Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration
GGeneralization error of random features and kernel methods:hypercontractivity and kernel matrix concentration
Song Mei ∗ , Theodor Misiakiewicz † , Andrea Montanari †‡ January 27, 2021
Abstract
Consider the classical supervised learning problem: we are given data ( y i , x i ), i ≤ n , with y i a responseand x i ∈ X a covariates vector, and try to learn a model f : X → R to predict future responses. Randomfeatures methods map the covariates vector x i to a point φ ( x i ) in a higher dimensional space R N , viaa random featurization map φ . We study the use of random features methods in conjunction with ridgeregression in the feature space R N . This can be viewed as a finite-dimensional approximation of kernelridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.We define a class of problems satisfying certain spectral conditions on the underlying kernels, and ahypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classicalhigh-dimensional examples. Under these conditions, we prove a sharp characterization of the error ofrandom features ridge regression. In particular, we address two fundamental questions: (1) What is thegeneralization error of KRR? (2) How big N should be for the random features approximation to achievethe same error as KRR?In this setting, we prove that KRR is well approximated by a projection onto the top (cid:96) eigenfunctionsof the kernel, where (cid:96) depends on the sample size n . We show that the test error of random featuresridge regression is dominated by its approximation error and is larger than the error of KRR as long as N ≤ n − δ for some δ >
0. We characterize this gap. For N ≥ n δ , random features achieve the sameerror as the corresponding KRR, and further increasing N does not lead to a significant change in testerror. Contents ∗ Department of Statistics, University of California, Berkeley † Department of Statistics, Stanford University ‡ Department of Electrical Engineering, Stanford University a r X i v : . [ m a t h . S T ] J a n Generalization error of kernel machines 17
A Approximation error of random features model 25
A.1 Assumptions and theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.2 Proof of Theorem 5.( a ): lower bound on the approximation error . . . . . . . . . . . . . . . . 26A.3 Proof of Theorem 5.( b ): upper bound on the approximation error . . . . . . . . . . . . . . . . 28A.4 Structure of the empirical kernel matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.4.1 Concentration of the top eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 31A.4.2 Bounding the off-diagonal part of the matrix U > M . . . . . . . . . . . . . . . . . . . . 32A.5 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A.5.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B Generalization error of random features model: Proof of Theorem 1 38
B.1 Proof of Theorem 1 in the overparametrized regime . . . . . . . . . . . . . . . . . . . . . . . . 38B.2 Proof of Proposition 6: Structure of the feature matrix Z . . . . . . . . . . . . . . . . . . . . 44B.2.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46B.3 Proof of Proposition 7: technical bounds in the overparametrized regime . . . . . . . . . . . . 48B.3.1 Proof of claim ( c ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48B.3.2 Proof of Proposition 7.( a ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49B.3.3 Proof of Proposition 7.( b ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51B.3.4 Proof of Proposition 7.( d ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52B.3.5 Bounds in the underparametrized regime . . . . . . . . . . . . . . . . . . . . . . . . . 52B.4 Concentration of the random features kernel matrix Z T Z . . . . . . . . . . . . . . . . . . . . 54 C Generalization error of kernel ridge regression: Proof of Theorem 4 57
C.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57C.1.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63C.2 Kernel ridge regression under relaxed assumptions on the diagonal . . . . . . . . . . . . . . . 65C.2.1 Proof outline for Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
D Proof of Theorem 2: generalization error of RFRR on the sphere and hypercube 69
D.1 On the sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69D.2 On the hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
E Technical background 74
E.1 Functions on the sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74E.1.1 Functional spaces over the sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74E.1.2 Gegenbauer polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74E.1.3 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76E.2 Functions on the hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76E.2.1 Fourier basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76E.2.2 Hypercubic Gegenbauer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77E.3 Hypercontractivity of Gaussian measure and uniform distributions on the sphere and thehypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
Introduction
Consider the supervised learning problem in which we are given i.i.d. samples ( y i , x i ), i ≤ n , from a commonprobability distribution on R × X . Here x i ∈ X is a vector of covariates, and y i is a response variable. Weare interested in learning a model ˆ f : X → R which, given a new point x test , predicts the correspondingresponse y test via ˆ f ( x test ).A number of statistical learning methods can be viewed as a combination of two steps: featurizationand training. Featurization maps sample points into a convenient ‘feature space’ H (a vector space) viaa featurization map φ : X → H , x i (cid:55)→ φ ( x i ). Training fits a model that is linear in the feature space:ˆ f ( x ) = (cid:104) a , φ ( x ) (cid:105) H . In this paper we will be concerned with a relatively simple method for training, ridgeregression: ˆ a ( λ ) := arg min a (cid:110) n (cid:88) i =1 (cid:0) y i − (cid:104) a , φ ( x i ) (cid:105) H (cid:1) + λ (cid:107) a (cid:107) H (cid:111) . (1)Here it is implicitly assumed that H is an Hilbert space, and therefore a ∈ H and (cid:104) · , · (cid:105) H , (cid:107) · (cid:107) H are thescalar product and norm in H .It is useful to discuss a few examples of this paradigm, some of which will play a role in what follows(we refer to Section 2.1 for formal definitions). Feature engineering.
We use this term to refer to the classical approach of crafting a set of N features φ ( x ) = ( φ ( x ) , . . . , φ N ( x )) ∈ H = R N for a specific application, by leveraging domain expertise. This hasbeen the standard approach to computer vision for a long time [Low04, BETVG08], and is still the state ofthe art in most of applied statistics [HTF09]. Kernel methods.
In this case H is a reproducing kernel Hilbert space (RKHS) defined implicitly via a positivedefinite kernel H : X × X → R [BTA11]. Rather than manually constructing features, the statistician/dataanalyst only needs to encode in H ( x , x ) = (cid:104) φ ( x ) , φ ( x ) (cid:105) H a suitable notion of similarity in the inputspace X . The resulting model only depends on the kernel H , and a crucial role is played by its eigenvaluedecomposition H ( x , x ) = (cid:80) ∞ (cid:96) =1 λ (cid:96) ψ (cid:96) ( x ) ψ (cid:96) ( x ). Ridge regression with RKHS featurization is referred toas kernel ridge regression (KRR). Formally, the KRR estimator takes the form:ˆ f λ ( x ) = ∞ (cid:88) (cid:96) =1 ˆ f λ,(cid:96) ψ (cid:96) ( x ) , ˆ f λ,(cid:96) = ∞ (cid:88) (cid:96) (cid:48) =1 (( λ/n ) · I + G ) − (cid:96),(cid:96) (cid:48) λ (cid:96) λ (cid:96) (cid:48) (cid:104) ψ (cid:96) (cid:48) , y (cid:105) n , (2) G (cid:96),(cid:96) (cid:48) := λ (cid:96) λ (cid:96) (cid:48) (cid:104) ψ (cid:96) , ψ (cid:96) (cid:48) (cid:105) n . (3)Here (cid:104) f, g (cid:105) n := n − (cid:80) ni =1 f ( x i ) g ( x i ) denotes the scalar product with respect to the empirical measure.For large n , we can imagine to replace the empirical scalar product with the population one, and therefore G (cid:96),(cid:96) (cid:48) ≈ λ (cid:96) (cid:96) = (cid:96) (cid:48) , whence ˆ f λ,(cid:96) ≈ (( λ/n ) + λ (cid:96) ) − λ (cid:96) (cid:104) ψ (cid:96) , y (cid:105) n . In words, KRR attempts to estimate accuratelythe projection of f ( x ) = E [ y | x ] onto the eigenvectors of the kernel H , corresponding to large eigenvalues λ (cid:96) . On the other hand, it shrinks towards 0 the projections of f onto eigenvectors corresponding to smallereigenvalues. Random Features (RF).
Instead of constructing the featurization map φ on the basis of domain expertise,or, implicitly, via a kernel, RF methods use a random map φ : X → R N . We will study a general con-struction that generalizes the original proposal of [RR08, BBV06]. We sample N point in a space Ω via θ ,. . . θ N ∼ iid τ (for a certain probability measure τ on Ω), and then define the mapping φ by letting φ ( x ) = ( σ ( x ; θ ) , . . . , σ ( x ; θ N )). Here σ : X × Ω → R is a square integrable function. We endow the featurespace H N = R N with the inner product (cid:104) a , a (cid:105) H N = a T a /N .3ecause of the connection to two-layers neural networks (see below) we shall refer to N as the ‘numberof neurons’ (although, ‘number of parameters’ would be more appropriate), and to σ as the ‘activationfunction.’ The resulting function ˆ f takes the formˆ f ( x ; a ) := (cid:104) a , φ ( x ) (cid:105) H = 1 N N (cid:88) i =1 a i σ ( x ; θ i ) . (4)We will refer to the procedure defined by Eq. (1) with φ the random feature map defined here as ‘randomfeatures ridge regression’ (RFRR). RFRR is closely related to KRR. First of all, we can view RFRR as anexample of KRR, with kernel H N ( x , x ) = (cid:104) φ ( x ) , φ ( x ) (cid:105) H = 1 N N (cid:88) i =1 σ ( x ; θ i ) σ ( x ; θ i ) . Notice however that the kernel H N has finite rank and is random, because of the random features θ , . . . , θ N .Second, for large N , we can expect H N to be a good approximation of its expectation E H N ( x , x ) = H ( x , x ) := (cid:90) Ω σ ( x ; θ ) σ ( x ; θ ) τ (d θ ) . (5)Hence, for large N , we expect RFRR to have similar generalization properties as the underlying RKHS,while possibly exhibiting lower complexity because it only operates on N × n matrices (instead of n × n matrices as for KRR). Neural networks in the linear (lazy) regime.
The methods described above fit the general paradigm of Eq. (1).Training does not affect the feature map φ . The model ˆ f λ ( · ) is linear in y , as a consequence of the factthat the loss is quadratic (see also Eq. (2)). In contrast, neural networks aim at learning the best featurerepresentation of the data. The feature map changes during training, and indeed there is no clear separationbetween the feature map φ ( x ) and the coefficients a .Nevertheless a copious line of recent research shows that —under certain training schemes— neural net-works are well approximated by their linearization around a random initialization [JGH18, LL18, DZPS18,DLL +
18, AZLS18, AZLL18, ADH +
19, ZCZG18, OS19]. It is useful to recall the basic argument here. De-note by x (cid:55)→ f ( x ; θ ) the neural network, with parameters (weights) θ ∈ R N , and by θ the initialization forgradient-based training. For highly overparametrized networks, a small change in the parameters θ is suffi-cient to change significantly the evaluations of f at the data points, i.e., the vector ( f ( x ; θ ) , . . . , f ( x n ; θ )).As a consequence, an empirical risk minimizer can be found in a small neighborhood of the initialization θ ,and it is legitimate to approximate f by its first order Taylor expansion in the parameters: f ( x ; θ + a ) ≈ f ( x ; θ ) + (cid:104) a , ∇ θ f ( x ; θ ) (cid:105) . (6)Apart from the zero-th order term f ( x ; θ ) (which has no free parameters, and hence plays the role of anoffset), this linearized model takes the same form ˆ f ( x ) = (cid:104) a , φ ( x ) (cid:105) . The featurization map is given by φ ( x ) = ∇ θ f ( x ; θ ). We refer to the model x (cid:55)→ (cid:104) a , ∇ θ f ( x ; θ ) (cid:105) as the neural tangent (NT) model.Notice that the NT featurization map is random, because of the random initialization θ . However, ingeneral it does not take the form of the RF model, because the entries of ∇ θ f ( x ; θ ) are not independent.Despite this important difference, we expect key properties of the RF model to generalize to suitable classesof NT models. Examples of this phenomenon were studied recently in [GMMM19, MZ20].The present paper focuses on KRR and RFRR. We introduce a set of assumptions on the data distri-bution, the choice of activation function, and the probability distribution τ on the θ i ’s, under which we can4haracterize the large n , N behavior of the generalization (test) error. While our results apply to an abstractinput space X , our assumptions aim at capturing the behavior observed when X is high-dimensional, andthe distribution ν on X satisfies strong concentration properties. For instance, our results apply to X = S d − (the sphere in d dimension) or X = { +1 , − } d , both endowed with the uniform measure.Our results do not require the true regression function f to belong to the associate RKHS and theycharacterize the test error (with respect to the square loss) pointwise, i.e., for any function f . This charac-terization holds up to error terms that are negligible compared to the null risk E { f ( x ) } .In particular our results allow to answer in a quantitative way two sets of key questions that emergefrom the above discussion: Q1.
How does the test error of KRR depends on the sample size n , on the target function f , and on thekernel H ? While this question has attracted considerable attention in the past (see Section 1.3 for anoverview), a very precise answer can be given in the present setting. Q2.
How does the test error of RFRR depend on the sample size n , and the number of neurons N ? Inparticular, for a given sample size, how big N should be to achieve the same error as for the associatedKRR (which corresponds formally to N = ∞ )? Q3.
How do the answers to the previous questions depend on the regularization parameter λ ? In particular,in which cases the optimal test error is achieved by choosing λ →
0, i.e. by using the minimum norminterpolator to the training data?Let us emphasize that the second question is technically more challenging than the first one, because itamounts to studying KRR with a random kernel . The setting introduced here is particularly motivated bythe objective to address Q2 (and its ramifications in Q3 ). Indeed, to the best of our knowledge, we providethe first set of results on the optimal choice of the overparametrization N/n under polynomial scalings of
N, n, d . Before summarizing our results, it is useful to describe informally our assumptions: we refer to Sections 2.2and 3.2 for a formal statement of the same assumptions. We consider ( x i ) i ≤ n ∼ iid ν with ν a probabilitydistribution of the covariates space X , and y i = f ( x i ) + ε i , where f is the target function and ε i ∼ N (0 , σ ε )independent of x i is noise.An RFRR problem is specified by ν, f, σ ε (which determine the data distribution), σ, τ (which determinethe RF representation), and the parameters n, N (sample size and number of neurons). The associatedkernel problem is obtained by using the kernel (5). It is also useful to introduce a kernel in the θ space via U ( θ , θ ) := E x ∼ ν { σ ( x ; θ ) σ ( x ; θ ) } .We will consider sequences of such problems indexed by an integer d , and characterize their behavior as N, n, d → ∞ . In applications, d typically corresponds to the dimension of the covariates space X . In thisinformal summary, we drop any reference to d for simplicity.We next describe informally our key assumptions, which depends on integers ( m , M , u ), with u ≥ max( M , m ). (For the sake of simplicity, we omit some assumptions of a more technical nature.)1. Hypercontractivity.
The top u eigenvectors of H are ‘delocalized’. We formalize this condition by requir-ing that, for any function g ∈ span( ψ j : j ≤ u ), and for any integer k , E ν { g ( x ) k } ≤ C k,u E ν { g ( x ) } k .We assume a same condition for the eigenvectors of U .2. Concentration of diagonal elements of the kernels.
Denote by H > m the kernel obtained from H bysetting to zero the eigenvalues λ , . . . , λ m . We require that the diagonal elements { H > m ( x i , x i ) } i ≤ n ν on X . Analogously, we requirethe diagonal elements { U > M ( θ i , θ i ) } i ≤ N to concentrate around their expectation.This assumption amounts to a condition of symmetry: most points x in the support of ν are roughlyequivalent, in the sense of having the same value of H > m ( x , x ), and similarly for most θ in the supportof ν .3. Spectral gap.
Recall that ( λ j ) j ≥ denote the eigenvalues of the kernel H in decreasing order. We thenassume one of the following two conditions to hold: Undeparametrized regime.
We have N (cid:28) n and1 λ M ∞ (cid:88) k = M +1 λ k (cid:28) N (cid:28) λ M +1 ∞ (cid:88) k = M +1 λ k , (7) Overparametrized regime.
We have n (cid:28) N and1 λ m ∞ (cid:88) k = m +1 λ k (cid:28) n (cid:28) λ m +1 ∞ (cid:88) k = m +1 λ k . (8)This assumption is ensures a clear separation between the subspace of D d which is estimated accurately(spanned by the eigenfunction of H corresponding to the top eigenvalues) and the subspace that isestimated trivially by 0 (corresponding to the low eigenvalues of H .) As we will see, a spectral gapcondition holds for classical high-dimensional examples. On the other hand, we believe it should bepossible to avoid this condition at the price of a more complicate characterization of the risk, andindeed we do not require it for KRR.As explained above, KRR attempts to estimate accurately the projection of the target function f ∗ ontothe top eigenvectors of the kernel H , and shrinks to zero its other components. RFRR behaves similarly,except that it only constructs a finite rank approximation of the kernel H . How many components of thetarget function are estimated accurately? There are of course two limiting factors: the statistical error whichdepends on the sample size n , and the approximation error which depends on the number of neurons N .It turns out that, in the present setting, the interplay between n and N takes a particularly simpleform. In a nutshell what matters is the smaller of n and N . If n (cid:28) N , then the statistical error dominatesand ridge regression estimates correctly the projection of f ∗ onto the top m eigenfunctions of H (where m is defined per Eq. (8)). If on the other hand N (cid:28) n , then the approximation error dominates and ridgeregression estimates correctly the projection of f ∗ onto the top M eigenfunctions of H (where M is definedper Eq. (7)).In formulas, we denote by R RF ( f ∗ ; λ ) = E { ( f ∗ ( x ) − f λ ( x )) } the test error of RFRR (for square loss)when the target function is f ∗ and the regularization parameter equals λ . Our main result establishes thatfor all λ ∈ [0 , λ ∗ ] (with a suitable choice of λ ∗ ), in a certain asymptotic sense, the following hold: R RF ( f ∗ ; λ ) = (cid:40) E { ( P > m f ∗ ( x )) } + o (1) · E { f ∗ ( x ) } if n (cid:28) N , E { ( P > M f ∗ ( x )) } + o (1) · E { f ∗ ( x ) } if n (cid:29) N , (9)where P >(cid:96) is the projector onto the span of the eigenfunctions { ψ j : j > (cid:96) } . This statement also applies toKRR, if we interpret the latter as the N = ∞ case of RFRR. Further, no kernel machine achieves a smallererror.This characterization implies a relatively simple answer to questions Q1 , Q2 , and Q3 , which we posed inthe previous section. We summarize some of the insights that follow from this result.6 RR acts as a projection.
As mentioned above, Eq. (9) can be restated as saying that (for the specialcase N = ∞ ), ˆ f λ ( x ) ≈ P ≤ m f ∗ ( x ). Indeed, we will prove a stronger result, which does not require thespectral gap assumption of Eq. (8). The KRR estimator ˆ f λ is well approximated by the KRR estimatorfor the population problem ( n = ∞ ), but with a larger value of the ridge regularization γ > λ . Inother words KRR acts as a shrinkage operator along the eigenfunctions of the kernel. Effects of overparametrization.
In random features models, we are free to choose the number of neurons N . Equation (9) indicates that any choice of N has roughly the same test error (which is also thetest error of KRR) as long as N (cid:29) n . This is interesting in both directions. First, the test errordoes not deteriorate as the number of parameters increases, and becomes much larger than the samplesize. This contrasts with a naive measure of the model complexity: indeed, counting the number ofparameters would naively suggest that N (cid:29) n might hurt generalization. Second, the error does notimprove with overparametrization either, as long as N (cid:29) n . Optimal overparametrization.
At what level of overparametrization should we operate? In view of theprevious point, it is sufficient to use a model with a number of parameters much larger than the samplesize (formally, N ≥ n δ for some δ >
0, although this specific condition is mainly dictated by ourproof technique). Further overparametrization does not improve the statistical behavior.Let us also note that —as proven in [MM19]— choosing
N/n =: ψ = O (1) can lead to sub-optimaltest error, with the suboptimality vanishing if ψ → ∞ after , N, n → ∞ . Optimality of interpolation.
Finally, the above phenomena are obtained for all λ ∈ [0 , λ ∗ ]. The case λ = 0 corresponds to minimum norm interpolators. We also prove that the risk of any kernel ma-chine is lower bounded by E { ( P > m f ∗ ( x )) } + o (1) · E { f ∗ ( x ) } . We therefore conclude that, in theoverparametrized regime N (cid:29) n , min-norm interpolators are optimal among all kernel methods. The test error of KRR was studied by a number of authors in the past [CDV07, JS¸S + n → ∞ in fixed dimension d . In contrastour focus is on the case in which both d and n grow simultaneously. Further, we provide upper and lowerbounds that hold pointwise (for a given target function f ∗ ) while earlier work mostly establish pointwiseupper bound and minimax lower bounds (for the worst case f ∗ ). The recent work [JS¸S +
20] also derivedpointwise upper and lower bounds for kernel ridge regression (but with strictly positive ridge regularizer),which is very similar to our Theorem 4. However, these results are based on a universality assumption whosevalidity is unclear in specific settings.Recently, the ridge-less (interpolation) limit of KRR was studied by Liang, Rakhlin and Zhai [LR20,LRZ19]. Again, these authors provide minimax upper bounds that hold within the RKHS, holding for innerproduct kernel, when the feature vectors x have independent coordinates. Their results are related but notdirectly comparable to ours.The complexity of training kernel machine scales at least quadratically in the sample size. This hasmotivated the development of randomized techniques to lower the complexity of training and testing. Whileour focus is on random features methods, alternative approaches are based on subsampling the columns-rows of the empirical kernel matrix, see e.g. [Bac13, AM15, RCR15]. In particular, [RCR15] compares theprediction errors using the sketched and the full kernel matrices, and shows that —for a fixed RKHS— itis sufficient to use a number of rows/columns of the order of the square root of the sample size in order toachieve the minimax rate over that RKHS. 7he generalization properties of random features methods have been studied in a smaller number ofpapers [RR09, RR17, MWW20]. Rahimi and Recht [RR09] proved an upper bound of the order 1 / √ N +1 / √ n on the generalization error. The insight provided by this bound is similar to one of our points: about N (cid:16) n neurons are sufficient for the error to be of the same order as for N → ∞ . On the other hand, [RR09]proves only a minimax upper bound, it is limited to Lipschitz losses, and, crucially, requires the coefficientsmax i ≤ N | a i | ≤ C so that (cid:107) a (cid:107) = O ( N ). In contrast, in the present setting, we typically have (cid:107) a (cid:107) = Θ( nN )The case of square loss was considered earlier by Rudi and Rosasco [RR17] who proved that, for a targetfunction f ∗ in the RKHS, N = C √ n log n is sufficient to learn a random features model with test error oforder 1 / √ n . These authors interpret this finding as implying that roughly √ n random features are sufficient:we will discuss the difference between their setting and ours in Section 2.3.Finally, [Bac15] studies optimized distributions for sampling the random features, while [YLM +
12] pro-vides a comparison between random features approaches and subsampling of the kernel matrix.As pointed out above, we find that taking λ → φ ( x i ). Further, they only provide upper andlower bounds that match up to factors depending on the condition number of a certain random matrix. Incontrast, our characterization is specialized to the random features setting, does not require subgaussianity,and holds up to additive errors that are negligible compared to the null risk.The present paper solves a number of open problems that were left open in our earlier work [GMMM19].First of all, [GMMM19] only considered the cases n = ∞ (approximation error of random features models)or N = ∞ (generalization error of KRR). Here instead we establish the complete picture for both n and N finite. Second, [GMMM19] assumed a special data distribution ( ν was the uniform distribution overthe d -dimensional sphere), a special structure for the kernel (inner product kernels), and a special type ofactivation functions (depending on the inner product (cid:104) θ , x (cid:105) )). The present paper considers general datadistribution, kernel, and activation functions, under a set of assumptions that covers the previous exampleas a special case. Finally, the proofs of [GMMM19] made use of the moment method, which is difficult togeneralize beyond special examples. Here we use a decoupling approach and matrix concentration methodswhich are significantly more flexible.The results of [GMMM19] were generalized to certain anisotropic distributions in [GMMM20]. For theinner product activation functions on the sphere, the precise asymptotics (for N, n, d → ∞ with
N/d → ψ , n/d → ψ , ψ , ψ ∈ (0 , ∞ )) of generalization error of random features models was calculated in [MM19]. For a positive integer, we denote by [ n ] the set { , , . . . , n } . For vectors u , v ∈ R d , we denote (cid:104) u , v (cid:105) = u v + . . . + u d v d their scalar product, and (cid:107) u (cid:107) = (cid:104) u , u (cid:105) / the (cid:96) norm. Given a matrix A ∈ R n × m , wedenote (cid:107) A (cid:107) op = max (cid:107) u (cid:107) =1 (cid:107) Au (cid:107) its operator norm and by (cid:107) A (cid:107) F = (cid:0) (cid:80) i,j A ij (cid:1) / its Frobenius norm. If A ∈ R n × n is a square matrix, the trace of A is denoted by Tr( A ) = (cid:80) i ∈ [ n ] A ii .We use O d ( · ) (resp. o d ( · )) for the standard big-O (resp. little-o) relations, where the subscript d emphasizes the asymptotic variable. Furthermore, we write f = Ω d ( g ) if g ( d ) = O d ( f ( d )), and f = ω d ( g ) if g ( d ) = o d ( f ( d )). Finally, f = Θ d ( g ) if we have both f = O d ( g ) and f = Ω d ( g ).We use O d, P ( · ) (resp. o d, P ( · )) the big-O (resp. little-o) in probability relations. Namely, for h ( d ) and h ( d ) two sequences of random variables, h ( d ) = O d, P ( h ( d )) if for any ε >
0, there exists C ε > d ε ∈ Z > , such that P ( | h ( d ) /h ( d ) | > C ε ) ≤ ε, ∀ d ≥ d ε , h ( d ) = o d, P ( h ( d )), if h ( d ) /h ( d ) converges to 0 in probability. Similarly, we will denote h ( d ) = Ω d, P ( h ( d )) if h ( d ) = O d, P ( h ( d )), and h ( d ) = ω d, P ( h ( d )) if h ( d ) = o d, P ( h ( d )). Finally, h ( d ) =Θ d, P ( h ( d )) if we have both h ( d ) = O d, P ( h ( d )) and h ( d ) = Ω d, P ( h ( d )). In this section, we present our results on the generalization error of random features models. We begin inSection 2.1 by introducing the general abstract setting in which we work, and some of its basic properties.We then state our assumptions in Section 2.2, and state our main theorem (Theorem 1) in Section 2.3.Finally, Section 2.4 presents applications of our general theorem to ( i ) the case of feature vectors uniformlydistributed over the sphere x i ∼ Unif( S d − ( √ d )), and ( ii ) the case of feature vectors uniformly distributedover the Hamming cube x i ∼ Unif( { +1 , − } d ). While these applications are ‘simple’ in the sense thatchecking the assumptions of our general theorem is straightforward, they are in themselves quite interesting.In particular, our result for the uniform distribution on the sphere (cf. Proposition 2) closes the mainproblem left unsolved in [GMMM19]. We consider two sequences of Polish probability spaces ( X d , ν d ) and (Ω d , τ d ), indexed by an integer d . Wedenote by L ( X d ) = L ( X d , ν d ) the space of square integrable functions on ( X d , ν d ), and by L (Ω d ) = L (Ω d , τ d ) the space of square integrable functions on (Ω d , τ d ). Since ( X d , ν d ) and (Ω d , τ d ) are standardprobability spaces [Dud18, Theorem 13.1.1], it follows that L ( X d ) and L (Ω d ) are separable.More generally for p ≥
1, we denote (cid:107) f (cid:107) L p ( X ) = E x ∼ ν [ | f ( x ) | p ] /p the L p norm of f . We will sometimesomit X and write directly (cid:107) f (cid:107) L and (cid:107) f (cid:107) L p when clear from context.Given two closed linear subspaces D d ⊆ L ( X d ), V d ⊆ L (Ω d ), and the activation function σ d ∈ L ( X d × Ω d , ν d ⊗ τ d ), we define a Fredholm integral operator T d : D d → V d via T d g ( θ ) ≡ (cid:90) X d σ d ( x , θ ) g ( x ) ν d (d x ) . (10)Note that T d is a compact operator by construction. We will assume that T d g (cid:54) = 0 for any g ∈ D d \ { } .Also, without loss of generality, we can assume V d = Im( T d ) (which is closed since T d is bounded). Withan abuse of notation, we will sometimes denote by T d the extension of this operator obtained by setting T d g = 0 for g ∈ D ⊥ d . Notice that we can choose the kernel σ d so that (cid:82) X d σ d ( x , θ ) g ( x ) ν d (d x ) = 0 for any g ∈ D ⊥ d : we will assume such a choice hereafter.While in simple examples we might assume D d = L ( X d ), the extra flexibility afforded by a generalsubspace D d ⊆ L ( X d ) allows to model some important applications [MMM21].The adjoint operator T ∗ d : V d → D d has kernel representation T ∗ d f ( x ) = (cid:90) Ω d σ d ( x , θ ) f ( θ ) τ d (d θ ) . As before, we will sometimes extend T ∗ d to L (Ω d ) by setting Ker( T ∗ d ) = V ⊥ d .The operator T d induces two compact self-adjoint positive definite operators: U d = T d T ∗ d : V d → V d , and H d = T ∗ d T d : D d → D d . These operators admit the kernel representations: U d f ( θ ) = (cid:90) Ω d U d ( θ , θ (cid:48) ) f ( θ (cid:48) ) τ d (d θ (cid:48) ) , (11) H d g ( x ) = (cid:90) X d H d ( x , x (cid:48) ) g ( x (cid:48) ) ν d (d x (cid:48) ) , (12)9here U d : Ω d × Ω d → R and H d : X d ×X d → R be two measurable functions, satisfying (cid:82) Ω d U d ( θ , θ (cid:48) ) f ( θ (cid:48) ) τ d (d θ (cid:48) ) =0 for f ∈ V ⊥ d , and (cid:82) X d H d ( x , x (cid:48) ) g ( x (cid:48) ) ν d (d x (cid:48) ) = 0 for g ∈ D ⊥ d . We immediately have U d ( θ , θ ) = E x ∼ ν d [ σ d ( x , θ ) σ d ( x , θ )] , (13) H d ( x , x ) = E θ ∼ τ d [ σ d ( x , θ ) σ d ( x , θ )] . (14)By Cauchy-Schwartz inequality, we have U d ∈ L (Ω d × Ω d ) and H d ∈ L ( X d × X d ).By the spectral theorem of compact operators, there exist two orthonormal bases ( ψ j ) j ≥ , span( ψ j , j ≥
1) = D d ⊆ L ( X d ) and ( φ j ) j ≥ , span( φ j , j ≥
1) = V d ⊆ L (Ω d ), and eigenvalues ( λ d,j ) j ≥ ⊆ R , withnonincreasing absolute values | λ d, | ≥ | λ d, | ≥ · · · , and (cid:80) j ≥ λ d,j < ∞ such that T d = ∞ (cid:88) j =1 λ d,j ψ j φ ∗ j , U d = ∞ (cid:88) j =1 λ d,j φ j φ ∗ j , H d = ∞ (cid:88) j =1 λ d,j ψ j ψ ∗ j . (Here convergence holds in operator norm.) In terms of the kernel, these identities read σ d ( x , θ ) = ∞ (cid:88) j =1 λ d,j ψ j ( x ) φ j ( θ ) , U d ( θ , θ ) = ∞ (cid:88) j =1 λ d,j φ j ( θ ) φ j ( θ ) , H d ( x , x ) = ∞ (cid:88) j =1 λ d,j ψ j ( x ) ψ j ( x ) . (15)Here convergence holds in L ( X d × Ω d ), L (Ω d × Ω d ), and L ( X d × X d ).Associated to the operator H , we can define a reproducing kernel Hilbert space (RKHS) H ⊆ D d definedas H = (cid:110) f ∈ D : (cid:107) f (cid:107) H = ∞ (cid:88) j =1 λ − d,j (cid:104) f, ψ j (cid:105) L < ∞ (cid:111) , where (cid:107) · (cid:107) H denotes the RKHS norm associated to H . In particular, H is dense in D d , provided λ d,j > j .For S ⊆ { , , . . . } , we denote P S to be the projection operator from L ( X d ) onto D d,S := span( ψ j , j ∈ S ). With a little abuse of notations, we also denote P S to be the projection operator from L (Ω d ) onto V d,S := span( φ j , j ∈ S ). We denote T d,S and σ d,S to be the corresponding operator and kernel T d,S = (cid:88) j ∈ S λ d,j ψ j φ ∗ j ,σ d,S ( x , θ ) = (cid:88) j ∈ S λ d,j ψ j ( x ) φ j ( θ ) . We define U d,S = T d,S T ∗ d,S and H d,S = T ∗ d,S T d,S , and denote by U d,S and H d,S the corresponding kernels. If S = { j ∈ N : j ≤ (cid:96) } we will write for brevity T d, ≤ (cid:96) , U d, ≤ (cid:96) , H d, ≤ (cid:96) , and similarly for S = { j ∈ N : j > (cid:96) } .Since σ d ∈ L ( X d × Ω d ), it follows that U d,S is trace class, for any S ⊆ N , with trace given byTr( U d,S ) ≡ (cid:88) j ∈ S λ d,j = E θ ∼ τ d [ U d,S ( θ , θ )] < ∞ . Similarly, we have Tr( H d,S ) ≡ (cid:88) j ∈ S λ d,j = E x ∼ ν d [ H d,S ( x , x )] < ∞ . .2 Assumptions Let Θ = ( θ i ) i ∈ [ N ] ∼ iid τ d . We define the random features function class to be F RF ,N ( Θ ) = (cid:110) ˆ f ( x ; a ) = 1 N N (cid:88) i =1 a i σ d ( x , θ i ) : a i ∈ R , i ∈ [ N ] (cid:111) . Note that the factor 1 /N is immaterial here, and only introduced in order to match the definition of featuremap and scalar product in Section 1.1.We observe pairs ( y i , x i ) i ∈ [ n ] , with ( x i ) i ∈ [ n ] ∼ iid ν d , and y i = f d ( x i ) + ε i , f d ∈ L ( X d ) and ε i ∼ N (0 , σ ε )independently. We fit the coefficients ( a i ) i ≤ N using ridge regression, cf. Eq. (1) that we reproduce hereˆ a ( λ ) = arg min a (cid:40) n (cid:88) i =1 (cid:0) y i − ˆ f ( x i ; a ) (cid:1) + λN (cid:107) a (cid:107) (cid:41) . (16)We allow λ to depend on the dimension parameter d . The test error is given by R RF ( f d , X , Θ , λ ) := E x (cid:104)(cid:16) f d ( x ) − ˆ f ( x ; ˆ a ( λ )) (cid:17) (cid:105) . (17)We next state our assumptions on the sequences of probability spaces ( X d , ν d ) and (Ω d , τ d ), and on theactivation functions σ d . The first set of assumptions concerns the concentration properties of the featuremap, and are grouped in the next definition. These assumptions are quantified by four sequences of integers { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ , where N ( d ) and n ( d ) are, respectively, the number of neurons and the samplesize. The integers M ( d ) and m ( d ) play a minor role in this definition, but will encode the decomposition of L (Ω d ) and L ( X d ) (respectively) into the span of the top eigenvectors of U d and H d (of dimensions M ( d )and m ( d )) and their complements. Assumption 1 ( { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ -Feature Map Concentration Property) . We say that thesequence of activation functions { σ d } d ≥ satisfies the Feature Map Concentration Property (FMCP) withrespect to the sequence { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ if there exists a sequence { u ( d ) } d ≥ with u ( d ) ≥ max( M ( d ) , m ( d )) such that the following hold.(a) (Hypercontractivity of finite eigenspaces)(i) (Hypercontractivity of finite eigenspaces on D d .) For any integer k ≥ , there exists C such that,for any g ∈ D d, ≤ u ( d ) = span( ψ s , ≤ s ≤ u ( d )) , we have (cid:107) g (cid:107) L k ( X d ) ≤ C · (cid:107) g (cid:107) L ( X d ) . (ii) (Hypercontractivity of finite eigenspaces on V d .) For any integer k ≥ , there exists C (cid:48) such that,for any g ∈ V d, ≤ u ( d ) = span( φ s , ≤ s ≤ u ( d )) , we have (cid:107) g (cid:107) L k (Ω d ) ≤ C (cid:48) · (cid:107) g (cid:107) L (Ω d ) . (b) (Properly decaying eigenvalues.) There exists a fixed δ > , such that , for all d large enough max( N ( d ) , n ( d )) δ ≤ (cid:16) (cid:80) ∞ j = u ( d )+1 λ d,j (cid:17) (cid:80) ∞ j = u ( d )+1 λ d,j . (18)11 c) (Hypercontractivity of the high degree part.) Let σ d,>u ( d ) corresponds to the projection on the highdegree part of σ d . Then there exists a fixed δ > and an integer k such that min( n, N ) δ max( N, n ) /k − log(max( N, n )) = o d (1) , and E x , θ [ σ >u ( d ) ( x ; θ ) k ] / (2 k ) = O d (1) · min( n, N ) δ · E x , θ [ σ >u ( d ) ( x ; θ ) ] / . (d) (Concentration of diagonal elements) For ( x i ) i ∈ [ n ( d )] ∼ iid ν d and ( θ i ) i ∈ [ N ( d )] ∼ iid τ d , we have sup i ∈ [ n ( d )] (cid:12)(cid:12)(cid:12) H d,> m ( d ) ( x i , x i ) − E x [ H d,> m ( d ) ( x , x )] (cid:12)(cid:12)(cid:12) = o d, P (1) · E x [ H d,> m ( d ) ( x , x )] , sup i ∈ [ N ( d )] (cid:12)(cid:12)(cid:12) U d,> M ( d ) ( θ i , θ i ) − E θ [ U d,> M ( d ) ( θ , θ )] (cid:12)(cid:12)(cid:12) = o d, P (1) · E θ [ U d,> M ( d ) ( θ , θ )] . This statement formalizes three assumptions. The first one is hypercontractivity (points ( a ) and ( c )).Recall that D d, ≤ u ( d ) is the eigenspace spanned by top eigenvectors of the operator H d , and V d, ≤ u ( d ) is theeigenspace spanned by top eigenvectors of the operator U d . We request that functions in these spaces havecomparable norms of all orders, which roughly amount to say that they take values of the same order astheir typical value for most x (or most θ ). This typically happens when the functions in the top eigenspacesare delocalized.The second assumption (assumption ( b )) requires that the eigenvalues of kernel operators do not decaytoo rapidly. If this is not the case, the RKHS will be very close to a low-dimensional space. For instance,if λ d,k (cid:16) k − α , α >
0, then this condition holds as long as we take u ( d ) ≥ max( N ( d ) , n ( d )) δ for some δ > d ) concerns the diagonal elements of the kernel matrices. They require the truncatedkernel functions H d,> m ( d ) and U d,> M ( d ) evaluated on covariates and weight vectors to have nearly constantdiagonal values.The second set of assumptions concerns the spectrum of the kernel operator, defined by the sequenceof eigenvalues ( λ d,j ) j ≥ . We require that the spectrum has a gap: the location of this gap dictates therelationship between N ( d ) and M ( d ) and between n ( d ) and m ( d ). Assumption 2 (Spectral gap at level { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ ) . We say that the sequence of activationfunctions { σ d } d ≥ has a spectral gap at level { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ if one of the following conditions ( a ) , ( b ) hold for all d large enough.(a) (Overparametrized regime.) We have N ( d ) ≥ n ( d ) and(i) (Number of samples) There exists fixed δ > such that m ( d ) ≤ n ( d ) − δ and λ d, m ( d ) ∞ (cid:88) k = m ( d )+1 λ d,k ≤ n ( d ) − δ ≤ n ( d ) δ ≤ λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k . (19) (ii) (Number of features) There exists fixed δ > such that M ( d ) ≤ N ( d ) − δ , M ( d ) ≥ m ( d ) and N ( d ) δ ≤ λ d, M ( d )+1 ∞ (cid:88) k = M ( d )+1 λ d,k . (20) (b) (Underparametrized regime) We have n ( d ) ≥ N ( d ) and i) (Number of features) There exists fixed δ > such that M ( d ) ≤ N ( d ) − δ and λ d, M ( d ) ∞ (cid:88) k = M ( d )+1 λ d,k ≤ N ( d ) − δ ≤ N ( d ) δ ≤ λ d, M ( d )+1 ∞ (cid:88) k = M ( d )+1 λ d,k . (ii) (Number of samples) There exists fixed δ > such that m ( d ) ≤ n ( d ) − δ , m ( d ) ≥ M ( d ) and n ( d ) δ ≤ · λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k . The assumption of a spectral gap is useful in that it leads to a clear-cut separation in our main statementbelow. For instance, in the overparametrized regime n ( d ) (cid:28) N ( d ), the projection of the target function onto D d, ≤ m ( d ) is estimated with negligible error, while the projection onto D d,> m ( d ) is estimated with 0. If therewas no spectral gap, the transition would not be as sharp. However, we expect this to affect only targetfunctions with a large projection onto eigenfunctions whose indices are close to m ( d ). In this sense, whilerestrictive, the spectral gap assumption can be in fact a good model for a more generic situation. We are now in position to state our main results for random features ridge regression.
Theorem 1 (Generalization error of Random Features Ridge Regression) . Let { f d ∈ D d } d ≥ be a se-quence of functions, X = ( x i ) i ∈ [ n ( d )] and Θ = ( θ j ) j ∈ [ N ( d )] with ( x i ) i ∈ [ n ( d )] ∼ ν d and ( θ j ) j ∈ [ N ( d )] ∼ τ d independently. Let y i = f d ( x i ) + ε i and ε i ∼ iid N (0 , σ ε ) for some σ ε > . Let { σ d } d ≥ be a sequence of ac-tivation functions satisfying { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ -FMCP (Assumption 1) and spectral gap at level { ( N ( d ) , M ( d ) , n ( d ) , m ( d ))) } d ≥ (Assumption 2). Then the following hold for the test error of RFRR (see Eq. (17) ):(a) (Overparametrized regime) If N ( d ) ≥ d δ · n ( d ) for some δ > , , let λ (cid:63) be such that λ ∗ = o d (Tr( H d,> m )) .Then, for any regularization parameter λ ∈ [0 , λ (cid:63) ] , and any fixed η > and ε > , with high probabilitywe have | R RF ( f d , X , Θ , λ ) − (cid:107) P > m f d (cid:107) L | ≤ ε · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε ) . (21) (b) (Underparametrized regime) If n ( d ) ≥ d δ · N ( d ) for some δ > , let λ (cid:63) be such that λ (cid:63) = o d ( n/N · Tr( U d,> M )) . Then, for any regularization parameter λ ∈ [0 , λ (cid:63) ] , and any fixed η > and ε > , withhigh probability we have | R RF ( f d , X , Θ , λ ) − (cid:107) P > M f d (cid:107) L | ≤ ε · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η + σ ε ) . (22) Remark 2.1.
The two limits N = ∞ and n = ∞ play a special role. For N = ∞ , the random kernel H N ( x , x ) = N − (cid:80) Ni =1 σ ( x ; θ i ) σ ( x ; θ i ) converges to its expectation, and we recover KRR. While thiscase is not technically covered by Theorem 1, we establish the relevant characterization in Theorems 3 and4. In the case n = ∞ the generalization error vanishes, and we are left with the approximation error. Thiscase is covered separately in Appendix A. In both these limit cases we confirm the result that would havebeen obtained by naively setting N = ∞ or n = ∞ in the last theorem.13otice that the sample size n and the number of neurons N play a nearly symmetric role in this statement,and the smallest of the two determines the test error. An important insight follows: in the present setting,the test error is nearly insensitive to the number of neurons as long as we take N (cid:29) n . If we want tominimize computational complexity subject to achieving nearly optimal generalization properties, we shouldoperate, say, at N (cid:16) n δ for some small δ > N (cid:16) √ n log n . While oursetting differs from the one of [RR17] in a number of technical aspects, we believe that the core differencebetween the two results lies in the treatment of the target function f d . Simplifying, the recommendationof [RR17] is based on two results, the second of which proved in [CDV07] (with an abuse of notation, weindicate the number of neurons and sample size as arguments of R RF ( f d ) = R RF ( f d ; N, n ), and use N = ∞ to denote the KRR limit case):sup (cid:107) f d (cid:107) H ≤ r R RF ( f d ; N n , n ) ≤ C ( d ) r √ n , for N n (cid:16) √ n log n , (23)sup (cid:107) f d (cid:107) H ≤ r R RF ( f d ; ∞ , n ) ≤ C ( d ) r (cid:16) log nn (cid:17) b/ ( b +1) , (24)where b ∈ (1 , ∞ ) encodes the decay of eigenvalues of the kernel . Now, considering the worst case decay b →
1, the error rate achieved by RFRR, cf. Eq. (23), is of the same order as the one achieved by KRR, cf.Eq. (24).Note several differences with respect to our results: ( i ) The analysis of [RR17, CDV07] is minimax,over balls in the RKHS, while our results hold pointwise , i.e., for a given function f d ; ( ii ) Optimality in[RR17] is established in terms of rates, i.e., up to multiplicative constant, while ours hold up to additiveerrors (multiplicative constants are exactly characterized); ( iii ) The results of [RR17, CDV07] apply to afixed RKHS (in particular, a fixed dimension d ), while we study the case in which d is large and N, n, d arepolynomially related.Some of these distinction are also relevant in comparing our work to recent results on KRR. In particularpoints ( i ) and ( ii ) apply when comparing with [LR20, LRZ19]. As examples we consider the case of feature vectors x i that are uniformly distributed over the discretehypercube Q d = {− , +1 } d or the sphere S d − ( √ d ) = { x ∈ R d : (cid:107) x (cid:107) = d } . Namely, letting A d to beeither Q d or S d − ( √ d ) and ρ d = Unif( A d ), we set X d = A d and ν d = ρ d . We further choose the θ i ’s to bedistributed as the covariates vectors, namely V d = A d and τ d = ρ d . Apart from simplifying our analysis, thisis a sensible choice: since the covariates vectors do not align along any preferred direction, it is reasonablefor the θ i ’s to be isotropic as well.Given a function ¯ σ d : R → R (which we allow to depend on the dimension d ), we define the activationfunction σ d : A d × A d → R by σ d ( x ; θ ) = ¯ σ d ( (cid:104) x , θ (cid:105) / √ d ) . (25)We denote by E d, ≤ (cid:96) the subspace of L ( A d , ρ d ) spanned by polynomials of degree less or equal to (cid:96) and by P ≤ (cid:96) the orthogonal projection on E d, ≤ (cid:96) in L ( A d , ρ d ). The projectors P (cid:96) and P >(cid:96) are defined analogously(see Appendix E for more details). Let us emphasize that the projectors P ≤ (cid:96) are related but distinct from The results of [CDV07, RR17] assume the weaker condition that inf g ∈H (cid:107) f d − g (cid:107) L is achieved in H : since H is dense in L ( X d ) (provided the kernel is strictly positive definite), this is equivalent to f d ∈ H . P ≤ m : while P ≤ (cid:96) projects onto eigenspaces of polynomials of degree at most (cid:96) , P ≤ m projects onto the top m -eigenfunctions .In order to apply Theorem 1, we make the following assumption about ¯ σ d . Assumption 3 (Assumptions on A d at level ( s , S ) ∈ N ) . For { ¯ σ d } d ≥ a sequence of functions ¯ σ d : R → R ,we assume the following conditions to hold.(a) There exists an integer k and constants c < and c > , δ > /k such that n ≤ N − δ or N ≤ n − δ and | ¯ σ d ( x ) | ≤ c exp( c x / (4 k )) .(b) We have min k ≤ s d s − k (cid:107) P k ¯ σ d ( (cid:104) e , · (cid:105) ) (cid:107) L ( A d ,ρ d ) =Ω d (1) , (26)min k ≤ S d S − k (cid:107) P k ¯ σ d ( (cid:104) e , · (cid:105) ) (cid:107) L ( A d ,ρ d ) =Ω d (1) , (27) (cid:107) P > s , S )+1 ¯ σ d ( (cid:104) e , · (cid:105) ) (cid:107) L ( A d ,ρ d ) =Ω d (1) , (28) where e ∈ A d is a fixed vector (it is easy to see that these quantities do not depend on e ).(c) If A d = Q d , we have, for all d large enough max k ≤ s , S )+2 d − k (cid:107) P d − k ¯ σ d ( (cid:104) e , · (cid:105) ) (cid:107) L ( A d ,ρ d ) ≤ d − s , S ) − . (29)Assumption ( a ) requires n , N to be well separated and a technical integrability condition. The latter isnecessary for the hypercontractivity condition in Assumption 1.( c ) to make sense.Equations (26) and (27) (Assumption ( b )) are a quantitative version of a universality condition: if P k ¯ σ d ( (cid:104) e , ·(cid:105) / √ d ) = 0 for some k , then linear combinations of ¯ σ d can only span a linear subspace of L ( A d , ρ d ).Equation (28) (Assumption ( b )) requires the high degree part of ¯ σ d to be non-vanishing (and therefore inducea non-zero regularization from the high degree non-linearity).For A d = Q d , we further require Assumption ( c ), namely that the last eigenvalues of ¯ σ d decreasesufficiently fast. This is a necessary conditions to avoid pathological sequences { ¯ σ d } d ≥ which are veryrapidly oscillating. Remark 2.2.
If ¯ σ d = ¯ σ is independent of the dimension, then Assumptions ( b ), ( c ) are easy to check: • The first two parts of Assumption ( b ) (Eqs. (26) and (27)) are satisfied if we require E { ¯ σ ( G ) p ( G ) } (cid:54) = 0for all non-vanishing polynomials p of degree at most max( s , S ) (expectation being taken with respectto G ∼ N (0 , E { ¯ σ ( G ) He k ( G ) } (cid:54) = 0 for all k ≤ max( s , S ), where He k isthe k -th Hermite polynomial. • The third part of Assumption ( b ) (Eq. (28)) amounts to requiring ¯ σ not to be a degree-(2 max( s , S ) + 1)polynomial. • In Appendix D.2 we check that Assumption ( c ) holds if ¯ σ is smooth and there exists c > c < s , S ) + 2)-th derivative verifies | ¯ σ (2 max( s , S )+2) ( x ) | ≤ c exp( c x / σ d ( x ) = ( x − c ) + with a a generic c ∈ R \ { } verifies Assumption 3.(The case c = 0 violates Eq. (26), since E { ¯ σ ( G )He k ( G ) } = 0 for k ≥ The two coincide if m = (cid:80) (cid:96) (cid:48) ≤ (cid:96) B ( A d ; (cid:96) (cid:48) ), with B ( A d ; (cid:96) (cid:48) ) the dimension of the space of degree- (cid:96) (cid:48) polynomials and the top m eigenvalues verify λ d,j = Ω d ( d − (cid:96) ), see Appendix D. heorem 2 (Generalization error of RFRR on the sphere and hypercube) . Let { f d ∈ L ( A d , ρ d ) } d ≥ be asequence of functions. Let Θ = ( θ i ) i ∈ [ N ] with ( θ i ) i ∈ [ N ] ∼ ρ d independently and X = ( x i ) i ∈ [ n ] with ( x i ) i ∈ [ n ] ∼ ρ d independently. Let y i = f d ( x i ) + ε i and ε i ∼ iid N (0 , σ ε ) for some σ ε > . Assume d s + δ ≤ n ≤ d s +1 − δ and d S + δ ≤ N ≤ d S +1 − δ for fixed integers s , S and for some δ > . Let { ¯ σ d } d ≥ satisfy Assumption 3 atlevel ( s , S ) . Then the following hold for the test error of RFRR (see Eq. (17) ):(a) Assume N ≥ nd δ for some δ > . Then for any regularization parameter λ = O d (1) (including λ = 0 identically), any η > and ε > , we have, with high probability, | R RF ( f d , X , Θ , λ ) − (cid:107) P > s f d (cid:107) L | ≤ ε · ( (cid:107) f d (cid:107) L + (cid:107) P > s f d (cid:107) L η + σ ε ) . (30) (b) Assume n ≥ N d δ for some δ > . Then, for any regularization parameter λ = O d ( n/N ) (including λ = 0 identically), η > and ε > , we have, with high probability, | R RF ( f d , X , Θ , λ ) − (cid:107) P > S f d (cid:107) L | ≤ ε · ( (cid:107) f d (cid:107) L + (cid:107) P > S f d (cid:107) L η + σ ε ) . (31)As mentioned in the introduction, [GMMM19] proves this theorem in the cases n = ∞ (RF approximationerror) and N = ∞ (generalization error of KRR), for the uniform measure on the sphere. The general casefollows here as a consequence of Theorem 1.To see the connection between Theorem 1 and the results given here for the sphere and hypercube cases(see Appendix D for details), notice that the integral operator T d associated to the inner product activationfunction (25) is in this case symmetric, and commutes with rotations in SO( d ) (for the sphere) or with theaction of ( Z ) d (for the hypercube ). Hence, the eigenvectors of T d (which is self-adjoint by construction) aregiven by the spherical harmonics of degree (cid:96) (for the sphere) or the homogeneous polynomials of degree (cid:96) (forthe hypercube). The spaces spanned by the low degree spherical harmonics and homogeneous polynomialsverify the hypercontractivity condition of Assumption 1.( a ) (see Appendix E.3). The corresponding distincteigenvalues are ξ d,(cid:96) , with degeneracy B ( S d − ; (cid:96) ) = d − (cid:96)d − (cid:18) d − (cid:96)(cid:96) (cid:19) , B ( Q d ; (cid:96) ) = (cid:18) d(cid:96) (cid:19) . (32)Notice that in both cases B ( A d ; (cid:96) ) = ( d (cid:96) /(cid:96) !)(1 + o d (1)) and, hence ξ d,(cid:96) (cid:46) d − (cid:96)/ (by construction Tr( H d ) isbounded uniformly). Indeed, by Assumption 3.( a ), we have ξ d,(cid:96) (cid:16) d − (cid:96)/ .As a consequence, if we set m = (cid:80) (cid:96) ≤ s B ( A d ; (cid:96) ), M = (cid:80) (cid:96) ≤ S B ( A d ; (cid:96) ), we have (cid:80) ∞ k = (cid:96) +1 λ k,d = Θ(1) (indeedthis sum is O d (1) because Tr( H d ) is bounded uniformly, and it is Ω d (1) by Assumption 3.( b )). Therefore,the conditions (7) and (8) (or, more formally, the conditions in Assumption 2) can be rewritten as d S (cid:16) ξ S (cid:28) N (cid:28) ξ S +1 (cid:16) d S +1 , (33) d s (cid:16) ξ s (cid:28) n (cid:28) ξ s +1 (cid:16) d s +1 , (34)which matches the assumptions in Theorem 2.Figure 1 provides an illustration of Theorem 2, for the case of the uniform distribution over the sphere A d = S d − ( √ d ). We fix d = 50, and generate data { ( x i , y i ) } i ≤ n with no noise σ ε = 0. We use the targetfunction f d ( x ) = g d ( (cid:104) v , x (cid:105) ) , (35) In the { +1 , − } d representation, z ∈ { +1 , − } d acts on Q d via x (cid:55)→ D z x , where D z is the diagonal matrix with diag( D z ) = z . .5 1.0 1.5 2.0 2.5 log( n )/log( d ) l o g ( N ) / l o g ( d ) min( n , N ) = d min( n , N ) = d log( n )/log( d ) R R F f d L P > 1 f d L P > 2 f d L N nN n
N n
Figure 1: Learning a polynomial f d (cf. Eq. (35)) over the d -dimensional sphere, d = 50, using a randomfeatures model and min-norm interpolation. We report the test error averaged over 10 realizations. Left:heatmap of the test error as a function of the number of neurons N and number of samples n . Notice theblow-up at the interpolation threshold N ≈ n , and the symmetry around this line. Right: decrease of thetest error as a function of sample size for scalings of the network size N = n α .where v ∈ S d − ( √ d ) and g is a fourth-order polynomial: g ( z ) = √ (cid:98) Q ( z ) + √ (cid:98) Q ( z ) + √ (cid:98) Q ( z ) + √ (cid:98) Q ( z )(here (cid:98) Q (cid:96) is the (cid:96) -th Gegenbauer polynomial, normalized so that (cid:107) (cid:98) Q (cid:96) ( (cid:104) v , · (cid:105) ) (cid:107) L ( S d − ( √ d )) = 1). While theprecise form of f d does not really matter here, we note that (cid:107) P f d (cid:107) L = (cid:107) P f d (cid:107) L = 0 . (cid:107) P f d (cid:107) L = (cid:107) P f d (cid:107) L = 0 . (cid:107) P > f d (cid:107) L = 0. We plot the test error of RFRR using σ ( x ) = max( x − . ,
0) (shiftedReLu), and λ = 0+ (min-norm interpolation). We repeat this calculation for a grid of values of n, N , andfor each point in the grid report the average risk over 10 realizations.We plot the observed average risk in the sample-size/number-of-parameters plane whose axes are log n/ log d and log N/ log d (corresponding to the exponents in the polynomial relation between n and d , and between N and d ). Several prominent features of this plot are worth of note: • The risk has a large peak for N ≈ n . This phenomenon was characterized precisely in the proportionalregime N (cid:16) d , n (cid:16) d in [HMRT19, MM19]. • The plot appears completely symmetric under exchange of N and n : the number of parameters andsample size plays the same role in limiting the generalization abilities, as anticipated by Theorem 1and Theorem 2. • The risk is bounded away from zero even for
N, n (cid:16) d . Indeed, Theorem 2 implies that consistentestimation would require N, n (cid:29) d in this case. • Finally, for a fixed n , near optimal test error is achieved when N (cid:16) n δ ∗ , for δ ∗ a small positiveconstant. Formally, kernel ridge regression (KRR) corresponds to the limit N → ∞ of random feature ridge regression.Despite this, we cannot apply directly Theorem 1 with N = ∞ . We state therefore a separate theorems for17ernel methods. As a side benefit, we establish somewhat stronger results in this case. In particular: • We simplify the set of assumptions (in particular, the assumptions concern only H d and not theactivation function σ d , as they should). • We prove a risk lower bound, Theorem 3, that holds for general kernel methods, not only KRR. • Crucially, we remove the spectral gap assumption. In this more general setting, the risk of KRR isnot approximated by the square norm of the projection of f d orthogonal to the leading eigenfunctionsof the kernel. We instead obtain an approximation in terms of a population-level ridge regressionproblem, with an effective value of the regularization parameter, which we determine.Throughout this section, the setting is the same as in the previous one: we observe i.i.d. data ( y i , x i ) i ∈ [ n ] ,with feature vectors x i from the probability space ( X d , ν d ). Responses are given by y i = f d ( x i ) + ε i , f d ∈ D d and ε i ∼ N (0 , σ ε ) independently of x i .We introduce some general background in Section 3.1, then state our assumptions in Section 3.2, andformally state our results in Sections 3.3 and 3.4. We consider a general RKHS defined on the probability space ( X d , ν d ), via H d a compact self-adjoint positivedefinite operators: H d : D d → D d with kernel representation H d g ( x ) = (cid:90) X d H d ( x , x (cid:48) ) g ( x (cid:48) ) ν d (d x (cid:48) ) , where H d : X d × X d → R is a square integrable function H d ∈ L ( X d × X d ), with the property that (cid:82) X d H d ( x , x (cid:48) ) g ( x (cid:48) ) ν d (d x (cid:48) ) = 0 for g ∈ D ⊥ d .Given a loss function (cid:96) : R × R → R ≥ a general kernel method learns the functionˆ f λ = arg min f (cid:40) n (cid:88) i =1 (cid:96) ( y i , f ( x i )) + λ (cid:107) f (cid:107) H (cid:41) , (36)where (cid:107) f (cid:107) H is the RKHS norm associated to H d . Kernel ridge regression corresponds to the special case (cid:96) ( y, ˆ y ) = ( y − ˆ y ) . As before, we will evaluate a kernel methods via their test error, which we denote asfollows in the case of KRR R KR ( f d , X , λ ) := E x (cid:104)(cid:16) f d ( x ) − ˆ f λ ( x ) (cid:17) (cid:105) . (37)As mentioned above any kernel method can be seen as the N → ∞ limit of a RF model. To see this,note any positive semidefinite kernel can be written in the form H d ( x , x ) = E θ ∼ τ d [ σ d ( x , θ ) σ d ( x , θ )], forsome activation function σ d , and some probability space (Ω d , τ d ). This is akin to taking the square root ofa matrix and —as in the finite-dimensional case— the square root is not unique. For instance, we can let σ d be the symmetric square root H d ( x , x ) replacing λ d,j by λ d,j in Eq. (15).Given a choice of this square root, we can rewrite the estimator (36) as ˆ f λ ( x ) = f ( x ; ˆ a λ ), where ˆ a λ ∈ L (Ω d ; ν d ) and ˆ a λ = arg min a (cid:40) n (cid:88) i =1 (cid:96) ( y i , f ( x i ; a )) + λ (cid:107) a (cid:107) L (cid:41) , (38) f ( x ; a ) := (cid:90) σ d ( x ; θ ) a ( θ ) τ d (d θ ) . (39)This can be informally seen as the N → ∞ limit of Eq. (16) if we choose the square loss function.18 .2 Assumptions on the kernel As for the case of RFRR, we collect our assumptions in two groups. The first one is mainly concerned withthe concentration properties of the kernel, which are quantified in terms of the sequences of integers n ( d ), m ( d ). Assumption 4 ( { n ( d ) , m ( d ) } d ≥ -Kernel Concentration Property) . We say that the sequence of operators { H d } d ≥ satisfies the Kernel Concentration Property (KCP) with respect to the sequence { ( n ( d ) , m ( d )) } d ≥ if there exists a sequence of integers { u ( d ) } d ≥ with u ( d ) ≥ m ( d ) such that the following conditions hold.(a) (Hypercontractivity of finite eigenspaces.) For any fixed q ≥ , there exists a constant C such that, forany h ∈ D d, ≤ u ( d ) = span( ψ s , ≤ s ≤ u ( d )) , we have (cid:107) h (cid:107) L q ≤ C · (cid:107) h (cid:107) L . (40) (b) (Properly decaying eigenvalues.) There exists fixed δ > , such that, for all d large enough, n ( d ) δ ≤ (cid:16) (cid:80) ∞ j = u ( d )+1 λ d,j (cid:17) (cid:80) ∞ j = u ( d )+1 λ d,j , (41) n ( d ) δ ≤ (cid:16) (cid:80) ∞ j = u ( d )+1 λ d,j (cid:17) (cid:80) ∞ j = u ( d )+1 λ d,j . (42) (c) (Concentration of diagonal elements of kernel) For ( x i ) i ∈ [ n ( d )] ∼ iid ν d , we have: max i ∈ [ n ( d )] (cid:12)(cid:12)(cid:12) E x ∼ ν d (cid:2) H d,> m ( d ) ( x i , x ) (cid:3) − E x , x (cid:48) ∼ ν d (cid:2) H d,> m ( d ) ( x , x (cid:48) ) (cid:3)(cid:12)(cid:12)(cid:12) = o d, P (1) · E x , x (cid:48) ∼ ν d (cid:2) H d,> m ( d ) ( x , x (cid:48) ) (cid:3) , (43)max i ∈ [ n ( d )] (cid:12)(cid:12)(cid:12) H d,> m ( d ) ( x i , x i ) − E x [ H d,> m ( d ) ( x , x )] (cid:12)(cid:12)(cid:12) = o d, P (1) · E x [ H d,> m ( d ) ( x , x )] . (44)In the last definition, assumptions ( a ) and ( c ) have an interpretation that is similar to the one for RFRR.Namely, assumption ( a ) requires that the top eigenvectors of H d are delocalized, and assumption ( c ) requiresthat ‘most points’ in the sample space X d behave similarly, in the sense of having similar values of the kerneldiagonal H d ( x , x ). Condition ( b ) is very mild in high dimension, and concerns the tail of eigenvalues of H d .The next condition essentially connects the sample size n ( d ) to the eigenvalue index m ( d ), via theeigenvalues sequence. Assumption 5 (Eigenvalue condition at level { ( n ( d ) , m ( d )) } d ≥ ) . We say that the sequence of Kerneloperators { H d } d ≥ satisfies the Eigenvalue Condition at level { ( n ( d ) , m ( d )) } d ≥ if the following conditionshold for all d large enough.(a) There exists fixed δ > , such that n ( d ) δ ≤ λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k , (45) n ( d ) δ ≤ λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k . (46)19 b) There exists fixed δ > , such that m ( d ) ≤ n ( d ) − δ . Unlike in the case of RFRR, we do not require the existence of a spectral gap, but we assume twodifferent upper bounds n ( d ) to hold simultaneously. In many cases of interest, the right hand sides of (45)and (46) have roughly the same value, which is given by the number of eigenvalues between λ d, m ( d )+1 and c λ d, m ( d )+1 for a small c (counting degeneracy). The technical requirement ( b ) is mild and we do not knowof any interesting counterexample. Consider any regression method of the form (36). By the representer theorem, there exist coefficientsˆ ζ , . . . , ˆ ζ n such that ˆ f λ ( x ) = n (cid:88) i =1 ˆ ζ i H d ( x , x i ) . (47)We are therefore led to define the following data-dependent prediction risk function for kernel methods R H ( f d , X ) := min ζ E x (cid:110)(cid:16) f d ( x ) − n (cid:88) i =1 ζ i H d ( x i , x ) (cid:17) (cid:111) . (48)This is a lower bound on the prediction error of any kernel methods of the form (36).The next theorem provides a lower bound on the generalization of kernel methods that is a consequenceof the approximation bound in Theorem 5.( a ) derived for the random features model, in Appendix A. Theorem 3.
Let { f d ∈ D d } d ≥ be a sequence of functions, ( x i ) i ∈ [ n ( d )] ∼ ν d independently, { H d } d ≥ be asequence of kernel operators such that { ( H d , n ( d ) , m ( d )) } d ≥ satisfies Eqs. (40) , (41) , (43) , and (45) . Thenwe have (cf. Eq. (48) ) (cid:12)(cid:12)(cid:12) R H ( f d , X ) − R H ( P ≤ m ( d ) f d , X ) − (cid:107) P > m ( d ) f d (cid:107) L (cid:12)(cid:12)(cid:12) ≤ o d, P (1) · (cid:107) f d (cid:107) L (cid:107) P > m ( d ) f d (cid:107) L . (49) Proof.
This follows immediately from Theorem 5 ( a ) stated in Appendix A. Indeed, setting σ d ( x , x (cid:48) ) = H d ( x , x (cid:48) ), we obtain R H ( f d , X ) = R RF ( f d , X ), whence the claim follows by applying Eq. (57).Notice that R H ( P ≤ m ( d ) f d , X ) ≥ R KR ( f d ; X , λ ) ≥ R H ( f d , X ) ≥ (cid:107) P > m ( d ) f d (cid:107) L − o d, P (1) · (cid:107) f d (cid:107) L (cid:107) P > m ( d ) f d (cid:107) L . (50)In words, if we neglect the error term o d, P (1) · (cid:107) f d (cid:107) L (cid:107) P > m ( d ) f d (cid:107) L , no kernel method can achieve non-trivialaccuracy on the projection of f d onto eigenvectors beyond the first m ( d ). Kernel ridge regression is one specific way of selecting the coefficients ˆ ζ in Eq. (47), namely by using (cid:96) (ˆ y, y ) = (ˆ y − y ) in Eq. (36). Solving for the coefficients yieldsˆ ζ = ( H + λ I n ) − y , where the kernel matrix H = ( H ij ) ij ∈ [ n ] is given by H ij = H d ( x i , x j ), and y = ( y , . . . , y n ) T .20t is convenient to state our main results in terms of an effective ridge regression estimatorˆ f eff γ = arg min f (cid:110) (cid:107) f d − f (cid:107) L + γn (cid:107) f (cid:107) H (cid:111) , (51)This amounts to replacing the empirical risk in Eq. (36) by its population counterpart (cid:107) f d − f (cid:107) L = E { ( f d ( x ) − f ( x )) } . Also note that the regularization parameter does not coincide with λ : its precisevalue will be specified below.The solution of the population ridge problem (51) can be explicitly written in terms of a shrinkageoperator in the basis of eigenfunctions of H d : f ( x ) = ∞ (cid:88) (cid:96) =1 c (cid:96) ψ d,(cid:96) ( x ) (cid:55)→ ˆ f eff γ ( x ) = ∞ (cid:88) (cid:96) =1 λ d,(cid:96) λ d,(cid:96) + γn c (cid:96) ψ d,(cid:96) ( x ) . (52) Theorem 4.
Let { f d ∈ D d } d ≥ be a sequence of functions, ( x i ) i ∈ [ n ( d )] ∼ ν d independently, { H d } d ≥ be asequence of kernel operators such that { ( H d , n ( d ) , m ( d )) } d ≥ satisfies { n ( d ) , m ( d ) } d ≥ -KPCP (Assumption4) and eigenvalue condition at level { n ( d ) , m ( d ) } d ≥ (Assumption 5). Define the effective regularization γ eff := λ + Tr( H d,> m ( d ) ) . (53) Then, for any regularization parameter λ ∈ [0 , λ (cid:63) ] where λ (cid:63) = Tr( H d,> m ( d ) ) , any η > , we have (cf. Eq.(37)) (cid:12)(cid:12)(cid:12) R KR ( f d , X , λ ) − (cid:107) f d − ˆ f eff γ eff (cid:107) L (cid:12)(cid:12)(cid:12) = o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε ) . (54) Further, the ridge regression estimator ˆ f λ is close to the effective estimator ˆ f eff γ eff , namely (cid:13)(cid:13) ˆ f λ − ˆ f eff γ eff (cid:13)(cid:13) L = o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε ) . (55)The proof of Theorem 4 is deferred to Appendix C.In words, KRR behaves as ridge regression with respect to the population risk, except that the regu-larization parameter is increased by Tr( H d,> m ). The underlying mechanism is quite simple. The empiricalkernel matrix is decomposed as H = H ≤ m + H > m , and the second component can be approximated by amultiple of the identity: H > m ≈ Tr( H d,> m ) · I n . This term acts as an additional ridge regularizer.As mentioned above, we do not assume here any eigenvalue gap condition. However, formulas simplifyif we assume an eigenvalue gap, e.g.: n ( d ) = ω d (1) · λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k . Under this additional assumption, Theorem 4 implies the following simplified formula for the test error: (cid:12)(cid:12)(cid:12) R KR ( f d , X , λ ) − (cid:107) P > m ( d ) f d (cid:107) L (cid:12)(cid:12)(cid:12) = o d, P (1) · ( (cid:107) f d (cid:107) L η + σ ε ) . As anticipated, this coincides with the risk of RFRR, if we heuristically set N = ∞ in Theorem 1. Acknowledgnements
This work was supported by NSF through award DMS-2031883 and from the Simons Foundation throughAward 814639 for the Collaboration on the Theoretical Foundations of Deep Learning We also acknowledgeNSF grants CCF-2006489, IIS-1741162 and the ONR grant N00014-18-1-2729.21 eferences [ADH +
19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang,
Fine-grained anal-ysis of optimization and generalization for overparameterized two-layer neural networks ,arXiv:1901.08584 (2019).[AM15] Ahmed El Alaoui and Michael W Mahoney,
Fast randomized kernel ridge regression with sta-tistical guarantees , Advances in Neural Information Processing Systems, 2015, pp. 775–783.[AZLL18] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang,
Learning and generalization in overparam-eterized neural networks, going beyond two layers , arXiv:1811.04918 (2018).[AZLS18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song,
A convergence theory for deep learning viaover-parameterization , arXiv:1811.03962 (2018).[Bac13] Francis Bach,
Sharp analysis of low-rank kernel matrix approximations , Conference on LearningTheory, 2013, pp. 185–209.[Bac15] ,
On the equivalence between quadrature rules and random features , arXiv preprintarXiv:1502.06800 (2015), 135.[BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala,
Kernels as features: On kernels,margins, and low-dimensional mappings , Machine Learning (2006), no. 1, 79–94.[Bec75] William Beckner, Inequalities in Fourier analysis , Annals of Mathematics (1975), 159–182.[Bec92] ,
Sobolev inequalities, the Poisson semigroup, and analysis on the sphere S n , Proceedingsof the National Academy of Sciences (1992), no. 11, 4816–4819.[BETVG08] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool, Speeded-up robust features(surf ) , Computer vision and image understanding (2008), no. 3, 346–359.[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal,
Reconciling modern machine-learning practice and the classical bias–variance trade-off , Proceedings of the National Academyof Sciences (2019), no. 32, 15849–15854.[BLLT20] Peter L Bartlett, Philip M Long, G´abor Lugosi, and Alexander Tsigler,
Benign overfitting inlinear regression , Proceedings of the National Academy of Sciences (2020).[Bon70] Aline Bonami,
Etude des coefficients de Fourier des fonctions de L p ( G ), Annales de l’institutFourier, vol. 20, 1970, pp. 335–402.[BRT19] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov, Does data interpolation con-tradict statistical optimality? , The 22nd International Conference on Artificial Intelligence andStatistics, PMLR, 2019, pp. 1611–1619.[BTA11] Alain Berlinet and Christine Thomas-Agnan,
Reproducing kernel hilbert spaces in probabilityand statistics , Springer Science & Business Media, 2011.[CDV07] Andrea Caponnetto and Ernesto De Vito,
Optimal rates for the regularized least-squares algo-rithm , Foundations of Computational Mathematics (2007), no. 3, 331–368.[DLL +
18] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai,
Gradient descent findsglobal minima of deep neural networks , arXiv:1811.03804 (2018).22Dud18] Richard M Dudley,
Real analysis and probability , CRC Press, 2018.[DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh,
Gradient descent provably optimizesover-parameterized neural networks , arXiv:1810.02054 (2018).[GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari,
Linearized two-layers neural networks in high dimension , Annals of Statistics (2019), arXiv:1904.12191.[GMMM20] ,
When do neural networks outperform kernel methods? , Advances in Neural InformationProcessing Systems (2020).[Gro75] Leonard Gross, Logarithmic sobolev inequalities , American Journal of Mathematics (1975),no. 4, 1061–1083.[HMRT19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation , arXiv:1903.08560 (2019).[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman,
The elements of statistical learning ,Springer, 2009.[JGH18] Arthur Jacot, Franck Gabriel, and Cl´ement Hongler,
Neural tangent kernel: Convergence andgeneralization in neural networks , Advances in neural information processing systems, 2018,pp. 8571–8580.[JS¸S +
20] Arthur Jacot, Berfin S¸im¸sek, Francesco Spadaro, Cl´ement Hongler, and Franck Gabriel,
Kernelalignment risk estimator: Risk prediction from training data , arXiv preprint arXiv:2006.09796(2020).[LL18] Yuanzhi Li and Yingyu Liang,
Learning overparameterized neural networks via stochastic gra-dient descent on structured data , Advances in Neural Information Processing Systems, 2018,pp. 8157–8166.[Low04] David G Lowe,
Distinctive image features from scale-invariant keypoints , International journalof computer vision (2004), no. 2, 91–110.[LR20] Tengyuan Liang, Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regression can gen-eralize , Annals of Statistics (2020), no. 3, 1329–1347.[LRZ19] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the risk of minimum-norm interpolantsand restricted lower isometry of kernels , arXiv:1908.10292 (2019).[MM19] Song Mei and Andrea Montanari,
The generalization error of random features regression: Pre-cise asymptotics and double descent curve , arXiv:1908.05355 (2019).[MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari,
Learning invariances in randomfeature models , In preparation (2021).[MWW20] Chao Ma, Stephan Wojtowytsch, Lei Wu,
Towards a mathematical understanding of neu-ral network-based machine learning: what we know and what we don’t , arXiv preprintarXiv:2009.10713 (2020).[MZ20] Andrea Montanari and Yiqiao Zhong,
The interpolation phase transition in neural networks:Memorization and generalization under lazy training , arXiv:2007.12826 (2020).23O’D14] Ryan O’Donnell,
Analysis of boolean functions , Cambridge University Press, 2014.[OS19] Samet Oymak and Mahdi Soltanolkotabi,
Towards moderate overparameterization: global con-vergence guarantees for training shallow neural networks , arXiv:1902.04674 (2019).[RCR15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco,
Less is more: Nystr¨om computa-tional regularization , Advances in Neural Information Processing Systems, 2015, pp. 1657–1665.[RR08] Ali Rahimi and Benjamin Recht,
Random features for large-scale kernel machines , Advancesin neural information processing systems, 2008, pp. 1177–1184.[RR09] ,
Weighted sums of random kitchen sinks: Replacing minimization with randomizationin learning , Advances in neural information processing systems, 2009, pp. 1313–1320.[RR17] Alessandro Rudi and Lorenzo Rosasco,
Generalization properties of learning with random fea-tures , Advances in Neural Information Processing Systems, 2017, pp. 3215–3225.[TB20] Alexander Tsigler and Peter L Bartlett,
Benign overfitting in ridge regression , arXiv preprintarXiv:2009.14286 (2020).[Ver10] Roman Vershynin,
Introduction to the non-asymptotic analysis of random matrices ,arXiv:1011.3027 (2010).[Wai19] Martin J Wainwright,
High-dimensional statistics: A non-asymptotic viewpoint , vol. 48, Cam-bridge University Press, 2019.[Wed72] Per-˚Ake Wedin,
Perturbation bounds in connection with singular value decomposition , BITNumerical Mathematics (1972), no. 1, 99–111.[YLM +
12] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou,
Nystr¨om methodvs random fourier features: A theoretical and empirical comparison , Advances in neural infor-mation processing systems, 2012, pp. 476–484.[ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu,
Stochastic gradient descent optimizesover-parameterized deep relu networks , arXiv:1811.08888 (2018).24
Approximation error of random features model
In this section, we consider the approximation error of the random features function class. Formally, theapproximation error can be seen as the generalization error of random features ridge regression for finitenumber of neurons
N < ∞ and infinite data n = ∞ . However, we cannot apply directly Theorem 1 with n = ∞ . We therefore state a separate theorem. This is also used to prove the lower bound of Theorem 3 onthe generalization error of general kernel methods.In Section A.1, we state our assumptions and theorem. Sections A.2 and A.3 provide a proof of thetheorem, while Section A.4 gathers key technical concentration results that will also be used in the proofsof Theorem 1 and Theorem 4. A.1 Assumptions and theorem
Recall the definition of the random features function class (see Section 2.1): let Θ = ( θ i ) i ∈ [ N ] ∼ iid τ d , F RF ,N ( Θ ) = (cid:110) ˆ f ( x ; a ) = N (cid:88) i =1 a i σ d ( x ; θ i ) : a i ∈ R , i ∈ [ N ] (cid:111) . We define the approximation error of the random features function class for a target function f d ∈ L ( X d )as R App ( f d , Θ ) := inf ˆ f ∈F RF ,N ( Θ ) E x ∼ τ d [( f d ( x ) − ˆ f ( x )) ] . (56)Similarly to Sections 2.2 and 3.2, we will quantify our assumptions on the sequences of probability spaces( X d , ν d ) and (Ω d , τ d ), and on the activation functions σ d ∈ L ( X d × Ω d ), in terms of the sequences of integers N ( d ) , M ( d ). We state the assumptions in two groups: Assumption 6 and Assumption 7 deal respectively withthe concentration properties and the spectrum of the sequence of feature kernel operator { U d } d ≥ defined as U d ( θ , θ ) = E x ∼ ν d [ σ d ( x ; θ ) σ d ( x ; θ )] . Assumption 6 (Feature kernel concentration at level { ( N ( d ) , M ( d )) } d ≥ ) . The sequences of spaces {V d } d ≥ ,operators { U d } d ≥ and numbers of neurons { N ( d ) } d ≥ satisfy feature kernel concentration at level { M ( d ) } d ≥ if there exists a sequence { u ( d ) } d ≥ with u ( d ) ≥ M ( d ) , such that the following hold.(a) (Hypercontractivity of finite eigenspaces.) For any fixed q ≥ , there exists C such that, for any g ∈ V d, ≤ u ( d ) = span( φ s , ≤ s ≤ u ( d )) , we have (cid:107) g (cid:107) L q (Ω d ) ≤ C · (cid:107) g (cid:107) L (Ω d ) . (b) (Properly decaying eigenvalues.) There exists a fixed δ > , such that N ( d ) δ ≤ (cid:16) (cid:80) ∞ j = u ( d )+1 λ d,j (cid:17) (cid:80) ∞ j = u ( d )+1 λ d,j . (c) (Upper bound on the diagonal elements of the kernel) For ( θ i ) i ∈ [ N ( d )] ∼ iid τ d and any δ > , we have max i ∈ [ N ( d )] U d,> M ( d ) ( θ i , θ i ) = O d, P ( N ( d ) δ ) · E θ [ U d,> M ( d ) ( θ , θ )] . d) (Lower bound on the diagonal elements of the kernel) For ( θ i ) i ∈ [ N ( d )] ∼ iid τ d and any δ > , we have min i ∈ [ N ( d )] U d,> M ( d ) ( θ i , θ i ) = Ω d, P ( N ( d ) − δ ) · E θ [ U d,> M ( d ) ( θ , θ )] . Assumption 7 (Spectral gap at level { ( N ( d ) , M ( d )) } d ≥ ) . The sequence of operators { U d } d ≥ has a spectralgap at level { ( N ( d ) , M ( d )) } d ≥ if the following hold.(a) There exists a fixed δ > , such that N ( d ) δ ≤ λ d, M ( d )+1 ∞ (cid:88) j = M ( d )+1 λ j,d . (b) There exists a fixed δ > , such that M ( d ) ≤ N ( d ) − δ and N ( d ) − δ ≥ λ d, M ( d ) ∞ (cid:88) j = M ( d )+1 λ j,d . Remark A.1.
In Assumption 6.( c ), we can replace U d,> M ( d ) by U d,>u ( d ) (see Lemma 7).We are now in position to state our theorem on the approximation error of the random features functionclass. We state the lower and upper bounds and their assumptions separately. Theorem 5 (Approximation error of the random features function class) . Let { f d ∈ D d } d ≥ be a sequenceof functions and Θ = ( θ i ) i ∈ [ N ( d )] with ( θ i ) i ∈ [ N ( d )] ∼ τ d independently. Let { σ d } d ≥ be a sequence of activa-tion functions satisfying Assumptions 6.(a) and 6. ( b ) at level { M ( d ) } d ≥ . Then the following hold for theapproximation error of the random features class (see Eq. (56) ):(a) (Lower bound) If { σ d } d ≥ satisfies further Assumptions 6. ( d ) and 7.(a), then we have (cid:12)(cid:12)(cid:12) R App ( f d , Θ ) − R App ( P ≤ M ( d ) f d , Θ ) − (cid:107) P > M ( d ) f d (cid:107) L (cid:12)(cid:12)(cid:12) ≤ o d, P (1) · (cid:107) f d (cid:107) L (cid:107) P > M ( d ) f d (cid:107) L . (57) (b) (Upper bound) If { σ d } d ≥ satisfies further Assumptions 6.(c) and 7, then we have (cid:12)(cid:12)(cid:12) R App ( P ≤ M ( d ) f d , Θ ) (cid:12)(cid:12)(cid:12) ≤ o d, P (1) · (cid:107) f d (cid:107) L (cid:107) P ≤ M ( d ) f d (cid:107) L . (58)Point ( a ) is proved in Section A.2, while point ( b ) is proved in Section A.3.The lower bound on general kernel methods in Theorem 3 is obtained as a direct consequence of Theorem5.(a), by taking σ d ( x , x (cid:48) ) = H d ( x , x (cid:48) ). Indeed, it is easy to check that Eqs. (40) and (42) imply Assumptions6.(a) and 6.( b ), Eq. (44) implies Assumptions 6.( c ) and 6.( d ), and Eq. (46) implies Assumption 7.(a). A.2 Proof of Theorem 5. ( a ) : lower bound on the approximation error We denote E θ to be the expectation operator with respect to θ ∼ τ d , E x to be the expectation operatorwith respect to x ∼ ν d . We will denote M = M ( d ) and N = N ( d ).Define the random vectors V = ( V , . . . , V N ) T , V ≤ M = ( V , ≤ M , . . . , V N, ≤ M ) T , V > M = ( V ,> M , . . . , V N,> M ) T ,with V i, ≤ M ≡ E x ∼ ν d [[ P ≤ M f d ]( x ) σ d ( x ; θ i )] ,V i,> M ≡ E x ∼ ν d [[ P > M f d ]( x ) σ d ( x ; θ i )] ,V i ≡ E x ∼ ν d [ f d ( x ) σ d ( x ; θ i )] = V i, ≤ M + V i,> M . U = ( U ij ) i,j ∈ [ N ] , with U ij = E x ∼ ν d [ σ d ( x ; θ i ) σ d ( x ; θ j )] . (59)In what follows, we write R App ( f d ) = R App ( f d , Θ ) for the approximation error of the random features model,omitting the dependence on the weights Θ . By definition and a simple calculation, we have R App ( f d ) = min a ∈ R N (cid:110) E x [ f d ( x ) ] − (cid:104) a , V (cid:105) + (cid:104) a , U a (cid:105) (cid:111) = E x [ f d ( x ) ] − V T U − V ,R App ( P ≤ M f d ) = min a ∈ R N (cid:110) E x [ P ≤ M f d ( x ) ] − (cid:104) a , V ≤ M (cid:105) + (cid:104) a , U a (cid:105) (cid:111) = E x [ P ≤ M f d ( x ) ] − V T ≤ M U − V ≤ M . By orthogonality, we have E x [ f d ( x ) ] = E x [[ P ≤ M f d ]( x ) ] + E x [[ P > M f d ]( x ) ] , which gives (cid:12)(cid:12)(cid:12) R App ( f d ) − R App ( P ≤ M f d ) − E x [[ P > M f d ]( x ) ] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) V T ≤ M U − V ≤ M − V T U − V (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) V T ≤ M U − V ≤ M − ( V ≤ M + V > M ) T U − ( V ≤ M + V > M ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) V T U − V > M − V T > M U − V > M (cid:12)(cid:12)(cid:12) ≤ (cid:107) U − / V > M (cid:107) (cid:107) U − / V (cid:107) + (cid:107) U − (cid:107) op (cid:107) V > M (cid:107) ≤ (cid:107) U − / (cid:107) op (cid:107) V > M (cid:107) (cid:107) f d (cid:107) L + (cid:107) U − (cid:107) op (cid:107) V > M (cid:107) , (60)where the last inequality used the fact that0 ≤ R App ( f d ) = (cid:107) f d (cid:107) L − V T U − V , so that (cid:107) U − / V (cid:107) = V T U − V ≤ (cid:107) f d (cid:107) L . By Eq. (60), to prove Theorem 5.( a ), we need to bound (cid:107) U − (cid:107) op (cid:107) V > M (cid:107) . This is achieved in the twofollowing propositions. Proposition 1 (Expected norm of V ) . Let { f d ∈ D d } be a sequence of target functions. Define E > M by E > M ≡ E θ (cid:104)(cid:16) E x [ P > M f d ( x ) σ d ( x ; θ )] (cid:17) (cid:105) . Then we have E > M ≤ (cid:107) U d,> M (cid:107) op · (cid:107) P > M f d (cid:107) L . Proof of Proposition 1.
We have E > M ≡ E θ ∼ τ d [ (cid:104) P > M f d , σ d ( · , θ ) (cid:105) L ( X d ) ]= E θ ∼ τ d E x , x ∼ ν d [ P > M f d ( x ) σ d ( x , θ ) σ d ( x , θ ) P > M f d ( x )]= E x , x ∼ ν d [ P > M f d ( x ) E θ ∼ τ d [ σ d ( x , θ ) σ d ( x , θ )] P > M f d ( x )]= (cid:104) P > M f d , H d P > M f d (cid:105) L = (cid:104) P > M f d , H d,> M P > M f d (cid:105) L ≤ (cid:107) H d,> M (cid:107) op (cid:107) P > M f d (cid:107) L = (cid:107) U d,> M (cid:107) op (cid:107) P > M f d (cid:107) L . This proves the proposition. 27 roposition 2 (Lower bound on the kernel matrix) . Let { σ d } d ≥ be a sequence of activation functionssatisfying Assumptions 6. ( a ) , 6. ( b ) and 7. ( a ) at level { ( N ( d ) , M ( d )) } d ≥ . Let ( θ i ) i ∈ [ N ] ∼ τ d independentlyand let U ∈ R N × N be the kernel matrix defined by Eq. (59) . Then, we have U (cid:23) κ > M ( Λ + ∆ ) , (61) with Λ = diag(( U d,> M ( θ i , θ i ) /κ > M ) i ∈ [ N ] ) , κ > M = Tr( U d,> M ) , and ∆ is such that there exists some δ (cid:48) > ,such that E [ (cid:107) ∆ (cid:107) op ] = O d ( N − δ (cid:48) ) . Proof of Proposition 2.
This is a direct consequence of Theorem 6.( a ).By Proposition 1, we have E [ (cid:107) V > M (cid:107) ] = N E > M ≤ N · (cid:107) U d,> M (cid:107) op · (cid:107) P > M f d (cid:107) L . (62)Next, by Proposition 2 and Assumption 6.( d ), for any fixed δ > δ < δ (cid:48) , we have (cid:107) U − (cid:107) op · Tr( U d,> M ) ≤ (cid:104) min i ∈ [ N ] U d,> M ( θ i , θ i ) / Tr( U d,> M ) − O d, P ( N − δ (cid:48) ) (cid:105) − ≤ O d, P ( N δ ) , and hence by Markov inequality we have (cid:107) U − (cid:107) op (cid:107) V > M (cid:107) (cid:107) P > M f d (cid:107) L ≤ O d, P ( N δ ) · N · (cid:107) U d,> M (cid:107) op Tr( U d,> M ) . (63)By Assumption 7.( a ), we have N ·(cid:107) U d,> M (cid:107) op / Tr( U d,> M ) = O d ( N − δ ) for some δ >
0. Plugging this equationinto Eq. (63) and choosing δ < δ , we have (cid:107) U − (cid:107) op (cid:107) V > M (cid:107) = o d, P (1) · (cid:107) P > M f d (cid:107) L . (64)Combining Eq. (64) with Eq. (60) proves Theorem 5.( a ). A.3 Proof of Theorem 5. ( b ) : upper bound on the approximation error In the following, we would like to calculate the quantity R App ( P ≤ M f d , Θ ). We have R App ( P ≤ M f d , Θ ) = (cid:107) P ≤ M f d (cid:107) L − V T ≤ M U − V ≤ M , where V ≤ M = ( V ≤ M , , . . . , V ≤ M ,N ) T and U = ( U ij ) ij ∈ [ N ] with V ≤ M ,i = E x ∼ ν d [ P ≤ M f d ( x ) σ d ( x ; θ i )] ,U ij = E x ∼ ν d [ σ d ( x ; θ i ) σ d ( x ; θ j )] . Recall that ( ψ k ) k ≥ is the orthonormal eigenbasis of H d . We denote the decomposition of P ≤ M f d in thisbasis by P ≤ M f d ( x ) = M (cid:88) k =1 (cid:104) f d , ψ k (cid:105) L ψ k ( x ) ≡ M (cid:88) k =1 ˆ f k ψ k ( x ) . Recall the decomposition of σ d σ d ( x , θ ) = ∞ (cid:88) k =1 λ d,k ψ k ( x ) φ k ( θ ) .
28y orthonormality of the ( ψ k ) k ≥ , we have V ≤ M ,i = M (cid:88) k =1 ˆ f k λ d,k φ k ( θ i ) . Define ˆ f = ( ˆ f , . . . , ˆ f M ) T ∈ R M , D = diag( λ d, , . . . , λ d, M ) ∈ R M × M , Φ = ( φ k ( θ i )) i ∈ [ N ] ,k ∈ [ M ] ∈ R N × M , L = Φ D ∈ R N × M . Then we have V ≤ M = (cid:16) M (cid:88) k =1 ˆ f k λ d,k φ k ( θ i ) (cid:17) i ∈ [ N ] = Φ D ˆ f = L ˆ f . By Eq. (67) in Theorem 6, there exists ∆ ∈ R N × N such that U = Φ D Φ T + κ > M ( Λ + ∆ ) = LL T + κ > M ( Λ + ∆ ) , where κ > M = Tr( U d,> M ), Λ = diag(( U d,> M ( θ i , θ i ) /κ > M ) i ∈ [ N ] ). By simple algebra, we have V T ≤ M U − V ≤ M = ˆ f T S ˆ f , where S = L T ( LL T + κ > M ( Λ + ∆ )) − L . Therefore, we have R App ( P ≤ M f d , W ) = (cid:107) P ≤ M f d (cid:107) L − V T ≤ M U − V ≤ M = (cid:107) ˆ f (cid:107) − (cid:104) ˆ f , S ˆ f (cid:105)≤(cid:107) I M − S (cid:107) op (cid:107) ˆ f (cid:107) = o d, P (1) · (cid:107) P ≤ M f d (cid:107) L . The last equation is by Lemma 1 which is stated and proved below. This proves the theorem.
Lemma 1 (Concentration of S ) . Let Assumptions 6. ( a ) , 6. ( b ) , 6. ( c ) and 7 hold. Then we have (cid:107) I M − S (cid:107) op = o d, P (1) . Proof of Lemma 1.
By the Sherman-Morrison-Woodbury formula, we have I M − S = I M − L T ( LL T + κ > M ( Λ + ∆ )) − L = ( I M + L T ( Λ + ∆ ) − L /κ > M ) − , so that (cid:107) I M − S (cid:107) op ≤ /λ min ( L T ( Λ + ∆ ) − L /κ > M ) . (65)Note that we have λ min ( L T ( Λ + ∆ ) − L ) /κ > M = λ min ( D Φ T ( Λ + ∆ ) − Φ D ) /κ > M ≥ λ min ( Φ T Φ /N ) · [ N · λ min ( D ) /κ > M ] / (cid:107) Λ + ∆ (cid:107) op = λ min ( Φ T Φ /N ) · [ N · λ min ( U d, ≤ M ) /κ > M ] / (cid:107) Λ + ∆ (cid:107) op . (66)By Theorem 6.( b ), we have λ min ( Φ T Φ /N ) = Θ d, P (1) .
29y Assumption 6.( c ), we have (cid:107) Λ (cid:107) op = O d, P ( N δ ) for any δ >
0. Therefore, by Theorem 6.( a ), for any δ > (cid:107) Λ + ∆ (cid:107) op ≤ (cid:107) Λ (cid:107) op + (cid:107) ∆ (cid:107) op = O d ( N δ ) . By Assumption 7.( b ), there exists δ >
0, such that[ N · λ min ( U d, ≤ M ) /κ > M ] = Ω d ( N δ ) . Combining the above equalities with Eq. (66) and choosing δ such that 0 < δ < δ , we have λ min ( L T ( Λ + ∆ ) − L /κ > M ) = ω d, P (1) . Combining with Eq. (65) proves the lemma.
A.4 Structure of the empirical kernel matrix
In this section, we present a key theorem describing the structure of the empirical kernel matrix U =( U ( θ i , θ j )) i,j ∈ [ N ] ∈ R N × N . The proof of this theorem relies on two propositions: Proposition 3 shows thatthe matrix of the top eigenvectors evaluated on the random weights ( θ i ) i ∈ [ N ] is nearly orthogonal and ispresented in Section A.4.1; Proposition 4 shows the concentration to zero in operator norm of the off-diagonalpart of the matrix U > M and is presented in Section A.4.2. The proof of Proposition 4 is deferred to SectionA.5. Theorem 6 (Structure of the empirical kernel matrix) . Let Assumptions 6. ( a ) , 6. ( b ) and 7. ( a ) hold. Let ( θ i ) i ∈ [ N ] ∼ τ d independently, and define U = ( U ij ) ij ∈ [ N ] with U ij := U d ( θ i , θ j ) . (Recall that U d ( θ i , θ j ) ≡ E x ∼ ν d [ σ d ( x , θ i ) σ d ( x , θ j )] . ) Then, we can rewrite U (by choosing ∆ ∈ R N × N ) U = Φ D Φ T + κ > M ( Λ + ∆ ) , (67) with κ > M = Tr( U d,> M ) and Φ = ( φ k ( θ i )) i ∈ [ N ] ,k ∈ [ M ] , D = diag( λ d, , . . . , λ d, M ) , Λ = diag(( U d,> M ( θ i , θ i ) /κ > M ) i ∈ [ N ] ) . The following hold: ( a ) There exists a fixed δ (cid:48) > , such that E [ (cid:107) ∆ (cid:107) op ] = O d ( N − δ (cid:48) ) . (b) If further we assume M ( d ) ≤ N ( d ) − δ for a fixed δ > , then we have (cid:13)(cid:13)(cid:13) Φ T Φ /N − I M (cid:13)(cid:13)(cid:13) op = o d, P (1) . Proof of Theorem 6.
For S ⊆ { , , , . . . } , recall that U d,S ≡ (cid:88) s ∈ S λ d,s φ s φ ∗ s , U d,S to be the kernel associated to U d,S . Define Q S = ( Q S,ij ) i,j ∈ [ N ( d )] by Q S,ij = U d,S ( θ i , θ j ) i (cid:54) = j . By decomposing the entries of U in the orthonormal basis { φ j } j ≥ , we can write U = U ≤ M + U > M where U ≤ M = Φ D Φ T , U > M = ( U d,> M ( θ i , θ j )) i,j ∈ [ N ] . We begin by part ( b ). By Assumption 6.( a ) and M ( d ) ≤ N ( d ) − δ , the assumptions of Proposition 3 aresatisfied with D = M and ( φ , . . . , φ M ) the top M eigenvectors of U . Hence, there exists C = C ( q ) > q such that E (cid:104)(cid:13)(cid:13)(cid:13) Φ T Φ /N − I M (cid:13)(cid:13)(cid:13) op (cid:105) ≤ C M log( N ) N − /q . Taking q > /δ , the right hand side become o d (1) and Theorem 6.( b ) follows by Markov’s inequality.Next, we prove part ( a ), namely that U > M = κ > M · ( Λ + ∆ ) with (cid:107) ∆ (cid:107) op = O d, P ( N − δ (cid:48) ) for some δ (cid:48) > Q ∈ R N × N be the matrix with entries Q ij = ( U > M ) ij i (cid:54) = j . Then we have U > M = κ > M Λ + Q .We next apply Proposition 4 to the operator (cid:98) U d = U d,> M and subspace (cid:98) V d = V d,> M . Notice that theassumptions of Proposition 4 are satisfied by Assumptions 6.( a ), 6.( b ) and 7.( a ). We therefore conclude that E [ (cid:107) Q (cid:107) op ] = O d ( N − δ (cid:48) ) · Tr( U d,> M ) = O d ( N − δ (cid:48) ) · κ > M for some δ (cid:48) >
0. This concludes the proof of Theorem6.( a ).This theorem implies a particularly simple structure of the empirical kernel matrix U . Under theadditional Assumptions 6.( c ), 6.( d ) and 7.( b ), U can be written as a sum of a ‘spike’ U ≤ M (of rank M andeigenvalues (cid:29) Tr( U d,> M )) and a full rank matrix U > M with eigenvalues of order Tr( U d,> M ). The ‘spike’matrix U ≤ M has the following approximate diagonalization: U ≤ M = ˜ Φ ˜ D ˜ Φ T , where ˜ Φ = Φ / √ N ∈ R N × M is approximately an orthogonal matrix (cid:107) ˜ Φ T ˜ Φ − I M (cid:107) op = o d, P (1) and thediagonal matrix ˜ D = diag( N λ d, , . . . , N λ d, M ) verifies ˜ D (cid:23) N ( d ) δ Tr( U d,> M ) · I N (by Assumption 7.( b )).Furthermore, by Assumptions 6.( c ), and 6.( d ), and Theorem 6.( a ), we have for any δ > d, P ( N − δ ) · Tr( U d,> M ) · I N (cid:22) U > M (cid:22) O d, P ( N δ ) · Tr( U d,> M ) · I N . A.4.1 Concentration of the top eigenvectors
Here we state and prove a general matrix concentration result. For each d ≥
1, let (Ω d , τ d ) be a (Polish)probability space, and ( φ k ) k ≥ an orthonormal basis of L (Ω d , τ d ). Define φ ( θ ) ≡ ( φ ( θ ) , . . . , φ D ( θ )) T ∈ R D ,and let ( θ i ) i ≤ N ∼ iid τ d . The law of large numbers and orthonormality imply that, for any fixed D ,lim N →∞ N N (cid:88) i =1 φ ( θ i ) φ ( θ i ) T = (cid:90) Ω d φ ( θ ) φ ( θ ) T τ d (d θ ) = I D . (68)The next proposition establishes a generalization of this fact for the case in which both D and N diverge.31 roposition 3. Let { φ k ∈ L (Ω , τ ) } Dk =1 be orthonormal functions. Let { θ i } i ∈ [ N ] ∼ τ independently. Define φ i = φ ( θ i ) = ( φ ( θ i ) , . . . , φ D ( θ i )) T ∈ R D for i ∈ [ N ] . We assume that, for any integer q ≥ , there exists C = C ( q ) such that we have sup k ∈ [ D ] (cid:107) φ k (cid:107) L q ≤ C ( q ) . (69) Then for any q ≥ , there exists K = K ( q ) that only depends on C ( q ) , such that denoting δ ≡ K ( q ) D log( D ∨ N ) /N − /q , we have E (cid:13)(cid:13)(cid:13) N N (cid:88) i =1 φ i φ T i − I D (cid:13)(cid:13)(cid:13) op ≤ ( δ ∨ √ δ ) . Proof of Proposition 3.
By the hypercontractivity assumption, cf. Eq. (69), we haveΓ := E (cid:104) max i ∈ [ N ] (cid:107) φ i (cid:107) (cid:105) ≤ E (cid:104) max i ∈ [ N ] (cid:107) φ i (cid:107) q (cid:105) /q ≤ N /q · E [ (cid:107) φ i (cid:107) q ] /q = N /q · (cid:13)(cid:13)(cid:13) D (cid:88) k =1 φ k (cid:13)(cid:13)(cid:13) L q ≤ N /q D · max k ∈ [ D ] (cid:107) φ k (cid:107) L q ≤ C ( q ) · N /q D. Applying Lemma 2 below proves the proposition.
Lemma 2 ([Ver10] Theorem 5.45) . Let { a i ∈ R D } i ∈ [ N ] be independent random vectors with E [ a i a T i ] = I D .Denote Γ ≡ E [max i ∈ [ N ] (cid:107) a i (cid:107) ] . Then there exists a universal constant C , such that denoting δ ≡ C · Γ · log( N ∧ D ) /N , we have E (cid:104)(cid:13)(cid:13)(cid:13) N N (cid:88) i =1 a i a T i − I D (cid:13)(cid:13)(cid:13) op (cid:105) ≤ δ ∨ √ δ. A.4.2 Bounding the off-diagonal part of the matrix U > M We state a key proposition) whose proof will be presented in Section A.5. The statement and the assumptionsare self-contained.
Proposition 4 (Bound on the off-diagonal part of the matrix U > M ) . Let ( θ i ) i ∈ [ N ( d )] ∼ iid τ d . Let (cid:98) U d bea self-adjoint positive definite operator (cid:98) U d : (cid:98) V d → (cid:98) V d , (cid:98) V d ⊆ L (Ω d ) with kernel (cid:98) U d ∈ L (Ω d × Ω d ) (seeEq. (11) ) satisfying (cid:82) Ω d (cid:98) U d ( θ , θ (cid:48) ) f ( θ (cid:48) ) τ d (d θ (cid:48) ) = 0 for any f ∈ V ⊥ d . Let ( ˆ φ j ) j ≥ be an orthonormal basisof eigenfunctions with span( ˆ φ j , j ≥
1) = (cid:98) V d ⊆ L (Ω d ) , and eigenvalues (ˆ λ d,j ) j ≥ ⊆ R with nonincreasingabsolute values | ˆ λ d, | ≥ | ˆ λ d, | ≥ · · · and (cid:80) j ≥ ˆ λ d,j < ∞ , such that (cid:98) U d = ∞ (cid:88) j =1 ˆ λ d,j ˆ φ j ˆ φ ∗ j , (cid:98) U d ( θ , θ (cid:48) ) = ∞ (cid:88) j =1 ˆ λ d,j ˆ φ j ( θ ) ˆ φ j ( θ (cid:48) ) . When S ⊆ { , , , . . . } , we denote (cid:98) U d,S = (cid:88) j ∈ S ˆ λ d,j ˆ φ j ˆ φ ∗ j , (cid:98) U d,S ( θ , θ (cid:48) ) = (cid:88) j ∈ S ˆ λ d,j ˆ φ j ( θ ) ˆ φ j ( θ (cid:48) ) . We make the following assumptions:
There exists a sequence { v ( d ) } d ≥ , such that for any fixed q ≥ , there exists C = C ( q, { v ( d ) } d ≥ ) suchthat, for any f d ∈ (cid:98) V d, ≤ v ( d ) ≡ span( ˆ φ s , ≤ s ≤ v ( d )) , we have (cid:107) f d (cid:107) L q ≤ C · (cid:107) f d (cid:107) L . (A2) For the same sequence { v ( d ) } d ≥ as in (A1) , there exists fixed δ > , such that Tr( (cid:98) U d,>v ( d ) ) · N ( d ) δ = O d (1) · Tr( (cid:98) U d,>v ( d ) ) . (A3) There exists δ > , such that N ( d ) δ · (cid:107) (cid:98) U d (cid:107) op = O d (1) · Tr( (cid:98) U d ) . (70) Consider the random matrix Q = ( Q ij ) i,j ∈ [ N ( d )] ∈ R N × N , with Q ij = (cid:98) U d ( θ i , θ j ) i (cid:54) = j . Then there exists δ (cid:48) > , such that E [ (cid:107) Q (cid:107) op ] = O d ( N − δ (cid:48) ) · Tr( (cid:98) U d ) . A.5 Proof of Proposition 4
We begin by stating two key estimates which are used in the proof of Proposition 4. The notations of Lemma3 follow the notations of Proposition 4. The notations and assumptions of Proposition 5 are self-contained.We collect a number of technical lemmas in Section A.5.1.
Lemma 3.
Consider the same setup as Proposition 4. Let { N ( d ) } d ≥ and { v ( d ) } d ≥ be two sequences, andassume that there exists δ > such that (this is Assumption (A2) in Proposition 4) N ( d ) · Tr( (cid:98) U d,>v ( d ) ) = O d ( N − δ ) · Tr( (cid:98) U d,>v ( d ) ) . (71) Consider the random matrix Q >v ( d ) = ( Q >v ( d ) ,ij ) i,j ∈ [ N ( d )] ∈ R N × N , with Q >v ( d ) ,ij = (cid:98) U d,>v ( d ) ( θ i , θ j ) i (cid:54) = j . Then we have E [ (cid:107) Q >v ( d ) (cid:107) ] / = O d ( N − δ ) · Tr( (cid:98) U d,>v ( d ) ) . Proposition 5 (Vanishing off-diagonal) . Let U be a compact self-adjoint positive definite operator ona closed subspace V ⊆ L (Ω , τ ) , U : V → V , with corresponding kernel U ∈ L (Ω × Ω) , satisfying (cid:82) Ω U ( θ , θ (cid:48) ) f ( θ (cid:48) ) τ (d θ (cid:48) ) = 0 for all f ∈ V ⊥ . For any q ≥ , we assume that there exists C ( q ) such that E θ , θ ∼ τ [ | U ( θ , θ ) | q ] / (2 q ) ≤ C ( q ) · E θ , θ ∼ τ [ U ( θ , θ ) ] / , E θ ∼ τ [ | U ( θ , θ ) | q ] /q ≤ C ( q ) · E θ ∼ τ [ U ( θ , θ )] . (72) Moreover, let { θ i } i ∈ [ N ] ∼ iid τ independently, and consider ∆ = (∆ ij ) i,j ∈ [ N ] ∈ R N × N , with ∆ ij = U ( θ i , θ j ) i (cid:54) = j . Then for any integer p > , there exists a constant K ( p ) which only depends on the constant C ( p ) , such that E [ (cid:107) ∆ (cid:107) op ] ≤ K ( p ) · (cid:110) N (cid:107) U (cid:107) op + [ (cid:107) U (cid:107) op Tr( U ) N /p log N ] / (cid:111) . (73)33e are now in position to prove Proposition 4. Proof of Proposition 4.
We decompose the operator (cid:98) U d = (cid:98) U d, ≤ v ( d ) + (cid:98) U d,>v ( d ) , and the kernel (cid:98) U d = (cid:98) U d, ≤ v ( d ) + (cid:98) U d,>v ( d ) . Define Q S = ( Q S,ij ) i,j ∈ [ N ( d )] with Q S,ij = U d,S ( θ i , θ j ) i (cid:54) = j . By Assumption (A2) and by Lemma 3, we have E [ (cid:107) Q >v ( d ) (cid:107) ] / = O d ( N − δ ) · Tr( (cid:98) U d,>v ( d ) ) . By Assumption (A1) and Lemma 6 which is stated in Section A.5.1 below, the assumptions of Proposition5 are satisfied, in which we take ∆ = Q ≤ v ( d ) , U = (cid:98) U d, ≤ v ( d ) , U = (cid:98) U d, ≤ v ( d ) , and V ≡ span( φ s : 1 ≤ s ≤ v ( d )).Further by Assumption (A3) as in Eq. (70), we fix some p > /δ in Proposition 5, then for δ (cid:48) = δ / > E [ (cid:107) Q ≤ v ( d ) (cid:107) op ] = O d ( N − δ (cid:48) ) · Tr( (cid:98) U d, ≤ v ( d ) ) . Combining the equations in the last two displays proves the proposition.We next prove Lemma 3 and Proposition 5.
Proof of Lemma 3.
We have E [ (cid:107) Q >v ( d ) (cid:107) ] ≤ E [ (cid:107) Q >v ( d ) (cid:107) F ] = N ( N − · E [ Q >v ( d ) ,ij ]= N ( N − · Tr( (cid:98) U d,>v ( d ) ) = O d ( N − δ ) · Tr( (cid:98) U d,>v ( d ) ) . where the last equation is by Eq. (71). This proves the lemma. Proof of Proposition 5.
With a little abuse of notation, we define U = ( U ( θ i , θ j )) i,j ∈ [ N ] ∈ R N × N . Step 1. Bound E [ (cid:107) ∆ (cid:107) op ] using matrix decoupling. For T , T ⊆ [ N ], we denote A T ,T =( A ij ) i ∈ T ,j ∈ T . By Lemma 4 which is stated in Section A.5.1 below, we have E [ (cid:107) ∆ (cid:107) op ] ≤ T ⊆ [ N ] E [ (cid:107) ∆ T T c (cid:107) op ] . (74)For any S ⊆ [ N ], we denote E S to be the expectation with respect to { θ i } i ∈ S and conditional on { θ j } j ∈ S c .Fix T ⊆ [ N ]. Using Lemma 5 (which is stated in Section A.5.1 below) conditioning on { θ j } j ∈ T c , we have E T [ (cid:107) ∆ T T c (cid:107) op ] ≤ [Σ( T ) · N ] / + C · (Γ( T ) · log N ) / , where Σ( T ) ≡ (cid:107) E θ u [ ∆ T c u ∆ uT c ] (cid:107) op (for some u ∈ T ) and Γ( T ) ≡ E T [max i ∈ T (cid:107) ∆ iT c (cid:107) ]. Therefore, byHolder’s inequality, we have E [ (cid:107) ∆ (cid:107) op ] ≤ T ⊆ [ N ] E [ (cid:107) ∆ T T c (cid:107) op ] = 4 sup T ⊆ [ N ] E T c E T [ (cid:107) ∆ T T c (cid:107) op ] ≤ T ⊆ [ N ] (cid:110) [ E T c [Σ( T )] · N ] / + C · ( E T c [Γ( T )] · log N ) / (cid:111) . (75) Step 3. Bound E T c [Σ( T )] . By the compactness of operator U | V , there exists orthogonal basis { φ k } k ≥ and real numbers { λ k } k ≥ , such that U ( θ i , θ j ) = (cid:80) k λ k φ k ( θ i ) φ k ( θ j ). Therefore, we haveΣ( T ) = (cid:107) E θ u [ ∆ T c u ∆ uT c ] (cid:107) op = sup (cid:107) z (cid:107) =1 (cid:88) i,j ∈ T c (cid:88) k λ k φ k ( θ i ) φ k ( θ j ) z i z j ≤ (cid:107) U (cid:107) op · sup (cid:107) z (cid:107) =1 (cid:88) i,j ∈ T c (cid:88) k λ k φ k ( θ i ) φ k ( θ j ) z i z j = (cid:107) U (cid:107) op · (cid:107) ( U ij ) i,j ∈ T c (cid:107) op ≤ (cid:107) U (cid:107) op · [ (cid:107) ddiag( U ) (cid:107) op + (cid:107) ∆ (cid:107) op ] . E [ (cid:107) ddiag( U ) (cid:107) op ] ≤ E (cid:104) N (cid:88) i =1 U pii (cid:105) /p ≤ N /p · E [ U pii ] /p ≤ C ( p ) N /p · E [ U ii ] ≤ C ( p ) N /p · Tr( U ) . This gives E T c [Σ( T )] ≤ C ( p ) N /p · (cid:107) U (cid:107) op Tr( U ) + (cid:107) U (cid:107) op E [ (cid:107) ∆ (cid:107) op ] . (76) Step 4. Bound E T c [Γ( T )] . By the hypercontractivity assumption as in Eq. (72), we have E T c [Γ( T )] ≡ E (cid:104) max i ∈ T (cid:107) ∆ iT c (cid:107) (cid:105) ≤ N · E (cid:104) max i ∈ T max j ∈ T c ∆ ij (cid:105) ≤ N · E (cid:104) max i ∈ T,j ∈ T c ∆ pij (cid:105) /p ≤ N /p · E [∆ pij ] /p = C ( p ) N /p · E [∆ ij ] ≤ C ( p ) N /p · (cid:107) U (cid:107) op Tr( U ) . (77)The last inequality holds since E [∆ ij ] = E { [ (cid:80) k λ k φ k ( θ i ) φ k ( θ j )] } = (cid:80) k λ k ≤ (cid:107) U (cid:107) op Tr( U ). Step 5. Combining the equations.
Combining Eq. (75), (76), and (77), we have E [ (cid:107) ∆ (cid:107) op ] ≤ T ⊆ [ N ] (cid:110) [ E T c [Σ( T )] · N ] / + C · ( E T c [Γ( T )] · log N ) / (cid:111) ≤ K ( p ) (cid:110) {(cid:107) U (cid:107) op Tr( U ) N /p log N } / + { N (cid:107) U (cid:107) op E [ (cid:107) ∆ (cid:107) op ] } / (cid:111) . Denote ε = K ( p )( N (cid:107) U (cid:107) op ) / ≥ ε = K ( p ) {(cid:107) U (cid:107) op Tr( U ) N /p log N } / ≥ x = E [ (cid:107) ∆ (cid:107) op ] / . Theabove inequality implies x − ε x − ε ≤
0, which gives x ≤ [ ε + ( ε + 4 ε ) / ] / ≤ ( ε + 4 ε ) / . Thisconcludes the proof. A.5.1 Auxiliary lemmas
The following standard decoupling trick follows, for instance, from [Ver10] in Lemma 5.60.
Lemma 4 (Matrix decoupling) . Let A ∈ R N × N be a real symmetric random matrix. For T , T ⊆{ , , . . . , N } , we denote A T ,T = ( A ij ) i ∈ T ,j ∈ T . Then we have E [ (cid:107) A − ddiag( A ) (cid:107) op ] ≤ T ⊆ [ N ] E [ (cid:107) A T,T c (cid:107) op ] . Proof of Lemma 4.
Let T be a random subset of { , , . . . , N } , with each element selected with probability1 / x ∈ S N − , we have (cid:104) x , [ A − ddiag( A )] x (cid:105) = 4 E T (cid:104) (cid:88) i ∈ T,j ∈ T c A ij x i x j (cid:105) . By Jensen’s inequality we have E [ (cid:107) A − ddiag( A ) (cid:107) op ] = E A (cid:104) sup x ∈ S N − (cid:104) x , [ A − ddiag( A )] x (cid:105) (cid:105) ≤ E T E A (cid:104) sup x ∈ S N − (cid:88) i ∈ T,j ∈ T c A ij x i x j (cid:105) ≤ T ⊆ [ N ] E [ (cid:107) A T T c (cid:107) op ] . This completes the proof. 35 emma 5 ([Ver10] Theorem 5.48) . Let A ∈ R N × n with A T = [ a , . . . , a N ] where a i are independent randomvectors in R n with the common second moment matrix Σ = E [ a i a T i ] . Let Γ ≡ E [max i ∈ [ N ] (cid:107) a i (cid:107) ] . Then thereexists a universal constant C , such that E [ (cid:107) A (cid:107) ] / ≤ ( (cid:107) Σ (cid:107) op · N ) / + C · (Γ · log( N ∧ n )) / . Lemma 6.
Let { φ k } ≤ k ≤ Z ⊆ L (Ω , τ ) be a set of orthonormal functions. We assume that, for any fixed q ≥ , there exists C = C ( q ) , such that for any f ∈ span { φ k : 1 ≤ k ≤ Z } , we have (cid:107) f (cid:107) L q ≤ C ( q ) · (cid:107) f (cid:107) L . For θ , θ (cid:48) ∈ Ω , we denote U ( θ , θ (cid:48) ) = (cid:80) Zk =1 λ k φ k ( θ ) φ k ( θ (cid:48) ) where { λ k } ≤ k ≤ Z are fixed real numbers. Then forany q ≥ , we have E θ , θ ∼ τ [ U ( θ , θ ) q ] / (2 q ) ≤ C ( q ) · E θ , θ ∼ τ [ U ( θ , θ ) ] / , (78) E θ ∼ τ [ U ( θ , θ ) q ] /q ≤ C ( q ) · E θ ∼ τ [ U ( θ , θ )] . (79) Proof of Lemma 6.
For any q ≥
1, we have E θ , θ ∼ τ [ U ( θ , θ ) q ] = E θ ∼ τ (cid:40) E θ ∼ τ (cid:110)(cid:104) Z (cid:88) k =1 λ k φ k ( θ ) φ k ( θ ) (cid:105) q (cid:12)(cid:12)(cid:12) θ (cid:111)(cid:41) ( a ) ≤ C ( q ) q · E θ ∼ τ (cid:40) E θ ∼ τ (cid:110)(cid:104) Z (cid:88) k =1 λ k φ k ( θ ) φ k ( θ ) (cid:105) (cid:12)(cid:12)(cid:12) θ (cid:111) q (cid:41) ( b ) = C ( q ) q · E θ ∼ τ (cid:40)(cid:104) Z (cid:88) k =1 λ k φ k ( θ ) (cid:105) q (cid:41) ( c ) ≤ C ( q ) q · (cid:40) Z (cid:88) k =1 λ k · E θ ∼ τ [ φ k ( θ ) q ] /q (cid:41) q ( d ) ≤ C ( q ) q · (cid:40) C ( q ) Z (cid:88) k =1 λ k · E θ ∼ τ [ φ k ( θ ) ] (cid:41) q ( e ) = C ( q ) q (cid:104) Z (cid:88) k =1 λ k (cid:105) q ( f ) = C ( q ) q · (cid:110) E θ , θ ∼ τ [ U ( θ , θ ) ] (cid:111) q . Here, inequality ( a ) follows by applying the hypercontractivity inequality with respect to f ( θ ) = (cid:80) Zk =1 λ k φ k ( θ ) φ k ( θ )(and conditional on θ ). Equality ( b ) by the fact that ( φ k ) ≤ k ≤ Z are orthonormal functions. Inequality ( c )is by the Minkowski inequality. Inequality ( d ) follows by applying the hypercontractivity inequality with re-spect to f ( θ ) = φ k ( θ ). Equality ( e ) holds because ( φ k ) ≤ k ≤ Z are orthonormal functions. Finally, equality( f ) follows by simple calculation. This proves Eq. (78).For any q ≥
1, we have E θ ∼ τ [ U ( θ , θ ) q ] = E θ ∼ τ (cid:104)(cid:16) Z (cid:88) k =1 λ k φ k ( θ ) (cid:17) q (cid:105) ( a ) ≤ (cid:104) Z (cid:88) k =1 λ k · E θ ∼ τ [ φ k ( θ ) q ] /q (cid:105) q ( b ) ≤ C ( q ) q (cid:104) Z (cid:88) k =1 λ k · E θ ∼ τ [ φ k ( θ ) ] (cid:105) q ( c ) = C ( q ) q (cid:104) Z (cid:88) k =1 λ k (cid:105) q ( d ) = C ( q ) q (cid:110) E θ ∼ τ [ U ( θ , θ )] (cid:111) q . Here, inequality ( a ) holds by Minkowski inequality. Inequality ( b ) follows by applying the hypercontractivityinequality with respect to f ( θ ) = φ k ( θ ). Equality ( c ) holds because ( φ k ) ≤ k ≤ Z are orthonormal functions,and equality ( d ) by a simple calculation. This proves Eq. (79).36 emma 7 (Bound on the maximum of diagonal) . Consider a sequence of probability spaces (Ω d , τ d ) with { φ d,k } k ≥ an orthonormal basis of functions for D d ⊆ L (Ω d , τ d ) . Assume that there exists a sequence ofintegers { u ( d ) } d ≥ such that the subspace D d, ≤ u ( d ) = span( φ d,k : 1 ≤ k ≤ u ( d )) is hypercontractive, i.e., forany fixed k ≥ , there exists a constant C such that, for any g ∈ D d, ≤ u ( d ) , we have (cid:107) g (cid:107) L k (Ω d ) ≤ C · (cid:107) g (cid:107) L (Ω d ) . Let { U d } d ≥ be a sequence of positive definite kernels U d ∈ L (Ω d × Ω d ) with U d ( θ , θ ) = ∞ (cid:88) j =1 λ d,k φ d,k ( θ ) φ d,k ( θ ) . Denote U d,>(cid:96) the kernel function obtained by setting λ d, = . . . = λ d,(cid:96) = 0 . Letting ( θ i ) i ∈ [ N ( d )] ∼ iid τ d , if weassume that for any δ > , max i ∈ [ N ( d )] U d,>u ( d ) ( θ i , θ i ) = O d, P ( N ( d ) δ ) · E θ ∼ τ d [ U d,>u ( d ) ( θ , θ )] , (80) then for any δ > , max i ∈ [ N ( d )] U d ( θ i , θ i ) = O d, P ( N ( d ) δ ) · E θ ∼ τ d [ U d ( θ , θ )] . (81) Furthermore, if we assume that for any δ > , max i ∈ [ N ( d )] E θ ∼ τ d [ U d,>u ( d ) ( θ i , θ ) ] = O d, P ( N ( d ) δ ) · E θ , θ ∼ τ d [ U d,>u ( d ) ( θ , θ ) ] , (82) then for any δ > , max i ∈ [ N ( d )] E θ ∼ τ d [ U d ( θ i , θ ) ] = O d, P ( N ( d ) δ ) · E θ , θ ∼ τ d [ U d ( θ , θ ) ] . (83) Proof of Lemma 7.
Let us decompose U d in a high and low degree parts, U d = U d, ≤ u + U d,>u where U d, ≤ u ( θ , θ ) = u (cid:88) k =1 λ d,k φ d,k ( θ ) φ d,k ( θ ) ,U d,>u ( θ , θ ) = ∞ (cid:88) k = u +1 λ d,k φ d,k ( θ ) φ d,k ( θ ) . By Lemma 6, we have for any q ≥ E (cid:104) max i ∈ [ N ( d )] U d, ≤ u ( θ i , θ i ) (cid:105) ≤ E (cid:104) max i ∈ [ N ( d )] U d, ≤ u ( θ i , θ i ) q (cid:105) /q ≤ N /q E (cid:104) U d, ≤ u ( θ , θ ) q (cid:105) /q ≤ C ( q ) N /q E (cid:104) U d, ≤ u ( θ , θ ) (cid:105) . Hence, by Markov’s inequality and condition (80), we get for any δ >
0, taking q sufficiently large,max i ∈ [ N ( d )] U d ( θ i , θ i ) = O d, P ( N ( d ) δ ) · E θ ∼ τ d [ U d ( θ , θ )] . The proof of Eq. (83) follows from a similar argument.37
Generalization error of random features model: Proof of Theorem 1
In this section, we prove Theorem 1. The proof in the overparametrized regime is presented in Section B.1.The proof in the underparametrized regime follows from a very similar argument: we will omit it and simplyadd comments in the overparametrized proof where they differ.We defer the proofs of some technical results to later sections. Section B.2 proves a key proposition onthe structure of the feature matrix Z = ( σ d ( x i ; θ j )) i ∈ [ n ] ,j ∈ [ N ] . Section B.3 gather some technical boundsnecessary for the proof of Theorem 1. Finally, Section B.4 contains concentration results on the high degreepart of the feature matrix. B.1 Proof of Theorem 1 in the overparametrized regime
In this section, we prove Theorem 1 in the overparametrized regime. We defer the proofs of some of thetechnical lemmas and matrix concentration results to Sections B.2, B.3 and B.4. The underparametrizedcase follows from the same proof with the following mapping n ↔ N , m ↔ M and λ → λ N = N λ/n . Wewill add remarks in the proof when a difference arises.
Step 1. Rewrite the y , V , U , Z matrices. We recall that the random features ridge regression solution is given byˆ a ( λ ) = arg min a (cid:110) n (cid:88) i =1 (cid:0) y i − ˆ f ( x i ; a ) (cid:1) + λN (cid:107) a (cid:107) (cid:111) . Solving for the coefficients yields ˆ a ( λ ) = ( Z T Z /N + λ I N ) − Z T y , where y = ( y , . . . , y n ) and Z = ( Z ij ) i ∈ [ n ] ,j ∈ [ N ] ∈ R n × N with Z ij = σ d ( x i ; θ j ). Hence, the prediction functionat location x is given by ˆ f ( x ; ˆ a ( λ )) = y T Z ( Z T Z /N + λ I N ) − σ ( x ) /N, where σ ( x ) = ( σ d ( x ; θ ) , . . . , σ d ( x ; θ N )) ∈ R N .Expanding the test error, we get R RF ( f d , X , Θ , λ ) ≡ E x (cid:104)(cid:16) f d ( x ) − y T Z ( Z T Z /N + λ I N ) − σ ( x ) /N (cid:17) (cid:105) = E x [ f d ( x ) ] − y T Z ˆ U − λ V /N + y T Z ˆ U − λ U ˆ U − λ Z T y /N , where V = ( V , . . . , V N ) T and U = ( U ij ) ij ∈ [ N ] with V i = E x [ f d ( x ) σ d ( x ; θ i )] ,U ij = E x [ σ d ( x ; θ i ) σ d ( x ; θ j )] , and ˆ U λ = Z T Z /N + λ I N is the (rescaled) regularized empirical kernel matrixˆ U λ,ij = 1 N (cid:88) k ∈ [ n ] σ d ( x k ; θ i ) σ d ( x k ; θ j ) + λδ ij . We recall that the eigendecomposition of σ d is given by σ d ( x ; θ ) = ∞ (cid:88) k =1 λ d,k ψ k ( x ) φ k ( θ ) .
38e write the orthogonal decomposition of f d in this basis as f d ( x ) = ∞ (cid:88) k =1 ˆ f d,k ψ k ( x ) , Define ψ k =( ψ k ( x ) , . . . , ψ k ( x n )) T ∈ R n , φ k =( φ k ( θ ) , . . . , φ k ( θ N )) T ∈ R N , D ≤ m =diag( λ d, , λ d, , . . . , λ d, m ) ∈ R m × m , ψ ≤ m =( ψ k ( x i )) i ∈ [ n ] ,k ∈ [ m ] ∈ R n × m , φ ≤ m =( φ k ( θ i )) i ∈ [ N ] ,k ∈ [ m ] ∈ R N × m , ˆ f ≤ m =( ˆ f d, , ˆ f d, , . . . , ˆ f d, m ) T ∈ R m . (84)Recall that y = ( y , . . . , y n ) T = f + ε with f =( f d ( x ) , . . . , f d ( x n )) T ε =( ε , . . . , ε n ) T . Using the above notations, we can decompose the vectors and matrices f , V , U , and as f = f ≤ m + f > m , f ≤ m = ψ ≤ m ˆ f ≤ m , f > m = ∞ (cid:88) k = m +1 ˆ f d,k ψ k , V = V ≤ m + V > m , V ≤ m = φ ≤ m D ≤ m ˆ f ≤ m , V > m = ∞ (cid:88) k = m +1 ˆ f d,k λ d,k φ k , U = U ≤ m + U > m , U ≤ m = φ ≤ m D ≤ m φ T ≤ m , U > m = ∞ (cid:88) k = m +1 λ d,k φ k φ T k , Z = Z ≤ m + Z > m , Z ≤ m = ψ ≤ m D ≤ m φ T ≤ m , Z > m = (cid:88) k ≥ m +1 λ d,k ψ k φ T k . (85) Step 2. Decompose the risk.
We decompose the risk with respect to y = f + ε as follows R KR ( f d , X , W , λ ) = (cid:107) f d (cid:107) L − T + T + T − T + 2 T . where T = f T Z ˆ U − λ V /N,T = f T Z ˆ U − λ U ˆ U − λ Z T f /N ,T = ε T Z ˆ U − λ U ˆ U − λ Z T ε /N ,T = ε T Z ˆ U − λ V /N,T = ε T Z ˆ U − λ U ˆ U − λ Z T f /N . (86)The proof relies on the following key result on the structure of the feature matrix Z : Proposition 6 (Structure of the feature matrix Z ) . Follow the assumptions and the notations in the proofof Theorem 1 in the overparametrized regime (note in particular that N ≥ n δ and n ≥ m δ for somefixed δ > ). Consider the singular value decomposition of Z = ( Z ij ) i ∈ [ n ] ,j ∈ [ N ] with Z ij = σ d ( x i ; θ j ) : Z / √ N = P Λ Q T = [ P , P ]diag( Λ , Λ )[ Q , Q ] T ∈ R n × N , here P ∈ R n × n and Q ∈ R N × n , and P ∈ R n × m and Q ∈ R N × m correspond to the left and right singularvectors associated to the largest m singular values Λ , while P ∈ R n × ( n − m ) and Q ∈ R N × ( n − m ) correspondto the left and right singular vectors associated to the last ( n − m ) smallest singular values Λ . Define κ > m = Tr( H d,> m ) .Then the singular value decomposition has the following properties:(a) Define Λ = diag(( σ i ( Z / √ N )) i ∈ [ n ] ) the singular values (in non increasing order) of Z / √ N . Then thesingular values verify σ min ( Λ ) = min i ∈ [ m ] σ i ( Z / √ N ) = κ / > m · ω d, P (1) , (87) (cid:107) Λ − κ / > m · I n − m (cid:107) op = max i = m +1 ,...,n (cid:12)(cid:12) σ i ( Z / √ N ) − κ / > m (cid:12)(cid:12) = κ / > m · o d, P (1) . (88) (b) The left and right singular vectors associated to the ( n − m ) smallest singular values verify n − / (cid:107) ψ T ≤ m P (cid:107) op = o d, P (1) , N − / (cid:107) φ T ≤ m Q (cid:107) op = o d, P (1) . (89) (c) We have N − / (cid:107) P T Z > m Q (cid:107) op = κ / > m · o d, P (1) . (90)We defer the proof of Proposition 6 to Section B.2. Remark B.1.
Proposition 6 shows that the feature matrix Z = Z ≤ m + Z > m (cf. Eq. (85)) is a spikedmatrix, with m spikes with singular values Λ much larger than κ / > m coming from the low-degree part Z ≤ m (in particular, Proposition 6.( b ) shows that the left and right singular vectors of the spikes are approximatelyspanned by the left and right singular vectors of Z ≤ m ) while the rest of the singular values are approximatelyconstant equal to κ / > m . The proof of this proposition is based on the following observations:( a ) Z ≤ m / √ N = ψ ≤ m D ≤ m φ T ≤ m / √ N is a rank m matrix with( i ) ψ ≤ m / √ n and φ ≤ m / √ N are approximately orthogonal matrices (see Eq. (105)).( ii ) √ n | D ≤ m | = diag( √ n | λ | , . . . , √ n | λ m | ) (cid:23) ω d, P ( κ / > m ) · I m from condition (19) in Assumption 2.( a ).( b ) The high degree part Z > m / √ N has nearly constant singular values (cid:107) Z > m Z T > m /N − κ > m I n (cid:107) op = κ > m · o d, P (1) and is nearly orthogonal to the span of the right singular vectors of Z ≤ m , i.e., (cid:107) Z > m φ ≤ m /N (cid:107) op = κ / > m · o d, P (1) (see Proposition 8 in Section B.4).Using Proposition 6, we can prove the following list of bounds that will be the main tools for the rest ofthe proof of Theorem 1. Proposition 7.
Follow the assumptions and the notations in the proof of Theorem 1 in the overparametrizedregime. Then the following bounds hold. (Recall that κ > m = Tr( H d,> m ) .) ( a ) Bounds on ˆ U − λ = ( Z T Z /N + λ I N ) − : ψ T ≤ m Z ˆ U − λ φ ≤ m D ≤ m /N = I m + ∆ , (91) (cid:107) D ≤ m φ T ≤ m ˆ U − λ Z T f > m /N (cid:107) = (cid:107) P > m f d (cid:107) L η · o d, P (1) , (92) √ n (cid:107) Z ˆ U − λ φ ≤ m D ≤ m /N (cid:107) op = O d, P (1) , (93) where (cid:107) ∆ (cid:107) op = o d, P (1) . Furthermore, we have (cid:107) Z ˆ U − λ / √ N (cid:107) op = κ − / > m · O d, P (1) . (94)40 b ) Bound on U > m : nN (cid:107) U > m (cid:107) op = κ > m · o d, P (1) . ( c ) Bounds on f : (cid:107) f (cid:107) = √ n (cid:107) f d (cid:107) L · O d, P (1) , (cid:107) ψ T ≤ m f > m /n (cid:107) = (cid:107) P > m f d (cid:107) L η · o d, P (1) . ( d ) Bound on V > m : (cid:114) nN (cid:107) V > m (cid:107) = κ / > m (cid:107) P > m f d (cid:107) L · o d, P (1) . The proof of Proposition 7 is deferred to Section B.3.
Remark B.2.
In the underparametrized case, the proofs and statements of Proposition 6 and Proposition7.( a ) and 7.( c ) are symmetric under the mapping n ↔ N , m ↔ M and λ → λ N = N λ/n . The bounds inPropositions 7.( b ) and 7.( d ) can be easily replaced by (cid:107) U > M (cid:107) op = κ > M · O d, P (1) , (cid:107) V > M (cid:107) = κ / > M (cid:107) P > m f d (cid:107) L · o d, P (1) . In order to bound the term T in Eq. (100), we will further use the following bound (cid:107) ˆ U − λ Z T f /n (cid:107) op = κ − / > M · (cid:107) f d (cid:107) L · o d, P (1) , that we prove in Section B.3.5. It is easy to plug the new bounds below with the aforementioned mappingand check that the underparametrized case follows indeed from the same computation.The rest of the proof amounts to controlling each term separately using the claims listed in Proposition7. We will use extensively the following (basic) properties of the operator norm: for A ∈ R m × p , B ∈ R p × q , u ∈ R m and v ∈ R p , we have (cid:107) A (cid:107) op = (cid:107) A T A (cid:107) / = (cid:107) AA T (cid:107) / , (cid:107) AB (cid:107) op ≤(cid:107) A (cid:107) op (cid:107) B (cid:107) op , u T Av ≤(cid:107) u (cid:107) (cid:107) A (cid:107) op (cid:107) v (cid:107) . Step 3. Term T . Let us decompose T into T = T + T + T , where T = f T ≤ m Z ˆ U − λ V ≤ m /N,T = f T > m Z ˆ U − λ V ≤ m /N,T = f T Z ˆ U − λ V > m /N. Recall that V ≤ m = φ ≤ m D ≤ m ˆ f ≤ m and f ≤ m = ψ ≤ m ˆ f ≤ m . Hence by Eq. (91) in Proposition 7.( a ), we have T = ˆ f T ≤ m ( ψ T ≤ m Z ˆ U − λ φ ≤ m D ≤ m /N ) ˆ f ≤ m = ˆ f T ≤ m ( I m + ∆ ) ˆ f ≤ m = (cid:107) P ≤ m f d (cid:107) L + (cid:107) P ≤ m f d (cid:107) L · o d, P (1) . (95)41imilarly by Eq. (92) in Proposition 7.( a ), | T | = | f T > m Z ˆ U − λ φ ≤ m D ≤ m ˆ f ≤ m /N |≤(cid:107) D ≤ m φ T ≤ m ˆ U − λ Z T f > m /N (cid:107) (cid:107) ˆ f ≤ m (cid:107) = (cid:107) P > m f d (cid:107) L η (cid:107) P ≤ m f d (cid:107) L · o d, P (1) . (96)Using Proposition 7.( c ) and 7.( d ) as well as Eq. (94) in Proposition 7.( a ), we get | T | = | f T Z ˆ U − λ V > m /N | ≤(cid:107) f / √ n (cid:107) (cid:107) ( Z / √ N ) ˆ U − λ (cid:107) op · (cid:112) n/N (cid:107) V > m (cid:107) ≤ O d, P ( (cid:107) f d (cid:107) L ) · O d, P ( κ − / > m ) · o d, P ( κ / > m (cid:107) P > m f d (cid:107) L )= (cid:107) f d (cid:107) L (cid:107) P > m f d (cid:107) L · o d, P (1) . (97)Combining Eqs. (95), (96) and (97) yields T = (cid:107) P ≤ m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η ) . (98) Step 4. Term T Recalling U = φ ≤ m D ≤ m φ T ≤ m + U > m , we can decompose T as T = T + T , where T =( f T Z ˆ U − λ φ ≤ m D ≤ m /N )( D ≤ m φ T ≤ m ˆ U − λ Z T f /N ) ,T = f T Z ˆ U − λ U > m ˆ U − λ Z T f /N . From Eqs. (91) and (92) in Proposition 7.( a ), we have D ≤ m φ T ≤ m ˆ U − λ Z T f /N = D ≤ m φ T ≤ m ˆ U − λ Z T ψ ≤ m ˆ f ≤ m /N + D ≤ m φ T ≤ m ˆ U − λ Z T f > m /N =( I m + ∆ ) ˆ f ≤ m + (cid:107) P > m f d (cid:107) L η · ∆ , where (cid:107) ∆ (cid:107) op = o d, P (1), (cid:107) ∆ (cid:107) = o d, P (1). Hence, T = (cid:107) P ≤ m f d (cid:107) L + ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L δ ) · o d, P (1) . (99)From Eq. (94) in Proposition 7.( a ) as well as Proposition 7.( b ), 7.( c ), the second term is bounded by | T | = | f T Z ˆ U − λ U > m ˆ U − λ Z T f /N |≤(cid:107) ( n/N ) U > m (cid:107) op (cid:107) ( Z / √ N ) ˆ U − λ (cid:107) (cid:107) f / √ n (cid:107) = o d, P ( κ > m ) · O d, P ( κ − > m ) · O d, P ( (cid:107) f d (cid:107) L ) = (cid:107) f d (cid:107) L · o d, P (1) . (100)As a result, combining Eqs. (99) and (100), we have T = (cid:107) P ≤ m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η ) . (101) Step 5. Terms T , T and T . Let us start with the term T . Decompose U = φ ≤ m D ≤ m φ T ≤ m + U > m : E ε [ T ] /σ ε =tr( Z ˆ U − λ U ˆ U − λ Z T ) /N =tr( Z ˆ U − λ φ ≤ m D ≤ m φ T ≤ m ˆ U − λ Z T ) /N + tr( Z ˆ U − λ U > m ˆ U − λ Z T ) /N .
42y Eq. (93) in Proposition 7.( a ), and since m ≤ n − δ by Assumption 2.( a ), we havetr( Z ˆ U − λ φ ≤ m D ≤ m φ T ≤ m ˆ U − λ Z T ) /N ≤ m · (cid:107) Z ˆ U − λ φ ≤ m D ≤ m /N (cid:107) = m n · O d, P (1) = o d, P (1) . By Eq. (94) in Proposition 7.( a ) as well as Proposition 7.( b ), the second term is bounded bytr( Z ˆ U − λ U > m ˆ U − λ Z T ) /N ≤(cid:107) ( n/N ) U > m (cid:107) op (cid:107) Z ˆ U − λ Z T /N (cid:107) op /n = o d, P ( κ > m ) · O d, P ( κ − > m ) · n − = o d, P (1) . Combining these two bounds and using Markov’s inequality, we get T = o d, P (1) · σ ε . (102)Let us consider T term. Recall that we can decompose V = φ ≤ m D ≤ m ˆ f ≤ m + V > m , E ε [ T ] /σ ε =tr( Z ˆ U − λ V V T ˆ U − λ Z T ) /N = V T ˆ U − λ Z T Z ˆ U − λ V /N ≤ (cid:107) Z ˆ U − λ V ≤ m /N (cid:107) + (cid:107) Z ˆ U − λ V > m /N (cid:107) ) . We have by Eq. (93) in Proposition 7.( a ), (cid:107) Z ˆ U − λ V ≤ m /N (cid:107) ≤(cid:107) Z ˆ U − λ φ ≤ m D ≤ m /N (cid:107) op (cid:107) ˆ f ≤ m (cid:107) = (cid:107) P ≤ m f d (cid:107) L · o d, P (1) , and by Proposition 7.( d ), (cid:107) Z ˆ U − λ V > m /N (cid:107) ≤(cid:107) Z ˆ U − λ / √ N (cid:107) (cid:107) V > m / √ N (cid:107) = O d, P ( κ − / > m ) · o d, P ( κ / > m (cid:107) P > m f d (cid:107) L n − / ) = (cid:107) P > m f d (cid:107) L · o d, P (1) . Combining the two above bounds, we get by Markov’s inequality T = o d, P (1) · σ ε (cid:107) f d (cid:107) L = o d, P (1) · ( σ ε + (cid:107) f d (cid:107) L ) . (103)Let us consider the last term T . We have E ε [ T ] /σ ε =tr( Z ˆ U − λ U ˆ U − λ Z T f f T Z ˆ U − λ U ˆ U − λ Z T ) /N = (cid:107) Z ˆ U − λ U ˆ U − λ Z T f /N (cid:107) ≤ (cid:107) Z ˆ U − λ U ˆ U − λ Z T √ n/N (cid:107) (cid:107) f / √ n (cid:107) . By Eq. (92) in Proposition 7.( a ), and Proposition 7.( b ), (cid:107) Z ˆ U − λ U ˆ U − λ Z T √ n/N (cid:107) op ≤√ n · (cid:107) Z ˆ U − λ φ ≤ m D ≤ m /N (cid:107) + (cid:107) (cid:112) n/N U > m (cid:107) op (cid:107) Z ˆ U − λ Z T /N (cid:107) op = o d, P (1) . Hence, by Proposition 7.( c ), E ε [ T ] /σ ε = o d, P (1) · (cid:107) f / √ n (cid:107) = (cid:107) f d (cid:107) · o d, P (1) , which gives by Markov’s inequality T = σ ε (cid:107) f d (cid:107) L · o d, P (1) = ( σ ε + (cid:107) f d (cid:107) L ) · o d, P (1) . (104) Step 6. Finish the proof.
Combining Eqs. (98), (101), (102), (103) and (104), we have R RF ( f d , X , W , λ ) = (cid:107) f d (cid:107) L − T + T + T − T + 2 T = (cid:107) f d (cid:107) L − (cid:107) P ≤ m f d (cid:107) L + (cid:107) P ≤ m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε )= (cid:107) P > m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε ) , which concludes the proof. 43 .2 Proof of Proposition 6: Structure of the feature matrix Z Recall the definition Z = ( σ d ( x i ; θ j )) i ∈ [ n ] ,j ∈ [ N ] . Recall the decomposition Z = Z ≤ m + Z > m into a low andhigh degree parts, as per Eq. (85). For convenience, we will consider the normalized quantities˜ Z = Z / √ N , ˜ Z ≤ m = Z ≤ m / √ N , ˜ Z > m = Z > m / √ N , ˜ φ ≤ m = φ ≤ m / √ N , ˜ ψ ≤ m = ψ ≤ m / √ n, ˜ D ≤ m = √ n D ≤ m . In particular, notice that ˆ U λ = ˜ Z T ˜ Z + λ I N and ˜ Z ≤ m = ˜ ψ ≤ m ˜ D ≤ m ˜ φ T ≤ m .By Proposition 3 applied to ˜ φ ≤ m and ˜ ψ ≤ m (with assumptions satisfied by Assumption 1.( a ) and As-sumption 2.( a )), we get ˜ φ T ≤ m ˜ φ ≤ m = I m + ∆ , ˜ ψ T ≤ m ˜ ψ ≤ m = I m + ∆ , (105)with (cid:107) ∆ i (cid:107) op = o d, P (1) for i = 1 ,
2. Furthermore, by Proposition 8 (stated in Section B.4), we have˜ Z > m ˜ Z T > m = κ > m · ( I n + ∆ Z ) , (cid:107) ˜ Z > m ˜ φ ≤ m (cid:107) op = κ / > m · o d, P (1) , (106)with (cid:107) ∆ Z (cid:107) op = o d, P (1) and where we recall κ > m = Tr( H d,> m ). Furthermore, Assumption 2.( a ) implies that σ min ( ˜ D ≤ m ) = min k ≤ m {√ n | λ d,k |} = ω d (1) · κ / > m . (107)Hence, we expect ˜ Z = ˜ Z ≤ m + ˜ Z > m to have m large singular values ω d (1) · κ / > m associated to ˜ Z ≤ m withleft and right singular vectors spanned approximately by ˜ ψ ≤ m and ˜ φ ≤ m , and n − m small singular valuesapproximately equal to κ / > m associated to ˜ Z > m . Proof of Proposition 6.
Claim ( a ) . Bound on the singular values. Using Eqs. (105) and (107), we have˜ Z ≤ m ˜ Z T ≤ m = ˜ ψ ≤ m ˜ D ≤ m ˜ φ T ≤ m ˜ φ ≤ m ˜ D ≤ m ˜ ψ T ≤ m = ˜ ψ ≤ m ˜ D ≤ m ( I m + ∆ ) ˜ D ≤ m ˜ ψ T ≤ m (cid:23) Ω d, P (1) · ˜ ψ ≤ m ˜ D ≤ m ˜ ψ T ≤ m (cid:23) κ > m · ω d (1) · ˜ ψ ≤ m ˜ ψ T ≤ m . Furthermore, by ˜ ψ T ≤ m ˜ ψ ≤ m = I m + ∆ , we deduce that the singular values of ˜ Z ≤ m are lower bounded asfollows min i ∈ [ m ] σ i ( ˜ Z ≤ m ) = κ > m · ω d, P (1) . (108)By Lemma 8 stated below in Section B.2.1, we have for i ∈ [ n ], | σ i ( ˜ Z ) − σ i ( ˜ Z ≤ m ) | ≤ (cid:107) ˜ Z > m (cid:107) op . (109)Recalling Eq. (106), (cid:107) ˜ Z > m (cid:107) op = O d, P (1) · κ / > m . Hence the first m singular values verify σ i ( ˜ Z ) ≥ σ i ( ˜ Z ≤ m ) − κ / > m · O d, P (1) . (110)Using Eq. (108) implies σ min ( Λ ) = min i ∈ [ m ] σ i ( ˜ Z ) = κ / > m · ω d, P (1). This proves Eq. (87).Using again Eq. (109), the n − m smallest singular values verifymax i = m +1 ,...,n σ i ( ˜ Z ) ≤ κ / > m · (1 + o d, P (1)) . (111)44n order to lower bound the n − m smallest singular values, we lower bound the eigenvalues of ˜ Z ˜ Z T . Wehave ˜ Z ˜ Z T = ˜ Z ≤ m ˜ Z T ≤ m + ˜ Z > m ˜ Z T ≤ m + ˜ Z ≤ m ˜ Z T > m + ˜ Z > m ˜ Z T > m . Recalling Eq. (105) and Eq. (106), we have˜ Z ≤ m ˜ Z T ≤ m = ˜ ψ ≤ m ˜ D ≤ m ( ˜ φ T ≤ m ˜ φ ≤ m ) ˜ D ≤ m ˜ ψ T ≤ m = ˜ ψ ≤ m ˜ D ≤ m ( I m + ∆ ) ˜ D ≤ m ˜ ψ ≤ m , ˜ Z > m ˜ Z T > m = κ > m · ( I n + ∆ Z ) , where (cid:107) ∆ Z (cid:107) op = o d, P (1).Denote L = ˜ ψ ≤ m ˜ D ≤ m ( I m + ∆ ) / and T = ˜ Z > m ˜ φ ≤ m ( I m + ∆ ) − / . By Eq. (106), we have (cid:107) T T T (cid:107) op = κ > m · o d, P (1). Combining these remarks, we get˜ Z ˜ Z T = LL T + T L T + LT T + T T T − T T T + ˜ Z > m ˜ Z T > m = ( L + T )( L + T ) T + κ > m · ( I n + ∆ (cid:48) ) (cid:23) κ > m · ( I n + ∆ (cid:48) ) , where (cid:107) ∆ (cid:48) (cid:107) op = o d, P (1). We deduce that σ min ( ˜ Z ) = min i ∈ [ n ] σ i ( ˜ Z ) ≥ κ / > m · (1 + o d, P (1)) , which combined with Eq. (111) yields Eq. (88). Part ( b ) . Left and right singular vectors. Let us prove (cid:107) ˜ φ T ≤ m Q (cid:107) op = o d, P (1). The proof for ˜ ψ T ≤ m P follows from the same argument by replacing˜ Z by ˜ Z T and using the bound (cid:107) ˜ Z > m ˜ φ ≤ m (cid:107) op = κ / > m · o d, P (1), cf. Eq. (106).Let us consider a sequence u ∈ R n − m (where we keep the dependency on d implicit) such that (cid:107) u (cid:107) = 1and (cid:107) ˜ φ T ≤ m Q u (cid:107) = (cid:107) ˜ φ T ≤ m Q (cid:107) op . For convenience, denote ˜ u = ˜ φ T ≤ m Q u . We have u T Λ u = u T Q T ˜ Z T ˜ ZQ u = u T Q T ( ˜ Z T ≤ m ˜ Z ≤ m + ˜ Z T > m ˜ Z ≤ m + ˜ Z T ≤ m ˜ Z T > m + ˜ Z T > m ˜ Z > m ) Q u = ˜ u T ˜ D ≤ m ( I m + ∆ ) ˜ D ≤ m ˜ u + 2 ˜ u T ˜ D ≤ m ( ˜ ψ T ≤ m ˜ Z > m u ) + (cid:107) ˜ Z > m Q u (cid:107) . (112)From step 1, we have u T Λ u = κ > m · O d, P (1). Furthermore,˜ u T ˜ D ≤ m ( I m + ∆ ) ˜ D ≤ m ˜ u = Ω d, P (1) · (cid:107) ˜ D ≤ m ˜ u (cid:107) , ˜ u T ˜ D ≤ m ( ˜ ψ T ≤ m ˜ Z > m u ) ≥ − (cid:107) ˜ D ≤ m ˜ u (cid:107) (cid:107) ˜ ψ T ≤ m ˜ Z > m (cid:107) op , (cid:107) ˜ ψ T ≤ m ˜ Z > m (cid:107) op ≤ (cid:107) ˜ ψ ≤ m (cid:107) op (cid:107) ˜ Z > m (cid:107) op = κ / > m · O d, P (1) . (113)Therefore, using the bounds (113) in Eq. (112), we getΩ d, P (1) · (cid:107) ˜ D ≤ m ˜ u (cid:107) − (cid:107) ˜ D ≤ m ˜ u (cid:107) (cid:107) ˜ ψ T ≤ m ˜ Z > m (cid:107) op ≤ κ > m · O d, P (1) . Hence, (cid:107) ˜ D ≤ m ˜ u (cid:107) = O d, P (cid:16) max (cid:0) κ / > m , (cid:107) ˜ ψ T ≤ m ˜ Z > m (cid:107) op (cid:1)(cid:17) = κ / > m · O d, P (1) . (114)Using the bound (cid:107) ˜ D ≤ m ˜ u (cid:107) = κ / > m · ω d (1) · (cid:107) ˜ u (cid:107) = κ / > m · ω d (1) · (cid:107) Q T ˜ φ ≤ m (cid:107) op in Eq. (114), we deduce that (cid:107) Q T ˜ φ ≤ m (cid:107) op = o d, P (1). This concludes the proof of Proposition 6.( b ). Part ( c ) . Cross term bound. κ − / > m ˜ Z = κ − / > m ˜ Z ≤ m + κ − / > m ˜ Z > m . Indeed, Eq. (108) implies that σ min ( κ − / > m ˜ Z ≤ m ) = ω d, P (1) and Eq. (106) gives (cid:107) κ − > m ˜ Z > m ˜ Z T > m − I n (cid:107) op = o d, P (1). Furthermore, the right singular vectors V of ˜ Z ≤ m are spanned by the left singular vectorsof ˜ φ ≤ m . From Eq. (106), we have (cid:107) ˜ Z > m ˜ φ ≤ m (cid:107) op = κ / > m · o d, P (1). Combined with (cid:107) ˜ φ T ≤ m ˜ φ ≤ m − I m (cid:107) op = o d, P (1),we get (cid:107) κ − / > m ˜ Z > m V (cid:107) op = o d, P (1). B.2.1 Auxiliary lemmas
We recall the following classical perturbation theory result.
Theorem 7 (Sin(Θ) theorem for rectangular matrices [Wed72]) . Let A be a n × N -matrix with singularvalue decomposition A = U Σ V T , where U ∈ R n × m , V ∈ R N × m verify m ≤ min( n, N ) and U T U = V T V = I m , and Σ = diag(( σ i ( A )) i ∈ [ m ] ) are the singular values. Let M be a perturbation n × N -matrix and consider B = A + M with singularvalue decomposition B = P Σ Q = [ P , P ]diag( Λ , Λ )[ Q , Q ] T , where P ∈ R n × m , Q ∈ R N × m , P ∈ R n × ( n − m ) , Q ∈ R N × ( n − m ) . Assume that σ min ( Λ ) > . Then max( (cid:107) ( I n − U U T ) P (cid:107) op , (cid:107) ( I N − V V T ) Q (cid:107) op ) ≤ max( (cid:107) M Q (cid:107) op , (cid:107) M T P (cid:107) op ) σ min ( Λ ) . (115) Lemma 8 (Weyl’s inequality) . Consider A , M ∈ R n × N and define B = A + M . Then for any i ∈ [min( n, N )] , we have | σ i ( B ) − σ i ( A ) | ≤ (cid:107) M (cid:107) op . (116)The next lemma implies that the projection of the noise matrix M on the top left singular vectors ofthe full matrix is approximately in the space orthogonal to the right singular vectors. Lemma 9 (Null space of right singular vectors) . Let { N ( d ) } d ≥ , { n ( d ) } d ≥ and { m ( d ) } d ≥ be three sequencesof integers. For convenience, we denote N = N ( d ) , n = n ( d ) and m = m ( d ) . Assume that N ≥ n + m and n ≥ m . Consider the following sequence of random spiked matrices: B := B ( d ) = A + M = U Σ V T + M ∈ R n × N , where U Σ V T is the singular value decomposition of the rank m matrix A with U ∈ R n × m , V ∈ R N × m and U T U = V T V = I m , and Σ = diag(( σ ,i ( A )) i ∈ [ m ] ) ∈ R m × m are the singular values. Further assumethat(a) σ min ( A ) = min i ∈ [ m ] σ ,i ( A ) = ω d, P (1) ,(b) (cid:107) M V (cid:107) op = o d, P (1) ,(c) (cid:107) M M T − I n (cid:107) op = o d, P (1) .Denote B = P Λ Q T = [ P , P ]diag( Λ , Λ )[ Q , Q ] T the singular value decomposition of B where P ∈ R n × m and Q ∈ R N × m correspond to the left and right singular vectors associated to the first m singularvalues Λ , while P ∈ R n × ( n − m ) and Q ∈ R N × ( n − m ) correspond to the left and right singular vectorsassociated to the last ( n − m ) singular values Λ .Then we have (cid:107) P T M Q (cid:107) op = o d, P (1) . (117)46 roof of Lemma 9. Step 1. Simplification of the problem.
Without loss of generality, we can choose an orthonormal basis in R N so that, in that basis V = (cid:20) I m N − m , m (cid:21) , M = (cid:2) M M n,N − ( n + m ) (cid:3) , (118)where M ∈ R n × m and M ∈ R n × n . Because the space corresponding to the last N − ( n + m ) coordinatesof the row is in the right null space of both A and M we can forget about them and consider –without lossof generality– M = [ M , M ] ∈ R n × ( n + m ) , N = n + m .From the assumption (cid:107) M V (cid:107) op = o d, P (1), we have (cid:107) M (cid:107) op = o d, P (1) . (119)Furthermore, from the assumption (cid:107) M M T − I n (cid:107) op = o d, P (1), we have (cid:107) M M T − I n (cid:107) op = o d, P (1) . (120) Step 2. There exists an orthogonal matrix R ∈ R m × m such that (cid:107) P − U R (cid:107) op = o d, P (1) . Recall that Λ = diag(( σ ,i ( B )) i ∈ [ m ] ). By Lemma 8, we have for any i ∈ [ m ], | σ ,i ( B ) − σ ,i ( A ) | ≤ (cid:107) M (cid:107) op . Using the assumption ( a ) that σ min ( A ) = ω d, P (1) and assumption ( c ) (cid:107) M (cid:107) op = O d, P (1), we deduce that σ min ( Λ ) = ω d, P (1) . (121)Furthermore (cid:107) M Q (cid:107) op ≤ (cid:107) M (cid:107) op = O d, P (1) and similarly (cid:107) M T P (cid:107) op = O d, P (1). We can therefore applyTheorem 7 which gives (cid:107) ( I n − U U T ) P (cid:107) op = o d, P (1) . Denote by U , ⊥ ∈ R n × ( n − m ) a matrix such that [ U , U , ⊥ ] is orthogonal, the last equation implies (cid:107) U T , ⊥ P (cid:107) op = o d, P (1). Further, P T U U T P = I m − P T ( I n − U U T ) P , which shows that (cid:107) P T U U T P − I m (cid:107) op = o d, P (1). This implies U T P is an approximately orthogonalmatrix. Namely, let its singular value decomposition be U T P = R SR T . Then, by defining the orthogonalmatrix R := R R T ∈ R m × m , we have (cid:107) P − U R (cid:107) op = o d, P (1). Step 3. The null space of the right eigenvectors Q . Let us explicitly describe the null space of Q ∈ R ( n + m ) × n (recall that we removed the N − ( n + m ) lastcoordinates of the columns). Consider N ∈ R m × m a rank m matrix and write N ∈ R n × m as a function of N such that ker( Q ) is spanned by the columns of the matrix N = (cid:20) N N (cid:21) ∈ R ( n + m ) × m , i.e., BN = , thatis (cid:2) U Σ + M M (cid:3) (cid:20) N N (cid:21) = ( U Σ + M ) N + M N = . Projecting on the two orthogonal subspaces U and U , ⊥ , this is equivalent to N = − ( Σ + U T M ) − U T M N , U T , ⊥ M N = − U T , ⊥ M N . (122)Let us do the following reparametrization N = M − ˜ N and fix N = − ( Σ + U T M ) − . Then Eq. (122)gives U T ˜ N = I m , U T , ⊥ ˜ N = U T , ⊥ M ( Σ + U T M ) − , N = U + U , ⊥ U T , ⊥ M ( Σ + U T M ) − , and N = − ( Σ + U T M ) − , N = M − U + M − U , ⊥ U T , ⊥ M ( Σ + U T M ) − . By the assumption λ min ( Σ ) = ω d, P (1) and Eq. (119), we have (cid:107) ( Σ + U T M ) − (cid:107) op = o d, P (1). Furthermore,from Eq. (120), we have (cid:107) M − − M T (cid:107) op = o d, P (1). We deduce that (cid:107) N T − (cid:2) n, m U T M (cid:3) (cid:107) op = o d, P (1) . (123) Step 4. Concluding the proof.
By construction, N T Q = and using Eq. (123), we get (cid:107) N T Q − (cid:2) n, m U T M (cid:3) Q (cid:107) op = (cid:107) (cid:2) n, m U T M (cid:3) Q (cid:107) op = o d, P (1) . (124)Furthermore using step 2 and recalling that (cid:107) M (cid:107) op = o d, P (1), (cid:107) P T M − (cid:2) n, m R T U T M (cid:3) (cid:107) op = o d, P (1) . (125)Combining Eqs. (124) and (125), we get (cid:107) RP T M Q − (cid:2) n, m U T M (cid:3) Q (cid:107) op = o d, P (1) , and (cid:107) P T M Q (cid:107) op = (cid:107) RP T M Q (cid:107) op = o d, P (1), which concludes the proof. B.3 Proof of Proposition 7: technical bounds in the overparametrized regime
We prove the claims of this proposition in a different order than stated.
B.3.1 Proof of claim ( c )First, notice that E [ (cid:107) f (cid:107) ] = n (cid:107) f d (cid:107) L . Hence, by Markov’s inequality, (cid:107) f (cid:107) = n (cid:107) f d (cid:107) L · O d, P (1).Let us now consider ψ T ≤ m f > m /n . For any η >
0, we have E (cid:104) (cid:107) ψ T ≤ m f > m (cid:107) (cid:105) /n = E x (cid:104)(cid:16) (cid:88) u ≥ m +1 ˆ f u ψ T u (cid:17) ψ ≤ m ψ T ≤ m (cid:16) (cid:88) v ≥ m +1 ˆ f v ψ v (cid:17)(cid:105) /n = (cid:88) u,v ≥ m +1 m (cid:88) s =0 (cid:88) i,j ∈ [ n ] (cid:110) E (cid:104) ψ u ( x i ) ψ s ( x i ) ψ s ( x j ) ψ v ( x j ) (cid:105) /n (cid:111) ˆ f u ˆ f v = (cid:88) u,v ≥ m +1 m (cid:88) s =0 (cid:88) i ∈ [ n ] (cid:110) E (cid:104) ψ u ( x i ) ψ s ( x i ) ψ s ( x i ) ψ v ( x i ) (cid:105) /n (cid:111) ˆ f u ˆ f v = 1 n m (cid:88) s =0 E x (cid:104)(cid:0) P > m f d ( x ) (cid:1) ψ s ( x ) (cid:105) ≤ n m (cid:88) s =0 (cid:107) P > m f d (cid:107) L η (cid:107) ψ s (cid:107) L (4+2 η ) /η ≤ ˜ C ( η ) m n (cid:107) P > m f d (cid:107) L η , where the last inequality uses the hypercontractivity assumption of Assumption 1.( a ): (cid:107) ψ s (cid:107) L (4+2 η ) /η = E x [ ψ s ( x ) · ηη ] η η ≤ C ((2 + η ) /η ) E x [ ψ s ( x ) ] = C ((2 + η ) /η ) , and ˜ C ( η ) = C ((2 + η ) /η ). By Markov’s inequality (using m ≤ n − δ in Assumption 2.( a ) for some fixed δ > (cid:107) ψ T ≤ m f > m /n (cid:107) = o d, P (1) · (cid:107) P > m f d (cid:107) L η . .3.2 Proof of Proposition 7. ( a )Throughout the proof, we will generically denote ∆ any matrix with (cid:107) ∆ (cid:107) op = o d, P (1). In particular, ∆ canchange from line to line. For convenience, we will use the notations introduced in Section B.2. Step 0. Bound (cid:107) ˜ Z ˆ U − λ (cid:107) op = κ − / > m · O d, P (1) . Recall the definition ˆ U λ = ˜ Z T ˜ Z + λ I N and the singular value decomposition ˜ Z = P Λ Q T . Hence, wecan rewrite ˜ Z ˆ U − λ = P ΛΛ + λ Q T , where we denoted by a slight abuse of notation Λ / ( Λ + λ ) := diag((Λ i / (Λ i + λ )) i ∈ [ n ] ). From Proposition6.( a ), σ min ( Λ ) = κ / > m · (1 + o d, P (1)). We deduce that (cid:107) ˜ Z ˆ U − λ (cid:107) op = κ − / > m · O d, P (1) . Step 1. Bound (cid:107) ˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ φ ≤ m ˜ D ≤ m − I m (cid:107) op = o d, P (1) . First notice that ˜ φ ≤ m ˜ D ≤ m = ˜ Z T ≤ m ( ˜ ψ T ≤ m ) † = ( ˜ Z − ˜ Z > m ) T ( ˜ ψ T ≤ m ) † . Furthermore, by Eq. (105), we have( ˜ ψ T ≤ m ) † = ˜ ψ ≤ m + ∆ . Hence,˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ φ ≤ m ˜ D ≤ m = ˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T ( ˜ ψ T ≤ m ) † − ˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T > m ( ˜ ψ T ≤ m ) † . (126)Let us decompose the first term along the large singular values Λ and small singular values Λ :˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T ( ˜ ψ T ≤ m ) † = ˜ ψ T ≤ m P Λ Λ + λ P T ( ˜ ψ T ≤ m ) † = ˜ ψ T ≤ m P Λ Λ + λ P T ( ˜ ψ T ≤ m ) † + ˜ ψ T ≤ m P Λ Λ + λ P T ( ˜ ψ T ≤ m ) † . From Eqs. (87) and (89) in Proposition 6 and the assumption in the theorem λ = O d (1) · κ > m , we have (cid:13)(cid:13)(cid:13) Λ Λ + λ − I m (cid:13)(cid:13)(cid:13) op = o d, P (1) , (cid:107) ˜ ψ T ≤ m P (cid:107) op = o d, P (1) . Hence, (cid:13)(cid:13)(cid:13) ˜ ψ T ≤ m P Λ Λ + λ P T ( ˜ ψ T ≤ m ) † (cid:13)(cid:13)(cid:13) op ≤ (cid:107) ˜ ψ T ≤ m P (cid:107) op (cid:107) ( ˜ ψ T ≤ m ) † (cid:107) op = o d, P (1) , and ˜ ψ T ≤ m P Λ Λ + λ P T ( ˜ ψ T ≤ m ) † = ˜ ψ T ≤ m P P T ( ˜ ψ T ≤ m ) † + ∆ = ˜ ψ T ≤ m P P T ( ˜ ψ T ≤ m ) † + ∆ (cid:48) = I m + ∆ (cid:48) , where (cid:107) ∆ (cid:107) op , (cid:107) ∆ (cid:48) (cid:107) op = o d, P (1). We deduce (cid:13)(cid:13)(cid:13) ˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T ( ˜ ψ T ≤ m ) † − I m (cid:13)(cid:13)(cid:13) op = o d, P (1) . (127)Consider the second term in Eq. (126):˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T > m ( ˜ ψ T ≤ m ) † = ˜ ψ T ≤ m P Λ Λ + λ Q T ˜ Z T > m ( ˜ ψ T ≤ m ) † + ˜ ψ T ≤ m P Λ Λ + λ Q T ˜ Z T > m ( ˜ ψ T ≤ m ) † . σ min ( Λ ) = κ / > m · ω d, P (1). Then, recalling that (cid:107) ˜ Z > m (cid:107) op = κ / > m · O d, P (1), we have (cid:13)(cid:13)(cid:13) ˜ ψ T ≤ m P Λ Λ + λ Q T ˜ Z T > m ( ˜ ψ T ≤ m ) † (cid:13)(cid:13)(cid:13) op ≤ (cid:107) ˜ ψ T ≤ m (cid:107) op (cid:107) Λ / ( Λ + λ ) (cid:107) op (cid:107) ˜ Z > m (cid:107) op (cid:107) ˜ ψ ≤ m + ∆ (cid:107) op = O d, P (1) · o d, P ( κ − / > m ) · O d, P ( κ / > m ) · O d, P (1) = o d, P (1) . By Eqs. (88) and (89) in Proposition 6, we get (cid:13)(cid:13)(cid:13) ˜ ψ T ≤ m P Λ Λ + λ Q T ˜ Z T > m ( ˜ ψ T ≤ m ) † (cid:13)(cid:13)(cid:13) op ≤ (cid:107) ˜ ψ T ≤ m P (cid:107) op (cid:107) Λ / ( Λ + λ ) (cid:107) op (cid:107) ˜ Z > m (cid:107) op (cid:107) ˜ ψ ≤ m + ∆ (cid:107) op = o d, P (1) · O d, P ( κ − / > m ) · O d, P ( κ / > m ) · O d, P (1) = o d, P (1) . We deduce that (cid:107) ˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ Z T > m ( ˜ ψ T ≤ m ) † (cid:107) op = o d, P (1) . (128)Combining Eqs. (127) and (128) into Eq. (126) yields˜ ψ T ≤ m ˜ Z ˆ U − λ ˜ φ ≤ m ˜ D ≤ m = I m + ∆ , where (cid:107) ∆ (cid:107) op = o d, P (1). Step 2. Bound (cid:107) ˜ D ≤ m ˜ φ T ≤ m ˆ U − λ ˜ Z T f > m / √ n (cid:107) = (cid:107) P > m f d (cid:107) L η · o d, P (1) . Let us denote ˜ f > m = f > m / √ n for convenience. Let us use again that ˜ φ ≤ m ˜ D ≤ m = ( ˜ Z − ˜ Z > m ) T ( ˜ ψ T ≤ m ) † :˜ D ≤ m ˜ φ T ≤ m ˆ U − λ ˜ Z T ˜ f > m = ( ˜ ψ ≤ m ) † ˜ Z ˆ U − λ ˜ Z T ˜ f > m − ( ˜ ψ ≤ m ) † ˜ Z > m ˆ U − λ ˜ Z T ˜ f > m . (129)First notice that because (cid:107) ˜ ψ T ≤ m ˜ ψ ≤ m − I m (cid:107) op = o d, P (1), we have (cid:107) ˜ ψ T ≤ m P (cid:107) op = o d, P (1) in Proposition 6.( b )that implies (cid:107) ( ˜ ψ ≤ m ) † P (cid:107) op = o d, P (1) (for example by looking at the singular value decomposition of ˜ ψ ≤ m ).Similarly (cid:107) ˜ ψ T ≤ m ˜ f > m (cid:107) = (cid:107) P > m f d (cid:107) L η · o d, P (1) (Proposition 7.( c )) implies (cid:107) ( ˜ ψ ≤ m ) † ˜ f > m (cid:107) = (cid:107) P > m f d (cid:107) L η · o d, P (1). Using the same argument as in the proof of Eq. (127), we have (cid:107) ( ˜ ψ ≤ m ) † ˜ Z ˆ U − λ ˜ Z T ˜ f > m (cid:107) = (cid:13)(cid:13)(cid:13) ( ˜ ψ ≤ m ) † P Λ Λ + λ P T ˜ f > m (cid:13)(cid:13)(cid:13) ≤ (cid:107) ( ˜ ψ ≤ m ) † ˜ f > m (cid:107) + o d, P (1) · (cid:107) ( ˜ ψ ≤ m ) † (cid:107) op (cid:107) ˜ f > m (cid:107) + (cid:107) P > m f d (cid:107) L · O d, P (1) · (cid:107) ( ˜ ψ ≤ m ) † P (cid:107) op = (cid:107) P > m f d (cid:107) L η · o d, P (1) . (130)The second term (129) can be decomposed as( ˜ ψ ≤ m ) † ˜ Z > m ˆ U − λ ˜ Z T ˜ f > m = ( ˜ ψ ≤ m ) † ˜ Z > m Q Λ Λ + λ P T ˜ f > m + ( ˜ ψ ≤ m ) † ˜ Z > m Q Λ Λ + λ P T ˜ f > m . Using that σ min ( Λ ) = κ / > m · ω d, P (1) and (cid:107) ˜ Z > m (cid:107) op = κ / > m · O d, P (1) yields (cid:13)(cid:13)(cid:13) ( ˜ ψ ≤ m ) † ˜ Z > m Q Λ Λ + λ P T ˜ f > m (cid:13)(cid:13)(cid:13) op ≤ (cid:107) ( ˜ ψ ≤ m ) † ˜ Z > m Q (cid:107) op (cid:107) Λ / ( Λ + λ ) (cid:107) op (cid:107) P T ˜ f > m (cid:107) op = O d, P ( κ / > m ) · o d, P ( κ − / > m ) · O d, P ( (cid:107) P > m f d (cid:107) L )= (cid:107) P > m f d (cid:107) L · o d, P (1) . (131)50or the second term, recall that ( ˜ ψ ≤ m ) † = ˜ ψ T ≤ m + ∆ and introduce P P T = P P T + P P T = I n : (cid:13)(cid:13)(cid:13) ( ˜ ψ ≤ m ) † ˜ Z > m Q Λ Λ + λ P T ˜ f > m (cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13) ( ˜ ψ ≤ m ) † [ P P T + P P T ] ˜ Z > m Q Λ Λ + λ P T ˜ f > m (cid:13)(cid:13)(cid:13) op ≤ (cid:107) ( ˜ ψ ≤ m ) † P (cid:107) op (cid:107) P T ˜ Z > m Q (cid:107) op (cid:107) Λ / ( Λ + λ ) (cid:107) op (cid:107) P T ˜ f > m (cid:107) op + (cid:107) ( ˜ ψ ≤ m ) † P (cid:107) op (cid:107) P T ˜ Z > m Q (cid:107) op (cid:107) Λ / ( Λ + λ ) (cid:13)(cid:13)(cid:13) op (cid:107) P T ˜ f > m (cid:107) op = o d, P ( κ / > m ) · O d, P ( κ / > m ) · (cid:107) P > m f d (cid:107) L = (cid:107) P > m f d (cid:107) L · o d, P (1) . (132)where we used Eq. (90) in Proposition 6, and σ min ( Λ ) = κ − / > m · Ω d, P (1) to obtain the second to last line.Combining Eqs. (130), (131) and (132) yields the result. Step 3. Bound √ n (cid:107) Z ˆ U − λ φ ≤ m D ≤ m /N (cid:107) op = O d, P (1) . First notice that (cid:107) ˜ Z > m ˜ φ ≤ m (cid:107) op = κ / > m · o d, P (1) implies (cid:107) ˜ Z > m ( ˜ φ T ≤ m ) † (cid:107) op = κ / > m · o d, P (1), where we usedthat (cid:107) ˜ φ T ≤ m ˜ φ ≤ m − I m (cid:107) op = o d, P (1).Using ˜ φ ≤ m ˜ D ≤ m = ( ˜ Z − ˜ Z > m )( ˜ φ T ≤ m ) † , we have (cid:107) ˜ Z ˆ U − λ ˜ φ ≤ m ˜ D ≤ m (cid:107) op ≤ (cid:107) ˜ Z ˆ U − λ ˜ Z ( ˜ φ T ≤ m ) † (cid:107) op + (cid:107) ˜ Z ˆ U − λ ˜ Z > m ( ˜ φ T ≤ m ) † (cid:107) op ≤ (cid:107) Λ / ( Λ + λ ) (cid:107) op (cid:107) ( ˜ φ T ≤ m ) † (cid:107) op + (cid:107) ˜ Z ˆ U − λ (cid:107) op (cid:107) ˜ Z > m ( ˜ φ T ≤ m ) † (cid:107) op = O d, P (1) + O d, P ( κ − / > m ) · o d, P ( κ / > m )= O d, P (1) , which concludes the proof of the claims in Proposition 7.( a ). B.3.3 Proof of Proposition 7. ( b )Denote D m : M =diag( λ d, m +1 , λ d, m +2 , . . . , λ d, M ) ∈ R ( M − m ) × ( M − m ) , φ m : M =( φ k ( θ i )) i ∈ [ N ] ,k = m +1 ,..., M ∈ R N × ( M − m ) . Applying Theorem 6 to U > m (where the assumptions are satisfied by Assumptions 1.( a ) and ( b ) and As-sumption 2.( a )), we get with Assumption 1.( d ), U > m = φ m : M D m : M φ Tm : M + κ > M ( I N + ∆ ) , where (cid:107) ∆ (cid:107) op = o d, P (1) and κ > M = Tr( H d,> M ). By assumption, we have N ≥ n δ for some fixed δ > nN (cid:107) κ > M ( I N + ∆ ) (cid:107) op = κ > M · o d, P (1) . (133)By Proposition 3 (assumptions satisfied by Assumptions 1.( a ) and 2.( a )), we get (cid:107) φ Tm : M φ m : M /N − I M − m (cid:107) op = o d, P (1) . Furthermore, by Assumption 2.( a ), we have n δ · (cid:107) H d,> m (cid:107) op = O d (1) · κ > m for a fixed δ >
0. Therefore n (cid:107) D m : M (cid:107) op = κ > m · o d (1). Hence, nN (cid:107) φ m : M D m : M φ Tm : M (cid:107) op ≤ (cid:107) φ m : M / √ N (cid:107) (cid:107) n D m : M (cid:107) op = κ > m · o d, P (1) . (134)51ombining Eqs. (133) and (134) yields nN (cid:107) U > m (cid:107) op = κ > m · o d, P (1) . B.3.4 Proof of Proposition 7. ( d )Recall V > m = ∞ (cid:88) k = m +1 ˆ f d,k λ d,k φ k . Taking the expectation over ( θ , . . . , θ N ), we get nN E [ (cid:107) V > m (cid:107) ] = n (cid:88) k ≥ m +1 λ d,k ˆ f d,k ≤ n · (cid:107) H d,> m (cid:107) op · (cid:107) P > m f d (cid:107) L . From condition (19) in Assumption 2.( a ), we have n δ (cid:107) H d,> m (cid:107) op = O d (1) · κ > m , and we conclude withMarkov’s inequality that (cid:114) nN (cid:107) V > m (cid:107) = √ κ > m (cid:107) P > m f d (cid:107) L · o d, P (1) . B.3.5 Bounds in the underparametrized regime
In the underparametrized case, we further prove the following lemma.
Lemma 10.
Follow the assumptions of Theorem 1 in the underparametrized case as well as the notationsin Section B.1. Then, we have (cid:107) Z T > M f > M /n (cid:107) = κ / > M · (cid:107) P > M f d (cid:107) L η · o d, P (1) , (135) (cid:107) ˆ U − λ Z T f /n (cid:107) op = κ − / > M · o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η ) . (136) Proof of Lemma 10.
Step 1. Bound (cid:107) Z T > M f > M /n (cid:107) = κ / > M · (cid:107) P > M f d (cid:107) L η · o d, P (1) . Recall the decomposition of Z > M in the eigenbasis of functions: Z > M = ∞ (cid:88) k = M +1 λ d,k ψ k φ T k . Consider the expected square norm (with respect to Θ = ( θ j ) j ∈ [ N ] ) E Θ (cid:2) (cid:107) Z T > M f > M (cid:107) (cid:3) = ∞ (cid:88) k,(cid:96) = M +1 λ d,k λ d,(cid:96) E Θ [ f T > M ψ d,k φ T d,k φ d,(cid:96) ψ d,(cid:96) f > M ]= N ∞ (cid:88) k = M +1 λ d,k ( f T > M ψ d,k ) where we used that E Θ [ φ T d,k φ d,(cid:96) ] = N δ k,(cid:96) by orthonormality of { φ d,k } k ≥ . Expanding with respect to the x i ’s, we get E Θ (cid:2) (cid:107) Z T > M f > M (cid:107) (cid:3) = N (cid:88) i ∈ [ n ] (cid:8) H d,> M : m ( x i , x i )[ P > M f d ( x i )] + H d,> m ( x i , x i )[ P > M f d ( x i )] (cid:111) + N (cid:88) i (cid:54) = j ∈ [ n ] ∞ (cid:88) k = M +1 λ d,k ψ d,k ( x i ) P > M f d ( x i ) · ψ d,k ( x j ) P > M f d ( x j ) , H d, M : u ( x i , x i ) = u (cid:88) k = M +1 λ d,k ψ d,k ( x i ) ,H d,>u ( x i , x i ) = ∞ (cid:88) k = u +1 λ d,k ψ d,k ( x i ) . Consider the first term depending on H d, M : m . Using the same computation as in the proof of Proposition7.( c ) and Lemma 6 (with the hypercontractivity assumption up to u ≥ m of Assumption 1.( a )), by H¨older’sinequality we have for the q E (cid:2) H d, M : m ( x , x )[ P > M f d ( x )] (cid:3) ≤ (cid:107) H d, M : m (cid:107) L /η (cid:107) P > M f d (cid:107) L η ≤ C (1 + 2 /η ) · E x [ H d, M : m ( x , x )] · (cid:107) P > M f d (cid:107) L η . We deduce by Markov’s inequality that the first term is bounded by (cid:88) i ∈ [ n ] H d, M : m ( x i , x i )[ P > M f d ( x i )] = O d, P (1) · n · Tr( H d, M : m ) · (cid:107) P > M f d (cid:107) L η . (137)For the second term, recall that by Assumption 1.( d ), we havemax x i ∈ [ n ] H d,> m ( x i , x i ) = O d, P (1) · Tr( H d,> m ) . Hence (cid:88) i ∈ [ n ] H d,> m ( x i , x i )[ P > M f d ( x i )] = O d, P (1) · Tr( H d,> m ) · (cid:88) i ∈ [ n ] [ P > M f d ( x i )] , and by Markov’s inequality (cid:88) i ∈ [ n ] H d,> m ( x i , x i )[ P > M f d ( x i )] = O d, P (1) · n · Tr( H d,> m ) · (cid:107) P > M f d (cid:107) L . (138)Taking the expectation of the third term gives n ( n − ∞ (cid:88) k = M +1 λ d,k E (cid:2) ψ d,k ( x )[ P > M f d ( x )] (cid:3) = n ( n − ∞ (cid:88) k = M +1 λ d,k ˆ f d,k ≤ n ( n − (cid:107) H d,>u (cid:107) op (cid:107) P > M f d (cid:107) L . (139)Merging Eqs. (137), (138) and (139), we get E Θ (cid:2) (cid:107) Z T > M f > M /n (cid:107) (cid:3) ≤ Nn · O d, P (1) · Tr( H d,> M ) · (cid:107) P > M f d (cid:107) L η + N (cid:107) H d,>u (cid:107) op · (cid:107) P > M f d (cid:107) L = o d (1) · Tr( H d,> M ) · (cid:107) P > M f d (cid:107) L η , where we used Assumption 2.( b ) ( N · (cid:107) H d,>u (cid:107) op = O d, P ( N − δ )Tr( H d,>u ) as well as n ≥ N δ for a fixed δ > Step 2. Bound on (cid:107) ˆ U − λ Z T f /n (cid:107) . By Proposition 6.( a ) in the underparametrized case, we haveˆ U − λ Z T f /n = Q Λ Λ + λ P T f / √ n + Q Λ Λ + λ P T f / √ n, (140)53here σ min ( Λ ) = ω d, P (1) · κ / > M and σ min ( Λ ) = κ / > M · (1 + o d, P (1)). In particular, this shows that (cid:13)(cid:13)(cid:13) Q Λ Λ + λ P T f / √ n (cid:13)(cid:13)(cid:13) ≤ σ min ( Λ ) − (cid:107) f / √ n (cid:107) ≤ o d, P (1) · κ − / > M · (cid:107) f d (cid:107) L . (141)For the second term (140), decompose f = f ≤ M + f > M . Recall f ≤ M = ψ ≤ M ˆ f ≤ M . By Proposition 6.( b ), wehave (cid:107) P T ψ ≤ m / √ n (cid:107) op = o d, P (1). Furthermore, using Eq. (135), namely (cid:107) Z T > M f > M /n (cid:107) = κ / > M ·(cid:107) P > M f d (cid:107) L η · o d, P (1), we get (cid:107) P T f > M / √ n (cid:107) op = (cid:107) P > M f d (cid:107) L η · o d, P (1). We deduce (cid:13)(cid:13)(cid:13) Q Λ Λ + λ P T f / √ n (cid:13)(cid:13)(cid:13) ≤ σ min ( Λ ) − ( (cid:107) P T f ≤ M / √ n (cid:107) + (cid:107) P T f > M / √ n (cid:107) )= O d, P ( κ − / > M ) · o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η )= κ − / > M ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η ) · o d, P (1) . (142)Combining Eqs. (141) and (142) yields Eq. (136). B.4 Concentration of the random features kernel matrix Z T Z We recall the following standard result on concentration of random matrices with independent rows:
Lemma 11 ([Ver10] Theorem 5.45) . Let A be a p × q matrix whose rows a i are independent random vectorsin R q with common second moment matrix Σ = E [ a i ⊗ a i ] . Let Γ := E [max i ∈ [ p ] (cid:107) a i (cid:107) ] . Then E (cid:2) (cid:107) A T A /p − Σ (cid:107) op (cid:3) ≤ max( (cid:107) Σ (cid:107) / η, η ) , where η = C (cid:113) Γ log(min( p,q )) p and C is an absolute constant. We will also use the following corollary for asymmetric matrices:
Corollary 1.
Let A be a n × N matrix whose rows a i are independent random vectors in R N with commonsecond moment matrix Σ a = E [ a i ⊗ a i ] . Let B be a n × m matrix whose rows b i are independent randomvectors in R m with common second moment matrix Σ b = E [ b i ⊗ b i ] . Let Γ a := E [max i ∈ [ n ] (cid:107) a i (cid:107) ] and Γ b := E [max i ∈ [ n ] (cid:107) b i (cid:107) ] . Denote Σ ab = E [ a i ⊗ b i ] . Then, E (cid:2) (cid:107) A T B /n − Σ ab (cid:107) op (cid:3) ≤ max (cid:0) ( (cid:107) Σ a (cid:107) op + (cid:107) Σ b (cid:107) op ) / η, η (cid:1) , (143) where η = C (cid:113) (Γ a +Γ b ) log(min( n,N, m )) n and C is an absolute constant.Proof of Corollary 1. Define C = [ A , B ] ∈ R n × ( N + m ) whose rows c i = [ a i , b i ] are independent randomvectors in R N + m with common second matrix Σ c = (cid:20) Σ a Σ ab Σ ba Σ b (cid:21) . By Lemma 11, we have E (cid:2) (cid:107) C T C /n − Σ c (cid:107) op (cid:3) ≤ max( (cid:107) Σ c (cid:107) / η, η ) , where η = C (cid:113) Γ log(min( n,N + m )) n withΓ = E [max i ∈ [ n ] (cid:107) c i (cid:107) ] ≤ E [max i ∈ [ n ] (cid:107) a i (cid:107) ] + E [max i ∈ [ n ] (cid:107) b i (cid:107) ] ≤ Γ a + Γ b . Notice that (cid:107) Σ c (cid:107) op ≤ C ( (cid:107) Σ a (cid:107) op + (cid:107) Σ b (cid:107) op ), and (cid:107) A T B /n − Σ ab (cid:107) op ≤ (cid:107) C T C /n − Σ c (cid:107) op . Combining these bounds yields Eq. (143). 54onsider the feature matrix Z = ( σ d ( x i ; θ j )) i ∈ [ n ] ,j ∈ [ N ] . We recall the decomposition Z = Z ≤ m + Z > m into a low and high degree parts: Z ≤ m = ψ ≤ m D ≤ m φ T ≤ m , Z > m = (cid:88) k ≥ m +1 λ d,k ψ k φ T k . We prove the following concentration result on Z > m . Proposition 8 (Concentration Z matrix) . Consider the overparametrized case N ( d ) ≥ n ( d ) δ for somefixed δ > . Let { σ d } d ≥ be a sequence of activation functions satisfying the feature map concentration(Assumption 1) and the spectral gap (Assumption 2) at level { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ . Then, we have Z > m Z T > m N = κ > m · ( I n + ∆ Z ) , (144) where κ > m = Tr( H d,> m ) and (cid:107) ∆ Z (cid:107) op = o d, P (1) . Furthermore, (cid:13)(cid:13)(cid:13) Z > m φ ≤ m N (cid:13)(cid:13)(cid:13) op = κ / > m · o d, P (1) . (145) Proof of Proposition 8.
For convenience, we will drop the subscript d . Step 1. Bound on (cid:107) Z > m Z T > m /N − κ > m I n (cid:107) op . Denote A T = Z > m = [ a , . . . , a N ] ∈ R n × N with a i = ( σ > m ( x ; θ i ) , . . . , σ > m ( x n ; θ i )) ∈ R n . Conditionedon ( x , . . . , x n ), the rows a i are independent with common second moment matrix E x [ a i ⊗ a i ] = H > m , where H > m = ( H > m ,ij ) ≤ i,j ≤ N with H > m ,ij = E θ [ σ > m ( x i ; θ ) σ > m ( x j ; θ )]. By applying Theorem 6 to thekernel matrix H > m (assumptions satisfied by Assumptions 1 and 2), we have H > m = κ > m · ( I n + ∆ H ) where (cid:107) ∆ H (cid:107) op = o d, P (1). Therefore it is sufficient to show that (cid:13)(cid:13)(cid:13) Z > m Z T > m N − H > m (cid:13)(cid:13)(cid:13) op = o d, P (1) . Let us decompose σ > m into a low and high degree parts σ > m = σ m : u + σ >u (recall that u ( d ) > m ( d )): σ m : u ( x ; θ ) = u (cid:88) k = m +1 λ d,k ψ k ( x ) φ k ( θ ) ,σ >u ( x ; θ ) = ∞ (cid:88) k = u +1 λ d,k ψ k ( x ) φ k ( θ ) . Let a i = ( σ m : u ( x ; θ i ) , . . . , σ m : u ( x n ; θ i )) ∈ R n and a i = ( σ >u ( x ; θ i ) , . . . , σ >u ( x n ; θ i )) ∈ R n , a i = a i + a i .Then Γ = E θ [max i ∈ [ N ] (cid:107) a i (cid:107) ] ≤ E θ [max i ∈ [ N ] (cid:107) a i (cid:107) ] + 2 E θ [max i ∈ [ N ] (cid:107) a i (cid:107) ] . Let q > c ). We have E θ (cid:104) max i ∈ [ N ] (cid:107) a i (cid:107) (cid:105) ≤ E θ (cid:104) max i ∈ [ N ] (cid:107) a i (cid:107) q (cid:105) /q ≤ N /q E θ [ (cid:107) a i (cid:107) q ] /q .
55y Jensen’s inequality and Assumption 1.( c ), there exists a fixed δ > E x , θ (cid:104) (cid:107) a i (cid:107) q (cid:105) = E x , θ (cid:104)(cid:16) (cid:88) j ∈ [ n ] σ >u ( x j ; θ ) (cid:17) q (cid:105) ≤ n q − E x , θ (cid:104) (cid:88) j ∈ [ n ] σ >u ( x j ; θ ) q (cid:105) ≤ n q E x , θ [ σ >u ( x j ; θ ) q ] = O d (1) · n q (1+2 δ ) · κ q>u , where κ >u = Tr( H >u ) = (cid:80) ∞ k = u +1 λ k . Hence, by Markov’s inequality, we get E θ [max i ∈ [ N ] (cid:107) a i (cid:107) ] = O d, P (1) · N /q n δ · κ >u . (146)Similarly, by the hypercontractivity assumption (Assumption 1.( a )), we have E x (cid:104) E θ (cid:104) max i ∈ [ N ] (cid:107) a i (cid:107) (cid:105)(cid:105) ≤ C q N /q E x , θ (cid:2) (cid:107) a i (cid:107) (cid:3) = C q N /q n · κ m : u , where κ m : u = (cid:80) uk = m +1 λ k . Hence, by Markov’s inequality, we get E θ (cid:104) max i ∈ [ N ] (cid:107) a i (cid:107) (cid:105) = O d, P (1) · N /q n · κ m : u . (147)Combining Eqs. (146) and (147), we getΓ a = O d, P (1) · N /q n δ κ > m . We can therefore apply Lemma 11. Recalling (cid:107) H > m (cid:107) op = O d, P (1) · κ > m , we have E θ (cid:104)(cid:13)(cid:13) Z > m Z T > m /N − H > m (cid:13)(cid:13) op (cid:105) ≤ O d, P (1) · max( κ / > m η, η ) , with η = (cid:0) κ > m N /q − n δ log( N ) (cid:1) / = κ / > m · o d, P (1) by the choice of q in Assumption 1.( c ). We conclude (cid:13)(cid:13) Z > m Z T > m /N − H > m (cid:13)(cid:13) op = κ > m · o d, P (1) . Step 2. Bound on (cid:107) Z > m φ ≤ m /N (cid:107) op . Consider B = κ / > m φ ≤ m = [ b , . . . , b N ] T R N × m where b i = κ / > m [ φ ( θ i ) , . . . , φ m ( θ i )] ∈ R m are independentrows with second moment matrix Σ b = E [ b i ⊗ b i ] = κ > m I m . Furthermore, by the hypercontractivityassumption (Assumption 1.( a )), we haveΓ b = E θ (cid:104) max i ∈ [ N ] (cid:107) b i (cid:107) (cid:105) ≤ C q N /q E θ (cid:104) (cid:107) b i (cid:107) (cid:105) = C q N /q m · κ > m . Notice that E [ a i ⊗ b i ] = 0. Furthermore, recalling the previous step, we have (cid:107) Σ a (cid:107) op = (cid:107) H > m (cid:107) op = O d, P (1) · κ > m and (cid:107) Σ a (cid:107) op + (cid:107) Σ b (cid:107) op = O d, P (1) · κ > m , Γ a + Γ b = O d, P (1) · N /q n δ · κ > m . Then by Corollary 1 applied to A T B /N and recalling the assumption on q in Assumption 1.( c ), we have E θ [ (cid:107) Z m φ ≤ m /N (cid:107) op ] = o d, P (1) · κ / > m , which concludes the proof by Markov’s inequality. 56 Generalization error of kernel ridge regression: Proof of Theorem 4
In this section, we prove Theorem 4. We will then prove a different version of the same theorem in SectionC.2, under somewhat different assumptions. Namely, we will relax Assumption 4.( c ) and instead impose agap condition on the eigenvalues of the kernel. C.1 Proof of Theorem 4
In this section, we prove Theorem 8. Throughout the proof, we will denote ∆ any matrix with (cid:107) ∆ (cid:107) op = o d, P (1). In particular, ∆ can change from one line to line. We defer the proofs of some more technical resultsto Section C.1.1. Step 1. Expressing the risk in terms of empirical kernel matrix.
Recall that the KRR estimator is given byˆ f λ ( x ) = y T ( H + λ I N ) − h ( x ) , where y = ( y , . . . , y n ) and H = ( H ( x i , x j )) i,j ∈ [ n ] , h ( x ) = ( H d ( x , x ) , . . . , H d ( x , x n )) ∈ R n . The resultingtest error is R KR ( f d , X , λ ) ≡ E x (cid:104)(cid:16) f d ( x ) − y T ( H + λ I n ) − h ( x ) (cid:17) (cid:105) = E x [ f d ( x ) ] − y T ( H + λ I n ) − E + y T ( H + λ I n ) − M ( H + λ I n ) − y , where E = ( E , . . . , E n ) T , M = ( M ij ) ij ∈ [ n ] and H = ( H ij ) ij ∈ [ n ] are defined by E i = E x [ f d ( x ) H d ( x , x i )] ,M ij = E x [ H d ( x i , x ) H d ( x j , x )] ,H ij = H d ( x i , x j ) . We recall that the eigendecomposition of H d is given by H d ( x , y ) = ∞ (cid:88) k =1 λ d,k ψ k ( x ) ψ k ( y ) . We write the orthogonal decomposition of f d in the basis { ψ k } k ≥ as f d ( x ) = ∞ (cid:88) k =1 ˆ f d,k ψ k ( x ) . Define ψ k = ( ψ k ( x ) , . . . , ψ k ( x n )) T ∈ R n , D ≤ m = diag( λ d, , λ d, , . . . , λ d, m ) ∈ R m × m , Ψ ≤ m = ( ψ k ( x i )) i ∈ [ n ] ,k ∈ [ m ] ∈ R n × m , ˆ f ≤ m = ( ˆ f d, , ˆ f d, , . . . , ˆ f d, m ) T ∈ R m .
57e decompose the vectors and matrices f , E , H , and M in terms of orthogonal basis f = f ≤ m + f >m , f ≤ m = Ψ ≤ m ˆ f ≤ m , f > m = ∞ (cid:88) k = m +1 ˆ f d,k ψ k , E = E ≤ m + E >m , E ≤ m = Ψ ≤ m D ≤ m ˆ f ≤ m , E > m = ∞ (cid:88) k = m +1 λ d,k ˆ f d,k ψ k , H = H ≤ m + H >m , H ≤ m = Ψ ≤ m D ≤ m Ψ T ≤ m , H > m ∞ (cid:88) k = m +1 λ d,k ψ k ψ T k , M = M ≤ m + M >m , M ≤ m = Ψ ≤ m D ≤ m Ψ T ≤ m , M > m = ∞ (cid:88) k = m +1 λ d,k ψ k ψ T k . (148)Applying Theorem 6 with respect to the operator H d and H d where the assumptions are satisfied byAssumptions 4.( a ), 4.( b ), cf. Eqs. (40) and (42), and 5.( a ), cf. Eq. (46), and using Assumption 4.( c ), thekernel matrices H and M can be rewritten as H = Ψ ≤ m D ≤ m Ψ T ≤ m + κ H ( I + ∆ H ) , M = Ψ ≤ m D ≤ m Ψ T ≤ m + κ M ( I + ∆ M ) , (149)where κ H = Tr( H d,> m ) = (cid:88) k ≥ m +1 λ d,k ,κ M = Tr( H d,> m ) = (cid:88) k ≥ m +1 λ d,k , and max {(cid:107) ∆ M (cid:107) op , (cid:107) ∆ H (cid:107) op } = o d, P (1) . (150)Let us introduce the shrinkage matrix S ≤ m = (cid:16) I m + κ H + λn D − ≤ m (cid:17) − = diag(( s j ) j ∈ [ m ] ) where s j = λ d,j λ d,j + κ H + λn . (151) Step 2. Decompose the risk
Recalling y = f + ε , we decompose the risk as follows R KR ( f d , X , λ ) = (cid:107) f d (cid:107) L − T + T + T − T + 2 T . where T = f T ( H + λ I n ) − E ,T = f T ( H + λ I n ) − M ( H + λ I n ) − f ,T = ε T ( H + λ I n ) − M ( H + λ I n ) − ε ,T = ε T ( H + λ I n ) − E ,T = ε T ( H + λ I n ) − M ( H + λ I n ) − f . Step 3. Term T Note we have T = T + T + T , T = f T ≤ m ( H + λ I n ) − M ( H + λ I n ) − f ≤ m ,T = 2 f T ≤ m ( H + λ I n ) − M ( H + λ I n ) − f > m ,T = f T > m ( H + λ I n ) − M ( H + λ I n ) − f > m . (152)By Lemma 12 which is stated in Section C.1.1 below, we have (cid:107) n ( H + λ I n ) − M ( H + λ I n ) − − Ψ ≤ m S ≤ m Ψ T ≤ m /n (cid:107) op = o d, P (1) , (153)hence T = ˆ f T ≤ m Ψ T ≤ m ( H + λ I n ) − M ( H + λ I n ) − Ψ ≤ m ˆ f ≤ m = ˆ f T ≤ m Ψ T ≤ m Ψ T ≤ m S ≤ m Ψ ≤ m Ψ ≤ m ˆ f ≤ m /n + [ (cid:107) Ψ ≤ m ˆ f ≤ m (cid:107) /n ] · o d, P (1) . By Assumption 4.( a ), the conditions of Theorem 6.( b ) are satisfied, and we have (with (cid:107) ∆ (cid:107) op = o d, P (1))ˆ f T ≤ m Ψ T ≤ m Ψ ≤ m S ≤ m Ψ T ≤ m Ψ ≤ m ˆ f ≤ m /n = ˆ f T ≤ m ( I + ∆ ) S ≤ m ( I + ∆ ) ˆ f ≤ m = (cid:107) S ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · (cid:107) ˆ f ≤ m (cid:107) . Moreover, we have (cid:107) Ψ ≤ m ˆ f ≤ m (cid:107) /n = ˆ f T ≤ m ( I + ∆ ) ˆ f ≤ m = (cid:107) ˆ f ≤ m (cid:107) (1 + o d, P (1)) . As a result, we have T = (cid:107) S ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · (cid:107) ˆ f ≤ m (cid:107) = (cid:107) S ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · (cid:107) P ≤ m f d (cid:107) L . (154)By Eq. (153) again, we have T = (cid:16) (cid:88) k ≥ m +1 ˆ f k ψ T k (cid:17) ( H + λ I n ) − M ( H + λ I n ) − (cid:16) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:17) = (cid:16) (cid:88) k ≥ m +1 ˆ f k ψ T k (cid:17) Ψ ≤ m S ≤ m Ψ T ≤ m (cid:16) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:17) /n + (cid:104)(cid:13)(cid:13)(cid:13) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:13)(cid:13)(cid:13) /n (cid:105) · o d, P (1) . Note that S ≤ m (cid:22) I m and we have E (cid:104)(cid:16) (cid:88) k ≥ m +1 ˆ f k ψ T k (cid:17) Ψ ≤ m S ≤ m Ψ T ≤ m (cid:16) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:17)(cid:105) /n ≤ E (cid:104)(cid:16) (cid:88) k ≥ m +1 ˆ f k ψ T k (cid:17) Ψ ≤ m Ψ T ≤ m (cid:16) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:17)(cid:105) /n = (cid:88) u,v ≥ m +1 m (cid:88) s =1 (cid:88) i,j ∈ [ n ] (cid:110) E (cid:104) ψ u ( x i ) ψ s ( x i ) ψ s ( x j ) ψ v ( x j ) (cid:105) /n (cid:111) ˆ f v ˆ f u = (cid:88) u,v ≥ m +1 m (cid:88) s =1 (cid:88) i ∈ [ n ] (cid:110) E (cid:104) ψ u ( x i ) ψ s ( x i ) ψ s ( x i ) ψ v ( x i ) (cid:105) /n (cid:111) ˆ f v ˆ f u = 1 n m (cid:88) s =1 E x (cid:104)(cid:0) P > m f d ( x ) (cid:1) ψ s ( x ) (cid:105) ≤ n m (cid:88) s =1 (cid:107) P > m f d (cid:107) L η (cid:107) ψ s (cid:107) L (4+2 η ) /η ≤ C ( η ) m n (cid:107) P > m f d (cid:107) L η , where the last inequality used the hypercontractivity assumption as in Assumption 4.( a ). Moreover E (cid:104) n (cid:13)(cid:13)(cid:13) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:13)(cid:13)(cid:13) (cid:105) = ∞ (cid:88) k = m +1 ˆ f k = (cid:107) P > m f d (cid:107) L . m ( d ) ≤ n ( d ) − δ by Assumption 5.( b ), T = o d, P (1) · (cid:107) P > m f d (cid:107) L η . (155)Using Cauchy-Schwarz inequality for T , we get T ≤ T T ) / = o d, P (1) · (cid:107) P ≤ m f d (cid:107) L (cid:107) P > m f d (cid:107) L η . (156)As a result, combining Eqs. (154), (155) and (156), we have T = (cid:107) S ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η ) . (157) Step 4. Term T . We have T = T + T + T , where T = f T ≤ m ( H + λ I n ) − E ≤ m ,T = f T > m ( H + λ I n ) − E ≤ m ,T = f T ( H + λ I n ) − E > m . By Lemma 13 stated in Section C.1.1 below, we have (cid:107) Ψ T ≤ m ( H + λ I n ) − Ψ ≤ m D ≤ m − S ≤ m (cid:107) op = o d, P (1) , so that T = ˆ f T ≤ m Ψ T ≤ m ( H + λ I n ) − Ψ ≤ m D ≤ m ˆ f ≤ m = (cid:107) S / ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · (cid:107) ˆ f ≤ m (cid:107) = (cid:107) S / ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · (cid:107) P ≤ m f d (cid:107) . (158)Using Cauchy-Schwarz inequality for T , and by the expression M = Ψ ≤ m D ≤ m Ψ T ≤ m + κ M ( I M + ∆ M ),cf. Eq. (149), we get with high probability | T | = (cid:12)(cid:12)(cid:12) ∞ (cid:88) k = m +1 ˆ f k ψ T k ( H + λ I n ) − Ψ ≤ m D ≤ m ˆ f ≤ m (cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:13)(cid:13)(cid:13) ∞ (cid:88) k = m +1 ˆ f k ψ T k ( H + λ I n ) − Ψ ≤ m D ≤ m (cid:13)(cid:13)(cid:13) (cid:107) ˆ f ≤ m (cid:107) b ) = (cid:104)(cid:16) ∞ (cid:88) k = m +1 ˆ f k ψ T k (cid:17) ( H + λ I n ) − Ψ ≤ m D ≤ m Ψ T ≤ m ( H + λ I n ) − (cid:16) ∞ (cid:88) k = m +1 ˆ f k ψ k (cid:17)(cid:105) / (cid:107) ˆ f ≤ m (cid:107) c ) ≤ (cid:104)(cid:16) ∞ (cid:88) k = m +1 ˆ f k ψ T k (cid:17) ( H + λ I n ) − M ( H + λ I n ) − (cid:16) ∞ (cid:88) k = m +1 ˆ f k ψ k (cid:17)(cid:105) / (cid:107) ˆ f ≤ m (cid:107) d ) = T / (cid:107) ˆ f ≤ m (cid:107) e ) = o d, P (1) · (cid:107) P ≤ m f d (cid:107) L (cid:107) P > m f d (cid:107) L η . (159)Here ( a ) follows by Cauchy-Schwarz; ( b ) by the definition of norm; ( c ) because M (cid:23) Ψ ≤ m D ≤ m Ψ T ≤ m + κ M ( I + ∆ M ) (cid:23) Ψ ≤ m D ≤ m Ψ T ≤ m ; ( d ) by the definition of T as in Eq. (152); ( e ) by Eq. (155).For term T , we have | T | = | f T ( H + λ I n ) − E > m | ≤ (cid:107) f (cid:107) (cid:107) ( H + λ I n ) − (cid:107) op (cid:107) E > m (cid:107) . E [ (cid:107) f (cid:107) ] = n (cid:107) f d (cid:107) L . Further by Eq. (149), we have (cid:107) ( H + λ I n ) − (cid:107) op ≤ / ( κ H + λ ) withhigh probability. Finally, we have E [ (cid:107) E > m (cid:107) ] = n ∞ (cid:88) k = m +1 λ d,k ˆ f k ≤ n (cid:104) max k ≥ m +1 λ d,k (cid:105) (cid:107) P > m f d (cid:107) L . As a result, we have | T | ≤ O d, P (1) · (cid:107) P > m f d (cid:107) L (cid:107) f d (cid:107) L (cid:104) n max k ≥ m +1 λ d,k (cid:105) / / ( κ H + λ )= O d, P (1) · (cid:107) P > m f d (cid:107) L (cid:107) f d (cid:107) L (cid:104) n max k ≥ m +1 λ d,k (cid:105) / (cid:16) (cid:88) k ≥ m +1 λ d,k + λ (cid:17) = o d, P (1) · (cid:107) P > m f d (cid:107) L (cid:107) f d (cid:107) L , (160)where the last equality used Eq. (46) in Assumption 5.( a ) and the fact that λ ∈ [0 , Tr( H d,> m )]. CombiningEqs. (158), (159) and (160), we get T = (cid:107) S / ≤ m ˆ f ≤ m (cid:107) + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η ) . (161) Step 5. Terms T . By Lemma 12 again, we have1 σ ε E ε [ T ] = Tr(( H + λ I n ) − M ( H + λ I n ) − ) = Tr( Ψ ≤ m S ≤ m Ψ T ≤ m /n ) + o d, P (1) , By Proposition 3 and noting that S ≤ m (cid:22) I m , we have1 n Tr( Ψ ≤ m S ≤ m Ψ T ≤ m ) ≤ n Tr( Ψ ≤ m Ψ T ≤ m ) = 1 n Tr( Ψ T ≤ m Ψ ≤ m ) = 1 n n m (cid:0) o d, P (1) (cid:1) = o d, P (1) . This gives T = o d, P (1) · σ ε . (162) Step 6. Terms T . Note that 1 σ ε E ε [ T ] = 1 σ ε E ε [ ε T ( H + λ I n ) − EE T ( H + λ I n ) − ε ]= E T ( H + λ I n ) − E . Notice that M (cid:23) Ψ ≤ L D ≤ L Ψ T ≤ L for any L ∈ N , by the decomposition of Eq. (148). Therefore: (cid:107) D ≤ L Ψ T ≤ L ( H + λ I n ) − Ψ ≤ L D ≤ L (cid:107) op = (cid:107) ( H + λ I n ) − Ψ ≤ L D ≤ L Ψ T ≤ L ( H + λ I n ) − (cid:107) op ≤ (cid:107) ( H + λ I n ) − M ( H + λ I n ) − (cid:107) op . (163)Further notice that, using Lemma 12 (stated below) followed by Proposition 3, we get (cid:107) ( H + λ I n ) − M ( H + λ I n ) − (cid:107) op = (cid:107) Ψ ≤ m S ≤ m Ψ T ≤ m /n (cid:107) op /n + o d, P (1 /n ) ≤(cid:107) Ψ ≤ m Ψ T ≤ m /n (cid:107) op /n + o d, P (1) = o d, P (1) . (164)61ence, E T ( H + λ I n ) − E ( a ) = lim L →∞ E T ≤ L ( H + λ I n ) − E ≤ L ( b ) = lim L →∞ ˆ f T ≤ L [ D ≤ L Ψ T ≤ L ( H + λ I n ) − Ψ ≤ L D ≤ L ] ˆ f ≤ L ( c ) ≤ lim sup L →∞ (cid:107) D ≤ L Ψ T ≤ L ( H + λ I n ) − Ψ ≤ L D ≤ L (cid:107) op · lim L →∞ (cid:107) ˆ f ≤ L (cid:107) d ) ≤ (cid:107) ( H + λ I n ) − M ( H + λ I n ) − (cid:107) op · (cid:107) f d (cid:107) L ( e ) ≤ o d, P (1) · (cid:107) f d (cid:107) L , where the limits for L → ∞ exist with high probability. In particular, ( a ) holds with high probability since (cid:107) E T ≤ L − E (cid:107) → L → ∞ , and (cid:107) ( H + λ I n ) − (cid:107) op ≤ /λ min ( H ) ≤ (2 /κ H ) , by the decomposition (149),together with the fact that (cid:107) ∆ H (cid:107) op = o d, P (1), cf. Eq. (150). Further, ( b ) is by definition of E ≤ L ; ( c ) bydefinition of operator norm; ( d ) by Eq. (163); ( e ) by Eq. (164).We thus obtain T = o d, P (1) · σ ε · (cid:107) f d (cid:107) L = o d, P (1) · ( σ ε + (cid:107) f d (cid:107) L ) . (165) Step 7. Terms T . We decompose T using f = f ≤ m + f > m , T = T + T , where T = ε T ( H + λ I n ) − M ( H + λ I n ) − f ≤ m ,T = ε T ( H + λ I n ) − M ( H + λ I n ) − f > m . First notice that, by Eq. (164), (cid:107) M / ( H + λ I n ) − M / (cid:107) op = (cid:107) ( H + λ I n ) − M ( H + λ I n ) − (cid:107) op = o d, P (1) . Then by Lemma 12, we get1 σ ε E ε [ T ] = 1 σ ε E ε [ ε T ( H + λ I n ) − M ( H + λ I n ) − f ≤ m f T ≤ m ( H + λ I n ) − M ( H + λ I n ) − ε ]= f T ≤ m [( H + λ I n ) − M ( H + λ I n ) − ] f ≤ m ≤(cid:107) M / ( H + λ I n ) − M / (cid:107) op (cid:107) M / ( H + λ I n ) − f ≤ m (cid:107) = o d, P (1) · T = o d, P (1) · (cid:107) P ≤ m f d (cid:107) L . where the last equality follows by Eq. (154). Similarly, we get E ε [ T ] /σ ε = o d, P (1) · T = o d, P (1) · (cid:107) P > m f d (cid:107) L . By Markov’s inequality, we deduce that T = o d, P (1) · σ ε ( (cid:107) P ≤ m f d (cid:107) L + (cid:107) P > m f d (cid:107) L ) = o d, P (1) · ( σ ε + (cid:107) f d (cid:107) L ) . (166) Step 8. Finish the proof. R KR ( f d , X , λ ) = (cid:107) f d (cid:107) L − T + T + T − T + 2 T = (cid:107) ˆ f ≤ m (cid:107) − (cid:107) S / ≤ m ˆ f ≤ m (cid:107) + (cid:107) S ≤ m ˆ f ≤ m (cid:107) + (cid:107) P > m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η + σ ε )= (cid:107) ( I − S ≤ m ) ˆ f ≤ m (cid:107) + (cid:107) P > m f d (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η + σ ε ) . Recall the expression (52) of ˆ f eff γ eff : ˆ f eff γ eff = ∞ (cid:88) j =1 λ d,j λ d,j + γ eff n ˆ f d,k ψ d,j , with γ eff = λ + κ H . From Assumption 5.( a ), we have max j> m λ d,j = o d (1) · κ H /n and we deduce (cid:107) ( I − S ≤ m ) ˆ f ≤ m (cid:107) + (cid:107) P > m f d (cid:107) L = (cid:107) f d − ˆ f eff γ eff (cid:107) L + (cid:107) f d (cid:107) L · o d, P (1) . We conclude R KR ( f d , X , λ ) = (cid:107) f d − ˆ f eff γ eff (cid:107) L + o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η + σ ε ) . Proceeding analogously (with ˆ f eff γ eff replacing f d ) we obtain (cid:107) ˆ f λ − ˆ f eff γ eff (cid:107) L = o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > M f d (cid:107) L η + σ ε ) . C.1.1 Auxiliary lemmasLemma 12.
Follow the assumptions and notations in the proof of Theorem 4. We have (cid:107) n ( H + λ I n ) − M ( H + λ I n ) − − Ψ ≤ m S ≤ m Ψ T ≤ m /n (cid:107) op = o d, P (1) . where S ≤ m is the shrinkage matrix defined in Equation (151) .Proof of Lemma 12. We simplify the notations by defining ψ k = ( ψ k ( x i )) i ∈ [ n ] ∈ R n Ψ = ψ ≤ m ∈ R n × m , D = D ≤ m = diag( λ d, , . . . , λ d, m ) ∈ R m × m .Then recalling Eq. (149), we have H = Ψ D Ψ T + κ H ( I + ∆ H ) , M = Ψ D Ψ T + κ M ( I + ∆ M ) , (167)where κ H = Tr( H d,> m ) and κ M = Tr( H d,> m ), andmax {(cid:107) ∆ M (cid:107) op , (cid:107) ∆ H (cid:107) op } = o d, P (1) . (168)As a result, we have n ( H + λ I n ) − M ( H + λ I n ) − = T + T , where T = nκ M ( H + λ I n ) − ( I M + ∆ M )( H + λ I n ) − ,T = n ( H + λ I n ) − Ψ D Ψ T ( H + λ I n ) − . Step 1. Bound term T . T , by Eqs. (167) and (168), we have, (cid:107) T (cid:107) op ≤ nκ M (cid:107) ( H + λ I n ) − (cid:107) (cid:107) I + ∆ M (cid:107) op = O d, P (1) · n [ κ M /κ H ] . (169)By Eq. (46) in Assumption 5.( a ), we have κ M κ H = Tr( H d,> m )Tr( H d,> m ) ≤ (cid:107) H d,> m (cid:107) op Tr( H d,> m ) = O d ( n − − δ ) . We conclude that (cid:107) T (cid:107) op = o d, P (1) . Step 2. Bound term T . For T , by the Sherman-Morrison-Woodbury formula, we have, setting ∆ (cid:48) H = κ H ∆ H / ( λ + κ H ), T = n ( I + ∆ (cid:48) H ) − Ψ (( κ H + λ ) D − + Ψ T ( I + ∆ (cid:48) H ) − Ψ ) − Ψ T ( I + ∆ (cid:48) H ) − = E Ψ R Ψ T E /n, where E = ( I + ∆ (cid:48) H ) − , R = [( κ H + λ )( n D ) − + Ψ T E Ψ /n ] − . Denote S := S ≤ m = [ I m + ( κ H + λ )( n D ) − ] − . We have (cid:107) T − Ψ T S Ψ /n (cid:107) op ≤ (1 + (cid:107) E (cid:107) op ) (cid:107) E − I (cid:107) op (cid:107) Ψ R Ψ T /n (cid:107) + (cid:107) Ψ R − Ψ T /n − Ψ S Ψ T /n (cid:107) op ≤ (1 + (cid:107) E (cid:107) op ) (cid:107) E − I (cid:107) op (cid:107) Ψ R Ψ T /n (cid:107) op + (cid:107) ΨΨ T /n (cid:107) op (cid:107) R − S (cid:107) op . Recalling Eq. (168), we have (cid:107) E − I (cid:107) op = o d, P (1) and by Theorem 6.( b ), we have (cid:107) Ψ T Ψ /n − I (cid:107) op = o d, P (1).Furthermore (cid:107) R − S (cid:107) op = (cid:107) R − S (cid:107) op ( (cid:107) R (cid:107) op + (cid:107) S (cid:107) op ≤ (cid:107) R (cid:107) op (cid:107) S (cid:107) op ( (cid:107) R (cid:107) op + (cid:107) S (cid:107) op ) (cid:107) R − − S − (cid:107) op . We have (cid:107) R − − S − (cid:107) op ≤ (cid:107) Ψ T E Ψ /n − Ψ T Ψ /n (cid:107) op + (cid:107) Ψ T Ψ /n − I (cid:107) op ≤ (cid:107) Ψ T Ψ /n (cid:107) op (cid:107) E − I (cid:107) op + (cid:107) Ψ T Ψ /n − I (cid:107) op = o d, P (1) . Furthermore (cid:107) S (cid:107) op ≤ (cid:107) R (cid:107) op ≤ o d, P (1). Combining the above inequal-ities, we have (cid:107) T − Ψ T S Ψ /n (cid:107) op = o d, P (1) . This gives (cid:107) n ( H + λ I n ) − M ( H + λ I n ) − − Ψ S Ψ T /n (cid:107) op ≤ (cid:107) T (cid:107) op + (cid:107) T − Ψ S Ψ T /n (cid:107) op = o d, P (1) . This completes the proof.
Lemma 13.
Follow the assumptions and notations in the proof of Theorem 4. We have (cid:107) S ≤ m − Ψ T ≤ m ( H + λ I n ) − Ψ ≤ m D ≤ m (cid:107) op = o d, P (1) , where S ≤ m is the shrinkage matrix defined in Equation (151) . roof of Lemma 13. We will follow the notations in the proof of Proposition 12. Applying Theorem 6 withrespect to operator H d and by Eq. (171) in Assumption 4.( c H + λ I n = Ψ D Ψ T + κ H ( I n + ∆ H ) + λ I n = Ψ D Ψ T + ( κ H + λ )( I n + ∆ (cid:48) H ) , where (cid:107) ∆ H (cid:107) op , (cid:107) ∆ (cid:48) H (cid:107) op = o d, P (1). By the Sherman-Morrison-Woodbury formula, we have Ψ T [ Ψ D Ψ T + ( κ H + λ )( I n + ∆ (cid:48) H )] − Ψ D = Ψ T E − Ψ R /n, where E = I n + ∆ H and R = [( κ H + λ )( n D ) − + Ψ T E − Ψ /n ] − . We have (cid:107) S − Ψ T E − Ψ R /n (cid:107) op ≤ (cid:107) R (cid:107) op (cid:107) Ψ T E − Ψ /n − I (cid:107) op + (cid:107) R − S (cid:107) op . In the proof of Lemma 12, we already showed that (cid:107) Ψ T E − Ψ /n − I (cid:107) op = o d, P (1), (cid:107) R − S (cid:107) op = o d, P (1) and (cid:107) R (cid:107) op = O d, P (1), which concludes the proof. C.2 Kernel ridge regression under relaxed assumptions on the diagonal
In this section, we state and prove a version of Theorem 4 that holds under weaker assumptions. Namely,instead of the concentration bound in Assumption 4.( c ) we only require that the diagonal terms are upperbounded by a sub-polynomial factors times their expectation. Instead, we assume a spectral gap conditionthat was not required in the previous section.We will first describe the modified assumption, then state the new version of the theorem. The proof isvery similar to the one in the previous section. We will therefore use the same notations and only sketchthe differences. Assumption 8 (Relaxed kernel concentration at level { ( n ( d ) , m ( d )) } d ≥ ) . We assume the kernel concen-tration property at level { ( n ( d ) , m ( d )) } d ≥ , as stated in Assumption 4, with condition ( c ) replaced by thefollowing(c’) (Upper bound on the diagonal elements of the kernel) For ( x i ) i ∈ [ n ( d )] ∼ iid ν d and any δ > , we have max i ∈ [ n ( d )] E x ∼ u ( d ) (cid:2) H d,>u ( d ) ( x i , x ) (cid:3) = O d, P ( n ( d ) δ ) · E x , x (cid:48) ∼ ν d (cid:2) H d,>u ( d ) ( x , x (cid:48) ) (cid:3) , (170)max i ∈ [ n ( d )] H d,>u ( d ) ( x i , x i ) = O d, P ( n ( d ) δ ) · E x ∼ ν d [ H d,>u ( d ) ( x , x )] . (171) Assumption 9 (Eigenvalue condition at level { ( n ( d ) , m ( d )) } d ≥ ) . We assume the eigenvalue conditionAssumption 5 and, in addition, the following to hold(c) There exists a fixed δ > , such that n − δ ≥ λ d, m ( d ) (cid:88) k = m ( d )+1 λ d,k . Assumption 10 (Regularization and lower bound on diagonal elements) . Consider the regularization pa-rameter λ ∈ R ≥ . We assume that one of the following holds:(i) For ( x i ) i ∈ [ n ( d )] ∼ iid ν d and any δ > , we have min i ∈ [ n ( d )] E x ∼ ν d [ H d,> m ( d ) ( x i , x ) ] =Ω d, P ( n ( d ) − δ ) · E x , x (cid:48) ∼ ν d [ H d,> m ( d ) ( x , x (cid:48) ) ] , (172)min i ∈ [ n ( d )] H d,> m ( d ) ( x i , x i ) = Ω d, P ( n ( d ) − δ ) · E x [ H d,> m ( d ) ( x , x )] , (173) and λ = O d (1) · Tr( H d,> m ( d ) ) (in particular, taking λ = 0 is fine). ii) We have λ = Θ d (1) · Tr( H d,> m ( d ) ) . Theorem 8.
Let { f d ∈ D d } d ≥ be a sequence of functions, ( x i ) i ∈ [ n ( d )] ∼ ν d independently, { H d } d ≥ bea sequence of kernel operators such that { ( H d , n ( d ) , m ( d )) } d ≥ and the regularization parameter λ satisfyAssumptions 8, 9, and 10. Then for any fixed η > , we have | R KR ( f d , X , Θ , λ ) − (cid:107) P > m f d (cid:107) L | = o d, P (1) · ( (cid:107) f d (cid:107) L + (cid:107) P > m f d (cid:107) L η + σ ε ) . (174) C.2.1 Proof outline for Theorem 8
Throughout this section, we will denote δ > δ > δ is allowed to change from line to line.By the spectral gap condition (Assumption 9), the population estimator ˆ f eff γ eff defined in Eq. (52) isapproximately given by (cid:107) ˆ f eff γ eff − P ≤ m f d (cid:107) L = o d, P (1) · (cid:107) f d (cid:107) L . Similarly, the shrinkage matrix defined in Eq. (151) verifies (cid:107) S ≤ m − I m (cid:107) op = O d, P ( n − δ ) . From Theorem 6 applied to the operator H d and H d , the kernel matrices H and M can be rewritten as H = Ψ ≤ m D ≤ m Ψ T ≤ m + κ H ( Λ H + ∆ H ) , M = Ψ ≤ m D ≤ m Ψ T ≤ m + κ M ( Λ M + ∆ M ) , (175)where κ H = Tr( H d,> m ) = (cid:88) k ≥ m +1 λ d,k ,κ M = Tr( H d,> m ) = (cid:88) k ≥ m +1 λ d,k , and Λ H =diag (cid:0)(cid:8) H d,> m ( x i , x i ) /κ H (cid:9) i ∈ [ n ] (cid:1) , Λ M =diag (cid:16)(cid:110) E x [ H d,> m ( x i , x ) ] /κ M (cid:111) i ∈ [ n ] (cid:17) , and there exists a fixed δ > {(cid:107) ∆ M (cid:107) op , (cid:107) ∆ H (cid:107) op } = O d, P ( n − δ ) . (176)From Lemma 7 applied to Λ H and Λ M with Assumptions 4.( a ) and 8.( c (cid:48) ), we have Λ H (cid:22) O d, P ( n δ ) · I n , Λ M (cid:22) O d, P ( n δ ) · I n . (177)Furthermore from Assumption 10 and Eq. (176), we have for any δ < δ (cid:48) , H + λ I n (cid:23) κ H ( Λ H + ∆ H ) + λ I n (cid:23) Ω d, P ( n − δ ) · κ H · I n . (178)The handling of the bounds on T , T , T , T and T follows from the same computation as SectionC.1 where every o d, P (1) is replaced by O d, P ( n − δ ) for some fixed δ > O d, P (1) is replaced by O d, P ( n δ ) with δ > O d, P (1) · o d, P (1) should be replaced66y O d, P ( n δ ) · O d, P ( n − δ ), and taking δ > o d, P (1) (see the proofs bellow forsome examples).Below we detail the proof of the updated auxiliary lemmas from Section C.1.1. Eq. (179) is used tobound the term T , Eq. (180) is used to bound the term T , while Eq. (181) is used to bound the term T , T and T . Lemma 14.
Follow the assumptions of Theorem 8 and the same notations as in Section C.1. Define G = n ( H + λ I n ) − M ( H + λ I n ) − . Then, there exists a fixed δ > such that for any δ > , (cid:107) ψ T ≤ m Gψ ≤ m /n − I m (cid:107) op = O d, P ( n − δ ) , (179) f T > m Gf > m /n = O d, P ( n − δ ) · (cid:107) P > m f d (cid:107) L η , (180) (cid:107) G (cid:107) op ≤ O d, P ( n δ ) . (181) Proof of Lemma 14.
Recall that we denote δ > δ > δ is allowed to change from line to line.Following the notations as in the proof of Lemma 12, we have H = Ψ D Ψ T + κ H ( Λ H + ∆ H ) , M = Ψ D Ψ T + κ M ( Λ M + ∆ M ) . Consider the same decomposition G = T + T as in the proof of Lemma 12, where T = nκ M ( H + λ I n ) − ( Λ M + ∆ M )( H + λ I n ) − , T = n ( H + λ I n ) − Ψ D Ψ T ( H + λ I n ) − . Step 1. Bound term T . For T , by Eqs. (177) and (178), we have for any δ > (cid:107) T (cid:107) op ≤ nκ M (cid:107) ( H + λ I n ) − (cid:107) (cid:107) ( Λ M + ∆ M ) (cid:107) op ≤ O d, P (1) · nκ M · n δ κ − H · n δ ≤ O d, P (1) · n δ κ M κ H . (182)By Eq. (46) in Assumption 5.( a ), we have κ M κ H = Tr( H d,> m )Tr( H d,> m ) ≤ (cid:107) H d,> m (cid:107) op Tr( H d,> m ) = O d ( n − − δ ) . Hence, taking δ sufficiently small in Eq. (169) yields (cid:107) T (cid:107) op = O d, P ( n − δ ) . Step 2. Simplifying the term T . Introduce A = Λ H + ∆ + ( λ/κ H ) · I n so that H + λ I n = Ψ D Ψ T + κ H A .
67y the Sherman-Morrison-Woodbury formula, we have T = A − ψ (cid:0) κ H ( n D ) − + ψ T A − ψ /n (cid:1) − ψ T A − /n. From Assumption 9.( b ), we have κ H ( n D ) − (cid:22) Ω d, P ( n − δ ) · I n . Furthermore, recalling that (cid:107) ψ T ψ − I m (cid:107) op = o d, P (1) and A − (cid:23) O d, P ( n − δ ) I n for any δ >
0, we deduce (for example from Lemma 8) (cid:13)(cid:13) T − A − ψ (cid:0) ψ T A − ψ /n (cid:1) − ψ T A − /n (cid:13)(cid:13) op = O d, P ( n − δ ) . Denote S = Λ H + ( λ/κ H ) · I n the diagonal matrix such that we have (cid:107) A − S (cid:107) op = O d, P ( n − δ ). We haveΩ d ( n − δ ) · I n (cid:22) S (cid:22) O d ( n δ ) · I n . Similarly to the previous line, we get (cid:13)(cid:13) A − ψ (cid:0) ψ T A − ψ /n (cid:1) − ψ T A − /n − S − ψ (cid:0) ψ T S − ψ /n (cid:1) − ψ T S − /n (cid:13)(cid:13) op = O d, P ( n − δ ) . Denote R = S − ψ (cid:0) ψ T S − ψ /n (cid:1) − ψ T S − /n . Step 3. Proving the bounds.
First notice that because Ω d ( n − δ ) · I n (cid:22) S (cid:22) O d ( n δ ) · I n and (cid:107) ψ T ψ − I m (cid:107) op = o d, P (1), we have (cid:107) G (cid:107) op ≤ (cid:107) T (cid:107) op + (cid:107) T − R (cid:107) op + (cid:107) R (cid:107) op = O d, P ( n δ ) , for any δ >
0, which proves Eq. (181). Similarly, (cid:107) ψ T Gψ /n − I m (cid:107) op ≤ ( (cid:107) T (cid:107) op + (cid:107) T − R (cid:107) op ) (cid:107) ψ / √ n (cid:107) + (cid:107) ψ T Rψ /n − I m (cid:107) op = O d, P ( n − δ ) , which proves Eq. (179).Notice that R = S − ψ (cid:0) ψ T S − ψ /n (cid:1) − ψ T S − /n (cid:22) O d, P ( n δ ) · . S − ψψ T S − /n Denote S = diag(( s i ) i ∈ [ n ] ) and recall the decomposition f > m = ∞ (cid:88) k = m +1 ˆ f k ψ k . We have E (cid:104) f T > m S − ψψ T S − f > m (cid:105) /n = E (cid:104)(cid:16) (cid:88) k ≥ m +1 ˆ f k ψ T k (cid:17) S − Ψ ≤ m Ψ T ≤ m S − (cid:16) (cid:88) k ≥ m +1 ψ k ˆ f k (cid:17)(cid:105) /n = (cid:88) u,v ≥ m +1 m (cid:88) t =1 (cid:88) i,j ∈ [ n ] (cid:110) s − i s − j E (cid:104) ψ u ( x i ) ψ t ( x i ) ψ t ( x j ) ψ v ( x j ) (cid:105) /n (cid:111) ˆ f v ˆ f u = (cid:88) u,v ≥ m +1 m (cid:88) t =1 (cid:88) i ∈ [ n ] s − i (cid:110) E (cid:104) ψ u ( x i ) ψ t ( x i ) ψ t ( x i ) ψ v ( x i ) (cid:105) /n (cid:111) ˆ f v ˆ f u = O d ( n δ ) · n m (cid:88) s =1 E x (cid:104)(cid:0) P > m f d ( x ) (cid:1) ψ s ( x ) (cid:105) ≤ O d ( n δ ) · n m (cid:88) t =1 (cid:107) P > m f d (cid:107) L η (cid:107) ψ t (cid:107) L (4+2 η ) /η ≤ O d ( n δ ) · m n (cid:107) P > m f d (cid:107) L eta = O d ( n − δ ) · (cid:107) P > m f d (cid:107) L η .
68e deduce by Markov’s inequality that | f T > m Rf > m /n | ≤ | f T > m S − ψψ T S − f > m /n | = O d ( n − δ ) · (cid:107) P > m f d (cid:107) L η . We deduce that | f T > m Gf > m | ≤ ( (cid:107) T (cid:107) op + (cid:107) T − R (cid:107) op ) (cid:107) f > m / √ n (cid:107) + | f T > m Rf > m /n | = O d ( n − δ ) · (cid:107) P > m f d (cid:107) L η , which concludes the proof. Lemma 15.
Follow the assumptions of Theorem 8 and the same notations as in Section C.1. There existsa fixed δ > such that (cid:107) I ≤ m − Ψ T ≤ m ( H + λ I n ) − Ψ ≤ m D ≤ m (cid:107) op = O d, P ( n − δ ) . Proof of Lemma 15.
We follow the same argument as in Lemma 13. We have H + λ I n = Ψ D Ψ T + κ H · A , where we denoted A = Λ H + ∆ + ( λ/κ H ) · I . By the Sherman-Morrison-Woodbury formula, we have Ψ T [ Ψ D Ψ T + A ] − Ψ D = Ψ T A − Ψ [ κ H ( n D ) − + Ψ T A − Ψ /n ] − /n. Hence (cid:107) I m − Ψ T A − Ψ [ κ H ( n D ) − + Ψ T A − Ψ /n ] − /n (cid:107) op = (cid:107) κ H ( n D ) − ( κ H ( n D ) − + Ψ T A − Ψ /n ) − (cid:107) op . We have by Assumption 9, n D (cid:23) Ω d ( n δ ) · κ H · I m . Furthermore, by Eq. (178), we have A − (cid:23) Ω d ( n − δ ) · κ H · I n . Using that (cid:107) Ψ T Ψ − I m (cid:107) op = o d, P (1), we deduce that for any δ > (cid:107) ( κ H ( n D ) − + Ψ T A − Ψ /n ) − (cid:107) op = O d, P ( n δ ) . We deduce that (cid:107) I m − Ψ T A − Ψ [ κ H ( n D ) − + Ψ T A − Ψ /n ] − /n (cid:107) op = O d, P ( n − δ ) · O d, P ( n δ ) . Taking δ sufficiently small concludes the proof. D Proof of Theorem 2: generalization error of RFRR on the sphere andhypercube
We check that Assumption 3 implies the assumptions of Theorem 1 on the sphere (Section D.1) and on thehypercube (Section D.2).
D.1 On the sphere
Proof of Theorem 2 on the sphere.
Consider the spherical case θ , x ∼ Unif( S d − ( √ d )) and d s + δ ≤ n ≤ d s +1 − δ and d S + δ ≤ N ≤ d S +1 − δ . Take σ d ( x ; θ ) = ¯ σ d ( (cid:104) x , θ (cid:105) / √ d ) for some activation function ¯ σ d : R → R satisfying Assumption 3 at level ( s , S ) (see Section 2.4 in the main text). Step 1. Diagonalization of the activation function and choosing m = m ( d ) , M = M ( d ) .
69y rotational invariance, we can decompose ¯ σ in the basis of spherical harmonics (see Section E.1) σ d ( x ; θ ) = ¯ σ d ( (cid:104) x , θ (cid:105) / √ d ) = ∞ (cid:88) k =0 ξ d,k B ( S d − ; k ) Q ( d ) k ( (cid:104) x , θ (cid:105) ) = ∞ (cid:88) k =0 ξ d,k (cid:88) s ∈ [ B ( d,k )] Y ks ( x ) Y ks ( θ ) , where the distinct eigenvalues are ξ d,k with degeneracy B ( S d − ; k ) = d − kd − (cid:18) d − kk (cid:19) . We have for fixed k , B ( S d − ; k ) = Θ d ( d k ). Furthermore, we have uniformly sup k ≥ (cid:96) B ( S d − ; k ) − = O d ( d − (cid:96) )(see Lemma 1 in [GMMM19]). Notice that by Assumption 3 (see for example Lemma 5 in [GMMM19]),there exists a constant C > (cid:107) ¯ σ d (cid:107) L = ∞ (cid:88) k =0 ξ d,k B ( S d − ; k ) ≤ C, (183)which implies that ξ d,k = O d ( B ( S d − ; k ) − ). In particular,sup k> s ξ d,k = O d ( d − s − ) , (184)sup k> S ξ d,k = O d ( d − S − ) , (185)Furthermore, by noting that ξ d,k = B ( S d − ; k ) − (cid:107) P k ¯ σ d ( (cid:104) e , · (cid:105) ) (cid:107) L , conditions (26), (27) and (28) can berewritten as follows in terms of the coefficients ( ξ d,k ) k ≥ :min k ≤ s ξ d,k = Ω d ( d − s ) , (186)min k ≤ S ξ d,k = Ω d ( d − S ) , (187) ∞ (cid:88) k =2 max( s , S )+2 ξ d,k B ( S d − ; k ) = Ω d (1) . (188)Denote { λ d,j } j ≥ the eigenvalues { ξ d,k } k ≥ with their degeneracy in non increasing order of their absolutevalue. Set M and m to be the number of eigenvalues associated to spherical harmonics of degree less or equalto S and s respectively, i.e., M = S (cid:88) k =0 B ( S d − ; k ) = Θ d ( d S ) , m = s (cid:88) k =0 B ( S d − ; k ) = Θ d ( d s ) . (189)Notice that Eqs. (184) and (186) imply that ( λ d,j ) j ≤ m corresponds exactly to all the eigenvalues associatedto spherical harmonics of degree less or equal to s . Similarly Eqs. (185) and (187) imply that ( λ d,j ) j ≤ M corresponds exactly to all the eigenvalues associated to spherical harmonics of degree less or equal to S .Notice that the diagonal elements of the truncated kernels are given by (for any x , θ ∈ S d − ( √ d )) H d,> m ( x , x ) = ∞ (cid:88) k = s +1 ξ d,k B ( S d − ; k ) Q ( d ) k ( (cid:104) x , x (cid:105) ) = ∞ (cid:88) k = s +1 ξ d,k B ( S d − ; k ) = (cid:107) P > s ¯ σ d (cid:107) L = Tr( H d,> m ) ,U d,> M ( θ , θ ) = ∞ (cid:88) k = S +1 ξ d,k B ( S d − ; k ) Q ( d ) k ( (cid:104) θ , θ (cid:105) ) = ∞ (cid:88) k = S +1 ξ d,k B ( S d − ; k ) = (cid:107) P > S ¯ σ d (cid:107) L = Tr( U d,> M ) , (190)70here we used that Q ( d ) k ( d ) = 1. Step 2. Checking the assumptions at level { ( N ( d ) , M ( d ) , n ( d ) , m ( d )) } d ≥ . We are now in position to verify the assumptions of Theorem 1. Choose u := u ( d ) to be the numberof eigenvalues with absolute value Ω d ( d − s , S ) − δ ) for some δ > λ d,j ) j ∈ [ u ] contains all the eigenvalues associated to the spherical harmonics ofdegree less or equal to max( S , s ), and none of the eigenvalues associated to spherical harmonics of degree2 max( S , s ) + 2 and bigger. We therefore must have u ≥ max( M ( d ) , m ( d )).Let us verify the conditions of ( N, M , n, m )-FMCP in Assumption 1 with the sequence of integers u ( d ):( a ) The hypercontractivity of the space of polynomials of degree less or equal 2 max( S , s ) + 1 is a conse-quence of a classical result due to Beckner [Bec92] (see Section E.3).( b ) Let us lower bound the right-hand side of Eq. (18). We have ∞ (cid:88) j = u ( d )+1 λ d,j ≥ ∞ (cid:88) k =2 max( s , S )+2 ξ d,k B ( S d − ; k ) = Ω d (1) , ∞ (cid:88) j = u ( d )+1 λ d,j ≤ (cid:110) sup j>u λ d,j (cid:111) · ∞ (cid:88) j = u ( d )+1 λ d,j = O d ( d − s , S ) − δ ) · ∞ (cid:88) j = u ( d )+1 λ d,j . Hence, (cid:16) (cid:80) ∞ j = u ( d )+1 λ d,j (cid:17) (cid:80) ∞ j = u ( d )+1 λ d,j = Ω d ( d s , S )+2 − δ ) ≥ max( n, N ) δ , (191)for δ > n ≤ d s +1 − δ and N ≤ d S +1 − δ for some fixed δ > c ) From Eq. (28) in Assumption 3, we only need to check that for q such thatmin( n, N ) max( N, n ) /q − log(max( N, n )) = o d (1) , we have E x , θ (cid:2) [ P >u ¯ σ d ]( (cid:104) x , θ (cid:105) / √ d ) q (cid:3) / (2 q ) = O d (1) . Denote S the set of eigenvalues λ d,j , with j > u , associated to spherical harmonics of degree less ofequal to 2 max( s , S ) + 1. By triangular inequality, we have E x , θ (cid:2) [ P >u ¯ σ d ]( (cid:104) x , θ (cid:105) / √ d ) q (cid:3) / (2 q ) ≤ E x , θ (cid:2) [ P S ¯ σ d ]( (cid:104) x , θ (cid:105) / √ d ) q (cid:3) / (2 q ) + E x , θ (cid:2) [ P > s , S )+1 ¯ σ d ]( (cid:104) x , θ (cid:105) / √ d ) q (cid:3) / (2 q ) = O d (1) , where we used that P S ¯ σ d is a polynomial of degree less or equal to 2 max( S , s ) + 1 in each variable x and θ and satisfies the hypercontractivity property (see Lemma 6), i.e., E x , θ (cid:2) [ P S ¯ σ d ]( (cid:104) x , θ (cid:105) / √ d ) q (cid:3) = O d (1) · E x (cid:2) H d,S ( x , x ) q (cid:3) = O d (1) · Tr( H d,S ) q = O d (1) , while the bound on P > s , S )+1 ¯ σ d follows from Assumption 3.( a ) and Lemma 16 stated below.( d ) This is automatically verified because the diagonal elements are constant in this case (Eq. (190)).Next, we check Assumption 2 at level ( N, M , n, m ). Consider the overparametrized case N ( d ) ≥ n ( d ), andtherefore M ≥ m . The underparametrized case N ( d ) ≤ n ( d ) is treated analogously.71 i ) The eigenvalue sums in Eq. (19) can be estimated as follows1 λ d, m ( d ) ∞ (cid:88) k = m ( d )+1 λ d,k = 1 ξ d, s ∞ (cid:88) k = s +1 ξ d,k B ( S d − ; k ) = O d ( d s ) , (192)1 λ d, m ( d )+1 ∞ (cid:88) k = m ( d )+1 λ d,k = 1 ξ d, s +1 ∞ (cid:88) k = s +1 ξ d,k B ( S d − ; k ) ≥ B ( S d − ; s + 1) = Ω d ( d s +1 ) . (193)The last equality in (192) follows from Eq. (183) and the assumption (186), Hence condition (19)in Assumption 2 is satisfied since, by the statement of Theorem 2, we assume d s + δ ≤ n ≤ d s +1 − δ .Furthermore, by Eq. (189), we have m ≤ n − δ for some δ > ii ) The eigenvalue sum in Eq. (20) is1 λ d, M ( d )+1 ∞ (cid:88) k = M ( d )+1 λ d,k = 1 ξ d, S +1 ∞ (cid:88) k = S +1 ξ d,k B ( S d − ; k ) ≥ B ( S d − ; S + 1) = Ω d ( d S +1 ) . (194)Hence condition (20) in Assumption 2 is satisfied since, by the statement of Theorem 2, N ≤ d S +1 − δ .By Eq. (189), we have M ≤ N − δ for some δ > Lemma 16.
Consider m, (cid:96) two fixed integers. Assume | ¯ σ d ( x ) | ≤ c exp( c x / (4 m )) with c > and c < .Then E x ∼ τ d (cid:2) ¯ σ d,>(cid:96) ( x ) m (cid:3) = O d (1) , where τ d is the marginal distribution of (cid:104) e , x (cid:105) with (cid:107) e (cid:107) = 1 and x ∼ Unif( S d − ( √ d )) , and we denoted ¯ σ d,>(cid:96) = P >(cid:96) ¯ σ d .Proof of Lemma 16. Recall that¯ σ d,>(cid:96) ( x ) = ¯ σ d ( x ) − (cid:96) (cid:88) k =0 ξ d,k ( σ ) B ( S d − ; k ) Q ( d ) k ( √ dx ) , (195)where ξ d,k ( σ ) B ( d, k ) ≤ (cid:107) ¯ σ d (cid:107) L ≤ C for some constant C > | ¯ σ d ( x ) | ≤ c exp( c x / (4 m ))) and (cid:112) B ( S d − ; k ) Q ( d ) k is a degree- k polynomial that converges to the Hermite polynomial He k / √ k ! (see SectionE.1.3). Therefore, ¯ σ d,>(cid:96) is equal to ¯ σ d plus a polynomial of degree (cid:96) with bounded coefficients. In particular,from the assumption | ¯ σ d ( x ) | ≤ c exp( c x / (4 m )), we deduce there exists c (cid:48) > | ¯ σ d,>(cid:96) ( x ) | ≤ c (cid:48) exp( c x / (4 m )), whence: (cid:12)(cid:12)(cid:12) E x [¯ σ d,>(cid:96) ( x ) m ] (cid:12)(cid:12)(cid:12) ≤ ( c (cid:48) ) m E x [exp( c x / . (196)Furthermore, recall that τ d (d x ) = C d (1 − x /d ) ( d − / x ∈ [ −√ d, √ d ] d x ≤ C exp( − x / x . We can thereforeupper bound the right hand side of Eq. (196) and use dominated convergence, which concludes the proof.72 .2 On the hypercube The proof for the hypercube Q d follows from the same proof as for the sphere. We refer tp [O’D14] forbackground on Fourier analysis on Q d , and Section E.2 for notations that make the analogy with the spheretransparent. In particular, an analogous of Lemma 16 follows by noticing that the law (cid:104) , x (cid:105) / √ d is astandardized binomial, which can be in terms of the standard normal distribution, times polynomial factors.The only difference comes from the degeneracy B ( Q d ; k ) = (cid:18) d(cid:96) (cid:19) . Hence Assumption 3.( a ) only implies ξ d,d − (cid:96) = O d ( d − (cid:96) ) for the last coefficients, which is the reason for thefurther requirement Assumption 3.( c ).Let us check that Assumption 3.( c ) holds for a class of smooth activation functions. We believe thatindeed this assumption holds much more generally, but we leave such generalizations to future work. Lemma 17.
Consider (cid:96) a fixed integer. Assume there exist constants c > and c < such that | ¯ σ ( (cid:96) ) ( x ) | ≤ c exp( c x / for all x ∈ R . Then, we have max k ≤ (cid:96) ξ d,d − k (¯ σ ) = O d ( d − (cid:96) ) , where ξ d,d − k (¯ σ ) = (cid:104) ¯ σ ( (cid:104) e , · (cid:105) ) , Q d − k ( √ d (cid:104) e , ·(cid:105) ) (cid:105) L ( Q d ) , and Q k is the k -th hypercubic Gegenbauer polynomial(see Appendix E.2).Proof of Lemma 17. By the mean value theorem, we have for any k ≤ (cid:96) , ξ d − k,d (¯ σ ) = E x [¯ σ ( (cid:104) , x (cid:105) / √ d ) Q ( d ) d − k ( (cid:104) , x (cid:105) )]= E x (cid:104) x · · · x d − k ¯ σ (cid:16) x + . . . + x d √ d (cid:17)(cid:105) = 12 E x ,...,x (cid:104) x . . . x d − k (cid:16) ¯ σ (cid:16) x + . . . + x d √ d (cid:17) − ¯ σ (cid:16) − x + . . . + x d √ d (cid:17)(cid:17)(cid:105) = 1 √ d E x ,...,x d (cid:104) x . . . x d − k ¯ σ (1) ( ζ ( x , . . . , x d )) (cid:105) , where on the third line we integrated over the first coordinate x and on the last line | ζ ( x , . . . , x d )) − ( x + . . . + x d ) / √ d | ≤ / √ d . By iterating this computation (cid:96) times, we get ξ d − k,d (¯ σ ) = 1 d (cid:96)/ E x (cid:96) +1 ,...,x d (cid:104) x (cid:96) +1 . . . x d − k ¯ σ ( (cid:96) ) ( ζ (cid:96) ( x (cid:96) +1 , . . . , x d )) (cid:105) , where | ζ (cid:96) ( x (cid:96) +1 , . . . , x d ) − ( x (cid:96) +1 + . . . + x d ) / √ d | ≤ (cid:96)/ √ d . Hence, | ξ d − k,d (¯ σ ) | ≤ d (cid:96)/ E x (cid:96) +1 ,...,x d (cid:104) | ¯ σ ( (cid:96) ) ( ζ (cid:96) ( x (cid:96) +1 , . . . , x d )) | (cid:105) ≤ d (cid:96)/ E X =( x (cid:96) +1 + ... + x d ) / √ d [ c exp( c X / c (cid:96) /d )]= O d ( d − (cid:96)/ ) , where we used that X converges weakly to the standard normal distribution and dominated convergence.73 Technical background
E.1 Functions on the sphere
E.1.1 Functional spaces over the sphere
For d ≥
3, we let S d − ( r ) = { x ∈ R d : (cid:107) x (cid:107) = r } denote the sphere with radius r in R d . We will mostly workwith the sphere of radius √ d , S d − ( √ d ) and will denote by τ d the uniform probability measure on S d − ( √ d ).All functions in this section are assumed to be elements of L ( S d − ( √ d ) , τ d ), with scalar product and normdenoted as (cid:104) · , · (cid:105) L and (cid:107) · (cid:107) L : (cid:104) f, g (cid:105) L ≡ (cid:90) S d − ( √ d ) f ( x ) g ( x ) τ d (d x ) . (197)For (cid:96) ∈ Z ≥ , let ˜ V d,(cid:96) be the space of homogeneous harmonic polynomials of degree (cid:96) on R d (i.e. homo-geneous polynomials q ( x ) satisfying ∆ q ( x ) = 0), and denote by V d,(cid:96) the linear space of functions obtainedby restricting the polynomials in ˜ V d,(cid:96) to S d − ( √ d ). With these definitions, we have the following orthogonaldecomposition L ( S d − ( √ d ) , τ d ) = ∞ (cid:77) (cid:96) =0 V d,(cid:96) . (198)The dimension of each subspace is given bydim( V d,(cid:96) ) = B ( S d − ; (cid:96) ) = 2 (cid:96) + d − d − (cid:18) (cid:96) + d − (cid:96) (cid:19) . (199)For each (cid:96) ∈ Z ≥ , the spherical harmonics { Y ( d ) (cid:96),j } ≤ j ≤ B ( S d − ; (cid:96) ) form an orthonormal basis of V d,(cid:96) : (cid:104) Y ( d ) ki , Y ( d ) sj (cid:105) L = δ ij δ ks . Note that our convention is different from the more standard one, that defines the spherical harmonics asfunctions on S d − (1). It is immediate to pass from one convention to the other by a simple scaling. We willdrop the superscript d and write Y (cid:96),j = Y ( d ) (cid:96),j whenever clear from the context.We denote by P k the orthogonal projections to V d,k in L ( S d − ( √ d ) , τ d ). This can be written in termsof spherical harmonics as P k f ( x ) ≡ B ( S d − ; k ) (cid:88) l =1 (cid:104) f, Y kl (cid:105) L Y kl ( x ) . (200)We also define P ≤ (cid:96) ≡ (cid:80) (cid:96)k =0 P k , P >(cid:96) ≡ I − P ≤ (cid:96) = (cid:80) ∞ k = (cid:96) +1 P k , and P <(cid:96) ≡ P ≤ (cid:96) − , P ≥ (cid:96) ≡ P >(cid:96) − . E.1.2 Gegenbauer polynomials
The (cid:96) -th Gegenbauer polynomial Q ( d ) (cid:96) is a polynomial of degree (cid:96) . Consistently with our convention forspherical harmonics, we view Q ( d ) (cid:96) as a function Q ( d ) (cid:96) : [ − d, d ] → R . The set { Q ( d ) (cid:96) } (cid:96) ≥ forms an orthogonalbasis on L ([ − d, d ] , ˜ τ d ), where ˜ τ d is the distribution of √ d (cid:104) x , e (cid:105) when x ∼ τ d , satisfying the normalizationcondition: (cid:104) Q ( d ) k ( √ d (cid:104) e , ·(cid:105) ) , Q ( d ) j ( √ d (cid:104) e , ·(cid:105) ) (cid:105) L ( S d − ( √ d )) = 1 B ( S d − ; k ) δ jk . (201)74n particular, these polynomials are normalized so that Q ( d ) (cid:96) ( d ) = 1. As above, we will omit the superscript( d ) in Q ( d ) (cid:96) when clear from the context.Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix v ∈ S d − ( √ d ) andconsider the subspace of V (cid:96) formed by all functions that are invariant under rotations in R d that keep v unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of thefunction Q ( d ) (cid:96) ( (cid:104) v , · (cid:105) ).We will use the following properties of Gegenbauer polynomials1. For x , y ∈ S d − ( √ d ) (cid:104) Q ( d ) j ( (cid:104) x , ·(cid:105) ) , Q ( d ) k ( (cid:104) y , ·(cid:105) ) (cid:105) L = 1 B ( S d − ; k ) δ jk Q ( d ) k ( (cid:104) x , y (cid:105) ) . (202)2. For x , y ∈ S d − ( √ d ) Q ( d ) k ( (cid:104) x , y (cid:105) ) = 1 B ( S d − ; k ) B ( S d − ; k ) (cid:88) i =1 Y ( d ) ki ( x ) Y ( d ) ki ( y ) . (203)These properties imply that —up to a constant— Q ( d ) k ( (cid:104) x , y (cid:105) ) is a representation of the projector onto thesubspace of degree - k spherical harmonics( P k f )( x ) = B ( S d − ; k ) (cid:90) S d − ( √ d ) Q ( d ) k ( (cid:104) x , y (cid:105) ) f ( y ) τ d (d y ) . (204)For a function ¯ σ ∈ L ([ −√ d, √ d ] , τ d ) (where τ d is the distribution of (cid:104) e ! , x (cid:105) when x ∼ iid Unif( S d − ( √ d ))),denoting its spherical harmonics coefficients ξ d,k (¯ σ ) to be ξ d,k (¯ σ ) = (cid:90) [ −√ d, √ d ] ¯ σ ( x ) Q ( d ) k ( √ dx ) τ d (d x ) , (205)then we have the following equation holds in L ([ −√ d, √ d ] , τ d − ) sense¯ σ ( x ) = ∞ (cid:88) k =0 ξ d,k (¯ σ ) B ( S d − ; k ) Q ( d ) k ( √ dx ) . To any rotationally invariant kernel H d ( x , x ) = h d ( (cid:104) x , x (cid:105) /d ), with h d ( √ d · ) ∈ L ([ −√ d, √ d ] , τ d ), wecan associate a self adjoint operator H d : L ( S d − ( √ d )) → L ( S d − ( √ d )) via H d f ( x ) ≡ (cid:90) S d − ( √ d ) h d ( (cid:104) x , x (cid:105) /d ) f ( x ) τ d (d x ) . (206)By rotational invariance, the space V k of homogeneous polynomials of degree k is an eigenspace of H d , andwe will denote the corresponding eigenvalue by ξ d,k ( h d ). In other words H d f ( x ) ≡ (cid:80) ∞ k =0 ξ d,k ( h d ) P k f . Theeigenvalues can be computed via ξ d,k ( h d ) = (cid:90) [ −√ d, √ d ] h d (cid:0) x/ √ d (cid:1) Q ( d ) k ( √ dx ) τ d − (d x ) . (207)75 .1.3 Hermite polynomials The Hermite polynomials { He k } k ≥ form an orthogonal basis of L ( R , γ ), where γ (d x ) = e − x / d x/ √ π isthe standard Gaussian measure, and He k has degree k . We will follow the classical normalization (here andbelow, expectation is with respect to G ∼ N (0 , E (cid:8) He j ( G ) He k ( G ) (cid:9) = k ! δ jk . (208)As a consequence, for any function g ∈ L ( R , γ ), we have the decomposition g ( x ) = ∞ (cid:88) k =0 µ k ( g ) k ! He k ( x ) , µ k ( g ) ≡ E (cid:8) g ( G ) He k ( G ) } . (209)The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomialsintroduced in the previous section. Indeed, the Gegenbauer polynomials (up to a √ d scaling in domain) areconstructed by Gram-Schmidt orthogonalization of the monomials { x k } k ≥ with respect to the measure ˜ τ d ,while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to γ . Since ˜ τ d ⇒ γ (here ⇒ denotes weak convergence), it is immediate to show that, for any fixed integer k ,lim d →∞ Coeff { Q ( d ) k ( √ dx ) B ( S d − ; k ) / } = Coeff (cid:26) k !) / He k ( x ) (cid:27) . (210)Here and below, for P a polynomial, Coeff { P ( x ) } is the vector of the coefficients of P . As a consequence,for any fixed integer k , we have µ k (¯ σ ) = lim d →∞ ξ d,k (¯ σ )( B ( S d − ; k ) k !) / , (211)where µ k (¯ σ ) and ξ d,k (¯ σ ) are given in Eq. (209) and (205). E.2 Functions on the hypercube
Fourier analysis on the hypercube is a well studied subject [O’D14]. The purpose of this section is tointroduce some notations that make the correspondence with proofs on the sphere straightforward. Forconvenience, we will adopt the same notations as for their spherical case.
E.2.1 Fourier basis
Denote Q d = {− , +1 } d the hypercube in d dimension. Let us denote τ d to be the uniform probabilitymeasure on Q d . All the functions will be assumed to be elements of L ( Q d , τ d ) (which contains all thebounded functions f : Q d → R ), with scalar product and norm denoted as (cid:104)· , ·(cid:105) L and (cid:107) · (cid:107) L : (cid:104) f, g (cid:105) L ≡ (cid:90) Q d f ( x ) g ( x ) τ d (d x ) = 12 n (cid:88) x ∈ Q d f ( x ) g ( x ) . Notice that L ( Q d , τ d ) is a 2 n dimensional linear space. By analogy with the spherical case we decompose L ( Q d , τ d ) as a direct sum of d + 1 linear spaces obtained from polynomials of degree (cid:96) = 0 , . . . , dL ( Q d , τ d ) = d (cid:77) (cid:96) =0 V d,(cid:96) . (cid:96) ∈ { , . . . , d } , consider the Fourier basis { Y ( d ) (cid:96),S } S ⊆ [ d ] , | S | = (cid:96) of degree (cid:96) , where for a set S ⊆ [ d ],the basis is given by Y ( d ) (cid:96),S ( x ) ≡ x S ≡ (cid:89) i ∈ S x i . It is easy to verify that (notice that x ki = x i if k is odd and x ki = 1 if k is even) (cid:104) Y ( d ) (cid:96),S , Y ( d ) k,S (cid:48) (cid:105) L = E [ x S × x S (cid:48) ] = δ (cid:96),k δ S,S (cid:48) . Hence { Y ( d ) (cid:96),S } S ⊆ [ d ] , | S | = (cid:96) form an orthonormal basis of V d,(cid:96) anddim( V d,(cid:96) ) = B ( Q d ; (cid:96) ) = (cid:18) d(cid:96) (cid:19) . As above, we will omit the superscript ( d ) in Y ( d ) (cid:96),S when clear from the context. E.2.2 Hypercubic Gegenbauer
We consider the following family of polynomials { Q ( d ) (cid:96) } (cid:96) =0 ,...,d that we will call hypercubic Gegenbauer,defined as Q ( d ) (cid:96) ( (cid:104) x , y (cid:105) ) = 1 B ( Q d ; (cid:96) ) (cid:88) S ⊆ [ d ] , | S | = (cid:96) Y ( d ) (cid:96),S ( x ) Y ( d ) (cid:96),S ( y ) . Notice that the right hand side only depends on (cid:104) x , y (cid:105) and therefore these polynomials are uniquely defined.In particular, (cid:104) Q ( d ) (cid:96) ( (cid:104) , ·(cid:105) ) , Q ( d ) k ( (cid:104) , ·(cid:105) ) (cid:105) L = 1 B ( Q d ; k ) δ (cid:96)k . Hence { Q ( d ) (cid:96) } (cid:96) =0 ,...,d form an orthogonal basis of L ( {− d, − d +2 , . . . , d − , d } , ˜ τ d ) where ˜ τ d is the distributionof (cid:104) , x (cid:105) when x ∼ τ d , i.e., ˜ τ d ∼ d, / − d/ (cid:104) Q ( d ) (cid:96) ( (cid:104) x , ·(cid:105) ) , Q ( d ) k ( (cid:104) y , ·(cid:105) ) (cid:105) L = 1 B ( Q d ; k ) Q k ( (cid:104) x , y (cid:105) ) δ (cid:96)k . For a function ¯ σ ( · / √ d ) ∈ L ( {− d, − d + 2 , . . . , d − , d } , ˜ τ d ), denote its hypercubic Gegenbauer coefficients ξ d,k (¯ σ ) to be ξ d,k (¯ σ ) = (cid:90) {− d, − d +2 ,...,d − ,d } ¯ σ ( x/ √ d ) Q ( d ) k ( x )˜ τ d (d x ) . Notice that by weak convergence of (cid:104) , x (cid:105) / √ d to the normal distribution, we have also convergence ofthe (rescaled) hypercubic Gegenbauer polynomials to the Hermite polynomials, i.e., for any fixed k , we havelim d →∞ Coeff { Q ( d ) k ( √ dx ) B ( Q d ; k ) / } = Coeff (cid:26) k !) / He k ( x ) (cid:27) . (212) E.3 Hypercontractivity of Gaussian measure and uniform distributions on the sphereand the hypercube
By Holder’s inequality, we have (cid:107) f (cid:107) L p ≤ (cid:107) f (cid:107) L q for any f and any p ≤ q . The reverse inequality doesnot hold in general, even up to a constant. However, for some measures, the reverse inequality will holdfor some sufficiently nice functions. These measures satisfy the celebrated hypercontractivity properties[Gro75, Bon70, Bec75, Bec92]. 77 emma 18 (Hypercube hypercontractivity [Bec75]) . For any (cid:96) = { , . . . , d } and f d ∈ L ( Q d ) to be a degree (cid:96) polynomial, then for any integer q ≥ , we have (cid:107) f d (cid:107) L q ( Q d ) ≤ ( q − (cid:96) · (cid:107) f d (cid:107) L ( Q d ) . Lemma 19 (Spherical hypercontractivity [Bec92]) . For any (cid:96) ∈ N and f d ∈ L ( S d − ) to be a degree (cid:96) polynomial, for any q ≥ , we have (cid:107) f d (cid:107) L q ( S d − ) ≤ ( q − (cid:96) · (cid:107) f d (cid:107) L ( S d − ) . Lemma 20 (Gaussian hypercontractivity) . For any (cid:96) ∈ N and f ∈ L ( R , γ ) to be a degree (cid:96) polynomial on R , where γ is the standard Gaussian distribution. Then for any q ≥ , we have (cid:107) f (cid:107) L q ( R ,γ ) ≤ ( q − (cid:96) · (cid:107) f (cid:107) L ( R ,γ ) ..