Risk Convergence of Centered Kernel Ridge Regression with Large Dimensional Data
Khalil Elkhalil, Abla Kammoun, Xiangliang Zhang, Mohamed-Slim Alouini, Tareq Al-Naffouri
11 Risk Convergence of Centered Kernel RidgeRegression with Large Dimensional Data
Khalil Elkhalil, Abla Kammoun, Xiangliang Zhang, Mohamed-Slim Alouini and Tareq Al-Naffouri
Abstract — This paper carries out a large dimensional analysisof a variation of kernel ridge regression that we call centeredkernel ridge regression (CKRR), also known in the literature askernel ridge regression with offset. This modified technique isobtained by accounting for the bias in the regression problemresulting in the old kernel ridge regression but with centered kernels. The analysis is carried out under the assumption thatthe data is drawn from a Gaussian distribution and heavilyrelies on tools from random matrix theory (RMT). Under theregime in which the data dimension and the training size growinfinitely large with fixed ratio and under some mild assumptionscontrolling the data statistics, we show that both the empiricaland the prediction risks converge to a deterministic quantitiesthat describe in closed form fashion the performance of CKRRin terms of the data statistics and dimensions. Inspired by thistheoretical result, we subsequently build a consistent estimatorof the prediction risk based on the training data which allowsto optimally tune the design parameters. A key insight of theproposed analysis is the fact that asymptotically a large class ofkernels achieve the same minimum prediction risk. This insightis validated with both synthetic and real data.
Index Terms —Kernel regression, centered kernels, randommatrix theory.
I. I
NTRODUCTION K ERNEL ridge regression (KRR) is part of kernel-basedmachine learning methods that deploy a set of nonlinearfunctions to describe the real output of interest [1], [2]. Moreprecisely, the idea is to map the data into a high-dimensional space H , a.k.a. feature space , which can even be of infinite dimensionresulting in a linear representation of the data with respect to theoutput. Then, a linear regression problem is solved in H bycontrolling over-fitting with a regularization term. In fact, the mostimportant advantage of kernel methods is the utilized kernel trick or kernel substitution [1], which allows to directly work withkernels and avoid explicit use of feature vectors in H .Due to its popularity, a rich body of research has been conductedto analyze the performance of KRR. In [3], a randomized versionof KRR is studied with performance guarantees in terms ofconcentration bounds. The work in [4] analyzes the randomfeatures approximation in least squares kernel regression. Morerelevant results can be found in [5] where upper bounds ofthe prediction risk have been derived in terms of the empiricalquadratic risk for general regression models. Similarly for KRRmodels, an upper and lower bound on the expected risk have beenprovided in [6] before being generalized to general regularizationoperators in [7]. Therefore, most of the results related to the The authors are with the Electrical Engineering Program, King Abdullah Universityof Science and Technology, Thuwal, Saudi Arabia; e-mails: { khalil.elkhalil,abla.kammoun, xiangliang.zhang, tareq.alnaffouri, slim.alouini } @kaust.edu.sa. performance analysis of KRR and related regression techniques arein the form of upper or lower bounds of the prediction risk. Inthis work, we study the problem from an asymptotic analysisperspective. As we will demonstrate in the course of the paper,such an analysis brought about novel results that predict in anaccurate fashion prediction risks metrics. Our focus is on avariation of KRR called centered kernel ridge regression (CKRR)that is built upon the same principles of KRR with the additionalrequirement to minimize the bias in the learning problem. Thisvariation has been motivated by Cortes et al. in [8] and [9], [10]where the benefits of centering kernels have been highlighted. Theobtained regression technique can be seen as KRR with centeredkernels. Moreover, in the high dimensional setting with certainnormalizations, we show that kernel matrices behave as a rank onematrix, thus centering allows to neutralize this non-informativecomponent and highlight higher order components that retainuseful information of the data.To understand the behavior of CKRR, we conduct theoreticalanalysis in the large dimensional regime where both the datadimension p and the training size n tend to infinity with fixed ratio( p/n → constant). As far as inner-product kernels are concerned,with mild assumptions on the data statistics, we show usingfundamental results from random matrix theory elaborated in [11]and [12] that both the empirical and prediction risks approach adeterministic quantity that relates in closed form fashion theseperformance measures to the data statistics and dimensions. Thisimportant finding allows to see how the model performancebehaves as a function of the problem’s parameters and as such tunethe design parameters to minimize the prediction risk. Moreover,as an outcome of this result, we show that it is possible to jointlyoptimize the regularization parameter along with the kernelfunction so that to achieve the possible minimum predictionrisk. In other words, the minimum prediction risk is alwaysattainable for all kernels with a proper choice of the regualrizationparameter. This implies that all kernels behave similarly to thelinear kernel. We regard such a fact as a consequence of the curseof dimensionality phenomenon which causes the CKRR to beasymptotically equivalent to centered linear ridge regression .As an additional contribution of the present work, we build aconsistent estimator of the prediction risk based on the trainingsamples, thereby paving the way towards optimal setting of theregularization parameter.The rest of the paper is structured as follows. In section II, wegive a brief background on kernel ridge regression and introduceits centered variation. In section III, we provide the main results ofthe paper related to the asymptotic analysis of CKRR as well asthe construction of a consistent estimator of the prediction risk.Then, we provide some numerical examples in section IV. We a r X i v : . [ s t a t . M L ] A p r finally make some concluding remarks in section V. Notations: E [ . ] and var [ . ] stand for the expectation and thevariance of a random variable while → a.s. and → prob. respectivelystand for the almost sure convergence and the convergence inprobability. (cid:107) . (cid:107) denotes the operator norm of a matrix and the L norm for vectors, tr [ . ] stands for the trace operator. The notation f = O ( g ) means that ∃ M bounded such that f ≤ M g . We saythat f is C p if the p th derivative of f exists and is continous. II. B
ACKGROUND ON KERNEL RIDGE REGRESSION
Let { ( x i , y i ) } ni =1 be a set of n observations in X × Y , where X denotes the input space and Y the output space. Our aimis to predict the output of new input points x ∈ X with areasonable accuracy. Assume that the output is generated using afunction f : X → Y , then the problem can be cast as a functionapproximation problem where the goal is to find an estimate of f denoted by (cid:98) f such that (cid:98) f ( x ) is close to the real output f ( x ) . Inthis context, the kernel learning problem is formulated as follows min f ∈H n (cid:88) i =1 l ( y i , f ( x i )) + λ (cid:107) f (cid:107) H , (1)where H is a reproducing kernel Hilbert space (RKHS), l : Y×Y → R is a loss function and λ > is a regularization parameter thatpermits to control overfitting. Denoting by φ : X → H a featuremap that maps the data points to the feature space H , then wedefine k : X × X → R such that k ( x , x (cid:48) ) = (cid:104) φ ( x ) , φ ( x (cid:48) ) (cid:105) H for all x , x (cid:48) ∈ X where k is known as the positive definitekernel corresponding to the feature map φ . With these definitions,the representer theorem [13], [14] shows that the minimizer ofthe problem in (1) writes as f ∗ ( x ) = α T φ ( x ) . Thus, we canreformulate (1) as follows min α ∈H n (cid:88) i =1 l (cid:0) y i , α T φ ( x i ) (cid:1) + λ (cid:107) α (cid:107) . (2)When l is the squared loss, the optimization problem in (2) can bereformulated as min α ∈H (cid:107) y − Φ α (cid:107) + λ (cid:107) α (cid:107) , (3)where Φ = [ φ ( x ) , · · · , φ ( x n )] T . This yields the followingsolution α ∗ = (cid:0) Φ T Φ + λ I (cid:1) − Φ T y , where y = { y i } ni =1 . Then,the output estimate of any data point s is given by [1] (cid:98) f ( s ) = κ ( s ) T ( K + λ I ) − y , (4)where κ ( s ) = { k ( s , x i ) } ni =1 is the information vector and K = ΦΦ T with entries K i,j = κ ( x i , x j ) , ≤ i, j ≤ n . Thisis commonly known as the kernel trick which allows to highlysimplify the problem which boils down to solving a n -dimensionalproblem. Throughout this paper, we consider the following datamodel y i = f ( x i ) + σ(cid:15) i , i = 1 , · · · , n. (5)where f generates the actual output of the data and (cid:15) i are i.i.d.standard normal random variables with σ assumed to be known. We consider both the empirical (training) and the prediction(testing) risks respectively defined as [15] R train (cid:16) (cid:98) f (cid:17) = 1 n E (cid:15) (cid:107) (cid:98) f ( X ) − f ( X ) (cid:107) , (6) R test (cid:16) (cid:98) f (cid:17) = E s ∼D , (cid:15) (cid:20)(cid:16) (cid:98) f ( s ) − f ( s ) (cid:17) (cid:21) , (7)where D is the data input distribution, s is taken independent ofthe training data X and (cid:15) = { (cid:15) i } ni =1 . The above two equationsrespectively measure the goodness of fit relative to the trainingdata and to new unseen data all in terms of the mean squared error(MSE). A. Centered kernel ridge regression
The concept of centered kernels dates back to the work ofCortes [8] on learning kernels based on the notion of centeredalignment. As we will show later, this notion of centering comesnaturally to the picture when we account for the bias in thelearning problem (also see the lecture notes by Jakkola [16]).More specifically, we modify the optimization problem in (2) toaccount for the bias as follows min α , α ∈H n (cid:88) i =1 l (cid:0) y i , α + α T φ ( x i ) (cid:1) + λ (cid:107) α (cid:107) , (8)where clearly we do not penalize the offset (or the bias ) α in theregularization term . With l being the squared loss, we immediatelyget α ∗ = ¯ y − n α T Φ T , y = n T y . Substituting α ∗ in (8),we solve the centered optimization problem given by min α ∈H (cid:107) P ( y − Φ α ) (cid:107) + λ α T α , (9)where P = I n − n n Tn is referred as a projection matrix or a centering matrix [8], [16]. Finally, we get α ∗ = (cid:0) Φ T PΦ + λ I (cid:1) − Φ T P ( y − ¯ y ) ( b ) = Φ T P (cid:0) PΦΦ T P + λ I n (cid:1) − ( y − ¯ y n )= Φ T P ( PKP + λ I n ) − ( y − ¯ y n )= Φ T P ( K c + λ I n ) − ( y − ¯ y n ) , where K c = PKP is the centered kernel matrix as defined in [8,Lemma1] and ( b ) is obtained using the Woodbury identity. Withsome basic manipulations, the centered kernel ridge regressionestimate of the output of data point s is given by (cid:98) f c ( s ) = κ c ( s ) T ( K c + λ I ) − P y + ¯ y. (10)Therefore, the feature map corresponding to K c as well as theinformation vector can be respectively obtained as follows φ c ( s ) = φ ( s ) − n n (cid:88) i =1 φ ( x i ) , κ c ( s ) = P κ ( s ) − n PK1 n . (11)Throughout this paper, we consider inner-product kernels [1], [11]defined as follows k ( x , x (cid:48) ) = g (cid:0) x T x (cid:48) /p (cid:1) , ∀ x , x (cid:48) ∈ R p , (12)and subsequently, K = (cid:8) g (cid:0) x Ti x j /p (cid:1)(cid:9) ni,j =1 , where the normal-ization by p in (12) is convenient in the large n, p regime as we This is equivalent to normalize all data points by √ p . This type of normalizationhas also been conisdered in [17] following the heuristic of Jakkola. will show later (also see [18] for similar normalization in theanalysis of LS-SVMs). In the following, we conduct a largedimensional analysis of the performance of CKRR with the aim toget useful insights on the design of CKRR. Particularly, we willfocus on studying the empirical and the prediction risks of CKRRwhich we define as R train = 1 n E (cid:15) n (cid:88) i =1 (cid:12)(cid:12)(cid:12) (cid:98) f c ( x i ) − f ( x i ) (cid:12)(cid:12)(cid:12) , R test = E s ∼D , (cid:15) (cid:20)(cid:12)(cid:12)(cid:12) (cid:98) f c ( s ) − f ( s ) (cid:12)(cid:12)(cid:12) (cid:21) . The novelty of our analysis with respect to previous studies liesin that1) It provides a mathematical connection between theperformance and the problem’s dimensions and statisticsresulting in a deeper understanding of centered kernel ridgeregression in the large n, p regime.2) It brings insights on how to choose the kernel function g and the regularization parameter λ in order to guarantee agood generalization performance for unknown data.As far as the second point is considered, we show later thatboth the kernel function and the regularization parameter can beoptimized jointly as a consequence of the mathematical resultconnecting the prediction risk with these design parameters. Ouranalysis does not assume a specific choice of the inner-productkernels, and is valid for the following popular ones. • Linear kernels: k ( x , x (cid:48) ) = α x T x (cid:48) /p + β . • Polynomial kernels: k ( x , x (cid:48) ) = (cid:0) α x T x (cid:48) /p + β (cid:1) d . • Sigmoid kernels: k ( x , x (cid:48) ) = tanh (cid:0) α x T x (cid:48) /p + β (cid:1) . • Exponential kernels: k ( x , x (cid:48) ) = exp (cid:0) α x T x (cid:48) /p + β (cid:1) .III. M AIN RESULTS
A. Technical assumptions
In this section, we will present our theoretical results on theprediction risk of CKRR by first introducing the assumptions of datagrowth rate, kernel function g and true function f . Without loss ofgenerality, we assume that the data samples x , · · · , x n ∈ R p are independent such that x i ∼ N ( p , Σ ) , i = 1 , · · · , n , withpositive definite covariance matrix Σ ∈ R p × p . Throughout theanalysis, we consider the large dimensional regime in which both p and n grow simultaneously large with the following growth rateassumptions. Assumption 1 (Growth rate) . As p, n → ∞ we assume thefollowing • Data scaling : p/n → c ∈ (0 , ∞ ) . • Covariance scaling : lim sup p (cid:107) Σ (cid:107) < ∞ . The above assumptions are standard to consider and allow toexploit the large heritage of random matrix theory. Moreover,allowing p and n to grow large at the same rate is of practicalinterest when dealing with modern large and numerous data. Theassumption treating the covariance scaling is technically convenientsince it allows to use important theoretical results on the behavior of large kernel matrices [11], [12]. Under Assumption 1, we havethe following implications. x Ti x i /p → a.s. p tr Σ (cid:44) τ, i = 1 , · · · , n. (13) x Ti x j /p → a.s. , i (cid:54) = j, (14)where < τ < ∞ due to the covariance scaling in Assumption 1.This means that in the limit when p → ∞ , the kernel matrix K as defined earlier has all its entries converging to a deterministiclimit. Applying a Taylor expansion on the entries of K , and undersome assumption on the kernel function g , it has been shown in[11, Theorem 2.1] that (cid:107) K − K ∞ (cid:107) → prob. , (15)where the convergence is in operator norm and K ∞ exhibits niceproperties and can be expressed using standard random matrixmodels. The explicit expression of K ∞ as well as its propertieswill be thoroughly investigated in Appendix A. We subsequentlymake additional assumptions to control the kernel function g andthe data generating function f . Assumption 2 (Kernel function) . As in [11, Theorem 2.1], weshall assume that g is C in a neighborhood of τ and C in aneighborhood of 0. Moreover, we assume that for any independentobservations x i and x j drawn from N ( p , Σ ) and k ∈ N ∗ , E (cid:12)(cid:12)(cid:12)(cid:12) g (3) (cid:18) p x Ti x j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) k < ∞ where g (3) is the third derivative of g . Assumption 3 (Data generating function) . We assume that f is C and polynomially bounded together with its derivatives. Weshall further assume that the moments of f ( x ) and its gradientare finite. More explicitly we need to have: E x ∼N ( , Σ ) | f ( x ) | k < ∞ , and E x ∼N ( , Σ ) (cid:107) ∇ f ( x ) (cid:107) k < ∞ , where ∇ f ( x ) = (cid:26) ∂f ( x ) ∂x l (cid:27) pl =1 . (16)As we will show later, the above assumptions are neededto guarantee a bounded asymptotic risk and to carry out theanalysis. Under the setting of Assumptions 1, 2 and 3, we aim tostudy the performance of CKRR by asymptotically evaluating theperformance metrics defined in (6). Inspired by the fundamentalresults from [11] and [12] in the context of spectral clustering,then following the observations made in (13) and(14), it is alwayspossible to linearize the kernel matrix K around the matrix g (0) T which avoids dealing with the original intractableexpression of K . Note that the first component of the approximationgiven by g (0) T will be neutralized by the projection matrix P in the context of CKRR, which means that the behavior of CKRRwill be essentially governed by the higher order approximations of K . Consequently, one can resort to those approximations to havean explicit expression of the asymptotic risk in the large p, n regime. This expression would hopefully reveal the mathematicalconnection between the regression risk and the data’ statistics anddimensions as p, n → ∞ . B. Limiting risk
With the above assumptions at hand, we are now in a positionto state the main results of the paper related to the derivation ofthe asymptotic risk of CKRR. Before doing so, we shall introducesome useful quantities. ν (cid:44) g ( τ ) − g (0) − τ g (cid:48) (0) . Also, for all z ∈ C at macroscopic distance from the eigenvalues λ , · · · , λ p of p (cid:80) ni =1 x i x Ti , we define the Stieltjes transform of p (cid:80) ni =1 x i x Ti also known as the Stieltjes transform of theMar˘cenko-Pastur law as the unique solution to the followingfixed-point equation [19] m ( z ) (cid:44) − (cid:20) cz − n tr Σ ( I + m ( z ) Σ ) − (cid:21) − , (17)where m ( z ) in (17) is bounded as p → ∞ provided thatAssumption 1 is satisfied. For ease of notation, we shall use m z todenote m ( z ) for all appropriate z . The first main result of thepaper is summarized in the following theorem, the proof of whichis postponed to the Appendix A. Theorem 1 (Limiting risk) . Under Assumptions 1, 2 and 3 and bytaking z = − λ + νg (cid:48) (0) for kernel functions satisfying g (cid:48) (0) (cid:54) = 0 and z at macroscopic distance from the eigenvalues of p (cid:80) ni =1 x i x Ti ,both the empirical and the prediction risks converge in probabilityto a non trivial deterministic limits respectively given by R train − R ∞ train → prob. , (18) R test − R ∞ test → prob. , (19) where the expressions of R ∞ train and R ∞ test are given in the topof the next page. Note that in the case where Σ = I p , the limiting risks in (20)and (21) can be further simplified as R ∞ train = (cid:18) cλm z g (cid:48) (0) (cid:19) × n (1 + m z ) (cid:0) σ + var f (cid:1) − nm z (2 + m z ) (cid:107) E [ ∇ f ] (cid:107) n (1 + m z ) − pm z + σ − σ cλm z g (cid:48) (0) . R ∞ test = n (1 + m z ) (cid:0) σ + var f (cid:1) − nm z (2 + m z ) (cid:107) E [ ∇ f ] (cid:107) n (1 + m z ) − pm z − σ , where m z can be explicitly derived as in [20] m z = − ( cz − c + 1) − (cid:113) ( cz − c − − c cz . The case of g (cid:48) (0) = 0 is asymptotically equivalent to take the sample meanas an estimate of f which is neither of practical nor theoretical interest. Remark 1.
From Theorem 1 it entails that the limiting predictionrisk can be expressed using the limiting empirical risk in thefollowing fashion. R ∞ test = (cid:18) cλm z g (cid:48) (0) (cid:19) − R ∞ train − σ (cid:18) g (cid:48) (0) cλm z − (cid:19) . (22) Lemma 1 (A consistent estimator of the prediction risk) . Inspiredby the outcome of Theorem 1 summarized in Remark 1, weconstruct a consistent estimator of the prediction risk given by (cid:98) R test = (cid:18) cλ (cid:98) m z g (cid:48) (0) (cid:19) − R train − σ (cid:18) g (cid:48) (0) cλ (cid:98) m z − (cid:19) , with λ > , (23) in the sense that R test − (cid:98) R test → prob. , where m z can beconsistently estimated as (cid:98) m z = p tr (cid:16) XX T /p − z I n (cid:17) − , X =[ x , · · · , x n ] T . Proof.
The proof is straightforward relying on the relation in (22)and the fact that (cid:98) m z − m z → a.s. , as shown in [12, Lemma 1].Since the aim of any learning system is to design a model thatachieves minimal prediction risk [15], the relation described inLemma 1 by (23) has enormous advantages as it permits toestimate the prediction risk in terms of the empirical risk andhence optimize the prediction risk accordingly. Remark 2.
One important observation from the expression of thelimiting prediction risk in (21) is that the information on thekernel (given by g (cid:48) (0) and ν ) as well as the information on λ areboth encapsulated in m z with z = − λ + νg (cid:48) (0) . This means that oneshould optimize z to have minimal prediction risk and thus jointlychoose the kernel g and the regularization parameter λ . Moreover,it entails that the choice of the kernel (as long as g (cid:48) (0) (cid:54) = 0 ) isasymptotically irrelevant since a bad choice of the kernel canbe compensated by a good choice of λ and vice-versa. Thisessentially implies that a linear kernel asymptotically achieves thesame optimal performance as any other type of kernels .C. A consistent estimator of the prediction risk Although the estimator provided in Lemma 1 permits to estimatethe prediction risk by virtue of the empirical risk, it presents thedrawback of being sensitive to small values of λ . In the followingtheorem, we provide a consistent estimator of the predictionrisk constructed from the training data { ( x i , y i ) } ni =1 and is lesssensitive to small values of λ . Theorem 2 (A consistent estimator of the prediction risk) . UnderAssumptions 1, 2 and 3 with g (cid:48) (0) (cid:54) = 0 and z = − λ + νg (cid:48) (0) , weconstruct a consistent estimator of the prediction risk based on thetraining data such that R test − (cid:98) R test → prob. , This does not mean that all kernels will have the same performance for a givenregularization parameter but means that they will achieve the same minimumprediction risk. R ∞ train = (cid:18) cλm z g (cid:48) (0) (cid:19) nσ + n var f − nm z E [ ∇ f ] T Σ (cid:104) ( I + m z Σ ) − + ( I + m z Σ ) − (cid:105) Σ E [ ∇ f ] n − m z tr Σ ( I + m z Σ ) − + σ − σ cλm z g (cid:48) (0) . (20) R ∞ test = nσ + n var f − nm z E [ ∇ f ] T Σ (cid:104) ( I + m z Σ ) − + ( I + m z Σ ) − (cid:105) Σ E [ ∇ f ] n − m z tr Σ ( I + m z Σ ) − − σ . (21) (cid:98) R test = 1( cz (cid:98) m z ) (cid:20) np y T PX (cid:16) z (cid:101) Q z − (cid:101) Q z (cid:17) X T P y + var ( y ) (cid:21) − σ , (24) with (cid:101) Q z is the resolvent matrix given by (cid:101) Q z = (cid:16) X T PX p − z I p (cid:17) − ,with X = [ x , · · · , x n ] T . Moreover, in the special case where Σ = I p , the estimator reduces to (cid:98) R test = n (1 + (cid:98) m z ) var ( y ) n (1 + (cid:98) m z ) − p (cid:98) m z − (cid:98) m z (2 + (cid:98) m z ) (cid:104) y T P XX T n P y − p var ( y ) (cid:105) n (1 + (cid:98) m z ) − p (cid:98) m z − σ . (25)Theorem 2 provides a generic way to estimate the predictionrisk from the pairs of training examples { ( x i , y i ) } ni =1 . Thisallows using the closed form expressions in (24) and (25) withthe same set of arguments in Remark 2 to jointly estimate theoptimal kernel and the optimal regularization parameter λ . D. Parameters optimization
We briefly discuss how to jointly optimize the kernel functionand the regularization parameter λ . As mentioned earlier, weexploit the special structure in the expression of the consistentestimate (cid:98) R test where both parameters (the kernel function g and λ ) are summarized in z . We focus on the case where Σ = I p dueto the tractability of the expression of (cid:98) R test in (25). By simplecalculations, we can show that (cid:98) R test is minimized when (cid:98) m z satisifies the equation p var ( y ) (cid:104) p (cid:98) m z + n (1 + (cid:98) m z ) (cid:105) = A (cid:0) n + n (cid:98) m z + p (cid:98) m z (cid:1) , where A = y T P XX T n P y , which admits the following closed-formsolution m (cid:63) = (cid:112) nA − npA + 8 np A var ( y ) − np var ( y )2 ( − pA + np var ( y ) + p var ( y )) − nA − pn var ( y )( − pA + np var ( y ) + p var ( y )) . (26)Then, look up z (cid:63) such that (cid:98) m z (cid:63) = m (cid:63) . Finally, choose λ and g ( . ) such that z (cid:63) = − λ + νg (cid:48) (0) . In the general case, it is difficult toget a closed from expression in terms of z or (cid:98) m z , however it ispossible to numerically optimize the expression of (cid:98) R test withrespect to z . This can be done using simple one dimensionaloptimization techniques implemented in most softwares. The expression in (25) is useful because it does not involve any matrixinversion unlike the one in (24).
IV. E
XPERIMENTS
A. Synthetic data
To validate our theoretical findings, we consider both Gaussianand Bernoulli data. As shown in Figure 2, both data distributionsexhibit the same behavior for all the settings with different kernelfunctions. More importantly, eventhough the derived formulasheavily rely on the Gaussian assumption, in the case where thedata is Bernoulli distributed, we have a good agreement with thetheoretical limits. This can be understood as part of the universalityproperty often encountered in many high dimensional settings.Therefore, we conjecture that the obtained results are valid forany data distribution following the model x ∼ Σ z , where Σ satisfies Assumption 1 and { z i } ≤ i ≤ p the entries of z are i.i.d.with zero mean, unit variance and have bounded moments . Formore clarity, we refer the reader to Figure 1 as a representative ofFigure 2 when the data is Gaussian with p = 100 and n = 200 .As shown in Figure 1, the proposed consistent estimators are ableto track the real behavior of the prediction risk for all types ofkernels into consideration. It is worth mentioning however that theproposed estimator in Lemma 1 exhibits some instability forsmall values of λ due to the inversion of λ in (23). Therefore,it is advised to use the estimator given by Theorem 2. It isalso clear from Figure 1 that all the considered kernels achievethe same minimum prediction risks but with different optimalregularizations λ . This is not the case for the empirical risk asshown in Figure 1 and (20) where the information on the kerneland the regularization parameter λ are decoupled. Hence, incontrast to the prediction risk, the regularization parameter and thekernel can not be jointly optimized to minimize the empirical risk. B. Real data
As a further experiment, we validate the accuracy of our resultover a real data set. To this end, we use the real
Communities andCrime Data Set for evaluation [21], which has 123 samples and122 features. For the experiment in Figure 3, we divide the dataset to have training samples ( n = 73 ) and the remainingfor testing ( n test = 50 ). The risks in Figure 3 are obtained byaveraging the prediction risk (computed using n test ) over random permutation of the data. Although the data set is farfrom being Gaussian, we notice that the proposed prediction riskestimators are able to track the real behavior of the prediction riskfor all types of considered kernels. We can also validate theprevious insight from Theorem 1 where all kernels almost achievethe same minimum prediction risk. We couldn’t provide experiments for more data distributions due to spacelimitations. λ Polynomial kernel, α = 1, β = 1, d = 2 λ R i s k Sigmoid kernel, α = 1, β = − λ Exponential kernel, α = 1, β = 0 λ R i s k Linear kernel, α = 1, β = 1 R test R ∞ test (Theorem 1) b R test (Theorem 2) b R test (Lemma 1) R train R ∞ train (Theorem 1) λ = 3 . R ∞ test = 0 . λ = 1 . R ∞ test = 0 . λ = 6 . R ∞ test = 0 . λ = 2 . R ∞ test = 0 . Fig. 1. CKRR risk with respect to the regularization parameter λ on Gaussiandata ( x ∼ N (cid:0) p , { . | i − j | } i,j (cid:1) , n = 200 training samples with p = 100 predictors, σ = 0 . for different types of kernels. The data generating function istaken to be f ( x ) = sin (cid:0) T x / √ p (cid:1) . V. C
ONCLUDING R EMARKS
We conducted a large dimensional analysis of centered kernelridge regression, which is a modified version of kernel ridgeregression that accounts for the bias in the regression formulation.By allowing both the data dimension and the training size togrow infinitely large in a fixed proportion and by relying onfundamental tools from random matrix theory, we showed thatboth the empirical and the prediction risks converge in probabilityto a deterministic quantity that mathematically connects theseperformance metrics to the data dimensions and statistics. Afundamental insight taken from the analysis is that asymptoticallythe choice of the kernel is irrelevant to the learning problem whichasserts that a large class of kernels will achieve the same bestperformance in terms of prediction risk as a linear kernel. Finally,based on the asymptotic analysis, we built a consistent estimator ofthe prediction risk making it possible to estimate the optimalregularization parameter that achieves the minimum prediction risk.A
PPENDIX AP ROOF OF T HEOREM
Theorem (Asymptotic behavior of inner product kernel randommatrices) . Under the assumptions of Theorem 2.1 [11], thekernel matrix K can be approximated by K ∞ in the sense that (cid:107) K − K ∞ (cid:107) → almost surely in operator norm, where K ∞ = (cid:34) g (0) + g (cid:48)(cid:48) (0) tr (cid:0) Σ (cid:1) p (cid:35) T + g (cid:48) (0) XX T p + ν I n , where ν = g ( τ ) − g (0) − τ g (cid:48) (0) . A similar result can befound in [11] where the accuracy of K ∞ has been assessed as K = K ∞ + O (cid:107) . (cid:107) ( √ n ) , where O (cid:107) . (cid:107) ( √ n ) denotes a matrix withspectral norm converging in probability to zero with a rate / √ n . Define Q z = (cid:32) P XX T p P − z I n (cid:33) − , (cid:101) Q z = (cid:18) X T PX p − z I p (cid:19) − . (27)Note that using the Woodbury identity, it is easy to show thefollowing useful relations Q z = − z I n + 1 zp PX (cid:101) Q z X T P , (28) (cid:101) Q z = − z I p + 1 zp X T PQ z PX . (29)The above theorem has the following consequence (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( K c + λ I ) − − g (cid:48) (0) (cid:34) Q z + νg (cid:48) (0) Q z n T Q z − νg (cid:48) (0) 1 n T Q z (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p ( 1 √ n ) , (30)where (30) is obtained by a simple application of the Sherman-Morrison Lemma (inversion Lemma), along with the use ofthe resolvent identity A − − B − = A − ( B − A ) B − , whichholds for any square invertible matrices A and B . The proof ofthe above Theorem follows from the application of a Taylorexpansion of the elements of p XX T at the vicinity of their mean.Applying the same approach for vector κ ( s ) , we get κ ( s ) = g (0) + g (cid:48) (0) 1 p Xs + ˜ κ ( s ) , where ˜ κ ( s ) has elements [˜ κ ( s )] i = g (cid:48)(cid:48) (0)2 (cid:18) x Ti sp (cid:19) + g ( ξ i )6 (cid:18) x Ti s p (cid:19) , (31)with ξ i ∈ (cid:104) , x Ti s p (cid:105) . Then, since E (cid:12)(cid:12)(cid:12) x Ti s √ p (cid:12)(cid:12)(cid:12) r is uniformly boundedin p for all r ∈ N , we have for all k ∈ N , E (cid:107) ˜ κ ( s ) (cid:107) k = O ( 1 n k ) . (32)As shall be seen later, we need also to control E s (cid:2) ˜ κ ( s )˜ κ ( s ) T (cid:3) .This is performed in the following technical Lemma. Lemma 2.
Let ˜ κ ( s ) be as in (32) . Then, E s (cid:2) ˜ κ ( s )˜ κ ( s ) T (cid:3) = 1 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) (cid:18) p tr Σ (cid:19) T + O (cid:107) . (cid:107) ( 1 n ) . Similarly, the following approximations hold true E s (cid:2) X s ˜ κ ( s ) T (cid:3) = O (cid:107) . (cid:107) ( n − ) . E s (cid:20) ˜ κ ( s ) T n KP (cid:21) = 1 p tr Σ n T p XX T p P + O (cid:107) . (cid:107) ( n − ) . Proof.
To begin with, note that for M = { m ij } ni,j =1 a randommatrix whose elements satisfies, m i,j = O p ( n − α ) for some α > ,as (cid:107) M (cid:107) ≤ tr MM T , we have m i,j = O p ( n − α ) ⇒ (cid:107) M (cid:107) = R i s k Linear kernel, α = 1 , β = 1 R test R ∞ test (Theorem 1) b R test (Theorem 2) b R test (Lemma 1) R train R ∞ train (Theorem 1) Polynomial kernel, α = 1 , β = 1 , d = 2 Sigmoid kernel, α = 1 , β = − Exponential kernel, α = 1 , β = 0 R i s k R i s k λ R i s k λ λ λ p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100 p = 200, n = 100Bernoulli data Bernoulli data Bernoulli dataBernoulli dataBernoulli data Bernoulli data Bernoulli data Bernoulli dataGaussian data Gaussian data Gaussian dataGaussian dataGaussian dataGaussian dataGaussian data Gaussian data Fig. 2. CKRR risk with respect to the regularization parameter λ on both Gaussian and Bernoulli data (i.e., x ∼ N (cid:0) p , { . | i − j | } i,j (cid:1) and x = { . | i − j | } i,j z with z i ∼ i.i.d. − × Bernoulli (1 / respectively). The noise variance is taken to be σ = 0 . and the data generating function is f ( x ) = sin (cid:0) T x / √ p (cid:1) . O p ( n − α ) . We first start by deriving E s (cid:104) [˜ κ ( s )] i [˜ κ ( s )] j (cid:105) . Wehave E s (cid:104) [˜ κ ( s )] i [˜ κ ( s )] j (cid:105) = (cid:18) g (cid:48)(cid:48) (0)2 p (cid:19) E s (cid:2) s T x i x Ti ss T x j x Tj s (cid:3) + g (cid:48)(cid:48) (0)2 E s (cid:18) x Ti s p (cid:19) (cid:32) x Tj s p (cid:33) g ( ξ i )6 + g (cid:48)(cid:48) (0)2 E s (cid:32) x Tj s p (cid:33) (cid:18) x Ti s p (cid:19) g ( ξ j )6 + E s g ( ξ i )6 g ( ξ j )6 (cid:18) x Ti s p (cid:19) (cid:32) x Tj s p (cid:33) . Using Assumption 2, we can prove that E | g ( ξ i ) | r is boundedfor all r ∈ N . E (cid:34)(cid:16) x Ti s p (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x Tj s p (cid:12)(cid:12)(cid:12)(cid:12) | g ( ξ i )6 | (cid:35) = O ( n − ) . and E (cid:34)(cid:12)(cid:12)(cid:12) g ( ξ i )6 (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) g ( ξ j )6 (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) x Ti s p (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) x Tj s p (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) = O ( n − ) . Hence, bycomputing the expectation over s of the first term, we obtain E s (cid:104) [˜ κ ( s )] i [˜ κ ( s )] j (cid:105) = 2 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) (cid:18) x Ti Σx j p (cid:19) + 1 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) p x Ti Σx i x Tj Σx j + O p ( n − ) . λ R i s k Linear kernel, α = 1, β = 1 R test b R test (Theorem 2) b R test (Lemma 2) R train λ Polynomial kernel, α = 1, β = 1, d = 2 λ R i s k Sigmoid kernel, α = 1, β = − λ Exponential kernel, α = 1, β = 0 λ = 1 . R test = 0 . λ = 2 . R test = 0 . λ = 1 . R test = 0 . λ = 0 . R test = 0 . Fig. 3. CKRR risk with respect to the regularization parameter λ on the Communities and Crime Data Set where independent zero mean Gaussian noisesamples with variance σ = 0 . are added to the true response. When i (cid:54) = j , (cid:16) x Ti Σx j p (cid:17) = O p ( n − ) . Hence, E s (cid:104) [˜ κ ( s )] i [˜ κ ( s )] j (cid:105) = 2 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) δ i = j (cid:18) p x Ti Σ x i (cid:19) + 1 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) p x Ti Σx i x Tj Σx j + O p ( n − ) . Using (V), we thus obtain E s (cid:2) ˜ κ ( s )˜ κ ( s ) T (cid:3) = 2 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) diag (cid:40)(cid:18) p x Ti Σx i (cid:19) (cid:41) ni =1 + 1 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) (cid:26) p x Ti Σx i (cid:27) ni =1 (cid:32)(cid:26) p x Tj Σx j (cid:27) nj =1 (cid:33) T + O (cid:107) . (cid:107) ( n − ) . It is easy to see that (cid:13)(cid:13)(cid:13)(cid:13) p (cid:16) g (cid:48)(cid:48) (0)2 (cid:17) diag (cid:26)(cid:16) p x Ti Σx i (cid:17) (cid:27) ni =1 (cid:13)(cid:13)(cid:13)(cid:13) = O p ( n − ) . On the other hand, one can show that we can replacein the second term p x Ti Σx i by p tr Σ . This is because (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p (cid:26) p x Ti Σx i − p tr Σ (cid:27) ni =1 (cid:32)(cid:26) p x Tj Σx j (cid:27) nj =1 (cid:33) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:26) p x Ti Σx i − p tr Σ (cid:27) ni =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:26) p x Tj Σx j (cid:27) nj =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p ( n − ) . Putting all the above results together, we obtain E s (cid:2) ˜ κ ( s )˜ κ ( s ) T (cid:3) = 1 p (cid:18) g (cid:48)(cid:48) (0)2 (cid:19) (cid:18) p tr Σ (cid:19) T + O (cid:107) . (cid:107) ( n − ) . Now using the approximation in (V), we obtain κ c ( s ) = g (cid:48) (0) 1 p PXs + P ˜ κ ( s ) − n PK1 . (33) Theorem (Asymptotic behavior of Q z and (cid:101) Q z ) . As in [12,Lemma 1], let Assumption 1 holds, then as p → ∞ and all z ∈ C \ R + , (cid:101) Q z ↔ − z ( I + m z Σ ) − , ∀ z ∈ C \ supp ( Σ ) , (34) where m z is the unique stieltjes transform solution, for all such z ,to the implicit equation m z = − (cid:18) cz − n tr Σ ( I + m z Σ ) − (cid:19) − , and the notation A ↔ B means that as p → ∞ , p tr M ( A − B ) → a.s. and u T ( A − B ) v → a.s. , for alldeterministic Hermitian matrices M and deterministic vectors u , v of bounded norms. Moreover, from [22] and [23] and z ∈ C \ R + , n tr MQ z + 1 z tr M ( I + m z Σ ) − = 1 n ψ n ( z ) , (35) u T Q z v + 1 z u T ( I + m z Σ ) − v = 1 √ n h n ( z ) , (36) where for all k ∈ N , E | ψ n ( z ) | k and E | h n ( z ) | k can be boundeduniformly in n over any compact at a macroscopic distance fromthe limiting support of p PXX T P . Theorem 3 (An Integration by parts formula for Gaussianfunctionals) . [22] With f satisfying Assumption 3 and for x = [ x , · · · , x p ] T ∼ N ( p , Σ ) , we have E [ x i f ( x )] = p (cid:88) j =1 [ Σ ] i,j E (cid:20) ∂f ( x ) ∂x j (cid:21) , (37) or equivalently, E [ x f ( x )] = Σ E [ ∇ f ( x )] . Theorem (Nash-Poincar´e inequality) . [22] With f satisfyingAssumption 3 and for x = [ x , · · · , x p ] T ∼ N ( p , Σ ) , wehave under the setting of the previous theorem, var( f ( x )) ≤ E (cid:2) ∇ f ( x ) T Σ ∇ f ( x ) (cid:3) . We shall also need the following differentiation formula. For i ∈ { , · · · , n } and j ∈ { , · · · , p } ∂ (cid:101) Q z ∂x ij = − (cid:101) Q z e i e Tj PX p (cid:101) Q z − (cid:101) Q z X T P p e j e Ti (cid:101) Q z , (38)where e i is the all zero vector with 1 at the i th entry. With thisbackground on the asymptotic behavior of kernel matrices, in thefollowing, we derive the limiting prediction and empirical risks.Recall that the prediction risk writes as R test = E s ,(cid:15) (cid:20)(cid:12)(cid:12)(cid:12) κ c ( s ) T ( K c + λ I n ) − Py + y − f ( s ) (cid:12)(cid:12)(cid:12) (cid:21) , where y = [ y , · · · , y n ] T = f ( X )+ (cid:15) and ¯ y = T n f ( X )+ n (cid:15) T .Due to the independence of s and (cid:15) , the prediction risk can be decomposed into a variance and bias terms as R test = B + V, where V = E s ,(cid:15) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) κ c ( s ) T ( K c + λ I n ) − P (cid:15) + 1 n (cid:15) T (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) .B = E s (cid:34)(cid:18) κ c ( s ) T ( K c + λ I n ) − Pf ( X ) + 1 n T f ( X ) − f ( s ) (cid:19) (cid:35) . Now, computing the expectation over (cid:15) , we obtain V = σ E s (cid:104) κ c ( s ) T ( K c + λ I n ) − P ( K c + λ I n ) − κ c ( s ) (cid:105) + O p ( n − )= σ E s (cid:34) κ c ( s ) T ( K c + λ I n ) − κ c ( s ) − (cid:12)(cid:12)(cid:12)(cid:12) κ c ( s ) T ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) + O p ( n − ) . Let us start by controlling the second term. Replacing κ c ( s ) by(33), E s (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) κ c ( s ) T ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ g (cid:48) (0) E s (cid:12)(cid:12)(cid:12)(cid:12) p s T X T P ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) + 3 E s (cid:12)(cid:12)(cid:12)(cid:12) ˜ κ ( s ) T P ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) + O p ( n − ) . Computing the expectation over s of the term term of the lastinequality, we can show that E s (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) p s T X T P ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) = 1 p T n ( K c + λ I n ) − PXΣX T P p ( K c + λ I n ) − = O p ( n − ) . On the other hand, from (32), we have E (cid:12)(cid:12)(cid:12)(cid:12) ˜ κ ( s ) T P ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) = O p ( n − ) . The above approximations thus yield E s (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) κ c ( s ) T ( K c + λ I n ) − √ n (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) = O p ( n − ) . It remains now to compute the first term. Using (32) along with(33), we have V = σ E s (cid:20) p ( g (cid:48) (0)) s T X T P ( K c + λ I n ) − PX s (cid:21) + O p ( n − )= σ p ( g (cid:48) (0)) tr ΣX T P ( K c + λ I n ) − PX + O p ( n − ) . From (30), V can be approximated as V ( a ) = σ p tr ΣX T PQ z PX + O p ( n − )= σ p ∂∂t (cid:2) tr ΣX T PQ t PX (cid:3) t = z + O p ( n − )= σ ∂∂t (cid:20) p tr ΣX T PX (cid:101) Q t (cid:21) t = z + O p ( n − )= σ ∂∂t (cid:20) p tr Σ + tp tr Σ (cid:101) Q t (cid:21) t = z + O p ( n − )= ∂∂t (cid:20) tσ p tr Σ (cid:101) Q t (cid:21) t = z + O p ( n − ) , where in ( a ) the contribution of matrix νg (cid:48) (0) Q z n T Q z − νg (cid:48) (0) 1 n T Q z has beendiscarded since it only induces terms of order O p ( n − ) . Using(35), we thus have ∂∂t (cid:20) tσ p tr Σ (cid:101) Q t (cid:21) t = z = ∂∂t (cid:20) − σ p tr Σ tr ( I p + m t Σ ) − (cid:21) t = z + O p ( n − ) . Taking the derivative over z , we find after simple calculations that ∂∂t (cid:20) tσ p tr Σ (cid:101) Q t (cid:21) t = z = σ m z n − m z tr Σ ( I p + m z Σ ) − × tr Σ ( I p + m z Σ ) − + O p ( n − ) . Putting all the above derivations together, we finally obtain V = σ m z n − m z tr Σ ( I p + m z Σ ) − tr Σ ( I p + m z Σ ) − + O p ( n − ) . Evaluation of the bias term .To begin with, we first expand B as B = E s (cid:104) f ( X ) T P ( K c + λ I n ) − κ c ( s ) κ c ( s ) T ( K c + λ I n ) − Pf ( X ) (cid:105) + E s (cid:12)(cid:12)(cid:12)(cid:12) n T f ( X ) − f ( s ) (cid:12)(cid:12)(cid:12)(cid:12) + 2 E s (cid:34) f ( X ) T P ( K c + λ I n ) − κ c ( s ) × (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:35) . We will sequentially treat the above three terms. To begin with, wecontrol first E s (cid:2) κ c ( s ) κ c ( s ) T (cid:3) which we expand as E s (cid:2) κ c ( s ) κ c ( s ) T (cid:3) = ( g (cid:48) (0)) p PXΣX T P + g (cid:48) (0) 1 p PX s ˜ κ ( s ) T P + g (cid:48) (0) p P E s (cid:2) ˜ κ ( s ) s T (cid:3) XP + P E s ˜ κ ( s )˜ κ ( s ) T P − n P E s ˜ κ ( s ) T KP − n PK1 E s ˜ κ ( s ) T P + 1 n PK11 T P . Using (33) along with Lemma 2, we obtain E s (cid:2) κ c ( s ) κ c ( s ) T (cid:3) = ( g (cid:48) (0)) p PXΣX T P + 1 n PK11 T KP + O (cid:107) . (cid:107) ( n − ) . Replacing K by K ∞ , we thus obtain E s (cid:2) κ c ( s ) κ c ( s ) T (cid:3) = ( g (cid:48) (0)) p PXΣX T P + ( g (cid:48) (0)) n P XX T p T XX T p P + O (cid:107) . (cid:107) ( n − ) . From Assumption 3, we can prove that (cid:107) Pf ( X ) (cid:107) = O p ( √ n ) . Hence, the first term in B can be approximated as E s (cid:34) f ( X ) T P ( K c + λ I n ) − κ c ( s ) κ c ( s ) T ( K c + λ I n ) − × Pf ( X ) (cid:35) = ( g (cid:48) (0)) f ( X ) T P ( K c + λ I n ) − P XΣX T p P × ( K c + λ I n ) − Pf ( X ) + ( g (cid:48) (0)) f ( X ) T P ( K c + λ I n ) − × n P XX T p T XX T p P ( K c + λ I n ) − Pf ( X )+ O p ( n − )= Z + Z + O p ( n − ) . We will start by treating Z . From (30), we can show that Z = 1 p f ( X ) T PX (cid:101) Q z Σ (cid:101) Q z X T Pf ( X ) + O p ( n − ) . To treat Z , we first decompose Pf ( X ) as Pf ( X ) = f ( X ) − E x ( f ( x )) + E x ( f ( x )) − n T f ( X ) , where from the classical probability results, we have (cid:13)(cid:13)(cid:13)(cid:13) E x ( f ( x )) − n T f ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = O p (1) . Hence, we can replace Pf ( X ) by ◦ f ( X ) = f ( X ) − E x ( f ( x )) up to an error O p ( n − ) , thusyielding Z = 1 p ◦ f ( X ) T X (cid:101) Q z Σ (cid:101) Q z X T ◦ f ( X ) + O p ( n − ) . To treat Z , we shall resort to the following lemma. Lemma 3.
Let A be a p × p symmetric matrix with a uniformlybounded spectral norm. Let f satisfy Assumption 3 and z ∈ C \ R + .Consider the following function. h ( X ) = 1 p ◦ f ( X ) T X (cid:101) Q z A (cid:101) Q z X T ◦ f ( X ) . Then, for any δ > , var( h ( X )) = O ( n − δ ) . Proof.
The proof of Lemma 3 follows from the Nash-Poincar´einequality and the differentiation formula in (38). Therefore, weomit technical details for brievity.As per Lemma 3, Z can be approximated as Z = E (cid:20) p ◦ f ( X ) T X (cid:101) Q z Σ (cid:101) Q z X T ◦ f ( X ) (cid:21) + O p ( n − + δ ) , for any δ > . Now let x = √ n X T . The resolvent matrix (cid:101) Q z can be expressed as (cid:101) Q z = (cid:18) p X T X − p xx T − z I p (cid:19) − . By the inversion Lemma, matrix (cid:101) Q z can be written as (cid:101) Q z = Q z + 1 p Q z xx T Q z − p x T Q z x , where Q z = (cid:16) p X T X − z I p (cid:17) − . Hence Z can be expanded as Z = 1 p E (cid:104) ◦ f ( X ) T XQ z ΣQ z X T ◦ f ( X ) (cid:105) + 1 p E (cid:34) ◦ f ( X ) T XQ z xxQ z ΣQ z X T ◦ f ( X ) (cid:18) − p x T Q z x (cid:19) − (cid:35) + 1 p E (cid:34) ◦ f ( X ) T XQ z ΣQ z xx T Q z X T ◦ f ( X ) (cid:18) − p x T Q z x (cid:19) − (cid:35) + 1 p E (cid:34) ◦ f ( X ) T XQ z xx T Q z ΣQ z xx T Q z X T ◦ f ( X ) × (cid:18) − p x T Q z x (cid:19) − (cid:35) . We can show that the last three terms are O ( n − + δ ) . To illustratethis, we will focus on the second term, as the derivations aresimilar for the remaining quantities. By Cauchy-Schwartz inequality,we have E p (cid:34) ◦ f ( X ) T XQ z xxQ z ΣQ z X T ◦ f ( X ) (cid:18) − p x T Q z x (cid:19) − (cid:35) ≤ (cid:115) E (cid:12)(cid:12)(cid:12)(cid:12) p ◦ f ( X ) T XQ z X T √ n (cid:12)(cid:12)(cid:12)(cid:12) (cid:115) E (cid:12)(cid:12)(cid:12)(cid:12) p T √ n XQ z ΣQ z X T ◦ f ( X ) (cid:12)(cid:12)(cid:12)(cid:12) . To treat the above bound, we first show that E (cid:12)(cid:12)(cid:12)(cid:12) p T √ n XQ z X T ◦ f ( X ) (cid:12)(cid:12)(cid:12)(cid:12) = O ( n − δ ) , for any δ > . Towards this end, we need the following control ofthe variance of p T √ n XQ z ΣQ z X T ◦ f ( X ) which we can beshown to be O ( n − δ ) . Lemma 4.
Let g satisfy Assumption 3 and z ∈ C \ R + . Considerthe following function. g ( X ) = 1 p T √ n XQ z X T ◦ f ( X ) . Then, for any δ > g ( X )) = O ( n − δ ) . The proof of Lemma 4 follows the same lines as Lemma 3 andis thus omitted. With the control of the variance at hand, it sufficesto show the following. E [ g ( X )] = O ( n − + δ ) . (39) Let Q z,i = (cid:16)(cid:80) nk =1 ,k / ∈{ i } p x k x Tk − z I p (cid:17) − . We thus develop g ( X ) as √ n p n (cid:88) i =1 n (cid:88) j =1 E (cid:104) x Ti Q z x j ◦ f ( x j ) (cid:105) = 1 √ n p n (cid:88) i =1 n (cid:88) j =1 ,i (cid:54) = j E x Ti Q z,i x j ◦ f ( x j )1 + p x Ti Q z,i x i + 1 √ n p n (cid:88) i =1 E x Ti Q z,i x i ◦ f ( x i )1 + p x Ti Q z,i x i ( a ) = 1 √ n p n (cid:88) i =1 n (cid:88) j =1 E x Ti Q z,i x j ◦ f ( x j ) × (cid:34)(cid:18) p x Ti Q z,i x i (cid:19) − − (cid:18) p tr ΣQ z,i (cid:19) − (cid:35) + O ( n − )= 1 √ np E ◦ f ( X ) T XQ × diag (cid:40) p x Ti Qx i p x Ti Q z,i x i − p x Ti Qx i p tr ΣQ z,i (cid:41) ni =1 × X T + O ( n − ) , where in (a), we use the fact that when i (cid:54) = j , E (cid:20) x Ti Q z,i x j ◦ f ( x j ) (cid:16) p tr ΣQ z,i (cid:17) − (cid:21) = 0 . Using the factthat (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) diag (cid:40) p x Ti Qx i p x Ti Q z,i x i − p x Ti Qx i p tr ΣQ z,i (cid:41) ni =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p ( n − + δ ) . we conclude that Z = E (cid:20) p ◦ f ( X ) T XQ z ΣQ z X T ◦ f ( X ) (cid:21) + O p ( n − + δ ) . From the inversion lemma, we get Q z x i = Q z,i x i p x Ti Q z,i x i .Z writes as Z = n (cid:88) i =1 n (cid:88) j =1 p E ◦ f ( x i ) x Ti Q z,i ΣQ z,j x j ◦ f ( x j )(1 + p x Ti Q z,i x i )(1 + p x Tj Q z,j x j ) + O p ( n − + δ ) . Using the fact that (cid:12)(cid:12)(cid:12)(cid:12) p x Ti Q z,i x i + 1 pz tr Σ ( I p + m z Σ ) − (cid:12)(cid:12)(cid:12)(cid:12) k = O ( n − ) . Along with the relation − pz tr Σ ( I p + m z Σ ) − = − zcm z , we obtain Z = c z m z n (cid:88) i =1 n (cid:88) j =1 p E ◦ f ( x i ) x Ti Q z,i ΣQ z,j x j ◦ f ( x j )= c z m z n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i p E ◦ f ( x i ) x Ti Q z,i ΣQ z,j x j ◦ f ( x j )+ c z m z p n (cid:88) i =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ◦ f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12) x Ti Q z,i ΣQ z,i x i (cid:35) + O p ( n − + δ ) . (40)Now, we further proceed resorting to the inversion lemma Q z,i = Q z,ij − p Q z,ij x j x Tj Q z,ij p x Tj Q z,ij x j , which once plugged into (40) yields Z = c z m z n (cid:88) i =1 n (cid:88) j =1 E (cid:20) p ◦ f ( x i ) x Ti Q z,ij ΣQ z,ij x j ◦ f ( x j ) (cid:21) − c z m z n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i E (cid:34) p ◦ f ( x i ) x Ti Q z,ij x j x Tj Q z,ij ΣQ z,ij p x Tj Q z,ij x j × x j ◦ f ( x j ) (cid:35) − c z m z n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i E (cid:34) p ◦ f ( x i ) x Ti Q z,ij Σ p x Ti Q z,ij x i × Q z,ij x i x Ti Q z,ij x j ◦ f ( x j ) (cid:35) + c z m z n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i E (cid:34) p ◦ f ( x i ) x Ti Q z,ij x j x Tj Q z,ij ΣQ z,ij (1 + p x Ti Q z,ij x i )(1 + p x Tj Q z,ij x j ) × x i x Ti Q z,ij x j ◦ f ( x j ) (cid:35) (cid:44) Z + Z + Z + Z + O p ( n − + δ ) . Let us first control Z . Taking the expectation over x i and x j ,we obtain Z = c z m z p n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i E [ ∇ f ] T Σ E (cid:2) Q z,ij ΣQ z,ij (cid:3) Σ E [ ∇ f ]+ c z m z p n (cid:88) i =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ◦ f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12) x Ti Q z,ij ΣQ z,ij x i (cid:35) . It can be shown that for a vector a with uniformly bounded normthat E (cid:2) a T Q z,ij ΣQ z,ij b (cid:3) = E (cid:2) a T Q z ΣQ z b (cid:3) + O ( n − ) . Using the fact that E (cid:20)(cid:12)(cid:12)(cid:12) p x Ti Q z,ij ΣQ z,ij x i − E p tr ΣQΣQ (cid:12)(cid:12)(cid:12) (cid:21) = O ( n − ) , we thus obtain Z = c z m z p n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i E [ ∇ f ] T Σ E (cid:2) Q z ΣQ z (cid:3) Σ E [ ∇ f ]+ c z m z p n (cid:88) i =1 var f E (cid:20) p tr ΣQΣQ (cid:21) + O ( n − ) . By standard results from random matrix theory [23], [24], we get Z = nm z E [ ∇ f ] T Σ ( I p + m z Σ ) − Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − × Σ E [ ∇ f ] + var f m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − + O ( n − ) . Along the same arguments, we can show that Z = Z = m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − × E [ ∇ f ] T Σ ( I p + m z Σ ) − Σ E [ ∇ f ] + O ( n − ) . As for Z , using H¨older’s inequality, it can be bounded as | Z |≤ c z m z p √ p (cid:88) i =1 (cid:88) j =1 ,j (cid:54) = i (cid:32) E (cid:12)(cid:12)(cid:12)(cid:12) √ p x Ti Q z,ij x j (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) × (cid:32) E (cid:12)(cid:12)(cid:12)(cid:12) √ p x Tj Q z,ij ΣQ z,ij x i (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) (cid:32) E (cid:12)(cid:12)(cid:12)(cid:12) √ p x Ti Q z,ij x j (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) × (cid:32) E (cid:12)(cid:12)(cid:12)(cid:12) ◦ f ( x i ) ◦ f ( x j ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) = O ( n − ) . We thus conclude that Z = var f m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − + nm z E [ ∇ f ] T Σ ( I p + m z Σ ) − Σ ( I p + m z Σ ) − Σ E [ ∇ f ] n − m z tr Σ ( I p + m z Σ ) − − m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − E [ ∇ f ] T Σ ( I p + m z Σ ) − × Σ E [ ∇ f ] + O p ( n − + δ ) . Now, we will treat the term Z . Similarly to Z , we can show that Z = 1 np f ( X ) T PX (cid:101) Q z X T T n X (cid:101) Q z X T Pf ( X ) + O p ( n − ) . Using Lemma 4 along with (39), we thus obtain Z = O p ( n − ) . We have thus completed the treatment of the first term of the biasand shown that E s (cid:104) f ( X ) T P ( K c + λ I n ) − κ c ( s ) κ c ( s ) T ( K c + λ I n ) − Pf ( X ) (cid:105) = var f m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − + nm z n − m z tr Σ ( I p + m z Σ ) − E [ ∇ f ] T Σ ( I p + m z Σ ) − × Σ ( I p + m z Σ ) − Σ E [ ∇ f ] − m z tr Σ ( I p + m z Σ ) − n − m z tr Σ ( I p + m z Σ ) − × E [ ∇ f ] T Σ ( I p + m z Σ ) − Σ E [ ∇ f ] + O p ( n − + δ ) . The second term in the bias can be dealt with by noticing that n T f ( X ) = E f ( x ) + O p ( 1 √ n ) . Thus yielding E s (cid:12)(cid:12)(cid:12)(cid:12) n T f ( X ) − f ( s ) (cid:12)(cid:12)(cid:12)(cid:12) = var f + O p ( n − ) . We now move to the last term in the bias. Using (33), we obtain E s (cid:20) f ( X ) T P ( K c + λ I n ) − ˜ κ c ( s ) (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:21) = 2 E s (cid:34) f ( X ) T P ( K c + λ I n ) − (cid:26) g (cid:48) (0) 1 p PX s + P ˜ κ c ( s ) − n PK1 (cid:27) × (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:35) = − E s (cid:20) f ( X ) T P ( K c + λ I n ) − g (cid:48) (0) 1 p PX s f ( s ) (cid:21) + 2 E s (cid:20) f ( X ) T P ( K c + λ I n ) − P ˜ κ c ( s ) s (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:21) − E s (cid:20) f ( X ) T P ( K c + λ I n ) − n PK1 (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:21) . Since E s f ( s ) − n T f ( X ) = O p ( n − ) , we can replace in thetwo last terms n T f ( X ) by E s f ( s ) with an error O ( n − ) . Indoing so, we obtain E s (cid:20) f ( X ) T P ( K c + λ I n ) − ˜ κ c ( s ) (cid:18) n T f ( X ) − f ( s ) (cid:19)(cid:21) = − f ( X ) T P ( K c + λ I n ) − g (cid:48) (0) 1 p PXΣ E [ ∇ f ]+ 2 E s (cid:104) f ( X ) T P ( K c + λ I n ) − P ˜ κ c ( s ) ( E s f ( s ) − f ( s )) (cid:105) + O p ( n − ) . Using (30) along with standard calculations as those used inLemma 4, we obtain − f ( X ) T P ( K c + λ I n ) − g (cid:48) (0) 1 p PXΣ E [ ∇ f ]= − f ( X ) T PQ z p PXΣ E [ ∇ f ] + O p ( n − )= − p f ( X ) T PX (cid:101) Q z Σ E ∇ f + O p ( n − ) . Replacing (cid:101) Q z by Q z , and using similar derivations as before, weobtain − f ( X ) T P ( K c + λ I n ) − g (cid:48) (0) 1 p PXΣ E [ ∇ f ]= − m z E [ ∇ f ] T Σ ( I p + m z Σ ) − Σ E [ ∇ f ] + O p ( n − + δ ) . Finally, using the same tools we can show that E s (cid:104) f ( X ) T P ( K c + λ I n ) − P ˜ κ c ( s ) ( E s f ( s ) − f ( s )) (cid:105) will not contribute to the expression of the bias. In fact, p E s (cid:104) f ( X ) T P ( K c + λ I n ) − P ˜ κ c ( s ) ( E s f ( s ) − f ( s )) (cid:105) = O p ( n − + δ ) . We are now in position to estimate the bias term. Combining theresults of all derivations, for any δ > we obtain B = n var f n − m z tr tr Σ ( I p + m z Σ ) − − nm z n − m z tr tr Σ ( I p + m z Σ ) − E [ ∇ f ] T Σ ( I p + m z Σ ) − × Σ E [ ∇ f ] − nm z n − m z tr tr Σ ( I p + m z Σ ) − E [ ∇ f ] T Σ × ( I p + m z Σ ) − × Σ E [ ∇ f ] + O p ( n − + δ ) . This concludes the proof.A
PPENDIX BP ROOF OF T HEOREM p y T PX (cid:101) Q z X T P y = 1 p f ( X ) T PX (cid:101) Q z X T P f ( X ) + 1 p (cid:15) T PX (cid:101) Q z X T P (cid:15) + O p (cid:16) n − (cid:17) = 1 p f ( X ) T PX (cid:101) Q z X T P f ( X ) + σ p tr X T PX p (cid:101) Q z + O p (cid:16) n − (cid:17) = 1 p f ( X ) T PX (cid:101) Q z X T P f ( X ) + σ + z σ p tr (cid:101) Q z + O p (cid:16) n − (cid:17) = 1 p f ( X ) T PX (cid:101) Q z X T P f ( X ) + σ − σ p tr Σ ( I + m z Σ ) − + O p (cid:16) n − (cid:17) . It is also possible to show along the same lines as before that p f ( X ) T PX (cid:101) Q z X T P f ( X )= − zm z E [ ∇ f ] T Σ ( I + m z Σ ) − Σ E [ ∇ f ]+ m z var f p tr Σ ( I + m z Σ ) − + O p (cid:16) n − + δ (cid:17) . Moreover, by computing p f ( X ) T PX (cid:101) Q z X T P f ( X )= ∂∂z p f ( X ) T PX (cid:101) Q z X T P f ( X )= ∂∂z (cid:34) − zm z E [ ∇ f ] T Σ ( I + m z Σ ) − Σ E [ ∇ f ]+ m z var f p tr Σ ( I + m z Σ ) − (cid:35) + O p (cid:16) n − + δ (cid:17) . With the above observation at hand, it is straightforward to showthat (cid:98) R test = R ∞ test + O p ( n − + δ ) , which is combined with Theorem 1 gives the claim of Theorem 2. R EFERENCES[1] C. M. Bishop,
Pattern Recognition and Machine Learning . Springer, 2006.[2] T. Hofmann, B. Sch¨olkopf, and A. Smola, “Kernel methods in machinelearning,”
Annals of Statistics , vol. 36, no. 3, pp. 1171–1220, Jun. 2008.[3] A. E. Alaoui and M. W. Mahoney, “Fast randomized kernel ridge regressionwith statistical guarantees,”
NIPS , vol. 1, pp. 775–783, 2015.[4] A. Rudi and L. Rosasco, “Generalization properties of learning with randomfeatures,” in
NIPS , 2017.[5] S. Gigu`ere, F. Laviolette, M. Marchand, and K. Sylla, “Risk boundsand learning algorithms for the regression approach to structuredoutput prediction,” in
Proceedings of the 30th International Conferenceon International Conference on Machine Learning - Volume 28 , ser.ICML’13. JMLR.org, 2013, pp. I–107–I–114. [Online]. Available:http://dl.acm.org/citation.cfm?id=3042817.3042831[6] A. Caponnetto and E. De Vito, “Optimal rates for the regularized least-squaresalgorithm,”
Foundations of Computational Mathematics , vol. 7, no. 3, pp. 331–368, Jul 2007. [Online]. Available: https://doi.org/10.1007/s10208-006-0196-8[7] L. H. Dicker, D. P. Foster, and D. Hsu, “Kernel ridge vs. principal componentregression: Minimax bounds and the qualification of regularization operators,”
Electronic Journal of Statistics , vol. 11, no. 1, pp. 1022–1047, 2017.[8] C. Cortes, M. Mohri, and A. Rostamizadeh, “Algorithms for LearningKernels Based on Centered Alignment,”
Journal of Machine LearningResearch , no. 13, pp. 795–828, 2012.[9] Y. Lu, L. Wang, J. Lu, J. Yang, and C. Shen, “Multiple kernelclustering based on centered kernel alignment,”
Pattern Recognition
Computational and Mathematical Methods inMedicine , 2016. [Online]. Available: https://doi.org/10.1155/2016/9523849[11] N. El-Karoui, “The spectrum of kernel random matrices,”
The Annals ofStatistics , vol. 38, no. 1, pp. 1–50, 2010.[12] R. Couillet and F. Benaych-Georges, “Kernel Spectral Clustering of LargeDimensional Data,”
Electronic Journal of Statistics , vol. 10, pp. 1393–1454,2016.[13] G. Kimeldorf and G. Wahba, “Some results on Tchebycheffian splinefunctions,”
Journal of mathematical analysis and applications , vol. 33, no. 1,p. 82–95, 1971.[14] B. Scholkopf, R. Herbrich, and A. J. Smola, “A generalized representertheorem,”
Springer , no. 1, p. 416–426, 2001.[15] V. Cherkassky, X. Shao, F. M. Mulier, and V. N. Vapnik, “Model complexitycontrol for regression using VC generalization bounds,”
IEEE Transactions onNeural Networks , vol. 10, no. 5, pp. 1075–1089, Sep 1999.[16] T. Jaakkola, “Course materials for 6.867 machine learning, fall 2006. mitopencourseware,”
Massachusetts Institute of Technology. , 2006.[17] G. Valentini and T. G. Dietterich, “Bias-variance analysis of supportvector machines for the development of svm-based ensemble methods,”
J.Mach. Learn. Res. , vol. 5, pp. 725–775, Dec. 2004. [Online]. Available:http://dl.acm.org/citation.cfm?id=1005332.1016783[18] Z. Liao and R. Couillet, “A Large Dimensional Analysis ofLeast Squares Support Vector Machines,” 2017. [Online]. Available:https://arxiv.org/abs/1701.02967[19] R. Couillet and M. Debbah,
Random Matrix Methods for WirelessCommunications . Cambridge University Press, 2011.[20] Z. Bai and J. W. Silverstein,
Spectral Analysis of Large Dimensional RandomMatrices . Springer, 2009.[21] M. A. Redmond and A. Baveja, “Communities and crime data set,”
UCIRepository , 2002. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/communities+and+crime[22] W. Hachem, O. Khorunzhiy, P. Loubaton, J. Najim, and L. Pastur, “ANew Approach for Mutual Information Analysis of Large DimensionalMulti-Antenna Channels,”
IEEE Transactions on Information Theory , vol. 54,no. 9, pp. 3987–4004, Sept 2008.[23] W. Hachem, P. Loubaton, J. Najim, and P. Vallet, “On bilinear forms basedon the resolvent of large random matrices,”
Annales de l’Institut HenriPoincar´e- Probabilit´es et Statistiques , vol. 49, no. 1, pp. 36–63, 2013.[24] W. Hachem, M. Kharouf, J. Najim, and J. W. Silverstein, “A clt forinformation-theoretic statistics of non-centered gram ranodm matrices,”