Inner-product Kernels are Asymptotically Equivalent to Binary Discrete Kernels
aa r X i v : . [ s t a t . M L ] S e p Inner-product Kernels are Asymptotically Equivalent toBinary Discrete Kernels
Zhenyu Liao ∗ †
Romain Couillet †∗ Abstract
This article investigates the eigenspectrum of the inner product-type kernel ma-trix √ p K = { f ( x T i x j / √ p ) } ni , j = under a binary mixture model in the high dimen-sional regime where the number of data n and their dimension p are both large andcomparable. Based on recent advances in random matrix theory, we show that, fora wide range of nonlinear functions f , the eigenspectrum behavior is asymptoticallyequivalent to that of an (at most) cubic function. This sheds new light on the un-derstanding of nonlinearity in large dimensional problems. As a byproduct, wepropose a simple function prototype valued in ( −
1, 0, 1 ) that, while reducing bothstorage memory and running time, achieves the same (asymptotic) classification per-formance as any arbitrary function f . Multivariate mixture models, especially Gaussian mixtures, play a fundamental role instatistics and have received significant research attention in the machine learning com-munity [KMV16, ABDH + independent of the underlying distribution of the random matrix.Intuitively speaking, the “squared” number of degrees of freedom in large matrices (e.g.,sample covariance matrices n XX T ∈ R p × p based on n observations of dimension p ar-ranged in the columns of X ) and their independent interactions induce fast versions ofcentral limit theorems irrespective of the data distribution, resulting in universal statisti-cal results.In this paper, we consider the eigenspectrum behavior of the inner product “prop-erly scaled” (see details below) kernel matrix K ij = f ( x T i x j / √ p ) / √ p , for n data x i ∈ R p arising from an affine transformation of i.i.d. random vectors with zero mean, unit covari-ance and bounded higher order moments. Under this setting and some mild regularitycondition for the nonlinear function f , the spectrum of K can be shown to only dependon three parameters of f , in the regime where n , p are both large and comparable.The theoretical study of the eigenspectrum of large random matrices serves as a ba-sis to understand many practical statistical learning algorithms, among which are kernelspectral clustering [NJW02] or sparse principle component analysis (PCA) [JL09]. In the ∗ Laboratoire des Signaux et Système, CentraleSupélec, Université Paris-Saclay, France. † G-STATS Data Science Chair, GIPSA-lab, University Grenoble Alpes, France. n , p regime, various types of “random kernel matrices” have been studied from aspectral random matrix viewpoint. In [EK10b] the authors considered kernel matricesbased on the Euclidean distance f ( k x i − x j k / p ) or the inner product f ( x T i x j / p ) betweendata vectors, and study the eigenvalue distribution by essentially “linearizing” the non-linear function f via a Taylor expansion, which naturally demands f to be locally smooth.Later in [CBG16] the authors followed the same technical approach and considered amore involved Gaussian mixture model for the x i , providing rich insights into the impactof nonlinearity in kernel spectral clustering application.Nonetheless, these results are in essence built upon a local expansion of the non-linear function f , which follows from the “concentration” of the similarity measures k x i − x j k / p or x T i x j / p around a single value of the smooth domain of f , therefore dis-regarding most of the domain of f . In this article, following [CS13, DV13], we study theinner product kernel f ( x T i x j / √ p ) / √ p which avoids the concentration effects with themore natural √ p normalization. With the flexible tool of orthogonal polynomials, we areable to prove universal results which solely depend on the first two order moments ofthe data distribution and allow for nonlinear functions f that need not even be differen-tiable. As a practical outcome of our theoretical results, we propose an extremely simplepiecewise constant function which is spectrally equivalent and thus performs equallywell as arbitrarily complex functions f , while inducing enormous gains in both storageand computational complexity.The remainder of this article is organized as follows. In Section 2 we introduce theobject of interest together with necessary assumptions to work along with. Our main the-oretical findings are presented in Section 3 with intuitive ideas, while detailed proofs aredeferred to the supplementary material due to space limitation. In Section 4 we discussthe practical consequence of our theoretical findings and propose our piecewise constantfunction prototype which works in a universal manner for kernel spectrum-based appli-cations, for the system model under consideration. The article closes with concludingremarks and envisioned extensions in Section 5. Notations : Boldface lowercase (uppercase) characters stand for vectors (matrices). Thenotation ( · ) T denotes the transpose operator. The norm k · k is the Euclidean norm forvectors and the operator norm for matrices, and we denote k A k ∞ = max i , j | A ij | as well as k · k F the Frobenius norm: k A k F = tr ( AA T ) . ξ is often used to denote standard Gaussianrandom variable, i.e., ξ ∼ N (
0, 1 ) . Let x , . . . , x n ∈ R p be n feature vectors drawn independently from the following two-class ( C and C ) mixture model: ( C : x = µ + ( I p + E ) z C : x = µ + ( I p + E ) z (1)each having cardinality n /2, for some deterministic µ a ∈ R p , E a ∈ R p × p , a =
1, 2 andrandom vector z ∈ R p having i.i.d. entries of zero mean, unit variance and bounded We restrict ourselves to binary classification for readability, but the presented results easily extend to amulti-class setting. µ a , E a is neither (asymptotically) too simplenor impossible to be extracted from the noisy features , we work (as in [CLM18]) underthe following assumption. Assumption 1 (Non-trivial classification) . As n → ∞ , we have for a ∈ {
1, 2 } (i) p / n = c → ¯ c ∈ ( ∞ ) ,(ii) k µ a k = O ( ) , k E a k = O ( p − ) , | tr ( E a ) | = O ( √ p ) and k E a k F = O ( √ p ) . Following [EK10a, CS13] we consider the following nonlinear random inner-productmatrix K = n δ i = j f ( x T i x j / √ p ) / √ p o ni , j = (2)for function f : R R satisfying some regularity conditions (detailed later in Assump-tion 2). As in [EK10a, CS13], the diagonal elements f ( x T i x i / √ p ) have been discarded.Indeed, under Assumption 1, x T i x i / √ p = O ( √ p ) which is an “improper scaling” for theevaluation by f (unlike x T i x j / √ p which properly scales as O ( ) for all i = j ).In the absence of discriminative information (null model), i.e., if µ a = and E a = for a =
1, 2, we write K = K N with [ K N ] ij = δ i = j f ( z T i z j / √ p ) / √ p . (3)Letting ξ p ≡ z T i z j / √ p , by the central limit theorem, ξ p L −→ N (
0, 1 ) as p → ∞ . As such,the [ K N ] ij , 1 ≤ i = j ≤ n , asymptotically behave like a family of dependent standardGaussian variables to which f is applied. In order to analyze the joint behavior of thisfamily, we shall exploit some useful concepts of the theory of orthogonal polynomialsand, in particular, of the class of Hermite polynomials defined with respect to the stan-dard Gaussian distribution [AS65, AAR00]. For a real probability measure µ , we denote the set of orthogonal polynomials with re-spect to the scalar product h f , g i = R f gd µ as { P l ( x ) , l =
0, 1, . . . } , obtained from theGram-Schmidt procedure on the monomials { x , x , . . . } such that P ( x ) = P l is ofdegree l and h P l , P l i = δ l − l . By the Riesz-Fisher theorem [Rud64, Theorem 11.43], forany function f ∈ L ( µ ) , the set of squared integrable functions with respect to h· , ·i , onecan formally expand f as f ( x ) ∼ ∞ ∑ l = a l P l ( x ) , a l = Z R f ( x ) P l ( x ) d µ ( x ) (4)where “ f ∼ ∑ ∞ l = P l ” indicates that k f − ∑ Nl = P l k → N → ∞ (and k f k = h f , f i ).To investigate the asymptotic behavior of K and K N as n , p → ∞ , we position our-selves under the following technical assumption. Assumption 2.
For each p, let ξ p = z T i z j / √ p and let { P l , p ( x ) , l ≥ } be the set of orthonormalpolynomials with respect to the probability measure µ p of ξ p . For f ∈ L ( µ p ) for each p, i.e.,f ( x ) ∼ ∞ ∑ l = a l , p P l , p ( x ) for a l , p defined in (4) , we demand that We refer the readers to Section A in Supplementary Material for a more detailed discussion on this point. Note that µ p is merely standard Gaussian in the large p limit. i) ∑ ∞ l = a l , p P l , p ( x ) µ p ( dx ) converges in L ( µ p ) to f ( x ) uniformly over large p, i.e., for any ǫ > there exists L such that for all p large, (cid:13)(cid:13)(cid:13) f − L ∑ l = a l , p P l , p (cid:13)(cid:13)(cid:13) L ( µ p ) = ∞ ∑ l = L + | a l , p | ≤ ǫ , (ii) as p → ∞ , ∑ ∞ l = | a l , p | → ν ∈ [ ∞ ) . Moreover, for l =
0, 1, 2 , a l , p converges and wedenote a , a and a their limits, respectively.(iii) a = . Since ξ p → N (
0, 1 ) , the limiting parameters a , a , a and ν are simply (generalized)moments of the standard Gaussian measure involving f . Precisely, a = E [ f ( ξ )] , a = E [ ξ f ( ξ )] , a = E [( ξ − ) f ( ξ )] √ = E [ ξ f ( ξ )] − a √ ν = Var [ f ( ξ )] ≥ a + a for ξ ∼ N (
0, 1 ) . These parameters are of crucial significance in determining the eigen-spectrum behavior of K . Note that a will not affect the classification performance, asdescribed below. Remark 1 (On a ) . In the present case of balanced mixtures (equal cardinalities for C and C ),a contributes to the polynomial expansion of K N (and K ) as a non-informative rank-1 pertur-bation of the form a ( n T n − I n ) / √ p. Since n is orthogonal to the “class-information vector” [ n /2 , − n /2 ] , its presence does not impact the classification performance. N It was shown in [CS13, DV13] that, for independent z i ’s with independent entries, the empirical spectral measure L n = n ∑ ni = δ λ i ( K N ) of the null model K N has an asymptoticallydeterministic behavior as n , p → ∞ with p / n → ¯ c ∈ ( ∞ ) . Theorem 1 ([CS13, DV13]) . Let p / n = c → ¯ c ∈ ( ∞ ) and Assumption 2 hold. Then,the empirical spectral measure L n of K N defined in (3) converges weakly and almost surely to aprobability measure L . The latter is uniquely defined through its Stieltjes transform m : C + → C + , z R ( t − z ) − L ( dt ) , given as the unique solution in C + of the (cubic) equation − m ( z ) = z + a m ( z ) c + a m ( z ) + ν − a c m ( z ) .Theorem 1 is “universal” with respect to the law of the (independent) entries of z i .While universality is classical in random matrix results, with mostly first and secondorder statistics involved, the present universality result is much less obvious since (i) thenonlinear application f ( x T i x j / √ p ) depends in an intricate manner on all moments of x T i x j and (ii) the entries of K N are strongly dependent. In essence, universality still holds herebecause the convergence speed to Gaussian of x T i x j / √ p is sufficiently fast to compensatethe residual impact of higher order moments in the spectrum of K N . If mixtures are unbalanced, the vector n may tend to “pull” eigenvectors aligned to [ n , − n ] , with n i the cardinality in C i , so away from purely noisy eigenvectors and thereby impacting classification perfor-mance. See [CBG16] for similar considerations. C + ≡ { z ∈ C , ℑ [ z ] > } . We also recall that, for m ( z ) the Stieltjes transform of a measure µ , µ can beobtained from m ( z ) via µ ([ a , b ]) = lim ǫ ↓ π R ba ℑ [ m ( x + ı ǫ )] dx for all a < b continuity points of µ .
4s an illustration, Figure 1a compares the empirical spectral measure of K N to thelimiting measure µ of Theorem 1. From a technical viewpoint, the objective of the article is to go beyond the null modeldescribed in Theorem 1 by providing a tractable random matrix equivalent ˜ K for the kernelmatrix K , in the sense that k K − ˜ K k → n , p → ∞ . This convergence allows one to identify the eigenvalues and isolated eigenvectors(that can be used for spectral clustering purpose) of K by means of those of ˜ K , see forinstance [HJ12, Corollary 4.3.15]. More importantly, while not visible from the expressionof K , the impact of the mixture model ( µ , µ , E , E ) on K is readily accessed from ˜ K and easily related to the Hermite coefficients ( a , a , ν ) of f . This allows us to furtherinvestigate how the choice of f impacts the asymptotically feasibility and efficiency ofspectral clustering from the top eigenvectors of K . (a) Eigenvalues of K N (b) Eigenvalues of K (c) Top eigenvectors of K N (top) and K (bot-tom) Figure 1: Eigenvalue distribution and top eigenvector of K N and K , together with thelimiting spectral measure L (from Theorem 1) in red; f ( x ) = sign ( x ) , Gaussian z i , n = p = µ = − [ p − ] = − µ and E = E = . X = [ x , . . . , x n ] ∈ R p × n with x , . . . , x n /2 ∈ C and x n /2 + , . . . , x n ∈ C . The main idea for the asymptotic analysis of K comes in two steps: (i) first, by an ex-pansion of x T i x j as a function of z i , z j and the statistical mixture model parameters µ , E ,we decompose x T i x j (under Assumption 1) into successive orders of magnitudes with re-spect to p ; this, as we will show, further allows for a Taylor expansion of f ( x T i x j / √ p ) forat least twice differentiable functions f around its dominant term f ( z T i z j / √ p ) . Then, (ii)we rely on the orthogonal polynomial approach of [CS13] to “linearize” the resulting ma-trix terms { f ( x T i x j / √ p ) } , { f ′ ( x T i x j / √ p ) } and { f ′′ ( x T i x j / √ p ) } (all terms correspondingto higher order derivatives asymptotically vanish) and use Assumption 2 to extend theresult to arbitrary square-summable f .Our main conclusion is that K asymptotically behaves like a matrix ˜ K following aso-called “spiked random matrix model” in the sense that ˜ K = K N + ˜ K I is the sum of thefull-rank “noise” matrix K N having compact limiting spectrum (the support of L ) and alow-rank “information” matrix ˜ K I [BAP05, BGN11]. For all figures in this article, the eigenvalues (that produce the empirical histograms) as well as theassociated eigenvectors are computed by MATLAB’s eig(s) function and correspond to a single realizationof the random kernel matrix K or K N . .1 Information-plus-noise decomposition of K We first show that K can be asymptotically approximated as K N + K I with K N definedin (3) and K I an additional (so far full-rank) term containing the statistical information ofthe mixture model.As announced, we start by decomposing x T i x j into a sequence of terms of successiveorders of magnitude using Assumption 1 and x i = µ a + ( I p + E a ) z i , x j = µ b + ( I p + E b ) z j for x i ∈ C a and x j ∈ C b . We have precisely, for i = j , x T i x j √ p = µ T a µ b √ p + √ p ( µ T a ( I p + E b ) z j + µ T b ( I p + E a ) z i ) + √ p z T i ( I p + E a ) ( I p + E b ) z j = z T i z j √ p | {z } O ( ) + z T i ( E a + E b ) z j √ p | {z } ≡ A ij = O ( p − ) + µ T a µ b + µ T a z j + µ T b z i √ p − z T i ( E a − E b ) z j √ p | {z } ≡ B ij = O ( p − ) + o ( p − ) (5)where in particular we performed a Taylor expansion of ( I p + E a ) (since k E a k = O ( p − ) )around I p , and used the fact that with high probability z T i E a z j = O ( p ) and z T i ( E a − E b ) z j = O ( ) .As a consequence of this expansion, for at least twice differentiable f ∈ L ( µ p ) , wehave K ij = f ( x T i x j / √ p ) √ p = f ( z T i z j / √ p ) √ p + f ′ ( z T i z j / √ p ) √ p ( A ij + B ij ) + f ′′ ( z T i z j / √ p ) √ p A ij + o ( p − ) where o ( p − ) is understood with high probability and uniformly over i , j ∈ {
1, . . . , n } .This entry-wise expansion up to order o ( p − ) is sufficient since, matrix-wise , if A ij = o ( p − ) uniformly on i , j , from k A k ≤ p k A k ∞ = p max i , j | A ij | , we have k A k = o ( ) as n , p → ∞ .In the particular case where f is a monomial of degree k ≥
2, this implies the follow-ing result.
Proposition 1 (Monomial f ) . Under Assumptions 1–2, let f ( x ) = x k , k ≥ . Then, asn , p → ∞ , k K − ( K N + K I ) k → almost surely, with K N defined in (3) and K I = k √ p ( Z T Z / √ p ) ◦ ( k − ) ◦ ( A + B ) + k ( k − ) √ p ( Z T Z / √ p ) ◦ ( k − ) ◦ ( A ) ◦ (7) for Z = [ z , . . . , z n ] ∈ R p × n and A , B ∈ R n × n defined in (5) with A ii = B ii = . Here X ◦ Y denotes the Hadamard product between X , Y and X ◦ k the k-th Hadamard power, i.e., [ X ◦ k ] ij =( X ij ) k . Since f ∈ L ( µ ) can be decomposed into its Hermite polynomials, Proposition 1 alongwith Theorem 1 allows for an asymptotic quantification of K . However, the expressionof K I in (7) does not so far allow for a thorough understanding of the spectrum of K , dueto (i) the delicate Hadamard products between purely random ( Z T Z ) and informativematrices ( A , B ) and (ii) the fact that K I is full rank (so that the resulting spectral propertiesof K N + K I remains intractable). We next show that, as n , p → ∞ , K I admits a tractablelow-rank approximation ˜ K I , thereby leading to a spiked-model approximation of K .6 .2 Spiked-model approximation of K Let us then consider K I defined in (7), the ( i , j ) entry of which can be written as the sumof terms containing µ a , µ b (treated separately) and random variables of the type φ = C √ p ( x T y / √ p ) α ( x T Fy ) β for independent random vectors x , y ∈ R p with i.i.d. zero mean, unit variance and finitemoments (uniformly on p ) entries, deterministic F ∈ R p × p , C ∈ R , α ∈ N and β ∈ {
1, 2 } .For Gaussian x , y , the expectation of φ can be explicitly computed via an integral trick[Wil97, LLC18]. For more generic x , y with i.i.d. bounded moment entries, a combinato-rial argument controls the higher order moments of the expansion which asymptoticallyresult in (matrix-wise) vanishing terms. See Sections B–C of the supplementary material.This leads to the following result. Proposition 2 (Low rank asymptotics of K I ) . Under Assumptions 1–2, for f ( x ) = x k , k ≥ , k K I − ˜ K I k → almost surely as n , p → ∞ , for K I defined in (7) and ˜ K I = ( k !! p ( JM T MJ T + JM T Z + Z T MJ T ) , for k odd k ( k − ) !!2 p J ( T + S ) J T , for k even (8) where M = [ µ , µ ] ∈ R p × , T = { tr ( E a + E b ) / √ p } a , b = , S = { tr ( E a E b ) / √ p } a , b = ∈ R × and J = [ j , j ] ∈ R n × with j a ∈ R n the canonical vector of class C a , i.e., [ j a ] i = δ x i ∈C a . We refer the readers to Section C of the supplementary material for a detailed exposi-tion of the proof.Proposition 2 states that K I is asymptotically equivalent to ˜ K I that is of rank at mosttwo. Note that the eigenvectors of ˜ K I are linear combinations of the vectors j , j andthus provide the data classes.From the expression of ˜ K I , quite surprisingly, it appears that for f ( x ) = x k , depend-ing on whether k is odd or even, either only the information in means ( M ) or only incovariance ( T and S ) can be (asymptotically) preserved.By merely combining the results of Propositions 1–2, the latter can be easily extendedto polynomial f . Then, by considering f ( x ) = P κ ( x ) , the Hermite polynomial of degree κ , it can be shown that, quite surprisingly, one has ˜ K I = if κ > P , P , . . . ofan arbitrary f ∈ L ( µ ) satisfying Assumption 2 leads to a very simple expression of ourmain result. For mental reminder, M stands for means , T accounts for the difference in traces of covariance matricesand S for the “shapes” of the covariances. Note that, as defined, ˜ K I has non-zero diagonal elements, while [ K I ] ii =
0. This is not contradictory asthe diagonal matrix diag ( ˜ K I ) has vanishing norm and can thus be added without altering the approximation k K I − ˜ K I k →
0; it however appears convenient as it ensures that ˜ K I is low rank (while without its diagonal,˜ K I is full rank). heorem 2 (Spiked-model approximation of K ) . For an arbitrary f ∈ L ( µ ) with f ∼ ∑ ∞ l = a l P l ( x ) , under Assumptions 1–2, k K − ˜ K k →
0, ˜ K = K N + ˜ K I with K N defined in (3) and ˜ K I = a p ( JM T MJ T + JM T Z + Z T MJ T ) + a p J ( T + S ) J T . (9)The detailed proof of Theorem 2 is provided in Section D of the supplementary mate-rial. − − (a) N : eigs of K for P − − (b) N : eigs of ˜ K for P − − − − − − (c) N : eigs of K for P − − − − − − (d) N : eigs of ˜ K for P − − (e) Stud: eigs of K for P − − (f) Stud: eigs of ˜ K for P − − (g) Stud: eigs of K for P − − (h) Stud: eigs of ˜ K for P Figure 2: Eigenvalue distributions of K and ˜ K from Theorem 2 (blue) and L from The-orem 1 (red), for z i with Gaussian (top) or Student-t with degree of freedom 7 (bottom)entries; functions f ( x ) = P ( x ) = x , f ( x ) = P ( x ) = ( x − ) √ f ( x ) = P ( x ) =( x − x ) / √ n = p = µ = − [ p − ] = − µ and E = − I p / √ p = − E .Figure 2 compares the spectra of K and ˜ K for random vectors with independentGaussian or Student-t entries, for the first three (normalized) Hermite polynomials P ( x ) , P ( x ) and P ( x ) . These numerical evidences validate Theorem 2: only for P ( x ) and P ( x ) is an isolated eigenvalue observed. Besides, as shown in the bottom display of Figure 1c,the corresponding eigenvector is, as expected, a noisy version of linear combinations of j , j . Remark 2 (Even and odd f ) . While rank ( ˜ K I ) ≤ (as the sum of two rank-two terms), inFigure 2 no more than two isolated eigenvalues are observed (for f = P only one on the rightside, for f = P one on each side). This follows from a = when f = P and a = forf = P . More generally, for f odd ( f ( − x ) = − f ( x ) ), a = and the statistical informationon covariances (through E ) asymptotically vanishes in K ; for f even ( f ( − x ) = f ( x ) ), a = and information about the means µ , µ vanishes. Thus, only f neither odd nor even can preserveboth first and second order discriminating statistics (e.g., the popular ReLU function f ( x ) = max ( x ) ). This was previously remarked in [LC18] based on a local expansion of smooth f in asimilar setting. K for piecewise constantand cubic f , in the setting of Figure 2 and 4. f Size (Mb) Running time (s)Piecewise 4.15 0.2390Cubic 16.75 0.4244 Figure 3: Piecewise constant (green)versus cubic (blue) function with equal ( a , a , ν ) . As a direct consequence of Theorem 2, the performance of spectral clustering for largedimensional mixture models of the type (1) only depends on the three parameters of thenonlinear function f : a = E [ ξ f ( ξ )] , a = E [ ξ f ( ξ )] / √ ν = E [ f ( ξ )] . The parame-ters a , ν determine the limiting spectral measure L of K (since K and K N asymptoticallydiffer by a rank-4 matrix, they share the same limiting spectral measure) while a , a de-termine the low rank structure within ˜ K I .As an immediate consequence, arbitrary (square-summable) kernel functions f (with a =
0) are asymptotically equivalent to the simple cubic function ˜ f ( x ) = c x + c x + c x − c having the same Hermite polynomial coefficients a , a , ν . The idea of this section is to design a prototypical family F of functions f having(i) universal properties with respect to ( a , a , ν ) , i.e., for each ( a , a , ν ) there exists f ∈F with these Hermite coefficients and (ii) having numerically advantageous properties.Thus, any arbitrary kernel function f can be mapped, through ( a , a , ν ) , to a function in F with good numerical properties.One such prototypical family F can be the set of f , parametrized by ( t , s − , s + ) , anddefined as f ( x ) = − rt x ≤ √ s − √ s − < x ≤ √ s + t x > √ s + with a = t √ π ( e − s + + re − s − ) a = t √ π ( s + e − s + + rs − e − s − ) ν = t ( − erf ( s + )) ( + r ) (10)where r ≡ − erf ( s + ) + erf ( s − ) . Figure 3 displays f given in (10) together with the cubic function c x + c ( x − ) + c x sharing the same Hermite coefficients ( a , a , ν ) .The class of equivalence of kernel functions induced by this mapping is quite unlikethat raised in [EK10b] or [CBG16] in the “improper” scaling f ( x T i x j / p ) regime. While inthe latter, functions f ( x ) of the same class of equivalence are those having common f ′ ( ) and f ′′ ( ) values, in the present case, these functions may have no similar local behavior(as shown in the example of Figure 3).For the piecewise constant function defined in (10) and the associated cubic functionhaving the same ( a , a , ν ) , a close match is observed for both eigenvalues and top eigen-vectors of K in Figure 4, with gains in both storage size and computational time displayedin Table 1. The coefficients being related through a = c + c , a = √ c and ν = ( c + c ) + c + c . − (a) Piecewise constant − − (b) Cubic polynomial − · − (c) Cubic (blue) versus piecewise (green) func-tion Figure 4: Eigenvalue distribution and top eigenvectors of K for the piecewise constantfunction (in green) and the associated cubic function (in blue) with the same ( a , a , ν ) ,performed on Bernoulli distribution with zero mean and unit variance, in the setting ofFigure 2. We have shown that inner-product kernel matrices √ p K = f ( x T i x j / √ p ) with x i = µ a +( I p + E a ) z i , a ∈ {
1, 2 } , asymptotically behave as a spiked random matrix model whichspectrally only depends on three defining parameters of f . Turning I p into a generic C covariance is more technically challenging, breaking most of the orthogonality propertiesof the orthogonal polynomial approach of the proofs, but a needed extension of the result.Interestingly, this study can be compared to analyses in neural networks (see e.g.,[PW17, BP19]) where it has been shown that in the case of sub-Gaussian entries forboth random layer W and input data X the (limiting) spectrum of the Gram matrix f ( WX ) f ( WX ) T ( f understood entry-wise) is uniquely determined by the same ( a , ν ) coefficients. Our results may then be adapted to an improved understanding of classifi-cation in random neural networks. References [AAR00] George E Andrews, Richard Askey, and Ranjan Roy.
Special functions , volume 71.Cambridge university press, 2000.[ABDH +
18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, AbbasMehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learningmixtures of gaussians via sample compression schemes. In
Advances in Neural Infor-mation Processing Systems , pages 3412–3421, 2018.[AS65] Milton Abramowitz and Irene A Stegun.
Handbook of mathematical functions: withformulas, graphs, and mathematical tables , volume 55. Courier Corporation, 1965.[BAP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largesteigenvalue for nonnull complex sample covariance matrices.
The Annals of Probabil-ity , 33(5):1643–1697, 2005.[BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvec-tors of finite, low rank perturbations of large random matrices.
Advances in Mathe-matics , 227(1):494–521, 2011.[Bil12] Patrick Billingsley.
Probability and measure . John Wiley & Sons, third edition, 2012. BP19] Lucas Benigni and Sandrine Péché. Eigenvalue distribution of nonlinear models ofrandom matrices. arXiv preprint arXiv:1904.03090 , 2019.[CBG16] Romain Couillet and Florent Benaych-Georges. Kernel spectral clustering of largedimensional data.
Electronic Journal of Statistics , 10(1):1393–1454, 2016.[CLM18] Romain Couillet, Zhenyu Liao, and Xiaoyi Mai. Classification asymptotics in therandom matrix regime. In , pages 1875–1879. IEEE, 2018.[CS13] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernelmatrices.
Random Matrices: Theory and Applications , 2(04):1350010, 2013.[DV13] Yen Do and Van Vu. The spectrum of random kernel matrices: universality re-sults for rough and varying kernels.
Random Matrices: Theory and Applications ,2(03):1350005, 2013.[EK10a] Noureddine El Karoui. On information plus noise kernel random matrices.
TheAnnals of Statistics , 38(5):3191–3216, 2010.[EK10b] Noureddine El Karoui. The spectrum of kernel random matrices.
The Annals ofStatistics , 38(1):1–50, 2010.[FM19] Zhou Fan and Andrea Montanari. The spectral norm of random inner-product ker-nel matrices.
Probability Theory and Related Fields , 173(1-2):27–85, 2019.[HJ12] Roger A. Horn and Charles R. Johnson.
Matrix Analysis . Cambridge UniversityPress, second edition, 2012.[JL09] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal com-ponents analysis in high dimensions.
Journal of the American Statistical Association ,104(486):682–693, 2009.[KMV16] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians.
Communications of the ACM , 55, 2016.[LC18] Zhenyu Liao and Romain Couillet. On the spectrum of random features maps ofhigh dimensional data. In
Proceedings of the 35th International Conference on MachineLearning , volume 80, pages 3063–3071. PMLR, 2018.[LC19] Zhenyu Liao and Romain Couillet. A large dimensional analysis of least squaressupport vector machines.
IEEE Transactions on Signal Processing , 67(4):1065–1074,2019.[LLC18] Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach toneural networks.
The Annals of Applied Probability , 28(2):1190–1248, 2018.[NJW02] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysisand an algorithm. In
Advances in neural information processing systems , pages 849–856,2002.[PW17] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deeplearning. In
Advances in Neural Information Processing Systems , pages 2637–2646, 2017.[Rud64] Walter Rudin.
Principles of mathematical analysis , volume 3. McGraw-hill New York,1964.[Wil97] Christopher KI Williams. Computing with infinite networks. In
Advances in neuralinformation processing systems , pages 295–301, 1997. The non-trivial classification regime
In the ideal case where µ , µ and E , E are perfectly known, the (decision optimal) Neyman-Pearson test to decide on the class of an unknown and normally distributed x , genuinely belong-ing to C , consists in the following comparison ( x − µ ) T ( I p + E ) − ( x − µ ) − ( x − µ ) T ( I p + E ) − ( x − µ ) C ≷ C log det ( I p + E ) det ( I p + E ) .Writing x = µ + ( I p + E ) z so that z ∼ N ( , I p ) , the above test is then equivalent to T ( x ) = p z T (cid:16) ( I p + E ) ( I p + E ) − ( I p + E ) − I p (cid:17) z + p ∆ µ T ( I p + E ) − ( I p + E ) z + p ∆ µ T ( I p + E ) − ∆ µ − p log det ( I p + E ) det ( I p + E ) C ≷ C ∆ µ ≡ µ − µ . Since, for U ∈ R p × p an eigenvector basis for ( I p + E ) ( I p + E ) − ( I p + E ) − I p , Uz has the same distribution as z , with a careful application of the Lyapunov’s centrallimit theorem (see for example [Bil12, Theorem 27.3]), along with the assumption k E a k = o ( ) for a ∈ {
1, 2 } , we obtain V − T ( T ( x ) − E T ) L −→ N (
0, 1 ) as p → ∞ , with E T ≡ p tr (cid:16) ( I p + E )( I p + E ) − (cid:17) − + p ∆ µ T ( I p + E ) − ∆ µ − p log det ( I p + E ) det ( I p + E ) V T ≡ p k ( I p + E ) ( I p + E ) − ( I p + E ) − I p k F + p ∆ µ T ( I p + E ) − ( I p + E )( I p + E ) − ∆ µ .The classification error rate is thus non-trivial (i.e., converging neither to 0 not 1 as p → ∞ ) if E T and √ V T are of the same order of magnitude (with respect to p ).In the case where E = E = E , E T = p ∆ µ T ( I p + E ) − ∆ µ = O ( k ∆ µ k p − ) , p V T = p q ∆ µ T ( I p + E ) − ∆ µ = O ( k ∆ µ k p − ) so that we must as least demand k ∆ µ k ≥ O ( ) (which, up to centering, is equivalent to asking k µ a k ≥ O ( ) for a ∈ {
1, 2 } ).Under the critical condition k ∆ µ k = O ( ) , we move on to the study of the condition on thecovariance E a . To this end, a Taylor expansion can be performed for I p + E around I p + E sothat E T = p ∆ µ T ( I p + E ) − ∆ µ + p k ( I p + E ) − ∆ E k F + o ( p − ) V T = p k ( I p + E ) − ∆ E k F + p ∆ µ T ( I p + E ) − ∆ µ + o ( p − ) .with ∆ E ≡ E − E . Thus one must have k ∆ E k ≥ O ( p − ) for k ( I p + E ) − ∆ E k F not to vanish for p large and for ∆ E to have discriminating power. It is thus convenient to request k E a k ≥ O ( p − ) for a ∈ {
1, 2 } , which unfolds from | tr E a | ≥ O ( √ p ) , or k E a k F ≥ O ( ) .Yet, as noticed in [CLM18], many classification algorithms, either supervised, semi-supervisedor unsupervised, are not able to achieve the optimal rate k E a k F = O ( ) when n , p are of the sameorder of magnitude. Indeed, the best possible rate k E a k F = O ( √ p ) can only be obtained in veryparticular cases, for instance if | tr E a | = o ( √ p ) and k µ a k = o ( ) as investigated in [LC19]. Thisthus leads to the non-trivial classification condition demanded in Assumption 1. Exact computation of φ in the Gaussian case In the Gaussian case where z ∼ N ( , I p ) we resort to computing, as in [Wil97, LLC18], the integral E z [( z T a ) k ( z T b ) k ] = ( π ) − p /2 Z R p ( z T a ) k ( z T b ) k e −k z k /2 d z = π Z R ( ˜ z ˜ a ) k ( ˜ z ˜ b + ˜ z ˜ b ) k e − ( ˜ z + ˜ z ) /2 d ˜ z d ˜ z = π Z R ( ˜ z T ˜ a ) k ( ˜ z T ˜ b ) k e −k ˜ z k /2 d ˜ z where we apply the Gram-Schmidt procedure to project z into the two-dimensional space spannedby a , b with ˜ a = k a k , ˜ b = a T b k a k , ˜ b = r k b k − ( a T b ) k a k and denote ˜ z = [ ˜ z ; ˜ z ] , ˜ a = [ ˜ a ; 0 ] and˜ b = [ ˜ b ; ˜ b ] . As a consequence, we obtain, for k even, E h ( z T i z j / √ p ) k i = E [ ξ k ] = ( k − ) !!; E z i h ( z T i z j / √ p ) k ( z T i b ) i = E z i h ( z T i z j / √ p ) k − ( z T i b ) i = E z i h ( z T i z j / √ p ) k − ( z T i b ) i = ( k − ) !! ( k z j k / √ p ) k − ( z T j b ) / √ p ; E z i h ( z T i z j / √ p ) k ( z T i b ) i = ( k − ) !! (cid:16) k ( k z j k / √ p ) k − ( z T j b / √ p ) + ( k z j k / √ p ) k k b k (cid:17) ;where we denote k !! the double factorial of an integral k such that k !! = k ( k − )( k − ) · · · . Thisfuther leads to, in the Gaussian case, the expression of ˜ K I in Proposition 2. C Proof of Proposition 2
Define by L the matrix with L ij ≡ [ p ( JM T MJ T + JM T Z + Z T MJ T )] ij for i = j and L ii =
0. Then K I can be written as K I = k ( Z T Z / √ p ) ◦ ( k − ) ◦ L + Φ , Φ ij ≡ kp ( z T i z j / √ p ) k − z T i (cid:18) ( E a + E b ) − ( E a − E b ) (cid:19) z j + k ( k − ) p ( z T i z j / √ p ) k − √ p ( z T i ( E a + E b ) z j ) for i = j and Φ ii =
0. With this expression, the proof of Proposition 2 can be divided into threesteps.
Concentration of Φ . We first show that, k Φ − E [ Φ ] k → n , p → ∞ . Thisfollows from the observation that Φ is a p − rescaling (since k E a k = O ( p − ) ) of the nullmodel K N , which concentrates around its expectation in the sense that k K N − E [ K N ] k = O ( ) for E [ K N ] = O ( √ p ) if a = O ( √ p ) discarded (arising from E [ K N ] ), K N is of bounded operatornorm for all large n , p with probability one; this, together with the fact that k E [ Φ ] k = O ( ) thatwill be shown subsequently, allows us to conclude that k Φ − E [ Φ ] k → n , p → ∞ . Computation of E [ Φ ] : beyond the Gaussian case. We then show that, for random vectors z with zero mean, unit variance and bounded moments entries, the expression of E [ Φ ] coincideswith the Gaussian case. To this end, recall that the entries of Φ are the sum of random variablesof the type φ = C √ p ( x T y / √ p ) α ( x T Fy ) β By assuming first that a , b are linearly independent before extending by continuity to a , b proportional. or independent random vectors x , y ∈ R p with i.i.d. zero mean, unit variance and finite moments(uniformly on p ) entries, deterministic F ∈ R p × p , C ∈ R , α ∈ N and β ∈ {
1, 2 } . Let us start withthe case β = φ as φ = C √ p √ p p ∑ i = x i y i ! . . . √ p p ∑ i α = x i α y i α ! p ∑ j , j = F j , j x j y j ! (11)with x i and y i the i -th entry of x and y , respectively, so that (i) x i is independent of y j for all i , j and (ii) x i is independent of x j for i = j with E [ x i ] = E [ x i ] = E [ | x i | k ] ≤ C k for some C k independent of p (and similarly for y ).At this point, note that to ensure E [ K I ] has non-vanishing operator norm as n , p → ∞ , weneed E [ φ ] ≥ O ( p − ) since k A k ≤ p k A k ∞ for A ∈ R p × p . Also, note that (as β = ∑ pj , j = F j , j x j y j with j = j must be zero since in other terms x i always appearstogether with y i , so that all terms with j = j give rise to zero in expectation. Hence, the p terms of the sum only contain p nonzero terms in expectation (those with j = j ). The arbitrary(absolute) moments of x and y being finite, the first α p terms must be divided into ⌈ α ⌉ /2 groupsof size two (containing O ( p ) terms) so that, with the normalization by p − for each group of sizetwo, the associated expectation is not vanishing. We shall thus discuss the following two cases:1. α even: the α terms in the sum form α /2 groups with different indices each and also differ-ent from j = j . Therefore we have E x j [ φ ] = E [ φ ] = α odd: the α terms in the sum form ( α − ) /2 groups with indices different from each otherand the remaining one goes with the last term containing F and one has E [ φ ] = C α !! p tr ( F ) by a simple combinatorial argument.The case β = j may not equal j to giverise to non-vanishing terms. Concentration of Hadamard product.
It now remains to treat the term k ( Z T Z / √ p ) ◦ ( k − ) ◦ L and show it also has an asymptotically deterministic behavior (as Φ ). It can be shown that k N ◦ L k → n , p → ∞ with N = ( Z T Z / √ p ) ⊙ ( k − ) − ( k − ) !! n T n for k odd and N = ( Z T Z / √ p ) ⊙ ( k − ) for k even.To prove this, note that, depending on the key parameter a = E [ f ( ξ )] , the operator normof f ( Z T Z / √ p ) / √ p is either of order O ( √ p ) for a = O ( ) for a =
0. In particular, formonomial f ( x ) = x k under study here, we have a = E [ ξ k ] = k odd and a = E [ ξ k ] =( k − ) !! = k even, ξ ∼ N (
0, 1 ) . To control the operator norm of the Hadamard productbetween matrices, we introduce the following lemma. Lemma 1.
For A , B ∈ R p × p , we have k A ◦ B k ≤ √ p k A k ∞ k B k .Proof of Lemma 1. Let e , . . . , e p be the canonical basis vectors of R p , then for all 1 ≤ i ≤ p , k ( A ◦ B ) e i k ≤ max i , j | A ij |k Be i k = k A k ∞ k Be i k ≤ k A k ∞ k B k .As a consequence, for any v = ∑ pi = v i e i , we obtain k ( A ◦ B ) v k ≤ p ∑ i = | v i |k ( A ◦ B ) e i k ≤ p ∑ i = | v i |k A k ∞ k B k which, by Cauchy-Schwarz inequality further yields ∑ pi = | v i | ≤ √ p k v k . This concludes the proofof Lemma 1.Lemma 1 tells us that the Hadamard product between a matrix with o ( p − ) entry and amatrix with bounded operator norm is of vanishing operator norm, as p → ∞ . As such, since k N k = O ( ) and L has O ( p − ) entries, we have k N ◦ L k →
0. This concludes the proof ofProposition 2. Proof of Theorem 2
The proof follows from the fact that the individual coefficients of the Hermite polynomials P κ ( x ) = ∑ κ l = c κ , l x l satisfy the following recurrent relation [AS65] c κ + l = ( − κ c κ − l l = c κ , l − − κ c κ − l l ≥
1; (12)with c = c = c =
1. As a consequence, by indexing the informative matrix inProposition 2 of the monomial f ( x ) = x l as ˜ K I , l , we have for odd κ ≥ K I = κ ∑ l = c κ , l ˜ K I , l = κ ∑ l = c κ , l l !! ( JM T MJ T + JM T Z + Z T MJ T ) / p − diag ( · ) = with [ X − diag ( · )] ij = X ij δ i = j . This follows from the fact that, for κ ≥
3, we have both ∑ κ l = c κ , l l !! = ∑ κ + l = c κ + l ( l + ) !! =
0. The latter is proved by induction on κ : first, for κ =
3, wehave c + c = c + c + c =
0; then, assuming κ odd, we have ∑ κ l = c κ , l l !! = ∑ κ + l = c κ + l ( l + ) !! = κ + ∑ l = c κ + l l !! = κ + ∑ l = ( c κ + l − − ( κ + ) c κ , l ) l !! = κ + ∑ l = c κ + l − l !! = κ + ∑ l = c κ + l ( l + ) !! = κ + ∑ l = c κ + l ( l + ) !! = − ( κ + ) c κ + + κ + ∑ l = ( c κ + l − − ( κ + ) c κ + l )( l + ) !! = c κ , l = l ≥ κ +
1. Similar arguments hold for the case of κ even, whichconcludes the proof.even, whichconcludes the proof.