Faster Kernel Matrix Algebra via Density Estimation
FFaster Kernel Matrix Algebra via Density Estimation
Arturs Backurs ∗ Piotr Indyk † Cameron Musco ‡ Tal Wagner § Abstract
We study fast algorithms for computing fundamental properties of a positive semidefinitekernel matrix K ∈ R n × n corresponding to n points x , . . . , x n ∈ R d . In particular, we considerestimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector.We show that the sum of matrix entries can be estimated to 1 + (cid:15) relative error in time sublinear in n and linear in d for many popular kernels, including the Gaussian, exponential,and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and anapproximate eigenvector) can be approximated to 1 + (cid:15) relative error in time subquadratic in n and linear in d .Our algorithms represent significant advances in the best known runtimes for these problems.They leverage the positive definiteness of the kernel matrix, along with a recent line of work onefficient kernel density estimation. Kernels are a ubiquitous notion in statistics, machine learning, and other fields. A kernel is afunction k : R d × R d → R that measures the similarity between two d -dimensional vectors. Manystatistical and machine learning methods, such as support vector machines, kernel ridge regressionand kernel density estimation, rely on appropriate choices of kernels. A prominent example of akernel function is the Radial Basis Function, a.k.a. Gaussian kernel, defined as k ( x, y ) = exp( −(cid:107) x − y (cid:107) ) . Other popular choices include the Laplace kernel, exponential kernel, etc. See [Shawe-Taylor et al.,2004, Hofmann et al., 2008] for an overview.Kernel methods typically operate using a kernel matrix. Given n vectors x , . . . , x n ∈ R d , thekernel matrix K ∈ R n × n is defined as K i,j = k ( x i , x j ). For most popular kernels, k is a positivedefinite function, and so K is positive semidefinite (PSD). Further, it is often the case that K ’sentries are in the range [0,1], with 1’s on the diagonal – we assume this throughout.Although popular, the main drawback of kernel methods is their efficiency. Most kernel-basedalgorithms have running times that are at least quadratic in n ; in fact, many start by explicitlymaterializing the kernel matrix K . This quadratic runtime is likely necessary as long as exact (orhigh-precision) answers are desired. Consider perhaps the simplest kernel problem, where the goal ∗ Toyota Technological Institute at Chicago. [email protected] † Massachusetts Institute of Technology. [email protected] ‡ University of Massachusetts Amherst. [email protected] § Microsoft Research. [email protected] This should be contrasted with distance functions that measure the dissimilarity between two vectors. a r X i v : . [ c s . D S ] F e b s to compute the sum of matrix entries, i.e., s ( K ) = (cid:80) i,j K i,j . It was shown in Backurs et al.[2017] that, for the Gaussian kernel, computing s ( K ) up to 1 + (cid:15) relative error requires n − o (1) time under the Strong Exponential Time Hypothesis (SETH), as long as the dimension d is at leastpolylogarithmic in n , and (cid:15) = exp( − ω (log n )). The same limitations were shown to apply to kernelsupport vector machines, kernel ridge regression, and other kernel problems.Fortunately, the aforementioned lower bound does not preclude faster algorithms for largervalues of (cid:15) (say, (cid:15) = Θ(1)). Over the last decade many such algorithms have been proposed. In ourcontext, the most relevant ones are those solving the problem of kernel density evaluation [Charikarand Siminelakis, 2017, Backurs et al., 2018, 2019, Siminelakis et al., 2019, Charikar et al., 2020].Here, we are given two sets of vectors X = { x , . . . , x m } , and Y = { y , . . . , y m } , and the goal is tocompute the values of k ( y i ) = m (cid:80) j k ( x j , y i ) for i = 1 , . . . , m . For the Gaussian kernel, the bestknown algorithm, due to Charikar et al. [2020], estimates these values to 1 + (cid:15) relative error in time O ( dm/ ( µ . (cid:15) )), where µ is a lower bound on k ( x i ).Applying this algorithm directly to approximating the Gaussian kernel sum yields a runtimeof roughly O ( dn . /(cid:15) ), since we can set µ = 1 /n and still achieve (1 + (cid:15) ) approximation as k ( x i , x i ) = 1 for all x i . It is a natural question whether this bound can be improved and if progresson fast kernel density estimation can be extended to other fundamental kernel matrix problems. Our results
In this paper we give much faster algorithms for approximating two fundamentalquantities: the kernel matrix sum and the kernel matrix top eigenvector/eigenvalue. Consider akernel matrix K induced by n points x , . . . , x n ∈ R d and a kernel function with values in [0 ,
1] suchthat the matrix K is PSD and has 1’s on the diagonal. Furthermore, suppose that the kernel k issupported by a kernel density evaluation algorithm with running time of the form O ( dm/ ( µ p (cid:15) ))for m points, relative error (1 + (cid:15) ), and density lower bound µ . Then we give:1. An algorithm for (1 + (cid:15) )-approximating s ( K ) in time: O (cid:16) dn p p /(cid:15) p p · log n (cid:17) . For many popular kernels the above runtime is sublinear in n – see Table 1. Our algorithmis based on subsampling O ( √ n ) points from x , . . . , x n and then applying fast kernel densityevaluation to these points. We complement the algorithm with a (very simple) lower boundshowing that sampling Ω( √ n ) points is necessary to estimate s ( K ) up to a constant factor.This shows that our sampling complexity is optimal.2. An algorithm that returns an approximate top eigenvector z ∈ R n with (cid:107) z (cid:107) = 1 and z T Kz ≥ (1 − (cid:15) ) · λ ( K ), where λ ( K ) is K ’s top eigenvalue, running in time: O (cid:18) dn p log( n/(cid:15) ) p (cid:15) p (cid:19) . For many popular kernels, this runtime is subquadratic in n – see Table 1. This is the firstsubquadratic time algorithm for top eigenvalue approximation and a major improvementover forming the full kernel matrix. By a simple argument, Ω( dn ) time is necessary even forconstant factor approximation, and thus our algorithm is within an ˜ O ( n p ) factor of optimal.Our algorithm is also simple and practical, significantly outperforming baseline methodsempirically – see Section 5. 2 ernel k ( x, y ) KDE algorithm KDE runtime Kernel Sum Top EigenvectorGaussian e −(cid:107) x − y (cid:107) Charikar et al. [2020] dm/ ( µ . ) ˜ O ( dn . ) ˜ O ( dn . )Gaussian Greengard and Strain [1991](see appendix) m log( m ) O ( d ) n . log( n ) O ( d ) n log( n ) O ( d ) Exponential e −(cid:107) x − y (cid:107) Charikar et al. [2020] dm/ ( µ . ) ˜ O ( dn . ) ˜ O ( dn . )Rational quadratic (cid:107) x − y (cid:107) ) β Backurs et al. [2018] d ˜ O ( dn . ) ˜ O ( dn )All of the above(lower bound) - - - Ω( dn . ) Ω( dn ) Table 1: Instantiations of our main results, giving sublinear time kernel sum approximation andsubquadratic time top eigenvector approximation. All running times are up to polylogarithmicfactors assuming constant accuracy parameter (cid:15) and kernel parameter β . The KDE runtime dependson m , the number of query points and µ , a lower bound on the density for each query point. Application
An immediate application of our kernel sum algorithm is a faster algorithm forestimating the kernel alignment [Cristianini et al., 2002], a popular measure of similarity betweenkernel matrices. Given K and K (cid:48) , the alignment between K and K (cid:48) is defined asˆ A ( K, K (cid:48) ) = (cid:104)
K, K (cid:48) (cid:105) (cid:112) (cid:104)
K, K (cid:105)(cid:104) K (cid:48) , K (cid:48) (cid:105) , where (cid:104) K, K (cid:48) (cid:105) = (cid:80) i,j K i,j K (cid:48) i,j is the inner product between the matrices K and K (cid:48) interpreted asvectors. Our algorithm yields an efficient algorithm for estimating ˆ A ( K, K (cid:48) ) as long as the productkernels K ◦ K , K (cid:48) ◦ K (cid:48) and K ◦ K (cid:48) are supported by fast kernel density evaluation algorithms asdescribed earlier. This is the case for e.g., the Gaussian or Laplace kernels. Related work
The problem of evaluating kernel densities, especially for the Gaussian kernel,has been studied extensively. In addition to the recent randomized algorithms discussed in theintroduction, there has been a considerable amount of work on algorithms in low dimensionalspaces, including Greengard and Strain [1991], Yang et al. [2003], Lee et al. [2006], Lee and Gray[2009], March et al. [2015]. We present a streamlined version of the Fast Gauss Transform algorithmof Greengard and Strain [1991] in the appendix. In addition, there has been a considerable effortdesigning core-sets for this problem [Phillips and Tai, 2018a,b].The sum of kernel values can be viewed as a similarity analog of the sum of pairwise distances inmetric spaces. The latter quantity can be approximated in time linear in the number n of points inthe metric space [Indyk, 1999, Chechik et al., 2015, Cohen et al., 2018]. Note that it is not possibleto achieve an o ( n )-time algorithm for this problem, as a single point can greatly affect the overallvalue. To the best of our knowledge, our algorithms for the kernel matrix sum are the first thatachieve sublinear in n running time and give O (1)-approximation for a nontrivial kernel problem.Computing the top eigenvectors of a kernel matrix is a central problem – it is the primaryoperation behind kernel principal component analysis [Sch¨olkopf et al., 1997]. Projection ontothese eigenvectors also yields an optimal low-rank approximation to the kernel matrix, which canbe used Low-rank approximation is widely used to approximate kernel matrices, to speed up kernellearning methods [Williams and Seeger, 2001, Fine and Scheinberg, 2001]. Significant work hasfocused on fast low-rank approximation algorithms for kernel matrices or related distance matrices[Musco and Musco, 2017, Musco and Woodruff, 2017, Bakshi and Woodruff, 2018, Indyk et al., 2019,3akshi et al., 2020]. These algorithms have runtime scaling roughly linearly in n . However, they donot give any nontrivial approximation to the top eigenvalues or eigenvectors of the kernel matrixthemselves. To the best of our knowledge, prior to our work, no subquadratic time approximationalgorithms for the top eigenvalue were known, even for Θ(1) approximation. Our Techniques: Kernel Sum
We start by noting that, via a Chernoff bound, a simple randomsampling of kernel matrix entries provides the desired estimation in linear time.
Claim 1.
For a positive definite kernel k : R d × R d → [0 , with k ( x, x ) = 1 ∀ x , uniformly sample t = O (cid:16) n log(1 /δ ) (cid:15) (cid:17) off-diagonal entries of K , K i ,j , ..., K i t ,j t and let ˜ s ( K ) = n + n ( n − t · (cid:80) t(cid:96) =1 K i (cid:96) ,j (cid:96) .Then with probability ≥ − δ , ˜ s ( K ) ∈ (1 ± (cid:15) ) · s ( K ) . Our goal is to do better, giving 1 ± (cid:15) approximation to s ( K ) in sublinear time. We achievethis by (a) performing a more structured random sampling, i.e., sampling a principal submatrix asopposed to individual entries, and (b) providing an efficient algorithm for processing this submatrix.Subsampling the matrix requires a more careful analysis of the variance of the estimator. Toaccomplish this, we use the fact that the kernel matrix is PSD, which implies that its “mass”cannot be too concentrated. For the second task, we use fast kernel density estimation algorithms,combined with random row sampling to reduce the running time. Our Techniques: Top Eigenvector
Our algorithm for top eigenvector approximation is a vari-ant on the classic power method, with fast approximate matrix vector multiplication implementedthrough kernel density evaluation. Kernel density evaluation on a set of n points with correspond-ing kernel matrix K ∈ R n × n , can be viewed as approximating the vector Kz ∈ R n where z ( i ) = 1 /n for all i . Building on this primitive, it is possible to implement approximate multiplication withgeneral z ∈ R n [Charikar and Siminelakis, 2017]. We can then hope to leverage work on the ‘noisypower method’, which approximates the top eigenvectors of a matrix using just approximate matrixvector multiplications [Hardt and Price, 2014]. However, existing analysis assumes random noiseon each matrix vector multiplication, which does not align with the top eigenvector. This cannotbe guaranteed in our setting. Fortunately, we can leverage additional structure: if the kernel k isnon-negative, then by the Perron-Frobenius theorem, K ’s top eigenvector is entrywise non-negative.This ensures that, if our noise in approximating Kz at each step of the power method is entrywisenon-negative, then this noise will have non-negative dot product with the top eigenvector. Weare able to guarantee this property, and show convergence of the method to an approximate topeigenvector, even when the error might align significantly with the top eigenvector. Throughout, we focus on nice kernels satisfying:
Definition 2 (Nice Kernel Function) . A kernel function k : R d × R d → [0 , is nice if it is positivedefinite and satisfies k ( x, x ) = 1 for all x ∈ R d . Many popular kernels, such as the Gaussian kernel, the exponential kernel and the rationalquadratic kernel described in the introduction, are indeed nice. We also assume that k admits afast KDE algorithm. Specifically: 4 efinition 3 (Fast KDE) . A kernel function k admits a O ( dm/ ( µ p (cid:15) )) time KDE algorithm, ifgiven a set of m points x , . . . , x n ∈ R d , we can process them in O ( dm/ ( µ p (cid:15) )) time for some p ≥ such that we can answer queries of the form n (cid:80) i k ( y, x i ) up to (1 + (cid:15) ) relative error in O ( d/ ( µ p (cid:15) )) time for any query point y ∈ R d with probability ≥ / assuming that n (cid:80) i k ( y, x i ) ≥ µ . Note that for a kernel function satisfying Def. 3, evaluating k ( y j ) for m points Y = { y , . . . , y m } against m points X = { x , . . . , x m } requires O ( dm/ ( µ p (cid:15) )) total time. Our proposed kernel sum approximation algorithm will sample a set A of s = Θ( √ n ) input pointsand look at the principal submatrix K A of K corresponding to those points. We prove that the sumof off-diagonal entries in K A (appropriately scaled) is a good estimator of the sum of off-diagonalentries of K . Since for a nice kernel, the sum of diagonal entries is always n , this is enough togive a good estimate of the full kernel sum s ( K ). Furthermore, we show how to estimate the sumof off-diagonal entries of K A quickly via kernel density evaluation, in ds − δ /(cid:15) O (1) = dn − δ/ /(cid:15) O (1) time for a constant δ >
0. Overall this yields:
Theorem 4.
Let K ∈ R n × n be a kernel matrix defined by a set of n points and a nice kernel function k (Def. 2) that admits O ( dm/ ( µ p (cid:15) )) time approximate kernel density evaluation (Def. 3). Aftersampling a total of O ( √ n/(cid:15) ) points, we can in O (cid:16) dn p p log ( n ) /(cid:15) p p (cid:17) time approximate thesum of entries of K within a factor of (cid:15) with high probability − /n Θ(1) . For any square matrix K , let s o ( K ) be the sum of off diagonal entries. Crucial to our analysiswill be the lemma: Lemma 5 (PSD Mass is Spread Out) . Let K ∈ R n × n be a PSD matrix with diagonal entries allequal to . Let s o,i ( K ) be the sum of off diagonal entries in the i th row of K . If s o ( K ) ≥ (cid:15)n forsome (cid:15) ≤ , then ∀ i : s o,i ( K ) ≤ (cid:112) s ( K ) /(cid:15). By Lemma 5, if the off-diagonal elements contribute significantly to s ( K ) (i.e., s ( K ) ≥ (cid:15)n ), thenthe off-diagonal weight is spread relatively evenly across Ω( (cid:112) (cid:15) · s o ( K )) = Ω( √ n ) rows/columns.This allows our strategy of sampling a principal submatrix with Θ( √ n ) rows to work. Proof.
Assume for the sake of contradiction that there is a row with s o,i ( K ) > (cid:112) s ( K ) /(cid:15) . Let x be the vector that has value − (cid:113) s ( K ) (cid:15) at index i and 1 elsewhere. Then: x T Kx ≤ s o ( K ) (cid:15) − · s o ( K ) (cid:15) + s o ( K ) + n ≤ − s o ( K ) (cid:15) , where the last inequality follows from the assumptions that s ( K ) ≥ (cid:15)n and (cid:15) ≤
1. The abovecontradicts K being PSD, completing the lemma.5 .1 Our Estimator For a subset A ⊆ [ n ], let K A be the corresponding kernel matrix (which is a principal submatrixof K ). Suppose that A is chosen by adding every element i ∈ [ n ] to A independently at randomwith probability p (we will later set p = 1 / √ n ). Then Z (cid:44) n + s o ( K A ) /p is an unbiased estimatorof s ( K ). That is, E [ Z ] = s ( K ). We would like to show that the variance Var[ Z ] is small. In fact,in Lemma 6 below, we show that Var[ Z ] = O ( s ( K ) ). Thus, taking Var[ Z ] / ( (cid:15) E [ Z ] ) = O (1 /(cid:15) )samples of Z and returning the average yields a 1 + (cid:15) approximation of s ( K ) with a constantprobability. To amplify the probability of success to 1 − δ for any δ >
0, we take the median of O (log(1 /δ )) estimates and apply Chernoff bound in a standard way. Our variance bound follows: Lemma 6.
Var[ Z ] = O ( s ( K ) ) . Proof.
Let Z o (cid:44) Z − n = s o ( K A ) /p .Var[ Z ] = Var[ Z o ] = E [ Z o ] − E [ Z o ] ≤ E [ Z o ]= 1 p E [ s o ( K A ) ]= 1 p p (cid:88) i,j ∈ [ n ] i (cid:54) = j K i,j + 2 p (cid:88) i,j,j (cid:48) ∈ [ n ] |{ i,j,j (cid:48) }| =3 K i,j K i,j (cid:48) + p (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ [ n ] |{ i,j,i (cid:48) ,j (cid:48) }| =4 K i,j K i (cid:48) ,j (cid:48) . We upper bound each term of the above expression separately. We start with the first term: p − (cid:88) i,j ∈ [ n ] i (cid:54) = j K i,j = n (cid:88) i,j ∈ [ n ] i (cid:54) = j K i,j ≤ n · s ( K ) ≤ s ( K ) , where in the equality we set p = 1 / √ n and in the first inequality we use the fact that 0 ≤ K i,j ≤ i, j . We have s ( K ) ≥ n since all diagonal entries are 1, giving the last inequality.We upper bound the third term: (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ [ n ] |{ i,j,i (cid:48) ,j (cid:48) }| =4 K i,j K i (cid:48) ,j (cid:48) ≤ (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ [ n ] K i,j K i (cid:48) ,j (cid:48) = s ( K ) . To upper bound the second term we consider two cases.
Case s o ( K ) ≤ n . In this case we can use Lemma 5 with (cid:15) = s o ( K ) /n to conclude that s o,i ( K ) ≤ (cid:112) s o ( K ) /(cid:15) = 2 (cid:112) s o ( K ) · n/s o ( K ) = 2 √ n With this bound we can bound the second term by p − (cid:88) i,j,j (cid:48) ∈ [ n ] |{ i,j,j (cid:48) }| =3 K i,j K i,j (cid:48) ≤ p − · s o ( K ) max i s o,i ( K ) ≤ ns o ( K ) ≤ s ( K ) , where we substituted p = 1 / √ n . 6 ase s o ( K ) > n . In this case we proceed as follows. p − (cid:88) i,j,j (cid:48) ∈ [ n ] |{ i,j,j (cid:48) }| =3 K i,j K i,j (cid:48) ≤ p − (cid:88) i (cid:88) j : j (cid:54) = i K i,j = p − (cid:88) i s o,i ( K ) . Since (cid:80) i s o,i ( K ) is fixed (and is equal to s o ( K )), the last expression is maximized when some s o,i take as large values as possible and the rest are set to 0. Since s o,i ( K ) ≤ t (cid:44) (cid:112) s o ( K ) (byLemma 5 with (cid:15) = 1), we have that in the worst case for s o ( K ) /t values of i we have s o,i = t and s o,i = 0 for the rest of i . Therefore, p − (cid:88) i s o,i ( K ) ≤ p − t s o ( K ) t = 2 p (cid:112) s o ( K ) s o ( K ) ≤ s o ( K ) ≤ s ( K ) , where we use s o ( K ) > n and p = √ n . To turn the argument from the previous section into an algorithm, we need to approximate thevalue of Z = n + s o ( K A ) /p for p = 1 / √ n efficiently. It is sufficient to efficiently approximate Z = n + n · s o ( K A ) when s o ( K A ) = Ω( (cid:15) ), as otherwise the induced loss in approximating s ( K ) isnegligible since we always have s ( K ) ≥ n .Let K (cid:48) be a kernel matrix of size m × m for which s o ( K (cid:48) ) ≥ Ω( (cid:15) ). We show that for such a kernelmatrix it is possible to approximate s o ( K (cid:48) ) in time m − δ for a constant δ >
0. This is enough toyield a sublinear time algorithm for estimating Z since K A is m × m with m ≈ pn = √ n . A simple algorithm.
We note that it is sufficient to approximate the contribution to s o ( K (cid:48) )from the rows i for which the sum of entries s o,i is Ω( (cid:15)/m ), as the contribution from the rest ofthe rows in negligible under the assumption that s o ( K (cid:48) ) = Ω( (cid:15) ). So fix an i and assume that s o,i ≥ Ω( (cid:15)/m ). To estimate s o,i , we use a kernel density evaluation algorithm. Our goal is toapproximate s o,i = (cid:88) j : j (cid:54) = i K (cid:48) i,j = (cid:88) j : j (cid:54) = i k ( x i , x j ) . The approach is to first process the points x , . . . , x m using the algorithm for the KDE. The KDEquery algorithm then allows us to answer queries of the form m (cid:80) j k ( q, x j ) for an arbitrary querypoint q in time O ( d/ ( µ p (cid:15) )), where µ is a lower bound on m (cid:80) j k ( q, x j ). So we set µ = Ω( (cid:15)/m )and query the KDE data structure on all q = x , . . . , x m .The above does not quite work however – to estimate the off-diagonal sum we need to answerqueries of the form m (cid:80) j : j (cid:54) = i k ( x i , x j ) instead of m (cid:80) j k ( x i , x j ). This could be solved if the KDE7ata structure were dynamic, so that we could remove any point x i from it. Some of the datastructures indeed have this property. To provide a general reduction, however, we avoid thisrequirement by building several “static” data structures and then answering a single query of theform m (cid:80) j : j (cid:54) = i k ( x i , x j ) by querying O (log m ) static data structures. Assume w.l.o.g. that m isan integer power of 2. Then we build a data structure for points x , . . . , x m/ and another for x m/ , . . . , x m . We also build 4 data structures for sets x , . . . , x m/ and x m/ , . . . , x m/ , and x m/ , . . . , x m/ , and x m/ , . . . , x m . And so forth for log m levels. Suppose that we want toestimate m (cid:80) j : j (cid:54) =1 k ( x , x j ). For that we query the data structures on sets x m/ , . . . , x m and x m/ , . . . , x m/ , and x m/ , . . . , x m/ and so forth for a total of log m data structures – one fromeach level. Similarly we can answer queries for an arbitrary i . The threshold µ for all the datastructures is the same as before: µ = Ω( (cid:15)/m ).Since we need to query O (log m ) data structures for every x i and we also need to amplify theprobability of success to, say, 1 − /m . Thus, the final runtime of the algorithm is O ( m · d/ ( µ p (cid:15) ) log m ) = O ( dm p /(cid:15) p log m ) . A faster algorithm.
We note that in the previous case, if s o,i = Θ( (cid:15)/m ) for every i , then we canapproximate s o ( K (cid:48) ) efficiently by sampling a few i , evaluating the corresponding s o,i exactly (in O ( dm ) time) and returning the empirical mean of the evaluated s o,i . This works since the varianceis small. There can be, however, values of i for which s o,i is large. For these values of i , we can runa kernel density evaluation algorithm. More formally, we define a threshold t > i with µ = t/m (similarly as in the previous algorithm).This reports all i for which s o,i ≥ t/m . This takes time O ( dm log m · / ( µ p (cid:15) )) = O ( dm p log m/ ( (cid:15) t p )) . Let I be the set of remaining i . To estimate the contribution from s o,i with i ∈ I , we repeatedlysample i ∈ I and evaluate s o,i exactly using the linear scan, and output the average of the evaluated s o,i scaled by | I | as an estimate of the contribution of s o,i from i ∈ I . Since we can ignore thecontribution from i ∈ I with s o,i ≤ o ( (cid:15)/m ), we can assume that t/m ≥ s o,i ≥ Ω( (cid:15)/m ) for every i ∈ I and then Var i [ s o,i ] / ( (cid:15) ( E i [ s o,i ]) ) ≤ O ( t /(cid:15) )samples are sufficient to get a 1 + (cid:15) approximation. This step takes time O ( t dm/(cid:15) ). The finalruntime is O ( dm p log m/ ( (cid:15) t p ) + dt m/(cid:15) ) = O ( dm p p log ( m ) /(cid:15) p p )by setting t = m p p (cid:15) p . Since m = Θ( √ n ) with high probability, we achieve O ( dm p p log ( m ) /(cid:15) p p )runtime for approximating the random variable Z = n + n · s o ( K A ). Since we evaluate the randomvariable Z O (1 /(cid:15) ) times, the final runtime to approximate the sum of entries of the kernel matrix K within a factor of 1 + (cid:15) is O ( dn p p log ( n ) /(cid:15) p p ). We have the following lower bound, which shows that sampling just O ( √ n ) data points, as is donein our algorithm, is optimal up to constant factors.8 heorem 7. Consider any nice kernel k such that k ( x, y ) → as (cid:107) x − y (cid:107) → ∞ . In order toestimate (cid:80) ni,j k ( x i , x j ) within any constant factor, we need to sample at least Ω( √ n ) points fromthe input x , . . . , x n .Proof. Suppose that we want to approximate s ( K ) = (cid:80) ni,j k ( x i , x j ) within a factor of C > x , . . . , x n . For the first distribution x , . . . , x n are sampled inde-pendently from a large enough domain such that k ( x i , x j ) ≈ i (cid:54) = j with high probability.In this case s ( K ) ≈ n . For the other distribution we sample again x , . . . , x n independently atrandom as before and then sample s , . . . , s √ Cn from 1 , . . . , n without repetition. Then we set x s t = x s for t = 2 , . . . , √ Cn . For this distribution s ( K ) ≈ Cn . To distinguish between thesetwo distributions we need to sample at least Ω( (cid:112) n/C ) = Ω( √ n ) points from x , . . . , x n . Our algorithm can be immediately used to estimate the value of the kernel alignment ˆ A ( K, K (cid:48) ) asdefined in the introduction. The only requirement is that the matrices of product kernels K ◦ K , K (cid:48) ◦ K (cid:48) and K ◦ K (cid:48) are supported by fast kernel density evaluation algorithms. We formalize thisas follows. Let C be a set of nice kernels defined over pairs of points in R d . For any two kernelfunctions k, k (cid:48) ∈ C , the product kernel k ◦ k (cid:48) : R d × R d → [0 ,
1] is such that for any p, q, p (cid:48) , q (cid:48) ∈ R d ,( k ◦ k (cid:48) )(( p, p (cid:48) ) , ( q, q (cid:48) )) = k ( p, q ) · k (cid:48) ( p (cid:48) , q (cid:48) ) . Definition 8.
Let C = C , C , . . . be a sequence of sets of nice kernels, such that C d is definedover pairs of points in R d . We say that C is closed under product if for any two k, k (cid:48) ∈ C d , theproduct kernel k ◦ k (cid:48) belongs to C d . It is immediate that the Gaussian kernel, interpreted as a sequence of kernels for different valuesof the dimension d , is closed under product. Thus we obtain the following: Corollary 9.
Given two Gaussian kernel matrices
K, K (cid:48) ∈ R n × n , each defined by a set of n pointsin R d , and (cid:15) ∈ (0 , , ˆ A ( K, K (cid:48) ) can be estimated to ± (cid:15) relative error in time O ( dn . /(cid:15) O (1) log n ) with high probability − /n Θ(1) . We now present our top eigenvector approximation algorithm, which is a variant on the ‘noisypower method’ with approximate matrix vector multiplication implemented through kernel densityevaluation. Existing analysis of the noisy power method assumes random noise on each matrixvector multiplication, which has little correlation with the top eigenvector [Hardt and Price, 2014,Balcan et al., 2016]. This prevents this top direction from being ‘washed out’ by the noise. Inour setting this cannot be guaranteed – the noise distribution arising from implementing matrixmultiplication with K using kernel density evaluation is complex.To avoid this issue, we use that since the kernel k is nice, K is entrywise non-negative, andby the Perron-Frobenius theorem, so is its top eigenvector. Thus, if our noise in approximating Kz is entrywise non-negative (i.e., if we overestimate each weighted kernel density), then the noisewill have non-negative dot product with the top eigenvector, and will not wash it out, even if it ishighly correlated with it. 9e formalize this analysis in Theorem 11, first giving the required approximate matrix vectormultiplication primitive that we use in Definition 10. We give a full description of our noisy powermethod variant in Algorithm 1. In Section 4.1 we discuss how to implement the matrix vectormultiplication primitive efficiently using existing KDE algorithms. Definition 10 (Non-negative Approximate Matrix Vector Multiplication) . An (cid:15) -non-negative ap-proximate MVM algorithm for a matrix K ∈ R n × n takes as input a non-negative vector x ∈ R n andreturns y = Kx + e where e is an entrywise non-negative error vector satisfying: (cid:107) e (cid:107) ≤ (cid:15) (cid:107) Kx (cid:107) . Algorithm 1
Kernel Noisy Power Method input : Error parameter (cid:15) ∈ (0 , I . (cid:15) / K ( · ) for nice kernel matrix K ∈ R n × n . output : z ∈ R n with (cid:107) z (cid:107) = 1 Initialize z ∈ R n with z ( i ) = √ n for all i . Initialize λ := 0. for i = 0 to I do z i +1 := K ( z i ). if z Ti z i +1 > λ then z := z i . λ := z Ti z i +1 . end if z i +1 := z i +1 / (cid:107) z i +1 (cid:107) . end for return z . Theorem 11.
The kernel noisy power method (Algorithm 1) run for I = O (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) iterationsoutputs a unit vector z with z T Kz ≥ (1 − (cid:15) ) · λ ( K ) .Proof. Let V Λ V T = K be K ’s eigendecomposition. Λ is diagonal containing the eigenvalues indecreasing order λ ≥ . . . ≥ λ n ≥ V is orthonormal, with columns equal to the correspondingeigenvectors of K : v , . . . , v n . Let m be the largest index such that λ m ≥ (1 − (cid:15)/ · λ .Let c i = V T z i be the i th iterate, written in the eigenvector basis. Let c i,m be its first m components and c i,n − m be the last n − m components. We will argue that for I = O (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) iterations, there is at least one iteration i ≤ I where (cid:107) c i,m (cid:107) ≥ (1 − (cid:15)/ z i aligns mostlywith large eigenvectors. Formally this gives z Ti Kz i = c Ti Λ c i ≥ (cid:107) c i,m (cid:107) · (1 − (cid:15)/ · λ ≥ (1 − (cid:15)/ · λ ≥ (1 − (cid:15)/ · λ . Further, when we check to set z := z i at line (4) we have z Ti z i +1 = z Ti K z i = z Ti Kz i + z Ti e. z , K , and e are all entrywise non-negative, z i is entrywise non-negative for all i . Thus, z Ti e ≥
0. By our bound on (cid:107) e (cid:107) we also have z Ti e ≤ (cid:107) e (cid:107) ≤ (cid:15) / · (cid:107) Kz i − (cid:107) ≤ (cid:15) / · λ . Overallthis gives z Ti Kz i ≤ z Ti z i +1 ≤ z Ti Kz i + (cid:15) / · λ . So, if there is an i with (cid:107) c i,m (cid:107) ≥ (1 − (cid:15)/
4) and thus z Ti Kz i ≥ (1 − (cid:15)/ · λ , we will not output z = z j with z Tj Kz j ≤ (1 − (cid:15)/ − (cid:15) / · λ > (1 − (cid:15) ) · λ , ensuring our final error bound.To prove that there is some iterate with (cid:107) c i,m (cid:107) ≥ (1 − (cid:15)/ (cid:107) z i (cid:107) = (cid:107) c i (cid:107) = 1, it sufficesto argue that (cid:107) c i,n − m (cid:107) ≤ (cid:15)/
4. Assume for the sake of contradiction that for all i ≤ I we have (cid:107) c i,n − m (cid:107) > (cid:15)/
4. Under this assumption we can argue that c i (1) grows significantly with respectto (cid:107) c i,n − m (cid:107) in each step. Specifically, we can show by induction that c i (1) (cid:107) c i,n − m (cid:107) ≥ (1+ (cid:15)/ i √ n . Thisgives a contradiction since for I = O (log( n/(cid:15) ) /(cid:15) ) it would imply that 1 ≥ c I (1) ≥ (cid:15) (cid:107) c I,n − m (cid:107) andthus we must have (cid:107) c I,n − m (cid:107) < (cid:15)/
4. This contradiction proves the theorem.
Base Case:
Initially, z has all entries equal to 1 / √ n . Thus c (1) (cid:107) c ,n − m (cid:107) ≥ c (1) = v T z = 1 √ n n (cid:88) j =1 v ( j )= 1 √ n (cid:107) v (cid:107) ≥ √ n , where we use that v is a non-negative unit vector by the Perron-Frobenius theorem so (cid:80) nj =1 v ( j ) ≥(cid:107) v (cid:107) = 1. Inductive Step:
Assume inductively that c i (1) (cid:107) c i,n − m (cid:107) ≥ (1+ (cid:15)/ i √ n . Before normalization at step (7)we have z i +1 = Kz i + e . Normalization doesn’t affect the ratio between c i +1 (1) and (cid:107) c i +1 ,n − m (cid:107) –thus we can ignore this step.For all j ∈ , . . . , n we have c i +1 ( j ) = λ j · c i ( j ) + v Tj e . Since both e and v are all non-negative,this gives c i +1 (1) ≥ λ · c i (1). Further, by triangle inequality and since for j > m , λ j < (1 − (cid:15)/ (cid:107) c i +1 ,n − m (cid:107) ≤ (1 − (cid:15)/ λ · (cid:107) c i,n − m (cid:107) + (cid:107) e (cid:107) ≤ (1 − (cid:15)/ λ · (cid:107) c i,n − m (cid:107) + (cid:15) / · (cid:107) Kz i (cid:107) ≤ (1 − (cid:15)/ λ · (cid:107) c i,n − m (cid:107) + (cid:15) / · λ . By our assumption (for contradiction) that (cid:107) c i,n − m (cid:107) ≥ (cid:15)/ (cid:107) c i +1 ,n − m (cid:107) ≤ (cid:18) − (cid:15)/ (cid:15) / (cid:15)/ (cid:19) (cid:107) c i,n − m (cid:107) · λ ≤ (1 − (cid:15)/ · (cid:107) c i,n − m (cid:107) · λ . Overall, this gives that c i +1 (1) (cid:107) c i +1 ,n − m (cid:107) ≥ λ · c i (1)(1 − (cid:15)/ · (cid:107) c i,n − m (cid:107) · λ ≥ (1 + (cid:15)/ · c i (1) (cid:107) c i,n − m (cid:107) ≥ (1 + (cid:15)/ i +1 √ n , by our inductive assumption. This gives our contradiction, completing the proof.11 .1 Approximate Kernel Matrix Vector Multiplication We next show how to use fast a KDE algorithm to instantiate the non-negative approximate MVMprimitive (Definition 10) required by Algorithm 1. A similar approach was taken by Charikar et al.[2020]. We provide our own analysis, which applies black box to any KDE algorithm.
Theorem 12.
Let k be a nice kernel (Def. 2) admitting O ( dm/µ p (cid:15) ) time approximate kernel den-sity evaluation (Def. 3). Let K ∈ R n × n be the associated kernel matrix for n points in d dimensions.There is an (cid:15) -non-negative approximate MVM algorithm for K running in time O (cid:16) dn p log( n/(cid:15) ) p (cid:15) p (cid:17) . Combined with Theorem 11, Theorem 12 immediately gives our final subquadratic time eigen-value approximation result:
Corollary 13.
Let k be a nice kernel (Def. 2) admitting O ( dm/µ p (cid:15) ) time approximate kerneldensity evaluation (Def. 3). Let K ∈ R n × n be the associated kernel matrix for n points in d dimensions. There is an algorithm running in time O (cid:16) dn p log( n/(cid:15) ) p (cid:15) p (cid:17) , which outputs a unitvector z with z T Kz ≥ (1 − (cid:15) ) · λ ( K ) Proof.
Algorithm 1 requires I = O (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) approximate MVMs, each with error parameter (cid:15) / O (cid:16) dn p log( n/(cid:15) ) p (cid:15) p (cid:17) . Multiplying by I gives the final bound. Proof of Theorem 12.
We seek an algorithm that computes Kx + e where e is a non-negative errorvector with (cid:107) e (cid:107) ≤ c(cid:15) (cid:107) Kx (cid:107) for some fixed constant c . Note that we can always scale x so that (cid:107) x (cid:107) = 1, and then scale back after multiplication, without effecting the error (cid:15) . Thus we assumegoing forward that (cid:107) x (cid:107) = 1. We can also adjust (cid:15) by a constant factor to have error bounded by (cid:15) (cid:107) Kx (cid:107) rather than c(cid:15) (cid:107) Kx (cid:107) . Since K has all non-negative entries and ones on the diagonal, (cid:107) Kx (cid:107) = (cid:107) x (cid:107) + (cid:107) ( K − I ) x (cid:107) + 2 x T ( K − I ) x ≥ (cid:107) x (cid:107) , since K − I also have all non-negative entries. Thus (cid:107) Kx (cid:107) ≥ Rounding x : We split the entries of x into b = c (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) buckets consisting of values lying inthe range [(1 − (cid:15)/ i , (1 − (cid:15)/ i − ] for i = 1 , ..., b for some fixed constant c . Let ¯ x have all values inbucket i rounded to (1 − (cid:15)/ i − . This rounding increases each entry in x by at most a − (cid:15)/ ≤ (cid:15) multiplicative factor, and so increases all values in Kx also by at most a 1 + (cid:15) multiplicative factor.Thus, (cid:107) K ¯ x − Kx (cid:107) ≤ (cid:15) (cid:107) Kx (cid:107) . By triangle inequality, it suffices to compute z = K ¯ x + e where e is non-negative and (cid:107) e (cid:107) ≤ c(cid:15) (cid:107) K ¯ x (cid:107) for some constant c .Let ¯ x i ∈ R n be ¯ x with only the entries in bucket i kept and the rest set to zero. Let ¯ x b +1 ∈ R n be the set of entries not falling into any bucket – note that these entries have not been rounded.We have K ¯ x = (cid:80) b +1 i =1 K ¯ x i . Thus, to achieve our final error bound, it suffices to return z = (cid:80) b +1 i =1 z i where z i = K ¯ x i + e i , for non-negative error vector e i with (cid:107) e i (cid:107) ≤ (cid:15)b +1 (cid:107) K ¯ x (cid:107) . This gives that z = K ¯ x + e where e = (cid:80) b +1 i =1 e i is non-negative and by triangle inequality has (cid:107) e (cid:107) ≤ (cid:15) (cid:107) K ¯ x (cid:107) ,satisfying the required guarantees. Remainder Entries:
We first focus on the entries not lying in any bucket, ¯ x b +1 . For largeenough c (recall that b = c (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) ), ¯ x b +1 only includes entries with value < (cid:15) ( b +1) n / . Since12ll entries of K are bounded by 1, all entries of K ¯ x b +1 are bounded by (cid:15) ( b +1) √ n . Thus, if we let z b +1 have value (cid:15) ( b +1) √ n in each entry, we have z b +1 = K ¯ x b +1 + e where e is non-negative and (cid:107) e (cid:107) ≤ (cid:15)b +1 ≤ (cid:15)b +1 (cid:107) K ¯ x (cid:107) , as required. Estimation within a Bucket:
We next consider approximating K ¯ x i , which amounts to kerneldensity evaluation with query points corresponding to all n points in our data set and target points X i corresponding to the non-zero entries of ¯ x i . We have[ K ¯ x i ]( j ) = (1 − (cid:15)/ i − (cid:88) x ∈X i k ( x, x j ) = |X i | (1 − (cid:15)/ i − · |X i | (cid:88) x ∈X i k ( x, x j ) . If we round any [ K ¯ x i ]( j ) with value ≤ (cid:15) ( b +1) √ n up to (cid:15) ( b +1) √ n , this will affect our final error by atmost (cid:15) ( b +1) ≤ (cid:15)b +1 (cid:107) K ¯ x (cid:107) . Thus, to achieve our designed error, we set our KDE relative error to (cid:15) andminimum density value to µ = (cid:15) ( b +1) √ n · |X i | (1 − (cid:15)/ i − . We multiply each estimate by a 1 / (1 − (cid:15) ) factorto ensure that it is an overestimate of the true entry in K ¯ x i . Since (cid:107) ¯ x (cid:107) ≤ (1 + (cid:15) ) (cid:107) x (cid:107) = (1 + (cid:15) ),we have |X i | ≤ min (cid:16) n, (1+ (cid:15) ) (1 − (cid:15)/ i − (cid:17) . Thus,1 |X i | (1 − (cid:15)/ i − ≥ max (cid:18) n (1 − (cid:15)/ i − , (1 − (cid:15)/ i − (1 + (cid:15) ) (cid:19) ≥ √ n . In turn, this gives µ = (cid:15) ( b +1) √ n · |X i | (1 − (cid:15)/ i − ≥ (cid:15) ( b +1)4 n .Finally, we plug in our KDE runtime of O ( dm/µ p (cid:15) ). For each bucket, we must process |X i | points to evaluate the density against. Since (cid:80) bi =1 |X i | ≤ n the total runtime here is O ( dn/µ p (cid:15) ).We then must evaluate the density of all n points against the points in each bucket, requiring totaltime O ( b · dn/µ p (cid:15) ). Recalling that b = O (cid:16) log( n/(cid:15) ) (cid:15) (cid:17) and µ = (cid:15) ( b +1)4 n gives total runtime: O (cid:18) dn p b p (cid:15) p (cid:19) = O (cid:18) dn p log( n/(cid:15) ) p (cid:15) p (cid:19) . It is easy to see that Ω( dn ) time is necessary to estimate λ ( K ) to even a constant factor. Thus,for (cid:15) = Θ(1), the runtime of Corollary 13 is tight, up to an n p log( n ) p factor. Theorem 14.
Consider any nice kernel k such that k ( x, y ) → as (cid:107) x − y (cid:107) → ∞ . Let K ∈ R n × n bethe associated kernel matrix of n points x , . . . , x n ∈ R d . Estimating λ ( K ) to any constant factorrequires Ω( nd ) time.Proof. Let c be any constant. Consider two input cases. In the first, no two points in x , . . . , x n are identical. In the second, a random set of c points are exact duplicates of each other. Scalingup these point sets by an arbitrarily large constant value, we can see that their kernel matricesare arbitrarily close to K = I in the first case and K = I + E in the second, where E has a 1 atpositions ( i, j ) and ( j, i ) if i, j are in the duplicate set, and zeros everywhere else. We can checkthat λ ( I ) = 1 while λ ( I + E ) = c . Thus, to approximate the top eigenvalue up to a c factor wemust distinguish the two cases. If we read o ( n/c ) points, then with good probability we will see noduplicates. Thus, we must read Ω( n/c ) points, requiring Ω( nd/c ) time.13 Empirical Evaluation
In this section we empirically evaluate the kernel noisy power method (Algorithm 1) for ap-proximating the top eigenvalue of the kernel matrix K ∈ R n × n . We use the Laplacian kernel, k ( x, y ) = exp( −(cid:107) x − y (cid:107) /σ ). It is a nice kernel (by Def. 2), and furthermore, a Fast KDE imple-mentation for it (as per Def. 3) with p = 0 . KNPM . Evaluated methods.
We compare KNPM to two baselines: The usual (
Full ) power method, anda
Uniform noisy power method. The latter is similar to KNPM, except that the KDE subroutinein Corollary 13 is evaluated by vanilla uniform sampling instead of a Fast KDE. Evaluation metrics.
The computational cost of each algorithm is measured by the number ofkernel evaluations it performs (i.e., how many entries of K it computes). Computed kernel valuesare not carried over across iterations. The full power methods computes the entire matrix row-by-row in each iteration, since it is too large to be stored in memory. The other two methods use asmall sample of entries. The accuracy of each method is evaluated in each iteration by the relativeerror, 1 − z T Kz/λ ( K ), where z is the unit vector computed by the algorithm in that iteration,and λ ( K ) is the true top eigenvalue. λ ( K ) is computed by letting the full power method rununtil convergence. This error measure corresponds directly to (cid:15) from Corollary 13. Dataset.
We use classes of the Forest Covertype dataset [Blackard and Dean, 1999], which isa 54-dimensional dataset often used to evaluate kernel methods in high dimensions [Siminelakiset al., 2019, Backurs et al., 2019]. We use 5 of the 7 classes (namely classes 3–7), whose sizes rangefrom 2.7K to 35.7K points. We have omitted the two larger classes since we could not compute anaccurate groundtruth λ ( K ) for them, and hence could not measure accuracy. Parameter setting.
We use bandwidth σ = 0 .
05 (other choices produce similar results). Thefull power method has no parameters. The Uniform and KNPM methods each have a singleparameter that governs the sampling rate. For both, we start with a small sampling rate, andgradually increase it by multiplying by 1 . Results.
All results are reported in Figure 1. Both the Uniform and KNPM variants of the noisypower method give a much better tradeoff in terms of accuracy vs. computation than the full powermethod. Additionally, KNPM consistently outperforms Uniform. In fact, vanilla uniform sampling is a Fast KDE (Definition 3) for nice kernels (Definition 2), albeit with p = 1,leading only to a quadratic time top eigenvalue approximation algorithm. Conclusion
We have shown that fast kernel density evaluation methods can be used to give much faster algo-rithms for approximating the sum of kernel matrix entries and the kernel matrix top eigenvector.Our work leaves open a number of directions. For top eigenvector computation – it is open if thegaps between the linear in n lower bound of Theorem 14 and our slightly superlinear runtimes forthe Gaussian and exponential kernels can be closed. Extending our techniques to approximate thetop k eigenvectors/values in subquadratic time would also be very interesting. Finally, it wouldbe interesting to identify other natural kernel matrix problems that can be solved in sublinear orsubquadratic time using fast KDE methods. Conversely, one might hope to prove lower boundsruling this out. Lower bounds that hold for error (cid:15) = Θ(1) would be especially interesting – knownlower bounds against e.g., subquadratic time algorithms for the kernel sum, only hold when highaccuracy (cid:15) = exp( − ω (log n )) is demanded. Acknowledgements
This research has been supported in part by NSF awards DMS-2022448, CCF-2006798 and CCF-2006806, and a Simons Investigator Award.
References
Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empiricalrisk minimization: Kernel methods and neural networks. In
Advances in Neural InformationProcessing Systems 27 (NIPS) , pages 4308–4318, 2017.Arturs Backurs, Moses Charikar, Piotr Indyk, and Paris Siminelakis. Efficient density evaluationfor smooth kernels. In
Proceedings of the 59th Annual IEEE Symposium on Foundations ofComputer Science (FOCS) , pages 615–626, 2018.Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimationin high dimensions. In
Advances in Neural Information Processing Systems 32 (NIPS) , pages15799–15808, 2019.Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices.In
Advances in Neural Information Processing Systems 31 (NIPS) , pages 3782–3792, 2018.Ainesh Bakshi, Nadiia Chepurko, and Rajesh Jayaram. Testing positive semi-definiteness via ran-dom submatrices.
Proceedings of the 61st Annual IEEE Symposium on Foundations of ComputerScience (FOCS) , 2020.Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. In
Proceedings of the 29th Annual Conferenceon Computational Learning Theory (COLT) , pages 284–309, 2016.Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks anddiscriminant analysis in predicting forest cover types from cartographic variables.
Computersand electronics in agriculture , 24(3):131–151, 1999.16 Charikar, M. Kapralov, Nouri N., and P. Siminelakis. Kernel density estimation through den-sity constrained near neighbor search. In
Proceedings of the 61st Annual IEEE Symposium onFoundations of Computer Science (FOCS) , 2020.Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimen-sions. In
Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science(FOCS) , pages 1032–1043, 2017.Shiri Chechik, Edith Cohen, and Haim Kaplan. Average distance queries through weighted samplesin graphs and metric spaces: High scalability with tight statistical guarantees.
Approximation,Randomization, and Combinatorial Optimization. Algorithms and Techniques , page 659, 2015.Edith Cohen, Shiri Chechik, and Haim Kaplan. Clustering small samples with quality guarantees:Adaptivity with one2all PPS. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola. On kernel-target align-ment. In
Advances in Neural Information Processing Systems 15 (NIPS) , pages 367–373, 2002.Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations.
Journal of Machine Learning Research , 2(Dec):243–264, 2001.Leslie Greengard and John Strain. The fast Gauss transform.
SIAM Journal on Scientific andStatistical Computing , 12(1):79–94, 1991.Moritz Hardt and Eric Price. The noisy power method: a meta algorithm with applications. In
Advances in Neural Information Processing Systems 27 (NIPS) , pages 2861–2869, 2014.Thomas Hofmann, Bernhard Sch¨olkopf, and Alexander J Smola. Kernel methods in machine learn-ing.
The Annals of Statistics , pages 1171–1220, 2008.Piotr Indyk. Sublinear time algorithms for metric space problems. In
Proceedings of the 31st AnnualACM Symposium on Theory of Computing (STOC) , pages 428–434, 1999.Piotr Indyk, Ali Vakilian, Tal Wagner, and David Woodruff. Sample-optimal low-rank approxima-tion of distance matrices.
Proceedings of the 32nd Annual Conference on Computational LearningTheory (COLT) , 2019.Dongryeol Lee and Alexander G Gray. Fast high-dimensional kernel summations using the MonteCarlo multipole method. In
Advances in Neural Information Processing Systems 22 (NIPS) ,2009.Dongryeol Lee, Andrew W Moore, and Alexander G Gray. Dual-tree fast Gauss transforms. In
Advances in Neural Information Processing Systems 19 (NIPS) , 2006.William B March, Bo Xiao, and George Biros. ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions.
SIAM Journal on Scientific Computing , 37(2):A1089–A1110, 2015.Cameron Musco and Christopher Musco. Recursive sampling for the Nystr¨om method. In
Advancesin Neural Information Processing Systems 30 (NIPS) , pages 3836–3848, 2017.17ameron Musco and David P Woodruff. Sublinear time low-rank approximation of positive semidef-inite matrices. In
Proceedings of the 58th Annual IEEE Symposium on Foundations of ComputerScience (FOCS) , pages 672–683, 2017.Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In
Proceedingsof the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 2718–2727,2018a.Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. In , 2018b.Bernhard Sch¨olkopf, Alexander Smola, and Klaus-Robert M¨uller. Kernel principal componentanalysis. In
International Conference on Artificial Neural Networks , pages 583–588. Springer,1997.John Shawe-Taylor, Nello Cristianini, et al.
Kernel methods for pattern analysis . CambridgeUniversity Press, 2004.Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernelevaluation in high dimensions. In
Proceedings of the 36th International Conference on MachineLearning (ICML) , pages 5789–5798, 2019.Christopher Williams and Matthias Seeger. Using the nystr¨om method to speed up kernel mach-inesstr¨om method to speed up kernel machines. In
Advances in Neural Information ProcessingSystems 14 (NIPS) , pages 682–688, 2001.Changjiang Yang, Ramani Duraiswami, Nail A Gumerov, and Larry Davis. Improved fast gausstransform and efficient kernel density estimationauss transform and efficient kernel density es-timation. In
Proceedings of the 2003 IEEE Computer Society Conference on Computer Visionand Pattern Recognition , 2003. 18
Simpler Algorithm for the Fast Gauss Transform
In this section we provide a simpler version of the algorithm of Greengard and Strain [1991] forapproximate Gaussian kernel density estimation in low dimensions. This version uses ideas thatare similar to those in the original algorithm, but reduces it to the necessary essentials, removinge.g., the use of Hermite polynomials.
Theorem 15.
Given n points p , . . . , p n , we can preprocess them in n log(1 /(cid:15) ) O ( d ) time so that wecan answer queries of the form n (cid:80) i k ( q, p i ) for k ( q, p ) = exp( −(cid:107) q − p (cid:107) ) within an additive factorof (cid:15) in log(1 /(cid:15) ) O ( d ) query time.Proof. We first note that, if (cid:107) q − p i (cid:107) ≥ log(1 /(cid:15) ), then we can discard such points and this changesthe average of the kernel values by at most an additive factor of (cid:15) .In the preprocessing step we partition the space into d -dimensional hypercubes of diameter log(1 /(cid:15) ). When a query point q comes, we only examine those hypercubes (and the points of thedataset within) that are at distance at most 5 log(1 /(cid:15) ) from the query point q . This ensures thatdon’t ignore any point p i that is at distance less than log(1 /(cid:15) ) from q . Furthermore, the totalvolume of all inspected hypercubes is less than the volume of a d -dimensional Euclidean ball ofradius 10 log(1 /(cid:15) ). A simple volume argument shows that such a Euclidean ball can contain atmost 2 O ( d ) hypercubes of diameter log(1 /(cid:15) ). Thus, our algorithm will only examine at most 2 O ( d ) hypercubes. Since this is asymptotically negligible compared to the promised query time, from nowon we will assume that all points p i are satisfy (cid:107) q − p i (cid:107) ≤
10 log(1 /(cid:15) ).We calculate the Taylor expansion of exp( −(cid:107) q − p i (cid:107) ). We denote x = −(cid:107) q − p i (cid:107) obtaining:exp( −(cid:107) q − p i (cid:107) ) = exp( x ) = 1 + x + x /
2! + x /
3! + . . .
Since | x | ≤
10 log(1 /(cid:15) ), we can truncate the expansion after O (log(1 /(cid:15) )) terms such that thetruncation error is at most (cid:15) . Consider a term x r /r ! with r ≤ O (log(1 /(cid:15) )). Since r ≤ O (log(1 /(cid:15) )),it suffices to have an r O ( d ) time algorithm (to get the promised final runtime) to answer the queriesof the form (cid:80) i (cid:107) q − p i (cid:107) r . This is sufficient because then we can do separate queries for all r ≤ O (log(1 /(cid:15) )) and combine them according to the Taylor expansion to approximate exp( −(cid:107) q − p i (cid:107) ).To answer the queries of the form (cid:80) i (cid:107) q − p i (cid:107) r efficiently, it is sufficient to answer the queriesof the form (cid:80) i ( q · p i ) r efficiently, where q · p i = q (1) p i (1) + . . . + q ( d ) p i ( d ) denotes the inner productbetween the two vectors.To achieve a fast query algorithm we observe that( q · p i ) r = ( q (1) p i (1) + ... + q ( d ) p i ( d )) r consists of r O ( d ) different monomials after opening the parentheses. Thus, we can write( q · p i ) r = X ( q ) · Y ( p i )for some efficiently computable maps X, Y : R d → R r O ( d ) . We get that (cid:88) i ( q · p i ) r = X ( q ) · (cid:88) i Y ( p i ) . It remains to note that we can precompute quantities (cid:80) i Y ( p i ) for every hypercube in the prepro-cessing step and then examine the relevant hypercubes during the query step. Note.
Similar algorithms work for kernels log (cid:107) q − p i (cid:107) , 1 / (1 + (cid:107) q − p i (cid:107) t ) etc.: do Taylor expansionand precompute the quantities (cid:80) ii
Similar algorithms work for kernels log (cid:107) q − p i (cid:107) , 1 / (1 + (cid:107) q − p i (cid:107) t ) etc.: do Taylor expansionand precompute the quantities (cid:80) ii Y ( p ii