Oblivious Sketching of High-Degree Polynomial Kernels
Thomas D. Ahle, Michael Kapralov, Jakob B. T. Knudsen, Rasmus Pagh, Ameya Velingker, David Woodruff, Amir Zandieh
aa r X i v : . [ c s . D S ] M a y Oblivious Sketching of High-Degree Polynomial Kernels ∗ Thomas D. Ahle
ITU and BARC [email protected]
Michael Kapralov
EPFL [email protected]
Jakob B. T. Knudsen
U. Copenhagen and BARC [email protected]
Rasmus Pagh
ITU and BARC [email protected]
Ameya Velingker
Google Research [email protected]
David P. Woodruff
Amir Zandieh
EPFL [email protected]
May 5, 2020
Abstract
Kernel methods are fundamental tools in machine learning that allow detection of non-lineardependencies between data without explicitly constructing feature vectors in high dimensionalspaces. A major disadvantage of kernel methods is their poor scalability: primitives such askernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (atleast) quadratic time, as kernel matrices are usually dense. Some methods for speeding up kernellinear algebra are known, but they all invariably take time exponential in either the dimensionof the input point set (e.g., fast multipole methods suffer from the curse of dimensionality ) orin the degree of the kernel function.
Oblivious sketching has emerged as a powerful approach to speeding up numerical linearalgebra over the past decade, but our understanding of oblivious sketching solutions for kernelmatrices has remained quite limited, suffering from the aforementioned exponential dependenceon input parameters. Our main contribution is a general method for applying sketching solutionsdeveloped in numerical linear algebra over the past decade to a tensoring of data points withoutforming the tensoring explicitly. This leads to the first oblivious sketch for the polynomialkernel with a target dimension that is only polynomially dependent on the degree of the kernelfunction, as well as the first oblivious sketch for the Gaussian kernel on bounded datasets thatdoes not suffer from an exponential dependence on the dimensionality of input data points. ∗ This paper is a merged version of the work of Ahle and Knudsen [AK19] and Kapralov, Pagh, Velingker, Woodruffand Zandieh [KPV + ontents p q (analysis for T base : CountSketch and S base : TensorSketch ) . . 244.2 Higher Moments of Π q (analysis for T base : OSNAP and S base : TensorSRHT ) . . . . . 26 s λ q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Spectral Property of Identity × TensorSRHT . . . . . . . . . . . . . . . . . . . . . . . 405.4 Spectral property of Identity × OSNAP . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 High Probability OSE with linear dependence on s λ . . . . . . . . . . . . . . . . . . 47 A.1 Lower Bound for Sub-Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55A.2 Upper bound for Sub-Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A.3 Lower Bound for TensorSketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
Introduction
Data dimensionality reduction, or sketching , is a common technique for quickly reducing the size ofa large-scale optimization problem while approximately preserving the solution space, thus allowingone to instead solve a much smaller optimization problem, typically in a smaller amount of time.This technique has led to near-optimal algorithms for a number of fundamental problems in numer-ical linear algebra and machine learning, such as least squares regression, low rank approximation,canonical correlation analysis, and robust variants of these problems. In a typical instance of sucha problem, one is given a large matrix X ∈ R d × n as input, and one wishes to choose a randommap Π from a certain family of random maps and replace X with Π X . As Π typically has manyfewer rows than columns, Π X compresses the original matrix X , which allows one to perform theoriginal optimization problem on the much smaller matrix Π X . For a survey of such techniques,we refer the reader to the survey by Woodruff [Woo14].A key challenge in this area is to extend sketching techniques to kernel-variants of the abovelinear algebra problems. Suppose each column of X corresponds to an example while each ofthe d rows corresponds to a feature. Then these algorithms require an explicit representationof X to be made available to the algorithm. This is unsatisfactory in many machine learningapplications, since typically the actual learning is performed in a much higher (possibly infinite)dimensional feature space, by first mapping each column of X to a much higher dimensional space.Fortunately, due to the kernel trick, one need not ever perform this mapping explicitly; indeed, if theoptimization problem at hand only depends on inner product information between the input points,then the kernel trick allows one to quickly compute the inner products of the high dimensionaltransformations of the input points, without ever explicitly computing the transformation itself.However, evaluating the kernel function easily becomes a bottleneck in algorithms that rely on thekernel trick because it typically takes O ( d ) time to evaluate the kernel function for d dimensionaldatasets. There are a number of recent works which try to improve the running times of kernelmethods; we refer the reader to the recent work of [MM17] and the references therein. A naturalquestion is whether it is possible to instead apply sketching techniques on the high-dimensionalfeature space without ever computing the high-dimensional mapping.For the important case of polynomial kernel , such sketching techniques are known to be possible .This was originally shown by Pham and Pagh in the context of kernel support vector machines[PP13], using the TensorSketch technique for compressed matrix multiplication due to Pagh [Pag13].This was later extended in [ANW14] to a wide array of kernel problems in linear algebra, includingprincipal component analysis, principal component regression, and canonical correlation analysis.The running times of the algorithms above, while nearly linear in the number of non-zeroentries of the input matrix X , depend exponentially on the degree q of the polynomial kernel. Forexample, suppose one wishes to do low rank approximation on A , the matrix obtained by replacingeach column of X with its kernel-transformed version. One would like to express A ≈ U V , where U ∈ R d p × k and V ∈ R k × n . Writing down U explicitly is problematic, since the columns belong tothe much higher d p -dimensional space. Instead, one can express U V implicitly via column subsetselection, by expressing it as a
AZZ ⊤ and then outputting Z . Here Z is an n × k matrix. In[ANW14], an algorithm running in nnz ( X ) + ( n + d )poly (3 p , k, /ǫ ) time was given for outputtingsuch Z with the guarantee that k A − AZZ ⊤ k F ≤ (1 + ǫ ) k A − A k k F with constant probability, where A k is the best rank- k approximation to A . Algorithms with similar running times were proposedfor principal component regression and canonical correlation analysis. The main message here is The lifting function corresponding to the polynomial kernel maps x ∈ R d to φ ( x ) ∈ R d p , where φ ( x ) i ,i ,...,i p = x i x i · · · x i p , for i , i , . . . , i p ∈ { , , . . . , d } p rows in order toguarantee their correctness. Moreover, the existing sketches work with constant probability onlyand no high probability result was known for the polynomial kernel.The main drawback with previous work on applying dimensionality reduction for the polynomialkernel is the exponential dependence on p in the sketching dimension and consequently in therunning time. Ideally, one would like a polynomial dependence. This is especially useful for theapplication of approximating the Gaussian kernel by a sum of polynomial kernels of various degrees,for which large values of p , e.g., p = poly (log n ) are used [CKS11]. This raises the main questionof our work: Is it possible to desing a data oblivious sketch with a sketching dimension (and, hence, runningtime) that is not exponential in p for the above applications in the context of the polynomialkernel? While we answer the above question, we also study it in a more general context, namely, thatof regularization. In many machine learning problems, it is crucial to regularize so as to preventoverfitting or ill-posed problems. Sketching and related sampling-based techniques have also beenextensively applied in this setting. For a small sample of such work see [RR07, AM15, PW15,MM17, ACW17b, ACW17a, AKM +
17, AKM + d × n matrix A , and a d × b , and one seeks to find a y ∈ R n so as to minimize k Ay − b k . In ridge regression, we instead seek a y so as to minimize k Ay − b k + λ k y k , for a parameter λ >
0. Intuitively, if λ is much larger than the operator norm k A k of A , then a good solution is obtained simply by setting y = 0 d . On the other hand, if λ = 0,the problem just becomes an ordinary least squares regression. In general, the statistical dimension (or effective degrees of freedom ), s λ , captures this tradeoff, and is defined as P di =1 λ i ( A ⊤ A ) λ i ( A ⊤ A )+ λ , where λ i ( A ⊤ A ) is the i -th eigenvalue of A ⊤ A . Note that the statistical dimension is always at mostmin( n, d ), but in fact can be much smaller. A key example of its power is that for ridge regression,it is known [ACW17b] that if one chooses a random Gaussian matrix Π with O ( s λ /ǫ ) rows, and if y is the minimizer to k Π Ay − Π b k + λ k y k , then k Ay − b k + λ k y k ≤ (1+ ǫ ) min y ′ ( k Ay ′ − b k + λ k y ′ k ).Note that for ordinary regression ( λ = 0) one would need that Π has Ω(rank( A ) /ǫ ) rows [CW09].Another drawback of existing sketches for the polynomial kernel is that their running time andtarget dimension depend at least quadratically on s λ and no result is known with linear dependenceon s λ , which would be optimal. We also ask if the exponential dependence on p is avoidable in the regularized setting: Is it possible to obtain sketching dimension bounds and running times that are not exponential in p in the context of regularization? Moreover, is it possible to obtain a running time that dependsonly linearly on s λ ? In this paper, we answer the above questions in the affirmative. In other words, for each ofthe aforementioned applications, our algorithm depends only polynomially on p . We state theseapplications as corollaries of our main results, which concern approximate matrix product andsubspace embeddings. In particular, we devise a new distribution on oblivious linear maps Π ∈ R m × d p (i.e., a randomized family of maps that does not depend on the dataset X ), so that for anyfixed X ∈ R d × n , it satisfies the approximate matrix product and subspace embedding properties.These are the key properties needed for kernel low rank approximation. We remark that our4 ata oblivious sketching is greatly advantageous to data dependent methods because it results in aone-round distributed protocol for kernel low rank approximation [KVW14].We show that our oblivious linear map Π ∈ R m × d p has the following key properties: Oblivious Subspace Embeddings (OSEs).
Given ε > n -dimensional subspace E ⊆ R d , we say that Π ∈ R m × d is an ε -subspace embedding for E if (1 − ε ) k x k ≤ k Π x k ≤ (1 + ε ) k x k for all x ∈ E . In this paper we focus on Oblivious Subspace Embeddings in the regularized setting.In order to define a (regularized) Oblivious Subspace Embedding, we need to introduce the notionof statistical dimension , which is defined as follows: Definition 1 (Statistical Dimension) . Given λ ≥
0, for every positive semidefinite matrix K ∈ R n × n , we define the λ -statistical dimension of K to be s λ ( K ) := tr ( K ( K + λI n ) − ) . Now, we can define the notion of an oblivious subspace embedding (OSE):
Definition 2 (Oblivious Subspace Embedding (OSE)) . Given ε, δ, µ > d, n ≥ ε, δ, µ, d, n ) -Oblivious Subspace Embedding (OSE) is a distribution D over m × d matrices (forarbitrary m ) such that for every λ ≥
0, every A ∈ R d × n with λ -statistical dimension s λ ( A ⊤ A ) ≤ µ ,the following holds, Pr Π ∼D h (1 − ǫ )( A ⊤ A + λI n ) (cid:22) (Π A ) ⊤ Π A + λI n (cid:22) (1 + ǫ )( A ⊤ A + λI n ) i ≥ − δ. (1)The goal is to have the target dimension m small so that Π provides dimensionality reduction.If we consider the non-oblivious setting where we allow the sketch matrix Π to depend on A , thenby leverage score sampling we can achieve a target dimension of m ≈ s λ ( A ⊤ A ), which is essentiallyoptimal [AKM + m ≈ s λ ( A ⊤ A ). Approximate Matrix Product.
We formally define this property in the following definition.
Definition 3 (Approximate Matrix Product) . Given ε, δ >
0, we say that a distribution D over m × d matrices has the ( ε, δ ) -approximate matrix product property if for every C, D ∈ R d × n ,Pr Π ∼D h k C ⊤ Π ⊤ Π D − C ⊤ D k F ≤ ε k C k F k D k F i ≥ − δ. Our main theorems, which provide the aforementioned guarantees, are as follows, Theorem 1.
For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / , s λ , d p , n ) -oblivioussubspace embedding as in Definition 2. (2) If m = Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / -approximatematrix product property as in Definition 3.Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by the p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O ( pnm + p nnz( X )) . For symmetric matrices K and K ′ , the spectral inequality relation K (cid:22) K ′ holds if and only if x ⊤ Kx ≤ x ⊤ K ′ x for all vectors x Throughout this paper, the notations e O, e Ω , e Θ suppress poly (log( nd/ε )) factors. heorem 2. For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = e Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding (Definition 2). (2) If m = e Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / poly ( n )) -approximate matrix product property (Definition 3).Moreover, in the setting of (1) , for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columnsare obtained by a p -fold self-tensoring of each column of X , then the matrix Π p A can be computedusing Algorithm 1 in time e O (cid:16) pnm + p / s λ ε − nnz( X ) (cid:17) . Theorem 3.
For every positive integers p, d, n , every ε, s λ > , there exists a distribution onlinear sketches Π p ∈ R m × d p which is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding as inDefinition 2, provided that the integer m satisfies m = e Ω (cid:0) p s λ /ǫ (cid:1) .Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by a p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O (cid:0) pnm + p ǫ − nnz( X ) (cid:1) . We can immediately apply these theorems to kernel ridge regression with respect to the polyno-mial kernel of degree p . In this problem, we are given a regularization parameter λ >
0, a d × n ma-trix X , and vector b ∈ R n and would like to find a y ∈ R n so as to minimize k A ⊤ Ay − b k + λ k Ay k ,where A ∈ R d p × n is the matrix obtained from X by applying the self tensoring of degree p toeach column. To solve this problem via sketching, we choose a random matrix Π p accordingto the theorems above and compute Π p A . We then solve the sketched ridge regression problemwhich seeks to minimize (cid:13)(cid:13)(cid:13) (Π p A ) ⊤ Π p Ax − b (cid:13)(cid:13)(cid:13) + λ k Π p Ax k over x . By the above theorems, wehave (cid:13)(cid:13)(cid:13) (Π p A ) ⊤ Π p Ax − b (cid:13)(cid:13)(cid:13) + λ k Π p Ax k = (1 ± ǫ ) (cid:18)(cid:13)(cid:13)(cid:13) A ⊤ Ax − b (cid:13)(cid:13)(cid:13) + λ k Ax k (cid:19) simultaneously for all x ∈ R n ; thus, solving the sketched ridge regression problem gives a (1 ± ǫ )-approximation to theoriginal problem. If we apply Theorem 1, then the number of rows of Π p needed to ensure successwith probability 9 /
10 is Θ( ps λ ǫ − ). The running time to compute Π p A is O ( p s λ ε − n + p nnz( X )),after which a ridge regression problem can be solved in O ( ns λ /ǫ ) time via an exact closed-formsolution for ridge regression. An alternative approach to obtaining a very high-accuracy approxima-tion is to use the sketched kernel as a preconditioner to solve the original ridge regression problem,which improves the dependence on ε to log(1 /ε ) [ACW17a]. To obtain a higher probability ofsuccess, we can instead apply Theorem 3, which would allow us to compute the sketched matrixΠ p A in e O ( p s λ ε − n + p ε − nnz( X )) time. This is the first sketch to achieve the optimal depen-dence on s λ for the polynomial kernel, after which we can now solve the ridge regression problemin e O ( ns λ poly (cid:0) p, ǫ − (cid:1) ) time. Importantly, both running times are polynomial in p , whereas allpreviously known methods incurred running times that were exponential in p .Although there has been much work on sketching methods for kernel approximation whichnearly achieve the optimal target dimension m ≈ s λ , such as Nystrom sampling [MM17], all knownmethods are data-dependent unless strong conditions are assumed about the kernel matrix (smallcondition number or incoherence). Data oblivious methods provide nice advantages, such as one-round distributed protocols and single-pass streaming algorithms. However, for kernel methodsthey are poorly understood and previously had worse theoretical guarantees than data-dependentmethods. Furthermore, note that the Nystrom method requires to sample at least m = Ω( s λ ) land-marks to satisfy the subspace embedding property even given an oracle access to the exact leveragescores distribution. This results in a runtime of Ω (cid:0) s λ d + s λ nnz( X ) (cid:1) . Whereas our method achievesa target dimension that nearly matches the best dimension possible with data-dependent Nystrommethod and with strictly better running time of e O ( ns λ + nnz( X )) (assuming p = poly (log n )).6herefore, for a large range of parameter our sketch runs in input sparsity time wheras the Nys-trom methods are slower by an s λ factor in the best case. Application: Polynomial Kernel Rank- k Approximation.
Approximate matrix productand subspace emebedding are key properties for sketch matrices which imply efficient algorithmsfor rank- k kernel approximation [ANW14]. The following corollary of Theorem 1 immediatelyfollows from Theorem 6 of [ANW14]. Corollary 4 (Rank- k Approximation) . For every positive integers k, n, p, d , every ε > , any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by the p -fold self-tensoringof each column of X then there exists an algorithm which finds an n × k matrix V in time O (cid:0) p nnz( X ) + poly (cid:0) k, p, ε − (cid:1)(cid:1) such that with probability / , k A − AV V ⊤ k F ≤ (1 + ε ) min U ∈ R dp × n rank ( U )= k k A − U k F . Note that this runtime improves the runtime of [ANW14] by exponential factors in the polyno-mial kernel’s degree p . Additional Applications.
Our results also imply improved bounds for each of the applicationsin [ANW14], including canonical correlation analysis (CCA), and principal component regression(PCR). Importantly, we obtain the first sketching-based solutions for these problems with runningtime polynomial rather than exponential in p . Oblivious Subspace Embedding for the Gaussian Kernel.
One very important implicationof our result is Oblivious Subspace Embedding of the Gaussian kernel. Most work in this area isrelated to the Random Fourier Features method [RR07]. It was shown in [AKM +
17] that onerequires Ω( n ) samples of the standard Random Fourier Features to obtain a subspace embeddingfor the Gaussian kernel, while a modified distribution for sampling frequencies yields provablybetter performance. The target dimension of our proposed sketch for the Gaussian kernel strictlyimproves upon the result of [AKM + d . We for the first time, embed the Gaussian kernel with a target dimension which has a lineardependence on the statistical dimension of the kernel and is not exponential in the dimensionalityof the data-point. Theorem 5.
For every r > , every positive integers n, d , and every X ∈ R d × n such that k x i k ≤ r for all i ∈ [ n ] , where x i is the i th column of X , suppose G ∈ R n × n is the Gaussian kernel matrix –i.e., G j,k = e −k x j − x k k / for all j, k ∈ [ n ] . There exists an algorithm which computes S g ( X ) ∈ R m × n in time e O (cid:0) q ǫ − ns λ + q ǫ − nnz( X ) (cid:1) such that for every ε, λ > , Pr S g h (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) i ≥ − / poly ( n ) , where m = e Θ (cid:0) q s λ /ǫ (cid:1) and q = Θ( r + log( n/ǫλ )) and s λ is λ -statistical dimension of G as inDefinition 1. We remark that for datasets with radius r = poly (log n ) even if one has oracle access to theexact leverage scores for Fourier features of Gaussian kernel, in order to get subspace embeddingguarantee one needs to use m = Ω( s λ ) features which requires Ω( s λ nnz( X )) operations to compute.Wheras our result of Theorem 5 runs in time e O ( ns λ + nnz( X )). Therefore, for a large range ofparameters our Gaussian sketch runs in input sparsity time wheras the Fourier features method isat best slower by an s λ factor. 7 .2 Technical Overview Our goal is to design a sketching matrix Π p that satisfies the oblivious subspace embedding propertywith an optimal embedding dimension and which can be efficiently applied to vectors of the form x ⊗ p ∈ R d p . We start by describing some natural approaches to this problem (some of which havebeen used before), and show why they incur an exponential loss in the degree of the polynomialkernel. We then present our sketch and outline our proof of its correctness.We first discuss two natural approaches to tensoring classical sketches, namely the Johnson-Lindenstrauss transform and the CountSketch . We show that both lead to an exponential depen-dence of the target dimension on p and then present our new approach. Tensoring the Johnson-Lindenstrauss Transform.
Perhaps the most natural approach todesigning a sketch Π p is the idea of tensoring p independent Johnson-Lindenstrauss matrices. Specif-ically, let m be the target dimension. For every r = 1 , . . . , p let M ( r ) denote an m × d matrix withiid uniformly random ± M ∈ R m × d p be M = 1 √ m M (1) • . . . • M ( p ) , where • stands for the operation of tensoring the rows of matrices M ( r ) (see Definition 7). Thiswould be a very efficient matrix to apply, since for every j = 1 , . . . , m the j -th entry of M x ⊗ p isexactly Q pr =1 h M ( r ) x i j , which can be computed in time O ( p nnz( x )), giving overall evaluation time O ( pm nnz( x )). One would hope that m = O ( ε − log n ) would suffice to ensure that k M x ⊗ p k =(1 ± ǫ ) k x ⊗ q k . However, this is not true: we show in Appendix A that one must have m =Ω( ε − p log( n ) /p + ε − (log( n ) /p ) p ) in order to preserve the norm with high probability. Thus,the dependence on degree p of the polynomial kernel must be exponential. The lower bound isprovided by controlling the moments of the sketch M and using Paley-Zygmund inequality. Forcompleteness, we show that the aforementioned bound on the target dimension m is sharp, i.e.,necessary and sufficient for obtaining the Johnson-Lindenstrauss property. Tensoring of CountSketch (TensorSketch).
Pagh and Pham [PP13] introduced the followingtensorized version of
CountSketch . For every i = 1 , . . . , p let h i : [ d ] → [ m ] denote a random hashfunction, and σ i : [ d ] → [ m ] a random sign function. Then let S : R d ⊗ p → R m be defined by S r, ( j ,...,j p ) := σ ( i ) · · · σ ( i p ) [ h ( i ) + . . . h p ( i p ) = r ]for r = 1 , . . . , m . For every x ∈ R d one can compute Sx ⊗ p in time O ( pm log m + p nnz( x )). Since thetime to apply the sketch only depends linearly on the dimension p (due to the Fast Fourier Trans-form) one might hope that the dependence of the sketching dimension on p is polynomial. However,this turns out to not be the case: the argument in [ANW14] implies that m = e O (3 p s λ ) suffices toconstruct a subspace embedding for a matrix with regularization λ and statistical dimension s λ ,and we show in Appendix A.3 that exponential dependence on p is necessary. Our Approach: Recursive Tensoring.
The initial idea behind our sketch is as follows. Toapply our sketch Π p to x ⊗ p , for x ∈ R d , we first compute the sketches T x, T x, . . . , T p x for inde-pendent sketching matrices T , . . . , T p ∼ T base – see the leaves of the sketching tree in Fig. 1. Notethat we choose these sketches as CountSketch [CCFC02] or
OSNAP [NN13] to ensure that the leaf Tensor product of x with itself p times. base S base T base T base S base T base T base internal nodes: TensorSketch or TensorSRHT leaves:
CountSketch or OSNAP
Figure 1: S base is chosen from the family of sketches which support fast matrix-vector product fortensor inputs such as TensorSketch and
TensorSRHT . The T base is chosen from the family of sketcheswhich operate in input sparsity time such as CountSketch and
OSNAP .sketches can be applied in time proportional to the number of nonzeros in the input data (in thecase of OSNAP this is true up to polylogarithimic factors).Each of these is a standard sketching matrix mapping d -dimensional vectors to m -dimensionalvectors for some common value of m . We refer the reader to the survey [Woo14]. The next ideais to choose new sketching matrices S , S , . . . , S p/ ∼ S base , mapping m -dimensional vectors to m -dimensional vectors and apply S to ( T x ) ⊗ ( T x ), as well as apply S to ( T x ) ⊗ ( T x ), andso on, applying S p/ to ( T p − x ) ⊗ ( T p x ). These sketches are denoted by S base – see internal nodesof the sketching tree in Fig. 1. We note that in order to ensure efficiency of our construction (inparticular, running time that depends only linearly on the statistical dimension s λ ) we must choose S base as a sketch that can be computed on tensored data without explicitly constructing the actualtensored input, i.e., S base supports fast matrix vector product on tensor product of vectors. Weuse either TensorSketch (for results that work with constant probability) and a new variant of theSubsampled Randomized Hadamard Transform
SRHT which supports fast multiplication for thetensoring of two vectors (for high probability bounds) – we call the last sketch
TensorSRHT .At this point we have reduced our number of input vectors from p to p/
2, and the dimensionis m , which will turn out to be roughly s λ . We have made progress, as we now have fewer vectorseach in roughly the same dimension we started with. After log p levels in the tree we are left witha single output vector.Intuitively, the reason that this construction avoids an exponential dependence on p is thatat every level in the tree we use target dimension m larger than the statistical dimension of ourmatrix by a factor polynomial in p . This ensures that the accumulation of error is limited, as thetotal number of nodes in the tree is O ( p ). This is in contrast to the direct approaches discussedabove, which use a rather direct tensoring of classical sketches, thereby incurring an exponentialdependence on p due to dependencies that arise. Showing Our Sketch is a Subspace Embedding.
In order to show that our recursive sketchis a subspace embedding, we need to argue it preserves norms of arbitrary vectors in R d p , not only9ectors of the form x ⊗ p , i.e., p -fold self-tensoring of d -dimensional vectors . Indeed, all knownmethods for showing the subspace embedding property (see [Woo14] for a survey) at the very leastargue that the norms of each of the columns of an orthonormal basis for the subspace in questionare preserved. While our subspace may be formed by the span of vectors which are tensor productsof p d -dimensional vectors, we are not guaranteed that there is an orthonormal basis of this form.Thus, we first observe that our mapping is indeed linear over R d p , making it well-defined on theelements of any basis for our subspace, and hence our task essentially reduces to proving that ourmapping preserves norms of arbitrary vectors in R d p .We present two approaches to analyzing our construction. One is based on the idea of prop-agating moment bounds through the sketching tree, and results in a nearly linear dependence ofthe sketching dimension m on the degree p of the polynomial kernel, at the expense of a quadraticdependence on the statistical dimension s λ . This approach is presented in Section 4. The otherapproach achieves the (optimal) linear dependence on s λ , albeit at the expense of a worse polyno-mial dependence on p . This approach uses sketches that succeed with high probability, and usesmatrix concentration bounds. Propagating moment bounds through the tree – optimizing the dependence on thedegree p . We analyze our recursively tensored version of the
OSNAP and
CountSketch by showinghow moment bounds can be propagated through the tree structure of the sketch. This analysis ispresented in Section 4, and results in the proof of Theorem 1 as well as the first part of Theorem 3.The analysis obtained this way give particularly sharp dependencies on p and log 1 /δ .The idea is to consider the unique matrix M ∈ R m × d p that acts on simple tensors in the waywe have described it recursively above. This matrix could in principle be applied to any vector x ∈ R d p (though it would be slow to realise). We can nevertheless show that this matrix has the( ε, δ, t )-JL Moment Property, which is for parameters ε, δ ∈ [0 , , t ≥
2, and every x ∈ R d with k x k = 1 the statement E h(cid:12)(cid:12) k M x k − (cid:12)(cid:12) t i ≤ ε t δ .It can be shown that M is built from our various S base and T base matrices using three differentoperations: multiplication, direct sum, and row-wise tensoring. In other words, it is sufficient toshow that if Q and Q ′ both have the ( ε, δ, t )-JL Moment Property, then so does QQ ′ , Q ⊕ Q ′ and Q • Q ′ . This turns out to hold for Q ⊕ Q ′ , but QQ ′ and Q • Q ′ are more tricky. (Here ⊕ is thedirect sum and • is the composition of tensoring the rows. See section 2 on notation.)For multiplication, a simple union bound allows us to show that Q (1) Q (2) · · · Q ( p ) has the( pε, pδ, t )-JL Moment Property. This would unfortunately mean a factor of p in the final di-mension. The union bound is clearly suboptimal, since implicitly it is assumes that all the matricesconspire to either shrink or increase the norm of a vector, while in reality with independent ma-trices, we should get a random walk on the real line. Using an intricate decoupling argument, weshow that this is indeed the case, and that Q (1) Q (2) · · · Q ( p ) has the ( √ pε, δ, t )-JL Moment Property,saving a factor of p in the output dimension.Finally we need to analyze Q • Q ′ . Here it is easy to show that the JL Moment Propertydoesn’t in general propagate to Q • Q ′ (consider e.g. Q being constant 0 on its first m/ Q ′ having 0 on its m/ Q • Q ′ behaves well. In particular we show this for matrices with independentsub-Gaussian entries (appendix A.2), and for the so-called Fast Johnson Lindenstrauss construc-tion [AC06] (Lemma 21). The main tool here is a higher order version of the classical Khintchine x ⊗ p denotes x ⊗ x · · · ⊗ x | {z } p terms , the p -fold self-tensoring of x . E h h σ (1) ⊗ σ (2) ⊗ · · · ⊗ σ ( p ) , x i t i when σ (1) , . . . σ ( p ) areindependent sub-Gaussian vectors (Lemma 19). Optimizing the dependence on s λ . Our proof of Theorem 3 relies on instantiating our frame-work with
OSNAP at the leaves of the tree ( T base ) and a novel version of the SRHT that we referto as
TensorSRHT at the internal nodes of the tree. We outline the analysis here. In order to showthat our sketch preserves norms, let y be an arbitrary vector in R d p . Then in the bottom level ofthe tree, we can view our sketch as T × T × · · · × T p , where × for denotes the tensor product ofmatrices (see Definition 5). Then, we can reshape y to be a d q − × d matrix Y , and the entries of T × T × · · · × T p y are in bijective correspondence with those of T × T × · · · × T p − Y T ⊤ p . By defi-nition of T p , it preserves the Frobenius norm of Y , and consequently, we can replace Y with Y T ⊤ p .We next look at ( T × T × · · · × T p − ) Z ( I d × T ⊤ p − ), where Z is the d p − × d matrix with entriesin bijective correspondence with those of Y T ⊤ p . Then we know that T p − preserves the Frobeniusnorm of Z . Iterating in this fashion, this means the first layer of our tree preserves the norm of y , provided we union bound over O ( p ) events that a sketch preserves a norm of an intermediatematrix. The core of the analysis consists of applying spectral concentration bounds based analysisto sketches that act on blocks of the input vector in a correlated fashion. We give the details inSection 5. Sketching the Gaussian kernel.
Our techniques yield the first oblivious sketching method forthe Gaussian kernel with target dimension that does not depend exponentially on the dimensionalityof the input data points. The main idea is to Taylor expand the Gaussian function and apply oursketch for the polynomial kernel to the elements of the expansion. It is crucial here that thetarget dimension of our sketch for the polynomial kernel depends only polynomially on the degree,as otherwise we would not be able to truncate the Taylor expansion sufficiently far in the tail(the number of terms in the Taylor expansion depends on the radius of the dataset and dependslogarithmically on the regularization parameter). Overall, our Gaussian kernel sketch has optimaltarget dimension up to polynomial factors in the radius dataset and logarithmic factors in thedataset size. Moreover, it is the first subspace embedding of Gaussian kernel which runs in inputsparsity time e O (nnz( X )) for datasets with polylogarithmic radius. The result is summarized inTheorem 5, and the analysis is presented in Section 6. Work related to sketching of tensors and explicit kernel embeddings is found in fields ranging frompure mathematics to physics and machine learning. Hence we only try to compare ourselves withthe four most common types we have found.
Johnson-Lindenstrauss Transform
A cornerstone result in the field of subspace embeddingsis the Johnson-Lindenstrauss lemma [JLS86]: “For all ε ∈ [0 , n, d ≥
1, and X ⊆ R d with | X | = n there exists f : R d → R m with m = O ( ε − log( n )), such that (1 − ε ) k x − y k ≤k f ( x ) − f ( y ) k ≤ (1 + ε ) k x − y k for every x, y ∈ X .“It has been shown in [CW13, CNW16a] there exists a constant C , so that, for any r -dimensionalsubspace U ⊆ R d , there exists a subset X ⊆ U with | X | ≤ C r , such that max x ∈ U (cid:12)(cid:12) k f ( x ) k − k x k (cid:12)(cid:12) ≤ O (max x ∈ X (cid:12)(cid:12) k f ( x ) k − k x k (cid:12)(cid:12) ). So the Johnson-Lindenstrauss Lemma implies that there exists asubspace embedding with m = O ( ε − r ). 11t is not enough to know that the subspace embedding exists, we also need the to find thedimension-reducing map f , and we want the map f to be applied to the data quickly. Achlioptasshowed that if Π ∈ R m × d is random matrix with i.i.d. entries where Π i,j = 0 with probability2 /
3, and otherwise Π i,j is uniform in {− , } , and m = O ( ε − log(1 /δ )), then k Π x k = (1 ± ε ) k x k with probability 1 − δ for any x ∈ R d [Ach03]. This gives a running time of O ( m nnz ( x )) to sketcha vector x ∈ R d . Later, the Fast Johnson Lindenstrauss Transform [AC06], which exploits theFast Fourier Transform, improved the running time for dense vectors to O ( d log d + m ). The re-lated Subsampled Randomized Hadamard Transform has been extensively studied [Sar06, DMM06,DMMS11, Tro11, DMMW12, LDFU13], which uses O ( d log d ) time but obtains suboptimal dimen-sion O ( ε − log(1 /δ ) ), hence it can not use the above argument to get subspace embedding, but ithas been proven in [Tro11] that if m = O ( ε − ( r + log(1 /δ ) )), then one get a subspace embedding.The above improvements has a running time of O ( d log d ), which can be worse than O ( m nnz ( x ))if x ∈ R d is very sparse. This inspired a line of work trying to obtain sparse Johnson Lindenstrausstransforms [DKS10, KN14, NN13, Coh16]. They obtain a running time of O ( ε − log(1 /δ )nnz ( x )).In [NN13] they define the ONSAP transform and investigate the trade-off between sparsity andsubspace embedding dimension. This was further improved in [Coh16].In the context of this paper all the above mentioned methods have the same shortcoming,they do not exploit the extra structure of the tensors. The Subsampled Randomized HadamardTransform have a running time of Ω( pd p log( p )) in the model considered in this paper, and thesparse embeddings have a running time of Ω(nnz( x ) p ). This is clearly unsatisfactory and inspiredthe TensorSketch [PP13, ANW14], which has a running time of Ω( p nnz( x )). Unfortunately, theyneed m = Ω(3 p ε − δ − ) and one of the main contributions of this paper is get rid of the exponentialdependence on p . Approximate Kernel Expansions
A classic result by Rahimi and Recht [RR08] shows how tocompute an embedding for any shift-invariant kernel function k ( k x − y k ) in time O ( dm ). In [LSS14]this is improved to any kernel on the form k ( h x, y i ) and time O (( m + d ) log d ), however the methoddoes not handle kernel functions that can’t be specified as a function of the inner product, and itdoesn’t provide subspace embeddings. See also [MM17] for more approaches along the same line.Unfortunately, these methods are unable to operate in input sparsity time and their runtime atbest is off by an s λ factor. Tensor Sparsification
There is also a literature of tensor sparsification based on sampling [NDT15],however unless the vectors tensored are already very smooth (such as ± Hyper-plane rounding
An alternative approach is to use hyper-plane rounding to get vectorson the form ±
1. Let ρ = h x,y ik x kk y k , then we have h sign ( M x ) , sign ( M y ) i = P i sign ( M i x ) sign ( M i y ) = P i X i , where X i are independent Rademachers with µ/m = E [ X i ] = 1 − π arccos ρ = π ρ + O ( ρ ).By tail bounds then Pr[ |h sign ( M x ) , sign ( M y ) i − µ | > ǫµ ] ≤ − min( ǫ µ σ , ǫµ )). Taking m = O ( ρ − ǫ − log 1 /δ ) then suffices with high probability. After this we can simply sample fromthe tensor product using simple sample bounds.The sign-sketch was first brought into the field of data-analysis by [Cha02] and [Val15] was thefirst, in our knowledge, to use it with tensoring. The main issue with this approach is that it isn’ta linear sketch, which hinders the applications we consider in this paper, such as kernel low rank12pproximation, CCA, PCR, and ridge regression. It also takes dm time to calculate M x and
M y which is unsatisfactory.
In section 2 we introduce basic definitions and notations that will be used throughout the paper.Section 3 introduces our recursive construction of the sketch which is our main technical toolfor sketching high degree tensor products. Section 4 analyzes how the moment bounds propagatethrough our recursive construction thereby proving Theorems 1 and 2 which have linear dependenceon the degree q . Section 5 introduces a high probability Oblivious Subspace Embedding with lineardependence on the statistical dimension thereby proving Theorem 3. Finally, section 6 uses thetools that we build for sketching polynomial kernel and proves that, for the first time, Gaussiankernel can be sketched without an exponential loss in the dimension with provable guarantees.Appendix A proves lower bounds. In this section we introduce notation and present useful properties of tensor product of vectors andmatrices as well as properties of linear sketch matrices.We denote the tensor product of vectors a, b by a ⊗ b which is formally defined as follows, Definition 4 (Tensor product of vectors) . Given a ∈ R m and b ∈ R n we define the twofold tensorproduct a ⊗ b to be a ⊗ b = a b a b · · · a b n a b a b · · · a b n ... ... ... a m b a m b · · · a m b n ∈ R m × n . Although tensor products are multidimensional objects, it is often convenient to associate them withsingle-dimensional vectors. In particular, we will often associate a ⊗ b with the single-dimensionalcolumn vector ( a b , a b , . . . , a m b , a b , a b , . . . , a m b , . . . , a m b n ). Given v ∈ R d , v ∈ R d · · · v k ∈ R d k , we define the k -fold tensor product v ⊗ v · · · ⊗ v k ∈ R d d ··· d k . For shorthand, we use thenotation v ⊗ k to denote v ⊗ v · · · ⊗ v | {z } k terms , the k -fold self-tensoring of v .Tensor product can be naturally extended to matrices which is formally defined as follows, Definition 5.
Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m k × n k , we define A × A × · · · × A k to be the matrix in R m m ··· m k × n n ··· n k whose element at row ( i , · · · , i k ) and column ( j , · · · , j k ) is A ( i , j ) · · · A k ( i k , j k ). As a consequence the following holds for any v ∈ R n , v ∈ R n , · · · , v k ∈ R n k : ( A × A × · · · × A k )( v ⊗ v ⊗ · · · ⊗ v k ) = ( A v ) ⊗ ( A v ) ⊗ · · · ⊗ ( A k v k ).The tensor product has the useful mixed product property , given in the following Claim, Claim 6.
For every matrices
A, B, C, D with appropriate sizes, the following holds, ( A · B ) × ( C · D ) = ( A × C ) · ( B × D ) . We also define the column wise tensoring of matrices as follows,13 efinition 6.
Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m k × n , we define A ⊗ A ⊗ · · · ⊗ A k tobe the matrix in R m m ··· m k × n whose j th column is A j ⊗ A j ⊗ · · · ⊗ A jk for every j ∈ [ n ], where A jl is the j th column of A l for every l ∈ [ k ].Similarly the row wise tensoring of matrices are introduced in the following Definition, Definition 7.
Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m × n k , we define A • A • · · · A k to bethe matrix in R m × n n ··· n k whose j th row is ( A j ⊗ A j ⊗ · · · ⊗ A kj ) ⊤ for every j ∈ [ m ], where A lj isthe j th row of A l as a column vector for every l ∈ [ k ]. Definition 8.
Another related operation is the direct sum for vectors: x ⊕ y = [ xy ] and for matrices: A ⊕ B = (cid:2) A B (cid:3) . When the sizes match up, we have ( A ⊕ B )( x ⊕ y ) = Ax + By . Also notice thatif I k is the k × k identity matrix, then I k ⊗ A = A ⊕ · · · ⊕ A | {z } k times . In this section, we present the basic construction for our new sketch. Suppose we are given v , v , . . . v q ∈ R m . Our main task is to map the tensor product v ⊗ v ⊗ · · · ⊗ v q to a vectorof size m using a linear sketch.Our sketch construction is recursive in nature. To illustrate the general idea, let us first considerthe case in which q ≥ v ⊗ v ) , ( v ⊗ v ) , · · · , ( v q − ⊗ v q ) ∈ R m independently using independent instances of some linear basesketch (e.g., degree two TensorSketch , Sub-sampled Randomized Hadamard Transform (
SRHT ), CountSketch , OSNAP ). The number of vectors after this step is half of the number of vectors thatwe began with. The natural idea is to recursively apply the same procedure on the sketched tensorswith half as many instances of the base sketch in each successive step.More precisely, we first choose a (randomized) base sketch S base : R m → R m that sketchestwofold tensor products of vectors in R m (we will describe how to choose the base sketch later).Then, for any power of two q ≥
2, we define Q q : R m q → R m on v ⊗ v ⊗ · · · ⊗ v q recursively asfollows: Q q ( v ⊗ v ⊗ · · · ⊗ v q ) = Q q/ (cid:16) S q ( v ⊗ v ) ⊗ S q ( v ⊗ v ) ⊗ · · · ⊗ S qq/ ( v q − ⊗ v q ) (cid:17) , where S q , S q , · · · , S qq/ : R m → R m are independent instances of S base and Q : R m → R m is simplythe identity map on R m .The above construction of Q q has been defined in terms of its action on q -fold tensor products ofvectors in R m , but it extends naturally to a linear mapping from R m q to R m . The formal definitionof Π q is presented below. Definition 9 (Sketch Q q ) . Let m ≥ S base : R m → R m be alinear map that specifies some base sketch. Then, for any integer power of two q ≥
2, we define Q q : R m q → R m to be the linear map specified as follows: Q q ≡ S · S · · · S q/ · S q , where for each l ∈ { , , · · · , q/ , q } , S l is a matrix in R m l/ × m l defined as S l ≡ S l × S l × · · · × S ll/ , (2)where the matrices S l , · · · , S ll/ ∈ R m × m are drawn independently from a base distribution S base .14 w ⊗ w z = S ( w ⊗ w ) S v ⊗ v w = S ( v ⊗ v ) v v S v ⊗ v w = S ( v ⊗ v ) v v S = S S = S × S Figure 2: Visual illustration of the recursive construction of Q q for degree q = 4. The input tensoris v ⊗ v ⊗ v ⊗ v and the output is z = Q ( v ⊗ v ⊗ v ⊗ v ). The intermediate nodes sketch thetensors w = S ( v ⊗ v ) and w = S ( v ⊗ v ).This sketch construction can be best visualized using a balanced binary tree with q leaves.Figure 2 illustrates the construction of degree 4, Q .For every integer q which is a power of two, by definition of S q in (2) of Definition 9, S q = S q × · · · × S qq/ . Hence, by claim 6 we can write, S q = S q × · · · × S qq/ = (cid:16) S q × · · · × S qq/ − × I m (cid:17) · (cid:16) I m q − × S qq/ (cid:17) . By multiple applications of Claim 6 we have the following claim,
Claim 7.
For every power of two integer q and any positive integer m , if S q is defined as in (2) of Definition 9, then S q = M q/ M q/ − · · · M , where M j = I m q − j × S qq/ − j +1 × I m j − for every j ∈ [ q/ . Embedding R d q : So far we have constructed a sketch Q q for sketching tensor product of vectorsin R m . However, in general the data points can be in a space R d of arbitrary dimension. A naturalidea is to reduce the dimension of the vectors by a mapping from R d to R m and then apply Q q onthe tensor product of reduced data points. The dimensionality reduction defines a linear mappingfrom R d q to R m d which can be represented by a matrix. We denote the dimensionality reductionmatrix by T q ∈ R m q × d q formally defined as follows. Definition 10.
Let m, d be positive integers and let T base : R d → R m be a linear map that specifiessome base sketch. Then for any integer power of two q we define T q to be the linear map specifiedas follows, T q = T × T × · · · × T q , where the matrices T , · · · , T q are drawn independently from T base . Discussion:
Similar to Claim 7, the transform T q can be expressed as the following productof q matrices, T q = M q M q − · · · M , where M j = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ].15ow we define the final sketch Π q : R d q → R m for arbitrary d as the composition of Q q · T q .Moreover, to extend the definition to arbitrary q which is not necessarily a power of two we tensorthe input vector with a standard basis vector a number of times to make the input size compatiblewith the sketch matrices. The sketch Π q is formally defined below, Definition 11 (Sketch Π p ) . Let m, d be positive integers and let S base : R m → R m and T base : R d → R m be linear maps that specify some base sketches. Then, for any integer p ≥ p : R d p → R m to be the linear map specified as follows:1. If p is a power of two, then Π p is defined asΠ p = Q p · T p , where Q p ∈ R m × m p and T p ∈ R m p × d p are sketches as in Definitions 9 and 10 respectively.2. If p is not a power of two, then let q = 2 ⌈ log p ⌉ be the smallest power of two integer that isgreater than p and we define Π p asΠ p ( v ) = Π q (cid:16) v ⊗ e ⊗ ( q − p )1 (cid:17) , for every v ∈ R d p , where e ∈ R d is the standard basis column vector with a 1 in the firstcoordinate and zeros elsewhere, and Π q is defined as in the first part of this definition.Algorithm 1 sketches x ⊗ p for any integer p and any input vector x ∈ R d using the sketch Π p asin Definition 11, i.e., computes Π p ( x ⊗ p ). Algorithm 1
Sketch for the Tensor x ⊗ p input : vector x ∈ R d , dimension d , degree p , number of buckets m , base sketches S base ∈ R m × m and T base ∈ R m × d output : sketched vector z ∈ R m Let q = 2 ⌈ log p ⌉ Let T , · · · T q be independent instances of the base sketch T base : R d → R m For every j ∈ { , , · · · , p } , let Y j = T j · x For every j ∈ { p + 1 , · · · , q } , let Y j = T j · e , where e is the standard basis vector in R d withvalue 1 in the first coordinate and zero elsewhere for l = 1 to log q do Let S q/ l − , · · · , S q/ l − q/ l be independent instances of the base sketch S base : R m → R m For every j ∈ { , · · · , q/ l } let Y lj = S q/ l − j (cid:16) Y l − j − ⊗ Y l − j (cid:17) end for return z = Y log q We show the correctness of Algorithm 1 in the next lemma.
Lemma 8.
For any positive integers d , m , and p , any distribution on matrices S base : R m → R m and T base : R d → R m which specify some base sketches, any vector x ∈ R d , Algorithm 1 computes Π p ( x ⊗ p ) as in Definition 11.Proof. For every input vector x ∈ R d to Algorithm 1, the vectors Y , · · · , Y p , are computed inlines 3 and 4 of algorithm as Y j = T j · x, for all j ∈ { , · · · , p } , and, Y j ′ = T j ′ · e , for all j ∈ { q + 1 , · · · , q } . Therefore, as shown in Definition 5, the following holds,16 ⊗ · · · ⊗ Y p = T × · · · × T q · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) . From the definition of sketch T q as per Definition 10 it follows that, Y ⊗ · · · ⊗ Y q = T q · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) . (3)The algorithm computes Y l , · · · Y lq/ l in line 7 as, Y lj = S q/ l − j (cid:16) Y l − j − ⊗ Y l − j (cid:17) , for every j ∈{ , · · · , q/ l } and every l ∈ { , · · · , log q } in a for loop. Therefore, by Claim 6, Y l ⊗ · · · ⊗ Y lq/ l = (cid:18) S q/ l − × · · · × S q/ l − q/ l (cid:19) · Y l − ⊗ · · · ⊗ Y l − q/ l − . By the definition of the sketch S q/ l − in (2) of Definition 9 we have that for every l ∈ { , · · · , log q } , Y l ⊗ · · · ⊗ Y lq/ l = S q/ l − · Y l − ⊗ · · · ⊗ Y l − q/ l − . Therefore, by recursive application of the above identity we get that, Y log p = S · S · · · S q/ · S q · Y ⊗ · · · ⊗ Y q . From the definition of sketch Q q as in Definition 9 it follows that, Y log q = Q q · Y ⊗ · · · ⊗ Y q . Substituting Y ⊗ · · · ⊗ Y q from (3) in the above gives, z = ( Q q · T q ) · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) , where byDefinition 11 we have that, z = Π p ( x ⊗ p ) . Choices of the Base Sketches S base and T base : We present formal definitions for various choicesof the base sketches S base and T base that will be used for our sketch construction Π q of Definition11. We start by briefly recalling the CountSketch [CCFC02].
Definition 12 ( CountSketch transform) . Let h : [ d ] → [ m ] be a 3-wise independent hash func-tion and also let σ : [ d ] → {− , +1 } be a 4-wise independent random sign function. Then, the CountSketch transform, S : R d → R m , is defined as follows; for every i ∈ [ d ] and every r ∈ [ m ], S r,i = σ ( i ) · [ h ( i ) = r ] . Another base sketch that we consider is the
TensorSketch of degree two [Pag13] defined asfollows.
Definition 13 (degree two
TensorSketch transform) . Let h , h : [ d ] → [ m ] be 3-wise independenthash functions and also let σ , σ : [ d ] → {− , +1 } be 4-wise independent random sign functions.Then, the degree two TensorSketch transform, S : R d × R d → R m , is defined as follows; for every i, j ∈ [ d ] and every r ∈ [ m ], S r, ( i,j ) = σ ( i ) · σ ( j ) · [ h ( i ) + h ( j ) = r mod m ] . Remark: S ( x ⊗ ) can be computed in O ( m log m + nnz( x )) time using the Fast Fourier Trans-form.Now let us briefly recall the SRHT [AC06]. 17 efinition 14 (Subsampled Randomized Hadamard Transform ( SRHT )) . Let D be a d × d diagonalmatrix with independent Rademacher random variables along the diagonal. Also, let P ∈ { , } m × d be a random sampling matrix in which each row contains a 1 at a uniformly distributed coordinateand zeros elsewhere, and let H be a d × d Hadamard matrix. Then, the
SRHT , S ∈ R m × d , is S = √ m P HD .We now define a variant of the
SRHT which is very efficient for sketching x ⊗ which we call the TensorSRHT . Definition 15 (Tensor Subsampled Randomized Hadamard Transform (
TensorSRHT )) . Let D and D be two independent d × d diagonal matrices, each with diagonal entries given by independentRademacher variables. Also let P ∈ { , } m × d be a random sampling matrix in which each rowcontains exactly one uniformly distributed nonzero element which has value one, and let H be a d × d Hadamard matrix. Then, the
TensorSRHT is defined to be S : R d × R d → R m given by S = √ m P · ( HD × HD ). Remark: S ( x ⊗ ) can be computed in time O ( d log d + m ) using the FFT algorithm.Another sketch which is particularly efficient for sketching sparse vectors with high probabilityis the OSNAP transform [NN13], defined as follows.
Definition 16 ( OSNAP transform) . For every sparsity parameter s , target dimension m , andpositive integer d , the OSNAP transform with sparsity parameter s is defined as, S r,j = r s · δ r,j · σ r,j , for all r ∈ [ m ] and all j ∈ [ d ], where σ r,j ∈ {− , +1 } are independent and uniform Rademacherrandom variables and δ r,j are Bernoulli random variables satisfying,1. For every i ∈ [ d ], P r ∈ [ m ] δ r,i = s . That is, each column of S has exactly s non-zero entries.2. For all r ∈ [ m ] and all i ∈ [ d ], E [ δ r,i ] = s/m .3. The δ r,i ’s are negatively correlated: ∀ T ⊂ [ m ] × [ d ], E hQ ( r,i ) ∈ T δ r,i i ≤ Q ( r,i ) ∈ T E [ δ r,i ] = ( sm ) | T | . p There are various desirable properties that we would like a linear sketch to satisfy. One suchproperty which is central to our main results is the
JL Moment Property . In this section weprove Theorem 1 and Theorem 2 by propagating the
JL Moment Property through our recursiveconstruction from Section 3. The
JL Moment Property captures a bound on the moments of thedifference between the Euclidean norm of a vector and its Euclidean norm after applying the sketchon it. The JL Moment Property proves to be a powerful property for a sketch and we will show thatit implies the Oblivious Subspace Embedding as well as the Approximate Matrix Product propertyfor linear sketches.In section 4.1 we choose S base and T base to be TensorSketch and
CountSketch respectively. Thenwe propagate the second JL Moment through the sketch construction Π p and thereby prove Theorem1. In section 4.2 we propagate the higher JL Moments through our recursive construction Π p asper Definition 11 with TensorSRHT at the internal nodes ( S base ) and OSNAP at the leaves ( T base ),thereby proving Theorem 2.To make the notation less heavy we will use k X k L t for the t th moment of a random variable X .This is formally defined below. 18 efinition 17. For every integer t ≥ X ∈ R , we write k X k L t = (cid:16) E h | X | t i(cid:17) /t . Note that k X + Y k L t ≤ k X k L t + k Y k L t for any random variables X, Y by the Minkowski Inequality.We now formally define the JL Moment Property of sketches.
Definition 18 (JL Moment Property) . For every positive integer t and every δ, ε ≥
0, we say adistribution over random matrices S ∈ R m × d has the ( ǫ, δ, t )-JL-moment property, when (cid:13)(cid:13)(cid:13) k Sx k − (cid:13)(cid:13)(cid:13) L t ≤ ǫδ /t and E h k Sx k i = 1for all x ∈ R d such that k x k = 1.The JL Moment Property directly implies the following moment bound for the inner productof two vectors: Lemma 9 (Two vector JL Moment Property) . For any x, y ∈ R d , if S has the ( ε, δ, t ) -JL MomentProperty, then (cid:13)(cid:13)(cid:13) ( Sx ) ⊤ ( Sy ) − x ⊤ y (cid:13)(cid:13)(cid:13) L t ≤ εδ /t k x k k y k . (4) Proof.
We can assume by linearity of the norms that k x k = k y k = 1. We then use that k x − y k = k x k + k y k − x ⊤ y and k x + y k = k x k + k y k + 2 x ⊤ y such that x ⊤ y = ( k x + y k − k x − y k ) / (cid:13)(cid:13)(cid:13) ( Sx ) ⊤ ( Sy ) − x ⊤ y (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k Sx + Sy k − k x + y k − k Sx − Sy k + k x − y k (cid:13)(cid:13)(cid:13) L t / ≤ (cid:16)(cid:13)(cid:13)(cid:13) k S ( x + y ) k − k x + y k (cid:13)(cid:13)(cid:13) L t + (cid:13)(cid:13)(cid:13) k S ( x − y ) k − k x − y k (cid:13)(cid:13)(cid:13) L t (cid:17) / ≤ εδ /t ( k x + y k + k x − y k ) / εδ /t ( k x k + k y k ) / εδ /t . We will also need the Strong JL Moment Property, which is a sub-Gaussian bound on thedifference between the Euclidean norm of a vector and its Euclidean norm after applying the sketchon it.
Definition 19 (Strong JL Moment Property) . For every ε, δ > M ∈ R m × d has the Strong ( ε, δ )-JL Moment Property when (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ εe s t log(1 /δ ) and E h k M x k i = 1 , for all x ∈ R d , k x k = 1 and every integer t ≤ log(1 /δ ). Remark 1.
It should be noted that if a matrix M ∈ R m × d has the Strong ( ε, δ )-JL MomentProperty then it has the ( ε, δ, log(1 /δ ))-JL Moment Property, since (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L log(1 /δ ) ≤ εe s log(1 /δ )log(1 /δ ) = εe = εδ / log( δ ) . p is an ObliviousSubspace Embedding and that Π p has the Approximate Matrix Multiplication Property, then itsuffices to prove that Π q has the JL Moment Property, for q which is the smallest power of twointeger such that q ≥ p , as in Definition 11. This reduction will be the main component of theproofs of Theorem 1 and Theorem 2. Lemma 10.
For every positive integers n, p, d , every ε, δ ∈ [0 , , and every µ ≥ . Let q = 2 ⌈ log ( p ) ⌉ and let Π p ∈ R m × d p and Π q ∈ R m × d q be defined as in Definition 11, for some base sketches S base ∈ R m × m and T base ∈ R d × d .If Π q is an ( ε, δ, µ, d q , n ) -Oblivious Subspace Embedding then Π p is an ( ε, δ, µ, d p , n ) -ObliviousSubspace Embedding. Also if Π q has the ( ε, δ ) -Approximate Matrix Multiplication Property then Π p has the ( ε, δ ) -Approximate Matrix Multiplication Property.Proof. We will prove a correspondence between Π p and Π q . Let E ∈ R d × n be a matrix whosefirst row is equal to one and is zero everywhere else. By Definition 11 we have that for anymatrix A ∈ R d p × n that Π p A = Π q ( A ⊗ E ⊗ ( q − p )1 ). A simple calculation shows that for any matrices A, B ∈ R d p × n then( A ⊗ E ⊗ ( q − p )1 ) ⊤ ( B ⊗ E ⊗ ( q − p )1 ) = A ⊤ B ◦ ( E ⊗ ( q − p )1 ) ⊤ E ⊗ ( q − p )1 = A ⊤ B , where ◦ denotes the Hadamard product, and the last equality follows since ( E ⊗ ( q − p )1 ) ⊤ E ⊗ ( q − p )1 is anall ones matrix. This implies that k A ⊗ E ⊗ ( q − p )1 k F = k A k F and s λ (( A ⊗ E ⊗ ( q − p )1 ) ⊤ A ⊗ E ⊗ ( q − p )1 ) = s λ ( A ⊤ A ).Now assume that Π q is an ( ε, δ, µ, n )-Oblivious Subspace Embedding, and let A ∈ R d p × n and λ ≥ s λ ( A ) ≤ µ . Define A ′ = A ⊗ E ⊗ ( q − p )1 , thenPr h (1 − ε )( A ⊤ A + λI n ) (cid:22) (Π p A ) ⊤ Π p A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) i = Pr h (1 − ε )( A ′⊤ A ′ + λI n ) (cid:22) (Π q A ′ ) ⊤ Π q A ′ + λI n (cid:22) (1 + ε )( A ′⊤ A ′ + λI n ) i ≥ − δ , where we have used that s λ ( A ′ ⊤ A ′ ) = s λ ( A ⊤ A ) ≤ µ . This shows that Π p is an ( ε, δ, µ, n )-ObliviousSubspace Embedding.Assume that Π q has ( ε, δ )-Approximate Matrix Multiplication Property, and let C, D ∈ R d p × n .Define C ′ = C ⊗ E ⊗ ( q − p )1 and D ′ = D ⊗ E ⊗ ( q − p )1 , thenPr h k (Π p C ) ⊤ Π p D − C ⊤ D k F ≥ ε k C k F k D k F i = Pr h k (Π q C ′ ) ⊤ Π q D ′ − C ′⊤ D ′ k F ≥ ε k C ′ k F k D ′ k F i ≤ δ , where we have used that k C ′ k F = k C k F , k D ′ k F = k D k F , and C ′⊤ D ′ = C ⊤ D . This show that Π p has ( ε, δ )-Approximate Matrix Multiplication Property. Lemma 11.
For any ε, δ ∈ [0 , , t ≥ , if M ∈ R m × d is a random matrix with ( ε, δ, t ) -JL MomentProperty then M has the ( ε, δ ) -Approximate Matrix Multiplication Property.Furthermore, for any µ > , if M ∈ R m × d is a random matrix with ( ε/µ, δ, t ) -JL MomentProperty then for every positive integer n ∈ Z , M is a ( ε, δ, µ, d, n ) -OSE.Proof. pproximate Matrix Multiplication Let
C, D ∈ R d × n . We will prove that (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) L t ≤ εδ /t k C k F k D k F . (5)Then Markov’s inequality will give us the result. Using the triangle inequality together withLemma 9 we get that: (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) / L t/ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i,j ∈ [ n ] (cid:16) ( M C i ) ⊤ M D j − C ⊤ i D j (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ ≤ s X i,j ∈ [ n ] (cid:13)(cid:13) ( M C i ) ⊤ M D j − C ⊤ i D j (cid:13)(cid:13) L t ≤ s X i,j ∈ [ n ] ε δ /t k C i k k D j k = εδ /t k C k F k D k F . Using Markov’s inequality we now get thatPr h k ( M C ) ⊤ M D − C ⊤ D k F ≥ ε k C k F k D k F i ≤ (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) tL t ε t k C k tF k D k tF ≤ δ . Oblivious Subspace Embedding.
We will prove that for any λ ≥ A ∈ R d × n ,(1 − ε )( A ⊤ A + λI n ) (cid:22) ( M A ) ⊤ M A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) , (6)holds with probability at least 1 − (cid:16) s λ ( A ⊤ A ) µ (cid:17) t δ , which will imply our result.We will first consider λ >
0. Then A ⊤ A + λI n is positive definite. Thus, by left and rightmultiplying (6) by ( A ⊤ A + λI n ) − / , we see that (6) is equivalent to(1 − ε ) I n (cid:22) (cid:16) M A ( A ⊤ A + λI n ) − / (cid:17) ⊤ M A ( A ⊤ A + λI n ) − / + λ ( A ⊤ A + λI n ) − (cid:22) (1 + ε ) I n . which, in turn, is implied by the following: (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M A ( A ⊤ A + λI n ) − / (cid:17) ⊤ M A ( A ⊤ A + λI n ) − / + λ ( A ⊤ A + λI n ) − − I n (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ε . Note that ( A ⊤ A + λI n ) − / A ⊤ A ( A ⊤ A + λI n ) − / = I n − λ ( A ⊤ A + λI n ) − . Letting Z = A ( A ⊤ A + λI n ) − / , we note that it suffices to establish, (cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) op ≤ ε . Using (5) together with Markov’s inequality we get thatPr (cid:20)(cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) op ≥ ε (cid:21) ≤ Pr h(cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) F ≥ ε i ≤ k Z k F µ ! t δ = s λ ( A ⊤ A ) µ ! t δ , k Z k F = tr (cid:16) Z ⊤ Z (cid:17) = tr (cid:18)(cid:16) A ( A ⊤ A + λI n ) − / (cid:17) ⊤ A ( A ⊤ A + λI n ) − / (cid:19) = tr (cid:16) A ⊤ A ( A ⊤ A + λI n ) − (cid:17) = s λ ( A ⊤ A ) . To prove the result for λ = 0 we will use Fatou’s lemma.Pr (cid:20)(cid:16) (1 − ε ) A ⊤ A (cid:22) ( M A ) ⊤ M A (cid:22) (1 + ε ) A ⊤ A (cid:17) C (cid:21) ≤ lim inf λ → + Pr (cid:20)(cid:16) (1 − ε )( A ⊤ A + λI n ) (cid:22) ( M A ) ⊤ M A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) (cid:17) C (cid:21) ≤ lim inf λ → + s λ ( A ⊤ A ) µ δ = s ( A ⊤ A ) µ δ , where the last equality follows from continuity of λ s λ ( A ⊤ A ).Our next important observation is that Π q can be written as the product of 2 q − Lemma 12.
For any integer q which is a power of two, Π q : R m q → R m be defined as in Defini-tion 11 for some base sketches S base : R m → R m and T base : R d → R m . Then there exist matrices ( M ( i ) ) i ∈ [ q − , ( M ′ ( j ) ) j ∈ [ q ] and integers ( k i ) i ∈ [ q − , ( k ′ ) i ∈ [ q − , ( l j ) j ∈ [ q ] , ( l ′ j ) j ∈ [ q ] , such that, Π q = M ( q − · . . . M (1) · M ′ ( q ) · . . . · M ′ (1) , and M ( i ) = I k i × S ( i )base × I k ′ i , M ′ ( j ) = I l j × T ( j )base × I l ′ j , where S ( i )base and T ( j )base are independent instancesof S base and T base , for every i ∈ [ q − , j ∈ [ q ] .Proof. We have that Π q = Q q T q by Definition 11. By Definition 9 we have that Q q = S S · · · S q .Claim 7 shows that for every l ∈ { , , · · · q } we can write, S l = M ll/ M ll/ − · . . . · M l , (7)where M lj = I m l − j × S ll/ − j +1 × I m j − for every j ∈ [ l/ T q = M ′ ( q ) · . . . · M ′ (1) , (8)where M ′ ( j ) = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ]. Therefore by combining (7) and (8) we getthe result.We want to show that I k × M × I k ′ inherits the JL properties of M . The following simple factdoes just that. 22 emma 13. Let t ∈ N and α ≥ . If P ∈ R m × d and Q ∈ R m × d are two random matrices (notnecessarily independent), such that, (cid:13)(cid:13)(cid:13) k P x k − k x k (cid:13)(cid:13)(cid:13) L t ≤ α k x k and E h k P x k i = k x k , (cid:13)(cid:13)(cid:13) k Qy k − k y k (cid:13)(cid:13)(cid:13) L t ≤ α k y k and E h k Qy k i = k y k , for any vectors x ∈ R d and y ∈ R d , then (cid:13)(cid:13)(cid:13) k ( P ⊕ Q ) z k − k z k (cid:13)(cid:13)(cid:13) L t ≤ α k z k and E h k ( P ⊕ Q ) z k i = k z k , for any vector z ∈ R d + d .Proof. Let z ∈ R d + d and choose x ∈ R d and y ∈ R d , such that, z = x ⊕ y . Using the triangleinequality, (cid:13)(cid:13)(cid:13) k ( P ⊕ Q ) z k − k z k (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k P x k + k Qy k − k x k − k y k (cid:13)(cid:13)(cid:13) L t ≤ (cid:13)(cid:13)(cid:13) k P x k − k x k (cid:13)(cid:13)(cid:13) L t + (cid:13)(cid:13)(cid:13) k Qy k − k y k (cid:13)(cid:13)(cid:13) L t ≤ α k x k + α k y k = α k z k . We also see that E h k ( P ⊕ Q ) z k i = E h k P x k i + E h k Qy k i = k x k + k y k = k z k . An easy consequence of this lemma is that for any matrix, S , with the ( ε, δ, t )-JL Moment Prop-erty, I k × S has the ( ε, δ, t )-JL Moment Property. This follows simply from I k × S = S ⊕ S ⊕ . . . ⊕ S | {z } k times .Similarly, S × I k has the ( ε, δ, t )-JL Moment Property, since S × I k is just a reordering of the rowsof I k × S , which trivially does not affect the JL Moment Property. The same arguments show thatif S has the Strong ( ε, δ )-JL Moment Property then I k × S and S × I k has the Strong ( ε, δ )-JLMoment Property. So we conclude the following Lemma 14.
If the matrix S has the ( ε, δ, t ) -JL Moment Property, then for any positive integers k, k ′ , the matrix M = I k × S × I k ′ has the ( ε, δ, t ) -JL Moment Property.Similarly, if the matrix S has the Strong ( ε, δ ) -JL Moment Property, then for any positiveintegers k, k ′ , the matrix M = I k × S × I k ′ has the Strong ( ε, δ ) -JL Moment Property. Now if we can prove that the product of matrices with the JL Moment Property has the JLMoment Property, then Lemma 14 and Lemma 12 would imply that Π q has the JL Moment Prop-erty, which again implies that Π p is an Oblivious Subspace Embedding and has the ApproximateMatrix Multiplication Property, by Lemma 11 and Lemma 10. This is exactly what we will do: inSection 4.1 we prove that the product of k independent matrices with the (cid:16) ε √ k , δ, (cid:17) -JL MomentProperty results in a matrix with the ( ε, δ, k independent matrices with theStrong (cid:16) O (cid:16) ε √ k (cid:17) , δ (cid:17) -JL Moment Property results in a matrix with the Strong ( ε, δ )-JL MomentProperty, which will give us the proof of Theorem 2.23 .1 Second Moment of Π q (analysis for T base : CountSketch and S base : TensorSketch ) In this section we prove Theorem 1 by instantiating our recursive construction from Section 3with
CountSketch at the leaves and
TensorSketch at the internal nodes of the tree. The proofproceeds by showing the second moment property – i.e., ( ε, δ, q satisfies the ( ε, δ, S base , T base are chosen from a distribution which satisfiesthe second moment property. We show that this is the case for CountSketch and
TensorSketch .Lemma 14 together with Lemma 12 show that if the base sketches S base , T base have the JLMoment Property then Π q is the product of 2 q − Lemma 15 (Composition lemma for the second moment) . For any ε, δ ≥ and any integer k if M (1) ∈ R d × d , · · · M ( k ) ∈ R d k +1 × d k are independent random matrices with the (cid:16) ε √ k , δ, (cid:17) -JL-moment property then the product matrix M = M ( k ) · · · M (1) satisfies the ( ε, δ, -JL-momentproperty.Proof. Let x ∈ R d be a fixed unit norm vector. We note that for any i ∈ [ k ] we have that E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − i = k M ( i − · . . . · M (1) x k . (9)Now we will prove by induction on i ∈ [ k ] that,Var h k M ( i ) · . . . · M (1) x k i ≤ ε δ k ! i − . (10)For i = 1 the result follows from the fact that M (1) has the ( ε/ √ k, δ, i −
1. By the law of total variance we get thatVar h k M ( i ) · . . . · M (1) x k i = E h Var h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii + Var h E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii (11)Using (9) and the induction hypothesis we get that,Var h E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii = Var h k M ( i − · . . . M (1) x k i ≤ ε δ k ! i − − . (12)Using that M ( i ) has the ( ε/ √ k, δ, E (cid:20) Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i ) · . . . · M (1) x (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − (cid:21)(cid:21) ≤ E " ε k δ (cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) = ε δ k Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) (cid:21) + E (cid:20)(cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) (cid:21) ! ≤ ε δ k ε δ k ! i − − = ε δ k ε δ k ! i − . (13)Plugging (12) and (13) into (11) gives,Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i ) · . . . · M (1) x (cid:13)(cid:13)(cid:13) (cid:21) ≤ ε δ k ε δ k ! i − + ε δ k ! i − − ε δ k ! i − . Hence, Var h k M x k i ≤ ε δ k ! k − ≤ exp( ε δ/ − ≤ ε δ , which proves that M has the ( ε, δ, q : Corollary 16 (Second moment property for Π q ) . For any power of two integer q let Π q : R m q → R m be defined as in Definition 11, where both of the common distributions S base : R m → R m and T base : R d → R m , satisfy the (cid:16) ε √ q +2 , δ, (cid:17) -JL-moment property. Then it follows that Π q satisfiesthe ( ε, δ, -JL-moment property.Proof. This follows from Lemma 12 and Lemma 15.Now we are ready to prove Theorem 1. Recall that k ( x, y ) = h x, y i q is the polynomial kernel ofdegree q . One can see that k ( x, y ) = h x ⊗ q , y ⊗ q i . Let x , x , · · · x n ∈ R m be an arbitrary dataset of n points in R m . We represent the data points by matrix X ∈ R m × n whose i th column is the vector x i . Let A ∈ R m q × n be the matrix whose i th column is x ⊗ qi for every i ∈ [ n ]. For any regularizationparameter λ >
0, the statistical dimension of A ⊤ A is defined as s λ := tr (cid:16) ( A ⊤ A )( A ⊤ A + λI n ) − (cid:17) . Theorem 1.
For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / , s λ , d p , n ) -oblivioussubspace embedding as in Definition 2. (2) If m = Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / -approximatematrix product property as in Definition 3.Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by the p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O ( pnm + p nnz( X )) . roof. Throughout the proof, let δ = denote the failure probability, let q = 2 ⌈ log p ⌉ , and let e ∈ R d be the column vector with a 1 in the first coordinate and zeros elsewhere. Let Π p ∈ R m × d p be the sketch defined in Definition 11, where the base distributions S base ∈ R m × m and T base ∈ R m × d are respectively the standard TensorSketch of degree two and standard
CountSketch . It is shownin [ANW14] and [CW17] that for these choices of base sketches, S base and T base are both unbiasedand satisfy the (cid:16) ε √ q +2 , δ, (cid:17) -JL-moment property as long as m = Ω( qε δ ) (see Definition 18). Oblivious Subspace Embedding
Let m = Ω (cid:18) qs λ δǫ (cid:19) be an integer. Then S base and T base hasthe (cid:16) ε √ q +2 s λ , δ, (cid:17) -JL Moment Property. Thus using Corollary 16 we conclude that Π q has the (cid:16) εs λ , δ, (cid:17) -JL Moment Property. Thus, Lemma 11 implies that Π q is an ( ε, δ, s λ , d q , n )-ObliviousSubspace Embedding, and by Lemma 10 we get that Π p is an ( ε, δ, s λ , d p , n )-Oblivious SubspaceEmbedding. Approximate Matrix Multiplication.
Let m = Ω (cid:16) qδǫ (cid:17) . Then S base and T base have the (cid:16) ε √ q +2 , δ, (cid:17) -JL Moment Property. Thus, using Corollary 16 we conclude that Π q has the ( ε, δ, q has the ( ε, δ )-Approximate Matrix Multipli-cation Property, and by Lemma 10 we get that Π p has the ( ε, δ )-Approximate Matrix MultiplicationProperty. Runtime of Algorithm 1 when the base sketch S base is TensorSketch of degree two and T base is CountSketch : We compute the time of running Algorithm 1 on a vector x . Computing Y j for each j in lines 3 and 4 of algorithm requires applying a CountSketch on either x or e whichtakes time O (nnz( x )). Therefore computing all Y j ’s takes time O ( q · nnz( x )).Computing each of Y lj ’s for l ≥ TensorSketch of input dimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time O ( m log m ). Therefore computing Y lj for all l, j ≥ O ( q · m log m ). Note that q ≤ p and hence the total running time of Algorithm 1 on one vector x is O ( p · m log m + p · nnz( w )).Sketching n columns of a matrix X ∈ R d × n takes time O ( p ( nm log m + nnz( X ))). Π q (analysis for T base : OSNAP and S base : TensorSRHT ) In this section we prove Theorem 2 by instantiating our recursive construction of Section 3 with
OSNAP at the leaves and
TensorSRHT at the internal nodes.The proof proceeds by showing the Strong JL Moment Property for our sketch Π q . If a sketchsatisfies the Strong JL Moment Property then it straightforwardly is an OSE and has the ap-proximate matrix product property. This section has two goals: first is to show that SRHT , and
TensorSRHT as well as
OSNAP transform all satisfy the Strong JL Moment Property. The secondgoal of this section is to prove that our sketch construction Π q inherits the strong JL momentproperty from the base sketches S base , T base .In this section we will need Khintchine’s inequality. Lemma 17 (Khintchine’s inequality [HM07]) . Let t be a positive integer, x ∈ R d , and ( σ i ) i ∈ [ d ] beindependent Rademacher ± random variables. Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d X i =1 σ i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C t k x k , here C t ≤ √ (cid:16) Γ(( t +1) / √ π (cid:17) /t ≤ √ t for all t ≥ .One may replace ( σ i ) with an arbitrary independent sequence of random variables ( ς i ) with E [ ς i ] = 0 and k ς i k L r ≤ √ r for any ≤ r ≤ t , and the lemma still holds up to a universal constantfactor on the r.h.s. First we note that the
OSNAP transform satisfies the strong JL moment property.
Lemma 18.
There exists a universal constant L , such that, the following holds. Let M ∈ R m × d bea OSNAP transform with sparsity parameter s . Let x ∈ R d be any vector with k x k = 1 and t ≥ ,then (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ L r tm + ts ! . (14) Setting m = Ω( ε − log(1 /δ )) and s = Ω( ε − log(1 /δ )) then M has the Strong ( ε, δ ) -JL MomentProperty (Definition 19).Proof. The proof of (14) follows from analysis in [CJN18]. They only prove it for t = log(1 /δ ) buttheir proof is easily extended to the general case.Now if we set m = 4 L e · ε − log(1 /δ ) and s = 2 Le · ε − log(1 /δ ) then we get that (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ L s t L e · ε − log(1 /δ ) + L t Le · ε − log(1 /δ ) ≤ εe s t log(1 /δ ) , for every 1 ≤ t ≤ log(1 /δ ), which proves the result.We continue by proving that SRHT and
TensorSRHT sketches satisfy the strong JL momentproperty. We will do this by proving that a more general class of matrices satisfies the strong JLmoment property. More precisely, let k ∈ Z > be a positive integer and ( D ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i × d i be independent matrices, each with diagonal entries given by independent Rademacher variables.Let d = Q i ∈ [ k ] d i , and P ∈ { , } m × d be a random sampling matrix in which each row containsexactly one uniformly distributed nonzero element which has value one. Then we will prove thatthe matrix M = √ m P H ( D × . . . × D k ) satisfies the strong JL moment property, where H is a d × d Hadamard matrix. If k = 1 then M is just a SRHT , and if k = 2 then M is a TensorSRHT .In order to prove this result we need a couple of lemmas. The first lemma can be seen as aversion of Khintchine’s inequality for higher order chaos.
Lemma 19.
Let t ≥ , k ∈ Z > , and ( σ ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i be independent vectors each sat-isfying the Khintchine inequality (cid:13)(cid:13)(cid:13) h σ ( i ) , x i (cid:13)(cid:13)(cid:13) L t ≤ C t k x k for t ≥ and any vector x ∈ R d i . Let ( a i ,...,i k ) i ∈ [ d j ] ,...,i k ∈ [ d k ] be a tensor in R d × ... × d k , then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k ∈ [ d k ] Y j ∈ [ k ] σ ( j ) i j a i ,...,i k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C kt X i ∈ [ d ] ,...,i k ∈ [ d k ] a i ,...,i k / , for t ≥ . Or, considering a ∈ R d ··· d k a vector, then simply (cid:13)(cid:13)(cid:13) h σ (1) ⊗ · · · ⊗ σ ( k ) , a i (cid:13)(cid:13)(cid:13) L t ≤ C kt k a k ,for t ≥ . a is not assumed to have special structure. Note that this implies the classical bound onthe fourth moment of products of 4-wise independent hash functions [BCL +
10, IM08, PT12], since C = 3 / for Rademachers we have E h h σ (1) ⊗ · · · ⊗ σ ( k ) , x i i ≤ k k x k for four-wise independent( σ ( i ) ) i ∈ [ k ] . Proof.
The proof will be by induction on k . For k = 1 then the result is by assumption. So assumethat the result is true for every value up to k −
1. Let B i ,...,i k − = P i k ∈ [ d k ] σ ( k ) i k a i ,...,i k . We thenpull it out of the left hand term in the theorem: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k ∈ [ d k ] Y j ∈ [ k ] σ ( j ) i j a i ,...,i k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ] Y j ∈ [ k − σ ( j ) i j B i ,...,i k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C k − t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ] B i ,...,i k − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t (15)= C k − t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ] B i ,...,i k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ ≤ C k − t X i ∈ [ d ] ,...,i k − ∈ [ d k − ] (cid:13)(cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13)(cid:13) L t/ / (16)= C k − t X i ∈ [ d ] ,...,i k − ∈ [ d k − ] (cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13) L t / . Here (15) is the inductive hypothesis and (16) is the triangle inequality. It remains to bound (cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13) L t ≤ C t P i c ∈ [ d k ] a i ,...,i k by Khintchine’s inequality, which finishes the induction stepand hence the proof.The next lemma we will be using is a type of Rosenthal inequality, but which mixes large andsmall moments in a careful way. It bears similarity to the one sided bound in [BLM13] (Theorem15.10) derived from the Efron Stein inequality, and the literature has many similar bounds, but westill include a proof here based on first principles. Lemma 20.
There exists a universal constant L , such that, for t ≥ if X , . . . , X k are independentnon-negative random variables with t -moment, then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t s X i ∈ [ k ] E [ X i ] + t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t . roof. Throughout these calculations L , L and L will be universal constants. (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] σ i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t (Symmetrization) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ (Khintchine’s inequality) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i · X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ (Non-negativity) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t (Cauchy-Schwartz) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t s X i ∈ [ k ] E [ X i ] + L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t . Now let C = (cid:13)(cid:13)(cid:13)P i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13) / L t , B = L qP i ∈ [ k ] E [ X i ], and A = √ t (cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13) / L t . thenwe have shown C ≤ A ( B + C ). That implies C is smaller than the largest of the roots of thequadratic. Solving this quadratic inequality gives C ≤ L ( AB + A ) which is the result.We can now prove that SHRT and
TensorSRHT has the Strong JL Moment Property.
Lemma 21.
There exists a universal constant L , such that, the following holds. Let k ∈ Z > , and ( D ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i × d i be independent diagonal matrices with independent Rademacher variables.Define d = Q i ∈ [ k ] d i and D = D × D × · · · D k ∈ R d × d . Let P ∈ R m × d be an independent samplingmatrix which samples exactly one coordinate per row, and define M = P HD where H is a d × d Hadamard matrix. Let x ∈ R d be any vector with k x k = 1 and t ≥ , then (cid:13)(cid:13)(cid:13) m k P HDx k − (cid:13)(cid:13)(cid:13) L t ≤ L s tr k m + tr k m , where r = max { t, log m } .There exists a universal constant L ′ , such that, setting m = Ω (cid:16) ε − log(1 /δ )( L ′ log(1 /εδ )) k (cid:17) , weget that √ m P HD has Strong ( ε, δ ) -JL Moment Property. Note that setting k = 1, this matches the Fast Johnson Lindenstrauss analysis in [CNW16b]. Proof.
Throughout the proof C , C and C will denote universal constants.For every i ∈ [ m ] we let P i be the random variable that says which coordinate the i ’th row of P samples, and we define the random variable Z i = M i x = H P i Dx . We note that since the variables( P i ) i ∈ [ m ] are independent then the variables ( Z i ) i ∈ [ m ] are conditionally independent given D , thatis, if we fix D then ( Z i ) i ∈ [ m ] are independent. 29e use Lemma 20, the triangle inequality, and Cauchy-Schwartz to get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i ∈ [ m ] Z i − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E m X i ∈ [ m ] Z i − t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ tm E max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D / (2 t ) s X i ∈ [ m ] E (cid:2) Z i (cid:12)(cid:12) D (cid:3) + tm E max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C √ tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D / (2 t ) s X i ∈ [ m ] E (cid:2) Z i (cid:12)(cid:12) D (cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t + C tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C √ tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ m ] E h Z i (cid:12)(cid:12)(cid:12) D i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t + C tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t . By orthogonality of H we have k HDx k = d k x k independent of D . Hence X i ∈ [ m ] E h Z i (cid:12)(cid:12)(cid:12) D i = X i ∈ [ m ] k x k = m . To bound (cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13) L t we first use Lemma 19 to show (cid:13)(cid:13)(cid:13) Z i (cid:13)(cid:13)(cid:13) L r = k H P i Dx k L r = k Dx k L r ≤ r k k x k . We then bound the maximum using a sufficiently high powered sum: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L r ≤ X i ∈ [ m ] (cid:13)(cid:13)(cid:13) Z i (cid:13)(cid:13)(cid:13) rL r /r ≤ m /r r k k x k ≤ er k , where the last inequality follows from r ≥ log m . This gives us that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i ∈ [ m ] Z i − k x k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C s tr k m + C tr k m , which finishes the first part of the proof.We set m = 4 e C ε − log(1 /δ )( C log(1 / ( δε ))) k , such that, r ≤ C log(1 / ( δε )). Hence m ≥ e C ε − log(1 /δ ) r k . We then get that (cid:13)(cid:13)(cid:13) k P HDx k − (cid:13)(cid:13)(cid:13) L t ≤ C s tr k e C ε − log(1 /δ ) r k + C tr k e C ε − log(1 /δ ) r k ≤ εe s t log 1 /δ , for all 1 ≤ t ≤ log(1 /δ ) which finishes the proof.30ow we have proved that the Strong JL Moment Property is satisfied by the SRHT , the
Ten-sorSRHT as well as
OSNAP transform, but we still need to prove the usefulness of the property.Our next result remedies this and show that the Strong JL Moment Property is preserved undermultiplication. We will use the following decoupling lemma which first appeared in [Hit94], but thefollowing is roughly taken from [DlPG12], which we also recommend for readers interested in moregeneral versions.
Lemma 22 (General decoupling, [DlPG12] Theorem 7.3.1, paraphrasing) . There exists an uni-versal constant C , such that, given any two sequences ( X i ) i ∈ [ n ] and ( Y i ) i ∈ [ n ] of random variables,satisfying1. Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i = Pr h X i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i for every t ∈ R and for every i ∈ [ n ] .2. The sequence ( Y i ) i ∈ [ n ] is conditionally independent given ( X i ) i ∈ [ n ] .3. Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i = Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ n ] i for every t ∈ R and for every i ∈ [ n ] .Then for all t ≥ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ n ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ n ] Y i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t We are now ready to state and prove the main lemma of this section. This basically saysthat if you take k independent JL transforms, that all have the Strong ( ε/ √ k, δ )-JL MomentProperty, SJLMP, then the result has the ( ε, δ ) SJLMP. A simple union bound would give the sameresult where each matrix has the ( ε/k, δ ) SJLMP, but that would ultimately result in a higherdependency on the tensoring dimension. A simple change to the proof shows that we only needthe i th JL transform to have the ( ε/ √ i, δ ) SJLMP, but that ultimately makes no difference for ourconstruction. Lemma 23.
There exists a universal constant L , such that, for any constants ε, δ ∈ [0 , andpositive integer k ∈ Z > . If M (1) ∈ R d × d , . . . , M ( k ) ∈ R d k +1 × d c are independent random matriceswith the Strong ( ε/ ( L √ k ) , δ ) -JL Moment Property, then the matrix M = M ( k ) · . . . · M (1) has theStrong ( ε, δ ) -JL Moment Property.Proof. Let x ∈ R d be an arbitrary, fixed unit vector, and fix 1 < t ≤ log(1 /δ ). We define X i = k M ( i ) · . . . · M (1) x k and Y i = X i − X i − for every i ∈ [ k ]. By telescoping we then have that X i − P j ∈ [ i ] Y i . We let ( T ( i ) ) i ∈ [ k ] be independent copies of ( M ( i ) ) i ∈ [ k ] and define Z i = k T ( i ) · M ( i − · . . . · M (1) x k − k M ( i − · . . . · M (1) x k , for every i ∈ [ k ]. We get the following three properties:1. Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i = Pr h Y i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i for every t ∈ R and every i ∈ [ k ].2. The sequence ( Z i ) i ∈ [ k ] is conditionally independent given ( M ( i ) ) i ∈ [ k ] .3. Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i = Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ k ] i for every t ∈ R and for every i ∈ [ k ].31his means we can use Lemma 22 to get (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t . (17)for every i ∈ [ k ].We will prove by induction on i ∈ [ k ] that k X i − k L t ≤ εe s t log(1 /δ ) ≤ . (18)For i = 1 we use that M (1) has the Strong ( ε/ ( L √ k ) , δ )-JL Moment Property and get that (cid:13)(cid:13)(cid:13) k M (1) x k − (cid:13)(cid:13)(cid:13) L t ≤ εeL √ k s t log(1 /δ ) ≤ εe s t log(1 /δ ) . Now assume that (18) is true for i −
1. Using (17) we get that k X i − k L t = (cid:13)(cid:13)(cid:13)P j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)P j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13) L t . By using that ( T ( j ) ) j ∈ [ i ] has the Strong ( ε/ ( L √ k ) , δ )-JL Moment Property to-gether with Khintchine’s inequality (Lemma 17), we get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E X j ∈ [ i ] Z j t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i ] /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εeL √ k s t log(1 /δ ) s X j ∈ [ i ] X j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = C εe s t log(1 /δ ) · L √ k vuuut(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] X j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t/ ≤ C εe s t log(1 /δ ) · L √ k s X j ∈ [ i ] k X j k L t , where the last inequality follows from the triangle inequality. Using the triangle inequality and(18) we get that k X j k L t ≤ k X j − k L t ≤ , for every j ∈ [ i ]. Setting L = 2 C C we get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ εe s t log(1 /δ ) · C C L √ k s X j ∈ [ i ] k X j k L t (19) ≤ εe s t log(1 /δ ) · C C L √ k · √ i (20) ≤ εe s t log(1 /δ ) , (21)which finishes the induction. Now we have that (cid:13)(cid:13) k M x k − (cid:13)(cid:13) L t ≤ εe q t log(1 /δ ) so we conclude that M has Strong ( ε, δ )-JL Moment Property. 32 simple corollary of this result is a sufficient condition for our recursive sketch Π q to have theStrong JL Moment Property. Corollary 24 (Strong JL Moment Property for Π q ) . For any integer q which is a power of two,let Π q : R m q → R m be defined as in Definition 11, where both of the common distributions S base : R m → R m and T base : R d → R m , satisfy the Strong (cid:16) O (cid:16) ε √ q (cid:17) , δ (cid:17) -JL Moment Property. Then itfollows that Π q satisfies the Strong ( ε, δ ) -JL Moment Property.Proof. The proof follows from using Lemma 12 and Lemma 23.We conclude this section by proving Theorem 2.
Theorem 2.
For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = e Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding (Definition 2). (2) If m = e Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / poly ( n )) -approximate matrix product property (Definition 3).Moreover, in the setting of (1) , for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columnsare obtained by a p -fold self-tensoring of each column of X , then the matrix Π p A can be computedusing Algorithm 1 in time e O (cid:16) pnm + p / s λ ε − nnz( X ) (cid:17) .Proof. Let δ = n ) denote the failure probability. Define q = ⌈ log ( p ) ⌉ and let Π p ∈ R m × d p andΠ q ∈ R m × d q be the sketches defined in Definition 11, where S base ∈ R m × m is a TensorSRHT sketchand T base ∈ R m × d is an OSNAP sketch with sparsity parameter s , which will be set later. Oblivious Subspace Embedding
Let m = Θ (cid:18) ps λ log(1 / ( εδ )) ε (cid:19) and s = Θ (cid:16) √ ps λ log(1 /δ ) ε (cid:17) beintegers, then Lemma 21 and Lemma 18 implies that S base and T base has the Strong (cid:16) O (cid:16) ε √ qs λ (cid:17) , δ (cid:17) -JL Moment Property, thus using Corollary 24 we conclude that Π q has the Strong (cid:16) εs λ , δ (cid:17) -JLMoment Property and in particular it has the (cid:16) εs λ , δ, log(1 /δ ) (cid:17) -JL Moment Property. By Lemma 11we then get that Π q is an ( ε, δ, s λ , d q , n )-Oblivious Subspace Embedding, and by Lemma 10 we getthat Π p is an ( ε, δ, s λ , d p , n )-Oblivious Subspace Embedding. Approximate Matrix Multiplication
Let m = Θ (cid:16) p log(1 / ( εδ )) ε (cid:17) and s = Θ (cid:16) √ p log(1 /δ ) ε (cid:17) beintegers. Then Lemma 21 and Lemma 18 implies that S base and T base has the Strong (cid:16) O (cid:16) ε √ qs λ (cid:17) , δ (cid:17) -JL Moment Property. Thus, using Corollary 24 we conclude that Π q has the Strong ( ε, δ )-JLMoment Property and in particular it has the ( ε, δ, log(1 /δ ))-JL Moment Property. By Lemma 11we then get that Π q has the ( ε, δ )-Approximate Matrix Multiplication Property, and by Lemma 10we get that Π p has the ( ε, δ )-Approximate Matrix Multiplication Property. Runtime of Algorithm 1 when the base sketch S base is a TensorSRHT sketch and T base isan OSNAP sketch with sparsity parameter s : We compute the time of running Algorithm 1on a vector x . Computing Y j for each j in lines 3 and 4 of algorithm requires applying an ONSAP sketch on either x or e which takes time O ( s · nnz( x )). Therefore computing all Y j ’s takes time O ( qs · nnz( x )).Computing each of Y lj ’s for l ≥ TensorSRHT sketch of input dimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time33 ( m log m ). Therefore computing Y lj for all l, j ≥ O ( q · m log m ). Note that q ≤ p hence the total running time of Algorithm 1 on one vector x is O ( pm log m + ps · nnz( w )). Sketching n columns of a matrix X ∈ R d × n takes time O ( p ( nm log m + s · nnz( X ))).In the setting of (1) we have that s = O (cid:16) √ ps λ log(1 /δ ) ε (cid:17) , hence we get a runtime of O (cid:16) pnm log m + p / s λ log(1 /δ )) ε nnz( X ) (cid:17) = ˜ O (cid:16) pnm + p / s λ ε nnz( X ) (cid:17) . s λ In this section, we show that if one chooses the internal nodes and the leaves of our recursive con-struction from Section 3 to be
TensorSRHT and
OSNAP transform respectively, then the recursiveconstruction Π q as in Definition 11 yields a high probability OSE with target dimension e O ( p s λ ).Thus, we prove Theorem 3. This sketch is very efficiently computable for high degree tensor prod-ucts because the OSNAP transform is computable in input sparsity time and the
TensorSRHT supports fast matrix vector multiplication for tensor inputs.We start by defining the
Spectral Property for a sketch. We use the notation k · k op to denotethe operator norm of matrices. Definition 20 (Spectral Property) . For any positive integers m, n, d and any ε, δ, µ F , µ ≥ S ∈ R m × d satisfies the ( µ F , µ , ǫ, δ, n ) -spectral property if, for every fixedmatrix U ∈ R d × n with k U k F ≤ µ F and k U k op ≤ µ ,Pr S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ S ⊤ SU − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:21) ≥ − δ. The spectral property is a central property of our sketch construction from Section 3 when leavesare
OSNAP and internal nodes are
TensorSRHT . This is a powerful property which implies thatany sketch which satisfies the spectral property , is an
Oblivious Subspace Embedding . The
SRHT , TensorSRHT , as well as
OSNAP sketches (Definitions 14, 15, 16 respectively) with target dimension m = Ω (cid:16) ( µ F µ ǫ ) · poly (log( nd/δ )) (cid:17) and sparsity parameter s = Ω(poly (log( nd/δ ))), all satisfy theabove-mentioned spectral property [Sar06, Tro11, NN13].In section 5.1 we recall the tools from the literature which we use to prove the spectral propertyfor our construction Π q . Then in section 5.2 we show that our recursive construction in Section3 satisfies the Spectral Property of Definition 20 as long as I d q × T base and I m q × S base satisfythe Spectral Property. Therefore, we analyze the Spectral Property of I d q × OSNAP and I m q × TensorSRHT in section 5.3 and section 5.4 respectively. Finally we put everything together insection 5.5 and prove that when the leaves are
OSNAP and the internal nodes are
TensorSRHT in our recursive construction of Section 3, the resulting sketch Π q satisfies the Spectral Propertythereby proving Theorem 3. In this section we present the definitions and tools which we use for proving concentration propertiesof random matrices.
Claim 25.
For every ǫ, δ > and any sketch S ∈ R m × d such that I k × S satisfies ( µ F , µ , ǫ, δ, n ) -spectral property, the sketch S × I k also satisfies the ( µ F , µ , ǫ, δ, n ) -spectral property.Proof. Suppose U ∈ R dk × n . Then, note that there exists U ′ ∈ R dk × n formed by permuting therows of U such that ( S × I k ) U and ( I k × S ) U ′ are identical up to a permutation of the rows. (In34articular, U ′ is the matrix such that the ( d, k )-reshaping of any column U j of U ′ is the transposeof the ( k, d )-reshaping of the corresponding column U ′ j of U ′ .) Then, observe that U ⊤ U = U ′⊤ U ′ . and U ⊤ ( S × I k ) ⊤ ( S × I k ) U = U ′⊤ ( I k × S ) ⊤ ( I k × S ) U ′ . Therefore, k U ⊤ ( S × I k ) ⊤ ( S × I k ) U − U ⊤ U k op = k U ′⊤ ( S × I k ) ⊤ ( S × I k ) U ′ − U ′⊤ U ′ k op . Moreover, since U and U ′ are identical up to a permutation of the rows, we have k U k op = k U ′ k op and k U k F = k U ′ k F . The desired claim now follows easily.We will use matrix Bernstein inequalities to show spectral guarantees for sketches, Lemma 26 (Matrix Bernstein Inequality (Theorem 6.1.1 in [Tro15])) . Consider a finite sequence Z i of independent, random matrices with dimensions d × d . Assume that each random matrix satisfies E [ Z i ] = 0 and k Z i k op ≤ B almost surely. Define σ = max {k P i E [ Z i Z ∗ i ] k op , k P i E [ Z ∗ i Z i ] k op } .Then for all t > − , P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≥ t ≤ ( d + d ) · exp − t / σ + Bt/ ! . Lemma 27 (Restatement of Corollary 6.2.1 of [Tro15]) . Let B be a fixed n × n matrix. Constructan n × n matrix R that satisfies, E [ R ] = B and k R k op ≤ L, almost surely. Define M = max {k E [ RR ∗ ] k op , k E [ R ∗ R ] k op } . Form the matrix sampling estimator, ¯ R = 1 m m X k =1 R k , where each R k is an independent copy of R . Then, Pr h k ¯ R − B k op ≥ ǫ i ≤ n · exp − mǫ / M + 2 Lǫ/ ! . To analyze the performance of
SRHT we need the following claim which shows that with highprobability individual entries of the Hadamard transform of a vector with random signs on itsentries do not “overshoot the mean energy” by much.
Claim 28.
Let D be a d × d diagonal matrix with independent Rademacher random variables alongthe diagonal. Also, let H be a d × d Hadamard matrix. Then, for every x ∈ R d , Pr D (cid:20) k HD · x k ∞ ≤ q log ( d/δ ) · k x k (cid:21) ≥ − δ. roof. By Khintchine’s inequality, Lemma 17 we have that for every t ≥ j ∈ [ d ] the j th element of HDx has a bounded t th moment as follows, k ( HDx ) j k L t ≤ √ t · k x k . Hence by applying Markov’s inequality to the t th moment of | ( HDx ) j | for t = log ( d/δ ) we getthat, Pr (cid:20) | ( HDx ) j | ≥ q log ( d/δ ) · k x k (cid:21) ≤ δ/d. The claim follows by a union bound over all entries j ∈ [ d ]. Claim 29.
Let D , D be two independent d × d diagonal matrices, each with diagonal entries givenby independent Rademacher random variables. Also, let H be a d × d Hadamard matrix. Then, forevery x ∈ R d , Pr D ,D [ k (( HD ) × ( HD )) · x k ∞ ≤ ( d/δ ) · k x k ] ≥ − δ. Proof.
By Claim 6 we can write that,( HD ) × ( HD ) = ( H × H )( D × D ) , where H × H is indeed a Hadamard matrix of size d × d which we denote by H ′ . The goal is toprove Pr D ,D (cid:2) k H ′ ( D × D ) · x k ∞ ≤ ( d/δ ) · k x k (cid:3) ≥ − δ. By Lemma 19 we have that for every t ≥ j ∈ [ d ] the j th element of H ′ ( D × D ) x hasa bounded t th moment as follows, (cid:13)(cid:13) ( H ′ ( D × D ) x ) j (cid:13)(cid:13) L t ≤ t · k x k . Hence by applying Markov’s inequality to the t th moment of | ( H ′ ( D × D ) x ) j | for t = log ( d/δ )we get that, Pr (cid:2) | ( H ′ ( D × D ) x ) j | ≥ ( d/δ ) · k x k (cid:3) ≤ δ/d . The claim follows by a union bound over all entries j ∈ [ d ]. Π q In this section we show that the sketch Π q presented in Definition 11 inherits the spectral property(see Definition 20) from the base sketches S base and T base . We start by the following claim whichproves that composing two random matrices with spectral property results in a matrix with spectralproperty. Claim 30.
For every ǫ, ǫ ′ , δ, δ ′ > , suppose that S ∈ R m × t is a sketch which satisfies the (( µ F +1)(1 + ǫ ′ ) , µ + 1 + ǫ ′ , ǫ, δ, n ) -spectral property and also suppose that the sketch T ∈ R t × d satisfies the ( µ F +1 , µ +1 , ǫ ′ , δ ′ /n, n ) -spectral property. Then S · T satisfies the ( µ F + 1 , µ + 1 , ǫ + ǫ ′ , δ + δ ′ (1 + 1 /n ) , n ) -spectral property. roof. Suppose S and T are matrices satisfying the hypothesis of the claim. Consider an arbitrarymatrix U ∈ R d × n which satisfies k U k F ≤ µ F + 1 and k U k op ≤ µ + 1. We want to prove that forevery such U , Pr h k U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U k op ≤ ǫ + ǫ ′ i ≥ − δ − δ ′ (1 + 1 /n ) . Let us define the event E as follows, E := (cid:26) k T · U k F ≤ (cid:0) ǫ ′ (cid:1) k U k F and (cid:13)(cid:13)(cid:13) U ⊤ T ⊤ T U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ ′ (cid:27) . We show that this event holds with probability 1 − δ ′ (1 + 1 /n ) over the random choice of sketch T .The spectral property of T implies that for every column U j of matrix U , k T U j k = (cid:0) ± ǫ ′ (cid:1) k U j k , with probability 1 − δ ′ n . By a union bound over all j ∈ [ n ], we have the following,Pr T h k T · U k F ≤ (cid:0) ǫ ′ (cid:1) k U k F i ≥ − δ ′ . Also, Pr T (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ T ⊤ T U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ ′ (cid:21) ≥ − δ ′ /n. Therefore by union bound, Pr T [ E ] ≥ − δ ′ (1 + 1 /n ) . We condition on T ∈ E in the rest of the proof. Since S satisfies the (( µ F +1)(1+ ǫ ′ ) , µ +1+ ǫ ′ , ǫ, δ, n )-spectral property, Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:21) ≥ − δ. Therefore,Pr
T,S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:21) ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − Pr T [ ¯ E ] ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 /n ) ≥ Pr S (cid:20) (cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13) ( T U ) ⊤ ( T U ) − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 n ) ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 /n ) ≥ − δ − δ ′ (1 + 1 /n ) . This completes the proof.In the following lemma we show that composing independent random matrices with spectralproperty preserves the spectral property. 37 emma 31.
For any ε, δ, µ F , µ > and every positive integers k, n , if M (1) ∈ R d × d , · · · M ( k ) ∈ R d k +1 × d k are independent random matrices with the (2 µ F + 2 , µ + 2 , O ( ǫ/k ) , O ( δ/nk ) , n ) -spectralproperty then the product matrix M = M ( k ) · · · M (1) satisfies the ( µ F + 1 , µ + 1 , ǫ, δ, n ) -spectralproperty.Proof. Consider a matrix U ∈ R d × n which satisfies k U k F ≤ µ F + 1 and k U k op ≤ µ + 1. We wantto prove that for every such U ,Pr h k U ⊤ M ⊤ M U − U ⊤ U k op ≤ ǫ i ≥ − δ, where M = M ( k ) · · · M (1) .By the assumption of the lemma the matrices M (1) , · · · M ( k ) satisfy the (2 µ F +2 , µ +2 , O ( ǫ/k ) , O ( δ/nk ) , n )-spectral property. For every j ∈ [ k ], let us define the set E j as follows, E j := (cid:16) M (1) , · · · , M ( j ) (cid:17) : . (cid:13)(cid:13)(cid:13)(cid:16) M ( j ) · · · M (1) (cid:17) U (cid:13)(cid:13)(cid:13) F ≤ (cid:0) ǫ k (cid:1) j k U k F . (cid:13)(cid:13)(cid:13)(cid:13) U ⊤ (cid:16) M ( j ) · · · M (1) (cid:17) ⊤ (cid:16) M ( j ) · · · M (1) (cid:17) U − U ⊤ U (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫj k . First we prove that for every j ∈ { , · · · , k − } ,Pr M ( j +1) h (cid:16) M (1) , · · · , M ( j +1) (cid:17) ∈ E j +1 (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ k . Let us denote (cid:16) M ( j ) · · · M (1) (cid:17) · U by U ′ . The condition (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j implies that, k U ′ k F ≤ (1 + ǫ/ (10 k )) j k U k F and k U ′⊤ U ′ − U ⊤ U k op ≤ ǫj k and therefore by triangle inequality wehave k U ′ k op ≤ (cid:16) k U k op + ǫj k (cid:17) . The assumptions k U k F ≤ µ F + 1 and k U k op ≤ µ + 1 imply that k U ′ k F ≤ µ F + 2 and k U ′ k op ≤ µ + 2. Now note that by the assumption of the lemma, M ( j +1) satisfies the (2 µ F + 2 , µ + 2 , O ( ǫ/k ) , O ( δ/nk ) , n )-spectral property. Therefore,Pr M ( j +1) " (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M ( j +1) U ′ (cid:17) ⊤ M ( j +1) U ′ − U ′⊤ U ′ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ/ (4 nk ) . Combining the above with k U ′⊤ U ′ − U ⊤ U k ≤ ǫj k gives,Pr M ( j +1) " (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M ( j +1) U ′ (cid:17) ⊤ M ( j +1) U ′ − U ⊤ U (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫ j + 13 k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ/ (4 nk ) . (22)Also from the spectral property of M ( j +1) it follows that for every column U ′ i of matrix U ′ , k M ( j +1) U ′ i k = (1 ± ǫ/ (10 k )) k U ′ i k , with probability 1 − δ nk . By a union bound over all i ∈ [ n ], we have the following,Pr M ( j +1) h k M ( j +1) · U ′ k F ≤ (1 + ǫ/ (10 k )) k U ′ k F (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ k . Combining the above with k U ′ k F ≤ (1 + ǫ/ (10 k )) j k U k F gives,Pr M ( j +1) " k M ( j +1) · U ′ k F ≤ (cid:18) ǫ k (cid:19) j +1 k U k F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ k . (23)38 union bound on (22) and (23) gives,Pr M ( j +1) h (cid:16) M (1) , · · · , M ( j +1) (cid:17) ∈ E j +1 (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ nk − δ k ≥ − δ k . We also show that, Pr M (1) [ M (1) ∈ E ] ≥ − δ/ k. By the assumption of lemma we know that M (1) satisfies the (cid:16) µ F + 2 , µ + 2 , ǫ k , δ nk , n (cid:17) -spectralproperty. Therefore, Pr M (1) (cid:20) k ( M (1) U ) ⊤ M (1) U − U ⊤ U k op ≤ ǫ k (cid:21) ≥ − δ nk . (24)Also for every column U i of matrix U , k M (1) U i k = (1 ± ǫ/ (10 k )) k U i k , with probability 1 − δ nk . By a union bound over all i ∈ [ n ], we have the following,Pr M (1) h k M (1) · U k F ≤ (1 + ǫ/ (10 k )) k U k F i ≥ − δ k . (25)A union bound on (24) and (25) gives,Pr T [ T ∈ E ] ≥ − δ nk − δ k ≥ − δ k . By the chain rule for events we have,Pr M (1) , ··· ,M ( k ) h(cid:16) M (1) , · · · , M ( k ) (cid:17) ∈ E k i ≥ k Y j =2 Pr M ( j ) h (cid:16) M (1) , · · · M ( j ) (cid:17) ∈ E j (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · M ( j − (cid:17) ∈ E j − i · Pr M (1) [ M (1) ∈ E ] ≥ (1 − δ k ) k ≥ − δ, which completes the proof of the lemma.The following lemma shows that our sketch construction Π q presented in definition 11 inheritsthe spectral property of Definition 20 from the base sketches, that is, if S base and T base are suchthat I m q − × S base and I d q − × T base satisfy the spectral property, then the sketch Π q satisfies thespectral property. Lemma 32.
For every positive integers n, d, m , any power of two integer q , any base sketch T base : R d → R m such that I d q − × T base satisfies the (2 µ F +2 , µ +2 , O ( ǫ/q ) , O ( δ/nq ) , n ) -spectral property,any S base : R m → R m such that I m q − × S base satisfies the (2 µ F + 2 , µ + 2 , O ( ǫ/q ) , O ( δ/nq ) , n ) -spectral property, the sketch Π q defined as in Definition 11 satisfies the ( µ F +1 , µ +1 , ε, δ, n ) -spectralproperty. roof. We wish to show that Π q = Q q T q as per Definition 11, satisfies the ( µ F + 1 , µ + 1 , ε, δ, n )-spectral property. By Definition 9 Q q = S S · · · S q . Claim 7 shows that for every l ∈ { , , · · · q } we can write, S l = M ll/ M ll/ − · · · M l , (26)where M j = I m q − j × S qq/ − j +1 × I m j − for every j ∈ [ q/ T q = M ′ q · · · M ′ , (27)where M ′ j = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ]. Therefore by combining (26) and (27) we getthat, Π q = M (2 q +1) M (2 q ) · · · M (1) , where M ( i ) matrices are independent and by the assumption of the lemma about the spectralproperty of I m q − × S base and I d q − × T base together with Claim 25 it follows that M ( i ) matricessatisfy the (2 µ F + 2 , µ + 2 , O ( ǫ/q ) , O ( δ/nq ) , n )-spectral property. Therefore, the Lemma readilyfollows by invoking Lemma 31 with k = 2 q + 1. × TensorSRHT
In this section, we show that tensoring an identity operator with a
TensorSRHT sketch results ina transform that satisfies the spectral property defined in Definition 20 with nearly optimal targetdimension.
Lemma 33.
Suppose ǫ, δ, µ , µ F > and n is a positive integer. If m = Ω (cid:16) log( nδ ) log ( ndkǫδ ) · µ F µ ǫ (cid:17) and S ∈ R m × d is a TensorSRHT , then the sketch I k × S satisfies ( µ F , µ , ǫ, δ, n ) -spectral property.Proof. Fix a matrix U ∈ R kd × n with k U k F ≤ µ F and k U k op ≤ µ . Partition U by rows into d × n submatrices U , U , . . . , U k such that U ⊤ = h U ⊤ U ⊤ · · · U ⊤ k i . Note that U ⊤ ( I k × S ) ⊤ ( I k × S ) U = ( U ) ⊤ S ⊤ SU + · · · ( U k ) ⊤ S ⊤ SU k . The proof first considers the simpler case of a
TensorSRHT sketch of rank 1 and then applies thematrix Bernstein inequality from Lemma 27. Let R denote a rank one TensorSRHT sketch. R is a1 × d matrix defined in Definition 15 by setting m = 1 as follows, R = P · ( HD × HD ) , where P ∈ { , } × d has one non-zero element whose position is uniformly distributed over [ d ]. Notethat S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R , i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i ,for i.i.d. R , R , . . . , R m ∼ R , and therefore, U ⊤ ( I k × S ) ⊤ ( I k × S ) U = 1 m X i ∈ [ m ] U ⊤ ( I k × R i ) ⊤ ( I k × R i ) U. Therefore in order to use matrix Bernstein, Lemma 27, we need to bound the maximum operatornorm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U as well as the operator norm of its second moment.40e proceed to upper bound the operator norm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U . First, define the set E := (cid:26) ( D , D ) : (cid:13)(cid:13)(cid:13) ( HD × HD ) U ij (cid:13)(cid:13)(cid:13) ∞ ≤ ( ndµ F kǫδ )) · k U ij k for all j ∈ [ k ] and all i ∈ [ n ] (cid:27) , where U ji is the i th column of U j . By Claim 29, for every i ∈ [ n ] and j ∈ [ k ],Pr D ,D (cid:20)(cid:13)(cid:13)(cid:13) ( HD × HD ) U ji (cid:13)(cid:13)(cid:13) ∞ ≤ ( ndk/δ ) k U ji k (cid:21) ≥ − ǫδ/ ( nkµ F d ) . Thus, by a union bound over all i ∈ [ n ] and j ∈ [ k ], it follows that E occurs with probability atleast 1 − ǫδ/ ( dµ F ), Pr D ,D [( D , D ) ∈ E ] ≥ − ǫδ/ ( dµ F ) , where the probability is over the random choice of D , D .From now on, we fix ( D , D ) ∈ E and proceed having conditioned on this event. Upper bounding (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op . From the fact that we have conditioned on ( D , D ) ∈E , note that L ≡ (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op = k ( U ) ⊤ R ⊤ RU + · · · ( U k ) ⊤ R ⊤ RU k k op ≤ (cid:13)(cid:13)(cid:13) ( U ) ⊤ R ⊤ RU (cid:13)(cid:13)(cid:13) op + · · · + (cid:13)(cid:13)(cid:13) ( U k ) ⊤ R ⊤ RU k (cid:13)(cid:13)(cid:13) op = k RU k + · · · + k RU k k ≤ ( ndµ F k/ǫδ ) · ( k U k F + · · · + k U k k F ) ≤ ( ndµ F k/ǫδ ) · k U k F = 16 µ F · log ( ndµ F k/ǫδ )) , where the equality on the third line above holds because the matrices ( U i ) ⊤ R ⊤ RU i are rank one. Upper bounding (cid:13)(cid:13)(cid:13)(cid:13) E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op . For every x ∈ R d with k x k = 1, wehave x T E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) x = E P X j,j ′ ∈ [ k ] x T ( U j ) ⊤ R ⊤ RU j · ( U j ′ ) ⊤ R ⊤ RU j ′ x ≤ E P X j,j ′ ∈ [ k ] | RU j x |k RU j k | RU j ′ x |k RU j ′ k = E P X j ∈ [ k ] | RU j x |k RU j k ≤ E P X j ∈ [ k ] ( RU j x ) X j ∈ [ k ] k RU j k , where the second and fourth lines follow from the Cauchy-Schwarz inequality. Using the fact thatwe conditioned on ( D , D ) ∈ E , we get 41 T E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) x ≤
16 log ( ndµ F k/ǫδ ) X j ∈ [ k ] k U j k F E P X j ∈ [ k ] ( RU j x ) = 16 log ( ndµ F k/ǫδ ) X j ∈ [ k ] k U j k F X j ∈ [ k ] E P h ( P ( HD × HD ) U j x ) i = 16 log ( ndµ F k/ǫδ ) · k U k F X j ∈ [ k ] k U j x k = 16 log ( ndµ F k/ǫδ ) · k U k F k U x k ≤
16 log ( ndµ F k/ǫδ ) · µ F µ , since E P (cid:2) ( P ( HD × HD ) U j x ) (cid:3) = d k ( HD × HD ) U j x k = k U j x k for all x .Since the matrix E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) is positive semi-definite for any fixed D and D , it follows that M ≡ (cid:13)(cid:13)(cid:13)(cid:13) E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op ≤
16 log ( ndµ F k/ǫδ ) · µ F µ . Combining one-dimensional TensorSRHT sketches.
To conclude, we note that the Grammatrix of a
TensorSRHT , S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R , i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i , for i.i.d. R , R , . . . , R m ∼ R , and therefore,( I k × S ) ⊤ ( I k × S ) = 1 m X i ∈ [ m ] ( I k × R i ) ⊤ ( I k × R i ) . Recall that ( D , D ) ∈ E occurs with probability at least 1 − ǫδ/ ( dµ F ), therefore we have thefollowing for the conditional expectation E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i , E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[( D , D ) ∈ E ] (cid:22) U ⊤ U − ǫδ/ ( dµ F ) . And also by Cauchy-Schwarz we have, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i (cid:23) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ ¯ E i · Pr[ ¯ E ] (cid:23) U ⊤ U − d k U k F Pr[ ¯ E ] · I n (cid:23) U ⊤ U − d k U k F · ǫδ/ ( dµ F ) · I n (cid:23) U ⊤ U − ( ǫ/ · I n . These two bounds together imply that, (cid:13)(cid:13)(cid:13) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ/ . Now note that the random variables R ⊤ i R i are independent conditioned on ( D , D ) ∈ E . Hence,using the upper bounds L ≤ µ F · log ( ndµ F k/ǫδ ) and M ≤ µ F µ · log ( ndµ F k/ǫδ ), which hold42hen ( D , D ) ∈ E , we have the following by Lemma 27, (here we drop the subscript from I k forease of notation)Pr P,D ,D (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I × S ) ⊤ ( I × S ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≥ ǫ (cid:21) ≤ Pr P (cid:20) (cid:13)(cid:13)(cid:13) U ⊤ ( I × S ) ⊤ ( I × S ) U − E h U ⊤ ( I × R ) ⊤ ( I × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i(cid:13)(cid:13)(cid:13) op ≥ ǫ/ (cid:12)(cid:12)(cid:12)(cid:12) ( D , D ) ∈ E (cid:21) + Pr D ,D [ ¯ E ] ≤ n · exp − mǫ / M + 2 ǫL/ ! + δ/ ≤ δ, where the last inequality follows by setting m = Ω (cid:16) log( n/δ ) log ( ndk/ǫδ ) · µ F µ /ǫ (cid:17) . This showsthat I k × S satisfies the ( µ F , µ , ǫ, δ, n )-spectral property. × OSNAP
In this section, we show that tensoring identity operator with
OSNAP sketch (Definition 16) resultsin a transform which satisfies the spectral property (Definition 20) with nearly optimal target di-mension as well as nearly optimal application time. This sketch is particularly efficient for sketchingsparse vectors. We use a slightly different sketch than the original
OSNAP to simplify the analysis,defined as follows.
Definition 21 ( OSNAP transform) . For every sparsity parameter s , target dimension m , andpositive integer d , the OSNAP transform with sparsity parameter s is defined as, S r,j = r s · δ r,j · σ r,j , for all r ∈ [ m ] and all j ∈ [ d ], where σ r,j ∈ {− , +1 } are independent and uniform Rademacherrandom variables and δ r,j are independent Bernoulli random variables satisfying, E [ δ r,i ] = s/m forall r ∈ [ m ] and all i ∈ [ d ]. Lemma 34.
Suppose ǫ, δ, µ , µ F > and n is a positive integer. If S ∈ R m × d is a OSNAP sketchwith sparsity parameter s , then the sketch I k × S satisfies the ( µ F , µ , ǫ, δ, n ) -spectral property,provided that s = Ω (cid:16) log ( ndk/ǫδ ) log( n/δ ) · µ ǫ (cid:17) and m = Ω (cid:16) ( µ F µ /ǫ ) · log ( ndk/ǫδ ) (cid:17) .Proof. Fix a matrix U ∈ R kd × n with k U k F ≤ µ F and k U k op ≤ µ . Partition U by rows into d × n sub-matrices U , U , . . . , U k such that U T = h U ⊤ U ⊤ · · · U ⊤ k i . Note that U ⊤ ( I k × S ) ⊤ ( I k × S ) U = ( U ) ⊤ S ⊤ SU + · · · ( U k ) ⊤ S ⊤ SU k . The proof first considers the simpler case of an
OSNAP sketch of rank 1 and then applies the matrixBernstein bound. Let R denote a rank one OSNAP sketch. R is a 1 × d matrix defined as follows, R i = r ms · δ i σ i , (28)where σ i for all i ∈ [ d ] are independent Rademacher random variables and also, δ i for all i ∈ [ d ] areindependent Bernoulli random variables for which the probability of being one is equal to sm .43e proceed to upper bound the operator norm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U . First, define the set E := (cid:26) R : ( RU j ) ⊤ RU j (cid:22) C (cid:18) ms log ( ndkµ F ǫδ ) · U ⊤ j U j + log( ndkµ F ǫδ ) k U j k F · I n (cid:19) for all j = 1 , . . . , k (cid:27) , where C > R ∈ E ] ≥ − ǫδ/ ( dmµ F ) , where the probability is over the random choices of { σ i } i ∈ [ d ] and { δ i } i ∈ [ d ] . To show this we firstprove the following claim, Claim 35.
For every matrix Z ∈ R d × n , if we let R be defined as in (28) , then, Pr (cid:20) Z ⊤ R ⊤ RZ (cid:22) C (cid:18) ms · log ( n/δ ) Z ⊤ Z + log( n/δ ) k Z k F I n (cid:19)(cid:21) ≥ − δ. Proof.
The proof is by Matrix Bernstein inequality, Lemma 26. For any matrix Z let A = Z ( Z ⊤ Z + µI n ) − / , where µ = sm n/δ ) k Z k F . We can write RA = q ms P i ∈ [ d ] δ i σ i A i , where A i is the i throw of A . Note that E [ δ i σ i A i ] = 0 and k δ i σ i A i k ≤ k A i k ≤ k A k op . Also note that X i ∈ [ d ] E [( δ i σ i A i )( δ i σ i A i ) ∗ ] = X i ∈ [ d ] sm k A i k = sm k A k F and, X i ∈ [ d ] E [( δ i σ i A i ) ∗ ( δ i σ i A i )] = X i ∈ [ d ] sm A ∗ i A i = sm A ⊤ A. Therefore,max (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] E [( δ i σ i A i )( δ i σ i A i ) ∗ ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] E [( δ i σ i A i ) ∗ ( δ i σ i A i )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ sm k A k F . By Lemma 26, Pr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] δ i σ i A i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≥ t ≤ ( n + 1) · exp − t / sm k A k F + k A k op t/ ! , hence if t = C ′ / · (cid:16)q sm log( n/δ ) k A k F + log( n/δ ) k A k op (cid:17) , then Pr (cid:20)(cid:13)(cid:13)(cid:13)P i ∈ [ d ] δ i σ i A i (cid:13)(cid:13)(cid:13) op ≥ t (cid:21) ≤ δ . Byplugging k RA k = ms · k P i ∈ [ d ] δ i σ i A i k into the above we get the following,Pr (cid:20) k RA k op ≤ C ′ / (cid:18) ms · log ( n/δ ) k A k op + log( n/δ ) k A k F (cid:19)(cid:21) ≥ − δ. Now note that for the choice of A = Z ( Z ⊤ Z + µI n ) − / , we have k A k op ≤ k Z ⊤ Z k op k Z ⊤ Z k op + µ ≤ k A k F = P i λ i ( Z ⊤ Z ) λ i ( Z ⊤ Z )+ µ ≤ P i λ i ( Z ⊤ Z ) µ = ms log( n/δ ). By plugging these into the above we get that,Pr (cid:20)(cid:13)(cid:13)(cid:13) RZ ( Z ⊤ Z + µI n ) − / (cid:13)(cid:13)(cid:13) op ≤ C ′ ms · log ( n/δ ) (cid:21) ≥ − δ. Z ⊤ Z + µI n ) − / Z ⊤ R ⊤ RZ ( Z ⊤ Z + µI n ) − / (cid:22) C ms · log ( n/δ ) I n , with probability 1 − δ , where C = C ′ . Multiplying both sides of the above from left and right bythe positive definite matrix ( Z ⊤ Z + µI n ) / gives (recall that µ = sm · k Z k F log( n/δ ) ), Z ⊤ R ⊤ RZ (cid:22) C (cid:18) ms · log ( n/δ ) Z ⊤ Z + log( n/δ ) k Z k F I n (cid:19) . By applying Claim 35 with failure probability of ǫδ/ ( dkµ F ) on each of U j ’s and then applyinga union bound, we get the following,Pr[ R ∈ E ] ≥ − ǫδ/ ( dmµ F ) . From now on, we fix R ∈ E and proceed having conditioned on this event. Upper bounding (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op . From the fact that we have conditioned on R ∈ E ,note that, L ≡ (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op = k ( U ) ⊤ R ⊤ RU + · · · ( U k ) ⊤ R ⊤ RU k k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ j U j + log( ndkµ F /ǫδ ) k U j k F · I n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C (cid:18) ms · log ( ndkµ F /ǫδ ) · k U k op + log( ndkµ F /ǫδ ) k U k F (cid:19) ≤ C (cid:18) ms µ · log ( ndkµ F /ǫδ ) + µ F · log( ndkµ F /ǫδ ) (cid:19) . Upper bounding (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op . From the condition R ∈ E , it follows that E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) (cid:22) E (cid:20) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) (cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17)(cid:21) (cid:22) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) E h(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17)i (cid:22) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) · U ⊤ U − ǫδ/ ( dmµ F )where the last line follows from the fact that the random variable U ⊤ ( I k × R ) ⊤ ( I k × R ) U is positivesemidefinite and the conditional expectation can be upper bounded by its unconditional expectationas follows, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[ R ∈ E ] . M ≡ (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13) C (cid:18) ms · log ( ndkµ F /ǫδ ) · ( U ⊤ U ) + log( ndkµ F /ǫδ ) k U k F · U ⊤ U (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C (cid:18) ms · log ( ndkµ F /ǫδ ) · k U ⊤ U k op + log( ndkµ F /ǫδ ) k U k F · k U ⊤ U k op (cid:19) = 2 C (cid:18) ms · log ( ndkµ F /ǫδ ) · µ + log( ndkµ F /ǫδ ) µ F µ (cid:19) . Combining one-dimensional OSNAP transforms.
To conclude, we note that the Grammatrix of an
OSNAP sketch, S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R with R defined as in (28) – i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i for i.i.d. R , R , . . . , R m ∼ R , and therefore,( I k × S ) ⊤ ( I k × S ) = 1 m X i ∈ [ m ] ( I k × R i ) ⊤ ( I k × R i ) . Note that by a union bound R i ∈ E simultaneously for all i ∈ [ m ] with probability at least1 − ǫδ/ ( dµ F ). Now note that the random variables R ⊤ i R i are independent conditioned on R i ∈ E for all i ∈ [ m ]. Also note that the conditional expectation E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i satisfies the following, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:23) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ ¯ E i · Pr[ ¯ E ] (cid:23) U ⊤ U − d k U k F Pr[ ¯ E ] · I n (cid:23) U ⊤ U − d k U k F · ǫδ/ ( dµ F ) · I n (cid:23) U ⊤ U − d k U k F · ǫ/ · I n . We also have, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[ R ∈ E ] (cid:22) U ⊤ U − ǫδ/ ( dµ F ) . These two bounds together imply that, (cid:13)(cid:13)(cid:13) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ/ . Now, using the upper bounds L ≤ C (cid:16) ms µ · log ( ndkµ F /ǫδ ) + µ F · log( ndkµ F /δ ) (cid:17) and M ≤ C (cid:16) ms · log ( ndkµ F /δ ) · µ + log( ndkµ F /δ ) µ F µ (cid:17) , which hold when R ∈ E , we have that by Lemma27,Pr (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I k × S ) ⊤ ( I k × S ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≥ ǫ (cid:21) ≤ Pr (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I k × S ) ⊤ ( I k × S ) U − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i(cid:13)(cid:13)(cid:13) op ≥ ǫ/ (cid:12)(cid:12) E (cid:21) + Pr D [ ¯ E ] ≤ n · exp − mǫ / M + ǫL/ ! + δ/ ≤ δ, s = Ω (cid:16) log ( ndkµ F /ǫδ ) log( nd/δ ) · µ ǫ (cid:17) and m =Ω (cid:16) µ F µ /ǫ · log ( ndkµ F /ǫδ ) (cid:17) . This shows that I k × S satisfies the ( µ F , µ , ǫ, δ, n )-spectral prop-erty. s λ We are ready to prove Theorem 3. We prove that if we instantiate Π p from Definition 11 with T base : OSNAP and S base : TensorSRHT , it satisfies the statement of Theorem 3.
Theorem 3.
For every positive integers p, d, n , every ε, s λ > , there exists a distribution onlinear sketches Π p ∈ R m × d p which is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding as inDefinition 2, provided that the integer m satisfies m = e Ω (cid:0) p s λ /ǫ (cid:1) .Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by a p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O (cid:0) pnm + p ǫ − nnz( X ) (cid:1) .Proof. Let δ = n ) denote the failure probability. Let m ≈ p log ( ndεδ ) · s λ ε and s ≈ p ε · log ( ndεδ )be integers. Let Π p ∈ R m × m p be the sketch defined in Definition 11, where S base ∈ R m × m is a TensorSRHT sketch and T base ∈ R m × d is an OSNAP sketch with sparsity parameter s .Let q = 2 ⌈ log ( p ) ⌉ . By Lemma 10, it is sufficient to show that Π q is a ( ε, δ, s λ , d q , n )-ObliviousSubspace Embedding. Consider arbitrary A ∈ R d q × n and λ >
0. Let us denote the statisti-cal dimension of A by s λ = s λ ( A ⊤ A ). Let U = A (cid:16) A ⊤ A + λI n (cid:17) − / . Therefore, k U k ≤ k U k F = s λ . Since q < p , by Lemma 34, the transform I d q − × T base , satisfies (2 s λ +2 , , O ( ε/q ) , O ( δ/n q ) , n )-spectral property. Moreover, by Lemma 33, the transform I m q − × S base satisfies (5 s λ + 9 , , O ( ε/q ) , O ( δ/n q ) , n )-spectral property. Therefore, by Lemma 32, the sketchΠ q satisfies ( s λ + 1 , , ε, δ, n )-spectral property, hence,Pr (cid:20)(cid:13)(cid:13)(cid:13) (Π q U ) ⊤ Π q U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ε (cid:21) ≥ − δ. Since U ⊤ U = ( A ⊤ A + λI n ) − / A ⊤ A ( A ⊤ A + λI n ) − / and Π q U = Π p A ( A ⊤ A + λI n ) − / we havethe following,Pr h (1 − ǫ )( A ⊤ A + λI n ) (cid:22) (Π p A ) ⊤ Π p A + λI n (cid:22) (1 + ǫ )( A ⊤ A + λI n ) i ≥ − δ. Runtime:
By Lemma 8, for any S base and T base , if A is the matrix whose columns are obtainedby p -fold self-tensoring of each column of some X ∈ R d × n then the sketched matrix Π p A can becomputed using Algorithm 1. When S base is TensorSRHT and T base is OSNAP , the runtime ofAlgorithm 1 for a fixed vector w ∈ R d is as follows; Computing Y j ’s for each j in lines 3 and4 of algorithm requires applying an OSNAP sketch on w ∈ R d which on expectation takes time O ( s · nnz( w )). Therefore computing all Y j ’s takes time O ( qs · nnz( w )).Computing each of Y lj ’s in line 7 of algorithm amounts to applying a TensorSRHT of inputdimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time O ( m log m ). Thereforecomputing all the Y lj ’s takes time O ( q · m log m ). Note that q ≤ p hence the total time of runningAlgorithm 1 on a vector w is O ( p · m log m + ps · nnz( w )). Therefore, sketching n columns of amatrix X ∈ R d × n takes time O ( p ( nm log m + s · nnz( X ))).47 Oblivious Subspace Embedding for the Gaussian Kernel
In this section we show how to sketch the Gaussian kernel matrix by polynomial expansion andthen applying our proposed sketch for the polynomial kernels.
Data-points with bounded ℓ radius: Suppose that we are given a dataset of points x , · · · x n ∈ R d such that for all i ∈ [ n ], k x i k ≤ r for some positive value r . Consider the Gaussian kernelmatrix G ∈ R n × n defined as G i,j = e −k x i − x j k / for all i, j ∈ [ n ]. We are interested in sketching thedata-points matrix X using a sketch S g : R d → R m such that the following holds with probability1 − δ , (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) . Theorem 5.
For every r > , every positive integers n, d , and every X ∈ R d × n such that k x i k ≤ r for all i ∈ [ n ] , where x i is the i th column of X , suppose G ∈ R n × n is the Gaussian kernel matrix –i.e., G j,k = e −k x j − x k k / for all j, k ∈ [ n ] . There exists an algorithm which computes S g ( X ) ∈ R m × n in time e O (cid:0) q ǫ − ns λ + q ǫ − nnz( X ) (cid:1) such that for every ε, λ > , Pr S g h (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) i ≥ − / poly ( n ) , where m = e Θ (cid:0) q s λ /ǫ (cid:1) and q = Θ( r + log( n/ǫλ )) and s λ is λ -statistical dimension of G as inDefinition 1.Proof. Let δ = n ) denote the failure probability. Note that G i,j = e −k x i k / · e x ⊤ i x j · e −k x j k / for every i, j ∈ [ n ]. Let D be a n × n diagonal matrix with i th diagonal entry e −k x i k / and let K ∈ R n × n be defined as K i,j = e x ⊤ i x j (note that DKD = G ). Note that K is a positive definitekernel matrix. The Taylor series expansion for kernel K is as follows, K = ∞ X l =0 ( X ⊗ l ) ⊤ X ⊗ l l ! . Therefore G can be written as the following series, G = ∞ X l =0 ( X ⊗ l D ) ⊤ X ⊗ l Dl ! . Note that each of the terms ( X ⊗ l D ) ⊤ X ⊗ l D = D ( X ⊗ l ) ⊤ X ⊗ l D are positive definite kernelmatrices. The statistical dimension of kernel ( X ⊗ l D ) ⊤ X ⊗ l D for every l ≥ G through the following claim. Claim 36.
For every µ ≥ and every integer l , s µ (cid:16) ( X ⊗ l D ) ⊤ X ⊗ l D (cid:17) ≤ s µ ( G ) . Proof.
From the Taylor expansion G = P ∞ l =0 ( X ⊗ l D ) ⊤ X ⊗ l Dl ! along with the fact that the polynomialkernel of any degree is positive definite, we have that ( X ⊗ l D ) ⊤ X ⊗ l D (cid:22) G . Now, by Courant-Fischer’s min-max theorem we have that, λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) = max U ∈ R ( j − × n min α =0 Uα =0 α ⊤ ( X ⊗ l D ) ⊤ X ⊗ l Dα k α k . U ∗ be the maximizer of the expression above. Then we have, λ j ( G ) = max U ∈ R ( j − × n min α =0 Uα =0 α ⊤ Gα k α k ≥ min α =0 U ∗ α =0 α ⊤ Gα k α k ≥ min α =0 U ∗ α =0 α ⊤ ( X ⊗ l D ) ⊤ X ⊗ l Dα k α k = λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) . for all j . Therefore, the claim follows from the definition of statistical dimension, s µ ( G ) = n X j =1 λ j ( G ) λ j ( G ) + µ ≥ n X j =1 λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) + µ = s µ (cid:16) ( X ⊗ l D ) ⊤ X ⊗ l D (cid:17) . If we let P = P ql =0 ( X ⊗ l ) ⊤ X ⊗ l l ! , where q = C · ( r + log( nǫλ )) for some constant C , then by thetriangle inequality we have k K − P k op ≤ X l>q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( X ⊗ l ) ⊤ X ⊗ l l ! (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ X l>q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( X ⊗ l ) ⊤ X ⊗ l l ! (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ X l>q n · r l l ! ≤ ǫλ/ .P is a positive definite kernel matrix. Also note that all the eigenvalues of the diagonal matrix D are bounded by 1. Hence, in order to get a subspace embedding it is sufficient to satisfy thefollowing with probability 1 − δ ,(1 − ǫ/ DP D + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ/ DP D + λI n ) . Let the sketch Π l ∈ R m l × d l be the sketch from Theorem 3 therefore by Claim 36 we get thefollowing guarantee on Π l :(1 − ǫ X ⊗ l D ) ⊤ X ⊗ l D + λI n ) (cid:22) (Π l X ⊗ l D ) ⊤ Π l X ⊗ l D + λI n (cid:22) (1+ ǫ X ⊗ l D ) ⊤ X ⊗ l D + λI n ) , (29)with probability 1 − δq +1 as long as m l = Ω (cid:16) l log ( nd/δ ) · s λ /ǫ (cid:17) and moreover Π l X ⊗ l D can becomputed using O (cid:16) n · l · m l log m l + l ǫ · log ( nd/δ ) · nnz( X ) (cid:17) runtime where s λ is the λ -statisticaldimension of G .We let S P be the sketch of size m × ( P ql =0 d l ) which sketches the kernel P . The sketch S P isdefined as 49 P = 1 √
0! Π ⊕ √
1! Π ⊕ √
2! Π · · · √ q ! Π q . Let Z be the matrix of size ( P ql =0 d l ) × n whose i th column is z i = x ⊗ i ⊕ x ⊗ i ⊕ x ⊗ i · · · x ⊗ qi , where x i is the i th column of X . Therefore the following holds for ( S P Z ) ⊤ S P Z ,( S P Z ) ⊤ S P Z = q X l =0 (Π l X ⊗ l ) ⊤ Π l X ⊗ l l ! , and hence, ( S P ZD ) ⊤ S P ZD = q X l =0 (Π l X ⊗ l D ) ⊤ Π l X ⊗ l Dl ! . Therefore by combining the terms of (29) for all 0 ≤ l ≤ q , using a union bound we get that withprobability 1 − δ , the following holds,(1 − ǫ/ DP D + λI n ) (cid:22) ( S P ZD ) ⊤ S P ZD + λI n (cid:22) (1 + ǫ/ DP D + λI n ) . Now we define S g ( x ) which is a non-linear transformation on the input x defined as S g ( x ) = e −k x k / (cid:18) √ · Π ( x ⊗ ) ⊕ √ · Π ( x ⊗ ) ⊕ √ · Π ( x ⊗ ) · · · √ q ! · Π q ( x ⊗ q ) (cid:19) . We have that S g ( X ) = S P ZD , therefore with probability 1 − δ , the following holds,(1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) . Note that the target dimension of S g is m = m + m + · · · + m q ≈ q log ( nd/δ ) s λ /ǫ . Also, byTheorem 3, time to compute S g ( X ) is O (cid:16) nq ε · log ( nd/δ ) · s λ + q ǫ · log ( nd/δ ) · nnz( X ) (cid:17) . Acknowledgements
Michael Kapralov is supported by ERC Starting Grant SUBLINEAR. Thomas D. Ahle, JakobB. T. Knudsen, and Rasmus Pagh are supported by Villum Foundation grant 16582 to BasicAlgorithms Research Copenhagen (BARC). David Woodruff is supported in part by Office of NavalResearch (ONR) grant N00014-18-1-2562. Part of this work was done while Michael Kapralov,Rasmus Pagh, and David Woodruff were visiting the Simons Institute for the Theory of Computing.
References [AC06] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In
Proceedings of the 38th Annual ACM Symposium on The-ory of Computing, Seattle, WA, USA, May 21-23, 2006 , pages 557–563, 2006.[Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrausswith binary coins.
J. Comput. Syst. Sci. , 66(4):671–687, June 2003.50ACW17a] Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Faster kernel ridge re-gression using sketching and preconditioning.
SIAM J. Matrix Analysis Applications ,38(4):1116–1138, 2017.[ACW17b] Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Sharper bounds for regu-larized data fitting. In
Approximation, Randomization, and Combinatorial Optimiza-tion. Algorithms and Techniques, APPROX/RANDOM 2017, August 16-18, 2017,Berkeley, CA, USA , pages 27:1–27:22, 2017.[AK19] Thomas D Ahle and Jakob BT Knudsen. Almost optimal tensor sketch. arXiv preprintarXiv:1909.01821 , 2019.[AKM +
17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Vel-ingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Ap-proximation bounds and statistical guarantees. In
Proceedings of the 34th InternationalConference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August2017 , pages 253–262, 2017.[AKM + CoRR , abs/1804.09893, 2018.[AKM + arXiv preprint arXiv:1812.08723 , 2018.[AM15] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regressionwith statistical guarantees. In Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015, December 7-12,2015, Montreal, Quebec, Canada , pages 775–783, 2015.[ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polyno-mial kernel. In
Advances in neural information processing systems , pages 2258–2266,2014.[BCL +
10] Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, andRafail Ostrovsky. AMS without 4-wise independence on product domains. In , pages 119–130, 2010.[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.
Concentration inequalities:A nonasymptotic theory of independence . Oxford university press, 2013.[CCFC02] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items indata streams. In
International Colloquium on Automata, Languages, and Program-ming , pages 693–703. Springer, 2002.[Cha02] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages380–388. ACM, 2002. 51CJN18] Michael B. Cohen, T. S. Jayram, and Jelani Nelson. Simple analyses of the sparsejohnson-lindenstrauss transform. In , pages 15:1–15:9, 2018.[CKS11] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of thegaussian kernel. arXiv preprint arXiv:1109.4603 , 2011.[CNW16a] Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal approximate matrixproduct in terms of stable rank. In , pages11:1–11:14, 2016.[CNW16b] Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal approximate matrixproduct in terms of stable rank. In , pages11:1–11:14, 2016.[Coh16] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.In
Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016 , pages 278–287,2016.[CW09] Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the stream-ing model. In
Proceedings of the 41st Annual ACM Symposium on Theory of Com-puting, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009 , pages 205–214,2009.[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regressionin input sparsity time. In
Symposium on Theory of Computing Conference, STOC’13,Palo Alto, CA, USA, June 1-4, 2013 , pages 81–90, 2013.[CW17] Kenneth L Clarkson and David P Woodruff. Low-rank approximation and regressionin input sparsity time.
Journal of the ACM (JACM) , 63(6):54, 2017.[DKS10] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrausstransform. In
Proceedings of the 42nd ACM Symposium on Theory of Computing,STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010 , pages 341–350, 2010.[DlPG12] Victor De la Pena and Evarist Giné.
Decoupling: from dependence to independence .Springer Science & Business Media, 2012.[DMM06] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l Proceedings of the Seventeenth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006 , pages 1127–1136, 2006.[DMMS11] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tamás Sarlós. Fasterleast squares approximation.
Numerische Mathematik , 117(2):219–249, 2011.[DMMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff.Fast approximation of matrix coherence and statistical leverage.
Journal of MachineLearning Research , 13:3475–3506, 2012.52Hit93] Paweł Hitczenko. Domination inequality for martingale transforms of a rademachersequence.
Israel Journal of Mathematics , 84(1-2):161–178, 1993.[Hit94] Pawel Hitczenko. On a domination of sums of random variables by sums of condition-ally independent ones.
The Annals of Probability , pages 453–468, 1994.[HM07] Uffe Haagerup and Magdalena Musat. On the best constants in noncommutativekhintchine-type inequalities.
Journal of Functional Analysis , 250(2):588–624, 2007.[IM08] Piotr Indyk and Andrew McGregor. Declaring independence via the sketching ofsketches. In
Proceedings of the nineteenth annual ACM-SIAM symposium on Discretealgorithms , pages 737–745. Society for Industrial and Applied Mathematics, 2008.[JLS86] William B. Johnson, Joram Lindenstrauss, and Gideon Schechtman. Extensions oflipschitz maps into banach spaces.
Israel Journal of Mathematics , 54(2):129–138, Jun1986.[KN14] Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.
J. ACM ,61(1):4:1–4:23, 2014.[KPV +
19] Michael Kapralov, Rasmus Pagh, Ameya Velingker, David Woodruff, and AmirZandieh. Oblivious sketching of high-degree polynomial kernels. arXiv preprintarXiv:1909.01410 , 2019.[KVW14] Ravi Kannan, Santosh Vempala, and David Woodruff. Principal component analysisand higher correlations for distributed data. In
Conference on Learning Theory , pages1040–1057, 2014.[Lat97] Rafał Latała. Estimation of moments of sums of independent real random variables.
The Annals of Probability , 25(3):1502–1513, 1997.[Lat06] Rafał Latała. Estimates of moments and tails of gaussian chaoses.
The Annals ofProbability , 34(6):2315–2331, 2006.[LDFU13] Yichao Lu, Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. Faster ridgeregression via the subsampled randomized hadamard transform. In
Advances in NeuralInformation Processing Systems 26: 27th Annual Conference on Neural InformationProcessing Systems 2013. Proceedings of a meeting held December 5-8, 2013, LakeTahoe, Nevada, United States. , pages 369–377, 2013.[LSS14] Quoc Viet Le, Tamás Sarlós, and Alexander Johannes Smola. Fastfood: Approximatekernel expansions in loglinear time.
CoRR , abs/1408.3060, 2014.[MM17] Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method.In
Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,USA , pages 3836–3848, 2017.[NDT15] Nam H Nguyen, Petros Drineas, and Trac D Tran. Tensor sparsification via a boundon the spectral norm of random tensors.
Information and Inference: A Journal of theIMA , 4(3):195–229, 2015. 53NN13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithmsvia sparser subspace embeddings. In , pages 117–126. IEEE, 2013.[Pag13] Rasmus Pagh. Compressed matrix multiplication.
TOCT , 5(3):9:1–9:17, 2013.[PP13] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit featuremaps. In
The 19th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013 , pages 239–247,2013.[PT12] Mihai Patrascu and Mikkel Thorup. The power of simple tabulation hashing.
J. ACM ,59(3):14:1–14:50, 2012.[PW15] Mert Pilanci and Martin J. Wainwright. Randomized sketches of convex programswith sharp guarantees.
IEEE Trans. Information Theory , 61(9):5096–5115, 2015.[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In
Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver,British Columbia, Canada, December 3-6, 2007 , pages 1177–1184, 2007.[RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In
Advances in neural information processing systems , pages 1177–1184, 2008.[Sar06] Tamás Sarlós. Improved approximation algorithms for large matrices via randomprojections. In , pages143–152, 2006.[Tro11] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform.
Advances in Adaptive Data Analysis , 3(1-2):115–126, 2011.[Tro15] Joel A. Tropp. An introduction to matrix concentration inequalities.
Foundations andTrends in Machine Learning , 8(1-2):1–230, 2015.[Val15] Gregory Valiant. Finding correlations in subquadratic time, with applications to learn-ing parities and the closest pair problem.
Journal of the ACM (JACM) , 62(2):13, 2015.[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra.
Foundations andTrends in Theoretical Computer Science , 10(1-2):1–157, 2014.
A Direct Lower and Upper Bounds
We introduce the following notation. We say f ( x ) . g ( x ) if for some some universal constant C wehave f ( x ) ≤ Cg ( x ) for all x ∈ R and . Note this is slightly different from the usual f ( x ) = O ( g ( x ))in that it is uniform in x rather than asymptotic. We similarly say f ( x ) & g ( x ) if g ( x ) . f ( x ) and f ( x ) ∼ g ( x ) if both f ( x ) . g ( x ) and f ( x ) & g ( x ).We will also make heavy use of the L p norm notation for random variables in R , that is for p ≥ k X k L p = ( E | X | p ) /p . A very useful result for computing the L p -normof a sum of random variables is the following: 54 emma 37 (Latala’s inequality, [Lat97]) . If p ≥ and X, X , . . . , X n are iid. mean 0 randomvariables, then we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p ∼ sup ( ps (cid:18) np (cid:19) /s k X k L s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:26) , pn (cid:27) ≤ s ≤ p ) . (30)The following simple corollary will be used for both upper and lower bounds: Corollary 38.
Let p ≥ , C > and α ≥ . Let ( X i ) i ∈ [ n ] be iid. mean 0 random variables suchthat k X i k L p ∼ ( Cp ) α , then k P i X i k L p ∼ C α max { α √ pn, ( n/p ) /p p α } .Proof. We will show that the expression in eq. (30) is maximized either by minimizing or maximizing s . Hence we need to chat that ps (cid:16) np (cid:17) /s s α it has no other optimums in the valid range. For this,we note that dds ps (cid:16) np (cid:17) /s s α = − ps − α (cid:16) np (cid:17) /s (cid:16) (1 − α ) s + log np (cid:17) . Given α ≥ s , which gives the lemma.For the lower bound we will also use the following result by Hitczenko, which provides animprovement on Khintchine for Rademacher random variables. Lemma 39 (Sharp bound on Rademacher sums [Hit93]) . Let σ ∈ {− , } n be a random Rademachersequence and let a ∈ R n be an arbitrary real vector with sorted entries | a | ≥ | a | ≥ · · · | a n | , then kh a, σ ik L p ∼ X i ≤ p a i + √ p (cid:0) X i>p a i (cid:1) / (31)Finally the lower bound will use the Paley-Zygmund inequality (also known as the one-sidedChebyshev inequality): Lemma 40 (Paley-Zygmund) . Let X ≥ be a real random variable with finite variance, and let θ ∈ [0 , , then Pr[ X ≥ θ E [ X ]] ≥ (1 − θ ) E [ X ] E [ X ] . (32)A classical strategy when using Paley-Zygmund is to prove E [ X ] ≥ ε for some ε >
0, and thentake θ = 1 / X ≥ ε ] ≥ E [ X ] / (4 E (cid:2) X (cid:3) ). A.1 Lower Bound for Sub-Gaussians
The following lower bound considers the sketching matrix consisting of the direct composition ofmatrices with Rademacher entries. Note however that the assumptions on Rademachers are onlyused to show that the p -norm of a single row with a vector is ∼ √ p . For this reason the same lowerbound hold if the Rademacher entries are substituted for, say Gaussians. Theorem 41 (Lower bound) . For some constants C , C , B > , let d, m, c ≥ be integers, let ε ∈ [0 , and δ ∈ [0 , / . Further assume that d ≥ log 1 /δ ≥ c/B . Then the following holds.Let M (1) , . . . , M ( c ) ∈ R m × d be matrices with all independent Rademacher entries and let M = √ m M (1) • · · · • M ( c ) . Then there exists some unit vector y ∈ R d c such that if m < C max (cid:26) c ε − log 1 /δc , ε − (cid:18) C log 1 /δc (cid:19) c (cid:27) then Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) > ε i > δ. (33)55 roof. Let y = [1 , . . . , T / √ d ∈ R d and let x = y ⊗ c . We have k M y k − m (cid:13)(cid:13)(cid:13) M (1) x ◦ · · · ◦ M ( c ) x (cid:13)(cid:13)(cid:13) − m X j ∈ [ m ] (cid:0) Y i ∈ [ c ] Z i,j (cid:1) − Z i,j = P k ∈ [ d ] M ( i ) j,k / √ d are independent averages of d independent Rademacher randomvariables. By Lemma 39 we have k Z i,j k L p ∼ min {√ p, √ d } which is √ p by the assumption d ≥ log 1 /δ as long as p ≤ log 1 /δ . By the expanding Z i,j into monomials and linearity of expectationwe get k Z i,j k = √ d ( d + 3 d ( d − / = (3 − /d ) / .Now define X j = Q i ∈ [ c ] Z i,j −
1, then EX j = 0 and k X j k L p ≥ (cid:13)(cid:13)(cid:13)Q i ∈ [ c ] Z i,j (cid:13)(cid:13)(cid:13) L p − k Z i,j k cL p − ≥ K c p c for some K , assuming p ≥
2. In particular, k X j k L ≥ k Z i,j k cL − − /d ) c/ − ∼ c/ by the assumption d ≥ c ≥ (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p = m (cid:13)(cid:13)(cid:13)P j ∈ [ m ] X m (cid:13)(cid:13)(cid:13) L p is a sum of iid. random variables, so we can useCorollary 38 to show K max (cid:26)q c p/m, ( m/p ) /p K c p c /m (cid:27) . (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) L p (35) . K max (cid:26)q c p/m, ( m/p ) /p K c p c /m (cid:27) (36)for some universal constants K , K , K , K > m < max n AK c ε − /δc , K ε − (cid:16) AK /δc (cid:17) c o as in the theorem. Wetake p = 4 A log 1 /δc for some constant A to be determined. We want to show (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ ε .For this we split into two cases depending on which term of m < max { (1) , (2) } dominates. If (1) ≥ (2) we pick the first lower bound in eq. (36) and get (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ K p c p/m ≥ K r ε K = 2 ε .Otherwise, if (2) ≥ (1), we pick the other lower bound and also get: (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) L p ≥ K ( m/p ) /p K c p c m ≥ K K c (cid:16) A log 1 /δc (cid:17) cK ε − (cid:16) AK /δc (cid:17) c = 2 ε, (37)where we used ( m/p ) /p ≥ e − / ( em ) ≥ / m ≥
1. Plugging into Paley-Zygmund (Lemma 40)we have Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) ≥ ε i ≥ Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) p ≥ (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) pL p − p i (38) ≥ (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ! p , (39)where we used that p ≥ − − p ) ≥ / p c p/m ≥ ( m/p ) /p K c p c /m we have using the first lower bound that k k My k − k Lp k k My k − k L p ≥ K √ K .For the alternative case, ( m/p ) /p K c p c /m ≥ p c p/m , we have (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ K √ K ( m/p ) /p ( m/ p ) / p (cid:18) K K (cid:19) c ≥ K K (cid:18) K K (cid:19) c (40)56here ( m/p ) /p ( m/ p ) / p ≥ e − / (4 em ) ≥ / √ m ≥ A ≤ min { K /K , K /K } /
32. Thischoice also ensures that 1 ≤ p ≤ log 1 /δ as we promised. Note that we may assume in eq. (36) that K ≤ K and K ≤ K . We then finally have14 (cid:18) K √ K (cid:19) p ≥ δ / (4 c ) and 14 (cid:18) K K (cid:18) K K (cid:19) c (cid:19) p ≥ δ / (4 c )+1 / , (41)which are both ≥ δ for c ≥ δ < / A.2 Upper bound for Sub-Gaussians
Theorem 42 (Upper bound) . Let ε, δ ∈ [0 , and let γ > , ≤ c ≤ log 1 /δ γ be some constants.Let T ∈ R m × d be a matrix with iid. rows T , . . . , T m ∈ R d such that E (cid:2) ( T x ) (cid:3) = k x k and k T x k L p ≤ √ ap k x k for some a > and p ≥ . Let M = T (1) • · · · • T ( c ) where T (1) , . . . , T ( c ) areindependent copies of T . Then M has the JL-moment property, kk M x k − k x k k L p ≤ εδ /p , given m & (4 ae γ ) c ε − log 1 /δcγ + (4 ae γ ) c ε − (cid:18) log 1 /δcγ (cid:19) c . (42) Remark 2.
In the case of random Rademachers we set a = √ / m = O (cid:18) c ε − log 1 /δcγ e cγ + ε − (cid:18) √ /δcγ (cid:19) c e cγ (cid:19) . Note that depending on γ this matches either of the terms of the lower bound. Setting γ = Θ(1 /c )or γ = Θ(1) we have either m = O (cid:16) c ε − log 1 /δ + ε − (cid:16) √ /δ (cid:17) c (cid:17) or m = O (cid:18) (3 e ) c ε − log 1 /δc + ε − (cid:18) √ e log 1 /δc (cid:19) c (cid:19) . Finally, in the case of constant c = O (1) , γ = Θ(1) we simply get m = O (cid:16) ε − log 1 /δ + ε − (log 1 /δ ) c (cid:17) . Proof of Theorem 42.
Without loss of generalization we may assume k x k L = 1. We notice that (cid:13)(cid:13) k M x k − (cid:13)(cid:13) L p ≤ (cid:13)(cid:13)(cid:13) m P i ( M i x ) − (cid:13)(cid:13)(cid:13) L p is the mean of iid. random variables. Call these Z i =( M i x ) −
1. Then EZ i = 0 and k Z i k L p = (cid:13)(cid:13) ( M i x ) − (cid:13)(cid:13) L p . (cid:13)(cid:13) ( M i x ) (cid:13)(cid:13) L p = k M i x k L p by the triangleinequality and the assumption that p ≥
1. Now by the assumption k T x k L p ≤ √ ap k x k = √ ap , andby Lemma 19, we get that k M i x k L p = (cid:13)(cid:13)(cid:13) T (1) i ⊗ · · · ⊗ T ( c ) i x (cid:13)(cid:13)(cid:13) L p ≤ ( ap ) c/ , and so k Z i k L p ≤ (2 ap ) c forall i ∈ [ m ].We now use Corollary 38 which implies (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p . (4 a ) c q p/m + m /p (2 ap ) c /m . (4 a ) c q p/m + (4 ap ) c /m. (43)The second inequality comes from the following consideration: If the second term of (43) dominates,then (4 a ) c p p/m ≤ m /p (2 ap ) c /m which implies m /p ≤ ( p/ c − p − ≤ c for p ≥ p . We take p = log 1 /δcγ which is ≥ m = max { (4 ae γ ) c pε − , (4 ae γ ) c p c ε − } . Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) pL p . (4 a ) cp max { ε p (4 ae γ ) − cp , ε p (4 ae γ ) − cp } (44)= e − cγp ε p (45)= δε p , (46)which is exactly the JL moment property. A.3 Lower Bound for TensorSketch
For every integer d, q , the
TensorSketch of degree q , M : R d q → R m is defined as, M ( x ⊗ q ) = F − (( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x )) , (47)for every x ∈ R d where C , · · · C q ∈ R m × d are independent instances of CountSketch and
F ∈ C m × m is the Discrete Fourier Transform matrix with proper normalization which satisfies the convolutiontheorem, also note that, ◦ denotes entry-wise (Hadamard) product of vectors of the same size. Lemma 43.
For every integer d, q , let M : R d q → R m be the TensorSketch of degree q ≤ d , see (47) . For the all ones vector x = { } d , Var h k M x ⊗ q k i ≥ (cid:18) q m − (cid:19) k x ⊗ q k . Proof.
Note that since F is normalized such that it satisfies the convolution theorem, F − is indeeda unitary matrix times 1 / √ m , k M x ⊗ q k = m k ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ) k . Consider the firstentry of the vector ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ). Because the first row of F is all ones { } m ,the first element of the mentioned vector for the choice of x = { } d is Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) = Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) , where σ i : [ d ] → {− , +1 } are fully independent random hash functions usedby the CountSketch C i for all i ∈ [ q ]. Let us denote by V the following positive random variable, V = q Y i =1 X j ∈ [ d ] σ i ( j ) . Note that k M x ⊗ q k ≥ Vm , hence E (cid:2) k M x ⊗ q k (cid:3) ≥ E [ V ] m . Also note that E (cid:2) V (cid:3) = Q qi =1 E (cid:20)(cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) (cid:21) because σ i ’s are independent. We can write E X j ∈ [ d ] σ i ( j ) = 3 d − d = 3(1 − d ) k x k , hence if d ≥ q , E h V i ≥ (1 / · q · k x ⊗ q k , Therefore E (cid:2) k M x ⊗ q k (cid:3) ≥ E [ V ] m ≥ q m k x ⊗ q k . It is also true that E (cid:2) k M x ⊗ q k (cid:3) = k x ⊗ q k [ANW14].58 emma 44. For every integer d, q every ε > , every < δ ≤ · q , let M : R d q → R m be the TensorSketch of degree q , see (47) . If m < q/ then for the all ones vector x = { } d we have, Pr h |k M x ⊗ q k − k x ⊗ q k | > / · k x ⊗ q k i > δ. Proof.
Note that since F is normalized such that it satisfies the convolution theorem, F − is indeeda unitary matrix times 1 / √ m , k M x ⊗ q k = m k ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ) k . Consider the firstentry of the vector ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ). Because the first row of F is all ones { } m ,the first element of the mentioned vector for the choice of x = { } d is Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) = Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) , where σ i : [ d ] → {− , +1 } are fully independent random hash functions usedby the CountSketch C i for all i ∈ [ q ]. Let us denote by V the following positive random variable, V = q Y i =1 X j ∈ [ d ] σ i ( j ) . Note that k M x ⊗ q k ≥ Vm . Note that E (cid:2) V t (cid:3) = Q qi =1 E (cid:20)(cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) t (cid:21) for every t because σ i ’s areindependent. Note that for t = 2 we have, E X j ∈ [ d ] σ i ( j ) = 3 d − d ≥ − d ) k x k , hence if d ≥ q , E h V i ≥ (3 q / · k x ⊗ q k . Now consider t = 4. By Khintchine’s inequality, Lemma 17, we have, E X j ∈ [ d ] σ i ( j ) ≤ · k x k , hence, E h V i ≤ q · k x ⊗ q k . Therefore by Paley Zygmund we have the following,Pr " k M x ⊗ q k ≥ q m · k x ⊗ q k ≥ Pr h V ≥ q / · k x ⊗ q k i = Pr h V ≥ q / · k x ⊗ q k i ≥ Pr h V ≥ / · E h V ii ≥ / · E (cid:2) V (cid:3) E [ V ] ≥ q · q > · q ≥ δ.δ.