[PDF] Oblivious Sketching of High-Degree Polynomial Kernels

Abstract

Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, as kernel matrices are usually dense. Some methods for speeding up kernel linear algebra are known, but they all invariably take time exponential in either the dimension of the input point set (e.g., fast multipole methods suffer from the curse of dimensionality) or in the degree of the kernel function. Oblivious sketching has emerged as a powerful approach to speeding up numerical linear algebra over the past decade, but our understanding of oblivious sketching solutions for kernel matrices has remained quite limited, suffering from the aforementioned exponential dependence on input parameters. Our main contribution is a general method for applying sketching solutions developed in numerical linear algebra over the past decade to a tensoring of data points without forming the tensoring explicitly. This leads to the first oblivious sketch for the polynomial kernel with a target dimension that is only polynomially dependent on the degree of the kernel function, as well as the first oblivious sketch for the Gaussian kernel on bounded datasets that does not suffer from an exponential dependence on the dimensionality of input data points.

Full PDF

aa r X i v : . [ c s . D S ] M a y Oblivious Sketching of High-Degree Polynomial Kernels ∗ Thomas D. Ahle

ITU and BARC [email protected]

Michael Kapralov

EPFL [email protected]

Jakob B. T. Knudsen

U. Copenhagen and BARC [email protected]

Rasmus Pagh

ITU and BARC [email protected]

Ameya Velingker

Google Research [email protected]

David P. Woodruﬀ

CMU [email protected]

Amir Zandieh

EPFL [email protected]

May 5, 2020

Abstract

Oblivious sketching has emerged as a powerful approach to speeding up numerical linearalgebra over the past decade, but our understanding of oblivious sketching solutions for kernelmatrices has remained quite limited, suﬀering from the aforementioned exponential dependenceon input parameters. Our main contribution is a general method for applying sketching solutionsdeveloped in numerical linear algebra over the past decade to a tensoring of data points withoutforming the tensoring explicitly. This leads to the ﬁrst oblivious sketch for the polynomialkernel with a target dimension that is only polynomially dependent on the degree of the kernelfunction, as well as the ﬁrst oblivious sketch for the Gaussian kernel on bounded datasets thatdoes not suﬀer from an exponential dependence on the dimensionality of input data points. ∗ This paper is a merged version of the work of Ahle and Knudsen [AK19] and Kapralov, Pagh, Velingker, Woodruﬀand Zandieh [KPV + ontents p q (analysis for T base : CountSketch and S base : TensorSketch ) . . 244.2 Higher Moments of Π q (analysis for T base : OSNAP and S base : TensorSRHT ) . . . . . 26 s λ q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Spectral Property of Identity × TensorSRHT . . . . . . . . . . . . . . . . . . . . . . . 405.4 Spectral property of Identity × OSNAP . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 High Probability OSE with linear dependence on s λ . . . . . . . . . . . . . . . . . . 47 A.1 Lower Bound for Sub-Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55A.2 Upper bound for Sub-Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A.3 Lower Bound for TensorSketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

Introduction

Data dimensionality reduction, or sketching , is a common technique for quickly reducing the size ofa large-scale optimization problem while approximately preserving the solution space, thus allowingone to instead solve a much smaller optimization problem, typically in a smaller amount of time.This technique has led to near-optimal algorithms for a number of fundamental problems in numer-ical linear algebra and machine learning, such as least squares regression, low rank approximation,canonical correlation analysis, and robust variants of these problems. In a typical instance of sucha problem, one is given a large matrix X ∈ R d × n as input, and one wishes to choose a randommap Π from a certain family of random maps and replace X with Π X . As Π typically has manyfewer rows than columns, Π X compresses the original matrix X , which allows one to perform theoriginal optimization problem on the much smaller matrix Π X . For a survey of such techniques,we refer the reader to the survey by Woodruﬀ [Woo14].A key challenge in this area is to extend sketching techniques to kernel-variants of the abovelinear algebra problems. Suppose each column of X corresponds to an example while each ofthe d rows corresponds to a feature. Then these algorithms require an explicit representationof X to be made available to the algorithm. This is unsatisfactory in many machine learningapplications, since typically the actual learning is performed in a much higher (possibly inﬁnite)dimensional feature space, by ﬁrst mapping each column of X to a much higher dimensional space.Fortunately, due to the kernel trick, one need not ever perform this mapping explicitly; indeed, if theoptimization problem at hand only depends on inner product information between the input points,then the kernel trick allows one to quickly compute the inner products of the high dimensionaltransformations of the input points, without ever explicitly computing the transformation itself.However, evaluating the kernel function easily becomes a bottleneck in algorithms that rely on thekernel trick because it typically takes O ( d ) time to evaluate the kernel function for d dimensionaldatasets. There are a number of recent works which try to improve the running times of kernelmethods; we refer the reader to the recent work of [MM17] and the references therein. A naturalquestion is whether it is possible to instead apply sketching techniques on the high-dimensionalfeature space without ever computing the high-dimensional mapping.For the important case of polynomial kernel , such sketching techniques are known to be possible .This was originally shown by Pham and Pagh in the context of kernel support vector machines[PP13], using the TensorSketch technique for compressed matrix multiplication due to Pagh [Pag13].This was later extended in [ANW14] to a wide array of kernel problems in linear algebra, includingprincipal component analysis, principal component regression, and canonical correlation analysis.The running times of the algorithms above, while nearly linear in the number of non-zeroentries of the input matrix X , depend exponentially on the degree q of the polynomial kernel. Forexample, suppose one wishes to do low rank approximation on A , the matrix obtained by replacingeach column of X with its kernel-transformed version. One would like to express A ≈ U V , where U ∈ R d p × k and V ∈ R k × n . Writing down U explicitly is problematic, since the columns belong tothe much higher d p -dimensional space. Instead, one can express U V implicitly via column subsetselection, by expressing it as a

AZZ ⊤ and then outputting Z . Here Z is an n × k matrix. In[ANW14], an algorithm running in nnz ( X ) + ( n + d )poly (3 p , k, /ǫ ) time was given for outputtingsuch Z with the guarantee that k A − AZZ ⊤ k F ≤ (1 + ǫ ) k A − A k k F with constant probability, where A k is the best rank- k approximation to A . Algorithms with similar running times were proposedfor principal component regression and canonical correlation analysis. The main message here is The lifting function corresponding to the polynomial kernel maps x ∈ R d to φ ( x ) ∈ R d p , where φ ( x ) i ,i ,...,i p = x i x i · · · x i p , for i , i , . . . , i p ∈ { , , . . . , d } p rows in order toguarantee their correctness. Moreover, the existing sketches work with constant probability onlyand no high probability result was known for the polynomial kernel.The main drawback with previous work on applying dimensionality reduction for the polynomialkernel is the exponential dependence on p in the sketching dimension and consequently in therunning time. Ideally, one would like a polynomial dependence. This is especially useful for theapplication of approximating the Gaussian kernel by a sum of polynomial kernels of various degrees,for which large values of p , e.g., p = poly (log n ) are used [CKS11]. This raises the main questionof our work: Is it possible to desing a data oblivious sketch with a sketching dimension (and, hence, runningtime) that is not exponential in p for the above applications in the context of the polynomialkernel? While we answer the above question, we also study it in a more general context, namely, thatof regularization. In many machine learning problems, it is crucial to regularize so as to preventoverﬁtting or ill-posed problems. Sketching and related sampling-based techniques have also beenextensively applied in this setting. For a small sample of such work see [RR07, AM15, PW15,MM17, ACW17b, ACW17a, AKM +

17, AKM + d × n matrix A , and a d × b , and one seeks to ﬁnd a y ∈ R n so as to minimize k Ay − b k . In ridge regression, we instead seek a y so as to minimize k Ay − b k + λ k y k , for a parameter λ >

0. Intuitively, if λ is much larger than the operator norm k A k of A , then a good solution is obtained simply by setting y = 0 d . On the other hand, if λ = 0,the problem just becomes an ordinary least squares regression. In general, the statistical dimension (or eﬀective degrees of freedom ), s λ , captures this tradeoﬀ, and is deﬁned as P di =1 λ i ( A ⊤ A ) λ i ( A ⊤ A )+ λ , where λ i ( A ⊤ A ) is the i -th eigenvalue of A ⊤ A . Note that the statistical dimension is always at mostmin( n, d ), but in fact can be much smaller. A key example of its power is that for ridge regression,it is known [ACW17b] that if one chooses a random Gaussian matrix Π with O ( s λ /ǫ ) rows, and if y is the minimizer to k Π Ay − Π b k + λ k y k , then k Ay − b k + λ k y k ≤ (1+ ǫ ) min y ′ ( k Ay ′ − b k + λ k y ′ k ).Note that for ordinary regression ( λ = 0) one would need that Π has Ω(rank( A ) /ǫ ) rows [CW09].Another drawback of existing sketches for the polynomial kernel is that their running time andtarget dimension depend at least quadratically on s λ and no result is known with linear dependenceon s λ , which would be optimal. We also ask if the exponential dependence on p is avoidable in the regularized setting: Is it possible to obtain sketching dimension bounds and running times that are not exponential in p in the context of regularization? Moreover, is it possible to obtain a running time that dependsonly linearly on s λ ? In this paper, we answer the above questions in the aﬃrmative. In other words, for each ofthe aforementioned applications, our algorithm depends only polynomially on p . We state theseapplications as corollaries of our main results, which concern approximate matrix product andsubspace embeddings. In particular, we devise a new distribution on oblivious linear maps Π ∈ R m × d p (i.e., a randomized family of maps that does not depend on the dataset X ), so that for anyﬁxed X ∈ R d × n , it satisﬁes the approximate matrix product and subspace embedding properties.These are the key properties needed for kernel low rank approximation. We remark that our4 ata oblivious sketching is greatly advantageous to data dependent methods because it results in aone-round distributed protocol for kernel low rank approximation [KVW14].We show that our oblivious linear map Π ∈ R m × d p has the following key properties: Oblivious Subspace Embeddings (OSEs).

Given ε > n -dimensional subspace E ⊆ R d , we say that Π ∈ R m × d is an ε -subspace embedding for E if (1 − ε ) k x k ≤ k Π x k ≤ (1 + ε ) k x k for all x ∈ E . In this paper we focus on Oblivious Subspace Embeddings in the regularized setting.In order to deﬁne a (regularized) Oblivious Subspace Embedding, we need to introduce the notionof statistical dimension , which is deﬁned as follows: Deﬁnition 1 (Statistical Dimension) . Given λ ≥

0, for every positive semideﬁnite matrix K ∈ R n × n , we deﬁne the λ -statistical dimension of K to be s λ ( K ) := tr ( K ( K + λI n ) − ) . Now, we can deﬁne the notion of an oblivious subspace embedding (OSE):

Deﬁnition 2 (Oblivious Subspace Embedding (OSE)) . Given ε, δ, µ > d, n ≥ ε, δ, µ, d, n ) -Oblivious Subspace Embedding (OSE) is a distribution D over m × d matrices (forarbitrary m ) such that for every λ ≥

0, every A ∈ R d × n with λ -statistical dimension s λ ( A ⊤ A ) ≤ µ ,the following holds, Pr Π ∼D h (1 − ǫ )( A ⊤ A + λI n ) (cid:22) (Π A ) ⊤ Π A + λI n (cid:22) (1 + ǫ )( A ⊤ A + λI n ) i ≥ − δ. (1)The goal is to have the target dimension m small so that Π provides dimensionality reduction.If we consider the non-oblivious setting where we allow the sketch matrix Π to depend on A , thenby leverage score sampling we can achieve a target dimension of m ≈ s λ ( A ⊤ A ), which is essentiallyoptimal [AKM + m ≈ s λ ( A ⊤ A ). Approximate Matrix Product.

We formally deﬁne this property in the following deﬁnition.

Deﬁnition 3 (Approximate Matrix Product) . Given ε, δ >

0, we say that a distribution D over m × d matrices has the ( ε, δ ) -approximate matrix product property if for every C, D ∈ R d × n ,Pr Π ∼D h k C ⊤ Π ⊤ Π D − C ⊤ D k F ≤ ε k C k F k D k F i ≥ − δ. Our main theorems, which provide the aforementioned guarantees, are as follows, Theorem 1.

For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / , s λ , d p , n ) -oblivioussubspace embedding as in Deﬁnition 2. (2) If m = Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / -approximatematrix product property as in Deﬁnition 3.Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by the p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O ( pnm + p nnz( X )) . For symmetric matrices K and K ′ , the spectral inequality relation K (cid:22) K ′ holds if and only if x ⊤ Kx ≤ x ⊤ K ′ x for all vectors x Throughout this paper, the notations e O, e Ω , e Θ suppress poly (log( nd/ε )) factors. heorem 2. For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = e Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding (Deﬁnition 2). (2) If m = e Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / poly ( n )) -approximate matrix product property (Deﬁnition 3).Moreover, in the setting of (1) , for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columnsare obtained by a p -fold self-tensoring of each column of X , then the matrix Π p A can be computedusing Algorithm 1 in time e O (cid:16) pnm + p / s λ ε − nnz( X ) (cid:17) . Theorem 3.

0, a d × n ma-trix X , and vector b ∈ R n and would like to ﬁnd a y ∈ R n so as to minimize k A ⊤ Ay − b k + λ k Ay k ,where A ∈ R d p × n is the matrix obtained from X by applying the self tensoring of degree p toeach column. To solve this problem via sketching, we choose a random matrix Π p accordingto the theorems above and compute Π p A . We then solve the sketched ridge regression problemwhich seeks to minimize (cid:13)(cid:13)(cid:13) (Π p A ) ⊤ Π p Ax − b (cid:13)(cid:13)(cid:13) + λ k Π p Ax k over x . By the above theorems, wehave (cid:13)(cid:13)(cid:13) (Π p A ) ⊤ Π p Ax − b (cid:13)(cid:13)(cid:13) + λ k Π p Ax k = (1 ± ǫ ) (cid:18)(cid:13)(cid:13)(cid:13) A ⊤ Ax − b (cid:13)(cid:13)(cid:13) + λ k Ax k (cid:19) simultaneously for all x ∈ R n ; thus, solving the sketched ridge regression problem gives a (1 ± ǫ )-approximation to theoriginal problem. If we apply Theorem 1, then the number of rows of Π p needed to ensure successwith probability 9 /

10 is Θ( ps λ ǫ − ). The running time to compute Π p A is O ( p s λ ε − n + p nnz( X )),after which a ridge regression problem can be solved in O ( ns λ /ǫ ) time via an exact closed-formsolution for ridge regression. An alternative approach to obtaining a very high-accuracy approxima-tion is to use the sketched kernel as a preconditioner to solve the original ridge regression problem,which improves the dependence on ε to log(1 /ε ) [ACW17a]. To obtain a higher probability ofsuccess, we can instead apply Theorem 3, which would allow us to compute the sketched matrixΠ p A in e O ( p s λ ε − n + p ε − nnz( X )) time. This is the ﬁrst sketch to achieve the optimal depen-dence on s λ for the polynomial kernel, after which we can now solve the ridge regression problemin e O ( ns λ poly (cid:0) p, ǫ − (cid:1) ) time. Importantly, both running times are polynomial in p , whereas allpreviously known methods incurred running times that were exponential in p .Although there has been much work on sketching methods for kernel approximation whichnearly achieve the optimal target dimension m ≈ s λ , such as Nystrom sampling [MM17], all knownmethods are data-dependent unless strong conditions are assumed about the kernel matrix (smallcondition number or incoherence). Data oblivious methods provide nice advantages, such as one-round distributed protocols and single-pass streaming algorithms. However, for kernel methodsthey are poorly understood and previously had worse theoretical guarantees than data-dependentmethods. Furthermore, note that the Nystrom method requires to sample at least m = Ω( s λ ) land-marks to satisfy the subspace embedding property even given an oracle access to the exact leveragescores distribution. This results in a runtime of Ω (cid:0) s λ d + s λ nnz( X ) (cid:1) . Whereas our method achievesa target dimension that nearly matches the best dimension possible with data-dependent Nystrommethod and with strictly better running time of e O ( ns λ + nnz( X )) (assuming p = poly (log n )).6herefore, for a large range of parameter our sketch runs in input sparsity time wheras the Nys-trom methods are slower by an s λ factor in the best case. Application: Polynomial Kernel Rank- k Approximation.

Approximate matrix productand subspace emebedding are key properties for sketch matrices which imply eﬃcient algorithmsfor rank- k kernel approximation [ANW14]. The following corollary of Theorem 1 immediatelyfollows from Theorem 6 of [ANW14]. Corollary 4 (Rank- k Approximation) . For every positive integers k, n, p, d , every ε > , any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by the p -fold self-tensoringof each column of X then there exists an algorithm which ﬁnds an n × k matrix V in time O (cid:0) p nnz( X ) + poly (cid:0) k, p, ε − (cid:1)(cid:1) such that with probability / , k A − AV V ⊤ k F ≤ (1 + ε ) min U ∈ R dp × n rank ( U )= k k A − U k F . Note that this runtime improves the runtime of [ANW14] by exponential factors in the polyno-mial kernel’s degree p . Additional Applications.

Our results also imply improved bounds for each of the applicationsin [ANW14], including canonical correlation analysis (CCA), and principal component regression(PCR). Importantly, we obtain the ﬁrst sketching-based solutions for these problems with runningtime polynomial rather than exponential in p . Oblivious Subspace Embedding for the Gaussian Kernel.

One very important implicationof our result is Oblivious Subspace Embedding of the Gaussian kernel. Most work in this area isrelated to the Random Fourier Features method [RR07]. It was shown in [AKM +

17] that onerequires Ω( n ) samples of the standard Random Fourier Features to obtain a subspace embeddingfor the Gaussian kernel, while a modiﬁed distribution for sampling frequencies yields provablybetter performance. The target dimension of our proposed sketch for the Gaussian kernel strictlyimproves upon the result of [AKM + d . We for the ﬁrst time, embed the Gaussian kernel with a target dimension which has a lineardependence on the statistical dimension of the kernel and is not exponential in the dimensionalityof the data-point. Theorem 5.

For every r > , every positive integers n, d , and every X ∈ R d × n such that k x i k ≤ r for all i ∈ [ n ] , where x i is the i th column of X , suppose G ∈ R n × n is the Gaussian kernel matrix –i.e., G j,k = e −k x j − x k k / for all j, k ∈ [ n ] . There exists an algorithm which computes S g ( X ) ∈ R m × n in time e O (cid:0) q ǫ − ns λ + q ǫ − nnz( X ) (cid:1) such that for every ε, λ > , Pr S g h (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) i ≥ − / poly ( n ) , where m = e Θ (cid:0) q s λ /ǫ (cid:1) and q = Θ( r + log( n/ǫλ )) and s λ is λ -statistical dimension of G as inDeﬁnition 1. We remark that for datasets with radius r = poly (log n ) even if one has oracle access to theexact leverage scores for Fourier features of Gaussian kernel, in order to get subspace embeddingguarantee one needs to use m = Ω( s λ ) features which requires Ω( s λ nnz( X )) operations to compute.Wheras our result of Theorem 5 runs in time e O ( ns λ + nnz( X )). Therefore, for a large range ofparameters our Gaussian sketch runs in input sparsity time wheras the Fourier features method isat best slower by an s λ factor. 7 .2 Technical Overview Our goal is to design a sketching matrix Π p that satisﬁes the oblivious subspace embedding propertywith an optimal embedding dimension and which can be eﬃciently applied to vectors of the form x ⊗ p ∈ R d p . We start by describing some natural approaches to this problem (some of which havebeen used before), and show why they incur an exponential loss in the degree of the polynomialkernel. We then present our sketch and outline our proof of its correctness.We ﬁrst discuss two natural approaches to tensoring classical sketches, namely the Johnson-Lindenstrauss transform and the CountSketch . We show that both lead to an exponential depen-dence of the target dimension on p and then present our new approach. Tensoring the Johnson-Lindenstrauss Transform.

Perhaps the most natural approach todesigning a sketch Π p is the idea of tensoring p independent Johnson-Lindenstrauss matrices. Specif-ically, let m be the target dimension. For every r = 1 , . . . , p let M ( r ) denote an m × d matrix withiid uniformly random ± M ∈ R m × d p be M = 1 √ m M (1) • . . . • M ( p ) , where • stands for the operation of tensoring the rows of matrices M ( r ) (see Deﬁnition 7). Thiswould be a very eﬃcient matrix to apply, since for every j = 1 , . . . , m the j -th entry of M x ⊗ p isexactly Q pr =1 h M ( r ) x i j , which can be computed in time O ( p nnz( x )), giving overall evaluation time O ( pm nnz( x )). One would hope that m = O ( ε − log n ) would suﬃce to ensure that k M x ⊗ p k =(1 ± ǫ ) k x ⊗ q k . However, this is not true: we show in Appendix A that one must have m =Ω( ε − p log( n ) /p + ε − (log( n ) /p ) p ) in order to preserve the norm with high probability. Thus,the dependence on degree p of the polynomial kernel must be exponential. The lower bound isprovided by controlling the moments of the sketch M and using Paley-Zygmund inequality. Forcompleteness, we show that the aforementioned bound on the target dimension m is sharp, i.e.,necessary and suﬃcient for obtaining the Johnson-Lindenstrauss property. Tensoring of CountSketch (TensorSketch).

Pagh and Pham [PP13] introduced the followingtensorized version of

CountSketch . For every i = 1 , . . . , p let h i : [ d ] → [ m ] denote a random hashfunction, and σ i : [ d ] → [ m ] a random sign function. Then let S : R d ⊗ p → R m be deﬁned by S r, ( j ,...,j p ) := σ ( i ) · · · σ ( i p ) [ h ( i ) + . . . h p ( i p ) = r ]for r = 1 , . . . , m . For every x ∈ R d one can compute Sx ⊗ p in time O ( pm log m + p nnz( x )). Since thetime to apply the sketch only depends linearly on the dimension p (due to the Fast Fourier Trans-form) one might hope that the dependence of the sketching dimension on p is polynomial. However,this turns out to not be the case: the argument in [ANW14] implies that m = e O (3 p s λ ) suﬃces toconstruct a subspace embedding for a matrix with regularization λ and statistical dimension s λ ,and we show in Appendix A.3 that exponential dependence on p is necessary. Our Approach: Recursive Tensoring.

The initial idea behind our sketch is as follows. Toapply our sketch Π p to x ⊗ p , for x ∈ R d , we ﬁrst compute the sketches T x, T x, . . . , T p x for inde-pendent sketching matrices T , . . . , T p ∼ T base – see the leaves of the sketching tree in Fig. 1. Notethat we choose these sketches as CountSketch [CCFC02] or

OSNAP [NN13] to ensure that the leaf Tensor product of x with itself p times. base S base T base T base S base T base T base internal nodes: TensorSketch or TensorSRHT leaves:

CountSketch or OSNAP

Figure 1: S base is chosen from the family of sketches which support fast matrix-vector product fortensor inputs such as TensorSketch and

TensorSRHT . The T base is chosen from the family of sketcheswhich operate in input sparsity time such as CountSketch and

OSNAP .sketches can be applied in time proportional to the number of nonzeros in the input data (in thecase of OSNAP this is true up to polylogarithimic factors).Each of these is a standard sketching matrix mapping d -dimensional vectors to m -dimensionalvectors for some common value of m . We refer the reader to the survey [Woo14]. The next ideais to choose new sketching matrices S , S , . . . , S p/ ∼ S base , mapping m -dimensional vectors to m -dimensional vectors and apply S to ( T x ) ⊗ ( T x ), as well as apply S to ( T x ) ⊗ ( T x ), andso on, applying S p/ to ( T p − x ) ⊗ ( T p x ). These sketches are denoted by S base – see internal nodesof the sketching tree in Fig. 1. We note that in order to ensure eﬃciency of our construction (inparticular, running time that depends only linearly on the statistical dimension s λ ) we must choose S base as a sketch that can be computed on tensored data without explicitly constructing the actualtensored input, i.e., S base supports fast matrix vector product on tensor product of vectors. Weuse either TensorSketch (for results that work with constant probability) and a new variant of theSubsampled Randomized Hadamard Transform

SRHT which supports fast multiplication for thetensoring of two vectors (for high probability bounds) – we call the last sketch

TensorSRHT .At this point we have reduced our number of input vectors from p to p/

2, and the dimensionis m , which will turn out to be roughly s λ . We have made progress, as we now have fewer vectorseach in roughly the same dimension we started with. After log p levels in the tree we are left witha single output vector.Intuitively, the reason that this construction avoids an exponential dependence on p is thatat every level in the tree we use target dimension m larger than the statistical dimension of ourmatrix by a factor polynomial in p . This ensures that the accumulation of error is limited, as thetotal number of nodes in the tree is O ( p ). This is in contrast to the direct approaches discussedabove, which use a rather direct tensoring of classical sketches, thereby incurring an exponentialdependence on p due to dependencies that arise. Showing Our Sketch is a Subspace Embedding.

In order to show that our recursive sketchis a subspace embedding, we need to argue it preserves norms of arbitrary vectors in R d p , not only9ectors of the form x ⊗ p , i.e., p -fold self-tensoring of d -dimensional vectors . Indeed, all knownmethods for showing the subspace embedding property (see [Woo14] for a survey) at the very leastargue that the norms of each of the columns of an orthonormal basis for the subspace in questionare preserved. While our subspace may be formed by the span of vectors which are tensor productsof p d -dimensional vectors, we are not guaranteed that there is an orthonormal basis of this form.Thus, we ﬁrst observe that our mapping is indeed linear over R d p , making it well-deﬁned on theelements of any basis for our subspace, and hence our task essentially reduces to proving that ourmapping preserves norms of arbitrary vectors in R d p .We present two approaches to analyzing our construction. One is based on the idea of prop-agating moment bounds through the sketching tree, and results in a nearly linear dependence ofthe sketching dimension m on the degree p of the polynomial kernel, at the expense of a quadraticdependence on the statistical dimension s λ . This approach is presented in Section 4. The otherapproach achieves the (optimal) linear dependence on s λ , albeit at the expense of a worse polyno-mial dependence on p . This approach uses sketches that succeed with high probability, and usesmatrix concentration bounds. Propagating moment bounds through the tree – optimizing the dependence on thedegree p . We analyze our recursively tensored version of the

OSNAP and

CountSketch by showinghow moment bounds can be propagated through the tree structure of the sketch. This analysis ispresented in Section 4, and results in the proof of Theorem 1 as well as the ﬁrst part of Theorem 3.The analysis obtained this way give particularly sharp dependencies on p and log 1 /δ .The idea is to consider the unique matrix M ∈ R m × d p that acts on simple tensors in the waywe have described it recursively above. This matrix could in principle be applied to any vector x ∈ R d p (though it would be slow to realise). We can nevertheless show that this matrix has the( ε, δ, t )-JL Moment Property, which is for parameters ε, δ ∈ [0 , , t ≥

2, and every x ∈ R d with k x k = 1 the statement E h(cid:12)(cid:12) k M x k − (cid:12)(cid:12) t i ≤ ε t δ .It can be shown that M is built from our various S base and T base matrices using three diﬀerentoperations: multiplication, direct sum, and row-wise tensoring. In other words, it is suﬃcient toshow that if Q and Q ′ both have the ( ε, δ, t )-JL Moment Property, then so does QQ ′ , Q ⊕ Q ′ and Q • Q ′ . This turns out to hold for Q ⊕ Q ′ , but QQ ′ and Q • Q ′ are more tricky. (Here ⊕ is thedirect sum and • is the composition of tensoring the rows. See section 2 on notation.)For multiplication, a simple union bound allows us to show that Q (1) Q (2) · · · Q ( p ) has the( pε, pδ, t )-JL Moment Property. This would unfortunately mean a factor of p in the ﬁnal di-mension. The union bound is clearly suboptimal, since implicitly it is assumes that all the matricesconspire to either shrink or increase the norm of a vector, while in reality with independent ma-trices, we should get a random walk on the real line. Using an intricate decoupling argument, weshow that this is indeed the case, and that Q (1) Q (2) · · · Q ( p ) has the ( √ pε, δ, t )-JL Moment Property,saving a factor of p in the output dimension.Finally we need to analyze Q • Q ′ . Here it is easy to show that the JL Moment Propertydoesn’t in general propagate to Q • Q ′ (consider e.g. Q being constant 0 on its ﬁrst m/ Q ′ having 0 on its m/ Q • Q ′ behaves well. In particular we show this for matrices with independentsub-Gaussian entries (appendix A.2), and for the so-called Fast Johnson Lindenstrauss construc-tion [AC06] (Lemma 21). The main tool here is a higher order version of the classical Khintchine x ⊗ p denotes x ⊗ x · · · ⊗ x | {z } p terms , the p -fold self-tensoring of x . E h h σ (1) ⊗ σ (2) ⊗ · · · ⊗ σ ( p ) , x i t i when σ (1) , . . . σ ( p ) areindependent sub-Gaussian vectors (Lemma 19). Optimizing the dependence on s λ . Our proof of Theorem 3 relies on instantiating our frame-work with

OSNAP at the leaves of the tree ( T base ) and a novel version of the SRHT that we referto as

TensorSRHT at the internal nodes of the tree. We outline the analysis here. In order to showthat our sketch preserves norms, let y be an arbitrary vector in R d p . Then in the bottom level ofthe tree, we can view our sketch as T × T × · · · × T p , where × for denotes the tensor product ofmatrices (see Deﬁnition 5). Then, we can reshape y to be a d q − × d matrix Y , and the entries of T × T × · · · × T p y are in bijective correspondence with those of T × T × · · · × T p − Y T ⊤ p . By deﬁ-nition of T p , it preserves the Frobenius norm of Y , and consequently, we can replace Y with Y T ⊤ p .We next look at ( T × T × · · · × T p − ) Z ( I d × T ⊤ p − ), where Z is the d p − × d matrix with entriesin bijective correspondence with those of Y T ⊤ p . Then we know that T p − preserves the Frobeniusnorm of Z . Iterating in this fashion, this means the ﬁrst layer of our tree preserves the norm of y , provided we union bound over O ( p ) events that a sketch preserves a norm of an intermediatematrix. The core of the analysis consists of applying spectral concentration bounds based analysisto sketches that act on blocks of the input vector in a correlated fashion. We give the details inSection 5. Sketching the Gaussian kernel.

Our techniques yield the ﬁrst oblivious sketching method forthe Gaussian kernel with target dimension that does not depend exponentially on the dimensionalityof the input data points. The main idea is to Taylor expand the Gaussian function and apply oursketch for the polynomial kernel to the elements of the expansion. It is crucial here that thetarget dimension of our sketch for the polynomial kernel depends only polynomially on the degree,as otherwise we would not be able to truncate the Taylor expansion suﬃciently far in the tail(the number of terms in the Taylor expansion depends on the radius of the dataset and dependslogarithmically on the regularization parameter). Overall, our Gaussian kernel sketch has optimaltarget dimension up to polynomial factors in the radius dataset and logarithmic factors in thedataset size. Moreover, it is the ﬁrst subspace embedding of Gaussian kernel which runs in inputsparsity time e O (nnz( X )) for datasets with polylogarithmic radius. The result is summarized inTheorem 5, and the analysis is presented in Section 6. Work related to sketching of tensors and explicit kernel embeddings is found in ﬁelds ranging frompure mathematics to physics and machine learning. Hence we only try to compare ourselves withthe four most common types we have found.

Johnson-Lindenstrauss Transform

A cornerstone result in the ﬁeld of subspace embeddingsis the Johnson-Lindenstrauss lemma [JLS86]: “For all ε ∈ [0 , n, d ≥

1, and X ⊆ R d with | X | = n there exists f : R d → R m with m = O ( ε − log( n )), such that (1 − ε ) k x − y k ≤k f ( x ) − f ( y ) k ≤ (1 + ε ) k x − y k for every x, y ∈ X .“It has been shown in [CW13, CNW16a] there exists a constant C , so that, for any r -dimensionalsubspace U ⊆ R d , there exists a subset X ⊆ U with | X | ≤ C r , such that max x ∈ U (cid:12)(cid:12) k f ( x ) k − k x k (cid:12)(cid:12) ≤ O (max x ∈ X (cid:12)(cid:12) k f ( x ) k − k x k (cid:12)(cid:12) ). So the Johnson-Lindenstrauss Lemma implies that there exists asubspace embedding with m = O ( ε − r ). 11t is not enough to know that the subspace embedding exists, we also need the to ﬁnd thedimension-reducing map f , and we want the map f to be applied to the data quickly. Achlioptasshowed that if Π ∈ R m × d is random matrix with i.i.d. entries where Π i,j = 0 with probability2 /

3, and otherwise Π i,j is uniform in {− , } , and m = O ( ε − log(1 /δ )), then k Π x k = (1 ± ε ) k x k with probability 1 − δ for any x ∈ R d [Ach03]. This gives a running time of O ( m nnz ( x )) to sketcha vector x ∈ R d . Later, the Fast Johnson Lindenstrauss Transform [AC06], which exploits theFast Fourier Transform, improved the running time for dense vectors to O ( d log d + m ). The re-lated Subsampled Randomized Hadamard Transform has been extensively studied [Sar06, DMM06,DMMS11, Tro11, DMMW12, LDFU13], which uses O ( d log d ) time but obtains suboptimal dimen-sion O ( ε − log(1 /δ ) ), hence it can not use the above argument to get subspace embedding, but ithas been proven in [Tro11] that if m = O ( ε − ( r + log(1 /δ ) )), then one get a subspace embedding.The above improvements has a running time of O ( d log d ), which can be worse than O ( m nnz ( x ))if x ∈ R d is very sparse. This inspired a line of work trying to obtain sparse Johnson Lindenstrausstransforms [DKS10, KN14, NN13, Coh16]. They obtain a running time of O ( ε − log(1 /δ )nnz ( x )).In [NN13] they deﬁne the ONSAP transform and investigate the trade-oﬀ between sparsity andsubspace embedding dimension. This was further improved in [Coh16].In the context of this paper all the above mentioned methods have the same shortcoming,they do not exploit the extra structure of the tensors. The Subsampled Randomized HadamardTransform have a running time of Ω( pd p log( p )) in the model considered in this paper, and thesparse embeddings have a running time of Ω(nnz( x ) p ). This is clearly unsatisfactory and inspiredthe TensorSketch [PP13, ANW14], which has a running time of Ω( p nnz( x )). Unfortunately, theyneed m = Ω(3 p ε − δ − ) and one of the main contributions of this paper is get rid of the exponentialdependence on p . Approximate Kernel Expansions

A classic result by Rahimi and Recht [RR08] shows how tocompute an embedding for any shift-invariant kernel function k ( k x − y k ) in time O ( dm ). In [LSS14]this is improved to any kernel on the form k ( h x, y i ) and time O (( m + d ) log d ), however the methoddoes not handle kernel functions that can’t be speciﬁed as a function of the inner product, and itdoesn’t provide subspace embeddings. See also [MM17] for more approaches along the same line.Unfortunately, these methods are unable to operate in input sparsity time and their runtime atbest is oﬀ by an s λ factor. Tensor Sparsiﬁcation

There is also a literature of tensor sparsiﬁcation based on sampling [NDT15],however unless the vectors tensored are already very smooth (such as ± Hyper-plane rounding

An alternative approach is to use hyper-plane rounding to get vectorson the form ±

1. Let ρ = h x,y ik x kk y k , then we have h sign ( M x ) , sign ( M y ) i = P i sign ( M i x ) sign ( M i y ) = P i X i , where X i are independent Rademachers with µ/m = E [ X i ] = 1 − π arccos ρ = π ρ + O ( ρ ).By tail bounds then Pr[ |h sign ( M x ) , sign ( M y ) i − µ | > ǫµ ] ≤ − min( ǫ µ σ , ǫµ )). Taking m = O ( ρ − ǫ − log 1 /δ ) then suﬃces with high probability. After this we can simply sample fromthe tensor product using simple sample bounds.The sign-sketch was ﬁrst brought into the ﬁeld of data-analysis by [Cha02] and [Val15] was theﬁrst, in our knowledge, to use it with tensoring. The main issue with this approach is that it isn’ta linear sketch, which hinders the applications we consider in this paper, such as kernel low rank12pproximation, CCA, PCR, and ridge regression. It also takes dm time to calculate M x and

M y which is unsatisfactory.

In section 2 we introduce basic deﬁnitions and notations that will be used throughout the paper.Section 3 introduces our recursive construction of the sketch which is our main technical toolfor sketching high degree tensor products. Section 4 analyzes how the moment bounds propagatethrough our recursive construction thereby proving Theorems 1 and 2 which have linear dependenceon the degree q . Section 5 introduces a high probability Oblivious Subspace Embedding with lineardependence on the statistical dimension thereby proving Theorem 3. Finally, section 6 uses thetools that we build for sketching polynomial kernel and proves that, for the ﬁrst time, Gaussiankernel can be sketched without an exponential loss in the dimension with provable guarantees.Appendix A proves lower bounds. In this section we introduce notation and present useful properties of tensor product of vectors andmatrices as well as properties of linear sketch matrices.We denote the tensor product of vectors a, b by a ⊗ b which is formally deﬁned as follows, Deﬁnition 4 (Tensor product of vectors) . Given a ∈ R m and b ∈ R n we deﬁne the twofold tensorproduct a ⊗ b to be a ⊗ b =  a b a b · · · a b n a b a b · · · a b n ... ... ... a m b a m b · · · a m b n  ∈ R m × n . Although tensor products are multidimensional objects, it is often convenient to associate them withsingle-dimensional vectors. In particular, we will often associate a ⊗ b with the single-dimensionalcolumn vector ( a b , a b , . . . , a m b , a b , a b , . . . , a m b , . . . , a m b n ). Given v ∈ R d , v ∈ R d · · · v k ∈ R d k , we deﬁne the k -fold tensor product v ⊗ v · · · ⊗ v k ∈ R d d ··· d k . For shorthand, we use thenotation v ⊗ k to denote v ⊗ v · · · ⊗ v | {z } k terms , the k -fold self-tensoring of v .Tensor product can be naturally extended to matrices which is formally deﬁned as follows, Deﬁnition 5.

Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m k × n k , we deﬁne A × A × · · · × A k to be the matrix in R m m ··· m k × n n ··· n k whose element at row ( i , · · · , i k ) and column ( j , · · · , j k ) is A ( i , j ) · · · A k ( i k , j k ). As a consequence the following holds for any v ∈ R n , v ∈ R n , · · · , v k ∈ R n k : ( A × A × · · · × A k )( v ⊗ v ⊗ · · · ⊗ v k ) = ( A v ) ⊗ ( A v ) ⊗ · · · ⊗ ( A k v k ).The tensor product has the useful mixed product property , given in the following Claim, Claim 6.

For every matrices

A, B, C, D with appropriate sizes, the following holds, ( A · B ) × ( C · D ) = ( A × C ) · ( B × D ) . We also deﬁne the column wise tensoring of matrices as follows,13 eﬁnition 6.

Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m k × n , we deﬁne A ⊗ A ⊗ · · · ⊗ A k tobe the matrix in R m m ··· m k × n whose j th column is A j ⊗ A j ⊗ · · · ⊗ A jk for every j ∈ [ n ], where A jl is the j th column of A l for every l ∈ [ k ].Similarly the row wise tensoring of matrices are introduced in the following Deﬁnition, Deﬁnition 7.

Given A ∈ R m × n , A ∈ R m × n , · · · , A k ∈ R m × n k , we deﬁne A • A • · · · A k to bethe matrix in R m × n n ··· n k whose j th row is ( A j ⊗ A j ⊗ · · · ⊗ A kj ) ⊤ for every j ∈ [ m ], where A lj isthe j th row of A l as a column vector for every l ∈ [ k ]. Deﬁnition 8.

Another related operation is the direct sum for vectors: x ⊕ y = [ xy ] and for matrices: A ⊕ B = (cid:2) A B (cid:3) . When the sizes match up, we have ( A ⊕ B )( x ⊕ y ) = Ax + By . Also notice thatif I k is the k × k identity matrix, then I k ⊗ A = A ⊕ · · · ⊕ A | {z } k times . In this section, we present the basic construction for our new sketch. Suppose we are given v , v , . . . v q ∈ R m . Our main task is to map the tensor product v ⊗ v ⊗ · · · ⊗ v q to a vectorof size m using a linear sketch.Our sketch construction is recursive in nature. To illustrate the general idea, let us ﬁrst considerthe case in which q ≥ v ⊗ v ) , ( v ⊗ v ) , · · · , ( v q − ⊗ v q ) ∈ R m independently using independent instances of some linear basesketch (e.g., degree two TensorSketch , Sub-sampled Randomized Hadamard Transform (

SRHT ), CountSketch , OSNAP ). The number of vectors after this step is half of the number of vectors thatwe began with. The natural idea is to recursively apply the same procedure on the sketched tensorswith half as many instances of the base sketch in each successive step.More precisely, we ﬁrst choose a (randomized) base sketch S base : R m → R m that sketchestwofold tensor products of vectors in R m (we will describe how to choose the base sketch later).Then, for any power of two q ≥

2, we deﬁne Q q : R m q → R m on v ⊗ v ⊗ · · · ⊗ v q recursively asfollows: Q q ( v ⊗ v ⊗ · · · ⊗ v q ) = Q q/ (cid:16) S q ( v ⊗ v ) ⊗ S q ( v ⊗ v ) ⊗ · · · ⊗ S qq/ ( v q − ⊗ v q ) (cid:17) , where S q , S q , · · · , S qq/ : R m → R m are independent instances of S base and Q : R m → R m is simplythe identity map on R m .The above construction of Q q has been deﬁned in terms of its action on q -fold tensor products ofvectors in R m , but it extends naturally to a linear mapping from R m q to R m . The formal deﬁnitionof Π q is presented below. Deﬁnition 9 (Sketch Q q ) . Let m ≥ S base : R m → R m be alinear map that speciﬁes some base sketch. Then, for any integer power of two q ≥

2, we deﬁne Q q : R m q → R m to be the linear map speciﬁed as follows: Q q ≡ S · S · · · S q/ · S q , where for each l ∈ { , , · · · , q/ , q } , S l is a matrix in R m l/ × m l deﬁned as S l ≡ S l × S l × · · · × S ll/ , (2)where the matrices S l , · · · , S ll/ ∈ R m × m are drawn independently from a base distribution S base .14 w ⊗ w z = S ( w ⊗ w ) S v ⊗ v w = S ( v ⊗ v ) v v S v ⊗ v w = S ( v ⊗ v ) v v S = S S = S × S Figure 2: Visual illustration of the recursive construction of Q q for degree q = 4. The input tensoris v ⊗ v ⊗ v ⊗ v and the output is z = Q ( v ⊗ v ⊗ v ⊗ v ). The intermediate nodes sketch thetensors w = S ( v ⊗ v ) and w = S ( v ⊗ v ).This sketch construction can be best visualized using a balanced binary tree with q leaves.Figure 2 illustrates the construction of degree 4, Q .For every integer q which is a power of two, by deﬁnition of S q in (2) of Deﬁnition 9, S q = S q × · · · × S qq/ . Hence, by claim 6 we can write, S q = S q × · · · × S qq/ = (cid:16) S q × · · · × S qq/ − × I m (cid:17) · (cid:16) I m q − × S qq/ (cid:17) . By multiple applications of Claim 6 we have the following claim,

Claim 7.

For every power of two integer q and any positive integer m , if S q is deﬁned as in (2) of Deﬁnition 9, then S q = M q/ M q/ − · · · M , where M j = I m q − j × S qq/ − j +1 × I m j − for every j ∈ [ q/ . Embedding R d q : So far we have constructed a sketch Q q for sketching tensor product of vectorsin R m . However, in general the data points can be in a space R d of arbitrary dimension. A naturalidea is to reduce the dimension of the vectors by a mapping from R d to R m and then apply Q q onthe tensor product of reduced data points. The dimensionality reduction deﬁnes a linear mappingfrom R d q to R m d which can be represented by a matrix. We denote the dimensionality reductionmatrix by T q ∈ R m q × d q formally deﬁned as follows. Deﬁnition 10.

Let m, d be positive integers and let T base : R d → R m be a linear map that speciﬁessome base sketch. Then for any integer power of two q we deﬁne T q to be the linear map speciﬁedas follows, T q = T × T × · · · × T q , where the matrices T , · · · , T q are drawn independently from T base . Discussion:

Similar to Claim 7, the transform T q can be expressed as the following productof q matrices, T q = M q M q − · · · M , where M j = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ].15ow we deﬁne the ﬁnal sketch Π q : R d q → R m for arbitrary d as the composition of Q q · T q .Moreover, to extend the deﬁnition to arbitrary q which is not necessarily a power of two we tensorthe input vector with a standard basis vector a number of times to make the input size compatiblewith the sketch matrices. The sketch Π q is formally deﬁned below, Deﬁnition 11 (Sketch Π p ) . Let m, d be positive integers and let S base : R m → R m and T base : R d → R m be linear maps that specify some base sketches. Then, for any integer p ≥ p : R d p → R m to be the linear map speciﬁed as follows:1. If p is a power of two, then Π p is deﬁned asΠ p = Q p · T p , where Q p ∈ R m × m p and T p ∈ R m p × d p are sketches as in Deﬁnitions 9 and 10 respectively.2. If p is not a power of two, then let q = 2 ⌈ log p ⌉ be the smallest power of two integer that isgreater than p and we deﬁne Π p asΠ p ( v ) = Π q (cid:16) v ⊗ e ⊗ ( q − p )1 (cid:17) , for every v ∈ R d p , where e ∈ R d is the standard basis column vector with a 1 in the ﬁrstcoordinate and zeros elsewhere, and Π q is deﬁned as in the ﬁrst part of this deﬁnition.Algorithm 1 sketches x ⊗ p for any integer p and any input vector x ∈ R d using the sketch Π p asin Deﬁnition 11, i.e., computes Π p ( x ⊗ p ). Algorithm 1

Sketch for the Tensor x ⊗ p input : vector x ∈ R d , dimension d , degree p , number of buckets m , base sketches S base ∈ R m × m and T base ∈ R m × d output : sketched vector z ∈ R m Let q = 2 ⌈ log p ⌉ Let T , · · · T q be independent instances of the base sketch T base : R d → R m For every j ∈ { , , · · · , p } , let Y j = T j · x For every j ∈ { p + 1 , · · · , q } , let Y j = T j · e , where e is the standard basis vector in R d withvalue 1 in the ﬁrst coordinate and zero elsewhere for l = 1 to log q do Let S q/ l − , · · · , S q/ l − q/ l be independent instances of the base sketch S base : R m → R m For every j ∈ { , · · · , q/ l } let Y lj = S q/ l − j (cid:16) Y l − j − ⊗ Y l − j (cid:17) end for return z = Y log q We show the correctness of Algorithm 1 in the next lemma.

Lemma 8.

For any positive integers d , m , and p , any distribution on matrices S base : R m → R m and T base : R d → R m which specify some base sketches, any vector x ∈ R d , Algorithm 1 computes Π p ( x ⊗ p ) as in Deﬁnition 11.Proof. For every input vector x ∈ R d to Algorithm 1, the vectors Y , · · · , Y p , are computed inlines 3 and 4 of algorithm as Y j = T j · x, for all j ∈ { , · · · , p } , and, Y j ′ = T j ′ · e , for all j ∈ { q + 1 , · · · , q } . Therefore, as shown in Deﬁnition 5, the following holds,16 ⊗ · · · ⊗ Y p = T × · · · × T q · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) . From the deﬁnition of sketch T q as per Deﬁnition 10 it follows that, Y ⊗ · · · ⊗ Y q = T q · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) . (3)The algorithm computes Y l , · · · Y lq/ l in line 7 as, Y lj = S q/ l − j (cid:16) Y l − j − ⊗ Y l − j (cid:17) , for every j ∈{ , · · · , q/ l } and every l ∈ { , · · · , log q } in a for loop. Therefore, by Claim 6, Y l ⊗ · · · ⊗ Y lq/ l = (cid:18) S q/ l − × · · · × S q/ l − q/ l (cid:19) · Y l − ⊗ · · · ⊗ Y l − q/ l − . By the deﬁnition of the sketch S q/ l − in (2) of Deﬁnition 9 we have that for every l ∈ { , · · · , log q } , Y l ⊗ · · · ⊗ Y lq/ l = S q/ l − · Y l − ⊗ · · · ⊗ Y l − q/ l − . Therefore, by recursive application of the above identity we get that, Y log p = S · S · · · S q/ · S q · Y ⊗ · · · ⊗ Y q . From the deﬁnition of sketch Q q as in Deﬁnition 9 it follows that, Y log q = Q q · Y ⊗ · · · ⊗ Y q . Substituting Y ⊗ · · · ⊗ Y q from (3) in the above gives, z = ( Q q · T q ) · (cid:16) x ⊗ p ⊗ e ⊗ ( q − p )1 (cid:17) , where byDeﬁnition 11 we have that, z = Π p ( x ⊗ p ) . Choices of the Base Sketches S base and T base : We present formal deﬁnitions for various choicesof the base sketches S base and T base that will be used for our sketch construction Π q of Deﬁnition11. We start by brieﬂy recalling the CountSketch [CCFC02].

Deﬁnition 12 ( CountSketch transform) . Let h : [ d ] → [ m ] be a 3-wise independent hash func-tion and also let σ : [ d ] → {− , +1 } be a 4-wise independent random sign function. Then, the CountSketch transform, S : R d → R m , is deﬁned as follows; for every i ∈ [ d ] and every r ∈ [ m ], S r,i = σ ( i ) · [ h ( i ) = r ] . Another base sketch that we consider is the

TensorSketch of degree two [Pag13] deﬁned asfollows.

Deﬁnition 13 (degree two

TensorSketch transform) . Let h , h : [ d ] → [ m ] be 3-wise independenthash functions and also let σ , σ : [ d ] → {− , +1 } be 4-wise independent random sign functions.Then, the degree two TensorSketch transform, S : R d × R d → R m , is deﬁned as follows; for every i, j ∈ [ d ] and every r ∈ [ m ], S r, ( i,j ) = σ ( i ) · σ ( j ) · [ h ( i ) + h ( j ) = r mod m ] . Remark: S ( x ⊗ ) can be computed in O ( m log m + nnz( x )) time using the Fast Fourier Trans-form.Now let us brieﬂy recall the SRHT [AC06]. 17 eﬁnition 14 (Subsampled Randomized Hadamard Transform ( SRHT )) . Let D be a d × d diagonalmatrix with independent Rademacher random variables along the diagonal. Also, let P ∈ { , } m × d be a random sampling matrix in which each row contains a 1 at a uniformly distributed coordinateand zeros elsewhere, and let H be a d × d Hadamard matrix. Then, the

SRHT , S ∈ R m × d , is S = √ m P HD .We now deﬁne a variant of the

SRHT which is very eﬃcient for sketching x ⊗ which we call the TensorSRHT . Deﬁnition 15 (Tensor Subsampled Randomized Hadamard Transform (

TensorSRHT )) . Let D and D be two independent d × d diagonal matrices, each with diagonal entries given by independentRademacher variables. Also let P ∈ { , } m × d be a random sampling matrix in which each rowcontains exactly one uniformly distributed nonzero element which has value one, and let H be a d × d Hadamard matrix. Then, the

TensorSRHT is deﬁned to be S : R d × R d → R m given by S = √ m P · ( HD × HD ). Remark: S ( x ⊗ ) can be computed in time O ( d log d + m ) using the FFT algorithm.Another sketch which is particularly eﬃcient for sketching sparse vectors with high probabilityis the OSNAP transform [NN13], deﬁned as follows.

Deﬁnition 16 ( OSNAP transform) . For every sparsity parameter s , target dimension m , andpositive integer d , the OSNAP transform with sparsity parameter s is deﬁned as, S r,j = r s · δ r,j · σ r,j , for all r ∈ [ m ] and all j ∈ [ d ], where σ r,j ∈ {− , +1 } are independent and uniform Rademacherrandom variables and δ r,j are Bernoulli random variables satisfying,1. For every i ∈ [ d ], P r ∈ [ m ] δ r,i = s . That is, each column of S has exactly s non-zero entries.2. For all r ∈ [ m ] and all i ∈ [ d ], E [ δ r,i ] = s/m .3. The δ r,i ’s are negatively correlated: ∀ T ⊂ [ m ] × [ d ], E hQ ( r,i ) ∈ T δ r,i i ≤ Q ( r,i ) ∈ T E [ δ r,i ] = ( sm ) | T | . p There are various desirable properties that we would like a linear sketch to satisfy. One suchproperty which is central to our main results is the

JL Moment Property . In this section weprove Theorem 1 and Theorem 2 by propagating the

JL Moment Property through our recursiveconstruction from Section 3. The

JL Moment Property captures a bound on the moments of thediﬀerence between the Euclidean norm of a vector and its Euclidean norm after applying the sketchon it. The JL Moment Property proves to be a powerful property for a sketch and we will show thatit implies the Oblivious Subspace Embedding as well as the Approximate Matrix Product propertyfor linear sketches.In section 4.1 we choose S base and T base to be TensorSketch and

CountSketch respectively. Thenwe propagate the second JL Moment through the sketch construction Π p and thereby prove Theorem1. In section 4.2 we propagate the higher JL Moments through our recursive construction Π p asper Deﬁnition 11 with TensorSRHT at the internal nodes ( S base ) and OSNAP at the leaves ( T base ),thereby proving Theorem 2.To make the notation less heavy we will use k X k L t for the t th moment of a random variable X .This is formally deﬁned below. 18 eﬁnition 17. For every integer t ≥ X ∈ R , we write k X k L t = (cid:16) E h | X | t i(cid:17) /t . Note that k X + Y k L t ≤ k X k L t + k Y k L t for any random variables X, Y by the Minkowski Inequality.We now formally deﬁne the JL Moment Property of sketches.

Deﬁnition 18 (JL Moment Property) . For every positive integer t and every δ, ε ≥

0, we say adistribution over random matrices S ∈ R m × d has the ( ǫ, δ, t )-JL-moment property, when (cid:13)(cid:13)(cid:13) k Sx k − (cid:13)(cid:13)(cid:13) L t ≤ ǫδ /t and E h k Sx k i = 1for all x ∈ R d such that k x k = 1.The JL Moment Property directly implies the following moment bound for the inner productof two vectors: Lemma 9 (Two vector JL Moment Property) . For any x, y ∈ R d , if S has the ( ε, δ, t ) -JL MomentProperty, then (cid:13)(cid:13)(cid:13) ( Sx ) ⊤ ( Sy ) − x ⊤ y (cid:13)(cid:13)(cid:13) L t ≤ εδ /t k x k k y k . (4) Proof.

We can assume by linearity of the norms that k x k = k y k = 1. We then use that k x − y k = k x k + k y k − x ⊤ y and k x + y k = k x k + k y k + 2 x ⊤ y such that x ⊤ y = ( k x + y k − k x − y k ) / (cid:13)(cid:13)(cid:13) ( Sx ) ⊤ ( Sy ) − x ⊤ y (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k Sx + Sy k − k x + y k − k Sx − Sy k + k x − y k (cid:13)(cid:13)(cid:13) L t / ≤ (cid:16)(cid:13)(cid:13)(cid:13) k S ( x + y ) k − k x + y k (cid:13)(cid:13)(cid:13) L t + (cid:13)(cid:13)(cid:13) k S ( x − y ) k − k x − y k (cid:13)(cid:13)(cid:13) L t (cid:17) / ≤ εδ /t ( k x + y k + k x − y k ) / εδ /t ( k x k + k y k ) / εδ /t . We will also need the Strong JL Moment Property, which is a sub-Gaussian bound on thediﬀerence between the Euclidean norm of a vector and its Euclidean norm after applying the sketchon it.

Deﬁnition 19 (Strong JL Moment Property) . For every ε, δ > M ∈ R m × d has the Strong ( ε, δ )-JL Moment Property when (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ εe s t log(1 /δ ) and E h k M x k i = 1 , for all x ∈ R d , k x k = 1 and every integer t ≤ log(1 /δ ). Remark 1.

It should be noted that if a matrix M ∈ R m × d has the Strong ( ε, δ )-JL MomentProperty then it has the ( ε, δ, log(1 /δ ))-JL Moment Property, since (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L log(1 /δ ) ≤ εe s log(1 /δ )log(1 /δ ) = εe = εδ / log( δ ) . p is an ObliviousSubspace Embedding and that Π p has the Approximate Matrix Multiplication Property, then itsuﬃces to prove that Π q has the JL Moment Property, for q which is the smallest power of twointeger such that q ≥ p , as in Deﬁnition 11. This reduction will be the main component of theproofs of Theorem 1 and Theorem 2. Lemma 10.

For every positive integers n, p, d , every ε, δ ∈ [0 , , and every µ ≥ . Let q = 2 ⌈ log ( p ) ⌉ and let Π p ∈ R m × d p and Π q ∈ R m × d q be deﬁned as in Deﬁnition 11, for some base sketches S base ∈ R m × m and T base ∈ R d × d .If Π q is an ( ε, δ, µ, d q , n ) -Oblivious Subspace Embedding then Π p is an ( ε, δ, µ, d p , n ) -ObliviousSubspace Embedding. Also if Π q has the ( ε, δ ) -Approximate Matrix Multiplication Property then Π p has the ( ε, δ ) -Approximate Matrix Multiplication Property.Proof. We will prove a correspondence between Π p and Π q . Let E ∈ R d × n be a matrix whoseﬁrst row is equal to one and is zero everywhere else. By Deﬁnition 11 we have that for anymatrix A ∈ R d p × n that Π p A = Π q ( A ⊗ E ⊗ ( q − p )1 ). A simple calculation shows that for any matrices A, B ∈ R d p × n then( A ⊗ E ⊗ ( q − p )1 ) ⊤ ( B ⊗ E ⊗ ( q − p )1 ) = A ⊤ B ◦ ( E ⊗ ( q − p )1 ) ⊤ E ⊗ ( q − p )1 = A ⊤ B , where ◦ denotes the Hadamard product, and the last equality follows since ( E ⊗ ( q − p )1 ) ⊤ E ⊗ ( q − p )1 is anall ones matrix. This implies that k A ⊗ E ⊗ ( q − p )1 k F = k A k F and s λ (( A ⊗ E ⊗ ( q − p )1 ) ⊤ A ⊗ E ⊗ ( q − p )1 ) = s λ ( A ⊤ A ).Now assume that Π q is an ( ε, δ, µ, n )-Oblivious Subspace Embedding, and let A ∈ R d p × n and λ ≥ s λ ( A ) ≤ µ . Deﬁne A ′ = A ⊗ E ⊗ ( q − p )1 , thenPr h (1 − ε )( A ⊤ A + λI n ) (cid:22) (Π p A ) ⊤ Π p A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) i = Pr h (1 − ε )( A ′⊤ A ′ + λI n ) (cid:22) (Π q A ′ ) ⊤ Π q A ′ + λI n (cid:22) (1 + ε )( A ′⊤ A ′ + λI n ) i ≥ − δ , where we have used that s λ ( A ′ ⊤ A ′ ) = s λ ( A ⊤ A ) ≤ µ . This shows that Π p is an ( ε, δ, µ, n )-ObliviousSubspace Embedding.Assume that Π q has ( ε, δ )-Approximate Matrix Multiplication Property, and let C, D ∈ R d p × n .Deﬁne C ′ = C ⊗ E ⊗ ( q − p )1 and D ′ = D ⊗ E ⊗ ( q − p )1 , thenPr h k (Π p C ) ⊤ Π p D − C ⊤ D k F ≥ ε k C k F k D k F i = Pr h k (Π q C ′ ) ⊤ Π q D ′ − C ′⊤ D ′ k F ≥ ε k C ′ k F k D ′ k F i ≤ δ , where we have used that k C ′ k F = k C k F , k D ′ k F = k D k F , and C ′⊤ D ′ = C ⊤ D . This show that Π p has ( ε, δ )-Approximate Matrix Multiplication Property. Lemma 11.

For any ε, δ ∈ [0 , , t ≥ , if M ∈ R m × d is a random matrix with ( ε, δ, t ) -JL MomentProperty then M has the ( ε, δ ) -Approximate Matrix Multiplication Property.Furthermore, for any µ > , if M ∈ R m × d is a random matrix with ( ε/µ, δ, t ) -JL MomentProperty then for every positive integer n ∈ Z , M is a ( ε, δ, µ, d, n ) -OSE.Proof. pproximate Matrix Multiplication Let

C, D ∈ R d × n . We will prove that (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) L t ≤ εδ /t k C k F k D k F . (5)Then Markov’s inequality will give us the result. Using the triangle inequality together withLemma 9 we get that: (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) / L t/ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i,j ∈ [ n ] (cid:16) ( M C i ) ⊤ M D j − C ⊤ i D j (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ ≤ s X i,j ∈ [ n ] (cid:13)(cid:13) ( M C i ) ⊤ M D j − C ⊤ i D j (cid:13)(cid:13) L t ≤ s X i,j ∈ [ n ] ε δ /t k C i k k D j k = εδ /t k C k F k D k F . Using Markov’s inequality we now get thatPr h k ( M C ) ⊤ M D − C ⊤ D k F ≥ ε k C k F k D k F i ≤ (cid:13)(cid:13)(cid:13) k ( M C ) ⊤ M D − C ⊤ D k F (cid:13)(cid:13)(cid:13) tL t ε t k C k tF k D k tF ≤ δ . Oblivious Subspace Embedding.

We will prove that for any λ ≥ A ∈ R d × n ,(1 − ε )( A ⊤ A + λI n ) (cid:22) ( M A ) ⊤ M A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) , (6)holds with probability at least 1 − (cid:16) s λ ( A ⊤ A ) µ (cid:17) t δ , which will imply our result.We will ﬁrst consider λ >

0. Then A ⊤ A + λI n is positive deﬁnite. Thus, by left and rightmultiplying (6) by ( A ⊤ A + λI n ) − / , we see that (6) is equivalent to(1 − ε ) I n (cid:22) (cid:16) M A ( A ⊤ A + λI n ) − / (cid:17) ⊤ M A ( A ⊤ A + λI n ) − / + λ ( A ⊤ A + λI n ) − (cid:22) (1 + ε ) I n . which, in turn, is implied by the following: (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M A ( A ⊤ A + λI n ) − / (cid:17) ⊤ M A ( A ⊤ A + λI n ) − / + λ ( A ⊤ A + λI n ) − − I n (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ε . Note that ( A ⊤ A + λI n ) − / A ⊤ A ( A ⊤ A + λI n ) − / = I n − λ ( A ⊤ A + λI n ) − . Letting Z = A ( A ⊤ A + λI n ) − / , we note that it suﬃces to establish, (cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) op ≤ ε . Using (5) together with Markov’s inequality we get thatPr (cid:20)(cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) op ≥ ε (cid:21) ≤ Pr h(cid:13)(cid:13)(cid:13) ( M Z ) ⊤ M Z − Z ⊤ Z (cid:13)(cid:13)(cid:13) F ≥ ε i ≤ k Z k F µ ! t δ = s λ ( A ⊤ A ) µ ! t δ , k Z k F = tr (cid:16) Z ⊤ Z (cid:17) = tr (cid:18)(cid:16) A ( A ⊤ A + λI n ) − / (cid:17) ⊤ A ( A ⊤ A + λI n ) − / (cid:19) = tr (cid:16) A ⊤ A ( A ⊤ A + λI n ) − (cid:17) = s λ ( A ⊤ A ) . To prove the result for λ = 0 we will use Fatou’s lemma.Pr (cid:20)(cid:16) (1 − ε ) A ⊤ A (cid:22) ( M A ) ⊤ M A (cid:22) (1 + ε ) A ⊤ A (cid:17) C (cid:21) ≤ lim inf λ → + Pr (cid:20)(cid:16) (1 − ε )( A ⊤ A + λI n ) (cid:22) ( M A ) ⊤ M A + λI n (cid:22) (1 + ε )( A ⊤ A + λI n ) (cid:17) C (cid:21) ≤ lim inf λ → + s λ ( A ⊤ A ) µ δ = s ( A ⊤ A ) µ δ , where the last equality follows from continuity of λ s λ ( A ⊤ A ).Our next important observation is that Π q can be written as the product of 2 q − Lemma 12.

For any integer q which is a power of two, Π q : R m q → R m be deﬁned as in Deﬁni-tion 11 for some base sketches S base : R m → R m and T base : R d → R m . Then there exist matrices ( M ( i ) ) i ∈ [ q − , ( M ′ ( j ) ) j ∈ [ q ] and integers ( k i ) i ∈ [ q − , ( k ′ ) i ∈ [ q − , ( l j ) j ∈ [ q ] , ( l ′ j ) j ∈ [ q ] , such that, Π q = M ( q − · . . . M (1) · M ′ ( q ) · . . . · M ′ (1) , and M ( i ) = I k i × S ( i )base × I k ′ i , M ′ ( j ) = I l j × T ( j )base × I l ′ j , where S ( i )base and T ( j )base are independent instancesof S base and T base , for every i ∈ [ q − , j ∈ [ q ] .Proof. We have that Π q = Q q T q by Deﬁnition 11. By Deﬁnition 9 we have that Q q = S S · · · S q .Claim 7 shows that for every l ∈ { , , · · · q } we can write, S l = M ll/ M ll/ − · . . . · M l , (7)where M lj = I m l − j × S ll/ − j +1 × I m j − for every j ∈ [ l/ T q = M ′ ( q ) · . . . · M ′ (1) , (8)where M ′ ( j ) = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ]. Therefore by combining (7) and (8) we getthe result.We want to show that I k × M × I k ′ inherits the JL properties of M . The following simple factdoes just that. 22 emma 13. Let t ∈ N and α ≥ . If P ∈ R m × d and Q ∈ R m × d are two random matrices (notnecessarily independent), such that, (cid:13)(cid:13)(cid:13) k P x k − k x k (cid:13)(cid:13)(cid:13) L t ≤ α k x k and E h k P x k i = k x k , (cid:13)(cid:13)(cid:13) k Qy k − k y k (cid:13)(cid:13)(cid:13) L t ≤ α k y k and E h k Qy k i = k y k , for any vectors x ∈ R d and y ∈ R d , then (cid:13)(cid:13)(cid:13) k ( P ⊕ Q ) z k − k z k (cid:13)(cid:13)(cid:13) L t ≤ α k z k and E h k ( P ⊕ Q ) z k i = k z k , for any vector z ∈ R d + d .Proof. Let z ∈ R d + d and choose x ∈ R d and y ∈ R d , such that, z = x ⊕ y . Using the triangleinequality, (cid:13)(cid:13)(cid:13) k ( P ⊕ Q ) z k − k z k (cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13) k P x k + k Qy k − k x k − k y k (cid:13)(cid:13)(cid:13) L t ≤ (cid:13)(cid:13)(cid:13) k P x k − k x k (cid:13)(cid:13)(cid:13) L t + (cid:13)(cid:13)(cid:13) k Qy k − k y k (cid:13)(cid:13)(cid:13) L t ≤ α k x k + α k y k = α k z k . We also see that E h k ( P ⊕ Q ) z k i = E h k P x k i + E h k Qy k i = k x k + k y k = k z k . An easy consequence of this lemma is that for any matrix, S , with the ( ε, δ, t )-JL Moment Prop-erty, I k × S has the ( ε, δ, t )-JL Moment Property. This follows simply from I k × S = S ⊕ S ⊕ . . . ⊕ S | {z } k times .Similarly, S × I k has the ( ε, δ, t )-JL Moment Property, since S × I k is just a reordering of the rowsof I k × S , which trivially does not aﬀect the JL Moment Property. The same arguments show thatif S has the Strong ( ε, δ )-JL Moment Property then I k × S and S × I k has the Strong ( ε, δ )-JLMoment Property. So we conclude the following Lemma 14.

If the matrix S has the ( ε, δ, t ) -JL Moment Property, then for any positive integers k, k ′ , the matrix M = I k × S × I k ′ has the ( ε, δ, t ) -JL Moment Property.Similarly, if the matrix S has the Strong ( ε, δ ) -JL Moment Property, then for any positiveintegers k, k ′ , the matrix M = I k × S × I k ′ has the Strong ( ε, δ ) -JL Moment Property. Now if we can prove that the product of matrices with the JL Moment Property has the JLMoment Property, then Lemma 14 and Lemma 12 would imply that Π q has the JL Moment Prop-erty, which again implies that Π p is an Oblivious Subspace Embedding and has the ApproximateMatrix Multiplication Property, by Lemma 11 and Lemma 10. This is exactly what we will do: inSection 4.1 we prove that the product of k independent matrices with the (cid:16) ε √ k , δ, (cid:17) -JL MomentProperty results in a matrix with the ( ε, δ, k independent matrices with theStrong (cid:16) O (cid:16) ε √ k (cid:17) , δ (cid:17) -JL Moment Property results in a matrix with the Strong ( ε, δ )-JL MomentProperty, which will give us the proof of Theorem 2.23 .1 Second Moment of Π q (analysis for T base : CountSketch and S base : TensorSketch ) In this section we prove Theorem 1 by instantiating our recursive construction from Section 3with

CountSketch at the leaves and

TensorSketch at the internal nodes of the tree. The proofproceeds by showing the second moment property – i.e., ( ε, δ, q satisﬁes the ( ε, δ, S base , T base are chosen from a distribution which satisﬁesthe second moment property. We show that this is the case for CountSketch and

TensorSketch .Lemma 14 together with Lemma 12 show that if the base sketches S base , T base have the JLMoment Property then Π q is the product of 2 q − Lemma 15 (Composition lemma for the second moment) . For any ε, δ ≥ and any integer k if M (1) ∈ R d × d , · · · M ( k ) ∈ R d k +1 × d k are independent random matrices with the (cid:16) ε √ k , δ, (cid:17) -JL-moment property then the product matrix M = M ( k ) · · · M (1) satisﬁes the ( ε, δ, -JL-momentproperty.Proof. Let x ∈ R d be a ﬁxed unit norm vector. We note that for any i ∈ [ k ] we have that E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − i = k M ( i − · . . . · M (1) x k . (9)Now we will prove by induction on i ∈ [ k ] that,Var h k M ( i ) · . . . · M (1) x k i ≤ ε δ k ! i − . (10)For i = 1 the result follows from the fact that M (1) has the ( ε/ √ k, δ, i −

1. By the law of total variance we get thatVar h k M ( i ) · . . . · M (1) x k i = E h Var h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii + Var h E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii (11)Using (9) and the induction hypothesis we get that,Var h E h k M ( i ) · . . . · M (1) x k (cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − ii = Var h k M ( i − · . . . M (1) x k i ≤ ε δ k ! i − − . (12)Using that M ( i ) has the ( ε/ √ k, δ, E (cid:20) Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i ) · . . . · M (1) x (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) M (1) , . . . , M ( i − (cid:21)(cid:21) ≤ E " ε k δ (cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) = ε δ k Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) (cid:21) + E (cid:20)(cid:13)(cid:13)(cid:13) M ( i − · . . . M (1) x (cid:13)(cid:13)(cid:13) (cid:21) ! ≤ ε δ k  ε δ k ! i − −  = ε δ k ε δ k ! i − . (13)Plugging (12) and (13) into (11) gives,Var (cid:20)(cid:13)(cid:13)(cid:13) M ( i ) · . . . · M (1) x (cid:13)(cid:13)(cid:13) (cid:21) ≤ ε δ k ε δ k ! i − + ε δ k ! i − − ε δ k ! i − . Hence, Var h k M x k i ≤ ε δ k ! k − ≤ exp( ε δ/ − ≤ ε δ , which proves that M has the ( ε, δ, q : Corollary 16 (Second moment property for Π q ) . For any power of two integer q let Π q : R m q → R m be deﬁned as in Deﬁnition 11, where both of the common distributions S base : R m → R m and T base : R d → R m , satisfy the (cid:16) ε √ q +2 , δ, (cid:17) -JL-moment property. Then it follows that Π q satisﬁesthe ( ε, δ, -JL-moment property.Proof. This follows from Lemma 12 and Lemma 15.Now we are ready to prove Theorem 1. Recall that k ( x, y ) = h x, y i q is the polynomial kernel ofdegree q . One can see that k ( x, y ) = h x ⊗ q , y ⊗ q i . Let x , x , · · · x n ∈ R m be an arbitrary dataset of n points in R m . We represent the data points by matrix X ∈ R m × n whose i th column is the vector x i . Let A ∈ R m q × n be the matrix whose i th column is x ⊗ qi for every i ∈ [ n ]. For any regularizationparameter λ >

0, the statistical dimension of A ⊤ A is deﬁned as s λ := tr (cid:16) ( A ⊤ A )( A ⊤ A + λI n ) − (cid:17) . Theorem 1.

CountSketch . It is shownin [ANW14] and [CW17] that for these choices of base sketches, S base and T base are both unbiasedand satisfy the (cid:16) ε √ q +2 , δ, (cid:17) -JL-moment property as long as m = Ω( qε δ ) (see Deﬁnition 18). Oblivious Subspace Embedding

Let m = Ω (cid:18) qs λ δǫ (cid:19) be an integer. Then S base and T base hasthe (cid:16) ε √ q +2 s λ , δ, (cid:17) -JL Moment Property. Thus using Corollary 16 we conclude that Π q has the (cid:16) εs λ , δ, (cid:17) -JL Moment Property. Thus, Lemma 11 implies that Π q is an ( ε, δ, s λ , d q , n )-ObliviousSubspace Embedding, and by Lemma 10 we get that Π p is an ( ε, δ, s λ , d p , n )-Oblivious SubspaceEmbedding. Approximate Matrix Multiplication.

Let m = Ω (cid:16) qδǫ (cid:17) . Then S base and T base have the (cid:16) ε √ q +2 , δ, (cid:17) -JL Moment Property. Thus, using Corollary 16 we conclude that Π q has the ( ε, δ, q has the ( ε, δ )-Approximate Matrix Multipli-cation Property, and by Lemma 10 we get that Π p has the ( ε, δ )-Approximate Matrix MultiplicationProperty. Runtime of Algorithm 1 when the base sketch S base is TensorSketch of degree two and T base is CountSketch : We compute the time of running Algorithm 1 on a vector x . Computing Y j for each j in lines 3 and 4 of algorithm requires applying a CountSketch on either x or e whichtakes time O (nnz( x )). Therefore computing all Y j ’s takes time O ( q · nnz( x )).Computing each of Y lj ’s for l ≥ TensorSketch of input dimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time O ( m log m ). Therefore computing Y lj for all l, j ≥ O ( q · m log m ). Note that q ≤ p and hence the total running time of Algorithm 1 on one vector x is O ( p · m log m + p · nnz( w )).Sketching n columns of a matrix X ∈ R d × n takes time O ( p ( nm log m + nnz( X ))). Π q (analysis for T base : OSNAP and S base : TensorSRHT ) In this section we prove Theorem 2 by instantiating our recursive construction of Section 3 with

OSNAP at the leaves and

TensorSRHT at the internal nodes.The proof proceeds by showing the Strong JL Moment Property for our sketch Π q . If a sketchsatisﬁes the Strong JL Moment Property then it straightforwardly is an OSE and has the ap-proximate matrix product property. This section has two goals: ﬁrst is to show that SRHT , and

TensorSRHT as well as

OSNAP transform all satisfy the Strong JL Moment Property. The secondgoal of this section is to prove that our sketch construction Π q inherits the strong JL momentproperty from the base sketches S base , T base .In this section we will need Khintchine’s inequality. Lemma 17 (Khintchine’s inequality [HM07]) . Let t be a positive integer, x ∈ R d , and ( σ i ) i ∈ [ d ] beindependent Rademacher ± random variables. Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d X i =1 σ i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C t k x k , here C t ≤ √ (cid:16) Γ(( t +1) / √ π (cid:17) /t ≤ √ t for all t ≥ .One may replace ( σ i ) with an arbitrary independent sequence of random variables ( ς i ) with E [ ς i ] = 0 and k ς i k L r ≤ √ r for any ≤ r ≤ t , and the lemma still holds up to a universal constantfactor on the r.h.s. First we note that the

OSNAP transform satisﬁes the strong JL moment property.

Lemma 18.

There exists a universal constant L , such that, the following holds. Let M ∈ R m × d bea OSNAP transform with sparsity parameter s . Let x ∈ R d be any vector with k x k = 1 and t ≥ ,then (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ L r tm + ts ! . (14) Setting m = Ω( ε − log(1 /δ )) and s = Ω( ε − log(1 /δ )) then M has the Strong ( ε, δ ) -JL MomentProperty (Deﬁnition 19).Proof. The proof of (14) follows from analysis in [CJN18]. They only prove it for t = log(1 /δ ) buttheir proof is easily extended to the general case.Now if we set m = 4 L e · ε − log(1 /δ ) and s = 2 Le · ε − log(1 /δ ) then we get that (cid:13)(cid:13)(cid:13) k M x k − (cid:13)(cid:13)(cid:13) L t ≤ L s t L e · ε − log(1 /δ ) + L t Le · ε − log(1 /δ ) ≤ εe s t log(1 /δ ) , for every 1 ≤ t ≤ log(1 /δ ), which proves the result.We continue by proving that SRHT and

TensorSRHT sketches satisfy the strong JL momentproperty. We will do this by proving that a more general class of matrices satisﬁes the strong JLmoment property. More precisely, let k ∈ Z > be a positive integer and ( D ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i × d i be independent matrices, each with diagonal entries given by independent Rademacher variables.Let d = Q i ∈ [ k ] d i , and P ∈ { , } m × d be a random sampling matrix in which each row containsexactly one uniformly distributed nonzero element which has value one. Then we will prove thatthe matrix M = √ m P H ( D × . . . × D k ) satisﬁes the strong JL moment property, where H is a d × d Hadamard matrix. If k = 1 then M is just a SRHT , and if k = 2 then M is a TensorSRHT .In order to prove this result we need a couple of lemmas. The ﬁrst lemma can be seen as aversion of Khintchine’s inequality for higher order chaos.

Lemma 19.

Let t ≥ , k ∈ Z > , and ( σ ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i be independent vectors each sat-isfying the Khintchine inequality (cid:13)(cid:13)(cid:13) h σ ( i ) , x i (cid:13)(cid:13)(cid:13) L t ≤ C t k x k for t ≥ and any vector x ∈ R d i . Let ( a i ,...,i k ) i ∈ [ d j ] ,...,i k ∈ [ d k ] be a tensor in R d × ... × d k , then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k ∈ [ d k ]  Y j ∈ [ k ] σ ( j ) i j  a i ,...,i k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C kt  X i ∈ [ d ] ,...,i k ∈ [ d k ] a i ,...,i k  / , for t ≥ . Or, considering a ∈ R d ··· d k a vector, then simply (cid:13)(cid:13)(cid:13) h σ (1) ⊗ · · · ⊗ σ ( k ) , a i (cid:13)(cid:13)(cid:13) L t ≤ C kt k a k ,for t ≥ . a is not assumed to have special structure. Note that this implies the classical bound onthe fourth moment of products of 4-wise independent hash functions [BCL +

10, IM08, PT12], since C = 3 / for Rademachers we have E h h σ (1) ⊗ · · · ⊗ σ ( k ) , x i i ≤ k k x k for four-wise independent( σ ( i ) ) i ∈ [ k ] . Proof.

The proof will be by induction on k . For k = 1 then the result is by assumption. So assumethat the result is true for every value up to k −

1. Let B i ,...,i k − = P i k ∈ [ d k ] σ ( k ) i k a i ,...,i k . We thenpull it out of the left hand term in the theorem: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k ∈ [ d k ]  Y j ∈ [ k ] σ ( j ) i j  a i ,...,i k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ]  Y j ∈ [ k − σ ( j ) i j  B i ,...,i k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C k − t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ] B i ,...,i k −  / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t (15)= C k − t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] ,...,i k − ∈ [ d k − ] B i ,...,i k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ ≤ C k − t  X i ∈ [ d ] ,...,i k − ∈ [ d k − ] (cid:13)(cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13)(cid:13) L t/  / (16)= C k − t  X i ∈ [ d ] ,...,i k − ∈ [ d k − ] (cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13) L t  / . Here (15) is the inductive hypothesis and (16) is the triangle inequality. It remains to bound (cid:13)(cid:13) B i ,...,i k − (cid:13)(cid:13) L t ≤ C t P i c ∈ [ d k ] a i ,...,i k by Khintchine’s inequality, which ﬁnishes the induction stepand hence the proof.The next lemma we will be using is a type of Rosenthal inequality, but which mixes large andsmall moments in a careful way. It bears similarity to the one sided bound in [BLM13] (Theorem15.10) derived from the Efron Stein inequality, and the literature has many similar bounds, but westill include a proof here based on ﬁrst principles. Lemma 20.

There exists a universal constant L , such that, for t ≥ if X , . . . , X k are independentnon-negative random variables with t -moment, then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ L  √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t s X i ∈ [ k ] E [ X i ] + t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t  . roof. Throughout these calculations L , L and L will be universal constants. (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] σ i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t (Symmetrization) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ (Khintchine’s inequality) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i · X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t/ (Non-negativity) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t (Cauchy-Schwartz) ≤ L √ t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t s X i ∈ [ k ] E [ X i ] + L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t  . Now let C = (cid:13)(cid:13)(cid:13)P i ∈ [ k ] ( X i − E [ X i ]) (cid:13)(cid:13)(cid:13) / L t , B = L qP i ∈ [ k ] E [ X i ], and A = √ t (cid:13)(cid:13)(cid:13) max i ∈ [ k ] X i (cid:13)(cid:13)(cid:13) / L t . thenwe have shown C ≤ A ( B + C ). That implies C is smaller than the largest of the roots of thequadratic. Solving this quadratic inequality gives C ≤ L ( AB + A ) which is the result.We can now prove that SHRT and

TensorSRHT has the Strong JL Moment Property.

Lemma 21.

There exists a universal constant L , such that, the following holds. Let k ∈ Z > , and ( D ( i ) ) i ∈ [ k ] ∈ Q i ∈ [ k ] R d i × d i be independent diagonal matrices with independent Rademacher variables.Deﬁne d = Q i ∈ [ k ] d i and D = D × D × · · · D k ∈ R d × d . Let P ∈ R m × d be an independent samplingmatrix which samples exactly one coordinate per row, and deﬁne M = P HD where H is a d × d Hadamard matrix. Let x ∈ R d be any vector with k x k = 1 and t ≥ , then (cid:13)(cid:13)(cid:13) m k P HDx k − (cid:13)(cid:13)(cid:13) L t ≤ L s tr k m + tr k m  , where r = max { t, log m } .There exists a universal constant L ′ , such that, setting m = Ω (cid:16) ε − log(1 /δ )( L ′ log(1 /εδ )) k (cid:17) , weget that √ m P HD has Strong ( ε, δ ) -JL Moment Property. Note that setting k = 1, this matches the Fast Johnson Lindenstrauss analysis in [CNW16b]. Proof.

Throughout the proof C , C and C will denote universal constants.For every i ∈ [ m ] we let P i be the random variable that says which coordinate the i ’th row of P samples, and we deﬁne the random variable Z i = M i x = H P i Dx . We note that since the variables( P i ) i ∈ [ m ] are independent then the variables ( Z i ) i ∈ [ m ] are conditionally independent given D , thatis, if we ﬁx D then ( Z i ) i ∈ [ m ] are independent. 29e use Lemma 20, the triangle inequality, and Cauchy-Schwartz to get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i ∈ [ m ] Z i − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  m X i ∈ [ m ] Z i −  t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D  /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ tm E  max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D  / (2 t ) s X i ∈ [ m ] E (cid:2) Z i (cid:12)(cid:12) D (cid:3) + tm E  max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D  /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C √ tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  max i ∈ [ m ] Z i ! t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D  / (2 t ) s X i ∈ [ m ] E (cid:2) Z i (cid:12)(cid:12) D (cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t + C tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C √ tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ m ] E h Z i (cid:12)(cid:12)(cid:12) D i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / L t + C tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t . By orthogonality of H we have k HDx k = d k x k independent of D . Hence X i ∈ [ m ] E h Z i (cid:12)(cid:12)(cid:12) D i = X i ∈ [ m ] k x k = m . To bound (cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13) L t we ﬁrst use Lemma 19 to show (cid:13)(cid:13)(cid:13) Z i (cid:13)(cid:13)(cid:13) L r = k H P i Dx k L r = k Dx k L r ≤ r k k x k . We then bound the maximum using a suﬃciently high powered sum: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max i ∈ [ m ] Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L r ≤  X i ∈ [ m ] (cid:13)(cid:13)(cid:13) Z i (cid:13)(cid:13)(cid:13) rL r  /r ≤ m /r r k k x k ≤ er k , where the last inequality follows from r ≥ log m . This gives us that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i ∈ [ m ] Z i − k x k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C s tr k m + C tr k m , which ﬁnishes the ﬁrst part of the proof.We set m = 4 e C ε − log(1 /δ )( C log(1 / ( δε ))) k , such that, r ≤ C log(1 / ( δε )). Hence m ≥ e C ε − log(1 /δ ) r k . We then get that (cid:13)(cid:13)(cid:13) k P HDx k − (cid:13)(cid:13)(cid:13) L t ≤ C s tr k e C ε − log(1 /δ ) r k + C tr k e C ε − log(1 /δ ) r k ≤ εe s t log 1 /δ , for all 1 ≤ t ≤ log(1 /δ ) which ﬁnishes the proof.30ow we have proved that the Strong JL Moment Property is satisﬁed by the SRHT , the

Ten-sorSRHT as well as

OSNAP transform, but we still need to prove the usefulness of the property.Our next result remedies this and show that the Strong JL Moment Property is preserved undermultiplication. We will use the following decoupling lemma which ﬁrst appeared in [Hit94], but thefollowing is roughly taken from [DlPG12], which we also recommend for readers interested in moregeneral versions.

Lemma 22 (General decoupling, [DlPG12] Theorem 7.3.1, paraphrasing) . There exists an uni-versal constant C , such that, given any two sequences ( X i ) i ∈ [ n ] and ( Y i ) i ∈ [ n ] of random variables,satisfying1. Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i = Pr h X i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i for every t ∈ R and for every i ∈ [ n ] .2. The sequence ( Y i ) i ∈ [ n ] is conditionally independent given ( X i ) i ∈ [ n ] .3. Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ i − i = Pr h Y i > t (cid:12)(cid:12)(cid:12) ( X j ) j ∈ [ n ] i for every t ∈ R and for every i ∈ [ n ] .Then for all t ≥ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ n ] X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ n ] Y i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t We are now ready to state and prove the main lemma of this section. This basically saysthat if you take k independent JL transforms, that all have the Strong ( ε/ √ k, δ )-JL MomentProperty, SJLMP, then the result has the ( ε, δ ) SJLMP. A simple union bound would give the sameresult where each matrix has the ( ε/k, δ ) SJLMP, but that would ultimately result in a higherdependency on the tensoring dimension. A simple change to the proof shows that we only needthe i th JL transform to have the ( ε/ √ i, δ ) SJLMP, but that ultimately makes no diﬀerence for ourconstruction. Lemma 23.

There exists a universal constant L , such that, for any constants ε, δ ∈ [0 , andpositive integer k ∈ Z > . If M (1) ∈ R d × d , . . . , M ( k ) ∈ R d k +1 × d c are independent random matriceswith the Strong ( ε/ ( L √ k ) , δ ) -JL Moment Property, then the matrix M = M ( k ) · . . . · M (1) has theStrong ( ε, δ ) -JL Moment Property.Proof. Let x ∈ R d be an arbitrary, ﬁxed unit vector, and ﬁx 1 < t ≤ log(1 /δ ). We deﬁne X i = k M ( i ) · . . . · M (1) x k and Y i = X i − X i − for every i ∈ [ k ]. By telescoping we then have that X i − P j ∈ [ i ] Y i . We let ( T ( i ) ) i ∈ [ k ] be independent copies of ( M ( i ) ) i ∈ [ k ] and deﬁne Z i = k T ( i ) · M ( i − · . . . · M (1) x k − k M ( i − · . . . · M (1) x k , for every i ∈ [ k ]. We get the following three properties:1. Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i = Pr h Y i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i for every t ∈ R and every i ∈ [ k ].2. The sequence ( Z i ) i ∈ [ k ] is conditionally independent given ( M ( i ) ) i ∈ [ k ] .3. Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i − i = Pr h Z i > t (cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ k ] i for every t ∈ R and for every i ∈ [ k ].31his means we can use Lemma 22 to get (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t . (17)for every i ∈ [ k ].We will prove by induction on i ∈ [ k ] that k X i − k L t ≤ εe s t log(1 /δ ) ≤ . (18)For i = 1 we use that M (1) has the Strong ( ε/ ( L √ k ) , δ )-JL Moment Property and get that (cid:13)(cid:13)(cid:13) k M (1) x k − (cid:13)(cid:13)(cid:13) L t ≤ εeL √ k s t log(1 /δ ) ≤ εe s t log(1 /δ ) . Now assume that (18) is true for i −

1. Using (17) we get that k X i − k L t = (cid:13)(cid:13)(cid:13)P j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)P j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13) L t . By using that ( T ( j ) ) j ∈ [ i ] has the Strong ( ε/ ( L √ k ) , δ )-JL Moment Property to-gether with Khintchine’s inequality (Lemma 17), we get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Z j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E  X j ∈ [ i ] Z j  t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( M ( j ) ) j ∈ [ i ]  /t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εeL √ k s t log(1 /δ ) s X j ∈ [ i ] X j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t = C εe s t log(1 /δ ) · L √ k vuuut(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] X j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t/ ≤ C εe s t log(1 /δ ) · L √ k s X j ∈ [ i ] k X j k L t , where the last inequality follows from the triangle inequality. Using the triangle inequality and(18) we get that k X j k L t ≤ k X j − k L t ≤ , for every j ∈ [ i ]. Setting L = 2 C C we get that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ i ] Y j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L t ≤ εe s t log(1 /δ ) · C C L √ k s X j ∈ [ i ] k X j k L t (19) ≤ εe s t log(1 /δ ) · C C L √ k · √ i (20) ≤ εe s t log(1 /δ ) , (21)which ﬁnishes the induction. Now we have that (cid:13)(cid:13) k M x k − (cid:13)(cid:13) L t ≤ εe q t log(1 /δ ) so we conclude that M has Strong ( ε, δ )-JL Moment Property. 32 simple corollary of this result is a suﬃcient condition for our recursive sketch Π q to have theStrong JL Moment Property. Corollary 24 (Strong JL Moment Property for Π q ) . For any integer q which is a power of two,let Π q : R m q → R m be deﬁned as in Deﬁnition 11, where both of the common distributions S base : R m → R m and T base : R d → R m , satisfy the Strong (cid:16) O (cid:16) ε √ q (cid:17) , δ (cid:17) -JL Moment Property. Then itfollows that Π q satisﬁes the Strong ( ε, δ ) -JL Moment Property.Proof. The proof follows from using Lemma 12 and Lemma 23.We conclude this section by proving Theorem 2.

Theorem 2.

For every positive integers n, p, d , every ε, s λ > , there exists a distribution on linearsketches Π p ∈ R m × d p such that: (1) If m = e Ω (cid:0) ps λ ǫ − (cid:1) , then Π p is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding (Deﬁnition 2). (2) If m = e Ω (cid:0) pε − (cid:1) , then Π p has the ( ε, / poly ( n )) -approximate matrix product property (Deﬁnition 3).Moreover, in the setting of (1) , for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columnsare obtained by a p -fold self-tensoring of each column of X , then the matrix Π p A can be computedusing Algorithm 1 in time e O (cid:16) pnm + p / s λ ε − nnz( X ) (cid:17) .Proof. Let δ = n ) denote the failure probability. Deﬁne q = ⌈ log ( p ) ⌉ and let Π p ∈ R m × d p andΠ q ∈ R m × d q be the sketches deﬁned in Deﬁnition 11, where S base ∈ R m × m is a TensorSRHT sketchand T base ∈ R m × d is an OSNAP sketch with sparsity parameter s , which will be set later. Oblivious Subspace Embedding

Let m = Θ (cid:18) ps λ log(1 / ( εδ )) ε (cid:19) and s = Θ (cid:16) √ ps λ log(1 /δ ) ε (cid:17) beintegers, then Lemma 21 and Lemma 18 implies that S base and T base has the Strong (cid:16) O (cid:16) ε √ qs λ (cid:17) , δ (cid:17) -JL Moment Property, thus using Corollary 24 we conclude that Π q has the Strong (cid:16) εs λ , δ (cid:17) -JLMoment Property and in particular it has the (cid:16) εs λ , δ, log(1 /δ ) (cid:17) -JL Moment Property. By Lemma 11we then get that Π q is an ( ε, δ, s λ , d q , n )-Oblivious Subspace Embedding, and by Lemma 10 we getthat Π p is an ( ε, δ, s λ , d p , n )-Oblivious Subspace Embedding. Approximate Matrix Multiplication

Let m = Θ (cid:16) p log(1 / ( εδ )) ε (cid:17) and s = Θ (cid:16) √ p log(1 /δ ) ε (cid:17) beintegers. Then Lemma 21 and Lemma 18 implies that S base and T base has the Strong (cid:16) O (cid:16) ε √ qs λ (cid:17) , δ (cid:17) -JL Moment Property. Thus, using Corollary 24 we conclude that Π q has the Strong ( ε, δ )-JLMoment Property and in particular it has the ( ε, δ, log(1 /δ ))-JL Moment Property. By Lemma 11we then get that Π q has the ( ε, δ )-Approximate Matrix Multiplication Property, and by Lemma 10we get that Π p has the ( ε, δ )-Approximate Matrix Multiplication Property. Runtime of Algorithm 1 when the base sketch S base is a TensorSRHT sketch and T base isan OSNAP sketch with sparsity parameter s : We compute the time of running Algorithm 1on a vector x . Computing Y j for each j in lines 3 and 4 of algorithm requires applying an ONSAP sketch on either x or e which takes time O ( s · nnz( x )). Therefore computing all Y j ’s takes time O ( qs · nnz( x )).Computing each of Y lj ’s for l ≥ TensorSRHT sketch of input dimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time33 ( m log m ). Therefore computing Y lj for all l, j ≥ O ( q · m log m ). Note that q ≤ p hence the total running time of Algorithm 1 on one vector x is O ( pm log m + ps · nnz( w )). Sketching n columns of a matrix X ∈ R d × n takes time O ( p ( nm log m + s · nnz( X ))).In the setting of (1) we have that s = O (cid:16) √ ps λ log(1 /δ ) ε (cid:17) , hence we get a runtime of O (cid:16) pnm log m + p / s λ log(1 /δ )) ε nnz( X ) (cid:17) = ˜ O (cid:16) pnm + p / s λ ε nnz( X ) (cid:17) . s λ In this section, we show that if one chooses the internal nodes and the leaves of our recursive con-struction from Section 3 to be

TensorSRHT and

OSNAP transform respectively, then the recursiveconstruction Π q as in Deﬁnition 11 yields a high probability OSE with target dimension e O ( p s λ ).Thus, we prove Theorem 3. This sketch is very eﬃciently computable for high degree tensor prod-ucts because the OSNAP transform is computable in input sparsity time and the

TensorSRHT supports fast matrix vector multiplication for tensor inputs.We start by deﬁning the

Spectral Property for a sketch. We use the notation k · k op to denotethe operator norm of matrices. Deﬁnition 20 (Spectral Property) . For any positive integers m, n, d and any ε, δ, µ F , µ ≥ S ∈ R m × d satisﬁes the ( µ F , µ , ǫ, δ, n ) -spectral property if, for every ﬁxedmatrix U ∈ R d × n with k U k F ≤ µ F and k U k op ≤ µ ,Pr S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ S ⊤ SU − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:21) ≥ − δ. The spectral property is a central property of our sketch construction from Section 3 when leavesare

OSNAP and internal nodes are

TensorSRHT . This is a powerful property which implies thatany sketch which satisﬁes the spectral property , is an

Oblivious Subspace Embedding . The

SRHT , TensorSRHT , as well as

OSNAP sketches (Deﬁnitions 14, 15, 16 respectively) with target dimension m = Ω (cid:16) ( µ F µ ǫ ) · poly (log( nd/δ )) (cid:17) and sparsity parameter s = Ω(poly (log( nd/δ ))), all satisfy theabove-mentioned spectral property [Sar06, Tro11, NN13].In section 5.1 we recall the tools from the literature which we use to prove the spectral propertyfor our construction Π q . Then in section 5.2 we show that our recursive construction in Section3 satisﬁes the Spectral Property of Deﬁnition 20 as long as I d q × T base and I m q × S base satisfythe Spectral Property. Therefore, we analyze the Spectral Property of I d q × OSNAP and I m q × TensorSRHT in section 5.3 and section 5.4 respectively. Finally we put everything together insection 5.5 and prove that when the leaves are

OSNAP and the internal nodes are

TensorSRHT in our recursive construction of Section 3, the resulting sketch Π q satisﬁes the Spectral Propertythereby proving Theorem 3. In this section we present the deﬁnitions and tools which we use for proving concentration propertiesof random matrices.

Claim 25.

For every ǫ, δ > and any sketch S ∈ R m × d such that I k × S satisﬁes ( µ F , µ , ǫ, δ, n ) -spectral property, the sketch S × I k also satisﬁes the ( µ F , µ , ǫ, δ, n ) -spectral property.Proof. Suppose U ∈ R dk × n . Then, note that there exists U ′ ∈ R dk × n formed by permuting therows of U such that ( S × I k ) U and ( I k × S ) U ′ are identical up to a permutation of the rows. (In34articular, U ′ is the matrix such that the ( d, k )-reshaping of any column U j of U ′ is the transposeof the ( k, d )-reshaping of the corresponding column U ′ j of U ′ .) Then, observe that U ⊤ U = U ′⊤ U ′ . and U ⊤ ( S × I k ) ⊤ ( S × I k ) U = U ′⊤ ( I k × S ) ⊤ ( I k × S ) U ′ . Therefore, k U ⊤ ( S × I k ) ⊤ ( S × I k ) U − U ⊤ U k op = k U ′⊤ ( S × I k ) ⊤ ( S × I k ) U ′ − U ′⊤ U ′ k op . Moreover, since U and U ′ are identical up to a permutation of the rows, we have k U k op = k U ′ k op and k U k F = k U ′ k F . The desired claim now follows easily.We will use matrix Bernstein inequalities to show spectral guarantees for sketches, Lemma 26 (Matrix Bernstein Inequality (Theorem 6.1.1 in [Tro15])) . Consider a ﬁnite sequence Z i of independent, random matrices with dimensions d × d . Assume that each random matrix satisﬁes E [ Z i ] = 0 and k Z i k op ≤ B almost surely. Deﬁne σ = max {k P i E [ Z i Z ∗ i ] k op , k P i E [ Z ∗ i Z i ] k op } .Then for all t > − , P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≥ t  ≤ ( d + d ) · exp − t / σ + Bt/ ! . Lemma 27 (Restatement of Corollary 6.2.1 of [Tro15]) . Let B be a ﬁxed n × n matrix. Constructan n × n matrix R that satisﬁes, E [ R ] = B and k R k op ≤ L, almost surely. Deﬁne M = max {k E [ RR ∗ ] k op , k E [ R ∗ R ] k op } . Form the matrix sampling estimator, ¯ R = 1 m m X k =1 R k , where each R k is an independent copy of R . Then, Pr h k ¯ R − B k op ≥ ǫ i ≤ n · exp − mǫ / M + 2 Lǫ/ ! . To analyze the performance of

SRHT we need the following claim which shows that with highprobability individual entries of the Hadamard transform of a vector with random signs on itsentries do not “overshoot the mean energy” by much.

Claim 28.

Let D be a d × d diagonal matrix with independent Rademacher random variables alongthe diagonal. Also, let H be a d × d Hadamard matrix. Then, for every x ∈ R d , Pr D (cid:20) k HD · x k ∞ ≤ q log ( d/δ ) · k x k (cid:21) ≥ − δ. roof. By Khintchine’s inequality, Lemma 17 we have that for every t ≥ j ∈ [ d ] the j th element of HDx has a bounded t th moment as follows, k ( HDx ) j k L t ≤ √ t · k x k . Hence by applying Markov’s inequality to the t th moment of | ( HDx ) j | for t = log ( d/δ ) we getthat, Pr (cid:20) | ( HDx ) j | ≥ q log ( d/δ ) · k x k (cid:21) ≤ δ/d. The claim follows by a union bound over all entries j ∈ [ d ]. Claim 29.

Let D , D be two independent d × d diagonal matrices, each with diagonal entries givenby independent Rademacher random variables. Also, let H be a d × d Hadamard matrix. Then, forevery x ∈ R d , Pr D ,D [ k (( HD ) × ( HD )) · x k ∞ ≤ ( d/δ ) · k x k ] ≥ − δ. Proof.

By Claim 6 we can write that,( HD ) × ( HD ) = ( H × H )( D × D ) , where H × H is indeed a Hadamard matrix of size d × d which we denote by H ′ . The goal is toprove Pr D ,D (cid:2) k H ′ ( D × D ) · x k ∞ ≤ ( d/δ ) · k x k (cid:3) ≥ − δ. By Lemma 19 we have that for every t ≥ j ∈ [ d ] the j th element of H ′ ( D × D ) x hasa bounded t th moment as follows, (cid:13)(cid:13) ( H ′ ( D × D ) x ) j (cid:13)(cid:13) L t ≤ t · k x k . Hence by applying Markov’s inequality to the t th moment of | ( H ′ ( D × D ) x ) j | for t = log ( d/δ )we get that, Pr (cid:2) | ( H ′ ( D × D ) x ) j | ≥ ( d/δ ) · k x k (cid:3) ≤ δ/d . The claim follows by a union bound over all entries j ∈ [ d ]. Π q In this section we show that the sketch Π q presented in Deﬁnition 11 inherits the spectral property(see Deﬁnition 20) from the base sketches S base and T base . We start by the following claim whichproves that composing two random matrices with spectral property results in a matrix with spectralproperty. Claim 30.

For every ǫ, ǫ ′ , δ, δ ′ > , suppose that S ∈ R m × t is a sketch which satisﬁes the (( µ F +1)(1 + ǫ ′ ) , µ + 1 + ǫ ′ , ǫ, δ, n ) -spectral property and also suppose that the sketch T ∈ R t × d satisﬁes the ( µ F +1 , µ +1 , ǫ ′ , δ ′ /n, n ) -spectral property. Then S · T satisﬁes the ( µ F + 1 , µ + 1 , ǫ + ǫ ′ , δ + δ ′ (1 + 1 /n ) , n ) -spectral property. roof. Suppose S and T are matrices satisfying the hypothesis of the claim. Consider an arbitrarymatrix U ∈ R d × n which satisﬁes k U k F ≤ µ F + 1 and k U k op ≤ µ + 1. We want to prove that forevery such U , Pr h k U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U k op ≤ ǫ + ǫ ′ i ≥ − δ − δ ′ (1 + 1 /n ) . Let us deﬁne the event E as follows, E := (cid:26) k T · U k F ≤ (cid:0) ǫ ′ (cid:1) k U k F and (cid:13)(cid:13)(cid:13) U ⊤ T ⊤ T U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ ′ (cid:27) . We show that this event holds with probability 1 − δ ′ (1 + 1 /n ) over the random choice of sketch T .The spectral property of T implies that for every column U j of matrix U , k T U j k = (cid:0) ± ǫ ′ (cid:1) k U j k , with probability 1 − δ ′ n . By a union bound over all j ∈ [ n ], we have the following,Pr T h k T · U k F ≤ (cid:0) ǫ ′ (cid:1) k U k F i ≥ − δ ′ . Also, Pr T (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ T ⊤ T U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ ′ (cid:21) ≥ − δ ′ /n. Therefore by union bound, Pr T [ E ] ≥ − δ ′ (1 + 1 /n ) . We condition on T ∈ E in the rest of the proof. Since S satisﬁes the (( µ F +1)(1+ ǫ ′ ) , µ +1+ ǫ ′ , ǫ, δ, n )-spectral property, Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:21) ≥ − δ. Therefore,Pr

T,S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:21) ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( S · T ) ⊤ ( S · T ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − Pr T [ ¯ E ] ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 /n ) ≥ Pr S (cid:20) (cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13) ( T U ) ⊤ ( T U ) − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ + ǫ ′ (cid:12)(cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 n ) ≥ Pr S (cid:20)(cid:13)(cid:13)(cid:13) ( T U ) ⊤ S ⊤ S ( T U ) − ( T U ) ⊤ ( T U ) (cid:13)(cid:13)(cid:13) op ≤ ǫ (cid:12)(cid:12)(cid:12) T ∈ E (cid:21) − δ ′ (1 + 1 /n ) ≥ − δ − δ ′ (1 + 1 /n ) . This completes the proof.In the following lemma we show that composing independent random matrices with spectralproperty preserves the spectral property. 37 emma 31.

For any ε, δ, µ F , µ > and every positive integers k, n , if M (1) ∈ R d × d , · · · M ( k ) ∈ R d k +1 × d k are independent random matrices with the (2 µ F + 2 , µ + 2 , O ( ǫ/k ) , O ( δ/nk ) , n ) -spectralproperty then the product matrix M = M ( k ) · · · M (1) satisﬁes the ( µ F + 1 , µ + 1 , ǫ, δ, n ) -spectralproperty.Proof. Consider a matrix U ∈ R d × n which satisﬁes k U k F ≤ µ F + 1 and k U k op ≤ µ + 1. We wantto prove that for every such U ,Pr h k U ⊤ M ⊤ M U − U ⊤ U k op ≤ ǫ i ≥ − δ, where M = M ( k ) · · · M (1) .By the assumption of the lemma the matrices M (1) , · · · M ( k ) satisfy the (2 µ F +2 , µ +2 , O ( ǫ/k ) , O ( δ/nk ) , n )-spectral property. For every j ∈ [ k ], let us deﬁne the set E j as follows, E j := (cid:16) M (1) , · · · , M ( j ) (cid:17) :  . (cid:13)(cid:13)(cid:13)(cid:16) M ( j ) · · · M (1) (cid:17) U (cid:13)(cid:13)(cid:13) F ≤ (cid:0) ǫ k (cid:1) j k U k F . (cid:13)(cid:13)(cid:13)(cid:13) U ⊤ (cid:16) M ( j ) · · · M (1) (cid:17) ⊤ (cid:16) M ( j ) · · · M (1) (cid:17) U − U ⊤ U (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫj k  . First we prove that for every j ∈ { , · · · , k − } ,Pr M ( j +1) h (cid:16) M (1) , · · · , M ( j +1) (cid:17) ∈ E j +1 (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ k . Let us denote (cid:16) M ( j ) · · · M (1) (cid:17) · U by U ′ . The condition (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j implies that, k U ′ k F ≤ (1 + ǫ/ (10 k )) j k U k F and k U ′⊤ U ′ − U ⊤ U k op ≤ ǫj k and therefore by triangle inequality wehave k U ′ k op ≤ (cid:16) k U k op + ǫj k (cid:17) . The assumptions k U k F ≤ µ F + 1 and k U k op ≤ µ + 1 imply that k U ′ k F ≤ µ F + 2 and k U ′ k op ≤ µ + 2. Now note that by the assumption of the lemma, M ( j +1) satisﬁes the (2 µ F + 2 , µ + 2 , O ( ǫ/k ) , O ( δ/nk ) , n )-spectral property. Therefore,Pr M ( j +1) " (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M ( j +1) U ′ (cid:17) ⊤ M ( j +1) U ′ − U ′⊤ U ′ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ/ (4 nk ) . Combining the above with k U ′⊤ U ′ − U ⊤ U k ≤ ǫj k gives,Pr M ( j +1) " (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) M ( j +1) U ′ (cid:17) ⊤ M ( j +1) U ′ − U ⊤ U (cid:13)(cid:13)(cid:13)(cid:13) op ≤ ǫ j + 13 k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ/ (4 nk ) . (22)Also from the spectral property of M ( j +1) it follows that for every column U ′ i of matrix U ′ , k M ( j +1) U ′ i k = (1 ± ǫ/ (10 k )) k U ′ i k , with probability 1 − δ nk . By a union bound over all i ∈ [ n ], we have the following,Pr M ( j +1) h k M ( j +1) · U ′ k F ≤ (1 + ǫ/ (10 k )) k U ′ k F (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ k . Combining the above with k U ′ k F ≤ (1 + ǫ/ (10 k )) j k U k F gives,Pr M ( j +1) " k M ( j +1) · U ′ k F ≤ (cid:18) ǫ k (cid:19) j +1 k U k F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j ≥ − δ k . (23)38 union bound on (22) and (23) gives,Pr M ( j +1) h (cid:16) M (1) , · · · , M ( j +1) (cid:17) ∈ E j +1 (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · , M ( j ) (cid:17) ∈ E j i ≥ − δ nk − δ k ≥ − δ k . We also show that, Pr M (1) [ M (1) ∈ E ] ≥ − δ/ k. By the assumption of lemma we know that M (1) satisﬁes the (cid:16) µ F + 2 , µ + 2 , ǫ k , δ nk , n (cid:17) -spectralproperty. Therefore, Pr M (1) (cid:20) k ( M (1) U ) ⊤ M (1) U − U ⊤ U k op ≤ ǫ k (cid:21) ≥ − δ nk . (24)Also for every column U i of matrix U , k M (1) U i k = (1 ± ǫ/ (10 k )) k U i k , with probability 1 − δ nk . By a union bound over all i ∈ [ n ], we have the following,Pr M (1) h k M (1) · U k F ≤ (1 + ǫ/ (10 k )) k U k F i ≥ − δ k . (25)A union bound on (24) and (25) gives,Pr T [ T ∈ E ] ≥ − δ nk − δ k ≥ − δ k . By the chain rule for events we have,Pr M (1) , ··· ,M ( k ) h(cid:16) M (1) , · · · , M ( k ) (cid:17) ∈ E k i ≥ k Y j =2 Pr M ( j ) h (cid:16) M (1) , · · · M ( j ) (cid:17) ∈ E j (cid:12)(cid:12)(cid:12) (cid:16) M (1) , · · · M ( j − (cid:17) ∈ E j − i · Pr M (1) [ M (1) ∈ E ] ≥ (1 − δ k ) k ≥ − δ, which completes the proof of the lemma.The following lemma shows that our sketch construction Π q presented in deﬁnition 11 inheritsthe spectral property of Deﬁnition 20 from the base sketches, that is, if S base and T base are suchthat I m q − × S base and I d q − × T base satisfy the spectral property, then the sketch Π q satisﬁes thespectral property. Lemma 32.

For every positive integers n, d, m , any power of two integer q , any base sketch T base : R d → R m such that I d q − × T base satisﬁes the (2 µ F +2 , µ +2 , O ( ǫ/q ) , O ( δ/nq ) , n ) -spectral property,any S base : R m → R m such that I m q − × S base satisﬁes the (2 µ F + 2 , µ + 2 , O ( ǫ/q ) , O ( δ/nq ) , n ) -spectral property, the sketch Π q deﬁned as in Deﬁnition 11 satisﬁes the ( µ F +1 , µ +1 , ε, δ, n ) -spectralproperty. roof. We wish to show that Π q = Q q T q as per Deﬁnition 11, satisﬁes the ( µ F + 1 , µ + 1 , ε, δ, n )-spectral property. By Deﬁnition 9 Q q = S S · · · S q . Claim 7 shows that for every l ∈ { , , · · · q } we can write, S l = M ll/ M ll/ − · · · M l , (26)where M j = I m q − j × S qq/ − j +1 × I m j − for every j ∈ [ q/ T q = M ′ q · · · M ′ , (27)where M ′ j = I d q − j × T q − j +1 × I m j − for every j ∈ [ q ]. Therefore by combining (26) and (27) we getthat, Π q = M (2 q +1) M (2 q ) · · · M (1) , where M ( i ) matrices are independent and by the assumption of the lemma about the spectralproperty of I m q − × S base and I d q − × T base together with Claim 25 it follows that M ( i ) matricessatisfy the (2 µ F + 2 , µ + 2 , O ( ǫ/q ) , O ( δ/nq ) , n )-spectral property. Therefore, the Lemma readilyfollows by invoking Lemma 31 with k = 2 q + 1. × TensorSRHT

In this section, we show that tensoring an identity operator with a

TensorSRHT sketch results ina transform that satisﬁes the spectral property deﬁned in Deﬁnition 20 with nearly optimal targetdimension.

Lemma 33.

Suppose ǫ, δ, µ , µ F > and n is a positive integer. If m = Ω (cid:16) log( nδ ) log ( ndkǫδ ) · µ F µ ǫ (cid:17) and S ∈ R m × d is a TensorSRHT , then the sketch I k × S satisﬁes ( µ F , µ , ǫ, δ, n ) -spectral property.Proof. Fix a matrix U ∈ R kd × n with k U k F ≤ µ F and k U k op ≤ µ . Partition U by rows into d × n submatrices U , U , . . . , U k such that U ⊤ = h U ⊤ U ⊤ · · · U ⊤ k i . Note that U ⊤ ( I k × S ) ⊤ ( I k × S ) U = ( U ) ⊤ S ⊤ SU + · · · ( U k ) ⊤ S ⊤ SU k . The proof ﬁrst considers the simpler case of a

TensorSRHT sketch of rank 1 and then applies thematrix Bernstein inequality from Lemma 27. Let R denote a rank one TensorSRHT sketch. R is a1 × d matrix deﬁned in Deﬁnition 15 by setting m = 1 as follows, R = P · ( HD × HD ) , where P ∈ { , } × d has one non-zero element whose position is uniformly distributed over [ d ]. Notethat S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R , i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i ,for i.i.d. R , R , . . . , R m ∼ R , and therefore, U ⊤ ( I k × S ) ⊤ ( I k × S ) U = 1 m X i ∈ [ m ] U ⊤ ( I k × R i ) ⊤ ( I k × R i ) U. Therefore in order to use matrix Bernstein, Lemma 27, we need to bound the maximum operatornorm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U as well as the operator norm of its second moment.40e proceed to upper bound the operator norm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U . First, deﬁne the set E := (cid:26) ( D , D ) : (cid:13)(cid:13)(cid:13) ( HD × HD ) U ij (cid:13)(cid:13)(cid:13) ∞ ≤ ( ndµ F kǫδ )) · k U ij k for all j ∈ [ k ] and all i ∈ [ n ] (cid:27) , where U ji is the i th column of U j . By Claim 29, for every i ∈ [ n ] and j ∈ [ k ],Pr D ,D (cid:20)(cid:13)(cid:13)(cid:13) ( HD × HD ) U ji (cid:13)(cid:13)(cid:13) ∞ ≤ ( ndk/δ ) k U ji k (cid:21) ≥ − ǫδ/ ( nkµ F d ) . Thus, by a union bound over all i ∈ [ n ] and j ∈ [ k ], it follows that E occurs with probability atleast 1 − ǫδ/ ( dµ F ), Pr D ,D [( D , D ) ∈ E ] ≥ − ǫδ/ ( dµ F ) , where the probability is over the random choice of D , D .From now on, we ﬁx ( D , D ) ∈ E and proceed having conditioned on this event. Upper bounding (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op . From the fact that we have conditioned on ( D , D ) ∈E , note that L ≡ (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op = k ( U ) ⊤ R ⊤ RU + · · · ( U k ) ⊤ R ⊤ RU k k op ≤ (cid:13)(cid:13)(cid:13) ( U ) ⊤ R ⊤ RU (cid:13)(cid:13)(cid:13) op + · · · + (cid:13)(cid:13)(cid:13) ( U k ) ⊤ R ⊤ RU k (cid:13)(cid:13)(cid:13) op = k RU k + · · · + k RU k k ≤ ( ndµ F k/ǫδ ) · ( k U k F + · · · + k U k k F ) ≤ ( ndµ F k/ǫδ ) · k U k F = 16 µ F · log ( ndµ F k/ǫδ )) , where the equality on the third line above holds because the matrices ( U i ) ⊤ R ⊤ RU i are rank one. Upper bounding (cid:13)(cid:13)(cid:13)(cid:13) E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op . For every x ∈ R d with k x k = 1, wehave x T E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) x = E P  X j,j ′ ∈ [ k ] x T ( U j ) ⊤ R ⊤ RU j · ( U j ′ ) ⊤ R ⊤ RU j ′ x  ≤ E P  X j,j ′ ∈ [ k ] | RU j x |k RU j k | RU j ′ x |k RU j ′ k  = E P  X j ∈ [ k ] | RU j x |k RU j k   ≤ E P  X j ∈ [ k ] ( RU j x )   X j ∈ [ k ] k RU j k  , where the second and fourth lines follow from the Cauchy-Schwarz inequality. Using the fact thatwe conditioned on ( D , D ) ∈ E , we get 41 T E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) x ≤

16 log ( ndµ F k/ǫδ )  X j ∈ [ k ] k U j k F  E P  X j ∈ [ k ] ( RU j x )  = 16 log ( ndµ F k/ǫδ )  X j ∈ [ k ] k U j k F  X j ∈ [ k ] E P h ( P ( HD × HD ) U j x ) i = 16 log ( ndµ F k/ǫδ ) · k U k F X j ∈ [ k ] k U j x k = 16 log ( ndµ F k/ǫδ ) · k U k F k U x k ≤

16 log ( ndµ F k/ǫδ ) · µ F µ , since E P (cid:2) ( P ( HD × HD ) U j x ) (cid:3) = d k ( HD × HD ) U j x k = k U j x k for all x .Since the matrix E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) is positive semi-deﬁnite for any ﬁxed D and D , it follows that M ≡ (cid:13)(cid:13)(cid:13)(cid:13) E P (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op ≤

16 log ( ndµ F k/ǫδ ) · µ F µ . Combining one-dimensional TensorSRHT sketches.

To conclude, we note that the Grammatrix of a

TensorSRHT , S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R , i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i , for i.i.d. R , R , . . . , R m ∼ R , and therefore,( I k × S ) ⊤ ( I k × S ) = 1 m X i ∈ [ m ] ( I k × R i ) ⊤ ( I k × R i ) . Recall that ( D , D ) ∈ E occurs with probability at least 1 − ǫδ/ ( dµ F ), therefore we have thefollowing for the conditional expectation E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i , E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[( D , D ) ∈ E ] (cid:22) U ⊤ U − ǫδ/ ( dµ F ) . And also by Cauchy-Schwarz we have, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i (cid:23) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ ¯ E i · Pr[ ¯ E ] (cid:23) U ⊤ U − d k U k F Pr[ ¯ E ] · I n (cid:23) U ⊤ U − d k U k F · ǫδ/ ( dµ F ) · I n (cid:23) U ⊤ U − ( ǫ/ · I n . These two bounds together imply that, (cid:13)(cid:13)(cid:13) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ/ . Now note that the random variables R ⊤ i R i are independent conditioned on ( D , D ) ∈ E . Hence,using the upper bounds L ≤ µ F · log ( ndµ F k/ǫδ ) and M ≤ µ F µ · log ( ndµ F k/ǫδ ), which hold42hen ( D , D ) ∈ E , we have the following by Lemma 27, (here we drop the subscript from I k forease of notation)Pr P,D ,D (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I × S ) ⊤ ( I × S ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≥ ǫ (cid:21) ≤ Pr P (cid:20) (cid:13)(cid:13)(cid:13) U ⊤ ( I × S ) ⊤ ( I × S ) U − E h U ⊤ ( I × R ) ⊤ ( I × R ) U (cid:12)(cid:12)(cid:12) ( D , D ) ∈ E i(cid:13)(cid:13)(cid:13) op ≥ ǫ/ (cid:12)(cid:12)(cid:12)(cid:12) ( D , D ) ∈ E (cid:21) + Pr D ,D [ ¯ E ] ≤ n · exp − mǫ / M + 2 ǫL/ ! + δ/ ≤ δ, where the last inequality follows by setting m = Ω (cid:16) log( n/δ ) log ( ndk/ǫδ ) · µ F µ /ǫ (cid:17) . This showsthat I k × S satisﬁes the ( µ F , µ , ǫ, δ, n )-spectral property. × OSNAP

In this section, we show that tensoring identity operator with

OSNAP sketch (Deﬁnition 16) resultsin a transform which satisﬁes the spectral property (Deﬁnition 20) with nearly optimal target di-mension as well as nearly optimal application time. This sketch is particularly eﬃcient for sketchingsparse vectors. We use a slightly diﬀerent sketch than the original

OSNAP to simplify the analysis,deﬁned as follows.

Deﬁnition 21 ( OSNAP transform) . For every sparsity parameter s , target dimension m , andpositive integer d , the OSNAP transform with sparsity parameter s is deﬁned as, S r,j = r s · δ r,j · σ r,j , for all r ∈ [ m ] and all j ∈ [ d ], where σ r,j ∈ {− , +1 } are independent and uniform Rademacherrandom variables and δ r,j are independent Bernoulli random variables satisfying, E [ δ r,i ] = s/m forall r ∈ [ m ] and all i ∈ [ d ]. Lemma 34.

Suppose ǫ, δ, µ , µ F > and n is a positive integer. If S ∈ R m × d is a OSNAP sketchwith sparsity parameter s , then the sketch I k × S satisﬁes the ( µ F , µ , ǫ, δ, n ) -spectral property,provided that s = Ω (cid:16) log ( ndk/ǫδ ) log( n/δ ) · µ ǫ (cid:17) and m = Ω (cid:16) ( µ F µ /ǫ ) · log ( ndk/ǫδ ) (cid:17) .Proof. Fix a matrix U ∈ R kd × n with k U k F ≤ µ F and k U k op ≤ µ . Partition U by rows into d × n sub-matrices U , U , . . . , U k such that U T = h U ⊤ U ⊤ · · · U ⊤ k i . Note that U ⊤ ( I k × S ) ⊤ ( I k × S ) U = ( U ) ⊤ S ⊤ SU + · · · ( U k ) ⊤ S ⊤ SU k . The proof ﬁrst considers the simpler case of an

OSNAP sketch of rank 1 and then applies the matrixBernstein bound. Let R denote a rank one OSNAP sketch. R is a 1 × d matrix deﬁned as follows, R i = r ms · δ i σ i , (28)where σ i for all i ∈ [ d ] are independent Rademacher random variables and also, δ i for all i ∈ [ d ] areindependent Bernoulli random variables for which the probability of being one is equal to sm .43e proceed to upper bound the operator norm of U ⊤ ( I k × R ) ⊤ ( I k × R ) U . First, deﬁne the set E := (cid:26) R : ( RU j ) ⊤ RU j (cid:22) C (cid:18) ms log ( ndkµ F ǫδ ) · U ⊤ j U j + log( ndkµ F ǫδ ) k U j k F · I n (cid:19) for all j = 1 , . . . , k (cid:27) , where C > R ∈ E ] ≥ − ǫδ/ ( dmµ F ) , where the probability is over the random choices of { σ i } i ∈ [ d ] and { δ i } i ∈ [ d ] . To show this we ﬁrstprove the following claim, Claim 35.

For every matrix Z ∈ R d × n , if we let R be deﬁned as in (28) , then, Pr (cid:20) Z ⊤ R ⊤ RZ (cid:22) C (cid:18) ms · log ( n/δ ) Z ⊤ Z + log( n/δ ) k Z k F I n (cid:19)(cid:21) ≥ − δ. Proof.

The proof is by Matrix Bernstein inequality, Lemma 26. For any matrix Z let A = Z ( Z ⊤ Z + µI n ) − / , where µ = sm n/δ ) k Z k F . We can write RA = q ms P i ∈ [ d ] δ i σ i A i , where A i is the i throw of A . Note that E [ δ i σ i A i ] = 0 and k δ i σ i A i k ≤ k A i k ≤ k A k op . Also note that X i ∈ [ d ] E [( δ i σ i A i )( δ i σ i A i ) ∗ ] = X i ∈ [ d ] sm k A i k = sm k A k F and, X i ∈ [ d ] E [( δ i σ i A i ) ∗ ( δ i σ i A i )] = X i ∈ [ d ] sm A ∗ i A i = sm A ⊤ A. Therefore,max (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] E [( δ i σ i A i )( δ i σ i A i ) ∗ ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] E [( δ i σ i A i ) ∗ ( δ i σ i A i )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ sm k A k F . By Lemma 26, Pr (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ d ] δ i σ i A i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≥ t  ≤ ( n + 1) · exp − t / sm k A k F + k A k op t/ ! , hence if t = C ′ / · (cid:16)q sm log( n/δ ) k A k F + log( n/δ ) k A k op (cid:17) , then Pr (cid:20)(cid:13)(cid:13)(cid:13)P i ∈ [ d ] δ i σ i A i (cid:13)(cid:13)(cid:13) op ≥ t (cid:21) ≤ δ . Byplugging k RA k = ms · k P i ∈ [ d ] δ i σ i A i k into the above we get the following,Pr (cid:20) k RA k op ≤ C ′ / (cid:18) ms · log ( n/δ ) k A k op + log( n/δ ) k A k F (cid:19)(cid:21) ≥ − δ. Now note that for the choice of A = Z ( Z ⊤ Z + µI n ) − / , we have k A k op ≤ k Z ⊤ Z k op k Z ⊤ Z k op + µ ≤ k A k F = P i λ i ( Z ⊤ Z ) λ i ( Z ⊤ Z )+ µ ≤ P i λ i ( Z ⊤ Z ) µ = ms log( n/δ ). By plugging these into the above we get that,Pr (cid:20)(cid:13)(cid:13)(cid:13) RZ ( Z ⊤ Z + µI n ) − / (cid:13)(cid:13)(cid:13) op ≤ C ′ ms · log ( n/δ ) (cid:21) ≥ − δ. Z ⊤ Z + µI n ) − / Z ⊤ R ⊤ RZ ( Z ⊤ Z + µI n ) − / (cid:22) C ms · log ( n/δ ) I n , with probability 1 − δ , where C = C ′ . Multiplying both sides of the above from left and right bythe positive deﬁnite matrix ( Z ⊤ Z + µI n ) / gives (recall that µ = sm · k Z k F log( n/δ ) ), Z ⊤ R ⊤ RZ (cid:22) C (cid:18) ms · log ( n/δ ) Z ⊤ Z + log( n/δ ) k Z k F I n (cid:19) . By applying Claim 35 with failure probability of ǫδ/ ( dkµ F ) on each of U j ’s and then applyinga union bound, we get the following,Pr[ R ∈ E ] ≥ − ǫδ/ ( dmµ F ) . From now on, we ﬁx R ∈ E and proceed having conditioned on this event. Upper bounding (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op . From the fact that we have conditioned on R ∈ E ,note that, L ≡ (cid:13)(cid:13)(cid:13) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:13)(cid:13)(cid:13) op = k ( U ) ⊤ R ⊤ RU + · · · ( U k ) ⊤ R ⊤ RU k k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ [ k ] C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ j U j + log( ndkµ F /ǫδ ) k U j k F · I n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C (cid:18) ms · log ( ndkµ F /ǫδ ) · k U k op + log( ndkµ F /ǫδ ) k U k F (cid:19) ≤ C (cid:18) ms µ · log ( ndkµ F /ǫδ ) + µ F · log( ndkµ F /ǫδ ) (cid:19) . Upper bounding (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op . From the condition R ∈ E , it follows that E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21) (cid:22) E (cid:20) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) (cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17)(cid:21) (cid:22) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) E h(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17)i (cid:22) C (cid:18) ms · log ( ndkµ F /ǫδ ) · U ⊤ U + log( ndkµ F /ǫδ ) k U k F · I n (cid:19) · U ⊤ U − ǫδ/ ( dmµ F )where the last line follows from the fact that the random variable U ⊤ ( I k × R ) ⊤ ( I k × R ) U is positivesemideﬁnite and the conditional expectation can be upper bounded by its unconditional expectationas follows, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[ R ∈ E ] . M ≡ (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20)(cid:16) U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:17) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13) C (cid:18) ms · log ( ndkµ F /ǫδ ) · ( U ⊤ U ) + log( ndkµ F /ǫδ ) k U k F · U ⊤ U (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C (cid:18) ms · log ( ndkµ F /ǫδ ) · k U ⊤ U k op + log( ndkµ F /ǫδ ) k U k F · k U ⊤ U k op (cid:19) = 2 C (cid:18) ms · log ( ndkµ F /ǫδ ) · µ + log( ndkµ F /ǫδ ) µ F µ (cid:19) . Combining one-dimensional OSNAP transforms.

To conclude, we note that the Grammatrix of an

OSNAP sketch, S ⊤ S ∈ R d × d , is the average of m independent samples from R ⊤ R with R deﬁned as in (28) – i.e., S ⊤ S = m P i ∈ [ m ] R ⊤ i R i for i.i.d. R , R , . . . , R m ∼ R , and therefore,( I k × S ) ⊤ ( I k × S ) = 1 m X i ∈ [ m ] ( I k × R i ) ⊤ ( I k × R i ) . Note that by a union bound R i ∈ E simultaneously for all i ∈ [ m ] with probability at least1 − ǫδ/ ( dµ F ). Now note that the random variables R ⊤ i R i are independent conditioned on R i ∈ E for all i ∈ [ m ]. Also note that the conditional expectation E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i satisﬁes the following, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:23) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ ¯ E i · Pr[ ¯ E ] (cid:23) U ⊤ U − d k U k F Pr[ ¯ E ] · I n (cid:23) U ⊤ U − d k U k F · ǫδ/ ( dµ F ) · I n (cid:23) U ⊤ U − d k U k F · ǫ/ · I n . We also have, E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i (cid:22) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U i Pr[ R ∈ E ] (cid:22) U ⊤ U − ǫδ/ ( dµ F ) . These two bounds together imply that, (cid:13)(cid:13)(cid:13) E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ǫ/ . Now, using the upper bounds L ≤ C (cid:16) ms µ · log ( ndkµ F /ǫδ ) + µ F · log( ndkµ F /δ ) (cid:17) and M ≤ C (cid:16) ms · log ( ndkµ F /δ ) · µ + log( ndkµ F /δ ) µ F µ (cid:17) , which hold when R ∈ E , we have that by Lemma27,Pr (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I k × S ) ⊤ ( I k × S ) U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≥ ǫ (cid:21) ≤ Pr (cid:20)(cid:13)(cid:13)(cid:13) U ⊤ ( I k × S ) ⊤ ( I k × S ) U − E h U ⊤ ( I k × R ) ⊤ ( I k × R ) U (cid:12)(cid:12)(cid:12) R ∈ E i(cid:13)(cid:13)(cid:13) op ≥ ǫ/ (cid:12)(cid:12) E (cid:21) + Pr D [ ¯ E ] ≤ n · exp − mǫ / M + ǫL/ ! + δ/ ≤ δ, s = Ω (cid:16) log ( ndkµ F /ǫδ ) log( nd/δ ) · µ ǫ (cid:17) and m =Ω (cid:16) µ F µ /ǫ · log ( ndkµ F /ǫδ ) (cid:17) . This shows that I k × S satisﬁes the ( µ F , µ , ǫ, δ, n )-spectral prop-erty. s λ We are ready to prove Theorem 3. We prove that if we instantiate Π p from Deﬁnition 11 with T base : OSNAP and S base : TensorSRHT , it satisﬁes the statement of Theorem 3.

Theorem 3.

For every positive integers p, d, n , every ε, s λ > , there exists a distribution onlinear sketches Π p ∈ R m × d p which is an ( ε, / poly ( n ) , s λ , d p , n ) -oblivious subspace embedding as inDeﬁnition 2, provided that the integer m satisﬁes m = e Ω (cid:0) p s λ /ǫ (cid:1) .Moreover, for any X ∈ R d × n , if A ∈ R d p × n is the matrix whose columns are obtained by a p -fold self-tensoring of each column of X then the matrix Π p A can be computed using Algorithm 1in time e O (cid:0) pnm + p ǫ − nnz( X ) (cid:1) .Proof. Let δ = n ) denote the failure probability. Let m ≈ p log ( ndεδ ) · s λ ε and s ≈ p ε · log ( ndεδ )be integers. Let Π p ∈ R m × m p be the sketch deﬁned in Deﬁnition 11, where S base ∈ R m × m is a TensorSRHT sketch and T base ∈ R m × d is an OSNAP sketch with sparsity parameter s .Let q = 2 ⌈ log ( p ) ⌉ . By Lemma 10, it is suﬃcient to show that Π q is a ( ε, δ, s λ , d q , n )-ObliviousSubspace Embedding. Consider arbitrary A ∈ R d q × n and λ >

0. Let us denote the statisti-cal dimension of A by s λ = s λ ( A ⊤ A ). Let U = A (cid:16) A ⊤ A + λI n (cid:17) − / . Therefore, k U k ≤ k U k F = s λ . Since q < p , by Lemma 34, the transform I d q − × T base , satisﬁes (2 s λ +2 , , O ( ε/q ) , O ( δ/n q ) , n )-spectral property. Moreover, by Lemma 33, the transform I m q − × S base satisﬁes (5 s λ + 9 , , O ( ε/q ) , O ( δ/n q ) , n )-spectral property. Therefore, by Lemma 32, the sketchΠ q satisﬁes ( s λ + 1 , , ε, δ, n )-spectral property, hence,Pr (cid:20)(cid:13)(cid:13)(cid:13) (Π q U ) ⊤ Π q U − U ⊤ U (cid:13)(cid:13)(cid:13) op ≤ ε (cid:21) ≥ − δ. Since U ⊤ U = ( A ⊤ A + λI n ) − / A ⊤ A ( A ⊤ A + λI n ) − / and Π q U = Π p A ( A ⊤ A + λI n ) − / we havethe following,Pr h (1 − ǫ )( A ⊤ A + λI n ) (cid:22) (Π p A ) ⊤ Π p A + λI n (cid:22) (1 + ǫ )( A ⊤ A + λI n ) i ≥ − δ. Runtime:

By Lemma 8, for any S base and T base , if A is the matrix whose columns are obtainedby p -fold self-tensoring of each column of some X ∈ R d × n then the sketched matrix Π p A can becomputed using Algorithm 1. When S base is TensorSRHT and T base is OSNAP , the runtime ofAlgorithm 1 for a ﬁxed vector w ∈ R d is as follows; Computing Y j ’s for each j in lines 3 and4 of algorithm requires applying an OSNAP sketch on w ∈ R d which on expectation takes time O ( s · nnz( w )). Therefore computing all Y j ’s takes time O ( qs · nnz( w )).Computing each of Y lj ’s in line 7 of algorithm amounts to applying a TensorSRHT of inputdimension m and target dimension of m on Y l − j − ⊗ Y l − j . This takes time O ( m log m ). Thereforecomputing all the Y lj ’s takes time O ( q · m log m ). Note that q ≤ p hence the total time of runningAlgorithm 1 on a vector w is O ( p · m log m + ps · nnz( w )). Therefore, sketching n columns of amatrix X ∈ R d × n takes time O ( p ( nm log m + s · nnz( X ))).47 Oblivious Subspace Embedding for the Gaussian Kernel

In this section we show how to sketch the Gaussian kernel matrix by polynomial expansion andthen applying our proposed sketch for the polynomial kernels.

Data-points with bounded ℓ radius: Suppose that we are given a dataset of points x , · · · x n ∈ R d such that for all i ∈ [ n ], k x i k ≤ r for some positive value r . Consider the Gaussian kernelmatrix G ∈ R n × n deﬁned as G i,j = e −k x i − x j k / for all i, j ∈ [ n ]. We are interested in sketching thedata-points matrix X using a sketch S g : R d → R m such that the following holds with probability1 − δ , (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) . Theorem 5.

For every r > , every positive integers n, d , and every X ∈ R d × n such that k x i k ≤ r for all i ∈ [ n ] , where x i is the i th column of X , suppose G ∈ R n × n is the Gaussian kernel matrix –i.e., G j,k = e −k x j − x k k / for all j, k ∈ [ n ] . There exists an algorithm which computes S g ( X ) ∈ R m × n in time e O (cid:0) q ǫ − ns λ + q ǫ − nnz( X ) (cid:1) such that for every ε, λ > , Pr S g h (1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) i ≥ − / poly ( n ) , where m = e Θ (cid:0) q s λ /ǫ (cid:1) and q = Θ( r + log( n/ǫλ )) and s λ is λ -statistical dimension of G as inDeﬁnition 1.Proof. Let δ = n ) denote the failure probability. Note that G i,j = e −k x i k / · e x ⊤ i x j · e −k x j k / for every i, j ∈ [ n ]. Let D be a n × n diagonal matrix with i th diagonal entry e −k x i k / and let K ∈ R n × n be deﬁned as K i,j = e x ⊤ i x j (note that DKD = G ). Note that K is a positive deﬁnitekernel matrix. The Taylor series expansion for kernel K is as follows, K = ∞ X l =0 ( X ⊗ l ) ⊤ X ⊗ l l ! . Therefore G can be written as the following series, G = ∞ X l =0 ( X ⊗ l D ) ⊤ X ⊗ l Dl ! . Note that each of the terms ( X ⊗ l D ) ⊤ X ⊗ l D = D ( X ⊗ l ) ⊤ X ⊗ l D are positive deﬁnite kernelmatrices. The statistical dimension of kernel ( X ⊗ l D ) ⊤ X ⊗ l D for every l ≥ G through the following claim. Claim 36.

For every µ ≥ and every integer l , s µ (cid:16) ( X ⊗ l D ) ⊤ X ⊗ l D (cid:17) ≤ s µ ( G ) . Proof.

From the Taylor expansion G = P ∞ l =0 ( X ⊗ l D ) ⊤ X ⊗ l Dl ! along with the fact that the polynomialkernel of any degree is positive deﬁnite, we have that ( X ⊗ l D ) ⊤ X ⊗ l D (cid:22) G . Now, by Courant-Fischer’s min-max theorem we have that, λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) = max U ∈ R ( j − × n min α =0 Uα =0 α ⊤ ( X ⊗ l D ) ⊤ X ⊗ l Dα k α k . U ∗ be the maximizer of the expression above. Then we have, λ j ( G ) = max U ∈ R ( j − × n min α =0 Uα =0 α ⊤ Gα k α k ≥ min α =0 U ∗ α =0 α ⊤ Gα k α k ≥ min α =0 U ∗ α =0 α ⊤ ( X ⊗ l D ) ⊤ X ⊗ l Dα k α k = λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) . for all j . Therefore, the claim follows from the deﬁnition of statistical dimension, s µ ( G ) = n X j =1 λ j ( G ) λ j ( G ) + µ ≥ n X j =1 λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) λ j (( X ⊗ l D ) ⊤ X ⊗ l D ) + µ = s µ (cid:16) ( X ⊗ l D ) ⊤ X ⊗ l D (cid:17) . If we let P = P ql =0 ( X ⊗ l ) ⊤ X ⊗ l l ! , where q = C · ( r + log( nǫλ )) for some constant C , then by thetriangle inequality we have k K − P k op ≤ X l>q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( X ⊗ l ) ⊤ X ⊗ l l ! (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ X l>q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( X ⊗ l ) ⊤ X ⊗ l l ! (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ X l>q n · r l l ! ≤ ǫλ/ .P is a positive deﬁnite kernel matrix. Also note that all the eigenvalues of the diagonal matrix D are bounded by 1. Hence, in order to get a subspace embedding it is suﬃcient to satisfy thefollowing with probability 1 − δ ,(1 − ǫ/ DP D + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ/ DP D + λI n ) . Let the sketch Π l ∈ R m l × d l be the sketch from Theorem 3 therefore by Claim 36 we get thefollowing guarantee on Π l :(1 − ǫ X ⊗ l D ) ⊤ X ⊗ l D + λI n ) (cid:22) (Π l X ⊗ l D ) ⊤ Π l X ⊗ l D + λI n (cid:22) (1+ ǫ X ⊗ l D ) ⊤ X ⊗ l D + λI n ) , (29)with probability 1 − δq +1 as long as m l = Ω (cid:16) l log ( nd/δ ) · s λ /ǫ (cid:17) and moreover Π l X ⊗ l D can becomputed using O (cid:16) n · l · m l log m l + l ǫ · log ( nd/δ ) · nnz( X ) (cid:17) runtime where s λ is the λ -statisticaldimension of G .We let S P be the sketch of size m × ( P ql =0 d l ) which sketches the kernel P . The sketch S P isdeﬁned as 49 P = 1 √

0! Π ⊕ √

1! Π ⊕ √

2! Π · · · √ q ! Π q . Let Z be the matrix of size ( P ql =0 d l ) × n whose i th column is z i = x ⊗ i ⊕ x ⊗ i ⊕ x ⊗ i · · · x ⊗ qi , where x i is the i th column of X . Therefore the following holds for ( S P Z ) ⊤ S P Z ,( S P Z ) ⊤ S P Z = q X l =0 (Π l X ⊗ l ) ⊤ Π l X ⊗ l l ! , and hence, ( S P ZD ) ⊤ S P ZD = q X l =0 (Π l X ⊗ l D ) ⊤ Π l X ⊗ l Dl ! . Therefore by combining the terms of (29) for all 0 ≤ l ≤ q , using a union bound we get that withprobability 1 − δ , the following holds,(1 − ǫ/ DP D + λI n ) (cid:22) ( S P ZD ) ⊤ S P ZD + λI n (cid:22) (1 + ǫ/ DP D + λI n ) . Now we deﬁne S g ( x ) which is a non-linear transformation on the input x deﬁned as S g ( x ) = e −k x k / (cid:18) √ · Π ( x ⊗ ) ⊕ √ · Π ( x ⊗ ) ⊕ √ · Π ( x ⊗ ) · · · √ q ! · Π q ( x ⊗ q ) (cid:19) . We have that S g ( X ) = S P ZD , therefore with probability 1 − δ , the following holds,(1 − ǫ )( G + λI n ) (cid:22) ( S g ( X )) ⊤ S g ( X ) + λI n (cid:22) (1 + ǫ )( G + λI n ) . Note that the target dimension of S g is m = m + m + · · · + m q ≈ q log ( nd/δ ) s λ /ǫ . Also, byTheorem 3, time to compute S g ( X ) is O (cid:16) nq ε · log ( nd/δ ) · s λ + q ǫ · log ( nd/δ ) · nnz( X ) (cid:17) . Acknowledgements

Michael Kapralov is supported by ERC Starting Grant SUBLINEAR. Thomas D. Ahle, JakobB. T. Knudsen, and Rasmus Pagh are supported by Villum Foundation grant 16582 to BasicAlgorithms Research Copenhagen (BARC). David Woodruﬀ is supported in part by Oﬃce of NavalResearch (ONR) grant N00014-18-1-2562. Part of this work was done while Michael Kapralov,Rasmus Pagh, and David Woodruﬀ were visiting the Simons Institute for the Theory of Computing.

References [AC06] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In

Proceedings of the 38th Annual ACM Symposium on The-ory of Computing, Seattle, WA, USA, May 21-23, 2006 , pages 557–563, 2006.[Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrausswith binary coins.

J. Comput. Syst. Sci. , 66(4):671–687, June 2003.50ACW17a] Haim Avron, Kenneth L. Clarkson, and David P. Woodruﬀ. Faster kernel ridge re-gression using sketching and preconditioning.

SIAM J. Matrix Analysis Applications ,38(4):1116–1138, 2017.[ACW17b] Haim Avron, Kenneth L. Clarkson, and David P. Woodruﬀ. Sharper bounds for regu-larized data ﬁtting. In

Approximation, Randomization, and Combinatorial Optimiza-tion. Algorithms and Techniques, APPROX/RANDOM 2017, August 16-18, 2017,Berkeley, CA, USA , pages 27:1–27:22, 2017.[AK19] Thomas D Ahle and Jakob BT Knudsen. Almost optimal tensor sketch. arXiv preprintarXiv:1909.01821 , 2019.[AKM +

17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Vel-ingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Ap-proximation bounds and statistical guarantees. In

Proceedings of the 34th InternationalConference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August2017 , pages 253–262, 2017.[AKM + CoRR , abs/1804.09893, 2018.[AKM + arXiv preprint arXiv:1812.08723 , 2018.[AM15] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regressionwith statistical guarantees. In Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015, December 7-12,2015, Montreal, Quebec, Canada , pages 775–783, 2015.[ANW14] Haim Avron, Huy Nguyen, and David Woodruﬀ. Subspace embeddings for the polyno-mial kernel. In

Advances in neural information processing systems , pages 2258–2266,2014.[BCL +

10] Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, andRafail Ostrovsky. AMS without 4-wise independence on product domains. In , pages 119–130, 2010.[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.

Concentration inequalities:A nonasymptotic theory of independence . Oxford university press, 2013.[CCFC02] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items indata streams. In

International Colloquium on Automata, Languages, and Program-ming , pages 693–703. Springer, 2002.[Cha02] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In

Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages380–388. ACM, 2002. 51CJN18] Michael B. Cohen, T. S. Jayram, and Jelani Nelson. Simple analyses of the sparsejohnson-lindenstrauss transform. In , pages 15:1–15:9, 2018.[CKS11] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of thegaussian kernel. arXiv preprint arXiv:1109.4603 , 2011.[CNW16a] Michael B. Cohen, Jelani Nelson, and David P. Woodruﬀ. Optimal approximate matrixproduct in terms of stable rank. In , pages11:1–11:14, 2016.[CNW16b] Michael B. Cohen, Jelani Nelson, and David P. Woodruﬀ. Optimal approximate matrixproduct in terms of stable rank. In , pages11:1–11:14, 2016.[Coh16] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.In

Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016 , pages 278–287,2016.[CW09] Kenneth L. Clarkson and David P. Woodruﬀ. Numerical linear algebra in the stream-ing model. In

Proceedings of the 41st Annual ACM Symposium on Theory of Com-puting, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009 , pages 205–214,2009.[CW13] Kenneth L. Clarkson and David P. Woodruﬀ. Low rank approximation and regressionin input sparsity time. In

Symposium on Theory of Computing Conference, STOC’13,Palo Alto, CA, USA, June 1-4, 2013 , pages 81–90, 2013.[CW17] Kenneth L Clarkson and David P Woodruﬀ. Low-rank approximation and regressionin input sparsity time.

Journal of the ACM (JACM) , 63(6):54, 2017.[DKS10] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrausstransform. In

Proceedings of the 42nd ACM Symposium on Theory of Computing,STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010 , pages 341–350, 2010.[DlPG12] Victor De la Pena and Evarist Giné.

Decoupling: from dependence to independence .Springer Science & Business Media, 2012.[DMM06] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l Proceedings of the Seventeenth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006 , pages 1127–1136, 2006.[DMMS11] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tamás Sarlós. Fasterleast squares approximation.

Numerische Mathematik , 117(2):219–249, 2011.[DMMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruﬀ.Fast approximation of matrix coherence and statistical leverage.

Journal of MachineLearning Research , 13:3475–3506, 2012.52Hit93] Paweł Hitczenko. Domination inequality for martingale transforms of a rademachersequence.

Israel Journal of Mathematics , 84(1-2):161–178, 1993.[Hit94] Pawel Hitczenko. On a domination of sums of random variables by sums of condition-ally independent ones.

The Annals of Probability , pages 453–468, 1994.[HM07] Uﬀe Haagerup and Magdalena Musat. On the best constants in noncommutativekhintchine-type inequalities.

Journal of Functional Analysis , 250(2):588–624, 2007.[IM08] Piotr Indyk and Andrew McGregor. Declaring independence via the sketching ofsketches. In

Proceedings of the nineteenth annual ACM-SIAM symposium on Discretealgorithms , pages 737–745. Society for Industrial and Applied Mathematics, 2008.[JLS86] William B. Johnson, Joram Lindenstrauss, and Gideon Schechtman. Extensions oflipschitz maps into banach spaces.

Israel Journal of Mathematics , 54(2):129–138, Jun1986.[KN14] Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.

J. ACM ,61(1):4:1–4:23, 2014.[KPV +

19] Michael Kapralov, Rasmus Pagh, Ameya Velingker, David Woodruﬀ, and AmirZandieh. Oblivious sketching of high-degree polynomial kernels. arXiv preprintarXiv:1909.01410 , 2019.[KVW14] Ravi Kannan, Santosh Vempala, and David Woodruﬀ. Principal component analysisand higher correlations for distributed data. In

Conference on Learning Theory , pages1040–1057, 2014.[Lat97] Rafał Latała. Estimation of moments of sums of independent real random variables.

The Annals of Probability , 25(3):1502–1513, 1997.[Lat06] Rafał Latała. Estimates of moments and tails of gaussian chaoses.

The Annals ofProbability , 34(6):2315–2331, 2006.[LDFU13] Yichao Lu, Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. Faster ridgeregression via the subsampled randomized hadamard transform. In

Advances in NeuralInformation Processing Systems 26: 27th Annual Conference on Neural InformationProcessing Systems 2013. Proceedings of a meeting held December 5-8, 2013, LakeTahoe, Nevada, United States. , pages 369–377, 2013.[LSS14] Quoc Viet Le, Tamás Sarlós, and Alexander Johannes Smola. Fastfood: Approximatekernel expansions in loglinear time.

CoRR , abs/1408.3060, 2014.[MM17] Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method.In

Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,USA , pages 3836–3848, 2017.[NDT15] Nam H Nguyen, Petros Drineas, and Trac D Tran. Tensor sparsiﬁcation via a boundon the spectral norm of random tensors.

Information and Inference: A Journal of theIMA , 4(3):195–229, 2015. 53NN13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithmsvia sparser subspace embeddings. In , pages 117–126. IEEE, 2013.[Pag13] Rasmus Pagh. Compressed matrix multiplication.

TOCT , 5(3):9:1–9:17, 2013.[PP13] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit featuremaps. In

The 19th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013 , pages 239–247,2013.[PT12] Mihai Patrascu and Mikkel Thorup. The power of simple tabulation hashing.

J. ACM ,59(3):14:1–14:50, 2012.[PW15] Mert Pilanci and Martin J. Wainwright. Randomized sketches of convex programswith sharp guarantees.

IEEE Trans. Information Theory , 61(9):5096–5115, 2015.[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver,British Columbia, Canada, December 3-6, 2007 , pages 1177–1184, 2007.[RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advances in neural information processing systems , pages 1177–1184, 2008.[Sar06] Tamás Sarlós. Improved approximation algorithms for large matrices via randomprojections. In , pages143–152, 2006.[Tro11] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform.

Advances in Adaptive Data Analysis , 3(1-2):115–126, 2011.[Tro15] Joel A. Tropp. An introduction to matrix concentration inequalities.

Foundations andTrends in Machine Learning , 8(1-2):1–230, 2015.[Val15] Gregory Valiant. Finding correlations in subquadratic time, with applications to learn-ing parities and the closest pair problem.

Journal of the ACM (JACM) , 62(2):13, 2015.[Woo14] David P. Woodruﬀ. Sketching as a tool for numerical linear algebra.

Foundations andTrends in Theoretical Computer Science , 10(1-2):1–157, 2014.

A Direct Lower and Upper Bounds

We introduce the following notation. We say f ( x ) . g ( x ) if for some some universal constant C wehave f ( x ) ≤ Cg ( x ) for all x ∈ R and . Note this is slightly diﬀerent from the usual f ( x ) = O ( g ( x ))in that it is uniform in x rather than asymptotic. We similarly say f ( x ) & g ( x ) if g ( x ) . f ( x ) and f ( x ) ∼ g ( x ) if both f ( x ) . g ( x ) and f ( x ) & g ( x ).We will also make heavy use of the L p norm notation for random variables in R , that is for p ≥ k X k L p = ( E | X | p ) /p . A very useful result for computing the L p -normof a sum of random variables is the following: 54 emma 37 (Latala’s inequality, [Lat97]) . If p ≥ and X, X , . . . , X n are iid. mean 0 randomvariables, then we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p ∼ sup ( ps (cid:18) np (cid:19) /s k X k L s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:26) , pn (cid:27) ≤ s ≤ p ) . (30)The following simple corollary will be used for both upper and lower bounds: Corollary 38.

Let p ≥ , C > and α ≥ . Let ( X i ) i ∈ [ n ] be iid. mean 0 random variables suchthat k X i k L p ∼ ( Cp ) α , then k P i X i k L p ∼ C α max { α √ pn, ( n/p ) /p p α } .Proof. We will show that the expression in eq. (30) is maximized either by minimizing or maximizing s . Hence we need to chat that ps (cid:16) np (cid:17) /s s α it has no other optimums in the valid range. For this,we note that dds ps (cid:16) np (cid:17) /s s α = − ps − α (cid:16) np (cid:17) /s (cid:16) (1 − α ) s + log np (cid:17) . Given α ≥ s , which gives the lemma.For the lower bound we will also use the following result by Hitczenko, which provides animprovement on Khintchine for Rademacher random variables. Lemma 39 (Sharp bound on Rademacher sums [Hit93]) . Let σ ∈ {− , } n be a random Rademachersequence and let a ∈ R n be an arbitrary real vector with sorted entries | a | ≥ | a | ≥ · · · | a n | , then kh a, σ ik L p ∼ X i ≤ p a i + √ p (cid:0) X i>p a i (cid:1) / (31)Finally the lower bound will use the Paley-Zygmund inequality (also known as the one-sidedChebyshev inequality): Lemma 40 (Paley-Zygmund) . Let X ≥ be a real random variable with ﬁnite variance, and let θ ∈ [0 , , then Pr[ X ≥ θ E [ X ]] ≥ (1 − θ ) E [ X ] E [ X ] . (32)A classical strategy when using Paley-Zygmund is to prove E [ X ] ≥ ε for some ε >

0, and thentake θ = 1 / X ≥ ε ] ≥ E [ X ] / (4 E (cid:2) X (cid:3) ). A.1 Lower Bound for Sub-Gaussians

The following lower bound considers the sketching matrix consisting of the direct composition ofmatrices with Rademacher entries. Note however that the assumptions on Rademachers are onlyused to show that the p -norm of a single row with a vector is ∼ √ p . For this reason the same lowerbound hold if the Rademacher entries are substituted for, say Gaussians. Theorem 41 (Lower bound) . For some constants C , C , B > , let d, m, c ≥ be integers, let ε ∈ [0 , and δ ∈ [0 , / . Further assume that d ≥ log 1 /δ ≥ c/B . Then the following holds.Let M (1) , . . . , M ( c ) ∈ R m × d be matrices with all independent Rademacher entries and let M = √ m M (1) • · · · • M ( c ) . Then there exists some unit vector y ∈ R d c such that if m < C max (cid:26) c ε − log 1 /δc , ε − (cid:18) C log 1 /δc (cid:19) c (cid:27) then Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) > ε i > δ. (33)55 roof. Let y = [1 , . . . , T / √ d ∈ R d and let x = y ⊗ c . We have k M y k − m (cid:13)(cid:13)(cid:13) M (1) x ◦ · · · ◦ M ( c ) x (cid:13)(cid:13)(cid:13) − m X j ∈ [ m ] (cid:0) Y i ∈ [ c ] Z i,j (cid:1) − Z i,j = P k ∈ [ d ] M ( i ) j,k / √ d are independent averages of d independent Rademacher randomvariables. By Lemma 39 we have k Z i,j k L p ∼ min {√ p, √ d } which is √ p by the assumption d ≥ log 1 /δ as long as p ≤ log 1 /δ . By the expanding Z i,j into monomials and linearity of expectationwe get k Z i,j k = √ d ( d + 3 d ( d − / = (3 − /d ) / .Now deﬁne X j = Q i ∈ [ c ] Z i,j −

1, then EX j = 0 and k X j k L p ≥ (cid:13)(cid:13)(cid:13)Q i ∈ [ c ] Z i,j (cid:13)(cid:13)(cid:13) L p − k Z i,j k cL p − ≥ K c p c for some K , assuming p ≥

2. In particular, k X j k L ≥ k Z i,j k cL − − /d ) c/ − ∼ c/ by the assumption d ≥ c ≥ (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p = m (cid:13)(cid:13)(cid:13)P j ∈ [ m ] X m (cid:13)(cid:13)(cid:13) L p is a sum of iid. random variables, so we can useCorollary 38 to show K max (cid:26)q c p/m, ( m/p ) /p K c p c /m (cid:27) . (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) L p (35) . K max (cid:26)q c p/m, ( m/p ) /p K c p c /m (cid:27) (36)for some universal constants K , K , K , K > m < max n AK c ε − /δc , K ε − (cid:16) AK /δc (cid:17) c o as in the theorem. Wetake p = 4 A log 1 /δc for some constant A to be determined. We want to show (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ ε .For this we split into two cases depending on which term of m < max { (1) , (2) } dominates. If (1) ≥ (2) we pick the ﬁrst lower bound in eq. (36) and get (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ K p c p/m ≥ K r ε K = 2 ε .Otherwise, if (2) ≥ (1), we pick the other lower bound and also get: (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) L p ≥ K ( m/p ) /p K c p c m ≥ K K c (cid:16) A log 1 /δc (cid:17) cK ε − (cid:16) AK /δc (cid:17) c = 2 ε, (37)where we used ( m/p ) /p ≥ e − / ( em ) ≥ / m ≥

1. Plugging into Paley-Zygmund (Lemma 40)we have Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) ≥ ε i ≥ Pr h(cid:12)(cid:12)(cid:12) k M y k − (cid:12)(cid:12)(cid:12) p ≥ (cid:13)(cid:13)(cid:13) k M y k − (cid:13)(cid:13)(cid:13) pL p − p i (38) ≥ (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ! p , (39)where we used that p ≥ − − p ) ≥ / p c p/m ≥ ( m/p ) /p K c p c /m we have using the ﬁrst lower bound that k k My k − k Lp k k My k − k L p ≥ K √ K .For the alternative case, ( m/p ) /p K c p c /m ≥ p c p/m , we have (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p (cid:13)(cid:13) k M y k − (cid:13)(cid:13) L p ≥ K √ K ( m/p ) /p ( m/ p ) / p (cid:18) K K (cid:19) c ≥ K K (cid:18) K K (cid:19) c (40)56here ( m/p ) /p ( m/ p ) / p ≥ e − / (4 em ) ≥ / √ m ≥ A ≤ min { K /K , K /K } /

32. Thischoice also ensures that 1 ≤ p ≤ log 1 /δ as we promised. Note that we may assume in eq. (36) that K ≤ K and K ≤ K . We then ﬁnally have14 (cid:18) K √ K (cid:19) p ≥ δ / (4 c ) and 14 (cid:18) K K (cid:18) K K (cid:19) c (cid:19) p ≥ δ / (4 c )+1 / , (41)which are both ≥ δ for c ≥ δ < / A.2 Upper bound for Sub-Gaussians

Theorem 42 (Upper bound) . Let ε, δ ∈ [0 , and let γ > , ≤ c ≤ log 1 /δ γ be some constants.Let T ∈ R m × d be a matrix with iid. rows T , . . . , T m ∈ R d such that E (cid:2) ( T x ) (cid:3) = k x k and k T x k L p ≤ √ ap k x k for some a > and p ≥ . Let M = T (1) • · · · • T ( c ) where T (1) , . . . , T ( c ) areindependent copies of T . Then M has the JL-moment property, kk M x k − k x k k L p ≤ εδ /p , given m & (4 ae γ ) c ε − log 1 /δcγ + (4 ae γ ) c ε − (cid:18) log 1 /δcγ (cid:19) c . (42) Remark 2.

In the case of random Rademachers we set a = √ / m = O (cid:18) c ε − log 1 /δcγ e cγ + ε − (cid:18) √ /δcγ (cid:19) c e cγ (cid:19) . Note that depending on γ this matches either of the terms of the lower bound. Setting γ = Θ(1 /c )or γ = Θ(1) we have either m = O (cid:16) c ε − log 1 /δ + ε − (cid:16) √ /δ (cid:17) c (cid:17) or m = O (cid:18) (3 e ) c ε − log 1 /δc + ε − (cid:18) √ e log 1 /δc (cid:19) c (cid:19) . Finally, in the case of constant c = O (1) , γ = Θ(1) we simply get m = O (cid:16) ε − log 1 /δ + ε − (log 1 /δ ) c (cid:17) . Proof of Theorem 42.

Without loss of generalization we may assume k x k L = 1. We notice that (cid:13)(cid:13) k M x k − (cid:13)(cid:13) L p ≤ (cid:13)(cid:13)(cid:13) m P i ( M i x ) − (cid:13)(cid:13)(cid:13) L p is the mean of iid. random variables. Call these Z i =( M i x ) −

1. Then EZ i = 0 and k Z i k L p = (cid:13)(cid:13) ( M i x ) − (cid:13)(cid:13) L p . (cid:13)(cid:13) ( M i x ) (cid:13)(cid:13) L p = k M i x k L p by the triangleinequality and the assumption that p ≥

1. Now by the assumption k T x k L p ≤ √ ap k x k = √ ap , andby Lemma 19, we get that k M i x k L p = (cid:13)(cid:13)(cid:13) T (1) i ⊗ · · · ⊗ T ( c ) i x (cid:13)(cid:13)(cid:13) L p ≤ ( ap ) c/ , and so k Z i k L p ≤ (2 ap ) c forall i ∈ [ m ].We now use Corollary 38 which implies (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p . (4 a ) c q p/m + m /p (2 ap ) c /m . (4 a ) c q p/m + (4 ap ) c /m. (43)The second inequality comes from the following consideration: If the second term of (43) dominates,then (4 a ) c p p/m ≤ m /p (2 ap ) c /m which implies m /p ≤ ( p/ c − p − ≤ c for p ≥ p . We take p = log 1 /δcγ which is ≥ m = max { (4 ae γ ) c pε − , (4 ae γ ) c p c ε − } . Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i Z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) pL p . (4 a ) cp max { ε p (4 ae γ ) − cp , ε p (4 ae γ ) − cp } (44)= e − cγp ε p (45)= δε p , (46)which is exactly the JL moment property. A.3 Lower Bound for TensorSketch

For every integer d, q , the

TensorSketch of degree q , M : R d q → R m is deﬁned as, M ( x ⊗ q ) = F − (( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x )) , (47)for every x ∈ R d where C , · · · C q ∈ R m × d are independent instances of CountSketch and

F ∈ C m × m is the Discrete Fourier Transform matrix with proper normalization which satisﬁes the convolutiontheorem, also note that, ◦ denotes entry-wise (Hadamard) product of vectors of the same size. Lemma 43.

For every integer d, q , let M : R d q → R m be the TensorSketch of degree q ≤ d , see (47) . For the all ones vector x = { } d , Var h k M x ⊗ q k i ≥ (cid:18) q m − (cid:19) k x ⊗ q k . Proof.

Note that since F is normalized such that it satisﬁes the convolution theorem, F − is indeeda unitary matrix times 1 / √ m , k M x ⊗ q k = m k ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ) k . Consider the ﬁrstentry of the vector ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ). Because the ﬁrst row of F is all ones { } m ,the ﬁrst element of the mentioned vector for the choice of x = { } d is Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) = Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) , where σ i : [ d ] → {− , +1 } are fully independent random hash functions usedby the CountSketch C i for all i ∈ [ q ]. Let us denote by V the following positive random variable, V = q Y i =1  X j ∈ [ d ] σ i ( j )  . Note that k M x ⊗ q k ≥ Vm , hence E (cid:2) k M x ⊗ q k (cid:3) ≥ E [ V ] m . Also note that E (cid:2) V (cid:3) = Q qi =1 E (cid:20)(cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) (cid:21) because σ i ’s are independent. We can write E  X j ∈ [ d ] σ i ( j )   = 3 d − d = 3(1 − d ) k x k , hence if d ≥ q , E h V i ≥ (1 / · q · k x ⊗ q k , Therefore E (cid:2) k M x ⊗ q k (cid:3) ≥ E [ V ] m ≥ q m k x ⊗ q k . It is also true that E (cid:2) k M x ⊗ q k (cid:3) = k x ⊗ q k [ANW14].58 emma 44. For every integer d, q every ε > , every < δ ≤ · q , let M : R d q → R m be the TensorSketch of degree q , see (47) . If m < q/ then for the all ones vector x = { } d we have, Pr h |k M x ⊗ q k − k x ⊗ q k | > / · k x ⊗ q k i > δ. Proof.

Note that since F is normalized such that it satisﬁes the convolution theorem, F − is indeeda unitary matrix times 1 / √ m , k M x ⊗ q k = m k ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ) k . Consider the ﬁrstentry of the vector ( F C x ) ◦ ( F C x ) ◦ · · · ( F C q x ). Because the ﬁrst row of F is all ones { } m ,the ﬁrst element of the mentioned vector for the choice of x = { } d is Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) = Q qi =1 (cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) , where σ i : [ d ] → {− , +1 } are fully independent random hash functions usedby the CountSketch C i for all i ∈ [ q ]. Let us denote by V the following positive random variable, V = q Y i =1  X j ∈ [ d ] σ i ( j )  . Note that k M x ⊗ q k ≥ Vm . Note that E (cid:2) V t (cid:3) = Q qi =1 E (cid:20)(cid:16)P j ∈ [ d ] σ i ( j ) (cid:17) t (cid:21) for every t because σ i ’s areindependent. Note that for t = 2 we have, E  X j ∈ [ d ] σ i ( j )   = 3 d − d ≥ − d ) k x k , hence if d ≥ q , E h V i ≥ (3 q / · k x ⊗ q k . Now consider t = 4. By Khintchine’s inequality, Lemma 17, we have, E  X j ∈ [ d ] σ i ( j )   ≤ · k x k , hence, E h V i ≤ q · k x ⊗ q k . Therefore by Paley Zygmund we have the following,Pr " k M x ⊗ q k ≥ q m · k x ⊗ q k ≥ Pr h V ≥ q / · k x ⊗ q k i = Pr h V ≥ q / · k x ⊗ q k i ≥ Pr h V ≥ / · E h V ii ≥ / · E (cid:2) V (cid:3) E [ V ] ≥ q · q > · q ≥ δ.δ.