Sublinear Time Low-Rank Approximation of Positive Semidefinite Matrices
aa r X i v : . [ c s . D S ] J a n Sublinear Time Low-Rank Approximation ofPositive Semidefinite Matrices
Cameron MuscoMIT [email protected]
David P. WoodruffCarnegie Mellon University [email protected]
Abstract
We show how to compute a relative-error low-rank approximation to any positive semidefinite(PSD) matrix in sublinear time , i.e., for any n × n PSD matrix A , in ˜ O ( n · poly( k/ǫ )) time weoutput a rank- k matrix B , in factored form, for which k A − B k F ≤ (1+ ǫ ) k A − A k k F , where A k isthe best rank- k approximation to A . When k and 1 /ǫ are not too large compared to the sparsityof A , our algorithm does not need to read all entries of the matrix. Hence, we significantlyimprove upon previous nnz( A ) time algorithms based on oblivious subspace embeddings, andbypass an nnz( A ) time lower bound for general matrices (where nnz( A ) denotes the number ofnon-zero entries in the matrix). We prove time lower bounds for low-rank approximation of PSDmatrices, showing that our algorithm is close to optimal. Finally, we extend our techniques togive sublinear time algorithms for low-rank approximation of A in the (often stronger) spectralnorm metric k A − B k and for ridge regression on PSD matrices. Introduction
A fundamental task in numerical linear algebra is to compute a low-rank approximation of a matrix.Such an approximation can reveal underlying low-dimensional structure, can provide a compact wayof storing a matrix in factored form, and can be quickly applied to a vector. Countless applica-tions include clustering [DFK +
04, FSS13, LBKW14, CEM + + n × n input matrix A and anaccuracy parameter ǫ >
0, output a rank- k matrix B for which: k A − B k F ≤ (1 + ǫ ) k A − A k k F , (1)where for a matrix C , k C k F = P i,j C i,j is its squared Frobenius norm, and A k = argmin rank- k B k A − B k F . A k can be computed exactly using the singular value decomposition, but takes O ( n ) timein practice and n ω time in theory, where ω ≈ .
373 is the exponent of matrix multiplication.In seminal work, Frieze, Kannan, and Vempala [FKV04] and Achlioptas and McSherry [AM07]show that using randomization and approximation, much faster runtimes are possible. Specifically,[FKV04] gives an algorithm that, assuming access to the row norms of A , outputs rank- k B , infactored form, such that with good probability, k A − B k F ≤ k A − A k k F + ǫ k A k F . The algorithmruns in just n · poly( k/ǫ ) time. However nnz( A ) additional time is required to compute the rownorms, where nnz( A ) denotes the number of non-zero entries of A . Further, the guarantee achievedcan be significantly weaker than (1), since the error is of the form ǫ k A k F rather than ǫ k A − A k k F .Note that k A − A k k F ≪ k A k F precisely when A is well-approximated by a rank- k matrix. Relatedadditive error algorithms with additional assumptions were given for tensors in [SWZ16].Sarl´os [Sar06] showed how to achieve (1) with constant probability in ˜ O (nnz( A ) · k/ǫ ) + n · poly( k/ǫ ) time. This was improved by Clarkson and Woodruff [CW13] who achieved O (nnz( A )) + n · poly( k/ǫ ) time. See also work by Bourgain, Dirksen, and Nelson [BDN15], Cohen [Coh16], Mengand Mahoney [MM13], and Nelson and Nguyen [NN13] which further improved the degree in thepoly( k/ǫ ) term. For a survey, see [Woo14].In the special case that A is rank- k and so k A − A k k F = 0, (1) is equivalent to the well studiedlow-rank matrix completion problem [CR09]. Much attention has focused on completing incoherent low-rank matrices, whose singular directions are represented uniformly throughout the rows andcolumns and hence can be identified via uniform sampling and without fully accessing the matrix.Under incoherence (and often condition number) assumptions, a number of methods are able tocomplete a rank- k matrix in ˜ O ( n · poly( k )) time [JNS13, Har14].For general matrices, without incoherence, it is not hard to see that Ω(nnz( A )) is a time lowerbound: if one does not read a constant fraction of entries of A , with constant probability one canmiss an entry much larger than all others, which needs to be included in the low-rank approximation. An important class of matrices for which low-rank approximation is often applied is the set ofpositive semidefinite (PSD) matrices. These are real symmetric matrices with all non-negativeeigenvalues. They arise for example as covariance matrices, graph Laplacians, Gram matrices (inparticular, kernel matrices), and random dot product models [YS07]. In multidimensional scaling,low-rank approximation of PSD matrices in the Frobenius norm error metric (1) corresponds tothe standard ‘strain minimization’ problem [CC00]. Completion of low-rank, or nearly low-rank1i.e., when k A − A k k F ≈ +
10] and global positioning using local distances [SY05, YH38].Due to its importance, a vast literature studies low-rank approximation of PSD matrices[DM05, ZTK08, KMT09, BW09, LKL10, GM13, WZ13, DLWZ14, WLZ16, TYUC16, LJS16,MM16, CW17]. However, known algorithms either run in at least nnz( A ) time, do not achievethe relative-error guarantee of (1), or require strong incoherence assumptions (see Table 1).At the same time, the simple Ω(nnz( A )) time lower bound for general matrices does not hold inthe PSD case . Positive semidefiniteness ensures that for all i, j , | A i,j | ≤ max( A i,i , A j,j ). So ‘hiding’a large entry in A requires creating a corresponding large diagonal entry. By reading the n diagonalelements, an algorithm can avoid being tricked by this approach. While far from an algorithm, thisargument raises the possibility that improved runtimes could be possible for PSD matrices. We give the first sublinear time relative-error low-rank approximation algorithm for PSD matrices.Our algorithm reads just nk · poly(log n/ǫ ) entries of A and runs in nk ω − · poly(log n/ǫ ) time(Theorem 9). With probability 99 /
100 it outputs a matrix B in factored form which satisfies (1).We critically exploit the intuition that large entries cannot ‘hide’ in PSD matrices, but surprisinglyrequire no additional assumptions on A , such as incoherence or bounded condition number.We complement our algorithm with an Ω( nk/ǫ ) time lower bound. The lower bound is information-theoretic, showing that any algorithm which reads fewer than this number of entries in the inputPSD matrix cannot achieve the relative-error guarantee of (1) with constant probability. As ouralgorithm only reads nk · poly(log n/ǫ ) entries of A , this is nearly optimal for constant ǫ . We notethat the actual time complexity of our algorithm is slower by a factor of k ω − .Finally, we show that our techniques can be extended to compute B satisfying the spectralnorm guarantee: k A − B k ≤ (1 + ǫ ) k A − A k k + ǫk k A − A k k F using just nk · poly(log n/ǫ ) accessesto A and nk ω · poly(log n/ǫ ) time (Theorem 25). This guarantee is often stronger than (1) when k A − A k k F is large, and is important in many applications. For example, we use this result to solvethe ridge regression problem min x ∈ R n k Ax − y k + λ k x k up to (1 + ǫ ) relative error in ˜ O (cid:16) ns ωλ ǫ ω (cid:17) time, where s λ = tr(( A + λI ) − A ) is the statistical dimension of the problem (see Theorem 27).Typically s λ ≪ n , so our runtime is sublinear and improves significantly on existing input-sparsitytime results [ACW16]. For a summary of our results and comparison to prior work see, Table 1. The starting point for our approach is the fundamental fact that any matrix A contains a subsetof O ( k/ǫ ) columns, call them C , that span a relative-error rank- k approximation to A [DRVW06,DV06, DMM06]. Computing the best low-rank approximation to A using an SVD requires accessto all Θ( n ) dot products between the columns of the matrix. However, given C , just n · O ( k/ǫ ) dotproducts are needed – to project the remaining columns of the matrix to the span of the subset.Additionally, a subset of size poly( k/ǫ ) can be identified using an intuitive approach knownas adaptive sampling [DV06]: columns are iteratively added to the subset, with each new columnbeing sampled with probability proportional to its norm outside the column span of the currentsubset. Formally, column a i is selected with probability k a i − P C a i k k A − P C A k F where P C is the projection ontothe current subset C . Computing these sampling probabilities requires knowing the norm of each Many of these algorithms satisfy the additional constraint that the low-rank approximation B is PSD. This isalso now known to be possible in O (nnz( A )) time using sketching-based algorithms for general matrices [CW17]. Note that this bound is stated incorrectly as k A − B k F ≤ k A − A k k F + ǫ P ni =1 ( A ii ) in [DM05]. ource Runtime Approximation Bound [DM05] nk ω − · poly(1 /ǫ ) k A − B k F ≤ k A − A k k F + ǫ k A k ∗ [KMT09] nk ω − · poly(1 /ǫ ) k A − B k F ≤ k A − A k k F + ǫn · max i A ii [GM13] ˜ O ( n ) + nk ω − poly(log n/ǫ ) k A − B k F ≤ k A − A k k F + ǫ k A − A k k ∗ [AGR16] + [BW09] n log n · poly( k ) k A − B k F ≤ ( k + 1) k A − A k k ∗ [MM16] nk w − · poly(log k/ǫ ) k A / − B k F ≤ (1 + ǫ ) k A / − A / k k F [CW17] O (nnz( A )) + n poly( k/ǫ ) k A − B k F ≤ (1 + ǫ ) k A − A k k F Our Results
Theorem 9 nk ω − · poly(log n/ǫ ) k A − B k F ≤ (1 + ǫ ) k A − A k k F Theorem 25 nk ω · poly(log n/ǫ ) k A − B k ≤ (1 + ǫ ) k A − A k k + ǫk k A − A k k F Table 1: Comparison of our results to prior work on low-rank approximation of PSD matrices. k M k ∗ = P ni =1 σ i ( M ) denotes the nuclear norm of matrix M . The cited results all output B (infactored form), which is itself PSD. In Theorem 12 we show how to modify our algorithm to satisfythis condition, and run in nk ω · poly(log n/ǫ ) time. The table shows results that do not requireincoherence assumptions on A . For general PSD matrices, all known incoherence based results (seee.g., [Git11, GM13]) degrade to Θ( n ω ) runtime. Additionally, as discussed, any general low-rankapproximation algorithm can be applied to PSD matrices, with state-of-the-art approaches runningin input-sparsity time [CW13, MM13, NN13]. [CW17] extends these results to the case where theoutput B is restricted to be PSD. [TYUC16] does the same but outputs B with rank 2 k . For amore in depth discussion of bounds obtained in prior work, see Section 1.4. a i along with its dot product with each column currently in C . So, overall this approach gives arelative-error low-rank approximation using just n · poly( k/ǫ ) dot products between columns of A .The above observation is surprising – not only does every matrix contain a small column subsetwitnessing a near optimal low-rank approximation, but also, such a witness can be found usingsignificantly less information about the column span of the matrix than is required by a full SVD.This fact is not immediately algorithmically useful, as computing the required dot products takesnnz( A ) · poly( k/ǫ ) time. However, given PSD A , we can write the eigendecomposition A = U Λ U T where Λ is a non-negative diagonal matrix of eigenvalues, and let A / = U Λ / U T be the matrixsquare root of A . Since A / A / = A , the entry A i,j is just the dot product between the i th and j th columns of A / . So with A in hand, the dot products have been ‘precomputed’ and the aboveapproach yields a low-rank approximation algorithm for A / running in just n · poly( k/ǫ ) time.Note that, aligning with our initial intuition that reading the diagonal entries of A is necessary toavoid the nnz( A ) time lower bound for general matrices, the diagonal entries of A are the columnnorms of A / , and hence their values are critical to computing the adaptive sampling probabilities.By the above argument, given PSD A , we can compute in n · poly( k/ǫ ) time a rank- k orthogonalprojection matrix P ∈ R n × n (in factored form) for which k A / − A / P k F ≤ (1 + ǫ ) k A / − A / k k F .This approach can be implemented using adaptive sampling [DV06], sublinear time volume sampling[AGR16], or as shown in [MM16], recursive ridge leverage score sampling . The ridge leverage scoresare a natural interpolation between adaptive sampling and the widely studied leverage scores,which, as we will see, have a number of additional algorithmically useful properties. As discussedin [MM16], the guarantee for A / is useful for a number of kernel learning methods such as kernelridge regression. However, it is very different from our final goal. In fact, one can show thatprojecting to P can yield an arbitrarily bad low-rank approximation to A itself (see Appendix A).We note that, since P is constructed via column selection methods, it is possible to efficiently3ompute a factorization of A / P A / (see Appendix A). Further, this matrix gives a near optimallow-rank approximation of A if we use error parameter ǫ ′ = ǫ/ √ n . This approach gives a firstsublinear time algorithm, but it is significantly suboptimal. Namely, it requires reading ˜ O ( nk/ǫ ′ ) =˜ O ( n / k/ǫ ) entries of A and takes n . · poly( k/ǫ ) time using fast matrix multiplication.To improve the dependence on n , we need a better understanding of how to perform ridgeleverage score sampling on A itself. We start by showing that the ridge leverage scores of A / arewithin a factor of O ( p n/k ) of the ridge leverage scores of A . By this bound, if we over-samplecolumns of A by a factor of O ( p n/k ) using the ridge leverage scores of A / (computable via[MM16]), obtaining a sample of ˜ O ( p n/k · k/ǫ ) columns, the sample will be a so-called projection-cost preserving sketch (PCP) of A . The notion of a PCP was introduced in [CEM + C is defined to be an ( ǫ, k )-column PCP of A if for all rank- k projection matrices P :(1 − ǫ ) k A − P A k F ≤ k C − P C k F ≤ (1 + ǫ ) k A − P A k F . (2)One important property of a PCP is that good low-rank approximations to C translate to goodlow-rank approximations of A . More precisely, if U is an n × k matrix with orthonormal columnsfor which k C − U U T C k F ≤ (1 + ǫ ) k C − C k k F , then k A − U U T A k F ≤ (1+ ǫ ) (1 − ǫ ) k A − A k k F .Letting C be the n × ˜ O ( √ nk/ǫ ) submatrix which we sample via ridge leverage scores, wecan apply an nnz( C ) time algorithm to compute a subspace U ∈ R n × k whose columns span anear-optimal low-rank approximation of C , and hence of A by the PCP property. Using standardsampling techniques, we can approximately project the columns of A to U , producing our finalsolution. This gives time complexity n / · poly( k/ǫ ), improving slightly upon our first approach.To reduce the time to linear in n , we must further reduce the size of C by sampling a smallsubset of its rows, which themselves form a PCP. To find these rows, we cannot afford to useoblivious sketching techniques, which would take at least nnz( C ) time, nor can we use our previousmethod for providing O ( p n/k ) overestimates to the ridge leverage scores, since C is no longerPSD. In fact, the row ridge leverage scores of C can be arbitrarily large compared to those of A / .The key idea to getting around this issue is that, since C is a column PCP of A , projectingits columns onto A ’s top eigenvectors gives a near optimal low-rank approximation. Further, wecan show that the ridge leverage scores of A / (appropriately scaled) upper bound the standardleverage scores of this low-rank approximation. Sampling by these leverage scores is not enoughto give a guarantee like (2) – they ignore the entire component of C not falling in the span of A ’stop eigenvectors and so may significantly distort projection costs over the matrix. Further, at thispoint, we have no idea how to estimate the row norms of C , or even its Frobenius norm, with n poly( k/ǫ ) samples, which are necessary to implement any kind of adaptive sampling approach.Fortunately, using that sampling at least preserves the matrix in expectation, along with a fewother properties of the ridge leverage scores of A / , we show that, with good probability, sampling˜ O ( √ nk/ poly( ǫ )) rows of C by these scores yields R satisfying for all rank- k projection matrices P :(1 − ǫ ) k C − CP k F ≤ k R − RP k F + ∆ ≤ (1 + ǫ ) k C − CP k F where ∆ is a fixed value, independent of P , with | ∆ | ≤ c k C − C k k F for some constant c . Since thesame ∆ distortion applies to all P , and since it is at most a constant times the true optimum, anear optimal low-rank approximation for R still translates to a near optimal approximation for C .At this point R is a small matrix, and we can run any O (nnz( R )) time algorithm to find a goodlow-rank factorization EF T to it, where F T is k × ˜ O ( √ nk/ poly( ǫ )). Since R is a row PCP for C ,by regressing the rows of C to the span of F , we can obtain a near optimal low-rank approximationto C . We can solve this multi-response regression approximately in sublinear time via standard4ampling techniques. Approximately regressing A to the span of this approximation using similartechniques yields our final result. The total runtime is dominated by the input-sparsity low-rankapproximation of R requiring O (nnz( R )) = ˜ O ( nk/ poly( ǫ )) time.To improve ǫ dependencies in our final runtime, achieving sample complexity ˜ O (cid:0) nkǫ . (cid:1) , we modifythis approach somewhat, showing that R actually satisfies a stronger spectral norm PCP propertyfor C . This property lets us find a low-rank span Z with k C − CZZ T k ≤ ǫk k A − A k k F , fromwhich, through a series of approximate regression steps, we can extract a low-rank approximationto A satisfying (1). This stronger spectral guarantee also lies at the core of our extensions to nearoptimal spectral norm low-rank approximation (Theorem 25), ridge regression (Theorem 27), andlow-rank approximation where B is restricted to be PSD (Theorem 12). Observe that in computing a low-rank approximation of A , we read just ˜ O ( n · poly( k/ǫ )) entriesof the matrix, which is, up to lower order terms, the same number of entries (corresponding tocolumn dot products of A / ) that we accessed to compute a low-rank approximation of A / in ourdescription above. However, these sets of entries are very different. While low-rank approximationof A / looks at an n × poly( k/ǫ ) sized submatrix of A together with the diagonal entries, ouralgorithm considers a carefully chosen √ nk poly(log n/ǫ ) × √ nk poly(log n/ǫ ) submatrix togetherwith the diagonal entries, which gives significantly more information about the spectrum of A .As a simple example, consider A with top eigenvalue λ = √ n , and λ i = 1 for i = 2 , ...n . k A / k F = P ni =1 λ i = √ n + n − k A / − A / k F = P ni =2 λ i = n −
1. So, A / has no goodrank-1 approximation. Unless we set ǫ = O (1 / √ n ), a low-rank approximation algorithm for A / can learn nothing about λ and still be near optimal. In contrast, k A k F = P ni =1 λ i = 2 n − k A − A k F = P ni =2 λ i = n −
1. So, even with ǫ = 1 /
2, any rank-1 approximation algorithm for A must identify the presence of λ and project this direction off the matrix. In this sense, ouralgorithm is able to obtain a much more accurate picture of A ’s spectrum.With incoherence assumptions, prior work on PSD low-rank approximation [GM13] obtainsthe bound k A − B k ∗ ≤ (1 + ǫ ) k A − A k k ∗ in sublinear time, where k M k ∗ = P ni =1 σ i ( M ) is thenuclear norm of M . Recent work ([AGR16] in combination with [BW09]) gives k A − B k F ≤ ( k + 1) k A − A k k ∗ without the incoherence assumption. These nuclear norm bounds are closelyrelated to approximation bounds for A / and it is not hard to see that neither require λ to bedetected in the example above, and so in this sense are weaker than our Frobenius norm bound.A natural question if even stronger bounds are possible: e.g., can we compute B with k A − B k ≤ (1 + ǫ ) k A − A k k in sublinear time? We partially answer this question in Theorem 25. In˜ O ( nk ω poly(log n/ǫ )) time, we can find B satisfying k A − B k ≤ (1 + ǫ ) k A − A k k + ǫk k A − A k k F .Significantly improving the above bound seems difficult: it is easy to see that a relative errorspectral norm guarantee requires Ω( n ) time. Consider A which is the identity except with A i,j = A j,i = 1 for some uniform random pair ( i, j ). Finding ( i, j ) requires Ω( n ) queries to A . However,it is necessary to achieve a relative error spectral norm guarantee with ǫ < k A k = 4 while k A − A k = 1 where A is all zeros with ones at its ( i, i ), ( j, j ), ( i, j ), and ( j, i ) entries.A similar argument shows that relative error low-rank approximation in higher Schatten- p norms, i.e., k A − B k pp for p > n (where k M k pp = P ni =1 σ pi ( M ).)We can set A to be the identity but with an all ones block on a uniform random subset of n /p indices. This block has associated eigenvalue λ = n /p and so, since all other ( n − n /p ) eigenvaluesof A are 1, k A k pp = Θ( n ), and the block must be recovered to give a relative error approximationto k A − A k pp . However, as the block is placed uniformly at random and contains just n /p entries,finding even a single entry requires n − /p queries to A – superlinear for p > .5 Open Questions While it is apparent that obtaining stronger error guarantees than (1) may require increasedruntime, understanding exactly what can be achieved in sublinear time is an interesting direc-tion for future work. We also note that it is still unknown how to compute a number of basicproperties of PSD matrices in sublinear time. For example, while we can output B satisfying k A − B k F ≤ (1 + ǫ ) k A − A k k F , surprisingly it is not clear how to actually estimate the value k A − A k k F to within a (1 ± ǫ ) factor. This can be achieved in n / poly( k/ǫ ) time using our PCPtechniques. However, obtaining linear runtime in n is open. Estimating k A − A k k F seems stronglyconnected to estimating other important quantities such as the statistical dimension of A for ridgeregression (see Theorem 27) which we do not know how to do in o ( n / ) time.Finally, an open question is if these techniques can be generalized to a broader class of matrices.As discussed, in the matrix completion literature, much attention has focused on incoherent low-rank matrices [CR09] which can be approximated with uniform sampling. PSD matrices are notincoherent in general, which is highlighted by the fact that our sampling schemes are far fromuniform and very adaptive to previously seen matrix entries. However, perhaps there is some otherparameter (maybe relating to a measure of diagonal dominance) which characterizes when low-rankapproximation can be performed with just a small number of adaptive accesses to A . Section 2: Ridge Leverage Score Sampling . We show that the ridge leverage scores of A arewithin an O ( p n/k ) factor of those of A / , letting us use the fast ridge leverage score samplingalgorithm of [MM16] to sample ˜ O ( √ nk/ǫ ) columns of A that form a column PCP of the matrix. Section 3: Row Sampling . We discuss how to further accelerate our algorithm by obtaining arow PCP for our column sample, allowing us to achieve runtime linear in n . Section 4: Full Algorithm . We use the primitives in the previous sections along with approxi-mate regression techniques to give our full sublinear time low-rank approximation algorithm.
Section 5: Lower Bounds . We show that our algorithm is nearly optimal – any relative errorlow-rank approximation algorithm must read Ω( nk/ǫ ) entries of A . Section 6: Spectral Norm Bounds . We modify the algorithm of Section 4 to give a tighterapproximation in the spectral norm and discuss applications to sublinear time ridge regression.
Our main algorithmic tool will be ridge leverage score sampling, which is used to identify a smallsubset of columns of A that span a good low-rank approximation of the matrix. Following thedefinition of [CMM17], the rank- k ridge leverage scores of any matrix A are given by: Definition 1 (Ridge Leverage Scores) . For any A ∈ R n × d , letting a i ∈ R n be the i th column of A ,the i th rank- k column ridge leverage score of A is: τ ki ( A ) = a Ti (cid:18) AA T + k A − A k k F k I (cid:19) + a i . Above I is the appropriately sized identity matrix and M + denotes the matrix pseudoinverse,equivalent to the inverse unless k A − A k k F = 0 and A is singular. Analogous scores can be defined6or the rows of A by simply transposing the matrix. It is not hard to see that 0 < τ ki ( A ) < i . Since we use these scores as sampling probabilities, it is critical that the sum of scores, andhence the size of the subsets we sample, is not too large. We have the following (see Appendix B): Lemma 2 (Sum of Ridge Leverage Scores) . For any A ∈ R n × d , P di =1 τ ki ( A ) ≤ k . Intuitively, the ridge leverage scores are similar to the standard leverage scores of A , which aregiven by a Ti ( AA T ) + a i . By writing A = U Σ V T in its SVD, one sees that standard leverage scoresare just the squared column norms of V T . Sampling columns by ridge leverage scores yields aspectral approximation to the matrix. The addition of the weighted identity (or ‘ridge’) k A − A k k F k I ‘dampens’ contributions from smaller singular directions of A , decreasing the sum of the scores andallowing us to sample fewer columns. At the same time, it introduces error dependent on the size ofthe tail k A − A k k F , ultimately giving an approximation from which it is possible to output a nearoptimal low-rank approximation to the original matrix. Specifically, sampling by ridge leveragescores yields a projection-cost preserving sketch (PCP) of A : Lemma 3 (Theorem 6 of [CMM17]) . For any A ∈ R n × d , for i ∈ { , . . . , d } , let ˜ τ ki ≥ τ ki ( A ) be anoverestimate for the i th rank- k ridge leverage score. Let p i = ˜ τ ki P i ˜ τ ki and t = c log( k/δ ) ǫ P i ˜ τ ki for any ǫ < and sufficiently large constant c . Construct C by sampling t columns of A , each set to √ tp i a i with probability p i . With probability − δ , for any rank- k orthogonal projection P ∈ R n × n , (1 − ǫ ) k A − P A k F ≤ k C − P C k F ≤ (1 + ǫ ) k A − P A k F . We refer to C as an ( ǫ, k ) -column PCP of A . Since the ‘cost’ k A − P A k F of any rank- k projection of A is preserved by C , any near-optimallow-rank approximation of C yields a near optimal low-rank approximation of A . Further, C ismuch smaller than A , so such a low-rank approximation can be computed quickly. The difficultyis in computing the approximate leverage scores. To do this, we use the main result from [MM16]: Lemma 4 (Corollary of Theorem 20 of [MM16]) . There is an algorithm that given any PSD matrix A ∈ R n × n , runs in O ( n ( k log( k/δ )) ω − ) time, accesses O ( nk log( k/δ )) entries of A , and returns foreach i ∈ [1 , .., n ] , ˜ τ ki ( A / ) such that with probability − δ , for all i : τ ki ( A / ) ≤ ˜ τ ki ( A / ) ≤ τ ki ( A / ) . Proof.
Theorem 20 of [MM16] shows that by using a recursive ridge leverage score samplingalgorithm, it is possible to return (with probability 1 − δ ) a sampling matrix S ∈ R n × s with s = O ( k log( k/δ )) such that, letting λ = k k A / − A / k k F :12 ( A + λI ) (cid:22) (cid:16) A / SS T A / + λI (cid:17) (cid:22)
32 ( A + λI )where M (cid:22) N indicates x T M x ≤ x T N x for all x . If we set ˜ τ ki ( A / ) = 2 · x Ti (cid:0) A / SS T A / + λI (cid:1) + x i ,where x i is the i th column of A / we have the desired bound. Of course, we cannot directly computethis value without factoring A to form A / . However, as shown in Lemma 6 of [MM16]: x Ti (cid:16) A / SS T A / + λI (cid:17) − x i = 1 λ (cid:0) A − AS ( S T AS + λI ) − S T A (cid:1) i,i . Computing ( S T AS + λI ) − requires O ( s ) = O (( k log( k/δ )) ) accesses to A and O ( s ω ) = O (( k log( k/δ )) ω ) time. Computing all n diagonal entries of AS ( S T AS + λI ) − S T A then requires7 ( nk log( k/δ )) accesses to A and O ( n ( k log( k/δ )) ω − ) time. With these entries in hand we cansimply subtract from the diagonal entries of A and rescale to give the final leverage score approxima-tion. Critically, this calculation always reads all diagonal entries of A , allowing it to identify rowscontaining large off diagonal entries and skirt the nnz( A ) time lower bound for general matrices.Note that the stated runtime in [MM16] for outputting S is ˜ O ( nk ) accesses to A ( kernel eval-uations in the language of [MM16]) and ˜ O ( nk ) runtime. However this runtime is improved to˜ O ( nk ω − ) using fast matrix multiplication.In order to apply Lemmas 3 and 4 to low-rank approximation of A , we now show that the ridgeleverage scores of A / coarsely approximate those of A : Lemma 5 (Ridge Leverage Score Bound) . For any PSD matrix A ∈ R n × n : τ ki ( A ) ≤ r nk · τ ki ( A / ) . Proof.
We write A / in its eigendecomposition A / = U Λ / U T , where Λ i,i = λ i is the i th eigen-value of A . Letting x i denote the i th column of A / we have: τ ki ( A / ) = x Ti A + k A / − A / k k F k I ! − x i = x Ti U ¯Λ U T x i where ¯Λ i,i def = λ i + k P nj = k +1 λ j . We can similarly write: τ ki ( A ) = a Ti (cid:18) A + k A − A k k F k I (cid:19) − a i = x Ti A / (cid:18) A + k A − A k k F k I (cid:19) − A / x i = x Ti U ˆΛ U T x i where ˆΛ i,i def = λ i λ i + k P nj = k +1 λ j . Showing ˆΛ (cid:22) p nk · ¯Λ is enough to give the lemma. Specifically wemust show, for all i , ˆΛ i,i ≤ p nk · ¯Λ i,i which after cross-multiplying is equivalent to: λ i + 1 k λ i n X j = k +1 λ j ≤ r nk λ i + 1 k n X j = k +1 λ j . (3)First consider the relatively large eigenvalues. Say we have k P nj = k +1 λ j ≤ p nk λ i . Then: λ i + 1 k λ i n X j = k +1 λ j ≤ (cid:16) p n/k (cid:17) λ i k P nj = k +1 λ j ≥ p nk λ i . In this case: λ i + 1 k λ i n X j = k +1 λ j ≤ λ i + 1 √ n · k / n X j = k +1 λ j ≤ λ i + 1 √ n · k / · n n X j = k +1 λ j (Norm bound: k · k ≤ n k · k ) ≤ r nk λ i + 1 k n X j = k +1 λ j which gives (3), completing the proof.Combining Lemmas 2, 3, 4, 5 we have: Corollary 6 (Fast PSD Ridge Leverage Score Sampling) . There is an algorithm that given anyPSD matrix A ∈ R n × n runs in ˜ O ( nk ω − ) time, accesses ˜ O ( nk ) entries of A , and with prob. − δ outputs a weighted sampling matrix S ∈ R n × ˜ O (cid:16) √ nkǫ (cid:17) such that AS is an ( ǫ , k ) -column PCP of A .Proof. By Lemma 4 we can compute constant factor approximations to the ridge leverage scoresof A / in time ˜ O ( nk ω − ). Applying Lemma 5, if we scale these scores up by 2 p n/k they will beoverestimates of the ridge leverage scores of A . If we set t = O (cid:16) log( k/δ ) ǫ · P ˜ τ ki (cid:17) , and generate S by sampling t columns of A with probabilities proportional to these estimated scores, by Lemma 3, AS will be an ( ǫ, k )-column PCP of A with probability 1 − δ . By Lemma 2, P ni =1 τ ki ( A / ) ≤ k .So we have t = ˜ O ( P ˜ τ ki /ǫ ) = ˜ O ( √ nk/ǫ ).Forming AS requires reading just ˜ O ( n / √ k/ǫ ) entries of A . At this point, we could employany input sparsity time algorithm to find a near optimal rank- k projection P for approximating AS in O (nnz( AS )) + n poly( k/ǫ ) = n / · poly( k/ǫ ) time. This would in turn yield a near optimallow-rank approximation of A . However, as we will see in the next section, by further sampling therows of AS , we can significantly improve this runtime. To achieve near linear dependence on n , we sample roughly √ nk rows from AS , producing an evensmaller matrix S T AS , which we can afford to fully read and from which we can form a near optimallow-rank approximation to AS and consequently to A . However, sampling AS is challenging: wecannot employ input sparsity time methods as we cannot afford to read the full matrix, and sinceit is no longer PSD, we cannot apply the same approach we used for A , approximating the ridgeleverage scores with those of A / .Rewriting Definition 1 using the SVD AS = U Σ V T (and transposing AS to give row insteadof column scores) we see that the row ridge leverage scores of AS are the diagonal entries of: AS (cid:18) S T A T AS + k AS − ( AS ) k k F k I (cid:19) + S T A T = U ¯Σ U T where ¯Σ i,i = Σ i,i Σ i,i + k AS − ( AS k k Fk . That is, the row ridge leverage scores depend only on the columnspan U of AS and its spectrum. Since AS is a column PCP of A this gives hope that the twomatrices have similar row ridge leverage scores. 9nfortunately, this is not the case . It is possible to have rows in AS with ridge leverage scoressignificantly higher than in A . Thus, even if we knew the ridge leverage scores of A , we wouldhave to scale them up significantly to sample from AS . As an example, consider A with relativelyuniform ridge leverage scores: τ i ( A ) ≈ k/n for all i . When a column is selected to be included in AS it will be reweighted by roughly a factor of p n/k . Now, append a number of rows to A eachwith very small norm and just a containing single non-zero entry. These rows will have little effecton the ridge leverage scores if their norms are small enough. However, if the column correspondingto the nonzero in a row is selected, the row will appear in AS with p n/k times the weight thatit appears in A , and its ridge leverage score will be roughly a factor n/k times higher.Fortunately, we are still able to show that sampling the rows of AS by the rank k ′ = O ( k/ǫ )leverage scores of A / scaled up by a p n/k ′ factor yields a row PCP for this matrix. Our proofworks not with the ridge scores of AS but with the standard leverage scores of a near optimallow-rank approximation to this matrix – specifically the approximation given by projecting ontothe top eigenvectors of A . We have: Lemma 7 (Row PCP) . For any PSD A ∈ R n × n and ǫ ≤ let k ′ = ⌈ ck/ǫ ⌉ and let ˜ τ k ′ i ( A / ) ≥ τ k ′ i ( A / ) be an overestimate for the i th rank- k ′ ridge leverage score of A / . Let ˜ ℓ i = q nǫk · ˜ τ k ′ i ( A / ) , p i = ˜ ℓ i P i ˜ ℓ i , and t = c ′ log nǫ P i ˜ ℓ i . Construct weighted sampling matrices S , S ∈ R n × t each whose j th column is set to √ tp i e i with probability p i .For sufficiently large constants c, c ′ , with probability , letting ˜ A = S T AS , for any rank- k orthogonal projection P ∈ R t × t : (1 − ǫ ) k AS ( I − P ) k F ≤ k ˜ A ( I − P ) k F + ∆ ≤ (1 + ǫ ) k AS ( I − P ) k F for some fixed ∆ (independent of P ) with | ∆ | ≤ k A − A k k F . We refer to ˜ A as an ( ǫ, k ) -rowPCP of AS . Note that by Lemma 2, P i τ k ′ i ( A / ) = O ( k/ǫ ). So if ˜ τ k ′ i ( A / ) is a constant factor approxi-mation to τ k ′ i ( A / ), t = O (cid:16) √ nk log nǫ . (cid:17) . Also note that the Lemma requires both S and S to besampled using the rank k ′ ridge scores. If we sample S using a sum of the rank- k and rank- k ′ ridgescores (appropriately scaled) Lemma 7 and Lemma 3 will hold simultaneously.By applying an input sparsity time low-rank approximation algorithm to ˜ A (which has just˜ O (cid:0) nkǫ (cid:1) entries) we can find a near optimal low-rank approximation of AS , and thus for A . However,in our final algorithm, we will take a somewhat different approach. We are able to show that usingappropriate sampling probabilities, we can in fact sample ˜ A which is a projection-cost preservingsketch of AS for spectral norm error. As we will see, recovering a near optimal spectral normlow-rank approximation to AS suffices to recover a near optimal Frobenius norm approximationto A , and allows us to improve ǫ dependencies in our final runtime. Lemma 8 (Spectral Norm Row PCP) . For any PSD A ∈ R n × n , and ǫ < let k ′ = ⌈ ck/ǫ ⌉ and ˜ τ k ′ i ( A / ) ≥ τ k ′ i ( A / ) be an overestimate for the i th rank- k ′ ridge leverage score of A / . Let ˜ ℓ i = 4 ǫ p nk τ k ′ i ( A / ) , p i = ˜ ℓ i P i ˜ ℓ i , and t = c ′ log nǫ · P i ˜ ℓ i . Construct weighted sampling matrices S , S ∈ R n × t , each whose j th column is set to √ tp i e i with probability p i .For sufficiently large constants c, c ′ , with high probability (i.e. probability ≥ − /n d for somelarge constant d ), letting ˜ A = S T AS , for any orthogonal projection P ∈ R t × t : (1 − ǫ ) k AS ( I − P ) k − ǫk k A − A k k F ≤ k ˜ A ( I − P ) k ≤ (1 + ǫ ) k AS ( I − P ) k + ǫk k A − A k k F . e refer to ˜ A as an ( ǫ, k ) -spectral PCP of AS . Note that if ˜ τ k ′ i ( A / ) is a constant factor approximation to τ k ′ i ( A / ), t = O (cid:16) √ nk log nǫ (cid:17) . Wedefer the proofs of Lemmas 7 and 8 to Appendix B. We are finally ready to give our main algorithm for relative error low-rank approximation of PSDmatrices in ˜ O ( n poly( k/ǫ )) time, Algorithm 1. We set k = ⌈ ck/ǫ ⌉ and estimate the both the rank- k and rank- c ′ k ridge leverage scores of A / using the algorithm of [MM16] (Step 1). If c, c ′ aresufficiently large, sampling by the sum of these scores (Steps 2-3) ensures that AS is an ( ǫ, k )-column PCP for A and simultaneously, by applying Lemmas 7 and 8 that ˜ A = S T AS is a rowPCP in both spectral and Frobenius norm with rank k and error ǫ = 1 / AS In conjunction, these guarantees ensure that we can apply an input sparsity time algorithm to˜ A (Step 4) to find a rank- k Z satisfying k AS − AZZ T k = O ( k AS − ( AS ) k k + k k A − A k k F ) = O (cid:0) ǫk k A − A k k F (cid:1) , where the final bound holds since k = Θ( k/ǫ ). It is not hard to show that, dueto this strong spectral norm bound, projecting AS to Z and taking the best rank- k approximationin the span gives a near optimal Frobenius norm low-rank approximation to AS and hence A .We can still not afford to read AS in its entirety, so we employ a number of standard leveragescore sampling techniques to perform this projection approximately. In Step 5, we sample ˜ O ( k/ǫ )columns of AS using the leverage scores of Z (its row norms since it is an orthonormal matrix) toform AS S . We argue that there is a good rank- k approximation to AS lying in both the columnspan of AS S and the row span of Z T . In Step 6 we find a near optimal such approximationby further sampling ˜ O ( k/ǫ ) rows AS by the leverage scores of AS S (the row norms of V , anorthonormal basis for its span), and computing the best rank- k approximation to the sampledmatrix falling in the column span of AS S and the row span of Z T .Finally, in Step 7 we approximately project A itself to the span of this rank- k approximationby first sampling by the leverage scores of the approximation (the row norms of Q ) and projecting. Algorithm 1 PSD Low-Rank Approximation
1. Let k = ⌈ ck/ǫ ⌉ . For all i ∈ [1 , .., n ] compute ˜ τ ki ( A / ) and ˜ τ c ′ k i ( A / ) which are constantfactor approximations to the ridge leverage scores τ ki ( A / ) and τ c ′ k i ( A / ) respectively.2. Set ℓ (1) i = p nk ˜ τ ki ( A / ) + q nǫ k ˜ τ c ′ k i ( A / ) and ℓ (2) i = q nk ˜ τ c ′ k i ( A / ). Set p (1) i = ℓ (1) i P i ℓ (1) i and p (2) i = ℓ (2) i P i ℓ (2) i .3. Set t = c log nǫ P i ℓ (1) i and t = c log n P i ℓ (2) i . Sample S ∈ R n × t whose j th column is setto q tp (1) i e i with probability p (1) i . Sample S ∈ R n × t analogously using p (2) i .4. Let ˜ A = S T AS , and use an input sparsity time algorithm to compute orthonormal Z ∈ R t × k satisfying the spectral guarantee k ˜ A − ˜ AZZ T k ≤ k ˜ A − ˜ A k k + k k ˜ A − ˜ A k k F .5. Let t = c (cid:16) k log( k/ǫ ) ǫ + kǫ (cid:17) , set p (3) i = k z i k k Z k F , and sample S ∈ R t × t where the j th column isset to q t p (3) i e i with probability p (3) i . Compute V ∈ R n × t which is an orthogonal basis forthe column span of AS S . 11. Let p (4) i = k v i k k V k F and t = c (cid:16) t log t ǫ (cid:17) . Sample S ∈ R n × t where the j th column is set to q t p (4) i e i with probability p (4) i . Compute W ∈ R t × t satisfying: W = arg min W | rank( W )= k k S T AS S W Z T − S T AS k F .
7. Compute an orthogonal basis Q ∈ R n × k for the column span of AS S W . Let t = c (cid:0) k log k + kǫ (cid:1) ,set p (5) i = k q i k k Q k F , where q i is the i th row of Q . Sample S ∈ R n × t where the j th column is setof q t p (5) i e i with probability p (5) i . Solve: N = arg min N ∈ R n × k k S T QN T − S T A k F .
8. Return
Q, N ∈ R n × k . Theorem 9 (Sublinear Time Low-Rank Approximation) . Given any PSD A ∈ R n × n , for suffi-ciently large constants c, c ′ , c , c , c , c , c , for any ǫ < , Algorithm 1 accesses O ( n · k log nǫ . + √ nk . · log n · poly(1 /ǫ )) entries of A , runs in ˜ O (cid:16) nk ω − ǫ ω − + √ nk ω − . · poly(1 /ǫ ) (cid:17) time, and with probabilityat least / outputs M, N ∈ R n × k with: k A − M N T k F ≤ (1 + ǫ ) k A − A k k F . For simplicity, we will show k A − M N T k F ≤ (1 + O ( ǫ )) k A − A k k F . By scaling up the constants c, c ′ , c , c , c , c , c we can then scale down ǫ , achieving the claimed result. We first show: Lemma 10.
For sufficiently large constants c, c ′ , c , c , with probability / , Steps 1-4 of Algorithm 1produce Z ∈ R t × k satisfying: k AS − AS ZZ T k ≤ ǫk k A − A k k F . Proof.
In Step 3, S is sampled using probabilities proportional to the scores ℓ (1) i = p nk ˜ τ ki ( A / ) + q nǫ k ˜ τ c ′ k i ( A / ). The first part of this score ensures that by Lemma 3, with high probability AS is an ( ǫ, k )-column PCP of A . The second half, in combination with the construction of S ensuresvia Lemmas 7 and 8 that if c ′ is large enough, ˜ A = S T AS is both a (1 / , k )-row PCP of AS for Frobenius norm error with probability 99 /
100 and a (1 / , k )-spectral PCP of AS with highprobability. Note that in ℓ (1) i the scores corresponding to ˜ τ c ′ k i ( A / ) are scaled down by an ǫ factor from what is required to apply Lemmas 7 and 8 with rank k and error 1 /
2. However, t isoversampled by a 1 /ǫ factor, balancing this scaling. By a union bound, all three PCP propertieshold simultaneously with probability 98 / U k contain the top k row singularvectors of AS we have: k ˜ A − ˜ A k k F ≤ k ˜ A − ˜ AU k U Tk k F ≤ k AS − ( AS ) k k F − ∆ ≤ k A − A k k F (4)12y the fact that | ∆ | ≤ k A − A k k F and k AS − ( AS ) k k F ≤ k AS − ( AS ) k k F ≤ (1 + ǫ ) k A − A k k F ≤ k A − A k k F since AS is an ( ǫ, k )-column PCP of A . Further, via Lemma 8, for anyrank- k projection P :12 k AS ( I − P ) k − k k A − A k k F ≤ k ˜ A ( I − P ) k ≤ k AS ( I − P ) k + 12 k k A − A k k F . (5)By this bound, we have k ˜ A − ˜ A k k ≤ k ˜ A − ˜ AU k U Tk k ≤ k AS − ( AS ) k k + 12 k k A − A k k ≤ ǫk k A − A k k F (6)where the final bound follows because if we set c ≥ k ≥ ⌈ k/ǫ ⌉ so k AS − ( AS ) k k ≤ ǫk k AS − ( AS ) k k F ≤ ǫk k A − A k k F . With (4) and (6) in place, applying (5) again, if we compute Z satisfying the guarantee in Step 4 we have: k AS − AS ZZ T k ≤ k ˜ A − ˜ AZZ T k + 1 k k A − A k k F ≤ k ˜ A − ˜ A k k + 4 k k ˜ A − ˜ A k k F + 1 k k A − A k k F (By the guarantee on Z in Step 4) ≤ (4 + 4 ·
603 + 1) ǫk · k A − A k k F which gives the lemma after adjusting constants on ǫ by making c , c ′ , c and c sufficiently large.Lemma 10 ensures that the rank- k matrix W computed in Step 6 gives a near optimal low-rankapproximation of AS . Specifically we have: Lemma 11.
With probability / , for sufficiently large c, c ′ , c , c , c , c , Step 5 of Algorithm 1produces W such that, letting Q ∈ R n × k be an orthonormal basis for the column span of AS S W : k AS − QQ T AS k F ≤ (1 + ǫ ) k A − A k k F . Proof.
We first consider the optimization problem: M ∗ = arg min M | rank( M )= k k M Z T − AS k F . We have: k M ∗ Z T − AS k F ≤ k ( AS ) k ZZ T − AS k F = k [ AS − ( AS ) k ] + ( AS ) k ( I − ZZ T ) k F = k AS − ( AS ) k k F + k ( AS ) k ( I − ZZ T ) k F ≤ k AS − ( AS ) k k F + k · k AS ( I − ZZ T ) k ≤ k AS − ( AS ) k k F + ǫ k A − A k k F ≤ (1 + 2 ǫ ) k A − A k k F (7)where the second to last step uses that k AS ( I − ZZ T ) k ≤ ǫk k A − A k k F by Lemma 10 and thelast step uses that k AS − ( AS ) k k F ≤ (1 + ǫ ) k A − A k k F since AS is an ( ǫ, k )-column PCP of A .13ote that since it is rank- k we can write M ∗ = Y ∗ N ∗ where Y ∗ ∈ R n × k and N ∗ ∈ R k × k hasorthonormal rows. Y ∗ is the solution to the unconstrained low-rank approximation problem: Y ∗ = arg min Y ∈ R n × k k Y N ∗ Z T − AS k F .N ∗ Z T has orthonormal rows, so its column norms are its leverage scores. S is sampled us-ing the column norms of Z T , which upper bound those of N ∗ Z T . Since k Z k F = ⌈ ck/ǫ ⌉ , t =Θ (cid:0) k Z k F log( k Z k F ) + k Z k F /ǫ (cid:1) . It is thus well known (see e.g. Theorem 38 of [CW13], and[DMM08] for earlier work) that if we set˜ Y = arg min Y ∈ R n × k k Y N ∗ Z T S − AS S k F we have with probability 99 / k ˜ Y N ∗ Z T − AS k F ≤ (1 + ǫ ) k Y ∗ N ∗ Z T − AS k F . (8)˜ Y can be computed in closed form as ˜ Y = AS S ( N ∗ Z T S ) + , which is in the column span of AS S .Thus (8) demonstrates that there is some rank- k M in this span satisfying k M Z T − AS k F ≤ (1 + ǫ ) k M ∗ Z T − AS k F ≤ (1 + 4 ǫ ) k A − A k k F by (7). Thus if we compute W ∗ = arg min W | rank( W )= k k AS S W Z T − AS k F . we will have k AS S W ∗ Z T − AS k F ≤ (1 + 4 ǫ ) k A − A k k F . (9)We can compute W ∗ approximately by sampling the rows of AS S by their leverage scores. Theseare given by the row norms of the orthogonal basis V as computed in Step 5 and sampled with toobtain S in Step 6. For any W , by the Pythagorean theorem, since V spans AS S , k AS S W Z T − AS k F = k AS S W Z T − V V T AS k F + k ( I − V V T ) AS k F . (10)Similarly k S T AS S W Z T − S T AS k F = k S T AS S W Z T − S T V V T AS k F + k S T ( I − V V T ) AS k F + 2 tr (cid:0) ( ZW T S T S T A T − S T A T V V T ) S S T ( I − V V T ) AS (cid:1) . (11)For the first term, since S is sampled via the leverage scores of AS S and t = c (cid:16) t log t ǫ (cid:17) , S gives a subspace embedding for the column span of AS S with high probability and: k S T AS S W Z T − S T V V T AS k F ∈ (1 ± ǫ ) k AS S W Z T − V V T AS k F (12)For the cross term, again since V spans the columns of AS S we have:tr (cid:0) ( ZW T S T S T A T − S T A T V V T ) S S T ( I − V V T ) AS (cid:1) = tr (cid:0) ( ZW T S T S T A T − S T A T ) V V T S S T ( I − V V T ) AS (cid:1) ≤ k AS S W Z T − AS k F · k V V T S S T ( I − V V T ) AS k F ≤ k AS S W Z T − AS k F · ǫ k ( I − V V T ) AS k F /
100 from a standard approximate matrix mul-tiplication result [DKM06] since S is sampled using the row norms of V . k ( I − V V T ) AS k F ≤k AS S W ∗ Z T − AS k F ≤ k AS S W Z T − AS k F and so overall:tr (cid:0) ( ZW T S T S T A T − S T A T V V T ) S S T ( I − V V T ) AS (cid:1) ≤ ǫ k AS S W Z T − AS k F . (13)Combining (10), (11), (12), and (13) for any W , k S T AS S W Z T − S T AS k F ∈ (1 ± ǫ ) k AS S W Z T − AS k F + ∆where ∆ = k S T ( I − V V T ) AS k F − k ( I − V V T ) AS k F is fixed independent of W . Thus, if we set:˜ W = arg min W | rank( W )= k k S T AS S W Z T − S T AS k F . with probability 95 /
100 union bounding over all the probabilistic events above, and applying (9): k AS S ˜ W Z T − AS k F ≤ (1 + 3 ǫ )(1 − ǫ ) k AS S W ∗ Z T − AS k F ≤ (1 + 3 ǫ )(1 + 4 ǫ )(1 − ǫ ) k A − A k k F which gives the lemma after adjusting constants on ǫ by making c, c ′ , c , c , c , c large enough. Proof of Theorem 9.
By Lemma 11 and since AS is an ( ǫ, k )-column PCP of A , we have k A − QQ T A k F ≤ ǫ − ǫ k A − A k k F . In Step 7 we sample c ( k log k + k/ǫ ) rows of Q by their standardleverage scores (their row norms) and so have with probability 99/100: k QN T − A k F ≤ (1 + ǫ ) min Y ∈ R n × k k QY T − A k F = (1 + ǫ ) k A − QQ T A k F ≤ (1 + ǫ ) − ǫ k A − A k k F . Union bounding over failure probabilities and adjusting constants on ǫ yields the final claim.It just remains to discuss Algorithm 1’s runtime and query complexity. In Step 3, forming˜ A = S T AS requires reading t · t = O (cid:16) nk log nǫ . (cid:17) entries of A . This dominates the access costof Step 1, which requires O ( nk log k ) = O (cid:16) nk log( k/ǫ ) ǫ (cid:17) accesses by Lemma 4, forming AS S in Step 5 which requires O (cid:16) nk log( k/ǫ ) ǫ + nkǫ (cid:17) accesses, and forming S T A in Step 7 which requires O (cid:0) nk log k + nkǫ (cid:1) accesses. Forming S T AS in Step 6 requires t · t = O (cid:16) √ nk . log( k/ǫ ) log nǫ (cid:17) accesses, which when n is large compared to k and 1 /ǫ will be dominated by our linear in n terms.For time complexity, Step 1 requires O ( n ( k log k ) ω − ) = ˜ O (cid:16) nk ω − ǫ ω − (cid:17) time. Computing Z inStep 4 can be done using an input sparsity time algorithm for spectral norm error with rank k and error parameter ǫ ′ = Θ(1). By Theorem 27 of [CEM +
15] or using input sparsity time ridgeleverage score sampling [CMM17] in conjunction with the spectral norm PCP result of Lemma 24,the total runtime required is O (nnz( ˜ A )) + ˜ O ( √ nk ω − ǫ ω − ).We can compute V in Step 5 in O ( nt ω − ) = ˜ O (cid:16) nk ω − ǫ ω − (cid:17) . In Step 6, we can compute W byfirst multiplying S T AS by Z and then multiplying by ( S T AS S ) + and taking the best rank- k approximation of the result. The total runtime is ˜ O (cid:0) t · k ω − + k ω (cid:1) · poly(1 /ǫ ) = ˜ O ( √ nk ω − . ) · poly(1 /ǫ ). Finally, we can compute Q in Step 7 by first computing a t × k span of the columnspace of W , multiplying this by AS S , and then taking an orthogonal basis for the result, requiringtotal time ˜ O ( k ω · poly(1 /ǫ )) + O (cid:16) nk ω − log( k/ǫ ) ǫ + nk ω − (cid:17) . The final regression problem can besolved by forming the pseudoinverse of S T Q in O ( k ω (log k + 1 /ǫ )) time and applying it to S T A in O ( nk ω − (log k + 1 /ǫ )) time. So overall our runtime cost with linear dependence on n is dominatedby the cost of Step 5. We additionally must add the ˜ O ( √ nk ω − . ) · poly(1 /ǫ ) term from Step 6,which only dominates if n is relatively small. 15n many applications it is desirable that the low-rank approximation to A is also symmetricand positive semidefinite. We show in Appendix C that a modification to Algorithm 1 can satisfythis constraint also in ˜ O ( n poly( k/ǫ )) time. The upshot is: Theorem 12 (Sublinear Time Low-Rank Approximation – PSD Output) . There is an algorithmthat given any PSD A ∈ R n × n , accesses ˜ O (cid:16) nk ǫ + nkǫ (cid:17) entries of A , runs in ˜ O (cid:16) nk ω ǫ ω + nk ω − ǫ ω − (cid:17) timeand with probability at least / outputs M ∈ R n × k with: k A − M M T k F ≤ (1 + ǫ ) k A − A k k F . We now present our lower bound on the number of accesses to A required to compute a near-optimallow-rank approximation, matching the query complexity of Algorithm 4 up to a ˜ O (1 /ǫ . ) factor. Theorem 13.
Assume that k, ǫ are such that nk/ǫ = o ( n ) . Consider any (possibly randomized)algorithm A that, given any PSD A ∈ R n × n , outputs a (1 + ǫ ) -approximate rank- k approximationto A (in the Frobenius norm) with probability at least / . Then there must be some input A ofwhich A reads at least Ω( nk/ǫ ) positions, possibly adaptively, in expectation. We prove Theorem 13 via Yao’s minimax principle [Yao77], proving a lower bound for randomizedalgorithms on worst-case inputs via a lower bound for deterministic algorithms on a hard inputdistribution. Specifically, we will draw the input A from a distribution over binary matrices. A hasall 1’s on its diagonal, along with k randomly positioned (non-contiguous) blocks of all 1’s, each ofsize p ǫn/k × p ǫn/k . In other words, A is the adjacency matrix (plus identity) of a graph with k cliques of size p ǫn/k , placed on random subsets of the vertices, with all other vertices isolated.It is easy to see that every A chosen from the above distribution is PSD since applying apermutation yields a block diagonal matrix, each of whose blocks is a PSD matrix (either a single1 entry or a rank-1 all 1’s block). Additionally, for every A chosen from the distribution, theoptimal rank- k approximation to A projects off each of the k blocks, achieving Frobenius normerror k A − A k k F = n − k p ǫn/k ≈ n . To match this up to a 1 + ǫ factor, any near-optimal rank- k approximation must at least capture a constant fraction of the Frobenius norm mass in the blockssince this mass is k · ǫn/k = 2 ǫn .Doing so requires identifying at least a constant fraction of the blocks. However, since blockpositions are chosen uniformly at random, and since the diagonal entries of A are identical andso convey no information about the positions, to identify a single block, any algorithm essentiallymust read arbitrary off-diagonal entries until it finds a 1. There are ≈ n off-diagonal entries withjust 2 ǫn of them 1’s, so identifying a first block requires Ω( n/ǫ ) queries to A in expectation (overthe random choice of A ). Since the vast majority of vertices are isolated and not contained withina block, finding this first block does little to make finding future blocks easier. So overall, thealgorithm must make Ω( nk/ǫ ) queries in expectation to find a constant fraction of the k blocks andoutput a near-optimal low-rank approximation with good probability.While the above intuition is the key idea behind the lower bound, a rigorous proof requiresa number of additional steps and modifications, detailed in the remainder of this section. InSection 5.2 we introduce the notion of a primitive approximation to a matrix, which we will employin our lower bound for low-rank approximation. In Section 5.3 we show that any deterministic16ow-rank approximation algorithm that succeeds with good probability on the input distributiondescribed above can be used to give an algorithm that computes a primitive approximation withgood probability on a matrix drawn from a related input distribution (Lemma 20). This reductionyields a lower bound for deterministic low-rank approximation algorithms (Theorem 23), whichgives Theorem 13 after an application of Yao’s principle. We first define the notion of an ǫ -primitive approximation to a matrix and establish some basicproperties of these approximations. Definition 14 ( ǫ -primitive Approximation) . A matrix A ′ ∈ R m × m is said to be ǫ -primitive for A ∈ R m × m if the squared Frobenius norm of A − A ′ , restricted to its off-diagonal entries is ≤ ǫm .Note that A ′ is allowed to have any rank. We also define a distribution on matrices with a randomly placed block of all one entries thatwill will later use in our hard input distribution construction.
Definition 15 (Random Block Matrix) . For any m, ǫ with m/ǫ ≤ m , let µ ( m, ǫ ) be the distributionon A ∈ R m × m defined as follows. We choose a uniformly random subset S of [ m ] where | S | = √ ǫm , where we assume for simplicity that | S | is an integer. We generate a random matrix A bysetting for each i = j ∈ S , A i,j = 1 . We then set A i,i = 1 for all i and set all remaining entries of A to equal . Note that we associate a random subset S with the sampling of a matrix A according to µ ( m, ǫ ).It is clear that any A in the support of µ ( m, ǫ ) is PSD. This is since any A in the support of µ ( m, ǫ ),after a permutation, is composed of an | S | × | S | all ones block and an ( m − | S | ) × ( m − | S | ) identity.If A ′ is ǫ - primitive for A in the support of µ ( m, ǫ ), it approximates A up to error ǫm on itsoff-diagonal entries. Let R denote the set of off-diagonal entries of A restricted to the intersectionof the rows and columns indexed by S . The error on the entries of R is at most ǫm . Further,restricted to these entries, A is an all ones matrix. Thus, on the entries of R , both A and A ′ lookfar from an identity matrix. Formally, we show: Lemma 16. If A ′ ∈ R m × m is ǫ -primitive for an A ∈ R m × m in the support of µ ( m, ǫ ) , then A ′ isnot ǫ -primitive for I .Proof. By definition, any ǫ -primitive matrix A ′ for A has squared Frobenius norm restricted to A ’soff-diagonal entries of ≤ ǫm . Let R denote the set of off-diagonal entries in the intersection of therows and columns of A indexed by S . Assuming | S | ≥
4, which follows from the assumption inDefinition 15 that m/ǫ ≤ m , | R | = | S | − S ≥ ǫm .Restricted to the entries in R , A is an all ones matrix. Thus, on at least 8 ǫm of these entries A ′ must have value at least 1 /
2. Otherwise it would have greater than | R | − ǫm ≥ ǫm entrieswith value ≤ / A ’s off-diagonal entries greater than · ǫm ≥ ǫm , contradicting the assumption that it is ǫ -primitive for A .Restricted to the entries in R , the identity matrix is all zero. Thus, A ′ has Frobenius norm erroron these entries at least · ǫm ≥ ǫm . Thus A ′ is not ǫ -primitive for I , giving the lemma.Consider A drawn from µ ( m, ǫ ) with probability 1 / I with probability 1 /
2. ByLemma 16, any (possibly randomized) algorithm A that returns an ǫ -primitive approximation to A with good probability can be used to distinguish with good probability whether A is in thesupport of µ ( m, ǫ ) or A = I . This is because, by Lemma 16, the output of such an algorithm can17nly be correct either for some A in the support of µ ( m, ǫ ) or for the identity. We first define thedistribution over matrices to which this result applies: Definition 17.
For any m, ǫ with m/ǫ ≤ m , let γ ( m, ǫ ) be the distribution on A ∈ R m × m definedas follows. With probability / draw A from µ ( m, ǫ ) (Definition 15). Otherwise, draw A from ν ( m ) , which is the distribution whose support is only the m × m identity matrix. Note that, as with µ ( m, ǫ ), we associate a random subset S with | S | = √ ǫm with the samplingof a matrix A according to γ ( m, ǫ ). S is not used in the case that A is drawn from ν ( m ).Formally, since A can distinguish whether A is in the support of µ ( m, ǫ ) or ν ( m ) (i.e., A = I )with good probability we can prove that the distribution of A ’s access pattern (over randomnessin the input and possible randomization in the algorithm) is significantly different when it is giveninput A drawn from µ ( m, ǫ ) than when it is given A drawn from ν ( m ). Recall for distributions α and β supported on elements s of a finite set S , that the total variation distance D T V ( α, β ) = P s ∈ S | α ( s ) − β ( s ) | , where α ( s ) is the probability of s in distribution α . Corollary 18.
Suppose that a (possibly randomized) algorithm A , with probability at least / over its random coin flips and random input A ∈ R n × n drawn from γ ( m, ǫ ) , outputs an ǫ -primitivematrix for A . Further, suppose that A reads at most r positions of A , possibly adaptively. Let S be a random variable indicating the list of positions read and their corresponding values. Let L ( µ ) denote the distribution of S conditioned on A ∼ µ ( m, ǫ ) , and let L ( ν ) denote the distribution of S conditioned on A ∼ ν ( m ) . Then D T V ( L ( µ ) , L ( ν )) ≥ / . Proof.
By Lemma 16, if algorithm A succeeds then its output can be used to decide if A ∼ µ ( m, ǫ )or if A ∼ ν ( m ). The success probability of any such algorithm is well-known (see, e.g., Proposition2.58 of [BY02]) to be at most 1 / D T V ( L ( µ ) , L ( ν )) /
2. Making this quantity at least 7 /
12 andsolving for D T V ( L ( µ ) , L ( ν )) proves the corollary. We now give a reduction, showing that any deterministic relative error k -rank approximation algo-rithm that succeeds with good probability on the hard input distribution described in Section 5.1can be used to compute an ǫ -primitive approximation to A that is drawn from γ ( n/ (2 k ) , ǫ ) withgood probability. We first formally define this input distribution. Definition 19 (Hard Input Distribution – Low-Rank Approximation) . Suppose nk/ǫ ≤ n . Let γ b be the distribution on A ∈ R n × n determined as follows. We draw a uniformly random subset S of [ n ] where | S | = n/ , where we assume for simplicity that | S | is an integer. We further partition S into k subsets S , S , ..., S k chosen uniformly at random. For all ℓ ∈ [ k ] , | S ℓ | = n/ (2 k ) , whichwe also assume to be an integer.Letting A ℓ denote the entries of A restricted to the intersection of the rows and columns indexedby S ℓ , we independently draw each A ℓ from γ ( n/ (2 k ) , ǫ ) (Definition 17). We then set A i,i = 1 forall i and set all remaining entries of A to equal . That is, for any input A , in any random execution, A reads at most r entries of A . S is determined by the random input A and the random choices of A . Since A reads at most r positions of anyinput, we always have | S | ≤ r . Here and throughout, we let A ∼ µ ( m, ǫ ) denote the event that, when A is drawn from the distribution γ ( m, ǫ )(Definition 17), which is a mixture of the distributions µ ( m, ǫ ) and ν ( m ), that A is drawn from µ ( m, ǫ ). A ∼ ν ( m )denotes the analogous event for ν ( m ). By the assumption that 2 nk/ǫ ≤ n we have n kǫ ≤ (cid:0) n k (cid:1) and so this is a valid setting of the parameters for γ ( m, ǫ ). ǫ -primitive approximation to low-rank approximation is as follows: Lemma 20 (Reduction from Primitive Approximation to Low-Rank Approximation) . Suppose that nk/ǫ = o ( n ) and that A is a deterministic algorithm that, with probability ≥ / on random input A ∈ R n × n drawn from γ b , outputs a (1 + ǫ/ -approximate rank- k approximation to A . Furthersuppose A reads at most r positions of A , possibly adaptively.Then there is a randomized algorithm B that, with probability ≥ / over its random coin flipsand random input B drawn from γ ( n/ (2 k ) , ǫ ) , outputs an ǫ -primitive matrix for B . Further, letting L ( µ ) , L ( ν ) be as defined in Corollary 18 for B , D T V ( L ( µ ) , L ( ν )) ≤ ǫrnk . Proof.
Consider a randomized algorithm B that, given B drawn from γ ( n/ (2 k ) , ǫ ) generates arandom matrix A n × n drawn from γ b as follows. Choose a uniformly random subset S of [ n ] with | S | = n/
2. Partition S into k subsets S , S , ..., S k chosen uniformly at random. Note that for ℓ ∈ [ k ] | S ℓ | = n/ (2 k ). Letting A ℓ denote the entries of A restricted to the intersection of therows and columns indexed by S ℓ , set A = B , and for ℓ = 2 , ..., k independently draw A ℓ from γ ( n/ (2 k ) , ǫ ). Set A i,i = 1 for all i and set all remaining entries of A equal to 0.After generating A , B then applies A to A to compute a rank-k approximation A ′ . B thenoutputs ( A ′ ) , the n/ (2 k ) × n/ (2 k ) submatrix of A ′ corresponding to the intersection of the rowsand columns indexed by S . We have the following: Claim 21.
With probability ≥ / over the random choices of B and over the random input B drawn from γ ( n/ (2 k ) , ǫ ) , ( A ′ ) is ǫ -primitive for B .Proof. It is clear that A generated by B is distributed according to γ b (Definition 19). By con-struction, for any A ∼ γ b there is a rank- k approximation of cost at most n ; indeed this follows bychoosing the best rank-1 solution for each A ℓ . Consequently if A succeeds on A , then its output A ′ satisfies k A − A ′ k F ≤ n + ǫn/ A has A i,i = 1 for all i ∈ [ n ]. So the squared Frobenius norm cost of any rank- k approximation restricted to the diagonal is at least n − k . Let c ℓ be the squared Frobenius normcost of A ′ restricted to the off diagonal entries in A ℓ . Note that c ℓ is a random variable. Then, n − k + k X ℓ =1 c ℓ ≤ k A ′ − A k F ≤ n + ǫn/ . By averaging for at least a 11 /
12 fraction of the blocks i , c i ≤
12 + 12 ǫn/ (26 k ) ≤ ǫn/ (26 k ) < ǫn/ (2 k ) , assuming ǫn/ (26 k ) ≥
8, which holds if ǫn/k = ω (1), which follows from our assumption that nk/ǫ = o ( n ).It follows by symmetry of γ b with respect to the k blocks that with probability at least 11 /
12, if A succeeds on A , c ≤ ǫn/ (2 k ). This gives that ( A ′ ) is ǫ -primitive (Definition 14) for A = B . Thisyields the claim after applying a union bound, since A succeeds with probability at least 2 / B ’s access pattern when B ∼ µ ( n/ (2 k ) , ǫ )and when B ∼ ν ( n/ (2 k )). We have: 19 laim 22. For the algorithm B defined above D T V ( L ( µ ) , L ( ν )) ≤ ǫrnk . Proof.
Let W denote the random variable that encompasses B ’s random choices in choosing theindices in S , ..., S k and in setting the entries in A , ..., A k . Let Ω denote the set of all possiblevalues of W . Let S denote the random subset with | S | = p n/ (2 k ) · ǫ associated with the drawingof B from γ ( n/ (2 k ) , ǫ ) (see Definition 17). Let ¯ S denote the set of off-diagonal entries of A in theintersection of the rows and columns indexed by S .If B ∼ µ ( n/ (2 k ) , ǫ ) then all entries of ¯ S are 1. If B ∼ ν ( n/ (2 k )) then B = I and so allentries of ¯ S are 0. Outside of the entries in ¯ S , the entries of B are identical in the two cases that B ∼ µ ( n/ (2 k ) , ǫ ) and B ∼ ν ( n/ (2 k )) (in particular, they are all 0 off the diagonal and 1 on thediagonal).So, for any w ∈ Ω, conditioned on W = w , all entries outside ¯ S are fixed. Further conditionedon B ∼ ν ( n/ (2 k )), all entries in ¯ S are 0. So all entries of A are fixed and A always reads the samesequence of entries ( i w, , j w, ) , ..., ( i w,r , j w,r ). Further, for any w ∈ Ω, conditioned on W = w , S isa uniform random subset of R = [ n ] \ ( S ∪ ... ∪ S k ). | R | = n − ( k − · n/ (2 k ) ≥ n/
2. So, for any ℓ ∈ [ r ] we have: Pr[( i w,ℓ , j w,ℓ ) ∈ ¯ S |W = w ] ≤ | ¯ S || R | ≤ | S | n / ≤ ǫnk . By a union bound, letting E be the event that A reads ( i w, , j w, ) , ..., ( i w,r , j w,r ) and does not readany entry of ¯ S in its r accesses to A we have for any w ∈ Ω:Pr[
E|W = w ] ≥ − ǫrnk . Thus, for any w ∈ Ω, regardless of whether B ∼ ν ( n/ (2 k )) or B ∼ µ ( n/ (2 k ) , ǫ ), A has accesspattern ( i w, , j w, ) , ..., ( i w,r , j w,r ) to A with probability ≥ − ǫrnk . Correspondingly, B has a fixedaccess pattern to B with probability ≥ − ǫrnk . Thus, D T V ( L ( µ ) , L ( ν )) ≤ ǫrnk yielding the claim.In combination Claims 21 and 22 give Lemma 20.We can now use Lemma 20 to argue that if a deterministic low-rank approximation algorithmsucceeding with good probability on a random input drawn from γ b accesses too few entries, thenit can be used to give a primitive approximation algorithm violating Corollary 18. Theorem 23 (Lower Bound for Deterministic Algorithms) . Assume that n, k, ǫ are such that nk/ǫ = o ( n ) . Consider any deterministic algorithm A that, given random input A drawn from γ b ,outputs a (1+ ǫ ) -approximate rank- k approximation to A (in the Frobenius norm) with probability atleast / . Further, suppose A reads at most r positions of A , possibly adaptively. Then r = Ω( nk/ǫ ) .Proof. Assume towards a contradiction that r = o ( nk/ǫ ). Then applying Lemma 20, A can beused to give a randomized algorithm B that with probability ≥ /
12 over its random coin flips andrandom input B ∈ n/ (2 k ) × n/ (2 k ) drawn from γ ( n/ (2 k ) , ǫ ), outputs an ǫ -primitive matrix for B .Further, letting L ( µ ) , L ( ν ) be as defined in Corollary 18 for B , by Lemma 20, D T V ( L ( µ ) , L ( ν )) ≤ ǫrnk . For r = o ( nk/ǫ ), ǫrnk = o (1), contradicting Corollary 18, and giving the theorem.Theorem 13 follows directly from applying Yao’s minimax principle ([Yao77], Theorem 3) toTheorem 23. 20 Spectral Norm Error Bounds
We conclude by showing how to modify Algorithm 1 to output a low-rank approximation B achiev-ing the spectral norm guarantee: k A − B k ≤ (1 + ǫ ) k A − A k k + ǫk k A − A k k F . (14)This can be significantly stronger than the Frobenius guarantee (1) when k A − A k k F is large, and,for example is critical in our application to ridge regression, discussed in Section 6.2.It is not hard to see that since additive error in the Frobenius norm upper bounds additiveerror in the spectral norm (see e.g. Theorem 3.4 of [Gu14]) that for B satisfying the Frobeniusnorm guarantee k A − B k F ≤ (1 + ǫ ) k A − A k k F , we immediately have the spectral bound k A − B k ≤ k A − A k k + ǫ k A − A k k F . Thus, we can achieve 14 simply by running Algorithm 1 witherror parameter ǫ/k . However, this approach is suboptimal. Applying Theorem 9, our querycomplexity would be Θ (cid:16) nk . log nǫ . (cid:17) . We improve this k dependence significantly in Algorithm 2.Since (14) is often applied (see for example Section 6.2) with k ′ = k/ǫ and ǫ = Θ(1) to give k A − B k ≤ O (cid:0) ǫk k A − A k k F (cid:1) , optimizing k dependence is especially important.We first give an extension of Lemma 3 to the spectral norm case. This lemma provides thecolumn sampling analog to Lemma 8. Lemma 24 (Spectral Norm PCP) . For any A ∈ R n × d , for i ∈ { , . . . , d } , let ˜ τ ki ≥ τ ki ( A ) be anoverestimate for the i th rank- k ridge leverage score. Let p i = ˜ τ ki P i ˜ τ ki and t = c log( k/δ ) ǫ P i ˜ τ ki for any ǫ < and sufficiently large constant c . Construct C by sampling t columns of A , each set to √ tp i a i with probability p i . With probability − δ , for any orthogonal projection P ∈ R n × n , (1 − ǫ ) k A − P A k − ǫk k A − A k k F ≤ k C − P C k ≤ (1 + ǫ ) k A − P A k + ǫk k A − A k k F . We refer to C as an ( ǫ, k ) -spectral PCP of A .Proof. This follows from Corollary 34. With probability ≥ − δ , sampling by the rank- k ridgeleverage scores gives C satisfying:(1 − ǫ ) CC T − ǫk k A − A k k F I (cid:22) AA T (cid:22) (1 + ǫ ) CC T + ǫk k A − A k k F I. (15)We can write for any M , k M k = max x : k x k =1 x T M x . When k x k = 1, k ( I − P ) x k ≤ x : x T ( I − P ) CC T ( I − P ) x ≤ x T ( I − P ) AA T ( I − P ) x + ǫk k A − A k k F ≤ k A − P A k + ǫk k A − A k k F which gives k C − P C k ≤ − ǫ k A − P A k + ǫ (1 − ǫ ) k k A − A k k F . Similarly we have: x T ( I − P ) AA T ( I − P ) x ≤ (1 + ǫ ) x T ( I − P ) CC T ( I − P ) x + ǫk k A − A k k F ≤ (1 + ǫ ) k C − P C k + ǫk k A − A k k F which gives ǫ k A − P A k − ǫ (1+ ǫ ) k k A − A k k F ≤ k C − P C k . The lemma follows from combiningthese upper and lower bounds after adjusting constant factors on ǫ (by making c large enough).21 .1 Spectral Norm Low-Rank Approximation Algorithm We now use Lemma 24, along with its row sampling counterpart, Lemma 8, to give an algorithm(Algorithm 2) for computing a near optimal spectral norm low-rank approximation to A .In Steps 1-3 we sample both rows and columns of A via the rank Θ( k/ǫ ) ridge leverage scoresof A / , ensuring with high probability that AS is an ( ǫ, k )-spectral PCP of A and ˜ A is in turn an( ǫ, k )-spectral row PCP of AS . Thus, if we compute (using an input sparsity time algorithm) aspan Z which gives a near optimal spectral norm low-rank approximation to ˜ A (Step 3), this spanwill also be nearly optimal for AS . Since we cannot afford to fully read AS , we approximatelyproject it to Z by further sampling its columns using Z ’s leverage scores (Step 4). We use leveragescore sampling again in Step 5 to approximately project A to the span of the result. This yieldsour final approximation, using the fact that AS is a spectral PCP for A . Algorithm 2 PSD Low-Rank Approximation – Spectral Error
1. Let k = ⌈ ck/ǫ ⌉ . For all i ∈ [1 , .., n ] compute ˜ τ k i ( A / ) which is a constant factor approxi-mation to the ridge leverage score τ k i ( A / ).2. Set ℓ (1) i = 4 ǫ p nk ˜ τ k i ( A / ). Set p (1) i = ℓ (1) i P i ℓ (1) i and t = c log nǫ P i ℓ (1) i . Sample S , S ∈ R n × t each whose j th column is set to q tp (1) i e i with probability p (1) i .3. Let ˜ A = S T AS , and use an input sparsity time algorithm to compute orthonormal Z ∈ R t × k satisfying the Frobenius guarantee k ˜ A − ˜ AZZ T k F ≤ k ˜ A − ˜ A k k F along with the spectralguarantee k ˜ A − ˜ AZZ T k ≤ (1 + ǫ ) k ˜ A − ˜ A k k + ǫk k ˜ A − ˜ A k k F .4. Let t = c (cid:16) k log k + k ǫ (cid:17) , set p (3) i = k z i k k Z k F , and sample S ∈ R t × t where the j th column isset to q t p (3) i e i with probability p (3) i . Solve: M = arg min M ∈ R n × k k AS S − M Z T S k F
5. Compute an orthogonal basis Q ∈ R n × k for the column span of M . Let t = c (cid:16) k log k + k ǫ (cid:17) ,set p (4) i = k q i k k Q k F , and sample S ∈ R n × t where the j th column is set to q t p (4) i e i withprobability p (4) i . Solve: N = arg min N ∈ R n × k k S T QN T − S T A k F .
6. Return
Q, N ∈ R n × k . Theorem 25 (Sublinear Time Low-Rank Approximation –Spectral Norm Error) . Given any PSD A ∈ R n × n , for sufficiently large constants c, c , c , c , c , Algorithm 2 accesses O ( n · k log nǫ + nk ǫ ) entries of A , runs in ˜ O (cid:0) nk ω ǫ + nkǫ + ( √ nk ω − + k ω +1 ) · poly(1 /ǫ ) (cid:1) time and with probability at least / outputs M, N ∈ R n × k with: k A − M N T k ≤ (1 + ǫ ) k A − A k k + ǫk k A − A k k F . roof. ℓ (1) i = 4 ǫ p nk ˜ τ k i ( A / ) which by Lemma 5 is within a constant factor of upper boundingthe rank k = ⌈ ck/ǫ ⌉ ridge leverage scores of A . As long as c ≥
1, these scores upper bound therank- k ridge leverage scores. So by Lemma 24, with high probability AS an ( ǫ, k )-spectral PCPof A as long as c is set large enough. Additionally, by Lemma 8, with high probability ˜ A is an( ǫ, k )-spectral row PCP of AS .Frobenius norm PCP bounds also hold. By Lemma 3, AS is an ( ǫ, k )-column PCP of A with high probability and ˜ A is an ( ǫ, k )-row PCP for AS with probability 99 / AS and ˜ A are ( ǫ, k ) PCPs for A and AS respectively, which gives: k ˜ A − ˜ A k k F ≤ c k A − A k k F (16)for some constant c . By (16) and the fact that ˜ A is an ( ǫ, k )-spectral PCP of AS , for Z computedin Step 3 of Algorithm 2 we have: k AS − AS ZZ T k ≤ (1 + ǫ )(1 − ǫ ) (cid:16) k ˜ A − ˜ A k k + ǫk k ˜ A − ˜ A k k F (cid:17) + ǫk (1 − ǫ ) k A − A k k F ≤ (1 + 3 ǫ ) k ˜ A − ˜ A k k + (3 c + 2) ǫk k A − A k k F ≤ (1 + 3 ǫ )(1 + ǫ ) k AS − ( AS ) k k + (3 c + 2 + 1) ǫk k A − A k k F (17)where we assume without loss of generality that ǫ < / S is sampled by the leverage scores of Z , andsince t = Θ ( k log k + k/ǫ ′ ) for ǫ ′ = ǫ/k , for M computed in Step 4, with probability 99 / k AS − M Z T k F ≤ (cid:16) ǫk (cid:17) k AS − AS ZZ T k F . (18)This Frobenius norm bound also implies a spectral norm bound. Specifically, we can write: k AS − M Z T k F = k AS ZZ T − M Z T k F + k AS ( I − ZZ T ) k F so by (18) we must have k AS ZZ T − M Z T k F ≤ ǫk k AS ( I − ZZ T ) k F . This gives: k AS − M Z T k ≤ k AS ZZ T − M Z T k + k AS ( I − ZZ T ) k ≤ k AS ( I − ZZ T ) k + k AS ZZ T − M Z T k F ≤ k AS ( I − ZZ T ) k + ǫk k AS ( I − ZZ T ) k F . Note that k AS ( I − ZZ T ) k F = O ( k A − A k k F ) by the Frobenius norm guarantee required in Step3, and the fact that ˜ A and AS are ( ǫ, k ) PCPs for AS and A respectively. So combining with(16) the above implies that for Q spanning the columns of M , k AS − QQ T AS k ≤ k AS − M Z T k ≤ (1 + O ( ǫ )) k AS − ( AS ) k k + O (cid:16) ǫk (cid:17) k A − A k k F . Since AS is an ( ǫ, k )-spectral PCP for A we also have k A − QQ T A k ≤ (1+ O ( ǫ )) k AS − ( AS ) k k + O (cid:0) ǫk (cid:1) k A − A k k F . The theorem follows by applying an identical approximate regression argumentfor N computed in Step 5 to show that k A − QN T k ≤ (1 + ǫ ) k A − QQ T A k + ǫk k A − QQ T A k F = (1 + O ( ǫ )) k A − A k k + O (cid:16) ǫk (cid:17) k A − A k k F / ǫ and union bounding over failure probabilitiesyields the final bound.It just remains to discuss runtime and sample complexity. Constructing S T AS in Step 3requires reading t = O (cid:16) nk log nǫ (cid:17) entries of A . Constructing AS S in Step 4 and AS in Step 5both require reading O (cid:16) nk log k + nk ǫ (cid:17) entries.For runtime, computing Z in Step 3 requires O (nnz( ˜ A )) + ˜ O ( √ nk ω − · poly(1 /ǫ )) time using aninput sparsity time algorithm (e.g. by Theorem 27 of [CEM +
15] or using input sparsity time ridgeleverage score sampling [CMM17] in conjunction with the spectral norm PCP result of Lemma24). Computing M in Step 4 and N in Step 5 both require computing the pseudoinverse of a k × O (cid:16) k log k + k ǫ (cid:17) matrix in ˜ O ( k ω +1 poly(1 /ǫ )) time, and then applying this to an n × O (cid:16) k log k + k ǫ (cid:17) matrix requiring ˜ O (cid:0) nk ω ǫ (cid:1) time. We now demonstrate how Theorem 25 can be leveraged to give a sublinear time, relative erroralgorithm for approximately solving the ridge regression problem:min x ∈ R n k Ax − y k + λ k x k (19)for PSD A . We begin with a lemma showing that any approximation to A with small spectral normerror can be used to approximately solve (19) up to relative error. Lemma 26 (Ridge Regression via Spectral Norm Low-Rank Approximation) . For any PSD A ∈ R n × n , y ∈ R n , regularization parameter λ ≥ and B with k A − B k ≤ ǫ λ , let ˜ x ∈ R n be anyvector satisfying: k B ˜ x − y k + λ k ˜ x k ≤ (1 + α ) · min x ∈ R n k Bx − y k + λ k x k we have: k A ˜ x − y k + λ k ˜ x k ≤ (1 + α )(1 + 5 ǫ ) min x ∈ R n k Ax − y k + λ k x k . Proof.
For any x ∈ R n we have: k Bx − y k + λ k x k = k Ax − y k + k ( B − A ) x k + 2 x T ( B − A ) T ( Ax − y ) + λ k x k Now k ( B − A ) x k ≤ ǫ λ k x k and further | x T ( B − A ) T ( Ax − y ) | ≤ k ( A − B ) x k k Ax − y k ≤ ǫ √ λ k x k k Ax − y k ≤ ǫ (cid:0) k Ax − y k + λ k x k (cid:1) . So for any x , k Bx − y k + λ k x k ∈ (1 ± ǫ ) (cid:0) k Ax − y k + λ k x k (cid:1) which gives the lemma since anynearly optimal ˜ x for the ridge regression problem on B will also be nearly optimal for A .Combining Lemma 26 with Theorem 25 gives the following:24 heorem 27 (Sublinear Time Ridge Regression) . Given any PSD A ∈ R n × n , regularization pa-rameter λ ≥ , y ∈ R n , and upper bound ˜ s λ on the statistical dimension s λ def = tr(( A + λI ) − A ) ,there is an algorithm accessing ˜ O (cid:16) n ˜ s λ ǫ (cid:17) entries of A and running in ˜ O (cid:16) n ˜ s ωλ ǫ ω (cid:17) time, which outputs ˜ x satisfying: k A ˜ x − y k + λ k ˜ x k ≤ (1 + ǫ ) · min x ∈ R n k Ax − y k + λ k x k . When ˜ s λ ≪ n as is often the case, the above significantly improves upon state-of-the-art inputsparsity time runtimes for general matrices [ACW16]. Proof.
Let k = c ˜ s λ ǫ for sufficiently large constant c . We have: s λ = n X i =1 λ i ( A ) λ i ( A ) + λ ≥ X i : λ i ( A ) ≥ ǫ λ λ i ( A )(1 + 1 /ǫ ) λ i ( A ) ≥ ǫ · |{ i : λ i ( A ) ≥ ǫ λ }| . So |{ i : λ i ( A ) ≥ ǫ λ }| ≤ s λ ǫ and for large enough c and k = c ˜ s λ ǫ ≥ cs λ ǫ we have k A − A k k ≤ ǫ λ andcan run Algorithm 2 with error parameter ǫ = Θ(1) to find M, N ∈ R n × k with k A − M N T k ≤ ǫ λ .We can then apply Lemma 26 – solving ˜ x = min x ∈ R n k M N T x − y k + λ k x k exactly using an SVD in O ( nk ω − ) time. ˜ x will be a (1 + O ( ǫ )) approximate solution for A , which, after adjusting constantson ǫ , gives the lemma. The runtime follows from Theorem 25 with k = c ˜ s λ /ǫ and ǫ ′ = Θ(1). The O ( nk ω − ) regression cost is dominated by the cost of computing the low-rank approximation.Note that Theorem 27 ensures that if the k ≥ cs λ ǫ for some constant c , ˜ x is a good approximationto the ridge regression problem. Setting k properly requires some knowledge of an upper bound ˜ s λ on s λ . A constant factor approximation to s λ can be computed in ˜ O ( n / · poly( s λ )) time using forexample a column PCP as given by Lemma 3 and binary searching for an appropriate k value.An interesting open question is if s λ be be approximated more quickly – specifically with lineardependence on n . This question is closely related to if it is possible to estimate the cost k A − A k k F in ˜ O ( n · poly( k )), which surprisingly is also open. Acknowledgements
The authors thank IBM Almaden where part of this work was done. David Woodruff also thanks theSimons Institute program on Machine Learning and the XDATA program of DARPA for support.
References [ACW16] Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Sharper bounds for re-gression and low-rank approximation with regularization, 2016.[AFK +
01] Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Spectralanalysis of data. In
Proceedings of the 33rd Annual ACM Symposium on Theory ofComputing (STOC) , pages 619–626, 2001.[AFKM01] Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, and Frank McSherry. Web search viahub synthesis. In
Proceedings of the 42nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) , pages 500–509, 2001.25AGR16] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov chainalgorithms for sampling strongly Rayleigh distributions and determinantal point pro-cesses. In
Proceedings of the 29th Annual Conference on Computational Learning The-ory (COLT) , pages 103–115, 2016.[AM05] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distri-butions. In
Proceedings of the 18th Annual Conference on Computational LearningTheory (COLT) , pages 458–469, 2005.[AM07] Dimitris Achlioptas and Frank McSherry. Fast computation of low-rank matrix ap-proximations.
Journal of the ACM , 54(2), 2007.[BDN15] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a unified theory of sparsedimensionality reduction in Euclidean space. In
Proceedings of the 47th Annual ACMSymposium on Theory of Computing (STOC) , pages 499–508, 2015.[BW09] Mohamed-Ali Belabbas and Patrick J Wolfe. Spectral methods in machine learningand new strategies for very large datasets.
Proceedings of the National Academy ofSciences , 106(2):369–374, 2009.[BY02] Ziv Bar-Yossef.
The Complexity of Massive Data Set Computations . PhD thesis, Uni-versity of California at Berkeley, 2002.[CC00] Trevor F Cox and Michael AA Cox.
Multidimensional scaling . CRC press, 2000.[CEM +
15] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and MadalinaPersu. Dimensionality reduction for k-means clustering and low rank approximation.In
Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC) ,pages 163–172, 2015.[CMM17] Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In
Proceedings of the 28th AnnualACM-SIAM Symposium on Discrete Algorithms (SODA) , 2017.[Coh16] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.In
Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA) , pages 278–287, 2016.[CR09] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex opti-mization.
Foundations of Computational Mathematics , 9(6):717–772, 2009.[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regressionin input sparsity time. In
Proceedings of the 45th Annual ACM Symposium on Theoryof Computing (STOC) , pages 81–90, 2013.[CW17] Ken Clarkson and David P. Woodruff. Low-rank PSD approximation in input-sparsitytime. In
Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA) , 2017.[DFK +
04] Petros Drineas, Alan M. Frieze, Ravi Kannan, Santosh Vempala, and V. Vinay. Cluster-ing large graphs via the singular value decomposition.
Machine Learning , 56(1-3):9–33,2004. 26DKM06] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast Monte Carlo algorithmsfor matrices I: Approximating matrix multiplication.
SIAM Journal on Computing ,36(1):132–157, 2006.[DKR02] Petros Drineas, Iordanis Kerenidis, and Prabhakar Raghavan. Competitive recommen-dation systems. In
Proceedings of the 34th Annual ACM Symposium on Theory ofComputing (STOC) , pages 82–90, 2002.[DLWZ14] Xuefeng Duan, Jiaofen Li, Qingwen Wang, and Xinjun Zhang. Low rank approximationof the symmetric positive semidefinite matrix.
Journal of Computational and AppliedMathematics , 260:236–243, 2014.[DM05] Petros Drineas and Michael W Mahoney. On the Nystr¨om method for approximat-ing a Gram matrix for improved kernel-based learning.
Journal of Machine LearningResearch , 6:2153–2175, 2005.[DMM06] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Subspace sampling andrelative-error matrix approximation: Column-row-based methods. In
European Sym-posium on Algorithms , pages 304–314. Springer, 2006.[DMM08] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR matrixdecompositions.
SIAM Journal on Matrix Analysis and Applications , 30(2), 2008.[DRVW06] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix ap-proximation and projective clustering via volume sampling. In
Proceedings of the 17thAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2006.[DV06] Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrixapproximation. In
Approximation, Randomization, and Combinatorial Optimization:Algorithms and Techniques , pages 292–303. Springer, 2006.[FKV04] Alan M. Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms forfinding low-rank approximations.
Journal of the ACM , 51(6):1025–1041, 2004.[FSS13] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data:Constant-size coresets for k -means, PCA, and projective clustering. In Proceedingsof the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages1434–1453, 2013.[Git11] Alex Gittens. The spectral norm error of the naive Nystr¨om extension. arXiv:1110.5305 ,2011.[GLF +
10] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quan-tum state tomography via compressed sensing.
Physical review letters , 105(15):150401,2010.[GM13] Alex Gittens and Michael W. Mahoney. Revisiting the Nystr¨om method for improvedlarge-scale machine learning. In
Proceedings of the 30th International Conference onMachine Learning (ICML) , 2013. Preliminary version at arXiv:1303.1849.[Gu14] Ming Gu. Subspace iteration randomization and singular value problems. arXiv:1408.2208 , 2014. 27Har14] Moritz Hardt. Understanding alternating minimization for matrix completion. In
Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science(FOCS) , pages 651–660, 2014.[Hof03] Thomas Hofmann. Collaborative filtering via Gaussian probabilistic latent semanticanalysis. In
Proceedings of the 26th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR) , pages 259–266, 2003.[JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completionusing alternating minimization. In
Proceedings of the 45th Annual ACM Symposiumon Theory of Computing (STOC) , pages 665–674, 2013.[Kle99] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.
Journal of theACM , 46(5):604–632, 1999.[KMT09] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for theNystr¨om method. In
Proceedings of the 12th International Conference on ArtificialIntelligence and Statistics (AISTATS) , pages 304–311, 2009.[KSV08] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method forgeneral mixture models.
SIAM Journal on Computing , 38(3):1141–1156, 2008.[LBKW14] Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P. Woodruff.Improved distributed principal component analysis. In
Advances in Neural InformationProcessing Systems 27 (NIPS) , pages 3113–3121, 2014.[LJS16] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP sampling for Nystr¨om withapplication to kernel methods. In
Proceedings of the 33rd International Conference onMachine Learning (ICML) , pages 2061–2070, 2016.[LKL10] Mu Li, James Tin-Yau Kwok, and Baoliang Lu. Making large-scale Nystr¨om approx-imation possible. In
Proceedings of the 27th International Conference on MachineLearning (ICML) , page 631, 2010.[MM13] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings ininput-sparsity time and applications to robust linear regression. In
Proceedings of the45th Annual ACM Symposium on Theory of Computing (STOC) , pages 91–100, 2013.[MM15] Cameron Musco and Christopher Musco. Randomized block Krylov methods forstronger and faster approximate singular value decomposition. In
Advances in Neu-ral Information Processing Systems 28 (NIPS) , pages 1396–1404, 2015.[MM16] Cameron Musco and Christopher Musco. Recursive sampling for the Nystr¨om method. arXiv:1605.07583 , 2016.[NN13] Jelani Nelson and Huy L. Nguyen. OSNAP: Faster numerical linear algebra algorithmsvia sparser subspace embeddings. In
Proceedings of the 54th Annual IEEE Symposiumon Foundations of Computer Science (FOCS) , pages 117–126, 2013.[PRTV00] Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala.Latent semantic indexing: A probabilistic analysis.
Journal of Computer and SystemSciences , 61(2):217–235, 2000. 28Sar06] Tamas Sarlos. Improved approximation algorithms for large matrices via random pro-jections. In
Proceedings of the 47th Annual IEEE Symposium on Foundations of Com-puter Science (FOCS) , pages 143–152, 2006.[SWZ16] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthogonal tensordecomposition. In
Advances in Neural Information Processing Systems 29 (NIPS) ,2016.[SY05] Anthony Man-Cho So and Yinyu Ye. Theory of semidefinite programming for sensornetwork localization. In
Proceedings of the 16th Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA) , pages 405–414, 2005.[Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv:1501.01571 ,2015.[TYUC16] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Randomized single-view algorithms for low-rank matrix approximation. arXiv:1609.00048 , 2016.[WLZ16] Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis columnselection: theories, algorithms, and extensions.
Journal of Machine Learning Research ,17(49):1–49, 2016.[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra.
Foundations andTrends in Theoretical Computer Science , 10(1-2):1–157, 2014.[WZ13] Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and theNystr¨om approximation via adaptive sampling.
Journal of Machine Learning Research ,14(1):2729–2769, 2013.[Yao77] Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of com-plexity. In
Proceedings of the 18th Annual IEEE Symposium on Foundations of Com-puter Science (FOCS) , pages 222–227, 1977.[YH38] Gale Young and Alston S Householder. Discussion of a set of points in terms of theirmutual distances.
Psychometrika , 3(1):19–22, 1938.[YS07] Stephen J Young and Edward R Scheinerman. Random dot product graph models forsocial networks. In
International Workshop on Algorithms and Models for the Web-Graph , pages 138–149. Springer, 2007.[ZTK08] Kai Zhang, Ivor W Tsang, and James T Kwok. Improved Nystr¨om low-rank approx-imation and error analysis. In
Proceedings of the 25th International Conference onMachine Learning (ICML) , pages 1232–1239, 2008.29
Low-Rank Approximation of A via Approximation of A / We first observe that a low-rank approximation for A / does not imply a good low-rank approx-imation for A . Intuitively, if A has a large top singular value, the low-rank approximation for A must capture the corresponding singular direction with significantly more accuracy than a goodlow-rank approximation for A / , in which the singular value is relatively much smaller. Theorem 28.
For any k , ǫ there exists a PSD matrix A and a rank k matrix B such that k A / − B k F ≤ (1 + ǫ ) k A / − A / k k F but for every matrix C in the rowspan of B , k A − C k F ≥ (cid:18) ǫ · ( n − k − λ ( A ) λ k +1 ( A ) (cid:19) k A − A k k F . Notably, if we set B = A / P for some rank k orthogonal projection P , AP can be an arbitrarilybad low-rank approximation of A . Proof.
Let A ∈ R n × n be a diagonal matrix with A i,i = α for i = 1 , ..., k , A k +1 ,k +1 = 0 and allother diagonal entries equal to β , where α > β >
0. Let B be a rank k matrix which has its last n − k rows all zero. For i = 1 , ..., k , let B i,i = A / i,i and B ,k +2 = p ǫ ( n − k − · β . We have: k A / − A / k k F = ( n − k − β and k A / − B k F = (1 + ǫ )( n − k − β . Note that the firstrow b aligns somewhat well with a , but as we will see, not well enough to give a good low-rankapproximation for A itself.Let C be the projection of A onto the rowspan of B , which gives the optimal low-rank approx-imation to A within this span. For i = 2 , ..., k , c i = a i , since A and B match exactly on these rowsup to a scaling. For i > k , c i = ~
0. Finally, c = b k b k · h b , a i = b · (cid:16) α α + ǫ ( n − k − β (cid:17) . Overall: k A − C k F = ( n − k − β + ( A , − C , ) + ( A ,k +2 − C ,k +2 ) ≥ ( n − k − β + p ǫ ( n − k − · βα α + ǫ ( n − k − β ! ≥ ( n − k − β · (1 + ǫ ( n − k − α / β )= (1 + ǫ ( n − k − α /β ) · k A − A k k F . By setting α ≫ β we can make this approximation arbitrarily bad. Note that α /β = λ ( A ) /λ k +1 ( A ).This ratio will be large whenever A is well approximated by a low-rank matrix.Despite the above example, we can show that for a projection P , if A / P is a very near optimallow-rank approximation of A / then A / P A / is a relative error low-rank approximation of A : Theorem 29.
Let P ∈ R n × n be an orthogonal projection matrix such that k A / − A / P k F ≤ (1 + ǫ/ √ n ) k A / − A / k k F . Then: k A − A / P A / k F ≤ (1 + 3 ǫ ) k A − A k k F . Proof.
We can rewrite using the fact that ( I − P ) = ( I − P ) since it is a projection: k A − A / P A / k F = k A / ( I − P ) A / k F = k A / ( I − P ) k . δ i = σ i ( A / ( I − P )) denote the i th singular value of A / ( I − P ). Let λ i be the i th eigenvalueof A . By the assumption that P gives a near optimal low-rank approximation of A / : n − k X i =1 δ i ≤ n X i = k +1 λ i + ǫ/ √ n k A / − A / k k F . Additionally, by Weyl’s monotonicity theorem (see e.g. Theorem 3.2 in [Gu14] and proof of Lemma15 in [MM15]), for all i , δ i ≥ λ / i + k . We thus have: k A − A / P A / k F = n − k X i =1 δ i ≤ n X i = k +2 λ i + (cid:16) λ k +1 + ǫ/ √ n k A / − A / k k F (cid:17) . If λ k +1 ≥ / √ n ·k A / − A / k k F then (cid:16) λ k +1 + ǫ/ √ n k A / − A / k k F (cid:17) ≤ (1+ ǫ ) λ k +1 ≤ (1+3 ǫ ) λ k +1 and hence: k A − A / P A / k F ≤ (1 + 3 ǫ ) n X i = k +1 λ i = (1 + 3 ǫ ) k A − A k k F . Alternatively if λ k +1 ≤ / √ n · k A / − A / k k F then (cid:16) λ k +1 + ǫ/ √ n k A / − A / k k F (cid:17) ≤ (cid:16) (1 + ǫ ) / √ n k A / − A / k k F (cid:17) ≤ (1 + ǫ ) k A − A k k F which also gives the theorem. A.1 PSD Low-Rank Approximation in n . · poly( k/ǫ ) Time
We now combine Theorem 29 with the ridge leverage score sampling algorithm of [MM16] to givea sublinear time algorithm for low-rank approximation of A reading n / · poly( k/ǫ ) entries of thematrix and running in n . · poly( k/ǫ ) time. We note that this approach could also be used withadaptive sampling [DV06] or volume sampling techniques [AGR16], as outlined in the introduction. Theorem 30.
There is an algorithm based off ridge leverage score sampling which, given PSD A ∈ R n × n with high probability outputs M ∈ R n × k with k A − M M T k F ≤ (1 + ǫ ) k A − A k k F . Thealgorithm reads ˜ O ( n / k/ǫ ) entries of A and runs in ˜ O (cid:0) n ( ω +1) / · ( k/ǫ ) ω − (cid:1) time where ω < . is the exponent of matrix multiplication. This follows from Lemma 4, adapted from Theorem 8 of [MM16], which shows that it is possibleto estimate the ridge leverage scores of A / with ˜ O ( nk ) accesses to A and O ( nk ω − ) time. We canuse these scores to sample a set of rows from A / whose span contains a near optimal low-rankapproximation. Specifically we have: Lemma 31 (Theorem 7 of [CMM17]) . For any B ∈ R n × n , for i ∈ { , . . . , n } , let ˜ τ ki ≥ τ ki ( B ) be anoverestimate for the i th rank- k ridge leverage score of B . Let p i = ˜ τ ki P i ˜ τ ki and t = c (cid:16) log k + log(1 /δ ) ǫ (cid:17) P ki ˜ τ ki for ǫ < and some sufficiently large constant c . Construct R by sampling t rows of B , each set torow b i with probability p i . With probability − δ , letting P R be the projection onto the rows of R , k B − ( BP R ) k k F ≤ (1 + ǫ ) k B − B k k F . BP R ) k can be written as a row projection of B – onto the top k singular vectors of BP R . So, if we compute for each i , ˜ τ ki ≥ τ ki ( A / ) using Lemma 4, set ǫ ′ = ǫ/ √ n , and let S bea sampling matrix selecting ˜ O ( P ˜ τ ki /ǫ ′ ) = ˜ O ( k √ n/ǫ ) rows of A / , then by Theorem 29, letting P be the projection onto the rows of SA / , we have k A − ( A / P ) k ( A / P ) Tk k F ≤ (1 + ǫ ) k A − A k k F .Finally, ( A / P ) k ( A / P ) Tk = ( A / P A / ) k = ( AS ( S T AS ) + S T A ) k . We can compute a factor-ization of this matrix by computing ( S T AS ) + / , then computing AS ( S T AS ) + / and taking theSVD of this matrix. Since S has ˜ O ( k √ n/ǫ ) columns, using fast matrix multiplication this requirestime ˜ O ( n · ( k √ n/ǫ ) ω − ) = ˜ O ( n ( ω +1) / · ( k/ǫ ) ω − ) and ˜ O ( n / k/ǫ ) accesses to A (to read the entriesof AS ), giving Theorem 30. B Additional Proofs for Main Algorithm
Lemma 2 (Sum of Ridge Leverage Scores) . For any A ∈ R n × d , P di =1 τ i ( A ) ≤ k .Proof. We rewrite Definition 1 using A ’s singular value decomposition A = U Σ V T . τ i ( A ) = a Ti (cid:18) U Σ U T + k A − A k k F k U U T (cid:19) − a i = a Ti (cid:0) U ¯Σ U T (cid:1) a i , where ¯Σ i,i = σ i ( A )+ k A − Ak k Fk . We then have: n X i =1 τ i ( A ) = tr (cid:0) A T U ¯Σ U T A (cid:1) = tr (cid:0) V Σ ¯ΣΣ V T (cid:1) = tr(Σ ¯Σ)(Σ ¯Σ) i,i = σ i ( A ) σ i ( A )+ k A − Ak k Fk . For i ≤ k we simply upper bound this by 1. So:tr(Σ ¯Σ) = k + n X i = k +1 σ i ( A ) σ i ( A ) + k A − A k k F k ≤ k + k n X i = k +1 σ i ( A ) k A − A k k F = 2 k. We first prove our row sampling PCP result for spectral norm error. We follow with the closelyrelated proof Lemma 7 which gives Frobenius norm error.
Lemma 8 (Spectral Norm Row PCP) . For any PSD A ∈ R n × n , and ǫ < let k ′ = ⌈ ck/ǫ ⌉ and ˜ τ k ′ i ( A / ) ≥ τ k ′ i ( A / ) be an overestimate for the i th rank- k ′ ridge leverage score of A / . Let ˜ ℓ i = 4 ǫ p nk τ k ′ i ( A / ) , p i = ˜ ℓ i P i ˜ ℓ i , and t = c ′ log nǫ · P i ˜ ℓ i . Construct weighted sampling matrices S , S ∈ R n × t , where the j th column is set to √ tp i e i with probability p i .For sufficiently large constants c, c ′ , with high probability, letting ˜ A = S T AS , for any orthogonalprojection P ∈ R t × t : (1 − ǫ ) k AS ( I − P ) k − ǫk k A − A k k F ≤ k ˜ A ( I − P ) k ≤ (1 + ǫ ) k AS ( I − P ) k + ǫk k A − A k k F . We refer to ˜ A as an ( ǫ, k ) -spectral PCP of AS . τ k ′ i ( A / ) is a constant factor approximation to τ k ′ i ( A / ), t = O (cid:16) √ nk log nǫ (cid:17) . Weuse ‘with high probability’ to mean with probability ≥ − /n d for some large constant d . Proof.
For conciseness write C def = AS and write the eigendecomposition A = U Λ U T with λ i = Λ i,i .Applying the triangle inequality we have: k ˜ A ( I − P ) k = k ( I − P ) ˜ A T ˜ A ( I − P ) k = k ( I − P )[ C T C + ( ˜ A T ˜ A − C T C )]( I − P ) k ∈ k C ( I − P ) k ± k ( I − P )( ˜ A T ˜ A − C T C )( I − P ) k Thus to show the Lemma it suffices to show: k ( I − P )( ˜ A T ˜ A − C T C )( I − P ) k ≤ ǫ k C ( I − P ) k + ǫk k A − A k k F . (20)Let m be the largest index with λ m ≥ ǫ k k A − A k k F . Let U H def = U m contain the top m ‘head’eigenvectors of A and let U T contain the remaining ‘tail’ eigenvectors. Let C H = U H U TH C and C T = U T U TT C . C H + C T = C and so by the triangle inequality: k ( I − P )( ˜ A T ˜ A − C T C )( I − P ) k ≤ k ( I − P )( C TH S S T C H − C TH C H )( I − P ) k + k ( I − P )( C TT S S T C T − C TT C T )( I − P ) k +2 k ( I − P ) C TH S S T C T ( I − P ) k . (21)We bound each of the terms in the above sum separately. Specifically we show: • Head Term: k ( I − P )( C TH S S T C H − C TH C H )( I − P ) k ≤ ǫ k C ( I − P ) k • Tail Term: k ( I − P )( C TT S S T C T − C TT C T )( I − P ) k ≤ ǫ k k A − A k k F . • Cross Term: k ( I − P ) C TH S S T C T ( I − P ) k ≤ ǫk k A − A k k F + 4 ǫ k C ( I − P ) k F .Combining these three bounds, after adjusting constant factors on ǫ by making the constants c and c ′ in the rank parameter k ′ and sample size t large enough, gives (20) and thus the lemma.For the remainder of the proof we thus fix c = 1 so k ′ = ⌈ k/ǫ ⌉ . Head Term:
We first show that the ridge scores of A / upper bound the standard leverage scores of U m . Lemma 32.
For any p with λ p ( A ) ≥ k k A − A k k F we have: r nk · τ ki ( A / ) ≥ k ( U p ) i k . where k ( U p ) i k is the i th row norm of U p , whose columns are the top p eigenvectors of A .Proof. If U p contains no eigenvectors, this is true vacuously as all row norms are 0. Otherwise τ ki ( A ) = a Ti (cid:18) A + k A − A k k F k I (cid:19) − a i = e Ti U ˆΛ U T e i j,j = λ j λ j + k A − Ak k Fk . We can then write: τ ki ( A ) = n X j =1 U i,j · ˆΛ j,j ≥ p X j =1 U i,j · ˆΛ j,j (truncate sum) ≥ p X j =1 U i,j · λ j λ j ! (By assumption, λ j ( A ) ≥ λ p ( A ) ≥ k k A − A k k F ) ≥ m X j =1 U i,j = 12 k ( U p ) i k . This gives the lemma combined with the fact that τ ki ( A ) ≤ p nk τ ki ( A / ) by Lemma 5.Applying Lemma 32 with k ′ = ⌈ k/ǫ ⌉ , since λ m ( A ) ≥ k ′ k A − A k ′ k F , we have:˜ ℓ i = r nǫ k · ˜ τ k ′ i ( A / ) ≥ k ( U m ) i k = k ( U H ) i k Further, t = c ′ log nǫ · P i ˜ ℓ i so if we set c ′ large enough, we have by a standard matrix Chernoffbound (see Lemma 33) since U H is an orthogonal span for the columns of C H and hence and itsrow norms are the leverage scores of C H , with high probability:(1 − ǫ ) C TH C H (cid:22) C TH S S T C H ≺ (1 + ǫ ) C TH C H . (22)This in turn gives: k ( I − P )( C TH S S T C H − C TH C H )( I − P ) k ≤ ǫ k ( I − P ) C TH C H ( I − P ) k = ǫ k C H ( I − P ) k ≤ ǫ k C ( I − P ) k . (23) Tail Term:
We can loosely bound via the triangle inequality and the fact that k I − P k ≤ k ( I − P )( C TT S S T C T − C TT C T )( I − P ) k ≤ k S C T k + k C T k . (24)Since k ′ = ⌈ k/ǫ ⌉ :˜ ℓ i = 4 ǫ r nk · ˜ τ k ′ i ( A / ) ≥ x Ti A + ǫ k A / − A / k k F √ nk ! + x i . Thus for S = S or S = S , for sufficiently large c ′ in our sample size t = c ′ log nǫ , by a matrixChernoff bound (Corollary 34), with high probability(1 − ǫ ) A / SS T A / − ǫ k A / − A / k k F √ nk (cid:22) A (cid:22) (1 + ǫ ) A / SS T A / + ǫ k A / − A / k k F √ nk . Which in turn implies σ ( S T A / ( I − U T U TT )) ≤ (1 + ǫ ) σ ( A / ( I − U T U TT )) + ǫ k A / − A / k k F √ nk . k S T C T k = k S T ( I − U T U TT ) AS k ≤ σ ( S T A / ( I − U T U TT )) · σ (( I − U T U TT ) A / S ) ≤ ǫ ) σ ( A / ( I − U T U T ) T ) + 2 ǫ k A / − A / k k F nk ≤ k A ( I − U T U TT ) k + 2 ǫ k A − A k k F k (by ℓ / ℓ bound.) ≤ ǫ k A − A k k F k . ( k A ( I − U T U TT ) k ≤ ǫ k A − A k k F k by definition.)Similarly we have: k C T k = σ (( I − U T U TT ) AS ) ≤ σ ( S T A / ( I − U T U TT )) · σ ( A / ( I − U T U TT )) ≤ (1 + ǫ ) σ ( A / ( I − U T U TT )) + ǫ k A / − A / k k F √ nk · σ ( A / ( I − U T U TT )) ≤ ǫ k A − A k k F k . So overall, plugging back into (24) gives: k ( I − P )( C TT S S T C T − C TT C T )( I − P ) k ≤ ǫ k k A − A k k F . (25) Cross Term:
By submultiplicativity of the spectral norm: k ( I − P ) C TT S S T C H ( I − P ) k ≤ k C TT S k · k S T C H ( I − P ) k . As shown above for our tail bound, k C TT S k ≤ q ǫ k k A − A k k F . Further, as shown in (22), S gives a subspace embedding for C H so we have: k ( I − P ) C TT S S T C H ( I − P ) k ≤ r ǫ k k A − A k k F · (1 + ǫ ) k C H ( I − P ) k ≤ ǫk k A − A k k F + 4 ǫ k C ( I − P ) k . (26)where the last bound follows form the AMGM inequality.Plugging our head (23), tail (25) and cross term (26) bounds in (21), and adjusting constantson ǫ by making c and c ′ sufficiently large gives the lemma. Lemma 7 (Frobenius Norm Row PCP) . For any PSD A ∈ R n × n and ǫ ≤ let k ′ = ⌈ ck/ǫ ⌉ andlet ˜ τ k ′ i ( A / ) ≥ τ k ′ i ( A / ) be an overestimate for the i th rank- k ′ ridge leverage score of A / . Let ˜ ℓ i = q nǫk · ˜ τ k ′ i ( A / ) , p i = ˜ ℓ i P i ˜ ℓ i , and t = c ′ log nǫ P i ˜ ℓ i . Construct weighted sampling matrices S , S ∈ R n × t each whose j th column is set to √ tp i e i with probability p i . or sufficiently large constants c, c ′ , with probability , letting ˜ A = S T AS , for any rank- k orthogonal projection P ∈ R t × t : (1 − ǫ ) k AS ( I − P ) k F ≤ k ˜ A ( I − P ) k F + ∆ ≤ (1 + ǫ ) k AS ( I − P ) k F for some fixed ∆ (independent of P ) with | ∆ | ≤ k A − A k k F . Note that if ˜ τ k ′ i ( A / ) is a constant factor approximation to τ k ′ i ( A / ), t = O (cid:16) √ nk log nǫ . (cid:17) . Proof.
The proof is similar to that of Lemma 8. Denote C def = AS and write the eigendecomposition A = U Λ U T with λ i = Λ i,i . Let m be the largest index with λ m ≥ ǫk k A − A k ′ k F . Let U H def = U m contain the top m ‘head’ eigenvectors of A and let U T contain the remaining ‘tail’ eigenvectors.Let C H = U H U TH C and C T = U T U TT C and note that C H + C T = C .By the Pythagorean theorem we have k C ( I − P ) k F = k C H ( I − P ) k F + k C T ( I − P ) k F . Expandingusing the identity k M k F = tr( M T M ), k ˜ A ( I − P ) k F = k S T C H ( I − P ) k F + k S T C T ( I − P ) k F + 2 tr (cid:0) ( I − P ) C TH S S T C T ( I − P ) (cid:1) (27)We bound each of the terms in the above sum separately. Specifically we show: • Head Term: k S T C H ( I − P ) k F ∈ (1 ± ǫ ) k C H ( I − P ) k F • Tail Term: k S T C T ( I − P ) k F + ∆ ∈ k C T ( I − P ) k F ± ǫ k A − A k k F where | ∆ | ≤ k A − A k k F . • Cross Term: | tr (cid:0) ( I − P ) C TH S S T C T ( I − P ) (cid:1) | ≤ ǫ k C ( I − P ) k F . It is not hard to see that combining these bounds gives the Lemma. k A − A k k F ≤ (1+3 ǫ ) k C ( I − P ) k F since C is an ( ǫ, k )-column PCP of A by Lemmas 3, 5, and the fact that rank- k ′ ridge scores upperbound the rank- k scores (as long as c > k A − A k k F ≤ k A ( I − P ) k F for any rank- kP . So plugging into (27) we have: k ˜ A ( I − P ) k + ∆ ∈ (1 ± ǫ ) k C ( I − P ) k ± (2 + 3 ǫ ) ǫ k C ( I − P ) k F ∈ (1 ± O ( ǫ )) k C ( I − P ) k . This gives the lemma by adjusting constants on ǫ by making c and c ′ large enough. For theremainder of the proof we thus fix c = 1 and so k ′ = ⌈ k/ǫ ⌉ . Head Term:
By Lemma 32 applied to k ′ = ⌈ k/ǫ ⌉ , since λ m ≥ k ′ k A − A k ′ k F :˜ ℓ i ≥ r nǫk · ˜ τ k ′ i ( A / ) ≥ k ( U H ) i k . Further, t = c ′ log nǫ · P i ˜ ℓ i so if we set c ′ large enough, by a standard matrix Chernoff bound (seeLemma 33) since U H is an orthogonal span for the columns of C H and hence and its row normsare the leverage scores of C H , with high probability:(1 − ǫ ) C TH C H (cid:22) C TH S S T C H ≺ (1 + ǫ ) C TH C H . This gives the bound k S T C H ( I − P ) k F ∈ (1 ± ǫ ) k C H ( I − P ) k F . Tail Term:
We want to show: 36 S T C T ( I − P ) k F + ∆ ∈ k C T ( I − P ) k F ± ǫ k A − A k k F (28)where | ∆ | ≤ k A − A k k F .We again split using Pythagorean theorem, k S T C T ( I − P ) k F = k S T C T k F − k S T C T P k F . Weset ∆ = k C T k F − k S T C T k F . We have E k S C T k F = k C T k F ≤ (1 + ǫ ) k A − A k k F , where the lastbound follows since C is an ( ǫ, k )-column PCP for A . Thus by a Markov bound, with probability299 / | ∆ | ≤ (1 + ǫ )300 k A − A k k F ≤ k A − A k k F .Additionally, we have k C T k ≤ ǫk k A − A k k F and k S T C T k ≤ ǫk k A − A k k F with high proba-bility by an identical argument to that used for Lemma 8. This gives: (cid:12)(cid:12) k S T C T P k F − k C T P k F (cid:12)(cid:12) ≤ k ( k S T C T k + k C T k ) ≤ ǫ k A − A k k F which gives the main bound (28) after adjusting constants on ǫ . Cross Term:
We want to show: | tr (cid:0) ( I − P ) C TT S S T C H ( I − P ) (cid:1) | ≤ ǫ k C ( I − P ) k F . (29)We can write: (cid:12)(cid:12) tr (cid:0) ( I − P ) C TT S S T C H ( I − P ) (cid:1)(cid:12)(cid:12) = (cid:12)(cid:12) tr (cid:0) C TT S S T C H ( I − P ) (cid:1)(cid:12)(cid:12) (Cyclic property of trace and ( I − P ) = ( I − P ) )= (cid:12)(cid:12) tr (cid:0) C TT S S T C H ( C T C ) + ( C T C )( I − P ) (cid:1)(cid:12)(cid:12) where in the last step, inserting ( C T C ) + ( C T C ), which is the projection onto the row span of C has no effect as the rows of C H = U H U TH C already lie in this span. h M, N i = tr( M ( C T C ) + N T ) isa semi-inner product since C T C is positive semidefinite, so by Cauchy-Schwarz: (cid:12)(cid:12) tr (cid:0) ( I − P ) C TT S S T C H ( I − P ) (cid:1)(cid:12)(cid:12) ≤ k C TT S S T C H ( C T C ) + / k F · k C ( I − P ) k F . Using the singular value decomposition C = XSY T : (cid:12)(cid:12) tr (cid:0) C TT S S T C H ( I − P ) (cid:1)(cid:12)(cid:12) ≤ k C TT S S T U H U TH X k F · k C ( I − P ) k F ≤ k C TT S S T U H k F · k C ( I − P ) k F (30)As argued, by Lemma 32 we have ˜ ℓ i ≥ k ( U H ) i k , and so by a standard approximate matrixmultiplication result [DKM06], with probability 299 /
300 if we set the constant c ′ in our sample size > c ′′ for some fixed c ′′ : k C TT S S T U H k F ≤ k U H k F k C T k F r t · c ′′ k U H k F P i ˜ ℓ i ·≤ ǫ k C T k F = O ( ǫ k A − A k k F )where the last step follows from the fact that k C T k F = O ( k A − A k k F ) since C is an ( ǫ, k )-columnPCP for A . The final bound then follows from combining with (30) with the fact that k A − A k k F ≤ (1 + ǫ ) k C ( I − P ) k F and adjusting constants on ǫ by increasing the constant c ′ in the sample size t and c in the rank parameter k ′ = ⌈ ck/ǫ ⌉ .The full lemma follows simply from noting that we can union bound over our failure probabilityfor each term so all hold simultaneously with probability 99 / emma 33 (Leverage Score Sampling Matrix Chernoff) . For any A ∈ R n × d for i ∈ { , . . . , d } , let ℓ i ( A ) def = a Ti ( AA T ) + a i be the i th column leverage score of A and let ˜ ℓ i ≥ ℓ i ( A ) be an overestimate forthis score. Let p i = ˜ ℓ i P i ˜ ℓ i and t = c log( d/δ ) ǫ P i ˜ ℓ i for sufficiently large c . Construct C by sampling t columns of A , each set to √ tp i a i with probability p i . With probability − δ : (1 − ǫ ) CC T (cid:22) AA T (cid:22) (1 + ǫ ) CC T . (31) Proof.
Write the singular value decomposition A = U Σ V T . Note that: ℓ i = e Ti V Σ U T (cid:0) U Σ U T (cid:1) + U Σ V T e i = k v i k . Let Y = Σ − U T (cid:0) CC T − AA T (cid:1) U Σ − . We can write: Y = t X j =1 (cid:20) Σ − U T (cid:18) c j c Tj − t AA T (cid:19) U Σ − (cid:21) def = t X j =1 [ X j ] . For each j ∈ , . . . , t , X j is given by: X j = 1 t · Σ − U T (cid:18) p i a i a Ti − AA T (cid:19) U Σ − with probability p i . E Y = 0 since E X j = P di =1 p i h p i a i a Ti − AA T i = 0. Furthermore, CC T = U Σ Y Σ U + AA T .Showing k Y k ≤ ǫ gives − ǫI (cid:22) Y (cid:22) ǫI , which gives the lemma. We prove this bound using amatrix Bernstein inequality from [Tro15]. This inequality requires upper bounds on the spectralnorm of each X j and on variance of Y . We first note that for any i , ℓ i ( A ) a i a Ti (cid:22) AA T . This followsfrom writing any x in the column span of A as ( AA T ) + / y and then noting: x T (cid:0) a i a Ti (cid:1) x = y T ( AA T ) + / a i a Ti ( AA T ) + / y ≤ ℓ i ( A ) k y k since ( AA T ) + / a i a Ti ( AA T ) + / is rank-1 and so has maximum eigenvalue tr (cid:0) ( AA T ) + / a i a Ti ( AA T ) + / (cid:1) = ℓ i ( A ) by the cyclic property of trace. Further x T AA T x = y T ( AA T ) + / AA T ( AA T ) + / y = k y k , giv-ing us the bound. We then have:1 ℓ i ( A ) · Σ − U T a i a Ti U Σ − (cid:22) Σ − U T AA T U Σ − = I. And so tp i Σ − U T a i a Ti U Σ − (cid:22) ǫ c log( d/δ )˜ ℓ i Σ − U T a i a Ti U Σ − (cid:22) ǫ c log( d/δ ) I . Additionally,1 tp i Σ − U T AA T U Σ − (cid:22) ǫ c log( d/δ ) I as long as ˜ ℓ i ≤ ℓ i ( A ) ≤
1. Overall this gives k X j k ≤ ǫ c log( d/δ ) .Next we bound the variance of Y . E ( Y ) = t · E ( X j ) = 1 t X p i · (cid:18) p i Σ − U T a i a Ti U Σ − U T a i a Ti U Σ − − p i Σ − U T a i a Ti U Σ − U T AA T U Σ − + Σ − U T AA T U Σ − U T AA T U Σ − (cid:19) (cid:22) t X " P ˜ ℓ i ˜ ℓ i · ℓ i ( A ) · Σ − U a i a Ti U Σ − − t Σ − U T AA T U Σ − (cid:22) ǫ c log( d/δ ) Σ − U T AA T U Σ − (cid:22) ǫ c log( d/δ ) I.
38y Theorem 7.3.1 of [Tro15], for ǫ < k Y k ≥ ǫ ] ≤ d · e − ǫ / (cid:18) ǫ c log( d/δ ) (1+ ǫ/ (cid:19) . Which gives Pr [ k Y k ≥ ǫ ] ≤ de − c log( d/δ )4 ≤ δ if we choose c large enough.The above Lemma also gives an easy Corollary for the approximation obtained when samplingwith ridge leverage scores: Corollary 34 (Ridge Leverage Scores Sampling Matrix Chernoff) . For any A ∈ R n × d for i ∈{ , . . . , d } , let τ λi ( A ) def = a Ti ( AA T + λI ) + a i be the i th λ -ridge leverage score of A and let ˜ τ λi ≥ τ λi ( A ) be an overestimate for this score. Let p i = ˜ τ λi P i ˜ τ λi and t = c log( d/δ ) ǫ P i ˜ τ λi for sufficiently large c .Construct C by sampling t columns of A , each set to √ tp i a i with probability p i . With probability − δ : (1 − ǫ ) CC T − ǫλI (cid:22) AA T (cid:22) (1 + ǫ ) CC T + ǫλI. (32) Proof.
We can instantiate Lemma 33 with [ A, √ λI ] setting ˜ ℓ i = ˜ τ λi . We simply fix the columns ofthe identity to appear in our sample. This only decreases variance, all calculations go through, andwith probability δ :(1 − ǫ )[ C, √ λI ][ C, √ λI ] T (cid:22) [ A, √ λI ][ A, √ λI ] T (cid:22) (1 + ǫ )[ C, √ λI ][ C, √ λI ] T which gives the desired bound if we subtract λI from all sides. C Outputting a PSD Low-Rank Approximation
In this section, we prove Theorem 12, showing to to efficiently output a low-rank approximation to A under the restriction that the approximation is also PSD. We start with the following lemma, whichshows that as long as we have a good low rank subspace for approximating A in the spectral norm(computable using Theorem 25), we can quickly find a near optimal PSD low-rank approximation: Lemma 35.
Given PSD A ∈ R n × n and orthonormal basis Z ∈ R n × m with k A − AZZ T k ≤ ǫk k A − A k k F , there is an algorithm which accesses O (cid:16) nm log mǫ (cid:17) entries of A , runs in ˜ O (cid:16) nm ω − ǫ ω − (cid:17) time, and with probability / outputs M ∈ R n × k satisfying k A − M M T k F ≤ (1 + ǫ ) k A − A k k F .Proof. By Lemma 10 of [CW17] for any basis Z ∈ R n × m with k A − AZZ T k ≤ ǫk k A − A k k F ,min X : rank ( X )= k,X (cid:23) k A − ZXZ T k F ≤ (1 + O ( ǫ )) k A − A k k F . As in Algorithm 1 we can find a near optimal X by further sampling Z using its leveragescores. If we sample t = O (cid:16) m log mǫ (cid:17) rows of Z by their leverage scores (their norms since Z isorthonormal) to form S ∈ R n × t , by Theorem 39 of [CW13], we will have an affine embedding of Z . Specifically, letting B ∗ = arg min B k A − ZB k F and E ∗ = A − ZB ∗ , for any B we have: k S T A − S ZB k F + (cid:0) k E ∗ k F − k S T E ∗ k F (cid:1) ∈ (cid:2) (1 − ǫ ) k A − ZB k F , (1 + ǫ ) k A − ZB k F (cid:3) . W computed in Step 6 of Algorithm 1 gave a near optimal low-rank approximation to AS .By a Markov bound, since E k S T E ∗ k F = k E ∗ k F , with probability 99 / (cid:12)(cid:12) k E ∗ k F − k S T E ∗ k F (cid:12)(cid:12) ≤ k E ∗ k F = O (1) k A − A k k F . This guarantees that a (1+ ǫ ) approximation to the sketched problemgives a (1 + O ( ǫ )) approximation to the original. That is, for any PSD ˜ X with rank( ˜ X ) = k , and k S T A − S Z ˜ XZ T k F ≤ (1 + ǫ ) min X :rank( X )= k,X (cid:23) k S T A − S ZXZ T k F we have k A − Z ˜ XZ T k F ≤ (1 + O ( ǫ )) k A − A k k F . Following [CW17], we write S T Z in its SVD S T Z = U z Σ z V Tz . Since S ZXZ T falls in the column span of S Z and the row span of Z T , we write δ def = k ( I − U z U Tz ) S T A k F + k U z U Tz S T A ( I − ZZ T ) k F and by Pythagorean theorem have: k S T A − S ZXZ T k F = k U z U Tz S AZZ T − U z Σ z V Tz XZ T k F + δ = k U Tz S T AZ − Σ z V Tz X k F + δ = k Σ z V Tz (cid:0) V z Σ − z U Tz S T AZ − X (cid:1) k F + δ Since S is sampled via Z ’s leverage scores, (1 − ǫ ) I (cid:22) S T Z (cid:22) (1 + ǫ ) I and so: k S T AZ − S ZXZ T k F = (1 ± ǫ ) k V Tz Σ − z U Tz S T AZ − X k F + δ. Finally, following [CW17], letting B = V Tz Σ − z U Tz S T AZ , we have˜ X = arg min X | X (cid:23) , rank( X )= k k B − X k F = (cid:0) B/ B T / (cid:1) k, + where N k, + has all but the top k positive eigenvalues of N set to 0. We output M = Z ˜ X / .Overall, the above algorithm requires accessing O (cid:16) nm log mǫ (cid:17) entries of A (the entries of AS ) andhas runtime ˜ O (cid:16) nm ω − ǫ ω − (cid:17) giving the lemma.We can obtain Z with rank m = Θ( k/ǫ ) by applying Theorem 25 with rank k ′ = Θ( k/ǫ ) anderror parameter ǫ ′ = Θ(1). Combined with Lemma 35 this yields: Theorem 12 (Sublinear Time Low-Rank Approximation – PSD Output) . There is an algorithmthat given any PSD A ∈ R n × n , accesses ˜ O (cid:16) nk ǫ + nkǫ (cid:17) entries of A , runs in ˜ O (cid:16) nk ω ǫ ω + nk ω − ǫ ω − (cid:17) timeand with probability at least / outputs M ∈ R n × k with: k A − M M T k F ≤ (1 + ǫ ) k A − A k k F ..