Computer Science Data Structures and Algorithms

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

De Huang, Jonathan Niles-Weed, Rachel Ward

Abstract

We analyze Oja's algorithm for streaming k-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. d \times d symmetric matrices, we show that Oja's algorithm can obtain an accurate approximation to the subspace of the top k eigenvectors of their expectation using a number of samples that scales polylogarithmically with d. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm's execution.

Full PDF

aa r X i v : . [ c s . D S ] F e b STREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONEUPDATES DE HUANG, JONATHAN NILES-WEED, AND RACHEL WARDAbstract. We analyze Oja’s algorithm for streaming 𝑘 -PCA, and prove that it achieves performancenearly matching that of an optimal oﬄine algorithm. Given access to a sequence of i.i.d. 𝑑 × 𝑑 symmetricmatrices, we show that Oja’s algorithm can obtain an accurate approximation to the subspace of thetop 𝑘 eigenvectors of their expectation using a number of samples that scales polylogarithmically with 𝑑 .Previously, such a result was only known in the case where the updates have rank one.Our analysis is based on recently developed matrix concentration tools, which allow us to prove strongbounds on the tails of the random matrices which arise in the course of the algorithm’s execution. Introduction

Principal component analysis is one of the foundational algorithms of statistics and machine learn-ing. From a practical perspective, perhaps no optimization problem is more widely used in dataanalysis [18]. From a theoretical perspective, it is one of the simplest examples of a non-convexoptimization problem that can nevertheless be solved in polynomial time; as such, it has been animportant proving ground for understanding the fundamental limits of eﬃcient optimization [30].In the basic setting, the practitioner has access to a sequence of independent symmetric randommatrices 𝑨 , 𝑨 , . . . with expectation 𝑴 ∈ R 𝑑 × 𝑑 . The goal is to approximate the leading eigenspaceof 𝑴 or, more generally, to approximate the subspace spanned by its leading 𝑘 eigenvectors. Whileit is natural to attempt to solve this problem by performing an eigen-decomposition of the empiricalaverage ¯ 𝑨 = 𝑇 Í 𝑇𝑖 = 𝑨 𝑖 , the amount of space required by this approach can be prohibitive when 𝑑 islarge. In particular, if the matrices 𝑨 𝑖 are sparse or low-rank, performing incremental updates with thematrices 𝑨 𝑖 may be signiﬁcantly cheaper than storing all the iterates or their average. A tremendousamount of attention has therefore been paid to designing algorithms which can cheaply and provablyestimate the subspace spanned by the top 𝑘 eigenvectors of 𝑴 using limited memory and a single passover the data, a problem known as streaming PCA [17].The simplest and most natural approach to this problem was proposed nearly 40 years ago byOja [25, 26]:(1) Randomly choose an initial guess 𝒁 ∈ R 𝑑 × 𝑘 , and set 𝑸 ← QR [ 𝒁 ] (2) For 𝑡 ≥ , set 𝑸 𝑡 ← QR [( I + 𝜂 𝑡 𝑨 𝑡 ) 𝑸 𝑡 − ] .Here, QR [ 𝑸 𝑡 ] returns an orthogonal R 𝑑 × 𝑘 matrix obtained by performing the Gram–Schmidt processto the columns of 𝑸 𝑡 . It is easy to see [1, Lemma 2.2] that the Gram–Schmidt step commutes with themultiplicative update, so that we can equivalently consider a version of the algorithm which performsa single orthonormalization at the end, and outputs 𝑸 𝑡 = QR [ 𝒁 𝑡 ] , 𝒁 𝑡 = 𝒀 𝑡 . . . 𝒀 𝒁 , Date : 6 February 2021.The authors gratefully acknowledge the funding for this work. DH was in part supported by NSF Grants DMS-1907977,DMS-1912654, and the Choi Family Postdoc Gift Fund. JNW was supported under NSF grant DMS-2015291. JNW and RWwere supported in part by the Institute for Advanced Study, where some of this research was conducted. RW receivedsupport from AFOSR MURI Award N00014-17-S-F006 and NSF grant DMS-1952735 . where 𝒀 𝑖 : = ( I + 𝜂 𝑖 𝑨 𝑖 ) .Oja’s algorithm can be viewed as a noisy version of the classic orthogonal iteration algorithm forcomputing invariant subspaces of a symmetric matrix [12, Section 7.3.2]; alternatively, it corresponds toprojected stochastic gradient descent on the Stiefel manifold of matrices with orthonormal columns [9].Despite its simplicity and practical eﬀectiveness, Oja’s algorithm has proven challenging to analyzebecause of its inherent non-convexity.As a benchmark against which to compare Oja’s algorithm, we may consider the performance of thesimple oﬄine algorithm which computes the leading 𝑘 eigenvectors of ¯ 𝑨 . We write 𝑽 ∈ R 𝑑 × 𝑘 for theorthogonal matrix whose columns are the leading 𝑘 eigenvectors of 𝑴 and ˆ 𝑽 ∈ R 𝑑 × 𝑘 for the matrixcontaining the leading 𝑘 eigenvectors of ¯ 𝑨 , and measure the quality of ˆ 𝑽 by the following standardmeasure of distance between subspaces:dist ( ˆ 𝑽 , 𝑽 ) : = k 𝑽𝑽 ∗ − ˆ 𝑽 ˆ 𝑽 ∗ k If k 𝑨 𝑖 − 𝑴 k ≤ 𝑀 almost surely and the gap between the 𝑘 th and ( 𝑘 + ) th eigenvalues is 𝜌 𝑘 , thenthe Matrix Bernstein inequality [31, Theorem 1.4] combined with Wedin’s Theorem [33] implies thatthere exists a positive constant 𝐶 such thatdist ( ˆ 𝑽 , 𝑽 ) ≤ 𝐶 𝑀𝜌 𝑘 r log ( 𝑑 / 𝛿 ) 𝑇 . (1.1)with probability at least − 𝛿 .The key question is whether Oja’s algorithm is able to achieve similar performance. However, exceptin the special rank-one case where either 𝑘 = or rank ( 𝑨 𝑖 ) = almost surely, no such bound is known.1.1. Our contribution.

We give the ﬁrst results for Oja’s algorithm nearly matching (1.1), for any 𝑘 ≥ and updates of any rank. Our main result (Theorem 2.3) establishes that, after a burn-in period of 𝑇 = ˜ 𝑂 (cid:16) 𝑘𝑀 𝛿 𝜌 𝑘 (cid:17) steps, the output of Oja’s algorithm satisﬁesdist ( 𝑸 𝑇 , 𝑽 ) ≤ 𝐶 ′ 𝑀𝜌 𝑘 s log ( 𝑘𝑀 / 𝛿𝜌 𝑘 ) 𝑇 − 𝑇 with probability at least − 𝛿 for a universal positive constant 𝐶 ′ . Ours is the ﬁrst work to show thatOja’s algorithm can achieve a guarantee similar to (1.1) beyond the rank-one case.The assumption that 𝑘 = or rank ( 𝑨 𝑖 ) = is fundamental to the proof strategies used in prior works.To show that the error decays suﬃciently quickly, prior work focuses on the quantity k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k ,where the columns of 𝑼 are the last 𝑑 − 𝑘 eigenvectors of 𝑴 , which is an upper bound on dist ( 𝑸 𝑡 , 𝑽 ) .(See Lemma 2.6, below.) The key challenge is to control the inverse ( 𝑽 ∗ 𝒁 𝑡 ) − . When 𝑘 = , as in[17], this quantity is a scalar, so it can be pulled out of the norm and bounded separately. This isno longer possible when 𝑘 > , but if rank ( 𝑨 𝑖 ) = , as in [1], then 𝑽 ∗ 𝒁 𝑡 can be written as a rank-one perturbation of 𝑽 ∗ 𝒁 𝑡 − . The Sherman–Morrison formula then implies that 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − can bewritten as 𝑼 ∗ 𝒁 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − plus the sum of explicit, rank-one correction terms. However, if neither 𝑘 = nor rank ( 𝑨 𝑖 ) = , this approach quickly becomes infeasible, since the correction terms nowinvolve a product of rank- 𝑘 matrices whose norm is diﬃcult to bound.A more subtle diﬃculty implicit in prior work is that proofs must be carried out entirely in expected(squared) Frobenius norm. This requirement is necessitated by the fact that the Frobenius norm isHilbertian, so it is possible to employ the crucial Pythagorean identity E k 𝒀 k = k E 𝒀 k + k 𝒀 − E 𝒀 k (1.2) TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 3 for any random matrix 𝒀 . It is this identity that makes it possible to control the evolution of E k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k . However, as our proofs reveal, it is of signiﬁcant utility to be able to recur-sively control the operator norm k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k with high probability instead. Unfortunately, (1.2)is of no help in proving statements of this kind.Our argument handles both challenges and represents a signiﬁcant conceptual simpliﬁcation overearlier proofs. Our crucial insight is that, rather than using the squared Frobenius norm, it is possibleto prove a stronger recursion in a diﬀerent norm, which implies high-probability bounds. Usingtechniques recently developed by [16] to prove concentration inequalities for products of randommatrices, we show that conditioned on k 𝑼 ∗ 𝒁 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − k being well behaved, the probability that k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k deviates signiﬁcantly from its expectation is exponentially small.In other words, good concentration properties for k 𝑼 ∗ 𝒁 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − k imply good concentrationproperties for the next iterate, k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k . These high-probability bounds signiﬁcantly simplifythe calculations, since they allow us to guarantee that the problematic error terms appearing in priorwork are small.If we knew that k 𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − k = 𝑂 ( ) with high probability, then the above induction argumentwould allow us to conclude that k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k = 𝑂 ( ) for all 𝑡 . Unfortunately, this is not the case:if 𝒁 is randomly initialized with i.i.d. Gaussian entries, then typically k 𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − k ≍ √ 𝑑𝑘 . We therefore adopt a two-phase approach: in the ﬁrst, short phase, of length approximately l og 𝑑 , weshow that the operator norm decays from 𝑂 (√ 𝑑𝑘 ) to 𝑂 ( ) , and in the second phase we use the aboverecursive argument to establish that the operator norm decays to zero at a 𝑂 ( /√ 𝑇 ) rate. To simplifythe analysis of the ﬁrst phase, we develop a coupling argument that allows us reduce without loss ofgenerality to the case where the law P 𝑨 of the random matrices 𝑨 , 𝑨 , . . . has ﬁnite support andobtain almost-sure guarantees by a simple union bound. This weak control is enough to guaranteethat k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k decays exponentially fast, so that it is of constant order after approximately l og 𝑑 iterations.1.2. Prior work.

Obtaining non-asymptotic rates of convergence for Oja’s algorithm and its variantshas been an area of active recent interest [28, 29, 27, 21, 20, 2, 4, 13, 17, 23]. Apart from the resultsof [1] and [17], none of these works proves bounds matching (1.1).A breakthrough in the project of obtaining optimal guarantees was due to [28], who gave an analysisof Oja’s algorithm that works when provided with a warm start: he showed that, when 𝑘 = andrank ( 𝑨 𝑖 ) = almost surely, Oja’s algorithm converges in a number of steps logarithmic in 𝑑 if it isinitialized in a neighborhood of the optimum, but his result does not extend to random initializationand it is unclear how to ﬁnd a warm start in practice. This restriction was lifted by [17], who werethe ﬁrst to show a global, eﬃcient guarantee for Oja’s algorithm when 𝑘 = . Subsequently, [1]gave a global, eﬃcient guarantee for Oja’s algorithm in the 𝑘 > case, but under the restriction thatrank ( 𝑨 𝑖 ) = almost surely.The idea of analyzing Oja’s algorithm by developing concentration bounds for products of randommatrices was suggested by [15], who also proved such non-asymptotic concentration bounds in asimpliﬁed setting. Those bounds were later improved by [16] who developed a diﬀerent techniquebased on martingale inequalities for Schatten norms, following a strategy pursued by [19] and [24]for other Banach space norms. The concentration inequalities of [16] are not sharp enough to recoveroptimal rates for Oja’s algorithm on their own; in this work, we use a similar proof techniques toestablish tailor-made concentration results for the Oja setting. HUANG ET AL.

Organization of the remainder of the paper.

In Section 2, we give our main results and anoverview of our techniques. Our main tool is a recursive inequality which proves a concentrationresult for the iterates of Oja’s algorithm, which we state and prove in Section 3.Our analysis of Oja’s algorithm involves two distinct phases, which we analyze separately. Since theargument for the second phase is simpler, we present it ﬁrst in Section 4, and present the slightly morecomplicated argument for the ﬁrst phase in Section 5. We conclude in Section 6 with open questionsand directions for future work. The appendices contain omitted proofs and supplementary results foreach section.1.4.

Notation.

We write 𝜆 ≥ · · · ≥ 𝜆 𝑑 for the eigenvalues of the symmetric matrix 𝑴 , and we write 𝜌 𝑘 : = 𝜆 𝑘 − 𝜆 𝑘 + for the gap between the 𝑘 th and ( 𝑘 + ) th eigenvalue. We write 𝑽 ∈ R 𝑑 × 𝑘 forthe orthogonal matrix whose columns are the 𝑘 leading eigenvectors of 𝑴 , and 𝑼 ∈ R 𝑑 ×( 𝑑 − 𝑘 ) forthe orthogonal matrix whose columns are the remaining eigenvectors. Given an orthogonal matrix 𝑾 ∈ R 𝑑 × 𝑘 , we write [7] dist ( 𝑾 , 𝑽 ) = k 𝑽𝑽 ∗ − 𝑾𝑾 ∗ k = k 𝑼 ∗ 𝑾 k , The symbol k·k denotes the spectral norm (i.e., ℓ operator norm) of a matrix, which is equal to itsmaximum singular value. For 𝑝 ≥ , the symbol k·k 𝑝 denotes the Schatten 𝑝 -norm, which is the ℓ 𝑝 norm of the singular values of its argument. We also deﬁne the 𝐿 𝑝 norm of a random matrix 𝑿 as k 𝑿 k 𝑝,𝑝 : = (cid:0) E k 𝑿 k 𝑝𝑝 (cid:1) / 𝑝 . We employ standard asymptotic notation 𝑎 = 𝑂 ( 𝑏 ) to indicate that 𝑎 ≤ 𝐶𝑏 for a universal positiveconstant 𝐶 , and write 𝑎 = Θ ( 𝑏 ) if 𝑎 = 𝑂 ( 𝑏 ) and 𝑏 = 𝑂 ( 𝑎 ) . The notations ˜ 𝑂 (·) and ˜ Θ (·) suppresspolylogarithmic factors in the problem parameters. When 𝑡 is a positive integer, we write [ 𝑡 ] : = { , . . . , 𝑡 } . 2. T echniques and main results We focus throughout on the following setup:

Assumption 2.1.

The matrices 𝑨 𝑖 are symmetric, independent, identically distributed samples from adistribution P 𝑨 , with expectation 𝑴 .Note that while we require that each 𝑨 𝑖 is symmetric, we do not require that 𝑨 𝑖 (cid:23) .The requirement that 𝑨 𝑖 is symmetric is not as restrictive as it may seem, since we can replace 𝑨 𝑖 by its Hermitian dilation : D ( 𝑨 𝑖 ) : = (cid:18) 𝑨 𝑖 𝑨 ∗ 𝑖 (cid:19) ∈ R 𝑑 × 𝑑 . Estimating the leading eigenvectors of D ( 𝑴 ) is equivalent to estimating the leading singular vectorsof 𝑴 . Our results therefore extend to the non-symmetric streaming SVD problem as well. We referthe reader to [32] for more details about this standard reduction.The second requirement establishes that the random errors are bounded in a suitable norm. Wewrite S 𝑑,𝑘 for the Stiefel manifold of 𝑑 × 𝑘 matrices with orthonormal columns. Assumption 2.2. If 𝑨 ∼ P 𝑨 , then sup 𝑷 ∈ S 𝑑,𝑘 k 𝑷 ∗ ( 𝑨 − 𝑴 ) k ≤ 𝑀 almost surely.Note that for any matrix 𝑿 ∈ ℝ 𝑑 × 𝑑 ,sup 𝑃 ∈ S 𝑑,𝑘 k 𝑷 ∗ 𝑿 k = (cid:16)Õ 𝑘𝑖 = 𝜎 𝑖 ( 𝑿 ) (cid:17) / , ≤ 𝑘 ≤ 𝑑, TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 5 where 𝜎 ( 𝑿 ) ≥ 𝜎 ( 𝑿 ) ≥ · · · ≥ 𝜎 𝑑 ( 𝑿 ) are the singular values of 𝑿 . This norm, sometimes known asthe ( , 𝑘 ) norm [22] or the Ky Fan - 𝑘 norm [8], satisﬁes k 𝑿 k ≤ sup 𝑃 ∈ S 𝑑,𝑘 k 𝑷 ∗ 𝑿 k ≤ √ 𝑘 k 𝑿 k ≤ k 𝑿 k . This choice of norm generalizes the error assumptions in the literature. In the 𝑘 = case, it agreeswith the operator norm, which is the condition used by [17]; and it weakens the requirement of [1]that k 𝑨 𝑖 k ≤ almost surely.The following theorem summarizes our main results for Oja’s algorithm. Theorem 2.3 (Main, informal) . Adopt Assumptions 2.1 and 2.2. Let 𝜆 ≥ . . . 𝜆 𝑑 be the eigenvalues of 𝑴 , and let 𝜌 𝑘 = 𝜆 𝑘 − 𝜆 𝑘 + .For every 𝛿 ∈ ( , ) , deﬁne learning rates 𝑇 = ˜ Θ 𝑘𝑀 𝛿 𝜌 𝑘 ! , 𝛽 = ˜ Θ 𝑀 𝜌 𝑘 ! , 𝜂 𝑡 =  ˜ Θ (cid:16) 𝜌 𝑘 𝑇 (cid:17) , 𝑡 ≤ 𝑇 Θ (cid:16) 𝜌 𝑘 ( 𝛽 + 𝑡 − 𝑇 ) (cid:17) , 𝑡 > 𝑇 . Let 𝑽 ∈ R 𝑑 × 𝑘 be the orthogonal matrix whose columns are the 𝑘 leading eigenvectors of 𝑴 . Then forany 𝑇 > 𝑇 , the output 𝑸 𝑇 of Oja’s algorithm satisﬁes dist ( 𝑸 𝑇 , 𝑽 ) ≤ 𝐶 ′ 𝑀𝜌 𝑘 s log ( 𝑀𝑘 / 𝜌 𝑘 𝛿 ) 𝑇 − 𝑇 with probability at least − 𝛿 , where 𝐶 ′ is a universal positive constant. To prove Theorem 2.3, we adopt a two-phase analysis. Our ﬁrst result shows that after 𝑇 iterations,the output of Oja’s algorithm satisﬁes k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ with high probability. Theorem 2.4 (Phase I, informal) . Adopt the same setting as Theorem 2.3, and let 𝒁 ∈ R 𝑑 × 𝑘 havei.i.d. Gaussian entries. Let 𝑇 = Θ 𝑘𝑀 𝛿 𝜌 𝑘 (cid:0) l og ( 𝑑 𝑀 / 𝛿𝜌 𝑘 ) (cid:1) ! . Then after 𝑇 iterations of Oja’s algorithm with constant step size 𝜂 = Θ (cid:16) l og ( 𝑑 / 𝛿 ) 𝜌 𝑘 𝑇 (cid:17) and initialization 𝒁 ,the output 𝑸 𝑇 satisﬁes k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ with probability at least − 𝛿 . Our analysis of the second phase shows that, if Oja’s algorithm is initialized with any matrixsatisfying k 𝑼 ∗ 𝑸 ( 𝑽 ∗ 𝑸 ) − k ≤ , then the output of Oja’s algorithm decays at the rate 𝑂 ( /√ 𝑇 ) . Theorem 2.5 (Phase II, informal) . Adopt the same setting as Theorem 2.3, and suppose that 𝒁 ∈ R 𝑑 × 𝑘 satisﬁes k 𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − k ≤ . Then after 𝑇 iterations of Oja’s algorithm with step size 𝜂 𝑖 = ( 𝛽 + 𝑖 ) 𝜌 𝑘 with 𝛽 = Θ (cid:16) 𝑀 𝜌 𝑘 l og (cid:16) 𝑀𝑘𝜌 𝑘 𝛿 (cid:17) (cid:17) and initialization 𝑸 , the output 𝑸 𝑇 satisﬁes dist ( 𝑸 𝑇 , 𝑽 ) ≤ s 𝛽 + 𝛽 + 𝑇 with probability at least − 𝛿 . This error guarantee is completely dimension free, and depends only logarithmically on 𝑘 and thefailure probability 𝛿 . HUANG ET AL.

Theorem 2.3 follows directly from Theorems 2.4 and 2.5. Theorem 2.4 guarantees that with proba-bility − 𝛿 , the output of Phase I is a suitable initialization for Phase II, and, conditioned on this goodevent, Theorem 2.5 guarantees that the output of the second phase has error 𝑂 ( p 𝛽 / 𝑇 ) with probability − 𝛿 . By concatenating the analysis of the two phases and using the union bound, we obtain that theresulting two-phase algorithm succeeds with probability at least − 𝛿 , yielding Theorem 2.3.In the remainder of this section, we describe the main technical tools we employ in our argument.2.1. A recursive expression.

To simplify the argument, we recall the following result of [1, Lemma2.2]:

Lemma 2.6.

For all 𝑡 ≥ , dist ( 𝑸 𝑡 , 𝑽 ) = k 𝑼 ∗ 𝑸 𝑡 k ≤ k 𝑼 ∗ 𝑸 𝑡 ( 𝑽 ∗ 𝑸 𝑡 ) − k = k 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − k . We therefore focus on bounding the norm of the matrix 𝑾 𝑡 : = 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − . (2.1)Under the assumption that 𝜂 𝑡 is small, we might expect that we can write 𝑾 𝑡 as a sum of thedominant term 𝑯 𝑡 : = 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − plus lower order terms.To argue that 𝑾 𝑡 is close to 𝑯 𝑡 , we need to argue that the inverse ( 𝑽 ∗ 𝒁 𝑡 ) − does not blow up,which will be the case so long as the ﬂuctuation term 𝜂 𝑡 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − is smaller than the main term 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − . In order to make this requirement precise, we write 𝚫 𝑡 : = 𝜂 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − . (2.2)So long as this matrix has small norm, the inverse term will be well behaved. As we discuss in thefollowing section, we will be able to guarantee that this is the case by conditioning on an appropriategood event.The following lemma shows that, modulo a term involving 𝚫 𝑡 , we can indeed express 𝑾 𝑡 as 𝑯 𝑡 plusa small correction. Lemma 2.7.

Let 𝑾 𝑡 , 𝑯 𝑡 , and 𝚫 𝑡 be deﬁned as in (2.1) – (2.2) . Then we can write 𝑾 𝑡 ( I − 𝚫 𝑡 ) = 𝑯 𝑡 + 𝑱 𝑡, + 𝑱 𝑡, , (2.3) for matrices 𝑱 𝑡, and 𝑱 𝑡, of norm 𝑂 ( 𝜂 𝑡 ) and 𝑂 ( 𝜂 𝑡 ) , respectively. Below, in Propositions A.1 and A.2, we use Lemma 2.7 to develop an explicit recursive bound on thenorm of 𝑾 𝑡 .2.2. Matrix concentration via smoothness.

In order to exploit the expression (2.3), we need concen-tration inequalities that allow us to conclude that 𝑾 𝑡 is near 𝑯 𝑡 with high probability. [16] recentlydeveloped new tools to control the norms of products of independent random matrices, in an attemptto extend the mature toolset for bounding sums of random matrices to the product setting. Their tech-niques are based on a simple but deep property of the Schatten 𝑝 -norms known as uniform smoothness .The most elementary expression of this fact is the following inequality, which is the analogue of (1.2)for the 𝐿 𝑝 norm. Proposition 2.8 ([16, Proposition 4.3]) . Let 𝑿 and 𝒀 be random matrices of the same size, with E [ 𝒀 : 𝑿 ] = . Then for any 𝑝 ≥ , k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≤ k 𝑿 k 𝑝,𝑝 + ( 𝑝 − ) k 𝒀 k 𝑝,𝑝 . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 7 We will employ the following corollary of Proposition 2.8, which extends the inequality to non-centered random matrices.

Proposition 2.9.

Let 𝑿 , 𝒀 , and 𝒁 be random matrices of the same size, with E [ 𝒀 : 𝑿 ] = . Then forany 𝑝 ≥ and 𝜆 > , k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≤ ( + 𝜆 ) ( k 𝑿 k 𝑝,𝑝 + ( 𝑝 − ) k 𝒀 k 𝑝,𝑝 + 𝜆 − k 𝒁 k 𝑝,𝑞 ) The beneﬁt of working in the 𝐿 𝑝 norm is that bounding this norm for 𝑝 large yields good tailbounds on the operator norm, which are not available if the argument is carried out solely in expectedFrobenius norm. We will rely heavily on this fact heavily in our argument.2.3. Conditioning on good events.

Obtaining control on 𝑾 𝑡 via (2.3) requires ensuring that thematrix I − 𝚫 𝑡 is invertible, with inverse of bounded norm. To accomplish this, we deﬁne a sequenceof good events G ⊃ G ⊃ . . . , where each G 𝑖 is measurable with respect to the 𝜎 -algebra F 𝑖 : = 𝜎 ( 𝒁 , 𝒀 , . . . , 𝒀 𝑖 ) . We write 𝑖 for the indicator of the event G 𝑖 , and we will deﬁne G 𝑖 in such a waythat ( I − 𝚫 𝑡 𝑡 − ) is invertible almost surely.During Phase II, the good events are deﬁned by G : = {k 𝑾 k ≤ } G 𝑖 : = {k 𝑾 𝑖 k ≤ 𝛾 } ∩ G 𝑖 − , ∀ 𝑖 ≥ for some 𝛾 ≥ to be speciﬁed. Since Assumption 2.2 implies that k 𝑨 𝑖 − 𝑴 k ≤ 𝑀 almost surely, thisdeﬁnition guarantees that for all 𝑖 ≥ , k 𝑽 ∗ ( 𝑨 𝑖 − 𝑴 ) 𝑼𝑾 𝑖 − 𝑖 − k ≤ 𝑀𝛾 almost surely. (2.4)As we show in Proposition A.1 below, if the step size is suﬃciently small, then (2.4) implies that I − 𝚫 𝑡 is almost surely invertible on G 𝑡 − , which allows us to employ (2.3) to bound the norm of 𝑾 𝑡 𝑡 − .During Phase I, we condition on a slightly more complicated set of events, which we describeexplicitly in Section 5. However, these events are constructed so that (2.4) still holds for all 𝑖 ≥ .Our matrix concentration results described in Section 2.2 allow us to show that, during both PhaseI and Phase II, k 𝑾 𝑡 𝑡 − k is small with high probability, for all 𝑡 ≥ . Using this fact, we show that,conditioned on G 𝑡 − , the probability that G 𝑡 holds is also large. Bounding the failure probability ateach step, we are able to conclude that, conditioned on the initialization event G , the good events G 𝑡 hold for all 𝑡 ≥ with high probability.3. Main recursive bound

In this section, we state our main recursive bound, which we use in both Phase I and Phase II. Aproof appears in Section B.

Theorem 3.1.

Let 𝑡 be a positive integer, and for all 𝑖 ∈ [ 𝑡 ] , let 𝜀 𝑖 = 𝜂 𝑖 𝑀 ( + 𝛾 ) . Let , . . . , 𝑡 be theindicator functions of a sequence of good events satisfying (2.4) for all 𝑖 ∈ [ 𝑡 ] .Assume that for all 𝑖 ∈ [ 𝑡 ] , 𝜀 𝑖 ≤ , 𝜂 𝑖 k 𝑴 k ≤ , e − 𝜂 𝑖 𝜌 𝑘 / ≤ 𝜀 𝑖 𝜀 𝑖 − , (3.1) with the convention that the last requirement is vacuous when 𝑖 = . Then for any 𝑝 ≥ , k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ e − 𝑠 𝑡 𝜌 𝑘 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝜀 𝑡 Õ 𝑡 − 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 , (3.2) where 𝑠 𝑡 = Í 𝑡𝑖 = 𝜂 𝑖 , 𝐶 = , and 𝐶 = . Moreover, if in addition for all 𝑖 ∈ [ 𝑡 ] , 𝑝𝜀 𝑖 ≤ 𝜂 𝑖 𝜌 𝑘 , (3.3) HUANG ET AL. then k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ e − 𝑠 𝑡 𝜌 𝑘 / k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 . Theorem 3.1 shows that, up to small error, k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 decays exponentially fast. We will use thisfact to prove high probability bounds on k 𝑾 𝑡 𝑡 − k , which then imply bounds on k 𝑾 𝑡 k .4. Phase II

In this section, we use Theorem 3.1 to prove a formal version of Theorem 2.5.For this phase, recall that we deﬁne the good events G 𝑖 by G = {k 𝑾 k ≤ } , G 𝑖 = {k 𝑾 𝑖 k ≤ 𝛾 } ∩ G 𝑖 − , ∀ 𝑖 ≥ . (4.1)For Phase II, we set 𝛾 = √ .We ﬁrst show that, with a speciﬁc step-size schedule, we obtain good bounds on the norm of thelast iterate. Proposition 4.1.

Deﬁne the good events as in (4.1) . Set 𝜂 𝑖 = 𝛼 ( 𝛽 + 𝑖 ) 𝜌 𝑘 , for positive quantities 𝛼 and 𝛽 , anddeﬁne the normalized gap ¯ 𝜌 𝑘 = min (cid:26) 𝜌 𝑘 𝑀 , 𝜌 𝑘 k 𝑴 k , (cid:27) . (4.2) If 𝛼 ≥ , 𝛽 ≥ ( + √ ) 𝛼 ¯ 𝜌 𝑘 , (4.3) then for any 𝑡 ≥ , k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ 𝑘 / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑡 (cid:19) 𝛼 + 𝑝𝑘 / 𝑝 · (cid:18) 𝐶 𝛼 ¯ 𝜌 𝑘 (cid:19) · 𝑡 ( 𝛽 + 𝑡 ) , (4.4) where 𝐶 is a numerical constant less than .Proof. Since the good events deﬁned in (4.1) satisfy (2.4), we can apply Theorem 3.1. In the appendix,we show (Lemma C.1) that (4.3) implies that the assumptions in (3.1) hold. Theorem 3.1 then yields k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ e − 𝑠 𝑡 𝜌 𝑘 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝜀 𝑡 Õ 𝑡 − 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 ≤ e − 𝑠 𝑡 𝜌 𝑘 𝑘 / 𝑝 + ( 𝐶 𝛾 + 𝐶 ) 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 , since (4.1) implies k 𝑾 k 𝑝,𝑝 ≤ 𝑘 / 𝑝 and k 𝑾 𝑖 𝑖 k 𝑝,𝑝 ≤ 𝛾 𝑘 / 𝑝 for all 𝑖 ≥ .The deﬁnition of 𝜂 𝑖 implies 𝜌 𝑘 𝑠 𝑡 = 𝛼 Õ 𝑡𝑖 = 𝛽 + 𝑖 ≥ 𝛼 l og (cid:18) 𝛽 + 𝑡𝛽 + (cid:19) . We obtain k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ 𝑘 / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑡 (cid:19) 𝛼 + 𝑝𝑘 / 𝑝 · (cid:18) 𝐶 𝛼 ¯ 𝜌 𝑘 (cid:19) · 𝑡 ( 𝛽 + 𝑡 ) , where 𝐶 = ( 𝐶 𝛾 + 𝐶 ) / 𝐶 𝜀 < , as desired. (cid:3) Finally, we remove the conditioning and prove the full version of Theorem 2.5.

TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 9 Theorem 4.2.

Assume k 𝑾 k ≤ , and adopt the step size 𝜂 𝑖 = 𝛼 ( 𝛽 + 𝑖 ) 𝜌 𝑘 , with 𝛼 ≥ , 𝛽 ≥ (cid:18) 𝐶 𝛼 ¯ 𝜌 𝑘 (cid:19) l og (cid:18) 𝐶 𝛼 ¯ 𝜌 𝑘 · 𝑘 / 𝛿 (cid:19) , where ¯ 𝜌 𝑘 is as in (4.2) and 𝐶 is as in (4.4) . Then k 𝑾 𝑇 k ≤ s 𝛽 + 𝛽 + 𝑇 with probability at least − 𝛿 .Proof. For any 𝑠 ≥ , it holds P {k 𝑾 𝑇 k ≥ 𝑠 } ≤ P {k 𝑾 𝑇 𝑇 k ≥ 𝑠 } + P (cid:8) G 𝐶𝑇 (cid:9) . First, we have P (cid:8) G 𝐶𝑇 (cid:9) ≤ P (cid:8) G 𝐶 (cid:9) + Õ 𝑇𝑗 = P n G 𝐶𝑗 ∩ G 𝑗 − o . Since we have assumed that the initialization satisﬁes k 𝑾 k ≤ , the event G holds with probability , so it suﬃces to bound the second term. By Markov’s inequality, we have P n G 𝐶𝑗 ∩ G 𝑗 − o = P (cid:8) k 𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:9) ≤ inf 𝑝 ≥ 𝛾 − 𝑝 k 𝑾 𝑗 𝑗 − k 𝑝𝑝,𝑝 . For ﬁxed 𝑗 ≥ , we choose 𝑝 = ( 𝛽 + 𝑗 ) · ¯ 𝜌 𝑘 𝐶 𝛼 . It follows from (4.4) that, 𝛾 − 𝑝 k 𝑾 𝑗 𝑗 − k 𝑝𝑝,𝑝 ≤ 𝛾 𝑘 / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑗 (cid:19) 𝛼 + 𝛾 𝑝𝑘 / 𝑝 · 𝐶 𝛼 ¯ 𝜌 𝑘 · 𝑗 ( 𝛽 + 𝑗 ) ! 𝑝 / ≤ 𝑘 (cid:18) + 𝑗𝛽 + 𝑗 (cid:19) 𝑝 / ≤ 𝑘 e − 𝑝 = 𝑘 exp −( 𝛽 + 𝑗 ) · ¯ 𝜌 𝑘 𝐶 𝛼 ! . Therefore, for any 𝑇 ≥ , Õ 𝑇𝑗 = P n G 𝐶𝑗 | G 𝑗 − o ≤ 𝑘 Õ 𝑇𝑗 = exp −( 𝛽 + 𝑗 ) · ¯ 𝜌 𝑘 𝐶 𝛼 ! ≤ 𝑘 𝐶 𝛼 ¯ 𝜌 𝑘 e − 𝛽 · ¯ 𝜌 𝑘𝐶 𝛼 . This quantity is smaller than 𝛿 / if 𝛽 ≥ 𝐶 𝛼 ¯ 𝜌 𝑘 l og (cid:18) 𝐶 𝛼𝑀 ¯ 𝜌 𝑘 · 𝑘 / 𝛿 (cid:19) . It remains to bound P {k 𝑾 𝑇 𝑇 k ≥ 𝑠 } . A simple argument (Lemma C.2) based on (4.4) shows thatthis probability is at least 𝛿 / for 𝑠 = s 𝛽 + 𝛽 + 𝑇 .

The claim follows. (cid:3) Phase I

In this section, we describe the slightly more delicate proof of the formal version of Theorem 2.4.As in Section 4, we will employ Theorem 3.1. However, we will also need to develop an auxiliaryrecurrence to bound the growth of an additional matrix sequence.

Before we analyze Phase I, we ﬁrst show that we can reduce to the case that that P 𝑨 has ﬁnitesupport. We prove the following result in Appendix E. Proposition 5.1.

Fix 𝜌 > . Suppose that there exists a choice of constant step size 𝜂 and 𝑇 ≥ 𝑀𝜌𝛿 l og ( 𝑑 / 𝛿 ) such that for any ﬁnitely-supported distribution with support size at most 𝑇 satisfyingAssumptions 2.1 and 2.2 and with 𝜌 𝑘 ≥ 𝜌 / , we have k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ (5.1) with probability at least − 𝛿 / .Then for this same 𝜂 and 𝑇 it in fact holds that for any distribution satisfying Assumptions 2.1 and 2.2and with 𝜌 𝑘 ≥ 𝜌 , we have k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ with probability at least − 𝛿 . Proposition 5.1 implies that it suﬃces to prove the error guarantee (5.1) in the special case when P 𝑨 has ﬁnite support of cardinality at most 𝑇 .Let us ﬁx a time horizon 𝑇 and assume in what follows that 𝑚 : = | supp ( P 𝑨 ) | ≤ 𝑇 .We begin bydeﬁning the good events for Phase I. We adopt a constant step size 𝜂 , to be speciﬁed. Denote E : = { 𝑀 − ( 𝑨 − 𝑴 ) 𝑼𝑼 ∗ : 𝑨 ∈ supp ( P 𝑨 )} . For 𝑖 ≥ , we will set G 𝑖 = { max 𝑬 ∈ E k 𝑽 ∗ 𝑬𝑼𝑾 𝑖 k ≤ 𝛾 } ∩ G 𝑖 − . Note that this choice satisﬁes (2.4) for all 𝑖 > .To deﬁne the initial good event G , we need to deﬁne a larger set of matrices to condition on. Forall 𝑟, ℓ ≥ , set E 𝑟,ℓ : = { 𝑽 ∗ 𝑭 · · · 𝑭 𝑟 𝑼 : 𝑭 𝑖 ∈ E for at most ℓ distinct indices 𝑖 ∈ [ 𝑟 ] , and 𝑭 𝑖 = ( + 𝜂𝜆 𝑘 + ) − ( I + 𝜂 𝑴 ) 𝑼𝑼 ∗ otherwise } The set E 𝑟,ℓ has cardinality less than ( 𝑟 ( 𝑚 + )) ℓ , and k 𝑬 k ≤ for any 𝑬 ∈ E 𝑟,ℓ , and any 𝑟, ℓ ≥ .We have deﬁned E 𝑟,ℓ so that control over max 𝑬 ∈ E 𝑟 + ,ℓ + k 𝑬𝑾 𝑡 − k gives control over max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 k .Finally, we deﬁne G : = Ù 𝑇 + 𝑟,ℓ = ( max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 k ≤ √ ℓ𝛾 √ ) ∩ {k 𝑾 k ≤ √ 𝑑𝛾 } . (5.2)Since 𝑽 ∗ ( 𝑨 − 𝑴 ) 𝑼 ∈ E , almost surely, this choice satisﬁes (2.4) for 𝑖 = . Our strategy will be similar to the one used in Section 4. However, in order to show that the goodevents G 𝑖 hold with high probability, we will also need a second recurrence that allows us to controlthe norm of matrices of the form 𝑬𝑾 𝑡 , for 𝑬 ∈ E 𝑟,ℓ . The details appear in Section D.6. Conclusion

This work gives the ﬁrst nearly optimal analysis of Oja’s algorithm for streaming PCA beyond therank one case. Our analysis is conceptually simple: we show that the spectral norm of the matrix 𝑾 𝑡 concentrates well around its expectation, once we condition on 𝑾 𝑡 − having the same behavior. Andour concentration results are strong enough that we can pay to union bound over the entire course ofthe algorithm, to show that 𝑾 𝑡 is well behaved for all 𝑡 ≥ .The matrix concentration techniques we have applied here could be useful in analyzing other PCA-like algorithms, or, more generally, other stochastic algorithms for simple non-convex optimization TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 11 problems. An interesting question is whether these techniques can prove gap-free rates for Oja’salgorithm outside the rank-one setting. This would extend the results of [1] to the general case.Finally, we stress that the algorithm we have described here requires a priori knowledge of theproblem parameters (including the gap 𝜌 𝑘 ) to set the step sizes, which is a serious limitation inpractice. Recently, [14] developed a data-driven procedure to adaptively select the optimal step sizes.Obtaining theoretical guarantees for this or similar algorithms is an important open problem. Acknowledgement

We thank Joel Tropp and Amelia Henriksen for valuable discussions which greatly improved thismanuscript.

Appendix A. Additional results for Section 3

The following proposition develops the expansion described in Lemma 2.7 and gives explicit boundson the norms of the error matrices 𝑱 𝑡, and 𝑱 𝑡, .We recall the following deﬁnitions 𝑾 𝑡 = 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − 𝑯 𝑡 = 𝑼 ∗ ( I + 𝜂 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑴 ) 𝒁 𝑡 − ) − 𝚫 𝑡 = 𝜂 𝑡 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − Proposition A.1.

Let 𝑡 ≥ . Assume that 𝜂 𝑡 is small enough that 𝑴 (cid:23) − 𝜂 𝑡 I , and assume that (2.4) holds for 𝑖 = 𝑡 . Let 𝐸 𝑡 = ( 𝑘 / 𝑝 + k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 ) 𝜀 𝑡 = 𝜂 𝑡 𝑀 ( + 𝛾 ) . Then k 𝚫 𝑡 𝑡 − k ≤ 𝜀 𝑡 almost surely, and 𝑾 𝑡 ( I − 𝚫 𝑡 ) = 𝑯 𝑡 + 𝑱 𝑡, + 𝑱 𝑡, for 𝑱 𝑡, and 𝑱 𝑡, satisfying k 𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ 𝐸 𝑡 𝜀 𝑡 k 𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ 𝐸 𝑡 𝜀 𝑡 , and E [ 𝑱 𝑡, : F 𝑡 − ] = .Proof. We employ the notation of the proof of Lemma 2.7. (See Appendix G.) First, we show the boundon 𝚫 𝑡 . Since 𝜂 𝑡 𝑴 (cid:23) − I , we have k 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) − 𝑽 k ≤ . Moreover, since k 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼 𝑊 𝑡 − k ≤ 𝑀𝛾 almost surely, we have that k 𝚫 𝑡 𝑡 − k ≤ k 𝜂 𝑡 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) ( 𝑼𝑼 ∗ + 𝑽𝑽 ∗ ) 𝒁 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − 𝑡 − k≤ 𝜂 𝑡 k 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼𝑼 ∗ 𝒁 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − 𝑡 − k + 𝜂 𝑡 k 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑽 k = 𝜂 𝑡 k 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼𝑾 𝑡 − 𝑡 − k + 𝜂 𝑡 k 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑽 k≤ 𝜂 𝑡 𝑀 ( + 𝛾 ) = : 𝜀 𝑡 . We can bound k b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 by a similar argument. First, note that Assumption 2.2 implies that k 𝑨 𝑡 − 𝑴 k ≤ 𝑀 almost surely. Hence k b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝜂 𝑡 k 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼𝑼 ∗ 𝑍 𝑡 − ( 𝑽 ∗ 𝒁 𝑡 − ) − 𝑡 − k 𝑝,𝑝 + 𝜂 𝑡 k 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑽 𝑡 − k 𝑝,𝑝 = 𝜂 𝑡 k 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼 k k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝜂 𝑡 k 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑽 𝑡 − k 𝑝,𝑝 ≤ ( k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝑘 / 𝑝 ) 𝜂 𝑡 𝑀 ≤ ( k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝑘 / 𝑝 ) 𝜀 𝑡 , Finally, we have k 𝑯 𝑡 𝑡 − k 𝑝,𝑝 ≤ + 𝜂 𝑡 𝜆 𝑘 + + 𝜂 𝑡 𝜆 𝑘 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 ≤ k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 . We now employ Lemma 2.7. The term 𝑱 𝑡, satisﬁes E [ 𝑱 𝑡, 𝑡 − | F 𝑡 − ] = , and we have k 𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ k b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 + k 𝑯 𝑡 𝑡 − k 𝑝,𝑝 k 𝚫 𝑡 𝑡 − k≤ ( k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝑘 / 𝑝 ) 𝜀 𝑡 + k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 𝜀 𝑡 ≤ 𝐸 𝑡 𝜀 𝑡 . Finally, k 𝑱 𝑡, k 𝑝,𝑝 ≤ k b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 k 𝚫 𝑡 𝑡 − k ≤ ( k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝑘 / 𝑝 ) 𝜀 𝑡 ≤ 𝐸 𝑡 𝜀 𝑡 . (cid:3) Combining Proposition A.1 with Proposition 2.9 immediately yields a recursive bound.

Proposition A.2.

Adopt the setting of Proposition A.1. If 𝜀 𝑡 ≤ / , then k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝐾 ,𝑡 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐾 ,𝑡 , (A.1) where 𝐾 ,𝑡 = ( + 𝜀 𝑡 ) ( (cid:18) + 𝜂 𝑡 𝜆 𝑘 + 𝜂 𝑡 𝜆 𝑘 + (cid:19) + 𝑝𝜀 𝑡 ) 𝐾 ,𝑡 = 𝑝𝑘 / 𝑝 𝜀 𝑡 . Proof.

Reusing the notation of Proposition A.1, we have 𝑾 𝑡 𝑡 − ( I − 𝚫 𝑡 ) = 𝑯 𝑡 𝑡 − + 𝑱 𝑡, 𝑡 − + 𝑱 𝑡, 𝑡 − , where E [ 𝑱 𝑡, 𝑡 − : F 𝑡 − ] = . Since 𝑯 𝑡 𝑡 − is F 𝑡 − -measurable, Proposition 2.9 therefore yields forany 𝜆 > k 𝑾 𝑡 𝑡 − ( I − 𝚫 𝑡 ) k 𝑝,𝑝 ≤ ( + 𝜆 ) ( k 𝑯 𝑡 𝑡 − k 𝑝,𝑝 + ( 𝑝 − ) 𝐸 𝑡 𝜀 𝑡 + 𝜆 − 𝐸 𝑡 𝜀 𝑡 ) . Choosing 𝜆 = 𝜀 𝑡 , we obtain k 𝑾 𝑡 𝑡 − ( I − 𝚫 𝑡 ) k 𝑝,𝑝 ≤ ( + 𝜀 𝑡 ) ( k 𝑯 𝑡 𝑡 − k 𝑝,𝑝 + 𝑝𝐸 𝑡 𝜀 𝑡 ) . Finally, under the assumption that k 𝚫 𝑡 𝑡 − k ≤ 𝜀 𝑡 ≤ almost surely, on the event G 𝑡 − the matrix I − 𝚫 𝑡 is invertible and satisﬁes k ( I − 𝚫 𝑡 ) − 𝑡 − k ≤ ( − k 𝚫 𝑡 𝑡 − k ) − ≤ ( − 𝜀 𝑡 ) − Hence k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − ( I − 𝚫 𝑡 ) k 𝑝,𝑝 k ( I − 𝚫 𝑡 ) − 𝑡 − k ≤ + 𝜀 𝑡 ( − 𝜀 𝑡 ) ( k 𝑯 𝑡 𝑡 − k 𝑝,𝑝 + 𝑝𝐸 𝑡 𝜀 𝑡 ) . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 13 Since + 𝜀 𝑡 ( − 𝜀 𝑡 ) ≤ + 𝜀 𝑡 for all 𝜀 𝑡 ≤ and ( + 𝜀 𝑡 ) 𝐸 𝑡 ≤ ( + 𝜀 𝑡 ) ( 𝑘 / 𝑝 + k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 ) and ( + 𝜀 𝑡 ) ≤ for all 𝜀 𝑡 ≤ , this proves the claim. (cid:3) Appendix B. Proof of Theorem 3.1

We will unroll the one-step recurrence of Proposition A.2. We ﬁrst bound 𝐾 ,𝑖 . We have 𝐾 ,𝑖 ≤ (cid:18) + 𝜂 𝑖 𝜆 𝑘 + 𝜂 𝑖 𝜆 𝑘 + (cid:19) + ( + 𝑝 ) 𝜀 𝑖 + 𝑝𝜀 𝑖 ≤ (cid:18) + 𝜂 𝑖 𝜆 𝑘 + 𝜂 𝑖 𝜆 𝑘 + (cid:19) + ( + 𝑝 ) 𝜀 𝑖 , where the second inequality follows from the ﬁrst assumption in (3.1). The second assumption in (3.1)implies that ≤ + 𝜂 𝑖 𝜆 𝑘 ≤ , so (cid:18) + 𝜂 𝑖 𝜆 𝑘 + + 𝜂 𝑖 𝜆 𝑘 (cid:19) = (cid:18) − 𝜂 𝑖 𝜌 𝑘 + 𝜂 𝑖 𝜆 𝑘 (cid:19) ≤ (cid:18) − 𝜂 𝑖 𝜌 𝑘 (cid:19) ≤ e − 𝜂 𝑖 𝜌 𝑘 . Since + 𝑝 ≤ 𝑝 for all 𝑝 ≥ , we obtain 𝐾 ,𝑖 ≤ e − 𝜂 𝑖 𝜌 𝑘 + 𝐶 𝑝𝜀 𝑖 . We now proceed to prove the ﬁrst claim by induction. When 𝑡 = , we use (A.1) to obtain k 𝑾 k 𝑝,𝑝 ≤ k 𝑾 k 𝑝,𝑝 ≤ 𝐾 , k 𝑾 k 𝑝,𝑝 + 𝐾 , ≤ e − 𝜂 𝜌 𝑘 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝜀 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 , which is the desired bound.Proceeding by induction, for 𝑡 > we have k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝐾 ,𝑡 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐾 ,𝑡 ≤ e − 𝜂 𝑡 𝜌 𝑘 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐶 𝑝𝜀 𝑡 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐾 ,𝑡 ≤ e − 𝜂 𝑡 𝜌 𝑘 (cid:16) e − 𝑠 𝑡 − 𝜌 𝑘 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝜀 𝑡 − Õ 𝑡 − 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 − ( 𝑡 − ) (cid:17) + 𝐶 𝑝𝜀 𝑡 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 ≤ e − 𝑠 𝑡 𝜌 𝑘 k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝜀 𝑡 Õ 𝑡 − 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 , where in the ﬁnal inequality we have used that e − 𝜂 𝑡 𝜌 𝑘 𝜀 𝑡 − ≤ 𝜀 𝑡 by the third assumption of (3.1). Thisproves the ﬁrst bound.For the second bound, we proceed in a similar way, but with a sharper bound on 𝐾 ,𝑖 . The secondassumption of (3.1) again implies (cid:18) + 𝜂 𝑖 𝜆 𝑘 + + 𝜂 𝑖 𝜆 𝑘 (cid:19) = (cid:18) − 𝜂 𝑖 𝜌 𝑘 + 𝜂 𝑖 𝜆 𝑘 (cid:19) ≤ − 𝜂 𝑖 𝜌 𝑘 + ( 𝜂 𝑖 𝜌 𝑘 ) ≤ − 𝜂 𝑖 𝜌 𝑘 , and therefore 𝐾 ,𝑖 ≤ ( + 𝜀 𝑖 ) (cid:18) − 𝜂 𝑖 𝜌 𝑘 + 𝑝𝜀 𝑖 (cid:19) ≤ exp (cid:18) − 𝜂 𝑖 𝜌 𝑘 + ( + 𝑝 ) 𝜀 𝑖 (cid:19) ≤ e − 𝜂 𝑖 𝜌 𝑘 / , where the ﬁnal step uses Assumption (3.3) and the fact that + 𝑝 ≤ 𝑝 for all 𝑝 ≥ .When 𝑡 = , we therefore have k 𝑾 k 𝑝,𝑝 ≤ k 𝑾 k 𝑝,𝑝 ≤ 𝐾 , k 𝑾 k 𝑝,𝑝 + 𝐾 , ≤ e − 𝜂 𝜌 𝑘 / k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 , as desired, and for 𝑡 > the induction hypothesis yields k 𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝐾 ,𝑡 k 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝐾 ,𝑡 ≤ e − 𝜂 𝑡 𝜌 𝑘 / (cid:16) e − 𝑠 𝑡 − 𝜌 𝑘 / k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 − ( 𝑡 − ) (cid:17) ≤ e − 𝑠 𝑡 𝜌 𝑘 / k 𝑾 k 𝑝,𝑝 + 𝐶 𝑝𝑘 / 𝑝 𝜀 𝑡 𝑡 , where the ﬁnal inequality again uses the third assumption in (3.1). This proves the second bound. (cid:3) Appendix C. Additional results for Section 4

Lemma C.1.

Under the conditions of Proposition 4.1, the assumptions of (3.1) hold.Proof.

First assumption. We have 𝜀 𝑖 = 𝜂 𝑖 𝑀 ( + 𝛾 ) = ( + √ ) 𝛼𝑀 ( 𝛽 + 𝑖 ) 𝜌 𝑘 ≤ 𝐶 𝜀 𝛼𝛽 ¯ 𝜌 𝑘 , where 𝐶 𝜀 = ( + √ ) . So the ﬁrst assumption is fulﬁlled as long as 𝛽 / 𝛼 ≥ 𝐶 𝜀 / ¯ 𝜌 𝑘 . (C.1a)Second assumption. As above, we have 𝜂 𝑖 k 𝑴 k ≤ 𝛼 k 𝑴 k 𝛽𝜌 𝑘 ≤ 𝛼𝛽 ¯ 𝜌 𝑘 , so the assumption is fulﬁlled if (C.1a) holds.Third assumption. It suﬃces to show that 𝜀 𝑖 − 𝜀 𝑖 ≤ + 𝜂 𝑖 𝜌 𝑘 ∀ 𝑖 ≥ , which is equivalent to 𝛽 + 𝑖 − ≤ 𝛼 / 𝛽 + 𝑖 ∀ 𝑖 ≥ . This holds as long as 𝛼 ≥ . (C.1b)We obtain that all three assumptions hold under (C.1a) and (C.1b), as claimed. (cid:3) Lemma C.2.

In the setting of Theorem D.5, if 𝑠 = q 𝛽 + 𝛽 + 𝑇 , then P {k 𝑾 𝑇 k ≥ 𝑠 } ≤ 𝛿 / . Proof.

We have P {k 𝑾 𝑇 𝑇 k ≥ 𝑠 } ≤ inf 𝑝 ≥ 𝑠 − 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 15 In particular, we choose 𝑠 = e (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 + e 𝐶 𝛼 ¯ 𝜌 𝑘 𝑇 ( 𝛽 + 𝑇 ) l og ( 𝑘 / 𝛿 ) , and 𝑝 = l og ( 𝑘 / 𝛿 ) . It then follows from (4.4) that P {k 𝑾 𝑇 𝑇 k ≥ 𝑠 } ≤ 𝑠 − 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝 ≤ 𝑘 𝑠 (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 + 𝑠 𝑝 𝐶 𝛼 ¯ 𝜌 𝑘 𝑇 ( 𝛽 + 𝑇 ) ! 𝑝 / = 𝑘 e − 𝑝 = 𝛿 / . Combining the above bounds, we obtain that k 𝑾 𝑇 k ≤ 𝑠 ≤ e (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 / + e 𝐶 𝛼𝑀𝜌 𝑘 r l og ( 𝑘 / 𝛿 ) 𝑇 , with probability at least − 𝛿 . Since both terms are smaller than e q 𝛽 + 𝛽 + 𝑇 , the claim follows. (cid:3) Appendix D. Additional results for Section 5

Our main tool will be the following slight variation on Proposition A.1.

Proposition D.1.

Let 𝑡 ≥ . Assume that 𝜂 𝑡 is small enough that 𝑴 (cid:23) − 𝜂 𝑡 I , and assume that (2.4) holds for 𝑖 = 𝑡 . Consider an arbitrary deterministic matrix 𝑬 ∈ E 𝑟,ℓ .Let ¯ 𝐸 𝑡 = + max 𝑬 ′′ ∈ E 𝑟 + ,ℓ + k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 𝜀 = 𝜂𝑀 ( + 𝛾 ) . Then k 𝚫 𝑡 𝑡 − k ≤ 𝜀 almost surely, and 𝑬𝑾 𝑡 ( I − 𝚫 𝑡 ) = 𝑬𝑯 𝑡 + 𝑬𝑱 𝑡, + 𝑬𝑱 𝑡, for 𝑬𝑱 𝑡, and 𝑬𝑱 𝑡, satisfying k 𝑬𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ ¯ 𝐸 𝑡 𝜀 k 𝑬𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ ¯ 𝐸 𝑡 𝜀 , and E [ 𝑬𝑱 𝑡, : F 𝑡 − ] = .Proof. The proof is a slight modiﬁcation on the proof of Proposition A.1. By construction, k 𝑬𝑯 𝑡 𝑡 − k 𝑝,𝑝 ≤ (cid:18) + 𝜂𝜆 𝑘 + 𝜂𝜆 𝑘 + (cid:19) k 𝑬 ′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 , where 𝑬 ′ = + 𝜂𝜆 𝑘 + 𝑬𝑼 ∗ ( I + 𝜂 𝚺 ) 𝑼 ∈ E 𝑟 + ,ℓ ⊆ E 𝑟 + ,ℓ + .Similarly, we have k 𝑬 b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝜂 k 𝑬𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + 𝜂 k 𝑬𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑽 k 𝑝,𝑝 ≤ 𝜂𝑀 ( k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + k 𝑬 k 𝑝,𝑝 )≤ 𝜀 ( k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ) where 𝑬 ′′ = 𝑀 𝑬𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝑼 ∈ E 𝑟 + ,ℓ + , and we have used k 𝑬 k 𝑝 ≤ k 𝑬 k ≤ .We therefore obtain k 𝑬𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ k 𝑬 b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 + k 𝑬𝑯 𝑡 𝑡 − k 𝑝,𝑝 k 𝚫 𝑡 𝑡 − k ≤ (cid:0) k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + k 𝑬 ′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + (cid:1) 𝜀 ≤ ¯ 𝐸 𝑡 𝜀 , and k 𝑬𝑱 𝑡, 𝑡 − k 𝑝,𝑝 ≤ k 𝑬 b 𝚫 𝑡 𝑡 − k 𝑝,𝑝 k 𝚫 𝑡 𝑡 − k ≤ ( k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ) 𝜀 ≤ ¯ 𝐸 𝑡 𝜀 . (cid:3) The following two results are the appropriate analogues of Proposition A.2 and Theorem 3.1.

Proposition D.2.

Adopt the setting of Proposition D.1. If 𝜀 ≤ / , then max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ ¯ 𝐾 max 𝑬 ′ ∈ E 𝑟 + ,ℓ k 𝑬 ′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ¯ 𝐾 max 𝑬 ′′ ∈ E 𝑟 + ,ℓ + k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ¯ 𝐾 , (D.1) where ¯ 𝐾 = ( + 𝜀 ) (cid:18) + 𝜂𝜆 𝑘 + 𝜂𝜆 𝑘 + (cid:19) ¯ 𝐾 = ( + 𝜀 ) 𝑝𝜀 Proof.

As in the proof of Proposition A.2, we have for any 𝑬 ∈ E 𝑟,ℓ , k 𝑬 k 𝑝,𝑝 ≤ ( + 𝜀 ) ( k 𝑬𝑯 𝑡 𝑡 − k 𝑝,𝑝 + 𝑝 ¯ 𝐸 𝑡 𝜀 ) . As in the proof of Proposition D.1, we can write k 𝑬𝑯 𝑡 𝑡 − k 𝑝,𝑝 ≤ (cid:18) + 𝜂𝜆 𝑘 + 𝜂𝜆 𝑘 + (cid:19) k 𝑬 ′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 where 𝑬 ′ = + 𝜂𝜆 𝑘 + 𝑬𝑼 ∗ ( I + 𝜂 𝚺 ) 𝑼 ∈ E 𝑟 + ,ℓ . Since ¯ 𝐸 𝑡 ≤ max 𝑬 ′′ ∈ E 𝑟 + ,ℓ + k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + , taking the maximum over all 𝑬 ∈ E 𝑟,ℓ and 𝑬 ′ ∈ E 𝑟 + ,ℓ yields the claim. (cid:3) Theorem D.3.

Let 𝑡 ≤ 𝑇 be a positive integer, and assume the following requirements hold for some 𝑝 ≥ : 𝜀 ≤ , (D.2a) 𝜂 k 𝑴 k ≤ , (D.2b) 𝑝𝜀 ≤ 𝜂𝜌 𝑘 (D.2c) 𝛾 ≥ . (D.2d) Then for any 𝑟, ℓ ∈ [ 𝑇 − 𝑡 + ] and 𝑝 ≥ , max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ ℓ𝛾 e − 𝑡𝜂𝜌 𝑘 / + 𝐶 𝑝𝛾 𝜀 𝑡 . where 𝐶 = .Proof. First, as in the proof of Theorem 3.1, Assumptions (D.2b) and (D.2c) imply ¯ 𝐾 + ¯ 𝐾 = ( + 𝜀 ) ( (cid:18) + 𝜂𝜆 𝑘 + 𝜂𝜆 𝑘 + (cid:19) + 𝑝𝜀 ) ≤ e − 𝜂𝜌 𝑘 / . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 17 In particular, ¯ 𝐾 + ¯ 𝐾 ≤ . Assumption (D.2a) likewise implies that ¯ 𝐾 ≤ .We now turn to the proof of the main claim, which we prove by induction on 𝑡 . For convenience,we introduce the notation 𝛾 e = 𝛾 /√ . When 𝑡 = and 𝑟, ℓ ≤ 𝑇 , (D.1) impliesmax 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 k 𝑝,𝑝 ≤ max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 k 𝑝,𝑝 ≤ ¯ 𝐾 max 𝑬 ′ ∈ E 𝑟 + ,ℓ k 𝑬 ′ 𝑾 k 𝑝,𝑝 + ¯ 𝐾 max 𝑬 ′′ ∈ E 𝑟 + ,ℓ + k 𝑬 ′′ 𝑾 k 𝑝,𝑝 𝜀 + ¯ 𝐾 ≤ ¯ 𝐾 ℓ𝛾 + ¯ 𝐾 ( ℓ + ) 𝛾 + ¯ 𝐾 ≤ ℓ𝛾 ( ¯ 𝐾 + ¯ 𝐾 ) + ( + 𝛾 ) ¯ 𝐾 ≤ ℓ𝛾 e − 𝜂𝜌 𝑘 / + 𝛾 𝐾 where we have used the deﬁnition of G and where the last step uses (D.2d). Proceeding by induction,we have max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 𝑡 k 𝑝,𝑝 ≤ max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ ¯ 𝐾 max 𝑬 ′ ∈ E 𝑟 + ,ℓ k 𝑬 ′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ¯ 𝐾 max 𝑬 ′′ ∈ E 𝑟 + ,ℓ + k 𝑬 ′′ 𝑾 𝑡 − 𝑡 − k 𝑝,𝑝 + ¯ 𝐾 ≤ ¯ 𝐾 ( ℓ𝛾 e −( 𝑡 − ) 𝜂𝜌 𝑘 / + ( 𝑡 − ) 𝛾 ¯ 𝐾 )+ ¯ 𝐾 ( ( ℓ + ) 𝛾 e −( 𝑡 − ) 𝜂𝜌 𝑘 / + ( 𝑡 − ) 𝛾 ¯ 𝐾 ) + ¯ 𝐾 ≤ ℓ𝛾 ( ¯ 𝐾 + ¯ 𝐾 ) e −( 𝑡 − ) 𝜂𝜌 𝑘 / + ( 𝑡 − ) ( ¯ 𝐾 + ¯ 𝐾 ) 𝛾 ¯ 𝐾 + ( + 𝛾 ) ¯ 𝐾 = ℓ𝛾 e − 𝑡𝜂𝜌 𝑘 / + 𝛾 𝐾 𝑡 , as claimed. (cid:3) Proposition D.4.

Fix 𝑠 ∈ ( , ) , ≤ 𝛾 ≤ 𝐶 𝛾 𝑑𝛿 , and 𝑝 ≥ , where 𝐶 𝛾 = 𝛾 is the constant inLemma H.4. Given 𝜌 > , deﬁne the normalized gap ¯ 𝜌 = min (cid:26) 𝑀𝜌 , k 𝑴 k 𝜌 , (cid:27) , and adopt the step size 𝜂 = 𝐶 𝜂 l og ( e 𝑑 / 𝑠𝛿 ) 𝜌𝑇 . If 𝜌 𝑘 ≥ 𝜌 / and 𝑇 ≥ 𝑝 · 𝐶 𝑇 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑠 ¯ 𝜌 where 𝐶 𝜂 ≥ + og 𝐶 𝛾 , 𝐶 𝑇 ≥ 𝐶 𝜂 , then k 𝑾 𝑇 𝑇 − k 𝑝,𝑝 ≤ 𝑠 (cid:16) + 𝑘 / 𝑝 (cid:17) and max 𝑬 ∈ E , k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝛾 e for all ≤ 𝑡 ≤ 𝑇 . Proof.

We will apply Theorems 3.1 and D.3. First, note that (D.2d) holds by assumption. We now turnto the other conditions.Assumption (D.2a): Since 𝛾 ≥ , we have 𝜀 = 𝜂𝑀 ( + 𝛾 ) ≤ 𝐶 𝜂 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑀𝜌𝑇 . The assumption therefore holds as long as 𝐶 𝑇 ≥ 𝐶 𝜂 . (D.3)Assumption (D.2b): As above, we have 𝜂 k 𝑴 k ≤ 𝐶 𝜂 l og ( e 𝑑 / 𝑠𝛿 ) k 𝑴 k 𝜌𝑇 , and the requirement (D.3) implies that this quantity is also smaller than / .Assumption (D.2c): Since 𝜂𝜌 𝑘 = 𝐶 𝜂 l og ( e 𝑑 / 𝑠𝛾 ) 𝑇 ≥ 𝑇 and > , it suﬃces to prove the strongerclaim 𝑝𝜀 ≤ 𝑠 𝑇 . (D.4)This is satisﬁed so long as 𝑝 · 𝐶 𝜂 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑀 𝜌 𝑇 ≤ 𝑠 𝑇 . which will hold if 𝐶 𝑇 ≥ 𝐶 𝜂 . (D.5)This requirement is stronger than (D.3), so Assumptions (D.2a)–(D.2c) hold under the sole condi-tion (D.5).We now turn to the two claimed bounds. First, we instantiate Theorem 3.1 with the choice 𝜂 𝑖 = 𝜂 for ≤ 𝑖 ≤ 𝑇 . The third assumption of (3.1) is trivially satisﬁed when when 𝜂 𝑖 is constant, sincein that case 𝜀 𝑖 = 𝜀 𝑖 − for all 𝑖 ≥ . The remaining assumptions correspond directly to Assump-tions (D.2a), (D.2b), and (D.2c). The assumptions of Theorem 3.1 are therefore satisﬁed, so we obtain, k 𝑾 𝑇 𝑇 − k 𝑝,𝑝 ≤ e − 𝑇 𝜂𝜌 𝑘 / k 𝑾 k 𝑝,𝑝 + 𝑝𝑘 / 𝑝 𝜀 𝑇 . The deﬁnition of G in (5.2) and the fact that 𝜌 𝑘 ≥ 𝜌 / implies that the ﬁrst term is at most e − 𝑇 𝜂𝜌 𝑘 / 𝑑𝛾 = ( e 𝑑 / 𝑠𝛿 ) − 𝐶 𝜂 / 𝑑𝛾 , and this will be less than 𝑠 if 𝐶 𝜂 ≥ + og ( 𝐶 𝛾 ) . Since (D.4) holds, the second term satisﬁes 𝑝𝑘 / 𝑝 𝜀 𝑇 ≤ 𝑠 𝑘 / 𝑝 < 𝑠 𝑘 / 𝑝 . We obtain k 𝑾 𝑇 𝑇 − k 𝑝,𝑝 ≤ 𝑠 (cid:16) + 𝑘 / 𝑝 (cid:17) , as claimed.For the second claim, we rely on Theorem D.3. Assumptions (D.2a)–(D.2d) having already beenveriﬁed, we obtain for all ≤ 𝑡 ≤ 𝑇 ,max 𝑬 ∈ E , k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝛾 e − 𝑡𝜂𝜌 𝑘 / + 𝑝𝛾 𝜀 𝑡 . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 19 Since 𝜌 𝑘 ≥ , the ﬁrst term is at most 𝛾 , and the second term is also at most 𝛾 by (D.4). We obtainthat max 𝑬 ∈ E , k 𝑬𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝛾 e , as claimed. (cid:3) With Proposition D.4 in hand, we can prove a full version of Theorem 2.4.

Theorem D.5.

Fix a 𝜌 > and assume | supp ( P 𝑨 ) | = 𝑚 . Let ¯ 𝜌 = max (cid:26) 𝜌𝑀 , 𝜌 k 𝑴 k , (cid:27) , and set 𝑠 = / .Adopt the step size 𝜂 = 𝐶 𝜂 l og ( e 𝑑 / 𝛿𝑠 ) 𝜌𝑇 where 𝑇 ≥ 𝐶 𝑇 𝑘 ( log 𝑑 / 𝛿 ¯ 𝜌𝑠 ) 𝑠 𝛿 ¯ 𝜌 . and 𝐶 𝜂 ≥ + og 𝐶 𝛾 , 𝐶 𝑇 ≥ ( 𝐶 𝜂 𝐶 𝛾 ) / . If 𝑚 ≤ 𝑇 and 𝜌 𝑘 ≥ 𝜌 / , then k 𝑾 𝑇 k ≤ / with probability at least − 𝛿 / .Proof. We ﬁrst show that we can assume that l og 𝑇 ≤ og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) . Indeed, if 𝑇 > (cid:16) 𝐶 𝑇 𝑑𝛿 ¯ 𝜌𝑠 (cid:17) , acrude argument similar to the one employed in the analysis of Phase II yields the claim. We give thefull details in Appendix F. In what follows, we therefore assumelog 𝑇 ≤ og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) . (D.6)Set 𝛾 = 𝐶 𝛾 min ( p 𝑘 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) 𝛿 , 𝑑𝛿 ) , where 𝐶 𝛾 is as in Lemma H.4.Recall that our goal is to show k 𝑾 𝑇 k ≤ 𝑠 with probability at least − 𝛿 / . The failure probabilitycan be bounded as P (cid:8) k 𝑾 𝑇 k ≥ 𝑠 (cid:9) ≤ P (cid:8) k 𝑾 𝑇 𝑇 k ≥ 𝑠 (cid:9) + P (cid:8) G 𝐶𝑇 (cid:9) ≤ inf 𝑝 ≥ 𝑠 − 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 + P (cid:8) G 𝐶𝑇 (cid:9) . If we choose 𝑝 = log ( 𝑘 / 𝛿 ) , then since l og ( 𝐶 𝑇 ) ≤ 𝐶 / 𝑇 log ( ) for any value of 𝐶 𝑇 , we have 𝑇 ≥ 𝐶 𝑇 𝑘 ( l og ( 𝑑 / 𝛿 ¯ 𝜌𝑠 )) 𝑠 𝛿 ¯ 𝜌 ≥ l og ( 𝑘 / 𝛿 ) · 𝐶 / 𝑇 𝑘 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) 𝛿 · log ( e 𝑑 / 𝑠𝛿 ) 𝑠 ¯ 𝜌 ≥ 𝑝 𝐶 𝜂 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑠 ¯ 𝜌 , as long as 𝐶 𝑇 ≥ ( 𝐶 𝜂 ( 𝐶 𝛾 ) ) / , which veriﬁes the assumption of Proposition D.4.We obtain k 𝑾 𝑇 𝑇 k 𝑝,𝑝 ≤ 𝑠 ( + 𝑘 / 𝑝 ) ≤ 𝑘 / 𝑝 𝑠 e . We therefore have 𝑠 − 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 ≤ e − l og ( 𝑘 / 𝛿 ) ≤ 𝛿 / . It remains to bound P n G 𝐶𝑇 o . Clearly P (cid:8) G 𝐶𝑇 (cid:9) ≤ P (cid:8) G 𝐶 (cid:9) + Õ 𝑇 𝑗 = P n G 𝐶𝑗 ∩ G 𝑗 − o . Since 𝑚 ≤ 𝑇 and we have assumed l og 𝑇 ≤ og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) , we havelog ( e 𝑚𝑇 / 𝛿 ) ≤ og ( 𝑇 ) + log ( e / 𝛿 ) ≤

20 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) + log ( e / 𝛿 ) ≤

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) , so Lemma H.4 guarantees that G holds with probability at least − 𝛿 / .For the second term, we have P n G 𝐶𝑗 ∩ G 𝑗 − o = P (cid:26) max 𝑬 ∈ E , k 𝑬𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:27) ≤ Õ 𝑬 ∈ E , P (cid:8) k 𝑬𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:9) . Choose 𝑝 =

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) . The same argument as above yields 𝑇 ≥ 𝑝 · 𝐶 / 𝑇 𝑘 log ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) 𝛿 · log ( e 𝑑 / 𝑠𝛿 ) 𝑠 ¯ 𝜌 , and this will be larger than the lower bound required on 𝑇 that was assumed in Proposition D.4 aslong as 𝐶 𝑇 ≥ ( 𝐶 𝜂 ( 𝐶 𝛾 ) ) / Proposition D.4 therefore yields P (cid:8) k 𝑬𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:9) ≤ 𝛾 − 𝑝 k 𝑬𝑾 𝑗 𝑗 − k 𝑝𝑝,𝑝 ≤ e − 𝑝 = e −

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) for all 𝑬 ∈ E , and thus P n G 𝐶𝑗 | G 𝑗 − o ≤ Õ 𝑬 ∈ E , P (cid:8) k 𝑬𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:9) ≤ 𝑚 e −

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) . This yields Õ 𝑇 𝑗 = P n G 𝐶𝑗 | G 𝑗 − o ≤ 𝑚𝑇 e −

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 ) ≤ e −

21 l og ( 𝐶 𝑇 𝑑 / 𝛿 ¯ 𝜌𝑠 )+ og 𝑇 ≤ 𝛿 / , where the last step uses (D.6). Finally, choosing 𝑠 = / , we obtain P (cid:8) k 𝑾 𝑇 k ≥ / (cid:9) ≤ 𝛿 / , as claimed. (cid:3) Appendix E. A reduction to finite support

Let Ω be the space of 𝑑 × 𝑑 symmetric matrices. We argue that it suﬃces to assume that 𝑃 𝐴 hasﬁnite support of cardinality at most 𝑇 in Phase I. We prove this by comparing the product measure 𝑃 ⊗ 𝑇 𝐴 with another distribution 𝑃 𝑚 on Ω ⊗ 𝑇 . We specify this distribution by the following procedure:drawing a 𝑇 -tuple ( 𝐴 , . . . , 𝐴 𝑇 ) from the distribution 𝑃 𝑚 is accomplished by(1) Drawing 𝑚 independent samples ˆ 𝑨 , . . . , ˆ 𝑨 𝑚 from 𝑃 𝐴 . TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 21 (2) Drawing 𝑨 , . . . , 𝑨 𝑇 independently from the discrete distribution 𝑃 ˆ 𝐴 = 𝑚 Õ 𝑚𝑖 = 𝛿 ˆ 𝑨 𝑖 . That is, drawing 𝑨 , . . . , 𝑨 𝑇 independently and uniformly from the set { ˆ 𝑨 𝑖 } 𝑚𝑖 = with replace-ment.We will rely on the fact that the two distributions, 𝑃 ⊗ 𝑇 𝐴 and 𝑃 𝑚 , are close in total variation distancewhen 𝑚 is large. To see this, we ﬁrst recognize that drawing ( 𝐴 , . . . , 𝐴 𝑇 ) from 𝑃 ⊗ 𝑇 𝐴 is equivalent tothe following:(1) Draw 𝑚 independent samples ˆ 𝑨 , . . . , ˆ 𝑨 𝑚 from 𝑃 𝐴 .(2) Draw 𝑨 , . . . , 𝑨 𝑇 sequentially and uniformly from the set { ˆ 𝑨 𝑖 } 𝑚𝑖 = without replacement. De-note by 𝑃 ( 𝑇 ) ˆ 𝐴 the distribution of this sampling.It is a standard result [11] that, given any { ˆ 𝐴 𝑖 } 𝑚𝑖 = , 𝑑 TV (cid:16) 𝑃 ⊗ 𝑇 ˆ 𝐴 , 𝑃 ( 𝑇 ) ˆ 𝐴 (cid:17) ≤ 𝑇 𝑚 . We thus have the following:

Proposition E.1.

For any 𝛿 ∈ ( , ) , it holds that 𝑑 TV (cid:16) 𝑃 𝑚 , 𝑃 ⊗ 𝑇 𝐴 (cid:17) ≤ 𝛿 for all 𝑚 ≥ 𝑇 / 𝛿 .Proof. For any set 𝑆 ⊂ Ω ⊗ 𝑇 , we have (cid:12)(cid:12)(cid:12) 𝑃 𝑚 ( 𝑆 ) − 𝑃 ⊗ 𝑇 𝐴 ( 𝑆 ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E ˆ 𝐴 𝑖 ∼ 𝑃 𝐴 , ≤ 𝑖 ≤ 𝑚 h 𝑃 ⊗ 𝑇 ˆ 𝐴 ( 𝑆 ) − 𝑃 ( 𝑇 ) ˆ 𝐴 ( 𝑆 ) i (cid:12)(cid:12)(cid:12) ≤ E ˆ 𝐴 𝑖 ∼ 𝑃 𝐴 , ≤ 𝑖 ≤ 𝑚 (cid:12)(cid:12)(cid:12) 𝑃 ⊗ 𝑇 ˆ 𝐴 ( 𝑆 ) − 𝑃 ( 𝑇 ) ˆ 𝐴 ( 𝑆 ) (cid:12)(cid:12)(cid:12) ≤ E ˆ 𝐴 𝑖 ∼ 𝑃 𝐴 , ≤ 𝑖 ≤ 𝑚 𝑑 TV (cid:16) 𝑃 ⊗ 𝑇 ˆ 𝐴 , 𝑃 ( 𝑇 ) ˆ 𝐴 (cid:17) ≤ 𝑇 𝑚 ≤ 𝛿. The claim follows from taking the maximum of | 𝑃 𝑚 ( 𝑆 ) − 𝑃 ⊗ 𝑇 𝐴 ( 𝑆 ) | over all subsets of Ω ⊗ 𝑇 . (cid:3) Given any ˆ 𝑨 , . . . , ˆ 𝑨 𝑚 , deﬁne the empirical average ˆ 𝑴 𝑚 : = E 𝐴 ∼ 𝑃 ˆ 𝐴 𝑨 = 𝑚 Õ 𝑚𝑖 = ˆ 𝑨 𝑖 . Denote by ˆ 𝜆 ≥ ˆ 𝜆 ≥ · · · ≥ ˆ 𝜆 𝑑 the eigenvalues of ˆ 𝑴 𝑚 , and write ˆ 𝜌 𝑘 = ˆ 𝜆 𝑘 − ˆ 𝜆 𝑘 + . Let ˆ 𝑉 ∈ ℝ 𝑑 × 𝑘 be theorthogonal matrix whose columns are the leading 𝑘 eigenvectors of ˆ 𝑴 𝑚 , and let ˆ 𝑼 ∈ ℝ 𝑑 ×( 𝑑 − 𝑘 ) be theorthogonal matrix consisting of the remaining eigenvectors. Standard results of matrix concentrationimplies that ˆ 𝑴 𝑚 is close to 𝑴 . In particular, we have the following: Proposition E.2.

Suppose that 𝑚 ≥ 𝑀 𝜌 𝑘 l og ( 𝑑 / 𝛿 ) . Let ˆ 𝑨 , . . . , ˆ 𝑨 𝑚 be drawn independently from 𝑃 𝐴 .Then it holds with probability at least − 𝛿 that k ˆ 𝑴 𝑚 − 𝑴 k ≤ 𝜌 𝑘 / , and, in particular, ˆ 𝜌 𝑘 ≥ 𝜌 𝑘 / and k 𝑼 ∗ ˆ 𝑽 k ≤ / . Proof.

By assumption 2, we have that k ˆ 𝑴 𝑚 − 𝑴 k ≤ 𝑀 almost surely. Then the matrix Bernsteininequality [31, Theorem 1.4] implies that, for any 𝑡 ≥ , P (cid:8) k ˆ 𝑴 𝑚 − 𝑴 k ≥ 𝑡 (cid:9) ≤ 𝑑 exp (cid:18) − 𝑚𝑡 / 𝑀 + 𝑀𝑡 / (cid:19) . Substituting 𝑡 = 𝜌 𝑘 / yields the ﬁrst claim. Using the perturbation theory of eigenvalues of symmetricmatrices, we have ˆ 𝜆 𝑘 ≥ 𝜆 𝑘 − k ˆ 𝑴 𝑚 − 𝑴 k and ˆ 𝜆 𝑘 + ≤ 𝜆 𝑘 + − k ˆ 𝑴 𝑚 − 𝑴 k . Therefore, conditioned on the ﬁrst claim, it holds that ˆ 𝜌 𝑘 ≥ 𝜌 𝑘 − k ˆ 𝑴 𝑚 − 𝑴 k ≥ 𝜌 𝑘 . Furthermore, it follows from Wedin’s inequality [33] that k 𝑼 ∗ ˆ 𝑽 k ≤ k ˆ 𝑴 𝑚 − 𝑴 k ˆ 𝜆 𝑘 − 𝜆 𝑘 + ≤ . (cid:3) Proposition E.3.

Let 𝑼 and 𝑽 be orthogonal matrices such that 𝑼𝑼 ∗ + 𝑽𝑽 ∗ = I , and let ˆ 𝑼 and ˆ 𝑽 be matricesof the same size satisfying the same requirement. Suppose k 𝑼 ∗ ˆ 𝑽 k ≤ / and k ˆ 𝑼 ∗ 𝑺 ( ˆ 𝑽 ∗ 𝑺 ) − k ≤ 𝛾 ≤ .Then k 𝑼 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k ≤ + 𝛾 − 𝛾 . Proof.

A direct calculation yields k 𝑼 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k = k 𝑼 ∗ ( ˆ 𝑼 ˆ 𝑼 ∗ + ˆ 𝑽 ˆ 𝑽 ∗ ) 𝑺 ( 𝑽 ∗ 𝑺 ) − k≤ k ˆ 𝑼 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k + k 𝑼 ∗ ˆ 𝑽 ˆ 𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k≤ k ˆ 𝑼 ∗ 𝑺 ( ˆ 𝑽 ∗ 𝑺 ) − ˆ 𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k + k ˆ 𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k≤ ( 𝛾 + ) k ˆ 𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k . We also have k ˆ 𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k ≤ k ˆ 𝑽 ∗ 𝑼𝑼 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k + k ˆ 𝑽 ∗ 𝑽𝑽 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k ≤ k 𝑼 ∗ 𝑺 ( 𝑽 ∗ 𝑺 ) − k + . Sequencing the two displays above and rearrange the inequality yields the claim. (cid:3)

Now let 𝑇 be given as in Theorem D.5 and choose 𝑚 = 𝑇 / 𝛿 . As long as 𝑇 ≥ 𝑀𝜌 𝑘 𝛿 l og ( 𝑑 / 𝛿 ) , wehave 𝑀 𝜌 𝑘 l og ( 𝑑 / 𝛿 ) ≤ 𝑚 ≤ 𝑇 . It then follows from Proposition E.2 that, when drawing ˆ 𝑨 , . . . , ˆ 𝑨 𝑚 independently from 𝑃 𝐴 , the event G : = { ˆ 𝜌 𝑘 ≥ 𝜌 𝑘 / and k 𝑼 ∗ ˆ 𝑽 k ≤ / } (E.1)happens with probability at least − 𝛿 . Conditioned on G , we consider running 𝑇 steps of Oja’salgorithm, with 𝐴 , . . . , 𝐴 𝑇 drawn i.i.d from 𝑃 ˆ 𝐴 . Note that the discrete distribution 𝑃 ˆ 𝐴 also satisﬁesAssumption 1 and Assumption 2 (with 𝑀 replaced by 𝑀 ). Our main theorem thus guarantees thatwith appropriately chosen step size, the output 𝑄 𝑇 = 𝑄 𝑇 ( 𝐴 , . . . , 𝐴 𝑇 ) of this algorithm after 𝑇 stepssatisﬁes k ˆ 𝑼 ∗ 𝑸 𝑇 ( ˆ 𝑽 ∗ 𝑸 𝑇 ) − k ≤ TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 23 with probability − 𝛿 . Combining (E.1) and Proposition E.3, we obtain that with probability at least ( − 𝛿 ) ≥ − 𝛿 , the output of the algorithm satisﬁes k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ , that is, 𝑃 𝑚 (cid:0) k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ (cid:1) ≥ − 𝛿. Finally, we obtain from Proposition E.1 that 𝑃 𝑚 (cid:0) k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ (cid:1) ≥ 𝑃 ⊗ 𝑇 𝐴 (cid:0) k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ (cid:1) − 𝑑 TV (cid:16) 𝑃 𝑚 , 𝑃 ⊗ 𝑇 𝐴 (cid:17) ≥ − 𝛿. In other words, with the same choice of 𝑇 , the output of 𝑇 steps of Oja’s algorithm with 𝐴 , . . . , 𝐴 𝑇 drawn i.i.d from the original distribution 𝑃 𝐴 satisﬁes k 𝑼 ∗ 𝑸 𝑇 ( 𝑽 ∗ 𝑸 𝑇 ) − k ≤ with probability at least − 𝛿 . Appendix F. Phase I succeeds if 𝑇 is large In this section, we prove Theorem D.5 when 𝑇 > 𝐶 𝑇 𝑑 𝛿 ¯ 𝜌 𝑠 . Note that this value of 𝑇 is far largerthan the optimal choice (which is of order ˜ Θ ( 𝑘 / 𝛿 ¯ 𝜌 𝑠 ) ), which makes the theorem much easier toprove. Indeed, if 𝑇 is this large, we can prove Theorem D.5 directly by using the same conditioningargument as in Phase II. Proposition F.1.

Assume 𝜂 and 𝑇 satisfy the requirements of Theorem D.5, and assume 𝜌 ≥ 𝜌 𝑘 / .If 𝑇 ≥ 𝐶 𝑇 𝑑 𝛿 ¯ 𝜌 𝑠 , then k 𝑾 𝑇 k ≤ 𝑠 with probability at least − 𝛿 / .Proof. Set 𝛾 = 𝐶 𝛾 𝑑𝛿 where 𝐶 𝛾 is deﬁned in Lemma H.4 and deﬁne the good events G : = {k 𝑾 k ≤ 𝛾 /(√ )} G 𝑖 : = {k 𝑾 k ≤ 𝛾 } ∩ G 𝑖 − , ∀ 𝑖 ≥ . In order to apply Theorem 3.1, we verify (3.1)First assumption. We have 𝜀 = 𝜂𝑀 ( + 𝛾 ) ≤ 𝐶 𝜂 l og ( e 𝑑 / 𝛿𝑠 ) 𝑀𝛾𝜌𝑇 , and this quantity is smaller than / so long as 𝐶 𝑇 ≥ 𝐶 𝜂 𝐶 𝛾 . (F.1)Second assumption. We again have 𝜂 k 𝑴 k = 𝐶 𝜂 l og ( e 𝑑 / 𝛿𝑠 ) k 𝑴 k 𝜌𝑇 , and (F.1) guarantees that this quantity is smaller than / as well.Third assumption. Since 𝜀 𝑖 = 𝜀 for all 𝑖 and 𝜂𝜌 ≥ , this requirement trivially holds. Our goal is to bound P (cid:8) k 𝑾 𝑇 k ≥ 𝑠 (cid:9) ≤ P (cid:8) k 𝑾 𝑇 𝑇 k ≥ 𝑠 (cid:9) + P (cid:8) G 𝐶 (cid:9) + Õ 𝑇 𝑗 = P n G 𝐶𝑗 ∩ G 𝑗 − o . Having veriﬁed (3.1), we can employ (3.2), obtaining k 𝑾 𝑇 𝑇 k 𝑝,𝑝 ≤ e − 𝑇 𝜂𝜌 𝑘 𝑘 / 𝑝 𝛾 / + ( 𝐶 𝛾 + 𝐶 ) 𝑝𝑘 / 𝑝 𝜀 𝑇 . For the ﬁrst term, the fact that 𝜌 𝑘 ≥ 𝜌 / implies that e − 𝑇 𝜂𝜌 𝑘 𝛾 = ( 𝛿𝑠 / e 𝑑 ) 𝐶 𝜂 / 𝛾 , and this is smaller than 𝑠 as long as 𝐶 𝜂 ≥ + og ( 𝐶 𝛾 ) . Letting 𝐶 be as in Proposition 4.1 and choosing 𝑝 = l og ( 𝑘 / 𝑑𝛿 ) , we also have 𝑝 ( 𝐶 𝛾 + 𝐶 ) 𝜀 𝑇 ≤ 𝑝 𝐶 𝐶 𝜂 l og ( e 𝑑 / 𝛿𝑠 ) 𝑀 𝛾 𝜌 𝑇 ≤ 𝐶 𝐶 𝜂 𝐶 𝛾 l og ( 𝑑 / 𝛿𝑠 ) 𝐶 𝑇 · 𝛿𝑠𝑑 Since l og ( 𝑑 / 𝛿𝑠 ) ≤ 𝑑𝛿𝑠 for all positive 𝑑 , 𝛿 , and 𝑠 , this quantity will be less than 𝑠 so long as 𝐶 𝑇 ≥ ( 𝐶 𝐶 𝜂 𝐶 𝛾 ) , (F.2)and this requirement subsumes (F.1).We therefore obtain, for 𝑝 = l og ( 𝑘 / 𝛿 ) , P (cid:8) k 𝑾 𝑇 k ≥ 𝑠 (cid:9) ≤ 𝑠 − 𝑝 k 𝑾 𝑇 k 𝑝𝑝,𝑝 ≤ 𝑘 e − 𝑝 ≤ 𝛿 / , In a similar way, (3.2) yields for all 𝑡 ∈ [ 𝑇 ] , 𝛾 − k 𝑾 𝑡 𝑡 − k 𝑝,𝑝 ≤ 𝑘 / 𝑝 + ( 𝐶 𝛾 + 𝐶 ) 𝑝𝑘 / 𝑝 𝜀 𝑇 . If we choose 𝑝 = l og ( 𝑘𝑇 / 𝛿 ) , then we have 𝑝 ( 𝐶 𝛾 + 𝐶 ) 𝜀 𝑇 ≤ 𝑝 𝐶 𝐶 𝜂 l og ( e 𝑑 / 𝛿𝑠 ) 𝑀 𝛾 𝜌 𝑇 ≤ 𝐶 𝐶 𝜂 𝐶 𝛾 l og ( 𝑇 ) 𝐶 𝑇 𝑇 / , and since log ( 𝑇 ) ≤ 𝑇 / for all 𝑇 , we have that this quantity will be at most if 𝐶 𝑇 ≥ ( 𝐶 𝐶 𝜂 𝐶 𝛾 ) / , and this requirement subsumes (F.2), and it holds under the assumptions of Theorem D.5.By Lemma H.4, the event G holds with probability at least − 𝛿 / .Finally, we have for any 𝑗 ∈ [ 𝑇 ] , P n G 𝐶𝑗 ∩ G 𝑗 − o ≤ P (cid:8) k 𝑾 𝑗 𝑗 − k ≥ 𝛾 (cid:9) ≤ inf 𝑝 ≥ 𝛾 − 𝑝 k 𝑾 𝑡 𝑡 − k 𝑝𝑝,𝑝 , and choosing 𝑝 = log ( 𝑘𝑇 / 𝛿 ) we have 𝛾 − 𝑝 k 𝑾 𝑡 𝑡 − k 𝑝𝑝,𝑝 ≤ 𝑘 e − 𝑝 ≤ 𝛿𝑇 , and summing these probabilities for 𝑗 ∈ [ 𝑇 ] , yields that P (cid:8) k 𝑾 𝑇 k ≥ 𝑠 (cid:9) ≤ P (cid:8) k 𝑾 𝑇 𝑇 k ≥ 𝑠 (cid:9) + P (cid:8) G 𝐶 (cid:9) + Õ 𝑇 𝑗 = P n G 𝐶𝑗 ∩ G 𝑗 − o ≤ + + = , as claimed. (cid:3) TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 25 Appendix G. Omitted proofs

G.1.

Proof of Lemma 2.7.

We will show that 𝑾 𝑡 ( I − 𝚫 𝑡 ) = 𝑯 𝑡 + 𝑱 𝑡, + 𝑱 𝑡, , where 𝑯 𝑡 = 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − , 𝑱 𝑡, = b 𝚫 𝑡 − 𝑯 𝑡 𝚫 𝑡 , and 𝑱 𝑡, = − b 𝚫 𝑡 𝚫 𝑡 and where we write b 𝚫 𝑡 = 𝜂 𝑡 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − . By the deﬁnition of 𝒁 𝑡 , we have 𝑾 𝑡 = 𝑼 ∗ 𝒁 𝑡 ( 𝑽 ∗ 𝒁 𝑡 ) − = 𝑼 ∗ 𝒀 𝑡 𝒁 𝑡 − ( 𝑽 ∗ 𝒀 𝑡 𝒁 𝑡 − ) − . We have 𝑽 ∗ 𝒀 𝑡 𝒁 𝑡 − = 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − + 𝜂 𝑡 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − = (cid:0) I + 𝜂 𝑡 𝑽 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − (cid:1) 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − = ( I + 𝚫 𝑡 ) 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − , which implies ( 𝑽 ∗ 𝒀 𝑡 𝒁 𝑡 − ) − ( I − 𝚫 𝑡 ) = ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − ( I + 𝚫 𝑡 ) − ( I + 𝚫 𝑡 ) ( I − 𝚫 𝑡 ) = ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − ( I − 𝚫 𝑡 ) . We also have 𝑼 ∗ 𝒀 𝑡 𝒁 𝑡 − = 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − + 𝜂 𝑡 𝑼 ∗ ( 𝑨 𝑡 − 𝑴 ) 𝒁 𝑡 − = 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − + b 𝚫 𝑡 ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) . Therefore 𝑾 𝑡 ( I − 𝚫 𝑡 ) = 𝑼 ∗ 𝒀 𝑡 𝒁 𝑡 − ( 𝑽 ∗ 𝒀 𝑡 𝒁 𝑡 − ) − = 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − + b 𝚫 𝑡 − 𝑼 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ( 𝑽 ∗ ( I + 𝜂 𝑡 𝑴 ) 𝒁 𝑡 − ) − 𝚫 𝑡 − b 𝚫 𝑡 𝚫 𝑡 . That is 𝑾 𝑡 ( I − b 𝚫 𝑡 ) = 𝑯 𝑡 + 𝑱 𝑡, + 𝑱 𝑡, . Since 𝚫 𝑡 and b 𝚫 𝑡 are both 𝑂 ( 𝜂 𝑡 ) , the claim follows. (cid:3) G.2.

Proof of Proposition 2.9.

By the triangle inequality, we have k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≤ k 𝑿 + 𝒀 k 𝑝,𝑝 + k 𝒁 k 𝑝,𝑝 , which implies k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≤ ( k 𝑿 + 𝒀 k 𝑝,𝑝 + k 𝒁 k 𝑝,𝑝 ) ≤ ( + 𝜆 ) ( k 𝑿 + 𝒀 k 𝑝,𝑝 + 𝜆 − k 𝒁 k 𝑝,𝑝 ) , where in the second step we have applied the elementary inequality ( 𝑎 + 𝑏 ) ≤ ( + 𝜆 ) ( 𝑎 + 𝜆 − 𝑏 ) , valid for all real numbers 𝑎 and 𝑏 and 𝜆 > . Applying Proposition 2.8 to k 𝑿 + 𝒀 k 𝑝,𝑝 then yields theclaim. (cid:3) Appendix H. Additional Lemmas

Lemma H.1.

For any deterministic matrices 𝑨 , 𝑩 and any standard Gaussian matrix 𝒁 of suitable sizes,it holds that P {k 𝑨𝒁𝑩 k ≥ k 𝑨 k k 𝑩 k ( + 𝑡 )} ≤ e − 𝑡 / . Proof.

Let 𝑓 ( 𝑿 ) : = k 𝑨𝑿 𝑩 k , then | 𝑓 ( 𝑿 ) − 𝑓 ( 𝑿 ) | ≤ k 𝑨 k k 𝑩 k · k 𝑿 − 𝑿 k . By Gaussian concentration, we have P { 𝑓 ( 𝒁 ) ≥ E 𝑓 ( 𝒁 ) + k 𝑨 k k 𝑩 k 𝑡 } ≤ e − 𝑡 / . Moreover, we have E 𝑓 ( 𝒁 ) ≤ ( E k 𝑨𝒁𝑩 k ) / = k 𝑨 k k 𝑩 k . It thus follows that P { 𝑓 ( 𝒁 ) ≥ k 𝑨 k k 𝑩 k ( + 𝑡 )} ≤ P { 𝑓 ( 𝒁 ) ≥ E 𝑓 ( 𝒁 ) + k 𝑨 k k 𝑩 k 𝑡 } ≤ e − 𝑡 / , which is the stated result. (cid:3) Lemma H.2 ([6, Theorem II.13]) . Let 𝑸 ∈ ℝ 𝑑 × 𝑘 be a standard Gaussian matrix. Then P n k 𝑸 k ≥ √ 𝑑 + √ 𝑘 + 𝑡 o ≤ · e − 𝑡 / . Lemma H.3 ([1, Lemma i.A.3]) . Let 𝑸 ∈ ℝ 𝑘 × 𝑘 be a standard Gaussian matrix. Then for every 𝛿 ∈ ( , ) , P ( k 𝑸 − k ≥ √ 𝑘𝛿 ) ≤ 𝛿. The next lemma bounds the probability of G from below. Lemma H.4.

Let G be the event deﬁned in (5.2) . There exists a positive constant 𝐶 𝛾 = such thatfor any 𝛿 ∈ ( , ) , if 𝛾 ≥ 𝐶 𝛾 min { p 𝑘 log ( e 𝑚𝑇 / 𝛿 )/ 𝛿, 𝑑 / 𝛿 } , then G holds with probability at least − 𝛿 .Proof. We have 𝑾 = 𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − , where 𝒁 is a matrix with i.i.d. Gaussian entries. Since 𝑼 and 𝑽 have orthonormal columns and are themselves orthogonal, the two matrices 𝑽 ∗ 𝒁 and 𝑼 ∗ 𝒁 areindependent matrices with i.i.d. Gaussian entries. Using Lemma H.1 and conditioning on 𝑽 ∗ 𝒁 , wehave that with probability at least − 𝛿 / ( 𝑇 + ) ,max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − k ≤ k ( 𝑽 ∗ 𝒁 ) − k · p ℓ l og ( e 𝑚𝑇 / 𝛿 ) , (H.1)where we have taken a union bound over the fewer than ( ( 𝑚 + ) ( 𝑇 + )) ℓ elements of E 𝑟,ℓ . Takinga uniform bound again over all 𝑟, ℓ ∈ [ 𝑇 + ] yields that, with probability at least − 𝛿 / , the event(H.1) holds for all 𝑟, ℓ ∈ [ 𝑇 + ] . By Lemma H.3, we also have that that k ( 𝑽 ∗ 𝒁 ) − k ≤ √ 𝑘 / 𝛿 withprobability at least − 𝛿 / . Furthermore, Lemma H.2 implies that k 𝑼 ∗ 𝒁 k ≤ p 𝑑 l og ( / 𝛿 ) withprobability at least − 𝛿 / . Combining these bounds, we obtain that with probability at least − 𝛿 / ,max 𝑬 ∈ E 𝑟,ℓ k 𝑬𝑼 ∗ 𝒁 ( 𝑽 ∗ 𝒁 ) − k ≤ p ℓ l og ( e 𝑚𝑇 / 𝛿 ) , which is less than √ ℓ𝛾 √ as long as 𝐶 𝛾 ≥ , and under this same assumption k 𝑾 k ≤ k 𝑼 ∗ 𝒁 k k ( 𝑽 ∗ 𝒁 ) − k ≤ p 𝑑 l og ( / 𝛿 ) ≤ √ 𝑑𝛾 as well. TREAMING 𝑘 -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 27 So G holds with probability at least − 𝛿 if 𝛾 ≥ 𝐶 𝛾 p 𝑘 l og ( e 𝑚𝑇 / 𝛿 )/ 𝛿 for 𝐶 𝛾 ≥ .On the other hand, We have E k 𝑼 ∗ 𝒁 k ≤ √ 𝑑 , so that k 𝑼 ∗ 𝒁 k ≤ √ 𝑑 / 𝛿 with probability at least − 𝛿 / , and Lemma H.3 implies that k 𝑽 ∗ 𝒁 k ≤ √ 𝑘 / 𝛿 with probability at least − 𝛿 / , so withprobability at least − 𝛿 we have k 𝑾 k ≤ k 𝑼 ∗ 𝒁 k k ( 𝑽 ∗ 𝒁 ) − k ≤ √ 𝑑𝑘 / 𝛿 < 𝑑 / 𝛿 . as claimed. On this event, we also have k 𝑬𝑾 k ≤ k 𝑾 k ≤ 𝑑 / 𝛿 . Therefore, if 𝛾 ≥ √ 𝑑 / 𝛿 ,then G holds.So G holds with probability at least − 𝛿 if 𝛾 ≥ 𝐶 𝛾 𝑑 / 𝛿 for 𝐶 𝛾 ≥ √ . Therefore, taking 𝐶 𝛾 = satisﬁes both requirements and proves the claim. (cid:3) References [1] Z. Allen-Zhu and Y. Li. First eﬃcient convergence for streaming 𝑘 -PCA: a global, gap-free, and near-optimal rate. In , pages 487–492. IEEE Computer Soc.,Los Alamitos, CA, 2017.[2] M. Balcan, S. S. Du, Y. Wang, and A. W. Yu. An improved gap-dependency analysis of the noisy power method. InFeldman et al. [10], pages 284–309.[3] M. Balcan and K. Q. Weinberger, editors. Proceedings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of

JMLR Workshop and Conference Proceedings . JMLR.org,2016.[4] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. In Burges et al. [5], pages3174–3182.[5] C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors.

Advances in Neural Information ProcessingSystems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting heldDecember 5-8, 2013, Lake Tahoe, Nevada, United States , 2013.[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In

Handbook of thegeometry of Banach spaces, Vol. I , pages 317–366. North-Holland, Amsterdam, 2001.[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III.

SIAM J. Numer. Anal. , 7:1–46, 1970.[8] X. V. Doan and S. Vavasis. Finding the largest low-rank clusters with Ky Fan - 𝑘 -norm and ℓ -norm. SIAM J. Optim. ,26(1):274–312, 2016.[9] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.

SIAM J. MatrixAnal. Appl. , 20(2):303–353, 1999.[10] V. Feldman, A. Rakhlin, and O. Shamir, editors.

Proceedings of the 29th Conference on Learning Theory, COLT 2016, NewYork, USA, June 23-26, 2016 , volume 49 of

JMLR Workshop and Conference Proceedings . JMLR.org, 2016.[11] D. Freedman. A remark on the diﬀerence between sampling with and without replacement.

J. Amer. Statist. Assoc. ,72(359):681, 1977.[12] G. H. Golub and C. F. Van Loan.

Matrix computations . Johns Hopkins Studies in the Mathematical Sciences. JohnsHopkins University Press, Baltimore, MD, third edition, 1996.[13] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages2861–2869, 2014.[14] A. Henriksen and R. Ward. Adaoja: Adaptive learning rates for streaming pca. 05 2019, 1905.12115.[15] A. Henriksen and R. Ward. Concentration inequalities for random matrix products.

Linear Algebra Appl. , 594:81–94,2020.[16] D. Huang, J. Niles-Weed, J. A. Tropp, and R. Ward. Matrix concentration for products. 03 2020, 2003.05437.[17] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: matching matrix bernstein and near-optimalﬁnite sample guarantees for oja’s algorithm. In Feldman et al. [10], pages 1147–1164.[18] I. T. Jolliﬀe.

Principal component analysis . Springer Series in Statistics. Springer-Verlag, New York, second edition, 2002.[19] A. Juditsky and A. S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. 09 2008,0809.0813.[20] C. Li, H. Lin, and C. Lu. Rivalry of two families of algorithms for memory-restricted streaming PCA. In A. Gretton andC. C. Robert, editors,

Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS , volume 51 of

JMLR Workshop and Conference Proceedings , pages 473–481. JMLR.org,2016.[21] C. J. Li, M. Wang, H. Liu, and T. Zhang. Near-optimal stochastic approximation for online principal componentestimation.

Math. Program. , 167(1, Ser. B):75–97, 2018.[22] C.-K. Li and N.-K. Tsing. Some isometries of rectangular complex matrices.

Linear and Multilinear Algebra , 23(1):47–53,1988.[23] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Burges et al. [5], pages 2886–2894.[24] A. Naor. On the banach-space-valued azuma inequality and small-set isoperimetry of alon–roichman graphs.

Combina-torics, Probability and Computing , 21(4):623–634, 2012.[25] E. Oja. A simpliﬁed neuron model as a principal component analyzer.

J. Math. Biol. , 15(3):267–273, 1982.[26] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of arandom matrix.

Journal of Mathematical Analysis and Applications , 106(1):69–84, 1985.[27] C. D. Sa, C. Ré, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex matrixproblems. In F. R. Bach and D. M. Blei, editors,

Proceedings of the 32nd International Conference on Machine Learning,ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of

JMLR Workshop and Conference Proceedings , pages 2332–2341.JMLR.org, 2015.[28] O. Shamir. Convergence of stochastic gradient descent for PCA. In Balcan and Weinberger [3], pages 257–265.[29] O. Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In Balcan and Weinberger[3], pages 248–256.[30] M. Simchowitz, A. El Alaoui, and B. Recht. Tight query complexity lower bounds for PCA via ﬁnite sample deformedWigner law. In

STOC’18—Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages1249–1259. ACM, New York, 2018.[31] J. A. Tropp. User-friendly tail bounds for sums of random matrices.

Found. Comput. Math. , 12(4):389–434, 2012.[32] J. A. Tropp. An introduction to matrix concentration inequalities.

Foundations and Trends in Machine Learning , 8(1-2):1–230, 2015.[33] P.-A. Wedin. Perturbation bounds in connection with singular value decomposition.