Researchain Logo Researchain
  • Decentralized Journals

    A

    Archives
  • Avatar
    Welcome to Researchain!
    Feedback Center
Decentralized Journals
A
Archives Updated
Archive Your Research
Computer Science Data Structures and Algorithms

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

De Huang,  Jonathan Niles-Weed,  Rachel Ward

Abstract
We analyze Oja's algorithm for streaming k-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. d \times d symmetric matrices, we show that Oja's algorithm can obtain an accurate approximation to the subspace of the top k eigenvectors of their expectation using a number of samples that scales polylogarithmically with d. Previously, such a result was only known in the case where the updates have rank one. Our analysis is based on recently developed matrix concentration tools, which allow us to prove strong bounds on the tails of the random matrices which arise in the course of the algorithm's execution.
Full PDF

aa r X i v : . [ c s . D S ] F e b STREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONEUPDATES DE HUANG, JONATHAN NILES-WEED, AND RACHEL WARDAbstract. We analyze Oja’s algorithm for streaming π‘˜ -PCA, and prove that it achieves performancenearly matching that of an optimal offline algorithm. Given access to a sequence of i.i.d. 𝑑 Γ— 𝑑 symmetricmatrices, we show that Oja’s algorithm can obtain an accurate approximation to the subspace of thetop π‘˜ eigenvectors of their expectation using a number of samples that scales polylogarithmically with 𝑑 .Previously, such a result was only known in the case where the updates have rank one.Our analysis is based on recently developed matrix concentration tools, which allow us to prove strongbounds on the tails of the random matrices which arise in the course of the algorithm’s execution. Introduction

Principal component analysis is one of the foundational algorithms of statistics and machine learn-ing. From a practical perspective, perhaps no optimization problem is more widely used in dataanalysis [18]. From a theoretical perspective, it is one of the simplest examples of a non-convexoptimization problem that can nevertheless be solved in polynomial time; as such, it has been animportant proving ground for understanding the fundamental limits of efficient optimization [30].In the basic setting, the practitioner has access to a sequence of independent symmetric randommatrices 𝑨 , 𝑨 , . . . with expectation 𝑴 ∈ R 𝑑 Γ— 𝑑 . The goal is to approximate the leading eigenspaceof 𝑴 or, more generally, to approximate the subspace spanned by its leading π‘˜ eigenvectors. Whileit is natural to attempt to solve this problem by performing an eigen-decomposition of the empiricalaverage Β― 𝑨 = 𝑇 Í 𝑇𝑖 = 𝑨 𝑖 , the amount of space required by this approach can be prohibitive when 𝑑 islarge. In particular, if the matrices 𝑨 𝑖 are sparse or low-rank, performing incremental updates with thematrices 𝑨 𝑖 may be significantly cheaper than storing all the iterates or their average. A tremendousamount of attention has therefore been paid to designing algorithms which can cheaply and provablyestimate the subspace spanned by the top π‘˜ eigenvectors of 𝑴 using limited memory and a single passover the data, a problem known as streaming PCA [17].The simplest and most natural approach to this problem was proposed nearly 40 years ago byOja [25, 26]:(1) Randomly choose an initial guess 𝒁 ∈ R 𝑑 Γ— π‘˜ , and set 𝑸 ← QR [ 𝒁 ] (2) For 𝑑 β‰₯ , set 𝑸 𝑑 ← QR [( I + πœ‚ 𝑑 𝑨 𝑑 ) 𝑸 𝑑 βˆ’ ] .Here, QR [ 𝑸 𝑑 ] returns an orthogonal R 𝑑 Γ— π‘˜ matrix obtained by performing the Gram–Schmidt processto the columns of 𝑸 𝑑 . It is easy to see [1, Lemma 2.2] that the Gram–Schmidt step commutes with themultiplicative update, so that we can equivalently consider a version of the algorithm which performsa single orthonormalization at the end, and outputs 𝑸 𝑑 = QR [ 𝒁 𝑑 ] , 𝒁 𝑑 = 𝒀 𝑑 . . . 𝒀 𝒁 , Date : 6 February 2021.The authors gratefully acknowledge the funding for this work. DH was in part supported by NSF Grants DMS-1907977,DMS-1912654, and the Choi Family Postdoc Gift Fund. JNW was supported under NSF grant DMS-2015291. JNW and RWwere supported in part by the Institute for Advanced Study, where some of this research was conducted. RW receivedsupport from AFOSR MURI Award N00014-17-S-F006 and NSF grant DMS-1952735 . where 𝒀 𝑖 : = ( I + πœ‚ 𝑖 𝑨 𝑖 ) .Oja’s algorithm can be viewed as a noisy version of the classic orthogonal iteration algorithm forcomputing invariant subspaces of a symmetric matrix [12, Section 7.3.2]; alternatively, it corresponds toprojected stochastic gradient descent on the Stiefel manifold of matrices with orthonormal columns [9].Despite its simplicity and practical effectiveness, Oja’s algorithm has proven challenging to analyzebecause of its inherent non-convexity.As a benchmark against which to compare Oja’s algorithm, we may consider the performance of thesimple offline algorithm which computes the leading π‘˜ eigenvectors of Β― 𝑨 . We write 𝑽 ∈ R 𝑑 Γ— π‘˜ for theorthogonal matrix whose columns are the leading π‘˜ eigenvectors of 𝑴 and Λ† 𝑽 ∈ R 𝑑 Γ— π‘˜ for the matrixcontaining the leading π‘˜ eigenvectors of Β― 𝑨 , and measure the quality of Λ† 𝑽 by the following standardmeasure of distance between subspaces:dist ( Λ† 𝑽 , 𝑽 ) : = k 𝑽𝑽 βˆ— βˆ’ Λ† 𝑽 Λ† 𝑽 βˆ— k If k 𝑨 𝑖 βˆ’ 𝑴 k ≀ 𝑀 almost surely and the gap between the π‘˜ th and ( π‘˜ + ) th eigenvalues is 𝜌 π‘˜ , thenthe Matrix Bernstein inequality [31, Theorem 1.4] combined with Wedin’s Theorem [33] implies thatthere exists a positive constant 𝐢 such thatdist ( Λ† 𝑽 , 𝑽 ) ≀ 𝐢 π‘€πœŒ π‘˜ r log ( 𝑑 / 𝛿 ) 𝑇 . (1.1)with probability at least βˆ’ 𝛿 .The key question is whether Oja’s algorithm is able to achieve similar performance. However, exceptin the special rank-one case where either π‘˜ = or rank ( 𝑨 𝑖 ) = almost surely, no such bound is known.1.1. Our contribution.

We give the first results for Oja’s algorithm nearly matching (1.1), for any π‘˜ β‰₯ and updates of any rank. Our main result (Theorem 2.3) establishes that, after a burn-in period of 𝑇 = ˜ 𝑂 (cid:16) π‘˜π‘€ 𝛿 𝜌 π‘˜ (cid:17) steps, the output of Oja’s algorithm satisfiesdist ( 𝑸 𝑇 , 𝑽 ) ≀ 𝐢 β€² π‘€πœŒ π‘˜ s log ( π‘˜π‘€ / π›ΏπœŒ π‘˜ ) 𝑇 βˆ’ 𝑇 with probability at least βˆ’ 𝛿 for a universal positive constant 𝐢 β€² . Ours is the first work to show thatOja’s algorithm can achieve a guarantee similar to (1.1) beyond the rank-one case.The assumption that π‘˜ = or rank ( 𝑨 𝑖 ) = is fundamental to the proof strategies used in prior works.To show that the error decays sufficiently quickly, prior work focuses on the quantity k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k ,where the columns of 𝑼 are the last 𝑑 βˆ’ π‘˜ eigenvectors of 𝑴 , which is an upper bound on dist ( 𝑸 𝑑 , 𝑽 ) .(See Lemma 2.6, below.) The key challenge is to control the inverse ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ . When π‘˜ = , as in[17], this quantity is a scalar, so it can be pulled out of the norm and bounded separately. This isno longer possible when π‘˜ > , but if rank ( 𝑨 𝑖 ) = , as in [1], then 𝑽 βˆ— 𝒁 𝑑 can be written as a rank-one perturbation of 𝑽 βˆ— 𝒁 𝑑 βˆ’ . The Sherman–Morrison formula then implies that 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ can bewritten as 𝑼 βˆ— 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ plus the sum of explicit, rank-one correction terms. However, if neither π‘˜ = nor rank ( 𝑨 𝑖 ) = , this approach quickly becomes infeasible, since the correction terms nowinvolve a product of rank- π‘˜ matrices whose norm is difficult to bound.A more subtle difficulty implicit in prior work is that proofs must be carried out entirely in expected(squared) Frobenius norm. This requirement is necessitated by the fact that the Frobenius norm isHilbertian, so it is possible to employ the crucial Pythagorean identity E k 𝒀 k = k E 𝒀 k + k 𝒀 βˆ’ E 𝒀 k (1.2) TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 3 for any random matrix 𝒀 . It is this identity that makes it possible to control the evolution of E k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k . However, as our proofs reveal, it is of significant utility to be able to recur-sively control the operator norm k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k with high probability instead. Unfortunately, (1.2)is of no help in proving statements of this kind.Our argument handles both challenges and represents a significant conceptual simplification overearlier proofs. Our crucial insight is that, rather than using the squared Frobenius norm, it is possibleto prove a stronger recursion in a different norm, which implies high-probability bounds. Usingtechniques recently developed by [16] to prove concentration inequalities for products of randommatrices, we show that conditioned on k 𝑼 βˆ— 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ k being well behaved, the probability that k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k deviates significantly from its expectation is exponentially small.In other words, good concentration properties for k 𝑼 βˆ— 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ k imply good concentrationproperties for the next iterate, k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k . These high-probability bounds significantly simplifythe calculations, since they allow us to guarantee that the problematic error terms appearing in priorwork are small.If we knew that k 𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ k = 𝑂 ( ) with high probability, then the above induction argumentwould allow us to conclude that k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k = 𝑂 ( ) for all 𝑑 . Unfortunately, this is not the case:if 𝒁 is randomly initialized with i.i.d. Gaussian entries, then typically k 𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≍ √ π‘‘π‘˜ . We therefore adopt a two-phase approach: in the first, short phase, of length approximately l og 𝑑 , weshow that the operator norm decays from 𝑂 (√ π‘‘π‘˜ ) to 𝑂 ( ) , and in the second phase we use the aboverecursive argument to establish that the operator norm decays to zero at a 𝑂 ( /√ 𝑇 ) rate. To simplifythe analysis of the first phase, we develop a coupling argument that allows us reduce without loss ofgenerality to the case where the law P 𝑨 of the random matrices 𝑨 , 𝑨 , . . . has finite support andobtain almost-sure guarantees by a simple union bound. This weak control is enough to guaranteethat k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k decays exponentially fast, so that it is of constant order after approximately l og 𝑑 iterations.1.2. Prior work.

Obtaining non-asymptotic rates of convergence for Oja’s algorithm and its variantshas been an area of active recent interest [28, 29, 27, 21, 20, 2, 4, 13, 17, 23]. Apart from the resultsof [1] and [17], none of these works proves bounds matching (1.1).A breakthrough in the project of obtaining optimal guarantees was due to [28], who gave an analysisof Oja’s algorithm that works when provided with a warm start: he showed that, when π‘˜ = andrank ( 𝑨 𝑖 ) = almost surely, Oja’s algorithm converges in a number of steps logarithmic in 𝑑 if it isinitialized in a neighborhood of the optimum, but his result does not extend to random initializationand it is unclear how to find a warm start in practice. This restriction was lifted by [17], who werethe first to show a global, efficient guarantee for Oja’s algorithm when π‘˜ = . Subsequently, [1]gave a global, efficient guarantee for Oja’s algorithm in the π‘˜ > case, but under the restriction thatrank ( 𝑨 𝑖 ) = almost surely.The idea of analyzing Oja’s algorithm by developing concentration bounds for products of randommatrices was suggested by [15], who also proved such non-asymptotic concentration bounds in asimplified setting. Those bounds were later improved by [16] who developed a different techniquebased on martingale inequalities for Schatten norms, following a strategy pursued by [19] and [24]for other Banach space norms. The concentration inequalities of [16] are not sharp enough to recoveroptimal rates for Oja’s algorithm on their own; in this work, we use a similar proof techniques toestablish tailor-made concentration results for the Oja setting. HUANG ET AL.

Organization of the remainder of the paper.

In Section 2, we give our main results and anoverview of our techniques. Our main tool is a recursive inequality which proves a concentrationresult for the iterates of Oja’s algorithm, which we state and prove in Section 3.Our analysis of Oja’s algorithm involves two distinct phases, which we analyze separately. Since theargument for the second phase is simpler, we present it first in Section 4, and present the slightly morecomplicated argument for the first phase in Section 5. We conclude in Section 6 with open questionsand directions for future work. The appendices contain omitted proofs and supplementary results foreach section.1.4.

Notation.

We write πœ† β‰₯ Β· Β· Β· β‰₯ πœ† 𝑑 for the eigenvalues of the symmetric matrix 𝑴 , and we write 𝜌 π‘˜ : = πœ† π‘˜ βˆ’ πœ† π‘˜ + for the gap between the π‘˜ th and ( π‘˜ + ) th eigenvalue. We write 𝑽 ∈ R 𝑑 Γ— π‘˜ forthe orthogonal matrix whose columns are the π‘˜ leading eigenvectors of 𝑴 , and 𝑼 ∈ R 𝑑 Γ—( 𝑑 βˆ’ π‘˜ ) forthe orthogonal matrix whose columns are the remaining eigenvectors. Given an orthogonal matrix 𝑾 ∈ R 𝑑 Γ— π‘˜ , we write [7] dist ( 𝑾 , 𝑽 ) = k 𝑽𝑽 βˆ— βˆ’ 𝑾𝑾 βˆ— k = k 𝑼 βˆ— 𝑾 k , The symbol kΒ·k denotes the spectral norm (i.e., β„“ operator norm) of a matrix, which is equal to itsmaximum singular value. For 𝑝 β‰₯ , the symbol kΒ·k 𝑝 denotes the Schatten 𝑝 -norm, which is the β„“ 𝑝 norm of the singular values of its argument. We also define the 𝐿 𝑝 norm of a random matrix 𝑿 as k 𝑿 k 𝑝,𝑝 : = (cid:0) E k 𝑿 k 𝑝𝑝 (cid:1) / 𝑝 . We employ standard asymptotic notation π‘Ž = 𝑂 ( 𝑏 ) to indicate that π‘Ž ≀ 𝐢𝑏 for a universal positiveconstant 𝐢 , and write π‘Ž = Θ ( 𝑏 ) if π‘Ž = 𝑂 ( 𝑏 ) and 𝑏 = 𝑂 ( π‘Ž ) . The notations ˜ 𝑂 (Β·) and ˜ Θ (Β·) suppresspolylogarithmic factors in the problem parameters. When 𝑑 is a positive integer, we write [ 𝑑 ] : = { , . . . , 𝑑 } . 2. T echniques and main results We focus throughout on the following setup:

Assumption 2.1.

The matrices 𝑨 𝑖 are symmetric, independent, identically distributed samples from adistribution P 𝑨 , with expectation 𝑴 .Note that while we require that each 𝑨 𝑖 is symmetric, we do not require that 𝑨 𝑖 (cid:23) .The requirement that 𝑨 𝑖 is symmetric is not as restrictive as it may seem, since we can replace 𝑨 𝑖 by its Hermitian dilation : D ( 𝑨 𝑖 ) : = (cid:18) 𝑨 𝑖 𝑨 βˆ— 𝑖 (cid:19) ∈ R 𝑑 Γ— 𝑑 . Estimating the leading eigenvectors of D ( 𝑴 ) is equivalent to estimating the leading singular vectorsof 𝑴 . Our results therefore extend to the non-symmetric streaming SVD problem as well. We referthe reader to [32] for more details about this standard reduction.The second requirement establishes that the random errors are bounded in a suitable norm. Wewrite S 𝑑,π‘˜ for the Stiefel manifold of 𝑑 Γ— π‘˜ matrices with orthonormal columns. Assumption 2.2. If 𝑨 ∼ P 𝑨 , then sup 𝑷 ∈ S 𝑑,π‘˜ k 𝑷 βˆ— ( 𝑨 βˆ’ 𝑴 ) k ≀ 𝑀 almost surely.Note that for any matrix 𝑿 ∈ ℝ 𝑑 Γ— 𝑑 ,sup 𝑃 ∈ S 𝑑,π‘˜ k 𝑷 βˆ— 𝑿 k = (cid:16)Γ• π‘˜π‘– = 𝜎 𝑖 ( 𝑿 ) (cid:17) / , ≀ π‘˜ ≀ 𝑑, TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 5 where 𝜎 ( 𝑿 ) β‰₯ 𝜎 ( 𝑿 ) β‰₯ Β· Β· Β· β‰₯ 𝜎 𝑑 ( 𝑿 ) are the singular values of 𝑿 . This norm, sometimes known asthe ( , π‘˜ ) norm [22] or the Ky Fan - π‘˜ norm [8], satisfies k 𝑿 k ≀ sup 𝑃 ∈ S 𝑑,π‘˜ k 𝑷 βˆ— 𝑿 k ≀ √ π‘˜ k 𝑿 k ≀ k 𝑿 k . This choice of norm generalizes the error assumptions in the literature. In the π‘˜ = case, it agreeswith the operator norm, which is the condition used by [17]; and it weakens the requirement of [1]that k 𝑨 𝑖 k ≀ almost surely.The following theorem summarizes our main results for Oja’s algorithm. Theorem 2.3 (Main, informal) . Adopt Assumptions 2.1 and 2.2. Let πœ† β‰₯ . . . πœ† 𝑑 be the eigenvalues of 𝑴 , and let 𝜌 π‘˜ = πœ† π‘˜ βˆ’ πœ† π‘˜ + .For every 𝛿 ∈ ( , ) , define learning rates 𝑇 = ˜ Θ π‘˜π‘€ 𝛿 𝜌 π‘˜ ! , 𝛽 = ˜ Θ 𝑀 𝜌 π‘˜ ! , πœ‚ 𝑑 =  ˜ Θ (cid:16) 𝜌 π‘˜ 𝑇 (cid:17) , 𝑑 ≀ 𝑇 Θ (cid:16) 𝜌 π‘˜ ( 𝛽 + 𝑑 βˆ’ 𝑇 ) (cid:17) , 𝑑 > 𝑇 . Let 𝑽 ∈ R 𝑑 Γ— π‘˜ be the orthogonal matrix whose columns are the π‘˜ leading eigenvectors of 𝑴 . Then forany 𝑇 > 𝑇 , the output 𝑸 𝑇 of Oja’s algorithm satisfies dist ( 𝑸 𝑇 , 𝑽 ) ≀ 𝐢 β€² π‘€πœŒ π‘˜ s log ( π‘€π‘˜ / 𝜌 π‘˜ 𝛿 ) 𝑇 βˆ’ 𝑇 with probability at least βˆ’ 𝛿 , where 𝐢 β€² is a universal positive constant. To prove Theorem 2.3, we adopt a two-phase analysis. Our first result shows that after 𝑇 iterations,the output of Oja’s algorithm satisfies k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ with high probability. Theorem 2.4 (Phase I, informal) . Adopt the same setting as Theorem 2.3, and let 𝒁 ∈ R 𝑑 Γ— π‘˜ havei.i.d. Gaussian entries. Let 𝑇 = Θ π‘˜π‘€ 𝛿 𝜌 π‘˜ (cid:0) l og ( 𝑑 𝑀 / π›ΏπœŒ π‘˜ ) (cid:1) ! . Then after 𝑇 iterations of Oja’s algorithm with constant step size πœ‚ = Θ (cid:16) l og ( 𝑑 / 𝛿 ) 𝜌 π‘˜ 𝑇 (cid:17) and initialization 𝒁 ,the output 𝑸 𝑇 satisfies k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ with probability at least βˆ’ 𝛿 . Our analysis of the second phase shows that, if Oja’s algorithm is initialized with any matrixsatisfying k 𝑼 βˆ— 𝑸 ( 𝑽 βˆ— 𝑸 ) βˆ’ k ≀ , then the output of Oja’s algorithm decays at the rate 𝑂 ( /√ 𝑇 ) . Theorem 2.5 (Phase II, informal) . Adopt the same setting as Theorem 2.3, and suppose that 𝒁 ∈ R 𝑑 Γ— π‘˜ satisfies k 𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ . Then after 𝑇 iterations of Oja’s algorithm with step size πœ‚ 𝑖 = ( 𝛽 + 𝑖 ) 𝜌 π‘˜ with 𝛽 = Θ (cid:16) 𝑀 𝜌 π‘˜ l og (cid:16) π‘€π‘˜πœŒ π‘˜ 𝛿 (cid:17) (cid:17) and initialization 𝑸 , the output 𝑸 𝑇 satisfies dist ( 𝑸 𝑇 , 𝑽 ) ≀ s 𝛽 + 𝛽 + 𝑇 with probability at least βˆ’ 𝛿 . This error guarantee is completely dimension free, and depends only logarithmically on π‘˜ and thefailure probability 𝛿 . HUANG ET AL.

Theorem 2.3 follows directly from Theorems 2.4 and 2.5. Theorem 2.4 guarantees that with proba-bility βˆ’ 𝛿 , the output of Phase I is a suitable initialization for Phase II, and, conditioned on this goodevent, Theorem 2.5 guarantees that the output of the second phase has error 𝑂 ( p 𝛽 / 𝑇 ) with probability βˆ’ 𝛿 . By concatenating the analysis of the two phases and using the union bound, we obtain that theresulting two-phase algorithm succeeds with probability at least βˆ’ 𝛿 , yielding Theorem 2.3.In the remainder of this section, we describe the main technical tools we employ in our argument.2.1. A recursive expression.

To simplify the argument, we recall the following result of [1, Lemma2.2]:

Lemma 2.6.

For all 𝑑 β‰₯ , dist ( 𝑸 𝑑 , 𝑽 ) = k 𝑼 βˆ— 𝑸 𝑑 k ≀ k 𝑼 βˆ— 𝑸 𝑑 ( 𝑽 βˆ— 𝑸 𝑑 ) βˆ’ k = k 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ k . We therefore focus on bounding the norm of the matrix 𝑾 𝑑 : = 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ . (2.1)Under the assumption that πœ‚ 𝑑 is small, we might expect that we can write 𝑾 𝑑 as a sum of thedominant term 𝑯 𝑑 : = 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ plus lower order terms.To argue that 𝑾 𝑑 is close to 𝑯 𝑑 , we need to argue that the inverse ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ does not blow up,which will be the case so long as the fluctuation term πœ‚ 𝑑 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ is smaller than the main term 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ . In order to make this requirement precise, we write 𝚫 𝑑 : = πœ‚ 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ . (2.2)So long as this matrix has small norm, the inverse term will be well behaved. As we discuss in thefollowing section, we will be able to guarantee that this is the case by conditioning on an appropriategood event.The following lemma shows that, modulo a term involving 𝚫 𝑑 , we can indeed express 𝑾 𝑑 as 𝑯 𝑑 plusa small correction. Lemma 2.7.

Let 𝑾 𝑑 , 𝑯 𝑑 , and 𝚫 𝑑 be defined as in (2.1) – (2.2) . Then we can write 𝑾 𝑑 ( I βˆ’ 𝚫 𝑑 ) = 𝑯 𝑑 + 𝑱 𝑑, + 𝑱 𝑑, , (2.3) for matrices 𝑱 𝑑, and 𝑱 𝑑, of norm 𝑂 ( πœ‚ 𝑑 ) and 𝑂 ( πœ‚ 𝑑 ) , respectively. Below, in Propositions A.1 and A.2, we use Lemma 2.7 to develop an explicit recursive bound on thenorm of 𝑾 𝑑 .2.2. Matrix concentration via smoothness.

In order to exploit the expression (2.3), we need concen-tration inequalities that allow us to conclude that 𝑾 𝑑 is near 𝑯 𝑑 with high probability. [16] recentlydeveloped new tools to control the norms of products of independent random matrices, in an attemptto extend the mature toolset for bounding sums of random matrices to the product setting. Their tech-niques are based on a simple but deep property of the Schatten 𝑝 -norms known as uniform smoothness .The most elementary expression of this fact is the following inequality, which is the analogue of (1.2)for the 𝐿 𝑝 norm. Proposition 2.8 ([16, Proposition 4.3]) . Let 𝑿 and 𝒀 be random matrices of the same size, with E [ 𝒀 : 𝑿 ] = . Then for any 𝑝 β‰₯ , k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≀ k 𝑿 k 𝑝,𝑝 + ( 𝑝 βˆ’ ) k 𝒀 k 𝑝,𝑝 . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 7 We will employ the following corollary of Proposition 2.8, which extends the inequality to non-centered random matrices.

Proposition 2.9.

Let 𝑿 , 𝒀 , and 𝒁 be random matrices of the same size, with E [ 𝒀 : 𝑿 ] = . Then forany 𝑝 β‰₯ and πœ† > , k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≀ ( + πœ† ) ( k 𝑿 k 𝑝,𝑝 + ( 𝑝 βˆ’ ) k 𝒀 k 𝑝,𝑝 + πœ† βˆ’ k 𝒁 k 𝑝,π‘ž ) The benefit of working in the 𝐿 𝑝 norm is that bounding this norm for 𝑝 large yields good tailbounds on the operator norm, which are not available if the argument is carried out solely in expectedFrobenius norm. We will rely heavily on this fact heavily in our argument.2.3. Conditioning on good events.

Obtaining control on 𝑾 𝑑 via (2.3) requires ensuring that thematrix I βˆ’ 𝚫 𝑑 is invertible, with inverse of bounded norm. To accomplish this, we define a sequenceof good events G βŠƒ G βŠƒ . . . , where each G 𝑖 is measurable with respect to the 𝜎 -algebra F 𝑖 : = 𝜎 ( 𝒁 , 𝒀 , . . . , 𝒀 𝑖 ) . We write 𝑖 for the indicator of the event G 𝑖 , and we will define G 𝑖 in such a waythat ( I βˆ’ 𝚫 𝑑 𝑑 βˆ’ ) is invertible almost surely.During Phase II, the good events are defined by G : = {k 𝑾 k ≀ } G 𝑖 : = {k 𝑾 𝑖 k ≀ 𝛾 } ∩ G 𝑖 βˆ’ , βˆ€ 𝑖 β‰₯ for some 𝛾 β‰₯ to be specified. Since Assumption 2.2 implies that k 𝑨 𝑖 βˆ’ 𝑴 k ≀ 𝑀 almost surely, thisdefinition guarantees that for all 𝑖 β‰₯ , k 𝑽 βˆ— ( 𝑨 𝑖 βˆ’ 𝑴 ) 𝑼𝑾 𝑖 βˆ’ 𝑖 βˆ’ k ≀ 𝑀𝛾 almost surely. (2.4)As we show in Proposition A.1 below, if the step size is sufficiently small, then (2.4) implies that I βˆ’ 𝚫 𝑑 is almost surely invertible on G 𝑑 βˆ’ , which allows us to employ (2.3) to bound the norm of 𝑾 𝑑 𝑑 βˆ’ .During Phase I, we condition on a slightly more complicated set of events, which we describeexplicitly in Section 5. However, these events are constructed so that (2.4) still holds for all 𝑖 β‰₯ .Our matrix concentration results described in Section 2.2 allow us to show that, during both PhaseI and Phase II, k 𝑾 𝑑 𝑑 βˆ’ k is small with high probability, for all 𝑑 β‰₯ . Using this fact, we show that,conditioned on G 𝑑 βˆ’ , the probability that G 𝑑 holds is also large. Bounding the failure probability ateach step, we are able to conclude that, conditioned on the initialization event G , the good events G 𝑑 hold for all 𝑑 β‰₯ with high probability.3. Main recursive bound

In this section, we state our main recursive bound, which we use in both Phase I and Phase II. Aproof appears in Section B.

Theorem 3.1.

Let 𝑑 be a positive integer, and for all 𝑖 ∈ [ 𝑑 ] , let πœ€ 𝑖 = πœ‚ 𝑖 𝑀 ( + 𝛾 ) . Let , . . . , 𝑑 be theindicator functions of a sequence of good events satisfying (2.4) for all 𝑖 ∈ [ 𝑑 ] .Assume that for all 𝑖 ∈ [ 𝑑 ] , πœ€ 𝑖 ≀ , πœ‚ 𝑖 k 𝑴 k ≀ , e βˆ’ πœ‚ 𝑖 𝜌 π‘˜ / ≀ πœ€ 𝑖 πœ€ 𝑖 βˆ’ , (3.1) with the convention that the last requirement is vacuous when 𝑖 = . Then for any 𝑝 β‰₯ , k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘πœ€ 𝑑 Γ• 𝑑 βˆ’ 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 , (3.2) where 𝑠 𝑑 = Í 𝑑𝑖 = πœ‚ 𝑖 , 𝐢 = , and 𝐢 = . Moreover, if in addition for all 𝑖 ∈ [ 𝑑 ] , π‘πœ€ 𝑖 ≀ πœ‚ 𝑖 𝜌 π‘˜ , (3.3) HUANG ET AL. then k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ / k 𝑾 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 . Theorem 3.1 shows that, up to small error, k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 decays exponentially fast. We will use thisfact to prove high probability bounds on k 𝑾 𝑑 𝑑 βˆ’ k , which then imply bounds on k 𝑾 𝑑 k .4. Phase II

In this section, we use Theorem 3.1 to prove a formal version of Theorem 2.5.For this phase, recall that we define the good events G 𝑖 by G = {k 𝑾 k ≀ } , G 𝑖 = {k 𝑾 𝑖 k ≀ 𝛾 } ∩ G 𝑖 βˆ’ , βˆ€ 𝑖 β‰₯ . (4.1)For Phase II, we set 𝛾 = √ .We first show that, with a specific step-size schedule, we obtain good bounds on the norm of thelast iterate. Proposition 4.1.

Define the good events as in (4.1) . Set πœ‚ 𝑖 = 𝛼 ( 𝛽 + 𝑖 ) 𝜌 π‘˜ , for positive quantities 𝛼 and 𝛽 , anddefine the normalized gap Β― 𝜌 π‘˜ = min (cid:26) 𝜌 π‘˜ 𝑀 , 𝜌 π‘˜ k 𝑴 k , (cid:27) . (4.2) If 𝛼 β‰₯ , 𝛽 β‰₯ ( + √ ) 𝛼 Β― 𝜌 π‘˜ , (4.3) then for any 𝑑 β‰₯ , k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ π‘˜ / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑑 (cid:19) 𝛼 + π‘π‘˜ / 𝑝 Β· (cid:18) 𝐢 𝛼 Β― 𝜌 π‘˜ (cid:19) Β· 𝑑 ( 𝛽 + 𝑑 ) , (4.4) where 𝐢 is a numerical constant less than .Proof. Since the good events defined in (4.1) satisfy (2.4), we can apply Theorem 3.1. In the appendix,we show (Lemma C.1) that (4.3) implies that the assumptions in (3.1) hold. Theorem 3.1 then yields k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘πœ€ 𝑑 Γ• 𝑑 βˆ’ 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ π‘˜ / 𝑝 + ( 𝐢 𝛾 + 𝐢 ) π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 , since (4.1) implies k 𝑾 k 𝑝,𝑝 ≀ π‘˜ / 𝑝 and k 𝑾 𝑖 𝑖 k 𝑝,𝑝 ≀ 𝛾 π‘˜ / 𝑝 for all 𝑖 β‰₯ .The definition of πœ‚ 𝑖 implies 𝜌 π‘˜ 𝑠 𝑑 = 𝛼 Γ• 𝑑𝑖 = 𝛽 + 𝑖 β‰₯ 𝛼 l og (cid:18) 𝛽 + 𝑑𝛽 + (cid:19) . We obtain k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ π‘˜ / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑑 (cid:19) 𝛼 + π‘π‘˜ / 𝑝 Β· (cid:18) 𝐢 𝛼 Β― 𝜌 π‘˜ (cid:19) Β· 𝑑 ( 𝛽 + 𝑑 ) , where 𝐢 = ( 𝐢 𝛾 + 𝐢 ) / 𝐢 πœ€ < , as desired. (cid:3) Finally, we remove the conditioning and prove the full version of Theorem 2.5.

TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 9 Theorem 4.2.

Assume k 𝑾 k ≀ , and adopt the step size πœ‚ 𝑖 = 𝛼 ( 𝛽 + 𝑖 ) 𝜌 π‘˜ , with 𝛼 β‰₯ , 𝛽 β‰₯ (cid:18) 𝐢 𝛼 Β― 𝜌 π‘˜ (cid:19) l og (cid:18) 𝐢 𝛼 Β― 𝜌 π‘˜ Β· π‘˜ / 𝛿 (cid:19) , where Β― 𝜌 π‘˜ is as in (4.2) and 𝐢 is as in (4.4) . Then k 𝑾 𝑇 k ≀ s 𝛽 + 𝛽 + 𝑇 with probability at least βˆ’ 𝛿 .Proof. For any 𝑠 β‰₯ , it holds P {k 𝑾 𝑇 k β‰₯ 𝑠 } ≀ P {k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 } + P (cid:8) G 𝐢𝑇 (cid:9) . First, we have P (cid:8) G 𝐢𝑇 (cid:9) ≀ P (cid:8) G 𝐢 (cid:9) + Γ• 𝑇𝑗 = P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o . Since we have assumed that the initialization satisfies k 𝑾 k ≀ , the event G holds with probability , so it suffices to bound the second term. By Markov’s inequality, we have P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o = P (cid:8) k 𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:9) ≀ inf 𝑝 β‰₯ 𝛾 βˆ’ 𝑝 k 𝑾 𝑗 𝑗 βˆ’ k 𝑝𝑝,𝑝 . For fixed 𝑗 β‰₯ , we choose 𝑝 = ( 𝛽 + 𝑗 ) Β· Β― 𝜌 π‘˜ 𝐢 𝛼 . It follows from (4.4) that, 𝛾 βˆ’ 𝑝 k 𝑾 𝑗 𝑗 βˆ’ k 𝑝𝑝,𝑝 ≀ 𝛾 π‘˜ / 𝑝 (cid:18) 𝛽 + 𝛽 + 𝑗 (cid:19) 𝛼 + 𝛾 π‘π‘˜ / 𝑝 Β· 𝐢 𝛼 Β― 𝜌 π‘˜ Β· 𝑗 ( 𝛽 + 𝑗 ) ! 𝑝 / ≀ π‘˜ (cid:18) + 𝑗𝛽 + 𝑗 (cid:19) 𝑝 / ≀ π‘˜ e βˆ’ 𝑝 = π‘˜ exp βˆ’( 𝛽 + 𝑗 ) Β· Β― 𝜌 π‘˜ 𝐢 𝛼 ! . Therefore, for any 𝑇 β‰₯ , Γ• 𝑇𝑗 = P n G 𝐢𝑗 | G 𝑗 βˆ’ o ≀ π‘˜ Γ• 𝑇𝑗 = exp βˆ’( 𝛽 + 𝑗 ) Β· Β― 𝜌 π‘˜ 𝐢 𝛼 ! ≀ π‘˜ 𝐢 𝛼 Β― 𝜌 π‘˜ e βˆ’ 𝛽 Β· Β― 𝜌 π‘˜πΆ 𝛼 . This quantity is smaller than 𝛿 / if 𝛽 β‰₯ 𝐢 𝛼 Β― 𝜌 π‘˜ l og (cid:18) 𝐢 𝛼𝑀 Β― 𝜌 π‘˜ Β· π‘˜ / 𝛿 (cid:19) . It remains to bound P {k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 } . A simple argument (Lemma C.2) based on (4.4) shows thatthis probability is at least 𝛿 / for 𝑠 = s 𝛽 + 𝛽 + 𝑇 .

The claim follows. (cid:3) Phase I

In this section, we describe the slightly more delicate proof of the formal version of Theorem 2.4.As in Section 4, we will employ Theorem 3.1. However, we will also need to develop an auxiliaryrecurrence to bound the growth of an additional matrix sequence.

Before we analyze Phase I, we first show that we can reduce to the case that that P 𝑨 has finitesupport. We prove the following result in Appendix E. Proposition 5.1.

Fix 𝜌 > . Suppose that there exists a choice of constant step size πœ‚ and 𝑇 β‰₯ π‘€πœŒπ›Ώ l og ( 𝑑 / 𝛿 ) such that for any finitely-supported distribution with support size at most 𝑇 satisfyingAssumptions 2.1 and 2.2 and with 𝜌 π‘˜ β‰₯ 𝜌 / , we have k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ (5.1) with probability at least βˆ’ 𝛿 / .Then for this same πœ‚ and 𝑇 it in fact holds that for any distribution satisfying Assumptions 2.1 and 2.2and with 𝜌 π‘˜ β‰₯ 𝜌 , we have k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ with probability at least βˆ’ 𝛿 . Proposition 5.1 implies that it suffices to prove the error guarantee (5.1) in the special case when P 𝑨 has finite support of cardinality at most 𝑇 .Let us fix a time horizon 𝑇 and assume in what follows that π‘š : = | supp ( P 𝑨 ) | ≀ 𝑇 .We begin bydefining the good events for Phase I. We adopt a constant step size πœ‚ , to be specified. Denote E : = { 𝑀 βˆ’ ( 𝑨 βˆ’ 𝑴 ) 𝑼𝑼 βˆ— : 𝑨 ∈ supp ( P 𝑨 )} . For 𝑖 β‰₯ , we will set G 𝑖 = { max 𝑬 ∈ E k 𝑽 βˆ— 𝑬𝑼𝑾 𝑖 k ≀ 𝛾 } ∩ G 𝑖 βˆ’ . Note that this choice satisfies (2.4) for all 𝑖 > .To define the initial good event G , we need to define a larger set of matrices to condition on. Forall π‘Ÿ, β„“ β‰₯ , set E π‘Ÿ,β„“ : = { 𝑽 βˆ— 𝑭 Β· Β· Β· 𝑭 π‘Ÿ 𝑼 : 𝑭 𝑖 ∈ E for at most β„“ distinct indices 𝑖 ∈ [ π‘Ÿ ] , and 𝑭 𝑖 = ( + πœ‚πœ† π‘˜ + ) βˆ’ ( I + πœ‚ 𝑴 ) 𝑼𝑼 βˆ— otherwise } The set E π‘Ÿ,β„“ has cardinality less than ( π‘Ÿ ( π‘š + )) β„“ , and k 𝑬 k ≀ for any 𝑬 ∈ E π‘Ÿ,β„“ , and any π‘Ÿ, β„“ β‰₯ .We have defined E π‘Ÿ,β„“ so that control over max 𝑬 ∈ E π‘Ÿ + ,β„“ + k 𝑬𝑾 𝑑 βˆ’ k gives control over max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 k .Finally, we define G : = Γ™ 𝑇 + π‘Ÿ,β„“ = ( max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 k ≀ √ ℓ𝛾 √ ) ∩ {k 𝑾 k ≀ √ 𝑑𝛾 } . (5.2)Since 𝑽 βˆ— ( 𝑨 βˆ’ 𝑴 ) 𝑼 ∈ E , almost surely, this choice satisfies (2.4) for 𝑖 = . Our strategy will be similar to the one used in Section 4. However, in order to show that the goodevents G 𝑖 hold with high probability, we will also need a second recurrence that allows us to controlthe norm of matrices of the form 𝑬𝑾 𝑑 , for 𝑬 ∈ E π‘Ÿ,β„“ . The details appear in Section D.6. Conclusion

This work gives the first nearly optimal analysis of Oja’s algorithm for streaming PCA beyond therank one case. Our analysis is conceptually simple: we show that the spectral norm of the matrix 𝑾 𝑑 concentrates well around its expectation, once we condition on 𝑾 𝑑 βˆ’ having the same behavior. Andour concentration results are strong enough that we can pay to union bound over the entire course ofthe algorithm, to show that 𝑾 𝑑 is well behaved for all 𝑑 β‰₯ .The matrix concentration techniques we have applied here could be useful in analyzing other PCA-like algorithms, or, more generally, other stochastic algorithms for simple non-convex optimization TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 11 problems. An interesting question is whether these techniques can prove gap-free rates for Oja’salgorithm outside the rank-one setting. This would extend the results of [1] to the general case.Finally, we stress that the algorithm we have described here requires a priori knowledge of theproblem parameters (including the gap 𝜌 π‘˜ ) to set the step sizes, which is a serious limitation inpractice. Recently, [14] developed a data-driven procedure to adaptively select the optimal step sizes.Obtaining theoretical guarantees for this or similar algorithms is an important open problem. Acknowledgement

We thank Joel Tropp and Amelia Henriksen for valuable discussions which greatly improved thismanuscript.

Appendix A. Additional results for Section 3

The following proposition develops the expansion described in Lemma 2.7 and gives explicit boundson the norms of the error matrices 𝑱 𝑑, and 𝑱 𝑑, .We recall the following definitions 𝑾 𝑑 = 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ 𝑯 𝑑 = 𝑼 βˆ— ( I + πœ‚ 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ 𝚫 𝑑 = πœ‚ 𝑑 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ Proposition A.1.

Let 𝑑 β‰₯ . Assume that πœ‚ 𝑑 is small enough that 𝑴 (cid:23) βˆ’ πœ‚ 𝑑 I , and assume that (2.4) holds for 𝑖 = 𝑑 . Let 𝐸 𝑑 = ( π‘˜ / 𝑝 + k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 ) πœ€ 𝑑 = πœ‚ 𝑑 𝑀 ( + 𝛾 ) . Then k 𝚫 𝑑 𝑑 βˆ’ k ≀ πœ€ 𝑑 almost surely, and 𝑾 𝑑 ( I βˆ’ 𝚫 𝑑 ) = 𝑯 𝑑 + 𝑱 𝑑, + 𝑱 𝑑, for 𝑱 𝑑, and 𝑱 𝑑, satisfying k 𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝐸 𝑑 πœ€ 𝑑 k 𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝐸 𝑑 πœ€ 𝑑 , and E [ 𝑱 𝑑, : F 𝑑 βˆ’ ] = .Proof. We employ the notation of the proof of Lemma 2.7. (See Appendix G.) First, we show the boundon 𝚫 𝑑 . Since πœ‚ 𝑑 𝑴 (cid:23) βˆ’ I , we have k 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) βˆ’ 𝑽 k ≀ . Moreover, since k 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼 π‘Š 𝑑 βˆ’ k ≀ 𝑀𝛾 almost surely, we have that k 𝚫 𝑑 𝑑 βˆ’ k ≀ k πœ‚ 𝑑 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) ( 𝑼𝑼 βˆ— + 𝑽𝑽 βˆ— ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ 𝑑 βˆ’ k≀ πœ‚ 𝑑 k 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼𝑼 βˆ— 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ 𝑑 βˆ’ k + πœ‚ 𝑑 k 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑽 k = πœ‚ 𝑑 k 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼𝑾 𝑑 βˆ’ 𝑑 βˆ’ k + πœ‚ 𝑑 k 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑽 k≀ πœ‚ 𝑑 𝑀 ( + 𝛾 ) = : πœ€ 𝑑 . We can bound k b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 by a similar argument. First, note that Assumption 2.2 implies that k 𝑨 𝑑 βˆ’ 𝑴 k ≀ 𝑀 almost surely. Hence k b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ πœ‚ 𝑑 k 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼𝑼 βˆ— 𝑍 𝑑 βˆ’ ( 𝑽 βˆ— 𝒁 𝑑 βˆ’ ) βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + πœ‚ 𝑑 k 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑽 𝑑 βˆ’ k 𝑝,𝑝 = πœ‚ 𝑑 k 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼 k k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + πœ‚ 𝑑 k 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑽 𝑑 βˆ’ k 𝑝,𝑝 ≀ ( k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + π‘˜ / 𝑝 ) πœ‚ 𝑑 𝑀 ≀ ( k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + π‘˜ / 𝑝 ) πœ€ 𝑑 , Finally, we have k 𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ + πœ‚ 𝑑 πœ† π‘˜ + + πœ‚ 𝑑 πœ† π‘˜ k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 ≀ k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 . We now employ Lemma 2.7. The term 𝑱 𝑑, satisfies E [ 𝑱 𝑑, 𝑑 βˆ’ | F 𝑑 βˆ’ ] = , and we have k 𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ k b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + k 𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 k 𝚫 𝑑 𝑑 βˆ’ k≀ ( k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + π‘˜ / 𝑝 ) πœ€ 𝑑 + k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 πœ€ 𝑑 ≀ 𝐸 𝑑 πœ€ 𝑑 . Finally, k 𝑱 𝑑, k 𝑝,𝑝 ≀ k b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 k 𝚫 𝑑 𝑑 βˆ’ k ≀ ( k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + π‘˜ / 𝑝 ) πœ€ 𝑑 ≀ 𝐸 𝑑 πœ€ 𝑑 . (cid:3) Combining Proposition A.1 with Proposition 2.9 immediately yields a recursive bound.

Proposition A.2.

Adopt the setting of Proposition A.1. If πœ€ 𝑑 ≀ / , then k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝐾 ,𝑑 k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐾 ,𝑑 , (A.1) where 𝐾 ,𝑑 = ( + πœ€ 𝑑 ) ( (cid:18) + πœ‚ 𝑑 πœ† π‘˜ + πœ‚ 𝑑 πœ† π‘˜ + (cid:19) + π‘πœ€ 𝑑 ) 𝐾 ,𝑑 = π‘π‘˜ / 𝑝 πœ€ 𝑑 . Proof.

Reusing the notation of Proposition A.1, we have 𝑾 𝑑 𝑑 βˆ’ ( I βˆ’ 𝚫 𝑑 ) = 𝑯 𝑑 𝑑 βˆ’ + 𝑱 𝑑, 𝑑 βˆ’ + 𝑱 𝑑, 𝑑 βˆ’ , where E [ 𝑱 𝑑, 𝑑 βˆ’ : F 𝑑 βˆ’ ] = . Since 𝑯 𝑑 𝑑 βˆ’ is F 𝑑 βˆ’ -measurable, Proposition 2.9 therefore yields forany πœ† > k 𝑾 𝑑 𝑑 βˆ’ ( I βˆ’ 𝚫 𝑑 ) k 𝑝,𝑝 ≀ ( + πœ† ) ( k 𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + ( 𝑝 βˆ’ ) 𝐸 𝑑 πœ€ 𝑑 + πœ† βˆ’ 𝐸 𝑑 πœ€ 𝑑 ) . Choosing πœ† = πœ€ 𝑑 , we obtain k 𝑾 𝑑 𝑑 βˆ’ ( I βˆ’ 𝚫 𝑑 ) k 𝑝,𝑝 ≀ ( + πœ€ 𝑑 ) ( k 𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + 𝑝𝐸 𝑑 πœ€ 𝑑 ) . Finally, under the assumption that k 𝚫 𝑑 𝑑 βˆ’ k ≀ πœ€ 𝑑 ≀ almost surely, on the event G 𝑑 βˆ’ the matrix I βˆ’ 𝚫 𝑑 is invertible and satisfies k ( I βˆ’ 𝚫 𝑑 ) βˆ’ 𝑑 βˆ’ k ≀ ( βˆ’ k 𝚫 𝑑 𝑑 βˆ’ k ) βˆ’ ≀ ( βˆ’ πœ€ 𝑑 ) βˆ’ Hence k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ ( I βˆ’ 𝚫 𝑑 ) k 𝑝,𝑝 k ( I βˆ’ 𝚫 𝑑 ) βˆ’ 𝑑 βˆ’ k ≀ + πœ€ 𝑑 ( βˆ’ πœ€ 𝑑 ) ( k 𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + 𝑝𝐸 𝑑 πœ€ 𝑑 ) . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 13 Since + πœ€ 𝑑 ( βˆ’ πœ€ 𝑑 ) ≀ + πœ€ 𝑑 for all πœ€ 𝑑 ≀ and ( + πœ€ 𝑑 ) 𝐸 𝑑 ≀ ( + πœ€ 𝑑 ) ( π‘˜ / 𝑝 + k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 ) and ( + πœ€ 𝑑 ) ≀ for all πœ€ 𝑑 ≀ , this proves the claim. (cid:3) Appendix B. Proof of Theorem 3.1

We will unroll the one-step recurrence of Proposition A.2. We first bound 𝐾 ,𝑖 . We have 𝐾 ,𝑖 ≀ (cid:18) + πœ‚ 𝑖 πœ† π‘˜ + πœ‚ 𝑖 πœ† π‘˜ + (cid:19) + ( + 𝑝 ) πœ€ 𝑖 + π‘πœ€ 𝑖 ≀ (cid:18) + πœ‚ 𝑖 πœ† π‘˜ + πœ‚ 𝑖 πœ† π‘˜ + (cid:19) + ( + 𝑝 ) πœ€ 𝑖 , where the second inequality follows from the first assumption in (3.1). The second assumption in (3.1)implies that ≀ + πœ‚ 𝑖 πœ† π‘˜ ≀ , so (cid:18) + πœ‚ 𝑖 πœ† π‘˜ + + πœ‚ 𝑖 πœ† π‘˜ (cid:19) = (cid:18) βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + πœ‚ 𝑖 πœ† π‘˜ (cid:19) ≀ (cid:18) βˆ’ πœ‚ 𝑖 𝜌 π‘˜ (cid:19) ≀ e βˆ’ πœ‚ 𝑖 𝜌 π‘˜ . Since + 𝑝 ≀ 𝑝 for all 𝑝 β‰₯ , we obtain 𝐾 ,𝑖 ≀ e βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + 𝐢 π‘πœ€ 𝑖 . We now proceed to prove the first claim by induction. When 𝑑 = , we use (A.1) to obtain k 𝑾 k 𝑝,𝑝 ≀ k 𝑾 k 𝑝,𝑝 ≀ 𝐾 , k 𝑾 k 𝑝,𝑝 + 𝐾 , ≀ e βˆ’ πœ‚ 𝜌 π‘˜ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘πœ€ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ , which is the desired bound.Proceeding by induction, for 𝑑 > we have k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝐾 ,𝑑 k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐾 ,𝑑 ≀ e βˆ’ πœ‚ 𝑑 𝜌 π‘˜ k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐢 π‘πœ€ 𝑑 k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐾 ,𝑑 ≀ e βˆ’ πœ‚ 𝑑 𝜌 π‘˜ (cid:16) e βˆ’ 𝑠 𝑑 βˆ’ 𝜌 π‘˜ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘πœ€ 𝑑 βˆ’ Γ• 𝑑 βˆ’ 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 βˆ’ ( 𝑑 βˆ’ ) (cid:17) + 𝐢 π‘πœ€ 𝑑 k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ k 𝑾 k 𝑝,𝑝 + 𝐢 π‘πœ€ 𝑑 Γ• 𝑑 βˆ’ 𝑖 = k 𝑾 𝑖 𝑖 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 , where in the final inequality we have used that e βˆ’ πœ‚ 𝑑 𝜌 π‘˜ πœ€ 𝑑 βˆ’ ≀ πœ€ 𝑑 by the third assumption of (3.1). Thisproves the first bound.For the second bound, we proceed in a similar way, but with a sharper bound on 𝐾 ,𝑖 . The secondassumption of (3.1) again implies (cid:18) + πœ‚ 𝑖 πœ† π‘˜ + + πœ‚ 𝑖 πœ† π‘˜ (cid:19) = (cid:18) βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + πœ‚ 𝑖 πœ† π‘˜ (cid:19) ≀ βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + ( πœ‚ 𝑖 𝜌 π‘˜ ) ≀ βˆ’ πœ‚ 𝑖 𝜌 π‘˜ , and therefore 𝐾 ,𝑖 ≀ ( + πœ€ 𝑖 ) (cid:18) βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + π‘πœ€ 𝑖 (cid:19) ≀ exp (cid:18) βˆ’ πœ‚ 𝑖 𝜌 π‘˜ + ( + 𝑝 ) πœ€ 𝑖 (cid:19) ≀ e βˆ’ πœ‚ 𝑖 𝜌 π‘˜ / , where the final step uses Assumption (3.3) and the fact that + 𝑝 ≀ 𝑝 for all 𝑝 β‰₯ .When 𝑑 = , we therefore have k 𝑾 k 𝑝,𝑝 ≀ k 𝑾 k 𝑝,𝑝 ≀ 𝐾 , k 𝑾 k 𝑝,𝑝 + 𝐾 , ≀ e βˆ’ πœ‚ 𝜌 π‘˜ / k 𝑾 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ , as desired, and for 𝑑 > the induction hypothesis yields k 𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝐾 ,𝑑 k 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + 𝐾 ,𝑑 ≀ e βˆ’ πœ‚ 𝑑 𝜌 π‘˜ / (cid:16) e βˆ’ 𝑠 𝑑 βˆ’ 𝜌 π‘˜ / k 𝑾 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 βˆ’ ( 𝑑 βˆ’ ) (cid:17) ≀ e βˆ’ 𝑠 𝑑 𝜌 π‘˜ / k 𝑾 k 𝑝,𝑝 + 𝐢 π‘π‘˜ / 𝑝 πœ€ 𝑑 𝑑 , where the final inequality again uses the third assumption in (3.1). This proves the second bound. (cid:3) Appendix C. Additional results for Section 4

Lemma C.1.

Under the conditions of Proposition 4.1, the assumptions of (3.1) hold.Proof.

First assumption. We have πœ€ 𝑖 = πœ‚ 𝑖 𝑀 ( + 𝛾 ) = ( + √ ) 𝛼𝑀 ( 𝛽 + 𝑖 ) 𝜌 π‘˜ ≀ 𝐢 πœ€ 𝛼𝛽 Β― 𝜌 π‘˜ , where 𝐢 πœ€ = ( + √ ) . So the first assumption is fulfilled as long as 𝛽 / 𝛼 β‰₯ 𝐢 πœ€ / Β― 𝜌 π‘˜ . (C.1a)Second assumption. As above, we have πœ‚ 𝑖 k 𝑴 k ≀ 𝛼 k 𝑴 k π›½πœŒ π‘˜ ≀ 𝛼𝛽 Β― 𝜌 π‘˜ , so the assumption is fulfilled if (C.1a) holds.Third assumption. It suffices to show that πœ€ 𝑖 βˆ’ πœ€ 𝑖 ≀ + πœ‚ 𝑖 𝜌 π‘˜ βˆ€ 𝑖 β‰₯ , which is equivalent to 𝛽 + 𝑖 βˆ’ ≀ 𝛼 / 𝛽 + 𝑖 βˆ€ 𝑖 β‰₯ . This holds as long as 𝛼 β‰₯ . (C.1b)We obtain that all three assumptions hold under (C.1a) and (C.1b), as claimed. (cid:3) Lemma C.2.

In the setting of Theorem D.5, if 𝑠 = q 𝛽 + 𝛽 + 𝑇 , then P {k 𝑾 𝑇 k β‰₯ 𝑠 } ≀ 𝛿 / . Proof.

We have P {k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 } ≀ inf 𝑝 β‰₯ 𝑠 βˆ’ 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 15 In particular, we choose 𝑠 = e (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 + e 𝐢 𝛼 Β― 𝜌 π‘˜ 𝑇 ( 𝛽 + 𝑇 ) l og ( π‘˜ / 𝛿 ) , and 𝑝 = l og ( π‘˜ / 𝛿 ) . It then follows from (4.4) that P {k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 } ≀ 𝑠 βˆ’ 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝 ≀ π‘˜ 𝑠 (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 + 𝑠 𝑝 𝐢 𝛼 Β― 𝜌 π‘˜ 𝑇 ( 𝛽 + 𝑇 ) ! 𝑝 / = π‘˜ e βˆ’ 𝑝 = 𝛿 / . Combining the above bounds, we obtain that k 𝑾 𝑇 k ≀ 𝑠 ≀ e (cid:18) 𝛽 + 𝛽 + 𝑇 (cid:19) 𝛼 / + e 𝐢 π›Όπ‘€πœŒ π‘˜ r l og ( π‘˜ / 𝛿 ) 𝑇 , with probability at least βˆ’ 𝛿 . Since both terms are smaller than e q 𝛽 + 𝛽 + 𝑇 , the claim follows. (cid:3) Appendix D. Additional results for Section 5

Our main tool will be the following slight variation on Proposition A.1.

Proposition D.1.

Let 𝑑 β‰₯ . Assume that πœ‚ 𝑑 is small enough that 𝑴 (cid:23) βˆ’ πœ‚ 𝑑 I , and assume that (2.4) holds for 𝑖 = 𝑑 . Consider an arbitrary deterministic matrix 𝑬 ∈ E π‘Ÿ,β„“ .Let Β― 𝐸 𝑑 = + max 𝑬 β€²β€² ∈ E π‘Ÿ + ,β„“ + k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 πœ€ = πœ‚π‘€ ( + 𝛾 ) . Then k 𝚫 𝑑 𝑑 βˆ’ k ≀ πœ€ almost surely, and 𝑬𝑾 𝑑 ( I βˆ’ 𝚫 𝑑 ) = 𝑬𝑯 𝑑 + 𝑬𝑱 𝑑, + 𝑬𝑱 𝑑, for 𝑬𝑱 𝑑, and 𝑬𝑱 𝑑, satisfying k 𝑬𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ Β― 𝐸 𝑑 πœ€ k 𝑬𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ Β― 𝐸 𝑑 πœ€ , and E [ 𝑬𝑱 𝑑, : F 𝑑 βˆ’ ] = .Proof. The proof is a slight modification on the proof of Proposition A.1. By construction, k 𝑬𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ (cid:18) + πœ‚πœ† π‘˜ + πœ‚πœ† π‘˜ + (cid:19) k 𝑬 β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 , where 𝑬 β€² = + πœ‚πœ† π‘˜ + 𝑬𝑼 βˆ— ( I + πœ‚ 𝚺 ) 𝑼 ∈ E π‘Ÿ + ,β„“ βŠ† E π‘Ÿ + ,β„“ + .Similarly, we have k 𝑬 b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ πœ‚ k 𝑬𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + πœ‚ k 𝑬𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑽 k 𝑝,𝑝 ≀ πœ‚π‘€ ( k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + k 𝑬 k 𝑝,𝑝 )≀ πœ€ ( k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + ) where 𝑬 β€²β€² = 𝑀 𝑬𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝑼 ∈ E π‘Ÿ + ,β„“ + , and we have used k 𝑬 k 𝑝 ≀ k 𝑬 k ≀ .We therefore obtain k 𝑬𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ k 𝑬 b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + k 𝑬𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 k 𝚫 𝑑 𝑑 βˆ’ k ≀ (cid:0) k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + k 𝑬 β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + (cid:1) πœ€ ≀ Β― 𝐸 𝑑 πœ€ , and k 𝑬𝑱 𝑑, 𝑑 βˆ’ k 𝑝,𝑝 ≀ k 𝑬 b 𝚫 𝑑 𝑑 βˆ’ k 𝑝,𝑝 k 𝚫 𝑑 𝑑 βˆ’ k ≀ ( k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + ) πœ€ ≀ Β― 𝐸 𝑑 πœ€ . (cid:3) The following two results are the appropriate analogues of Proposition A.2 and Theorem 3.1.

Proposition D.2.

Adopt the setting of Proposition D.1. If πœ€ ≀ / , then max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ Β― 𝐾 max 𝑬 β€² ∈ E π‘Ÿ + ,β„“ k 𝑬 β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + Β― 𝐾 max 𝑬 β€²β€² ∈ E π‘Ÿ + ,β„“ + k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + Β― 𝐾 , (D.1) where Β― 𝐾 = ( + πœ€ ) (cid:18) + πœ‚πœ† π‘˜ + πœ‚πœ† π‘˜ + (cid:19) Β― 𝐾 = ( + πœ€ ) π‘πœ€ Proof.

As in the proof of Proposition A.2, we have for any 𝑬 ∈ E π‘Ÿ,β„“ , k 𝑬 k 𝑝,𝑝 ≀ ( + πœ€ ) ( k 𝑬𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 + 𝑝 Β― 𝐸 𝑑 πœ€ ) . As in the proof of Proposition D.1, we can write k 𝑬𝑯 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ (cid:18) + πœ‚πœ† π‘˜ + πœ‚πœ† π‘˜ + (cid:19) k 𝑬 β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 where 𝑬 β€² = + πœ‚πœ† π‘˜ + 𝑬𝑼 βˆ— ( I + πœ‚ 𝚺 ) 𝑼 ∈ E π‘Ÿ + ,β„“ . Since Β― 𝐸 𝑑 ≀ max 𝑬 β€²β€² ∈ E π‘Ÿ + ,β„“ + k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + , taking the maximum over all 𝑬 ∈ E π‘Ÿ,β„“ and 𝑬 β€² ∈ E π‘Ÿ + ,β„“ yields the claim. (cid:3) Theorem D.3.

Let 𝑑 ≀ 𝑇 be a positive integer, and assume the following requirements hold for some 𝑝 β‰₯ : πœ€ ≀ , (D.2a) πœ‚ k 𝑴 k ≀ , (D.2b) π‘πœ€ ≀ πœ‚πœŒ π‘˜ (D.2c) 𝛾 β‰₯ . (D.2d) Then for any π‘Ÿ, β„“ ∈ [ 𝑇 βˆ’ 𝑑 + ] and 𝑝 β‰₯ , max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ ℓ𝛾 e βˆ’ π‘‘πœ‚πœŒ π‘˜ / + 𝐢 𝑝𝛾 πœ€ 𝑑 . where 𝐢 = .Proof. First, as in the proof of Theorem 3.1, Assumptions (D.2b) and (D.2c) imply Β― 𝐾 + Β― 𝐾 = ( + πœ€ ) ( (cid:18) + πœ‚πœ† π‘˜ + πœ‚πœ† π‘˜ + (cid:19) + π‘πœ€ ) ≀ e βˆ’ πœ‚πœŒ π‘˜ / . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 17 In particular, Β― 𝐾 + Β― 𝐾 ≀ . Assumption (D.2a) likewise implies that Β― 𝐾 ≀ .We now turn to the proof of the main claim, which we prove by induction on 𝑑 . For convenience,we introduce the notation 𝛾 e = 𝛾 /√ . When 𝑑 = and π‘Ÿ, β„“ ≀ 𝑇 , (D.1) impliesmax 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 k 𝑝,𝑝 ≀ max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 k 𝑝,𝑝 ≀ Β― 𝐾 max 𝑬 β€² ∈ E π‘Ÿ + ,β„“ k 𝑬 β€² 𝑾 k 𝑝,𝑝 + Β― 𝐾 max 𝑬 β€²β€² ∈ E π‘Ÿ + ,β„“ + k 𝑬 β€²β€² 𝑾 k 𝑝,𝑝 πœ€ + Β― 𝐾 ≀ Β― 𝐾 ℓ𝛾 + Β― 𝐾 ( β„“ + ) 𝛾 + Β― 𝐾 ≀ ℓ𝛾 ( Β― 𝐾 + Β― 𝐾 ) + ( + 𝛾 ) Β― 𝐾 ≀ ℓ𝛾 e βˆ’ πœ‚πœŒ π‘˜ / + 𝛾 𝐾 where we have used the definition of G and where the last step uses (D.2d). Proceeding by induction,we have max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 𝑑 k 𝑝,𝑝 ≀ max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ Β― 𝐾 max 𝑬 β€² ∈ E π‘Ÿ + ,β„“ k 𝑬 β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + Β― 𝐾 max 𝑬 β€²β€² ∈ E π‘Ÿ + ,β„“ + k 𝑬 β€²β€² 𝑾 𝑑 βˆ’ 𝑑 βˆ’ k 𝑝,𝑝 + Β― 𝐾 ≀ Β― 𝐾 ( ℓ𝛾 e βˆ’( 𝑑 βˆ’ ) πœ‚πœŒ π‘˜ / + ( 𝑑 βˆ’ ) 𝛾 Β― 𝐾 )+ Β― 𝐾 ( ( β„“ + ) 𝛾 e βˆ’( 𝑑 βˆ’ ) πœ‚πœŒ π‘˜ / + ( 𝑑 βˆ’ ) 𝛾 Β― 𝐾 ) + Β― 𝐾 ≀ ℓ𝛾 ( Β― 𝐾 + Β― 𝐾 ) e βˆ’( 𝑑 βˆ’ ) πœ‚πœŒ π‘˜ / + ( 𝑑 βˆ’ ) ( Β― 𝐾 + Β― 𝐾 ) 𝛾 Β― 𝐾 + ( + 𝛾 ) Β― 𝐾 = ℓ𝛾 e βˆ’ π‘‘πœ‚πœŒ π‘˜ / + 𝛾 𝐾 𝑑 , as claimed. (cid:3) Proposition D.4.

Fix 𝑠 ∈ ( , ) , ≀ 𝛾 ≀ 𝐢 𝛾 𝑑𝛿 , and 𝑝 β‰₯ , where 𝐢 𝛾 = 𝛾 is the constant inLemma H.4. Given 𝜌 > , define the normalized gap Β― 𝜌 = min (cid:26) π‘€πœŒ , k 𝑴 k 𝜌 , (cid:27) , and adopt the step size πœ‚ = 𝐢 πœ‚ l og ( e 𝑑 / 𝑠𝛿 ) πœŒπ‘‡ . If 𝜌 π‘˜ β‰₯ 𝜌 / and 𝑇 β‰₯ 𝑝 Β· 𝐢 𝑇 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑠 Β― 𝜌 where 𝐢 πœ‚ β‰₯ + og 𝐢 𝛾 , 𝐢 𝑇 β‰₯ 𝐢 πœ‚ , then k 𝑾 𝑇 𝑇 βˆ’ k 𝑝,𝑝 ≀ 𝑠 (cid:16) + π‘˜ / 𝑝 (cid:17) and max 𝑬 ∈ E , k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝛾 e for all ≀ 𝑑 ≀ 𝑇 . Proof.

We will apply Theorems 3.1 and D.3. First, note that (D.2d) holds by assumption. We now turnto the other conditions.Assumption (D.2a): Since 𝛾 β‰₯ , we have πœ€ = πœ‚π‘€ ( + 𝛾 ) ≀ 𝐢 πœ‚ 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) π‘€πœŒπ‘‡ . The assumption therefore holds as long as 𝐢 𝑇 β‰₯ 𝐢 πœ‚ . (D.3)Assumption (D.2b): As above, we have πœ‚ k 𝑴 k ≀ 𝐢 πœ‚ l og ( e 𝑑 / 𝑠𝛿 ) k 𝑴 k πœŒπ‘‡ , and the requirement (D.3) implies that this quantity is also smaller than / .Assumption (D.2c): Since πœ‚πœŒ π‘˜ = 𝐢 πœ‚ l og ( e 𝑑 / 𝑠𝛾 ) 𝑇 β‰₯ 𝑇 and > , it suffices to prove the strongerclaim π‘πœ€ ≀ 𝑠 𝑇 . (D.4)This is satisfied so long as 𝑝 Β· 𝐢 πœ‚ 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑀 𝜌 𝑇 ≀ 𝑠 𝑇 . which will hold if 𝐢 𝑇 β‰₯ 𝐢 πœ‚ . (D.5)This requirement is stronger than (D.3), so Assumptions (D.2a)–(D.2c) hold under the sole condi-tion (D.5).We now turn to the two claimed bounds. First, we instantiate Theorem 3.1 with the choice πœ‚ 𝑖 = πœ‚ for ≀ 𝑖 ≀ 𝑇 . The third assumption of (3.1) is trivially satisfied when when πœ‚ 𝑖 is constant, sincein that case πœ€ 𝑖 = πœ€ 𝑖 βˆ’ for all 𝑖 β‰₯ . The remaining assumptions correspond directly to Assump-tions (D.2a), (D.2b), and (D.2c). The assumptions of Theorem 3.1 are therefore satisfied, so we obtain, k 𝑾 𝑇 𝑇 βˆ’ k 𝑝,𝑝 ≀ e βˆ’ 𝑇 πœ‚πœŒ π‘˜ / k 𝑾 k 𝑝,𝑝 + π‘π‘˜ / 𝑝 πœ€ 𝑇 . The definition of G in (5.2) and the fact that 𝜌 π‘˜ β‰₯ 𝜌 / implies that the first term is at most e βˆ’ 𝑇 πœ‚πœŒ π‘˜ / 𝑑𝛾 = ( e 𝑑 / 𝑠𝛿 ) βˆ’ 𝐢 πœ‚ / 𝑑𝛾 , and this will be less than 𝑠 if 𝐢 πœ‚ β‰₯ + og ( 𝐢 𝛾 ) . Since (D.4) holds, the second term satisfies π‘π‘˜ / 𝑝 πœ€ 𝑇 ≀ 𝑠 π‘˜ / 𝑝 < 𝑠 π‘˜ / 𝑝 . We obtain k 𝑾 𝑇 𝑇 βˆ’ k 𝑝,𝑝 ≀ 𝑠 (cid:16) + π‘˜ / 𝑝 (cid:17) , as claimed.For the second claim, we rely on Theorem D.3. Assumptions (D.2a)–(D.2d) having already beenverified, we obtain for all ≀ 𝑑 ≀ 𝑇 ,max 𝑬 ∈ E , k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝛾 e βˆ’ π‘‘πœ‚πœŒ π‘˜ / + 𝑝𝛾 πœ€ 𝑑 . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 19 Since 𝜌 π‘˜ β‰₯ , the first term is at most 𝛾 , and the second term is also at most 𝛾 by (D.4). We obtainthat max 𝑬 ∈ E , k 𝑬𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ 𝛾 e , as claimed. (cid:3) With Proposition D.4 in hand, we can prove a full version of Theorem 2.4.

Theorem D.5.

Fix a 𝜌 > and assume | supp ( P 𝑨 ) | = π‘š . Let Β― 𝜌 = max (cid:26) πœŒπ‘€ , 𝜌 k 𝑴 k , (cid:27) , and set 𝑠 = / .Adopt the step size πœ‚ = 𝐢 πœ‚ l og ( e 𝑑 / 𝛿𝑠 ) πœŒπ‘‡ where 𝑇 β‰₯ 𝐢 𝑇 π‘˜ ( log 𝑑 / 𝛿 Β― πœŒπ‘  ) 𝑠 𝛿 Β― 𝜌 . and 𝐢 πœ‚ β‰₯ + og 𝐢 𝛾 , 𝐢 𝑇 β‰₯ ( 𝐢 πœ‚ 𝐢 𝛾 ) / . If π‘š ≀ 𝑇 and 𝜌 π‘˜ β‰₯ 𝜌 / , then k 𝑾 𝑇 k ≀ / with probability at least βˆ’ 𝛿 / .Proof. We first show that we can assume that l og 𝑇 ≀ og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) . Indeed, if 𝑇 > (cid:16) 𝐢 𝑇 𝑑𝛿 Β― πœŒπ‘  (cid:17) , acrude argument similar to the one employed in the analysis of Phase II yields the claim. We give thefull details in Appendix F. In what follows, we therefore assumelog 𝑇 ≀ og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) . (D.6)Set 𝛾 = 𝐢 𝛾 min ( p π‘˜ l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) 𝛿 , 𝑑𝛿 ) , where 𝐢 𝛾 is as in Lemma H.4.Recall that our goal is to show k 𝑾 𝑇 k ≀ 𝑠 with probability at least βˆ’ 𝛿 / . The failure probabilitycan be bounded as P (cid:8) k 𝑾 𝑇 k β‰₯ 𝑠 (cid:9) ≀ P (cid:8) k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 (cid:9) + P (cid:8) G 𝐢𝑇 (cid:9) ≀ inf 𝑝 β‰₯ 𝑠 βˆ’ 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 + P (cid:8) G 𝐢𝑇 (cid:9) . If we choose 𝑝 = log ( π‘˜ / 𝛿 ) , then since l og ( 𝐢 𝑇 ) ≀ 𝐢 / 𝑇 log ( ) for any value of 𝐢 𝑇 , we have 𝑇 β‰₯ 𝐢 𝑇 π‘˜ ( l og ( 𝑑 / 𝛿 Β― πœŒπ‘  )) 𝑠 𝛿 Β― 𝜌 β‰₯ l og ( π‘˜ / 𝛿 ) Β· 𝐢 / 𝑇 π‘˜ l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) 𝛿 Β· log ( e 𝑑 / 𝑠𝛿 ) 𝑠 Β― 𝜌 β‰₯ 𝑝 𝐢 πœ‚ 𝛾 l og ( e 𝑑 / 𝑠𝛿 ) 𝑠 Β― 𝜌 , as long as 𝐢 𝑇 β‰₯ ( 𝐢 πœ‚ ( 𝐢 𝛾 ) ) / , which verifies the assumption of Proposition D.4.We obtain k 𝑾 𝑇 𝑇 k 𝑝,𝑝 ≀ 𝑠 ( + π‘˜ / 𝑝 ) ≀ π‘˜ / 𝑝 𝑠 e . We therefore have 𝑠 βˆ’ 𝑝 k 𝑾 𝑇 𝑇 k 𝑝𝑝,𝑝 ≀ e βˆ’ l og ( π‘˜ / 𝛿 ) ≀ 𝛿 / . It remains to bound P n G 𝐢𝑇 o . Clearly P (cid:8) G 𝐢𝑇 (cid:9) ≀ P (cid:8) G 𝐢 (cid:9) + Γ• 𝑇 𝑗 = P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o . Since π‘š ≀ 𝑇 and we have assumed l og 𝑇 ≀ og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) , we havelog ( e π‘šπ‘‡ / 𝛿 ) ≀ og ( 𝑇 ) + log ( e / 𝛿 ) ≀

20 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) + log ( e / 𝛿 ) ≀

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) , so Lemma H.4 guarantees that G holds with probability at least βˆ’ 𝛿 / .For the second term, we have P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o = P (cid:26) max 𝑬 ∈ E , k 𝑬𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:27) ≀ Γ• 𝑬 ∈ E , P (cid:8) k 𝑬𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:9) . Choose 𝑝 =

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) . The same argument as above yields 𝑇 β‰₯ 𝑝 Β· 𝐢 / 𝑇 π‘˜ log ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) 𝛿 Β· log ( e 𝑑 / 𝑠𝛿 ) 𝑠 Β― 𝜌 , and this will be larger than the lower bound required on 𝑇 that was assumed in Proposition D.4 aslong as 𝐢 𝑇 β‰₯ ( 𝐢 πœ‚ ( 𝐢 𝛾 ) ) / Proposition D.4 therefore yields P (cid:8) k 𝑬𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:9) ≀ 𝛾 βˆ’ 𝑝 k 𝑬𝑾 𝑗 𝑗 βˆ’ k 𝑝𝑝,𝑝 ≀ e βˆ’ 𝑝 = e βˆ’

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) for all 𝑬 ∈ E , and thus P n G 𝐢𝑗 | G 𝑗 βˆ’ o ≀ Γ• 𝑬 ∈ E , P (cid:8) k 𝑬𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:9) ≀ π‘š e βˆ’

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) . This yields Γ• 𝑇 𝑗 = P n G 𝐢𝑗 | G 𝑗 βˆ’ o ≀ π‘šπ‘‡ e βˆ’

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  ) ≀ e βˆ’

21 l og ( 𝐢 𝑇 𝑑 / 𝛿 Β― πœŒπ‘  )+ og 𝑇 ≀ 𝛿 / , where the last step uses (D.6). Finally, choosing 𝑠 = / , we obtain P (cid:8) k 𝑾 𝑇 k β‰₯ / (cid:9) ≀ 𝛿 / , as claimed. (cid:3) Appendix E. A reduction to finite support

Let Ξ© be the space of 𝑑 Γ— 𝑑 symmetric matrices. We argue that it suffices to assume that 𝑃 𝐴 hasfinite support of cardinality at most 𝑇 in Phase I. We prove this by comparing the product measure 𝑃 βŠ— 𝑇 𝐴 with another distribution 𝑃 π‘š on Ξ© βŠ— 𝑇 . We specify this distribution by the following procedure:drawing a 𝑇 -tuple ( 𝐴 , . . . , 𝐴 𝑇 ) from the distribution 𝑃 π‘š is accomplished by(1) Drawing π‘š independent samples Λ† 𝑨 , . . . , Λ† 𝑨 π‘š from 𝑃 𝐴 . TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 21 (2) Drawing 𝑨 , . . . , 𝑨 𝑇 independently from the discrete distribution 𝑃 Λ† 𝐴 = π‘š Γ• π‘šπ‘– = 𝛿 Λ† 𝑨 𝑖 . That is, drawing 𝑨 , . . . , 𝑨 𝑇 independently and uniformly from the set { Λ† 𝑨 𝑖 } π‘šπ‘– = with replace-ment.We will rely on the fact that the two distributions, 𝑃 βŠ— 𝑇 𝐴 and 𝑃 π‘š , are close in total variation distancewhen π‘š is large. To see this, we first recognize that drawing ( 𝐴 , . . . , 𝐴 𝑇 ) from 𝑃 βŠ— 𝑇 𝐴 is equivalent tothe following:(1) Draw π‘š independent samples Λ† 𝑨 , . . . , Λ† 𝑨 π‘š from 𝑃 𝐴 .(2) Draw 𝑨 , . . . , 𝑨 𝑇 sequentially and uniformly from the set { Λ† 𝑨 𝑖 } π‘šπ‘– = without replacement. De-note by 𝑃 ( 𝑇 ) Λ† 𝐴 the distribution of this sampling.It is a standard result [11] that, given any { Λ† 𝐴 𝑖 } π‘šπ‘– = , 𝑑 TV (cid:16) 𝑃 βŠ— 𝑇 Λ† 𝐴 , 𝑃 ( 𝑇 ) Λ† 𝐴 (cid:17) ≀ 𝑇 π‘š . We thus have the following:

Proposition E.1.

For any 𝛿 ∈ ( , ) , it holds that 𝑑 TV (cid:16) 𝑃 π‘š , 𝑃 βŠ— 𝑇 𝐴 (cid:17) ≀ 𝛿 for all π‘š β‰₯ 𝑇 / 𝛿 .Proof. For any set 𝑆 βŠ‚ Ξ© βŠ— 𝑇 , we have (cid:12)(cid:12)(cid:12) 𝑃 π‘š ( 𝑆 ) βˆ’ 𝑃 βŠ— 𝑇 𝐴 ( 𝑆 ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E Λ† 𝐴 𝑖 ∼ 𝑃 𝐴 , ≀ 𝑖 ≀ π‘š h 𝑃 βŠ— 𝑇 Λ† 𝐴 ( 𝑆 ) βˆ’ 𝑃 ( 𝑇 ) Λ† 𝐴 ( 𝑆 ) i (cid:12)(cid:12)(cid:12) ≀ E Λ† 𝐴 𝑖 ∼ 𝑃 𝐴 , ≀ 𝑖 ≀ π‘š (cid:12)(cid:12)(cid:12) 𝑃 βŠ— 𝑇 Λ† 𝐴 ( 𝑆 ) βˆ’ 𝑃 ( 𝑇 ) Λ† 𝐴 ( 𝑆 ) (cid:12)(cid:12)(cid:12) ≀ E Λ† 𝐴 𝑖 ∼ 𝑃 𝐴 , ≀ 𝑖 ≀ π‘š 𝑑 TV (cid:16) 𝑃 βŠ— 𝑇 Λ† 𝐴 , 𝑃 ( 𝑇 ) Λ† 𝐴 (cid:17) ≀ 𝑇 π‘š ≀ 𝛿. The claim follows from taking the maximum of | 𝑃 π‘š ( 𝑆 ) βˆ’ 𝑃 βŠ— 𝑇 𝐴 ( 𝑆 ) | over all subsets of Ξ© βŠ— 𝑇 . (cid:3) Given any Λ† 𝑨 , . . . , Λ† 𝑨 π‘š , define the empirical average Λ† 𝑴 π‘š : = E 𝐴 ∼ 𝑃 Λ† 𝐴 𝑨 = π‘š Γ• π‘šπ‘– = Λ† 𝑨 𝑖 . Denote by Λ† πœ† β‰₯ Λ† πœ† β‰₯ Β· Β· Β· β‰₯ Λ† πœ† 𝑑 the eigenvalues of Λ† 𝑴 π‘š , and write Λ† 𝜌 π‘˜ = Λ† πœ† π‘˜ βˆ’ Λ† πœ† π‘˜ + . Let Λ† 𝑉 ∈ ℝ 𝑑 Γ— π‘˜ be theorthogonal matrix whose columns are the leading π‘˜ eigenvectors of Λ† 𝑴 π‘š , and let Λ† 𝑼 ∈ ℝ 𝑑 Γ—( 𝑑 βˆ’ π‘˜ ) be theorthogonal matrix consisting of the remaining eigenvectors. Standard results of matrix concentrationimplies that Λ† 𝑴 π‘š is close to 𝑴 . In particular, we have the following: Proposition E.2.

Suppose that π‘š β‰₯ 𝑀 𝜌 π‘˜ l og ( 𝑑 / 𝛿 ) . Let Λ† 𝑨 , . . . , Λ† 𝑨 π‘š be drawn independently from 𝑃 𝐴 .Then it holds with probability at least βˆ’ 𝛿 that k Λ† 𝑴 π‘š βˆ’ 𝑴 k ≀ 𝜌 π‘˜ / , and, in particular, Λ† 𝜌 π‘˜ β‰₯ 𝜌 π‘˜ / and k 𝑼 βˆ— Λ† 𝑽 k ≀ / . Proof.

By assumption 2, we have that k Λ† 𝑴 π‘š βˆ’ 𝑴 k ≀ 𝑀 almost surely. Then the matrix Bernsteininequality [31, Theorem 1.4] implies that, for any 𝑑 β‰₯ , P (cid:8) k Λ† 𝑴 π‘š βˆ’ 𝑴 k β‰₯ 𝑑 (cid:9) ≀ 𝑑 exp (cid:18) βˆ’ π‘šπ‘‘ / 𝑀 + 𝑀𝑑 / (cid:19) . Substituting 𝑑 = 𝜌 π‘˜ / yields the first claim. Using the perturbation theory of eigenvalues of symmetricmatrices, we have Λ† πœ† π‘˜ β‰₯ πœ† π‘˜ βˆ’ k Λ† 𝑴 π‘š βˆ’ 𝑴 k and Λ† πœ† π‘˜ + ≀ πœ† π‘˜ + βˆ’ k Λ† 𝑴 π‘š βˆ’ 𝑴 k . Therefore, conditioned on the first claim, it holds that Λ† 𝜌 π‘˜ β‰₯ 𝜌 π‘˜ βˆ’ k Λ† 𝑴 π‘š βˆ’ 𝑴 k β‰₯ 𝜌 π‘˜ . Furthermore, it follows from Wedin’s inequality [33] that k 𝑼 βˆ— Λ† 𝑽 k ≀ k Λ† 𝑴 π‘š βˆ’ 𝑴 k Λ† πœ† π‘˜ βˆ’ πœ† π‘˜ + ≀ . (cid:3) Proposition E.3.

Let 𝑼 and 𝑽 be orthogonal matrices such that 𝑼𝑼 βˆ— + 𝑽𝑽 βˆ— = I , and let Λ† 𝑼 and Λ† 𝑽 be matricesof the same size satisfying the same requirement. Suppose k 𝑼 βˆ— Λ† 𝑽 k ≀ / and k Λ† 𝑼 βˆ— 𝑺 ( Λ† 𝑽 βˆ— 𝑺 ) βˆ’ k ≀ 𝛾 ≀ .Then k 𝑼 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k ≀ + 𝛾 βˆ’ 𝛾 . Proof.

A direct calculation yields k 𝑼 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k = k 𝑼 βˆ— ( Λ† 𝑼 Λ† 𝑼 βˆ— + Λ† 𝑽 Λ† 𝑽 βˆ— ) 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k≀ k Λ† 𝑼 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k + k 𝑼 βˆ— Λ† 𝑽 Λ† 𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k≀ k Λ† 𝑼 βˆ— 𝑺 ( Λ† 𝑽 βˆ— 𝑺 ) βˆ’ Λ† 𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k + k Λ† 𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k≀ ( 𝛾 + ) k Λ† 𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k . We also have k Λ† 𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k ≀ k Λ† 𝑽 βˆ— 𝑼𝑼 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k + k Λ† 𝑽 βˆ— 𝑽𝑽 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k ≀ k 𝑼 βˆ— 𝑺 ( 𝑽 βˆ— 𝑺 ) βˆ’ k + . Sequencing the two displays above and rearrange the inequality yields the claim. (cid:3)

Now let 𝑇 be given as in Theorem D.5 and choose π‘š = 𝑇 / 𝛿 . As long as 𝑇 β‰₯ π‘€πœŒ π‘˜ 𝛿 l og ( 𝑑 / 𝛿 ) , wehave 𝑀 𝜌 π‘˜ l og ( 𝑑 / 𝛿 ) ≀ π‘š ≀ 𝑇 . It then follows from Proposition E.2 that, when drawing Λ† 𝑨 , . . . , Λ† 𝑨 π‘š independently from 𝑃 𝐴 , the event G : = { Λ† 𝜌 π‘˜ β‰₯ 𝜌 π‘˜ / and k 𝑼 βˆ— Λ† 𝑽 k ≀ / } (E.1)happens with probability at least βˆ’ 𝛿 . Conditioned on G , we consider running 𝑇 steps of Oja’salgorithm, with 𝐴 , . . . , 𝐴 𝑇 drawn i.i.d from 𝑃 Λ† 𝐴 . Note that the discrete distribution 𝑃 Λ† 𝐴 also satisfiesAssumption 1 and Assumption 2 (with 𝑀 replaced by 𝑀 ). Our main theorem thus guarantees thatwith appropriately chosen step size, the output 𝑄 𝑇 = 𝑄 𝑇 ( 𝐴 , . . . , 𝐴 𝑇 ) of this algorithm after 𝑇 stepssatisfies k Λ† 𝑼 βˆ— 𝑸 𝑇 ( Λ† 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 23 with probability βˆ’ 𝛿 . Combining (E.1) and Proposition E.3, we obtain that with probability at least ( βˆ’ 𝛿 ) β‰₯ βˆ’ 𝛿 , the output of the algorithm satisfies k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ , that is, 𝑃 π‘š (cid:0) k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ (cid:1) β‰₯ βˆ’ 𝛿. Finally, we obtain from Proposition E.1 that 𝑃 π‘š (cid:0) k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ (cid:1) β‰₯ 𝑃 βŠ— 𝑇 𝐴 (cid:0) k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ (cid:1) βˆ’ 𝑑 TV (cid:16) 𝑃 π‘š , 𝑃 βŠ— 𝑇 𝐴 (cid:17) β‰₯ βˆ’ 𝛿. In other words, with the same choice of 𝑇 , the output of 𝑇 steps of Oja’s algorithm with 𝐴 , . . . , 𝐴 𝑇 drawn i.i.d from the original distribution 𝑃 𝐴 satisfies k 𝑼 βˆ— 𝑸 𝑇 ( 𝑽 βˆ— 𝑸 𝑇 ) βˆ’ k ≀ with probability at least βˆ’ 𝛿 . Appendix F. Phase I succeeds if 𝑇 is large In this section, we prove Theorem D.5 when 𝑇 > 𝐢 𝑇 𝑑 𝛿 Β― 𝜌 𝑠 . Note that this value of 𝑇 is far largerthan the optimal choice (which is of order ˜ Θ ( π‘˜ / 𝛿 Β― 𝜌 𝑠 ) ), which makes the theorem much easier toprove. Indeed, if 𝑇 is this large, we can prove Theorem D.5 directly by using the same conditioningargument as in Phase II. Proposition F.1.

Assume πœ‚ and 𝑇 satisfy the requirements of Theorem D.5, and assume 𝜌 β‰₯ 𝜌 π‘˜ / .If 𝑇 β‰₯ 𝐢 𝑇 𝑑 𝛿 Β― 𝜌 𝑠 , then k 𝑾 𝑇 k ≀ 𝑠 with probability at least βˆ’ 𝛿 / .Proof. Set 𝛾 = 𝐢 𝛾 𝑑𝛿 where 𝐢 𝛾 is defined in Lemma H.4 and define the good events G : = {k 𝑾 k ≀ 𝛾 /(√ )} G 𝑖 : = {k 𝑾 k ≀ 𝛾 } ∩ G 𝑖 βˆ’ , βˆ€ 𝑖 β‰₯ . In order to apply Theorem 3.1, we verify (3.1)First assumption. We have πœ€ = πœ‚π‘€ ( + 𝛾 ) ≀ 𝐢 πœ‚ l og ( e 𝑑 / 𝛿𝑠 ) π‘€π›ΎπœŒπ‘‡ , and this quantity is smaller than / so long as 𝐢 𝑇 β‰₯ 𝐢 πœ‚ 𝐢 𝛾 . (F.1)Second assumption. We again have πœ‚ k 𝑴 k = 𝐢 πœ‚ l og ( e 𝑑 / 𝛿𝑠 ) k 𝑴 k πœŒπ‘‡ , and (F.1) guarantees that this quantity is smaller than / as well.Third assumption. Since πœ€ 𝑖 = πœ€ for all 𝑖 and πœ‚πœŒ β‰₯ , this requirement trivially holds. Our goal is to bound P (cid:8) k 𝑾 𝑇 k β‰₯ 𝑠 (cid:9) ≀ P (cid:8) k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 (cid:9) + P (cid:8) G 𝐢 (cid:9) + Γ• 𝑇 𝑗 = P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o . Having verified (3.1), we can employ (3.2), obtaining k 𝑾 𝑇 𝑇 k 𝑝,𝑝 ≀ e βˆ’ 𝑇 πœ‚πœŒ π‘˜ π‘˜ / 𝑝 𝛾 / + ( 𝐢 𝛾 + 𝐢 ) π‘π‘˜ / 𝑝 πœ€ 𝑇 . For the first term, the fact that 𝜌 π‘˜ β‰₯ 𝜌 / implies that e βˆ’ 𝑇 πœ‚πœŒ π‘˜ 𝛾 = ( 𝛿𝑠 / e 𝑑 ) 𝐢 πœ‚ / 𝛾 , and this is smaller than 𝑠 as long as 𝐢 πœ‚ β‰₯ + og ( 𝐢 𝛾 ) . Letting 𝐢 be as in Proposition 4.1 and choosing 𝑝 = l og ( π‘˜ / 𝑑𝛿 ) , we also have 𝑝 ( 𝐢 𝛾 + 𝐢 ) πœ€ 𝑇 ≀ 𝑝 𝐢 𝐢 πœ‚ l og ( e 𝑑 / 𝛿𝑠 ) 𝑀 𝛾 𝜌 𝑇 ≀ 𝐢 𝐢 πœ‚ 𝐢 𝛾 l og ( 𝑑 / 𝛿𝑠 ) 𝐢 𝑇 Β· 𝛿𝑠𝑑 Since l og ( 𝑑 / 𝛿𝑠 ) ≀ 𝑑𝛿𝑠 for all positive 𝑑 , 𝛿 , and 𝑠 , this quantity will be less than 𝑠 so long as 𝐢 𝑇 β‰₯ ( 𝐢 𝐢 πœ‚ 𝐢 𝛾 ) , (F.2)and this requirement subsumes (F.1).We therefore obtain, for 𝑝 = l og ( π‘˜ / 𝛿 ) , P (cid:8) k 𝑾 𝑇 k β‰₯ 𝑠 (cid:9) ≀ 𝑠 βˆ’ 𝑝 k 𝑾 𝑇 k 𝑝𝑝,𝑝 ≀ π‘˜ e βˆ’ 𝑝 ≀ 𝛿 / , In a similar way, (3.2) yields for all 𝑑 ∈ [ 𝑇 ] , 𝛾 βˆ’ k 𝑾 𝑑 𝑑 βˆ’ k 𝑝,𝑝 ≀ π‘˜ / 𝑝 + ( 𝐢 𝛾 + 𝐢 ) π‘π‘˜ / 𝑝 πœ€ 𝑇 . If we choose 𝑝 = l og ( π‘˜π‘‡ / 𝛿 ) , then we have 𝑝 ( 𝐢 𝛾 + 𝐢 ) πœ€ 𝑇 ≀ 𝑝 𝐢 𝐢 πœ‚ l og ( e 𝑑 / 𝛿𝑠 ) 𝑀 𝛾 𝜌 𝑇 ≀ 𝐢 𝐢 πœ‚ 𝐢 𝛾 l og ( 𝑇 ) 𝐢 𝑇 𝑇 / , and since log ( 𝑇 ) ≀ 𝑇 / for all 𝑇 , we have that this quantity will be at most if 𝐢 𝑇 β‰₯ ( 𝐢 𝐢 πœ‚ 𝐢 𝛾 ) / , and this requirement subsumes (F.2), and it holds under the assumptions of Theorem D.5.By Lemma H.4, the event G holds with probability at least βˆ’ 𝛿 / .Finally, we have for any 𝑗 ∈ [ 𝑇 ] , P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o ≀ P (cid:8) k 𝑾 𝑗 𝑗 βˆ’ k β‰₯ 𝛾 (cid:9) ≀ inf 𝑝 β‰₯ 𝛾 βˆ’ 𝑝 k 𝑾 𝑑 𝑑 βˆ’ k 𝑝𝑝,𝑝 , and choosing 𝑝 = log ( π‘˜π‘‡ / 𝛿 ) we have 𝛾 βˆ’ 𝑝 k 𝑾 𝑑 𝑑 βˆ’ k 𝑝𝑝,𝑝 ≀ π‘˜ e βˆ’ 𝑝 ≀ 𝛿𝑇 , and summing these probabilities for 𝑗 ∈ [ 𝑇 ] , yields that P (cid:8) k 𝑾 𝑇 k β‰₯ 𝑠 (cid:9) ≀ P (cid:8) k 𝑾 𝑇 𝑇 k β‰₯ 𝑠 (cid:9) + P (cid:8) G 𝐢 (cid:9) + Γ• 𝑇 𝑗 = P n G 𝐢𝑗 ∩ G 𝑗 βˆ’ o ≀ + + = , as claimed. (cid:3) TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 25 Appendix G. Omitted proofs

G.1.

Proof of Lemma 2.7.

We will show that 𝑾 𝑑 ( I βˆ’ 𝚫 𝑑 ) = 𝑯 𝑑 + 𝑱 𝑑, + 𝑱 𝑑, , where 𝑯 𝑑 = 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ , 𝑱 𝑑, = b 𝚫 𝑑 βˆ’ 𝑯 𝑑 𝚫 𝑑 , and 𝑱 𝑑, = βˆ’ b 𝚫 𝑑 𝚫 𝑑 and where we write b 𝚫 𝑑 = πœ‚ 𝑑 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ . By the definition of 𝒁 𝑑 , we have 𝑾 𝑑 = 𝑼 βˆ— 𝒁 𝑑 ( 𝑽 βˆ— 𝒁 𝑑 ) βˆ’ = 𝑼 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ ) βˆ’ . We have 𝑽 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ = 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ + πœ‚ 𝑑 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ = (cid:0) I + πœ‚ 𝑑 𝑽 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ (cid:1) 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ = ( I + 𝚫 𝑑 ) 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ , which implies ( 𝑽 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ ) βˆ’ ( I βˆ’ 𝚫 𝑑 ) = ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ ( I + 𝚫 𝑑 ) βˆ’ ( I + 𝚫 𝑑 ) ( I βˆ’ 𝚫 𝑑 ) = ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ ( I βˆ’ 𝚫 𝑑 ) . We also have 𝑼 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ = 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ + πœ‚ 𝑑 𝑼 βˆ— ( 𝑨 𝑑 βˆ’ 𝑴 ) 𝒁 𝑑 βˆ’ = 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ + b 𝚫 𝑑 ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) . Therefore 𝑾 𝑑 ( I βˆ’ 𝚫 𝑑 ) = 𝑼 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— 𝒀 𝑑 𝒁 𝑑 βˆ’ ) βˆ’ = 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ + b 𝚫 𝑑 βˆ’ 𝑼 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ( 𝑽 βˆ— ( I + πœ‚ 𝑑 𝑴 ) 𝒁 𝑑 βˆ’ ) βˆ’ 𝚫 𝑑 βˆ’ b 𝚫 𝑑 𝚫 𝑑 . That is 𝑾 𝑑 ( I βˆ’ b 𝚫 𝑑 ) = 𝑯 𝑑 + 𝑱 𝑑, + 𝑱 𝑑, . Since 𝚫 𝑑 and b 𝚫 𝑑 are both 𝑂 ( πœ‚ 𝑑 ) , the claim follows. (cid:3) G.2.

Proof of Proposition 2.9.

By the triangle inequality, we have k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≀ k 𝑿 + 𝒀 k 𝑝,𝑝 + k 𝒁 k 𝑝,𝑝 , which implies k 𝑿 + 𝒀 + 𝒁 k 𝑝,𝑝 ≀ ( k 𝑿 + 𝒀 k 𝑝,𝑝 + k 𝒁 k 𝑝,𝑝 ) ≀ ( + πœ† ) ( k 𝑿 + 𝒀 k 𝑝,𝑝 + πœ† βˆ’ k 𝒁 k 𝑝,𝑝 ) , where in the second step we have applied the elementary inequality ( π‘Ž + 𝑏 ) ≀ ( + πœ† ) ( π‘Ž + πœ† βˆ’ 𝑏 ) , valid for all real numbers π‘Ž and 𝑏 and πœ† > . Applying Proposition 2.8 to k 𝑿 + 𝒀 k 𝑝,𝑝 then yields theclaim. (cid:3) Appendix H. Additional Lemmas

Lemma H.1.

For any deterministic matrices 𝑨 , 𝑩 and any standard Gaussian matrix 𝒁 of suitable sizes,it holds that P {k 𝑨𝒁𝑩 k β‰₯ k 𝑨 k k 𝑩 k ( + 𝑑 )} ≀ e βˆ’ 𝑑 / . Proof.

Let 𝑓 ( 𝑿 ) : = k 𝑨𝑿 𝑩 k , then | 𝑓 ( 𝑿 ) βˆ’ 𝑓 ( 𝑿 ) | ≀ k 𝑨 k k 𝑩 k Β· k 𝑿 βˆ’ 𝑿 k . By Gaussian concentration, we have P { 𝑓 ( 𝒁 ) β‰₯ E 𝑓 ( 𝒁 ) + k 𝑨 k k 𝑩 k 𝑑 } ≀ e βˆ’ 𝑑 / . Moreover, we have E 𝑓 ( 𝒁 ) ≀ ( E k 𝑨𝒁𝑩 k ) / = k 𝑨 k k 𝑩 k . It thus follows that P { 𝑓 ( 𝒁 ) β‰₯ k 𝑨 k k 𝑩 k ( + 𝑑 )} ≀ P { 𝑓 ( 𝒁 ) β‰₯ E 𝑓 ( 𝒁 ) + k 𝑨 k k 𝑩 k 𝑑 } ≀ e βˆ’ 𝑑 / , which is the stated result. (cid:3) Lemma H.2 ([6, Theorem II.13]) . Let 𝑸 ∈ ℝ 𝑑 Γ— π‘˜ be a standard Gaussian matrix. Then P n k 𝑸 k β‰₯ √ 𝑑 + √ π‘˜ + 𝑑 o ≀ Β· e βˆ’ 𝑑 / . Lemma H.3 ([1, Lemma i.A.3]) . Let 𝑸 ∈ ℝ π‘˜ Γ— π‘˜ be a standard Gaussian matrix. Then for every 𝛿 ∈ ( , ) , P ( k 𝑸 βˆ’ k β‰₯ √ π‘˜π›Ώ ) ≀ 𝛿. The next lemma bounds the probability of G from below. Lemma H.4.

Let G be the event defined in (5.2) . There exists a positive constant 𝐢 𝛾 = such thatfor any 𝛿 ∈ ( , ) , if 𝛾 β‰₯ 𝐢 𝛾 min { p π‘˜ log ( e π‘šπ‘‡ / 𝛿 )/ 𝛿, 𝑑 / 𝛿 } , then G holds with probability at least βˆ’ 𝛿 .Proof. We have 𝑾 = 𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ , where 𝒁 is a matrix with i.i.d. Gaussian entries. Since 𝑼 and 𝑽 have orthonormal columns and are themselves orthogonal, the two matrices 𝑽 βˆ— 𝒁 and 𝑼 βˆ— 𝒁 areindependent matrices with i.i.d. Gaussian entries. Using Lemma H.1 and conditioning on 𝑽 βˆ— 𝒁 , wehave that with probability at least βˆ’ 𝛿 / ( 𝑇 + ) ,max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ k ( 𝑽 βˆ— 𝒁 ) βˆ’ k Β· p β„“ l og ( e π‘šπ‘‡ / 𝛿 ) , (H.1)where we have taken a union bound over the fewer than ( ( π‘š + ) ( 𝑇 + )) β„“ elements of E π‘Ÿ,β„“ . Takinga uniform bound again over all π‘Ÿ, β„“ ∈ [ 𝑇 + ] yields that, with probability at least βˆ’ 𝛿 / , the event(H.1) holds for all π‘Ÿ, β„“ ∈ [ 𝑇 + ] . By Lemma H.3, we also have that that k ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ √ π‘˜ / 𝛿 withprobability at least βˆ’ 𝛿 / . Furthermore, Lemma H.2 implies that k 𝑼 βˆ— 𝒁 k ≀ p 𝑑 l og ( / 𝛿 ) withprobability at least βˆ’ 𝛿 / . Combining these bounds, we obtain that with probability at least βˆ’ 𝛿 / ,max 𝑬 ∈ E π‘Ÿ,β„“ k 𝑬𝑼 βˆ— 𝒁 ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ p β„“ l og ( e π‘šπ‘‡ / 𝛿 ) , which is less than √ ℓ𝛾 √ as long as 𝐢 𝛾 β‰₯ , and under this same assumption k 𝑾 k ≀ k 𝑼 βˆ— 𝒁 k k ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ p 𝑑 l og ( / 𝛿 ) ≀ √ 𝑑𝛾 as well. TREAMING π‘˜ -PCA: EFFICIENT GUARANTEES FOR OJA’S ALGORITHM, BEYOND RANK-ONE UPDATES 27 So G holds with probability at least βˆ’ 𝛿 if 𝛾 β‰₯ 𝐢 𝛾 p π‘˜ l og ( e π‘šπ‘‡ / 𝛿 )/ 𝛿 for 𝐢 𝛾 β‰₯ .On the other hand, We have E k 𝑼 βˆ— 𝒁 k ≀ √ 𝑑 , so that k 𝑼 βˆ— 𝒁 k ≀ √ 𝑑 / 𝛿 with probability at least βˆ’ 𝛿 / , and Lemma H.3 implies that k 𝑽 βˆ— 𝒁 k ≀ √ π‘˜ / 𝛿 with probability at least βˆ’ 𝛿 / , so withprobability at least βˆ’ 𝛿 we have k 𝑾 k ≀ k 𝑼 βˆ— 𝒁 k k ( 𝑽 βˆ— 𝒁 ) βˆ’ k ≀ √ π‘‘π‘˜ / 𝛿 < 𝑑 / 𝛿 . as claimed. On this event, we also have k 𝑬𝑾 k ≀ k 𝑾 k ≀ 𝑑 / 𝛿 . Therefore, if 𝛾 β‰₯ √ 𝑑 / 𝛿 ,then G holds.So G holds with probability at least βˆ’ 𝛿 if 𝛾 β‰₯ 𝐢 𝛾 𝑑 / 𝛿 for 𝐢 𝛾 β‰₯ √ . Therefore, taking 𝐢 𝛾 = satisfies both requirements and proves the claim. (cid:3) References [1] Z. Allen-Zhu and Y. Li. First efficient convergence for streaming π‘˜ -PCA: a global, gap-free, and near-optimal rate. In , pages 487–492. IEEE Computer Soc.,Los Alamitos, CA, 2017.[2] M. Balcan, S. S. Du, Y. Wang, and A. W. Yu. An improved gap-dependency analysis of the noisy power method. InFeldman et al. [10], pages 284–309.[3] M. Balcan and K. Q. Weinberger, editors. Proceedings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of

JMLR Workshop and Conference Proceedings . JMLR.org,2016.[4] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. In Burges et al. [5], pages3174–3182.[5] C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors.

Advances in Neural Information ProcessingSystems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting heldDecember 5-8, 2013, Lake Tahoe, Nevada, United States , 2013.[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In

Handbook of thegeometry of Banach spaces, Vol. I , pages 317–366. North-Holland, Amsterdam, 2001.[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III.

SIAM J. Numer. Anal. , 7:1–46, 1970.[8] X. V. Doan and S. Vavasis. Finding the largest low-rank clusters with Ky Fan - π‘˜ -norm and β„“ -norm. SIAM J. Optim. ,26(1):274–312, 2016.[9] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.

SIAM J. MatrixAnal. Appl. , 20(2):303–353, 1999.[10] V. Feldman, A. Rakhlin, and O. Shamir, editors.

Proceedings of the 29th Conference on Learning Theory, COLT 2016, NewYork, USA, June 23-26, 2016 , volume 49 of

JMLR Workshop and Conference Proceedings . JMLR.org, 2016.[11] D. Freedman. A remark on the difference between sampling with and without replacement.

J. Amer. Statist. Assoc. ,72(359):681, 1977.[12] G. H. Golub and C. F. Van Loan.

Matrix computations . Johns Hopkins Studies in the Mathematical Sciences. JohnsHopkins University Press, Baltimore, MD, third edition, 1996.[13] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages2861–2869, 2014.[14] A. Henriksen and R. Ward. Adaoja: Adaptive learning rates for streaming pca. 05 2019, 1905.12115.[15] A. Henriksen and R. Ward. Concentration inequalities for random matrix products.

Linear Algebra Appl. , 594:81–94,2020.[16] D. Huang, J. Niles-Weed, J. A. Tropp, and R. Ward. Matrix concentration for products. 03 2020, 2003.05437.[17] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: matching matrix bernstein and near-optimalfinite sample guarantees for oja’s algorithm. In Feldman et al. [10], pages 1147–1164.[18] I. T. Jolliffe.

Principal component analysis . Springer Series in Statistics. Springer-Verlag, New York, second edition, 2002.[19] A. Juditsky and A. S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. 09 2008,0809.0813.[20] C. Li, H. Lin, and C. Lu. Rivalry of two families of algorithms for memory-restricted streaming PCA. In A. Gretton andC. C. Robert, editors,

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS , volume 51 of

JMLR Workshop and Conference Proceedings , pages 473–481. JMLR.org,2016.[21] C. J. Li, M. Wang, H. Liu, and T. Zhang. Near-optimal stochastic approximation for online principal componentestimation.

Math. Program. , 167(1, Ser. B):75–97, 2018.[22] C.-K. Li and N.-K. Tsing. Some isometries of rectangular complex matrices.

Linear and Multilinear Algebra , 23(1):47–53,1988.[23] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Burges et al. [5], pages 2886–2894.[24] A. Naor. On the banach-space-valued azuma inequality and small-set isoperimetry of alon–roichman graphs.

Combina-torics, Probability and Computing , 21(4):623–634, 2012.[25] E. Oja. A simplified neuron model as a principal component analyzer.

J. Math. Biol. , 15(3):267–273, 1982.[26] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of arandom matrix.

Journal of Mathematical Analysis and Applications , 106(1):69–84, 1985.[27] C. D. Sa, C. RΓ©, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex matrixproblems. In F. R. Bach and D. M. Blei, editors,

Proceedings of the 32nd International Conference on Machine Learning,ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of

JMLR Workshop and Conference Proceedings , pages 2332–2341.JMLR.org, 2015.[28] O. Shamir. Convergence of stochastic gradient descent for PCA. In Balcan and Weinberger [3], pages 257–265.[29] O. Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In Balcan and Weinberger[3], pages 248–256.[30] M. Simchowitz, A. El Alaoui, and B. Recht. Tight query complexity lower bounds for PCA via finite sample deformedWigner law. In

STOC’18β€”Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages1249–1259. ACM, New York, 2018.[31] J. A. Tropp. User-friendly tail bounds for sums of random matrices.

Found. Comput. Math. , 12(4):389–434, 2012.[32] J. A. Tropp. An introduction to matrix concentration inequalities.

Foundations and Trends in Machine Learning , 8(1-2):1–230, 2015.[33] P.-A. Wedin. Perturbation bounds in connection with singular value decomposition.

Related Researches

The Multiplicative Version of Azuma's Inequality, with an Application to Contention Analysis
by William Kuszmaul
Balanced Districting on Grid Graphs with Provable Compactness and Contiguity
by Cyrus Hettle
Deterministic Tree Embeddings with Copies for Algorithms Against Adaptive Adversaries
by Bernhard Haeupler
Approximately counting independent sets of a given size in bounded-degree graphs
by Ewan Davies
A Dynamic Data Structure for Temporal Reachability with Unsorted Contact Insertions
by Luiz F. Afra Brito
Semi-Streaming Algorithms for Submodular Matroid Intersection
by Paritosh Garg
Multivariate Analysis of Scheduling Fair Competitions
by Siddharth Gupta
Online Bin Packing with Predictions
by Spyros Angelopoulos
Minimum projective linearizations of trees in linear time
by LluΓ­s Alemany-Puig
Parameterized complexity of computing maximum minimal blocking and hitting sets
by JΓΊlio AraΓΊjo
A 2 -Approximation Algorithm for Flexible Graph Connectivity
by Sylvia Boyd
A Faster Algorithm for Finding Closest Pairs in Hamming Metric
by Andre Esser
Kernelization of Maximum Minimal Vertex Cover
by JΓΊlio AraΓΊjo
Fractionally Log-Concave and Sector-Stable Polynomials: Counting Planar Matchings and More
by Yeganeh Alimohammadi
Optimal Construction of Hierarchical Overlap Graphs
by Shahbaz Khan
Gapped Indexing for Consecutive Occurrences
by Philip Bille
CountSketches, Feature Hashing and the Median of Three
by Kasper Green Larsen
A Refined Analysis of Submodular Greedy
by Ariel Kulik
Generalized Parametric Path Problems
by Prerona Chatterjee
Approximate Privacy-Preserving Neighbourhood Estimations
by Alvaro Garcia-Recuero
Coalgebra Encoding for Efficient Minimization
by Hans-Peter Deifel
Algorithms and Complexity on Indexing Founder Graphs
by Massimo Equi
A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs
by Sangsoo Park
Density Sketches for Sampling and Estimation
by Aditya Desai
A Survey on Consortium Blockchain Consensus Mechanisms
by Wei Yao

  • «
  • 1
  • 2
  • 3
  • 4
  • »
Submitted on 6 Feb 2021 Updated

arXiv.org Original Source
NASA ADS
Google Scholar
Semantic Scholar
How Researchain Works
Researchain Logo
Decentralizing Knowledge