Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates
aa r X i v : . [ c s . D S ] F e b STREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONEUPDATES DE HUANG, JONATHAN NILES-WEED, AND RACHEL WARDAbstract. We analyze Ojaβs algorithm for streaming π -PCA, and prove that it achieves performancenearly matching that of an optimal oο¬ine algorithm. Given access to a sequence of i.i.d. π Γ π symmetricmatrices, we show that Ojaβs algorithm can obtain an accurate approximation to the subspace of thetop π eigenvectors of their expectation using a number of samples that scales polylogarithmically with π .Previously, such a result was only known in the case where the updates have rank one.Our analysis is based on recently developed matrix concentration tools, which allow us to prove strongbounds on the tails of the random matrices which arise in the course of the algorithmβs execution. Introduction
Principal component analysis is one of the foundational algorithms of statistics and machine learn-ing. From a practical perspective, perhaps no optimization problem is more widely used in dataanalysis [18]. From a theoretical perspective, it is one of the simplest examples of a non-convexoptimization problem that can nevertheless be solved in polynomial time; as such, it has been animportant proving ground for understanding the fundamental limits of eο¬cient optimization [30].In the basic setting, the practitioner has access to a sequence of independent symmetric randommatrices π¨ , π¨ , . . . with expectation π΄ β R π Γ π . The goal is to approximate the leading eigenspaceof π΄ or, more generally, to approximate the subspace spanned by its leading π eigenvectors. Whileit is natural to attempt to solve this problem by performing an eigen-decomposition of the empiricalaverage Β― π¨ = π Γ ππ = π¨ π , the amount of space required by this approach can be prohibitive when π islarge. In particular, if the matrices π¨ π are sparse or low-rank, performing incremental updates with thematrices π¨ π may be signiο¬cantly cheaper than storing all the iterates or their average. A tremendousamount of attention has therefore been paid to designing algorithms which can cheaply and provablyestimate the subspace spanned by the top π eigenvectors of π΄ using limited memory and a single passover the data, a problem known as streaming PCA [17].The simplest and most natural approach to this problem was proposed nearly 40 years ago byOja [25, 26]:(1) Randomly choose an initial guess π β R π Γ π , and set πΈ β QR [ π ] (2) For π‘ β₯ , set πΈ π‘ β QR [( I + π π‘ π¨ π‘ ) πΈ π‘ β ] .Here, QR [ πΈ π‘ ] returns an orthogonal R π Γ π matrix obtained by performing the GramβSchmidt processto the columns of πΈ π‘ . It is easy to see [1, Lemma 2.2] that the GramβSchmidt step commutes with themultiplicative update, so that we can equivalently consider a version of the algorithm which performsa single orthonormalization at the end, and outputs πΈ π‘ = QR [ π π‘ ] , π π‘ = π π‘ . . . π π , Date : 6 February 2021.The authors gratefully acknowledge the funding for this work. DH was in part supported by NSF Grants DMS-1907977,DMS-1912654, and the Choi Family Postdoc Gift Fund. JNW was supported under NSF grant DMS-2015291. JNW and RWwere supported in part by the Institute for Advanced Study, where some of this research was conducted. RW receivedsupport from AFOSR MURI Award N00014-17-S-F006 and NSF grant DMS-1952735 . where π π : = ( I + π π π¨ π ) .Ojaβs algorithm can be viewed as a noisy version of the classic orthogonal iteration algorithm forcomputing invariant subspaces of a symmetric matrix [12, Section 7.3.2]; alternatively, it corresponds toprojected stochastic gradient descent on the Stiefel manifold of matrices with orthonormal columns [9].Despite its simplicity and practical eο¬ectiveness, Ojaβs algorithm has proven challenging to analyzebecause of its inherent non-convexity.As a benchmark against which to compare Ojaβs algorithm, we may consider the performance of thesimple oο¬ine algorithm which computes the leading π eigenvectors of Β― π¨ . We write π½ β R π Γ π for theorthogonal matrix whose columns are the leading π eigenvectors of π΄ and Λ π½ β R π Γ π for the matrixcontaining the leading π eigenvectors of Β― π¨ , and measure the quality of Λ π½ by the following standardmeasure of distance between subspaces:dist ( Λ π½ , π½ ) : = k π½π½ β β Λ π½ Λ π½ β k If k π¨ π β π΄ k β€ π almost surely and the gap between the π th and ( π + ) th eigenvalues is π π , thenthe Matrix Bernstein inequality [31, Theorem 1.4] combined with Wedinβs Theorem [33] implies thatthere exists a positive constant πΆ such thatdist ( Λ π½ , π½ ) β€ πΆ ππ π r log ( π / πΏ ) π . (1.1)with probability at least β πΏ .The key question is whether Ojaβs algorithm is able to achieve similar performance. However, exceptin the special rank-one case where either π = or rank ( π¨ π ) = almost surely, no such bound is known.1.1. Our contribution.
We give the ο¬rst results for Ojaβs algorithm nearly matching (1.1), for any π β₯ and updates of any rank. Our main result (Theorem 2.3) establishes that, after a burn-in period of π = Λ π (cid:16) ππ πΏ π π (cid:17) steps, the output of Ojaβs algorithm satisο¬esdist ( πΈ π , π½ ) β€ πΆ β² ππ π s log ( ππ / πΏπ π ) π β π with probability at least β πΏ for a universal positive constant πΆ β² . Ours is the ο¬rst work to show thatOjaβs algorithm can achieve a guarantee similar to (1.1) beyond the rank-one case.The assumption that π = or rank ( π¨ π ) = is fundamental to the proof strategies used in prior works.To show that the error decays suο¬ciently quickly, prior work focuses on the quantity k πΌ β π π‘ ( π½ β π π‘ ) β k ,where the columns of πΌ are the last π β π eigenvectors of π΄ , which is an upper bound on dist ( πΈ π‘ , π½ ) .(See Lemma 2.6, below.) The key challenge is to control the inverse ( π½ β π π‘ ) β . When π = , as in[17], this quantity is a scalar, so it can be pulled out of the norm and bounded separately. This isno longer possible when π > , but if rank ( π¨ π ) = , as in [1], then π½ β π π‘ can be written as a rank-one perturbation of π½ β π π‘ β . The ShermanβMorrison formula then implies that πΌ β π π‘ ( π½ β π π‘ ) β can bewritten as πΌ β π π‘ β ( π½ β π π‘ β ) β plus the sum of explicit, rank-one correction terms. However, if neither π = nor rank ( π¨ π ) = , this approach quickly becomes infeasible, since the correction terms nowinvolve a product of rank- π matrices whose norm is diο¬cult to bound.A more subtle diο¬culty implicit in prior work is that proofs must be carried out entirely in expected(squared) Frobenius norm. This requirement is necessitated by the fact that the Frobenius norm isHilbertian, so it is possible to employ the crucial Pythagorean identity E k π k = k E π k + k π β E π k (1.2) TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 3 for any random matrix π . It is this identity that makes it possible to control the evolution of E k πΌ β π π‘ ( π½ β π π‘ ) β k . However, as our proofs reveal, it is of signiο¬cant utility to be able to recur-sively control the operator norm k πΌ β π π‘ ( π½ β π π‘ ) β k with high probability instead. Unfortunately, (1.2)is of no help in proving statements of this kind.Our argument handles both challenges and represents a signiο¬cant conceptual simpliο¬cation overearlier proofs. Our crucial insight is that, rather than using the squared Frobenius norm, it is possibleto prove a stronger recursion in a diο¬erent norm, which implies high-probability bounds. Usingtechniques recently developed by [16] to prove concentration inequalities for products of randommatrices, we show that conditioned on k πΌ β π π‘ β ( π½ β π π‘ β ) β k being well behaved, the probability that k πΌ β π π‘ ( π½ β π π‘ ) β k deviates signiο¬cantly from its expectation is exponentially small.In other words, good concentration properties for k πΌ β π π‘ β ( π½ β π π‘ β ) β k imply good concentrationproperties for the next iterate, k πΌ β π π‘ ( π½ β π π‘ ) β k . These high-probability bounds signiο¬cantly simplifythe calculations, since they allow us to guarantee that the problematic error terms appearing in priorwork are small.If we knew that k πΌ β π ( π½ β π ) β k = π ( ) with high probability, then the above induction argumentwould allow us to conclude that k πΌ β π π‘ ( π½ β π π‘ ) β k = π ( ) for all π‘ . Unfortunately, this is not the case:if π is randomly initialized with i.i.d. Gaussian entries, then typically k πΌ β π ( π½ β π ) β k β β ππ . We therefore adopt a two-phase approach: in the ο¬rst, short phase, of length approximately l og π , weshow that the operator norm decays from π (β ππ ) to π ( ) , and in the second phase we use the aboverecursive argument to establish that the operator norm decays to zero at a π ( /β π ) rate. To simplifythe analysis of the ο¬rst phase, we develop a coupling argument that allows us reduce without loss ofgenerality to the case where the law P π¨ of the random matrices π¨ , π¨ , . . . has ο¬nite support andobtain almost-sure guarantees by a simple union bound. This weak control is enough to guaranteethat k πΌ β π π‘ ( π½ β π π‘ ) β k decays exponentially fast, so that it is of constant order after approximately l og π iterations.1.2. Prior work.
Obtaining non-asymptotic rates of convergence for Ojaβs algorithm and its variantshas been an area of active recent interest [28, 29, 27, 21, 20, 2, 4, 13, 17, 23]. Apart from the resultsof [1] and [17], none of these works proves bounds matching (1.1).A breakthrough in the project of obtaining optimal guarantees was due to [28], who gave an analysisof Ojaβs algorithm that works when provided with a warm start: he showed that, when π = andrank ( π¨ π ) = almost surely, Ojaβs algorithm converges in a number of steps logarithmic in π if it isinitialized in a neighborhood of the optimum, but his result does not extend to random initializationand it is unclear how to ο¬nd a warm start in practice. This restriction was lifted by [17], who werethe ο¬rst to show a global, eο¬cient guarantee for Ojaβs algorithm when π = . Subsequently, [1]gave a global, eο¬cient guarantee for Ojaβs algorithm in the π > case, but under the restriction thatrank ( π¨ π ) = almost surely.The idea of analyzing Ojaβs algorithm by developing concentration bounds for products of randommatrices was suggested by [15], who also proved such non-asymptotic concentration bounds in asimpliο¬ed setting. Those bounds were later improved by [16] who developed a diο¬erent techniquebased on martingale inequalities for Schatten norms, following a strategy pursued by [19] and [24]for other Banach space norms. The concentration inequalities of [16] are not sharp enough to recoveroptimal rates for Ojaβs algorithm on their own; in this work, we use a similar proof techniques toestablish tailor-made concentration results for the Oja setting. HUANG ET AL.
Organization of the remainder of the paper.
In Section 2, we give our main results and anoverview of our techniques. Our main tool is a recursive inequality which proves a concentrationresult for the iterates of Ojaβs algorithm, which we state and prove in Section 3.Our analysis of Ojaβs algorithm involves two distinct phases, which we analyze separately. Since theargument for the second phase is simpler, we present it ο¬rst in Section 4, and present the slightly morecomplicated argument for the ο¬rst phase in Section 5. We conclude in Section 6 with open questionsand directions for future work. The appendices contain omitted proofs and supplementary results foreach section.1.4.
Notation.
We write π β₯ Β· Β· Β· β₯ π π for the eigenvalues of the symmetric matrix π΄ , and we write π π : = π π β π π + for the gap between the π th and ( π + ) th eigenvalue. We write π½ β R π Γ π forthe orthogonal matrix whose columns are the π leading eigenvectors of π΄ , and πΌ β R π Γ( π β π ) forthe orthogonal matrix whose columns are the remaining eigenvectors. Given an orthogonal matrix πΎ β R π Γ π , we write [7] dist ( πΎ , π½ ) = k π½π½ β β πΎπΎ β k = k πΌ β πΎ k , The symbol kΒ·k denotes the spectral norm (i.e., β operator norm) of a matrix, which is equal to itsmaximum singular value. For π β₯ , the symbol kΒ·k π denotes the Schatten π -norm, which is the β π norm of the singular values of its argument. We also deο¬ne the πΏ π norm of a random matrix πΏ as k πΏ k π,π : = (cid:0) E k πΏ k ππ (cid:1) / π . We employ standard asymptotic notation π = π ( π ) to indicate that π β€ πΆπ for a universal positiveconstant πΆ , and write π = Ξ ( π ) if π = π ( π ) and π = π ( π ) . The notations Λ π (Β·) and Λ Ξ (Β·) suppresspolylogarithmic factors in the problem parameters. When π‘ is a positive integer, we write [ π‘ ] : = { , . . . , π‘ } . 2. T echniques and main results We focus throughout on the following setup:
Assumption 2.1.
The matrices π¨ π are symmetric, independent, identically distributed samples from adistribution P π¨ , with expectation π΄ .Note that while we require that each π¨ π is symmetric, we do not require that π¨ π (cid:23) .The requirement that π¨ π is symmetric is not as restrictive as it may seem, since we can replace π¨ π by its Hermitian dilation : D ( π¨ π ) : = (cid:18) π¨ π π¨ β π (cid:19) β R π Γ π . Estimating the leading eigenvectors of D ( π΄ ) is equivalent to estimating the leading singular vectorsof π΄ . Our results therefore extend to the non-symmetric streaming SVD problem as well. We referthe reader to [32] for more details about this standard reduction.The second requirement establishes that the random errors are bounded in a suitable norm. Wewrite S π,π for the Stiefel manifold of π Γ π matrices with orthonormal columns. Assumption 2.2. If π¨ βΌ P π¨ , then sup π· β S π,π k π· β ( π¨ β π΄ ) k β€ π almost surely.Note that for any matrix πΏ β β π Γ π ,sup π β S π,π k π· β πΏ k = (cid:16)Γ ππ = π π ( πΏ ) (cid:17) / , β€ π β€ π, TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 5 where π ( πΏ ) β₯ π ( πΏ ) β₯ Β· Β· Β· β₯ π π ( πΏ ) are the singular values of πΏ . This norm, sometimes known asthe ( , π ) norm [22] or the Ky Fan - π norm [8], satisο¬es k πΏ k β€ sup π β S π,π k π· β πΏ k β€ β π k πΏ k β€ k πΏ k . This choice of norm generalizes the error assumptions in the literature. In the π = case, it agreeswith the operator norm, which is the condition used by [17]; and it weakens the requirement of [1]that k π¨ π k β€ almost surely.The following theorem summarizes our main results for Ojaβs algorithm. Theorem 2.3 (Main, informal) . Adopt Assumptions 2.1 and 2.2. Let π β₯ . . . π π be the eigenvalues of π΄ , and let π π = π π β π π + .For every πΏ β ( , ) , deο¬ne learning rates π = Λ Ξ ππ πΏ π π ! , π½ = Λ Ξ π π π ! , π π‘ =  Λ Ξ (cid:16) π π π (cid:17) , π‘ β€ π Ξ (cid:16) π π ( π½ + π‘ β π ) (cid:17) , π‘ > π . Let π½ β R π Γ π be the orthogonal matrix whose columns are the π leading eigenvectors of π΄ . Then forany π > π , the output πΈ π of Ojaβs algorithm satisο¬es dist ( πΈ π , π½ ) β€ πΆ β² ππ π s log ( ππ / π π πΏ ) π β π with probability at least β πΏ , where πΆ β² is a universal positive constant. To prove Theorem 2.3, we adopt a two-phase analysis. Our ο¬rst result shows that after π iterations,the output of Ojaβs algorithm satisο¬es k πΌ β πΈ π ( π½ β πΈ π ) β k β€ with high probability. Theorem 2.4 (Phase I, informal) . Adopt the same setting as Theorem 2.3, and let π β R π Γ π havei.i.d. Gaussian entries. Let π = Ξ ππ πΏ π π (cid:0) l og ( π π / πΏπ π ) (cid:1) ! . Then after π iterations of Ojaβs algorithm with constant step size π = Ξ (cid:16) l og ( π / πΏ ) π π π (cid:17) and initialization π ,the output πΈ π satisο¬es k πΌ β πΈ π ( π½ β πΈ π ) β k β€ with probability at least β πΏ . Our analysis of the second phase shows that, if Ojaβs algorithm is initialized with any matrixsatisfying k πΌ β πΈ ( π½ β πΈ ) β k β€ , then the output of Ojaβs algorithm decays at the rate π ( /β π ) . Theorem 2.5 (Phase II, informal) . Adopt the same setting as Theorem 2.3, and suppose that π β R π Γ π satisο¬es k πΌ β π ( π½ β π ) β k β€ . Then after π iterations of Ojaβs algorithm with step size π π = ( π½ + π ) π π with π½ = Ξ (cid:16) π π π l og (cid:16) πππ π πΏ (cid:17) (cid:17) and initialization πΈ , the output πΈ π satisο¬es dist ( πΈ π , π½ ) β€ s π½ + π½ + π with probability at least β πΏ . This error guarantee is completely dimension free, and depends only logarithmically on π and thefailure probability πΏ . HUANG ET AL.
Theorem 2.3 follows directly from Theorems 2.4 and 2.5. Theorem 2.4 guarantees that with proba-bility β πΏ , the output of Phase I is a suitable initialization for Phase II, and, conditioned on this goodevent, Theorem 2.5 guarantees that the output of the second phase has error π ( p π½ / π ) with probability β πΏ . By concatenating the analysis of the two phases and using the union bound, we obtain that theresulting two-phase algorithm succeeds with probability at least β πΏ , yielding Theorem 2.3.In the remainder of this section, we describe the main technical tools we employ in our argument.2.1. A recursive expression.
To simplify the argument, we recall the following result of [1, Lemma2.2]:
Lemma 2.6.
For all π‘ β₯ , dist ( πΈ π‘ , π½ ) = k πΌ β πΈ π‘ k β€ k πΌ β πΈ π‘ ( π½ β πΈ π‘ ) β k = k πΌ β π π‘ ( π½ β π π‘ ) β k . We therefore focus on bounding the norm of the matrix πΎ π‘ : = πΌ β π π‘ ( π½ β π π‘ ) β . (2.1)Under the assumption that π π‘ is small, we might expect that we can write πΎ π‘ as a sum of thedominant term π― π‘ : = πΌ β ( I + π π‘ π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β plus lower order terms.To argue that πΎ π‘ is close to π― π‘ , we need to argue that the inverse ( π½ β π π‘ ) β does not blow up,which will be the case so long as the ο¬uctuation term π π‘ π½ β ( π¨ π‘ β π΄ ) π π‘ β is smaller than the main term π½ β ( I + π π‘ π΄ ) π π‘ β . In order to make this requirement precise, we write π« π‘ : = π π½ β ( π¨ π‘ β π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β . (2.2)So long as this matrix has small norm, the inverse term will be well behaved. As we discuss in thefollowing section, we will be able to guarantee that this is the case by conditioning on an appropriategood event.The following lemma shows that, modulo a term involving π« π‘ , we can indeed express πΎ π‘ as π― π‘ plusa small correction. Lemma 2.7.
Let πΎ π‘ , π― π‘ , and π« π‘ be deο¬ned as in (2.1) β (2.2) . Then we can write πΎ π‘ ( I β π« π‘ ) = π― π‘ + π± π‘, + π± π‘, , (2.3) for matrices π± π‘, and π± π‘, of norm π ( π π‘ ) and π ( π π‘ ) , respectively. Below, in Propositions A.1 and A.2, we use Lemma 2.7 to develop an explicit recursive bound on thenorm of πΎ π‘ .2.2. Matrix concentration via smoothness.
In order to exploit the expression (2.3), we need concen-tration inequalities that allow us to conclude that πΎ π‘ is near π― π‘ with high probability. [16] recentlydeveloped new tools to control the norms of products of independent random matrices, in an attemptto extend the mature toolset for bounding sums of random matrices to the product setting. Their tech-niques are based on a simple but deep property of the Schatten π -norms known as uniform smoothness .The most elementary expression of this fact is the following inequality, which is the analogue of (1.2)for the πΏ π norm. Proposition 2.8 ([16, Proposition 4.3]) . Let πΏ and π be random matrices of the same size, with E [ π : πΏ ] = . Then for any π β₯ , k πΏ + π + π k π,π β€ k πΏ k π,π + ( π β ) k π k π,π . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 7 We will employ the following corollary of Proposition 2.8, which extends the inequality to non-centered random matrices.
Proposition 2.9.
Let πΏ , π , and π be random matrices of the same size, with E [ π : πΏ ] = . Then forany π β₯ and π > , k πΏ + π + π k π,π β€ ( + π ) ( k πΏ k π,π + ( π β ) k π k π,π + π β k π k π,π ) The beneο¬t of working in the πΏ π norm is that bounding this norm for π large yields good tailbounds on the operator norm, which are not available if the argument is carried out solely in expectedFrobenius norm. We will rely heavily on this fact heavily in our argument.2.3. Conditioning on good events.
Obtaining control on πΎ π‘ via (2.3) requires ensuring that thematrix I β π« π‘ is invertible, with inverse of bounded norm. To accomplish this, we deο¬ne a sequenceof good events G β G β . . . , where each G π is measurable with respect to the π -algebra F π : = π ( π , π , . . . , π π ) . We write π for the indicator of the event G π , and we will deο¬ne G π in such a waythat ( I β π« π‘ π‘ β ) is invertible almost surely.During Phase II, the good events are deο¬ned by G : = {k πΎ k β€ } G π : = {k πΎ π k β€ πΎ } β© G π β , β π β₯ for some πΎ β₯ to be speciο¬ed. Since Assumption 2.2 implies that k π¨ π β π΄ k β€ π almost surely, thisdeο¬nition guarantees that for all π β₯ , k π½ β ( π¨ π β π΄ ) πΌπΎ π β π β k β€ ππΎ almost surely. (2.4)As we show in Proposition A.1 below, if the step size is suο¬ciently small, then (2.4) implies that I β π« π‘ is almost surely invertible on G π‘ β , which allows us to employ (2.3) to bound the norm of πΎ π‘ π‘ β .During Phase I, we condition on a slightly more complicated set of events, which we describeexplicitly in Section 5. However, these events are constructed so that (2.4) still holds for all π β₯ .Our matrix concentration results described in Section 2.2 allow us to show that, during both PhaseI and Phase II, k πΎ π‘ π‘ β k is small with high probability, for all π‘ β₯ . Using this fact, we show that,conditioned on G π‘ β , the probability that G π‘ holds is also large. Bounding the failure probability ateach step, we are able to conclude that, conditioned on the initialization event G , the good events G π‘ hold for all π‘ β₯ with high probability.3. Main recursive bound
In this section, we state our main recursive bound, which we use in both Phase I and Phase II. Aproof appears in Section B.
Theorem 3.1.
Let π‘ be a positive integer, and for all π β [ π‘ ] , let π π = π π π ( + πΎ ) . Let , . . . , π‘ be theindicator functions of a sequence of good events satisfying (2.4) for all π β [ π‘ ] .Assume that for all π β [ π‘ ] , π π β€ , π π k π΄ k β€ , e β π π π π / β€ π π π π β , (3.1) with the convention that the last requirement is vacuous when π = . Then for any π β₯ , k πΎ π‘ π‘ k π,π β€ k πΎ π‘ π‘ β k π,π β€ e β π π‘ π π k πΎ k π,π + πΆ ππ π‘ Γ π‘ β π = k πΎ π π k π,π + πΆ ππ / π π π‘ π‘ , (3.2) where π π‘ = Γ π‘π = π π , πΆ = , and πΆ = . Moreover, if in addition for all π β [ π‘ ] , ππ π β€ π π π π , (3.3) HUANG ET AL. then k πΎ π‘ π‘ k π,π β€ k πΎ π‘ π‘ β k π,π β€ e β π π‘ π π / k πΎ k π,π + πΆ ππ / π π π‘ π‘ . Theorem 3.1 shows that, up to small error, k πΎ π‘ π‘ β k π,π decays exponentially fast. We will use thisfact to prove high probability bounds on k πΎ π‘ π‘ β k , which then imply bounds on k πΎ π‘ k .4. Phase II
In this section, we use Theorem 3.1 to prove a formal version of Theorem 2.5.For this phase, recall that we deο¬ne the good events G π by G = {k πΎ k β€ } , G π = {k πΎ π k β€ πΎ } β© G π β , β π β₯ . (4.1)For Phase II, we set πΎ = β .We ο¬rst show that, with a speciο¬c step-size schedule, we obtain good bounds on the norm of thelast iterate. Proposition 4.1.
Deο¬ne the good events as in (4.1) . Set π π = πΌ ( π½ + π ) π π , for positive quantities πΌ and π½ , anddeο¬ne the normalized gap Β― π π = min (cid:26) π π π , π π k π΄ k , (cid:27) . (4.2) If πΌ β₯ , π½ β₯ ( + β ) πΌ Β― π π , (4.3) then for any π‘ β₯ , k πΎ π‘ π‘ k π,π β€ π / π (cid:18) π½ + π½ + π‘ (cid:19) πΌ + ππ / π Β· (cid:18) πΆ πΌ Β― π π (cid:19) Β· π‘ ( π½ + π‘ ) , (4.4) where πΆ is a numerical constant less than .Proof. Since the good events deο¬ned in (4.1) satisfy (2.4), we can apply Theorem 3.1. In the appendix,we show (Lemma C.1) that (4.3) implies that the assumptions in (3.1) hold. Theorem 3.1 then yields k πΎ π‘ π‘ k π,π β€ e β π π‘ π π k πΎ k π,π + πΆ ππ π‘ Γ π‘ β π = k πΎ π π k π,π + πΆ ππ / π π π‘ π‘ β€ e β π π‘ π π π / π + ( πΆ πΎ + πΆ ) ππ / π π π‘ π‘ , since (4.1) implies k πΎ k π,π β€ π / π and k πΎ π π k π,π β€ πΎ π / π for all π β₯ .The deο¬nition of π π implies π π π π‘ = πΌ Γ π‘π = π½ + π β₯ πΌ l og (cid:18) π½ + π‘π½ + (cid:19) . We obtain k πΎ π‘ π‘ k π,π β€ π / π (cid:18) π½ + π½ + π‘ (cid:19) πΌ + ππ / π Β· (cid:18) πΆ πΌ Β― π π (cid:19) Β· π‘ ( π½ + π‘ ) , where πΆ = ( πΆ πΎ + πΆ ) / πΆ π < , as desired. (cid:3) Finally, we remove the conditioning and prove the full version of Theorem 2.5.
TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 9 Theorem 4.2.
Assume k πΎ k β€ , and adopt the step size π π = πΌ ( π½ + π ) π π , with πΌ β₯ , π½ β₯ (cid:18) πΆ πΌ Β― π π (cid:19) l og (cid:18) πΆ πΌ Β― π π Β· π / πΏ (cid:19) , where Β― π π is as in (4.2) and πΆ is as in (4.4) . Then k πΎ π k β€ s π½ + π½ + π with probability at least β πΏ .Proof. For any π β₯ , it holds P {k πΎ π k β₯ π } β€ P {k πΎ π π k β₯ π } + P (cid:8) G πΆπ (cid:9) . First, we have P (cid:8) G πΆπ (cid:9) β€ P (cid:8) G πΆ (cid:9) + Γ ππ = P n G πΆπ β© G π β o . Since we have assumed that the initialization satisο¬es k πΎ k β€ , the event G holds with probability , so it suο¬ces to bound the second term. By Markovβs inequality, we have P n G πΆπ β© G π β o = P (cid:8) k πΎ π π β k β₯ πΎ (cid:9) β€ inf π β₯ πΎ β π k πΎ π π β k ππ,π . For ο¬xed π β₯ , we choose π = ( π½ + π ) Β· Β― π π πΆ πΌ . It follows from (4.4) that, πΎ β π k πΎ π π β k ππ,π β€ πΎ π / π (cid:18) π½ + π½ + π (cid:19) πΌ + πΎ ππ / π Β· πΆ πΌ Β― π π Β· π ( π½ + π ) ! π / β€ π (cid:18) + ππ½ + π (cid:19) π / β€ π e β π = π exp β( π½ + π ) Β· Β― π π πΆ πΌ ! . Therefore, for any π β₯ , Γ ππ = P n G πΆπ | G π β o β€ π Γ ππ = exp β( π½ + π ) Β· Β― π π πΆ πΌ ! β€ π πΆ πΌ Β― π π e β π½ Β· Β― π ππΆ πΌ . This quantity is smaller than πΏ / if π½ β₯ πΆ πΌ Β― π π l og (cid:18) πΆ πΌπ Β― π π Β· π / πΏ (cid:19) . It remains to bound P {k πΎ π π k β₯ π } . A simple argument (Lemma C.2) based on (4.4) shows thatthis probability is at least πΏ / for π = s π½ + π½ + π .
The claim follows. (cid:3) Phase I
In this section, we describe the slightly more delicate proof of the formal version of Theorem 2.4.As in Section 4, we will employ Theorem 3.1. However, we will also need to develop an auxiliaryrecurrence to bound the growth of an additional matrix sequence.
Before we analyze Phase I, we ο¬rst show that we can reduce to the case that that P π¨ has ο¬nitesupport. We prove the following result in Appendix E. Proposition 5.1.
Fix π > . Suppose that there exists a choice of constant step size π and π β₯ πππΏ l og ( π / πΏ ) such that for any ο¬nitely-supported distribution with support size at most π satisfyingAssumptions 2.1 and 2.2 and with π π β₯ π / , we have k πΌ β πΈ π ( π½ β πΈ π ) β k β€ (5.1) with probability at least β πΏ / .Then for this same π and π it in fact holds that for any distribution satisfying Assumptions 2.1 and 2.2and with π π β₯ π , we have k πΌ β πΈ π ( π½ β πΈ π ) β k β€ with probability at least β πΏ . Proposition 5.1 implies that it suο¬ces to prove the error guarantee (5.1) in the special case when P π¨ has ο¬nite support of cardinality at most π .Let us ο¬x a time horizon π and assume in what follows that π : = | supp ( P π¨ ) | β€ π .We begin bydeο¬ning the good events for Phase I. We adopt a constant step size π , to be speciο¬ed. Denote E : = { π β ( π¨ β π΄ ) πΌπΌ β : π¨ β supp ( P π¨ )} . For π β₯ , we will set G π = { max π¬ β E k π½ β π¬πΌπΎ π k β€ πΎ } β© G π β . Note that this choice satisο¬es (2.4) for all π > .To deο¬ne the initial good event G , we need to deο¬ne a larger set of matrices to condition on. Forall π, β β₯ , set E π,β : = { π½ β π Β· Β· Β· π π πΌ : π π β E for at most β distinct indices π β [ π ] , and π π = ( + ππ π + ) β ( I + π π΄ ) πΌπΌ β otherwise } The set E π,β has cardinality less than ( π ( π + )) β , and k π¬ k β€ for any π¬ β E π,β , and any π, β β₯ .We have deο¬ned E π,β so that control over max π¬ β E π + ,β + k π¬πΎ π‘ β k gives control over max π¬ β E π,β k π¬πΎ π‘ k .Finally, we deο¬ne G : = Γ π + π,β = ( max π¬ β E π,β k π¬πΎ k β€ β βπΎ β ) β© {k πΎ k β€ β ππΎ } . (5.2)Since π½ β ( π¨ β π΄ ) πΌ β E , almost surely, this choice satisο¬es (2.4) for π = . Our strategy will be similar to the one used in Section 4. However, in order to show that the goodevents G π hold with high probability, we will also need a second recurrence that allows us to controlthe norm of matrices of the form π¬πΎ π‘ , for π¬ β E π,β . The details appear in Section D.6. Conclusion
This work gives the ο¬rst nearly optimal analysis of Ojaβs algorithm for streaming PCA beyond therank one case. Our analysis is conceptually simple: we show that the spectral norm of the matrix πΎ π‘ concentrates well around its expectation, once we condition on πΎ π‘ β having the same behavior. Andour concentration results are strong enough that we can pay to union bound over the entire course ofthe algorithm, to show that πΎ π‘ is well behaved for all π‘ β₯ .The matrix concentration techniques we have applied here could be useful in analyzing other PCA-like algorithms, or, more generally, other stochastic algorithms for simple non-convex optimization TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 11 problems. An interesting question is whether these techniques can prove gap-free rates for Ojaβsalgorithm outside the rank-one setting. This would extend the results of [1] to the general case.Finally, we stress that the algorithm we have described here requires a priori knowledge of theproblem parameters (including the gap π π ) to set the step sizes, which is a serious limitation inpractice. Recently, [14] developed a data-driven procedure to adaptively select the optimal step sizes.Obtaining theoretical guarantees for this or similar algorithms is an important open problem. Acknowledgement
We thank Joel Tropp and Amelia Henriksen for valuable discussions which greatly improved thismanuscript.
Appendix A. Additional results for Section 3
The following proposition develops the expansion described in Lemma 2.7 and gives explicit boundson the norms of the error matrices π± π‘, and π± π‘, .We recall the following deο¬nitions πΎ π‘ = πΌ β π π‘ ( π½ β π π‘ ) β π― π‘ = πΌ β ( I + π π΄ ) π π‘ β ( π½ β ( I + π π΄ ) π π‘ β ) β π« π‘ = π π‘ π½ β ( π¨ π‘ β π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β Proposition A.1.
Let π‘ β₯ . Assume that π π‘ is small enough that π΄ (cid:23) β π π‘ I , and assume that (2.4) holds for π = π‘ . Let πΈ π‘ = ( π / π + k πΎ π‘ β π‘ β k π,π ) π π‘ = π π‘ π ( + πΎ ) . Then k π« π‘ π‘ β k β€ π π‘ almost surely, and πΎ π‘ ( I β π« π‘ ) = π― π‘ + π± π‘, + π± π‘, for π± π‘, and π± π‘, satisfying k π± π‘, π‘ β k π,π β€ πΈ π‘ π π‘ k π± π‘, π‘ β k π,π β€ πΈ π‘ π π‘ , and E [ π± π‘, : F π‘ β ] = .Proof. We employ the notation of the proof of Lemma 2.7. (See Appendix G.) First, we show the boundon π« π‘ . Since π π‘ π΄ (cid:23) β I , we have k π½ β ( I + π π‘ π΄ ) β π½ k β€ . Moreover, since k π½ β ( π¨ π‘ β π΄ ) πΌ π π‘ β k β€ ππΎ almost surely, we have that k π« π‘ π‘ β k β€ k π π‘ π½ β ( π¨ π‘ β π΄ ) ( πΌπΌ β + π½π½ β ) π π‘ β ( π½ β π π‘ β ) β π‘ β kβ€ π π‘ k π½ β ( π¨ π‘ β π΄ ) πΌπΌ β π π‘ β ( π½ β π π‘ β ) β π‘ β k + π π‘ k π½ β ( π¨ π‘ β π΄ ) π½ k = π π‘ k π½ β ( π¨ π‘ β π΄ ) πΌπΎ π‘ β π‘ β k + π π‘ k π½ β ( π¨ π‘ β π΄ ) π½ kβ€ π π‘ π ( + πΎ ) = : π π‘ . We can bound k b π« π‘ π‘ β k π,π by a similar argument. First, note that Assumption 2.2 implies that k π¨ π‘ β π΄ k β€ π almost surely. Hence k b π« π‘ π‘ β k π,π β€ π π‘ k πΌ β ( π¨ π‘ β π΄ ) πΌπΌ β π π‘ β ( π½ β π π‘ β ) β π‘ β k π,π + π π‘ k πΌ β ( π¨ π‘ β π΄ ) π½ π‘ β k π,π = π π‘ k πΌ β ( π¨ π‘ β π΄ ) πΌ k k πΎ π‘ β π‘ β k π,π + π π‘ k πΌ β ( π¨ π‘ β π΄ ) π½ π‘ β k π,π β€ ( k πΎ π‘ β π‘ β k π,π + π / π ) π π‘ π β€ ( k πΎ π‘ β π‘ β k π,π + π / π ) π π‘ , Finally, we have k π― π‘ π‘ β k π,π β€ + π π‘ π π + + π π‘ π π k πΎ π‘ β π‘ β k π,π β€ k πΎ π‘ β π‘ β k π,π . We now employ Lemma 2.7. The term π± π‘, satisο¬es E [ π± π‘, π‘ β | F π‘ β ] = , and we have k π± π‘, π‘ β k π,π β€ k b π« π‘ π‘ β k π,π + k π― π‘ π‘ β k π,π k π« π‘ π‘ β kβ€ ( k πΎ π‘ β π‘ β k π,π + π / π ) π π‘ + k πΎ π‘ β π‘ β k π,π π π‘ β€ πΈ π‘ π π‘ . Finally, k π± π‘, k π,π β€ k b π« π‘ π‘ β k π,π k π« π‘ π‘ β k β€ ( k πΎ π‘ β π‘ β k π,π + π / π ) π π‘ β€ πΈ π‘ π π‘ . (cid:3) Combining Proposition A.1 with Proposition 2.9 immediately yields a recursive bound.
Proposition A.2.
Adopt the setting of Proposition A.1. If π π‘ β€ / , then k πΎ π‘ π‘ k π,π β€ k πΎ π‘ π‘ β k π,π β€ πΎ ,π‘ k πΎ π‘ β π‘ β k π,π + πΎ ,π‘ , (A.1) where πΎ ,π‘ = ( + π π‘ ) ( (cid:18) + π π‘ π π + π π‘ π π + (cid:19) + ππ π‘ ) πΎ ,π‘ = ππ / π π π‘ . Proof.
Reusing the notation of Proposition A.1, we have πΎ π‘ π‘ β ( I β π« π‘ ) = π― π‘ π‘ β + π± π‘, π‘ β + π± π‘, π‘ β , where E [ π± π‘, π‘ β : F π‘ β ] = . Since π― π‘ π‘ β is F π‘ β -measurable, Proposition 2.9 therefore yields forany π > k πΎ π‘ π‘ β ( I β π« π‘ ) k π,π β€ ( + π ) ( k π― π‘ π‘ β k π,π + ( π β ) πΈ π‘ π π‘ + π β πΈ π‘ π π‘ ) . Choosing π = π π‘ , we obtain k πΎ π‘ π‘ β ( I β π« π‘ ) k π,π β€ ( + π π‘ ) ( k π― π‘ π‘ β k π,π + ππΈ π‘ π π‘ ) . Finally, under the assumption that k π« π‘ π‘ β k β€ π π‘ β€ almost surely, on the event G π‘ β the matrix I β π« π‘ is invertible and satisο¬es k ( I β π« π‘ ) β π‘ β k β€ ( β k π« π‘ π‘ β k ) β β€ ( β π π‘ ) β Hence k πΎ π‘ π‘ β k π,π β€ k πΎ π‘ π‘ β ( I β π« π‘ ) k π,π k ( I β π« π‘ ) β π‘ β k β€ + π π‘ ( β π π‘ ) ( k π― π‘ π‘ β k π,π + ππΈ π‘ π π‘ ) . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 13 Since + π π‘ ( β π π‘ ) β€ + π π‘ for all π π‘ β€ and ( + π π‘ ) πΈ π‘ β€ ( + π π‘ ) ( π / π + k πΎ π‘ β π‘ β k π,π ) and ( + π π‘ ) β€ for all π π‘ β€ , this proves the claim. (cid:3) Appendix B. Proof of Theorem 3.1
We will unroll the one-step recurrence of Proposition A.2. We ο¬rst bound πΎ ,π . We have πΎ ,π β€ (cid:18) + π π π π + π π π π + (cid:19) + ( + π ) π π + ππ π β€ (cid:18) + π π π π + π π π π + (cid:19) + ( + π ) π π , where the second inequality follows from the ο¬rst assumption in (3.1). The second assumption in (3.1)implies that β€ + π π π π β€ , so (cid:18) + π π π π + + π π π π (cid:19) = (cid:18) β π π π π + π π π π (cid:19) β€ (cid:18) β π π π π (cid:19) β€ e β π π π π . Since + π β€ π for all π β₯ , we obtain πΎ ,π β€ e β π π π π + πΆ ππ π . We now proceed to prove the ο¬rst claim by induction. When π‘ = , we use (A.1) to obtain k πΎ k π,π β€ k πΎ k π,π β€ πΎ , k πΎ k π,π + πΎ , β€ e β π π π k πΎ k π,π + πΆ ππ k πΎ k π,π + πΆ ππ / π π , which is the desired bound.Proceeding by induction, for π‘ > we have k πΎ π‘ π‘ k π,π β€ k πΎ π‘ π‘ β k π,π β€ πΎ ,π‘ k πΎ π‘ β π‘ β k π,π + πΎ ,π‘ β€ e β π π‘ π π k πΎ π‘ β π‘ β k π,π + πΆ ππ π‘ k πΎ π‘ β π‘ β k π,π + πΎ ,π‘ β€ e β π π‘ π π (cid:16) e β π π‘ β π π k πΎ k π,π + πΆ ππ π‘ β Γ π‘ β π = k πΎ π π k π,π + πΆ ππ / π π π‘ β ( π‘ β ) (cid:17) + πΆ ππ π‘ k πΎ π‘ β π‘ β k π,π + πΆ ππ / π π π‘ β€ e β π π‘ π π k πΎ k π,π + πΆ ππ π‘ Γ π‘ β π = k πΎ π π k π,π + πΆ ππ / π π π‘ π‘ , where in the ο¬nal inequality we have used that e β π π‘ π π π π‘ β β€ π π‘ by the third assumption of (3.1). Thisproves the ο¬rst bound.For the second bound, we proceed in a similar way, but with a sharper bound on πΎ ,π . The secondassumption of (3.1) again implies (cid:18) + π π π π + + π π π π (cid:19) = (cid:18) β π π π π + π π π π (cid:19) β€ β π π π π + ( π π π π ) β€ β π π π π , and therefore πΎ ,π β€ ( + π π ) (cid:18) β π π π π + ππ π (cid:19) β€ exp (cid:18) β π π π π + ( + π ) π π (cid:19) β€ e β π π π π / , where the ο¬nal step uses Assumption (3.3) and the fact that + π β€ π for all π β₯ .When π‘ = , we therefore have k πΎ k π,π β€ k πΎ k π,π β€ πΎ , k πΎ k π,π + πΎ , β€ e β π π π / k πΎ k π,π + πΆ ππ / π π , as desired, and for π‘ > the induction hypothesis yields k πΎ π‘ π‘ k π,π β€ k πΎ π‘ π‘ β k π,π β€ πΎ ,π‘ k πΎ π‘ β π‘ β k π,π + πΎ ,π‘ β€ e β π π‘ π π / (cid:16) e β π π‘ β π π / k πΎ k π,π + πΆ ππ / π π π‘ β ( π‘ β ) (cid:17) β€ e β π π‘ π π / k πΎ k π,π + πΆ ππ / π π π‘ π‘ , where the ο¬nal inequality again uses the third assumption in (3.1). This proves the second bound. (cid:3) Appendix C. Additional results for Section 4
Lemma C.1.
Under the conditions of Proposition 4.1, the assumptions of (3.1) hold.Proof.
First assumption. We have π π = π π π ( + πΎ ) = ( + β ) πΌπ ( π½ + π ) π π β€ πΆ π πΌπ½ Β― π π , where πΆ π = ( + β ) . So the ο¬rst assumption is fulο¬lled as long as π½ / πΌ β₯ πΆ π / Β― π π . (C.1a)Second assumption. As above, we have π π k π΄ k β€ πΌ k π΄ k π½π π β€ πΌπ½ Β― π π , so the assumption is fulο¬lled if (C.1a) holds.Third assumption. It suο¬ces to show that π π β π π β€ + π π π π β π β₯ , which is equivalent to π½ + π β β€ πΌ / π½ + π β π β₯ . This holds as long as πΌ β₯ . (C.1b)We obtain that all three assumptions hold under (C.1a) and (C.1b), as claimed. (cid:3) Lemma C.2.
In the setting of Theorem D.5, if π = q π½ + π½ + π , then P {k πΎ π k β₯ π } β€ πΏ / . Proof.
We have P {k πΎ π π k β₯ π } β€ inf π β₯ π β π k πΎ π π k ππ,π . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 15 In particular, we choose π = e (cid:18) π½ + π½ + π (cid:19) πΌ + e πΆ πΌ Β― π π π ( π½ + π ) l og ( π / πΏ ) , and π = l og ( π / πΏ ) . It then follows from (4.4) that P {k πΎ π π k β₯ π } β€ π β π k πΎ π π k ππ β€ π π (cid:18) π½ + π½ + π (cid:19) πΌ + π π πΆ πΌ Β― π π π ( π½ + π ) ! π / = π e β π = πΏ / . Combining the above bounds, we obtain that k πΎ π k β€ π β€ e (cid:18) π½ + π½ + π (cid:19) πΌ / + e πΆ πΌππ π r l og ( π / πΏ ) π , with probability at least β πΏ . Since both terms are smaller than e q π½ + π½ + π , the claim follows. (cid:3) Appendix D. Additional results for Section 5
Our main tool will be the following slight variation on Proposition A.1.
Proposition D.1.
Let π‘ β₯ . Assume that π π‘ is small enough that π΄ (cid:23) β π π‘ I , and assume that (2.4) holds for π = π‘ . Consider an arbitrary deterministic matrix π¬ β E π,β .Let Β― πΈ π‘ = + max π¬ β²β² β E π + ,β + k π¬ β²β² πΎ π‘ β π‘ β k π,π π = ππ ( + πΎ ) . Then k π« π‘ π‘ β k β€ π almost surely, and π¬πΎ π‘ ( I β π« π‘ ) = π¬π― π‘ + π¬π± π‘, + π¬π± π‘, for π¬π± π‘, and π¬π± π‘, satisfying k π¬π± π‘, π‘ β k π,π β€ Β― πΈ π‘ π k π¬π± π‘, π‘ β k π,π β€ Β― πΈ π‘ π , and E [ π¬π± π‘, : F π‘ β ] = .Proof. The proof is a slight modiο¬cation on the proof of Proposition A.1. By construction, k π¬π― π‘ π‘ β k π,π β€ (cid:18) + ππ π + ππ π + (cid:19) k π¬ β² πΎ π‘ β π‘ β k π,π , where π¬ β² = + ππ π + π¬πΌ β ( I + π πΊ ) πΌ β E π + ,β β E π + ,β + .Similarly, we have k π¬ b π« π‘ π‘ β k π,π β€ π k π¬πΌ β ( π¨ π‘ β π΄ ) πΌπΎ π‘ β π‘ β k π,π + π k π¬πΌ β ( π¨ π‘ β π΄ ) π½ k π,π β€ ππ ( k π¬ β²β² πΎ π‘ β π‘ β k π,π + k π¬ k π,π )β€ π ( k π¬ β²β² πΎ π‘ β π‘ β k π,π + ) where π¬ β²β² = π π¬πΌ β ( π¨ π‘ β π΄ ) πΌ β E π + ,β + , and we have used k π¬ k π β€ k π¬ k β€ .We therefore obtain k π¬π± π‘, π‘ β k π,π β€ k π¬ b π« π‘ π‘ β k π,π + k π¬π― π‘ π‘ β k π,π k π« π‘ π‘ β k β€ (cid:0) k π¬ β²β² πΎ π‘ β π‘ β k π,π + k π¬ β² πΎ π‘ β π‘ β k π,π + (cid:1) π β€ Β― πΈ π‘ π , and k π¬π± π‘, π‘ β k π,π β€ k π¬ b π« π‘ π‘ β k π,π k π« π‘ π‘ β k β€ ( k π¬ β²β² πΎ π‘ β π‘ β k π,π + ) π β€ Β― πΈ π‘ π . (cid:3) The following two results are the appropriate analogues of Proposition A.2 and Theorem 3.1.
Proposition D.2.
Adopt the setting of Proposition D.1. If π β€ / , then max π¬ β E π,β k π¬πΎ π‘ π‘ β k π,π β€ Β― πΎ max π¬ β² β E π + ,β k π¬ β² πΎ π‘ β π‘ β k π,π + Β― πΎ max π¬ β²β² β E π + ,β + k π¬ β²β² πΎ π‘ β π‘ β k π,π + Β― πΎ , (D.1) where Β― πΎ = ( + π ) (cid:18) + ππ π + ππ π + (cid:19) Β― πΎ = ( + π ) ππ Proof.
As in the proof of Proposition A.2, we have for any π¬ β E π,β , k π¬ k π,π β€ ( + π ) ( k π¬π― π‘ π‘ β k π,π + π Β― πΈ π‘ π ) . As in the proof of Proposition D.1, we can write k π¬π― π‘ π‘ β k π,π β€ (cid:18) + ππ π + ππ π + (cid:19) k π¬ β² πΎ π‘ β π‘ β k π,π where π¬ β² = + ππ π + π¬πΌ β ( I + π πΊ ) πΌ β E π + ,β . Since Β― πΈ π‘ β€ max π¬ β²β² β E π + ,β + k π¬ β²β² πΎ π‘ β π‘ β k π,π + , taking the maximum over all π¬ β E π,β and π¬ β² β E π + ,β yields the claim. (cid:3) Theorem D.3.
Let π‘ β€ π be a positive integer, and assume the following requirements hold for some π β₯ : π β€ , (D.2a) π k π΄ k β€ , (D.2b) ππ β€ ππ π (D.2c) πΎ β₯ . (D.2d) Then for any π, β β [ π β π‘ + ] and π β₯ , max π¬ β E π,β k π¬πΎ π‘ π‘ k π,π β€ max π¬ β E π,β k π¬πΎ π‘ π‘ β k π,π β€ βπΎ e β π‘ππ π / + πΆ ππΎ π π‘ . where πΆ = .Proof. First, as in the proof of Theorem 3.1, Assumptions (D.2b) and (D.2c) imply Β― πΎ + Β― πΎ = ( + π ) ( (cid:18) + ππ π + ππ π + (cid:19) + ππ ) β€ e β ππ π / . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 17 In particular, Β― πΎ + Β― πΎ β€ . Assumption (D.2a) likewise implies that Β― πΎ β€ .We now turn to the proof of the main claim, which we prove by induction on π‘ . For convenience,we introduce the notation πΎ e = πΎ /β . When π‘ = and π, β β€ π , (D.1) impliesmax π¬ β E π,β k π¬πΎ k π,π β€ max π¬ β E π,β k π¬πΎ k π,π β€ Β― πΎ max π¬ β² β E π + ,β k π¬ β² πΎ k π,π + Β― πΎ max π¬ β²β² β E π + ,β + k π¬ β²β² πΎ k π,π π + Β― πΎ β€ Β― πΎ βπΎ + Β― πΎ ( β + ) πΎ + Β― πΎ β€ βπΎ ( Β― πΎ + Β― πΎ ) + ( + πΎ ) Β― πΎ β€ βπΎ e β ππ π / + πΎ πΎ where we have used the deο¬nition of G and where the last step uses (D.2d). Proceeding by induction,we have max π¬ β E π,β k π¬πΎ π‘ π‘ k π,π β€ max π¬ β E π,β k π¬πΎ π‘ π‘ β k π,π β€ Β― πΎ max π¬ β² β E π + ,β k π¬ β² πΎ π‘ β π‘ β k π,π + Β― πΎ max π¬ β²β² β E π + ,β + k π¬ β²β² πΎ π‘ β π‘ β k π,π + Β― πΎ β€ Β― πΎ ( βπΎ e β( π‘ β ) ππ π / + ( π‘ β ) πΎ Β― πΎ )+ Β― πΎ ( ( β + ) πΎ e β( π‘ β ) ππ π / + ( π‘ β ) πΎ Β― πΎ ) + Β― πΎ β€ βπΎ ( Β― πΎ + Β― πΎ ) e β( π‘ β ) ππ π / + ( π‘ β ) ( Β― πΎ + Β― πΎ ) πΎ Β― πΎ + ( + πΎ ) Β― πΎ = βπΎ e β π‘ππ π / + πΎ πΎ π‘ , as claimed. (cid:3) Proposition D.4.
Fix π β ( , ) , β€ πΎ β€ πΆ πΎ ππΏ , and π β₯ , where πΆ πΎ = πΎ is the constant inLemma H.4. Given π > , deο¬ne the normalized gap Β― π = min (cid:26) ππ , k π΄ k π , (cid:27) , and adopt the step size π = πΆ π l og ( e π / π πΏ ) ππ . If π π β₯ π / and π β₯ π Β· πΆ π πΎ l og ( e π / π πΏ ) π Β― π where πΆ π β₯ + og πΆ πΎ , πΆ π β₯ πΆ π , then k πΎ π π β k π,π β€ π (cid:16) + π / π (cid:17) and max π¬ β E , k π¬πΎ π‘ π‘ β k π,π β€ πΎ e for all β€ π‘ β€ π . Proof.
We will apply Theorems 3.1 and D.3. First, note that (D.2d) holds by assumption. We now turnto the other conditions.Assumption (D.2a): Since πΎ β₯ , we have π = ππ ( + πΎ ) β€ πΆ π πΎ l og ( e π / π πΏ ) πππ . The assumption therefore holds as long as πΆ π β₯ πΆ π . (D.3)Assumption (D.2b): As above, we have π k π΄ k β€ πΆ π l og ( e π / π πΏ ) k π΄ k ππ , and the requirement (D.3) implies that this quantity is also smaller than / .Assumption (D.2c): Since ππ π = πΆ π l og ( e π / π πΎ ) π β₯ π and > , it suο¬ces to prove the strongerclaim ππ β€ π π . (D.4)This is satisο¬ed so long as π Β· πΆ π πΎ l og ( e π / π πΏ ) π π π β€ π π . which will hold if πΆ π β₯ πΆ π . (D.5)This requirement is stronger than (D.3), so Assumptions (D.2a)β(D.2c) hold under the sole condi-tion (D.5).We now turn to the two claimed bounds. First, we instantiate Theorem 3.1 with the choice π π = π for β€ π β€ π . The third assumption of (3.1) is trivially satisο¬ed when when π π is constant, sincein that case π π = π π β for all π β₯ . The remaining assumptions correspond directly to Assump-tions (D.2a), (D.2b), and (D.2c). The assumptions of Theorem 3.1 are therefore satisο¬ed, so we obtain, k πΎ π π β k π,π β€ e β π ππ π / k πΎ k π,π + ππ / π π π . The deο¬nition of G in (5.2) and the fact that π π β₯ π / implies that the ο¬rst term is at most e β π ππ π / ππΎ = ( e π / π πΏ ) β πΆ π / ππΎ , and this will be less than π if πΆ π β₯ + og ( πΆ πΎ ) . Since (D.4) holds, the second term satisο¬es ππ / π π π β€ π π / π < π π / π . We obtain k πΎ π π β k π,π β€ π (cid:16) + π / π (cid:17) , as claimed.For the second claim, we rely on Theorem D.3. Assumptions (D.2a)β(D.2d) having already beenveriο¬ed, we obtain for all β€ π‘ β€ π ,max π¬ β E , k π¬πΎ π‘ π‘ β k π,π β€ πΎ e β π‘ππ π / + ππΎ π π‘ . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 19 Since π π β₯ , the ο¬rst term is at most πΎ , and the second term is also at most πΎ by (D.4). We obtainthat max π¬ β E , k π¬πΎ π‘ π‘ β k π,π β€ πΎ e , as claimed. (cid:3) With Proposition D.4 in hand, we can prove a full version of Theorem 2.4.
Theorem D.5.
Fix a π > and assume | supp ( P π¨ ) | = π . Let Β― π = max (cid:26) ππ , π k π΄ k , (cid:27) , and set π = / .Adopt the step size π = πΆ π l og ( e π / πΏπ ) ππ where π β₯ πΆ π π ( log π / πΏ Β― ππ ) π πΏ Β― π . and πΆ π β₯ + og πΆ πΎ , πΆ π β₯ ( πΆ π πΆ πΎ ) / . If π β€ π and π π β₯ π / , then k πΎ π k β€ / with probability at least β πΏ / .Proof. We ο¬rst show that we can assume that l og π β€ og ( πΆ π π / πΏ Β― ππ ) . Indeed, if π > (cid:16) πΆ π ππΏ Β― ππ (cid:17) , acrude argument similar to the one employed in the analysis of Phase II yields the claim. We give thefull details in Appendix F. In what follows, we therefore assumelog π β€ og ( πΆ π π / πΏ Β― ππ ) . (D.6)Set πΎ = πΆ πΎ min ( p π l og ( πΆ π π / πΏ Β― ππ ) πΏ , ππΏ ) , where πΆ πΎ is as in Lemma H.4.Recall that our goal is to show k πΎ π k β€ π with probability at least β πΏ / . The failure probabilitycan be bounded as P (cid:8) k πΎ π k β₯ π (cid:9) β€ P (cid:8) k πΎ π π k β₯ π (cid:9) + P (cid:8) G πΆπ (cid:9) β€ inf π β₯ π β π k πΎ π π k ππ,π + P (cid:8) G πΆπ (cid:9) . If we choose π = log ( π / πΏ ) , then since l og ( πΆ π ) β€ πΆ / π log ( ) for any value of πΆ π , we have π β₯ πΆ π π ( l og ( π / πΏ Β― ππ )) π πΏ Β― π β₯ l og ( π / πΏ ) Β· πΆ / π π l og ( πΆ π π / πΏ Β― ππ ) πΏ Β· log ( e π / π πΏ ) π Β― π β₯ π πΆ π πΎ l og ( e π / π πΏ ) π Β― π , as long as πΆ π β₯ ( πΆ π ( πΆ πΎ ) ) / , which veriο¬es the assumption of Proposition D.4.We obtain k πΎ π π k π,π β€ π ( + π / π ) β€ π / π π e . We therefore have π β π k πΎ π π k ππ,π β€ e β l og ( π / πΏ ) β€ πΏ / . It remains to bound P n G πΆπ o . Clearly P (cid:8) G πΆπ (cid:9) β€ P (cid:8) G πΆ (cid:9) + Γ π π = P n G πΆπ β© G π β o . Since π β€ π and we have assumed l og π β€ og ( πΆ π π / πΏ Β― ππ ) , we havelog ( e ππ / πΏ ) β€ og ( π ) + log ( e / πΏ ) β€
20 l og ( πΆ π π / πΏ Β― ππ ) + log ( e / πΏ ) β€
21 l og ( πΆ π π / πΏ Β― ππ ) , so Lemma H.4 guarantees that G holds with probability at least β πΏ / .For the second term, we have P n G πΆπ β© G π β o = P (cid:26) max π¬ β E , k π¬πΎ π π β k β₯ πΎ (cid:27) β€ Γ π¬ β E , P (cid:8) k π¬πΎ π π β k β₯ πΎ (cid:9) . Choose π =
21 l og ( πΆ π π / πΏ Β― ππ ) . The same argument as above yields π β₯ π Β· πΆ / π π log ( πΆ π π / πΏ Β― ππ ) πΏ Β· log ( e π / π πΏ ) π Β― π , and this will be larger than the lower bound required on π that was assumed in Proposition D.4 aslong as πΆ π β₯ ( πΆ π ( πΆ πΎ ) ) / Proposition D.4 therefore yields P (cid:8) k π¬πΎ π π β k β₯ πΎ (cid:9) β€ πΎ β π k π¬πΎ π π β k ππ,π β€ e β π = e β
21 l og ( πΆ π π / πΏ Β― ππ ) for all π¬ β E , and thus P n G πΆπ | G π β o β€ Γ π¬ β E , P (cid:8) k π¬πΎ π π β k β₯ πΎ (cid:9) β€ π e β
21 l og ( πΆ π π / πΏ Β― ππ ) . This yields Γ π π = P n G πΆπ | G π β o β€ ππ e β
21 l og ( πΆ π π / πΏ Β― ππ ) β€ e β
21 l og ( πΆ π π / πΏ Β― ππ )+ og π β€ πΏ / , where the last step uses (D.6). Finally, choosing π = / , we obtain P (cid:8) k πΎ π k β₯ / (cid:9) β€ πΏ / , as claimed. (cid:3) Appendix E. A reduction to finite support
Let Ξ© be the space of π Γ π symmetric matrices. We argue that it suο¬ces to assume that π π΄ hasο¬nite support of cardinality at most π in Phase I. We prove this by comparing the product measure π β π π΄ with another distribution π π on Ξ© β π . We specify this distribution by the following procedure:drawing a π -tuple ( π΄ , . . . , π΄ π ) from the distribution π π is accomplished by(1) Drawing π independent samples Λ π¨ , . . . , Λ π¨ π from π π΄ . TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 21 (2) Drawing π¨ , . . . , π¨ π independently from the discrete distribution π Λ π΄ = π Γ ππ = πΏ Λ π¨ π . That is, drawing π¨ , . . . , π¨ π independently and uniformly from the set { Λ π¨ π } ππ = with replace-ment.We will rely on the fact that the two distributions, π β π π΄ and π π , are close in total variation distancewhen π is large. To see this, we ο¬rst recognize that drawing ( π΄ , . . . , π΄ π ) from π β π π΄ is equivalent tothe following:(1) Draw π independent samples Λ π¨ , . . . , Λ π¨ π from π π΄ .(2) Draw π¨ , . . . , π¨ π sequentially and uniformly from the set { Λ π¨ π } ππ = without replacement. De-note by π ( π ) Λ π΄ the distribution of this sampling.It is a standard result [11] that, given any { Λ π΄ π } ππ = , π TV (cid:16) π β π Λ π΄ , π ( π ) Λ π΄ (cid:17) β€ π π . We thus have the following:
Proposition E.1.
For any πΏ β ( , ) , it holds that π TV (cid:16) π π , π β π π΄ (cid:17) β€ πΏ for all π β₯ π / πΏ .Proof. For any set π β Ξ© β π , we have (cid:12)(cid:12)(cid:12) π π ( π ) β π β π π΄ ( π ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E Λ π΄ π βΌ π π΄ , β€ π β€ π h π β π Λ π΄ ( π ) β π ( π ) Λ π΄ ( π ) i (cid:12)(cid:12)(cid:12) β€ E Λ π΄ π βΌ π π΄ , β€ π β€ π (cid:12)(cid:12)(cid:12) π β π Λ π΄ ( π ) β π ( π ) Λ π΄ ( π ) (cid:12)(cid:12)(cid:12) β€ E Λ π΄ π βΌ π π΄ , β€ π β€ π π TV (cid:16) π β π Λ π΄ , π ( π ) Λ π΄ (cid:17) β€ π π β€ πΏ. The claim follows from taking the maximum of | π π ( π ) β π β π π΄ ( π ) | over all subsets of Ξ© β π . (cid:3) Given any Λ π¨ , . . . , Λ π¨ π , deο¬ne the empirical average Λ π΄ π : = E π΄ βΌ π Λ π΄ π¨ = π Γ ππ = Λ π¨ π . Denote by Λ π β₯ Λ π β₯ Β· Β· Β· β₯ Λ π π the eigenvalues of Λ π΄ π , and write Λ π π = Λ π π β Λ π π + . Let Λ π β β π Γ π be theorthogonal matrix whose columns are the leading π eigenvectors of Λ π΄ π , and let Λ πΌ β β π Γ( π β π ) be theorthogonal matrix consisting of the remaining eigenvectors. Standard results of matrix concentrationimplies that Λ π΄ π is close to π΄ . In particular, we have the following: Proposition E.2.
Suppose that π β₯ π π π l og ( π / πΏ ) . Let Λ π¨ , . . . , Λ π¨ π be drawn independently from π π΄ .Then it holds with probability at least β πΏ that k Λ π΄ π β π΄ k β€ π π / , and, in particular, Λ π π β₯ π π / and k πΌ β Λ π½ k β€ / . Proof.
By assumption 2, we have that k Λ π΄ π β π΄ k β€ π almost surely. Then the matrix Bernsteininequality [31, Theorem 1.4] implies that, for any π‘ β₯ , P (cid:8) k Λ π΄ π β π΄ k β₯ π‘ (cid:9) β€ π exp (cid:18) β ππ‘ / π + ππ‘ / (cid:19) . Substituting π‘ = π π / yields the ο¬rst claim. Using the perturbation theory of eigenvalues of symmetricmatrices, we have Λ π π β₯ π π β k Λ π΄ π β π΄ k and Λ π π + β€ π π + β k Λ π΄ π β π΄ k . Therefore, conditioned on the ο¬rst claim, it holds that Λ π π β₯ π π β k Λ π΄ π β π΄ k β₯ π π . Furthermore, it follows from Wedinβs inequality [33] that k πΌ β Λ π½ k β€ k Λ π΄ π β π΄ k Λ π π β π π + β€ . (cid:3) Proposition E.3.
Let πΌ and π½ be orthogonal matrices such that πΌπΌ β + π½π½ β = I , and let Λ πΌ and Λ π½ be matricesof the same size satisfying the same requirement. Suppose k πΌ β Λ π½ k β€ / and k Λ πΌ β πΊ ( Λ π½ β πΊ ) β k β€ πΎ β€ .Then k πΌ β πΊ ( π½ β πΊ ) β k β€ + πΎ β πΎ . Proof.
A direct calculation yields k πΌ β πΊ ( π½ β πΊ ) β k = k πΌ β ( Λ πΌ Λ πΌ β + Λ π½ Λ π½ β ) πΊ ( π½ β πΊ ) β kβ€ k Λ πΌ β πΊ ( π½ β πΊ ) β k + k πΌ β Λ π½ Λ π½ β πΊ ( π½ β πΊ ) β kβ€ k Λ πΌ β πΊ ( Λ π½ β πΊ ) β Λ π½ β πΊ ( π½ β πΊ ) β k + k Λ π½ β πΊ ( π½ β πΊ ) β kβ€ ( πΎ + ) k Λ π½ β πΊ ( π½ β πΊ ) β k . We also have k Λ π½ β πΊ ( π½ β πΊ ) β k β€ k Λ π½ β πΌπΌ β πΊ ( π½ β πΊ ) β k + k Λ π½ β π½π½ β πΊ ( π½ β πΊ ) β k β€ k πΌ β πΊ ( π½ β πΊ ) β k + . Sequencing the two displays above and rearrange the inequality yields the claim. (cid:3)
Now let π be given as in Theorem D.5 and choose π = π / πΏ . As long as π β₯ ππ π πΏ l og ( π / πΏ ) , wehave π π π l og ( π / πΏ ) β€ π β€ π . It then follows from Proposition E.2 that, when drawing Λ π¨ , . . . , Λ π¨ π independently from π π΄ , the event G : = { Λ π π β₯ π π / and k πΌ β Λ π½ k β€ / } (E.1)happens with probability at least β πΏ . Conditioned on G , we consider running π steps of Ojaβsalgorithm, with π΄ , . . . , π΄ π drawn i.i.d from π Λ π΄ . Note that the discrete distribution π Λ π΄ also satisο¬esAssumption 1 and Assumption 2 (with π replaced by π ). Our main theorem thus guarantees thatwith appropriately chosen step size, the output π π = π π ( π΄ , . . . , π΄ π ) of this algorithm after π stepssatisο¬es k Λ πΌ β πΈ π ( Λ π½ β πΈ π ) β k β€ TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 23 with probability β πΏ . Combining (E.1) and Proposition E.3, we obtain that with probability at least ( β πΏ ) β₯ β πΏ , the output of the algorithm satisο¬es k πΌ β πΈ π ( π½ β πΈ π ) β k β€ , that is, π π (cid:0) k πΌ β πΈ π ( π½ β πΈ π ) β k β€ (cid:1) β₯ β πΏ. Finally, we obtain from Proposition E.1 that π π (cid:0) k πΌ β πΈ π ( π½ β πΈ π ) β k β€ (cid:1) β₯ π β π π΄ (cid:0) k πΌ β πΈ π ( π½ β πΈ π ) β k β€ (cid:1) β π TV (cid:16) π π , π β π π΄ (cid:17) β₯ β πΏ. In other words, with the same choice of π , the output of π steps of Ojaβs algorithm with π΄ , . . . , π΄ π drawn i.i.d from the original distribution π π΄ satisο¬es k πΌ β πΈ π ( π½ β πΈ π ) β k β€ with probability at least β πΏ . Appendix F. Phase I succeeds if π is large In this section, we prove Theorem D.5 when π > πΆ π π πΏ Β― π π . Note that this value of π is far largerthan the optimal choice (which is of order Λ Ξ ( π / πΏ Β― π π ) ), which makes the theorem much easier toprove. Indeed, if π is this large, we can prove Theorem D.5 directly by using the same conditioningargument as in Phase II. Proposition F.1.
Assume π and π satisfy the requirements of Theorem D.5, and assume π β₯ π π / .If π β₯ πΆ π π πΏ Β― π π , then k πΎ π k β€ π with probability at least β πΏ / .Proof. Set πΎ = πΆ πΎ ππΏ where πΆ πΎ is deο¬ned in Lemma H.4 and deο¬ne the good events G : = {k πΎ k β€ πΎ /(β )} G π : = {k πΎ k β€ πΎ } β© G π β , β π β₯ . In order to apply Theorem 3.1, we verify (3.1)First assumption. We have π = ππ ( + πΎ ) β€ πΆ π l og ( e π / πΏπ ) ππΎππ , and this quantity is smaller than / so long as πΆ π β₯ πΆ π πΆ πΎ . (F.1)Second assumption. We again have π k π΄ k = πΆ π l og ( e π / πΏπ ) k π΄ k ππ , and (F.1) guarantees that this quantity is smaller than / as well.Third assumption. Since π π = π for all π and ππ β₯ , this requirement trivially holds. Our goal is to bound P (cid:8) k πΎ π k β₯ π (cid:9) β€ P (cid:8) k πΎ π π k β₯ π (cid:9) + P (cid:8) G πΆ (cid:9) + Γ π π = P n G πΆπ β© G π β o . Having veriο¬ed (3.1), we can employ (3.2), obtaining k πΎ π π k π,π β€ e β π ππ π π / π πΎ / + ( πΆ πΎ + πΆ ) ππ / π π π . For the ο¬rst term, the fact that π π β₯ π / implies that e β π ππ π πΎ = ( πΏπ / e π ) πΆ π / πΎ , and this is smaller than π as long as πΆ π β₯ + og ( πΆ πΎ ) . Letting πΆ be as in Proposition 4.1 and choosing π = l og ( π / ππΏ ) , we also have π ( πΆ πΎ + πΆ ) π π β€ π πΆ πΆ π l og ( e π / πΏπ ) π πΎ π π β€ πΆ πΆ π πΆ πΎ l og ( π / πΏπ ) πΆ π Β· πΏπ π Since l og ( π / πΏπ ) β€ ππΏπ for all positive π , πΏ , and π , this quantity will be less than π so long as πΆ π β₯ ( πΆ πΆ π πΆ πΎ ) , (F.2)and this requirement subsumes (F.1).We therefore obtain, for π = l og ( π / πΏ ) , P (cid:8) k πΎ π k β₯ π (cid:9) β€ π β π k πΎ π k ππ,π β€ π e β π β€ πΏ / , In a similar way, (3.2) yields for all π‘ β [ π ] , πΎ β k πΎ π‘ π‘ β k π,π β€ π / π + ( πΆ πΎ + πΆ ) ππ / π π π . If we choose π = l og ( ππ / πΏ ) , then we have π ( πΆ πΎ + πΆ ) π π β€ π πΆ πΆ π l og ( e π / πΏπ ) π πΎ π π β€ πΆ πΆ π πΆ πΎ l og ( π ) πΆ π π / , and since log ( π ) β€ π / for all π , we have that this quantity will be at most if πΆ π β₯ ( πΆ πΆ π πΆ πΎ ) / , and this requirement subsumes (F.2), and it holds under the assumptions of Theorem D.5.By Lemma H.4, the event G holds with probability at least β πΏ / .Finally, we have for any π β [ π ] , P n G πΆπ β© G π β o β€ P (cid:8) k πΎ π π β k β₯ πΎ (cid:9) β€ inf π β₯ πΎ β π k πΎ π‘ π‘ β k ππ,π , and choosing π = log ( ππ / πΏ ) we have πΎ β π k πΎ π‘ π‘ β k ππ,π β€ π e β π β€ πΏπ , and summing these probabilities for π β [ π ] , yields that P (cid:8) k πΎ π k β₯ π (cid:9) β€ P (cid:8) k πΎ π π k β₯ π (cid:9) + P (cid:8) G πΆ (cid:9) + Γ π π = P n G πΆπ β© G π β o β€ + + = , as claimed. (cid:3) TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 25 Appendix G. Omitted proofs
G.1.
Proof of Lemma 2.7.
We will show that πΎ π‘ ( I β π« π‘ ) = π― π‘ + π± π‘, + π± π‘, , where π― π‘ = πΌ β ( I + π π‘ π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β , π± π‘, = b π« π‘ β π― π‘ π« π‘ , and π± π‘, = β b π« π‘ π« π‘ and where we write b π« π‘ = π π‘ πΌ β ( π¨ π‘ β π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β . By the deο¬nition of π π‘ , we have πΎ π‘ = πΌ β π π‘ ( π½ β π π‘ ) β = πΌ β π π‘ π π‘ β ( π½ β π π‘ π π‘ β ) β . We have π½ β π π‘ π π‘ β = π½ β ( I + π π‘ π΄ ) π π‘ β + π π‘ π½ β ( π¨ π‘ β π΄ ) π π‘ β = (cid:0) I + π π‘ π½ β ( π¨ π‘ β π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β (cid:1) π½ β ( I + π π‘ π΄ ) π π‘ β = ( I + π« π‘ ) π½ β ( I + π π‘ π΄ ) π π‘ β , which implies ( π½ β π π‘ π π‘ β ) β ( I β π« π‘ ) = ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β ( I + π« π‘ ) β ( I + π« π‘ ) ( I β π« π‘ ) = ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β ( I β π« π‘ ) . We also have πΌ β π π‘ π π‘ β = πΌ β ( I + π π‘ π΄ ) π π‘ β + π π‘ πΌ β ( π¨ π‘ β π΄ ) π π‘ β = πΌ β ( I + π π‘ π΄ ) π π‘ β + b π« π‘ ( π½ β ( I + π π‘ π΄ ) π π‘ β ) . Therefore πΎ π‘ ( I β π« π‘ ) = πΌ β π π‘ π π‘ β ( π½ β π π‘ π π‘ β ) β = πΌ β ( I + π π‘ π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β + b π« π‘ β πΌ β ( I + π π‘ π΄ ) π π‘ β ( π½ β ( I + π π‘ π΄ ) π π‘ β ) β π« π‘ β b π« π‘ π« π‘ . That is πΎ π‘ ( I β b π« π‘ ) = π― π‘ + π± π‘, + π± π‘, . Since π« π‘ and b π« π‘ are both π ( π π‘ ) , the claim follows. (cid:3) G.2.
Proof of Proposition 2.9.
By the triangle inequality, we have k πΏ + π + π k π,π β€ k πΏ + π k π,π + k π k π,π , which implies k πΏ + π + π k π,π β€ ( k πΏ + π k π,π + k π k π,π ) β€ ( + π ) ( k πΏ + π k π,π + π β k π k π,π ) , where in the second step we have applied the elementary inequality ( π + π ) β€ ( + π ) ( π + π β π ) , valid for all real numbers π and π and π > . Applying Proposition 2.8 to k πΏ + π k π,π then yields theclaim. (cid:3) Appendix H. Additional Lemmas
Lemma H.1.
For any deterministic matrices π¨ , π© and any standard Gaussian matrix π of suitable sizes,it holds that P {k π¨ππ© k β₯ k π¨ k k π© k ( + π‘ )} β€ e β π‘ / . Proof.
Let π ( πΏ ) : = k π¨πΏ π© k , then | π ( πΏ ) β π ( πΏ ) | β€ k π¨ k k π© k Β· k πΏ β πΏ k . By Gaussian concentration, we have P { π ( π ) β₯ E π ( π ) + k π¨ k k π© k π‘ } β€ e β π‘ / . Moreover, we have E π ( π ) β€ ( E k π¨ππ© k ) / = k π¨ k k π© k . It thus follows that P { π ( π ) β₯ k π¨ k k π© k ( + π‘ )} β€ P { π ( π ) β₯ E π ( π ) + k π¨ k k π© k π‘ } β€ e β π‘ / , which is the stated result. (cid:3) Lemma H.2 ([6, Theorem II.13]) . Let πΈ β β π Γ π be a standard Gaussian matrix. Then P n k πΈ k β₯ β π + β π + π‘ o β€ Β· e β π‘ / . Lemma H.3 ([1, Lemma i.A.3]) . Let πΈ β β π Γ π be a standard Gaussian matrix. Then for every πΏ β ( , ) , P ( k πΈ β k β₯ β ππΏ ) β€ πΏ. The next lemma bounds the probability of G from below. Lemma H.4.
Let G be the event deο¬ned in (5.2) . There exists a positive constant πΆ πΎ = such thatfor any πΏ β ( , ) , if πΎ β₯ πΆ πΎ min { p π log ( e ππ / πΏ )/ πΏ, π / πΏ } , then G holds with probability at least β πΏ .Proof. We have πΎ = πΌ β π ( π½ β π ) β , where π is a matrix with i.i.d. Gaussian entries. Since πΌ and π½ have orthonormal columns and are themselves orthogonal, the two matrices π½ β π and πΌ β π areindependent matrices with i.i.d. Gaussian entries. Using Lemma H.1 and conditioning on π½ β π , wehave that with probability at least β πΏ / ( π + ) ,max π¬ β E π,β k π¬πΌ β π ( π½ β π ) β k β€ k ( π½ β π ) β k Β· p β l og ( e ππ / πΏ ) , (H.1)where we have taken a union bound over the fewer than ( ( π + ) ( π + )) β elements of E π,β . Takinga uniform bound again over all π, β β [ π + ] yields that, with probability at least β πΏ / , the event(H.1) holds for all π, β β [ π + ] . By Lemma H.3, we also have that that k ( π½ β π ) β k β€ β π / πΏ withprobability at least β πΏ / . Furthermore, Lemma H.2 implies that k πΌ β π k β€ p π l og ( / πΏ ) withprobability at least β πΏ / . Combining these bounds, we obtain that with probability at least β πΏ / ,max π¬ β E π,β k π¬πΌ β π ( π½ β π ) β k β€ p β l og ( e ππ / πΏ ) , which is less than β βπΎ β as long as πΆ πΎ β₯ , and under this same assumption k πΎ k β€ k πΌ β π k k ( π½ β π ) β k β€ p π l og ( / πΏ ) β€ β ππΎ as well. TREAMING π -PCA: EFFICIENT GUARANTEES FOR OJAβS ALGORITHM, BEYOND RANK-ONE UPDATES 27 So G holds with probability at least β πΏ if πΎ β₯ πΆ πΎ p π l og ( e ππ / πΏ )/ πΏ for πΆ πΎ β₯ .On the other hand, We have E k πΌ β π k β€ β π , so that k πΌ β π k β€ β π / πΏ with probability at least β πΏ / , and Lemma H.3 implies that k π½ β π k β€ β π / πΏ with probability at least β πΏ / , so withprobability at least β πΏ we have k πΎ k β€ k πΌ β π k k ( π½ β π ) β k β€ β ππ / πΏ < π / πΏ . as claimed. On this event, we also have k π¬πΎ k β€ k πΎ k β€ π / πΏ . Therefore, if πΎ β₯ β π / πΏ ,then G holds.So G holds with probability at least β πΏ if πΎ β₯ πΆ πΎ π / πΏ for πΆ πΎ β₯ β . Therefore, taking πΆ πΎ = satisο¬es both requirements and proves the claim. (cid:3) References [1] Z. Allen-Zhu and Y. Li. First eο¬cient convergence for streaming π -PCA: a global, gap-free, and near-optimal rate. In , pages 487β492. IEEE Computer Soc.,Los Alamitos, CA, 2017.[2] M. Balcan, S. S. Du, Y. Wang, and A. W. Yu. An improved gap-dependency analysis of the noisy power method. InFeldman et al. [10], pages 284β309.[3] M. Balcan and K. Q. Weinberger, editors. Proceedings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of
JMLR Workshop and Conference Proceedings . JMLR.org,2016.[4] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. In Burges et al. [5], pages3174β3182.[5] C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors.
Advances in Neural Information ProcessingSystems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting heldDecember 5-8, 2013, Lake Tahoe, Nevada, United States , 2013.[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In
Handbook of thegeometry of Banach spaces, Vol. I , pages 317β366. North-Holland, Amsterdam, 2001.[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III.
SIAM J. Numer. Anal. , 7:1β46, 1970.[8] X. V. Doan and S. Vavasis. Finding the largest low-rank clusters with Ky Fan - π -norm and β -norm. SIAM J. Optim. ,26(1):274β312, 2016.[9] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.
SIAM J. MatrixAnal. Appl. , 20(2):303β353, 1999.[10] V. Feldman, A. Rakhlin, and O. Shamir, editors.
Proceedings of the 29th Conference on Learning Theory, COLT 2016, NewYork, USA, June 23-26, 2016 , volume 49 of
JMLR Workshop and Conference Proceedings . JMLR.org, 2016.[11] D. Freedman. A remark on the diο¬erence between sampling with and without replacement.
J. Amer. Statist. Assoc. ,72(359):681, 1977.[12] G. H. Golub and C. F. Van Loan.
Matrix computations . Johns Hopkins Studies in the Mathematical Sciences. JohnsHopkins University Press, Baltimore, MD, third edition, 1996.[13] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages2861β2869, 2014.[14] A. Henriksen and R. Ward. Adaoja: Adaptive learning rates for streaming pca. 05 2019, 1905.12115.[15] A. Henriksen and R. Ward. Concentration inequalities for random matrix products.
Linear Algebra Appl. , 594:81β94,2020.[16] D. Huang, J. Niles-Weed, J. A. Tropp, and R. Ward. Matrix concentration for products. 03 2020, 2003.05437.[17] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: matching matrix bernstein and near-optimalο¬nite sample guarantees for ojaβs algorithm. In Feldman et al. [10], pages 1147β1164.[18] I. T. Jolliο¬e.
Principal component analysis . Springer Series in Statistics. Springer-Verlag, New York, second edition, 2002.[19] A. Juditsky and A. S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. 09 2008,0809.0813.[20] C. Li, H. Lin, and C. Lu. Rivalry of two families of algorithms for memory-restricted streaming PCA. In A. Gretton andC. C. Robert, editors,
Proceedings of the 19th International Conference on Artiο¬cial Intelligence and Statistics, AISTATS , volume 51 of
JMLR Workshop and Conference Proceedings , pages 473β481. JMLR.org,2016.[21] C. J. Li, M. Wang, H. Liu, and T. Zhang. Near-optimal stochastic approximation for online principal componentestimation.
Math. Program. , 167(1, Ser. B):75β97, 2018.[22] C.-K. Li and N.-K. Tsing. Some isometries of rectangular complex matrices.
Linear and Multilinear Algebra , 23(1):47β53,1988.[23] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Burges et al. [5], pages 2886β2894.[24] A. Naor. On the banach-space-valued azuma inequality and small-set isoperimetry of alonβroichman graphs.
Combina-torics, Probability and Computing , 21(4):623β634, 2012.[25] E. Oja. A simpliο¬ed neuron model as a principal component analyzer.
J. Math. Biol. , 15(3):267β273, 1982.[26] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of arandom matrix.
Journal of Mathematical Analysis and Applications , 106(1):69β84, 1985.[27] C. D. Sa, C. RΓ©, and K. Olukotun. Global convergence of stochastic gradient descent for some non-convex matrixproblems. In F. R. Bach and D. M. Blei, editors,
Proceedings of the 32nd International Conference on Machine Learning,ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of
JMLR Workshop and Conference Proceedings , pages 2332β2341.JMLR.org, 2015.[28] O. Shamir. Convergence of stochastic gradient descent for PCA. In Balcan and Weinberger [3], pages 257β265.[29] O. Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In Balcan and Weinberger[3], pages 248β256.[30] M. Simchowitz, A. El Alaoui, and B. Recht. Tight query complexity lower bounds for PCA via ο¬nite sample deformedWigner law. In
STOCβ18βProceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages1249β1259. ACM, New York, 2018.[31] J. A. Tropp. User-friendly tail bounds for sums of random matrices.
Found. Comput. Math. , 12(4):389β434, 2012.[32] J. A. Tropp. An introduction to matrix concentration inequalities.
Foundations and Trends in Machine Learning , 8(1-2):1β230, 2015.[33] P.-A. Wedin. Perturbation bounds in connection with singular value decomposition.