[PDF] Matrix Completion from O(n) Samples in Linear Time

Abstract

We consider the problem of reconstructing a rank- k n×n matrix M from a sampling of its entries. Under a certain incoherence assumption on M and for the case when both the rank and the condition number of M are bounded, it was shown in \cite{CandesRecht2009, CandesTao2010, keshavan2010, Recht2011, Jain2012, Hardt2014} that M can be recovered exactly or approximately (depending on some trade-off between accuracy and computational complexity) using O(npoly(logn)) samples in super-linear time O( n a poly(logn)) for some constant a≥1 . In this paper, we propose a new matrix completion algorithm using a novel sampling scheme based on a union of independent sparse random regular bipartite graphs. We show that under the same conditions w.h.p. our algorithm recovers an ϵ -approximation of M in terms of the Frobenius norm using O(n log 2 (1/ϵ)) samples and in linear time O(n log 2 (1/ϵ)) . This provides the best known bounds both on the sample complexity and computational complexity for reconstructing (approximately) an unknown low-rank matrix. The novelty of our algorithm is two new steps of thresholding singular values and rescaling singular vectors in the application of the "vanilla" alternating minimization algorithm. The structure of sparse random regular graphs is used heavily for controlling the impact of these regularization steps.

Full PDF

aa r X i v : . [ s t a t . M L ] A ug Matrix Completion from O ( n ) Samples in Linear Time David Gamarnik ∗ Quan Li † Hongyi Zhang ‡ Abstract

We consider the problem of reconstructing a rank- k n × n matrix M from a sampling of itsentries. Under a certain incoherence assumption on M and for the case when both the rank and thecondition number of M are bounded, it was shown in [CR09, CT10, KMO10, Rec11, JNS12, Har14]that M can be recovered exactly or approximately (depending on some trade-oﬀ between accuracyand computational complexity) using O ( n poly(log n )) samples in super-linear time O ( n a poly(log n ))for some constant a ≥ ǫ -approximation of M in terms of the Frobenius normusing O ( n log (1 /ǫ )) samples and in linear time O ( n log (1 /ǫ )). This provides the best known boundsboth on the sample complexity and computational complexity for reconstructing (approximately) anunknown low-rank matrix.The novelty of our algorithm is two new steps of thresholding singular values and rescaling singularvectors in the application of the “vanilla” alternating minimization algorithm. The structure of sparserandom regular graphs is used heavily for controlling the impact of these regularization steps. We consider the problem of reconstructing a hidden rank- k matrix from a sampling of its entries.Speciﬁcally, consider an n × n matrix M . The goal is to design a sampling index set Ω ⊆ [ n ] × [ n ]such that M can be reconstructed eﬃciently from the entries in M associated with Ω, that is, fromthe entries M ij , ( i, j ) ∈ Ω, with the cardinality | Ω | as small as possible. The problem has a wide rangeof applications in recommendation systems, system identiﬁcation, global positioning, computer vision ,etc. [CP10].For the convenience of discussing various matrix completion results and comparing them to ourresults, we will assume in the discussion below that the rank k , condition number κ and the incoherenceparameter µ of M (appropriately deﬁned) are bounded in n . The problem of reconstructing M un-der uniform sampling received considerable attention in recent years. One research direction of matrixcompletion under this sampling scheme focuses on the exact recovery of M . Recht [Rec11] and Gross[Gro11] showed that M can be reconstructed exactly from O ( n log n ) samples using trace-norm basedoptimization. Keshavan et al. [KMO10] showed that M can be reconstructed exactly from O ( n log n )samples using singular value decomposition (SVD) followed by gradient descent on Grassmanian mani-fold. Another research direction of matrix completion under uniform sampling pays more attention tothe eﬃciency of the algorithm, and only requires approximate matrix completion. Jain et al. [JNS12]showed that an ǫ -approximation (appropriately deﬁned) of M in the Frobenius norm can be recon-structed from O ( n log n log(1 /ǫ )) samples using alternating minimization in O ( n log n log(1 /ǫ )) time. ∗ MIT; e-mail: [email protected] † MIT; e-mail: [email protected] ‡ MIT; e-mail: [email protected]

Accepted for presentation at Conference on Learning Theory (COLT) 2017 † ’are for exact matrix completion while the remaining methods without ‘ † ’ are for approximate matrixcompletion. The methods with superscript symbols ‘ ∗ ’ are under stronger incoherence assumption thanthe standard incoherence assumption (Assumption 1, appropriately deﬁned) while others are under thestandard incoherence assumption. ǫ is the tolerance such that the reconstructed matrix ˜ M satisﬁes k M − ˜ M k F ≤ ǫ k M k F w.h.p. ˜ O notation hides factors polynomial in k , κ and µ .Methods Sample Complexity Running Time[KMO10] † , ∗ O (cid:0) κ µ kn (cid:0) log n + κ µ k (cid:1)(cid:1) O (cid:0) κ µ k n log n (cid:0) log n + κ µ k (cid:1)(cid:1) [Rec11] † , ∗ , [Gro11] † , ∗ O (cid:0) µ kn log n (cid:1) O ( n log n/ √ ǫ ) or O ( n log(1 /ǫ ))[Che15] † O (cid:0) µ kn log n (cid:1) O ( n log n/ √ ǫ ) or O ( n log(1 /ǫ ))[SL16] † O (cid:0) κ µ kn (cid:0) log n + µ k κ (cid:1)(cid:1) ˜ O (poly( n ) log( ǫ ))[ZL16] † O (cid:0) κ µ k n (log n + µ ) (cid:1) O (cid:0) kn log( ǫ ) (cid:1) [BLWZ17] † O (cid:0) κ µ kn log n log κ n (cid:1) O (cid:16) n ǫ (cid:17) [JNS12] O (cid:0) κ µ k . log (cid:0) kǫ (cid:1) n log n (cid:1) O (cid:0) κ µ k . n log n log (cid:0) kǫ (cid:1)(cid:1) [Har14] O (cid:16) k M ∗ k F ( σ ∗ k ) µ kn (cid:0) log (cid:0) nǫ (cid:1) + k (cid:1)(cid:17) O (cid:16) k M ∗ k F ( σ ∗ k ) µ k n (cid:0) log (cid:0) nǫ (cid:1) + k (cid:1)(cid:17) [ZWL15] O (cid:0) κ µ k n log n log( ǫ ) (cid:1) O (cid:0) κ µ k n log n log( ǫ ) (cid:1) Ours O (cid:0)(cid:0) κ µ k + µ k log (cid:0) ǫ (cid:1)(cid:1) n log( ǫ ) (cid:1) O ( (cid:0) κ µ k + µ k log (cid:0) ǫ (cid:1)(cid:1) n log( ǫ ))Then Hardt [Har14] reﬁned the analysis of alternating minimization and improved the sample complex-ity to O ( n log( n/ǫ )). With extensive research on this subject, it is tempting to believe that the samplecomplexity obtained by [JNS12] or [Har14] are optimal (up to a constant factor) for ǫ -approximationof matrix completion as well. Perhaps surprisingly, we establish that this is not the case and proposea new algorithm, which constructs an ǫ -approximation of M in Frobenius norm using O ( n log (1 /ǫ ))samples in linear time O ( n log (1 /ǫ )). The comparison of various matrix completion methods is givenin Table 1. In order to compare various methods for exact and approximate matrix completion, thecriterion k M − ˜ M k F ≤ ǫ k M k F is used where ǫ is the tolerance, ˜ M is the reconstructed matrix and k · k F is the Frobenius norm.Our proposed algorithm adds two new steps: a thresholding of singular values and a rescaling ofsingular vectors upon the “vanilla” alternating minimization algorithm. The idea behind these stepsis regularization of the least square estimation in the form of the singular value thresholding. Thesingular value thresholding step is necessary due to the decreased sample complexity. More speciﬁcally,due to the decreased sample complexity by a logarithmic factor log n , certain matrices inverted in eachstep of the alternating minimization algorithm may become ill-conditioned. Our algorithm avoids this2ll-conditioning problem by adding to the “vanilla” alternating minimization an extra step of singularvalue thresholding applied to these matrices (i.e. the Gramian matrices inverted in (10) and (12))before their inversion. This extra singular value thresholding step enforces that the singular values ofthe Gramian matrices inverted in (10), (11), (12) and (13) deviate from their expected values by atmost 1 − β after proper normalization, and as a result, guarantees the nonsingularity of these (adjusted)Gramian matrices. We call this algorithm Thresholded Alternating Minimization ( T AM ), referring tothe extra singular value thresholding steps added to alternating minimization. A rescaling of the entriesof singular vectors is also implemented in the

T AM algorithm in order to maintain the proximity toincoherence. A more speciﬁc discussion of the intuition behind these two new steps appears after theintroduction of the

T AM algorithm (in Pages 7 and 8).We restrict our attention to the case of bounded rank, bounded condition number and boundedincoherence parameter of M , for the convenience of the analysis. Most of the work in this paper isto prove the following result: with high probability (w.h.p.) T AM produces a 1 ± ǫ multiplicativeapproximation of M in Frobenius norm using O ( n log (1 /ǫ )) samples under the standard incoherenceAssumption 1, given in Section 2. For simplicity, we call this just ǫ -approximation. Let M = U ∗ Σ ∗ ( V ∗ ) T and U be the input to one of the iterations of T AM . Also, let γ be the distance between the subspacesspanned by U ∗ and U , appropriately deﬁned later. We further establish that the number of times thatthe singular value thresholding is applied per one iteration of T AM is bounded above by a function of γ ,which is monotonically decreasing as γ decreases. The novel bounding technique we used for establishingthis result is based on random graph theory. More speciﬁcally, the detailed structure of sparse randomregular graphs is used heavily on controlling the impact of regularization, i.e. the number of times thatthe singular value thresholding steps are applied per one iteration of T AM algorithm. This result issummarized in Theorem 4.7. We use it as a key result in establishing the geometric convergence of

T AM . The analysis of our algorithm is substantially diﬀerent from the one in [JNS12], due to thiscritical singular value thresholding step. Although the proof of our main result seems involved, most ofthe proof steps use elementary linear algebraic derivations and are easy to follow.For the convenience of analysis,

T AM employs a sampling generated from a union of independentrandom bipartite regular graphs. Although our results of

T AM are established on this special sampling,

T AM can be generalized to uniform sampling in the obvious manner and similar results of

T AM underuniform sampling can be established accordingly. In fact, by considering Poisson cloning model [Kim06]for Erd¨os-R´enyi graphs, (which we intend to research in future), we conjecture that the same samplecomplexity of

T AM might hold for constructing an ǫ -approximation of M in Frobenius norm underuniform sampling. There is no contradiction between the information theoretic lower bound O ( n log n )for exact matrix completion and this conjecture, due to its approximate nature. Other sampling schemesfor matrix completion are also studied in [MJD09, KTT15, PABN16].Bhojanapalli and Jain [BJ14] showed that if the index set of the sampled entries corresponds to abipartite graph with large spectral gap, then the trace-norm based optimization exactly reconstructs M that satisﬁes certain stricter incoherence assumptions (Assumption 1 and condition (6), see below).In particular, they showed that the trace-norm based optimization exactly reconstructs M for δ ≤ / O ( k n ) samples. Furthermore, they raised a question of studying alternating minimizationunder the same incoherence assumptions, in the hope of achieving similar sample complexity. Our secondresult answers this question for the case of constant k : w.h.p. T AM under incoherence Assumptions 1and 2 produces an ǫ -approximation of M in Frobenius norm using O ( n log(1 /ǫ )) samples. Furthermore,this result requires a less stringent incoherence condition (Assumption 2) on M than the incoherencecondition (6), and furthermore holds for all δ ∈ (0 ,

1) satisfying condition (5) in Assumption 2 whilethe result in [BJ14] holds for all δ ∈ (0 , /

6] satisfying condition (6).

T AM maintains the computational complexity of alternating minimization, which is O ( | Ω | ) forbounded k . T AM only requires O ( n log (1 /ǫ )), or O ( n log(1 /ǫ )) samples, depending on whether As-3umption 1 or both Assumptions 1 and 2 are satisﬁed, respectively. Hence, T AM is a linear algorithmof computational complexity O ( n log (1 /ǫ )) or O ( n log(1 /ǫ )). Like alternating minimization, T AM hascomputational eﬃciency advantage over trace-norm based optimization, which requires time O ( n log n/ √ ǫ ) using the singular value thresholding algorithm [CCS10] or O ( n log(1 /ǫ )) using interior point meth-ods. More speciﬁc computational complexity comparison between trace-norm based optimization andalternating minimization is given in [JNS12].The remainder of the paper is structured as follows. In the next section, we deﬁne the problem ofmatrix completion and state necessary assumptions. In Section 3, we introduce the random d -regulargraph model of Ω and formally state our two main results: the one regarding the performance of T AM under the incoherence Assumption 1 and the one regarding the performance of

T AM under theincoherence Assumptions 1 and 2. Section 4 is devoted to the proof of two main results. We concludein Section 5 with some open questions.We close this section with some notational conventions. We use standard notations o ( · ), O ( · ) andΩ( · ) with respect to n → ∞ . Let σ i ( A ) be the i -th largest singular value of matrix A and σ min ( A ) bethe least singular value of matrix A . Let k A k be the spectral norm (largest singular value) of matrix A and k A k F be the Frobenius norm of matrix A. Let A T be the transpose of a vector or matrix A .For a ∈ N , let [ a ] be a set of indices { , , . . . , a } . Let k ∈ N be the rank of matrix M . For a matrix U ∈ R n × k , let u Ti , i ∈ [ n ], be the i -th row of U where u i ∈ R k × is a column vector. Also, let Span( U )be the subspace spanned by the k columns of U . For a matrix A ∈ R n × n , let SVD( A, k ) ∈ R n × k be thematrix consisting of the top- k left singular vectors of the matrix A . Let h x, y i be the inner product oftwo vectors x and y , and ⌈ z ⌉ be the smallest integer no less than z . We say that a sequence of events E n occurs w.h.p. if P ( E n ) → n → ∞ . Given l ≤ n , we call a matrix A ∈ R n × l with orthonormalcolumns (column-)orthonormal matrix. A QR decomposition of a matrix A ∈ R n × k is A = QR where Q ∈ R n × k is an orthonormal matrix and R ∈ R k × k is an upper triangular matrix. We include thefollowing list of matrix inequalities to be used later. Given a matrix A of rank l k A k F ≤ √ l k A k . (1)Given two matrices A and B k AB k F ≤ k A k k B k F . (2)Give matrices A, B ∈ R n × n , the following Ky Fan singular value inequality [Mos12] holds σ r + t +1 ( A + B ) ≤ σ r +1 ( A ) + σ t +1 ( B ) (3)for t ≥ , r ≥ r + t + 1 ≤ n . Let M ∈ R n × m be a rank- k matrix and M = U ∗ Σ ∗ ( V ∗ ) T be its SVD where the singular values are σ ∗ ≥ σ ∗ . . . ≥ σ ∗ k in decreasing order. The entries in M associated with the index set Ω ⊆ [ n ] × [ m ] areobserved, that is, the entries M ij , ∀ ( i, j ) ∈ Ω, are known. Deﬁne the sampling operator P Ω : R n × m → R n × m by P Ω ( M ) = (cid:26) M ij if ( i, j ) ∈ Ω , i, j ) / ∈ Ω . Let V R and V C be the sets of rows and columns of matrix M , respectively, indexed by the sets { , , . . . , n } and { , , . . . , m } . Also, let G = ( V , E ) be a bipartite undirected graph on the vertex set4 = V R ∪ V C with edge set E ∋ ( i, j ) if and only if ( i, j ) ∈ Ω. Our goal is to obtain an ǫ -approximationof the matrix M from the observed P Ω ( M ).For the rest of the paper, we will assume for simplicity that m = n . Our results can be easilyextended to the more general case m = Θ( n ), using the generalization as in the appendix D of [Har14].We say a graph is a random bipartite d -regular graph G d ( n, n ) if it is chosen uniformly at random fromall bipartite d -regular graphs with n vertices { , , . . . , n } on the left and another n vertices { , , . . . , n } on the right. Let G n ∈ R n × n be the bi-adjacency matrix of G d ( n, n ) with the ( i, j ) entry ( G n ) ij = 1if and only if there is an edge between vertex i on the left and vertex j on the right in G d ( n, n ) and( G n ) ij = 0 otherwise. For our proposed algorithm, we choose G to be a union of several independentrandom bipartite d -regular graphs G d ( n, n ). Two essential properties of the random bipartite d -regulargraph are • P G n is [1 / √ n, / √ n, . . . , / √ n ] T . • P σ ( G n ) = d . As discussed below, w.h.p. the second largestsingular value σ ( G n ) is upper bounded by (7 √ d ) / d ≥ G d ( n, n ) are ∪ ni =1 {− σ i ( G n ) , σ i ( G n ) } . Corollary1.6 in [Pud15] states that alongside the two trivial eigenvalues ± d , all other eigenvalues of the adjacencymatrix of the graph G d ( n, n ) are within [ − √ d − − . , √ d − .

84] w.h.p. as n → ∞ . For d ≥ √ d − . ≤ (7 √ d ) / P d -regular graph G d ( n, n ) in expected running time O ( nd ).Let u ∗ ,Ti , i ∈ [ n ], be the i -th row of U ∗ and v ∗ ,Tj , j ∈ [ n ], be the j th row of V ∗ . Now we present theincoherence assumptions on M . • Assumption . There exists a constant µ ≥ k u ∗ i k ≤ µ kn , ∀ i ∈ [ n ] and k v ∗ j k ≤ µ kn , ∀ j ∈ [ n ] . (4) • Assumption . Given the degree d of G d ( n, n ), let S n be a subset of [ n ] chosen uniformly atrandom from all the subsets of [ n ] with cardinality d . There exists a constant δ ∈ (0 ,

1) such that P ( k X i ∈ S n nd u ∗ i u ∗ ,Ti − I k ≤ δ ) = 1 − o (1) and P ( k X j ∈ S n nd v ∗ j v ∗ ,Tj − I k ≤ δ ) = 1 − o (1) . (5) where Assumption 1 is the standard incoherence condition assumed by most of existing low-rank ma-trix completion results [CR09, KMO10, JNS12, Har14] etc. We call Assumption 2 the probabilisticgeneralized restricted isometry condition, which is strictly weaker, for example, than the incoherenceassumption A (cid:13)(cid:13)(cid:13)(cid:13) X i ∈ S n nd u ∗ i u ∗ ,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ and (cid:13)(cid:13)(cid:13)(cid:13) X j ∈ S n nd v ∗ j v ∗ ,Tj − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ, (6)for δ ≤ / S n , S n ⊂ [ n ] of cardinality | S n | = | S n | = d while the probabilistic generalizedrestricted isometry condition (5) requires the inequalities above hold for majority of the subsets S n ⊂ [ n ]of cardinality | S n | = d and for majority of the subsets S n ⊂ [ n ] of cardinality | S n | = d .5 Main Results

We are about to present a new matrix completion algorithm and give recovery guarantees of the proposedalgorithm for two scenarios: matrix completion under Assumption 1, and matrix completion under bothAssumption 1 and Assumption 2. Furthermore, we will assume that Assumption 1 always holds, andthat the rank k , the condition number σ ∗ /σ ∗ k , and the incoherence parameter µ of the matrix M arebounded from above by a constant, as n → ∞ .Now we formally describe the matrix completion algorithm we propose in this paper and stateour main results. For the statement of our algorithm, we ﬁrst introduce two operators acting on thematrices. Deﬁne T : R k × → R × k by T ( u ) ,  q µ kn u T k u k k u k ≥ q µ kn ,u T k u k < q µ kn . (7)Speciﬁcally, the operator T normalizes the vector u of length at least 2 p µ k/n to the vector of thesame direction and of length p µ k/n . For the convenience of notation we extend T to the one actingon matrix U = ( u Ti , i ∈ [ n ]) ∈ R n × k by T ( U ) ,  T ( u )... T ( u n )  . Then it follows from the deﬁnition of T ( · ) in (7) that any row vector of T ( U ) has length at most2 p µ k/n .For A ∈ R d × k , let the SVD of A be A = U A Σ A ( V A ) T . We write Σ A in the form p d/n diag( σ , · · · , σ k ) where the diagonal entries σ , σ . . . , σ k ( σ ≥ σ . . . ≥ σ k ) are the singular values of A divided by p d/n . For a given a ∈ (0 ,

1) and ∀ i ∈ [ k ], let σ i,a =  σ i if σ i ∈ [ √ a, √ − a ] , √ a if σ i < √ a, √ − a if σ i > √ − a. Deﬁne T ( A, a ) by T ( A, a ) , U A ˆΣ A ( V A ) T (8)where ˆΣ A = p d/n diag( σ ,a , · · · , σ k,a ) and hence the entire σ ,a , · · · , σ k,a satisfy √ − a ≥ σ ,a ≥ σ ,a . . . ≥ σ k,a ≥ √ a. Speciﬁcally, the operator T lifts the normalized singular values in Σ A less than √ a to √ a and truncatesthe normalized singular values in Σ A more than √ − a to √ − a .Let Ω t ⊆ [ n ] × [ n ], t = 0 , , . . . , N , be the index sets associated with 2 N + 1 independent randombipartite d -regular graphs G d ( n, n ). Deﬁne RRG ( d, n, N ) as the random d -regular graph model of Ω,that is, RRG ( d, n, N ) , { Ω , Ω , · · · , Ω N } . (9)6et D be a subset of [ n ] with d entries, namely, D = { i , i , . . . , i d } . For a matrix U = ( u Ti , i ∈ [ n ]) ∈ R n × k , let its submatrix with the row indices in D and the column indices the same as U be U D =  u Ti ... u Ti d  . Let S t,Lj = { i ∈ [ n ] : ( i, j ) ∈ Ω t } , ∀ j ∈ [ n ]. Then | S t,Lj | = d . Namely, S t,Lj consists of all the leftneighbors of vertex j on the right in the random bipartite d -regular graph associated with the index setΩ t . Correspondingly given any a ∈ (0 ,

1) and any j ∈ [ n ], we denote ˆ U S t,Lj = T ( U S t,Lj , a ) and the rowin ˆ U S t,Lj associated with the index i ∈ S t,Lj by ˆ u t,Ti . Similarly, let S t,Ri = { j ∈ [ n ] : ( i, j ) ∈ Ω t } , ∀ i ∈ [ n ],that is, S t,Ri consists of all the right neighbors of vertex i on the left in the random bipartite d -regulargraph associated with the index set Ω t . Also, we have | S t,Ri | = d . For a matrix V ∈ R n × k and a given a ∈ (0 , V S t,Ri = T ( V S t,Ri , a ) and the row in ˆ V S t,Ri associated with the index j ∈ S t,Ri by ˆ v t,Tj .Now we introduce the algorithm T AM for matrix completion in the sparse regime. For the algo-rithm below we ﬁx arbitrary δ ∈ (0 ,

1) and we let β be any constant in (0 , − δ ). Thresholded Alternating Minimization algorithm (

T AM ) Input : Observed index sets

RRG ( d, n, N ) and values P ∪ Nt =0 Ω t ( M ) . Initialize : ¯ U = SVD ( nd P Ω ( M ) , k ) , i.e. top- k left singular vectors of nd P Ω ( M ) .Truncation step: ﬁrst apply T on ¯ U then orthonormalize the columns of T ( ¯ U ) . Denote the resultantorthonormal matrix by U = ( u ,Ti , ≤ i ≤ n ) . Loop : For t = 0 to N − For each j ∈ [ n ] :If nd σ l ( P i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti ) ∈ [ β, − β ] for all l ∈ [ k ] , then set ˜ v t +1 j =  X i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti  − X i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti M ij . (10) Otherwise let ˆ U tS t +1 ,Lj = T ( U tS t +1 ,Lj , β ) and ˜ v t +1 j =  X i ∈ [ n ]:( i,j ) ∈ Ω t +1 ˆ u ti ˆ u t,Ti  − X i ∈ [ n ]:( i,j ) ∈ Ω t +1 ˆ u ti M ij . (11) Let ˜ V t +1 = (˜ v t +1 ,Tj , ≤ j ≤ n ) and ˜ V t +1 = ¯ V t +1 R t +1 be the QR decomposition of ˜ V t +1 . Orthonormalize the columns of T ( ¯ V t +1 ) . Denote the resultant orthonormal matrixby V t +1 = ( v t +1 ,Tj , ≤ j ≤ n ) .For each i ∈ [ n ] :If nd σ l ( P j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 v t +1 j v t +1 ,Tj ) ∈ [ β, − β ] for all l ∈ [ k ] , then set ˜ u t +1 i =  X j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 v t +1 j v t +1 ,Tj  − X j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 v t +1 j M ij . (12)7 therwise let ˆ V t +1 S N + t +1 ,Ri = T ( V t +1 S N + t +1 ,Ri , β ) and ˜ u t +1 i =  X j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 ˆ v t +1 j ˆ v t +1 ,Tj  − X j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 ˆ v t +1 j M ij . (13) Let ˜ U t +1 = (˜ u t +1 ,Tj , ≤ j ≤ n ) and ˜ U t +1 = ¯ U t +1 R N + t +1 be the QR decomposition of ˜ U t +1 . Orthonormalize the columns of T ( ¯ U t +1 ) . Denote the resultant orthonormal matrix by U t +1 = ( u t +1 ,Ti , ≤ i ≤ n ) . Output : Set U N − = ( u N − ,Ti , ≤ i ≤ n ) , ˜ V N = (˜ v N,Tj , ≤ j ≤ n ) . Output M N = U N − ˜ V N,T . Now we provide the intuition behind the algorithm. Given j ∈ [ n ] and a constant d , it is notguaranteed that at the t -th iteration of the alternating minimization algorithm U t,TS t +1 ,Lj U tS t +1 ,Lj = X i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti concentrates around its expectation E [ U t,TS t +1 ,Lj U tS t +1 ,Lj ] = 1 (cid:0) nd (cid:1) X D ∈{ S ⊂ [ n ]: | S | = d } U t,TD U tD = 1 (cid:0) nd (cid:1) (cid:0) nd (cid:1) dn X i ∈ [ n ] u ti u t,Ti = dn I. Some U t,TS t +1 ,Lj U tS t +1 ,Lj might be ill-conditioned, namely, its least singular value is 0 or closed to zero. If thematrix U t,TS t +1 ,Lj U tS t +1 ,Lj is ill-conditioned, the results from the iteration (10) in the “vanilla” alternatingminimization algorithm might blow up. To prevent this adversarial scenario, we use the operations T to lift the small singular values and truncate the large singular values of U tS t +1 ,Lj , ∀ j ∈ [ n ], before eachrow vector of ˜ V t +1 = (˜ v t +1 ,Tj , ≤ j ≤ n ) is computed. The convergence of the algorithm relies on thefact that w.h.p. the number of times the algorithm applies the operation T in each iteration is a smallfraction of n . We will elaborate this point later in Theorem 4.7. Also, the operators T are applied atthe end of each iteration to guarantee the incoherence of the input V t +1 (or U t +1 ) for the next iterationwhile maintaining that V t +1 (or U t +1 ) is still close enough to V ∗ (or U ∗ ).Our main result concerns the performance of the algorithm T AM under Assumption 1 and underboth Assumptions 1 and 2, respectively. We recall that

T AM is parameterized by δ and β . Theorem 3.1.

Suppose M ∈ R n × n is a rank- k matrix satisfying Assumption . Suppose the observedindex set Ω is sampled according to the model RRG ( d, n, N ) in (9). Given any δ ∈ (0 , , β ∈ (0 , − δ ) and ǫ ∈ (0 , / , there exists a C ( δ, β ) > such that for d ≥ C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) + 5 µ k (1 + δ/ δ log (cid:18) ǫ (cid:19) (14) and N ≥ ⌈ log( ǫ ) / log 4 ⌉ , the T AM algorithm produces a matrix M N satisfying k M − M N k F ≤ ǫ k M k F w.h.p. urthermore, suppose M satisﬁes both Assumptions and . Then for δ ∈ (0 , as deﬁned inAssumption and β ∈ (0 , − δ ) , the same result holds when d ≥ C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) , (15) for the same constant C ( δ, β ) in (14). Theorem 3.1 states that under Assumption 1 the

T AM algorithm produces a rank- k ǫ -approximationof matrix M using O ( dn log(1 /ǫ )) samples for d satisfying (14). Furthermore, under both Assump-tion 1 and Assumption 2 the T AM algorithm produces a rank- k ǫ -approximation of matrix M using O ( dn log(1 /ǫ )) samples for d satisfying (15).In terms of computational complexity, the cost in the initialization of T AM is mainly contributedby computing the top- k left singular vectors of a sparse matrix nd P Ω ( M ) ∈ R n × n , which requires time O ( k | Ω | ) by exploiting the sparsity of nd P Ω ( M ) [MHT10]. In each iteration t = 0 , , · · · , N −

1, the cost ismainly contributed by computing P i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti ∈ R k × k , ∀ j ∈ [ n ], P j ∈ [ n ]:( i,j ) ∈ Ω N + t +1 ˆ v t +1 j ˆ v t +1 ,Tj ∈ R k × k , ∀ i ∈ [ n ], at most n SVD of U tS t +1 ,Lj ∈ R d × k , ∀ j ∈ [ n ], and at most n SVD of V t +1 S N + t +1 ,Ri ∈ R d × k , ∀ i ∈ [ n ]. Each component of the ﬁrst two terms is the sum of d k -by- k matrices. Each matrix is theouter product of two k -by-1 vectors. Hence in each iteration it costs O ( dk n ) to compute the ﬁrst twoterms and O ( dk n ) to compute at most 2 n SVD of d -by- k matrices. By | Ω | = O ( dn ) and N chosen asthe lower bound given by Theorem 3.1, the overall cost for T AM algorithm is O ( k | Ω | ) + O ( dk nN ) = O ( dk log(1 /ǫ ) n ) . Choosing the lower bound of d given by (14) or (15) in Theorem 3.1, T AM algorithm runs in lineartime in n . T AM algorithm

The convergence of the

T AM algorithm requires a warm start point U close to the true U ∗ . To measurethe closeness between two subspaces spanned by two matrices, we introduce the following deﬁnition ofdistance between subspaces. Deﬁnition 4.1. [GVL12] Given any two matrices

X, Y ∈ R n × k , let ˆ X, ˆ Y ∈ R n × k be their correspondingorthonormal basis, and ˆ X ⊥ , ˆ Y ⊥ ∈ R n × ( n − k ) be any orthonormal basis of the orthogonal complement of ˆ X and ˆ Y . Then the distance between the subspaces spanned by the columns of X and Y is deﬁned by dist( X, Y ) , k ˆ X T ⊥ ˆ Y k . The range of dist( · , · ) is [0 , X, Y ) deﬁned above depends only on thespaces spanned by the columns of X and Y , that is, Span( X ) and Span( Y ). Furthermore,dist( X, Y ) = dist(

Y, X ) ⇒ k ˆ X T ⊥ ˆ Y k = k ˆ Y T ⊥ ˆ X k , (16) σ min ( ˆ X T ˆ Y ) + k ˆ X T ⊥ ˆ Y k = 1 , (17) k ˆ X T ⊥ ˆ Y k = k ˆ X ˆ X T − ˆ Y ˆ Y T k . (18)We refer to Theorem 2.6.1 in [GVL12] and its proof for the three properties above.We now obtain a bound on the distance dist( ¯ U , U ∗ ).9 emma 4.2. Let M be a rank- k matrix that satisﬁes Assumption . Also, let Ω be as deﬁned in RRG ( d, n, N ) in (9) and ¯ U = SV D ( nd P Ω ( M ) , k ) as deﬁned in the ﬁrst step of the T AM algorithm.For any

C > and d ≥ Ck µ ( σ ∗ /σ ∗ k ) , w.h.p. we have dist( ¯ U , U ∗ ) ≤ √ C k . (19)The proof of this lemma is similar to the proof of Lemma C.1. in [JNS12]. We give its proof in theAppendix A for completeness.While ¯ U is close enough to U ∗ , ¯ U might not be incoherent. Hence, T AM algorithm implementsthe operation T on ¯ U in the truncation step to obtain an incoherent warm start U for the iterationsafterward. Lemma 4.3.

Suppose U ∗ satisﬁes Assumption . Let ¯ U ∈ R n × k be an orthonormal matrix such that dist( ¯ U , U ∗ ) ≤ φk / for some φ ≥ √ √ − . Let ˆ U = T ( ¯ U ) , and U ∈ R n × k be an orthonormal basis of ˆ U .Also, let u Ti ∈ R × k , i ∈ [ n ] , be the i -th row of U . Then k u i k ≤ r µ kn ∀ i ∈ [ n ] , (20)dist( U, U ∗ ) ≤ √ φ . (21)This lemma states that by applying the operator T to ¯ U and then orthonormalizing ˆ U , the resultantmatrix U loses a factor √ k / in dist( · , U ∗ ) but gains the incoherence. Applying this lemma to ¯ U ,from Lemma 4.2 w.h.p. the corresponding φ is √ Ck . . Choosing a large enough constant C > φ ≥ √ √ − , this lemma implies that w.h.p. the following inequalities hold. k u i k ≤ r µ kn ∀ i ∈ [ n ] and dist( U , U ∗ ) ≤ √ √ Ck . . (22)We delay the proof of this lemma to Appendix B. T AM . Proof of Theorem 3.1

First we formulate the update of ¯ V t +1 at the t -th iteration in the algorithm T AM in a more compactform. For j ∈ [ n ] and β as given in the algorithm, letˆ B j = ( nd P i :( i,j ) ∈ Ω t +1 u ti u t,Ti if nd σ l ( P i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti ) ∈ [ β, − β ] ∀ l ∈ [ k ] nd P i :( i,j ) ∈ Ω t +1 ˆ u ti ˆ u t,Ti o.w.ˆ C j = ( nd P i :( i,j ) ∈ Ω t +1 u ti u ∗ ,Ti if nd σ l ( P i ∈ [ n ]:( i,j ) ∈ Ω t +1 u ti u t,Ti ) ∈ [ β, − β ] ∀ l ∈ [ k ] nd P i :( i,j ) ∈ Ω t +1 ˆ u ti u ∗ ,Ti o.w. (23)and D = U t,T U ∗ . (24)Using M ij = u ∗ ,Ti Σ ∗ v ∗ j , we combine (10) and (11) for j ∈ [ n ] at the t -th iteration and rewrite ˜ v t +1 j by˜ v t +1 j = ( ˆ B j ) − ˆ C j Σ ∗ v ∗ j . (25)10hen we rearrange the equation above as follows˜ v t +1 ,Tj = v ∗ ,Tj Σ ∗ D T − v ∗ ,Tj Σ ∗ (cid:16) D T ˆ B j − ( ˆ C j ) T (cid:17) ( ˆ B j ) − . Recall that ˜ V t +1 ∈ R n × k is a matrix with the j -th row equal to ˜ v t +1 ,Tj and the QR decomposition˜ V t +1 = ¯ V t +1 R t +1 . We then rewrite the equation above in a more compact form˜ V t +1 = V ∗ Σ ∗ U ∗ ,T U t − F t ¯ V t +1 = ˜ V t +1 ( R t +1 ) − (26)where F t =  v ∗ ,T Σ ∗ ( D T ˆ B − ( ˆ C ) T )( ˆ B ) − ... v ∗ ,Tn Σ ∗ ( D T ˆ B n − ( ˆ C n ) T )( ˆ B n ) −  . (27)Next we establish the geometric decay of the distance between the subspaces spanned by V t +1 and V ∗ and the distance between the subspaces spanned by U t +1 and U ∗ . Then we use this property toconclude the proof of Theorem 3.1. Our ﬁrst step is to show an upper bound on the Frobenius norm ofthe error term F t in (27) for the t -th iteration in the algorithm T AM . Theorem 4.4.

Suppose U t satisﬁes k u ti k ≤ r µ kn , ∀ i ∈ [ n ] . (28) Let F t be the matrix as deﬁned in (27), and M , Ω t +1 , δ , β and ǫ be as deﬁned in Theorem . . Thenunder Assumption and for d satisfying (14) w.h.p. we have k F t /σ ∗ k k F ≤ √ k max { dist( U t , U ∗ ) , ǫ/ } . (29) Also, under Assumptions and and for d satisfying (15), the inequality (29) holds w.h.p. We delay the proof of this theorem to the next subsection. Our next step in proving Theorem 3.1 isto show the geometric decay property of the distance between the subspaces spanned by iterates U t +1 ( V t +1 ) and U ∗ ( V ∗ ). In order to prove Theorem 3.1, we also need the following lemma which resultsfrom Deﬁnition 3 . . Lemma 4.5.

Given two orthonormal matrices

X, Y ∈ R n × k , let X ⊥ , Y ⊥ ∈ R n × ( n − k ) be another twoorthonormal matrices which span the orthogonal complements of X and Y , respectively. Suppose X T Y is invertible. Then k X T ⊥ Y k σ k ( X T Y ) = k X T ⊥ Y ( X T Y ) − k . In this lemma we replaced the original k ( I − XX T ) Y k in [Har14] by k X T ⊥ Y k due to the relation k ( I − XX T ) Y k = k X ⊥ X T ⊥ Y k = sup v ∈ R n : k v k =1 k v T X ⊥ X T ⊥ Y k = sup v ∈ span( X ⊥ ): k v k =1 k v T X ⊥ X T ⊥ Y k = sup u ∈ R n − k : k u k =1 k u T X T ⊥ Y k = k X T ⊥ Y k . heorem 4.6. Let ǫ be as deﬁned in Theorem 3.1. Under Assumption and for d satisfying (14),w.h.p. the ( t + 1) th iterates V t +1 and U t +1 of algorithm T AM satisfy k v t +1 j k ≤ r µ kn ∀ j ∈ [ n ] , dist( V t +1 , V ∗ ) ≤

12 max { dist( U t , U ∗ ) , ǫ/ } , ∀ t = 0 , , . . . , N − , (30) and k u t +1 i k ≤ r µ kn ∀ i ∈ [ n ] , dist( U t +1 , U ∗ ) ≤

12 max { dist( V t +1 , V ∗ ) , ǫ/ } , ∀ t = 0 , , . . . , N − . (31) Also, under Assumptions and and for d satisfying (15), w.h.p. the ( t + 1) th iterates V t +1 and U t +1 of algorithm T AM satisfy (30) and (31).Proof. we ﬁrst prove (30) for both cases, and then use a similar argument to show (31). Under Assump-tion 1, we apply Lemma 4.3 to ¯ U and obtain w.h.p. (22) in which we choose a large enough C ( δ, β )such that dist( U , U ∗ ) < /

3. Then the following inequalities hold w.h.p. for t = 0. k u ti k ≤ r µ kn ∀ i ∈ [ n ] and dist( U t , U ∗ ) <

13 (32)Now we assume the inequality (32) holds for some t ≥

0. It follows from Theorem 4.4 that for boththe case under Assumption 1 and d satisfying (14), and the case under Assumptions 1 and 2 and d satisfying (15), the following inequality holds w.h.p. k F t /σ ∗ k k ≤ k F t /σ ∗ k k F ≤ √ k max { dist( U t , U ∗ ) , ǫ/ } . (33)Next we derive an upper bound on dist( ¯ V t +1 , V ∗ ). First we claim that V ∗ ,T ¯ V t +1 is invertible. Usingthe expression of ˜ V t +1 given by (26), we have σ k ( V ∗ ,T ˜ V t +1 ) = σ k ( V ∗ ,T ( V ∗ Σ ∗ U ∗ ,T U t − F t ))= σ k (Σ ∗ U ∗ ,T U t − V ∗ ,T F t )Using Ky Fan singular value inequality in (3) for A = V ∗ ,T F t , B = Σ ∗ U ∗ ,T U t − V ∗ ,T F t , r = 0 and t = k −

1, we have σ k (Σ ∗ U ∗ ,T U t − V ∗ ,T F t ) ≥ σ k (Σ ∗ U ∗ ,T U t ) − σ ( V ∗ ,T F t ) ≥ σ k (Σ ∗ U ∗ ,T U t ) − k F t k ≥ σ ∗ k σ k ( U ∗ ,T U t ) − σ ∗ k k F t /σ ∗ k k By the assumption dist( U t , U ∗ ) < /

3, that is, k U ∗ ,T ⊥ U t k < / σ k ( U ∗ ,T U t ) = q − k U ∗ ,T ⊥ U t k ≥ √ / k F t /σ ∗ k k F in (33), gives σ ∗ k σ k ( U ∗ ,T U t ) − σ ∗ k k F t /σ ∗ k k ≥ √ σ ∗ k − σ ∗ k √ k max { dist( U t , U ∗ ) , ǫ/ } > . σ k ( V ∗ ,T ˜ V t +1 ) > V ∗ ,T ˜ V t +1 is invertible. Also by QR decomposition ˜ V t +1 =¯ V t +1 R t +1 , we have V ∗ ,T ˜ V t +1 = V ∗ ,T ¯ V t +1 R t +1 . Then V ∗ ,T ¯ V t +1 ∈ R k × k has rank k and hence the claimfollows. Then by Lemma 4.5 where the claim we just proved veriﬁes the assumption, we have k V ∗ ,T ⊥ ¯ V t +1 k σ k ( V ∗ ,T ¯ V t +1 ) = k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k . (35)First applying the second equation in (26) and then the ﬁrst equation in (26), we obtain k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k = k V ∗ ,T ⊥ ˜ V t +1 (cid:16) V ∗ ,T ˜ V t +1 (cid:17) − k = k V ∗ ,T ⊥ ˜ V t +1 (cid:0) Σ ∗ U ∗ ,T U t − V ∗ ,T F t (cid:1) − k . (36)It follows from (34) that U ∗ ,T U t is invertible. Hence k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k = k V ∗ ,T ⊥ ˜ V t +1 (cid:0) U ∗ ,T U t (cid:1) − (cid:0) Σ ∗ − V ∗ ,T F t ( U ∗ ,T U t ) − (cid:1) − k ≤ k V ∗ ,T ⊥ ˜ V t +1 (cid:0) U ∗ ,T U t (cid:1) − k k (cid:0) Σ ∗ − V ∗ ,T F t ( U ∗ ,T U t ) − (cid:1) − k ≤ k V ∗ ,T ⊥ ˜ V t +1 (cid:0) U ∗ ,T U t (cid:1) − k σ k (Σ ∗ − V ∗ ,T F t ( U ∗ ,T U t ) − ) . (37)Using the expression of ˜ V t +1 in (26), the numerator of the right hand side above becomes k V ∗ ,T ⊥ ˜ V t +1 (cid:0) U ∗ ,T U t (cid:1) − k ≤ k V ∗ ,T ⊥ F t (cid:0) U ∗ ,T U t (cid:1) − k ≤ k V ∗ ,T ⊥ F t k k (cid:0) U ∗ ,T U t (cid:1) − k ≤ k F t k σ k ( U ∗ ,T U t ) . Using Ky Fan singular value inequality in (3) for A = V ∗ ,T F t ( U ∗ ,T U t ) − , B = Σ ∗ − V ∗ ,T F t ( U ∗ ,T U t ) − , r = 0 and t = k −

1, the denominator of the right hand side in (37) becomes σ k (Σ ∗ − V ∗ ,T F t ( U ∗ ,T U t ) − ) ≥ σ ∗ k − k V ∗ ,T F t ( U ∗ ,T U t ) − k ≥ σ ∗ k − k V ∗ ,T F t k k ( U ∗ ,T U t ) − k ≥ σ ∗ k − k F t k σ k ( U ∗ ,T U t ) . Then (37) becomes k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k ≤ k F t k σ k ( U ∗ ,T U t ) σ ∗ k − k F t k σ k ( U ∗ ,T U t ) . By σ k ( U ∗ ,T U t ) ≥ √ / > /

2. Then k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k ≤ k F t k σ ∗ k − k F t k = 2 k F t /σ ∗ k k − k F t /σ ∗ k k . Using the upper bound on k F t /σ ∗ k k in (33), we obtain k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k ≤ / (5 √ k )1 − { dist( U t , U ∗ ) , ǫ/ } / (5 √ k ) max { dist( U t , U ∗ ) , ǫ/ } .

13y dist( U t , U ∗ ) ∈ [0 ,

1] and ǫ ∈ (0 , / k V ∗ ,T ⊥ ¯ V t +1 (cid:0) V ∗ ,T ¯ V t +1 (cid:1) − k ≤ / (5 √ k )1 − / (5 √ k ) max { dist( U t , U ∗ ) , ǫ/ }≤ √ k max { dist( U t , U ∗ ) , ǫ/ } . Then it follows from (35) that k V ∗ ,T ⊥ ¯ V t +1 k σ k ( V ∗ ,T ¯ V t +1 ) ≤ √ k max { dist( U t , U ∗ ) , ǫ/ } . We have shown that V ∗ ,T ¯ V t +1 is invertible and hence σ k ( V ∗ ,T ¯ V t +1 ) ∈ (0 ,

1] from which it follows thatdist( ¯ V t +1 , V ∗ ) = k V ∗ ,T ⊥ ¯ V t +1 k ≤ √ k max { dist( U t , U ∗ ) , ǫ/ } . Now we apply Lemma 4.3 where ¯ U and U ∗ are replaced by ¯ V t +1 and V ∗ , respectively, and by theinequality above φ = 2 √ / max { dist( U t , U ∗ ) , ǫ/ } . Then by dist( U t , U ∗ ) < / ǫ/ < /

3, weobtain φ ≥ √ ≥ √ / ( √ − k v t +1 j k ≤ r µ kn ∀ j ∈ [ n ] and dist( V t +1 , V ∗ ) ≤

12 max { dist( U t , U ∗ ) , ǫ/ } . The second inequality above also implies dist( V t +1 , V ∗ ) < /

3. Using k v t +1 j k ≤ r µ kn ∀ j ∈ [ n ] and dist( V t +1 , V ∗ ) < , (31) is established similarly and then (32) holds by replacing t by t + 1. By repeating the argumentsabove, (30) and (31) hold for all t = 0 , , · · · , N − Proof of Theorem 3.1.

By Theorem 4.6, after N ≥ /ǫ ) / log 4 iterations, we obtaindist( U N − , U ∗ ) ≤

12 max { dist( V N − , V ∗ ) , ǫ/ }≤

12 max (cid:26)

12 max { dist( U N − , U ∗ ) , ǫ/ } , ǫ/ (cid:27) = max (cid:26)

14 dist( U N − , U ∗ ) , ǫ (cid:27) ... ≤ max ((cid:18) (cid:19) N − dist( U , U ∗ ) , ǫ ) ≤ ǫ , (38)and k u N − i k ≤ r µ kn ∀ i ∈ [ n ] . (39)14sing the expression of ˜ V N in (26) for t = N −

1, we obtain k M − U N − ˜ V N,T k F = k U ∗ Σ ∗ V ∗ ,T − U N − ( U N − ,T U ∗ Σ ∗ V ∗ ,T − F N − ,T ) k F ≤ k ( I − U N − U N − ,T ) U ∗ Σ ∗ V ∗ ,T k F + k U N − F N − ,T k F . Using the inequality (2), we obtain k M − U N − ˜ V N,T k F ≤ k ( I − U N − U N − ,T ) U ∗ k k Σ ∗ V ∗ ,T k F + k F N − ,T k F = k U N − ⊥ U N − ,T ⊥ U ∗ k k Σ ∗ V ∗ ,T k F + k F N − ,T /σ ∗ k k F σ ∗ k (40)Then by the upper bound on dist( U N − , U ∗ ) in (38) and k Σ ∗ V ∗ ,T k F = k M k F , k M − U N − ˜ V N,T k F ≤ ǫ k M k F + k F N − ,T /σ ∗ k k F k M k F . By the incoherence of U N − in (39), Theorem 4.4 implies that w.h.p. k F N − ,T /σ ∗ k k F ≤ √ k × max { dist( U N − , U ∗ ) , ǫ/ } ≤ ǫ √ k . Then w.h.p. the right hand side of the inequality (40) is upper bounded by ≤ ǫ k M k F + ǫ √ k k M k F ≤ ǫ k M k F from which the result follows. k F t /σ ∗ k k F . Proof of Theorem 4.4 We ﬁrst introduce a theorem which gives an upper bound on the number of times at the t -th iterationof the algorithm T AM the operations T ( · , β ) are applied to compute ˜ V t +1 = (˜ v t +1 ,Tj , ≤ j ≤ n ), andthen use the upper bound given by this theorem to conclude the proof of Theorem 4.4.Let β, δ be as deﬁned in the algorithm T AM . Deﬁne S tb ( β ) , ( j ∈ [ n ] : (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) > − β ) . (41)The equivalence relation (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ − β ⇐⇒ σ l  nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti  ∈ [ β, − β ] , ∀ l ∈ [ k ] , implies that S tb ( β ) consists of all the ‘bad’ indices j ∈ [ n ] associated with U tS t +1 ,Lj to which the operation T ( · , β ) is applied before ˜ v t +1 j is computed in (11). Let γ t = dist( U t , U ∗ ), α = 1 − β − δ µ k , ρ t = 2 k (1 − β − δ ) µ k − γ t µ k (42)15nd the function f : N × R → R f ( d, µ, a ) , k √ πd exp (cid:18) − a / µk + µka/ d (cid:19) . (43)For a large C ( δ, β ) >

0, it can be checked easily that ρ t > γ t ∈ (cid:16) , √ / ( p C ( δ, β ) k . µ ) (cid:17) . (44)The following theorem gives an upper bound on the size of S tb ( β ). Theorem 4.7.

Suppose Assumption holds and U t satisﬁes k u ti k ≤ r µ kn ∀ i ∈ [ n ] . (45) Let δ and β be as deﬁned in T AM . Then the following statements hold.(a) w.h.p. we have for any ﬁxed ζ > , | S tb ( β ) | ≤ (1 + ζ ) f ( d, µ , − β ) n. (46) (b) Suppose γ t satisﬁes (44). w.h.p. we have for any ﬁxed ζ > and a large C ( δ, β ) > | S tb ( β ) | ≤ . (cid:18) e ρ t γ t α (cid:19) αd + (1 + ζ ) f ( d, µ , δ ) ! n. (47) (c) Suppose γ t satisﬁes (44) and Assumption also holds. w.h.p. we have for any ﬁxed ζ > and alarge C ( δ, β ) > | S tb ( β ) | ≤ . (cid:18) e ρ t γ t α (cid:19) αd + ζ ! n. (48)We delay the proof of this theorem to the next subsection. We now prove Theorem 4.4, assumingthe validity of Theorem 4.7. Proof of Theorem 4.4.

We vectorize the rows of F t in (27) and then reassemble them one by one as along vector A ∈ R kn × A =  ( ˆ B ) − ( ˆ B D − ˆ C )Σ ∗ v ∗ ...( ˆ B n ) − ( ˆ B n D − ˆ C n )Σ ∗ v ∗ n  . Then k F t k F = k A k . For any x j ∈ R × k , j ∈ [ n ], we have( x , x , . . . , x n ) A = n X j =1 x j ( ˆ B j ) − ( ˆ B j D − ˆ C j )Σ ∗ v ∗ j . Let B j = nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti ∀ j ∈ [ n ] and C j = nd X i :( i,j ) ∈ Ω t +1 u ti u ∗ ,Ti ∀ j ∈ [ n ] . (49)16ecall ˆ B j and ˆ C j deﬁned in (23), and that S tb ( β ) consists of all the indices j ∈ [ n ] associated with U tS t +1 ,Lj to which the operation T ( · , β ) is applied. We have ˆ B j = B j and ˆ C j = C j for all j ∈ [ n ] \ S tb ( β ).Then, ( x , x , . . . , x n ) A = n X j =1 x j ( ˆ B j ) − ( B j D − C j )Σ ∗ v ∗ j + X j ∈ S tb ( β ) x j ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ v ∗ j + X j ∈ S tb ( β ) x j ( ˆ B j ) − ( C j − ˆ C j )Σ ∗ v ∗ j . (50)We will establish Theorem 4.4 from the following proposition, which gives upper bounds on the threeterms on the right hand side of (50), respectively. We delay its proof for later. Proposition 4.8.

Suppose Assumption holds and U t satisﬁes k u ti k ≤ r µ kn , ∀ i ∈ [ n ] . Let δ and β be as deﬁned in Theorem 3.1 and S tb ( β ) be as deﬁned in (41). Then for d satisfying (15)and all x j ∈ R × k , j ∈ [ n ] , satisfying k ( x , x , . . . , x n ) k = 1 we have X j ∈ S tb ( β ) x j ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ v ∗ j ≤ (2 − β + 5 µ k ) σ ∗ β p µ k r | S tb ( β ) | n , (51) X j ∈ S tb ( β ) x j ( ˆ B j ) − ( C j − ˆ C j )Σ ∗ v ∗ j ≤ β σ ∗ ( µ k ) . r | S tb ( β ) | n (52) and w.h.p. n X j =1 x j ( ˆ B j ) − ( B j D − C j )Σ ∗ v ∗ j ≤ σ ∗ k √ k dist( U t , U ∗ ) . (53)Applying Proposition 4.8 and then replacing the three terms on the right hand side of (50) by theirupper bounds provided by (51), (52) and (53), w.h.p. for d satisfying (15) we obtain an upper boundon k F t /σ ∗ k k F k F t /σ ∗ k k F = max ( x ,x ,...,x n ): k ( x ,x ,...,x n ) k =1 ( x , x , . . . , x n ) A/σ ∗ k ≤ γ t √ k + (2 − β + 5 µ k ) σ ∗ βσ ∗ k p µ k r | S tb ( β ) | n + 7 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ γ t √ k + 14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n . (54)Next, we prove the upper bound on k F t /σ ∗ k k F in (29) under Assumption 1 and for d satisfying (14).We show this result for two cases: γ t ∈ [4 √ / ( p C ( δ, β ) µ k . ) ,

1] and γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )),respectively. Under Assumption 1, the upper bound on | S tb ( β ) | from (46) in Theorem 4.7 implies thatw.h.p. 14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ β σ ∗ σ ∗ k ( µ k ) . p . f ( d, µ , − β ) . f ( d, µ , − β ) in (43). Then,14 β σ ∗ σ ∗ k ( µ k ) . p . f ( d, µ , − β )= 14 β σ ∗ σ ∗ k ( µ k ) . √ . π / k / d / exp (cid:18) − (1 − β ) / µ k (1 + (1 − β ) / d (cid:19) = 14 √ . π / β exp (cid:18) log (cid:18) σ ∗ σ ∗ k (cid:19) + 1 . µ + 2 log k + log d − (1 − β ) / µ k (1 + (1 − β ) / d (cid:19) For d satisfying (14), we observe that the last term inside exp( · ) above is a polynomial of k , µ and σ ∗ /σ ∗ k while other terms inside exp( · ) are linear combination of log k , log µ and log( σ ∗ /σ ∗ k ). Hence wecan choose a large C ( δ, β ) > β σ ∗ σ ∗ k ( µ k ) . p . f ( d, µ , − β ) ≤ √ k √ p C ( δ, β ) k . µ . Hence for a large C ( δ, β ) and γ t ∈ [4 √ / ( p C ( δ, β ) µ k . ) , γ t √ k + 14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ γ t √ k + 110 √ k √ p C ( δ, β ) k . µ ≤ γ t √ k + 110 √ k γ t = γ t √ k , which, along with (54), gives the upper bound on k F t /σ ∗ k k F in (29).Next, we consider the case γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )). Under Assumption 1 and γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )), the upper bound on | S tb ( β ) | from (47) in Theorem 4.7 implies that w.h.p.14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ β σ ∗ σ ∗ k ( µ k ) . s . (cid:18) e ρ t γ t α (cid:19) αd + 1 . f ( d, µ , δ ) ≤ β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ + p . f ( d, µ , δ ) ! . Our proof of Theorem 4.4 also relies on the following proposition, which gives upper bounds on thelast two terms in the inequality above. The proof of this proposition, which involves heavy calculations,can be found in Appendix C.

Proposition 4.9.

Let α and ρ t be as deﬁned in (42), f ( d, µ , δ ) be as deﬁned in (43), and ǫ , δ , β beas deﬁned in Theorem 3.1. Suppose γ t satisﬁes (44). There exists a large C ( δ, β ) > such that if d satisﬁes (15), we have β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ ≤ γ t √ k , (55) and if d satisﬁes (14), we have β σ ∗ σ ∗ k ( µ k ) . p . f ( d, µ , δ ) ≤ ǫ √ k . (56)18ince any d satisfying the inequality (14) also satisﬁes the inequality (15), the two upper bounds(55) and (56) in Proposition 4.9 yield that for d satisfying (14), w.h.p.14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ γ t √ k + ǫ √ k , which, along with (54), gives the upper bound on k F t /σ ∗ k k F in (29) k F t /σ ∗ k k F ≤ γ t √ k + γ t √ k + ǫ √ k ≤ √ k max { γ t , ǫ/ } . This completes the proof of (29) under Assumption 1 for d satisfying (14).Finally, we prove the upper bound on k F t /σ ∗ k k F in (29) under Assumptions 1, 2 and for d satisfying(15). For γ t ∈ [4 √ / ( p C ( δ, β ) µ k . ) , k F t /σ ∗ k k F in (29) follows similarlyusing (46) and (54). Suppose γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )). Under Assumptions 1 and 2 and γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )), the upper bound on | S tb ( β ) | from (48) in Theorem 4.7 implies that forany ﬁxed ζ > β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ β σ ∗ σ ∗ k ( µ k ) . s . (cid:18) e ρ t γ t α (cid:19) αd + ζ ≤ β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ + 14 β σ ∗ σ ∗ k ( µ k ) . p ζ. We choose a small enough ζ > β σ ∗ σ ∗ k ( µ k ) . p ζ ≤ ǫ √ k . This is possible since k , σ ∗ /σ ∗ k and µ are assumed to be bounded from above by a constant. The upperbound (55) in Proposition 4.9 and the inequality above yield that w.h.p.14 β σ ∗ σ ∗ k ( µ k ) . r | S tb ( β ) | n ≤ γ t √ k + ǫ √ k . Then similarly, the upper bound on k F t /σ ∗ k k F in (29) follows. The proof of Theorem 4.4 is complete. We ﬁrst prove (51). By the submultiplicative inequality for the spectral norm and Cauchy-Schwarzinequality, we have X j ∈ S tb ( β ) x j ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ v ∗ j ≤ max j ∈ [ n ] k ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ k X j ∈ S tb ( β ) k x j k k v ∗ j k ≤ max j ∈ [ n ] k ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ k s X j ∈ S tb ( β ) k x j k s X j ∈ S tb ( β ) k v ∗ j k . By P j ∈ [ n ] k x j k = 1 and Assumption 1 on the incoherence of V ∗ , we have X j ∈ S tb ( β ) k x j k ≤ X j ∈ S tb ( β ) k v ∗ j k ≤ | S tb ( β ) | µ kn . j ∈ [ n ] k ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ k ≤ max j ∈ [ n ] k ( ˆ B j ) − k max j ∈ [ n ] {k ˆ B j k + k B j k }k D k σ ∗ . Recall ˆ B j given by (23). Then by σ l ( ˆ B j ) ∈ [ β, − β ] for all l ∈ [ k ] and all j ∈ [ n ] and D = U t,T U ∗ where U t and U ∗ are both orthonormal matrices, we havemax j ∈ [ n ] k ( ˆ B j ) − k ≤ β , max j ∈ [ n ] k ˆ B j k ≤ − β, k D k ≤ . (57)Also recall B j given by (49) and the incoherence assumption k u ti k ≤ p µ k/n , ∀ i ∈ [ n ]. Then we havethe following upper bound on k B j k k B j k ≤ nd d max i ∈ [ n ] k u ti k ≤ nd d µ kn = 5 µ k, ∀ j ∈ [ n ] . Then max j ∈ [ n ] k ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ k ≤ β (2 − β + 5 µ k ) σ ∗ . Combining the inequalities above, we obtain X j ∈ S tb ( β ) x j ( ˆ B j ) − ( ˆ B j − B j ) D Σ ∗ v ∗ j ≤ (2 − β + 5 µ k ) σ ∗ β p µ k r | S tb ( β ) | n . (58)This proves (51). Next, we prove (52). Similarly, we have X j ∈ S tb ( β ) x j ( ˆ B j ) − ( C j − ˆ C j )Σ ∗ v ∗ j ≤ max j ∈ [ n ] {k C j k + k ˆ C j k } σ ∗ β p µ k r | S tb ( β ) | n . It follows from C j given by (49) and ˆ C j given by (23) that C j = nd U t,TS t +1 ,Lj U ∗ S t +1 ,Lj and ˆ C j = nd ˆ U t,TS t +1 ,Lj U ∗ S t +1 ,Lj . Also by the deﬁnition of T ( · , β ) in (8) we have k ˆ U tS t +1 ,Lj k = kT ( U tS t +1 ,Lj , β ) k ≤ p (2 − β ) d/n , which,together with Assumption 1 on the incoherence of U ∗ and the incoherence condition k u ti k ≤ p µ k/n , ∀ i ∈ [ n ], gives k C j k ≤ nd k U tS t +1 ,Lj k k U ∗ S t +1 ,Lj k ≤ nd k U tS t +1 ,Lj k F k U ∗ S t +1 ,Lj k F ≤ nd r d µ kn r d µ kn = √ µ k, ∀ j ∈ [ n ] , and k ˆ C j k ≤ nd k ˆ U tS t +1 ,Lj k k U ∗ S t +1 ,Lj k ≤ nd k ˆ U tS t +1 ,Lj k k U ∗ S t +1 ,Lj k F ≤ nd r dn (2 − β ) r d µ kn = p (2 − β ) µ k, ∀ j ∈ [ n ] . X j ∈ S tb ( β ) x j ( ˆ B j ) − ( C j − ˆ C j )Σ ∗ v ∗ j ≤ ( √ µ k + p (2 − β ) µ k ) σ ∗ β p µ k r | S tb ( β ) | n ≤ β σ ∗ ( µ k ) . r | S tb ( β ) | n . (59)Finally, we prove (53). Let y j = x j ( ˆ B j ) − , ˜ v ∗ j = Σ ∗ v ∗ j and J i = u ti u t,Ti U t,T U ∗ − u ti u ∗ ,Ti ∀ i ∈ [ n ] . (60)Then B j D − C j = nd X i :( i,j ) ∈ Ω t +1 ( u ti u t,Ti U t,T U ∗ − u ti u ∗ ,Ti )= nd X i :( i,j ) ∈ Ω t +1 J i . We can rewrite the left hand side of (53) then as follows nd n X j =1 X i :( i,j ) ∈ Ω t +1 y j J i ˜ v ∗ j = nd X ( i,j ): ( i,j ) ∈ Ω t +1 y j J i ˜ v ∗ j . (61)Also, let y jh , h ∈ [ k ], be the h -th entry of y j ∈ R × k , ˜ v ∗ jl , l ∈ [ k ], be the l -th entry of ˜ v ∗ j ∈ R k × and( J i ) hl , h, l ∈ [ k ], be the ( h, l ) entry of the matrix J i ∈ R k × k . Then the right-hand side of (61) is nd X h,l ∈ [ k ] X ( i,j ):( i,j ) ∈ Ω t +1 y jh ˜ v ∗ jl ( J i ) hl . (62)Let G n ∈ R n × n be the biadjacency matrix of the random bipartite d -regular graph associated with theindex set Ω t +1 . Also, let J hl ∈ R × n , h, l ∈ [ k ], be J hl = (( J ) hl , ( J ) hl , . . . , ( J n ) hl ) , L hl ∈ R × n , h, l ∈ [ k ], be L hl = ( y h ˜ v ∗ l , y h ˜ v ∗ l , . . . , y nh ˜ v ∗ nl ) , J ∈ R × k n be J = ( J , . . . , J k , J , . . . , J k , . . . , J k , . . . , J kk ) , L ∈ R × k n be L = ( L , . . . , L k , L , . . . , L k , . . . , L k , . . . , L kk ) , and I k ∈ R k × k be an identity matrix. Denote ⊗ the Kronecker product. Then we rewrite (62) by nd ( J , . . . , J k , J , . . . , J k , . . . , J k , . . . , J kk ) × ( I k ⊗ G n ) ( L , . . . , L k , L , . . . , L k , . . . , L k , . . . , L kk ) T = nd J ( I k ⊗ G n ) L T . (63)21bserve I k ⊗ G n is a block diagonal matrix in which each block is G n . Let U be the top left sigular vec-tor of G n . Then by property P of the random bipartite d -regular graph, U = [1 / √ n, / √ n, · · · , / √ n ] T .Hence the top k left singular vectors of I k ⊗ G n are e i ⊗ U , ∀ i ∈ [ k ], where e i ∈ R k × is the i -thunit vector, that is, its i -th entry is one and all others are zero.Let X i , Y i ∈ R k n × , i ∈ [ k n ], be the i -th left singular vector and the i -th right singular vector of I k ⊗ G n , respectively, and σ i , i ∈ [ k n ], be the i -th singular value of I k ⊗ G n . Then we can rewrite(63) nd  k X i =1 σ i ( J X i )( L Y i ) + k n X i = k +1 σ i ( J X i )( L Y i )  = nd  k X i =1 σ i hJ , e i ⊗ U i ( L Y i ) + k n X i = k +1 σ i ( J X i )( L Y i )  . (64)Note that X i ∈ [ n ] J i = U t,T U t U t,T U ∗ − U t,T U ∗ = 0 . Then for all h, l ∈ [ k ] we have X i ∈ [ n ] ( J i ) hl = 0 . (65)Hence the entry sum of J hl for all h, l ∈ [ k ] is 0, which yields( J , . . . , J k , J , . . . , J k , . . . , J k , . . . , J kk ) ( e i ⊗ U ) = 0 , ∀ i ∈ [ k ] , that is, hJ , e i ⊗ U i = 0 , ∀ i ∈ [ k ]. Then the right hand side of (64) becomes nd k n X i = k +1 σ i ( J X i )( L Y i ) . (66)Also by the property P of random bipartite d -regular graph, the top k singular values of I k ⊗ G n areall d , and the remaining singular values are upper bounded by (7 √ d ) / nd k n X i = k +1 σ i ( J X i )( L Y i ) ≤ nd k n X i = k +1 σ i |J X i ||L Y i |≤ nd √ d vuut k n X i = k +1 |J X i | vuut k n X i = k +1 |L Y i | ≤ nd √ d kJ k kLk . (67)Now we bound kJ k k and kLk separately. Let u tih , h ∈ [ k ], be the h -th entry of u ti ∈ R k × , u ∗ il , l ∈ [ k ],be the l -th entry of u ∗ i ∈ R k × and U ∗ l ∈ R n × , l ∈ [ k ], be the l -th column of U ∗ . Then, kJ k = X h,l ∈ [ k ] X i ∈ [ n ] ( J i ) h,l = X h,l ∈ [ k ] X i ∈ [ n ] ( u tih u t,Ti U t,T U ∗ l − u tih u ∗ il ) = X l ∈ [ k ] X i ∈ [ n ] X h ∈ [ k ] ( u tih ) ( u t,Ti U t,T U ∗ l − u ∗ il ) ≤ max i ∈ [ n ] k u ti k X l ∈ [ k ] X i ∈ [ n ] ( u t,Ti U t,T U ∗ l − u ∗ il ) . U t and U ∗ ∈ R n × k are both orthonormal matrices, we have X l ∈ [ k ] X i ∈ [ n ] ( u t,Ti U t,T U ∗ l − u ∗ il ) = X l ∈ [ k ] X i ∈ [ n ] (cid:16) U ∗ ,Tl U t u ti u t,Ti U t,T U ∗ l − u ∗ i,l u t,Ti U t,T U ∗ l + ( u ∗ il ) (cid:17) = X l ∈ [ k ] (cid:16) U ∗ ,Tl U t U t,T U ∗ l − U ∗ ,Tl U t U t,T U ∗ l + 1 (cid:17) = X l ∈ [ k ] (cid:16) − U ∗ ,Tl U t U t,T U ∗ l (cid:17) = X l ∈ [ k ] (cid:0) − k U t,T U ∗ l k (cid:1) ≤ X l ∈ [ k ] (cid:0) − ( σ min ( U t,T U ∗ )) (cid:1) = k (cid:0) − ( σ min ( U t,T U ∗ )) (cid:1) Also by the subspace distance property (17), we have1 − ( σ min ( U t,T U ∗ )) = dist( U t , U ∗ ) which gives X l ∈ [ k ] X i ∈ [ n ] ( u t,Ti U t,T U ∗ l − u ∗ il ) ≤ k dist( U t , U ∗ ) . Then using the incoherence assumption k u ti k ≤ p µ k/n , ∀ i ∈ [ n ], we obtain kJ k ≤ µ k n dist( U t , U ∗ ) . Next, we bound kLk . It follows from y j and ˜ v j given in (60) and Assumption 1 that X l ∈ [ k ] (˜ v ∗ jl ) = k ˜ v ∗ j k = k Σ ∗ v ∗ j k ≤ ( σ ∗ ) µ kn and X h ∈ [ k ] ( y jh ) = k y j k = k x j ( ˆ B j ) − k ≤ k x j k β where in the last inequality we used (57). Then recalling P j ∈ [ n ] k x j k = 1 we have kLk = X h,l ∈ [ k ] X j ∈ [ n ] ( y jh ) (˜ v ∗ jl ) ≤ X j ∈ [ n ] k x j k β ( σ ∗ ) µ kn = ( σ ∗ ) β µ kn . Finally, we obtain w.h.p. n X j =1 x j ( ˆ B j ) − ( B j D − C j )Σ ∗ v ∗ j ≤ nd √ d kJ k kLk ≤ nd √ d r µ k n dist( U t , U ∗ ) σ ∗ β r µ kn = 7 √ β k . µ √ d σ ∗ dist( U t , U ∗ ) . d ≥ C ( δ, β ) k µ ( σ ∗ /σ ∗ k ) we can choose a large C ( δ, β ) > n X j =1 x j ( ˆ B j ) − ( B j D − C j )Σ ∗ v ∗ j ≤ σ ∗ k √ k dist( U t , U ∗ ) . (68)The proof of Proposition 4.8 is complete. S tb ( β ) . Proof of Theorem 4.7 First, we claim that there exists an orthonormal matrix R ∈ R k × k such that U ∗ ,T U t R is symmetric.Indeed, suppose the SVD of U ∗ ,T U t is U ∗ ,T U t = W Σ W T where W , W ∈ R k × k are two orthonormal matrices. Right-multiplying both sides of the equationabove by W W T , we obtain U ∗ ,T U t W W T = W Σ W T . (69)Observe W W T ∈ R k × k is an orthonormal matrix and then the claim follows by taking R = W W T .Note the deﬁnition of S tb ( β ) in (41). If we replace U t by U t R , it can be checked easily that the indexset S tb ( β ), γ t and ρ t given in (42) are unchanged. In the remaining part of this subsection, we will use U t R instead of U t to derive an upper bound on | S tb ( β ) | . We will still denote U t R by U t for convenience.Now U ∗ ,T U t is symmetric.For τ ∈ (0 , Q t ( τ ) be Q t ( τ ) , n i ∈ [ n ] : k u ti u t,Ti − u ∗ i u ∗ ,Ti k > τn o . Our ﬁrst step is to show an upper bound on the size of Q t ( τ ) when dist( U t , U ∗ ) is small. Lemma 4.10.

Suppose Assumption holds. Let γ t = dist( U t , U ∗ ) . Then for any τ ∈ (0 , , (cid:18) τ µ k − γ t µ k (cid:19) | Q t ( τ ) | ≤ kγ t n. (70)For γ t < τ √ µ k , the coeﬃcient of | Q t ( τ ) | above is positive. Then the inequality above implies anupper bound on the size of Q t ( τ ) | Q t ( τ ) | ≤ kγ t n τ µ k − γ t µ k . Hence for small distance γ t , most of the row vectors u ti of U t are close to the corresponding row vectors u ∗ i of U ∗ . Proof.

For any i ∈ Q t ( τ ), we now derive a lower bound on k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k by consideringthe cases ( k u ti k − k u ∗ i k ) ≥ τ µ kn and ( k u ti k − k u ∗ i k ) < τ µ kn , separately. Consider the case( k u ti k − k u ∗ i k ) ≥ τ µ kn . Recall U t , U ∗ ∈ R n × k are two orthonormal matrices. Then, k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k =( u t,Ti U t,T − u ∗ ,Ti U ∗ ,T )( U t u ti − U ∗ u ∗ i )= u t,Ti u ti + u ∗ ,Ti u ∗ i − u t,Ti U t,T U ∗ u ∗ i − u ∗ ,Ti U ∗ ,T U t u ti ≥k u ti k + k u ∗ i k − k u ti k k u ∗ i k =( k u ti k − k u ∗ i k ) ≥ τ µ kn . (71)24ext, we consider the case ( k u ti k − k u ∗ i k ) < τ µ kn . (72)We ﬁrst show a lower bound on k u ti − u ∗ i k . By Assumption 1 on the incoherence of U ∗ and the inequality(72), we have k u ti k ≤ k u ∗ i k + (cid:12)(cid:12) k u ti k − k u ∗ i k (cid:12)(cid:12) ≤ r µ kn + s τ µ kn . (73)Then, k u ti u t,Ti − u ∗ i u ∗ ,Ti k = k u ti u t,Ti − u ti u ∗ ,Ti + u ti u ∗ ,Ti − u ∗ i u ∗ ,Ti k ≤ k u ti k k u t,Ti − u ∗ ,Ti k + k u ti − u ∗ i k k u ∗ ,Ti k ≤ r µ kn + s τ µ kn ! k u t,Ti − u ∗ ,Ti k + r µ kn k u ti − u ∗ i k = r µ kn + s τ µ kn ! k u ti − u ∗ i k . (74)Also by the deﬁnition of Q t ( τ ), for any i ∈ Q t ( τ ), we have k u ti u t,Ti − u ∗ i u ∗ ,Ti k > τn . (75)Recall τ ∈ (0 , k ≥ µ ≥

1. Hence, k u ti − u ∗ i k > τ /n q µ kn + q τ µ kn ≥ τ /n q µ kn + q µ kn ≥ τ √ µ kn . (76)Now, k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k = u t,Ti u ti + u ∗ ,Ti u ∗ i − u t,Ti U t,T U ∗ u ∗ i − u ∗ ,Ti U ∗ ,T U t u ti = k u ti − u ∗ i k − u t,Ti ( U t,T U ∗ − I ) u ∗ i − u ∗ ,Ti ( U ∗ ,T U t − I ) u ti ≥k u ti − u ∗ i k − k I − U ∗ ,T U t k k u ti k k u ∗ i k . Since U ∗ ,T U t is symmetric, U ∗ ,T U t has SVD U ∗ ,T U t = W Σ W T for some orthonormal matrix W ∈ R k × k , k I − U ∗ ,T U t k = k W ( I − Σ) W T k = k I − Σ k . By the property (17) of subspace distance, the least singular value of U ∗ ,T U t is p − γ t and thus allthe singular values in Σ are in [ p − γ t , k I − U ∗ ,T U t k ≤ − q − γ t ≤ γ t . Hence, k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k ≥k u ti − u ∗ i k − γ t k u ti k k u ∗ i k .

25y the lower bound on k u ti − u ∗ i k in (76), the upper bound of k u ti k in (73) and the incoherenceAssumption 1 on U ∗ , we have k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k ≥ τ µ kn − γ t r µ kn + s τ µ kn ! r µ kn ≥ τ µ kn − γ t µ kn , which, along with the lower bound on k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k in (71) for the ﬁrst case, implies that theinequality above holds for all i ∈ Q t ( τ ). Hence k U t U t,T − U ∗ U ∗ ,T k F = n X i =1 k u t,Ti U t,T − u ∗ ,Ti U ∗ ,T k ≥ | Q t ( τ ) | (cid:18) τ µ kn − γ t µ kn (cid:19) . (77)Since U t , U ∗ ∈ R n × k are both orthonormal matrices, the ranks of U t U t,T and U ∗ U ∗ ,T are both k . Thenthe rank of U t U t,T − U ∗ U ∗ ,T is at most 2 k , since the rank of the sum of two matrices is at most thesum of the ranks of two matrices. Then by property (18) of subspace distance, namely, γ t = dist( U t , U ∗ ) = k U t U t,T − U ∗ U ∗ ,T k and the inequality (1) where l = 2 k , we have k U t U t,T − U ∗ U ∗ ,T k F ≤ √ k k U t U t,T − U ∗ U ∗ ,T k = √ kγ t . Then from the inequality (77) we have2 kγ t ≥ | Q t ( τ ) | (cid:18) τ µ kn − γ t µ kn (cid:19) , from which the result (70) follows.For τ, α ∈ (0 , S tb, ( τ, α ) be S tb, ( τ, α ) , { j ∈ [ n ] : (cid:12)(cid:12) { i ∈ [ n ] : ( i, j ) ∈ Ω t +1 and i ∈ Q t ( τ ) } (cid:12)(cid:12) ≥ αd } . That is, S tb, ( τ, α ) is the set of the vertices on the right in the random bipartite d -regular graph associatedwith Ω t +1 such that each vertex in S tb, ( τ, α ) has at least αd neighbors in the index set Q t ( τ ). Let W ∈ R n × k be any orthonormal matrix with its i th row w Ti satisfying k w i k ≤ µkn , ∀ i ∈ [ n ]for some µ >

0. In our application, matrices U ∗ and U t will play the role of W . For a ∈ (0 , S tb, ( W, a ) , { j ∈ [ n ] : k nd X i :( i,j ) ∈ Ω t +1 w i w Ti − I k > a } . Roughly speaking, S tb, ( W, a ) contains all the vertices j ∈ [ n ] on the right in the random bipartite d -regular graph associated with Ω t +1 for which the corresponding matrix nd P i :( i,j ) ∈ Ω t +1 w i w Ti deviatesfrom I by a certain threshold. The next lemma shows that the size of S tb ( β ) is bounded from above bythe sum of | S tb, ( τ, α ) | and | S tb, ( W, a ) | for a certain choice of τ , α , W and a .26 emma 4.11. Let δ and β be as deﬁned in T AM . Also, let τ = (1 − β − δ ) / and α = (1 − β − δ ) / (12 µ k ) .Then, | S tb ( β ) | ≤ | S tb, ( τ, α ) | + | S tb, ( U ∗ , δ ) | . (78) Proof.

It suﬃces to show S tb ( β ) ⊆ S tb, ( τ, α ) ∪ S tb, ( U ∗ , δ ). For j / ∈ S tb, ( τ, α ) ∪ S tb, ( U ∗ , δ ), it follows fromthe deﬁnition of S tb, ( τ, α ) and S tb, ( U ∗ , δ ) that |{ i ∈ [ n ] : ( i, j ) ∈ Ω t +1 and i ∈ Q t ( τ ) }| < αd and (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ∗ i u ∗ ,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ. (79)Then, (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ∗ i u ∗ ,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) + nd (cid:13)(cid:13)(cid:13)(cid:13) X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − X i :( i,j ) ∈ Ω t +1 u ∗ i u ∗ ,Ti (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ + nd (cid:13)(cid:13)(cid:13)(cid:13) X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − X i :( i,j ) ∈ Ω t +1 u ∗ i u ∗ ,Ti (cid:13)(cid:13)(cid:13)(cid:13) . (80)Divide vertex j ’s neighbors { i ∈ [ n ] : ( i, j ) ∈ Ω t +1 } into two parts: neighbors in [ n ] \ Q t ( τ ) and neighborsin Q t ( τ ), that is, S = { i ∈ [ n ] : ( i, j ) ∈ Ω t +1 and i / ∈ Q t ( τ ) } and S = { i ∈ [ n ] : ( i, j ) ∈ Ω t +1 , i ∈ Q t ( τ ) } . Then we have (cid:13)(cid:13)(cid:13)(cid:13) X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − X i :( i,j ) ∈ Ω t +1 u ∗ i u ∗ ,Ti (cid:13)(cid:13)(cid:13)(cid:13) ≤ X i ∈ S (cid:13)(cid:13) u ti u t,Ti − u ∗ i u ∗ ,Ti (cid:13)(cid:13) + X i ∈ S (cid:13)(cid:13) u ti u t,Ti − u ∗ i u ∗ ,Ti (cid:13)(cid:13) .i ∈ S implies i / ∈ Q t ( τ ) and thus (cid:13)(cid:13) u ti u t,Ti − u ∗ i u ∗ ,Ti (cid:13)(cid:13) ≤ τ /n . Then the right hand side of the inequalityabove is ≤ τn | S | + | S | ( (cid:13)(cid:13) u ti u t,Ti (cid:13)(cid:13) + (cid:13)(cid:13) u ∗ i u ∗ ,Ti (cid:13)(cid:13) ) . From the ﬁrst inequality of (79), we have | S | < αd , which, together with the incoherence assumptionof u ti in (45) and α = (1 − β − δ ) / (12 µ k ), implies the inequality above ≤ τn d + αd (cid:18) µ kn + µ kn (cid:19) = dn (cid:18) − β − δ − β − δ (cid:19) = dn (1 − β − δ ) . Then (80) becomes (cid:13)(cid:13)(cid:13)(cid:13) nd X i :( i,j ) ∈ Ω t +1 u ti u t,Ti − I (cid:13)(cid:13)(cid:13)(cid:13) ≤ − β. (81)It follows from the deﬁnition of S tb ( β ) in (41) that j / ∈ S tb ( β ) and thus S tb ( β ) ⊆ S tb, ( τ, α ) ∪ S tb, ( U ∗ , δ ).27e will establish Theorem 4.7 from the following two lemmas, which gives upper bounds on | S tb, ( τ, α ) | and | S tb, ( W, a ) | , respectively. We delay their proof for later. Proposition 4.12.

Suppose Assumption holds. Let α and ρ t be as deﬁned in (42). Without loss ofgenerality, let αd be an integer. Also, let λ = 1 αkµ and ν = ρ t k µ . For a

C > , suppose C ≥ e √ νλ, γ t ∈ (0 , / ( Cµ k . )) , | Q t ( τ ) | ≤ ρ t γ t n and ρ t γ t < . The following inequality | S tb, ( τ, α ) | ≤ . (cid:18) e ρ t γ t α (cid:19) αd n (82) holds w.h.p. Proposition 4.13.

For µ > and a ∈ (0 , , let f ( d, µ, a ) be as deﬁned in (43). Then w.h.p. for any ζ > | S tb, ( W, a ) | ≤ (1 + ζ ) f ( d, µ, a ) n. (83) Suppose Assumptions and hold for U ∗ . Let δ be as given in Assumptions . w.h.p. for any ζ > | S tb, ( U ∗ , δ ) | ≤ ζn. (84) Proof of Theorem 4.7 .

The ﬁrst result (46) directly follows from (83) in Lemma 4.13 where we choose W = U t , µ = 5 µ and a = 1 − β .Now we prove the second result (47). Let τ = (1 − β − δ ) /

0, we have C ≥ e √ νλ and ρ t γ t < . Also, Lemma 4.10 implies | Q t ( τ ) | ≤ ρ t γ t n . The veriﬁcation is completed. Then it follows from Propo-sition 4.12 that w.h.p. | S tb, ( τ, α ) | ≤ . (cid:18) e ρ t γ t α (cid:19) αd n. (85)28lso, (83) in Proposition 4.13 implies that under Assumption 1 w.h.p. for any ζ > | S tb, ( U ∗ , δ ) | ≤ (1 + ζ ) f ( d, µ , δ ) n. Therefore w.h.p. | S tb ( β ) | ≤ | S tb, ( τ, α ) | + | S tb, ( U ∗ , δ ) | ≤ . (cid:18) e ρ t γ t α (cid:19) αd n + (1 + ζ ) f ( d, µ , δ ) n from which (47) follows.Suppose that Assumption 2 is also satisﬁed, (84) in Proposition 4.13 implies that w.h.p. for any ζ > | S tb, ( U ∗ , δ ) | ≤ ζn which, together with the bound in (85), implies the third result (48) similarly. S tb, ( τ, α ) . Proof of Proposition 4.12 We will rely on the conﬁguration model of random regular graphs and its extension to the randombipartite regular graphs [Bol85, JLR00], which we now introduce.A conﬁguration model of G d ( n, n ) is obtained by replicating each of the 2 n vertices of the graph d times, and then creating a uniform random bipartite matching between dn replicas on the left andthe other dn replicas on the right. Then for every two vertices u ∈ [ n ] and v ∈ [ n ] on the oppositesides, an edge is created between u and v , for each edge between any of the replicas of u and any of thereplicas of v . The step of creating edges between vertices belonging to diﬀerent sides from the matchingon dn replicas on the left and the other dn replicas on the right we call projecting. It is known that,conditioned on the absence of parallel edges, this procedure gives a bipartite d -regular graph generateduniformly at random from the set of all bipartite d -regular graphs on 2 n vertices. It is also known thatthe probability of no parallel edges after projecting is bounded away from zero when d is bounded.More detailed results on this fact can be found in the introduction section of [Coo16]. Since we are onlyconcerned with events holding w.h.p., such a conditioning is irrelevant to us and thus we assume that G d ( n, n ) is generated simply by ﬁrst choosing a unifrom random bipartite matching and then projecting.Denote the conﬁguration model by ¯ G d ( n, n ), with vertices denoted by ( i, r, L ) for the vertices on the leftand ( i, r, R ) for the vertices on the right where i ∈ [ n ] and r ∈ [ d ]. Namely, ( i, r, L ( R )) is the r -th replicaof vertex i on the left (right) in the conﬁguration model. Given any set A ⊂ [ n ] on the left (right), let¯ A be the extension of A to the conﬁguration model, namely, ¯ A = { ( i, r, L ( R )) : i ∈ A, r ∈ [ d ] } . We willuse A and ¯ A interchangeably. Proof of Proposition 4.12.

By the assumption | Q t ( τ ) | ≤ ρ t γ t n , let | Q t ( τ ) | = ˆ ργ t n for some ˆ ρ ∈ [0 , ρ t ].Let E ( βn, αd ) be the event that there are exactly | S tb, ( τ, α ) | = βn vertices on the right such that eachof them has at least αd neighbors in the vertex set Q t ( τ ) on the left. Also, let R ( βn, αd, l ) ⊂ E ( βn, αd )be the event that there are exactly l edges between the vertex set Q t ( τ ) on the left and the vertex set S tb, ( τ, α ) on the right. Since under the event E ( βn, αd ) each vertex in S tb, ( τ, α ) has at least αd neighborsin Q t ( τ ) and the number of edges originating from S tb, ( τ, α ) is dβn , the number of edges between thevertex set Q t ( τ ) and the vertex set S tb, ( τ, α ) is within [ αdβn, dβn ]. Then l is at least αdβn , at most βdn and ∪ βdnl = αdβn R ( βn, αd, l ) = E ( βn, αd ). In what follows we bound the probability P ( R ( βn, αd, l ))in the conﬁguration model ¯ G d ( n, n ) for l ∈ [ αdβn, βdn ], and thus the probability P ( E ( βn, αd )) in theconﬁguration model ¯ G d ( n, n ) by the union bound.It follows from S tb, ( τ, α ) = βn and | Q t ( τ ) | = ˆ ργ t n that their counterparts in the conﬁgurationmodel are | ¯ S tb, ( τ, α ) | = βdn and | ¯ Q t ( τ ) | = ˆ ργ t dn. θ ∈ [ α,

1] be deﬁned by l = θβdn . Then as shown in Figure 1, the number of edges between ¯ Q t ( τ )and [ n ] \ S tb, ( τ, α ) is ˆ ργ t dn − θβdn, (86)the number of edges between ¯ S tb, ( τ, α ) and [ n ] \ Q t ( τ ) is βdn − θβdn , and the number of edges between[ n ] \ Q t ( τ ) and [ n ] \ S tb, ( τ, α ) is(1 − ˆ ργ t ) dn − ( βdn − θβdn ) = (1 − β ) dn − ˆ ργ t dn + θβdn. (87) S S ( ) t Q τ [ ] \ ( ) t n Q τ ,1 ( , ) tb S τ α ,1 [ ] \ ( , ) tb n S τ α , Q S E , c Q S E , c Q S E , c c Q S E Figure 1: Illustration of the event R ( βn, αd, θβdn ) where E Q,S = θβdn represents the number of edgesbetween two vertex sets sitting at the ends of the line corresponding to E Q,S . E Q,S c = ˆ ργ t dn − θβdn , E Q c ,S = βdn − θβdn and E Q c ,S c = (1 − β ) dn − ˆ ργ t dn + θβdn are deﬁned accordingly.Let X ij , i ∈ [ βn ] , j ∈ [ d ] be i.i.d. Bernoulli random variables with P ( X ij = 1) = θ , and Y ij , i ∈ [(1 − β ) n ] , j ∈ [ d ] be another set of i.i.d. Bernoulli random variables with P ( Y ij = 1) = ˆ ργ t − θβ − β .Deﬁne two conditional probabilities f = P  d X j =1 X ij ≥ αd, ∀ i ∈ [ βn ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) βn X i =1 d X j =1 X ij = θβdn  ,f = P  d X j =1 Y ij < αd, ∀ i ∈ [(1 − β ) n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (1 − β ) n X i =1 d X j =1 Y ij = ˆ ργ t dn − θβdn  . Then we claim that (cid:0) βdnθβdn (cid:1) f is the number of ways of choosing θβdn replicas from βdn replicas in¯ S tb, ( τ, α ) such that each vertex in S tb, ( τ, α ) has at least αd replicas chosen. Deﬁne the set L , (cid:26) ( r , . . . , r βn ) ∈ [ d ] βn : βn X i =1 r i = θβdn ; r i ≥ αd, ∀ i ∈ [ βn ] (cid:27) . Then we expand f by Bayes’ formula f = P ( r ,...,r βn ): ( r ,...,r βn ) ∈ L Q βni =1 (cid:0) dr i (cid:1) θ r i (1 − θ ) d − r i (cid:0) βdnθβdn (cid:1) θ θβdn (1 − θ ) (1 − θ ) βdn = P ( r ,...,r βn ): ( r ,...,r βn ) ∈ L Q βni =1 (cid:0) dr i (cid:1)(cid:0) βdnθβdn (cid:1) . θβdn replicas from βdn replicas in ¯ S tb, ( τ, α ) such that each vertex in S tb, ( τ, α ) has at least αd replicas chosen.Hence the claim follows. Similarly, we have that (cid:0) (1 − β ) dn ˆ ργ t dn − θβdn (cid:1) f is the number of ways of choosingˆ ργ t dn − θβdn replicas from (1 − β ) dn replicas in [ n ] \ S tb, ( τ, α ) such that each vertex in [ n ] \ S tb, ( τ, α )has less than αd replicas chosen.Now we claim that the probability P ( R ( βn, αd, θβdn )) is given by P ( R ( βn, αd, θβdn )) = (cid:0) nβn (cid:1) I I I I ( dn )! (88)where I = (cid:18) βdnθβdn (cid:19) f (cid:18) ˆ ργ t dnθβdn (cid:19) ( θβdn )! ,I = (cid:18) (1 − ˆ ργ t ) dnβdn − θβdn (cid:19) ( βdn − θβdn )! ,I = (cid:18) (1 − β ) dn ˆ ργ t dn − θβdn (cid:19) f (ˆ ργ t dn − θβdn )! ,I = ((1 − β ) dn − ˆ ργ t dn + θβdn )! . Indeed, the term (cid:0) nβn (cid:1) is the number of ways of selecting | S tb, ( τ, α ) | = βn vertices from n vertices onthe right. The term I is the number of matching choices between θβdn vertices chosen from ¯ S tb, ( τ, α )and θβdn vertices chosen from ¯ Q t ( τ ) such that any vertex in S tb, ( τ, α ) has at least αd neighbors in Q t ( τ ). The term I is the number of matching choices between the remaining vetices in ¯ S tb, ( τ, α ) and βdn − θβdn vertices chosen from [ n ] \ Q t ( τ ). The term I is the number of matching choices betweenthe remaining vetices in ¯ Q t ( τ ) and ˆ ργ t dn − θβdn vertices chosen from [ n ] \ S tb, ( τ, α ) such that anyvertex in [ n ] \ S tb, ( τ, α ) has less than αd neighbors in Q t ( τ ). The term I is the number of matchingchoices between the remaining vetices in [ n ] \ Q t ( τ ) and the remaining vertices in [ n ] \ S tb, ( τ, α ). Thus (cid:0) nβn (cid:1) I I I I is the number of conﬁguration graphs such that there are exactly βn vertices on the righteach of which has at least αd neighbors in Q t ( τ ), and the number of edges between Q t ( τ ) and S tb, ( τ, α )is exactly θβdn . ( dn )! is precisely the total number of conﬁguration graphs. Hence (88) follows.By expanding the terms in (88), we have the following lemma. The proof of this lemma, whichinvolves heavy asymptotic expansions, can be found in Appendix D. Lemma 4.14.

Given β ∈ (1 . ρ t γ t /α ) αd , , there exists an η > such that lim sup n →∞ n log P ( R ( βn, αd, θβdn )) ≤ − η. (89)Applying Lemma 4.14, for any β ∈ (1 . ρ t γ t /α ) αd ,

1] we have by the union bound P ( E ( βn, αd )) ≤ βdn X l = αdβn P ( R ( βn, αd, l )) = exp( − Ω( n )) . Thus in the conﬁguration model ¯ G d ( n, n ), we have P ( | S tb, ( τ, α ) | > . ρ t γ t /α ) αd n ) ≤ n X h = ⌊ . ρ t γ t /α ) αd n ⌋ +1 P ( E ( h, αd )) = exp( − Ω( n )) . .4.2 Bounding the size of S tb, ( a ) . Proof of Proposition 4.13 We ﬁrst introduce Matrix Bernstein inequality.

Theorem 4.15. [T +

15, Theorem 6.1.1]

Consider a ﬁnite sequence { S k } of independent, random ma-trices with common dimension d × d . Assume that E ( S k ) = and k S k k ≤ L for each index k. Introduce the random matrix Z = P k S k . Let ν ( Z ) = max {k E ( Z T Z ) k , k E ( ZZ T ) k } = max {k E ( X k S Tk S k ) k , k E ( X k S k S Tk ) k } . Then for all t ≥ , P ( k Z k ≥ t ) ≤ ( d + d ) exp (cid:18) − t / ν ( Z ) + Lt/ (cid:19) . (90)We rely on the conﬁguration model ¯ G d ( n, n ) to prove Proposition 4.13. To state the generation ofthe conﬁguration model ¯ G d ( n, n ) more precisely, we ﬁrst introduce an ordering for the replicas on theright side of ¯ G d ( n, n ). For j , j ∈ [ n ] and r , r ∈ [ d ], we say ( j , r , R ) > ( j , r , R ) if j > j . For j ∈ [ n ] and r , r ∈ [ d ], we say ( j, r , R ) > ( j, r , R ) if r > r . Here we use the following procedureto generate a random bipartite d -regular multigraphs ¯ G d ( n, n ) on [ n ] × [ n ] vertices [Coo16, Wor99].Replicate each vetex in [ n ] on both sides of the graph d times. Then on the left side, the replicas are( i, r, L ) for all i ∈ [ n ] and all r ∈ [ d ]. Similarly on the right side, the replicas are ( i, r, R ) for all i ∈ [ n ]and all r ∈ [ d ]. Always choose the replica on the right of the least order and pair it uniformly at randomwith one unpaired replica on the left until all the replicas are paired. Finally for each pair, create anedge between the two replicas in the pair. Proof of Proposition 4.13.

Denote ( i, r , L ) ¯ G d ( n,n ) ∼ ( j, r , R ) if the vertex replica ( i, r , L ) on the left ispaired with the vertex replica ( j, r , R ) on the right in the graph ¯ G d ( n, n ). Then for each j ∈ [ n ], thevertex replicas on the left pairing with the replicas ( j, r , R ), ∀ r ∈ [ d ], on the right in ¯ G d ( n, n ) areincluded in H j , (cid:26) ( i, r , L ) : ∃ r ∈ [ d ] such that ( i, r , L ) ¯ G d ( n,n ) ∼ ( j, r , R ) (cid:27) . Recall W ∈ R n × k is an orthonormal matrix with incoherence parameter µ >

0. For the tuple ( i, r ), i ∈ [ n ] and r ∈ [ d ], let g (( i, r )) , i and correspondinglyˆ S b, ( W, a ) , (cid:26) j ∈ [ n ] : (cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ):( i,r,L ) ∈ H j w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:27) . Observe that conditional on ¯ G d ( n, n ) being a simple graph, ˆ S b, ( W, a ) has the same distribution as S tb, ( W, a ). For bounded d , the probability that the conﬁguration model produces a simple graph isbounded away from zero. Since we are only concerned with events holding w.h.p., in the following wederive an upper bound on | ˆ S b, ( W, a ) | instead.Let Z ir , i ∈ [ n ] and r ∈ [ d ], be a sequence of i.i.d. Bernoulli random variable with P ( Z ir = 1) = 1 /n . H consists of d replicas on the left which are paired with the d least ordered replicas on the right in32 G d ( n, n ). H can be also seen as d replicas chosen uniformly at random from nd replicas on the left.Then we have P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ):( i,r,L ) ∈ H w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:19) = P (cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ) ∈{ ( i,r ) ∈ [ n ] × [ d ]: Z ir =1 } w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 d X r =1 Z ir = d  . (91)It follows from the Local Limit Theorem [Les05, Theorem 9.1] that P n X i =1 d X r =1 Z ir = d ! = 1 √ πd (1 + o (1)) . Then we have an upper bound on the right hand side of (91) P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ) ∈{ ( i,r ) ∈ [ n ] × [ d ]: Z ir =1 } w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:12)(cid:12)(cid:12)(cid:12) n X i =1 d X r =1 Z ir = d (cid:19) ≤ √ πd (1 + o (1)) P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ) ∈{ ( i,r ) ∈ [ n ] × [ d ]: Z ir =1 } w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:19) . (92)We claim P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) nd X ( i,r ) ∈{ ( i,r ) ∈ [ n ] × [ d ]: Z ir =1 } w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13)(cid:13) > a (cid:19) ≤ k exp (cid:18) − a / µk + µka/ d (cid:19) . Now we use Matrix Bernstein inequality (Theorem 4.15) to establish this claim. Let S ir , i ∈ [ n ] and r ∈ [ d ], be S ir = nd (cid:18) Z ir w i w Ti − n w i w Ti (cid:19) . Then by P ni =1 w i w Ti = W T W = I , n X i =1 d X r =1 S ir = nd X ( i,r ) ∈{ ( i,r ) ∈ [ n ] × [ d ]: Z ir =1 } w g (( i,r )) w Tg (( i,r )) − I and E ( S ir ) = . Using k w i k ≤ µk/n for all i ∈ [ n ], we have k S ir k ≤ nd × (1 − n ) k w i k ≤ µkd , ∀ i ∈ [ n ] and ∀ r ∈ [ d ] . Observe S ir ∈ R k × k is a symmetric matrix and w i w Ti is positive semideﬁnite. Then, (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 d X r =1 E ( S ir S ir ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:16) nd (cid:17) (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 d X r =1 E ( Z ir − n Z ir + 1 n ) w i w Ti w i w Ti (cid:13)(cid:13)(cid:13)(cid:13) = (cid:16) nd (cid:17) (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 d (cid:18) n − n (cid:19) w i w Ti w i w Ti (cid:13)(cid:13)(cid:13)(cid:13) ≤ nd (cid:13)(cid:13)(cid:13) n X i =1 ( w Ti w i ) w i w Ti (cid:13)(cid:13)(cid:13) ≤ nd max i ∈ [ n ] { w Ti w i } (cid:13)(cid:13)(cid:13) n X i =1 w i w Ti (cid:13)(cid:13)(cid:13) . P ni =1 w i w Ti = I and the incoherence parameter µ of W , we have (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 d X r =1 E ( S ir S ir ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ nd × µkn = µkd . The claim then follows from choosing t = a in (90) in Theorem 4.15. Then from the inequality (92), wehave P (cid:18)(cid:13)(cid:13)(cid:13) nd X ( i,r ):( i,r,L ) ∈ H w g (( i,r )) w Tg (( i,r )) − I (cid:13)(cid:13)(cid:13) > a (cid:19) ≤ k √ πd (1 + o (1)) exp (cid:18) − a / µk + µka/ d (cid:19) . In the conﬁguration model ¯ G d ( n, n ), H j for 2 ≤ j ≤ n has the same distribution as H . Hence for alarge n we have E ( | ˆ S b, ( W, a ) | ) ≤ k √ πd (1 + o (1)) exp (cid:18) − a / µk + µka/ d (cid:19) n < f ( d, µ, a ) n. (93)Next we apply the following concentration result. Theorem 4.16. [Wor99, Theorem 2.19] If X n is a random variable deﬁned on ¯ G d ( n, n ) such that | X n ( P ) − X n ( P ′ ) | ≤ c whenever P and P ′ diﬀer by a simple switching of two edges, then P ( | X n − E ( X n ) | ≥ t ) ≤ (cid:18) − t dnc (cid:19) for all t > . Although this result is established for the conﬁguration model of a random regular graph, the sameresult for the conﬁguration model of a random bipartite regular graph can be established in the obviousmanner. Choosing the constant c = 2 in this theorem, we have P (cid:18) (cid:12)(cid:12)(cid:12) | ˆ S b, ( W, a ) | − E ( | ˆ S b, ( W, a ) | ) (cid:12)(cid:12)(cid:12) ≥ ζf ( d, µ, a ) n (cid:19) ≤ (cid:18) − ζ f ( d, µ, a ) n dn (cid:19) = 2 exp (cid:18) − ζ f ( d, µ, a ) d n (cid:19) . Then it follows from the inequality above and the inequality (93) that P ( | ˆ S tb, ( W, a ) | > (1 + ζ ) f ( d, µ, a ) n ) ≤ (cid:18) − ζ f ( d, µ, a ) d n (cid:19) from which the ﬁrst result (83) follows.Now we establish the second result (84). Similarly, we have for any ζ > P (cid:18)(cid:12)(cid:12)(cid:12) | ˆ S tb, ( U ∗ , δ ) | − E ( | ˆ S tb, ( U ∗ , δ ) | ) (cid:12)(cid:12)(cid:12) ≥ ζ n (cid:19) ≤ (cid:18) − ( ζ/ n dn (cid:19) = 2 exp (cid:18) − ζ d n (cid:19) . Recall that the probability that the conﬁguration model produces a simple graph is bounded away fromzero and does not depend on n . Then, P (cid:18)(cid:12)(cid:12) | S tb, ( U ∗ , δ ) | − E ( | S tb, ( U ∗ , δ ) | ) (cid:12)(cid:12) ≥ ζ n (cid:19) ≤ (cid:18) − ζ d n + O (1) (cid:19) . It follows from Assumption 2 that E ( | S tb, ( U ∗ , δ ) | ) = o ( n ) and thus P ( | S tb, ( U ∗ , δ ) | > ζn ) ≤ (cid:18) − ζ d n + O (1) (cid:19) from which the second result follows. 34 Conclusions and Open Questions

We close this paper with several open questions for further research. In light of the new algorithm

T AM which improves the sample complexity for the alternating minimization algorithm by a factor log n for the case of matrix M with bounded rank, condition number and incoherence parameter, a naturaldirection is to extend this result to the cases when the rank, condition number and incoherence parameterare possibly growing functions of the dimension of M . In this situation we would be considering thecase of growing d for which Assumption 2 is satisﬁed automatically by applying Matrix Bernsteininequality. On the other hand, under uniform sampling and for the case of growing (average degree) d ,Hardt [Har14] proposed an augmented alternating minimization algorithm by adding extra smoothingsteps typically used in smoothed analysis of the QR factorization. This reduced the dependence of thesample complexity on the rank, condition number and incoherence parameter. Perhaps such smoothingsteps can be incorporated into the T AM algorithm as well, possibly leading to a reduced sample andcomputational complexity when compared to the one achieved in [Har14].Studying

T AM under i.i.d. uniform sampling, which corresponds to a bipartite Erd¨os-R´enyi graph,is another interesting problem. Instead of using the conﬁguration model, possibly Poisson cloningmodel can be employed to carry out a similar analysis for the case of a bipartite Erd¨os-R´enyi graph.We conjecture that the same sample complexity of

T AM holds under such uniform sampling.Finally, another challenge is to achieve the information theoretic lower bound of sample complexity O ( µ kn log n ) [CT10] for exact low-rank matrix completion when k is growing. The technique developedin this paper for reducing sample complexity by a log n factor might be of interest for achieving thisgoal via more careful analysis of the trace-norm based minimization. References [BJ14] Srinadh Bhojanapalli and Prateek Jain,

Universal matrix completion , Proceedings of The31st International Conference on Machine Learning, 2014, pp. 1881–1889.[BKS10] Mohsen Bayati, Jeong Han Kim, and Amin Saberi,

A sequential algorithm for generatingrandom graphs , Algorithmica (2010), no. 58, 860–910.[BLWZ17] Maria-Florina Balcan, Yingyu Liang, David P Woodruﬀ, and Hongyang Zhang, Optimalsample complexity for matrix completion and related problems via l -regularization , arXivpreprint arXiv:1704.08683 (2017).[Bol85] B. Bollobas, Random graphs , Academic Press, Inc., 1985.[CCS10] J. F. Cai, E. J. Cand´es, and Z. Shen,

A singular value thresholding algorithm for matrixcompletion , SIAM Journal on Optimization (2010), no. 4, 1956–1982.[Che15] Yudong Chen, Incoherence-optimal matrix completion , IEEE Transactions on InformationTheory (2015), no. 5, 2909–2923.[Coo16] Nicholas A Cook, Discrepancy properties for random regular digraphs , Random Structures& Algorithms (2016).[CP10] Emmanuel J Candes and Yaniv Plan,

Matrix completion with noise , Proceedings of the IEEE (2010), no. 6, 925–936.[CR09] Emmanuel J Cand`es and Benjamin Recht, Exact matrix completion via convex optimization ,Foundations of computational mathematics (2009), no. 6, 717–772.35CT10] Emmanuel J Cand`es and Terence Tao, The power of convex relaxation: Near-optimal matrixcompletion , IEEE Transactions on Information Theory (2010), no. 5, 2053–2080.[Gro11] David Gross, Recovering low-rank matrices from few coeﬃcients in any basis , IEEE Trans-actions on Information Theory (2011), no. 3, 1548–1566.[GVL12] Gene H Golub and Charles F Van Loan, Matrix computations , vol. 3, JHU Press, 2012.[Har14] Moritz Hardt,

Understanding alternating minimization for matrix completion , Foundationsof Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, IEEE, 2014, pp. 651–660.[JLR00] S. Janson, T. Luczak, and A. Rucinski,

Random graphs , John Wiley and Sons, Inc., 2000.[JNS12] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi,

Low-rank matrix completion usingalternating minimization , arXiv preprint arXiv:1212.0467 (2012).[Kim06] Jeong Han Kim,

Poisson cloning model for random graphs , Proceedings of the InternationalCongress of Mathematicians, 2006, pp. 873–897.[KMO10] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh,

Matrix completion from afew entries , IEEE Transactions on Information Theory (2010), no. 6, 2980–2998.[KTT15] Franz J Kir´aly, Louis Theran, and Ryota Tomioka, The algebraic combinatorial approach forlow-rank matrix completion. , Journal of Machine Learning Research (2015), 1391–1436.[Les05] Emmanuel Lesigne, Heads or tails: an introduction to limit theorems in probability , vol. 28,American Mathematical Soc., 2005.[MHT10] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani,

Spectral regularization algorithmsfor learning large incomplete matrices , Journal of machine learning research (2010),no. Aug, 2287–2322.[MJD09] Raghu Meka, Prateek Jain, and Inderjit S Dhillon, Matrix completion from power-law dis-tributed samples , Advances in neural information processing systems, 2009, pp. 1258–1266.[Mos12] Mohammad Sal Moslehian,

Ky fan inequalities , Linear and Multilinear Algebra (2012),no. 11-12, 1313–1325.[PABN16] Daniel L Pimentel-Alarc´on, Nigel Boston, and Robert D Nowak, A characterization of deter-ministic sampling patterns for low-rank matrix completion , IEEE Journal of Selected Topicsin Signal Processing (2016), no. 4, 623–636.[Pud15] Doron Puder, Expansion of random graphs: New proofs, new results , Inventiones mathemat-icae (2015), no. 3, 845–908.[Rec11] Benjamin Recht,

A simpler approach to matrix completion , Journal of Machine LearningResearch (2011), no. Dec, 3413–3430.[SL16] Ruoyu Sun and Zhi-Quan Luo, Guaranteed matrix completion via non-convex factorization ,IEEE Transactions on Information Theory (2016), no. 11, 6535–6579.[T +

15] Joel A Tropp et al.,

An introduction to matrix concentration inequalities , Foundations andTrends R (cid:13) in Machine Learning (2015), no. 1-2, 1–230.36Wor99] Nicholas C Wormald, Models of random regular graphs , London Mathematical Society Lec-ture Note Series (1999), 239–298.[ZL16] Qinqing Zheng and John Laﬀerty,

Convergence analysis for rectangular matrix completionusing burer-monteiro factorization and gradient descent , arXiv preprint arXiv:1605.07051(2016).[ZWL15] Tuo Zhao, Zhaoran Wang, and Han Liu,

A nonconvex optimization framework for low rankmatrix estimation , Advances in Neural Information Processing Systems, 2015, pp. 559–567.

A Proof of Lemma 4.2

Proof of Lemma 4.2.

Let ¯ U Σ V T be the top- k singular components of nd P Ω ( M ). Recall from theproperty P that w.h.p. the second largest singular value of the biadjacency matrix of the randombipartite d -regular graph associated with the index set Ω is at most (7 √ d ) /

3. By Theorem 4.1 in[BJ14] where we choose 7 / k M − ¯ U Σ V T k ≤ µ k √ d k M k . (94)Also we have k M − ¯ U Σ V T k = k U ∗ Σ ∗ V ∗ ,T − ¯ U ¯ U ,T U ∗ Σ ∗ V ∗ ,T + ¯ U ¯ U ,T U ∗ Σ ∗ V ∗ ,T − ¯ U Σ V T k = k ( I − ¯ U ¯ U ,T ) U ∗ Σ ∗ V ∗ ,T + ¯ U ( ¯ U ,T U ∗ Σ ∗ V ∗ ,T − Σ V T ) k . Since I − ¯ U ¯ U ,T is orthogonal to ¯ U , we have the right hand side of the equation above ≥ k ( I − ¯ U ¯ U ,T ) U ∗ Σ ∗ V ∗ ,T k = k ¯ U ⊥ ¯ U ,T ⊥ U ∗ Σ ∗ V ∗ ,T k . Suppose the SVD of ¯ U ,T ⊥ U ∗ Σ ∗ is ˆ U ˆΣ ˆ V T . Then¯ U ⊥ ¯ U ,T ⊥ U ∗ Σ ∗ V ∗ ,T = ¯ U ⊥ ˆ U ˆΣ ˆ V T V ∗ ,T . Observe that ¯ U ⊥ ˆ U and V ∗ ˆ V are both orthonormal matrices. Then ˆ U ˆΣ ˆ V T has the same singular valuesas the ones in ¯ U ⊥ ¯ U ,T ⊥ U ∗ Σ ∗ V ∗ ,T , i.e. k ¯ U ⊥ ¯ U ,T ⊥ U ∗ Σ ∗ V ∗ ,T k = k ¯ U ,T ⊥ U ∗ Σ ∗ k . Let y ∈ R n − k be the top left singular vector of ¯ U ,T ⊥ U ∗ . In particular, k y ¯ U ,T ⊥ U ∗ k = k ¯ U ,T ⊥ U ∗ k . Then k ¯ U ,T ⊥ U ∗ Σ ∗ k = sup x ∈ R n − k : k x k =1 k x ¯ U ,T ⊥ U ∗ Σ ∗ k ≥ k y ¯ U ,T ⊥ U ∗ Σ ∗ k = k ¯ U ,T ⊥ U ∗ k k y ¯ U ,T ⊥ U ∗ k ¯ U ,T ⊥ U ∗ k Σ ∗ k ≥ k ¯ U ,T ⊥ U ∗ k inf z ∈ R k : k z k =1 k z Σ ∗ k = k ¯ U ,T ⊥ U ∗ k σ ∗ k , which, together with (94), gives k ¯ U ,T ⊥ U ∗ k ≤ µ k √ d σ ∗ σ ∗ k . The result then follows from d ≥ Ck µ ( σ ∗ /σ ∗ k ) . 37 Proof of Lemma 4.3

Proof of Lemma 4.3.

First, we claim that there exists an orthonormal matrix R ∈ R k × k such that k U ∗ R − ¯ U k F ≤ √ φ . (95)Consider the SVD of U ∗ ,T ¯ U = W Σ W T where W , W ∈ R k × k are two orthonormal matrices. Since k U ∗ ,T ¯ U k ≤ k U ∗ ,T k k ¯ U k = 1, all the singular values in Σ are within [0 ,

1] and W W T is also anorthonormal matrix. Let R = W W T . Then we have k U ∗ R − ¯ U k = k ( U ∗ R − ¯ U ) T ( U ∗ R − ¯ U ) k = k I − R T U ∗ ,T ¯ U − ¯ U T U ∗ R k = k I − W Σ W T k = 2 k W ( I − Σ) W T k = 2 k I − Σ k Let γ = dist( ¯ U , U ∗ ). By the property (17) of subspace distance and γ = k U ∗ ,T ⊥ ¯ U k , the least singularvalue of U ∗ ,T ¯ U is p − γ . Then the inequality above becomes k U ∗ R − ¯ U k = 2(1 − p − γ ) ≤ γ which, together with the inequality (1) where l = k , implies k U ∗ R − ¯ U k F ≤ √ kγ . Then the claimfollows from γ = dist( ¯ U , U ∗ ) ≤ / ( φk / ).Let ¯ u Ti , ˆ u Ti , u ∗ ,Ti , i ∈ [ n ], be the i -th row of the matrices ¯ U , ˆ U and U ∗ . We claim k ¯ u Ti − ˆ u Ti k ≤ k ¯ u Ti − u ∗ ,Ti R k ∀ i ∈ [ n ] . (96)We will establish this claim by considering the case k ¯ u Ti k ≥ p µ k/n and the case k ¯ u Ti k < p µ k/n ,respectively. Consider the case k ¯ u Ti k ≥ p µ k/n . Applying the operator T on ¯ u i truncates ¯ u i to ˆ u i of the same direction and of length p µ k/n , which gives k ˆ u i k = p µ k/n and thus k ¯ u Ti − ˆ u Ti k = k ¯ u Ti k − r µ kn . Notice that the orthonormal transformation does not change the length of u ∗ i , that is, k u ∗ ,Ti R k = k u ∗ ,Ti k ≤ p µ k/n. The triangle inequality gives k ¯ u Ti − ˆ u Ti k = k ¯ u Ti k − r µ kn ≤ k ¯ u Ti k − k u ∗ ,Ti R k ≤ k ¯ u Ti − u ∗ ,Ti R k , and the claim is established for the case k ¯ u Ti k ≥ p µ k/n .Suppose now ¯ u Ti satisﬁes k ¯ u Ti k < p µ k/n . It follows from (7) that ˆ u Ti = T (¯ u Ti ) = ¯ u Ti and thus k ¯ u Ti − ˆ u Ti k = 0. Then the claim follows. Thus it follows from (96) and (95) that k ¯ U − ˆ U k F = vuut n X i =1 k ¯ u Ti − ˆ u Ti k ≤ vuut n X i =1 k ¯ u Ti − u ∗ ,Ti R k = k ¯ U − U ∗ R k F ≤ √ φ . (97)38pplying Ky Fan singular value inequality (3) to ¯ U = ˆ U + ( ¯ U − ˆ U ) gives σ k ( ¯ U ) ≤ σ k ( ˆ U ) + σ ( ¯ U − ˆ U ) . Since ¯ U ∈ R n × k is an orthonormal matrix, we have σ k ( ¯ U ) = 1. Also we have σ ( ¯ U − ˆ U ) ≤ k ¯ U − ˆ U k F ≤√ /φ . Then using φ ≥ √ / ( √ − σ k ( ˆ U ) ≥ σ k ( ¯ U ) − σ ( ¯ U − ˆ U ) ≥ − √ φ ≥ √ . We can write U = ˆ U Q − where Q is an invertible matrix with the same singular values as ˆ U . This,together with the inequality above, implies k Q − k = 1 σ k ( ˆ U ) ≤ √ . Since ˆ u Ti is obtained by applying the operations T on ¯ u Ti , we have k ˆ u i k < p µ k/n for all i ∈ [ n ].Therefore for all i ∈ [ n ] k u Ti k = k ˆ u Ti Q − k ≤ k ˆ u i k k Q − k ≤ √ × r µ kn = r µ kn , and (20) is established. Finally,dist( U, U ∗ ) = k U ∗ ,T ⊥ U k = k U ∗ ,T ⊥ ˆ U Q − k ≤k U ∗ ,T ⊥ ˆ U k k Q − k ≤ √ k U ∗ ,T ⊥ ˆ U k ≤ √ (cid:16) k U ∗ ,T ⊥ ¯ U k + k U ∗ ,T ⊥ ( ˆ U − ¯ U ) k (cid:17) ≤ √

52 ( k U ∗ ,T ⊥ ¯ U k + k ¯ U − ˆ U k ) . Recall from the assumptions of this lemma that k U ∗ ,T ⊥ ¯ U k = dist( ¯ U , U ∗ ) ≤ / ( φk / ) and from (97)that k ¯ U − ˆ U k ≤ k ¯ U − ˆ U k F ≤ √ /φ . (21) then follows fromdist( U, U ∗ ) ≤ √ φk / + √ φ ! ≤ √ φ . C Proof of Proposition 4.9

Proof of Proposition 4.9.

We prove (55) for the cases e ρ t γ t /α > ρ t γ t /α ≤

1, separately. Con-sider the case e ρ t γ t /α >

1. Recall ρ t given in (42). We ﬁrst derive an upper bound on ρ t . It followsfrom γ t ∈ (0 , √ / ( p C ( δ, β ) µ k . )) that(1 − β − δ ) µ k − γ t µ k ≥ (1 − β − δ ) µ k − C ( δ, β ) µ k µ k = (1 − β − δ ) µ k − C ( δ, β ) µ k . C ( δ, β ) > ρ t = 2 k (1 − β − δ ) µ k − γ t µ k ≤ k (1 − β − δ ) µ k = 96 µ k (1 − β − δ ) . (98)Recall α = (1 − β − δ ) / (12 µ k ) in (42). Then,e ρ t γ t α ≤ µ k − β − δ µ k (1 − β − δ ) C ( δ, β ) k µ = 12 × (1 − β − δ ) C ( δ, β ) . Now we have the left hand side of (55)14 β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ ≤ β ( µ k ) . σ ∗ σ ∗ k √ . (cid:18) × (1 − β − δ ) C ( δ, β ) (cid:19) − β − δ µ k d = exp log (cid:18) √ . β (cid:19) + 1 . µ + 1 . k + log (cid:18) σ ∗ σ ∗ k (cid:19) + d − β − δ µ k log (cid:18) × (1 − β − δ ) C ( δ, β ) (cid:19) ! . Then for d ≥ C ( δ, β ) k µ ( σ ∗ /σ ∗ k ) , the last term in the exponent above is a polynomial of µ , k and σ ∗ /σ ∗ k while the rest terms are the linear combination of log µ , log k and log( σ ∗ /σ ∗ k ). Also observethat a large C ( δ, β ) leads to a negative coeﬃcient of the last term in the exponent above. Hence thefollowing inequality holds for a large C ( δ, β ) and d ≥ C ( δ, β ) k µ ( σ ∗ /σ ∗ k ) β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ ≤ √ k (1 − β − δ ) µ k − β − δ µ k . Finally by the upper bound on ρ t in (98) and e ρ t γ t /α >

1, we have120 √ k (1 − β − δ ) µ k − β − δ µ k ≤ √ k ρ t α e ≤ γ t √ k , which gives (55) for the cases e ρ t γ t /α > ρ t γ t /α ≤

1. Then we have the following upper bound on the left hand side of(55) 14 β σ ∗ σ ∗ k ( µ k ) . √ . (cid:18) e ρ t γ t α (cid:19) αd/ ≤ β σ ∗ σ ∗ k ( µ k ) . √ . γ − β − δ µ k dt . Similarly it follows from γ t ≤ √ / ( p C ( δ, β ) µ k . ) and d ≥ C ( δ, β ) k µ ( σ ∗ /σ ∗ k ) that for a large C ( δ, β ) >

0, the following inequality holds14 β σ ∗ σ ∗ k ( µ k ) . √ . γ αd/ t ≤ γ t √ k . Hence the inequality (55) follows. 40ow we show the inequality (56). Recall the deﬁnition of f ( d, µ , δ ) in (43). Then the left hand sideof (56) becomes exp (cid:18) log (cid:18) β σ ∗ σ ∗ k ( µ k ) . (cid:19) + 12 log(1 .

1) + 12 log f ( d, µ , δ ) (cid:19) = exp log (cid:18) β σ ∗ σ ∗ k ( µ k ) . (cid:19) + 12 log(1 .

1) + 12 log(3 k √ π )+ 14 log d − δ µ k (1 + δ/ d ! . (99)Let g ( d ) = 14 log d − δ µ k (1 + δ/ d. Then the derivative of g ( d ) is g ′ ( d ) = 14 d − δ µ k (1 + δ/ . For a large C ( δ, β ) > g ′ ( d ) is always negative for any d satisfying (14), that is, d ≥ C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) + 5 µ k (1 + δ/ δ log (cid:18) ǫ (cid:19) . Then the right hand side of (99) is ≤ exp log (cid:18) β ( µ k ) . σ ∗ σ ∗ k (cid:19) + 12 log(1 .

1) + 12 log(3 k √ π )+ 14 log C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) + 5 µ k (1 + δ/ δ log (cid:18) ǫ (cid:19)! − δ µ k (1 + δ/ C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) + 5 µ k (1 + δ/ δ log (cid:18) ǫ (cid:19)! ! . Using log( x + y ) ≤ log x + log y for x, y ≥

2, we obtain the right hand side of the inequality above ≤ exp log (cid:18) β ( µ k ) . σ ∗ σ ∗ k (cid:19) + 12 log(1 .

1) + 12 log(3 k √ π )+ 14 log C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) ! + 14 log (cid:18) µ k (1 + δ/ δ log (cid:18) ǫ (cid:19)(cid:19) − δ µ k (1 + δ/ C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) + 5 µ k (1 + δ/ δ log (cid:18) ǫ (cid:19)! ! . Using log log(1 /ǫ ) ≤ log(1 /ǫ ) for all ǫ ∈ (0 , / ≤ exp log (cid:18) β ( µ k ) . σ ∗ σ ∗ k (cid:19) + 12 log(1 .

1) + 12 log(3 k √ π )+ 14 log C ( δ, β ) k µ (cid:18) σ ∗ σ ∗ k (cid:19) ! + 14 log (cid:18) µ k (1 + δ/ δ (cid:19) − δ C ( δ, β ) k µ δ/ (cid:18) σ ∗ σ ∗ k (cid:19) − log (cid:18) ǫ (cid:19) ! . µ , k and σ ∗ /σ ∗ k while thepositive terms are linear combination of log µ , log k and log( σ ∗ /σ ∗ k ). Hence for a large enough C ( δ, β ) >

0, the right hand side of the inequation above is no more than ǫ/ (40 √ k ), from which the inequality(56) follows. D Proof of Lemma 4.14

Proof of Lemma 4.14.

Consider the logarithm of each term in (88) normalized by dn . Using Stirling’sapproximation a ! ≈ √ πa ( a/ e) a , we have1 dn log (cid:18) nβn (cid:19) = 1 dn log n !((1 − β ) n )!( βn )!= o (1) + 1 dn log √ πnn n p π (1 − β ) n ((1 − β ) n ) (1 − β ) n √ πβn ( βn ) βn = o (1) + 1 dn ( n log n − (1 − β ) n log((1 − β ) n ) − βn log( βn ))= o (1) − (1 − β ) log(1 − β ) + β log βd , Notice that (log( √ n )) /n = o (1). In the following expansion of a !, for convenience we will not explicitlywrite down the term √ πa .1 dn log I = 1 dn log (cid:18) ( βdn )!( θβdn )!((1 − θ ) βdn )! f (ˆ ργ t dn )!( θβdn )!((ˆ ργ t − θβ ) dn )! ( θβdn )! (cid:19) = 1 dn log (cid:18) f ( βdn )!( θβdn )!((1 − θ ) βdn )! (ˆ ργ t dn )!((ˆ ργ t − θβ ) dn )! (cid:19) = o (1) + 1 dn log (cid:16) f ( βdn ) βdn ( θβdn ) θβdn ((1 − θ ) βdn ) (1 − θ ) βdn × (ˆ ργ t dn ) ˆ ργ t dn exp( − ˆ ργ t dn )((ˆ ργ t − θβ ) dn ) (ˆ ργ t − θβ ) dn exp( − (ˆ ργ t − θβ ) dn ) (cid:17) = o (1) + 1 dn log (cid:16) f ( βdn ) βdn ( θβdn ) θβdn ((1 − θ ) βdn ) (1 − θ ) βdn (ˆ ργ t dn ) ˆ ργ t dn ((ˆ ργ t − θβ ) dn ) (ˆ ργ t − θβ ) dn exp( − θβdn ) (cid:17) = o (1) + 1 dn log f + β log( βdn ) − θβ log( θβdn ) − (1 − θ ) β log((1 − θ ) βdn )+ ˆ ργ t log(ˆ ργ t dn ) − (ˆ ργ t − θβ ) log((ˆ ργ t − θβ ) dn ) − θβ = o (1) + 1 dn log f + β log β − θβ log( θβ ) − (1 − θ ) β log((1 − θ ) β ) + ˆ ργ t log(ˆ ργ t ) − (ˆ ργ t − θβ ) log(ˆ ργ t − θβ ) + θβ log( dn ) − θβ, dn log I = 1 dn log (cid:18) ((1 − ˆ ργ t ) dn )!((1 − ˆ ργ t − β + θβ ) dn )! (cid:19) = 1 dn log ((1 − ˆ ργ t ) dn ) (1 − ˆ ργ t ) dn exp( − (1 − ˆ ργ t ) dn )((1 − ˆ ργ t − β + θβ ) dn ) (1 − ˆ ργ t − β + θβ ) dn exp( − (1 − ˆ ργ t − β + θβ ) dn ) ! + o (1)= (1 − ˆ ργ t ) log((1 − ˆ ργ t ) dn ) − (1 − ˆ ργ t − β + θβ ) log((1 − ˆ ργ t − β + θβ ) dn ) − (1 − θ ) β + o (1)= (1 − ˆ ργ t ) log(1 − ˆ ργ t ) − (1 − ˆ ργ t − β + θβ ) log(1 − ˆ ργ t − β + θβ ) + (1 − θ ) β log( dn ) − (1 − θ ) β + o (1) , dn log I = 1 dn log (cid:18) f ((1 − β ) dn )!((1 − β − ˆ ργ t + θβ ) dn )! (cid:19) = 1 dn log f ((1 − β ) dn ) (1 − β ) dn exp( − (1 − β ) dn )((1 − β − ˆ ργ t + θβ ) dn ) (1 − β − ˆ ργ t + θβ ) dn exp( − (1 − β − ˆ ργ t + θβ ) dn ) ! + o (1)= 1 dn log f + (1 − β ) log((1 − β ) dn ) − (1 − β − ˆ ργ t + θβ ) log((1 − β − ˆ ργ t + θβ ) dn ) − ˆ ργ t + θβ + o (1)= 1 dn log f + (1 − β ) log(1 − β ) − (1 − β − ˆ ργ t + θβ ) log(1 − β − ˆ ργ t + θβ )+ (ˆ ργ t − θβ ) log( dn ) − (ˆ ργ t − θβ ) + o (1) , dn log I = 1 dn log (cid:16) ((1 − β − ˆ ργ t + θβ ) dn ) (1 − β − ˆ ργ t + θβ ) dn exp( − (1 − β − ˆ ργ t + θβ ) dn ) (cid:17) + o (1)= (1 − β − ˆ ργ t + θβ ) log(1 − β − ˆ ργ t + θβ ) + (1 − β − ˆ ργ t + θβ ) log( dn ) − (1 − β − ˆ ργ t + θβ ) + o (1)1 dn log(( dn )!) = 1 dn log(( dn ) dn exp( − dn )) + o (1) = o (1) + log( dn ) − . Combining these terms above, the expression of log P ( R ( βn, αd, θβdn )) normalized by dn is rewrittenas follows where both the terms with factor log( dn ) and without log( · ) factor cancel out.1 dn log P ( R ( βn, αd, θβdn ))= o (1) − β log β + (1 − β ) log(1 − β ) d + 1 dn (log f + log f ) + β log β − θβ log( θβ ) − (1 − θ ) β log((1 − θ ) β ) + ˆ ργ t log(ˆ ργ t ) − (ˆ ργ t − θβ ) log(ˆ ργ t − θβ ) + (1 − ˆ ργ t ) log(1 − ˆ ργ t ) − (1 − ˆ ργ t − β + θβ ) log(1 − ˆ ργ t − β + θβ ) + (1 − β ) log(1 − β ) . (100)43ext we divide the terms on the right hand side of the equation (100) into ﬁve groups and then deriveupper bounds on them, respectively. Using the fact log(1 − a ) ≥ − a/ (1 − a ) for a ∈ [0 , − β log β + (1 − β ) log(1 − β ) d ≤ − β log βd + 1 − βd β − β = − β log βd + βd . (101)Since f , f ∈ [0 , dn (log f + log f ) ≤ . (102)For the third group, we have β log β − θβ log( θβ ) − (1 − θ ) β log((1 − θ ) β )= β log β − θβ log θ − θβ log β − (1 − θ ) β log(1 − θ ) − (1 − θ ) β log β = − β ( θ log θ + (1 − θ ) log(1 − θ )) . (103)Using the fact log(1 − a ) ≥ − a/ (1 − a ) for a ∈ [0 ,

1) again, we have for the fourth groupˆ ργ t log(ˆ ργ t ) − (ˆ ργ t − θβ ) log(ˆ ργ t − θβ )= ˆ ργ t log(ˆ ργ t ) − (ˆ ργ t − θβ ) (cid:18) log(ˆ ργ t ) + log(1 − θβ ˆ ργ t ) (cid:19) ≤ ˆ ργ t log(ˆ ργ t ) − (ˆ ργ t − θβ ) log(ˆ ργ t ) + (ˆ ργ t − θβ ) θβ/ (ˆ ργ t )1 − θβ/ (ˆ ργ t )= θβ log(ˆ ργ t ) + θβ. (104)It follows from the non-negativity of (86) and the right hand side of (87) that θβ ≤ ˆ ργ t and β (1 − θ ) ≤ − ˆ ργ t . Also | Q t ( τ ) | = ˆ ργ t n < n gives 1 − ˆ ργ t >

0. Then we have for the ﬁfth group (1 − ˆ ργ t ) log(1 − ˆ ργ t ) − (1 − ˆ ργ t − β + θβ ) log(1 − ˆ ργ t − β + θβ ) + (1 − β ) log(1 − β )= (1 − ˆ ργ t ) log(1 − ˆ ργ t ) + (ˆ ργ t − θβ ) log(1 − ˆ ργ t − β + θβ ) − (1 − β ) log 1 − ˆ ργ t − β + θβ − β = (1 − ˆ ργ t ) log(1 − ˆ ργ t ) + (ˆ ργ t − θβ ) log(1 − ˆ ργ t ) + (ˆ ργ t − θβ ) log(1 − β (1 − θ )1 − ˆ ργ t ) − (1 − β ) log 1 − ˆ ργ t − β + θβ − β . (105) Using log(1 − a ) ≤ − a for a ∈ [0 , ≤ (1 − θβ ) log(1 − ˆ ργ t ) − (ˆ ργ t − θβ ) β (1 − θ )1 − ˆ ργ t − (1 − β ) log 1 − ˆ ργ t − β + θβ − β = ( β − θβ ) log(1 − ˆ ργ t ) + (1 − β ) log(1 − ˆ ργ t ) − (ˆ ργ t − θβ ) β (1 − θ )1 − ˆ ργ t − (1 − β ) log 1 − ˆ ργ t − β + θβ − β = β (1 − θ ) log(1 − ˆ ργ t ) − (ˆ ργ t − θβ ) β (1 − θ )1 − ˆ ργ t + (1 − β ) log (1 − ˆ ργ t )(1 − β )1 − ˆ ργ t − β + θβ = β (1 − θ ) log(1 − ˆ ργ t ) + ( θβ − ˆ ργ t ) β (1 − θ )1 − ˆ ργ t + (1 − β ) log 1 − ˆ ργ t − β + ˆ ργ t β − ˆ ργ t − β + θβ . (106)

44e claim that the right hand side of (106) is nonpositive. It is easy to see that the ﬁrst term onthe right hand side of (106) is nonpositive. It follows from the non-negativity of the term in (86) that θβ − ˆ ργ t ≤ ρ t = νk µ and γ t ≤ Cµ k . , we have ρ t γ t ≤ νk µ C µ k = νC µ k . Also by C ≥ e √ νλ and α = 1 / ( λµ k ), the inequality above becomes ρ t γ t ≤ λ µ k = α e . (107)Then it follows from θ ∈ [ α,

1] and ˆ ρ ∈ [0 , ρ ] that ˆ ργ t ≤ θ . Hence the last term on the right hand sideof (106) is also nonpositive. Hence the claim follows. We conclude (1 − ˆ ργ t ) log(1 − ˆ ργ t ) − (1 − ˆ ργ t − β + θβ ) log(1 − ˆ ργ t − β + θβ ) + (1 − β ) log(1 − β ) ≤ . (108) Now we sum up those terms on the right hand sides of (101), (102), (103), (104) and (108) and obtainan upper bound on log P ( R ( βn, αd, θdn )) normalized by dn .1 dn log P ( R ( βn, αd, θdn )) ≤ − β log βd + βd + o (1) − β ( θ log θ + (1 − θ ) log(1 − θ )) + θβ log(ˆ ργ t ) + θβ. (109)Using − (1 − θ ) log(1 − θ ) ≤ θ for θ ∈ (0 , ≤ − β log βd + βd + o (1) − βθ log θ + βθ + θβ log(ˆ ργ t ) + θβ = − β log( β/ e) d − βθ log θ e + θβ log(eˆ ργ t ) + o (1)= β log (cid:18) e β (cid:19) /d (cid:18) e ˆ ργ t θ (cid:19) θ ! + o (1) . It follows from (107) that e ρ t γ t /α ≤

1, which, together with θ ∈ [ α,

1] and ˆ ρ ∈ [0 , ρ ], implies that (cid:18) e β (cid:19) /d (cid:18) e ˆ ργ t θ (cid:19) θ ≤ (cid:18) e β (cid:19) /d (cid:18) e ρ t γ t α (cid:19) θ ≤ e β (cid:18) e ργ t α (cid:19) αd ! /d . Hence for β ≥ . ρ t γ t /α ) αd , we have1 n log P ( R ( βn, αd, θdn )) ≤ dβ log (cid:16) (1 / . /d (cid:17) + o (1)= − β log 1 . o (1) ≤ − . ρ t γ t /α ) αd log 1 . o (1) . Then we choose η = 1 . ρ t γ t /α ) αd log 1 ..