[PDF] Exact Linear Convergence Rate Analysis for Low-Rank Symmetric Matrix Completion via Gradient Descent

Abstract

Factorization-based gradient descent is a scalable and efficient algorithm for solving low-rank matrix completion. Recent progress in structured non-convex optimization has offered global convergence guarantees for gradient descent under certain statistical assumptions on the low-rank matrix and the sampling set. However, while the theory suggests gradient descent enjoys fast linear convergence to a global solution of the problem, the universal nature of the bounding technique prevents it from obtaining an accurate estimate of the rate of convergence. In this paper, we perform a local analysis of the exact linear convergence rate of gradient descent for factorization-based matrix completion for symmetric matrices. Without any additional assumptions on the underlying model, we identify the deterministic condition for local convergence of gradient descent, which only depends on the solution matrix and the sampling set. More crucially, our analysis provides a closed-form expression of the asymptotic rate of convergence that matches exactly with the linear convergence observed in practice. To the best of our knowledge, our result is the first one that offers the exact rate of convergence of gradient descent for matrix factorization in Euclidean space for matrix completion.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b EXACT LINEAR CONVERGENCE RATE ANALYSIS FOR LOW-RANK SYMMETRICMATRIX COMPLETION VIA GRADIENT DESCENT

Trung Vu and Raviv Raich

School of EECS, Oregon State University, Corvallis, OR 97331-5501, USA { vutru, raich } @oregonstate.edu ABSTRACT

Factorization-based gradient descent is a scalable and efﬁcient algo-rithm for solving low-rank matrix completion. Recent progress instructured non-convex optimization has offered global convergenceguarantees for gradient descent under certain statistical assumptionson the low-rank matrix and the sampling set. However, while thetheory suggests gradient descent enjoys fast linear convergence to aglobal solution of the problem, the universal nature of the boundingtechnique prevents it from obtaining an accurate estimate of the rateof convergence. In this paper, we perform a local analysis of theexact linear convergence rate of gradient descent for factorization-based matrix completion for symmetric matrices. Without any addi-tional assumptions on the underlying model, we identify the deter-ministic condition for local convergence of gradient descent, whichonly depends on the solution matrix and the sampling set. More cru-cially, our analysis provides a closed-form expression of the asymp-totic rate of convergence that matches exactly with the linear con-vergence observed in practice. To the best of our knowledge, ourresult is the ﬁrst one that offers the exact rate of convergence of gra-dient descent for matrix factorization in Euclidean space for matrixcompletion.

Index Terms — Low-rank matrix completion, matrix factoriza-tion, local convergence analysis, gradient descent.

1. INTRODUCTION

Matrix completion is the problem of recovering a low-rank matrixfrom a sampling of its entries. In machine learning and signal pro-cessing, it has a wide range of applications including collaborativeﬁltering [1], system identiﬁcation [2] and dimension reduction [3].In the era of big data, matrix completion has been proven to be anefﬁcient and powerful framework to handle the enormous amount ofinformation by exploiting low-rank structure of the data matrix.Let M ∈ R n × m be a rank r matrix with ≤ r ≤ min( n, m ) ,and Ω = { ( i, j ) | M ij is observed } be an index subset of cardi-nality s such that s ≤ nm . The goal is to recover the unknownentries of M . Matrix completion can formulated as a linearly con-strained rank minimization or a rank-constrained least squares prob-lem [4]. Two popular approaches for solving the aforementionedmatrix completion problem formulations are convex relaxation vianuclear norm and non-convex factorization. The former approach,motivated by the success of compressed sensing, replaces the ma-trix rank by its convex surrogate (the nuclear norm). Extensive workon designing convex optimization algorithms with guarantees can befound in [4–8]. While on the theoretical side, the solutions of therelaxed problem and the original problem can be shown to coincidewith high probability, on the practical side, computational complex-ity concerns diminish the applicability of these algorithms. When the size of the matrix grows rapidly, storing and optimizing over amatrix variable become computationally expensive and even infeasi-ble. In addition, it is evident this approach suffers from slow conver-gence [9, 10]. In the second approach, the original rank-constrainedoptimization is studied. Interestingly, by reparametrizing the n × m matrix as the product of two smaller matrices M = XY ⊤ , for X ∈ R n × r and Y ∈ R m × r , the resulting equivalent problem is uncon-strained and more computationally efﬁcient to solve [11]. While thisproblem is non-convex, recent progress shows that for such problemany local minimum is also a global minimum [12, 13]. Thus, ba-sic optimization algorithms such as gradient descent [12, 14, 15] andalternating minimization [16–19] can provably solve matrix com-pletion under a speciﬁc sampling regime. Alternatively, the origi-nal rank-constrained optimization problem can be solved without theaforementioned reparameterization via the truncated singular valuedecomposition. [10, 20–25].Among the aforementioned algorithms, let us draw our attentionto the gradient descent method due to its outstanding simplicity andscalability. The ﬁrst global convergence guarantee is attributed toSun and Luo [12]. The authors proved that gradient descent withappropriate regularization can converge to the global optima of afactorization-based formulation at a linear rate. Later on, Ma et. al. [15] proposed that even in the absence of explicit regularization, gra-dient descent recovers the underlying low-rank matrix by implicitlyregularizing its iterates. The aforementioned results, while establish-ing powerful guarantees on the convergence behavior of gradient de-scent, impose several limitations. For some methods, the linear con-vergence rate depends on constants that are not in closed-form andare hard to verify in numerical experiments even when the underly-ing matrix is known. Second, a solution-independent analysis of theconvergence rate typically offers a loose bound when considered fora speciﬁc solution. Third, the global convergence analysis requirescertain assumptions on the underlying model which largely restrictthe setting of the matrix completion problem in practice. Amongsuch assumptions, one would consider the incoherence of the targetmatrix, the randomness of the sampling set, and the fact that the rank r and the condition number of M are small constants as n, m → ∞ .To address these issues, we consider the local convergence anal-ysis of gradient descent for factorization-based matrix completion.In the scope of this paper, we restrict our attention to the symmetriccase. We identify the condition for linear convergence of gradientdescent that depends only on the solution M and the sampling set Ω . In addition, we provide a closed-form expression for the asymp-totic convergence rate that matches well with the convergence ofthe algorithm in practice. The proposed analysis does not requirean asymptotic setting for matrix completion, e.g., large matrices ofsmall rank. We believe that our analysis can be useful in both theo-retical and practical aspects of the matrix completion problem. lgorithm 1 (Non-convex) Gradient Descent Input: X , P Ω ( M ) , η Output: { X k } for k = 0 , , , . . . do X k +1 = X k − η P Ω (cid:0) X k X k ⊤ − M (cid:1) X k

2. GRADIENT DESCENT FOR MATRIX COMPLETION

Notations.

Throughout the paper, we use the notations k · k F and k·k to denote the Frobenius norm and the spectral norm of a matrix,respectively. On the other hand, k · k is used on a vector to denotethe Euclidean norm. Boldfaced symbols are reserved for vectors andmatrices. In addition, the t × t identity matrix is denoted by I t . ⊗ denotes the Kronecker product between two matrices, and vec( · ) denotes the vectorization of a matrix by stacking its columns on topof one another. Let X be some matrix and F ( X ) be a matrix-valuedfunction of X . Then, for some positive number k , we use F ( X ) = O ( k X k kF ) to imply lim δ → sup k X k F = δ k F ( X ) k F / k X k kF < ∞ .We begin by introducing the low-rank matrix completion prob-lem. For simplicity, we focus on the symmetric case where M isan n × n positive semi-deﬁnite (PSD) rank- r matrix and the sam-pling set Ω is symmetric. Let the rank- r economy version of theeigendecomposition of M be given by M = U Λ U ⊤ , where U ∈ R n × r is a semi-orthogonal matrix and Λ ∈ R r × r is adiagonal matrix containing r non-zero eigenvalues of M , i.e., λ ≥ . . . ≥ λ r > . Since M can be represented as M = ( U Λ / )( U Λ / ) ⊤ , we can write M = X ∗ X ∗⊤ , such that X ∗ = U Λ / ∈ R n × r .Therefore, the factorization-based formulation for matrix comple-tion can be described using the following non-convex optimization: min X ∈ R n × r X ( i,j ) ∈ Ω (cid:0) [ XX ⊤ ] ij − M ij (cid:1) . (1)Denote P Ω : R n × n → R n × n the projection onto the set of matricessupported in Ω , i.e., [ P Ω ( Z )] ij = ( Z ij if ( i, j ) ∈ Ω , otherwise . We can rewrite the objective function as f ( X ) = kP Ω ( XX ⊤ − M ) k F . The gradient of f ( X ) is given by ∇ f ( X ) = P Ω ( XX ⊤ − M ) X . (2)Starting from an initial X (usually through spectral initialization[15]), the gradient descent algorithm (see Algorithm 1) simply up-dates the value of X by taking steps proportional to the negative ofthe gradient ∇ f ( X ) . If the sampling set is not symmetric, one can symmetrize it by adding ( j, i ) , for any ( i, j ) ∈ Ω , to Ω since M ji = M ij .

3. LOCAL CONVERGENCE ANALYSIS

This section presents the local convergence result of Algorithm 1.While recent work on the global guarantees of the algorithm hasshown the linear behavior under certain statistical models, we em-phasize that no closed-form expression of the convergence rate wasprovided. Our result in this paper, on the other hand, does not makeany assumption about the underlying model for M and Ω , and pro-vides an exact expression of the asymptotic rate of convergence. Letus ﬁrst introduce some critical concepts used in our derivation. Deﬁnition 1.

Denote ¯Ω = { ( i − n + j | ( i, j ) ∈ Ω } . The rowselection matrix S is an s × n matrix obtained from a subset ofrows corresponding to the elements in ¯Ω from the n × n identitymatrix I n . Deﬁnition 2.

The orthogonal projection onto the null space of M is deﬁned by P U ⊥ = I n − U U ⊤ . Deﬁnition 3.

Let T n be an n × n matrix where the ( i, j ) th blockof T n is the n × n matrix e j e ⊤ i for ≤ i, j ≤ n . Then T n can beused to represent the transpose operator as follows: vec( E ⊤ ) = T n vec( E ) for any E ∈ R n × n . We are now in position to state our main result on the asymptoticlinear convergence rate of Algorithm 1.

Theorem 1.

Denote P = I n − P U ⊥ ⊗ P U ⊥ , P = (cid:0) I n + T n (cid:1) ,and P = P P . In addition, let H = P (cid:16) I n − η ( M ⊕ M )( S ⊤ S ) (cid:17) P , (3) where M ⊕ M = M ⊗ I n + I n ⊗ M is the Kronecker sum.Deﬁne the spectral radius of H , ρ ( H ) , as the largest absolute valueof the eigenvalues of H . If ρ ( H ) < , then Algorithm 1 producesa sequence of matrices X k X k ⊤ converging to M at an asymptoticlinear rate ρ ( H ) . Formally, there exists a neighborhood N ( M ) of M such that for any X X ⊤ ∈ N ( M ) , k X k X k ⊤ − M k F ≤ C k X X ⊤ − M k F ρ ( H ) k , (4) for some numerical constant C > . Remark 1.

Theorem 1 provides a closed-form expression of theasymptotic linear convergence rate of Algorithm 1, which only de-pends on M , Ω and the choice of step-size η . We note that thecondition for linear convergence, ρ ( H ) < , is fully determinedgiven M , Ω , and η . It would be interesting to establish a connec-tion between this condition and the standard statistical model formatrix completion. For instance, how the incoherence of M andthe randomness of Ω would affect the spectral radius of H ? Thisexploration is left as future work. In our approach, the following lemma plays a pivotal role inthe derivation of Theorem 1, establishing the recursion on the errormatrix X k X k ⊤ − M : Lemma 1.

Let E k = X k X k ⊤ − M . Then E k +1 = E k − η (cid:0) P Ω ( E k ) M + M P Ω ( E k ) (cid:1) + O ( k E k k F ) . Furthermore, denote A = I n − η ( M ⊕ M )( S ⊤ S ) and e k =vec( E k ) , the matrix recursion can be rewritten compactly as e k +1 = Ae k + O ( k e k k ) . (5) We provide proofs of all the lemmas in the Appendix.

500 1000 1500 2000 2500 300010 -12 -10 -8 -6 -4 -2 ||E k || F ||A k e || ||H k e || ||e || (H) k Fig. 1 . Linear convergence of gradient descent for matrix comple-tion. The decrease in the norm of error matrix E k through iterationsis shown in the blue dashed line with triangle markers. The blacksolid line with square markers and the red dotted line with circlemarkers represent ﬁrst-order approximations of the error using A and H , respectively. Finally, the green dash-dot line is the theoreti-cal bound (up to a constant) given by k e k ρ ( H ) k . We use differentmarkers, i.e., triangle versus circle, to better distinguish the blue linefrom the red line, respectively. Remark 2.

Figure 1 illustrates the effectiveness of the proposedbound on the asymptotic rate of convergence given by Theorem 1.In Fig. 1, the low-rank solution matrix M is generated by takingthe product of a × matrix X and its transpose, where X has i.i.d. normally distributed entries. The sampling set Ω is obtainedby randomly selecting the entries of M based on a Bernoulli modelwith probability . . Next, we run the economy-SVD on M to com-pute X ∗ = U Λ / . The initialization X is obtained by adding i.i.d. normally distributed noise with standard deviation σ = 10 − to the entries of X ∗ . Then we run Algorithm 1 with X , P Ω ( M ) ,and η = 0 . / k M k . It is noticeable from Fig. 1 that our theo-retical bound k e k ρ ( H ) k given by the green line predicts suc-cessfully the rate of decrease in k E k k F , running parallel to theblue line as soon as k E k k F < − . As far as the approxima-tions are concerned, we compare the changes in the error modeledby e k +1 = Ae k and the error modeled by e k +1 = He k . Whilethe former (represented by k A k e k in black) fails to approximate k E k k F for k E k k F < − , the later (represented by k H k e k inred) matches k E k k F surprisingly well. In the rest of this section, we shall derive the proof of Theorem 1.First, we present a major challenge met by the traditional approachthat uses (5) to characterize the convergence of the error towardszero. Next, we describe our proposed technique to overcome thisdifﬁculty. Finally, we show that our bounding technique recoversthe exact rate of local convergence of Algorithm 1.

The stability of the nonlinear difference equation (5) is the key toanalyze the convergence of Algorithm 1. In essence, linear conver-gence rate is obtained by the following lemma:

Lemma 2. (Rephrased from the supplemental material of [26]) Let ( a n ) n ∈ N ⊂ R + be the sequence deﬁned by a n +1 ≤ ρa n + qa n for n = 0 , , , . . . , where ≤ ρ < and q ≥ . Then ( a n ) converges to if and onlyif a < − ρq . A simple linear convergence bound can be derived for a < ρ − ρq in the form of a n ≤ a Kρ n , for K = (cid:18) − a qρ (1 − ρ ) (cid:19) − . In order to apply Lemma 2 to (5), one natural way is to performthe eigendecomposition A = Q A Λ A Q − A , where Q A is the squarematrix whose columns are n eigenvectors of A , and Λ A is the di-agonal matrix whose diagonal elements are the corresponding eigen-values of A . Then, left-multiplying both sides of (5) by Q − A yields Q − A e k +1 = Λ A Q − A e k + O ( k e k k ) , where Q − A does not affect the O term since its norm is constant.Applying the triangle inequality to the last equation leads to k Q − A e k +1 k = k Λ A Q − A e k k + O ( k e k k ) . (6)With the deﬁnition of the spectral radius of A using the spectralnorm of Λ A , we have ρ ( A ) = k Λ A k = sup (cid:26) k Λ A ˜ e k k ˜ e k : ˜ e ∈ R n , ˜ e = (cid:27) . (7)Now, using (7) and the fact that O ( k e k k ) = O ( k Q − A e k k ) , (6)can be upper-bounded by k Q − A e k +1 k ≤ ρ ( A ) k Q − A e k k + O ( k Q − A e k k ) . (8)If ρ ( A ) < , then by Lemma 2, the sequence k Q − A e k k convergesto linearly at rate ρ ( A ) . Unfortunately, one can verify that ρ ( A ) ≥ by taking any vector v ∈ R n such that v i = 0 for all i ∈ ¯Ω . Since Av = v , must be an eigenvalue of A .The failure of the aforementioned bounding technique is it over-looks the fact that E k = X k X k ⊤ − M . By deﬁning E = { XX ⊤ − M | X ∈ R n × r } and ˜ E A = { Q − A vec( E ) | E ∈ E} , a tighterbound on k Λ A Q − A e k k / k Q − A e k k can be obtained by ρ E ( A , δ ) = sup (cid:26) k Λ A ˜ e k k ˜ e k : ˜ e ∈ ˜ E A , ˜ e = , k ˜ e k ≤ δ (cid:27) , (9)for some constant δ > . Taking into account the structure of E k ,one would expect ρ E ( A ) = lim δ → ρ E ( A , δ ) is a more reliable es-timate of the asymptotic rate of convergence for (5). Nonetheless,(9) is a non-trivial optimization problem that has no closed-form so-lution to the best of our knowledge. Given a = b + c , by triangle inequality, we have k a k ≤ k b k + k c k and k a k ≥ k b k − k c k (since b = a + ( − c ) and hence k b k ≤ k a k + k − c k = k a k + k c k or k a k ≥ k b k − k c k ). Consequently, we can write |k a k − k b k| ≤ k c k and hence k a k = k b k + O ( k c k ) . .2. Integrating structural constraints To address the aforementioned issue, we propose to integrate thestructural constraint on E k into the recursion (5). As we shall showin the next subsection, this integration enables the application ofLemma 2 to the new recursion in order to obtain a tight bound onthe convergence rate. First, let us characterize the feasible set oferror matrices E as follows: Lemma 3. E ∈ E if and only if the following conditions hold si-multaneously:(C1) P r ( M + E ) = M + E , where P r is the truncated singularvalue decomposition of order r [27].(C2) E ⊤ = E .(C3) v ⊤ ( M + E ) v ≥ for all v ∈ R n . Our strategy is to integrate three conditions in Lemma 3 into the lin-ear operator A so that the resulting recursion will implicitly enforce E k to remain in E . Speciﬁcally, for condition (C1), we linearize P r using the ﬁrst-order perturbation analysis of the truncated singularvalue decomposition [28]. For condition (C2), we leverage the lin-earity of the transpose operator. Finally, while handling condition(C3) is non-trivial, it turns out that this condition can be ignored. Inthe following lemma, we introduce the linear projection that ensuresthe updated error E k remains near E . Lemma 4.

Recall that P = I n − P U ⊥ ⊗ P U ⊥ , P = (cid:0) I n + T n (cid:1) . Then, the following statements hold:1. P corresponds to the orthogonal projection onto the tangentplane of the set of rank- r matrices at M .2. P corresponds to the orthogonal projection onto the spaceof symmetric matrices.3. P and P commute, and P = P P = P P is also anorthogonal projection.4. For any E ∈ E , vec( E ) = P vec( E ) + O ( k E k F ) . By Lemma 4-4, we have e k = P e k + O ( k e k k ) for all k . Usingthis result with k + 1 instead of k and replacing e k +1 from (5) intothe ﬁrst term on the RHS, we have e k +1 = P (cid:0) Ae k + O ( k e k k ) (cid:1) + O ( k e k +1 k ) . Substituting e k = P e k + O ( k e k k ) and using e k +1 = O ( k e k k ) ,we obtain e k +1 = P AP e k + O ( k e k k ) . (10)It can be seen from Lemma 4-1 and Lemma 4-2 that the projection P enforces the error vector e k to lie in the space under conditions (C1)and (C2) in Lemma 3. Now replacing the deﬁnition H = P AP ,(10) can be rewritten as e k +1 = He k + O ( k e k k ) . (11)Similar to the derivation with A , let H = Q H Λ H Q − H be theeigendecomposition of H and deﬁne ˜ e k = Q − H e k . Then, we have k ˜ e k +1 k = k Λ H ˜ e k k + O ( k ˜ e k k ) . (12)In addition, denote ˜ E H = { Q − H vec( E ) | E ∈ E} , we can deﬁne ρ ( H ) = sup (cid:26) k Λ H ˜ e k k ˜ e k : ˜ e ∈ R n , ˜ e = (cid:27) and (13) ρ E ( H , δ ) = sup (cid:26) k Λ H ˜ e k k ˜ e k : ˜ e ∈ ˜ E H , ˜ e = , k ˜ e k ≤ δ (cid:27) . (14) Since (5) and (11) are two different systems that describes the samedynamic for E k ∈ E , one would expect they share the same asymp-totic behavior. In particular, their linear rates of convergence shouldagree when the constraint E k ∈ E is considered. Lemma 5.

Let ρ E ( H ) = lim δ → ρ E ( H , δ ) . Then, ρ E ( H ) = ρ E ( A ) . While using H instead of A preserves the system dynamic over E ,it provides updates of the error that ensure that it remains in E . Con-sequently, we can ignore the constraints that are implicitly satisﬁedin our analysis when using H . We have seen in Subsection 3.1 that applying Lemma 2 to (5) failsto estimate the convergence rate due to the gap between ρ E ( A ) and ρ ( A ) . In this subsection, we show that integrating the structural con-straint helps eliminating the gap between ρ E ( H ) and ρ ( H ) (evenwhen condition (C3) is omitted). Therefore, applying Lemma 2 to(12) yields ρ ( H ) as a tight bound on the convergence rate. To thatend, our goal is to prove the following lemma: Lemma 6. As δ approaches , we have ρ ( H ) − ρ E ( H , δ ) = O ( δ ) .Consequently, it holds that ρ ( H ) = ρ E ( H ) . Let us brieﬂy present the key ideas and lemmas we use to proveLemma 6. Our proof relies on two critical considerations: (i) ρ E ( H , δ ) ≤ ρ ( H ) , (ii) there exists a maximizer ˜ e ⋆ of the supre-mum in (13) such that the distance from ˜ e ⋆ to ˜ E H is O ( δ ) . While(i) is trivial from (13) and (14), (ii) is proven by introducing F δ as asurrogate for the set E as follows: Lemma 7.

Denote the eigenvector of H corresponding to thelargest (in magnitude) eigenvalue by q . Deﬁne G as the n × n ma-trix satisfying vec( G ) = δ q . Let F δ be the set of n × n matricessatisfying the following conditions: (i) k F k F ≤ δ ; (ii) F ⊤ = F ;(iii) k P U ⊥ F P U ⊥ k F ≤ λ r δ ; and (iv) v ⊤ ( M + F ) v ≥ for all v ∈ R n . Then, there exists F ∈ F δ satisfying k F − G k F = O ( δ ) . Lemma 8.

For any F ∈ F δ , there exists E ∈ E satisfying k E − F k F = O ( δ ) . From (i) and (ii), it follows that the difference between ρ E ( H , δ ) and ρ ( H ) is O ( δ ) . Thus, ρ ( H ) = ρ E ( H ) when taking the limit of ρ E ( H , δ ) as δ → . Our derivation of Theorem 1 is completed bydirectly applying Lemma 2 to (12).

4. CONCLUSION AND FUTURE WORK

We presented a framework for analyzing the convergence of the ex-isting gradient descent approach for low-rank matrix completion. Inour analysis, we restricted our focus to the symmetric matrix com-pletion case. We proved that the algorithm converges linearly. Dif-ferent to other approaches, we made no assumption on the rank ofthe matrix or fraction of available entries. Instead, we derived anexpression for the linear convergence rate via the spectral norm ofa closed-form matrix. As future work, using random matrix theory,the closed-form expression for the convergence rate can be furtherrelated to the rank, the number of available entries, and the matrixdimensions. Additionally, this work can be extended to the non-symmetric case. . APPENDIX5.1. Proof of Lemma 1

Recall the gradient descent update in Algorithm 1: X k +1 = X k − η P Ω (cid:0) X k X k ⊤ − M (cid:1) X k = ( I n − η P Ω ( E k )) X k . (15)Substituting (15) into the deﬁnition of E k +1 , we have E k +1 = X k +1 X k +1 ⊤ − M = (cid:0) I n − η P Ω ( E k ) (cid:1) X k X k ⊤ (cid:0) I n − η P Ω ( E k ) (cid:1) ⊤ − M . From the fact that E k is symmetric and Ω is a symmetric sampling,the last equation can be further expanded as E k +1 = X k X k ⊤ − η P Ω ( E k ) X k X k ⊤ − η X k X k ⊤ P Ω ( E k ) + η P Ω ( E k ) X k X k ⊤ P Ω ( E k ) − M . (16)Since X k X k ⊤ = M + E k , (16) is equivalent to E k +1 = E k − η (cid:0) P Ω ( E k ) M + M P Ω ( E k ) (cid:1) − η (cid:0) P Ω ( E k ) E k + E k P Ω ( E k ) (cid:1) + η P Ω ( E k ) M P Ω ( E k ) + η P Ω ( E k ) E k P Ω ( E k ) . (17)Note that kP Ω ( E k ) k F ≤ k E k k F . Hence, collecting terms that areof second order and higher, with respect to k E k k F , on the RHS of(17) yields E k +1 = E k − η (cid:0) P Ω ( E k ) M + M P Ω ( E k ) (cid:1) + O ( k E k k F ) . Now by Deﬁnition 1, it is easy to verify that SS ⊤ = I n and vec (cid:0) P Ω ( E k ) (cid:1) = S ⊤ Se k . Using the property vec(

ABC ) = ( C ⊤ ⊗ A ) vec( B ) , (5) can bevectorized as follows: e k +1 = e k − η ( M ⊗ I n ) vec (cid:0) P Ω ( E k ) (cid:1) − η ( I n ⊗ M ) vec (cid:0) P Ω ( E k ) (cid:1) + O ( k e k k ) . The last equation can be reorganized as e k +1 = (cid:16) I n − η ( M ⊕ M )( S ⊤ S ) (cid:17) e k + O ( k e k k ) . ( ⇒ ) Suppose E ∈ E . Then for (C1), i.e., E ⊤ = E , E = XX ⊤ − M is symmetric since both XX ⊤ and M are symmetric. For (C2),i.e., P r ( M + E ) = M + E , stems from the fact M + E = XX ⊤ has rank no greater than r for X ∈ R n × r . Finally, for any v ∈ R n ,we have v ⊤ ( M + E ) v = v ⊤ ( XX ⊤ ) v = k X ⊤ v k ≥ . ( ⇐ ) From conditions (C1) and (C3), M + E is a PSD matrix. Inaddition, P r ( M + E ) = M + E implies M + E must have rank nogreater r . Since any PSD matrix A with rank less than or equal to r can be factorized as A = Y Y ⊤ for some Y ∈ R n × r , we concludethat E ∈ E . First, recall that any matrix Π ∈ R n × n is an orthogonal projectionif and only if Π = Π and Π = Π ⊤ . Since P ⊤ U ⊥ = P U ⊥ , we have P ⊤ = (cid:0) I n − P U ⊥ ⊗ P U ⊥ (cid:1) ⊤ = I ⊤ n − P ⊤ U ⊥ ⊗ P ⊤ U ⊥ = I n − P U ⊥ ⊗ P U ⊥ = P . In addition, since P U ⊥ = P U ⊥ , we have P = ( I n − P U ⊥ ⊗ P U ⊥ )( I n − P U ⊥ ⊗ P U ⊥ ) ⊤ = I n − P U ⊥ ⊗ P U ⊥ + ( P U ⊥ ⊗ P U ⊥ ) = I n − P U ⊥ ⊗ P U ⊥ + ( P U ⊥ ⊗ P U ⊥ )= I n − P U ⊥ ⊗ P U ⊥ + P U ⊥ ⊗ P U ⊥ = I n − P U ⊥ ⊗ P U ⊥ = P . Second, using the fact that T n = I n and T n is symmetric, wecan derive similar result: P ⊤ = (cid:18) I n + T n (cid:19) ⊤ = I n + T n P , and P = ( I n + T n ) I n + 2 T n + T n

4= 2 I n + 2 T n I n + T n P . Third, we observe that P and P are the vectorized version ofthe linear operators Π ( E ) = E − P U ⊥ EP U ⊥ and Π ( E ) = 12 ( E + E ⊤ ) , respectively, for any E ∈ R n × n . Hence, in order to prove that P and P commute, it is sufﬁcient to show that operators Π and Π commute. Indeed, we have Π Π ( E ) = 12 (cid:0) ( E − P U ⊥ EP U ⊥ ) + ( E − P U ⊥ EP U ⊥ ) ⊤ (cid:1) = 12 ( E + E ⊤ ) − P U ⊥

12 ( E + E ⊤ ) P U ⊥ = Π Π ( E ) . This implies Π and Π commute. Since P is the product of twocommuting orthogonal projections, it is also an orthogonal projec-tion.Finally, let us restrict E to belong to E and denote e = vec( E ) .Using Theorem 3 in [28], we have P r ( M + E ) = M + E − P U ⊥ EP U ⊥ + O ( k E k F ) . (18)ince P r ( M + E ) = M + E , it follows from (18) that P U ⊥ EP U ⊥ = O ( k E k F ) . Vectorizing the last equation, we obtain ( P U ⊥ ⊗ P U ⊥ ) e = O ( k E k F ) . (19)On the other hand, since E is symmetric, e = T n e = (cid:16) I n + T n (cid:17) e . (20)From (19) and (20), we have e = ( I n − P U ⊥ ⊗ P U ⊥ ) e + O ( k E k F )= ( I n − P U ⊥ ⊗ P U ⊥ ) (cid:16) I n + T n (cid:17) e + O ( k E k F ) . (21)Substituting P = P P = ( I n − P U ⊥ ⊗ P U ⊥ ) (cid:16) I n + T n (cid:17) into (21) completes our proof of the lemma. Let ˜ E = { vec( E ) | E ∈ E} . Recall that for any e ∈ ˜ E , e = P e + O ( k e k ) . Therefore, by the triangle inequality, we obtain k Ae k = k A (cid:0) P e + O ( k e k ) (cid:1) k≤ k AP e k + k A O ( k e k ) k . Since the second term on the RHS of the last inequality is O ( k e k ) ,it is also O ( δ ) for any e ∈ ˜ E such that k e k ≤ δ . In other words, k Ae k ≤ k AP e k + O ( δ ) . (22)Similarly, we also have, k Ae k ≥ k AP e k − k A O ( k e k ) k = k AP e k − O ( δ ) . (23)From (22) and (23), it follows that k Ae k k e k = k AP e k k e k + O ( δ ) . (24)Taking the limit of the supremum of (24) as δ → yields ρ E ( A ) = lim δ → sup e ∈ ˜ E e =0 k e k ≤ δ k Ae k k e k = lim δ → sup e ∈ ˜ E e =0 k e k ≤ δ k AP e k k e k = ρ E ( AP ) . (25)Now following similar argument in Lemma 6, we have ( ρ E ( AP ) = ρ ( AP ) ,ρ E ( P AP ) = ρ ( P AP ) . (26) Given (25) and (26), it remains to show that ρ ( AP ) = ρ ( P AP ) .Indeed, using Gelfand’s formula [29], we have ρ ( AP ) = lim k →∞ k ( AP ) k k /k and ρ ( P AP ) = lim k →∞ k ( P AP ) k k /k . By the property of operator norms, k ( AP ) k k = k A ( P AP ) k − k ≤ k A k k ( P AP ) k − k . Thus, k ( AP ) k k /k ≤ k A k /k (cid:16) k ( P AP ) k − k / ( k − (cid:17) ( k − /k . Taking the limit of both sides of the last inequality as k → ∞ yields ρ ( AP ) ≤ ρ ( P AP ) . Similarly, since k ( P AP ) k k = k P ( AP ) k k ≤ k ( AP ) k k , we also obtain ρ ( P AP ) ≤ ρ ( AP ) . This concludes our proof ofthe lemma. Without loss of generality, assume λ is the eigenvalue withlargest magnitude, i.e., | λ | = ρ ( H ) . By the deﬁnition of G , we have k G k F = δ . Since H vec( G ) = λ vec( G ) and H = Q H Λ H Q − H , it follows that Q H Λ H Q − H vec( G ) = λ vec( G ) . (27)Multiplying both sides of (27) by Q − H , we obtain Λ H Q − H vec( G ) = λ Q − H vec( G ) . Taking the L -norm and and reorganizing the equation yields k Λ H Q − H vec( G ) k k Q − H vec( G ) k = | λ | = ρ ( H ) . (28)Therefore, G leads to a solution of the supremum in (13). We nowprove that G is symmetric and ( P U ⊥ ⊗ P U ⊥ ) vec( G ) = . First,since P , P and P = P P are orthogonal projections, we have P H = P P AP = P P P AP = P P AP = P P AP = P AP = H . Thus, λ vec( G ) = H vec( G )= P H vec( G )= λ P vec( G ) . (29)Substituting P = (cid:0) I n + T n (cid:1) into (29) yields vec( G ⊤ ) = T n vec( G ) or G = G ⊤ . econd, since P H = H , we obtain λ vec( G ) = H vec( G )= P H vec( G )= λ P vec( G ) . (30)Substituting P = I n − P U ⊥ ⊗ P U ⊥ into (30) yields ( P U ⊥ ⊗ P U ⊥ ) vec( G ) = or P U ⊥ GP U ⊥ = . Since k E − G k F ≤ k E − F k F + k F − G k F (by the triangleinequality), Lemmas 7 and 8 imply the existence of E ∈ E such that k E − G k F = O ( δ ) . Denote ˜ e = Q − H vec( E ) ∈ ˜ E H , we have Λ H ˜ e = λ ˜ e − ( λ I n − Λ H )˜ e = λ ˜ e − ( λ I n − Λ H ) Q − H vec( E )= λ ˜ e − ( λ I n − Λ H ) Q − H vec( E − G ) . where the last equality stems from the fact that λ Q − H vec( G ) = Λ H Q − H vec( G ) . Next, using the triangle inequality, we obtain k Λ H ˜ e k ≥ k λ ˜ e k − k ( λ I n − Λ H ) Q − H vec( E − G ) k ≥ ρ ( H ) k ˜ e k − k λ I n − Λ H k k Q − H k k vec( E − G ) k . Dividing both sides by k ˜ e k yields k Λ H ˜ e k k ˜ e k ≥ ρ ( H ) − k λ I n − Λ H k k Q − H k k vec( E − G ) k k ˜ e k . (31)Since k E − G k F = O ( δ ) , (31) can be rewritten as k Λ H ˜ e k k ˜ e k ≥ ρ ( H ) − O ( δ ) . (32)On the other hand, for any ˜ e ∈ ˜ E H , we also have k Λ H ˜ e k k ˜ e k ≤ ρ E ( H , δ ) ≤ ρ ( H ) . (33)Combining (32) and (33) yields ρ ( H ) − ρ E ( H , δ ) = O ( δ ) . Denote P U = U U ⊤ , for any v ∈ R n , we can decompose v intotwo orthogonal component: v = v U + v ⊥ , where v U = P U v and v ⊥ = P U ⊥ v . Without loss of generality,assume that k v k = k v U k + k v ⊥ k = 1 . Thus, we have v ⊤ ( M + G ) v = ( v U + v ⊥ ) ⊤ ( M + G )( v U + v ⊥ )= v ⊤ U M v U + v ⊤ U Gv U + v ⊤ U Gv ⊥ + v ⊤⊥ Gv U + v ⊤⊥ Gv ⊥ , (34)where the last equation stems from the fact that M = P U M P U and P U P U ⊥ = . Since P U ⊥ GP U ⊥ = , we have v ⊤⊥ Gv ⊥ = v ⊤ P U ⊥ GP U ⊥ v = 0 . Thus, (34) is equivalent to v ⊤ ( M + G ) v = v ⊤ U M v U + v ⊤ U Gv U + 2 v ⊤ U Gv ⊥ . (35) Now let us lower-bound each term on the RHS of (35) as follows.First, by the Rayleigh quotient, we have v ⊤ U M v U ≥ λ r k v U k , (36)and v ⊤ U Gv U ≥ λ min ( G ) k v U k ≥ −k G k F k v U k . (37)Next, by Cauchy-Schwarz inequality, v ⊤ U Gv ⊥ ≥ −k G k k v U k k v ⊥ k ≥ −k G k F k v U k . (38)From (36), (37), and (38), we obtain v ⊤ ( M + G ) v ≥ ( λ r − k G k F ) k v U k − k G k F k v U k . (39)Note that k G k F = δ and the quadratic g ( t ) = ( λ r − δ ) t − δt isminimized at t ∗ = δλ r − δ , g ( t ∗ ) = − δ λ r − δ . Combining this with (39) yields v ⊤ ( M + G ) v ≥ − λ r δ , for sufﬁciently small δ . Let F = G + λ r δ I n . Now we can easilyverify that k F − G k F = O ( δ ) and F ∈ F . We shall show that the matrix E = P r ( M + F ) − M belongs to E and satisﬁes k E − F k F = O ( δ ) . (40)First, since F ∈ F δ , M + F must be PSD. Thus, P r ( M + F ) is a PSD matrix of rank no greater than r and it admits a rank- r factorization P r ( M + F ) = ZZ ⊤ , for some Z ∈ R n × r . Therefore,by the deﬁnition of E , E = P r ( M + F ) − M = ZZ ⊤ − M ∈ E . Next, using (18), we have E − F = P r ( M + F ) − M − F = P U ⊥ F P U ⊥ + O ( k F k F ) . Since F ∈ F δ implies P U ⊥ F P U ⊥ = O ( k F k F ) , we conclude that E − F = O ( k F k F ) .

6. REFERENCES [1] Jasson DM Rennie and Nathan Srebro, “Fast maximum marginmatrix factorization for collaborative prediction,” in

Interna-tional Conference on Machine learning , 2005, pp. 713–719.[2] Zhang Liu and Lieven Vandenberghe, “Interior-point methodfor nuclear norm approximation with application to systemidentiﬁcation,”

SIAM Journal on Matrix Analysis and Appli-cations , vol. 31, no. 3, pp. 1235–1256, 2009.[3] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright,“Robust principal component analysis?,”

Journal of the ACM ,vol. 58, no. 3, pp. 11, 2011.4] Emmanuel J Cand`es and Benjamin Recht, “Exact matrix com-pletion via convex optimization,”

Foundations of Computa-tional mathematics , vol. 9, no. 6, pp. 717, 2009.[5] Shuiwang Ji and Jieping Ye, “An accelerated gradient methodfor trace norm minimization,” in

International Conference onMachine Learning , 2009, pp. 457–464.[6] Kim-Chuan Toh and Sangwoon Yun, “An accelerated proximalgradient algorithm for nuclear norm regularized least squaresproblems,”

Paciﬁc Journal of Optimization , vol. 6, pp. 615–640, 2010.[7] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen, “Asingular value thresholding algorithm for matrix completion,”

SIAM Journal on Optimization , vol. 20, no. 4, pp. 1956–1982,2010.[8] Shiqian Ma, Donald Goldfarb, and Lifeng Chen, “Fixed pointand Bregman iterative methods for matrix rank minimization,”

Mathematical Programming , vol. 128, no. 1-2, pp. 321–353,2011.[9] Yehuda Koren, “The BellKor solution to the Netﬂix grandprize,”

Netﬂix prize documentation , vol. 81, no. 2009, pp. 1–10, 2009.[10] Trung Vu and Raviv Raich, “Accelerating iterative hard thresh-olding for low-rank matrix completion via adaptive restart,” in

IEEE International Conference on Acoustics, Speech and Sig-nal Processing , 2019, pp. 2917–2921.[11] Samuel Burer and Renato DC Monteiro, “Local minima andconvergence in low-rank semideﬁnite programming,”

Mathe-matical Programming , vol. 103, no. 3, pp. 427–444, 2005.[12] Ruoyu Sun and Zhi-Quan Luo, “Guaranteed matrix completionvia non-convex factorization,”

IEEE Transactions on Informa-tion Theory , vol. 62, no. 11, pp. 6535–6579, 2016.[13] Rong Ge, Jason D Lee, and Tengyu Ma, “Matrix completionhas no spurious local minimum,” in

Advances in Neural Infor-mation Processing Systems , 2016, pp. 2973–2981.[14] Yudong Chen and Martin J Wainwright, “Fast low-rank estima-tion by projected gradient descent: General statistical and algo-rithmic guarantees,” arXiv preprint arXiv:1509.03025 , 2015.[15] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen, “Im-plicit regularization in nonconvex statistical estimation: Gra-dient descent converges linearly for phase retrieval and matrixcompletion,” in

International Conference on Machine Learn-ing , 2018, pp. 3345–3354.[16] Caihua Chen, Bingsheng He, and Xiaoming Yuan, “Matrixcompletion via an alternating direction method,”

IMA Journalof Numerical Analysis , vol. 32, no. 1, pp. 227–245, 2012.[17] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi, “Low-rank matrix completion using alternating minimization,” in

Annual ACM Symposium on Theory of Computing , 2013, pp.665–674.[18] Moritz Hardt, “Understanding alternating minimization formatrix completion,” in

IEEE Annual Symposium on Founda-tions of Computer Science , 2014, pp. 651–660.[19] Moritz Hardt and Mary Wootters, “Fast matrix completionwithout the condition number,” in

Conference on LearningTheory , 2014, pp. 638–678.[20] Prateek Jain, Raghu Meka, and Inderjit S Dhillon, “Guaranteedrank minimization via singular value projection,” in

Advancesin Neural Information Processing Systems , 2010, pp. 937–945. [21] Donald Goldfarb and Shiqian Ma, “Convergence of ﬁxed-pointcontinuation algorithms for matrix rank minimization,”

Foun-dations of Computational Mathematics , vol. 11, no. 2, pp. 183–210, 2011.[22] Jared Tanner and Ke Wei, “Normalized iterative hard thresh-olding for matrix completion,”

SIAM Journal on ScientiﬁcComputing , vol. 35, no. 5, pp. S104–S125, 2013.[23] Prateek Jain and Praneeth Netrapalli, “Fast exact matrix com-pletion with ﬁnite samples,” in

Conference on Learning The-ory , 2015, pp. 1007–1034.[24] Evgenia Chunikhina, Raviv Raich, and Thinh Nguyen, “Per-formance analysis for matrix completion via iterative hard-thresholded SVD,” in

IEEE Workshop on Statistical SignalProcessing , 2014, pp. 392–395.[25] Trung Vu and Raviv Raich, “Local convergence of the HeavyBall method in iterative hard thresholding for low-rank matrixcompletion,” in

IEEE International Conference on Acoustics,Speech and Signal Processing , 2019, pp. 3417–3421.[26] Trung Vu and Raviv Raich, “Adaptive step size momentummethod for deconvolution,” in

IEEE Statistical Signal Pro-cessing Workshop , 2018, pp. 438–442.[27] Carl Eckart and Gale Young, “The approximation of one ma-trix by another of lower rank,”

Psychometrika , vol. 1, no. 3,pp. 211–218, 1936.[28] Trung Vu, Evgenia Chunikhina, and Raviv Raich, “Pertur-bation expansions and error bounds for the truncated singularvalue decomposition,” arXiv preprint arXiv:2009.07542 , 2020.[29] Izrail Gelfand, “Normierte ringe,”