Structured Low-Rank Matrix Factorization with Missing and Grossly Corrupted Observations
Fanhua Shang, Yuanyuan Liu, Hanghang Tong, James Cheng, Hong Cheng
aa r X i v : . [ c s . L G ] S e p Structured Low-Rank Matrix Factorization with Missing andGrossly Corrupted Observations
Fanhua Shang a , Yuanyuan Liu b , Hanghang Tong c , James Cheng a , Hong Cheng b a Department of Computer Science and Engineering, The Chinese University of Hong Kong b Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong c School of computing informatics and decision systems engineering, Arizona State University
Abstract
Recovering low-rank and sparse matrices from incomplete or corrupted observations is an impor-tant problem in machine learning, statistics, bioinformatics, computer vision, as well as signaland image processing. In theory, this problem can be solved by the natural convex joint / mixedrelaxations (i.e., l -norm and trace norm) under certain conditions. However, all current provablealgorithms su ff er from superlinear per-iteration cost, which severely limits their applicability tolarge-scale problems. In this paper, we propose a scalable, provable structured low-rank matrixfactorization method to recover low-rank and sparse matrices from missing and grossly corrupteddata, i.e., robust matrix completion (RMC) problems, or incomplete and grossly corrupted mea-surements, i.e., compressive principal component pursuit (CPCP) problems. Specifically, wefirst present two small-scale matrix trace norm regularized bilinear structured factorization mod-els for RMC and CPCP problems, in which repetitively calculating SVD of a large-scale matrixis replaced by updating two much smaller factor matrices. Then, we apply the alternating direc-tion method of multipliers (ADMM) to e ffi ciently solve the RMC problems. Finally, we providethe convergence analysis of our algorithm, and extend it to address general CPCP problems. Ex-perimental results verified both the e ffi ciency and e ff ectiveness of our method compared with thestate-of-the-art methods. Keywords:
Compressive principal component pursuit, Robust matrix completion, Robustprincipal component analysis, Low-rank matrix recovery and completion
1. Introduction
In recent years, recovering low-rank and sparse matrices from severely incomplete or evencorrupted observations has received broad attention in many di ff erent fields, such as statistics[1, 2, 3], bioinformatics [4], machine learning [5, 6, 7, 8], computer vision [9, 10, 11, 12, 13],signal and image processing [14, 15, 16, 17, 18]. In those areas, the data to be analyzed oftenhave high dimensionality, which brings great challenges to data analysis, such as digital pho-tographs, surveillance videos, text and web documents. Fortunately, the high-dimensional dataare observed to have low intrinsic dimension, which is often much smaller than the dimension ofthe ambient space [19]. ∗ Corresponding author, e-mail: [email protected]
Preprint submitted to Elsevier September 4, 2014 or the high-dimensional data, principal component analysis (PCA) is one of the most pop-ular analysis tools to recover a low-rank structure of the data mainly because it is simple toimplement, can be solved e ffi ciently, and is e ff ective in many real-world applications such asface recognition and text clustering. However, one of the main challenges faced by PCA is thatthe observed data is often contaminated by outliers and missing values [20], or is a small set oflinear measurements [1]. To address these issues, many compressive sensing or rank minimiza-tion based techniques and methods have been proposed, such as robust PCA [5, 21, 13] (RPCA,also called PCP in [9] and low-rank and sparse matrix decomposition in [7, 22], LRSD) andlow-rank matrix completion [23, 3].In many applications, we have to recover a matrix from only a small number of observedentries, for example collaborative filtering for recommender systems. This problem is oftencalled matrix completion, where missing entries or outliers are presented at arbitrary location inthe measurement matrix. Matrix completion has been used in a wide range of problems suchas collaborative filtering [23, 3], structure-from-motion [24, 11], click prediction [25], tag rec-ommendation [26], and face reconstruction [27]. In some other applications, we would like torecover low-rank and sparse matrices from corrupted data. For example, the face images of aperson may be corrupted by glasses or shadows [28]. The classical PCA cannot address the issueas its least-squares fitting is sensitive to these gross outliers. Recovering a low-rank matrix in thepresence of outliers has been extensively studied, which is often called RPCA, PCP or LRSD.The RPCA problem has been successfully applied in many important applications, such as lat-ten semantic indexing [29], video surveillance [5, 9], and image alignment [15]. In some moregeneral applications, we also have to simultaneously recover both low-rank and sparse matricesfrom small sets of linear measurements, which is called compressive principal component pursuit(CPCP) in [1].In principle, those problems mentioned above can be exactly solved with high probabilityunder mild assumptions via a hybrid convex program involving both the l -norm and the tracenorm (also called the nuclear norm) minimization. In recent years, many new techniques andalgorithms [23, 3, 9, 5, 21, 1] for low-rank matrix recovery and completion have been proposed,and the theoretical guarantees have been derived in [23, 9, 1]. However, those provable algo-rithms all exploit a closed-form expression for the proximal operator of the trace norm, whichinvolves the singular value decomposition (SVD). Hence, they all have high computational costand are even not applicable for solving large-scale problems.To address this issue, we propose a scalable robust bilinear structured factorization (RBF)method to recover low-rank and sparse matrices from incomplete, corrupted data or a small setof linear measurements, which is formulated as follows:min L , S f ( L , S ) + λ k L k ∗ , s.t. , A ( L + S ) = y , (1)where λ ≥ k L k ∗ is the trace norm of a low-rank matrix L ∈ R m × n ,i.e., the sum of its singular values, S ∈ R m × n is a sparse error matrix, y ∈ R p is the given linearmeasurements, A ( · ) is an underdetermined linear operator such as the linear projection operator P Ω , and f ( · ) denotes the loss function such as the l -norm loss or the l -norm loss functions.Unlike existing robust low-rank matrix factorization approaches, our method not only takesinto account the fact that the observation is contaminated by additive outliers (Fig. 1 shows anexample) or missing data, i.e., robust matrix completion [3, 30] (RMC, also called RPCA plusmatrix completion) problems, but can also identify both low-rank and sparse noisy componentsfrom incomplete and grossly corrupted measurements, i.e., CPCP problems. We also present a2 RBFPCA
Figure 1: Principal directions learned by PCA and RBF on the toy data set with outliers, which are in a blue rectangle. robust bilateral factorization framework for both RMC and CPCP problems, in which repetitivelycalculating SVD of a large matrix in [23, 9, 1] is replaced by updating two much smaller factormatrices. We verify with convincing experimental results both the e ffi ciency and e ff ectiveness ofour RBF method.The main contributions of this paper are summarized as follows:1. We propose a scalable structured RBF framework to simultaneously recover both low-rankand sparse matrices for both RMC and CPCP problems. By imposing the orthogonalityconstraint, we convert the original RMC and CPCP models into two smaller-scale matrixtrace norm regularized problems, respectively.2. By the fact that the optimal solution S Ω C =
0, i.e., the values of S at unobserved locationsare zero, we reformulate the proposed RMC problem by replacing the linear projectionoperator constraint with a simple equality one.3. Moreover, we propose an e ffi cient alternating direction method of multipliers (ADMM) tosolve our RMC problems, and then extend it to address CPCP problems with a linearizationtechnique.4. Finally, we theoretically analyze the suboptimality of the solution produced by our algo-rithm.The remainder of this paper is organized as follows. We review background and related workin Section 2. In Section 3, we propose two scalable trace norm regularized RBF models for RMCand CPCP problems. We develop an e ffi cient ADMM algorithm for solving RMC problems andthen extend it to solve CPCP problems in Section 4. We provide the theoretical analysis of ouralgorithm in Section 5. We report empirical results in Section 6, and conclude this paper inSection 7.
2. BACKGROUND
A low-rank structured matrix L ∈ R m × n and a sparse one S ∈ R m × n can be recovered fromhighly corrupted measurements y = P Q ( D ) ∈ R p via the following CPCP model,min L , S k S k + λ k L k ∗ , s.t. , P Q ( D = L + S ) = P Q ( L + S ) , (2)where k S k denotes the l -norm of S , i.e., k S k = Σ i j | s i j | , Q ⊆ R m × n is a linear subspace, and P Q is the projection operator onto that subspace. When P Q = P Ω , the model (2) is the robust3atrix completion (RMC) problem, where Ω is the index set of observed entries. Wright et al.,[1] proved the following result. Theorem 1.
Let L , S ∈ R m × n , with m ≥ n, and suppose that L , is a µ -incoherent matrix ofrank r, r ≤ c n µ log m , and sign(S ) is i.i.d. Bernoulli - Rademacher with non-zero probability ρ < c . Let Q ⊂ R m × n bea random subspace of dimension dim( Q ) ≥ C ( ρ mn + mr ) log m , distributed according to the Haar measure, probabilistically independent of sign(S ). Then theminimizer to the problem (2) with λ = √ m is unique and equal to ( L , S ) with probability atleast − C m − , where c , c , C and C are positive numerical constants. This theorem states that a commensurately small number of measurements are su ffi cient toaccurately recover the low-rank and sparse matrices with high probability. Indeed, if Q is theentire space, the model (2) degenerates to the following RPCA problem [5, 9]min L , S k S k + λ k L k ∗ , s.t. , D = L + S , (3)where D denotes the given observations. Several algorithms have been developed to solve theconvex optimization problem (3), such as IALM [31] and LRSD [22]. Although both models(2) and (3) are convex optimization problems, and their algorithms converge to the globallyoptimal solution, they involve SVD at each iteration and su ff er from a high computational cost of O ( mn ). While there have been many e ff orts towards fast SVD computation such as partial SVD[31] and approximate SVD [32], the performance of these existing methods is still unsatisfactoryfor many real applications. To address this problem, we propose a scalable, provable robustbilinear factorization method with missing and grossly corrupted observations.
3. OUR RBF FRAMEWORK
Matrix factorization is one of the most useful tools in scientific computing and high dimen-sional data analysis, such as the QR decomposition, the LU decomposition, SVD, and NMF.In this paper, robust bilinear factorization (RBF) aims to find two smaller low-rank matrices U ∈ R m × d ( U T U = I ) and V ∈ R n × d whose product is equal to the desired matrix of low-rank, L ∈ R m × n , L = UV T , where d is an upper bound for the rank of L , i.e., d ≥ r = rank( L ). Suppose that the observed matrix D is corrupted by outliers and missing data, the RMCproblem is given by min L , S k S k + λ k L k ∗ , s.t. , P Ω ( D ) = P Ω ( L + S ) . (4)From the optimization problem (4), we easily find the optimal solution S Ω C =
0, where Ω C is thecomplement of Ω , i.e., the index set of unobserved entries. Consequently, we have the followinglemma. 4 emma 2. The RMC model (4) with the operator P Ω is equivalent to the following problem min L , S kP Ω ( S ) k + λ k L k ∗ , s.t. , P Ω ( D ) = P Ω ( L + S ) and P Ω C ( S ) = . (5)The proof of this lemma can be found in APPENDIX A. From the incomplete and corruptedmatrix D , our RBF model is to find two smaller matrices, whose product approximates L , can beformulated as follows: min U , V , S kP Ω ( S ) k + λ k UV T k ∗ , s.t. , P Ω ( D ) = P Ω ( UV T + S ) . (6) Lemma 3.
Let U and V be two matrices of compatible dimensions, where U has orthogonalcolumns, i.e., U T U = I, then we have k UV T k ∗ = k V k ∗ . The proof of this lemma can be found in APPENDIX B. By imposing U T U = I and substi-tuting k UV T k ∗ = k V k ∗ into (6), we arrive at a much smaller-scale matrix trace norm minimizationproblem min U , V , S kP Ω ( S ) k + λ k V k ∗ , s.t. , P Ω ( D ) = P Ω ( UV T + S ) , U T U = I . (7) Theorem 4.
Suppose ( L ∗ , S ∗ ) is a solution of the problem (5) with rank ( L ∗ ) = r, then there existsthe solution U k ∈ R m × d , V k ∈ R n × d and S k ∈ R m × n to the problem (7) with d ≥ r and S Ω C = , ( U k V Tk , S k ) is also a solution to the problem (5). The proof of this theorem can be found in APPENDIX C.
From a small set of linear measurements y ∈ R p , the CPCP problem is to recover low-rankand sparse matrices as follows, min U , V , S k S k + λ k V k ∗ , s.t. , P Q ( D ) = P Q ( UV T + S ) . (8) Theorem 5.
Suppose ( L ∗ , S ∗ ) is a solution of the problem (2) with rank ( L ∗ ) = r, then there existsthe solution U k ∈ R m × d , V k ∈ R n × d and S k ∈ R m × n to the problem (8) with d ≥ r, ( U k V Tk , S k ) isalso a solution to the problem (2). We omit the proof of this theorem since it is very similar to that of Theorem 4. In thefollowing, we will discuss how to solve the models (7) and (8). It is worth noting that the RPCAproblem can be viewed as a special case of the RMC problem (7) when all entries of the corruptedmatrix are directly observed. In the next section, we will mainly develop an e ffi cient alternatingdirection method of multipliers (ADMM) solver for solving the non-convex problem (7). It isalso worth noting that although our algorithm will produce di ff erent estimations of U and V ,the estimation of UV T is stable as guaranteed by Theorems 4 and 5, and the convexity of theproblems (2) and (4). 5 .3. Connections to Existing Approaches According to the discussion above, it is clear that our RBF method is a scalable method forboth RMC and CPCP problems. Compared with convex algorithms such as common RPCA [9]and CPCP [1] methods, which have a computational complexity of O ( mn ) and are impracticalfor solving relatively large-scale problems, our RBF method has a linear complexity and scaleswell to handle large-scale problems.To understand better the superiority of our RBF method, we compare and relate RBF withseveral popular robust low-rank matrix factorization methods. It is clear that the model in [33, 10,27] is a special case of our trace norm regularized model (7), while λ =
0. Moreover, the modelsused in [11, 12] focus only on the desired low-rank matrix. In this sense, they can be viewedas special cases of our model (7). The other major di ff erence is that SVD is used in [11], whileQR factorizations are used in this paper. The use of QR factorizations also makes the updateoperation highly scalable on modern parallel architectures [34]. Regarding the complexity, it isclear that both schemes have the similar theory computational complexity. However, from theexperimental results in Section 6, we can see that our algorithm usually runs much faster, butmore accurate than the methods in [11, 12]. The following bilinear spectral regularized matrixfactorization formulation in [12] is one of the most similar models to our model (7),min L , U , V k W ⊙ ( D − L ) k + λ k U k F + k V k F ) , s.t. , L = UV T , where ⊙ denotes the Hadamard product and W ∈ R m × n is a weight matrix that can be used todenote missing data (i.e., w i j =
4. OPTIMIZATION ALGORITHM
In this section, we propose an e ffi cient alternating direction method of multipliers (ADMM)for solving the RMC problem (7), and then extend it for solving the CPCP problem (8). Weprovide the convergence analysis of our algorithm in Section 5. Recently, it has been shown in the literature [35, 36] that ADMM is very e ffi cient for someconvex or non-convex programming problems from various applications. We also refer to arecent survey [35] for some recently exploited applications of ADMM. For e ffi ciently solvingthe RMC problem (7), we can assume without loss of generality that the unknown entries of D are simply set as zeros, i.e., D Ω C =
0, and S Ω C may be any values such that P Ω C ( D ) = P Ω C ( UV T ) + P Ω C ( S ). Therefore, the constraint with the operator P Ω in (7) is simplified into D = UV T + S .Hence, we introduce the constraint D = UV T + S into (7), and obtain the following equivalentform: min U , V , S kP Ω ( S ) k + λ k V k ∗ , s.t. , D = UV T + S , U T U = I . (9)The partial augmented Lagrangian function of (9) is L α ( U , V , S , Y ) = λ k V k ∗ + kP Ω ( S ) k + h Y , D − S − UV T i + α k D − S − UV T k F , (10)where Y ∈ R m × n is a matrix of Lagrange multipliers, α > h M , N i denotes the inner product between matrices M and N of equal sizes, i.e., h M , N i = Σ i , j M i j N i j .6 .2. Robust Bilateral Factorization Scheme We will derive our scheme for solving the following subproblems with respect to U , V and S , respectively, U k + = arg min U ∈ R m × d L α k ( U , V k , S k , Y k ) , s.t. , U T U = I , (11) V k + = arg min V ∈ R n × d L α k ( U k + , V , S k , Y k ) , (12) S k + = arg min S ∈ R m × n L α k ( U k + , V k + , S , Y k ) . (13) Fixing V and S at their latest values, and by removing the terms that do not depend on U and adding some proper terms that do not depend on U , the problem (11) with respect to U isreformulated as follows: min U k UV Tk − P k k F , s.t. , U T U = I , (14)where P k = D − S k + Y k /α k . In fact, the optimal solution can be given by the SVD of the matrix P k V k as in [37]. To further speed-up the calculation, we introduce the idea in [36] that uses a QRdecomposition instead of SVD. The resulting iteration step is formulated as follows: U k + = Q , QR( P k V k ) = QR , (15)where U k + is an orthogonal basis for the range space R ( P k V k ), i.e., R ( U k + ) = R ( P k V k ). Al-though U k + in (15) is not an optimal solution to (14), our iterative scheme and the one in [38]are equivalent to solve (14) and (16), and their equivalent analysis is provided in Section 5.Moreover, the use of QR factorizations also makes our update scheme highly scalable on mod-ern parallel architectures [34]. Fixing U and S , the optimization problem (12) with respect to V can be rewritten as:min V α k k U k + V T − P k k F + λ k V k ∗ . (16)To solve the convex problem (16), we first introduce the following definition [39]. Definition 1.
For any given matrix M ∈ R n × d whose rank is r, and µ ≥ , the singular valuethresholding (SVT) operator is defined as follows: SVT µ ( M ) = U diag(max { σ − µ, } ) V T , where max {· , ·} should be understood element-wise, U ∈ R n × r , V ∈ R d × r and σ = ( σ , . . . , σ r ) T ∈ R r × are obtained by SVD of M, i.e., M = U diag( σ ) V T . Theorem 6.
The trace norm minimization problem (16) has a closed-form solution given by:V k + = SVT λ/α k ( P Tk U k + ) . (17)7 roof. The first-order optimality condition for (16) is given by0 ∈ λ∂ k V k ∗ + α k ( VU Tk + − P Tk ) U k + , where ∂ k · k ∗ is the set of subgradients of the trace norm. Since U Tk + U k + = I , the optimalitycondition for (16) is rewritten as follows:0 ∈ λ∂ k V k ∗ + α k ( V − P Tk U k + ) . (18)(18) is also the optimality condition for the following convex problem,min V α k k V − P Tk U k + k F + λ k V k ∗ . (19)By Theorem 2.1 in [39], then the optimal solution of (19) is given by (17). Fixing U and V , we can update S by solvingmin S kP Ω ( S ) k + α k k S + U k + V Tk + − D − Y k /α k k F . (20)For solving the problem (20), we introduce the following soft-thresholding operator S τ : S τ ( A i j ) : = A i j − τ, A i j > τ, A i j + τ, A i j < − τ, , otherwise . Then the optimal solution S k + can be obtained by solving the following two subproblemswith respect to S Ω and S Ω C , respectively. The optimization problem with respect to S Ω is firstformulated as follows:min S Ω α k kP Ω ( S + U k + V Tk + − D − Y k /α k ) k F + kP Ω ( S ) k . (21)By the operator S τ and letting τ = /α k , the closed-form solution to the problem (21) is given by( S k + ) Ω = S τ (( D − U k + V Tk + + Y k /α k ) Ω ) . (22)Moreover, the subproblem with respect to S Ω C is formulated as follows:min S Ω C kP Ω C ( S + U k + V Tk + − D − Y k /α k ) k F . (23)We can easily obtain the closed-form solution by zeroing the gradient of the cost function (23)with respect to S Ω C , i.e., ( S k + ) Ω C = ( D − U k + V Tk + + Y k /α k ) Ω C . (24)Summarizing the analysis above, we obtain an ADMM scheme to solve the RMC problem(7), as outlined in Algorithm 1 . Our algorithm is essentially a Gauss-Seidel-type scheme ofADMM, and the update strategy of the Jacobi version of ADMM is easily implemented, well8 lgorithm 1
Solving RMC problem (7) via ADMM.
Input: P Ω ( D ), λ and ε . Output: U , V and S , where S Ω C is set to 0. Initialize: U = eye( m , d ), V = , Y = , α = kP Ω ( D ) k F , α max = , and ρ = . while not converged do Update U k + by (15). Update V k + by (17). Update S k + by (22) and (24). Update the multiplier Y k + by Y k + = Y k + α k ( D − U k + V Tk + − S k + ). Update α k + by α k + = min( ρα k , α max ). Check the convergence condition, k D − U k + V Tk + − S k + k F < ε . end while suited for parallel and distributed computing and hence is particularly attractive for solving large-scale problems [40]. In addition, S Ω C should be set to 0 for the expected output S . This algorithmcan also be accelerated by adaptively changing α . An e ffi cient strategy [31] is to let α = α (theinitialization in Algorithm 1) and increase α k iteratively by α k + = ρα k , where ρ ∈ (1 . , .
1] ingeneral and α is a small constant. Furthermore, U is initialized with eye( m , d ) : = h I d × d ( m − d ) × d i .Algorithm 1 is easily used to solve the RPCA problem (3), where all entries of the corruptedmatrix are directly observed. Moreover, we introduce an adaptive rank adjusting strategy for ouralgorithm in Section 4.4. Algorithm 1 can be extended to solve the CPCP problem (8) with the complex operator P Q ,as outlined in Algorithm 2 , which is to optimize the following augmented Lagrange function F α ( U , V , S , Y ) = λ k V k ∗ + k S k + h Y , y − P Q ( S + UV T ) i + α k y − P Q ( S + UV T ) k . (25)We minimize F α with respect to ( U , V , S ) using a recently proposed linearization technique [41],which resolves such problems where P Q is not the identity operator. Specifically, for updating U and V , let T = UV T and g ( T ) = α k k y − P Q ( S k + T ) + Y k /α k k , then g ( T ) is approximated by g ( T ) ≈ g ( T k ) + h∇ g ( T k ) , T − T k i + τ k T − T k k F , (26)where ∇ g ( T k ) = α k P ⋆ Q ( P Q ( T k + S k ) − y − Y k /α k ), P ⋆ Q is the adjoint operator of P Q , and τ is chosenas τ = / kP ⋆ Q P Q k as in [41], and k · k the spectral norm of a matrix, i.e., the largest singularvalue of a matrix.Similarly, for updating S , let T k + = U k + V Tk + and h ( S ) = α k k y − P Q ( S + T k + ) + Y k /α k k ,then h ( S ) is approximated by h ( S ) ≈ h ( S k ) + h∇ h ( S k ) , S − S k i + τ k S − S k k F , (27)where ∇ h ( S k ) = α k P ⋆ Q ( P Q ( S k + T k + ) − y − Y k /α k ).9 lgorithm 2 Solving CPCP problem (8) via ADMM.
Input: y ∈ R p , P Q , and parameters λ and ε . Output: U , V and S . Initialize: U = eye( m , d ), V = , Y = , α = k y k , α max = , and ρ = . while not converged do Update U k + by U k + = Q , QR(( U k V Tk − ∇ g ( U k V Tk ) /τ ) V k ) = QR . Update V k + by V Tk + = SVT λ/α k ( U Tk + ( U k V Tk − ∇ g ( U k V Tk ) /τ )). Update S k + by S k + = S /α k ( S k − ∇ h ( S k ) /τ ). Update the multiplier Y k + by Y k + = Y k + α k ( y − P Q ( U k + V Tk + + S k + )). Update the parameter α k + by α k + = min( ρα k , α max ). Check the convergence condition,( k T k + − T k k F + k S k + − S k k F ) / ( k T k k F + k S k k F ) < ε . end while As the stopping criteria for terminating our RBF algorithms, we employ some gap criteria;that is, we stop Algorithm 1 when the current gap is satisfied as a given tolerance ε , i.e., k D − U k V Tk − S k k F < ε , and run Algorithm 2 when ( k U k V Tk − U k − V Tk − k F + k S k − S k − k F ) / ( k U k − V Tk − k F + k S k − k F ) < ε .In Algorithms 1 and 2, d is one of the most important parameters. If d is too small, it cancause underfitting and a large estimation error; but if d is too large, it can cause overfittingand large deviation to the underlying low-rank matrix L . Fortunately, several works [42, 36]have provided some matrix rank estimation strategies to compute a good value r for the rankof the involved matrices. Thus, we only set a relatively large integer d such that d ≥ r . Inaddition, we provide a scheme to dynamically adjust the rank parameter d . Our scheme startsfrom an overestimated input, i.e., d = ⌊ . r ⌋ . Following [42] and [33], we decease it aggressivelyonce a dramatic change in the estimated rank of the product U k V Tk is detected based on theeigenvalue decomposition which usually occurs after a few iterations. Specifically, we calculatethe eigenvalues of ( U k V Tk ) T U k V Tk = V k U Tk U k V Tk = V k V Tk , which are assumed to be ordered as λ ≥ λ ≥ . . . ≥ λ d . Since the product V k V Tk and V Tk V k have the same nonzero eigenvalues, itis much more e ffi cient to compute the eigenvalues of the product V Tk V k . Then we compute thequotient sequence ¯ λ i = λ i /λ i + , i = , . . . , d −
1. Suppose ˆ r = arg max ≤ i ≤ d − ¯ λ i . If the conditiongap = ( d −
1) ¯ λ ˆ r P i , ˆ r ¯ λ i ≥ , is satisfied, which means a “big” jump between λ ˆ r and λ ˆ r + , then we reduce d to ˆ r , and thisadjustment is performed only once.
5. Theoretical Analysis and Applications
In this section, we will present several theoretical properties of Algorithm 1. First, we providethe equivalent relationship analysis for our iterative solving scheme and the one in [38], as shownby the following theorem. 10 heorem 7.
Let ( U ∗ k , V ∗ k , S ∗ k ) be the solution of the subproblems (11), (12) and (13) at the k-thiteration, respectively, Y ∗ k = Y ∗ k − + α k − ( D − U ∗ k ( V ∗ k ) T − S ∗ k ) , and ( U k , V k , S k , Y k ) be generated byAlgorithm 1 at the k-th iteration, ( k = , . . . , T ) . Then ∃ O k ∈ O = { M ∈ R d × d | M T M = I } such that U ∗ k = U k O k and V ∗ k = V k O k . U ∗ k ( V ∗ k ) T = U k V Tk , k V ∗ k k ∗ = k V k k ∗ , S ∗ k = S k , and Y ∗ k = Y k . Remark:
The proof of this theorem can be found in APPENDIX D. Since the Lagrangefunction (10) is determined by the product UV T , V , S and Y , the di ff erent values of U and V are essentially equivalent as long as the same product UV T and k V k ∗ = k V ∗ k ∗ . Meanwhile, ourscheme replaces SVD by the QR decomposition, and can avoid the SVD computation for solvingthe optimization problem with the orthogonal constraint. The global convergence of our derived algorithm is guaranteed, as shown in the followinglemmas and theorems.
Lemma 8.
Let ( U k , V k , S k ) be a sequence generated by Algorithm 1, then we have the followingconclusions:
1. ( U k , V k , S k ) approaches to a feasible solution, i.e., lim k →∞ k D − U k V Tk − S k k F = . Both sequences U k V Tk and S k are Cauchy sequences. The detailed proofs of this lemma, the following lemma and theorems can be found in AP-PENDIX E. Lemma 8 ensures only that the feasibility of each solution has been assessed. In thispaper, we want to show that it is possible to prove the local optimality of the solution produced byAlgorithm 1. Let k ∗ be the number of iterations needed by Algorithm 1 to stop, and ( U ∗ , V ∗ , S ∗ )be defined by U ∗ = U k ∗ + , V ∗ = V k ∗ + , S ∗ = S k ∗ + . In addition, let Y ∗ (resp. b Y ∗ ) denote the Lagrange multiplier Y k ∗ + (resp. b Y k ∗ + ) associated with( U ∗ , V ∗ , S ∗ ), i.e., Y ∗ = Y k ∗ + , b Y ∗ = b Y k ∗ + , where b Y k ∗ + = Y k ∗ + α k ∗ ( D − U k ∗ + V Tk ∗ + − S k ∗ ). Lemma 9.
For the solution ( U ∗ , V ∗ , S ∗ ) generated by Algorithm 1, then we have the followingconclusion: kP Ω ( S ) k + λ k V k ∗ ≥ kP Ω ( S ∗ ) k + λ k V ∗ k ∗ + h b Y ∗ − Y ∗ , UV T − U ∗ ( V ∗ ) T i − mn ε holds for any feasible solution ( U , V , S ) to (9). To reach the global optimality of (9) based on the above lemma, it is required to show thatthe term h b Y ∗ − Y ∗ , UV T − U ∗ ( V ∗ ) T i diminishes. Since k Y ∗ − b Y ∗ k ≤ √ mn k Y ∗ − b Y ∗ k ∞ , and by Lemma 13 (Please see APPENDIX E), we have k Y ∗ − b Y ∗ k ∞ = kP Ω ( Y ∗ ) − b Y ∗ k ∞ ≤ kP Ω ( Y ∗ ) k ∞ + k b Y ∗ k ∞ ≤ + λ, which means that k Y ∗ − b Y ∗ k ∞ is bounded. By setting the parameter ρ to be relatively small as in[38], k Y ∗ − b Y ∗ k ∞ is small, which means that k Y ∗ − b Y ∗ k is also small. Let ε = k Y ∗ − b Y ∗ k , thenwe have the following theorems. 11 heorem 10. Let f g be the globally optimal objective function value of (9), and f ∗ = kP Ω ( S ∗ ) k + λ k V ∗ k ∗ be the objective function value generated by Algorithm 1. We have thatf ∗ ≤ f g + c ε + mn ε, where c is a constant defined byc = mn λ kP Ω ( D ) k F ( ρ (1 + ρ ) ρ − + ρ k ∗ ) + kP Ω ( D ) k λ . Theorem 11.
Suppose ( L , S ) is an optimal solution to the RMC problem (5), rank ( L ) = rand f = kP Ω ( S ) k + λ k L k ∗ . Let f ∗ = kP Ω ( S ∗ ) k + λ k U ∗ V ∗ k ∗ be the objective function valuegenerated by Algorithm 1 with parameter d > . We have thatf ≤ f ∗ ≤ f + c ε + mn ε + ( √ mn − λ ) σ d + max( r − d , , where σ ≥ σ ≥ . . . are the singular values of L . Since the rank parameter d is set to be higher than the rank of the optimal solution to theRMC problem (5), i.e., d ≥ r , Theorem 11 directly concludes that f ≤ f ∗ ≤ f + c ε + mn ε. Moreover, the value of ε can be set to be arbitrarily small, and the second term involving ε diminishes. Hence, for the solution ( U ∗ , V ∗ , S ∗ ) generated by Algorithm 1, a solution to theRMC problem (5) can be achieved by computing L ∗ = U ∗ ( V ∗ ) T . We also discuss the time complexity of our RBF algorithm. For the RMC problem (7), themain running time of our RBF algorithm is consumed by performing SVD on the small matrix ofsize n × d , the QR decomposition of the matrix P k V k , and some matrix multiplications. In (17),the time complexity of performing SVD is O ( d n ). The time complexity of QR decompositionand matrix multiplications is O ( d m + mnd ). The total time complexity of our RBF algorithm forsolving both problems (3) and (7) is O ( t ( d n + d m + mnd )) (usually d ≪ n ≤ m ), where t is thenumber of iterations. As our RBF framework introduced for robust matrix factorization is general, there are manypossible extensions of our methodology. In this section, we outline a novel result and methodol-ogy for one extension we consider most important: low-rank matrix completion. The space limitrefrains us from fully describing each development, but we try to give readers enough details tounderstand and use each of these applications.By introducing an auxiliary variable L , the low-rank matrix completion problem can be writ-ten into the following form, min U , V , L kP Ω ( D ) − P Ω ( L ) k F + λ k V k ∗ , s.t. , L = UV T , U T U = I . (28)Similar to Algorithm 1, we can present an e ffi cient ADMM scheme to solve the matrix comple-tion problem (28). This algorithm can also be easily used to solve the low-rank matrix factoriza-tion problem, where all entries of the given matrix are observed.12a) (b) (c) Figure 2: Image used in text removal experiment: (a) Input image; (b) Original image; (c) Outlier mask.
6. Experimental Evaluation
We now evaluate the e ff ectiveness and e ffi ciency of our RBF method for RMC and CPCPproblems such as text removal, background modeling, face reconstruction, and collaborativefiltering. We ran experiments on an Intel(R) Core (TM) i5-4570 (3.20 GHz) PC running Windows7 with 8GB main memory. We first conduct an experiment by considering a simulated task on artificially generated data,whose goal is to remove some generated text from an image. The ground-truth image is of size256 ×
222 with rank equal to 10 for the data matrix. we then add to the image a short phasein text form which plays the role of outliers. Fig. 2 shows the image together with the cleanimage and outliers mask. For fairness, we set the rank of all the algorithms to 20, which is twotimes the true rank of the underlying matrix. The input data are generated by setting 30% of therandomly selected pixels of the image as missing entries. We compare our RBF method with thestate-of-the-art methods, including PCP [9], SpaRCS [6], RegL1 [11] and BF-ALM [12]. Weset the regularization parameter λ = √ max( m , n ) for RegL1 and RBF, and the stopping tolerance ε = − for all algorithms in this section.The results obtained by di ff erent methods are visually shown in Fig. 3, where the outlierdetection accuracy (the score Area Under the receiver operating characteristic Curve, AUC) andthe error of low-rank component recovery (i.e., k D − L k F / k D k F , where D and L denote the ground-truth image matrix and the recovered image matrix, respectively) are also presented. As far aslow-rank matrix recovery is concerned, these RMC methods including SpaRCS, RegL1, BF-ALM and RBF, outperform PCP, not only visually but also quantitatively. For outlier detection, itcan be seen that our RBF method significantly performs better than the other methods. In short,RBF significantly outperforms PCP, RegL1, BF-ALM and SpaRCS in terms of both low-rankmatrix recovery and spare outlier identification. Moreover, the running time of PCP, SpaRCS,RegL1, BF-ALM and RBF is 36.25sec, 15.68sec, 26.85sec, 6.36sec and 0.87sec, respectively.We further evaluate the robustness of our RBF method with respect to the regularizationparameter λ and the given rank variations. We conduct some experiments on the artificiallygenerated data, and illustrate the outlier detection accuracy (AUC) and the error (Error) of low-rank component recovery of PCP, SpaRCS, RegL1 and our RBF method, where the given rankof SpaRCS, RegL1 and our RBF method is chosen from { , , · · · , } , and the regularization https://sites.google.com/site/yinqiangzheng/ Figure 3: Text removal results. The first row shows the foreground masks and the second row shows the recoveredbackground images: (a) PCP (AUC: 0.8558; Error: 0.2516); (b) SpaRCS (AUC: 0.8665; Error: 0.2416); (c) RegL1(AUC: 0.8792; Error: 0.2291); (d) BF-ALM (AUC: 0.8568; Error: 0.2435); (e) RBF (AUC: 0.9227; Error: 0.1844).
20 30 40 50 600.50.60.70.80.91 Given rank A UC PCPSpaRCSRegL1RBF 20 30 40 50 600.20.30.40.50.6 Given rank E rr o r PCPSpaRCSRegL1RBF
Figure 4: Comparison of PCP, SpaRCS, RegL1 and our RBF method in terms of AUC (Left) and Error (Right) on theartificially generated data with varying ranks. parameter λ of PCP, RegL1 and RBF is chosen from the grid { , . , , . , , , , , } .Notice that because BF-ALM and RegL1 achieve very similar results, we do not provide theresults of the former in the following. The average AUC and Error results of 10 independent runsare shown in Figs. 4 and 5, from which we can see that our RBF method performs much morerobust than SpaRCS and RegL1 with respect to the given rank. Moreover, our RBF method ismuch more robust than PCP and RegL1 against the regularization parameter λ . In this experiment we test our RBF method on real surveillance videos for object detectionand background subtraction as a RPCA plus matrix completion problem. Background modelingis a crucial task for motion segmentation in surveillance videos. A video sequence satisfiesthe low-rank and sparse structures, because the background of all the frames is controlled byfew factors and hence exhibits low-rank property, and the foreground is detected by identifyingspatially localized sparse residuals [5, 9, 43]. We test our RBF method on four color surveillance14 A UC PCPSpaRCSRegL1RBF 10 E rr o r PCPSpaRCSRegL1RBF
Figure 5: Comparison of PCP, SpaRCS, RegL1 and our RBF method in terms of AUC (Left) and Error (Right) on theartificially generated data with varying regularization parameters. videos: Bootstrap, Lobby, Hall and Mall databases . The data matrix D consists of the first 400frames of size 144 × D with size of 76032 × ff ectively extracted by our RBF method, RegL1 andGRASTA [44]. Notice that SpaRCS could not yield experimental results on these databasesbecause they ran out of memory. Moreover, we can see that the decomposition results of our RBFmethod, especially the recovered low-rank components, are slightly better than that of RegL1 andGRASTA. We also report the running time in Table 1, from which we can see that RBF is morethan 3 times faster than GRASTA and more than 2 times faster than RegL1. This further showsthat our RBF method has very good scalability and can address large-scale problems. Table 1: Comparison of time costs in CPU seconds of GRASTA, RegL1 and RBF on background modeling data sets.
Datasets Sizes GRASTA RegL1 RBFBootstrap 57 , ×
400 153.65 93.17 38.32Lobby 61 , ×
400 187.43 139.83 50.08Hall 76 , ×
400 315.11 153.45 67.73Mall 245 , ×
200 493.92 —- 94.59
We also test our RBF method for the face reconstruction problems with the incomplete andcorrupted face data or a small set of linear measurements y as in [1], respectively. The facedatabase used here is a part of Extended Yale Face Database B [28] with the large corruptions. http://perception.i2r.a-star.edu.sg/bkmodel/bkindex https://sites.google.com/site/hejunzz/grasta igure 6: Background extraction results of di ff erent algorithms on the Bootstrap data set, where the first, second and lastrows show the recovered low-rank and sparse images by GRASTA, RegL1 and RBF, respectively. The face images can often be decomposed as a low-rank part, capturing the face appearancesunder di ff erent illuminations, and a sparse component, representing varying illumination condi-tions and heavily “shadows”. The resolution of all images is 192 ×
168 and the pixel values arenormalized to [0, 1], then the pixel values are used to form data vectors of dimension 32,256.The input data are generated by setting 40% of the randomly selected pixels of each image asmissing entries.Fig. 7 shows some original and reconstructed images by RBF, PCP, RegL1 and CWM [27],where the average computational time of all these algorithms on each people’s faces is presented.It can be observed that RBF performs better than the other methods not only visually but alsoe ffi ciently, and e ff ectively eliminates the heavy noise and “shadows” and simultaneously com-pletes the missing entries. In other words, RBF can achieve the latent features underlying theoriginal images regardless of the observed data corrupted by outliers or missing values.Moreover, we implement a challenging problem to recover face images from incompleteline measurements. Considering the computational burden of the projection operator P Q , weresize the original images into 42 ×
48 and normalize the raw pixel values to form data vectorsof dimension 2016. Following [1], the input data is P Q ( D ), where Q is a subspace generatedrandomly with the dimension 0 . mn .Fig. 8 illustrates some reconstructed images by CPCP [1] and RBF, respectively. It is clearthat both CPCP and RBF e ff ectively remove “shadows” from faces images and simultaneouslysuccessfully recover both low-rank and sparse components from the reduced measurements. Collaborative filtering is a technique used by some recommender systems [45, 46]. Oneof the main purposes is to predict the unknown preference of a user on a set of unrated items,according to other similar users or similar items. In order to evaluate our RBF method, some ma-trix completion experiments are conducted on three widely used recommendation system datasets: MovieLens100K with 100K ratings, MovieLens1M (ML-1M) with 1M ratings and Movie-Lens10M (ML-10M) with 10M ratings. We randomly split these data sets to training and testing igure 7: Face recovery results by these algorithms. From left column to right column: Input corrupted images (blackpixels denote missing entries), original images, reconstruction results by PCP (1020.69sec), CWM (1830.18sec), RegL1(2416.85sec) and RBF (52.73sec). sets such that the ratio of the training set to testing set is 9:1, and the experimental results arereported over 10 independent runs. We also compare our RBF method with APG [47], Soft-Impute [48], OptSpace [42] and LMaFit [36], and two state-of-the-art manifold optimizationmethods: ScGrass [49] and RTRMC [50]. All other parameters are set to their default valuesfor all compared algorithms. We use the Root Mean Squared Error (RMSE) as the evaluationmeasure, which is defined as RMSE = r | Ω | Σ ( i , j ) ∈ Ω ( D i j − L i j ) , where | Ω | is the total number of ratings in the testing set, L i j denotes the ground-truth rating ofuser i for item j , and D i j denotes the corresponding predicted rating.The average RMSE on these three data sets is reported over 10 independent runs and isshown in Table 2. From the results shown in Table 2, we can see that, for some fixed ranks, mostmatrix factorization methods including ScGrass, RTRMC, LMaFit and our RBF method, exceptOptSpace, usually perform better than the two convex trace norm minimization methods, APGand Soft-Impute. Moreover, our bilinear factorization method with trace norm regularization http://web.engr.illinois.edu/~swoh/software/optspace/ http://lmafit.blogs.rice.edu/ http://perso.uclouvain.be/nicolas.boumal/RTRMC/ igure 8: Face reconstruction results by CPCP and RBF, where the first column show the original images, the secondand third columns show the low-rank and sparse components obtained by CPCP, while the last two columns show thelow-rank and sparse components obtained by RBF.Table 2: RMSE of di ff erent methods on three data sets: MovieLens100K, MovieLens1M and MovieLens10M. Methods MovieLens100K MovieLens1M MovieLens10MAPG 1.2142 1.1528 0.8583Soft-Impute 1.0489 0.9058 0.8615OptSpace 0.9411 0.9071 1.1357Ranks 5 6 7 5 6 7 5 6 7ScGrass 0.9647 0.9809 0.9945 0.8847 0.8852 0.8936 0.8359 0.8290 0.8247RTRMC 0.9837 1.0617 1.1642 0.8875 0.8893 0.8960 0.8463 0.8442 0.8386LMaFit 0.9468 0.9540 0.9568 0.8918 0.8920 0.8853 0.8576 0.8530 0.8423RBF consistently outperforms the other matrix factorization methods including OptSpace, ScGrass,RTRMC and LMaFit, and the two trace norm minimization methods, APG and Soft-Impute. Thisconfirms that our robust bilinear factorization model with trace norm regularization is reasonable.Furthermore, we also analyze the robustness of our RBF method with respect to its param-eter changes: the given rank and the regularization parameter λ on the MovieLens1M data set,as shown in Fig. 9, from which we can see that our RBF method is very robust against its pa-rameter variations. For comparison, we also show the results of some related methods: ScGrassand LMaFit, OptSpace and RTRMC with varying ranks or di ff erent regularization parametersin Fig. 9. It is clear that, by increasing the number of the given ranks, the RMSE of ScGrassand LMaFit, RTRMC becomes dramatically increases, while that of our RBF method increaseslightly. This further confirms that our bilinear matrix factorization model with trace norm reg-ularization can significantly reduce the over-fitting problems of matrix factorization. ScGrass,RTRMC and OptSpace all have their spectral regularization models, respectively (for example,the formulation for OptSpace is min U , S , V (1 / kP Ω ( US V T − D ) k F + λ k S k F .) We can see that ourRBF method performs more robust than OptSpace, ScGrass and RTRMC in terms of the regular-ization parameter λ . Moreover, our RBF method is easily used to incorporate side-informationas in [51, 52, 8, 53, 54]. 18
10 150.850.90.951 Rank R M SE ScGrassLMaFitRTRMCRBF (a) −6 −4 −2 R M SE OptSpaceScGrassRTRMCRBF (b)Figure 9: Results of our RBF method, ScGrass, LMaFit, and OptSpace against their parameters: (a) Rank and (b)Regularization parameter λ .
7. CONCLUSIONS
In this paper, we proposed a scalable robust bilinear structured factorization (RBF) frame-work for RMC and CPCP problems. Unlike existing robust low-rank matrix factorization meth-ods, the proposed RBF method can not only address large-scale RMC problems, but can alsosolve low-rank and sparse matrix decomposition problems with incomplete or corrupted obser-vations. To this end, we first presented two smaller-scale matrix trace norm regularized modelsfor RMC and CPCP problems, respectively. Then we developed an e ffi cient ADMM algorithm tosolve both RMC and RPCA problems, and analyzed the suboptimality of the solution producedby our algorithm. Finally, we extended our algorithm to solve CPCP problems. Experimentalresults on real-world data sets demonstrated the superior performance of our RBF method incomparison with the state-of-the-art methods in terms of both e ffi ciency and e ff ectiveness. APPENDIX A: Proof of Lemma 2
Proof.
Let ( L ∗ , S ∗ ) be the optimal solution of (4), g ( L , S ) = k S k + λ k L k ∗ and Γ = { ( L , S ) | P Ω ( D ) = P Ω ( L + S ) } , then we use contradiction to prove that P Ω C ( S ∗ ) = .We first assume P Ω C ( S ∗ ) , . Let ˜ S ∗ be ( ˜ S ∗ ) Ω = ( S ∗ ) Ω and ( ˜ S ∗ ) Ω C =
0, then we have( L ∗ , ˜ S ∗ ) ∈ Γ and g ( L ∗ , ˜ S ∗ ) ≤ g ( L ∗ , S ∗ ) that leads to a contradiction. Thus, P Ω C ( S ∗ ) = .Therefore, ( L ∗ , S ∗ ) is also the optimal solution of (5). APPENDIX B: Proof of Lemma 3
Proof.
Let the SVD of V T be V T = ˆ U ˆ Σ ˆ V T , then UV T = ( U ˆ U ) ˆ Σ ˆ V T . As ( U ˆ U ) T ( U ˆ U ) = I ,( U ˆ U ) ˆ Σ ˆ V T is actually an SVD of UV T . According to the definition of the trace norm, we have k V k ∗ = k V T k ∗ = tr( ˆ Σ ) = k UV T k ∗ . 19 PPENDIX C: Proof of Theorem 4
Proof.
If we know that ( L ∗ , S ∗ ) is a solution for the optimization problem (5), it is also a solutionto min L , S , rank( L ) = r kP Ω ( S ) k + λ k L k ∗ , s.t. , P Ω ( D ) = P Ω ( L + S ) , P Ω C ( S ) = . Since for any ( L , S ) with rank( L ) = r , we can find U ∈ R m × d and V ∈ R n × d satisfying UV T = L and P Ω ( D − UV T ) = P Ω ( S ), where d ≥ r . Moreover, according to Lemma 3, we havemin U , V , S kP Ω ( S ) k + λ k V k ∗ s.t. , P Ω ( D ) = P Ω ( UV T + S ) , U T U = I , = min U , V , S kP Ω ( S ) k + λ k UV T k ∗ s.t. , P Ω ( D ) = P Ω ( UV T + S ) , = min L , S , rank( L ) = r kP Ω ( S ) k + λ k L k ∗ s.t. , P Ω ( D ) = P Ω ( L + S ) , where P Ω C ( S ) = . This completes the proof. APPENDIX D: Proof of Theorem 7
Proof.
We will prove the statement in Theorem 7 using mathematical induction. While k =
1, and following [37], then the optimal solution to the problem (14) is given by U ∗ = e U e V T , where the skinny SVD of P ∗ V ∗ is P ∗ V ∗ = e U e Σ e V T .By Algorithm 1, and with the same initial values, i.e., U ∗ = U , V ∗ = V , S ∗ = S and P ∗ = P , then we have U = Q , QR( P V ) = QR( P ∗ V ∗ ) = QR . Hence, it can be easily verified that ∃ O ∈ N satisfies U ∗ = U O , where N = { A ∈ R d × d , A T A = I , AA T = I } .By the iteration step (19), we have V ∗ = SVT λ/α k (( P ∗ ) T U ∗ ) = SVT λ/α k (( P ∗ ) T U O ) = SVT λ/α k (( P ∗ ) T U ) O = V O . Thus, U ∗ ( V ∗ ) T = U V T . Furthermore, we have S ∗ = S , P ∗ = P and Y ∗ = Y . . While k >
1, the result of Theorem 7 holds at the ( k -1)-th iteration, then following [37]and [38], U ∗ k is updated by U ∗ k = e U k e V Tk , where the skinny SVD of P ∗ k − V ∗ k − is P ∗ k − V ∗ k − = e U k e Σ k e V Tk .By P ∗ k − V ∗ k − = P ∗ k − V k − O k − , and according to (15), then ∃ O k ∈ N satisfies U ∗ k = U k O k .Furthermore, we have U ∗ k ( V ∗ k ) T = U ∗ k SVT λ/α k − (( U ∗ k ) T P ∗ k − ) = SVT λ/α k ( U ∗ k ( U ∗ k ) T P ∗ k − ) = SVT λ/α k ( U k ( U k ) T P k − ) = U k V Tk , and V ∗ k = V k O k , S ∗ k = S k , P ∗ k = P k and Y ∗ k = Y k .Since V ∗ k = V k O k , we also have k V ∗ k k ∗ = k V k k ∗ .This completes the proof. APPENDIX E
The proof sketch of Lemma 8 is similar to the one in [38]. We first prove that the boundednessof multipliers and some variables of Algorithm 1, and then analyze the convergence of Algorithm1. To prove the boundedness, we first give the following lemmas.
Lemma 12.
Let X be a real Hilbert space endowed with an inner product h·i and a correspondingnorm k · k (the nuclear norm or the l norm), and y ∈ ∂ k x k , where ∂ k · k denotes the subgradient.Then k y k ∗ = if x , , and k y k ∗ ≤ if x = , where k · k ∗ is the dual norm of the norm k · k . Lemma 13.
Let Y k + = Y k + α k ( D − U k + V Tk + − S k + ) , b Y k + = Y k + α k ( D − U k + V Tk + − S k ) and e Y k + = Y k + α k ( D − U ∗ k + V Tk − S k ) , where U ∗ k + is the solution of the problem (14). Then thesequences { Y k } , { b Y k } , { e Y k } , { V k } and { S k } produced by Algorithm 1 are all bounded.Proof. By the optimality condition of the problem (20) with respect to S k + , we have that0 ∈ ∂ ( S k + ) Ω L α k ( U k + , V k + , S k + , Y k ) , and P Ω ( Y k + α k ( D − U k + V Tk + − S k + )) ∈ ∂ kP Ω ( S k + ) k , i.e., P Ω ( Y k + ) ∈ ∂ kP Ω ( S k + ) k . (29)Furthermore, substituting Y k + = Y k + α k ( D − U k + V Tk + − S k + ) into (23), we have P Ω C ( Y k + ) = . By Lemma 12, we have k Y k + k ∞ = kP Ω ( Y k + ) k ∞ ≤ . (30)Thus, the sequence { Y k } is bounded. 21rom the iteration procedure of Algorithm 1, we have that L α k ( U k + , V k + , S k + , Y k ) ≤L α k ( U k + , V k + , S k , Y k ) ≤ L α k ( U k , V k , S k , Y k ) = L α k − ( U k , V k , S k , Y k − ) + β k k Y k − Y k − k F . where β k = α − k − ( α k − + α k ) and α k = ρα k − .Hence, ∞ X k = α − k − ( α k − + α k ) = ρ ( ρ + α ( ρ − < ∞ . (31)Thus, {L α k − ( U k , V k , S k , Y k − ) } is upper bounded due to the boundedness of { Y k } . Then λ k V k k ∗ + kP Ω ( S k ) k = L α k − ( U k , V k , S k , Y k − ) − α − k − ( k Y k k F − k Y k − k F ) , is upper bounded, i.e., { V k } and { S k } are bounded, and { U k V Tk } is also bounded.We next prove that { e Y k } is bounded. Let U ∗ k + denote the solution of the subproblem (14). Bythe optimality of U ∗ k + , then we have k Y k + α k ( D − U ∗ k + V Tk ) k F ≤ k Y k + α k ( D − U k V Tk − S k ) k F , and by the definition of e Y k , and α k + = ρα k , thus, k e Y k k F ≤ k (1 + ρ ) Y k − ρ Y k − k F . By the boundedness of V k and Y k , then the sequence { e Y k } is bounded.The optimality condition of the problem (16) with respect to V k + is rewritten as follows: U Tk + b Y k + ∈ λ∂ k V Tk + k ∗ . (32)By Lemma 12, we have that k U Tk + b Y k + k ≤ λ. Thus, U Tk + b Y k + is bounded. Let U ⊥ k + denote the orthogonal complement of U k + , i.e., U ⊥ k + U k + = , and according to Theorem 7, then ∃ O k + satisfies U ∗ k + = U k + O k + , thus we have( U ⊥ k + ) T b Y k + = ( U ⊥ k + ) T ( Y k + α k ( D − U k + V Tk + − S k )) = ( U ⊥ k + ) T ( Y k + α k ( D − U k + O k + V Tk − S k )) = ( U ⊥ k + ) T ( Y k + α k ( D − U ∗ k + V Tk − S k )) = ( U ⊥ k + ) T e Y k . Thus, { ( U ⊥ k + ) T b Y k + } is bounded due to the boundedness of { e Y k } . Then we have k b Y k + k = k U Tk + b Y k + + ( U ⊥ k + ) T b Y k + k ≤ k U Tk + b Y k + k + k ( U ⊥ k + ) T b Y k + k . Since both U Tk + b Y k + and ( U ⊥ k + ) T b Y k + are bounded, the sequence { b Y k } is bounded. This completesthe proof. 22 roof of Lemma 8 : Proof. By D − U k + V Tk + − S k + = α − k ( Y k + − Y k ), the boundedness of { Y k } and lim k →∞ α k = ∞ ,we have that lim k →∞ D − U k + V Tk + − S k + = . Thus, ( U k , V k , S k ) approaches to a feasible solution. We prove that the sequences { S k } and { U k V Tk } are Cauchy sequences.By k S k + − S k k = α − k k Y k + − b Y k k = o ( α − k ) and ∞ X k = α − k − = ρα ( ρ − < ∞ , thus, { S k } is a Cauchy sequence, and it has a limit, S ∗ .Similarly, { U k V Tk } is also a Cauchy sequence, therefore it has a limit, U ∗ ( V ∗ ) T .This completes the proof.To prove Lemma 9, we first give the following lemma in [38]: Lemma 14.
Let X, Y and Q be matrices of compatible dimensions. If Q obeys Q T Q = I andY ∈ ∂ k X k ∗ , then QY ∈ ∂ k QX k ∗ . Proof of Lemma 9:
Proof.
Let the skinny SVD of P k = D − S k + Y k /α k be P k = b U k b Σ k b V Tk , then it can be calculatedthat QR( P k V k ) = QR( b U k b Σ k b V Tk V k ) . Let the full SVD of b Σ k b V Tk V k be b Σ k b V Tk V k = e U e Σ e V T (note that e U and e V are orthogonal matrices),then it can be calculated thatQR( b U k b Σ k b V Tk V k ) = QR( b U k e U e Σ e V T ) = QR , U k + = Q . Then ∃ O and O T O = OO T = I such that U k + = b U k e UO , which simply leads to U k + U Tk + = b U k e UOO T e U T b U Tk = b U k b U Tk . Hence, b Y k + − U k + U Tk + b Y k + = µ k (( D − S k + Y k /µ k ) − U k + U Tk + ( D − S k + Y k /µ k )) = µ k ( b U k b Σ k b V Tk − U k + U Tk + b U k b Σ k b V Tk ) = µ k ( b U k b Σ k b V Tk − b U k b U Tk b U k b Σ k b V Tk ) = , i.e., b Y k + = U k + U Tk + b Y k + .
23y (32) and Lemma 14, we have U k + U Tk + b Y k + ∈ λ∂ k U k + V Tk + k ∗ . Thus, we have b Y k + ∈ λ∂ k U k + V Tk + k ∗ and P Ω ( Y k + ) ∈ ∂ kP Ω ( S k + ) k , ∀ k . Since the above conclusion holds for any k , it naturally holds at ( U ∗ , V ∗ , S ∗ ): b Y ∗ = b Y k ∗ + ∈ λ∂ k U ∗ ( V ∗ ) T k ∗ and P Ω ( Y ∗ ) = P Ω ( Y k ∗ + ) ∈ ∂ kP Ω ( S ∗ ) k . (33)Given any feasible solution ( U , V , S ) to the problem (9), by the convexity of matrix norms and(33), and P Ω C ( Y ∗ ) = , it can be calculated that kP Ω ( S ) k + λ k V k ∗ = kP Ω ( S ) k + λ k UV T k ∗ ≥ kP Ω ( S ∗ ) k + hP Ω ( Y ∗ ) , P Ω ( S − S ∗ ) i + λ k U ∗ ( V ∗ ) T k ∗ + h b Y ∗ , UV T − U ∗ ( V ∗ ) T i = kP Ω ( S ∗ ) k + hP Ω ( Y ∗ ) , S − S ∗ i + λ k U ∗ ( V ∗ ) T k ∗ + h b Y ∗ , UV T − U ∗ ( V ∗ ) T i = kP Ω ( S ∗ ) k + λ k U ∗ ( V ∗ ) T k ∗ + hP Ω ( Y ∗ ) , UV T + S − U ∗ ( V ∗ ) T − S ∗ i + h b Y ∗ − P Ω ( Y ∗ ) , UV T − U ∗ ( V ∗ ) T i = kP Ω ( S ∗ ) k + λ k U ∗ ( V ∗ ) T k ∗ + hP Ω ( Y ∗ ) , UV T + S − U ∗ ( V ∗ ) T − S ∗ i + h b Y ∗ − Y ∗ , UV T − U ∗ ( V ∗ ) T i . By Lemma 8 and kP Ω ( Y ∗ ) k ∞ ≤
1, we have that k UV T + S − U ∗ ( V ∗ ) T − S ∗ k ∞ = k D − U ∗ ( V ∗ ) T − S ∗ k ∞ ≤ ε , which leads to |hP Ω ( Y ∗ ) , UV T + S − U ∗ ( V ∗ ) T − S ∗ i| ≤ kP Ω ( Y ∗ ) k ∞ k UV T + S − U ∗ ( V ∗ ) T − S ∗ k = kP Ω ( Y ∗ ) k ∞ k D − U ∗ ( V ∗ ) T − S ∗ k ≤ mn k D − U ∗ ( V ∗ ) T − S ∗ k ∞ ≤ mn ε. Hence, kP Ω ( S ) k + λ k V k ∗ ≥ kP Ω ( S ∗ ) k + λ k V ∗ k ∗ + h b Y ∗ − Y ∗ , UV T − U ∗ ( V ∗ ) T i − mn ε. This completes the proof.
Proof of Theorem 10:
Proof.
It is worth nothing that ( U , V = , S = D ) is feasible to (9). Let ( U g , V g , S g ) be a globallyoptimal solution to (9), then we have λ k V g k ∗ ≤ kP Ω ( S g ) k + λ k V g k ∗ ≤ k D k = kP Ω ( D ) k . By the proof procedure of Lemma 13 and α = kP Ω ( D ) k F , we have that V ∗ is bounded by λ k V ∗ k ∗ ≤ kP Ω ( S ∗ ) k + λ k V ∗ k ∗ ≤ L α k ∗ ( U k ∗ + , V k ∗ + , S k ∗ + , Y k ∗ ) + k Y k ∗ k F µ k ∗ ≤ mn α ( ρ (1 + ρ ) ρ − + ρ k ∗ ) = mn kP Ω ( D ) k F ( ρ (1 + ρ ) ρ − + ρ k ∗ ) . k U g ( V g ) T − U ∗ ( V ∗ ) T k ∗ ≤ k V g k ∗ + k V ∗ k ∗ ≤ c . (34)Note that |h M , N i| ≤ k M k k N k ∗ (please see [55]) holds for any matrices M and N . By Lemma 9and (34), we have f g = kP Ω ( S g ) k + λ k V g k ∗ ≥ kP Ω ( S ∗ ) k + λ k V ∗ k ∗ + h b Y ∗ − Y ∗ , U g ( V g ) T − U ∗ ( V ∗ ) T i − mn ε ≥ f ∗ − k b Y ∗ − Y ∗ k k U g ( V g ) T − U ∗ ( V ∗ ) T k ∗ − mn ε = f ∗ − ε k U g ( V g ) T − U ∗ ( V ∗ ) T k ∗ − mn ε ≥ f ∗ − c ε − mn ε. This completes the proof.
Proof of Theorem 11:
Proof.
Let L = U ∗ ( V ∗ ) T and S = S ∗ , then ( L , S ) is a feasible solution to the RMC problem (5).By the convexity of the problem (5) and the optimality of ( L , S ), it naturally follows that f ≤ f ∗ . Let L = U Σ ( V ) T be the skinny SVD of L . Construct U ′ = U , ( V ′ ) T = Σ ( V ) T and S ′ = S .When d ≥ r , we have D = L + S = U Σ ( V ) T + S = U ′ ( V ′ ) T + S ′ , i.e., ( U ′ , V ′ , S ′ ) is a feasible solution to the problem (9). By Theorem 10, it can be concludedthat f ∗ − c ε − mn ε ≤ λ k V ′ k ∗ + kP Ω ( S ′ ) k = λ k Σ k ∗ + kP Ω ( S ) k = f . For d ≤ r , we decompose the skinny SVD of L as L = U Σ V T + U Σ V T , where U , V (resp. U , V ) are the singular vectors associated with the d largest singular values(resp. the rest singular values smaller than or equal to σ d ). With these notations, we have afeasible solution to the problem (9) by constructing U ′′ = U , ( V ′′ ) T = Σ V T and S ′′ = D − U Σ V T = S + U Σ V T .
25y Theorem 10, it can be calculated that f ∗ − c ε − mn ε ≤ f g ≤ λ k V ′′ k ∗ + kP Ω ( S ′′ ) k ≤ λ k Σ k ∗ + kP Ω ( S o + U Σ V T ) k ≤ λ k L k ∗ − λ k Σ k ∗ + kP Ω ( S ) k + kP Ω ( U Σ V T ) k ≤ f − λ k Σ k ∗ + k U Σ V T k ≤ f − λ k Σ k ∗ + √ mn k U Σ V T k F ≤ f − λ k Σ k ∗ + √ mn k U Σ V T k ∗ ≤ f + ( √ mn − λ ) k Σ k ∗ ≤ f + ( √ mn − λ ) σ d + ( r − d ) . This completes the proof.
References [1] J. Wright, A. Ganesh, K. Min, Y. Ma, Compressive principal component pursuit, Inform. Infer. 2 (2013) 32–68.[2] A. Agarwal, S. Negahban, M. Wainwright, Noisy matrix decomposition via convex relaxation: Optimal rates inhigh dimensions, Ann. Stat. 40 (2) (2012) 1171–1197.[3] Y. Chen, A. Jalali, S. Sanghavi, C. Caramanis, Low-rank matrix recovery from errors and erasures, IEEE Trans.Inform. Theory 59 (7) (2013) 4324–4337.[4] R. Otazo, E. Candes, D. Sodickson, Low-rank and sparse matrix decomposition for accelerated dynamic mri withseparation of background and dynamic components, Magn. Reson. Med. in press.[5] J. Wright, A. Ganesh, S. Rao, Y. Peng, Y. Ma, Robust principal component analysis: exact recovery of corruptedlow-rank matrices by convex optimization, in: NIPS, 2009, pp. 2080–2088.[6] A. Waters, A. Sankaranarayanan, R. Baraniuk, SpaRCS: Recovering low-rank and sparse matrices from compres-sive measurements, in: NIPS, 2011, pp. 1089–1097.[7] M. Tao, X. Yuan, Recovering low-rank and sparse components of matrices from incomplete and noisy observations,SIAM J. Optim. 21 (1) (2011) 57–81.[8] Y. Liu, F. Shang, H. Cheng, J. Cheng, A Grassmannian manifold algorithm for nuclear norm regularized leastsquares problems, in: UAI, 2014.[9] E. Candes, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, J. ACM 58 (3) (2011) 1–37.[10] T. Zhou, D. Tao, Greedy bilateral sketch, completion & smoothing, in: AISTATS, 2013, pp. 650–658.[11] Y. Zheng, G. Liu, S. Sugimoto, S. Yan, M. Okutomi, Practical low-rank matrix approximation under robust L1-norm, in: CVPR, 2012, pp. 1410–1417.[12] R. Cabral, F. Torre, J. Costeira, A. Bernardino, Unifying nuclear norm and bilinear factorization approaches forlow-rank matrix decomposition, in: ICCV, 2013, pp. 2488–2495.[13] F. Shang, Y. Liu, J. Cheng, H. Cheng, Robust principal component analysis with missing data, in: CIKM, 2014.[14] R. Ma, N. Barzigar, A. Roozgard, S. Cheng, Decomposition approach for low-rank matrix completion and itsapplications, IEEE Trans. Singal Proc. 62 (7) (2014) 1671–1683.[15] Y. Peng, A. Ganesh, J. Wright, W. Xu, Y. Ma, RASL: Robust alignment by sparse and low-rank decomposition forlinearly correlated images, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2233–2246.[16] Z. Li, J. Liu, Y. Jiang, J. Tang, H. Lu, Low rank metric learning for social image retrieval, in: ACM Multimedia,2012, pp. 853–856.[17] Y. Liu, L. Jiao, F. Shang, F. Yin, F. Liu, An e ffi cient matrix bi-factorization alternative optimization method forlow-rank matrix recovery and completion, Neur. Netw. 48 (2013) 8–18.[18] Y. Feng, J. Xiao, Y. Zhuang, X. Yang, J. Zhang, R. Song, Exploiting temporal stability and low-rank structure formotion capture data refinement, Inform. Sci. (2014) in press.[19] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation,IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 171–184.[20] P. Favaro, R. Vidal, A. Ravichandran, A closed form solution to robust subspace estimation and clustering, in:CVPR, 2011, pp. 1801–1807.[21] H. Xu, C. Caramanis, S. Sanghavi, Robust PCA via outlier pursuit, in: NIPS, 2010, pp. 2496–2504.
22] X. Yuan, J. Yang, Sparse and low-rank matrix decomposition via alternating direction methods, Pac. J. Optim. 9 (1)(2013) 167–180.[23] E. Candes, B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math. 9 (6) (2009) 717–772.[24] A. Eriksson, A. van den Hengel, E ffi cient computation of robust low-rank matrix approximations in the presenceof missing data using the l1 norm, in: CVPR, 2010, pp. 771–778.[25] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Trans. ImageProcess. 23 (5) (2014) 2019–2032.[26] M. Wang, B. Ni, X.-S. Hua, T.-S. Chua, Assistive tagging: A survey of multimedia tagging with human-computerjoint exploration, ACM Comput. Surv. 44 (4).[27] D. Meng, Z. Xu, L. Zhang, J. Zhao, A cyclic weighted median method for L1 low-rank matrix factorization withmissing entries, in: AAAI, 2013.[28] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans.Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698.[29] K. Min, Z. Zhang, J. Wright, Y. Ma, Decomposition background topics from keywords by principal componentpursuit, in: CIKM, 2010, pp. 269–278.[30] X. Li, Compressed sensing and matrix completion with constant proportion of corruptions, Constr. Approx. 37(2013) 73–99.[31] Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for low-rank representation,in: NIPS, 2011, pp. 612–620.[32] S. Ma, D. Goldfarb, L. Chen, Fixed point and Bregman iterative methods for matrix rank minimization, Math. Prog.128 (1) (2011) 321–353.[33] Y. Shen, Z. Wen, Y. Zhang, Augmented Lagrangian alternating direction method for matrix separation based onlow-rank factorization, Optim. Method Softw. 29 (2) (2014) 239–263.[34] H. Avron, S. Kale, S. Kasiviswanathan, V. Sindhwani, E ffi cient and practical stochastic subgradient descent fornuclear norm regularization, in: ICML, 2012.[35] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alter-nating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122.[36] Z. Wen, W. Yin, Y. Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successiveover-relaxation algorithm, Math. Prog. Comp. 4 (4) (2012) 333–361.[37] H. Nick, Matrix procrustes problems.[38] G. Liu, S. Yan, Active subspace: toward scalable low-rank learning, Neur. Comp. 24 (12) (2012) 3371–3394.[39] J. Cai, E. Candes, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4)(2010) 1956–1982.[40] F. Shang, Y. Liu, J. Cheng, Generalized higher-order tensor decomposition via parallel ADMM, in: AAAI, 2014,pp. 1279–1285.[41] J. Yang, X. Yuan, Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimiza-tion, Math. Comp. 82 (2013) 301–329.[42] R. Keshavan, A. Montanari, S. Oh, Matrix completion from a few entries, IEEE Trans. Inform. Theory 56 (6)(2010) 2980–2998.[43] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, T.-S. Chua, Event driven web video summarization by tag localizationand key-shot identification, IEEE Trans. Multimedia 14 (4) (2012) 975–985.[44] J. He, L. Balzano, A. Szlam, Incremental gradient on the Grassmannian for online foreground and backgroundseparation in subsampled video, in: CVPR, 2012, pp. 1568–1575.[45] D.-R. Liu, C.-H. Lai, W.-J. Lee, A hybrid of sequential rules and collaborative filtering for product recommendation,Inform. Sci. 179 (20) (2009) 3505–3519.[46] S. Lee, Y. Cho, S. Kim, Collaborative filtering with ordinal scale-based implicit ratings for mobile music recom-mendations, Inform. Sci. 180 (11) (2010) 2142–2155.[47] K.-C. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems,Pac. J. Optim. 6 (2010) 615–640.[48] R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for learning large incomplete matrices,J. Mach. Learn. Res. 11 (2010) 2287–2322.[49] T. Ngo, Y. Saad, Scaled gradients on Grassmann manifolds for matrix completion, in: NIPS, 2012, pp. 1421–1429.[50] N. Boumal, P.-A. Absil, RTRMC: A Riemannian trust-region method for low-rank matrix completion, in: NIPS,2011, pp. 406–414.[51] F. Shang, L. Jiao, F. Wang, Semi-supervised learning with mixed knowledge information, in: KDD, 2012, pp.732–740.[52] F. Shang, L. Jiao, F. Wang, Graph dual regularization non-negative matrix factorization for co-clustering, PatternRecogn. 45 (6) (2012) 2237–2250.
53] Z. Li, J. Liu, X. Zhu, T. Liu, H. Lu, Image annotation using multi-correlation probabilistic matrix factorization, in:ACM Multimedia, 2010, pp. 1187–1190.[54] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Trans.Image Process. 21 (7) (2012) 3262–3272.[55] R. Tomioka, T. Suzuki, Convex tensor decomposition via structured Schatten norm regularization, in: NIPS, 2013,pp. 1331–1339.53] Z. Li, J. Liu, X. Zhu, T. Liu, H. Lu, Image annotation using multi-correlation probabilistic matrix factorization, in:ACM Multimedia, 2010, pp. 1187–1190.[54] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Trans.Image Process. 21 (7) (2012) 3262–3272.[55] R. Tomioka, T. Suzuki, Convex tensor decomposition via structured Schatten norm regularization, in: NIPS, 2013,pp. 1331–1339.