A variational approach to stable principal component pursuit
Aleksandr Aravkin, Stephen Becker, Volkan Cevher, Peder Olsen
AA variational approach to stable principal component pursuit
Aleksandr Aravkin
T. J. Watson CenterIBM ResearchYorktown Heights, NY
Stephen Becker
T. J. Watson CenterIBM ResearchYorktown Heights, NY
Volkan Cevher ∗ LIONSEPFLLausanne, Switzerland
Peder Olsen
T. J. Watson CenterIBM ResearchYorktown Heights, NY
Abstract
We introduce a new convex formulation forstable principal component pursuit (SPCP)to decompose noisy signals into low-rank andsparse representations. For numerical solu-tions of our SPCP formulation, we first de-velop a convex variational framework andthen accelerate it with quasi-Newton meth-ods. We show, via synthetic and real dataexperiments, that our approach offers advan-tages over the classical SPCP formulations inscalability and practical parameter selection.
Linear superposition is a useful model for many appli-cations, including nonlinear mixing problems. Surpris-ingly, we can perfectly distinguish multiple elementsin a given signal using convex optimization as long asthey are concise and look sufficiently different fromone another. Popular examples include robust prin-cipal component analysis (RPCA) where we decom-pose a signal into low rank and sparse componentsand stable principal component pursuit (SPCP) , wherewe also seek an explicit noise component within theRPCA decomposition. Applications include alignmentof occluded images (Peng et al., 2012), scene trian-gulation (Zhang et al., 2011), model selection (Chan-drasekaran et al., 2012), face recognition, and docu-ment indexing (Candès et al., 2011).The SPCP formulation can be mathematically statedas follows. Given a noisy matrix Y ∈ R m × n , we de-compose it as a sum of a low-rank matrix L and a ∗ Author’s work is supported in part by the EuropeanCommission under the grants MIRG-268398 and ERC Fu-ture Proof, and by the Swiss Science Foundation under thegrants SNF 200021-132548, SNF 200021-146750 and SNFCRSII2-147633. sparse matrix S via the following convex programminimize L,S ||| L ||| ∗ + λ sum k S k subject to k L + S − Y k F ≤ ε, (SPCP sum )where the 1-norm k·k and nuclear norm |||·||| ∗ are givenby k S k = P i,j | s i,j | , ||| L ||| ∗ = P i σ i ( L ) , where σ ( L ) isthe vector of singular values of L . In (SPCP sum ), theparameter λ sum > L vs. the sparse term S , and theparameter ε accounts for the unknown perturbations Y − ( L + S ) in the data not explained by L and S .When ε = 0, (SPCP sum ) is the “robust PCA” problemas analyzed by Candès et al. (2011); Chandrasekaranet al. (2009), and it has perfect recovery guaranteesunder stylized incoherence assumptions. There is eventheoretical guidance for selecting a minimax optimalregularization parameter λ sum (Candès et al., 2011).Unfortunately, many practical problems only approxi-mately satisfy the idealized assumptions, and hence,we typically tune RPCA via cross-validation tech-niques. SPCP further complicates the practical tuningdue to the additional parameter ε .To cope with practical tuning issues of SPCP, we pro-pose the following new variant called “max-SPCP”:minimize L,S max ( ||| L ||| ∗ , λ max k S k )subject to k L + S − Y k F ≤ ε, (SPCP max )where λ max > λ sum . Our work showsthat this new formulation offers both modeling andcomputational advantages over (SPCP sum ).Cross-validation with (SPCP max ) to estimate ( λ max , ε )is significantly easier than estimating ( λ sum , ε ) in(SPCP sum ). For example, given an oracle that pro-vides an ideal separation Y ’ L oracle + S oracle , we canuse ε = k L oracle + S oracle − Y k F in both cases. However,while we can estimate λ max = k L oracle k ∗ / k S oracle k , itis not clear how to choose λ sum from data. Such cross a r X i v : . [ m a t h . O C ] J un alidation can be performed on a similar dataset, or itcould be obtained from a probabilistic model.Our convex approach for solving (SPCP sum ) gener-alizes to other source separation problems (Baldas-sarre et al., 2013) beyond SPCP. Both (SPCP max ) and(SPCP sum ) are challenging to solve when the dimen-sions are large. We show in this paper that these prob-lems can be solved more efficiently by solving a few(typically 5 to 10) subproblems of a different functionalform. While the efficiency of the solution algorithmsfor (SPCP sum ) relies heavily on the efficiency of the1-norm and nuclear norm projections, the efficiency ofour solution algorithm (SPCP max ) is preserved for ar-bitrary norms. Moreover, (SPCP max ) allows a fasteralgorithm in the standard case, discussed in Section 6. The theoretical and algorithmic research on SPCP for-mulations (and source separation in general) is rapidlyevolving. Hence, it is important to set the stage firstin terms of the available formulations to highlight ourcontributions.To this end, we illustrate (SPCP sum ) and (SPCP max )via different convex formulations. Flipping the objec-tive and the constraints in (SPCP max ) and (SPCP sum ),we obtain the following convex programsminimize
L,S k L + S − Y k F s.t. ||| L ||| ∗ + λ sum k S k ≤ τ sum (flip-SPCP sum )minimize L,S k L + S − Y k F s.t. max( ||| L ||| ∗ , λ max k S k ) ≤ τ max (flip-SPCP max ) Remark 2.1.
The solutions of (flip-SPCP sum ) and (flip-SPCP max ) are related to the solutionsof (SPCP sum ) and (SPCP max ) via the Pareto fron-tier by Aravkin et al. (2013a, Theorem 2.1). If theconstraint k L + S − Y k ≤ ε is tight at the solution,then there exist corresponding parameters τ sum ( ε ) and τ max ( ε ) , for which the optimal value of (flip-SPCP sum ) and (flip-SPCP max ) is ε , and the corresponding opti-mal solutions ( S s , L s ) and ( S m , L m ) are also optimalfor (SPCP sum ) and (SPCP max ) . For completeness, we also include the Lagrangian for-mulation, which is covered by our new algorithm:minimize
L,S λ L ||| L ||| ∗ + λ S k S k + 12 k L + S − Y k F (Lag-SPCP) Problems (flip-SPCP max ) and (flip-SPCP sum ) can besolved using projected gradient and accelerated gradi-ent methods. The disadvantage of some of these for-mulations is that it may not be clear how to tune theparameters. Surprisingly, an algorithm we propose inthis paper can solve (SPCP max ) and (SPCP sum ) us-ing a sequence of flipped problems that specifically ex-ploits the structured relationship cited in Remark 2.1.In practice, we will see that better tuning also leadsto faster algorithms, e.g., fixing ε ahead of time to anestimated ‘noise floor’ greatly reduces the amount ofrequired computation if parameters are to be selectedvia cross-validation.Finally, we note that in some cases, it is useful tochange the k L + S − Y k F term to kA ( L + S − Y ) k F where A is a linear operator. For example, let Ω be asubset of the indices of a m × n matrix. We may onlyobserve Y restricted to these entries, denoted P Ω ( Y ),in which case we choose A = P Ω . Most existingRPCA/SPCP algorithms adapt to the case A = P Ω but this is due to the strong properties of the projec-tion operator P Ω . The advantage of our approach isthat it seamlessly handles arbitrary linear operators A . In fact, it also generalizes to smooth misfit penal-ties, that are more robust than the Frobenius norm,including the Huber loss. Our results also generalizeto some other penalties on S besides the 1-norm.The paper proceeds as follows. In Section 3, we de-scribe previous work and algorithms for SPCP andRPCA. In Section 4, we cast the relationships be-tween pairs of problems (flip-SPCP sum ), (SPCP sum )and (flip-SPCP max ), (SPCP max ) into a general varia-tional framework, and highlight the product-space reg-ularization structure that enables us solve the formula-tions of interest using corresponding flipped problems.We discuss computationally efficient projections as op-timization workhorses in Section 5, and develop newaccelerated projected quasi-Newton methods for theflipped and Lagrangian formulations in Section 6. Fi-nally, we demonstrate the efficacy of the new solversand the overall formulation on synthetic problems anda real cloud removal example in Section 7, and followwith conclusions in Section 8. While problem (SPCP sum ) with ε = 0 has severalsolvers (e.g., it can be solved by applying the widelyknown Alternating Directions Method of Multipli-ers (ADMM)/Douglas-Rachford method (Combettes& Pesquet, 2007)), the formulation assumes the dataare noise free. Unfortunately, the presence of noise weconsider in this paper introduces a third term in theADMM framework, where the algorithm is shown to2e non-convergent (Chen et al., 2013). Interestingly,there are only a handful of methods that can handlethis case. Those using smoothing techniques no longerpromote exactly sparse and/or exactly low-rank solu-tions (Aybat et al., 2013). Those using dual decom-position techniques require high iteration counts. Be-cause each step requires a partial singular value de-composition (SVD) of a large matrix, it is critical thatthe methods only take a few iterations.As a rough comparison, we start with related solversthat solve (SPCP sum ) for ε = 0. Wright et al. (2009a)solves an instance of (SPCP sum ) with ε = 0 and a800 ×
800 system in 8 hours. By switching to the(Lag-SPCP) formulation, Ganesh et al. (2009) usesthe accelerated proximal gradient method (Beck &Teboulle, 2009) to solve a 1000 × sum ) with ε = 0 us-ing the augmented Lagrangian and ADMM methodsand solves a 1500 × sum ) with ε >
0, Tao & Yuan(2011) propose the alternating splitting augmented La-grangian method (ASALM), which exploits separabil-ity of the objective in the splitting scheme, and cansolve a 1500 × ×
300 orso, in about 1 minute). solving 1500 × sum ), and makes use of the insight of Aybatet al. (2013). The ADMM variant is interesting in thatit splits the variable L , rather than the sum L + S orresidual L + S − Y . Their experiments solve a 1500 × L = U V T . This technique has also been effectively used inpractice for matrix completion (Aravkin et al., 2013b;Lee et al., 2010; Recht & Ré, 2011), but lacks a fullconvergence theory in either context. The methodof (Shen et al., 2014) was an order of magnitude fasterthan ASALM, but encountered difficulties in some ex- periments where the sparse component dominated thelow rank component in some sense. Mansour & Vetro(2014) attack the SPCP sum formulation using a fac-torized approach, together with alternating solves be-tween ( U, V ) and S . Non-convex techniques also in-clude hard thresholding approaches, e.g. the approachof Kyrillidis & Cevher (2014). While the factorizationtechnique may potentially speed up some of the meth-ods presented here, we leave this to future work, andonly work with convex formulations. Both of the formulations of interest (SPCP sum )and (SPCP max ) can be written as follows:min φ ( L, S ) s.t. ρ ( L + S − Y ) ≤ ε. (4.1)Classic formulations assume ρ to be the Frobeniusnorm; however, this restriction is not necessary, andwe consider ρ to be smooth and convex. In particular, ρ can be taken to be the robust Huber penalty (Huber,2004). Even more importantly, this formulation allowspre-composition of a smooth convex penalty with anarbitrary linear operator A , which extends the pro-posed approach to a much more general class of prob-lems. Note that a simple operator is already embeddedin both formulations of interest: L + S = (cid:2) I I (cid:3) (cid:20) LS (cid:21) . (4.2)Projection onto a set of observed indices Ω is also asimple linear operator that can be included in ρ . Op-erators may include different transforms (e.g., Fourier)applied to either L or S .The main formulations of interest differ only in thefunctional φ ( L, S ). For (SPCP sum ), we have φ sum ( L, S ) = ||| L ||| ∗ + λ sum k S k , while for (SPCP max ), φ max ( L, S ) = max( ||| L ||| ∗ , λ max k S k ) . The problem class (4.1) falls into the class of problemsstudied by van den Berg & Friedlander (2008, 2011) for ρ ( · ) = k · k and by Aravkin et al. (2013a) for arbitraryconvex ρ . Making use of this framework, we can definea value function v ( τ ) = min L,S ρ ( A ( L, S ) − Y ) s.t. φ ( L, S ) ≤ τ, (4.3)and use Newton’s method to find a solution to v ( τ ) = ε . The approach is agnostic to the linear operator A (it can be of the simple form (4.2); include restrictionin the missing data case, etc.).3or both formulations of interest, φ is a norm definedon a product space R n × m × R n × m , since we can write φ sum ( L, S ) = (cid:13)(cid:13)(cid:13)(cid:13) ||| L ||| ∗ λ sum k S k (cid:13)(cid:13)(cid:13)(cid:13) , (4.4) φ max ( L, S ) = (cid:13)(cid:13)(cid:13)(cid:13) ||| L ||| ∗ λ max k S k (cid:13)(cid:13)(cid:13)(cid:13) ∞ . (4.5)In particular, both φ sum ( L, S ) and φ max ( L, S ) are gauges . For a convex set C containing the origin, thegauge γ ( x | C ) is defined by γ ( x | C ) = inf λ { λ : x ∈ λC } . (4.6)For any norm k · k , the set defining it as a gauge is sim-ply the unit ball B k·k = { x : k x k ≤ } . We introducegauges for two reasons. First, they are more general (agauge is a norm only if C is bounded with nonemptyinterior and symmetric about the origin). For exam-ple, gauges trivially allow inclusion of non-negativityconstraints. Second, definition (4.6) and the explicitset C simplify the exposition of the following results.In order to implement Newton’s method for (4.3), theoptimization problem to evaluate v ( τ ) must be solved(fully or approximately) to obtain ( L, S ). Then the τ parameter for the next (4.3) problem is updated via τ k +1 = τ k − v ( τ ) − τv ( τ ) . (4.7)Given ( L, S ), v ( τ ) can be written in closed form using(Aravkin et al., 2013a, Theorem 5.2), which simplifiesto v ( τ ) = − φ ◦ ( A T ∇ ρ ( A ( L, S ) − Y )) , (4.8)with φ ◦ denoting the polar gauge to φ . The polargauge is precisely γ ( x | C ◦ ), with C ◦ = { v : h v, x i ≤ ∀ x ∈ C } . (4.9)In the simplest case, where A is given by (4.2), and ρ is the least squares penalty, the formula (4.8) becomes v ( τ ) = − φ ◦ (cid:18)(cid:20) L + S − YL + S − Y (cid:21)(cid:19) . The main computational challenge in the approachoutlined in (4.3)-(4.8) is to design a fast solver to eval-uate v ( τ ). Section 6 does just this.The key to RPCA is that the regularization functional φ is a gauge over the product space used to decompose Y into summands L and S . This makes it straightfor-ward to compute polar results for both φ sum and φ max . Theorem 4.1 (Max-Sum Duality for Gauges on Prod-uct Spaces) . Let γ and γ be gauges on R n and R n ,and consider the function g ( x, y ) = max { γ ( x ) , γ ( y ) } . Then g is a gauge, and its polar is given by g ◦ ( z , z ) = γ ◦ ( z ) + γ ◦ ( z ) . Proof.
Let C and C denote the canonical sets corre-sponding to gauges γ and γ . It immediately followsthat g is a gauge for the set C = C × C , sinceinf { λ ≥ | ( x, y ) ∈ λC } = inf { λ | x ∈ λC and y ∈ λC } = max { γ ( x ) , γ ( y ) } . By (Rockafellar, 1970, Corollary 15.1.2), the polar ofthe gauge of C is the support function of C , which isgiven bysup x ∈ C ,y ∈ C h ( x, y ) , ( z , z ) i = sup x ∈ C h x, z i + sup y ∈ C h y, z i = γ ◦ ( z ) + γ ◦ ( z ) . This theorem allows us to easily compute the polarsfor φ sum and φ max in terms of the polars of |||·||| ∗ and k · k , which are the dual norms, the spectral norm andinfinity norm, respectively. Corollary 4.2 (Explicit variational formulaefor (SPCP sum ) and (SPCP max )) . We have φ ◦ sum ( Z , Z ) = max (cid:26) ||| Z ||| , λ sum k Z k ∞ (cid:27) φ ◦ max ( Z , Z ) = ||| Z ||| + 1 λ max k Z k ∞ , (4.10) where ||| X ||| denotes the spectral norm (largest eigen-value of X T X ). This result was also obtained by (van den Berg &Friedlander, 2011, Section 9), but is stated only fornorms. Theorem 4.1 applies to gauges, and in partic-ular now allows asymmetric gauges, so non-negativityconstraints can be easily modeled.We now have closed form solutions for v ( τ ) in (4.8)for both formulations of interest. The remaining chal-lenge is to design a fast solver for (4.3) for formula-tions (SPCP sum ) and (SPCP max ). We focus on thischallenge in the remaining sections of the paper. Wealso discuss the advantage of (SPCP max ) from thiscomputational perspective. In this section, we consider the computational issuesof projecting onto the set defined by φ ( L, S ) ≤ τ . For φ max ( L, S ) = max( ||| L ||| ∗ , λ max k S k ) this is straight-forward since the set is just the product set of the4uclear norm and ‘ norm balls, and efficient pro-jectors onto these are known. In particular, project-ing an m × n matrix (without loss of generality let m ≤ n ) onto the nuclear norm ball takes O ( m n ) op-erations, and projecting it onto the ‘ -ball can be doneon O ( mn ) operations using fast median-finding algo-rithms (Brucker, 1984; Duchi et al., 2008).For φ sum ( L, S ) = ||| L ||| ∗ + λ sum k S k , the projection isno longer straightforward. Nonetheless, the followinglemma shows this projection can be efficiently imple-mented. Proposition 5.1. (van den Berg & Friedlander, 2011,Section 5.2) Projection onto the scaled ‘ -ball, that is, { x ∈ R d | P di =1 α i | x i | ≤ } for some α i > , can bedone in O ( d log( d )) time. The proof of the proposition follows by noting that thesolution can be written in a form depending only on asingle scalar parameter, and this scalar can be foundby sorting ( | x i | /α i ) followed by appropriate summa-tions. We conjecture that fast median-finding ideascould reduce this to O ( d ) in theory, the same as theoptimal complexity for the ‘ -ball.Armed with the above proposition, we state an impor-tant lemma below. For our purposes, we may think of S as a vector in R mn rather than a matrix in R m × n . Lemma 5.2. (van den Berg & Friedlander, 2011,Section 9.2) Let L = U Σ V T and Σ = diag( σ ) ,and let ( S i ) mni =1 be any ordering of the elements of S . Then the projection of ( L, S ) onto the φ sum ball is ( U diag(ˆ σ ) V T , ˆ S ) , where (ˆ σ, ˆ S ) is the projec-tion onto the scaled ‘ -ball { ( σ, S ) | P min( m,n ) j =1 | σ j | + P mni =1 λ sum | S i | ≤ } .Sketch of proof. We need to solvemin { ( L ,S ) | φ sum ( L ,S ) ≤ } k L − L k F + 12 k S − S k F . Alternatively, solvemin S min { L | ||| L ||| ∗ ≤ − λ sum k S k } k L − L k F + 12 k S − S k F . The inner minimization is equivalent to projectingonto the nuclear norm ball, and this is well-known tobe soft-thresholding of the singular values. Since itdepends only on the singular values, recombining thetwo minimization terms gives exactly a joint projectiononto a scaled ‘ -ball. Remark 5.1.
All the references to the ‘ -ball can bereplaced by the intersection of the ‘ -ball and the non-negative cone, and the projection is still efficient. Asnoted in Section 4, imposing non-negativity constraints is covered by the gauge results of Theorem 4.1 andCorollary 4.2. Therefore, both the variational and ef-ficient computational framework can be applied to thisinteresting case. In order to accelerate the approach, we can use quasi-Newton (QN) methods since the objective has a sim-ple structure. The main challenge here is that forthe ||| L ||| ∗ term, it is tricky to deal with a weightedquadratic term (whereas for k S k , we can obtain alow-rank Hessian and solve it efficiently via coordinatedescent).We wish to solve (flip-SPCP max ). Let X = ( L, S ) bethe full variable, so we can write the objective functionas f ( X ) = kA ( X ) − Y k F . To simplify the exposition,we take A = ( I, I ) to be the mn × mn matrix, but thepresented approach applies to general linear operators(including terms like P Ω ). The matrix structure of L and S is not important here, so we can think of themas mn × m × n matrices.The gradient is ∇ f ( X ) = A T ( A ( X ) − Y ). For conve-nience, we use r ( X ) = A ( X ) − Y and ∇ f ( X ) = (cid:18) ∇ L f ( X ) ∇ S f ( X ) (cid:19) = A T (cid:18) r ( X ) r ( X ) (cid:19) , r k ≡ r ( X k ) . The Hessian is A T A = (cid:18) I II I (cid:19) . We cannot simulta-neously project (
L, S ) onto their constraints with thisHessian scaling (doing so would solve the original prob-lem!), since the Hessian removes separability. Instead,we use ( L k , S k ) to approximate the cross-terms.The true function is a quadratic, so the following We use “quasi-Newton” to mean an approximationto a Newton method and it should not be confused withmethods like BFGS X k = ( L k , S k ) is exact: f ( L, S ) = f ( X k ) + (cid:28)(cid:18) ∇ L f ( X k ) ∇ S f ( X k ) (cid:19) , (cid:18) L − L k S − S k (cid:19)(cid:29) + (cid:28)(cid:18) L − L k S − S k (cid:19) , ∇ f (cid:18) L − L k S − S k (cid:19)(cid:29) = f ( X k ) + (cid:28)(cid:18) r k r k (cid:19) , (cid:18) L − L k S − S k (cid:19)(cid:29) + (cid:28)(cid:18) L − L k S − S k (cid:19) , (cid:18) (cid:19) (cid:18) L − L k S − S k (cid:19)(cid:29) = f ( X k ) + (cid:28)(cid:18) r k r k (cid:19) , (cid:18) L − L k S − S k (cid:19)(cid:29) + (cid:28)(cid:18) L − L k S − S k (cid:19) , (cid:18) L − L k + S − S k L − L k + S − S k (cid:19)(cid:29) The coupling of the second order terms, shown inbold, prevents direct 1-step minimization of f , sub-ject to the nuclear and 1-norm constraints. TheFISTA (Beck & Teboulle, 2009) and spectral gradi-ent methods (SPG) (Wright et al., 2009b) replacethe Hessian (cid:18) I II I (cid:19) with the upper bound 2 (cid:18) I I (cid:19) ,which solves the coupling issue, but potentially losetoo much second order information. After comparingFISTA and SPG, we use the SPG method for solving(flip-SPCP sum ). However, for (flip-SPCP max ) (and for(Lag-SPCP), which has no constraints but rather non-smooth terms, which can be treated like constraintsusing proximity operators), the constraints are uncou-pled and we can take a “middle road” approach, re-placing (cid:28)(cid:18) L − L k S − S k (cid:19) , (cid:18) L − L k + S − S k L − L k + S − S k (cid:19)(cid:29) with (cid:28)(cid:18) L − L k S − S k (cid:19) , (cid:18) L − L k + S k − S k − L k + − L k + S − S k (cid:19)(cid:29) . The first term is decoupled, allowing us to update L k ,and then this is plugged into the second term in aGauss-Seidel fashion. In practice, we also scale thissecond-order term with a number slightly greater than1 but less than 2 (e.g., 1 .
25) which leads to more robustbehavior. We expect this “quasi-Newton” trick to dowell when S k +1 − S k is similar to S k − S k − . The numerical experiments are done with the algo-rithms suggested in this paper as well as code fromPSPG (Aybat et al., 2013), NSA (Aybat & Iyengar,2014), and ASALM (Tao & Yuan, 2011) . We modi- PSPG, NSA and ASALM available from the experi-ment package at fied the other software as needed for testing purposes.PSPG, NSA and ASALM all solve (SPCP sum ), butASALM has another variant which solves (Lag-SPCP)so we test this as well. All three programs also useversions of PROPACK from Becker & Candès (2008);Larsen (1998) to compute partial SVDs. Since the costof a single iteration may vary among the solvers, wemeasure error as a function of time, not iterations.When a reference solution ( L ? , S ? ) is available, wemeasure the (relative) error of a trial solution ( L, S )as k L − L ? k F / k L ? k F + k S − S ? k F / k S ? k F . The bench-mark is designed so the time required to calculate thiserror at each iteration does not factor into the reportedtimes. Since picking stopping conditions is solver de-pendent, we show plots of error vs time, rather thanlist tables. All tests are done in Matlab and the dom-inant computational time was due to matrix multipli-cations for all algorithms; all code was run in the samequad-core 1 . max ),(flip-SPCP sum ) and (Lag-SPCP), we use a random-ized SVD (Halko et al., 2011). Since the number ofsingular values needed is not known in advance, thepartial SVD may be called several times (the same istrue for PSPG, NSA and ASALM). Our code limits thenumber of singular values on the first two iterations inorder to speed up calculation without affecting conver-gence. Unfortunately, the delicate projection involvedin (flip-SPCP sum ) makes incorporating a partial SVDto this setting more challenging, so we use Matlab’sdense SVD routine.
We first provide a test with generated data. The ob-servations Y ∈ R m × n with m = 400 and n = 500 werecreated by first sampling a rank 20 matrix Y withrandom singular vectors (i.e., from the Haar measure)and singular values drawn from a uniform distributionwith mean 0 .
1, and then adding exponential randomnoise (with mean equal to one tenth the median ab-solute value of the entries of Y ). This exponentialnoise, which has a longer tail than Gaussian noise, isexpected to be captured partly by the S term andpartly by the k L + S − Y k F term.Given Y , the reference solution ( L ? , S ? ) was generatedby solving (Lag-SPCP) to very high accuracy; the val-ues λ L = 0 .
25 and λ S = 10 − were picked by hand tun-ing ( λ L , λ S ) to find a value such that both L ? and S ? are non-zero. The advantage to solving (Lag-SPCP) isthat knowledge of ( L ? , S ? , λ L , λ S ) allows us to gener-ate the parameters for all the other variants, and hencewe can test different problem formulations.With these parameters, L ? was rank 17 with nuclear6orm 6 . S ? had 54 non-zero entries (most of thempositive) with ‘ norm 0 . k L ? + S ? − Y k F / k Y k F = 0 . ε = 1 . λ sum = 0 . λ max = 150 . τ sum = 6 . τ max = 6 . −4 −3 −2 −1 time (s) E rr o r this paper* (flip−SPCP−max, QN)this paper (flip−SPCP−sum)this paper* (lag−SPCP, QN)this paper* (SPCP−max, QN)this paper (SPCP−sum)NSA* (SPCP−sum)PSPG* (SPCP−sum)ASALM* (SPCP−sum)ASALM* (lag−SPCP) Figure 1: The exponential noise test. The asterisk inthe legend means the method uses a fast SVD.Results are shown in Fig. 1. Our methods for(flip-SPCP max ) and (Lag-SPCP) are extremely fast,because the simple nature of these formulations allowsthe quasi-Newton acceleration scheme of Section 6. Inturn, since our method for solving (SPCP max ) usesthe variational framework of Section 4 to solve a se-quence of (flip-SPCP max ) problems, it is also compet-itive (shown in cyan in Figure 1). The jumps are dueto re-starting the sub-problem solver with a new valueof τ , generated according to (4.7).Our proximal gradient method for (flip-SPCP sum ),which makes use of the projection in Lemma 5.2, con-verges more slowly, since it is not easy to acceleratewith the quasi-Newton scheme due to variable cou-pling, and it does not make use of fast SVDs. Oursolver for (SPCP sum ), which depends on a sequence ofproblems (flip-SPCP sum ), converges slowly.The ASALM performs reasonably well, which was un-expected since it was shown to be worse than NSAand PSPG in Aybat et al. (2013); Aybat & Iyengar(2014). The PSPG solver converges to the wrong an-swer, most likely due to a bad choice of the smoothingparameter µ ; we tried choosing several different valuesother than the default but did not see improvementfor this test (for other tests, not shown, tweaking µ helped significantly). The NSA solver reaches mod-erate error quickly but stalls before finding a highlyaccurate solution. We show some tests from the test setup of Aybat &Iyengar (2014) in the m = n = 1500 case. The de-fault setting of λ sum = 1 / p max( m, n ) was used, andthen the NSA solver was run to high accuracy to ob-tain a reference solution ( L ? , S ? ). From the knowledgeof ( L ? , S ? , λ sum ), one can generate λ max , τ sum , τ max , ε ,but not λ S and λ L , and hence we did not test thesolvers for (Lag-SPCP) in this experiment. The datawas generated as Y = L + S + Z , where L was sampled by multiplication of m × r and r × n normal Gaussian matrices, S had p randomly cho-sen entries uniformly distributed within [ − , Z was white noise chosen to give a SNR of45 dB. We show three tests that vary the rank from { . , . } · min( m, n ) and the sparsity ranging from p = { . , . } · mn . Unlike Aybat & Iyengar (2014),who report error in terms of a true noiseless signal( L , S ), we report the optimization error relative to( L ? , S ? ).For the first test (with r = 75 and p = 0 . × mn ), L ? had rank 786 and nuclear norm 111363 . S ? had75 .
49% of its elements nonzero and ‘ norm 5720399 . k L ? + S ? − Y ? k F / k Y k F = 1 . · − . The otherparameters were ε = 3 . λ sum = 0 . λ max =0 . τ sum = 2 . · and τ max = 1 . · .An interesting feature of this test is that while L islow-rank, L ? is nearly low-rank but with a small tailof significant singular values until number 786. Weexpect methods to converge quickly to low-accuracywhere only a low-rank approximation is needed, andthen slow down as they try to find a larger rank highly-accurate solution. −2 −1 time (s) E rr o r this paper* (flip−SPCP−max, QN)this paper (flip−SPCP−sum)this paper* (SPCP−max, QN)this paper (SPCP−sum)NSA* (SPCP−sum)PSPG* (SPCP−sum)ASALM* (SPCP−sum) Figure 2: The 1500 × .
01 (for comparison, an error of 2 is achieved7y setting L = S = 0). The NSA and PSPGsolvers do quite well. In contrast to the previ-ous test, ASALM does poorly. Our methods for(flip-SPCP sum ), and hence (SPCP sum ), are not com-petitive, since they use dense SVDs. We imposeda time-limit of about one minute, so these methodsonly manage a single iteration or two. Our quasi-Newton method for (flip-SPCP max ) does well initially,then takes a long time due to a long partial SVDcomputation. Interestingly, (SPCP max ) does betterthan pure (flip-SPCP max ). One possible explanationis that it chooses a fortuitous sequence of τ values,for which the corresponding (flip-SPCP max ) subprob-lems become increasingly hard, and therefore bene-fit from the warm-start of the solution of the eas-ier previous problem. This is consistent with empiri-cal observations regarding continuation techniques, seee.g., (van den Berg & Friedlander, 2008; Wright et al.,2009b). −3 −2 −1 time (s) E rr o r this paper* (flip−SPCP−max, QN)this paper (flip−SPCP−sum)this paper* (SPCP−max, QN)this paper (SPCP−sum)NSA* (SPCP−sum)PSPG* (SPCP−sum)ASALM* (SPCP−sum) Figure 3: Second 1500 × r = 150 and p =0 . · mn , and the conclusions are largely similar. Figure 4 shows 15 images of size 300 ×
300 from theMODIS satellite, after some transformations to turnimages from different spectral bands into one grayscaleimages. Each image is a photo of the same rural lo-cation but at different points in time over the courseof a few months. The background changes slowly andthe variability is due to changes in vegetation, snowcover, and different reflectance. There are also outly-ing sources of error, mainly due to clouds (e.g., majorclouds in frames 5 and 7, smaller clouds in frames 9,11 and 12), as well as artifacts of the CCD camera on Publicly available at http://ladsweb.nascom.nasa.gov/ Y L S Y L S Y L S Figure 5: Showing frames 4, 5 and 12. Leftmost col-umn is original data, middle column is low-rank termof the solution, and right column is sparse term of thesolution. Data have been processed slightly to enhancecontrast for viewing.the satellite (frame 4 and 6) and issues stitching to-gether photos of the same scene (the lines in frames 8and 10).There are hundreds of applications for clean satelliteimagery, so removing the outlying error is of greatpractical importance. Because of slow changing back-ground and sparse errors, we can model the prob-lem using the robust PCA approach. We use the(flip-SPCP max ) version due to its speed, and pick pa-rameters ( λ max , τ max ) by using a Nelder-Mead simplexsearch. For an error metric to use in the parametertuning, we remove frame 1 from the data set (call it y ) and set Y to be frames 2–15. From this trainingdata Y , the algorithm generates L and S . Since L isa 300 ×
14 matrix, it has far from full column span.Thus our error is the distance of y from the span of L , i.e., k y − P span( L ) ( y ) k .Our method takes about 11 iterations and 5 seconds,and uses a dense SVD instead of the randomizedmethod due to the high aspect ratio of the matrix.Some results of the obtained ( L, S ) outputs are inFig. 5, where one can see that some of the anoma-lies in the original data frames Y are picked up by the S term and removed from the L term. Frame 4 haswhat appears to be a camera pixel error; frame 6 hasanother artificial error (that is, caused by the cameraand not the scene); and frame 12 has cloud cover.8 Figure 4: Satellite photos of the same location on different days
In this paper, we reviewed several formulations andalgorithms for the RPCA problem. We intro-duced a new denoising formulation (SPCP max ) tothe ones previously considered, and discussed model-ing and algorithmic advantages of denoising formula-tions (SPCP max ) and (SPCP sum ) compared to flippedversions (flip-SPCP max ) and (flip-SPCP sum ). In par-ticular, we showed that these formulations can belinked using a variational framework, which can beexploited to solve denoising formulations using a se-quence of flipped problems. For (flip-SPCP max ), weproposed a quasi-Newton acceleration that is compet-itive with state of the art, and used this innovation todesign a fast method for (SPCP max ) through the vari-ational framework. The new methods were comparedagainst prior art on synthetic examples, and appliedto a real world cloud removal application applicationusing publicly available MODIS satellite data.
References
Aravkin, A. Y., Burke, J., and Friedlander, M. P. Varia-tional properties of value functions.
SIAM J. Optimiza-tion , 23(3):1689–1717, 2013a.Aravkin, A. Y., Kumar, R., Mansour, H., Recht, B., andHerrmann, F. J. A robust SVD-free approach to matrixcompletion, with applications to interpolation of largescale data. 2013b. URL http://arxiv.org/abs/1302.4886 .Aybat, N., Goldfarb, D., and Ma, S. Efficient algorithms for robust and stable principal component pursuit.
Com-putational Optimization and Applications, (accepted),2013.Aybat, N. S. and Iyengar, G. An alternating directionmethod with increasing penalty for stable principal com-ponent pursuit.
Computational Optimization and Ap-plications, (submitted), 2014. http://arxiv.org/abs/1309.6553 .Baldassarre, L., Cevher, V., McCoy, M., Tran Dinh, Q.,and Asaei, A. Convexity in source separation: Models,geometry, and algorithms. Technical report, 2013. http://arxiv.org/abs/1311.0258 .Beck, A. and Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.
SIAM J. Imaging Sciences , 2(1):183–202, January 2009.Becker, S. and Candès, E. Singular value thresholding tool-box, 2008. Available from http://svt.stanford.edu/ .Brucker, P. An O(n) algorithm for quadratic knapsackproblems.
Operations Res. Lett. , 3(3):163 – 166, 1984.doi: 10.1016/0167-6377(84)90010-5.Candès, E. J., Li, X., Ma, Y., and Wright, J. Robust prin-cipal component analysis?
J. Assoc. Comput. Mach. , 58(3):1–37, May 2011.Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Will-sky, A. S. Sparse and low-rank matrix decompositions.In
SYSID 2009 , Saint-Malo, France, July 2009.Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. La-tent variable graphical model selection via convex opti-mization.
Ann. Stat. , 40(4):1935–2357, 2012.Chen, Caihua, He, Bingsheng, Ye, Yinyu, and Yuan, Xi-aoming. The direct extension of admm for multi-blockconvex minimization problems is not necessarily conver-gent.
Optimization Online , 2013. ombettes, P. L. and Pesquet, J.-C. A Douglas–RachfordSplitting Approach to Nonsmooth Convex VariationalSignal Recovery. IEEE J. Sel. Topics Sig. Processing , 1(4):564–574, December 2007.Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T.Efficient projections onto the l1-ball for learning in highdimensions. In
Intl. Conf. Machine Learning (ICML) ,pp. 272–279, New York, July 2008. ACM Press.Ganesh, A., Lin, Z., Wright, J., Wu, L., Chen, M., and Ma,Y. Fast algorithms for recovering a corrupted low-rankmatrix. In
Computational Advances in Multi-SensorAdaptive Processing (CAMSAP) , pp. 213–215, Aruba,Dec. 2009.Halko, N., Martinsson, P.-G., and Tropp, J. A. Findingstructure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions.
SIAMreview , 53(2):217–288, 2011.Huber, P. J.
Robust Statistics . John Wiley and Sons, 2edition, 2004.Kyrillidis, Anastasios and Cevher, Volkan. Matrix recipesfor hard thresholding methods.
Journal of MathematicalImaging and Vision , 48(2):235–265, 2014.Larsen, R. M. Lanczos bidiagonalization with partialreorthogonalization. Tech. Report. DAIMI PB-357,Department of Computer Science, Aarhus University,September 1998.Lee, J., Recht, B., Salakhutdinov, R., Srebro, N., andTropp, J.A. Practical large-scale optimization for max-norm regularization. In
Neural Information ProcessingSystems (NIPS) , Vancouver, 2010.Lin, Z., Chen, M., and Ma, Y. The augmented Lagrangemultiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055 , 2010.Mansour, H. and Vetro, A. Video background subtractionusing semi-supervised robust matrix completion. In
Toappear in IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP) , 2014.Peng, Y., Ganesh, A., Wright, J., Xu, W., and Ma, Y.RASL: Robust alignment by sparse and low-rank de-composition for linearly correlated images.
IEEE Trans.Pattern Analysis and Machine Intelligence , 34(11):2233–2246, 2012.Recht, Benjamin and Ré, Christopher. Parallel stochasticgradient algorithms for large-scale matrix completion.
Math. Prog. Comput. , pp. 1–26, 2011.Rockafellar, R. T.
Convex analysis . Princeton Mathemati-cal Series, No. 28. Princeton University Press, Princeton,N.J., 1970.Shen, Y., Wen, Z., and Zhang, Y. Augmented La-grangian alternating direction method for matrix sep-aration based on low-rank factorization.
OptimizationMethods and Software , 29(2):239–263, March 2014.Tao, M. and Yuan, X. Recovering low-rank and sparsecomponents of matrices from incomplete and noisy ob-servations.
SIAM J. Optimization , 21:57–81, 2011.van den Berg, E. and Friedlander, M. P. Probing thePareto frontier for basis pursuit solutions.
SIAM J.Sci. Computing , 31(2):890–912, 2008. software: . van den Berg, E. and Friedlander, M. P. Sparse optimiza-tion with least-squares constraints.
SIAM J. Optimiza-tion , 21(4):1201–1229, 2011.Wright, J., Ganesh, A., Rao, S., and Ma, Y. Robust prin-cipal component analysis: Exact recovery of corruptedlow-rank matrices by convex optimization. In
NeuralInformation Processing Systems (NIPS) , 2009a.Wright, S. J., Nowak, R. D., and Figueiredo, M. A. T.Sparse Reconstruction by Separable Approximation.
IEEE Trans. Sig. Processing , 57(7):2479–2493, July2009b.Zhang, Z., Liang, X., Ganesh, A., and Ma, Y. TILT: Trans-form invariant low-rank textures. In Kimmel, R., Klette,R., and Sugimoto, A. (eds.),
Computer Vision – ACCV2010 , volume 6494 of
Lecture Notes in Computer Sci-ence , pp. 314–328. Springer, 2011., pp. 314–328. Springer, 2011.