[PDF] Practical Matrix Completion and Corruption Recovery using Proximal Alternating Robust Subspace Minimization

Abstract

Full PDF

IInternational Journal of Computer Vision manuscript No. (will be inserted by the editor)

Practical Matrix Completion and Corruption Recovery usingProximal Alternating Robust Subspace Minimization

Yu-Xiang Wang · Choon Meng Lee · Loong-Fah Cheong · Kim-Chuan Toh

Received: 5 September 2013 / Accepted: 26 June 2014

Abstract

Low-rank matrix completion is a problemof immense practical importance. Recent works on thesubject often use nuclear norm as a convex surrogateof the rank function. Despite its solid theoretical foun-dation, the convex version of the problem often fails towork satisfactorily in real-life applications. Real data of-ten suﬀer from very few observations, with support notmeeting the randomness requirements, ubiquitous pres-ence of noise and potentially gross corruptions, some-times with these simultaneously occurring.This paper proposes a Proximal Alternating RobustSubspace Minimization (PARSuMi) method to tacklethe three problems. The proximal alternating schemeexplicitly exploits the rank constraint on the completedmatrix and uses the (cid:96) pseudo-norm directly in thecorruption recovery step. We show that the proposedmethod for the non-convex and non-smooth model con-verges to a stationary point. Although it is not guar-anteed to ﬁnd the global optimal solution, in practicewe ﬁnd that our algorithm can typically arrive at agood local minimizer when it is supplied with a reason-ably good starting point based on convex optimization.Extensive experiments with challenging synthetic andreal data demonstrate that our algorithm succeeds ina much larger range of practical problems where con-vex optimization fails, and it also outperforms variousstate-of-the-art algorithms. Keywords matrix completion · matrix factorization · RPCA · robust · low-rank · sparse · nuclear norm · non-convex optimization · SfM · photometric stereo Y.X. Wang, C.M. Lee, L.F. Cheong, K.C. TohNational University of SingaporeE-mail: { yuxiangwang, leechoonmeng, eleclf, mattohkc } @nus.edu.sgThe support of the PSF grant 1321202075 is gratefully ac-knowledged. Completing a low-rank matrix from partially observedentries, also known as matrix completion, is a centraltask in many real-life applications. The same abstrac-tion of this problem has appeared in diverse ﬁelds suchas signal processing, communications, information re-trieval, machine learning and computer vision. For in-stance, the missing data to be ﬁlled in may correspondto plausible movie recommendations (Koren et al 2009;Funk 2006), occluded feature trajectories for rigid ornon-rigid structure from motion, namely SfM (Hart-ley and Schaﬀalitzky 2003; Buchanan and Fitzgibbon2005) and NRSfM (Paladini et al 2009), relative dis-tances of wireless sensors (Oh et al 2010), pieces of un-collected measurements in DNA micro-array (Friedlandet al 2006), just to name a few.

Fig. 1

Sampling pattern of the Dinosaur sequence: 316 fea-tures are tracked over 36 frames. Dark area represents loca-tions where no data is available; sparse highlights are injectedgross corruptions. Middle stripe in grey are noisy observeddata, occupying 23% of the full matrix. The task of this pa-per is to ﬁll in the missing data and recover the corruptions. a r X i v : . [ c s . C V ] O c t Wang, Lee, Cheong and Toh

The common diﬃculty of these applications lies inthe scarcity of the observed data, uneven distribution ofthe support, noise, and more often than not, the pres-ence of gross corruptions in some observed entries. Forinstance, in the movie rating database Netﬂix (Bennettet al 2007), only less than 1% of the entries are ob-served and 90% of the observed entries correspond to10% of the most popular movies. In photometric stereo,the missing data and corruptions (arising from shadowand specular highlight as modeled in Wu et al (2011b))form contiguous blocks in images and are by no meansrandom. In structure from motion, the observations fallinto a diagonal band shape, and feature coordinatesare often contaminated by tracking errors (see the il-lustration in Fig. 1). Therefore, in order for any matrixcompletion algorithm to work in practice, these afore-mentioned diﬃculties need to be tackled altogether. Werefer to this problem as practical matrix comple-tion . Mathematically, the problem to be solved is thefollowing:Given Ω, (cid:99) W ij for all ( i, j ) ∈ Ω, ﬁnd W, ˜ Ω, s.t. rank( W ) is small; card( ˜ Ω ) is small; | W ij − (cid:99) W ij | is small ∀ ( i, j ) ∈ Ω | ˜ Ω. where Ω is the index set of observed entries whose lo-cations are not necessarily selected at random, ˜ Ω ∈ Ω represents the index set of corrupted data, (cid:99) W ∈ R m × n is the measurement matrix with only (cid:99) W ij ∈ Ω known,i.e., its support is contained in Ω . Furthermore, we de-ﬁne the projection P Ω : R m × n (cid:55)→ R | Ω | so that P Ω ( (cid:99) W )denotes the vector of observed data. The adjoint of P Ω is denoted by P ∗ Ω .Extensive theories and algorithms have been devel-oped to tackle some aspect of the challenges listed in thepreceding paragraph, but those tackling the full set ofchallenges are far and few between, thus resulting in adearth of practical algorithms. Two dominant classes ofapproaches are nuclear norm minimization, e.g. Cand`esand Recht (2009); Cand`es et al (2011); Cand`es andPlan (2010); Chen et al (2011), and matrix factoriza-tion, e.g., Koren et al (2009); Buchanan and Fitzgib-bon (2005); Okatani and Deguchi (2007); Chen (2008);Eriksson and Van Den Hengel (2010). Nuclear normminimization methods minimize the convex relaxationof rank instead of the rank itself, and are supportedby rigorous theoretical analysis and eﬃcient numericalcomputation. However, the conditions under which theysucceed are often too restrictive for it to work well inreal-life applications (as reported in Shi and Yu (2011) and Jain et al (2012)). In contrast, matrix factoriza-tion is widely used in practice and are considered veryeﬀective for problems such as movie recommendation(Koren et al 2009) and structure from motion (Tomasiand Kanade 1992; Paladini et al 2009) despite its lackof rigorous theoretical foundation. Indeed, as one fac-torizes matrix W into U V T , the formulation becomesbilinear and thus optimal solution is hard to obtain ex-cept in very speciﬁc cases (e.g., in Jain et al (2012)). Amore comprehensive survey of the algorithms and re-view of the strengths and weaknesses will be given inthe next section.In this paper, we attempt to solve the practical ma-trix completion problem under the prevalent case wherethe rank of the matrix W and the cardinality of ˜ Ω areupper bounded by some known parameters r and N via the following non-convex, non-smooth optimizationmodel:min W,E (cid:107)P Ω ( W − (cid:99) W + E ) (cid:107) + (cid:15) (cid:107)P Ω ( W ) (cid:107) s.t. rank( W ) ≤ r, W ∈ R m × n (cid:107) E (cid:107) ≤ N , (cid:107) E (cid:107) ≤ K E , E ∈ R m × nΩ (1)where R m × nΩ denotes the set of m × n matrices whosesupports are subsets of Ω and (cid:107)·(cid:107) is the Frobenius norm; K E is a ﬁnite constant introduced to facilitate the con-vergence proof. Note that the restriction of E to R m × nΩ is natural since the role of E is to capture the grosscorruptions in the observed data (cid:99) W ij ∈ Ω . The boundconstraint on (cid:107) E (cid:107) is natural in some problems whenthe true matrix W is bounded (e.g., Given the typicalmovie ratings of 0-10, the gross outliers can only lie in [-10, 10]). In other problems, we simply choose K E to besome large multiple (say 20) of √ N × median( P Ω ( (cid:99) W )),so that the constraint is essentially inactive and has noimpact on the optimization. Note that without mak-ing any randomness assumption on the index set Ω or assuming that the problem has a unique solution( W ∗ , E ∗ ) such that the singular vector matrices of W ∗ satisfy some inherent conditions like those in Cand`eset al (2011), the problem of practical matrix comple-tion is generally ill-posed. This motivated us to includethe Tikhonov regularization term (cid:15) (cid:107)P Ω ( W ) (cid:107) in (1),where Ω denotes the complement of Ω , and 0 < (cid:15) < W whichhas the smallest (cid:107)P Ω ( W ) (cid:107) among all the candidates inthe optimal solution set of the non-regularized prob-lem. Notice that we only put a regularization on thoseelements of W in Ω as we do not wish to perturb thoseelements of W in the ﬁtting term. Finally, with theTikhonov regularization and the bound constraint on ractical Matrix Completion and corruption recovery using PARSuMi 3 (cid:107) E (cid:107) , we can show that problem (1) has a global mini-mizer.By deﬁning H ∈ R m × n to be the matrix such that H ij = (cid:40) i, j ) ∈ Ω √ (cid:15) if ( i, j ) (cid:54)∈ Ω, (2)and those elements of E and (cid:99) W in Ω to be zero, we canrewrite the objective function in (1) in a compact form,and the problem becomes:min W,E (cid:107) H ◦ ( W + E − (cid:99) W ) (cid:107) s.t. rank( W ) ≤ r, W ∈ R m × n (cid:107) E (cid:107) ≤ N , (cid:107) E (cid:107) ≤ K E , E ∈ R m × nΩ . (3)In the above, the notation “ ◦ ” denotes the element-wiseproduct between two matrices.We propose PARSuMi, a proximal alternating min-imization algorithm motivated by the algorithm in At-touch et al (2010) to solve (3). This involves solvingtwo subproblems each with an auxiliary proximal reg-ularization term. It is important to emphasize that thesubproblems in our case are non-convex and hence itis essential to design appropriate algorithms to solvethe subproblems to global optimality, at least empir-ically. We develop essential reformulations of the sub-problems and design novel techniques to eﬃciently solveeach subproblem, provably achieving the global opti-mum for one, and empirically so for the other. We alsoprove that our algorithm is guaranteed to converge toa limit point, which is necessarily a stationary point of(3). We emphasize here that the convergence is estab-lished even though one of the subproblems may not besolved to global optimality. Together with the initial-ization schemes we have designed based on the convexrelaxation of (3), our method is able to solve challeng-ing real matrix completion problems with corruptionsrobustly and accurately. As we demonstrate in the ex-periments, PARSuMi is able to provide excellent recon-struction of unobserved feature trajectories in the clas-sic Oxford Dinosaur sequence for SfM, despite struc-tured (as opposed to random) observation pattern anddata corruptions. It is also able to solve photometricstereo to high precision despite severe violations of theLambertian model (which underlies the rank-3 factor-ization) due to shadow, highlight and facial expressiondiﬀerence. Compared to state-of-the-art methods suchas GRASTA (He et al 2011), Wiberg (cid:96) (Eriksson andVan Den Hengel 2010) and BALM (Del Bue et al 2012),our results are substantially better both qualitativelyand quantitatively.Note that in (3) we do not seek convex relaxationof any form, but rather constrain the rank and the corrupted entries’ cardinality directly in their originalforms. While it is generally not possible to have an al-gorithm guaranteed to compute the global optimal so-lution, we demonstrate that with appropriate initializa-tions, the faithful representation of the original problemoften oﬀers signiﬁcant advantage over the convex relax-ation approach in denoising and corruption recovery,and is thus more successful in solving real problems.The rest of the paper is organized as follows. In Sec-tion 2, we provide a comprehensive review of the exist-ing theories and algorithms for practical matrix com-pletion, summarizing the strengths and weaknesses ofnuclear norm minimization and matrix factorization.In Section 3, we conduct numerical evaluations of pre-dominant matrix factorization methods, revealing thosealgorithms that are less-likely to be trapped at localminima. Speciﬁcally, these features include parameter-ization on a subspace and second-order Newton-like it-erations. Building upon these ﬁndings, we develop thePARSuMi scheme in Section 4 to simultaneously handlesparse corruptions, dense noise and missing data. Theproof of convergence and a convex initialization schemeare also provided in this section. In Section 5, the pro-posed method is evaluated on both synthetic and realdata and is shown to outperform the current state-of-the-art algorithms for robust matrix completion. W (cid:110) (cid:107) W (cid:107) ∗ (cid:12)(cid:12)(cid:12) P Ω ( W − (cid:99) W ) = 0 (cid:111) , (4)in which rank( X ) is replaced by the nuclear norm (cid:107) X (cid:107) ∗ = (cid:80) i σ i ( X ), where the latter is the tightest convex relax-ation of rank over the unit (spectral norm) ball. Cand`esand Recht (2009) showed that when sampling is uni-formly random and suﬃciently dense, and the underly-ing low-rank subspace is incoherent with respect to thestandard bases, then the remaining entries of the ma-trix can be exactly recovered. The guarantee was laterimproved in Cand`es and Tao (2010); Recht (2009), andextended for noisy data in Cand`es and Plan (2010);Negahban and Wainwright (2012) relaxed the equalityconstraint to (cid:107)P Ω ( W − (cid:99) W ) (cid:107) ≤ δ. Using similar assumptions and arguments, Cand`es et al(2011) and Chandrasekaran et al (2011) concurrently

Wang, Lee, Cheong and TohMC (Cand`esand Recht2009) RPCA(Cand`eset al 2011) NoisyMC(Cand`es andPlan 2010) StableRPCA(Zhou et al2010) RMC (Li2013) RMC (Chenet al 2011)Missing data Yes Yes Yes No Yes YesCorruptions No Yes No Yes Yes YesNoise No No Yes Yes No NoDeterministic Ω No No No No No YesDeterministic ˜ Ω No No No No No Yes

Table 1

Summary of the theoretical development for matrix completion and corruption recovery. proposed solution to the related problem of robust prin-cipal component analysis (RPCA) where the low-rankmatrix can be recovered from sparse corruptions (withno missing data ). This is formulated asmin W,E (cid:110) (cid:107) W (cid:107) ∗ + λ (cid:107) E (cid:107) (cid:12)(cid:12)(cid:12) W + E = (cid:99) W (cid:111) . (5)Noisy extension and improvement of the guarantee forRPCA were provided by Zhou et al (2010) and Ganeshet al (2010) respectively. Chen et al (2011) and Li (2013)combined (4) and (5) and provided guarantee for thefollowingmin W,E (cid:110) (cid:107) W (cid:107) ∗ + λ (cid:107) E (cid:107) (cid:12)(cid:12)(cid:12) P Ω ( W + E − (cid:99) W ) = 0 (cid:111) . (6)In particular, the results in Chen et al (2011) liftedthe uniform random support assumptions in previousworks by laying out the exact recovery condition for aclass of deterministic sampling ( Ω ) and corruptions ( ˜ Ω )patterns.We summarize the theoretical and algorithmic progressin practical matrix completion achieved by each methodin Table 1. It appears that researchers are moving to-wards analyzing all possible combinations of the prob-lems; from past indication, it seems entirely plausiblealbeit tedious to show that the noisy extensionmin W,E (cid:110) (cid:107) W (cid:107) ∗ + λ (cid:107) E (cid:107) (cid:12)(cid:12)(cid:12) (cid:107)P Ω ( W + E − (cid:99) W ) (cid:107) ≤ δ (cid:111) (7)will return a solution stable around the desired W and E under appropriate assumptions. Wouldn’t that solvethe practical matrix completion problem altogether?The answer is unfortunately no. While this line ofresearch have provided profound understanding of prac-tical matrix completion itself, the actual performance ofthe convex surrogate on real problems (e.g., movie rec-ommendation) is usually not competitive against non-convex approaches such as matrix factorization. Althoughconvex relaxation is amazingly equivalent to the origi-nal problem under certain conditions, those well versed Cand`es et al (2011) actually considered missing data too,but their guarantee (Theorem 1.2) for (6) is only preliminaryaccording to their own remarks. A stronger result is releasedlater by the same group in Li (2013). in practical problems will know that those theoreticalconditions are usually not satisﬁed by real data. Dueto noise and model errors, real data are seldom trulylow-rank (see the comments on Jester joke dataset inKeshavan et al (2009)), nor are they as incoherent asrandomly generated data. More importantly, observa-tions are often structured (e.g., diagonal band shapein SfM) and hence do not satisfy the random samplingassumption needed for the tight convex relaxation ap-proach. As a consequence of all these factors, the re-covered W and E by convex optimization are often nei-ther low-rank nor sparse in practical matrix completion.This can be further explained by the so-called “RobinHood” attribute of (cid:96) norm (analogously, nuclear normis the (cid:96) norm in the spectral domain), that is, it tendsto steal from the rich and give it to the poor, decreas-ing the inequity of “wealth” distribution. Illustrationsof the attribute will be given in Section 5.Nevertheless, the convex relaxation approach hasthe advantage that one can design eﬃcient algorithmsto ﬁnd or approximately reach the global optimal so-lution of the given convex formulation. In this paper,we take advantage of the convex relaxation approachand use it to provide a powerful initialization for ouralgorithm to converge to the correct solution.2.2 Matrix factorization and applicationsAnother widely-used method to estimate missing datain a low-rank matrix is matrix factorization (MF). Itis at ﬁrst considered as a special case of the weightedlow-rank approximation problem with { , } weight byGabriel and Zamir in 1979 and much later by Srebroand Jaakkola (2003). The buzz of Netﬂix Prize furtherpopularizes the missing data problem as a standalonetopic of research. Matrix factorization turns out to be arobust and eﬃcient realization of the idea that people’spreferences of movies are inﬂuenced by a small numberof latent factors and has been used as a key compo-nent in almost all top-performing recommendation sys-tems (Koren et al 2009) including BellKor’s PragmaticChaos, the winner of the Netﬂix Prize (Koren 2009). ractical Matrix Completion and corruption recovery using PARSuMi 5 In computer vision, matrix factorization with miss-ing data is recognized as an important problem too.Tomasi-Kanade aﬃne factorization (Tomasi and Kanade1992), Sturm-Triggs projective factorization (Sturm andTriggs 1996), and many techniques in Non-Rigid SfMand motion tracking (Paladini et al 2009) can all beformulated as a matrix factorization problem. Missingdata and corruptions emerge naturally due to occlu-sions and tracking errors. For a more exhaustive surveyof computer vision problems that can be modelled bymatrix factorization, we refer readers to Del Bue et al(2012).Regardless of its applications, the key idea is thatwhen W = U V T , one ensures that the required rankconstraint is satisﬁed by restricting the factors U and V to be in R m × r and R n × r respectively. Since the ( U, V )parameterization has a much smaller degree of freedomthan the dimension of W , completing the missing databecomes a better posed problem. This gives rise to thefollowing optimization problem:min U,V (cid:13)(cid:13)(cid:13) P Ω ( U V T − (cid:99) W ) (cid:13)(cid:13)(cid:13) (8)or its equivalence reformulationmin U (cid:26) (cid:13)(cid:13)(cid:13) P Ω ( U V ( U ) T − (cid:99) W ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) U T U = I r (cid:27) (9)where the factor V is now a function of U .Unfortunately, (8) is not a convex optimization prob-lem. The quality of the solutions one may get by min-imizing this objective function depends on speciﬁc al-gorithms and their initializations. Roughly speaking,the various algorithms for (8) may be grouped intothree categories: alternating minimization , ﬁrst or-der gradient methods and second order Newton-likemethods.Simple approaches like alternating least squares (ALS)or equivalently PowerFactorization (Hartley and Schaf-falitzky 2003) fall into the ﬁrst category. They alternat-ingly ﬁx one factor and minimize the objective over theother using least squares method. A more sophisticatedalgorithm is BALM (Del Bue et al 2012), which usesthe Augmented Lagrange Multiplier method to grad-ually impose additional problem-speciﬁc manifold con-straints. The inner loop however is still alternating min-imization. This category of methods has the reputationof reducing the objective value quickly in the ﬁrst fewiterations, but they usually take a large number of iter-ations to converge to a high quality solution (Buchananand Fitzgibbon 2005).First order gradient methods are eﬃcient, easy toimplement and they are able to scale up to million-by-million matrices if stochastic gradient descent is adopted. Therefore it is very popular for large-scale recommenda-tion systems. Typical approaches include Simon Funk’sincremental SVD (Funk 2006), nonlinear conjugate gra-dient (Srebro and Jaakkola 2003) and more sophisti-catedly, gradient descent on the Grassmannian/Stiefelmanifold, such as GROUSE (Balzano et al 2010) andOptManifold (Wen and Yin 2013). These methods, how-ever, as we will demonstrate later, easily get stuck inlocal minima .The best performing class of methods are the sec-ond order Newton-like algorithms, in that they demon-strate superior performance in both accuracy and thespeed of convergence (though each iteration requiresmore computation); hence they are suitable for smallto medium scale problems requiring high accuracy so-lutions (e.g., SfM and photometric stereo in computervision). Representatives of these algorithms include thedamped Newton method (Buchanan and Fitzgibbon2005), Wiberg( (cid:96) ) (Okatani and Deguchi 2007), LM Sand LM M of Chen (2008) and LM GN, which is a vari-ant of LM M using Gauss-Newton (GN) to approximatethe Hessian function.As these methods are of special importance in de-veloping our PARSuMi algorithm, we conduct extensivenumerical evaluations of these algorithms in Section 3to understand their pros and cons as well as the key fac-tors that lead to some of them ﬁnding global optimalsolutions more often than others.It is worthwhile to note some delightful recent ef-forts to scale the ﬁrst two classes of MF methods tointernet scale, e.g., parallel coordinate descent exten-sion for ALS (Yu et al 2012) and stochastic gradientmethods in “Hogwild” (Recht et al 2011). It will be aninteresting area of research to see if the ideas in thesepapers can be used to make the second order methodsmore scalable.In addition, there are a few other works in each cat-egory that take into account the corruption problem bychanging the quadratic penalty term of (8) into (cid:96) -normor Huber functionmin U,V (cid:13)(cid:13)(cid:13) P Ω ( U V T − (cid:99) W ) (cid:13)(cid:13)(cid:13) , (10)min U,V (cid:88) ( ij ) ∈ Ω Huber (cid:0) ( U V T − (cid:99) W ) ij (cid:1) . (11)Notable algorithms to solve these formulations includealternating linear programming (ALP) and alternat-ing quadratic programming (AQP) in Ke and Kanade(2005), GRASTA (He et al 2011) that extends GROUSE, Our experiment on synthetic data shows that the strongWolfe line search adopted by Srebro and Jaakkola (2003) andWen and Yin (2013) somewhat ameliorates the issue, thoughit does not seem to help much on real data. Wang, Lee, Cheong and Toh as well as Wiberg (cid:96) (Eriksson and Van Den Hengel2010) that uses a second order Wiberg-like iteration.While it is well known that the (cid:96) -norm or Huber penaltyterm can better handle outliers, and the models (10)and (11) are seen to be eﬀective in some problems,there is not much reason for a “convex” relaxation ofthe (cid:96) pseudo-norm , since the rank constraint imposedby matrix factorization is already highly non-convex.Empirically, we ﬁnd that (cid:96) -norm penalty oﬀers poordenoising ability to dense noise and also suﬀers from“Robin Hood” attribute. Comparison with this class ofmethods will be given later in Section 5, which showsthat our method can better handle noise and corrup-tions.The practical advantage of (cid:96) over (cid:96) penalty is wellillustrated in Xiong et al (2011), where Xiong et al pro-posed an (cid:96) -based robust matrix factorization methodwhich deals with corruptions and a given rank con-straint. Our work is similar to Xiong et al (2011) inthat we both eschew the convex surrogate (cid:96) -norm infavor of using the (cid:96) -norm directly. However, our ap-proach treats both corruptions and missing data. Moreimportantly, our treatment of the problem is diﬀerentand it results in a convergence guarantee that coversthe algorithm of Xiong et al (2011) as a special case;this will be further explained in Section 4.2.3 Emerging theory for matrix factorizationAs we mentioned earlier, a fundamental drawback ofmatrix factorization methods for low rank matrix com-pletion is the lack of proper theoretical foundation. How-ever, thanks to the better understanding of low-rankstructures nowadays, some theoretical analysis of thisproblem slowly emerges. This class of methods are es-sentially designed for solving noisy matrix completionproblem with an explicit rank constraint, i.e.,min W (cid:26) (cid:13)(cid:13)(cid:13) P Ω ( W − (cid:99) W ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) rank( W ) ≤ r (cid:27) . (12)From a combinatorial-algebraic perspective, Kir`aly andTomioka (2012) provided a suﬃcient and necessary con-dition on the existence of an unique rank- r solution to(12). It turns out that if the low-rank matrix is generic ,then unique completability depends only on the supportof the observations Ω . This suggests that the incoher-ence and random sampling assumptions typically re-quired by various nuclear norm minimization methodsmay limit the portion of problems solvable by the lat-ter to only a small subset of those solvable by matrixfactorization methods. The cardinality of non-zero entries, which strictly speak-ing is not a norm.

Around the same time, Wang and Xu (2012) stud-ied the stability of matrix factorization under arbitrarynoise. They obtained a stability bound for the optimalsolution of (12) around the ground truth, which turnsout to be better than the corresponding bound for nu-clear norm minimization in Cand`es and Plan (2010) bya scale of (cid:112) min ( m, n ) (in Big-O sense). The study how-ever bypassed the practical problem of how to obtainthe global optimal solution for this non-convex prob-lem.This gap is partially closed by the recent work ofJain et al (2012), in which the global minimum of (12)can be obtained up to an accuracy (cid:15) with O (log 1 /(cid:15) ) it-erations using a slight variation of the ALS scheme. Theguarantee requires the observation to be noiseless, sam-pled uniformly at random and the underlying subspaceof W needs to be incoherent—basically all assumptionsin the convex approach—yet still requires slightly moreobservations than that for nuclear norm minimization.It does not however touch on when the algorithm isable to ﬁnd the global optimal solution when the datais noisy. Despite not achieving stronger theoretical re-sults nor under weaker assumptions than the convex re-laxation approach, this is the ﬁrst guarantee of its kindfor matrix factorization. Given its more eﬀective empir-ical performance, we believe that there is great room forimprovement on the theoretical front. A secondary con-tribution of this paper is to ﬁnd the potentially “right”algorithm or rather constituent elements of algorithmfor theoreticians to look deeper into. To better understand the performance of diﬀerent meth-ods, we compare the following attributes quantitativelyfor all three categories of approaches that solve (8) or(9) : Sample complexity

Number of samples required forexact recovery of random uniformly sampled obser-vations in random low-rank matrices, an index typ-ically used to quantify the performance of nuclearnorm based matrix completion.

Hits on global optimal[synthetic]

The proportion ofrandom initializations that lead to the global opti-mal solution on random low rank matrices with (a)increasing Gaussian noise, (b) exponentially decay-ing singular values.

Hits on global optimal[SfM]

The proportion of ran-dom initializations that lead to the global optimal As a reference, we also included nuclear norm minimiza-tion that solve (4) where applicable.ractical Matrix Completion and corruption recovery using PARSuMi 7

Fig. 2

Exact recovery with increasing number of random observations . Algorithms (random initialization) are evaluatedon 100 randomly generated rank-4 matrices of dimension 100 × n .To account for small numerical error, the result is considered “exact recovery” if the RMSE of the recovered entries is smallerthan 10 − . On the left, CVX (Grant and Boyd 2012) and TFOCS (Becker et al 2012) (in cyan) solves the nuclear norm basedmatrix completion (4), everything else aims to solve matrix factorization (8). On the right, the best solution of MF across allalgorithms is compared to the CVX solver for nuclear norm minimization (solved with the highest numerical accuracy) and alower bound (below the bound, the number of samples is smaller than r for at least a row or a column). Fig. 3

Percentage of hits on global optimal with increasinglevel of noise . Five rank-4 matrices are generated by mul-tiplying two standard Gaussian matrices of dimension 40 × ×

60. 30% of entries are uniformly picked as observationswith additive Gaussian noise N (0 , σ ). 24 diﬀerent random ini-tialization are tested for each matrix. The “global optimal” isassumed to be the solution with lowest objective value acrossall testing algorithm and all initializations. solution on the Oxford Dinosaur sequence (Buchananand Fitzgibbon 2005) used in the SfM community.The sample complexity experiment in Fig. 2 showsthat the best performing matrix factorization algorithmattains exact recovery with the number of observed en-tries at roughly 18%, while CVX for nuclear norm min-imization needs roughly 36% (even worse for numericalsolvers such as TFOCS). This seems to imply that thesample requirement for MF is fundamentally smallerthan that of nuclear norm minimization. As MF as- Fig. 4

Percentage of hits on global optimal for ill-conditioned low-rank matrices . Data are generated inthe same way as in Fig. 3 with σ = 0 .

05, except that we fur-ther take SVD and rescale the i th singular value accordingto 1 /α i . The Frobenious norm is normalized to be the sameas the original low-rank matrix. The exponent α is given onthe horizontal axis. sumes known rank of the underlying matrix while nu-clear norm methods do not, the results we observe arequite reasonable. In addition, among diﬀerent MF al-gorithms, some perform much better than others. Thebest few of them achieve something close to the lower Wang, Lee, Cheong and Toh

Fig. 5

Cumulative histogram on the pixel RMSE for 100randomly initialized runs conducted for each algorithm onDinosaur sequence. The curve summarizes how many runsof each algorithm corresponds to the global optimal solution(with pixel RMSE 1.0847) on the horizontal axis. Note thatthe input pixel coordinates are normalized to between [0 , bound . This corroborates our intuition that MF isprobably a better choice for problems with known rank.From Fig. 3 and 4, we observe that the followingclasses of algorithms, including LM X series (Chen 2008),Wiberg (Okatani and Deguchi 2007), Non-linear Con-jugate Gradient method (NLCG) (Srebro and Jaakkola2003) and the curvilinear search on Stiefel manifold(OptManifold (Wen and Yin 2013)) perform signiﬁ-cantly better than others in reaching the global opti-mal solution despite their non-convexity. The percent-age of global optimal hits from random initialization ispromising even when the observations are highly noisyor when the condition number of the underlying matrixis very large .The common attribute of the four algorithms is thatthey are all based on the model (9) which parameter-izes the factor V as a function of U and then optimizesover U alone. This parameterization essentially reducesthe problem to ﬁnding the best subspace that ﬁts thedata. What is diﬀerent between them is the way theyupdate the solution in each iteration. OptManifold andNLCG adopt a Strong Wolfe line search that allows thealgorithm to take large step sizes, while the second or-der methods approximate each local neighborhood with The lower bound is given by the percentage of randomlygenerated data that have at least one column or row hav-ing less than r samples. Clearly, having at least r samplesfor every column and row is a necessary condition for exactrecovery. When α = 3 . r th singular value is almost assmall as the spectral norm of the input noise. a convex quadratic function and jump directly to theminimum of the approximation. This diﬀerence appearsto matter tremendously on the SfM experiment (seeFig. 5). We observe that only the second order meth-ods achieve global optimal solution frequently, whereasthe Strong Wolfe line search adopted by both OptMan-ifold and NLCG does not seem to help much on the realdata experiment like it did in simulation with randomlygenerated data. Indeed, neither approach reaches theglobal optimal solution even once in the hundred runs,though they are rather close in quite a few runs. De-spite these close runs, we remark that in applicationslike SfM, it is important to actually reach the globaloptimal solution. Due to the large amount of missingdata in the matrix, even slight errors in the sampledentries can cause the recovered missing entries to gototally haywire with a seemingly good local minimum(see Fig. 6). We thus refrain from giving any credit tolocal minima even if the RMSE visible error (deﬁned in(13)) is very close to that of the global minimum.RMSE visible := (cid:107)P Ω ( W recovered − (cid:99) W ) (cid:107) (cid:112) | Ω | . (13) (a) Local minimum (b)

Global minimum

Fig. 6

Comparison of the feature trajectories correspond-ing to a local minimum and global minimum of (8), givenpartial uncorrupted observations. Note that RMSE visible =1 . visible = 1 . visible , the ﬁlled-in values for missing data in (a) arefar oﬀ. Another observation is that LM GN seems to worksubstantially better than other second-order methodswith subspace or manifold parameterization, reachingglobal minimum 93 times out of the 100 runs. Com-pared to LM S and LM M, the only diﬀerence is theuse of Gauss-Newton approximation of the Hessian.According to the analysis in Chen (2011), the Gauss-Newton Hessian provides the only non-negative convex ractical Matrix Completion and corruption recovery using PARSuMi 9DN Wiberg LM S LM M LM GNNo. of hits at global min. 2 46 42 32 93No. of hits on stopping condition 75 47 99 93 98Average run time(sec) 324 837 147 126 40No. of variables (m+n)r (m-r)r mr (m-r)r (m-r)rHessian Yes Gauss-Newton Yes Yes Gauss-NewtonLM/Trust Region Yes No Yes Yes YesLargest Linear system to solve [( m + n ) r ] | Ω | × mr mr × mr [( m − r ) r ] [( m − r ) r ] Table 2

Comparison of various second order matrix factorization algorithms quadratic approximation that preserves the so-called“zero-on-( n − .To summarize the key ﬁndings of our experimentalevaluation, we observe that: (a) the ﬁxed-rank MF for-mulation requires less samples than nuclear norm min-imization to achieve exact recovery; (b) the compactparameterization on the subspace, strong line search orsecond order update help MF algorithms in avoidinglocal minima in high noise, poorly conditioned matrixsetting; (c) LM GN with Gauss-Newton update is ableto reach the global minimum with a very high successrate on a challenging real SfM data sequence. (3)Our proposed PARSuMi method for problem (3) worksin two stages. It ﬁrst obtains a good initialization froman eﬃcient convex relaxation of (3), which will be de-scribed in Section 4.6. This is followed by the minimiza-tion of the low rank matrix W and the sparse matrix E alternatingly until convergence. The eﬃciency of ourPARSuMi method depends on the fact that the two in-ner minimizations of W and E admit eﬃcient solutions, Wiberg algoirthm takes longer time mainly because itsometimes does not converge and exhausts the maximumnumber of iterations. which will be derived in Sections 4.1 and 4.3 respec-tively. Speciﬁcally, in step k , we compute W k +1 fromeithermin W (cid:107) H ◦ ( W − (cid:99) W + E k ) (cid:107) + β (cid:107) H ◦ ( W − W k ) (cid:107) subject to rank( W ) ≤ r, (14)or its quadratic majorization , and E k +1 frommin E (cid:107) H ◦ ( W k +1 − (cid:99) W + E ) (cid:107) + β (cid:107) E − E k (cid:107) subject to (cid:107) E (cid:107) ≤ N , (cid:107) E (cid:107) ≤ K E , E ∈ R m × nΩ , (15)where H is deﬁned as in (2). Note that the above it-eration is diﬀerent from applying a direct alternatingminimization of (3). We have added the proximal reg-ularization terms (cid:107) H ◦ ( W − W k ) (cid:107) and (cid:107) E − E k (cid:107) tomake the objective functions in the subproblems coer-cive and hence ensuring that W k +1 and E k +1 are welldeﬁned. As is shown in Attouch et al (2010), the prox-imal terms are critical to ensure the critical point con-vergence of the sequence. In addition, we have added aquadratic majorization safeguard step when computing W k +1 . This is to safeguard the convergence even if ourcomputed W k +1 fails to be a global minimizer of (14).Further details of the algorithm and the proof of itsconvergence are provided in the subsequent sections.4.1 Computation of W k +1 in (14)Our solution for (14) consists of two steps. We ﬁrsttransform the rank-constrained minimization (14) intoan equivalent subspace ﬁtting problem, then solve thenew formulation using LM GN.Motivated by the ﬁndings in Section 3 where themost successful algorithms for solving (12) are basedon the formulation (9), we will now derive a similarequivalent reformulation of (14). Our reformulation of(14) is motivated by the N -parametrization of (12) dueto Chen (2008), who considered the task of matrix com-pletion as ﬁnding the best subspace to ﬁt the partially We will explain this further shortly.0 Wang, Lee, Cheong and Toh observed data. In particular, Chen proposes to solve(12) usingmin N (cid:40) (cid:88) i ˆ w Ti ( I − P i ) ˆ w i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N T N = I (cid:41) (16)where N is a m × r matrix whose column space is theunderlying subspace to be reconstructed, N i is N butwith those rows corresponding to the missing entries incolumn i removed. P i = N i N + i is the projection ontospan( N i ) with N + i being the Moore-Penrose pseudo in-verse of N i , and the objective function minimizes thesum of squares distance between ˆ w i to span( N i ), whereˆ w i is the vector of observed entries in the i th column of (cid:99) W . (14)First deﬁne the matrix H ∈ R m × n as follows: H ij = (cid:40) √ β if ( i, j ) ∈ Ω √ (cid:15) + (cid:15)β if ( i, j ) (cid:54)∈ Ω . (17)Let B k ∈ R m × n be the matrix deﬁned by B kij =  √ β ( (cid:99) W ij − E kij + β W kij ) if ( i, j ) ∈ Ω (cid:15)β √ (cid:15) + (cid:15)β W kij if ( i, j ) (cid:54)∈ Ω .(18)Deﬁne the diagonal matrices D i ∈ R m × m to be D i = diag( H i ) , i = 1 , . . . , n (19)where H i is the i th column of H . It turns out that the N -parameterization for the regularized problem (14)has a similar form as (16), as shown below. Proposition 1 (Equivalence of subspace param-eterization)

Let Q i ( N ) = D i N ( N T D i N ) − N T D i , whichis the m × m projection matrix onto the column spaceof D i N . The problem (14) is equivalent to the followingproblem:min N f ( N ) := 12 n (cid:88) i =1 (cid:107) B ki − Q i ( N ) B ki (cid:107) subject to N T N = I, N ∈ R m × r (20) where B ki is the i th columns of B k . If N ∗ is an optimalsolution of (20) , then W k +1 , whose columns are deﬁnedby W k +1 i = D − i Q i ( N ∗ ) B ki , (21) is an optimal solution of (14) . Proof We can show by some algebraic manipulationsthat the objective function in (14) is equal to12 (cid:107) H ◦ W − B k (cid:107) + constantNow note that we have { W ∈ R m × n | rank( W ) ≤ r } = { N C | N ∈ R m × r , C ∈ R r × n , N T N = I } . (22)Thus the problem (14) is equivalent tomin N { f ( N ) | N T N = I, N ∈ R m × r } (23)where f ( N ) := min C (cid:107) H ◦ ( N C ) − B k (cid:107) . To derive (20) from the above, we need to obtain f ( N )explicitly as a function of N . For a given N , the un-constrained minimization problem over C in f ( N ) hasa strictly convex objective function in C , and hence theunique global minimizer satisﬁes the following optimal-ity condition: N T (( H ◦ H ) ◦ ( N C )) = N T ( H ◦ B k ) . (24)By considering the i th column C i of C , we get N T D i N C i = N T D i B ki , i = 1 , . . . , n. (25)Since N has full column rank and D i is positive def-inite, the coeﬃcient matrix in the above equation isnonsingular, and hence C i = ( N T D i N ) − N T D i B ki . Now with the optimal C i above for the given N , wecan show after some algebra manipulations that f ( N )is given as in (20). (cid:117)(cid:116) We can see that when β ↓ w i appropriatelymodiﬁed to take into account of E k . Also, from theabove proof, we see that the N -parameterization re-duces the feasible region of W by restricting W to onlythose potential optimal solutions among the set of W satisfying the expression in (21). This seems to implythat it is not only equivalent but also advantageous tooptimize over N instead of W . While we have no the-oretical justiﬁcation of this conjecture, it is consistentwith our experiments in Section 3 which show the supe-rior performance of those algorithms using subspace pa-rameterization in ﬁnding global minima and vindicatesthe design motivations of the series of LM X algorithmsin Chen (2008). ractical Matrix Completion and corruption recovery using PARSuMi 11 Now that we have shown how to handle the regulariza-tion term and validated the equivalence of the trans-formation, the steps to solve (14) essentially general-ize those of LM GN (available in Section 3.2 and Ap-pendix A of Chen (2011)) to account for the generalmask H . The derivations of the key formulae and theirmeanings are given in this section.In general, Levenberg-Marquadt solves the non-linearproblem with the following sum-of-squares objective func-tion L ( x ) = 12 (cid:88) i =1: n (cid:107) y i − f i ( x ) (cid:107) , (26)by iteratively updating x as follows: x ← x + ( J T J + λI ) − J T r , where J = [ J ; . . . ; J n ] is the Jacobian matrix and J i isthe Jacobian matrix of f i ; r is the concatenated vectorof residual r i := y i − f i ( x ) for all i , and λ is the dampingfactor that interpolates between Gauss-Newton updateand gradient descent. We may also interpret the iter-ation as a Damped Newton method with a ﬁrst orderapproximation of the Hessian matrix using J T J .Note that the objective function of (20) can be ex-pressed in the form of (26) by taking x := vec( N ), data y i := B ki , and function f i ( x := vec( N )) = Q i ( N ) B ki = Q i y i Proposition 2

Let

T ∈ R mr × mr be the permutationmatrix such that vec( X T ) = T vec( X ) for any X ∈ R m × r . The Jacobian of f i ( x ) = Q i ( N ) y i is given asfollows: J i ( x ) = ( A Ti y i ) T ⊗ (( I − Q i ) D i ) + [( D i r i ) T ⊗ A i ] T . (27) Also J T J = (cid:80) ni =1 J Ti J i , J T r = (cid:80) ni =1 J Ti r i , where J Ti J i = ( A Ti y i y Ti A i ) ⊗ ( D i ( I − Q i ) D i )+ T T [( D i r i r Ti D i ) ⊗ ( A Ti A i )] T (28) J Ti r i = vec( D i r i ( A Ti y i ) T ) . (29) In the above, ⊗ denotes the Kronecker product.Proof Let A i = D i N ( N T D i N ) − . Given suﬃcientlysmall δN , we can show that the directional derivativeof f i at N along δN is given by f (cid:48) i ( N + δN ) = ( I − Q i ) D i δN A Ti y i + A i δN T D i r i . By using the property that vec(

AXB ) = ( B T ⊗ A )vec( X ),we havevec( f (cid:48) i ( N + δN )) = [( A Ti y i ) T ⊗ (( I − Q i ) D i )]vec( δN )+[( D i r i ) T ⊗ A i ]vec( δN T ) Algorithm 1

Levenberg-Marquadt method for (14)

Input: (cid:99)

W , E k , W k , ¯ H , objective function L ( x ) and initial N k ; numerical parameter λ , ρ > Initialization:

Compute y i = B ki for i = 1 , ..., n , and x = vec( N k ), j = 0 . while not converged do

1. Compute J T r and J T J using (29) and(28).2. Compute ∆x = ( J T J + λI ) − J T r while L ( x + ∆x ) < L ( x ) do (1) λ = ρλ. (2) ∆x = ( J T J + λI ) − J T r. end while λ = λ/ρ.

4. Orthogonalize N = orth[reshape( x j + ∆x )].5. Update x j +1 = vec( N ).6. Iterate j = j + 1 end whileOutput: N k +1 = N , W k +1 using (21) with N k +1 replac-ing N ∗ . From here, the required result in (27) follows.To prove (28), we make use of the following proper-ties of Kronecker product: ( A ⊗ B )( C ⊗ D ) = ( AC ) ⊗ ( BD ) and ( A ⊗ B ) T = A T ⊗ B T . By using these prop-erties, we see that J Ti J i has four terms, with two of theterms containing involving D i ( I − Q i ) A i or its trans-pose. But we can verify that Q i A i = A i and hencethose two terms become 0. The remaining two termsare those appearing in (28) after using the fact that( I − Q i ) = I − Q i . Next we prove (29). We have J Ti r i = vec( D i ( I − Q i ) r i ( A Ti y i ) T ) + T T vec( A Ti r i r Ti D i ) . By noting that A Ti r i = 0 and Q i r i = 0, we get therequired result in (29). (cid:117)(cid:116) The complete procedure of solving (14) is summarizedin Algorithm 1. In all our experiments, the initial λ ischosen as 1 e − ρ = 10.4.2 Quadratic majorization of (14)Recall that while the LM GN method may be highlysuccessful in computing a global minimizer for (14) em-pirically, W k +1 may fail to be a global minimizer oc-casionally. Thus before deriving the update rule for E k +1 , we consider minimizing a quadratic majorizationof (14) as a safeguard step to ensure the convergence ofthe PARSuMi iterations. Recall that (14) is equivalentto W k +1 = argmin W (cid:26) (cid:107) H ◦ ( W − ˆ B k ) (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) rank( W ) ≤ r (cid:27) where ˆ B k = H − ◦ B k with H and B k deﬁned as in(17) and (18) respectively. For convenience, we denote the above objective func-tion F ( W, ˆ B k ). Suppose that we have positive vectors p ∈ R m and q ∈ R n such that¯ H ij ≤ p i q j ∀ i = 1 , . . . , m, j = 1 , . . . , n. (30)Note that the above inequality always holds if p , q arechosen to be p i = max { ¯ H ij | j = 1 , . . . , n } , i = 1 , . . . , mq j = max { ¯ H ij | i = 1 , . . . , m } , j = 1 , . . . , n. (31)Let G k = ∇ W F ( W k , ˆ B k ). We majorize F ( W, ˆ B k ) bybounding its Taylor expansion at W k F ( W, ˆ B k ) − F ( W k , ˆ B k ) = (cid:104) G k , W − W k (cid:105) + 12 (cid:104) W − ˆ B k , ( ¯ H ◦ ¯ H ) ◦ ( W − ˆ B k ) (cid:105)≤ (cid:104) G k , W − W k (cid:105) + 12 (cid:104) W − W k , P ( W − W k ) Q (cid:105) = 12 (cid:107) P / W Q / − U k (cid:107) − (cid:107) P − / G k Q − / (cid:107) (32)where P = diag( p ) and Q = diag( q ), U k = P / W k Q / − P − / G k Q − / . Proposition 3

The minimizer of quadratic majoriza-tion function in (32) is given in closed-form by W k +1QM ∈ P − / Π r ( U k ) Q − / . (33) Here Π r ( U k ) denotes the set of best rank-r approxima-tion of U k . The proof is simple and is given in the Appendix. Notethat this is a nonconvex minimization, yet we have aneﬃcient closed-form solution thanks to SVD.As we shall see later in Algorithm 4, the global mini-mizer W k +1QM in (33) of the quadratic majorization func-tion of F ( W, ˆ B k ) is used as a safeguard when the com-puted solution W k +1 from (14) is inferior (which nec-essarily implies that W k +1 is not a global optimal solu-tion) to W k +1QM . By doing so, the convergence of PAR-SuMi can be ensured, as we shall prove in Theorem 1.4.3 Sparse corruption recovery step (15)In the sparse corruption step, we need to solve the (cid:96) -constrained least squares minimization (15). This prob-lem is combinatorial in nature, but fortunately, for ourproblem, we show that a closed-form solution can beobtained. Let x := P Ω ( E ). Observe that (15) can beexpressed in the following equivalent form:min x (cid:110) (cid:107) x − b (cid:107) | (cid:107) x (cid:107) ≤ N , (cid:107) x (cid:107) − K E ≤ (cid:111) (34)where b = P Ω ( (cid:99) W − W k +1 + β E k ) / (1 + β ). Algorithm 2

Closed-form solution of (15)

Input: (cid:99)

W , W k +1 , E k , Ω .1. Compute b using (34).2. Compute x using (35). Output: E k +1 = P ∗ Ω ( x ). Algorithm 3 (Partially Majorized) Proximal Alternat-ing Robust Subspace Minimization (PARSuMi)

Input:

Observed data (cid:99) W , sample mask Ω , parameter r, N .Initialization W and E (typically by Algorithm 5 de-scribed in Section 4.6), k = 0. repeat W k , E k , N k , ob-tain updates ˜ W k +1 and ˜ N k +1 W k , E k obtain updates ˆ W k +1 .2. Assign W k +1 = argmin W ∈{ ˜ W k +1 , ˆ W k +1 } F ( W, ˆ B k ),and then assign the corresponding N ( k +1) .3. Solve (15) using Algorithm 2 with W k +1 , E k ; obtainupdates E k +1 . until (cid:107) W k +1 − W k (cid:107) < (cid:107) W k (cid:107) · − and (cid:107) E k +1 − E k (cid:107) < (cid:107) E k (cid:107) · − Output:

Accumulation points W and E Proposition 4

Let I be the set of indices of the N largest (in magnitude) component of b . Then the nonzerocomponents of the optimal solution x of (34) is givenby x I = (cid:40) K E b I / (cid:107) b I (cid:107) if (cid:107) b I (cid:107) > K E b I if (cid:107) b I (cid:107) ≤ K E . (35)The proof (deferred to the Appendix) involves check-ing the optimality conditions of (34) assuming knownsupport set and ﬁnding the optimal support set in adecoupled fashion.The procedure to obtain the optimal solution of (15)is summarized in Algorithm 2. We remark that this isa very special case of (cid:96) -constrained optimization; theavailability of the exact closed form solution depends onboth terms in (15) being decomposable into individual( i, j ) term. In general, if we change the operator M → H ◦ M in (15) to a general linear transformation (e.g.,a sensing matrix in compressive sensing), or change thenorm (cid:107)·(cid:107) of the proximal term to some other norm suchas spectral norm or nuclear norm, then the problembecomes NP-hard.4.4 AlgorithmOur method is summarized in Algorithm 3. Note thatwe do not need to know the exact cardinality of thecorrupted entries; N can be taken as an upper boundof the allowable number of corruptions. As a rule ofthumb, 10-15% of | Ω | is a reasonable size. The surplus ractical Matrix Completion and corruption recovery using PARSuMi 13 in N will only label a few noisy samples as corrup-tions, which should not aﬀect the recovery of either W or E , so long as the remaining | Ω |− N samples are stillsuﬃcient. The other parameter r is typically given bythe physical model of the problem. For those problemswhere r is not known, choosing r is analogous to choos-ing the regularization parameter as in other machinelearning tasks. A large r will lead to overﬁtting andpoorly estimated missing data while an overly small r will cause underﬁtting of the observed data.4.5 Convergence to a critical pointIn this section, we show the convergence of Algorithm3 to a critical point.Note that due to the non-convex nature of the sub-problem (14), Algorithm 1 is guaranteed to convergeonly to a local minimum. Therefore, the result in At-touch et al (2010) that requires global optimal solutionsin all subproblems cannot be directly applied in our casefor the critical point convergence proof. Empirically, wecannot hope LM GN to always ﬁnd the global optimalsolution of (14) either, as our experiments on LM GNin Section 3 clearly demonstrated. As a result, we de-sign the partial majorization (Step 1b) in Algorithm 3to safeguard against the case when the computed solu-tion from Step 1a is not a global optimal solution. Thesafeguard step is powerful in that we do not need toassume anything on the computed solution of the sub-problem before we can prove the overall critical pointconvergence.We start our convergence proof by ﬁrst deﬁningan equivalent formulation of (3) in terms of closed,bounded sets. The convergence proof is then based onthe indicator functions for these closed and boundedsets, which have the key lower semicontinuous property.Let K W = 2 (cid:107) (cid:99) W (cid:107) + K E . Deﬁne the closed and boundedsets: W = { W ∈ R m × n | rank( W ) ≤ r, (cid:107) H ◦ W (cid:107) ≤ K W }E = { E ∈ R m × nΩ | (cid:107) E (cid:107) ≤ N , (cid:107) E (cid:107) ≤ K E } . We will ﬁrst show that (3) is equivalent to the problemgiven in the next proposition.

Proposition 5

Let f ( W, E ) := (cid:107) H ◦ ( W + E − (cid:99) W ) (cid:107) .The problem (3) is equivalent to the following problem: min (cid:8) f ( W, E ) | W ∈ W , E ∈ E (cid:9) . (36) Proof

Observe that the only diﬀerence between (3) and(36) is the inclusion of the bound constraint on (cid:107) H ◦ W (cid:107) in (36). To show the equivalence, we only need to show that any minimizer ( W , E ) of (3) must satisfy thebound constraint in W . By deﬁnition, we know that f ( W , E ) ≤ f (0 ,

0) = 12 (cid:107) (cid:99) W (cid:107) . Now for any (

W, E ) such that rank( W ) ≤ r , E ∈ E and (cid:107) H ◦ W (cid:107) > K W , we must have (cid:107) H ◦ ( W + E − (cid:99) W ) (cid:107) ≥ (cid:107) H ◦ W (cid:107) − (cid:107) H ◦ ( E − (cid:99) W ) (cid:107) > K W − (cid:107) E (cid:107) − (cid:107) (cid:99) W (cid:107) ≥ (cid:107) (cid:99) W (cid:107) . Hence f ( W, E ) > (cid:107) (cid:99) W (cid:107) = f (0 , . This implies thatwe must have (cid:107) H ◦ ( W ) (cid:107) ≤ K W . (cid:117)(cid:116) To establish the convergence of PARSuMi, it is moreconvenient for us to consider the following generic prob-lem which includes (36) as a special case. Let X , Y , Z beﬁnite-dimensional inner product spaces, and f : X → R ∪ {∞} , g : Y → R ∪ {∞} are lower semi-continuousfunctions. We consider the problem:min x,y { L ( x, y ) := f ( x ) + g ( y ) + q ( x, y ) } (37)where q ( x, y ) = (cid:107) Ax + By − c (cid:107) and A : X → Z , B : Y → Z are given linear maps. For (36), we have X = Z = R m × n , Y = R m × nΩ , A ( x ) = H ◦ x , B ( y ) = H ◦ y , c = (cid:99) W . and f , g are the following indicatorfunctions, f ( x ) = (cid:26) x ∈ W∞ otherwise g ( y ) = (cid:26) y ∈ E∞ otherwiseNote that in this case, f are g are lower semicontin-uous since indicator functions of closed sets are lowersemicontinuous (Rudin 1987).To denote the corresponding majorization safeguard,we deﬁne, for a ﬁxed ( (cid:98) x, (cid:98) y ) and given M (cid:31) A ∗ AQ ( x ; (cid:98) x, (cid:98) y ) := q ( (cid:98) x, (cid:98) y ) + (cid:104)∇ x q ( (cid:98) x, (cid:98) y ) , x − (cid:98) x (cid:105) + 12 (cid:107) x − (cid:98) x (cid:107) M (38) (cid:98) L ( x ; (cid:98) x, (cid:98) y ) := Q ( x ; (cid:98) x, (cid:98) y ) + f ( x ) + g ( (cid:98) y ) (39)where (cid:107) · (cid:107) M is deﬁned in the last part of Algorithm 4.Then, we have that q ( x, (cid:98) y ) = Q ( x ; (cid:98) x, (cid:98) y ) − (cid:107) x − (cid:98) x (cid:107) M − A ∗ A (40) L ( x, (cid:98) y ) = (cid:98) L ( x ; (cid:98) x, (cid:98) y ) − (cid:107) x − (cid:98) x (cid:107) M − A ∗ A (41)Consider the partially majorized proximal alternat-ing minimization (PMPAM) outlined in Algorithm 4,which we have modiﬁed from Attouch et al (2010). Thealgorithm alternates between minimizing x and y , but with the important addition of the quadratic Moreau-Yoshida regularization term (which is also known asthe proximal term) in each step. The importance ofMoreau-Yoshida regularization for convex matrix opti-mization problems has been demonstrated and studiedin Liu et al (2009); Yang et al (2012); Wu et al (2011a).For our non-convex, non-smooth setting here, the im-portance of the proximal term will become clear whenwe prove the convergence of Algorithm 4. For our prob-lem (36), the positive linear maps S and T in Algorithm4 correspond to β ( H ◦ H ) ◦ and β I respectively, where β , β are given positive parameters. Our algorithm dif-fers from that in Attouch et al (2010) by having thesafeguard step (in Step 1b and Step 2) to ensure thatcritical point convergence can be achieved even if thecomputed solution in Step 1a is not globally optimal.Observe that one can bypass Step 1a in Algorithm 4completely and always choose to use Step 1b. But theminimization in Step 1b based on quadratic majoriza-tion may not reduce the merit function L ( x, y k )+ (cid:107) x − x k (cid:107) S as quickly as the minimization in Step 1a. Thus itis necessary to have Step 1a to ensure that Algorithm 4converges at a reasonable speed. We note that a similarsafeguard step can be introduced for the subproblem inStep 3 if the global optimality of y k +1 is not guaranteed. Algorithm 4

Partially majorized proximal alternatingminimization (PMPAM)

Input: ( x , y ) ∈ X × Y ; positive linear operators S and T .Choose M such that M (cid:31) A ∗ A + S in (38). repeat (cid:101) x k +1 ∈ argmin { L ( x, y k ) + (cid:107) x − x k (cid:107) S } . (cid:98) x k +1 ∈ argmin (cid:8) (cid:98) L ( x ; x k , y k ) (cid:9)

2. Consider condition (I) L ( (cid:101) x k +1 , y k ) + (cid:107) (cid:101) x k +1 − x k (cid:107) S ≤ L ( (cid:98) x k +1 , y k ) + (cid:107) (cid:98) x k +1 − x k (cid:107) S . Set x k +1 = (cid:40) (cid:101) x k +1 if condition (I) holds (cid:98) x k +1 otherwise.3. y k +1 = argmin { L ( x k +1 , y ) + β (cid:107) y − y k (cid:107) T } until convergence Output:

Accumulation points x and y In the above, S and T are given positive deﬁnitelinear maps, and (cid:107) x − x k (cid:107) S = (cid:104) x − x k , S ( x − x k ) (cid:105) , (cid:107) y − y k (cid:107) T = (cid:104) y − y k , T ( y − y k ) (cid:105) . Note that Step 1b isto safeguard against the possibility that the computed˜ x k +1 is not a global optimal solution of the subproblem.We assume that it is possible to compute the global op-timal solution (cid:98) x k +1 analytically.Note that for our problem (36), the global minimizerof the nonconvex subproblem in Step 1b can be com-puted analytically as discussed in Section 4.2. Next we show that any limit point of { ( x k , y k ) } is a stationarypoint of L even if (cid:101) x k +1 computed in Step 1a is not aglobal minimizer of the subproblem. Theorem 1

Let { ( x k , y k ) } be the sequence generatedby Algorithm 4, and ( (cid:101) x k +1 , (cid:98) x k +1 ) are the intermediateiterates at iteration k .(a) For all k ≥ , we have that L ( (cid:98) x k +1 , y k ) + 12 (cid:107) (cid:98) x k +1 − x k (cid:107) S + 12 (cid:107) (cid:98) x k +1 − x k (cid:107) M − A ∗ A − S = (cid:98) L ( (cid:98) x k +1 ; x k , y k ) ≤ L ( x k , y k ) , (42) L ( x k +1 , y k ) + 12 (cid:107) x k +1 − x k (cid:107) S ≤ L ( (cid:98) x k +1 , y k ) + 12 (cid:107) (cid:98) x k +1 − x k (cid:107) S . (43) (b) For all k ≥ , we have that L ( x k +1 , y k +1 )+ 12 (cid:107) x k +1 − x k (cid:107) S + 12 (cid:107) y k +1 − y k (cid:107) T ≤ L ( x k , y k ) . (44) Hence (cid:80) ∞ k =0 (cid:107) x k +1 − x k (cid:107) S + (cid:107) y k +1 − y k (cid:107) T < ∞ and lim k →∞ (cid:107) x k +1 − x k (cid:107) = 0 = lim k →∞ (cid:107) y k +1 − y k (cid:107) .(c) Let { ( x k (cid:48) , y k (cid:48) ) } be any convergent subsequence of { ( x k , y k ) } with limit (¯ x, ¯ y ) . Then lim k →∞ L ( x k , y k ) =lim k →∞ L ( x k +1 , y k ) = lim k →∞ (cid:98) L ( (cid:98) x k +1 ; x k , y k ) = L (¯ x, ¯ y ) . Furthermore lim k →∞ (cid:107) (cid:98) x k +1 − x k (cid:107) = 0 .(d) Let { ( x k (cid:48) , y k (cid:48) ) } be any convergent subsequence of { ( x k , y k ) } with limit (¯ x, ¯ y ) . Then (¯ x, ¯ y ) is a stationarypoint of L . The full proof is given in the Appendix. Here we ex-plain the four parts of Theorem 1. Part(a) establishesthe non-increasing monotonicity of the proximal regu-larized update. Leveraging on part(a), part(b) ensuresthe existence of the limits. Using Part(a), (b) and (c),(d) then shows the critical point convergence of Algo-rithm 4.4.6 Convex relaxation of (3) as initializationDue to the non-convexity of the rank and (cid:96) cardinalityconstraints, it is expected that the outcome of Algo-rithm 3 depends on initializations. A natural choice forthe initialization of PARSuMi is the convex relaxationof both the rank and (cid:96) function:min (cid:110) f ( W, E )+ λ (cid:107) W (cid:107) ∗ + γ (cid:107) E (cid:107) | W ∈ R m × n , E ∈ R m × nΩ (cid:111) (45)where f ( W, E ) = (cid:107) H ◦ ( W + E − (cid:99) W ) (cid:107) , (cid:107) · (cid:107) ∗ is thenuclear norm, and λ and γ are regularization parame-ters.Problem (45) can be solved eﬃciently by the quadraticmajorization-APG (accelerated proximal gradient) frame-work proposed by Toh and Yun (2010). At the k th iter-ation with iterate ( ¯ W k , ¯ E k ), the majorization step re-places (45) with a quadratic majorization of f ( W, E ), ractical Matrix Completion and corruption recovery using PARSuMi 15 so that W and E can be optimized independently, aswe shall see shortly. Let G k = ( H ◦ H ) ◦ ( ¯ W k + ¯ E k + (cid:99) W ).By some simple algebra, we have f ( W, E ) − f ( ¯ W k , ¯ E k ) = 12 (cid:107) H ◦ ( W − ¯ W k + E − ¯ E k ) (cid:107) + (cid:104) W − ¯ W k + E − ¯ E k , G k (cid:105)≤ (cid:107) W − ¯ W k (cid:107) + (cid:107) E − ¯ E k (cid:107) + (cid:104) W − ¯ W k + E − ¯ E k , G k (cid:105) = (cid:107) W − (cid:102) W k (cid:107) + (cid:107) E − (cid:101) E k (cid:107) + constantwhere (cid:102) W k = ¯ W k − G k / (cid:101) E k = ¯ E k − G k /

2. Ateach step of the APG method, one minimizes (45) with f ( W, E ) replaced by the above quadratic majorization.As the resulting problem is separable in W and E , wecan minimize them separately, thus yielding the follow-ing two optimization problems: W k +1 = argmin 12 (cid:107) W − (cid:102) W k (cid:107) + λ (cid:107) W (cid:107) ∗ (46) E k +1 = argmin 12 (cid:107) E − (cid:101) E k (cid:107) + γ (cid:107) E (cid:107) (47)The main reason for performing the above majoriza-tion is because the solutions to (46) and (47) can read-ily be found with closed-form solutions. For (46), theminimizer is given by the Singular Value Thresholding(SVT) operator. For (47), the minimizer is given by thewell-known soft thresholding operator (Donoho 1995).The APG algorithm, which is adapted from Beck andTeboulle (2009) and analogous to that in Toh and Yun(2010), is summarized below. Algorithm 5

An APG algorithm for (45)

Input:

Initialize W = ¯ W = 0, E = ¯ E = 0, t = 1, k = 0 repeat

1. Compute G k = ( H ◦ H ) ◦ ( ¯ W k + ¯ E k + (cid:99) W ), (cid:102) W k , (cid:101) E k .2. Update W k +1 by applying the SVT on (cid:102) W k in (46).3. Update E k +1 by applying the soft-thresholding op-erator on (cid:101) E k in (47).4. Update step size t k +1 = (1 + (cid:113) t k ).5. ( ¯ W k +1 , ¯ E k +1 ) = ( W k +1 , E k +1 ) + t k − t k +1 ( W k +1 − W k , E k +1 − E k ) until Convergence

Output:

Accumulation points W and E As has already been proved in Beck and Teboulle(2009), the APG algorithm, including the one above,has a very nice worst case iteration complexity resultin that for any given (cid:15) >

0, the APG algorithm needs atmost O (1 / √ (cid:15) ) iterations to compute an (cid:15) -optimal (interms of function value) solution.The tuning of the regularization parameters λ and γ in (45) is fairly straightforward. For λ , we use the sin-gular values of the converged W as a reference. Starting from a relatively large value of λ , we reduce it by a con-stant factor in each pass to obtain a W such that its sin-gular values beyond the r th are much smaller than theﬁrst r singular values. For γ , we use the suggested valueof 1 / (cid:112) max( m, n ) from RPCA (Cand`es et al 2011). Inour experiments, we ﬁnd that we only need a ballparkﬁgure, without having to do a lot of tuning. Taking λ = 0 . γ = 1 / (cid:112) max( m, n ) serve the purpose well.4.7 Other heuristicsIn practice, we design two heuristics to further boostthe quality of the convex initialization. These are tricksthat allow PARSuMi to detect corrupted entries betterand are always recommended.We refer to the ﬁrst heuristic as “Huber Regres-sion”. The idea is that the quadratic loss term in ourmatrix completion step (14) is likely to result in a densespread of estimation error across all measurements. Thereis no guarantee that those true corrupted measurementswill hold larger errors comparing to the uncorruptedmeasurements. On the other hand, we note that thequality of the subspace N k obtained from LM GN isusually good despite noisy/corrupted measurements. Thisis especially true when the ﬁrst LM GN step is initial-ized with Algorithm 5. Intuitively, we should be betteroﬀ with an intermediate step, using N k +1 to detect theerrors instead of W k +1 , that is, keeping N k +1 as a ﬁxedinput and ﬁnding coeﬃcient C and E simultaneouslywith min E,C (cid:107) H ◦ ( N k +1 C − (cid:99) W + E ) (cid:107) subject to (cid:107) E (cid:107) ≤ N . (48)To make it computationally tractable, we relax (48) tomin E,C (cid:107) H ◦ ( N k +1 C − (cid:99) W + E ) (cid:107) + η (cid:107) E (cid:107) (49)where η > E is ab-sorbed into the Huber penalty)min C j m (cid:88) i =1 Huber η /H ij ( H ij (( N k +1 C j ) i − (cid:99) W ij )) . (50)Since N k +1 is known, (49) can be solved very eﬃcientlyusing the APG algorithm, whose derivation is similar tothat of Algorithm 5, with soft-thresholding operationson C and E . To further reduce the Robin Hood eﬀect(that haunts all (cid:96) -like penalties) and enhance sparsity,we may optionally apply the iterative re-weighted Hu-ber minimization (a slight variation of the method in Cand`es et al (2008)), that is, solving (50) for l max it-erations using an entrywise weighting factor inverselyproportional to the previous iteration’s ﬁtting residual.In the end, the optimal columns C j ’s are concatenatedinto the optimal solution matrix C ∗ of (49), and we set W k +1 = N k +1 C ∗ . With this intermediate step between the W step andthe E step, it is much easier for the E step to detectthe support of the actual corrupted entries.The above procedure can be used in conjunctionwith another heuristic that avoids adding false posi-tives into the corruption set in the E step when thesubspace N has not yet been accurately recovered. Thisis achieved by imposing a threshold η on the minimumabsolute value of E k ’s non-zero entries, and shrink thisthreshold by a factor (say 0.8) in each iteration. The“Huber regression” heuristic is used only when η > η ,and hence only in a very small number of iterationbefore the support of E has been reliably recovered.Afterwards the pure PARSuMi iterations (without theHuber step) will take over, correct the Robin Hood ef-fect of Huber loss and then converge to a high qualitysolution.Note that our critical point convergence guaranteein Section 4.5 is not hampered at all by the two heuris-tics, since after a small number of iterations, η ≤ η and we come back to the pure PARSuMi. In this section, we present the methodology and resultsof various experiments designed to evaluate the eﬀec-tiveness of our proposed method. The experiments re-volve around synthetic data and two real-life datasets:the Oxford Dinosaur sequence, which is representativeof data matrices in SfM works, and the Extended YaleBface dataset (Lee et al 2005), which we use to demon-strate how PARSuMi works on photometric stereo prob-lems.In the synthetic data experiments, our method iscompared with the state-of-the-art algorithms for theobjective function in (10) namely Wiberg (cid:96) (Erikssonand Van Den Hengel 2010) and GRASTA (He et al2011). ALP and AQP (Ke and Kanade 2005) are leftout since they are shown to be inferior to Wiberg (cid:96) in Eriksson and Van Den Hengel (2010). For the sakeof comparison, we perform the experiment on recov-ery eﬀectiveness using the same small matrices as inSection 5.1 of Eriksson and Van Den Hengel (2010).Other synthetic experiments are conducted with morereasonably-sized matrices. Whenever appropriate, we also include a comparison to a variant of RPCA thathandles missing data (Wu et al 2011b) which solves(6) using the augmented Lagrange multiplier (ALM)algorithm (we will call it ALM-RPCA from here on-wards). This serves as a representative of the nuclearnorm based methods.The real data from the SfM and photometric stereoproblems contain many challenges typical in practicalscenarios. They contain large contiguous areas of miss-ing data, and potentially highly corrupted observationswhich may not be sparse too. For instance, in the YaleBface dataset, grazing illumination tends to produce largearea of missing data (well over 50%) and often largenumber of outliers too (due to specular highlights).The PARSuMi method outperformed a variety of othermethods in the experiments, even uncovering hithertounknown corruptions inherent in the Dinosaur data fromSfM. The results also corroborate those obtained in thesynthetic data experiments, in that our method canhandle a substantially larger fraction of missing dataand corruptions, thus providing empirical evidence forthe eﬃcacy of PARSuMi under practical scenarios.For a summary of the parameters used in the exper-iments, please refer to the Appendix.5.1 Convex Relaxation as an Initialization SchemeWe ﬁrst investigate the results of our convex initializa-tion scheme by testing on a randomly generated 100 ×

100 rank-4 matrix. A random selection of 70% and 10%of the entries are considered missing and corrupted re-spectively. Corruptions are generated by adding largeuniform noise between [ − , N (0 , σ ) for σ = 0 .

01 is added to all observed en-tries. From Fig. 7, we see that the convex relaxationoutlined in Section 4.6 was able to recover the errorsupport, but there is considerable diﬀerence in magni-tude between the recovered error and the ground truth,owing to the “Robin Hood” attribute of (cid:96) -norm as aconvex proxy of (cid:96) . Nuclear norm as a proxy of rank alsosuﬀers from the same woe. Similar observations can bemade on the results of the Dinosaur experiments, whichwe will show later.Despite the problems with the solution of the convexinitialization, we ﬁnd that it is a crucial step for PAR-SuMi to work well in practice. As can be seen fromFig. 7, the detected error support can be quite accu-rate. This makes the E -step of PARSuMi more likelyto identify the true locations of corrupted entries. ractical Matrix Completion and corruption recovery using PARSuMi 17 Fig. 7

The Robin Hood eﬀect of Algorithm 5 on detectedsparse corruptions E Init . Left : illustration of a random selec-tion of detected E vs. true E. Note that the support is mostlydetected, but the magnitude falls short.

Right : scatter plotof the detected E against true E (perfect recovery falls on the y = x line, false positives on the y -axis and false negatives onthe x -axis).(a) False negatives (b) False positives Fig. 8

Recovery of corruptions from poor initialization. E bears little resemblance to thetrue injected corruptions. Speciﬁcally, we test the caseswhen the initialization fails to detect many of the cor-rupted entries (false negatives) and when many entriesare wrongly detected as corruptions (false positives).From Fig. 8, we see that PARSuMi is able to recoverthe corrupted entries to a level comparable to the mag-nitude of the injected Gaussian noise in both experi-ments . Note that a number of false positives persist in the secondexperiment. This is understandable because false positives of-ten contaminate an entire column or row, making it impossi-ble to recover that column/row in later iterations even if thesubspace is correctly detected. To avoid such an undesirablesituation, we prefer “false negatives” over “false positives”when tuning Algorithm 5. In practice, it suﬃces to keep theinitial E relatively sparse. Fig. 9

A histogram representing the frequency of diﬀerentmagnitudes of RMSE in the estimates generated by eachmethod.

In most of our experiments, we ﬁnd that PARSuMiis often able to detect the corruptions perfectly from asimple initializations with all zeros, even without the“Huber Regression” heuristic. This is especially truewhen the data are randomly generated with benignsampling pattern and well-conditioned singular values.However, in challenging applications such as SfM, agood convex initialization and the “Huber Regression”heuristic are always recommended.5.3 Recovery eﬀectiveness from sparse corruptionsFor easy benchmarking, we use the same synthetic datain Section 5.1 of Eriksson and Van Den Hengel (2010)to investigate the quantitative eﬀectiveness of our pro-posed method. A total of 100 random low-rank matriceswith missing data and corruptions are generated andtested using PARSuMi, Wiberg (cid:96) and GRASTA.In accordance with Eriksson and Van Den Hengel(2010), the ground truth low rank matrix W ∈ R m × n , m =7 , n = 12 , r = 3 , is generated as W = U V T , where U ∈ R m × r , V ∈ R n × r are generated using uniform dis-tribution, in the range [-1,1]. 20% of the data are desig-nated as missing, and 10% are added with corruptions,both at random locations. The magnitude of the corrup-tions follows a uniform distribution [ − , (cid:107) W recovered − W (cid:107) F √ mn . (51)Out of the 100 independent experiments, the number ofruns that returned RMSE values of less than 5 are 100for PARSuMi, 78 and 58 for Wiberg (cid:96) (with two diﬀer-ent initializations) and similarly 94 and 93 for GRASTA.These are summarized in Fig. 9. ×

60 rank-4 matrices (with the elementsof the factors

U, V drawn independently from the uni-form distribution on [ − , σ in the range [0,0.1].By ﬁxing the Gaussian noise at a speciﬁc level, the re-sults are rendered in terms of phase diagrams showingthe recovery precision as a function of the missing dataand outliers. The precision is quantiﬁed as the diﬀer-ence between the recovered RMSE and the oracle boundRMSE . As can be seen from Fig. 10, our algorithmobtains near optimal performance at an impressivelylarge range of missing data and outlier at σ = 0 . .For comparison, we also displayed the results forclosely related methods, e.g., ALM-RPCA (Wu et al2011b), GRASTA (He et al 2011), DRMF (Xiong et al2011), LM GN (Chen 2011) as well as Algorithm 5 (ourinitialization). Wiberg (cid:96) is omitted because it is tooslow. Among all the methods we compared, PARSuMiis able to successfully reconstruct the largest range ofmatrices with almost optimal numerical accuracy. Also,the results for DRMF and LM GN are well-expectedsince they are not designed to handle both missing dataand outliers.5.5 SfM with missing and corrupted data on DinosaurIn this section, we apply PARSuMi to the problem ofSfM using the Dinosaur sequence and investigate howwell the corrupted entries can be detected and recov-ered in real data. We have normalized image pixel di-mensions (width and height) to be in the range [0,1];all plots, unless otherwise noted, are shown in the nor-malized coordinates.To simulate data corruptions arising from wrongfeature matches, we randomly add sparse error of therange [-2,2] to 1% of the sampled entries. This is a See the Appendix for details. The phase diagrams for other levels of noise look verymuch like Fig. 10; we therefore did not include them in thepaper. In SfM data corruptions are typically matching failures.Depending on where true matches are, error induced by amatching failure can be arbitrarily large. If we constrain truematch to be inside image frame [0 , − , Fig. 10

Phase diagrams (darker is better) of RMSE withvarying proportion of missing data and corruptions withGaussian noise σ = 0 . PARSuMi Wiberg (cid:96) GRASTANo. of success 9/10 0/10 0/10Run time (mins):min/avg/max 2.2/2.9/5.2 76/105/143 0.2/0.5/0.6Min RMSE (origi-nal pixel unit) 1.454 2.715 22.9Min RMSE exclud-ing corrupted en-tries 0.3694 1.6347 21.73

Table 3

Summary of the Dinosaur experiments. Note thatbecause there is no ground truth for the missing data, theRMSE is computed only for those observed entries as inBuchanan and Fitzgibbon (2005). more realistic (and much larger ) deﬁnition of outliersfor SfM compared to the [-50,50] pixel range used toevaluate Wiberg (cid:96) in Eriksson and Van Den Hengel(2010).We conducted the experiment 10 times each for PAR-SuMi, Wiberg (cid:96) (with SVD initialization) and GRASTA(random initialization as recommended in the originalpaper) and count the number of times they succeed. [-50,50] in pixel is only about [-0.1,0.1] in our normalizeddata, which could hardly be regarded as “gross” corruptions.ractical Matrix Completion and corruption recovery using PARSuMi 19 As there are no ground truth to compare against, wecannot use the RMSE to evaluate the quality of theﬁlled-in entries. Instead, we plot the feature trajectoryof the recovered data matrix for a qualitative judge-ment. As is noted in Buchanan and Fitzgibbon (2005),a correct recovery should consist of all elliptical trajec-tories. Therefore, if the recovered trajectories look likethat in Fig. 6(b), we count the recovery as a success. (a)

Input (b)

ALM-RPCA (c)

Wiberg (cid:96) (d) GRASTA (e)

Our initialization (f)

PARSuMi

Fig. 11

Comparison of recovered feature trajectories withdiﬀerent methods. It is clear that under dense noise and grossoutliers, neither convex relaxation nor (cid:96) error measure yieldssatisfactory results. Solving the original non-convex problemwith (e) as an initialization produces a good solution. The results are summarized in Table 3. Notably,PARSuMi managed to correctly detect the corruptedentries and ﬁll in the missing data in 9 runs whileWiberg (cid:96) and GRASTA failed on all 10 attempts. Typ-ical feature trajectories recovered by each method are (a) Initialization via Algorithm 5 and the ﬁnal recovered errors byPARSuMi (Algorithm 3) (b)

Diﬀerence of the recovered and ground truth error (in original pixelunit)

Fig. 12

Sparse corruption recovery in the Dinosaur experi-ments: The support of all injected outliers are detected by Al-gorithm 5 (see (a)), but the magnitudes fall short by roughly20% (see (b)). Algorithm 3 is able to recover all injectedsparse errors, together with the inherent tracking errors inthe dataset (see the red spikes in (b)). shown in Fig. 11. Note that only PARSuMi is able torecover the elliptical trajectories satisfactorily.For comparison, we also include the input (partiallyobserved trajectories) and the results of our convex ini-tialization in Fig. 11(a) and 11(e) respectively.An interesting and somewhat surprising ﬁnding isthat the result of PARSuMi is even better than theglobal optimal solution for data containing supposedlyno corruptions (and thus can be obtained with (cid:96) method)(see Fig. 6(b), which is obtained under no corruptionsin the observed data)! In particular, the trajectories arenow closed.The reason becomes clear when we look at Fig. 12(b),which shows two large spikes in the vectorized diﬀerencebetween the artiﬁcially injected corruptions and the re-covered corruptions by PARSuMi. This suggests thatthere are hitherto unknown corruptions inherent in theDinosaur data. We trace the two large ones into the rawimages, and ﬁnd that they are indeed data corruptionscorresponding to mismatched feature points from theoriginal dataset; our method managed to recover thecorrect feature matches (left column of Fig. 13). Fig. 13

Original tracking errors in the Dinosaur data identi-ﬁed (yellow box) and corrected by PARSuMi (green box withred star) in frame 13 feature 86 (a) and frame 15 feature 144(b).

The result shows that PARSuMi recovered not onlythe artiﬁcially added errors, but also the intrinsic errorsin the data set. In Buchanan and Fitzgibbon (2005), itwas observed that there is a mysterious increase of theobjective function value upon closing the trajectoriesby imposing orthogonality constraint on the factorizedcamera matrix. Our discovery of these intrinsic trackingerrors explained this matter evidently. It is also the rea-son why the (cid:96) -based algorithms (see Fig. 6(b)) ﬁnd aglobal minimum solution that is of poorer quality (tra-jectories fail to close loop).To complete the story, we generated the 3D pointcloud of Dinosaur with the completed data matrix. Theresults viewed from diﬀerent directions are shown inFig. 14. Fig. 14

3D point cloud of the reconstructed Dinosaur.

Fig. 15

Illustration of the synthetic data and their surfacenormal. Note that there are specular regions and shadows. inner products will be observed as zero. This is the so-called attached shadow. Images of non-convex objectoften also contain cast shadow, due to the blocking oflight path. If these issues are teased out, then the seem-ingly naive Lambertian model is able to approximatemany surfaces very well.Wu et al (2011b) subscribed to this low-rank factor-ization model and proposed to model all dark regionsas missing data, all highlights as sparse corruptions andthen use a variant of RPCA (identical to (6)) to recoverthe full low-rank matrix. The solution however is onlytested on noise-free synthetic data and toy-scale realexamples. Del Bue et al (2012) applied their BALM onphotometric stereo too, attempting on both syntheticand real data. Their contribution is to impose the nor-mal constraint of each normal vector during the opti-mization.We compare PARSuMi with the aforementioned twomethods on several photometric stereo datasets. Quan-titatively, we use the Caesar and Elephant data in Wuet al (2011b) and compare the reconstructed surfacenormal against the ground truth. The data is illustratedin Fig. 15 and the comparison is detailed in Table 4. Aswe can see, PARSuMi has the smallest reconstructionerror among the three methods in all experiments.We also conducted a qualitative comparison of themethods on a real-life data using Subject 3 in the Ex-tended YaleB dataset since it was initially used to eval-uate BALM in Del Bue et al (2012) . As we do nothave any ground truth, we can only compare the recon-struction qualitatively.From Fig. 16, we can clearly see that PARSuMi isable to recover the missing pixels in the image muchbetter than the other two methods. In particular, Fig. 16(a)and 16(b) shows that PARSuMi’s reconstruction (in theilluminated half of the face) has fewest artifacts. Thiscan be seen from the unnatural grooves that the red ar-rows point to in Fig. 16(b). Moreover, we know from theoriginal image that the light comes from the right-hand-side of the subject; thus all the pixels on the left side of The authors claimed that it is Subject 10 (Del Bue et al2012, Figure 9), but careful examination of all faces showsthat it is in fact Subject 3.ractical Matrix Completion and corruption recovery using PARSuMi 21Dataset PARSuMi ALM-RPCA (Wuet al 2011b) BALM (Del Bue et al2012) Oracle (a lowerbound)Elephant (16.7 min) 7.87e-2 (1.1 min) 3.55 (1.1 min) model errorCaesar (28.6 min) 2.71e-1 (7.2 min) 3.11 (5.2 min) model errorElephant + N (0 , . (28.3 min) 2.62 (1.5 min) 4.37 (1.1 min) 1.70 + model errorCaesar + N (0 , . (99.2 min) 2.53 (8.3 min) 4.06 (6.6 min) 1.73 + model error Table 4

Angular error (in degree) and runtime (in minutes) comparison for the synthetic data photometric stereo experiments.The lowest estimation error is labeled in boldface. The oracle column gives the information-theoretic limit, it depends on anunknown model error as we are using a Lambertian model to deal with the data rendered by the Cook-Torrence model. his face (e.g. the red ellipse area in Fig. 16(b)) shouldhave negative ﬁlled-in values and therefore should bedark in the image. Neither BALM nor ALM-RPCA’sreconstructed images comply to this physical law.To see this more clearly, we invert the pixel values ofthe reconstructed image in Fig. 16(c). This is equivalentto inverting the direction of lighting. From the tag of theimage, we know that the original lighting is − ◦ fromthe subject’s right posterior and 40 ◦ from the top, so theinverted light should illuminate the left half of his facefrom 20 ◦ left frontal and 40 ◦ from below. As is shown inthe comparison, only PARSuMi’s result revealed whatshould be correctly seen with a light shining from thisdirection.In addition, we reconstruct the 3D depth map withthe classic method by Horn (1990). In Fig. 16(d), theshape from PARSuMi reveals much richer depth infor-mation than those from the other two algorithms, whosereconstructions appear ﬂattened. This is a known low-frequency bias problem for photometric stereo and it isoften caused by errors in the surface normal estimationNehab et al (2005). The fact that BALM and ALM-RPCA produces a ﬂatter reconstruction is a strong in-dication that their estimations of the surface normalare noisier than that of PARSuMi.From our experiments, we ﬁnd that PARSuMi isable to successfully reconstruct the 3D face for all 38subjects with little artifacts. As illustrated in Fig. 17,our 3D reconstructions of the features seem to revealthe characteristic features of subjects across diﬀerentethnic groups. Moreover, due to the robust (cid:96) penalty,PARSuMi is able to eﬀectively recover the input imagesfrom many diﬀerent types of irregularities, e.g. spec-ular regions, diﬀerent facial expressions, or even im-age corruptions caused by hardware malfunctions (seeFig. 18 and 20). This makes it possible for PARSuMito be integrated reliably into engineering systems thatfunction with minimal human interactions . For the best of our knowledge, all previous works thatuse this dataset for photometric 3D reconstruction manuallyremoved a number of images of poor qualities, e.g.Del Bueet al (2012) (a)

Comparison of the recovered image (b)

Comparison of the recovered image (details) (c)

Taking the negative of (b) to see the ﬁlled-in missing pixels. Thisis as if the lighting direction is inverted. (d)

Comparison of the reconstructed 3D surfaces (albedo rendered).

Fig. 16

Qualitative comparison of algorithms on Subject 3.From left to right, the results are respectively for PARSuMi,BALM and ALM-RPCA. In (a), they are preceded by theoriginal image and the image depicting the missing data ingreen. J T J + λI ) δ = J r which requires the Choleskyfactorization of a potentially dense mr × mr matrix,or the computation of J which requires solving a smalllinear system of normal equation involving the m × r matrix N for n times. As the overall complexity of O (max( m r , mnr )) scales merely linearly with num- Subject 02 (b)

Subject 5 (c)

Subject 10 (d)

Subject 15 (e)

Subject 12 (f)

Subject 22

Fig. 17

The reconstructed surface normal and 3D shapesfor Asian (ﬁrst row), Caucasian (second row) and African(third row), male (ﬁrst column) and female (second column),in Extended YaleB face database.(Zoom-in to look at details) ber of columns n but cubic with m and r , PARSuMi iscomputationally attractive when solving problems withsmall m and r , and potentially large n , e.g., photometricstereo and SfM (since the number of images is usuallymuch smaller than the number of pixels and featurepoints). However, for a typical million by million datamatrix as in social networks and collaborative ﬁltering,PARSuMi will take an unrealistic amount of time torun.Experimentally, we compare the runtime betweenour algorithm and Wiberg (cid:96) method in our Dinosaurexperiment in Section 5.5. Our Matlab implementationis run on a 64-bit Windows machine with a 1.6 GHzCore i7 processor and 4 GB of memory. We see fromTable 3 that there is a big gap between the speed per-formance. The near 2-hour runtime for Wiberg (cid:96) is dis-couragingly slow, whereas ours is vastly more eﬃcient.On the other hand, as an online algorithm, GRASTA isinherently fast. Examples in He et al (2011) show that itworks in real time for live video surveillance. However,our experiment suggests that it is probably not appro-priate for applications such as SfM, which requires ahigher numerical accuracy.The runtime comparison for the photometric stereoproblems is shown in Table 4. We remark that PAR-SuMi is roughly ten times slower than other methods.The pattern is consistent for the YaleB face data too,where PARSuMi takes 23.4 minutes to converge while (a) Cast shadow and attached shadow are recovered. Region of castshadow is now visible, and attached shadow is also ﬁlled with mean-ingful negative values. (b)

Facial expressions are set to normal. (c)

Rare corruptions in image acquisition are recovered. (d)

Light comes 20 degrees from behind and 65 degrees from above).

Fig. 18

Illustrations of how PARSuMi recovers missing dataand corruptions. From left to right: original image, input im-age with missing data labeled in green, reconstructed imageand detected sparse corruptions.

BALM and RPCA takes only 4.8 and 1.7 minutes re-spectively.We note that PARSuMi is currently not optimizedfor computation. Speeding up the algorithm for applica-tion on large scale dataset would require further eﬀort(such as parallelization) and could be a new topic ofresearch. For instance, the computation of Jacobians J i and the evaluation of the objective function can beeasily done in parallel and the Gauss-Newton update(a positive deﬁnite linear system of equations) can besolved using the conjugate gradient method; hence, wedo not even need to store the matrix in memory. Fur-thermore, since PARSuMi seeks to ﬁnd the best sub-space, perhaps using only a small portion of the datacolumns is suﬃcient. If the subspace is correct, therest of the columns can be recovered in linear timewith our iterative reweighted Huber regression tech-nique (see Section 4.7). A good direction for future re- ractical Matrix Completion and corruption recovery using PARSuMi 23 search is perhaps on how to choose the best subset ofdata to feed into PARSuMi. In this paper, we have presented a practical algorithm(PARSuMi) for low-rank matrix completion in the pres-ence of dense noise and sparse corruptions. Despite thenon-convex and non-smooth optimization formulation,we are able to derive a set of update rules under theproximal alternating scheme such that the convergenceto a critical point can be guaranteed. The method wastested on both synthetic and real life data with chal-lenging sampling and corruption patterns. The variousexperiments we have conducted show that our methodis able to detect and remove gross corruptions, suppressnoise and hence provide a faithful reconstruction of themissing entries. By virtue of the explicit constraints onboth the matrix rank and cardinality, and the novel re-formulation, design and implementation of appropriatealgorithms for the non-convex and non-smooth model,our method works signiﬁcantly better than the state-of-the-art algorithms in nuclear norm minimization, (cid:96) matrix factorization and (cid:96) robust matrix factoriza-tion in real life problems such as SfM and photometricstereo.Moreover, we have provided a comprehensive re-view of the existing results pertaining to the “prac-tical matrix completion” problem that we consideredin this paper. The review covered the theory of ma-trix completion and corruption recovery, and the the-ory and algorithms for matrix factorization. In par-ticular, we conducted extensive numerical experimentswhich reveals (a) the advantages of matrix factoriza-tion over nuclear norm minimization when the under-lying rank is known, and (b) the two key factors that af-fect the chance of (cid:96) -based factorization methods reach-ing global optimal solutions, namely “subspace param-eterization” and “Gauss-Newton” update. These ﬁnd-ings provided critical insights into this diﬃcult problem,upon the basis which we developed PARSuMi as wellas its convex initialization.The strong empirical performance of our algorithmcalls for further analysis. For instance, obtaining thetheoretical conditions for the convex initialization toyield good support of the corruptions should be plau-sible (following the line of research discussed in Sec-tion 2.1), and this in turn guarantees a good startingpoint for the algorithm proper. Characterizing how wellthe following non-convex algorithm works given suchinitialization and how many samples are required toguarantee high-conﬁdence recovery of the matrix re-main open questions for future study. Other interesting topics include ﬁnding a cheaperbut equally eﬀective alternative to the LM GN solverfor solving (20), parallel/distributed computation, in-corporating additional structural constraints, selectingoptimal subset of data for subspace learning and so on.Step by step, we hope this will eventually lead to apractically working robust matrix completion algorithmthat can be conﬁdently embedded in real-life applica-tions. A Appendix

A.1 Proofs

Proof (Proof of Proposition 4)

Given a subset I of { , . . . , | Ω |} with cardinality at most N such that b I (cid:54) = 0. Let J = { , . . . , | Ω |}\ I . Consider the problem (34) for x ∈ R | Ω | sup-ported on I , we get the following: v I := min x I (cid:110) (cid:107) x I − b I (cid:107) + (cid:107) b J (cid:107) | (cid:107) x I (cid:107) − K E ≤ (cid:111) , which is a convex minimization problem whose optimalityconditions are given by x I − b I + µ x I = 0 , µ ( (cid:107) x I (cid:107) − K E ) = 0 , µ ≥ µ is the Lagrange multiplier for the inequality con-straint. First consider the case where µ >

0. Then we get x I = K E b I / (cid:107) b I (cid:107) , and 1 + µ = (cid:107) b I (cid:107) /K E (hence (cid:107) b I (cid:107) > K E ).This implies that v I = (cid:107) b (cid:107) + K E − (cid:107) b I (cid:107) K E . On the otherhand, if µ = 0, then we have x I = b I and v I = (cid:107) b J (cid:107) = (cid:107) b (cid:107) − (cid:107) b I (cid:107) . Hence v I = (cid:40) (cid:107) b (cid:107) + K E − (cid:107) b I (cid:107) K E if (cid:107) b I (cid:107) > K E (cid:107) b (cid:107) − (cid:107) b I (cid:107) if (cid:107) b I (cid:107) ≤ K E .In both cases, it is clear that v I is minimized if (cid:107) b I (cid:107) is max-imized. Obviously (cid:107) b I (cid:107) is maximized if I is chosen to be theset of indices corresponding to the N largest components of b . (cid:117)(cid:116) Proof (Proof of Theorem 1) (a) The equality in (42) followsdirectly from (41). By the minimal property of (cid:98) x k +1 , we havethat (cid:98) L ( (cid:98) x k +1 ; x k , y k ) ≤ (cid:98) L ( ξ ; x k , y k ) ∀ ξ ∈ X . (52)Thus when ξ = x k , we get (cid:98) L ( (cid:98) x k +1 ; x k , y k ) ≤ (cid:98) L ( x k ; x k , y k ) = L ( x k , y k ), and the required inequality in (42) follows. On theother hand, the inequality (43) follows readily from the deﬁ-nition of x k +1 .(b) If x k +1 = (cid:101) x k +1 , then from the deﬁnition of x k +1 and(42), we have that, L ( x k +1 , y k ) + 12 (cid:107) x k +1 − x k (cid:107) S ≤ L ( (cid:98) x k +1 , y k ) + 12 (cid:107) (cid:98) x k +1 − x k (cid:107) S ≤ L ( x k , y k ) . (53)On the other hand, if x k +1 = (cid:98) x k +1 , we have from (42) that L ( x k +1 , y k ) + 12 (cid:107) x k +1 − x k (cid:107) S ≤ L ( x k , y k ) . (54)4 Wang, Lee, Cheong and TohBy the minimal property of y k +1 , we have that ∀ η ∈ Y L ( x k +1 , y k +1 ) + 12 (cid:107) y k +1 − y k (cid:107) T ≤ L ( x k +1 , η ) + 12 (cid:107) η − y k (cid:107) T . In particular, when η = y k , we get L ( x k +1 , y k +1 ) + 12 (cid:107) y k +1 − y k (cid:107) T ≤ L ( x k +1 , y k ) . (55)By combining (53)-(54) and (55), we get the inequality (44).(c) Note that by using the result in part (b), we also havelim k (cid:48) →∞ x k (cid:48) +1 = ¯ x and lim k (cid:48) →∞ y k (cid:48) +1 = ¯ y . From (42), (43)and (52), we have ∀ k ≥ , ξ ∈ X L ( x k +1 , y k ) + 12 (cid:107) x k +1 − x k (cid:107) S ≤ (cid:98) L ( ξ ; x k , y k ) . (56)Thus ∀ ξ ∈ X lim sup k (cid:48) →∞ f ( x k (cid:48) +1 ) + q (¯ x, ¯ y ) ≤ f ( ξ ) + Q ( ξ ; ¯ x, ¯ y ) . (57)By taking ξ = ¯ x , we getlim sup k (cid:48) →∞ f ( x k (cid:48) +1 ) ≤ f (¯ x ) + Q (¯ x ; ¯ x, ¯ y ) − q (¯ x, ¯ y ) = f (¯ x ) . (58)On the other hand, since f is lower semicontinuous, we havethat lim inf k (cid:48) →∞ f ( x k (cid:48) +1 ) ≥ f (¯ x ). Thus lim k (cid:48) →∞ f ( x k (cid:48) +1 ) = f (¯ x ). Similarly, we can show that lim k (cid:48) →∞ g ( y k (cid:48) +1 ) = g (¯ y ).As a result, we havelim k (cid:48) →∞ L ( x k (cid:48) +1 , y k (cid:48) +1 ) = L (¯ x, ¯ y ) . (59)Since { L ( x k , y k ) } is a nonincreasing sequence, the above re-sult implies thatlim k →∞ L ( x k , y k ) = L (¯ x, ¯ y ) = inf k L ( x k , y k ) . Also, (53)-(54) and (55) implies thatlim k →∞ L ( x k +1 , y k ) = L (¯ x, ¯ y ) . From (42) and (43), we have L ( x k +1 , y k ) + 12 (cid:107) x k +1 − x k (cid:107) S ≤ (cid:98) L ( (cid:98) x k +1 ; x k , y k ) ≤ L ( x k , y k ) . Thus lim k →∞ (cid:98) L ( (cid:98) x k +1 ; x k , y k ) = L (¯ x, ¯ y ).Now by (42) and (43) again, we have12 (cid:107) x k +1 − x k (cid:107) S + 12 (cid:107) (cid:98) x k +1 − x k (cid:107) M − A ∗ A − S ≤ (cid:98) L ( (cid:98) x k +1 ; x k , y k ) − L ( x k +1 , y k ) . Thus lim k →∞ (cid:107) (cid:98) x k +1 − x k (cid:107) M − A ∗ A − S = 0. Since M − A ∗ A − S (cid:31)

0, we also get lim k →∞ (cid:107) (cid:98) x k +1 − x k (cid:107) = 0.(d) From the optimality of (cid:98) x k +1 , we have that0 ∈ ∂ (cid:98) L ( (cid:98) x k +1 ; x k , y k )= ∂f ( (cid:98) x k +1 ) + A ∗ ( Ax k + By k − c ) + M ( (cid:98) x k +1 − x k )= ∂f ( (cid:98) x k +1 ) + A ∗ ( A (cid:98) x k +1 + By k +1 − c ) − ∆x k +1 where ∆x k +1 = − ( M − A ∗ A )( (cid:98) x k +1 − x k ) − A ∗ B ( y k − y k +1 ).Thus ∆x k +1 ∈ ∂ x L ( (cid:98) x k +1 , y k +1 ) . (60) From the optimality of y k +1 , we have that0 ∈ ∂g ( y k +1 ) + B ∗ ( Ax k +1 + By k +1 − c ) + T ( y k +1 − y k )= ∂ y L ( (cid:98) x k +1 , y k +1 ) + T ( y k +1 − y k ) + B ∗ A ( x k +1 − (cid:98) x k +1 )Hence ∆y k +1 := − T ( y k +1 − y k ) − B ∗ A ( x k +1 − (cid:98) x k +1 ) ∈ ∂ y L ( (cid:98) x k +1 , y k +1 ) . From part (b) and (c), we have thatlim k (cid:48) →∞ (cid:107) (cid:98) x k (cid:48) +1 − x k (cid:48) (cid:107) = 0 , lim k (cid:48) →∞ (cid:107) (cid:98) x k (cid:48) +1 − x k (cid:48) +1 (cid:107) = 0 , lim k (cid:48) →∞ (cid:98) x k (cid:48) +1 = ¯ x, lim k (cid:48) →∞ y k (cid:48) +1 = ¯ y. Thus lim k (cid:48) →∞ ∆x k (cid:48) +1 = 0 = lim k (cid:48) →∞ ∆y k (cid:48) +1 . By the closedness property of ∂L (Clarke 1990, Proposition2.1.5), we get (0 , ∈ ∂L (¯ x, ¯ y ) . Thus (¯ x, ¯ y ) is a stationary point of L. (cid:117)(cid:116) A.2 Software/code used

The point cloud in Fig. 14 are generated using VincentSfM-Toolbox (Rabaud n.d.). Source codes of BALM, GROUSE,GRASTA, Damped Newton, Wiberg, LM X used in the ex-periments are released by the corresponding author(s) of DelBue et al (2012); Balzano et al (2010); He et al (2011); Buchananand Fitzgibbon (2005); Okatani and Deguchi (2007); Wu et al(2011b) and Chen (2008). In particular, we are thankful thatBalzano et al (2010) and Wu et al (2011b) shared with usa customized version of GROUSE and ALM-RPCA that arenot yet released online. For Wiberg (cid:96) (Eriksson and VanDen Hengel 2010), we have optimized the computation forJacobian and adopted the commercial LP solver: cplex. Theoptimized code performs identically to the released code insmall scale problems, but it is beyond the scope for us to ver-ify for larger scale problems. In addition, we implemented Si-monFunk’s SVD ourselves. The ALS implementation is givenin the released code package of LM X. For OptManifold,TFOCS and CVX, we use the generic optimization packagesreleased by the author(s) of Wen and Yin (2013); Becker et al(2011); Grant and Boyd (2008) and customize for the par-ticular problem. For NLCG, we implement the derivationsin Srebro and Jaakkola (2003) and used the generic NLCGpackage (Overton n.d.). A.3 Additional experimental results

Illustration of the decomposition on Subject 3 of ExtendedYaleB dataset is given in Fig. 19. Additional qualitative com-parisons on the recovery of the image is given in Fig. 20.

A.4 The lower bounds in the experiments – The lower bound in Fig. 2: the lower bound is obtainedby the data set that contains less than r data points per-column and per-row. It is clear from Kir`aly and Tomioka(2012) that this is an easy-to-check necessary conditionof recoverability.ractical Matrix Completion and corruption recovery using PARSuMi 25 Fig. 20

Additional comparisons in the quality of face imagerecovery. From left to right, they are original image, miss-ing data mask (in green), results for PARSuMi, BALM andmissing-RPCA. – The oracle RMSE for Phase Diagram : We also adaptthe oracle lower bound from Cand`es and Plan (2010) torepresent the theoretical limit of recovery accuracy un-der noise. Our extended oracle bound under both sparsecorruptions and Gaussian noise is:RMSE oracle = σ (cid:115) ( m + n − r ) rp − e , (61)This is used for benchmarking in our phase diagram ex-periments. – The oracle angular error in Table 4:

For the Caesar andElephant experiments, we use (61) (ignoring corruptionsby taking e = 0) but transformed it by takingarcsin (cid:112) − ( n · ˆ n ) , where ˆ n is the surface normal obtained by an oracle pro-jection of the noisily observed image. A.5 Summary of parameters used in the experiments – Parameters in our formulation : We assume r (the un-derlying rank) to be known. N is chosen to be an upperbound of the number of corrupted entries. In experiments,we use 120% of the actual number of corruptions. In prac-tice, we should choose N = 0 . | Ω | or 0 . | Ω | . (cid:15) = 1 e −

10 (almost negligible). K E = 20 (cid:113) N × median( P ) Ω ( (cid:99) W )(very large, negligible). In theory, we only need (cid:15) > K E < ∞ to ensure the convergence. In practice, unless itis meaningful to choose an eﬀective K E , we will choose itlarge enough so that it has no impact on the optimization. – Parameters for PARSuMi : β = β = e − √ max m,n . ForAlgorithm 1, ρ = 10 and initial λ = 1 e − – Parameters for APG : γ = √ max m,n λ = 0 . References

Attouch H, Bolte J, Redont P, Soubeyran A (2010) Proxi-mal alternating minimization and projection methods fornonconvex problems: An approach based on the Kurdyka-(cid:32)Lojasiewicz inequality. Mathematics of Operations Re-search 35(2):438–457 Balzano L, Nowak R, Recht B (2010) Online identiﬁcationand tracking of subspaces from highly incomplete in-formation. In: Communication, Control, and Computing(Allerton), 2010 48th Annual Allerton Conference on,IEEE, pp 704–711Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAMJournal on Imaging Sciences 2(1):183–202Becker SR, Cand`es EJ, Grant MC (2011) Templates forconvex cone problems with applications to sparse sig-nal recovery. Mathematical Programming Computation3(3):165–218Becker SR, Cand`es EJ, Grant MC (2012) TFOCS: Templatesfor ﬁrst-order conic solvers. http://cvxr.com/tfocs/

Bennett J, Lanning S, Netﬂix N (2007) The Netﬂix Prize. In:In KDD Cup and Workshop in conjunction with KDDBuchanan AM, Fitzgibbon AW (2005) Damped Newton al-gorithms for matrix factorization with missing data. In:IJCV, vol 2, pp 316–322Cand`es E, Plan Y (2010) Matrix completion with noise. ProcIEEE 98(6):925–936Cand`es E, Recht B (2009) Exact matrix completion via con-vex optimization. Foundations of Computational Mathe-matics 9(6):717–772Cand`es E, Li X, Ma Y, Wright J (2011) Robust principalcomponent analysis? Journal of the ACM 58(3)Cand`es EJ, Tao T (2010) The power of convex relaxation:Near-optimal matrix completion. Information Theory,IEEE Transactions on 56(5):2053–2080Cand`es EJ, Wakin MB, Boyd SP (2008) Enhancing sparsityby reweighted (cid:96) minimization. Journal of Fourier Anal-ysis and Applications 14(5-6):877–905Chandrasekaran V, Sanghavi S, Parrilo P, Willsky A (2011)Rank-sparsity incoherence for matrix decomposition.SIAM Journal on Optimization 21:572–596Chen P (2008) Optimization algorithms on subspaces: Re-visiting missing data problem in low-rank matrix. IJCV80(1):125–142Chen P (2011) Hessian matrix vs. gauss-newton hessian ma-trix. SIAM Journal on Numerical Analysis 49(4):1417–1435Chen Y, Jalali A, Sanghavi S, Caramanis C (2011) Low-rankmatrix recovery from errors and erasures. In: Informa-tion Theory Proceedings (ISIT), 2011 IEEE InternationalSymposium on, IEEE, pp 2313–2317Clarke FH (1990) Optimization and nonsmooth analysis,vol 5. SiamDel Bue A, Xavier J, Agapito L, Paladini M (2012) Bilin-ear modeling via augmented lagrange multipliers (balm).Pattern Analysis and Machine Intelligence, IEEE Trans-actions on 34(8):1496 –1508Donoho DL (1995) De-noising by soft-thresholding. Informa-tion Theory, IEEE Transactions on 41(3):613–627Eriksson A, Van Den Hengel A (2010) Eﬃcient computationof robust low-rank matrix approximations in the presenceof missing data using the L1 norm. CVPR pp 771–778Friedland S, Niknejad A, Kaveh M, Zare H (2006) An Algo-rithm for Missing Value Estimation for DNA MicroarrayData. In: Acoustics, Speech and Signal Processing, 2006.ICASSP 2006 Proceedings. 2006 IEEE International Con-ference on, IEEE, vol 2, p IIFunk S (2006) Netﬂix update: Try this at home. http://sifter.org/~simon/journal/20061211.html Gabriel KR, Zamir S (1979) Lower rank approximation of ma-trices by least squares with any choice of weights. Tech-nometrics 21(4):489–4986 Wang, Lee, Cheong and TohGanesh A, Wright J, Li X, Cand`es EJ, Ma Y (2010) Dense er-ror correction for low-rank matrices via principal compo-nent pursuit. In: Information Theory Proceedings (ISIT),2010 IEEE International Symposium on, IEEE, pp 1513–1517Grant M, Boyd S (2008) Graph implementations for nons-mooth convex programs. In: Blondel V, Boyd S, KimuraH (eds) Recent Advances in Learning and Control, Lec-ture Notes in Control and Information Sciences, Springer-Verlag Limited, pp 95–110Grant M, Boyd S (2012) CVX: Matlab software for disciplinedconvex programming, version 2.0 beta. http://cvxr.com/cvx

Hartley R, Schaﬀalitzky F (2003) Powerfactorization: 3dreconstruction with missing or uncertain data. In:Australia-Japan advanced workshop on computer vision,vol 74, pp 76–85He J, Balzano L, Lui J (2011) Online robust subspace trackingfrom partial information. arXiv preprint arXiv:11093827Horn BK (1990) Height and gradient from shading. Interna-tional journal of computer vision 5(1):37–75Jain P, Netrapalli P, Sanghavi S (2012) Low-rank matrix com-pletion using alternating minimization. arXiv preprintarXiv:12120467Ke Q, Kanade T (2005) Robust l1 norm factorization in thepresence of outliers and missing data by alternative con-vex programming. In: CVPR, vol 1, pp 739–746Keshavan R, Montanari A, Oh S (2009) Low-rank matrixcompletion with noisy observations: a quantitative com-parison. In: Communication, Control, and Computing, pp1216–1222Kir`aly F, Tomioka R (2012) A combinatorial algebraic ap-proach for the identiﬁability of low-rank matrix com-pletion. In: Langford J, Pineau J (eds) Proceedings ofthe 29th International Conference on Machine Learning(ICML-12), Omnipress, New York, NY, USA, ICML ’12,pp 967–974Koren Y (2009) The Bellkor solution to the Netﬂix grandprize. Netﬂix prize documentationKoren Y, Bell R, Volinsky C (2009) Matrix factorization tech-niques for recommender systems. Computer 42(8):30–37Lee KC, Ho J, Kriegman DJ (2005) Acquiring linear sub-spaces for face recognition under variable lighting. Pat-tern Analysis and Machine Intelligence, IEEE Transac-tions on 27(5):684–698Li X (2013) Compressed sensing and matrix completion withconstant proportion of corruptions. Constructive Approx-imation 37(1):73–99Liu Y, Sun D, Toh K (2009) An implementable proximalpoint algorithmic framework for nuclear norm minimiza-tion. Mathematical Programming 133(1-2):1–38Negahban S, Wainwright MJ (2012) Restricted strong con-vexity and weighted matrix completion: Optimal boundswith noise. The Journal of Machine Learning Research13:1665–1697Nehab D, Rusinkiewicz S, Davis J, Ramamoorthi R (2005)Eﬃciently combining positions and normals for precise3d geometry. In: ACM Transactions on Graphics (TOG),ACM, vol 24, pp 536–543Oh S, Montanari A, Karbasi A (2010) Sensor network local-ization from local connectivity: Performance analysis forthe mds-map algorithm. In: Information Theory Work-shop (ITW), 2010 IEEE, IEEE, pp 1–5Okatani T, Deguchi K (2007) On the wiberg algorithm formatrix factorization in the presence of missing compo-nents. IJCV 72(3):329–337 Overton ML (n.d.) NLCG: Nonlinear conjugate gradient.

Paladini M, Bue AD, Stosic M, Dodig M, Xavier J, Agapito L(2009) Factorization for non-rigid and articulated struc-ture using metric projections. CVPR pp 2898–2905Rabaud V (n.d.) Vincent’s Structure from Motion Toolbox. http://vision.ucsd.edu/~vrabaud/toolbox/

Recht B (2009) A simpler approach to matrix completion.arXiv preprint arXiv:09100651Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-freeapproach to parallelizing stochastic gradient descent. In:Advances in Neural Information Processing Systems 24,pp 693–701Rudin W (1987) Real Analysis, vol 84. McGraw-Hill, NewYorkShi X, Yu PS (2011) Limitations of matrix completion viatrace norm minimization. ACM SIGKDD ExplorationsNewsletter 12(2):16–20Srebro N, Jaakkola T (2003) Weighted low-rank approxima-tions. In: Proceedings of the 20th International Confer-ence on Machine Learning (ICML-2003), vol 20, p 720Sturm P, Triggs B (1996) A factorization based algorithmfor multi-image projective structure and motion. In: 4thEuropean Conference on Computer Vision, Springer, pp709–720Toh K, Yun S (2010) An accelerated proximal gradient al-gorithm for nuclear norm regularized linear least squaresproblems. Paciﬁc J Optim 6:615–640Tomasi C, Kanade T (1992) Shape and motion from imagestreams under orthography: a factorization method. IJCV9(2):137–154Wang YX, Xu H (2012) Stability of matrix factorization forcollaborative ﬁltering. In: Langford J, Pineau J (eds) Pro-ceedings of the 29th International Conference on MachineLearning (ICML-12), Omnipress, New York, NY, USA,ICML ’12, pp 417–424Wen Z, Yin W (2013) A feasible method for optimizationwith orthogonality constraints. Mathematical Program-ming pp 1–38Wu B, Ding C, Sun D, Toh KC (2011a) On the Moreau-Yosida regularization of the vector k-norm related func-tions. PreprintWu L, Ganesh A, Shi B, Matsushita Y, Wang Y, Ma Y(2011b) Robust photometric stereo via low-rank matrixcompletion and recovery. In: ACCV, pp 703–717Xiong L, Chen X, Schneider J (2011) Direct robust ma-trix factorizatoin for anomaly detection. In: Data Mining(ICDM), 2011 IEEE 11th International Conference on,IEEE, pp 844–853Yang J, Sun D, Toh KC (2012) A proximal point algorithmfor log-determinant optimization with group lasso regu-larization. To appear in SIAM J OptimizationYu HF, Hsieh CJ, Si S, Dhillon I (2012) Scalable coordi-nate descent approaches to parallel matrix factorizationfor recommender systems. In: Data Mining (ICDM), 2012IEEE 12th International Conference on, IEEE, pp 765–774Zhou Z, Li X, Wright J, Cand`es E, Ma Y (2010) Stableprincipal component pursuit. In: Information Theory Pro-ceedings (ISIT), 2010 IEEE International Symposium on,IEEE, pp 1518–1522ractical Matrix Completion and corruption recovery using PARSuMi 27(a)

The 64 original face images (b)

Input images with missing data (in green) (c)

The 64 recovered rank-3 face images (d)

Sparse corruptions detected