A Fast and Adaptive SVD-free Algorithm for General Weighted Low-rank Recovery
AA Fast and Adaptive SVD-free Algorithm for General WeightedLow-rank Recovery
Aritra Dutta *† Jingwei Liang *‡ Xin Li § Abstract.
This paper is devoted to proposing a general weighted low-rank recovery model and designs a fast SVD-freecomputational scheme to solve it. First, our generic weighted low-rank recovery model unifies several existing approachesin the literature. Moreover, our model readily extends to the non-convex setting. Algorithm-wise, most first-order prox-imal algorithms in the literature for low-rank recoveries require computing singular value decomposition (SVD). AsSVD does not scale properly with the dimension of the matrices, these algorithms becomes slower when the problemsize becomes larger. By incorporating the variational formulation of the nuclear norm into the sub-problem of proximalgradient descent, we avoid to compute SVD which results in significant speed-up. Moreover, our algorithm preservesthe rank identification property of nuclear norm [33] which further allows us to design a rank continuation scheme thatasymptotically achieves the minimal iteration complexity. Numerical experiments on both toy example and real-worldproblems including structure from motion (SfM) and photometric stereo, background estimation and matrix completion,demonstrate the superiority of our proposed algorithm.
Key words.
Low-rank recovery, Weighted low-rank, Nuclear norm, Singular value decomposition, proximal gradientdescent, alternating minimization, rank identification/continuation
AMS subject classifications.
Low-rank matrix recovery is an important problem to study as it covers many interesting problems arisingfrom diverse fields including machine learning, data science, signal/image processing, and computer vision, toname a few. The goal of low-rank recovery is to recover or approximate the targeted matrix ˚ X ∈ R m × n whoserank is much smaller than its dimension. For example, matrix completion [54, 13], structure from motion [44],video segmentation [64, 28, 61], image processing and signal retrieval [26, 59] exploit the inherent low-rankstructure of the data.For many problems of interests, instead of accessing the data ˚ X directly, often we can only observe itthrough some agent ( e.g. a linear operator) Ψ . A general observation model takes the following form F = Ψ ( ˚ X ) + ε , (1)where Ψ : R m × n → R d × (cid:96) is the (observation) operator which is assumed to be bounded linear. For example, inthe compressed sensing scenario, Ψ returns a linear measurement of ˚ X which is a d -dimensional vector [26];for matrix completion Ψ is a binary mask [13]. In the above model, variable ε ∈ R d × (cid:96) denotes additive noise( e.g. white Gaussian) and F ∈ R d × (cid:96) is the obtained noise contaminated observation.Over the years, numerous low-rank promoting recovery models are proposed in the literature, for example[39, 63, 59, 62, 61, 15], to mention a few. When the rank of ˚ X is available, one can consider the following rank * Equal contributions. † Division of Computer, Electrical and Mathematical Sciences & Engineering, King Abdullah University of Science and Technol-ogy, Saudi Arabia (E-mail: [email protected]). ‡ School of Mathematical Sciences, Queen Mary University of London, UK (E-mail: [email protected]). § Department of Mathematics, University of Central Florida, USA (E-mail: [email protected]). a r X i v : . [ m a t h . O C ] J a n onstrained weighted least squaremin X ∈ R m × n (cid:107) (cid:0) Ψ ( X ) − F (cid:1) (cid:12) W (cid:107) such that rank ( X ) ≤ r , (2)where r = rank ( ˚ X ) is the rank of ˚ X , W ∈ R d × l is a non-negative weight matrix, and (cid:12) is the Hadamard product.The motivation of considering a weight W is such that (2) can handle more general noise model ε , rather thanmere Gaussian noise [16, 22, 23]. A clear limitation of (2) is that, for many problems it is in general impossibleto know rank ( ˚ X ) a priori . As a result, instead of using rank as constraint, one can penalize it to the objectivewhich results in rank regularized recovery modelmin X ∈ R m × n (cid:107) (cid:0) Ψ ( X ) − F (cid:1) (cid:12) W (cid:107) + τ rank ( X ) , (3)where τ > r , one needs to choose τ prop-erly. Moreover, due to rank function, (2) and (3) are non-convex, imposing challenges to both theoreticalanalysis and algorithmic design.In literature, a popular approach to avoid non-convexity is to replace the rank function with its convexsurrogate—the nuclear norm (a.k.a. trace norm) [35, 12]. Correspondingly, we obtain the following nuclearnorm constrained form of (2)min X ∈ R m × n (cid:107) (cid:0) Ψ ( X ) − F (cid:1) (cid:12) W (cid:107) such that (cid:107) X (cid:107) ∗ ≤ c , (4)where c is a predefined constant, e.g. c = (cid:107) ˚ X (cid:107) ∗ if possible. Consequently, for (3), we arrive at the following unconstrained nuclear norm regularized recovery modelmin X ∈ R m × n (cid:107) (cid:0) Ψ ( X ) − F (cid:1) (cid:12) W (cid:107) + τ (cid:107) X (cid:107) ∗ . (5)Note that, besides the weighted (cid:96) loss, one can also consider the general loss function f ( X , F , W ) which givesthe following recovery model (similar to [30, 51])min X ∈ R m × n f ( X , F , W ) + τ (cid:107) X (cid:107) ∗ . (6)For the rest of the paper, we mainly focus on model (5) and only present a short discussion of (6) in Section 2.3. Our recovery model (5) is connected with several established work in the literature, and moreover covers someas special cases. Therefore in what follows, we present a short overview of literature study.
Trace LASSO
When the entries of W are all 1’s, problem (5) becomes the Trace LASSO considered in [25], i.e. nuclear norm regularized least square. For this case, ε corresponds to additive white Gaussian noise. When Ψ ∈ R d × m , F ∈ R d × n and W = ∈ R d × n , (5) becomesmin X ∈ R m × n (cid:107) Ψ X − F (cid:107) + τ (cid:107) X (cid:107) ∗ , (7)which is studied in [47]. Examples of (7) include multivariate linear regression, multi-class classification andmulti-task learning [47]. As proposed in [29, 30], one can also consider a general loss function f ( X , F ) (as aspecial case of (6)) to solve min X ∈ R m × n f ( X , F ) + τ (cid:107) X (cid:107) ∗ , (8)For different choice of f ( X , F ) in (8), one can recover affine-rank minimization [55, 39], regularized semi-definite linear least squares [51], etc. Weighted low-rank recovery
For the case W is a general non-negative weight, there is also a train of worksin the literature. For most of them, Ψ is an identity operator. We start with rank constraint casemin X ∈ R m × n (cid:107) ( F − X ) (cid:12) W (cid:107) such that rank ( X ) ≤ r , (9)2hich is well-studied in the literature under different settings [38, 52, 53, 42, 56]. In [41], instead of consider-ing a generic weight W , the authors proposed a general matrix induced weighted normmin X ∈ R m × n (cid:107) F − X (cid:107) Q such that rank ( X ) ≤ r , (10)where Q ∈ R mn × mn is symmetric positive definite and (cid:107) F − X (cid:107) Q def = vec ( F − X ) (cid:62) Q vec ( F − X ) with vec ( · ) being an operator which maps the entries of R m × n to vectors in R mn × by stacking the columns. Werefer to [16, 41, 22, 23, 48, 20] and the references therein for more discussions. If we lift the constraint to theobjective function as for (3), we get the problem below:min X ∈ R m × n (cid:107) ( X − F ) (cid:12) W (cid:107) + τ rank ( X ) , (11)which is studied in [21, 16]. The latest addition to this class of problems is the weighted singular value thresh-olding studied in [17] which takes the form:min X ∈ R m × n (cid:107) ( X − F ) W (cid:107) + τ (cid:107) X (cid:107) ∗ , (12)where W ∈ R n × n is a weight matrix. Let U Σ V (cid:62) be a SVD of W with Σ = diag ( σ σ · · · σ n ) . Applying theunitary invariance of the norms (and by the change of variable X → XU ), problem (12) becomesmin X ∈ R m × n (cid:107) ( X − FU ) Σ (cid:107) + τ (cid:107) X (cid:107) ∗ , (13)which moreover can be equivalently written asmin X ∈ R m × n (cid:107) ( X − FU ) (cid:12) W Σ (cid:107) + τ (cid:107) X (cid:107) ∗ , (14)where W Σ = ( σ ; σ ; · · · ; σ n ) ∈ R m × n and ∈ R m × is the vector of all 1’s. In Table 1 below, we summarizethe above formulations studied in the literature to highlight the difference and connections between them. Name Formulation Reference
SVD/PCA min X :rank ( X ) ≤ r (cid:107) X − F (cid:107) [24, 31]SVT min X τ (cid:107) X (cid:107) ∗ + (cid:107) X − F (cid:107) [12]Weighted low-rank (WLR) min X :rank ( X ) ≤ r (cid:107) ( X − F ) (cid:12) W (cid:107) [53, 52, 38]General WLR (GWLR) min X :rank ( X ) ≤ r (cid:107) X − F (cid:107) Q [41, 43]Weighted SVD min X rank ( X ) + (cid:107) ( X − F ) W (cid:107) [21, 16]Weighted SVT (WSVT) min X τ (cid:107) X (cid:107) ∗ + (cid:107) ( X − F ) W (cid:107) [21, 16]Nuclear norm regularized GWLR min X τ (cid:107) X (cid:107) ∗ + (cid:107) ( Ψ X − F ) (cid:12) W (cid:107) This workNuclear norm constrained min X : (cid:107) X (cid:107) ∗ ≤ t / f ( X ) [29, 39, 13]Trace norm minimization min X f ( X ) + τ (cid:107) X (cid:107) ∗ [30, 11, 60, 62] Table 1: SVT, weighted low-rank approximation and their variants.
In this paper, we propose a general model (5) for low-rank recovery. Based on the variational formulation ofnuclear norm, we propose an efficient algorithm which avoids computing SVD. More precisely, our contribu-tions include the following aspects.
A generic low-rank recovery model
We propose general low-rank recovery models which covers severalexisting works as special cases. We provide a detailed comparison of our problem with the existing ones, bothanalytically and empirically. We believe problem (5) with our dedicated structure dependent analysis shouldbe studied as a standalone problem to close the existing knowledge gap.3 n efficient SVD-free algorithm
In the literature, numerous numerical schemes can be applied to solve (5),since it is the sum of a smooth function and a non-smooth one. However, most of these algorithms requirecomputing SVD, which does not scale properly with the dimension of the problem [17, 30, 47, 44]. Toefficiently solve (5), we propose an SVD-free method (see Algorithm 1). By combing proximal gradientdescent [36] and the variational characteristic of nuclear norm, we design a “proximal gradient & alternatingminimization method” which we coin as ProGrAMMe. Our algorithm can also be applied to solve the generalmodel (6) if the loss function f ( X , F , W ) is smoothly differentiable with gradient being Lipschitz continuous.Moreover, our algorithm can be easily extended to the non-convex loss function case. A rank continuation strategy
Based on the result of [33], we show that the sequence generated by Algorithm1 can find the rank of the minimizer (to which the generated sequence converges) in finite number of iterations ,which we call rank identification property. In turn, we design a rank continuation technique which leads to Al-gorithm 2. Compare to Algorithm 1, rank continuation is less sensitive to initial parameter, and asymptoticallyachieves the minimal per iteration complexity.
Numerical comparisons
We evaluate our algorithms against 15 state-of-the-art weighted and unweighted low-rank approximation methods on various tasks, including structure from motion (SfM) and photometric stereo,background estimation from fully and partially observed data, and matrix completion. In these problems,different weights are used as deem fit—from binary weights to random large weights. We observed in allthe tasks our weighted low-rank algorithm performs either better or is as good as the other algorithms. Thisindicates that our algorithm is robust and scalable to both binary and general weights on a diverse set of tasks.
Throughout the paper, R m is a finite dimensional Euclidean space equipped with scalar product (cid:104)· , ·(cid:105) andinduced norm (cid:107)·(cid:107) . We abuse the notation (cid:107)·(cid:107) for the Frobenius norm when · is a matrix. Id m denotes theidentity operator on R m . Let S ⊂ R n be a non-empty close compact set, then ri ( S ) denotes its relative interior,and par ( S ) is the subspace which is parallel to span ( S ) . The sub-differential of a proper closed convex function g : R m → R ∪ { + ∞ } is a set-valued mapping defined by ∂ g : R m ⇒ R m , x (cid:55)→ (cid:8) v ∈ R m | g ( x (cid:48) ) ≥ g ( x ) + (cid:104) v , x (cid:48) − x (cid:105) , ∀ x (cid:48) ∈ R m (cid:9) . Definition 1.1.
The proximal mapping (or proximal operator) of a proper closed convex function g : R n → R ∪ { + ∞ } is defined as: let γ > γ g ( y ) = arg min x (cid:8) γ g ( x ) + (cid:107) x − y (cid:107) (cid:9) . (15)For nuclear norm, its proximal mapping is singular value thresholding (SVT) [12], which is the lifting ofvector soft-shrinkage thresholding to matrix [7]. Lemma 1.1 (Variational formulation of nuclear norm [49, 50]).
Let X ∈ R m × n and U ∈ R m × r , V ∈ R r × n with r ≥ rank ( X ) . We can write (cid:107) X (cid:107) ∗ = min X = UVU ∈ R m × r , V ∈ R r × n (cid:107) U (cid:107)(cid:107) V (cid:107) = min X = UVU ∈ R m × r , V ∈ R r × n ( (cid:107) U (cid:107) + (cid:107) V (cid:107) ) . Paper organization
The rest of the paper is organized as following. In Section 2, we describe our proposedalgorithm ProGrAMMe (Algorithm 1) and prove its global convergence property. In Section 3, we first provethe rank identification property of Algorithm 1 and then propose a rank continuation strategy in Algorithm 2.Extensive numerical experiments are provided in Section 4, followed by the conclusion of this paper.
Problem (5) is the composition of a smooth term and a non-smooth term. In the literature, numerical meth-ods for such a structured problem are well studied, such as proximal gradient descent [36] (a.k.a. Forward–Backward splitting) and its various variants including the celebrated FISTA [6, 14]. Indeed, our problem (5)4an be handled by proximal gradient descent. However, such a method requires repeated SVD computation,which can significantly slow down its performance in many practical scenarios where the data size is large.Therefore, in this section, by combining proximal gradient descent and the variational formulation of nuclearnorm (c.f., Lemma 1.1), we propose a SVD-free method for solving (5).
In this part, we provide a detailed derivation of our algorithm, which is a combination of proximal gradientdescent, alternating minimization, and inertial acceleration. For convenience, denote f ( X ) = (cid:107) ( Ψ ( X ) − F ) (cid:12) W (cid:107) and g ( X ) = (cid:107) X (cid:107) ∗ . Step 1 - Inertial proximal gradient descent
The first step to derive our algorithm is applying an inertialproximal gradient descent [36, 33] to solve problem (5). Since Ψ is a bounded linear mapping, we have thefollowing simple lemma. Lemma 2.1.
Let (cid:101) W = W (cid:12) W . The loss f ( X ) is smoothly differentiable with its gradient given by ∇ f ( X ) = ∇Ψ ( X ) (cid:0) ( Ψ ( X ) − F ) (cid:12) (cid:101) W (cid:1) , which is L-Lipschitz continuous with L = (cid:107) ∇Ψ ( X ) (cid:107) max i , j (cid:101) W i , j . Below we provide two examples of ∇Ψ ( X ) :• In compressed sensing scenario, Ψ ∈ R d × mn is a linear measurement matrix, Ψ ( X ) = Ψ vec ( X ) and (cid:107) ∇Ψ ( X ) (cid:107) = (cid:107) Ψ (cid:107) . • For matrix completion problem, Ψ ∈ R m × n is a binary mask, and Ψ ( X ) = Ψ (cid:12) X and (cid:107) ∇Ψ ( X ) (cid:107) = . In the literature, a routine approach to solve (5) is inertial proximal gradient descent. Let X ∈ R m × n be anarbitrary starting point, we consider the following iteration Y k = X k + a k ( X k − X k − ) , Z k = Y k − γ ∇Ψ ( Y k ) (cid:0) ( Ψ ( Y k ) − F ) (cid:12) (cid:101) W (cid:1) , X k + = arg min X ∈ R m × n τ (cid:107) X (cid:107) ∗ + γ (cid:107) X − Z k (cid:107) , (16)where a k ∈ [ , ] is the inertial parameter, γ ∈ ] , / L [ is the step-size and the last line of (16) is nothing butSVT of Z k . Iteration (16) is a special case of the general inertial scheme proposed in [33], and we refer to [33]and the references therein for more discussion on inertial schemes. Step 2 - Alternating minimization
As computing the proximal mapping of nuclear norm requires SVD, thegoal of second step is to avoid SVD in solving SVT of Z k by incorporating the variational formulation ofnuclear norm. To this end, the subproblem of (16) readsmin X ∈ R m × n τ (cid:107) X (cid:107) ∗ + γ (cid:107) X − Z k (cid:107) . (17)By plugging in the variational formulation of nuclear norm, we arrive at the following constrained minimiza-tion problem: let r > SVT τγ ( Z k ) min X ∈ R m × n , U ∈ R m × r , V ∈ R r × n (cid:107) X − Z k (cid:107) + τγ (cid:0) (cid:107) U (cid:107) + (cid:107) V (cid:107) (cid:1) such that X = UV . (18)Instead of considering the augmented Lagrangian multiplier of the constraint [11], we directly substitute theconstraint X = UV in the objective, which leads tomin U ∈ R m × r , V ∈ R r × n (cid:107) UV − Z k (cid:107) + τγ (cid:0) (cid:107) U (cid:107) + (cid:107) V (cid:107) (cid:1) , (19)which is a smooth, bi-convex optimization problem in each component U and V .5ifferent from (17), problem (19) does not admits closed form solution. However, when either U or V is fixed, the problem becomes a simple least square. Hence we can solve (19) via a simple alternatingminimization, namely a two block Gauss-Seidel iteration [3]: given U ∈ R m × r , V ∈ R r × n U i + = Z k V (cid:62) i (cid:0) V i V (cid:62) i + τγ Id r (cid:1) − , V i + = (cid:0) U (cid:62) i + U i + + τγ Id r (cid:1) − U (cid:62) i + Z k , (20)where Id r denotes the identity operator on R r . Substituting (20) into (16) as an inner loop, we obtain thefollowing iterative scheme: let I ∈ N + Z k = Y k − γ ∇Ψ ( Y k ) (cid:0) ( Ψ ( Y k ) − F ) (cid:12) (cid:101) W (cid:1) , Initialize U , V . For i = , . . . , I − (cid:36) U i + = Z k V (cid:62) i (cid:0) V i V (cid:62) i + τγ Id r (cid:1) − , V i + = (cid:0) U (cid:62) i + U i + + τγ Id r (cid:1) − U (cid:62) i + Z k , X k + = U I V I . (21)Assembling the above steps, we obtain our proposed algorithm, proximal gradient & alternating minimizationmethod, which we call ProGrAMMe and is summarized below in Algorithm 1. Algorithm 1:
A Proximal Gradient & Alternating Minimization Method(ProGrAMMe) Compute (cid:101) W = W (cid:12) W , L and et γ ∈ ] , / L [ ; Choose r > I ∈ N + ; while not convergent do Y k = X k + a k ( X k − X k − ) , // inertial step // Z k = Y k − γ ∇Ψ ( Y k ) (cid:0) ( Ψ ( Y k ) − F ) (cid:12) (cid:101) W (cid:1) , // gradient descent // Initialize U ∈ R m × r , V ∈ R r × n , for i = , ..., I − do // inner loop // U i + = Z k V (cid:62) i ( V i V (cid:62) i + τγ Id r ) − , V i + = ( U (cid:62) i + U i + + τγ Id r ) − U (cid:62) i + Z k , end for X k + = U I V I . end while return X k + Remark 2.2. • One highlight of our algorithm is that, via variational formulation of nuclear norm, we relaxed the convexsubproblem (17) to a non-convex (19) problem.• Every step of Algorithm 1 requires initializing U , V for the inner step, and the simplest way is usingthe U I , V I from the last step.• For the controlling parameter r , theoretically it does not make any difference as long as it is larger thanthe rank of the solution of (5). However, practically it is crucial to the performance of Algorithm 1.Detailed discussion is provided in Section 3. Remark 2.3.
In the literature, several SVD-free approaches were proposed. For example, variational formula-tion was also considered in [11] and the resulted problem was solved by method of Lagrange multiplier whilewe directly plug the constraint into the objective. In [62], a dual characterization of the nuclear norm was usedand a SVD-free gradient descent was designed. The benefits of our approach, as we shall see later, are simpleconvergence analysis (see Secsion 2.2) and extensions to more general settings (see Section 2.3).6 emark 2.4 (Per iteration complexity).
Comparing Algorithm 1 to proximal gradient descent (16), the onlydifference is
Line 5-9 . For proximal gradient descent, since SVD is needed, the iteration complexity eachstep is O ( mn min { m , n } ) . For Algorithm 1, suppose I =
1, the complexity of
Line 7-8 is O (( m + n + r ) r ) .It can be concluded that, the smaller the value of r (still larger than the rank of the solutions), the lower theper iteration complexity of Algorithm 1. As a result, the choice of r is crucial to the practical performance ofAlgorithm 1. Therefore, in Section 3, a detailed discussion is provided on how to choose r . Relation with existing work
Our algorithm is closely related with proximal splitting method and its variants,as our first step to derive Algorithm 1 is the inertial proximal gradient descent (for example, proximal gradientdescent [36] and its accelerated versions including FISTA [6, 14], as in [51, 55, 30] where FISTA was adoptedto solving low-rank recovery problem).Our model (5) and Algorithm 1 share similarities with those of [10, 15, 11], but there are some fundamentaldifferences. First of all, all these works consider only the case Ψ = Id, i.e. Ψ is an identity mappingmin X (cid:107) ( X − F ) (cid:12) W (cid:107) + τ (cid:107) X (cid:107) ∗ . (22)In [10, 11], the authors consider directly applying matrix factorization to (22) which results in: let r > U ∈ R m × r , V ∈ R r × n (cid:107) ( UV − F ) (cid:12) W (cid:107) + τ (cid:0) (cid:107) U (cid:107) + (cid:107) V (cid:107) (cid:1) , (23)which is a smooth and bi-convex optimization problem. Our approach, on the other hand, only considerapplying matrix factorization for the subproblem of proximal gradient descent (16).It is worth noting that (23) shares the same continuous property as (19), hence can by handled by alternatingminimization algorithm. Both (23) and (19) are also special cases of the following non-convex problemmin U , V f ( U , V ) + τ g ( U ) + τ g ( V ) (24)where f ( U , V ) is differentiable with Lipschitz continuous gradient, τ , τ > g ( · ) , g ( · ) are (non-smooth) regularization terms for U , V , respectively. Problem (24) was well studied in[2, 8]; for instance, the following algorithm was proposed in [2]: U k + = arg min U ∈ R m × r (cid:8) f ( U , V k ) + τ g ( U ) + α (cid:107) U − U k (cid:107) (cid:9) , V k + = arg min V ∈ R r × n (cid:8) f ( U k + , V ) + τ g ( V ) + γ (cid:107) V − V k (cid:107) (cid:9) , where α , γ > U k + = arg min (cid:110) (cid:107) ( UV k − F ) (cid:12) W (cid:107) + τ (cid:107) U (cid:107) + α (cid:107) U − U k (cid:107) (cid:111) , V k + = arg min (cid:110) (cid:107) ( U k + V − F ) (cid:12) W (cid:107) + τ (cid:107) V (cid:107) + γ (cid:107) V − V k (cid:107) (cid:111) . It can be observed that, though sharing similarities, our proposed algorithm is different from the above schemes.Same for the algorithm proposed in [8].
In this part, we provide global convergence analysis of Algorithm 1. The key of our proof is rewriting Algo-rithm 1 as an inexact version of inertial proximal gradient descent (16) whose convergence property is wellestablished in the literature. Such an equivalence is obtained based on the result below from [11, Theorem 1].
Lemma 2.5 ([11, Theorem 1]).
Let (cid:98)
X be the unique minimizer of (17) with rank ˆ r = rank ( (cid:98) X ) , and ( (cid:98) U , (cid:98) V ) asolution of (19) with r ≥ ˆ r. There holds (cid:98) X = (cid:98) U (cid:98) V .
The above lemma implies that, although we relaxed the strongly convex problem (17) to a non-convex one(19), we can recover the unique minimizer of (17) via solving (19). In turn, we can cast Algorithm 1 back tothe proximal gradient descent (16), possibly with approximation errors due to finite step inner loop, and thenprove its convergence. To this end, we first propose the following inexact characterization of Algorithm 1.7 roposition 2.6 (Inexact inertial proximal gradient descent).
For Algorithm 1, let I ∈ N + . Then Algorithm1 is equivalent o the following inexact inertial proximal gradient descentY k = X k + a k ( X k − X k − ) , Z k = Y k − γ ∇Ψ (cid:0) ( Ψ ( Y k ) − F ) (cid:12) (cid:101) W (cid:1) , X k + = prox τγ g ( Z k + e k ) (25) where e k accounts for truncation error for finite-valued I, and has the forme k = P I ( S I + s k ) Q TI − Z k , with P I S I Q TI being the SVD of X k + and s k a diagonal matrix with all its diagonal elements s k ( j , j ) , i = , ..., min { m , n } s k ( j , j ) = (cid:40) τγ : S ( j , j ) ≥ τγ , [ , τγ ] : S ( j , j ) < τγ . (26) Proof.
Let the SVD Z k be Z k = PSQ T . Suppose the inner loop is ran for infinite steps of Algorithm 1, i.e. (19)is solved exactly, then we have X (cid:48) k + = U + ∞ V + ∞ = prox τγ g ( Z k ) = P T τγ ( S ) Q T , where T τγ ( S ) is soft-thresholding operator [7]. Now we truncate the inner loop to I finite steps, that is X k + = U I V I = P I S I Q TI where P I S I Q TI is the SVD of X k + and I is the number of iteration for inner loop. Apparently, we have I → + ∞ : P I → P , S I → T τγ ( S ) and Q I → Q . Let e k be such that Z k + e k = P I (cid:0) S I + s k (cid:1) Q TI where s k is a diagonal matrix in (26). As a result, we get e k = P I ( S I + s k ) Q TI − Z k and conclude the proof.The inexact formulation of Algorithm 1 allows us to prove its convergence in a rather simple fashion, as(25) is nothing but a special case of the inertial proximal gradient descent discussed in [33]. As a consequence,we have the following result regarding the global convergence of Algorithm 1. Proposition 2.7 (Convergence of Algorithm 1).
For Algorithm 1, let the inertial sequences { a k } k ∈ N be suchthat lim sup k a k < ∑ k ∈ N a k (cid:107) X k − X k − (cid:107) < + ∞ . If, moreover, the error e k is such that ∑ k ∈ N k (cid:107) e k (cid:107) < + ∞ . Then, there exists X (cid:63) ∈ arg min ( f + τ g ) to which the sequence { X k } k ∈ N generated by Algorithm 1 converges. The result is simply implied from [33, Theorem 3], hence we omit the proof here and refer to [33] fordetailed discussions.
Remark 2.8.
The condition on error e k implies that the inner problem needs to be solved with an increasingaccuracy, i.e. an increasing number of inner iterations. However, in practice, fixed choice of I works quitewell; see our numerical examples later. The summability of a k (cid:107) X k − X k − (cid:107) can be guaranteed by certainchoices of a k ; See [33, Theorem 4]. One can also use an online approach to determine a k such that thesummability condition holds. For instance, let a ∈ [ , ] and c > , δ >
0, then a k can be chosen as a k = min { a , ck + δ (cid:107) X k − X k − (cid:107) } . 8 emark 2.9 (FISTA-like inertial parameter). If the error term e k can be carefully taken care of, such asgradually increase the value of I , then according to [4] FISTA rule for updating a k can be applied, e.g. a k = k − k + d for d >
2. One can then use the lazy-start strategy of [34] for further speed-up.
Remark 2.10.
In [33], the authors also studied the local linear convergence of proximal gradient descent typemethods, under the notion of “partial smoothness” (Definition 3.1). However, due to the existence of the errorterm e k , the local linear convergence of (25) is more complicated than that of [33] where the analysis wasobtained only for the exact case. Therefore, for this aspect, we forgo the theoretical analysis and only providenumerical discussions in Section 3. Throughout this paper, our main focus is (5) whose loss function is a simple weighted least square. In thisscope, we discuss several generalization of Algorithm 1, including more general loss function ( e.g. (6)), thenon-convex setting and non-linear Ψ . General loss function
Algorithm 1 is loss function agnostic, as for proximal gradient descent type methods,the condition required for the smooth part is that the function should be smoothly differentiable with gradientbeing Lipschitz continuous. Therefore, we can apply Algorithm 1 to solve the more general model (6), and theonly change we need to make to Algorithm 1 is
Line 4 for which we now have Z k = Y k − γ ∇ f ( Y k , F , W ) , where ∇ denotes the gradient of f ( X , F , W ) with respect to X . The global convergence result stays the samefor the above update. Non-convex loss function
Algorithm 1 does not require convexity for the loss function. In the literature,the convergence properties of proximal gradient descent type methods for non-convex optimization are wellstudied, most of them are obtained under Kurdyka-Łojasiewicz inequality owing to the pioneered work [3].As Algorithm 1 is a special case of inexact proximal gradient descent, it can also be applied to solve problemswhere the loss function f ( X , F , W ) is non-convex. Parameter-wise, there are two main differences between thenon-convex and convex cases:• For step-size γ , different from the convex case whose upper bound is 2 / L , it reduces to 1 / L for thenon-convex case.• The conditions on the error is different. For the non-convex case, the error should be such that a descentproperty of certain stability function (see e.g. [3]) should be maintained. As a result, line search mightbe needed for the number of inner loop iteration. Remark 2.11.
While keeping the loss function as least square, it still can be non-convex because of theoperator Ψ , for which case Ψ is a non-linear smooth mapping instead of being linear. For this case, as long as Ψ is such that the gradient is Lipschitz continuous, global convergence of Algorithm 1 can be guaranteed. As we discussed above, the choice of parameter r is crucial to the performance of Algorithm 1: let r (cid:63) be therank of a solution X (cid:63) of the problem (5), then the closer the value of r to r (cid:63) , the better practical performanceof Algorithm 1, and the best performance can be obtained when r = r (cid:63) . However, in general it is impossibleto know r (cid:63) a priori , and usually an overestimation of r (cid:63) is provided which damps the efficiency of the algo-rithm. In this section, we first discuss the rank identification property of Algorithm 1, and then discuss a rankcontinuation strategy which asymptotically achieves the optimal per iteration complexity. In [33], for proximal gradient descent type methods with non-smooth regularization, it was shown that thesequence generated by these methods has a so-called “ finite activity identification property ”. For nuclear norm,9his means that after a finite number of iterations, there holds rank ( X k ) = rank ( X (cid:63) ) , for all k large enough where X (cid:63) is the solution to which X k converges. In the theorem below, we show that Algorithm 1 also has this finiterank identification property. We continue to use the notations ( f and g ) used in subsection 2.1. Theorem 3.1 (Rank identification).
For Algorithm 1, suppose the conditions of Proposition 2.7 hold, then X k converges to X (cid:63) ∈ Arg min ( f + τ g ) . If, moreover, the following non-degeneracy condition holds − ∇Ψ ( X (cid:63) ) (cid:0) ( Ψ ( X (cid:63) ) − F ) (cid:12) (cid:101) W (cid:1) ∈ τ ri (cid:0) ∂ (cid:107) X (cid:63) (cid:107) ∗ (cid:1) , (27) then there exists a K > such that for all k ≥ K there holds rank ( X k ) = rank ( X (cid:63) ) . To prove the result, we need the help of partly smoothness , which was first introduced in [32]. Let M be a C -smooth embedded submanifold of R n around a point x . To lighten notation, henceforth, we use C -manifold instead of C -smooth embedded submanifold of R n . The natural embedding of a submanifold M into R n permits to define a Riemannian structure on M , and we simply say M is a Riemannian manifold. T M ( x ) denotes the tangent space to M at any point near x in M . Definition 3.1 (Partial smoothness).
Let g be proper closed and convex, g is said to be partly smooth at xrelative to a set M containing x if ∂ g ( x ) (cid:54) = /0. We also define the smoothness, sharpness, and continuity of g at x relative to a set M as follows: Smoothness : M is a C -manifold around x , g restricted to M is C around x ; Sharpness : The tangent space T M ( x ) coincides with T x = par (cid:0) ∂ g ( x ) (cid:1) ⊥ ; Continuity : The set-valued mapping ∂ g is continuous at x relative to M .For nuclear norm, it is partly smooth along the set of fixed-rank matrices [32]. Other examples of partlysmooth functions including (cid:96) -norm for sparsity, (cid:96) , -norm for group sparsity, etc; We refer to [33] and thereferences therein for more examples of partly smooth functions. Proof of Theorem 3.1.
Since f locally is C -smooth around X (cid:63) , the smooth perturbation rule of partly smoothfunctions [32, Corollary 4.7], ensures that f + τ g is partial smoothness at X (cid:63) relative to M X (cid:63) def = { X ∈ R m × n :rank ( X ) = rank ( X (cid:63) ) } .By assumption, the sequence X k created by Algorithm 1 converges to X (cid:63) ∈ arg min ( f + τ g ) . The non-degeneracy condition (27) is equivalent to 0 ∈ ri (cid:0) ∂ (( f + τ g )( X (cid:63) )) (cid:1) . Now (25) is equivalent to Y k − γ ∇ f ( Y k ) − X k + + e k ∈ γτ∂ g ( X k + ) ⇐⇒ (cid:0) Y k − γ ∇ f ( Y k ) (cid:1) − (cid:0) X k + − γ ∇ f ( X k + ) (cid:1) + e k ∈ γ∂ ( f + τ g )( X k + ) . (28)By Baillon-Haddad theorem [5], Id − γ ∇ f is non-expansive, whence we getdist (cid:0) , ∂ ( f + τ g )( X k + ) (cid:1) ≤ (cid:107) ( Id − γ ∇ f )( Y k ) − ( Id − γ ∇ f )( X k + ) (cid:107) + (cid:107) e k (cid:107)≤ (cid:107) Y k − X k + (cid:107) + (cid:107) e k (cid:107)≤ (cid:107) X k − X k + (cid:107) + a k (cid:107) X k − X k (cid:107) + (cid:107) e k (cid:107) . (29)Since X k is convergent and (cid:107) e k (cid:107) →
0, we have that dist (cid:0) , ∂ ( f + τ g )( X k ) (cid:1) →
0. Owing to our assumptions, f + τ g is sub-differentially continuous at every point in its domain, and in particular at X (cid:63) for 0, which inturn entails ( f + τ g )( X k ) → ( f + τ g )( X (cid:63) ) . Altogether, this shows that the conditions of [27, Theorem 5.3] arefulfilled, and the rank identification result follows.To demonstrate the rank identification property of Algorithm 1, the following low-rank recovery problemis considered as an illustration: min X ∈ R × (cid:107) ( Ψ ( X ) − F ) (cid:12) W (cid:107) + τ (cid:107) X (cid:107) ∗ , where we have Ψ ∈ R × and F = Ψ vec ( ˚ X ) + ε ( ˚ X ) = ε being random Gaussian noise. Moreover, we choose τ = (cid:107) ε (cid:107) .The problem is solved with ProGrAMMe with a k ≡ X k for both schemes eventually becomes constant.
100 200 300 400 500 6005101520253035404550
ProGrAMMePGD
Figure 1: Rank identification of Algorithm 1.
As we remarked above that the choice of r in Algorithm 1 is crucial to the practical performance of the method.To overcome the difficulty of a tight estimation of the rank of minimizers. In this section, we introduce a rankcontinuation strategy, see Algorithm 2, which adaptively adjusts the rank of the output and asymptoticallyattains the optimal per iteration complexity. Algorithm 2:
ProGrAMMe with Rank Continuation Compute (cid:101) W = W (cid:12) W , L and et γ ∈ ] , / L [ ; Choose r > I ∈ N + ; while not convergent do Y k = X k + a k ( X k − X k − ) , // inertial step // Z k = Y k − γ ∇Ψ ( Y k ) (cid:0) ( Ψ ( Y k ) − F ) (cid:12) (cid:101) W (cid:1) , // gradient descent // Initialize U ∈ R m × r , V ∈ R r × n , for i = , ..., I − do // inner loop // U i + = Z k V (cid:62) i ( V i V (cid:62) i + τγ Id r ) − , V i + = ( U (cid:62) i + U i + + τγ Id r ) − U (cid:62) i + Z k , end for X k + = U I V I , r = rank ( U I ) . // rank continuation // end while return X k + Remark 3.2. In Line 11 , instead of using rank ( X k + ) to update r , we choose to use rank ( U I ) , since rank ( X k + ) ≤ min { rank ( U I ) , rank ( V I ) } and it is less computational demanding to evaluate the rank of U I than that of X k + . Remark 3.3.
Though the initial value of r is no longer as important as that of Algorithm 1 whose r is fixed, itis still beneficial to have a relatively good estimate of rank ( X (cid:63) ) as it can further reduce the computational cost11f the algorithm. Also, it is not desirable to compute rank ( U I ) every iteration and a practical approach is to doit in a regular interval.To illustrate the performance of rank continuation, we consider a low-rank recovery problem with randomlymissing entries: min X ∈ R × (cid:107) ( Ψ (cid:12) X − F ) (cid:12) W (cid:107) + τ (cid:107) X (cid:107) ∗ , (30)where Ψ ∈ R × is a random binary mask with 50% entries equal to 0 and F = Ψ (cid:12) ˚ X + ε with rank ( ˚ X ) =
10 and ε being random Gaussian noise. For this case we set τ = (cid:107) ε (cid:107) . Comparison to SVD based PGD/FISTA
We first compare the performances of PGD/FISTA, Algorithm 1(ProGrAMMe) and Algorithm 2 (ProGrAMMe-RC), with the following settings:• The step-size γ for all these schemes are chosen as γ = . / L . Note that this choice of γ exceeds theupper bound of step-size of FISTA, however for this example FISTA converges.• For FISTA, we adopt the “lazy start” strategy, e.g. a k = k − k + , proposed in [34] which is faster than thestandard FISTA scheme.• For both Algorithm 1 (ProGrAMMe) and Algorithm 2 (ProGrAMMe-RC), we choose I = a k ≡ r =
500 for ProGrAMMe which is also the initial value of r for Algorithm 2.All schemes are stopped when the relative error (cid:107) X k − X k − (cid:107) reaches 10 − , and we note the the average wallclock time of 10 runs for these schemes as:Schemes PGD FISTA ProGrAMMe ProGrAMMe-RCTime (in seconds) 127 .
77 104 .
70 14 .
68 8 . magenta line in Figure 2 (a) for ProGrAMMe-RC, it has several jumps which is due to the update of r . Remark 3.4.
For FISTA scheme, one can also consider the adaptive restarted scheme of [46], which canfurther improve the performance of FISTA. However, as long as SVD is involved, it cannot compete withProGrAMMe for the considered relative large-scale problem.
Effect of different starting rank
To further understand the advantage of rank continuation over the static one,we conduct a comparison of Algorithm 2 under different initial values for r . Precisely, we considered:• four different values of r = , , , r is updated every 10 steps.We use “nRC” to denote ProGrAMMe without rank continuation and “RC” with rank continuation, and theresult is shown in Figure 2 (b). We observed:• Without rank continuation, the smaller the value of r , the better the performance of ProGrAMMe.• For r = , r = ,
50, rank continuation actually becomes slower than the static scheme, and the extratime is mainly the overhead of computing rank ( U I ) .From the above observations, we conclude:• For problems where a tight estimation of the rank of the solution can be obtained, one can simplyconsider Algorithm 1; 12
20 40 60 80 100 12010 -10 -5 PGDFISTAProGrAMMeProGrAMMe-RC (a) Comparison against PGD/FISTA -10 -5 (b) Different initial values of r Figure 2: Comparison of PGD/FISTA, Algorithm 1 and Algorithm 2. (a) Comparison against PGD/FISTA; (b)Comparison between rank continuation and no continuation. Note that the lines of rank continuation schemeshave several jumps which is due to the update of r .• When the rank of solutions is difficult to estimate, then rank continuation can be applied to achieveacceleration.We leave the comparison of ProGrAMMe with inertial to the next section. To understand the effects of inertial acceleration, validate the strengths and flexibility of our recovery modeland algorithm, in this section we perform numerical experiments on several low-rank recovery problems. Through-out this section, we typically use two different versions of ProGrAMMe—( i ) ProGrAMMe-1 which terminatesthe inner loop in each iteration and ( ii ) ProGrAMMe- ε which terminates the inner loop when the relative er-ror of the inner iterates reach an ε precision or maximum inner iteration is achieved, whichever occurs first.Throughout the section, for ProGrAMMe- ε , we use ε = − and maximum number of inner iteration is setto 20, unless otherwise specified. We continue the matrix completion problem (30) to study the effect of inertial acceleration. Both Algorithm 1and 2 are tested, the setting of the tests are• For both algorithms, we initialize r with value of 500; In terms of step-size, we keep the previous choicewhich is γ = . / L .• In total, 5 different choices of inertial parameter a k are considered a k ≡ , a k ≡ , a k ≡ , a k ≡ and a k = k − k + . The results are shown in Figure 3, whose left figure is the comparison of Algorithm 1 without rank continuation• In general, inertial schemes demonstrate faster performance than the standard scheme. For example, a k ≡ is about 30% faster than a k ≡ a k ≡ , , , the larger the value of a k , the faster the performance of inertial scheme.• Lazy-start FISTA performs no better than a k ≡ .The comparison for rank continuation scheme is provided in Figure 3 (b), where similar observation as abovecan be obtained. 13 -10 -5 (a) No rank continuation -10 -5 (b) Rank continuation Figure 3: Effects of inertia. (a) No rank continuation; (b) Rank continuation.
For these experiments, we generate the low-rank matrix, L , as a product of two independent full-rank matricesof size m × r with r < m such that elements are independent and identically distributed (i.i.d.) and sampled froma normal distribution— N ( , ) . We used two different types of sparse noise—Gaussian noise and arbitrarylarge noise. For each case, we generate the sparse matrix, S , such that for a sparsity level α ∈ ( , ) , the sparsesupport is created randomly.• For random Gaussian noise, we construct the sparse matrix S random whose elements are i.i.d. N ( , ) random variables and form F as: F = L + η S random , where η controls the noise level and we set η = . i , j ( L ) i j .• For large noise, we generate the sparse matrix, S sparse , such that its elements are chosen from the interval [ − , ] and construct F as F = L + S sparse .We fix m = ρ r = rank ( L ) / m , where rank ( L ) varies and set the sparsity level α ∈ ( , ) . For each ( ρ r , α ) pair, we apply RPCA GD [63], NCF [19], and ProGrAMMe to recover a low-rank matrix X . Weconsider RMSE, (cid:107) F − X (cid:107) / √ mn , as performance measure.For each class of noise, we run the experiments for 10 times and plot the average RMSEs for each ( ρ r , α ) .Note that, RPCA GD and NCF use an operator T α [ S ] that does not perform an explicit Euclidean projectiononto the sparse support of S , as the exact projection on S is expensive [19, 63, 18]. Inspired by this, for sparsenoise, we design our weight matrix such that it has large weights for the sparse support of S and 1 otherwise.However, for random noise, we simply use a random weight matrix. From Figure 4 we see that for bothrandom and sparse noise, ProGrAMMe has the least average RMSEs. Moreover, while two PCP algorithmsshow significant differences in their RMSE diagrams, our ProGrAMMe produce almost similar RMSE andobtain lower values compare to PCP algorithms in both types of noises. Effect of the condition number of W on the convergence of ProGrAMMe The problem (22) is tricky,as the condition number, κ W of the weight matrix, W plays an important role in convergence [47, 17]. Weperform a detailed empirical convergence analysis of ProGrAMMe- ε and ProGrAMMe-1 on synthetic data inAppendix A.2 by varying κ W and compared with proximal algorithms. We observe that proximal algorithmsare sensitive to κ W (a higher κ W , translates to slower convergence), but both ProGrAMMe- ε and ProGrAMMe-1 are less sensitive κ W , maintains a stable convergence profile for different κ W , and converge faster thanproximal algorithms in almost all cases. 14 .20.40.60.800.10.20.30.40.50.60.70.80.91 (a) Random noise (b) Sparse large noise Figure 4: RMSE: (cid:107) F − X (cid:107) / √ mn for different methods. Top row is for random noise, bottom row is for sparselarge noise. For both cases, min i , j W i j = , max i , j W i j = . To validate the strengths and flexibility of our proposed algorithms, we use three real world problems—structure from motion (SfM), matrix completion with noise, and background estimation from fully and partiallyobserved datasets. We compared our algorithms against 15 state-of-the-art weighted and unweighted low-rankapproximation algorithms (see Table 3 in Appendix).
Structure from motion and photometric stereo
SfM uses local image features without a prior knowledge oflocations or pose and infers a three dimensional structure or motion. For these experiments, we used three pop-ular datasets : non-rigid occluded motion of a giraffe in the background (for nonrigid SfM), a toy dinosaur (foraffine SfM), and the light directions and surface normal of a static face with a moving light source (for pho-tometric stereo) (See more details in Figure 5). The datasets have 69.35%, 23.08%, and 58.28% observableentries, respectively. Therefore, we use a binary mask as weight W such that W i j = ( i , j ) th position, otherwise, W i j = . With this setup, our formulation works as a matrix completion £ : 28% known £ : 70% known £ : 58% knownTotal: 1000 runs Total: 500 runs Total: 1000 runs rms pixel error r un c oun t rms pixel error r un c oun t damped Newtondamped Newton "altreg"damped Newton "line search"Hybrid 2Hybrid 3alternationShum et al.PowerFactorization rms pixel error r un c oun t Figure 2. Summary of the results. The three columns correspond to three example problems: the recovery of 1) the rigid turntable motionof a toy dinosaur; 2) the non-rigid occluded motion of a giraffe in the background; and 3) the light directions and surface normals of a staticface with a moving light source. The first row shows a frame from the sequence. The second represents the sparsity of the measurementmatrix. The third row shows small sections of the accumulation histograms (curves showing the number of random initial state vectorswhich converged to a solution with a particular rms pixel error or lower) for the best of the algorithms presented in the text. measurement matrix to be factored in each case was filteredto remove any outliers. This is discussed in section 5.The algorithms were first initialized with potential fac-tors given by the output of Jacobs’s algorithm and secondlywith a large number of random starting states (indicated inthe figure). In all three cases the Jacobs estimate itself andthe solutions given by the algorithms using Jacobs as a start-ing point were not as good as the best of the random runs.It is assumed that accurate solutions are sought andpoor solutions are to be rejected, irrespective of time taken.Therefore, timings will not be used as a measure of com-parison. By this criterion, several algorithms could be im-mediately discounted: the schemes proposed by Brandt,Huynh et al. and Aanaes et al., plus the project and mergealgorithm never gave final errors below a modest threshold.The remaining three alternation algorithms (basic alterna-tion, Shum et al. and PowerFactorization) all performed rel-atively evenly. However, on the dinosaur sequence, none ofthe pure alternation schemes achieved a final error as low as the best runs by the Newton-based algorithms. It must benoted that an iteration limit was imposed on the algorithms,and it is possible in some cases that a flatlined alternation al-gorithm could have reached the optimum. However, in ourexperience flatlining at 1000 iterations invariably mean thattens of thousands of iterations are required for convergence.We confirmed that all runs that converged to the same er-ror value gave the same factorization solution (within gaugefreedom). Because the initial states were generated ran-domly, counting the number of runs that ended with thesame error value is a reflection of the size of the basin ofconvergence for that minimum for each algorithm. Thecount for the lowest minimum can be taken as a measureof how well each algorithm performed on each example.When judged on this criterion, Newton-based methods out-perform the other algorithms. The three alternation schemesperformed with mixed success. Damped Newton itself wasthe most successful algorithm in the dinosaur and illumina-tion problems and found the lowest error solution for the £ : 28% known £ : 70% known £ : 58% knownTotal: 1000 runs Total: 500 runs Total: 1000 runs rms pixel error r un c oun t rms pixel error r un c oun t damped Newtondamped Newton "altreg"damped Newton "line search"Hybrid 2Hybrid 3alternationShum et al.PowerFactorization rms pixel error r un c oun t Figure 2. Summary of the results. The three columns correspond to three example problems: the recovery of 1) the rigid turntable motionof a toy dinosaur; 2) the non-rigid occluded motion of a giraffe in the background; and 3) the light directions and surface normals of a staticface with a moving light source. The first row shows a frame from the sequence. The second represents the sparsity of the measurementmatrix. The third row shows small sections of the accumulation histograms (curves showing the number of random initial state vectorswhich converged to a solution with a particular rms pixel error or lower) for the best of the algorithms presented in the text. measurement matrix to be factored in each case was filteredto remove any outliers. This is discussed in section 5.The algorithms were first initialized with potential fac-tors given by the output of Jacobs’s algorithm and secondlywith a large number of random starting states (indicated inthe figure). In all three cases the Jacobs estimate itself andthe solutions given by the algorithms using Jacobs as a start-ing point were not as good as the best of the random runs.It is assumed that accurate solutions are sought andpoor solutions are to be rejected, irrespective of time taken.Therefore, timings will not be used as a measure of com-parison. By this criterion, several algorithms could be im-mediately discounted: the schemes proposed by Brandt,Huynh et al. and Aanaes et al., plus the project and mergealgorithm never gave final errors below a modest threshold.The remaining three alternation algorithms (basic alterna-tion, Shum et al. and PowerFactorization) all performed rel-atively evenly. However, on the dinosaur sequence, none ofthe pure alternation schemes achieved a final error as low as the best runs by the Newton-based algorithms. It must benoted that an iteration limit was imposed on the algorithms,and it is possible in some cases that a flatlined alternation al-gorithm could have reached the optimum. However, in ourexperience flatlining at 1000 iterations invariably mean thattens of thousands of iterations are required for convergence.We confirmed that all runs that converged to the same er-ror value gave the same factorization solution (within gaugefreedom). Because the initial states were generated ran-domly, counting the number of runs that ended with thesame error value is a reflection of the size of the basin ofconvergence for that minimum for each algorithm. Thecount for the lowest minimum can be taken as a measureof how well each algorithm performed on each example.When judged on this criterion, Newton-based methods out-perform the other algorithms. The three alternation schemesperformed with mixed success. Damped Newton itself wasthe most successful algorithm in the dinosaur and illumina-tion problems and found the lowest error solution for the £ : 28% known £ : 70% known £ : 58% knownTotal: 1000 runs Total: 500 runs Total: 1000 runs rms pixel error r un c oun t rms pixel error r un c oun t damped Newtondamped Newton "altreg"damped Newton "line search"Hybrid 2Hybrid 3alternationShum et al.PowerFactorization rms pixel error r un c oun t Figure 2. Summary of the results. The three columns correspond to three example problems: the recovery of 1) the rigid turntable motionof a toy dinosaur; 2) the non-rigid occluded motion of a giraffe in the background; and 3) the light directions and surface normals of a staticface with a moving light source. The first row shows a frame from the sequence. The second represents the sparsity of the measurementmatrix. The third row shows small sections of the accumulation histograms (curves showing the number of random initial state vectorswhich converged to a solution with a particular rms pixel error or lower) for the best of the algorithms presented in the text. measurement matrix to be factored in each case was filteredto remove any outliers. This is discussed in section 5.The algorithms were first initialized with potential fac-tors given by the output of Jacobs’s algorithm and secondlywith a large number of random starting states (indicated inthe figure). In all three cases the Jacobs estimate itself andthe solutions given by the algorithms using Jacobs as a start-ing point were not as good as the best of the random runs.It is assumed that accurate solutions are sought andpoor solutions are to be rejected, irrespective of time taken.Therefore, timings will not be used as a measure of com-parison. By this criterion, several algorithms could be im-mediately discounted: the schemes proposed by Brandt,Huynh et al. and Aanaes et al., plus the project and mergealgorithm never gave final errors below a modest threshold.The remaining three alternation algorithms (basic alterna-tion, Shum et al. and PowerFactorization) all performed rel-atively evenly. However, on the dinosaur sequence, none ofthe pure alternation schemes achieved a final error as low as the best runs by the Newton-based algorithms. It must benoted that an iteration limit was imposed on the algorithms,and it is possible in some cases that a flatlined alternation al-gorithm could have reached the optimum. However, in ourexperience flatlining at 1000 iterations invariably mean thattens of thousands of iterations are required for convergence.We confirmed that all runs that converged to the same er-ror value gave the same factorization solution (within gaugefreedom). Because the initial states were generated ran-domly, counting the number of runs that ended with thesame error value is a reflection of the size of the basin ofconvergence for that minimum for each algorithm. Thecount for the lowest minimum can be taken as a measureof how well each algorithm performed on each example.When judged on this criterion, Newton-based methods out-perform the other algorithms. The three alternation schemesperformed with mixed success. Damped Newton itself wasthe most successful algorithm in the dinosaur and illumina-tion problems and found the lowest error solution for the
Figure 5: Sample frame from the static
Face , Giraffe and toy dinosaur sequences. The data matrices, F are of size 2944 ×
20, 240 × ×
72, respectively and the prior rank of the sequences are 4, 6, and4, respectively.problem. We compared ProGrAMMe-1 and ProGrAMMe- ε with respect to the damped Newton algorithmin [10]. Admittedly, [10] obtains the best factorization pair, X = UV such that it gives minimum loss (withinthe observable entry), (cid:107) ( F − X ) (cid:12) W (cid:107)(cid:107) W (cid:107) for all cases. Additionally, we also calculated the loss outside the observ-able entries, that is, (cid:107) ( F − X ) (cid:12) ( − W ) (cid:107)(cid:107) − W (cid:107) where the performance of ProGrAMMe is better in all cases. See resultsprovided in Table 2 below. Dataset (cid:107) ( X − F ) (cid:12) W (cid:107)(cid:107) W (cid:107) (cid:107) ( X − F ) (cid:12) ( − W ) (cid:107)(cid:107) − W (cid:107) Giraffe ε ) (ProGrAMMe- ε ) [10] 364.2476 [10]4.4081 (ProGrAMMe-1) 244.5437 (ProGrAMMe-1) Toy Dino ε ) (ProGrAMMe- ε ) [10] 318.666 [10]0.023 (ProGrAMMe-1) (ProGrAMMe-1) Face ε ) 0.7338 (ProGrAMMe- ε ) [10] 0.98 [10]Table 2: Comparisons between ProGrAMMe- ε , ProGrAMMe-1, and damped Newton method on structurefrom motion and photometric stereo datasets. Matrix completion with noise on power-grid data
Matrix completion is one of the important special casesof weighted low-rank estimation problems as for this problem, the weights are reduced to { , } . This set ofexperiments are inspired by [15]. The dataset and the codes for BIRSVD are collected from author’s website .In this experiment, the test dataset contains 48 hours of temperature data sampled every 30 minutes over 20European cities. The ( i , j ) th entry of the matrix represents the i th temperature measurement for the j th city. Asa prior information, we use the fact that temperature varies smoothly over time and the data matrix is low-rank.For this data set, we use a rank 2 approximation. We run each algorithm with random initialization for 10times and plot the average results in Figure 6. Each algorithm is run for 50 global iterations or tolerance setto machine precision, whichever attained first. For BIRSVD, ProGrAMMe-1, and BLF (with (cid:96) loss) we use τ = . https://homepage.univie.ac.at/saptarshi.das/index.html .1 0.2 0.3 0.4 0.5 0.6 0.710 -1 (a) Within data -1 (b) Out of data Figure 6: Fidelity within and out of the data for problem of matrix completion with noise on power-grid data.Here Ω denotes the percentage of missing data. We use ProGrAMMe-1 for these experiments.more discussion and detailed comparisons we refer the readers to Section A.3 in Appendix. Background estimation
Background estimation and moving object detection from a video sequence is a clas-sic problem in computer vision and it plays an important role in human activity recognition, tracking, and videoanalysis from surveillance cameras. In the conventional matrix decomposition framework used for backgroundestimation, the video frames are concatenated in a data matrix F , and the background matrix, X , is of low-rank [45], as the background frames are often static or close to static. However, the foreground is usuallysparse. The desired target rank of the background is hard to determine due to some inherent challenges, suchas, changing illumination, occlusion, dynamic foreground, reflection, and other noisy artifacts. Therefore, ro-bust PCA algorithms such as, iEALM [35], APG [61], ReProCS [26] overcome the rank challenge robustly.However, in some cases, the target rank and the sparsity level can be part of the user-defined hyperparame-ters. Therefore, instead one might use a different approach as in [64, 19, 63].For these set of experiments, we use ProGrAMMe- ε and compared with two different types robust PCPformulations . Note that, the PCP formulations have a way to detect the sparse outliers but our formulationdoes not. We overcome this challenge by using large weights. Similar to the heuristic used in the experimentsfor synthetic data, we choose a random subset of entries from the set [ m ] × [ n ] and use a large range of weightat those elements. This is similar to the idea used in [19, 63, 18], as for specific sparsity percentage α , theoperator T α [ S ] performs an approximate projection onto the sparse support of S . We argue that randomlyselecting α % of elements from [ m ] × [ n ] and hitting them by large weights, we obtain the same artifact. Indeedour empirical evidence in Figure 7 justifies that.In our experiments, we use eight different video sequences: (i) the Basic sequence from Stuttgart syntheticdataset [9], (ii) four video sequences from CDNet2014 datasets [57], and (iii) three video sequences from SBIdataset [40, 1]. We extensively use the Stuttgart video sequence as it is equipped with foreground ground truthfor each frame. For iEALM and APG, we set λ = / √ m and use µ = . / (cid:107) F (cid:107) and ρ = .
5, where (cid:107) F (cid:107) is the spectral norm (maximum singular value) of F . For Best pair RPCA, RPCA GD, NCF, GoDec, and ourProGrAMMe- ε , we set r =
2, target sparsity 10% and additionally, for GoDec, we set q =
2. For GRATSA,we set the parameters the same as those mentioned in the authors’ website . The qualitative analysis on thebackground and foreground recovered suggest that our method recovers a visually similar or better qualitybackground and foreground compare to the other methods. Note that, RPCA GD and ReProCS recover a See [59] for an overview. https://sites.google.com/site/hejunzz/grasta mage:100 Image:420
Image:570
BG+FG ProGrAMMe(ours) APGReProCSGRASTA
NCF RPCA GD
GoDec Error(ours) APGsparseReProCSsparseGRASTAsparse
NCFsparse RPCA GDsparse
GoDecsparse
Figure 7: Sample video frames from the Stuttgart
Basic video sequence. ProGrAMMe- ε provides a visuallyhigh quality background.fragmentary foreground with more false positives; moreover, GRASTA, iEALM, and APG cannot remove thestatic foreground object. See Section A.4 in Appendix for more qualitative results (Figure 13) and detailedquantitative results (Figure 14) of MSSIM and PSNR. Background estimation from partially observed/missing data
We randomly select the set of observableentries in the data matrix F and perform our experiments on Stuttgart Basic video. For these set of experi-ments, we use ProGrAMMe-1. As this is a missing data case, for ProGrAMMe-1, we use a binary mask as theweight. Figure 8 shows the qualitative results on different subsampled video. We provide a detailed quantita-tive evaluation of our ProGrAMMe with respect to the ε -proximity metric– d ε ( X , Y ) as in [19] in recoveringthe foreground objects and show the execution time for different missing data cases in Figure 15 in SectionA.5 of Appendix. Image:100
Image:120
Image:420
BG + FG ProGrAMMe(ours) GRASTA
NCF
RPCA GD
Error(ours) GRASTAsparse
RPCA NCFsparse RPCA GDsparse
Best Pair RPCA Best Pair sparse
Figure 8: Sample video frames from the Stuttgart
Basic video sequence for the missing data case. From topto bottom we use Ω = . , . , and 0.6, respectively. In this paper, we proposed a generic weighted low-rank recovery model and designed an SVD-free fast algo-rithm for solving the model. Our model covers several existing low-rank approaches in the literature as specialcases and can easily be extended to the non-convex setting. Our proposed algorithm combines proximal gra-dient descent method and the variational formulation of nuclear norm, which does not require to compute theSVD in each step. This makes the algorithm highly scalable to larger data and enjoys a lower per iterationcomplexity than those who require SVD. Moreover, based on a rank identification property, we designed arank continuation scheme which asymptotically achieves the minimal per iteration complexity. Numerical ex-periments on various problems and settings were performed, from which we observe superior performance ofour proposed algorithm compared to a vast class of weighted and unweighted low-rank algorithms.18 lgorithm Abbr. Appears in Ref.
Inexact Augmented LagrangeMethod of Multipliers iEALM Fig. 14 [35]Proximal Gradient PG Fig. 9, 10 [30, 51]Accelerated Proximal Gradient APG Fig. 7 [30, 51]Accelerated Proximal Gradient-II APG-II Fig. 9, 10 [61]Grassmannian Robust AdaptiveSubspace Tracking Algorithm GRASTA Fig. 7, 8, 13, 14 [28]Go Decomposition GoDec Fig. 7, 8, 13 [64]Robust PCA Gradient Descent RPCA GD Fig. 4, 7, 8, 13, 14, 15 [63]Robust PCA Nonconvex Feasibility NCF Fig. 4, 7, 8, 13, 15 [19]Recursive projected compressed sensing ReProCS Fig. 7, 8, 13, 14 [26]Best pair RPCA — Fig. 15 [18]Fixed point Bergman FPCA Fig. 6, 11 [39]Expectation Maximization EM Fig. 6, 11, 12 [53]Bi-iterative regularized SVD BIRSVD Fig. 6, 11, 12 [15]Bilinear factorization BLF Fig. 6, 11 [11]Damped Newton — Table 2 [10]
Table 3: Algorithms compared in this paper.
A Addendum to the numerical experiments
In this section, we added some extra numerical experiments that complement our experiments and other claimsin the main paper.
A.1 Table of baseline methods
In Table 3 we summarize all the methods compared in this paper.
A.2 Convergence behavior
In this section, we demonstrate the convergence of our algorithm(s). For this purpose, we generate the low-rank matrix, L , as a product of two independent full-rank matrices of size m × r with r < m such that elementsare independent and identically distributed (i.i.d.) and sampled from a normal distribution— N ( , ) . Weadd Gaussian noise and generate E as a noise matrix whose elements are i.i.d. N ( , ) random variablesas well. We constructed F as: A = L + E . We fixed min i , j W i j = i , j W i j from a set Λ =[ , , , , , , , , × , , × , ] . At each instance, i , we create an m × m weightmatrix W by fusing min i , j W i j with Λ i by using MATLAB function randperm .We compare our ProGrAMMe- ε (Algorithm 1), its inexact counterpart ProGrAMMe-1, proximal gradi-ent (PG) algorithm and its accelerated version—accelerated proximal gradient (APG) for these experiments.For different condition number ( κ W ) of the weight matrix W , we plotted the functional value Φ ( X k ) versus iter-ations in Figure 9 and difference between consecutive iterates, (cid:107) X k + − X k (cid:107) , versus iterations in Figure 10. Notethat by construction, κ W ranges between 1238.021 to 21043.1574. The convergence plots justify our claims,that, although problem (5) belongs to the class of problems (8), the general algorithms used in [30, 51, 55, 37]fail to provide good convergence results when κ W is large; that, our approaches are faster compare to thosegeneral approaches; and, that the performance of both exact and inexact ProGrAMMe are the same. A.3 Matrix completion with missing data
In this part, we conduct extensive tests of the power-grid missing data problem. For first set of experiments,we ran the methods until the relative error, (cid:107) X k + − X k (cid:107) / (cid:107) X k (cid:107) or consecutive iteration error, (cid:107) X k + − X k (cid:107) isless than the machine precision or a maximum number of iterations (500) is reached, whichever happens19
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010 -5
20 40 60 80 10010
20 40 60 80 10010
20 40 60 80 10010 -10 Figure 9: Convergence in terms of functional value of proximal gradient applied to original problem (5) (DirectPG), accelerated proximal gradient-II (APG-II), ProGrAMMe-1, and ProGrAMMe- ε . For all problems τ = − / λ .first. Eventually, Figure 11 shows that the performance of ProGrAMMe-1 improves as it runs for more globaliterations although it has a fractions of the execution time compare to the other methods. Next, in Figure 12we ran more samples (50) with the same stopping criteria as before but with an increased number of iterations(1000). From Figure 12 we find that for fidelity inside the sample, ProGrAMMe-1 performs better that the twopreviously best performing methods—EM and BIRSVD. However, for fidelity outside the sample BIRSVD isstill the best, but the behavior of ProGrAMMe improves compared to the previous cases. A.4 Background estimation
We show the qualitative results on 7 video sequences from the CDNet 2014 and SBI datasets in Figure 13. Inalmost all sequences, our ProGrAMMe- ε performs consistently well compare to the other state-of-the-artmethods. We do not include iEALM or APG due to their higher execution time.In Figure 14, we use two robust quantitative measures for the background estimation experiments onStuttgart Basic video: peak signal to noise ratio (PSNR) and mean structural similarities index measure (SSIM)[58]. PSNR is defined as 10 log of the ratio of the peak signal energy to the mean square error (MSE) be-tween the processed video signal and the ground truth. Let F ( : , i ) − ˆ X ( : , i ) be the reconstructed vectorizedforeground frame and G ( : , i ) be the corresponding ground truth frame, then PSNR is defined as 10 log
10 M I MSE ,where MSE = mn (cid:107) F ( : , i ) − ˆ X ( : , i ) − G ( : , i ) (cid:107) and M I =
255 is the maximum possible pixel value of the image,as the pixels are represented using 8 bits per sample. For a reconstructed image with 8 bits bit depth, the PSNRare between 30 and 50 dB, where the higher is the better as we minimize the MSE between images with respectthe maximum signal value of the image.For both measures, we perceive the information how the high-intensity regions of the image are comingthrough the noise, and we pay much less attention to the low-intensity regions. We remove the noisy compo-nents from the recovered foreground, F ( : , i ) − ˆ X ( : , i ) , by using the threshold ε (cid:48) , such that we set the components20 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10
20 40 60 80 10010 -10 Figure 10: Convergence in terms of difference between consecutive iterates ( (cid:107) X k + − X k (cid:107) ) of proximal gra-dient applied to original problem (5) (Direct PG), accelerated proximal gradient-II (APG-II), ProGrAMMe- ε (Algorithm 1), and ProGrAMMe-1. For all problems τ = − / λ . (a) Fidelity within data (b) Fidelity out of data (c) Execution time Figure 11: Fidelity within and out of the data, ran for more number of iterations (500). Here Ω denotes thepercentage of missing data. The last bar diagram shows the execution time of different algorithms for differentsubsample Ω . Although BLF with (cid:96) loss has the least execution time, its performance is not so good for thisset of experiments.below ε (cid:48) in E to 0. In our experiments, we set ε (cid:48) = − . In order to calculate the SSIM of each recovered fore-ground video frame, we consider an 11 ×
11 Gaussian window with standard deviation ( σ ) 1.5 and considerthe corresponding ground truth as the reference image. Among the methods tested, ProGrAMMe- ε has thehighest average SSIM (or MSSIM). To compare PSNR of recovered foreground frames, we use ProGrAMMe- ε , GRASTA [28], recursive projected compressive sensing (ReProCS)[26], inexact ALM (iEALM) [35], andRPCA GD [63]. iEALM has the highest average PSNR, 30.05 dB among all the methods, whereas our Pro-GrAMMe- ε has average PSNR 29.45 dB. However, ProGrAMMe- ε needs an average 9.48 seconds to producethe results, compare to the average execution time of iEALM is 183 seconds.21 .35 0.4 0.45 0.5 0.55 0.6 0.650.0370.03750.0380.03850.0390.03950.040.0405 (a) Fidelity within the sample (b) Fidelity out of the sample Figure 12: Fidelity within and out of the sample, ran for more samples (50) over a higher number of iterations(1000). Here Ω denotes the percentage of missing data. Image:1000
Image:80
Image:110
Image:300
Image:400
Image:1000
Image:300
Image:1000
Image:100
BG + FG ProGrAMMe(ours) GRASTA
NCF
RPCA GD
Error(ours) GRASTAsparse
NCFsparse RPCA GDsparse
GoDec ReProCS GoDecsparse ReProCSsparse
Figure 13: Sample video frames from the CDNet 2014 and SBI dadasets. ProGrAMMe provides a visuallyhigh quality background for almost all sequences. The red bounding boxes in recovered BG denote shadows,ghosting effects of FG objects, static FG object etc.
A.5 Background estimation from partially observed/missing data
For background estimation on partially observed data we used the ε -proximity metric— d ε ( X , Y ) proposed in[19] on Stuttgart Basic video. The performance of RPCA feasibility [19] with respect to d ε ( X , Y ) stays stablefor all subsample Ω . The performance of the best pair RPCA is also stable except for Ω = .
3. The performanceof RPCA GD [63] keeps downgrading as we decrease Ω . Surprisingly, the performance of ProGrAMMe-1 gets22
100 200 300 400 500 600-0.200.20.40.60.81 (a) Mean SSIM (b) PSNR
Figure 14: Quantitative comparison of different algorithms on Stuttgart
Basic sequence. We compare therecovered foreground by different methods with respect to the foreground GT available for each frame on twodifferent metrics: mean SSIM and PSNR.better for this typical experiment as we decrease Ω . Furthermore, the average execution time for ProGrAMMe-1 is stable for different Ω , and is around 8 seconds. While the next best average execution time 19.67 seconds isrecorded for RPCA GD. The average execution time of NCF and best pair are 44 and 43 seconds, respectively. References [1] http://sbmi2015.na.icar.cnr.it/SBIdataset.html.[2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methodsfor nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality.
Mathematics of operationsresearch , 35(2):438–457, 2010.[3] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems:proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.
Mathematical Program-ming , 137(1-2):91–129, 2013.[4] J.-F. Aujol and C. Dossal. Stability of over-relaxations for the forward-backward algorithm, application to fista.
SIAM Journal on Optimization , 25(4):2408–2433, 2015.[5] J. B. Baillon and G. Haddad. Quelques propriétés des opérateurs angle-bornés etn-cycliquement monotones.
IsraelJournal of Mathematics , 26(2):137–150, 1977.[6] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAMJournal on Imaging Science , 2(1):183–202, 2009.[7] T. Boas, A. Dutta, X. Li, K. P. Mercier, and E. Niderman. Shrinkage function and its applications in matrixapproximation.
Electronic Journal of Linear Algebra , 32:163–171, 2017.[8] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmoothproblems.
Mathematical Programming , 146(1-2):459–494, 2014.[9] S. Brutzer, B. Hoferlin, and G. Heidemann. Evaluation of background subtraction techniques for video surveillance.In
IEEE Computer Vision and Pattern Recognition , pages 1937–1944, 2011.[10] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. In
IEEE Computer Vision and Pattern Recognition , volume 2, pages 316–322, 2005.[11] R. Cabral, F. D. L. Torre, J. P. Costeira, and A. Bernardino. Unifying nuclear norm and bilinear factorizationapproaches for low-rank matrix decomposition. In , pages2488–2495, 2013. (a) | Ω | = . ( m , n ) (b) | Ω | = . ( m , n ) (c) | Ω | = . ( m , n ) (d) | Ω | = . ( m , n ) (e) | Ω | = . ( m , n ) (f) | Ω | = . ( m , n ) (g) | Ω | = . ( m , n ) (h) Execution time Figure 15: Quantitative comparison between different algorithms on Stuttgart
Basic sequence for differentlevels of partially observed/missing data case with respect to the d ε ( X , Y ) metric. Figure (h) the bar diagramshows the execution time of different algorithms for different subsample Ω . ProGrAMMe-1 has the leastaverage execution time in all scenarios. [12] J. F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journalon Optimization , 20(4):1956–1982, 2010.[13] E. J. Candès and B. Recht. Exact matrix completion via convex optimization.
Foundations of ComputationalMathematics , 9(6):717–772, 2009.[14] A. Chambolle and C. Dossal. On the convergence of the iterates of the “fast iterative shrinkage/thresholdingalgorithm”.
Journal of Optimization Theory and Applications , 166(3):968–982, 2015.[15] S. Das and A. Neumaier. Regularized low rank approximation of weighted data sets, 2011.[16] A. Dutta.
Weighted Low-Rank Approximation of Matrices:Some Analytical and Numerical Aspects . PhD thesis,University of Central Florida, 2016.[17] A. Dutta, B. Gong, X. Li, and M. Shah. Weighted singular value thresholding and its application to backgroundestimation, 2017. arXiv:1707.00133.
18] A. Dutta, F. Hanzely, J. Liang, and P. Richtárik. Best pair formulation & accelerated scheme for non-convexprincipal component pursuit.
IEEE Transactions on Signal Processing , 68:6128–6141, 2020.[19] A. Dutta, F. Hanzely, and P. Richtárik. A nonconvex projection method for robust PCA. In
Thirty-Third AAAIConference on Artificial Intelligence (AAAI-19) , pages 1468–1476, 2018.[20] A. Dutta and X. Li. A fast algorithm for a weighted low rank approximation. In , pages 93–96, 2017.[21] A. Dutta and X. Li. On a problem of weighted low-rank approximation of matrices.
SIAM Journal on MatrixAnalysis and Applications , 38(2):530–553, 2017.[22] A. Dutta and X. Li. Weighted low rank approximation for background estimation problems. In
The IEEE Interna-tional Conference on Computer Vision Workshops (ICCVW) , pages 1853–1861, 2017.[23] A. Dutta, X. Li, and P. Richtárik. Weighted low-rank approximation of matrices and background modeling, 2018.arXiv:1804.06252.[24] C. Eckart and G. Young. The approximation of one matrix by another of lower rank.
Psychometrika , 1(3):211–218,1936.[25] E. Grave, G. R Obozinski, and F. R Bach. Trace lasso: a trace norm regularization for correlated designs. In
Advances in Neural Information Processing Systems , pages 2187–2195, 2011.[26] H. Guo, C. Qiu, and N. Vaswani. An online algorithm for separating sparse and low-dimensional signal sequencesfrom their sum.
IEEE Transactions on Signal Processing , 62(16):4284–4297, 2014.[27] W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity.
Journal ofConvex Analysis , 11(2):251–266, 2004.[28] J. He, L. Balzano, and A. Szlam. Incremental gradient on the Grassmannian for online foreground and backgroundseparation in subsampled video.
IEEE Computer Vision and Pattern Recognition , pages 1937–1944, 2012.[29] M. Jaggi, M. Sulovsk, et al. A simple algorithm for nuclear norm regularized problems. In
Proceedings of the 27thinternational conference on machine learning (ICML-10) , pages 471–478, 2010.[30] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In
Proceedings of the 26th AnnualInternational Conference on Machine Learning , pages 457–464, 2009.[31] I. T. Jolliffee.
Principal component analysis . Springer-Verlag, second edition, 2002.[32] A. S. Lewis. Active sets, nonsmoothness, and sensitivity.
SIAM Journal on Optimization , 13(3):702–725, 2003.[33] J. Liang, J. Fadili, and G. Peyré. Activity identification and local linear convergence of forward–backward-typemethods.
SIAM Journal on Optimization , 27(1):408–437, 2017.[34] J. Liang, T. Luo, and C. Schönlieb. Improving fista: Faster, smarter and greedier. arXiv preprint arXiv:1811.01430 ,2018.[35] Z. Lin, M. Chen, and Y. Ma. The augmented Lagrange multiplier method for exact recovery of corrupted low-rankmatrices, 2010. arXiv1009.5055.[36] P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.
SIAM Journal on NumericalAnalysis , 16(6):964–979, 1979.[37] Y. Liu, H. Cheng, F. Shang, and J. Cheng. Nuclear norm regularized least squares optimization on grassmannianmanifolds. In
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence , UAI’14, pages515–524, 2014.[38] W. S. Lu, S. C. Pei, and P. H. Wang. Weighted low-rank approximation of general complex matrices and itsapplication in the design of 2-d digital filters.
IEEE Transactions on Circuits and Systems I: Fundamental Theoryand Applications , 44(7):650–655, 1997.[39] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization.
Math-ematical Programming , 128(1-2):321–353, 2011.
40] L. Maddalena and A. Petrosino. Towards benchmarking scene background initialization. In
New Trends in ImageAnalysis and Processing – ICIAP 2015 Workshops , pages 469–476, 2015.[41] J. H. Manton, R. Mehony, and Y. Hua. The geometry of weighted low-rank approximations.
IEEE Transactions onSignal Processing , 51(2):500–514, 2003.[42] I. Markovsky. Low-rank approximation: algorithms, implementation, applications, 2012. Springer.[43] I. Markovsky, J. C. Willems, B. De Moor, and S. Van Huffel.
Exact and approximate modeling of linear systems: abehavioral approach, Number 11 in Monographs on Mathematical Modeling and Computation . SIAM, 2006.[44] G. Mateos and G. Giannakis. Robust PCA as bilinear decomposition with outlier-sparsity regularization.
IEEETransaction on Signal Processing , 60(10):5176–5190, 2012.[45] N. Oliver, B. Rosario, and A. Pentland. A Bayesian computer vision system for modeling human interactions. In
International Conference on Computer Vision Systems , pages 255–272, 1999.[46] B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes.
Foundations of computationalmathematics , 15(3):715–732, 2015.[47] T. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning.
SIAM J. on Optimization , 20(6):3465–3489, 2010.[48] I. Razenshteyn, Z. Song, and D. P. Woodruff. Weighted low rank approximations with provable guarantees. In
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages 250–263. ACM, 2016.[49] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear normminimization.
SIAM Review , 52(3):471–501, 2010.[50] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In
Pro-ceedings of the 22Nd International Conference on Machine Learning , ICML ’05, pages 713–719, 2005.[51] Z. Shen, K. Toh, and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear leastsquares problems.
Pacific J. Optim , pages 615–640, 2009.[52] D. Shpak. A weighted-least-squares matrix decomposition with application to the design of 2-d digital filters.
Proceedings of IEEE 33rd Midwest Symposium on Circuits and Systems , pages 1070–1073, 1990.[53] N. Srebro and T. Jaakkola. Weighted low-rank approximations. , pages 720–727, 2003.[54] M. Tao and J. Yang. Recovering low-rank and sparse components of matrices from incomplete and noisy observa-tions.
SIAM Journal on Optimization , 21(1):57–81, 2011.[55] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems.2009.[56] K. Usevich and I. Markovsky. Variable projection methods for affinely structured low-rank approximation inweighted 2-norms.
Journal of Computational and Applied Mathematics , 272:430–448, 2014.[57] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar. Cdnet 2014: an expanded change detectionbenchmark dataset. In
Proceedings of the IEEE conference on computer vision and pattern recognition workshops ,pages 387–394, 2014.[58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility tostructural similarity.
IEEE Transaction on Image Processing , 13(4):600–612, 2004.[59] F. Wen, R. Ying, P. Liu, and T.-K. Truong. Nonconvex regularized robust pca using the proximal block coordinatedescent algorithm.
IEEE Transactions on Signal Processing , 67(20):5402–5416, 2019.[60] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by a nonlinearsuccessive over-relaxation algorithm.
Mathematical Programming Computation , 4(4):333–361, 2012.[61] J. Wright, Y. Peng, Y. Ma, A. Ganseh, and S. Rao. Robust principal component analysis: exact recovery of corr-puted low-rank matrices by convex optimization.
Proceedings of 22nd Advances in Neural Information Processingsystems , pages 2080–2088, 2009.
62] Y. Xiao, L. Li, T. Yang, and L. Zhang. Svd-free convex-concave approaches for nuclear norm regularization. In
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 , pages 3126–3132, 2017.[63] X. Yi, D. Park, Y. Chen, and C. Caramanis. Fast algorithms for robust PCA via gradient descent.
Advances inNeural Information Processing systems , pages 361–369, 2016.[64] T. Zhou and D. Tao. Godec: Randomized low-rank and sparse matrix decomposition in noisy case. In
Proceedingsof the 28th International Conference on Machine Learning (ICML) , pages 33–40, 2011., pages 33–40, 2011.