[PDF] Accelerated Linearized Bregman Method

Abstract

In this paper, we propose and analyze an accelerated linearized Bregman (ALB) method for solving the basis pursuit and related sparse optimization problems. This accelerated algorithm is based on the fact that the linearized Bregman (LB) algorithm is equivalent to a gradient descent method applied to a certain dual formulation. We show that the LB method requires O(1/ϵ) iterations to obtain an ϵ -optimal solution and the ALB algorithm reduces this iteration complexity to O(1/ ϵ √ ) while requiring almost the same computational effort on each iteration. Numerical results on compressed sensing and matrix completion problems are presented that demonstrate that the ALB method can be significantly faster than the LB method.

Full PDF

aa r X i v : . [ m a t h . O C ] J un ACCELERATED LINEARIZED BREGMAN METHOD

BO HUANG † , SHIQIAN MA ∗† , AND

DONALD GOLDFARB † June 21, 2011

Abstract.

In this paper, we propose and analyze an accelerated linearized Bregman (ALB) method for solving the basispursuit and related sparse optimization problems. This accelerated algorithm is based on the fact that the linearized Bregman(LB) algorithm is equivalent to a gradient descent method applied to a certain dual formulation. We show that the LB methodrequires O (1 /ǫ ) iterations to obtain an ǫ -optimal solution and the ALB algorithm reduces this iteration complexity to O (1 / √ ǫ )while requiring almost the same computational eﬀort on each iteration. Numerical results on compressed sensing and matrixcompletion problems are presented that demonstrate that the ALB method can be signiﬁcantly faster than the LB method. Key words.

Convex Optimization, Linearized Bregman Method, Accelerated Linearized Bregman Method, CompressedSensing, Basis Pursuit, Matrix Completion

AMS subject classiﬁcations.

1. Introduction.

In this paper, we are interested in the following optimization problemmin x ∈ R n J ( x ) s.t. Ax = b, (1.1)where A ∈ R m × n , b ∈ R m and J ( x ) is a continuous convex function. An important instance of (1.1) is theso-called basis pursuit problem when J ( x ) := k x k = P nj =1 | x j | :min x ∈ R n k x k s.t. Ax = b. (1.2)Since the development of the new paradigm of compressed sensing [9, 11], the basis pursuit problem (1.2)has become a topic of great interest. In compressed sensing, A is usually the product of a sensing matrix Φand a transform basis matrix Ψ and b is a vector of the measurements of the signal s = Ψ x . The theory ofcompressed sensing guarantees that the sparsest solution (i.e., representation of the signal s = Ψ x in termsof the basis Ψ) of Ax = b can be obtained by solving (1.2) under certain conditions on the matrix Φ and thesparsity of x . This means that (1.2) gives the optimal solution of the following NP-hard problem [21]:min x ∈ R n k x k s.t. Ax = b, (1.3)where k x k counts the number of nonzero elements of x .Matrix generalizations of (1.3) and (1.2), respectively, are the so-called matrix rank minimization prob-lem min X ∈ R m × n rank( X ) s.t. A ( X ) = d, (1.4) ∗ Corresponding author. † Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027. Email: { bh2359,sm2756,goldfarb } @columbia.edu. Research supported in part by NSF Grants DMS 06-06712 and DMS 10-16571, ONRGrants N00014-03-0514 and N00014-08-1-1118, and DOE Grants DE-FG01-92ER-25126 and DE-FG02-08ER-25856.1 nd its convex relaxation, the nuclear norm minimization problem:min X ∈ R m × n k X k ∗ s.t. A ( X ) = d, (1.5)where A : R m × n → R p is a linear operator, d ∈ R p , and k X k ∗ (the nuclear norm of X ) is deﬁned as the sumof singular values of matrix X . A special case of (1.4) is the matrix completion problem:min X ∈ R m × n rank( X ) s.t. X ij = M ij , ∀ ( i, j ) ∈ Ω , (1.6)whose convex relaxation is given by:min X ∈ R m × n k X k ∗ s.t. X ij = M ij , ∀ ( i, j ) ∈ Ω . (1.7)The matrix completion problem has a lot of interesting applications in online recommendation systems,collaborative ﬁltering [35, 36], etc., including the famous Netﬂix problem [34]. It has been proved that undercertain conditions, the solutions of the NP-hard problems (1.4) and (1.6) are given respectively by solvingtheir convex relaxations (1.5) and (1.7), with high probability (see, e.g., [31, 8, 10, 30, 15]).The linearized Bregman (LB) method was proposed in [42] to solve the basis pursuit problem (1.2). Themethod was derived by linearizing the quadratic penalty term in the augmented Lagrangian function that isminimized on each iteration of the so-called Bregman method introduced in [27] while adding a prox termto it. The linearized Bregman method was further analyzed in [5, 7, 41] and applied to solve the matrixcompletion problem (1.7) in [5].Throughout of this paper, we will sometimes focus our analysis on the basis pursuit problem (1.2).However, all of the analysis and results can be easily extended to (1.5) and (1.7). The linearized Bregmanmethod depends on a single parameter µ > x ∈ R n g µ ( x ) := k x k + 12 µ k x k , s.t. Ax = b, (1.8)rather than the problem (1.2). Recently it was shown in [41] that the solution to (1.8) is also a solutionto problem (1.2) as long as µ is chosen large enough. Furthermore, it was shown in [41] that the linearizedBregman method can be viewed as a gradient descent method applied to the Lagrangian dual of problem(1.8). This dual problem is an unconstrained optimization problem of the form(1.9) min y ∈ R m G µ ( y ) , where the objective function G µ ( y ) is diﬀerentiable since g µ ( x ) is strictly convex (see, e.g., [32]). Motivatedby this result, some techniques for speeding up the classical gradient descent method applied to this dualproblem such as taking Barzilai-Borwein (BB) steps [1], and incorporating it into a limited memory BFGS(L-BFGS) method [18], were proposed in [41]. Numerical results on the basis pursuit problem (1.2) reportedin [41] show that the performance of the linearized Bregman method can be greatly improved by using thesetechniques.Our starting point is also motivated by the equivalence between applying the linearized Bregman methodto (1.2) and solving the Lagrangian dual problem (1.9) by the gradient descent method. Since the gradient of G µ ( y ) can be shown to be Lipschitz continuous, it is well-known that the classical gradient descent method ith a properly chosen step size will obtain an ǫ -optimal solution to (1.9) (i.e., an approximate solution y k such that G µ ( y k ) − G µ ( y ∗ ) ≤ ǫ ) in O (1 /ǫ ) iterations. In [23], Nesterov proposed a technique for acceleratingthe gradient descent method for solving problem of the form (1.9) (see, also, [24]), and proved that using thisaccelerated method, the number of iterations needed to obtain an ǫ -optimal solution is reduced to O (1 / √ ǫ )with a negligible change in the work required at each iteration. Nesterov also proved that the O (1 / √ ǫ )complexity bound is the best bound that one can get if one uses only the ﬁrst-order information. Based onthe above discussion, we propose an accelerated linearized Bregman (ALB) method for solving (1.8) whichis equivalent to an accelerated gradient descent method for solving the Lagrangian dual (1.9) of (1.8). Asa by-product, we show that the basic and the accelerated linearized Bregman methods require O (1 /ǫ ) and O (1 / √ ǫ ) iterations, respectively, to obtain an ǫ -optimal solution with respect to the Lagrangian for (1.8).The rest of this paper is organized as follows. In Section 2 we describe the original Bregman iterativemethod, as well as the linearized Bregman method. We motivate the methods and state some previouslyobtained theoretical results that establish the equivalence between the LB method and a gradient descentmethod for the dual of problem (1.8). We present our accelerated linearized Bregman method in Section 3.We also provide a theoretical foundation for the accelerated algorithm and prove complexity results for itand the unaccelerated method. In Section 4, we describe how the LB and ALB methods can be extendedto basis pursuit problems that include additional convex constraints. In Section 5, we report preliminarynumerical results, on several compressed sensing basis pursuit and matrix completion problems. Thesenumerical results show that our accelerated linearized Bregman method signiﬁcantly outperforms the basiclinearized Bregman method. We make some conclusions in Section 6.

2. Bregman and Linearized Bregman Methods.

The Bregman method was introduced to theimage processing community by Osher et al. in [27] for solving the total-variation (TV) based imagerestoration problems. The Bregman distance [4] with respect to convex function J ( · ) between points u and v is deﬁned as D pJ ( u, v ) := J ( u ) − J ( v ) − h p, u − v i , (2.1)where p ∈ ∂J ( v ), the subdiﬀerential of J at v . The Bregman method for solving (1.1) is given below asAlgorithm 1. Note that the updating formula for p k (Step 4 in Algorithm 1) is based on the optimalityconditions of Step 3 in Algorithm 1:0 ∈ ∂J ( x k +1 ) − p k + A ⊤ ( Ax k +1 − b ) . This leads to p k +1 = p k − A ⊤ ( Ax k +1 − b ) . It was shown in [27, 42] that the Bregman method (Algorithm 1) converges to a solution of (1.1) in a ﬁnitenumber of steps.It is worth noting that for solving (1.1), the Bregman method is equivalent to the augmented Lagrangianmethod [17, 29, 33] in the following sense.

Theorem 2.1.

The sequences { x k } generated by Algorithm 1 and by the augmented Lagrangian method, lgorithm 1 Original Bregman Iterative Method Input: x = p = 0. for k = 0 , , · · · do x k +1 = arg min x D p k J ( x, x k ) + k Ax − b k ; p k +1 = p k − A ⊤ ( Ax k +1 − b ); end for which computes for k = 0 , , · · · ( x k +1 := arg min x J ( x ) − h λ k , Ax − b i + k Ax − b k λ k +1 := λ k − ( Ax k +1 − b )(2.2) starting from λ = 0 are exactly the same.Proof . From Step 4 of Algorithm 1 and the fact that p = 0, it follows that p k = − P kj =1 A ⊤ ( Ax j − b ).From the second equation in (2.2) and using λ = 0, we get λ k = − P kj =1 ( Ax j − b ). Thus, p k = A ⊤ λ k forall k . Hence it is easy to see that Step 3 of Algorithm 1 is exactly the same as the ﬁrst equation in (2.2)and that the x k +1 computed in Algorithm 1 and (2.2) are exactly the same. Therefore, the sequences { x k } generated by both algorithms are exactly the same.Note that for J ( x ) := α k x k , Step 3 of Algorithm 1 reduces to an ℓ -regularized problem:min x α k x k − h p k , x i + 12 k Ax − b k . (2.3)Although there are many algorithms for solving the subproblem (2.3) such as FPC [16], SPGL1 [39], FISTA[2] etc., it often takes them many iterations to do so. The linearized Bregman method was proposed in [42],and used in [28, 7, 6] to overcome this diﬃculty. The linearized Bregman method replaces the quadraticterm k Ax − b k in the objective function that is minimized in Step 3 of Algorithm 1 by its linearization h A ⊤ ( Ax k − b ) , x i plus a proximal term µ k x − x k k . Consequently the updating formula for p k is changedsince the optimality conditions for this minimization step become:0 ∈ ∂J ( x k +1 ) − p k + A ⊤ ( Ax k − b ) + 1 µ ( x k +1 − x k ) . In Algorithm 2 below we present a slightly generalized version of the original linearized Bregman methodthat includes an additional parameter τ that corresponds to the length of a gradient step in a dual problem. Algorithm 2

Linearized Bregman Method Input: x = p = 0, µ > τ > for k = 0 , , · · · do x k +1 = arg min x D p k J ( x, x k ) + τ h A ⊤ ( Ax k − b ) , x i + µ k x − x k k ; p k +1 = p k − τ A ⊤ ( Ax k − b ) − µ ( x k +1 − x k ); end for In [41], it is shown that when µ k A k <

2, where k A k denotes the largest singular value of A , theiterates of the linearized Bregman method (Algorithm 2 with τ = 1) converge to the solution of the followingregularized version of problem (1.1):min x J ( x ) + 12 µ k x k s.t. Ax = b. (2.4) e prove in Theorem 2.3 below an analogous result for Algorithm 2 for a range of values of τ . However, weﬁrst prove, as in [41], that the linearized Bregman method (Algorithm 2) is equivalent to a gradient descentmethod(2.5) y k +1 := y k − τ ∇ G µ ( y k )applied to the Lagrangian dual max y min w { J ( w ) + 12 µ k w k − h y, Aw − b i} of (2.4), which we express as the following equivalent minimization problem:(2.6) min y G µ ( y ) := −{ J ( w ∗ ) + 12 µ k w ∗ k − h y, Aw ∗ − b i} , where w ∗ := arg min w { J ( w ) + 12 µ k w k − h y, Aw − b i} . To show that G µ ( y ) is continuously diﬀerentiable, we rewrite G µ ( y ) as G µ ( y ) = − Φ µ ( µA ⊤ y ) + µ k A ⊤ y k − b ⊤ y, where Φ µ ( v ) ≡ min w { J ( w ) + 12 µ k w − v k } is strictly convex and continuously diﬀerentiable with gradient ∇ Φ µ ( v ) = v − ˆ wµ , and ˆ w = arg min w { J ( w ) + µ k w − v k } (e.g., see Proposition 4.1 in [3]). From this it follows that ∇ G µ ( y ) = Aw ∗ − b . Hence thegradient method (2.5) corresponds to Algorithm 3 below. Algorithm 3

Linearized Bregman Method (Equivalent Form) Input: µ > τ > y = τ b . for k = 0 , , · · · do w k +1 := arg min w { J ( w ) + µ k w k − h y k , Aw − b i} ; y k +1 := y k − τ ( Aw k +1 − b ) . ; end for Lemma 2.2 and Theorem 2.3 below generalize Theorem 2.1 in [41] by allowing a step length choice inthe gradient step (2.5) and show that Algorithms 2 and 3 are equivalent. Our proof closely follows the proofof Theorem 2.1 in [41].

Lemma 2.2. x k +1 computed by Algorithm 2 equals w k +1 computed by Algorithm 3 if and only if A ⊤ y k = p k − τ A ⊤ ( Ax k − b ) + 1 µ x k . (2.7) Proof . By comparing Step 3 in Algorithms 2 and 3, it is obvious that w k +1 is equal to x k +1 if and only f (2.7) holds. Theorem 2.3.

The sequences { x k } and { w k } generated by Algorithms 2 and 3 are the same.Proof . We prove by induction that equation (2.7) holds for all k ≥

0. Note that (2.7) holds for k = 0since p = x = 0 and y = τ b . Now let us assume that (2.7) holds for all 0 ≤ k ≤ n −

1; thus by Lemma 2.2 w k +1 = x k +1 for all 0 ≤ k ≤ n −

1. By iterating Step 4 in Algorithm 3 we get y n = y n − − τ ( Aw n − b ) = − n X j =0 τ ( Ax j − b ) . (2.8)By iterating Step 4 in Algorithm 2 we get p n = − n − X j =0 τ A ⊤ ( Ax j − b ) − µ x n , which implies that p n − τ A ⊤ ( Ax n − b ) + 1 µ x n = − k X j =0 τ A ⊤ ( Ax j − b ) = A ⊤ y n , where the last equality follows from (2.8); thus by induction (2.7) holds for all k ≥

0, which implies byLemma 2.2 that x k = w k for all k ≥ v k = A ⊤ y k and algebraically manipulatingthe last two terms in the objective function in Step 3 in Algorithm 3, Steps 3 and 4 in that algorithm canbe replaced by ( w k +1 := arg min w J ( w ) + µ k w − µv k k v k +1 := v k − τ A ⊤ ( Aw k +1 − b )(2.9)if we set v = τ A ⊤ b . Because Algorithms 2 and 3 are equivalent, convergence results for the gradient descentmethod can be applied to both of them. Thus we have the following convergence result. Theorem 2.4.

Let J ( w ) ≡ k w k . Then G µ ( y ) in the dual problem (2.6) is continuously diﬀerentiableand its gradient is Lipschitz continuous with the Lipschitz constant L ≤ µ k A k . Consequently, if the steplength τ < µ k A k , the sequences { x k } and { w k } generated by Algorithms 2 and 3 converge to the optimalsolution of (2.4) .Proof . When J ( x ) = k x k , w k +1 in (2.9) reduces to w k +1 = µ · shrink( v k , , where the ℓ shrinkage operator is deﬁned as(2.10) shrink( z, α ) := sgn( z ) ◦ max {| z | − α, } , ∀ z ∈ R n , α > .G µ ( y ) is continuously diﬀerentiable since g µ ( x ) is strictly convex. Since for any point y , ∇ G µ ( y ) = Aw − b ,where w = µ · shrink( A ⊤ y, k shrink( s, α ) − shrink( t, α ) k ≤ k s − t k , ∀ s, t, α hat k∇ G µ ( y ) − ∇ G µ ( y ) k = k µ · A · shrink( A ⊤ y , − µ · A · shrink( A ⊤ y , k≤ µ · k A k · k A ⊤ ( y − y ) k≤ µ k A k k y − y k , for any two points y and y . Thus the Lipschitz constant L of ∇ G µ ( · ) is bounded above by µ k A k .When τ < µ k A k , we have τ L < | − τ L | <

1. It then follows that the gradient descentmethod y k +1 = y k − τ ∇ G µ ( y k ) converges and therefore Algorithms 2 and 3 converge to x ∗ µ , the optimalsolution of (2.4).Before developing an accelerated version of the LB algorithm in the next section. We would like tocomment on the similarities and diﬀerences between the LB method and Nesterov’s composite gradientmethod [26] and the ISTA method [2] applied to problem (1.1) and related problems. The latter algorithmsiterate Step 3 in the LB method (Algorithm 2) with p k = 0, and never compute or update the subgradientvector p k . More importantly, their methods solve the unconstrained problemmin x ∈ R n k x k + 12 µ k Ax − b k . Hence, while these methods and the LB method both linearize the quadratic term k Ax − b k while handlingthe nonsmooth term k x k directly, they are very diﬀerent.Similar remarks apply to the accelerated LB method presented in the next section and fast versions ofISTA and Nesterov’s composite gradient method.

3. The Accelerated Linearized Bregman Algorithm.

Based on Theorem 2.3, i.e., the equivalencebetween the linearized Bregman method and the gradient descent method, we can accelerate the linearizedBregman method by techniques used to accelerate the classical gradient descent method. In [41], Yin con-sidered several techniques such as line search, BB step and L-BFGS, to accelerate the linearized Bregmanmethod. Here we consider the acceleration technique proposed by Nesterov in [23, 24]. This technique accel-erates the classical gradient descent method in the sense that it reduces the iteration complexity signiﬁcantlywithout increasing the per-iteration computational eﬀort. For the unconstrained minimization problem (1.9),Nesterov’s accelerated gradient method replaces the gradient descent method (2.5) by the following iterativescheme: ( x k +1 := y k − τ ∇ G µ ( y k ) y k +1 := α k x k +1 + (1 − α k ) x k , (3.1)where the scalars α k are specially chosen weighting parameters. A typical choice for α k is α k = k +2 . If τ is chosen so that τ ≤ /L , where L is the Lipschitz constant for ∇ G µ ( · ), Nesterov’s accelerated gradientmethod (3.1) obtains an ǫ -optimal solution of (1.9) in O (1 / √ ǫ ) iterations, while the classical gradient method(2.5) takes O (1 /ǫ ) iterations. Moreover, the per-iteration complexities of (2.5) and (3.1) are almost the samesince computing the gradient ∇ G µ ( · ) usually dominates the computational cost in each iteration. Nesterov’sacceleration technique has been studied and extended by many others for nonsmooth minimization problemsand variational inequalities, e.g., see [25, 26, 2, 38, 22, 12, 13, 14].Our accelerated linearized Bregman method is given below as Algorithm 4. The main diﬀerence betweenit and the basic linearized Bregman method (Algorithm 2) is that the latter uses the previous iterate x k nd subgradient p k to compute the new iterate x k +1 , while Algorithm 4 uses extrapolations ˜ x k and ˜ p k thatare computed as linear combinations of the two previous iterates and subgradients, respectively. Carefullychoosing the sequence of weighting parameters { α k } guarantees an improved rate of convergence. Algorithm 4

Accelerated Linearized Bregman Method Input: x = ˜ x = ˜ p = p = 0, µ > τ > for k = 0 , , · · · do x k +1 = arg min x D ˜ p k J ( x, ˜ x k ) + τ h A ⊤ ( A ˜ x k − b ) , x i + µ k x − ˜ x k k ; p k +1 = ˜ p k − τ A ⊤ ( A ˜ x k − b ) − µ ( x k +1 − ˜ x k ); ˜ x k +1 = α k x k +1 + (1 − α k ) x k ; ˜ p k +1 = α k p k +1 + (1 − α k ) p k . end for In the following, we ﬁrst establish the equivalence between the accelerated linearized Bregman methodand the corresponding accelerated gradient descent method (3.1), which we give explicitly as (3.2) belowapplied to the dual problem (2.6). Based on this, we then present complexity results for both basic andaccelerated linearized Bregman methods. Not surprisingly, the accelerated linearized Bregman methodimproves the iteration complexity from O (1 /ǫ ) to O (1 / √ ǫ ). Theorem 3.1.

The accelerated linearized Bregman method (Algorithm 4) is equivalent to the accelerateddual gradient descent method (3.2) starting from ˜ y = y = τ b :  w k +1 := arg min J ( w ) + µ k w k − h ˜ y k , Aw − b i y k +1 := ˜ y k − τ ( Aw k +1 − b )˜ y k +1 := α k y k +1 + (1 − α k ) y k . (3.2) More speciﬁcally, the sequence { x k } generated by Algorithm 4 is exactly the same as the sequence { w k } generated by (3.2) .Proof . Note that the Step 3 of Algorithm 4 is equivalent to x k +1 := arg min J ( x ) − h ˜ p k , x i + τ h A ⊤ ( A ˜ x k − b ) , x i + 12 µ k x − ˜ x k k . (3.3)Comparing (3.3) with the ﬁrst equation in (3.2), it is easy to see that x k +1 = w k +1 if and only if A ⊤ ˜ y k = ˜ p k + τ A ⊤ ( b − A ˜ x k ) + 1 µ ˜ x k . (3.4)We will prove (3.4) in the following by induction. Note that (3.4) holds for k = 0 since ˜ y = τ b and˜ x = ˜ p = 0. As a result, we have x = w . By deﬁning w = 0, we also have x = w , A ⊤ ˜ y = A ⊤ ( α y + (1 − α ) A ⊤ y ) = α A ⊤ ˜ y + α τ A ⊤ ( b − Aw ) + (1 − α ) A ⊤ y . (3.5)On the other hand, p = ˜ p + τ A ⊤ ( b − A ˜ x ) − µ ( x − ˜ x ) = A ⊤ ˜ y − µ x , (3.6)where for the second equality we used (3.4) for k = 0. Expressing ˜ p and ˜ x in terms of their aﬃnecombinations of p , p , x and x , then substituting for p using (3.6) and using the fact that x = p = 0, nd ﬁnally using ˜ y = τ b and (3.5), we obtain,˜ p + τ A ⊤ ( b − A ˜ x ) + 1 µ ˜ x = α p + (1 − α ) p + α τ A ⊤ ( b − Ax ) + (1 − α ) τ A ⊤ ( b − Ax ) + 1 µ ( α x + (1 − α ) x )= α ( A ⊤ ˜ y − µ x ) + α τ A ⊤ ( b − Ax ) + (1 − α ) τ A ⊤ b + 1 µ α x = α A ⊤ ˜ y + α τ A ⊤ ( b − Ax ) + (1 − α ) A ⊤ y = α A ⊤ ˜ y + α τ A ⊤ ( b − Aw ) + (1 − α ) A ⊤ y = A ⊤ ˜ y . Thus we proved that (3.4) holds for k = 1. Now let us assume that (3.4) holds for 0 ≤ k ≤ n −

1, whichimplies x k = w k , ∀ ≤ k ≤ n since x = w . We will prove that (3.4) holds for k = n .First, note that p n = ˜ p n − + τ A ⊤ ( b − A ˜ x n − ) − µ ( x n − ˜ x n − ) = A ⊤ ˜ y n − − µ x n , (3.7)where the ﬁrst equality is from Step 4 of Algorithm 4 and the second equality is from (3.4) for k = n − p n = α n − p n + (1 − α n − ) p n − = α n − ( A ⊤ ˜ y n − − µ x n ) + (1 − α n − )( A ⊤ ˜ y n − − µ x n − )= α n − A ⊤ ˜ y n − + (1 − α n − ) A ⊤ ˜ y n − − µ ˜ x n , (3.8)where the last equality uses Step 5 of Algorithm 4. On the other hand, from (3.2) we have A ⊤ ˜ y n = A ⊤ ( α n − y n + (1 − α n − ) y n − )= α n − A ⊤ (˜ y n − + τ ( b − Aw n )) + (1 − α n − ) A ⊤ (˜ y n − + τ ( b − Aw n − ))= α n − A ⊤ ˜ y n − + (1 − α n − ) A ⊤ ˜ y n − + τ A ⊤ [ b − A ( α n − x n + (1 − α n − ) x n − )]= α n − A ⊤ ˜ y n − + (1 − α n − ) A ⊤ ˜ y n − + τ A ⊤ ( b − A ˜ x n ) , (3.9)where the third equality is from w n = x n and w n − = x n − , the last equality is from Step 5 of Algorithm 4.Combining (3.8) and (3.9) we get that (3.4) holds for k = n .Like the linearized Bregman, we can also use a simpler implementation for accelerated linearized Bregmanmethod in which the main computation at each step is a proximal minimization. Speciﬁcally, (3.2) isequivalent to the following three steps.  w k +1 := arg min J ( w ) + µ k w − µ ˜ v k k v k +1 := ˜ v k − τ A ⊤ ( Aw k +1 − b )˜ v k +1 := α k v k +1 + (1 − α k ) v k (3.10)As before this follows from letting v k = A ⊤ y k and ˜ v k = A ⊤ ˜ y k and completing the square in the objectivefunction in the ﬁrst equation of (3.2).Next we prove iteration complexity bounds for both basic and accelerated linearized Bregman algorithms.Since these algorithms are standard gradient descent methods applied to the Lagrangian dual function andthese results have been well established, our proofs will be quite brief. heorem 3.2. Let the sequence { x k } be generated by the linearized Bregman method (Algorithm 2)and ( x ∗ , y ∗ ) be the pair of optimal primal and dual solutions for Problem (2.4) . Let { y k } be the sequencegenerated by Algorithm 3 and suppose the step length τ ≤ L , where L is the Lipschitz constant for ∇ G µ ( y ) .Then for the Lagrangian function L µ ( x, y ) = J ( x ) + 12 µ k x k − h y, Ax − b i , (3.11) we have L µ ( x ∗ , y ∗ ) − L µ ( x k +1 , y k ) ≤ k y ∗ − y k τ k . (3.12) Thus, if we further have τ ≥ β/L , where < β ≤ , then ( x k +1 , y k ) is an ǫ -optimal solution to Problem (2.4) with respect to the Lagrangian function if k ≥ ⌈ C/ǫ ⌉ , where C := L k y ∗ − y k β .Proof . From (2.6) we get G µ ( y k ) = −L µ ( x k +1 , y k ) . (3.13)By using the convexity of function G µ ( · ) and the Lipschitz continuity of the gradient ∇ G µ ( · ), we get forany y , G µ ( y k ) − G µ ( y ) ≤ G µ ( y k − ) + h∇ G µ ( y k − ) , y k − y k − i + L k y k − y k − k − G µ ( y ) ≤ G µ ( y k − ) + h∇ G µ ( y k − ) , y k − y k − i + τ k y k − y k − k − G µ ( y ) ≤ h∇ G µ ( y k − ) , y k − − y i + h∇ G µ ( y k − ) , y k − y k − i + τ k y k − y k − k = h∇ G µ ( y k − ) , y k − y i + τ k y k − y k − k = τ h y k − − y k , y k − y i + τ k y k − y k − k ≤ τ ( k y − y k − k − k y − y k k ) . (3.14)Setting y = y k − in (3.14), we obtain G µ ( y k ) ≤ G µ ( y k − ) and thus the sequence { G µ ( y k ) } is non-increasing.Moreover, summing (3.14) over k = 1 , , . . . , n with y = y ∗ yields n ( G µ ( y n ) − G µ ( y ∗ )) ≤ n X k =1 ( G µ ( y k ) − G µ ( y ∗ )) ≤ τ ( k y ∗ − y k − k y ∗ − y n k ) ≤ τ k y ∗ − y k , and this implies (3.12).Before we analyze the iteration complexity of the accelerated linearized Bregman method, we introducea lemma from [38] that we will use in our analysis. Lemma 3.3 (

Property 1 in [38] ). For any proper lower semicontinuous function ψ : R n → ( −∞ , + ∞ ] and any z ∈ R n , if z + = arg min x { ψ ( x ) + 12 k x − z k } , then ψ ( x ) + 12 k x − z k ≥ ψ ( z + ) + 12 k z + − z k + 12 k x − z + k , ∀ x ∈ R n . he following theorem gives an iteration-complexity result for the accelerated linearized Bregman method.Our proof of this theorem closely follows the proof of Proposition 2 in [38]. Theorem 3.4.

Let the sequence { x k } be generated by accelerated linearized Bregman method (Algorithm4) and ( x ∗ , y ∗ ) be the optimal primal and dual variable for Problem (2.4) . Let { α k } be chosen as α k − = 1 + θ k ( θ − k − − , (3.15) where θ − := 1 , and θ k = 2 k + 2 , ∀ k ≥ . (3.16) Let the sequence { y k } be deﬁned as in (3.2) and the step length τ ≤ L , where L is the Lipschitz constant of ∇ G µ ( y ) and G µ ( · ) is deﬁned by (3.13) . We have G µ ( y k ) − G µ ( y ∗ ) ≤ k y ∗ − y k τ k . (3.17) Thus, if we further have τ ≥ β/L , where < β ≤ , then ( x k +1 , y k ) is an ǫ -optimal solution to Problem (2.4) with respect to the Lagrangian function (3.11) if k ≥ ⌈ p C/ǫ ⌉ , where C := L k y ∗ − y k β .Proof . Let(3.18) z k = y k − + θ − k − ( y k − y k − )and denote the linearization of G µ ( y ) as(3.19) l G µ ( x ; y ) := G µ ( y ) + h∇ G µ ( y ) , x − y i ≤ G µ ( x ) . Therefore the second equality in (3.2) is equivalent to y k +1 := arg min y G µ (˜ y k ) + h∇ G µ (˜ y k ) , y − ˜ y k i + 12 τ k y − ˜ y k k = arg min y l G µ ( y ; ˜ y k ) + 12 τ k y − ˜ y k k . eﬁne ˆ y k := (1 − θ k ) y k + θ k y ∗ , we have G µ ( y k +1 ) ≤ G µ (˜ y k ) + h∇ G µ (˜ y k ) , y k +1 − ˜ y k i + L k y k +1 − ˜ y k k (3.20) ≤ l G µ ( y k +1 ; ˜ y k ) + 12 τ k y k +1 − ˜ y k k ≤ l G µ (ˆ y k ; ˜ y k ) + 12 τ k ˆ y k − ˜ y k k − τ k ˆ y k − y k +1 k = l G µ ((1 − θ k ) y k + θ k y ∗ ; ˜ y k ) + 12 τ k (1 − θ k ) y k + θ k y ∗ − ˜ y k k − τ k (1 − θ k ) y k + θ k y ∗ − y k +1 k = l G µ ((1 − θ k ) y k + θ k y ∗ ; ˜ y k ) + θ k τ k y ∗ + θ − k ( y k − ˜ y k ) − y k k − θ k τ k y ∗ + θ − k ( y k − y k +1 ) − y k k = l G µ ((1 − θ k ) y k + θ k y ∗ ; ˜ y k ) + θ k τ k y ∗ − z k k − θ k τ k y ∗ − z k +1 k = (1 − θ k ) l G µ ( y k ; ˜ y k ) + θ k l G µ ( y ∗ ; ˜ y k ) + θ k τ k y ∗ − z k k − θ k τ k y ∗ − z k +1 k ≤ (1 − θ k ) G µ ( y k ) + θ k G µ ( y ∗ ) + θ k τ k y ∗ − z k k − θ k τ k y ∗ − z k +1 k , where the second inequality is from (3.19) and τ ≤ /L , the third inequality uses Lemma 3.3 with ψ ( x ) := τ l G µ ( x ; ˜ y k ), the third equality uses (3.18), (3.2) and (3.15) and the last inequality uses (3.19).Therefore we get1 θ k ( G µ ( y k +1 ) − G µ ( y ∗ )) ≤ − θ k θ k ( G µ ( y k ) − G µ ( y ∗ )) + 12 τ k y − z k k − τ k y − z k +1 k . From (3.16), it is easy to show that − θ k θ k ≤ θ k − for all k ≥

0. Thus (3.20) implies that(3.21) 1 − θ k +1 θ k +1 ( G µ ( y k +1 ) − G µ ( y ∗ )) ≤ − θ k θ k ( G µ ( y k ) − G µ ( y ∗ )) + 12 τ k y − z k k − τ k y − z k +1 k . Summing (3.21) over k = 0 , , . . . , n −

1, we get1 − θ n θ n ( G µ ( y n ) − G µ ( y ∗ )) ≤ τ k y ∗ − z k = 12 τ k y ∗ − y k , which immediately implies (3.17). Remark 3.5.

The proof technique and the choice of θ k used here are suggested in [38] for acceleratingthe basic algorithm. Other choices of θ k can be found in [23, 24, 2, 38]. They all work here and give thesame order of iteration complexity.

4. Extension to Problems with Additional Convex Constraints.

We now consider extensionsof both the LB and ALB methods to problems of the formmin x ∈ X J ( x ) s.t Ax = b, (4.1) here X is a nonempty closed convex set in R n . It is not clear how to extend the LB and ALB methods(Algorithms 2 and 4) to problem (4.1) since we can no longer rely on the relationship0 ∈ ∂J ( x k +1 ) − p k + A ⊤ ( Ax k − b ) + 1 µ ( x k +1 − x k )to compute a subgradient p k +1 ∈ ∂J ( x k +1 ). Fortunately, the Lagrangian dual gradient versions of thesealgorithms do not suﬀer from this diﬃculty. All that is required to extend them to problem (4.1) is toinclude the constraint w ∈ X in the minimization step in these algorithms. Note that the gradient ofˆΦ µ ( v ) = min w ∈ X { J ( w ) + 12 µ k w − v k } remains the same. Also it is clear that the iteration complexity results given in Theorems 3.2 and 3.4 applyto these algorithms as well.Being able to apply the LB and ALB methods to problems of the form of (4.1) greatly expands theirusefulness. One immediate extension is to compressed sensing problems in which the signal is required tohave nonnegative components. Also (4.1) directly includes all linear programs. Applying the LB and ALBto such problems, with the goal of only obtaining approximated optimal solutions, will be the subject of afuture paper.

5. Numerical Experiments.

In this section, we report some numerical results that demonstratethe eﬀectiveness of the accelerated linearized Bregman algorithm. All numerical experiments were run inMATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel Xeon(TM) 3.4GHZ CPU and 6GB ofRAM.

In this subsection, we compare theperformance of the accelerated linearized Bregman method against the performance of the basic linearizedBregman method on a variety of compressed sensing problems of the form (1.2).We use three types of sensing matrices A ∈ R m × n . Type (i): A is a standard Gaussian matrix generatedby the randn ( m, n ) function in MATLAB. Type (ii): A is ﬁrst generated as a standard Gaussian matrix andthen normalized to have unit-norm columns. Type (iii): The elements of A are sampled from a Bernoullidistribution as either +1 or −

1. We use two types of sparse solutions x ∗ ∈ R n with sparsity s (i.e., thenumber of nonzeros in x ∗ ). The positions of the nonzero entries of x ∗ are selected uniformly at random, andeach nonzero value is sampled either from (i) standard Gaussian (the randn function in MATLAB) or from(ii) [ − ,

1] uniformly at random (2 ∗ rand − J ( x ) = k x k , the linearized Bregman method reduces to thetwo-line algorithm: ( x k +1 := µ · shrink( v k , v k +1 := v k + τ A ⊤ ( b − Ax k +1 ) , where the ℓ shrinkage operator is deﬁned in (2.10). Similarly, the accelerated linearized Bregman can bewritten as:  x k +1 := µ · shrink(˜ v k , v k +1 := ˜ v k + τ A T ( b − Ax k +1 )˜ v k +1 := α k v k +1 + (1 − α k ) v k . oth algorithms are very simple to program and involve only one Ax and one A ⊤ y matrix-vector multipli-cation in each iteration.We ran both LB and ALB with the seed used for generating random number in MATLAB setting as 0.Here we set n = 2000 , m = 0 . × n, s = 0 . × m, µ = 5 for all data sets. We set τ = µ k A k . We terminatedthe algorithms when the stopping criterion(5.1) k Ax k − b k / k b k < − was satisﬁed or the number of iterations exceeded 5000. Note that (5.1) was also used in [41]. We reportthe results in Table 5.1. Table 5.1

Compare linearized Bregman (LB) with accelerated linearized Bregman (ALB)

Standard Gaussian matrix A Number of Iterations Relative error k x − x ∗ k / k x ∗ k Type of x ∗ n ( m = 0 . n, s = 0 . m ) LB ALB LB ALBGaussian 2000 5000+ 330 5.1715e-3 1.4646e-5Uniform 2000 1681 214 2.2042e-5 1.5241e-5Normalized Gaussian matrix A Number of Iterations Relative error k x − x ∗ k / k x ∗ k Type of x ∗ n ( m = 0 . n, s = 0 . m ) LB ALB LB ALBGaussian 2000 2625 234 3.2366e-5 1.2664e-5Uniform 2000 5000+ 292 1.2621e-2 1.5629e-5Bernoulli +1/-1 matrix A Number of Iterations Relative error k x − x ∗ k / k x ∗ k Type of x ∗ n ( m = 0 . n, s = 0 . m ) LB ALB LB ALBGaussian 2000 2314 222 4.2057e-5 1.0812e-5Uniform 2000 5000+ 304 1.6141e-2 1.5732e-5In Table 5.1, we see that for three out of six problems, LB did not achieve the desired convergencecriterion within 5000 iterations, while ALB satisﬁed this stopping criterion in less than 330 iterations onall six problems. To further demonstrate the signiﬁcant improvement the ALB achieved over LB, we plotin Figures 5.1, 5.2 and 5.3 the Euclidean norms of the residuals and the relative errors as a function of theiteration number that were obtained by LB and ALB applied to the same data sets. These ﬁgures also depictthe non-monotonic behavior of the ALB method. There are fast implementations oflinearized Bregman [5] and other solvers [20, 37, 19, 40] for solving matrix completion problems. We do notcompare the linearized Bregman and our accelerated linearized Bregman algorithms with these fast solvershere. Rather our tests are focused only on comparing ALB with LB and verifying that the accelerationactually occurs in practice for matrix completion problems.The nuclear norm matrix completion problem (1.7) can be rewritten asmin X k X k ∗ s.t. P Ω ( X ) = P Ω ( M ) , (5.2)where [ P Ω ( X )] ij = X ij if ( i, j ) ∈ Ω and [ P Ω ( X )] ij = 0 otherwise. When the convex function J ( · ) is thenuclear norm of matrix X , the Step 3 of Algorithm 2 with inputs X k , P k can be reduced to X k +1 := arg min X ∈ R m × n µ k X k ∗ + 12 k X − ( X k − µ ( τ P Ω ( P Ω X k − P Ω ( M )) − P k )) k F . (5.3) −6 −4 −2 Residual LBALB0 1000 2000 3000 4000 500010 −5 Relative Errors LBALB 0 200 400 600 800 1000 1200 1400 1600 180010 −6 −4 −2 Residual LBALB0 200 400 600 800 1000 1200 1400 1600 180010 −5 Relative Errors LBALB

Fig. 5.1 . Gaussian matrix A , Left: Gaussian x ∗ , Right: Uniform x ∗ −6 −4 −2 Residual LBALB0 500 1000 1500 2000 2500 300010 −5 Relative Errors LBALB 0 1000 2000 3000 4000 500010 −5 Residual LBALB0 1000 2000 3000 4000 500010 −5 Relative Errors LBALB

Fig. 5.2 . Normalized Gaussian matrix A , Left: Gaussian x ∗ , Right: Uniform x ∗ −6 −4 −2 Residual LBALB0 500 1000 1500 2000 250010 −5 Relative Errors LBALB 0 1000 2000 3000 4000 500010 −6 −4 −2 Residual LBALB0 1000 2000 3000 4000 500010 −5 Relative Errors LBALB

Fig. 5.3 . Bernoulli matrix A , Left: Gaussian x ∗ , Right: Uniform x ∗ t is known (see, e.g., [5, 20]) that (5.3) has the closed-form solution, X k +1 = Shrink( X k − µ ( τ P Ω ( P Ω X k − P Ω ( M )) − P k ) , µ ) , where the matrix shrinkage operator is deﬁned asShrink( Y, γ ) := U Diag(max( σ − γ, V ⊤ , and Y = U Diag( σ ) V ⊤ is the singular value decomposition (SVD) of matrix Y . Thus, a typical iterationof the linearized Bregman method (Algorithm 2), with initial inputs X = P = 0, for solving the matrixcompletion problem (5.2) can be summarized as ( X k +1 := Shrink( X k − µ ( τ P Ω ( P Ω X k − P Ω ( M )) − P k ) , µ ) P k +1 := P k − τ ( P Ω X k − P Ω M ) − ( X k +1 − X k ) /µ. (5.4)Similarly, a typical iteration of the accelerated linearized Bregman method (Algorithm 4), with initial inputs X = P = ˜ X = ˜ P = 0, for solving the matrix completion problem (5.2) can be summarized as  X k +1 := Shrink( X k − µ ( τ P Ω ( P Ω X k − P Ω ( M )) − P k ) , µ ) P k +1 := ˜ P k − τ ( P Ω ˜ X k − P Ω M ) − ( X k +1 − ˜ X k ) /µ ˜ X k +1 := α k X k +1 + (1 − α k ) X k ˜ P k +1 := α k P k +1 + (1 − α k ) P k , (5.5)where the sequence α k is chosen according to Theorem 3.4.We compare the performance of LB and ALB on a variety of matrix completion problems. We createdmatrices M ∈ R n × n with rank r by the following procedure. We ﬁrst created standard Gaussian matrices M L ∈ R n × r and M R ∈ R n × r and then we set M = M L M ⊤ R . The locations of the p known entries in M weresampled uniformly, and the values of these p known entries were drawn from an iid Gaussian distribution.The ratio p/n between the number of measurements and the number of entries in the matrix is denotedby “SR” (sampling ratio). The ratio between the dimension of the set of n × n rank r matrices, r (2 n − r ),and the number of samples p , is denoted by “FR”. In our tests, we ﬁxed F R to 0.2 and 0.3 and r to 10.We tested ﬁve matrices with dimension n = 100 , , , ,

500 and set the number p to r (2 n − r ) /F R .The random seed for generating random matrices in MATLAB was set to 0. µ was set to 5 n (a heuristicargument for this choice can be found in [5]). We set the step length τ to 1 /µ since for matrix completionproblems kP Ω k = 1. We terminated the code when the relative error between the residual and the truematrix was less than 10 − , i.e., kP Ω ( X k ) − P Ω ( M ) k F / kP Ω ( M ) k F < − . (5.6)Note that this stopping criterion was used in [5]. We also set the maximum number of iteration to 2000.We report the number of iterations needed by LB and ALB to reach (5.6) in Table 5.2. Note thatperforming the shrinkage operation, i.e., computing an SVD, dominates the computational cost in eachiteration of LB and ALB. Thus, the per-iteration complexities of LB and ALB are almost the same and itis reasonable to compare the number of iterations needed to reach the stopping criterion. We report therelative error err := k X k − M k F / k M k F between the recovered matrix X k and the true matrix M in Table

100 200 300 400 50010 −10 −5 Relative Error (n=200) LBALB0 100 200 300 400 50010 −10 −5 Residual (n=200) LBALB 0 100 200 300 400 50010 −6 −4 −2 Relative Error (n=300) LBALB0 100 200 300 400 50010 −6 −4 −2 Residual (n=300) LBALB0 100 200 300 400 50010 −5 Relative Error (n=400) LBALB0 100 200 300 400 50010 −5 Residual (n=400) LBALB 0 100 200 300 400 50010 −5 Relative Error (n=500) LBALB0 100 200 300 400 50010 −5 Residual (n=500) LBALB

Fig. 5.4 . Comparison of LB and ALB on matrix completion problems with rank = 10 , F R = 0 . n = 200 , ,

400 and 500.Note that the non-monotonicity of ALB is far less pronounced on these problems.

Table 5.2

Comparison between LB and ALB on Matrix Completion Problems

F R = 0 . , rank = 10 F R = 0 . , rank = 10 n SR iter-LB err-LB iter-ALB err-ALB SR iter-LB err-LB iter-ALB err-ALB100 0.95 85 1.07e-4 63 1.11e-4 0.63 294 1.75e-4 163 1.65e-4200 0.49 283 1.62e-4 171 1.58e-4 0.33 1224 3.76e-4 289 1.83e-4300 0.33 466 1.64e-4 261 1.60e-4 0.22 2000+ 3.59e-3 406 1.93e-4400 0.25 667 1.79e-4 324 1.65e-4 0.17 2000+ 1.12e-2 455 1.80e-4500 0.20 831 1.76e-4 398 1.65e-4 0.13 2000+ 3.14e-2 1016 7.49e-3

100 200 300 400 50010 −5 Relative Error (n=200) LBALB0 100 200 300 400 50010 −5 Residual (n=200) LBALB 0 100 200 300 400 50010 −4 −2 Relative Error (n=300) LBALB0 100 200 300 400 50010 −5 Residual (n=300) LBALB0 100 200 300 400 50010 −4 −2 Relative Error (n=400) LBALB0 100 200 300 400 50010 −5 Residual (n=400) LBALB 0 100 200 300 400 50010 −3 −2 −1 Relative Error (n=500) LBALB0 100 200 300 400 50010 −4 −2 Residual (n=500) LBALB

Fig. 5.5 . Comparison of LB and ALB on matrix completion problems with rank = 10 , F R = 0 .

6. Conclusions.

In this paper, we analyzed for the ﬁrst time the iteration complexity of the linearizedBregman method. Speciﬁcally, we show that for a suitably chosen step length, the method achieves a valueof the Lagrangian of a quadratically regularized version of the basis pursuit problem that is within ǫ of theoptimal value in O (1 /ǫ ) iterations. We also derive an accelerated version of the linearized Bregman methodwhose iteration complexity is reduced to O (1 / √ ǫ ) and present numerical results on basis pursuit and matrixcompletion problems that illustrate this speed-up. REFERENCES[1]

J. Barzilai and J. Borwein , Two point step size gradient methods , IMA Journal of Numerical Analysis, 8 (1988),pp. 141–148.[2]

A. Beck and M. Teboulle , A fast iterative shrinkage-thresholding algorithm for linear inverse problems , SIAM J.Imaging Sciences, 2 (2009), pp. 183–202.[3]

D. P. Bertsekas and J. N. Tsitsiklis , Parallel and distributed computation: numerical methods , Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1989.[4]

L. Bregman , The relaxation method of ﬁnding the common points of convex sets and its application to the solution ofproblems in convex programming , USSR Computational Mathematics and Mathematical Physics, 7 (1967), pp. 200–217.[5]

J. Cai, E. J. Cand`es, and Z. Shen , A singular value thresholding algorithm for matrix completion , SIAM J. on Opti-18ization, 20 (2010), pp. 1956–1982.[6]

J.-F. Cai, S. Osher, and Z. Shen , Convergence of the linearized Bregman iteration for ℓ -norm minimization , Mathe-matics of Computation, 78 (2009), pp. 2127–2136.[7] , Linearized Bregman iterations for compressed sensing , Mathematics of Computation, 78 (2009), pp. 1515–1536.[8]

E. J. Cand`es and B. Recht , Exact matrix completion via convex optimization , Foundations of Computational Mathe-matics, 9 (2009), pp. 717–772.[9]

E. J. Cand`es, J. Romberg, and T. Tao , Robust uncertainty principles: Exact signal reconstruction from highly incom-plete frequency information , IEEE Transactions on Information Theory, 52 (2006), pp. 489–509.[10]

E. J. Cand`es and T. Tao , The power of convex relaxation: near-optimal matrix completion , IEEE Trans. Inform. Theory,56 (2009), pp. 2053–2080.[11]

D. Donoho , Compressed sensing , IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306.[12]

D. Goldfarb and S. Ma , Fast multiple splitting algorithms for convex optimization , tech. report, Department of IEOR,Columbia University. Preprint available at http://arxiv.org/abs/0912.4570, 2009.[13]

D. Goldfarb, S. Ma, and K. Scheinberg , Fast alternating linearization methods for minimizing the sumof two convex functions , tech. report, Department of IEOR, Columbia University. Preprint available athttp://arxiv.org/abs/0912.4571, 2010.[14]

D. Goldfarb and K. Scheinberg , Fast ﬁrst-order methods for composite convex optimization with line search , preprint,(2011).[15]

D. Gross , Recovering low-rank matrices from few coeﬃcients in any basis , IEEE Transactions on Information Theory,57 (2011), pp. 1548–1566.[16]

E. T. Hale, W. Yin, and Y. Zhang , Fixed-point continuation for ℓ -minimization: Methodology and convergence , SIAMJournal on Optimization, 19 (2008), pp. 1107–1130.[17] M. R. Hestenes , Multiplier and gradient methods , Journal of Optimization Theory and Applications, 4 (1969), pp. 303–320.[18]

D. C. Liu and J. Nocedal , On the limited memory BFGS method for large scale optimization , Mathematical Program-ming, Series B, 45 (1989), pp. 503–528.[19]

Y. Liu, D. Sun, and K.-C. Toh , An implementable proximal point algorithmic framework for nuclear norm minimization ,To appear in Mathematical Programming, (2009).[20]

S. Ma, D. Goldfarb, and L. Chen , Fixed point and Bregman iterative methods for matrix rank minimization , Mathe-matical Programming Series A, 128 (2011), pp. 321–353.[21]

B. K. Natarajan , Sparse approximate solutions to linear systems , SIAM Journal on Computing, 24 (1995), pp. 227–234.[22]

A. Nemirovski , Prox-method with rate of convergence O (1 /t ) for variational inequalities with Lipschitz continuous mono-tone operators and smooth convex-concave saddle point problems , SIAM Journal on Optimization, 15 (2005), pp. 229–251.[23] Y. E. Nesterov , A method for unconstrained convex minimization problem with the rate of convergence O (1 /k ), Dokl.Akad. Nauk SSSR, 269 (1983), pp. 543–547.[24] , Introductory lectures on convex optimization , 87 (2004), pp. xviii+236. A basic course.[25] ,

Smooth minimization for non-smooth functions , Math. Program. Ser. A, 103 (2005), pp. 127–152.[26] ,

Gradient methods for minimizing composite objective function , CORE Discussion Paper 2007/76, (2007).[27]

S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin , An iterative regularization method for total variation-basedimage restoration , SIAM Journal on Multiscale Modeling and Simulation, 4 (2005), pp. 460–489.[28]

S. Osher, Y. Mao, B. Dong, and W. Yin , Fast linearized Bregman iteration for compressive sensing and sparsedenoising , Communications in Mathematical Sciences, 8 (2010), pp. 93–111.[29]

M. J. D. Powell , A method for nonlinear constraints in minimization problems , in Optimization, R. Fletcher, ed.,Academic Press, New York, 1972, pp. 283–298.[30]

B. Recht , A simpler approach to matrix completion , To appear in Journal of Machine Learning Research., (2009).[31]

B. Recht, M. Fazel, and P. Parrilo , Guaranteed minimum-rank solutions of linear matrix equations via nuclear normminimization , SIAM Review, 52 (2010), pp. 471–501.[32]

R.T. Rockafellar , Convex Analysis , Princeton University Press, Princeton, 1970.[33]

R. T. Rockafellar , Augmented Lagrangians and applications of the proximal point algorithm in convex programming ,Math. Oper. Res., 1 (1976), pp. 97–116.[34]

ACM SIGKDD and Netflix , Proceedings of kdd cup and workshop

N. Srebro , Learning with Matrix Factorizations , PhD thesis, Massachusetts Institute of Technology, 2004.1936]

N. Srebro and T. Jaakkola , Weighted low-rank approximations , in Proceedings of the Twentieth International Confer-ence on Machine Learning (ICML-2003), 2003.[37]

K.-C. Toh and S. Yun , An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems ,Paciﬁc J. Optimization, 6 (2010), pp. 615–640.[38]

P. Tseng , On accelerated proximal gradient methods for convex-concave optimization , submitted to SIAM J. Optim.,(2008).[39]

E. van den Berg and M. P. Friedlander , Probing the Pareto frontier for basis pursuit solutions , SIAM J. on ScientiﬁcComputing, 31 (2008), pp. 890–912.[40]

Z. Wen, W. Yin, and Y. Zhang , Solving a low-rank factorization model for matrix completion by a nonlinear successiveover-relaxation algorithm , preprint, (2010).[41]

W. Yin , Analysis and generalizations of the linearized Bregman method , SIAM Journal on Imaging Sciences, 3 (2010),pp. 856–877.[42]

W. Yin, S. Osher, D. Goldfarb, and J. Darbon , Bregman iterative algorithms for ℓ -minimization with applicationsto compressed sensing-minimization with applicationsto compressed sensing