[PDF] Accelerated Primal-Dual Proximal Block Coordinate Updating Methods for Constrained Convex Optimization

Abstract

Block Coordinate Update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal-dual BCU method for solving linearly constrained convex program in multi-block variables. The method is an accelerated version of a primal-dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an O(1/t) convergence rate under weak convexity assumption. We show that the rate can be accelerated to O(1/t^2) if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance.

Full PDF

aa r X i v : . [ m a t h . O C ] N ov Accelerated Primal-Dual Proximal Block Coordinate UpdatingMethods for Constrained Convex Optimization ∗ Yangyang Xu † Shuzhong Zhang ‡ Abstract

Block Coordinate Update (BCU) methods enjoy low per-update computational complexitybecause every time only one or a few block variables would need to be updated among possiblya large number of blocks. They are also easily parallelized and thus have been particularlypopular for solving problems involving large-scale dataset and/or variables. In this paper, wepropose a primal-dual BCU method for solving linearly constrained convex program in multi-block variables. The method is an accelerated version of a primal-dual algorithm proposed by theauthors, which applies randomization in selecting block variables to update and establishes an O (1 /t ) convergence rate under convexity assumption. We show that the rate can be acceleratedto O (1 /t ) if the objective is strongly convex. In addition, if one block variable is independent ofthe others in the objective, we then show that the algorithm can be modiﬁed to achieve a linearrate of convergence. The numerical experiments show that the accelerated method performsstably with a single set of parameters while the original method needs to tune the parametersfor diﬀerent datasets in order to achieve a comparable level of performance. Keywords: primal-dual method, block coordinate update, alternating direction method ofmultipliers (ADMM), accelerated ﬁrst-order method.

Mathematics Subject Classiﬁcation:

Motivated by the need to solve large-scale optimization problems and increasing capabilities inparallel computing, block coordinate update (BCU) methods have become particularly popular inrecent years due to their low per-update computational complexity, low memory requirements, andtheir potentials in a distributive computing environment. In the context of optimization, BCU ﬁrstappeared in the form of block coordinate descent (BCD) type of algorithms which can be appliedto solve unconstrained smooth problems or those with separable nonsmooth terms in the objective ∗ This work is partly supported by NSF grant DMS-1719549 and CMMI-1462408. † [email protected]. Department of Mathematical Sciences, Rensselaer Polytechnic Institute ‡ [email protected]. Department of Industrial & Systems Engineering, University of Minnesota x f ( x ) + M X i =1 g i ( x i ) , s.t. M X i =1 A i x i = b, (1)where x is partitioned into disjoint blocks ( x , x , . . . , x M ), f is a smooth convex function withLipschitz continuous gradient, and each g i is proper closed convex and possibly non-diﬀerentiable.Note that g i can include an indicator function of a convex set X i , and thus (1) can implicitly includecertain separable block constraints in addition to the nonseparable linear constraint.Many applications arising in statistical and machine learning, image processing, and ﬁnance can beformulated in the form of (1) including the basis pursuit [7], constrained regression [23], supportvector machine in its dual form [10], portfolio optimization [28], just to name a few.Towards ﬁnding a solution for (1), we will ﬁrst present an accelerated proximal Jacobian alternatingdirection method of multipliers (Algorithm 1), and then we generalize it to an accelerated random-ized primal-dual block coordinate update method (Algorithm 2). Assuming strong convexity on theobjective function, we will establish O (1 /t ) convergence rate results of the proposed algorithms byadaptively setting the parameters, where t is the total number of iterations. In addition, if furtherassuming smoothness and the full-rankness we then obtain linear convergence of a modiﬁed method(Algorithm 3). Our algorithms are closely related to randomized coordinate descent methods, primal-dual coor-dinate update methods, and accelerated primal-dual methods. In this subsection, let us brieﬂyreview the three classes of methods and discuss their relations to our algorithms.

Randomized coordinate descent methods

In the absence of linear constraint, Algorithm 2 specializes to randomized coordinate descent (RCD),which was ﬁrst proposed in [31] for smooth problems and later generalized in [27, 38] to nonsmoothproblems. It was shown that RCD converges sublinearly with rate O (1 /t ), which can be acceleratedto O (1 /t ) for convex problems and achieves a linear rate for strongly convex problems. By choosingmultiple block variables at each iteration, [37] proposed to parallelize the RCD method and showedthe same convergence results for parallelized RCD. This is similar to setting m > x -blocks.2 rimal-dual coordinate update methods In the presence of linear constraints, coordinate descent methods may fail to converge to a solution ofthe problem because ﬁxing all but one block, the selected block variable may be uniquely determinedby the linear constraint. To perform coordinate update to the linearly constrained problem (1),one eﬀective approach is to update both primal and dual variables. Under this framework, thealternating direction method of multipliers (ADMM) is one popular choice. Originally, ADMM[14,17] was proposed for solving two-block structured problems with separable objective (by setting f = 0 and M = 2 in (1)), for which its convergence and also convergence rate have been well-established (see e.g. [2, 13, 22, 29]). However, directly extending ADMM to the multi-block settingsuch as (1) may fail to converge; see [6] for a divergence example of the ADMM even for solvinga linear system of equations. Lots of eﬀorts have been spent on establishing the convergence ofmulti-block ADMM under stronger assumptions (see e.g. [4, 6, 16, 25, 26]) such as strong convexityor orthogonality conditions on the linear constraint. Without additional assumptions, modiﬁcationis necessary for the ADMM applied to multi-block problems to be convergent; see [12, 19, 20, 39]for example. Very recently, [15] proposed a randomized primal-dual coordinate (RPDC) updatemethod, whose asynchronous parallel version was then studied in [41]. Applied to (1), RPDC is aspecial case of Algorithm 2 with ﬁxed parameters. It was shown that RPDC converges with rate O (1 /t ) under convexity assumption. More general than solving an optimization problem, primal-dual coordinate (PDC) update methods have also appeared in solving ﬁxed-point or monotoneinclusion problems [9, 34–36]. However, for these problems, the PDC methods are only shown toconverge but no convergence rate estimates are known unless additional assumptions are made suchas the strong monotonicity condition. Accelerated primal-dual methods

It is possible to accelerate the rate of convergence from O (1 /t ) to O (1 /t ) for gradient type methods.The ﬁrst acceleration result was shown by Nesterov [30] for solving smooth unconstrained problems.The technique has been generalized to accelerate gradient-type methods on possibly nonsmoothconvex programs [1, 32]. Primal-dual methods on solving linearly constrained problems can alsobe accelerated by similar techniques. Under convexity assumption, the augmented Lagrangianmethod (ALM) is accelerated in [21] from O (1 /t ) convergence rate to O (1 /t ) by using a similartechnique as that in [1] to the multiplier update, and [40] accelerates the linearized ALM usinga technique similar to that in [32]. Assuming strong convexity on the objective, [18] acceleratesthe ADMM method, and the assumption is weakened in [40] to assuming the strong convexityfor one component of the objective function. On solving bilinear saddle-point problems, variousprimal-dual methods can be accelerated if either primal or dual problem is strongly convex [3,5,11].Without strong convexity, partial acceleration is still possible in terms of the rate depending onsome other quantities; see e.g. [8, 33]. 3 .2 Contributions of this paper We accelerate the proximal Jacobian ADMM [12] and also generalize it to an accelerated primal-dual coordinate updating method for linearly constrained multi-block structured convex program,where in the objective there is a nonseparable smooth function. With parameters ﬁxed duringall iterations, the generalized method reduces to that in [15] and enjoys O (1 /t ) convergence rateunder mere convexity assumption. By adaptively setting the parameters at diﬀerent iterations,we show that the accelerated method has O (1 /t ) convergence rate if the objective is stronglyconvex. In addition, if there is one block variable that is independent of all others in the objective(but coupled in the linear constraint) and also the corresponding component function is smooth,we modify the algorithm by treating that independent variable in a diﬀerent way and establish alinear convergence result. Numerically, we test the accelerated method on quadratic programmingand compare it to the (nonaccelerated) RPDC method in [15]. The results demonstrate thatthe accelerated method performs eﬃciently and stably with the parameters automatically set inaccordance of the analysis, while the RPDC method needs to tune its parameters for diﬀerent datain order to have a comparable performance. Notations.

For a positive integer M , we denote [ M ] as { , . . . , M } . We let x S denote the subvectorof x with blocks indexed by S . Namely, if S = { i , . . . , i m } , then x S = ( x i , . . . , x i m ). Similarly, A S denotes the submatrix of A with columns indexed by S , and g S denotes the sum of componentfunctions indicated by S . We use ∇ i f ( x ) for the partial gradient of f with respect to x i at x and ∇ S f ( x ) with respect to x S . For a nondiﬀerentiable function g , ˜ ∇ g ( x ) denotes a subgradient of g at x . We reserve I for the identity matrix and use k·k for Euclidean norm. Given a symmetric positivesemideﬁnite (PSD) matrix W , for any vector v of appropriate size, we deﬁne k v k W = v ⊤ W v , and∆ W ( v + , v o , v ) = 12 (cid:2) k v + − v k W − k v o − v k W + k v + − v o k W (cid:3) . (2)If W = I , we simply use ∆( v + , v o , v ). Also, we denote g ( x ) = m X i =1 g i ( x i ) , F ( x ) = f ( x ) + g ( x ) , Φ(ˆ x, x, λ ) = F (ˆ x ) − F ( x ) − h λ, A ˆ x − b i . (3) Preparations.

A point ( x ∗ , λ ∗ ) is called a Karush-Kuhn-Tucker (KKT) point of (1) if0 ∈ ∂F ( x ∗ ) − A ⊤ λ ∗ , Ax ∗ − b = 0 . (4)For convex programs, the conditions in (4) are suﬃcient for x ∗ to be an optimal solution of (1),and they are also necessary if a certain qualiﬁcation condition holds (e.g., the Slater condition:there is x in the interior of the domain of F such that Ax = b ). Together with the convexity of F , (4) implies Φ( x, x ∗ , λ ∗ ) ≥ , ∀ x. (5)4e will use the following lemmas as basic facts. The ﬁrst lemma is straightforward to verify fromthe deﬁnition of k · k W ; the second one is similar to Lemma 3.3 in [15]; the third one is from Lemma3.5 in [15]. Lemma 1.1

For any vectors u, v and symmetric PSD matrix W of appropriate sizes, it holds that u ⊤ W v = 12 (cid:2) k u k W − k u − v k W + k v k W (cid:3) . (6) Lemma 1.2

Given a function φ , for a given x and a random vector ˆ x , if for any λ (that maydepend on ˆ x ) it holds E Φ(ˆ x, x, λ ) ≤ E φ ( λ ) , then for any γ > , we have E (cid:2) F (ˆ x ) − F ( x ) + γ k A ˆ x − b k (cid:3) ≤ sup k λ k≤ γ φ ( λ ) . Proof.

Let ˆ λ = − γ ( A ˆ x − b ) k A ˆ x − b k if A ˆ x − b = 0, and ˆ λ = 0 otherwise. ThenΦ(ˆ x, x, ˆ λ ) = F (ˆ x ) − F ( x ) + γ k A ˆ x − b k . In addition, since k ˆ λ k ≤ γ , we have φ (ˆ λ ) ≤ sup k λ k≤ γ φ ( λ ) and thus E φ (ˆ λ ) ≤ sup k λ k≤ γ φ ( λ ). Hence,we have the desired result from E Φ(ˆ x, x, ˆ λ ) ≤ E φ (ˆ λ ). (cid:3) Lemma 1.3

Suppose E (cid:2) F (ˆ x ) − F ( x ∗ ) + γ k A ˆ x − b k (cid:3) ≤ ǫ. Then, E k A ˆ x − b k ≤ ǫγ − k λ ∗ k , and − ǫ k λ ∗ k γ − k λ ∗ k ≤ E (cid:2) F (ˆ x ) − F ( x ∗ ) (cid:3) ≤ ǫ, where ( x ∗ , λ ∗ ) satisﬁes the optimality conditions in (4) , and we assume k λ ∗ k < γ . Outline.

The rest of the paper is organized as follows. Section 2 presents the accelerated proximalJacobian ADMM and its convergence results. In section 3, we propose an accelerated primal-dualblock coordinate update method with convergence analysis. Section 4 assumes more structure onthe problem (1) and modiﬁes the algorithm in section 3 to have linear convergence. Numericalresults are provided in section 5. Finally, section 6 concludes the paper.

In this section, we propose an accelerated proximal Jacobian ADMM for solving (1). At each iter-ation, the algorithm updates all M block variables in parallel by minimizing a linearized proximalapproximation of the augmented Lagrangian function, and then it renews the multiplier. Speciﬁ-cally, it iteratively performs the following updates: x k +1 i = arg min x i D ∇ i f ( x k ) − A ⊤ i ( λ k − β k r k ) , x i E + g i ( x i ) + 12 k x i − x ki k P ki , i = 1 , . . . , M, (7a) λ k +1 = λ k − ρ k r k +1 , (7b)5here β k and ρ k are scalar parameters, P k is an M × M block diagonal matrix with P ki as its i -thdiagonal block for i = 1 , . . . , M , and r k = Ax k − b denotes the residual. Note that (7a) consists of M independent subproblems, and they can be solved in parallel.Algorithm 1 summarizes the proposed method. It reduces to the proximal Jacobian ADMM in [12]if β k , ρ k and P k are ﬁxed for all k and there is no nonseparable function f . We will show thatadapting the parameters as the iteration progresses can accelerate the convergence of the algorithm. Algorithm 1:

Accelerated proximal Jacobian ADMM for (1) Initialization: choose x , set λ = 0, and let r = Ax − b for k = 1 , , . . . do Choose parameters β k , ρ k and a block diagonal matrix P k Let x k +1 ← (7a) and λ k +1 ← (7b) with r k +1 = Ax k +1 − b . if a certain stopping criterion satisﬁed then Return ( x k +1 , λ k +1 ). Throughout the analysis in this section, we make the following assumptions.

Assumption 1

There exists ( x ∗ , λ ∗ ) satisfying the KKT conditions in (4) . Assumption 2 ∇ f is Lipschitz continuous with modulus L f . Assumption 3

The function g is strongly convex with modulus µ > . The ﬁrst two assumptions are standard, and the third one is for showing convergence rate of O (1 /t ), where t is the number of iterations. Note that if f is strongly convex with modulus µ f >

0, we can let f ← f − µ f k · k and g ← g + µ f k · k . This way, we have a convex function f and a strongly convex function g . Hence, Assumption 3 is without loss of generality. With onlyconvexity, Algorithm 1 can be shown to converge at the rate O (1 /t ) with parameters ﬁxed for alliterations, and the order 1 /t is optimal as shown in the very recent work [24]. In this subsection, we show the O (1 /t ) convergence rate result of Algorithm 1. First, we establisha result of running one iteration of Algorithm 1. 6 emma 2.1 (One-iteration analysis) Under Assumptions 2 and 3, let { ( x k , λ k ) } be the se-quence generated from Algorithm 1. Then for any k and ( x, λ ) such that Ax = b , it holds that Φ( x k +1 , x, λ ) ≤ ρ k h k λ − λ k k − k λ − λ k +1 k + k λ k − λ k +1 k i − β k k r k +1 k − h k x k +1 − x k P k − β k A ⊤ A + µI − k x k − x k P k − β k A ⊤ A + k x k +1 − x k k P k − β k A ⊤ A − L f I i . (8)Using the above lemma, we are able to prove the following theorem. Theorem 2.2

Under Assumptions 2 and 3, let { ( x k , λ k ) } be the sequence generated by Algorithm 1.Suppose that the parameters are set to satisfy < ρ k ≤ β k , P k (cid:23) β k A ⊤ A + L f I, ∀ k ≥ , (9) and there exists a number k such that for all k ≥ , k + k + 1 ρ k ≤ k + k ρ k − , (10)( k + k + 1)( P k − β k A ⊤ A ) (cid:22) ( k + k )( P k − − β k − A ⊤ A + µI ) . (11) Then, for any ( x, λ ) satisfying Ax = b , we have t X k =1 ( k + k + 1)Φ( x k +1 , x, λ ) + t X k =1 k + k + 12 (2 β k − ρ k ) k r k +1 k + t + k + 12 k x t +1 − x k P t − β t A ⊤ A + µI ≤ φ ( x, λ ) , (12) where φ ( x, λ ) = k + 22 ρ k λ − λ k + k + 22 k x − x k P − β A ⊤ A . (13)In the next theorem, we provide a set of parameters that satisfy the conditions in Theorem 2.2 andestablish the O (1 /t ) convergence rate result. Theorem 2.3 (Convergence rate of order /t ) Under Assumptions 1 through 3, let { ( x k , λ k ) } be the sequence generated by Algorithm 1 with parameters set to: β k = ρ k = kβ, P k = kP + L f I, ∀ k ≥ , (14) where P is a block diagonal matrix satisfying ≺ P − βA ⊤ A (cid:22) µ I . Then, max n β k r t +1 k , k x t +1 − x ∗ k P − βA ⊤ A o ≤ t ( t + k + 1) φ ( x ∗ , λ ∗ ) , (15)7 here k = L f µ , and φ is deﬁned in (13) . In addition, letting γ = max { k λ ∗ k , k λ ∗ k} and T = t ( t + 2 k + 3)2 , ¯ x t +1 = P tk =1 ( k + k + 1) x k T , we have | F (¯ x t +1 ) − F ( x ∗ ) | ≤ T max |k λ k≤ γ φ ( x ∗ , λ ) , (16a) k A ¯ x t +1 − b k ≤ T max { , k λ ∗ k} max k λ k≤ γ φ ( x ∗ , λ ) . (16b) In this section, we generalize Algorithm 1 to a randomized setting where the user may choose toupdate a subset of blocks at each iteration. Instead of updating all M block variables, we randomlychoose a subset of them to renew at each iteration. Depending on the number of processors (nodes,or cores), we can choose a single or multiple block variables for each update. Our algorithm is an accelerated version of the randomized primal-dual coordinate update methodrecently proposed in [15], for which we shall use RPDC as its acronym. At each iteration, itperforms a block proximal gradient update to a subset of randomly selected primal variables whilekeeping the remaining ones ﬁxed, followed by an update to the multipliers. Speciﬁcally, at iteration k , it selects an index set S k ⊂ { , . . . , M } with cardinality m and performs the following updates: x k +1 i = ( arg min x i h∇ i f ( x k ) − A ⊤ i ( λ k − β k r k ) , x i i + g i ( x i ) + η k k x i − x ki k , if i ∈ S k ,x ki , if i S k (17a) r k +1 = r k + X i ∈ S k A i ( x k +1 i − x ki ) , (17b) λ k +1 = λ k − ρ k r k +1 , (17c)where β k , ρ k and η k are algorithm parameters, and their values will be determined later. Note thatwe use η k k x i − x ki k in (17a) for simplicity. It can be replaced by a PSD matrix weighted normsquare term as in (7a), and our convergence results still hold.Algorithm 2 summarizes the above method. If the parameters β k , ρ k and η k are ﬁxed during all theiterations, i.e., constant parameters, the algorithm reduces to a special case of the RPDC method In fact, [15] presents a more general algorithmic framework. It assumes two groups of variables, and each hasmulti-block structure. Our method in Algorithm 2 is an accelerated version of one special case of Algorithm 1 in [15].

8n [15]. Adapting these parameters to the iterations, we will show that Algorithm 2 enjoys fasterconvergence rate than RPDC if the problem is strongly convex.

Algorithm 2:

Accelerated randomized primal-dual block coordinate update method for (1) Initialization: choose x , set λ = 0, let r = Ax − b , and choose parameter m for k = 1 , , . . . do Select S k ⊂ { , , . . . , M } uniformly at random with | S k | = m . Choose parameters β k , ρ k and η k . Let x k +1 ← (17a) and λ k +1 ← (17c). if a certain stopping criterion satisﬁed then Return ( x k +1 , λ k +1 ). In this subsection, we establish convergence results of Algorithm 2 under Assumptions 1 and 3, andalso the following partial gradient Lipschitz continuity assumption.

Assumption 4

For any S ⊂ { , . . . , M } with | S | = m , ∇ S f is Lipschitz continuous with a uniformconstant L m . Note that if ∇ f is Lipschitz continuous with constant L f , then L m ≤ L f and L M = L f . In addition,if x + and x only diﬀer on a set S ⊂ [ M ] with cardinality m , then f ( x + ) ≤ f ( x ) + h∇ f ( x ) , x + − x i + L m k x + − x k . (18)Similar to the analysis in section 2, we ﬁrst establish a result of running one iteration of Algorithm2. Throughout this section, we denote θ = mM . Lemma 3.1 (One iteration analysis)

Under Assumptions 3 and 4, let { ( x k , λ k ) } be the se-quence generated from Algorithm 2. Then for any x such that Ax = b , it holds E h Φ( x k +1 , x, λ k +1 ) + ( β k − ρ k ) k r k +1 k + µ k x k +1 − x k i (19) ≤ (1 − θ ) E h Φ( x k , x, λ k ) + β k k r k k + µ k x k − x k i − E (cid:20) ∆ η k I − β k A ⊤ A ( x k +1 , x k , x ) − L m k x k +1 − x k k (cid:21) . When µ = 0 (i.e., (1) is convex), Algorithm 2 has O (1 /t ) convergence rate with ﬁxed β k , ρ k , η k .This can be shown from (19), and a similar result in slightly diﬀerent form has been establishedin [15, Theorem 3.6]. For completeness, we provide its proof in the appendix.9 heorem 3.2 (Un-accelerated convergence) Under Assumptions 1 and 4, let { ( x k , λ k ) } bethe sequence generated from Algorithm 2 with β k = β, ρ k = ρ, η k = η for all k , satisfying < ρ ≤ θβ, η ≥ L m + β k A k , where k A k denotes the spectral norm of A . Then (cid:12)(cid:12) E [ F (¯ x t ) − F ( x ∗ )] (cid:12)(cid:12) ≤

11 + θ ( t −

1) max k λ k≤ γ φ ( x ∗ , λ ) , (20a) E k A ¯ x t − b k ≤ θ ( t − { , k λ ∗ k} max k λ k≤ γ φ ( x ∗ , λ ) , (20b) where ( x ∗ , λ ∗ ) satisﬁes the KKT conditions in (4) , γ = max {k λ ∗ k , k λ ∗ k} , and ¯ x t = x t +1 + θ P tk =2 x k θ ( t − , φ ( x, λ ) = (1 − θ ) (cid:0) F ( x ) − F ( x ) (cid:1) + η k x − x k + θ k λ k ρ . When F is strongly convex, the above O (1 /t ) convergence rate can be accelerated to O (1 /t ) byadaptively changing the parameters at each iteration. The following theorem is our main result.It shows an O (1 /t ) convergence result under certain conditions on the parameters. Based on thistheorem, we will give a set of parameters that satisfy these conditions, thus providing a speciﬁcscheme to choose the paramenters. Theorem 3.3

Under Assumptions 3 and 4, let { ( x k , λ k ) } be the sequence generated from Algorithm2 with parameters satisfying the following conditions for a certain number k : θ ( k + k + 1) ≥ , ∀ k ≥ , (21a)( β k − − ρ k − )( k + k ) ≥ (1 − θ )( k + k + 1) β k , ∀ ≤ k ≤ t, (21b) θ ( k + k + 1) − ρ k − ≥ θ ( k + k + 2) − ρ k , ∀ ≤ k ≤ t − , (21c) θ ( t + k + 1) − ρ t − ≥ t + k + 1 ρ t , (21d) β k ( k + k + 1) ≥ β k − ( k + k ) , ∀ k ≥ , (21e)( k + k + 1)( η k − L m ) I (cid:23) β k ( k + k + 1) A ⊤ A, ∀ k ≥ , (21f)( k + k ) η k − + µ (cid:0) θ ( k + k + 1) − (cid:1) ≥ ( k + k + 1) η k , ∀ k ≥ . (21g) Then for any ( x, λ ) such that Ax = b , we have ( t + k + 1) E Φ( x t +1 , x, λ ) + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) E Φ( x k , x, λ ) ≤ (1 − θ )( k + 2) E h Φ( x , x, λ ) + β k r k + µ k x − x k i + η ( k + 2)2 E k x − x k + θ ( k + 3) − ρ E k λ − λ k − t + k + 12 E k x t +1 − x k µ + η t ) I − β t A ⊤ A . (22)10pecifying the parameters that satisfy (21), we show O (1 /t ) convergence rate of Algorithm 2. Proposition 3.4

The following parameters satisfy all conditions in (21) : β k = µ ( θk + 2 + θ )2 ρ k A k , ∀ k ≥ , (23a) ρ k =  θβ k (6 − θ ) , for ≤ k ≤ t − , ( t + k +1) ρ t − θ ( t + k +1) − , for k = t (23b) η k = ρβ k k A k + L m , ∀ k ≥ , (23c) where ρ ≥ and k = 4 θ + 2 L m θµ . (24) Theorem 3.5 (Accelerated convergence)

Under Assumptions 1, 3 and 4, let { ( x k , λ k ) } be thesequence generated from Algorithm 2 with parameters taken as in (23) . Then (cid:12)(cid:12) E [ F (¯ x t +1 ) − F ( x ∗ )] (cid:12)(cid:12) ≤ T max k λ k≤ γ φ ( x ∗ , λ ) , E k A ¯ x t +1 − b k ≤ T max { , k λ ∗ k} max k λ k≤ γ φ ( x ∗ , λ ) , (25) where γ = max { k λ ∗ k , k λ ∗ k} , ¯ x t +1 = ( t + k + 1) x t +1 + P tk =2 (cid:0) θ ( k + k + 1) − (cid:1) x k T ,φ ( x, λ ) = (1 − θ )( k + 2) h F ( x ) − F ( x ) + β k r k + µ k x − x k i + η ( k + 2)2 k x − x k + θ ( k + 3) − ρ k λ k and T = ( t + k + 1) + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) . In addition, E k x t +1 − x ∗ k ≤ φ ( x ∗ , λ ∗ )( t + k + 1) (cid:16) ( ρ − µ ρ ( θt + θ + 2) + 2 µ + L m (cid:17) . In this section, we assume some more structure on (1) and show that a linear rate of convergence ispossible. If there is no linear constraint, Algorithm 2 reduces to the RCD method proposed in [31].It is well-known that RCD converges linearly if the objective is strongly convex. However, withthe presence of linear constraints, mere strong convexity of the objective of the primal problem11nly ensures the smoothness of its Lagrangian dual function, but not its strong concavity. Hence,in general, we do not expect linear convergence by only assuming strong convexity on the primalobjective function. To ensure linear convergence on both the primal and dual variables, we needadditional assumptions.Throughout this section, we suppose that there is at least one block variable being absent in thenonseparable part of the objective, namely f . For convenience, we rename this block variable tobe y , and the corresponding component function and constraint coeﬃcient matrix as h and B .Speciﬁcally, we consider the following problemmin x,y f ( x , . . . , x M ) + M X i =1 g i ( x i ) + h ( y ) , s.t. M X i =1 A i x i + By = b. (26)One example of (26) is the problem that appears while computing a point on the central path of aconvex program. Suppose we are interested in solvingmin x f ( x , . . . , x M ) , s.t. M X i =1 A i x i ≤ b, x i ≥ , i = 1 , . . . , M. (27)Let y = b − P Mi =1 A i x i and use the log-barrier function. We have the log-barrier approximation of(27) as follows:min x,y f ( x , . . . , x M ) − µ M X i =1 e ⊤ log x i − µe ⊤ log y, s.t. M X i =1 A i x i + y = b, (28)where e is the all-one vector. As µ decreases, the approximation becomes more accurate.Towards a solution to (26), we modify Algorithm 2 by updating y -variable after the x -update. Sincethere is only a single y -block, to balance x and y updates, we do not renew y in every iteration butinstead update it in probability θ = mM . Hence, roughly speaking, x and y variables are updated inthe same frequency. The method is summarized in Algorithm 3. In this section, we denote z = ( x, y, λ ). Assume h is diﬀerentiable. Similar to (4), a point z ∗ =( x ∗ , y ∗ , λ ∗ ) is called a KKT point of (26) if0 ∈ ∂F ( x ∗ ) − A ⊤ λ ∗ , (32a) ∇ h ( y ∗ ) − B ⊤ λ ∗ = 0 , (32b) Ax ∗ + By ∗ − b = 0 . (32c)Besides Assumptions 3 and 4, we make two additional assumptions as follows.12 lgorithm 3: Randomized primal-dual block coordinate update for (26) Initialization: choose ( x , y ), set λ = 0, and choose parameters β, ρ, η x , η y , m . Let r = Ax + By − b and θ = mM . for k = 1 , , . . . do Select index set S k ⊂ { , . . . , M } uniformly at random with | S k | = m . Keep x k +1 i = x ki , ∀ i S k and update x k +1 i = arg min x i D ∇ i f ( x k ) − A ⊤ i ( λ k − βr k ) , x i E + g i ( x i ) + η x k x i − x ki k , if i ∈ S k . (29)Let r k + = r k + P i ∈ S k A i ( x k +1 i − x ki ). In probability 1 − θ keep y k +1 = y k , and in probability θ let y k +1 = ˜ y k +1 , where˜ y k +1 = arg min y h ( y ) − D B ⊤ ( λ k − βr k + ) , y E + η y k y − y k k . (30)Let r k +1 = r k + + B ( y k +1 − y k ). Update the multiplier by λ k +1 = λ k − ρr k +1 . (31) if a certain stopping criterion is satisﬁed then Return ( x k +1 , y k +1 , λ k +1 ). Assumption 5

There exists z ∗ = ( x ∗ , y ∗ , λ ∗ ) satisfying the KKT conditions in (32) . Assumption 6

The function h is strongly convex with modulus ν , and its gradient ∇ h is Lipschitzcontinuous with constant L h . The strong convexity of F and h implies F ( x k +1 ) − F ( x ∗ ) − h ˜ ∇ F ( x ∗ ) , x k +1 − x ∗ i ≥ µ k x k +1 − x ∗ k , (33a) h y k +1 − y ∗ , ∇ h ( y k +1 ) − ∇ h ( y ∗ ) i ≥ ν k y k +1 − y ∗ k . (33b) Similar to Lemma 3.1, we ﬁrst establish a result of running one iteration of Algorithm 3. It can beproven by similar arguments to those showing Lemma 3.1.

Lemma 4.1 (One iteration analysis)

Under Assumptions 3, 4, and 6, let { ( x k , y k , λ k ) } be the equence generated from Algorithm 3. Then for any k and ( x, y, λ ) such that Ax + By = b , it holds E ϕ ( z k +1 , z ) + ( β − ρ ) E k r k +1 k + 1 ρ E ∆( λ k +1 , λ k , λ )+ E (cid:20) ∆ P ( x k +1 , x k , x ) − L m k x k +1 − x k k (cid:21) + E ∆ Q ( y k +1 , y k , y ) ≤ (1 − θ ) E ϕ ( z k , z ) + β (1 − θ ) E k r k k + 1 − θρ E ∆( λ k , λ k − , λ )+ β E h A ( x k +1 − x ) , B ( y k +1 − y k ) i + β (1 − θ ) E h B ( y k − y ) , A ( x k +1 − x k ) i . (34) where P = η x I − βA ⊤ A , Q = η y I − βB ⊤ B, and ϕ ( z k , z ) = F ( x k ) − F ( x ) + µ k x k − x k + (cid:10) y k − y, ∇ h ( y k ) (cid:11) − (cid:10) λ, Ax k + By k − b (cid:11) . (35)In the following, we letΨ( z k , z ∗ ) = F ( x k ) − F ( x ∗ ) − h ˜ ∇ F ( x ∗ ) , x k − x ∗ i + (cid:10) y k − y ∗ , ∇ h ( y k ) − ∇ h ( y ∗ ) (cid:11) , (36)and ψ ( z k , z ∗ ; P, Q, β, ρ, c, τ )= (1 − θ ) E Ψ( z k , z ∗ ) + β (1 − θ )2 E k r k k + 12 E k x k − x ∗ k P + µ (1 − θ ) I + 12 E k y k − y ∗ k Q + β (1 − θ ) τ B ⊤ B + 12 ρ E (cid:20) k λ k − λ ∗ k − (1 − θ ) k λ k − − λ ∗ k + 1 θ k λ k − λ k − k (cid:21) . (37)The following theorem is key to establishing linear convergence of Algorithm 3. Theorem 4.2

Under Assumptions 3 through 6, let { ( x k , y k , λ k ) } be the sequence generated fromAlgorithm 3 with ρ = θβ . Let < α < θ and γ = max n k A k αµ , k B k αν o . Choose δ, κ ≥ such that " − (1 − θ )(1 + δ ) (1 − θ )(1 + δ )(1 − θ )(1 + δ ) κ − (1 − θ )(1 + δ ) (cid:23) " θ − θ − θ θ − (1 − θ ) , (38) and positive numbers η x , η y , c, τ , τ , β such that P (cid:23) β (1 − θ ) τ A ⊤ A + L m I (39a) Q (cid:23) cQ ⊤ Q + 4 cρ (1 − θ )(1 + 1 δ ) B ⊤ BB ⊤ B + βτ B ⊤ B. (39b) Then it holds that (1 − α ) E Ψ( z k +1 , z ∗ ) + 12 E k x k +1 − x ∗ k P +( αµ + µ ) I − βτ A ⊤ A + 12 E k y k +1 − y ∗ k Q +( αν − cL h ) I + (cid:0) β − ρ γ (cid:1) E k r k +1 k − (cid:18) cρ (cid:0) κ + 2(1 − θ )(1 + 1 δ ) (cid:1) + 2 c ( β − ρ ) (cid:19) E k B ⊤ r k +1 k + (cid:18) ρ + c σ min ( BB ⊤ ) (cid:19) E (cid:20) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:21) ≤ ψ ( z k , z ∗ ; P, Q, β, ρ, c, τ ) . (40)14sing Theorem 4.2, a linear convergence rate of Algorithm 3 follows. Theorem 4.3

Under Assumptions 3 through 6, let { ( x k , y k , λ k ) } be the sequence generated fromAlgorithm 3 with ρ = θβ . Let < α < θ and γ = max n k A k αµ , k B k αν o . Assume that B is fullrow-rank and max {k A k , k B k } ≤ . Choose δ, κ, η x , η y , c, β, τ , τ satisfying (38) and (39) , and inaddition, α µ + θµ > βτ (41a)3 αν > cL h + β (1 − θ )2 τ (41b)1 γ > cρ (cid:18) κ + 2(1 − θ )(1 + 1 δ ) (cid:19) + 2 c ( β − ρ ) . (41c) Then ψ ( z k +1 , z ∗ ; P, Q, β, ρ, c, τ ) ≤ η ψ ( z k , z ∗ ; P, Q, β, ρ, c, τ ) , (42) where η = min ( − α − θ , α µ + θµ − βτ η x + µ (1 − θ ) , αν − cL h − β (1 − θ )2 τ η y + β (1 − θ )2 τ , γ − cρ (cid:0) κ + 2(1 − θ )(1 + δ ) (cid:1) − c ( β − ρ ) β (1 − θ ) , cρσ min ( BB ⊤ ) ) > . We ﬁnish this section by making a few remarks.

Remark 4.1

We can always rescale

A, B and b without essentially altering the linear constraints.Hence, the assumption max {k A k , k B k } ≤ can be made without losing generality. From (42) , itis easy to see that when P ≻ and Q ≻ , ( x k , y k ) converges to ( x ∗ , y ∗ ) R-linearly in expectation.In addition, note that k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k = θ k λ k +1 − λ ∗ k + 2(1 − θ ) h λ k +1 − λ ∗ , λ k +1 − λ k i + ( 1 θ − θ ) k λ k +1 − λ k k ≥ θ − (1 − θ ) θ − θ ! k λ k +1 − λ ∗ k = θ θ − θ k λ k +1 − λ ∗ k . Hence, (42) also implies an R-linear convergence of λ k to λ ∗ in expectation. emark 4.2 We give examples of parameters that satisfy the conditions required in Theorem 4.3.First consider the case of θ = 1 , i.e., all blocks are updated at each iteration. In this case, wecan choose δ = 0 , κ = to satisfy (38) and η x = β k A k + L f to satisfy (39a) and let α = and τ = βµ to ensure that (41a) holds. Finally, choose η y > (cid:0) β + β µ (cid:1) k B k and c suﬃciently small,and all other conditions in Theorem 4.3 are satisﬁed. Next consider the case of θ < . We canchoose δ = θ − θ ) and κ = θ + θ − to satisfy (38) , and let α = θ , τ = βθµ , τ = β (1 − θ ) ν , η x = β (1 + (1 − θ ) τ ) k A k + L m , and η y > β (1 + τ ) k B k . With such choices, all other conditionsrequired in Theorem 4.3 hold when c is suﬃciently small. Remark 4.3

If there is only one x -block and there is no f function, then Algorithm 3 reducesto the so-called linearized ADMM. To show the linear convergence of the linearized ADMM, onescenario in [13, Theorem 3.1] assumes the strong convexity of g and h , the smoothness of h , andthe full row-rankness of B . In Theorem 4.3, we make the same assumptions, and so our result canbe considered as a generalization. The aim of this section is to test the practical performance of the proposed algorithms. We testAlgorithm 2 on quadratic programmingmin x F ( x ) = 12 x ⊤ Qx + c ⊤ x, s.t. Ax = b, x ≥ , (43)and Algorithm 3 on the log-barrier approximation of linear programmingmin x,y c ⊤ x − e ⊤ log x − e ⊤ log y, s.t. Ax + y = b, x i ≤ u i , ∀ i. (44) Quadratic programming.

Two types of randomized implementations are considered: one withﬁxed parameters and the newly introduced one with adaptive parameters, which shall be callednonadaptive RPDC and adaptive RPDC respectively. Note that the former reduces to the methodproposed in [15] when applied to (43). The purpose of the experiment is to test the eﬀect ofacceleration for the latter approach.The data was generated randomly as follows. We let Q = HDH ⊤ ∈ R n × n , where H is Gaussianrandomly generated orthogonal matrix and D is a diagonal matrix with d ii = 1 + ( i − L − n − , i =1 , . . . , n . Hence, the smallest and largest singular values of Q are 1 and L respectively, and theobjective of (43) is strongly convex with modulus 1. The components of c follow standard Gaussiandistribution, and those of b follow uniform distribution on [0 , A = [ B, I ] ∈ R p × n to Besides the scenario that g and h are strongly convex, h is smooth, and B is of full row-rank, [13, Theorem 3.1]also shows linear convergence of the linearized ADMM under three other diﬀerent scenarios. = 1 β = 10 β = 100 β = 1000 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC Figure 1: Results by Algorithm 2 with adaptive parameters and nonadaptive parameters for solving(43) with problem size n = 2000 , p = 200 and condition number 10. The latter uses diﬀerent penaltyparameter β . Top row: diﬀerence of objective value to the optimal value | F ( x k ) − F ( x ∗ ) | ; bottomrow: violation of feasibility k Ax k − b k .guarantee the existence of feasible solutions, where B was generated according to standard Gaussiandistribution. In addition, we normalized A so that it has a unit spectral norm.In the test, we ﬁxed n = 2000 , p = 200 and varied L among { , , } . For both nonadaptiveand adaptive RPDC, we evenly partitioned x into 40 blocks, i.e., each block consists of 50 coordi-nates, and we set m = 40, i.e., all blocks are updated at each iteration. For the adaptive RPDC, weset the values of its parameters according to (23) with ρ = 1, and those for the nonadaptive RPDCwere set based on Theorem 3.2 with ρ = β, η = 100+ β, ∀ k where β varied among { , , , } .Figures 1 through 3 plot the objective values and feasibility violations by Algorithm 2 under thesetwo diﬀerent settings. From these results, we see that adaptive RPDC performed well for all threedatasets with a single set of parameters while the performance of the nonadaptive one was severelyaﬀected by the penalty parameter. Linear programming.

In this test, we apply Algorithm 3 to the problem (44), where we let f ( x ) = c ⊤ x, g ( x ) = − e ⊤ log x and h ( y ) = − e ⊤ log y . The purpose of this experiment is to demonstrate thelinear convergence of Algorithm 3.We generated A ∈ R × and c according to the standard Gaussian distribution and b by theuniform distribution on [ , ]. The upper bound was set to u i = 10 , ∀ i . We treated x as a singleblock and set the algorithm parameters to β = 0 . η x = β k A k , and η y = β (cid:0) . β µ (cid:1) . Thissetting satisﬁes the conditions required in Theorem 4.3 if α is suﬃciently close to 1. Note that g and h do not have uniform strong convexity constants but they are both strongly convex on abounded set. Figure 4 shows the convergence behavior of Algorithm 3. From the ﬁgure, we canclearly see that the algorithm linearly converges to an optimal solution.17 = 1 β = 10 β = 100 β = 1000 -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -20 -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -20 -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -20 -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -20 -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC Figure 2: Results by Algorithm 2 with adaptive parameters and nonadaptive parameters for solving(43) with problem size n = 2000 , p = 200 and condition number 100. The latter uses diﬀerentpenalty parameter β . Top row: diﬀerence of objective value to the optimal value | F ( x k ) − F ( x ∗ ) | ;bottom row: violation of feasibility k Ax k − b k . β = 1 β = 10 β = 100 β = 1000 -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC -15 -10 -5 v i o l a t i on o f f ea s i b ili t y nonadaptive RPDCadaptive RPDC Figure 3: Results by Algorithm 2 with adaptive parameters and nonadaptive parameters for solving(43) with problem size n = 2000 , p = 200 and condition number 1000. The latter uses diﬀerentpenalty parameter β . Top row: diﬀerence of objective value to the optimal value | F ( x k ) − F ( x ∗ ) | ;bottom row: violation of feasibility k Ax k − b k . 18

200 400 600 800 1000 number of epochs -10 -5 d i s t an c e o f ob j e c t i v e t o op t i m a l v a l ue number of epochs -15 -10 -5 v i o l a t i on o f f ea s i b ili t y Figure 4: Results by Algorithm 3 on the problem (44) with A ∈ R × . Left: diﬀerence ofobjective value to the optimal value | F ( x k ) + h ( y k ) − F ( x ∗ ) − h ( y ∗ ) | ; Right: violation of feasibility k Ax k + By k − b k In this paper we propose an accelerated proximal Jacobian ADMM method and generalize it to anaccelerated randomized primal-dual coordinate updating method for solving linearly constrainedmulti-block structured convex programs. We show that if the objective is strongly convex then themethods achieve O (1 /t ) convergence rate where t is the total number of iterations. In addition, ifone block variable is independent of others in the objective and its part of the objective functionis smooth, we have modiﬁed the primal-dual coordinate updating method to achieve linear conver-gence. Numerical experiments on quadratic programming and log-barrier approximation of linearprogramming have shown the eﬃcacy of the newly proposed methods. References [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM journal on imaging sciences , 2(1):183–202, 2009. 3[2] D. Boley. Local linear convergence of the alternating direction method of multipliers onquadratic or linear programs.

SIAM Journal on Optimization , 23(4):2183–2207, 2013. 3[3] K. Bredies and H. Sun. Accelerated douglas-rachford methods for the solution of convex-concave saddle-point problems. arXiv preprint arXiv:1604.06282 , 2016. 3[4] X. Cai, D. Han, and X. Yuan. The direct extension of admm for three-block separable convexminimization models is convergent when one function is strongly convex.

Optimization Online ,2014. 3[5] A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problems withapplications to imaging.

Journal of Mathematical Imaging and Vision , 40(1):120–145, 2011. 3196] C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of admm for multi-block convexminimization problems is not necessarily convergent.

Mathematical Programming , 155(1-2):57–79, 2016. 3[7] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.

SIAMreview , 43(1):129–159, 2001. 2[8] Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle pointproblems.

SIAM Journal on Optimization , 24(4):1779–1814, 2014. 3[9] P. L. Combettes and J.-C. Pesquet. Stochastic quasi-fej´er block-coordinate ﬁxed point itera-tions with random sweeping.

SIAM Journal on Optimization , 25(2):1221–1248, 2015. 3[10] C. Cortes and V. Vapnik. Support-vector networks.

Machine learning , 20(3):273–297, 1995. 2[11] C. Dang and G. Lan. Randomized methods for saddle point computation. arXiv preprintarXiv:1409.8625 , 2014. 3[12] W. Deng, M.-J. Lai, Z. Peng, and W. Yin. Parallel multi-block admm with o (1 /k ) convergence. Journal of Scientiﬁc Computing , pages 1–25, 2016. 3, 4, 6[13] W. Deng and W. Yin. On the global and linear convergence of the generalized alternatingdirection method of multipliers.

Journal of Scientiﬁc Computing , 66(3):889–916, 2015. 3, 16[14] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problemsvia ﬁnite element approximation.

Computers & Mathematics with Applications , 2(1):17–40,1976. 3[15] X. Gao, Y. Xu, and S. Zhang. Randomized primal-dual proximal block coordinate updates. arXiv preprint arXiv:1605.05969 , 2016. 3, 4, 5, 8, 9, 16[16] X. Gao and S.-Z. Zhang. First-order algorithms for convex optimization with nonseparableobjective and coupled constraints.

Journal of the Operations Research Society of China , pages1–29, 2015. 3[17] R. Glowinski and A. Marrocco. Sur l’approximation, par el´ements ﬁnis d’ordre un, etla r´esolution, par p´enalisation-dualit´e d’une classe de probl`emes de dirichlet non lin´eaires.

ESAIM: Mathematical Modelling and Numerical Analysis , 9(R2):41–76, 1975. 3[18] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimiza-tion methods.

SIAM Journal on Imaging Sciences , 7(3):1588–1623, 2014. 3[19] B. He, L. Hou, and X. Yuan. On full Jacobian decomposition of the augmented Lagrangianmethod for separable convex programming.

SIAM Journal on Optimization , 25(4):2274–2312,2015. 3 2020] B. He, M. Tao, and X. Yuan. Alternating direction method with gaussian back substitutionfor separable convex programming.

SIAM Journal on Optimization , 22(2):313–340, 2012. 3[21] B. He and X. Yuan. On the acceleration of augmented lagrangian method for linearly con-strained optimization.

Optimization online , 2010. 3[22] B. He and X. Yuan. On the O (1 /n ) convergence rate of the Douglas-Rachford alternatingdirection method. SIAM Journal on Numerical Analysis , 50(2):700–709, 2012. 3[23] G. M. James, C. Paulson, and P. Rusmevichientong. Penalized and constrained regression.Technical report, 2013. 2[24] H. Li and Z. Lin. Optimal nonergodic o (1 /k ) convergence rate: When linearized adm meetsnesterov’s extrapolation. arXiv preprint arXiv:1608.06366 , 2016. 6[25] M. Li, D. Sun, and K.-C. Toh. A convergent 3-block semi-proximal ADMM for convex min-imization problems with one strongly convex block. Asia-Paciﬁc Journal of Operational Re-search , 32(04):1550024, 2015. 3[26] T. Lin, S. Ma, and S. Zhang. On the global linear convergence of the admm with multiblockvariables.

SIAM Journal on Optimization , 25(3):1478–1497, 2015. 3[27] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent meth-ods.

Mathematical Programming , 152(1-2):615–642, aug 2015. 2[28] H. Markowitz. Portfolio selection.

The journal of ﬁnance , 7(1):77–91, 1952. 2[29] R. D. Monteiro and B. F. Svaiter. Iteration-complexity of block-decomposition algorithms andthe alternating direction method of multipliers.

SIAM Journal on Optimization , 23(1):475–507,2013. 3[30] Y. Nesterov. A method of solving a convex programming problem with convergence rate O (1 /k ). Soviet Mathematics Doklady , 27(2):372–376, 1983. 3[31] Y. Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimization problems.

SIAM Journal on Optimization , 22(2):341–362, 2012. 2, 11[32] Y. Nesterov. Gradient methods for minimizing composite functions.

Mathematical Program-ming , 140(1):125–161, 2013. 3[33] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating directionmethod of multipliers.

SIAM Journal on Imaging Sciences , 8(1):644–681, 2015. 3[34] Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin. Coordinate friendly structures, algorithms andapplications.

Annals of Mathematical Sciences and Applications , 1(1):57–119, 2016. 32135] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic framework for asynchronousparallel coordinate updates.

SIAM Journal on Scientiﬁc Computing , 38(5):A2851–A2879,2016. 3[36] J.-C. Pesquet and A. Repetti. A class of randomized primal-dual algorithms for distributedoptimization. arXiv preprint arXiv:1406.6404 , 2014. 3[37] P. Richt´arik and M. Tak´aˇc. Parallel coordinate descent methods for big data optimization.

Mathematical Programming , pages 1–52, 2012. 2[38] P. Richt´arik and M. Tak´aˇc. Iteration complexity of randomized block-coordinate descentmethods for minimizing a composite function.

Mathematical Programming , 144(1-2):1–38,2014. 2[39] Y. Xu. Hybrid jacobian and gauss-seidel proximal block coordinate update methods for linearlyconstrained convex programming. arXiv preprint arXiv:1608.03928 , 2016. 3[40] Y. Xu. Accelerated ﬁrst-order primal-dual proximal methods for linearly constrained compositeconvex programming.

SIAM Journal on Optimization , 27(3):1459–1484, 2017. 3[41] Y. Xu. Asynchronous parallel primal-dual block update methods. arXiv preprintarXiv:1705.06391 , 2017. 3

A Technical proofs: Section 2

In this section, we give the detailed proofs of the lemmas and theorems in section 2. The followinglemma will be used a few times. Note that when S = [ M ], the result is deterministic. Lemma A.1

Let S be a uniformly selected subset of [ M ] with cardinality m and x o be a vectorindependent of S . Suppose x + is a random vector dependent on S and its coordinates out of S arethe same as x o . Let β ∈ R , λ o and r o be vectors independent of S , and W a positive semideﬁnite M × M block diagonal matrix. If ∇ S f ( x o ) + ˜ ∇ g S ( x + S ) − A ⊤ S ( λ o − βr o ) + W S ( x + S − x oS ) = 0 , then for any x , it holds that E S h F ( x + ) − F ( x ) + µ k x + − x k − (cid:10) A ( x + − x ) , λ o − βr o (cid:11)i ≤ (1 − θ ) h F ( x o ) − F ( x ) + µ k x o − x k − (cid:10) A ( x o − x ) , λ o − βr o (cid:11)i − E S (cid:2) k x + − x k W − k x o − x k W + k x + − x o k W − L m I (cid:3) , (45) where θ = mM , L m is given in Assumption 4, and the expectation is taken on S . roof. For any x , we have D x + S − x S , ∇ S f ( x o ) + ˜ ∇ g S ( x + S ) − A ⊤ S ( λ o − βr o ) + W S ( x + S − x oS ) E = 0 . We split the left hand side of the above equation into four terms and bound each of them as below.First, we have E S (cid:10) x + S − x S , ∇ S f ( x o ) (cid:11) = E S (cid:10) x + − x o , ∇ f ( x o ) (cid:11) + E S h x oS − x S , ∇ S f ( x o ) i≥ E S (cid:20) f ( x + ) − f ( x o ) − L m k x + − x o k (cid:21) + θ [ f ( x o ) − f ( x )]= E S (cid:20) f ( x + ) − f ( x ) − L m k x + − x o k (cid:21) − (1 − θ )[ f ( x o ) − f ( x )] , (46)where the ﬁrst equality uses the fact x + i = x oi , ∀ i S , and the inequality follows from the uniformdistribution of S , the convexity of f , and also the inequality (18).Secondly, it follows from the strong convexity of g that D x + S − x S , ˜ ∇ g S ( x + S ) E ≥ g S ( x + S ) − g S ( x S ) + X i ∈ S µ k x + i − x i k . (47)Since g S ( x + S ) − g S ( x S ) = g ( x + ) − g ( x o )+ g S ( x oS ) − g S ( x S ) and E S [ g S ( x oS ) − g S ( x S )] = θ [ g ( x o ) − g ( x )],we have E S [ g S ( x + S ) − g S ( x S )] = E S [ g ( x + ) − g ( x o )] + θ [ g ( x o ) − g ( x )]= E S [ g ( x + ) − g ( x )] − (1 − θ )[ g ( x o ) − g ( x )] . (48)Similarly, it holds E S P i ∈ S µ k x + i − x i k = µ (cid:0) E S k x + − x k − (1 − θ ) k x o − x k (cid:1) . Hence, takingexpectation on both sides of (47) yields E S D x + S − x S , ˜ ∇ g S ( x + S ) E ≥ E S h g ( x + ) − g ( x ) + µ k x + − x k i − (1 − θ ) h g ( x o ) − g ( x ) + µ k x o − x k i . (49)Thirdly, by essentially the same arguments on showing (48), we have E S D x + S − x S , − A ⊤ S ( λ o − βr o ) E = − E S (cid:10) A ( x + − x ) , λ o − βr o (cid:11) + (1 − θ ) (cid:10) A ( x o − x ) , λ o − βr o (cid:11) . (50)Fourth, note (cid:10) x + S − x S , W S ( x + S − x oS ) (cid:11) = h x + − x, W ( x + − x o ) i , and thus by (6), E S (cid:10) x + S − x S , W S ( x + S − x oS ) (cid:11) = 12 E S (cid:2) k x + − x k W − k x o − x k W + k x + − x o k W (cid:3) . (51)The desired result is obtained by adding (46), (49), (50), and (51), and recalling F = f + g . (cid:3) .1 Proof of Lemma 2.1 From (7a), we have the optimality condition ∇ f ( x k ) − A ⊤ ( λ k − β k r k ) + ˜ ∇ g ( x k +1 ) + P k ( x k +1 − x k ) = 0 . Hence, for any x such that Ax = b , it follows from the deﬁnition of Φ in (3) and Lemma A.1 with S = [ M ], x o = x k , λ o = λ k , β = β k , x + = x k +1 , and W = P k thatΦ( x k +1 , x, λ ) ≤ D Ax k +1 − b, λ k − β k r k E − D Ax k +1 − b, λ E − E S h k x k +1 − x k P k + µI − k x k − x k P k + k x k +1 − x k k P k − L f I i . (52)Using the fact λ k +1 = λ k − ρ k ( Ax k +1 − b ), we have D Ax k +1 − b, λ k − λ E = 1 ρ k D λ k − λ k +1 , λ k − λ E (6) = 12 ρ k h k λ − λ k k − k λ − λ k +1 k + k λ k − λ k +1 k i . (53)In addition, we write r k = r k − r k +1 + r k +1 = r k +1 − A ( x k +1 − x k ) and have D Ax k +1 − b, − β k r k E = − β k k r k +1 k + β k D A ( x k +1 − x ) , A ( x k +1 − x k ) E (6) = − β k k r k +1 k + β k h k A ( x k +1 − x ) k − k A ( x k − x ) k + k A ( x k +1 − x k ) k i (54)Substituting (53) and (54) into (52) gives the inequality in (8). A.2 Proof of Theorem 2.2

First, we have t X k =1 k + k + 12 ρ k h k λ − λ k k − k λ − λ k +1 k i = k + 22 ρ k λ − λ k − t + k + 12 ρ t k λ − λ t +1 k + t X k =2 (cid:18) k + k + 12 ρ k − k + k ρ k − (cid:19) k λ − λ k k ≤ k + 22 ρ k λ − λ k . (55)24n addition, − t X k =1 k + k + 12 (cid:16) k x k +1 − x k P k − β k A ⊤ A + µI − k x k − x k P k − β k A ⊤ A (cid:17) = k + 22 k x − x k P − β A ⊤ A − t + k + 12 k x t +1 − x k P t − β t A ⊤ A + µI + 12 t X k =2 (cid:16) ( k + k + 1) k x k − x k P k − β k A ⊤ A − ( k + k ) k x k − x k P k − − β k − A ⊤ A + µI (cid:17) (11) ≤ k + 22 k x − x k P − β A ⊤ A − t + k + 12 k x t +1 − x k P t − β t A ⊤ A + µI . (56)Now multiplying k + k + 1 to both sides of (8) and adding it over k , we obtain (12) by using (55)and (56), and noting k λ k − λ k +1 k = ρ k k r k +1 k and k x k +1 − x k k P k − β k A ⊤ A − L f I ≥ A.3 Proof of Theorem 2.3

From the choice of k and the condition P − βA ⊤ A (cid:22) µ I , it is not diﬃcult to verify( k + k + 1) h kP − kβA ⊤ A + L f I i (cid:22) ( k + k ) h ( k − P − ( k − βA ⊤ A + ( L f + µ ) I i , ∀ k ≥ . Hence, the condition in (11) holds. In addition, it is easy to see that all conditions in (9) and (10)also hold. Therefore, we have (12), which, by taking parameters in (14) and x = x ∗ , reduces to t X k =1 ( k + k + 1)Φ( x k +1 , x ∗ , λ ) + t X k =1 k ( k + k + 1)2 β k r k +1 k + t + k + 12 k x t +1 − x ∗ k t ( P − βA ⊤ A )+( L f + µ ) I ≤ φ ( x ∗ , λ ) , (57)where we have used the fact λ = 0.Letting λ = λ ∗ , we have from (5) and (57) that (by dropping nonnegative Φ( x k +1 , x ∗ , λ ∗ )’s): t ( t + k + 1)2 β k r t +1 k + t + k + 12 k x t +1 − x ∗ k t ( P − βA ⊤ A )+( L f + µ ) I ≤ φ ( x ∗ , λ ∗ ) , which indicates (15). In addition, from the convexity of F and (57), we have that for any λ , itholds t ( t +2 k +3)2 Φ(¯ x t +1 , x ∗ , λ ) ≤ φ ( x ∗ , λ ) , which together with Lemmas 1.2 and 1.3 implies (16). B Technical proofs: Section 3

In this section, we give the proofs of the lemmas and theorems in section 3.25 .1 Proof of Lemma 3.1

From the update in (17a), we have the optimality condition: ∇ S k f ( x k ) − A ⊤ S k ( λ k − β k r k ) + ˜ ∇ g S k ( x k +1 S k ) + η k ( x k +1 S k − x kS k ) = 0 . (58)It follows from the update rule of λ that −h Ax k +1 − b, λ k i = −h Ax k +1 − b, λ k +1 i − ρ k k r k +1 k . Plugging (54) and the above equation into (45) with S = S k , λ o = λ k , β = β k , x o = x k , x + = x k +1 , W = η k I , and x satisfying Ax = b , we have the desired result by taking expectation and recallingthe deﬁnition of ∆ in (2) and Φ in (3). B.2 Proof of Theorem 3.2

Let β k = β, ρ k = ρ and η k = η in (19), and also note µ = 0 and η ≥ L m + β k A k . We have E h Φ( x k +1 , x, λ k +1 ) + ( β − ρ ) k r k +1 k i ≤ (1 − θ ) E h Φ( x k , x, λ k ) + β k r k k i − E h k x k +1 − x k ηI − βA ⊤ A − k x k − x k ηI − βA ⊤ A i . Summing the above inequality over k = 1 through t and noting ρ ≤ θβ give E (cid:2) Φ( x t +1 , x, λ t +1 ) + ( β − ρ ) k r t +1 k (cid:3) + θ t − X k =1 E Φ( x k +1 , x, λ k +1 ) (59) ≤ (1 − θ ) E (cid:2) Φ( x , x, λ ) + β k r k (cid:3) + 12 k x − x k ηI − βA ⊤ A . By the update of λ , it follows that θ t − X k =1 Φ( x k +1 , x, λ k +1 ) = θ t − X k =1 (cid:20) Φ( x k +1 , x, λ ) + 1 ρ h λ k +1 − λ, λ k +1 − λ k i (cid:21) = θ t − X k =1 Φ( x k +1 , x, λ ) + θ ρ t − X k =1 h k λ k +1 − λ k − k λ k − λ k + k λ k +1 − λ k k i = θ t − X k =1 Φ( x k +1 , x, λ ) + θ ρ " k λ t − λ k − λ − λ k + t − X k =1 k λ k +1 − λ k k (60)and Φ( x t +1 , x, λ t +1 ) = Φ( x t +1 , x, λ ) − h λ t − λ − ρr t +1 , r t +1 i = Φ( x t +1 , x, λ ) − h λ t − λ, r t +1 i + ρ k r t +1 k . (61)26ince ρ ≤ θβ , by Young’s inequality, it holds β k r t +1 k − h λ t − λ, r t +1 i + θ ρ k λ t − λ k ≥ . Then plugging (60) and (61) into (59), we have E Φ( x t +1 , x, λ ) + θ t − X k =1 E Φ( x k +1 , x, λ ) ≤ (1 − θ ) E (cid:2) Φ( x , x, λ ) + β k r k (cid:3) + 12 k x − x k ηI − βA ⊤ A + θ ρ E k λ − λ k ≤ E φ ( x, λ ) , (62)where in the last inequality we have used λ = 0, θ > k r k = k x − x k βA ⊤ A .Therefore, from the convexity of F , it follows that E Φ(¯ x t , x ∗ , λ ) ≤ θ ( t − E φ ( x ∗ , λ ) , ∀ λ , and weobtain the desired result from Lemmas 1.2 and 1.3. B.3 Proof of Theorem 3.3

We ﬁrst establish a few inequalities below.

Proposition B.1 If (21e) , (21f) and (21g) hold, then − t X k =1 ( k + k + 1) E (cid:20) ∆ η k I − β k A ⊤ A ( x k +1 , x k , x ) − L m k x k +1 − x k k (cid:21) − µ ( t + k + 1)2 E k x t +1 − x k − t X k =2 µ (cid:0) θ ( k + k + 1) − (cid:1) E k x k − x k ≤ η ( k + 2)2 E k x − x k − ( t + k + 1)2 E k x t +1 − x k µ + η t ) I − β t A ⊤ A . (63) Proof.

This inequality can be easily shown by noting that for any 1 ≤ k ≤ t , the weight matrix of k x k +1 − x k k is β k ( k + k + 1) A ⊤ A − ( k + k + 1)( η k − L m ) I , which is negative semideﬁnite, andfor any 2 ≤ k ≤ t , the weight matrix of k x k − x k is (cid:2) β k − ( k + k ) − β k ( k + k + 1) (cid:3) A ⊤ A + (cid:2) ( k + k + 1) η k − ( k + k ) η k − − µ (cid:0) θ ( k + k + 1) − (cid:1)(cid:3) I, which is also negative semideﬁnite. (cid:3) Proposition B.2 If (21a) , (21c) and (21d) hold, then − t + k + 1 ρ t E ∆( λ t +1 , λ t , λ ) − t X k =2 θ ( k + k + 1) − ρ k − E ∆( λ k , λ k − , λ ) ≤ θ ( k + 3) − ρ E k λ − λ k . (64)27 roof. On the left hand side of (64), the coeﬃcient of each k λ k +1 − λ k k is negative. For2 ≤ k ≤ t −

1, the coeﬃcient of k λ k − λ k is θ ( k + k +2) − ρ k − θ ( k + k +1) − ρ k − , which is nonpositive; thecoeﬃcient of k λ t − λ k is t + k +1 ρ t − θ ( t + k +1) − ρ t − , which is nonpositive; the coeﬃcient of k λ t +1 − λ k is also nonpositive. Hence, dropping these nonpositive terms, we have the desired result. (cid:3) Now we are ready to prove Theorem 3.3.

Proof. [of Theorem 3.3]Multiplying k + k + 1 to both sides of (19), summing it up from k = 1 through t , and moving theterms about Φ( x k , x, λ k ) + µ k x k − x k and k r k k to the left hand side for 2 ≤ k ≤ t give( t + k + 1) E h Φ( x t +1 , x, λ t +1 ) + ( β t − ρ t ) k r t +1 k + µ k x t +1 − x k i + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) E h Φ( x k , x, λ k ) + µ k x k − x k i + t X k =2 (cid:0) ( β k − − ρ k − )( k + k ) − (1 − θ )( k + k + 1) β k (cid:1) E k r k k ≤ (1 − θ )( k + 2) E h Φ( x , x, λ ) + β k r k + µ k x − x k i (65) − t X k =1 ( k + k + 1) E (cid:20) ∆ η k I − β k A ⊤ A ( x k +1 , x k , x ) − L m k x k +1 − x k k (cid:21) . Hence, from (21b) and (63), it follows that( t + k + 1) E Φ( x t +1 , x, λ t +1 ) + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) E Φ( x k , x, λ k ) ≤ (1 − θ )( k + 2) E h Φ( x , x, λ ) + β k r k + µ k x − x k i + η ( k + 2)2 E k x − x k − t + k + 12 E k x t +1 − x k µ + η t ) I − β t A ⊤ A . (66)In addition, from the update of λ in (17c), we have h λ k +1 − λ, Ax k +1 − b i = − ρ k h λ k +1 − λ, λ k +1 − λ k i = − ρ k ∆( λ k +1 , λ k , λ ) , (67)and thus ( t + k + 1) E h λ t +1 − λ, Ax t +1 − b i + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) E h λ k − λ, Ax k − b i = − t + k + 1 ρ t E ∆( λ t +1 , λ t , λ ) − t X k =2 θ ( k + k + 1) − ρ k − E ∆( λ k , λ k − , λ ) (64) ≤ θ ( k + 3) − ρ E k λ − λ k . x k , x, λ ) = Φ( x k , x, λ k ) + h λ k − λ, Ax k − b i , we obtain the desired result by adding theabove inequality to (66). (cid:3) B.4 Proof of Proposition 3.4

Note that (24) implies k ≥ θ , and thus (21a) must hold. Also, it is easy to see that (21d) holdswith equality from the second equation of (23b). Since I (cid:23) A ⊤ A k A k , we can easily have (21f) byplugging in β k and η k deﬁned in (23a) and (23c) respectively.To verify (21c), we plug in ρ k deﬁned in the ﬁrst equation of (23b), and it is equivalent to requiringthat for any 2 ≤ k ≤ t − θ ( k + k + 1) − θ ( k −

1) + 2 + θ ≥ θ ( k + k + 2) − θk + 2 + θ ⇐⇒ θ ( k + 1) − θk + 2 ≥ θ ( k + 1) − θk + 2 + θ . The inequality on the right hand side obviously holds, and thus we have (21c).Plugging in the formula of β k , (21e) is equivalent to( θk + 2 + θ )( k + k + 1) ≥ ( θk + 2)( k + k ) , which holds trivially, and thus (21e) follows.With the given β k and ρ k , (21b) becomes − θ ( θk + 2)( k + k ) ≥ ( k + k + 1)( θk + 2 + θ ) , ∀ ≤ k ≤ t, which is equivalent to − θ ≥ ( k +3)(3 θ +2)( k +2)(2 θ +2) . Note that k +3 k +2 is decreasing with respect to k ≥ − θ ≥ ( θ +3)(3 θ +2)( θ +2)(2 θ +2) . Hence, (21b) is satisﬁed from the fact k ≥ θ .Finally, we show (21g). Plugging in η k , we have that (21g) becomes( k + k ) (cid:16) µ θk + 2) + L m (cid:17) + µ (cid:0) θ ( k + k + 1) − (cid:1) ≥ ( k + k + 1) (cid:16) µ θk + 2 + θ ) + L m (cid:17) , ∀ k ≥ , which is equivalent to k + 1 ≥ θ + L m θµ . Hence, for k given in (24), (21g) must hold. Therefore,we have veriﬁed all conditions in (21). B.5 Proof of Theorem 3.5

From Proposition 3.4, we have the inequality in (22) that, as λ = 0, reduces to( t + k + 1) E Φ( x t +1 , x, λ ) + t X k =2 (cid:0) θ ( k + k + 1) − (cid:1) E Φ( x k , x, λ ) ≤ φ ( x, λ ) − t + k + 12 E k x t +1 − x k µ + η t ) I − β t A ⊤ A . (68)29or ρ ≥

1, we have ( µ + η t ) I − β t A ⊤ A (cid:23) (cid:18) ( ρ − µ ρ ( θt + θ + 2) + µ + L m (cid:19) I. (69)Letting x = x ∗ and using the convexity of F , we have from (68) and the above inequality that E (cid:2) F (¯ x t +1 ) − F ( x ∗ ) − (cid:10) λ, A ¯ x t +1 − b (cid:11)(cid:3) ≤ T E φ ( x ∗ , λ ) , ∀ λ, (70)which together with Lemmas 1.2 and 1.3 with γ = max(2 k λ ∗ k , k λ ∗ k ) indicates (25).In addition, note Φ( x t +1 , x ∗ , λ ∗ ) ≥ µ k x t +1 − x ∗ k . Hence, letting ( x, λ ) = ( x ∗ , λ ∗ ) in (68) and using (5), we have from (69) that t + k + 12 (cid:18) ( ρ − µ ρ ( θt + θ + 2) + 2 µ + L m (cid:19) E k x t +1 − x ∗ k ≤ φ ( x ∗ , λ ∗ ) , (71)and the proof is completed. C Technical proofs: Section 4

In this section, we provide the proofs of the lemmas and theorems in section 4.

C.1 Proof of Lemma 4.1

Note r k +1 − r k = A ( x k +1 − x k ) + B ( y k +1 − y k ). Hence by (6), we have D A ( x k +1 − x ) , − βr k E = − β D A ( x k +1 − x ) , r k +1 E + β D A ( x k +1 − x ) , B ( y k +1 − y k ) E + β h k A ( x k +1 − x ) k − k A ( x k − x ) k + k A ( x k +1 − x k ) k i . (72)In addition, h A ( x k +1 − x ) , λ k i = h A ( x k +1 − x ) , λ k +1 + ρr k +1 i . Plugging this equation and (72) into(45) with x o = x k , λ o = λ k , x + = x k +1 , W = η x I and taking expectation yield E h F ( x k +1 ) − F ( x ) + µ k x k +1 − x k − (cid:10) A ( x k +1 − x ) , λ k +1 (cid:11) + ( β − ρ ) (cid:10) A ( x k +1 − x ) , r k +1 (cid:11)i + 12 E h k x k +1 − x k P − k x k − x k P + k x k +1 − x k k P − L m I i ≤ (1 − θ ) E h F ( x k ) − F ( x ) + µ k x k − x k − (cid:10) A ( x k − x ) , λ k − βr k (cid:11)i (73)+ β E D A ( x k +1 − x ) , B ( y k +1 − y k ) E , where P = η x I − βA ⊤ A . 30rom (30), the optimality condition for ˜ y k +1 is ∇ h (˜ y k +1 ) − B ⊤ λ k + βB ⊤ r k + + η y (˜ y k +1 − y k ) = 0 . (74)Since Prob( y k +1 = ˜ y k +1 ) = θ, Prob( y k +1 = y k ) = 1 − θ, we have E D y k +1 − y, ∇ h ( y k +1 ) − B ⊤ λ k + βB ⊤ r k + + η y ( y k +1 − y k ) E = (1 − θ ) E D y k − y, ∇ h ( y k ) − B ⊤ λ k + βB ⊤ r k + E , or equivalently, E D y k +1 − y, ∇ h ( y k +1 ) − B ⊤ λ k +1 + ( β − ρ ) B ⊤ r k +1 − βB ⊤ B ( y k +1 − y k ) + η y ( y k +1 − y k ) E = (1 − θ ) E D y k − y, ∇ h ( y k ) − B ⊤ λ k + βB ⊤ r k E + β (1 − θ ) E D B ( y k − y ) , A ( x k +1 − x k ) E . (75)Recall Q = η y I − βB ⊤ B . We have D y k +1 − y, − βB ⊤ B ( y k +1 − y k ) + η y ( y k +1 − y k ) E = 12 h k y k +1 − y k Q − k y k − y k Q + k y k +1 − y k k Q i . Therefore adding (75) to (73), noting Ax + By = b , and plugging (67) with ρ k = ρ , we have thedesired result. C.2 Proof of Theorem 4.2

Before proving Theorem 4.2, we establish a few inequalities. First, using Young’s inequality, wehave the following results.

Lemma C.1

For any τ , τ > , it holds that h A ( x k +1 − x ∗ ) , B ( y k +1 − y k ) i ≤ τ k A ( x k +1 − x ∗ ) k + τ k B ( y k +1 − y k ) k , (76) h B ( y k − y ∗ ) , A ( x k +1 − x k ) i ≤ τ k B ( y k − y ∗ ) k + τ k A ( x k +1 − x k ) k . (77)In addition, we are able to bound the λ -term by y -term and the residual r . The proofs are givenin Appendix C.4 and C.5. Lemma C.2

For any δ > , we have E k B ⊤ ( λ k +1 − λ ∗ ) k − (1 − θ )(1 + δ ) E k B ⊤ ( λ k − λ ∗ ) k ≤ E (cid:2) L h k y k +1 − y ∗ k + k Q ( y k +1 − y k ) k (cid:3) + 2( β − ρ ) E k B ⊤ r k +1 k (78)+2 ρ (1 − θ )(1 + 1 δ ) E (cid:2) k B ⊤ r k +1 k + k B ⊤ B ( y k +1 − y k ) k (cid:3) . emma C.3 Assume (38) . Then σ min ( BB ⊤ )2 (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) ≤ k B ⊤ ( λ k +1 − λ ∗ ) k − (1 − θ )(1 + δ ) k B ⊤ ( λ k − λ ∗ ) k + κ k B ⊤ ( λ k +1 − λ k ) k , (79) where σ min ( BB ⊤ ) denotes the smallest singular value of BB ⊤ . Lemma C.4

Let c, δ, τ , τ and κ be constants satisfying the conditions in Theorem 4.2. Then β E (cid:10) A ( x k +1 − x ∗ ) , B ( y k +1 − y k ) (cid:11) + β (1 − θ ) E (cid:10) B ( y k − y ∗ ) , A ( x k +1 − x k ) (cid:11) + c σ min ( BB ⊤ ) E (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) ≤ E k x k +1 − x k k P − L m I + β τ E k A ( x k +1 − x ∗ ) k (80)+ 12 E k y k +1 − y k k Q + β (1 − θ )2 τ E k B ( y k − y ∗ ) k + 4 cL h E k y k +1 − y ∗ k + (cid:20) cρ (cid:18) κ + 2(1 − θ ) (cid:0) δ (cid:1)(cid:19) + 2 c ( β − ρ ) (cid:21) E k B ⊤ r k +1 k . Now we are ready to show Theorem 4.2.

Proof. [of Theorem 4.2]Letting ( x, y, λ ) = ( x ∗ , y ∗ , λ ∗ ) in (34), plugging (32) into it, and noting Ax ∗ + By ∗ = b , we have E Ψ( z k +1 , z ∗ ) + ( β − ρ ) E k r k +1 k + E (cid:20) ∆ P ( x k +1 , x k , x ∗ ) − L m k x k +1 − x k k (cid:21) + E ∆ Q ( y k +1 , y k , y ∗ ) + µ E k x k +1 − x ∗ k + 1 ρ E ∆( λ k +1 , λ k , λ ∗ ) ≤ (1 − θ ) E Ψ( z k , z ∗ ) + β (1 − θ ) E k r k k + 1 − θρ E ∆( λ k , λ k − , λ ∗ ) + µ (1 − θ )2 E k x k − x ∗ k + β E (cid:10) A ( x k +1 − x ∗ ) , B ( y k +1 − y k ) (cid:11) + β (1 − θ ) E (cid:10) B ( y k − y ∗ ) , A ( x k +1 − x k ) (cid:11) , (81)where Ψ is deﬁned in (36). Note1 ρ ∆( λ k +1 , λ k , λ ∗ )= 12 ρ (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) − ρ θ − k r k +1 k − θ ρ k λ k − λ ∗ k , and1 − θρ ∆( λ k , λ k − , λ ∗ )= 12 ρ (cid:2) k λ k − λ ∗ k − (1 − θ ) k λ k − − λ ∗ k + 1 θ k λ k − λ k − k (cid:3) − ρ θ − (1 − θ )) k r k k − θ ρ k λ k − λ ∗ k . E Ψ( z k +1 , z ∗ ) + ( β − ρ ) E k r k +1 k + E (cid:20) ∆ P ( x k +1 , x k , x ∗ ) − L m k x k +1 − x k k (cid:21) + E ∆ Q ( y k +1 , y k , y ∗ ) + µ E k x k +1 − x ∗ k − ρ θ − E k r k +1 k − θ ρ E k λ k − λ ∗ k + (cid:18) ρ + c σ min ( BB ⊤ ) (cid:19) E (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) ≤ (1 − θ ) E Ψ( z k , z ∗ ) + β (1 − θ ) E k r k k − ρ θ − (1 − θ )) E k r k k − θ ρ E k λ k − λ ∗ k + 12 ρ E (cid:2) k λ k − λ ∗ k − (1 − θ ) k λ k − − λ ∗ k + 1 θ k λ k − λ k − k (cid:3) + µ (1 − θ )2 E k x k − x ∗ k + 12 E k x k +1 − x k k P − L m I + β τ E k A ( x k +1 − x ∗ ) k + 12 E k y k +1 − y k k Q + β (1 − θ )2 τ E k B ( y k − y ∗ ) k + 4 cL h E k y k +1 − y ∗ k + (cid:20) cρ (cid:18) κ + 2(1 − θ ) (cid:0) δ (cid:1)(cid:19) + 2 c ( β − ρ ) (cid:21) E k B ⊤ r k +1 k . Using the deﬁnition in (2) to expand ∆ P ( x k +1 , x k , x ∗ ) and ∆ Q ( y k +1 , y k , y ∗ ) in the above inequality,and then rearranging terms, we have E Ψ( z k +1 , z ∗ ) + (cid:18) ( β − ρ ) − ρ θ − (cid:19) E k r k +1 k − (cid:20) cρ (cid:18) κ + 2(1 − θ ) (cid:0) δ (cid:1)(cid:19) + 2 c ( β − ρ ) (cid:21) E k B ⊤ r k +1 k + E (cid:20) k x k +1 − x ∗ k P + µ k x k +1 − x ∗ k − β τ k A ( x k +1 − x ∗ ) k (cid:21) + E (cid:20) k y k +1 − y ∗ k Q − cL h k y k +1 − y ∗ k (cid:21) + (cid:18) ρ + c σ min ( BB ⊤ ) (cid:19) E (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) ≤ (1 − θ ) E Ψ( z k , z ∗ ) + β (1 − θ ) E k r k k − ρ θ − (1 − θ )) E k r k k + 12 E k x k − x ∗ k P + µ (1 − θ )2 E k x k − x ∗ k + 12 E k y k − y ∗ k Q + β (1 − θ )2 τ E k B ( y k − y ∗ ) k + 12 ρ E (cid:2) k λ k − λ ∗ k − (1 − θ ) k λ k − − λ ∗ k + 1 θ k λ k − λ k − k (cid:3) . (82)Since ρ = θβ , it holds( β − ρ ) − ρ θ −

1) = β − ρ , β (1 − θ ) − ρ θ − (1 − θ )) ≤ β (1 − θ )2 , and thus the inequality (82) implies E Ψ( z k +1 , z ∗ ) + β − ρ E k r k +1 k − (cid:20) cρ (cid:18) κ + 2(1 − θ ) (cid:0) δ (cid:1)(cid:19) + 2 c ( β − ρ ) (cid:21) E k B ⊤ r k +1 k E (cid:20) k x k +1 − x ∗ k P + µ k x k +1 − x ∗ k − β τ k A ( x k +1 − x ∗ ) k (cid:21) + E (cid:20) k y k +1 − y ∗ k Q − cL h k y k +1 − y ∗ k (cid:21) + (cid:18) ρ + c σ min ( BB ⊤ ) (cid:19) E (cid:2) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:3) ≤ ψ ( z k , z ∗ ; P, Q, β, ρ, c, τ ) , (83)where ψ is deﬁned in (37).From (33), it follows that(1 − α )Ψ( z k +1 , z ∗ ) + αµ k x k +1 − x ∗ k + αν k y k +1 − y ∗ k ≤ Ψ( z k +1 , z ∗ ) . (84)In addition, note that k r k +1 k = k Ax k +1 + By k +1 − ( Ax ∗ + By ∗ ) k ≤ k A k k x k +1 − x ∗ k + 2 k B k k y k +1 − y ∗ k ≤ γ (cid:16) αµ k x k +1 − x ∗ k + αν k y k +1 − y ∗ k (cid:17) , and thus 1 γ k r k +1 k ≤ αµ k x k +1 − x ∗ k + αν k y k +1 − y ∗ k . (85)Adding (84) and (85) to (83) gives the desired result. (cid:3) C.3 Proof of Theorem 4.3

From 0 < α < θ , the full row-rankness of B , and the conditions in (41), it is easy to see that η > η ≤ − α − θ , we have η (1 − θ )Ψ( z k +1 , z ∗ ) ≤ (1 − α )Ψ( z k +1 , z ∗ ) . (86)Note k A k ≤ (cid:18) αµ µ − βτ (cid:19) I (cid:23) αµ + θµ − βτ η x + µ (1 − θ ) ( η x I − βA ⊤ A ) + αµ + θµ − βτ η x + µ (1 − θ ) µ (1 − θ ) I + µ (1 − θ ) I. Hence, from η ≤ αµ + θµ − βτ η x + µ (1 − θ ) and P = η x I − βA ⊤ A , it follows that η k x k +1 − x ∗ k P + µ (1 − θ ) I ≤ k x k +1 − x ∗ k P +( αµ + µ ) I − βτ A ⊤ A . (87)Similarly, since (cid:18) αν − cL h (cid:19) I (cid:23) αν − cL h − β (1 − θ ) τ η y + β (1 − θ ) τ ( η y I − βB ⊤ B )+ αν − cL h − β (1 − θ ) τ η y + β (1 − θ ) τ β (1 − θ ) τ I + β (1 − θ ) τ I, = η y I − βB ⊤ B , and B ⊤ B (cid:22) I , we have η k y k +1 − y ∗ k Q + β (1 − θ ) τ B ⊤ B ≤ k y k +1 − y ∗ k Q +( αν − cL h ) I . (88)For the r -term, we note from the deﬁnition of η that η β (1 − θ )2 ≤ (cid:0) β (1 − θ )2 + 1 γ (cid:1) − (cid:18) cρ (cid:0) κ + 2(1 − θ )(1 + 1 δ ) (cid:1) + 2 c ( β − ρ ) (cid:19) . In addition, since k B k ≤

1, it holds k B ⊤ r k +1 k ≤ k r k +1 k , and thus η β (1 − θ )2 k r k +1 k ≤ (cid:0) β (1 − θ )2 + 1 γ (cid:1) k r k +1 k − (cid:18) cρ (cid:0) κ + 2(1 − θ )(1 + 1 δ ) (cid:1) + 2 c ( β − ρ ) (cid:19) k B ⊤ r k +1 k . (89)Finally, it is obvious to have η ρ (cid:20) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:21) ≤ (cid:18) ρ + c σ min ( BB ⊤ ) (cid:19) (cid:20) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:21) . (90)Therefore, we obtain (42) by the deﬁnition of ψ and adding (86) through (90). C.4 Proof of Lemma C.2

Let ˜ λ k +1 = λ k − ρ ( Ax k +1 + B ˜ y k +1 − b ) . Then from the update of y , we have E k B ⊤ ( λ k +1 − λ ∗ ) k = θ E k B ⊤ (˜ λ k +1 − λ ∗ ) k + (1 − θ ) E k B ⊤ ( λ k − λ ∗ − ρ ( Ax k +1 + By k − b )) k . (91)Below we bound the two terms on the right hand side of (91). First, the deﬁnition of ˜ λ k +1 togetherwith (74) implies B ⊤ ˜ λ k +1 = ∇ h (˜ y k +1 ) + Q (˜ y k +1 − y k ) + ( β − ρ ) B ⊤ ( Ax k +1 + B ˜ y k +1 − b ) . (92)Hence, by the Young’s inequality and the condition in (32b), we have θ E k B ⊤ (˜ λ k +1 − λ ∗ ) k ≤ θ E k∇ h (˜ y k +1 ) − ∇ h ( y ∗ ) + Q (˜ y k +1 − y k ) k + 2 θ ( β − ρ ) E k B ⊤ ( Ax k +1 + B ˜ y k +1 − b ) k . (93)Since Prob( y k +1 = ˜ y k +1 ) = θ and Prob( y k +1 = y k ) = 1 − θ , it follows that E k∇ h ( y k +1 ) − ∇ h ( y ∗ ) + Q ( y k +1 − y k ) k = θ E k∇ h (˜ y k +1 ) − ∇ h ( y ∗ ) + Q (˜ y k +1 − y k ) k + (1 − θ ) E k∇ h ( y k ) − ∇ h ( y ∗ ) k , θ E k∇ h (˜ y k +1 ) − ∇ h ( y ∗ ) + Q (˜ y k +1 − y k ) k ≤ E k∇ h ( y k +1 ) − ∇ h ( y ∗ ) + Q ( y k +1 − y k ) k . Similarly, θ ( β − ρ ) E k B ⊤ ( Ax k +1 + B ˜ y k +1 − b ) k ≤ ( β − ρ ) E k B ⊤ ( Ax k +1 + By k +1 − b ) k . Plugging the above two equations into (93) and applying the Young’s inequality and also theLipschitz continuity of ∇ h give θ E k B ⊤ (˜ λ k +1 − λ ∗ ) k ≤ E (cid:2) L h k y k +1 − y ∗ k + k Q ( y k +1 − y k ) k (cid:3) + 2( β − ρ ) E k B ⊤ r k +1 k . (94)In addition, from the Young’s inequality, it follows for any δ > k B ⊤ ( λ k − λ ∗ − ρ ( Ax k +1 + By k − b )) k ≤ (1 + δ ) k B ⊤ ( λ k − λ ∗ ) k + ρ (1 + 1 δ ) k B ⊤ ( Ax k +1 + By k − b ) k . Note k B ⊤ ( Ax k +1 + By k − b ) k ≤ k B ⊤ r k +1 k + 2 k B ⊤ B ( y k +1 − y k ) k . Therefore, plugging (94)and the above two inequalites into (91), we complete the proof. C.5 Proof of Lemma C.3

It is straightforward to verify k B ⊤ ( λ k +1 − λ ∗ ) k − (1 − θ )(1 + δ ) k B ⊤ ( λ k − λ ∗ ) k + κ k B ⊤ ( λ k +1 − λ k ) k = " λ k +1 − λ ∗ λ k +1 − λ k ⊤ " (1 − (1 − θ )(1 + δ )) (1 − θ )(1 + δ )(1 − θ )(1 + δ ) ( κ − (1 − θ )(1 + δ )) ⊗ BB ⊤ " ( λ k +1 − λ ∗ )( λ k +1 − λ k ) , and " λ k +1 − λ ∗ λ k +1 − λ k ⊤ " θ (1 − θ )(1 − θ ) ( θ − (1 − θ )) ⊗ I " λ k +1 − λ ∗ λ k +1 − λ k = (cid:20) k λ k +1 − λ ∗ k − (1 − θ ) k λ k − λ ∗ k + 1 θ k λ k +1 − λ k k (cid:21) . Hence, we have the desired result from (38) and the inequality U ⊗ V (cid:23) σ min ( V ) U ⊗ I for any PSDmatrices U and V . C.6 Proof of Lemma C.4

From (39a) and (39b), we have β (1 − θ ) τ k A ( x k +1 − x k ) k ≤ k x k +1 − x k k P − L m I , c k Q ( y k +1 − y k ) k + 2 cρ (1 − θ )(1 + 1 δ ) k B ⊤ B ( y k +1 − y k ) k + βτ k B ( y k +1 − y k ) k ≤ k y k +1 − y k k Q . The desired result is then obtained by adding the above two inequalities together with β times of(76), β (1 − θ ) times of (77), c times of both (78) and (79), and also noting λ k +1 − λ k = − ρr k +1+1