On the Duality Gap Convergence of ADMM Methods
aa r X i v : . [ m a t h . O C ] S e p On the Duality Gap Convergence of ADMM Methods
Da TangTsinghua UniversityBeijing, China Tong ZhangBaidu Inc, Beijing, ChinaRutgers University, NJ, USA
Abstract
This paper provides a duality gap convergence analysis for the standard ADMM as well asa linearized version of ADMM. It is shown that under appropriate conditions, both methodsachieve linear convergence. However, the standard ADMM achieves a faster accelerated conver-gence rate than that of the linearized ADMM. A simple numerical example is used to illustratethe difference in convergence behavior.
This paper considers the following optimization problem:min w,v [ φ ( w ) + g ( v )]subject to Aw − Bv = c, (1)where ( w, v ) ∈ R n × R m are unknown vectors, A ∈ R p × n , B ∈ R p × m and c ∈ R p are known matricesand vector. In this paper, we assume that φ : R n → R ∪ { + ∞} and g : R m → R ∪ { + ∞} are convexfunctions.A popular method for solving (1) is the Alternating Direction Method of Multipliers (ADMM)algorithm. It solves the problem by alternatively optimizing the variables in the Augmented La-grangian function: L ( w, v, α, ρ ) = φ ( w ) + g ( v ) + α ⊤ ( Aw − Bv − c ) + ρ k Aw − Bv − c k , (2)and the resulting procedure is summarized in Algorithm 1. In the algorithm, both G and H aresymmetric positive semi-definite matrices. In the standard ADMM, we can set G = 0 and H = 0.The method of introducing the additional term k v − v t − k G = ( v − v t − ) ⊤ G ( v − v t − ) is oftenreferred to as preconditioning. If we let G = βI − B ⊤ B for a sufficiently large β > G is positive semi-definite, then the minimization problem to obtain v t in line 3 of Algorithm 1becomes: v t = arg min v (cid:20) g ( v ) − ( α t − + ρB ⊤ Aw t − + ρGv t − ) ⊤ v + ρβ v ⊤ v (cid:21) , which may be simpler to solve than the corresponding problem with G = 0, since the originalquadratic term v ⊤ B ⊤ Bv is now replaced by v ⊤ v . The additional term k w − w t − k H can play asimilar role of preconditioning. 1 lgorithm 1 Preconditioned Standard ADMM Algorithm Choose w , v , and α for t = 1 , , . . . do v t = arg min v [ g ( v ) − α t − ⊤ Bv + ρ k Aw t − − Bv − c k + ρ k v − v t − k G ]; w t = arg min w [ φ ( w ) + α t − ⊤ Aw + ρ k Aw − Bv t − c k + k w − w t − k H ]; α t = α t − + ρ ( Aw t − Bv t − c ); end for Output: w t , v t , α t .For simplicity, this paper focuses on the scenario that g ( · ) is strongly convex, and φ ( · ) is smooth.The results allow g ( · ) to include a constraint v ∈ Ω for a convex set Ω by setting g ( v ) = + ∞ when v / ∈ Ω. The same proof technique can also handle other three cases with one objective functionbeing smooth and one being strongly convex.The standard ADMM algorithm assumes that the optimization problem to obtain w t is sim-ple. If this optimization is difficult to perform, then we may also consider the linearized ADMMformulation which replaces φ ( w ) by a quadratic approximation φ H ( w ) defined as φ H ( w t − ; w ) = φ ( w t − ) + ∇ φ ( w t − ) ⊤ ( w − w t − ) + 12 ( w − w t − ) ⊤ H ( w − w t − ) . The resulting algorithm is described in Algorithm 2. Both H and G are symmetric positive semi-definite matrices. By setting H = β ′ I − ρA ⊤ A , we can replace the term w ⊤ A ⊤ Aw by w ⊤ w in theoptimization of line 4 of Algorithm 2. Algorithm 2
Preconditioned Linearized ADMM Algorithm Choose w , v , and α for t = 1 , , . . . do v t = arg min v [ g ( v ) − α t − ⊤ Bv + ρ k Aw t − − Bv − c k + ρ k v − v t − k G ]; w t = arg min w [ φ H ( w t − ; w ) + α t − ⊤ Aw + ρ k Aw − Bv t − c k ]; α t = α t − + ρ ( Aw t − Bv t − c ); end for Output: w t , v t , α t .This paper compares the convergence behavior of the ADMM algorithm versus that of thelinearized ADMM algorithm for solving (1). Under the assumption that A is invertible, g ( · ) is λ strongly convex, and φ ( · ) is 1 /γ smooth, it is shown that the standard ADMM achieves a worst caselinear convergence rate of 1 / (1 + Θ( √ λγ )) (with optimally chosen ρ ) while the linearized ADMMachieves a slower worst case linear convergence rate of 1 / (1 + Θ( λγ )).The paper is organized as follows. Section 2 reviews related work. Section 3 provides a the-oretical analysis for both standard and linearized ADMM. Section 4 provides a simple numericalexample to illustrate the difference in convergence behavior. Concluding remarks are given inSection 5. 2 Related Work on ADMM and Linearized ADMM
In this section, we review some previous work on the convergence analysis of ADMM and LinearizedADMM, focusing mainly on linear convergence results.
Many authors have studied the linear convergence of ADMM in recent years. For example, theauthors in [6] presented a novel proof for the linear convergence of the ADMM algorithm. Moreover,the analysis applies for the more general case in which the object function can be the summationof more than two separable functions ( φ and g in our case). However, the assumption on eachseparable function is very complex, and no explicit rate is obtained. Therefore their results are notdirectly comparable to ours.Another work is [4], which presented analysis for the linear convergence of generalized ADMMunder certain conditions. More comprehensive results for the general form of constraint Aw − Bv = c were obtained later in [3] using similar ideas. In that paper, they presented an extension of ADMMalgorithm called Relaxed ADMM, which leads to linear convergence in the following four cases (italso requires either A or B are invertible): φ is strongly convex and smooth; g is strongly convex,and smooth; φ is smooth, and g is strongly convex; g is smooth, and φ is strongly convex. However,their analysis employs a technique for analyzing the dual objective of ADMM that may be regardedas a Relaxed Peacheman-Rachford splitting method. It can be used to prove the dual convergence.In contrast, our analysis uses a very different argument that can directly bound the convergenceof primal objective function and the duality gap. Moreover, even when the required regularityconditions for linear convergence are not satisfied, our analysis immediately implies a sublinear 1 /t convergence of duality gap (assuming a finite solution exists for the underlying problem). Thereforethe analysis of this paper contains a unified treatment that can simultaneously handle both linearand sublinear convergence depending on the regularity condition. In contrast, although sublinearresults can be obtained using techniques similar to those of [3] (see results in [2]), they requirespecialized treatment and the obtained results are in different forms that are not compatible withthe duality gap convergence of this paper. In this setting, the operator splitting proof techniquesof [3, 2] and the objective function proof technique of this paper are complementary to each other.Another advantage of our proof technique is that it can be directly applied to linearized ADMMwith minimal modifications.Our analysis employs a technique similar to that of [10] (note that neither linear convergencenor duality gap convergence was studied in [10]). At the conceptual level, the technique is alsoclosely related to the analysis of [1], but the actual execution differs quite significantly. One mayview the analysis of this paper as a refined version of those in [10], in that we simultaneously handlelinear and sublinear cases depending on regularity conditions. Moreover, our analysis unifies thetechniques used in [10] (which deals with primal objective convergence) and the techniques usedin [1] (which deals with a special primal-dual objective convergence); our proof shows that theseemingly different results in these two papers can be proved using the same underlying argument.Although results similar to ours were presented in [1] for a procedure related to a specific form ofpreconditioned ADMM (see [1] for discussions), they did not analyze the standard ADMM (or itslinearized version) under the general condition Aw − Bv = c . Therefore results obtained in thispaper for ADMM are different from those of [1].Another result on the linear convergence of the standard ADMM can be found in a recent3aper [9], which uses a different technique than what’s presented in this paper and that of [4, 3].Their results are not directly comparable to ours. Moreover, some other work on the convergenceof ADMM like procedures include [7, 5, 12], which focused on different applications that are notrelated to our work. One advantage of our proof technique is that it also handles linearized ADMM, with new results notavailable in the previous literature. Most of previous work on linearized ADMM does not considerlinear convergence; a few that do consider impose strong assumptions on the matrices
A, B , or thefunctions f, g .There are several papers that considered linear convergence of Linearized ADMM. For example[6] considered linearized ADMM, but as mentioned earlier, their rate is not explicit and they imposecomplex conditions that are incompatible with our results. Similarly, a linear convergence resultfor linearized ADMM was also obtained in [8], but only under the assumption of g = 0 and somestrong constraints on the matrices A and B . Again their results are incompatible with ours.Some other work considered Linearized ADMM in the general cases but without linear conver-gence. For example, in [11], the authors consider the convergence of Linearized ADMM on severaldifferent cases, and obtained sublinear convergence of 1 /t . Similar sublinear results can be found in[10] for stochastic ADMM. As we have pointed out, our proof technique is closely related to that of[10], which can handle both linearized and standard ADMM under the same theoretical framework. This section provides our main results for the standard ADMM and the linearized ADMM. We willderive upper bounds on their convergence rates, as well as the worst case matching lower boundsfor some specific problems.
Given any convex function h , we may define its convex conjugate h ∗ ( β ) = sup u [ β ⊤ u − h ( u )] , and define the Bregman divergence of a convex function h ( u ) as: D h ( u ′ , u ) = h ( u ) − h ( u ′ ) − ∇ h ( u ′ ) ⊤ ( u − u ′ ) . We will assume that φ is 1 /γ smooth: ∀ w, w ′ , D φ ( w ′ , w ) ≤ γ k w ′ − w k , which also implies that D φ ( w ′ , w ) ≥ γ k∇ φ ( w ′ ) − ∇ φ ( w ) k , D φ ∗ ( u ′ , u ) ≥ γ k u ′ − u k .
4e also assume that g is λ strongly convex: ∀ v, v ′ , D g ( v ′ , v ) ≥ λ k v ′ − v k . Assume also that ( w ∗ , v ∗ , α ∗ ) is an optimal solution of (1), which satisfies the equality: Aw ∗ − Bv ∗ − c = 0 , A ⊤ α ∗ = −∇ φ ( w ∗ ) , w ∗ = ∇ φ ∗ ( − A ⊤ α ∗ ) , B ⊤ α ∗ = ∇ g ( v ∗ ) . (3)Given any α , taking inf over ( w, v ) with respect to the Lagrangian φ ( w ) + g ( v ) + α ⊤ ( Aw − Bv − c ) , we obtain the dual D ( α ) = − φ ∗ ( − A ⊤ α ) − g ∗ ( B ⊤ α ) − α ⊤ c. It is clear by definition that for any pair ( w, v ) that are feasible (that is Aw − Bv − c = 0), and any α , we have φ ( w ) + g ( v ) ≥ D ( α ). The value φ ( w ) + g ( v ) − D ( α ) is referred to as the duality gap.Duality gap is always larger than primal suboptimality [ φ ( w ) + g ( v )] − [ φ ( w ∗ ) + g ( v ∗ )]. Thereforeif the duality gap is zero, then ( w, v ) solves (1).We may also introduce the concept of restricted duality gap as in [1]. Consider regions B ⊂ R p ,and B ⊂ R m . Given any ˆ α , ˆ v , we can define the restricted duality gap G B × B ( ˆ α, ˆ v ) = sup α ∈ B ; v ∈ B h φ ∗ ( − A ⊤ ˆ α ) + g (ˆ v ) − φ ∗ ( − A ⊤ α ) − g ( v ) + ˆ α ⊤ ( Bv + c ) − α ⊤ ( B ˆ v + c ) i . If we pick ( α, v ) = ( α ∗ , v ∗ ), then D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˆ α )+ D g ( v ∗ , ˆ v ) = φ ∗ ( − A ⊤ ˆ α )+ g (ˆ v ) − φ ∗ ( − A ⊤ α ∗ ) − g ( v ∗ )+ ˆ α ⊤ ( Bv ∗ + c ) − α ⊤∗ ( B ˆ v + c ) . Therefore as long as ( α ∗ , v ∗ ) ∈ B × B , we have D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˆ α ) + D g ( v ∗ , ˆ v ) ≤ G B × B ( ˆ α, ˆ v ) . Assume AA ⊤ is invertible, and let A + = A ⊤ ( AA ⊤ ) − (4)be the pseudo-inverse of A , then we may let ˆ w = A + ( B ˆ v + c ). It follows that A ˆ w − B ˆ v − c = 0. Ifwe set B × B = R p × R m , then we recover the unrestricted duality gap: G R p × R m ( ˆ α, ˆ v ) = [ φ ( ˆ w ) + g (ˆ v )] − D ( ˆ α ) , where the maximum over ( α, v ) is taken at − A ⊤ α = ∇ φ ( ˆ w ) and v = ∇ g ∗ ( B ⊤ ˆ α ). In general, we have the following result. 5 heorem 3.1
Assume that φ is /γ smooth and g is λ strongly convex. Assume that we canwrite H = A ⊤ ˜ HA . Let σ max ( H ) and σ max ( ˜ H ) be the largest eigenvalues of H and ˜ H respectively, σ min ( A ) be be the smallest eigenvalue value of ( AA ⊤ ) / , σ max ( B ) be the largest singular value of B , σ max ( G ) be the largest singular value of G . Consider s ∈ [0 , and θ > such that θ ≤ min γρσ min ( A ) γσ max ( H ) + 1 , sρσ max ( ˜ H ) , (1 − s ) λ ( ρ + σ max ( ˜ H )) σ max ( B ) + (1 − s ) ρσ max ( G ) ! . Let ˜ α t = α t + ˜ HA ( w t − w t − ) . Then for all ( α, v ) and w = ∇ φ ∗ ( − A ⊤ α ) , Algorithm 1 producesapproximate solutions that satisfy T X t =1 (1 + θ ) t − T r t ≤ (1 + θ ) − T δ − δ T , (5) T X t =1 (1 + θ ) t − T r ∗ t ≤ (1 + θ ) − T δ − δ T , (6) where r t = φ ( w t ) + g ( v t ) − φ ( w ) − g ( v ) − ˜ α t ⊤ ( Aw − Bv − c ) + α ⊤ ( Aw t − Bv t − c ) ,r ∗ t = φ ∗ ( − A ⊤ ˜ α t ) + g ( v t ) − φ ∗ ( − A ⊤ α ) − g ( v ) + ˜ α t ⊤ ( Bv + c ) − α ⊤ ( Bv t + c ) ,δ t = ρ k Aw t − Bv − c k + 12 k Aw t − Bv − c k H + ρ (1 + θ )2 k v t − v k G + 1 + θ ρ k α − α t k . For arbitrary ( α, v ), the left hand side of (5) and (6) can be difficult to understand. We may choosespecific values of ( α, v ) so that the results are easier to interpret. By setting ( α, w, v ) = ( α ∗ , w ∗ , v ∗ )in Theorem 3.1, and using (3), we obtain the following corollary. Corollary 3.1
Under the conditions of Theorem 3.1, we have T X t =1 (1 + θ ) t − T h max( D φ ( w ∗ , w t ) , D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˜ α t )) + D g ( v ∗ , v t ) i + ρ k A ( w T − w ∗ ) k + 12 k A ( w T − w ∗ ) k H + 1 + θ ρ k α T − α ∗ k + ρ (1 + θ )2 k v T − v ∗ k G ≤ (1 + θ ) − T (cid:20) ρ k A ( w − w ∗ ) k + k A ( w − w ∗ ) k H + 1 + θρ k α − α ∗ k + ρ (1 + θ ) k v − v ∗ k G (cid:21) . Using the definition of restricted duality gap, it is easy to see that (6) directly implies an upperbound of restricted duality gap, which is the same style as results of [1]. Our result is more generalthan those of [1] because the results can also be expressed in the form of Corollary 3.1, as well asin terms of unrestricted duality gap, as stated below.6 orollary 3.2
Under the conditions of Theorem 3.1, and let A + be the psudo-inverse of A . Define δ ∗ = (cid:20) ( ρ + σ max ( ˜ H )) k A ( w − w ∗ ) k + 1 + θρ k α − α ∗ k + ρ (1 + θ ) k v − v ∗ k G (cid:21) b ( δ ) = sup u (cid:26) k α + A − T ∇ φ ( A + ( Bu + c ) k : k u − v ∗ k ≤ δρ (1 + θ ) (cid:27) b ( δ ) = sup β ( k v − ∇ g ∗ ( B ⊤ β ) k : k β − β ∗ k ≤ r ρδ θ + σ max ( ˜ H ) s θ ) δρ ) b ( δ ) = 1 + θ ρ b ( δ ) + 1 + θ /ρ σ max ( G ) b ( δ ) + ( ρ + σ max ( ˜ H ))( σ max ( B ) b ( δ ) + k Aw − Bv − c k ) . Then we have the following bound in duality gap [ φ ( A + ( Bv T + c )) + g ( v T )] − D ( ˜ α T ) ≤ (1 + θ ) − T b ((1 + θ ) − T δ ∗ ) . Moreover, define ¯ v T = P Tt =1 (1 + θ ) t v t P Tt =1 (1 + θ ) t , ¯ α T = P Tt =1 (1 + θ ) t ˜ α t P Tt =1 (1 + θ ) t . Then [ φ ( A + ( B ¯ v T + c )) + g (¯ v T )] − D ( ¯ α T ) ≤ b ( δ ∗ ) P Tt =1 (1 + θ ) t . In the above results, we consider the simple case of H = 0. Then the optimal value of θ isachieved when we take ρ = q σ max ( B ) + σ max ( G ) σ min ( A ) s λγ , θ = σ min ( A ) q ( σ max ( B ) + σ max ( G )) γλ. When θ >
0, this implies the following convergence from Corollary 3.1:max[ D φ ( w ∗ , w T ) , D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˜ α T )] + D g ( v ∗ , v T )+ ρ k A ( w T − w ∗ ) k + 12 ρ k α T − α ∗ k + ρ k v T − v ∗ k G ≤ (cid:18) σ min ( A ) q ( σ max ( B ) + σ max ( G )) γλ (cid:19) − T (cid:20) ρ k A ( w − w ∗ ) k + 12 ρ k α − α ∗ k + ρ k v − v ∗ k G (cid:21) . This implies k w ∗ − w T k = O ((1+ θ ) − T ), k v ∗ − v T k = O ((1+ θ ) − T ), and k α ∗ − α T k = O ((1+ θ ) − T ).The linear convergence result holds when θ >
0. However, even when θ = 0 (and H = 0), wecan still obtain the following sublinear convergence from Corollary 3.1:max " D φ ( w ∗ , ¯ w T ) , T T X t =1 D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˜ α t ) + D g ( v ∗ , ¯ v T ) ≤ T (cid:20) ( ρ + σ max ( ˜ H )) k A ( w − w ∗ ) k + 1 ρ k α − α ∗ k + ρ k v − v ∗ k (cid:21) , w T = T − P Tt =1 w t , ¯ v T = T − P Tt =1 v t . This result does not require any assumption on φ , g , A , B .Similar results hold for unrestricted duality gap under the conditions of Corollary 3.2. Forexample, when θ = 0, but AA ⊤ is invertible, we obtain the sublinear convergence of duality-gapbelow. [ φ ( A + ( B ¯ v T + c )) + g (¯ v T )] − D ( ¯ α T ) ≤ b ( δ ∗ ) T .
This bound can be compared to the main result of [1] stated in terms of the restricted duality gap(in which the authors studied a method that is related to, but not identical to ADMM). Their resultdid not imply a bound on the unrestricted duality gap because they did not obtain a counterpartof Corollary 3.1.In the case of φ being smooth but g is not a strongly convex function, given any ǫ >
0, we canset λ = ǫ , and apply ADMM with g ( v ) replaced by the strongly convex function g ( v ) + λv ⊤ v . With ρ chosen optimally, this leads to φ ( A + ( Bv T + c )) + g ( v T ) − D ( α T ) = O ( ǫ )when we take T = ln(1 / ( γǫ )) / √ γǫ . For Linearized ADMM, we have the following counterpart of Theorem 3.1. Here we need to assumethat A is invertible and H is sufficiently large so that σ min ( H ) ≥ γ − . Theorem 3.2
Assume that φ is /γ smooth and g is λ strongly convex, and A is a square invert-ible matrix. Assume that we can write H = A ⊤ ˜ HA . Let σ min ( H ) and σ max ( ˜ H ) be the smallesteigenvalue of H and the largest eigenvalue of ˜ H respectively, and we assume that σ min ( H ) ≥ γ − .Let σ min ( A ) be the smallest eigenvalue value of ( AA ⊤ ) / , σ max ( B ) be the largest singular value of B , σ max ( G ) be the largest singular value of G . Consider s ∈ [0 , and θ > such that θ ≤ min ρσ min ( A ) σ min ( H ) , sρσ max ( ˜ H ) , (1 − s ) λ ( ρ + σ max ( ˜ H )) σ max ( B ) + (1 − s ) ρσ max ( G ) ! . Let ˜ α t = α t + ˜ HA ( w t − w t − ) . Then for any ( α, v ) , and w = ∇ φ ∗ ( − A ⊤ α ) , Algorithm 2 producesapproximate solutions that satisfy T X t =1 (1 + θ ) t − T r t ≤ (1 + θ ) − T δ − δ T , (7) T X t =1 (1 + θ ) t − T r ∗ t ≤ (1 + θ ) − T δ − δ T , (8) where r t = φ ( w t − ) + g ( v t ) − φ ( w ) − g ( v ) − ˜ α t ⊤ ( Aw − Bv − c ) + α ⊤ ( Aw t − − Bv t − c ) ,r ∗ t = φ ∗ ( − A ⊤ ˜ α t ) + g ( v t ) − φ ∗ ( − A ⊤ α ) − g ( v ) + ˜ α t ⊤ ( Bv + c ) − α ⊤ ( Bv t + c ) ,δ t = 12 (cid:20) k Aw t − Bv − c k H + ρ k Aw t − Bv − c k + ρ (1 + θ ) k v t − v k G + 1 + θρ k α t − α k (cid:21) . Corollary 3.3
Under the conditions of Theorem 3.2, we have T X t =1 (1 + θ ) t − T h max( D φ ( w ∗ , w t − ) , D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˜ α t )) + D g ( v ∗ , v t ) i + 12 k w T − w ∗ k H + ρ k A ( w T − w ∗ ) k + 1 + θ ρ k α T − α ∗ k + ρ (1 + θ )2 k v T − v ∗ k G ≤ (1 + θ ) − T (cid:20) ( ρ + σ max ( ˜ H )) k A ( w − w ∗ ) k + 1 + θρ k α − α ∗ k + ρ (1 + θ ) k v − v ∗ k G (cid:21) . Corollary 3.4
Under the conditions of Theorem 3.1, and let A + be the psudo-inverse of A . If wedefine δ ∗ , b ( δ ) , ¯ v T and ¯ α T as in Corollary 3.2, then [ φ ( A + ( Bv T + c )) + g ( v T )] − D ( ˜ α T ) ≤ (1 + θ ) − T b ((1 + θ ) − T δ ∗ ) , [ φ ( A + ( B ¯ v T + c )) + g (¯ v T )] − D ( ¯ α T ) ≤ b ( δ ∗ ) P Tt =1 (1 + θ ) t . The requirement of σ min ( H ) ≥ γ − is the key difference between Theorem 3.1 and Theorem 3.2.The fast convergence of ADMM requires that H to be of order O ( ρ ), which may be smaller thanΘ( γ − ). Consider the case that H = Θ( γ − I ) for linearized ADMM, then the optimal ρ can bechosen as ρ = Θ( γ − ). This leads to a linear convergence with θ = Θ( λγ ). The rate is slower thanthat of the standard ADMM, which can achieve θ = Θ( √ λγ ) at the optimal choice of ρ .Similar to the case of standard ADMM, we could take θ = 0: as long φ is 1 /γ smooth, and H satisfies σ min ( H ) ≥ /γ , we can achieve the following sublinear convergence without additionalassumptions: T X t =1 h max( D φ ( w ∗ , w t − ) , D φ ∗ ( − A ⊤ α ∗ , − A ⊤ ˜ α t )) + D g ( v ∗ , v t ) i ≤ h ( ρ + σ max ( ˜ H )) k A ( w − w ∗ ) k + ρ k v − v ∗ k G + ρ − k α − α ∗ k i . A similar result holds for duality gap convergence when A is a square invertible matrix. We consider the quadratic case that A = B = I , c = 0, and φ ( w ) = 12 w ⊤ Qw, g ( v ) = 12 v ⊤ Λ v. The optimal solution is w ∗ = v ∗ = α ∗ = 0 . We show that with appropriately chosen Q and Λ so that Q is 1 /γ smooth, and both Λ and Q are λ strongly convex, the convergence rate of ADMM can be 1 − Θ( √ γλ ) and the convergence rate oflinearized ADMM can be 1 − Θ( γλ ). 9 DMM
We assume that Q and Λ are diagonal matrices.The ADMM iterate satisfies the following equations (with G = 0): v t =(Λ + ρI ) − ( α t − + ρw t − ) w t =( Q + ρI ) − ( ρv t − α t − ) α t = α t − + ρ ( w t − v t ) , which implies v t =(Λ + ρI ) − ( α t − + ρw t − ) w t =(Λ + ρI ) − ( Q + ρI ) − ( ρ w t − − Λ α t − ) α t =(Λ + ρI ) − ( Q + ρI ) − Q (Λ α t − − ρ w t − ) . We may write [ w t ; α t ] = M [ w t − ; α t − ]. Now we take Q = Λ = diag( λ, /γ ), where we assumethat λ ≤ /γ . Then the largest eigenvalue of M , which determines the rate of convergence ofADMM, is max (cid:20) ρ + λ ( ρ + λ ) , ρ γ + 1( ργ + 1) (cid:21) . The optimal ρ to minimize the above is ρ = p λ/γ , and the maximum value is (1 + γλ ) / (1 + √ γλ ) .This special case matches the convergence rate behavior of 1 − Θ( √ γλ ) we proved for the ADMMmethod. Linearized ADMM
We assume that H , Q , and Λ are diagonal matrices. The linearized ADMM iterate satisfies thefollowing equations (with G = 0): v t =(Λ + ρI ) − ( α t − + ρw t − ) w t =( H + ρI ) − (( H − Q ) w t − + ρv t − α t − ) α t = α t − + ρ ( w t − v t ) , which implies that v t =(Λ + ρI ) − ( α t − + ρw t − ) w t =(Λ + ρI ) − ( H + ρI ) − ((( ρI + Λ)( H − Q ) + ρ I ) w t − − Λ α t − ) α t =(Λ + ρI ) − ( H + ρI ) − H (Λ α t − + ρ (Λ − ( ρI + Λ) H − Q ) w t − ) . Now let λ ≤ /γ , and we take Q = Λ = diag( λ, /γ ), and H = diag(2 /γ, /γ ). It follows that theconvergence rate of linearized ADMM is no faster than the largest eigenvalue of M = 1( ρ + h )( ρ + λ ) (cid:20) ρ + ( ρ + λ )( h − q ) − λρ ( λh − ( ρ + λ ) q ) λh (cid:21) q = λ and h = 2 /γ . When ρ ≤ h − λ , the largest eigenvalue of M is no less than ρ + ( h − q ) ρ + ( h − q ) λρ + ( h + λ ) ρ + hλ ≥ h − λh + λ = 1 − O ( λγ ) . Similarly, it is also not difficult to check that the eigenvalue is no less than 1 − O ( λγ ) when ρ ≥ h − λ .It follows that this special case matches the convergence rate behavior of 1 − Θ( γλ ) we proved forthe linearized ADMM method. Although we have obtained both the worst case upper bounds and matching lower bounds forADMM and Linearized ADMM. The analysis shows that in the worst case ADMM converges at afaster rate of 1 − Θ( √ λρ ) while in the worst case Linearized ADMM converges at a slower rate of1 − Θ( λρ ).However, for any specific problem, both methods can converge faster than the correspondingworst case upper bounds obtained in this paper. In this section, we use a simple example toillustrate the real convergence behavior of ADMM versus linearized ADMM methods at differentchoices of ρ ’s, to illustrate the phenomenon that the former can converge significantly faster thanthe latter.Consider the following 1-dimensional problem: φ ( w ) = w √ γ arctan( w √ γ ) −
12 ln(1 + w γ ) + µ w , g ( v ) = 112 v + λ v . with A = B = I and c = 0. It can be checked that φ ( w ) is 1 /γ + µ smooth and µ strongly convex; g ( v ) is λ -strongly convex.We compare the convergence of ADMM versus linearized ADMM with different values of ρ . Inlinearized ADMM, and we set h = 2( µ + 1 /γ ). Note that for this problem, w ∗ = v ∗ = 0, and wecan define the error of a solution ( w, v ) as √ w + v .Figure 1 shows the convergence behavior when γ = 0 .
1, and λ = µ = 0 .
2. This is the situationthat λγ = 0 .
02 is relatively small. In this case, we compare three different values of ρ ’s: ρ =0 . p λ/γ , ρ = p λ/γ , and ρ = 5 p λ/γ . The corresponding convergence rates for ADMM are 0 . .
21, and 0 .
41; the corresponding convergence rates for linearized ADMM are 0 .
51, 0 .
53, and 0 . ρ ’s. Moreover, it achieves relativelyfast convergence rate at the optimal choice of ρ = p λ/γ , while Linearized ADMM is relativelyinsensitive to ρ .Figure 2 shows the convergence behavior when γ = λ = µ = 1. This is the situation that λγ = 1 is relatively large. We compare three different values of ρ ’s: ρ = 0 . p λ/γ , ρ = p λ/γ ,and ρ = 5 p λ/γ . the corresponding convergence rates for ADMM are 0 .
78, 0 .
49, and 0 .
64; thecorresponding convergence rates for linearized ADMM are 0 .
82, 0 .
69, and 0 .
82. The relativelyconvergence behaviors of ADMM and linearized ADMM are consistent with those of Figure 1.
This paper presents a new duality gap convergence analysis of standard ADMM versus linearizedADMM under conditions commonly studied in the literature. It is shown that in the worst case,11 umber of Iterations
Log E rr o r -4-3.5-3-2.5-2-1.5-1-0.500.5 ADMMLinearized ADMM (a) ρ = 0 . p λ/γ Number of Iterations
Log E rr o r -10-8-6-4-202 ADMMLinearized ADMM (b) ρ = p λ/γ Number of Iterations
Log E rr o r -6-5-4-3-2-10 ADMMLinearized ADMM (c) ρ = 5 p λ/γ Figure 1: Convergence of ADMM versus that of Linearized ADMM ( γ = 0 . λ = µ = 0 . Number of Iterations
Log E rr o r -1.4-1.2-1-0.8-0.6-0.4-0.200.2 ADMMLinearized ADMM (a) ρ = 0 . p λ/γ Number of Iterations
Log E rr o r -5-4.5-4-3.5-3-2.5-2-1.5-1-0.50 ADMMLinearized ADMM (b) ρ = p λ/γ Number of Iterations
Log E rr o r -3.5-3-2.5-2-1.5-1-0.50 ADMMLinearized ADMM (c) ρ = 5 p λ/γ Figure 2: Convergence of ADMM versus that of Linearized ADMM ( γ = λ = µ = 1)the standard ADMM converges with an accelerated rate that is faster than that of the linearizedADMM. Matching lower bounds are obtained for specific problems. A simple numerical exampleillustrates this behavior. One consequence of our analysis is that the standard ADMM does notrequire Nesterov’s acceleration scheme in theory because it already enjoys the squared root con-vergence rate for smooth-strongly convex problems. On the other hand, linearized ADMM maystill benefit from extra acceleration steps. Finally the results obtained in this paper only show theworst case behaviors for both algorithms (under appropriate assumptions commonly used in theliterature). In practice, both methods might converge faster, and it remains open to study suchfaster convergence rates under additional suitable assumptions. Acknowledgment
The work was done during Da Tang’s internship at Baidu Big Data Lab in Beijing. Tong Zhangwould like to acknowledge NSF IIS-1250985, NSF IIS-1407939, and NIH R01AI116744 for support-ing his research. The authors would also like to thank Wotao Yin for helpful discussions.12 eferences [1] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problemswith applications to imaging.
Journal of Mathematical Imaging and Vision , 40(1):120–145,2011.[2] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. arXiv:1406.4834 , 2014.[3] Damek Davis and Wotao Yin. Faster convergence rates of relaxed Peaceman-Rachford andADMM under regularity assumptions. arXiv preprint arXiv:1407.5210 , 2014.[4] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternatingdirection method of multipliers.
Journal of Scientific Computing , pages 1–28, 2012.[5] Donald Goldfarb, Shiqian Ma, and Katya Scheinberg. Fast alternating linearization methodsfor minimizing the sum of two convex functions.
Mathematical Programming , 141(1-2):349–382,2013.[6] Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction methodof multipliers. arXiv preprint arXiv:1208.3922 , 2012.[7] Franck Iutzeler, Pascal Bianchi, Philippe Ciblat, and Walid Hachem. Explicit convergence rateof a distributed alternating direction method of multipliers. arXiv preprint arXiv:1312.1085 ,2013.[8] Qing Ling and Alejandro Ribeiro. Decentralized linearized alternating direction method ofmultipliers. In
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on , pages 5447–5451. IEEE, 2014.[9] Robert Nishihara, Laurent Lessard, Benjamin Recht, Andrew Packard, and Michael I Jordan.A general analysis of the convergence of admm. arXiv preprint arXiv:1502.02009 , 2015.[10] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating directionmethod of multipliers. In
Proceedings of the 30th International Conference on Machine Learn-ing , pages 80–88, 2013.[11] Yuyuan Ouyang, Yunmei Chen, Guanghui Lan, and Eduardo Pasiliao Jr. An acceleratedlinearized alternating direction method of multipliers.
SIAM Journal on Imaging Sciences ,8(1):644–681, 2015.[12] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating directionmultiplier method. In
Proceedings of the 30th International Conference on Machine Learning(ICML-13) , pages 392–400, 2013.
A Proof of Theorem 3.1
The fact that w t minimizes the objective function in line 4 of Algorithm 1, together with therelationship of α t and α t − in line 5, implies that ∇ φ ( w t ) + A ⊤ α t = H ( w t − − w t ) . (9)13e thus obtain φ ( w t ) − φ ( w ) + ( α t ) ⊤ A ( w t − w ) + ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t ) ≤ − γ k∇ φ ( w t ) − ∇ φ ( w ) k + ∇ φ ( w t ) ⊤ ( w t − w ) + ( α t ) ⊤ A ( w t − w ) + ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t )= − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + ( w t − − w t ) ⊤ A ⊤ ˜ HA ( w t − w ) + ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t )= − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + 12 h k Aw t − − Bv − c k H − k Aw t − Bv − c k H − k Aw t − Aw t − k H i . (10)In the above derivation, the inequality is a direct consequence of the smoothness of φ , which impliesthat for any w ′ and w , φ ( w ) ≥ φ ( w ′ )+ ∇ φ ( w ′ ) ⊤ ( w − w ′ )+0 . γ k∇ φ ( w ) −∇ φ ( w ′ ) k . The first equalityis due to (9), and ∇ φ ( w ) + A ⊤ α = 0 (which follows from the assumption w = ∇ φ ∗ ( − A ⊤ α ) of thetheorem). The second equality is algebra.We also have from the optimality of v t for minimizing the objective function in line 3 of Algo-rithm 1, and the relationship of α t and α t − in line 5: ∇ g ( v t ) − B ⊤ α t = − ρG ( v t − v t − ) + ρB ⊤ A ( w t − − w t ) . (11)Therefore g ( v t ) − g ( v ) + λ k v t − v k − α t ⊤ B ( v t − v ) ≤∇ g ( v t ) ⊤ ( v t − v ) − α t ⊤ B ( v t − v )= ρ ( v t − v t − ) ⊤ G ( v − v t ) + ρ ( w t − w t − ) ⊤ A ⊤ B ( v − v t )= ρ k v − v t − k G − k v − v t k G − k v t − v t − k G ]+ ρ k Aw t − − Bv − c k + k Aw t − Bv t − c k − k Aw t − Bv − c k − k Aw t − − Bv t − c k ]= ρ k v − v t − k G − k v − v t k G − k v t − v t − k G ] + 12 ρ k α t − α t − k + ρ k Aw t − − Bv − c k − k Aw t − Bv − c k − k Aw t − − Bv t − c k ] . (12)In the above derivation, the first inequality is due to the strong convexity of g ( · ). The first equalityemploys (11). The second equality is algebra, and the third equality is due to the relationship of α t and α t − in line 5 of Algorithm 1.Finally we have − ( α t − α ) ⊤ ( Aw t − Bv t − c )= − ρ ( α t − α ) ⊤ ( α t − α t − )= 12 ρ [ k α − α t − k − k α − α t k − k α t − α t − k ] , (13)14here the first equality uses the relationship of α t and α t − in line 5 of Algorithm 1, and the secondequality is algebra.By adding (10), (12), (13), we obtain φ ( w t ) + g ( v t ) − φ ( w ) − g ( v ) − ˜ α t ⊤ ( Aw − Bv − c ) + α ⊤ ( Aw t − Bv t − c ) ≤ − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k − λ k v t − v k + 12 h k Aw t − − Bv − c k H − k Aw t − Bv − c k H − k Aw t − Aw t − k H i + ρ k v − v t − k G − k v − v t k G − k v t − v t − k G ]+ ρ k Aw t − − Bv − c k − k Aw t − Bv − c k − k Aw t − − Bv t − c k ]+ 12 ρ [ k α − α t − k − k α − α t k ] , which can be rewritten as the following bound: r t ≤ − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k − k w t − w t − k H + θ ρ k α t − α k | {z } X t − λ k v t − v k + ρθ k v − v t k G | {z } Y t − ρ k v t − v t − k G + ρθ θ ) k Aw t − − Bv − c k + θ θ ) k Aw t − − Bv − c k H − ρ k Aw t − − Bv t − c k | {z } Z t + 12 (cid:20)
11 + θ k Aw t − − Bv − c k H − k Aw t − Bv − c k H (cid:21) + ρ k v − v t − k G − (1 + θ ) k v − v t k G ]+ ρ (cid:20)
11 + θ k Aw t − − Bv − c k − k Aw t − Bv − c k (cid:21) + 12 ρ [ k α − α t − k − (1 + θ ) k α − α t k ]= X t + Y t − ρ k v t − v t − k G + Z t + (1 + θ ) − δ t − − δ t . We can bound X t as follows: X t = − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k − k w t − w t − k H + θ ρ k α t − α k ≤ − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k − σ max ( H ) k H ( w t − w t − ) k + θ ρ k α t − α k ≤
12 max u h − γ k A ⊤ ( α t − α ) + u k − σ max ( H ) − k u k i + θ ρ k α t − α k = − γ/ γσ max ( H ) + 1 k A ⊤ ( α t − α ) k + θ ρ k α t − α k ≤ . θ in the theorem. We also have Z t = θ θ ) k Aw t − − Bv − c k H + ρθ θ ) k Aw t − − Bv − c k − ρ k Aw t − − Bv t − c k ≤ ρ " θ (1 + σ max ( ˜ H ) /ρ )1 + θ k Aw t − − ( Bv + c ) k − k Aw t − − Bv t − c k ≤ θρ ( ρ + σ max ( ˜ H ))2( ρ − θσ max ( ˜ H )) k B ( v t − v ) k , where the second inequality uses the fact that θ (1 + a )1 + θ k u k − k u ′ k ≤ a − θa θ k u − u ′ k , when θa < a = σ max ( ˜ H ) /ρ . Therefore Y t + Z t ≤ − λ k v t − v k + ρθ k v − v t k G + θρ ( ρ + σ max ( ˜ H ))2( ρ − θσ max ( ˜ H )) k B ( v t − v ) k ≤ " − λ ρθ σ max ( G ) + θ ( ρ + σ max ( ˜ H ))2(1 − s ) σ max ( B ) k v t − v k ≤ . Therefore we obtain r t ≤ X t + Y t + Z t + (1 + θ ) − δ t − − δ t ≤ (1 + θ ) − δ t − − δ t . Now by multiplying the above displayed inequality by (1 + θ ) t − T , and sum over t = 1 , . . . , T , weobtain (5).In order to obtain (6), we simply note that (9) implies that ∇ φ ∗ ( − A ⊤ ˜ α t ) − w t = 0 . (14)Therefore (10) can be replaced by the following inequality: φ ∗ ( − A ⊤ ˜ α t ) − φ ∗ ( − A ⊤ α ) + ( α t − α ) ⊤ Aw t + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t ) ≤∇ φ ∗ ( − A ⊤ ˜ α t ) ⊤ ( − A ⊤ ˜ α t + A ⊤ α ) + ( α t − α ) ⊤ Aw t − γ k − A ⊤ ˜ α t + A ⊤ α k + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t )= − ( w t ) ⊤ ( A ⊤ α t + H ( w t − w t − ) − A ⊤ α ) + ( α t − α ) ⊤ Aw t − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t )= − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + 12 h k Aw t − − Bv − c k H − k Aw t − Bv − c k H − k Aw t − Aw t − k H i , (15)where the first inequality uses the fact φ ∗ is γ strongly convex, which is a direct consequence of thefact that φ is 1 /γ smooth. The first equality is due to (14) and the definition of ˜ α t . The secondequality is algebra.Now, we note that the right hand side of (15) is the same as that of (10). Therefore theremaining of the proof follows the same argument as that of (5), where we simply use the additionof (15), (12), and (13) to replace the addition of (10), (12), and (13). This leads to (6).16 Proof of Corollary 3.2
We have from (6)2(1 + θ ) T h φ ∗ ( − A ⊤ ˜ α T ) + g ( v T ) − φ ∗ ( − A ⊤ α ) − g ( v ) + ( ˜ α T ) ⊤ ( Bv + c ) − α ⊤ ( Bv T + c ) i ≤ ( ρ + σ max ( ˜ H )) k Aw − Bv − c k + ρ (1 + θ ) k v − v k G + 1 + θρ k α − α k ≤ ρ + σ max ( ˜ H )) k Aw − Bv − c k + (2( ρ + σ max ( ˜ H )) σ max ( B ) + ρ (1 + θ ) σ max ( G )) k v − v k + 1 + θρ k α − α k . Now we set α = − ( A + ) ⊤ ∇ φ ( A + ( Bv T + c )) and v = ∇ g ∗ ( B ⊤ ˜ α T ). This choice achieves themaximum value of the left hand side over ( α, v ). With this choice, and the definition of convexconjugate, we obtain2(1 + θ ) T [ φ ( A + ( Bv T + c )) + g ( v T ) − D ( ˜ α T )] ≤ (2( ρ + σ max ( ˜ H )) σ max ( B ) + ρ (1 + θ ) σ max ( G )) k v − v k + 1 + θρ k α − α k + 2( ρ + σ max ( ˜ H )) k Aw − Bv − c k . (16)From Corollary 3.1, we obtain ρ k A ( w T − w ∗ ) k + 1 + θ ρ k α T − α ∗ k + ρ (1 + θ )2 k v T − v ∗ k G ≤ (1 + θ ) − T δ ∗ . (17)Therefore k A ( w T − w T − ) k ≤ k A ( w T − w ) k + 2 k A ( w T − − w ) k ≤ θ )(1 + θ ) − T δ ∗ /ρ. Moreover, (17) also implies k α T − α ∗ k ≤ ρ (1 + θ ) − (1 + θ ) − T δ ∗ . Therefore k ˜ α T − α ∗ k ≤ k α T − α ∗ k + σ max ( ˜ H ) k A ( w T − w T − ) k ≤ σ max ( ˜ H ) q θ )(1 + θ ) − T δ ∗ /ρ + q ρ (1 + θ ) − (1 + θ ) − T δ ∗ . It follows from the definition of b ( · ) that k v − v k ≤ b ((1 + θ ) − T δ ∗ ) . Similarly, we obtain from (17) that k v T − v ∗ k G ≤ (1 + θ ) − T δ ∗ / ( ρ + ρθ ). It implies that k α − α k ≤ b ((1 + θ ) − T δ ∗ ) . Now the first desired bound of the theorem can be obtained by plugging in the estimates of k v − v k and k α − α k into (16).For the second desired bound, we note from the Jensen’s inequality and (6) that h φ ∗ ( − A ⊤ ¯ α T ) + g (¯ v T ) − φ ∗ ( − A ⊤ α ) − g ( v ) + ( ¯ α T ) ⊤ ( Bv + c ) − α ⊤ ( B ¯ v T + c ) i ≤ P Tt =1 (1 + θ ) t − (cid:20)
12 ( ρ + σ max ( ˜ H )) k Aw − Bv − c k + ρ k v − v k G + 12 ρ k α − α k (cid:21) . Again we simply take the choice of ( α, v ) that achieves the maximum on the left hand side: α = − ( A + ) ⊤ ∇ φ ( A + ( B ¯ v T + c )) and v = ∇ g ∗ ( B ⊤ ¯ α T ).17 Proof of Theorem 3.2
The basic proof structure is the same as that of Theorem 3.1. The fact that w t minimizes theobjective function in line 4 of Algorithm 2, together with the relationship of α t and α t − in line 5,implies that ∇ φ ( w t − ) + A ⊤ α t = H ( w t − − w t ) . (18)We thus obtain φ ( w t − ) − φ ( w ) + ( α t ) ⊤ A ( w t − w ) + α ⊤ A ( w t − − w t )+ ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t ) ≤∇ φ ( w t − ) ⊤ ( w t − − w ) + ( α t ) ⊤ A ( w t − w ) + α ⊤ A ( w t − − w t )+ ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t ) − γ k∇ φ ( w t − ) − ∇ φ ( w ) k =( H ( w t − − w t ) − A ⊤ α t ) ⊤ ( w t − − w ) + ( α t ) ⊤ A ( w t − w ) + α ⊤ A ( w t − − w t )+ ( Aw − Bv − c ) ⊤ ˜ HA ( w t − − w t ) − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k =( w t − w t − ) ⊤ ( A ⊤ ( α t − α )) − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + 12 h k Aw t − − Bv − c k H − k Aw t − Bv − c k H + k w t − w t − k H i , (19)where the derivation uses similar arguments as those of (10). The first inequality uses the smooth-ness of φ , and the first equality uses (18). The second equality is algebra.We also have from the optimality of v t for minimizing the objective function in line 3 of Algo-rithm 2, and the relationship of α t and α t − in line 5, to obtain (12). Finally, we can also obtain(13).By adding (19), (12), (13), and use the simplified notation ∆ w = ( w t − w t − ), and ∆ α = α t − α ,we obtain r t ≤ ∆ w ⊤ ( A ⊤ ∆ α ) − γ k A ⊤ ∆ α + H ∆ w k + 12 k ∆ w k H + θ ρ k ∆ α k | {z } X t − λ k v t − v k + ρθ k v − v t k G | {z } Y t − ρ k v t − v t − k G + θ θ ) k w t − − A − ( Bv + c ) k H + ρθ θ ) k Aw t − − Bv − c k − ρ k Aw t − − Bv t − c k | {z } Z t + 12 (cid:20)
11 + θ k Aw t − − ( Bv + c ) k H − k Aw t − ( Bv + c ) k H (cid:21) + ρ k v − v t − k G − (1 + θ ) k v − v t k G ]+ ρ θ k Aw t − − Bv − c k − k Aw t − Bv − c k ]+ 12 ρ [ k α − α t − k − (1 + θ ) k α − α t k ] .
18e can bound X t as follows: X t = − ( H ∆ w ) ⊤ ( γI − H − )( A ⊤ ∆ α ) − γ k A ⊤ ∆ α k + k H ∆ w k ) + 12 k ∆ w k H + θ ρ k ∆ α k ≤ ( γ − /σ min ( H )) k H ∆ w k k A ⊤ ∆ α k − γ − /σ min ( H )2 k H ∆ w k − γ k A ⊤ ∆ α k + θ ρ k ∆ α k ≤ − σ min ( H ) k A ⊤ ∆ α k + θ ρ k ∆ α k ≤ . The first inequality uses the assumption that γ − σ min ( H ) − ≥ k H ∆ w k . The last inequalityuses the assumptions on θ . We also can use the same derivation as that of Theorem 3.1 to showthat Y t + Z t ≤
0. Therefore r t ≤ X t + Y t − ρ k v t − v t − k G + Z t + (1 + θ ) − δ t − − δ t ≤ (1 + θ ) − δ t − − δ t . We can multiply the above by (1 + θ ) t − T and then sum over t = 1 , . . . to obtain (7).Similarly we can prove a dual version of (19) below. The equation in (18) and the definition of˜ α t in the theorem imply that w t − = ∇ φ ∗ ( − A ⊤ ˜ α t ) . We thus have φ ∗ ( − A ⊤ ˜ α t ) − φ ∗ ( − A ⊤ α ) + ( α t − α ) ⊤ Aw t + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t ) ≤∇ φ ∗ ( − A ⊤ ˜ α t ) ⊤ ( − A ⊤ ˜ α t + A ⊤ α ) − γ k A ⊤ ( ˜ α t − α ) k + ( α t − α ) ⊤ Aw t + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t )=( w t − ) ⊤ ( − ( A ⊤ α t + H ( w t − w t − )) + A ⊤ α ) + ( α t − α ) ⊤ Aw t − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + ( − Bv − c ) ⊤ ˜ HA ( w t − − w t )=( w t − w t − ) ⊤ ( A ⊤ ( α t − α )) − γ k A ⊤ ( α t − α ) + H ( w t − w t − ) k + 12 h k Aw t − − Bv − c k H − k Aw t − Bv − c k H + k w t − w t − k H i . (20)In the above derivation, the first inequality uses the strong convexity of φ ∗ , which follows fromthe smoothness of φ . The first equality uses the relationship of ∇ φ ∗ ( − A ⊤ ˜ α t ) and w t − and therelationship of ˜ α t and α t . The last equality uses algebra. Note that the right hand side of (19) andthat of (20) are the same. Therefore by adding (20), (12), (13), we obtain r ∗ t ≤ X t + Y t − ρ k v t − v t − k G + Z t + (1 + θ ) − δ t − − δ t ≤ (1 + θ ) − δ t − − δ t . We can multiply (1 + θ ) t − T to both sides, and then sum over t = 1 , . . ., . . .