Primal-dual subgradient method for constrained convex optimization problems
aa r X i v : . [ m a t h . O C ] S e p Dual subgradient method for constrainedconvex optimization problems
Michael R. Metel ∗ and Akiko Takeda † RIKEN Center for Advanced Intelligence Project, Tokyo, Japan Department of Creative Informatics, Graduate School of Information Science and Technology, TheUniversity of Tokyo, Tokyo, Japan; RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
September 29, 2020
Abstract
This paper considers a general convex constrained problem setting where functionsare not assumed to be differentiable nor Lipschitz continuous. Our motivation is infinding a simple first-order method for solving a wide range of convex optimizationproblems with minimal requirements. We study the method of weighted dual averages(Nesterov, 2009) in this setting and prove that it is an optimal method.
Keywords: convex optimization; subgradient method; non-smooth optimization; iterationcomplexity; constrained optimization
In this work we are interested in constrained minimization problems,min x ∈ R d f ( x ) (1)s . t . f i ( x ) ≤ i = 1 , ..., nh i ( x ) = 0 i = 1 , ..., p, where f ( x ) and all f i ( x ) are convex functions and all h i ( x ) are affine functions from R d to R . Our goal is to develop an algorithm with a proven convergence rate to the constrainedminimum of f ( x ) without any further assumptions besides standard regularity conditions.In particular, we do not assume that f ( x ) or the f i ( x ) are differentiable nor Lipschitz contin-uous, and we do not assume a priori that algorithm iterates are constrained to any boundedset. All that is required is that an optimal primal and dual solution exist and that strong ∗ [email protected] † [email protected] K iterations, all of the algo-rithms discussed have a proven rate of convergence of O ( √ K ) towards an optimal solution,which is the best rate achievable using a first-order method, in the sense of matching thelower complexity bound for the unconstrained version of our problem setting (Nesterov, 2004,Section 3.2.1). All of these algorithms’ convergence results rely on some combination of acompact feasible region, bounds on the subgradients, or bounds on the constraint functionsthough.If we consider the unconstrained problem with n = p = 0, recent works include that ofGrimmer (2019) which proved that the convergence rate of the subgradient method holdsunder the more relaxed assumption compared to Lipschitz continuity, that f ( x ) − f ( x ∗ ) ≤ D ( k x − x ∗ k ) holds where x ∗ is an optimal solution and D ( · ) is a non-negative non-decreasingfunction. The convergence rate of O ( √ K ) for the general unconstrained convex optimizationproblem was solved earlier though by Nesterov (2009), with the method of weighted dualaverages. The iterates of the algorithm can be shown to be bounded for a range of convexoptimization problems, including unconstrained minimization without the assumption of aglobal Lipschitz parameter. The path taken in this paper is to apply the method of weighteddual averages, presented as Algorithm 1, to the general convex constrained problem (1) andestablish the same rate of convergence as previous works under our more relaxed assumptions. We define the Lagrangian function as L ( x, µ, θ ) := f ( x ) + n X i =1 µ i f i ( x ) + p X i =1 θ i h i ( x ) , µ ≥ R n + θ ∈ R p min x ∈ R d L ( x, µ, θ ) . The following assumptions are sufficient for (1) to be a convex optimization problem withan optimal primal and dual solution with strong duality.
Assumptions 1. f ( x ) and f i ( x ) for i = 1 , ..., n are convex functions, and h i ( x ) for i = 1 , ..., p are affinefunctions over R d .2. Slater’s condition holds: there exists an ˆ x ∈ R d such that f i (ˆ x ) < for i = 1 , ..., n and h i (ˆ x ) = 0 for i = 1 , ..., p .3. There exists an optimal solution, denoted x ∗ . These assumptions are sufficient since the optimal objective value of f ( x ∗ ) is finite by thecontinuity of f ( x ), and given that Slater’s condition holds, strong duality holds and thereexists at least one dual optimal solution ( µ ∗ , θ ∗ ) (Bertsekas, 2009, Prop. 5.3.5).Strong duality holds if and only if ( x ∗ , µ ∗ , θ ∗ ) is a saddle point of L ( x, µ, θ ) (Bertsekas, 2009,Prop. 3.4.1), i.e. ∀ x ∈ R d , µ ≥ R n + , and θ ∈ R p , L ( x ∗ , µ, θ ) ≤ L ( x ∗ , µ ∗ , θ ∗ ) ≤ L ( x, µ ∗ , θ ∗ ) . (2)We will work with an unconstrained version of (1), written asmin x ∈ R d max λ ≥ F ( x, λ ) := f ( x ) + λf ( x ) , where f ( x ) := max( f ( x ) , f ( x ) , ..., f n ( x ) , | h ( x ) | , | h ( x ) | , ..., | h p ( x ) | ) . Let λ ∗ = P ni =1 µ ∗ i + P pi =1 | θ ∗ i | . By the fact that f ( x ∗ ) = 0 and complementary slackness, P ni =1 µ ∗ i f i ( x ∗ ) = 0, F ( x ∗ , λ ) = f ( x ∗ ) = L ( x ∗ , µ ∗ , θ ∗ ) ∀ λ, and L ( x, µ ∗ , θ ∗ ) = f ( x ) + n X i =1 µ ∗ i f i ( x ) + p X i =1 θ ∗ i h i ( x ) ≤ f ( x ) + n X i =1 µ ∗ i f i ( x ) + p X i =1 | θ ∗ i || h i ( x ) |≤ f ( x ) + λ ∗ f ( x )= F ( x, λ ∗ ) , x, λ ) ∈ R d +1 , F ( x ∗ , λ ) ≤ F ( x, λ ∗ ) . (3)We will use the following notation for the subgradients needed of F ( x, λ ), g ( x ) ∈ ∂f ( x ) g i ( x ) ∈ ∂f i ( x ) g ( x ) ∈ ∂f ( x ) = Conv( { ∂f i ( x ) : f i ( x ) = f ( x ) } ∪ { ∂ | h i ( x ) | : | h i ( x ) | = f ( x ) } ) G x ( x, λ ) ∈ ∂ x F ( x, λ ) = ∂f ( x ) + λ∂f ( x ) G λ ( x, λ ) = ∇ λ F ( x, λ ) = f ( x ) G ( x, λ ) ∈ ∂F ( x, λ ) , where for subdifferential of ∂f ( x ), see for example (Nesterov, 2004, Lemma 3.1.10). Followingthe standard measure of convergence to a primal solution, with an optimal solution x ∗ , wedefine an algorithm’s output ¯ x as an ( ǫ , ǫ )-optimal solution if f (¯ x ) − f ( x ∗ ) ≤ ǫ and f (¯ x ) ≤ ǫ . Convergence to an optimal solution is proven using the method of weighted dual averages ofNesterov (2009) presented as Algorithm 1. When convenient we will use the column vector w := [ x ; λ ] := [ x T , λ T ] T , and the notation G k := [ G x ( w k ); − G λ ( w k )] k G ( w k ) k (note that k G k k = 1). Algorithm 1
Method of weighted dual averages
Input: w = [ x ∈ R d ; λ ≥ s = ˆ s = ˆ x = 0; β = 1 for k = 0 , , ..., K − do Compute G ( w k ) ∈ ∂F ( w k ) s k +1 = s k + [ G x ( w k ); − G λ ( w k )] k G ( w k ) k w k +1 = w − s k +1 β k β k +1 = β k + β k ˆ s k +1 = ˆ s k + k G ( w k ) k ˆ x k +1 = ˆ x k + x k k G ( w k ) k end for ˆ s K +1 = ˆ s K + k G ( w K ) k ˆ x K +1 = ˆ x K + x K k G ( w K ) k return x K +1 = ˆ s − K +1 ˆ x K +1 In each iteration w k +1 = w − s k +1 β k is the maximizer of U sβ ( w ) := −h s, w − w i − β k w − w k (4)4or s = s k +1 and β = β k , with U s k +1 β k ( w k +1 ) = k s k +1 k β k . (5)In addition, U sβ ( w ) is strongly concave in w with parameter β , U sβ ( w ) ≤ U sβ ( w ′ ) + h∇ U sβ ( w ′ ) , w − w ′ i − β k w − w ′ k . (6)Given that G λ ( w k ) ≥
0, it holds that λ k +1 ≥ λ k , with the λ k iterates always remainingfeasible. The following property examines the case Algorithm 1 crashes due to k G ( w k ) k = 0. Property 2. If k G ( w k ) k = 0 , then x k is an optimal solution to (1) .Proof. If k G ( w k ) k = 0, this implies that G λ ( w k ) = f ( x k ) = 0 and hence x k is a feasiblesolution. From k G x ( w k ) k = 0, 0 ∈ ∂ x F ( x k , λ k ), and as F ( x, λ k ) is convex in x , x k is aminimizer of F ( x, λ k ). It follows that for all x ∈ R d feasible in (1), f ( x k ) = f ( x k ) + λ k f ( x k ) ≤ f ( x ) + λ k f ( x ) = f ( x ) . A key property of Algorithm 1 is that by redefining G ( w k ) appropriately, the iterates arebounded for quite general convex optimization problems. In particular, all that is requiredis that (11) in the proof below holds for the iterates to be bounded using (Nesterov, 2009,Theorem 3). For the sake of completeness we present the full proof for our application. Property 3.
For all iterates of Algorithm 1, it holds that k w k − w ∗ k ≤ k w − w ∗ k + 1 , with the inequality being strict when w = w ∗ .Proof. From (5), U s k +1 β k ( w k +1 ) = β k − β k k s k +1 k β k − = β k − β k k s k + G k k β k − = β k − β k ( k s k k β k − + 1 β k − h s k , G k i + k G k k β k − )= β k − β k ( U s k β k − ( w k ) + 1 β k − h s k , G k i + 12 β k − )= β k − β k ( U s k β k − ( w k ) + h w − w k , G k i + 12 β k − ) . h w k − w , G k i = U s k β k − ( w k ) − β k β k − U s k +1 β k ( w k +1 ) + 12 β k − ≤ U s k β k − ( w k ) − U s k +1 β k ( w k +1 ) + 12 β k − , since β k is increasing. Telescoping these inequalities for k = 1 , .., K , and using the fact that k s k = 1, K X k =1 h w k − w , G k i ≤ U s β ( w ) − U s K +1 β K ( w K +1 ) + K X k =1 β k − (7)= 12 β − U s K +1 β K ( w K +1 ) + K − X k =0 β k = − U s K +1 β K ( w K +1 ) + 12 ( K − X k =0 β k + β ) . Expanding the recursion β k = β k − + β k − , K X k =1 h w k − w , G k i ≤ − U s K +1 β K ( w K +1 ) + β K . (8)Given the convexity of F ( x, λ ) in x and linearity in λ , F ( x ∗ , λ k ) ≥ F ( x k , λ k ) + h G x ( x k , λ k ) , x ∗ − x k i (9) F ( x k , λ ∗ )= F ( x k , λ k ) + h G λ ( x k , λ k ) , λ ∗ − λ k i . (10)Subtracting (9) from (10) and using (3),0 ≤ h [ G x ( w k ); − G λ ( w k )] , w k − w ∗ i . (11)It follows that 0 ≤ K X k =1 h w k − w ∗ , G k i = K X k =1 h w − w ∗ , G k i + K X k =1 h w k − w , G k i≤ K X k =1 h w − w ∗ , G k i − U s K +1 β K ( w K +1 ) + β K h w − w ∗ , s K +1 i − U s K +1 β K ( w K +1 ) + β K , (12)6here the second inequality uses (8), and the second equality follows since s k +1 = s k + G k .Considering inequality (6) with s = s K +1 , β = β K , w = w ∗ , and w ′ = w K +1 , U s K +1 β K ( w ∗ ) ≤ U s K +1 β K ( w K +1 ) + h∇ U s K +1 β K ( w K +1 ) , w ∗ − w K +1 i − β K k w ∗ − w K +1 k = U s K +1 β K ( w K +1 ) − β K k w ∗ − w K +1 k , given that w K +1 is the maximum of U s K +1 β K ( w ). Applying this inequality in (12),0 ≤h w − w ∗ , s K +1 i − U s K +1 β K ( w ∗ ) − β K k w ∗ − w K +1 k + β K h w − w ∗ , s K +1 i + h s K +1 , w ∗ − w i + β K k w ∗ − w k − β K k w ∗ − w K +1 k + β K β K k w ∗ − w k − β K k w ∗ − w K +1 k + β K , where the first equality uses the definition of U s K +1 β K ( w ∗ ) (4). Rearranging, k w ∗ − w K +1 k ≤k w ∗ − w k + 1 . (13)As K ≥ k ≥
2. Considering now when k = 1, k w − w ∗ k = k w − w ∗ − G k = k w − w ∗ k − h w − w ∗ , G i + 1 ≤k w − w ∗ k + 1 , where the last line uses (11). Now for all k ,( k w ∗ − w k + 1) = k w ∗ − w k + 2 k w ∗ − w k + 1 ≥k w ∗ − w k k + 2 k w ∗ − w k , so that k w ∗ − w k + 1 ≥ k w ∗ − w k k , (14)with (14) being strict when w ∗ = w .In order to prove the convergence result of Algorithm 1, we require bounding the norm ofthe subgradients G ( w k ). Property 4.
For all k there exists a constant L such that k G ( w k ) k ≤ L (2 k w − w ∗ k + λ ∗ + 3) . Proof.
Recall that g ( x ) ∈ ∂f ( x ) and g ( x ) ∈ ∂f ( x ), k G ( w k ) k = k [ G x ( w k ); G λ ( w k )] k = k [ g ( x k ) + λ k g ( x k ); f ( x k )] k ≤k g ( x k ) k + λ k k g ( x k ) k + f ( x k ) . (15)7he iterates of Algorithm 1 are bounded in a convex compact region, w k ∈ D := { w : k w − w ∗ k ≤ k w − w ∗ k + 1 } . This implies that x k ∈ D x := { x : k x − x ∗ k ≤ k w − w ∗ k + 1 } and λ k ∈ D λ := { λ : | λ − λ ∗ | ≤ k w − w ∗ k + 1 } . It follows that there exists an L ≥ f ( x ) is L -Lipschitz continuous on D x (Hiriart-Urruty and Lemar´echal, 1996, TheoremIV.3.1.2), | f ( x ) − f ( x ′ ) | ≤ L k x − x ′ k , (16)for all x, x ′ ∈ D x . Assuming that w = w ∗ , x k ∈ Int D x . For any x ∈ Int D x , taking θ > x ′ = x + θ g ( x ) k g ( x ) k ∈ D x , h g ( x ) , x ′ − x i ≤ f ( x ′ ) − f ( x ) ⇐⇒ h g ( x ) , x ′ − x i ≤ L k x ′ − x k ⇐⇒ h g ( x ) , θ g ( x ) k g ( x ) k i ≤ L θ ⇐⇒ k g ( x ) k ≤ L . (17)If w = w ∗ , x k ∈ Int D δx := { x : k x − x ∗ k ≤ δ + 1 } for any δ >
0, and L can be increasedsuch that (16) holds over D δx so that (17) holds for all x ∈ D x . Similarly, there exists an L ≥ | f ( x ) − f ( x ′ ) | ≤ L k x − x ′ k and k g ( x ) k ≤ L for all x, x ′ ∈ D x . Inaddition, f ( x k ) = | f ( x k ) − f ( x ∗ ) |≤ L k x k − x ∗ k ≤ L ( k w − w ∗ k + 1) , and λ k ≤ k w − w ∗ k + 1 + λ ∗ from the definition of D λ . Combining these bounds in (15)and taking L = max( L , L ), k G ( w K ) k ≤k g ( x k ) k + λ k k g ( x k ) k + f ( x k ) ≤ L + ( k w − w ∗ k + 1 + λ ∗ ) L + L ( k w − w ∗ k + 1) ≤ L (2 k w − w ∗ k + λ ∗ + 3) . For all k the value of β k can be bounded as follows by induction. Property 5. (Nesterov, 2009, Lemma 3) β k ≤
11 + √ √ k + 1We can now prove a convergence rate of O ( √ K ) to an optimal solution of problem (1). Theorem 6.
Running Algorithm 1 for K iterations, f (¯ x K +1 ) − f ( x ∗ ) ≤ C ( k w − w ∗ k + 1)2( K + 1) (cid:18)
11 + √ √ K + 1 (cid:19) nd f (¯ x K +1 ) ≤ C (4( k w − w ∗ k + 1) + 1)2( K + 1) (cid:18)
11 + √ √ K + 1 (cid:19) , where C := L (2 k w − w ∗ k + λ ∗ + 3) .Proof. Using equation (8), and recalling that w K +1 maximizes U s K +1 β K ( w ) (4), β K ≥ K X k =1 h w k − w , G k i + U s K +1 β K ( w K +1 )= K X k =1 h w k − w , G k i + max w ∈ R d +1 {−h K X k =0 G k , w − w i − β K k w − w k } = max w ∈ R d +1 { K X k =0 −h G k , w − w k i − β K k w − w k } . (18)Like x K +1 , let w K +1 := ˆ s − K +1 P Kk =0 w k k G ( w k ) k and λ K +1 := ˆ s − K +1 P Kk =0 λ k k G ( w k ) k . Multiplyingboth sides of (18) by ˆ s − K +1 ,ˆ s − K +1 β K ≥ ˆ s − K +1 max w ∈ R d +1 { K X k =0 −h G k , w − w k i − β K k w − w k } =ˆ s − K +1 max w ∈ R d +1 { K X k =0 −h G x ( w k ) k G ( w k ) k , x − x k i + h G λ ( w k ) k G ( w k ) k , λ − λ k i − β K k w − w k }≥ ˆ s − K +1 max w ∈ R d +1 { K X k =0 F ( x k , λ k ) − F ( x, λ k ) k G ( w k ) k + F ( x k , λ ) − F ( x k , λ k ) k G ( w k ) k − β K k w − w k } =ˆ s − K +1 max w ∈ R d +1 { K X k =0 F ( x k , λ ) − F ( x, λ k ) k G ( w k ) k − β K k w − w k } = max w ∈ R d +1 { ˆ s − K +1 K X k =0 F ( x k , λ ) − F ( x, λ k ) k G ( w k ) k − ˆ s − K +1 β K k w − w k }≥ max w ∈ R d +1 { F ( x K +1 , λ ) − F ( x, λ K +1 ) − ˆ s − K +1 β K k w − w k } = max w ∈ R d +1 { f ( x K +1 ) + λf ( x K +1 ) − f ( x ) − λ K +1 f ( x ) − ˆ s − K +1 β K k w − w k } , (19)where the third inequality uses Jensen’s inequality. Given the maximum function, the in-equality holds for any choice of w . We consider two cases, the first being x = x K +1 and9 = λ K +1 + 1. From (19),ˆ s − K +1 β K ≥ f ( x K +1 ) + ( λ K +1 + 1) f ( x K +1 ) − f ( x K +1 ) − λ K +1 f ( x K +1 ) − ˆ s − K +1 β K k [ x K +1 ; λ K +1 + 1] − w k = f ( x K +1 ) − ˆ s − K +1 β K k [ x K +1 ; λ K +1 + 1] − w k . (20)Further, k [ x K +1 ; λ K +1 + 1] − w k ≤k w K +1 − w k + 1= k w K +1 − w ∗ + w ∗ − w k + 1 ≤k w K +1 − w ∗ k + k w ∗ − w k + 1 ≤ ˆ s − K +1 K X k =0 k w k − w ∗ k k G ( w k ) k + k w ∗ − w k + 1 ≤ ˆ s − K +1 K X k =0 k w − w ∗ k + 1 k G ( w k ) k + k w ∗ − w k + 1=2( k w − w ∗ k + 1) , (21)where the third inequality uses Jensen’s inequality and the fourth inequality uses Property3. Combining (20) and (21), f ( x K +1 ) ≤ ˆ s − K +1 β K k w − w ∗ k + 1) + 1) . The second case will use w = w ∗ . Starting from (19),ˆ s − K +1 β K ≥ f ( x K +1 ) + λ ∗ f ( x K +1 ) − f ( x ∗ ) − λ K +1 f ( x ∗ ) − ˆ s − K +1 β K k w ∗ − w k ≥ f ( x K +1 ) − f ( x ∗ ) − ˆ s − K +1 β K k w ∗ − w k , since f ( x ∗ ) = 0 and λ ∗ f ( x K +1 ) ≥
0. Rearranging, f ( x K +1 ) − f ( x ∗ ) ≤ ˆ s − K +1 β K k w − w ∗ k + 1) . Using Properties 4 and 5, ˆ s − K +1 β K can be bounded as follows.ˆ s − K +1 β K ≤
12 ( K X k =0 k G ( w k ) k ) − ( 11 + √ √ k + 1) ≤
12 ( K X k =0 C ) − ( 11 + √ √ k + 1) ≤ C K + 1) ( 11 + √ √ k + 1) . O ( √ K ) from Theorem 6 matchesthe lower complexity bound for minimizing the unconstrained version of (1) as discussed inthe introduction. The following corollary establishes the O (min( ǫ , ǫ ) − ) iteration complex-ity required to achieve an ( ǫ , ǫ )-optimal solution. Corollary 7. An ( ǫ , ǫ ) optimal solution is obtained after running Algorithm 1 for K ≥ α max (cid:18) C ǫ , C ǫ (cid:19) iterations, where C = C ( k w − w ∗ k + 1) , C = C (4( k w − w ∗ k + 1) + 1) , and α = ( √ √ + 1) .Proof. From Theorem 6 for i = 1 ,
2, we need to compute a lower bound on K which ensuresthat C i K + 1) ( 11 + √ √ K + 1) ≤ ǫ i . Since for K ≥ √ K K + 1) ( 11 + √ √ K + 1)= √ K √ K + 1) + √ K + K K + 1) ≤ √
3) + 1 √
2= 1 √ √ √
3) + 1) , it holds that C i K + 1) ( 11 + √ √ K + 1) ≤ √ αC i √ K .
To ensure convergence within ǫ i , it is sufficient for ǫ i ≥ √ αC i √ K , or that K ≥ α ( C i ǫ i ) . Takingthe maximum over i gives the result. In this paper we have established the existence of a simple first-order method for the generalconvex constrained optimization problem without the need for differentiability nor Lipschitzcontinuity. We see this as a general use algorithm for practitioners since it requires minimalknowledge of the problem, with no parameter tuning for its implementation, while stillachieving the optimal convergence rate for first-order methods.11
Acknowledgments
The research of the first author is supported in part by JSPS KAKENHI Grants No.19H04069.The research of the second author is supported in part by JSPS KAKENHI Grants No.17H01699 and 19H04069.
References
Amir Beck, Aharon Ben-Tal, Nili Guttmann-Beck, and Luba Tetruashvili. The CoMirror al-gorithm for solving nonsmooth constrained convex problems.
Operations Research Letters ,38(6):493–498, 2010.Dimitri P. Bertsekas.
Convex Optimization Theory . Athena Scientific, 2009.Benjamin Grimmer. Convergence Rates for Deterministic and Stochastic Subgradient Meth-ods without Lipschitz Continuity.
SIAM Journal on Optimization , 29(2):1350–1365, 2019.Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal.
Convex Analysis and MinimizationAlgorithms I: Fundamentals . Springer-Verlag, 1996.Guanghui Lan and Zhiqiang Zhou. Algorithms for stochastic optimization with functional orexpectation constraints.
Computational Optimization and Applications , 76:461–498, 2020.Yurii Nesterov.
Introductory Lectures on Convex Optimization: A Basic Course . SpringerScience+Business Media, 2004.Yurii Nesterov. Primal-dual subgradient methods for convex problems.
Mathematical Pro-gramming , 120(1):221–259, 2009.Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theoryto Algorithms . Cambridge University Press, 2014.Yangyang Xu. Primal-Dual Stochastic Gradient Method for Convex Programs with ManyFunctional Constraints.