[PDF] Comparing different subgradient methods for solving convex optimization problems with functional constraints

Abstract

We provide a dual subgradient method and a primal-dual subgradient method for standard convex optimization problems with complexity \mathcal{O}(\varepsilon^{-2}) and \mathcal{O}(\varepsilon^{-2r}), for all r> 1, respectively. They are based on recent Metel-Takeda's work in [arXiv:2009.12769, 2020, pp. 1-12] and Boyd's method in [Lecture notes of EE364b, Stanford University, Spring 2013-14, pp. 1-39]. The efficiency of our methods is numerically illustrated in a comparison to the others.

Full PDF

CComparing diﬀerent subgradient methods forsolving convex optimization problems withfunctional constraints

Thi Lan Dinh ∗ Ngoc Hoang Anh Mai † January 5, 2021

Abstract

We provide a dual subgradient method and a primal-dual subgradi-ent method for standard convex optimization problems with complexity O ( ε − ) and O ( ε − r ), for all r >

1, respectively. They are based on recentMetel-Takeda’s work in [arXiv:2009.12769, 2020, pp. 1-12] and Boyd’smethod in [Lecture notes of EE364b, Stanford University, Spring 2013-14,pp. 1-39]. The eﬃciency of our methods is numerically illustrated in acomparison to the others.

Keywords: convex optimization, nonsmooth optimization, subgradient method

Contents ∗ Torus-Actions SAS; 3 avenue Didier Daurat, F-31400 Toulouse; France. † CNRS; LAAS; 7 avenue du Colonel Roche, F-31400 Toulouse; France. a r X i v : . [ m a t h . O C ] J a n Introduction

The results presented in this paper are not very new and mainly based on Metel-Takeda’s and Boyd’s ideas.Given a convex function f : R n → R and a closed convex domain C ⊂ R n , considerthe convex optimization problem (COP):inf x ∈ C f ( x ) , (1.1)where f might be non-diﬀerentiable. It is well known that the optimal value and anoptimal solution of problem (1.1) can be approximated as closely as desired by usingsubgradient methods. In some cases we can use interior point or Newton methodsto solve problem (1.1). Although developed very early by Shor in the Soviet Union(see [8]), subgradient methods is still highly competitive with state-of-the-art methodssuch as interior point and Newton methods. They can be applied to problem (1.1)with very large number of variables n because of their little memory requirement.We might classify problem (1.1) in three groups as follows:(a) Unconstrained case: C = R n . It can be eﬃciently solved by various methods, inparticular with Nesterov’s method of weighted dual averages in [5]. His methodhas optimal convergence rate O ( ε − ).(b) Inexpensive projection on C . In this case the projected subgradient method intro-duced by Alber, Iusem and Solodov in [1] can be used for extremely large problemsfor which interior point or Newton methods cannot be used.(c) COP with functional constraints (also known as standard form in the literature): C = { x ∈ R n : f i ( x ) ≤ , i = 1 , . . . , m , Ax = b } , (1.2)where f i is a convex function on R n , for i = 1 , . . . , m , A is a real matrix and b is a real vector. Metel and Takeda recently suggest the method of weighted dualaverages [4] for solving this type of problems based on Nesterov’s idea. They alsoobtain the optimal convergence rate O ( ε − ) for their method. Earlier Boyd pro-poses a primal-dual subgradient method in his lecture notes [2] without complexityanalysis. We show that Boyd’s method is suboptimal. Contribution.

Our contribution with respect to the previous group (c) is twofold:1. In Section 4, we provide an alternative dual subgradient method with sev-eral dual variables based on Metel-Takeda’s method [4]. Our dual subgradientmethod has the same complexity O ( ε − ) with Metel-Takeda’s. Moreover it ismuch more eﬃcient than Metel-Takeda’s in practice as shown in Section 6.2. In Section 5, we provide a primal-dual subgradient method based on the non-smooth penalty method and Boyd’s method in [2, Section 8]. Our primal-dualsubgradient method converges with complexity O ( ε − r ) for all r > O ( ε − ). In contrast, our dual subgradient method andprimal-dual subgradient method are very eﬃcient in practice even for COP with largenumber of variables ( n ≥ Let f : R n → R be a real-valued function on the Euclidean space R n , f is calledconvex if f ( tx + (1 − t ) y ) ≤ tf ( x ) + (1 − t ) f ( y ) , ∀ ( x, y ) ∈ R n × R n , ∀ t ∈ [0 , . (2.3) f f : R n → R is a convex function, the subdiﬀerential at a point x in R n , denoted by ∂f ( x ), is deﬁned by the set ∂f ( x ) := { g ∈ R n : f ( x ) − f ( x ) ≥ g (cid:62) ( x − x ) } . (2.4)A vector g in ∂f ( x ) is called a subgradient at x . The subdiﬀerential is always anonempty convex compact set.From now on, we focus on solving the constrained convex optimization problemwith functional constraints: p (cid:63) = inf x ∈ R n { f ( x ) : f i ( x ) ≤ , i = 1 , . . . , m , Ax = b } , (2.5)where f i : R n → R is a convex function, for i = 0 , , . . . , m , A ∈ R l × n and b ∈ R l . Let x (cid:63) be an optimal solution of problem (2.5), i.e., f i ( x (cid:63) ) ≤

0, for i = 1 , . . . , m , Ax (cid:63) = b and f ( x (cid:63) ) = p (cid:63) . This section is devoted to recall the simple subgradient method for problem (2.5).This method is introduced in [6, Section 3.2.4] due to Nesterov for convex problemswith only inequality constraints. We extend Nesterov’s method to convex problemswith both inequality and equality constraints in a trivial way.It is fairly easy to see that problem (2.5) is equivalent to the problem: p (cid:63) = inf x ∈ R n { f ( x ) : f ( x ) ≤ } , (3.6)where f ( x ) = max { f ( x ) , . . . , f m ( x ) , | a (cid:62) x − b | , . . . , | a (cid:62) l x − b l | } . (3.7)Here a (cid:62) j is the j th row of A , i.e., A =  a (cid:62) . . .a (cid:62) l  . If x is an optimal solution of problem(3.6), x is also an optimal solution of problem (2.5).Let g ( x ) ∈ ∂f ( x ), g ( x ) ∈ ∂f ( x ) and consider the following method to solveproblem (3.6): SG Initialization: ε > x (0) ∈ R n .For k = 0 , , . . . , K do:1. If f ( x ( k ) ) ≤ ε , then x ( k +1) := x ( k ) − ε (cid:107) g ( x ( k ) ) (cid:107) g ( x ( k ) );2. Else, x ( k +1) := x ( k ) − f ( x ( k ) ) (cid:107) g ( x ( k ) ) (cid:107) g ( x ( k ) ). (3.8)In order to guarantee the convergence for method (3.8), let the following assump-tion hold: Assumption 3.1.

The norms of the subgradients of f , f , . . . , f m and the values of f , . . . , f m are bounded on any compact subsets of R n . For every ε > K ∈ N , let I ε ( K ) be the set of iterations and p ( K ) ε be a valuedeﬁned as follows: I ε ( K ) := { k ∈ { , , . . . , K } : f ( x ( k ) ) ≤ ε } and p ( K ) ε := min k ∈I ε ( K ) f ( x ( k ) ) . (3.9)The optimal worst-case performance guarantee of method (3.8) is stated in the follow-ing theorem: heorem 3.2. If the number of iterations K in method (3.8) is big enough, K ≥O ( ε − ) , then I ε ( K ) (cid:54) = ∅ and p ( K ) ε ≤ p (cid:63) + ε . (3.10)The proof of Theorem 3.2 is similar to the proof of [6, Theorem 3.2.3]. In this section, we provide an alternative dual subgradient method for problem(2.5). Originally developed by Metel and Takeda in [4], the dual subgradient methodinvolves a single dual variable when they consider the Lagrangian function of problem(3.6) with the single constraint f ( x ) ≤

0. Our dual subgradient method is a gener-alization of Metel-Takeda’s result and based on the Lagrangian function of problem(2.5) with several dual variables.Consider the following Lagrangian function of problem (2.5): L ( x, λ, ν ) = f ( x ) + λ (cid:62) F ( x ) + ν (cid:62) ( Ax − b ) , (4.11)for x ∈ R n , λ ∈ R m + and ν ∈ R l . Here F ( x ) := ( F ( x ) , . . . , F m ( x )) with F i ( x ) = max { f i ( x ) , } , i = 1 , . . . , m . (4.12)The dual of problem (2.5) reads as: d (cid:63) = sup λ ∈ R m + , ν ∈ R l inf x ∈ R n L ( x, λ, ν ) . (4.13)Let the following assumption hold: Assumption 4.1.

Strong duality holds for primal-dual (2.5) - (4.13) , i.e., p (cid:63) = d (cid:63) and (4.13) has an optimal solution ( λ (cid:63) , ν (cid:63) ) . It implies that ( x (cid:63) , λ (cid:63) , ν (cid:63) ) is a saddle-point of the Lagrangian function L , i.e., L ( x (cid:63) , λ, ν ) ≤ L ( x (cid:63) , λ (cid:63) , ν (cid:63) ) = p (cid:63) ≤ L ( x, λ (cid:63) , ν (cid:63) ) , (4.14)for all x ∈ R n , λ ∈ R m + and ν ∈ R l .Given C as a subset of R n , denote by conv( C ) the convex hull generated by C ,i.e., conv( C ) = { (cid:80) ri =1 t i a i : a i ∈ C , t i ∈ [0 , , (cid:80) ri =1 t i = 1 } .With z = ( x, λ, ν ), we use the following notation: • g ( x ) ∈ ∂f ( x ); • g i ( x ) ∈ ∂F i ( x ) =  ∂f i ( x ) if f i ( x ) > , conv( ∂f i ( x ) ∪ { } ) if f i ( x ) = 0 , { } otherwise; • G x ( z ) ∈ ∂ x L ( z ) = ∂f ( x ) + (cid:80) mi =1 λ i ∂F i ( x ) + A (cid:62) ν ; • G λ ( z ) = ∇ λ L ( z ) = F ( x ) and G ν ( z ) = ∇ ν L ( z ) = Ax − b ; • G ( z ) = ( G λ ( z ) , − G λ ( z ) , − G ν ( z )). etting z ( k ) := ( x ( k ) , λ ( k ) , ν ( k ) ), we consider the following method: DSG

Initialization: z (0) = ( x (0) , λ (0) , ν (0) ) ∈ R n × R m + × R l ; s (0) = 0; ˆ x (0) = 0; δ = 0; β = 1.For k = 0 , , . . . , K do:1. s ( k +1) = s ( k ) + G ( z ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) ;2. z ( k +1) = z (0) − s ( k +1) β k ;3. β k +1 = β k + β k ;4. δ k +1 = δ k + (cid:107) G ( z ( k ) ) (cid:107) ;5. ˆ x ( k +1) = ˆ x ( k ) + x ( k ) (cid:107) G ( z ( k ) ) (cid:107) ;6. x ( k +1) = δ − k +1 ˆ x ( k +1) . (4.15)Since G λ ( z ( k ) ) = F ( x ( k ) ) ≥

0, it holds that λ ( k +1) i ≥ λ ( k ) i ≥ λ (0) i ≥

0, for i = 1 , . . . , m .Under Assumption 3.1, we obtaint the convergence rate of order O ( K − / ) in thefollowing theorem: Theorem 4.2.

Let ε > . If the number of iterations K in method (4.15) is bigenough, K ≥ O ( ε − ) , then (cid:107) F ( x ( K +1) ) (cid:107) + (cid:107) Ax ( K +1) − b (cid:107) ≤ ε and f ( x ( K +1) ) ≤ p (cid:63) + ε . (4.16)The proof of Theorem 4.2, based on Lemma 8.3, proceeds exactly the same as theproofs in [4]. In this section, we extend the primal-dual subgradient method introduced in Boyd’slecture notes [2, Section 8]. The idea of our method is to replace the augmentationof the augmented Lagrangian considered in [2, Section 8] by a more general penaltyterm. We also prove the convergence guarantee and provide convergence rate orderfor this method.Let s ∈ [1 ,

2] and ρ > F ( x ) deﬁned as in (4.12), consider anequivalent problem of problem (2.5): p (cid:63) = inf x ∈ R n f ( x ) + ρ ( (cid:107) F ( x ) (cid:107) s + (cid:107) Ax − b (cid:107) s )s.t. F i ( x ) ≤ , i = 1 , . . . , m , Ax = b . (5.17)Since x (cid:63) is an optimal solution of problem (2.5), x (cid:63) is also an optimal solution ofproblem (5.17). Remark 5.1.

Instead of using the augmentation with the square of l -norm (cid:107) · (cid:107) according to the deﬁnition of augmented problem in [2, Section 8] we use the additionalpenalty term with l -norm to the s th power (cid:107) · (cid:107) s in problem (5.17) . This strategy hasbeen discussed in [7, pp. 513] for the case of s = 1 . The Lagrangian of problem (5.17) has the form: L ρ ( x, λ, ν ) = f ( x ) + λ (cid:62) F ( x ) + ν (cid:62) ( Ax − b ) + ρ ( (cid:107) F ( x ) (cid:107) s + (cid:107) Ax − b (cid:107) s ) , (5.18) or x ∈ R n , λ ∈ R m + and ν ∈ R l .Let us deﬁne a set-valued mapping T : R n × R m + × R l → R n × R m + × R l by T ρ ( x, λ, ν ) = ∂ x L ρ ( x, λ, ν ) × ( − ∂ λ L ρ ( x, λ, ν )) × ( − ∂ ν L ρ ( x, λ, ν ))= ∂ x L ρ ( x, λ, ν ) × {− F ( x ) } × { b − Ax } , (5.19)where ∂ x L ρ ( x, λ, ν ) = ∂f ( x )+ (cid:80) mi =1 λ i ∂F i ( x )+ A (cid:62) ν + ρ∂ (cid:107) F i ( · ) (cid:107) s ( x )+ ρ∂ (cid:107) A ·− b (cid:107) s ( x ).The explicit formulas of the subgradients in the subdiﬀerentials ∂ (cid:107) F i ( · ) (cid:107) s ( x ) and ∂ (cid:107) A · − b (cid:107) s ( x ) are provided in Appendix 8.2.We do the simple iteration: z ( k +1) = z ( k ) − α k T ( k ) , (5.20)where z ( k ) = ( x ( k ) , λ ( k ) , ν ( k ) ) is the k th iterate of the primal and dual variables, T ( k ) is any element of T ρ ( z ( k ) ), α k > k th step size.By expanding (5.20) out, we can also write the method as: PDS

Initialization: x (0) ∈ R n , λ (0) ∈ R m + , ν (0) ∈ R l and ( α k ) k ∈ N ⊂ R + .For k = 0 , , . . . , K do:1. (cid:37) ( k ) = (cid:40) s (cid:107) F ( k ) (cid:107) s − F ( k ) if F ( k ) (cid:54) = 0 , ς ( k ) = (cid:40) s (cid:107) Ax ( k ) − b (cid:107) s − ( Ax ( k ) − b ) if Ax ( k ) (cid:54) = b x ( k +1) = x ( k ) − α k [ g ( k )0 + (cid:80) mi =1 ( λ ( k ) i + ρ(cid:37) ( k ) i ) g ( k ) i + A (cid:62) ( ν ( k ) + ρς ( k ) )];4. λ ( k +1) = λ ( k ) + α k F ( k ) and ν ( k +1) = ν ( k ) + α k ( Ax ( k ) − b ). (5.21)Here we note: • g ( k )0 ∈ ∂f ( x ( k ) ), g ( k ) i ∈ ∂F i ( x ( k ) ) and F ( k ) i := F i ( x ( k ) ), i = 1 , . . . , m . • T ( k ) =  g ( k )0 + (cid:80) mi =1 ( λ ( k ) i + ρ(cid:37) ( k ) i ) g ( k ) i + A (cid:62) ( ν ( k ) + ρς ( k ) ) − F ( k ) b − Ax ( k )  .Remark that λ ( k ) ≥ F ( k ) i ≥ i = 1 , . . . , m . The case of s = 2 is the standardprimal-dual subgradient method in [2, Section 8].Let the Assumptions 3.1 and 4.1 hold. For every ε > K ∈ N , let I ε ( K ) := { k ∈ { , , . . . , K } : (cid:107) F ( x ( k ) ) (cid:107) + (cid:107) Ax ( k ) − b (cid:107) ≤ ε } (5.22)and p ( K ) ε := min k ∈I ε ( K ) f ( x ( k ) ) . (5.23)We state that the method (5.21) converges in the following theorem: Theorem 5.2.

Let δ ∈ (0 , . Let the step size rule α k = γ k (cid:107) T ( k ) (cid:107) with γ k = ( k + 1) − δ/ . (5.24) Let ε > . If the number of iterations K in method (5.21) is big enough, K ≥O ( ε − s/δ ) , then I ε ( K ) (cid:54) = ∅ and p ( K ) ε ≤ p (cid:63) + ε . (5.25) The notation n the number of variables of the COP m the number of inequality constraints of the COP l the number of equality constraints of the COPSG the COP solved by the subgradient method (3.8)SingleDSG the COP solved by the dual subgradient method with single dualvariable [4, Algorithm 1]MultiDSG the COP solved by the dual subgradient method with multi-dual-variables (4.15)PDS the COP solved by the primal-dual subgradient method (5.21) s the power of l -norm in the additional penalty term for PDS ρ the penalty coeﬃcients for PDSval the approximate optimal value of the COPval (cid:63) the exact optimal value of the COPgap the relative optimality gap w.r.t. the exact value val (cid:63) , i.e.,gap = | val − val (cid:63) | / (1 + max {| val (cid:63) | , | val |} )infeas the infeasibility of the approximate optimal solutiontime the running time in seconds ε the desired accuracy of the approximate solution K the number of iterations Table 2:

The value and the infeasitility at the k th iteration. Method Complexity val infeasSG O ( ε − ) f ( x ( k ) ) max { f ( x ( k ) ) , } SingleDSG O ( ε − ) f ( x ( k ) ) max { f ( x ( k ) ) , } MultiDSG O ( ε − ) f ( x ( k ) ) (cid:107) F ( x ( k ) ) (cid:107) + (cid:107) Ax ( k ) − b (cid:107) PDS O ( ε − r ) , ∀ r > f ( x ( k ) ) (cid:107) F ( x ( k ) ) (cid:107) + (cid:107) Ax ( k ) − b (cid:107) The proof of Theorem 5.2 is based on Lemma 8.4. This proof is similar in spirit tothe convergence proof of the standard primal-dual subgradient method in [2, Section8]. A mathematical mistake in the proof in [2, Section 8] is corrected.

In this section we report results of numerical experiments obtained by solvingconvex optimization problem (COP) with functional constraints. The experiments areperformed in Python 3.9.1. The implementation of methods (3.8), (4.15) and (5.21)is available online via the link: https://github.com/dinhthilan/COP .We use a desktop computer with an Intel(R) Pentium(R) CPU N4200 @ 1.10GHzand 4.00 GB of RAM. The notation for the numerical results is given in Table 1. Thevalue and the infeasibility of SG, SingleDSG, MultiDSG and PDS at the k th iterationis computed as in Table 2. Randomly generated test problems. • Setting: l = (cid:100) n/ (cid:101) ε = 10 − , ρ = 1 /s .Id Case δ K Size n m l

10 1 22 1 0.5 10

100 1 153 1 0.99 10

10 21 25 2 0.5 10

100 201 156 2 0.99 10 We construct randomly generated test problems in the form:min x ∈ R n { c (cid:62) x : x ∈ Ω , Ax = b } , (6.26)where c ∈ R n , A ∈ R l × n , b ∈ R l and Ω is a convex domain such that: • Every entry of c and A is taken in [ − ,

1] with uniform distribution. • The domain Ω is chosen in the following two cases: – Case 1: Ω := { x ∈ R n : (cid:107) x (cid:107) ≤ } . – Case 2: Ω := { x ∈ [ − , n : max {− log( x + 1) , x } ≤ } . • With a random point x in Ω, we take b := Ax .Let us apply SG, SingleDSG, multiDSG and PDS to solve problem (6.26). The size ofthe test problems and the setting of our software are given in Table 3. The numericalresults are displayed in Table 4. The convergences of SG, SingleDSG, MultiDSG andPDS with s ∈ { , . , } are illustrated in Figures 1, 2, 3 for Case 1 and Figures 4, 5,6 for Case 2.These ﬁgures show that: • In Case 1, the values returned by SingleDSG, MultiDSG and PDS with s ∈{ , . , } have the same limit when the number of iterations increases exceptId 1. In Id 1, the value returned by SingleDSG have a limit which is not thesame limit of the values returned by the others. We should point out that thevalue of PDS with s = 2 has the fastest convergence rate. In this case theinfeasibilities of all methods converge to zero and the infeasibility of SG has thefastest convergence rate. • In Case 2, the values returned by MultiDSG and PDS with s ∈ { , . , } havethe same limit while the values provided by SG and SingleDSG converge todiﬀerent limits, when the number of iterations increases. Moreover the valueof MultiDSG has the fastest convergence rate. We observe the similar behaviorwhen considering the infeasibilities of these methods.For timing comparison from Table 4, SingleDSG is the slowest in Case 1 whileMultiDSG is the fastest one in this case. Nevertheless, MultiDSG takes the mostconsuming time in Case 2. n = 10 (Id 1).Figure 2: Illustration for Case 1 with n = 100 (Id 2).Figure 3: Illustration for Case 1 with n = 1000 (Id 3).9igure 4: Illustration for Case 2 with n = 10 (Id 4).Figure 5: Illustration for Case 2 with n = 100 (Id 5).Figure 6: Illustration for Case 2 with n = 1000 (Id 6).10able 4: Numerical results for randomly generated test problemsId SG SingleDSG MultiDSGval infeas time val infeas time val infeas time1 -0.6363 0.2584 1 -0.9609 0.0806 1 -0.8385 0.0140 12 0.2711 0.2389 19 -0.9672 0.1832 35 -0.9003 0.0177 193 -0.0032 0.0849 2356 -0.9938 0.0748 4704 -0.8649 0.0153 13804 -2.2255 2.3955 1 -5.7702 2.0534 2 -4.1255 0.0056 55 7.6548 7.7407 432 -141.43 7.0433 84 -37.712 0.0097 5756 -1.4176 38.806 1079 -316.62 35.684 2382 -401.07 0.0389 8059Id PDS with s = 1 PDS with s = 1 . s = 2val infeas time val infeas time val infeas time1 -0.8213 0.0007 1 -0.8226 0.0034 1 -0.8209 0.0003 12 -0.8830 0.0008 20 -0.9004 0.0957 20 -0.8913 0.0262 203 -0.8550 0.0241 2434 -0.8428 0.0043 2673 -0.8430 0.0046 24244 -4.0708 0.0025 2 -4.1017 0.0029 2 -4.1037 0.0041 25 -37.022 0.0009 404 -37.453 0.0214 416 -37.506 0.0451 4136 -398.71 0.0435 4802 -399.63 0.0475 4889 -399.76 0.0323 4602 Table 5:

Linearly inequality constrained minimax test problems. • Setting: l = 0, ε = 10 − , δ = 0 . ρ = 1 /s , K = 10 . Id Problem val (cid:63)

Size n m

We solve several test problems from [3, Table 4.1] including MAD8, Wong2 andWong3 by using SG, SingleDSG, multiDSG and PDS.The size of the test problems and the setting of our software are given in Table 5.The numerical results are displayed in Table 6. The convergences of SG, SingleDSG,MultiDSG and PDS with s ∈ { , . , } are plotted in Figures 10, 11 and 12 for n ∈ { , , } , respectively.As one can see from Table 6, the value returned by PDS with s = 1 has thesmallest gap w.r.t the exact value while requires smaller total time compared to SG,SingleDSG and MultiDSG. Moreover the infeasibility of PDS with s = 1 converges tozero more eﬃciently than the ones of SG, SingleDSG and MultiDSG when the numberof iterations increases.Figures 10, 11 and 12 show that the values returned by PDS with s ∈ { , . , } converge with the same rate in this case but the infeasibility of PDS with s = 1converges to zero the most fastest. Numerical results for linearly inequality constrained minimax problems

Id SG SingleDSG MultiDSGval infeas val infeas val infeas7 0.5065 0.0006 0.4629 0.0325 0.5037 0.00388 653.00 0.0000 22.352 0.6159 23.964 0.10179 198.03 0.0000 -39.638 0.9788 -38.697 0.2851Id PDS with s = 1 PDS with s = 1 . s = 2val infeas val infeas val infeas7 0.5073 0.0000 0.5070 0.0000 0.5071 0.00008 24.305 0.0013 24.127 0.0975 24.003 0.13609 -37.969 0.0009 -38.436 0.6214 -38.628 0.8841Id SG SingleDSG MultiDSGgap time gap time gap time7 0.03% 204 2.92% 203 0.21% 2218 96.1% 48 7.72% 54 1.35% 579 118% 76 4.10% 86 1.82% 91Id PDS with s = 1 PDS with s = 1 . s = 2gap time gap time gap time7 0.01% 113 0.01% 116 0.01% 1138 0.00% 29 0.71% 21 1.20% 279 0.01% 45 1.18% 48 1.65% 43 Table 7:

Numerical results for least absolute deviations • Setting: n = 3 n , m = 0, l = m = 2 n , K = 10 , ε = 10 − , ρ = 1 /s , δ = 0 . n n l

10 10 30 2011 100 300 20012 1000 3000 2000

Consider the following problem:min x ∈ R n (cid:107) Dx − w (cid:107) . (6.27)where D ∈ R m × n , w ∈ R m with entries taken in [ − ,

1] with uniform distribution.By adding slack variables y = Dx − w , (6.27) is equivalent to the CCOP:min x,y (cid:107) y (cid:107) s.t. y = Dx − w . (6.28)Let us solve (6.28) by using SG, SingleDSG, MultiDSG and PDS. The size of the testproblems and the setting of our software are given in Table 7. The numerical resultsare displayed in Table 8. The convergences of SG, SingleDSG, MultiDSG and PDSwith s ∈ { , . , } are illustrated in Figures 7, 8 and 9.The value returned by PDS with s = 2 has the the fastest convergence rate whilerequires smaller total time compared to SG, SingleDSG and MultiDSG in all cases.Moreover PDS with s = 2 returns smallest infeasibility at the ﬁnal iteration. n = 10 (Id 10).Figure 11: Illustration for LAD with n = 100 (Id 11).Figure 12: Illustration for LAD with n = 1000 (Id 12).14able 8: Numerical results for least absolute deviationsId SG SingleDSG MultiDSGval infeas time val infeas time val infeas time10 0.0004 0.0010 23 0.0030 0.0017 39 0.0017 0.0013 2111 51.917 0.0014 252 0.1084 0.0641 507 0.0945 0.0013 10812 810.46 0.5941 3682 87.849 8.4173 7578 11.535 0.0013 1858Id PDS with s = 1 PDS with s = 1 . s = 2val infeas time val infeas time val infeas time10 0.0065 0.0033 14 0.0062 0.0017 15 0.0074 0.0019 1411 0.0159 0.0151 132 0.0222 0.0021 116 0.0237 0.0020 12012 0.0434 43.281 1787 0.0749 0.0044 1520 0.0739 0.0020 1443 Table 9:

Numerical results for support vector machine • Setting: N = 200 × n , m = 0, ε = 10 − , δ = 0 . ρ = 1 /s , δ = 0 . K = 10 . Id Size n n l

13 2 403 40014 3 604 60015 5 1006 1000

Consider the following problem:min w,u (cid:34) N N (cid:88) i =1 max { , − y i ( z (cid:62) i w − u ) } + 12 (cid:107) w (cid:107) (cid:35) , (6.29)where z i ∈ R n and y i ∈ {− , } are taken as follows: • For i = 1 , . . . , (cid:98) N/ (cid:99) , we choose y i = 1 and take z i in [0 , n with uniformdistribution. • For i = (cid:98) N/ (cid:99) +1 , . . . , N , we choose y i = − z i in [ − , n with uniformdistribution.Letting τ i = z (cid:62) i w − u , problem (6.29) is equivalent to the problem:min w,u,τ N (cid:80) Ni =1 max { , − y i τ i } + (cid:107) w (cid:107) s.t. τ = Zw − ue , (6.30)where Z =  z (cid:62) . . .z (cid:62) N  and e =  . . .  .Let us solve (6.30) by using SG, SingleDSG, MultiDSG and PDS. The size of testproblems and the setting of our software are given in Table 9. The numerical resultsare displayed in Table 10. Figures 13, 14 and 15 show the progress of SG, SingleDSG,MultiDSG and PDS with s ∈ { , . , } . n = 2 (Id 13).Figure 14: Illustration for SVM with n = 3 (Id 14).Figure 15: Illustration for SVM with n = 5 (Id 15).16able 10: Numerical results for support vector machine

Id SG SingleDSG MultiDSGval infeas time val infeas time val infeas time13 0.9764 0.0014 254 0.9823 0.2576 316 0.9956 0.0610 26814 0.9821 0.0012 516 0.9793 0.3068 610 0.9930 1.3704 51515 0.9978 0.0014 778 0.9892 0.3616 991 0.9979 6.0966 846Id PDS with s = 1 PDS with s = 1 . s = 2val infeas time val infeas time val infeas time13 0.9574 0.0998 140 0.9492 0.0957 139 0.9019 0.0955 14414 0.9678 3.2489 229 0.9612 0.1171 255 0.9436 0.1170 24615 0.9945 12.655 430 0.9823 0.1954 452 0.9831 0.1736 422 We have tested the performance of diﬀerent subgradient method in solving COPwith functional constraints. We emphasize that PDS with s ∈ [1 ,

2] and MultiDSGare typically the best choice for users. They provide better approximations in lesscomputational time. For COP (2.5) with the number of inequality constraints largerthan the number of equality ones ( m > l ), users should probably choose PDS with s close to 1 (see MAD8, Wong2 and Wong3). When the number of equality constraintsof COP (2.5) is much larger than the number of inequality ones ( l (cid:29) m ), PDS with s close to 2 would be a good choice (see Case 1 and LAD). For COP (2.5) with largenumber of equality constraints and large number of inequality ones ( m (cid:29) l (cid:29) f ( x ) ≤ f deﬁned as in (3.7) and hence loose information ofthe dual problem.As a topic of further applications, we would like to use MultiDSG and PDS forsolving large-scale semideﬁnite program (SDP) in the form: p (cid:63) := inf y ∈ R n (cid:26) f (cid:62) y (cid:12)(cid:12)(cid:12)(cid:12) C i + B i y (cid:22) , i = 1 , . . . , m ,Ay = b (cid:27) , (7.31)where f ∈ R n , C i ∈ S s i B i : R n → S s i is a linear operator deﬁned by B i y = (cid:80) nj =1 y i B ( j ) i with B ( j ) i ∈ S s i , A ∈ R l × n , and b ∈ R l . Here we denote by S s theset of real symmetric matrices of size s . Obviously, SDP (7.31) is equivalent to COPwith functional constraints: p (cid:63) := inf y ∈ R n (cid:26) f (cid:62) y (cid:12)(cid:12)(cid:12)(cid:12) λ max ( C i + B i y ) ≤ , i = 1 , . . . , m ,Ay = b (cid:27) , (7.32)where λ max ( A ) stands by the largest eigenvalue of a given symmetric matrix A .Finally, another interesting application is to solve large-scale nonsmooth optimiza-tion problems arising from machine learning by using MultiDSG and PDS. In each iteration z ( k +1) = z (0) − s ( k +1) β k is the maximizer of U sβ ( z ) := − s (cid:62) ( z − z (0) ) − β (cid:107) z − z (0) (cid:107) (8.33) or s = s ( k +1) and β = β k , with U s ( k +1) β k ( z ( k +1) ) = (cid:107) s ( k +1) (cid:107) β k . (8.34)In addition, U sβ ( z ) is strongly concave in z with parameter β , U sβ ( z ) ≤ U sβ ( z (cid:48) ) + ∇ U sβ ( z (cid:48) ) (cid:62) ( z − z (cid:48) ) − β (cid:107) z − z (cid:48) (cid:107) . (8.35)We deﬁne G ( k ) := G ( z ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) . Note that (cid:107) G ( k ) (cid:107) = 1. Lemma 8.1.

In method (4.15) , it holds that (cid:107) z ( k ) − z (cid:63) (cid:107) ≤ (cid:107) z (0) − z (cid:63) (cid:107) + 1 .Proof. From (8.34), U s ( k +1) β k ( z ( k +1) ) = β k − β k (cid:107) s ( k +1) (cid:107) β k − = β k − β k (cid:107) s ( k ) + G ( k ) (cid:107) β k − = β k − β k (cid:32) (cid:107) s ( k ) (cid:107) β k − + 1 β k − s ( k ) (cid:62) G ( k ) + (cid:107) G ( k ) (cid:107) β k − (cid:33) = β k − β k (cid:18) U s ( k ) β k − ( z ( k ) ) + 1 β k − s ( k ) (cid:62) G ( k ) + 12 β k − (cid:19) = β k − β k (cid:18) U s ( k ) β k − ( z ( k ) ) + ( z (0) − z ( k ) ) (cid:62) G ( k ) + 12 β k − (cid:19) . Rearranging,( z ( k ) − z (0) ) (cid:62) G ( k ) = U s ( k ) β k − ( z ( k ) ) − β k β k − U s ( k +1) β k ( z ( k +1) ) + 12 β k − ≤ U s ( k ) β k − ( z ( k ) ) − U s ( k +1) β k ( z ( k +1) ) + 12 β k − , since β k is increasing. Telescoping these inequalities for k = 1 , . . . , K , and using thefact that (cid:107) s (cid:107) = 1, K (cid:88) k =1 ( z ( k ) − z (0) ) (cid:62) G ( k ) ≤ U s β ( z (1) ) − U s ( K +1) β K ( z ( K +1) ) + K (cid:88) k =1 β k − (8.36)= 12 β − U s ( K +1) β K ( z ( K +1) ) + K − (cid:88) k =0 β k = − U s ( K +1) β K ( z ( K +1) ) + 12 (cid:32) K − (cid:88) k =0 β k + β (cid:33) . Expanding the recursion β k = β k − + β k − , K (cid:88) k =1 ( z ( k ) − z (0) ) (cid:62) G ( k ) ≤ − U s ( K +1) β K ( z ( K +1) ) + β K . (8.37)Given the convexity of L ( x, λ, ν ) in x and linearity in ( λ, ν ), L ( x (cid:63) , λ ( k ) , ν ( k ) ) ≥ L ( x ( k ) , λ ( k ) , ν ( k ) ) + G x ( x ( k ) , λ ( k ) , ν ( k ) ) (cid:62) ( x (cid:63) − x ( k ) ) (8.38) L ( x ( k ) , λ (cid:63) , ν (cid:63) ) = L ( x ( k ) , λ ( k ) , ν ( k ) ) + G λ ( x ( k ) , λ ( k ) , ν ( k ) ) (cid:62) ( λ (cid:63) − λ ( k ) ) (cid:105) + G ν ( x ( k ) , λ ( k ) , ν ( k ) ) (cid:62) ( ν (cid:63) − ν ( k ) ) . (8.39) ubtracting (8.38) from (8.39) and using (4.13),0 ≤ ( G x ( z ( k ) ) , − G λ ( z ( k ) ) , − G ν ( z ( k ) )) (cid:62) ( z ( k ) − z (cid:63) ) . (8.40)It follows that0 ≤ K (cid:88) k =1 ( z ( k ) − z (cid:63) ) (cid:62) G ( k ) = K (cid:88) k =1 ( z (0) − z (cid:63) ) (cid:62) G ( k ) + K (cid:88) k =1 ( z ( k ) − z (0) ) (cid:62) G ( k ) ≤ K (cid:88) k =1 ( z (0) − z (cid:63) ) (cid:62) G ( k ) − U s ( K +1) β K ( z ( K +1) ) + β K z (0) − z (cid:63) ) (cid:62) s ( k +1) − U s ( K +1) β K ( z ( K +1) ) + β K , (8.41)where the second inequality uses (8.37), and the second equality follows since s ( k +1) = s ( k ) + G ( k ) . Considering inequality (8.35) with s = s ( K +1) , β = β K , z = z (cid:63) , and z (cid:48) = z ( K +1) , U s ( K +1) β K ( z (cid:63) ) ≤ U s ( K +1) β K ( z ( K +1) ) + ∇ U s ( K +1) β K ( z ( K +1) ) (cid:62) ( z (cid:63) − z ( K +1) ) − β K (cid:107) z (cid:63) − z ( K +1) (cid:107) = U s ( K +1) β K ( z ( K +1) ) − β K (cid:107) z (cid:63) − z ( K +1) (cid:107) , given that z ( K +1) is the maximum of U s ( K +1) β K ( z ). Applying this inequality in (8.41),0 ≤ ( z (0) − z (cid:63) ) (cid:62) s ( k +1) − U s ( K +1) β K ( z (cid:63) ) − β K (cid:107) z (cid:63) − z ( K +1) (cid:107) + β K z (0) − z (cid:63) ) (cid:62) s ( k +1) + s ( k +1) (cid:62) ( z (cid:63) − z (0) ) + β K (cid:107) z (cid:63) − z (0) (cid:107) − β K (cid:107) z (cid:63) − z ( K +1) (cid:107) + β K β K (cid:107) z (cid:63) − z (0) (cid:107) − β K (cid:107) z (cid:63) − z ( K +1) (cid:107) + β K , where the ﬁrst equality uses the deﬁnition of U s ( K +1) β K ( z (cid:63) ) (8.33). Rearranging, (cid:107) z (cid:63) − z ( K +1) (cid:107) ≤(cid:107) z (cid:63) − z (0) (cid:107) + 1 . (8.42)As K ≥ K ≥

2. Considering now when k = 1, (cid:107) z (1) − z (cid:63) (cid:107) = (cid:107) z (0) − z (cid:63) − G (0) (cid:107) = (cid:107) z (0) − z (cid:63) (cid:107) − z (0) − z (cid:63) ) (cid:62) G (0) + 1 ≤(cid:107) z (0) − z (cid:63) (cid:107) + 1 , where the last line uses (8.40). Now for all k ,( (cid:107) z (cid:63) − z (0) (cid:107) + 1) = (cid:107) z (cid:63) − z (0) (cid:107) + 2 (cid:107) z (cid:63) − z (0) (cid:107) + 1 ≥ (cid:107) z (cid:63) − z ( k ) (cid:107) + 2 (cid:107) z (cid:63) − z (0) (cid:107) , so that (cid:107) z (cid:63) − z (0) (cid:107) + 1 ≥ (cid:107) z (cid:63) − z ( k ) (cid:107) .In order to prove the convergence result of Algorithm 4.15, we require boundingthe norm of the subgradients G ( z ( k ) ). Lemma 8.2.

There exists a constant

C > such that (cid:107) G ( z ( k ) ) (cid:107) ≤ C , for all k ∈ N . roof. Recall that g ( x ) ∈ ∂f ( x ) and g i ( x ) ∈ ∂F i ( x ), (cid:107) G ( z ( k ) ) (cid:107) = (cid:107) ( G x ( z ( k ) ) , − G λ ( z ( k ) ) , − G ν ( z ( k ) )) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) g ( x ( k ) ) + m (cid:88) i =1 λ ( k ) i g i ( x ( k ) ) + A (cid:62) ν ( k ) , − F ( x ( k ) ) , − Ax ( k ) + b (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤(cid:107) g ( x ( k ) ) (cid:107) + m (cid:88) i =1 λ ( k ) i (cid:107) g i ( x ( k ) ) (cid:107) + (cid:107) A (cid:62) (cid:107)(cid:107) ν ( k ) (cid:107) + F ( x ( k ) ) + (cid:107) Ax ( k ) − b (cid:107) . Here we note (cid:107) A (cid:62) (cid:107) := max u ∈ R l {(cid:107) A (cid:62) u (cid:107) / (cid:107) u (cid:107) } . The iterates of method (4.15) arebounded in a convex compact region, z ( k ) ∈ D := { z : (cid:107) z − z (cid:63) (cid:107) ≤ (cid:107) z (0) − z (cid:63) (cid:107) + 1 } .This implies that x ( k ) ∈ D x := { x : (cid:107) x − x (cid:63) (cid:107) ≤ (cid:107) z (0) − z (cid:63) (cid:107) + 1 } , λ ( k ) ∈ D λ := { λ : (cid:107) λ − λ (cid:63) (cid:107) ≤ (cid:107) z (0) − z (cid:63) (cid:107) + 1 } and ν ( k ) ∈ D ν := { ν : (cid:107) ν − ν (cid:63) (cid:107) ≤ (cid:107) z (0) − z (cid:63) (cid:107) + 1 } .The desired result follows due to Assumption 3.1. Lemma 8.3.

There exists a constant

C > such that for any K ∈ N iterations inmethod (4.15) , f ( x ( K +1) ) − p (cid:63) ≤ C ( (cid:107) z (0) − z (cid:63) (cid:107) + 1)2( K + 1) (cid:18)

11 + √ √ K + 1 (cid:19) and (cid:107) F ( x ( K +1) ) (cid:107) + (cid:107) Ax ( K +1) − b (cid:107) ≤ C (4( (cid:107) z (0) − z (cid:63) (cid:107) + 1) + 1)2( K + 1) (cid:18)

11 + √ √ K + 1 (cid:19) . Proof.

Using equation (8.37), and recalling that z ( K +1) maximizes U s ( K +1) β K ( z ) deﬁnedby (8.33), β K ≥ K (cid:88) k =1 ( z ( k ) − z (0) ) (cid:62) G ( k ) + U s ( K +1) β K ( z ( K +1) )= K (cid:88) k =1 ( z ( k ) − z (0) ) (cid:62) G ( k ) + max z ∈ R n + m + l  − (cid:32) K (cid:88) k =0 G ( k ) (cid:33) (cid:62) ( z − z (0) ) − β K (cid:107) z − z (0) (cid:107)  = max z ∈ R n + m + l (cid:40) K (cid:88) k =0 − G ( k ) (cid:62) ( z − z ( k ) ) − β K (cid:107) z − z (0) (cid:107) (cid:41) . (8.43)Like x ( K +1) , let z ( K +1) := δ − K +1 (cid:80) Kk =0 z ( k ) (cid:107) G ( z ( k ) ) (cid:107) , λ ( K +1) := δ − K +1 (cid:80) Kk =0 λ ( k ) (cid:107) G ( z ( k ) ) (cid:107) and ( K +1) := δ − K +1 (cid:80) Kk =0 ν ( k ) (cid:107) G ( z ( k ) ) (cid:107) . Multiplying both sides of (8.43) by δ − K +1 , δ − K +1 β K ≥ δ − K +1 max z ∈ R n + m + l (cid:40) K (cid:88) k =0 − G ( k ) (cid:62) ( z − z ( k ) ) − β K (cid:107) z − z (0) (cid:107) (cid:41) = δ − K +1 max z ∈ R n + m + l (cid:40) K (cid:88) k =0 − G x ( z ( k ) ) (cid:62) (cid:107) G ( z ( k ) ) (cid:107) ( x − x ( k ) ) + G λ ( z ( k ) ) (cid:62) (cid:107) G ( z ( k ) ) (cid:107) ( λ − λ ( k ) )+ G ν ( z ( k ) ) (cid:62) (cid:107) G ( z ( k ) ) (cid:107) ( ν − ν ( k ) ) − β K (cid:107) z − z (0) (cid:107) (cid:27) ≥ δ − K +1 max z ∈ R n + m + l (cid:40) K (cid:88) k =0 L ( x ( k ) , λ ( k ) , ν ( k ) ) − L ( x, λ ( k ) , ν ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) + L ( x ( k ) , λ, ν ) − L ( x ( k ) , λ ( k ) , ν ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) − β K (cid:107) z − z (0) (cid:107) (cid:27) = δ − K +1 max z ∈ R n + m + l (cid:40) K (cid:88) k =0 L ( x ( k ) , λ, ν ) − L ( x, λ ( k ) , ν ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) − β K (cid:107) z − z (0) (cid:107) (cid:41) = max z ∈ R n + m + l (cid:40) δ − K +1 K (cid:88) k =0 L ( x ( k ) , λ, ν ) − L ( x, λ ( k ) , ν ( k ) ) (cid:107) G ( z ( k ) ) (cid:107) − δ − K +1 β K (cid:107) z − z (0) (cid:107) (cid:41) ≥ max z ∈ R n + m + l (cid:26) L ( x ( K +1) , λ, ν ) − L ( x, λ ( K +1) , ν ( K +1) ) − δ − K +1 β K (cid:107) z − z (0) (cid:107) (cid:27) = max z ∈ R n + m + l (cid:110) f ( x ( K +1) ) + F ( x ( K +1) ) (cid:62) λ + ( Ax ( K +1) − b ) (cid:62) ν − f ( x ) − F ( x ) (cid:62) λ ( K +1) − ( Ax − b ) (cid:62) ν ( K +1) − δ − K +1 β K (cid:107) z − z (0) (cid:107) (cid:27) , (8.44)where the third inequality uses Jensen’s inequality. Given the maximum function, theinequality holds for any choice of z . We consider two cases, the ﬁrst being x = x ( K +1) , λ = λ ( K +1) + F ( x ( K +1) )2 (cid:107) F ( x ( K +1) ) (cid:107) and ν = ν ( K +1) + Ax ( K +1) − b (cid:107) Ax ( K +1) − b (cid:107) . From (8.44), δ − K +1 β K ≥ f ( x ( K +1) ) + (cid:18) λ ( K +1) + F ( x ( K +1) )2 (cid:107) F ( x ( K +1) ) (cid:107) (cid:19) (cid:62) F ( x ( K +1) )+ (cid:18) ν ( K +1) + Ax ( K +1) − b (cid:107) Ax ( K +1) − b (cid:107) (cid:19) (cid:62) ( Ax ( K +1) − b ) − f ( x ( K +1) ) − F ( x ( K +1) ) (cid:62) λ ( K +1) − ν ( K +1) (cid:62) ( Ax ( K +1) − b ) − δ − K +1 β K (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x ( K +1) , λ ( K +1) + F ( x ( K +1) )2 (cid:107) F ( x ( K +1) ) (cid:107) , ν ( K +1) + Ax ( K +1) − b (cid:107) Ax ( K +1) − b (cid:107) (cid:19) − z (0) (cid:13)(cid:13)(cid:13)(cid:13) = 12 ( (cid:107) F ( x ( K +1) ) (cid:107) + (cid:107) Ax ( K +1) − b (cid:107) ) − δ − K +1 β K (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x ( K +1) , λ ( K +1) + F ( x ( K +1) )2 (cid:107) F ( x ( K +1) ) (cid:107) , ν ( K +1) + Ax ( K +1) − b (cid:107) Ax ( K +1) − b (cid:107) (cid:19) − z (0) (cid:13)(cid:13)(cid:13)(cid:13) . (8.45) urther, (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x ( K +1) , λ ( K +1) + F ( x ( K +1) )2 (cid:107) F ( x ( K +1) ) (cid:107) , ν ( K +1) + Ax ( K +1) − b (cid:107) Ax ( K +1) − b (cid:107) (cid:19) − z (0) (cid:13)(cid:13)(cid:13)(cid:13) ≤(cid:107) z ( K +1) − z (0) (cid:107) + (cid:107) F ( x ( K +1) ) (cid:107) (cid:107) F ( x ( K +1) ) (cid:107) + (cid:107) Ax ( K +1) − b (cid:107) (cid:107) Ax ( K +1) − b (cid:107) = (cid:107) z ( K +1) − z (cid:63) + z (cid:63) − z (0) (cid:107) + 1 ≤(cid:107) z ( K +1) − z (cid:63) (cid:107) + (cid:107) z (cid:63) − z (0) (cid:107) + 1 ≤ δ − K +1 K (cid:88) k =0 (cid:107) z ( k ) − z (cid:63) (cid:107) (cid:107) G ( z ( k ) ) (cid:107) + (cid:107) z (cid:63) − z (0) (cid:107) + 1 ≤ δ − K +1 K (cid:88) k =0 (cid:107) z (0) − z (cid:63) (cid:107) + 1 (cid:107) G ( z ( k ) ) (cid:107) + (cid:107) z (cid:63) − z (0) (cid:107) + 1=2( (cid:107) z (0) − z (cid:63) (cid:107) + 1) , (8.46)where the third inequality uses Jensen’s inequality and the fourth inequality usesLemma 8.1. Combining (8.45) and (8.46), (cid:107) F ( x ( K +1) ) (cid:107) + (cid:107) Ax ( K +1) − b (cid:107) ≤ δ − K +1 β K (cid:107) z (0) − z (cid:63) (cid:107) + 1) + 1) . The second case will use w = z (cid:63) . Starting from (8.44), δ − K +1 β K ≥ f ( x ( K +1) ) + F ( x ( K +1) ) (cid:62) λ (cid:63) + ( Ax ( K +1) − b ) (cid:62) ν (cid:63) − f ( x (cid:63) ) − F ( x (cid:63) ) (cid:62) λ ( K +1) − ( Ax (cid:63) − b ) (cid:62) ν ( K +1) − δ − K +1 β K (cid:107) z (cid:63) − z (0) (cid:107) = f ( x ( K +1) ) − f ( x (cid:63) ) − δ − K +1 β K (cid:107) z (cid:63) − z (0) (cid:107) , since F ( x (cid:63) ) = 0 and Ax (cid:63) = b . Rearranging, f ( x ( K +1) ) − f ( x (cid:63) ) ≤ δ − K +1 β K (cid:107) z (0) − z (cid:63) (cid:107) + 1) . Using Lemma 8.2 and [5, Lemma 3], δ − K +1 β K can be bounded as follows. δ − K +1 β K ≤ (cid:32) K (cid:88) k =0 (cid:107) G ( z ( k ) ) (cid:107) (cid:33) − (cid:18)

11 + √ √ k + 1 (cid:19) ≤

12 ( K (cid:88) k =0 C ) − (cid:18)

11 + √ √ k + 1 (cid:19) ≤ C K + 1) (cid:18)

11 + √ √ k + 1 (cid:19) . Subgradient computing.

Given C as a subset of R n , recall that conv( C ) standsfor the convex hull generated by C .The subdiﬀerential of F i is computed by the formula: ∂F i ( x ) =  ∂f i ( x ) if f i ( x ) > , conv( ∂f i ( x ) ∪ { } ) if f i ( x ) = 0 , { } otherwise . (8.47) ince the subdiﬀerential of (cid:107) · (cid:107) s function is computed by the formula ∂ (cid:107) · (cid:107) s ( z ) =  { z } if s = 2 ,s (cid:107) z (cid:107) s − z if z (cid:54) = 0 and s ∈ [0 , , { sg ∈ R m : (cid:107) g (cid:107) ≤ } if z = 0 and s ∈ [0 , , (8.48)we have ∂ (cid:107) F ( · ) (cid:107) s ( x ) ⊃ conv { (cid:80) mi =1 a i b i : a ∈ ∂ (cid:107) · (cid:107) s ( F ( x )) , b i ∈ ∂F i ( x ) }⊃ (cid:40) s (cid:107) F ( x ) (cid:107) s − (cid:80) mi =1 F i ( x ) ∂F i ( x ) if F ( x ) (cid:54) = 0 , { } otherwise , (8.49)and ∂ (cid:107) A · − b (cid:107) s ( x ) ⊃ conv { A (cid:62) u : u ∈ ∂ (cid:107) · (cid:107) s ( Ax − b ) }⊃ (cid:40) s (cid:107) Ax − b (cid:107) s − A (cid:62) ( Ax − b ) if Ax − b (cid:54) = 0 , { } otherwise . (8.50) Proof of Theorem 5.2.

In [2, Section 8], author uses a wrong statement that if( a k ) ∞ k =1 and ( b k ) ∞ k =1 are two nonnegative real sequence such that (cid:80) ∞ k =1 a k b k is boundedand (cid:80) ∞ k =1 a k diverges, then b k → k → ∞ . For counterexample, take a k = 1 /k and b k = (cid:40) k = i for some i ∈ N , Lemma 8.4.

Let K = ∞ in method (5.21) . For every k ∈ N , let i k be any elementbelonging to arg min ≤ i ≤ k T ( i ) (cid:62) ( z ( i ) − z (cid:63) ) . (8.52) Then | f ( x ( i k ) ) − p (cid:63) | → and (cid:107) F ( x ( i k ) ) (cid:107) + (cid:107) Ax ( i k ) − b (cid:107) → as k → ∞ with the rate at least O (( k + 1) δ/ (2 s ) ) .Proof. Let z (cid:63) := ( x (cid:63) , λ (cid:63) , ν (cid:63) ). Let R > R ≥ (cid:107) z (1) (cid:107) and R ≥ (cid:107) z (cid:63) (cid:107) . We start by writing out a basic identity (cid:107) z ( k +1) − z (cid:63) (cid:107) = (cid:107) z ( k ) − z (cid:63) (cid:107) − α k T ( k ) (cid:62) ( z ( k ) − z (cid:63) ) + α k (cid:107) T ( k ) (cid:107) = (cid:107) z ( k ) − z (cid:63) (cid:107) − γ k T ( k ) (cid:62) (cid:107) T ( k ) (cid:107) ( z ( k ) − z (cid:63) ) + γ k . (8.54)By summing it over k and rearranging the terms, we get (cid:107) z ( k +1) − z (cid:63) (cid:107) + 2 k (cid:88) i =1 γ i T ( i ) (cid:62) (cid:107) T ( i ) (cid:107) ( z ( i ) − z (cid:63) ) = (cid:107) z (1) − z (cid:63) (cid:107) + k (cid:88) i =1 γ i ≤ R + S. (8.55)The latter inequality is due to (cid:80) ∞ i =1 γ i = S < ∞ (see Abel’s summation formula).We argue that the sum on the lefthand side is nonnegative. First, we estimate alower bound of T ( k ) (cid:62) ( z ( k ) − z (cid:63) ) = ∂ x L ρ ( z ( k ) ) (cid:62) ( x ( k ) − x (cid:63) ) − ∂ λ L ρ ( z ( k ) ) (cid:62) ( λ ( k ) − λ (cid:63) ) − ∂ ν L ρ ( z ( k ) ) (cid:62) ( ν ( k ) − ν (cid:63) ) . (8.56) ith F ( k ) (cid:54) = 0 and Ax ( k ) (cid:54) = b , the ﬁrst term further expands to ∂ x L ρ ( z ( k ) ) (cid:62) ( x ( k ) − x (cid:63) )= g ( k ) (cid:62) ( x ( k ) − x (cid:63) )+ m (cid:80) i =1 (cid:16) λ ( k ) i + ρs (cid:107) F ( k ) (cid:107) s − F ( k ) i (cid:17) g ( k ) (cid:62) i ( x ( k ) − x (cid:63) )+ ν ( k ) (cid:62) A ( x ( k ) − x (cid:63) ) + ρs (cid:107) Ax ( k ) − b (cid:107) s − ( Ax ( k ) − b ) (cid:62) A ( x ( k ) − x (cid:63) ) ≥ f ( x ( k ) ) − p (cid:63) + λ ( k ) (cid:62) F ( k ) + ρs (cid:107) F ( k ) (cid:107) s + ν ( k ) (cid:62) ( Ax ( k ) − b ) + ρs (cid:107) Ax ( k ) − b (cid:107) s . (since Ax (cid:63) = b ) (8.57)It is due to deﬁnition of subgradient, for the objective function, we have g ( k ) (cid:62) ( x ( k ) − x (cid:63) ) ≥ f ( x ( k ) ) − p (cid:63) , (8.58)and for the constraints g ( k ) (cid:62) i ( x ( k ) − x (cid:63) ) ≥ F ( k ) i − F i ( x (cid:63) ) = F ( k ) i , i = 1 , . . . , m . (8.59)Notice that λ ( k ) i + ρs (cid:107) F ( k ) (cid:107) s − F ( k ) i is nonnegative since λ ( k ) i and F ( k ) i are nonnegative.Next, we have − ∂ λ L ρ ( z ( k ) ) (cid:62) ( λ ( k ) − λ (cid:63) ) = − F ( k ) (cid:62) ( λ ( k ) − λ (cid:63) ) (8.60)and − ∂ ν L ρ ( z ( k ) ) (cid:62) ( ν ( k ) − ν (cid:63) ) = ( b − Ax ( k ) ) (cid:62) ( ν ( k ) − ν (cid:63) ) . (8.61)Using these and subtracting, we obtain T ( k ) (cid:62) ( z ( k ) − z (cid:63) ) ≥ f ( x ( k ) ) − p (cid:63) + λ (cid:63) (cid:62) F ( k ) + ν (cid:63) (cid:62) ( Ax ( k ) − b )+ ρs ( (cid:107) F ( k ) (cid:107) s + (cid:107) Ax ( k ) − b (cid:107) s )= L ( x ( k ) , λ (cid:63) , ν (cid:63) ) − L ( x (cid:63) , λ (cid:63) , ν (cid:63) )+ ρs ( (cid:107) F ( k ) (cid:107) s + (cid:107) Ax ( k ) − b (cid:107) s ) ≥ . (8.62)The latter inequality is implied by using (4.14). It is remarkable that (8.62) is stilltrue even if F ( k ) = 0 or Ax ( k ) = b .Since both terms on the lefthand side of (8.55) are nonnegative, for all k , we have (cid:107) z ( k +1) − z (cid:63) (cid:107) ≤ R + S and 2 k (cid:88) i =1 γ i T ( i ) (cid:62) (cid:107) T ( i ) (cid:107) ( z ( i ) − z (cid:63) ) ≤ R + S . (8.63)The ﬁrst inequality yields that there exists positive real D satisfying (cid:107) z ( k ) (cid:107) ≤ D ,namely D = R + √ R + S . By assumption, the norm of subgradients g ( k ) i , i =0 , , . . . , m on the set (cid:107) x ( k ) (cid:107) is bounded, so it follows that (cid:107) T ( k ) (cid:107) is bounded bysome positive real C independent from k .The second inequality of (8.63) implies that T ( i k ) (cid:62) ( z ( i k ) − z (cid:63) ) ≤ C (4 R + S )2 (cid:80) ki =1 γ i ≤ W ( k + 1) δ/ . (8.64)for some W > k . The latter inequality is due to the asymptoticbehavior of the zeta function. Moreover, (8.62) turns out that( L ( x ( i k ) , λ (cid:63) , ν (cid:63) ) − L ( x (cid:63) , λ (cid:63) , ν (cid:63) )) + ρs (cid:107) F ( x ( i k ) ) (cid:107) s + ρs (cid:107) Ax ( i k ) − b (cid:107) s ≤ T ( i k ) (cid:62) ( z ( i k ) − z (cid:63) ) ≤ W ( k +1) δ/ . (8.65)Since three terms on the lefthand size are nonnegative, we obtain (cid:107) F ( x ( i k ) ) (cid:107) ≤ W /s ( ρs ) /s ( k + 1) δ/ (2 s ) , (cid:107) Ax ( i k ) − b (cid:107) ≤ W /s ( ρs ) /s ( k + 1) δ/ (2 s ) , (8.66) nd W ( k +1) δ/ ≥ L ( x ( i k ) , λ (cid:63) , ν (cid:63) ) − L ( x (cid:63) , λ (cid:63) , ν (cid:63) )= f ( x ( i k ) ) − p (cid:63) + λ (cid:63) (cid:62) F ( x ( i k ) ) + ν (cid:63) (cid:62) ( Ax ( i k ) − b ) ≥ . (8.67)Using these, we have W ( k + 1) δ/ + ξ i k ≥ f ( x ( i k ) ) − p (cid:63) ≥ − ξ i k . (8.68)where ξ i k = (cid:107) λ (cid:63) (cid:107) (cid:107) F ( x ( i k ) ) (cid:107) + (cid:107) ν (cid:63) (cid:107) (cid:107) Ax ( i k ) − b (cid:107) . Since ξ i k → k → ∞ with therate at least O (( k + 1) δ/ (2 s ) ). This proves (8.53). References [1] Y. I. Alber, A. N. Iusem, and M. V. Solodov. On the projected subgradient methodfor nonsmooth convex optimization in a hilbert space.

Mathematical Programming ,81(1):23–35, 1998.[2] S. Boyd. Subgradient methods.

Lecture notes of EE364b, Stanford University ,Spring 2013-14:1–39, 2014.[3] L. Lukˇsan and J. Vlcek. Test problems for nonsmooth unconstrained and linearlyconstrained optimization.

Technick´a zpr´ava , 798, 2000.[4] M. R. Metel and A. Takeda. Dual subgradient method for constrained convexoptimization problems. arXiv preprint arXiv:2009.12769 , 2020.[5] Y. Nesterov. Primal-dual subgradient methods for convex problems.

Mathematicalprogramming , 120(1):221–259, 2009.[6] Y. Nesterov.

Lectures on convex optimization , volume 137. Springer, 2018.[7] J. Nocedal and S. Wright.

Numerical optimization . Springer Science & BusinessMedia, 2006.[8] N. Z. Shor.

Minimization methods for non-diﬀerentiable functions , volume 3.Springer Science & Business Media, 2012., volume 3.Springer Science & Business Media, 2012.