[PDF] An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints

Abstract

Optimization problems under affine constraints appear in various areas of machine learning. We consider the task of minimizing a smooth strongly convex function F(x) under the affine constraint K x = b, with an oracle providing evaluations of the gradient of F and matrix-vector multiplications by K and its transpose. We provide lower bounds on the number of gradient computations and matrix-vector multiplications to achieve a given accuracy. Then we propose an accelerated primal--dual algorithm achieving these lower bounds. Our algorithm is the first optimal algorithm for this class of problems.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b An Optimal Algorithmfor Strongly Convex Minimizationunder Aﬃne Constraints

Adil Salim Laurent CondatDmitry Kovalev Peter Richt´arikKing Abdullah University of Science and TechnologyFebruary 17, 2021

Abstract

Optimization problems under aﬃne constraints appear in various areas of machine learning.We consider the task of minimizing a smooth strongly convex function F ( x ) under the aﬃneconstraint K x = b , with an oracle providing evaluations of the gradient of F and matrix-vectormultiplications by K and its transpose. We provide lower bounds on the number of gradientcomputations and matrix-vector multiplications to achieve a given accuracy. Then we proposean accelerated primal–dual algorithm achieving these lower bounds. Our algorithm is the ﬁrstoptimal algorithm for this class of problems. We consider the convex optimization problemmin x ∈ X F ( x ) s.t. K x = b, (1)where F is a smooth and strongly convex function over X := R d , b ∈ Y := R p is a vector and K isa nonzero p × d matrix, for some integers d ≥ p ≥

1. We adopt the matrix-vector setting to setthe ideas down, but the formalism holds more generally with any separable real Hilbert spaces X and Y and linear operator K : X → Y . We suppose that b is in the range of K ; then the soughtsolution to (1), denoted by x ⋆ , exists and is unique, by strong convexity.Problem (1) covers a large number of applications in machine learning [5, 33, 38] and beyond[6, 19, 39]. Examples include inverse problems in imaging [11], or recovering a model from partialmeasurements b on the model, in compressed sensing [20] or sketched learning-type applications [23]for instance. In optimal transport, one often looks for measures with ﬁxed marginals, which canbe written as an aﬃne equality constraint [32]. Network ﬂow optimization takes the form ofProblem (1), where b contains the incoming and outgoing rates at source and sink nodes of a network,and K is the edge-node incidence matrix [43]. Decentralized optimization is a well-known instance ofProblem (1), with K a gossip matrix (or its square root), and b = 0 [3,18,21,25,26,27,30,35,37,42]. Ifadditional aﬃne constraints are added to the decentralized optimization problem, for instance that1ome elements or linear measurements of the sought model x ⋆ are ﬁxed, decentralized optimizationreverts to Problem (1) with nonzero b .For large-scale convex optimization problems like (1), primal–dual splitting algorithms [9, 15,16, 24, 29, 34, 41] are well suited, as they are easy to implement and typically show state-of-the-artperformance. The fully split algorithms do not require the ability to project onto the constraintspace { x ∈ X : K x = b } . Precisely, we say that an iterative algorithm is fully split if it produces asequence of iterates ( x k ) k ≥ ∈ X N converging to the solution x ⋆ of (1), using only computations of ∇ F and multiplications by K and K T , the transpose of K . There exist several fully split primal–dual algorithms well suited to solve Problem (1) and even more general problems [13, 14, 40]. Inparticular, we can mention the algorithm ﬁrst proposed in [28], and rediscovered independentlyas the PDFP2O algorithm [12] and the Proximal Alternating Predictor-Corrector (PAPC) algo-rithm [17]. For simplicity, we name it the PAPC algorithm. When applied to Problem (1), with F strongly convex, the PAPC has been proved to converge linearly in [34].In this paper, we focus on the complexity of fully split algorithms to solve Problem (1), whichis of primary importance in large-scale applications. That is, we study the number of gradientcomputations and matrix multiplications necessary to reach a given accuracy. We ﬁrst derive lowerbounds for these two quantities. No algorithm is known, matching these lower bounds, althoughnearly optimal algorithms exist in the case b = 0 [18]. Then, we propose a new accelerated primal–dual algorithm, which matches the lower bounds, and thus is optimal. Our algorithm can be viewedas an accelerated version of the PAPC algorithm.In summary, our main contributions are the following: • We provide complexity lower bounds for solving Problem (1) within the class of algorithmsperforming evaluations of ∇ F and multiplications by K and K T . • We propose a new algorithm for solving Problem (1). • We prove that the complexity of our algorithm matches the lower bounds, therefore it isoptimal.This paper is organized as follows. In Section 2, we introduce the notations and assumptions.We summarize our contributions in the light of prior work in Section 3. In Section 4, we deﬁnethe class of algorithms under study and we derive the corresponding complexity lower bounds forsolving (1). Our main algorithm and our main result about its convergence and complexity aregiven in Section 5. Our approach for deriving and analyzing this algorithm is provided in Section 6.We illustrate our convergence results by numerical experiments in Section 7. The technical proofsare postponed to the Appendix.

Let us make the formulation of the problem (1) more precise. The convex function F : X → R isan L -smooth and µ -strongly convex function, for some µ > L >

0; that is, F is diﬀerentiableand satisﬁes the convexity inequality F ( x ) + (cid:10) ∇ F ( x ) , x ′ − x (cid:11) + µ k x − x ′ k ≤ F ( x ′ ) , F ( x ′ ) ≤ F ( x ) + (cid:10) ∇ F ( x ) , x ′ − x (cid:11) + L k x ′ − x k , for every ( x, x ′ ) ∈ X . That is, ∇ F is L -Lipschitz continuous and F − µ k · k is convex. Moreover,the Bregman divergence of F is denoted by D F ( x, x ′ ) := F ( x ) − F ( x ′ ) − h∇ F ( x ′ ) , x − x ′ i ≥

0. Wehave 0 < µ ≤ L and we denote by κ := Lµ ≥ F .The kernel of the matrix K is denoted by ker( K ) and its range by range( K ). We deﬁne thesymmetric positive semideﬁnite matrix W := K T K . The largest eigenvalue of W is denoted by λ max ( W ) and its smallest positive eigenvalue by λ +min ( W ). We have 0 < λ +min ( W ) ≤ λ max ( W ) andwe denote by χ ( W ) := λ max ( W ) λ +min ( W ) ≥ W .The condition number κ (resp. χ ( W )) measures the regularity of F (resp. K ). The complexityresults obtained in this paper are (nondecreasing) functions of κ (resp. χ ( W )).We can note that ker( K ) = ker( W ). If ker( W ) = { } , the solution x ⋆ to the linear system K x = b is unique, so that F does not play any role and Problem (1) reverts to solving this linearsystem. We allow for this case, but it is of course not the focus of this paper.Finally, we denote by ι { b } the indicator function of { b } ; that is, ι { b } : y ∈ Y

7→ { y = b ,+ ∞ otherwise } . This function is convex and lower semicontinuous over Y . Denoting by ∂ thesubdiﬀerential operator [7, Section 16], we recall that ∂ι { b } ( y ) = ∅ if and only if y = b . Most algorithms able to solve Problem (1) using evaluations of ∇ F and multiplications by K and K T can be viewed as primal–dual algorithms. For instance, the Condat-V˜u algorithm [14, 40] andits variants, the PAPC algorithm [12, 17, 28] can be applied to Problem (1). The PAPC algorithmapplied to Problem (1) consists in iterating  x k + := x k − η ∇ F ( x k ) − η K T y k y k +1 := y k + θ ( K x k + − b ) x k +1 := x k − η ∇ F ( x k ) − η K T y k +1 (2)for some parameters η, θ > with b = 0, at an accelerated or (nearly) optimalrate have been made recently in the particular case of decentralized optimization [3, 18, 21, 25, 26,27, 35, 42]. In this case, K is typically the square root of a gossip matrix, i.e. a symmetric positivesemideﬁnite matrix supported by a graph, whose kernel is the consensus space. In particular,3able 1: Comparison of the complexity of state-of-the-art algorithms with our results, in terms ofgradient computations and matrix multiplications to ﬁnd x ∈ X such that k x − x ⋆ k ≤ ε . Thecondition number of F is denoted by κ and the condition number of K T K is denoted by χ .Algorithm Gradient computations Matrix multiplicationsPAPC algorithm [34] O (cid:0) ( κ + χ ) log ε (cid:1) O (cid:0) ( κ + χ ) log ε (cid:1) [29] O (cid:0) ( κ + χ ) log ε (cid:1) O (cid:0) ( κ + χ ) log ε (cid:1) [18] (case b = 0) O (cid:0) √ κ log ε (cid:1) O (cid:0) √ κχ log ε (cid:1) Algorithm 1 (This paper, Theorem 2) O (cid:0) √ κ log ε (cid:1) O (cid:0) √ κχ log ε (cid:1) Lower bound (This paper, Theorem 1) O (cid:0) √ κ log ε (cid:1) O (cid:0) √ κχ log ε (cid:1) optimal decentralized algorithms have been proposed using acceleration techniques [1,2,4,31] in [25],and, in a more general stochastic case, in [27]. Our approach can be seen as an extension of [25],which is not straightforward due to the presence of the right hand side b .In the case where projecting onto the constraint space { x ∈ X : K x = b } is possible,FISTA [8, 10] is an optimal algorithm for solving Problem (1). FISTA can be seen as Nesterov’sacceleration [31] of the classical projected gradient algorithm.In a nutshell, our approach consists in a rigorous combination of Nesterov’s acceleration [31]to minimize a smooth and strongly convex function, and Chebyshev’s acceleration [2, 4] of linearsystem solving. Our approach allows us to accelerate the PAPC algorithm and, for the ﬁrst time,to achieve the asymptotic complexity lower bounds. Our results and the most relevant results ofthe literature are summarized in Table 1. We now deﬁne the family of algorithms considered to solve Problem (1). Informally, this is the familyof algorithms using gradient computations and matrix multiplications. Since no particular structureis assumed on K , any multiplication of the iterates by K must be followed by a multiplication by K T in order to map the iterates back into the optimization space X , before an application of ∇ F .Hence, we consider the wide class of Black-Box First Order algorithms using ∇ F , K and K T ,denoted by BBFO( ∇ F, K ), which generate a sequence of vectors ( x n ) n ∈ N ∈ X N such that x n +1 ∈ Span (cid:16) x , . . . , x n , ∇ F ( x ) , . . . , ∇ F ( x n ) , K T Span (cid:0) b, K x , . . . , K x n , K ∇ F ( x ) , . . . , K ∇ F ( x n ) (cid:1) (cid:17) and do not apply the operators ∇ F , K and K T to other vectors. It is important to note that theindex n need not coincide with the iteration counter of an iterative algorithm: each x n can corre-spond to an intermediate vector in X obtained after any computation or sequence of computationsduring the course of the algorithm. We do not assume the knowledge of a solution x to the linear system K x = b , otherwise one could get back tothe case b = 0 using a change of variable. heorem 1 (Lower bounds) . Let χ ≥ . There exist a vector b , a matrix K such that thecondition number of K T K is χ , and a smooth and strongly convex function F with conditionnumber κ , such that the following holds: for any ε > , any BBFO ( ∇ F , K ) algorithm requires atleast • Ω( √ κχ log(1 /ε )) multiplications by K , • Ω( √ κχ log(1 /ε )) multiplications by K T , • Ω( √ κ log(1 /ε )) computations of ∇ F ,to output a vector x such that k x − x ⋆ k < ε , where x ⋆ = arg min { x : K x = b } F ( x ) . Theorem 1 provides lower bounds on the number of gradient computations and matrix multi-plications needed to reach ε accuracy, which here means that k x − x ⋆ k ≤ ε . Proof.

We follow the ideas of [35], in the context of decentralized optimization, to exhibit worst-casefunction F and matrix K . Let χ ≥ “Bad” function F and “bad” matrix K . Consider the family of smooth and strongly convexfunctions ( f i ) ni =1 and the matrix W with condition number χ given by [35, Corollary 2]. Denote by κ the common condition number of f i . Set F ( x , . . . , x n ) := P ni =1 f i ( x i ), K := √ W and b := 0.Then, the condition number of F is κ and the condition number of W = K T K is χ . Moreover, W is a gossip matrix [35, Section 2.2]. BBFO ( ∇ F , K ) are decentralized optimization algorithms. Any BBFO algorithm usingthese operators ∇ F , K , K T can be rewritten as a function of ∇ F and W = K T K . Indeed,Span (cid:16) x , . . . , x n , ∇ F ( x ) , . . . , ∇ F ( x n ) , K T Span (cid:0) b , K x , . . . , K x n , K ∇ F ( x ) , . . . , K ∇ F ( x n ) (cid:1) (cid:17) = Span (cid:16) x , . . . , x n , ∇ F ( x ) , . . . , ∇ F ( x n ) , Span (cid:0) W x , . . . , W x n , W ∇ F ( x ) , . . . , W ∇ F ( x n ) (cid:1) (cid:17) . Since W is a gossip matrix, BBFO( ∇ F , K ) algorithms are therefore Black-box optimizationprocedures using W , in the sense of [35, Section 3.1]. In other words, BBFO( ∇ F , K ) algorithmsare decentralized optimization algorithms over a network, in which communication amounts tomultiplication by W , and local computations correspond to evaluations of ∇ F . Any solution to (1) is a solution to a decentralized optimization problem.

Since ker( W )is the consensus space, x ⋆ = arg min { x : W x =0 } F ( x ) can be written as x ⋆ = ( x ⋆ , . . . , x ⋆ ) where x ⋆ = arg min n P ni =1 f i . 5 BFO ( ∇ F , K ) algorithms cannot outperform the lower bounds of decentralized al-gorithms. As shown in [35, Corollary 2], for any ε >

0, any Black-box optimization procedureusing W requires at least Ω (cid:0) √ κχ log(1 /ε ) (cid:1) communication rounds, and at least Ω ( √ κ log(1 /ε ))gradient computations to output x = ( x , . . . , x n ) such that k x − x ⋆ k < ε , where x ⋆ = arg min F . In particular , for any ε >

0, any BBFO( ∇ F , K ) algorithm requires at least Ω (cid:0) √ κχ log(1 /ε ) (cid:1) mul-tiplications by K T K , and at least Ω ( √ κ log(1 /ε )) computations of ∇ F to output x = ( x , . . . , x n )such that k x − x ⋆ k < ε , where x ⋆ = arg min F . Finally, one multiplication by W is equivalent to one multiplication by K followed by onemultiplication by K T . Algorithm 1

Proposed algorithm Parameters: x ∈ X , N ∈ N ∗ , λ , λ , η, θ, α > , τ ∈ (0 , x f := x , u := 0 X for k = 0 , , . . . do x kg := τ x k + (1 − τ ) x kf x k + := (1 + ηα ) − (cid:0) x k − η ( ∇ F ( x kg ) − αx kg + u k ) (cid:1) r k := θ (cid:0) x k + − Chebyshev( x k + , K , b, N, λ , λ ) (cid:1) u k +1 := u k + r k x k +1 := x k + − η (1 + ηα ) − r k x k +1 f := x kg + τ − τ ( x k +1 − x k ) end forAlgorithm 2 Chebyshev iteration Parameters: z ∈ X , K , b ∈ Y , N ∈ N ∗ , λ , λ > . Set W := K T K , ρ := (cid:0) λ − λ (cid:1) / ν := ( λ + λ ) / γ := − ν/ p := − K T ( K z − b ) /ν z := z + p for i = 1 , . . . , N − do β i − := ρ/γ i − γ i := − ( ν + β i − ) p i := (cid:0) K T ( K z i − b ) + β i − p i − (cid:1) /γ i z i +1 := z i + p i end for Output: z N In this section, we present our main algorithm, Algorithm 1, and our main convergence result,Theorem 2. The derivations and proofs are deferred to Section 6. Algorithm 2 implements theclassical Chebyshev iteration [22], see Section 6.3 for details. It is used as a subroutine in Algorithm 1and denoted by Chebyshev, with its parameters passed as arguments. We stress here that the6hebyshev iteration is diverted from its usual use, which is solving linear systems, and is usedhere as a preconditioner. Although Algorithm 1 runs at every iteration a number N of Chebysheviterations, there is no approximation or truncation error here: Algorithm 1 converges to the exactsolution x ⋆ of Problem (1). It is not an algorithm with at every iteration an inner loop solving alinear system to enforce the constraint K x = b . Theorem 2 (Convergence of Algorithm 1) . Consider λ ≥ λ max ( W ) and λ such that < λ ≤ λ +min ( W ) . Denote χ := λ λ and let N ≥ √ χ .Set the parameters τ, η, θ, α to τ := min n , q κ o , η := τL , θ := η , and α := µ . Then,there exists C ≥ such that η (cid:13)(cid:13)(cid:13) x k − x ⋆ (cid:13)(cid:13)(cid:13) + 2(1 − τ ) τ D F ( x kf , x ⋆ ) ≤ ( , r κ )! − k C. Moreover, for every ε > , Algorithm 1 ﬁnds x k for which k x k − x ⋆ k ≤ ε using O ( √ κ log(1 /ε )) gradient computations and O ( N √ κ log(1 /ε )) matrix multiplications with K or K T . A straightforward corollary of Theorem 2 is as follows.

Corollary 1 (Tight version of Theorem 2) . Set the parameters λ , λ , N, τ, η, θ, α to λ = λ max ( W ) , λ = λ +min ( W ) , N = ⌈ p χ ( W ) ⌉ , τ = min n , q κ o , η = τL , θ = η , and α = µ . Then,for every ε > , Algorithm 1 ﬁnds x k for which k x k − x ⋆ k ≤ ε using O ( √ κ log(1 /ε )) gradientcomputations and O (cid:16)p κχ ( W ) log(1 /ε ) (cid:17) matrix multiplications with K or K T . The complexity result given by Corollary 1 is summarized in Table 1. In particular, the complex-ity of Algorithm 1 matches the lower bounds of Theorem 1 both in terms of gradient computationsand matrix multiplications.

In this section, we explain how we derive our main algorithm from the PAPC algorithm and proveTheorem 2 step by step. First, we derive the primal–dual optimality conditions associated toProblem (1).

First, note that arg min K x = b F ( x ) = arg min F ( x ) + ι { b } ( K x ). Deﬁne the strongly convex function G : x F ( x ) + ι { b } ( K x ). Then, 0 ∈ ∂G ( x ⋆ ) = ∇ F ( x ⋆ ) + K T ∂ι { b } ( K x ⋆ ) [7, Theorem 16.47]. Thismeans that there exists y ⋆ ∈ ∂ι { b } ( K x ⋆ ) such that 0 = ∇ F ( x ⋆ ) + K T y ⋆ . Besides, ∂ι { b } ( K x ⋆ ) isnonempty if and only if K x ⋆ = b . Finally, the pair ( x ⋆ , y ⋆ ) must satisfy (cid:26) ∇ F ( x ⋆ ) + K T y ⋆ − K x ⋆ + b (3)These equations are called primal–dual optimality conditions, and are also the ﬁrst-order conditionsassociated to the Lagrangian function L ( x, y ) := F ( x ) + h K x − b, y i associated to Problem (1).7oreover, ( x ⋆ , y ⋆ ) is called an optimal primal–dual pair. If ( x ⋆ , y ⋆ ) is an optimal primal–dual pair,then ( x ⋆ , y ⋆ + ¯ y ), where ¯ y ∈ ker( K T ), is also an optimal primal–dual pair. Thus, in the sequel, wedenote by ( x ⋆ , y ⋆ ) the only optimal primal–dual pair such that y ⋆ ∈ range( K ); that is, such that (cid:26) ∇ F ( x ⋆ ) + K T y ⋆ , y ⋆ ∈ range( K )0 = − K x ⋆ + b. (4)We can note that the sequence of iterates ( x k , y k ) of the PAPC algorithm, shown in (2), convergeslinearly to ( x ⋆ , y ⋆ ) [34, Theorem 8], as reported in Table 1. The ﬁrst step to derive Algorithm 1 is to propose a variant of the PAPC (2) using Nesterov’sacceleration [31]. Nesterov’s acceleration does not apply to primal–dual algorithms in general, butin our case, we manage to apply it.This intermediate algorithm is detailed below.

Algorithm 3

Intermediate algorithm Parameters: x ∈ X , y = 0 Y , η, θ, α > τ ∈ (0 , Set x f = x for k = 0 , , , . . . do x kg := τ x k + (1 − τ ) x kf x k + := (1 + ηα ) − ( x k − η ( ∇ F ( x kg ) − αx kg + K T y k )) y k +1 := y k + θ ( K x k + − b ) x k +1 := (1 + ηα ) − ( x k − η ( ∇ F ( x kg ) − αx kg + K T y k +1 )) x k +1 f := x kg + τ − τ ( x k +1 − x k ) end for The convergence of Algorithm 3 is stated in Proposition 1.

Proposition 1 (Algorithm 3) . Consider λ ≥ λ max ( W ) and λ such that < λ ≤ λ +min ( W ) .Denote χ := λ λ .Set the parameters of Algorithm 3 as τ := min n , q χκ o , η := τL , θ := ηλ , and α := µ .Then, η (cid:13)(cid:13)(cid:13) x k − x ⋆ (cid:13)(cid:13)(cid:13) + ηαθ (1 + ηα ) (cid:13)(cid:13)(cid:13) y k − y ⋆ (cid:13)(cid:13)(cid:13) + 2(1 − τ ) τ D F ( x kf , x ⋆ ) ≤ (cid:18) (cid:26) √ κχ , χ (cid:27)(cid:19) − k C, (5) where C := η (cid:13)(cid:13) x − x ⋆ (cid:13)(cid:13) + θ k y − y ⋆ k + − τ ) τ D F ( x f , x ⋆ ) . Proposition 1 states the linear convergence of the distance between the iterates and the primal–dual optimal point. In particular, if λ = λ max ( W ) and λ = λ +min ( W ), then k x − x ⋆ k ≤ ε after O (cid:16)(cid:16)p κχ ( W ) + χ ( W ) (cid:17) log (cid:0) ε (cid:1)(cid:17) gradient computations and matrix multiplications. Besides, Proposition 1 states the linear conver-gence of the Bregman divergence of F . Using (4), one can check that the Bregman divergence of F is equal to the restricted primal–dual gap, in particular: D F ( x kf , x ⋆ ) = L ( x kf , y ⋆ ) − L ( x ⋆ , y k ).8he proof of Proposition 1 is provided in the Appendix. The main tool of the proof is thefollowing representation of Algorithm 3.We denote by Q the ( d + p ) × ( d + p ) matrix deﬁned blockwise by Q := " η I X θ I Y − η ηα KK T , (6)where I X (resp. I Y ) is the identity matrix over X (resp. Y ). Lemma 1.

The following equality holds: Q (cid:20) x k +1 − x k y k +1 − y k (cid:21) = (cid:20) α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 ) K x k +1 − b (cid:21) . (7)Lemma 1, proven in the Appendix, enables to view Algorithm 3 as a variant of the Forward–Backward algorithm involving monotone operators, see [7, Section 26.14] or [15] for more details.The Forward–Backward algorithm is a ﬁxed-point algorithm. For instance, one can see in Equa-tion (7) that a ﬁxed point ( x k , y k ) = ( x ⋆ , y ⋆ ) is a solution to (3). Hence, Algorithm 3 can be viewedas an accelerated primal–dual ﬁxed-point algorithm. Our main Algorithm 1 is obtained as a particular instantiation of Algorithm 3. More precisely, weapply Chebyshev’s acceleration [2, 4] in a similar way it was applied in the particular setting ofdecentralized optimization [35, 36].Consider a polynomial P such that, for every eigenvalue t of W , P ( t ) ≥ P ( t ) = 0 ⇔ t =0). Since K x ⋆ = b , K x = b ⇔ K ( x − x ⋆ ) = 0 ⇔ K T K ( x − x ⋆ ) = 0 ⇔ W ( x − x ⋆ ) = 0 ⇔ P ( W )( x − x ⋆ ) = 0 ⇔ p P ( W )( x − x ⋆ ) = 0 ⇔ p P ( W ) x = p P ( W ) x ⋆ . Therefore, the problem min x ∈ X F ( x ) s.t. p P ( W ) x = p P ( W ) x ⋆ , (8)is equivalent to Problem (1). Consequently, to solve Problem (1), one can apply Algorithm 3 byreplacing K by p P ( W ) and b by p P ( W ) x ⋆ ; we will see below that x ⋆ is not needed in thecomputations, only b is. Since p P ( W ) is symmetric, this leads to the following algorithm:  x kg := τ x k + (1 − τ ) x kf x k + := (1 + ηα ) − (cid:0) x k − η ( ∇ F ( x kg ) − αx kg + p P ( W ) y k ) (cid:1) y k +1 := y k + θ (cid:0)p P ( W ) x k + − p P ( W ) x ⋆ (cid:1) x k +1 := (1 + ηα ) − (cid:0) x k − η ( ∇ F ( x kg ) − αx kg + p P ( W ) y k +1 ) (cid:1) x k +1 f := x kg + τ − τ ( x k +1 − x k ) . (9)After applying the change of variable u k := p P ( W ) y k , we get:  x kg := τ x k + (1 − τ ) x kf x k + := (1 + ηα ) − (cid:0) x k − η ( ∇ F ( x kg ) − αx kg + u k ) (cid:1) u k +1 := u k + θ (cid:0) P ( W ) x k + − P ( W ) x ⋆ (cid:1) x k +1 := (1 + ηα ) − (cid:0) x k − η ( ∇ F ( x kg ) − αx kg + u k +1 ) (cid:1) x k +1 f := x kg + τ − τ ( x k +1 − x k ) . (10)9o obtain Algorithm 1 and Theorem 2, we have to choose a suitable polynomial P and showhow to compute P ( W ) x k + − P ( W ) x ⋆ eﬃciently. The goal is to make Problem (8) better conditioned than Problem (1). For this, we want P tocluster all the positive eigenvalues of W around the same value, say 1 (the scaling of P does notmatter, since it is compensated by the stepsizes). To that aim, the best choice is to set P as 1minus a Chebyshev polynomial of appropriate degree [4, Theorem 6.1]. More precisely, let T n bethe Chebyshev polynomial of the ﬁrst kind of degree n ≥

0, which is such that { T n ( t ) : t ∈ [ − , } = [ − , λ ≥ λ max ( W ) and 0 < λ ≤ λ +min ( W ) be upper and lower bounds of theeigenvalues of W . Set χ := λ /λ ≥ χ ( W ) ≥ λ = λ , no preconditioning is necessary and we could just set P ( W ) = W . So, let us assumethat λ < λ (the derivations can be shown to be still valid if λ = λ ).For every n ≥

1, we deﬁne the shifted Chebyshev polynomial e T n as e T n ( t ) = T n (cid:0) ( λ + λ − t ) / ( λ − λ ) (cid:1) T n (cid:0) ( λ + λ ) / ( λ − λ ) (cid:1) . (11)Then, for every n ≥ e T n (0) = 1, e T n ( t ) decreases monotonically for t ∈ [0 , λ ], andmax t ∈ [ λ ,λ ] | e T n ( t ) | = 1 T n (cid:0) ( λ + λ ) / ( λ − λ ) (cid:1) = 2 ζ n ζ n < , where ζ = √ χ − √ χ + 1 < , (12)see [4, Corollary 6.1]. Hence, if N ≥ √ χ , thenmax t ∈ [ λ ,λ ] | e T N ( t ) | < . < . (13)Indeed, − / ln(( t − / ( t + 1)) < t/ t ≥

1, therefore by setting t = √ χ we obtain N ≥ √ χ ⇒ N > − / ln( ζ ) ⇒ ζ N < e − ⇒ ζ N / (1 + ζ N ) < . P := 1 − e T N (14)for some N ≥ √ χ . Then, we have λ max ( P ( W )) ≤ max t ∈ [ λ ,λ ] P ( t ) ≤ t ∈ [ λ ,λ ] | e T N ( t ) | ≤ ,λ +min ( P ( W )) ≥ min t ∈ [ λ ,λ ] P ( t ) ≥ − max t ∈ [ λ ,λ ] | e T N ( t ) | ≥ ,χ (cid:0) P ( W ) (cid:1) ≤ . ( W ) x − P ( W ) x ⋆ without Knowing x ⋆ There remains to show how to compute P ( W ) x − P ( W ) x ⋆ , for any x ∈ X . Consider N ≥ P deﬁned in (14). We now show that Algorithm 1 is equivalent to the iterations (10). Inspecting theiterations (10) and Algorithm 1, it is suﬃcient to prove that for every x ∈ X , P ( W ) x − P ( W ) x ⋆ = x − Chebyshev( x, K , b, N, λ , λ ) . (15)10he vector z N = Chebyshev( x, K , b, N ) is the N th iterate of the classical Chebyshev iteration tosolve the linear system K z = b , or equivalently W z = K T b , starting with some initial guess z = x ,using the recurrence relation of the Chebyshev polynomials e T N , see Algorithm 4 in [22] .In other words, z N satisﬁes K T ( K z N − b ) = e T N ( W ) (cid:0) K T ( K z − b ) (cid:1) , (16)so that k K z N − b k converges linearly to zero when N → + ∞ .Since e T N (0) = 1, there exists a polynomial e R N such that e T N ( X ) = 1 + X e R N ( X ). Therefore, K T ( K z n − b ) = (cid:0) K T ( K z − b ) (cid:1) + W e R N ( W ) (cid:0) K T ( K z − b ) (cid:1) , i.e. , W z N = W (cid:16) z + e R N ( W ) (cid:0) K T ( K z − b ) (cid:1)(cid:17) . One can check by induction that z N ∈ z + range( W ). Using I X + W e R N ( W ) = e T N ( W ), z n = z + e R N ( W ) (cid:0) K T ( K z − b ) (cid:1) = z + W e R N ( W ) z − e R N ( W ) K T b = e T N ( W ) z − e R N ( W ) K T b = e T N ( W ) z − e R N ( W ) W x ⋆ = e T N ( W ) z − e T N ( W ) x ⋆ + x ⋆ . Finally, for every z ∈ X , P ( W ) z − P ( W ) x ⋆ = z − e T N ( W ) z − x ⋆ + e T N ( W ) x ⋆ = z − z N . In Section 6.3.2 we proved that Algorithm 3 applied to the equivalent Problem (8) is equivalent toour main Algorithm 1. Therefore, we can prove our main Theorem 2 by applying Proposition 1to Problem (8). Indeed, the proof of Theorem 2 is a straightforward application of Proposition 1to Problem (8), using that N ≥ √ χ implies λ max ( P ( W )) ≥ / , λ +min ( P ( W )) ≥ /

15 and χ ( P ( W )) ≤ /

11, see Section 6.3.1.

We illustrate the performance of our Algorithm 1 in a compressed-sensing-type experiment: wewant to recover a sparse vector x ♯ ∈ X = R d , with d = 1000, having 50 randomly chosen nonzeroelements (equal to 1) from b = K x ♯ ∈ Y = R p , with p = 250, where K has random i.i.d. Gaussianelements and its nonzero singular values are modiﬁed so that they span the interval [1 / √ χ,

1] forsome prescribed value of χ . Solving Problem (1) with F the ℓ norm yields perfect reconstructionwith x ⋆ = x ♯ . Thus, without pretending in any way that this is the best way to solve this estimation Several recurrence relation can be used to compute e T n , and we chose Algorithm 4 in [22] because it is proved tobe numerically stable. -25 -20 -15 -10 -5 proposed Algorithm 1PAPC algorithm -25 -20 -15 -10 -5 proposed Algorithm 1PAPC algorithm Figure 1: Error k x − x ⋆ k with respect to the number of calls to K and K T to obtain x (left) andto the number of calls to ∇ F , equal to the number k of iterations, to obtain x = x k (right).problem, we solve Problem (1) with F a L -smooth and µ -strongly convex approximation of the ℓ norm: we set F : x = ( x i ) di =1 ∈ X P di =1 f ( x i ) with f : t ∈ R

7→ √ t + e + ( e/ t , for some e >

0, so that L = 1 /e + e , µ = e , κ = L/µ = 1 + 1 /e . So, given a prescribed value of κ , we set e = p / ( κ − χ = 10 and κ = 10 ; other values gave similar plots. The computation time is roughly the sameas the number of calls to K and K T here. Both algorithms converge linearly, but Algorithm 1 hasa much better rate, which corresponds visually to the slope of the curves in Figure 1. References [1] Z. Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods.

TheJournal of Machine Learning Research , 18(1):8194–8244, 2017.[2] M. Arioli and J. Scott. Chebyshev acceleration of iterative reﬁnement.

Numerical Algorithms ,66(3):591–608, 2014.[3] Y. Arjevani, J. Bruna, B. Can, M. G¨urb¨uzbalaban, S. Jegelka, and H. Lin. Ideal: Inexactdecentralized accelerated augmented lagrangian method. arXiv preprint arXiv:2006.06733 ,2020.[4] W. Auzinger. Iterative solution of large linear systems.

Lecture notes , 2011.[5] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.

Found. Trends Mach. Learn. , 4(1):1–106, 2012.[6] H. H. Bauschke, R. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz,editors.

Fixed-Point Algorithms for Inverse Problems in Science and Engineering . Springer-Verlag, New York, 2010.[7] H. H. Bauschke and P. L. Combettes.

Convex Analysis and Monotone Operator Theory inHilbert Spaces . Springer, New York, 2nd edition, 2017.128] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.[9] R. I. Bot¸, E. R. Csetnek, and C. Hendrich. Recent developments on primal–dual splittingmethods with applications to convex minimization. In P. M. Pardalos and T. M. Rassias,editors,

Mathematics Without Boundaries: Surveys in Interdisciplinary Research , pages 57–99.Springer New York, 2014.[10] A. Chambolle and C. Dossal. On the convergence of the iterates of the “Fast Iterative Shrink-age/Thresholding Algorithm”.

J. Optim. Theory Appl. , 166:968–982, 2015.[11] A. Chambolle and T. Pock. An introduction to continuous optimization for imaging.

ActaNumerica , 25:161–319, 2016.[12] P. Chen, J. Huang, and X. Zhang. A primal–dual ﬁxed point algorithm for convex separableminimization with applications to image restoration.

Inverse Problems , 29(2), 2013.[13] P. L. Combettes and J.-C. Pesquet. Primal–dual splitting algorithm for solving inclusions withmixtures of composite, Lipschitzian, and parallel-sum type monotone operators.

Set-Val. Var.Anal. , 20(2):307–330, 2012.[14] L. Condat. A primal-dual splitting method for convex optimization involving Lipschitzian,proximable and linear composite terms.

J. Optim. Theory Appl. , 158(2):460–479, 2013.[15] L. Condat, D. Kitahara, A. Contreras, and A. Hirabayashi. Proximal splitting algorithms: Atour of recent advances, with new twists. preprint arXiv:1912.00137, 2019.[16] L. Condat, G. Malinovsky, and P. Richt´arik. Distributed proximal splitting algorithms withrates and acceleration. preprint arXiv:2010.00952, 2020.[17] Y. Drori, S. Sabach, and M. Teboulle. A simple algorithm for a class of nonsmooth convexconcave saddle-point problems.

Oper. Res. Lett. , 43(2):209–214, 2015.[18] D. Dvinskikh and A. Gasnikov. Decentralized and parallel primal and dual accelerated methodsfor stochastic convex programming problems.

Journal of Inverse and Ill-posed Problems , 2021.in press.[19] R. Glowinski, S. J. Osher, and W. Yin, editors.

Splitting Methods in Communication, Imaging,Science, and Engineering . Springer International Publishing, 2016.[20] T. Goldstein and X. Zhang. Operator splitting methods in compressive sensing and sparseapproximation. In R. Glowinski, S. J. Osher, and W. Yin, editors,

Splitting Methods inCommunication, Imaging, Science, and Engineering , pages 301–343, Cham, 2016. SpringerInternational Publishing.[21] E. Gorbunov, D. Dvinskikh, and A. Gasnikov. Optimal decentralized distributed algorithmsfor stochastic convex optimization. arXiv preprint arXiv:1911.07363 , 2019.[22] M. H. Gutknecht and S. R¨ollin. The Chebyshev iteration revisited.

Parallel Computing ,28:263–283, 2002. 1323] N. Keriven, A. Bourrier, R. Gribonval, and P. P´erez. Sketching for large-scale learning ofmixture models.

Information and Inference: a Journal of the IMA , 7(3):447–508, 2018.[24] N. Komodakis and J.-C. Pesquet. Playing with duality: An overview of recent primal–dualapproaches for solving large-scale optimization problems.

IEEE Signal Process. Mag. , 32(6):31–54, November 2015.[25] D. Kovalev, A. Salim, and P. Richt´arik. Optimal and practical algorithms for smooth andstrongly convex decentralized optimization. In

Proc. of Conf. on Neural Information ProcessingSystems (NeurIPS) , 2020.[26] H. Li, C. Fang, W. Yin, and Z. Lin. Decentralized accelerated gradient methods with increasingpenalty parameters.

IEEE Transactions on Signal Processing , 68:4855–4870, 2020.[27] H. Li, Z. Lin, and Y. Fang. Optimal accelerated variance reduced EXTRA and DIGINGfor strongly convex and smooth decentralized optimization. arXiv preprint arXiv:2009.04373 ,2020.[28] I. Loris and C. Verhoeven. On a generalization of the iterative soft-thresholding algorithm forthe case of non-separable penalty.

Inverse Problems , 27(12), 2011.[29] K. Mishchenko and P. Richt´arik. A stochastic decoupling method for minimizing the sum ofsmooth and non-smooth functions. preprint arXiv:1905.11535, 2019.[30] A. Nedic, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed opti-mization over time-varying graphs.

SIAM Journal on Optimization , 27(4):2597–2633, 2017.[31] Y. Nesterov.

Introductory Lectures on Convex Optimization . Kluwer Academic Publisher,Dordrecht, The Netherlands, 2004.[32] G. Peyr´e and M. Cuturi. Computational optimal transport: With applications to data science.

Foundations and Trends in Machine Learning , 11(5–6):355–607, 2019.[33] N. G. Polson, J. G. Scott, and B. T. Willard. Proximal algorithms in statistics and machinelearning.

Statist. Sci. , 30(4):559–581, 2015.[34] A. Salim, L. Condat, K. Mishchenko, and P. Richt´arik. Dualize, split, randomize: Fast nons-mooth optimization algorithms. preprint arXiv:2004.02635, 2020.[35] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massouli´e. Optimal algorithms for smoothand strongly convex distributed optimization in networks. In

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 , pages 3027–3036, 2017.[36] K. Scaman, F. Bach, S. Bubeck, L. Massouli´e, and Y. T. Lee. Optimal algorithms for non-smooth distributed optimization in networks. In

Advances in Neural Information ProcessingSystems , pages 2740–2749, 2018.[37] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact ﬁrst-order algorithm for decentralizedconsensus optimization.

SIAM J. Optim. , 25(2):944–966, 2015.1438] S. Sra, S. Nowozin, and S. J. Wright.

Optimization for Machine Learning . The MIT Press,2011.[39] G. Stathopoulos, H. Shukla, A. Szucs, Y. Pu, and C. N. Jones. Operator splitting methods incontrol.

Foundations and Trends in Systems and Control , 3(3):249–362, 2016.[40] B. C. V˜u. A splitting algorithm for dual monotone inclusions involving cocoercive operators.

Adv. Comput. Math. , 38(3):667–681, April 2013.[41] M. Yan. A new primal-dual algorithm for minimizing the sum of three functions with a linearoperator.

J. Sci. Comput. , 76(3):1698–1717, September 2018.[42] H. Ye, L. Luo, Z. Zhou, and T. Zhang. Multi-consensus decentralized accelerated gradientdescent. arXiv preprint arXiv:2005.00797 , 2020.[43] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie. Accelerated dual descent for networkﬂow optimization.

IEEE Trans. Automat. Contr. , 59(4):905–920, 2013.15 ppendix

Contents P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106.3.2 Eﬃcient Computation of P ( W ) x − P ( W ) x ⋆ without Knowing x ⋆ . . . . . . . 106.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17A.2 End of the Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2316

Proof of Proposition 1

We denote by k · k Q (resp. h· , ·i Q ) the norm (resp. inner product) induced by Q (deﬁned in (6)).The norm k · k Q satisﬁes the following. A.1 Preliminary Lemmas

Lemma 2.

If parameters η > and θ > satisfy ηθλ max ( W ) ≤ , (17) and if α > , then the symmetric matrix Q is positive deﬁnite and for all x ∈ X , y ∈ Y , thefollowing inequality holds: η k x k ≤ η k x k + ηαθ (1 + ηα ) k y k ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) xy (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ η k x k + 1 θ k y k . (18) Proof.

The nonzero eigenvalues of W = K T K are the nonzero eigenvalues of KK T , therefore λ max ( W ) = λ max ( KK T ). Consequently, using (17), η ηα k K T y k ≤ η ηα λ max ( W ) k y k ≤ k y k θ (1 + ηα ) . Therefore, since αη > η k x k + (cid:18) θ − θ (1 + ηα ) (cid:19) k y k ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) xy (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q = 1 η k x k + 1 θ k y k − η ηα k K T y k , which proves in particular that Q is positive deﬁnite.Besides, lines 5 to 7 of Algorithm 3 admit the following representation, which is at the core ofthe convergence proof. Lemma 3.

The following equality holds: Q (cid:20) x k +1 − x k y k +1 − y k (cid:21) = (cid:20) α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 ) K x k +1 − b (cid:21) . (19) Proof.

Using the deﬁnition of Q , we have Q (cid:20) x k +1 − x k y k +1 − y k (cid:21) = " η ( x k +1 − x k ) θ ( y k +1 − y k ) − η ηα KK T ( y k +1 − y k ) . From line 7 of Algorithm 3 it follows that1 η ( x k +1 − x k ) = α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 ) , and from line 6 of Algorithm 3 y k +1 − y k = θ ( K x k + − b ) . Q (cid:20) x k +1 − x k y k +1 − y k (cid:21) = " α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 )( K x k + − b ) − η ηα KK T ( y k +1 − y k ) . From lines 5 and 7 of Algorithm 3, x k +1 − x k + = − η ηα K T ( y k +1 − y k ) , (20)therefore,( K x k + − b ) − η ηα KK T ( y k +1 − y k ) = K (cid:18) x k + − η ηα K T ( y k +1 − y k ) (cid:19) − b = K x k +1 − b. Finally, Q (cid:20) x k +1 − x k y k +1 − y k (cid:21) = (cid:20) α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 ) K x k +1 − b (cid:21) . We now start the proof of Proposition 1.

Lemma 4.

Let α satisfy ≤ α ≤ µ . Then the following inequality holds: − η k x k +1 − x k k ≤ − η k K T y k +1 − K T y ⋆ k + ηα k x k +1 − x ⋆ k + 2 ηL D F ( x kg , x ⋆ ) . (21) Proof.

From line 7 of Algorithm 3 and optimality condition (4) ∇ F ( x ⋆ ) + K T y ⋆ = 0, it follows that k x k +1 − x k k = k η ( K T y k +1 − K T y ⋆ ) + η ( ∇ F ( x kg ) − ∇ F ( x ⋆ ) − α ( x kg − x ⋆ )) + ηα ( x k +1 − x ⋆ ) k ≥ η k K T y k +1 − K T y ⋆ k − η α k x k +1 − x ⋆ k − η k∇ F ( x kg ) − ∇ F ( x ⋆ ) − α ( x kg − x ⋆ ) k . Denote ¯ F ( x ) = F ( x ) − α k x k . The function ¯ F is a convex and ( L − α )-smooth function, therefore k∇ ¯ F ( x ) − ∇ ¯ F ( x ′ ) k ≤ L − α )D ¯ F ( x, x ′ ). Therefore, we can lower bound the last term and get k x k +1 − x k k ≥ η k K T y k +1 − K T y ⋆ k − η α k x k +1 − x ⋆ k − η ( L − α )D ¯ F ( x kg , x ⋆ ) ≥ η k K T y k +1 − K T y ⋆ k − η α k x k +1 − x ⋆ k − η L D F ( x kg , x ⋆ ) . Rearranging and dividing by 2 η concludes the proof.Our last lemma states the linear convergence of a Lyapunov function to zero.18 emma 5. Consider λ ≥ λ max ( W ) and λ ≤ λ +min ( W ) .Let parameter η be deﬁned by η = 14 τ L . (22) Let parameter θ be deﬁned by θ = 1 ηλ . (23) Let parameter α be deﬁned by α = µ. (24) Let parameter τ be deﬁned by τ = min ( , r µL λ λ ) . (25) Let Ψ k be the following Lyapunov function: Ψ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2(1 − τ ) τ D F ( x kf , x ⋆ ) , (26) Then the following inequality holds: Ψ k +1 ≤ (r µL λ λ , λ λ )! − Ψ k . (27) Proof. (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x k y k +1 − y k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2 (cid:28)(cid:20) x k +1 − x k y k +1 − y k (cid:21) , (cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:29) Q Note that stepsize η deﬁned by (22) and stepsize θ deﬁned by (23) satisfy (17), hence inequality(18) holds. Using (18) and (19) we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k + 2 (cid:28)(cid:20) α ( x kg − x k +1 ) − ( ∇ F ( x kg ) + K T y k +1 ) K x k +1 − b (cid:21) , (cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:29) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k + 2 α h x kg − x k +1 , x k +1 − x ⋆ i− (cid:28)(cid:20) ∇ F ( x kg ) + K T y k +1 − K x k +1 + b (cid:21) − (cid:20) ∇ F ( x ⋆ ) + K T y ⋆ − K x ⋆ + b (cid:21) , (cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:29) , using the primal–dual optimality conditions (4). Rewrite (cid:20) ∇ F ( x kg ) + K T y k +1 − K x k +1 + b (cid:21) − (cid:20) ∇ F ( x ⋆ ) + K T y ⋆ − K x ⋆ + b (cid:21) = (cid:20) ∇ F ( x kg ) − ∇ F ( x ⋆ ) b − b (cid:21) + (cid:20) K T − K (cid:21) (cid:20) x k +1 − x ⋆ y k +1 − y ⋆ . (cid:21) (28)19sing h Az, z i = 0 for any skew-symmetric matrix A , we obtain − (cid:28)(cid:20) ∇ F ( x kg ) + K T y k +1 − K x k +1 + b (cid:21) − (cid:20) ∇ F ( x ⋆ ) + K T y ⋆ − K x ⋆ + b (cid:21) , (cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:29) = − h∇ F ( x kg ) −∇ F ( x ⋆ ) , x k +1 − x ⋆ i . (29)Hence, (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k + 2 α h x kg − x k +1 , x k +1 − x ⋆ i− h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 − x ⋆ i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k − α k x k +1 − x ⋆ k − α h x kg − x ⋆ , x k +1 − x ⋆ i − h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 − x ⋆ i . Using Young’s inequality 2 h a, b i ≤ k a k + k b k we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k − α k x k +1 − x ⋆ k + α k x kg − x ⋆ k + α k x k +1 − x ⋆ k − h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 − x ⋆ i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − η k x k +1 − x k k − α k x k +1 − x ⋆ k + α k x kg − x ⋆ k − h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 − x ⋆ i . Using line 4 of Algorithm 3 and we have x k − x ⋆ = ( x kg − x ⋆ ) + − ττ ( x kg − x kf ) and using line 8, x k +1 − x k = − τ τ ( x k +1 f − x kg ). Therefore, decomposing x k +1 − x ⋆ = ( x k +1 − x k ) + ( x k − x ⋆ ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + α k x kg − x ⋆ k − η k x k +1 − x k k − − ττ (cid:18) h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 f − x kg i + 12 η (2 − τ )4 τ k x k +1 f − x kg k (cid:19) − h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x kg − x ⋆ i + 2(1 − τ ) τ h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x kf − x kg i . Since parameter η deﬁned by (22) satisfy η ≤ − τ τL , we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + α k x kg − x ⋆ k − η k x k +1 − x k k − − ττ (cid:18) h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x k +1 f − x kg i + L k x k +1 f − x kg k (cid:19) − h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x kg − x ⋆ i + 2(1 − τ ) τ h∇ F ( x kg ) − ∇ F ( x ⋆ ) , x kf − x kg i . µ -strong convexity and L -smoothness of F we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + α k x kg − x ⋆ k − η k x k +1 − x k k − − ττ (cid:16) D F ( x k +1 f , x ⋆ ) − D F ( x kg , x ⋆ ) (cid:17) + 2(1 − τ ) τ (cid:16) D F ( x kf , x ⋆ ) − D F ( x kg , x ⋆ ) (cid:17) − (cid:16) D F ( x kg , x ⋆ ) + µ k x kg − x ⋆ k (cid:17) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ )+ ( α − µ ) k x kg − x ⋆ k − η k x k +1 − x k k − D F ( x kg , x ⋆ ) . Now, we deﬁne δ = min n , ηL o . Since α deﬁned by (24) satisﬁes conditions of Lemma 4, we canuse (21) and get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ )+ ( α − µ ) k x kg − x ⋆ k − δ η k x k +1 − x k k − D F ( x kg , x ⋆ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) − ηδ k K T y k +1 − K T y ⋆ k + ηα δ k x k +1 − x ⋆ k + 2 ηLδ D f ( x kg , x ⋆ )+ ( α − µ ) k x kg − x ⋆ k − D F ( x kg , x ⋆ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − α k x k +1 − x ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) − ηδ k K T y k +1 − K T y ⋆ k + α L k x k +1 − x ⋆ k + ( α − µ ) k x kg − x ⋆ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − (cid:18) α − α L (cid:19) k x k +1 − x ⋆ k − ηδ k K T y k +1 − K T y ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) + ( α − µ ) k x kg − x ⋆ k . Using parameter α = µ deﬁned by (24) and using µ ≤ L , we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − µ k x k +1 − x ⋆ k − ηδ k K T y k +1 − K T y ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) . For every y ∈ range( K ), λ k y k ≤ λ +min ( W ) k y k ≤ k K T y k . Using line 6 of Algorithm 3, onecan check by induction that y k ∈ range( K ) for every k ≥

0. Moreover, using (4), y ⋆ ∈ range( K ).21herefore, (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − µ k x k +1 − x ⋆ k − ηδλ k y k +1 − y ⋆ k + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) . Using (18) we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − min (cid:26) ηµ , ηθδλ (cid:27) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) . Using parameter θ deﬁned by (23) and deﬁnition of δ we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − min (cid:26) ηµ , λ λ , λ ηLλ (cid:27) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) . Plugging parameter η deﬁned by (22) we get (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − min (cid:26) µ τ L , λ λ , τ λ λ (cid:27) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − − ττ D F ( x k +1 f , x ⋆ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k − x ⋆ y k − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q − min (cid:26) µ τ L , λ λ , τ λ λ (cid:27) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 − x ⋆ y k +1 − y ⋆ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) Q + 2(1 − τ ) τ D F ( x kf , x ⋆ ) − (cid:16) τ (cid:17) − τ ) τ D F ( x k +1 f , x ⋆ ) . After rearranging and using deﬁnition of Ψ k (26) we getΨ k ≥ (cid:18) (cid:26) τ , µ τ L , λ λ , τ λ λ (cid:27)(cid:19) Ψ k +1 . Plugging parameter τ deﬁned by (25) we getΨ k ≥ (r µL λ λ , λ λ )! Ψ k +1 . .2 End of the Proof of Proposition 1 The conditions of Lemma 5 are satisﬁed, hence the following inequality holds for all k :Ψ k +1 ≤ (r µL λ λ , λ λ )! − Ψ k . (30)After telescoping we get Ψ k ≤ (r µL λ λ , λ λ )! − k Ψ . (31)Inequality (18) implies Ψ ≤ C , where C := η (cid:13)(cid:13) x − x ⋆ (cid:13)(cid:13) + θ k y − y ⋆ k + − τ ) τ D F ( x f , x ⋆ ) . Hence,we obtain Ψ k ≤ (r µL λ λ , λ λ )! − k C. (32)It remains to lower bound Ψ k using (18) one more time:1 η (cid:13)(cid:13)(cid:13) x k − x ⋆ (cid:13)(cid:13)(cid:13) + ηαθ (1 + ηα ) (cid:13)(cid:13)(cid:13) y k − y ⋆ (cid:13)(cid:13)(cid:13) + 2(1 − τ ) τ D F ( x kf , x ⋆ ) ≤ Ψ k ..