[PDF] Restart of accelerated first order methods with linear convergence for non-strongly convex optimization

Abstract

Accelerated first order methods, also called fast gradient methods, are popular optimization methods in the field of convex optimization. However, they are prone to suffer from oscillatory behaviour that slows their convergence when medium to high accuracy is desired. In order to address this, restart schemes have been proposed in the literature, which seek to improve the practical convergence by suppressing the oscillatory behaviour. This paper presents a restart scheme applicable to a broad class of accelerated first order methods. Under a quadratic functional growth condition, linear convergence rate is proved for a large class of non-strongly convex functions. Moreover, the worst-case convergence rate is comparable to the one obtained using a (generally non-implementable) optimal fixed-rate restart strategy. We show numerical results comparing the proposed algorithm with other restart schemes.

Full PDF

11 Restart of accelerated ﬁrst order methods with linearconvergence for non-strongly convex optimization

Teodoro Alamo, Pablo Krupa, Daniel Limon

Abstract —Accelerated ﬁrst order methods, also called fastgradient methods, are popular optimization methods in the ﬁeldof convex optimization. However, they are prone to suffer fromoscillatory behaviour that slows their convergence when mediumto high accuracy is desired. In order to address this, restartschemes have been proposed in the literature, which seek toimprove the practical convergence by suppressing the oscillatorybehaviour. This paper presents a restart scheme applicable to abroad class of accelerated ﬁrst order methods. Under a quadraticfunctional growth condition, linear convergence rate is provedfor a large class of non-strongly convex functions. Moreover, theworst-case convergence rate is comparable to the one obtainedusing a (generally non-implementable) optimal ﬁxed-rate restartstrategy. We show numerical results comparing the proposedalgorithm with other restart schemes.

Index Terms —Convex Optimization, Accelerated First OrderMethods, Restart Schemes, Linear Convergence.

I. I

NTRODUCTION

In the ﬁeld of convex optimization, ﬁrst order methods(FOM) are a widespread class of optimization algorithmswhich only require evaluations of the objective function andits gradient [1], [2]. Some examples of these methods include:gradient descent [1], ISTA [3] and ADMM [4]. A subclass ofFOM are the accelerated ﬁrst order methods (AFOM), whichare characterised by providing a convergence rate O (1 /k ) in terms of the value of the objective function [5]. Somenoteworthy examples are: Nesterov’s fast gradient method [5],FISTA [3], MFISTA [6, §V.A] and accelerated ADMM [7],[8], [9], [10]. The use of AFMOs in the ﬁeld of control isa heavily researched topic, especially in the ﬁeld of modelpredictive control [11], [12], [13], [14], [15].A drawback of AFOMs is that they often suffer fromoscillating behaviour that slows them down [16]. In orderto mitigate this, restart schemes have been proposed in theliterature, which have been shown to improve the convergencein a practical setting by suppressing the oscillatory behaviour[16], [17]. In a restart scheme, the AFOM is stopped whena certain criterion is met and then restarted using the lastvalue provided by the algorithm as the new initial condition.However, most of the restart schemes proposed in the literaturedo not guarantee linear convergence for non-strongly convexoptimization problems. The authors are with the Department of Systems Engineering and Automa-tion, University of Seville, 41092 Seville, Spain (e-mails: [email protected] , [email protected] , [email protected] ). Corresponding author: Teodoro Alamo.This work was supported in part by the Agencia Estatal de Investigaci´on(AEI) under Grant PID2019-106212RB-C41/AEI/10.13039/501100011033,by MINERCO-Spain and FEDER funds under Grant DPI2016-76493-C3-1-Rand by the MCIU-Spain and FSE under Grant FPI-2017. A notable exception is the restart scheme presented in [18],which we label the optimal ﬁxed-rate restart scheme , as itexhibits global linear convergence for convex optimizationproblems with non-strongly convex objective functions thatsatisfy a quadratic functional growth condition. This schemerestarts the AFOM after a ﬁxed number of iterations. However,its drawback is that it requires prior knowledge of either theoptimal value of the objective function or of the parameter thatcharacterizes the quadratic functional growth, both of whichare not easily available in most practical cases.In this paper we present a novel restart scheme for AFOMswhich exhibits linear convergence for non-strongly convexobjective functions that satisfy a quadratic functional growthcondition. Furthermore, it does not require hard-to-attain infor-mation about the objective function, such as its optimal valueor the quadratic functional growth parameter. We provide atheoretical upper bound on the number of iterations neededto achieve a desired accuracy and show that the obtainedconvergence rate is comparable to the one that could beobtained using the optimal ﬁxed-rate restart strategy [18],which is optimal for the class of AFOMs and optimizationproblems under consideration. We also show numerical resultscomparing the proposed scheme with other restart strategiesof the literature. This paper extends on the preliminary resultspresented for FISTA in the conference papers [19] and [20]by providing a restart algorithm with improved worst-caseconvergence rate that can be applied to a broad class of AFOMalgorithms.In Section II we formally present the class of optimizationproblem and AFOM algorithms under consideration. SectionIII describes the optimal ﬁxed-rate restart strategy and providesits iteration complexity for our class of AFOMs. SectionIV presents the novel implementable restart scheme withlinear convergence. We show numerical results comparing theproposed scheme with other restart strategies in Section V.Finally, conclusions are drawn in Section VI.

Notation:

Given a norm (cid:107) · (cid:107) , we denote by (cid:107) · (cid:107) ∗ its dualnorm: (cid:107) x (cid:107) ∗ . = sup { x T z : (cid:107) z (cid:107) ≤ } . The (cid:96) -norm isdenoted by (cid:107) · (cid:107) . Z + denotes the set of non-negative integers.The set of integer numbers from i to j is denoted with Z ji ,i.e. Z ji . = { i, i + 1 , . . . , j − , j } . Euler’s number is denotedby e , and the natural logarithm by ln( · ) . (cid:100) x (cid:101) denotes thesmallest integer greater than or equal to x , and (cid:98) x (cid:99) the largestinteger smaller than or equal to x . The set of proper closedconvex functions from R n to ( −∞ , ∞ ] is denoted by Γ n .Given f ∈ Γ n , we denote by dom( f ) its effective domain,that is, dom( f ) . = { x ∈ R n : f ( x ) < ∞ } . Further notationis given in Notations 1 and 2 in the following section. a r X i v : . [ m a t h . O C ] F e b II. P

ROBLEM STATEMENT

In this paper we are concerned with ﬁnding the solution ofoptimization problems given by f ∗ = min x ∈ R n f ( x ) , (1)which we assume are solvable and where f ∈ Γ n .We use the following notation for the optimal set of (1), theprojection operation onto it, and the level sets of f . Notation 1.

Given the solvable problem (1) :(i) The optimal set is denoted by Ω f . That is, Ω f = { x ∈ R n : f ( x ) = f ∗ } . (ii) For every x ∈ R n we denote ¯ x the closest element to x in the optimal set Ω f with respect to norm (cid:107) · (cid:107) , i.e. ¯ x = arg min z ∈ Ω f (cid:107) x − z (cid:107) . (iii) Given ρ ∈ [0 , ∞ ) we denote the level set V f ( ρ ) = { x ∈ R n : f ( x ) − f ∗ ≤ ρ } . It is well known that if f ∈ Γ n is strongly convex, thenthere exists µ > such that f ( x ) − f ∗ ≥ µ (cid:107) x − ¯ x (cid:107) , ∀ x ∈ dom( f ) . This inequality is called the quadratic functional growth con-dition and it is satisﬁed, at least locally, for a large class of notnecessarily strictly convex functions [18], [21]. For example,when f ( x ) = h ( Ex ) + c (cid:62) x + I X ( x ) , where h : R m → R , isa smooth strictly convex function, E ∈ R m × n and I X is theindicator function of a polyhedral set X [18], [22].Let us consider a ﬁxed point algorithm A that can beapplied to solve (1), i.e. given a initial point x ∈ dom( f ) ,algorithm A generates a sequence { x k } with k ≥ such that lim k →∞ f ( x k ) = f ∗ . We use the following notation to referto the iterates provided by algorithm A . Notation 2.

Suppose that the ﬁxed point algorithm A isapplied to solve problem (1) using as initial condition x .Given the integer k ≥ , we denote with A ( x , k ) the vectorin R n corresponding to iteration k of the algorithm. The following assumption characterizes the class of opti-mization problems and AFOM algorithms considered in thisarticle.

Assumption 1.

We assume that:(i) For every ρ > , f ∈ Γ n satisﬁes a quadratic growthcondition of the form f ( x ) − f ∗ ≥ µ ρ (cid:107) x − ¯ x (cid:107) , ∀ x ∈ V f ( ρ ) , for some µ ρ > .(ii) For every x ∈ dom( f ) , algorithm A satisﬁes, f ( A ( x , ≤ f ( x ) − L f (cid:107) g ( x ) (cid:107) ∗ , (2) f ( A ( x , k )) − f ∗ ≤ a f ( k + 1) (cid:107) x − ¯ x (cid:107) , ∀ k ≥ , (3) where a f > , L f > , and g ( · ) is a gradient operatorsatisfying g ( x ) = 0 ⇔ x ∈ Ω f .(iii) We denote n ρ . = max (cid:26) , (cid:114) a f µ ρ (cid:27) . The improvement with respect to the initial condition x stated in (2) is satisﬁed for most AFOMs because the ﬁrstiteration often results from the application of a proximal gra-dient operator T L f ( · ) on x , thus resulting in the satisfaction of(2). In this case L f > is a Lipschitz constant and g ( · ) is thegradient mapping g ( x ) = L f ( x − T L f ( x )) (see [3], [23] and [2,Chapter 10]). In any case, condition (2) can be easily enforcedusing as initial condition for A the result of the application of aproximal gradient operator on x . The convergence rate givenin (3) is satisﬁed by most AFOM algorithms [1], [2]. Constant a f is equal to L f in FISTA and MFISTA and a multiple of L f in other cases, e.g. when the Lipschitz constant L f is notknown and a backtracking strategy is implemented (see [23]and [2, Chapter 10]).We now present a property on the iterates of A which servesas the basis for the development and convergence analysisof the optimization schemes of the following sections. Anequivalent result can be found in [18, Subsection 5.2.2]. Property 1.

Suppose that Assumption 1 holds. Then, for every x ∈ V f ( ρ ) , f ( A ( x , k )) − f ∗ ≤ (cid:18) n ρ k + 1 (cid:19) ( f ( x ) − f ∗ ) , ∀ k ≥ . (4) Proof.

Denote f . = f ( x ) , f k . = f ( A ( x , k )) , ∀ k > . Then, f k − f ∗ ≤ a f ( k + 1) (cid:107) x − ¯ x (cid:107) ≤ a f µ ρ ( k + 1) ( f − f ∗ ) ≤ n ρ ( k + 1) ( f − f ∗ ) . (cid:4) III. O

PTIMAL FIXED - RATE RESTART SCHEME

This section describes the optimal ﬁxed restart schemepresented in [18, §5.2.2], in which A is restarted each time theiteration counter attains an optimal ﬁxed number of iterations.We analyze, under Assumption 1, its iteration complexity.Given v ∈ V f ( ρ ) , a ﬁxed-rate restart scheme takes therecursion v j +1 = A ( v j , n ) , j ≥ , (5)where n ≥ is a ﬁxed integer.Under assumption 1, the sequence { f ( v j ) } j ≥ is non in-creasing and converges monotically to f ∗ if n ≥ n ρ (seeProperty 1). Given an accuracy parameter (cid:15) > , the followingproperty states the number M of restarts required to satisfy f ( v M − ) − f ( v M ) ≤ (cid:15) , and shows that the bound on the totalnumber of iterations of A is minimized if n is chosen equalto (cid:100) en ρ (cid:101) . See also [18, §5.2.2] for a similar result. Property 2 (Optimal ﬁxed-rate restart scheme) . Let Assump-tion 1 hold. Given v ∈ V f ( ρ ) and an integer n satisfying n > n ρ , consider the recursion (5) . Then, given (cid:15) > :(i) The inequality f ( v M − ) − f ( v M ) ≤ (cid:15) is satisﬁed forevery M ≥ ¯ M , where ¯ M . = 1 + 12(ln n − ln n ρ ) ln (cid:18) f ( v ) − f ∗ (cid:15) (cid:19) . (6) Algorithm 1:

Delayed exit condition on A Prototype: [ z, m ] = A d ( r, n ) Require : r ∈ dom( f ) , n ∈ R x ← r , k ← Initialize A with x repeat k ← k + 1 x k ← (cid:40) A ( x , k ) if f ( A ( x , k )) ≤ f ( x k − ) x k − otherwise (cid:96) ← (cid:4) k (cid:5) until k ≥ n and f ( x (cid:96) ) − f ( x k ) ≤

13 ( f ( x ) − f ( x (cid:96) )) Output: z ← x k , m ← k (ii) If n = (cid:100) en ρ (cid:101) , the total number of iterations of A requiredto attain f ( v j − ) − f ( v j ) ≤ (cid:15) is upper bounded by ¯ N ∗ F . = (cid:100) en ρ (cid:101) (cid:24) (cid:18) f ( v ) − f ∗ (cid:15) (cid:19)(cid:25) . (7) In this case, we call recursion (5) the optimal ﬁxed-raterestart scheme.Proof.

See Appendix A.One of the key properties of the optimal ﬁxed-rate restartscheme is that it recovers the linear optimal convergencerate provided by Nesterov’s fast gradient method for stronglyconvex functions [18], [24, §2.2]. That is, recalling that n ρ = max { / , (cid:112) a f /µ ρ } we easily obtain from Property2.( ii) that an (cid:15) accurate solution is obtained in O (cid:18) n ρ ln (cid:18) f ( v ) − f ∗ (cid:15) (cid:19)(cid:19) (8)iterations. Remark 1.

This optimal ﬁxed-rate scheme is often non-implementable because the value of n ρ is generally not avail-able. However, the obtained bound is important because itprovides the best theoretical convergence rate that could beobtained with a ﬁxed-rate restart strategy. We also remark thatthis scheme can be implemented without requiring knowledgeof the value of n ρ if the value of f ∗ is known in advance.In this case, a similar convergence result can be obtainedif a restart is implemented each time the iteration counter k satisﬁes A ( x , k ) − f ∗ ≤ f ( x ) − f ∗ e . See [18, §5.2.2] for furtherdetails. IV. P

ROPOSED RESTART SCHEME

In this section we propose a novel restart scheme (seeAlgorithm 2) that does not require knowledge of n ρ and thatattains a convergence rate similar to the one of the optimalﬁxed-rate restart strategy described in Section III. We startby presenting Algorithm 1, which implements a delayed exitcondition of algorithm A . Algorithm 1 will then be used toderive the main result of this article: Algorithm 2. Algorithm 2:

Optimal Algorithm based on A d Prototype: [ z out , j out ] = A ∗ ( z ) Require : z ∈ dom( f ) , (cid:15) > m ← , m − ← , j ← − repeat j ← j + 1 s j ←  (cid:115) f ( z j − ) − f ( z j ) f ( z j − ) − f ( z j ) if j ≥ otherwise n j ← max { m j , s j m j − } [ z j +1 , m j +1 ] ← A d ( z j , n j ) until f ( z j ) − f ( z j +1 ) ≤ (cid:15) Output: z out ← z j +1 , j out ← j Fig. 1: Satisfaction of the delayed exit condition (10).Given an initial condition x and a scalar n , which servesas a lower bound on the number of iterations, Algorithm 1generates a sequence { x k } k ≥ that satisﬁes (see step 5) f ( x k ) = min { f ( x k − ) , f ( A ( x , k )) } , ∀ k ≥ . Therefore, f ( x k ) = min i =0 ,...,k f ( A ( x , i )) . (9)The algorithm exits after k ≥ n iterations if the followinginequality is satisﬁed (see step 7): f ( x (cid:96) ) − f ( x k ) ≤

13 ( f ( x ) − f ( x (cid:96) )) , (10)where (cid:96) = (cid:4) k (cid:5) . The outputs of the algorithm are z ∈ R n and m ∈ Z , where z = x m and m ≥ n is the number of iterationsrequired to satisfy the exit condition (10).Intuitively, as illustrated in Figure 1, exit condition (10)detects a degradation in the performance of the iterations of A . Notice that at iteration m , the reduction corresponding tothe last half of the iterations (from (cid:4) m (cid:5) to m ) is no largerthan one third of the reduction achieved in the ﬁrst half of theiterations (from to (cid:4) m (cid:5) ).The following property characterizes the number of itera-tions required to attain the exit condition (10) of Algorithm 1.This result is instrumental to prove the convergence results ofAlgorithm 2. Property 3.

Suppose that Assumption 1 holds. Then, theoutput [ z, m ] from the call [ z, m ] = A d ( r, n ) of Algorithm1 satisﬁes, for every r ∈ V f ( ρ ) :(i) f ( z ) ≤ f ( r ) − L f (cid:107) g ( r ) (cid:107) ∗ ,(ii) f ( z ) − f ∗ ≤ (cid:18) n ρ m + 1 (cid:19) ( f ( r ) − f ∗ ) ,(iii) n ∈ (0 , (cid:100) n ρ (cid:101) ] = ⇒ m ∈ [ n, (cid:100) n ρ (cid:101) ] .Proof. See Appendix A.We now introduce the main contribution of the article: Al-gorithm 2. This algorithm makes successive calls to Algorithm1 (see step 6) using a minimum number of iterations n j that isdetermined by the past evolution of the iterates z j (see steps4 and 5). The main properties of the iterates of Algorithm 2are given in the following property and theorem. Property 4.

Suppose that Assumption 1 holds and considerAlgorithm 2 for a given initial condition z ∈ V f ( ρ ) andaccuracy parameter (cid:15) > . Then:(i) Property 3 can be applied to the iterates of Algorithm 2(i.e., taking r ≡ z j , n ≡ n j z ≡ z j +1 and m ≡ m j +1 ).(ii) The sequence { m j } produced is non-decreasing. In par-ticular, m j ≤ n j ≤ m j +1 , ∀ j ∈ Z j out . (11) (iii) The sequence { s j } satisﬁes s j ∈ (0 , , ∀ j ∈ Z j out .Proof. See Appendix B.

Theorem 1.

Suppose that Assumption 1 holds and considerAlgorithm 2 for a given initial condition z ∈ V f ( ρ ) andaccuracy parameter (cid:15) > . Then:(i) The number of calls to A d (step 6) is bounded. That is, j out is ﬁnite.(ii) The number of iterations of A at each call of A d (step6) is upper bounded by (cid:100) n ρ (cid:101) . That is, m j +1 ≤ (cid:100) n ρ (cid:101) , ∀ j ∈ Z j out . (12) (iii) The total number of iterations of A performed by a call toAlgorithm 2, which we denote by N A , is upper-boundedby N A ≤ ¯ N A , where ¯ N A . = e (cid:100) n ρ (cid:101) (cid:24) (cid:18) f ( z ) − f ∗ (cid:15) (cid:19)(cid:25) . Proof.

See Appendix B.

Remark 2.

From Property 4.(i), we have that we can rear-range Property 3.(i) to read as (cid:107) g ( z j ) (cid:107) ∗ ≤ L f ( f ( z j ) − f ( z j +1 )) . Therefore, the exit condition f ( z j ) − f ( z j +1 ) ≤ (cid:15) implies (cid:107) g ( z j ) (cid:107) ∗ ≤ L f (cid:15) . Since, as per Assumption 1.(ii), g ( z j ) servesto characterize the optimality of z j , we conclude that theexit condition of Algorithm 2 also serves to characterize theoptimality of z j +1 . This means that the exit condition couldbe replaced by (cid:107) g ( z j ) (cid:107) ∗ ≤ ˜ (cid:15) , where ˜ (cid:15) > . In this case, theupper bound on the number of iterations given in Theorem1.(iii) would be the same but replacing (cid:15) with ˜ (cid:15)/ (2 L f ) . Algorithm 3:

MFISTA

Prototype: [ z, m ] = A MF IST A ( x, n, E c ) Require : x ∈ dom( f ) , n ∈ Z ≥ , exit condition E c y = x = x , t = 1 , k = 0 repeat k = k + 1 v k = T L f ( y k − ) t k = 12 (cid:16) (cid:113) t k − (cid:17) x k = (cid:26) v k if f ( v k ) ≤ f ( x k − ) x k − otherwise y k = x k + t k − t k ( v k − x k ) + t k − − t k ( x k − x k − ) Compute exit condition E c until k ≥ n and E c is true Output: z = x k , m = k Note that Theorem 1.( iii) shows that the proposed algorithmattains the optimal linear convergence rate of the optimal ﬁxed-rate restart scheme, in the sense that an (cid:15) accurate solution isobtained in (8) iterations.Comparing the upper bound provided in Theorem 1.( iii) with the upper bound ¯ N ∗ F (7) of the optimal ﬁxed-rate restartscheme presented in Section III, we have ¯ N A ¯ N ∗ F = e (cid:100) n ρ (cid:101) (cid:24) ln (cid:18) f ( z ) − f ∗ (cid:15) (cid:19)(cid:25) (cid:100) en ρ (cid:101) (cid:24) ln (cid:18) f ( z ) − f ∗ (cid:15) (cid:19)(cid:25) , from where we obtain that lim (cid:15) → ¯ N A ¯ N ∗ F = e (cid:100) n ρ (cid:101)(cid:100) en ρ (cid:101) ln 15 ≤ e (4 n ρ + 1) en ρ ln 15= 4ln 15 + 1 n ρ ln 15 ≤ (cid:18) n ρ (cid:19) . We conclude that the worst case complexity of (the im-plementable) Algorithm 2 is comparable to the (generally)non implementable optimal ﬁxed-rate restart scheme (approx-imately 50% more iterations of A when (cid:15) tends to zero).V. N UMERICAL RESULTS

We compare the proposed Algorithm 2 with other restartschemes of the literature by applying them to weighted Lassoproblem min x N (cid:107) Ax − b (cid:107) + (cid:107) W x (cid:107) , (13)where x ∈ R n , A ∈ R N × n is sparse with an average of of its entries being zero (sparsity was generated by setting a . probability for each element of the matrix to be ), n > N ,and b ∈ R N . Each nonzero element in A and b is obtainedfrom a Gaussian distribution with zero mean and variance 1. W ∈ R n × n is a diagonal matrix with elements obtained froma uniform distribution on the interval [0 , α ] .We note that problems (13) can be reformulated in such away that they satisfy the quadratic growth condition [18, §6.3].The restart schemes used for comparison are:TABLE I: Comparison between restart schemes. Exit Cond.

Alg. 2

Functional Gradient Opt. GLCR

Avg. Iter. . . . . Med. Iter. . . Max. Iter.

Min. Iter.

907 898 860 1403 1446 -8 -6 -4 -2 Fig. 2: Evolution of composite gradient mapping for a problemof Test 1.( i) Functional : The restart scheme proposed in [16] that usesrestart condition f ( x k +1 ) ≥ f ( x k ) .( ii) Gradient : The restart scheme proposed in [16] thatuses restart condition (cid:104) g ( y k ) , x k − x k +1 (cid:105) ≤ , where g : R n → R n denotes the gradient mapping operator [3].( iii) Optimal : The restart scheme proposed in [18, §5.2.2]which requires knowing f ∗ .( iv) GLCR : The restart FISTA algorithm with linear conver-gence proposed in [20, Alg. 2].We use the MFISTA algorithm [6], [2] as the algorithm A for these tests. Given an initial point x ∈ dom( f ) , a minimumnumber of iterations n ≥ and an exit condition E c , MFISTAalgorithm is given by Algorithm 3, where T L f is the proximalgradient operator. This algorithm is a monotone AFOM, i.e.,the sequence { f ( x k ) } k ≥ it produces is non-increasing, thatsatisﬁes Assumption 1. Since it is monotone, it sufﬁces to setthe exit condition E c as (10) and then directly use Algorithm3 as A d in step 6 of Algorithm 2.In order to provide a fair comparison between the differentrestart schemes, we run each one of them until the iterate x k satisﬁes (cid:107) g ( x k ) (cid:107) ∗ ≤ − , where g ( · ) is the gradientmapping operator, which, as stated in Remark 2, is a validcharacterization of the optimality of x k .Table I shows the results of solving 100 randomly generatedproblems (13) that share the values of N = 600 , n = 800 ,and α = 0 . .Figure 2 shows the evolution of (cid:107) g ( x k ) (cid:107) ∗ of each one of therestart schemes for one of the Lasso problems used to obtainthe results of Table I. Additionally, it also shows the result ofapplying MFISTA without a restart scheme. As can be seen,the use of restart schemes can greatly improve the convergenceof AFOMs, especially when small exit tolerances are desired.VI. C ONCLUSIONS

The main contribution of the paper is two-fold. We proposea delayed exit condition to detect degradation of the conver- gence of an accelerated algorithm A . We show that, undera quadratic growth condition, this delayed exit condition isattained in a ﬁnite number of iterations. Based on this exitcondition we propose a restart scheme for accelerated ﬁrstorder methods that retains their optimal linear convergencerate in the sense discussed above. Moreover, its worst casecomplexity is similar to the best one that can be obtainedif the parameters characterizing the convergence of the basealgorithm A were known. That is, we show that the upperbound of the number of iterations of A of the proposedalgorithm is similar to the one obtained for the optimal ﬁxed-rate restart scheme, but without requiting the knowledge ofthe aforementioned parameters. Finally, the numerical resultsindicate that the proposed algorithm is comparable, in practicalterms, with other restart schemes of the literature.A PPENDIX

A. Proof of Properties 2 and 3Proof of Property 2.

Suppose that the integer M is such thatthe inequality f ( v M − ) − f ( v M ) > (cid:15) is satisﬁed. FromProperty 1 we have f ( v j +1 ) − f ∗ ≤ (cid:16) n ρ n (cid:17) ( f ( v j ) − f ∗ ) , ∀ j ≥ . Using this inequality in a recursive manner we obtain (cid:15) < f ( v M − ) − f ( v M ) ≤ f ( v M − ) − f ∗ ≤ (cid:16) n ρ n (cid:17) M − ( f ( v ) − f ∗ ) . This leads to M − < n − ln n ρ ) ln (cid:18) f ( v ) − f ∗ (cid:15) (cid:19) . (14)Thus, we conclude that if M does not satisfy (14), then f ( v M − ) − f ( v M ) ≤ (cid:15) . This proves the ﬁrst claim.Given (cid:15) > , v ∈ V f ( ρ ) and n > n ρ , denote S ≥ thesmallest number of restarts required to satisfy the condition f ( v S − ) − f ( v S ) ≤ (cid:15) . We infer from the ﬁrst claim of theproperty that S ≤ max { , (cid:6) ¯ M (cid:7) } , which, making use of theexpression of ¯ M (6) allows us to write: S ≤ (cid:24) n − ln n ρ ) ln (cid:18) f ( v ) − f ∗ (cid:15) (cid:19)(cid:25) , where a 1 has been added to the argument of the logarithmin the expression of ¯ M to guarantee that the above bound isno smaller than max { , (cid:6) ¯ M (cid:7) } . Since each restart requires n iterations of A , we conclude that N F ( n ) , the total number ofiterations of A , is equal to nS . Thus, N F ( n ) ≤ n (cid:24)

1+ 12(ln n − ln n ρ ) ln (cid:18) f ( v ) − f ∗ (cid:15) (cid:19)(cid:25) . (15)Simple calculus yields that the value that minimizes thecoefﬁcient n ln n − ln n ρ is n ∗ = en ρ . Since n has to be a positive integer, we choosethe ﬁxed restart rate given by n = (cid:100) en ρ (cid:101) . Introducing thisvalue in the bound (15) we ﬁnally obtain N F ( (cid:100) en ρ (cid:101) ) ≤ (cid:100) en ρ (cid:101) (cid:24) (cid:18) f ( v ) − f ∗ (cid:15) (cid:19)(cid:25) . (cid:4) Proof of Property 3.

From (9) and Assumption 1 we have f ( z ) = f ( x m ) (9) = min i =0 ,...,m f ( A ( x , i )) ≤ f ( A ( x , (2) ≤ f ( x ) − L f (cid:107) g ( x ) (cid:107) ∗ . The ﬁrst claim now follows directly from x = r . In view of(9) and Property 1 we have, for every k ∈ Z m , f ( x k ) − f ∗ (9) ≤ f ( A ( x , k )) − f ∗ (4) ≤ (cid:18) n ρ k + 1 (cid:19) ( f ( x ) − f ∗ ) . (16)The second claim now follows from x m = z and x = r .The inequality n ≤ m is trivially satisﬁed from step 7. Thus,in order to conclude the proof we show that inequality (10)is satisﬁed for ˆ k = (cid:100) n ρ (cid:101) and ˆ (cid:96) = (cid:106) (cid:100) n ρ (cid:101) (cid:107) ≥ (cid:98) n ρ (cid:99) ≥ (where this last inequality follows from Assumption 1.( iii) ,which states that n ρ ≥ / ). f ( x ˆ (cid:96) ) − f ∗ (16) ≤ (cid:18) n ρ ˆ (cid:96) + 1 (cid:19) ( f ( x ) − f ∗ ) ≤ (cid:18) n ρ n ρ (cid:19) ( f ( x ) − f ∗ ) = 14 ( f ( x ) − f ∗ ) . This implies f ( x ˆ (cid:96) ) ≤ f ( x ) + 34 f ∗ ≤ f ( x ) + 34 f ( x ˆ k ) . Thus, f ( x ˆ (cid:96) ) − f ( x ˆ k ) ≤

14 ( f ( x ) − f ( x ˆ k ))= 14 ( f ( x ) − f ( x ˆ (cid:96) )) + 14 ( f ( x ˆ (cid:96) ) − f ( x ˆ k )) . From here we conclude f ( x ˆ (cid:96) ) − f ( x ˆ k ) ≤ ( f ( x ) − f ( x ˆ (cid:96) )) . (cid:4) B. Proofs for Algorithm 2

This section contains the proofs of Property 4 and Theorem1. The proof of Theorem 1 relies on Lemma 1, which statesa technical and non-intuitive result of Algorithm 2.

Proof of Property 4.

Since z ∈ dom( f ) , we have that z ∈ V f ( ρ ) for some ρ > . Additionally, each z j is obtained from acall to Algorithm 1 (step 6). As such, in view of Property 3.( i) ,we have that the iterates z j satisfy z j ∈ V f ( ρ ) , ∀ j ∈ Z j out .Therefore, Property 3 can be applied to each call to A d , thusproving claim ( i) . That is, for every j ≥ , the iterates ofAlgorithm 2 satisfy: f ( z j +1 ) ≤ f ( z j ) − L f (cid:107) g ( z j ) (cid:107) ∗ , (17a) f ( z j +1 ) − f ∗ ≤ (cid:18) n ρ m j +1 + 1 (cid:19) ( f ( z j ) − f ∗ ) , (17b) n j ∈ (0 , (cid:100) n ρ (cid:101) ] ⇒ m j +1 ∈ [ n j , (cid:100) n ρ (cid:101) ] . (17c)Next, due to step 5 we have m j ≤ n j , j ∈ Z j out . Moreover,from (17c), we have that n j ≤ m j +1 , ∀ j ∈ Z j out , whichproves claim ( ii) . Finally, we prove claim ( iii) . From the exit condition (step7), we have f ( z j − ) − f ( z j ) > (cid:15), ∀ j ∈ Z j out . (18)Additionally, from (17a) we have f ( z j − ) ≥ f ( z j − ) , ∀ j ∈ Z j out . Thus, f ( z j − ) − f ( z j ) ≥ f ( z j − ) − f ( z j ) (18) > (cid:15) > , ∀ j ∈ Z j out . Therefore, from step 4, taking j ≥ , we have < s j = (cid:115) f ( z j − ) − f ( z j ) f ( z j − ) − f ( z j ) ≤ , ∀ j ∈ Z j out . (cid:4) The proof of the following lemma relies upon some tech-nical results on the iterates of Algorithm 2, namely Lemmas2 and 3, which we include in Appendix C.

Lemma 1.

Consider Algorithm 2 with the initial condition z ∈ V f ( ρ ) , and (cid:15) > . Suppose that Assumption 1 is satisﬁedand that j out ≥ D , where D . = (cid:24) (cid:18) f ( z ) − f ∗ (cid:15) (cid:19)(cid:25) . Then, m (cid:96) +1 ≤ √ m (cid:96) +1+ D , ∀ (cid:96) ∈ Z j out − D . Proof.

The proof is obtained by reductio ad absurdum. If thereis (cid:96) ∈ Z j out − D such that m (cid:96) +1 > √ m (cid:96) +1+ D , then weobtain from Lemma 3.( iv) (see Appendix C) that D < (cid:18) f ( z ) − f ∗ (cid:15) (cid:19) , which contradicts the deﬁnition of D . (cid:4) Proof of Theorem 1.

Let T ∈ Z be such that f ( z j ) − f ( z j +1 ) > (cid:15), ∀ j ∈ Z T , (19)is satisﬁed. Then, deﬁning d j . = f ( z j ) − f ( z j +1 ) , we have f ( z ) − f ( z T +1 ) = T (cid:88) j =0 d j ≥ ( T +1) (cid:18) min j =0 ,...,T d j (cid:19) > ( T +1) (cid:15). Thus, T + 1 < f ( z ) − f ( z T +1 ) (cid:15) ≤ f ( z ) − f ∗ (cid:15) ≤ ρ(cid:15) , from where we infer that the largest integer T satisfying (19)is bounded. Consequently, the exit condition of Algorithm 2(step 7) is satisﬁed within a ﬁnite number of iterations, thusproving claim ( i) .To prove claim ( ii) , we start by noting that both m and m are no larger than (cid:100) n ρ (cid:101) . Indeed, from step 4 we havethat s = s = 0 , which, in virtue of step 5, implies that n = m = 1 and n = m . Since n = 1 is no larger than (cid:100) n ρ (cid:101) we have from (17c) that m is also upper-bounded by (cid:100) n ρ (cid:101) . Moreover, since n = m ≤ (cid:100) n ρ (cid:101) , we obtain by thesame reasoning that m ≤ (cid:100) n ρ (cid:101) . We now prove that if j ≥ and m j ≤ (cid:100) n ρ (cid:101) , then m j +1 ≤ (cid:100) n ρ (cid:101) . From step 4 we have s j = f ( z j − ) − f ( z j ) f ( z j − ) − f ( z j ) = 1 − f ( z j − ) − f ( z j − ) f ( z j − ) − f ( z j ) ≤ − f ( z j − ) − f ( z j − ) f ( z j − ) − f ∗ = f ( z j − ) − f ∗ f ( z j − ) − f ∗ (17b) ≤ (cid:18) n ρ m j − + 1 (cid:19) . Thus, we have s j m j − ≤ n ρ . Therefore, n j = max { m j , s j m j − } ≤ max {(cid:100) n ρ (cid:101) , n ρ } = (cid:100) n ρ (cid:101) , which, along with (17c), leads to m j +1 ≤ (cid:100) n ρ (cid:101) , thus provingthe claim.Finally, to prove claim ( iii) , we start by noting that thecomputation of each z j +1 is obtained from m j +1 iterationsof A . Thus, N A = j out (cid:88) j =0 m j +1 (12) ≤ (1 + j out ) (cid:100) n ρ (cid:101) . (20)Let us denote D . = (cid:24) (cid:18) f ( z ) − f ∗ (cid:15) (cid:19)(cid:25) . Consider ﬁrst the case j out < D . Since both j out and D areintegers we infer from this inequality that j out ≤ D . This,along with (20), implies that N A ≤ (cid:100) n ρ (cid:101) D ≤ ¯ N A .Suppose now that j out ≥ D . We ﬁrst recall that Property4.( ii) states that the sequence { m j +1 } j ≥ is non-decreasing.We now rewrite j out as j out = d + tD , where d ∈ IN ,D − and t is a non-negative integer. Thus, N A = j out (cid:88) j =0 m j +1 = d (cid:88) j =0 m j +1 + tD (cid:88) j =1 m d + j +1 ≤ Dm d +1 + D t (cid:88) i =1 m d +1+ iD = D t (cid:88) i =0 m d +1+ iD . From Lemma 1 we have m d +1+ iD ≤ m d +1+( i +1) D √ , ∀ i ∈ Z t − . Thus, N A ≤ D t (cid:88) i =0 m d +1+ tD (cid:18) √ (cid:19) t − i . Using now m d + tD ≤ ¯ m = (cid:100) n ρ (cid:101) (see (12)) we obtain N A D ¯ m ≤ t (cid:88) i =0 (cid:18) √ (cid:19) t − i = t (cid:88) j =0 (cid:18) √ (cid:19) j ≤ ∞ (cid:88) j =0 (cid:18) √ (cid:19) j = √ √ − ≤ e . Thus, N A ≤ e mD = e (cid:100) n ρ (cid:101) D . (cid:4) C. Technical results on the iterates of Algorithm 2

Lemma 2.

The function ϕ ( s ) : R → R , deﬁned as ϕ ( s ) . = (cid:18) s − (cid:19) · max (cid:8) , (4 s ) (cid:9) , satisﬁes ϕ ( s ) ≥ , ∀ s ∈ (0 , √

154 ] .Proof.

We have that ϕ ( s ) = (cid:40) ( s − s ) if s > , s − if s ≤ . It is clear that ϕ ( · ) is monotonically decreasing in (0 , ] . Thus, min s ∈ (0 , √ ] ϕ ( s ) = min s ∈ [ , √ ] ϕ ( s ) = min s ∈ [ , √ ] ( s − s ) . We notice that the derivative of s − s is s (1 − s ) , whichvanishes only once in the interval of interest (at s = √ ).From here we infer that s − s is increasing in [ , √ ) anddecreasing in ( √ , √ ] . Thus, the minimum is attained at theextremes of the interval [ , √ ] . That is, we conclude that min s ∈ (0 , √ ] ϕ ( s ) = min { ϕ ( 14 ) , ϕ ( √

154 ) } = min { , } = 15 . (cid:4) Lemma 3 (Technical results on the iterates of Alg. 2) . Consider Algorithm 2 with the initial condition z ∈ V f ( ρ ) ,and (cid:15) > . Suppose that Assumption 1 is satisﬁed andthat j out ≥ . Suppose also that there is T ∈ Z j out and (cid:96) ∈ Z j out − T such that m (cid:96) +1 > √ m (cid:96) +1+ T . Then:(i) s j ∈ (cid:16) , √ (cid:105) , ∀ j ∈ Z (cid:96) + T(cid:96) +2 .(ii) (cid:96) + T (cid:80) j = (cid:96) +2 ln (cid:0) max { , (4 s j ) } (cid:1) < .(iii) (cid:96) + T (cid:80) j = (cid:96) +2 ln ( 1 s j − ≤ ln (cid:18) f ( z ) − f ∗ (cid:15) (cid:19) .(iv) T < (cid:18) f ( z ) − f ∗ (cid:15) (cid:19) .Proof. Denote f j = f ( z j ) , j ∈ Z j out +10 . From j ≥ and step4 of Algorithm 2 we have s j = f j − − f j f j − − f j , j ∈ Z j out . The inequality s j > , ∀ j ∈ Z (cid:96) + T(cid:96) +2 follows from Property4.( iii) . In order to prove the ﬁrst claim it remains to prove theinequality s j ≤ √ , ∀ j ∈ Z (cid:96) + T(cid:96) +2 . We proceed by reductio adabsurdum. Suppose that there is j ∈ Z (cid:96) + T(cid:96) +2 such that s j > √ .In this case, m j +1 (11) ≥ n j = max { m j , s j m j − } ≥ s j m j − > √ m j − . From this and the non-decreasing nature of the sequence { m j } (Property 4.( ii) ) we obtain m (cid:96) +1+ T ≥ m j +1 > √ m j − ≥ √ m (cid:96) +1 . This contradicts the assumptions of the property, thus provingthe ﬁrst claim.From the non-decreasing nature of the sequence { m j } (Property 4.( ii) ) we have, for every j ∈ Z (cid:96) + T(cid:96) +2 , m j +1 (11) ≥ n j = max { m j , s j m j − } ≥ m j − · max { , s j } . Equivalently, ln (max { , s j } ) ≤ ln m j +1 m j − , ∀ j ∈ Z (cid:96) + T(cid:96) +2 . This implies (cid:96) + T (cid:88) j = (cid:96) +2 ln (max { , s j } ) ≤ (cid:96) + T (cid:88) j = (cid:96) +2 ln m j +1 m j − = ln m (cid:96) + T m (cid:96) +1+ T m (cid:96) +1 m (cid:96) +2 ≤ ln m (cid:96) +1+ T m (cid:96) +1 = 2 ln m (cid:96) +1+ T m (cid:96) +1 < √

15 = ln 15 . (21)The second claim is obtained multiplying the last inequalityby 4. To prove the third claim we notice that (cid:96) + T (cid:89) j = (cid:96) +2 ( 1 s j −

1) = (cid:96) + T (cid:89) j = (cid:96) +2 f j − − f j − f j − − f j = f (cid:96) − f (cid:96) +1 f (cid:96) + T − − f (cid:96) + T . Since (cid:96) + T ≤ j out we have f (cid:96) + T − − f (cid:96) + T > (cid:15) > . Usingthis inequality we obtain (cid:96) + T (cid:89) j = (cid:96) +2 ( 1 s j − < f (cid:96) − f (cid:96) +1 (cid:15) (17b) ≤ f − f (cid:96) +1 (cid:15) ≤ f − f ∗ (cid:15) , from where the third claim directly follows. In order to provethe last claim of the property we sum the inequalities givenby the second and third claims to obtain (cid:96) + T (cid:88) j = (cid:96) +2 ln (cid:32)(cid:32) s j − (cid:33) · max (cid:8) , (4 s j ) (cid:9)(cid:33) < ln (cid:18) f − f ∗ (cid:15) (cid:19) + 4 ln 15 . (22)From the ﬁrst claim we have s j ∈ (cid:16) , √ (cid:105) , ∀ j ∈ Z (cid:96) + T(cid:96) +2 .Thus, the left term of (22) can be lower bounded by means ofthe following inequality (Lemma 2) ≤ (cid:18) s − (cid:19) · max (cid:8) , (4 s ) (cid:9) , ∀ s ∈ (cid:16) , √ (cid:105) . That is, (cid:96) + T (cid:88) j = (cid:96) +2 ln 15 < ln (cid:18) f − f ∗ (cid:15) (cid:19) + 4 ln 15 . Equivalently, ( T −

1) ln 15 < ln (cid:18) f − f ∗ (cid:15) (cid:19) + 4 ln 15 . Therefore,

T < ln (cid:18) f − f ∗ (cid:15) (cid:19) . (cid:4) R EFERENCES[1] Y. Nesterov,

Lectures on convex optimization . Springer, 2018, vol. 137.[2] A. Beck,

First-order methods in optimization . SIAM, 2017, vol. 25.[3] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,”

SIAM J. Imaging Sciences , vol. 2,no. 1, pp. 183–202, 2009.[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends® in Machine learning , vol. 3,no. 1, pp. 1–122, 2011.[5] Y. Nesterov, “A method of solving a convex programming problem withconvergence rate O (1 /k ) ,” Sov. Math. Dokl. , vol. 27, no. 2, pp. 372–376, 1983.[6] A. Beck and M. Teboulle, “Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems,”

IEEETransactions on Image Processing , vol. 18, no. 11, pp. 2419–2434, 2009.[7] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast alternat-ing direction optimization methods,”

SIAM Journal on Imaging Sciences ,vol. 7, no. 3, pp. 1588–1623, 2014.[8] P. Patrinos, L. Stella, and A. Bemporad, “Douglas-rachford splitting:Complexity estimates and accelerated variants,” in . IEEE, 2014, pp. 4234–4239.[9] I. Pejcic and C. N. Jones, “Accelerated ADMM based on acceleratedDouglas-Rachford splitting,” in , 2016, pp. 1952–1957.[10] Y. Zheng, G. Fantuzzi, A. Papachristodoulou, P. Goulart, and A. Wynn,“Fast ADMM for homogeneous self-dual embedding of sparse SDPs,”

IFAC-PapersOnLine , vol. 50, no. 1, pp. 8411–8416, 2017.[11] G. Stathopoulos, M. Korda, and C. N. Jones, “Solving the inﬁnite-horizon constrained LQR problem using accelerated dual proximalmethods,”

IEEE Transactions on Automatic Control , vol. 62, no. 4, pp.1752–1767, 2016.[12] P. Patrinos and A. Bemporad, “An accelerated dual gradient-projectionalgorithm for embedded linear model predictive control,”

IEEE Trans-actions on Automatic Control , vol. 59, no. 1, pp. 18–33, 2014.[13] J. L. Jerez, P. J. Goulart, S. Richter, G. A. Constantinides, E. C. Kerrigan,and M. Morari, “Embedded online optimization for model predictivecontrol at megahertz rates,”

IEEE Transactions on Automatic Control ,vol. 59, no. 12, pp. 3238–3251, 2014.[14] M. Alamir, “Monitoring control updating period in fast gradient basedNMPC,” in , 2013, pp. 3621–3626.[15] P. Krupa, D. Limon, and T. Alamo, “Implementation of model predic-tive control in programmable logic controllers,”

IEEE Transactions onControl Systems Technology , 2020.[16] B. O’Donoghue and E. Candes, “Adaptive restart for accelerated gradientschemes,”

Foundations of Computational Mathematics , pp. 1–18, 2013.[17] P. Giselsson and S. Boyd, “Monotonicity and restart in fast gradientmethods,” in

Proceedings of the 2014 IEEE 53rd Annual Conference onDecision and Control , 2014, pp. 5058–5063.[18] I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of ﬁrstorder methods for non-strongly convex optimization,”

MathematicalProgramming , pp. 1–39, 2018.[19] T. Alamo, P. Krupa, and D. Limon, “Restart FISTA with global linearconvergence,” in . IEEE,2019, pp. 1969–1974.[20] ——, “Gradient based restart FISTA,” in

Proceedings of the 58th IEEEConference on Decision and Control (CDC) . IEEE, 2019, pp. 3936–3941.[21] D. Drusvyatskiy and A. S. Lewis, “Error bounds, quadratic growth, andlinear convergence of proximal methods,”

Mathematics of OperationsResearch , vol. 43, no. 3, pp. 919–948, 2018.[22] P.-W. Wang and C.-J. Lin, “Iteration complexity of feasible descentmethods for convex optimization,”

Journal of Machine Learning Re-search , vol. 15, pp. 1523–1548, 2014.[23] Y. Nesterov, “Gradient methods for minimizing composite functions,”

Mathematical Programming , vol. 140, pp. 125–161, 2013.[24] ——,