[PDF] Faster FISTA - Researchain

Abstract

The ``fast iterative shrinkage-thresholding algorithm'', a.k.a. FISTA, is one of the most widely used algorithms in the literature. However, despite its optimal theoretical O(1/ k 2 ) convergence rate guarantee, oftentimes in practice its performance is not as desired owing to the (local) oscillatory behaviour. Over the years, various approaches are proposed to overcome this drawback of FISTA, in this paper, we propose a simple yet effective modification to the algorithm which has two advantages: 1) it enables us to prove the convergence of the generated sequence; 2) it shows superior practical performance compared to the original FISTA. Numerical experiments are presented to illustrate the superior performance of the proposed algorithm.

Full PDF

FFaster FISTA * st Jingwei Liang

DAMTP, University of Cambridge

Cambridge, United [email protected] nd Carola-Bibiane Sch¨onlieb

DAMTP, University of Cambridge

Cambridge, United [email protected]

Abstract —The “fast iterative shrinkage-thresholding algo-rithm”, a.k.a. FISTA, is one of the most widely used algo-rithms in the literature. However, despite its optimal theoretical O (1 /k ) convergence rate guarantee, oftentimes in practice itsperformance is not as desired owing to the (local) oscillatorybehaviour. Over the years, various approaches are proposed toovercome this drawback of FISTA, in this paper, we proposea simple yet effective modiﬁcation to the algorithm which hastwo advantages: 1) it enables us to prove the convergence of thegenerated sequence; 2) it shows superior practical performancecompared to the original FISTA. Numerical experiments arepresented to illustrate the superior performance of the proposedalgorithm. Index Terms —FISTA, Forward–Backward splitting, Inertialschemes, Convergence rates, Acceleration

I. I

NTRODUCTION

A. Problem statement

In various ﬁelds of science and engineering, including sig-nal/image processing, inverse problems and machine learning,many problems to be solved can be cast as a structuredcomposite non-smooth optimization problem of the sum oftwo functions, which usually reads min x ∈H Φ( x ) def = R ( x ) + F ( x ) , ( P )where H is a real Hilbert space,( A.1 ) R : H → R ∪ { + ∞} is proper closed and convex;( A.2 ) F : H → R , and ∇ F is L -Lipschitz continuous;( A.3 ) Argmin(Φ) = ∅ , the set of minimizers is non-empty.Typical examples of ( P ) can be found in Section IV. B. Forward–Backward splitting and FISTA schemes

One classical approach to solve problem ( P ) is the Forward–Backward splitting (FBS) method [7], whose non-relaxediteration takes the form x k +1 def = prox γ k R (cid:0) x k − γ k ∇ F ( x k ) (cid:1) , γ k ∈ ]0 , /L [ , (1)where γ k is the step-size, and prox γR ( · ) def = min x ∈H || x −·|| + γR ( x ) denotes the proximity operator of R .The convergence of FB iterates is guaranteed as long as γ k ∈ [ (cid:15) , /L − (cid:15) ] , (cid:15) , (cid:15) ∈ ]0 , /L [ is satisﬁed. For the con-vergence rate, it is well established that the objective function This work was partly supported by Leverhulme Trust project “Breakingthe non-convexity barrier”, the EPSRC grant “EP/M00483X/1”, EPSRCcentre “EP/N014588/1”, the Cantab Capital Institute for the Mathematics ofInformation, and the Global Alliance project “Statistical and MathematicalTheory of Imaging”.

Algorithm 1:

FISTA-BT algorithm

Initial : t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat t k = √ t k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . (2) k = k + 1 ; until convergence ;value, i.e. Φ( x k ) − inf x ∈H Φ( x ) , converges at the speed of O (1 /k ) which is quite slow. Over the years, various schemeshave been proposed to accelerate the method. Among them, theFISTA scheme [3] by Beck and Teboulle (henceforth denotedas “ FISTA-BT ”) is the most-known one, which achieves O (1 /k ) convergence rate for the objective function value.However, while FBS is sequence convergent, the convergenceproperty of the sequence generated by FISTA-BT has been along-standing open problem. This question was settled in [4],followed by [2] in the continuous dynamical system case.In [4], Chambolle and Dossal proposed using the followingstrategy for the updating of a k , let d > , t k = k + dd , a k = t k − − t k . (3)The above choice is denoted as “ FISTA-CD ”. Under this set-ting, they managed to prove the convergence of the sequencewhile maintaining the O (1 /k ) rate on the objective functionvalues. Later in [1], the rate is proven to be actually o (1 /k ) . C. Slow practical performance

In practice, it has been reported in several work [10],[11], that despite the O (1 /k ) convergence rate guarantee,oftentimes FISTA-BT has very slow practical performancewhich is mainly caused by the oscillation behaviour of thescheme. For the FISTA-CD scheme, when d is close to , ithas almost the same performance as FISTA-BT; see SectionIV the numerical experiments.In [6], it is reported that when d is chosen in a certainrange, such as [50 , , then practically FISTA-CD can bemuch faster than FISTA-BT; see also Section IV. A naturalquestion would be raised: is it possible that the original FISTA-BT method can also achieve such boost of performance in a r X i v : . [ m a t h . O C ] J u l ractice? The main purpose of the presented paper is to answerthis question. II. A MODIFIED

FISTA

SCHEME

In this section, we present the main contribution of thispaper, a modiﬁed scheme of FISTA-BT.

A. Two observations

In the original FISTA-BT scheme, the update of t k reads t k = √ t k − . We have the following observations by replacing the , , inthe numerator with parameters p, q, r . a) Parameter r : Let r > , and consider t k = (1 + √ rt k − ) / , then a k = t k − − t k  r ∈ ]0 ,

4[ : t k → − r , a k → r ,r ∈ [4 , + ∞ [ : t k → + ∞ , a k → √ r . (4) Observation I: r controls the limiting value of a k . b) Parameter p, q : Let r ∈ ]0 , and p, q > . Consider t k = ( p + √ q + rt k − ) / , then a k = t k − − t k ( r ∈ ]0 ,

4[ : t k → p + ∆4 − r , a k → − − r p + ∆ ,r = 4 : t k → + ∞ , a k → , (5)where ∆ = p rp + (4 − r ) q .Fix r = 4 , Figure 1 shows the effects of different valuesof p, q , together with two different choices of d in (3). Since r = 4 , we have a k → , clearly, the smaller the value of p, q ,the slower a k converges to . While for FISTA-CD, the bigger the value of d , the slower the a k converges to . Fig. 1: Different effects of p, q and d . Observation II: p, q control the speed of a k converging to . Remark II.1.

Fix r = 4 , and if q ≤ (2 − p ) , then, t k = p + √ q + 4 t k − = ⇒ t k − t k ≤ t k − , which is the key to prove the convergence of the sequence ofthe modiﬁed FISTA scheme (Algorithm 2). B. A modiﬁed FISTA scheme

Based on the above observations, we propose the a modiﬁedFISTA scheme which is described in Algorithm 2. From nowon, to distinguish Algorithm 2 from FISTA-BT and FISTA-CD, we shall call it as “FISTA-Mod”.

Algorithm 2:

A modiﬁed FISTA scheme

Initial : p ∈ ]0 , , q > and r ∈ ]0 , , t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat t k = p + √ q + rt k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . (6) until convergence ; Remark II.2.

When r ∈ ]0 , , then Algorithm 2 becomes avariant of the inertial Forward–Backward method [6]. C. Convergence rate of the objective function

In this part we present the global convergence properties ofFISTA-Mod scheme. We ﬁrst show that FISTA-Mod preservesthe O (1 /k ) optimal convergence rate of FISTA-BT, and provethe convergence of the sequence { x k } k ∈ N . Theorem II.3 (Convergence rate of the objective).

For theFISTA-Mod scheme (6) , let r = 4 and choose p ∈ ]0 , , q ∈ ]0 , (2 − p ) ] . Then Φ( x k ) − Φ( x ? ) ≤ Lp ( k + 1) || x − x ? || . (7) Remark II.4.

Compared to the original convergence rate ofFISTA-BT [3], which is Φ( x k ) − Φ( x ? ) ≤ L ( k +1) || x − x ? || .Parameter p appears in the obtained rate estimation, and p = 1 yields the smallest constant in the rate. Though p < givesbigger constant in the rate estimation, as we shall see belowit allows us to prove the o (1 /k ) convergence rate. Sketch of proof.

There two key conditions to prove TheoremII.3, which are: • from q ≤ (2 − p ) , one can show that t k − t k ≤ t k − ; • For p ∈ ]0 , , we have t k ≥ k +12 p .With the above result and follow the proof of [3, Theorem 4.1]we can prove Theorem II.3. Theorem II.5 (From O (1 /k ) to o (1 /k ) ). For the FISTA-Mod scheme (6) , let r = 4 and choose p ∈ ]0 , , q > suchthat p ≤ q . Then Φ( x k ) − Φ( x ? ) = o (1 /k ) . Sketch of proof.

The key to establish o (1 /k ) convergencerate is that with < p < , one can show that p (1 − p )( k + 1)2 ≤ (1 − p ) t k ≤ t k − − ( t k − t k ) . Then following [1], [4] we obtain the desired result.

Remark II.6. (i) For the original FISTA-BT scheme, wehave strictly t k − − ( t k − t k ) , hence unable to obtain o (1 /k ) convergence rate.ii) One byproduct of Theorem II.5 is one can show thatsequence { x k } k ∈ N is bounded. D. Convergence of the sequence

Theorem II.7 (Convergence of the sequence).

For theFISTA-Mod scheme, let r = 4 and choose p ∈ ]0 , , q > such that p ≤ q . Then (i) there exists an x ? ∈ Argmin(Φ) to which the sequence { x k } k ∈ N generated by FISTA-Mod converges weakly; (ii) We have || x k − x k − || = o (1 /k ) . Sketch of proof.

There two key conditions to prove TheoremII.7, which are: • sequence { x k } k ∈ N is bounded; • The inertial parameter { a k } k ∈ N can be uniformlybounded from above by another sequence { a k } k ∈ N ;With the above result and follow the proof of [4, Theorem 4.1]we can prove Theorem II.7.III. L AZY START AND ADAPTIVE STRATEGY

Since the modiﬁed FISTA scheme (6) has three degrees offreedom compared to the original FISTA-BT, we can designstrategies to make FISTA-Mod adaptive to the properties of theproblems so that faster practical performance can be achieved.

A. Lazy-start FISTA-Mod

A very well-known behaviour of FISTA schemes is that,when the minimisation problem ( P ) is strongly convex, boththe trajectories of {|| x k − x k − ||} k ∈ N and { Φ( x k ) − Φ( x ? ) } k ∈ N will oscillate once a k is too close to . Such oscillation slowsdown the speed of the algorithm [10], [11], and eventuallymakes the scheme slower than the original FBS scheme [6].In [6], it is reported that for the FISTA-CD scheme, whenthe value of d is chosen relative big ( e.g. d ∈ [50 , ),the FISTA-CD scheme achieves a much faster practical per-formance; see also the numerical experiments in Section IV.An intuitive explanation for such phenomena is that it is theinterplay of the following two properties:(i) for the considered problems in [6], they are locallystrongly convex at the minimiser;(ii) Relative bigger value of d slows down the speed of a k converging to , as we have seen in Figure 1.The interactions of the above two properties make the algo-rithm achieve faster practical performance.Since all the parameters of the original FISTA-BT schemeare ﬁxed, hence has ﬁxed speed of a k converging to . Whilefor the FISTA-Mod scheme, we can adjust the values of p, q sothat we can control the speed of a k approaching . In practice,we found the following choices of p, q work quite well, whichwe dubbed as “lazy-start FISTA-Mod”: Lazy-start FISTA-Mod: p ∈ [ , ] , q ∈ ]0 , . B. Adaptive to local strong convexity

In practice, many problems encountered are not globallystrongly convex. However, oftentimes when certain conditionsare satisﬁed ( e.g. see for instance [6]), the problems locallyhave a so-called quadratic growth around the minimiser (see [6, Proposition 12]). As a result, locally adaptive strategies canbe applied to achieve the optimal convergence rate.Let us ﬁx γ = 1 /L , and suppose that problem ( P ) is α -strongly convex for some α > , then the optimal choice of a k should be, according to [6, Section 4.4] a k ≡ a ? = (1 − √ γα ) / (1 − γα ) . Deﬁne the following function of αf ( α ) = 4(1 − √ γα ) / (1 − γα ) . Suppose that the local strong convexity of Φ is graduallychanging until reaching α , we propose the following adaptiveFISTA scheme to take advantage of this local condition. Algorithm 3:

Adaptive-FISTA

Initial : p = 1 , q = 1 and r = 4 , t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat Estimate the local strong convexity α k ; r k = f ( α k ) , t k = p + √ q + r k t k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . until convergence ;For the rest of the paper, we shall call the above adaptivescheme “ Ada-FISTA ” for short.IV. N

UMERICAL EXPERIMENTS

In this section, we present numerical experiments of prob-lems arising from linear inverse problem and image/videoprocessing to demonstrate the advantages of the FISTA-Modand Ada-FISTA over the original FISTA-BT.

A. Linear inverse problem

We present ﬁrst the numerical experiments of linear inverseproblems. Consider the following forward observation of avector x ob ∈ R n f = K x ob + w, (8)where f ∈ R m is the observation, K : R n → R m is somelinear operator, and w ∈ R m stands for noise. To recover orapproximate x ob , one can consider the following optimizationproblem min x ∈ R n || f − K x || + λR ( x ) , ( P λ )where λ > is the trade-off parameter, R is the regulariserbased on the prior knowledge on x ob .We consider solving ( P λ ) with R being ‘ , ‘ , -norms and ‘ ∞ -norm. The observations are generated according to (8).Here K is generated from the standard Gaussian ensembleand the following parameters: ‘ -norm ( m, n ) = (768 , , x ob is -sparse; ‘ , -norm ( m, n ) = (512 , , x ob has non-zeroblocks of size ;

00 200 300 400 500 600 700 80010 -10 -6 -2 (a) ‘ -norm

100 200 300 400 500 600 700 800 90010 -10 -6 -2 (b) ‘ , -norm -10 -6 -2 (c) ‘ ∞ -norm Fig. 2: Performance comparison of different FISTA schemes in terms of {|| x k − x k − ||} k ∈ N for linear inverse problem: (a) ‘ -norm; (b) ‘ , -norm; (c) ‘ ∞ -norm. The original FISTA-BT [3], sequence convergent FISTA-CD [4], and the proposedFISTA-Mod scheme. For FISTA-CD, two choices of d are considered: d = 2 , . For FISTA-Mod, ( p, q ) = (1 / , / is considered. Black line, observation of FISTA-BT, the blue line is the observation of FISTA-Mod, the red lines are theobservations of FISTA-CD, the green line is the observation of Ada-FISTA. ‘ ∞ -norm ( m, n ) = (1020 , , x ob has saturatedentries.The results of these examples are presented in Figure 2, fromwhich we obtained the following observation and conclusions • For all three examples, all the FISTA schemes exhibita local linear convergence property, this is mainly dueto the fact that considered ‘ , ‘ , , ‘ ∞ -norms belong tothe so-called “partly smooth function”, we refer to [6]for the dedicated study of this local linear convergencebehaviour; • The FISTA-CD with d = 2 has almost the same perfor-mance as the original FISTA-BT scheme, see the lightred line and black line in all three ﬁgures; • The FISTA-CD with d = 75 and the FISTA-Mod havevery close performance, and both of them are much fasterthan FISTA-CD with d = 2 and the original FISTA-BT.More precisely, for ‘ , ‘ , -norms, FISTA-CD with d =75 and the FISTA-Mod are about times faster, whilefor the ‘ ∞ -norm, the difference is about times whichis quite signiﬁcant; • Ada-FISTA shows the fastest performance, especially forthe ‘ ∞ -norm, which is almost times faster than theoriginal FISTA-BT. Remark IV.1.

It should be noted that, a drawback of Ada-FISTA is that when the problem is of very large scale,estimating α k each step can be very time consuming. Aproper approach to deal with this deﬁciency is performingthe evaluation in every κ steps where κ is properly chosen.For instance, for the experiments provided in Figure 2, α isestimated every steps for ‘ , ‘ , -norm and every stepsfor ‘ ∞ -norm. B. Total variation based image deconvolution

We also consider a 2D image processing problem, where y is a degraded image generated according to (8), K is a circular convolution matrix with a Gaussian kernel. The anisotropictotal variation (TV) [8] is applied for reconstruction, and thegraph-cut algorithm [5] is applied for computing the proximityoperator of TV.The “cameraman” image is used for the experiments, theoriginal, blurred and reconstructed images are shown in Fig-ure 3(a)-(c). We compare only the performance of FISTA-BTand two settings of FISTA-Mod, the result is depicted inFigure 3(d). The result of this comparison is very similar tothose of the linear inverse problem, the lazy-start FISTA-Modshows superior performance than FISTA-BT. C. Principal component pursuit

To conclude this paper, we consider the principal componentpursuit (PCP) problem [9], and apply it to decompose a videosequence into its background and foreground components.Assume that a real matrix y ∈ R m × n can be written as y = x l , ob + x s , ob + w, where x l , ob is low–rank, x s , ob is sparse and w is the noise. ThePCP proposed in [9] attempts to provably recover ( x l , ob , x s , ob ) to a good approximation, by solving the following convexoptimization problem min x l ,x s ∈ R m × n || y − x l − x s || F + λ || x s || + λ || x l || ∗ , (9)where || · || F is the Frobenius norm.Observe that for ﬁxed x l , the minimizer of (9) is x ? s =prox λ ||·|| ( y − x l ) . Thus, (9) is equivalent to min x l ∈ R m × n (cid:0) λ || · || (cid:1) ( y − x l ) + λ || x l || ∗ , (10)where (cid:0) λ || · || (cid:1) ( y − x l ) = min z || y − x l − z || F + λ || z || is the Moreau Envelope of λ || · || of index , and hence has -Lipschitz continuous gradient. a) Original image (b) Blurred image (c) Recovered image -10 -6 -2 (d) Performance comparison Fig. 3: Performance comparison of FISTA-BT and FISTA-Mod in terms of {|| x k − x k − ||} k ∈ N for TV based image deblurring:(a) original image; (b) blurred image; (c) deblurred image; (d) performance of FISTA-BT and FISTA-Mod. (a) Original frame (b) Sparse component (c) Low-rank component

50 100 150 200 250 30010 -7 -4 -1 (d) Performance comparison Fig. 4: Performance comparison of FISTA-BT and FISTA-Mod in terms of {|| x k − x k − ||} k ∈ N for principal component pursuit:(a) original frame; (b) sparse component; (c) low-rank component; (d) performance of FISTA-BT and FISTA-Mod.We continue comparing only the performance of FISTA-BT and two settings of FISTA-Mod, the result is depictedin Figure 3(d). Again, the result of this comparison is verysimilar to previous examples, the lazy-start FISTA-Mod showssuperior performance than FISTA-BT.R EFERENCES[1] H. Attouch and J. Peypouquet, “The rate of convergence of Nesterov’saccelerated Forward–Backward method is actually o ( k − ))