FFaster FISTA * st Jingwei Liang
DAMTP, University of Cambridge
Cambridge, United [email protected] nd Carola-Bibiane Sch¨onlieb
DAMTP, University of Cambridge
Cambridge, United [email protected]
Abstract —The “fast iterative shrinkage-thresholding algo-rithm”, a.k.a. FISTA, is one of the most widely used algo-rithms in the literature. However, despite its optimal theoretical O (1 /k ) convergence rate guarantee, oftentimes in practice itsperformance is not as desired owing to the (local) oscillatorybehaviour. Over the years, various approaches are proposed toovercome this drawback of FISTA, in this paper, we proposea simple yet effective modification to the algorithm which hastwo advantages: 1) it enables us to prove the convergence of thegenerated sequence; 2) it shows superior practical performancecompared to the original FISTA. Numerical experiments arepresented to illustrate the superior performance of the proposedalgorithm. Index Terms —FISTA, Forward–Backward splitting, Inertialschemes, Convergence rates, Acceleration
I. I
NTRODUCTION
A. Problem statement
In various fields of science and engineering, including sig-nal/image processing, inverse problems and machine learning,many problems to be solved can be cast as a structuredcomposite non-smooth optimization problem of the sum oftwo functions, which usually reads min x ∈H Φ( x ) def = R ( x ) + F ( x ) , ( P )where H is a real Hilbert space,( A.1 ) R : H → R ∪ { + ∞} is proper closed and convex;( A.2 ) F : H → R , and ∇ F is L -Lipschitz continuous;( A.3 ) Argmin(Φ) = ∅ , the set of minimizers is non-empty.Typical examples of ( P ) can be found in Section IV. B. Forward–Backward splitting and FISTA schemes
One classical approach to solve problem ( P ) is the Forward–Backward splitting (FBS) method [7], whose non-relaxediteration takes the form x k +1 def = prox γ k R (cid:0) x k − γ k ∇ F ( x k ) (cid:1) , γ k ∈ ]0 , /L [ , (1)where γ k is the step-size, and prox γR ( · ) def = min x ∈H || x −·|| + γR ( x ) denotes the proximity operator of R .The convergence of FB iterates is guaranteed as long as γ k ∈ [ (cid:15) , /L − (cid:15) ] , (cid:15) , (cid:15) ∈ ]0 , /L [ is satisfied. For the con-vergence rate, it is well established that the objective function This work was partly supported by Leverhulme Trust project “Breakingthe non-convexity barrier”, the EPSRC grant “EP/M00483X/1”, EPSRCcentre “EP/N014588/1”, the Cantab Capital Institute for the Mathematics ofInformation, and the Global Alliance project “Statistical and MathematicalTheory of Imaging”.
Algorithm 1:
FISTA-BT algorithm
Initial : t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat t k = √ t k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . (2) k = k + 1 ; until convergence ;value, i.e. Φ( x k ) − inf x ∈H Φ( x ) , converges at the speed of O (1 /k ) which is quite slow. Over the years, various schemeshave been proposed to accelerate the method. Among them, theFISTA scheme [3] by Beck and Teboulle (henceforth denotedas “ FISTA-BT ”) is the most-known one, which achieves O (1 /k ) convergence rate for the objective function value.However, while FBS is sequence convergent, the convergenceproperty of the sequence generated by FISTA-BT has been along-standing open problem. This question was settled in [4],followed by [2] in the continuous dynamical system case.In [4], Chambolle and Dossal proposed using the followingstrategy for the updating of a k , let d > , t k = k + dd , a k = t k − − t k . (3)The above choice is denoted as “ FISTA-CD ”. Under this set-ting, they managed to prove the convergence of the sequencewhile maintaining the O (1 /k ) rate on the objective functionvalues. Later in [1], the rate is proven to be actually o (1 /k ) . C. Slow practical performance
In practice, it has been reported in several work [10],[11], that despite the O (1 /k ) convergence rate guarantee,oftentimes FISTA-BT has very slow practical performancewhich is mainly caused by the oscillation behaviour of thescheme. For the FISTA-CD scheme, when d is close to , ithas almost the same performance as FISTA-BT; see SectionIV the numerical experiments.In [6], it is reported that when d is chosen in a certainrange, such as [50 , , then practically FISTA-CD can bemuch faster than FISTA-BT; see also Section IV. A naturalquestion would be raised: is it possible that the original FISTA-BT method can also achieve such boost of performance in a r X i v : . [ m a t h . O C ] J u l ractice? The main purpose of the presented paper is to answerthis question. II. A MODIFIED
FISTA
SCHEME
In this section, we present the main contribution of thispaper, a modified scheme of FISTA-BT.
A. Two observations
In the original FISTA-BT scheme, the update of t k reads t k = √ t k − . We have the following observations by replacing the , , inthe numerator with parameters p, q, r . a) Parameter r : Let r > , and consider t k = (1 + √ rt k − ) / , then a k = t k − − t k r ∈ ]0 ,
4[ : t k → − r , a k → r ,r ∈ [4 , + ∞ [ : t k → + ∞ , a k → √ r . (4) Observation I: r controls the limiting value of a k . b) Parameter p, q : Let r ∈ ]0 , and p, q > . Consider t k = ( p + √ q + rt k − ) / , then a k = t k − − t k ( r ∈ ]0 ,
4[ : t k → p + ∆4 − r , a k → − − r p + ∆ ,r = 4 : t k → + ∞ , a k → , (5)where ∆ = p rp + (4 − r ) q .Fix r = 4 , Figure 1 shows the effects of different valuesof p, q , together with two different choices of d in (3). Since r = 4 , we have a k → , clearly, the smaller the value of p, q ,the slower a k converges to . While for FISTA-CD, the bigger the value of d , the slower the a k converges to . Fig. 1: Different effects of p, q and d . Observation II: p, q control the speed of a k converging to . Remark II.1.
Fix r = 4 , and if q ≤ (2 − p ) , then, t k = p + √ q + 4 t k − = ⇒ t k − t k ≤ t k − , which is the key to prove the convergence of the sequence ofthe modified FISTA scheme (Algorithm 2). B. A modified FISTA scheme
Based on the above observations, we propose the a modifiedFISTA scheme which is described in Algorithm 2. From nowon, to distinguish Algorithm 2 from FISTA-BT and FISTA-CD, we shall call it as “FISTA-Mod”.
Algorithm 2:
A modified FISTA scheme
Initial : p ∈ ]0 , , q > and r ∈ ]0 , , t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat t k = p + √ q + rt k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . (6) until convergence ; Remark II.2.
When r ∈ ]0 , , then Algorithm 2 becomes avariant of the inertial Forward–Backward method [6]. C. Convergence rate of the objective function
In this part we present the global convergence properties ofFISTA-Mod scheme. We first show that FISTA-Mod preservesthe O (1 /k ) optimal convergence rate of FISTA-BT, and provethe convergence of the sequence { x k } k ∈ N . Theorem II.3 (Convergence rate of the objective).
For theFISTA-Mod scheme (6) , let r = 4 and choose p ∈ ]0 , , q ∈ ]0 , (2 − p ) ] . Then Φ( x k ) − Φ( x ? ) ≤ Lp ( k + 1) || x − x ? || . (7) Remark II.4.
Compared to the original convergence rate ofFISTA-BT [3], which is Φ( x k ) − Φ( x ? ) ≤ L ( k +1) || x − x ? || .Parameter p appears in the obtained rate estimation, and p = 1 yields the smallest constant in the rate. Though p < givesbigger constant in the rate estimation, as we shall see belowit allows us to prove the o (1 /k ) convergence rate. Sketch of proof.
There two key conditions to prove TheoremII.3, which are: • from q ≤ (2 − p ) , one can show that t k − t k ≤ t k − ; • For p ∈ ]0 , , we have t k ≥ k +12 p .With the above result and follow the proof of [3, Theorem 4.1]we can prove Theorem II.3. Theorem II.5 (From O (1 /k ) to o (1 /k ) ). For the FISTA-Mod scheme (6) , let r = 4 and choose p ∈ ]0 , , q > suchthat p ≤ q . Then Φ( x k ) − Φ( x ? ) = o (1 /k ) . Sketch of proof.
The key to establish o (1 /k ) convergencerate is that with < p < , one can show that p (1 − p )( k + 1)2 ≤ (1 − p ) t k ≤ t k − − ( t k − t k ) . Then following [1], [4] we obtain the desired result.
Remark II.6. (i) For the original FISTA-BT scheme, wehave strictly t k − − ( t k − t k ) , hence unable to obtain o (1 /k ) convergence rate.ii) One byproduct of Theorem II.5 is one can show thatsequence { x k } k ∈ N is bounded. D. Convergence of the sequence
Theorem II.7 (Convergence of the sequence).
For theFISTA-Mod scheme, let r = 4 and choose p ∈ ]0 , , q > such that p ≤ q . Then (i) there exists an x ? ∈ Argmin(Φ) to which the sequence { x k } k ∈ N generated by FISTA-Mod converges weakly; (ii) We have || x k − x k − || = o (1 /k ) . Sketch of proof.
There two key conditions to prove TheoremII.7, which are: • sequence { x k } k ∈ N is bounded; • The inertial parameter { a k } k ∈ N can be uniformlybounded from above by another sequence { a k } k ∈ N ;With the above result and follow the proof of [4, Theorem 4.1]we can prove Theorem II.7.III. L AZY START AND ADAPTIVE STRATEGY
Since the modified FISTA scheme (6) has three degrees offreedom compared to the original FISTA-BT, we can designstrategies to make FISTA-Mod adaptive to the properties of theproblems so that faster practical performance can be achieved.
A. Lazy-start FISTA-Mod
A very well-known behaviour of FISTA schemes is that,when the minimisation problem ( P ) is strongly convex, boththe trajectories of {|| x k − x k − ||} k ∈ N and { Φ( x k ) − Φ( x ? ) } k ∈ N will oscillate once a k is too close to . Such oscillation slowsdown the speed of the algorithm [10], [11], and eventuallymakes the scheme slower than the original FBS scheme [6].In [6], it is reported that for the FISTA-CD scheme, whenthe value of d is chosen relative big ( e.g. d ∈ [50 , ),the FISTA-CD scheme achieves a much faster practical per-formance; see also the numerical experiments in Section IV.An intuitive explanation for such phenomena is that it is theinterplay of the following two properties:(i) for the considered problems in [6], they are locallystrongly convex at the minimiser;(ii) Relative bigger value of d slows down the speed of a k converging to , as we have seen in Figure 1.The interactions of the above two properties make the algo-rithm achieve faster practical performance.Since all the parameters of the original FISTA-BT schemeare fixed, hence has fixed speed of a k converging to . Whilefor the FISTA-Mod scheme, we can adjust the values of p, q sothat we can control the speed of a k approaching . In practice,we found the following choices of p, q work quite well, whichwe dubbed as “lazy-start FISTA-Mod”: Lazy-start FISTA-Mod: p ∈ [ , ] , q ∈ ]0 , . B. Adaptive to local strong convexity
In practice, many problems encountered are not globallystrongly convex. However, oftentimes when certain conditionsare satisfied ( e.g. see for instance [6]), the problems locallyhave a so-called quadratic growth around the minimiser (see [6, Proposition 12]). As a result, locally adaptive strategies canbe applied to achieve the optimal convergence rate.Let us fix γ = 1 /L , and suppose that problem ( P ) is α -strongly convex for some α > , then the optimal choice of a k should be, according to [6, Section 4.4] a k ≡ a ? = (1 − √ γα ) / (1 − γα ) . Define the following function of αf ( α ) = 4(1 − √ γα ) / (1 − γα ) . Suppose that the local strong convexity of Φ is graduallychanging until reaching α , we propose the following adaptiveFISTA scheme to take advantage of this local condition. Algorithm 3:
Adaptive-FISTA
Initial : p = 1 , q = 1 and r = 4 , t = 1 , γ = 1 /L and x ∈ H , x − = x . repeat Estimate the local strong convexity α k ; r k = f ( α k ) , t k = p + √ q + r k t k − , a k = t k − − t k ,y k = x k + a k ( x k − x k − ) ,x k +1 = prox γR (cid:0) y k − γ ∇ F ( y k ) (cid:1) . until convergence ;For the rest of the paper, we shall call the above adaptivescheme “ Ada-FISTA ” for short.IV. N
UMERICAL EXPERIMENTS
In this section, we present numerical experiments of prob-lems arising from linear inverse problem and image/videoprocessing to demonstrate the advantages of the FISTA-Modand Ada-FISTA over the original FISTA-BT.
A. Linear inverse problem
We present first the numerical experiments of linear inverseproblems. Consider the following forward observation of avector x ob ∈ R n f = K x ob + w, (8)where f ∈ R m is the observation, K : R n → R m is somelinear operator, and w ∈ R m stands for noise. To recover orapproximate x ob , one can consider the following optimizationproblem min x ∈ R n || f − K x || + λR ( x ) , ( P λ )where λ > is the trade-off parameter, R is the regulariserbased on the prior knowledge on x ob .We consider solving ( P λ ) with R being ‘ , ‘ , -norms and ‘ ∞ -norm. The observations are generated according to (8).Here K is generated from the standard Gaussian ensembleand the following parameters: ‘ -norm ( m, n ) = (768 , , x ob is -sparse; ‘ , -norm ( m, n ) = (512 , , x ob has non-zeroblocks of size ;
00 200 300 400 500 600 700 80010 -10 -6 -2 (a) ‘ -norm
100 200 300 400 500 600 700 800 90010 -10 -6 -2 (b) ‘ , -norm -10 -6 -2 (c) ‘ ∞ -norm Fig. 2: Performance comparison of different FISTA schemes in terms of {|| x k − x k − ||} k ∈ N for linear inverse problem: (a) ‘ -norm; (b) ‘ , -norm; (c) ‘ ∞ -norm. The original FISTA-BT [3], sequence convergent FISTA-CD [4], and the proposedFISTA-Mod scheme. For FISTA-CD, two choices of d are considered: d = 2 , . For FISTA-Mod, ( p, q ) = (1 / , / is considered. Black line, observation of FISTA-BT, the blue line is the observation of FISTA-Mod, the red lines are theobservations of FISTA-CD, the green line is the observation of Ada-FISTA. ‘ ∞ -norm ( m, n ) = (1020 , , x ob has saturatedentries.The results of these examples are presented in Figure 2, fromwhich we obtained the following observation and conclusions • For all three examples, all the FISTA schemes exhibita local linear convergence property, this is mainly dueto the fact that considered ‘ , ‘ , , ‘ ∞ -norms belong tothe so-called “partly smooth function”, we refer to [6]for the dedicated study of this local linear convergencebehaviour; • The FISTA-CD with d = 2 has almost the same perfor-mance as the original FISTA-BT scheme, see the lightred line and black line in all three figures; • The FISTA-CD with d = 75 and the FISTA-Mod havevery close performance, and both of them are much fasterthan FISTA-CD with d = 2 and the original FISTA-BT.More precisely, for ‘ , ‘ , -norms, FISTA-CD with d =75 and the FISTA-Mod are about times faster, whilefor the ‘ ∞ -norm, the difference is about times whichis quite significant; • Ada-FISTA shows the fastest performance, especially forthe ‘ ∞ -norm, which is almost times faster than theoriginal FISTA-BT. Remark IV.1.
It should be noted that, a drawback of Ada-FISTA is that when the problem is of very large scale,estimating α k each step can be very time consuming. Aproper approach to deal with this deficiency is performingthe evaluation in every κ steps where κ is properly chosen.For instance, for the experiments provided in Figure 2, α isestimated every steps for ‘ , ‘ , -norm and every stepsfor ‘ ∞ -norm. B. Total variation based image deconvolution
We also consider a 2D image processing problem, where y is a degraded image generated according to (8), K is a circular convolution matrix with a Gaussian kernel. The anisotropictotal variation (TV) [8] is applied for reconstruction, and thegraph-cut algorithm [5] is applied for computing the proximityoperator of TV.The “cameraman” image is used for the experiments, theoriginal, blurred and reconstructed images are shown in Fig-ure 3(a)-(c). We compare only the performance of FISTA-BTand two settings of FISTA-Mod, the result is depicted inFigure 3(d). The result of this comparison is very similar tothose of the linear inverse problem, the lazy-start FISTA-Modshows superior performance than FISTA-BT. C. Principal component pursuit
To conclude this paper, we consider the principal componentpursuit (PCP) problem [9], and apply it to decompose a videosequence into its background and foreground components.Assume that a real matrix y ∈ R m × n can be written as y = x l , ob + x s , ob + w, where x l , ob is low–rank, x s , ob is sparse and w is the noise. ThePCP proposed in [9] attempts to provably recover ( x l , ob , x s , ob ) to a good approximation, by solving the following convexoptimization problem min x l ,x s ∈ R m × n || y − x l − x s || F + λ || x s || + λ || x l || ∗ , (9)where || · || F is the Frobenius norm.Observe that for fixed x l , the minimizer of (9) is x ? s =prox λ ||·|| ( y − x l ) . Thus, (9) is equivalent to min x l ∈ R m × n (cid:0) λ || · || (cid:1) ( y − x l ) + λ || x l || ∗ , (10)where (cid:0) λ || · || (cid:1) ( y − x l ) = min z || y − x l − z || F + λ || z || is the Moreau Envelope of λ || · || of index , and hence has -Lipschitz continuous gradient. a) Original image (b) Blurred image (c) Recovered image -10 -6 -2 (d) Performance comparison Fig. 3: Performance comparison of FISTA-BT and FISTA-Mod in terms of {|| x k − x k − ||} k ∈ N for TV based image deblurring:(a) original image; (b) blurred image; (c) deblurred image; (d) performance of FISTA-BT and FISTA-Mod. (a) Original frame (b) Sparse component (c) Low-rank component
50 100 150 200 250 30010 -7 -4 -1 (d) Performance comparison Fig. 4: Performance comparison of FISTA-BT and FISTA-Mod in terms of {|| x k − x k − ||} k ∈ N for principal component pursuit:(a) original frame; (b) sparse component; (c) low-rank component; (d) performance of FISTA-BT and FISTA-Mod.We continue comparing only the performance of FISTA-BT and two settings of FISTA-Mod, the result is depictedin Figure 3(d). Again, the result of this comparison is verysimilar to previous examples, the lazy-start FISTA-Mod showssuperior performance than FISTA-BT.R EFERENCES[1] H. Attouch and J. Peypouquet, “The rate of convergence of Nesterov’saccelerated Forward–Backward method is actually o ( k − ))