[PDF] "FISTA" in Banach spaces with adaptive discretisations

Abstract

FISTA is a popular convex optimisation algorithm which is known to converge at an optimal rate whenever the optimisation domain is contained in a suitable Hilbert space. We propose a modified algorithm where each iteration is performed in a subspace, and that subspace is allowed to change at every iteration. Analytically, this allows us to guarantee convergence in a Banach space setting, although at a reduced rate depending on the conditioning of the specific problem. Numerically we show that a greedy adaptive choice of discretisation can greatly increase the time and memory efficiency in infinite dimensional Lasso optimisation problems.

Full PDF

““FISTA” in Banach spaces with adaptive discretisations

Antonin Chambolle ∗ and Robert Tovey † Wednesday 3 rd February, 2021

Abstract

FISTA is a popular convex optimisation algorithm which is known to converge at an optimal ratewhenever the optimisation domain is contained in a suitable Hilbert space. We propose a modiﬁedalgorithm where each iteration is performed in a subspace, and that subspace is allowed to changeat every iteration. Analytically, this allows us to guarantee convergence in a Banach space setting,although at a reduced rate depending on the conditioning of the speciﬁc problem. Numericallywe show that a greedy adaptive choice of discretisation can greatly increase the time and memoryeﬃciency in inﬁnite dimensional Lasso optimisation problems.

The Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) was proposed by Beck and Teboulle(2009) as an extension of Nesterov’s fast gradient method (see Nesterov (2004)) and is now a verypopular algorithm for minimising the sum of two convex functions. We write this as the problem ofcomputing u ∗ ∈ argmin u ∈ H E( u ) such that E( u ) := f( u ) + g( u ) , (1.1)where f : H → R is a convex diﬀerentiable function with L -Lipschitz gradient and g : H → R is a convexfunction, whose “proximity operator” is easy to compute. The iterates of the FISTA algorithm will bedenoted u n ∈ H . Beck and Teboulle (2009) showed that E( u n ) − E( u ∗ ) converges at the optimum rateof n − . Chambolle and Dossal (2015) showed (up to a minor modiﬁcation) convergence of the iteratesin a general Hibert space setting. Many further works have gone on to demonstrate faster practicalconvergence rates for slightly modiﬁed variants of FISTA (Tao et al., 2016; Liang et al., 2017; Alamoet al., 2019).The contribution of this work is a modiﬁed FISTA algorithm which achieves two key properties: • we extend the convergence results to the case u ∗ ∈ H \ H where U := H is a Banach space. Theproven rate is slower than the optimal n − , but appears to be quite sharp in numerical examples. • we allow for adaptive discretisation such that u n ∈ U n ⊂ H . With simple guarantees on the rates ofconvergence, we show analytically that applying our FISTA algorithm with inexact discretisationdoes not reduce the proven rate of convergence. In the Hilbert space setting this is the original n − rate. In numerical examples we demonstrate that greedy adaptive discretisation strategiescan achieve much greater performance than uniform discretisation in both measures of time andcomputer memory requirements.There is much overlap between the techniques used in this work and those used in the literature ofinexact optimisation, although we believe that our interpretation for the use outside of Hilbert spacesis the key theoretical novelty. Some other works for FISTA-like algorithms include (Jiang et al., 2012;Villa et al., 2013). Of particular note, our stability estimate for FISTA in Theorem 4.8 is very similarto (Schmidt et al., 2011, Proposition 2) and (Aujol and Dossal, 2015, Proposition 3.3). This is then ∗ CEREMADE, CNRS & Universit´e Paris Dauphine, PSL Research University, Paris, email: [email protected] † MOKAPLAN, INRIA Paris, Paris, email: [email protected] a r X i v : . [ m a t h . O C ] F e b sed to analyse the convergence properties in our more general Banach space setting where all sourcesof inexactness come from subspace approximations. The ideas of Parpas (2017) are similar although inapplication to the proximal gradient method with an additional smoothing on the functional g. Thepermitted reﬁnement steps are also more broad in our work. This chapter is organised as follows. Section 2 deﬁnes notation and the generic form of our proposedreﬁning FISTA algorithm, Algorithm 1. The main theoretical contribution of this work is the convergenceanalysis of Algorithm 1 which is split into two parts: ﬁrst we outline the proof structure in Section 3,then we state the speciﬁc results in the case of FISTA in Section 4. The main results are Theorem 4.8and Lemma 4.10 which extend the convergence of FISTA to inﬁnite dimensional Banach spaces withuniform/adaptively chosen subspaces U n respectively.Section 5 presents some general results for the application of Algorithm 1 and Section 6 gives a muchmore detailed discussion of adaptive reﬁnement for Lasso minimisation. In particular, we describe howto choose eﬃcient reﬁning discretisations to approximate u ∗ , estimate the convergence of E, and identifythe support of u ∗ . The numerical results in Section 7 demonstrate these techniques in four diﬀerentmodels demonstrating the apparent sharpness of our convergence rates and the computational eﬃciencyof adaptive discretisations. We consider optimisation of (1.1) over two spaces, the Banach space ( U , |||·||| ) and Hilbert space( H , (cid:104)· , ·(cid:105) , (cid:107)·(cid:107) ), such that ∃ u ∗ ∈ U s . t . E( u ∗ ) = min u ∈ U E( u ) = inf u ∈ H E( u ) . (2.1)Furthermore, deﬁne the translated energyE : U ∪ H → R , E ( u ) := E( u ) − E( u ∗ ) . (2.2)Note that the value of E is uniquely deﬁned, even if u ∗ may not be.The proposed generalised FISTA algorithm is stated in Algorithm 1 for an arbitrary choice of reﬁningsubspaces U n ⊂ U ∩ H for n = 0 , , . . . . The only diﬀerence is that on iteration n , all computations areperformed in the subspace U n . If U n = U = H then we recover the standard FISTA algorithm.Without loss of generality we will assume L = 1, i.e. ∇ f is 1-Lipschitz. To get the general statementof any of the results which follow, replace E with E L . A more important distinction is to say that theproperties of f and g are stated with respect to the Hilbert norm. In particular, (cid:107)∇ f( u ) − ∇ f( v ) (cid:107) ≤ (cid:107) u − v (cid:107) (2.3)for all u, v ∈ H and g is called ‘simple’ ifargmin u ∈ (cid:101) U (cid:107) u − v (cid:107) + g( u ) (2.4)is exactly computable for all v ∈ H and all (cid:101) U ∈ { U n } ∞ n =0 .One deﬁning property of the FISTA algorithm is an appropriate choice of inertia, dictated by t n . Inparticular, we will say that ( t n ) ∞ n =0 is a FISTA stepsize if t = 1 , t n ≥ , and ρ n := t n − t n +1 + t n +1 ≥ n = 0 , , . . . (2.5)Finally we introduce the concept of an orthogonal projection in the setting of this work. For asubspace U n ⊂ U ∩ H we deﬁne the orthogonal projection Π n : ( U n ) ∗ → U n to be any extension of thefunction such that (cid:104) Π n u, v (cid:105) = (cid:104) u, v (cid:105) for all u ∈ ( U n ) ∗ , v ∈ U n . This deﬁnition is non-standard as we view ( U n ) ∗ embedded in U . Note that U n ⊂ H implies that( U n ) ∗ ⊃ H , therefore Π n is an extension of the classical orthogonal projection. It is convenient to2llow an extension here because U may be bigger than H , ( U n ) ∗ is the largest space such that Π n issuﬃciently well-deﬁned. Beyond this, the Hahn-Banach theorem may allow the domain of Π n to beextended further, but the deﬁnition is no longer unique. For example, in 1D the value of (cid:68) δ , √ x (cid:69) is notwell-deﬁned, but the value of (cid:10) δ , [0 , (cid:11) can be chosen to be consistent with some sequence of moliﬁers.To account for this possibility, we shall state that the domain of Π n is ( U n ) ∗ , with the closure computedin an appropriate topology dictated by |||·||| .The precise constants associated to a given rate are given in the statements of the theorems but,for convenience, are otherwise omitted from the text. For sequences ( a n ) ∞ n =1 ,( b n ) ∞ n =1 we will use thenotation: a n (cid:46) b n ⇐⇒ ∃ C, N > . t . a n ≤ Cb n for all n > N,a n (cid:39) b n ⇐⇒ a n (cid:46) b n (cid:46) a n . Algorithm 1

Reﬁning sub-space FISTA Choose ( U n ) n ∈ N , u ∈ U and some FISTA stepsize choice ( t n ) v ← u , n ← repeat u n ← (1 − t n ) u n + t n v n u n +1 ← argmin u ∈ U n +1 (cid:107) u − u n + ∇ f( u n ) (cid:107) + g( u ) (cid:46) Only modiﬁcation, U n +1 ⊂ U v n +1 ← (1 − t n ) u n + t n u n +1 n ← n + 1 until converged In this section we give an intuitive outline of the full proof for convergence of Algorithm 1 before givingformal theorems and proofs in the next section. First we recall the classical FISTA convergence guaranteegiven by Chambolle and Dossal (2015, Theorem 2): t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − u ∗ (cid:107) ≤ (cid:107) u − u ∗ (cid:107) (3.1)for some FISTA stepsize choice t N (cid:39) N and ρ n ≥ Step 1: Quantifying the stability

The ﬁrst step is to generalise (3.1) to account for the adaptingsubspaces U n . In the notation of Algorithm 1, Theorem 4.3 shows that t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − w N (cid:107) ≤ (cid:107) u − w (cid:107) + (cid:107) w N (cid:107) −(cid:107) w (cid:107) + N (cid:88) n =1 t n E ( w n ) + (cid:104) v n − , w n − − w n (cid:105) (3.2)for w n = argmin u ∈ U n E( u ). The similarities to (3.1) are clear. If U n = U , then we have w n = u ∗ andthe two estimates agree. These extra terms quantify the robustness to changing of discretisation. Step 2: Quantifying the scaling properties

To show that the extra terms in (3.2) are small, weneed to quantify the approximation properties of U n . To do so, we construct a secondary sequence n < n < . . . and constants a U , a E ≥ k ∈ N n ≤ n k = ⇒ (cid:107) w n (cid:107) (cid:46) a kU , n ≥ n k = ⇒ E ( w n ) (cid:46) a − k E . (3.3)3he choice of exponential scaling is very natural if we consider the U n to be the subspace of functionsdiscretised on a uniform mesh. If that mesh sequentially reﬁned, then the resolution of the mesh will beof order h k after k reﬁnements and for some h <

1. The integer n k should then be the time when themesh has reﬁned k times.The value of a U is dictated by the smoothness of the norm of U with respect to that of H . If (cid:107) u ∗ (cid:107) < ∞ ,then the value is a U = 1, otherwise there is some damage to the rate of convergence generated by thisfactor. On the other hand, the value of a E is dictated by the smoothness of E at the point u ∗ . If aneﬃcient basis is used, then a E is large. The trade-oﬀ between a E and a U dictates the ﬁnal convergencerate of the algorithm. Step 3: Generalising the convergence bound

In this step we combine the FISTA stability estimatewith the subspace approximation guarantees to provide a sharper estimate of stability with respect tothe parameters a E and a U . For example, if U n is of the form U n k = U n k +1 = . . . = U n k +1 − , then many terms on the right-hand side of (3.2) telescope to 0. The result of this is presented inLemma 4.5 but the moral is that the stability error in (3.2) should scale in proportion to the k (relatedto the resolution) rather than the total number of iterations N . Step 4: Suﬃciently fast reﬁnement

In Step 3 we developed a convergence bound, now we wish toshow that it is only worse than the classical (3.1) by a constant factor. In particular, it is equivalent torun Algorithm 1 for N iterations or the classical FISTA algorithm for N iterations on the ﬁxed subspace U N . The estimate from (3.1) provides the estimate N E ( u N ) (cid:46) (cid:107) u − w N (cid:107) = O ( a KU ) for N ≤ n K .Lemma 4.6 shows that Algorithm 1 can achieve the same order of approximation so long as U n reﬁnesuﬃciently quickly, i.e. n k are suﬃciently small. Step 5: Suﬃciently slow reﬁnement

The result of Step 4 is suﬃcient to prove convergence, butnot yet a rate. If the subspaces reﬁne too quickly, then the inﬂuence of (cid:107) u ∗ (cid:107) = ∞ will slow the rate ofconvergence. If the resolution is too coarse then we are overﬁtting to the discrete minimiser, but if theresolution is too ﬁne then convergence of the energy is slow. Lemma 4.7 balances these two factors in anoptimal way for Algorithm 1 resulting in a convergence rate ofE ( u N ) (cid:46) a KU N (cid:46) N κ N for all N ∈ N , some κ ∈ [0 ,

1) depending on a U , a E . In particular, if u ∗ ∈ H then we recover the classicalrate with κ = 0. Step 6: Adaptivity

Up to this point we have implicitly focused on the case where U n (and n k ) arechosen a priori. Here we emphasise some of the challenges which are faced when extending results toallow for greedy adaptivity.Spatial adaptivity is robust, so long as the scaling properties of Step 2 are satisﬁed. The only otherconstraint is to ensure that the partial telescoping of Step 3 still holds. Lemma 4.5 shows that a suﬃcientcondition is U n k ⊂ U n k +1 ⊂ . . . ⊂ U n k +1 − . Convergence is most stable if the approximation spaces U n satisfy a monotone inclusion, breaking themonotonicity requires more care.Adaptively choosing the time of reﬁnement is more challenging due to the non-descent property ofFISTA. The idea is that U n should be reﬁned as soon as the value E( u n ) is suﬃciently close to E( w n ).This is analysed in Theorem 4.9 which shows convergence results of the formmin n ≤ N E ( u n ) (cid:46) N κ N for all N ∈ N , with the same κ ∈ [0 ,

1) from Step 5. The penalty for accelerating the reﬁnement time isa potential loss of stability in E( u n ), however, the asymptotic rate is equivalent and this behaviour hasnot been seen in numerical experiments. 4 Proof of convergence

In this section we follow the recipe motivated in Section 3 to prove convergence of two variants ofAlgorithm 1. Each of the main theorems and lemmas will be stated with a sketch proof in this section.The details of the proofs are either trivial or very technical and are therefore placed in Appendix A topreserve the ﬂow of the argument.

For Step 1 of Section 3 we look to replicate the classical bound of the form in (3.1) for Algorithm 1. Theproofs in this step follow the classical arguments of Beck and Teboulle (2009); Chambolle and Dossal(2015) very closely.

We ﬁrst wish to understand a single iteration of Algorithm 1. This is done through the following twolemmas.

Lemma 4.1 (equivalent to (Chambolle and Dossal, 2015, Lemma 1)) . Suppose ∇ f is 1-Lipschitz, forany u ∈ U n − deﬁne u := argmin u ∈ U n (cid:107) u − u + ∇ f( u ) (cid:107) + g( u ) . Then, for all w ∈ ( U n ) ∗ ⊃ H , we have E( u ) + (cid:107) u − Π n w (cid:107) ≤ E(Π n w ) + (cid:107) u − Π n w (cid:107) where Π n : ( U n ) ∗ → U n is the orthogonal projection. This is exactly the result of Chambolle and Dossal (2015) applied to the function u (cid:55)→ E(Π n u ). ApplyingLemma 4.1 to the iterates from Algorithm 1 gives a more explicit inequality. Lemma 4.2 (analogous to (Chambolle and Dossal, 2015, Theorem 2), (Beck and Teboulle, 2009, Theo-rem 1)) . Let U n ⊂ H ∩ U and w n ∈ U n be chosen arbitrarily and u n / v n be generated by Algorithm 1 forall n ∈ N . For all n ∈ N it holds that t n (E( u n ) − E( w n )) − ( t n − t n )(E( u n − ) − E( w n )) ≤ (cid:104) (cid:107) v n − (cid:107) − (cid:107) v n (cid:107) (cid:105) + (cid:104) v n − v n − , w n (cid:105) . (4.1)The proof is given in Theorem A.1 and is a result of the convexity of E for a well chosen w in Lemma 4.1. Lemma 4.2 gives us an understanding of a single iteration of Algorithm 1, summing over n then givesour generic convergence bound for any variant of Algorithm 1. Theorem 4.3.

Fix a sequence of subspaces { U n ⊂ U ∩ H s . t . n ∈ N } , arbitrary u ∈ U , and FISTAstepsize choice ( t n ) n ∈ N . Let u n and v n be generated by Algorithm 1. Then, for any choice of w n ∈ U n and N ∈ N we have t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − w N (cid:107) ≤ (cid:107) u − w (cid:107) − (cid:107) w (cid:107) + (cid:107) w N (cid:107) N (cid:88) n =1 t n E ( w n ) + (cid:104) v n − , w n − − w n (cid:105) . (4.2)The proof is given in Theorem A.2. This result is the key approximation for showing convergence ofFISTA with reﬁning sub-spaces. In the classical setting, we have U n = U = H , w n = u ∗ and the extraterms on the right-hand side collapse to 0.The natural choice in (4.2) is w n = Π n u ∗ , however, there are simple counter-examples which giveE(Π n u ∗ ) = ∞ and so this inequality becomes useless. A simple example, if f( u ) = (cid:107) u (cid:107) , g is the5ndicator on the set D = { u ∈ L ([0 , . t . u ( x ) ≥ x } and U n is a set of piecewise constant functions,then u ∗ = x (cid:55)→ x . On the other hand, suppose one of the pixels of the discretisation is [ x − h, x + h ],then Π n u ∗ (cid:0) x + h / (cid:1) = argmin c ∈ R (cid:90) x + hx − h ( u ∗ ( x ) − c ) dx = argmin c ∈ R (cid:90) x + hx − h ( x − c ) dx = x < x + h / . In particular Π n u ∗ / ∈ D therefore E(Π n u ∗ ) = ∞ .The choice w n = argmin u ∈ U n E( u ) is much more robust and allows us to apply Algorithm 1 morebroadly. The penalty for this ﬂexibility is a more complicated analysis, each time the sub-space changes,the system receives a ‘shock’ proportional to (cid:107) w n − w n +1 (cid:107) . In standard FISTA, the right-hand side of (4.2) is a constant. The following lemma minimises thegrowth of the ‘constant’ as a function of N by partially telescoping the sum on the right-hand side.Before progressing to the content of Step 3, we will ﬁrst formalise the deﬁnition of the constants a U and a E introduced in Step 2. Deﬁnition 4.4.

Fix a U , a E ≥ and a sequence of subspaces { (cid:101) U k ⊂ H ∩ U s . t . k ∈ N } . We say that { (cid:101) U k } is a ( a U , a E )-discretisation for E if (cid:107) (cid:101) w k (cid:107) (cid:46) a kU and E ( (cid:101) w k ) (cid:46) a − k E for all k ∈ N and some choice (cid:101) w k ∈ argmin u ∈ (cid:101) U k E( u ) . In this section we will simply assume that such sequences exist and in Section 5 we will give somemore general examples.

Lemma 4.5.

Let u n , v n be generated by Algorithm 1, ( n k ∈ N ) ∞ k =0 be a monotone increasing sequence,and choose (cid:101) w k ∈ U n k ∩ U n k +1 ∩ . . . ∩ U n k +1 − for each k ∈ N . If such a sequence exists, then for all K ∈ N , n K ≤ N < n K +1 we have t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − (cid:101) w K (cid:107) ≤ C + (cid:107) (cid:101) w K (cid:107) N + 1) − n K ( (cid:101) w K )+ K (cid:88) k =1 n k − n k − ( (cid:101) w k − ) + (cid:104) v n k − , (cid:101) w k − (cid:101) w k +1 (cid:105) where C = (cid:107) u − (cid:101) w (cid:107) −(cid:107) (cid:101) w (cid:107) . The proof is given in Lemma A.4. The introduction of n k has greatly compressed the expression ofTheorem 4.3. On the right-hand side, we now only consider E evaluated on the sequence (cid:101) w k and thereare K elements to the sum rather than N . The aim of Step 4 is to show that n iterations of Algorithm 1 is no slower (up to a constant factor) than n iterations of classical FISTA on the space U n . In other words, we would like to ensure thatE( u n ) − min u ∈ U n E( u ) (cid:46) (cid:107) u − argmin u ∈ U n E( u ) (cid:107) n (4.3)uniformly in n . If this condition is not satisﬁed, then it indicates that computational eﬀort has beenwasted by a poor choice of subspaces. This can be interpreted as an overﬁtting to the discretisationof E | U n rather than the desired function E | U . Combining the assumptions given by Deﬁnition 4.4 andthe result of Lemma 4.5, the following lemma proves the convergence of Algorithm 1 provided that thereﬁnement times n k are suﬃciently small (i.e. U n grows suﬃciently quickly).6 emma 4.6. Suppose U n , u n , v n and n k satisfy the conditions of Lemma 4.5 and { (cid:101) U k } forms an ( a U , a E ) -discretisation for E with (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k ∩ U n k +1 ∩ . . . ∩ U n k +1 − . If either: • a U > and n k (cid:46) a k E a kU , • or a U = 1 , (cid:80) ∞ k =1 n k a − k E < ∞ and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ , then E ( u N ) (cid:46) a KU N for all n K ≤ N < n K +1 . The proof is given in Lemma A.5. We make two observations of the optimality of Lemma 4.6: • The convergence guarantee for N iterations of classical FISTA in the space U N isE ( u N ) (cid:46) (cid:107) u − argmin u ∈ U N E( u ) (cid:107) N + min u ∈ U N E ( u ) (cid:46) a KU N + a − K E . This is equivalent to Lemma 4.6 after the assumptions on n k . • If U is ﬁnite dimensional then the a U = 1 is almost trivially satisﬁed. Norms in ﬁnite dimensionsare equivalent and any discretisation can be achieved with a ﬁnite number of reﬁnements (i.e. thesums over k are ﬁnite). In Lemma 4.6 we show that E( u n ) converges at a rate depending on k and n so long as k grows suﬃcientlyquickly. On the other hand, as k grows, the rate becomes worse and so we need to also put a lower limiton the growth of n k . The following lemma completes Step 5 by computing the global convergence rateof E ( u n ) when k grows at the minimum rate which is consistent with Lemma 4.6.As a special case, note that if a U = 1 then Lemma 4.6 already gives the optimal O ( N − ) convergencerate. This is in fact a special case of Aujol and Dossal (2015). If u ∗ ∈ H , then it is not possible to reﬁne‘too quickly’ and the following lemma is not needed. Lemma 4.7.

Suppose u n and n k are sequences satisfying E ( u N ) (cid:46) a KU N where n K (cid:38) a K E a KU , then E ( u N ) (cid:46) N − κ ) where κ = log a U log a E + log a U . The proof is given in Lemma A.6.

We can summarise Lemmas 4.5 to 4.7 into a single theorem stating the convergence guarantees when U n and n k are chosen a priori. Theorem 4.8.

Let { (cid:101) U k s . t . k ∈ N } be an ( a U , a E ) -discretisation for E and choose any U n such that (cid:101) U k = U n k , (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k +1 ∩ . . . ∩ U n k +1 − for all k ∈ N . Compute u n , v n by Algorithm 1. uppose that either: • a U > and n k (cid:39) a k E a kU , • or a U = 1 , (cid:80) ∞ k =1 n k a − k E < ∞ and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ ,then E ( u N ) (cid:46) N − κ ) where κ = log a U log a E + log a U uniformly for N ∈ N . This theorem is very easy to implement numerically and requires very little knowledge of how to estimateE ( u n ). So long as a U and a E can be computed analytically, choosing (cid:101) U k to be uniform discretisationsand (cid:101) U k = U n k = . . . = U n k +1 − will give the stated convergence rate. There are two properties of the sequence ( U n ) ∞ which we may wish to decide adaptively: the reﬁnementtimes n k and the exact spaces { U n s . t . n k ≤ n < n k +1 } . We will refer to these as time and spatialadaptivity respectively.Lemma 4.6 gives a suﬃcient condition on n k for converging at the rate O ( N κ − ) but it is notnecessary. Indeed for n ≤ n k the result showsE ( u n ) ≥ min u ∈ U n E ( u ) = O ( a − k E ) = O ( n κ − )which suggests that to converge faster than n κ − requires choosing smaller n k . As an example, inSection 7.2 we will see Algorithm 1 can converge at a linear rate, although this is not possible withoutadaptive reﬁnement times. On the other hand, choice of spatial adaptivity has no impact on rate butcan impact computational eﬃciency. It will be permitted to use greedy discretisation techniques so longas each collection { U n s . t . n k ≤ n < n k +1 } satisﬁes a monotonicity condition.While Theorem 4.8 also allows for spatial adaptivity, we will simplify the requirements to a form whichseems more practical for implementation and also allows for time adaptivity. We make the assumptionthat the choice of n k depends only on the value E( u n k − ) and therefore make u n k − the root of ourrequirements. Lemma 4.6 suggests that a good reﬁnement time strategy is to choose n k to be theminimal integer such that E ( u n k − ) (cid:46) a − k E . However, the value of E may be hard to estimate and sowe introduce a ‘backstop’ condition which guarantees that convergence is no slower than the rate givenby Theorem 4.8. The only constraint on spatial adaptivity is the piecewise monotone inclusion and that U n k (cid:51) u n k − . This is conﬁrmed by the following theorem. Theorem 4.9.

Let { U n ⊂ H ∩ U s . t . n ∈ N } be a sequence of subspaces and n k ∈ N a monotoneincreasing sequence such that (cid:101) U k := U n k (cid:51) u n k − , (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k +1 ∩ . . . ∩ U n k +1 − for all k ∈ N . Compute u n , v n by Algorithm 1.Suppose there exist a U , a E ≥ such that either: • a U > and n k (cid:46) a k E a kU , • or a U = 1 , (cid:80) ∞ k =1 n k a − k E < ∞ and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ and both (cid:107) (cid:101) w k (cid:107) (cid:46) a kU and E ( u n K − ) (cid:46) a − K E . Whenever these conditions on n k are satisﬁed, then min n ≤ N E ( u n ) (cid:46) N − κ ) where κ = log a U log a E + log a U uniformly for N ∈ N . u n ) is not monotone but the magnitude of oscillation is guaranteedto decay in time. This behaviour is boundedness is lost in Theorem 4.9 although it still holds on asubsequence. Although we do not prove it here, it can be shown that the stronger condition (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k +1 ∩ . . . ∩ U n k +1 − ∩ . . . ∩ U N (4.4)is suﬃcient to restore the stronger guarantee on E ( u n ). Again, monotonicity of U n corresponds withimproved stability of Algorithm 1.To enable more practical implementation of Theorem 4.9, the following lemma describes severalreﬁnement time strategies which provide suﬃcient condition for E ( u n ) (cid:46) a − k E . Lemma 4.10.

Let { (cid:101) U k s . t . k ∈ N } be a sequence of subspaces with some points u k ∈ (cid:101) U k and (cid:101) w k ∈ argmin u ∈ (cid:101) U k E( u ) . Suppose that (cid:107) (cid:101) w k (cid:107) (cid:46) a kU . Any of the following conditions are suﬃcient to show that { (cid:101) U k } is an ( a U , a E ) -discretisation for E :1. Small continuous gap reﬁnement: E ( u k ) ≤ βa − k E for all k ∈ N , some β > .2. Small discrete gap reﬁnement: E ( (cid:101) w k ) ≤ βa − k E and E ( u k ) − E ( (cid:101) w k − ) ≤ βa − k E for all k ∈ N , some β > .3. Small relative gap reﬁnement: E ( u k ) − E ( (cid:101) w k − ) ≤ β E ( u k ) for all k ∈ N , some < β ≤ a E .4. Small continuous gradient reﬁnement: ||| ∂ E( u k ) ||| ∗ ≤ βa − k E for all k ∈ N , some β > , and sublevelsets of E are |||·||| -bounded.5. Small discrete gradient reﬁnement: E ( (cid:101) w k ) ≤ βa − k E and ||| Π k ∂ E( u k ) ||| ∗ ≤ βa − k E for all k ∈ N ,some β > , and sublevel sets of E are |||·||| -bounded. The operator Π k : H → (cid:101) U k is the orthogonalprojection. The proof is given in Lemma A.8. The reﬁnement criteria described by Lemma 4.10 can be split intotwo groups. Cases (1), (3), and (4) justify that any choice of (cid:101) U k (cid:51) u k (or u n k − in Theorem 4.9) satisﬁesthe required conditions. In cases (2) and (5), u k is suﬃcient to choose the reﬁnement time n k but notthe subspace (cid:101) U k . In these cases one could, for example, choose (cid:101) U k to be a uniform discretisation withapriori estimates.Another splitting of the criteria is into gap and gradient computations. Typically, gradient norms (in(4) and (5)) should be easier to estimate than function gaps because they only require local knowledgerather than global, i.e. ∂ E( u n ) rather than an estimate of E( u ∗ ). Implicitly, the global informationcomes from an extra condition on E to assert that sublevel sets are bounded. In this work we consider the Lasso problem to be both a motivating example and source of numeri-cal examples, however, Algorithm 1 is much more broadly applicable. Here we justify this claim bydemonstrating that the constants a U and a E can be easily computed in many examples.The cases where U is ﬁnite or countable dimensional are quite straight forward. If the total numberof reﬁnements is ﬁnite (i.e. U n = U N for all n ≥ N , some N ∈ N ) then a U = 1. This holds formost ﬁnite dimensional problems as well as the countable Lasso example discussed in detail in Section 6.The examples we will consider here are where H = L (Ω) for some compact domain Ω ⊂ R d , and { (cid:101) U n s . t . n ∈ N } are ﬁnite dimensional ﬁnite-element–like spaces. We will now clarify exactly what ismeant by these properties and then demonstrate that they are suﬃcient to ﬁnd simple formulae for thevalues a U and a E . 9 eﬁnition 5.1. Suppose U ⊂ L p (Ω) for some compact domain Ω ⊂ R d . We say that M k = { ω k , ω k , . . . } is a mesh if ω ki ⊂ Ω and | ω ki ∩ ω kj | = 0 for all i and j . We say that (cid:101) U k ⊂ U ∩ H is a ﬁnite element space if there exists a mesh M k such thatfor all u ∈ (cid:101) U k there exists u ki ∈ (cid:101) U k such that supp( u ki ) ⊂ ω ki , u ( x ) = u ki ( x ) for a.e. x ∈ ω ki , all i ∈ N . We say that a sequence of ﬁnite element sub-spaces ( (cid:101) U k ) ∞ k =0 is h -reﬁning if there exists a basis { e , . . . , e N } ⊂ (cid:101) U ∩ L ∞ (Ω) such that for any u ki ∈ (cid:101) U k with supp( u ki ) ⊂ ω ki there exist α ∈ R d × d , β ∈ R d , γ ∈ R N such that < det( α ) (cid:46) h − kd , u ki ( x ) = N (cid:88) j =1 γ j e j ( α x + β ) for a.e. x ∈ ω ki , supp( e j ( α · + β )) ⊂ ω ki . We say that ( (cid:101) U k ) k is of order p if min w ∈ (cid:101) U k ||| w − u ∗ ||| (cid:46) u ∗ h kp . We allow the implicit constant to have any dependence on u ∗ so long as it is ﬁnite. For example, in thecase of Sobolev spaces we would expect an inequality of the form min w ∈ (cid:101) U k (cid:107) w − u ∗ (cid:107) W ,q (cid:46) h kp (cid:107) u ∗ (cid:107) W p,q (Strang, 1972). We note that any piecewise polynomial ﬁnite element space can be used to form a h -reﬁning sequenceof subspaces. Wavelets with a compactly supported basis satisfy this deﬁnition but it is hard to specifythe value N in general. Similarly, a Fourier basis does satisfy the scaling properties but each basis vectorhas global support. Both of these exceptions are important and could be accounted for with furtheranalysis but we focus on the more standard ﬁnite element case. The following theorem shows that allﬁnite element spaces achieve the same basic rates of convergence in Algorithm 1, depending on theparticular properties of |||·||| and E. Theorem 5.2.

Suppose H = L (Ω) for some compact domain Ω ⊂ R d and (cid:107)·(cid:107) q (cid:46) |||·||| for some q ∈ [1 , ∞ ] . If ( (cid:101) U k ) k is a sequence of h -reﬁning ﬁnite element spaces of order p then a U ≤ (cid:40) q ≥ √ h − d q < , a E ≥ (cid:40) h − p E is |||·||| -Lipschitz at u ∗ h − p E is |||·||| -smooth at u ∗ . If g is not Lipschitz at u ∗ but f is |||·||| -Lipschitz and min w ∈ (cid:101) U k {||| w − u ∗ ||| s . t . g( w ) ≤ g( u ∗ ) } (cid:46) min w ∈ (cid:101) U k ||| w − u ∗ ||| , then, again, a E ≥ h − p . The proof of this theorem is in Appendix B. The main take-home for this theorem is that thecomputation of a U and a E is typically very simple and clear given a particular choice of |||·||| and E. Theonly non-trivial case is when g is not smooth enough to estimate directly. So long as (cid:101) U k are suﬃcientlydense in sublevel sets of g, the smoothness of f is suﬃcient to give a convergence rate. We now focus on the concrete example of Lasso which will be used for numerical results in Section 7.We consider three forms of Lasso which will be referred to as continuous, countable, and discrete Lassodepending on whether the space U is M ([0 , d ), (cid:96) ( R ), or ﬁnite dimensional respectively. In each case,the energy can be written as E( u ) = (cid:107)A u − η (cid:107) (cid:96) + µ ||| u ||| (6.1)10or some A : U ∪ H → R m and µ >

0, where |||·||| = (cid:107)·(cid:107) .The aim of this section is to develop all of the necessary tools for implementing Algorithm 1 on theenergy (6.1) using the convergence guarantees of either Theorem 4.8 or Theorem 4.9. This includescomputing the rates a U and a E , estimating the continuous gap E ( u n ), and developing an eﬃcientreﬁnement choice for U n . We start by estimating rates in the case U = M (Ω) where Ω = [0 , d . In this case we choose (cid:101) U k to bethe span of all piecewise constant functions on a mesh of squares with maximum side length 2 − k (i.e. h = / ).Application of Theorem 5.2 gives a U = 2 d although although (cid:107)·(cid:107) is too strong a metric and resultsin a E = 2 . This can be seen because for any u ∈ L (Ω d ), (cid:107) u − δ (cid:107) = sup ϕ ∈ C (Ω d ) , (cid:107) ϕ (cid:107) ∞ ≤ (cid:104) ϕ, u − δ (cid:105) = (cid:107) u (cid:107) + (cid:107) δ (cid:107) ≥ . (6.2)To improve our estimate of a E requires additional assumptions on A . For any w ∈ (cid:101) U k such that (cid:107) w (cid:107) ≤ (cid:107) u ∗ (cid:107) , we computeE( w ) − E( u ∗ ) = (cid:107)A w − η (cid:107) − (cid:107)A u ∗ − η (cid:107) + µ ( ||| w ||| − ||| u ∗ ||| ) (6.3) ≤ (cid:107)A w − η (cid:107) − (cid:107)A u ∗ − η (cid:107) (6.4)= (cid:10) A ( w + u ∗ ) − η, A ( w − u ∗ ) (cid:11) (6.5) ≤ max( (cid:107)A w − η (cid:107) , (cid:107)A u ∗ − η (cid:107) ) (cid:107)A ( w − u ∗ ) (cid:107) (6.6) ≤ (cid:112) w ) (cid:107)A ( w − u ∗ ) (cid:107) . (6.7)Choosing w = (cid:101) Π k u ∗ for any bounded linear extension to the orthogonal projection (cid:101) Π k : U → (cid:101) U k givesE( w ) − E( u ∗ ) (cid:46) (cid:13)(cid:13)(cid:13) A ( (cid:101) Π k u ∗ − u ∗ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ( (cid:101) Π k − id) A ∗ (cid:13)(cid:13)(cid:13) (cid:96) → L ∞ ||| u ∗ ||| . Expanding the operator norm further, (cid:13)(cid:13)(cid:13) ( (cid:101) Π k − id) A ∗ (cid:13)(cid:13)(cid:13) (cid:96) → L ∞ = sup r ∈ R m max x ∈ Ω | [ (cid:101) Π k A ∗ r ]( x ) − [ A ∗ r ]( x ) |(cid:107) r (cid:107) (cid:96) ≤ sup r ∈ R m √ d − k (cid:107)∇ [ A ∗ r ] (cid:107) L ∞ (cid:107) r (cid:107) (cid:96) = √ d − k |A ∗ | (cid:96) → C (6.8)where √ d − k is the diameter of a d dimensional pixel of side length 2 − k .This computation conﬁrms two things, ﬁrstly that the scaling constant is indeed a E = 2, and secondlythat the required smoothness to achieve a good rate with Algorithm 1 is that A ∗ : R m → C (Ω) isa bounded operator. This accounts for using the weaker topology of M (Ω) rather than L (Ω). InSection 6.5 we will show two practical examples were this (semi)norm is accurately computable.Inserting the computed rates into Theorem 4.8 or Theorem 4.9 gives the guaranteed convergence rate κ = log a U log a E + log a U = d d = ⇒ E ( u n ) (cid:46) n − − κ ) = n − d . (6.9)This rate can be used to infer the required resolution at each iteration, in particular on iteration n with n (cid:39) ( a E a U ) k we expect the resolution to be2 − k = (cid:0) a E a U (cid:1) k d (cid:39) n − d . (6.10) We now extend the rate computations to the case when U = (cid:96) ( R ), or a ﬁnite dimensional subspace.The key fact here is that, even when U is inﬁnite dimensional, it is known (e.g. (Unser et al., 2016,Theorem 2) and (Boyer et al., 2019, Corollary 2)) that u ∗ can be chosen to have at most m non-zeros.11f this is the case, then u ∗ ∈ (cid:96) ( R ); this makes the analysis much simpler than in the continuous case.In this subsection we will ignore the discrete case as a simple sub-case of the countable.For countable Lasso we consider discretisation subspaces of the form (cid:101) U k = { u ∈ (cid:96) ( R ) s . t . i / ∈ J k = ⇒ u i = 0 } for some sets J k ⊂ N , i.e. inﬁnite vectors with ﬁnitely many non-zeros. The key change in analysisfrom the continuous case is u ∗ ∈ (cid:96) ( R ), which leads to a U = 1 and the expected optimal rate of n − ,independent of a E or any additional properties of A . The number of reﬁnements will also be ﬁnite,therefore n k = ∞ for some k and the remaining conditions of Theorems 4.8 and 4.9 hold trivially. Lemma 4.10 shows that adaptive reﬁnement can be performed based on estimates of the function gap orthe gradient. In this subsection we provide estimates for these values which can be easily computed.

We start by computing estimates for discretised Lasso. This covers the cases when either continu-ous/countable Lasso is projected onto U n , or U is ﬁnite dimensional. For notation we will use thecontinuous case, to recover the other cases just replace continuous indexing with discrete (i.e. u ( x ) (cid:32) u i ).Let Π n : U → U n denote an orthogonal projection (uniquely determined on H ). For Lasso, thediscretised function u (cid:55)→ E(Π n u ) is equivalent to replacing u with Π n u , and A ∗ with Π n A ∗ . Discrete gradient

We can use Π n to compute the discrete sub-derivative at u n ∈ U n : ∂ n E(Π n u n )( x ) = [Π n A ∗ ( A u n − η )]( x ) +  { + µ } u n ( x ) > − µ, µ ] u n ( x ) = 0 {− µ } u n ( x ) < n A ∗ ( A u n − η )]( x ) + µ Π n sign( u n ( x )) (6.12)where we deﬁne s + µ [ − ,

1] = [ s − µ, s + µ ] for all s ∈ R , µ ≥ |||·||| = (cid:107)·(cid:107) , the natural metric for ∂ n E is |||·||| ∗ = (cid:107)·(cid:107) ∞ which we can estimate ||| ∂ n E( u n ) ||| ∗ = max x ∈ Ω min {| v | s . t . v ∈ Π n A ∗ ( A u n − η )( x ) + µ sign( u n ( x )) } (6.13)= max x ∈ Ω  | [Π n A ∗ ( A u n − η )( x ) + µ | u n ( x ) > | [Π n A ∗ ( A u n − η )( x ) − µ | u n ( x ) < | [Π n A ∗ ( A u n − η )( x ) | − µ, u n ( x ) = 0 (6.14)which can be used directly in Lemma 4.10. Discrete gap

We now move on to the discrete gap, E( u n ) − min u ∈ U n E( u ). This can be computedwith a dual representation, such as that used by Duval and Peyr´e (2017a),min u ∈ U n (cid:107)A u − η (cid:107) (cid:96) + µ ||| u ||| = min u ∈ H max ϕ ∈ R m ( A Π n u − η ) · ϕ + µ ||| Π n u ||| − (cid:107) ϕ (cid:107) (cid:96) (6.15)= max ϕ ∈ R m min u ∈ H ( A Π n u − η ) · ϕ + µ ||| Π n u ||| − (cid:107) ϕ (cid:107) (cid:96) (6.16)= max ϕ ∈ R m (cid:40) − η · ϕ − (cid:107) ϕ (cid:107) (cid:96) ||| Π n A ∗ ϕ ||| ∗ ≤ µ −∞ else (6.17)= − min ϕ ∈ R m (cid:107) ϕ (cid:107) (cid:96) + η · ϕ (cid:124) (cid:123)(cid:122) (cid:125) =:E ∗ ( ϕ ) + χ ( ||| Π n A ∗ ϕ ||| ∗ ≤ µ ) . (6.18)In particular, E( u ) − min u ∈ U n E( u ) = E( u ) + min ϕ ∈ R m s . t . ||| Π n A ∗ ϕ ||| ∗ ≤ µ E ∗ ( ϕ ) ≤ E( u ) + E ∗ ( ϕ ) (6.19)12or any feasible ϕ ∈ R m . We further derive the criticality condition, if ( u ∗ , ϕ ∗ ) is a saddle point, then ϕ ∗ = A u ∗ − η. (6.20)We remark brieﬂy that it is more conventional to include the constraint in the deﬁnition of E ∗ . Wechoose to omit it here to highlight that it is only the constraint which changes between the discrete andcontinuous cases; E ∗ will remain the same.Given u n ∈ U n , the optimality condition motivates a simple rule for choosing ϕ : ϕ n := A u n − η, E( u ) − min u (cid:48) ∈ U n E( u (cid:48) ) ≤ E( u ) + min γ ≥ { E ∗ ( γϕ n ) s . t . γ ||| Π n A ∗ ϕ n ||| ∗ ≤ µ } (6.21)with optimal choice γ = max (cid:32) , min (cid:32) − η · ϕ n (cid:107) ϕ n (cid:107) (cid:96) , µ ||| Π n A ∗ ϕ n ||| ∗ (cid:33)(cid:33) . (6.22)To apply Algorithm 1, we are assuming that both f( u n ) = (cid:107) ϕ n (cid:107) (cid:96) and Π n ∇ f( u n ) = Π n A ∗ ϕ n are easilycomputable, therefore γ and E( u n ) + E ∗ ( γϕ n ) are also easy to compute. Extending the results of Section 6.3.1 to U = (cid:96) ( R ) is analytically very simple but computationally reliesheavily on the speciﬁc choice of A . The computations of gradients and gaps carry straight over replacingΠ n with the identity and adding the sets J n ⊂ N which deﬁne U n = { u ∈ (cid:96) s . t . i / ∈ J n = ⇒ u i = 0 } .In particular, ||| ∂ E( u n ) ||| ∗ = max i ∈ N  | [ A ∗ ϕ n ] i + µ | [ u n ] i > | [ A ∗ ϕ n ] i − µ | [ u n ] i < | [ A ∗ ϕ n ] i | − µ,

0) [ u n ] i = 0 (6.23)E ( u n ) ≤ E( u n ) + E( γ ϕ n ) , γ = max (cid:32) , min (cid:32) − η · ϕ n (cid:107) ϕ n (cid:107) (cid:96) , µ |||A ∗ ϕ n ||| ∗ (cid:33)(cid:33) (6.24)where ϕ n = A u n − η ∈ R m is always exactly computable.In the countable case, the sets J n give a clear partition into known/unknown values in these deﬁni-tions. For i ∈ J n the computation is the same as in Section 6.3.1, then for i / ∈ J n we know [ u n ] i = 0which simpliﬁes the remaining computations. This leads to: ||| ∂ E( u n ) ||| ∗ = max (cid:18) max i ∈ J n | [ ∂ E( u n )] i | , max i/ ∈ J n | [ ∂ E( u n )] i | (cid:19) (6.25)= max (cid:18) ||| ∂ n E( u n ) ||| ∗ , max i/ ∈ J n | [ A ∗ ϕ n ] i | − µ (cid:19) (6.26) |||A ∗ ϕ n ||| ∗ = max (cid:18) max i ∈ J n | [ A ∗ ϕ n ] i | , max i/ ∈ J n | [ A ∗ ϕ n ] i | (cid:19) (6.27)= max (cid:18) ||| Π n A ∗ ϕ n ||| ∗ , max i/ ∈ J n | [ A ∗ ϕ n ] i | (cid:19) . (6.28)Both estimates only rely on an upper bound of max i/ ∈ J n | [ A ∗ ϕ n ] i | . One example computing this value isseen in Section 7.2. 13 .3.3 Bounds for continuous functionals Finally we extend the results of Section 6.3.1 to continuous Lasso. Similar to the countable case, theexact formulae can be written down immediately: ||| ∂ E( u n ) ||| ∗ = max x ∈ Ω  | [ A ∗ ϕ n ]( x ) + µ | u n ( x ) > | [ A ∗ ϕ n ]( x ) − µ | u n ( x ) < | [ A ∗ ϕ n ]( x ) | − µ, u n ( x ) = 0 (6.29)E ( u n ) ≤ E( u n ) + E( γ ϕ n ) , γ = max (cid:32) , min (cid:32) − η · ϕ n (cid:107) ϕ n (cid:107) (cid:96) , µ |||A ∗ ϕ n ||| ∗ (cid:33)(cid:33) . (6.30)If these quantities are not analytically tractable then we use the mesh M n corresponding to U n todecompose the bounds: ||| ∂ E( u n ) ||| ∗ = max ω ni ∈ M n  (cid:107)A ∗ ϕ n + µ (cid:107) L ∞ ( ω ni ) u n | ω ni > (cid:107)A ∗ ϕ n − µ (cid:107) L ∞ ( ω ni ) u n | ω ni < , (cid:107)A ∗ ϕ n (cid:107) L ∞ ( ω ni ) − µ ) u n | ω ni = 0 (6.31) |||A ∗ ϕ n ||| ∗ = max ω ni ∈ M n (cid:107)A ∗ ϕ n (cid:107) L ∞ ( ω ni ) . (6.32)Now, both estimates rely on cell-wise supremum norms of A ∗ ϕ n which we assume is suﬃciently smooth.We will therefore use a cell-wise Taylor expansion to provide a simple and accurate estimate. For instance,let x i be the midpoint of the square ω ni , then (cid:107)A ∗ ϕ n (cid:107) L ∞ ( ω ni ) ≤ | [ A ∗ ϕ n ]( x i ) | + diam( ω ni )2 | [ ∇A ∗ ϕ n ]( x i ) | + diam( ω ni ) |A ∗ ϕ n | C . (6.33)In this work we chose a ﬁrst order expansion because we are looking for extrema of A ∗ ϕ n , i.e. we aremost interested in the squares ω ni such that | [ A ∗ ϕ n ]( x i ) | ≈ µ, | [ ∇A ∗ ϕ n ]( x i ) | ≈ , [ ∇ A ∗ ϕ n ]( x i ) (cid:22) . (6.34)A zeroth order expansion would be optimally ineﬃcient (approximating | [ ∇A ∗ ϕ n ]( x i ) | with |A ∗ ϕ n | C )and a second order expansion would possibly be the most elegant but harder to implement. We foundthat a ﬁrst order expansion was simple and eﬃcient.The bounds presented here for continuous Lasso emphasise the twinned properties required for adap-tive mesh optimisation. The mesh should be reﬁned greedily to the structures of u ∗ , but also must besuﬃciently uniform to provide a good estimate for E( u ∗ ). This is a classical exploitation/explorationtrade-oﬀ; exploiting visible structure whilst searching for other structures which are not yet visible. The main motivation for using Lasso in applications is because it recovers sparse signals, in the caseof compressed sensing the support of u ∗ is also provably close to the ‘true’ support (Duval and Peyr´e,2017a; Poon et al., 2018). If u n ≈ u ∗ in the appropriate sense, then we should also be able to quantifythe statement supp( u n ) ≈ supp( u ∗ ). Such methods are referred to as safe screening rules (El Ghaouiet al., 2010) which gradually identify the support and allow the optimisation algorithm to constrain partsof the reconstruction to 0. In this subsection we propose a new simple screening rule which is capable ofgeneralising to our continuous subspace approximation setting. It is likely that more advanced methods(Bonnefoy et al., 2015; Ndiaye et al., 2017) can also be adapted, although that is beyond the scope ofthis work. The key diﬀerence is the allowance of inexact computations resulting from estimates such as(6.33).The work of Duval and Peyr´e (2017a); Poon et al. (2018) and many others characterise the supportof u ∗ very precisely. In particular, the support is at most m distinct points and are a subset of { x ∈ Ω s . t . |A ∗ ϕ ∗ | ( x ) = µ } (an equivalent statement holds for the countable case). Less formally, this canalso be seen from the the gradient computations in Section 6.3, for all x ∈ supp( u ∗ ) we have0 ∈ ∂ E( u ∗ )( x ) = [ A ∗ ϕ ∗ ]( x ) + µ sign( u ∗ ( x )) . (6.35)14euristically, we will use strong convexity of E ∗ and smoothness of A ∗ to quantify the statement:if E( u n ) + E ∗ ( γ ϕ n ) ≈ { x s . t . | [ A ∗ ϕ n ]( x ) | (cid:28) µ } ⊂ { x s . t . u ∗ ( x ) = 0 } . First we use the strong convexity of E ∗ , if γ ϕ n and ϕ ∗ are both dual-feasible then (cid:107) γ ϕ n − ϕ ∗ (cid:107) (cid:96) ≤ E ∗ ( γ ϕ n ) − E ∗ ( ϕ ∗ ) = E ∗ ( γ ϕ n ) + E( u ∗ ) ≤ E ∗ ( γ ϕ n ) + E( u n ) , (6.36)which gives an easily computable bound on (cid:107) γ ϕ n − ϕ ∗ (cid:107) (cid:96) . Now we estimate A ∗ ϕ n on the support of u ∗ : min x ∈ supp( u ∗ ) | [Π n A ∗ ϕ n ]( x ) | ≥ min x ∈ supp( u ∗ ) | [ A ∗ ϕ n ]( x ) | (6.37)= γ − min x ∈ supp( u ∗ ) | [ A ∗ γ ϕ n ]( x ) | (6.38) ≥ γ − min x ∈ supp( u ∗ ) | [ A ∗ ϕ ∗ ]( x ) | − | [ A ∗ γ ϕ n − A ∗ ϕ ∗ ]( x ) | (6.39)= γ − min x ∈ supp( u ∗ ) µ − | [ A ∗ γ ϕ n − A ∗ ϕ ∗ ]( x ) | (6.40) ≥ γ − ( µ − |A ∗ | (cid:96) → L ∞ (cid:107) γ ϕ n − ϕ ∗ (cid:107) (cid:96) ) . (6.41)Therefore, | [ A ∗ γ ϕ n ]( x ) | < µ − (cid:112) u n ) + E ∗ ( γ ϕ n )) |A ∗ | (cid:96) → L ∞ = ⇒ u ∗ ( x ) = 0 . (6.42)This equation is valid when x is either a continuous or countable index, the only distinction is to switchto (cid:96) ∞ in the norm of A ∗ . To make the equivalent statement on the discretised problem, simply replace γ with γ and A ∗ with Π n A ∗ . There are two short observations on this formula: • The convergence guarantee from Theorem 4.8 is for the quantity E( u n ) − E( u ∗ ), the more relevantquantity here is E ∗ ( γ ϕ n ) + E( u ∗ ) for which there is no proven rate. • In Section 6.1, |A ∗ | (cid:96) → C < ∞ was required to compute a rate of convergence, but only |A ∗ | (cid:96) → L ∞ < ∞ is needed to estimate the support. For numerical implementation of Lasso, we are required to accurately estimate several operator normsof A . For f to be 1-Lipschitz we must divide by (cid:107)A ∗ A(cid:107) , and the adaptivity described in Sections 6.1,6.3.3 and 6.4 requires the values of |A ∗ | (cid:96) → L ∞ , |A ∗ | (cid:96) → C , and |A ∗ | (cid:96) → C . The aim for this section is toprovide estimates of these norms and seminorms for the numerical examples presented in Section 7.We start by specifying the structure of A . By linearity, there must exist kernels ψ j ∈ H ∩ U ∗ suchthat ( A u ) j = (cid:104) ψ j , u (cid:105) for all u ∈ H ∪ U , j = 1 , . . . , m . In this form, the adjoint can be written precisely, A ∗ : R m → H by [ A ∗ r ]( x ) = m (cid:88) j =1 r j ψ j ( x ) for all x ∈ Ω , r ∈ R m . (6.43)The following lemma allows for exact computation of the operator norm of A , as needed to ensurethat f is 1-Lipschitz. Lemma 6.1. If A : H → R m has kernels ψ j ∈ H for j ∈ [ m ] , then (cid:107)A ∗ A(cid:107) = (cid:107)AA ∗ (cid:107) where AA ∗ ∈ R m × m has entries ( AA ∗ ) i,j = (cid:104) ψ i , ψ j (cid:105) and the matrix norm is the standard spectral norm.Proof. The operator A has a ﬁnite dimensional range, therefore it also has a singular value decomposition.This shows that (cid:107)A ∗ A(cid:107) = (cid:107)AA ∗ (cid:107) . To compute the entries of AA ∗ : R m → R m , observe that for any r ∈ R m ( AA ∗ r ) i = (cid:104) ψ i , A ∗ r (cid:105) = (cid:42) ψ i , m (cid:88) j =1 r j ψ j (cid:43) = m (cid:88) j =1 (cid:104) ψ i , ψ j (cid:105) r j (6.44)as required. 15f (cid:107)A ∗ A(cid:107) is not analytically tractable, then Lemma 6.1 enables it to be computed using standardﬁnite dimensional methods. The operator AA ∗ is always ﬁnite dimensional and can be computed withoutdiscretisation error.In the continuous case, when H = L (Ω) we also need to estimate the smoothness properties of A ∗ .A generic result for this is given in the following lemma. Lemma 6.2. If A : H → R m has kernels ψ j ∈ L (Ω) ∩ C k (Ω) for j ∈ [ m ] , then for all q + q ∗ = 1 , q ∈ [1 , ∞ ] , we have |A ∗ r | C k := sup x ∈ Ω |∇ k [ A ∗ r ] | ( x ) ≤ sup x ∈ Ω (cid:13)(cid:13) ( ∇ k ψ j ( x )) mj =1 (cid:13)(cid:13) (cid:96) q ∗ (cid:107) r (cid:107) (cid:96) q , (6.45) |A ∗ | (cid:96) → C k := sup (cid:107) r (cid:107) (cid:96) ≤ |A ∗ r | C k ≤ sup x ∈ Ω (cid:13)(cid:13) ( ∇ k ψ j ( x )) mj =1 (cid:13)(cid:13) (cid:96) q ∗ × (cid:40) q ≥ √ m − q q < . (6.46) Proof.

For the ﬁrst inequality, we apply the H¨older inequality on R m : |∇ k [ A ∗ r ] | ( x ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =1 ∇ k ψ j ( x ) r j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤  m (cid:88) j =1 |∇ k ψ j ( x ) | q ∗  q ∗ (cid:107) r (cid:107) (cid:96) q = (cid:13)(cid:13) ( ∇ k ψ j ( x )) j (cid:13)(cid:13) (cid:96) q ∗ (cid:107) r (cid:107) (cid:96) q . For the second inequality, if q ≥ (cid:80) mj =1 r j ≤

1, then | r j | ≤ j and (cid:107) r (cid:107) q(cid:96) q ≤ (cid:107) r (cid:107) (cid:96) ≤

1. If q < (cid:107) r (cid:107) (cid:96) ≤

1, then we again use H¨older’s inequality: m (cid:88) j =1 r qj ≤ (cid:16) m (cid:88) j =1 Q ∗ (cid:17) Q ∗ (cid:16) m (cid:88) j =1 r qQj (cid:17) Q ≤ m − q for Q = q . We now present explicit examples of the computations in both Lemmas 6.1 and 6.2 for the numericalresults in Section 7. These examples are common in practical applications and we show the result ofTheorem 6.3 in the main text to demonstrate that, while sometimes hard to compute by hand, theconstants can estimated accurately in an eﬃcient manner.

Theorem 6.3.

Suppose A : H → R m has kernels ψ j ∈ H = L ([0 , d ) for j ∈ [ m ] .Case 1: If ψ j ( x ) = (cid:40) x ∈ X j else for some collection X j ⊂ Ω such that X i ∩ X j = ∅ for all i (cid:54) = j , then (cid:107)A(cid:107) L → (cid:96) = max j (cid:113) | X j | . Case 2: If ψ j ( x ) = cos( a j · x ) for some frequencies a j ∈ R d with | a j | ≤ A , then (cid:107)A(cid:107) L → (cid:96) ≤ √ m, |A ∗ r | C k ≤ m − q A k (cid:107) r (cid:107) q , |A ∗ | (cid:96) → C k ≤ √ mA k for all r ∈ R m and q ∈ [1 , ∞ ] .Case 3: Suppose ψ j ( x ) = (2 πσ ) − d exp (cid:16) − | x − x j | σ (cid:17) for some regular mesh x j ∈ [0 , d and separation ∆ .i.e. { x j s . t . j ∈ [ m ] } = { x + ( j ∆ , . . . , j d ∆) s . t . j i ∈ [ (cid:98) m ] } or some x ∈ R d , (cid:98) m := d √ m . For all q + q ∗ = 1 , q ∈ (1 , ∞ ] , we get (cid:107)A(cid:107) L → (cid:96) ≤ (cid:18) (4 πσ ) − (cid:88) j = − (cid:98) m,..., (cid:98) m exp( − ∆ σ j ) (cid:19) d , (6.47) |A ∗ r | C ≤ (2 πσ ) − d (cid:18) (cid:88) j ∈ J exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (6.48) |A ∗ r | C ≤ (2 πσ ) − d σ ∆ σ (cid:18) (cid:88) j ∈ J ( | j | + δ ) q ∗ exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (6.49) |A ∗ r | C ≤ (2 πσ ) − d σ (cid:18) (cid:88) j ∈ J (cid:16) ∆ σ ( | j | + δ ) (cid:17) q ∗ exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (6.50) where δ = √ d and J = { j ∈ Z d s . t . (cid:107) j (cid:107) (cid:96) ∞ ≤ (cid:98) m } . The case for q = 1 can be inferred from thestandard limit of (cid:107)·(cid:107) q ∗ → (cid:107)·(cid:107) ∞ for q ∗ → ∞ . The sharpest method for estimating these norms is of course to compute them. This is most cum-bersome for Gaussian kernels, however, the sums converge faster than exponentially and so the compu-tational burden should be very small.

We present four numerical Lasso examples. The ﬁrst two are in 1D to demonstrate the performance ofdiﬀerent variants of Algorithm 1, both with and without adaptivity. In particular, we explore sparseGaussian deconvolution and sparse signal recovery from Fourier data. We compare with the continuousbasis pursuit (CBP) discretisation (Ekanadham et al., 2011; Duval and Peyr´e, 2017b) which is alsodesigned to achieve super-resolution accuracy within a convex framework. More details of this methodwill be provided in Section 7.1.The next example is 2D reconstruction from Radon or X-ray data with wavelet-sparsity. As theforward operator is not suﬃciently smooth, we must optimise in (cid:96) ( R ) which naturally leads to thechoice of a wavelet basis.Finally, we process a dataset which represents a realistic application in biological microscopy, referredto as STORM microscopy. In essence, the task is to perform 2D Gaussian de-blurring/super-resolutionand denoising to ﬁnd the location of sparse spikes of signal.In this section, the main aim is to minimise the exact Lasso energy E ( u n ) and so this will beour main metric for the success of an algorithm, referred to as the ‘continuous gap’. Lemma 4.10only provides guarantees on the values of min n ≤ N E ( u n ) so it is this monotone estimate which isplotted. To clarify, as E( u ∗ ) is not known exactly, we always use the estimate min n ≤ N E ( u n ) ≈ min n ≤ N E( u n ) + min n (cid:48) ≤ n E ∗ ( γ ϕ n (cid:48) ). Another quantity of interest is minimisation of the discrete Lassoenergy min n ≤ N E( u n ) + min n (cid:48) ≤ n E ∗ ( γϕ n (cid:48) ) which will be referred to as the ‘discrete gap’. Note that forthe adaptive schemes this may not be exactly monotonic as the discrete dual problem changes with N .The code to reproduce these examples can be found on github . In this example we choose U = M ([0 , A : U → R with either random Fourier kernels:( A u ) j = (cid:90) cos( a j x ) u ( x ) , a j ∼ Uniform[ − , , j = 1 , , . . . , , µ = 0 . , (7.1)or Gaussian kernels on a regular grid:( A u ) j = (2 πσ ) − (cid:90) exp (cid:18) − ( x − ( j − σ (cid:19) u ( x ) , σ = 0 . , ∆ = , j = 1 , , . . . , , µ = 0 . . (7.2) https://github.com/robtovey/2020SpatiallyAdaptiveFISTA u ∗ through a convex discrete optimisation problem which is asymptotically exact in the limit h →

0. Itcan also be optimised with FISTA which allows for direct comparison with the uniform and adaptivemesh approaches. As explained by Ekanadham et al. (2011); Duval and Peyr´e (2017b), the idea is thatfor a ﬁxed mesh, the kernels of A are expanded to ﬁrst order on each pixel and a particular ﬁrst orderbasis is also chosen. If u ∗ has only one Dirac spike in each pixel then the zeroth order information shouldcorrespond to the mass of the spike and additional ﬁrst order information should determine the location.As shown in Section 6, in 1D we have a U = a E = 2. The estimates given in (6.9) and (6.10) indimension d = 1 predict that the adaptive energy will decay at a rate of E ( u n ) (cid:46) n so long as the pixelsize also decreases at a rate of h ∼ n . To achieve these rates, we implement a reﬁnement criterion fromLemma 4.10 with guarantee of E ( u n k − ) (cid:46) − k using the estimates made in Section 6.3. We choosesubspaces U n to approximately enforceE( u n ) + E ∗ ( γ ϕ n ) ≤ u n ) + E ∗ ( γϕ n )) , (7.3)i.e. the continuous gap is bounded by twice the discrete gap. In particular, note that for γ ≈ γ ,E ∗ ( γ ϕ n ) = (cid:107) γ ϕ n (cid:107) + γ η · ϕ n = γ γ (cid:18) γ γ (cid:107) γϕ n (cid:107) + γη · ϕ n (cid:19) ≈ γ γ E ∗ ( γϕ n ) . (7.4)Converting this into a spatial reﬁnement criteria, recall γ γ ≈ |||A ∗ ϕ n ||| ∗ ||| Π n A ∗ ϕ n ||| ∗ = max ω ni ∈ M n (cid:107)A ∗ ϕ n (cid:107) L ∞ ( ω ni ) max ω ni ∈ M n | Π n A ∗ ϕ n ( ω ni ) | ≈ max ω ni ∈ M n (cid:107)A ∗ ϕ n (cid:107) L ∞ ( ω ni ) | Π n A ∗ ϕ n ( ω ni ) | (7.5)is the maximum ratio of second vs. zeroth order Taylor approximations of A ∗ ϕ n on pixel ω ni . Thiswas found to be an eﬃcient method of selecting pixels for reﬁnement based on quantities which hadalready been computed. Note brieﬂy that this greedy strategy directly targets uncertainty, reﬁnementsalso happen outside of the support of u n to guarantee that this is representative of u ∗ . Such reﬁnementis necessary to avoid critical points of E which are not global minimisers. Comparison of discretisation methods

In Figure 1 we compare the three core approaches: ﬁxeduniform discretisation, adaptive discretisation, and CBP. In particular, we wish to observe their con-vergence properties as the number of pixels is allowed to grow. In each case we use a FISTA stepsizeof t n = n +1920 . The adaptive discretisation is started with one pixel and limited to 128, 256, or 512pixels while the ﬁxed and CBP discretisations have uniform discretisations with the maximum numberof pixels. We observe: • The adaptive scheme is much more eﬃcient, in both examples the adaptive scheme with 128 pixelsis at least as good as both ﬁxed discretisations with 512 pixels. In fact, only a maximum of 214pixels were needed by the adaptive method in either example. • With Fourier kernels the uniform piecewise constant discretisation is more eﬃcient than CBP butin the Gaussian case this is reversed. This suggests that the performance of CBP depends highlyon the smoothness of A . • The discrete gaps for non-adaptive optimisation behave as is common for FISTA, initial convergenceis polynomial until a locally linear regime activates (Tao et al., 2016). CBP is always slower toconverge than the piecewise constant discretisation. • The adaptive reﬁnement criterion succeeds in keeping the continuous/discrete gaps close for all n .It is not completely fair to judge CBP with the continuous gap because, although it generates a continuousrepresentation, this continuous representation is not necessarily consistent with the discrete gap beingoptimised, unlike when discretised with ﬁnite element methods. On the other hand, this is still theintended interpretation of the algorithm and we have no more appropriate metric for success in this case.18igure 1: Rates of continuous/discrete gap convergence for diﬀerent Lasso algorithms with 128, 256, or512 pixels. The ‘adaptive’ method uses the proposed algorithm. Both ‘ﬁxed’ and ‘CBP’ use standardFISTA with a uniform discretisation. Comparison of FISTA variants

Figure 2 compares many methods with either ﬁxed or adaptivediscretisations. Each adaptive scheme is allowed up to 1024 pixels and each uniform discretisation usesexactly 1024. The ‘Greedy FISTA’ implementation was proposed by Liang and Sch¨onlieb (2018) and weinclude the adaptive variant despite a lack of convergence proof. The remaining FISTA algorithms usea FISTA time step of t n = n + a − a for the given value of a , as proposed by Chambolle and Dossal (2015).In this example CBP used the greedy FISTA implementation which gave faster observed convergence.Figure 2 compares the discrete gaps because it is the accurate metric for ﬁxed discretisations, and forthe adaptive discretisation it should also be an accurate predictor of the continuous gap.The key observations are: • Each algorithm displays very similar convergence properties. The main diﬀerence is that thereconstructions with ﬁxed discretisations accelerate after 10 -10 iterations. • During the initial ‘slow’ phase, adaptive and ﬁxed discretisations appear to achieve very similar(discrete) convergence rates. The coarse-to-ﬁne adaptivity is not slower than ﬁxed discretisationsin this regime. • Lemma 4.10 accurately predicts the n rate of the adaptive methods, mirrored in the ﬁxed discreti-sations. This suggests that high-resolution discretisations are also initially limited by this n ratebefore entering the asymptotic regime. • The fastest FISTA algorithm is consistently the greedy one, although a = 20 is very comparable.19igure 2: Discrete convergence of diﬀerent algorithms. ‘Adaptive’ methods use Algorithm 1 with fewerthan 1024 pixels and the remaining methods use a uniform discretisation of 1024 pixels. Comparison of ﬁxed and adaptive discretisation

Motivated by the ﬁndings in Figure 2, we nowlook more closely at the performance of the a = 20 and the greedy FISTA schemes. We have analyticalresults for the former but the latter typically performs the best for non-adaptive optimisation and isnever worse than a = 20 in the adaptive setting. The question is whether it is faster/more eﬃcient touse the proposed adaptive scheme or to use a classical scheme at suﬃciently high uniform resolution.The ﬁxed discretisations use 1024 pixels (i.e. constant pixel size of 2 − in Figure 3) and the adaptivediscretisation starts with two pixels with an upper limit of 1024. As expected, the ﬁxed discretisationstarts with a smaller continuous gap before plateauing to a sub-optimal gap around E = 0 . u n with E ( u n ) ≤ .

1, all methods wouldconverge after O (10 ) iterations, demonstrating some equivalence between the two FISTA algorithms.For n ∈ [10 , ], in both problems, the adaptive schemes coincide with the ﬁxed schemes in both energyand minimum pixel size. On the other hand, we also see that the adaptive scheme achieves this energyin almost an order of magnitude less time and fewer pixels. In this example we consider A to be a 2D Radon transform. In particular, the rows of A correspond tointegrals over the sets X Ii where X Ii = (cid:26) x ∈ [ − , ] s . t . x · (cid:18) cos θ I sin θ I (cid:19) ∈ (cid:2) − + i − , − + i (cid:1)(cid:27) , θ I = 180 ◦ I for i, I = 1 , , . . . , . (7.6)This is not exactly in the form analysed by Theorem 6.3, however for each I the sets { X Ii s . t . i ∈ [50] } are disjoint therefore we can apply Theorem 6.3 block-wise to estimate (cid:107)A(cid:107) L → (cid:96) ≤ (cid:115) (cid:88) I ∈ [50] max i ∈ [50] | X Ii | = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) I ∈ [50] max i ∈ [50] (cid:90) X Ii d x = (cid:115) (cid:88) I ∈ [50] max i ∈ [50] ( A ) i,I . (7.7) A is not smooth, therefore we can’t bound |A ∗ | C k for k >

0, and so we must look to minimise over (cid:96) rather than L . The natural choice is to promote sparsity in a wavelet basis which can be rearranged20igure 3: Continuous convergence of adaptive (coarse-to-ﬁne pixel size) compared with uniform discreti-sation (constant pixel size) with respect to number of iterations.Figure 4: Continuous convergence of adaptive compared with uniform discretisation with respect towall-clock time and total number of pixels (memory requirement). J n = { (0 , , (0 , , (0 , , (1 , , (1 , } leaf( J n ) = { (0 , , (1 , , (1 , } M n = (cid:8) [0 , ) , [ , ) , [ , (cid:9) (cid:101) U w , [0 , w , [0 , ] w , [0 , ] w , [ , ] w , [ , w j,k are arranged in a tree structure with their supportunderneath. 21nto the Lasso form:min u ∈ U (cid:107)A u − η (cid:107) (cid:96) + µ (cid:13)(cid:13) W − u (cid:13)(cid:13) (cid:96) = min (cid:98) u ∈ (cid:96) ( R ) 12 (cid:107)AW (cid:98) u − η (cid:107) (cid:96) + µ (cid:107) (cid:98) u (cid:107) (cid:96) . (7.8)The minimisers are related by u ∗ = W (cid:98) u ∗ and, for wavelet bases, W is orthonormal so (cid:107)AW(cid:107) (cid:96) → (cid:96) = (cid:107)A(cid:107) L → (cid:96) . From Section 6.3 we know that to track convergence and perform adaptive reﬁnement, it issuﬃcient to accurately bound | [ W (cid:62) A ∗ ϕ n ] j | for all j / ∈ J n . If W is a wavelet transformation then itscolumns, w j ∈ L , are simply the wavelets themselves and we can use the bound | (cid:104) w j , A ∗ ϕ n (cid:105) | = (cid:12)(cid:12)(cid:10) w j , supp( w j ) A ∗ ϕ n (cid:11)(cid:12)(cid:12) ≤ (cid:13)(cid:13) supp( w j ) A ∗ ϕ n (cid:13)(cid:13) L ≤ (cid:107) X A ∗ ϕ n (cid:107) L (7.9)for all X ⊃ supp( w j ). In the case of the Radon transform, we can compute the left-hand side explicitly forthe ﬁnitely many j ∈ J n but we wish to use the right-hand side in a structured way to avoid computingthe inﬁnitely many j / ∈ J n . To do this, we will take a geometrical perspective on the construction ofwavelets to view them in a tree format. Tree structure of wavelets

Finite elements are constructed with a mesh which provided a useful toolfor adaptive reﬁnement in Section 6.3.3. For wavelets, we will associate a tree with every discretisationand the leaves of the tree form a mesh. This perspective comes from the multi-resolution interpretationof wavelets. An example is seen in Figure 5 for 1D Haar wavelets, w j,k ( x ) = √ k ψ (2 k x − j ) where ψ = [0 , − [ − , .In higher dimensions, the only two things which change are the number of children (2 d for non-leaves)and at each node you store the coeﬃcients of 2 d − − k at level k . The only change in ourown implementation is to translate the support to [ − , ] . We brieﬂy remark that the tree structuringof wavelets is not novel and appears more frequently in the Bayesian inverse problems literature, forexample see Castillo and Rockova (2019). Continuous gradient estimate

In Section 7.1 we used the continuous gap as a measure for conver-gence, for wavelets we will use the continuous gradient. With the tree structure we can easily adapt theresults of Section 6.3 to estimate gradients (or function gaps). In particular, ||| ∂ E( u n ) ||| ∗ = max (cid:18) ||| ∂ n E( u n ) ||| ∗ , max j / ∈ J n | (cid:104) w j , A ∗ ϕ n (cid:105) | − µ (cid:19) (7.10) ≤ max (cid:18) ||| ∂ n E( u n ) ||| ∗ , max j ∈ leaf( J n ) (cid:13)(cid:13) supp( w j ) A ∗ ϕ n (cid:13)(cid:13) L − µ (cid:19) . (7.11) Numerical results

We consider two phantoms where u † is either a binary disc or the Shepp-Loganphantom. No noise is added to the Shepp-Logan data but 5 % Gaussian white noise is added to thedisc data. This is visualised in Figure 6. All optimisations shown will be spatially adaptive using Haarwavelets and initialised with U = { x (cid:55)→ c s . t . c ∈ R } . The gradient metric shown throughout is the (cid:96) norm. Motivated by (7.11), the spatial adaptivity is chosen to reﬁne nodes j ∈ leaf( J n ) to ensure that (cid:13)(cid:13) supp( w j ) A ∗ ϕ n (cid:13)(cid:13) L − µ ≤ ||| ∂ n E( u n ) ||| ∗ for all j and n (i.e. so that the continuous gradient is less than 10 times the discrete gradient).The ﬁrst numerical results shown in Figure 7 compare the same adaptive algorithms as shown inFigure 2. In these examples we see that the greedy FISTA and the a = 20 algorithms achieve almostlinear convergence while a = 2 is signiﬁcantly slower. Our ﬁnal application is a super-resolution/de-blurring inverse problem from biological microscopy. Inmathematical terms, the observed data is a large number of sparse images which are corrupted by blurringand a large amount of noise, examples are seen in Figure 8. The task is to compute the centres of thespikes of signal in each image and then re-combine into a single super-resolved image, as in Figure 10.22igure 6: Phantoms and data used for wavelet-sparse tomography optimisation. The Shepp-Logan datais exact but the data for the disc-phantom has 5 % Gaussian white noise. Without noise the data wouldbe uniform with respect to the angle.Figure 7: Discrete convergence of diﬀerent implementations of Algorithm 1 with an unlimited numberof pixels. Figure 8: Example of images from STORM dataset.23his technique is referred to as

Single Molecule Localisation Microscopy (SMLM), of which we considerthe speciﬁc example of

Stochastic Optical Reconstruction Microscopy (STORM). Readers are directedto works such as by Sage et al. (2015, 2019); Schermelleh et al. (2019) for further details. The Lassoformulation has previously been shown to be eﬀective in the context of STORM, see for instance Huanget al. (2017); Denoyelle et al. (2019).Here we use a simulated dataset provided as part of the 2016 SMLM challenge for benchmarkingsoftware in this application. The corresponding Lasso formulation is( A u ) i = (2 πσ ) − (cid:90) [0 , . exp (cid:18) − σ (cid:12)(cid:12)(cid:12) x − ∆ (cid:0) i + i + (cid:1) (cid:62) (cid:12)(cid:12)(cid:12) (cid:19) u ( x ) d x , σ = 0 . µ m , ∆ = 0 . µ m(7.12)for i , i = 1 , , . . . ,

64 and U = M ([0 µ m , . µ m] ). 3020 frames are provided, examples of which areshown in Figure 8. To process this dataset, image intensities were normalised to [0 ,

1] then a constantwas subtracted to approximate 0-mean noise. The greedy FISTA algorithm was used for optimisationwith µ = 0 .

15, 10 iterations, and a maximum of 10 pixels per image.Finally, all the reconstructions were summed and the result shown in Figure 10. The adaptive schemeused fewer than 10 pixels per frame, a ﬁxed discretisation with equivalent resolution of 1 . · per frame. Lasso is compared with ThunderSTORM (Ovesn`y et al., 2014), apopular ImageJ plugin (Schindelin et al., 2012) which ﬁnds the location of signal using Fourier ﬁltering.The performance of ThunderSTORM was rated very highly in the initial competition by Sage et al.(2015). Both methods compared here demonstrate the key structures of the reconstruction, however,both are sensitive to tuning parameters. In this examples, Lasso has possibly recovered too little signaland ThunderSTORM contains spurious signal.Figure 9 shows slightly faster convergence than n − / predicted by (6.10) in dimension d = 2. In thisexample we also implement the suggestion of Section 6.4 to remove pixels outside of the support of u ∗ .From (6.42), any pixel ω ni satisfying γ (cid:107) Π n A ∗ ϕ n (cid:107) L ∞ ( ω ni ) ≤ − threshold µ (7.13)guarantees that ω ni ∩ supp( u ∗ ) = ∅ . This value is plotted in the ﬁrst panel of Figure 9. Once this thresholdbecomes less than 1, we can start reducing the number of pixels instead of just continual reﬁnement, asseen in the right-hand panel after around 30 iterations. In this work we have proposed a new adaptive variant of FISTA and provided convergence analysis.This algorithm allows FISTA to be applied outside of the classical Hilbert space setting, still with aguaranteed rate of convergence. We have presented several numerical examples where convergence withthe reﬁning discretisation is at least as fast as a uniform discretisation, although more eﬃcient withregards to both memory and computation time.In 1D we see good agreement with the theoretical rate. This rate also seems to be a good predictor forall variants of FISTA tested, although this is yet to be proven. Surprisingly, even the classical methodswith a ﬁxed discretisation initially seem limited to the slower adaptive rate for small n .The results in 2D are similar, all tested FISTA methods converge at least at the guaranteed rate.The wavelet example was most impressive, achieving nearly linear convergence in energy. This is similarto the behaviour for classical FISTA although it is also yet to be formally proven.An interesting observation over all of the adaptive Lasso examples is that the classical oscillatorybehaviour of FISTA has not occurred. With the monotone gaps plotted, oscillatory convergence shouldcorrespond to a piecewise constant descending gap. Either this behaviour only emerges for larger n , orthe adaptivity provides a dampening eﬀect for this oscillation.Moving forward, it would be interesting to see how far the analysis extends to other optimisationalgorithms. Other variants of FISTA, such as the ‘greedy’ implementation used here or the traditionalForward-Backward algorithm, should also be receptive to the analysis performed here. Furthermore, ,

1] rather than [0 , . µ m].Figure 10: Processed results of the STORM dataset. Top left: Lasso optimisation with Algorithm 1.Top right: Comparison with ThunderSTORM plugin. Bottom: Average data, no super-resolution orde-blurring. 25t would also interesting to attempt to replicate this reﬁnement argument to extend the primal-dualalgorithm proposed by Chambolle and Pock (2011) or the Douglas-Rachford algorithm proposed byDouglas and Rachford (1956). R.T. acknowledges funding from EPSRC grant EP/L016516/1 for the Cambridge Centre for Analysis,and the ANR CIPRESSI project grant ANR-19-CE48-0017-01 of the French Agence Nationale de laRecherche. Most of this work was done while A.C. was still in CMAP, CNRS and Ecole Polytechnique,Institut Polytechnique de Paris, Palaiseau, France.

References

Alamo, T., Limon, D., and Krupa, P. (2019). Restart ﬁsta with global linear convergence. In , pages 1969–1974. IEEE.Aujol, J.-F. and Dossal, C. (2015). Stability of over-relaxations for the forward-backward algorithm,application to ﬁsta.

SIAM Journal on Optimization , 25(4):2408–2433.Beck, A. and Teboulle, M. (2009). Gradient-based algorithms with applications to signal recovery prob-lems. In

Convex Optimization in Signal Processing and Communications , pages 42–88. CambridgeUniversity Press.Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. (2015). Dynamic screening: Accelerating ﬁrst-order algorithms for the lasso and group-lasso.

IEEE Transactions on Signal Processing , 63(19):5121–5132.Boyd, N., Schiebinger, G., and Recht, B. (2017). The alternating descent conditional gradient methodfor sparse inverse problems.

SIAM Journal on Optimization , 27(2):616–639.Boyer, C., Chambolle, A., Castro, Y. D., Duval, V., De Gournay, F., and Weiss, P. (2019). On representertheorems and convex regularization.

SIAM Journal on Optimization , 29(2):1260–1281.Bredies, K. and Pikkarainen, H. K. (2013). Inverse problems in spaces of measures.

ESAIM: Control,Optimisation and Calculus of Variations , 19(1):190–218.Castillo, I. and Rockova, V. (2019). Multiscale analysis of bayesian cart.

University of Chicago, BeckerFriedman Institute for Economics Working Paper , (2019-127).Catala, P., Duval, V., and Peyr´e, G. (2019). A low-rank approach to oﬀ-the-grid sparse superresolution.

SIAM Journal on Imaging Sciences , 12(3):1464–1500.Chambolle, A. and Dossal, C. (2015). On the convergence of the iterates of the “fast iterative shrink-age/thresholding algorithm”.

Journal of Optimization Theory and Applications , 166(3):968–982.Chambolle, A. and Pock, T. (2011). A First-Order Primal-Dual Algorithm for Convex Problems withApplications to Imaging.

Journal of Mathematical Imaging and Vision , 40(1):120–145.De Castro, Y., Gamboa, F., Henrion, D., and Lasserre, J.-B. (2016). Exact solutions to super reso-lution on semi-algebraic domains in higher dimensions.

IEEE Transactions on Information Theory ,63(1):621–630.Denoyelle, Q., Duval, V., Peyr´e, G., and Soubies, E. (2019). The sliding frank–wolfe algorithm and itsapplication to super-resolution microscopy.

Inverse Problems , 36(1):014001.Douglas, J. and Rachford, H. H. (1956). On the numerical solution of heat conduction problems in twoand three space variables.

Transactions of the American Mathematical Society , 82(2):421–439.Duval, V. and Peyr´e, G. (2017a). Sparse spikes super-resolution on thin grids i: the lasso.

InverseProblems , 33(5):055008.Duval, V. and Peyr´e, G. (2017b). Sparse spikes super-resolution on thin grids ii: the continuous basispursuit.

Inverse Problems , 33(9):095008.Ekanadham, C., Tranchina, D., and Simoncelli, E. P. (2011). Recovery of sparse translation-invariantsignals with continuous basis pursuit.

IEEE transactions on signal processing , 59(10):4735–4744.El Ghaoui, L., Viallon, V., and Rabbani, T. (2010). Safe feature elimination in sparse supervised learning.Technical Report UCB/EECS-2010–126, EECS Department, University of California, Berkeley.Huang, J., Sun, M., Ma, J., and Chi, Y. (2017). Super-resolution image reconstruction for high-density three-dimensional single-molecule microscopy.

IEEE Transactions on Computational Imaging ,3(4):763–773. 26iang, K., Sun, D., and Toh, K.-C. (2012). An inexact accelerated proximal gradient method for largescale linearly constrained convex sdp.

SIAM Journal on Optimization , 22(3):1042–1064.Liang, J., Fadili, J., and Peyr´e, G. (2017). Activity identiﬁcation and local linear convergence of forward–backward-type methods.

SIAM Journal on Optimization , 27(1):408–437.Liang, J. and Sch¨onlieb, C.-B. (2018). Improving ﬁsta: Faster, smarter and greedier. arXiv preprintarXiv:1811.01430 .Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. (2017). Gap safe screening rules for sparsityenforcing penalties.

The Journal of Machine Learning Research , 18(1):4671–4703.Nesterov, Y. (2004).

Introductory Lectures on Convex Optimization: A Basic Course . Kluwer AcademicPublishers Boston, Dordrecht, London.Ovesn`y, M., Kˇr´ıˇzek, P., Borkovec, J., ˇSvindrych, Z., and Hagen, G. M. (2014). Thunderstorm: a compre-hensive imagej plug-in for palm and storm data analysis and super-resolution imaging.

Bioinformatics ,30(16):2389–2390.Parpas, P. (2017). A multilevel proximal gradient algorithm for a class of composite optimization prob-lems.

SIAM Journal on Scientiﬁc Computing , 39(5):S681–S701.Poon, C., Keriven, N., and Peyr´e, G. (2018). The geometry of oﬀ-the-grid compressed sensing. arXivpreprint arXiv:1802.08464 .Sage, D., Kirshner, H., Pengo, T., Stuurman, N., Min, J., Manley, S., and Unser, M. (2015). Quantitativeevaluation of software packages for single-molecule localization microscopy.

Nature Methods , 12(8):717–724.Sage, D., Pham, T.-A., Babcock, H., Lukes, T., Pengo, T., Chao, J., Velmurugan, R., Herbert, A.,Agrawal, A., Colabrese, S., et al. (2019). Super-resolution ﬁght club: assessment of 2d and 3d single-molecule localization microscopy software.

Nature Methods , 16(5):387–395.Schermelleh, L., Ferrand, A., Huser, T., Eggeling, C., Sauer, M., Biehlmaier, O., and Drummen, G. P.(2019). Super-resolution microscopy demystiﬁed.

Nature cell biology , 21(1):72–84.Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., Pietzsch, T., Preibisch, S.,Rueden, C., Saalfeld, S., Schmid, B., et al. (2012). Fiji: an open-source platform for biological-imageanalysis.

Nature Methods , 9(7):676–682.Schmidt, M., Roux, N. L., and Bach, F. R. (2011). Convergence rates of inexact proximal-gradientmethods for convex optimization. In

Advances in Neural Information Processing Systems , pages 1458–1466.Strang, G. (1972). Approximation in the ﬁnite element method.

Numerische Mathematik , 19(1):81–98.Tao, S., Boley, D., and Zhang, S. (2016). Local linear convergence of ista and ﬁsta on the lasso problem.

SIAM Journal on Optimization , 26(1):313–336.Unser, M., Fageot, J., and Gupta, H. (2016). Representer theorems for sparsity-promoting (cid:96) regular-ization. IEEE Transactions on Information Theory , 62(9):5167–5180.Villa, S., Salzo, S., Baldassarre, L., and Verri, A. (2013). Accelerated and inexact forward-backwardalgorithms.

SIAM Journal on Optimization , 23(3):1607–1633.

A Proofs for FISTA convergence

This section contains all of the statements and proofs of the results contained in Section 4.

A.1 Proofs for Step 3

Theorem A.1 (Lemma 4.2) . Let U n ⊂ H ∩ U and w n ∈ U n be chosen arbitrarily and u n / v n be generatedby Algorithm 1 for all n ∈ N . For all n ∈ N it holds that t n (E( u n ) − E( w n )) − ( t n − t n )(E( u n − ) − E( w n )) ≤ (cid:104) (cid:107) v n − (cid:107) − (cid:107) v n (cid:107) (cid:105) + (cid:104) v n − v n − , w n (cid:105) . (A.1) Proof.

Modifying (Chambolle and Dossal, 2015, Theorem 2), for n ≥ u = u n − and w = (1 − t n ) u n − + t n w n . This givesE( u n ) + (cid:13)(cid:13)(cid:13) t n v n − t n w n (cid:13)(cid:13)(cid:13) ≤ E (cid:16) (1 − t n ) u n − + t n w n (cid:17) + (cid:13)(cid:13)(cid:13) t n v n − − t n w n (cid:13)(cid:13)(cid:13) . (A.2)27y the convexity of E, this reduces toE( u n ) − E( w n ) − (1 − t n )[E( u n − ) − E( w n )] ≤ t n (cid:107) v n − − w n (cid:107) − t n (cid:107) v n − w n (cid:107) (A.3)= t n (cid:104) (cid:107) v n − (cid:107) − (cid:107) v n (cid:107) (cid:105) + t n (cid:104) v n − v n − , w n (cid:105) . (A.4) Theorem A.2 (Theorem 4.3) . Fix a sequence of subspaces { U n ⊂ U ∩ H s . t . n ∈ N } , arbitrary u ∈ U ,and FISTA stepsize choice ( t n ) n ∈ N . Let u n and v n be generated by Algorithm 1. Then, for any choiceof w n ∈ U n and N ∈ N we have t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − w N (cid:107) ≤ (cid:107) u − w (cid:107) − (cid:107) w (cid:107) + (cid:107) w N (cid:107) N (cid:88) n =1 t n E ( w n ) + (cid:104) v n − , w n − − w n (cid:105) . (A.5) Proof.

Theorem A.2 is just a summation of (A.1) over all n = 1 , . . . , N . To see this: ﬁrst add and subtractE( u ∗ ) to each term on the left-hand side to convert E to E , then move E ( w n ) to the right-hand side.Now (A.1) becomes t n E ( u n ) − ( t n − t n ) E ( u n − ) ≤ t n E ( w n ) + (cid:104) (cid:107) v n − (cid:107) − (cid:107) v n (cid:107) (cid:105) + (cid:104) v n − v n − , w n (cid:105) . (A.6)Summing this inequality from n = 1 to n = N gives t N E ( u N ) + N − (cid:88) n =1 ( t n − t n +1 + t n +1 (cid:124) (cid:123)(cid:122) (cid:125) = ρ n ) E ( u n ) ≤ (cid:107) v (cid:107) − (cid:107) v N (cid:107) N (cid:88) n =1 t n E ( w n ) + (cid:104) v n − v n − , w n (cid:105) . (A.7)This is almost in the desired form, however, we would like to ﬂip the roles of v n / w n in the ﬁnal innerproduct term. Re-writing the right-hand side gives N (cid:88) n =1 (cid:104) v n − v n − , w n (cid:105) = (cid:104) v N , w N (cid:105) − (cid:104) v , w (cid:105) + N (cid:88) n =1 (cid:104) v n − , w n − − w n (cid:105) . (A.8)Noting that v = u , the previous two equations combine to prove the statement of Theorem A.2.The following lemma is used to produce a sharper estimate on sequences t n . Lemma A.3. If ρ n = t n − t n +1 + t n +1 ≥ , t n ≥ for all n ∈ N then t n ≤ n − t .Proof. This is trivially true for n = 1. Suppose true for n −

1, the condition on ρ n − gives t n − t n ≤ t n − ≤ ( n − t ) = ( n − t ) − n − t ) + 1 . (A.9)Assuming the contradiction, if t n > n − t then the above equation simpliﬁes to n − t < t ≥ n < Lemma A.4 (Lemma 4.5) . Let u n , v n be generated by Algorithm 1, ( n k ∈ N ) ∞ k =0 be a monotoneincreasing sequence, and choose (cid:101) w k ∈ U n k ∩ U n k +1 ∩ . . . ∩ U n k +1 − for each k ∈ N . If such a sequence exists, then for all K ∈ N , n K ≤ N < n K +1 we have t N E ( u N ) + N − (cid:88) n =1 ρ n E ( u n ) + (cid:107) v N − (cid:101) w K (cid:107) ≤ C + (cid:107) (cid:101) w K (cid:107) N + 1) − n K ( (cid:101) w K )+ K (cid:88) k =1 n k − n k − ( (cid:101) w k − ) + (cid:104) v n k − , (cid:101) w k − (cid:101) w k +1 (cid:105) where C = (cid:107) u − (cid:101) w (cid:107) −(cid:107) (cid:101) w (cid:107) . roof. This is just a telescoping of the right-hand side of (A.5) with the introduction of n k and simpli-ﬁcation w n = (cid:101) w k , (cid:107) w N (cid:107) + N (cid:88) n =1 t n E (Π n w n ) + (cid:104) v n − , w n − − w n (cid:105) = (cid:107) (cid:101) w K (cid:107) + N (cid:88) n = n K t n E ( (cid:101) w K )+ K (cid:88) k =1 n k − (cid:88) n = n k − t n E ( (cid:101) w k − ) + (cid:104) v n k − , (cid:101) w k − (cid:101) w k +1 (cid:105) . (A.10)By Lemma A.3, t n ≤ n so we can further simplify b − (cid:88) n = a t n ≤ b − (cid:88) n = a n = ( b − a ) b − a ≤ b − a A.2 Proof for Step 4

Lemma A.5 (Lemma 4.6) . Suppose U n , u n , v n and n k satisfy the conditions of Lemma 4.5 and { (cid:101) U k } forms an ( a U , a E ) -discretisation for E with (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k ∩ U n k +1 ∩ . . . ∩ U n k +1 − . If either: • a U > and n k (cid:46) a k E a kU , • or a U = 1 , (cid:80) ∞ k =1 n k a − k E < ∞ and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ , then E ( u N ) (cid:46) a KU N for all n K ≤ N < n K +1 . Proof.

Starting from Lemma A.4 we get t N E ( u N ) + (cid:107) v N − (cid:101) w K (cid:107) ≤ C + (cid:107) (cid:101) w K (cid:107) N + 1) − n K ( (cid:101) w K )+ K (cid:88) k =1 n k − n k − ( (cid:101) w k − ) + (cid:104) v n k − , (cid:101) w k − (cid:101) w k +1 (cid:105) (A.11) ≤ C + (cid:107) (cid:101) w K (cid:107) N + 1) ( (cid:101) w K )+ K (cid:88) k =1 n k ( (cid:101) w k − ) + (cid:104) v n k − − (cid:101) w k − + (cid:101) w k − , (cid:101) w k − (cid:101) w k +1 (cid:105) . (A.12)The inductive step now depends on the value of a U . Case a U > : We simplify the inequality t N E ( u N ) + (cid:107) v N − (cid:101) w K (cid:107) (cid:46) a KU + ( N + 1) a − K E + K (cid:88) k =1 n k a − k E + a kU (cid:107) v n k − − (cid:101) w k − (cid:107) + a kU (A.13) ≤ C (cid:34) a K +2 U + K (cid:88) k =1 a kU + a kU (cid:107) v n k − − (cid:101) w k − (cid:107) (cid:35) (A.14)29or some C > C . Choose C ≥ (cid:107) v n − − (cid:101) w − (cid:107) a − U such that12 C ≥ C a U − C + a U ) . (A.15)Assume (cid:107) v n k − − (cid:101) w k − (cid:107) ≤ C a kU for 1 ≤ k ≤ K (trivially true for K = 1), then for N = n K +1 − (cid:13)(cid:13) v n K +1 − − (cid:101) w K (cid:13)(cid:13) ≤ C (cid:34) a K +2 U + K (cid:88) k =1 a kU + a kU (cid:107) v n k − − (cid:101) w k − (cid:107) (cid:35) (A.16) ≤ C (cid:34) a K +2 U + (1 + C ) a K +2 U a U − (cid:35) (A.17) ≤ C a K +2 U a U − (cid:0) a U + C (cid:1) ≤ ( C a K +1 U ) . (A.18) Case a U = 1 : Denote b k = (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) and note that (cid:107) (cid:101) w k − (cid:107) ≤ (cid:107) (cid:101) w (cid:107) + (cid:80) ∞ b k (cid:46)

1. We thereforebound t N E ( u N ) + (cid:107) v N − (cid:101) w K (cid:107) (cid:46) N + 1) a − K E + K (cid:88) k =1 n k a − k E + ( (cid:107) v n k − − (cid:101) w k − (cid:107) + 1) b k (A.19) ≤ C (cid:34) K (cid:88) k =1 (cid:107) v n k − − (cid:101) w k − (cid:107) b k (cid:35) (A.20)for some C >

0. Choose C ≥ (cid:107) v n − − (cid:101) w − (cid:107) (cid:80) ∞ b k such that12 C ≥ C (cid:32) C ∞ (cid:88) b k (cid:33) . (A.21)Assume (cid:107) v n k − − (cid:101) w k − (cid:107) ≤ C for 1 ≤ k ≤ K (trivially true for K = 1), then for N = n K +1 − (cid:13)(cid:13) v n K +1 − − (cid:101) w K (cid:13)(cid:13) ≤ C (cid:34) K (cid:88) k =1 (cid:107) v n k − − (cid:101) w k − (cid:107) b k (cid:35) ≤ C (cid:32) C ∞ (cid:88) b k (cid:33) ≤ C (cid:13)(cid:13) v n K +1 − − (cid:101) w K (cid:13)(cid:13) holds for all K and we have t N E ( u N ) ≤ C a KU (A.23)for all N < n K − A.3 Proof for Step 5

Lemma A.6 (Lemma 4.7) . Suppose u n and n k are sequences satisfying E ( u N ) (cid:46) a KU N where n K (cid:38) a K E a KU , then E ( u N ) (cid:46) N − κ ) where κ = log a U log a E + log a U . Proof.

The proof is direct computation,log N ≥ log C + K (cid:0) log a E + log a U (cid:1) (A.24)which leads to a KU = exp( K log a U ) ≤ exp (cid:18) log N log a U log a E + log a U − log C log a U log a E + log a U (cid:19) (A.25)as required. 30 .4 Proofs for Step 6 Theorem A.7 (Theorem 4.9) . Let { U n ⊂ H ∩ U s . t . n ∈ N } be a sequence of subspaces and n k ∈ N amonotone increasing sequence such that (cid:101) U k := U n k (cid:51) u n k − , (cid:101) w k ∈ (cid:34) argmin u ∈ (cid:101) U k E( u ) (cid:35) ∩ U n k +1 ∩ . . . ∩ U n k +1 − for all k ∈ N . Compute u n , v n by Algorithm 1.Suppose there exist a U , a E ≥ such that either: • a U > and n k (cid:46) a k E a kU , • or a U = 1 , (cid:80) ∞ k =1 n k a − k E < ∞ and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ and both (cid:107) (cid:101) w k (cid:107) (cid:46) a kU and E ( u n K − ) (cid:46) a − K E . Whenever these conditions on n k are satisﬁed, then min n ≤ N E ( u n ) (cid:46) N − κ ) where κ = log a U log a E + log a U uniformly for N ∈ N .Proof. To apply Lemma A.5, we need n k (cid:46) (cid:40) ( a E a U ) k a U > k − a k E a U = 1 , (cid:107) (cid:101) w k (cid:107) (cid:46) a kU , and E ( (cid:101) w k ) (cid:46) a − k E (A.26)and (cid:80) ∞ k =1 (cid:107) (cid:101) w k − (cid:101) w k +1 (cid:107) < ∞ when a U = 1. The only one which is not directly assumed is easily veriﬁed,E ( (cid:101) w k ) = min u ∈ U nk E ( u ) ≤ E ( u n k − ) (cid:46) a − k E . (A.27)Therefore, the result of Lemma A.5 givesE ( u N ) (cid:46) a KU t N (cid:46) a KU N . (A.28)In the case a U = 1, this is already the optimal rate and therefore sharp. If this were sharp for general a U and E, then we gain nothing by reﬁning early (increasing K for ﬁxed N ) however, we can at leastguarantee that reﬁning early does not lose the optimal rate.If we ﬁx N (cid:46) ( a E a U ) k , thenmin n ≤ N E ( u n ) (cid:46) (cid:40) a − k E N > n ka kU / N N ≤ n k (cid:46) N − max( a kU , a kU ) (cid:46) N − − κ ) (A.29)as required. Lemma A.8 (Lemma 4.10) . Let { (cid:101) U k s . t . k ∈ N } be a sequence of subspaces with some points u k ∈ (cid:101) U k and (cid:101) w k ∈ argmin u ∈ (cid:101) U k E( u ) . Suppose that (cid:107) (cid:101) w k (cid:107) (cid:46) a kU . Any of the following conditions are suﬃcient toshow that { (cid:101) U k } is an ( a U , a E ) -discretisation for E :1. Small continuous gap reﬁnement: E ( u k ) ≤ βa − k E for all k ∈ N , some β > .2. Small discrete gap reﬁnement: E ( (cid:101) w k ) ≤ βa − k E and E ( u k ) − E ( (cid:101) w k − ) ≤ βa − k E for all k ∈ N , some β > .3. Small relative gap reﬁnement: E ( u k ) − E ( (cid:101) w k − ) ≤ β E ( u k ) for all k ∈ N , some < β ≤ a E . . Small continuous gradient reﬁnement: ||| ∂ E( u k ) ||| ∗ ≤ βa − k E for all k ∈ N , some β > , and sublevelsets of E are |||·||| -bounded.5. Small discrete gradient reﬁnement: E ( (cid:101) w k ) ≤ βa − k E and ||| Π k ∂ E( u k ) ||| ∗ ≤ βa − k E for all k ∈ N ,some β > , and sublevel sets of E are |||·||| -bounded. The operator Π k : H → (cid:101) U k is the orthogonalprojection.Proof. The proof is simply to justify that the conditions of Theorem A.7 are met for all reﬁnementcriteria described here. (cid:107) (cid:101) w k (cid:107) (cid:46) a kU is enforced at every step and condition (2) guarantees the back-stopcondition on n k .To complete the requirements of Theorem A.7, we ﬁrst need to show inductively that E ( (cid:101) w k − ) ≤ Ca − k E , then it follows that E ( u n k − ) ≤ Ca − k E . To be explicit, when E has bounded sublevel sets, assumethat the bound for { u ∈ H s . t . E( u ) ≤ E( u ) } is R > ( (cid:101) w k − ) is already assumed but otherwise we need to perform theformal induction. Assume E ( (cid:101) w k − ) ≤ Ca − k E , then for each remaining adaptive criterion:(1) E ( (cid:101) w k ) ≤ E ( u k ) ≤ βa − k E by deﬁnition, the induction holds if C ≥ β .(3) E ( (cid:101) w k ) ≤ E ( u k ) ≤ β − β E ( (cid:101) w k − ), the induction holds if β ≤ a E .(4) E ( (cid:101) w k ) ≤ E ( u k ) ≤ (cid:104) ∂ E( u k ) , u n k − − u ∗ (cid:105) ≤ Rβa − k E , the induction holds for C ≥ Rβ .In each case, with C suﬃciently large, the induction holds. Now we can return to the precise conditionof Theorem A.7, E ( u n k − ) ≤ C (cid:48) a − k E . Assume true for k −

1. For each adaptive criterion:(1) E ( u k ) ≤ βa − k E is by deﬁnition.(2) E ( u k ) ≤ βa − k E + E ( (cid:101) w k − ) requires C (cid:48) ≥ β + a E C .(3) E ( u k ) ≤ β − β E ( (cid:101) w k − ) requires C (cid:48) ≥ β − β C .(4) E ( u k ) ≤ Rβa − k E , the induction holds for large C (cid:48) .(5) E ( u k ) ≤ Rβa − k E + E ( (cid:101) w k − ), the induction holds for large C (cid:48) ≥ Rβ + C .This completes the requirements of Theorem A.7, therefore also this proof. B Proof of Theorem 5.2

The proof of Theorem 5.2 is the result of the following three lemmas. The ﬁrst, Lemma B.1, is ageneral quantiﬁcation of the equivalence between L q and L norms on ﬁnite dimensional sub-spaces.A special case occurs when q = 1 because the dual norm is a supremum rather than an integral. InLemma B.2, this locality is exploited by ﬁnite element spaces, which we assumed had a basis with localsupport. Lemma B.3 then performs the computations for the a E constant depending on the smoothnessproperties of E. Lemma B.1.

Suppose H = L (Ω) for some compact domain Ω ⊂ R d and (cid:107)·(cid:107) q (cid:46) |||·||| for some q ∈ [1 , ∞ ] .Let (cid:101) U ⊂ U ∩ H be a ﬁnite dimensional subspace with orthonormal basis { e j s . t . j = 1 , . . . , dim( (cid:101) U ) } ⊂ (cid:101) U and orthogonal projection Π : H → (cid:101) U . If these conditions hold, then: • if q ≥ , then (cid:107) Π w (cid:107) (cid:46) ||| w ||| , • otherwise, if e j ∈ U ∗ , then (cid:107) Π w (cid:107) ≤ (cid:113) dim( (cid:101) U ) max j ||| e j ||| ∗ ||| w ||| , (B.1) • otherwise, if e j ∈ L ∞ (Ω) and |{ j : e j ( x ) (cid:54) = 0 }| ≤ C for almost all x ∈ Ω , then (cid:107) Π w (cid:107) (cid:46) √ C max j (cid:107) e j (cid:107) L ∞ ||| w ||| (B.2)32 niformly for all w ∈ H .Proof. The ﬁrst statement for q ≥ (cid:107) Π w (cid:107) ≤ (cid:107) w (cid:107) = (cid:107) w (cid:107) = (cid:13)(cid:13) w (cid:13)(cid:13) (cid:46) (cid:107) w (cid:107) q (cid:46) ||| w ||| . (B.3)The remaining statements come from the equivalence of norms on ﬁnite dimensional spaces. Note that (cid:107) Π w (cid:107) = (cid:104) Π w, Π w (cid:105)(cid:107) Π w (cid:107) = (cid:104) Π w, w (cid:105)(cid:107) Π w (cid:107) ≤ ||| Π w ||| ∗ (cid:107) Π w (cid:107) ||| w ||| , (B.4)therefore it is suﬃcient to bound |||·||| ∗ (cid:107)·(cid:107) on the subspace (cid:101) U . Switching to the given basis, for u = (cid:80) j r j e j we have (cid:107) u (cid:107) = dim( (cid:101) U ) (cid:88) j =1 r j = (cid:107) r (cid:107) (cid:96) , (B.5) ||| u ||| ∗ ≤ dim( (cid:101) U ) (cid:88) j =1 | r j |||| e j ||| ∗ ≤ max j ||| e j ||| ∗ (cid:107) r (cid:107) (cid:96) , (B.6)= ⇒ ||| u ||| ∗ (cid:107) u (cid:107) ≤ max j ||| e j ||| ∗ (cid:107) r (cid:107) (cid:96) (cid:107) r (cid:107) (cid:96) ≤ max j ||| e j ||| ∗ (cid:113) dim( (cid:101) U ) . (B.7)Alternatively, we can use the inequality (cid:107) Π w (cid:107) = (cid:104) Π w, w (cid:105)(cid:107) Π w (cid:107) ≤ (cid:107) Π w (cid:107) ∞ (cid:107) Π w (cid:107) (cid:107) w (cid:107) (cid:46) (cid:107) Π w (cid:107) ∞ (cid:107) Π w (cid:107) (cid:107) w (cid:107) q (cid:46) (cid:107) Π w (cid:107) ∞ (cid:107) Π w (cid:107) ||| w ||| . (B.8)This simpliﬁes the equivalence constant because for any u = (cid:80) r j e j and µ >

1, there exists a set ofpoints x ∈ Ω with non-zero measure such that (cid:107) u (cid:107) ∞ ≤ µ | u ( x ) | . This gives (cid:107) u (cid:107) = (cid:88) r j ≥ (cid:88) e j ( x ) (cid:54) =0 r j , (B.9) (cid:107) u (cid:107) ∞ ≤ µ | u ( x ) | ≤ µ (cid:88) e j ( x ) (cid:54) =0 | r j |(cid:107) e j (cid:107) ∞ , (B.10)= ⇒ (cid:107) u (cid:107) ∞ (cid:107) u (cid:107) ≤ µ max j (cid:107) e j (cid:107) ∞ √ C. (B.11)This inequality holds for all µ > µ = 1. Essentially, with an extra smoothnessassumption on e j , we can reduce the dimension of the problem to dim( (cid:101) U ) = C and use the previousresult. Lemma B.2.

Suppose H = L (Ω) for some compact domain Ω ⊂ R d and (cid:107)·(cid:107) q (cid:46) |||·||| for some q ∈ [1 , ∞ ] .Let ( (cid:101) U k ) k be a sequence of h -reﬁning ﬁnite element spaces with orthogonal projections (cid:101) Π k : H → (cid:101) U k . If q ≥ , (cid:13)(cid:13)(cid:13)(cid:101) Π k w (cid:13)(cid:13)(cid:13) (cid:46) ||| w ||| , otherwise, (cid:13)(cid:13)(cid:13)(cid:101) Π k w (cid:13)(cid:13)(cid:13) (cid:46) (cid:113) dim( (cid:101) U ) h − kd ||| w ||| uniformly for all w ∈ H .Proof. Most of the conditions of Lemma B.1 are already satisﬁed. Denote C = dim( (cid:102) U ) and { e j s . t . j ∈ [ C ] } the standard orthonormal basis of (cid:101) U . The scaling properties of h -reﬁning ﬁnite element spacesguarantee that the value of C satisﬁes the conditions of Lemma B.1 and a basis of (cid:101) U k is given by (cid:8) x (cid:55)→ u j ( α i,k x + β i,k ) s . t . i = 1 , . . . , | M k | , j = 1 , . . . , C (cid:9) (B.12)33or some α i,k ∈ R d × d , and β i,k ∈ R d such that 0 < det( α i,k ) (cid:46) h − kd .We now compute the scaling constant in Lemma B.1: (cid:13)(cid:13) u j ( α i,k · + β i,k ) (cid:13)(cid:13) ∞ (cid:107) u j ( α i,k · + β i,k ) (cid:107) = (cid:107) u j (cid:107) ∞ (cid:113)(cid:82) ω ki e j ( α i,k x + β i,k ) d x = (cid:107) u j (cid:107) ∞ (cid:112) det( α i,k ) − (cid:46) h − kd . (B.13)This value is independent of j and gives the desired bound as a result of Lemma B.1. Lemma B.3.

Let ( (cid:101) U k ) k be a sequence of h -reﬁning ﬁnite element spaces of order p with (cid:101) w k = argmin u ∈ (cid:101) U k E( u ) .1. If E is |||·||| -Lipschitz at u ∗ , then E( (cid:101) w k ) − E( u ∗ ) (cid:46) h p ||| u ∗ ||| .

2. If E is |||·||| -smooth at u ∗ , then E( (cid:101) w k ) − E( u ∗ ) (cid:46) h p ||| u ∗ ||| .

3. If f is |||·||| -Lipschitz at u ∗ and min w ∈ (cid:101) U k {||| w − u ∗ ||| s . t . g( w ) ≤ g( u ∗ ) } (cid:46) min w ∈ (cid:101) U k ||| w − u ∗ ||| uniformly for k ∈ N , then E( (cid:101) w k ) − E( u ∗ ) (cid:46) h p ||| u ∗ ||| . Proof.

Each statement is by deﬁnition, observeE( w ) − E( u ∗ ) ≤ Lip(E) ||| w − u ∗ ||| , (B.14)E( w ) − E( u ∗ ) ≤ (cid:104) ∂ E( w ) , ( w − u ∗ ) (cid:105) = (cid:104)∇ E( w ) − ∇ E( u ∗ ) , w − u ∗ (cid:105) ≤ Lip( ∇ F ) ||| w − u ∗ ||| , (B.15)E( w ) − E( u ∗ ) ≤ f( w ) − f( u ∗ ) ≤ Lip(f) ||| w − u ∗ ||| . (B.16)Minimising over the right-hand side over w and substituting the deﬁnition of order gives the desiredresult. C Operator norms for numerical examples

Theorem C.1 (Theorem 6.3) . Suppose A : H → R m has kernels ψ j ∈ H = L ([0 , d ) for j ∈ [ m ] .Case 1: If ψ j ( x ) = (cid:40) x ∈ X j else for some collection X j ⊂ Ω such that X i ∩ X j = ∅ for all i (cid:54) = j , then (cid:107)A(cid:107) L → (cid:96) = max j (cid:113) | X j | . Case 2: If ψ j ( x ) = cos( a j · x ) for some frequencies a j ∈ R d with | a j | ≤ A , then (cid:107)A(cid:107) L → (cid:96) ≤ √ m, |A ∗ r | C k ≤ m − q A k (cid:107) r (cid:107) q , |A ∗ | (cid:96) → C k ≤ √ mA k for all r ∈ R m and q ∈ [1 , ∞ ] .Case 3: Suppose ψ j ( x ) = (2 πσ ) − d exp (cid:16) − | x − x j | σ (cid:17) for some regular mesh x j ∈ [0 , d and separation ∆ .i.e. { x j s . t . j ∈ [ m ] } = { x + ( j ∆ , . . . , j d ∆) s . t . j i ∈ [ (cid:98) m ] } or some x ∈ R d , (cid:98) m := d √ m . For all q + q ∗ = 1 , q ∈ (1 , ∞ ] , we get (cid:107)A(cid:107) L → (cid:96) ≤ (cid:18) (4 πσ ) − (cid:88) j = − (cid:98) m,..., (cid:98) m exp( − ∆ σ j ) (cid:19) d , (C.1) |A ∗ r | C ≤ (2 πσ ) − d (cid:18) (cid:88) j ∈ J exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (C.2) |A ∗ r | C ≤ (2 πσ ) − d σ ∆ σ (cid:18) (cid:88) j ∈ J ( | j | + δ ) q ∗ exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (C.3) |A ∗ r | C ≤ (2 πσ ) − d σ (cid:18) (cid:88) j ∈ J (cid:16) ∆ σ ( | j | + δ ) (cid:17) q ∗ exp (cid:16) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:17) (cid:19) q ∗ (cid:107) r (cid:107) q , (C.4) where δ = √ d and J = { j ∈ Z d s . t . (cid:107) j (cid:107) (cid:96) ∞ ≤ (cid:98) m } . The case for q = 1 can be inferred from thestandard limit of (cid:107)·(cid:107) q ∗ → (cid:107)·(cid:107) ∞ for q ∗ → ∞ .Proof case 1: From Lemma 6.1 we have( AA ∗ ) i,j = (cid:10) X i , X j (cid:11) = | X i ∩ X j | = (cid:40) | X i | i = j i (cid:54) = j . (C.5)Therefore, AA ∗ is a diagonal matrix and (cid:107)AA ∗ (cid:107) (cid:96) → (cid:96) = max j | X j | completes the result. Proof case 2: ψ j are not necessarily orthogonal however | (cid:104) ψ i , ψ j (cid:105) | ≤ (cid:107)AA ∗ (cid:107) (cid:96) → (cid:96) ≤ (cid:107)AA ∗ (cid:107) (cid:96) ∞ → (cid:96) ∞ ≤ m. (C.6)Now looking to apply Lemma 6.2, note (cid:13)(cid:13) ∇ k ψ j (cid:13)(cid:13) ∞ ≤ A k , therefore |A ∗ r | C k ≤ A k m q ∗ (cid:107) r (cid:107) q = A k m − q (cid:107) r (cid:107) q , (C.7) |A ∗ | (cid:96) → C k ≤ A k min q ∈ [1 , ∞ ] m − q √ m max(0 , − q ) = √ mA k . (C.8) Proof case 3:

In the Gaussian case, we build our approximations around the idea that sums of Gaus-sians should converge very quickly. The ﬁrst example can be used to approximate the operator norm.Computing the inner products gives (cid:104) ψ i , ψ j (cid:105) = (2 πσ ) − d (cid:90) [0 , d exp (cid:16) − | x − x i | σ − | x − x j | σ (cid:17) ≤ (2 πσ ) − d ( πσ ) d exp (cid:16) − | x i − x j | σ (cid:17) . (C.9)Estimating the operator norm, (cid:107)AA ∗ (cid:107) (cid:96) → (cid:96) ≤ (cid:107)AA ∗ (cid:107) (cid:96) ∞ → (cid:96) ∞ = max i ∈ [ m ] m (cid:88) j =1 | (cid:104) ψ i , ψ j (cid:105) | (C.10)= max i ∈ [ m ] (4 πσ ) − d (cid:88) j ,...,j d ∈ [ (cid:98) m ] exp (cid:18) − ( j ∆ − i ∆) + . . . + ( j d ∆ − i d ∆) σ (cid:19) (C.11) ≤ (4 πσ ) − d (cid:88) j ∈ Z d ∩ [ − (cid:98) m, (cid:98) m ] d exp (cid:18) − ( j ∆) + . . . + ( j d ∆) σ (cid:19) (C.12)=  (4 πσ ) − (cid:98) m (cid:88) j = − (cid:98) m exp (cid:18) − ∆ j σ (cid:19) d . (C.13)35his is a nice approximation because it factorises simply over dimensions. Applying the results fromLemma 6.2, note | ψ j ( x ) | = | ψ j ( x ) | = (2 πσ ) − d exp (cid:18) − | x − x j | σ (cid:19) , |∇ ψ j ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12) x − x j σ ψ j ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = (2 πσ ) − d σ | x − x j | σ exp (cid:18) − | x − x j | σ (cid:19) , |∇ ψ j ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12) σ + ( x − x j )( x − x j ) (cid:62) σ (cid:12)(cid:12)(cid:12)(cid:12) ψ j ( x ) = (2 πσ ) − d σ (cid:18) | x − x j | σ (cid:19) exp (cid:18) − | x − x j | σ (cid:19) . We now wish to sum over j = 1 , . . . , m and produce an upper bound on these, independent of t . To doso we will use the following lemma. Lemma C.2.

Suppose q > . If the polynomial p ( | x | ) = (cid:80) p k | x | k has non-negative coeﬃcients and x ∈ [ − m, m ] d , then (cid:88) (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j − x | ) exp (cid:16) − q | j − x | (cid:17) ≤ (cid:88) (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j | + δ ) exp (cid:18) − q max(0 , | j | − δ ) (cid:19) where δ := √ d and j ∈ Z d .Proof. There exists (cid:98) x ∈ [ − , ] d such that x + (cid:98) x ∈ Z d , therefore (cid:88) (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j − x | ) exp (cid:16) − q | j − x | (cid:17) = (cid:88) (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j − ( x + (cid:98) x ) + (cid:98) x | ) exp (cid:16) − q | j − ( x + (cid:98) x )+ (cid:98) x | (cid:17) ≤ (cid:88) (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j + (cid:98) x | ) exp (cid:16) − q | j + (cid:98) x | (cid:17) ≤ (cid:88) j ∈ Z d (cid:107) j (cid:107) (cid:96) ∞ ≤ m p ( | j | + δ ) exp (cid:16) − q max(0 , | j |− δ ) (cid:17) as | (cid:98) x | ≤ δ and p has non-negative coeﬃcients.Now, continuing the proof of Theorem C.1, for (cid:98) m = d √ m , δ = √ d and J = { j ∈ Z d s . t . (cid:107) j (cid:107) (cid:96) ∞ ≤ (cid:98) m } ,Lemma C.2 bounds m (cid:88) j =1 | ψ j ( x ) | q ∗ ≤ (2 πσ ) − dq ∗ (cid:88) j ∈ J exp (cid:18) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:19) m (cid:88) j =1 |∇ ψ j ( x ) | q ∗ ≤ (2 πσ ) − dq ∗ σ q ∗ ∆ q ∗ σ q ∗ (cid:88) j ∈ J ( | j | + δ ) q ∗ exp (cid:18) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:19) m (cid:88) j =1 |∇ ψ j ( x ) | q ∗ ≤ (2 πσ ) − dq ∗ σ q ∗ (cid:88) j ∈ J (cid:18) σ ( | j | + δ ) (cid:19) q ∗ exp (cid:18) − q ∗ ∆ σ max(0 , | j | − δ ) (cid:19) for all x ∈ Ω. In a worst case, this is O (2 d mm