[PDF] CPGD: Cadzow Plug-and-Play Gradient Descent for Generalised FRI

Abstract

Finite rate of innovation (FRI) is a powerful reconstruction framework enabling the recovery of sparse Dirac streams from uniform low-pass filtered samples. An extension of this framework, called generalised FRI (genFRI), has been recently proposed for handling cases with arbitrary linear measurement models. In this context, signal reconstruction amounts to solving a joint constrained optimisation problem, yielding estimates of both the Fourier series coefficients of the Dirac stream and its so-called annihilating filter, involved in the regularisation term. This optimisation problem is however highly non convex and non linear in the data. Moreover, the proposed numerical solver is computationally intensive and without convergence guarantee. In this work, we propose an implicit formulation of the genFRI problem. To this end, we leverage a novel regularisation term which does not depend explicitly on the unknown annihilating filter yet enforces sufficient structure in the solution for stable recovery. The resulting optimisation problem is still non convex, but simpler since linear in the data and with less unknowns. We solve it by means of a provably convergent proximal gradient descent (PGD) method. Since the proximal step does not admit a simple closed-form expression, we propose an inexact PGD method, coined as Cadzow plug-and-play gradient descent (CPGD). The latter approximates the proximal steps by means of Cadzow denoising, a well-known denoising algorithm in FRI. We provide local fixed-point convergence guarantees for CPGD. Through extensive numerical simulations, we demonstrate the superiority of CPGD against the state-of-the-art in the case of non uniform time samples.

Full PDF

PPREPRINT, UNDER REVIEW 1

CPGD: Cadzow Plug-and-Play Gradient Descentfor Generalised FRI

Matthieu Simeoni,

Member, IEEE,

Adrien Besson,

Associate Member, IEEE,

Paul Hurley,

Senior Member, IEEE, and Martin Vetterli,

Fellow, IEEE

Abstract —Finite rate of innovation (FRI) is a powerful re-construction framework enabling the recovery of sparse Diracstreams from uniform low-pass ﬁltered samples. An extensionof this framework, called generalised FRI (genFRI), has beenrecently proposed for handling cases with arbitrary linear mea-surement models. In this context, signal reconstruction amountsto solving a joint constrained optimisation problem, yielding esti-mates of both the Fourier series coefﬁcients of the Dirac streamand its so-called annihilating ﬁlter, involved in the regularisationterm. This optimisation problem is however highly non convexand non linear in the data. Moreover, the proposed numericalsolver is computationally intensive and without convergenceguarantee.In this work, we propose an implicit formulation of thegenFRI problem. To this end, we leverage a novel regularisa-tion term which does not depend explicitly on the unknownannihilating ﬁlter yet enforces sufﬁcient structure in the solutionfor stable recovery. The resulting optimisation problem is stillnon convex, but simpler since linear in the data and with lessunknowns. We solve it by means of a provably convergentproximal gradient descent (PGD) method. Since the proximal stepdoes not admit a simple closed-form expression, we propose aninexact PGD method, coined as Cadzow plug-and-play gradientdescent (CPGD). The latter approximates the proximal steps bymeans of Cadzow denoising, a well-known denoising algorithmin FRI. We provide local ﬁxed-point convergence guaranteesfor CPGD. Through extensive numerical simulations, we demon-strate the superiority of CPGD against the state-of-the-art in thecase of non uniform time samples.

Index Terms —ﬁnite rate of innovation, non bandlimited sam-pling, Dirac streams, non convex optimisation, Cadzow denoising,proximal gradient descent, alternating projections.

I. I

NTRODUCTION S AMPLING theorems lie at the foundation of modern digital signal processing as they permit the convenientnavigation between the analogue and digital worlds [1], [2].The most famous is undoubtedly the

Shannon sampling theorem [3], which states that bandlimited signals can be recoveredexactly from their discrete samples for a sufﬁcient samplingrate. This major result has had tremendous impact on the ﬁeldof signal processing and by extension on many ﬁelds of naturalsciences. But this unanimous celebration lead many scientists

M. Simeoni and A. Besson have contributed equally to this work.A. Besson is with E-Scopics, Saint-Cannat, France.M. Simeoni and M. Vetterli are with the Laboratoire de CommunicationsAudiovisuelles (LCAV), École Polytechnique Fédérale de Lausanne (EPFL),CH-1015, Lausanne, Switzerland.For part of this work, M. Simeoni was also with the Foundations of CognitiveSolutions group in the IBM Research Laboratory of Zurich.P. Hurley is with the Centre for Research in Data Science & Mathematicsat Western Sydney University (WSU), Australia. to start thinking about sampling theory exclusively in termsof bandlimitedness, which is only a sufﬁcient condition fora signal to admit a discrete representation. In fact, samplingtheorems can also be devised for non-bandlimited signals aslong as they possess ﬁnitely many degrees of freedom.This fact was brought to the attention of the signal processingcommunity in [4], where the authors introduced the ﬁniterate of innovation (FRI) framework. FRI is concerned withthe sampling of sparse non-bandlimited signals such as theprototypical sparse signal, namely the T -periodic stream ofDiracs : x ( t ) = (cid:213) k (cid:48) ∈ Z K (cid:213) k = x k δ ( t − t k − T k (cid:48) ) , ∀ t ∈ R , (1)with x k ∈ C and t k ∈ [ , T [ . In the FRI framework, the sparsityis measured in terms of its rate of innovation , deﬁned as thenumber of degrees of freedom per unit of time. For instance,the Dirac stream (1) has K degrees of freedom { x k , t k } k = ,..., K per period T , yielding a ﬁnite rate of innovation of ρ = K / T .Intuitively, any lossless sampling scheme for (1) must thereforehave a sampling rate at least as large as the rate of innovation ρ or it will be impossible to ﬁx all degrees of freedom.Blu et al. described a sampling scheme achieving thesecond best sampling rate after the critical innovation rate,permitting to perfectly recover the signal innovations fromthe knowledge of any K + consecutive Fourier coefﬁcientsof x [5]. Unfortunately, this scheme can be very sensitive tonoise perturbations in the collected samples. This is becausethe recovery of the innovations t k relies on the resolution ofa so-called annihilating equation which requires the Toeplitzmatrix built from the Fourier coefﬁcients to be rank deﬁcient.While this structural constraint is guaranteed to hold in thecase of noiseless recovery of Dirac streams, it can break downin the presence of noise, inevitable in practical applications.As a remedy, Blu et al. proposed to denoise the collectedsamples prior to solving the annihilating equation. To this end,they leveraged the well-known Cadzow algorithm [6], whichaims to retrieve the closest rank-deﬁcient Toeplitz matrix toa high-dimensional embedding of the data via an alternatingprojection method. When upgraded with this extra denoisingstep, simulations results from Blu et al. revealed that the overallaccuracy of the recovery procedure remains very good forsignal-to-noise ratios (SNR) as low as 5 dB [5]. While Cadzowalgorithm empirically provides accurate results after a few itera-tions, convergence in theory has however not been demonstratedto date, due to the non convex nature of the space of rank-deﬁcient matrices. Condat and Hirabayashi [7] revisited Cadzow a r X i v : . [ m a t h . O C ] J un PREPRINT, UNDER REVIEW denoising as a structured low-rank approximation (SLRA) problem and proposed a

Douglas-Rachford splitting algorithmto solve it [8], with higher accuracy than traditional Cadzowdenoising. Unfortunately, the gain in accuracy comes at theprice of signiﬁcantly higher computational cost, the Douglas-Rachford splitting method requiring many more iterations toconverge than Cadzow algorithm.In addition to their somewhat heuristic nature, neitherCadzow denoising nor its upgrade can handle more generaltypes of input measurements as considered in the generalisedFRI (genFRI) framework introduced by Pan et al. in [9].The latter extends FRI to very generic cases where themeasurements are related to the unknown Fourier coefﬁcientsof signals satisfying the annihilating property by a linear map.In such conﬁgurations, both the Fourier coefﬁcients and theircorresponding annihilating ﬁlter are unknown and must beestimated from the data. Pan et al. proposed to perform thisjoint estimation task by solving a constrained optimisationproblem which recovers the Fourier coefﬁcients, required tominimise a quadratic data-ﬁdelity term, and their correspondingannihilating ﬁlter coefﬁcients. The annihilating equation linkingthe two unknowns is explicitly enforced as a constraint. Thisoptimisation problem is highly non convex and non linear inthe data. They suggested to solve it via an iterative alternatingminimisation algorithm with multiple random initialisations [9].The proposed algorithm however comes without convergenceguarantees and is computationally intensive.In this paper, we propose an implicit formulation of thegenFRI problem in which only the Fourier coefﬁcients tobe annihilated are recovered. This formulation does notrely explicitly on the unknown annihilating ﬁlter but ratherleverages a structured low-rank regularisation constraint basedon a “Toeplitziﬁcation” linear operator, guaranteeing non-trivial solutions to the annihilating equation. The resultingoptimisation problem is still non convex, but simpler to analyseand solve since it is linear in the data and with less unknowns.We solve the implicit genFRI problem via proximal gradientdescent (PGD) [10], [11].We ﬁrst consider PGD with exact proximal steps which isshown to converge towards critical points of the implicit genFRIproblem. The latter is however impractical since the proximalstep involved at each iteration does not have a closed-formexpression. We therefore consider an inexact

PGD [12], withproximal steps approximated by means of alternating projec-tions. In the case of injective forward matrices, the approximateproximal step is shown to reduce to Cadzow denoising. Such anapproach is reminiscent of the plug-and-play (PnP) framework in which proximal operators involved in ﬁrst-order iterativemethods are replaced by generic denoisers [13], [14], [15]. Forthis reason, we name our reconstruction algorithm

CadzowPnP Gradient Descent (CPGD) . We demonstrate that CPGDconverges locally towards ﬁxed points of the update equationfor injective forward matrices. Through simulations of irregularand noisy time sampling of periodic stream of Diracs we showthat CPGD is almost always more accurate and more efﬁcient An efﬁcient Python implementation of CPGD is provided on our GitHubrepository [16]. than the procedure proposed by Pan et al. in [9], sometimesby several orders of magnitude.The remainder of the paper is organised as follows: • Preliminary concepts required for the understanding ofthe further sections are introduced in Section II. • Section III describes the genFRI problem and details theproposed implicit formulation. The CPGD algorithm isintroduced in Section IV. • Experiments and results are detailed in Section V andconcluding remarks are given in Section VI.Finally, note that all experiments and simulations are fullyreproducible using the benchmarking routines provided in ourGitHub repository [16].II. P

RELIMINARIES

In this section we introduce a linear operator, baptised

Toeplitziﬁcation operator , which transforms a vector into aToeplitz matrix. This operator will be used in the regularisationterm of our implicit genFRI optimisation problem. We thenbrieﬂy review the method of alternating projections [17] aswell as the FRI [4] framework and Cadzow denoising [7]. A. Toeplitziﬁcation Operator

Assume that we are given an arbitrary vector x ∈ C N , N = M + , with entries indexed as follows: x : = [ x − M , x − M + , . . . , x M − , x M ] T . Then, for any P ≤ M , we can embed x into the space T P ofToeplitz matrices of C ( N − P )×( P + ) by means of the following Toeplitziﬁcation operator : T P : (cid:40) C N → T P ⊂ C ( N − P )×( P + ) x (cid:55)→ [ T P ( x )] i , j : = x − M + P + i − j , (2)where i = , . . . , N − P , j = , . . . , P + . Note from (2) thatthe value of an entry [ T P ( x )] i , j of the matrix T P ( x ) dependsonly on the distance i − j between the row and column indexes: T P ( x ) is therefore a Toeplitz matrix and the vector x is calledits generator .The Toeplitziﬁcation operator (2) can be used to implementlinear convolutions. Indeed, it can be shown (see Appendix Aavailable as supplementary material) that the multiplication of T P ( x ) with a vector u = [ u , · · · , u P + ] T ∈ C P + returns thevalid part of the convolution between the two zero-paddedsequences ˜ x : = (cid:104) . . . , , x − M , . . . , x , . . . , x M , , . . . (cid:105) ∈ C Z and ˜ u : = (cid:104) . . . , , u , . . . , u P + , , . . . (cid:105) ∈ C Z . Similarly, the multipli-cation of T P ( x ) H with a vector v = [ v , · · · , v N − P ] T ∈ C N − P returns the valid part of the cross-correlation between ˜ x and thezero-padded sequence ˜ v : = (cid:104) . . . , , v , . . . , v N − P , , . . . (cid:105) ∈ C Z . The alternative appellation

Toeplitzication was used in [7]. See Appendix A for a formal deﬁnition of the valid part of a convolutionbetween zero-padded sequences.

IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 3

B. Inverse Toeplitziﬁcation Operator

The inverse Toeplitziﬁcation operator is the pseudoinverseof the Toeplitziﬁcation operator, mapping a Toeplitz matrix H ∈ C ( N − P )×( P + ) to its generator h ∈ C N . As we shallprove in Proposition 2, inverse Toeplitziﬁcation is achieved byaveraging across each diagonal of T P ( x ) . It is interesting tonote that this operation is also leveraged in Cadzow denoisingas described in [5], in order to map back the data from its highdimensional Toeplitz embedding. The formal interpretation ofthis inverse map as the pseudoinverse of the Toeplitziﬁcationoperator proposed hereafter is nevertheless not discussed in [5],nor anywhere else we may be aware of.To compute the pseudoinverse of T P , we ﬁrst need an expressionfor its adjoint map, detailed in the proposition hereafter. Proposition 1 (Adjoint operator of T P ) . The adjoint operatorT ∗ P of T P deﬁned in (2) is given byT ∗ P :  C ( N − P )×( P + ) → C N H (cid:55)→ h j = (cid:213) i = k + j − − P H ik , j = , . . . , N . (3) Proof.

The proof is described in Appendix B available assupplementary material of this manuscript. (cid:3)

Note that the adjoint map T ∗ P proceeds by summing acrosseach diagonal of the input matrix H . We are now ready to derivean expression for the (left) pseudoinverse of T P , described inthe proposition hereafter. Proposition 2 (Pseudoinverse of T P ) . The pseudoinverse T † P : C ( N − P )×( P + ) → C N of T P deﬁned in (2) is given byT † P = Γ − T ∗ P , (4) where Γ ∈ C N × N is a diagonal matrix with diagonal entriesgiven by: Γ i , i = min ( i , P + , N + − i ) , i = , . . . , N . (5) Proof.

The proof of this proposition is given in Appendix Cavailable as supplementary material of this manuscript. (cid:3)

Observe that the composition of T ∗ P and Γ − in the expressionof the pseudoinverse (4) indeed corresponds to a diagonalaveraging since T ∗ P ﬁrst sums across each diagonal of thematrix H ∈ C ( N − P )×( P + ) and Γ − then divides the sums bythe number of elements on each diagonal. C. The Method of Alternating Projections

In this section we brieﬂy discuss the method of alternatingprojections (MAP) [17], central to Cadzow denoising. It isused in computational mathematics to approximate projectionsonto intersecting sets. In its simplest form proposed by vonNeumann in 1933 [18], the MAP performs a cascade of n projection steps onto subsets {M , . . . , M K } of some Hilbertspace H , starting from a point z ∈ H : ˇ z = (cid:2) Π M K · · · Π M (cid:3) n ( z ) . (6) In (6), Π M k denotes the orthogonal projection map onto M k ,deﬁned for k = , . . . , K as Π M k : (cid:40) H → M k , z (cid:55)→ arg min x ∈M k (cid:107) z − x (cid:107) , for some norm (cid:107) · (cid:107) on H . In the case of closed linear subspaces {M , . . . , M K } , von Neumann and Halperin showed that [18],[19], [20] lim n →∞ (cid:13)(cid:13)(cid:13)(cid:2) Π M K · · · Π M (cid:3) n ( z ) − Π (cid:209) Kk = M k ( z ) (cid:13)(cid:13)(cid:13) = , ∀ z ∈ H . (7)The MAP equation (6) can therefore be used to approximatethe complex projection map Π (cid:209) Kk = M k . For closed convex sets {M , . . . , M K } , Bregman [17] showed moreover the weakconvergence of the MAP towards a point in the intersection (cid:209) Kk = M k . Strong convergence towards the actual projectionwas achieved by Dysktra’s MAP [21], one of the most popularvariant to von Neumann’s original algorithm. In the case of nonconvex intersecting sets, the convergence of the MAP has onlybeen established locally [22], [23], [24], [25]. For example,Andersson et al. considered in [24] the case of two (potentiallynon convex) ﬁnite-dimensional manifolds M , M ⊂ H andshowed the following local convergence result: Theorem 1. [24, Theorem 1.6] Let x ∈ M ∩ M be non-tangential , i.e. the angle between M and M at x is positive. Then, for z ∈ H and (cid:15) > , there exists δ ≥ such that, if (cid:107) x − z (cid:107) ≤ δ ,1) (cid:2) Π M Π M (cid:3) n ( z ) n →∞ → z ∞ ∈ M ∩ M ,2) (cid:13)(cid:13) z ∞ − Π M ∩M ( z ) (cid:13)(cid:13) < (cid:15) (cid:13)(cid:13) x − Π M ∩M ( z ) (cid:13)(cid:13) . Roughly speaking, Theorem 1 states that if the startingpoint z is close enough to a non-tangential point of M ∩ M (which as explained in [24] are all but very exceptional pointsof M ∩ M ), then the MAP converges to a point in M ∩ M .Moreover, the error (cid:13)(cid:13) z ∞ − Π M ∩M ( z ) (cid:13)(cid:13) can be made arbitrarilysmall with respect to (cid:13)(cid:13) x − Π M ∩M ( z ) (cid:13)(cid:13) . However, Theorem 1is difﬁcult to apply in practice since the value of δ guaranteeinga relative error below a given threshold (cid:15) is unknown. TheMAP is hence often used as a heuristic in non convex settingswith no convergence guarantees. This is notably the case ofCadzow denoising, discussed further in Section II-E. D. FRI in a Nutshell

The classical FRI framework, introduced in [4], aims atestimating the innovations {( x k , t k ) , k = , . . . , K } ⊂ C × [ , T [ ,of a T -periodic stream of Diracs: x ( t ) = (cid:213) k (cid:48) ∈ Z K (cid:213) k = x k δ ( t − t k − T k (cid:48) ) , ∀ t ∈ R . In standard FRI, the estimation procedure is divided into twostages. The locations t k are ﬁrst estimated by a nonlinearmethod, and then arranged into a Vandermonde system whosesolution yields the Dirac amplitudes [5]. The recovery of thelocations t k relies on the so-called annihilating equation , dating See [24, Deﬁnition 4.2] and [24, Deﬁnition 4.3] for a precise deﬁnition ofthe angle between two manifolds and the concept of non-tangentiality.

PREPRINT, UNDER REVIEW from Prony’s work [26], which cancels out the Fourier seriescoefﬁcients of x by convolving them with a particular ﬁlter,called the annihilating ﬁlter . The latter is deﬁned as the ﬁnite-tap sequence h = [· · · , , h , h , . . . , h K , , · · · ] ∈ C Z , with z -transform vanishing at roots { u k : = e − j π t k / T , k = , . . . , K } : H ( z ) = K (cid:213) k = h k z − k = K (cid:214) k = ( − u k z − ) . (8)For such a ﬁlter, we have indeed ( ˆ x ∗ h ) m = K (cid:213) k = h k ˆ x m − k = K (cid:213) k (cid:48) = x k (cid:48) (cid:32) K (cid:213) k = h k u − kk (cid:48) (cid:33) u mk (cid:48) = , m ∈ Z , (9)where ˆ x m = (cid:205) Kk = x k u mk , m ∈ Z , are the Fourier coefﬁcientsof x in (1). Notice that the roots u k of the z-transform H ( z ) in (8) of h are in one-to-one correspondence withthe locations t k . Recovering them amounts to estimating thecoefﬁcients h = [ h , . . . , h K ] ∈ C K + of h from the annihilatingequation (9). If for instance we have N = M + consecutiveFourier coefﬁcients of x , e.g. x = [ ˆ x − M , . . . , ˆ x M ] ∈ C M + ,we can extract the N − K equations from (9) correspondingto the convolution indices m = − M + K , . . . , M , and use theToeplitziﬁcation operator deﬁned in (2) to form the followingmatrix equation: T K ( x ) h = N − K , (cid:107) h (cid:107) (cid:44) . (10)Observe that any nontrivial element of the nullspace of T K ( x ) is a solution to (10). For M ≥ K , it can be shown [5] that T K ( x ) ∈ C ( N − K )×( K + ) has rank K and therefore has a nontrivialnullspace with dimension 1. Up to a multiplicative constant,the annihilating equation (10) admits hence a unique solution.The latter can be obtained numerically by means of total least-squares [5], which computes the eigenvector associated to thesmallest eigenvalue of T K ( x ) . In the critical case M = K , thematrix T K ( x ) is square , while in the oversampling case M > K it is rectangular and tall . As explained in [5], oversamplingmakes the estimation procedure more resilient to potential noiseperturbations in the Fourier coefﬁcients. In such cases, Blu etal. recommend moreover to perform Cadzow denoising on theFourier coefﬁcients x (see Section II-E) as well as replace (10)by a more general annihilating equation: T P ( x ) ˜ h = N − P , (cid:107) ˜ h (cid:107) (cid:44) . (11)with K ≤ P ≤ M , and ˜ h ∈ C P + . Again, it is possible to showthat T P ( x ) has rank K , and hence a nontrivial nullspace withdimension P + − K . Solutions to (11) are hence not uniquein this case, but all are equally valid for practical purposes.Moreover, the increased nullspace dimension makes Cadzowdenoising more efﬁcient at ﬁltering the noise component (seeSection II-E hereafter). In practice, the case P = M has beenreported to yield the best empirical performance [5]. Remember the link between the Toeplitziﬁcation operator and convolutiondiscussed in Section II-A. An eigenvalue exactly equal to zero may in practice be impossible toobtain due to numerical inaccuracies.

E. Cadzow Denoising

For strong noise perturbations, the generalised annihilatingequation (11) may fail to admit a nontrivial solution. Indeed,noisy generators x can yield full column rank matrices T P ( x ) with trivial nullspace. As a potential cure, Blu et al. proposeto denoise the Fourier coefﬁcients x prior to solving theannihilating equation. This denoising step attempts to transform T P ( x ) into a Toeplitz matrix with rank at most K , thusguaranteeing the existence of nontrivial solutions to (11). Thisoperation is carried out by means of Cadzow denoising [7],an alternating projection method (see Section II-C) appliedheuristically to the subspace T P of Toeplitz matrices and thesubset H K of matrices with rank at most K : H K : = (cid:110) M ∈ C ( N − P )×( P + ) | rank M ≤ K (cid:111) . (12)Using the notation introduced in Sections II-A, II-B andII-C, Cadzow denoising can be seen as processing the noisycoefﬁcients x as follows: ˇ x = T † P (cid:2) Π T P Π H K (cid:3) n T P ( x ) , (13)for some suitable n ∈ N . The inverse Toeplitziﬁcation operator T † P applied after the n iterations of the alternating projectionmethod is used to recover the denoised Fourier coefﬁcients ˇ x ∈ C N .Equation (13) allows us to develop an intuitive understandingas to why Cadzow denoising tends to perform better in practicewith values of P close to M in the generalised annihilatingequation (11). Indeed, it is easy to see that the maximal rank ofrectangular matrices in C ( N − P )×( P + ) ranges in (cid:110) K + , M + (cid:111) when P ranges in (cid:110) K , M (cid:111) . Consequently, the subset H K ofmatrices of rank at most K becomes “smaller and smaller”relatively to the ambient space as P increases towards M .This implies that the associated projection map Π H K is moreselective for large values of M , hence making Cadzow’salgorithm better at ﬁltering the noise component.Note that since H K is a non convex set the convergence ofthe MAP in (13) is not guaranteed. Nevertheless, experimentalresults [5], [7] suggest that Cadzow denoising almost alwaysconverges after a few iterations (typically n ≤ ), which couldtheoretically be explained by the local convergence result fromTheorem 1. We conclude this section by providing closed-formexpressions for the projection operators Π T P and Π H K , neededin (13).

1) Projection onto T P : As shown in Appendix D of thesupplementary material, the orthogonal projection operatoronto the subspace T P ⊂ C ( N − P )×( P + ) of rectangular Toeplitzmatrices can be written in terms of the Toeplitziﬁcation operatorand its pseudoinverse (see Section 2) as: Π T P = T P T † P = T P Γ − T ∗ P . For n , m ∈ Z , n < m , we denote by (cid:110) n , m (cid:111) the integer interval [ n , m ] ∩ Z . This can be seen from the closed-form expression of Π H K provided in (14). As explained in Section II-C, the assumptions of Theorem 1 are unfortu-nately very difﬁcult to verify in practice.

IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 5

2) Projection onto H K : The orthogonal projection operatoronto the space H K of matrices with rank at most K is givenby the Eckart-Young-Minsky theorem [27]. The latter statesthat the projection map Π H K ( X ) = arg min H ∈H K (cid:107) X − H (cid:107) F , X ∈ C ( N − P )×( P + ) , can be computed in closed-form as: Π H K ( X ) = U Λ K V H , X ∈ C ( N − P )×( P + ) , (14)where X = U Λ V H is the singular value decomposition of X , and Λ K is the diagonal matrix of sorted singular valuestruncated to the K strongest ones. Note that the output of theprojection map is unique as long as the K − th and ( K + )− thlargest singular values are different. Fortunately, the space ofmatrices failing to verify this condition is very small –moreprecisely it is thin , as discussed extensively in [24, Section 2].In practice moreover, ﬂoating-point arithmetic makes it veryunlikely that the K − th and ( K + )− th largest singular valuesbe exactly identical. Thus, the projection map Π H K can beconsidered single-valued for practical purposes.III. G ENERALISED

FRI

AS AN I NVERSE P ROBLEM

A. Generalised FRI

In Section II-D, we have described a procedure for recoveringthe locations t k from consecutive Fourier coefﬁcients of x .The issue of computing these Fourier coefﬁcients from acollection of arbitrary linear measurements y ∈ C L of x , L ≥ N now remains. Blu et al. [5] treated the simple scenario ofmeasurements resulting from regular time sampling with ideallow-pass preﬁltering. In such a case, they showed that, for awell chosen preﬁlter bandwidth, the Fourier coefﬁcients couldsimply be obtained by applying a discrete Fourier transformto the measurements y . For more general measurement types,the situation is more complex, and the Fourier coefﬁcients x ∈ C N must in general be estimated by solving a linearinverse problem: y = Gx + n , (15)where the forward matrix G ∈ C L × N , L ≥ N , is applicationdependent, and n is additive noise, usually assumed to be awhite Gaussian random vector. In [9], Pan et al. proposedthe generalised FRI (genFRI) optimisation problem to dealwith (15). The latter is a non convex constrained optimisationproblem whose objective is to jointly recover the Fouriercoefﬁcients x ∈ C N –required to minimise a quadraticdata-ﬁdelity term– and their corresponding annihilating ﬁltercoefﬁcients h ∈ C P + . The annihilating equation linking thetwo unknowns is explicitly enforced as a constraint, yieldingan optimisation problem of the form: min x ∈ C N h ∈ C P + (cid:107) Gx − y (cid:107) subject to (cid:40) T P ( x ) h = N − P , (cid:104) h , h (cid:105) = , (16)where h ∈ C P + is generated randomly accordedto the circularly-symmetric complex Gaussian distribu-tion C N ( , I P + ) . The normalisation constraint (cid:104) h , h (cid:105) = In [9], the authors have also considered the more natural normalisationconstraint (cid:107) h (cid:107) = . They claim however that this normalisation strategy isless successful experimentally. is used to exclude trivial solutions to the annihilating equationin (16) [9], [28].To solve (16), Pan et al. consider the corresponding La-grangian for a ﬁxed vector h ∈ C P + and show that, when G is full column rank, critical solutions x ∈ C N can be expressedin terms of the annihilating ﬁlter h as [9, Appendix A]: x = ξ − ( G H G ) − R P ( h ) H Ω − h R P ( h ) ξ , (17)where ξ : = ( G H G ) − G H y , Ω h : = R P ( h )( G H G ) − R P ( h ) H and R P is the right-dual of the Toeplitziﬁcation operator T P ,deﬁned as [9, Deﬁnition 1]: R P : (cid:40) C P + → C ( N − P )× N h (cid:55)→ R P ( h ) x : = T P ( x ) h , ∀ x ∈ C N . (18)By applying (18) to the canonical basis of C N , it is easy toshow that: R P ( h ) =  h P h P − · · · h · · · . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . . . · · · h P h P − · · · h  , for h = [ h , · · · , h P ] ∈ C P + .Pan et al. then use (17) to saturate (16) with respect to x ,obtaining an optimisation problem in terms of h only: min h ∈ C P + (cid:10) Ω − h T P ( ξ ) h , T P ( ξ ) h (cid:11) subject to (cid:104) h , h (cid:105) = . (19)Finally, they propose to solve (19) via an heuristic alternatingminimisation algorithm. At each iteration, the annihilatingﬁlter is updated as the solution to the constrained quadraticminimisation problem: arg min h ∈ C P + (cid:68) Ω − h n T P ( ξ ) h , T P ( ξ ) h (cid:69) (20)subject to (cid:104) h , h (cid:105) = , n ≥ . Observe that (20) corresponds to a relaxation of (19) where theunknown matrix Ω h has been replaced by the ﬁxed matrix Ω h n , constructed from the previous estimate h n . Pan et al. showmoreover in [9, Appendix B] that the iterates can be computedefﬁciently by solving, at each iteration, a linear system of size ( N + ) × ( N + ) :  T P ( ξ ) H h T P ( ξ ) − R P ( h n ) − R P ( h n ) H G H G h H   h n + (cid:96) v λ  =   , (21)where (cid:96) ∈ C N − P , v ∈ C N and λ ∈ C are auxiliary variables.In practice, the algorithm is stopped when the data mismatch (cid:107) Gx n − y (cid:107) falls below a certain threshold (cid:15) (typically the noiselevel), where x n ∈ C N is the estimate of the Fourier seriescoefﬁcients at iteration n , obtained by plugging h n in (17). Forgreater numerical stability, Pan et al. recommend to compute x n by solving a linear system with size ( N − P ) × ( N − P ) ,equivalent to (17) [9, Appendix B]: (cid:20) G H G R P ( h n ) H R P ( h n ) (cid:21) (cid:20) x n (cid:96) (cid:21) = (cid:20) G H y (cid:21) . (22) PREPRINT, UNDER REVIEW

Note that this heuristic iterative procedure comes withoutany strong or weak convergence guarantee: it is neither knownif the sequence {( h n , x n ) , n ∈ N } ⊂ C P + × C N converges to acritical point of (16), nor if { h n , n ∈ N } ⊂ C P + converges toa ﬁxed-point of (20). Moreover, the proposed stopping criterionrequires knowing the noise level, unknown in practice. When itis unknown, Pan et al. recommend performing the reconstruc-tion for a ﬁxed and arbitrary number of iterations (typicallyﬁfty). For optimal performance, they moreover suggest to runthe algorithm for multiple random initialisations of h ∈ C P + (typically ﬁfteen). The overall reconstruction procedure cantherefore be computationally intensive, since each iteration hascomplexity O (cid:0) ( N + ) + ( N − P ) (cid:1) = O (cid:0) N (cid:1) –the cost ofsolving the linear systems (21) and (22). Remark (Choice of P and numerical stability) . While theo-retically well-deﬁned for K ≤ P ≤ M, the iterative proceduredescribed above was only tested for P = K in [9]. This conser-vative choice was most likely motivated by numerical stabilityconsiderations: the inversion step in ξ = ( G H G ) − G H y canindeed be numerically unstable when G H G is ill-conditioned,which is less likely to be the case for tall matrices G obtainedwith P = K. We will touch on this again later.B. Implicit Generalised FRI

The annihilating equation constraint in (16) can be thoughtof as regularising the genFRI problem. Indeed, minimising thequadratic term (cid:107) Gx − y (cid:107) alone in the presence of noise wouldnot necessarily yield Fourier coefﬁcients x with nontrivialannihilating ﬁlter, which the annihilating constraint enforcesexplicitly. Unfortunately, this regularisation also complicatessigniﬁcantly the optimisation procedure. Indeed, it requires theintroduction of an extra unknown variable with non lineardependency on the data, namely the annihilating ﬁlter h .Moreover, the non linear constraint T P ( x ) h = N − P is highlynon convex, and state-of-the-art algorithms, such as alternatingminimisation or gradient descent [10], may suffer from gettingtrapped in local minima [29]. To circumvent these issues,we propose the following implicit formulation of the genFRIproblem, in which only the Fourier coefﬁcients are recovered: min x ∈ C N (cid:107) Gx − y (cid:107) subject to (cid:40) rank T P ( x ) ≤ K , (cid:107) x (cid:107) Γ ≤ ρ, (23)where K ≤ P ≤ M , ρ ∈] , + ∞] , and (cid:107) x (cid:107) Γ is the norm inducedby the diagonal and positive deﬁnite matrix Γ ∈ C N × N in (44): (cid:107) x (cid:107) Γ : = √ x H Γ x , ∀ x ∈ C N . (24)Similarly to (16), the quadratic term (cid:107) Gx − y (cid:107) in (23) isused to guarantee high ﬁdelity of the recovered coefﬁcients tothe observed data. Unlike (16), (23) leverages a regularisingrank constraint on T P ( x ) which does not explicitly involvethe unknown annihilating ﬁlter. As already discussed inSection II-E in the context of Cadzow denoising, requiring T P ( x ) to be of rank at most K is indeed a sufﬁcient condition This is notably the reason why Pan et al. recommend multiple randominitialisations of their algorithm in [9]. for the generalised annihilating equation (11) to admit nontrivialsolutions. This implicit regularisation greatly simpliﬁes thegenFRI problem, since it decouples the problem of estimatingthe Fourier coefﬁcients from the problem of estimating theannihilating ﬁlter. The normalisation constraint (cid:107) x (cid:107) Γ ≤ ρ enforces ﬁnite weighted energy (24) to the recovered Fouriercoefﬁcients. As shall be seen in Section IV, it can be relaxedwhen the forward matrix G is injective by setting ρ = + ∞ .Indeed, it is only used to ensure coercivity in underdeterminedcases where the forward matrix G has a nontrivial null space.Coercivity is indeed a key assumption [30] for the convergenceof the proximal gradient descent method envisioned in SectionIV-A. We conclude this section by noting that the choice of Γ as weighting matrix in the energy normalisation constraint isarbitrary and purely motivated by computational considerations.Indeed, any choice of positive deﬁnite weighting matrix in (24)would have been suitable for the sole purpose of making theobjective functional coercive. As explained in Section IV-Bhowever, deﬁning the weighting matrix as Γ greatly simpliﬁesthe computations involved at each iteration of the numericalsolver proposed in Section IV-A. Remark (On the choice of P ) . With similar developmentsto Section II-E for Cadzow denoising, it is possible to showthat the the set of matrices in C ( N − P )×( P + ) of rank at most Kbecomes “smaller and smaller” with respect to the ambientspace as P grows towards M. As a result, the rank constraintin (23) is more selective for values of P close to M, henceenforcing a stronger regularisation. We can hence expect (23) to perform better in practice for P = M. This is in contrastwith the explicit generalised FRI problem (16) , whose equalityconstraint is equally stringent for different values of P.

Remark (Case G = I ) . When G = I , the optimisationproblem (23) becomes a simple denoising problem, whichcould therefore be used as an alternative to Cadzow denoisingor its upgrade [7]. IV. O

PTIMISATION A LGORITHM

A. Non Convex Proximal Gradient Descent

The optimisation problem (23) can be rewritten in uncon-strained form as: min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) + ι B Γ ρ ( x ) , (25)where H K is the non convex set of matrices with rank lowerthan K deﬁned in (12), B Γ ρ : = { x ∈ C N : (cid:107) x (cid:107) Γ ≤ ρ } is the Γ -ball with radius ρ > , and ι H K : C ( N − P )×( P + ) → { , + ∞} , ι B Γ ρ : C N → { , + ∞} are indicator functions with domains H K and B Γ ρ , respectively. Observe that the unconstrainedoptimisation problem (54) can be written as a sum between a convex and differentiable quadratic term F ( x ) : = (cid:107) Gx − y (cid:107) , x ∈ C N , and a non convex and non differentiable term H ( x ) : = ι H K ( T P ( x )) + ι B Γ ρ ( x ) , x ∈ C N . IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 7

It is moreover easy to see that the gradient of F ∇ F ( x ) = G H ( Gx − y ) , x ∈ C N , (26)is β -Lipschitz continuous with respect to the Γ -norm (24), withLipschitz constant given by β = (cid:107) G H G (cid:107) Γ = sup (cid:8) (cid:107) G H Gx (cid:107) Γ : x ∈ C N , (cid:107) x (cid:107) Γ = (cid:9) = sup (cid:110) (cid:13)(cid:13)(cid:13) Γ / G H G Γ − / ˜ x (cid:13)(cid:13)(cid:13) : ˜ x ∈ C N , (cid:107) ˜ x (cid:107) = (cid:111) = (cid:13)(cid:13)(cid:13) Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) . (27)It is hence possible to optimise (54) by means of proximalgradient descent (PGD) [10], an iterative method alternatingbetween gradient and proximal steps according to the followingupdate equation: x k + ∈ prox Γ τ H ( x k − τ ∇ F ( x k )) , (28)for k ≥ , x ∈ C N , τ > and prox Γ τ H deﬁned in (29). Givena current estimate x k ∈ C N , the update equation (48) decreasesthe value of the objective function (54) by selecting a proximalpoint [10] –with respect to H – of a target located at a distance τ from x k along the direction of steepest descent −∇ F ( x k ) .The operator mapping a point x ∈ C N to its proximal pointswith respect to H is called proximal operator , and is deﬁnedas [10]prox Γ τ H ( x ) :  C N → P (cid:16) C N (cid:17) , x (cid:55)→ arg min z ∈ C N τ (cid:107) x − z (cid:107) Γ + H ( z ) , (29)where P( C N ) is the power set of C N , and τ > controls therelative importance of H with respect to the squared distance to x measured in terms of the Γ -norm (24). The function H beingnon convex, the proximal operator (29) will in general return multiple proximal points, which can all be used interchangeablyin (48). The convergence of the sequence { x k } k ∈ N of PGDiterates (48) towards critical points of (54) is established bythe following theorem. Theorem 2 (Convergence of PGD for Arbitrary G ) . Assumethat ρ ∈] , + ∞[ in (54) , and τ < / β with β deﬁned in (48) .Then, any limit point x (cid:63) of the sequence { x k } k ∈ N generatedby (48) is a local minimum of (54) .Proof. The proof of this theorem is adapted from [30, Theorem1] and given in Appendix E available as supplementary materialof this manuscript. (cid:3)

As stated by Theorem 3 hereafter, the convergence of PGDfurthermore extends to the case ρ = + ∞ , at least for injectiveforward matrices G . Setting ρ = + ∞ in (23) is equivalent todropping the energy normalisation constraint, since (cid:107) x (cid:107) Γ ≤ + ∞ is trivially veriﬁed and hence the associated indicator function ι B Γ ρ in (54) is always null. Theorem 3 (Convergence of PGD for Injective G ) . Assumethat ρ = + ∞ in (54) , τ < / β with β deﬁned in (48) , and G ∈ C L × N in (54) is injective , i.e. ker ( G ) = { N } . Then, any limit point x (cid:63) of the sequence { x k } k ∈ N generated by (48) is a local minimum of (54) .Proof. The proof of this theorem is given in Appendix Eavailable as supplementary material of this manuscript. (cid:3)

A practical implication of Theorem 3 is that, for injectiveforward matrices G , PGD applied to the following relaxedimplicit genFRI problem is convergent: min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) , (30)where F ( x ) : = (cid:107) Gx − y (cid:107) , and H ( x ) : = ι H K ( T P ( x )) . Asdiscussed in Section IV-B, (30) should always be favouredover (54) for injective forward matrices G , since solving it viaPGD requires less computations at each proximal step. B. Cadzow PnP Gradient Descent

As seen in the previous section, PGD requires the compu-tation of the proximal operator (29) at each iteration, whichamounts to ﬁnding a minimiser to the following non convexoptimisation problem: ˇ x ∈ arg min z ∈ C N (cid:26) τ (cid:107) x − z (cid:107) Γ + ι H K ( T P ( z )) + ι B Γ ρ ( z ) (cid:27) , (31)for some input x ∈ C N . Observe that the proximal step (31) canbe seen as a generalised projection step, aiming to ﬁnd a point ˇ x as close as possible –in terms of the Γ -norm – from x whileverifying some convex and non convex constraints speciﬁedby the indicator functions. This is formalised by Proposition3, which shows that solutions to (31) can be identiﬁed withthose of an orthogonal projection problem: Proposition 3.

The proximal operator (29) of H ( x ) : = ι H K ( T P ( x )) + ι B Γ ρ ( x ) , for ρ ∈] , + ∞] and K ≤ P ≤ M isgiven byprox Γ τ H ( x ) = T † P Π T P ∩H K ∩ B ρ T P ( x ) , ∀ x ∈ C N , (32) where B ρ : = { X ∈ C ( N − P )×( P + ) : (cid:107) X (cid:107) F ≤ ρ } and Π T P ∩H K ∩ B ρ is the orthogonal projection operator onto T P ∩H K ∩ B ρ with respect to the Frobenius norm: Π T P ∩H K ∩ B ρ ( X ) :  C ( N − P )×( P + ) → P (cid:16) C ( N − P )×( P + ) (cid:17) , X (cid:55)→ arg min H ∈ T P ∩H K ∩ B ρ (cid:107) X − H (cid:107) F . Proof.

The proof of this proposition is given in Appendix Favailable as supplementary material of this manuscript. (cid:3)

Equation (65) provides us with a practical way of computingthe proximal set (29) associated to a point x ∈ C N . Unfortu-nately, the orthogonal projection operator Π T P ∩H K ∩ B ρ admitsno simple closed-form expression. We therefore propose toapproximate it by the method of alternating projections (MAP)(see Section II-C): Π T P ∩H K ∩ B ρ (cid:39) (cid:2) Π T P Π H K Π B ρ (cid:3) n , (33) Observe that the weighting matrix Γ puts more emphasis on the coefﬁcientsthat appear more often in the Toeplitz matrix T P ( x ) . PREPRINT, UNDER REVIEW

Algorithm 1

Cadzow PnP Gradient Descent (CPGD)

Require: y , G , T P , x , K ≤ P , τ < / β, n ∈ N , ρ > k:=0 repeat z k + : = x k − τ G H ( Gx k − y ) if ρ = + ∞ then x k + : = T † P (cid:2) Π T P Π H K (cid:3) n T P ( z k + ) else x k + : = T † P (cid:2) Π T P Π H K Π B ρ (cid:3) n T P ( z k + ) end if k ← k + until a stopping criterion is satisﬁed return x ( k ) for some n ∈ N . Observe that when ρ = + ∞ (which is possiblefor injective matrices G , see Theorem 3) we have Π B ρ = Id andthe right-hand side of (33) simpliﬁes to (cid:2) Π T P Π H K (cid:3) n . Note thatsince H K is non convex, the convergence as n grows to inﬁnityof the product (cid:2) Π T P Π H K Π B ρ (cid:3) n towards the actual projectionmap Π T P ∩H K ∩ B ρ is not guaranteed in general (see discussionin Section II-C). For the speciﬁc case ρ = + ∞ however, it ispossible to apply Theorem 1 to show the local convergence ofthe MAP (33): Corollary 1.

Let Z ∈ H K ∩ T P be a non tangential point [24,Deﬁnition 4.3]. Then, for X ∈ C ( N − P )×( P + ) and (cid:15) > , thereexists δ ≥ such that, if (cid:107) X − Z (cid:107) F ≤ δ ,1) (cid:2) Π T P Π H K (cid:3) n ( X ) n →∞ → X ∞ ∈ H K ∩ T P ,2) (cid:13)(cid:13) X ∞ − Π H K ∩ T P ( X ) (cid:13)(cid:13) F < (cid:15) (cid:13)(cid:13) X − Π H K ∩ T P ( X ) (cid:13)(cid:13) F . Proof.

Similarly to the proof of [31, Theorem 7], Corollary 1is obtained by applying Theorem 1 to the manifolds M = R K of matrices with rank exactly K –which is dense in H K [24,Proposition 2.1]– and M = T P . For more details, see the proofof [31, Theorem 7], which discusses the local convergenceof the MAP for H K ∩ H P where H P denotes the space ofrectangular Hankel matrices. Since Hankel matrices are justreﬂected Toeplitz matrices, the analysis extends to the case ofToeplitz matrices. (cid:3) Roughly speaking, Corollary 1 states that, if applied to amatrix X close enough to a non tangential point of T P ∩ H K (which as discussed in [24] for the case of Hankel matricesare all but very exceptional matrices of T P ∩ H K ), the MAP(33) converges to a point in T P ∩ H K . Moreover, the error (cid:13)(cid:13) X ∞ − Π H K ∩ T P ( X ) (cid:13)(cid:13) F can be made arbitrarily small withrespect to (cid:13)(cid:13) X − Π H K ∩ T P ( X ) (cid:13)(cid:13) F . While difﬁcult to verify inpractice, the local convergence result Corollary 1 reassures uson the well-foundedness of approximation (33).Plugging (33) into (65) ﬁnally yields the following approxi-mate proximal step:prox Γ τ H ( x ) (cid:39) T † P (cid:2) Π T P Π H K Π B ρ (cid:3) n T P ( x ) , ∀ x ∈ C N , (34)for some n ≥ . The PGD algorithm with approximate proximalstep (34) is provided in Algorithm 1. Observe that when ρ =+ ∞ , (34) reduces to Cadzow denoising (13). The effect ofheuristic (33) is hence to replace the proximal step in the PGDiterations by a generic denoising step. Such an approach is reminiscent of the plug-and-play (PnP) framework [15], [14]from image processing, which leverages generic denoisers toapproximate complex projection or proximal operators [13]. Forthis reason, we baptise our algorithm Cadzow PnP GradientDescent (CPGD) . In the next section, we study the convergenceof Algorithm 1.

C. Local Fixed-Point Convergence of CPGD

In Section IV-A, we established Theorems 2 and 3 whichshow the convergence of PGD towards critical points of (54).However, such results required the computation of exact proxi-mal steps (29) in the PGD iterations, and do not apply to CPGDwhich leverages the inexact proximal step (34). Convergenceof PGD in non convex setups with inexact proximal steps wasstudied in [12], [32]. The results established in both papersrequire the proximal step approximation errors incurred ateach iteration to be decreasing and summable, which may notnecessarily be the case for the MAP approximation (33). Itis nevertheless possible to demonstrate that the iterations ofCPGD are locally contractive , and therefore locally convergenttowards a ﬁxed point using the Banach contraction principle.Such a result is stated in Theorem 4 hereafter.

Theorem 4 (CPGD is a Local Contraction) . Let R K ⊂ C ( N − P )×( P + ) be the set of matrices of rank exactly K ≤ P ≤(cid:98) N / (cid:99) , and U τ, n : C N → C N the update CPGD mapU τ, n ( x ) : = H n ( x − τ ∇ F ( x )) , x ∈ C N , (35) with H n ( x ) : = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) . Let G ∈ C L × N beinjective, and Γ be the diagonal and positive deﬁnite matrixdeﬁned in (44) . Deﬁne α : = λ min (cid:16) Γ / G H G Γ − / (cid:17) ,β : = λ max (cid:16) Γ / G H G Γ − / (cid:17) , where λ min ( M ) and λ max ( M ) denote the minimum and maxi-mum eigenvalues of a matrix M respectively.Then, U τ, n is locally well-deﬁned (single-valued) and Lips-chitz continuous with respect to the Γ -norm (cid:13)(cid:13) U τ, n ( x ) − U τ, n ( z ) (cid:13)(cid:13) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) and T P ( z ) are in someneighbourhood of some matrix R ∈ R K . The Lipschitz constant L τ is given by L τ = max {| − τα | , | − τ β |} . Moreover, U τ, n is contractive , i.e. ≤ L τ < , for < τ < / β ,and L τ is minimised for τ = /( α + β ) .Proof. The proof of this theorem is given in Appendix Gavailable as supplementary material of this manuscript. (cid:3)

The following corollary shows the local convergence ofCPGD towards a ﬁxed-point of the update map (35):

Corollary 2 (CPGD Converges Locally) . With the samenotations as in Theorem 4, assume that all CPGD iterates { x k } k ∈ N are such that { T P ( x k + ) , T P ( x k )} ⊂ U k , ∀ k ∈ N , (36) IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 9

Figure 1: Illustration of condition (36) in Corollary 2. for some neighbourhood U k of some matrix R k ∈ R K . Assumefurther that < τ < / β . Then, x k k →∞ → x (cid:63) where x (cid:63) ∈ C N isa ﬁxed-point of U τ, n , i.e. U τ, n ( x (cid:63) ) = x (cid:63) . Moreover, we have (cid:107) x (cid:63) − x k (cid:107) Γ ≤ L k τ − L τ (cid:107) x − x (cid:107) Γ , ∀ k ≥ . (37) Proof.

The proof of this theorem is given in Appendix Havailable as supplementary material of this manuscript. (cid:3)

Remark (Speed of Convergence) . From Theorem 4 and (2) ,we see that the sequence { x k } k ∈ N converges the fastest whenL τ is minimised, i.e. τ = /( α + β ) . Remark (Fixed Points vs. Critical Points) . Note that Corollary2 is a much weaker result than Theorems 2 and 3. Indeed,Corollary 2 only shows the local convergence of CPGD towardsﬁxed points of U τ, n , which may not necessarily be critical pointsof the optimisation problem (54) . Theorems 2 and 3 on theother hand, show the global convergence of PGD with exactproximal step towards critical points of (54) . This is howeverthe price to pay for computing the proximal step (31) efﬁcientlyin practice. Remark (Geometric Interpretation of Condition (36)) . Roughlyspeaking, Corollary 2 guarantees the convergence of CPGDtowards a ﬁxed point of the update map (35) , providedthat the forward matrix G is injective, and that any twoconsecutive lifted estimates T P ( x k ) , T P ( x k + ) , are in a commonneighbourhood U k of some matrix R k ∈ R K . Note that thisis much less stringent than requiring the entire lifted path { T P ( x k )} k ∈ N to belong to some neighbourhood U of someﬁxed matrix R ∈ R K . Indeed, condition (36) allows the liftedestimates to travel from one neighbourhood of the manifold R K to another, provided that every visited neighbourhoodcontains at least two consecutive lifted estimates (see Figure 1for an illustration). This condition, although difﬁcult to verifyin practice, seems however likely to hold for ρ = + ∞ , smallenough step sizes, large enough n and x = N . Indeed, insuch a case, we have: • T P ( x ) ∈ H K is in some neighbourhood of R K since R K is dense in H K . • For n large enough, T P ( x k ) is very likely to be insome neighbourhood of R K , since the denoising stepin the update map (35) makes T P ( x k ) close to be in theintersection H K ∩ T P (see Corollary 1). • For a small enough step size τ , T P ( x k ) and T P ( x k + ) arelikely to belong to the same neighbourhood of R K . V. E

XPERIMENTS AND R ESULTS

In this section we validate the CPGD method numerically,considering as a testbed the scenario of irregular time samplingfrom [9, Section IV.A]. We assess both the reconstructionaccuracy and the computational complexity of the method, andcompare it to the state-of-the-art.

Remark (Reproducibility) . Special care has been taken intomaking the experiments and simulations of this section fullyreproducible. To reproduce the results, the reader is referredto the routines provided in our GitHub repository [16].A. Reconstruction Accuracy

We deﬁne a 1-periodic stream of K = Diracs (see Fig. 2): x ( t ) = (cid:213) m ∈ Z K (cid:213) k = x k δ ( t − t k − m ) , ∀ t ∈ R , (38)where the amplitudes x k ∈ R + and locations t k ∈ [ , [ are random, with log-normal and uniform distributionsrespectively. We then generate L = noisy samples as y l = M (cid:213) m = − M ˆ x m e j π m θ l + (cid:15) l , l = , . . . , L , (39)where ˆ x m = (cid:205) Kk = x k exp (− j π mt k ) , m = − M , . . . , M , are theFourier coefﬁcients of the Dirac stream x , θ l ∈ [ , [ are chosenuniformly at random, and (cid:15) l ∈ R are independent realisationsof a Gaussian random variable N ( , σ ) , for l ∈ (cid:110) , L (cid:111) .As explained in [9, Section IV.A], the measurements y l correspond to noisy samples of the low-pass ﬁltered Diracstream x at irregular times θ l (see Fig. 2), where the low-pass ﬁlter is chosen as an ideal low-pass ﬁlter with bandwidth M + ≤ L . Using the formalism of Section III, we can rewrite(39) in vector notation as y = Gx + (cid:15) , (40)where y = [ y , . . . , y L ] ∈ C L , x = [ ˆ x − M , . . . , ˆ x M ] ∈ C N = M + , (cid:15) = [ (cid:15) , . . . , (cid:15) L ] ∈ C L , and G ∈ C L × N is given by G =  e − j π M θ · · · · · · e j π M θ e − j π M θ · · · · · · e j π M θ ... · · · ... · · · ... e − j π M θ L − · · · · · · e j π M θ L − e − j π M θ L · · · · · · e j π M θ L  . Note that from the periodicity of complex exponentials, itis possible to ﬂip the columns of G so as to rewrite it as aVandermonde matrix [9]. This shows that G is injective –since M + ≤ L and the irregular time samples are all distinct.From the samples y and the data model (40), we considerrecovering the Fourier coefﬁcients x ∈ C N by means of threealgorithms: To avoid degenerate cases, the Diracs are required to have a minimumseparation distance of 1 % of the total period. To avoid degenerate cases, the sampling locations are required to have aminimal separation distance of 0.5% of the total period. The noise level is deﬁned as the standard deviation σ of the Gaussiandistribution. . . . . . . − . . . . . . . (a) β = , cutoff frequency M = . . . . . . − . . . . . . . (b) β = , cutoff frequency M = . . . . . . . . . . . . (c) β = , cutoff frequency M = . . . . . . − . . . . . . . (d) β = , cutoff frequency M = Figure 2: Dirac stream with K = sources (dark grey, round coloured heads) and noiseless irregular time samples (light grey,diamond heads), for an oversampling parameter β ∈ { , , , } . • The CPGD algorithm 1 with P = M , ρ = + ∞ (since G is injective) and step size τ = . / β (where β is set asdescribed in Theorem 4). • The state-of-the-art algorithm of Pan et al. [9], describedin Section III-A and referred to hereafter as GenFRI.For smooth integration, the Python 3 implementation ofGenFRI provided by Pan et al. on their ofﬁcial Githubrepository [33] was included in our own algorithmicinterface. Since the noise level is assumed to be unknown,we set –as recommended in [9, Section III-C]– the numberof inner iterations and random initialisations to theirdefault values, 50 and 15 respectively. • The baseline method, referred to hereafter as LS-Cadzow,which consists in applying Cadzow denoising to the least-squares estimate of the Fourier coefﬁcients  x LS = argmin x ∈ C N (cid:107) Gx − y (cid:107) , x LS-Cadzow = T † P (cid:2) Π T P Π H K (cid:3) n T P ( x LS ) . (41)We solve the least-squares optimisation problem in (41)by means of the lstsq function in the Python 3 package scipy [34], with cut-off ratio cond = − .For CPGD, we ﬁx the maximum number of iterations to 500and consider that convergence is reached if the iterate normis changed by less than 0.01 % between two iterations. ForCadzow denoising, we ﬁx the number of iterations to 10 forboth LS-Cadzow and CPGD. The reconstruction accuracy In practice this upper bound is never reached: CPGD almost alwaysconverges in less than 150 iterations. is assessed by matching the true Dirac locations t k to therecovered ones, denoted by ω k , for k between and K . To doso, we proceed as explained in Section II-D and infer the Diraclocations ω k from the z-transform roots of the annihilatingﬁlter associated to the Fourier coefﬁcients estimated by eachmethod. Then, we solve the following matching problem bymeans of the

Hungarian algorithm [35] min j ,..., j K ∈{ ,..., K } (cid:40) K K (cid:213) k = d ( t k , ω j k ) (cid:41) , (42)where d ( t , ω ) = min {| t − ω | , − | t − ω |} , ∀ t , ω ∈ [ , [ ,is the canonical distance on the periodised interval [ , [ .Finally, we report the average positioning error, correspondingto the value of the cost function (cid:205) Kk = d ( t k , ω i k )/ K for theindices { i , . . . , i K } solutions to the matching problem (42).This metric is computed for 192 noise realisations, variouscutoff frequencies M = β K with the oversampling factor β ∈ { , , , } (see Figs. 2a, 2b, 2c and 2d respectively) andvarious noise levels σ = max k = ,..., K | x k | × exp (cid:18) − PSNR (cid:19) , where the peak signal to noise ratio PSNR ranges from −30 dBto 30 dB. The conditioning numbers of the matrix G H G for the See [7, Fig. 2] for additional details on the procedure used to recover theDirac locations from the annihilating ﬁlter coefﬁcients. The Hungarian algorithm is implemented in the linear_sum_assignment function from the Python 3 package scipy [34].

IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 11 − − −

10 0 10 20 30

PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (a) β = , cutoff frequency M = − − −

10 0 10 20 30

PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (b) β = , cutoff frequency M = − − −

10 0 10 20 30

PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (c) β = , cutoff frequency M = − − −

10 0 10 20 30

PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (d) β = , cutoff frequency M = Figure 3: Positioning error (42) (in percent of period and log-scale) for LS-Cadzow, CPGD and GenFRI, various oversamplingparameters β ∈ { , , , } and a PSNR in {− , − , − , , , , } dB. For each case, plain lines and shaded areas representrespectively the median and inter-quartile region of the positioning error’s empirical distribution obtained from 192 independentnoise realisations. These results can be reproduced using the Python script reproduce_simulation_results.py locatedin the directory ./benchmarking/ of our GitHub repository [16]. β = β = β = β = κ ( G H G ) × × × Table I: Conditioning number of G H G for various values ofthe oversampling parameter β .different values of the oversampling parameter β are provided inTable I. The results of the experiments are displayed on Figs. 3and 5 –available as supplementary material of this manuscript.In Fig. 3 we plot, for different oversampling factors and PSNR,the median and inter-quartile region of the empirical distributionof the average positioning error of the three methods. In Fig. 5,we plot, for each source, different oversampling factors andPSNR, the median and inter-quartile region of the empiricaldistribution of the source location as estimated by the threemethods against the true source location. The conclusions thatcan be drawn from the results are the following: • Figs. 3a, 5a, 5b and 5c reveal that without oversamplingin the Fourier domain (i.e. β = and a minimal cutofffrequency of M = K = ), the three algorithms CPGD,GenFRI and LS-Cadzow perform similarly throughoutthe entire PSNR range. The average positioning error –almost indistinguishable for the three algorithms– goesfrom approximately 10 % of the period for PSNRs of −30 dB to 1 % of the period for PSNRs of 30 dB. • Figs. 3b, 5d, 5e and 5f reveal that with an oversamplingof β = –yielding a cutoff frequency of M = , thethree algorithms CPGD, GenFRI and LS-Cadzow startbehaving differently for PSNRs larger than 0 dB: CPGDhas the lowest positioning error, followed by LS-Cadzowand ﬁnally GenFRI. The inter-quartile regions of thepositioning errors distribution are however overlapping,which means that the differences in performance are notstatistically signiﬁcant. For PSNRs larger than 0 dB, allalgorithms have a lower positioning error than in the case β = . For high PSNRs, this improvement can be as highas one and a half order of magnitude. • Figs. 3c, 5g, 5h and 5i reveal that with an oversamplingof β = –yielding a cutoff frequency of M = ,the three algorithms CPGD, GenFRI and LS-Cadzowagain behave differently for PSNRs larger than 0 dB:CPGD has the lowest positioning error, followed byGenFRI and ﬁnally LS-Cadzow. For high PSNRs, thedifferences in performance among the three algorithmsbecome statistically signiﬁcant: the inter-quartile regionsof the positioning error’s empirical distribution do notoverlap anymore. CPGD moreover reaches a positioningerror as low as 0.01 % of the period, which is up to twoorders of magnitude smaller than the minimal positioning error of GenFRI or LS-Cadzow in this scenario. ForPSNRs greater than 10 dB, CPGD improves its positioningerror with respect to the case β = by a bit less than halfan order of magnitude. This is not the case for GenFRIand LS-Cadzow which both underperform with respectto the case β = –and even with respect to the case β = for LS-Cadzow. This can be explained by the largeconditioning number of the matrix G H G in this case(see Table I), affecting the numerical stability of bothalgorithms (see remark in Section III-A). • Figs. 3a, 5j, 5k and 5l reveal that with an oversampling of β = –yielding a critical bandwidth of M + = equalto the number of measurements L , CPGD is superior toGenFRI which is itself superior to the baseline methodLS-Cadzow in nearly all cases, with the exception of verylow PSNRs ( ∼ −30 dB), where the three methods havecomparable reconstruction accuracy. For PSNRs largerthan −20 dB, the differences in performance are statisti-cally signiﬁcant. CPGD is more accurate than GenFRIand LS-Cadzow by a few orders of magnitude (from 1to 3 orders of magnitude for PSNRs larger than −10 dB),reaching a minimal positioning error as low as 0.005 %of the period. For PSNRs greater than −20 dB, CPGDimproves its positioning error with respect to all previouscases β ∈ { , , } . Again, this is not the case for GenFRIand LS-Cadzow which both perform as good as or worsethan the case β = . This can be explained by the (very)large conditioning number of the matrix G H G in thiscase (see Table I), which severely affects the numericalstability of both algorithms.In conclusion, in these simulations CPGD is better at lever-aging oversampling in the Fourier domain to improve thereconstruction accuracy by several orders of magnitude withrespect to the non oversampled case. In particular, CPGDperforms best when the bandwidth of the low-pass ﬁlter ischosen as large as the number of measurements. As explainedin Section II-E, Cadzow denoising exhibits a similar behaviour.This similarity is not fortuitous: both algorithms leverage asimilar rank constraint which becomes more and more selective as the oversampling parameter increases. In contrast, GenFRIand LS-Cadzow are negatively affected by large oversamplingparameters, due to numerical stability issues. For the criticalbandwidth M + = L , CPGD notably outperforms GenFRIand LS-Cadzow by one to three orders of magnitude, and thiseven for PSNRs as low as −20 dB. B. Computational Complexity

As explained in Section III-A, solving the linear systems(21) and (22) are the most expensive operations performed ateach iteration of GenFRI, yielding an overall computationalcomplexity of O( N ) . For CPGD, the computational cost ofeach iteration is dominated by the succesive projections onto H K in the approximate proximal step, which are computedvia a SVD –see Algorithm 1 and (14). At a cursory glance, itmay seem that the overall complexity of CPGD is somewhatcomparable to the one of GenFRI, since computing the SVD ofa matrix with size ( N − P )×( P + ) has in general a computational Size N − − E x ec u t i o n T i m e ( s ) CPGD ∝ N . GenFRI ∝ N . Figure 4: Reconstruction times for CPGD and GenFRI forvarious bandwidth sizes N and K = . The reported times arefor a MacBook Pro (16-inch, 2019), Intel Core i7 (6C/12T) @2.6GHz with 32GB RAM. These results can be reproduced us-ing the Python script reproduce_execution_times.py located in the directory ./benchmarking/ of our GitHubrepository [16].complexity of O(( N − P ) ( P + ) + ( P + ) ) [36] which reducesto O( N ) when P = M . In practice however, projecting onto H K does not require to perform a complete SVD since only the K strongest eigenvalues and their associated eigenvectors areneeded. This truncated SVD can be performed very efﬁcientlyby means of the implicitly restarted Arnoldi method (IRAM) [37], or the implicitly restarted Lanczos method for Hermitianmatrices. When K (cid:28) P + , such methods are obviously muchmore efﬁcient than a wasteful standard SVD. IRAM is moreovera matrix-free method [38]: it does not need the processed matrixto be stored in memory but simply requires an algorithm forperforming matrix/vector products. In our context, since thetruncated SVDs are exclusively performed on Toeplitz matrices,such matrix/vector products can be efﬁciently implemented bymeans of FFTs thanks to the convenient links between Toeplitzmatrices and convolutions outlined in Section II-A.The CPGD implementation provided in our Github repository[16] leverages all these computational tricks. In Fig. 4, weshow that our implementation of CPGD is considerably fasterthan GenFRI. Fig. 4 reports the reconstruction times of CPGDand GenFRI for K = and bandwidth N = β M + = L with β ranging from 1 to 300. To save computational time, thereconstruction times were scaled from the execution time ofa single iteration of CPGD and GenFRI, assuming a typicalnumber of iterations of 100 and 15 × N . while GenFRIscales as N . . Note that this difference in scaling behaviouris overlooked by the complexity analysis above.VI. C ONCLUSION

We propose an implicit version of the generalised ﬁnite-rate-of-innovation (genFRI) problem for the recovery of the Fourierseries coefﬁcients of sparse Dirac streams with arbitrary linearsensing. This formulation relies on a novel regularisation termwhich enforces the annihilation of the recovered Fourier series

IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 13 coefﬁcients without explicitly involving the unknown annihi-lating ﬁlter. The resulting non convex optimisation problem isconsequently simpler and linear in the data. To solve it, wesuggest a proximal gradient descent (PGD) algorithm whichwe prove converges towards a critical point of the objectivefunction. We further introduce an inexact PGD method, coinedCadzow plug-and-play gradient descent (CPGD), where theintractable proximal steps involved in PGD are approximatedby means of alternating projections, akin to the popular Cadzowdenoising algorithm. We outline the resemblance of CPGDto PnP methods used in image processing and prove its localﬁxed-point convergence under relatively weak assumptions.Considering the traditional irregular time sampling testbed, wedemonstrate empirically that CPGD outperforms by severalorders of magnitude the state-of-the-art GenFRI algorithm, bothin terms of accuracy and reconstruction time.For future work, we plan on investigating accelerationtechniques for CPGD, such as approximate sketching-basedeigenvalue decomposition methods [39], [40], more com-putationally efﬁcient for large-scale problems. Applicationsof CPGD to acoustics and radio astronomy will also beinvestigated. R

EFERENCES[1] M. Vetterli, J. Kovaˇcevi´c, and V. K. Goyal,

Foundations of signalprocessing . Cambridge University Press, 2014.[2] B. Rimoldi,

Principles of Digital Communication: A Top-Down Approach .Cambridge University Press, 2016.[3] C. E. Shannon, “A mathematical theory of communication,”

ACMSIGMOBILE Mobile Computing and Communications Review , vol. 5,no. 1, pp. 3–55, 2001.[4] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with ﬁniterate of innovation,”

IEEE Trans. Signal Process. , vol. 50, no. 6, pp.1417–1428, 2002.[5] T. Blu, P.-L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot, “Sparsesampling of signal innovations,”

IEEE Signal Process. Mag. , vol. 25,no. 2, pp. 31–40, 2008.[6] J. A. Cadzow, “Signal enhancement-a composite property mappingalgorithm,”

IEEE Trans. Acoust. , vol. 36, no. 1, pp. 49–62, 1988.[7] L. Condat, A. Hirabayashi, L. Condat, and A. Hirabayashi, “Cadzowdenoising upgraded : A new projection method for the recovery of diracpulses from noisy linear measurements,” 2015.[8] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signalprocessing,” in

Fixed-point algorithms for inverse problems in scienceand engineering . Springer, 2011, pp. 185–212.[9] H. Pan, T. Blu, and M. Vetterli, “Towards generalized FRI samplingwith an application to source resolution in radioastronomy,”

IEEE Trans.Signal Process. , vol. 65, no. 4, pp. 821–835, 2017.[10] N. Parikh, S. Boyd et al. , “Proximal algorithms,”

Foundations and Trendsin Optimization , vol. 1, no. 3, pp. 127–239, 2014.[11] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,”

SIAM journal on imaging sciences ,vol. 2, no. 1, pp. 183–202, 2009.[12] B. Gu, D. Wang, Z. Huo, and H. Huang, “Inexact proximal gradientmethods for non-convex and non-smooth optimization,” in

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence , 2018.[13] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “Cnn-based projected gradient descent for consistent ct image reconstruction,”

IEEE transactions on medical imaging , vol. 37, no. 6, pp. 1440–1453,2018.[14] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-playpriors for model based reconstruction,” in . IEEE, 2013, pp. 945–948.[15] E. K. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-and-play methods provably converge with properly trained denoisers,” arXivpreprint arXiv:1905.05406 , 2019.[16] M. Simeoni, A. Besson, P. Hurley, and M. Vetterli, “Pyoneer: continuousrecovery of non-bandlimited periodic signals with ﬁnite rates of innova-tion,” https://github.com/matthieumeo/pyoneer, accessed: 2020-04-16. [17] R. Escalante and M. Raydan,

Alternating projection methods . SIAM,2011.[18] J. von Neumann, “The geometry of orthogonal spaces, functionaloperators-vol. ii,”

Annals of Math. Studies , vol. 22, 1950.[19] H. Bauschke, F. Deutsch, H. Hundal, and S.-H. Park, “Accelerating theconvergence of the method of alternating projections,”

Transactions ofthe American Mathematical Society , vol. 355, no. 9, pp. 3433–3461,2003.[20] I. Halperin, “The product of projection operators,”

Acta Sci.Math.(Szeged) , vol. 23, no. 1, pp. 96–99, 1962.[21] H. H. Bauschke and J. M. Borwein, “Dykstra’s alternating projectionalgorithm for two sets,”

Journal of Approximation Theory , vol. 79, no. 3,pp. 418–443, 1994.[22] A. S. Lewis, D. R. Luke, and J. Malick, “Local linear convergencefor alternating and averaged nonconvex projections,”

Foundations ofComputational Mathematics , vol. 9, no. 4, pp. 485–513, 2009.[23] A. S. Lewis and J. Malick, “Alternating projections on manifolds,”

Mathematics of Operations Research , vol. 33, no. 1, pp. 216–234, 2008.[24] F. Andersson and M. Carlsson, “Alternating projections on low-dimensional manifolds.”[25] M. K. Tam, “Alternating projection methods failure in the absence ofconvexity.”[26] C. Riche de Prony, “Essai expérimental et analytique sur les lois de ladilatabilité des ﬂuides élastique et sur celles de la force expansive de lavapeur de l’eau et de la vapeur de l’alkool, à différentes températures,”

J. de l’École Polytechnique , vol. 1, pp. 24–76, 1795.[27] C. Eckart and G. Young, “The approximation of one matrix by anotherof lower rank,”

Psychometrika , vol. 1, no. 3, pp. 211–218, 1936.[28] H. Pan, M. Simeoni, P. Hurley, T. Blu, and M. Vetterli, “Leap: Lookingbeyond pixels with continuous-space estimation of point sources,”

Astronomy & Astrophysics , vol. 608, p. A136, 2017.[29] T. F. Chan and C.-K. Wong, “Convergence of the alternating minimizationalgorithm for blind deconvolution,”

Linear Algebra and its Applications ,vol. 316, no. 1-3, pp. 259–285, 2000.[30] H. Li and Z. Lin, “Accelerated proximal gradient methods for nonconvexprogramming,” in

Advances in neural information processing systems ,2015, pp. 379–387.[31] F. Andersson, M. Carlsson, and P.-A. Ivert, “A fast alternating pro-jection method for complex frequency estimation,” arXiv preprintarXiv:1107.2028 , 2011.[32] Q. Yao, J. T. Kwok, F. Gao, W. Chen, and T.-Y. Liu, “Efﬁcient inexactproximal gradient algorithm for nonconvex problems,” arXiv preprintarXiv:1612.09069 , 2016.[33] H. Pan, T. Blu, M. Vetterli, and F. Dümbgen, “Uniﬁed algorithmicframework for reconstructing signals with ﬁnite rate of innovation,”https://github.com/hanjiepan/FRI_pkg, accessed: 2019-10-18.[34] E. Jones, T. Oliphant, P. Peterson et al.

Navalresearch logistics quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[36] G. H. Golub and C. F. Van Loan,

Matrix computations . JHU press,2012, vol. 3.[37] R. B. Lehoucq and D. C. Sorensen, “Deﬂation techniques for animplicitly restarted arnoldi iteration,”

SIAM Journal on Matrix Analysisand Applications , vol. 17, no. 4, pp. 789–821, 1996.[38] S. Diamond and S. Boyd, “Matrix-free convex optimization modeling,” in

Optimization and its applications in control and data sciences . Springer,2016, pp. 221–264.[39] D. P. Woodruff et al. , “Sketching as a tool for numerical linear algebra,”

Foundations and Trends in Theoretical Computer Science , vol. 10, no.1–2, pp. 1–157, 2014.[40] P. Indyk, A. Vakilian, and Y. Yuan, “Learning-based low-rank approxi-mations,” in

Advances in Neural Information Processing Systems , 2019,pp. 7400–7410.[41] A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Exact worst-caseconvergence rates of the proximal gradient method for composite convexminimization,”

Journal of Optimization Theory and Applications , vol.178, no. 2, pp. 455–476, 2018.[42] A. Jung, “A ﬁxed-point of view on gradient methods for big data,”

Frontiers in Applied Mathematics and Statistics , vol. 3, p. 18, 2017.[43] F. R. Deutsch,

Best approximation in inner product spaces . SpringerScience & Business Media, 2012.[44] D. R. Luke, “Finding best approximation pairs relative to a convexand prox-regular set in a hilbert space,”

SIAM Journal on Optimization ,vol. 19, no. 2, pp. 714–739, 2008. [45] M. J. Lai and A. Varghese, “On convergence of the alternating projectionmethod for matrix completion and sparse recovery problems,” arXivpreprint arXiv:1711.02151 , 2017. A PPENDIX AT HE T OEPLITZIFICATION O PERATOR AND C ONVOLUTIONS

Consider the Toeplitziﬁcation operator deﬁned in (2). Whenmultiplied with a vector u = [ u , · · · u P + ] ∈ C P + , the matrix T P ( x ) returns the valid part of the convolution between thetwo zero-padded sequences: ˜ x : = (cid:104) · · · , , x − M , · · · , x , · · · , x M , , · · · (cid:105) ∈ C Z and ˜ u : = (cid:104) · · · , , u , · · · , u P + , , · · · (cid:105) ∈ C Z . Indeed, ( ˜ x ∗ ˜ u )[ k ] : = (cid:213) j ∈ Z ˜ x k − j ˜ u j = P + (cid:213) j = ˜ x k − j u j , k ∈ Z . The valid part corresponds to the indices k ∈ Z for which allthe terms in the summation are nonzero. This is the case when k ∈ {− M + P + , . . . , M + } . Consider i = k + M − P we get that the valid part of theconvolution is given by ( ˜ x ∗ ˜ u )[ i − M + P ] = P + (cid:213) j = x − M + P + i − j u j , = P + (cid:213) j = [ T P ( x )] i , j u j , i = , . . . , N − P , which corresponds precisely to T P ( x ) u . Using similar arguments, it is easy to show that themultiplication of T P ( x ) H ∈ C ( P + )×( N − P ) with a vector v = [ v , · · · , v N − P ] T ∈ C N − P returns the valid part of the cross-correlation ( ˜ x (cid:63) ˜ v )[ k ] : = (cid:213) j ∈ Z ˜ x ∗ j − k ˜ u j , k ∈ Z , between ˜ x and the zero-padded sequence ˜ v : = (cid:104) . . . , , v , . . . , v N − P , , . . . (cid:105) ∈ C Z . This time however, the valid part corresponds to the indices: k ∈ { M + − P , . . . , M + } . (cid:3) A PPENDIX BP ROOF OF P ROPOSITION H ∈ C ( N − P )×( P + ) and the followingFrobenius inner product (cid:104) T P ( x ) , H (cid:105) F = tr (cid:16) T HP ( x ) H (cid:17) = N − P (cid:213) i = P + (cid:213) k = T P ( x ) ik H ik = N − P (cid:213) i = P + (cid:213) k = ¯ x − M + P + i − k H iks = i − k + P = N − (cid:213) s = x − M + s (cid:32) (cid:213) i = k + s − P H ik (cid:33) (43)The term (cid:205) i = k + ( s − P ) H ik sums the elements of H along lineswith equation i = k + ( s − P ) . These lines have slope 1 andintercept b = s − P . Notice that these lines have nonnullintersection with the lattice ( k , i ) ∈ (cid:110) , P + (cid:111) × (cid:110) , N − P (cid:111) for b ∈ (cid:110) − P , N − P − (cid:111) . Indeed, the two extreme cases occurwhen the lines hit the points ( , N − P ) and ( P + , ) . Thishappens respectively when + b = N − P ⇒ b = N − P − and P + + b = ⇒ b = − P . Since s ∈ (cid:110) , N − (cid:111) the intercept b varies indeed in the range (cid:110) − P , N − P − (cid:111) and each termin the summation is nonnull. The summation (cid:205) i = k + ( s − P ) H ik corresponds then to summing across each diagonal of H . Weﬁnally get: (cid:104) T P ( x ) , H (cid:105) F = (cid:10) x , T ∗ P ( H ) (cid:11) , with T ∗ P :  C ( N − P )×( P + ) → C N H (cid:55)→ h j = (cid:205) i = k + j − − P H ik , j = , . . . , N . (cid:3) A PPENDIX CP ROOF OF P ROPOSITION T P , it is straightforward toobserve that the operator Γ = T ∗ P T P : C N → C N is a diagonalmatrix, with diagonal entries given by: Γ i , i =  i for i ≤ PP + for P < i ≤ N − PN + − i for N − P < i ≤ N . (44)The operator T † P = Γ − T ∗ P is hence a left inverse for T P : T † P T P = Γ − T ∗ P T P = ( T ∗ P T P ) − T ∗ P T P = I N . (45)Moreover, the latter is actually the pseudoinverse of T P . Indeed,we have trivially: T P T † P T P = T P , T † P T P T † P = T † P , ( T † P T P ) ∗ = T † P T P . Finally, we have ( T P T † P ) ∗ = T P Γ − H T ∗ P = T P T † P , (46) For n , m ∈ Z , n < m , we denote by (cid:110) n , m (cid:111) the 1D lattice [ n , m ] ∩ Z . IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 15 since Γ is diagonal and hence symmetric. T † P veriﬁes thus thedeﬁnition of the pseudoinverse of T P . (cid:3) A PPENDIX DP ROJECTION ON THE S UBSPACE OF T OEPLITZ M ATRICES

The operator T P is actually a surjection onto the subspace T P ⊂ C ( N − P )×( P + ) of rectangular Toeplitz matrices with size ( N − P ) × ( P + ) . Indeed, it is easy to see that every suchmatrix can be written as in (2) for some generator x ∈ C N .Moreover, we have from (45) that T † P T P = I N and hence from[1, Theorem 2.29], T P T † P is a projection operator onto the range T P of T P . Since T P T † P is moreover self-adjoint from (46), itis actually an orthogonal projection operator, which achievesthe proof. (cid:3) A PPENDIX EP ROOFS OF T HEOREMS AND

Lemma 1.

Consider the norm (cid:107) x (cid:107) : = (cid:112) (cid:104) x , x (cid:105) , x ∈ R n ,induced by some inner product (cid:104)· , ·(cid:105) on R n . Consider moreoverthe general problem: min x ∈ R n Φ ( x ) : = F ( x ) + H ( x ) , (47) where F : R n → R ∪ { + ∞} and H : R n → R ∪ { + ∞} arepotentially non convex functions such that:(A) F is a proper function, i.e. its domain is non empty, differentiable and with Lipschitz continuous gradient forsome Lipschitz constant ≤ β < + ∞ , (cid:107)∇ F ( x ) − ∇ F ( y )(cid:107) ≤ β (cid:107) x − y (cid:107) , ∀ x , y ∈ R n . (B) H is a proper and lower semicontinuous (lwsc) function(see supplementary material of [30] for a deﬁnition),potentially non smooth .(C) Φ = F + H is coercive , i.e. Φ is bounded from below and lim (cid:107) x (cid:107)→ + ∞ Φ ( x ) = + ∞ . Then, the iterates { x k } k ∈ N generated by the proximal gradientdescent applied to (47) : x k + ∈ prox τ H ( x k − τ ∇ F ( x k )) , k ≥ , (48) with τ < / β and x ∈ R n , are bounded . Moreover, any limitpoint x (cid:63) of { x k } k ∈ N is a local minimum of Φ .Proof. The lemma is easily shown by specifying the proof of[30, Theorem 1] to the non-accelerated case. For the sake ofcompleteness, it is provided hereafter. From the deﬁnition of the proximal operator,prox τ H ( x ) :  R n → P ( R n ) , x (cid:55)→ arg min z ∈ R n τ (cid:107) x − z (cid:107) + H ( z ) , we can reinterpret (48) as x k + ∈ arg min z ∈ R n τ (cid:107) z − x k (cid:107) + (cid:104)∇ F ( x k ) , z − x k (cid:105) + H ( z ) . (49)We have hence τ (cid:107) x k + − x k (cid:107) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + H ( x k + ) ≤ H ( x k ) . From the Lipschitz continuity of ∇ F we have moreover Φ ( x k + ) ≤ H ( x k + ) + F ( x k ) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + β (cid:107) x k + − x k (cid:107) ≤ H ( x k ) − τ (cid:107) x k + − x k (cid:107) − (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + F ( x k ) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + β (cid:107) x k + − x k (cid:107) = Φ ( x k ) − (cid:18) τ − β (cid:19) (cid:107) x k + − x k (cid:107) . (50)Since τ < / β we have hence ( / τ − β / ) ≥ and Φ ( x k + ) ≤ Φ ( x k ) ≤ Φ ( x ) , ∀ k ≥ . The sequence { Φ ( x k )} k ∈ N is hence bounded and since Φ iscoercive so is { x k } k ∈ N . The sequence { x k } k ∈ N admits hencelimit points. Moreover, since Φ ( x k ) is decreasing and boundedfrom below, it takes the same value Φ (cid:63) ∈ R at all of theselimit points. Summing (50), we obtain hence: (cid:18) τ − β (cid:19) + ∞ (cid:213) k = (cid:107) x k + − x k (cid:107) ≤ Φ ( x ) − Φ (cid:63) < + ∞ . Since τ < / β , we have necessarily (cid:205) + ∞ k = (cid:107) x k + − x k (cid:107) < + ∞ ,which yields lim k → + ∞ (cid:107) x k + − x k (cid:107) = . (51)From the optimality condition (49) and Items 1 and 3 ofProposition 1 of the supplementary material of [30], we havemoreover n ∈∇ F ( x k ) + τ ( x k + − x k ) + ∂ H ( x k + ) = ∂ Φ ( x k + ) − ∇ F ( x k + ) + ∇ F ( x k ) + τ ( x k + − x k ) , (52)where ∂ H : R n → P( R n ) and ∂ Φ : R n → P( R n ) denote the(set-valued) subdifferential operators of H and Φ respectively(see Deﬁnition 2 of the supplementary material of [30]).Equation (52) can moreover be rewritten as ∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k ) ∈ ∂ Φ ( x k + ) . Furthermore, from the Lipschitz continuity of F , we have (cid:107)∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k )(cid:107) ≤ (cid:18) β + τ (cid:19) (cid:107) x k + − x k (cid:107) , and hence from (51): lim (cid:107) x (cid:107)→ + ∞ (cid:13)(cid:13)(cid:13)(cid:13) ∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k ) (cid:13)(cid:13)(cid:13)(cid:13) = . (53) Let { x k j } j ∈ N be a convergent subsequence of { x k } k ∈ N , withlimit x (cid:63) . Then, we have from (53) and Item 2 of Proposition1 of the supplementary material of [30]: n ∈ lim j → + ∞ ∂ Φ ( x k j ) = ∂ Φ ( x (cid:63) ) , which completes the proof. (cid:3) We now show Theorems 2 and 3 by applying Lemma 1 tothe implicit genFRI problem in unconstrained form (54) min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) + ι B Γ ρ ( x ) . (54)To do so, we must ﬁrst convert (54) into an optimisationproblem of the form (47), deﬁned over R n for some n ∈ N . Weachieve this by proceeding as in [2, Section 7.8] and identifying C N with R N (respectively C L with R L ) in the canonical way x ∈ C N ↔ ˆ x : = (cid:20) R ( x ) I ( x ) (cid:21) ∈ R N , where R and I denote the real and imaginary parts respectively.Such an identiﬁcation makes the canonical inner products andnorms on C N and R N (respectively C L and R L ) consistentwith one another, i.e. for all x , z ∈ C N , we have (cid:104) x , z (cid:105) C N = z H x = R ( z ) T R ( x ) + I ( z ) T I ( x ) = ˆ z T ˆ x = (cid:104) ˆ x , ˆ z (cid:105) R N , and (cid:107) x (cid:107) C N = √ x H x = (cid:113) (cid:107) R ( x )(cid:107) R N + (cid:107) I ( x )(cid:107) R N = (cid:107) ˆ x (cid:107) R N . Still following [2, Section 7.8], we moreover identify the linearmap G : C N → C L with a linear map ˆ G : R N → R L withmatrix representation: ˆ G : = (cid:20) R ( G ) − I ( G ) I ( G ) R ( G ) (cid:21) ∈ R L × N . Again, it is easy to show that the two operators are consistent,in the sense that (cid:99) Gx = ˆ G ˆ x , and (cid:154) G H x = ˆ G T ˆ x , ∀ x ∈ C N . Similarly, the Toeplitziﬁcation operator T P : C N → C ( N − P )×( P + ) is identiﬁed with the linear operator ˆ T P : R N → C ( N − P )×( P + ) deﬁned as ˆ T P ( ˆ x ) : = T P ( R ( x )) + jT P ( I ( x )) , ∀ x ∈ C N , where j is the complex 2-root of unity. From the linearity of T P , this deﬁnition yields indeed T P ( x ) = ˆ T P ( ˆ x ) . Finally, the Γ -ball B Γ ρ ⊂ C N is identiﬁed with B ˆ Γ ρ : = (cid:8) ˆ x ∈ R N : (cid:107) ˆ x (cid:107) ˆ Γ ≤ ρ (cid:9) , where ˆ Γ ∈ R N × N is a positive deﬁnite and diagonal matrixdeﬁned as ˆ Γ : = (cid:20) Γ N × N N × N Γ (cid:21) . Again, we trivially have (cid:107) ˆ x (cid:107) ˆ Γ = (cid:107) x (cid:107) Γ and hence x ∈ B Γ ρ ⇔ ˆ x ∈ B ˆ Γ ρ for all x ∈ C N . In summary, the optimisation problem (54) is hence equivalentto the following optimisation problem with search space R N : min ˆ x ∈ R N (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L + ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) + ι B ˆ Γ ρ ( ˆ x ) . (55)Letting ˆ F ( ˆ x ) : = (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L and ˆ H ( ˆ x ) : = ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) + ι B ˆ Γ ρ ( ˆ x ) we have ˆ F : R N → R + and ˆ H : R N → { , + ∞} ,so that (55) is indeed of the form (47). We must now verifyassumptions (A), (B) and (C) of Lemma 1: (A) ˆ F is proper, differentiable and ∇ ˆ F Lipschitz continuous. ˆ F is proper since ˆ F ( N ) = (cid:107) ˆ y (cid:107) R L = (cid:107) y (cid:107) C L < + ∞ . It is differentiable, with gradient given by ∇ ˆ F ( ˆ x ) = G T ( ˆ G ˆ x − ˆ y ) = (cid:155) ∇ F ( x ) , ˆ x ∈ R N . (56)The gradient (56) is moreover ˆ β - Lipschitz continuous withrespect to the norm (cid:107) · (cid:107) ˆ Γ on R N , and its Lipschitz constantis given by: ˆ β = (cid:107) ˆ G T ˆ G (cid:107) ˆ Γ = sup (cid:8) (cid:13)(cid:13) ˆ G T ˆ G ˆ x (cid:13)(cid:13) ˆ Γ : ˆ x ∈ R N , (cid:107) ˆ x (cid:107) ˆ Γ = (cid:9) = sup (cid:8) (cid:13)(cid:13) G H Gx (cid:13)(cid:13) Γ : x ∈ C N , (cid:107) x (cid:107) Γ = (cid:9) = β < + ∞ . (57) (cid:3) (B) ˆ H is proper and lower semicontinuous. ˆ H is proper sincefor all ρ > , and K ∈ N , ˆ H ( N ) = ι H K (cid:0) ( N − P )×( P + ) (cid:1) + ι B ˆ Γ ρ ( N ) = < + ∞ . The indicator functions are moreover lower semicontinuoussince the sets H K and B ˆ Γ ρ are both closed. Since T P is abounded linear operator, it is continuous and hence ˆ H is indeedlower semicontinuous as composition between continuous andlower semicontinuous functions. (cid:3) (C) ˆ Φ = ˆ F + ˆ H is coercive.

It is easy to see that ˆ Φ = ˆ F + ˆ H ≥ . To show that ˆ Φ is coercive, it is hence sufﬁcient to showthat lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ ˆ Φ ( ˆ x ) = + ∞ . To this end, we distinguish two cases, which correspondrespectively to the assumptions of Theorems 2 and 3:1) ρ ∈] , + ∞[ : in this case ˆ Φ is trivially coercive since ι B ˆ Γ ρ ( ˆ x ) = + ∞ , ∀ (cid:107) ˆ x (cid:107) ˆ Γ ≥ ρ. ρ = + ∞ and G injective: When ρ = + ∞ , the term ι B ˆ Γ ρ isalways null and ˆ Φ simpliﬁes to ˆ Φ ( ˆ x ) = (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L + ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) , ˆ x ∈ R N . From [2, Section 7.8], we have moreover that det (cid:16) ˆ G T ˆ G (cid:17) = | det ( G H G )| (cid:44) , (58) IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 17 since G is injective by assumption. From the reversetriangle inequality, we have hence (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ σ min (cid:107) ˆ x (cid:107) R N − (cid:107) ˆ y (cid:107) R L , ∀ ˆ x ∈ R N , where σ min = (cid:113) λ min ( ˆ G T ˆ G ) > is the square root of theeigenvalue of ˆ G T ˆ G with lowest magnitude, which is nonnull from (58). From the equivalence of norms in ﬁnitedimensions, there exist moreover c , c > such that c (cid:107) ˆ x (cid:107) ˆ Γ ≤ (cid:107) ˆ x (cid:107) R N ≤ c (cid:107) ˆ x (cid:107) ˆ Γ , ∀ ˆ x ∈ R N . This yields (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ σ min c (cid:107) ˆ x (cid:107) ˆ Γ − (cid:107) ˆ y (cid:107) R L , ∀ ˆ x ∈ R N , and hence lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ σ min c (cid:107) ˆ x (cid:107) ˆ Γ = + ∞ , which shows that ˆ Φ is indeed coercive. (cid:3) We can hence apply Lemma 1, to show that the iterates { ˆ x k } k ∈ N ⊂ R N generated by PGD applied to (55): ˆ x k + ∈ prox ˆ Γ τ ˆ H (cid:16) ˆ x k − τ ∇ ˆ F ( ˆ x k ) (cid:17) , (59)with τ < / ˆ β and ˆ x ∈ R N , are bounded. Moreover, anylimit point ˆ x (cid:63) of { ˆ x k } k ∈ N is a critical point of (55), i.e. N ∈ ∂ ˆ Φ ( ˆ x (cid:63) ) .Observe ﬁnally, that the iterations (59) can be rewritten incomplex form as x k + ∈ prox Γ τ H ( x k − τ ∇ F ( x k )) , (60)with τ < / β and x ∈ C N , and where we have used (56),(57) and prox ˆ Γ τ ˆ H ( ˆ x ) = arg min ˆ z ∈ R N τ (cid:107) ˆ x − ˆ z (cid:107) Γ + ˆ H ( z ) = (cid:156) prox Γ τ H ( x ) , ∀ ˆ x ∈ R N , which follows trivially from the previous identiﬁcations. Byidentiﬁcation and equivalence between the real and complexoptimisation problems (55) and (54), we can hence concludethat limit points of the iterates { x k } k ∈ N ⊂ C N generated by(60) are critical points of (54), which achieves the proof. (cid:3) A PPENDIX FP ROOF OF P ROPOSITION x ∈ C N :prox Γ τ H ( x ) = arg min z ∈ C N (cid:26) τ (cid:107) x − z (cid:107) Γ + ι H K ( T P ( z )) + ι B Γ ρ ( z ) (cid:27) . (61) When mapped via the Toeplitziﬁcation operator T P , theproximal set (61) becomes T P (cid:16) prox Γ τ H ( x ) (cid:17) == (cid:8) T P ( ˇ x ) , ˇ x ∈ prox Γ τ H ( x ) (cid:9) = (cid:110) ˇ X ∈ T P , T † P ( ˇ X ) ∈ prox Γ τ H ( x ) (cid:111) = arg min Z ∈ T P (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι H K ( Z ) + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) = arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) , (62)where we have used the fact that T † P T P ( z ) = z for all z ∈ C N .We have moreover, ∀ Z ∈ T P : (cid:107) T † P ( Z ) − x (cid:107) Γ = (cid:104) Γ T † P ( Z ) − x , T † P ( Z ) − x (cid:105) = (cid:104) Γ T † P ( Z ) , T † P ( Z )(cid:105) + (cid:104) Γ x , x (cid:105) − (cid:104) Γ T † P ( Z ) , x (cid:105) − (cid:104) Γ x , T † P ( Z )(cid:105) = (cid:104) ΓΓ − T ∗ P ( Z ) , Γ − T ∗ P ( Z )(cid:105) + (cid:104) T ∗ P T P ( x ) , x (cid:105) − (cid:104) ΓΓ − T ∗ P ( Z ) , x (cid:105) − (cid:104) Γ x , Γ − T ∗ P ( Z )(cid:105) = (cid:104) Z , T P Γ − T ∗ P ( Z )(cid:105) F + (cid:104) T P ( x ) , T P ( x )(cid:105) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:104) Z , Π T P ( Z )(cid:105) F + (cid:107) T P ( x )(cid:107) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:104) Z , Z (cid:105) F + (cid:107) T P ( x )(cid:107) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:107) Z (cid:107) F + (cid:107) T P ( x )(cid:107) F − R ((cid:104) Z , T P ( x )(cid:105) F ) = (cid:107) Z − T P ( x )(cid:107) F , (63)where we have used the fact that Γ = Γ H = T ∗ P T P and T P Γ − T ∗ P = Π T P (see Appendices C and D). With similararguments, we have ∀ Z ∈ T P : (cid:13)(cid:13)(cid:13) T † P ( Z ) (cid:13)(cid:13)(cid:13) Γ ≤ ρ ⇔ (cid:113) (cid:104) Γ T † P ( Z ) , T † P ( Z )(cid:105) ≤ ρ ⇔ (cid:112) (cid:104) Z , Z (cid:105) F ≤ ρ ⇔ (cid:107) Z (cid:107) F ≤ ρ. so that ι B Γ ρ (cid:16) T † P ( Z ) (cid:17) = ι B ρ ( Z ) , ∀ Z ∈ T P , (64)where B ρ : = (cid:8) Z ∈ C ( N − P )×( P + ) : (cid:107) Z (cid:107) F ≤ ρ (cid:9) . Plugging (63)and (64) into (62) hence yields T P (cid:16) prox Γ τ H ( x ) (cid:17) == arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) = arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) Z − T P ( x )(cid:107) F + ι B ρ ( Z ) (cid:27) = arg min Z ∈ T P ∩H K ∩ B ρ (cid:26) τ (cid:107) Z − T P ( x )(cid:107) F (cid:27) = arg min Z ∈ T P ∩H K ∩ B ρ (cid:107) Z − T P ( x )(cid:107) F = Π T P ∩H K ∩ B ρ T P ( x ) . (65) Using the fact that T † P T P = I N we can ﬁnally rewrite (65) asprox Γ τ H ( x ) = T † P Π T P ∩H K ∩ B ρ T P ( x ) , which completes the proof. (cid:3) A PPENDIX GP ROOF OF T HEOREM (cid:96) canonical norm. Lemma 2in contrast assumes the Γ -norm as underlying norm, since thelatter is more natural for our particular problem. Lemma 2 (Contractive Gradient Descent) . Let G ∈ C L × N beinjective, and Γ be the diagonal and deﬁnite positive matrixdeﬁned in (44) . Deﬁne α : = λ min (cid:16) Γ / G H G Γ − / (cid:17) , (66) β : = λ max (cid:16) Γ / G H G Γ − / (cid:17) , (67) where λ min ( M ) and λ max ( M ) denote the minimum and maxi-mum eigenvalue of a matrix M respectively. Let τ ∈ R be apositive constant and consider the linear mapD τ : (cid:40) C N → C N , x (cid:55)→ x − τ G H ( Gx − y ) , (68) for some y ∈ C L . Then, D τ is Lipschitz continuous with respectto the norm induced by Γ : (cid:107) D τ ( x ) − D τ ( z )(cid:107) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , ∀ x , z ∈ C N , with Lipschitz contant:L τ = max {| − τα | , | − τ β |} . (69) Moreover, D τ is contractive, i.e. ≤ L τ < , for < τ < / β ,and L τ is minimised for τ = /( α + β ) .Proof. We have (cid:107) D τ ( x ) − D τ ( z )(cid:107) Γ = (cid:13)(cid:13) ( I N − τ G H G )( x − z ) (cid:13)(cid:13) Γ ≤ (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ (cid:107) x − z (cid:107) Γ , = L τ (cid:107) x − z (cid:107) Γ where the Lipschitz constant L τ : = (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ > isthe operator norm of I N − τ G H G induced by the Γ -norm on C N : (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ = sup (cid:107) x (cid:107) Γ = (cid:107) (cid:16) I N − τ G H G (cid:17) x (cid:107) Γ = sup (cid:107) x (cid:107) Γ = (cid:13)(cid:13)(cid:13) Γ / (cid:16) I N − τ G H G (cid:17) x (cid:13)(cid:13)(cid:13) = sup (cid:107) ˜ x (cid:107) = (cid:13)(cid:13)(cid:13) Γ / (cid:16) I N − τ G H G (cid:17) Γ − / ˜ x (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) I N − τ Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) . (70) Note that since G is injective, G H G is positive deﬁniteand hence we easily get [42] that the eigenvalues of I N − τ Γ / G H G Γ − / are contained in the interval [ − τ β, − τα ] , where β ≥ α > are deﬁned in (66) and (67) respectively. Itsspectral norm is hence given by: (cid:13)(cid:13)(cid:13) I N − τ Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) = max {| − τα | , | − τ β |} , which proves (69). Finally, the restriction on τ for L τ to besmaller than one follows from basic algebra, and is discussedin [41]. (cid:3) The second lemma states that in a Hilbert space, orthogonalprojection maps onto closed convex sets are non-expansive.This is a known result of approximation theory [43], [44].

Lemma 3 (Non-Expansiveness of Closed Convex Projections) . Let H be some Hilbert space with some inner-product norm (cid:107) · (cid:107) and C ⊂ H a closed , convex set. Then the orthogonalprojection map onto C , deﬁned as Π C ( x ) = arg min z ∈C (cid:107) x − z (cid:107) , ∀ x ∈ H , is non-expansive , i.e. (cid:107) Π C ( x ) − Π C ( z )(cid:107) ≤ (cid:107) x − z (cid:107) , ∀ x , z ∈ H . Proof.

Lemma 3 is proven in [43, Theorem 5.5]. (cid:3)

The third lemma states that the singular value projectionmap Π H k is locally non-expansive in every neighbourhood ofthe manifold of matrices with rank exactly k . Lemma 4 (Local Non-Expansiveness of the Singular ValueProjection) . Let C m × n be the space of complex-valued rectan-gular matrices of size m × n, and H k ⊂ C m × n , R k ⊂ C m × n thesets of matrices with rank at most and exactly k ≤ max { m , n } respectively. Denote further by Π H k the orthogonal projectionmap onto H k given in (14) . Then, for every R ∈ R k , the map Π H k is well-deﬁned (single-valued) and locally non-expansive (cid:107) Π H k ( X ) − Π H k ( Z )(cid:107) F ≤ (cid:107) X − Z (cid:107) F , ∀ X , Y ∈ U , for some neighbourhood U (cid:51) R .Proof. Since R k is dense in H k [24, Proposition 2.1], we have Π H k = Π R k in a neighbourhood W of every R ∈ R k (see[23, Example 2.3] for a detailed proof of this fact). Moreover,[45, Lemma 3] tells us that, for every R ∈ R k , Π R k is, ina neighbourhood U (cid:51) R such that U ⊂ W , well-deﬁned(single-valued), continuous and differentiable, with gradientgiven by: ∇ Π R k = Π T R k ( R ) where T R k ( R ) ⊂ C m × n is thetangent plane of the manifold R k in R (see [23, Example2.2]). Since T R k ( R ) is by deﬁnition a linear subspace of C m × n ,the orthogonal projection operator Π T R k ( R ) is bounded withunit spectral norm. The map Π R k = Π H k is consequently 1-Lipschitz continuous (i.e. non-expansive) with respect to theFrobenius norm in the neighbourhood U of R ∈ R k . (cid:3) The last lemma ﬁnally, makes use of Lemmas 3 and 4 to showthat the denoising operator H n ( x ) = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) is locally non-expansive with respect to the Γ -norm: IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 19

Lemma 5 (Local Non-Expansiveness of Denoiser) . Let C ( N − P )×( P + ) be the space of complex-valued rectangularmatrices of size ( N − P ) × ( P + ) , P ≤ (cid:98) N / (cid:99) , and H K ⊂ C ( N − P )×( P + ) , R K ⊂ C ( N − P )×( P + ) the sets of matriceswith rank at most and exactly K ≤ P respectively. LetH n ( x ) : = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) , ∀ x ∈ C N , be the approximate proximal operator (34) . Then, H n is locallywell-deﬁned (single-valued) and non-expansive with respect tothe Γ -norm (cid:107) H n ( x ) − H n ( z )(cid:107) Γ ≤ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in someneighbourhood of some matrix R ∈ R K .Proof. First, we have, for all x , z ∈ C N : (cid:107) H n ( x ) − H n ( z )(cid:107) Γ = (cid:13)(cid:13)(cid:13) T † P ( D n T P ( x ) − D n T P ( z )) (cid:13)(cid:13)(cid:13) Γ , (71)where D n = [ Π T P Π H K Π B ρ ] n . Notice that for X ∈ T P , wehave (cid:13)(cid:13)(cid:13) T † P ( X ) (cid:13)(cid:13)(cid:13) Γ = (cid:104) Γ T † P ( X ) , T † P ( X )(cid:105) = (cid:104) ΓΓ − T ∗ P ( X ) , T † P ( X )(cid:105) = (cid:104) X , T P T † P ( X )(cid:105) F = (cid:104) X , Π T P X (cid:105) F = (cid:104) X , X (cid:105) F = (cid:107) X (cid:107) F . Since the range of D n is T P , (71) becomes: (cid:107) H n ( x ) − H n ( z )(cid:107) Γ = (cid:107) D n T P ( x ) − D n T P ( z )(cid:107) F . Assuming now that T P ( x ) and T P ( z ) are in some neighbourhoodof some point R ∈ R K , we can invoke Lemmas 3 and 4recursively to obtain: (cid:107) D n T P ( x ) − D n T P ( z )(cid:107) F ≤ (cid:107) T P ( x ) − T P ( z )(cid:107) F = (cid:107) x − z (cid:107) Γ , where we have used: (cid:107) T P ( x )(cid:107) F = (cid:104) T P ( x ) , T P ( x )(cid:105) F = (cid:104) T ∗ P T P ( x ) , x (cid:105) = (cid:107) x (cid:107) Γ , ∀ x ∈ C N . We ﬁnally get (cid:107) H n ( x ) − H n ( z )(cid:107) Γ ≤ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in some neigh-bourhood of some matrix R ∈ R K . (cid:3) We are now ready to show Theorem 4. Let U τ, n ( x ) : = H n ( x − τ ∇ F ( x )) = H n ( D τ ( x )) , x ∈ C N . Then, for every x , z ∈ C N such that T P ( x ) , T P ( z ) are insome neighbourhood of some matrix R ∈ R K , U τ, n is locallyLipschitz continuous as composition between two (locally)Lipschitz continuous functions H n and D τ , see Lemmas 5 and2 respectively. Moreover, the Lipschitz constant is the productof the Lipschitz constants of H n and D τ , 1 and L τ in (69)respectively. We have therefore (cid:13)(cid:13) U τ, n ( x ) − U τ, n ( z ) (cid:13)(cid:13) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in some neigh-bourhood of some matrix R ∈ R K . Finally, the restriction on τ for L τ to be smaller than one results from Lemma 2. (cid:3) A PPENDIX HP ROOF OF C OROLLARY (cid:107) x k + − x k (cid:107) Γ ≤ (cid:107) U τ, n ( x k ) − U τ, n ( x k − )(cid:107) Γ ≤ L τ (cid:107) x k − x k − (cid:107) Γ , for all k ≥ and hence by induction (cid:107) x k + − x k (cid:107) Γ ≤ L k τ (cid:107) x − x (cid:107) Γ , ∀ k ≥ . (72)By assumption < τ < / β and therefore < L τ < . Wededuce hence from (72) that { x k } k ∈ N is a Cauchy sequence.Let j , k ∈ N with j > k : (cid:107) x j − x k (cid:107) Γ ≤ j − (cid:213) m = k (cid:107) x m + − x m (cid:107) Γ ≤ j − (cid:213) m = k L m τ (cid:107) x − x (cid:107) Γ = (cid:107) x − x (cid:107) Γ L k τ j − − k (cid:213) m = L m τ (73) ≤ (cid:107) x − x (cid:107) Γ L k τ ∞ (cid:213) m = L m τ = L k τ − L τ (cid:107) x − x (cid:107) Γ . For every (cid:15) > , we can choose a J ∈ N such that L J τ < (cid:15) ( − L τ )(cid:107) x − x (cid:107) Γ , and hence for all j > k > J (cid:107) x j − x k (cid:107) Γ < (cid:15) . The sequence { x k } k ∈ N is hence a Cauchy sequence, and since C N is complete, it converges towards a limit point x (cid:63) ∈ C N .We have moreover, since U τ, n is continuous x (cid:63) = lim n →∞ x k = lim n →∞ U τ, n ( x k − ) = U τ, n (cid:16) lim n →∞ x k − (cid:17) = U τ, n ( x (cid:63) ) , and hence x (cid:63) is a ﬁxed-point of U τ, n . Note moreover that,from (73) we get (cid:107) x (cid:63) − x k (cid:107) Γ = lim j → + ∞ (cid:107) x j − x k (cid:107) Γ ≤ lim j → + ∞ (cid:107) x − x (cid:107) Γ L k τ j − − k (cid:213) m = L m τ ≤ (cid:107) x − x (cid:107) Γ L k τ + ∞ (cid:213) m = L m τ = L k τ − L τ (cid:107) x − x (cid:107) Γ , which proves (37) of Corollary 2. (cid:3) . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (a) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (b) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (c) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (d) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (e) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (f) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (g) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (h) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (i) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (j) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (k) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (l) β = , M = , PSNR=30 dB. Figure 5: Actual vs. recovered Dirac locations for LS-Cadzow, CPGD and GenFRI, various oversampling parameters β ∈{ , , , } , a PSNR in {− , , } dB. For each case and each source (denoted with the same colours as in Fig. 2), the markersand horizontal lines represent respectively the median and inter-quartile region of the estimated locations’ empirical distributionobtained from 192 noise realisations. The closer a marker is from the line y = xx