CPGD: Cadzow Plug-and-Play Gradient Descent for Generalised FRI
Matthieu Simeoni, Adrien Besson, Paul Hurley, Martin Vetterli
PPREPRINT, UNDER REVIEW 1
CPGD: Cadzow Plug-and-Play Gradient Descentfor Generalised FRI
Matthieu Simeoni,
Member, IEEE,
Adrien Besson,
Associate Member, IEEE,
Paul Hurley,
Senior Member, IEEE, and Martin Vetterli,
Fellow, IEEE
Abstract —Finite rate of innovation (FRI) is a powerful re-construction framework enabling the recovery of sparse Diracstreams from uniform low-pass filtered samples. An extensionof this framework, called generalised FRI (genFRI), has beenrecently proposed for handling cases with arbitrary linear mea-surement models. In this context, signal reconstruction amountsto solving a joint constrained optimisation problem, yielding esti-mates of both the Fourier series coefficients of the Dirac streamand its so-called annihilating filter, involved in the regularisationterm. This optimisation problem is however highly non convexand non linear in the data. Moreover, the proposed numericalsolver is computationally intensive and without convergenceguarantee.In this work, we propose an implicit formulation of thegenFRI problem. To this end, we leverage a novel regularisa-tion term which does not depend explicitly on the unknownannihilating filter yet enforces sufficient structure in the solutionfor stable recovery. The resulting optimisation problem is stillnon convex, but simpler since linear in the data and with lessunknowns. We solve it by means of a provably convergentproximal gradient descent (PGD) method. Since the proximal stepdoes not admit a simple closed-form expression, we propose aninexact PGD method, coined as Cadzow plug-and-play gradientdescent (CPGD). The latter approximates the proximal steps bymeans of Cadzow denoising, a well-known denoising algorithmin FRI. We provide local fixed-point convergence guaranteesfor CPGD. Through extensive numerical simulations, we demon-strate the superiority of CPGD against the state-of-the-art in thecase of non uniform time samples.
Index Terms —finite rate of innovation, non bandlimited sam-pling, Dirac streams, non convex optimisation, Cadzow denoising,proximal gradient descent, alternating projections.
I. I
NTRODUCTION S AMPLING theorems lie at the foundation of modern digital signal processing as they permit the convenientnavigation between the analogue and digital worlds [1], [2].The most famous is undoubtedly the
Shannon sampling theorem [3], which states that bandlimited signals can be recoveredexactly from their discrete samples for a sufficient samplingrate. This major result has had tremendous impact on the fieldof signal processing and by extension on many fields of naturalsciences. But this unanimous celebration lead many scientists
M. Simeoni and A. Besson have contributed equally to this work.A. Besson is with E-Scopics, Saint-Cannat, France.M. Simeoni and M. Vetterli are with the Laboratoire de CommunicationsAudiovisuelles (LCAV), École Polytechnique Fédérale de Lausanne (EPFL),CH-1015, Lausanne, Switzerland.For part of this work, M. Simeoni was also with the Foundations of CognitiveSolutions group in the IBM Research Laboratory of Zurich.P. Hurley is with the Centre for Research in Data Science & Mathematicsat Western Sydney University (WSU), Australia. to start thinking about sampling theory exclusively in termsof bandlimitedness, which is only a sufficient condition fora signal to admit a discrete representation. In fact, samplingtheorems can also be devised for non-bandlimited signals aslong as they possess finitely many degrees of freedom.This fact was brought to the attention of the signal processingcommunity in [4], where the authors introduced the finiterate of innovation (FRI) framework. FRI is concerned withthe sampling of sparse non-bandlimited signals such as theprototypical sparse signal, namely the T -periodic stream ofDiracs : x ( t ) = (cid:213) k (cid:48) ∈ Z K (cid:213) k = x k δ ( t − t k − T k (cid:48) ) , ∀ t ∈ R , (1)with x k ∈ C and t k ∈ [ , T [ . In the FRI framework, the sparsityis measured in terms of its rate of innovation , defined as thenumber of degrees of freedom per unit of time. For instance,the Dirac stream (1) has K degrees of freedom { x k , t k } k = ,..., K per period T , yielding a finite rate of innovation of ρ = K / T .Intuitively, any lossless sampling scheme for (1) must thereforehave a sampling rate at least as large as the rate of innovation ρ or it will be impossible to fix all degrees of freedom.Blu et al. described a sampling scheme achieving thesecond best sampling rate after the critical innovation rate,permitting to perfectly recover the signal innovations fromthe knowledge of any K + consecutive Fourier coefficientsof x [5]. Unfortunately, this scheme can be very sensitive tonoise perturbations in the collected samples. This is becausethe recovery of the innovations t k relies on the resolution ofa so-called annihilating equation which requires the Toeplitzmatrix built from the Fourier coefficients to be rank deficient.While this structural constraint is guaranteed to hold in thecase of noiseless recovery of Dirac streams, it can break downin the presence of noise, inevitable in practical applications.As a remedy, Blu et al. proposed to denoise the collectedsamples prior to solving the annihilating equation. To this end,they leveraged the well-known Cadzow algorithm [6], whichaims to retrieve the closest rank-deficient Toeplitz matrix toa high-dimensional embedding of the data via an alternatingprojection method. When upgraded with this extra denoisingstep, simulations results from Blu et al. revealed that the overallaccuracy of the recovery procedure remains very good forsignal-to-noise ratios (SNR) as low as 5 dB [5]. While Cadzowalgorithm empirically provides accurate results after a few itera-tions, convergence in theory has however not been demonstratedto date, due to the non convex nature of the space of rank-deficient matrices. Condat and Hirabayashi [7] revisited Cadzow a r X i v : . [ m a t h . O C ] J un PREPRINT, UNDER REVIEW denoising as a structured low-rank approximation (SLRA) problem and proposed a
Douglas-Rachford splitting algorithmto solve it [8], with higher accuracy than traditional Cadzowdenoising. Unfortunately, the gain in accuracy comes at theprice of significantly higher computational cost, the Douglas-Rachford splitting method requiring many more iterations toconverge than Cadzow algorithm.In addition to their somewhat heuristic nature, neitherCadzow denoising nor its upgrade can handle more generaltypes of input measurements as considered in the generalisedFRI (genFRI) framework introduced by Pan et al. in [9].The latter extends FRI to very generic cases where themeasurements are related to the unknown Fourier coefficientsof signals satisfying the annihilating property by a linear map.In such configurations, both the Fourier coefficients and theircorresponding annihilating filter are unknown and must beestimated from the data. Pan et al. proposed to perform thisjoint estimation task by solving a constrained optimisationproblem which recovers the Fourier coefficients, required tominimise a quadratic data-fidelity term, and their correspondingannihilating filter coefficients. The annihilating equation linkingthe two unknowns is explicitly enforced as a constraint. Thisoptimisation problem is highly non convex and non linear inthe data. They suggested to solve it via an iterative alternatingminimisation algorithm with multiple random initialisations [9].The proposed algorithm however comes without convergenceguarantees and is computationally intensive.In this paper, we propose an implicit formulation of thegenFRI problem in which only the Fourier coefficients tobe annihilated are recovered. This formulation does notrely explicitly on the unknown annihilating filter but ratherleverages a structured low-rank regularisation constraint basedon a “Toeplitzification” linear operator, guaranteeing non-trivial solutions to the annihilating equation. The resultingoptimisation problem is still non convex, but simpler to analyseand solve since it is linear in the data and with less unknowns.We solve the implicit genFRI problem via proximal gradientdescent (PGD) [10], [11].We first consider PGD with exact proximal steps which isshown to converge towards critical points of the implicit genFRIproblem. The latter is however impractical since the proximalstep involved at each iteration does not have a closed-formexpression. We therefore consider an inexact
PGD [12], withproximal steps approximated by means of alternating projec-tions. In the case of injective forward matrices, the approximateproximal step is shown to reduce to Cadzow denoising. Such anapproach is reminiscent of the plug-and-play (PnP) framework in which proximal operators involved in first-order iterativemethods are replaced by generic denoisers [13], [14], [15]. Forthis reason, we name our reconstruction algorithm
CadzowPnP Gradient Descent (CPGD) . We demonstrate that CPGDconverges locally towards fixed points of the update equationfor injective forward matrices. Through simulations of irregularand noisy time sampling of periodic stream of Diracs we showthat CPGD is almost always more accurate and more efficient An efficient Python implementation of CPGD is provided on our GitHubrepository [16]. than the procedure proposed by Pan et al. in [9], sometimesby several orders of magnitude.The remainder of the paper is organised as follows: • Preliminary concepts required for the understanding ofthe further sections are introduced in Section II. • Section III describes the genFRI problem and details theproposed implicit formulation. The CPGD algorithm isintroduced in Section IV. • Experiments and results are detailed in Section V andconcluding remarks are given in Section VI.Finally, note that all experiments and simulations are fullyreproducible using the benchmarking routines provided in ourGitHub repository [16].II. P
RELIMINARIES
In this section we introduce a linear operator, baptised
Toeplitzification operator , which transforms a vector into aToeplitz matrix. This operator will be used in the regularisationterm of our implicit genFRI optimisation problem. We thenbriefly review the method of alternating projections [17] aswell as the FRI [4] framework and Cadzow denoising [7]. A. Toeplitzification Operator
Assume that we are given an arbitrary vector x ∈ C N , N = M + , with entries indexed as follows: x : = [ x − M , x − M + , . . . , x M − , x M ] T . Then, for any P ≤ M , we can embed x into the space T P ofToeplitz matrices of C ( N − P )×( P + ) by means of the following Toeplitzification operator : T P : (cid:40) C N → T P ⊂ C ( N − P )×( P + ) x (cid:55)→ [ T P ( x )] i , j : = x − M + P + i − j , (2)where i = , . . . , N − P , j = , . . . , P + . Note from (2) thatthe value of an entry [ T P ( x )] i , j of the matrix T P ( x ) dependsonly on the distance i − j between the row and column indexes: T P ( x ) is therefore a Toeplitz matrix and the vector x is calledits generator .The Toeplitzification operator (2) can be used to implementlinear convolutions. Indeed, it can be shown (see Appendix Aavailable as supplementary material) that the multiplication of T P ( x ) with a vector u = [ u , · · · , u P + ] T ∈ C P + returns thevalid part of the convolution between the two zero-paddedsequences ˜ x : = (cid:104) . . . , , x − M , . . . , x , . . . , x M , , . . . (cid:105) ∈ C Z and ˜ u : = (cid:104) . . . , , u , . . . , u P + , , . . . (cid:105) ∈ C Z . Similarly, the multipli-cation of T P ( x ) H with a vector v = [ v , · · · , v N − P ] T ∈ C N − P returns the valid part of the cross-correlation between ˜ x and thezero-padded sequence ˜ v : = (cid:104) . . . , , v , . . . , v N − P , , . . . (cid:105) ∈ C Z . The alternative appellation
Toeplitzication was used in [7]. See Appendix A for a formal definition of the valid part of a convolutionbetween zero-padded sequences.
IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 3
B. Inverse Toeplitzification Operator
The inverse Toeplitzification operator is the pseudoinverseof the Toeplitzification operator, mapping a Toeplitz matrix H ∈ C ( N − P )×( P + ) to its generator h ∈ C N . As we shallprove in Proposition 2, inverse Toeplitzification is achieved byaveraging across each diagonal of T P ( x ) . It is interesting tonote that this operation is also leveraged in Cadzow denoisingas described in [5], in order to map back the data from its highdimensional Toeplitz embedding. The formal interpretation ofthis inverse map as the pseudoinverse of the Toeplitzificationoperator proposed hereafter is nevertheless not discussed in [5],nor anywhere else we may be aware of.To compute the pseudoinverse of T P , we first need an expressionfor its adjoint map, detailed in the proposition hereafter. Proposition 1 (Adjoint operator of T P ) . The adjoint operatorT ∗ P of T P defined in (2) is given byT ∗ P : C ( N − P )×( P + ) → C N H (cid:55)→ h j = (cid:213) i = k + j − − P H ik , j = , . . . , N . (3) Proof.
The proof is described in Appendix B available assupplementary material of this manuscript. (cid:3)
Note that the adjoint map T ∗ P proceeds by summing acrosseach diagonal of the input matrix H . We are now ready to derivean expression for the (left) pseudoinverse of T P , described inthe proposition hereafter. Proposition 2 (Pseudoinverse of T P ) . The pseudoinverse T † P : C ( N − P )×( P + ) → C N of T P defined in (2) is given byT † P = Γ − T ∗ P , (4) where Γ ∈ C N × N is a diagonal matrix with diagonal entriesgiven by: Γ i , i = min ( i , P + , N + − i ) , i = , . . . , N . (5) Proof.
The proof of this proposition is given in Appendix Cavailable as supplementary material of this manuscript. (cid:3)
Observe that the composition of T ∗ P and Γ − in the expressionof the pseudoinverse (4) indeed corresponds to a diagonalaveraging since T ∗ P first sums across each diagonal of thematrix H ∈ C ( N − P )×( P + ) and Γ − then divides the sums bythe number of elements on each diagonal. C. The Method of Alternating Projections
In this section we briefly discuss the method of alternatingprojections (MAP) [17], central to Cadzow denoising. It isused in computational mathematics to approximate projectionsonto intersecting sets. In its simplest form proposed by vonNeumann in 1933 [18], the MAP performs a cascade of n projection steps onto subsets {M , . . . , M K } of some Hilbertspace H , starting from a point z ∈ H : ˇ z = (cid:2) Π M K · · · Π M (cid:3) n ( z ) . (6) In (6), Π M k denotes the orthogonal projection map onto M k ,defined for k = , . . . , K as Π M k : (cid:40) H → M k , z (cid:55)→ arg min x ∈M k (cid:107) z − x (cid:107) , for some norm (cid:107) · (cid:107) on H . In the case of closed linear subspaces {M , . . . , M K } , von Neumann and Halperin showed that [18],[19], [20] lim n →∞ (cid:13)(cid:13)(cid:13)(cid:2) Π M K · · · Π M (cid:3) n ( z ) − Π (cid:209) Kk = M k ( z ) (cid:13)(cid:13)(cid:13) = , ∀ z ∈ H . (7)The MAP equation (6) can therefore be used to approximatethe complex projection map Π (cid:209) Kk = M k . For closed convex sets {M , . . . , M K } , Bregman [17] showed moreover the weakconvergence of the MAP towards a point in the intersection (cid:209) Kk = M k . Strong convergence towards the actual projectionwas achieved by Dysktra’s MAP [21], one of the most popularvariant to von Neumann’s original algorithm. In the case of nonconvex intersecting sets, the convergence of the MAP has onlybeen established locally [22], [23], [24], [25]. For example,Andersson et al. considered in [24] the case of two (potentiallynon convex) finite-dimensional manifolds M , M ⊂ H andshowed the following local convergence result: Theorem 1. [24, Theorem 1.6] Let x ∈ M ∩ M be non-tangential , i.e. the angle between M and M at x is positive. Then, for z ∈ H and (cid:15) > , there exists δ ≥ such that, if (cid:107) x − z (cid:107) ≤ δ ,1) (cid:2) Π M Π M (cid:3) n ( z ) n →∞ → z ∞ ∈ M ∩ M ,2) (cid:13)(cid:13) z ∞ − Π M ∩M ( z ) (cid:13)(cid:13) < (cid:15) (cid:13)(cid:13) x − Π M ∩M ( z ) (cid:13)(cid:13) . Roughly speaking, Theorem 1 states that if the startingpoint z is close enough to a non-tangential point of M ∩ M (which as explained in [24] are all but very exceptional pointsof M ∩ M ), then the MAP converges to a point in M ∩ M .Moreover, the error (cid:13)(cid:13) z ∞ − Π M ∩M ( z ) (cid:13)(cid:13) can be made arbitrarilysmall with respect to (cid:13)(cid:13) x − Π M ∩M ( z ) (cid:13)(cid:13) . However, Theorem 1is difficult to apply in practice since the value of δ guaranteeinga relative error below a given threshold (cid:15) is unknown. TheMAP is hence often used as a heuristic in non convex settingswith no convergence guarantees. This is notably the case ofCadzow denoising, discussed further in Section II-E. D. FRI in a Nutshell
The classical FRI framework, introduced in [4], aims atestimating the innovations {( x k , t k ) , k = , . . . , K } ⊂ C × [ , T [ ,of a T -periodic stream of Diracs: x ( t ) = (cid:213) k (cid:48) ∈ Z K (cid:213) k = x k δ ( t − t k − T k (cid:48) ) , ∀ t ∈ R . In standard FRI, the estimation procedure is divided into twostages. The locations t k are first estimated by a nonlinearmethod, and then arranged into a Vandermonde system whosesolution yields the Dirac amplitudes [5]. The recovery of thelocations t k relies on the so-called annihilating equation , dating See [24, Definition 4.2] and [24, Definition 4.3] for a precise definition ofthe angle between two manifolds and the concept of non-tangentiality.
PREPRINT, UNDER REVIEW from Prony’s work [26], which cancels out the Fourier seriescoefficients of x by convolving them with a particular filter,called the annihilating filter . The latter is defined as the finite-tap sequence h = [· · · , , h , h , . . . , h K , , · · · ] ∈ C Z , with z -transform vanishing at roots { u k : = e − j π t k / T , k = , . . . , K } : H ( z ) = K (cid:213) k = h k z − k = K (cid:214) k = ( − u k z − ) . (8)For such a filter, we have indeed ( ˆ x ∗ h ) m = K (cid:213) k = h k ˆ x m − k = K (cid:213) k (cid:48) = x k (cid:48) (cid:32) K (cid:213) k = h k u − kk (cid:48) (cid:33) u mk (cid:48) = , m ∈ Z , (9)where ˆ x m = (cid:205) Kk = x k u mk , m ∈ Z , are the Fourier coefficientsof x in (1). Notice that the roots u k of the z-transform H ( z ) in (8) of h are in one-to-one correspondence withthe locations t k . Recovering them amounts to estimating thecoefficients h = [ h , . . . , h K ] ∈ C K + of h from the annihilatingequation (9). If for instance we have N = M + consecutiveFourier coefficients of x , e.g. x = [ ˆ x − M , . . . , ˆ x M ] ∈ C M + ,we can extract the N − K equations from (9) correspondingto the convolution indices m = − M + K , . . . , M , and use theToeplitzification operator defined in (2) to form the followingmatrix equation: T K ( x ) h = N − K , (cid:107) h (cid:107) (cid:44) . (10)Observe that any nontrivial element of the nullspace of T K ( x ) is a solution to (10). For M ≥ K , it can be shown [5] that T K ( x ) ∈ C ( N − K )×( K + ) has rank K and therefore has a nontrivialnullspace with dimension 1. Up to a multiplicative constant,the annihilating equation (10) admits hence a unique solution.The latter can be obtained numerically by means of total least-squares [5], which computes the eigenvector associated to thesmallest eigenvalue of T K ( x ) . In the critical case M = K , thematrix T K ( x ) is square , while in the oversampling case M > K it is rectangular and tall . As explained in [5], oversamplingmakes the estimation procedure more resilient to potential noiseperturbations in the Fourier coefficients. In such cases, Blu etal. recommend moreover to perform Cadzow denoising on theFourier coefficients x (see Section II-E) as well as replace (10)by a more general annihilating equation: T P ( x ) ˜ h = N − P , (cid:107) ˜ h (cid:107) (cid:44) . (11)with K ≤ P ≤ M , and ˜ h ∈ C P + . Again, it is possible to showthat T P ( x ) has rank K , and hence a nontrivial nullspace withdimension P + − K . Solutions to (11) are hence not uniquein this case, but all are equally valid for practical purposes.Moreover, the increased nullspace dimension makes Cadzowdenoising more efficient at filtering the noise component (seeSection II-E hereafter). In practice, the case P = M has beenreported to yield the best empirical performance [5]. Remember the link between the Toeplitzification operator and convolutiondiscussed in Section II-A. An eigenvalue exactly equal to zero may in practice be impossible toobtain due to numerical inaccuracies.
E. Cadzow Denoising
For strong noise perturbations, the generalised annihilatingequation (11) may fail to admit a nontrivial solution. Indeed,noisy generators x can yield full column rank matrices T P ( x ) with trivial nullspace. As a potential cure, Blu et al. proposeto denoise the Fourier coefficients x prior to solving theannihilating equation. This denoising step attempts to transform T P ( x ) into a Toeplitz matrix with rank at most K , thusguaranteeing the existence of nontrivial solutions to (11). Thisoperation is carried out by means of Cadzow denoising [7],an alternating projection method (see Section II-C) appliedheuristically to the subspace T P of Toeplitz matrices and thesubset H K of matrices with rank at most K : H K : = (cid:110) M ∈ C ( N − P )×( P + ) | rank M ≤ K (cid:111) . (12)Using the notation introduced in Sections II-A, II-B andII-C, Cadzow denoising can be seen as processing the noisycoefficients x as follows: ˇ x = T † P (cid:2) Π T P Π H K (cid:3) n T P ( x ) , (13)for some suitable n ∈ N . The inverse Toeplitzification operator T † P applied after the n iterations of the alternating projectionmethod is used to recover the denoised Fourier coefficients ˇ x ∈ C N .Equation (13) allows us to develop an intuitive understandingas to why Cadzow denoising tends to perform better in practicewith values of P close to M in the generalised annihilatingequation (11). Indeed, it is easy to see that the maximal rank ofrectangular matrices in C ( N − P )×( P + ) ranges in (cid:110) K + , M + (cid:111) when P ranges in (cid:110) K , M (cid:111) . Consequently, the subset H K ofmatrices of rank at most K becomes “smaller and smaller”relatively to the ambient space as P increases towards M .This implies that the associated projection map Π H K is moreselective for large values of M , hence making Cadzow’salgorithm better at filtering the noise component.Note that since H K is a non convex set the convergence ofthe MAP in (13) is not guaranteed. Nevertheless, experimentalresults [5], [7] suggest that Cadzow denoising almost alwaysconverges after a few iterations (typically n ≤ ), which couldtheoretically be explained by the local convergence result fromTheorem 1. We conclude this section by providing closed-formexpressions for the projection operators Π T P and Π H K , neededin (13).
1) Projection onto T P : As shown in Appendix D of thesupplementary material, the orthogonal projection operatoronto the subspace T P ⊂ C ( N − P )×( P + ) of rectangular Toeplitzmatrices can be written in terms of the Toeplitzification operatorand its pseudoinverse (see Section 2) as: Π T P = T P T † P = T P Γ − T ∗ P . For n , m ∈ Z , n < m , we denote by (cid:110) n , m (cid:111) the integer interval [ n , m ] ∩ Z . This can be seen from the closed-form expression of Π H K provided in (14). As explained in Section II-C, the assumptions of Theorem 1 are unfortu-nately very difficult to verify in practice.
IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 5
2) Projection onto H K : The orthogonal projection operatoronto the space H K of matrices with rank at most K is givenby the Eckart-Young-Minsky theorem [27]. The latter statesthat the projection map Π H K ( X ) = arg min H ∈H K (cid:107) X − H (cid:107) F , X ∈ C ( N − P )×( P + ) , can be computed in closed-form as: Π H K ( X ) = U Λ K V H , X ∈ C ( N − P )×( P + ) , (14)where X = U Λ V H is the singular value decomposition of X , and Λ K is the diagonal matrix of sorted singular valuestruncated to the K strongest ones. Note that the output of theprojection map is unique as long as the K − th and ( K + )− thlargest singular values are different. Fortunately, the space ofmatrices failing to verify this condition is very small –moreprecisely it is thin , as discussed extensively in [24, Section 2].In practice moreover, floating-point arithmetic makes it veryunlikely that the K − th and ( K + )− th largest singular valuesbe exactly identical. Thus, the projection map Π H K can beconsidered single-valued for practical purposes.III. G ENERALISED
FRI
AS AN I NVERSE P ROBLEM
A. Generalised FRI
In Section II-D, we have described a procedure for recoveringthe locations t k from consecutive Fourier coefficients of x .The issue of computing these Fourier coefficients from acollection of arbitrary linear measurements y ∈ C L of x , L ≥ N now remains. Blu et al. [5] treated the simple scenario ofmeasurements resulting from regular time sampling with ideallow-pass prefiltering. In such a case, they showed that, for awell chosen prefilter bandwidth, the Fourier coefficients couldsimply be obtained by applying a discrete Fourier transformto the measurements y . For more general measurement types,the situation is more complex, and the Fourier coefficients x ∈ C N must in general be estimated by solving a linearinverse problem: y = Gx + n , (15)where the forward matrix G ∈ C L × N , L ≥ N , is applicationdependent, and n is additive noise, usually assumed to be awhite Gaussian random vector. In [9], Pan et al. proposedthe generalised FRI (genFRI) optimisation problem to dealwith (15). The latter is a non convex constrained optimisationproblem whose objective is to jointly recover the Fouriercoefficients x ∈ C N –required to minimise a quadraticdata-fidelity term– and their corresponding annihilating filtercoefficients h ∈ C P + . The annihilating equation linking thetwo unknowns is explicitly enforced as a constraint, yieldingan optimisation problem of the form: min x ∈ C N h ∈ C P + (cid:107) Gx − y (cid:107) subject to (cid:40) T P ( x ) h = N − P , (cid:104) h , h (cid:105) = , (16)where h ∈ C P + is generated randomly accordedto the circularly-symmetric complex Gaussian distribu-tion C N ( , I P + ) . The normalisation constraint (cid:104) h , h (cid:105) = In [9], the authors have also considered the more natural normalisationconstraint (cid:107) h (cid:107) = . They claim however that this normalisation strategy isless successful experimentally. is used to exclude trivial solutions to the annihilating equationin (16) [9], [28].To solve (16), Pan et al. consider the corresponding La-grangian for a fixed vector h ∈ C P + and show that, when G is full column rank, critical solutions x ∈ C N can be expressedin terms of the annihilating filter h as [9, Appendix A]: x = ξ − ( G H G ) − R P ( h ) H Ω − h R P ( h ) ξ , (17)where ξ : = ( G H G ) − G H y , Ω h : = R P ( h )( G H G ) − R P ( h ) H and R P is the right-dual of the Toeplitzification operator T P ,defined as [9, Definition 1]: R P : (cid:40) C P + → C ( N − P )× N h (cid:55)→ R P ( h ) x : = T P ( x ) h , ∀ x ∈ C N . (18)By applying (18) to the canonical basis of C N , it is easy toshow that: R P ( h ) = h P h P − · · · h · · · . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . . . · · · h P h P − · · · h , for h = [ h , · · · , h P ] ∈ C P + .Pan et al. then use (17) to saturate (16) with respect to x ,obtaining an optimisation problem in terms of h only: min h ∈ C P + (cid:10) Ω − h T P ( ξ ) h , T P ( ξ ) h (cid:11) subject to (cid:104) h , h (cid:105) = . (19)Finally, they propose to solve (19) via an heuristic alternatingminimisation algorithm. At each iteration, the annihilatingfilter is updated as the solution to the constrained quadraticminimisation problem: arg min h ∈ C P + (cid:68) Ω − h n T P ( ξ ) h , T P ( ξ ) h (cid:69) (20)subject to (cid:104) h , h (cid:105) = , n ≥ . Observe that (20) corresponds to a relaxation of (19) where theunknown matrix Ω h has been replaced by the fixed matrix Ω h n , constructed from the previous estimate h n . Pan et al. showmoreover in [9, Appendix B] that the iterates can be computedefficiently by solving, at each iteration, a linear system of size ( N + ) × ( N + ) : T P ( ξ ) H h T P ( ξ ) − R P ( h n ) − R P ( h n ) H G H G h H h n + (cid:96) v λ = , (21)where (cid:96) ∈ C N − P , v ∈ C N and λ ∈ C are auxiliary variables.In practice, the algorithm is stopped when the data mismatch (cid:107) Gx n − y (cid:107) falls below a certain threshold (cid:15) (typically the noiselevel), where x n ∈ C N is the estimate of the Fourier seriescoefficients at iteration n , obtained by plugging h n in (17). Forgreater numerical stability, Pan et al. recommend to compute x n by solving a linear system with size ( N − P ) × ( N − P ) ,equivalent to (17) [9, Appendix B]: (cid:20) G H G R P ( h n ) H R P ( h n ) (cid:21) (cid:20) x n (cid:96) (cid:21) = (cid:20) G H y (cid:21) . (22) PREPRINT, UNDER REVIEW
Note that this heuristic iterative procedure comes withoutany strong or weak convergence guarantee: it is neither knownif the sequence {( h n , x n ) , n ∈ N } ⊂ C P + × C N converges to acritical point of (16), nor if { h n , n ∈ N } ⊂ C P + converges toa fixed-point of (20). Moreover, the proposed stopping criterionrequires knowing the noise level, unknown in practice. When itis unknown, Pan et al. recommend performing the reconstruc-tion for a fixed and arbitrary number of iterations (typicallyfifty). For optimal performance, they moreover suggest to runthe algorithm for multiple random initialisations of h ∈ C P + (typically fifteen). The overall reconstruction procedure cantherefore be computationally intensive, since each iteration hascomplexity O (cid:0) ( N + ) + ( N − P ) (cid:1) = O (cid:0) N (cid:1) –the cost ofsolving the linear systems (21) and (22). Remark (Choice of P and numerical stability) . While theo-retically well-defined for K ≤ P ≤ M, the iterative proceduredescribed above was only tested for P = K in [9]. This conser-vative choice was most likely motivated by numerical stabilityconsiderations: the inversion step in ξ = ( G H G ) − G H y canindeed be numerically unstable when G H G is ill-conditioned,which is less likely to be the case for tall matrices G obtainedwith P = K. We will touch on this again later.B. Implicit Generalised FRI
The annihilating equation constraint in (16) can be thoughtof as regularising the genFRI problem. Indeed, minimising thequadratic term (cid:107) Gx − y (cid:107) alone in the presence of noise wouldnot necessarily yield Fourier coefficients x with nontrivialannihilating filter, which the annihilating constraint enforcesexplicitly. Unfortunately, this regularisation also complicatessignificantly the optimisation procedure. Indeed, it requires theintroduction of an extra unknown variable with non lineardependency on the data, namely the annihilating filter h .Moreover, the non linear constraint T P ( x ) h = N − P is highlynon convex, and state-of-the-art algorithms, such as alternatingminimisation or gradient descent [10], may suffer from gettingtrapped in local minima [29]. To circumvent these issues,we propose the following implicit formulation of the genFRIproblem, in which only the Fourier coefficients are recovered: min x ∈ C N (cid:107) Gx − y (cid:107) subject to (cid:40) rank T P ( x ) ≤ K , (cid:107) x (cid:107) Γ ≤ ρ, (23)where K ≤ P ≤ M , ρ ∈] , + ∞] , and (cid:107) x (cid:107) Γ is the norm inducedby the diagonal and positive definite matrix Γ ∈ C N × N in (44): (cid:107) x (cid:107) Γ : = √ x H Γ x , ∀ x ∈ C N . (24)Similarly to (16), the quadratic term (cid:107) Gx − y (cid:107) in (23) isused to guarantee high fidelity of the recovered coefficients tothe observed data. Unlike (16), (23) leverages a regularisingrank constraint on T P ( x ) which does not explicitly involvethe unknown annihilating filter. As already discussed inSection II-E in the context of Cadzow denoising, requiring T P ( x ) to be of rank at most K is indeed a sufficient condition This is notably the reason why Pan et al. recommend multiple randominitialisations of their algorithm in [9]. for the generalised annihilating equation (11) to admit nontrivialsolutions. This implicit regularisation greatly simplifies thegenFRI problem, since it decouples the problem of estimatingthe Fourier coefficients from the problem of estimating theannihilating filter. The normalisation constraint (cid:107) x (cid:107) Γ ≤ ρ enforces finite weighted energy (24) to the recovered Fouriercoefficients. As shall be seen in Section IV, it can be relaxedwhen the forward matrix G is injective by setting ρ = + ∞ .Indeed, it is only used to ensure coercivity in underdeterminedcases where the forward matrix G has a nontrivial null space.Coercivity is indeed a key assumption [30] for the convergenceof the proximal gradient descent method envisioned in SectionIV-A. We conclude this section by noting that the choice of Γ as weighting matrix in the energy normalisation constraint isarbitrary and purely motivated by computational considerations.Indeed, any choice of positive definite weighting matrix in (24)would have been suitable for the sole purpose of making theobjective functional coercive. As explained in Section IV-Bhowever, defining the weighting matrix as Γ greatly simplifiesthe computations involved at each iteration of the numericalsolver proposed in Section IV-A. Remark (On the choice of P ) . With similar developmentsto Section II-E for Cadzow denoising, it is possible to showthat the the set of matrices in C ( N − P )×( P + ) of rank at most Kbecomes “smaller and smaller” with respect to the ambientspace as P grows towards M. As a result, the rank constraintin (23) is more selective for values of P close to M, henceenforcing a stronger regularisation. We can hence expect (23) to perform better in practice for P = M. This is in contrastwith the explicit generalised FRI problem (16) , whose equalityconstraint is equally stringent for different values of P.
Remark (Case G = I ) . When G = I , the optimisationproblem (23) becomes a simple denoising problem, whichcould therefore be used as an alternative to Cadzow denoisingor its upgrade [7]. IV. O
PTIMISATION A LGORITHM
A. Non Convex Proximal Gradient Descent
The optimisation problem (23) can be rewritten in uncon-strained form as: min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) + ι B Γ ρ ( x ) , (25)where H K is the non convex set of matrices with rank lowerthan K defined in (12), B Γ ρ : = { x ∈ C N : (cid:107) x (cid:107) Γ ≤ ρ } is the Γ -ball with radius ρ > , and ι H K : C ( N − P )×( P + ) → { , + ∞} , ι B Γ ρ : C N → { , + ∞} are indicator functions with domains H K and B Γ ρ , respectively. Observe that the unconstrainedoptimisation problem (54) can be written as a sum between a convex and differentiable quadratic term F ( x ) : = (cid:107) Gx − y (cid:107) , x ∈ C N , and a non convex and non differentiable term H ( x ) : = ι H K ( T P ( x )) + ι B Γ ρ ( x ) , x ∈ C N . IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 7
It is moreover easy to see that the gradient of F ∇ F ( x ) = G H ( Gx − y ) , x ∈ C N , (26)is β -Lipschitz continuous with respect to the Γ -norm (24), withLipschitz constant given by β = (cid:107) G H G (cid:107) Γ = sup (cid:8) (cid:107) G H Gx (cid:107) Γ : x ∈ C N , (cid:107) x (cid:107) Γ = (cid:9) = sup (cid:110) (cid:13)(cid:13)(cid:13) Γ / G H G Γ − / ˜ x (cid:13)(cid:13)(cid:13) : ˜ x ∈ C N , (cid:107) ˜ x (cid:107) = (cid:111) = (cid:13)(cid:13)(cid:13) Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) . (27)It is hence possible to optimise (54) by means of proximalgradient descent (PGD) [10], an iterative method alternatingbetween gradient and proximal steps according to the followingupdate equation: x k + ∈ prox Γ τ H ( x k − τ ∇ F ( x k )) , (28)for k ≥ , x ∈ C N , τ > and prox Γ τ H defined in (29). Givena current estimate x k ∈ C N , the update equation (48) decreasesthe value of the objective function (54) by selecting a proximalpoint [10] –with respect to H – of a target located at a distance τ from x k along the direction of steepest descent −∇ F ( x k ) .The operator mapping a point x ∈ C N to its proximal pointswith respect to H is called proximal operator , and is definedas [10]prox Γ τ H ( x ) : C N → P (cid:16) C N (cid:17) , x (cid:55)→ arg min z ∈ C N τ (cid:107) x − z (cid:107) Γ + H ( z ) , (29)where P( C N ) is the power set of C N , and τ > controls therelative importance of H with respect to the squared distance to x measured in terms of the Γ -norm (24). The function H beingnon convex, the proximal operator (29) will in general return multiple proximal points, which can all be used interchangeablyin (48). The convergence of the sequence { x k } k ∈ N of PGDiterates (48) towards critical points of (54) is established bythe following theorem. Theorem 2 (Convergence of PGD for Arbitrary G ) . Assumethat ρ ∈] , + ∞[ in (54) , and τ < / β with β defined in (48) .Then, any limit point x (cid:63) of the sequence { x k } k ∈ N generatedby (48) is a local minimum of (54) .Proof. The proof of this theorem is adapted from [30, Theorem1] and given in Appendix E available as supplementary materialof this manuscript. (cid:3)
As stated by Theorem 3 hereafter, the convergence of PGDfurthermore extends to the case ρ = + ∞ , at least for injectiveforward matrices G . Setting ρ = + ∞ in (23) is equivalent todropping the energy normalisation constraint, since (cid:107) x (cid:107) Γ ≤ + ∞ is trivially verified and hence the associated indicator function ι B Γ ρ in (54) is always null. Theorem 3 (Convergence of PGD for Injective G ) . Assumethat ρ = + ∞ in (54) , τ < / β with β defined in (48) , and G ∈ C L × N in (54) is injective , i.e. ker ( G ) = { N } . Then, any limit point x (cid:63) of the sequence { x k } k ∈ N generated by (48) is a local minimum of (54) .Proof. The proof of this theorem is given in Appendix Eavailable as supplementary material of this manuscript. (cid:3)
A practical implication of Theorem 3 is that, for injectiveforward matrices G , PGD applied to the following relaxedimplicit genFRI problem is convergent: min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) , (30)where F ( x ) : = (cid:107) Gx − y (cid:107) , and H ( x ) : = ι H K ( T P ( x )) . Asdiscussed in Section IV-B, (30) should always be favouredover (54) for injective forward matrices G , since solving it viaPGD requires less computations at each proximal step. B. Cadzow PnP Gradient Descent
As seen in the previous section, PGD requires the compu-tation of the proximal operator (29) at each iteration, whichamounts to finding a minimiser to the following non convexoptimisation problem: ˇ x ∈ arg min z ∈ C N (cid:26) τ (cid:107) x − z (cid:107) Γ + ι H K ( T P ( z )) + ι B Γ ρ ( z ) (cid:27) , (31)for some input x ∈ C N . Observe that the proximal step (31) canbe seen as a generalised projection step, aiming to find a point ˇ x as close as possible –in terms of the Γ -norm – from x whileverifying some convex and non convex constraints specifiedby the indicator functions. This is formalised by Proposition3, which shows that solutions to (31) can be identified withthose of an orthogonal projection problem: Proposition 3.
The proximal operator (29) of H ( x ) : = ι H K ( T P ( x )) + ι B Γ ρ ( x ) , for ρ ∈] , + ∞] and K ≤ P ≤ M isgiven byprox Γ τ H ( x ) = T † P Π T P ∩H K ∩ B ρ T P ( x ) , ∀ x ∈ C N , (32) where B ρ : = { X ∈ C ( N − P )×( P + ) : (cid:107) X (cid:107) F ≤ ρ } and Π T P ∩H K ∩ B ρ is the orthogonal projection operator onto T P ∩H K ∩ B ρ with respect to the Frobenius norm: Π T P ∩H K ∩ B ρ ( X ) : C ( N − P )×( P + ) → P (cid:16) C ( N − P )×( P + ) (cid:17) , X (cid:55)→ arg min H ∈ T P ∩H K ∩ B ρ (cid:107) X − H (cid:107) F . Proof.
The proof of this proposition is given in Appendix Favailable as supplementary material of this manuscript. (cid:3)
Equation (65) provides us with a practical way of computingthe proximal set (29) associated to a point x ∈ C N . Unfortu-nately, the orthogonal projection operator Π T P ∩H K ∩ B ρ admitsno simple closed-form expression. We therefore propose toapproximate it by the method of alternating projections (MAP)(see Section II-C): Π T P ∩H K ∩ B ρ (cid:39) (cid:2) Π T P Π H K Π B ρ (cid:3) n , (33) Observe that the weighting matrix Γ puts more emphasis on the coefficientsthat appear more often in the Toeplitz matrix T P ( x ) . PREPRINT, UNDER REVIEW
Algorithm 1
Cadzow PnP Gradient Descent (CPGD)
Require: y , G , T P , x , K ≤ P , τ < / β, n ∈ N , ρ > k:=0 repeat z k + : = x k − τ G H ( Gx k − y ) if ρ = + ∞ then x k + : = T † P (cid:2) Π T P Π H K (cid:3) n T P ( z k + ) else x k + : = T † P (cid:2) Π T P Π H K Π B ρ (cid:3) n T P ( z k + ) end if k ← k + until a stopping criterion is satisfied return x ( k ) for some n ∈ N . Observe that when ρ = + ∞ (which is possiblefor injective matrices G , see Theorem 3) we have Π B ρ = Id andthe right-hand side of (33) simplifies to (cid:2) Π T P Π H K (cid:3) n . Note thatsince H K is non convex, the convergence as n grows to infinityof the product (cid:2) Π T P Π H K Π B ρ (cid:3) n towards the actual projectionmap Π T P ∩H K ∩ B ρ is not guaranteed in general (see discussionin Section II-C). For the specific case ρ = + ∞ however, it ispossible to apply Theorem 1 to show the local convergence ofthe MAP (33): Corollary 1.
Let Z ∈ H K ∩ T P be a non tangential point [24,Definition 4.3]. Then, for X ∈ C ( N − P )×( P + ) and (cid:15) > , thereexists δ ≥ such that, if (cid:107) X − Z (cid:107) F ≤ δ ,1) (cid:2) Π T P Π H K (cid:3) n ( X ) n →∞ → X ∞ ∈ H K ∩ T P ,2) (cid:13)(cid:13) X ∞ − Π H K ∩ T P ( X ) (cid:13)(cid:13) F < (cid:15) (cid:13)(cid:13) X − Π H K ∩ T P ( X ) (cid:13)(cid:13) F . Proof.
Similarly to the proof of [31, Theorem 7], Corollary 1is obtained by applying Theorem 1 to the manifolds M = R K of matrices with rank exactly K –which is dense in H K [24,Proposition 2.1]– and M = T P . For more details, see the proofof [31, Theorem 7], which discusses the local convergenceof the MAP for H K ∩ H P where H P denotes the space ofrectangular Hankel matrices. Since Hankel matrices are justreflected Toeplitz matrices, the analysis extends to the case ofToeplitz matrices. (cid:3) Roughly speaking, Corollary 1 states that, if applied to amatrix X close enough to a non tangential point of T P ∩ H K (which as discussed in [24] for the case of Hankel matricesare all but very exceptional matrices of T P ∩ H K ), the MAP(33) converges to a point in T P ∩ H K . Moreover, the error (cid:13)(cid:13) X ∞ − Π H K ∩ T P ( X ) (cid:13)(cid:13) F can be made arbitrarily small withrespect to (cid:13)(cid:13) X − Π H K ∩ T P ( X ) (cid:13)(cid:13) F . While difficult to verify inpractice, the local convergence result Corollary 1 reassures uson the well-foundedness of approximation (33).Plugging (33) into (65) finally yields the following approxi-mate proximal step:prox Γ τ H ( x ) (cid:39) T † P (cid:2) Π T P Π H K Π B ρ (cid:3) n T P ( x ) , ∀ x ∈ C N , (34)for some n ≥ . The PGD algorithm with approximate proximalstep (34) is provided in Algorithm 1. Observe that when ρ =+ ∞ , (34) reduces to Cadzow denoising (13). The effect ofheuristic (33) is hence to replace the proximal step in the PGDiterations by a generic denoising step. Such an approach is reminiscent of the plug-and-play (PnP) framework [15], [14]from image processing, which leverages generic denoisers toapproximate complex projection or proximal operators [13]. Forthis reason, we baptise our algorithm Cadzow PnP GradientDescent (CPGD) . In the next section, we study the convergenceof Algorithm 1.
C. Local Fixed-Point Convergence of CPGD
In Section IV-A, we established Theorems 2 and 3 whichshow the convergence of PGD towards critical points of (54).However, such results required the computation of exact proxi-mal steps (29) in the PGD iterations, and do not apply to CPGDwhich leverages the inexact proximal step (34). Convergenceof PGD in non convex setups with inexact proximal steps wasstudied in [12], [32]. The results established in both papersrequire the proximal step approximation errors incurred ateach iteration to be decreasing and summable, which may notnecessarily be the case for the MAP approximation (33). Itis nevertheless possible to demonstrate that the iterations ofCPGD are locally contractive , and therefore locally convergenttowards a fixed point using the Banach contraction principle.Such a result is stated in Theorem 4 hereafter.
Theorem 4 (CPGD is a Local Contraction) . Let R K ⊂ C ( N − P )×( P + ) be the set of matrices of rank exactly K ≤ P ≤(cid:98) N / (cid:99) , and U τ, n : C N → C N the update CPGD mapU τ, n ( x ) : = H n ( x − τ ∇ F ( x )) , x ∈ C N , (35) with H n ( x ) : = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) . Let G ∈ C L × N beinjective, and Γ be the diagonal and positive definite matrixdefined in (44) . Define α : = λ min (cid:16) Γ / G H G Γ − / (cid:17) ,β : = λ max (cid:16) Γ / G H G Γ − / (cid:17) , where λ min ( M ) and λ max ( M ) denote the minimum and maxi-mum eigenvalues of a matrix M respectively.Then, U τ, n is locally well-defined (single-valued) and Lips-chitz continuous with respect to the Γ -norm (cid:13)(cid:13) U τ, n ( x ) − U τ, n ( z ) (cid:13)(cid:13) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) and T P ( z ) are in someneighbourhood of some matrix R ∈ R K . The Lipschitz constant L τ is given by L τ = max {| − τα | , | − τ β |} . Moreover, U τ, n is contractive , i.e. ≤ L τ < , for < τ < / β ,and L τ is minimised for τ = /( α + β ) .Proof. The proof of this theorem is given in Appendix Gavailable as supplementary material of this manuscript. (cid:3)
The following corollary shows the local convergence ofCPGD towards a fixed-point of the update map (35):
Corollary 2 (CPGD Converges Locally) . With the samenotations as in Theorem 4, assume that all CPGD iterates { x k } k ∈ N are such that { T P ( x k + ) , T P ( x k )} ⊂ U k , ∀ k ∈ N , (36) IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 9
Figure 1: Illustration of condition (36) in Corollary 2. for some neighbourhood U k of some matrix R k ∈ R K . Assumefurther that < τ < / β . Then, x k k →∞ → x (cid:63) where x (cid:63) ∈ C N isa fixed-point of U τ, n , i.e. U τ, n ( x (cid:63) ) = x (cid:63) . Moreover, we have (cid:107) x (cid:63) − x k (cid:107) Γ ≤ L k τ − L τ (cid:107) x − x (cid:107) Γ , ∀ k ≥ . (37) Proof.
The proof of this theorem is given in Appendix Havailable as supplementary material of this manuscript. (cid:3)
Remark (Speed of Convergence) . From Theorem 4 and (2) ,we see that the sequence { x k } k ∈ N converges the fastest whenL τ is minimised, i.e. τ = /( α + β ) . Remark (Fixed Points vs. Critical Points) . Note that Corollary2 is a much weaker result than Theorems 2 and 3. Indeed,Corollary 2 only shows the local convergence of CPGD towardsfixed points of U τ, n , which may not necessarily be critical pointsof the optimisation problem (54) . Theorems 2 and 3 on theother hand, show the global convergence of PGD with exactproximal step towards critical points of (54) . This is howeverthe price to pay for computing the proximal step (31) efficientlyin practice. Remark (Geometric Interpretation of Condition (36)) . Roughlyspeaking, Corollary 2 guarantees the convergence of CPGDtowards a fixed point of the update map (35) , providedthat the forward matrix G is injective, and that any twoconsecutive lifted estimates T P ( x k ) , T P ( x k + ) , are in a commonneighbourhood U k of some matrix R k ∈ R K . Note that thisis much less stringent than requiring the entire lifted path { T P ( x k )} k ∈ N to belong to some neighbourhood U of somefixed matrix R ∈ R K . Indeed, condition (36) allows the liftedestimates to travel from one neighbourhood of the manifold R K to another, provided that every visited neighbourhoodcontains at least two consecutive lifted estimates (see Figure 1for an illustration). This condition, although difficult to verifyin practice, seems however likely to hold for ρ = + ∞ , smallenough step sizes, large enough n and x = N . Indeed, insuch a case, we have: • T P ( x ) ∈ H K is in some neighbourhood of R K since R K is dense in H K . • For n large enough, T P ( x k ) is very likely to be insome neighbourhood of R K , since the denoising stepin the update map (35) makes T P ( x k ) close to be in theintersection H K ∩ T P (see Corollary 1). • For a small enough step size τ , T P ( x k ) and T P ( x k + ) arelikely to belong to the same neighbourhood of R K . V. E
XPERIMENTS AND R ESULTS
In this section we validate the CPGD method numerically,considering as a testbed the scenario of irregular time samplingfrom [9, Section IV.A]. We assess both the reconstructionaccuracy and the computational complexity of the method, andcompare it to the state-of-the-art.
Remark (Reproducibility) . Special care has been taken intomaking the experiments and simulations of this section fullyreproducible. To reproduce the results, the reader is referredto the routines provided in our GitHub repository [16].A. Reconstruction Accuracy
We define a 1-periodic stream of K = Diracs (see Fig. 2): x ( t ) = (cid:213) m ∈ Z K (cid:213) k = x k δ ( t − t k − m ) , ∀ t ∈ R , (38)where the amplitudes x k ∈ R + and locations t k ∈ [ , [ are random, with log-normal and uniform distributionsrespectively. We then generate L = noisy samples as y l = M (cid:213) m = − M ˆ x m e j π m θ l + (cid:15) l , l = , . . . , L , (39)where ˆ x m = (cid:205) Kk = x k exp (− j π mt k ) , m = − M , . . . , M , are theFourier coefficients of the Dirac stream x , θ l ∈ [ , [ are chosenuniformly at random, and (cid:15) l ∈ R are independent realisationsof a Gaussian random variable N ( , σ ) , for l ∈ (cid:110) , L (cid:111) .As explained in [9, Section IV.A], the measurements y l correspond to noisy samples of the low-pass filtered Diracstream x at irregular times θ l (see Fig. 2), where the low-pass filter is chosen as an ideal low-pass filter with bandwidth M + ≤ L . Using the formalism of Section III, we can rewrite(39) in vector notation as y = Gx + (cid:15) , (40)where y = [ y , . . . , y L ] ∈ C L , x = [ ˆ x − M , . . . , ˆ x M ] ∈ C N = M + , (cid:15) = [ (cid:15) , . . . , (cid:15) L ] ∈ C L , and G ∈ C L × N is given by G = e − j π M θ · · · · · · e j π M θ e − j π M θ · · · · · · e j π M θ ... · · · ... · · · ... e − j π M θ L − · · · · · · e j π M θ L − e − j π M θ L · · · · · · e j π M θ L . Note that from the periodicity of complex exponentials, itis possible to flip the columns of G so as to rewrite it as aVandermonde matrix [9]. This shows that G is injective –since M + ≤ L and the irregular time samples are all distinct.From the samples y and the data model (40), we considerrecovering the Fourier coefficients x ∈ C N by means of threealgorithms: To avoid degenerate cases, the Diracs are required to have a minimumseparation distance of 1 % of the total period. To avoid degenerate cases, the sampling locations are required to have aminimal separation distance of 0.5% of the total period. The noise level is defined as the standard deviation σ of the Gaussiandistribution. . . . . . . − . . . . . . . (a) β = , cutoff frequency M = . . . . . . − . . . . . . . (b) β = , cutoff frequency M = . . . . . . . . . . . . (c) β = , cutoff frequency M = . . . . . . − . . . . . . . (d) β = , cutoff frequency M = Figure 2: Dirac stream with K = sources (dark grey, round coloured heads) and noiseless irregular time samples (light grey,diamond heads), for an oversampling parameter β ∈ { , , , } . • The CPGD algorithm 1 with P = M , ρ = + ∞ (since G is injective) and step size τ = . / β (where β is set asdescribed in Theorem 4). • The state-of-the-art algorithm of Pan et al. [9], describedin Section III-A and referred to hereafter as GenFRI.For smooth integration, the Python 3 implementation ofGenFRI provided by Pan et al. on their official Githubrepository [33] was included in our own algorithmicinterface. Since the noise level is assumed to be unknown,we set –as recommended in [9, Section III-C]– the numberof inner iterations and random initialisations to theirdefault values, 50 and 15 respectively. • The baseline method, referred to hereafter as LS-Cadzow,which consists in applying Cadzow denoising to the least-squares estimate of the Fourier coefficients x LS = argmin x ∈ C N (cid:107) Gx − y (cid:107) , x LS-Cadzow = T † P (cid:2) Π T P Π H K (cid:3) n T P ( x LS ) . (41)We solve the least-squares optimisation problem in (41)by means of the lstsq function in the Python 3 package scipy [34], with cut-off ratio cond = − .For CPGD, we fix the maximum number of iterations to 500and consider that convergence is reached if the iterate normis changed by less than 0.01 % between two iterations. ForCadzow denoising, we fix the number of iterations to 10 forboth LS-Cadzow and CPGD. The reconstruction accuracy In practice this upper bound is never reached: CPGD almost alwaysconverges in less than 150 iterations. is assessed by matching the true Dirac locations t k to therecovered ones, denoted by ω k , for k between and K . To doso, we proceed as explained in Section II-D and infer the Diraclocations ω k from the z-transform roots of the annihilatingfilter associated to the Fourier coefficients estimated by eachmethod. Then, we solve the following matching problem bymeans of the
Hungarian algorithm [35] min j ,..., j K ∈{ ,..., K } (cid:40) K K (cid:213) k = d ( t k , ω j k ) (cid:41) , (42)where d ( t , ω ) = min {| t − ω | , − | t − ω |} , ∀ t , ω ∈ [ , [ ,is the canonical distance on the periodised interval [ , [ .Finally, we report the average positioning error, correspondingto the value of the cost function (cid:205) Kk = d ( t k , ω i k )/ K for theindices { i , . . . , i K } solutions to the matching problem (42).This metric is computed for 192 noise realisations, variouscutoff frequencies M = β K with the oversampling factor β ∈ { , , , } (see Figs. 2a, 2b, 2c and 2d respectively) andvarious noise levels σ = max k = ,..., K | x k | × exp (cid:18) − PSNR (cid:19) , where the peak signal to noise ratio PSNR ranges from −30 dBto 30 dB. The conditioning numbers of the matrix G H G for the See [7, Fig. 2] for additional details on the procedure used to recover theDirac locations from the annihilating filter coefficients. The Hungarian algorithm is implemented in the linear_sum_assignment function from the Python 3 package scipy [34].
IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 11 − − −
10 0 10 20 30
PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (a) β = , cutoff frequency M = − − −
10 0 10 20 30
PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (b) β = , cutoff frequency M = − − −
10 0 10 20 30
PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (c) β = , cutoff frequency M = − − −
10 0 10 20 30
PSNR (dB) − − P o s i t i o n i n g E rr o r ( % o f T ) LS-CadzowCPGDGenFRI (d) β = , cutoff frequency M = Figure 3: Positioning error (42) (in percent of period and log-scale) for LS-Cadzow, CPGD and GenFRI, various oversamplingparameters β ∈ { , , , } and a PSNR in {− , − , − , , , , } dB. For each case, plain lines and shaded areas representrespectively the median and inter-quartile region of the positioning error’s empirical distribution obtained from 192 independentnoise realisations. These results can be reproduced using the Python script reproduce_simulation_results.py locatedin the directory ./benchmarking/ of our GitHub repository [16]. β = β = β = β = κ ( G H G ) × × × Table I: Conditioning number of G H G for various values ofthe oversampling parameter β .different values of the oversampling parameter β are provided inTable I. The results of the experiments are displayed on Figs. 3and 5 –available as supplementary material of this manuscript.In Fig. 3 we plot, for different oversampling factors and PSNR,the median and inter-quartile region of the empirical distributionof the average positioning error of the three methods. In Fig. 5,we plot, for each source, different oversampling factors andPSNR, the median and inter-quartile region of the empiricaldistribution of the source location as estimated by the threemethods against the true source location. The conclusions thatcan be drawn from the results are the following: • Figs. 3a, 5a, 5b and 5c reveal that without oversamplingin the Fourier domain (i.e. β = and a minimal cutofffrequency of M = K = ), the three algorithms CPGD,GenFRI and LS-Cadzow perform similarly throughoutthe entire PSNR range. The average positioning error –almost indistinguishable for the three algorithms– goesfrom approximately 10 % of the period for PSNRs of −30 dB to 1 % of the period for PSNRs of 30 dB. • Figs. 3b, 5d, 5e and 5f reveal that with an oversamplingof β = –yielding a cutoff frequency of M = , thethree algorithms CPGD, GenFRI and LS-Cadzow startbehaving differently for PSNRs larger than 0 dB: CPGDhas the lowest positioning error, followed by LS-Cadzowand finally GenFRI. The inter-quartile regions of thepositioning errors distribution are however overlapping,which means that the differences in performance are notstatistically significant. For PSNRs larger than 0 dB, allalgorithms have a lower positioning error than in the case β = . For high PSNRs, this improvement can be as highas one and a half order of magnitude. • Figs. 3c, 5g, 5h and 5i reveal that with an oversamplingof β = –yielding a cutoff frequency of M = ,the three algorithms CPGD, GenFRI and LS-Cadzowagain behave differently for PSNRs larger than 0 dB:CPGD has the lowest positioning error, followed byGenFRI and finally LS-Cadzow. For high PSNRs, thedifferences in performance among the three algorithmsbecome statistically significant: the inter-quartile regionsof the positioning error’s empirical distribution do notoverlap anymore. CPGD moreover reaches a positioningerror as low as 0.01 % of the period, which is up to twoorders of magnitude smaller than the minimal positioning error of GenFRI or LS-Cadzow in this scenario. ForPSNRs greater than 10 dB, CPGD improves its positioningerror with respect to the case β = by a bit less than halfan order of magnitude. This is not the case for GenFRIand LS-Cadzow which both underperform with respectto the case β = –and even with respect to the case β = for LS-Cadzow. This can be explained by the largeconditioning number of the matrix G H G in this case(see Table I), affecting the numerical stability of bothalgorithms (see remark in Section III-A). • Figs. 3a, 5j, 5k and 5l reveal that with an oversampling of β = –yielding a critical bandwidth of M + = equalto the number of measurements L , CPGD is superior toGenFRI which is itself superior to the baseline methodLS-Cadzow in nearly all cases, with the exception of verylow PSNRs ( ∼ −30 dB), where the three methods havecomparable reconstruction accuracy. For PSNRs largerthan −20 dB, the differences in performance are statisti-cally significant. CPGD is more accurate than GenFRIand LS-Cadzow by a few orders of magnitude (from 1to 3 orders of magnitude for PSNRs larger than −10 dB),reaching a minimal positioning error as low as 0.005 %of the period. For PSNRs greater than −20 dB, CPGDimproves its positioning error with respect to all previouscases β ∈ { , , } . Again, this is not the case for GenFRIand LS-Cadzow which both perform as good as or worsethan the case β = . This can be explained by the (very)large conditioning number of the matrix G H G in thiscase (see Table I), which severely affects the numericalstability of both algorithms.In conclusion, in these simulations CPGD is better at lever-aging oversampling in the Fourier domain to improve thereconstruction accuracy by several orders of magnitude withrespect to the non oversampled case. In particular, CPGDperforms best when the bandwidth of the low-pass filter ischosen as large as the number of measurements. As explainedin Section II-E, Cadzow denoising exhibits a similar behaviour.This similarity is not fortuitous: both algorithms leverage asimilar rank constraint which becomes more and more selective as the oversampling parameter increases. In contrast, GenFRIand LS-Cadzow are negatively affected by large oversamplingparameters, due to numerical stability issues. For the criticalbandwidth M + = L , CPGD notably outperforms GenFRIand LS-Cadzow by one to three orders of magnitude, and thiseven for PSNRs as low as −20 dB. B. Computational Complexity
As explained in Section III-A, solving the linear systems(21) and (22) are the most expensive operations performed ateach iteration of GenFRI, yielding an overall computationalcomplexity of O( N ) . For CPGD, the computational cost ofeach iteration is dominated by the succesive projections onto H K in the approximate proximal step, which are computedvia a SVD –see Algorithm 1 and (14). At a cursory glance, itmay seem that the overall complexity of CPGD is somewhatcomparable to the one of GenFRI, since computing the SVD ofa matrix with size ( N − P )×( P + ) has in general a computational Size N − − E x ec u t i o n T i m e ( s ) CPGD ∝ N . GenFRI ∝ N . Figure 4: Reconstruction times for CPGD and GenFRI forvarious bandwidth sizes N and K = . The reported times arefor a MacBook Pro (16-inch, 2019), Intel Core i7 (6C/12T) @2.6GHz with 32GB RAM. These results can be reproduced us-ing the Python script reproduce_execution_times.py located in the directory ./benchmarking/ of our GitHubrepository [16].complexity of O(( N − P ) ( P + ) + ( P + ) ) [36] which reducesto O( N ) when P = M . In practice however, projecting onto H K does not require to perform a complete SVD since only the K strongest eigenvalues and their associated eigenvectors areneeded. This truncated SVD can be performed very efficientlyby means of the implicitly restarted Arnoldi method (IRAM) [37], or the implicitly restarted Lanczos method for Hermitianmatrices. When K (cid:28) P + , such methods are obviously muchmore efficient than a wasteful standard SVD. IRAM is moreovera matrix-free method [38]: it does not need the processed matrixto be stored in memory but simply requires an algorithm forperforming matrix/vector products. In our context, since thetruncated SVDs are exclusively performed on Toeplitz matrices,such matrix/vector products can be efficiently implemented bymeans of FFTs thanks to the convenient links between Toeplitzmatrices and convolutions outlined in Section II-A.The CPGD implementation provided in our Github repository[16] leverages all these computational tricks. In Fig. 4, weshow that our implementation of CPGD is considerably fasterthan GenFRI. Fig. 4 reports the reconstruction times of CPGDand GenFRI for K = and bandwidth N = β M + = L with β ranging from 1 to 300. To save computational time, thereconstruction times were scaled from the execution time ofa single iteration of CPGD and GenFRI, assuming a typicalnumber of iterations of 100 and 15 × N . while GenFRIscales as N . . Note that this difference in scaling behaviouris overlooked by the complexity analysis above.VI. C ONCLUSION
We propose an implicit version of the generalised finite-rate-of-innovation (genFRI) problem for the recovery of the Fourierseries coefficients of sparse Dirac streams with arbitrary linearsensing. This formulation relies on a novel regularisation termwhich enforces the annihilation of the recovered Fourier series
IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 13 coefficients without explicitly involving the unknown annihi-lating filter. The resulting non convex optimisation problem isconsequently simpler and linear in the data. To solve it, wesuggest a proximal gradient descent (PGD) algorithm whichwe prove converges towards a critical point of the objectivefunction. We further introduce an inexact PGD method, coinedCadzow plug-and-play gradient descent (CPGD), where theintractable proximal steps involved in PGD are approximatedby means of alternating projections, akin to the popular Cadzowdenoising algorithm. We outline the resemblance of CPGDto PnP methods used in image processing and prove its localfixed-point convergence under relatively weak assumptions.Considering the traditional irregular time sampling testbed, wedemonstrate empirically that CPGD outperforms by severalorders of magnitude the state-of-the-art GenFRI algorithm, bothin terms of accuracy and reconstruction time.For future work, we plan on investigating accelerationtechniques for CPGD, such as approximate sketching-basedeigenvalue decomposition methods [39], [40], more com-putationally efficient for large-scale problems. Applicationsof CPGD to acoustics and radio astronomy will also beinvestigated. R
EFERENCES[1] M. Vetterli, J. Kovaˇcevi´c, and V. K. Goyal,
Foundations of signalprocessing . Cambridge University Press, 2014.[2] B. Rimoldi,
Principles of Digital Communication: A Top-Down Approach .Cambridge University Press, 2016.[3] C. E. Shannon, “A mathematical theory of communication,”
ACMSIGMOBILE Mobile Computing and Communications Review , vol. 5,no. 1, pp. 3–55, 2001.[4] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finiterate of innovation,”
IEEE Trans. Signal Process. , vol. 50, no. 6, pp.1417–1428, 2002.[5] T. Blu, P.-L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot, “Sparsesampling of signal innovations,”
IEEE Signal Process. Mag. , vol. 25,no. 2, pp. 31–40, 2008.[6] J. A. Cadzow, “Signal enhancement-a composite property mappingalgorithm,”
IEEE Trans. Acoust. , vol. 36, no. 1, pp. 49–62, 1988.[7] L. Condat, A. Hirabayashi, L. Condat, and A. Hirabayashi, “Cadzowdenoising upgraded : A new projection method for the recovery of diracpulses from noisy linear measurements,” 2015.[8] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signalprocessing,” in
Fixed-point algorithms for inverse problems in scienceand engineering . Springer, 2011, pp. 185–212.[9] H. Pan, T. Blu, and M. Vetterli, “Towards generalized FRI samplingwith an application to source resolution in radioastronomy,”
IEEE Trans.Signal Process. , vol. 65, no. 4, pp. 821–835, 2017.[10] N. Parikh, S. Boyd et al. , “Proximal algorithms,”
Foundations and Trendsin Optimization , vol. 1, no. 3, pp. 127–239, 2014.[11] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,”
SIAM journal on imaging sciences ,vol. 2, no. 1, pp. 183–202, 2009.[12] B. Gu, D. Wang, Z. Huo, and H. Huang, “Inexact proximal gradientmethods for non-convex and non-smooth optimization,” in
Thirty-SecondAAAI Conference on Artificial Intelligence , 2018.[13] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “Cnn-based projected gradient descent for consistent ct image reconstruction,”
IEEE transactions on medical imaging , vol. 37, no. 6, pp. 1440–1453,2018.[14] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-playpriors for model based reconstruction,” in . IEEE, 2013, pp. 945–948.[15] E. K. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-and-play methods provably converge with properly trained denoisers,” arXivpreprint arXiv:1905.05406 , 2019.[16] M. Simeoni, A. Besson, P. Hurley, and M. Vetterli, “Pyoneer: continuousrecovery of non-bandlimited periodic signals with finite rates of innova-tion,” https://github.com/matthieumeo/pyoneer, accessed: 2020-04-16. [17] R. Escalante and M. Raydan,
Alternating projection methods . SIAM,2011.[18] J. von Neumann, “The geometry of orthogonal spaces, functionaloperators-vol. ii,”
Annals of Math. Studies , vol. 22, 1950.[19] H. Bauschke, F. Deutsch, H. Hundal, and S.-H. Park, “Accelerating theconvergence of the method of alternating projections,”
Transactions ofthe American Mathematical Society , vol. 355, no. 9, pp. 3433–3461,2003.[20] I. Halperin, “The product of projection operators,”
Acta Sci.Math.(Szeged) , vol. 23, no. 1, pp. 96–99, 1962.[21] H. H. Bauschke and J. M. Borwein, “Dykstra’s alternating projectionalgorithm for two sets,”
Journal of Approximation Theory , vol. 79, no. 3,pp. 418–443, 1994.[22] A. S. Lewis, D. R. Luke, and J. Malick, “Local linear convergencefor alternating and averaged nonconvex projections,”
Foundations ofComputational Mathematics , vol. 9, no. 4, pp. 485–513, 2009.[23] A. S. Lewis and J. Malick, “Alternating projections on manifolds,”
Mathematics of Operations Research , vol. 33, no. 1, pp. 216–234, 2008.[24] F. Andersson and M. Carlsson, “Alternating projections on low-dimensional manifolds.”[25] M. K. Tam, “Alternating projection methods failure in the absence ofconvexity.”[26] C. Riche de Prony, “Essai expérimental et analytique sur les lois de ladilatabilité des fluides élastique et sur celles de la force expansive de lavapeur de l’eau et de la vapeur de l’alkool, à différentes températures,”
J. de l’École Polytechnique , vol. 1, pp. 24–76, 1795.[27] C. Eckart and G. Young, “The approximation of one matrix by anotherof lower rank,”
Psychometrika , vol. 1, no. 3, pp. 211–218, 1936.[28] H. Pan, M. Simeoni, P. Hurley, T. Blu, and M. Vetterli, “Leap: Lookingbeyond pixels with continuous-space estimation of point sources,”
Astronomy & Astrophysics , vol. 608, p. A136, 2017.[29] T. F. Chan and C.-K. Wong, “Convergence of the alternating minimizationalgorithm for blind deconvolution,”
Linear Algebra and its Applications ,vol. 316, no. 1-3, pp. 259–285, 2000.[30] H. Li and Z. Lin, “Accelerated proximal gradient methods for nonconvexprogramming,” in
Advances in neural information processing systems ,2015, pp. 379–387.[31] F. Andersson, M. Carlsson, and P.-A. Ivert, “A fast alternating pro-jection method for complex frequency estimation,” arXiv preprintarXiv:1107.2028 , 2011.[32] Q. Yao, J. T. Kwok, F. Gao, W. Chen, and T.-Y. Liu, “Efficient inexactproximal gradient algorithm for nonconvex problems,” arXiv preprintarXiv:1612.09069 , 2016.[33] H. Pan, T. Blu, M. Vetterli, and F. Dümbgen, “Unified algorithmicframework for reconstructing signals with finite rate of innovation,”https://github.com/hanjiepan/FRI_pkg, accessed: 2019-10-18.[34] E. Jones, T. Oliphant, P. Peterson et al.
Navalresearch logistics quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[36] G. H. Golub and C. F. Van Loan,
Matrix computations . JHU press,2012, vol. 3.[37] R. B. Lehoucq and D. C. Sorensen, “Deflation techniques for animplicitly restarted arnoldi iteration,”
SIAM Journal on Matrix Analysisand Applications , vol. 17, no. 4, pp. 789–821, 1996.[38] S. Diamond and S. Boyd, “Matrix-free convex optimization modeling,” in
Optimization and its applications in control and data sciences . Springer,2016, pp. 221–264.[39] D. P. Woodruff et al. , “Sketching as a tool for numerical linear algebra,”
Foundations and Trends in Theoretical Computer Science , vol. 10, no.1–2, pp. 1–157, 2014.[40] P. Indyk, A. Vakilian, and Y. Yuan, “Learning-based low-rank approxi-mations,” in
Advances in Neural Information Processing Systems , 2019,pp. 7400–7410.[41] A. B. Taylor, J. M. Hendrickx, and F. Glineur, “Exact worst-caseconvergence rates of the proximal gradient method for composite convexminimization,”
Journal of Optimization Theory and Applications , vol.178, no. 2, pp. 455–476, 2018.[42] A. Jung, “A fixed-point of view on gradient methods for big data,”
Frontiers in Applied Mathematics and Statistics , vol. 3, p. 18, 2017.[43] F. R. Deutsch,
Best approximation in inner product spaces . SpringerScience & Business Media, 2012.[44] D. R. Luke, “Finding best approximation pairs relative to a convexand prox-regular set in a hilbert space,”
SIAM Journal on Optimization ,vol. 19, no. 2, pp. 714–739, 2008. [45] M. J. Lai and A. Varghese, “On convergence of the alternating projectionmethod for matrix completion and sparse recovery problems,” arXivpreprint arXiv:1711.02151 , 2017. A PPENDIX AT HE T OEPLITZIFICATION O PERATOR AND C ONVOLUTIONS
Consider the Toeplitzification operator defined in (2). Whenmultiplied with a vector u = [ u , · · · u P + ] ∈ C P + , the matrix T P ( x ) returns the valid part of the convolution between thetwo zero-padded sequences: ˜ x : = (cid:104) · · · , , x − M , · · · , x , · · · , x M , , · · · (cid:105) ∈ C Z and ˜ u : = (cid:104) · · · , , u , · · · , u P + , , · · · (cid:105) ∈ C Z . Indeed, ( ˜ x ∗ ˜ u )[ k ] : = (cid:213) j ∈ Z ˜ x k − j ˜ u j = P + (cid:213) j = ˜ x k − j u j , k ∈ Z . The valid part corresponds to the indices k ∈ Z for which allthe terms in the summation are nonzero. This is the case when k ∈ {− M + P + , . . . , M + } . Consider i = k + M − P we get that the valid part of theconvolution is given by ( ˜ x ∗ ˜ u )[ i − M + P ] = P + (cid:213) j = x − M + P + i − j u j , = P + (cid:213) j = [ T P ( x )] i , j u j , i = , . . . , N − P , which corresponds precisely to T P ( x ) u . Using similar arguments, it is easy to show that themultiplication of T P ( x ) H ∈ C ( P + )×( N − P ) with a vector v = [ v , · · · , v N − P ] T ∈ C N − P returns the valid part of the cross-correlation ( ˜ x (cid:63) ˜ v )[ k ] : = (cid:213) j ∈ Z ˜ x ∗ j − k ˜ u j , k ∈ Z , between ˜ x and the zero-padded sequence ˜ v : = (cid:104) . . . , , v , . . . , v N − P , , . . . (cid:105) ∈ C Z . This time however, the valid part corresponds to the indices: k ∈ { M + − P , . . . , M + } . (cid:3) A PPENDIX BP ROOF OF P ROPOSITION H ∈ C ( N − P )×( P + ) and the followingFrobenius inner product (cid:104) T P ( x ) , H (cid:105) F = tr (cid:16) T HP ( x ) H (cid:17) = N − P (cid:213) i = P + (cid:213) k = T P ( x ) ik H ik = N − P (cid:213) i = P + (cid:213) k = ¯ x − M + P + i − k H iks = i − k + P = N − (cid:213) s = x − M + s (cid:32) (cid:213) i = k + s − P H ik (cid:33) (43)The term (cid:205) i = k + ( s − P ) H ik sums the elements of H along lineswith equation i = k + ( s − P ) . These lines have slope 1 andintercept b = s − P . Notice that these lines have nonnullintersection with the lattice ( k , i ) ∈ (cid:110) , P + (cid:111) × (cid:110) , N − P (cid:111) for b ∈ (cid:110) − P , N − P − (cid:111) . Indeed, the two extreme cases occurwhen the lines hit the points ( , N − P ) and ( P + , ) . Thishappens respectively when + b = N − P ⇒ b = N − P − and P + + b = ⇒ b = − P . Since s ∈ (cid:110) , N − (cid:111) the intercept b varies indeed in the range (cid:110) − P , N − P − (cid:111) and each termin the summation is nonnull. The summation (cid:205) i = k + ( s − P ) H ik corresponds then to summing across each diagonal of H . Wefinally get: (cid:104) T P ( x ) , H (cid:105) F = (cid:10) x , T ∗ P ( H ) (cid:11) , with T ∗ P : C ( N − P )×( P + ) → C N H (cid:55)→ h j = (cid:205) i = k + j − − P H ik , j = , . . . , N . (cid:3) A PPENDIX CP ROOF OF P ROPOSITION T P , it is straightforward toobserve that the operator Γ = T ∗ P T P : C N → C N is a diagonalmatrix, with diagonal entries given by: Γ i , i = i for i ≤ PP + for P < i ≤ N − PN + − i for N − P < i ≤ N . (44)The operator T † P = Γ − T ∗ P is hence a left inverse for T P : T † P T P = Γ − T ∗ P T P = ( T ∗ P T P ) − T ∗ P T P = I N . (45)Moreover, the latter is actually the pseudoinverse of T P . Indeed,we have trivially: T P T † P T P = T P , T † P T P T † P = T † P , ( T † P T P ) ∗ = T † P T P . Finally, we have ( T P T † P ) ∗ = T P Γ − H T ∗ P = T P T † P , (46) For n , m ∈ Z , n < m , we denote by (cid:110) n , m (cid:111) the 1D lattice [ n , m ] ∩ Z . IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 15 since Γ is diagonal and hence symmetric. T † P verifies thus thedefinition of the pseudoinverse of T P . (cid:3) A PPENDIX DP ROJECTION ON THE S UBSPACE OF T OEPLITZ M ATRICES
The operator T P is actually a surjection onto the subspace T P ⊂ C ( N − P )×( P + ) of rectangular Toeplitz matrices with size ( N − P ) × ( P + ) . Indeed, it is easy to see that every suchmatrix can be written as in (2) for some generator x ∈ C N .Moreover, we have from (45) that T † P T P = I N and hence from[1, Theorem 2.29], T P T † P is a projection operator onto the range T P of T P . Since T P T † P is moreover self-adjoint from (46), itis actually an orthogonal projection operator, which achievesthe proof. (cid:3) A PPENDIX EP ROOFS OF T HEOREMS AND
Lemma 1.
Consider the norm (cid:107) x (cid:107) : = (cid:112) (cid:104) x , x (cid:105) , x ∈ R n ,induced by some inner product (cid:104)· , ·(cid:105) on R n . Consider moreoverthe general problem: min x ∈ R n Φ ( x ) : = F ( x ) + H ( x ) , (47) where F : R n → R ∪ { + ∞} and H : R n → R ∪ { + ∞} arepotentially non convex functions such that:(A) F is a proper function, i.e. its domain is non empty, differentiable and with Lipschitz continuous gradient forsome Lipschitz constant ≤ β < + ∞ , (cid:107)∇ F ( x ) − ∇ F ( y )(cid:107) ≤ β (cid:107) x − y (cid:107) , ∀ x , y ∈ R n . (B) H is a proper and lower semicontinuous (lwsc) function(see supplementary material of [30] for a definition),potentially non smooth .(C) Φ = F + H is coercive , i.e. Φ is bounded from below and lim (cid:107) x (cid:107)→ + ∞ Φ ( x ) = + ∞ . Then, the iterates { x k } k ∈ N generated by the proximal gradientdescent applied to (47) : x k + ∈ prox τ H ( x k − τ ∇ F ( x k )) , k ≥ , (48) with τ < / β and x ∈ R n , are bounded . Moreover, any limitpoint x (cid:63) of { x k } k ∈ N is a local minimum of Φ .Proof. The lemma is easily shown by specifying the proof of[30, Theorem 1] to the non-accelerated case. For the sake ofcompleteness, it is provided hereafter. From the definition of the proximal operator,prox τ H ( x ) : R n → P ( R n ) , x (cid:55)→ arg min z ∈ R n τ (cid:107) x − z (cid:107) + H ( z ) , we can reinterpret (48) as x k + ∈ arg min z ∈ R n τ (cid:107) z − x k (cid:107) + (cid:104)∇ F ( x k ) , z − x k (cid:105) + H ( z ) . (49)We have hence τ (cid:107) x k + − x k (cid:107) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + H ( x k + ) ≤ H ( x k ) . From the Lipschitz continuity of ∇ F we have moreover Φ ( x k + ) ≤ H ( x k + ) + F ( x k ) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + β (cid:107) x k + − x k (cid:107) ≤ H ( x k ) − τ (cid:107) x k + − x k (cid:107) − (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + F ( x k ) + (cid:104)∇ F ( x k ) , x k + − x k (cid:105) + β (cid:107) x k + − x k (cid:107) = Φ ( x k ) − (cid:18) τ − β (cid:19) (cid:107) x k + − x k (cid:107) . (50)Since τ < / β we have hence ( / τ − β / ) ≥ and Φ ( x k + ) ≤ Φ ( x k ) ≤ Φ ( x ) , ∀ k ≥ . The sequence { Φ ( x k )} k ∈ N is hence bounded and since Φ iscoercive so is { x k } k ∈ N . The sequence { x k } k ∈ N admits hencelimit points. Moreover, since Φ ( x k ) is decreasing and boundedfrom below, it takes the same value Φ (cid:63) ∈ R at all of theselimit points. Summing (50), we obtain hence: (cid:18) τ − β (cid:19) + ∞ (cid:213) k = (cid:107) x k + − x k (cid:107) ≤ Φ ( x ) − Φ (cid:63) < + ∞ . Since τ < / β , we have necessarily (cid:205) + ∞ k = (cid:107) x k + − x k (cid:107) < + ∞ ,which yields lim k → + ∞ (cid:107) x k + − x k (cid:107) = . (51)From the optimality condition (49) and Items 1 and 3 ofProposition 1 of the supplementary material of [30], we havemoreover n ∈∇ F ( x k ) + τ ( x k + − x k ) + ∂ H ( x k + ) = ∂ Φ ( x k + ) − ∇ F ( x k + ) + ∇ F ( x k ) + τ ( x k + − x k ) , (52)where ∂ H : R n → P( R n ) and ∂ Φ : R n → P( R n ) denote the(set-valued) subdifferential operators of H and Φ respectively(see Definition 2 of the supplementary material of [30]).Equation (52) can moreover be rewritten as ∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k ) ∈ ∂ Φ ( x k + ) . Furthermore, from the Lipschitz continuity of F , we have (cid:107)∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k )(cid:107) ≤ (cid:18) β + τ (cid:19) (cid:107) x k + − x k (cid:107) , and hence from (51): lim (cid:107) x (cid:107)→ + ∞ (cid:13)(cid:13)(cid:13)(cid:13) ∇ F ( x k + ) − ∇ F ( x k ) − τ ( x k + − x k ) (cid:13)(cid:13)(cid:13)(cid:13) = . (53) Let { x k j } j ∈ N be a convergent subsequence of { x k } k ∈ N , withlimit x (cid:63) . Then, we have from (53) and Item 2 of Proposition1 of the supplementary material of [30]: n ∈ lim j → + ∞ ∂ Φ ( x k j ) = ∂ Φ ( x (cid:63) ) , which completes the proof. (cid:3) We now show Theorems 2 and 3 by applying Lemma 1 tothe implicit genFRI problem in unconstrained form (54) min x ∈ C N (cid:107) Gx − y (cid:107) + ι H K ( T P ( x )) + ι B Γ ρ ( x ) . (54)To do so, we must first convert (54) into an optimisationproblem of the form (47), defined over R n for some n ∈ N . Weachieve this by proceeding as in [2, Section 7.8] and identifying C N with R N (respectively C L with R L ) in the canonical way x ∈ C N ↔ ˆ x : = (cid:20) R ( x ) I ( x ) (cid:21) ∈ R N , where R and I denote the real and imaginary parts respectively.Such an identification makes the canonical inner products andnorms on C N and R N (respectively C L and R L ) consistentwith one another, i.e. for all x , z ∈ C N , we have (cid:104) x , z (cid:105) C N = z H x = R ( z ) T R ( x ) + I ( z ) T I ( x ) = ˆ z T ˆ x = (cid:104) ˆ x , ˆ z (cid:105) R N , and (cid:107) x (cid:107) C N = √ x H x = (cid:113) (cid:107) R ( x )(cid:107) R N + (cid:107) I ( x )(cid:107) R N = (cid:107) ˆ x (cid:107) R N . Still following [2, Section 7.8], we moreover identify the linearmap G : C N → C L with a linear map ˆ G : R N → R L withmatrix representation: ˆ G : = (cid:20) R ( G ) − I ( G ) I ( G ) R ( G ) (cid:21) ∈ R L × N . Again, it is easy to show that the two operators are consistent,in the sense that (cid:99) Gx = ˆ G ˆ x , and (cid:154) G H x = ˆ G T ˆ x , ∀ x ∈ C N . Similarly, the Toeplitzification operator T P : C N → C ( N − P )×( P + ) is identified with the linear operator ˆ T P : R N → C ( N − P )×( P + ) defined as ˆ T P ( ˆ x ) : = T P ( R ( x )) + jT P ( I ( x )) , ∀ x ∈ C N , where j is the complex 2-root of unity. From the linearity of T P , this definition yields indeed T P ( x ) = ˆ T P ( ˆ x ) . Finally, the Γ -ball B Γ ρ ⊂ C N is identified with B ˆ Γ ρ : = (cid:8) ˆ x ∈ R N : (cid:107) ˆ x (cid:107) ˆ Γ ≤ ρ (cid:9) , where ˆ Γ ∈ R N × N is a positive definite and diagonal matrixdefined as ˆ Γ : = (cid:20) Γ N × N N × N Γ (cid:21) . Again, we trivially have (cid:107) ˆ x (cid:107) ˆ Γ = (cid:107) x (cid:107) Γ and hence x ∈ B Γ ρ ⇔ ˆ x ∈ B ˆ Γ ρ for all x ∈ C N . In summary, the optimisation problem (54) is hence equivalentto the following optimisation problem with search space R N : min ˆ x ∈ R N (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L + ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) + ι B ˆ Γ ρ ( ˆ x ) . (55)Letting ˆ F ( ˆ x ) : = (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L and ˆ H ( ˆ x ) : = ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) + ι B ˆ Γ ρ ( ˆ x ) we have ˆ F : R N → R + and ˆ H : R N → { , + ∞} ,so that (55) is indeed of the form (47). We must now verifyassumptions (A), (B) and (C) of Lemma 1: (A) ˆ F is proper, differentiable and ∇ ˆ F Lipschitz continuous. ˆ F is proper since ˆ F ( N ) = (cid:107) ˆ y (cid:107) R L = (cid:107) y (cid:107) C L < + ∞ . It is differentiable, with gradient given by ∇ ˆ F ( ˆ x ) = G T ( ˆ G ˆ x − ˆ y ) = (cid:155) ∇ F ( x ) , ˆ x ∈ R N . (56)The gradient (56) is moreover ˆ β - Lipschitz continuous withrespect to the norm (cid:107) · (cid:107) ˆ Γ on R N , and its Lipschitz constantis given by: ˆ β = (cid:107) ˆ G T ˆ G (cid:107) ˆ Γ = sup (cid:8) (cid:13)(cid:13) ˆ G T ˆ G ˆ x (cid:13)(cid:13) ˆ Γ : ˆ x ∈ R N , (cid:107) ˆ x (cid:107) ˆ Γ = (cid:9) = sup (cid:8) (cid:13)(cid:13) G H Gx (cid:13)(cid:13) Γ : x ∈ C N , (cid:107) x (cid:107) Γ = (cid:9) = β < + ∞ . (57) (cid:3) (B) ˆ H is proper and lower semicontinuous. ˆ H is proper sincefor all ρ > , and K ∈ N , ˆ H ( N ) = ι H K (cid:0) ( N − P )×( P + ) (cid:1) + ι B ˆ Γ ρ ( N ) = < + ∞ . The indicator functions are moreover lower semicontinuoussince the sets H K and B ˆ Γ ρ are both closed. Since T P is abounded linear operator, it is continuous and hence ˆ H is indeedlower semicontinuous as composition between continuous andlower semicontinuous functions. (cid:3) (C) ˆ Φ = ˆ F + ˆ H is coercive.
It is easy to see that ˆ Φ = ˆ F + ˆ H ≥ . To show that ˆ Φ is coercive, it is hence sufficient to showthat lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ ˆ Φ ( ˆ x ) = + ∞ . To this end, we distinguish two cases, which correspondrespectively to the assumptions of Theorems 2 and 3:1) ρ ∈] , + ∞[ : in this case ˆ Φ is trivially coercive since ι B ˆ Γ ρ ( ˆ x ) = + ∞ , ∀ (cid:107) ˆ x (cid:107) ˆ Γ ≥ ρ. ρ = + ∞ and G injective: When ρ = + ∞ , the term ι B ˆ Γ ρ isalways null and ˆ Φ simplifies to ˆ Φ ( ˆ x ) = (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L + ι H K (cid:16) ˆ T P ( ˆ x ) (cid:17) , ˆ x ∈ R N . From [2, Section 7.8], we have moreover that det (cid:16) ˆ G T ˆ G (cid:17) = | det ( G H G )| (cid:44) , (58) IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 17 since G is injective by assumption. From the reversetriangle inequality, we have hence (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ σ min (cid:107) ˆ x (cid:107) R N − (cid:107) ˆ y (cid:107) R L , ∀ ˆ x ∈ R N , where σ min = (cid:113) λ min ( ˆ G T ˆ G ) > is the square root of theeigenvalue of ˆ G T ˆ G with lowest magnitude, which is nonnull from (58). From the equivalence of norms in finitedimensions, there exist moreover c , c > such that c (cid:107) ˆ x (cid:107) ˆ Γ ≤ (cid:107) ˆ x (cid:107) R N ≤ c (cid:107) ˆ x (cid:107) ˆ Γ , ∀ ˆ x ∈ R N . This yields (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ σ min c (cid:107) ˆ x (cid:107) ˆ Γ − (cid:107) ˆ y (cid:107) R L , ∀ ˆ x ∈ R N , and hence lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ (cid:13)(cid:13) ˆ G ˆ x − ˆ y (cid:13)(cid:13) R L ≥ lim (cid:107) ˆ x (cid:107) ˆ Γ → + ∞ σ min c (cid:107) ˆ x (cid:107) ˆ Γ = + ∞ , which shows that ˆ Φ is indeed coercive. (cid:3) We can hence apply Lemma 1, to show that the iterates { ˆ x k } k ∈ N ⊂ R N generated by PGD applied to (55): ˆ x k + ∈ prox ˆ Γ τ ˆ H (cid:16) ˆ x k − τ ∇ ˆ F ( ˆ x k ) (cid:17) , (59)with τ < / ˆ β and ˆ x ∈ R N , are bounded. Moreover, anylimit point ˆ x (cid:63) of { ˆ x k } k ∈ N is a critical point of (55), i.e. N ∈ ∂ ˆ Φ ( ˆ x (cid:63) ) .Observe finally, that the iterations (59) can be rewritten incomplex form as x k + ∈ prox Γ τ H ( x k − τ ∇ F ( x k )) , (60)with τ < / β and x ∈ C N , and where we have used (56),(57) and prox ˆ Γ τ ˆ H ( ˆ x ) = arg min ˆ z ∈ R N τ (cid:107) ˆ x − ˆ z (cid:107) Γ + ˆ H ( z ) = (cid:156) prox Γ τ H ( x ) , ∀ ˆ x ∈ R N , which follows trivially from the previous identifications. Byidentification and equivalence between the real and complexoptimisation problems (55) and (54), we can hence concludethat limit points of the iterates { x k } k ∈ N ⊂ C N generated by(60) are critical points of (54), which achieves the proof. (cid:3) A PPENDIX FP ROOF OF P ROPOSITION x ∈ C N :prox Γ τ H ( x ) = arg min z ∈ C N (cid:26) τ (cid:107) x − z (cid:107) Γ + ι H K ( T P ( z )) + ι B Γ ρ ( z ) (cid:27) . (61) When mapped via the Toeplitzification operator T P , theproximal set (61) becomes T P (cid:16) prox Γ τ H ( x ) (cid:17) == (cid:8) T P ( ˇ x ) , ˇ x ∈ prox Γ τ H ( x ) (cid:9) = (cid:110) ˇ X ∈ T P , T † P ( ˇ X ) ∈ prox Γ τ H ( x ) (cid:111) = arg min Z ∈ T P (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι H K ( Z ) + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) = arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) , (62)where we have used the fact that T † P T P ( z ) = z for all z ∈ C N .We have moreover, ∀ Z ∈ T P : (cid:107) T † P ( Z ) − x (cid:107) Γ = (cid:104) Γ T † P ( Z ) − x , T † P ( Z ) − x (cid:105) = (cid:104) Γ T † P ( Z ) , T † P ( Z )(cid:105) + (cid:104) Γ x , x (cid:105) − (cid:104) Γ T † P ( Z ) , x (cid:105) − (cid:104) Γ x , T † P ( Z )(cid:105) = (cid:104) ΓΓ − T ∗ P ( Z ) , Γ − T ∗ P ( Z )(cid:105) + (cid:104) T ∗ P T P ( x ) , x (cid:105) − (cid:104) ΓΓ − T ∗ P ( Z ) , x (cid:105) − (cid:104) Γ x , Γ − T ∗ P ( Z )(cid:105) = (cid:104) Z , T P Γ − T ∗ P ( Z )(cid:105) F + (cid:104) T P ( x ) , T P ( x )(cid:105) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:104) Z , Π T P ( Z )(cid:105) F + (cid:107) T P ( x )(cid:107) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:104) Z , Z (cid:105) F + (cid:107) T P ( x )(cid:107) F − (cid:104) Z , T P ( x )(cid:105) F − (cid:104) T P ( x ) , Z (cid:105) F = (cid:107) Z (cid:107) F + (cid:107) T P ( x )(cid:107) F − R ((cid:104) Z , T P ( x )(cid:105) F ) = (cid:107) Z − T P ( x )(cid:107) F , (63)where we have used the fact that Γ = Γ H = T ∗ P T P and T P Γ − T ∗ P = Π T P (see Appendices C and D). With similararguments, we have ∀ Z ∈ T P : (cid:13)(cid:13)(cid:13) T † P ( Z ) (cid:13)(cid:13)(cid:13) Γ ≤ ρ ⇔ (cid:113) (cid:104) Γ T † P ( Z ) , T † P ( Z )(cid:105) ≤ ρ ⇔ (cid:112) (cid:104) Z , Z (cid:105) F ≤ ρ ⇔ (cid:107) Z (cid:107) F ≤ ρ. so that ι B Γ ρ (cid:16) T † P ( Z ) (cid:17) = ι B ρ ( Z ) , ∀ Z ∈ T P , (64)where B ρ : = (cid:8) Z ∈ C ( N − P )×( P + ) : (cid:107) Z (cid:107) F ≤ ρ (cid:9) . Plugging (63)and (64) into (62) hence yields T P (cid:16) prox Γ τ H ( x ) (cid:17) == arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) T † P ( Z ) − x (cid:107) Γ + ι B Γ ρ (cid:16) T † P ( Z ) (cid:17)(cid:27) = arg min Z ∈ T P ∩H K (cid:26) τ (cid:107) Z − T P ( x )(cid:107) F + ι B ρ ( Z ) (cid:27) = arg min Z ∈ T P ∩H K ∩ B ρ (cid:26) τ (cid:107) Z − T P ( x )(cid:107) F (cid:27) = arg min Z ∈ T P ∩H K ∩ B ρ (cid:107) Z − T P ( x )(cid:107) F = Π T P ∩H K ∩ B ρ T P ( x ) . (65) Using the fact that T † P T P = I N we can finally rewrite (65) asprox Γ τ H ( x ) = T † P Π T P ∩H K ∩ B ρ T P ( x ) , which completes the proof. (cid:3) A PPENDIX GP ROOF OF T HEOREM (cid:96) canonical norm. Lemma 2in contrast assumes the Γ -norm as underlying norm, since thelatter is more natural for our particular problem. Lemma 2 (Contractive Gradient Descent) . Let G ∈ C L × N beinjective, and Γ be the diagonal and definite positive matrixdefined in (44) . Define α : = λ min (cid:16) Γ / G H G Γ − / (cid:17) , (66) β : = λ max (cid:16) Γ / G H G Γ − / (cid:17) , (67) where λ min ( M ) and λ max ( M ) denote the minimum and maxi-mum eigenvalue of a matrix M respectively. Let τ ∈ R be apositive constant and consider the linear mapD τ : (cid:40) C N → C N , x (cid:55)→ x − τ G H ( Gx − y ) , (68) for some y ∈ C L . Then, D τ is Lipschitz continuous with respectto the norm induced by Γ : (cid:107) D τ ( x ) − D τ ( z )(cid:107) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , ∀ x , z ∈ C N , with Lipschitz contant:L τ = max {| − τα | , | − τ β |} . (69) Moreover, D τ is contractive, i.e. ≤ L τ < , for < τ < / β ,and L τ is minimised for τ = /( α + β ) .Proof. We have (cid:107) D τ ( x ) − D τ ( z )(cid:107) Γ = (cid:13)(cid:13) ( I N − τ G H G )( x − z ) (cid:13)(cid:13) Γ ≤ (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ (cid:107) x − z (cid:107) Γ , = L τ (cid:107) x − z (cid:107) Γ where the Lipschitz constant L τ : = (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ > isthe operator norm of I N − τ G H G induced by the Γ -norm on C N : (cid:13)(cid:13) I N − τ G H G (cid:13)(cid:13) Γ = sup (cid:107) x (cid:107) Γ = (cid:107) (cid:16) I N − τ G H G (cid:17) x (cid:107) Γ = sup (cid:107) x (cid:107) Γ = (cid:13)(cid:13)(cid:13) Γ / (cid:16) I N − τ G H G (cid:17) x (cid:13)(cid:13)(cid:13) = sup (cid:107) ˜ x (cid:107) = (cid:13)(cid:13)(cid:13) Γ / (cid:16) I N − τ G H G (cid:17) Γ − / ˜ x (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) I N − τ Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) . (70) Note that since G is injective, G H G is positive definiteand hence we easily get [42] that the eigenvalues of I N − τ Γ / G H G Γ − / are contained in the interval [ − τ β, − τα ] , where β ≥ α > are defined in (66) and (67) respectively. Itsspectral norm is hence given by: (cid:13)(cid:13)(cid:13) I N − τ Γ / G H G Γ − / (cid:13)(cid:13)(cid:13) = max {| − τα | , | − τ β |} , which proves (69). Finally, the restriction on τ for L τ to besmaller than one follows from basic algebra, and is discussedin [41]. (cid:3) The second lemma states that in a Hilbert space, orthogonalprojection maps onto closed convex sets are non-expansive.This is a known result of approximation theory [43], [44].
Lemma 3 (Non-Expansiveness of Closed Convex Projections) . Let H be some Hilbert space with some inner-product norm (cid:107) · (cid:107) and C ⊂ H a closed , convex set. Then the orthogonalprojection map onto C , defined as Π C ( x ) = arg min z ∈C (cid:107) x − z (cid:107) , ∀ x ∈ H , is non-expansive , i.e. (cid:107) Π C ( x ) − Π C ( z )(cid:107) ≤ (cid:107) x − z (cid:107) , ∀ x , z ∈ H . Proof.
Lemma 3 is proven in [43, Theorem 5.5]. (cid:3)
The third lemma states that the singular value projectionmap Π H k is locally non-expansive in every neighbourhood ofthe manifold of matrices with rank exactly k . Lemma 4 (Local Non-Expansiveness of the Singular ValueProjection) . Let C m × n be the space of complex-valued rectan-gular matrices of size m × n, and H k ⊂ C m × n , R k ⊂ C m × n thesets of matrices with rank at most and exactly k ≤ max { m , n } respectively. Denote further by Π H k the orthogonal projectionmap onto H k given in (14) . Then, for every R ∈ R k , the map Π H k is well-defined (single-valued) and locally non-expansive (cid:107) Π H k ( X ) − Π H k ( Z )(cid:107) F ≤ (cid:107) X − Z (cid:107) F , ∀ X , Y ∈ U , for some neighbourhood U (cid:51) R .Proof. Since R k is dense in H k [24, Proposition 2.1], we have Π H k = Π R k in a neighbourhood W of every R ∈ R k (see[23, Example 2.3] for a detailed proof of this fact). Moreover,[45, Lemma 3] tells us that, for every R ∈ R k , Π R k is, ina neighbourhood U (cid:51) R such that U ⊂ W , well-defined(single-valued), continuous and differentiable, with gradientgiven by: ∇ Π R k = Π T R k ( R ) where T R k ( R ) ⊂ C m × n is thetangent plane of the manifold R k in R (see [23, Example2.2]). Since T R k ( R ) is by definition a linear subspace of C m × n ,the orthogonal projection operator Π T R k ( R ) is bounded withunit spectral norm. The map Π R k = Π H k is consequently 1-Lipschitz continuous (i.e. non-expansive) with respect to theFrobenius norm in the neighbourhood U of R ∈ R k . (cid:3) The last lemma finally, makes use of Lemmas 3 and 4 to showthat the denoising operator H n ( x ) = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) is locally non-expansive with respect to the Γ -norm: IMEONI, BESSON, HURLEY & VETTERLI: CADZOW PNP GRADIENT DESCENT FOR GENERALISED FRI 19
Lemma 5 (Local Non-Expansiveness of Denoiser) . Let C ( N − P )×( P + ) be the space of complex-valued rectangularmatrices of size ( N − P ) × ( P + ) , P ≤ (cid:98) N / (cid:99) , and H K ⊂ C ( N − P )×( P + ) , R K ⊂ C ( N − P )×( P + ) the sets of matriceswith rank at most and exactly K ≤ P respectively. LetH n ( x ) : = T † P [ Π T P Π H K Π B ρ ] n T P ( x ) , ∀ x ∈ C N , be the approximate proximal operator (34) . Then, H n is locallywell-defined (single-valued) and non-expansive with respect tothe Γ -norm (cid:107) H n ( x ) − H n ( z )(cid:107) Γ ≤ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in someneighbourhood of some matrix R ∈ R K .Proof. First, we have, for all x , z ∈ C N : (cid:107) H n ( x ) − H n ( z )(cid:107) Γ = (cid:13)(cid:13)(cid:13) T † P ( D n T P ( x ) − D n T P ( z )) (cid:13)(cid:13)(cid:13) Γ , (71)where D n = [ Π T P Π H K Π B ρ ] n . Notice that for X ∈ T P , wehave (cid:13)(cid:13)(cid:13) T † P ( X ) (cid:13)(cid:13)(cid:13) Γ = (cid:104) Γ T † P ( X ) , T † P ( X )(cid:105) = (cid:104) ΓΓ − T ∗ P ( X ) , T † P ( X )(cid:105) = (cid:104) X , T P T † P ( X )(cid:105) F = (cid:104) X , Π T P X (cid:105) F = (cid:104) X , X (cid:105) F = (cid:107) X (cid:107) F . Since the range of D n is T P , (71) becomes: (cid:107) H n ( x ) − H n ( z )(cid:107) Γ = (cid:107) D n T P ( x ) − D n T P ( z )(cid:107) F . Assuming now that T P ( x ) and T P ( z ) are in some neighbourhoodof some point R ∈ R K , we can invoke Lemmas 3 and 4recursively to obtain: (cid:107) D n T P ( x ) − D n T P ( z )(cid:107) F ≤ (cid:107) T P ( x ) − T P ( z )(cid:107) F = (cid:107) x − z (cid:107) Γ , where we have used: (cid:107) T P ( x )(cid:107) F = (cid:104) T P ( x ) , T P ( x )(cid:105) F = (cid:104) T ∗ P T P ( x ) , x (cid:105) = (cid:107) x (cid:107) Γ , ∀ x ∈ C N . We finally get (cid:107) H n ( x ) − H n ( z )(cid:107) Γ ≤ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in some neigh-bourhood of some matrix R ∈ R K . (cid:3) We are now ready to show Theorem 4. Let U τ, n ( x ) : = H n ( x − τ ∇ F ( x )) = H n ( D τ ( x )) , x ∈ C N . Then, for every x , z ∈ C N such that T P ( x ) , T P ( z ) are insome neighbourhood of some matrix R ∈ R K , U τ, n is locallyLipschitz continuous as composition between two (locally)Lipschitz continuous functions H n and D τ , see Lemmas 5 and2 respectively. Moreover, the Lipschitz constant is the productof the Lipschitz constants of H n and D τ , 1 and L τ in (69)respectively. We have therefore (cid:13)(cid:13) U τ, n ( x ) − U τ, n ( z ) (cid:13)(cid:13) Γ ≤ L τ (cid:107) x − z (cid:107) Γ , for all x , z ∈ C N such that T P ( x ) , T P ( z ) are in some neigh-bourhood of some matrix R ∈ R K . Finally, the restriction on τ for L τ to be smaller than one results from Lemma 2. (cid:3) A PPENDIX HP ROOF OF C OROLLARY (cid:107) x k + − x k (cid:107) Γ ≤ (cid:107) U τ, n ( x k ) − U τ, n ( x k − )(cid:107) Γ ≤ L τ (cid:107) x k − x k − (cid:107) Γ , for all k ≥ and hence by induction (cid:107) x k + − x k (cid:107) Γ ≤ L k τ (cid:107) x − x (cid:107) Γ , ∀ k ≥ . (72)By assumption < τ < / β and therefore < L τ < . Wededuce hence from (72) that { x k } k ∈ N is a Cauchy sequence.Let j , k ∈ N with j > k : (cid:107) x j − x k (cid:107) Γ ≤ j − (cid:213) m = k (cid:107) x m + − x m (cid:107) Γ ≤ j − (cid:213) m = k L m τ (cid:107) x − x (cid:107) Γ = (cid:107) x − x (cid:107) Γ L k τ j − − k (cid:213) m = L m τ (73) ≤ (cid:107) x − x (cid:107) Γ L k τ ∞ (cid:213) m = L m τ = L k τ − L τ (cid:107) x − x (cid:107) Γ . For every (cid:15) > , we can choose a J ∈ N such that L J τ < (cid:15) ( − L τ )(cid:107) x − x (cid:107) Γ , and hence for all j > k > J (cid:107) x j − x k (cid:107) Γ < (cid:15) . The sequence { x k } k ∈ N is hence a Cauchy sequence, and since C N is complete, it converges towards a limit point x (cid:63) ∈ C N .We have moreover, since U τ, n is continuous x (cid:63) = lim n →∞ x k = lim n →∞ U τ, n ( x k − ) = U τ, n (cid:16) lim n →∞ x k − (cid:17) = U τ, n ( x (cid:63) ) , and hence x (cid:63) is a fixed-point of U τ, n . Note moreover that,from (73) we get (cid:107) x (cid:63) − x k (cid:107) Γ = lim j → + ∞ (cid:107) x j − x k (cid:107) Γ ≤ lim j → + ∞ (cid:107) x − x (cid:107) Γ L k τ j − − k (cid:213) m = L m τ ≤ (cid:107) x − x (cid:107) Γ L k τ + ∞ (cid:213) m = L m τ = L k τ − L τ (cid:107) x − x (cid:107) Γ , which proves (37) of Corollary 2. (cid:3) . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (a) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (b) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (c) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (d) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (e) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (f) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (g) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (h) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (i) β = , M = , PSNR=30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (j) β = , M = , PSNR=-30 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (k) β = , M = , PSNR=0 dB. . . . . . . Recovered Dirac Locations . . . . . . T r u e D i r a c L o c a t i o n s LS-CadzowCPGDGenFRI (l) β = , M = , PSNR=30 dB. Figure 5: Actual vs. recovered Dirac locations for LS-Cadzow, CPGD and GenFRI, various oversampling parameters β ∈{ , , , } , a PSNR in {− , , } dB. For each case and each source (denoted with the same colours as in Fig. 2), the markersand horizontal lines represent respectively the median and inter-quartile region of the estimated locations’ empirical distributionobtained from 192 noise realisations. The closer a marker is from the line y = xx