[PDF] Dictionary-Sparse Recovery From Heavy-Tailed Measurements

Abstract

The recovery of signals that are sparse not in a basis, but rather sparse with respect to an over-complete dictionary is one of the most flexible settings in the field of compressed sensing with numerous applications. As in the standard compressed sensing setting, it is possible that the signal can be reconstructed efficiently from few, linear measurements, for example by the so-called \ell_1-synthesis method. However, it has been less well-understood which measurement matrices provably work for this setting. Whereas in the standard setting, it has been shown that even certain heavy-tailed measurement matrices can be used in the same sample complexity regime as Gaussian matrices, comparable results are only available for the restrictive class of sub-Gaussian measurement vectors as far as the recovery of dictionary-sparse signals via \ell_1-synthesis is concerned. In this work, we fill this gap and establish optimal guarantees for the recovery of vectors that are (approximately) sparse with respect to a dictionary via the \ell_1-synthesis method from linear, potentially noisy measurements for a large class of random measurement matrices. In particular, we show that random measurements that fulfill only a small-ball assumption and a weak moment assumption, such as random vectors with i.i.d. Student-t entries with a logarithmic number of degrees of freedom, lead to comparable guarantees as (sub-)Gaussian measurements. Our results apply for a large class of both random and deterministic dictionaries. As a corollary of our results, we also obtain a slight improvement on the weakest assumption on a measurement matrix with i.i.d. rows sufficient for uniform recovery in standard compressed sensing, improving on results by Mendelson and Lecu\'e and Dirksen, Lecu\'e and Rauhut.

Full PDF

aa r X i v : . [ c s . I T ] J a n DICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS

PEDRO ABDALLA ∗ AND CHRISTIAN KÜMMERLE † Abstract. The recovery of signals that are sparse not in a given basis , but rather sparse withrespect to an over-complete dictionary is one of the most ﬂexible settings in the ﬁeld of compressedsensing with numerous applications. As in the standard compressed sensing setting, it is possiblethat the signal can be reconstructed eﬃciently from few, linear measurements, for example bythe so-called ℓ -synthesis method .However, it has been less well-understood which measurement matrices provably work forthis setting. Whereas in the standard setting, it has been shown that even certain heavy-tailedmeasurement matrices can be used in the same sample complexity regime as Gaussian matrices,comparable results are only available for the restrictive class of sub-Gaussian measurement vectorsas far as the recovery of dictionary-sparse signals via ℓ -synthesis is concerned.In this work, we ﬁll this gap and establish optimal guarantees for the recovery of vectors thatare (approximately) sparse with respect to a dictionary via the ℓ -synthesis method from linear,potentially noisy measurements for a large class of random measurement matrices. In particular,we show that random measurements that fulﬁll only a small-ball assumption and a weak momentassumption, such as random vectors with i.i.d. Student- 𝑡 entries with a logarithmic number ofdegrees of freedom, lead to comparable guarantees as (sub-)Gaussian measurements. Our resultsapply for a large class of both random and deterministic dictionaries.As a corollary of our results, we also obtain a slight improvement on the weakest assumptionon a measurement matrix with i.i.d. rows suﬃcient for uniform recovery in standard compressedsensing, improving on results by Mendelson and Lecué and Dirksen, Lecué and Rauhut.

1. Introduction

In the seminal works [CRT06a, CRT06b, Don06], the idea of compressed sensing was in-troduced as a paradigm to recover signals from undersampled and corrupted measurements.Mathematically, the goal is to recover a signal z ∈ ℝ 𝑑 from corrupted linear measurements y = Φ z + e ∈ ℝ 𝑚 , where e ∈ ℝ 𝑚 is an error vector satisfying k e k ≤ 𝜀 for some constant 𝜀 > and Φ ∈ ℝ 𝑚 × 𝑑 is a sensing matrix, often also called measurement matrix . The numberof measurements 𝑚 , which corresponds to the number of rows of Φ , is assumed to be muchsmaller than the ambient dimension 𝑑 , formally, 𝑚 = 𝑜 ( 𝑑 ) . Without any additional assumption,there is not much hope to estimate z just from knowledge of y and Φ . Recovery of Sparse Signals. H owever, the theory of compressed sensing suggests that ifthe vector of interest z has a low-dimensional structure such that it is, for example, a sparse vector with only few non-zero entries 𝑠 : = k z k : = (cid:12)(cid:12) { 𝑗 ∈ [ 𝑁 ] : ( z ) 𝑗 ≠ } (cid:12)(cid:12) ≪ 𝑑 , it can be stablyrecovered by the solving a convex optimization program called quadratically constrained basis Date : January 22,

Key words and phrases. compressed sensing, over-complete dictionaries, heavy-tailed distributions, ℓ -synthesis,small-ball assumption, null space property. ∗ Department of Mathematics, ETH Zürich, Zurich, Switzerland ([email protected]). † Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, USA([email protected]). pursuit (QCBP) or quadratically constrained ℓ -minimization [F R13](1) ˆ z = arg min k z k subject to k Φ z − y k ≤ 𝜀 such that ˆ z is close to z , at least if the measurement matrix Φ ∈ ℝ 𝑚 × 𝑑 fulﬁlls appropriateconditions and 𝑚 = Θ ( 𝑠 log ( 𝑑 / 𝑠 )) [FR13]. Examples for such conditions on Φ are the restrictedisometry property [CT06, FR13] or variants of the so-called null space property [CDD09], see alsoSection 1.2 below. We note that under these type of conditions, methods diﬀerent from eq. (1)have also shown to exhibit similar recovery guarantees, such as greedy and thresholding-basedmethods [Zha11, Fou11, BFH16, ZXQ19, ZL20].Therefore, a crucial question that lies at the core of compressed sensing is about whichmeasurement matrices Φ ∈ ℝ 𝑚 × 𝑑 fulﬁll such conditions in the information theoretically optimalregime of 𝑚 = Ω ( 𝑠 log ( 𝑑 / 𝑠 )) and therefore allow for robust recovery guarantees. All knownconstructions of measurement matrices to achieve this regime rely on some sort of randomness;it has been shown that matrices Φ with i.i.d. Gaussian entries [CT06], i.i.d. Bernoulli and othersub-Gaussian entries [BDDW08] allow for robust guarantees with high probability.In subsequent works, it was shown that distributions without strong concentration propertiesare equally suitable for the design of measurement matrices. In particular, it was shown thatmatrices with i.i.d. sub-exponential entries can be used with the same recovery guarantees asGaussian matrices [Kol11, Fou14], and ﬁnally, [ML17, DLR18] showed that this assumption canbe relaxed to the requirement of just Ω ( log ( 𝑑 )) small moments for the entrywise distribution,which was shown to be necessary up to a log-log factor [ML17].While the results on the mentioned random constructions allow for optimal sample com-plexity, such measurement matrices are often challenging to implement in applications. Inwireless communications [HBRN10] and radar [PEPC10], measurement matrices with a convo-lutional structure are used, and the related random partial circulant matrices have been shownto be fulﬁll a restricted isometry property for a number of measurements that is optimal upto logarithmic factors [RRT12, KMR14, MRW + without logarithmic factors.1.2. Recovery of Dictionary-Sparse Signals.

In most applications of compressed sensing, it isactually not the case that the signal of interest z itself has few non-zero entries, but rather, thesetting allows for a sparse representation such that there exists a matrix D ∈ ℝ 𝑑 × 𝑛 and a sparse vector x ∈ ℝ 𝑑 such that(2) z = Dx . This matrix D = (cid:2) d , . . . , d 𝑛 (cid:3) is called a dictionary (matrix) [EMR07]. The case mentionedabove corresponds to the situation in which the dictionary D = I is the identity matrix.In this paper, we focus on the case that this dictionary matrix D ∈ ℝ 𝑑 × 𝑛 is over-complete , i.e.,is such that 𝑛 > 𝑑 , allowing for a much larger class of signals than in the standard compressedsensing setting outlined above. We will refer to signals x ∈ ℝ 𝑑 as in eq. (2) with sparse z ∈ ℝ 𝑛 as dictionary -sparse signals. ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 3

Dictionaries of interest are on the one hand pre-designed, application dependent dictionariessuch as Gabor frame [HS09], wavelet or shearlet [GL07] dictionaries. On the other hand, it hasbeen also fruitful to learn over-complete dictionaries from data ﬁrst to subsequently sparselyencode signals based on the obtained dictionary [AEB06, WYG +

09, RB13].As mentioned, recovering signals as given by eq. (2) from few linear measurements provides avery general setup that has been studied in previous works [CENR11, ACP12, CWW14, DNW13].Given the measurement matrix Φ ∈ ℝ 𝑚 × 𝑛 and a measurement vector y = Φ z + e ∈ ℝ 𝑚 , with e ∈ ℝ 𝑚 and 𝜀 > as above, the ℓ -synthesis method [RSV08, LLM +

12] performs the signalrecovery by solving ˆ x = arg min k x k subject to k Φ Dx − y k ≤ 𝜀, ˆ z = Dˆx , (3)borrowing its name from the fact that the estimator ˆ z ∈ ℝ 𝑑 of the original signal z is synthesized by the dictionary D and the ℓ -minimizer ˆ x via the equality ˆ z = Dˆx .Any theoretical guarantees quantifying the accuracy of ℓ -synthesis eq. (3) need to dependon properties of the dictionary D ∈ ℝ 𝑑 × 𝑛 , and furthermore, on the relationship of measurementmatrix Φ and the dictionary D . Considering the matrix product ΦD as the measurement matrixin the setting of Section 1.1 gives us one possible approach of how to obtain guarantees.The ﬁrst work with an extensive analysis of this dictionary-sparse case was provided byRauhut, Schnass and Vanderghenyst [RSV08], who showed that if the dictionary D fulﬁlls therestricted isometry property (RIP) and if Φ ∈ ℝ 𝑚 × 𝑑 is a matrix with independent sub-Gaussiansrows with 𝑚 = Ω ( 𝑠 log ( 𝑛 / 𝑠 )) , the product ΦD also fulﬁlls the RIP of sparsity order 𝑠 with highprobability. In [CWW14], Chen, Wang and Wang introduced a null space property tailoredto the dictionary case, which is deﬁned as follows: If D ∈ ℝ 𝑑 × 𝑛 is a dictionary, it is said that Φ ∈ ℝ 𝑚 × 𝑑 satisﬁes the D -null space property ( D -NSP) of order 𝑠 if for any index set 𝑇 with | 𝑇 | ≤ 𝑠 and any vector v such that Dv ∈ ker Φ \ { } , there exists some u ∈ ker D such that k v 𝑇 + u k < k v 𝑇 𝑐 k . [CWW14] showed that the D -NSP is a necessary and suﬃcient condition forthe exact recovery of dictionary-sparse signals via ℓ -synthesis method eq. (3) in the noiselesscase where 𝜖 = .Using this notion of D -NSP, the authors of [CWW14] were also able relate exact recoveryguarantees of eq. (3) with a null space property of ΦD in the conventional sense of [FR13],which we recall as follows. Deﬁnition 1 ([FR13, Chapter 4]) . Let A ∈ ℝ 𝑚 × 𝑛 be a matrix and 𝑠 ∈ ℕ .(1) A is said to fulﬁll the null space property (NSP) of order 𝑠 if for all v ∈ ker ( A ) \ { } andall subsets 𝑇 ⊂ [ 𝑛 ] with | 𝑇 | ≤ 𝑠 , k v 𝑇 k < k v 𝑇 𝑐 k . (2) A is said to fulﬁll the robust null space property (robust NSP) of order 𝑠 with constants < 𝛾 < and 𝜏 > if inf v ∈ 𝑆 𝛾 k Av k ≥ 𝜏 , where 𝑆 𝛾 : = n v ∈ ℝ 𝑛 : k v 𝑇 k ≥ 𝛾 √ 𝑠 k v 𝑇 𝑐 k for some 𝑇 ⊂ [ 𝑛 ] with | 𝑇 | = 𝑠 o ∩ 𝕊 𝑛 − . While null space property is a necessary and suﬃcient condition for the successful recoveryof sparse vectors via eq. (1) in the noiseless case of 𝜀 = , the robust null space property See also Remark 1.

P. ABDALLA AND C. KÜMMERLE implies also robust recovery under noisy measurements and model error. [CWW14] obtainedthe following result.

Proposition 1 ([CWW14, Theorem 7.2]) . If the dictionary D ∈ ℝ 𝑑 × 𝑛 is full spark, i.e., if no set of 𝑑 columns of D is linear dependent, then Φ satisﬁes the D -NSP of order 𝑠 if and only if ΦD satisﬁesthe NSP of the same order. We note that the full spark assumption for the dictionary D is not a strong assumption.Indeed, if D is a matrix whose independent columns are drawn from a continuous distribution,then it is full spark with probability one [ACM12].We now turn to the core questions addressed in this paper. Taking Proposition 1 as a startingpoint, Casazza, Chen and Lynch proved recently in [CCL20] that if the dictionary D is a fullspark deterministic matrix with columns bounded in ℓ -norm that furthermore fulﬁlls therobust NSP of Deﬁnition 1, a measurement matrix Φ with i.i.d. sub-Gaussian rows results ina product ΦD that fulﬁlls the robust NSP, if furthermore the number of measurements fulﬁlls 𝑚 = Ω ( 𝑠 l og ( 𝑛 / 𝑠 )) . Via this robust NSP, their results ensure that for such dictionary andmeasurements, the ℓ -synthesis method eq. (3) recovers 𝑠 -dictionary sparse signals z ∈ ℝ 𝑑 with high probability, robustly from noisy measurements if 𝑚 = Ω ( 𝑠 log ( 𝑛 / 𝑠 )) .However, while being an improvement over the result of [RSV08] as the NSP assumptionon D is weaker than the RIP of [RSV08], the sub-Gaussianity assumption of [CCL20] on themeasurements Φ is still quite restrictive. This becomes in particular clear when comparingsuch an assumption to the variety of results on heavy-tailed measurement matrices for theconventional compressed sensing setting detailed in Section 1.1. In this paper, we thereforeaddress the following questions: (Q1) What are the weakest assumptions for suitable measurement matrices Φ for uniform recov-ery guarantees of dictionary-sparse signals via ℓ -synthesis?(Q2) Which joint assumptions on Φ and the dictionary matrix D come with good trade-oﬀs? Our Contribution.

We provide results for both deterministic and random dictionaries D , ad-dressing (Q1) and (Q2) . In particular, we show in Theorems 1 and 2 that for the case of a fullspark dictionary D fulﬁlling a robust NSP, ΦD fulﬁlls the robust NSP if Φ is a measurementmatrix with i.i.d. rows whose distributions only have a logarithmic number of ﬁnite, well-behavedmoments (and also fulﬁll a small-ball condition) , with high probability in the optimal regime of 𝑚 = Ω ( 𝑠 log ( 𝑛 / 𝑠 )) .Secondly, we prove two results about the setting where both Φ and D obey a random model.Via Theorem 4, we show that both Φ and D can obey heavy-tailed distributions while stillretaining the robust NSP of ΦD . Finally, we show with Theorem 5 that if the dictionary D ∈ ℝ 𝑑 × 𝑛 has i.i.d. entries whose distribution has log ( 𝑛 ) sub-Gaussian moments, it is suﬃcient to requireonly a bounded variance and a small-ball condition for the entries of the measurement matrix Φ .The established property of ΦD for each of the mentioned models on Φ and D implies that withhigh probability, the ℓ -synthesis method is able to uniformly recover cover all dictionary-sparsesignals of order 𝑠 , stably and robustly under measurement noise.Lastly, we prove a technical lemma that bounds the expectation of the sum of the largestcoordinates of a random vector with just few sub-Gaussian moments in Lemma 1, generalizinga result of [Men15]. This result is key for several of our results, and furthermore, improveson [ML17] and [DLR18] by relaxing the weakest known moment assumption for standardcompressed sensing in the optimal regime. This aspect is presented and discussed in Theorem 3. ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 5

The structure of this paper is organized as follows. After reviewing related works, we providesome preliminaries in Section 2. In Section 3, we present our results for deterministic dictionarymatrices D and random measurements Φ , before we present our results for joint random modelson the dictionary D and the measurement matrix Φ in Section 4. Our main technical result ispresented in Section 5. Finally, we present the proofs of our results in Section 6, and concludewith a short discussion in Section 7.1.3. Related Works.

As mentioned above, results of similar ﬂavor as ours can be found in[RSV08, CWW14, CCL20], albeit subject to considerably stronger assumptions on the measure-ment and dictionary matrices.Going beyond uniform results, the work [GE13] observes that under quite weak assump-tions on the dictionary D , signal recovery might still be possible for ℓ -synthesis eq. (3) evenwhen the coeﬃcients are not uniquely recoverable. This idea has been extended in the re-cent work [MBKW20] by considering a non-uniform , i.e., signal-dependent analysis based onconic Gaussian mean widths. Supported by numerical evidence, the authors of [MBKW20]provide a characterization which signals might be uniquely recoverable via ℓ -synthesis despitea non-unique sparse coeﬃcient representation.A diﬀerent route for the recovery of signals that are sparse in an over-complete dictionaryis taken in [DNW13, GN15], where a compressive sampling matching pursuit (CoSaMP) isproposed and analyzed. This method contains a projection step that is in general not eﬃcentlyimplementable, and furthermore, the presented theory requires the measurement matrix Φ tofulﬁll a RIP, which is stronger than the requirements of [CCL20].An approach that is related to ℓ -synthesis, but signiﬁcantly diﬀerent, is the ℓ -analysismodel, which as been studied in an arguably even larger body of literature, cf. [EMR07,CENR11, NDEG13, GKM20] and the references therein. Instead of recovering signals z as ineq. (2), this model assumes that 𝑧 ∈ ℝ 𝑑 is such that D ∗ z is sparse (which is not equivalent toour model for non-square dictionaries D ∈ ℝ 𝑑 × 𝑛 ) and solves the convex program ˆ z = arg min z ∈ ℝ 𝑑 k D ∗ z k subject to k Φ z − y k ≤ 𝜀. In this context, the notion of a D -restricted isometry property ( D -RIP) was shown to lead behelpful [CENR11], and it was shown to hold for random measurement matrices fulﬁlling strongconcentration property of enough measurements are provided [CENR11] which is strongerthan the D -NSP mentioned above. These results were extended to measurements constructedfrom subsampled bounded orthogonal systems [KNW15], which are of relevance in magneticresonance imaging.Both the ℓ -synthesis and the ℓ -analysis model come with their own set of advantages andlimitations. For extensive comparisons and discussion, we refer to [EMR07, LLM +

12, NDEG13].2.

Preliminaries and Background

To begin with, we introduce some notation. We write k · k 𝑝 for the standard ℓ 𝑝 -norm and 𝕊 𝑛 − 𝑝 , 𝔹 𝑛𝑝 for the unit sphere and unit ball in ℝ 𝑛 with respect to the ℓ 𝑝 norm, respectively. Whenthe sub-index is omitted, it is assumed that the we refer to the case the ℓ -norm. Moreover,for every matrix A we denote the standard operator norm by k A k . To avoid confusion, fora random variable 𝑋 , we denote its 𝐿 𝑝 -norm by k 𝑋 k 𝐿 𝑝 = 𝔼 [| 𝑋 | 𝑝 ] / 𝑝 for ≤ 𝑝 < ∞ , andwrite the essential supremum of 𝑋 as k 𝑋 k 𝐿 ∞ . For sets, for every natural number 𝑛 , deﬁnite [ 𝑛 ] : = { , . . . , 𝑛 } . We also denote 𝑇 𝑐 for the complement of the set 𝑇 . For 𝑚 and 𝑛 that are(varying) parameters, we write 𝑚 . 𝑛 if there exists an absolute constant 𝐶 > such that P. ABDALLA AND C. KÜMMERLE 𝑚 ≤ 𝐶𝑛 . Finally, 𝜎 𝑠 ( x ) denotes the ℓ -error of the best 𝑠 -term approximation of a vector x ∈ ℝ 𝑛 ,i.e., 𝜎 𝑠 ( x ) = inf {k x − z k : z ∈ ℝ 𝑛 is 𝑠 -sparse } .An important question in the context the recovery of signals that have 𝑠 -sparse coeﬃcientswith respect to a over-complete dictionary D = [ d , . . . , d 𝑛 ] ∈ ℝ 𝑑 × 𝑛 with 𝑑 ≪ 𝑛 is whichdictionaries are admissible for successful recovery of z via ℓ -synthesis. The result of [RSV08]allows for dictionaries D 𝑑 × 𝑛 that fulﬁll a restricted isometry property (RIP) of order 𝑠 , which isfor example, fulﬁlled for incoherent dictionaries, i.e., those that fulﬁll the estimate 𝜇 ( D ) : = max 𝑖 ≠ 𝑗 |h d 𝑖 , d 𝑗 i |k d 𝑖 k k d 𝑗 k ≤ ( 𝑠 − ) . Using this incoherence estimate, it is possible to ﬁnd deterministic, admissible dictionaries.However, it is well-known [FR13, Section 5.2] that this allows only for small sparsities of 𝑠 . √ 𝑑 .Random constructions for D using i.i.d. sub-Gaussian entries, on the other hand, satisfy arestricted isometry property of order 𝑠 for values of 𝑠 scaling almost linearly with 𝑑 .As observed by [CWW14, CCL20], it is possible to relax the RIP assumption on D to a weakernull space property as deﬁned in Deﬁnition 1. Together with a suitable measurement matrix Φ ,such a dictionary will still enable stable recovery of coeﬃcient vector x via ℓ -synthesis eq. (3),using the following standard result (see also [CCL20, Theorem 1.1] for a version). Proposition 2 ([FR13, Theorem 4.25]) . Suppose that a matrix A ∈ ℝ 𝑚 × 𝑛 satisﬁes the robust NSPof order 𝑠 of Deﬁnition 1 with constants < 𝛾 < and 𝜏 > . Then, for any x ∈ ℝ 𝑛 , thesolution ˆx = arg min k Ax − y k ≤ 𝜀 k x k with y = Ax + e and k e k ≤ 𝜀 approximates the vector x with ℓ -error k x − ˆx k ≤ ( + 𝛾 ) ( − 𝛾 ) 𝜎 𝑠 ( x ) √ 𝑠 + 𝜏 − 𝛾 𝜀, where 𝜎 𝑠 ( x ) is the best 𝑠 -term ℓ -approximation error of x . Using this proposition for A = ΦD , it is possible to derive guarantees for stable recoveryof the coeﬃcient vector x via ℓ -synthesis eq. (3) in case that ΦD fulﬁlls a robust NSP. InProposition 1, it was shown that for dictionaries D with full spark , the just slightly weaker NSPof ΦD is basically necessary for the success of ℓ -synthesis.Based on this argument, it has been noted in [CCL20] that actually, it is further necessarythat D fulﬁlls a null space property, too. This is due to the fact that the null space of D iscontained in the null space of ΦD , i.e., Ker D ⊂ Ker Φ D .Building on this, Casazza, Chen and Lynch obtained the following result. Proposition 3 ([CCL20, see Theorem 3.3 and Corollary 3.6]) . Suppose that D ∈ ℝ 𝑑 × 𝑛 has therobust NSP of order 𝑠 with constant < 𝛾 < and 𝜏 > and its columns satisfy max {k d 𝑖 k : 𝑖 ∈[ 𝑛 ]} ≤ 𝜌 .Let the measurement matrix Φ = (cid:2) 𝜑 , . . . , 𝜑 𝑚 (cid:3) 𝑇 have independent rows that are distributedas a centered, sub-isotropic, sub-Gaussian vector 𝜑 , i.e., 𝜑 fulﬁlls 𝔼 [ 𝜑 ] = , and there exist 𝑐 > and 𝜎 > such that(1) 𝔼 [|h 𝜑, z i |] ≥ 𝑐 for all z ∈ 𝕊 𝑑 − ,(2) ℙ (|h 𝜑, z i | ≥ 𝑡 ) ≤ (− 𝑡 /( 𝜎 )) for all z ∈ 𝕊 𝑑 − .If the number of measurements 𝑚 satisﬁes 𝑚 & 𝜎 𝑐 𝜏 𝜌𝛾 𝑠 log (cid:16) 𝑛𝑠 (cid:17) , ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 7 then with probability at least − exp (− 𝑚 𝑐 𝜎 ) , ΦD fulﬁlls the robust NSP of order 𝑠 withconstants 𝜌 and 𝜏 / 𝜎 . In this case, ℓ -synthesis eq. (3) provides stable and robust recovery of boththe coeﬃcient vector x ∈ ℝ 𝑛 and the signal z ∈ ℝ 𝑑 from 𝑦 = Φ z + 𝑒 with k 𝑒 k ≤ 𝜀 such that k ˆx − x k . 𝜎 𝑠 ( x ) + 𝜏𝜎 𝜀 and k ˆz − z k . k D k (cid:16) 𝜎 𝑠 ( x ) + 𝜏𝜎 𝜀 (cid:17) . Remark 1.

Instead of the robust NSP of Deﬁnition 1, [CCL20] use the notion of a stable NSP , whichis similar, but leads to a slightly suboptimal guarantee. To see this, compare [CCL20, Theorem1.1] to [FR13, Theorem 4.22] (and to Proposition 2, for that matter). The robust NSP we use inDeﬁnition 1 is very similar to what is called ℓ -robust NSP in [FR13, Chapter 4.3] and [DLR18].More speciﬁcally, it is a property that is shown in the proofs of [DLR18] to imply what is called the ℓ -robust NSP in [DLR18]. Proposition 3 is an improvement over the result of [RSV08] as its requirement on the dic-tionary D is weaker than the restricted isometry (RIP) assumption of [RSV08]. We refer to[CCW16, DLR18] for a discussion of the gap between NSP and RIP conditions.While the robust NSP requirement on the dictionary D is deterministic in its nature, it iswell-known that certifying such a condition for a given matrix D is in general NP-hard [TP13].Typically, this issue is addressed by using random constructions, see [FR13, ML17, DLR18] forexisting results. However, the existing theory does not address the intricacy of the ℓ -synthesismodel, as random constructions for both Φ and D might not directly imply a robust NSP for theproduct matrix ΦD . 3. Deterministic Dictionaries

In this section, we state our results for ℓ -synthesis in the case of deterministic dictionaries D ∈ ℝ 𝑑 × 𝑛 , which generalize Proposition 3 of [CCL20] directly to a much larger class of randommeasurement matrices Φ ∈ ℝ 𝑑 × 𝑛 .3.1. Main Results for Deterministic Dictionaries.

In particular, we assume the following forthe measurement matrix Φ ∈ ℝ 𝑚 × 𝑑 in Theorem 1 below. Assumption 1 ( Φ with independent rows with log ( 𝑛 / 𝑠 ) well-behaved moments) . Let Φ ∈ ℝ 𝑚 × 𝑑 be a matrix with i.i.d. rows distributed as 𝜑 /√ 𝑚 ∈ ℝ 𝑑 , where 𝜑 satisﬁes the conditions(1) 𝔼 [ 𝜑 ] = ,(2) There exists constants 𝐴 ∗ , 𝑐 > such that inf x ∈ 𝕊 𝑑 − ℙ (|h 𝜑, x i | ≥ 𝐴 ∗ ) ≥ 𝑐 ,(3) If 𝑛 ∈ ℕ denotes the number of columns of D and 𝑠 ∈ ℕ arbitrary, there exist constants 𝛼 ≥ and 𝜆 > such that for every vector a ∈ 𝕊 𝑑 − , k h 𝜑, a i k 𝐿 𝑝 ≤ 𝜆 𝑝 𝛼 for all ≤ 𝑝 . log 𝑛𝑠 . The ﬁrst condition of Assumption 1 centers the distribution at and is not restrictive. Thesecond condition (2) is often called small ball assumption and is related to an important technicalcomponent in our analysis shared by [Men15, ML17, KM15]. It is a much weaker assumptionthan the sub-Gaussian assumption of Proposition 3 if we assume some mild normalization. Inparticular, if an isotropic random vector 𝜑 satisﬁes k h 𝜑, a i k 𝐿 + 𝜀 = 𝑂 ( k h 𝜑, a i k 𝐿 ) for some 𝜀 > ,then a simple Paley-Zygmund argument (cf. [Men15, Lemma 4.1]) shows that the small ballassumption is satisﬁed.Finally, condition (3) of Assumption 1 allows for distributions with much heavier tail behaviorthan just sub-Gaussian distributions; it is well-known that if a vector 𝜑 is sub-Gaussian and P. ABDALLA AND C. KÜMMERLE a ∈ ℝ 𝑑 , then k h 𝜑, a i k 𝐿 𝑝 = 𝑂 (√ 𝑝 k h 𝜑, a i k 𝐿 ) (see [Ver18] for a proof), i.e., the moment boundof (3) is fulﬁlled with 𝛼 = for all moments 𝑝 ∈ ℕ . On the other hand, (3) requires only the ﬁrst log ( 𝑛 / 𝑠 ) moments to follow such behavior. Furthermore, the condition also includes randomvectors whose ﬁrst log ( 𝑛 / 𝑠 ) moments follow the behavior of an exponential (for 𝛼 = ) randomvariable or even heavier ones (for 𝛼 > ). We refer the reader to [DLR18] for more details andexamples of similar distributions. Remark 2.

In the following, since the parameters 𝐴 ∗ , 𝑐 , 𝜆 and 𝛼 are constant for any givendistribution, they are considered as constant for the purposes of any big- 𝑂 type notation and thusomitted in such a notation, and similarly, in expressions using & and . . For measurement matrices Φ fulﬁlling Assumption 1, we obtain the following theorem. Theorem 1.

Compared to the result of [CCL20] which we presented in Proposition 3, we obtainfurther a scaling in the the best 𝑠 -term ℓ -approximation error 𝜎 𝑠 ( x ) improved by /√ 𝑠 , whichis in line with the optimal results of standard compressed sensing, cf. [FR13, Chapter 4.3]. It iswell-known (e.g., [FR13, Theorem 4.22]) how to derive guarantees on the ℓ 𝑞 -error for ≤ 𝑞 ≤ based on such a property, we omit them for simplicity. Our proof is of Theorem 1 is based on the ideas of the small ball method [KM15, Men15,ML17, DLR18, CCL20], and proceeds by establishing lower bounds a certain empirical process.Crucially, unlike [CCL20], we do not rely on Gaussian widths in our argument, as this wouldrule out using assumptions as general as in Assumption 1. We refer to Section 6.3 for the details.Next, we provide a result that it is almost a corollary of Theorem 1—almost, as it is not quitea corollary, but follows from a modiﬁcation of the proof.

Assumption 2 (Measurement matrices with log ( 𝑛 / 𝑠 ) well-behaved moments) . Let Φ ∈ ℝ 𝑚 × 𝑑 be measurement matrix with i.i.d. rows distributed as 𝜑 /√ 𝑚 ∈ ℝ 𝑑 , where 𝜑 = (cid:2) 𝜉 , . . . , 𝜉 𝑑 (cid:3) isa random vector with independent entries. Assume that for each 𝑖 ∈ [ 𝑑 ] , 𝜉 𝑖 satisﬁes the conditions(1) 𝔼 [ 𝜉 𝑖 ] = ,(2) 𝔼 [ 𝜉 𝑖 ] = (i.e., 𝜉 𝑖 has unit variance),(3) If 𝑛 ∈ ℕ denotes the number of columns of D and 𝑠 ∈ ℕ arbitrary, there exist constants 𝛼 ≥ and 𝜆 > such that k 𝜉 𝑖 k 𝐿 𝑝 ≤ 𝜆 𝑝 𝛼 for all ≤ 𝑝 . log 𝑛𝑠 . e 𝜏 only depends on the distribution parameters of 𝜑 . ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 9

The theorem adapted to Assumption 2 can be stated as follows.

Theorem 2.

Suppose that D = [ d , . . . , d 𝑛 ] ∈ ℝ 𝑑 × 𝑛 has the robust NSP of order 𝑠 with constant < 𝛾 < and 𝜏 > and that its columns satisfy max {k d 𝑖 k : 𝑖 ∈ [ 𝑛 ]} ≤ 𝜌 = Θ ( ) .Assume that the measurement matrix Φ ∈ ℝ 𝑚 × 𝑑 satisﬁes Assumption 2 with moment growthparameter 𝛼 . If the number of measurements satisﬁes 𝑚 = Ω (cid:16) max (cid:16) 𝑠 log ( 𝑛 / 𝑠 ) , log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) (cid:17) (cid:17) , then with probability of at least − 𝑒 − Ω ( 𝑚 ) , ΦD fulﬁlls the robust NSP of order 𝑠 with some constants < 𝛾 < and e 𝜏 > . Thus, in this case, ℓ -synthesis eq. (3) provides stable and robust recovery ofany coeﬃcient vector x ∈ ℝ 𝑛 and corresponding signal z = Dx ∈ ℝ 𝑑 from 𝑦 = Φz + e with k e k ≤ 𝜀 such that k ˆx − x k . 𝜎 𝑠 ( x ) √ 𝑠 + 𝜏𝜀 and k ˆz − z k . k D k (cid:18) 𝜎 𝑠 ( x ) √ 𝑠 + 𝜏𝜀 (cid:19) . Even in the basis case that the dictionary D = I is the identity matrix (and thus, 𝑑 = 𝑛 ),Theorem 2 slightly improves [DLR18, Corollary 8] because it requires fewer moments of 𝜑 anddo not require the entries to be identically distributed: indeed, [DLR18, Corollary 8] requiresthe bound of condition (3) of Assumption 2 for the ﬁrst log ( 𝑛 ) moments, whereas we onlyrequire that bound for the ﬁrst log ( 𝑛 / 𝑠 ) moments.One crucial step towards proving Theorems 1 and 2 is the following proposition. Proposition 4.

Assume that 𝜑 , . . . , 𝜑 𝑚 are independent copies of a random vector 𝜑 fulﬁllingcondition (3) of either Assumption 1 or Assumption 2. Let D = [ d , . . . , d 𝑛 ] ∈ ℝ 𝑑 × 𝑛 be a dictionarywhose columns have bounded ℓ -norms, i.e., for all 𝑖 ∈ [ 𝑛 ] , k d 𝑖 k ≤ 𝜌 . Deﬁne 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 where the 𝜖 𝑖 are independent Rademacher random variables that are independent of the { 𝜑 𝑖 } 𝑚𝑖 = .If also 𝑚 & log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) , then 𝔼 " 𝑠 Õ 𝑖 = (cid:0) ( D 𝑇 𝑉 ) ∗ 𝑖 (cid:1) / . 𝜌 r 𝑠 log (cid:16) 𝑛𝑠 (cid:17) , where ( D 𝑇 𝑉 ) ∗ 𝑖 denotes the 𝑖 -th coordinate of the non-increasing rearrangement of the vector D 𝑇 𝑉 . The proof of Proposition 4 is provided in Section 6.2. One ingredient for Proposition 4 is anextension of Mendelson’s [Men15, Lemma 6.5] that holds for slightly weaker assumptions. Werefer to Lemma 1 for details.Moreover, we note that the proof of Proposition 4 for Assumption 2 relies on a (non-standard)Khintchine inequality for distributions with weak moment assumptions, see Lemma 4 below.

Example 1.

Measurement matrices fulﬁlling Assumption 2 include many random ensembles withheavy-tailed distributions. For example, let 𝜉 , . . . , 𝜉 𝑑 be independent Student- 𝑡 random variablesof degree 𝑘 . Student- 𝑡 variables are neither sub-Gaussian nor sub-exponential and even do nothave ﬁnite 𝑟 -th moment if 𝑟 > 𝑘 [KN04]. On the other hand, the moments grow as the ones of asub-Gaussian random variable for 𝑟 ∈ { , . . . 𝑘 / } .Therefore, Theorem 2 implies that a measurement matrix Φ with independent rows whose entriesare i.i.d. Student- 𝑡 random variables of degree 𝑘 = Ω ( log ( 𝑛 / 𝑠 )) is suitable for robust recovery of 𝑠 -dictionary-sparse signals via ℓ -synthesis. We refer to [DLR18, Example 9] for more examples fordistributions fulﬁlling similar assumptions. We end this section by observing that it is not clear whether our results can be relaxed toallow the entries of the vector 𝜑 to be arbitrarily dependent.4. Random Dictionaries

The results we presented in Section 3 are all based on the dictionary D satisfying a null spaceproperty (NSP). Proposition 1 further suggests that the spark of D plays an important role inthe connection between a NSP of the measurement-dictionary product matrix ΦD and recoveryguarantees for ℓ -synthesis. However, it is in general NP-hard to verify whether D is of fullspark or whether D fulﬁlls a null space property [TP13].This problem can typically be avoided by considering random matrices, similar to what hasbeen presented in Section 3 for the measurement matrices. In this way, the desired propertycan be established at least with high probability. In this section, we therefore consider diﬀerentsetups where both the dictionary D and the measurement matrix Φ are sampled at random.In particular, we suppose that the dictionary D = √ 𝑑 [ 𝜓 , . . . , 𝜓 𝑑 ] 𝑇 ∈ ℝ 𝑑 × 𝑛 is now generatedat random independently of Φ such that each 𝜓 𝑖 is an independent copy of a random vector 𝜓 ∈ ℝ 𝑑 . The normalization by √ 𝑑 is not really necessary but it maintains the results in thesame line of the previous results. For example, even for a standard Gaussian matrix G ∈ ℝ 𝑑 × 𝑛 ,the ℓ -norm of the columns are of order √ 𝑑 , so the maximum 𝜌 of the column ℓ -norms fulﬁlls 𝜌 = Ω (√ 𝑑 ) ≫ . Without normalization, the scaling √ 𝑑 would appear in the constant 𝜏 of therobust NSP.4.1. Properties of Random Dictionaries.

In this section, we establish the robust null spaceproperty (robust NSP) of random dictionaries D whose rows are independent and fulﬁll weakmoment assumptions and a small-ball assumption. Assumption 3 ( D with independent rows with log ( 𝑛 / 𝑠 ) well-behaved moments fulﬁlling small--ball assumption) . Let 𝑠 ∈ ℕ , < 𝛾 < and 𝑆 𝛾 : = n x ∈ ℝ 𝑛 : k x 𝑇 k > 𝛾 √ 𝑠 k x 𝑇 𝑐 k for some | 𝑇 | ≤ 𝑠 o ∩ 𝕊 𝑛 − , let D ∈ ℝ 𝑑 × 𝑛 be a matrix with i.i.d. rows distributed as 𝜓 /√ 𝑑 ∈ ℝ 𝑛 , where 𝜓 satisﬁes theconditions(1) 𝔼 [ 𝜓 ] = ,(2) There exists constants 𝐴 ∗ , 𝑐 > such that inf x ∈ 𝑆 𝛾 ℙ (|h 𝜓, x i | ≥ 𝐴 ∗ ) ≥ 𝑐 ,(3) The entries 𝜉 𝑖 of the vector 𝜓 = ( 𝜉 , . . . , 𝜉 𝑛 ) (not necessarily independent) satisfy k 𝜉 𝑖 k 𝐿 𝑝 ≤ 𝜆 𝑝 𝛽 for some 𝜆 > , 𝛽 ≥ / , for all ≤ 𝑝 . log 𝑛𝑠 . For dictionaries D fulﬁlling Assumption 3, we can show the following proposition that resem-bles Proposition 4. Its proof relies likewise on our core technical result, Lemma 1. Proposition 5.

Suppose D ∈ ℝ 𝑑 × 𝑛 fulﬁlls Assumption 3, let 𝑉 𝜓 : = √ 𝑑 Í 𝑑𝑖 = 𝜀 𝑖 𝜓 𝑖 with the 𝜓 𝑖 as in Assumption 3, and where the 𝜖 𝑖 are independent Rademacher random variables that areindependent of the { 𝜓 𝑖 } 𝑚𝑖 = . If 𝑑 & log ( 𝑛𝑠 ) max ( 𝛽 − , ) , then 𝔼 h 𝑠 Õ 𝑖 = (cid:0) ( 𝑉 𝜓 ) ∗ 𝑖 (cid:1) i / . r 𝑠 log (cid:16) 𝑛𝑠 (cid:17) . Using the small method, Proposition 5 implies the following theorem, see Section 6.4 fordetails.

ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 11

Theorem 3.

Suppose D ∈ ℝ 𝑑 × 𝑛 fulﬁlls Assumption 3. If 𝑑 & Ω (cid:18) max (cid:18) 𝑠 log (cid:16) 𝑛 / 𝑠 (cid:17) , log (cid:16) 𝑛 / 𝑠 (cid:17) max ( 𝛽 − , ) (cid:19) (cid:19) , then, with probability at least − 𝑒 − 𝑂 ( 𝑑 ) , D fulﬁlls the robust NSP of order 𝑠 with constants < 𝛾 < and 𝜏 > . It is worth putting Theorem 3 into context of existing results for standard compressed sensingas introduced in Section 1.1. Theorem 3 provides an improvement of [DLR18, Corollary 8]because it allows dependency among entries of 𝜓 and a bound on fewer moments of 𝜓 , i.e.,only log ( 𝑛 / 𝑠 ) instead of log ( 𝑛 ) . Strictly speaking, our result does not improve [ML17, TheoremA] because Theorem 3 assumes a slightly stronger condition: The small ball assumption (2) ofAssumption 2 is taken in a larger set than in [ML17, Theorem A]. In particular, the authors in[ML17] assume the small ball assumption on the set Σ 𝑠 of 𝑠 -sparse vectors in the sphere whileour condition is satisﬁed, for example, if the small ball assumption holds in the convex hull of Σ 𝑠 intersected with the ℓ -sphere. On the other hand, Theorem 3, compared to [ML17, TheoremA], requires fewer moments on 𝜓 and decreases the factor 𝛽 − to 𝛽 − together with asigniﬁcant improvement in the failure probability. Remark 4.

One might wonder how close Assumption 3 comes to a minimal assumption to establisha null space property of order 𝑠 as in Theorem 3. [ML17, Theorem C] establishes that in fact, arandom vector with independent entries whose ﬁrst log ( 𝑛 )/ log ( log ( 𝑛 )) moments are sub-Gaussian,with a probability of at least / , does not lead to an exact reconstruction property (which is evenslightly weaker than our robust NSP). In this sense, our suﬃcient condition involving the ﬁrst log ( 𝑛 / 𝑠 ) moments closes parts of this gap. We recall that Proposition 1 suggests that the connection between a null space property ΦD and recovery guarantees of ℓ -synthesis is also related to spark of D . In order for D to havefull spark, we note that is suﬃcient to choose 𝜓 to fulﬁll any joint law such that, for everycollection of 𝑑 entries, the joint law of such collection is absolutely continuous with respect tothe Lebesgue measure in ℝ 𝑑 [BD09, CWW14]. We note that condition on 𝜓 of Assumption 3can be easily obtained with independent entries, and the fact that 𝜓 is drawn from a continuousdistribution does not hurt the small ball assumption (2) of Assumption 3. For example, every 𝜓 drawn from a continuous distribution with bounded density automatically satisﬁes the smallball assumption [DLR18].4.2. Main Results for Random Dictionaries.

In this section, we present our main results whenthe dictionary D ∈ ℝ 𝑑 × 𝑛 is generated at random. Our ﬁrst model, addressed in Theorem 4,consists in D generated by 𝜓 as in Assumption 3 together with the measurement matrix Φ ∈ ℝ 𝑚 × 𝑑 of the previous chapter. The second model, addressed in Theorem 5, assumes strongerconditions for D but it will considerably relax the assumptions on the rows 𝜑 of the measurementmatrix Φ .We provide our ﬁrst main result. In a nutshell, it establishes sparse recovery via the ℓ -synthesis method eq. (3) with high probability. To the best of our knowledge, it is the ﬁrstresult that considers both dictionary and measurement matrix with heavy tails. Theorem 4.

Suppose Φ ∈ ℝ 𝑚 × 𝑑 is a measurement matrix satisfying Assumption 1. Generate D = √ 𝑑 [ 𝜓 , . . . , 𝜓 𝑛 ] 𝑇 ∈ ℝ 𝑑 × 𝑛 independently, such that it fulﬁlls Assumption 3. If the number of 𝜏 is an absolute constant that only depends on the distributional parameters of 𝜓 . rows of D satisﬁes 𝑑 = Ω ( max ( 𝑠 log ( 𝑛 / 𝑠 ) , log ( 𝑛 / 𝑠 ) max ( 𝛽 − , ) )) and the number of rows of Φ satisﬁes 𝑚 = Ω ( max ( 𝑠 log ( 𝑛 / 𝑠 ) , log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) )) , then, with probability at least − 𝑒 − 𝑂 ( 𝑑 ) − 𝑒 − Ω ( 𝑚 ) , D satisﬁes stable NSP with constants 𝛾 andsome e 𝜏 and furthermore, the ℓ -synthesis method eq. (3) provides stable and robust reconstructionof any coeﬃcient vector x ∈ ℝ 𝑛 and signal z ∈ ℝ 𝑑 from 𝑦 = Φz + e with k e k ≤ 𝜀 such that k ˆ x − x k . 𝜎 𝑠 ( x ) √ 𝑠 + e 𝜏𝜖 k ˆ z − z k . k D k (cid:18) 𝜎 𝑠 ( x ) √ 𝑠 + e 𝜏𝜖 (cid:19) . In particular, it holds that 𝔼 k ˆ z − z k . 𝔼 [k D k] ( 𝜎 𝑠 ( x )/√ 𝑑 + e 𝜏𝜀 ) . The proof follows easily from the tools developed to prove the results presented in Section 3and Section 4.1 and is detailed in Section 6.5.Next, we consider the same problem before, but with diﬀerent assumptions on our dictionary D ∈ ℝ 𝑑 × 𝑛 and the random vector 𝜑 ∈ ℝ 𝑑 that determines the distribution of the measurementmatrix Φ . This result explores the trade-oﬀ between the assumptions on the dictionary D and the measurement matrix Φ . Namely, we are now interested in random dictionaries D with independent entries drawn from a distribution with a few sub-Gaussian moments, andindependently sampled measurements 𝜑 whose distribution possesses only a ﬁnite secondmoment. Assumption 4 ( Φ with i.i.d. entries with ﬁnite variance) . Let Φ ∈ ℝ 𝑚 × 𝑑 be a matrix with i.i.d.rows distributed as 𝜑 /√ 𝑚 , where 𝜑 satisﬁes(1) 𝔼 [ 𝜑 ] = ,(2) There exists constants 𝐴 ∗ , 𝑐 > such that inf x ∈ 𝕊 𝑑 − ℙ (|h 𝜑, x i | ≥ 𝐴 ∗ ) ≥ 𝑐 ,(3) The entries 𝜉 𝑖 of 𝜑 have bounded variance, i.e., there exists 𝜆 > that k 𝜉 𝑖 k 𝐿 ≤ 𝜆 for all 𝑖 ∈ [ 𝑑 ] . Notice that the assumptions on Φ are very mild. Furthermore, a random dictionary D withi.i.d. entries with log ( 𝑛 ) sub-Gaussian moments satisﬁes the assumptions of Theorem 3, for 𝑑 large enough—the construction corresponds to the case in which the entries of 𝜓 are i.i.d.random variables with moment growth parameter 𝛽 = / . In particular, we can show thefollowing. Theorem 5.

Suppose Φ ∈ ℝ 𝑚 × 𝑑 is a random matrix satisfying Assumption 4, and supposethat D = ( D ) 𝑖 𝑗 = ( 𝑑 − / d ) 𝑖 𝑗 is a dictionary with i.i.d centered entries, such that the d 𝑖 𝑗 aredrawn independently from Φ , and their distribution’s ﬁrst log ( 𝑛 ) moments are sub-Gaussian, i.e, k d 𝑖 𝑗 k 𝐿 𝑝 = 𝑂 (√ 𝑝 ) for 𝑝 ≤ log ( 𝑛 ) , for all 𝑖 ∈ [ 𝑑 ] and 𝑗 ∈ [ 𝑛 ] .Then, the same conclusions as in Theorem 4 hold. We highlight that the results also holds for independent but non identical entries of D , as canbe seen in the proof in Section 6.5. Note that we require in Theorem 5 not only 𝑂 ( log ( 𝑛 / 𝑠 )) ,but 𝑂 ( log ( 𝑛 )) sub-Gaussian moments, which deviates from previous results in this paper. ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 13 Main Technical Lemma

In this section, we present a technical result that lies at the core of many of the resultspresented in this paper.

Lemma 1 (Generalization of [Men15, Lemma 6.5]) . There exists an absolute constant

𝐶 > forwhich the following holds. Let 𝑠 ∈ ℕ . Assume that 𝑧 , . . . , 𝑧 𝑛 are centered random variables withvariance that fulﬁll for each 𝑖 ∈ [ 𝑛 ] that for every 𝑝 ≤ ( 𝑛 / 𝑠 ) , k 𝑧 𝑖 k 𝐿 𝑝 ≤ 𝜆 √ 𝑝 . Then (4) 𝔼 " 𝑠 Õ 𝑖 = (cid:0) 𝑧 ∗ 𝑖 (cid:1) / ≤ 𝐶 𝜆 r 𝑠 log (cid:16) 𝑛𝑠 (cid:17) , where 𝑧 ∗ 𝑖 denotes the 𝑖 -th coordinate of the non-increasing rearrangement of the vector z = ( 𝑧 , . . . , 𝑧 𝑛 ) . Lemma 1 improves on Mendelson’s [Men15, Lemma 6.5], with the improvement beingtwofold: First, we do not require independence of the 𝑧 𝑖 , and neither do we require the 𝑧 𝑖 to have identical distributions. Both of these are, on the hand, requirements of [Men15, Lemma6.5]. Moreover, in order to establish eq. (4), [Men15, Lemma 6.5] requires the bound for theﬁrst 𝑂 ( log ( 𝑛 )) moments, whereas Lemma 1 only requires such a bound for the ﬁrst 𝑂 ( log ( 𝑛 / 𝑠 )) moments. 6. Proofs

Proof of Lemma 1.

As a preparation, we present a simple consequence of Markov’s in-equality.

Lemma 2. [MRW +

18, Lemma 3.7] Assume that a random variable 𝑋 satisﬁes k 𝑋 k 𝐿 𝑝 ≤ 𝐴 . Then,for each 𝑘 > , ℙ (| 𝑋 | ≥ 𝑘𝐴 ) ≤ 𝑘 − 𝑝 . Proof of Lemma 1.

Deﬁning 𝑍 𝑖 : = 𝑧 𝑖 | 𝑧 𝑖 | ≤ e 𝜆 √ 𝑝 for each 𝑖 ∈ [ 𝑛 ] , we obtain that 𝔼 h 𝑠 Õ 𝑖 = ( 𝑍 ∗ 𝑖 ) i ≤ 𝑠 e 𝜆 𝑝 if 𝑍 ∗ 𝑖 is the 𝑖 -th largest coordinate of the vector Z = ( 𝑍 , . . . , 𝑍 𝑛 ) . Now we have to deal with therandom variables 𝑌 𝑖 = 𝑧 𝑖 | 𝑧 𝑖 | > e 𝜆 √ 𝑝 . We estimate that 𝔼 h 𝑠 Õ 𝑖 = ( 𝑌 ∗ 𝑖 ) i ≤ 𝔼 𝑛 Õ 𝑖 = 𝑌 𝑖 = 𝑛 Õ 𝑖 = 𝔼 𝑌 𝑖 . Observe that by the Hölder inequality, 𝔼 [ 𝑌 𝑖 ] ≤ 𝔼 [| 𝑧 𝑖 | 𝑝 ] / 𝑝 ℙ (| 𝑧 𝑖 | ≥ e 𝜆 √ 𝑝 ) − / 𝑝 . Using the moment assumption for 𝑝 = log ( 𝑛 / 𝑠 ) , we obtain 𝔼 [| 𝑧 𝑖 | 𝑝 ] / 𝑝 ≤ 𝜆 ( p 𝑝 ) = 𝜆 log ( 𝑛 / 𝑠 ) . Applying Lemma 2 for 𝐴 = 𝜆 √ 𝑝 and 𝑘 = e , it further follows that ℙ (| 𝑧 𝑖 | ≥ 𝑒𝜆 √ 𝑝 ) − / 𝑝 ≤ ( e − 𝑝 ) − / 𝑝 = e − 𝑝 + = e 𝑠𝑛 and therefore 𝑛 Õ 𝑖 = 𝔼 𝑌 𝑖 ≤ 𝑛𝜆 log ( 𝑛 / 𝑠 ) e 𝑠𝑛 = 𝜆 𝑠 log ( 𝑛 / 𝑠 ) . To get our desired bound, we compute 𝔼 𝑠 Õ 𝑖 = ( D 𝑇 𝑉 ) ∗ 𝑖 ) ≤ 𝔼 [ 𝑠 Õ 𝑖 = ( 𝑍 ∗ 𝑖 ) ] + 𝔼 [ 𝑠 Õ 𝑖 = ( 𝑌 ∗ 𝑖 ) ] ≤ 𝑂 ( 𝜆 𝑠 log ( 𝑛 / 𝑠 )) + 𝑂 ( 𝜆 𝑠 log ( 𝑛 / 𝑠 )) and take square root on both sides of the inequality. Using Jensen’s inequality, we obtain thedesired result. (cid:3) Proof of Proposition 4.

In this section, we present the proof of Proposition 4.As a tool to prove Proposition 4 in the case of Assumption 1, we establish a Khintchine typeinequality for 𝑉 . Lemma 3.

Let 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 where the 𝜑 𝑖 are independent and satisfy the conditionsof Assumption 1 and { 𝜀 𝑖 } 𝑚𝑖 = are independent Rademacher random variables. Then, for every a ∈ 𝕊 𝑑 − , if 𝑝 . log ( 𝑛𝑠 ) and 𝑚 & log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) , where 𝛼 is the moment growth parameterof Theorem 3, it holds that k h a , 𝑉 i k 𝐿 𝑝 . √ 𝑝. Proof of Lemma 3.

By deﬁnition of 𝑉 we see that k h 𝑎, 𝑉 i k 𝐿 𝑝 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ 𝑚 𝑚 Õ 𝑖 = 𝜀 𝑖 h a , 𝜑 𝑖 i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 𝐿 𝑝 . All random variables { 𝜀 𝑖 h 𝑑, 𝜑 𝑖 i} 𝑚𝑖 = are i.i.d copies of a random variable with the ﬁrst 𝑝 momentsof order 𝑝 𝛼 due to condition (3) of Assumption 1. Thus, by [ML17, Lemma 2.8], there exists aconstant 𝑐 ( 𝛼 ) that depends only on 𝛼 such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ 𝑚 𝑚 Õ 𝑖 = 𝜀 𝑖 h 𝑎, 𝜑 𝑖 i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 𝐿 𝑝 ≤ 𝑐 ( 𝛼 ) 𝜆 √ 𝑝. (cid:3) To show Proposition 4, we need a statement analogue to Lemma 3 that holds for 𝜙 𝑖 as inAssumption 2. A challenge is that a moment bound on the marginals h a , 𝜑 𝑖 i is not directly givenby Assumption 2. We use a moment comparison argument to overcome this issue and establish aKhintchine inequality under weak moment assumptions (weaker than the standard subgaussianassumption). Similar arguments have already appeared in the literature, see Chapter 3 [KW92,Chapter 3]. Lemma 4 (Khintchine inequality under weak moment assumption) . Suppose 𝑋 = ( 𝑥 , . . . , 𝑥 𝑑 ) ∈ ℝ 𝑑 with independent mean zero 𝑥 𝑖 and also assume, for all 𝑖 ∈ [ 𝑑 ] , k 𝑥 𝑖 k 𝐿 𝑝 = 𝑂 (√ 𝑝 ) for ≤ 𝑝 ≤ 𝑘 .Then, for all vectors a ∈ ℝ 𝑑 , we have k h 𝑋, a i k 𝐿 𝑝 = 𝑂 (√ 𝑝 k 𝑎 k ) , 𝑝 ∈ [ , 𝑘 ] . Proof of Lemma 4.

By a standard symmetrization argument k h 𝑋, a i k 𝐿 𝑝 = (cid:13)(cid:13)(cid:13) 𝑑 Õ 𝑖 = a 𝑖 𝑥 𝑖 (cid:13)(cid:13)(cid:13) 𝐿 𝑝 ≤ (cid:13)(cid:13)(cid:13) 𝑛 Õ 𝑖 = 𝜀 𝑖 a 𝑖 𝑥 𝑖 (cid:13)(cid:13)(cid:13) 𝐿 𝑝 , ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 15 where 𝜀 𝑖 are independent standard Rademacher random variables. So we may assume that ourvector 𝑋 has independent symmetric coordinates. Without loss of generality assume 𝑝 to be aneven integer. Then, by expanding in monomials, 𝔼 ( 𝑑 Õ 𝑖 = a 𝑖 𝑥 𝑖 ) 𝑝 = min ( 𝑝,𝑑 ) Õ 𝑖 = Õ 𝑖 <𝑖 <...<𝑖 𝑡 Õ 𝑑 + ...𝑑 𝑡 = 𝑝,𝑑 𝑖 ≥ (cid:18) 𝑝𝑑 , . . . , 𝑑 𝑡 (cid:19) 𝑡 Ö 𝑗 = a 𝑑 𝑗 𝑖 𝑗 𝔼 𝑥 𝑖𝑗 𝑥 𝑑 𝑗 𝑖 𝑗 ! Any odd moment vanishes due to the symmetry. The even moments (until "k") are subgaussianby hypothesis, so it follows that the 𝑝 -norm of our inner product h 𝑋, a i is dominated (up to anabsolute constant) by the 𝑝 -norm of h 𝑔, a i , where 𝑔 is a standard Gaussian vector. Notice thatthe random variable h 𝑔, a i follows a Gaussian distribution with zero mean and variance k a k .The proof now follows by the standard estimates for 𝑝 -norm of Gaussian distributions. (cid:3) Using Lemma 4, we now establish that a lemma that is analoguous to Lemma 3.

Lemma 5.

Let 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 , where the 𝜑 𝑖 are independent satisfying the conditions ofAssumption 2 and { 𝜀 𝑖 } 𝑚𝑖 = are independent Rademacher random variables. Suppose that 𝑚 & log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) . Then, for every a ∈ 𝕊 𝑑 − and 𝑝 . log ( 𝑛𝑠 ) , k h a , 𝑉 i k 𝐿 𝑝 . √ 𝑝. Proof of Lemma 5.

Observe that the entries 𝑉 𝑗 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 ( 𝜑 𝑖 ) 𝑗 of 𝑉 are a normalized sumof i.i.d random variables with moments bounded of order 𝑝 𝛼 due to Assumption 2. By [DLR18,Lemma 2.8], we know that k 𝑉 𝑗 k 𝐿 𝑝 = 𝑂 (√ 𝑝 ) , for 𝑝 = 𝑂 ( log ( 𝑛 / 𝑠 )) . Moreover, by independenceof the entries of the 𝜑 𝑖 , the entries 𝑉 𝑗 are also independent, therefore Lemma 4 can be appliedto ﬁnish the proof. (cid:3) We now go ahead and combine the previous results to show Proposition 4.

Proof of Proposition 4.

Let 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 , where is deﬁned with independent 𝜑 𝑖 that arecentered random vectors fulﬁlling the last condition of Assumption 1 or of Assumption 2 andthe 𝜖 𝑖 are independent Rademacher random variables that are independent of the { 𝜑 𝑖 } 𝑚𝑖 = . Let D = [ d , . . . , d 𝑛 ] ∈ ℝ 𝑑 × 𝑛 be a dictionary such that k d 𝑖 k ≤ 𝜌 for all 𝑖 ∈ [ 𝑛 ] .First, we note that using the assumptions and Lemma 3 or Lemma 5, respectively, we see that k h d 𝑖 , 𝑉 i k = k d 𝑖 k kh d 𝑖 /k d 𝑖 k , 𝑉 i k . 𝜌. Deﬁning 𝑧 𝑖 : = (cid:0) D 𝑇 𝑉 (cid:1) 𝑖 /k h d 𝑖 , 𝑉 i k 𝐿 = h d 𝑖 , 𝑉 i/k h d 𝑖 , 𝑉 i k 𝐿 , we can again use Lemma 3 or Lemma 5,respectively, to see that 𝑧 𝑖 fulﬁlls the assumptions of Lemma 1 in each case, in particular k 𝑧 𝑖 k 𝐿 𝑝 ≤ 𝐶 𝜌 √ 𝑝 for all 𝑖 ∈ [ 𝑛 ] for some constant 𝐶 > . This then implies 𝔼 " 𝑠 Õ 𝑖 = (cid:0) ( D 𝑇 𝑉 ) ∗ 𝑖 (cid:1) / = 𝔼 " 𝑠 Õ 𝑖 = (cid:0) 𝑧 ∗ 𝑖 (cid:1) / . 𝜌 r 𝑠 log ( 𝑛𝑠 ) , which is the desired bound. (cid:3) Proofs of Theorems 1 and 2.

In this section, we detail the proofs of Theorems 1 and 2.The main idea of our proofs is based on the so-called small-ball method that was ﬁrst developedby Koltchinskii and Mendelson [KM15, Men15, ML17]. As a preperation, we consider thefollowing deﬁnitions.

Deﬁnition 2. (Marginal tail function) For a random vector 𝜓 ∈ ℝ 𝑑 , a subset 𝑆 ⊂ ℝ 𝑑 and a ﬁxed 𝐴 > , deﬁne the marginal tail function 𝑄 𝐴 ( 𝑆 ; 𝜓 ) by 𝑄 𝐴 ( 𝑆, 𝜓 ) : = inf x ∈ 𝑆 ℙ (|h x , 𝜓 i | ≥ 𝐴 ) . Deﬁnition 3. (Mean empirical width) For a random vector 𝜓 ∈ ℝ 𝑑 and a subset 𝑆 ⊂ ℝ 𝑑 , themean empirical width of 𝑆 is 𝑊 𝑚 ( 𝑆, 𝜓 ) : = 𝔼 sup x ∈ 𝑆 D x , 𝑚 𝑚 Õ 𝑖 = 𝜀 𝑖 𝜓 𝑖 E , where 𝜓 , . . . , 𝜓 𝑚 are i.i.d. copies of 𝜓 and 𝜀 , . . . , 𝜀 𝑚 are indepdendent Rademacher randomvariables that are independent of the { 𝜓 𝑖 } 𝑚𝑖 = . The following proposition is at the core of our proof.

Proposition 6 ([KM15, Theorem 1.5],[Tro15, Proposition 5.1]) . Fix a set 𝑆 ⊂ ℝ 𝑛 . Let 𝜓 ∈ ℝ 𝑛 bea random vector and let Ψ ∈ ℝ 𝑚 × 𝑛 be a random matrix whose rows are i.i.d copies of 𝜓 . Then, forany 𝑡 > and 𝐴 > , inf x ∈ 𝑆 k Ψ x k ≥ 𝐴 √ 𝑚𝑄 𝐴 ( 𝑆 ; 𝜓 ) − √ 𝑚𝑊 𝑚 ( 𝑆 ; 𝜓 ) − 𝐴𝑡, with probability at least − 𝑒 − 𝑡 / .Proof of Theorem 1. Since D = [ d , . . . , d 𝑛 ] ∈ ℝ 𝑑 × 𝑛 satisﬁes the robust NSP of order 𝑠 withconstants 𝛾 and 𝜏 , we know that(5) inf x ∈ 𝑆 𝛾 k Dx k ≥ 𝜏 for the set 𝑆 𝛾 : = n x ∈ ℝ 𝑛 : k x 𝑇 k > 𝛾 √ 𝑠 k x 𝑇 𝑐 k for some | 𝑇 | ≤ 𝑠 o ∩ 𝕊 𝑛 − .To show that ΦD fulﬁlls the robust NSP of order 𝑠 with constants 𝛾 and 𝜏 , we need to showthat inf 𝑥 ∈ 𝑆 𝛾 k ΦD 𝑥 k & 𝜏 , To do this, we will use Proposition 6 for Ψ = ΦD and 𝜓 = D 𝑇 𝜑 /√ 𝑚 and 𝑆 = 𝑆 𝛾 .We ﬁrst estimate the mean empirical width 𝑊 𝑚 ( 𝑆 𝛾 , D 𝑇 𝜑 /√ 𝑚 ) . To do this, we consider thesets Σ 𝑠 : = { x : k x k ≤ 𝑠, k x k = } and 𝑇 𝛾,𝑠 = (cid:8) x : k x 𝑇 k ≥ 𝛾 √ 𝑠 k x 𝑇 𝑐 k for some | 𝑇 | ≤ 𝑠 (cid:9) , anduse the following result. Lemma 6 ([DLR18, Lemma 2]) . Let 𝐷 𝑠 be the convex hull of the set Σ 𝑠 . Then 𝑇 𝛾,𝑠 ∩ 𝔹 𝑛 ⊂ ( + 𝛾 − ) 𝐷 𝑠 . Since 𝑆 𝛾 ⊂ 𝑇 𝛾,𝑠 ∩ 𝔹 𝑛 , we see by Hölder’s inequality and Lemma 6 that 𝑊 𝑚 ( 𝑆 𝛾 , D 𝑇 𝜑 /√ 𝑚 ) ≤ 𝑊 𝑚 ( 𝑇 𝛾,𝑠 ∩ 𝔹 𝑛 , D 𝑇 𝜑 /√ 𝑚 ) ≤ ( + 𝛾 − ) 𝑊 𝑚 ( 𝐷 𝑠 , D 𝑇 𝜑 /√ 𝑚 ) . Let 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 . Since Í 𝑚𝑖 = 𝜀 𝑖 D 𝑇 𝜑 𝑖 /√ 𝑚 = D 𝑇 𝑉 , we see that(6) 𝑊 𝑚 ( 𝐷 𝑠 , D 𝑇 𝜑 /√ 𝑚 ) = 𝑚 − 𝔼 sup x ∈ 𝐷 𝑠 h x , D 𝑇 𝑉 i = 𝑚 − 𝔼 sup x ∈ Σ 𝑠 h x , D 𝑇 𝑉 i = 𝑚 − 𝔼 " 𝑠 Õ 𝑖 = (cid:0) ( D 𝑇 𝑉 ) ∗ 𝑖 (cid:1) / . ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 17

In the second equality, we used the fact that the supremum of the linear form over theconvex hull 𝐷 𝑠 and over Σ 𝑠 coincides. Under the assumptions of Theorem 1, it holds that 𝑚 & log ( 𝑛 / 𝑠 ) max ( 𝛼 − , ) and therefore Proposition 4 implies that 𝑊 𝑚 ( 𝑆 𝛾 , D 𝑇 𝜑 /√ 𝑚 ) ≤ ( + 𝛾 − ) 𝑚 𝔼 " 𝑠 Õ 𝑖 = (cid:0) ( D 𝑇 𝑉 ) ∗ 𝑖 (cid:1) / . + 𝛾 − 𝜌𝑚 r 𝑠 log (cid:16) 𝑛𝑠 (cid:17) . √ 𝑚 r 𝑠𝑚 log (cid:16) 𝑛𝑠 (cid:17) , where 𝜌 = Θ ( ) is the uniform upper bound on the ℓ -norms of the columns of D .Furthermore, we lower bound 𝑄 𝐴 ( 𝑆 𝛾 ; D 𝑇 𝜑 ) such that 𝑄 𝐴 ( 𝑆 𝛾 ; D 𝑇 𝜑 /√ 𝑚 ) = inf x ∈ 𝑆 𝛾 ℙ (|h x , D 𝑇 𝜑 /√ 𝑚 i | ≥ 𝐴 ) = inf x ∈ 𝑆 𝛾 ℙ (|h Dx , 𝜑 /√ 𝑚 i | ≥ 𝐴 ) = inf x ∈ 𝑆 𝛾 ℙ (cid:18)(cid:12)(cid:12)(cid:12)D Dx k Dx k , 𝜑 E(cid:12)(cid:12)(cid:12) ≥ 𝐴 √ 𝑚 k Dx k (cid:19) ≥ inf z ∈ 𝕊 𝑛 − ℙ (cid:16) |h z , 𝜑 i | ≥ 𝐴𝜏 √ 𝑚 (cid:17) ≥ 𝑐 if 𝐴 = 𝐴 ∗ 𝜏 √ 𝑚 , where 𝐴 ∗ and 𝑐 are the constants of condition (2) of Assumption 1, using eq. (5) inthe ﬁrst inequality.Putting this together with the above estimate for the mean empirical width, we obtain byProposition 6 that inf 𝑥 ∈ 𝑆 𝛾 k Φ D 𝑥 k & 𝐴 ∗ 𝑐𝜏 − √ 𝑚𝑊 𝑚 ( 𝑆 𝛾 , D 𝑇 𝜑 ) − 𝐴 ∗ 𝜏 √ 𝑚 𝑡 & 𝐴 ∗ 𝑐𝜏 − vt 𝑠 log (cid:16) 𝑛𝑠 (cid:17) 𝑚 − 𝐴 ∗ 𝜏 √ 𝑚 𝑡 & 𝐴 ∗ 𝑐𝜏 with probability at least − 𝑒 − Ω ( 𝑚 ) if 𝑚 & Ω (cid:0) 𝜏 ( 𝐴 ∗ ) 𝑐 𝑠 log ( 𝑛 / 𝑠 ) (cid:1) .This concludes the proof as it shows that under the conditions stated by Theorem 1, withprobability at least − 𝑒 − Ω ( 𝑚 ) , ΦD fulﬁlls the robust NSP of order 𝑠 with constants 𝛾 and e 𝜏 : = Θ ( 𝜏 ) .Finally, on this event, the two inequalities bounding the distances k ˆx − x k and k ˆz − z k between the output of ℓ -synthesis eq. (3) ˆx and ˆz and follow from Proposition 2 and the factthat k ˆz − z k = k D ( ˆx − x ) k ≤ k D k k ˆx − x k . (cid:3) We continue with the proof of Theorem 2.

Proof of Theorem 2.

To show the statement of Theorem 2, we proceed analogously to the proof ofTheorem 1. Theorem 1 can be used to upper bound the mean empirical width 𝑊 𝑚 ( 𝑆 𝛾 , D 𝑇 𝜑 /√ 𝑚 ) ,but to lower bound the marginal teal function, we proceed slightly diﬀerently.Similarly as in the proof of Lemma 4 for a unit norm a ∈ 𝕊 𝑑 − 𝑝 , we 𝔼 (cid:2) |h 𝜑, a i | (cid:3) = 𝑑 Õ 𝑖 = Õ 𝑖 <𝑖 <...<𝑖 𝑡 Õ 𝑑 + ...𝑑 𝑡 = ,𝑑 𝑖 ≥ (cid:18) 𝑑 , . . . , 𝑑 𝑡 (cid:19) 𝑡 Ö 𝑗 = a 𝑑 𝑗 𝑖 𝑗 𝔼 [ 𝜉 𝑑 𝑗 𝑖 𝑗 ] ! By a standard symmetrization argument, we may assume that the 𝜉 𝑖 are symmetric randomvariables. Using the symmetry, we note that the 𝔼 [ 𝜉 𝑑 𝑗 𝑖 𝑗 ] are zero for all odd 𝑑 𝑗 . The onlypossibilities for non-zero terms are such that either(1) There exists 𝑖 𝑗 ∈ ℕ such that 𝑖 𝑗 = and Î 𝑡𝑗 = a 𝑑 𝑗 𝑖 𝑗 𝔼 [ 𝜉 𝑑 𝑗 𝑖 𝑗 ] = a 𝑖 𝑗 𝔼 [ 𝜉 𝑖 𝑗 ] . By condition (3) ofAssumption 2, a 𝑖 𝑗 𝔼 [ 𝜉 𝑖 𝑗 ] ≤ a 𝑖 𝑗 𝜆 𝛼 .(2) There exist 𝑖 𝑗 ≠ 𝑖 𝑘 ∈ ℕ such that 𝑑 𝑖 𝑗 = 𝑑 𝑖 𝑘 = and Î 𝑡𝑗 = a 𝑑 𝑗 𝑖 𝑗 𝔼 [ 𝜉 𝑑 𝑗 𝑖 𝑗 ] = a 𝑖 𝑗 a 𝑖 𝑘 𝔼 [ 𝜉 𝑖 𝑗 𝜉 𝑖 𝑘 ] = a 𝑖 𝑗 a 𝑖 𝑘 𝔼 [ 𝜉 𝑖 𝑗 ] 𝔼 [ 𝜉 𝑖 𝑘 ] ≤ a 𝑖 𝑗 a 𝑖 𝑘 𝜆 𝛼 𝜆 𝛼 = a 𝑖 𝑗 a 𝑖 𝑘 𝜆 𝛼 , using independence of the 𝜉 𝑖 . Therefore, 𝔼 (cid:2) |h 𝜑, a i | (cid:3) ≤ 𝜆 𝛼 (cid:0) k a k + k a k (cid:1) and thus k h 𝜑, a i | k 𝐿 . 𝜆 𝛼 k a k = 𝜆 𝛼 . Furthermore, by independence of the { 𝜉 𝑖 } 𝑚𝑖 = it follows that k h 𝜑, a i | k 𝐿 = k a k = . Finally,using the Paley-Zygmund inequality (cf., e.g., [FR13, Lemma 7.16]), we conclude that ℙ (|h 𝜑, a i | ≥ / ) ≥ (cid:0) 𝔼 (cid:2) |h 𝜑, a i | (cid:3) − / (cid:1) 𝔼 [|h 𝜑, a i | ] ≥

14 1 𝜆 𝛼 = 𝜆 − − 𝛼 − , which shows that 𝜑 actually fulﬁlls condition (2) of Assumption 1 for 𝐴 ∗ = / and 𝑐 = 𝜆 − − 𝛼 − . Therefore, the statement of Theorem 2 follows from Theorem 1. (cid:3) Proof of Proposition 5 and Theorem 3.

Here, we show the results of Section 4.1.

Proof of Proposition 5.

The proof is very similar to the proof of Proposition 4. In particular, for 𝑉 𝜓 = 𝑑 − / Í 𝑑𝑗 = 𝜀 𝑗 𝜓 𝑗 , we can deﬁne 𝑧 𝑖 : = ( 𝑉 𝜓 ) 𝑖 /k ( 𝑉 𝜓 ) 𝑖 k = h e 𝑖 , 𝑉 𝜓 i/k h e 𝑖 , 𝑉 𝜓 i k for each 𝑖 ∈ [ 𝑛 ] ,we apply Lemma 5 as (3) of Assumption 3 coincides with (3) of Assumption 2. It follows that k 𝑧 𝑖 k 𝐿 𝑝 . √ 𝑝 for all 𝑖 ∈ [ 𝑛 ] if 𝑑 & log ( 𝑛 / 𝑠 ) max ( 𝛽 − , ) . A direct application of Lemma 1 showsthe result. (cid:3) Proof of Theorem 3.

This statement can be shown analogously to Theorem 1 by using Proposi-tion 6. In Proposition 6, we choose Ψ = D , whose rows are distributed as 𝜓 /√ 𝑑 , and 𝑆 = 𝑆 𝛾 . Themean empirical width can be bounded by Proposition 5, we bound the marginal tail functionsuch that 𝑄 𝐴 ( 𝑆 𝛾 ; 𝜓 /√ 𝑑 ) = inf x ∈ 𝑆 𝛾 ℙ (|h x , 𝜓 i | ≥ 𝐴 √ 𝑑 ) = inf x ∈ 𝑆 𝛾 ℙ (|h x , 𝜓 i | ≥ 𝐴 √ 𝑑 ) ≥ 𝑐 for 𝐴 = 𝐴 ∗ /( √ 𝑑 ) , where 𝐴 ∗ and 𝑐 are the constants of the small ball assumption of Assump-tion 3. (cid:3) Proofs of Theorems 4 and 5.

In this section, we only stress the diﬀerences between theproofs of Theorem 4 and Theorem 5 and the results above.

Proof of Theorem 4.

Using Theorem 3, it follows for the given assumptions, the dictionary D fulﬁlls the robust NSP of order 𝑠 with constants 𝛾 and 𝜏 on an event 𝐸 with ℙ ( 𝐸 𝑐 ) = 𝑒 − Ω ( 𝑑 ) .So we may proceed as in the deterministic case, apply the small ball method for Φ D and theproblem boils down to prove that, for 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 , where 𝜑 𝑖 are as in Assumption 1and 𝜀 , . . . , 𝜀 𝑚 independent Rademacher variables,(7) 𝔼 " 𝑠 Õ 𝑖 = ( D 𝑇 𝑉 ) ∗ 𝑖 ) / . r 𝜌𝑠 log (cid:16) 𝑛𝑠 (cid:17) , where 𝜌 : = max 𝑗 ∈[ 𝑛 ] 𝔼 k d 𝑗 k is a bound on the maximum of the expected squared column ℓ -norms of D . Importantly, we note that the maximum in 𝜌 is taken outside of the expectation.Furthermore, the expectation is now taken with respect to both Φ and D . The constant 𝜌 canbe estimated such that 𝜌 = max 𝑗 ∈[ 𝑛 ] 𝔼 k 𝑑 𝑗 k = max 𝑗 ∈[ 𝑛 ] 𝑑 Õ 𝑖 = 𝔼 [ 𝑑 𝑗𝑖 ] = 𝑂 ( ) ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 19 since 𝑑 𝑗𝑖 = ( 𝜓 𝑖 ) 𝑗 /√ 𝑑 and 𝜓 𝑖 has ( 𝜓 𝑖 ) 𝑗 has a second moment of order 𝑂 ( ) for any 𝑖 ∈ [ 𝑛 ] and 𝑗 ∈ [ 𝑑 ] . The proof of eq. (7) is then a straightforward modiﬁcation of the analogous boundfor deterministic dictionary, with the main diﬀerence that we condition on D and take theexpectation afterwards. Then we can conclude using Proposition 6. (cid:3) Proof of Theorem 5.

Recall the deﬁnition 𝑉 : = 𝑚 − / Í 𝑚𝑖 = 𝜀 𝑖 𝜑 𝑖 with 𝜑 𝑖 as in Assumption 4 andthe 𝜖 𝑖 as usual. Following the same approach as for Theorem 4, our problem boils down tobound eq. (7). The following proof is similar to [Men15, Lemma 6.5]. Proposition 7.

Consider D and Φ as in Theorem 5. Then, 𝔼 " 𝑠 Õ 𝑖 = ( D 𝑇 𝑉 ) ∗ 𝑖 ) / . r 𝑠 log (cid:16) 𝑒𝑛𝑠 (cid:17) . Proof.

Conditioning on 𝑉 , we obtain, for arbitrary 𝑡 > ,(8) ℙ (cid:16) ( D 𝑇 𝑉 ) ∗ 𝑗 ≥ 𝑡 (cid:12)(cid:12)(cid:12) 𝑉 (cid:17) ≤ (cid:18) 𝑛𝑗 (cid:19) ℙ (cid:16) |h d 𝑇 , 𝑉 i | > 𝑡 √ 𝑑 (cid:12)(cid:12)(cid:12) 𝑉 (cid:17) 𝑗 ≤ (cid:18) 𝑛𝑗 (cid:19) k h d 𝑇 , 𝑉 i k 𝑝𝑗𝑝 (√ 𝑑𝑡 ) 𝑝𝑗 , using the independence of the d , . . . , d 𝑛 . Since 𝑉 is ﬁxed, we apply Lemma 4 for d and write ℙ (cid:16) ( D 𝑇 𝑉 ) ∗ 𝑗 ≥ 𝑡 | 𝑉 (cid:17) . (cid:18) 𝑛𝑗 (cid:19) (cid:18) √ 𝑝 k 𝑉 k √ 𝑑𝑡 (cid:19) 𝑝𝑗 Setting √ 𝑑𝑡 = 𝑢 k 𝑉 k p log ( 𝑒𝑛 / 𝑗 ) and 𝑝 = log ( 𝑒𝑛 / 𝑗 ) , it follows then that ℙ (cid:18) ( D 𝑇 𝑉 ) ∗ 𝑗 ≥ 𝑢 k 𝑉 k √ 𝑑 p log ( 𝑒𝑛 / 𝑗 ) (cid:12)(cid:12)(cid:12)(cid:12) 𝑉 (cid:19) ≤ (cid:16) 𝑒𝑢 (cid:17) 𝑗 log ( 𝑒𝑛𝑗 ) . Integrating the tails we obtain 𝔼 D ( D 𝑇 𝑉 ) ∗ 𝑗 ) . k 𝑉 k 𝑑 log ( 𝑒𝑛 / 𝑗 ) . We now take the expectationboth sides with respect to 𝑉 and get 𝔼 ( D 𝑇 𝑉 ) ∗ 𝑗 ) . log ( 𝑒𝑛 / 𝑗 ) because 𝔼 k 𝑉 k = 𝑂 ( 𝑑 ) . To seethis, we compute 𝔼 k 𝑉 k = 𝑑 Õ 𝑗 = 𝔼 [ 𝑉 𝑗 ] = 𝑑 Õ 𝑗 = 𝑚 𝑚 Õ 𝑖 = 𝔼 [( 𝜑 𝑖 ) 𝑗 ] + Õ 𝑖 ≠ 𝑘 𝔼 𝜀 𝑖 𝜀 𝑘 ( 𝜑 𝑗 ) 𝑖 ( 𝜑 𝑗 ) 𝑘 ! . Now, because of the bounded variance assumption, the ﬁrst term in the summand is 𝑂 ( 𝑑 ) andby independence of the Rademacher sequence 𝜀 𝑖 the second term in the summand is zero,which shows 𝔼 k 𝑉 k = 𝑂 ( 𝑑 ) .To ﬁnish the proof, just sum over 𝑗 ∈ [ 𝑠 ] , take the square root both sides and apply Jensen’sinequality. (cid:3) The remainder of the proof is as for Theorem 4. (cid:3) Outlook

In this work we presented an analysis of the ℓ -synthesis method under several, very generalassumptions on both the dictionary and measurement matrix, which include in particularheavy-tailed measurements. Speciﬁed to the case of a trivial dictionary, i.e., in the standardcompressed sensing setting, we showed that a control of only the ﬁrst log ( 𝑛 / 𝑠 ) moments ofthe entries of a measurement matrix suﬃces to establish recovery guarantees for the optimalnumber of measurements. It is true that the random assumptions we are assuming for our measurement matrix mightnot represent directly implementable solutions for many practical applications of compressedsensing, as typically, structural constraints are present. It would be interesting to see whetherthe presented tools can be helpful to analyze ℓ -synthesis for such settings.While prior research has shown that ℓ -synthesis can be successful to recover a dictionary-sparse signal z without uniqueness in its sparse representation in the dictionary, we have onlyfocussed on the recovery of both the signal and coeﬃcient vector. It would but interesting toinvestigate how to establish results on such a non-uniform signal recovery from heavy-tailedmeasurements in future work. Acknowledgements

C.K. is grateful for support by the NSF under the grant NSF-IIS-1837991.

References [ACM12] B. Alexeev, J. Cahill, and D. G. Mixon. Full spark frames.

J. Fourier Anal. Appl. ,18(6):1167–1194, 2012.[ACP12] A. Aldroubi, X. Chen, and A. M. Powell. Perturbations of measurement matrices anddictionaries in compressed sensing.

Appl. Comput. Harmon. Anal. , 33(2):282–291,2012.[AEB06] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcom-plete dictionaries for sparse representation.

IEEE Trans. Signal Process. , 54(11):4311–4322, 2006.[AHPR17] B. Adcock, A. C. Hansen, C. Poon, and B. Roman. Breaking the coherence barrier:A new theory for compressed sensing. In

Forum Math. Sigma , volume 5, pages 1–84.Cambridge University Press, 2017.[BD09] T. Blumensath and M. E. Davies. Sampling theorems for signals from the union ofﬁnite-dimensional linear subspaces.

IEEE Trans. Inf. Theory , 55(4):1872–1882, 2009.[BDDW08] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of therestricted isometry property for random matrices.

Constr. Approx. , 28(3):253–263,2008.[BFH16] J.-L. Bouchot, S. Foucart, and P. Hitczenko. Hard thresholding pursuit algorithms:Number of iterations.

Appl. Comput. Harmon. Anal. , 41(2):412 – 435, 2016. SparseRepresentations with Applications in Imaging Science, Data Analysis, and Beyond,Part II.[CCL20] P. G. Casazza, X. Chen, and R. G. Lynch. Preserving injectivity under subgaussianmappings and its application to compressed sensing.

Appl. Comput. Harmon. Anal. ,49(2):451 – 470, 2020.[CCW16] J. Cahill, X. Chen, and R. Wang. The gap between the null space property and therestricted isometry property.

Linear Algebra Appl. , 501:363–375, 2016.[CDD09] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best 𝑘 -term ap-proximation. J. Amer. Math. Soc. , 22(1):211–231, 2009.[CENR11] E. Candès, Y. Eldar, D. Needell, and P. Randall. Compressed sensing with coherentand redundant dictionaries.

Appl. Comput. Harmon. Anal. , 31(1):59–73, 2011.[CRT06a] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal re-construction from highly incomplete frequency information.

IEEE Trans. Inf. Theory ,52(2):489–509, 2006.

ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 21 [CRT06b] E. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete andinaccurate measurements.

Comm. Pure Appl. Math. , 59(8):1207–1223, 2006.[CT06] E. Candès and T. Tao. Near-optimal signal recovery from random projections:Universal encoding strategies?

IEEE Trans. Inf. Theory , 52(12):5406–5425, 2006.[CWW14] X. Chen, H. Wang, and R. Wang. A null space analysis of the ℓ -synthesis method indictionary-based compressed sensing. Appl. Comput. Harmon. Anal. , 37(3):492–515,2014.[DLR18] S. Dirksen, G. Lecué, and H. Rauhut. On the Gap Between Restricted IsometryProperties and Sparse Recovery Conditions.

IEEE Trans. Inf. Theory , 64(8):5478–5487, 2018.[DNW13] M. A. Davenport, D. Needell, and M. B. Wakin. Signal Space CoSaMP for SparseRecovery With Redundant Dictionaries.

IEEE Trans. Inf. Theory , 59(10):6820–6829,2013.[Don06] D. L. Donoho. Compressed sensing.

IEEE Trans. Inf. Theory , 52(4):1289–1306, 2006.[EMR07] M. Elad, P. Milanfar, and R. Rubinstein. Analysis versus synthesis in signal priors.

Inverse Problems , 23(3):947–968, 2007.[Fou11] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing.

SIAMJ. Numer. Anal. , 49(6):2543–2563, 2011.[Fou14] S. Foucart. Stability and robustness of ℓ -minimizations with Weibull matrices andredundant dictionaries. Linear Algebra Appl. , 441:4–21, 2014. special issue on SparseApproximate Solution of Linear Systems.[FR13] S. Foucart and H. Rauhut. An Invitation to Compressive Sensing. In

A MathematicalIntroduction to Compressive Sensing . Springer, 2013.[GE13] R. Giryes and M. Elad. Can we allow linear dependencies in the dictionary in thesparse synthesis framework? In , pages 5459–5463, 2013.[GKM20] M. Genzel, G. Kutyniok, and M. März. ℓ -Analysis minimization and generalized(co-)sparsity: When does recovery succeed? Appl. Comput. Harmon. Anal. , 2020.[GL07] K. Guo and D. Labate. Optimally sparse multidimensional representation usingshearlets.

SIAM J. Math. Anal. , 39(1):298–318, 2007.[GN15] R. Giryes and D. Needell. Greedy signal space methods for incoherence and beyond.

Appl. Comput. Harmon. Anal. , 39(1):1–20, 2015.[HBRN10] J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak. Toeplitz Compressed Sensing Ma-trices With Applications to Sparse Channel Estimation.

IEEE Trans. Inf. Theory ,56(11):5862–5875, 2010.[HR17] I. Haviv and O. Regev. The restricted isometry property of subsampled Fouriermatrices. In

Geometric Aspects of Functional Analysis , pages 163–179. Springer, 2017.[HS09] M. A. Herman and T. Strohmer. High-resolution radar via compressed sensing.

IEEETrans. Signal Process. , 57(6):2275–2284, 2009.[KM15] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a randommatrix without concentration.

Int. Math. Res. Not. IRMN , 2015(23):12991–13008,2015.[KMR14] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of Chaos Processes and theRestricted Isometry Property.

Comm. Pure Appl. Math. , 67(11):1877–1904, 2014.[KN04] S. Kotz and S. Nadarajah.

Multivariate t-distributions and their applications . Cam-bridge University Press, 2004. [KNW15] F. Krahmer, D. Needell, and R. Ward. Compressive sensing with redundant dic-tionaries and structured measurements.

SIAM J. Math. Anal. , 47(6):4606–4629,2015.[Kol11] V. Koltchinskii.

Oracle Inequalities in Empirical Risk Minimization and Sparse RecoveryProblems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008 , volume 2033.Springer Science & Business Media, 2011.[KW92] S. Kwapien and W. A. Woyczynski.

Random series and stochastic integrals: single andmultiple . Birkhäuser, 1992.[KW14] F. Krahmer and R. Ward. Stable and Robust Sampling Strategies for CompressiveImaging.

IEEE Trans. Image Process. , 23(2):612–622, 2014.[LDSP08] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed Sensing MRI.

IEEE Signal Processing Magazine , 25(2):72–82, March 2008.[LLM +

12] Y. Liu, S. Li, T. Mi, H. Lei, and W. Yu. Performance analysis of ℓ -synthesis withcoherent frames. In , pages 2042–2046. IEEE, 2012.[MBKW20] M. März, C. Boyer, J. Kahn, and P. Weiss. Sampling Rates for ℓ -Synthesis. arXivpreprint arXiv:2004.07175 , 2020.[Men15] S. Mendelson. Learning Without Concentration. J. ACM , 62(3):21:1–21:25, 2015.[ML17] S. Mendelson and G. Lecué. Sparse recovery under weak moment assumptions.

J.Eur. Math. Soc. , 19(3):881–904, 2017.[MRW +

18] S. Mendelson, H. Rauhut, R. Ward, et al. Improved bounds for sparse recovery fromsubsampled random convolutions.

Ann. Appl. Probab. , 28(6):3491–3527, 2018.[NDEG13] S. Nam, M. Davies, M. Elad, and R. Gribonval. The cosparse analysis model andalgorithms.

Appl. Comput. Harmon. Anal. , 34(1):30 – 56, 2013.[PEPC10] L. C. Potter, E. Ertin, J. T. Parker, and M. Cetin. Sparsity and compressed sensing inradar imaging.

Proc. IEEE , 98(6):1006–1020, 2010.[RB13] S. Ravishankar and Y. Bresler. Learning Sparsifying Transforms.

IEEE Trans. SignalProcess. , 61(5):1072–1086, 2013.[RRT12] H. Rauhut, J. Romberg, and J. A. Tropp. Restricted isometries for partial randomcirculant matrices.

Appl. Comput. Harmon. Anal. , 32(2):242 – 254, 2012.[RSV08] H. Rauhut, K. Schnass, and P. Vandergheynst. Compressed sensing and redundantdictionaries.

IEEE Trans. Inf. Theory , 54(5):2210–2219, 2008.[RV08] M. Rudelson and R. Vershynin. On sparse reconstruction from Fourier and Gaussianmeasurements.

Comm. Pure Appl. Math. , 61(8):1025–1045, 2008.[TP13] A. M. Tillmann and M. E. Pfetsch. The computational complexity of the restrictedisometry property, the nullspace property, and related concepts in compressed sens-ing.

IEEE Trans. Inf. Theory , 60(2):1248–1259, 2013.[Tro15] J. Tropp. Convex recovery of a structured signal from independent random linearmeasurements. In

Sampling Theory, a Renaissance , pages 67–101. Springer, 2015.[Ver18] R. Vershynin.

High-dimensional probability: An introduction with applications indata science , volume 47. Cambridge University Press, 2018.[WYG +

09] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust Face Recognitionvia Sparse Representation.

IEEE Trans. Pattern Anal. Mach. Intell. , 31(2):210–227,2009.[Zha11] T. Zhang. Sparse recovery with orthogonal matching pursuit under RIP.

IEEE Trans.Inf. Theory , 57(9):6215–6221, 2011.

ICTIONARY-SPARSE RECOVERY FROM HEAVY-TAILED MEASUREMENTS 23 [ZL20] Y.-B. Zhao and Z.-Q. Luo. Improved RIP-Based Bounds for Guaranteed Performanceof Several Compressed Sensing Algorithms. arXiv preprint arXiv:2007.01451 , 2020.[ZXQ19] S. Zhou, N. Xiu, and H.-D. Qi. Global and quadratic convergence of Newton hard-thresholding pursuit. arXiv preprint arXiv:1901.02763arXiv preprint arXiv:1901.02763