SSparse Recovery of Fusion Frame Structured Signals
Ula¸s Ayaz ∗ April 9, 2018
Abstract
Fusion frames are collection of subspaces which provide a redundant representation of signalspaces. They generalize classical frames by replacing frame vectors with frame subspaces. Thispaper considers the sparse recovery of a signal from a fusion frame. We use a block sparsitymodel for fusion frames and then show that sparse signals under this model can be compressivelysampled and reconstructed in ways similar to standard Compressed Sensing (CS). In particularwe invoke a mixed (cid:96) /(cid:96) norm minimization in order to reconstruct sparse signals. In ourwork, we show that assuming a certain incoherence property of the subspaces and the aprioriknowledge of it allows us to improve recovery when compared to the usual block sparsity case. The problem of recovering sparse signals in R N from m < N measurements has drawn a lotof attention in recent years [7, 8, 13]. Compressed Sensing (CS) achieves such performance byimposing a sparsity model on the signal of interest. The sparsity model, combined with randomizedlinear acquisition, guarantees that certain non-linear reconstruction can be used to efficiently andaccurately recover the signal.Classical frames are nowadays a standard notion for modeling applications, in which redundancyplays a vital role such as filter bank theory, sigma-delta quantization, signal and image processingand wireless communications. Fusion frames are a relatively new concept which is potentially usefulwhen a single frame system is not sufficient to represent the whole sensing mechanism efficiently.Fusion frames were first introduced in [10] under the name of ‘frames of subspaces’ (see also thesurvey [9]). They analyze signals by projecting them onto multidimensional subspaces, in contrastto frames which consider only one-dimensional projections. In the literature, there have been anumber of applications where fusion frames have proven to be useful practically, such as distributedprocessing [19, 11], parallel processing [4, 20], packet encoding [5].In this paper, we consider the recovery of sparse signals from a fusion frame. Signals of interestare collections of vectors from fusion frame subspaces which can be considered as a ‘block’ signal. Inother words, say we have N subspaces, then we have a collection of N vectors which is the (block)signal we wish to recover. In addition to block structure, the signals from a fusion frame have theproperty that each block belongs to a particular fusion frame subspace. We are then allowed toobserve linear measurements of those vectors and we aim to reconstruct the original signal fromthose measurements. In order to do so, we use ideas from CS. We assume a ‘block’ sparsity model onthe signals to be recovered where a few of the vectors in the collection are assumed to be nonzero.For instance, we are not interested whether each vector is sparse or not within the subspace it ∗ Laboratory for Information & Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA. a r X i v : . [ c s . I T ] A p r elongs to. For the reconstruction, a mixed (cid:96) /(cid:96) minimization is invoked in order to capture thestructure that comes with the sparsity model.This problem was studied before in [6] where the authors proved that it is sufficient for recoveryto take m (cid:38) s ln( N ) random measurements. Here s is the sparsity level and N is the numberof subspaces. It is worth emphasizing that our model assumes that the signals of interest lie inparticular subspaces which is not assumed in block sparsity problems. In this paper, our goal is toimprove the results in [6] when the subspaces have a certain incoherence property , i.e., they havenontrivial mutual angles between them, and we assume to know them. Recently the authors in [2]have studied this problem in the uniform recovery setting. Our focus in this paper will be on thenonuniform recovery of signals from a fusion frame. To our best knowledge, a result of this natureis not available in the literature. We denote Euclidean norm of vectors by (cid:107) · (cid:107) and the spectral norm of matrices by (cid:107) · (cid:107) . Boldfacenotation refers to block vectors and matrices throughout this paper. Let [ N ] = { , , . . . , N } . Fora block matrix A = ( a ij B ij ) i ∈ [ m ] ,j ∈ [ N ] ∈ R md × Nd with blocks B ij ∈ R d × d , we denote the (cid:96) -thblock column by A (cid:96) = ( a i(cid:96) B i(cid:96) ) i ∈ [ m ] ∈ R md × d and the column submatrix restricted to S ⊂ [ N ] by A S = ( a ij B ij ) i ∈ [ m ] ,j ∈ S . Similarly for a block vector x = ( x i ) Ni =1 ∈ R Nd with x i ∈ R d , we denotethe vector x S = ( x i ) i ∈ S restricted to S . The complement [ N ] \ S is denoted S . The (cid:96) ∞ → (cid:96) ∞ and (cid:96) → (cid:96) ∞ norms of a matrix A ∈ R m × n are given as (cid:107) A (cid:107) ∞ = max i ∈ [ m ] n (cid:88) j =1 A ij , and (cid:107) A (cid:107) , ∞ = max i ∈ [ m ] n (cid:88) j =1 A ij / , respectively. Furthermore, for a block vector x = ( x i ) Ni =1 with x i ∈ R d we define the (cid:96) , ∞ -norm as (cid:107) x (cid:107) , ∞ = max i ∈ [ N ] (cid:107) x i (cid:107) . Given a block vector z with N blocks from R d , we define sgn( z ) ∈ R dN analogously to the scalarcase as sgn( z ) i = (cid:26) z i (cid:107) z i (cid:107) : (cid:107) z i (cid:107) (cid:54) = 0 , (cid:107) z i (cid:107) = 0 . A fusion frame for R d is a collection of N subspaces W j ⊂ R d and associated weights v j that satisfy A (cid:107) x (cid:107) ≤ N (cid:88) j =1 v j (cid:107) P j x (cid:107) ≤ B (cid:107) x (cid:107) (1)for all x ∈ R d and for some universal fusion frame bounds 0 < A ≤ B < ∞ , where P j ∈ R d × d denotesthe orthogonal projection onto the subspace W j . For simplicity we assume that the dimensions ofthe W j are equal, that is dim( W j ) = k . For a fusion frame ( W j ) Nj =1 , let us define the space H = { ( x j ) Nj =1 : x j ∈ W j , ∀ j ∈ [ N ] } ⊂ R d × N , N ] = { , . . . , N } . The mixed (cid:96) , -norm of a vector x ≡ ( x j ) Nj =1 ∈ H is defined as (cid:107) x (cid:107) , = N (cid:88) j =1 (cid:107) x j (cid:107) . Furthermore, the ‘ (cid:96) - block norm’ of x ∈ H (which is actually not even a quasi-norm) is defined as (cid:107) x (cid:107) = (cid:93) { j ∈ [ N ] : x j (cid:54) = 0 } . We say that a vector x ∈ H is s -sparse, if (cid:107) x (cid:107) ≤ s .In this paper we consider the recovery of sparse vectors from the set H which collects all vectorsfrom a fusion frame subspaces. However our results do not assume necessarily that subspaces satisfythe fusion frame property (1) but rather assume they satisfy an incoherence property as explainedin Section 2.3. A special case of the sparse recovery problem above appears when all subspaces coincide with theambient space W j = R d for all j . Then the problem reduces to the well studied joint sparsity setup [15] in which all the vectors have the same sparsity structure. In order to see this, say we have N vectors { x , . . . , x N } in R d and write them as rows of a matrix X ∈ R N × d . Assuming only s of the rows are non-zero, the d vectors consisting of the columns of the matrix X have the samecommon support set, i.e., are jointly sparse. To be more precise, if ( X i ) di =1 denote the columns of X , supp( X i ) = supp( X j ) for all i (cid:54) = j .Furthermore, our problem is itself a special case of block sparsity setup [14], with significantadditional structure that allows us to enhance existing results. Without the subspace structure, wewould have N vectors { x , . . . , x N } in R d where only s of them are non-zero, i.e., card { i ∈ [ N ] : x i (cid:54) = 0 } = s . The reason this is called block sparsity is because we count the vectors which arenon-zero as a block rather than checking if its entries are sparse.Another related concept is group sparsity [21] where each entry of the vector is assigned to apredefined group and sparsity counts the groups that are active in the support set of the vector. Inmathematical notation, consider a vector x of length p and assume that its coefficients are groupedinto sets { G i } Ni =1 , such that for all i ∈ [ N ], G i ⊂ [ p ]. The vector x is group s -sparse if the non-zerocoefficients lie in s of the groups { G i } Ni =1 . Formally,card { i ∈ [ N ] : supp( x ) ∩ G i (cid:54) = 0 } = s. This is similar to the block sparsity with p = N d except that the groups may be allowed to overlap,i.e., G i ∩ G j (cid:54) = ∅ . We also note that, for sparse recovery, a natural proxy for group sparsity becomesthe norm (cid:80) i (cid:107) x G i (cid:107) .Finally in the case d = 1, the projections equal 1, and hence the problem reduces to the standardCS recovery problem Ax = y with x ∈ R N and y ∈ R m . Our measurement model is as follows. Let x = ( x j ) Nj =1 ∈ H be an s -sparse and we observe m linear combinations of those vectors y = ( y i ) mi =1 = N (cid:88) j =1 a ij x j mi =1 , y i ∈ R d , a ij ). Let us denote the block matrices A I = ( a ij I d ) i ∈ [ m ] ,j ∈ [ N ] and A P =( a ij P j ) i ∈ [ m ] ,j ∈ [ N ] that consist of the blocks a ij I d and a ij P j respectively. Here I d is the identitymatrix of size d × d . Then we can formulate this measurement scheme as y = A I x = A P x , for x ∈ H . We replaced A I by A P since the relation P j x j = x j holds for all j ∈ [ N ]. We wish to recover x from those measurements. This problem can be formulated as the following optimization program( L
0) ˆ x = argmin x ∈H (cid:107) x (cid:107) s.t. A P x = y . The optimization program ( L
0) is NP-hard. Instead we propose the following convex program theformer ( L
1) ˆ x = argmin x ∈H (cid:107) x (cid:107) , s.t. A P x = y . We shall show that block sparse signals can be accurately recovered by solving ( L P denotes the N × N block diagonal matrix where the diagonals are projection matrices P j , j ∈ [ N ]. P S denotes the s × s block diagonal matrix with entries P i , i ∈ S . In this section we define an incoherence matrix Λ associated with a fusion frameΛ = ( α ij ) i,j ∈ [ N ] , (2)where α ij = (cid:107) P i P j (cid:107) for i (cid:54) = j ∈ [ N ] and α ii = 0. Note that (cid:107) P i P j (cid:107) equals the largest absolute valueof the cosines of the principle angles between W i and W j . Then, for a given support set S ⊂ [ N ],we denote Λ S as the submatrix of Λ with columns restricted to S , and Λ S as the the submatrixwith columns and rows restricted to S . Then we note the following norms (cid:107) Λ S (cid:107) , ∞ = max i ∈ [ N ] (cid:88) j ∈ S,j (cid:54) = i (cid:107) P i P j (cid:107) / and (cid:107) Λ S (cid:107) , ∞ = max i ∈ S (cid:88) j ∈ S,j (cid:54) = i (cid:107) P i P j (cid:107) / , and (cid:107) Λ S (cid:107) ∞ = max i ∈ [ N ] (cid:88) j ∈ S,j (cid:54) = i (cid:107) P i P j (cid:107) and (cid:107) Λ S (cid:107) ∞ = max i ∈ S (cid:88) j ∈ S,j (cid:54) = i (cid:107) P i P j (cid:107) . Moreover, we define the parameter λ as λ = max i (cid:54) = j (cid:107) P i P j (cid:107) , i, j ∈ [ N ] . Clearly λ equals the largest entry of Λ. In addition it holds that (cid:107) Λ S (cid:107) , ∞ ≤ (cid:107) Λ S (cid:107) ∞ ≤ λs for any S with | S | = s . If the subspaces are all orthogonal to each other, i.e. λ = 0 and Λ = 0,then only one measurement y = (cid:80) i a i x i suffices to recover x . In fact, since x i ⊥ x j , we have x i = 1 a i P i y. This observation suggests that fewer measurements are necessary when λ gets smaller. In this workour goal is to provide a solid theoretical understanding of this observation.4 Nonuniform recovery with Bernoulli matrices
In this section we study nonuniform recovery from fusion frame measurements. Our main resultstates that a fixed sparse signal can be recovered with high probability using a random draw of aBernoulli random matrix A ∈ R m × N whose entries take value ± W j ) Nj =1 for R d have dim( W j ) = k for all j . Theorem 3.1.
Let x ∈ H be s -sparse and S = supp( x ) . Let A ∈ R m × N be Bernoulli matrix anda fusion frame ( W j ) Nj =1 be given with the incoherence matrix Λ . Assume that m ≥ C (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) ln( sk ) ln( ε − ) , (3) where C > is a universal constant. Then with probability at least − ε , ( L recovers x from y = A P x . If the subspaces are not equi-dimensional, one can replace sk term in Condition (3) by (cid:80) i ∈ S dim( W j ),where dim( W j ) = k j . We prove Theorem 3.1 in Section 3.3. The proof relies on the recovery con-dition of Lemma 3.3 via an inexact dual. Gaussian Case.
We also state a similar result for Gaussian random matrices without a proof.These matrices have entries as independent standard Gaussian random variables, i.e., A ij = g ij ∼N (0 , Theorem 3.2.
Let x ∈ H be s -sparse. Let A ∈ R m × N be a Gaussian matrix and ( W j ) Nj =1 be givenwith parameter λ ∈ [0 , and dim( W j ) = k for all j . Assume that m ≥ ˜ C (1 + λs ) ln (6 N k ) ln ( ε − ) , where ˜ C > is a universal constant. Then with probability at least − ε , ( L recovers x from y = A P x . The proof of this result can be found in [1, Section 3.3]. It also follows the inexact dual lemmahowever proceeds in a rather different way the Bernoulli case since the probabilistic tools we usefor proving Theorem 3.1 apply only for bounded random variables. Therefore we apply other toolswhich make proofs considerably long and so we do not present them here.
This section gives a sufficient condition for recovery of fixed sparse vectors based on an “inexactdual vector”. Sufficient conditions involving an exact dual vector were given in [17, 22]. Themodified inexact version is due to Gross [18]. Below, A |H restricts the matrix A to its range H . Lemma 3.3.
Let A ∈ R m × N and ( W j ) Nj =1 be a fusion frame for R d and x ∈ H with support S .Assume that (cid:107) [( A P ) ∗ S ( A P ) S ] − |H (cid:107) ≤ and max (cid:96) ∈ S (cid:107) ( A P ) ∗ S ( A P ) (cid:96) (cid:107) ≤ . (4) Suppose there exists a block vector u ∈ R Nd of the form u = A P ∗ h with block vector h ∈ R md suchthat (cid:107) u S − sgn( x S ) (cid:107) ≤ / and max i ∈ S (cid:107) u i (cid:107) ≤ / . (5) Then x is the unique minimizer of (cid:107) z (cid:107) , subject to A P z = A P x . roof. The proof follows [16, Theorem 4.32] and generalizes it to the block vector case. For con-venience we give the details here. Let ˆ x be a minimizer of (cid:107) z (cid:107) , subject to A P z = A P x . Then v = ˆ x − x ∈ H satisfies A P v = . We need to show that v = . First we observe that (cid:107) ˆ x (cid:107) , = (cid:107) x S + v S (cid:107) , + (cid:107) v S (cid:107) , = (cid:104) sgn( x S + v S ) , ( x S + v S ) (cid:105) + (cid:107) v S (cid:107) , ≥ (cid:104) sgn( x S ) , ( x S + v S ) (cid:105) + (cid:107) v S (cid:107) , = (cid:107) x S (cid:107) , + (cid:104) sgn( x S ) , v S (cid:105) + (cid:107) v S (cid:107) , . (6)For u = A P ∗ h it holds (cid:104) u S , v S (cid:105) = (cid:104) u , v (cid:105) − (cid:104) u S , v S (cid:105) = (cid:104) h , A P v (cid:105) − (cid:104) u S , v S (cid:105) = −(cid:104) u S , v S (cid:105) . Hence, (cid:104) sgn( x S ) , v S (cid:105) = (cid:104) sgn( x S ) − u S , v S (cid:105) + (cid:104) u S , v S (cid:105) = (cid:104) sgn( x S ) − u S , v S (cid:105) − (cid:104) u S , v S (cid:105) . The Cauchy-Schwarz inequality together with (5) yields |(cid:104) sgn( x S ) , v S (cid:105)| ≤ (cid:107) sgn( x S ) − u S (cid:107) (cid:107) v S (cid:107) + max i ∈ S (cid:107) u i (cid:107) (cid:107) v S (cid:107) , ≤ (cid:107) v S (cid:107) + 14 (cid:107) v S (cid:107) , . Together with (6) this yields (cid:107) ˆ x (cid:107) , ≥ (cid:107) x S (cid:107) , − (cid:107) v S (cid:107) + 34 (cid:107) v S (cid:107) , . We now bound (cid:107) v S (cid:107) . Since A P v = , we have ( A P ) S v S = − ( A P ) S v S and (cid:107) v S (cid:107) = (cid:107) [( A P ) ∗ S ( A P ) S ] − |H ( A P ) ∗ S ( A P ) S v S (cid:107) = (cid:107) − [( A P ) ∗ S ( A P ) S ] − |H ( A P ) ∗ S ( A P ) S v S (cid:107) ≤ (cid:107) [( A P ) ∗ S ( A P ) S ] − |H (cid:107)(cid:107) ( A P ) ∗ S ( A P ) S v S (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( A P ) ∗ S (cid:88) i ∈ S ( A P ) i v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) i ∈ S (cid:107) ( A P ) ∗ S ( A P ) i (cid:107)(cid:107) v i (cid:107) ≤ (cid:107) v S (cid:107) , . Hereby, we used the second condition in (4). Then we have (cid:107) ˆ x (cid:107) , ≥ (cid:107) x (cid:107) , + 14 (cid:107) v S (cid:107) , . Since ˆ x is an (cid:96) , -minimizer it follows that v S = . Therefore ( A P ) S v S = − ( A P ) S v S = . Since( A P ) S is injective, it follows that v S = , so that v = .To this end, we introduce the rescaled matrix ˜A P = √ m A P . The term (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S ] − |H (cid:107) in (4) will be treated with Theorem 3.4 by noticing that (cid:107) ( ˜A P ) ∗ S ( ˜A P ) S − P S (cid:107) ≤ δ implies (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S ] − |H (cid:107) ≤ (1 − δ ) − . The other terms in Lemma 3.3 will be estimated by the lem-mas in the next section. Throughout the proof, we use the notation E jj ( A ) to denote the s × s block diagonal matrix with the matrix A ∈ R d × d in its j -th diagonal entry and 0 elsewhere.6 .2 Auxiliary results We use the matrix Bernstein inequality [23], stated in Theorem 6.1 for convenience, in order tobound (cid:107) ( ˜A P ) ∗ S ( ˜A P ) S − P S (cid:107) . Recall the definition (2) of the incoherence matrix. Theorem 3.4.
Let A ∈ R m × N be a measurement matrix whose entries are i.i.d. Bernoulli randomvariables and ( W j ) Nj =1 be a fusion frame with the associated matrix Λ . Then, for δ ∈ (0 , , theblock matrix ˜A P satisfies (cid:107) ( ˜A P ) ∗ S ( ˜A P ) S − P S (cid:107) ≤ δ with probability at least − ε provided m ≥ δ − (cid:18) (cid:107) Λ S (cid:107) , ∞ + 23 max {(cid:107) Λ S (cid:107) , } (cid:19) ln(2 sk/ε ) . Proof.
We can assume that S = [ s ] without loss of generality. Denote Y (cid:96) = ( (cid:15) (cid:96)j P j ) j ∈ S for (cid:96) ∈ [ m ]as the (cid:96) -th block column vector of ( ˜A P ) ∗ S . Observing that E ( Y (cid:96) Y ∗ (cid:96) ) j,k = E ( (cid:15) (cid:96)j P j (cid:15) (cid:96)k P k ) = δ jk P j P k ,we have E Y (cid:96) Y ∗ (cid:96) = P S . Therefore, we can write( ˜A P ) ∗ S ( ˜A P ) S − P S = 1 m m (cid:88) (cid:96) =1 ( Y (cid:96) Y ∗ (cid:96) − E Y (cid:96) Y ∗ (cid:96) ) . This is a sum of independent self-adjoint random matrices. We denote the block matrices X (cid:96) := m ( Y (cid:96) Y ∗ (cid:96) − E Y (cid:96) Y ∗ (cid:96) ) which have mean zero. Moreover, (cid:107) X (cid:96) (cid:107) = 1 m max (cid:107) x (cid:107) =1 , x ∈H |(cid:104) Y (cid:96) Y ∗ (cid:96) x , x (cid:105) − (cid:104) P S x , x (cid:105)| = 1 m max (cid:107) x (cid:107) =1 , x ∈H (cid:12)(cid:12) (cid:107) Y ∗ (cid:96) x (cid:107) − (cid:107) x (cid:107) (cid:12)(cid:12) ≤ m max (cid:26) max (cid:107) x (cid:107) =1 , x ∈H (cid:107) Y ∗ (cid:96) x (cid:107) − , (cid:27) = 1 m max (cid:8) (cid:107) Y (cid:96) (cid:107) − , (cid:9) . We further bound the spectral norm of the block matrix Y ∗ (cid:96) . We separate a vector x ∈ R sd into s blocks of length d and denote x = ( x i ) si =1 . Defining the vector β ∈ R s with β i = (cid:107) x i (cid:107) we have (cid:107) Y ∗ (cid:96) (cid:107) = max (cid:107) x (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) i =1 (cid:15) i P i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = max (cid:107) x (cid:107) =1 s (cid:88) i,j =1 (cid:15) i (cid:15) j (cid:104) P i x i , P j x j (cid:105)≤ max (cid:107) x (cid:107) =1 s (cid:88) i,j =1 |(cid:104) P i P j x j , x i (cid:105)| ≤ max (cid:107) x (cid:107) =1 s (cid:88) i,j =1 (cid:107) P i P j (cid:107)(cid:107) x i (cid:107) (cid:107) x j (cid:107) ≤ max (cid:107) x (cid:107) =1 s (cid:88) j =1 (cid:107) x j (cid:107) + max (cid:107) β (cid:107) =1 (cid:88) i (cid:54) = j (cid:107) P i P j (cid:107) β i β j ≤ (cid:107) β (cid:107) =1 (cid:104) β, Λ β (cid:105) ≤ (cid:107) Λ S (cid:107) . (7)This implies the estimate (cid:107) X (cid:96) (cid:107) ≤ max {(cid:107) Λ S (cid:107) , } m . E X (cid:96) = 1 m E ( Y (cid:96) Y ∗ (cid:96) Y (cid:96) Y ∗ (cid:96) + P S − Y (cid:96) Y ∗ (cid:96) P S − P S Y (cid:96) Y ∗ (cid:96) )= E m Y (cid:96) s (cid:88) j =1 P j Y ∗ (cid:96) + 1 m P S − m E ( Y (cid:96) Y ∗ (cid:96) ) P S − m P S E ( Y (cid:96) Y ∗ (cid:96) )= 1 m s (cid:88) i =1 E ii P i s (cid:88) j =1 P j P i − m P S . In the first equality above, we used the independence of (cid:15) (cid:96)j for j ∈ S and the fact that (cid:15) (cid:96)j = 1.Next, we estimate the variance parameter appearing in the noncommutative Bernstein inequalityas σ := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) (cid:96) =1 E ( X (cid:96) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) i =1 E ii P i s (cid:88) j =1 P j P i − P S (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) i =1 E ii P i (cid:88) j ∈ [ s ] ,j (cid:54) = i P j P i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 m max i ∈ [ s ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P i (cid:88) j ∈ [ s ] ,j (cid:54) = i P j P i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . We further estimate,max i ∈ [ s ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P i (cid:88) j ∈ [ s ] ,j (cid:54) = i P j P i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = max i ∈ [ s ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ [ s ] ,j (cid:54) = i P i P j P i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max i ∈ [ s ] (cid:88) j ∈ [ s ] j (cid:54) = i (cid:107) P i P j (cid:107)(cid:107) P j P i (cid:107) = max i ∈ [ s ] (cid:88) j ∈ [ s ] j (cid:54) = i (cid:107) P i P j (cid:107) . Finally we arrive at σ ≤ (cid:107) Λ S (cid:107) , ∞ m . We are now in the position of applying Theorem 6.1. It holds P (cid:16) (cid:107) ( ˜A P ) ∗ S ( ˜A P ) S − P S (cid:107) > δ (cid:17) = P (cid:32) (cid:107) m (cid:88) (cid:96) =1 X (cid:96) (cid:107) > δ (cid:33) ≤ sk exp (cid:32) − δ m/ (cid:107) Λ S (cid:107) , ∞ + max {(cid:107) Λ S (cid:107) , } δ/ (cid:33) ≤ sk exp (cid:32) − δ m (cid:107) Λ S (cid:107) , ∞ + max {(cid:107) Λ S (cid:107) , } (cid:33) , (8)where we used that δ ∈ (0 , sk appears in frontof the exponential instead of the dimension of X (cid:96) ∈ R sd × sd as asked by Theorem 6.1. In fact,Theorem 6.1 gives a better estimate if the matrices E X (cid:96) are not full rank, see (38) and the remarkafter Theorem 6.1. Indeed, in our case since rank( P j ) = dim( W j ) = k , we have rank( E X (cid:96) ) = sk which appears in (8). Bounding the right hand side of (8) by ε completes the proof.We now provide the analogous of the auxiliary lemmas in [16, Section 12.4] with slight modifi-cations. 8 emma 3.5. Let S be a subset of [ N ] with cardinality s and v ∈ R S × d be a block vector of size s with v j ∈ W j for j ∈ S . Assume that m ≥ (cid:107) Λ S (cid:107) , ∞ and max i ∈ S (cid:107) v i (cid:107) ≤ κ ≤ . Then, for t > , P (cid:18) max (cid:96) ∈ S (cid:107) ( ˜A P ) ∗ (cid:96) ( ˜A P ) S v (cid:107) ≥ κ (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) ≤ N exp (cid:32) − t m κ (cid:107) Λ S (cid:107) , ∞ + 4 κ (cid:107) Λ S (cid:107) ∞ + tκ (cid:107) Λ S (cid:107) ∞ (cid:33) . Proof.
Fix (cid:96) ∈ S . We may assume without loss of generality that S = { , , . . . , s } . Observe thatfor i ∈ [ m ], (cid:15) i(cid:96) are independent from (cid:15) ij for j ∈ S . For simplicity we denote the correspondingmatrices as B = ( ˜A P ) ∗ (cid:96) and C = ( ˜A P ) S . The i -th block column and i -th block row are denoted as B i and B i respectively. Note that( ˜A P ) ∗ (cid:96) ( ˜A P ) S v = m (cid:88) i =1 B i C i v = m (cid:88) i =1 s (cid:88) j =1 m (cid:15) i(cid:96) (cid:15) ij P (cid:96) P j v j (9)for (cid:96) ∈ S . For convenience we introduce B i C i = 1 m (cid:0) (cid:15) i(cid:96) (cid:15) i P (cid:96) P (cid:15) i(cid:96) (cid:15) i P (cid:96) P · · · (cid:15) i(cid:96) (cid:15) is P (cid:96) P s (cid:1) . The sum of independent vectors in (9) will be bounded in (cid:96) norm using the vector valued Bernsteininequality Lemma 6.4. Observe that the vectors B i C i v have mean zero. Furthermore, m E (cid:107) B i C i v (cid:107) = 1 m E s (cid:88) j,k =1 (cid:15) i(cid:96) (cid:15) ij (cid:15) ik (cid:104) P (cid:96) P j v j , P (cid:96) P k v k (cid:105) = 1 m s (cid:88) j =1 (cid:107) P (cid:96) P j v j (cid:107) ≤ m s (cid:88) j =1 (cid:107) P (cid:96) P j (cid:107) (cid:107) v j (cid:107) ≤ κ m (cid:107) Λ S (cid:107) , ∞ where we used (cid:107) v j (cid:107) ≤ κ . We bound σ appearing in Lemma 6.4 simply due to Remark 6.5 by mσ ≤ m E (cid:107) B i C i v (cid:107) ≤ κ m (cid:107) Λ S (cid:107) , ∞ . For the uniform bound, observe that (cid:107) B i C i v (cid:107) = 1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) j =1 (cid:15) i(cid:96) (cid:15) ij P (cid:96) P j v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m s (cid:88) j =1 (cid:107) P (cid:96) P j (cid:107)(cid:107) v j (cid:107) ≤ κm (cid:107) Λ S (cid:107) ∞ . Then the vector valued Bernstein inequality (40) yields P (cid:18) (cid:107) ( ˜A P ) ∗ (cid:96) ( ˜A P ) S v (cid:107) ≥ κ (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) ≤ exp − t / κ (cid:107) Λ S (cid:107) , ∞ m + κ (cid:107) Λ S (cid:107) ∞ m κ (cid:107) Λ S (cid:107) , ∞ √ m + t κ (cid:107) Λ S (cid:107) ∞ m . (cid:96) ∈ S ⊂ [ N ] and using that (cid:107) Λ S (cid:107) , ∞ √ m ≤ P (cid:18) max (cid:96) ∈ S (cid:107) ( ˜A P ) ∗ (cid:96) ( ˜A P ) S v (cid:107) ≥ κ (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) ≤ N exp (cid:32) − t m κ (cid:107) Λ S (cid:107) , ∞ + 4 κ (cid:107) Λ S (cid:107) ∞ + tκ (cid:107) Λ S (cid:107) ∞ (cid:33) . This completes the proof.Next, we prove a similar auxiliary result.
Lemma 3.6.
Let S be subset of [ N ] with cardinality s and v ∈ R S × d be a block vector of size s with v j ∈ W j for j ∈ S . Assume that m ≥ (cid:107) Λ S (cid:107) , ∞ . Then, for t > , P (cid:18) (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S − P S ] v (cid:107) ≥ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) (cid:107) v (cid:107) (cid:19) ≤ exp (cid:32) − mt (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ + t ( + (cid:107) Λ S (cid:107) ∞ ) (cid:33) . Proof.
Again we assume without loss of generality that (cid:107) v (cid:107) = 1 and S = { , , . . . , s } . As in theproof of Theorem 3.4, we rewrite the term that we need to bound as[( ˜A P ) ∗ S ( ˜A P ) S − P S ] v = 1 m m (cid:88) (cid:96) =1 ( Y (cid:96) Y ∗ (cid:96) − P S ) v (10)where Y (cid:96) = ( (cid:15) (cid:96)i P i ) si =1 is the (cid:96) -th block row of ( ˜A P ) S . We use the vector valued Bernstein inequality,Lemma 6.4 once again, in order to estimate the (cid:96) norm of this sum. Observe that E ( Y (cid:96) Y ∗ (cid:96) − P S ) v = as in the proof of Theorem 3.4. Furthermore, denoting Z = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m (cid:88) (cid:96) =1 ( Y (cid:96) Y ∗ (cid:96) − P S ) v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , we have E Z = m E (cid:13)(cid:13)(cid:13)(cid:13) m ( Y Y ∗ − P S ) v (cid:13)(cid:13)(cid:13)(cid:13) = 1 m E (cid:104) ( Y Y ∗ − P S ) v , ( Y Y ∗ − P S ) v (cid:105) = 1 m E ( (cid:104) Y Y ∗ v , Y Y ∗ v (cid:105) − (cid:104) Y Y ∗ v , v (cid:105) + (cid:104) v , v (cid:105) )= 1 m E ( (cid:104) Y Y ∗ Y Y ∗ v , v (cid:105) − (cid:107) Y ∗ v (cid:107) + 1) . We now estimate the first two terms in the last line above. First observe that due to Y ∗ Y =10 si =1 P i , it holds E (cid:104) Y Y ∗ Y Y ∗ v , v (cid:105) = (cid:104) E ( Y s (cid:88) i =1 P i Y ∗ ) v , v (cid:105) = (cid:104) s (cid:88) j =1 E jj ( P j s (cid:88) i =1 P i P j ) v , v (cid:105) = s (cid:88) j =1 (cid:104) P j s (cid:88) i =1 P i P j v j , v j (cid:105) = s (cid:88) i,j =1 (cid:104) P j P i P j v j , v j (cid:105)≤ (cid:88) i,j,i (cid:54) = j (cid:107) P i P j (cid:107) (cid:107) v j (cid:107) + s (cid:88) j =1 (cid:107) v j (cid:107) ≤ s (cid:88) j =1 (cid:107) v j (cid:107) (cid:88) i (cid:54) = j (cid:107) P i P j (cid:107) + 1 ≤ max j ∈ S (cid:88) i (cid:54) = j (cid:107) P i P j (cid:107) s (cid:88) j =1 (cid:107) v j (cid:107) ≤ (cid:107) Λ S (cid:107) , ∞ , where we used that Λ S is symmetric and (cid:107) v (cid:107) = 1. Secondly, since E (cid:107) Y ∗ v (cid:107) = E (cid:107) s (cid:88) i =1 (cid:15) i P i v i (cid:107) = E s (cid:88) i,j =1 (cid:15) i (cid:15) j (cid:104) v i , v j (cid:105) = s (cid:88) i =1 (cid:107) v i (cid:107) = 1 , we obtain E Z ≤ (cid:107) Λ S (cid:107) , ∞ m . For the uniform bound, we have1 m (cid:107) ( Y (cid:96) Y ∗ (cid:96) − P S ) v (cid:107) ≤ m (cid:107) Y (cid:96) Y ∗ (cid:96) (cid:107)(cid:107) v (cid:107) + 1 m (cid:107) v (cid:107) = 1 m (cid:107) Y (cid:96) (cid:107) + 1 m ≤ (cid:107) Λ S (cid:107) ∞ m . The last inequality follows from (7). Finally we estimate the weak variance simply by the strongvariance mσ ≤ E Z ≤ (cid:107) Λ S (cid:107) , ∞ m . Then the (cid:96) -valued Bernstein inequality (40) yields P (cid:18) (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S − P S ] v (cid:107) ≥ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) (cid:107) v (cid:107) (cid:19) ≤ exp − t / (cid:107) Λ S (cid:107) , ∞ m + (4+2 (cid:107) Λ S (cid:107) ∞ ) m (cid:107) Λ S (cid:107) , ∞ √ m + t (2+ (cid:107) Λ S (cid:107) ∞ )3 m . Using that (cid:107) Λ S (cid:107) , ∞ √ m ≤
1, we obtain P (cid:18) (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S − P S ] v (cid:107) ≥ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) (cid:107) v (cid:107) (cid:19) ≤ exp (cid:32) − mt (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ + t ( + (cid:107) Λ S (cid:107) ∞ ) (cid:33) . This completes the proof. 11emma 3.6 shows that the multiplication with ( ˜A P ) ∗ S ( ˜A P ) S − P S decreases the (cid:96) norm of thevectors with high probability. The next lemma shows that this is true for (cid:96) , ∞ norm as well. Lemma 3.7.
Assume the conditions of Lemma 3.6. Then, for t > , P (cid:18) (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S − P S ] v (cid:107) , ∞ ≥ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) (cid:107) v (cid:107) , ∞ (cid:19) ≤ s · exp (cid:32) − mt (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ + t (cid:107) Λ S (cid:107) ∞ (cid:33) . Proof.
We can assume that S = [ s ] and (cid:107) v (cid:107) , ∞ = max i ∈ S (cid:107) v i (cid:107) = 1 by normalizing v by (cid:107) v (cid:107) , ∞ .As in (10), we write Z := [( ˜A P ) ∗ S ( ˜A P ) S − P S ] v = 1 m m (cid:88) (cid:96) =1 ( Y (cid:96) Y ∗ (cid:96) − P S ) v where Y (cid:96) = ( (cid:15) (cid:96)i P i ) si =1 is the (cid:96) -th row of ( ˜A P ) S . We can further write for i ∈ SZ i = s (cid:88) (cid:96) =1 m s (cid:88) j =1 j (cid:54) = i (cid:15) (cid:96)i (cid:15) (cid:96)j P i P j v j =: (cid:88) (cid:96) X (cid:96) . The vectors X (cid:96) are independent, thus we use Lemma 6.4 in order to bound (cid:107) Z i (cid:107) . Since we havedone similar estimations in the previous proofs, we skip some steps and obtain E (cid:107) Z i (cid:107) = m E (cid:107) X (cid:96) (cid:107) ≤ m s (cid:88) j =1 j (cid:54) = i (cid:107) P i P j (cid:107) (cid:107) v j (cid:107) ≤ m (cid:107) Λ S (cid:107) , ∞ , where we used that (cid:107) v j (cid:107) ≤
1. Furthermore, mσ ≤ m (cid:107) Λ S (cid:107) , ∞ . For any (cid:96) ∈ [ s ] we have theuniform bound (cid:107) X (cid:96) (cid:107) = 1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) j =1 ,j (cid:54) = i (cid:15) (cid:96)i (cid:15) (cid:96)j P i P j v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m s (cid:88) j =1 ,j (cid:54) = i (cid:107) P i P j (cid:107)(cid:107) v j (cid:107) ≤ m (cid:107) Λ S (cid:107) ∞ . Combining these with Lemma 6.4 and taking the union bound yield P (cid:18) max i ∈ S (cid:107) Z i (cid:107) ≥ (cid:107) Λ S (cid:107) , ∞ √ m + t (cid:19) ≤ s · exp (cid:32) − mt (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ + t (cid:107) Λ S (cid:107) ∞ (cid:33) . Lastly we present the following lemma before the proof of our main result.
Lemma 3.8.
For t ∈ (0 , ) , P (max i ∈ S (cid:107) ( ˜A P ) ∗ S ( ˜A P ) i (cid:107) ≥ t ) ≤ s + 1) N k exp (cid:32) − t m (cid:107) Λ S (cid:107) , ∞ (cid:33) . roof. Fix i ∈ S . Similarly as before, we write ( ˜A P ) ∗ S ( ˜A P ) i as a sum of independent matrices,( ˜A P ) ∗ S ( ˜A P ) i = 1 m m (cid:88) (cid:96) =1 (cid:15) (cid:96) (cid:15) (cid:96)i P P i (cid:15) (cid:96) (cid:15) (cid:96)i P P i ... (cid:15) (cid:96)s (cid:15) (cid:96)i P s P i =: 1 m m (cid:88) (cid:96) =1 Y (cid:96) , (11)where we assumed S = [ s ] for simplifying the notation. Above we introduced the block column vec-tors Y (cid:96) ∈ R sd × d which are independent and identically distributed rectangular matrices. Observealso that E Y (cid:96) = . In order to estimate the norm of the sum in (11) we will employ Theorem 6.3[23] which is a version of the noncommutative Bernstein inequality for rectangular matrices. Wefirst bound the variance parameter σ = max (cid:40) (cid:107) m (cid:88) (cid:96) =1 m E [ Y (cid:96) Y ∗ (cid:96) ] (cid:107) , (cid:107) m (cid:88) (cid:96) =1 m E [ Y ∗ (cid:96) Y (cid:96) ] (cid:107) (cid:41) . (12)We write E [ Y (cid:96) Y ∗ (cid:96) ] = (cid:80) sj =1 E jj ( P j P i P j ) . The first term on the right hand side of (12) is estimatedas (cid:107) m m (cid:88) (cid:96) =1 E [ Y (cid:96) Y ∗ (cid:96) ] (cid:107) = 1 m max j ∈ S (cid:107) P j P i P j (cid:107) ≤ m max j ∈ S (cid:107) P j P i (cid:107)(cid:107) P i P j (cid:107) ≤ λ m . (13)We used that P i = P i and i (cid:54)∈ S . Furthermore, E [ Y ∗ (cid:96) Y (cid:96) ] = (cid:80) sj =1 P i P j P i and1 m (cid:107) m (cid:88) (cid:96) =1 E [ Y ∗ (cid:96) Y (cid:96) ] (cid:107) = 1 m (cid:107) s (cid:88) j =1 P i P j P i (cid:107) ≤ m s (cid:88) j =1 (cid:107) P i P j (cid:107)(cid:107) P j P i (cid:107) ≤ (cid:107) Λ S (cid:107) , ∞ m . (14)Since (14) dominates (13), we have σ ≤ (cid:107) Λ S (cid:107) , ∞ m . For the uniform bound we obtain (cid:107) Y (cid:96) (cid:107) = sup (cid:107) x (cid:107) ≤ x ∈ R d (cid:107) Y (cid:96) x (cid:107) = sup (cid:107) x (cid:107) ≤ x ∈ R d s (cid:88) j =1 (cid:107) P j P i x (cid:107) ≤ sup (cid:107) x (cid:107) ≤ x ∈ R d s (cid:88) j =1 (cid:107) P j P i (cid:107) (cid:107) x (cid:107) ≤ (cid:107) Λ S (cid:107) , ∞ . We conclude that m (cid:107) Y (cid:96) (cid:107) ≤ (cid:107) Λ S (cid:107) , ∞ m . Combining these estimates, Theorem 6.3 yields P ( (cid:107) ( ˜A P ) ∗ S ( ˜A P ) i (cid:107) ≥ t ) ≤ s + 1) k exp − t / (cid:107) Λ S (cid:107) , ∞ m + t (cid:107) Λ S (cid:107) , ∞ m . Taking the union bound over i ∈ S ⊂ [ N ] and using that t ∈ (0 , ) yields P (max i ∈ S (cid:107) ( ˜A P ) ∗ S ( ˜A P ) i (cid:107) ≥ t ) ≤ s + 1) N k exp (cid:32) − t m (cid:107) Λ S (cid:107) , ∞ (cid:33) . Above, the dimension of the subspaces k appears instead the ambient dimension d for the samereasons explained in the proof of Theorem 3.4. This completes the proof.13 .3 Proof of Theorem 3.1 Essentially we follow the arguments in [16, Section 12.4]. We will construct an inexact dual vectoras in Lemma 3.3 satisfying the conditions there. To this end, we will use the so-called golfing scheme due to Gross [18]. We partition the m independent (block) rows of A P into L disjoint blocks ofsizes m , . . . , m L and L to be specified later with m = (cid:80) Lj =1 m j . These blocks correspond to rowsubmatrices of A P which are denoted by A P (1) ∈ R m d × Nd , . . . , A P ( L ) ∈ R m L d × Nd , i.e., A P = A P (1) A P (2) ... A P ( L ) } m } m ... } m L Set S = supp( x ). The golfing scheme starts with u (0) = and then inductively defines u ( n ) = 1 m n ( A P ( n ) ) ∗ ( A P ( n ) ) S (sgn( x S ) − u ( n − S ) + u ( n − , for n = 1 , . . . , L . The vector u = u ( L ) will serve as a candidate for the inxeact dual vector inLemma 3.3. Thus, we need to check if it satisfies the two conditions in (5). By construction u is inthe row space of A P , i.e., u = A P ∗ h for some vector h as required in Lemma 3.3. To simplify thenotation we introduce w ( n ) = sgn( x S ) − u ( n ) S . Observe that u ( n ) S − u ( n − S = 1 m n ( A P ( n ) ) ∗ S ( A P ( n ) ) S (sgn( x S ) − u ( n − S ) w ( n − − w ( n ) = 1 m n ( A P ( n ) ) ∗ S ( A P ( n ) ) S w ( n − w ( n ) = (cid:20) P S − m n ( A P ( n ) ) ∗ S ( A P ( n ) ) S (cid:21) w ( n − (15)Above we used that P S w ( n ) = w ( n ) . Furthermore we have u ( n ) − u ( n − = 1 m n ( A P ( n ) ) ∗ ( A P ( n ) ) S w ( n − u = u ( L ) = L (cid:88) n =1 m n ( A P ( n ) ) ∗ ( A P ( n ) ) S w ( n − , (16)where last line follows by a telescopic sum. We will later show that the matrices P S − m k ( A P ( k ) ) ∗ S ( A P ( k ) ) S are contractions and the norm of the residual vector w ( n ) decreases geometrically fast, thus u ( n ) becomes close to sgn( x S ) on its support set S . Particularly, we will prove that (cid:107) w ( L ) (cid:107) ≤ / L . In addition we also need that the off-support part of u remains small as well,satisfying the condition max i ∈ S (cid:107) u i (cid:107) ≤ / n with high probability (cid:107) w ( n ) (cid:107) , ∞ ≤ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + q n (cid:19) (cid:107) w ( n − (cid:107) , ∞ , n ∈ [ L ] . (17)14et q (cid:48) n := (cid:107) Λ S (cid:107) , ∞ √ m + q n . Since (cid:107) w (0) (cid:107) , ∞ = (cid:107) sgn( x S ) (cid:107) , ∞ = 1, we have (cid:107) w ( n ) (cid:107) , ∞ ≤ n (cid:89) j =1 q (cid:48) j =: h n . Further assume that the following inequalities hold for each n with high probability, (cid:107) w ( n ) (cid:107) ≤ (cid:18) (cid:107) Λ S (cid:107) , ∞ √ m + r n (cid:19) (cid:107) w ( n − (cid:107) , n ∈ [ L ] , (18)max i ∈ S (cid:13)(cid:13)(cid:13)(cid:13) m n ( A P ( n ) ) ∗ i ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) ≤ h n (cid:107) Λ S (cid:107) , ∞ √ m + t n , n ∈ [ L ] . (19)The parameters q n , r n , t n will be specified later. Now let r (cid:48) n := (cid:107) Λ S (cid:107) , ∞ √ m + r n and t (cid:48) n := h n (cid:107) Λ S (cid:107) , ∞ √ m + t n .Then the relations in (15) and (18) yield (cid:107) sgn( x S ) − u S (cid:107) = (cid:107) w ( L ) (cid:107) ≤ (cid:107) sgn( x S ) (cid:107) L (cid:89) n =1 r (cid:48) n ≤ √ s L (cid:89) n =1 r (cid:48) n . Furthermore, (16) and (19) givemax i ∈ S (cid:107) u i (cid:107) = max i ∈ S (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) n =1 m n ( A P ( n ) ) ∗ i ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L (cid:88) n =1 max i ∈ S (cid:13)(cid:13)(cid:13)(cid:13) m n ( A P ( n ) ) ∗ i ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) ≤ L (cid:88) n =1 t (cid:48) n . Next we define the probabilities p ( n ), p ( n ) and p ( n ) that (17), (18) and (19) do not holdrespectively. Then by Lemma 3.7 and independence of the blocks, p ( n ) ≤ ε, provided m n ≥ (cid:32) (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ q n + 2 (cid:107) Λ S (cid:107) ∞ q n (cid:33) ln( s/ε ) . (20)Also by Lemma 3.6 and independence of the blocks, p ( n ) ≤ ε provided m n ≥ (cid:32) (cid:107) Λ S (cid:107) ∞ + 2 (cid:107) Λ S (cid:107) , ∞ r n + 4 / / (cid:107) Λ S (cid:107) ∞ r n (cid:33) ln( ε − ) . (21)15imilarly, due to Lemma 3.5 and independence of the blocks, p ( n ) ≤ ε, provided m n ≥ (cid:32) h n (cid:107) Λ S (cid:107) , ∞ + 4 h n (cid:107) Λ S (cid:107) ∞ t n + h n (cid:107) Λ S (cid:107) ∞ t n (cid:33) ln( N/ε ) . (22)We now set the parameters L, m n , t n , r n , q n for n ∈ [ L ] such that (cid:107) sgn( x S ) − u S (cid:107) ≤ / i ∈ S (cid:107) u i (cid:107) ≤ / L = (cid:100) ln( s ) / ln ln( N ) (cid:101) + 3 ,m n ≥ c (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) ln(2 Lε − ) ,r n = 14 (cid:112) ln( N ) ,t n = 12 n +3 ,q n = 18 . We can estimate each of (cid:107) Λ S (cid:107) , ∞ , (cid:107) Λ S (cid:107) , ∞ , (cid:107) Λ S (cid:107) ∞ by (cid:107) Λ S (cid:107) ∞ from above. Then by definitions of r (cid:48) n , t (cid:48) n , h n , q (cid:48) n , we obtain r (cid:48) n ≤ √ ln N , q (cid:48) n ≤ , h n ≤ n and t (cid:48) n ≤ n +2 for n = 1 , . . . , L . Furthermore, (cid:107) sgn( x S ) − u S (cid:107) ≤ √ s L (cid:89) n =1 r (cid:48) n ≤ , and max i ∈ S (cid:107) u i (cid:107) ≤ L (cid:88) n =1 t (cid:48) n ≤ . Next we bound the failure probabilities according to our choices of parameters above. Consideringalso the conditions (20), (21) and (22), we have p ( n ) , p ( n ) , p ( n ) ≤ ε/L. These yield L (cid:88) n =1 p ( n ) + p ( n ) + p ( n ) ≤ ε. The overall number of samples obey m = L (cid:88) n =1 m n ≥ c (1 + (cid:107) Λ S (cid:107) ∞ ) L ln( N ) ln( L/ε ) . (23)This is already very close to the proposed condition in the statement of our theorem. We willstrengthen this condition later. Next we look into the first part of Condition (4) of Lemma 3.3. ByTheorem 3.4, (cid:107) ( ˜A P ) ∗ S ( ˜A P ) S − P S (cid:107) ≤ / − ε provided m ≥ (cid:18) (cid:107) Λ S (cid:107) , ∞ + 83 max {(cid:107) Λ S (cid:107) , } (cid:19) ln(2 sk/ε ) .. (24)16his implies that (cid:107) [( ˜A P ) ∗ S ( ˜A P ) S ] − |H (cid:107) ≤
2. For the second part of Condition (4) we will useLemma 3.8. It says that P (max i ∈ S (cid:107) ( ˜A P ) ∗ S ( ˜A P ) i (cid:107) ≥ t ) ≤ s + 1) N k exp (cid:32) − t m (cid:107) Λ S (cid:107) , ∞ (cid:33) . Taking t = 1 implies that max i ∈ S (cid:107) ( ˜A P ) ∗ S ( ˜A P ) i (cid:107) ≤ − ε provided m ≥ (cid:107) Λ S (cid:107) , ∞ ln( N ( s + 1) k/ε ) . (25)Altogether we have shown that Conditions (4) and (5) of Lemma 3.3 hold simultaneously withprobability at least 1 − ε provided Conditions (23), (24) and (25) hold. Replacing ε by ε/
5, themain condition of Theorem 3.1 m ≥ C (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) ln( sk ) ln( ε − )implies all three conditions above with an appropriate constant C since (cid:107) Λ S (cid:107) ≤ (cid:107) Λ S (cid:107) ∞ ≤ (cid:107) Λ S (cid:107) ∞ since Λ is symmetric.This ends the proof of our theorem. Remark 3.9.
The inexact dual method yields a relatively long and technical proof for sparserecovery and involves several auxiliary results. Other methods used in the compressed sensingliterature proved to be hard to apply for our particular case where we work with the block matrix A P which is more structured than a purely random Gaussian matrix. To name a few of othermethods, the exact dual approach developed by J. J. Fuchs [17] was used for subgaussian matricesin [3], a uniform recovery result for Gaussian matrices based on the concentration of measure ofLipschitz functions was given by one of the seminal papers by Cand´es and Tao [8] and atomic normapproach that was recently given in [12] with far-reaching applications. For instance, particularlythe exact dual approach involves taking the pseudo-inverse of A P which loses the structure given byprojection matrices P i . This structure is crucial because it allows us to prove our results involvingthe incoherence parameter λ which is the central theme of this paper. In this section we give an alternative result for the nonuniform recovery with Bernoulli matrices.This result involves the parameter λ instead of matrix Λ. Theorem 3.10.
Let x ∈ H be s -sparse. Let A ∈ R m × N be Bernoulli matrix and ( W j ) Nj =1 be givenwith parameter λ ∈ [0 , . Assume that m ≥ C (1 + λs ) ln( N sk ) ln( ε − ) , (26) where C > is a universal constant. Then with probability at least − ε , ( L recovers x from y = A P x . The proof of this theorem is similar to the one of Theorem 3.1 with slight modifications in theestimations. 17 emark 3.11.
Theorem 3.10 improves Theorem 3.1 in terms of the log-factors as log( s ) does notappear in Condition (26). Condition (3) is slightly better than (3) in terms of the incoherenceparameter, at least if there is a true gap between (cid:107) Λ (cid:107) ∞ and λs , which happens if the quantities (cid:107) P i P j (cid:107) are not all close to their maximal value. The equality is achieved when the subspaces areequi-angular. In the case that they are not equi-angular, even if only two subspaces align, then λ = 1. In this case, (26) suggests that we should not expect any improvement for the recovery of thesparse vectors with respect to the standard block sparse case. However, intuitively the orientationof the other subspaces might still be effective in the recovery process. A more average measureof incoherence of the subspaces is captured by (cid:107) Λ (cid:107) ∞ in (3), so Theorem 3.1 improves for generalorientations of the subspaces up to a slight drawback in the log-factors. Numerical experiments wehave run also support this result. In this section we show that nonuniform recovery for fusion frames with Bernoulli matrices are stableand robust under presence of noise. In other words we allow our signal x to be approximately sparse(compressible) and the measurements y to be noisy. Our measurement model then becomes y = A P x + e with (cid:107) e (cid:107) ≤ η √ m (27)for some η ≥
0. For the reconstruction we employ( L η ˆ x = argmin x ∈H (cid:107) x (cid:107) , s.t. (cid:107) A P x − y (cid:107) ≤ η √ m. The condition (cid:107) e (cid:107) ≤ η √ m in (27) is natural for a vector e = ( e j ) mj =1 . For instance, it is impliedby the bound (cid:107) e j (cid:107) ≤ η for all j ∈ [ m ]. We first define the best s -term approximation of a vector x as follows σ s ( x ) := inf (cid:107) z (cid:107) ≤ s (cid:107) x − z (cid:107) , . Compressible vectors are the ones with small σ s ( x ) . The next statement makes Lemma 3.3 stableand robust under noise and under passing from sparse to compressible vectors. It is an extensionof [16, Theorem 4.33] and its proof is entirely analogous to the one in there, so we skip it. Lemma 4.1.
Let A ∈ R m × N and ( W j ) Nj =1 be a fusion frame for R d and x ∈ H . Let S ⊂ [ N ] be the index set of the s largest (cid:96) -normed vectors x i of x . Assume that, for positive constants δ, β, γ, θ ∈ (0 , with b := θ + βγ/ (1 − δ ) < and (cid:107) ( A P ) ∗ S A P S − P S (cid:107) ≤ δ, (28)max (cid:96) ∈ S (cid:107) ( A P ) ∗ S ( A P ) (cid:96) (cid:107) ≤ β. (29) Suppose there exists a block vector u ∈ R Nd of the form u = A P ∗ h with block vector h ∈ R md suchthat (cid:107) u S − sgn( x S ) (cid:107) ≤ γ, (30)max i ∈ S (cid:107) u i (cid:107) ≤ θ, (31) (cid:107) h (cid:107) ≤ τ √ s. (32)18 et noisy measurements y = A P x + e be given with (cid:107) e (cid:107) ≤ η . Then the minimizer ˆ x of min z ∈H (cid:107) z (cid:107) , s.t. (cid:107) A P z − y (cid:107) ≤ η satisfies (cid:107) x − ˆ x (cid:107) ≤ C σ s ( x ) + ( C + C √ s ) η, where C = (cid:18) β − δ (cid:19) − b , C = 2 √ δ − δ + (cid:18) β − δ (cid:19) γ √ δ (1 − δ )(1 − b ) C = (cid:18) β − δ (cid:19) τ − b . In the remainder of this section, we prove a robust and stable version of the nonuniform recoveryresult Theorem 3.1 for Bernoulli matrices. We also state the result for the Gaussian case but donot prove it since it follows very similarly to the Bernoulli case.
Theorem 4.2.
Let x ∈ H and S ⊂ [ N ] with cardinality s be an index set of s largest (cid:96) -normedentries of x . Let A ∈ R m × N be a Bernoulli matrix and ( W j ) Nj =1 be given with parameter λ ∈ [0 , .Assume the measurement model in (27) and let ˆ x be a solution to ( L η . Provided m ≥ C (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) ln( sk ) ln( ε − ) , (33) then with probability at least − ε , (cid:107) x − ˆ x (cid:107) ≤ C σ s ( x ) + C √ sη. (34) The constants
C, C , C > are universal. The proof is analogous to the one of [16, Theorem 12.22]. It invokes Lemma 4.1 which gives nec-essary conditions on the measurement matrices for the robust and stable nonuniform recovery withthem. Since the conditioning assumption (28) requires normalization of the matrix, we will workwith the matrix ˜A P = √ m A P . Then observe that the optimization problem ( L η is equivalent tomin z ∈H (cid:107) z (cid:107) , s.t. (cid:13)(cid:13)(cid:13)(cid:13) ˜A P z − √ m y (cid:13)(cid:13)(cid:13)(cid:13) ≤ η. Proof.
We follow the golfing scheme as in the proof of Theorem 3.1, see Section 3.3. In particular,we make the same choices of the parameters
L, r n , t n , q n , m n as before. We choose m n as follows m ≥ c (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) L ln(2 Lε − ) ,m n ≥ c (1 + (cid:107) Λ S (cid:107) ∞ ) ln( N ) ln(2 Lε − ) , n ≥ . These choices change the number of overall samples m only up to a constant. Then Conditions (28),(29), (30), (31) are all satisfied for the normalized matrix ˜A P with probability at least 1 − ε withappropriate choices of variables δ, β, γ, θ . It remains to verify that the vector h ∈ R md constructedin Section 3.3 as u = ˜A ∗ P h satisfies Condition (32). For simplicity, assume without loss of generalitythat the first L values of n are used in the construction of the dual vector in (16). Then recall that u = L (cid:88) n =1 m n ( A P ( n ) ) ∗ ( A P ( n ) ) S w ( n − = L (cid:88) n =1 mm n ( ˜A ( n ) P ) ∗ ( ˜A ( n ) P ) S w ( n − . u = ˜A ∗ P h with h ∗ = (( h (1) ) ∗ , . . . , ( h ( L ) ) ∗ , , . . . ,
0) where h ( n ) = mm n ( ˜A ( n ) P ) S w ( n − ∈ R m n d , n = 1 , . . . , L (cid:48) . Then we have (cid:107) h (cid:107) = L (cid:88) n =1 (cid:107) h ( n ) (cid:107) = L (cid:88) n =1 mm n (cid:13)(cid:13)(cid:13)(cid:13)(cid:114) mm n ( ˜A ( n ) P ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) n =1 mm n (cid:13)(cid:13)(cid:13)(cid:13)(cid:114) m n ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) . We also recall the relation (15) of the vectors w ( n ) . This gives, for n ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:114) m n ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) = (cid:28) m n ( A P ( n ) ) ∗ S ( A P ( n ) ) S w ( n − , w ( n − (cid:29) = (cid:28)(cid:18) m n ( A P ( n ) ) ∗ S ( A P ( n ) ) S − P S (cid:19) w ( n − , w ( n − (cid:29) + (cid:107) w ( n − (cid:107) = (cid:104) w ( n ) , w ( n − (cid:105) + (cid:107) w ( n − (cid:107) ≤ (cid:107) w ( n ) (cid:107) (cid:107) w ( n − (cid:107) + (cid:107) w ( n − (cid:107) . Recall from the assumption (18) that (cid:107) w ( n ) (cid:107) ≤ r (cid:48) n (cid:107) w ( n − (cid:107) ≤ (cid:107) w ( n − (cid:107) . Then we obtain (cid:13)(cid:13)(cid:13)(cid:13)(cid:114) m n ( A P ( n ) ) S w ( n − (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) w ( n − (cid:107) ≤ (cid:107) w (0) (cid:107) n − (cid:89) j =1 ( r (cid:48) j ) = 2 (cid:107) sgn( x ) S (cid:107) n − (cid:89) j =1 ( r (cid:48) j ) = 2 s n − (cid:89) j =1 ( r (cid:48) j ) . Assume that m ≤ C (1 + λs ) ln( N ) ln( sk ) ln(2 ε − ) so that m is just large enough to satisfy (33).Recall the definition of L = (cid:100) ln( s ) / ln ln( N ) (cid:101) + 3. Then by our choices of m n , we have mm n ≤ L for n ≥ mm ≤ c for some c >
0. (If m is much larger, one can rescale m n proportionally toachieve the same ratio.) This yields (cid:107) h (cid:107) ≤ s L (cid:88) n =1 mm n n − (cid:89) j =1 ( r (cid:48) j ) ≤ s c N ) + L (cid:88) n =2 L n − (cid:89) j =1
12 ln N ≤ C (cid:48) s (cid:18) L N ) 1[1 − / (2 ln( N ))] (cid:19) ≤ C (cid:48)(cid:48) s, where we used the convention (cid:81) j =1 ( r (cid:48) j ) = 1. Therefore, all conditions of Lemma 4.1 are satisfiedfor x and ˜A P with probability at least 1 − ε . This completes the proof.We now state the result for Gaussian matrices and skip the proof. Theorem 4.3.
Let x ∈ H . Let A ∈ R m × N be a Gaussian matrix and ( W j ) Nj =1 be given withparameter λ ∈ [0 , and dim( W j ) = k for all j . Assume the measurement model in (27) and let ˆ x be a solution to ( L η . If m ≥ ˜ C (1 + λs ) ln (6 N k ) ln ( ε − ) ,
10 15 20 25 30 35102030405060708090100110
SPARSITY o f M EAS UR E M E N T S N EE D E D BLOCK SPARSITY vs. FUSION FRAME
BLOCK SPARSITYFUSION FRAME
N=200, SUCCESS RATE = 96% (a)
NUMBER OF MEASUREMENTS P R O BAB I L I T Y O F R E C O VE R Y BLOCK SPARSITY vs. FUSION FRAME
FUSION FRAMEBLOCK SPARSITYN=200, s=20 (b)
Figure 1: ’block’ vs. ’fusion frame’ sparsity then with probability at least − ε − N − c , (cid:107) x − ˆ x (cid:107) ≤ C σ s ( x ) + C √ sη. The constants ˜ C, C , C , c > are universal. In this section, we present numerical experiments in order to highlight important aspects of thesparse reconstruction in the fusion frame (FF) setup. The experiments illustrate our theoreticalresults and show that when the subspaces are known, one can significantly improve the recoveryof sparse vectors. In all of our experiments, we use SPGL1 [24, 25] to solve the (cid:96) , -minimizationproblems. General setup:
We generate subspaces randomly, which allows us to generate fusion frames withdifferent values of λ and Λ. Particularly, for N subspaces in R d each with dimension k , we generate N · k random vectors from N (0 , I d ) and group them to form the basis for the subspaces. Each sucha random orientation of the subspaces yields a parameter λ . In order to obtain a different λ , it isenough to vary d or k . When N is fixed, λ increases with increasing k and decreasing d .For the measurement matrices, we generate the normalized matrix ˜ A = √ m A where A ∈ R m × N is a Gaussian matrix. For a sparsity level s , sparse vectors are generated in the following way: Wechoose the support set S uniformly at random, then we sample a Gaussian vector in each subspacein this support set. N is kept fixed throughout the experiment at hand. In our experiments wework with the parameter (cid:107) Λ S (cid:107) ∞ introduced in Section 2.3. Since the random subspaces are notequiangular, this parameter reflects the linear relation between m and s better than λ . We workwith the normalized parameter λ eff = (cid:107) Λ S (cid:107) ∞ s . Exact sparse case:
In Fig. 1, we show that the knowledge of the subspaces improves the recovery.To that end, we fix a fusion frame with N = 200 subspaces in R d with λ eff ≈ .
6. Then we vary21 .2 0.3 0.4 0.5 0.6 0.7 0.810152025303540455055 λ eff o f M E A S UR E M E N T S N EE D E D λ eff N=180; s=25;success rate=96%
Figure 2: m vs. λ eff the sparsity level s from 5 to 35, and generate an s -sparse vector x in the fusion frame. For each s , we vary the number of measurements m and compute empirical recovery rates via the programs(FF) ˆ x = argmin x ∈H (cid:107) x (cid:107) , s.t. A P x = y , (35)(block) ˆ x = argmin x (cid:107) x (cid:107) , s.t. A I x = y . (36)For the whole period, we leave the vector to be recovered fixed. Repeating this test 100 times withdifferent random A for each choice of parameters ( s, m, N ) provides an empirical estimate of thesuccess probability. In Fig. 1(a), we plot m which yields at least 96% success rate for each s . Thedifference in two plots is due to the incoherence of the subspaces, i.e., λ eff. Fig. 1(b) ( d = 3, k = 1)shows the transition from the unsuccessful regime to the successful regime for the sparsity level s = 20 for both cases (FF and block sparsity). The transition for the FF case occurs at a smallervalue m which reflects Fig. 1(a) in a different way. As a consequence, the assumption x ∈ H in(35) allows us to to recover x with far less measurements compared to (36) where such a constraintis not used.Theorem 3.1 suggests that there is a linear relation between the number of measurements m and the parameter λ eff . The experiment depicted in Fig. 2 is designed to reflect this relation. Wegenerate fusion frames with N = 180 subspaces with various λ eff which is managed by changing d and keeping k = 3 fixed. Then in each fusion frame, a vector x with sparsity s = 25 is generatedand the number of measurements m that suffices for recovery is determined. The plot yields analmost linear relation in parallel to the theoretical result. Stable case:
In this part, we generate scenarios that allude to the conclusions of Theorems 4.2and 4.3. In a fusion frame of N = 200 subspaces, we generate a signal x composed of x S , supportedon an index set S , and a signal z S supported on S . We then normalize x S and z S so that (cid:107) x S (cid:107) , = (cid:107) z S (cid:107) , = 1 and produce x = x S + θ z S where θ ∈ [0 , x is our compressible vector wherecompressibility is controlled with θ . For measurement, we choose the normalized Gaussian matrix A ∈ R m × N . We measure y = A P x and then run the program ( L
1) and measure the reconstructionerror (cid:107) x − ˆ x (cid:107) . We repeat this test 20 times for a fixed x with θ = 0 .
12 in order to obtain anaverage recovery error for different values of m . Fig. 3(a) reports the results of this experimentperformed for different fusion frames with various values of λ eff and also for the block sparsity22
10 20 30 40 50 60 70 8000.050.10.150.20.250.30.35
NUMBER OF MEASUREMENTS R E C O N S T RUC T I O N E RR O R RECONSTRUCTION OF COMPRESSIBLE SIGNALS with DIFFERENT λ eff λ eff = 0.68 λ eff =0.55 λ eff =0.4block sparsityN=180,s=20, θ =0.12 (a) NUMBER OF MEASUREMENTS R E C O N S T RUC T I O N E RR O R NOISY MEASUREMENTS with DIFFERENT λ eff λ eff = 0.24 λ eff =0.7 λ eff =0.9block sparsity N=200, s=20, σ = 0.06 (error) (b) Figure 3: (cid:107) ˆ x − x (cid:107) vs. ’m’case. The decrease in the reconstruction error with increasing m is natural even though it is notsuggested directly by the theoretical results. Indeed, one would expect that increasing the numberof measurements would enhance the recovery conditions and yield an improved reconstruction.For the noisy case, similarly, we generate noisy observations A P x S + σ e , of a sparse signal x S where (cid:107) x S (cid:107) = (cid:107) e (cid:107) = 1 and σ = 0 .
06. Here, all entries of the noise vector e are chosen i.i.d fromthe standard Gaussian distribution and then properly normalized. We then run the robust ( L η program and measure the reconstruction error (cid:107) x − ˆ x (cid:107) . We plot the average of this error vs. thenumber of measurements in Fig 3(b) for different values of λ eff .Fig. 4(a) depicts the relation between the reconstruction error and the noise level σ for differentvalues of λ eff . In this setup, N = 200, s = 30 and m = 50 are fixed, and a sparse vector x inthe fusion frame with specific value of λ eff is generated. For each value of σ we plot the averagereconstruction error. Results manifest the linear relation between σ and (cid:107) x − ˆ x (cid:107) given in (34).Again, we obtain a better reconstruction quality when λ eff is smaller.Finally, we examine the relation between compressibility and the reconstruction error using adifferent model than described earlier. In Fig. 4(b), we plot the results of an experiment in which wegenerate signals x in a fusion frame with N = 200, with sorted values of (cid:107) x j (cid:107) that decay accordingto some power law. In particular, for various values of 0 < q <
1, we set (cid:107) x j (cid:107) = cj − /q such that (cid:107) x (cid:107) = 1. We then measure x with Gaussian matrices A and compute the average reconstructionerrors via ( L
1) program. Note that the higher the value of q , the less compressible the signal is.The results indicate that reconstruction of error decreases when the compressibility of the signalincreases as declared in (34). We can also see the improvement in the reconstruction when thesubspaces are more incoherent, i.e., they have smaller λ eff . The following theorem is the noncommutative Bernstein inequality due to Tropp, [23, Theorem1.4]. 23 σ (NOISE) R E C O N S T RUC T I O N E RR O R RECONSTRUCTION ERROR vs. NOISE LEVEL: σ λ eff =0.22 λ eff =0.68 λ eff =0.8block sparsity N=200, s=30, d=35 (a) R E C O N S T RUC T I O N E RR O R RECONSTRUCTION ERROR vs. COMPRESSIBILITY DIFFERENT with λ eff λ eff = 0.85 λ eff = 0.67 λ eff = 0.55block sparsity (b) Figure 4: (cid:107) ˆ x − x (cid:107) vs. σ and (cid:107) ˆ x − x (cid:107) vs. q Theorem 6.1. (Matrix Bernstein inequality)
Let { X (cid:96) } M(cid:96) =1 ∈ R d × d be a sequence of independentrandom self-adjoint matrices. Suppose that E X (cid:96) = 0 and (cid:107) X (cid:96) (cid:107) ≤ K a.s. and put σ := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:88) (cid:96) =1 E X (cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Then for all t ≥ , P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:88) (cid:96) =1 X (cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:33) ≤ d exp (cid:18) − t / σ + Kt/ (cid:19) . (37) Remark 6.2.
One can improve the tail bound (37) provided the { X (cid:96) } are identically distributedand the E X (cid:96) are not full rank, say rank( E X (cid:96) ) = r < d . Then (37) can be replaced by P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:88) (cid:96) =1 X (cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:33) ≤ r exp (cid:18) − t / σ + Kt/ (cid:19) . (38) Sketch of the proof.
We mainly improve [23, Corollary 3.7] under the assumptions above on X (cid:96) .By [23, Theorem 3.6], for each θ >
0, it holds that P (cid:32) λ max (cid:32) M (cid:88) (cid:96) =1 X (cid:96) (cid:33) ≥ t (cid:33) ≤ e − θt Tr exp (cid:32) M (cid:88) (cid:96) =1 ln (cid:16) E e θX (cid:96) (cid:17)(cid:33) , where λ max denotes the largest eigenvalue. Moreover, [23, Lemma 6.7] states that, for θ > E e θX (cid:52) exp (cid:0) g ( θ ) E X (cid:1) for a self-adjoint, centered random matrix X , where g ( θ ) = e θ − θ −
1. Using this result we obtain P (cid:32) λ max (cid:32) M (cid:88) (cid:96) =1 X (cid:96) (cid:33) ≥ t (cid:33) ≤ e − θt Tr exp (cid:32) g ( θ ) M (cid:88) (cid:96) =1 E X (cid:96) (cid:33) = e − θt Tr exp (cid:0) g ( θ ) M E X (cid:1) ≤ e − θt r λ max (cid:2) exp (cid:0) g ( θ ) M E X (cid:1)(cid:3) = e − θt r exp (cid:0) g ( θ ) λ max ( M E X ) (cid:1) . X (cid:96) are identically distributed. The second inequality is validbecause, for a positive definite matrix B with rank r , we have Tr B ≤ rλ max ( B ) and rank( c E X (cid:96) ) =rank (cid:0) exp( c E X (cid:96) ) (cid:1) = r , for some c >
0. The rest of the proof proceeds in the same way as theproof of [23, Theorem 1.4].We also give a rectangular version of the matrix Bernstein inequality as it appears in [23,Theorem 1.6].
Theorem 6.3. (Matrix Bernstein:rectangular)
Let { Z (cid:96) } ∈ R d × d be a finite sequence ofindependent random matrices. Suppose that E Z (cid:96) = 0 and (cid:107) Z (cid:96) (cid:107) ≤ K a.s. and put σ := max (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) (cid:96) E ( Z (cid:96) Z ∗ (cid:96) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) (cid:96) E ( Z ∗ (cid:96) Z (cid:96) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:41) . Then for all t ≥ , P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) (cid:96) Z (cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:33) ≤ ( d + d ) exp (cid:18) − t / σ + Kt/ (cid:19) . The next lemma is a deviation inequality for sums of independent random vectors which is acorollary of Bernstein inequalities for suprema of empirical processes [16, Corollary 8.44]. A similarresult can be also found in [18, Theorem 12].
Lemma 6.4. (Vector Bernstein inequality)
Let Y , Y , . . . , Y M be independent copies of arandom vector Y on R n satisfying E Y = 0 . Assume (cid:107) Y (cid:107) ≤ K . Let Z = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:88) (cid:96) =1 Y (cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , E Z = M E (cid:107) Y (cid:107) , and σ = sup (cid:107) x (cid:107) ≤ E |(cid:104) x, Y (cid:105)| . (39) Then, for t > , P ( Z ≥ √ E Z + t ) ≤ exp (cid:32) − t / M σ + 2 K √ E Z + tK/ (cid:33) . (40) Remark 6.5.
The so-called weak variance σ in (39) can be estimated by σ = sup (cid:107) x (cid:107) ≤ E |(cid:104) x, Y (cid:105)| ≤ E sup (cid:107) x (cid:107) ≤ |(cid:104) x, Y (cid:105)| = E (cid:107) Y (cid:107) . Acknowledgment
The authors would like to thank the Hausdorff Center for Mathematics and RWTH Aachen Uni-versity for support, and acknowledge funding through the WWTF project SPORTS (MA07-004)and the ERC Starting Grant StG 258926. 25 eferences [1] U. Ayaz.
Sparse Recovery with Fusion Frames . 2014. Ph.D. thesis. Hausdorff Center forMathematics, University of Bonn.[2] U. Ayaz, S. Dirksen, and H. Rauhut. Uniform recovery of fusion frame structured sparsesignals. (Submitted).[3] U. Ayaz and H. Rauhut. Nonuniform sparse recovery with subgaussian matrices.
ETNA , toappear.[4] P. E. Bjørstad and J. Mandel. On the spectra of sums of orthogonal projections with applica-tions to parallel computing.
BIT , 31(1):76–88, Mar. 1991.[5] B. G. Bodmann. Optimal linear transmission by loss-sensitive packet encoding.
Appl. Comput.Harmon. Anal. , 22(3):274–285, 2007.[6] P. Boufounos, G. Kutyniok, and H. Rauhut. Sparse recovery from combined fusion framemeasurements.
IEEE Trans. Inform. Theory , 57(6):3864–3876, 2011.[7] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstructionfrom highly incomplete frequency information.
IEEE Trans. Inf. Theory , 52(2):489–509, 2006.[8] E. Cand`es and T. Tao. Near optimal signal recovery from random projections: universalencoding strategies?
IEEE Trans. Inf. Theory , 52(12):5406–5425, 2006.[9] P. Casazza and G. Kutyniok.
Fusion Frames . Applied and Numerical Harmonic Analysis.Boston, MA: Birkh¨auser. xvi, 2013.[10] P. G. Casazza and G. Kutyniok. Frames of subspaces. in Wavelets, Frames and OperatorTheory , pages 87–113, 2004.[11] P. G. Casazza, G. Kutyniok, S. Li, and C. J. Rozell. Modeling sensor networks with fusionframes. In
Wavelets Xll, Special Session on Finite-Dimensional Frames, Time-FrequencyAnalysis, and Applications , volume 6701, page 11, 2007.[12] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linearinverse problems.
Foundations of Computational Mathematics , 12(6), 2012.[13] D. Donoho. Compressed sensing.
IEEE Trans. Inform. Theory , 52(4):1289–1306, 2006.[14] Y. C. Eldar and H. B¨olcskei. Block-sparsity: Coherence and efficient recovery. In
ICASSP ,pages 2885–2888. IEEE, 2009.[15] M. Fornasier and H. Rauhut. Recovery algorithms for vector valued data with joint sparsityconstraints.
SIAM J. Numer. Anal. , 46(2):577–613, 2008.[16] S. Foucart and H. Rauhut.
A mathematical introduction to compressive sensing . Applied andNumerical Harmonic Analysis. Birkh¨auser, 2013.[17] J.-J. Fuchs. On sparse representations in arbitrary redundant bases.
IEEE Trans. Inf. Th ,page 1344, 2004. 2618] D. Gross. Recovering low-rank matrices from few coefficients in any basis.
IEEE Transactionson Information Theory , 57(3):1548–1566, 2011.[19] G. Kutyniok, A. Pezeshki, R. Calderbank, and T. Liu. Robust dimension reduction, fusionframes, and Grassmannian packings.
Appl. Comput. Harmon. Anal. , 26(1):64–76, 2009.[20] P. Oswald. Frames and space splittings in Hilbert spaces. Technical report, 1997.[21] N. S. Rao, B. Recht, and R. D. Nowak. Universal measurement bounds for structured sparsesignal recovery.
Jour. of Mach. Learn. Res. - Proc. Track , 22:942–950, 2012.[22] J. Tropp. Recovery of short, complex linear combinations via l minimization. IEEE Trans.Inf. Theory , 51(4):1568–1570, 2005.[23] J. A. Tropp. User-friendly tail bounds for sums of random matrices.
Found. of Comp. Math.