SSPARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES
ENRICO AU-YEUNG
Abstract.
The problem of how to find a sparse representation of a signal is an importantone in applied and computational harmonic analysis. It is closely related to the problemof how to reconstruct a sparse vector from its projection in a much lower-dimensionalvector space. This is the setting of compressed sensing, where the projection is given bya matrix with many more columns than rows. We introduce a class of random matricesthat can be used to reconstruct sparse vectors in this paradigm. These matrices satisfy therestricted isometry property with overwhelming probability. We also discuss an applicationin dimensionality reduction where we initially discovered this class of matrices. Introduction and Motivation
In an influential survey paper by Bruckstein, Donoho, and Elad, the problem of findinga sparse solution to an underdetermined linear system is discussed in great detail [6]. Thisis an important problem in applied and computational harmonic analysis. Their surveyprovides plenty of inspiration for future directions of research, with both theoretical andpractical consideration. To make this presentation complete, we provide a brief overview.To motivate our discussion, we start by reviewing how sparsity and redundancy arebrought to use. Suppose we have a signal which we regard as a nonzero vector y ∈ R n andthere are two available orthonormal bases Ψ and Φ. Then the vector can be expressed asa linear combination of the columns of Ψ or as a linear combination of the columns of Φ, y = Ψ α = Φ β. An important example is to take Ψ to be the identity matrix, and Φ to be the matrix fordiscrete cosine transform. In this case, α is the representation of the signal in the timedomain (or space domain) and β is the representation in the frequency domain. For somepairs of orthonormal bases, such as the ones we have just mentioned, either the coefficients α can be sparse, or the β can be sparse, but they cannot both be sparse. This interestingphenomenon is sometimes called the Uncertainty Principle : (cid:107) α (cid:107) + (cid:107) β (cid:107) ≥ √ n. Here, we have written (cid:107) α (cid:107) to denote the sparsity of α , which is the number of nonzeroentries in the vector. This means that a signal cannot have fewer than √ n nonzero entriesin both the time domain and the frequency domain. Since the signal is sparse in either thetime domain or the frequency domain, but not in both, this leads to the idea of combiningthe two bases by concatenating the two matrices into one matrix A = [Ψ Φ].By a representation of the signal y , we mean a column vector x so that y = Ax . Therepresentation of the signal is not unique because the column vectors of A are not linearlyindependent. From this observation, we are naturally led to consider a matrix A formed a r X i v : . [ m a t h . F A ] A p r ENRICO AU-YEUNG by combining more than two bases. The hope is that among the many possible ways ofrepresenting the signal y , there is at least one representation that is very sparse, i.e. mostentries of x are zero. We want the vector x to be s -sparse, which means that at most s ofthe entries are nonzero. A natural question that arises is how to find the sparse represen-tation of a given signal y .There is a closely related problem that occurs commonly in signal and image processing.Suppose we begin with a vector x ∈ R N that is s -sparse, which we consider to be our com-pressible signal. Using the matrix A ∈ R n × N , we observe the vector y from the projection y = Ax . This leads to the following problem: given a matrix A ∈ R n × N , where typically N is much larger than n , and given y ∈ R n , how to recover the s -sparse vector x ∈ R N from theobservation y = Ax . The term most commonly used in this setting is compressed sensing.This problem is NP-hard, i.e. the natural approach to consider all possible s -sparse vectorsin R N is not feasible. The reconstruction of the vector x is accomplished by a non-linearoperator ∆ : R N → R n that solves the minimization problem,( P
1) min (cid:107) x (cid:107) subject to y = Ax.
The following definition plays a central role in this paper.
Definition 1.1.
A matrix A ∈ R m × N is said to have the restricted isometry property (RIP)of order s and level δ s ∈ (0 ,
1) if(1 − δ s ) (cid:107) x (cid:107) ≤ (cid:107) Ax (cid:107) ≤ (1 + δ s ) (cid:107) x (cid:107) for all s-sparse x ∈ R N . The restricted isometry property says that the columns of any sub-matrix with at most s columns are close to being orthogonal to each other. If the matrix A satisfies this property,then the solution to (P1) is unique, i.e. it is possible to reconstruct the s -sparse vector byminimizing the l norm of x , subject to y = Ax . For this reason, matrices that satisfy theRIP play a key role in compressed sensing. Some examples of random matrices that satisfythe RIP are the Gaussian, Bernoulli, or partial random Fourier matrices.From the foundational papers of Donoho [11] and Candes, Romberg, and Tao [8, 9], thefield of compressed sensing has been studied and extended by many others to include abroad range of theoretical issues and applications; see, for example, [7, 10, 2, 5, 19, 1, 12,13, 25, 26, 20, 23, 22, 16, 27], and the comprehensive treatment found in [14].The search for structured matrices that can be used in compressed sensing continuesto be an active research area (see, e.g., [21].) Towards that goal, our contribution is tointroduce a class of random matrices that satisfy the RIP with overwhelming probability.We also describe an application where we initially discovered this class of matrices.1.1. Application.
Dimensionality reduction is another area where matrices that satisfythe RIP play an important role. A powerful tool is the Johnson-Lindenstrauss (JL) lemma.This lemma tells us that the distance between each pair of points in a high-dimensionalspace is nearly preserved if we project the points into a much lower-dimensional space usinga random linear mapping. Krahmer and Ward [16] showed that if a matrix satisfies theRIP, then we can use it to create such a mapping if we randomize the column signs of thematrix. For a precise statement, see [16]. Together with our matrix that satisfies the RIP,
PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 3 their result allows one to create a matrix to be used in a JL-type embedding. To demon-strate this in a concrete setting, let us turn to an application in robust facial recognition.The goal of object recognition is to use training samples from k distinct object classesto determine the class to which a new sample belongs. We arrange the given n j trainingsamples from the j -th class as columns of a matrix Y j ≡ [ v j, , v j, , . . . , v j,n j ] ∈ R m × n j . In thecontext of a face recognition system, we identify a w × h facial image with the vector v ∈ R m ( m = wh ) given by stacking its columns. Therefore, the columns of Y j are the trainingfacial images of the j -th person. One effective approach for exploiting the structure of the Y j in object recognition is to model the samples from a single class as lying on a linearsubspace. Subspace models are flexible enough to capture the structure in real data, whereit has been demonstrated that the images of faces under varying lighting and expressionslie on a low-dimensional subspace [3]. For our present discussion, we will assume that thetraining samples from a single class do lie on a single subspace.Suppose we are given sufficient training samples of the j -th object class, Y j ≡ [ v j, , v j, , . . . , v j,n j ] ∈ R m × n j . Then, any new sample y new ∈ R m from the same class will approximately lie in the linearspan of the training samples associated with object j , i.e. y new = c j, v j, + c j, v j, + . . . c j,n j v j,n j , for some coefficients c j,k ∈ R , ≤ k ≤ n j . We define a new matrix Φ for the entire trainingset as the concatenation of the n training samples of all k object classes,Φ = [ Y , Y , Y , . . . , Y k ] = [ v , , v , , . . . , v , , v , , . . . , v k,n k ] . The new sample y new ∈ R m can be expressed as a linear combination of all training samples, y new = Φ x , where the transpose of the vector x is of the form, x = [0 , , . . . , , c j, , c j, , . . . , c j,n j , , , . . . , ∈ R n , i.e. x is the coefficient vector whose entries are zero, except those entries associated withthe j -th class. The sparse vector x encodes the identity of the new sample y new . The taskof classifying a new sample amounts to solving the linear system y new = Φ x to recover thesparse vector x . For more details, see [28], where the authors presented strong experimentalevidence to support this approach to robust facial recognition.One practical issue that arises is that for face images without any pre-processing, thecorresponding linear system y = Φ x is very large. For example, if each face image is givenat a typical resolution of 640 ×
480 pixels, then the matrix Φ has m rows, where m isin the order of 10 . Using scalable algorithms, such as linear programming, applying thisdirectly to high-resolution images still requires enormous computing power. Dimensionalityreduction becomes indispensable in this setting. The projection from the image space tothe much lower-dimensional feature space can be represented by a matrix P , where P hasmany more columns than rows. The linear system y = Φ x then becomes (cid:101) y ≡ P y = P Φ x. ENRICO AU-YEUNG
The new sample y is replaced by its projection (cid:101) y . The sparse vector x is reconstructed bysolving the minimization problem,min (cid:107) x (cid:107) subject to y = P Φ x. In the past, enormous amount of effort was spent to develop feature-extraction methods forfinding projections of images into lower-dimensional spaces. Examples of feature-extractionmethods include EigenFace, FisherFace, and a host of creative techniques; see, e.g. [4]. Forthe approach to facial recognition that we have described, choosing a matrix P is no longera difficult task. We can select a matrix P so that it nearly preserves the distance betweenevery pair of vectors, i.e. (cid:107) P x − P y (cid:107) ≈ (cid:107) x − y (cid:107) . As mentioned earlier, beginning witha matrix A that satisfies the RIP, the result of Krahmer and Ward allows one to create amatrix P to be used in a JL-type embedding.1.2. Notation.
Before continuing further, we need to define some terminology. The Rademachersystem { r n ( x ) } on the interval [0 ,
1] is a set of orthogonal functions defined by r n ( x ) = sign(sin(2 n +1 πx )); n = 0 , , , , . . . The Rademacher system does not form a basis for L ([0 , n can be written in the binary system as: n = 2 n + 2 n + . . . + 2 n k , where the integers n j are uniquely determined by n j +1 < n j . The Walsh functions { W n ( x ) } ∞ n =0 are then given by W ( x ) = 1 , W n ( x ) = r n ( x ) r n ( x ) . . . r n k ( x ) . The Walsh system forms an orthogonal basis for L ([0 , H = 1 , and for n ≥ H n = 1 √ (cid:20) H n − − H n − H n − H n − (cid:21) . Then, the column vectors of H n form an orthogonal basis on R n . Note that the matrix H n has 2 n rows and 2 n columns. Because of its close connection to the Walsh system, amatrix of the form H n is called a Hadamard-Walsh matrix.The inner product of two vectors x and y is denoted by (cid:104) x, y (cid:105) . The Euclidean norm ofa vector x is denoted by (cid:107) x (cid:107) . If a vector has at most s nonzero entries, we say that thevector is s-sparse. For clarity, we often label constants by C , C , . . . , but we do not keeptrack of their precise values.For a matrix A ∈ R m × N , its operator norm is (cid:107) A (cid:107) = sup {(cid:107) Ax (cid:107) : (cid:107) x (cid:107) = 1 } . If x ∈ R N ,then we say that Γ is the support set of the vector if the entries of x are nonzero only onthe set Γ, and we write supp( x ) = Γ. We define B Γ = { x ∈ R N : (cid:107) x (cid:107) = 1 , supp( x ) = Γ } . We write A ∗ for the adjoint (or transpose) of the matrix. Working with s-sparse vectors,there is another norm defined by (cid:107) A (cid:107) Γ = sup {|(cid:104) Ax, y (cid:105)| : x ∈ B Γ , y ∈ B Γ , | Γ | ≤ s } . PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 5
This norm is important because if the matrix A obeys the relation (cid:107) I − A ∗ A (cid:107) Γ ≤ δ s , then A satisfies the RIP of order s and level δ s .Let us introduce a model called Sparse City. From now on, we fix a positive integer m that is a power of 2, i.e. m = 2 k , and focus on a single Hadamard-Walsh matrix W with m rows and m columns. Let W mn be the m × n matrix formed by selecting the first n columnsof the matrix W . Let Θ be a bounded random variable with expected value zero, i.e. E (Θ) = 0 , | Θ | ≤ B. To be precise, the random variable Θ is equally likely to take one of the four possible values, { , − , , − } . Define the random vectors x , x , . . . x b ∈ R m , so that the entries of eachvector are independent random variables drawn from the same probability distributionas Θ. For each vector x j = ( θ j , θ j , . . . , θ jm ), we have E ( θ jw ) = 0 and | θ jw | ≤ B, for1 ≤ w ≤ m . We associate a matrix D j to each vector x j , so that each D j ∈ R m × m is adiagonal matrix with the entries of x j along the diagonal. To construct the matrix A , weconcatenate b blocks of D j W mn , so that written out in block form, A = (cid:2) D W mn | D W mn | D W mn | D W mn | . . . . . . | D b W mn (cid:3) . Note that the matrix A has m rows and nb columns. In our application, Walsh matricesare more appropriate than other orthogonal matrices, such as discrete cosine transforms(DCT). For illustration, if A ∈ R × , with b = 320 blocks, then each entry of 64 A is one of the four values { , − , , − } . Consider any vector y that contains only integervalues, ranging from 0 to 255, which are typical for facial images. The product Ay can becomputed from 64 × Ay ; the calculation of 64 × Ay uses only integer-arithmetic operations.1.3. Main results.
Our first result is that the matrix satisfies the RIP in expectation.
Theorem 1.2.
Let W be the Hadamard-Walsh matrix with m rows and m columns. Let W mn be the m × n matrix formed by selecting the first n columns of the matrix W . Thematrix A ∈ R m × nb is constructed by concatenating b blocks of D j W mn , so that A = (cid:2) D W mn | D W mn | D W mn | D W mn | . . . . . . | D b W mn (cid:3) . Each D j ∈ R m × m is a diagonal matrix, as defined in section (1.2). Then, there exists aconstant C > such that for any < δ s ≤ , we have E (cid:107) I − A ∗ A (cid:107) Γ ≤ δ s provided that m ≥ C · δ − s · s · log ( nb ) and m ≤ nb .More precisely, there are constants C and C so that (1) E (cid:107) I − A ∗ A (cid:107) Γ ≤ (cid:115) C · s · log ( s ) · log( mb ) · log( nb ) m provided that m ≥ C · s · log ( s ) log( mb ) log( nb ) . The next theorem tells us that the matrix satisfies the restricted isometry property withoverwhelming probability.
ENRICO AU-YEUNG
Theorem 1.3.
Fix a constant δ s > . Let A be the matrix specified in Theorem 1.2. Then,there exists a constant C > such that (cid:107) I − A ∗ A (cid:107) Γ < δ s with probability at least − (cid:15) provided that m ≥ C · δ − s · s · log ( nb ) · log(1 /(cid:15) ) and m ≤ nb . Related Work.
Gaussian and Bernoulli matrices satisfy the restricted isometry prop-erty (RIP) with overwhelmingly high probability, provided that the number of measure-ments m satisfies m = O ( s log( Ns )). Although these matrices require the least number ofmeasurements, they have limited use in practical applications. Storing an unstructuredmatrix, in which all the entries of the matrix are independent of each other, requires aprohibited amount of storage. From a computation and application view point, this hasmotivated the need to find structured random matrices that satisfy the RIP. Let us reviewthree of the most popular classes of random matrices that are appealing alternatives to theGaussian matrices. For a broad discussion and other types of matrices, see [18] and [21].The random subsampled Fourier matrix is constructed by randomly choosing m rowsfrom the N × N discrete Fourier transform (DFT) matrix. In this case, it is important tonote that the fast Fourier transform (FFT) algorithm can be used to significantly speed upthe matrix-by-vector multiplication. A random subsampled Fourier matrix with m rowsand N columns satisfies the RIP with high probability, provided that m ≥ C · δ s log( N ) ,where C is a universal constant; see [23] for a precise statement.The next type of structured random matrices are partial random Toeplitz and circulantmatrices. These matrices naturally arise in applications where convolutions are involved.Recall that for a Toeplitz matrix, each entry a ij in row i and column j is determined by thevalue of i − j , so that for example, a = a = a and a = a = a . To construct a ran-dom m × N Toeplitz matrix A , only N + m − A satisfies the RIP of order 3 s withhigh probability for every δ ∈ (0 , ), provided that m > C · s log( Ns ), where C is a constant.There are many situations in signal processing where we encounter signals that areband-limited and are sparse in the frequency domain. The random demodulator matrixis suitable in this setting [27]. For motivation, imagine that we try to acquire a singlehigh-frequency tone that lies within a wide spectral band. Then, a low-rate sampler withan antialiasing filter will be oblivious to any tone whose frequency exceeds the passband ofthe filter. To deal with this problem, the random demodulator smears the tone across theentire spectrum so that it leaves a signature that a low-rate sampler can detect. Considera signal whose highest frequency does not exceed W hertz. We can give a mathematicaldescription of the system. Let D be a W × W diagonal matrix, with random numbersalong the diagonal. Next, we consider the action of the sampler and suppose the samplingrate is R , where R divides W . Each sample is then the sum of WR consecutive entries of thedemodulated signal. The action of the sampling is specified by a matrix G with R rows and W columns, such that the r -th row has WR consecutive ones, beginning in column ( rWR ) + 1 PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 7 for each r = 0 , , , . . . , R −
1. For example, when W = 12 and R = 3, we have G = . Define the R × W matrix A = GDF , where F is the W × W discrete Fourier transform(DFT) matrix with the columns permuted; see [27] for further detail. For a fixed δ >
0, ifthe sampling rate R is greater than or equal to Cδ − · s log( W ) , then an R × W randomdemodulation matrix A has the RIP of order s with constant δ s ≤ δ , with probability atleast 1 − O ( W ).In contrast to the classes of matrices described above, the class of structured randommatrices introduced in Sparse City has a block form. The matrix A ∈ R m × nb is constructedby concatenating b blocks. More precisely, A = (cid:2) D W mn | D W mn | D W mn | D W mn | . . . . . . | D b W mn (cid:3) . In each block of D j W mn , the same m × n matrix W mn is used, but each block has itsown random diagonal matrix D j . For compressed sensing to be useful in applications, weneed to have suitable hardware and a data acquisition system. In seismic imaging, thesignals are often measured by multiple sensors. A signal can be viewed as partitioned intomany parts. Different sensors are responsible for measuring different parts of the signal.Each sensor is equipped with its own scrambler which it uses to randomly scramble themeasurements. The block structure of the sensing matrix A facilitates the design of asuitable data acquisition scheme tailored to this setting. ENRICO AU-YEUNG Mathematical tools
We collect together the tools we need to prove the main results. We begin with afundamental result by Rudelson and Vershynin [23], followed by an extension of this result,and then a concentration inequality. In what follows, for vectors x, y ∈ R n , the tensor x ⊗ y is the rank-one operator defined by ( x ⊗ y )( z ) = (cid:104) x, z (cid:105) y. For a given subset Γ ⊆ { , , . . . , n } ,the notation x Γ is the restriction of the vector x on the coordinates in the set Γ. Lemma 2.1. (Rudelson and Vershynin)Let x , x , x , . . . , x m , with m ≤ n , be vectors in R n with uniformly bounded entries, (cid:107) x i (cid:107) ∞ ≤ K for all i . Then (2) E sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 (cid:15) i x Γ i ⊗ x Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ M · sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 x Γ i ⊗ x Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / where the constant M equals C ( K ) √ s log( s ) √ log n √ log m . Since our next lemma is an extension of this lemma, we provide a review of the mainideas in the proof of Lemma (2.1).Let E denote the left-hand side of (2). We will bound E by the supremum of aGaussian process. Let g , g , g , . . . , g m be independent standard normal random variables.The expected value of | g i | is a constant that does not depend on the index i . E ≤ C · E sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 E | g i | (cid:15) i x Γ i ⊗ x Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C · E sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 | g i | x Γ i ⊗ x Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = C · E sup (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) i =1 g i (cid:104) x i , x (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) : | Γ | ≤ s, x ∈ B Γ (cid:41) To see that the last equality is true, consider an operator A on R n defined by Az = m (cid:88) i =1 g i (cid:104) x i , z (cid:105) x i and since A is a self-adjoint operator, it follows that (cid:107) A (cid:107) op = sup (cid:107) z (cid:107) =1 (cid:104) Az, z (cid:105) = sup (cid:107) z (cid:107) =1 m (cid:88) i =1 g i (cid:104) x i , z (cid:105) . For each vector u in R n , we consider the Gaussian process G ( u ) = m (cid:88) i =1 g i (cid:104) x i , u (cid:105) . This Gaussian process is a random process indexed by vectors in R n . PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 9
Thus to obtain an upper bound on E , we need an estimate on the expected value of thesupremum of a Gaussian process over an arbitrary index set. We use Dudley’s Theorem(see [24], Proposition 2.1) to obtain an upper bound. Theorem 2.2.
Let ( X ( t ) : t ∈ T ) be a Gaussian process with the associated pseudo-metric d ( s, t ) = (cid:0) E | X ( s ) − X ( t ) | (cid:1) / . Then there exists a constant K > such that E sup t ∈ T X ( t ) ≤ K (cid:90) ∞ (cid:112) log N ( T, d, u ) du. Here, T is an arbitrary index set, and the covering number N ( T, d, u ) is the smallest numberof balls of radius u to cover the set T with respect to the pseudo-metric. By applying Dudley’s inequality with T = (cid:91) | Γ |≤ s B Γ the above calculations show that,(3) E ≤ C (cid:90) ∞ log N (cid:91) | Γ |≤ s B Γ , (cid:107) · (cid:107) G , u / du, where N is the covering number.There is a semi-norm associated with the Gaussian process, so that if x and y are anytwo fixed vectors in R n , then (cid:107) x − y (cid:107) G = (cid:0) E | G ( x ) − G ( y ) | (cid:1) / = (cid:34) m (cid:88) i =1 (cid:0) (cid:104) x i , x (cid:105) − (cid:104) x i , y (cid:105) (cid:1) (cid:35) / ≤ (cid:34) m (cid:88) i =1 ( (cid:104) x i , x (cid:105) + (cid:104) x i , y (cid:105) ) (cid:35) / · max i ≤ m |(cid:104) x i , x − y (cid:105)|≤ | Γ |≤ s,z ∈ B Γ2 (cid:34) m (cid:88) i =1 (cid:104) x i , z (cid:105) (cid:35) / · max i ≤ m |(cid:104) x i , x − y (cid:105)| = 2 R max i ≤ m |(cid:104) x i , x − y (cid:105)| , where R ≡ sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 x Γ i ⊗ x Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / . Thus, by a change of variable in the integral in (3), we see that(4) E ≤ C R √ s (cid:90) ∞ log / N √ s (cid:91) | T |≤ s B Γ , (cid:107) · (cid:107) X , u du. Here, the semi-norm (cid:107) x (cid:107) X is defined by (cid:107) x (cid:107) X = max i ≤ m |(cid:104) x i , x (cid:105)| . It is sufficient to show the integral in (4) is bounded by C ( K ) · log( s ) · √ log n · √ log m. This concludes our review of the main ideas in the proof of Lemma (2.1).We extend the fundamental lemma of Rudelson and Vershynin. The proof follows thestrategy of the proof of the original lemma, with an additional ingredient. The Gaussianprocess involved is replaced by a tensorized version, with the appropriate tensor norm.
Lemma 2.3. (Extension of the fundamental lemma of Rudelson and Vershynin)Let u , u , u , . . . , u k , and v , v , v , . . . , v k , with k ≤ n , be vectors in R n with uniformlybounded entries, (cid:107) u i (cid:107) ∞ ≤ K and (cid:107) v i (cid:107) ∞ ≤ K for all i . Then (5) E sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 (cid:15) i u Γ i ⊗ v Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ M · sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u Γ i ⊗ u Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / + sup | Γ |≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 v Γ i ⊗ v Γ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / where the constant M depends on K and the sparsity s .Proof. Let E denote the left-hand side of (5). Our plan is to bound E by the supremum ofa Gaussian process. Let g , g , g , . . . , g k be independent standard normal random variables.Then E = E sup (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) i =1 (cid:15) i (cid:104) x p , u i (cid:105)(cid:104) v i , x q (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) : | Γ | ≤ s, x p ∈ B Γ , x q ∈ B Γ (cid:41) ≤ C · E sup (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) i =1 g i (cid:104) x p , u i (cid:105)(cid:104) v i , x q (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) : | Γ | ≤ s, x p ∈ B Γ , x q ∈ B Γ (cid:41) When G ( x ) is a Gaussian process indexed by the elements x in an arbitrary index set T ,Dudley’s inequality states that E sup x ∈ T | G ( x ) | ≤ C · (cid:90) ∞ log / N ( T, d, u ) du, with the pseudo-metric d given by d ( x, y ) = (cid:0) E | G ( x ) − G ( y ) | (cid:1) / . Our Gaussian process is indexed by two vectors x p and x q so that G ( x p , x q ) = (cid:88) i g i (cid:104) x p , u i (cid:105)(cid:104) v i , x q (cid:105) and the index set is T = (cid:91) | Γ |≤ s B Γ ⊗ B Γ . PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 11
The pseudo-metric on T is given by d (( x p , x q ) , ( y p , y q ))= (cid:34) k (cid:88) i =1 ( (cid:104) x p , u i (cid:105)(cid:104) v i , x q (cid:105) − (cid:104) y p , u i (cid:105)(cid:104) v i , y q (cid:105) ) (cid:35) / = 12 (cid:34) k (cid:88) i =1 ( (cid:104) x p + y p , u i (cid:105)(cid:104) v i , x q − y q (cid:105) + (cid:104) x p − y p , u i (cid:105)(cid:104) v i , x q + y q (cid:105) ) (cid:35) / ≤ · max i ( |(cid:104) u i , x p − y p (cid:105)| , |(cid:104) v i , x q − y q (cid:105)| ) · (cid:34) k (cid:88) i =1 ( |(cid:104) x p + y p , u i (cid:105)| + |(cid:104) x q + y q , v i (cid:105)| ) (cid:35) / ≤ Q · max i ( |(cid:104) u i , x p − y p (cid:105)| , |(cid:104) v i , x q − y q (cid:105)| ) , where the quantity Q is defined by Q = 12 sup (cid:34) k (cid:88) i =1 ( |(cid:104) x p + y p , u i (cid:105)| + |(cid:104) x q + y q , v i (cid:105)| ) (cid:35) / : ( x p , x q ) ∈ Γ We bound the quantity Q in the following calculations. Q = 14 sup ( x p ,x q ) ∈ Γ k (cid:88) i =1 ( |(cid:104) x p + y p , u i (cid:105)| + |(cid:104) x q + y q , v i (cid:105)| ) ≤
14 sup ( x p ,x q ) ∈ Γ (cid:40) k (cid:88) i =1 |(cid:104) x p + y p , u i (cid:105)| + k (cid:88) i =1 |(cid:104) x p + y p , v i (cid:105)| + 2 k (cid:88) i =1 |(cid:104) x p + y p , u i (cid:105)| · |(cid:104) x q + y q , v i (cid:105)| (cid:41) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u i ⊗ u i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 v i ⊗ v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ +12 sup ( x p ,x q ) ∈ Γ (cid:32) k (cid:88) i =1 |(cid:104) x p + y p , u i (cid:105)| (cid:33) / + (cid:32) k (cid:88) i =1 |(cid:104) x q + y q , v i (cid:105)| (cid:33) / ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u i ⊗ u i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 v i ⊗ v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u i ⊗ u i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u i ⊗ u i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 u i ⊗ u i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 v i ⊗ v i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / ≡ S . We now define two norms. Let (cid:107) x (cid:107) ( U ) ∞ = max i |(cid:104) x, u i (cid:105)| and (cid:107) x (cid:107) ( V ) ∞ = max i |(cid:104) x, v i (cid:105)| . Theabove calculations show that the pseudo-metric satisfies the next inequality, d (( x p , x q ) , ( y p , y q )) ≤ S · max (cid:0) (cid:107) x p − y p (cid:107) ( U ) ∞ , (cid:107) x q − y q (cid:107) ( V ) ∞ (cid:1) . Let (cid:101) T = (cid:83) | Γ |≤ s B Γ . Then T ⊆ (cid:101) T ⊗ (cid:101) T . Moreover, the covering number of the set T and thecovering number of the set (cid:101) T must satisfy the relation N ( T, d, u ) ≤ N ( (cid:101) T ⊗ (cid:101) T , (cid:101) d, u ) . Here, d and (cid:101) d are the pseudo-metrics for the corresponding index sets. Consequently, wehave (cid:90) ∞ log / N ( T, d, u ) du ≤ S · (cid:90) ∞ log / N ( (cid:101) T , (cid:107) · (cid:107) ( U ) ∞ , u ) du + S · (cid:90) ∞ log / N ( (cid:101) T , (cid:107) · (cid:107) ( V ) ∞ , u ) du. We have completed all the necessary modification to the proof of the original lemma. Therest of the proof proceeds in exactly the same manner as the proof of the original lemma,almost verbatim, and we omit the repetition. (cid:3)
In order to show that, with high probability, a random quantity does not deviate too muchfrom its mean, we invoke a concentration inequality for sums of independent symmetricrandom variables in a Banach space. (See [27], Proposition 19, which follows from [17],Theorem 6.17).
Proposition 2.4. (Concentration Inequality)Let Y , Y , . . . , Y R be independent, symmetric random variables in a Banach space X.Assume that each random variable satisfies the bound (cid:107) Y j (cid:107) X ≤ B almost surely, for ≤ j ≤ R . Let Y = (cid:107) (cid:80) j Y j (cid:107) X . Then there exists a constant C so that for all u, t ≥ , P ( Y > C [ uE ( Y ) + tB ]) ≤ e − u + e − t . We define a sequence of vectors that depend on the entries in the matrix W mn .Let y kw ∈ R nb , where the entries indexed by ( k − n + 1 , ( k − n + 2 , ( k − n + 3 , . . . , kn are from row w of the matrix W mn , while all other entries are zero. The next exampleillustrates the situation. Example 2.5.
Consider the matrix W mn with m rows and n columns. W mn = a (1 , a (1 , a (2 , a (2 , a (3 , a (3 , a (4 , a (4 , Here, m = 4 and n = 2. We define the vectors y , y , y , y by y = a (1 , a (1 , y = a (2 , a (2 , y = a (3 , a (3 , y = a (4 , a (4 , PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 13 and we define the vectors y , y , y , y by y = a (1 , a (1 , y = a (2 , a (2 , y = a (3 , a (3 , y = a (4 , a (4 , Since the columns of W mn come from an orthogonal matrix, we have the following relations (cid:88) k =1 ( a ( k, = 1 , (cid:88) k =1 ( a ( k, = 1 , (cid:88) k =1 a ( k, a ( k,
2) = 0 . The rank-one operator y ⊗ y is defined by ( y ⊗ y ) ( z ) = (cid:104) z, y kw (cid:105) y kw , for every z ∈ R .Explicitly in matrix form, this rank-one operator is y ⊗ y = a (1 , · a (1 , a (1 , · a (1 ,
2) 0 0 a (1 , · a (1 , a (1 , · a (1 ,
2) 0 00 0 0 00 0 0 0
We can directly compute and verify that : b (cid:88) k =1 m (cid:88) w =1 y kw ⊗ y kw = I, the identity matrix. Remark 2.6.
The vectors y kw ∈ R nb may seem cumbersome at first but they enable us towrite the matrix A ∗ A in a manageable form. The matrix A is constructed from b blocks of D j W mn and so the matrix has the form A = (cid:2) D W mn | D W mn | D W mn | D W mn | . . . . . . | D b W mn (cid:3) which means that when b = 3, the matrix A ∗ A has the form, ( W mn ) ∗ W mn ) ∗
00 0 ( W mn ) ∗ D ∗ D D ∗ D D ∗ D D ∗ D D ∗ D D ∗ D D ∗ D D ∗ D D ∗ D W mn W mn
00 0 W mn . For clarity, we have written out the form of A ∗ A when b = 3. The pattern extends to thegeneral case with b blocks. The key observation is that we can now write A ∗ A = b (cid:88) k =1 b (cid:88) j =1 m (cid:88) w =1 θ kw θ jw y kw ⊗ y jw . This expression for A ∗ A plays a crucial role in the proof of Theorem 1.2.To show that a quantity P is bounded by some constant, it is enough, as the next lemmatells us, to show that P is bounded by some constant multiplied by (2 + √ P + 1). Lemma 2.7.
Fix a constant c ≤ . If P > and P ≤ c (cid:16) √ P + 1 (cid:17) , then P < c . Proof.
Let x = ( P + 1) / and note that x is an increasing function of P . The hypothesisof the lemma becomes x − ≤ c (2 + x )which implies that x − c x − (2 c + 1) ≤ . The polynomial on the left is strictly increasing when x ≥ c / . Since α ≤ x ≥ P ≥
0, it is strictly increasing over the entire domain of interest, thus x ≤ c + (cid:112) ( c ) + 4(2 c + 1)2 . By substituting ( P + 1) / back in for x , this means P + 1 ≤ ( c ) c (cid:112) ( c ) + 4(2 c + 1)2 + (cid:112) ( c ) + 4(2 c + 1)4 . Since c <
1, this implies that
P < c . (cid:3) Proof of RIP in expectation (Theorem 1.2)
The rank-one operators y kw ⊗ y kw are constructed so that(6) b (cid:88) k =1 m (cid:88) w =1 y kw ⊗ y kw = I. As explained in Remark (2.6) from the last section, we have(7) A ∗ A = b (cid:88) k =1 b (cid:88) j =1 m (cid:88) w =1 θ kw θ jw y kw ⊗ y jw . The proof of the theorem proceeds by breaking up I − A ∗ A into four different parts, thenbounding the expected norm of each part separately. By combining equations (6) and (7),we see that(8) I − A ∗ A = b (cid:88) k =1 m (cid:88) w =1 (1 − | θ kw | ) y kw ⊗ y kw + (cid:88) j (cid:54) = k m (cid:88) w =1 θ kw θ jw y kw ⊗ y jw . For the two sums on the right hand side of (8), we will bound the expected norm of eachsum separately. Define two random quantities Q and Q (cid:48) by(9) Q = b (cid:88) k =1 m (cid:88) w =1 (1 − | θ kw | ) y kw ⊗ y kw and Q (cid:48) = b (cid:88) k =1 m (cid:88) w =1 (1 − | θ (cid:48) kw | ) y kw ⊗ y kw PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 15 where { θ (cid:48) kw } is an independent copy of { θ kw } . This implies that Q has the same probabilitydistribution as Q (cid:48) . To bound the expected norm of Q , E (cid:107) Q (cid:107) Γ = E (cid:107) Q − E ( Q (cid:48) ) (cid:107) Γ = E (cid:107) E [ Q − Q (cid:48) | Q ] (cid:107) Γ ≤ E [ E (cid:107) Q − Q (cid:48) (cid:107) Γ | Q ]= E ( (cid:107) Q − Q (cid:48) (cid:107) Γ ) . In the above equations, the first equality holds because Q (cid:48) has mean zero. The secondequality holds by the independence of Q and Q (cid:48) . The inequality in the third line is trueby Jensen’s inequality. Let(10) Y = Q − Q (cid:48) = b (cid:88) k =1 m (cid:88) w =1 ( | θ (cid:48) kw | − | θ kw | ) y kw ⊗ y kw . We randomize this sum. The random variable Y has the same probability distribution as(11) Y (cid:48) = : b (cid:88) k =1 m (cid:88) w =1 (cid:15) kw ( | θ (cid:48) kw | − | θ kw | ) y kw ⊗ y kw where { (cid:15) kw } are independent, identically distributed Bernoulli random variables. E (cid:107) Y (cid:107) Γ = E (cid:107) Y (cid:48) (cid:107) Γ = E [ E ( (cid:107) Y (cid:48) (cid:107) Γ |{ θ kw } , { θ (cid:48) kw } )] . Let x kw = ( | θ (cid:48) kw | − | θ kw | ) / y kw in order to apply the lemma of Rudelson and Vershynin.To see that each x kw is bounded, we note that B ≥ max k,w (cid:0) | θ (cid:48) kw | − | θ kw | (cid:1) / , and so (cid:107) x kw (cid:107) ∞ ≤ max k,w (cid:0) | θ (cid:48) kw | − | θ kw | (cid:1) / · (cid:107) y kw (cid:107) ∞ ≤ B √ m . With the { θ kw } , { θ (cid:48) kw } fixed, and with K = B/ √ m , we apply Lemma (2.1) to obtain E [ (cid:107) Y (cid:48) (cid:107) Γ |{ θ kw } , { θ (cid:48) kw } ] ≤ (cid:114) C · s · Lm · B · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 (cid:0) | θ (cid:48) kw | − | θ kw | (cid:1) · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / where L ≡ log ( s ) · log( nb ) · log( mb ). To remove the conditioning, we apply Cauchy-Schwarzinequality and the law of double expectation,(12) E (cid:107) Y (cid:107) Γ ≤ (cid:114) C · s · Lm · B (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 (cid:0) | θ (cid:48) kw | − | θ kw | (cid:1) · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ (cid:33) / . By using the triangle inequality, E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 (cid:0) | θ (cid:48) kw | − | θ kw | (cid:1) · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ and so the bound in (12) becomes E (cid:107) Y (cid:107) Γ ≤ (cid:114) C · s · Lm · (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ (cid:33) / . Since (cid:80) bk =1 (cid:80) mw =1 y kw ⊗ y kw = I and since E (cid:107) I (cid:107) Γ = 1, we have E (cid:107) Y (cid:107) Γ ≤ (cid:114) C · s · Lm · (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 (1 − | θ kw | ) · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ + 1 (cid:33) / = (cid:114) C · s · Lm · ( E (cid:107) Q (cid:107) Γ + 1) / ≤ (cid:114) C · s · Lm · ( E (cid:107) Y (cid:107) Γ + 1) / . Solutions to the equation E ≤ α ( E + 1) / satisfy E ≤ α , where α ≤ C , C such that if m ≥ C · s · L , then(13) E (cid:107) Q (cid:107) Γ ≤ E (cid:107) Y (cid:107) Γ ≤ (cid:114) C · s · Lm .
We have now obtained a bound on the expected norm of the first sum in equation (8). Tocontrol the norm of the second sum, we next define(14) Q = (cid:88) j (cid:54) = k m (cid:88) w =1 θ kw θ jw · y kw ⊗ y jw and we will apply decoupling inequality. Let(15) Q (cid:48) = (cid:88) j (cid:54) = k m (cid:88) w =1 θ kw θ (cid:48) jw · y kw ⊗ y jw where { θ (cid:48) kw } is an independent sequence with the same distribution as { θ kw } . Then E (cid:107) Q (cid:107) Γ ≤ C · E (cid:107) Q (cid:48) (cid:107) Γ . We will break up Q (cid:48) into two terms and control the norm of each one separately.(16) Q (cid:48) = b (cid:88) j =1 b (cid:88) k =1 m (cid:88) w =1 θ kw θ (cid:48) jw · y kw ⊗ y jw − b (cid:88) k =1 m (cid:88) w =1 θ kw θ (cid:48) jw · y kw ⊗ y jw . Denote the first term on the right by Q and the second term on the right by Q . To bound (cid:107) Q (cid:107) Γ , note that the random quantity Q has the same distribution as(17) Q (cid:48) = b (cid:88) k =1 m (cid:88) w =1 (cid:15) kw · u kw ⊗ v kw , where u kw and v kw are defined by(18) u kw = | θ kw | · y kw , v kw = | θ (cid:48) kw | · y kw PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 17 and { (cid:15) kw } is an independent Bernoulli sequence. Since max { θ kw , θ (cid:48) kw } ≤ B , we have (cid:107) u kw (cid:107) ∞ ≤ B √ m and (cid:107) v kw (cid:107) ∞ ≤ B √ m . Apply Lemma (2.3) with { θ kw , θ (cid:48) kw } fixed, E [ (cid:107) Q (cid:48) (cid:107) Γ |{ θ kw } , { θ (cid:48) kw } ] ≤ (cid:114) C · s · Lm · B · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ (cid:48) kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / . Then, we use the law of double expectation and the Cauchy-Schwarz inequality, as in (12),to remove the conditioning : E (cid:2) (cid:107) Q (cid:48) (cid:107) Γ (cid:3) ≤ (cid:114) C · s · Lm · B · E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ (cid:48) kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / / . The two sequences of random variables { θ kw } and { θ (cid:48) kw } are identically distributed, sousing Jensen inequality, we get E [ (cid:107) Q (cid:48) (cid:107) Γ ] ≤ (cid:114) C · s · Lm · B · (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ (cid:33) / . To bound the expected value on the right-hand side, we note that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 | θ kw | · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / ≤ (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b (cid:88) k =1 m (cid:88) w =1 (cid:0) − | θ kw | (cid:1) · y kw ⊗ y kw (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Γ + 1 (cid:33) / = ( E (cid:107) Q (cid:107) Γ + 1) / . Recall that from equation (13), if m ≥ C · s · L , then E (cid:107) Q (cid:107) Γ is bounded. The randomvariables { θ kw } are bounded by the constant B , so we can conclude there exist constants C , C such that if m ≥ C · s · L, then(19) E (cid:107) Q (cid:107) Γ ≤ C · (cid:114) s · Lm .
Recall that the right side of (16) has two terms, Q and Q . It remains to bound theexpected norm of the other term, Q = b (cid:88) j =1 b (cid:88) k =1 m (cid:88) w =1 θ kw θ (cid:48) jw · y kw ⊗ y kw . To bound E (cid:107) Q (cid:107) Γ , note that Q has the same probability distribution as Q (cid:48) = b (cid:88) j =1 b (cid:88) k =1 m (cid:88) w =1 (cid:15) w θ kw θ (cid:48) jw · y kw ⊗ y kw = m (cid:88) w =1 (cid:15) w (cid:32) b (cid:88) k =1 θ kw y kw (cid:33) ⊗ (cid:32) b (cid:88) j =1 θ (cid:48) jw y jw (cid:33) = m (cid:88) w =1 (cid:15) w u w ⊗ v w , where u w and v w are defined by(20) u w = b (cid:88) k =1 θ kw y kw and v w = b (cid:88) j =1 θ (cid:48) jw y jw . The y kw have disjoint support for different values of k , so that (cid:107) u w (cid:107) ∞ ≤ B √ m , (cid:107) v w (cid:107) ∞ ≤ B √ m . Also note that m (cid:88) w =1 u w ⊗ v w = m (cid:88) w =1 b (cid:88) k =1 b (cid:88) j =1 θ kw θ jw y kw ⊗ y jw , and so by comparing equation (7) with the above expression, we see that m (cid:88) w =1 u w ⊗ u w and m (cid:88) w =1 v w ⊗ v w are independent copies of A ∗ A . By Lemma (2.3) and Cauchy-Schwarz inequality, E (cid:107) Q (cid:107) Γ = E (cid:107) Q (cid:48) (cid:107) Γ ≤ (cid:114) C · s · L (cid:48) m · B · ( E (cid:107) A ∗ A (cid:107) Γ ) / ≤ C (cid:114) C · s · L (cid:48) m · ( E (cid:107) I + A ∗ A (cid:107) Γ + 1) / where we have written L (cid:48) = log ( s ) · log( m ) · log( nb ).Since L (cid:48) < L ( s, n, m, b ) ≡ log s · log( mb ) · log( nb ), we can conclude that E (cid:107) Q (cid:107) Γ ≤ C (cid:114) s · Lm · ( E (cid:107) I + A ∗ A (cid:107) Γ + 1) / . PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 19
To recapitulate, we have shown that E (cid:107) I − A ∗ A (cid:107) Γ ≤ E (cid:107) Q (cid:107) Γ + E (cid:107) Q (cid:107) Γ ≤ E (cid:107) Q (cid:107) Γ + C · E (cid:107) Q (cid:48) (cid:107) Γ ≤ E (cid:107) Q (cid:107) Γ + C ( C E (cid:107) Q (cid:48) (cid:107) Γ + C E (cid:107) Q (cid:107) Γ ) . When m ≥ max( C , C ) · s · L , we have established that E (cid:107) Q (cid:107) Γ ≤ C · (cid:114) s · Lm · ( E (cid:107) I − A ∗ A (cid:107) Γ + 1) / E (cid:107) Q (cid:107) Γ ≤ C · (cid:114) s · LmE (cid:107) Q (cid:107) Γ ≤ C · (cid:114) s · Lm Therefore, in conclusion, we have proven that E (cid:107) I − A ∗ A (cid:107) Γ ≤ C · (cid:114) s · Lm · (cid:18) (cid:113) E (cid:107) I − A ∗ A (cid:107) Γ + 1 (cid:19) . By Lemma (2.7), with P = E (cid:107) I − A ∗ A (cid:107) Γ and c = (cid:113) C · s · Lm , there exits constant C suchthat if m ≥ C · s · L , then we have E (cid:107) I − A ∗ A (cid:107) Γ ≤ (cid:114) C · s · Lm .
Recall that by definition, L = log s · log( mb ) · log( nb ). This proves that the assertion (1)is true. It remains to prove that equation (1) implies that for any 0 < δ s ≤
1, we have E (cid:107) I − A ∗ A (cid:107) Γ ≤ δ s provided that m ≥ C · δ − s · s · log ( nb ) and m ≤ nb .If m ≤ nb , then log( mb ) = log( m ) + log( b ) ≤ nb ). Hence, for s ≤ nb ,(21) (cid:115) C · s · log ( s ) · log( mb ) · log( nb ) m ≤ (cid:115) C · s log ( nb ) m . The right-hand side is bounded by δ s when m ≥ C · δ − s · s · log ( nb ).This concludes the proof Theorem 1.2.4. Proof of the tail bound (Theorem 1.3)
Recall that I − A ∗ A = Q + Q , where the expressions for Q and Q are given inequations (9) and (15). To obtain a bound for P ( (cid:107) I − A ∗ A (cid:107) Γ > δ ) , we will first finda bound for P ( (cid:107) Q (cid:107) Γ > δ ) . Recall that Y = Q − Q (cid:48) from (10), and Y (cid:48) has the sameprobability distribution as Y from (11). For any β > λ > P ( (cid:107) Q (cid:48) (cid:107) Γ < β ) P ( (cid:107) Q (cid:107) Γ > β + λ ) ≤ P ( (cid:107) Y (cid:107) Γ > λ ) . This equation holds because if (cid:107) Q (cid:107) Γ > β + λ and if (cid:107) Q (cid:48) (cid:107) Γ < β , then β + λ < (cid:107) Q (cid:107) Γ ≤ (cid:107) Q (cid:48) (cid:107) Γ + (cid:107) Y (cid:107) Γ < β + (cid:107) Y (cid:107) Γ , and so (cid:107) Y (cid:107) Γ must be greater than λ . Note that the median of a positive random variableis never bigger than two times the mean, therefore P ( (cid:107) Q (cid:48) (cid:107) Γ ≤ E (cid:107) Q (cid:48) (cid:107) Γ ) ≥ . We can choose β = 2 E (cid:107) Q (cid:48) (cid:107) Γ so that (22) becomes P ( (cid:107) Q (cid:48) (cid:107) Γ < E (cid:107) Q (cid:48) (cid:107) Γ ) P ( (cid:107) Q (cid:107) Γ > E (cid:107) Q (cid:48) (cid:107) Γ + λ ) ≤ P ( (cid:107) Y (cid:107) Γ > λ ) . Since E ( (cid:107) Q (cid:107) Γ ) = E (cid:107) Q (cid:48) (cid:107) Γ , we obtain P ( (cid:107) Q (cid:107) Γ > E (cid:107) Q (cid:107) Γ + λ ) ≤ P ( (cid:107) Y (cid:107) Γ > λ ) . Let V k,w = || θ k ( w ) (cid:48) | − | θ k ( w ) | | · y k,w ⊗ y k,w . Then (cid:107) V k,w (cid:107) ≤ K · sm , where we define K = B .By the Proposition 2.4, the concentration inequality gives us P (cid:18) (cid:107) Y (cid:48) (cid:107) Γ > C (cid:20) uE (cid:107) Y (cid:48) (cid:107) Γ + Ksm t (cid:21)(cid:19) ≤ e − u + e − t . From (13), we have the bound E (cid:107) Y (cid:48) (cid:107) Γ ≤ (cid:114) C · s · KLm where m ≥ C · s · L · log( mb ) . Combining the last two inequalities, we see that P (cid:32) (cid:107) Y (cid:48) (cid:107) Γ > C (cid:34) u (cid:114) sKLm + Ksm t (cid:35)(cid:33) ≤ e − u + e − t . Fix a constant α , where 0 < α < /
10. If we pick t = log(1 /α ) and u = (cid:112) log(1 /α ), thenthe above equation becomes P (cid:32) (cid:107) Y (cid:48) (cid:107) Γ > C (cid:34)(cid:114) sKL log(1 /α ) m + sKm log(1 /α ) (cid:33)(cid:35) ≤ α. That means for some constant λ , we have P ( (cid:107) Y (cid:48) (cid:107) Γ > λ ) ≤ α. With this constant λ , we use the bound for E (cid:107) Q (cid:107) Γ in (13) to conclude that2 E (cid:107) Q (cid:107) Γ + λ ≤ C (cid:32)(cid:114) sL log( mb ) m + (cid:114) sKL log(1 /α ) m + log(1 /α ) sKm (cid:33) . Inside the bracket on the right side, when we choose m so that all three terms are less than1, the middle term will dominate the other two terms. That means there exists a constant C such that if m ≥ C · δ − s · s · K · L · log(1 /α )then 2 E (cid:107) Q (cid:107) Γ + λ ≤ δ s . Combining the above inequalities, we obtain P (cid:18) (cid:107) Q (cid:107) Γ > δ s (cid:19) ≤ α. PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 21
Next, we want to obtain a bound for (cid:107) Q (cid:107) Γ . We saw that from (16), Q (cid:48) = Q + Q , andso to obtain a bound for (cid:107) Q (cid:107) Γ , we will proceed to find bounds for (cid:107) Q (cid:107) Γ and (cid:107) Q (cid:107) Γ .Recall that Q has the same probability distribution as Q (cid:48) and we showed that E (cid:107) Q (cid:48) (cid:107) ≤ (cid:114) C · s · L · log( mb ) m . With the same u k,w and v k,w defined in (18), we have (cid:107) u k,w ⊗ v k,w (cid:107) ≤ K sm . Apply the concentration inequality (Proposition 2.4) with t = log( C/α ) and u = (cid:112) log( C/α ),we obtain P (cid:32) (cid:107) Q (cid:48) (cid:107) Γ > C (cid:34)(cid:114) s · KL · log( C/α ) m + s · K · log( C/α ) m (cid:35)(cid:33) ≤ αC . Inside the probability bracket, when we choose m so that both terms on the right sideare less than 1, the first term will dominate the second term. That means there exists aconstant C such that if m ≥ C · δ − s · s · K · L · log(1 /α )then P ( (cid:107) Q (cid:48) (cid:107) Γ > δ · C ) ≤ αC . Since Q and Q (cid:48) have the same probability distribution, we finally arrive at P ( (cid:107) Q (cid:107) Γ > δ · C ) ≤ αC . To obtain a bound for (cid:107) Q (cid:107) , we will follow similar strategy. Recall that Q has the sameprobability distribution as Q (cid:48) and we showed that E (cid:107) Q (cid:107) Γ ≤ E ( (cid:107) I − A ∗ A (cid:107) Γ + 1) / (cid:114) CsL log( mb ) m . With the same u w and v w defined in (20), we have (cid:107) u w ⊗ v w (cid:107) ≤ K sm . By the concentration inequality (Proposition 2.4) with t = log( C/α ) and u = (cid:112) log( C/α ),we obtain P (cid:32) (cid:107) Q (cid:107) Γ > C (cid:34)(cid:114) s · L · log( mb ) · log( C/α ) m + s · K · log( C/α ) m (cid:35)(cid:33) ≤ αC . That means there exists a constant C such that if m ≥ C · δ − s · s · KL · log(1 /α )then P ( (cid:107) Q (cid:107) Γ > C · δ s ) ≤ αC . To summarize, we have shown for any 0 < α < , there exists C such that when m ≥ C δ − s · s · K · L · log(1 /α ), all the following inequalities hold:(23) P (cid:18) (cid:107) Q (cid:107) Γ > δ s (cid:19) ≤ α. (24) P ( (cid:107) Q (cid:107) Γ > C · δ s ) ≤ αC . (25) P ( (cid:107) Q (cid:107) Γ > δ · C ) ≤ αC . Finally, to find a tail bound on (cid:107) I − A ∗ A (cid:107) Γ , we note that P ( (cid:107) I − A ∗ A (cid:107) Γ > δ s ) ≤ P ( (cid:107) Q (cid:107) Γ > δ s P ( (cid:107) Q (cid:107) Γ > δ s ≤ P ( (cid:107) Q (cid:107) Γ > δ s C · P ( (cid:107) Q (cid:48) (cid:107) Γ > δ s C ) ≤ P ( (cid:107) Q (cid:107) Γ > δ s C · P ( (cid:107) Q (cid:107) Γ > δ s C ) + C · P ( (cid:107) Q (cid:107) Γ > δ s C )Combining this observation with inequalities (23), (24), (25), we can conclude that for any0 < α < , we have P ( (cid:107) I − A ∗ A (cid:107) > δ s ) ≤ α when m ≥ C δ − s · s · K · L · log(1 /α ). Recall that by definition, L = log ( s ) log( mb ) log( nb ).Note that m ≤ nb , and hence equation (21) remains valid. To complete the proof, wereplace 8 α with (cid:15) . This completes the proof of the theorem. References [1] Richard Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model-based compressivesensing.
IEEE Trans. Inform. Theory , 56(4):1982–2001, 2010.[2] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. A simple proof of therestricted isometry property for random matrices.
Constr. Approx. , 28(3):253–263, 2008.[3] Ronen Basri and David W. Jacobs. Lambertian reflection and linear subspaces.
IEEE Trans. PatternAnal. Mach. Intell. , 25(3):218–233, 2003.[4] Peter Belhumeur, J. Hespanha, and David Kriegman. Eigenfaces versus Fisherfaces: Recognition usingclass specific linear projection.
IEEE Trans. Pattern Anal. Mach. Intell. , 19(7):711–720, 1997.[5] Thomas Blumensath and Mike Davies. Iterative hard thresholding for compressed sensing.
Appl. Com-put. Harmon. Anal. , 27(3):265–74, 2009.[6] Alfred Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions of systems of equationsto sparse modeling of signals and images.
SIAM Rev. , 51(1):34–81, 2009.[7] Emmanuel Candes. The restricted isometry property and its implications for compressed sensing.
C.R. Math. Acad. Sci. Paris , 346(9-10):589–592, 2008.[8] Emmanuel Candes, Justin Romberg, and Terence Tao. Robust uncertainty principles: exact signalreconstruction from highly incomplete frequency information.
IEEE Trans. Inform. Theory , 52(2):489–509, 2006.[9] Emmanuel Candes, Justin Romberg, and Terence Tao. Stable signal recovery from incomplete andinaccurate measurements.
Comm. Pure Appl. Math. , 59(8):1207–1223, 2006.[10] Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best k-term approx-imation.
J. Amer. Math. Soc. , 22(1):211–231, 2009.[11] David L. Donoho. Compressed sensing.
IEEE Trans. Inform. Theory , 52(4):1289–1306, 2006.[12] David L. Donoho and Jared Tanner. Counting faces of randomly projected polytopes when the pro-jection radically lowers dimension.
J. Amer. Math. Soc. , 22(1):1–53, 2009.
PARSE RECONSTRUCTION WITH MULTIPLE WALSH MATRICES 23 [13] Simon Foucart. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry con-stants. In
Approximation Theory XIII: San Antonio 2010 , volume 13 of
Springer Proceedings in Math-ematics , pages 65–77. Springer, New York, 2012.[14] Simon Foucart and Holger Rauhut.
A Mathematical Introduction to Compressive Sensing . Appliedand Numerical Harmonic Analysis. Birkh¨auser, New York, 2013.[15] J. Haupt, W. U. Bajwa, G. Raz, S.J. Wright, and R. Nowak. Toeplitz-structured compressed sensingmatrices. In
Proc. 14th IEEE/SP Workshop Stat. Signal Process. (SSP’07), Madison, WI, Aug 2007 ,pages 294–298.[16] Felix Krahmer and Rachael Ward. New and improved Johnson-Lindenstrauss embeddings via therestricted isometry property.
SIAM J. Math. Anal. , 43(3):1269–1281, 2011.[17] Michel Ledoux and Michel Talagrand.
Probability in Banach spaces . A Series of Modern Surveys inMathematics. Springer-Verlag, Berlin, 1991.[18] Kezhi Li and Shuang Cong. State of the art and prospects of structured sensing matrices in compressedsensing.
Front. Comput. Sci. , 9(5):665–677, 2015.[19] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Uniform uncertainty principle forBernoulli and Subgaussian ensembles.
Constr. Approx. , 28(3):277–289, 2008.[20] Holger Rauhut. Random sampling of sparse trigonometric polynomials.
Appl. Comput. Harmon. Anal. ,22(1):16–42, 2007.[21] Holger Rauhut. Compressive sensing and structured random matrices. In
Theoretical foundations andnumerical methods for sparse recovery , Radon Ser. Comput. Appl. Math., 9. Walter de Gruyter, Berlin,2010.[22] Holger Rauhut and Rachael Ward. Sparse Legendre expansions via l -minimization. J. Approx. Theory ,164(5):517–533, 2012.[23] Mark Rudelson and Roman Vershynin. On sparse reconstruction from Fourier and Gaussian measure-ments.
Comm. Pure Appl. Math. , 61(8):1025–1045, 2008.[24] Michel Talagrand. Majorizing measures: the generic chaining.
Ann. Probab. , 24(3):1049–1103, 1996.[25] Joel A. Tropp. Greed is good: algorithmic results for sparse approximation.
IEEE Trans. Inform.Theory , 50(10):2231–2242, 2004.[26] Joel A. Tropp. Just relax: convex programming methods for identifying sparse signals in noise.
IEEETrans. Inform. Theory , 52(3):1030–1051, 2006.[27] Joel A. Tropp, Jason N. Laska, Marco F. Duarte, Justin Romberg, and Richard Baraniuk. BeyondNyquist: Efficient sampling of sparse bandlimited signals.
IEEE Trans. Inform. Theory , 56(1):540,2010.[28] John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust face recognitionvia sparse representation.
IEEE Trans. Pattern Anal. Mach. Intell. , 31(2):210–227, 2009.
Enrico Au-Yeung, Department of Mathematical Sciences, DePaul University, Chicago
E-mail address ::