Correcting Errors in Linear Measurements and Compressed Sensing of Multiple Sources
aa r X i v : . [ m a t h . NA ] A p r Correcting Errors in Linear Measurements andCompressed Sensing of Multiple Sources.
Alexander Petukhov and Inna KozlovJuly 6, 2018
Abstract
We present an algorithm for finding sparse solutions of the system of linear equations F x = y with rectangular matrices F of size n × N , where n < N , when measurement vector y is corruptedby a sparse vector of errors e .We call our algorithm the ℓ -greedy-generous (LGGA) since it combines both greedy and gen-erous strategies in decoding.Main advantage of LGGA over traditional error correcting methods consists in its ability towork efficiently directly on linear data measurements. It uses the natural residual redundancy of themeasurements and does not require any additional redundant channel encoding.We show how to use this algorithm for encoding-decoding multichannel sources. This algo-rithm has a significant advantage over existing straightforward decoders when the encoded sourceshave different density/sparsity of the information content. That nice property can be used for veryefficient blockwise encoding of the sets of data with a non-uniform distribution of the information.The images are the most typical example of such sources.The important feature of LGGA is its separation from the encoder. The decoder does not needany additional side information from the encoder except for linear measurements and the knowledgethat those measurements created as a linear combination of different sources. The problem of finding sparse solutions of a system of linear equations F x = y , x ∈ R N , y ∈ R n , N > n , (1)is interesting in many Information Theory related contexts. It is tractable as reconstruction ofa sparse data vector x compressed with the linear operator F . While from the point of view ofclassical linear algebra, system (1) may not have a unique solution, the regularization of the prob-lem in the form of the solution sparsity allows to guarantee the uniqueness (or the uniqueness almostfor sure) in many practically important cases.Within this paper, we say that a vector a ∈ R m is sparse if k : = { i | a i = , ≤ i ≤ m } < m , i.e., it has some zero entries. This number is called also the Hamming weight of the vector x .For a randomly selected matrix F and a vector y , if the sparse solution of system (1) exists, it is nique almost for sure. In particular, a sparse, i.e., having at most n − x theoretically can be restored from its measurements y almost for sure.Unfortunately, a straightforward search for sparse solutions to (1) is an NP-hard problem ([13]).The NP-hardness of the problem does not contradict the existence of the algorithms for findingsparse solutions in some quite typical cases.The recent Compressed Sensing (Compressive Sampling) studies gave a big push for the devel-opment of the theory and affordable algorithms for finding sparse solutions. It turned out that for areasonable (not very large) value of the sparsity the vector x can be recovered precisely, using LinearProgramming Algorithm (LPA) for finding the solution to (1) with the minimum of ℓ -norm ([1],[7], [15]). While, in practice, the number k characterizing the sparsity of the vectors x which can berecovered with ℓ -minimization is far from the magic number n −
1, this case is a well-studied andreliable tool for solving systems (1) for many applied problems.Orthogonal Greedy Algorithm (OGA) is a strong competitor of LPA. If it is implemented ap-propriately ([8], [14]), it outperforms LPA in both the computational complexity and in the abilityto recover sparse representations with the greater sparsity number k . In some papers (e.g., [8]), thismodification of OGA is called Stagewise Orthogonal Matching Pursuit (StOMP = StOGA).Very resent success in the efficient recovery of sparse representation is due to paper [12]. Theauthors suggested to use a special (band diagonal) type of sensing matrices in combination with abelief propagation-based algorithm. While the algorithm [12] gives a very powerful tool in linearmethods of data compression, it is oriented on a special form of measuring matrix which seems tobe theoretically less efficient from the point of view of the error correcting capability.We do not discuss hear the advantages and disadvantages of different decoding algorithms indetail. For the goals of this paper, we are interested in algorithms supporting the opportunity toassign non-equal significance of the solution entries. Such algorithms use the idea of reweighted/ greedy optimization (see [4], [5], [6], [9], [10], [11]). In what follows, we use our ℓ -greedyalgorithm (LGA, see Algorithm A below) as a starting point for the algorithms considered in thispaper. LGA has 2 advantages over competitors. First of all, it was shown numerically in [10] thatLGA has the highest capacity of the recovery of sparse/compressible data encoded with Gaussianmatrices. Second, what is more important, it is easily adaptable to the needs of this paper.The absence of theoretical justification and relatively high computational complexity are maindisadvantages of LGA. However, it should be emphasized that the main competitors of LGA also donot have appropriate theoretical justification. As for the computational complexity, the fast versionof LGA was developed in [11]. Its computational complexity is about the complexity of the regular ℓ -minimization, whereas the reconstruction capacity is very close to regular LGA and significantlyhigher than other, even more computationally extensive, algorithms.Above and in what follows, saying that the algorithm has ”the higher recovery capability”, wemean that for the same n and N the algorithm is able to recover vectors x with a larger number k ofnon-zero entries. Of course, the relative capability of algorithms depends on the statistical model ofnon-zero entries. We use only the Gaussian model since a Gaussian random value has the highestinformation content among random values with the same variance. The reweighted algorithms areknown as inefficient for vectors x with th Bernoulli distribution (equally probable ± he optimal data encoding for transmission through a lossy channel can be splitted into 2 indepen-dent stages which are source and channel encoding.The first stage is source encoding or compression. Usually, compression is a non-linear opera-tion. In the ideal case of the optimal compression, its output is a bitstream consisting of absolutelyrandom bits with equally probable ”0”s and ”1”s. The compression reduces the natural redundancyof the encoded data.The second stage is channel encoding. It plays the opposite role, introducing the artificial redundancy. Usually (but not necessary), this is a linear operator on a Galois (say binary) field. Theredundancy introduced here is different from the redundancy removed on the compression stage.Its model is completely known to the receiver of the information (the decoder) and in the case ofmoderate corruption this model can be used for the perfect recovery of the transmitted data.While the natural redundancy of the source also can be used for error correction, it is not soreliable for this purpose as channel encoding. Nevertheless, in some deeply studied problems a kindof ”error correction” based on the natural redundancy is possible. Among those applied problemswe just mention various data denoising methods and digital film restoration. The natural redundancyarises when a data model does not allow the digital data representation to take arbitrary values. Forthe denoising the belonging of the data to some class of smoothness serves for protection againstcorruption. Whereas impossibility of big changes between frames following with the rate 20 – 30frames per second plays the same role for moving images recovery. In both cases, the redundancymodel is known only approximately.One more problem indirectly conneted to the error correction is image upsampling. The inpaint-ing of missing information is tractable as correcting ”errors” lost in a channel with erasures.In CS community, applications to error correcting codes were noticed practically simultaneouslywith the mainstream compression issue (cf., [2], [3], [15]). Repeating commonly accepted channelencoding strategy, this approach consists in introducing the redundancy by multiplying the datavector y (say, the ”compressed” output in (1)) by a matrix B of the dimension m × n , where m > n .We assume that rank B = n . Then the output vector z : = B y = B F x is protected, at least theoretically, from the corruption of up to m − n − C of dimension m × ( m − n ) whose columnsconstitute a basis of the orthogonal complement of the space spanned by the columns of B . As-suming that e is a sparse vector of errors, compute the vector s : = C T ( z + e ) , known in the CodingTheory as a syndrome. The syndrome can be measured by the receiver from corrupted information.At the same time, C T z = C T B y = , the corruption vector e can be found as a sparse solution of thesystem C T e = s . The scheme considered in the previous paragraph is in the intersection of mainstreams of CSand Coding Theory based on separation of the source and channel encoding.In this paper, we discuss a different data protection scheme which uses the residual space notoccupied by the data for error correction purposes. In the literature on Coding Theory, a data trans-form providing simultaneous compression and protection from transmission errors is called JointSource-Channel Coding. As we will see below, the plain linear measurements have this propertyprovided that the encoded data are sparse enough in some representation system and the vector ofcorruption is also sparse. Therefore, in this case, the compressed data are restorable without anyadditional channel encoding. To our knowledge, the first time, this idea was formulated in [16]. n this paper, we do not consider any methods for fighting the corruption with the noise inentries when the level of the corruption of the output vector y is relatively low but the corruptiontakes place at each entry. However, we will discuss the stability of our algorithm with respect tosuch corruption.We emphasize that the encoding model accepted in this paper is analog encoding, i.e., we do notmean to apply analog-to-digital transform to the measurements. The entries are stored (transmitted)in the form of their magnitude (not in bitwise representation). For this reason we concider the sparsemodel of corruption when errors are introduced directly into entries of the vector y (not in bits).The structure of the paper is as follows. In Section 2, we descuss how corrupted data canbe recovered with the regular LGA algorithm. In Section 3, we show that the ”generous” (LGGA)modification of LGA provides much higher error correcting capability provided that the error rate isknown at least approximately. In Section 4, using the well known theoretical equivalency of a lossychannel and multisource transmission, we discuss advantages of joint encoding of a few sources ofdata. In Section 5, we design the adaptive ℓ -greedy-generous algorithm (ALGGA) which does notrequire preliminary knowledge of either error rate or the relative density of information in multiplesources.While all results of the paper are based exceptionally on numerical experiments, the stabilityof those results allow to be optimistic about possible applications of algorithms designed on thegreedy-generous principles introduced below.In all our numerical experiments we used random matrices F of size 128 × F , wedecided to do not include those results because they just confirm the effects and efficiency of thealgorithms considered in the paper and do not bring new effects deserving extra journal space. Let x ∈ R N be a sparse vector with k non-zero entries. The vector x is encoded with a lineartransform F x = : y ∈ R n . The vector y is corrupted with a vector e ∈ R n with r < n non-zero entries.The decoder receives the corrupted measurements˜ y = F x + e . The matrix F is also available for the decoder. It is required to find the data vector x . Of course,when x is found, the error vector also can be computed. Easy to see that the problem can be rewrittenin the form ˜ y = ˜ F ˜ x , were ˜ F : = ( F I n ) ∈ R n × ( N + n ) , ˜ x = (cid:18) xe (cid:19) , (2) I n is the identity n × n matrix. Assuming that the columns of the matrix F are almost orthogonal tocolumns of I n and taking into account that both x and e are sparse vectors and k + r < n , we arriveat a standard problem of finding a sparse solution of an underdetermined system of linear equationswith the matrix ˜ F . We emphasize that although the columns of I n and F span the same space R n anddo not introduce separation between data and error spaces, the spans of the columns correspondingto the sparse vectors x and e are approximately orthogonal, so the errors can be separated from thedata. hus, the error correction problem is reduced to the problem which can be solved with any stan-dard CS decoder. The algorithm we are goint to use for decoding is based on iterative minimizationof the functional k x k w j , : = (cid:229) w j , i | x i | , (3)which is the weighted ℓ -norm. The initial idea to use the reweighted ℓ -minimization for sparsesolutions of underdetermined system is due to [4]. We will use the following algorithm with thehigher recovery capability. Algorithm A ( the ℓ -Greedy Algorithm (LGA)[10])1. Initialization .(a) Take take w , i ≡ i = , . . . , N .(b) Find a solution x to (1) providing the minimum for (3).(c) Set M : = max { x , i } and j : = General Step .(a) Set M j : = a M j − and the weights w j , i : = (cid:26) e , | x j − , i | > M j , , | x j − , i | ≤ M j . (b) Find a solution x j to (1) providing the minimum for (3).(c) Set j : = j + ℓ -minimization from [4]. This dif-ference consists in dynamic updating the weighting function. Nevertheless, on the Gaussian input,LGA outperforms both the regular and the reweighted ℓ -minimization significantly.The results of straightforward application of Algorithm A to the extended inputs ˜ y and ˜ F from(2) will be shown below. Before we present numerical results, we describe our settings applicableto all numerical experiments in this paper. We generate vectors x and e with correspondingly k and r non-zero entries with normal distribution with parameters (0,1). The entries of the matrix F alsohas taken from the standard normal distribution.We run 200 independent trials with Algorithm A or its modifications. In each trial, the result isobtained after 30 iterations of the algorithm with parameters a = . e = .
001 or if the club oflarge coefficients, whose magnitude exceeds M j , reaches the cardinality n .On Figure 1, we present the graph of the relative success rate of the recovery for k -sparse databy Algorithm A when k + r =
57. The number 57 corresponds to the LGA reconstruction successrate about 0.82 for sensing 57-sparse vectors with random 128 ×
384 matrices. The horizontal axisreflects the values of k in the range 1 ÷
57, whereas the number of errors for each point of the graphis computed as r = − k .Since we have constant the number 57 = k + r of ”information” entries, in general, the almostconstant behavior of the graph on Fig. 1 is predictable, except for the better performance of therecovery algorithm when we have less non-zero data entries in x and a greater number of errors.Such behavior can be explained by the fact that the errors are associated with the identity partof the matrix ˜ F , which has orthogonal columns, whereas the columns of the matrix F are onlyapproximately orthogonal. Therefore errors can be found and corrected easier than the data entries.Of course, if the entire extended matrix ˜ F were constructed with the Gaussian distribution, thegraph would be parallel to the k -axis.
10 20 30 40 50 600.50.60.70.80.911.1 k S u cc e ss R a t e Reconstruction with Error Correction by LGA for k + r = 57.
Figure 1: Error correction with Algorithm A for F ∈ R × when the total number of non-zero entries and errors is fixed (=57). Thus, Algorithm A works for the error correction even better than it was expected in advance.However, there is one very discouraging drawback of such error correction. If we run it on a 57-sparse vector and its CS measurements are error free, then the Algorithm A with the extended matrix˜ F ∈ R × will recover it with the probability about 0.82, whereas the same LGA algorithmapplied to the matrix F ∈ R × is able to recover vectors of the same sparsity with probabilityvery close to 1. The probability 0.82 on F is reached by Algorithm A at the sparsity k =
68. Thus,we loose about 20% of the efficiency just because of the suspicion that the data could be corrupted.The curves of the reconstruction success rates for N =
256 and N =
384 when n =
128 whichdemonstrate those losses are presented on Fig. 2.
40 45 50 55 60 65 70 75 80 8500.10.20.30.40.50.60.70.80.91 k S u cc e ss R a t e Reconstruction for n=128 and N=384 / 256
Figure 2: Algorithm A (error free) for F ∈ R × (right) and F ∈ R × (left).6 rom the point of view of the multidimensional geometry everything looks reasonable. We use k -dimensional subspaces of R n which are all possible spans of k -subsets of a set of N vectors from R n . If we increase that set by n vectors, then the number of k -planes drastically increases from (cid:0) Nk (cid:1) to (cid:0) N + nk (cid:1) . Therefore the k -planes become harder distinguishable.From the point of view of classical coding theory those losses also are explainable. The errorresilience requires some redundancy in the representation. The higher projected error rate the biggerredundancy has to be introduced in advance. If we are lucky and a channel is error free, the redun-dancy is already spent (actually, in this case, it is wasted) on the encoding stage. Any improvementof the recovery due to the absence of the errors is impossible.There is a fundamental difference of our situation. We do not introduce any artificial redun-dancy. Therefore, for the lower rate of errors we may hope for the improved performance of thedecoder on the encoded data recovery. The next section explains how this intuitive advantage canbe transformed into actual benefits of the decoder. Let us come back to Fig. 2 to specify what kind of error correcting algorithm we want to design.If we apply Algorithm A to the potentially corrupted vector ˜ y obtained as a measurement with theextended matrix ˜ F , then the success rate will be described by the curve corresponding to the matrix128 × ( + ) (i.e., N = F of size128 ×
256 ( N = r = r = F , this curve is justa translation of the curve for N =
384 by 1 to the left what is close to the error free case. Whileapplying Algorithm A to the matrix F , we do not get sparse solutions at all because the columnsof the identity matrix do not have sparse representations in the dictionary of columns of the matrix F . In this case, even if we have side information that only one entry of the data are corrupted,Algorithm A applied to ˜ y and F becomes forceless, whereas applying it to ˜ F we lose in efficiencysignificantly. We wonder whether we can modify Algorithm A in such way that the success ratecurve becomes the translate of the curve for N =
256 by r = N = F i ∈ R × , i = , , . . . , F with the i th column of the identity matrixand select the sparsest of 128 solutions. Such algorithm is computable but very computationallyextensive and, because of this complexity, cannot be extended to r even slightly greater than 1.Thus, the main question of this sections is how the information about the density of errors canbe incorporated in Algorithm A for achieving the maximum possible success rate with minor or nochange of computational complexity.Let us discuss the potential limits of the desired algorithm. First of all, when kN ≈ rn , (4)i.e., the density of actual errors in e is equal to the density of non-zero entries of x , and non-zeroentries of x and e have same distributions of their magnitudes, we cannot hope for increasing theefficiency, staying within the framework of LGA. Indeed, under those assumptions, we can thinkabout the vector ˜ x as about a sparse vector with a randomly and independently selected index set f k + r non-zero components among N + n entries encoded as ˜ y with the matrix ˜ F . Definitely,the point k + r will rather correspond to the point of the curve for N + n (= 384 above) than N (=256 above). Thus, we cannot expect any improvement in the recovery of the vector ˜ x ∈ R N + n withrespect to the success rate curve for N + n . Of course, as we noticed above the result can be slightlybetter just because of the structure of ˜ F involving the identity matrix.At the same time, we have a right to expect the improvement when the equality of the propor-tions (4) is violated. For the case k / N ≫ r / n we may expect the success rate for the recovery k + r non-zero entries close to the success curve corresponding to n × N matrix, whereas for k / N ≪ r / n ,we want to see the recovery rate close to the recovery rate for the identity matrix I n .At the first glance, the last request may look too challenging because the identity matrix encoder y : = I n x allows the trivial ”recovery” x : = y . So the decoder must be extremely efficient. However,it is quite realistic. Indeed, when we encode x whith only one non-zero entry (i.e., k =
1) we canrecover it with a big probability even if e has up to n − ∼ n N ) complexity. This com-plexity as well as the required precision are high but still realistic. Therefore the high eficiency ofthe projected algorithm for r close to n would not be considered as a miracle.Let us formulate our goals in the developing a new recovering algorithm with error correctingcapability. The algorithm has to provide the property which has meaning of the 3-point interpola-tion. It has to work at least as Algorithm A on compressing matrices of size n × ( N + n ) when (4)takes place. Its efficiency has to approach the cases of compressing matrices of size n × N and n × n when r → k → y . This freedomis provided by the significantly less weight in the (weighted) ℓ -norm given to the entries assumedto be large in the “optimal” (sparse) representation of ˜ y . Then the algorithm does not worry toomuch about their contribution into the norm. So they can be used efficiently for the partial summinimizing the residual. This is a typical greedy strategy when large coefficients are selected in aseparated set and on the next iteration the biggest coefficients in the representation of the residualextend “the club of large coefficients”.The greedy approach inspired us on the opposite “generous” strategy. If we know that withlarge probability the channel is (almost) error free, let us give ”generously” a larger weight to theentire block of entries corresponding to the errors or, vice versa, we set a less weight to the entireerror block when its density is higher than the density of the data entries. For instance, if wewant to modify LPA algorithm according to the generous principle formulated above, instead of theminimization problem k ˜ x k → min , subject to ˜ F ˜ x = ˜ y , we solve the problem k x k + l k e k → min , subject to ˜ F ˜ x = ˜ y , where l depends on the density of errors. The ℓ -greedy algorithm from [10] (and its acceleratedversion from [11]) reweights input data of its iterations, according to the output of its previousiterations. The modification of the problem above does not require any significant changes in Algo-rithm A itself. We just introduce a different weight for the error components not included into theclub of large coefficients. To be more precise we replace the definition of the weight in item 2a of lgorithm A with w j , i : = e , | x j − , i | > M j ;1 , | x j − , i | ≤ M j , i ≤ N ; l , | x j − , i | ≤ M j , i > N . We call this modification
Algorithm B or the ℓ -greedy-generous algorithm (LGGA).The efficiency of Algorithm B is illustrated on Fig. 3. We found the curves of the successrate for r = , , , , ,
90 (the graphs plotted with ” ◦ ”s from right to left), applying l = . . . , . , . , . , .
55 in Algorithm B. S u cc e ss R a t e Reconstruction with Error Correction
Figure 3: Algorithm B for r = , , , , ,
90 (” ◦ ”curves from right to left), F ∈ R × ; r = F ∈ R × (” (cid:209) ” curve). On the same figure, we plot the graph corresponding to the error free case for N =
384 (thegraph plotted with “ (cid:209) ”s). We see that that the curve of the success rate of Algorithm A for errorfree input practically coincides with the success rate of Algorithm B for the case r =
5. So, undersimilar input conditions, Algorithm B corrects up to 5 errors when Algorithm A admits only errorfree input.To discuss the next issue let us move, in the mind’s eye, each of graphs on Fig. 3 plotted with“ ◦ ”s to the right by the corresponding value r . Then we have graphs of the recovery k + r (data anderrors) entries. We see that for r equal to 15 and 45 (and definitely for r between them) the shiftedgraphs are very close to the graph for the error free case when N = r =
1, the shifted graph is practically coincide with the error free case when N = r = N equal to 256and 384. We note that for the case r =
90 the shifted graph will be far to the right from the curve N = F is not compression at all.Therefore, for the matrix I n separated from F , any number of entries are “recoverable” with thetrivial ”algorithm” x : = y .If we compare the graphs on Fig. 3 with the classical ℓ -minimization algorithm, then it turnsout that the ℓ -minimization curve (for n = N = ≈
12% of measurements located at unknown positionsare corrupted ) on Fig. 3. This observation promotes not only the ℓ -greedy algorithm which is a asic component of Algorithm B but also supports one of the keystone paradigms of CS stating thatthe data measurements stored today can be used much more efficiently in a few years when newalgorithms will be designed.Even staying within the sparse Gaussian model of errors, we have one parameter whose influ-ence was not considered yet. This is the magnitude of errors. In classical error correcting codes onthe binary field, the magnitude is not an issue. In real-valued encoding, the influence of the magni-tude of the error vector on the efficiency of the recovery is not so obvious. Our experiments withincreasing/decreasing the error magnitude in up to 10 times showed that the data reconstructionefficiency increases in all cases even if we do not change the generous weight. There is a commonsense explanation of this phenomenon. The Gaussian distribution has the highest information con-tent among distribution with the same variance. The union of the Gaussian data and the Gaussianerrors with equal variances has the Gaussian distribution while the weighted union is not the Gaus-sian one anymore. In fact, it is sparser independently on whether the weight is less or greater than1. In classical Information Theory the model of the channel with errors is considered equivalently as achannel with 2 sources of information, i.e., the data transmitted through the channel and undesirableerrors. From this point of view, the errors consume a part of the channel capacity, reducing thecapacity of the channel for useful data. This observation leads to the idea to use the error correctingscheme considered in Section 3 for encoding/decoding data from a few sources by packing them inthe same output vector.To get some idea how efficient this idea can be we consider a model case of 4 sources. In ournumerical experiments, each source produces a data vector x i ∈ R . We do not make any specialassumption about the sparsity of each specific x i . However, we bear in mind that the compoundvector x ∈ R has to be somehow sparse to make decoding possible.What is especially interesting for us is to outperform Algorithm A when the sources of datacontain different amount of information, i.e., when the information is distributed non-uniformlybetween blocks. In our settings, the input consists of three blocks with the fixed sparsities 64, 5,3 and one block with the variable sparsity K . Thus, the total number of encoded non-zero entriesis k = K + + +
3. For Algorithm B, we set the block weights equal to 3 . , . , . , .
0. Weapplied only ”common sense” in setting the weights. No accurate optimization was performed. Theresult of the reconstruction is given on Fig.4.The numerical experiments show that usage of Algorithm B for decoding data from the mul-tisource input may bring significant benefits over the standard uniform decoding strategy. Thereweighting the blocks completely fits the philosophy of the reweighted optimization. However, thenew element in the LGGA consists in combine both the greedy and the generous strategies. Whilethe greedy strategy sets little penalty weight in the weighted ℓ -norm for the entries suspicious to beused in the encoded data representations, the strategy of Algorithm B quite ”generously” changesthe weights for entire blocks of information. Setting larger weights in ℓ -norm for the blocks withlow information contents, we ”temporarily exclude” them from consideration on initial iterationsof Algorithm B, helping blocks with the higher density of the information to be decoded moreefficiently.Now we briefly describe one more possible application, where generous strategy can be useful.In applications above we used partitioning the matrix F (or ˜ F ) generated by a specific problem. Atthe same time, in Section 3, we mentioned the phenomenon when the higher or the lower magnitude Figure 4: Algorithm B with decoder weights ( , , . , ) vs. Algorithm A for 4 sources with k = K , k = , k =
5, and k = F ∈ R × . of errors increases the capability of LGGA in error correction. Such phenomenon can be usedartificially for increasing the algorithm capability in reconstruction of sparse data. We give oneexample showing what benefits are reachable with that approach.Let us split the Gaussian matrix F ∈ R n × N into 2 submatrices F ∈ R n × N and F ∈ R n × N , N + N = N , and compose a compound sensing matrix Y = ( F , d F ) . For decoding data sensedwith the matrix Y , when n = N = N =
128 and d = .
1, we introduce little changes in the weightselection of Algorithm B. We define the weights as w j , i : = e , | x j − , i | > M j , i ≤ N ; e , | x j − , i | > M j / d , i > N ;1 , | x j − , i | ≤ M j , i ≤ N ; l , | x j − , i | ≤ M j , i > N . The result is presented on Fig. 5.This experiment obviously shows that the encoding with compound matrices is a very promisingtool for efficient encoding purposes. At the same time, it should be taken into account that there isa significant stability drawback of that approach. In Section 6, we have more extensive discussionabout the algorithm stability. However, this is absolutely obvious that the second half of the datahas the theoretical precision of the recnstruction 1 / d times lower than the precision of the first half.So this trick is kind of exchange of the higher precision for the larger number k . From the point ofview of settings of formal information theory, this is not progress at all. Anyway, this idea can beused when we do not need the same precision for different blocks of information or as a constructivebrick for other algorithms, where such compound matrices make a sence (e.g., cf. [12]). ℓ -Greedy-Generous Decoding Algorithm In Sections 3 and 4 we showed how Algorithm B, applied to linearly compressed data from multiplesources, can efficiently recover the encoded information. The tuning of Algorithm B uses the sideinformation about the density of the information in each input sources. S u cc e ss R a t e LGA, Gaussian F LGGA, compound Gaussian F Figure 5: Algorithm B on compound matrix vs. Algorithm A on Gaussian matrix, F ∈ R × . Embedding this side information in the code may bring additional (probably non-linear) proce-dures like protection from the corruption in the channel. While the amount of this side informationis tiny, this scheme violates the genuine CS architecture which is linear by nature. Moreover, inmost of practical cases, the information density values are not immediately available even for theencoder. Potentially, it may be extracted by the encoder from the raw data at some computationalexpenses comparable or exceeding CS expenses itself. Say, to get the sparsity information about animage the wavelet transform has to be applied. In other cases, like for the channel with errors, theencoder is not aware about the errors in the channel at all. The error rate of a stationary channel canalso be measured by sending the empty message to the decoder. Thus, to apply Algorithm B weneed the distribution of information over the blocks, assumed to be obtained from the encoder, andinformation about channel errors, assumed to be measured by the decoder.Due to reasonings above, we arrive at conclusion that the decoding algorithm having an internalestimator of the amount of information encoded in each block is very desirable. Such estimator hasto use only the raw CS encoder output. The block structure, i.e., the partitioning of the vector x intothe blocks with potentially different information density, also has to be known for the decoder.Below, we suggest an adaptive algorithm working on two sources of information. While it isquite efficient within the considered settings, it serves as just one successful example on the way toa really universal adaptive algorithm acting on the variety of block configurations as well as on thedifferent information contents of the blocks.We conducted some research bringing the algorithm for adapting weights between blocks, ac-cording to intermediate internal estimates made on LGGA iterations. Due to iterative nature of LGAwe do not try to find the optimal weights for the blocks from the beginning. We rather dynamicallyupdate those weights in the direction of the higher sparsity of the result. For our estimates, wecompute 0.5-quasi-norm k x k . : = ( (cid:229) p | x i | ) normalized by the Euclidean norm k x k : = q (cid:229) x i : s ( x ) : = k x k . / k x k . Since the vector x is unknown in the beginning of the decoding, we will use heuristic methodsmeasuring the relative behavior of this characteristic for the blocks of the vector x on consecutiveiterations of LGGA. e consider only the simplest case of two blocks of information with potentially different den-sities. We assume that the blocks have equal size, i.e., for modeling settings from previous sectionwhen N = N and N are equal to 128. Let us agree thatthe index sets are { , . . . , N / } and { N / + , . . . , N } for the first and for the second blocks cor-respondingly. We will use the superscripts 1 and 2 for denoting the corresponding subvectors andthe subscripts ” p ” and ” c ” for the minimum of ℓ -norm (pseudoinverse) solution to F x = y and thecurrent estimate of x made by LGGA.The next iteration weights will be functions of the value S = s ( x c ) · s ( x p ) s ( x c ) · s ( x p ) . Our heuristic approach is based on the experimental observation that when the second block haslower density of the information and the algorithm transforming x p into x c has some sparsifyingproperties, then S > S grows with the growthof the non-uniformity of the information distribution between the blocks.Accepting those results, we have to introduce a function setting the block weights in the greedy-generous algorithm. We set the weight W = W ( S ) = (cid:26) . √ S + . √ S , S ≥ ( . √ S + . √ S ) − , S < . (5) Algorithm C ( adaptive ℓ -Greedy-Generous Algorithm )1. Preliminaries (a) Compute the minimum ℓ -norm solution x p : = F T ( FF T ) − y .(b) Compute s ( x p ) and s ( x p ) Initialization .(a) Set w , i ≡ i = , . . . , N , W = x to (1) providing the minimum for (3).(c) Set M : = max {| x , i |} and j : = General ALGGA Step .(a) Find S and W = W ( S ) (b) Set M j : = a M j − and the weights w j , i : = e , | x j − , i | > M j , W , | x j − , i | ≤ M j , i ≤ N / W , | x j − , i | ≤ M j , i > N / . (c) Find the solution x j to (1) providing the minimum for (3).(d) Set j : = j + k corresponds to the total number of non-zero entries in both blocks. The graphs plottedwith ”*” correspond to the recovery with Algorithm B when the density of the information is knownbefore decoding, whereas the graphs plotted with ” ◦ ” reflects the results obtained with adaptiveAlgorithm C. There are four pairs of the graphs. S u cc e ss R a t e Figure 6: Algorithm C (’o’, adaptive) vs. Algorithm B (’*’, optimal) success ratesfor 2-channel input k = k + k , k = , , , The most left pair corresponds to the uniform information distribution between blocks.To be more precise, we set a number of non-zero entries in block 2 as 37 and a number k ofnon-zero entries in block 1 is changing, giving the total number k = k +
37. The number 37 isone half of the value corresponding to the level 0.5 of the success curve for Algoritm A. Thus,in the neighborhood of k = = ×
37 we have non-zero entries uniformly distributed over theentire range 1 ÷ W = W = W ≈ k =
15. The Algorithm C outputis very close to the output of Algorithm B (with W = . k = W = . k = W = .
5) but hasa huge advantage over Algorithm A.We would like to emphasize that the function defining the weight W in (5) does not pretendto be optimal or somehow universal. We just wanted to show that the idea of efficient decoding ofmultichannel code with unknown characteristics of channels may have a quite satisfactory adaptivesolution. Anyway, we believe that the significant part of Algorithm C losses for highly non-uniformdistributions is mainly because of statistical inconsistency of the estimate of S rather than imperfec-tion of formula (5). This is obvious that a few non-zero values have insignificant influence on thevalues of sparsity estimates, especially for pseudoinverse solutions s ( x jp ) . Therefore the significantfluctuations of S may mislead formula (5) in setting W .Experiments with change of the variance of x give an argument in favor of our claim above.Indeed, we introduce a factors for x in the range 0 . ÷
10 and checked the reconstruction withAlgoritm C. In all cases, the higher rate of recovery was observed. In particular, for the factor 10,the curve for k = ℓ -Greedy-Generous Algorithm Stability. The stability of the algorithm with respect to the noise in measurements is a necessary requirementfor its applicability to the real world problems. The noise may have very different reasons. Themost typical forms of noise are noise of transmission and noise of quantization. Any noise reducesthe precision of the result. However, technically, the goal of this section is to estimate the stabilityof the fact of recovery of sparse representations with reasonable precision rather then fighting forthe best possible precision itself. The CS decoding is a highly non-linear operation. So the stabilitymeans for CS much more than just the rate of the dependence of k x − ˆ x k from k y − ˆ y k becauseeven the continuity of such dependence is under question. Generally, no algorithm can be stable forall k not exceeding n −
1. Indeed, each value of the measurement precision has its own theoreticalmaximum level sparsity k admitting the recovery. In particular, the case k = n − N / n ≥ a > n → ¥ . For k / n = b <
1, the reconstruction with moderate pre-cision is possible up to some level of noise in y . For the noise/precision level above that threshold,no precision of reconstruction can be guaranteed. The closer b to 1, the lower the threshold is. Wechecked the result of reconstruction for many different levels of noise. The greedy-generous algo-rithm has shown the high stability to the noise in the measurements y . Let us describe the settingsof numeric experiments supporting the stability claim.In the noisy environment, we run the same adaptive LGGA with the same parameters but wechange the criterion of success. The threshold value for ”success” was set as d = . · NH ( k / N ) / n √ n s ,where H ( p ) : = − p log p − ( − p ) log ( − p ) , s is the noice variance. When k ˆ x − x k < d , where ˆ x is the estimate obtained with the algorithm, wesay that the reconstruction of x is successful.We use an indirect approach for setting the threshold d . The reasoning is as follows. Due tonormalization of the matrix F , its random n × n submatrix is ”almost” unitary. So in the best casescenario any deviation from the correct values in the y -domain will be conversed into the samedeviation in y -domain. Let us obtain a numerical estimate for this conversion. First of all, the noiselevel s at n entries gives the total noise √ n s at y . The level of the noise can be translated intoa number of significant bits in the entries of y . Unfortunately, we cannot hope that all those bitsto be used for representation of components of the vector x . Indeed, the linear encoding formula(1) assumes that the data vector x can be recovered from the measurements y . In particular, thismeans that the vector y has to contain not only information about bits in digital representation of x but also (maybe implicitely) information about indices of non-zero entries. For encoding thatinformation we need H ( k / N ) bits per entry. The total number of entries in x is N . The informationabout indices has to be put into the vector y consisting of n entries, i.e., each entry of y has to”reserve” NH ( k / N ) / n bits for encoding the indices. Those bits iminently decrease the precisionof the reconstruction at least in 2 NH ( k / N ) / n . Thus, we cannot expect the precision of reconstruction k ˆ x − x k higher than 2 NH ( k / N ) / n √ n s . Numerical experiments shows that this value is really veryclose to the precision of the estimate ˆ x . We introduce the additional factor 1.5 just to avoid unfair”losses” of the success when the precision is slightly higher due to some random fluctuations.The graphs of the success rate for non-uniform distribution of the information with a fixednumber k = k =
37 (right triplet) for s = , . , .
03 are given on Fig. 7.The exciting fact that ”the best case scenario” estimate turns out to be a good tool for theprediction of the deviation of the algorithm output from the true vector is very optimistic for CSperspectives. Indeed, the threshold estimate is based on elementary information theory principleswhich cannot be overcome with any algorithm. Compression of information allowing this precision S u cc e ss R a t e Reconstruction with Adaptive LGGA from Noisy Measurements
Figure 7: Algorithm C success rate for noise s = ( ◦ ) , . ( ∗ ) , . ( (cid:209) ) ,for k =
37 (left) and k = of the reconstruction can be achieved if we separately encode the digital (quantized) informationabout the entries and the information about indices. The index part of the information can beencoded with the optimal bit budget by the arithmetic encoder whose output is extremely vulnerableto the errors. In fact, the results of numerical modeling shows that linear compressor (1) encodes thecombination of digital bits and indices in very wise form. The indices are encoded implicitly. Theyare a part of Joint Source-Channel linear encoding. Within a wide range of losses of precision in theentry representation, reconstruction does not destroy the structure (index information), providingdecoding with the quality progressively depending on the channel noise. It is shown that Compressed Sensing encoding has very strong internal error correcting capabil-ity. No special redundant encoding is required for further error correction. Essentially, the errorprotection capability is defined by the amount of the space underloaded with data by the encoder.The less the data information content, the higher natural error protection level of the code. Thus,Compressed Sensing provides highly efficient Joint Source-Channel Encoding method.We showed that a minor modification of the ℓ -greedy decoding algorithm, which we call the ℓ -greedy-generous algorithm (LGGA), allows to correct errors efficiently.Using the well-known parallel between combination of data and errors in the channel with afew sources joint encoding, we applied the same decoding algorithm to decoding the multi-sourcecode. In the case of non-uniform informational contribution of the sources into the encoder output,our new ℓ -greedy-generous algorithm significantly outperforms all known algorithms, including ℓ -greedy, in recovering Gaussian data encoded with Gaussian matrices.While the knowledge of approximate distribution of information between blocks is very desir-able, this distribution can be estimated dynamically during iterations of LGGA. We suggest algo-rithm automatically adapting the ”generous” weights of the ℓ -greedy-generous algorithm to inputwith unknown (possibly non-uniform) distribution of information between blocks. eferences [1] E.J. Cand`es, T. Tao, Decoding by linear programming , IEEE Transactions on InformationTheory, (2005), 4203–4215.[2] E. J. Candes, P.A Randall, Highly Robust Error Correction by Convex Programming, IEEETransactions on Information Theory (2006) Volume: 54, Issue: 7.[3] E.J. Cand`es, M. Rudelson, T. Tao, R.Vershynin, Error Correction via Linear Programming,Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS05), IEEE,2005. pp. 295–308[4] E. J. Candes, M. B. Wakin, and S. Boyd, Enhancing sparsity by reweighted ℓ minimization J.of Fourier Anal. and Appl., special issue on sparsity, (2008), 877–905 .[5] Chartrand, R., Yin, W., Iteratively reweighted algorithms for compressive sensing.
In: Pro-ceedings of Int. Conf. on Acoustics, Speech, Signal Processing (ICASSP), 2008, 3869–3872.[6] I. Daubechies, R. DeVore, M. Fornasier, and C. S. G ¨unt¨urk,
Iteratively re-weighted leastsquares minimization for sparse recovery , Commun. Pure Appl. Math., (2010), 1–38.[7] D. Donoho, Compressed Sensing , IEEE Trans. on Information Theory, (2006), 1289–1306.[8] D. Donoho, Y. Tsaig, I. Drori, and J. Starck, Sparse solutions of underdetermined linear equa-tions by stagewise orthogonal matching pursuit , Report, Stanford University, 2006.[9] S. Foucart and M. J. Lai,
Sparsest Solutions of Underdetermined Linear Systems via ℓ q -minimization for ≤ q ≤ (2009) 395–407.[10] I. Kozlov, A. Petukhov, Sparse Solutions for Underdetermined Systems of Linear Equations ,in Handbook of Geomathematics, W. Freeden, M.Z. Nashed, T. Sonar (Eds.), Springer, 2010,1243–1259.[11] I. Kozlov, A. Petukhov,
Fast Implementation of ℓ -Greedy Algorithm. , in Recent Advances inHarmonic Analysis and Applications, D.Bilyk etc. (Eds.), Springer, 2012, to appear.[12] F. Krzakala, M. M´ezard, F. Sausset , Y. F. Sun, and L. Zdeborov´a, Statistical physics-basedreconstruction in compressed sensing, preprint, arXiv:1109.4424v1 [cond-mat.stat-mech] 20Sep 2011.[13] B. K. Natarajan, Sparse approximate solution to linear systems , SIAM J. Comput., (1995),227–234.[14] A. Petukhov, Fast implementation of orthogonal greedy algorithm for tight wavelet frames ,Signal Processing, (2006), 471–479.[15] M. Rudelson, R.Vershynin, Geometric approach to error correcting codes and reconstructionof signals , International Mathematical Research Notices (2005), 4019–4041.[16] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, Fellow, and Yi Ma, Robust Face Recognitionvia Sparse Representation, Pattern Analysis and Machine Intelligence, IEEE Transactions on,Feb 2009, Volume: 31 , Issue: 2 Page(s): 210 - 227.(2005), 4019–4041.[16] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, Fellow, and Yi Ma, Robust Face Recognitionvia Sparse Representation, Pattern Analysis and Machine Intelligence, IEEE Transactions on,Feb 2009, Volume: 31 , Issue: 2 Page(s): 210 - 227.