Sparse approximation based on a random overcomplete basis
Yoshinori Nakanishi-Ohno, Tomoyuki Obuchi, Masato Okada, Yoshiyuki Kabashima
aa r X i v : . [ c s . I T ] M a r APS/123-QED
Sparse approximation based on a random overcomplete basis
Yoshinori Nakanishi-Ohno,
1, 2, ∗ Tomoyuki Obuchi, † Masato Okada, ‡ and Yoshiyuki Kabashima § Graduate School of Frontier Sciences,The University of Tokyo, Kashiwa, Chiba, 277-8561, Japan Research Fellow of Japan Society for the Promotionof Science, Chiyoda, Tokyo, 102-0083, Japan Interdisciplinary Graduate School of Science and Engineering,Tokyo Institute of Technology, Yokohama, Kanagawa, 226-8502, Japan (Dated: August 26, 2018) bstract We discuss a strategy of sparse approximation that is based on the use of an overcomplete basis,and evaluate its performance when a random matrix is used as this basis. A small combinationof basis vectors is chosen from a given overcomplete basis, according to a given compression rate,such that they compactly represent the target data with as small a distortion as possible. As aselection method, we study the ℓ - and ℓ -based methods, which employ the exhaustive search and ℓ -norm regularization techniques, respectively. The performance is assessed in terms of the trade-off relation between the representation distortion and the compression rate. First, we evaluate theperformance analytically in the case that the methods are carried out ideally, using methods ofstatistical mechanics. The analytical result is then confirmed by performing numerical experimentson finite size systems, and extrapolating the results to the infinite-size limit. Our result clarifiesthe fact that the ℓ -based method greatly outperforms the ℓ -based one. An interesting outcomeof our analysis is that any small value of distortion is achievable for any fixed compression rate r in the large-size limit of the overcomplete basis, for both the ℓ - and ℓ -based methods. Thedifference between these two methods is manifested in the size of the overcomplete basis thatis required in order to achieve the desired value for the distortion. As the desired distortiondecreases, the required size grows in a polynomial and an exponential manners for the ℓ - and ℓ -based methods, respectively. Second, we examine the practical performances of two well-knownalgorithms, orthogonal matching pursuit and approximate message passing, when they are usedto execute the ℓ - and ℓ -based methods, respectively. Our examination shows that orthogonalmatching pursuit achieves a much better performance than the exact execution of the ℓ -basedmethod, as well as approximate message passing. However, regarding the ℓ -based method, thereis still room to design more effective greedy algorithms than orthogonal matching pursuit. Finally,we evaluate the performances of the algorithms when they are applied to image data compression. ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] . INTRODUCTION Information processing based on the sparseness of various data is an active area of re-search. This sparseness means that data are typically expressed by a small combination ofnon-zero components when a proper basis is used. The significance of sparseness for infor-mation processing had already begun to be noted when principal component analysis wasinvented, in 1901 [1]. Low-rank approximation of a matrix is known to be a useful method ofcollaborative filtering for recommendation systems [2–4]. In neuroscience, the sparse-codinghypothesis has gradually been accepted as a method of elucidating visual and auditory sys-tems [5–10]. Recent interest in information processing with sparse data has been triggeredby compressed sensing, since it was demonstrated that ℓ -norm minimization can give exactsolutions in a reasonable time, under appropriate conditions [11–14].In this study, we discuss sparse data processing from a different viewpoint, namely thatof sparse approximation. Sparse approximation refers to the process of representing targetdata by a small number of non-zero elements, the purpose of which is to achieve a bettertrade-off relation between the representation distortion and the compression rate [15–24].We adopt a strategy of sparse approximation that utilizes an overcomplete basis (OCB). AnOCB can also be called a frame in the field of signal processing. OCBs contain more basisvectors than the dimension of target data. This means that a better and smaller set of basisvectors may be chosen to compactly express the data. Therefore, in terms of the trade-offrelation, the OCB-based strategy is expected to outperform naive strategies such as randomprojection.For selecting basis vectors from an overcomplete basis, we discuss the ℓ - and ℓ -basedmethods, which employ the exhaustive search and ℓ -norm regularization techniques, re-spectively. Our adoption of these methods is motivated by their application in compressedsensing [25, 26]. Focusing on the trade-off relation, we evaluate the performance of sparseapproximation from two different viewpoints. First, we theoretically analyze the ideal per-formance that is achieved when the l - and l -based methods are performed exactly, by usingmethods of statistical mechanics. We regard the distortion and the compression rate as thethermal averages of physical quantities derived from partition functions. In the large-systemlimit, these are assessed by the replica method and the saddle-point method [27, 28]. In orderto validate the results of our analysis, we extrapolate physical quantities in the limit, from3nite-size results obtained using the exchange Monte Carlo method [29, 30] and quadraticprogramming. Second, we investigate the practical performance of the OCB-based strategy.We examine the performances of two well-known algorithms, orthogonal matching pursuit[31, 32] and approximate message passing [33], when they are employed to approximatelyexecute the ℓ - and ℓ -based methods, respectively. We also apply the approximate algo-rithms to a task of image data compression and evaluate their performances, as a practicalexample.The rest of this paper is organized as follows. In section II, we set up the problem ofsparse approximation that we will focus on, and explain the ℓ - and ℓ -based methods andrelated work. In section III, we analyze the ideal performances of these methods, in terms ofthe trade-off relation. In section IV, we discuss the practical performance of the OCB-basedstrategy, and its application to image data. In section V, we conclude this paper. II. PROBLEM SETTINGA. Sparse approximation using a random overcomplete basis
Given a data vector y ∈ R M and a compression rate r , the purpose of sparse ap-proximation is to obtain a compressed representation x ∈ R N using a basis matrix A =( a , . . . , a N ) ∈ R M × N , while keeping the representation distortion ǫ as small as possible.The compression rate r is defined as the ratio of the number of non-zero components of x to the dimension of the data vector. That is, r = || x || M , (1)where || · || denotes the so-called ℓ -norm of a vector. The ℓ -norm represents the number ofnon-zero elements of a vector, defined as || v || = P i | v i | , where | v i | is equal to 0 ( v i = 0)or 1 ( v i = 0). We measure the distortion using the mean squared error, as ǫ = 12 M || y − Ax || , (2)where || · || is the ℓ -norm of a vector, defined as || v || = pP i v i . Note that this representa-tion distortion measures how close a data vector y is described by a sparse representation x with a given basis A , and it is different from the reconstruction error often used to measurethe distance between an original sparse signal x and an estimated sparse representation ˆ x
4n the field of signal processing and compressed sensing. For our purpose of an analyticalevaluation of ǫ , we consider the case where the elements of the data vector y are indepen-dently and identically distributed (i.i.d.) random variables from the normal distribution,whose mean and variance are 0 and σ y , respectively, and together are denoted by N (0 , σ y ).The elements of the basis matrix A are also i.i.d. random variables from N (0 , M − ). Then,the matrix A is almost surely of rank min( M, N ), and the distortion becomes a randomvariable.If N = rM , the minimization of (2) is nothing but the method of least squares (LS), andthe corresponding compressed vector is easily obtained asˆ x = A + y , (3)where A + is the pseudoinverse of A , given by A + = ( A T A ) − A T . (4)Let us call this the naive method, which is illustrated in figure 1 (a). In the large-size limit M → ∞ , the corresponding distortion converges to ǫ naive = 1 − r σ y , (5)with probability one. In general, in the limit M → ∞ certain random variables, such as ǫ ,have the so-called self-averaging property, and will almost surely converge to their averagevalues. This enables us to present a clear discussion, and hereafter we focus on this limit. FIG. 1. Schematic diagrams of sparse approximation. (a) Naive method. (b) OCB-based strategy.
On the other hand, for
N > rM we have a lot of options in choosing a combination of rM basis vectors from the matrix, as illustrated in figure 1 (b). If the chosen combination5s more suitable for representing the data vector than one that is chosen randomly, thenthe distortion becomes smaller than ǫ naive . This is the idea behind the OCB-based strategy.However, this strategy presents the problem of how to choose the combination of basisvectors. We investigate the performances of ℓ - and ℓ -based methods. B. Methods ℓ -based method The basic idea of the ℓ -based method is to minimize the distortion by choosing the bestcombination of rM basis (column) vectors from a given OCB. More generally, we wouldlike to define the distortion as a function of the chosen combination of basis vectors, and tocontrol it in a simple manner. This motivates us to introduce a binary vector c ∈ { , } N ,to store information on whether each basis vector is chosen ( c i = 1) or not ( c i = 0). We alsointroduce a distortion, labelled by c , with ǫ ( c | y , A ) = min x (cid:26) M || y − A ( c ◦ x ) || (cid:27) , (6)where ◦ is the Hadamard product of two vectors, defined as ( v ◦ w ) i = v i w i . In addition, wedefine an entropy function s ( ǫ | y , A ) to represent the number of configurations c that give avalue of ǫ for the distortion, as follows: s ( ǫ | y , A ) = 1 M ln ( { c | || c || = rM ∧ ǫ ( c | y , A ) = ǫ } ) , (7)where ǫ ,and cannot be negative, by definition. A typical shape of the entropy is depicted in figure 2.There are two zero points in the entropy function and the smaller and larger ones are denotedby ǫ and ǫ + , respectively. The smaller zero point ǫ of the entropy function, s ( ǫ ) = 0, givesthe minimum value of the distortion ǫ ( y , A ) = min c ǫ ( c | y , A ) subj. to || c || = rM. (8)Hence, our original motivation for introducing the ℓ -based method, to find the minimumdistortion led by the best combination of basis vectors, can be achieved through the evalu-ation of the entropy function. In addition, the evaluation of the entropy function is easier6 IG. 2. A schematic shape of the entropy function. The smaller zero point of the entropy, ǫ ,corresponds to the minimum of the representation distortion connected to the best combination ofthe basis vectors, and the point giving the largest entropy value is ǫ naive because random choice isconsidered to select one of the most typical combinations of basis vectors. than the direct evaluation of ǫ , and moreover the entropy function provides more infor-mation about the space of the variables c , which can be useful for practical applicationssuch as designing algorithms. Thus, the entropy function s ( ǫ ) is the primary object of ouranalysis in the ℓ -based method. A similar analysis has been proposed for examining theweight space structure of multilayer perceptrons [34]. ℓ -based method The ℓ -based method is the most closely matched to the original idea of the OCB-basedstrategy. However, its algorithmic realization of searching combinations of basis vectors iscomputationally inefficient, because it requires an exponentially growing computational costas the system size N increases. In practical situations, instead of the ℓ -based method, amethod based on ℓ -norm regularization can be employed. This motivates us to examinethe following ℓ -based method.Our ℓ -based method arises from the following minimization problem:ˆ ξ = arg min ξ (cid:26) || y − Aξ || + λ || ξ || (cid:27) , (9)where || · || is the ℓ -norm of a vector, defined as || v || = P i | v i | , with the absolute valuedenoted by | · | . The solution of this minimization problem, ˆ ξ , provides useful informationfor finding the compressed vector we desire. This minimization problem is equivalent to7he least absolute shrinkage and selection operator, also known as LASSO [35]. The mainbenefit of this approach represented by (9) is the computational ease of performing theminimization. As the objective function of (9) is convex, its minimization can be exactlycarried out with a computational time in O ( N ), using versatile algorithms of quadraticprogramming. Furthermore, the ℓ -norm term in (9) results in a sparsifying effect in ˆ ξ , andits coefficient λ is adjusted according to the compression rate. Namely, λ is chosen so that || ˆ ξ || = rM .Our aim in the analysis in the ℓ -case is to evaluate the distortion resulting from ˆ ξ . Theexpression of the distortion is given by ǫ = 12 M || y − A ˆ ξ || . (10)An inconvenience presented by this distortion is that it is not minimized on the set ofbasis vectors chosen by ˆ ξ , owing to the presence of the ℓ -norm term. In order to removethis extra distortion, we determine again the values of the non-zero components by purelyminimizing the distortion, after the support estimation of the compressed vector by the ℓ -norm regularization. This procedure is described as follows: ǫ LS1 = min x (cid:26) M || y − A ( | ˆ ξ | ◦ x ) || (cid:27) , (11)where | · | of a vector is defined by ( | v | ) i = | v i | . This can be carried out by the method ofLS for the sub-matrix of A that is composed of columns corresponding to | ˆ ξ i | = 1. Thesetwo quantities, ǫ and ǫ LS1 , are the objects of our analysis in the ℓ case. C. Related Work
The problem of sparse approximation has been studied widely in the fields of signalprocessing, statistics and information theory. Sparse approximation involves searching foran optimal small combination of given basis vectors, and it was proved to be NP-hard[15]. In our setting, we seek a linear combination of a given number of basis vectors toapproximate a given signal with as small a representation distortion as possible [23]. Thissetting is also called N -term approximation [17, 18, 36]. As stated after equation (2), ourpurpose is to minimize the distortion in describing a given signal by a sparse representation.Note again that this distortion is different from the reconstruction error used to measure8he distance between an original sparse signal and an estimated signal from scarce data incompressed sensing. Our motivation is similar to that of rate-distortion theory for lossydata compression in information theory [37].We investigate the performances of the ℓ - and ℓ -based methods in solving the sparseapproximation problem. In the ℓ case, the exhaustive search is considered to be an absolutemethod for obtaining the most suitable representation. A major contribution of this paperis the theoretical analysis of the exhaustive search method for the sparse approximationproblem, by using methods of statistical mechanics.In order to reduce a computational cost of the ℓ -based method, some greedy algorithmswere proposed. Orthogonal matching pursuit (OMP) is a well-known greedy algorithm[31, 32]. The approximation bounds of OMP was proved and has been improved theoreticallyby previous studies [21–23, 38, 39]. On the other hand, the ℓ -based method based on convexrelaxation is known to be useful such as basis pursuit [40] and LASSO [35]. The problem ofsparse approximation allows a distortion in a compressed representation, though it should besmall, and we evaluate the performance of the method of ℓ -norm regularization equivalentto LASSO. In this paper, we are also interested in comparing greedy algorithms and convex-relaxation approach. III. ANALYSIS OF IDEAL PERFORMANCEA. Analytical treatment in the limit M → ∞ We investigate the limit M → ∞ , as stated above. For this purpose, we employ some sta-tistical mechanical tools, which provide useful assistance investigating this limit. Accordingto the terminology of statistical mechanics, we call the limit M → ∞ the thermodynamiclimit, and the average over y and A the configurational average, which is denoted by [ · ] y , A .In taking the limit M → ∞ , the aspect ratio of the basis matrix, α = M/N , is fixed. ℓ -based method A versatile technique of statistical mechanics is to introduce a generating function Z ofan energy function H , called a partition function. This defines a canonical distribution p .9n the ℓ case, we define the energy function, partition function, and canonical distributionrespectively as follows: H ( c ; β | y , A ) = − β ln Z d c x e − β || y − A ( c ◦ x ) || , (12) Z ( µ, β | y , A ) = X c δ ( M r − || c || )e − µ H ( c ; β | y , A ) ≡ Tr c e − µ H ( c ; β | y , A ) , (13) p ( c ; µ, β | y , A ) = 1 Z ( µ, β | y , A ) δ ( M r − || c || )e − µ H ( c ,β | y , A ) , (14)where R d c x i is equal to R d x i ( c i = 1) or 1 ( c i = 0). The parameter β denotes the inversetemperature corresponding to the method of LS, and the limit of β → + ∞ will be taken inaccordance with the execution of the method. The parameter µ denotes the inverse temper-ature corresponding to the support estimation, and plays an important role in combiningthe entropy and the distortion values to depict the entropy curve as shown in figure 2. Theenergy function is related to the distortion of a given basis-vector choice c as follows:1 M lim β →∞ H ( c ; β | y , A ) = ǫ ( c | y , A ) . (15)The cumulant generating function φ ( µ | y , A ) is obtained from Z by φ ( µ | y , A ) = lim β →∞ M ln Z ( µ, β | y , A ) , (16)and is connected to the entropy (7) by the Legendre transformation in the large M limit, as φ ( µ | y , A ) = max ǫ ≤ ǫ ≤ ǫ + { s ( ǫ | y , A ) − µǫ } . (17)The maximization problem of (17) must be solved on the well-defined region of s , whichrequires appropriate bounds the minimum value of distortion ǫ and the maximum value ofdistortion ǫ + . Overall, we can calculate the object of our analysis, s ( ǫ ), through the inverseLegendre transformation, once we have obtained φ . Therefore, we turn our attention tothe calculation of φ .The cumulant-generating function has the self-averaging property, as does the entropy,and we assess the configurational average, given by φ ( µ ) = [ φ ( µ | y , A )] y , A . (18)We employ the replica method in order to calculate this average, and a detailed analysisis provided in Appendix A. Though it is possible that the correct solution to the ℓ -based10ethod might break replica symmetry (RS), the result under the RS ansatz is given by φ ( µ ) = extr ˆΘ (
12 ln 1 + χ χ + µ ( Q − q ) − µ ( q + σ y )1 + χ + µ ( Q − q )+ 12 (cid:18) ˆ rr + ˆ QQ − ˆ χµ χ + ˆ qq (cid:19) + 1 α Z D z ln(1 + Y ) ) , (19)where extr Θ {·} denotes the operation of extremization with respect to Θ, ˆΘ = { Q, χ, q, ˆ r, ˆ Q, ˆ χ, ˆ q } ,and R D z = R d z √ π e − z , and we set Y ≡ s ˆ χ + ˆ Q ˆ Q + ˆ q e − ˆ r +
12 ˆ q ˆ Q +ˆ q z . (20)By applying the extremization condition, we obtain the following equations of state (EOSs):ˆ χ = µ (cid:26) ∆(1 + χ )(1 + χ + µ ∆) + σ y + q (1 + χ + µ ∆) (cid:27) , (21a)ˆ Q = µ (cid:26)
11 + χ + µ ∆ − µ ( σ y + q )(1 + χ + µ ∆) (cid:27) , (21b)ˆ q = µ σ y + q (1 + χ + µ ∆) , (21c) r = 1 α Z D z Y Y , (21d) χ = µr ˆ χ + ˆ Q , (21e) Q = r ˆ χ − ˆ q ( ˆ χ + ˆ Q )( ˆ Q + ˆ q ) + 1 α ˆ q ( ˆ Q + ˆ q ) Z D z z Y Y , (21f) q = 1 α ˆ q ( ˆ Q + ˆ q ) Z D z z (cid:18) Y Y (cid:19) . (21g)where we write ∆ = Q − q . From the EOSs, we obtain some simple and general relations,which we summarize here for later convenience:ˆ χ + ˆ Q = µ χ , (22a)ˆ Q + ˆ q = µ χ + µ ∆ , (22b)ˆ χ − ˆ q = µ ∆(1 + χ )(1 + χ + µ ∆) , (22c) χ = r − r . (22d)11he relation involving the entropy, (17), enables us to employ a convenient parametric formof ǫ ( µ ) and s ( µ ) = s ( ǫ ( µ )), and (21, 22) allow us to simplify ǫ ( µ ), as ǫ ( µ ) = − ∂ φ ( µ ) ∂µ = ˆ χ µ , (23) s ( µ ) = φ ( µ ) + µǫ ( µ ) . (24)The explicit form of s ( µ ) is not enlightening, and therefore we omit it. As the value of µ is increased from µ = 0, the point of ( ǫ, s ) moves along the entropy curve from the summit( µ = 0) in the direction of decreasing the distortion ( µ >
0) as shown in figure 2. When theentropy curve crosses the zero-entropy line at µ = µ , the minimum distortion is given by ǫ = ǫ ( µ ) . (25)Here, we make a technical remark on the derivation of (19). In contrast to the usualprescription of the replica method, we require two different replica numbers for the presentanalysis, because we have two different integration variables, x and c , in the calculation of φ . Using (16, 18), and introducing a variable ν = µ/β , we can rewrite φ ( µ ) as φ ( µ ) = lim ν → M (cid:20) ln Tr c (cid:18)Z d c x e − µν || y − A ( c ◦ x ) || (cid:19) ν (cid:21) y , A = lim n → lim ν → M n ln (cid:20)(cid:26) Tr c (cid:18)Z d c x e − µν || y − A ( c ◦ x ) || (cid:19) ν (cid:27) n (cid:21) y , A . (26)In the last line, we use the replica identity [ln X ] y , A = lim n → (1 /n ) ln [ X n ] y , A . We identify n and ν as the two replica numbers, and assume that they are natural numbers, whichenables us to expand the powers and to calculate the configurational average. The remainingcalculations follow the usual procedure of the replica method, and we assume the RS ansatzin the order parameters. Our present framework in calculating φ is actually similar to theone-step replica-symmetry-breaking (1RSB) ansatz. In this identification, ν is identified asthe 1RSB breaking parameter (usually written as m ), and each configuration of c correspondsto a pure state in the 1RSB free-energy landscape; the entropy can be regarded as complexity.The analytical results obtained on the basis of RS assumption will be justified later, in acomparison with numerical calculations. 12 . ℓ -based method a. Derivation of ǫ Similarly to the case of the ℓ -based method, the energy function,partition function, and canonical distribution of the ℓ case are defined respectively as H ( ξ | y , A ) = 12 || y − Aξ || + λ || ξ || , (27) Z ( µ, κ | y , A ) = Z d ξ e − µ ( H ( ξ | y , A )+ κ || ξ || ) , (28) p ( ξ ; µ, κ | y , A ) = 1 Z ( µ, κ | y , A )e − µ ( H ( ξ | y , A )+ κ || ξ || ) . (29)The parameter µ denotes the inverse temperature corresponding to the support estimationby the method of ℓ -norm regularization. The parameter κ is an auxiliary variable introducedto analyze the compression rate r and the limit of κ → H is exactly the minimized object in (9). We also introduce the averaged free-energy density, given by f ( µ, κ ) = − M µ [ln Z ( µ, κ | y , A )] y , A , (30)which plays the role of the cumulant-generating function that is given by φ in the ℓ case.In the limit µ → ∞ , the minimizer of the energy function becomes dominant in p , and wefocus on this limit. Any quantity of interest can be calculated from f . For example, thecompression rate r and the distortion ǫ are calculated as r = lim µ →∞ lim κ → ∂∂κ f ( µ, κ ) , (31) ǫ = lim µ →∞ (cid:18) µ ∂∂µ − λ ∂∂λ (cid:19) f ( µ, . (32)An analytically compact form of f is assessed by using the replica method in the limit M → ∞ , through the replica identity, as f ( µ, κ ) = − lim n → M µn ln [ Z n ( µ, κ | y , A )] y , A . (33)As in the ℓ case, we assume the replica-symmetric solution. The details of the necessarycalculations are presented in Appendix B. The result is given by f ( µ → ∞ , κ )= extr ˆΘ (cid:26) P + σ y χ p −
12 ( ˆ
P P − ˆ χ p χ p ) − ˆ χ p α ˆ P (cid:18) (1 + 2 θ + θ − )erfc( θ + ) − θ − √ π e − θ (cid:19)(cid:27) , (34)13here ˆΘ = { P, χ p , ˆ P , ˆ χ p } , θ ± = λ ± √ κ ˆ P √ χ p , and erfc( · ) is the complementary error function,defined as erfc( x ) = √ π R ∞ x d t e − t . The extremization condition gives the following EOSsfor the present case:ˆ χ p = P + σ y (1 + χ p ) , (35a)ˆ P = 11 + χ p , (35b) χ p = 1 α ˆ P erfc( θ + ) + s κ ˆ P ˆ χ p √ π e − θ , (35c) P = ˆ χ p α ˆ P (cid:18) (1 + 2 θ + θ − )erfc( θ + ) − θ − √ π e − θ (cid:19) + κα ˆ P erfc( θ + ) . (35d)By using (31, 32), we obtain r = 1 α erfc( θ ) (36) ǫ = 12 P + σ y χ p −
12 ( ˆ
P P − ˆ χ p χ p ) − ˆ χ p α ˆ P (cid:18) (1 − θ )erfc( θ ) + θ √ π e − θ (cid:19) , (37)where θ = λ √ χ p . In addition, a simple formula ǫ = 12 ˆ χ p . (38)is derived from the EOSs of (35) in the limit of κ →
0, and a useful relation χ p = r − r , (39)which is similar to (22d), is offered by (35, 36). b. Derivation of ǫ LS1
We also evaluate ǫ LS1 , as defined in (11). The computations arerather technical, and there we defer the details to Appendix B. Here, we present an outlineof the analysis, and the result.Again, we use the energy function defined in the ℓ case, but here the argument is | ξ | ,determined by p ( ξ ). Thus, we obtain H ( | ξ | ; β | y , A ) = − β ln Z d | ξ | x e − β || y − A ( | ξ | ◦ x ) || . (40)Since the vector ξ is drawn from p , we calculate the average value of (1 /M ) H ( | ξ | ) over p , in addition to the configurational average. Taking the limits of µ → ∞ and then β → ∞ afterward, we obtain the desired distortion ǫ LS1 as follows: ǫ LS1 = lim β →∞ lim µ →∞ M (cid:20)Z d ξ p ( ξ ; µ, | y , A ) H ( | ξ | ; β | y , A ) (cid:21) y , A . (41)14y utilizing the replica method again, we can calculate this. We defer the details of thecalculations to Appendix B, and here write down the resultant formula: ǫ LS1 = extr ˆΘ LS1 ( P + σ y χ q (cid:18) χ c χ p (cid:19) − C + σ y χ q χ c χ p + 12 Q + σ y χ q − ( ˆ CC − ˆ χ c χ c ) −
12 ( ˆ QQ − ˆ χ q χ q ) − ˆ χ p α ˆ Q ˆ χ q ˆ χ p − χ c ˆ χ p ˆ C ˆ P + (1 + 2 θ ) ˆ C ˆ P ! erfc( θ ) + θ ˆ χ c ˆ χ p − ˆ C ˆ P ! √ π e − θ !) , (42)where ˆΘ LS1 = { C, χ c , Q, χ q , ˆ C, ˆ χ c , ˆ Q, ˆ χ q } , and θ = λ √ χ p . One point to remark on is thatwe should not take the extremization condition with respect to ˆΘ = { P, χ p , ˆ P , ˆ χ p } in thisexpression. Instead, we should substitute the extremizer of (34) into it. Applying theextremization condition with respect to ˆΘ LS givesˆ χ q = P + σ y (1 + χ q ) (cid:18) χ c χ p (cid:19) − C + σ y (1 + χ q ) χ c χ p + Q + σ y (1 + χ q ) , (43a)ˆ Q = 11 + χ q , (43b)ˆ χ c = − P + σ y χ q χ c (1 + χ p ) + C + σ y χ q
11 + χ p (43c)ˆ C = −
11 + χ q χ c χ p , (43d) χ q = 1 α ˆ Q erfc( θ ) , (43e) Q = ˆ χ p α ˆ Q ˆ χ q ˆ χ p − χ c ˆ χ p ˆ C ˆ P + (1 + 2 θ ) ˆ C ˆ P ! erfc( θ ) + θ ˆ χ c ˆ χ p − ˆ C ˆ P ! √ π e − θ ! , (43f) χ c = 1 α ˆ Q − ˆ C ˆ P erfc( θ ) + θ ˆ χ c ˆ χ p √ π e − θ ! , (43g) C = − ˆ χ p α ˆ Q − ˆ χ c ˆ χ p P + (1 + 2 θ ) ˆ C ˆ P ! erfc( θ ) − θ ˆ C ˆ P √ π e − θ ! . (43h)From the EOSs, we can obtain the following simple relations: χ q = r − r = χ p , (44a)ˆ Q = ˆ P = 11 + χ p , (44b) ǫ LS1 = 12 ˆ χ q . (44c)15e now make some comments regarding the derivation of (42). In order to calculatethe configurational average, we are required to deal with two different factors, Z in p =(1 /Z )e − µ H , and the logarithm in H . Correspondingly, as in the ℓ case, we introducereplicas of two different kinds: n replicas to handle 1 /Z , and ν replicas to handle thelogarithm. Using them, we can rewrite (41) as ǫ LS1 = lim β →∞ lim µ →∞ lim n → lim ν → − M βν ln (cid:20) Z n − ( µ, | y , A ) Z d ξ e − µ H ( ξ | y , A ) (cid:18)Z d | ξ | x e − β || y − A ( | ξ | ◦ x ) || (cid:19) ν (cid:21) y , A . (45)It is now possible to calculate the configurational average by assuming n and ν are naturalnumbers, and we can follow the usual prescription of the replica method. However, thereremains a technical point concerning the limits n → ν → n = ν = 0 has an unusual property. The extremization condition withrespect to the order parameters yields several different solutions. Among these solutions, byemploying a versatile tool of spin-glass theory to analyze a probabilistic model conditionedby another probabilistic model, called the Franz-Parisi potential, we should choose the oneanalytically connected to ˆΘ in (34) in the limit ν →
0. This is achieved by the remarkgiven below (42) [41].
B. Numerical validation using simulations on finite M ℓ -based method We examine the analytical results, using numerical simulations of finite-size systems.When M is sufficiently small, we can obtain the cumulant-generating function φ by ex-haustively searching all possible combinations of basis vectors. In cases where M is lesssmall, we use the exchange Monte Carlo (MC) method to sample basis vector combinationsobeying the canonical distribution at various temperature points [29, 30], and then estimatethe cumulant-generating function φ using the multi-histogram method [42].In all simulations, we set α = 0 . σ y = 1. We treat two values of r equal to 0 . .
4. In the case of r = 0 . . µ . We conduct the exhaustivesearch for M ≤
25 (15), and use the exchange MC method for larger M . The configurational16 (a)1/M φ r=0.2r=0.4 (b) µ φ r=0.4r=0.2 s ε (c) r=0.2r=0.4 FIG. 3. Cumulant-generating function φ and entropy density s of the ℓ -based method with σ y = 1, α = 0 .
5, and r = 0 . , .
4. (a) Plots of numerically evaluated φ at µ = 1. The lines are given bythe linear regression. On the vertical axis, the circles and crosses represent the extrapolated andanalytical values in the M → ∞ limit, respectively. The lengths of the error bars are comparableto the sizes of symbols. (b) Plots of φ in the M → ∞ limit. The lines and circles represent theanalytical and extrapolated values, respectively. The lengths of the error bars are comparable tothe sizes of symbols. (c) Plots of s against ǫ in the M → ∞ limit. The lines and circles representthe analytical and extrapolated values, respectively. These are calculated from the values of φ in(b). average is calculated by taking the median over 1000 different samples of ( y , A ). The errorbars are estimated by the Bootstrap method.The procedure for our MC method will now be explained. At every temperature point,we randomly choose the initial vector c among those satisfying || c || = rM . For r = 0 .
2, thenumber of MC steps required for thermalization and sufficient sampling is 2 , , , , × for M = 30 , , , ,
50, respectively, while for r = 0 . , , , , × for M =20 , , , ,
40, respectively. The first half of the MC steps are discarded for thermalization.One MC step consists of two parts. First, updating once at every temperature point, andthen exchanging once between every pair of neighboring temperature points. In each updateof c , we randomly choose one index i such that c i = 1 and another j such that c j = 0 toflip into the opposite state. That is, we set c i = 0 and c j = 1, and accept or reject this trialaccording to the Metropolis criterion based on the energy values calculated from H (12).The Metropolis criterion is also used in the exchange of c s of different temperature points.The results of the numerical simulations are presented in figure 3. Figure 3(a) shows theresults of the cumulant-generating function value at µ = 1. On the vertical axis, the circles17epresent extrapolated values from finite-size results. The extrapolation lines are given bythe linear regression using an asymptotic form φ ≈ a + bM − + cM − ln M − . The regressionis conducted by employing the method of least squares, as follows:min a,b,c X M (cid:18) a + b M + c M ln 1 M − φ ( M ) (cid:19) . (46)The asymptotic form is based on the Stirling’s formula and is exact at µ = 0, which motivatesus to use the form even for µ = 0. The cumulant-generating function and entropy density inthe limit M → ∞ are presented in figures 3 (b) and 3 (c), respectively. The lines representthe analytical results. The circles represent the extrapolated values from the numericalresults. The analytical solutions are seen to be consistent with the numerical ones. Hence,the numerical results clearly validate the analytical results in the ℓ -based method. ℓ -based method Similarly to the case of the ℓ -based method, we examine the analytical results of the ℓ -based method by performing numerical simulations on finite-size systems. We carry outthe ℓ -norm regularization using quadratic programming, and evaluate the distortion beforethe method of LS, ǫ ; the distortion after the method of LS, ǫ LS1 ; and the compression rate r . The values of α and σ y are fixed as α = 0 . σ y = 1 for all simulations. We treat twovalues of λ equal to 1 and 2. We calculate (9) and (11) using quadratic programming andthe method of LS for M = 50 , , . . . , M . On the vertical axes,the circles and crosses represent extrapolated and analytical values in the M → ∞ limit,respectively. The extrapolation lines are given by the linear regression using the asymptoticforms ǫ ≈ a + bM − , ǫ LS1 ≈ c + dM − , and r ≈ e + f M − . We see that the analyticalsolutions are very close to the extrapolated values. This correlation clearly demonstratesthe reliability of the analytical results. 18 (a)1/M ε λ =1 λ =2 (b)1/M ε S λ =1 λ =2 (c)1/M r λ =2 λ =1 FIG. 4. Plots of numerically evaluated values of the ℓ -based method with σ y = 1, α = 0 .
5, and λ = 1 ,
2. The extrapolation lines are given by the linear regression. On the vertical axes, the circlesand crosses represent the extrapolated and analytical values in the M → ∞ limit, respectively.The lengths of the error bars are comparable to the sizes of symbols. (a) Distortion before themethod of LS, ǫ . (b) Distortion after the method of LS, ǫ LS1 . (c) Compression rate, r . (a) ε r NaiveL0L1 (before LS)L1 (after LS) 0 1 2 3 4 500.10.20.30.40.50.60.70.80.91 (b) λ r FIG. 5. Results of the analysis in the M → ∞ limit with σ y = 1 and α = 0 .
5. (a) Trade-offrelations of the naive, ℓ -based, and ℓ -based methods (before and after the method of LS). (b)Relation between the rate and the regularization coefficient in the ℓ -based method. C. Comparison in the trade-off relation
We compare the ideal performance in the M → ∞ limit for different methods in terms ofthe trade-off relation between the representation distortion and the compression rate. Figure5(a) shows the trade-off relations in the case of α = 0 .
5. We see that both of the OCB-based19 (a) ε r Naive α =1 α =0.5 α =0.1 0 0.1 0.2 0.3 0.4 0.500.10.20.30.40.50.60.70.80.91 (b) ε r Naive α =1 α =0.5 α =0.1 FIG. 6. Trade-off relations at various values of α in the case of σ y = 1. (a) Results of the ℓ -basedmethod. (b) Results of the ℓ -based method (after the method of LS). For both the methods,against fixed a r , the distortion ǫ becomes smaller as α decreases. methods achieve a better trade-off relation than the naive one. In the OCB-based strategy,the ℓ -based method significantly outperforms the ℓ -based one, even if the method of LS isoperated after carrying out support estimation by the ℓ -norm regularization. We attributethe inferiority of the ℓ -based method to the regularization term. Indeed, as shown in figure5 (b), the regularization term is necessary to decrease the rate, but it distorts the originalpurpose of minimizing the distortion, as clearly seen from (27).For a further comparison of the OCB-based methods, figure 6 shows the trade-off relationswhere different values of α control the degree of overcompleteness. figures 6 (a) and 6 (b)present the results of the ℓ - and ℓ -based methods, respectively. In the ℓ -based method,the method of LS has been operated after the support estimation. Both methods achieve abetter trade-off relation as the degree of overcompleteness increases, or α decreases. Anotherinteresting observation is the superiority of the ℓ -based method compared to the ℓ -basedone, regardless of the degree of overcompleteness.
1. In the large limit of the degree of overcompleteness, α → From figure 6 we see that the distortion becomes smaller as α decreases, both for the ℓ -and ℓ -based methods. An interesting question is whether the distortion vanishes or not in20 FIG. 7. Plots of ǫ against α in the small α limit, derived by solving (21, 43) numerically. Theleft panel represents ǫ , and the right panel represents ǫ LS1 . The lines are the fits based on ouranalytical formulas, (47, 49), and these show excellent agreement with the points obtained by thenumerical evaluations. the limit α →
0, or more quantitatively, how ǫ is scaled by α in the small limit.Deferring the detailed calculations to Appendix A 2 and B 2, here we summarize ouranalytical results on the behavior of ǫ in the limit α → ǫ ∝ α r − r → , (47) ǫ →
12 (1 − r ) σ y = O (1) , (48) ǫ LS1 ∝ | ln α | − → . (49)The asymptotic behaviors of ǫ and ǫ LS1 are examined using numerical solutions of the cor-responding EOSs, (21, 43), in figure 7. Our analytic formulas show an excellent agreementwith the numerical results.We stress the consequence of (47–49). First, they give a firm indication that it is rea-sonable to apply the method of LS after the ℓ -norm regularization, which is heuristicallyemployed in related problems such as compressed sensing in practical situations. The dif-ference in (48, 49) indicates that the method of LS actually diminishes the distortion, andeven eliminates it in the ideal limit α →
0, which never happens with only the use of ℓ -norm regularization. Second, (47) provides a general bound for the computational cost ofsearching the appropriate basis vectors. From (47), given a target value of the distortion ˆ ǫ and some data on the length M , the required size N req (ˆ ǫ, M ) of the basis matrix to achieve21his distortion value is scaled as N req (ˆ ǫ, M ) ∝ M ˆ ǫ − − r r . ( ℓ ) (50)This grows in a polynomial manner as the target distortion value ˆ ǫ decreases, and theexponent of the polynomial negatively grows as the compression rate r decreases. Thisquantitative information will provide a theoretical basis in designing algorithms. Finally,(49) manifests the limit of the ℓ -based method. The size N req required to achieve the targetdistortion ˆ ǫ in this case is scaled as N req (ˆ ǫ, M ) ∝ M e ǫ , ( ℓ + LS) . (51)This grows exponentially as ˆ ǫ decreases, which is considered to be reasonable. If it were apolynomial, versatile algorithms exactly solving the ℓ -norm regularization could be appliedto solve the problem with a computational cost of a polynomial order of the system sizeand the precision, which is believed not to be possible. However, (51) can still be useful,because it provides a quantitative comparison between the data size M and the acceptabledistortion ˆ ǫ in an unified manner. IV. EXAMINATION OF PRACTICAL PERFORMANCEA. Algorithms and their performances
A lot of computational time is required to conduct the exhaustive search used in the ℓ -based method. However, it is considered that certain greedy algorithms might work well forpractical applications. Orthogonal matching pursuit (OMP, figure 8) is a greedy algorithmthat may be suitable for the present purpose [31, 32]. OMP only requires a computationaltime of order O ( M ) for the current purpose. We compare the performance of OMP withthe ideal performances of both the ℓ - and ℓ -based methods.In addition to OMP, we also examine approximate message passing (AMP), as a represen-tative algorithm carrying out the ℓ -norm regularization. From the viewpoint of quadraticprogramming, ℓ -norm regularization is solved exactly using versatile algorithms, which re-quire a computational time of order O ( M ). In contrast, AMP only requires a computationaltime of order O ( M ) per update. Despite the low computational cost, AMP is known to beable to recover the results of those versatile algorithms, in certain reasonable situations [33].22 nput: a data vector y , a basis matrix A , a rate r . Initialization: x (0) = , U = { , , . . . , N } , S (0) = ∅ . Iteration: repeat from n = 1 until n = rM : r = y − Ax ( n − , j = arg max k ∈ U \ S ( n − {| a T k r |} , S ( n ) = S ( n − ∪ { j } , x ( n ) = arg min x {|| y − Ax || } subj. to supp( x ) ⊂ S ( n ) . Output: a compressed vector ˆ x = x ( rM ) .FIG. 8. The procedure of OMP. ∅ is the empty set. supp( · ) is the support set. Input: a data vector y , a basis matrix A , a regularization coefficient λ , a tuning parameter δ . Initialization: x (0) = , χ (0) = 0, r (0) = y . Iteration: repeat until convergence at n = ˆ n :ˆ Q = χ ( n − , r ( n ) = (1 − ˆ Q ) r ( n − + ˆ Q ( y − Ax ( n − ), h = A T r ( n ) + ˆ Q x ( n − , χ ( n ) = (1 − δ ) χ ( n − + δ Q M P i Θ( | h i | − λ ), x ( n ) i = (1 − δ ) x ( n − i + δ Q sign( h i )( | h i | − λ )Θ( | h i | − λ ) for i = 1 , . . . , N . Output: a compressed vector ˆ x = x (ˆ n ) .FIG. 9. The procedure of AMP. sign( · ) is the sign function. Θ( · ) is the Heaviside step function. The present case, where the basis matrix A and the data vector y are generated from i.i.d.normal distributions, is expected to be one such situation. Hence, we can fairly comparethe result of AMP with the ideal performance of the ℓ -based method, and therefore withthat of OMP.We evaluate the performances of OMP and AMP when they are employed for sparseapproximation with the OCB-based strategy. We examine the case with σ y = 1 and α =0 .
5. Figure 10 presents the results of the performance evaluations of OMP and AMP.Figure 10(a) shows the results for finite-size systems, namely M = 50 , , . . . , ε (a)AMP (after LS)OMP (b) ε r NaiveL0L1 (after LS)OMPAMP (after LS)
FIG. 10. Performances of OMP and AMP in the case with σ y = 1 and α = 0 .
5. The performanceof AMP is evaluated after the method of LS. (a) Plots of the numerically evaluated distortions inthe case of r = 0 .
5. The extrapolation lines are given by the linear regression. On the verticalaxis, the symbols represent extrapolated values in the M → ∞ limit. The lengths of the error barsare comparable to the sizes of symbols. In OMP r is set to r = 0 .
5, and in AMP λ is set to 0.65,so that r ≈ .
5. (b) Trade-off relations in the M → ∞ limit. The circles and crosses representextrapolated values of OMP and AMP, respectively. the extrapolation by the linear regression using an asymptotic form of ǫ ≈ a + bM − . Thecompression rate is set to r = 0 . λ is set to 0 .
65 when evaluating AMP, so that r ≈ .
5. We evaluate the performance ofAMP based on the distortion after the method of LS. In figure 10 (b), we compare theextrapolated performances of OMP and AMP at various rates with the achievable trade-offrelation analyzed in section III. The AMP result compares well with the ideal performanceof the ℓ -based method, while that for OMP does not reach the ideal result of the ℓ -basedmethod. However, a notable finding is that OMP considerably outperforms the ℓ -basedresults. This motivates the exploration of better algorithms for the ℓ -based method, in thecontext of sparse approximation. Such exploration is currently under way.24 a) c (cid:13) Playboy Enterprises, Inc. (b) c (cid:13) Playboy Enterprises, Inc. (c) c (cid:13) Playboy Enterprises, Inc.
FIG. 11. Application of sparse approximation with the OCB-based strategy to image data com-pression. The degree of overcompleteness is α = 0 .
5. (a) Original image data. (b) Compressedimage data obtained using OMP. The compression rate is r = 0 .
5. PSNR is 28 .
2. The time requiredis approximately 55 sec. (c) Compressed image data obtained using AMP. The regularization coef-ficient is λ = 0 .
65, so that r ≈ .
5. The AMP-compressed representation is given after the methodof LS. PSNR is 22 .
9. The time required is approximately 4.5 sec.
B. Application to image data
We investigate the performance of sparse approximation, when it is applied to a taskof image data compression. We compress image data composed of 256 ×
256 pixels. Theexperimental procedure of compression is as follows. First, image data are normalized so asto set the mean and variance to 0 and 1, respectively. Next, 256 ×
256 pixels are randomlypermuted, in order to obtain 1024 column vectors, whose dimension is 64. Following theseoperations, the data can be regarded as random numbers with a mean and variance of0 and 1, which approximates the properties of the data to the situation which we havealready studied theoretically and numerically. Finally, setting r = 0 .
5, we compress each ofthe column vectors into a compressed vector by using a 64 ×
128 random matrix, namely α = 0 .
5. We examine the performances of OMP and AMP. When applying AMP, we set theregularization coefficient to 0 .
65, so that r ≈ .
5, and the method of LS is operated after thesupport estimation by the ℓ -norm regularization. The results of experiments are presentedin figure 11. Although OMP requires a computational time that is several times larger thanthat of AMP, OMP outperforms AMP in terms of appearance and peak signal-to-noise ratio25PSNR), defined as PSNR = 10 log N P ij ( ˆ I ij − I ij ) , (52)where I = { I ij } and ˆ I = { ˆ I ij } represent an original image and a compressed image, respec-tively, and N is the number of image pixels.If the scope of application is limited to image data compression, more convenient bases,such as a discrete wavelet transformation, will achieve much better results in the perfor-mance and computational time [43, 44]. However, in general contexts it is not easy to finda proper basis for sparse approximation in advance. A solution to this problem is to useblind compressed sensing and related techniques such as dictionary learning [45–47], butthe computational costs are rather high. Our OCB-based strategy may overcome this dif-ficulty, because it avoids the learning of the dictionary by preparing many candidates forbasis vectors and choosing a suitable combination. Our theoretical analysis and numericalexperiments positively support this possibility. V. CONCLUSION
In the present paper, sparse-data processing has been discussed from the viewpoint ofsparse approximation. We have focused on a strategy of sparse approximation that is basedon a random OCB, and have discussed the use of the ℓ - and ℓ -based methods. We haveanalyzed the ideal performances of these methods in the large-system limit in a statistical-mechanical manner, which has been validated by numerical simulations on finite-size systemsand their extrapolation to the infinite-size limit. Our results have indicated that the ℓ -based method outperforms the naive and ℓ -based methods in terms of the trade-off relationbetween the representation distortion and the compression rate. A notable result is that anysmall distortion is achievable for any finite fixed value of the compression rate, by increasingthe degree of overcompleteness, for both the ℓ - and ℓ -based methods. This result allows usto determine both the theoretical limit of the OCB-based strategy and the limit for practicalalgorithms based on the ℓ regularization. In addition, it provides a firm basis for the use ofthe method of LS after the ℓ regularization, which is frequently applied in related problemssuch as compressed sensing in practical situations.In addition to the ideal performance analyzed in section III, we also investigated the26ractical performance of our strategy in section IV. We evaluated the performances of OMPand AMP as algorithms to approximately perform the ℓ - and ℓ -based methods, respectively.Our evaluation showed that OMP surpasses both AMP and the exact execution of the ℓ -based method, in terms of the trade-off relation. This suggests that greedy algorithms aremore suitable for sparse approximation using our strategy than convex relaxation algorithms,although there is still room to design more effective greedy algorithms than OMP. We arecurrently undertaking further research in this direction.We considered the application of our method to image data compression, as a practicalexample, and evaluated its performance when OMP and AMP are utilized. OMP outper-forms AMP in appearance and PSNR, although OMP requires a computational time that isseveral times larger. In order to efficiently decrease the computational time of our strategy,it is important to find a proper basis. This suggests the use of some prior knowledge in con-structing the overcomplete basis. Some further possibilities, such as combining our methodswith dictionary learning, are still open, and would be interesting to address in future work. ACKNOWLEDGMENTS
This work was supported by JSPS KAKENHI Grant Numbers 13J04920 (YN-O),26870185 (TO), 25120009 (MO), and 25120013 (YK).
Appendix A: Calculations for the ℓ -based method1. Derivation of φ Based on (26), we define ψ ( n, ν, µ ) = 1 M ln (cid:20)(cid:26) Tr c (cid:18)Z d c x e − µν || y − A ( c ◦ x ) || (cid:19) ν (cid:27) n (cid:21) y , A . (A1)The cumulant-generating function φ is recovered from ψ , as φ ( µ ) = lim n,ν → (1 /n ) ψ ( n, ν, µ ).When ( n, ν ) are positive integers, we obtain ψ ( n, ν, µ ) = 1 M ln Tr { c a } Tr { x aα }|{ c a } h e − µ ν P Mj =1 P na =1 P να =1 ( y j − P i A ji c ai x aαi ) i y , A , (A2)where Tr { c a } = Q na =1 P c a δ ( M r − P i c ai ), and Tr { x aα }|{ c a } = Q na =1 Q να =1 R d c a x aα . Let us intro-duce the variables s aαj = P i A ji c ai x aαi and Q ( aα )( bβ ) = M P i ( c ai x aαi )( c bi x bβi ). According to27he central limit theorem, we regard the variables { s aαj } as random variables that follow azero-mean multivariate normal distribution, with covariances [ s aαj s bβk ] A = δ jk Q ( aα )( bβ ) . Usingthese variables, we obtain ψ ( n, ν, µ ) = 1 M ln Tr { c a } Tr { x aα }|{ c a } Tr Q (cid:18)h e − µ ν P a P α ( y − s aα ) i y, { s aα }| Q (cid:19) M , (A3)where Tr Q = Q ( aα ) , ( bβ ) R d Q ( aα )( bβ ) δ ( M Q ( aα )( bβ ) − P i ( c ai x aαi )( c bi x bβi )), and the brackets [ · ] y, { s aα }| Q denote the average over y and s aα , which is conditioned by the variance Q ( aα )( bβ ) as explainedabove.After introducing the Fourier representation of the delta function, δ ( · ) ∝ R d˜ x e ˜ x ( · ) , thesaddle-point method is employed to obtain ψ ( n, ν, µ ) = extr Θ ( ln h e − µ ν P a P α ( y − s aα ) i y, { s aα }| Q + X a ˜ r a r + X ( aα ) , ( bβ ) ˜ Q ( aα )( bβ ) Q ( aα )( bβ ) + 1 α ln X { c a } Tr { x aα }|{ c a } e − P a ˜ ra c a − P ( aα ) , ( bβ ) ˜ Q ( aα )( bβ )2 ( c a x aα )( c b x bβ ) ) , (A4)where Θ = { Q , ˜ r , ˜ Q } . For the extremizer, we search the subspace with ( Q ( aα )( bβ ) , ˜ Q ( aα )( bβ ) )equal to ( Q, ˜ Q ) ( a = b, α = β ), ( q , − ˜ q ) ( a = b, α = β ), or ( q , − ˜ q ) ( a = b ), with ˜ r a = ˜ r .This is the RS in the present formula of two replica numbers n and ν . Then, we obtain ψ ( n, ν, µ ) = extr ˜Θ ( ln Z D y D w (cid:18)Z D v (cid:18)Z D u e − µ ν ( σ y y −√ q w −√ q − q v −√ Q − q u ) (cid:19) ν (cid:19) n + 12 n ˜ rr + 12 nν ˜ QQ − nν ( ν − q q − n ( n − ν ˜ q q + 1 α ln Z D z Z D t X c (cid:18) Tr x | c e − ˜ r ν c − ˜ Q +˜ q cx + t √ ˜ q − ˜ q cx + z √ ˜ q cx (cid:19) ν ! n ) , (A5)where ˜Θ = { Q, q , q , ˜ r, ˜ Q, ˜ q , ˜ q } and Tr x | c = R d c x . We assume that (A5) is true not only forpositive integers ( n, ν ) but also for real numbers ( n, ν ). In taking the limits ( n, ν ) → (0 , χ = β ( Q − q ), q = q , ˆ r = ˜ r , ˆ Q = ν ( ˜ Q + ˜ q ) − ν ˜ q , ˆ χ = ν ˜ q , and ˆ q = ν ˜ q ,which are assumed to be of the order O (1) in these limits. Following some straightforwardcalculations, the replica identity is given by φ ( µ ) = lim n → lim ν → n ψ ( n, ν, µ ) , (A6)thus yielding (19). 28 . The limit α → in the ℓ case We examine the behavior of the zero point of entropy, ǫ , in the large-size limit of thebasis matrix, α →
0. The parameter µ corresponding to the zero point ǫ , µ , can beformally written using (23, 24), as µ = − ˆ χ ( µ )2 φ ( µ ) = −
12 ˆ χ (
12 ln 1 + χ χ + µ ∆ − µ ( q + σ y )1 + χ + µ ∆+ 12 (cid:18) ˆ rr + ˆ QQ − ˆ χµ χ + ˆ qq (cid:19) + 1 α Z D z ln s ˆ χ + ˆ Q ˆ Q + ˆ q e − ˆ r +
12 ˆ q ˆ Q +ˆ q z !) − . (A7)A numerical calculation indicates the behavior of µ → ∞ as α →
0, while ˆ Q, ˆ q, Q, q, χ ∼ O (1) are kept finite. We will determine the scalings of the relevant variables for α → Y should vanish, in order to cancel the vanishing α , yielding1 α Y ∝ √ µ α e − ˆ r = O (1) ⇒ ˆ r = ˜ r − ρ ln αµ ∝ α − ρ , (A8)where we introduce an exponent ρ controlling the divergence speed of ˆ r and µ . Since weassume the divergence of µ , ρ must be larger than unity. The value of ρ is determined bysolving (A7) in a self-consistent manner. The scaling of the remaining order parameter ˆ χ isdetermined by ˆ χ → µ ( α )1 + χ + q + σ y ∆ → ∞ . (A9)Now, we know all of the scalings of the order parameters, and can reduce (A7) to thedominant part, as µ ≈ µ χ + q + σ y ∆ ρr ln α + ln(1 + χ + µ ∆) . (A10)By solving this in the leading scaling, we obtain ρ = 11 − r , µ ≈ e (1+ χ ) − ∆ α − r − r + O (1) . (A11)By inserting (A9, A11) into (23), we get (47)29 ppendix B: Some calculations for the ℓ -based methods1. Derivations of f and ǫ LS1
Based on (45), we introduce ψ ( n, ν, β, µ, κ ) = 1 M ln (cid:20) Z n − ( µ, κ | y , A ) Z d ξ e − µ ( H ( ξ | y , A )+ κ || ξ || ) (cid:18)Z d | ξ | x e − β || y − A ( | ξ | ◦ x ) || (cid:19) ν (cid:21) y , A . (B1)By calculating this in the case of positive integers ( n, ν ), we obtain ψ ( n, ν, β, µ, κ ) = 1 M ln Tr { ξ a } Tr { x α } (cid:18)h e − µ P a ( y j − P i A ji ξ ai ) − β P α ( y j − P i A ji | ξ i | x αi ) i y , A (cid:19) M , (B2)where Tr { ξ a } = Q na =1 R d ξ a e − µ ( λ P i | ξ ai | + κ P i | ξ ai | ) , and Tr { x α } = Q να =1 R d | ξ | x α . Let us introducethe variables s ′ aj = P i A ji ξ ai , s αj = P i A ji | ξ i | x αi , P ab = M P i ξ ai ξ bi , C αa = M P i ( | ξ i | x αi ) ξ ai ,and Q αβ = M P i ( | ξ i | x αi )( | ξ i | x βi ). As in the ℓ case, we can rewrite the variables { s ′ aj , s αj } as random variables from a zero-mean multivariate normal distribution, with the covariances[ s ′ aj s ′ bk ] A = δ jk P ab , [ s ′ aj s αk ] A = δ jk C aα , and [ s αj s βk ] A = δ jk Q αβ . The application of the centrallimit theorem here is justified by the nonzeroness of compression rate r shown in figure 5 (b)derived from (36). Using these variables, we obtain ψ ( n, ν, β, µ, κ )= 1 M ln Tr { ξ a } Tr { x α } Tr P Tr C Tr Q (cid:18)h e − µ P a ( y j − s ′ aj ) − β P α ( y j − s αj ) i y j , { s ′ aj ,s αj }| P , C , Q (cid:19) M , (B3)where Tr P = Q a,b R d P ab δ ( M P ab − P i ξ ai ξ bi ), Tr C = Q a,α R d C αa δ ( M C αa − P i ( | ξ i | x αi ) ξ ai ), andTr Q = Q α,β R d Q αβ δ ( M Q αβ − P i ( | ξ i | x αi )( | ξ i | x βi )). After introducing the Fourier represen-tation of the delta function, the saddle-point method is employed, to obtain ψ ( n, ν, β, µ, κ ) = extr Θ ( ln h e − µ P a ( y − s ′ a ) − β P α ( y − s α ) i y, { s ′ a ,s α }| P , C , Q + X a,b ˜ P ab P ab + X a,α ˜ C aα C aα + X α,β ˜ Q αβ Q αβ + 1 α ln Tr { ξ a } Tr { x α } e − P a,b ˜ Pab ξ a ξ b − P a,α ˜ C aα ξ a ( | ξ | x α ) − P α,β ˜ Qαβ ( | ξ | x α )( | ξ | x β ) ) , (B4)where Θ = { P , C , Q , ˜ P , ˜ C , ˜ Q } . For the extremizer, we search the subspace with ( P ab , ˜ P ab )equal to ( P, ˜ P ) ( a = b ) or ( p, − ˜ p ) ( a = b ); ( C aα , ˜ C aα ) equal to ( C, − ˜ C ) ( a = 1) or ( c, − ˜ c )30 a = 1); and ( Q αβ , ˜ Q αβ ) equal to ( Q, ˜ Q ) ( α = β ) or ( q, − ˜ q ) ( α = β ). This is the RSassumption for the present case. Thus, we obtain ψ ( n, ν, β, µ, κ ) = extr ˜Θ LS1 , ˜Θ ( ln Z D y D z D w (cid:18)Z D v e − µ ( σ y y −√ pw −√ P − pv ) (cid:19) n − × Z D v e − µ ( σ y y −√ pw −√ P − pv ) Z D u e − β ( σ y y − c √ p w − C − c √ P − p v − r q − ( C − c )2 P − p − c p z −√ Q − qu ) ! ν + 12 n ˜ P P − n ( n − pp − ν ˜ CC − ( n − ν ˜ cc + 12 ν ˜ QQ − ν ( ν − qq + 1 α ln Z D z D w D v D u (cid:18) Tr ξ e − ˜ P +˜ p ξ +( u √ ˜ p − ˜ c + v √ ˜ c ) ξ (cid:19) n − × Tr ξ e − ˜ P +˜ p + ˜ C − ˜ c ξ +( u √ ˜ p − ˜ c + w √ ˜ C − ˜ c + v √ ˜ c ) ξ × (cid:18) Tr x e − ˜ Q +˜ q + ˜ C − ˜ c ( | ξ | x ) +( z √ ˜ q − ˜ c + w √ ˜ C − ˜ c + v √ ˜ c ) | ξ | x (cid:19) ν ) , (B5)where ˜Θ LS1 = { C, c, Q, q, ˜ C, ˜ c, ˜ Q, ˜ q } and ˜Θ = { P, p, ˜ P , ˜ p } .The free-energy density f is now derived as f ( µ, κ ) = − lim n → lim ν → µn ψ ( n, ν, β, µ, κ ) (B6)= extr ˜Θ (cid:26) µ ln(1 + µ ( P − p )) + 12 P + σ y µ ( P − p ) − µ ( ˜ P P + ˜ pp ) − µα Z D v ln Tr ξ e − ˜ P +˜ p ξ + v √ ˜ pξ (cid:27) . (B7)In the limit µ → ∞ , we introduce χ p = µ ( P − p ), ˆ P = µ − ( ˜ P + ˜ p ), and ˆ χ p = µ − ˜ p , whichare assumed to be of the order O (1). Taking the µ → ∞ limit in (B7) leads to (34).On the other hand, in order to evaluate ǫ LS1 , in addition to ˆΘ = P, χ p , ˆ P , ˆ χ p , in takingthe limit µ → ∞ we define the parameters χ c = β ( C − c ), χ q = β ( Q − q ), ˆ C = β − ( ˜ C + ˜ c ),ˆ χ c = β − ˜ c , ˆ Q = β − ( ˜ Q + ˜ q ), and ˆ χ q = β − ˜ q , which are assumed to be of the order O (1).Then, through the formula ǫ LS1 = lim β →∞ lim µ →∞ lim n → lim ν → − βν ψ ( n, ν, β, µ, , (B8)we obtain (42). 31 . The limit α → in the ℓ case The EOSs (35) show that in the limit α → χ p , ˆ P , ˆ χ p = O (1) . (B9)From (36) and the asymptotic formula of the complementary error function erfc( · ), we seein the limit α → − θ α √ πθ = O (1) , ⇒ θ = O ( p | ln α | ) → ∞ , (B10)which is realized by controlling λ as O ( p | ln α | ). Using these scalings, and the asymptoticexpansion of the complementary error function for large θ in (35d), we obtain P = O ( θ − ) = O ( | ln α | − ) → . (B11)By inserting these scalings into (38), we obtain (48).The asymptotic form of ǫ LS1 can be similarly obtained. Following some lengthy butstraightforward calculations, we obtainˆ χ q = O ( | ln α | − ) → , (B12a)ˆ Q = O (1) , (B12b)ˆ χ c = O ( | ln α | − ) → , (B12c)ˆ C = O (1) , (B12d) χ q = O (1) , (B12e) Q = O ( | ln α | − ) → , (B12f) χ c = O (1) , (B12g) C = O ( | ln α | − ) → . (B12h)By substituting these scalings into (44c), we obtain (49). [1] K. Pearson, Philosophical Magazine , (11), 559–572 (1901).[2] C. Eckart and G. Young, Psychometrika , (3), 211–218 (1936).[3] X. Su and T.M. Khoshgoftaar, Advances in Artificial Intelligence , , 421425 (2009).
4] I. Markovsky,
Low Rank Approximation: Algorithms, Implementation, Applications , Springer,London (2012).[5] B.A. Olshausen and D.J. Field,
Nature , , 607–609 (1996).[6] B.A. Olshausen and D.J. Field, Vision Research , (23), 3311–3325 (1997).[7] B.A. Olshausen and D.J. Field, Current Opinion in Neurobiology , (4), 481–487 (2004).[8] H. Terashima and H. Hosoya, Network: Computation in Neural Systems , (4), 253–267(2009).[9] H. Terashima and M. Okada, in Advances in Neural Information Processing System 25 , 2312–2320 (2012).[10] H. Terashima, H. Hosoya, T. Tani, N. Ichinohe and M. Okada,
Neurocomputing , (1), 14–21(2013).[11] D.L. Donoho, IEEE Transactions on Information Theory , (4), 1289–1306 (2006).[12] E.J. Cand`es and T. Tao, IEEE Transactions on Information Theory , (12), 4203–4215(2005).[13] E.J. Cand`es, J. Romberg and T. Tao, IEEE Transactions on Information Theory , (2),489–509 (2006).[14] E.J. Cand`es and T. Tao, IEEE Transactions on Information Theory , (12), 5406–5425(2006).[15] B.K. Natarajan, SIAM Journal on Computing , (2), 227–234 (1995).[16] G. Davis, S. Mallat, and M. Avellaneda, Constructive Approximation , (1), 57–98 (1997).[17] V.N. Temlyakov, Advances in Computational Mathematics , (3), 249–265 (1998).[18] V.N. Temlyakov, Journal of Approximation Theory , (1), 117–145 (1999).[19] V.N. Temlyakov, Advances in Computational Mathematics , (2), 213–227 (2000).[20] V.N. Temlyakov, Nonlinear Methods of Approximation , (1), 33–107 (2003).[21] A.C. Gilbert, S. Muthukrishnan, and M.J. Strauss, Proceedings of the Fourteenth AnnualACM-SIAM Symposium on Discrete Algorithms , 243–252 (2003).[22] J.A. Tropp, A.C. Gilbert, S. Muthukrishnan, M.J. Strauss,
Proceedings of International Con-ference on Image Processing , I-37–I-40 (2003).[23] J.A. Tropp,
IEEE Transactions on Information Theory , (10) 2231–2242 (2004).[24] D.L. Donoho, M. Elad, and V.N. Temlyakov, IEEE Transactions on Information Theory , (1) 6–18 (2006).
25] S. Foucart and H. Rauhut,
A Mathematical Introduction to Compressive Sensing , Springer,Berlin (2013).[26] I. Rish and G. Grabarnik,
Sparse Modeling: Theory, Algorithms, and Applications , CRCPress, Boca Raton (2014).[27] H. Nishimori,
Statistical Physics of Spin Glasses and Information Processing: An Introduction ,Oxford University Press, Oxford (2001).[28] V. Dotsenko,
Introduction to the replica theory of disordered statistical systems , CambridgeUniversity Press, Cambridge (2001).[29] R.H. Swendsen and J.-S. Wang,
Physical Review Letters , (21), 2607–2609 (1986).[30] K. Hukushima and K. Nemoto, Journal of the Physical Society of Japan , , 1604–1608 (1996).[31] Y.C. Pati, R. Rezaiifar and P.S. Krishnaprasad, , 40–44 (1993).[32] G.M. Davis, S.G. Mallat and Z. Zhang, SPIE Journal of Optical Engineering , (7), 2183–2191 (1994).[33] D.L. Donoho, A. Maleki and A. Montanari, Proceedings of the National Academy of Sciences , (45), 18914–18919 (2009).[34] R. Monasson and D. O’Kane, Europhysics Letters , (2), 85–90 (1994).[35] R. Tibshirani, Journal of the Royal Statistical Society: Series B (Methodological) , (1), 267–288 (1996).[36] A. Cohen, W. Dahmen, and R. Devore, Journal of American Mathematical Society , (1),211–231 (2009).[37] T.M. Cover and J.A. Thomas, Elements of Information Theory , 2nd edition, Wiley, Hoboken(2006).[38] A. Das and D. Kempe,
Proceedings of the Fortieth Annual ACM Symposium on Theory ofComputing , 45–54 (2008).[39] A. Das and D. Kempe,
Proceedings of the 28th International Conference on Machine Learning ,1057–1064 (2011).[40] S.S. Chen, D.L. Donoho, and M.A. Saunders,
SIAM Journal on Scientific Computing , (1),33–61 (1998).[41] S. Franz and G. Parisi, Physical Review Letters , (13), 2486 (1997).[42] A.M. Ferrenberg and R.H. Swendsen, Physical Review Letters , (23), 2635–2638 (1988).
43] A. Skodras, C. Christopoulos and T. Ebrahimi,
IEEE Signal Processing Magazine , (5),36–58 (2001).[44] E.J. Cand`es and M.B. Wakin, IEEE Signal Processing Magazine , (2), 21–30 (2008).[45] R. Rubinstein, A.M. Bruckstein and M. Elad, Proceedings of the IEEE , (6), 1045–1057(2010).[46] S. Gleichman and Y.C. Eldar, IEEE Transactions on Information Theory , (10), 6958–6975(2011).[47] A. Sakata and Y. Kabashima, Europhysics Letters , , 28008-p1–28008-p6 (2013)., 28008-p1–28008-p6 (2013).