[PDF] Fundamental limits and algorithms for sparse linear regression with sublinear sparsity

Abstract

We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error (MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a generalization of the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modification of the well-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed, and its state evolution is rigorously analyzed. Our results show that the traditional linear assumption between the signal dimension and number of observations in the replica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify the existing well-known AMP algorithms for linear regimes to sub-linear ones.

Full PDF

11 The fundamental limits of sparse linearregression with sublinear sparsity

Lan V. TruongDepartment of EngineeringThe University of CambridgeEmail: [email protected]

Abstract

We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error(MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a simple generalizationof the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modiﬁcation of thewell-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed. Ourresults show that the traditional linear assumption between the signal dimension and number of observations in thereplica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify theexisting well-known AMP algorithms for linear regimes to sub-linear ones.

I. I

NTRODUCTION

Estimating a signal from linear random projections has a myriad of applications such as compressed sensing,code division multiple access in communications, error correction via sparse superposition codes, or Boolean grouptesting. The main question is to ﬁnd information theoretic limits for the estimation of a signal from the knowledgeof its noisy random linear projections. By using replica method, Tanaka [1], Guo and Verd´u [2] show that undera posterior mean estimator and the linear relation between signal dimension and the number of observations, themultiuser channel can be decoupled: Each signal experiences an equivalent single-signal Gaussian channel, whosesignal-to-noise ratio (SNR) suffers a degradation due to multiple signal interference. The replica method, althoughvery interesting, is based on some non-rigorous assumptions.In more recent years, an adaptive interpolation method has been proposed to prove fundamental limits predicted byreplica method in a rigorous way [3]–[5]. Roughly speaking, this method interpolates between the original problemand the mean-ﬁeld replica solution in small steps, each step involving its own set of trial parameters and Gaussianmean-ﬁelds in the sprit of Guerra and Toninelli [3], [6]. We can adjust the set of trial parameters in various ways sothat we get both upper and lower bounds that eventually match.Although the results achieved by the replica method and the adaptive interpolation counterpart are very interesting,they are constrained to the case where the number of observations scales linearly with the signal dimension. In [7],Reeves et al. consider a binary k -sparse linear regression problem where the number of observations m is sub-linear DRAFT a r X i v : . [ c s . I T ] F e b to the signal dimension n , where the authors established an “All-or-Nothing” information-theoretic phase transitionat a critical sample size m ∗ = 2 k log( n/k ) / log(1 + k/ ∆ n ) for two regimes k/ ∆ n = Ω(1) and k = o ( √ n ) with ∆ n being the noise variance. Their results are based on an assumption that the sparse signal is uniformly distributedfrom the set { v ∈ { , } n : (cid:107) v (cid:107) = k } . Sharp information-theoretic bounds were established in [8] and [9] forsupport recovery problems for linear and phase retrieval models, respectively. In addition, the “All-or-Nothing”phenomenon was also considered for Bernoulli group testing [10] or sparse spiked matrix estimation in [11], [12],and [13]. In [14], this phenomenon was also investigated for the generalized linear models with sub-linear regimesand Bernoulli and Bernoulli-Rademacher distributed vectors.In this paper, we consider the same k -sparse linear regression as [7] but in more general signal domain. However,we assume that the signal is sparse in expected sense as [14]. We show that the normalized mutual information andminimum mean-square-error can be estimated exactly when k = O ( n α ) and m = δn α for some α ∈ (0 , . Ourresult is achieved by a simple generalization of the adaptive interpolation method in Bayesian inference for linearregimes [3], [15] to sub-linear ones.A modiﬁcation of the well-known Approximate Message Passing (AMP) algorithm [16] achieves this MMSEfundamental limit. AMP is initially proposed for sparse signal recovery and compressed sensing [17]–[19]. AMPalgorithms achieve state-of-the-art performance for several high-dimensional statistical estimation problems, includingcompressed sensing and low-rank matrix estimation [16], [20].II. P ROBLEM S ETTING

Let S ∈ R n be a signal observed via a linear model with measurement matrix A ∈ R m × n . Let { ∆ n } ∞ n =1 be apositive sequence. We consider the same linear model as [3]: Y = AS + W (cid:112) ∆ n , (1)where A ∈ R m × n , S = ( S , S , · · · , S n ) T ∈ R n , W ∈ R m , and Y ∈ R m . Instead of assuming that m = nδ forsome δ > as standard literature in replica and adaptive interpolation methods, we assume that m = δn α for some δ > and α > . We also assume: • { S n } ∞ n =1 is an i.i.d. sequence with S i ∼ ˜ P , and E S ∼ P [ S ] < ∞ , ∀ i ∈ [ n ] . • W ∼ N (0 , I m ) . • ∆ n can be any function of n and α such that ∆ n = Ω n (1) .III. M AIN R ESULT

For this case, we assume that the sensing matrix A has i.i.d. Gaussian components. As in [3], let Σ( u ; v ) − := δn α − u + v , (2) ψ ( u ; v ) := δ (cid:20) log (cid:18) uv (cid:19) − uu + v (cid:21) . (3) This constraint is less strict than the one in [3], where the authors assumed that ∆ n is ﬁxed. DRAFT

Deﬁne the following sequence of Replica Symmetric (RS) potentials: f n, RS ( E ; ∆ n ) := ψ ( E ; ∆ n ) + i n, den (Σ( E ; ∆ n )) , (4)where i n, den (Σ) = n − α I ( S ; S + ˜ W Σ) is a normalized mutual information of a scalar Gaussian denoising model Y = S + ˜ W Σ with S ∼ ˜ P , ˜ W ∼ N (0 , , and Σ − an effective signal to noise ratio: i n, den (Σ) := n − α E S, ˜ W (cid:20) log (cid:90) ˜ P ( x ) × exp (cid:20) − (cid:18) ( x − S ) − ( x − S ) ˜ W Σ (cid:19)(cid:21) dx. (5)Our main result is the following: Theorem 1.

Let k = O ( n α ) , and m = δn α for some < α < and δ > . Assume that A is a Gaussianrandom matrix with A ij ∼ N (0 , /n α ) , and { S i } i ∈ [ n ] are i.i.d. which are distributed according to a discrete prior ˜ P ( s ) = (cid:0) − kn (cid:1) δ ( s ) + kn P ( s ) where P ( s ) = (cid:80) Bb =1 p b δ ( s − a b ) with a ﬁnite number B of constant terms and max b | a b | ≤ s max . Let ν n = n ( α − E S ∼ P [ S ] . Then, in the large system limits, the following holds: lim n →∞ (cid:20) I ( S ; Y ) n α − min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) (cid:21) = 0 , (6) lim n →∞ (cid:20) n α n (cid:88) i =1 E [( S i − ˆ S i ) ] − n − α ˜ E (∆ n ) (cid:21) = 0 , (7) where ˆ S = E [ S | Y ] is the MMSE estimator, ˜ E (∆ n ) is the unique global minimum of min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) w.r.t. E for all ∆ n (cid:54) = ∆ n, RS . Here, ∆ n, RS is the point where arg min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) is not unique. Remark 2.

Some remarks are in order. • For α = 1 (or m = δn ), and ∆ n = ∆ for some ﬁxed ∆ > , our results recover the classical result as in [1],[5], [21], [22]. In these classical papers, the authors assume that { S n } ∞ n =1 are i.i.d and S ∼ ˜ P which is aﬁxed distribution. • As k = o ( n α ) , it holds that m = δn α (8) = ω (cid:18) k log ( n/k )log(1 + ( k/ ∆) (cid:80) Bb =1 p b a b ) (cid:19) (9) = ω ( m ∗ ) , (10) where m ∗ is the “All-or-Nothing” critical sample size in Eqn. (3) [7] or Cor. 1 [8]. Our result shows thatunder the MMSE estimator, the linear regression model can be decomposed into sub-AWGN channels, and theﬁnal MMSE estimation is equal to the MMSE of a (time-varying SNR) sub-AWGN channel in the large systemlimit. IV. A

LGORITHM

In this section, we propose a way to modify the Approximate Message Passing (AMP) algorithm [16] to make itwork for sub-linear regimes as follows.

DRAFT

Algorithm 1

AMP for sub-linear regimes.

Input: observation y , matrix sizes m, n , other parameters α, δ , number of iterations itermax , t = 1 . repeat Initialize τ = (cid:112) ∆ n n − α + E S ∼ P [ S ] /δ, z = , ˆ x = , d = 0 . z ← y − A ˆ x + δ z d h ← A T z + ˆ x ˆ x ← η ( h , τ ) , d ← Mean (cid:0) d η dx ( h , τ ) (cid:1) τ ← (cid:112) ∆ n n − α + ( n − α /δ ) E [( η ( S + τ W, τ ) − S ) ] t ← t + 1 until t=itermax Output: ˆ x .We compare the theoretical MMSE fundamental limit in Theorem 1 and the MSE of Algorithm 1 for a speciﬁcexample as follows. Let ∆ n = ∆ = s max < ∞ and ˜ P ( s ) = (1 − kn ) δ ( s ) + ( kn )( δ ( s − √ ∆) + δ ( s + √ ∆)) , whichis the Bernoulli-Rademacher distribution (cf. [14]). With this assumption, we have i n, den (Σ) = n − α I ( S ; S + ˜ W Σ) (11) = n − α (cid:20) H ( Y ) −

12 log(2 πe Σ ) (cid:21) , (12)where f Y ( y ) = (cid:18) − kn (cid:19) √ π exp (cid:18) − y (cid:19) + 12Σ √ π (cid:18) kn (cid:19)(cid:18) exp (cid:18) − ( y − √ ∆) (cid:19) + exp (cid:18) − ( y + √ ∆) (cid:19)(cid:19) . (13)For this prior distribution, we run Algorithm 1 for itermax = 10 iterations with the denoiser deﬁned as following: η ( x, τ ) = E [ S | S + τ Z = x ] (14) = (cid:0) kn √ ∆ (cid:1)(cid:2) exp( x √ ∆ τ ) − exp( − x √ ∆ τ ) (cid:3)(cid:0) − kn (cid:1) exp (cid:0) ∆2 τ (cid:1) + (cid:0) kn (cid:1)(cid:2) exp( x √ ∆ τ ) + exp( − x √ ∆ τ ) (cid:3) . (15)This denoiser has the following derivative: dη ( x, τ ) dx = ∆2 τ (cid:0) − kn (cid:1) kn exp (cid:0) ∆2 τ (cid:1)(cid:2) exp( x √ ∆ τ ) + exp( − x √ ∆ τ ) (cid:3)(cid:0)(cid:0) − kn (cid:1) exp (cid:0) ∆2 τ (cid:1) + (cid:0) kn (cid:1)(cid:2) exp( x √ ∆ τ ) + exp( − x √ ∆ τ ) (cid:3)(cid:1) + (cid:0) kn (cid:1) τ (cid:0)(cid:0) − kn (cid:1) exp (cid:0) ∆2 τ (cid:1) + (cid:0) kn (cid:1)(cid:2) exp( x √ ∆ τ ) + exp( − x √ ∆ τ ) (cid:3)(cid:1) . (16) DRAFT M S E Fundamental LimitAMP (10 iterations)

Fig. 1. MMSE and the MSE of Algorithm 1 as a function of SNR at α = 0 . and δ = 0 . for n = 200 . Here, SNR := −

10 log ∆ n (dB). DRAFT M S E Fundamental LimitAMP (10 iterations)

Fig. 2. MMSE and the MSE of Algorithm 1 as a function of α at δ = 0 . , SNR = −

10 log ∆ n = 5 dB for n = 200 . Fig. 1 and 2 show that AMP performance (Monte-Carlo simulation) is very close to the MMSE fundamentallimit in Theorem 1 for n = 200 . V. P ROOF OF THE M AIN R ESULT

The proof of Theorem 1 is based on [3], [15] with some modiﬁcations in concentration inequalities and normalizedfactors to account for new settings. Given the model (1), the likelihood of the observation y given S and A is P ( y | s , A ) = 1(2 π ∆ n ) m/ exp (cid:20) − n (cid:13)(cid:13) y − As (cid:13)(cid:13) (cid:21) (17)From Bayes formula we then get the posterior distribution for x = [ x , x , · · · , x n ] ∈ R n given the observation y and sensing matrix A P ( x | y , A ) = (cid:81) ni =1 ˜ P ( x i ) P ( y | x , A ) (cid:82) (cid:81) ni =1 ˜ P ( x i ) dx i P ( y | x , A ) . (18)Replacing the observation y by its explicit expression (1) as a function of the signal and the noise we obtain P ( x | y = As + w (cid:112) ∆ n , A )= (cid:81) ni =1 ˜ P ( x i ) e −H ( x ; A , s , w ) Z ( A , s , w ) , (19) DRAFT where we call H ( x ; A , s , w ) := 1∆ n m (cid:88) µ =1 (cid:18) (cid:20) A ( x − s ) (cid:21) µ − (cid:20) A ( x − s ) (cid:21) µ w µ (cid:112) ∆ n (cid:19) (20)the Hamiltonian of the model, and the normalization factor is by deﬁnition the partition function : Z ( A , s , w ) := (cid:90) (cid:26) n (cid:89) i =1 P ( x i ) dx i (cid:27) e −H ( x ; A , s , w ) . (21)Our principal quantity of interest is f n = − n α E A , S , W [log Z ( A , S , W )] (22) = − n α E A , S , W (cid:20) log (cid:18) (cid:90) (cid:26) n (cid:89) i =1 ˜ P ( x i ) dx i (cid:27) × exp (cid:18) − n m (cid:88) µ =1 (cid:18) (cid:20) A ( x − S ) (cid:21) µ − (cid:20) A ( x − S ) (cid:21) µ W µ (cid:112) ∆ n (cid:19)(cid:19)(cid:19)(cid:21) , (23)where W i.i.d. ∼ N (0 , .By using the Bayes’ rule P ( y | A ) = P ( y | x , A ) (cid:81) ni =1 ˜ P ( x i ) P ( x | y = As + w √ ∆ n , A ) , (24)we have P ( y | A ) = (2 π ∆) − m/ Z ( A , s , w ) e − (cid:107) w (cid:107) . (25)It follows that I ( S ; Y ) n α = 1 n α E A , S , Y (cid:20) log (cid:18) P ( S , Y | A )˜ P ( S ) P ( Y | A ) (cid:19)(cid:21) (26) = f n − h ( Y | A , S ) n α + 12 n α E [ (cid:107) W (cid:107) ] + m n α log(2 π ∆ n ) (27) = f n − m n α log(2 πe ∆ n ) + m n α + m n α log(2 π ∆ n ) (28) = f n . (29)Hence, in order to obtain (6), it is enough to show that lim n →∞ (cid:20) f n − min E ≥ f n, RS ( E ; ∆ n ) (cid:21) = 0 . (30) DRAFT

Let W ( k ) = [ W ( k ) µ ] mµ =1 , ˜ W ( k ) = [ ˜ W ( k ) µ ] ni =1 and ˆ W = [ ˆ W i ] ni =1 all with i.i.d. N (0 , entries for k = 1 , , · · · , K n where K n is chosen later. Deﬁne Σ k := Σ( E k ; ∆ n ) where the trial parameters { E k } K n k =1 are determined later on.As [3], the (perturbed) ( k, t ) -interpolating Hamiltonian for this problem is deﬁned as H k,t ; ε ( x ; Θ ) := K n (cid:88) k (cid:48) = k +1 h (cid:18) x , S , A , W ( k (cid:48) ) , K n ∆ n (cid:19) + k − (cid:88) k (cid:48) =1 h mf (cid:18) x , S , ˜ W ( k (cid:48) ) , K n Σ k (cid:48) (cid:19) + h (cid:18) x , S , A , W ( k ) , K n γ k ( t ) (cid:19) + h mf (cid:18) x , S , ˜ W ( k ) , K n λ k ( t ) (cid:19) + ε n (cid:88) i =1 (cid:18) x i − x i S i − x i ˆ W i √ ε (cid:19) . (31)Here, Θ := { S , W ( k ) , ˜ W ( k ) } K n k =1 , ˆ W , A } , k ∈ [ K n ] , t ∈ [0 , and h ( x , S , W , A , σ ) := 1 σ m (cid:88) µ =1 (cid:18) [ A ¯ x ] µ − σ (cid:2) A ¯ x (cid:3) µ W µ (cid:19) , (32) h mf ( x , S , ˜ W , σ ) := 1 σ n (cid:88) i =1 (cid:18) ¯ x i − σ ¯ x i ˜ W i (cid:19) , (33)where ¯ x = x − S and ¯ x i = x i − S i .The ( k, t ) -interpolating model corresponds an inference model where one has access to the following sets ofnoisy observations about the signal S (cid:26) Z ( k (cid:48) ) = AS + W ( k (cid:48) ) (cid:112) K n ∆ n (cid:27) K n k (cid:48) = k +1 , (34) (cid:26) ˜ Z ( k (cid:48) ) = S + ˜ W ( k ) Σ k (cid:48) (cid:112) ∆ n (cid:27) k − k (cid:48) =1 , (35) (cid:26) Z ( k ) = AS + W ( k ) (cid:115) K n γ k ( t ) (cid:27) , (36) (cid:26) ˜ Z ( k ) = S + ˜ W ( k ) (cid:115) K n λ k ( t ) (cid:27) . (37)The ﬁrst and third sets of observation correspond to similar inference channel as the original model (1) but with ahigher noise variance proportional to K n . These correspond to the ﬁrst and third terms in (31). The second and fourthsets instead correspond to decoupled Gaussian denoising models, with associated “mean-ﬁeld” second and fourthterm in (31). The last term in (31) is a perturbed term which corresponds to a Gaussian “side-channel” Y = S √ ε + ˆ Z whose signal-to-noise ratio ε will tend to zero at the end of proof. The noise variance are proportional to K n in orderto keep the average signal-to-noise ratio not dependent on K n . A perturbed of the original and ﬁnal (decoupled)models are obtained by setting k = 1 , t = 0 and k = K n , t = 1 , respectively. The interpolation is performed on both k and t . For each ﬁxed k , at t changes from to , the observation in (36) is removed from the original model andadded to the decoupled model. An interesting point is that the ( k, t = 1) and ( k + 1 , t = 0) -interpolating models are DRAFT statistically equivalent. This is an adjusted model of the classical interpolation model in [6], where an interpolatingpath k ∈ [ K n ] is added. This is called the adaptive interpolation method. See [3] for more detailed discussion.Consider a set of observations [ y , ˜ y ] from the following channels  y = AS + W √ γ k ( t ) ˜ y = S + ˜ W √ λ k ( t ) , (38)where W ∼ N (0 , I m ) , ˜ W ∼ N (0 , I n ) , t ∈ [0 , is the interpolating parameter and the “signal-to-noise functions” { γ k ( t ) , λ k ( t ) } K n k =1 satisfy γ k (0) = ∆ − n , γ k (1) = 0 , (39) λ k (0) = 0 , λ k (1) = Σ − k , (40)as well as the following constraint δn α − γ k ( t ) − + E k + λ k ( t ) = δn α − ∆ n + E k = Σ − k (41)and thus dλ k ( t ) dt = − dγ k ( t ) dt δn α − (1 + γ k ( t ) E k ) . (42)We also require γ k ( t ) to be strictly decreasing with t . The ( k, t ) -interpolating model has an associated posteriordistribution, Gibbs expectation (cid:104)−(cid:105) k,t ; ε and ( k, t ) -interpolating free energy f k,t ; ε : P k,t ; ε ( x | Θ ) := (cid:81) ni =1 ˜ P ( x i ) e −H k,t ; ε ( x ; θ ) (cid:82) (cid:8) (cid:81) ni =1 ˜ P ( x i ) (cid:9) e −H k,t ; ε ( x ; θ ) , (43) (cid:104) V ( X ) (cid:105) k,t ; ε := (cid:90) d x V ( x ) P k,t ; ε ( x | Θ ) , (44) f k,t ; ε := − n E Θ (cid:20) log (cid:90) (cid:26) n (cid:89) i =1 dx i ˜ P ( x i ) (cid:27) e −H k,t ; ε ( x ; Θ ) (cid:21) . (45) Lemma 3.

Let P have ﬁnite second moment. Then for initial and ﬁnal systems | f , ε − f , | ≤ O (cid:18) ε n − α (cid:19) E S ∼ P [ S ] (46) | f K n , ε − f K n , | ≤ O (cid:18) ε n − α (cid:19) E S ∼ P [ S ] . (47) Proof.

Using the similar arguments as Lemma 1, Section II in [3], we have | f , ε − f , | ≤ ε E S ∼ ˜ P [ S ] (48) = ε kn E S ∼ P [ S ] (49) = O (cid:18) ε n − α (cid:19) E S ∼ P [ S ] . (50)Similarly, we come to the other inequality. DRAFT0

Now, by deﬁning Σ − ( { E k } K n k =1 ; ∆) := 1 K n K n (cid:88) k =1 Σ − k , (51)from (31), we have H K n , ( x ; Θ ) = K n (cid:88) k =1 h mf ( x , S , A , ˜ W ( k ) , K n Σ − k ) (52) = K n (cid:88) k =1 K n Σ k n (cid:88) i =1 (cid:18) ¯ x i − (cid:113) K n Σ k ¯ x i ˜ W ( k ) µ (cid:19) (53) = Σ − (cid:18) n (cid:88) i =1 ¯ x i − Σ mf ¯ x i K n (cid:88) k =1 Σ mf (cid:112) K n Σ k ˜ W ( k ) µ (cid:19) . (54)Since ˜ W i := K n (cid:88) k =1 Σ mf (cid:112) K n Σ k ˜ W ( k ) µ ∼ N (0 , , (55)it holds from (54) that H K n , ( x ; Θ ) = Σ − (cid:18) n (cid:88) i =1 ¯ x i − Σ mf ¯ x i ˜ W i (cid:19) . (56)Hence, we have f K n , = − n E (cid:20) n (cid:88) i =1 log (cid:90) dx i ˜ P ( x i ) e − Σ − (cid:0) ¯ x i − Σ mf ¯ x i ˜ W i (cid:1)(cid:21) (57) = E (cid:20) log (cid:90) dx ˜ P ( x ) e − Σ − (cid:0) ¯ x − Σ mf ¯ x ˜ W (cid:1)(cid:21) (58) = 1 n − α i n, den (Σ mf ( { E k } K n k =1 ; ∆ n ) , (59)where (59) follows from (5).Similarly, we can show that f , = − n E (cid:20) log (cid:90) (cid:26) n (cid:89) i =1 dx i ˜ P ( x i ) e −H ( x ; A , S , W ) } (cid:21) (60) = f n n − α . (61)In addition, we can prove (with ¯ X = X − S ) that df k,t ; ε dt = 1 K n (cid:0) A k,t ; ε + B k,t ; ε (cid:1) , (62) A k,t ; ε := dγ k ( t ) dt n m (cid:88) µ =1 E (cid:20)(cid:28)(cid:2) A ¯ X (cid:3) µ − (cid:115) K n γ k ( t ) (cid:2) A ¯ X (cid:3) µ W ( k ) µ (cid:29) k,t ; ε (cid:21) , (63) B k,t ; ε := dλ k ( t ) dt n n (cid:88) i =1 E (cid:20)(cid:28) ¯ X i − (cid:115) K n λ k ( t ) ¯ X i ˜ W ( k ) i (cid:29) k,t ; ε (cid:21) , (64) DRAFT1 where E denotes the average w.r.t. X and all quenched random variables Θ , and (cid:104)−(cid:105) k,t ; ε is the Gibbs average withHamiltonian (31).Now, since E [ W kµ ] = 0 , it is easy to see that n − α A k,t ; ε = dγ k ( t ) dt n α m (cid:88) µ =1 E (cid:2)(cid:10)(cid:2) A ¯ X (cid:3) µ (cid:11) k,t ; ε (cid:3) (65) = dγ k ( t ) dt δ k , t; ε , (66)where ymmse k , t; ε := 1 m E (cid:2)(cid:13)(cid:13) A ( (cid:104) X (cid:105) k,t ; ε − S ) (cid:13)(cid:13) (cid:3) . (67)is refered to as“measurement minimum mean-square error”.For B k,t ; ε , we proceed similarly and ﬁnd n − α B k,t ; ε = dλ k ( t ) dt n α n (cid:88) i =1 E [ (cid:104) ¯ X i (cid:105) k,t ; ε ] (68) = dλ k ( t ) dt n α E [ (cid:107)(cid:104) X (cid:105) k,t ; ε − S (cid:107) ] (69) = − dγ k ( t ) dt γ k ( t ) E k ) δ n α − mmse k , t; ε , (70)where the normalized minimum mean-square-error (MMSE) deﬁned as mmse k , t; ε := 1 n α E [ (cid:107)(cid:104) X (cid:105) k,t ; ε − S (cid:107) ] . (71)By the construction, we have the following coherency property: The ( k, t = 1) and ( k + 1 , t = 0) models areequivalent (the Hamiltonian is invariant under this change) and thus f k, ε = f k +1 , ε for any k [3]. This impliesthat the ( k, t ) -interpolating free energy satisﬁes f , ε = f K n , ε + K n (cid:88) k =1 ( f k, ε − f k, ε ) (72) = f K n , ε − K n (cid:88) k =1 (cid:90) dt df k,t ; ε dt . (73)It follows that f n = n − α f K n , ε − n − α K n (cid:88) k =1 (cid:90) dt df k,t ; ε dt (74) = n − α f K n , ε − K n n − α K n (cid:88) k =1 (cid:90) dt (cid:18) A k,t ; ε + B k,t ; ε (cid:19) . (75) DRAFT2

From (66), (70), and (75), we obtain (cid:90) b n a n dεf n = n − α (cid:90) b n a n dεf , ε (76) = (cid:90) b n a n n − α dε (cid:26) f K n , ε − f K n , (cid:27) + (cid:90) b n a n dεi n, den (cid:18) Σ mf (cid:0) { E k } K n k =1 ; ∆ (cid:1)(cid:19) − δ K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt × (cid:18) ymmse k,t ; ε − mmse k,t ; ε n α − (1 + γ k ( t ) E k ) (cid:19) . (77)Using a similar proof as [3], the following lemma can be veriﬁed to hold for the new settings: Lemma 4.

Fix a discrete P with bounded support. For any sequence K n → + ∞ and < a n < b n < (thattend to zero slowly enough in the application), and trial parameters { E k = E ( n ) k ( ε ) } K n k =1 which are differentiable,bounded and non-increasing in ε , we have (cid:90) b n a n K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt (cid:18) ymmse k , t; ε − mmse k , t; ε n − α + γ k ( t )mmse k , t; ε (cid:19) = O (cid:18) max (cid:26) o (cid:18) b n − a n ∆ n (cid:19) , a − n n − γ (cid:27)(cid:19) (78) as n → ∞ for some < γ < .Proof. This lemma is a generalization of (93) in [3]. The proof of this Lemma can be found in Appendix A.In addition, we can prove another interesting fact.

Lemma 5.

The following holds: (cid:12)(cid:12) mmse k,t ; ε − mmse k, ε (cid:12)(cid:12) = O (cid:18) n α K n (cid:19) . (79) Proof.

Observe that mmse k,t ; ε − mmse k, ε = (cid:90) d mmse k ,ν ; ε dν dν. (80)Now, we have n α d mmse k ,ν ; ε dν = ddν E [ (cid:107)(cid:104) X (cid:105) k,ν ; ε − S (cid:107) ] (81) = ddν E [ (cid:107)(cid:104) X (cid:105) k,ν ; ε (cid:107) ] − E (cid:20) S T ddν (cid:104) X (cid:105) k,ν ; ε (cid:21) . (82)Now, it is easy to see that ddν (cid:104) X (cid:105) k,ν ; ε = (cid:104) X (cid:105) k,ν ; ε (cid:28) d H k,ν ; ε ( X , Θ ) dν (cid:29) − (cid:28) X d H k,ν ; ε ( X , Θ ) dν (cid:29) k,ν ; ε . (83) DRAFT3

Deﬁne q x , s := 1 n α n (cid:88) i =1 x i s i , (84)which is a normalized overlap between x and s .Let X (cid:48) is a replica of X , i.e. P k,ν ; ε ( x , x (cid:48) | θ ) = P k,ν ; ε ( x | θ ) P k,ν ; ε ( x (cid:48) | θ ) , then it holds that E (cid:20) S T ddν (cid:104) X (cid:105) k,ν ; ε (cid:21) = n α E (cid:20) (cid:104) q X , S (cid:105) k,ν ; ε (cid:28) d H k,ν ; ε ( X , Θ ) dν (cid:29) − (cid:28) q X , S d H k,ν ; ε ( X , Θ ) dν (cid:29) k,ν ; ε (cid:21) (85) = n α E (cid:20)(cid:28) q X , S (cid:18) d H k,ν ; ε ( X (cid:48) , Θ ) dν (cid:29) − d H k,ν ; ε ( X , Θ ) dν (cid:19)(cid:29) k,ν ; ε (cid:21) . (86)Now, observe that d H k,ν ; ε ( X , Θ ) dν = ddν h (cid:18) X , S , A , W ( k ) , K n γ k ( ν ) (cid:19) + ddν h mf (cid:18) X , S , ˜ W ( k ) , K n γ k ( ν ) (cid:19) , (87)where ddν h (cid:18) X , S , A , W ( k ) , K n γ k ( ν ) (cid:19) = 12 K n (cid:18) dγ k ( ν ) dν (cid:19)(cid:18) m (cid:88) µ =1 [ AX ] µ − (cid:115) K n γ k ( ν ) m (cid:88) µ =1 [ AX ] µ W ( k ) µ (cid:19) , (88)and ddν h mf (cid:18) X , S , ˜ W ( k ) , K n λ k ( ν ) (cid:19) = 12 K n (cid:18) dλ k ( ν ) dν (cid:19)(cid:18) n (cid:88) i =1 ¯ X i − (cid:115) K n λ k ( ν ) n (cid:88) i =1 ¯ X i ˜ W ( k ) i (cid:19) . (89)Using the fact that E [ W ( k ) µ ] = 0 and E [ ˜ W ( k ) µ ] = 0 , we ﬁnally have E (cid:20) S T ddν (cid:104) X (cid:105) k,ν ; ε (cid:21) = n α K n (cid:18) dγ k ( ν ) dν (cid:19) × E (cid:20)(cid:28) q X , S (cid:18) g ( X , S ) − g ( X (cid:48) , S ) (cid:19)(cid:29) k,v ; ε (cid:21) , (90)where g ( x , s ) := m (cid:88) µ =1 [ Ax ] µ − δ (1 + γ k ( ν ) E k ) n (cid:88) i =1 ¯ x i . (91) DRAFT4

Hence, by Cauchy-Schwarz’s inequality, we obtain (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) S T ddν (cid:104) X (cid:105) k,ν ; ε (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n α K n (cid:12)(cid:12)(cid:12)(cid:12) dγ k ( ν ) dν (cid:12)(cid:12)(cid:12)(cid:12) × (cid:113) E [ (cid:104) q X , S (cid:105) k,ν ; ε ] E [ (cid:104) g ( X , S ) (cid:105) k,ν ; ε ] . (92)Now, we have E [ (cid:104) q X , S (cid:105) k,ν ; ε ] = 1 n α n (cid:88) i,j =1 E [ (cid:104) X i X j S i S j (cid:105) k,ν ; ε ] (93) ≤ n α n (cid:88) i,j =1 (cid:113) E [ (cid:104) X i X j (cid:105) k,ν ; ε ] E [ S i S j ] (94) = 1 n α n (cid:88) i,j =1 E [ S i S j ] (95) ≤ n α (cid:18) n kn E S ∼ P [ S ] + n (cid:18) kn (cid:19) ( E S ∼ P [ S ]) (cid:19) (96) = O (cid:18) n α (cid:18) n α + n α (cid:19)(cid:19) (97) = O (1) (98)where the last inequality follows from the assumption that E S ∼ P [ S ] is bounded above.By using Cauchy-Schwarz’s inequality and the fact that E k ≥ , γ k ( ν ) ≥ for all ν ∈ [0 , , it can be similarlyshown that E [ (cid:104) g ( X , S ) (cid:105) k,ν ; ε ] = O ( n α ) . (99)Finally, by combining all the above facts and the non-decreasing of γ k ( ν ) in (0 , , we ﬁnally obtain n α | mmse k,t ; ε − mmse k, ε | ≤ n α (cid:90) ddν E [ (cid:107)(cid:104) X (cid:105) k,ν ; ε (cid:107) ] dν + n α K n (cid:90) O ( n α ) dν (100) = n α (cid:90) ddν E [ (cid:107) S (cid:107) ] dν + n α K n (cid:90) O ( n α ) dν (101) = O (cid:18) n α K n (cid:19) . (102)This concludes our proof of Lemma 5.Return to the proof of our main theorem. Deﬁne the following identity as [3]: ψ ( E k ; ∆ n ) := δ (cid:90) dt dγ k ( t ) dt (cid:18) E k (1 + γ k ( t ) E k ) − E k γ k ( t ) E k (cid:19) , (103) DRAFT5 and let ˜ f n, RS ( { E k } K n k =1 ; ∆ n ) := i n, den (cid:18) Σ mf { E k } K n k =1 ; ∆ n (cid:19) + 1 K n K n (cid:88) k =1 ψ ( E k ; ∆ n ) . (104)From (77) and Lemma 4, we obtain as a n → that (cid:90) a n a n dεf n = (cid:90) a n a n n − α dε (cid:26) f K n , ε − f K n , (cid:27) + (cid:90) a n a n dε ˜ f n, RS ( { E k } K n k =1 ; ∆ n )+ δ K n (cid:90) a n a n dε K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt × γ k ( t )( E k − mmse k , t; ε n α − ) (1 + γ k ( t ) E k ) (1 + γ k ( t )mmse k , t; ε n − α )+ O (cid:18) max (cid:26) o (cid:18) a n ∆ n (cid:19) , a − n n − γ (cid:27)(cid:19) . (105)Now, since γ k ( t ) is non-creasing in t ∈ [0 , , it holds that dγ k ( t ) dt ≤ . Hence, from (105), we obtain (cid:90) a n a n dεf n ≤ (cid:90) a n a n n − α dε (cid:26) f K n , ε − f K n , (cid:27) + (cid:90) a n a n dε ˜ f n, RS ( { E k } K n k =1 ; ∆ n ) + O (cid:18) max (cid:26) o (cid:18) a n ∆ n (cid:19) , a − n n − γ (cid:27)(cid:19) . (106)By setting E k = arg min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) for all k ∈ [ K n ] , from (106), we have f n ≤ O (1) ε + min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) + O (cid:18) max (cid:26) o (cid:18) n (cid:19) , a − n n − γ (cid:27)(cid:19) (107)for some γ > . By taking ε → , we achieve an upper bound f n − min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) ≤ O (cid:18) max (cid:26) o (cid:18) n (cid:19) , a − n n − γ (cid:27)(cid:19) . (108)On the other hand, from Lemma 5 and (105), by choosing K n = Ω( n α + b ) for some b > , we obtain (cid:90) a n a n dεf n = (cid:90) a n a n n − α dε (cid:26) f K n , ε − f K n , (cid:27) + (cid:90) a n a n dε ˜ f n, RS ( { E k } K n k =1 ; ∆ n )+ δ K n (cid:90) a n a n dε K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt × γ k ( t )( E k − mmse k , ε n α − ) (1 + γ k ( t ) E k ) (1 + γ k ( t )mmse k , ε n − α )+ O (cid:18) max (cid:26) o (cid:18) a n ∆ n (cid:19) , a − n n − γ (cid:27)(cid:19) . (109) DRAFT6

By choosing E k = mmse k , ε n α − for all k ∈ [ K n ] , it holds that E k = mmse , ε n α − (110) = 1 n α E (cid:2) (cid:107) S − E [ S | Y ] (cid:107) (cid:3) n α − (111) ≤ n α E [ (cid:107) S (cid:107) ] n α − (112) = nn α E S ∼ ˜ P [ S ] n α − (113) = nn α n α n E S ∼ P [ S ] n α − (114) = E S ∼ P [ S ] n α − (115) = ν n , (116)where ν n is deﬁned in Theorem 1. Here, (112) follows from the fact that MMSE estimation gives the lowest MSE.Hence, from (109) we have (cid:90) a n a n dεf n = (cid:90) a n a n n − α dε (cid:26) f K n , ε − f K n , (cid:27) + (cid:90) a n a n dε ˜ f n, RS ( { E k } K n k =1 ; ∆ n ) + O (cid:18) max (cid:26) o (cid:18) a n ∆ n (cid:19) , a − n n − γ (cid:27)(cid:19) (117)Now, let Σ − k := δn α − / ( E k + ∆ n ) for all k ∈ [ K n ] , then it holds that Σ − k ≥ δn α − ν n +∆ n by (116) . For a given ∆ n ,set ψ ∆ n (Σ − ) = ψ ( δn α − / Σ − − ∆ n ; ∆ n ) . Since ψ ∆ n ( · ) is a convex function, from (104), we are easy to see that ˜ f n, RS ( { E k } K n k =1 ; ∆ n ) = i n, den (cid:18) Σ mf (cid:18) { E k } K n k =1 ; ∆ n (cid:19)(cid:19) + 1 K n K n (cid:88) k =1 ψ ∆ (Σ − k ) (118) ≥ i n, den (cid:18) Σ mf (cid:18) { E k } K n k =1 ; ∆ n (cid:19)(cid:19) + ψ ∆ n (cid:18) Σ − { E k } K n k =1 ; ∆ n (cid:19) (119) ≥ min Σ ∈ (cid:2) , (cid:113) νn +∆ nδnα − (cid:3) (cid:18) i n, den (Σ) + ψ ∆ n (Σ − ) (cid:19) (120) ≥ min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) (121)From (117) and (121), we obtain a lower bound f n ≥ min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) + O (1) ε + O (cid:18) max (cid:26) o (cid:18) n (cid:19) , a − n n − γ (cid:27)(cid:19) . (122)From (108) and (122), we have f n − min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) = O (cid:18) max (cid:26) o (cid:18) n (cid:19) , a − n n − γ (cid:27)(cid:19) , (123)or I ( S ; Y ) n α − min E ∈ [0 ,ν n ] f n, RS ( E ; ∆ n ) = O (cid:18) max (cid:26) o (cid:18) n (cid:19) , a − n n − γ (cid:27)(cid:19) (124) DRAFT7 which leads to (30) by choosing the sequence a n → and a − n n − γ → .Using the same argument as [15] (or by setting ( k = 1 , t = 0 , ε = 0) in (197) of the Lemma 4’s proof), we havethe following relation: ymmse , = mmse , n α − , n α − / ∆ n + o n (1) . (125)On the other hand, it can be shown that as n → ∞ , ymmse , = (cid:18) δn α − (cid:19)(cid:18) n (cid:19) dI ( S ; Y ) d ∆ − n (126) = 2 δ (cid:18) df n, RS ( ˜ E (∆ n ); ∆ n ) d ∆ − n (cid:19) (127) = ˜ E (∆ n )1 + ˜ E (∆ n ) / ∆ n , (128)where (126) follows from [15]. From (125) and (128), we obtain (7).A PPENDIX P ROOF OF L EMMA α = 1 , it is known from [3, Eq. (93)] that (cid:90) b n a n K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt (cid:18) ymmse k , t; ε − mmse k , t; ε n − α + γ k ( t )mmse k , t; ε (cid:19) = O ( a − n n − γ ) (129)for some < γ < .Now, assume that ≤ α < . Observe that n α − mmse k , t; ε = 1 n n (cid:88) i =1 E (cid:2) ( S i − (cid:104) X i (cid:105) k,t ; ε ) (cid:3) (130) = 1 n n (cid:88) i =1 E (cid:2)(cid:0) (cid:104) S i − X i (cid:105) k,t ; ε (cid:1) (cid:3) (131) = 1 n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) (132)since by deﬁnition ¯ X i := X i − S i for all i ∈ [ n ] .Now, by [15, Section 6], for all k ∈ [ K n ] we have ymmse k , t; ε = Y ,k − Y ,k , (133)where Y ,k := E (cid:20) m m (cid:88) µ =1 (cid:0) W ( k ) µ (cid:1) n n (cid:88) i =1 (cid:104) X i ¯ X i (cid:105) k,t ; ε (cid:21) , (134) Y ,k := (cid:112) γ k ( t ) E (cid:20) m m (cid:88) µ =1 W ( k ) µ (cid:28) [ A ¯ X ] µ n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) . (135) Our proof of Lemma 4 for α < is simpler than the proof in [3] for α = 1 . More speciﬁcally, the proof of concentration inequality in (169)has been simpliﬁed by making use of signal sparsity for α < . DRAFT8

By the law of large numbers, m (cid:80) mµ =1 (cid:0) W ( k ) µ (cid:1) = 1 + o n (1) almost surely, so we have Y ,k := 1 n E (cid:20) n (cid:88) i =1 (cid:104) X i ¯ X i (cid:105) k,t ; ε (cid:21) + o n (1) (136) = 1 n n (cid:88) i =1 E (cid:2)(cid:0) (cid:104) ¯ X i ( S i + ¯ X i ) (cid:105) k,t ; ε (cid:1)(cid:3) (137) = 1 n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) + 1 n n (cid:88) i =1 E [ (cid:104) ¯ X i (cid:105) k,t ; ε S i ] (138) = 1 n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) + E (cid:2) (cid:104) ¯ q X , S (cid:105) k,t ; ε (cid:3) , (139)where ¯ q X , S := 1 n n (cid:88) i =1 S i ¯ X i . (140)Here, (137) follows from X i = ¯ X i + S i .Now, for any a, b , by Cauchy-Schwarz inequality observe that E [ (cid:104) ab (cid:105) k,t ; ε ] = E [ (cid:104) a (cid:105) k,t ; ε ] E [ (cid:104) b (cid:105) k,t ; ε ] + E [ (cid:104) ( a − E [ (cid:104) a (cid:105) k,t ; ε ]) b (cid:105) k,t ; ε ] (141) = E [ (cid:104) a (cid:105) k,t ; ε ] E [ (cid:104) b (cid:105) k,t ; ε ] + O (cid:18) E (cid:20)(cid:113) (cid:104) ( a − E [ (cid:104) a (cid:105) k,t ; ε ] (cid:1) (cid:105) k,t ; ε (cid:104) b (cid:105) k,t ; ε (cid:21)(cid:19) (142) = E [ (cid:104) a (cid:105) k,t ; ε ] E [ (cid:104) b (cid:105) k,t ; ε ] + O (cid:18)(cid:113) E (cid:2) (cid:104) ( a − E [ (cid:104) a (cid:105) k,t ; ε ] (cid:1) (cid:105) k,t ; ε (cid:3) E (cid:2) (cid:104) b (cid:105) k,t ; ε (cid:3)(cid:19) . (143)Let a = n (cid:80) ni =1 X i ¯ X i and b = W ( k ) µ [ A ¯ X ] µ , then we have E (cid:20) W ( k ) µ ] (cid:28) [ A ¯ X ] µ n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) = E [ (cid:104) ab (cid:105) k,t ; ε ] . (144)On the other hand, by Cauchy-Schwarz inequality, we also have E [ (cid:104) b (cid:105) k,t ; ε ] = E (cid:20)(cid:28)(cid:18) W ( k ) µ [ A ¯ X ] µ (cid:19) (cid:29) k,t ; ε (cid:21) (145) ≤ E (cid:20)(cid:113)(cid:10)(cid:0) W ( k ) µ (cid:1) (cid:11) k,t ; ε (cid:10)(cid:0) [ A ¯ X ] µ (cid:1) (cid:11) k,t ; ε (cid:21) (146) ≤ (cid:113) E (cid:2)(cid:10)(cid:0) W ( k ) µ (cid:1) (cid:11) k,t ; ε (cid:3) E (cid:2)(cid:10)(cid:0) [ A ¯ X ] µ (cid:1) (cid:11) k,t ; ε (cid:3) (147) = (cid:113) E (cid:2)(cid:0) W ( k ) µ (cid:1) (cid:3) E (cid:2)(cid:10)(cid:0) [ A ¯ X ] µ (cid:1) (cid:11) k,t ; ε (cid:3) (148) = (cid:113) E (cid:2)(cid:10)(cid:0) [ A ¯ X ] µ (cid:1) (cid:11) k,t ; ε (cid:3) (149) = (cid:113) E (cid:2)(cid:10)(cid:0) [ AX ] µ − [ AS ] µ (cid:1) (cid:11) k,t ; ε (cid:3) (150) = (cid:118)(cid:117)(cid:117)(cid:116) E (cid:20)(cid:28) (cid:88) i =0 (cid:18) i (cid:19) ( − i [ AX ] iµ [ AS ] − iµ (cid:29) k,t ; ε (cid:21) (151) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i =0 (cid:18) i (cid:19) ( − i E (cid:2)(cid:10) [ AX ] iµ (cid:11) k,t ; ε [ AS ] − iµ (cid:3) (152) ≤ (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i =0 (cid:18) i (cid:19)(cid:113) E (cid:2)(cid:10) [ AX ] iµ (cid:11) k,t ; ε (cid:3) E (cid:2) [ AS ] − i ) µ (cid:3) (153) DRAFT9 = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i =0 (cid:18) i (cid:19)(cid:113) E (cid:2) [ AS ] iµ (cid:3) E (cid:2) [ AS ] − i ) µ (cid:3) (154) = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =0 (cid:18) i (cid:19)(cid:113) E (cid:2) [ √ n AS ] iµ (cid:3) E (cid:2) [ √ n AS ] − i ) µ (cid:3) . (155)Now, we have [ √ n AS ] µ = n (cid:88) i =1 √ nA µ,i S i . (156)Note that Var ( √ nA µ,i S i ) = n E [ A µ,i ] E S i ∼ ˜ P [ S i ] = nn α n α n E S ∼ P E [ S ] = E S ∼ P [ S ] . (157)Hence, by CLT, we have [ √ n AS ] µ → N (0 , E S ∼ P [ S ]) . (158)It follows that E (cid:2) [ √ n AS ] − i ) µ (cid:3) and E (cid:2) [ √ n AS ] iµ (cid:3) are bounded for each i ∈ [4] . Hence, E [ (cid:104) b (cid:105) k,t ; ε ] goes to zerouniformly in µ as n → ∞ by observing (155).Furthermore, we have E (cid:2) (cid:104) ( a − E [ (cid:104) a (cid:105) k,t ; ε ] (cid:1) (cid:105) k,t ; ε (cid:3) = E [ (cid:104) a (cid:105) k,t ; ε ] − E (cid:2)(cid:0) (cid:104) a (cid:105) k,t ; ε (cid:1) (cid:3) (159) ≤ E [ (cid:104) a (cid:105) k,t ; ε ] (160) = E (cid:20)(cid:28)(cid:18) n n (cid:88) i =1 X i ¯ X i (cid:19) (cid:29) k,t ; ε (cid:21) (161) = E (cid:20)(cid:28)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 X i ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) (cid:29) k,t ; ε (cid:21) (162) ≤ E (cid:20)(cid:28)(cid:18) n n (cid:88) i =1 | X i || ¯ X i | (cid:19) (cid:29) k,t ; ε (cid:21) (163) ≤ E (cid:20)(cid:28)(cid:18) n n (cid:88) i =1 | X i | s max (cid:19) (cid:29) k,t ; ε (cid:21) (164) = 4 s E (cid:20)(cid:18) n n (cid:88) i =1 | X i | (cid:19) (cid:21) (165) ≤ s n E (cid:20) n (cid:88) i =1 X i (cid:21) (166) = 4 s E S ∼ ˜ P [ S ] (167) = 4 s n α n E S ∼ P [ S ] (168) = O n (cid:18) n − α (cid:19) → (169)as ≤ α < , where (164) follows from the fact that | ¯ X i | = | X i − S i | ≤ | X i | + | S i | ≤ s max . DRAFT0

From (143), (144), (155), and (169), we obtain E (cid:20) W ( k ) µ (cid:28) [ A ¯ X ] µ n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) = E [ (cid:104) a (cid:105) k,t ; ε ] E [ (cid:104) b (cid:105) k,t ; ε ] + o n (1) (170) = E (cid:20)(cid:28) n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) E (cid:20) W ( k ) µ (cid:104) [ A ¯ X ] µ (cid:105) k,t ; ε (cid:21) + o n (1) , (171)where o n (1) → uniformly in µ .It follows that Y ,k = (cid:112) γ k ( t ) 1 m m (cid:88) µ =1 E (cid:20) W ( k ) µ (cid:28) [ A ¯ X ] µ n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) (172) = (cid:112) γ k ( t ) E (cid:20)(cid:28) n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21)(cid:18) m m (cid:88) µ =1 E (cid:20) W ( k ) µ (cid:104) [ A ¯ X ] µ (cid:105) k,t ; ε (cid:21)(cid:19) + o n (1) . (173)Now, by [15, Eq. (26)], we have ymmse k , t; ε = 1 m (cid:112) γ k ( t ) m (cid:88) µ =1 E (cid:20) W ( k ) µ (cid:104) [ A ¯ X ] µ (cid:105) k,t ; ε (cid:21) . (174)Hence, from (173) and (174), we obtain Y ,k = γ k ( t ) E (cid:20)(cid:28) n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) ymmse k , t; ε + o n (1) (175) = γ k ( t )ymmse k , t; ε ˜ Y ,k + o n (1) , (176)where ˜ Y ,k = E (cid:20)(cid:28) n n (cid:88) i =1 X i ¯ X i (cid:29) k,t ; ε (cid:21) . (177)It follows from (133)–(176) and (176) that ymmse k , t; ε = Y ,k − Y ,k (178) = ˜ Y ,k + o n (1) − ˜ Y ,k γ k ( t )ymmse k , t; ε + o n (1) (179) = ˜ Y ,k − ˜ Y ,k γ k ( t )ymmse k , t; ε + o n (1) . (180)This leads to ymmse k , t; ε = ˜ Y ,k γ k ( t ) ˜ Y ,k + o n (1) . (181)Then, it holds that ymmse k , t; ε − mmse k , t; ε n α − γ k ( t )mmse k , t; ε n α − = ˜ Y ,k γ k ( t ) ˜ Y ,k − mmse k , t; ε n α − γ k ( t )mmse k , t; ε n α − + o n (1) . (182) DRAFT1

Now, observe that (cid:12)(cid:12) ˜ Y ,k − mmse k , t; ε n α − (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) − n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) + E [ (cid:104) ¯ q X , S (cid:105) k,t ; ε ] (cid:12)(cid:12)(cid:12)(cid:12) + o n (1) (183) ≤ (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) − n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) E [ (cid:104) ¯ q X , S (cid:105) k,t ; ε ] (cid:12)(cid:12)(cid:12)(cid:12) + o n (1) (184) ≤ n n (cid:88) i =1 E (cid:2) (cid:104) ¯ X i (cid:105) k,t ; ε (cid:3) + 1 n n (cid:88) i =1 E [ | S i (cid:104) ¯ X i (cid:105) k,t ; ε | ] + o n (1) (185) = n (cid:88) i =1 E (cid:2) (cid:104) ( X i − S i ) (cid:105) k,t ; ε (cid:3) + 1 n n (cid:88) i =1 E [ | S i (cid:104) ( X i − S i ) (cid:105) k,t ; ε | ] + o n (1) (186) ≤ n n (cid:88) i =1 E (cid:2) (cid:104) X i + S i (cid:105) k,t ; ε (cid:3) + 1 n n (cid:88) i =1 (cid:113) E [ S i ] E [ (cid:104) ( X i − S i ) (cid:105) k,t ; ε ] (187) ≤ n n (cid:88) i =1 E (cid:2) (cid:104) X i + S i (cid:105) k,t ; ε (cid:3) + 1 n n (cid:88) i =1 (cid:113) E [ S i ] E [ (cid:104) S i + X i ) (cid:105) k,t ; ε ] (188) = 6 E S ∼ ˜ P E [ S ] (189) = 6 n α n E S ∼ P E [ S ] (190) := f ( n ) , (191)where f ( n ) = O (cid:0) n − α (cid:1) = o (1) uniformly in k, t if ≤ α < . Here, (187) follows from Cauchy–Schwarzinequality, (189) follows from the i.i.d. assumption of the sequence { S i } ni =1 and { X i } ni =1 under ˜ P , and (191)follows from the assumption that E S ∼ P [ S ] < ∞ .Now, let g k,t ( x ) := x γ k ( t ) x . (192)It is easy to see that g k,t ( x ) is an increasing function for x ≥ . More over, we have < g (cid:48) k,t ( x ) = 1(1 + γ k ( t ) x ) ≤ (193)uniformly in k, t for all x ≥ (since γ k ( t ) ≥ uniformly in k, t ). Hence, we have (cid:12)(cid:12)(cid:12)(cid:12) ymmse k , t; ε − mmse k , t; ε n α − γ k ( t )mmse k , t; ε n α − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ˜ Y ,k γ k ( t ) ˜ Y ,k − mmse k , t; ε n α − γ k ( t )mmse k , t; ε n α − (cid:12)(cid:12)(cid:12)(cid:12) + o n (1) (194) ≤ (cid:12)(cid:12)(cid:12)(cid:12) g k,t (mmse k , t; ε n α − ± f ( n )) − g k,t (mmse k , t; ε n α − ) (cid:12)(cid:12)(cid:12)(cid:12) + o n (1) (195) = | g (cid:48) k,t ( θ ) f ( n ) | + o n (1) (196)for some θ > by Taylor’s expansion. This means that (cid:12)(cid:12)(cid:12)(cid:12) ymmse k , t; ε − mmse k , t; ε n α − γ k ( t )mmse k , t; ε n α − (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ f ( n ) (197)uniformly in k, t where ˜ f ( n ) := f ( n ) + o n (1) . DRAFT2

Then, it holds that (cid:90) b n a n K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt (cid:18) ymmse k , t; ε − mmse k , t; ε n − α + γ k ( t )mmse k , t; ε (cid:19) ≤ (cid:90) b n a n K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt ˜ f ( n ) (198) = ( b n − a n ) ˜ f ( n ) 1 K n K n (cid:88) k =1 ( γ k (1) − γ k (0) (199) = ( b n − a n ) ˜ f ( n ) 1 K n K n (cid:88) k =1 n (200) = ( b n − a n ) ˜ f ( n ) 1∆ n (201) = o (cid:18) b n − a n ∆ n (cid:19) (202)as a n , b n → . Hence, we have (cid:90) b n a n K n K n (cid:88) k =1 (cid:90) dt dγ k ( t ) dt (cid:18) ymmse k , t; ε − mmse k , t; ε n − α + γ k ( t )mmse k , t; ε (cid:19) = o (cid:18) b n − a n ∆ n (cid:19) . (203)From (129) and (203), we obtain (78) for all ≤ α ≤ . Acknowledgements

The author is extremely grateful to Prof. Ramji Venkataramanan, the University of Cambridge, for many suggestionsto improve the manuscript. R

EFERENCES[1] T. Tanaka. A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors.

IEEE Trans. on Inform. Th. ,48(11):2888–2909, 2002.[2] Dongning Guo, S. Shamai, and S. Verdu. Mutual information and minimum mean-square error in gaussian channels.

IEEE Transactions onInformation Theory , 51(4):1261–1282, 2005.[3] Jean Barbier and Nicolas Macris. The adaptive interpolation method: a simple scheme to prove replica formulas in bayesian inference.

Probability Theory and Related Fields , 174:1133–1185, 2017.[4] Jean Barbier, Florent Krzakala, Nicolas Macris, L´eo Miolane, and Lenka Zdeborov´a. Optimal errors and phase transitions in high-dimensionalgeneralized linear models. In

Proceedings of the 31st Conference On Learning Theory , pages 728–731, 2018.[5] Jean Barbier, Florent Krzakala, Nicolas Macris, L´eo Miolane, and Lenka Zdeborov´a. Optimal errors and phase transitions in high-dimensionalgeneralized linear models.

Proceedings of the National Academy of Sciences , 116(12):5451–5460, 2019.[6] F. Guerra and F. L. Toninelli. The thermodynamic limit in mean ﬁeld spin glass models.

Communications in Mathematical Physics ,230:71–79, 2002.[7] Galen Reeves, Jiaming Xu, and Ilias Zadik. The all-or-nothing phenomenon in sparse linear regression. In

Proceedings of the Thirty-SecondConference on Learning Theory , pages 2652–2663, 2019.[8] J. Scarlett and V. Cevher. Limits on support recovery with probabilistic models: An information-theoretic framework.

IEEE Trans. onInform. Th. , 63(1):593–620, 2017.[9] L. V. Truong and J. Scarlett. Support recovery in the phase retrieval model: Information-theoretic fundamental limit.

IEEE Transactions onInformation Theory , 66(12):7887–7910, 2020.[10] L. V. Truong, M. Aldridge, and J. Scarlett. On the all-or-nothing behavior of bernoulli group testing.

IEEE Journal on Selected Areas inInformation Theory , 1(3):669–680, 2020.

DRAFT3 [11] Jean Barbier and N. Macris. 0-1 phase transitions in sparse spiked matrix estimation.

ArXiv , abs/1911.05030, 2019.[12] Cl´ement Luneau, Jean Barbier, and N. Macris. Information theoretic limits of learning a sparse rule.

ArXiv , abs/2006.11313, 2020.[13] Jonathan Niles-Weed and Ilias Zadik. The all-or-nothing phenomenon in sparse tensor pca.

ArXiv , abs/2007.11138, 2020.[14] J. Barbier, N. Macris, and C. Rush. All-or-nothing statistical and computational phase transitions in sparse spiked matrix estimation. In

Advances of Neural Information Processing Systems (NIPS) , 2020.[15] J. Barbier, M. Dia, N. Macris, and F. Krzakala. The mutual information in random linear estimation. In , pages 625–632, 2016.[16] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing.

IEEE Trans. onInform. Th. , 57(2):764–785, Feb 2011.[17] D. Donoho. Compressed sensing.

IEEE Trans. on Inform. Th. , 52(1):1289–1306, 2006.[18] E. Cand`es and M. Wakin. An introduction to compressive sampling.

IEEE Signal Process. Mag. , 25(2):21–30, 2008.[19] C. A. Metzler, A. Maleki, and R. G. Baraniuk. From denoising to compressed sensing.

IEEE Trans. on Inform. Th. , 62(9):5117–514, 2016.[20] A. Montanari and R. Venkataramanan. Estimation of low-rank matrices via approximate message passing.

Annals of Mathematical Statistics ,2020.[21] Dongning Guo and S. Verdu. Randomly spread cdma: asymptotics via statistical physics.

IEEE Trans. on Inform. Th. , 51(6):1983–2010,2005.[22] G. Reeves and H. D. Pﬁster. The replica-symmetric prediction for random linear estimation with gaussian matrices is exact.

IEEE Trans.on Inform. Th. , 65(4):2252–2283, April 2019., 65(4):2252–2283, April 2019.