It was "all" for "nothing": sharp phase transitions for noiseless discrete channels
aa r X i v : . [ m a t h . S T ] F e b It was “all” for “nothing”:sharp phase transitions for noiseless discrete channels
Jonathan Niles-Weed ∗ Ilias Zadik † February 25, 2021
Abstract
We establish a phase transition known as the “all-or-nothing” phenomenon for noiselessdiscrete channels. This class of models includes the Bernoulli group testing model and theplanted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenonfor such models was only known in a limited range of parameters. Our work extends the resultsto all signals with arbitrary sublinear sparsity.Over the past several years, the all-or-nothing phenomenon has been established in variousmodels as an outcome of two seemingly disjoint results: one positive result establishing the“all” half of all-or-nothing, and one impossibility result establishing the “nothing” half. Ourmain technique in the present work is to show that for noiseless discrete channels, the “all”half implies the “nothing” half, that is a proof of “all” can be turned into a proof of “nothing.”Since the “all” half can often be proven by straightforward means—for instance, by the first-moment method—our equivalence gives a powerful and general approach towards establishingthe existence of this phenomenon in other contexts.
Contents
A Auxilary results and important preliminary concepts 14 ∗ Courant Institute of Mathematical Sciences and Center for Data Science, New York University; e-mail: [email protected] . JNW is supported in part by NSF grant DMS-201529. † Center for Data Science, New York University ; e-mail: [email protected].
IZ is supported by a CDS Moore-Sloanpostdoctoral fellowship. Convex analysis 16C Proof of Theorem 1: Turning “all” into “nothing” 18
C.1 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19C.2 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D Proof of Corollary 1: Establishing the “all” 21E Applications: the Proofs 23
E.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23E.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
F Remaining proofs 27
A surprising feature of high-dimensional inference problems is the presence of phase transitions ,where the behavior of estimator changes abruptly as the parameters of a problem vary. Often,these transitions help illuminate fundamental limitations of an optimal estimation procedure, byshowing, for instance, that a certain inference task is impossible when the noise is too large orthe number of samples too few. There is a large and growing literature on proving rigorously thepresence of such transitions and on establishing their implications for learning and inference tasksin a variety of settings. [see, e.g. MM09]A particularly striking phase transition is known as the all-or-nothing phenomenon [GZ17,RXZ19b, Zad19]. In problems evincing this phenomenon, there is a sharp break: below a criticalnumber of samples, it is impossible to infer almost any information about a parameter of interest,but as soon as that critical point is reached, it is possible to infer the parameter almost perfectly.Such problems exhibit a sharp dichotomy, where either perfect inference is possible or nothing is.In this work, we develop general tools for proving the all-or-nothing phenomenon for a classof models we call “noiseless discrete channels.” In such models, we fix a function g and observeidentically distributed copies of a pair ( Y, X ) ∈ Y × R L generated by Y = g ( X, θ ) , where X is a random draw from some known distribution on R L , and θ is an unknown parameterto be estimated. Under the assumption that |Y| < ∞ , we can view g as a discrete channel,parametrized by θ , which maps R L to Y , and our goal is to ascertain how many samples (i.e., howmany uses of this channel) we need to reliably recover θ .We highlight two special cases of the above model which have seen recent attention: • Group testing [Dor43, AJS19]: θ ∈ { , } N indicates a subset of infected individuals in apopulation, and X ∈ { , } N indicates a random subset chosen to be tested as a batch. Weobserve g ( X, θ ) = 1(Support( X ) ∩ Support( θ ) = ∅ ), where for a vector v ∈ R N , Support( v ) ⊆ [ N ] denotes the set of the non-zero coordinates of v . How many tests do we need to determinewhich individuals are infected? • Planted Gaussian perceptron [ZK16]: in this simple “teacher-student” setting, θ ∈ { , } N represents the weights of a “teacher” one-layer neural network, and we observe g ( X, θ ) =1( P Nj =1 θ j X j ≥ X j are i.i.d. standard Gaussian random variables. How manysamples do we need for a “student” to learn the teacher’s hidden weights?2oth models have recently been studied in the all-or-nothing framework [TAS20, LBM20]. However,the range of parameters for which the all-or-nothing phenomenon has been rigorously established ineither model is limited. [TAS20] show that all-or-nothing holds for group testing in the extremelysparse regime when the number of infected individuals is o ( N ε ) for all ε >
0. Their proof iscombinatorial and proceeds by the second-moment method. [LBM20] give a heuristic derivation ofthe all-or-nothing phenomenon for the planted Gaussian perceptron based on the replica methodfrom statistical physics, and establish that this phenomenon holds if k θ k := |{ i ∈ [ N ] : θ i = 0 }| isboth ω ( N ) and o ( N ). We give a simple criterion for the all-or-nothing phenomenon to hold in noiseless discrete channels.For such settings, we measure success in terms of the minimum mean squared error (MMSE) andthe signal is assumed to lie on the Euclidean unit sphere. The “all” property corresponds to avanishing MMSE, while the “nothing” property corresponds to MMSE being asymptotically equalto one, which is the mean squared error achieved by the trivial zero estimator. As a corollary ofour result, we show that the all-or-nothing phenomenon holds for all relevant sparsity regimes inboth the group testing and planted perceptron models, substantially generalizing prior work.Our key technical contribution is to show that, under suitable conditions, proving a version of the“all” condition immediately implies that the “nothing” condition holds as well. More specifically, weshow that if the posterior distribution concentrates tightly enough around the the true parameterfor all n ≥ (1 + ǫ ) n ∗ for some critical n ∗ , then for n ≤ (1 − ǫ ) n ∗ no recovery is possible. In otherwords, for these models, “all” implies “nothing” in a suitable sense. Crucially, the “all” conditioncan often be proven directly, by simple means, as it suffices to establish that a specific estimatoris successful, via for example a simple “union bound” or “first-moment” argument. On the otherhand, the “nothing” lower bound requires proving the failure of any estimation method, and hastypically been proven by using more subtle techniques, such as delicate second moment methodarguments (see e.g. [RXZ19b] for the regression setting and [TAS20] for the Bernoulli group testingsetting). Our “all” implies “nothing” result shows that this complication is unnecessary for a classof noiseless discrete channels.We apply our techniques to both non-adaptive Bernoulli group testing and the planted Gaussianperceptron model. We report the following. • For the
Bernoulli group testing model (BGT), we focus on the case, common in the grouptesting literature, where there are k infected individuals, with k = o ( N ) . We model theinfected individuals as a binary k -sparse vector on the unit sphere, and as mentioned abovewe measure success in terms of the MMSE. In the BGT setting each individual is assumedto participate in any given test in an i.i.d. fashion, and independently with everything, withprobability νk , for some ν = ν k satisfying q = (1 − νk ) k . Here q ∈ (0 ,
1) is a fixed constant,again as customary in the literature of Bernoulli group testing [AJS19]. We show as anapplication of our technique that the all-or-nothing phenomenon holds for the BGT design for all k = o ( N ) and for any q ≤ at the critical number of tests n ∗ q = k log Nk /h ( q ) , where h ( q ) denotes the (rescaled) binary entropy at q defined in (16). In words, with less than n q samples the MMSE is not better than “random guess”, while with more than n q samplesit is almost zero. To the best of our knowledge this result was known before only in the casewhere k = o ( N ε ) for all ε > q = [TAS20].3 For the
Gaussian perceptron model , we focus on the case where θ is a a binary k -sparsevector on the unit sphere. We study a more general class of noiseless boolean models thanthe Gaussian perceptron, where Y i = 1( h X i , θ i ∈ A ) for some arbitrary Borel A ⊆ R with(standard) Gaussian mass equal to . Equivalently we consider any Boolean function f : R → {− , } which is balanced under the standard Gaussian measure, i.e. E f ( Z ) = 0 , Z ∼ N (0 , , and assume Y i = f ( h X i , θ i ) . Notice that the perceptron model corresponds to thecase A = [0 , + ∞ ) and f ( t ) = 21( t > −
1, but it includes other interesting models such asthe symmetric binary perceptron A = [ − u, u ] with u the median of | Z | , Z ∼ N (0 ,
1) which hasrecently been studied in the statistical physics literature [APZ19]. We apply our technique inthis setting to prove a generic result; all such models exhibit the all-or-nothing phenomenonat the same critical sample size n ∗ = k log Nk .
To the best of our knowledge this sharp phase transition was known before only in the casewhere A = [0 , + ∞ ) and k is ω ( N ) and o ( N ) [LBM20] All-or-Nothing
The all-or-nothing phenomenon has been investigated in a variety of models,and with different techniques [GZ17, NZ20, BM19, RXZ19a, BMR20, LBM20, TAS20, RXZ19b].More specifically, the phenomenon was initially observed in the context of the maximum likelihoodestimator for sparse regression in [GZ17] and was later established in the context of MMSE forsparse regression [RXZ19b, RXZ19a], sparse (tensor) PCA [BM19, NZ20, BMR20], Bernoulli grouptesting [TAS20] and generalized linear models [LBM20].A common theme of these works is that all-or-nothing behavior can arise when the parameterto be recovered is sparse, with sparsity sublinear in the dimensions of the problem. Though it isexpected that this phenomenon should arise for all sublinear scalings, technical difficulties oftenrestrict the range of applicability of rigorous results. In the present work we circumvent thischallenging by showing that a version of the “all” condition suffices to establish the all-or-nothingphenomenon for the whole sublinear regime. As mentioned above, usually the “all” result is easierto establish than the “nothing” result. Leveraging this, we are able to establish the all-or-nothingphase transitions throughout the sublinear sparsity regimes of both the Bernoulli group testingand Gaussian perceptron models, where only partial results have been established before [TAS20,LBM20]. “All” implies “Nothing”
As mentioned already, our key technical contribution is showing thata version of an “all” result suffices to establish the all-or-nothing phenomenon. This potentiallycounterintuitive result relates to a technique used in information theory known as the area theorem [MMU08, KKM +
17, RP16]. A heuristic explanation of this connection in the regression contextappears in [RXZ19b, Section 1.1.]; however, despite this intuition, the authors of [RXZ19b] do notproceed by this route. To the best of our knowledge, our results are the first to rigorously prove thatin certain sparse learning settings, an “all” result indeed implies the all-or-nothing sparse phasetransition.
Bernoulli group testing
Now, we comment on our contribution for the BGT model, as com-pared to the BGT literature. In the Bernoulli group testing model, it is well-known that for all k = o ( N ) and q = (1 − νk ) k , it is possible to obtain a vanishing MMSE (“all”) with access to41 + ǫ ) n ∗ q = (1 + ǫ ) k log Nk /h ( q ) tests [AJS19, TAS20]. Furthermore, it is also known as a folk-lore result that if q = 1 / − ǫ ) n ∗ / test it is impossible to achieve an “all” result[AJS19, TAS20]. To the best of our knowledge, this (weak) negative result of “all” being impossibleis not known when q = and one has access to fewer than (1 − ǫ ) n ∗ q tests. Finally, as mentionedabove, [TAS20] do establish the strong negative “nothing” result that if k = o ( N δ ) for all δ > q = with less (1 − ǫ ) n ∗ / it is impossible to achieve a non-trivial MMSE [TAS20].In the present work, we show as a corollary of our methods that for all k = o ( N ) and q ≤ ,“nothing” holds when the number of tests is fewer than (1 − ǫ ) n ∗ q , substantially improving theliterature of impossibility results in Bernoulli group testing. It should finally be mentioned thatwhile to the best of our knowledge, the appropriate “all” result needed for our argument to workis not known for any q < it has been established before when q = [see, e.g. IZ20, Lemma 1.3.]. Gaussian perceptron model
For the Gaussian perceptron model to the best of our knowledgethe most relevant result is in [LBM20] where the authors prove the all-or-nothing phenomenon at n ∗ = k log Nk samples when k is ω ( N ) and o ( N ). While they characterize the free energy of themodel and therefore provide more precise results than we do, their results apply to a restrictedsparsity regime. We do not precisely characterize the limiting free energy, but our much simplerargument shows that the all-or-nothing phenomenon holds for all sparsity levels k = o ( N ). The family of models
We define a sequence of observational models we study in this work,indexed by N ∈ N . Assume that an unknown parameter, or “signal”, θ ∈ R N is drawn from someuniform prior P Θ = ( P Θ ) N supported on a discrete subset Θ of the unit sphere in R N . We set | Θ | = M = M N and assume that for any θ, θ ′ ∈ Θ it holds h θ, θ ′ i ≥ . For some distribution D = D N supported on R L , where L = L N , we assume that for n = n N i.i.d. samples X i ∼ D X , i = 1 , , . . . , n we observe ( Y i , X i ) , i = 1 , , . . . , n where Y i = g ( X i , θ ) , i = 1 , , . . . , n. (1)The function g = g N : R L × R N → Y is referred to as the channel. We assume throughout that Y is finite and of cardinality that remains constant as N grows, e.g. Y = { , } . We denote by Y n the n -dimensional vector with entries Y i , i = 1 , , . . . , n and X n the n × L matrix with columns thevectors X i , i = 1 , , . . . , n . We write P = P N for the joint law of ( Y n , X n , θ ).We are given access to the pair ( Y n , X n ),and our goal is to recover θ . We measure recoverywith n samples in terms of the minimum mean squared error (MMSE),MMSE N ( n ) = E k θ − E [ θ | Y n , X n ] k . (2) The all-or-nothing phenomenon
We say that a sequence of models (( P Θ ) N , g N , D N ) satisfiesthe all-or-nothing phenomenon with critical sequence of sample sizes n c = ( n c ) N iflim N →∞ MMSE N ( ⌊ βn c ⌋ ) = (cid:26) β <
10 if β > . (3)This condition expresses a very sharp phase transition: when β >
1, we can identify the signalnearly perfectly, but when β <
1, we can do no better than a trivial estimator which alwaysoutputs zero. 5 ssumptions
To establish our result we make throughout the following further assumptions onour models.Recall that we have assumed that our prior P Θ is the uniform distribution on some finite subsetof cardinality M = M N . We assume throughout that M N → ∞ as N → ∞ . We also make thefollowing assumption, which requires that the distribution P Θ is sufficiently spread out. Assumption 1.
For θ and θ ′ chosen independently from P Θ we havelim δ → + lim N → + ∞ log( M P ⊗ ( h θ, θ ′ i ≥ − δ ))log M = 0 . (4)Moreover, we assume for any θ ∈ Θ, if θ ′ ∼ P Θ , then for any ǫ > N P Θ ( h θ ′ , θ i ≥ ǫ ) = 0 . (5)Assumption (4) guarantees that the that for two independent draws from the prior θ, θ ′ , theasymptotic probability that θ ′ is very near to θ is dominated by the probability that θ = θ ′ .This condition is the same as the one employed by [NZ20] in the analysis of the all-or-nothingphenomenon for Gaussian models.Assumption (5) implies that a sample from the prior P Θ does not correlate with any fixed vectorin Θ.We make also assumptions on the probability a θ ′ ∈ Θ \ { θ } is able to fit the observationsgenerated by the signal θ . Assumption 2.
We assume there exists a fixed function R : [0 , → [0 , N , suchthat P N ( g ( X, θ ) = g ( X, θ ′ )) = R ( h θ, θ ′ i ) ∀ N ∈ N , θ, θ ′ ∈ Θ . That is, that the probability that g ( X, θ ) and g ( X, θ ′ ) agree is a function of h θ, θ ′ i alone. We assumethat R is continuous at 0 + and strictly increasing on [0 , Notice that since our prior distribution is a uniform distribution over the finite parameter space Θand our observation model is noiseless, the posterior distribution of θ given Y n , X n satisfies thatfor any θ ′ , P ( θ ′ | Y n , X n ) = P ( θ ′ ) P ( Y n | X n , θ ′ ) P ( Y n | X n ) ∝ P ( Y n | X n , θ ′ ) = n Y i =1 Y i = g ( X i , θ ′ )) . In words, the posterior distribution is simply the uniform measure over the vectors θ ′ ∈ Θ satisfying Y i = g ( X i , θ ′ ) , i = 1 , , . . . , n. (6)As an easy corollary, the distance of the posterior mean from the ground truth vector, or equivalentlythe MMSE N ( n ), can be naturally related to the behavior of the following “counting” randomvariables. Definition 1.
For any N ∈ N and δ ∈ [0 , Z N,δ = Z N,δ ( Y n , X n ) be the random variablewhich is equal to the number of solutions θ ′ ∈ Θ of equations (6) where k θ − θ ′ k ≥ δ. Using the definition above, the following simple Lemma holds.6 roposition 1.
For θ ′ drawn from the posterior distribution of θ given Y n , X n it holds almostsurely that P ( k θ − θ ′ k ≥ δ | Y n , X n ) = Z N,δ Z N, . (7) Hence,
MMSE N ( n ) = 12 E Z δ =0 Z N,δ Z N, dδ. (8) Furthermore, under the assumption thatfor all δ ∈ (0 , , lim N P ( Z N,δ >
0) = 0 (9) it also holds lim N MMSE N ( n ) = 0 . (10)Proposition 1 offers a clean combinatorial way of establishing a vanishing MMSE (“all”) in ourcontext; one needs to prove the absence of solutions of (6) which are at a constant distance from θ . A clear benefit of such an approach is that one could possibly establish such a result by tryinga (possibly conditional) union bound—or “first moment”—argument. We investigate further thesuccess of such an approach in what follows.We consider the following critical sample size , n ∗ = ( n ∗ ) N = (cid:22) H ( θ ) H ( Y ) (cid:23) , (11)where by H ( · ) we refer to the Shannon entropy of a discrete random variable and Y = g ( X, θ ) fora sample of X ∼ D N and θ ∼ ( P Θ ) N . The significance of the sample size n ∗ is highlighted in thefollowing proposition which establishes that (9) can only hold if the number of samples is at least n ∗ . Proposition 2.
Suppose that Assumption 1 is true. Then if the condition (9) holds for somesequence of sample sizes n = n N , then lim inf N nn ∗ ≥ . While we defer the proof of Proposition 2 to the Appendix, we highlight some aspects for itwhich will be important in what follows. The key identity behind the proof of the Proposition is italways holds that H ( θ ) − H ( θ | Y n , X n ) = nH ( Y ) − D( P ( Y n , X n ) k Q ( Y n , X n )) (12) ≤ nH ( Y ) , (13)where 1) D stands for the Kullback-Leibler (KL) divergence (see e.g. [PW15, Section 6]), 2) P ( Y n , X n ) stands for the joint law of ( Y n , X n ) generated by the observation model (1) and 3) Q ( Y n , X n ) stands for the law of a “null” model where the columns of X n are i.i.d. samples drawnfrom D and the entries of Y n are drawn in an i.i.d. fashion from the distribution of Y = g ( X, θ ) but independently from X n . As an outcome P, Q share the same marginals but are distinct as7oint distributions, as for example the latter has no hidden signal . The identity (12) follows fromalgebraic manipulations which can be found in Appendix B. The inequality in (13) is implied bythe non-negativity of the KL divergence.The proof of the proposition is based on the fact that the positive result (9) is strong enough ,not only to imply the “all” condition (10), but also to imply that the entropy of the posterior is ofsmaller order of magnitude than the entropy of the prior (see Proposition 4). This property allowsus to conclude that the left hand side of (13) is (1 − o (1)) H ( θ ) which concludes the proof.Now we present the main technical result of the present work. We establish that if Proposition2 is tight, that is if (9) can be proven to be true when n ≥ (1 + ǫ ) n ∗ for arbitrary ǫ >
0, then (9)is a sufficient to establish that the the all-or-nothing phenomenon holds at sample size n ∗ as well. Theorem 1 (“all” implies “nothing”) . Suppose that Assumptions 1, 2 are true. Assume that if n ≥ (1 + ǫ ) n ∗ , for some arbitrary fixed ǫ > , then (9) holds. Then if n ≤ (1 − ǫ ) n ∗ for arbitraryfixed ǫ ∈ (0 , , it holds lim N MMSE N ( n ) = 1 . (14) In particular, the all-or-nothing phenomenon (3) holds at critical samples sizes n c = n ∗ . We provide here some intuition behind such a potentially surprising implication. Notice thatif (9) holds at sample sizes (1 + ǫ ) n ∗ for arbitrary fixed ǫ >
0, then from the sketch of the proofof Proposition 2 the inequality (13) needs to hold (approximately) with equality. In fact one canshow that at n = n ∗ , it must necessarily hold thatlim N D( P ( Y n ∗ , X n ∗ ) k Q ( Y n ∗ , X n ∗ )) H ( θ ) = 0 . (15)At an intuitive level, (15) seems already a significant step towards what we desire to prove. Indeed,(15) suggests that ( Y n , X n ) drawn from our model P are close in distribution to the samples( Y n , X n ) drawn from the null model Q . This strongly suggests that outperforming the randomguess in the mean-squared error should be impossible.While we think that this argument hints at the right direction for proving the “nothing” prop-erty, we do not know a complete proof along these lines. The reason is that one cannot conclude that P and Q are “sufficiently close”, e.g. in the total variation sense, to argue the above. The reasonis that the KL distance in (15) vanishes only after rescaled with the factor H ( θ ) = log M N → + ∞ .For this reason, (15) does not imply any nontrivial bound for the total variation distance between P, Q . Notably though, such an obstacle has already been tackled in the literature of “nothing”results in the context of sparse tensor PCA [BMV +
18, NZ20]. In these cases the “nothing” resultcan be established by the use of the I-MMSE formula combined with weak detection lower boundsuch as (15). The I-MMSE formula is an identity for Gaussian channels between the derivative(with respect to the continuous signal to noise ratio (SNR) ) of the corresponding KL divergenceand the MMSE for this value of SNR. We are not aware of any such formula for the noiseless dis-crete models considered in this work. Nevertheless, inspired by the I-MMSE connection, we studythe discrete derivative of the KL divergence for noiseless models. Specifically we prove a result ofpotentially independent interest, that a vanishing discrete derivative (with respect to the samplesize n ) of the sequence D( P ( Y n , X n ) k Q ( Y n , X n )) /H ( θ ) , n ∈ N , at n ≤ n ∗ , implies indeed a trivialMMSE at n < n ∗ . We conclude then the result by using classical real analysis results, to showthat the convexity of the vanishing sequence for n ≤ n ∗ , implies its the discrete derivative of thesequence is also vanishing for n ≤ n ∗ . 8 .3 The case of boolean channels: a simple condition One can naturally ask whether for various models of interest there exist a simple sufficient conditionwhich can establish the positive result (9) at n ∗ (e.g. by a union bound argument) and thereforeprove the all-or-nothing phenomenon. In this subsection, we provide such a simple sufficient con-dition for the subclass of boolean (or 1-bit) noiseless models , i.e. when Y = { , } . Perhaps notsurprisingly, our result follows from an appropriate “union-bound” or “first-moment” argument. Inthe next section we apply our condition to various such models.Notice that in these boolean binary settings the critical sample size simplifies to n ∗ = ⌊ H ( θ ) h ( p ) ⌋ where p = P ( g ( X, θ ) = 1) and h is the binary entropy h ( t ) = − t log t − (1 − t ) log(1 − t ) , t ∈ (0 , , (16)where log is, as always in this work, with base e .To proceed, we need some additional definitions. The first is about the two possible outcomesof the channel, and it extends Assumption 2 to further properties on the probability of a fixed θ ′ ∈ Θ \ { θ } satisfying (6). Assumption 3.
There exist fixed functions R : [0 , → [0 , , R : [0 , → [0 , N, such that for all N ∈ N and θ, θ ′ ∈ Θ it holds R ( h θ, θ ′ i ) = P ( g ( X, θ ) = g ( X, θ ′ ) = 1) , (17) R ( h θ, θ ′ i ) = P ( g ( X, θ ) = g ( X, θ ′ ) = 0) . (18)For all ρ ∈ [0 , R ( ρ ) = R ( ρ ) + R ( ρ ) , where R ( ρ ) is as in Assumption 2. We assume that both R and R are increasing on [0 , Definition 2.
Given a non-decreasing continuous function r : [ − , → R ≥ , we say { P Θ } admitsan overlap rate function r lim sup N M N log P ⊗ [ h θ ′ , θ i ≥ ρ ] ≤ − r ( ρ ) , where θ and θ ′ are independent draws from P Θ .We state our result. Corollary 1.
Let Y = { , } and let p = P ( g ( X, θ ) = 1) ∈ (0 , be constant. Suppose thatAssumptions 1, 2 and 3 are true. If { P Θ } admits an overlap rate r ( ρ ) satisfying r ( ρ ) ≥ h ( p ) (cid:18) p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) (cid:19) ∀ ρ ∈ [0 , , (19) then the all-or-nothing phenomenon holds at n ∗ = ⌊ log Mh ( p ) ⌋ . In this subsection we use our results, and specifically Corollary 1, to establish the all-or-nothingphenomenon for various sparse boolean models of interest.9 .1 Application 1: Nonadaptive Bernoulli group testing
We start with nonadaptive Bernoulli group testing. In this context, we fix some parameter k = k N ∈ N with k → + ∞ and k = o ( N ) , which corresponds to k infected individuals out of a populationof cardinality N . We also fix a constant q ∈ (0 , / N individuals at a time. The logic is that by doing so we may be able to use fewertests, say, from testing each individual separately, and still recover the infected individuals. Noticethat such a Bernoulli group testing model is characterized by the two parameters q and k = k N . The Model
We assume a uniform prior P Θ , which following our notation we encode as theuniform measure on the k -sparse binary vectors on the sphere in N -dimensions, i.e.Θ = { θ ∈ { , √ k } N : k θ k = k } , where there is a natural correspondence between the identities of the infected individuals and thesupport of the vectors θ ∈ Θ. For each k , we define ν = ν k to be the unique positive numbersatisfying (cid:16) − νk (cid:17) k = q . The group of individuals being tested is modeled by the binary vector X ∈ { , } N , with D =Bernoulli( νk ) ⊗ N . In words, we choose whether in individual participates at any given test inde-pendently from everything and with probability ν/k , with the parameter ν chosen so that theprobability that each group contains no infected individuals is exactly q .We model the channel by the step function at √ k , i.e. Y = g ( X, θ ) = 1( h X i , θ i ≥ √ k ) , whichsimply outputs the information of whether at least one of the individuals in the selected group isinfected (which is equivalent to √ k h X i , θ i = | Support( X i ) ∩ Support( θ ) | ≥
1) or not. The samplesize n corresponds to the number of tests conducted.Corollary 1 when applied to this context establishes the following result. Theorem 2.
Let q ∈ (0 , ] be a constant. Suppose k = o ( N ) and ν k satisfies (1 − ν k k ) k = q . Thenthe non-adaptive Bernoulli group testing model satisfies the all-or-nothing phenomenon at numberof tests n ∗ = (cid:22) log ( Nk ) h ( q ) (cid:23) = (1 + o (1)) k log Nk h ( q ) . The q = 1 / k = N o (1) case of Theorem 2 was proved by [TAS20]. Our theorem extends theirresult to all sublinear sparsities and all q ≤ /
2. Whether a similar result holds for q > / In this subsection, we turn our study to a family of what we call as
Sparse Balanced Gaussian(SBG) models . Every such model can be characterized by some sparsity parameter k = k N = o ( N )and a fixed “balanced” Borel subset A ⊆ R with P ( Z ∈ A ) = 1 / Z ∼ N (0 , he Model We assume as above that the signal θ is sampled from the uniform measure on the k -sparse binary vectors on the sphere in N -dimensions, i.e. Θ = { θ ∈ { , √ k } N : k θ k = k } . Weassume that the distribution for X ∈ R N is given by the standard Gaussian measure D = N (0 , I N ) . Finally the channel is given by the formula Y = g ( X, θ ) = 1( h X i , θ i ∈ A ) . We highlight two models of this class that have been studied in different contexts. • The case A = [0 , + ∞ ) corresponds to the well-studied Gaussian perceptron model with asparse planted signal Y = 1( h X, θ i ≥ . Variants of the Gaussian perceptron model hasreceived enormous attention in learning theory and statistical physics (see e.g. [ZK16,BKM +
19] and references therein). Recently the sparse version of it has been studied by [LBM20]. • The case A = [ − u, u ] for some u with u such that P ( | Z | ≤ u ) = , which correspondsto what is known as the symmetric binary perceptron model with a sparse planted signal Y = 1( |h X, θ i| ≤ u ) [APZ19].We establish a general result that all SBG models exhibit the all-or-nothing phenomenon at thesame critical sample size. Theorem 3.
Suppose k = o ( N ) and A ⊆ R be an arbitrary fixed Borel subset with P ( Z ∈ A ) = 1 / for Z ∼ N (0 , . Then the Sparse Balanced Gaussian model defined by k and A exhibits the all-or-nothing phenomenon at n ∗ = ⌊ log ( Nk ) log 2 ⌋ = (1 + o (1)) k log Nk . In the context of Gaussian perceptron it has recently been proven [LBM20] that the all-or-nothing phenomenon holds for any ω ( N ) = k = o ( N ). Theorem 3 generalizes this result to any k = o ( N ), even constant. To our knowledge, the existence of such a transition for the symmetricbinary perceptron and other SBG models is new. References [AJS19] Matthew Aldridge, Oliver Johnson, and Jonathan Scarlett. Group testing: an informa-tion theory perspective. arXiv preprint arXiv:1902.06002 , 2019.[APZ19] Benjamin Aubin, Will Perkins, and Lenka Zdeborov´a. Storage capacity in symmetricbinary perceptrons.
J. Phys. A , 52(29):294003, 32, 2019.[BKM +
19] Jean Barbier, Florent Krzakala, Nicolas Macris, L´eo Miolane, and Lenka Zdeborov´a.Optimal errors and phase transitions in high-dimensional generalized linear models.
Proc. Natl. Acad. Sci. USA , 116(12):5451–5460, 2019.[BM19] Jean Barbier and Nicolas Macris. 0-1 phase transitions in sparse spiked matrix estima-tion. arXiv:1911.05030, 2019.[BMR20] Jean Barbier, Nicolas Macris, and Cynthia Rush. All-or-nothing statistical and com-putational phase transitions in sparse spiked matrix estimation. In Larochelle et al.[LRH + +
18] Jess Banks, Cristopher Moore, Roman Vershynin, Nicolas Verzelen, and Jiaming Xu.Information-theoretic bounds and phase transitions in clustering, sparse PCA, andsubmatrix localization.
IEEE Trans. Inform. Theory , 64(7):4872–4994, 2018.11Bor85] Christer Borell. Geometric bounds on the Ornstein-Uhlenbeck velocity process.
Z.Wahrsch. Verw. Gebiete , 70(1):1–13, 1985.[Dor43] Robert Dorfman. The detection of defective members of large populations.
The Annalsof Mathematical Statistics , 14(4):436–440, 1943.[GZ17] David Gamarnik and Ilias Zadik. High dimensional linear regression with binary coef-ficients: Mean squared error and a phase transition.
Conference on Learning Theory(COLT) , 2017.[HUL93] Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal.
Convex analysis and minimiza-tion algorithms. I , volume 305 of
Grundlehren der Mathematischen Wissenschaften[Fundamental Principles of Mathematical Sciences] . Springer-Verlag, Berlin, 1993. Fun-damentals.[IZ20] Fotis Iliopoulos and Ilias Zadik. Group testing and local search: is there acomputational-statistical gap?, 2020.[KKM +
17] S. Kudekar, S. Kumar, M. Mondelli, H. D. Pfister, E. S¸a¸soˇglu, and R. L. Urbanke.Reed–muller codes achieve capacity on erasure channels.
IEEE Transactions on Infor-mation Theory , 63(7):4298–4316, 2017.[LBM20] Cl´ement Luneau, Jean Barbier, and Nicolas Macris. Information theoretic limits oflearning a sparse rule. In Larochelle et al. [LRH + +
20] Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, andHsuan-Tien Lin, editors.
Advances in Neural Information Processing Systems 33:Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,December 6-12, 2020, virtual , 2020.[MM09] Marc Mezard and Andrea Montanari.
Information, physics, and computation . OxfordUniversity Press, 2009.[MMU08] C. Measson, A. Montanari, and R. Urbanke. Maxwell construction: The hidden bridgebetween iterative and maximum a posteriori decoding.
IEEE Transactions on Infor-mation Theory , 54(12):5277–5307, 2008.[NZ20] Jonathan Niles-Weed and Ilias Zadik. The all-or-nothing phenomenon in sparse tensorPCA. In Larochelle et al. [LRH + Analysis of Boolean functions . Cambridge University Press, NewYork, 2014.[PW15] Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. 2015.[RP16] G. Reeves and H. D. Pfister. The replica-symmetric prediction for compressed sensingwith gaussian matrices is exact. In , pages 665–669, 2016.[RXZ19a] Galen Reeves, Jiaming Xu, and Ilias Zadik. All-or-nothing phenomena: From single-letter to high dimensions. In , pages 654–658, 2019.12RXZ19b] Galen Reeves, Jiaming Xu, and Ilias Zadik. The all-or-nothing phenomenon in sparselinear regression. In Alina Beygelzimer and Daniel Hsu, editors,
Proceedings of theThirty-Second Conference on Learning Theory , volume 99 of
Proceedings of MachineLearning Research , pages 2652–2663, Phoenix, USA, 25–28 Jun 2019. PMLR.[She99] William Fleetwood Sheppard. On the application of the theory of error to cases ofnormal distribution and normal correlation.
Philosophical Transactions of the RoyalSociety of London. Series A, Containing Papers of a Mathematical or Physical Char-acter , (192):101–167, 1899.[TAS20] L. V. Truong, M. Aldridge, and J. Scarlett. On the all-or-nothing behavior of bernoulligroup testing.
IEEE Journal on Selected Areas in Information Theory , 1(3):669–680,2020.[Zad19] Ilias Zadik.
Computational and statistical challenges in high dimensional statisticalmodels.
PhD thesis, Massachusetts Institute of Technology; Cambridge MA, 2019.[ZK16] Lenka Zdeborov´a and Florent Krzakala. Statistical physics of inference: thresholds andalgorithms.
Advances in Physics , 65(5):453–552, aug 2016.13
Auxilary results and important preliminary concepts
We start with an elementary lemma.
Lemma 1.
Under our assumptions, the sequence of critical sample sizes n ∗ defined in (11) satisfies,(i) lim N n ∗ = + ∞ (ii) n ∗ = (1 + o (1)) H ( θ ) H ( Y ) , as N → + ∞ .Proof. Recall that in all our models we assume H ( θ ) = log M N → + ∞ as N → + ∞ . Furthermore,we assume that Y = g ( X, θ ) is a random variable supported on a subset of Y where |Y| = O (1).Hence, H ( Y ) ≤ log |Y| = O (1) . In particular in all our models it holdslim N H ( θ ) H ( Y ) = + ∞ . This establishes the first property. The second property follows since for any real valued sequence X N , N ∈ N with lim N x N = + ∞ , it holds lim N x N ⌊ x N ⌋ = 1 . We continue with a crucial proposition for our main result. This proposition establishes aconnection between the critical sample size n ∗ , the estimation error manifested in the form of theentropy of the posterior distribution.To properly establish it we need to some additional notation, and the definition of an appropriate“null” distribution on the observation ( Y n , X n ). Definition 3.
We denote by P n = P n ( Y n , X n ) the law of the observation under our model, i.e.for any measurable A , P n ( A ) = E θ ∼ P Θ P (( Y n , X n , θ ) ∈ A ) . In words, P n generates ( Y n , X n ) byfirst sampling θ from the prior, independently sampling X n in an i.i.d. fashion from D , and thengenerating Y n by the conditional law P ( Y n | θ, X n ).Notice that in the noiseless case studied in this work, the latter conditional law greatly simplifiesto a dirac mass at Y n = ( g ( X i , θ )) ni =1 . Definition 4.
Denote, as usual, by Y = g ( X, θ ) the random variable where X ∼ D and θ ∼ P Θ are independent. We define by Q n = Q n ( Y n , X n ) the “null” distribution on n samples, where theobservations are generated as follows. We sample X n in an i.i.d. fashion from D and then generate Y n in an i.i.d. fashion from the law of Y , independently from X n .Notice first that the marginals of Y i under Q n are identical to the marginals of Y i under P n .Yet, the joint law of Q n does not include any “signal” θ and Y n are independent of X n . Naturally,is not possible to estimate any signal θ with observations coming from the null model Q n .The following proposition holds. Proposition 3.
Under our assumptions, ( θ, Y n , X n ) generated according to P n , and Q n definedin Definition 4 we have, I( θ ; Y n | X n ) = nH ( Y ) − D( P n k Q n ) . (20) and therefore (cid:18) − H ( θ | Y n , X n ) H ( θ ) (cid:19) + D( P n k Q n ) H ( θ ) = (1 + o (1)) nn ∗ . (21)14 roof. Note that (21) follows from (20) directly from part (ii) of Lemma 1, along with the identityI( θ ; Y n | X n ) = H ( θ | X n ) − H ( θ | Y n , X n ) = H ( θ ) − H ( θ | Y n , X n ) , where the second equality uses that X n is independent of θ .We now prove (21). We haveI( θ ; Y n | X n ) = E X n (cid:20) E θ,Y n | X n log P( Y n | X n , θ )P( Y n | X n ) (cid:21) = E X n (cid:20) E θ,Y n | X n log 1P( Y n | X n ) (cid:21) = E X n (cid:20) E θ,Y n | X n log 1Q( Y n ) (cid:21) − E X n (cid:20) E θ,Y n | X n log P( Y n | X n )Q( Y n ) (cid:21) The second term is D( P n k Q n ). For the first term, we have that Q ( Y n ) = Q ni =1 Q ( Y i ) by thedefinition of Q . Since by assumption also Q ( Y i ) = P ( Y i ) for each i we conclude E X n (cid:20) E θ,Y n | X n log 1Q( Y n ) (cid:21) = n E X (cid:20) E θ,Y | X log 1P( Y ) (cid:21) = nH ( Y ) . The proof is complete.
Proposition 4.
Under our assumptions the following holds. Suppose that (9) holds for somesequence of sample size n = n N . Then we also have lim N → + ∞ H ( θ | Y n , X n ) H ( θ ) = 0 . (22) Proof.
Recall that since the prior is uniform and the model is noiseless, the posterior is simply theuniform distribution over the solutions θ ′ of the system of equations (6). Hence, using the notationof Definition 1, Z N, is the random variable which is equal to the number of such solutions. Hence,(22) is equivalent to lim N → + ∞ E log Z N, log M = 0 . where we used that H ( θ ) = log M .Now fix a δ ∈ (0 ,
2] and notice that almost surely,log Z N, ≤ log( |{ θ ′ ∈ Θ : h θ, θ ′ i ≥ − δ }| + Z N,δ ) ≤ log(2 max {|{ θ ′ ∈ Θ : h θ, θ ′ i ≥ − δ }| , Z N,δ } ) ≤ log |{ θ ′ ∈ Θ : h θ, θ ′ i ≥ − δ }| + 1( Z N,δ ≥
1) log( Z N,δ ) + O (1) ≤ log |{ θ ′ ∈ Θ : h θ, θ ′ i ≥ − δ }| + 1( Z N,δ ≥
1) log M + O (1)= log( M P θ ′ ∼ P Θ ( h θ, θ ′ i ≥ − δ Z N,δ ≥
1) log M + O (1) , where at the last step we used that the prior is uniformly distributed.15ence, since lim N log M N = + ∞ , we conclude that for any δ ∈ (0 , E log Z N, log M ≤ E log( M P θ ′ ∼ P Θ ( h θ, θ ′ i ≥ − δ ))log M + P ( Z N,δ ≥
1) + o (1) ≤ log( M P ⊗ ( h θ, θ ′ i ≥ − δ ))log M + P ( Z N,δ ≥
1) + o (1) , where in the last inequality we used Jensen’s inequality and the fact that the logarithm is concave.Now we send first N to infinity and then δ to zero. The second and third terms on the right handside of the last inequality vanishes by the first limit since (9), and the first term vanishes by thedouble limit by using Assumption 1. The proof follows.We state and prove here a foklore result in the statistical physics literature called the “Nishimori”identity, which will be useful in what follows. Lemma 2.
It always holds that if θ ′ is a random variable drawn from the posterior distribution P θ | Y n ,X n that MMSE N ( n ) = 1 − E h θ, θ ′ i . Proof.
Bayes’s rule implies that the joint distribution of h θ ′ , θ i is identical with the distributionof h θ ′ , θ ′′ i of two independent random variables drawn from the posterior distribution of θ given Y n , X n . Therefore, E h E [ θ | Y n , X n ] , θ i = E h θ ′ , θ i = E h θ ′ , θ ′′ i = E k E [ θ | Y n , X n ] k . The result follows sinceMMSE N ( n ) = 1 + E k E [ θ | Y n , X n ] k − E h E [ θ | Y n , X n ] , θ i . B Convex analysis
Towards employing certain analytic techniques we consider the following function defined on R > ,which simply linear interpolates between the values the sequence D( P n k Q n ) /H ( θ ) , n ∈ N . Weestablish that the analytic properties of this function express various fundamental statistical prop-erties of the inference setting of interest. Here and throughout this section, P n , Q n are defined asin Definitions 3, 4. Definition 5.
Let D N : (0 , + ∞ ) → [0 , + ∞ ) , N ∈ N be the sequence of functions defined by D N ( β ) := (1 − βn ∗ + ⌊ βn ∗ ⌋ ) D ( P ⌊ βn ∗ ⌋ || Q ⌊ βn ∗ ⌋ ) H ( θ ) + ( βn ∗ − ⌊ βn ∗ ⌋ ) D ( P ⌊ βn ∗ ⌋ +1 || Q ⌊ βn ∗ ⌋ +1 ) H ( θ ) . (23)Notice that the normalization of the argument of D N is appropriately chosen such that D N (1) =D( P n ∗ k Q n ∗ ) /H ( θ ) . Proposition 5.
Consider the sequence of functions D N , per Definition 5. Then under our as-sumptions the following hold. a) For each N , D N is a convex, increasing, nonnegative function.(b) For all fixed β > , lim sup N D N ( β ) = β − N H ( θ | X ⌊ βn ∗ ⌋ , Y ⌊ βn ∗ ⌋ ) H ( θ ) . (c) For all fixed β > and for each N , the function D N is left differentiable at β and the leftderivative at β satisfies ( D N ) ′− ( β ) = 1 − H ( Y ⌈ βn ∗ ⌉ | Y ⌈ βn ∗ ⌉− , X ⌈ βn ∗ ⌉ ) H ( Y ) + o (1) , (24) where the o (1) term tends to zero as N → + ∞ .Proof. We start with part (a).Since D N is a linear interpolation of the sequence D( P n k Q n ) H ( θ ) , n ∈ N and H ( θ ) >
0, it suffices toshow the same properties for the sequence D( P n k Q n ) , n ∈ N . The nonnegativitiy is obvious. For afixed n ∈ N we have using the identity (20) from Proposition 3 thatD( P n +1 k Q n +1 ) − D( P n k Q n ) = H ( Y ) − I( θ ; Y n +1 | X n +1 ) + I( θ ; Y n | X n )By the chain rule for mutual information and its definition we haveI( θ ; Y n +1 | X n +1 ) = I( θ ; Y n +1 | X n +1 , Y n ) + I( θ ; Y n | X n ) . Combining the above and using the definition of the mutual information and the fact that ourchannel Y i = g ( X i , θ ) is noiseless, we obtainD( P n +1 k Q n +1 ) − D( P n k Q n ) = H ( Y ) − I( θ ; Y n +1 | X n +1 , Y n )= H ( Y ) − H ( Y n +1 | X n +1 , Y n ) + H ( Y n +1 | X n +1 , Y n , θ )= H ( Y ) − H ( Y n +1 | X n +1 , Y n ) (25)= I( Y n +1 ; X n +1 , Y n ) (26)Now the increasing property of the sequence follows from the fact that the mutual information isnon-negative. For the convexity, it suffices to show that the right hand side of (26) is nondecreasing.Indeed, notice that for each n from the fact that conditioning reduces entropy,I( Y n +1 ; X n +1 , Y n ) = H ( Y n +1 ) − H ( Y n +1 | X n +1 , Y n ) ≥ H ( Y n +1 ) − H ( Y n +1 | X , . . . , X n +1 , Y , . . . Y n )= I( Y n +1 ; X , . . . , X n +1 , Y , . . . Y n )= I( Y n ; X , . . . , X n , Y , . . . Y n − )= I( Y n ; X n , Y n − ) . This completes the proof of part (a).For part (b) notice that from Proposition 3 we have for each fixed β > P ⌊ βn ∗ ⌋ k Q ⌊ βn ∗ ⌋ ) H ( θ ) = (1 + o (1)) ⌊ βn ∗ ⌋ n ∗ − H ( θ | X ⌊ βn ∗ ⌋ , Y ⌊ βn ∗ ⌋ ) H ( θ )= β − H ( θ | X ⌊ βn ∗ ⌋ , Y ⌊ βn ∗ ⌋ ) H ( θ ) + o (1) , (27)17ince n ∗ → + ∞ by Lemma 1. By (25),D( P ⌊ βn ∗ ⌋ +1 k Q ⌊ βn ∗ ⌋ +1 ) H ( θ ) − D( P ⌊ βn ∗ ⌋ k Q ⌊ βn ∗ ⌋ ) H ( θ ) ≤ H ( Y ) H ( θ ) = o (1) , since H ( Y ) = O (1) and H ( θ ) → ∞ .Since D N ( β ) is a convex combination of D( P ⌊ βn ∗⌋ k Q ⌊ βn ∗⌋ ) H ( θ ) and D( P ⌊ βn ∗⌋ +1 k Q ⌊ βn ∗⌋ +1 ) H ( θ ) , we concludethat lim sup N D N ( β ) = β − N H ( θ | X ⌊ βn ∗ ⌋ , Y ⌊ βn ∗ ⌋ ) H ( θ ) , as we wanted.For part (c), recall that D N is the piecewise linear interpolation of the convex sequence D( P n k Q n ) H ( θ ) .By [HUL93, Proposition I.4.1.1], D N possesses a left derivative on the interior of its domain, sothat D N is left-differentiable at β for all β >
0. Moreover, this left derivative is simply the slopeof the segment which connects ( ⌈ βn ∗ ⌉− n ∗ , D( P ⌈ βn ∗⌉− k Q ⌈ βn ∗⌉− ) H ( θ ) ) and ( ⌈ βn ∗ ⌉ n ∗ , D( P ⌈ βn ∗⌉ k Q ⌈ βn ∗⌉ ) H ( θ ) ), whichequals D( P ⌈ βn ∗⌉ k Q ⌈ βn ∗⌉ ) H ( θ ) − D( P ⌈ βn ∗⌉− k Q ⌈ βn ∗⌉− ) H ( θ ) ⌈ βn ∗ ⌉ n ∗ − ( ⌈ βn ∗ ⌉− n ∗ ) = n ∗ D( P ⌈ βn ∗ ⌉ k Q ⌈ βn ∗ ⌉ ) − D( P ⌈ βn ∗ ⌉− k Q ⌈ βn ∗ ⌉− ) H ( θ )= (1 + o (1)) D( P ⌈ βn ∗ ⌉ k Q ⌈ βn ∗ ⌉ ) − D( P ⌈ βn ∗ ⌉− k Q ⌈ βn ∗ ⌉− ) H ( Y ) , where we use the second part of Lemma 1 for n ∗ and the o (1) term tends to zero as n ∗ tends toinfinity. Now using (25) we conclude that the slope is(1 + o (1)) − H ( Y ⌈ βn ∗ ⌉ | Y ⌈ βn ∗ ⌉− , X ⌈ βn ∗ ⌉ ) H ( Y ) ! = 1 − H ( Y ⌈ βn ∗ ⌉ | Y ⌈ βn ∗ ⌉− , X ⌈ βn ∗ ⌉ ) H ( Y ) + o (1) . The proof is complete.
C Proof of Theorem 1: Turning “all” into “nothing”
Recall that from Proposition 4, condition (9) implies that at (1 + ǫ ) n ∗ samples the entropy of theposterior distribution is of smaller order than the entropy of the prior. Our first result towardsproving Theorem 1 establishes two implications of this property of the entropy of the posterior. Lemma 3.
Suppose that for all ǫ > and n ≥ (1 + ǫ ) n ∗ , (22) holds. Then we have,(1) (KL closeness) lim N D( P n ∗ k Q n ∗ ) H ( θ ) = 0 , (28) and(2) (prediction “nothing”) for any fixed ǫ > , if n ≤ (1 − ǫ ) n ∗ , then lim N H ( Y n +1 | Y n , X n +1 ) H ( Y ) = 1 . (29)18n words, the sublinear entropy of the posterior implies 1) a “KL-closeness” between the planteddistribution P n ∗ and the null distribution Q n ∗ , and 2) that the entropy of the observation Y n +1 conditioned on knowing the past observations Y n and X n +1 is (almost) equal to the unconditionalentropy of Y = Y n +1 . While the first condition is, as already mentioned, hard to interpret (becauseof the H ( θ ) normalization), the second condition has rigorous implication of the recovery problemof interest, because of the following lemma. We emphasize to the reader that towards establishingthis lemma the use of an assumption such as Assumption 2 is crucial. Lemma 4.
Suppose that (29) holds. Then for any fixed ǫ ∈ (0 , , if n = n N ≤ (1 − ǫ ) n ∗ , then (14) holds. Notice that combining Proposition 4, the part (2) of Lemma 3 and Lemma 4, the proof ofTheorem 1 follows in a straightforward manner. We proceed by establishing the two Lemmas.
C.1 Proof of Lemma 3
Proof.
We start with establishing (28). Notice that for any ǫ >
0, combining Proposition 5 part(b) for β = 1 + ǫ and the condition (22), we havelim sup N D N (1 + ǫ ) = ǫ. Using now that D N is increasing from Proposition 5 part (a), we concludelim sup N D N (1) ≤ ǫ, or as ǫ > D N is non-negative,lim N D N (1) = 0 . (30)The identity (28) follows because for β = 1, βn ∗ ∈ N and therefore D N (1) = D( P n ∗ k Q n ∗ ) H ( θ ) . We now show that (28) implies (29). Notice that using Proposition 5 part (a), D N is a sequenceof increasing, convex and non-negative functions which we restrict to be defined on the compactinterval [0 , N sup β ∈ [0 , | D N ( β ) | = lim N D N (1) = 0 . Therefore D N converges uniformly to the zero function.Now to establish our result notice that since conditioning reduces entropy it suffices to considerthe case where n = ⌈ βn ∗ ⌉ − β ∈ (0 , D N are convex and converge uniformly to 0 in the openinterval (0 ,
1) we can conclude the left derivatives of D N ( β ) converge to the derivative of the zerofunction as well, i.e. lim N ( D N ) ′− ( β ) = 0 . Using now (24) from Proposition 5 part (c) for this β we conclude the proof.19 .2 Proof of Lemma 4 Proof.
Fix some ǫ ∈ (0 ,
1) and assume n ≤ (1 − ǫ ) n ∗ . Denote the probability distribution ˜ P n +1 on( Y n +1 , X n +1 ) where ( Y n , X n ) are drawn from P n and Y n +1 , X n +1 are drawn independently fromthe marginals P Y , D respectively. Notice that ˜ P is carefully chosen so thatD( P n +1 k ˜ P n +1 ) = E log P ( Y n +1 | Y n , X n +1 ) P ( Y n +1 ) = H ( Y ) − H ( Y n +1 | Y n − , X n +1 ) . Hence using (29), D( P n +1 k ˜ P n +1 ) = o ( H ( Y )) = o (1) , where we used the assumption that H ( Y ) ≤ log |Y| = O (1) . Using Pinsker’s inequality we concludelim N d TV ( P n +1 , ˜ P n +1 ) = 0 . Now we denote by θ ′ a sample from the posterior distribution P n ( θ | Y n , X n ). Using the totalvariation guarantee we have P n +1 (cid:8) g ( X n +1 , θ ′ ) = Y n +1 (cid:9) = ˜ P n +1 (cid:8) g ( X n +1 , θ ′ ) = Y n +1 (cid:9) + o (1) . Under ˜ P , because of its definition, we can write Y n +1 as g ( X n +1 , θ ′′ ), where θ ′′ ∼ P θ is independentof everything else. Using Assumption 2 we conclude E R ( h θ, θ ′ i ) = E R ( h θ ′′ , θ ′ i ) + o (1) . Furthermore, using (5) and the fact that θ ′′ is independent from θ ′ we conclude that for any ǫ > , lim sup N E R ( h θ, θ ′ i ) ≤ R ( ǫ )Hence by continuity of R at 0 we havelim sup N E R ( h θ, θ ′ i ) ≤ R (0) . Recall that 0 is the unique minimizer of R on [0 ,
1] which allows us to conclude E | R ( h θ, θ ′ i ) − R (0) | = o (1) . and therefore by Markov’s inequality for any ε > P ( R ( h θ, θ ′ i ) > R (0) + ε ) = o (1) . As R is strictly increasing we conclude that, for any ǫ > P ( h θ, θ ′ i > ǫ ) = o (1) . Since the integrand is bounded from above we conclude thatlim sup N E h θ, θ ′ i ≤ . Using now Lemma 2, we conclude lim inf N MMSE N ( n ) ≥ , which concludes the proof. 20 Proof of Corollary 1: Establishing the “all”
Proof.
We apply Theorem 1. We fix some ǫ > n ≥ (1 + ǫ ) n ∗ , (9)holds. We also assume without loss of generality that n ≤ Cn ∗ for an absolute positive constant C ,since the random variables Z N,δ are decreasing in the stochastic order as functions of the samplessize n . By assumption p is a fixed constant in (0 ,
1) independent of N . Hence since h ( p ) is a positiveconstant itself we conclude from the definition of n ∗ that for all n = Θ( n ∗ ) it also holds n = Θ(log M ) . (31)In particular, n → + ∞ as N → + ∞ .Consider n the number of samples where Y i = 1 and notice that n is distributed as a Binomialdistribution Bin( n, p ) . We condition on the event that F = {| n − np | ≤ √ n log log n } . Standardlarge deviation theory on the Binomial distribution yields that since p ∈ (0 ,
1) and n → + ∞ , itholds lim N P ( F ) = 1 . Therefore, by Markov’s inequality it suffices to prove that for every δ > N E [ Z N,δ F )] = 0 . (32)or equivalently by linearity of expectation and the independence of Y i , X i , i = 1 , , . . . , n given θ ,lim N E F ) X θ ′ : k θ ′ − θ k ≥ δ P n \ i =1 { Y i = g ( X i , θ ′ ) } (cid:12)(cid:12)(cid:12)(cid:12) n ! = 0 . Now fix any θ ′ with k θ ′ − θ k ≥ δ or equivalently ρ = h θ, θ ′ i ≤ − δ and some n satisfying F .Using the definitions of R i , i = 0 , P n \ i =1 { Y i = g ( X i , θ ′ ) } (cid:12)(cid:12)(cid:12)(cid:12) n ! equals (cid:18) nn (cid:19) P (cid:0) g ( X, θ ) = g ( X, θ ′ ) = 1 (cid:1) n P (cid:0) g ( X, θ ) = g ( X, θ ′ ) = 0 (cid:1) n − n = (cid:18) nn (cid:19) R ( ρ ) n R ( ρ ) n − n = exp (cid:16) nh ( n n ) + n log R ( ρ ) + ( n − n ) log R ( ρ ) + o ( n ) (cid:17) (33)where h is defined in (16), and we used the standard application of Stirling’s formula log (cid:0) nnx (cid:1) = nh ( x ) + o ( n ) when x is bounded away from 0 and 1. The last expression equals toexp (cid:16) n (cid:16) h ( n n ) + n n log R ( ρ ) + (1 − n n ) log R ( ρ ) (cid:17) + o ( n ) (cid:17) = exp ( n ( h ( p ) + p log R ( ρ )) + (1 − p ) log R ( ρ )) + o ( n )) (34)= exp (cid:18) n (cid:18) p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) (cid:19) + o ( n ) (cid:19) , (35)21nd for (34) we used the continuity of h and that n /n = p (1 + O (log log n/ √ n )) = p (1 + o (1)) , since n → + ∞ . Importantly, since p ∈ (0 ,
1) the o ( n ) term in (35) can be taken to hold uniformlyover the specific choices of n satisfying F .Using (35) it suffices to establish for G ( ρ, p ) = p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) thatlim N E θ X ρ ∈R : ρ ≤ − δ |{ θ ′ ∈ Θ : h θ, θ ′ i = ρ }| e nG ( ρ,p )+ o ( n ) = 0 , (36)where R denotes the support of the overlap distribution of two independent samples from the prior P Θ . Now since the prior is uniform over Θ if ρ is drawn from the law of the inner product betweentwo independent samples from the prior, (36) is equivalent withlim N E ρ ρ ≤ − δ M e nG ( ρ,p )+ o ( n ) = 0 . (37)Now notice that since n ∗ = (1 + o (1)) log Mh ( p ) by Proposition 1 we have M e nG ( ρ,p )+ o ( n ) = exp (log M + n ∗ G ( ρ, p ) + ( n − n ∗ ) G ( ρ, p ) + o ( n ))= exp ( n ∗ h ( p ) + n ∗ G ( ρ, p ) + ( n − n ∗ ) G ( ρ, p ) + o ( n + log M ))= exp (cid:18) n ∗ (cid:18) p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) (cid:19) + ( n − n ∗ ) G ( ρ, p ) + o (log M ) (cid:19) , (38)where we used that n is of order log M , by (31).Now Assumption 3 implies that the functions R i , i = 0 , ,
1] and Assumption2 that their sum is strictly increasing in [0 , R (1) = p, R (1) = 1 − p . Hence we conclude that for some δ ′ > ρ ≤ − δ , min { log R ( ρ ) p , log R ( ρ )1 − p } ≤ min { log R (1) p , log R (1)1 − p } − δ ′ = − δ ′ and max { log R ( ρ ) p , log R ( ρ )1 − p } ≤ max { log R (1) p , log R (1)1 − p } = 0 . Hence, since p ∈ (0 , G ( ρ, p ) we conclude that for δ ′′ = δ ′ min { p, − p } > ρ ≤ − δ , G ( ρ, p ) ≤ − δ ′′ . Hence since n ≥ (1 + ǫ ) n ∗ and n ∗ = Θ(log M ) weconclude that for all ρ ≤ − δ , ( n − n ∗ ) G ( ρ, p ) ≤ − ǫδ ′′ n ∗ = − Ω(log M ) . (39)Combining (38) with (39), and then using n ∗ = (1 + o (1)) log Mh ( p ) = log Mh ( p ) + o (log M ) , we conclude E ρ ρ ≤ − δ M e nG ( ρ,p )+ o ( n ) ≤ e − Ω(log M ) E ρ exp (cid:18) n ∗ ( p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) ) (cid:19) = e − Ω(log M ) E ρ exp (cid:18) log Mh ( p ) ( p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) ) (cid:19) . N M log E ρ exp ( W ( ρ, p ) log M ) = 0 , (40)for W ( ρ, p ) , h ( p ) ( p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) ) . To prove it, let us fix a positive integer k . We have E ρ exp ( W ( ρ, p ) log M ) ≤ k − X ℓ =0 P [ ρ ≥ ℓ/k ] sup t ∈ [ ℓ/k, ( ℓ +1) /k ) exp ( W ( t, p ) log M ) ≤ k · max ≤ ℓ E Applications: the Proofs In this section we present the proofs for the three families of models we establish the all-or-nothingphenomenon using our technique. The proof concept remains the same across the different models;we apply Corollary 1 and check that all assumptions apply. E.1 Proof of Theorem 2 Proof. We apply Corollary 1. Notice that for any fixed θ ∈ Θ the random variable √ k h X i , θ i followsa Binomial distribution Bin( k, νk ). Therefore p = 1 − P ( g ( X, θ ) = 0) = 1 − (cid:16) − νk (cid:17) k = 1 − q. (41)Hence h ( p ) = h ( q ) and the critical sample size is indeed n ∗ = (cid:22) log ( Nk ) h ( q ) (cid:23) , and Stirling’s formulaimplies that since k = o ( N ), H ( θ ) = log (cid:0) Nk (cid:1) = (1 + o (1)) k log Nk and therefore it also holds n ∗ = (1 + o (1)) k log Nk h ( q ) . We now check the assumptions of the Corollary.23 ssumption 1 We start with Assumption 1, which concerns properties of the prior. We use [NZ20,Lemma 6] to conclude that the prior P Θ admits the overlap rate function r ( t ) = t, per Defi-nition 2. Now, notice that the first part of Assumption 1 is directly implied by the fact that r (1 + δ ) = 1 + δ > δ > 0. For the second part notice that since the law of the prioris permutation-invariant with respect to the N dimensions, for any fixed θ ∈ Θ and θ ′ chosen fromthe prior, h θ, θ ′ i is equal in distribution to the law of h θ, θ ′ i where θ, θ ′ are two independent samplesfrom the prior. Hence using the overlap rate function r ( t ) = t we have that for M N = (cid:0) Nk (cid:1) it holdsthat for any ǫ > P (cid:0) h θ, θ ′ i > ǫ (cid:1) ≤ exp ( − ( ǫ + o (1)) log M N ) = o (1) , as desired. Assumptions 2, 3 For Assumption 2 and Assumption 3 we directly compute by elementarycombinatorics the functions R i ( ρ ) , i = 1 , R ( ρ ). Recall that { g ( X, θ ′ ) = 1 } is the event thatthe supports of θ ′ and X have a non-empty intersection. Given a N -dimensional vector v ∈ R n ,we denote its support by S ( v ) := { i ∈ [ N ] : v i = 0 } . First, fix some ρ ∈ [0 , 1] and we compute R ρ (1) by considering two arbitrary θ, θ ′ which share ρk indices in their support. Notice that for theargument to be non-vacuous we assume also that ρ = ℓ/k for some ℓ ∈ { , , , . . . , k } . Conditioningon whether S ( X ) intersects S ( θ ) ∩ S ( θ ′ ), we have R ρ (1) = P ( g ( X, θ ) = g ( X, θ ′ ) = 1)= P ( S ( X ) ∩ S ( θ ) = ∅ , S ( X ) ∩ S ( θ ′ ) = ∅ )= 1 − (cid:16) − νk (cid:17) ℓ | {z } case S ( X ) ∩ S ( θ ) ∩ S ( θ ′ ) = ∅ + (cid:16) − νk (cid:17) ℓ (cid:18) − (cid:16) − νk (cid:17) k − ℓ (cid:19) | {z } case S ( X ) ∩ S ( θ ) ∩ S ( θ ′ )= ∅ = 1 − (cid:16) − νk (cid:17) k + (cid:16) − νk (cid:17) k − ℓ = 1 − q + q − ρ . Likewise, R ρ (0) = P ( g ( X, θ ) = g ( X, θ ′ ) = 0)= P ( S ( X ) ∩ S ( θ ) = S ( X ) ∩ S ( θ ′ ) = ∅ )= (cid:16) − νk (cid:17) k − ℓ = q − ρ We obtain R ( ρ ) = 1 − q + 2 q − ρ . It can be straightforwardly checked that all three functions are strictly increasing and continuousin [0 , Condition (19) Finally, we need to check the condition (19). First notice that since r ( t ) = t weneed to show that for all ρ ∈ [0 , ρh ( p ) ≥ (cid:18) p log R ( ρ ) p + (1 − p ) log R ( ρ )(1 − p ) (cid:19) , 24r using the definition of h , 0 ≥ p log R ( ρ ) p − ρ + (1 − p ) log R ( ρ )(1 − p ) − ρ . (42)Notice that for any ρ , R ( ρ )(1 − p ) − ρ = 1 . Therefore it suffices to show that for every ρ ∈ [0 , R ( ρ ) ≤ p − ρ or equivalently with respect to q = 1 − p ,1 − q + q − ρ ≤ (1 − q ) − ρ . To prove the latter, recall that q ≤ and consider the function f ( ρ ) = (1 − q ) − ρ − q − ρ . Noticethat f (0) = f (1) = 1 − q and therefore it suffices to prove that f is concave in [0 , f is f ′′ ( ρ ) = log(1 − q ) (1 − q ) − ρ − (log q ) q − ρ = (1 − q ) − ρ (cid:18) log(1 − q ) − (log q ) ( q − q ) − ρ (cid:19) ≤ (1 − q ) − ρ (cid:18) log(1 − q ) − (log q ) ( q − q ) (cid:19) , since q ≤ . Hence, it suffices to show log(1 − q ) ≤ (log q ) ( q − q ) or (1 − q ) log(1 − q ) ≥ q log q. To prove the latter consider the function g ( q ) = (1 − q ) log(1 − q ) − q log q, q ∈ (0 , ] . Notice that g (0 + ) = g (1 / 2) = 0, and that for each q ∈ (0 , ) g ′′ ( q ) = q − q (1 − q ) < . Hence, g ( q ) is concave on theinterval [0 , / g ( q ) ≥ min { g (0 + ) , g (1 / } = 0 as we wanted. The proof is complete. E.2 Proof of Theorem 3 Proof. We apply Corollary 1. Notice that since any fixed θ ∈ Θ lies on the unit sphere in R N and X i ∼ N (0 , I N ), it holds that h X i , θ i ∼ N (0 , p = P ( g ( X i , θ ) = 1) = P ( h X i , θ i ∈ A ) = 12 . Hence indeed the critical sample size is n ∗ = ⌊ log ( Nk ) h ( ) ⌋ = (1 + o (1)) ⌊ k log Nk ⌋ , where we haveStirling’s formula and the assumption that k = o ( N ).We now check the assumptions of the Corollary. Assumption 1 Assumption 1 concerns properties of the prior P Θ and they are already establishedin the corresponding part of Theorem 2, since the prior is identical.25 ssumptions 2, 3 For Assumption 2 and Assumption 3, we study the functions R i () , i = 0 , R ( ρ ).Now we compute the functions. Recall that { g ( X, θ ) = 1 } = {h X i , θ i ∈ A } and that for any θ, θ ′ ∈ Θ with h θ, θ ′ i = ρ the pair h X i , θ i , h X i , θ ′ i is a bivariate pair of standard Gaussians withcorrelation ρ . Letting ( Z, Z ρ ) be such a pair, we therefore have R ρ (1) = P ( g ( X, θ ) = g ( X, θ ′ ) = 1) = P ( Z ∈ A, Z ρ ∈ A ) ,R ρ (0) = P ( g ( X, θ ) = g ( X, θ ′ ) = 0) = P ( Z A, Z ρ A ) ,R ( ρ ) = P ( g ( X, θ ) = g ( X, θ ′ )) = P ( Z ∈ A, Z ρ ∈ A ) + P ( Z A, Z ρ A ) . Furthermore, because A is balanced we have for any ρ ∈ [0 , R ( ρ ) = P ( { Z ∈ A, Z ρ ∈ A } ∪ { Z / ∈ A, Z ρ / ∈ A } )= P ( Z ∈ A, Z ρ ∈ A ) + P ( Z / ∈ A, Z ρ / ∈ A )= P ( Z ∈ A, Z ρ ∈ A ) + (1 − P ( Z ∈ A ) − P ( Z ρ ∈ A ) + P ( Z ∈ A, Z ρ ∈ A ))= 2 P ( Z ∈ A, Z ρ ∈ A )= 2 R ρ (1)= 2 R ρ (0) . The uniform limits are all strictly increasing with respect to ρ ∈ [0 , 1] and continuous at 0+ , byLemma 5 applied to A and A C . Condition (19) Finally, we need to check the condition (19). First notice that similar to Theorem2 the prior admits the overlap rate function r ( t ) = t and therefore the condition is equivalent with(42). Notice that (42) simplifies since p = 1 / R ρ (1) ≤ ρ − , or P ( Z ∈ A, Z ρ ∈ A ) ≤ ρ − . (43)By Borell’s noise stability theorem [Bor85], since P ( Z ∈ A ) = 1 / P ( Z ≥ P ( Z ∈ A, Z ρ ∈ A ) ≤ P ( Z ≥ , Z ρ ≥ 0) = 14 (cid:18) π arcsin ρ (cid:19) , where the equality is by Sheppard’s formula [She99].Hence it suffices to show that for all ρ ∈ [0 , 1] it holds 2 ρ ≥ π arcsin( ρ ) . We consider thefunction g ( ρ ) = 2 ρ − − π arcsin( ρ ) , ρ ∈ [0 , g ( ρ ) ≥ ρ ∈ [0 , g (0) = g (1) = 0 and g ( ) = √ − / > . We claim that there is no root of g in(0 , 1) which implies the result by Bolzano’s theorem. Arguing by contradiction, if there was a rootthen the derivative g ′ ( ρ ) = 2 ρ ln 2 − π p − ρ would have two roots in (0 , 1) by Rolle’s theorem. Rearranging, this is equivalent to the equation ρ ln 2 + 12 ln(1 − ρ ) = ln( 2 π ln 2 )having two roots in (0 , . But the function on the left side is concave and is zero for ρ = 0, so ittakes each negative value at most once. Since ln( π ln 2 ) < Remaining proofs Proof of Proposition 1. The equality (7) follows in a straightforward manner from the observationthat the posterior of θ given Y n , X n is the uniform measure over the solutions θ ′ of equations (6),and the definition of Z N,δ . For (8) notice that by Cauchy-Schwarz inequality that if θ ′ is drawn from the posterior P θ | Y n ,X n ,MMSE N ( n ) = E k θ − E [ θ | Y n , X n ] k = 12 E k θ − θ ′ k = 12 Z δ =0 P ( k θ − θ ′ k ≥ δ )= 12 Z δ =0 E P ( k θ − θ ′ k ≥ δ | Y n , X n )= 12 E Z δ =0 P ( k θ − θ ′ k ≥ δ | Y n , X n ) , where in the second line we have used Lemma 2 and where in the last line we are allowed toexchange the order of integration by Tonelli’s theorem as all integrands are non-negative. Noticethat finally (7) allows us to conclude (8).For the second part, fix some arbitrary ǫ ∈ (0 , 2] and set A = { Z N,ǫ > } . From (9) we have P ( A ) = o (1). Notice that for any ǫ ′ with 2 ≥ ǫ ′ ≥ ǫ , it holds almost surely Z N,ǫ ′ A c ) = 0. Hence,we have using (8), MMSE N ( n ) ≤ E Z δ =0 Z N,δ Z N, dδ = E Z δ =0 Z N,δ Z N, A c ) dδ + E Z δ =0 Z N,δ Z N, A ) dδ ≤ E Z ǫδ =0 Z N,δ Z N, dδ + P ( A ) ≤ ǫ + o (1) . Therefore, lim sup N MMSE N ( n ) ≤ ǫ. As ǫ ∈ (0 , 2] was arbitrary we conclude (10). Proof of Proposition 2. Using (21) from Proposition 3 we have that since the KL divergence isnon-negative, 1 − H ( θ | Y n , X n ) H ( θ ) ≤ (1 + o (1)) nn ∗ (44)Using now Proposition 4 we conclude that (22) holds. Combining (22) with (44) concludes theresult. 27 emma 5. Let Z and Z ρ be a bivariate pair of standard Gaussians with correlation ρ . Then forany Borel set A ⊆ R such that P ( Z ∈ A ) ∈ (0 , , the function ρ P ( Z ∈ A, Z ρ ∈ A ) is strictly increasing on [0 , and continuous on [0 , .Proof. Write γ for the standard Gaussian measure on R . We recall [see, e.g. O’D14, Proposition11.37] that there exists an orthonormal basis { h k } k ≥ for L ( γ ) such that, for any f ∈ L ( γ ), E [ f ( Z ) f ( Z ρ )] = X k ≥ ρ k ˆ f k , where the coefficients { ˆ f k } k ≥ are defined by f = X k ≥ ˆ f k h k in L ( γ ).Moreover, h = 1, so that if f is not γ -a.s. constant, there exists a k > f k = 0. Weobtain that, for any non-constant f , the function E [ f ( Z ) f ( Z ρ )] = P k ≥ ρ k ˆ f k is strictly increasingon [0 , P k ≥ ˆ f k = E [ f ( Z )] < + ∞ . Hence then function E [ f ( Z ) f ( Z ρ )] = P k ≥ ρ k ˆ f k is also continuous on [0 , . Applying this result to the non-constant function f ( x ) = 1( x ∈ A ) ∈ L ( γγ