High-dimensional inference: a statistical mechanics perspective
HHIGH-DIMENSIONAL INFERENCE:A STATISTICAL MECHANICS PERSPECTIVE
JEAN BARBIER
Abstract.
Statistical inference is the science of drawing conclusions about some system from data.In modern signal processing and machine learning, inference is done in very high dimension: verymany unknown characteristics about the system have to be deduced from a lot of high-dimensionalnoisy data. This “high-dimensional regime” is reminiscent of statistical mechanics, which aims atdescribing the macroscopic behavior of a complex system based on the knowledge of its microscopicinteractions. It is by now clear that there are many connections between inference and statisticalphysics. This article aims at emphasizing some of the deep links connecting these apparentlyseparated disciplines through the description of paradigmatic models of high-dimensional inferencein the language of statistical mechanics.This article has been published in the issue on artificial intelligence of Ithaca, an Italianpopularization-of-science journal. The selected topics and references are highly biased and notintended to be exhaustive in any ways. Its purpose is to serve as introduction to statistical mechanicsof inference through a very specific angle that corresponds to my own tastes and limited knowledge. Statistical inference: old and new
Statistical inference aims at accurately describing some system, through the design of anappropriate probability distribution, based on data about this system and, potentially, someassumptions about it. Designing both statistically and computationally efficient inference proceduresis thus crucial in virtually all fields of science.Classical statistics is mostly concerned with the regime where the system of study is rather“simple”, or “low-dimensional”. Namely, it is parametrised by few quantities of interest and theamount of accessible data is large. But in the big-data era , contemporary signal processing andmachine learning tasks require performing inference in the so-called high-dimensional (high-d)regime . This means that even if the amount of data as well at its dimensionality may be (very)large, the number of unknown parameters characterizing the system under study is also huge.Therefore totally new statistical tools are required to make sense of the data in order to “extractthe signal from the noise”.1.1.
Classical “low dimensional” statistics.
In classical statistics the signal , namely theinformation of interest/unknown parameter to recover from the data, is low-dimensional . Tobe more precise, let us denote the signal x ∈ R p and the data y = y ( x ) ∈ R n , that depends onthe signal. For example consider a simple experiment where one tries to infer if a coin is fair,namely, whether P ( head ) = x = − P ( tail ) with x = /
2. In this example the parameter space hasdimension p = x ∈ [ , ] is a scalar. A natural protocol to answerthat question is: toss the coin n times and record the number n h ∈ { , . . . , n } of times it fell onhead. Then in the limit n ≫ law or large numbers , which is a fundamental statisticalproperty at the core of the apparent predictability of the world in spite of its inherent probabilisticnature, predicts that the empirical mean converges to the statistical mean. This translates here a r X i v : . [ c ond - m a t . d i s - nn ] O c t J. BARBIER to n ∑ ni = ( toss i = head ) = n h / n = x + o n ( ) with a correction o n ( ) → n → ∞ (here (⋅) is theindicator function). Less basic examples could be to infer the earth gravitational constant from therecording of n ≫ p = x , weight x and the associated variances ( x , x ) based onsome large population of n individuals; in the latter case p = pn ≪ x , isthrough maximum likelihood estimation (MLE) . The probabilistic way to represent a randomprocess of data generation is a probability distribution of observing the data conditional on the(unknown) parameters P ( y ∣ x ) , called likelihood . For example, in the coin tossing experiment,an obvious observation is that conditional on the bias parameter x all tosses are independent.Therefore, mapping the binary data-space { head , tail } to { , } , each toss is a Bernoulli experimentwith P ( y i = ∣ x ) = x . Therefore the likelihood that the random variable (r.v.) N h takes value k ∈ { , . . . , n } , N h being the random number of heads among n trials, is the binomial law of(unknown) success probability x : P ( N h = k ∣ x ) = ( nk ) x k ( − x ) n − k . In this experiment the only datais the outcome n h of N h as the order of the tosses is irrelevant. Therefore P ( N h = n h ∣ x ) is thelikelihood of the data.It is conceptually useful to introduce the likelihood function L( x ∣ y ) ∶= P ( y ∣ x ) . It is important to think of L( x ∣ y ) really as a function of the parameters given the data (the databeing fixed and cannot be modified), not the other way around as suggested by the conditionalprobability distribution P ( y ∣ x ) . This is the reason behind the introduction of a specific notation L( x ∣ y ) emphasising this correct interpretation. MLE says that one should take as estimateˆ x = ˆ x ( y ) of the unknown parameters x the value that maximizes the (logarithm of the) likelihoodfunction given the measured data. In coin tossing L( x ∣ n h ) ∶= P ( N h = n h ∣ x ) , so MLE givesˆ x ∈ argmax x ∈[ , ] ln L( x ∣ n h ) = argmax x ∈[ , ] { n h ln x + ( n − n h ) ln ( − x )} . The function {⋯} is concave so its unique maximiser is easily found to be ˆ x ( n h ) = n h / n . Theprincipled approach of MLE therefore allows to recover the natural choice suggested by the lawof large numbers. This simple but powerful method has driven classical statistics for more than acentury since its development by C. Gauss, P. S. Laplace or F. Edgeworth. Finally, MLE can beshown to be optimal in the classical limit of statistics p / n → .1.2. High-dimensional statistics.
The high-d regime generally refers to statistical settings inwhich both the number of parameters and of data points –that can themselves be high-d– are largeand comparable: p / n → δ > p, n → ∞ together, with δ an order one constant. This is to The MLE estimator ˆ x is optimal in the following sense: in the Bayesian optimal setting described soon, in theregime n ≫ p , ˆ x is both a Bayes estimator (namely, it minimizes the Bayes risk associated with the distribution ofthe signal P ( x ) ) and minimax for the mean-square loss, see the upcoming section on decision theory and Chapter 12in the great book [1]. IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 3 be contrasted with the classical limit δ →
0. To be more precise we require pn × SNR d → δ > , (1)where the signal-to-noise ratio (SNR) per data point SNR d is a measure of the informationcontent about x carried by a single data point, in average. The definition of SNR d depends onthe model under study, but is always related to a natural way of comparing the (average) signalamplitude with the one of the noise that corrupts the data. When it is of order one we recover theusual definition of high-d regime p / n → δ > y = n h ∈ { , . . . , n } is given by the variance of the associated r.v. N h . The variance of thebinomial distribution Bin ( n, x ) is nx ( − x ) so SNR d = nx ( − x ) is large. Therefore ÷( × SNR d ) = /( × nx ( − x )) → n → ∞ . An equally valid interpretation is: wehave n data points ( y i ) ni = , each one being the outcome of a Bernoulli experiment. Each y i ∈ { , } has variance x ( − x ) = O ( ) . Then ÷ ( × SNR d ) = /( n × x ( − x )) → deep neural networks trained on data-baseswith millions images. But the number of parameters/synaptic weights defining these complexmodels is of the same order, or even much bigger. See, e.g., [35] for an introduction to machinelearning for physicists.It is not exagarating to call that a data-revolution , and high-d statistics is the theoreticalpowerhouse at its core. The other key pillars of this revolution being the amount of accessible data,as well as the modern computers and specific computational units able to process such huge datasets, like graphical processing units (GPUs).Understanding the high-d regime requires totally new concepts and mathematical tools, andfor solving actual applied problems, we need new algorithms. Speaking about algorithms, in thehigh-d world where the data is so massive, not only statistical efficiency matters –namely, thecapacity of an algorithm to extract the relevant information, independently of any “speed concern”–,but also computational efficiency , as it quickly becomes a bottleneck. Researchers working inapplications of high-d statistics must always keep in mind these two considerations, a key featureof the field that makes it so interesting and challenging at the same time.In this article we will mainly focus our attention on the information-theoretic (or statistical )limitations to inference and leave the algorithmic considerations aside.2. Basics of Bayesian inference
Breaking the curse of dimensionality using a-priori knowledge.
We have said thatin classical statistics MLE estimation is optimal. This is not true anymore in the high-d regime.Because the amount of data is comparable to the number of unknown parameters to infer, thismay create degeneracies in the solution of MLE, in the sense that the set argmax x ∈ R p ln L( x ∣ y ) may have a huge cardinal, all solutions being equally bad. This is one issue related to the curseof dimensionality : in the high-d regime the volume of the space in which lives the signal x J. BARBIER increases so fast (exponentially fast with p ) that the available data becomes relatively sparse. Thissparsity is problematic for any method that requires statistical significance. In order to obtain astatistically sound and reliable result, the amount of data needed to support the result often growsexponentially with the dimensionality, but such an amount is never accessible (compute exp p with p = , , assumptions about x , namely, a-priori knowledge . Quoting D. Mackay: “you cannot do inference withoutmaking assumptions”, see the amazing book [2]. This can be formalised through a probabilitydistribution P ( x ) that depends on the signal only, and therefore is completely independent of thedata y . This distribution, called prior , translates in the language of probability the whole set ofa-priori assumptions made by the statistician about x , before that the data is collected. It is crucialthat once the data is acquired, the prior is not modified accordingly, otherwise this may create bias . Such assumptions could be, for example, that the signal is binary x ∈ {− , } p with uniformindependent and identically distributed (i.i.d.) components x i (like the bits received from somecommunication source). This very basic assumption would translate in P ( x ) = ∏ pi = ( δ x i , − + δ x i , ) .But maybe in addition one may know that actually the signal is sparse , meaning it has a fraction ρ of entries equal to 0. In this case P ( x ) = ∏ pi = { ρδ x i , + − ρ ( δ x i , − + δ x i , )} . And so on: the richerthe set of assumptions, the more complex is the prior.2.2. Combining assumptions and data: Bayes formula.
One of the most elegant probabilisticstatement ever is the so-called
Bayes formula . It has been discovered by a reverend named T. Bayesbefore his death in 1761. Independently of Bayes, P. S. Laplace formalised similar ideas in 1774.Jeffreys, one of the father of modern Bayesian statistics, later wrote that Bayes’s theorem ”is tothe theory of probability what the Pythagorean theorem is to geometry”. As simple as it is, itencapsulates in a single equation a powerful recipe, more useful than ever in the context of inferencein high dimensions. It says that the proper way to combine a-priori knowledge and data is througha simple multiplication followed by a normalization: P ( x ∣ y ) = P ( x ) P ( y ∣ x ) P ( y ) . (2)That is, posterior = prior × likelihood / evidence. The posterior distribution P ( x ∣ y ) signifies P ( the parameters take value x given the data is y and our a-priori knowledge ) . “Posterior” is in the sense of a-posteriori that the data has been collected. The posterior distributionis therefore the multiplication of the prior, that formalises all our assumptions about the signal,with the likelihood, that models the data-generating process conditional on the signal. The posteriorcombines in a single probability distribution all information we have about the signal, as well asour uncertainty about it through, e.g., the posterior varianceVar ( x ∣ y ) = E [∥ x − E [ x ∣ y ]∥ ∣ y ] (defining generically E [ g ( x ) ∣ y ] ∶= ∫ d x g ( x ) P ( x ∣ y ) ).The normalization P ( y ) = ∫ d x ′ P ( x ′ ) P ( y ∣ x ′ ) The sparsity assumption allows to reconstruct very high-d signals from relatively few data points, and, e.g., toinvert apparently under-determined systems of linear equations. This forms the basis of a whole field of researchin mathematics and signal processing called compressive sensing , see the excellent introduction [3] or [23] for astatistical mechanics approach.
IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 5 is called evidence . It is the marginal distribution of the data. Note that in high dimensions thisdistribution can be very hard, if not impossible, to compute exactly as it requires to performa p -dimensional integral, with p very large. This is often the main computational bottleneck inBayesian inference and there is a whole field of research related to finding good approximationtechniques for the evidence.2.3. Information-theoretic limits: the Bayesian optimal setting.
From now on we willrestrict our discussion to the
Bayesian optimal setting . This means that the statistician knowsthe model underlying the data-generating process (but of course not the signal x ), which translatesinto the correct likelihood P ( y ∣ x ) . In addition she also exploits correctly all a-priori informationabout the signal, namely, the signal has indeed been randomly generated from the prior P ( x ) used by her. Therefore in this setting, the posterior distribution is the “correct one” and all theupcoming discussion applies only to this case.The Bayesian optimal setting is fundamental: as we will argue, any optimal estimator , whatevermeaningful notion of optimality is considered, relies on the correct posterior. Therefore in orderto study the information-theoretical limits of inference –namely the best results one can aimfor independently of any computational concern–, one needs to precisely study inference in thisoptimal setting. The performance of estimators for x in this setting cannot be outperformed byany algorithm, even those allowed to run for infinite time. The Bayesian optimal setting is at thecore of information theory [14].The mismatched setting where the likelihood used is not properly describing the data generationand/or the prior is biased (i.e., does not correspond to the probability distribution from whichthe signal was generated) is much more complicated and goes beyond the present discussion. Inthe repeated coin tossing experiment, a wrong a-priori assumption could be that the tosses areindependent, while for some reason the tosses were correlated; this could be due, e.g., to a tinydemon living inside the coin and that would modify the probability x = x i ( y i − ) of head for the toss i as a function of the previous tossing result y i . Yet, a lot of the phenomenology, that is inherent tothe high-dimensionality, remains the same in the Bayesian optimal and mismatched (more realistic)settings.2.4. Optimality of estimators: Bayesian decision theory.
In order to quantify the perfor-mance of the statistician in estimating x , we need to define a proper recontruction error associatedwith a given estimator ˆ x = ˆ x ( y ) . This error metric is called loss in statistics, and can be thoughtas an energy function. There are many possible choices, whose relevance depends on the specificapplication at hand. One canonical choice is the 0 − (cid:96) ( ˆ x , x ) = − ( ˆ x = x ) . This choice makessense when the signal is discrete, like in communications where the signal is made of bits. The losscannot be computed as it depends on the unknown signal. Therefore in order to define a notion of“goodness” of an estimator we define the posterior risk (our best estimate of the loss): r ( ˆ x ∣ y ) ∶= ∫ d x P ( x ∣ y ) (cid:96) ( ˆ x , x ) , which is simply equal to 1 − P ( ˆ x ∣ y ) for the 0 − Bayes risk r ( ˆ x ) ∶= ∫ d y P ( y ) r ( ˆ x ∣ y ) . Both can be in theory computed from the knowledge of the data only and the underlying statisticalmodel; in practice this may be computationally very demanding in high dimension. We directly get
J. BARBIER that the optimal estimator, optimal in the sense of minimizing the posterior (or Bayes) risk, is inthe case of 0 − x MAP ( y ) ∶= argmin ˆ x ∈ R p r ( ˆ x ∣ y ) = argmax ˆ x ∈ R p P ( ˆ x ∣ y ) . MAP stands for maximum a-posteriori estimator. Another common choice that is more appro-priate for real-valued signals is the L loss: (cid:96) ( ˆ x , x ) = ∥ ˆ x − x ∥ . The associated posterior risk is calledthe mean-square error , and the estimator that minimizes it is the minimum mean-squareerror (MMSE) estimator :ˆ x MMSE ( y ) ∶= argmin ˆ x ∈ R p ∫ d x P ( x ∣ y )∥ ˆ x − x ∥ = ∫ d x x P ( x ∣ y ) =∶ E [ x ∣ y ] . The second equality is easily shown by equating the gradient w.r.t. ˆ x of the posterior risk to theall-zeros vector (by convexity it leads the unique minimizer). Therefore the MMSE estimator is“simply” the posterior mean. Of course in general this may be very costly to compute because thereare two p -dimensional integrals: the evidence (necessary to normalize the posterior), and then theintegral ∫ d x x P ( x ∣ y ) . Therefore in many practical applications one prefers the MAP estimator asit bypasses this issue (but is sub-optimal for any other loss than the 0 − L loss, the inference error associated with the MMSE estimator is, naturally,the miminim mean-square error :MMSE p ∶= p ∥ E [ x ∣ y ] − x ∥ , MMSE ∶= lim p E x , y MMSE p = lim p E y Var ( x ∣ y ) . (3)By lim p we always mean the “thermodynamic limit” p → ∞ . We consider the average error, thesymbol E meaning the expectation with respect to the signal x and the data y (conditional on x ), seen as r.vs. drawn from their respective distributions (prior and likelihood). This quantitysummarizes in a single number all the complexity of the high-d problem, by providing the optimalerror one can aim for any algorithm, in average over all possible realisations of the problem.One may wonder whether the number MMSE is sufficient to describe the problem, as it mighta-priori be the case that the r.v. MMSE p (random through ( x , y ) ) fluctuates a lot (i.e., has O ( ) variance). But in well-defined high-d inference problems in the Bayesian optimal setting it concentrates ; it is said to be self-averaging in physics terminology. This means that it actuallydoes not fluctuate much for large systems, and becomes deterministic in the limit p, n → ∞ :MMSE p = MMSE + o p ( ) where o p ( ) is by definition a quantity such that lim p o p ( ) =
0. Therefore it is sufficient to focuson the asymptotic averaged MMSE in order to capture/predict the behavior of fixed large (typical)instances of the problem. The self-averaging of error metrics in high-d Bayes-optimal inferenceis very generic [4, 5] (in mismatched settings this is not necessary the case). It is a non-trivialmanifestation of the phenomenon of concentration of measure in high-d models, which is atthe core of the determinism/predictability of these complex random systems.3.
High-dimensional inference as statistical mechanics
Let us establish now clear connections between what we discussed until now on high-d inferenceand statistical mechanics.
IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 7
The posterior as a Gibbs-Boltzmann distribution.
The problem that we already men-tionned of normalizing a high-d probability distribution like the posterior (i.e., computing theevidence) should ring a bell for physicists: the very same thing happens in statistical mechanics,where one of the main task is to compute the partition function Z( J ) ∶= ∑ σ exp {− β H( σ ; J )} (if σ i is binary the sum is over σ ∈ {− , } p ) which normalizes the Gibbs-Boltzmann distribution : P GB ( σ ; J ) = Z( J ) exp {− β H( σ ; J )} . (4)Here H( σ ; J ) is the Hamiltonian/energy function defining the model, and σ (often binary)are the spins . J is the quenched randomness , namely, a set of fixed variables that parametrisethe Hamiltonian. β is the inverse temperature .We can push the analogy further: the posterior distribution given by the Bayes formula (2) canbe naturally thought as a Gibbs-Boltzmann distribution in the context of high-d inference. Simplyre-write it in exponential form and identify the (possibly real-valued) variables x representing theunknown signal with the spins σ , and the data y with the quenched randomness J : P ( x ∣ y ) = Z( y ) exp { ln P ( x ) + ln P ( y ∣ x )} . So the partition function is the evidence, and the Hamiltonian is (minus) the log-prior pluslog-likelihood, while β =
1. In many interesting models the prior factorises over the signal entries P ( x ) = p ∏ i = P ( x i ) (i.e., they are i.i.d.), and the likelihood factorises too over the data points P ( y ∣ x ) = n ∏ j = Q ( y j ∣ x ) (i.e., the data points are conditionally i.i.d.). In this case we recover a familiar form for the posterior: P ( x ∣ y ) = Z( y ) exp { p ∑ i = ln P ( x i ) + n ∑ j = ln Q ( y j ∣ x )} . So the local terms ( ln P ( x i )) pi = act as external magnetic fields, while the log-likelihood terms ( ln Q ( y j ∣ x )) nj = are interactions between spins that correlate them in a highly non-trivial way. Ifthese were pairwise, we would recover an instance of the Ising model , see below.Coming back to the notion of optimal estimators: with our statistical mechanics interpretationin mind, we now understand that MAP estimation is equivalent to finding the ground state ,namely the spin configuration that minimizes the energy. Instead computing the MMSE estimatorrelies on sampling the posterior/Gibbs-Boltzmann distribution ; these are among the mainalgorithmic tasks in statistical mechanics, and correspond exactly to the inference task.Suddenly it becomes almost obvious that high-d inference and statistical mechanics are very closecousins. We will discuss concrete examples. Yet this re-interpretation is not simply an observation:it allows to import the massive amount of analysis techniques and concepts developed for morethan a century in statistical mechanics into the world of inference and data-sciences.
J. BARBIER
Paradigmatic models of statistical mechanics, and order parameters.
One physicistreflex that spreaded in virtually all scientific areas is to consider a toy model containing all therelevant features of more complex/realistic models, but yet being simple enough to be tackledanalytically in order to understand fundamental phenomena that should apply more broadly. Let usstart by introducing two such models in statistical physics. We will later connect them to inference.By far the better understood and most paradigmatic model in statistical mechanics is the fully-connected Ising model , also called
Curie-Weiss model (CW) , defined by the Hamiltonian( J > h ∈ R and σ ∈ {− , } p ) H CW ( σ ; J, h ) = − Jp p ∑ i < j σ i σ j − h p ∑ i σ i . Statistical mechanics uses macroscopic quantities called order parameters for describing a complexsystem. For the CW model it is simply the magnetisation m p ( σ ) ∶= p p ∑ i = σ i . Then m ∶= lim p ⟨ m p ( σ )⟩ describes whether the system is in an ordered ferromagnetic phase (if m ≠
0) or a disordered antiferromagnetic (or ergodic) phase (if m = ; here we introducedthe standard physics notation ⟨ ⋅ ⟩ to denote an average with respect to the Gibbs-Boltzmanndistribution (4). In this simple model the concentration of measure implies m p ( σ ) = m + o p ( ) .Another important model of spin system is the disordered version of the CW model. The Sherrington-Kirkpatrick (SK) model , or mean-field spin glass , is defined by the Hamilton-ian H SK ( σ ; J , h ) = − p ∑ i < j J ij √ p σ i σ j − h p ∑ i σ i . (5)The quenched interactions J ij ∼ N ( , ) are i.i.d. realisations of a normal random variable.Let us open a technical parenthesis that is not crucial for the remaining of the discussion. A singlescalar order parameter like the magnetisation is not enough anymore to describe the phenomenologyof the SK model. Instead one needs to consider a richer distributional order parameter P ( q ) ∶= lim p E J P ( q p ∣ J ) , which is the asymptotic probability distribution of the overlap q p ∶= p p ∑ i = σ ( ) i σ ( ) i . Here σ ( ) and σ ( ) are (conditionally on J ) i.i.d. random vectors drawn from the same Gibbs-Boltzmann distribution; there are often called replicas. In the case of the CW model the asymptoticdistribution of the magnetisation was simply a dirac mass δ m , so m fully described the system. Buthere the concentration of measure is much more subtle: the overlap does not concentrate at lowtemperature ( q p ≠ lim p E J ⟨ q p ⟩ + o p ( ) ), but its distribution does: P ( q p ∣ J ) converges in distributionto P ( q ) as p → ∞ . The shape of this distribution then allows to describe the various phases of This is true whenever h ≠
0. In a system with a global sign-flip symmetry like here when the external fieldis null H CW ( σ ; J, h = ) = H CW (− σ ; J, h = ) , one needs to break this symmetry by introducing a small externalfield, and then taking the limit of this field to 0 after the thermodynamic limit. The resulting magnetisation value m ± ∶= lim h → ± lim p p ∑ pi = ⟨ σ i ⟩ h will depend, below its critical temperature 1 / β c , on whether the limit h → IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 9 the model (ferromagnetic, antiferromagnetic, spin glass, etc). Therefore the precense of disorder J changes drastically the model and its phenomenology, and describing it goes beyond the scope ofthis article, see [6, 7] for more details . The SK model has generated a whole field of research atthe crossroad of physics, information theory, computer science and mathematics. And as we willrealize soon, this model and its non-disordered cousin the CW model are both deeply connected tostatistical inference too.What is the order parameter in high-d inference? A natural candidate is an error metric, andwell will focus on the MMSE (3). The MMSE characterizes the information-theoretic phases :information-theoretically possible inference phase where the MMSE is relatively small, and theimpossible inference regime where it is comparatively high. Note that in general the location ofthe information-theoretic phase transition separating these phases does not depend on whicherror metric is used to probe the phase diagram ; we will come back to these notions.3.3. Thermodynamics: free entropy and phase transitions.
A key quantity in statisticalmechanics is the free entropy (or minus the free energy ): f p = βp ln Z . It contains all thermodynamic information about the model: the non-analyticity points of itsthermodynamic limit f ∶= lim p f p correspond to the location of phase transitions . A phasetransition is when a complex system experiences a change in the behavior of certain order parameterswhen external control parameters are varied. The canonical example is water, whose phase mayby characterized by a local density of molecules or an average correlation lenght (two orderparameters) while the temperature and/or the pressure evolve (two control parameters).One of the main use of the free entropy is that it allows to access (some) order parameters andtheir fluctuations; it is the moment generating function, the order parameter(s) being the moments.E.g., in the CW model the magnetisation and its flucutuations are obtained by taking derivativesw.r.t. the field h : f ′ p = ⟨ m p ⟩ , f ′′ p = p ⟨( m p − ⟨ m p ⟩) ⟩ (6)where the symbol ′ means a h -derivative. For disordered systems like the SK model, we generallyconsider the expected free entropy E f p = βp E J ln Z( J ) . It is equivalent to the non-averaged free entropy, but more practical to compute as independentof a particular realisation of the interactions J . The equivalence is again a consequence of theconcentration of measure, that implies f p ( J ) = lim p E f p + o p ( ) . Note that even if the overlap is not self-averaging, like in the SK model and other spin glasses at lowtemperature (or combinatorial optimisation problems [6]), the free energy is always self-averaging(for well defined models).Let us discuss further the notion of phase transition. There exist many types of phase transition;sometimes they are quite smooth (these are of the second order type because they correspond to The solution of the SK model has been found by G. Parisi [10, 12] and the rigorous proof of the Parisi solutionobtained by F. Guerra [9] and M. Talagrand [8] (and later re-proved by D. Panchenko [7]). a discontinuity of a second-order derivative of the asymptotic free entropy), and sometimes verysharp and discontinuous (of the first order type, namely, a discontinuity of a first order derivativeof f ). Examples of phase transitions are: the recovery of a souvenir by the brain once enough stimuliin the direction of the memorized pattern are provided (a simple model of associative memory is the Hopfield model ). Here the order parameter is the overlap with the memorized pattern and thecontrol parameter is the amount of stimuli. A crack in the financial market, where suddenly all pricesdrop all together. The sudden rigidity transition that happens when you randomly pack enoughballs in a box (this is called the jamming transition , and this is related to computer memoryoptimization or error correcting codes in communication). When communicating bits through agiven noisy channel, there exists a maximum rate of information transmission; communicationabove this sharp threshold is impossible as information gets lost due to noise. This limit is calledthe
Shannon capacity [13, 14] and really is nothing but a phase transition. The order parameteris the quality of recovery of the information bits, the control parameter is the communication rate.A final one. Say you want to train a classification algorithm that, when given a large data-base oflabeled training examples, is able to distinguish pictures of dogs and cats. There exists a minimumnumber of training examples below which, no matter the power of the computer, the algorithm willnever be able to properly classify the images; this is an information-theoretic transition. The orderparameter is the classification performance of the algorithm, the control parameter is the size ofthe training set, see, e.g., [24, 25]3.4.
Information theory: Shannon entropy and mutual information.
Information theoryin the context of inference is mainly concerned with the following question: when does data containsenough information so that it can be used to infer something about the process that generated it?To adress this question and, connected to that, understand what is the cousin notion of free entropyin high-d inference, we first need to recall what is the
Shannon entropy . The understanding ofthis object is fundamental and gave rise to the birth of information theory [13], so we will spendsome time to discuss it in details. As understood by Shannon, it is related to the older notion ofentropy in thermodynamics and statistical mechanics as we will see, thus the name. Its definitionfor a discrete r.v is H ( x ) ∶= ∑ x P ( x ) ln 1 P ( x ) = E x ln 1 P ( x ) . We slightly abuse notation and use the same symbol x for a r.v. and its outcome (while informationtheory usually denotes the r.v. in capital letter X and an outcome x ). We will restrict our discussionto discrete r.vs. as the interpretation is a bit more subtle in the continuous case, but a lot of theintuition generalizes.The Shannon entropy H ( x ) = E x h ( x ) is the expectation of the information content h ( x ) ∶= − ln P ( x ) , or surprise , of the outcome x . If this outcome has low probability then observing it is quitesurprising, and it brings a lot of information as it was not expected: h ( x ) is high. If instead P ( x ) isclose to 1 it is not surprising to observe x , so this outcome brings low information: h ( x ) is low. Saiddifferently: if the outcome of a r.v. is very probable, it is no surprise (and generally uninteresting)when it happens, because it was expected. However, if an outcome is unlikely to occur, it is muchmore informative if it happens to be observed. The term information content must be understoodas a potential information gain if x is observed. When using the log the information content andentropy are expressed in “bits”. IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 11
Imagine you are in the desert and suddenly it rains like hell. Worst, it rains cows that play piano!What?! It is amazingly surprising no? The probability of this event is actually so low that it bringsan enormous amount of information; in this case it should lead you to the conclusion that you aredreaming. If instead your are in the desert and its super sunny and hot, it is not surprising at all;this does not bring more information than what you already know, and if you are dreaming, it isunlikely that this observation will help you realize it. Another example: the knowledge that someparticular number will not be the winning one of a lottery provides very little information, becauseany particular chosen number will almost certainly not win. However, knowledge that a particularnumber will win a lottery has high informational value because it communicates the outcome of avery low probability event.The entropy can also be interpreted as a measure of unpredictability of the r.v. x , or of uninfor-mation/lack of knowledge about what x ’s outcome will be: the more surprising are the outcomesin expectation, the more unpredictable is the actual outcome, which also mean the less we knowabout x before observing it. H ( x ) quantifies the expected amount of missing information necessaryto determine the outcome of x before observing it . This can be confusing because previously wesaid that H ( x ) is an expected information content, while now we speak about a measure ofuninformation. There is no paradox: an information H ( x ) is gained in average when x ’s outcome isactually observed . But prior to observing the outcome, H ( x ) is a measure of uninformation aboutit. Put differently: observing the outcome x converts in average H ( x ) units of uninformation intoinformation. So it just a matter of conceptually placing ourselves before x is observed –in whichcase the interpretation as a measure of uninformation may be more natural–, or after x is observed–where the interpretation as an expected information content seems to fit better. But at the endthis is the same thing.An example might help: the outcome of a toss of a fair coin x fair ∼ Ber ( / ) is much moreunpredictable than the outcome of a strongly biased coin x bias ∼ Ber ( / ) , or equivalently ourlack of knowledge about what will be x fair is higher: we are more uninformed. But when observing the outcome of the fair coin, we then gain more information than with the unfair one, becauseit is in average more surprising. In the first case, which has entropy H ( x fair ) of one bit, bettingon one side or the other is the same statistically. While in the second case, where H ( x bias ) = log + log ≈ .
47, the outcome is much more predictable, we are less uninformed (= moreinformed); it would be an error not to bet on the outcome x bias = H ( x ) of the r.v. x quantifies: i ) its average informationcontent, i.e., the expected information gain when observing outcome x ; ii ) the average uninfor-mation/lack of knowledge about the outcome x prior to observe it; iii ) its unpredictability. Thehigher the entropy of x , the less “structured” its distribution is; v ) when expressed in bits H ( x ) is the expected number of binary “yes/no” questions required to determine the outcome before it is observed, or equivalently, the expected number of binary questions that the oucome x hasanswered after being observed.Similarly the conditional entropy is: H ( x ∣ y ) ∶= ∑ x , y P ( y ) P ( x ∣ y ) ln 1 P ( x ∣ y ) . It is the expected information revealed by evaluating the outcome of x given that you know alreadythe outcome of y . Or equivalently, it is the expected remaining amount of unpredictability of x given that y has already been observed. The entropy has many important properties that make it a “good” definition of informationcontent, one of the main being that it is additive for independent r.vs.: H ( x , y ) = H ( y ) + H ( x ) if P ( x , y ) = P ( x ) P ( y ) , and many other ones such as its non-negativity (for discrete r.vs.) and the chain rule H ( x , y ) = H ( x ∣ y ) + H ( y ) = H ( y ∣ x ) + H ( x ) . (It is easy to prove these facts from the definition of the entropy). But all these justifications arenot enough to prove that it is indeed the correct definition. Maybe other functions verify all theseproperties and have a similar interpretation. The mathematical proof that the entropy indeedis the correct definition comes from the source coding theorem of C. Shannon, the father ofinformation theory, see [2, 13, 14]. Let us describe it in words. We consider binary symbols but thefollowing reasonning applies to more generic discrete alphabets.Roughly, the source coding theorem says that if a source generates strings ( x , x , . . . , x n ) of n ≫ x , then there exists acompressed code C δ for this source of cardinal ∣C δ ∣ ≈ nH ( x ) ≤ n , and this independently of the risk0 < δ < n → ∞ ).Let us understand what does that mean, and why it implies that H ( x ) is the proper definitionof information content carried by the random variable x .a) First, why introducing a source of long strings? The information content of the source, whatever it means , must be n times the one of x alone because information must be additivefor independent variables ( x , x , . . . , x n ) . As a consequence the expected information contentper symbol of the source equals the one of x . So studying the source or x is the same froman information-theoretic point of view. But as n will get large, Shannon understood thatthe concentration of measure in the form of the law of large numbers will help a lot in theanalysis.b) What is a code? A code C δ is any alternative “maximally compressed” represention of theset of strings. Namely it is a set of smaller cardinal than 2 n , the number of possible strings,such that a random string from the source has an associated element in the code withprobability (over the strings) at least 1 − δ . And at the same time the code has smallestpossibe cardinal. A code C δ therefore “encodes” part of the information (as tuned by δ )about the source through a bijective mapping between C δ and a subset of the 2 n possiblestrings. The risk we take is in the sense that with probability < δ a string generated by thesource will not have an associated element in C δ so its information is lost when coding. Aconstructive way of defining C δ is to rank all possible strings according to their probability.Then add a first element C in C δ , associated to the most probable string. Then add asecond element C in C δ mapped to the second most probable string, and so on, until thesum of probabilities of the strings mapped to the code elements exceeds 1 − δ .c) The information content of this code expressed in bits is naturally defined as the number ofbinary symbols necessary to represent any element of the code: log ∣C δ ∣ . Moreover Shannonshowed that log ∣C δ ∣ → nH ( x ) as n → ∞ . A-priori this information content is less than theone of the original source as some risk δ has been taken when compressing/mapping thesource to C δ ; we talk about lossy compression .d) But the crucial observation is that the quantity H ( x ) defining the cardinal of the codenecessary to compress the source up to risk 0 < δ < independent of the risk inthe limit n ≫
1. This means that as long as we allow ourselves a tiny probability of error δ IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 13 (independent of n ), compression down to nH ( x ) bits is possible. But even if we are alloweda large probability of error, we still can compress the source only down to nH ( x ) bits. Thisstrongly suggests that nH ( x ) is the fundamental information content of the source. As aconsequence of this and point a ) the information content of x is also n log ∣C δ ∣ → H ( x ) .This ends the reasonning.Let us give a high-level proof of the source coding theorem. The point is that, as n gets larger,by the law of large numbers almost all strings actually generated by the random source are typical ,so that only the typical sequences need to be encoded during the compression (the others aretoo unprobable and therefore are not coded). Let us consider for simplicity Bernoulli variables x ∼ Ber ( ρ ) . All typical sequences have approximately the same number nρ of ones and n ( − ρ ) of zeros. Indeed, the probability that the string has exactly R ones is a binomial distribution R ∼ Bin ( n, ρ ) . The relative fluctuation of R is O ( /√ n ) so R concentrates onto its mean when n gets large . This implies that with high probability the only possibly observed strings are thosewith R values very close to nρ : this informally defines the typical set . So the probability of a typicalsequence x typ = ( x , . . . , x n ) is P ( x typ ) = n ∏ i P ( x typ ,i ) ≈ ρ nρ ( − ρ ) n ( − ρ ) . Denote this probability of a typical string P typ ∶= ρ nρ ( − ρ ) n ( − ρ ) . What is the information con-tent/surprise in bits of a typical outcome?log P typ = − n ( ρ log ρ + ( − ρ ) log ( − ρ )) = nH ( x ) . (7)So the proof strategy is: i ) as n gets large only typical sequences/outcomes are observed; they carryalmost all the probability mass. So when defining the code C δ we need only to code these typicaloutcomes; doing so it is maximally compressed. The number of typical strings is exponentiallylarge in n (this follows from the asymptotic equipartition principle ), so even if we allow a risk δ very close to 1 (but independent of n ) and therefore only code a small fraction of the typicalsequences, there are still approximately as many at leading (exponential) order as n gets large.E.g., if there are exp ( an ) typical sequences and we only code ( − δ ) exp ( an ) = exp ( an + ln ( − δ )) of them, there are the same number at leading order for any fixed a > > δ > n ≫
1. Soindependently of δ the number ∣C δ ∣ of typical sequences necessary to code is the same at leadingexponential order. ii ) The question then becomes: can we count them, i.e., evaluate ∣C δ ∣ at leadingorder? By definition all typical sequences have approximately the same probability P typ , and theycarry almost all the mass. Therefore ∑ { x typical } P ( x ) ≈ typ P typ ≈ , where typ is the number of typical sequences. With what we said previously typ equals ∣C δ ∣ atleading order. This implies that there are approximately typ ≈ / P typ = nH ( X ) typical sequences(from (7)). We can thus count them at leading order. This allows to estimate the expectedinformation content per bit as n log ∣C δ ∣ ≈ H ( x ) , which is the same as the expected information Relative fluctuations of the order O ( /√ n ) of macroscopic quantities like R are typical of complex systemstreated in statistical mechanics. That the relative fluctuations vanish is the reason why such random systems canbe analyzed and described by asymptotically (as n → +∞ ) deterministic observables, converging on their ensemblemean. content of x by definition of the source. The same argument extends to more general (non binary)alphabet.A connected information-theoretic quantity is the mutual information : I ( x ; y ) ∶= H ( x ) − H ( x ∣ y ) = H ( y ) − H ( y ∣ x ) . It is interpreted as a measure of the mutual dependence of x and y . It quantifies the “amount ofinformation” obtained about one r.v. through observing the other one. And indeed it cancels if andonly if the r.vs. are independent: I ( x ; y ) ≥ P ( x , y ) = P ( x ) P ( y ) . In an inference problem where we want to recover the parameters x from the data y ( x ) the lastform has a particularly nice interpretation: H ( y ) − H ( y ∣ x ) is the total information carried by thedata minus the remaining unpredictability/uninformation about the data when the signal is known,which is therefore the noise contribution. E.g., in a Gaussian denoising model y = √ λ x + z with z ∼ N ( , ) we have H ( y ∣ x ) = H ( z ) = ln ( πe )/
2. The mutual information is thus theinformation carried by the data purely about the signal. As such it quantifies the information-theoretic limits of inference, and computing it in high-d settings is a key goal of information theory.In the Gaussian denoising model it is an exercise to show that its explicit expression reads (here x ∗ , x are i.i.d. from P ( x ) ) I ( x ; y ) = λ E [ x ]− E x ∗ ln E x exp { λx ∗ x + √ λzx − λ x } . (8)3.5. Denoising, the I-MMSE formula and the information-theoretic interpretation ofthe free entropy.
Consider the general denoising model, where x can be a vector, matrix, etc: y = √ λ x + z . The r.v. z has same dimension as the signal and has i.i.d. standard normal entries. The SNR λ > x from y .There exists a general identity called I-MMSE formula [15] relating the mutual informationand the MMSE for the denoising model: ddλ p I ( x ; y ) = E x , y MMSE p = E ∥ x − E [ x ∣ y ]∥ . This relation is the equivalent of the thermodynamic identity f ′ p = ⟨ m p ⟩ in (6), but for high-dinference (under Gaussian noise).Let us clarify the connection between free entropy and mutual information. The expectedlog-partition function in the Bayes formula reads E y ln Z( y ) = ∫ d y P ( y ) ln P ( y ) = − H ( y ) . Therefore the expected free entropy is linked to the Shannon entropy of the data: − E y f p ( y ) = p H ( y ) . IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 15
So the mutual information verifies1 p I ( x ; y ) = − E f p ( y ) − p H ( y ∣ x ) . (9)The term p H ( y ∣ x ) = p H ( z ) = ln ( πe ) is trivial (because the noise has i.i.d. components).Another way to see the connection is by starting from the thermodynamic definition of freeentropy: the Shannon entropy of the Gibbs-Boltzmann distribution (the posterior) minus theinternal energy (recall β = p E f p ( y ) = H ( x ∣ y ) − E ⟨H( x ; y )⟩ , (10)where H( x ; y ) = − ln P ( y ∣ x ) − ln P ( x ) is the Hamiltonian defining the posterior. We focus on theGaussian denoising model. Using the Bayes formula (2) the internal energy verifies ∫ d x d y P ( y ) P ( x ∣ y )H( x ; y ) = ∫ d x d y P ( x ) P ( y ∣ x )H( x ; y ) = ∫ d x d z P ( x ) P ( z )H( x ; √ λ x + z ) using the change of variable y = √ λ x + z . As the noise is i.i.d. Gaussian the likelihood P ( y ∣ x ) is a multivariate Gaussian measure with mean √ λ x and identity covariance, and thus P ( z ) is astandard multivariate Gaussian after the change of variable. Therefore H( x ; y ) = ∥√ λ x − y ∥ + p ( π ) − ln P ( x ) . We finally reach that the internal energy E ⟨H( x ; y )⟩ = E ∥ z ∥ + p ( π ) − E ln P ( x ) = p ( πe ) + H ( x ) . Using this as well as (10) in p I ( x ; y ) = p H ( x ) − p H ( x ∣ y ) we recover (9) and − E f p = p H ( y ) .Therefore a physicist trying to compute the free entropy and an information theorist the mutualinformation are actually aiming for the very same goal.Thanks to the I-MMSE relation the MMSE order parameter can be derived from I ( x ; y ) , or themagnetisation from the free entropy in statistical mechanics models, at least “in theory”. Indeed,computing the p -dimensional integrals necessary to obtain the thermodynamic potentials (mutualinformation, free entropy) or the order parameters “directly” is generally a daunting task. But aswe will discuss towards the end, in some high-d problems, this can be reduced to a (much) simplerscalar optimisation problem thanks to the concentration of measure phenomenon.Consider a factorized prior P ( x ) = ∏ i P ( x i ) . In this setting, is the denoising model a “good”example of high-d inference problem, with phase transitions and a rich phase diagram? No. Indeed inmodel y = √ λ x + z each data point y i ( x i , z i ) is only function of a single signal and noise components.The r.vs. ( x i , z i ) i ≤ p are i.i.d. by the factorization assumption. As a consequence the MMSE ofthe whole signal x equals the MMSE of a single entry as they are all statistically equivalent: E MMSE p = E [( E [ x ∣ y ] − x ) ] . This quantity is easily shown to be E [( E [ x ∣ y ] − x ) ] = E [ x ] − E z,x ∗ [ x ∗ E x x e − (√ λ ( x ∗ − x )+ z ) E x e − (√ λ ( x ∗ − x )+ z ) ] (11)where x, x ∗ are i.i.d. from P and z is a standard Gaussian r.v. Plotting this MMSE order parameteras a function of the λ control parameter, we get a smooth continuous non-increasing curve, thatvanishes as λ → ∞ . Not so exciting. This is because the variables ( x i ) are in fact decoupled and the problem collapses onto p parallel equivalent low-dimensional/scalar inference problems y i = √ λ x i + z i . And all are statistically equivalent so studying one is enough. Something is missing in the model in order to turn the picture into something richer. The denoising model lacks akey ingredient of complex systems: correlations among the signal entries induced by non-trivialinteractions between the ( x i ) in the Hamiltonian.4. A paradigm of high-d inference: the spike Wigner model
In high-d inference an important model is the spike Wigner (SW) model , also called low-rank matrix factorisation . As we will discuss it is a close cousin of the Ising and SK models instatistical mechanics. It was introduced in random matrix theory as a simple model of principalcomponents analaysis [16], which is the most widely used dimensionality reduction technique.Let z = ( z ij ) ni,j = be a noise matrix with independent i.i.d. standard normal entries z ij ∼ N ( , ) ;this is called a Wigner matrix. In the SW model the data is (the upper triangular part of) R p × p ∋ y = √ λ / p xx ⊺ + z , or componentwise, y ij = √ λp x i x j + z ij for 1 ≤ i < j ≤ p. (12)The signal x is a realisation of the prior P ( x ) = ∏ pi = P ( x i ) . Using that the likelihood is thestandard multivariate Gaussian measure the posterior reads (constant terms are simplified with thenormalization) P ( x ∣ y ) = Z( y ) exp { p ∑ i = ln P ( x i ) − p ∑ i < j ( y ij − √ λp x i x j ) } . (13)Now we see pairwise interactions in the Hamiltonian, so the ( x i ) are not anymore decoupled andthe model cannot be reduced to independent scalar inference problems: this system is complex.Note that the information about the sign of x is lost by ± x symmetry in this measure whenever P ( x i ) is even, e.g., when considering a signal with ± P ( x ∣ y ) = P (− x ∣ y ) so that E [ x ∣ y ] = ( ) . Therefore it makes more sense in general to consider the rank-onematrix xx ⊺ = ( x i x j ) pi,j = as hidden signal (called “spike”). Anyway if the statistician can recoverthe spike, it may access ∣ x ∣ by finding its eigenvector. The noise z represents a uncontrolled sourceof randomness that corrupts the spike. The statistician task is then to infer xx ⊺ as accurately aspossible given y and the knowledge of the data-generating process (namely the model (12), but notthe specific realization of x nor z ). We could generalise to other type of noise (not only Gaussiannor additive), but the qualitative picture would not change much.The scaling 1 /√ p of the SNR in (12) is there to make the inference task nor impossible nortrivial. Any other scaling would turn the problem, in the large-system limit p → ∞ , into a modelwith not much interest. By “uninteresting” we mean that the (asymptotic average) spike-MMSEMMSE ∶= lim p p E ∥ E [ xx ⊺ ∣ y ] − xx ⊺ ∥ would be essentially equal to 0 for a scaling p − γ ≫ /√ p , or to its maximum value for a scaling p − γ ≪ /√ p , and this independently of λ = O ( ) . Here ∥ ⋅ ∥ is the Frobenius norm and E [ xx ⊺ ∣ y ] ∶= ∫ d x xx ⊺ P ( x ∣ y ) is the MMSE estimator of the spike. But precisely for the scaling γ = / phase diagram emerges with information-theoretic phase transitions . IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 17
Let us understand more precisely why this is the proper SNR scaling, and connected to that,that we are indeed in the high-d regime. For the inference task not to be trivial we need placeourselves in the high-d regime. As we explained already this means that the total SNR perparameters, i.e., × SNR d ÷ p ( p − )/ ( y ij ) i < j and SNR d = E [(√ λ / p x i x j ) ] = ( E [ x ]) λ / p . We verify that ( E [ x ]) ( p ( p − ) × λp ) p = ( E [ x ]) λ + O ( p ) is indeed O ( ) as we assume ( E [ x ]) = O ( ) . This explains the scaling 1 /√ p in the observationmodel (12): we are in the high-d regime.Examples of applications of this model are (we consider in all cases that the prior factorizes as P ( x ) = ∏ i ≤ p P ( x i ) ): ● Sparse principal components analysis:
In the simplest case the prior P = Ber ( ρ ) isBernoulli. The task is to estimate the hidden sparse low-rank representation xx ⊺ of y . ● Submatrix localization:
Again P = Ber ( ρ ) . One has then to extract a submatrix of y ofsize ρp × ρp with larger mean than the background noise matrix; this is an important modelof hidden structure in computer science. ● Community detection in the stochastic block model (SBM):
The (assortative)SBM is a network model where edges between nodes belonging to the same community aremore probably observed. Given these observed edges, the task is to infer the community towhich belong each nodes. For example, assume you know the network of friendships in somesocial network. Under the hypothesis that people voting for the same political party (amongtwo) are connected in this network with higher probability than when they vote oppositeparties, is it possible to guess the two communities of voters (up to a global permutation)?Recovering two communities of size ρp and ( − ρ ) p in a SBM of p vertices is information-theoretically “equivalent” to the SW model with prior (see [17] for the precise meaning ofequivalence) P = ρδ √( − ρ )/ ρ + ( − ρ ) δ −√ ρ /( − ρ ) . (14) ● Z / -synchronization: The prior is Rademacher P = δ − + δ . The task is to infer thenodes states x ∈ {− , } p (up to a global sign) from noisy pairwise products y .A possible interpretation: imagine that you can ask to pairs ( i, j ) of individuals whetherthey agree (+1) or not (-1) on some binary “yes/no” question, but you cannot ask toany individual i alone what is her/his opinion on the question, and you have no a-prioriidea about it. Moreover the answers ( y ij ) you collect are transmitted through a very noisy(Gaussian) communication channel. Naively, you would naturally guess that the pair ofindividuals ( i, j ) have the same opinion whenever y ij (equal to √ λ / p x i x j + z ij , where x i isthe opinion of individual i ) is positive because z ij is centered. For the pairs such that y kl < x k x l = − ( y ij ) for many (all) pairs. Can you optimally infer the opinion of eachindividuals (up to global flip), i.e., who are the “synchronized individuals”? The naiveapproach is sub-optimal. What one needs to do is to use the posterior (13) and computethe MMSE estimator (in case the goal is to minimize the MSE) or the MAP estimator (ifinstead one wants to maximize the probability of finding x ). Link to the Curie-Weiss and Sherrington-Kirkpatrick models.
As promised we nowestablish a clear connection between the CW and SK models from statistical mechanics and theSW model from high-d inference.We consider the binary case x ∈ {− , } p with Rademacher prior. This corresponds to the Z / x ∗ where the ∗ emphasizes that it isthe true one, that is fixed when performing inference, while x are the variables/spins that fluctuateaccording to the posterior. In the Rademacher case the prior gives a constant contribution thatsimplifies with the partition function and can therefore be dropped in the posterior. Then, as seenfrom (13), the Hamiltonian of the SW model reads, when expressing the data as a function ofthe signal and noise using y ij = √ λ / p x ∗ i x ∗ j + z ij and simplifying all x -independent terms with thenormalization, H SW ( x ; y ) = − p ∑ i < j ( λp x ∗ i x ∗ j + √ λp z ij ) x i x j with x ∈ {− , } p . This is exactly the SK Hamiltonian (5) when only the noise term z ij is presentand λ is set to one. The additional signal-related term − ∑ i < j λp x ∗ i x ∗ j x i x j is called planted term ,and inference models are planted statistical mechanics models . The planted term plays therole of external magnetic field that tends to align the spins in the signal direction; it carries theinformation. In contrast the noise term, that competes with the planted one, tends to align thespins in a random direction that is uncorrelated with the signal. Depending on the value of theSNR λ that plays a similar role as the inverse temperature β , one term wins against the other:for high enough λ > λ c the planted term wins and the spins “magnetise/polarise” in the signaldirection. Here λ c is the so-called information-theoretic threshold (see next section for moredetails). This polarisation is quantified by the overlap between a sample x from the posterior andthe signal x ∗ m ∗ p = p p ∑ i = x i x ∗ i . (15)Let m ∗ ∶= lim p E ⟨ m ∗ p ⟩ . We write the posterior mean E [ ⋅ ∣ y ] using the bracket notation ⟨ ⋅ ⟩ fromstatistical mechanics to emphasize that P ( x ∣ y ) is a Gibbs-Boltzmann distribution; E is the averageover all quenched variables ( x ∗ , y ) (or equivalently ( x ∗ , z ) ). After some manipulations one candemonstrate that the expected spike-MMSE relates to this overlap order parameter as E MMSE p = p E ∥ x ∗ ( x ∗ ) ⊺ − ⟨ xx ⊺ ⟩∥ = ( E [( x ∗ ) ]) − E ⟨( m ∗ p ) ⟩ . As in the CW model the concentration of measure implies (in the Bayesian optimal setting): m ∗ p = m ∗ + o p ( ) , and therefore concentration of the expected MMSE (and actually of the non-averaged one as well) towards the asymptotic average MMSE as p → ∞ :MMSE p → ( E [( x ∗ ) ]) − ( m ∗ ) =∶ MMSE . (16)Despite the Hamiltonian H SW ressembles a lot the one of the SK as there is disorder, thephenomenology of the SW model is closer to the one of the CW model due to the planted term.The order parameter m ∗ p is the counterpart in planted problems of the magnetisation m p in theCW model, and it concentrates, while the overlap does not in the SK model; that makes a huge IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 19
Figure 1.
From [20]. Plot of the spike-MMSE, the MSE of the AMP algorithm and ofnaive PCA for the SW model with prior (14) with ρ = .
05. Information-theoretic andalgorithmic first order phase transitions are observed. A computational-to-statisticalgap (hard phase) is present between the information-theoretic and algorithmicthresholds.difference. This complicates drastically the analysis of the SK model, see [7], and other modelswith replica symmetry breaking [6, 12]. This is the statistical mechanics terminology for “lackof self-averaging” of the order paramaters. In contrast the CW models and high-d inference modelsin the Bayesian optimal setting are replica symmetric , i.e., the order parameters do concentratetowards their mean as p → ∞ [4, 5].5. Information-theoretic and algorithmic phase transitions
Until now our discussion was mostly conceptual. But can we practically compute the mainhigh-d quantities we introduced (mutual inference, free entropy, MMSE) in order to understandand predict the behavior of algorithms for high-d inference problems? We continue to focus on theSW model as a representative example, but the following discussion applies more generically.5.1. “Single-letter formulas” for mean-field models: the magic of the concentrationof measure.
Deriving single-letter formulas for high-d quantities is often possible for problemsbelonging to the class of mean-field models . Such formulas usually come in the form of anoptimization problem over a function of a scalar parameter. In mean-field models each spin/variableinteracts with extensively many other ones, i.e., with O ( p ) : we speak in this case about a dense model. Another class of mean-field models are sparse/dilute models , where the network ofinteractions between variables is such that in the limit p → ∞ , variables ( x i ) interact with a randomsubset of finitely many O ( ) other ones. The SW model is a dense mean-field model, as each variable x i interacts with all the other ones through the pairwise interactions ( ( y ij − √ λ / p x i x j ) ) j ≤ p . Forsuch models there exists an arsenal of powerful methods from statistical mechanics that are able toreduce the evaluation of high-d quantities to low-dimensional optimisation problems, in particularthe replica method developed in the context of spin glasses [6, 12]. Such high-d to low-d reductionis another beautiful manifestation of the concentration of measure.Assume again that the prior factorizes with i.i.d. x i ∼ P . The replica method (or its close cousinthe cavity method [6, 12]) predicts that the mutual information for the SW model verifies as p → ∞ (denote v ∶= E P [ x ] and x ∗ , x are i.i.d. from P , z ∼ N ( , ) is a standard normal r.v.)1 p I ( x ; y ) → min q ∈[ ,v ] { λ ( q − v ) + I ( x ; √ λq x + z )} . Here I ( x ; √ λq x + z ) is the mutual information of the Gaussian denoising model with SNR λq ,given by (8) changing λ to λq . Therefore we can get an actual formula for the mutual information.Equating to 0 the q -derivative of the function i ( RS ) ( q, λ ) ∶= λ ( q − v ) + I ( x ; √ λq x + z ) above –called replica-symmetric potential –, its minimizer q min verifies the fixed point equation λ ( q min − v ) + ddλ I ( x ; √ λq x + z )∣ q = q min = . By the I-MMSE formula it gives q min = v − mmse ( x ∣ √ λq min x + z ) (17)where mmse ( x ∣ √ λq min x + z ) is the MMSE for the scalar denoising model; it is given by (11) with λ replaced by λq min . Whenever unique, the minimizer of the replica symmetric potential can beshown to be equal to m ∗ ∶= lim p E ⟨ m ∗ p ⟩ (recall (15)). Therefore from (16) we also get a “single-letterformula” for the MMSE: MMSE = v − q . (18)It is absolutely amazing that such high-d objects, that depend on so many random variables, canbe reduced to such simple formulas! There is something very peculiar happening here: both at thelevel of the mutual information and of the MMSE the simple scalar denoising model appears. Theanalysis of the high-d SW model therefore collapses onto the analysis of an inference problem of asingle signal component corrupted by Gaussian noise, with a SNR λq min given by a non-trivial fixedpoint equation. This obervation is generic for dense mean-fields models. For sparse problems thingsare a bit more subtle but essentially the same type of reduction from a high-d to low-d problemshappens too.Note that we said at some point that the scalar denoising model was not so interesting in itselfas there was no phase transition in its MMSE. But here even if this simple model appears, thecomplexity of the SW is revealed in the fact that the solutions of the fixed point equation (17) maybe more than one. So from one SNR value λ to a close one λ + ε , the solution q min that minimizes thereplica symmetric potential (and then gives the MMSE through (18)) may change discontinuously:a phase transition then occurs.All these results can be even turned in mathematically rigorous statements. Complementary tothe replica method, there exist the so-called cavity and interpolation methods [7, 9, 26–28],applied to the SW model in [17]. Recently an evolution of the interpolation method for high-dinference, called adaptive interpolation method , had great success in proving such fomulas(including the ones given above for the SW model) [24, 29, 30] . For those interested in knowingmore about these proof techniques see [20, 22].5.2. Phase transitions and phase diagram.
Equipped with the explicit formula (18) for theMMSE we are ready to explore the phase diagram of the problem. In Figure 1 the MMSE as well asthe MSE reached by two algorithms for the SW model is plotted. These algorithms are principalcomponent analysis (PCA) and the approximate message-passing (AMP) algorithm. InPCA one computes the eigenvector of the data matrix y associated with the maximum eigenvalue; There exists also an “algorithmic approach” to proving high-d replica symmetric formulas [18, 31].
IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 21
Figure 2.
From [21]. Phase diagram of the SW model with Bernoulli parameters x i ∼ Ber ( ρ ) as a function of the sparsity ρ and inverse of the total SNR given bysnr ∶= λρ . There is no phase transition in the system if ρ > . algorithmic threshold this eigenvector estimatorstarts to align with the signal so that the MSE lowers down. We will not discuss the AMPalgorithm, but essentially what matters is that it is conjectured by many to be optimal among alllow-complexity/practical algorithms in a broad class of high-d inference problems. Here we indeedobserve that AMP requires a lower SNR than PCA to perform well (i.e., can perform better athigher noise levels). And when it works it matches the MMSE estimator performance. Moreover itsperformance in the limit p → ∞ can be rigorously predicted. This allows to get the curves presentedhere, see [11, 21, 23, 24] for details.What we observe is a generic scenario in high-d inference with two types of phase transitionsdelimiting three phases: i ) the impossible phase is the regime where even the optimal MMSE esti-mator performs poorly (not better than random guessing). Therefore it is information-theoreticallyimpossible to infer anything about the signal better than random guessing. It is not a computationalissue, there is simply not enough information. The information-theoretic threshold is denoted λ c .The SNR regime λ ∈ ( λ c , λ algo ) (where in this problem the algorithmic threshold λ algo = hard phase . Hard is in the algorithmic sense. It means thatwe do not know any computationally efficient algorithm able to match the performance of theoptimal MMSE estimator. Finally λ > λ algo corresponds to the easy phase : in this regime wedo know a computationally efficient algorithm (AMP) able to match the MMSE. In this modelwith this specific prior both the information-theoretic and algorithmic transition of AMP aresharp/discontinuous: they are of the first order type. Sometimes they are continuous like hre for thePCA estimate. The precense of an hard phase defines a so-called computational-to-statisticalgap (another name for the hard regime), and understanding whether such gap is fundamental ornot is one of the main open question in the field. By fundamental we mean whether there actually / c ( ) MMSE = 1 e -1 AMP = 1 e -1 MMSE = 1 e -2 AMP = 1 e -2 MMSE = 1 e -3 AMP = 1 e -3 MMSE = 1 e -4 AMP = 1 e -4 MMSE vs MSE of AMP
Figure 3.
From [19]. All-or-nothing information-theoretic and AMP algorithmic phase transitionsfor the SW model with Bernoulli parameters x i ∼ Ber ( ρ ) . As the sparsity ρ dereases both transitionsbecome sharper: an all-or-nothing transition appears in the limit ρ →
0. Horizontal axis is on a logscale, and is relative to the information-theoretic threshold λ c ( ρ ) , itself function of ρ (see [19] for itsexpression). The statistical-to-algorithmic gap diverges as ρ →
0: it becomes algorithmically harderto infer the signal. exists or not in this region a polynomial-time (in p ) algorithm able to beat AMP and match theMMSE.These three regimes are separated by phase transitions. Consider the SW model with Bernoulliprior of mean ρ . We show the phase transitions lines in the ( /( λρ ) , ρ ) plane (these are the controlparameters; snr = λρ is the natural SNR parameter) in Figure 2. Predicting the performance ofthe MMSE and AMP estimators at each point, it allows to draw the phase diagram of the problem.We observe large regions in green where AMP is optimal, and the hard phase in orange. This issimilar to the phase diagram of water in the (temperature, pressure) plane with the solid, liquidand gas phases. This kind of pictures allow to read fundamental and algorithmic limitations ofsignal reconstruction as control parameters are varied.Let us mention another interesting observation. It was proven recently in [19] (based on conjecturesin [21]) that all-or-nothing phase transitions happen in the regime of very high sparsity ρ → effective dimension of the signal is much smaller than its ambient dimension p , the signal canbe or perfectly infered, or not at all. There is no crossover between these two behaviors like inFigure 1 which is for a finite sparsity ρ . Here the effective dimension of the signal is ρp , i.e., theexpected number of non-zero components. It vanishes when compared to the ambient dimension p as ρ →
0. This phenomenology seems very generic and happens in a broad class of other high-dinference models [32]. The success of modern signal processing and machine learning in high-dregimes is believed to be partly due to the structure of the data itself and the fact that evenif high-dimensional, it has lower effective dimensionality, that is then exploited by algorithms.Therefore designing and analysing simple models that are tractable and serve as idealized paradigmsfor this setting is of fundamental interest.
IGH-DIMENSIONAL INFERENCE: A STATISTICAL MECHANICS PERSPECTIVE 23 Concluding remarks
We discussed the modern regime of high-d statistics. Focusing on the spike Wigner model asparadigm of high-d inference, we have shown that inference can be recast in the statistical mechanicslanguage. As in more physical models like spins systems (and virtually any sufficiently complexsystem) the SW model has phase transitions separating different algorithmic regimes of inference.For the sake of pedagogy we focused on the SW model. But a large part of the concepts weintroduced, the phenomenology we presented and the conlusions we drew are much more generaland apply to an extremely large class of inference and learning problems. In order to get a broaderview and know about many more examples of high-d inference models that can be treated usingthe statistical mechanics approach I recommend the excellent review [11]. See also the article [24].For mathematically oriented readers see [22] and [20]. Classical references are the books [33, 34].We end this article by emphasizing again that the selection of topics and provided references arehighly subjective. The field is huge and fastly expanding, and it is a doomed task to cover it all ina finite number of pages. Nevertheless I hope that this contribution may motivate the reader todive deeper in this fascinating field. A complementary set of references can be found here [36].
References [1] L. Wasserman,
All of statistics: a concise course in statistical inference , Springer Science & Business Media(2013)[2] D. MacKay,
Information theory, inference and learning algorithms , Cambridge University Press (2003) [3] E. J. Cand`es and M. B. Wakin,
An introduction to compressive sampling , IEEE Signal Processing Magazine(2008), https://authors.library.caltech.edu/10092/ [4] J. Barbier,
Overlap matrix concentration in optimal Bayesian inference , Information and Inference: a Journal ofthe IMA (2020)[5] J. Barbier and D. Panchenko,
Strong replica symmetry in high-dimensional optimal Bayesian inference , (2020), https://arxiv.org/abs/2005.03115 [6] M. M´ezard and A. Montanari,
Information, physics, and computation , Oxford University Press (2009) https://web.stanford.edu/~montanar/RESEARCH/book.html [7] D. Panchenko,
The Sherrington-Kirkpatrick model , Springer Science & Business Media (2013)[8] M. Talagrand,
The parisi formula , Annals of Mathematics (2006)[9] F. Guerra,
Broken replica symmetry bounds in the mean field spin glass model , Communications in MathematicalPhysics (2003)[10] G. Parisi,
A sequence of approximated solutions to the Sherrington-Kirkpatrick model for spin glasses , Journalof Physics A: Mathematical and General (1980)[11] L. Zdeborov´a and F. Krzakala,
Statistical physics of inference: thresholds and algorithms , Advances in Physics(2016)[12] M. M´ezard, G. Parisi and M .Virasoro,
Spin glass theory and beyond: an introduction to the replica methodand its applications , World Scientific Publishing Company (1987)[13] C. E. Shannon,
A mathematical theory of communication , The Bell System Technical Journal (1948)[14] T. M. Cover and J. M. Thomas,
Elements of information theory , John Wiley & Sons (1999)[15] D. Guo, S. Shamai and S. Verd´u,
Mutual information and minimum mean-square error in Gaussian channels ,IEEE Transactions on Information Theory (2005)[16] I. M. Johnstone,
On the distribution of the largest eigenvalue in principal components analysis , The Annals ofStatistics (2001)[17] M. Lelarge and L. Miolane,
Fundamental limits of symmetric low-rank matrix estimation , Probability Theoryand Related Fields (2019) [18] M. Dia, J. Barbier, N. Macris, F. Krzakala, T. Lesieur and L. Zdeborov´a,
Mutual information for symmetricrank-one matrix estimation: a proof of the replica formula , Advances in Neural Information Processing Systems(2016)[19] J. Barbier, N. Macris and C. Rush,
All-or-nothing statistical and computational phase transitions in sparsespiked matrix estimation , Advances in Neural Information Processing Systems (2020)[20] L. Miolane,
Fundamental limits of inference: A statistical physics approach , PhD thesis (2020) https://hal.archives-ouvertes.fr/tel-02446988 [21] T. Lesieur, F. Krzakala and L. Zdeborov´a,
Constrained low-rank matrix estimation: phase transitions, approxi-mate message passing and applications , Journal of Statistical Mechanics: Theory and Experiment (2017)[22] J. Barbier,
Mean-field theory of high-dimensional Bayesian inference , Course given at the school “Mathematicaland Computational Aspects of Machine Learning”, Scuola Normale Superiore di Pisa (2020) [23] F. Krzakala, M. M´ezard, F .Sausset, Y. Sun and L. Zdeborov´a,
Probabilistic reconstruction in compressedsensing: algorithms, phase diagrams, and threshold achieving matrices , Journal of Statistical Mechanics: Theoryand Experiment (2012)[24] J. Barbier, F. Krzakala, N. Macris, L. Miolane and L. Zdeborov´a,
Optimal errors and phase transitions inhigh-dimensional generalized linear models , Proceedings of the National Academy of Sciences (2019)[25] J. Barbier,
Phase transitions: from physics to computer science , Online “Basic Notions Seminar” from the ICTPMathematics Department (2020) [26] M. Talagrand,
Mean field models for spin glasses: volume I: basic examples , Springer Science & Business Media(2010)[27] M. Talagrand,
Mean field models for spin glasses: volume II: advanced replica-symmetry and low temperature ,Springer Science & Business Media (2010)[28] F. Guerra, Francesco and F. L. Toninelli,
The thermodynamic limit in mean field spin glass models , Communi-cations in Mathematical Physics (2002)[29] J. Barbier and N. Macris,
The adaptive interpolation method: a simple scheme to prove replica formulas inBayesian inference , Probability Theory and Related Fields (2019)[30] J. Barbier and N. Macris,
The adaptive interpolation method for proving replica formulas. Applications to theCurie–Weiss and Wigner spike models , Journal of Physics A: Mathematical and Theoretical (2019)[31] J. Barbier, N. Macris, M. Dia and F. Krzakala,
Mutual information and optimality of approximate message-passing in random linear estimation , IEEE Transactions on Information Theory (2020)[32] C. Luneau, J. Barbier and N. Macris,
Information theoretic limits of learning a sparse rule , Advances in NeuralInformation Processing Systems (2020)[33] A. Engel and C. Van den Broecktitle,
Statistical mechanics of learning , Cambridge University Press (2001)[34] H. Nishimori,
Statistical physics of spin glasses and information processing: an introduction , Clarendon Press(2001)[35] P. Mehta, M. Bukov, C.H. Wang, A.G.R. Day, C. Richardson, C.K. Fisher and D.J. Schwab,
A high-bias,low-variance introduction to Machine Learning for physicists , Physics Reports (2019)[36] A. Montanari,
Mean field asymptotics in high-dimensional statistics: a few references , https://web.stanford.edu/~montanar/OTHER/TALKS/oops_refs.pdf (Jean Barbier) International Center for Theoretical Physics, Trieste, Italy.
Email address ::