[PDF] The peculiar statistical mechanics of Optimal Learning Machines

Abstract

Optimal Learning Machines (OLM) are systems that extract maximally informative representation of the environment they are in contact with, or of the data they are presented. It has recently been suggested that these systems are characterised by an exponential distribution of energy levels. In order to understand the peculiar properties of OLM within a broader framework, I consider an ensemble of optimisation problems over functions of many variables, part of which describe a sub-system and the rest account for its interaction with a random environment. The number of states of the sub-system with a given value of the objective function obeys a stretched exponential distribution, with exponent γ , and the interaction part is drawn at random from the same distribution, independently for each configuration of the whole system. Systems with γ=1 then correspond to OLM, and we find that they sit at the boundary between two regions with markedly different properties. For all γ>0 the system exhibits a freezing phase transition. The transition is discontinuous for γ<1 and it is continuous for γ>1 . The region γ>1 corresponds to learnable energy landscapes and the behaviour of the sub-system becomes predictable as the size of the environment exceeds a critical threshold. For γ<1 , instead, the energy landscape is unlearnable and the behaviour of the system becomes more and more unpredictable as the size of the environment increases. Sub-systems with γ=1 (OLM) feature a behaviour which is independent of the relative size of the environment. This is consistent with the expectation that efficient representations should be largely independent of the level of detail of the description of the environment.

Full PDF

TThe peculiar statistical mechanics of OptimalLearning Machines

Matteo Marsili

The Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34151 Trieste, Italy Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Trieste, Italy

July 24, 2019

Abstract

Optimal Learning Machines (OLM) are systems that extract maximally informa-tive representation of the environment they are in contact with, or of the data they arepresented. It has recently been suggested that these systems are characterised by anexponential distribution of energy levels. In order to understand the peculiar propertiesof OLM within a broader framework, I consider an ensemble of optimisation prob-lems over functions of many variables, part of which describe a sub-system and therest account for its interaction with a random environment. The number of states of thesub-system with a given value of the objective function obeys a stretched exponentialdistribution, with exponent γ , and the interaction part is drawn at random from the samedistribution, independently for each conﬁguration of the whole system. Systems with γ = 1 then correspond to OLM, and we ﬁnd that they sit at the boundary between tworegions with markedly different properties. For all γ > the system exhibits a freezingphase transition. The transition is discontinuous for γ < and it is continuous for γ > . The region γ > corresponds to learnable energy landscapes and the behaviourof the sub-system becomes predictable as the size of the environment exceeds a criticalthreshold. For γ < , instead, the energy landscape is unlearnable and the behaviourof the system becomes more and more unpredictable as the size of the environmentincreases. Sub-systems with γ = 1 (OLM) feature a behaviour which is independent ofthe relative size of the environment. This is consistent with the expectation that efﬁcientrepresentations should be largely independent of the level of detail of the description ofthe environment. Living systems rely in many ways on the efﬁciency of the internal representation they formof their environment [1, 2]. For example, in order for a bacterium to responds to challenges,it has to encode a representation of the environment in its internal state. This suggests thatthe metabolism or gene regulatory network can be regarded as learning machines, that haveevolved to perform tasks not so dissimilar from pattern recognition in artiﬁcial intelligence(e.g. deep neural networks).Here we focus on a particular ideal limit of what we call optimal learning machines (OLM). These are machines that extract representations that are maximally informative on1 a r X i v : . [ phy s i c s . d a t a - a n ] J u l he generative process of the states of the environment or of the data. It has been shown[3] that OLM so deﬁned, are characterised by an exponential distribution of energy levels,independently of architectural details or of the nature of what is represented. This implies alinear behaviour of the entropy S ( E ) = νE + S with the energy. This prediction can betested empirically since it implies statistical criticality in a ﬁnite sample, as shown in Refs.[4, 5]. This phenomenon amounts to the observation of broad frequency distributions, i.e.that the number of states observed k times in the sample behaves as m k ∼ k − ν − . Statisticalcriticality is ubiquitous in empirical data of natural systems that supposedly express efﬁcientrepresentations (see e.g. [5, 6, 7, 8]) as well as in efﬁcient representations in statisticallearning [9, 10, 11]. The parameter ν gauges the trade-off between signal and noise, and Ref.[3] shows that the point ν = 1 corresponds to the most compressed lossless representation.In a ﬁnite sample, the case ν = 1 corresponds to Zipf’s law [12, 4], which is observed e.g.in language [6], neural coding [7] and the immune system [8]. In deep neural networks, Ref.[9] shows that layers with ν ≈ are those that best reproduce the statistics of the trainingsample. This lends some support to the idea that biological systems and machine learningoperates close to the ideal limit of OLM.This evidence suggests that understanding the properties of systems with exponentialenergy density may shed light both on learning machines in artiﬁcial intelligence as well asin Nature [1, 2]. This is the goal of the present paper. Our goal is to reveal the peculiarproperties of systems with exponential energy density within a wider class of systems. Thisis done studying systems with a stretched exponential density of states, that interpolatesbetween OLM and more familiar physical systems, such as the Random Energy Model(REM) [13].We focus on a generic model, introduced in [14], of a system that optimises a complexfunction over a large number of variables. The system is composed of a sub-system andits environment. The components of the objective function of the sub-system and of itsinteraction with the environment, obey a stretched exponential distribution with exponent γ > . The case γ = 2 coincides with the REM whereas the case γ = 1 describes efﬁcientrepresentations. A well deﬁned thermodynamic limit can be deﬁned for all values of γ whenthe size of the system diverges, with a ﬁxed ratio µ between the sizes of the environment andof the sub-system. For all values of γ the model is described by a Gibbs distribution over thestates of the sub-system, that corresponds to a generalised REM with stretched exponentialdistributions. As shown in Ref. [15], this model exhibits a freezing phase transition, as thestrength ∆ of the interactions in the sub-system varies (see Fig. 1). Yet the nature of thephase transition differs substantially depending on whether γ < or γ > [15]. The regime γ > is characterised by a continuous transition and a disordered region that shrinks as therelative size of the environment increases. For γ > , instead, the phase transition is sharp(ﬁrst order), with a disordered region that gets larger for bigger environments. Systemswith an exponential distribution (i.e. γ = 1 ) therefore, have a very peculiar behaviour,because they are located exactly at the transition between these two regions. The freezingphase transition for γ = 1 occurs at a critical point that is independent of the size of theenvironment. This is suggestive for OLM whose internal state should not depend on thedegree of details in the description of the environment. Furthermore, OLM exhibit Zipf’slaw exactly at the phase transition. This is the only point, in the whole phase diagram, wherethe (analogous of the) speciﬁc heat diverges, as a consequence of the appearance of a broad The entropy here is deﬁned as the logarithm of the number of energy levels at energy E . Zipf’s law is the observation that the frequency of the r th most frequent outcome in a dataset scales as /r or that the number of outcomes observed k times behaves as m k ∼ k − . γ = ∆ = 1 within the phase diagram of Fig. 1. F IGURE

1. Phase diagram of the random optimisation problem pictorially described in the top center, as afunction of the three main parameters, γ and ∆ . Here γ controls the statistics of the objective function, with γ = 2 (dashed grey line) and γ = 1 (full grey line) that corresponds to the REM and to OLM, respectively. ∆ quantiﬁes the relative strength of the interactions in the sub-system with respect to those with the environment.Different lines correspond to different values of the ratio µ between the size of the environment and the sizeof the sub-system ( µ = 1 / , and , from top to bottom for γ > ). The shaded regions below the linescorrespond to the disordered (weak interaction) phase, in the three cases. The phase transitions are continuousfor γ > and discontinuous for γ < . The point at γ = ∆ = 1 denotes the point where Zipf’s law occurs,and is the only point where the (analogous of the) speciﬁc heat diverges. The next section reviews the derivation of the exponential density of states for OLM.The following one introduces the problem and discusses its properties whereas Section 4derives the thermodynamic description. General remarks are drawn in the ﬁnal section.

For completeness, this section provides a self-contained derivation of the exponential densityfor optimal learning machines, in the context of the present paper. Imagine a data generatingprocess p ( (cid:126)x ) , where (cid:126)x ∈ R n is a very high-dimensional vector ( n (cid:29) ). Examples ofpossible systems are digital pictures, where (cid:126)x speciﬁes the intensity of the different pixels;the time series of a stock, where each component of (cid:126)x is the return of the stock in a particularday; the neural activity of a population of neurons in a particular region of the brain in aparticular time interval, where (cid:126)x speciﬁes the activity of each neuron, etc.We assume that the entropy H [ (cid:126)x ] = − (cid:80) (cid:126)x p ( (cid:126)x ) log p ( (cid:126)x ) is proportional to n , so that H [ (cid:126)x ] /n (cid:39) h is ﬁnite. We also assume that (cid:126)x satisﬁes the Asymptotic Equipartition Property(AEP) [16]. This states that points (cid:126)x drawn from p ( (cid:126)x ) almost surely belong to the typical set A = (cid:26) (cid:126)x : (cid:12)(cid:12)(cid:12)(cid:12) − n log p ( (cid:126)x ) − h (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:27) , (1)for any (cid:15) > , in the limit of large n . This implies that all points (cid:126)x have the same prob-ability p ( (cid:126)x ) (cid:39) e − nh , to leading order. Still p ( (cid:126)x ) contains information on the statisticaldependencies that we aim at representing. 3 representation is a function of the data s : (cid:126)x → s ( (cid:126)x ) ∈ S (2)with |S| < + ∞ . The ﬁrst requirement of an efﬁcient representation is that, upon condition-ing on s , (cid:126)x should contain only irrelevant details. If this is true, the AEP should apply todata generated from p ( (cid:126)x | s ) , i.e. − log p ( (cid:126)x | s ) (cid:39) u s ≡ − (cid:88) (cid:126)x p ( (cid:126)x | s ) log p ( (cid:126)x | s ) , (3)for all (cid:126)x ∈ A such that s ( (cid:126)x ) = s . Notice that Eq. (3) holds exactly when s ( (cid:126)x ) providesa complete description of the distribution, as in the case where p ( (cid:126)x ) = F [ s ( (cid:126)x )] . The AEPidentiﬁes the variable u s in Eq. (3) as the natural coordinate for distinguishing noise (i.e.irrelevant details) from relevant details. Two points (cid:126)x and (cid:126)x (cid:48) with u s ( (cid:126)x ) (cid:54) = u s ( (cid:126)x (cid:48) ) cannotbelong to the same typical set and hence should differ by relevant details. Instead, if u s ( (cid:126)x ) = u s ( (cid:126)x (cid:48) ) , the difference between two points (cid:126)x and (cid:126)x (cid:48) can be attributed to noise, even if if s ( (cid:126)x ) (cid:54) = s ( (cid:126)x (cid:48) ) . If there are W ( u ) conﬁgurations s of the representation with u s = u , then the entropy log W ( u ) measures the amount of information the representation s is unable to untangle.More precisely, of the total information content H [ s ] = − (cid:88) s p ( s ) log p ( s ) , (4)the part H [ s | u ] = (cid:88) s p ( s ) log W ( u s ) (5)measures the number of bits that cannot be distinguished from noise. Notice that, for all (cid:126)x such that s ( (cid:126)x ) = s , we have p ( (cid:126)x ) (cid:39) p ( (cid:126)x | s ) p ( s ) . Taking the logarithm of this equation p ( s ) ≡ (cid:88) (cid:126)x : s ( (cid:126)x )= s p ( (cid:126)x ) = 1 Z e u s (6)with Z (cid:39) e nh .The second requirement of a maximally informative representation, is that for any ﬁxedvalue of H [ s ] , H [ s | u ] should be as small as possible, so that the amount H [ u ] = H [ s ] − H [ s | u ] (7)of informative bits is as large as possible. It is easy to see that the minimisation of H [ s | u ] over W ( u ) , at a ﬁxed value of the entropy H [ s ] , leads to an exponential distribution of uW ( u ) = W e − νu , (8)where the parameter ν enters as a Lagrange multiplier in the minimisation of H [ s | u ] − νH [ s ] ,to enforce the constraint on H [ s ] . Notice that, when ν = 1 , the problem reduces to that ofthe unconstrained maximisation of H [ u ] , and we recover log W ( u ) = log W − u . Sucha linear behaviour between energy and entropy, as discussed in Refs. [4, 12], correspondsto Zipf’s law and to a uniform distribution p ( u ) = W ( u ) e u /Z of u s . Indeed, the secondrequirement is analogous to demanding that u s should have a distribution which is as broadas possible. 4 An ensemble of optimisation problems

Consider a system described by a conﬁguration s = ( σ , . . . , σ n ) of n binary (or spinvariables) σ i = ± . The system is in contact with an environment, whose conﬁguration t = ( τ , . . . , τ m ) is speciﬁed by m binary (or spin variables) τ j = ± .As in Ref. [14], we consider the problem of ﬁnding the maximum ( s ∗ , t ∗ ) = arg max ( s,t ) U ( s, t ) (9)of an objective function that can be divided in two parts U ( s, t ) = u s + v t | s . (10)Here u s depend on the interactions of the variables within the system and v s,t accounts forthe interactions with the environment . The number of states with u s > u is given by |{ s : u s > u }| = 2 n e − ( u/ ∆) γ , u > . (11)This can be realised by drawing at random u s from a stretched exponential distribution,which results in a rough energy landscape, as in the REM [13]. Yet there is no need toassume such a rough energy landscape for the sub-system . For the environment, we assumethat v s,t is drawn from a distribution P { v t | s ≥ x } = e − x γ , γ > , (13)independently, for each s and t . Therefore s ∗ depends on the realisation v t | s of the interactionwith the environment. For ∆ (cid:29) we expect the optimisation to depend weakly on theenvironment, and to be dominated by the term u s . In this case, s ∗ will likely be one of thefew states s with values of u s close to the maximum u = max s u s , i.e. the probability p ( s | u ) = P { s ∗ = s | u } (14)that s ∗ = s will be dominated by few values of s . Hence, the entropy H [ s ] = − (cid:88) s p ( s | u ) log p ( s | u ) (15)will be small, for ∆ (cid:29) . When ∆ (cid:28) , instead, we expect that the environment v t | s dominates the optimisation, and hence that s ∗ will be broadly distributed on an exponentialnumber of states. This corresponds to an extensive entropy H [ s ] ∝ n . Our main focus willbe on the transition between these two regimes. Ref. [14] discusses several examples of systems where this generic description may apply. For example, aprotein domain is a sequence s of amino acids that has been optimised, in the course of evolution, for a speciﬁcfunction, e.g. regulate the ﬂux of ions across the cellular membrane. This function depends on the interaction( v t | s ) with other molecules in the cell, and on their speciﬁc composition t . Each sequence in a protein databasecan be thought of as a realisation of the optimisation process above, for a different choice of v t,s . Likewise, aword s in a sentence is chosen to best express a concept, depending on the other words t of that sentence. One way to deﬁne a smooth landscape satisfying Eq. (11), is to assume that u s depends only on the(Hamming) distance | s − s | from a state s . In order to do this, it is sufﬁcient to equate the entropy Σ( u ) = n [1 − ( u/ ∆) γ ] log 2 to the number (cid:0) nd (cid:1) of states s at distance d from s . This gives u s = u (cid:20) dn log dn + (cid:18) − dn (cid:19) log (cid:18) − dn (cid:19)(cid:21) /γ , d = | s − s | . (12)The function u s deﬁned in this way is smooth, apart from the point s , where | u s − u s | (cid:39) − dγn log dn + . . . has a singular behaviour. .1 The Gibbs distribution As shown in Ref. [14], Extreme Value Theory (EVT) can be invoked to integrate out thedegrees of freedom in the environment, by observing that for m (cid:29) ( s,t ) U ( s, t ) = max s (cid:104) u s + max t v t | s (cid:105) (16) ∼ = max s [ u s + a m + η s /β m ] (17)where η s is a random variable which follows a Gumbel distribution P { η s ≤ x } = e − e − x , a m = ( m log 2) /γ and β m = γ ( m log 2) − /γ . (18)The knowledge of the distribution of η s allows us to compute the probability that s ∗ = s ,which is the probability that u s + a m + η s /β m ≥ u s (cid:48) + a m + η s (cid:48) /β m for all s (cid:48) (cid:54) = s . The resultreads [14] p ( s | u ) = 1 Z e β m u s , Z = (cid:88) s e β m u s , (19)which is Gibbs distribution with an inverse temperature β m . Note that, for γ > , β m → ∞ as m → ∞ , so the entropy H [ s ] is expected to decrease as the size of the environmentincreases. On the contrary, β m → for γ < , which means that larger and larger environ-ments make the sub-system’s behaviour less predictable. For γ = 1 instead β m = 1 , i.e.the distribution of s is independent of the size of the environment. In this case, Eq. (19)coincides with Eq. (6). Note also that, the parameter ν discussed in Section 2 is given by ν = 1 / ∆ . Can the function u s be learned from a series of experiments, when it is not known in ad-vance? Let p ( s ) be the distribution that encodes the current state of knowledge about thesystem. For an extensive quantity q s ∝ n , it is possible to compute its distribution p ( q ) = (cid:88) s p ( s ) δ ( q − q s ) If q s is a self-averaging quantity, we expect its distribution to be sharply peaked around atypical value q typ = (cid:104) q (cid:105) . Imagine running an experiment where the value q exp is measured.If q exp ≈ q typ within experimental errors, then the current theory is conﬁrmed, otherwiseit has to be revised. In the latter case, the standard recipe to update the theory is givenby Large Deviation Theory [17]. This maintains that the new distribution should be suchthat (cid:104) q (cid:105) new = q exp , without assuming anything else. More precisely, the amount of in-formation that the measurement gives on the state s is given by the mutual information I ( s, q ) = D KL ( p new || p ) . Hence, p new should be the distribution with (cid:104) q (cid:105) new = q exp forwhich D KL ( p new || p ) is minimal. The distribution that satisﬁes this requirement is p new ( s ) = 1 Z ( g ) p ( s ) e gq s , Z ( g ) = (cid:90) dqp ( q ) e gq (20) This is an asymptotic result, but it is derived taking the maximum over m random variables v t | s , whichis an astronomically large number for m (cid:29) . g is adjusted in such a way to satisfy (cid:104) q (cid:105) new = q exp . This process can be continuedwith additional measures of different observables q (cid:48) s , q s ” , . . . , and, in principle, it leads toinfer β m u s = gq s + g (cid:48) q (cid:48) s + g ” q s ” + . . . (21)to the desired accuracy from a series of experiments.This recipe, however, only works for quantities for which p ( q ) has a distribution whichfalls off faster than exponential as q → ±∞ , which corresponds to γ ≥ . If − log p ( q ) (cid:39) c | q | γ for | q | → ∞ with γ < , then the integral deﬁning Z ( g ) in Eq. (20) is not deﬁned.There is no well deﬁned way to incorporate the observation q exp (cid:54) = q typ in the distribution p ( s ) and to update our state of knowledge in this case . In this sense, γ = 1 separates theregion of learnable systems ( γ ≥ ) from the one ( γ < ) of systems for which u s cannot belearned through a series of experiments. The thermodynamic limit is deﬁned as the limit n, m → ∞ with µ = m/n ﬁnite. Thelargest value of u s is of the order u = ∆( n log 2) /γ so β m u s = n (log 2) γµ − /γ ν s (22)is extensive , when the intensive variable ν s = ∆ u s /u varies in the interval [0 , ∆] . Like-wise, the free energy − log Z is also extensive. Hence the model of Eq. (19) with u s drawnfrom Eq. (13) coincides with a generalised REM, that has been discussed in Ref. [15].This Section re-derives and discusses its properties in the present setting. We refer to theappendix for detailed calculation and discuss the main results here.Fig. 2(left) shows the entropy density Σ( u ) /n as a function of u/u . This is the loga-rithm of the number of states at a given value of u , divided by n . For γ > this is a concavefunction, so the thermodynamics can be computed in the usual manner. For a certain valueof β m , the partition function Z is dominated by the point where Σ( u ) is tangent to the lineof slope − β m (dashed lines). Notice that, by Eq. (22), β m is controlled by µ . As long as mn = µ < µ c = ∆ − γ/ ( γ − , ( γ > (23) Z is dominated by an intermediate point u ∗ ∈ [0 , u ) for which an exponential (in n ) numberof states contribute to the sum in Z . Accordingly, the entropy H [ s ] ≡ Σ( u ∗ ) = n (cid:18) − µµ c (cid:19) log 2 (24)is extensive, and it vanishes linearly with µ/µ c as µ → µ − c (see Fig. 2 right), for all valuesof γ > . As a consequence, for µ < µ c , the probability p ( s | u ) is exponentially small in n ,for all s including s . As observed in [16], a distribution that would reproduce q exp = (cid:104) q (cid:105) new is p new ( s ) = (1 − (cid:15) ) p ( s )+ (cid:15)p ( s ) for any p ( s ) such that (cid:80) s p ( s ) q s = q typ + ( q exp − q typ ) /(cid:15) . A possible interpretation is that, a priori , if q exp (cid:54) = q typ then, with probability − (cid:15) we should discard the observation q exp and keep the old theory p andwith probability (cid:15) , instead, we should discard p ( s ) altogether and take p ( s ) as our new theory. In the ﬁrstcase we don’t learn anything. In the second, the current state of knowledge is wiped out altogether. Noticethat if it were possible to measure ˜ q s = q αs instead of q s , for a small enough α the distribution of ˜ q may fall offsufﬁciently fast, thus leading us back to the case γ > . The existence of the thermodynamic limit relies on the choice of the same distribution for both u s and v t | s . Under a different distribution, the thermodynamic limit would require a speciﬁc scaling of v t | s with m . n γ = γ = γ = / u s u H [ s ] n γ > γ < μμ c F IGURE

2. (Left) Logarithm of the number of states at a given value of u as a function of u/u for γ = 2 , and / (from top to bottom). The red dashed lines highlight the construction which determines the pointwhich dominates the partition function Z . (Right) Phase transition in the entropy H [ s ] /n as a function of µ/µ c . For µ > µ c the slope β m is larger than that of the curve Σ( u ) at u , hence Z is dominatedby states with u s (cid:39) u . Hence, the probability p ( s | u ) is ﬁnite as well as the entropy H [ s ] .The phase diagram in the ( µ, ∆) plane is shown in Fig. 3 (left). In summary, the typicalbehaviour of the REM holds in the whole region γ > . �� μ γ = γ = Δ �� μ γ = γ = Δ F IGURE

3. Phase diagram in the (∆ , µ ) plane for γ > (Left) and γ < (Right). Two values of γ are shownin each case ( γ = 1 . , left, γ = 0 . , . right). The shaded region corresponds to the disordered phase wherethe entropy H [ s ] is extensive. For γ < , instead, Σ( u ) is a convex function of u and the construction above fails towork. For all β m small enough, the partition function is dominated by the point u = 0 whereas for large β m it is dominated by states with u s ≈ u . As a result, the entropy is H [ s ] = n log 2 for mn = µ > µ c = ( γ ∆) γ/ (1 − γ ) ( γ < . (25)whereas H [ s ] /n → as n → ∞ for µ < µ c . The transition between the two regimes isdiscontinuous, as shown in Fig. 2 (right). Notice that, since β m is an increasing function of µ for γ < (see Eq. 22), the transition is also reversed.The case γ = 1 is discussed in the appendix The phase transition occurs at the point ∆ c = 1 for all values of µ . As shown in Fig. 4 (left), the entropy decreases sharply from H [ s ] (cid:39) n log 2 to a ﬁnite value. At the transition, the distribution of u extends across thewhole range [0 , n log 2] , which is signalled by the divergence of the (analog of the) speciﬁcheat C v = (cid:10) ( u s − (cid:104) u s (cid:105) ) (cid:11) (26)8 �� H [ s ]/ n n = = = = Δ �� C v n = = = = Δ �� H [ u ] n = = = = Δ F IGURE

4. Phase transition at ∆ = 1 for γ = 1 . Left: behaviour of the entropy H [ s ] /n for different valuesof n = 20 , , and , as a function of ∆ . Center: behaviour of C v (see text) across the transition (notethe log-scale on the y -axis). Right: H [ u ] vs ∆ for the same values of n . as shown in Fig. 4 (center). This divergence is usually taken as a signature of a second orderphase transition. In the ensemble of systems discussed here, it occurs only at γ = 1 . FinallyFig. 4 (right) shows the behaviour of the entropy H [ u ] of the random variable u . This, inan efﬁcient representation is taken as a measure of the amount of useful information. Inan inﬁnite system H [ u ] (cid:39) − log | ∆ − | diverges at ∆ = 1 whereas for a ﬁnite system itreaches its maximum H [ u ] (cid:39) log( n log 2) + 1 at ∆ = 1 .We remark that the thermodynamic description discussed above holds for any systemfor which the number of energy levels at energy E is given by W ( E ) = e n [log 2 − ( − E/n ) γ ] for E ≤ , irrespective of the relation between the energy E s and the conﬁguration s of thesystem. The case where the n energy levels are drawn at random, independently, from thesame distribution p { E s ≤ nx } = e − n ( − x ) γ , for x < , provides a particular (ensemble of)realisation(s) of this system. Yet, this is not the only way in which a function E s with agiven number W ( E ) = |{ s : E s = E }| of states at energy E , can be realised. In particular,energy landscapes where E s is drawn independently for each s are not ideal paradigms forlearning machines. First because we expect some sort of continuity in the representation, sothat similar objects s and s (cid:48) have similar energies E s ≈ E s (cid:48) . Second, random landscapes arecharacterised by an extremely slow dynamics [18]. Hence, a smooth energy landscape is adesirable property of OLM both because of continuity of the representation and because ofthe dynamical accessibility of the equilibrium state Eq. (19). Figure 1 puts on the same phase diagram systems with very different statistical proper-ties. The right side ( γ > ) describes REM like behaviour typical of disordered systemsin physics. The left side ( γ < ) describes unlearnable systems with a ﬁrst order phasetransitions. Optimal learning machines, that are characterised by γ = 1 , sit exactly at theboundary between these two regimes.This lends itself to a number of interesting, though speculative, comments. First, amongthe systems studied in this paper, OLM have the widest variation of thermodynamically ac-cessible energy levels u . Indeed, the range of energies is given by u = ∆( n log 2) /γ , whichis a decreasing function of γ . Yet, for γ < only u s = 0 and u s = u are thermodynam-ically accessible, so the range of thermodynamically accessible values of u is maximal for γ = 1 . This is consistent with the fact that the energy is the natural coordinate in learningbecause it corresponds to the coding cost − log p ( s ) . Maximally informative representationsuse the energy spectrum as efﬁciently as possible [3].It is interesting to relate the phase transition for γ = 1 with the trade-off between reso-lution H [ s ] and noise H [ s | u ] discussed in Ref. [3] (see also Section 2). As ∆ varies H [ s | u ] H [ s ] , where the slope ν = 1 / ∆ is related to theLagrange multiplier that is used to enforce the constraint on H [ s ] in the minimisation of H [ s | u ] [3]. This means that when H [ s ] is reduced by one bit, the noise is reduced by / ∆ bits. Therefore the region ∆ < describes noisy representations and correspond to valuesof H [ s ] larger than the value H c for which ∆ = 1 . The region H [ s ] < H c correspondsto ∆ > . In this region, reduction in the resolution come at the expense of a loss of in-formation on the generative model. In supervised learning, it is reasonable to surmise thatcompression for ∆ > occurs at the expense of details of the generative models that areirrelevant with respect to the speciﬁc input-output task that the machine is learning. Hencethe representation depends signiﬁcantly on the output. Conversely, for ∆ < we expectthat the representation depends mostly on the input and only weakly on the output. Thisleads to the conjecture that maximally informative representations have an universal naturefor ∆ ≤ , which depend mostly on the input data, and are largely independent of the spe-ciﬁc input-output relation that the machine is learning. In this picture, the phase transitionat ∆ = 1 marks the point where the ergodicity in the space of representations (and the sym-metry with respect to different outputs) gets (spontaneously) broken. This conjecture can inprinciple be disproved or conﬁrmed by further research on speciﬁc architectures .We’ve also seen that systems with γ < cannot be learned from a series of experimentsand OLM sit exactly at the boundary between learnable and unlearnable systems. In orderto appreciate the possible signiﬁcance of this observation, let us consider a larger system U ( s, t, z, . . . ) = u s + v t | s + y z | t,s + . . . (27)with q additional variables z = ( ζ , . . . , ζ q ) . As in Ref. [14], the different terms in Eq. (27)can be deﬁned as u s = E [ U ( s, t, z, . . . ) | s ] (28) v t | s = E [ U ( s, t, z, . . . ) − u s | s, t ] (29) y z | t,s = E (cid:2) U ( s, t, z, . . . ) − u s − v t | s | s, t, z (cid:3) . . . (30)(31)with E [ U | x ] representing the expected value on the distribution of U at given x , i.e. is thebest estimate of the objective function, when the variable x is ﬁxed. Let us also assumethat v t | s and y z | t,s are drawn independently from a stretched exponential distribution withexponents γ v and γ y , respectively. In the limit when q ∝ m ∝ n (cid:29) , the derivation inSection 3.1 shows that the statistics of the variable s ∗ = arg max s (cid:110) u s + max t (cid:104) v t | s + max z (cid:0) y z | t,s + . . . (cid:1)(cid:105)(cid:111) (32)still follows the Gibbs distribution Eq. (19), but the value of β is dominated by the variables t if γ v < γ y and by the variables z otherwise . Therefore, the most relevant set of variablesare those with the smallest value of γ . In this sense, systems with γ = 1 are characterised As an analogy, the critical temperature in a ferromagnetic Ising model, marks the point where the responseto a small external magnetic ﬁeld changes dramatically. In the paramagnetic region, the response is continuouswhereas in the ferromagnetic phase it is discontinuous. A possible way to conﬁrm this conjecture might be toprobe the response of maximally informative representations to changes in the output, at different values of H [ s ] . The change should small and “continuous” in the ∆ < phase and sharp in the ∆ > phase. Note that, the decomposition in Eq. (27) is not unique, since one could as well deﬁne U ( s, t, z, . . . ) = u s + w z | s + x z | t,s + . . . . Hence, one without loss of generality, one can focus on the decomposition for which γ v ≤ γ y ≤ . . . .

10y the most relevant variables that can be implemented in a physically accessible system.This also offers a guideline for ﬁnding relevant variables in high-dimensional data, as thosefor which the sample exhibits statistical criticality (see Refs. [19, 20] for attempts in thisdirection). Furthermore, fort γ v = γ y = γ = 1 one recovers the Eq. (19) with β m = 1 .In words, the behaviour of OLM is invariant if further details are added to the problem,which is a desirable property of efﬁcient representations. For example, the classiﬁcationof a dataset of images should be invariant, independently of the resolution of the images,beyond a certain level.A further unique property of systems with γ = 1 is that the system can, in principle,be further decomposed in sub-systems with the same properties. More precisely, one canﬁnd variables p = ( π , . . . , π l ) and r = ( ρ , . . . , ρ n − l ) such that s = ( p, r ) and u s = w p + z r | p , with w p and z r | p having again a distributions that asymptotically behaves as anexponential. In particular, critical systems with ∆ = 1 admit sub-systems that are also“poised” at the critical point ∆ = 1 . It is tempting to regard this remarkable self-similarity asa distinguishing feature of living systems. For example, both the abundance of metabolites[21] and gene expression levels [22] inside cells have been reported to obey Zipf’s law.On the contrary, systems with γ > exhibit a behaviour which is more and more pre-dictable the smaller is the number n of variables (i.e. for large µ ). Within the simple classof models discussed here, the possibility to describe a complex system in terms of few vari-ables emerges as a typical property of physical systems with γ > . Acknowledgements

Interesting discussions and useful comments with J. Barbier, J.-P. Bouchaud and S. Franzare gratefully acknowledged.

A The Statistical mechanics approach for γ (cid:54) = 1 The maximum value of u s , from EVT, is given by max s u s = ∆˜ n /γ [1 + γξ/ ˜ n ] (33)where ξ is also a random variable drawn from a Gumbel distribution. Here and in whatfollows, we introduced the shorthand ˜ n = n log 2 . Therefore, neglecting / ˜ n corrections,we introduce the intensive variable ν s by u s = ˜ n /γ ν s , ν s ∈ [0 , ∆] . (34)We focus on the case where the size of the heat bath m = µn is proportional to n . Then β m u s = ˜ nγµ − /γ ν s (35)is extensive and the number of conﬁgurations with u s ≥ ˜ n /γ ν is n P { u s ≥ ˜ n /γ ν } = e ˜ n [1 − ( ν/ ∆) γ ] (36) This was regarded as a wonderful gift by Wigner [23]. ˜ n . In the annealedapproximation, we can compute the partition function as Z (cid:39) γ ˜ n ∆ γ (cid:90) ∆0 dνν γ − e ˜ nf ( ν ) (37) f ( ν ) = 1 − ( ν/ ∆) γ + γµ − /γ ν. (38)For γ > the free energy f ( ν ) is a concave function and Z can be computed by saddlepoint. The saddle point value reads ν ∗ = ∆ γ/ ( γ − µ /γ . (39)As long as ν ∗ < ∆ , the annealed approximation is valid. This holds as long as mn = µ < µ c = ∆ − γ/ ( γ − . (40)The saddle point calculation yields Z (cid:39) (cid:115) πµ ˜ n ( γ − µ c e ˜ n [1+( γ − µ/µ c ] . (41)This allows us to compute the entropy for µ < µ c which is given by H [ s ] = log Z − β m (cid:104) u s (cid:105) (42) (cid:39) ˜ n (1 − µ/µ c ) . (43)which vanishes as µ → µ − c .The saddle point approximation cannot be used for γ < because the function f ( ν ) isconvex. Indeed the integral in Eq. (37) is either dominated by the point ν = 0 or by thepoint ν = ∆ . As long as f (0) > f (∆) the ﬁrst dominates and we have Z (cid:39) e ˜ n (cid:0) O (˜ n − /γ ) (cid:1) . (44)The condition f (0) > f (∆) is equivalent to mn = µ > µ c = ( γ ∆) γ/ (1 − γ ) . (45)As long as this condition is satisﬁed, the entropy H [ s ] = ˜ n is asymptotically the same asthe entropy of a ﬂat distribution over the states s . When µ < µ c the annealed approximationceases to be valid because the partition function is dominated by few states.When Z is dominated by the point ν = ∆ , i.e. for µ < µ c , the annealed partitionfunction can be estimated with the change of variables ν = ∆ − z/ ˜ n so that the free energybecomes ˜ nf (cid:39) γµ − /γ ∆˜ n − [( µ/µ c ) /γ − − γ ] z + . . . and the integral yields Z = γ ∆ γ [( µ/µ c ) /γ − − γ ] e ˜ nγµ − /γ ∆ (46)which suggests that the probability of states with β m u s = ˜ nγµ − /γ ∆ is of order one: P { s ∗ = s } (cid:39) ∆ γ γ [( µ/µ c ) /γ − − γ ] (47)Notice that this does not vanish as µ → µ − c which is a further signature of a ﬁrst order phasetransition. 12 The case γ = 1 In the case γ = 1 we can resort to a simple approximation, assuming that the spectrum ofpossible values of u is limited in the range [0 , u ] , with u = ∆ n log 2 , and that the number of energy levels in the interval [ u, u + du ) is given by N ( u ) du (cid:39) n ∆ e − u/ ∆ θ ( u − u ) du. (48)We can obtain most quantities of interest from the partition function Z ( λ ) = (cid:88) s e λu s = (cid:90) u N ( u ) due λu (49)Indeed, Z (1) yields the normalisation of the distribution over s and derivatives of log Z ( λ ) with respect to λ , computed at λ = 1 yield the moments of the distribution of u s . Also, theentropy H [ s ] = − (cid:88) s p s log p S = log Z (1) − ∂∂λ log Z ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) λ =1 (50)Within the approximation above, we ﬁnd Z ( λ ) (cid:39) n ∆ (cid:90) u e − (1 / ∆ − λ ) u = 2 n − λ ∆ n − λ ∆ . (51)The expected value of u reads (cid:104) u (cid:105) = ∂∂λ log Z ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) λ =1 = (cid:20) − e − χ − χ (cid:21) u , χ = (∆ − n log 2 (52)where we have introduced the scaling variable χ . For ∆ < the leading behaviour for n → ∞ is obtained for − χ (cid:29) , whereas for ∆ > it is obtained for χ (cid:29) . Hence (cid:104) u (cid:105) (cid:39) ∆1 − ∆ + θ (∆ − u , (53)where θ ( x ) = 1 for x ≥ and θ ( x ) = 0 is the Heaviside function. The speciﬁc heat is givenby C v = u (cid:20) e χ ( e χ − − χ (cid:21) (54)The entropy reads H [ s ] = n log 2 + χ − e χ + u χ + log( n log 2) + log 1 − e − χ χ (55) (cid:39) ∆∆ − − log | ∆ − | + θ (1 − ∆) n log 2 + . . . (56)It is also possible to compute the entropy of the variable uH [ u ] (cid:39) log u + log 1 − e − χ χ + 1 − χe χ − (57)for ∆ (cid:54) = 1 and n → ∞ one ﬁnds H [ u ] → log ∆ | ∆ − | + 1 whereas at ∆ = 1 one ﬁnds H [ u ] (cid:39) log( n log 2) + 1 . 13 .1 A reﬁned approach The approach discussed so far relies on the annealed approximation for the partition func-tion. This approach is accurate in the disordered phase but it does not work in the frozenphase. Indeed, for ∆ > ∆ c the partition function is dominated by few states and it is not selfaveraging. The probability p ( s | u ) is a function of u = { u s } and as such, it attains differentvalues depending on the values of u . As a result, the entropy H [ s ] is also a random variable.In order to appreciate this effects, we compute the function Ω( λ ) = (cid:10) p ( s | u ) λ (cid:11) s, u = (cid:88) s (cid:90) ∞ du ∆ e − u/ ∆ (cid:10) p ( s | u ) λ (cid:11) u − s | u s = u (58)where (cid:104) . . . (cid:105) s, u stands for the average over s and u whereas (cid:104) . . . (cid:105) u − s | u s = u for the average overall values of u s (cid:48) for s (cid:48) (cid:54) = s , with u s = u . Now, (cid:10) p ( s | u ) λ (cid:11) u − s | u s = u = e (1+ λ ) u (cid:42)(cid:32)(cid:88) s e u s (cid:33) − − λ (cid:43) (59) = e (1+ λ ) u Γ(1 + λ ) (cid:90) ∞ dtt λ e − te u (cid:89) s (cid:48) (cid:54) = s (cid:68) e − te us (cid:48) (cid:69) u s (cid:48) (60) = e (1+ λ ) u Γ(1 + λ ) (cid:90) ∞ dtt λ e − te u (cid:89) s (cid:48) (cid:54) = s (cid:68) e − te u (cid:48) (cid:69) N − u (cid:48) , (61)with N = 2 n (cid:29) . The term (cid:104) e − te u (cid:48) (cid:105) N − u (cid:48) is vanishingly small unless t (cid:28) . Hence, for ∆ > , and t (cid:28) we can write (cid:68) e − te us (cid:48) (cid:69) = t / ∆ ∆ Γ( − / ∆ , t ) (62) ∼ = 1 − Γ(1 − / ∆) t / ∆ + . . . (63)Anticipating that p ( s | u ) is non-negligible for values of u s = u + x , we compute (cid:10) p ( s | u ) λ (cid:11) u − s | u s = u + x ∼ = (cid:0) N ∆ e x (cid:1) λ (cid:90) ∞ dtt λ e − N ∆ te x + N log[1 − t / ∆ Γ(1 − / ∆ ,t )] = (cid:90) ∞ dzz λ e − z + N log[1 − N − e x/ ∆ z ∆ − Γ(1 − ∆ − ,N − ∆ e − x z )] → (cid:90) ∞ dzz λ e − z − Γ(1 − ∆ − ) e x/ ∆ z / ∆ , as n → ∞ where we set z = N ∆ e x t in the last equation and took the limit n, N → ∞ .Inserting this in Eq. (58), with the change of variables x = u − ∆ log N , we observethat observe that p ( u ) → e − x/ ∆ /N with a factor /N that cancels the sum on s . Therefore,with y = x/ ∆ , we ﬁnd Ω( λ ) = (cid:90) ∞−∞ dye − y (cid:90) ∞ dzz λ e − z − Γ(1 − / ∆) z / ∆ e − y . (64)14etting y = u + log[Γ(1 − / ∆) z / ∆ ] the integrals separate and one ﬁnds Ω( λ ) = Γ(1 + λ − / ∆)Γ(1 + λ )Γ(1 − / ∆) (65)Note that Ω(0) = 1 as necessary for normalisation. The knowledge of Ω( λ ) allows us tocompute observables in the ∆ > region. For example the probability that two replicas endup in the same state, i.e that s ∗ = s ∗ , is given by Ω(1) = 1 − / ∆ . Likewise, the probabilitythat q + 1 replicas coincide is P { s ∗ = s ∗ = . . . = s ∗ q +1 } = Ω( q ) = ∆ − q q ! q (cid:89) k =2 ( k ∆ − (66)which vanishes linearly with ∆ → + , for all q ≥ , and it decays as q − / ∆ for q (cid:29) .The expected value of the entropy is given by (cid:104) H [ s ] (cid:105) u = − ∂∂λ log Ω( λ ) (cid:12)(cid:12)(cid:12)(cid:12) λ =0 (67) = ψ (1) − ψ (1 − / ∆) (68) (cid:39) (cid:40) ∆∆ − − π (∆ −

1) + . . . ∆ → + π + 1 . − + . . . ∆ → ∞ (69)The leading divergence as ∆ → + matches the one found within the annealed approxima-tion. Its variance can also be computed V [ H [ s ]] = ∂ Ω ∂λ (cid:12)(cid:12)(cid:12)(cid:12) λ =1 (70) = ∆ − (cid:2) ψ (cid:48) (2 − / ∆) − ψ (cid:48) (2) + ( ψ (2 − / ∆) − ψ (2)) (cid:3) . (71)Interestingly, V [ H [ s ]] → as ∆ → + . References [1] Gaˇsper Tkaˇcik and William Bialek. Information processing in living systems.

AnnualReview of Condensed Matter Physics , 7(1):89–117, 2016.[2] Jorge Hidalgo, Jacopo Grilli, Samir Suweis, Miguel A. Mu˜noz, Jayanth R. Banavar,and Amos Maritan. Information-based ﬁtness and the emergence of criticality in livingsystems.

Proceedings of the National Academy of Sciences , 111(28):10095–10100,2014.[3] Ryan John Cubero, Junghyo Jo, Matteo Marsili, Yasser Roudi, and Juyong Song. Sta-tistical criticality arises in most informative representations.

Journal of Statistical Me-chanics: Theory and Experiment , 2019(6):063402, jun 2019.[4] Thierry Mora and William Bialek. Are biological systems poised at criticality?

Journalof Statistical Physics , 144(2):268–302, 2011.155] Miguel A. Mu˜noz. Colloquium: Criticality and dynamical scaling in living systems.

Rev. Mod. Phys. , 90:031001, Jul 2018.[6] George Kingsley Zipf.

Selected studies of the principle of relative frequency in lan-guage . Harvard university press, 1932.[7] Gaˇsper Tkaˇcik, Thierry Mora, Olivier Marre, Dario Amodei, Stephanie E Palmer,Michael J Berry, and William Bialek. Thermodynamics and signatures of critical-ity in a network of neurons.

Proceedings of the National Academy of Sciences ,112(37):11508–11513, 2015.[8] Javier D. Burgos and Pedro Moreno-Tovar. Zipf-scaling behavior in the immune sys-tem.

Biosystems , 39(3):227 – 232, 1996.[9] Juyong Song, Matteo Marsili, and Junghyo Jo. Resolution and relevance trade-offs in deep learning.

Journal of Statistical Mechanics: Theory and Experiment ,2018(12):123406, dec 2018.[10] M. E. Rule, M. Sorbaro, and M. H. Hennig. Optimal encoding in stochastic latent-variable Models.

ArXiv e-prints , page arXiv:1802.10361, February 2018.[11] Ryan Cubero, Matteo Marsili, and Yasser Roudi. Minimum description length codesare critical.

Entropy , 20(10):755, Oct 2018.[12] L. Aitchison, N. Corradi, and P. E. Latham. Zipf’s Law Arises Naturally When ThereAre Underlying, Unobserved Variables.

PLoS Computational Biology , 12:e1005110,December 2016.[13] B. Derrida. Random-energy model: Limit of a family of disordered models.

Phys. Rev.Lett. , 45:79–82, Jul 1980.[14] Matteo Marsili, Iacopo Mastromatteo, and Yasser Roudi. On sampling and mod-eling complex systems.

Journal of Statistical Mechanics: Theory and Experiment ,2013(09):P09003, 2013.[15] J. P. Bouchaud and M Mezard. Universality classes for extreme-value statistics.

Jour-nal of Physics A: Mathematical and General , 30(23):7997, 1997.[16] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley &Sons, 2012.[17] E. T. Jaynes. Information theory and statistical mechanics.

Phys. Rev. , 106:620–630,May 1957.[18] G´erard Ben Arous, Anton Bovier, and V´eronique Gayrard. Aging in the random energymodel.

Phys. Rev. Lett. , 88:087201, Feb 2002.[19] R. J. Cubero, M. Marsili, and Y. Roudi. Finding informative neurons in the brain usingMulti-Scale Relevance.

ArXiv e-prints , February 2018.[20] Silvia Grigolon, Silvio Franz, and Matteo Marsili. Identifying relevant positions inproteins by critical variable selection.

Molecular BioSystems , 12(7):2147–2158, 2016.1621] Shumpei Sato, Makoto Horikawa, Takeshi Kondo, Tomohito Sato, and Mitsutoshi Se-tou. A power law distribution of metabolite abundance levels in mice regardless of thetime and spatial scale of analysis.

Scientiﬁc Reports , 8(1):10315, 2018.[22] Chikara Furusawa and Kunihiko Kaneko. Zipf’s law in gene expression.

Phys. Rev.Lett. , 90:088102, Feb 2003.[23] Eugene P. Wigner. The unreasonable effectiveness of mathematics in the natural sci-ences. richard courant lecture in mathematical sciences delivered at new york univer-sity, may 11, 1959.