Phase Diagram of Restricted Boltzmann Machines and Generalised Hopfield Networks with Arbitrary Priors
Adriano Barra, Giuseppe Genovese, Peter Sollich, Daniele Tantari
PPHASE DIAGRAM OF RESTRICTED BOLTZMANN MACHINES ANDGENERALISED HOPFIELD NETWORKS WITH ARBITRARY PRIORS
ADRIANO BARRA, GIUSEPPE GENOVESE, PETER SOLLICH, AND DANIELE TANTARI
Abstract.
Restricted Boltzmann Machines are described by the Gibbs measure of a bipartite spinglass, which in turn corresponds to the one of a generalised Hopfield network. This equivalence allowsus to characterise the state of these systems in terms of retrieval capabilities, both at low and high load.We study the paramagnetic-spin glass and the spin glass-retrieval phase transitions, as the pattern ( i.e. weight) distribution and spin ( i.e. unit) priors vary smoothly from Gaussian real variables to Booleandiscrete variables. Our analysis shows that the presence of a retrieval phase is robust and not peculiarto the standard Hopfield model with Boolean patterns. The retrieval region is larger when the patternentries and retrieval units get more peaked and, conversely, when the hidden units acquire a broaderprior and therefore have a stronger response to high fields. Moreover, at low load retrieval alwaysexists below some critical temperature, for every pattern distribution ranging from the Boolean to theGaussian case. Introduction
The genesis of modern AI can be traced quite far back in time. Beyond the pioneering and historicalcontributions around the beginning of the last century, the most celebrated milestones are the neuronmodel of McCulloch and Pitts [38], the Rosenblatt perceptron [42], and along with the Hebb learningrule [30]. The latter was, in turn, exploited by Hopfield many years later to write his celebrated paperon neural networks from the connectionist perspective [32]. There has been a growing stream of studiesof neural networks ever since, with the subject attracting the interest of various communities, frombiological systems to signal processing and information theory [31, 20, 22, 29]. The physics angle on thetopic is mainly represented by the statistical mechanics of spin glasses [39]. In particular, problems ofgreat biological and technological relevance, such as the capability to learn or retrieve memories, find asimple formulation in a genuine statistical mechanics language [32, 4, 5, 31, 20, 22, 43].However, the models used to implement these two crucial features of neural networks – learningand retrieval – often start from quite different assumptions. For instance, in modern machine learningapproaches such as deep learning [35, 29], network weights are normally taken as real, enabling the use ofgradient descent for learning and inference. On the other side, the standard theory of pattern retrieval, asexemplified by the Amit-Gutfreund-Sompolinsky analysis of associative neural networks [4, 5], assumesBoolean patterns. Nevertheless, the two most utilised models for machine learning and retrieval, i.e.restricted Boltzmann machines (RBMs) and associative Hopfield networks are known to be equivalent[10, 15, 36, 34, 23]. Their relation is easily understood from the point of view of bipartite spin glasses:on the one hand the Gibbs measure of such systems is the same as the one of Restricted BoltzmannMachines, on the other one bipartite spin glasses constitute a class of disordered systems in which theHopfield model for neural networks can be embedded.For this reasons in this paper we analyse spin glasses defined on a bipartite network. We study theretrieval in these networks while varying both spin/unit priors and pattern/weight distributions contin-uously between the Boolean and the real Gaussian limits. We show that the presence of a ferromagneticregion of retrieval is not peculiar to the standard Hopfield model, but it occurs also in the case of continu-ous units and weights when these take the form of a Gaussian “softening” of Boolean variables. Moreover,while retrieval disappears for Gaussian weights at high load, in the low-load limit our generalised Hop-field networks always have a retrieval phase throughout the entire range of pattern distributions rangingfrom the Boolean to the Gaussian cases. This implies a degree of robustness in the machine-learningset-up, where weights evolve on real axes and one usually works at low load, i.e. with a small number offeatures, to avoid overfitting [35, 48].
Date : August 1, 2017.2000
Mathematics Subject Classification. a r X i v : . [ c ond - m a t . d i s - nn ] J u l .1. Generalised Hopfield Models and Restricted Boltzmann Machines.
The Hopfield modelintroduced in [32] is a celebrated paradigm for neural networks in which the neurons are represented by N spins, taking values ± . The energy function of the system is defined in terms of p so-called patterns,denoted by ξ µ , µ = 1 , . . . , p . It is natural to take the patterns to be N -dimensional random vectorswith independent and identically distributed components, which makes the Hopfield model a spin glass.Given an instance of the patterns, the Hamiltonian and the Gibbs measure of this system are H N,p ( σ | ξ ) := − p (cid:88) µ =1 N m µ , G N,p ( σ | ξ ) := e − βH N,p ( σ | ξ ) E σ e − βH N,p ( σ | ξ ) , (1.1)where β > is the inverse temperature, β = 1 /T , E σ denotes the statistical expectation with respect tothe spin configurations in {− , } N and m µ := 1 N N (cid:88) i =1 ξ µi σ i are the pattern overlaps, or Mattis magnetisations [37]. Intuitively, the spin configurations selected bythis Hamiltonian have the best possible overlap with the quenched patterns. In particular when theGibbs average of m µ is non-zero for some µ we say that this pattern is being retrieved. For a short butcomprehensive summary of the main known results on this model we refer to section II.B of [36].A generalisation of the Hopfield model is obtained by replacing m in (1.1) with a generic even function u ( m ) : H N,p ( σ | ξ ) = − p (cid:88) µ =1 u ( √ N m µ ) . (1.2)It is physically interesting, but not necessary, to consider convex u [42, 24, 25, 31, 22, 46]. Any convex,even and smooth u can be expressed as the cumulant generating function of a sub-Gaussian symmetricprobability distribution with unit variance [28]. Interpreting the random variables with this distributionsas ancillary spins, we obtain a correspondence between generalised Hopfield models and bipartite spinglass models. The latter are defined as follows: consider a bipartite system, with one part containing N spins denoted σ and the other N spins written as τ . Also let N = N + N , α = N /N and define thepartition function Z N ,N ( β ; ξ ) = E σ,τ exp (cid:32)(cid:114) βN N (cid:88) i =1 N (cid:88) µ =1 ξ µi σ i τ µ (cid:33) . (1.3)Setting u ( x ) = ln E τ e xτ , the cumulant generating function of the random variable τ , and marginalisingover all τ , we clearly obtain the partition function of a generalised Hopfield model with interaction u ,as claimed. Therefore we can think of the ξ µi as patterns, each entry being independently drawn from P ξ ( ξ µi ) . On the other hand, (1.3) can be viewed as a Restricted Boltzmann Machine, where a layer ofvisible units σ interacts with a layer of hidden units τ through the weights ξ .The standard Hopfield model is recovered when the ξ and the σ are binary and the τ µ are Gaussianvariables, but we study in this paper a much larger class of priors P σ ( σ i ) , P τ ( τ µ ) and P ξ ( ξ µi ) . Thiscorresponds in the generalised Hopfield model to varying the pattern distribution, the spin prior and theform of the interaction u .Here we investigate the general phase diagram, especially with regards to the existence of a retrievalphase (focusing on single pattern retrieval) and its interplay with the spin glass phase. Similar modelsof RBMs with generic priors have recently been studied using belief propagation and related methods in[36, 23, 47, 34].1.2. Model and RS Equations.
We shall use random variables which interpolate between Gaussianand binary distributions. Let Ω ∈ [0 , , ε g ∼ N (0 , and ε be a symmetric random variable takingvalues ± . We define ζ as ζ (Ω) = √ Ω g + √ − Ω ε and we denote by D (Ω) its probability distribution. Of course E [ ζ ] = 0 and E [ ζ ] = 1 for all Ω .Throughout we will draw both the patterns and the spins from D (Ω) , i.e. ξ µi ∼ D (Ω ξ ) , σ i ∼ D (Ω σ ) and τ µ ∼ D (Ω τ ) for Ω ξ , Ω σ , Ω τ ∈ [0 , . It will be useful to define the shorthand δ = (cid:112) − Ω ξ . arginalisation marginalisation ⌧ ⌧ ⌧ ⌧ ⌧ ⌧ RBM
GHM GHM
Figure 1.
Three equivalent architectures of neural networks: in a restricted Boltzmannmachine (RBM) (consisting of N = 5 σ variables and N = 3 τ variables in the figure)the role of hidden and visible units can be exchanged and marginalising over the hiddenunits one gets two dual generalised Holpfield models (GHMs), where the visible layer ofthe RBM constitutes the network and the hidden layer determines the interaction.To allow for retrieval phases in our analysis, we assume there are some numbers (cid:96) and (cid:96) of condensedpatterns with pattern overlaps or Mattis magnetizations m µ ( σ ) = 1 N N (cid:88) i =1 ξ µi σ i , µ = 1 , . . . , (cid:96) , (1.4) n i ( τ ) = 1 N N (cid:88) µ =1 ξ µi τ µ , i = 1 , . . . , (cid:96) (1.5)of order unity. We consider, for the sake of simplicity, the possible retrieval of a single pattern, i.e. (cid:96) = (cid:96) = 1 or pure state ansatz. The general case of mixed states is a straightforward generalisation[20] and can be considered a finer characterisation of the retrieval region we are going to describe. Onthe other hand, the possible presence of frozen but disordered states (spin glass region) can be describedby introducing the overlaps q ( σ a , σ b ) = 1 N N (cid:88) i =1 σ ai σ bi , r ( τ a , τ b ) = 1 N N (cid:88) µ =1 τ aµ τ bµ , (1.6)between two configurations ( σ a , τ a ) and ( σ b , τ b ) sampled from the Gibbs measure with the same patternrealisation, and the self-overlaps Q ( σ ) and R ( τ ) in the case a = b . From a fairly standard replicacalculation and the replica symmetry assumption (see Appendix A for more details), one gets that inthe thermodynamic limit the Gibbs averages of the order parameters converge to the solutions of thefollowing system: m = (cid:68) ξ (cid:104) σ (cid:105) σ | z,ξ (cid:69) z,ξ (1.7) n = (cid:68) ξ (cid:104) τ (cid:105) τ | η,ξ (cid:69) η,ξ (1.8) q = (cid:68) (cid:104) σ (cid:105) σ | z,ξ (cid:69) z,ξ (1.9) r = (cid:68) (cid:104) τ (cid:105) τ | η,ξ (cid:69) η,ξ (1.10) Q = (cid:68)(cid:10) σ (cid:11) σ | z,ξ (cid:69) z,ξ (1.11) R = (cid:68)(cid:10) τ (cid:11) τ | η,ξ (cid:69) η,ξ (1.12) ere z and η are standard Gaussian random variables, while ξ is sampled from P ξ . The distributionsof σ and τ being averaged over are proportional to, respectively, P σ ( σ ) e β (1 − α )Ω τ mξσ + √ βαr zσ + βα ( R − r ) σ / , (1.13) P τ ( τ ) e βα Ω σ nξτ + √ β (1 − α ) q ητ + β (1 − α )( Q − q ) τ / . (1.14)These equations are valid also for more general spin priors P σ ( σ ) and P τ ( τ ) , provided one then defines Ω σ (and similarly Ω τ ) as the high-field response of the spins, in the sense that the average of σ over P σ ( σ ) e hσ approaches Ω σ h for large h .We will repeatedly need averages over the distributions (1.13,1.14). Taking the first as an example,the prior as defined is P σ ( σ ) ∝ (cid:80) ε exp[ − ( σ − ε √ − Ω σ ) / (2Ω σ )] . Thus the distribution (1.13) of σ hasthe generic form Z − σ (cid:88) ε e − σ / (2 γ σ )+( φ σ ε + h σ ) σ (1.15)where we have set φ σ = √ − Ω σ / Ω σ and γ − σ = Ω − σ − βα ( R − r ) , h σ = β (1 − α )Ω τ mξ + (cid:112) βαr z (1.16)Averages over the distribution (1.15) then follow from the effective single spin partition function Z σ = (cid:90) dσ (cid:88) ε e − σ / (2 γ σ )+( φ σ ε + h σ ) σ ∝ (cid:88) ε e γ σ ( φ σ ε + h σ ) / , (1.17)giving (cid:104) σ (cid:105) σ | z,ξ = ∂ h σ ln Z σ = (cid:80) ε γ σ ( φ σ ε + h σ ) e γ σ ( φ σ ε + h σ ) / (cid:80) ε e γ σ ( φ σ ε + h σ ) / (1.18) = γ σ h σ + γ σ φ σ tanh( γ σ φ σ h σ ) (1.19)The average of σ can similarly be found from (cid:10) σ (cid:11) σ | z,ξ − (cid:104) σ (cid:105) σ | z,ξ = ∂ h σ ln Z σ = ∂ h σ (cid:104) σ (cid:105) σ | z,ξ = γ σ + γ σ φ σ [1 − tanh ( γ σ φ σ h σ )] (1.20)hence (cid:10) σ (cid:11) σ | z,ξ = γ σ + γ σ ( h σ + φ σ ) + 2 γ σ φ σ h σ tanh( γ σ φ σ h σ ) . (1.21)Analogous results hold for the averages of τ over the distribution (1.14).The RBM and equivalent Hopfield model defined above generalizes a number of existing models thatare included as special cases. For Ω σ = 0 , Ω τ = 1 and Ω ξ = 0 we recover the standard Hopfield model,while if Ω ξ = 1 we have the analog Hopfield model studied in [11, 12, 15] (see also [19] for the associatedMattis model). For Ω σ = Ω τ = 0 we recover the bipartite Sherrington-Kirkpatrick (SK) model studiedin [13, 9]. In this case it is known that the thermodynamics is not affected by the pattern distribution[26]. Throughout this paper we consider only on fully-connected networks: results on the sparse caserestricted to the Hopfield model can be found in [45, 1, 2].1.3. Summary and Further Comments.
The aim of this paper is to study the phase diagram ofRestricted Boltzmann Machines with generic priors and pattern/weight distributions as defined above.In general one expects three phases: a high-temperature (or paramagnetic) phase in which the freeenergy equals its annealed bound and all the order parameters are zero; a glassy phase where all patternoverlaps are still zero but replica symmetry breaking (RSB) is expected and finally a retrieval phase inwhich the overlap still has a glassy structure, but now one or more pattern overlaps have nonzero meanvalues. The precise organisation of the thermodynamic states is unknown in the glassy and retrievalregions. In particular, while in the glassy phase it is supposed to be similar to the one of the SK model[39, 46], the understanding of the retrieval phase remains severely limited [20, 18, 46]) and represents aopen challenge for theoretical and mathematical physics.Throughout the paper the starting point for our analysis will be equations (1.7–1.12). We will studythem analytically and numerically in the various regimes.The high-temperature transition is well understood by exact methods for the standard Hopfield model[20, 18], for the analog Hopfield model [11, 12, 15] and the bipartite SK model [13]. Moving beyond thesespecial cases, in Section 2 we give a theoretical prediction for the transition of the order parameter q asthe distributions of the priors and patterns vary. We will see that the transition is independent of theparticular pattern distribution. We find explicit expressions for the transition line for Ω σ = 0 (one layermade of ± spins) and (with totally different methods in Appendix B) for Ω σ = Ω τ = 1 . The remaining ntermediate cases are studied by numerically solving the self-consistency equations (1.7–1.12) for theorder parameters.Next we analyse the retrieval region considering the retrieval of one single pattern. A simple argumentshows that no retrieval is possible for Ω σ = Ω τ = 0 : retrieval requires giving up an O ( N ) amount ofentropy in the σ -system. This is worthwhile only if we can gain an extensive amount of energy. Thepattern being retrieved gives a field of O ( √ N ) acting on a τ spin, so the response of it needs to be also O ( √ N ) to get an overall O ( N ) energy gain. This is impossible for binary τ , but is possible for τ witha Gaussian tail, for which the cumulant generating function grows quadratically at infinity. Hence wealways consider Ω τ > .In Section 3 we look at the low-load regime in which α = 0 , . It turns out that the transitions in m and the replica overlap q occur at the same temperature T = Ω τ , and both transitions are continuous(as is known to be correct for the standard Hopfield model [20, 18]). First we concern ourselves withbinary spins Ω σ = 0 , then to analyse Ω σ > where it turns out that we need to add an appropriatespherical cut-off.In Section 4 we study numerically the retrieval transition at high load, i.e. α ∈ (0 , so that thenumber of patterns is proportional to the system size. First we vary separately Ω τ and Ω ξ while keeping Ω σ = 0 fixed. At Ω τ = 1 we see absence of retrieval in this analog Hopfield model ( Ω ξ = 1 ), as expectedfrom [12]. An analysis at T = 0 shows, furthermore, that the most efficient retrieval is given by thestandard Hopfield model.Moving on to Ω σ > we find that the model is well-defined only for high temperature. However, itis interesting that while for Ω σ = Ω τ = 1 (Gaussian bipartite model) the divergence of the partitionfunction coincides with the glassy transition, in the intermediate cases there is still a region of retrievalin the phase diagram. Finally when we regularise the model, again with a spherical cut-off, we observea standard retrieval phase, with a reentrant behaviour of the transition line. The latter would suggest aRSB scenario, as in the standard Hopfield model [40].In Appendix A we derive equations (1.7– 1.12) and in Appendix B we briefly analyse the Gaussianbipartite spin glass via Legendre duality, a method introduced for the spherical spin glass in [27].The model we analyse is, for Ω σ , a neural network with soft spins. Soft spin networks were introducedat an early stage of the development of the field by Hopfield in [33], but then were not much studied.From the (bipartite) spin glass perspective, soft spins (spherical or Gaussian) permit analytic methodsto be more easily applied, compared to the more commonly studied binary ± spins. Indeed there isa substantial number of results in the literature. In [17] and [44] (see also [46]) two similar models ofspherical neural networks are introduced, with spherical spins and quadratic(-like) interactions. Theauthors find the free energy to be RS and no retrieval region. However in [17] it is noted that retrievalappears when a quartic term is added to the Hamiltonian. More recently in [7] a spherical spin glass modelwas considered with random interaction given by a Wishart random matrix, which is much related withthe work in [17, 44, 46]. The authors find the free energy (which one can argue to be RS by comparisonwith the Wigner matrix case [16, 14, 27]) and its fluctuations for all temperatures. No retrieval isobserved. Finally a spherical bipartite spin glass is analysed in [6] for high temperatures, far from thecritical point, and the authors find the free energy in a variational form. Interestingly enough, for thismodel our analysis yields the same paramagnetic/spin glass transition line as for the bipartite SK model,and no retrieval (see the sections 4.3, 4.4 and Appendix B).As for the RSB scenario there are only a few results for bipartite models. To the best of our knowledgethese are limited to 1RSB for the standard Hopfield model (see [21, 20]) and to a partial mathematicalinvestigation of the bipartite (in fact multipartite) SK model [9, 41]. Therefore we will restrict ourselvesonly to RS approximations, when needed.2. Transition to the spin glass phase
At very high temperature ( β = 0 ) the distributions (1.13), (1.14) have no external effective fields andthe thermodynamic state is completely random, with order parameters m = n = q = r = 0 . Loweringthe temperature, a spin glass transition to a frozen but disordered states takes place, creating nonzerooverlaps q and r while m and n remain zero. Assuming this transition is continuous, we can lineariseequations (1.9,1.10) for small q and r : q ∼ βαr (cid:10) σ (cid:11) + o ( r ) , (2.1) r ∼ β (1 − α ) q (cid:10) τ (cid:11) + o ( q ) . (2.2) .0 0.2 0.4 0.6 0.8 1.0 α T SGP
Bipartite SK Gaussian BipartiteHopfield model
Figure 2.
Spin glass transition line T c ( α ) for different spin priors. The two continuouslines (lower, Ω σ = Ω τ = 0 , bipartite SK; upper, Ω σ = Ω τ = 1 , bipartite Gaussian) arecompletely symmetric w.r.t. exchange of the two network layers, i.e. the transformation α → − α . In the middle the Hopfield critical line, Ω σ = 0 , Ω τ = 1 .Here (cid:104) (cid:105) denotes the expectation value w.r.t. (1.13) and (1.14) with q = r = 0 (in particular withoutthe random field). The resulting transition criterion is β α (1 − α ) (cid:10) σ (cid:11) (cid:10) τ (cid:11) = β α (1 − α ) Q R , (2.3)where, using (1 . and ( . ), Q and R are the solutions of Q = (cid:10) σ (cid:11) = Z − σ E σ σ e βαRσ / , (2.4) R = (cid:10) τ (cid:11) = Z − τ E τ τ e β (1 − α ) Qτ / . (2.5)This result does not depend on the particular pattern distribution P ξ ( ξ ) (see also [3]), but it does clearlyinvolve the spin priors. With these priors fixed, the transition takes place at an inverse temperature β c ( α ) > that is a function of α . For β < β c ( α ) one finds that the self overlaps are the solutions of Q = (1 − βα Ω σ R ) / (1 − βα Ω σ R ) , (2.6) R = (1 − β (1 − α )Ω τ Q ) / (1 − β (1 − α )Ω τ Q ) . (2.7)The relation (2.6) can be derived directly from (1.21) with γ − σ = Ω − σ − βαR and h σ = 0 , and similarlyfor (2.7). Solving (2.6), (2.7) together with (2 . , T c ( α ) = 1 /β c ( α ) satisfies i ) lim α → T c ( α ) = Ω τ , lim α → T c ( α ) = Ω σ , ii ) lim Ω σ → T c ( α ) = (cid:16) − α )Ω τ + (cid:112) α (1 − α ) + [ α (1 − α ) + 4Ω τ (1 − Ω τ )(1 − α ) (cid:112) α (1 − α )] / (cid:17) ,and of course the symmetric expression for Ω τ → , which is obtained by replacing in ii ) Ω τ by Ω σ and α by − α .Relation ii ) recovers a number of known special cases. For Ω σ = Ω τ = 0 , one gets the critical line ofthe bipartite SK model T c = (cid:112) α (1 − α ) as found in [13] (see also [9]). When Ω σ = 0 and Ω τ = 1 one hasthe standard Hopfield model and finds T c = 1 − α + (cid:112) α (1 − α ) [5, 15]. The case of both Gaussian priors( Ω σ = Ω τ = 1 ) can be found independently using the Legendre duality between the Gaussian bipartitespin glass model and the spherical Hopfield model studied in [44, 17, 7], see Appendix B . The generalbimodal case can be analysed numerically; the results are shown in Fig. .3. Transition to retrieval I: low load
In the low-load regime the size of one layer is negligible w.r.t. the total size of the system, i.e. α = 0 , .In this case it is possible to obtain equation ( . ) without any RS approximation since the model becomesa generalised ferromagnet. This can be studied only in terms of the pattern overlaps without the needto consider q and r [20, 18]. Focusing on α = 0 and linearising ( . ) in m we get m = β Ω τ m + O ( m ) , .0 0.2 0.4 0.6 0.8 1.0 1.2 T m Ω σ = 0 . σ = 0 . σ = 0 . Figure 3.
Soft bipartite model at low load ( α = 0 ): for a generic pattern distribution(here δ = 0 . ) a spontaneous magnetisation appears at T = Ω τ , diverging at T = Ω σ Ω τ .which shows a bifurcation at T = Ω τ . As is known special cases, it therefore remains true for generic Ω σ , Ω τ and Ω ξ that the spin glass and low-load retrieval transitions occur at the same temperature.We next consider the strength of retrieval at temperatures below the transition: the inner average ofequation ( . ) is, using (1.19) with γ σ = Ω σ , (cid:104) σ (cid:105) σ | ξ = Ω σ β Ω τ mξ + (cid:112) − Ω σ tanh( (cid:112) − Ω σ β Ω τ mξ ) To carry out the remaining average over ξ , which by assumption is drawn from the bimodal distribution D (Ω ξ ) with peaks at ± δ = ± (cid:112) − Ω ξ , we set (see Sec. 1.2) ξ = δε + (cid:112) Ω ξ g . As (cid:104) σ (cid:105) σ | ξ is odd in ξ , thetwo possible values of ε = ± give the same contribution to (cid:10) ξ (cid:104) σ (cid:105) σ | ξ (cid:11) and we have to average only over g . After an integration by parts this gives m = f β, Ω ( m ) (3.1)with f β, Ω ( m ) = β Ω σ Ω τ m + (cid:112) − Ω σ (cid:110) δ ¯ t ( β (cid:112) − Ω σ Ω τ δ m, √ v )+ β (cid:112) − Ω σ Ω τ Ω ξ m (cid:104) − t ( β (cid:112) − Ω σ Ω τ δ m, √ v ) (cid:105)(cid:111) . Here have introduced the abbreviations ¯ t ( a, b ) = (cid:104) tanh( a + b g ) (cid:105) g , t ( a, b ) = (cid:10) tanh ( a + b g ) (cid:11) g (3.2)where the averages are over a zero mean, unit variance Gaussian random variable g . We have also defined v = β (1 − Ω σ )Ω τ Ω ξ m (3.3)For binary spins ( Ω σ = 0 ), | σ | = 1 and so f β, Ω ( m ) = (cid:10) ξ (cid:104) σ (cid:105) σ | ξ (cid:11) is bounded (between −(cid:104)| ξ |(cid:105) and + (cid:104)| ξ |(cid:105) ). This ensures that a non-trivial solution m of (3 . always exists below the retrieval transition.The zero temperature limit of m can be found explicitly: for β → ∞ , (cid:104) σ (cid:105) σ | ξ → sgn( mξ ) so f β, Ω ( m ) → sgn( m ) (cid:104) ξ sgn( ξ ) (cid:105) and therefore m → ± (cid:104)| ξ |(cid:105) with (cid:104)| ξ |(cid:105) = (cid:114) ξ π e − δ / (2Ω ξ ) + δ erf (cid:32) δ (cid:112) ξ (cid:33) . (3.4)For generic soft spins ( Ω σ > ), on the other hand, f β, Ω ( m ) is no longer bounded but grows as β Ω σ Ω τ m for large | m | . The spontaneous magnetisation, which is the solution of m = f β, Ω ( m ) , therefore divergesat T c = Ω σ Ω τ as temperature is lowered; see Fig. . For lower T the model is ill-defined as we are goingto see in more detail in the next section, thus we need to regularise the spin distribution in at least onenetwork layer.To fix the choice of regularisation we note that for a large system, every rotationally invariant weight onthe vector of σ -spins is equivalent to a rigid constraint at some fixed radius. Without loss of generality wetherefore regularize by multiplying the σ -prior by the spherical constraint δ ( N − (cid:80) Ni =1 σ i ) . The resultingprior still depends on Ω σ ; at Ω σ = 1 it is a uniform distribution on the sphere and we obtain the spherical .00.20.40.60.81.0 m Ω σ = 0 . σ = 0 . σ = 0 . σ = 1 . T γ σ δ = 1 m m = q /π Ω σ = 0 . σ = 0 . σ = 0 . σ = 0 . σ = 0 . T γ σ δ = 0 Figure 4.
Soft model with spherical constraint at low load ( α = 0 ). Spontaneousmagnetisation still occurs at T = Ω τ increasing until T = 0 . Left panels δ = 1 , rightpanels δ = 0 . As Ω σ → , m approaches the value (3 . at low T . But at any Ω σ > , m eventually peels off from this asymptote to reach m = 1 for T → . Lower panels showthe behaviour of γ σ : it tends to zero linearly at low temperature, γ σ ≈ T / Ω τ , while for T (cid:62) Ω τ , γ σ = Ω σ .Hopfield model studied in [44, 17, 7]. At Ω σ = 0 , on the other hand, the regularisation constraint isredundant and we recover the standard Hopfield model. One can now analyse the regularised model usingsimilar replica computations to those above. The only difference is an extra Gaussian factor e − ωσ / inthe effective σ -spin distribution. Here ω is a Lagrange multiplier that is determined from the sphericalconstraint Q = 1 . It changes the variance of the two Gaussian peaks fro m Ω σ to γ σ = (Ω − σ + ω ) − .Accordingly, instead of f β, Ω in (3 . one obtains a modified function f β, Ω ,γ σ ( m ) = βγ σ Ω τ m + γ σ φ σ (cid:110) δ ¯ t ( βγ σ φ σ Ω τ δ m, √ v ) + βγ σ φ σ Ω τ Ω ξ m (cid:104) − t ( . . . ) (cid:105)(cid:111) . (3.5)where the arguments of t are the same as for ¯ t . Note that the first term of (3.2) has become βγ σ Ω τ m and all occurrences of √ − Ω σ = Ω σ φ σ have been replaced by γ σ φ σ . Accordingly, also v now has themore general form v = β γ σ φ σ Ω τ Ω ξ m (3.6)The value of ω or equivalently γ σ is defined from the condition Q = (cid:10) σ (cid:11) σ,ξ = 1 , where Q can be workedout using (1.21) as Q = γ σ + γ σ ( β Ω τ m + φ σ ) + 2 βγ σ φ σ Ω τ δ m ¯ t ( . . . ) + 2 β γ σ φ σ Ω τ Ω ξ m [1 − ¯ t ( . . . )] (3.7)The last two terms are proportional to the last two terms in (3.5), and hence to (1 − βγ σ Ω τ ) m ; ifone traces back through the derivation this comes from the fact that both results are proportional to (cid:104) h σ tanh( γ σ φ σ h σ ) (cid:105) . With this simplification one obtains the equivalent expression Q = γ σ + γ σ φ σ + βγ σ Ω τ m (2 − βγ σ Ω τ ) = 1 . (3.8)For Ω σ → one has γ σ ≈ Ω σ , which vanishes as Ω σ → while γ σ φ σ = √ − Ω σ γ σ / Ω σ → . Forthis limiting case of Boolean σ -spins the constraint ( . ) is therefore automatically satisfied as expected.More generally, while as m → ∞ it behaves as βγ σ Ω τ m – this first term is not the leading contributionbecause γ σ ∼ /m for large m . The last two terms in f give a nonzero constant asymptote. Therefore f β, Ω ,γ σ ( m ) goes as β Ω τ ( γ σ + γ σ φ σ ) m near m = 0 . From equation ( . ), γ σ + γ σ φ σ = 1 + O ( m ) ,thus the ferromagnetic transition remains at T c = Ω τ in the model with the spherical constraint. (Oneeasily checks that γ σ + γ σ φ σ = 1 implies as the physical solution γ σ = Ω σ , so that the regularizer ω increases smoothly from zero at the transition.) For temperatures below T c one generally has to find m numerically. Results are shown in Fig. . As expected for a regularized model, m remains finite at all T . In the low-temperature limit it always reaches its maximum value m → . One can easily check thisfrom (3.5) and (3.8): the latter implies for m = 1 that βγ σ Ω τ → (see the lower plots in Fig. ). Hencethe first term on the r.h.s. of (3.5) also approaches unity as it should from m = f β, Ω ,γ σ ( m ) while theother terms in (3.5) vanish in the limit. . Transition to retrieval II: high load
Now we study the entire phase diagram of the model, in particular with regards to the presence andstability of a retrieval region. We now use the full definition of γ σ and h σ from (1.16), along with theanalogous definition for γ τ : γ − σ = Ω − σ − βα ( R − r ) , γ − τ = Ω − τ − β (1 − α )( Q − q ) , h σ = β (1 − α )Ω τ mξ + (cid:112) βαr z . (4.1)Furthermore we abbreviate the variance of the Gaussian part of γ σ φ σ h σ as v = β (1 − α ) γ σ φ σ Ω τ Ω ξ m + βαγ σ φ σ r (4.2)where compared to (3.3) we again have the replacement of √ − Ω σ by γ σ φ σ , and otherwise the incor-poration of the α -dependence and the new term proportional to r . Then, taking the averages w.r.t. ξ and z we have, using (1.19,1.21) and integrating by parts where appropriate, m = (cid:68) ξ (cid:104) σ (cid:105) σ | z,ξ (cid:69) z,ξ = β (1 − α ) γ σ Ω τ m + γ σ φ σ (cid:104) δ ¯ t ( β (1 − α ) γ σ φ σ Ω τ δ m, √ v ) + β (1 − α ) γ σ φ σ Ω τ Ω ξ m (cid:16) − t ( . . . ) (cid:17)(cid:105) q = (cid:68) (cid:104) σ (cid:105) σ | z,ξ (cid:69) z,ξ = (cid:10) ( γ σ h σ + γ σ φ σ tanh( γ σ φ σ h σ )) (cid:11) z,ξ = β (1 − α ) γ σ Ω τ m + βαγ σ r + γ σ φ σ t ( . . . ) + 2 β (1 − α ) γ σ φ σ Ω τ δ m ¯ t ( . . . ) + 2 γ σ v (cid:104) − t ( . . . ) (cid:105) = β (1 − α ) γ σ Ω τ (2 − β (1 − α )Ω τ γ σ ) m + βαγ σ (1 + 2 γ σ φ σ ) r + γ σ φ σ (1 − βαγ σ r ) t ( . . . ) Q = (cid:68)(cid:10) σ (cid:11) σ | z,ξ (cid:69) z,ξ = q + γ σ + γ σ φ σ (cid:104) − t ( . . . ) (cid:105) . where all tanh -averages ¯ t and t are evaluated for the same parameters, as given in the equation for m . In the final expression for q we have eliminated the ¯ t term using the expression for m . Repeatingthe same argument for the effective distribution of the τ spins, we get the equations for the other orderparameters simply by exchanging labels appropriately and replacing α with − α , bearing in mind alsothat the corresponding magnetization parameter is n = 0 . This gives the following additional equations: r = β (1 − α ) γ τ (1 + 2 γ τ φ τ ) q + γ τ φ τ (1 − β (1 − α ) γ τ q ) ¯ t (0 , γ τ φ τ (cid:112) β (1 − α ) q ) (4.3) R = r + γ τ + γ τ φ τ (cid:104) − t (0 , γ τ φ τ (cid:112) β (1 − α ) q ) (cid:105) . (4.4)4.1. One Boolean layer.
In the case where the σ -spins are Boolean, Ω σ = 0 , the saddle point equations(4.3) simplify considerably. From (4.1), one has as before γ σ ≈ Ω σ → and γ σ φ σ → . This leads to m = δ ¯ t ( β (1 − α )Ω τ δ m, √ v ) + β (1 − α )Ω τ Ω ξ m (cid:104) − t ( . . . ) (cid:105) (4.5) q = t ( . . . ) (4.6)where after the inserting the expression (4.3) for r the Gaussian field variance can be written as v = β (1 − α ) V with V = Ω τ Ω ξ m + α − α γ τ (1 + 2 γ τ φ τ ) q + α − α γ τ φ τ [( β (1 − α )) − − γ τ q ] t (0 , γ τ φ τ (cid:112) β (1 − α ) q ) . (4.7)Solutions of (4.5) are shown in Fig.(5). Starting from the standard Hopfield phase diagram ( Ω ξ = 0 and Ω τ = 1 ) the retrieval region gradually disappears with decreasing Ω ξ or increasing Ω τ . In the firstcase it shifts towards the T -axis, as the critical temperature for α = 0 is independent of Ω ξ . In thesecond case, both the retrieval and spin glass transition lines shift towards the α -axis, as the critical α at T = 0 is independent of Ω τ as we will see shortly.4.2. Zero temperature limit.
Useful insight into the Ω σ = 0 case can be obtained by further special-izing to the limit T → (i.e. as β → ∞ ). In this scenario ¯ t ( βa, βb ) → (cid:104) sgn( a + bη ) (cid:105) = erf( a/ √ b ) (4.8) .00 0.03 0.06 0.09 0.12 α T P SGR δ → δ = 1 . δ = 0 . δ = 0 . α T P SGR Ω h → Ω h = 1 . h = 0 . h = 0 . Figure 5.
Phase diagrams with one Boolean layer ( Ω σ = 0 ). Left panel: ( T, α ) phasediagram for Ω τ = 1 and different values of δ . The retrieval transition line moves towardsthe T -axis as δ decreases while the critical temperature at α = 0 remains fixed. Rightpanel: phase diagram for δ = 1 and different values of Ω τ . Both transition lines movetowards the α -axis as Ω τ decreases while now the critical load at T = 0 is fixed.and, putting w = β ( a + bg ) , β [1 − ¯ t ( βa, βb )] = β (cid:90) dw √ πβb exp[ − ( w − βa ) / (2 β b )][1 − tanh ( w )] → (cid:90) dw √ πb exp[ − a / (2 b )][1 − tanh ( w )]= √ √ πb exp[ − a / (2 b )] . (4.9)If we set v = β (1 − α ) V as before and then apply the above large- β identities in the equation ( . ) for m we get m = δ erf(Ω τ δ m/ √ V ) + Ω τ Ω ξ m √ √ πV exp( − Ω τ δ m / V ) . (4.10)The equation ( . ) for q has a limit in terms of C = β (1 − α )(1 − q ) : C = √ √ πv exp( − Ω τ δ m / V ) . (4.11)Finally for V in ( . ) the zero temperature limit is simple as t (0 , βb ) → and q → , giving V = Ω τ Ω ξ m + α − α γ τ = Ω τ Ω ξ m + α − α (Ω − τ − C ) − . (4.12)One can reduce these three equations to a single one for x = Ω τ m/ √ V , which reads x = F δ,α ( x ) , F δ,α ( x ) = δ erf( δx ) − √ π xδ e − δ x [2 α + 2(1 − δ )( δ erf( δx ) − √ π xδ e − δ x ) ] / , (4.13)We leave the derivation of this result to the end of this section. One sees that F δ,α ( x ) is strictly increasing,starting from zero and approaching δ/ (cid:112) α + 2(1 − δ ) δ for large x (Fig. 6). Note also that Ω τ has noeffect on the value of x , and only affects the coefficient in the linear relation between x and m .For fixed δ , a first order phase transition occurs in the self-consistency condition (4.13) as α increases.The transition value α c ( d ) is largest for δ = 1 and decreases to zero quite rapidly as δ → , see Fig. 7.For α < α c ( δ ) a non-zero solution of ( . ) exists, with x (thus m ) growing as α decreases. In particular,as α → , x = F δ,α ( x ) → / (cid:112) − δ ) = 1 / (cid:112) ξ . In this low-load limit one then recovers for m theexpression (3 . as we show below.We remark that since for any < α < , F δ,a ( x ) → as δ → , one also has m → (with a firstorder phase transition, see Fig. ). For α = 0 , on the other hand, we see from ( . ) that m → (cid:112) /π as δ → , which is consistent with the data shown in Fig. . Thus the Hopfield model retrieves Gaussianpatterns only for α = 0 , but not at high load. x F α , δ ( x ) α = δ = δ = δ = δ = x F α , δ ( x ) δ = α = α = α = α = Figure 6.
Plot of F δ,α ( x ) . It tends uniformly to zero as δ → at fixed α (left panel),while it approaches / (cid:112) − δ ) as α → at fixed δ (right panel). α m m = q π δ = δ = δ = δ = − δ α c Figure 7.
Left panel: magnetization vs α for different values of δ ; at α c ( δ ) a first orderphase transition occurs. The low-load pattern overlap m ( α = 0; δ ) tends to π as δ → .Right panel: α c ( δ ) plotted versus − δ : α c (1) = 0 . .. , while rapidly α c ( δ ) → as δ → .We close this section by outlining the derivation of (4.13). Bearing in mind δ = (cid:112) − Ω ξ , the equation(4.10) for m becomes m = δ erf( δ x ) + (1 − δ ) 2 √ π xe − δ x , (4.14)while for C one gets C = 2 √ π x Ω τ m exp( − δ x ) . (4.15)Thus V = Ω τ (1 − δ ) m + α − α [Ω − τ − x/ ( √ π Ω τ m ) exp( − δ x )] − = Ω τ m { − δ + α − α [ m − (2 / √ π ) x exp( − δ x )] − } = Ω τ m { − δ + α − α [ δ erf( δx ) − (2 / √ π ) δ x exp( − δ x )] − } , (4.16)Now we set F δ,α ( x ) = { − δ ) + 2(1 − α )[ δ erf( δx ) − √ π xδ e − δ x ] − } − / . (4.17)and we readily get (4.13).4.3. Soft models.
Models with both Gaussian spins are typically ill-defined at low temperature, dueto the occurrence of negative eigenvalues in the interaction matrix. In the fully Gaussian model ( Ω σ =Ω τ = 1 ) the line where the partition function diverges coincides exactly with the paramagnetic/spin Σ σ I ( Σ σ ) α = 0 . β < ˆ β c β = ˆ β c β > ˆ β c Figure 8. I ( x ) vs x for different value of β . At ˆ β c , I ( x ) is tangent to x .glass transition. In this case, the distributions P ( σ | z, ξ ) and P ( τ | η, ξ ) of equations ( . - . ) arerespectively proportional to e β (1 − α )Ω τ mξσ + √ βαrzσ − (1 − βα ( R − r )) σ (4.18) e βα Ω σ nξτ + √ β (1 − α ) qτη − (1 − β (1 − α )( Q − q )) τ . (4.19)Both these distributions are therefore Gaussian with variances Σ σ , Σ τ , defined by Σ − σ = 1 − βα ( R − r ) and Σ − τ = 1 − β (1 − α )( Q − q ) . The equations for R and Q read Q = (cid:68)(cid:10) σ (cid:11) σ | z,ξ (cid:69) z,ξ = q + Σ σ (4.20) R = (cid:68)(cid:10) τ (cid:11) τ | η (cid:69) η,ξ = r + Σ τ . (4.21)Thus Σ σ = 11 − βα Σ τ Σ τ = 11 − β (1 − α )Σ σ . (4.22)and one has to study the equation I (Σ σ ) = Σ σ where I (Σ σ ) = 1 − β (1 − α )Σ σ − βα − β (1 − α )Σ σ . (4.23)The function I ( x ) is a hyperbola diverging at x = (1 − βα ) /β (1 − α ) , see Fig. 8. It is positive only for x below this value, so this is the range we need to consider as Σ σ > . For small β one has a solutionnear Σ σ = 1 which increases with β . At some ˆ β c , I ( x ) becomes tangent to x and for still larger β thereare no intersections. After some calculations using ( . ) one finds for the threshold ˆ β c β c α (1 − α )Σ σ (cid:18) − β (1 − α )Σ σ (cid:19) = ˆ β c α (1 − α )Σ σ Σ τ , (4.24)which exactly coincides with the paramagnetic / spin glass transition temperature (2 . as anticipated.We note that we can compute the divergence of the partition function of the model also directly, bydiagonalising the interaction matrix ( i.e. the weight matrix): Z N ( β, α ; ξ ) = E σ,τ exp (cid:32)(cid:114) βN N (cid:88) i =1 N (cid:88) µ =1 ξ µi σ i τ µ (cid:33) = E σ exp β N N (cid:88) i,j =1 N (cid:88) µ =1 ξ µi ξ µj σ i σ j = E σ exp βα N (cid:88) i,j =1 M ij σ i σ j (4.25) ere M = N ξξ T is a Wishart matrix so its empirical eigenvalue spectrum converges to the Marchenko-Pastur distribution for large N , which is nonzero only between (cid:16) ± (cid:112) (1 − α ) /α (cid:17) . Using a suitableorthogonal transformation on the spin variables we can diagonalise M , so that Z N ( β, α ; ξ ) = E σ exp (cid:32) βα N (cid:88) i λ i σ i (cid:33) . This is well-defined as long as max i [ β (1 − α ) λ i ] < . Using the largest eigenvalue from Marchenko-Pastur, max i λ i = (1 + (cid:112) (1 − α ) /α ) for large N , we get for the critical temperature T c ( α ) = ( √ α + √ − α ) . It can be checked that the spin glass transition line numerically computed in Section 2 coincides with T = T c ( α ) . In the general case < Ω σ , Ω τ < we simply remark that (recall that the g are N (0 , and ε = ± ) N (cid:88) i =1 N (cid:88) µ =1 ξ µi σ i τ µ = (cid:112) (1 − Ω σ )(1 − Ω τ ) N (cid:88) i =1 N (cid:88) µ =1 ξ µi ε i ε µ (4.26) + (cid:16)(cid:112) Ω τ (1 − Ω σ ) + (cid:112) Ω σ (1 − Ω τ ) (cid:17) N (cid:88) i =1 N (cid:88) µ =1 ξ µi g i ε µ (4.27) + (cid:112) Ω σ Ω τ N (cid:88) i =1 N (cid:88) µ =1 ξ µi g i g µ . (4.28)Of course the first two addenda have well-defined thermodynamical properties for all T , so we just needto rescale T c as T c ( α ) = Ω σ Ω τ ( √ α + √ − α ) . (4.29)This generalises what happens at low load, where a divergence in m appears at T c = Ω σ Ω τ (see Fig. 3).Note that such a critical temperature is lower than the one for the paramagnet/ spin glass transition.4.4. Spherical Constraints.
As before we can remove the singularity in the partition function byadding the spherical constraint δ ( N − (cid:80) Ni =1 σ i ) to the σ -prior. The equations (4 . remain valid withthe replacement γ − σ = Ω − σ − βα ( R − r ) + ω , with ω (cid:62) (or directly γ σ , see also Section 3) satisfying Q = q + γ σ + γ σ φ σ (cid:104) − t ( β (1 − α ) γ σ φ σ Ω τ δm, √ v ) (cid:105) = 1 . (4.30)For binary σ , i.e. Ω σ → , one has γ σ φ σ → and the constraint ( . ) is automatically satisfied. ForGaussian σ ( Ω σ = 1 ), on the other hand, φ σ = 0 and hence γ σ = 1 − q .Starting from the low-load solution α = 0 and increasing α , it is possible to find numerically thesolution of the equations ( . ) and the constraint ( . ). The results, presented in Fig. , indicatethat the retrieval region is robust also in the high-load regime, disappearing as Ω σ → . The retrievaltransition line exhibits re-entrant behaviour as in the standard Hopfield model, which might point tounderlying RSB effects [40].In principle one can ask further what happens in a model where both layers have a spherical constraint.In this case we simply need to put an additional Gaussian factor e − ω τ τ / into the effective τ -spindistribution, where the additional Lagrange multiplier ω τ can be found by fixing the radius R = 1 . Asa consequence, the paramagnetic to spin glass transition line (2.3) becomes β α (1 − α ) Q R = β α (1 − α ) = 1 . (4.31)This is valid for the bipartite SK model ( Ω σ = Ω τ = 0 ) but also for generic Ω σ and Ω τ . As T c = (cid:112) α (1 − α ) → for α → and retrieval is expected only below the paramagnetic to spin glass transition,this indicates that the double spherical constraint removes the possibility of a retrieval phase, even for lowload. What is happening is that the high-field response Ω τ is weakened and becomes γ τ = Ω τ / (1+Ω τ ω τ ) . .00 0.03 0.06 0.09 0.12 α T P SGR Ω σ → Ω σ = 0 . σ = 0 . σ = 0 . σ = 0 . σ = 0 . Figure 9.
Retrieval regions of the soft model with a spherical constraint on the σ -layerfor different Ω σ and fixed Ω τ = δ = 1 .Moreover, equations (4 . still apply if we replace Ω τ by γ τ and set γ − τ = Ω − τ − β (1 − α )(1 − q ) + ω τ .In the paramagnetic regime γ σ and γ τ satisfy Q = γ σ + γ σ φ σ = 1 → γ σ = Ω σ R = γ τ + γ τ φ τ = 1 → γ τ = Ω τ , (4.32)while q = 0 , giving for the response γ τ = 1 / ( γ − τ + β (1 − α )) = (Ω − τ + β (1 − α )) − . This is not sufficientfor retrieval, not even at low load ( α = 0 ) where βγ τ = β Ω τ / (1 + β Ω τ ) < and the critical temperatureis T = 0 ( β → ∞ ). Intuitively, because of the spherical cut-off the tail of the hidden units is simply notsufficient to give, after marginalising out the visible units, an appropriate function u (see Section 1.1) toget spontaneous magnetisation in the low load ferromagnetic model.5. Conclusions and outlooks
In this paper we have investigated the phase diagram of Restricted Boltzmann Machines with dif-ferent units and weights distributions, ranging from centred (real) Gaussian to Boolean variables. Wehighlighted the retrieval capabilities of these networks, using their duality with generalised Hopfieldmodels.Our analysis is mainly based on the study of the self-consistency relations for the order parametersand offers a nearly complete description of the properties of these systems. For this rather large class ofmodels we have drawn the phase diagram, which is made up of three phases, namely paramagnetic, spinglass and retrieval, and studied the phase transitions between them.We stress that, while in associative neural networks patterns are often restricted to the binary case,there is at present much research activity in the area of Boltzmann machines with real weights. Ouranalysis shows that retrieval is possible at high load for any pattern distribution interpolating betweenBoolean and Gaussian statistics. In this Gaussian case high load retrieval fails, but is recovered evenhere at low load.A complete analysis of the paramagnetic-spin glass transition and the spin glass-retrieval transitionis very useful for the study of modern deep neural networks, where the crucial learning phase is ofteninitiated with a step of unsupervised learning through Restricted Boltzmann Machines [29, 35]. A firstattempt to link the properties of the phase diagram to the challenges of training a Restricted BoltzmannMachines from data and extracting statistically relevant features can be found in [8].
Acknowledgements
A.B. acknowledges partial financial support by National Group of Mathematical Physics GNFM-INDAM G.G. is supported by the NCCR SwissMAP. D.T. is supported by Scuola Normale Superioreand National Group of Mathematical Physics GNFM-INDAM. ppendix A. Derivation of equations (1.7)-(1.12)
Consider a bipartite system with N σ -spins and N τ -spins, N = N + N , α = N /N and partitionfunction Z N ( β, α ; ξ ) = E σ,τ exp (cid:32)(cid:114) βN N (cid:88) i =1 N (cid:88) µ =1 ξ µi σ i τ µ (cid:33) , (A.1)with the expectation being over generic spin distributions P σ ( σ ) and P τ ( τ ) . We assume there are (cid:96) = O (1) condensed patterns associated with the first (cid:96) σ -variables and similarly (cid:96) condensed patternsassociated with the first (cid:96) τ -variables, and two families of overlaps m µ ( σ ) = 1 N N (cid:88) i = (cid:96) ξ µi σ i , n i ( τ ) = 1 N N (cid:88) µ = (cid:96) ξ µi τ µ , (A.2)and q αβ = 1 N N (cid:88) i = (cid:96) σ αi σ βi r αβ = 1 N N (cid:88) µ = (cid:96) τ αµ τ βµ . (A.3)Then Z N ( β, α ; ξ ) = E σ,τ exp (cid:114) βN (cid:96) (cid:88) i =1 N (cid:88) µ = (cid:96) ξ µi σ i τ µ + (cid:114) βN (cid:96) (cid:88) µ =1 N (cid:88) i = (cid:96) ξ µi σ i τ µ × exp (cid:114) βN N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) ξ µi σ i τ µ + (cid:114) βN (cid:96) (cid:88) i =1 (cid:96) (cid:88) µ =1 ξ µi σ i τ µ (A.4) ∼ E σ,τ exp N (cid:114) βN (cid:96) (cid:88) i =1 n i ( τ ) σ i + N (cid:114) βN (cid:96) (cid:88) µ =1 m µ ( σ ) τ µ + (cid:114) βN N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) ξ µi σ i τ µ where we have neglected the last, non-extensive, term of (A.4). Constraining the values of the overlapswe get Z N = (cid:90) { dm µ , d ˆ m µ , dn i , d ˆ n i } exp (cid:32) − iN (cid:32) (cid:96) (cid:88) i =1 n i ˆ n i + (cid:96) (cid:88) µ =1 m µ ˆ m µ (cid:33)(cid:33) × E σ,τ exp N (cid:114) βN (cid:96) (cid:88) i =1 n i σ i + N (cid:114) βN (cid:88) µ<(cid:96) m µ τ µ (A.5) × E σ,τ exp iα (cid:96) (cid:88) i =1 ˆ n i N (cid:88) µ = (cid:96) ξ µi τ µ + i − α (cid:96) (cid:88) µ =1 ˆ m µ N (cid:88) i = (cid:96) ξ µi σ i + (cid:114) βN N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) ξ µi σ i τ µ . We recall the definition of Ω σ,τ and u σ,τ from the Introduction: u σ,τ is the cumulant generatingfunction of P σ,τ , to wit u σ,τ ( h ) = ln E P σ,τ [ e hx ] and lim N →∞ N u σ,τ ( √ N x ) = Ω σ,τ x . (A.6)Then the terms in the second line of ( A. become E σ,τ exp N (cid:114) βN (cid:96) (cid:88) i =1 n i σ i + N (cid:114) βN (cid:88) µ<(cid:96) m µ τ µ = exp (cid:32) βN (cid:32) α Ω σ (cid:96) (cid:88) i =1 n i + (1 − α ) Ω τ (cid:96) (cid:88) µ =1 m µ (cid:33)(cid:33) , (A.7) hile, after introducing replicas and averaging over the disorder, the last term in (A.5) gives (with u ξ the cumulant generating function associated with the patterns) E ξ exp (cid:114) βN n (cid:88) α =1 N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) ξ µi σ αi τ αµ = exp N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) u ξ (cid:32)(cid:114) βN n (cid:88) α =1 σ αi τ αµ (cid:33) ∼ exp N (cid:88) i = (cid:96) N (cid:88) µ = (cid:96) β N n (cid:88) α,β =1 σ αi σ βi τ αµ τ βµ . (Here we have used that the patterns have unit variance, hence u ξ ( x ) = x + . . . , and neglected correctionsin /N .) This term becomes exp (cid:16) βN α (1 − α ) (cid:80) αβ q αβ r αβ (cid:17) once it is expressed in terms of the orderparameters q and r , bearing in mind that the missing spins σ , . . . , σ (cid:96) and τ , . . . , τ (cid:96) constitute avanishing fraction of the total number. Now averaging over spin variables we get the other two termsin the last line of (A.5), where we also include the contributions from constraining the q and r orderparameters: E σ exp i − α n (cid:88) α =1 (cid:96) (cid:88) µ =1 ˆ m µα N (cid:88) i = (cid:96) ξ µi σ αi + i − α n (cid:88) α,β =1 ˆ q αβ N (cid:88) i = (cid:96) σ αi σ βi = exp (cid:18) N (1 − α ) (cid:68) ln E σ e i − α ( (cid:80) nα =1 (cid:80) (cid:96) µ =1 ˆ m µα ξ µ σ α + (cid:80) nα,β =1 ˆ q αβ σ α σ β ) (cid:69) ξ (cid:19) and E τ exp i (1 − α ) n (cid:88) α =1 (cid:96) (cid:88) i =1 ˆ n iα N (cid:88) µ = (cid:96) ξ µi τ αµ + i (1 − α ) n (cid:88) α,β =1 ˆ r αβ N (cid:88) µ = (cid:96) τ αµ τ βµ = exp (cid:18) αN (cid:68) ln E τ e iα ( (cid:80) nα =1 (cid:80) (cid:96) i =1 ˆ n iα ξ i τ α + (cid:80) nα,β =1 ˆ r αβ τ α τ β ) (cid:69) ξ (cid:19) . Collecting all the terms we get an expression for E [ Z nN ] which depends on the parameters m µα , n iα , q αβ and r αβ : E [ Z nN ] = (cid:90) { dm αµ , d ˆ m αµ }{ dq αβ , d ˆ q αβ } e Nf ( { m µα } , { n iα } , { q αβ } , { r αβ } ) , (A.8)with f ( { m µα } , { n iα } , { q α,β } , { r αβ } ) = − β τ (1 − α ) (cid:96) (cid:88) µ =1 m µα − β σ α (cid:96) (cid:88) i =1 n iα − β α (1 − α ) (cid:88) α,β =1 q αβ r αβ + (1 − α ) (cid:68) ln E σ e β (1 − α )Ω τ (cid:80) nα =1 ( m · ξ ) σ α + βα (cid:80) nα,β =1 r αβ σ α σ β (cid:69) ξ + α (cid:68) ln E τ e βα Ω σ (cid:80) nα =1 ( n · ξ ) τ α + β (1 − α )2 (cid:80) nα,β =1 q αβ τ α τ β (cid:69) ξ (A.9)By a saddle point calculation we obtain immediately i ˆ m µα = β (1 − α ) Ω τ m µα i ˆ n iα = βα Ω σ n iα i ˆ q αβ = β α (1 − α ) r αβ i ˆ r αβ = β α (1 − α ) q αβ (A.10)and in the RS ansatz, assuming that m µα = m µ n iα = n i q ab = Qδ αβ + q (1 − δ α,β ) r ab = Rδ αβ + r (1 − δ α,β ) , (A.11)taking the limit n → and extremizing (A.9) we get the saddle point equations in ( . − . ) Appendix B. Gaussian bipartite and spherical Hopfield model
The bipartite system with Gaussian priors on both layers ( Ω σ = Ω τ = 1 ) can be related to a sphericalHopfield model [44, 17, 7] via Legendre duality as in [27]. In fact, integrating over the radius r √ N we ave Z g ( β ) = (cid:90) dr e − Nr / √ π N (cid:90) d Σ r √ N ( σ ) e − β H ( σ ) = (cid:90) dr e − Nr / √ π N Z r √ Ns ( β )= (cid:90) dr e − Nr / √ π N r N − (cid:90) d Σ √ N ( σ ) e − βr H ( σ ) = (cid:90) dr e − Nr / √ π N r N − Z √ Ns ( βr ) , (B.1)where d Σ r ( σ ) is the uniform measure over the sphere of radius r and ( Z g , Z s ) are respectively thepartition functions of the Gaussian and spherical models. Thus the two free energies, f g and f s , arerelated by − βf g = sup r (cid:20) − r −
12 ln(2 π ) + ln( r ) − βf s ( βr ) (cid:21) (B.2)and so the Gaussian free energy comes from the spherical free energy calculated at the optimal radiusgiven by r = 11 − β∂ β ( − βf s ( β )) | βr . (B.3)Since r = Q , the self overlap of the s -spins (first layer), and using the expression for the spherical freeenergy from [17, 7] we have, in the high-temperature region, Q = 11 − βα − β (1 − α ) Q = 11 − βαR ( Q ) and R ( Q ) = 11 − β (1 − α ) Q . (B.4)These are exactly equations ( . )-( . ) with Ω σ = Ω τ = 1 . Moreover, again from [17, 7] the critical linefor the spherical model is given by (1 − β (1 − α )) = β α (1 − α ) . Thus we obtain the critical line for theGaussian model ( . ) by replacing β → βQ : β α (1 − α ) Q (1 − β (1 − α ) Q ) = 1 = β α (1 − α ) Q R . (B.5) References [1] E. Agliari, A. Annibale, A. Barra, A.C.C. Coolen, D. Tantari,
Immune networks: multitasking capabilities near satu-ration , J. Phys. A: Math. Theor 46 415003 (2013)[2] E. Agliari, A. Annibale, A. Barra, A. C. C. Coolen and D. Tantari,
Immune networks: multi-tasking capabilities atmedium load , J. Phys. A: Math. Theor. 46 335101 (2013)[3] E. Agliari, A. Barra, C. Longo, D. Tantari,
Neural Networks retrieving Boolean patterns in a sea of Gaussian ones ,Journal of Statistical Physics 1-20, DOI 10.1007/s10955-017-1840-9, (2017)[4] D.J. Amit, H. Gutfreund, H. Sompolinsky,
Spin Glass model of neural networks , Phys. Rev. A , 1007-1018, (1985).[5] D.J. Amit, H. Gutfreund, H. Sompolinsky, Storing infinite numbers of patterns in a spin glass model of neural networks ,Phys. Rev. Lett. , 1530-1533, (1985).[6] A. Auffinger, W.-K. Chen, Free energy and complexity of spherical bipartite models , J. Stat. Phys. 157, 40-59, (2014).[7] J. Baik, J. O. Lee,
Fluctuations of the free energy of the spherical Sherrington-Kirkpatrick model , J. Stat. Phys. .2(2016): 185-224.[8] A.Barra, G.Genovese, P.Sollich and D.Tantari,
Phase transitions in Restricted Boltzmann Machines with generic priors ,preprint arXiv:1612.03132 (2016)[9] A. Barra, P. Contucci, E. Mingione, D. Tantari,
Multi-Species Mean Field Spin Glasses. Rigorous Results , AnnalesHenri Poincarï¿
16 (3), 691-708 (2015)[10] A. Barra, A. Bernacchia, E. Santucci, P. Contucci,
On the equivalence among Hopfield neural networks and restrictedBoltzman machines , Neur. Net. , 1-9, (2012).[11] A. Barra, F. Guerra, About the ergodic regime in the analogical Hopfield neural networks: Moments of the partitionfunction , J. Math. Phys. 49, 125217, (2008).[12] A. Barra, G. Genovese, F. Guerra,
The Replica Symmetric Behaviour of the Analogical Neural Network , J. Stat. Phys.142, 654, (2010).[13] A. Barra, G. Genovese, F. Guerra,
Equilibrium statistical mechanics of bipartite spin systems , J. Phys. A: Math.Theor. , 245002 (2011).[14] A. Barra, G. Genovese, F. Guerra, D. Tantari, About a solvable mean field model of a Gaussian spin glass , J. Phys.A: Math. Theor. , 155002, (2014);[15] A. Barra, G. Genovese, F. Guerra, D. Tantari, How glassy are neural networks? , J. Stat. Mech. P07009 (2012).[16] G. Ben Arous, A. Dembo, A. Guionnet,
Aging of spherical spin glasses , Prob. Theor. Related Fields 120, 1, (2001).[17] D. Bollé, T. M. Nieuwenhuizen, I. P. Castillo,
A spherical Hopfield model , J. Phys. A : 10269-10277, (2003).[18] A. Bovier, Statistical mechanics of disordered system. A mathematical perspective , Cambridge University Press, (2006).[19] A.Bovier, A.C.D. van Enter, B. Niederhauser.
Stochastic symmetry-breaking in a Gaussian Hopfield model
J. Statist.Phys., 95(1-2):181-213, (1999).[20] A. C. C. Coolen, R. Kühn, P. Sollich.
Theory of Neural Information Processing . Oxford University Press, (2005).[21] A. Crisanti, D. J. Amit and H. Gutfreund,
Saturation Level of the Hopfield Model for Neural Network , EuroPhys.Lett. , 337-341, (1986).
22] A. Engel, C. Van den Broeck,
Statistical mechanics of Learning , Cambridge Press (2001).[23] M. Gabrie, E. W. Tramel, and F. Krzakala,
Training restricted Boltzmann machine via the Thouless-Anderson-Palmerfree energy . In Advances in Neural Information Processing Systems, pages 640-648, (2015)[24] E. Gardner,
The space of interactions in neural network models , J. Phys. A 21, 257-270, (1988).[25] E. Gardner, B. Derrida,
Optimal storage properties of neural network models , J. Phys. A 21, 271-284 (1988).[26] G. Genovese,
Universality in Bipartite Mean Field Spin Glasses , J. Math. Phys. , 123304, (2012);[27] G. Genovese, D. Tantari, Legendre Duality of Spherical and Gaussian Spin Glasses , Math. Phys. Anal. Geom. 18, 1,(2015).[28] G. Genovese, D. Tantari,
Non-Convex Multipartite Ferromagnets , J. Stat. Phys. : 492-513, (2016).[29] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, Google book (2016).[30] O.D. Hebb,
The organization of behaviour: a neuropsychological theory , Pshyc. Press (1949).[31] J. Hertz, A. Krogh, R. G. Palmer.
Introduction to the theory of neural computation , Santa Fe Institute Studies in theSciences of Complexity; Lecture Notes, Redwood City, Ca.: Addison-Wesley, (1991).[32] J.J. Hopfield,
Neural networks and physical systems with emergent collective computational abilities , Proc. Nat. Acad.Sci. USA , 2554-2558 (1982).[33] J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons ,Proc. Nat. Acad. Sci. .10 (1984): 3088-3092.[34] H. Huang, Statistical mechanics of unsupervised feature learning in a restricted Boltzmann machine with binarysynapses , arXiv preprint arXiv:1612.01717 (2016)[35] Y. LeCun, Y. Bengio, G. Hinton,
Deep learning , Nature (7553): 436-444, (2015).[36] M. Mezard,
Mean-field message-passing equations in the Hopfield model and its generalizations , Phys. Rev. E 95,022117 (2017).[37] D.C. Mattis,
Solvable spin system with random interactions , Phys. Lett., 56(A):421-2, 1976.[38] W.S. McCulloch, W. Pitts,
A logical calculus of the ideas immanent in nervous activity , Bull. Math. Biophys. ,115-133, (1943).[39] M. Mézard, G. Parisi, M. A. Virasoro, Spin glass theory and beyond , World Scientific, Singapore, (1987).[40] J.-P. Naef, A. Canning,
Reetrant spin glass behaviour in the replica symmetric solution of the Hopfield neural networkmodel , J. de Phys.
I 2.3 (1992): 247-250.[41] D. Panchenko,
The Free Energy in a Multi-Species Sherrington-Kirkpatrick Model , Ann. Prob. 43, 3494-3513 (2015).[42] F. Rosenblatt,
The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain , Psych.Rev. , 386-408, (1958).[43] H.S. Seung, H. Sompolinsky, N. Tishby, Statistical mechanics of learning from examples , Phys. Rev. A (8): 6056,(1992).[44] M. Shcherbina, B. Tirozzi, Rigorous Solution of the Gardner Problem , Comm. Math. Phys, 234 , p.383-422 (2003).[45] P. Sollich, D. Tantari, A. Annibale, A. Barra,
Extensive parallel processing on scale free networks , Phys. Rev. Lett. , 238106, (2014).[46] M. Talagrand,
Mean Field Models for Spin Glasses , Vol. 1,2, Springer-Verlag Berlin Heidelberg (2011).[47] E. W. Tramel, A. Manoel, F. Caltagirone, M. Gabrie, and F. Krzakala,
Inferring sparsity: Compressed sensing usinggeneralized restricted Boltzmann machines , arXiv preprint arXiv:1606.03956, (2016).[48] M. Welling, C. Sutton, Learning in markov random fields with contrastive free energies, Proc. th Intern. Workshopon AI and Statistics (AISTATS05), 397-404, (2005).
Adriano Barra: Dipartimento di Matematica e Fisica Ennio De Giorgi, Università del Salento, Lecce,Italy
E-mail address : [email protected] Giuseppe Genovese: Institut für Mathematik, Universität Zürich, CH-8057 Zürich, Switzerland.
E-mail address : [email protected] Peter Sollich: Department of Mathematics, King’s College London, London WC2R 2LS, UK.
E-mail address : [email protected] Daniele Tantari: Scuola Normale Superiore, Centro Ennio de Giorgi, Piazza dei Cavalieri 3, I-56100Pisa, Italy.
E-mail address : [email protected]@sns.it