Recognition Capabilities of a Hopfield Model with Auxiliary Hidden Neurons
Marco Benedetti, Victor Dotsenko, Giulia Fischetti, Enzo Marinari, Gleb Oshanin
RRecognition Capabilities of a Hopfield Model with Auxiliary Hidden Neurons
Marco Benedetti, Victor Dotsenko, Giulia Fischetti, Enzo Marinari,
1, 3 and Gleb Oshanin Universit`a di Roma La Sapienza, Rome, Italy Sorbonne Universit´e, CNRS, Laboratoire de Physique Th´eorique de la Mati`ere Condens´ee (UMR 7600),4 Place Jussieu, F-75252 Paris Cedex 05, France CNR-Nanotec and INFN, Sezione di Roma 1, Rome, Italy (Dated: January 14, 2021)We study the recognition capabilities of the Hopfield model with auxiliary hidden layers, whichemerge naturally upon a Hubbard-Stratonovich transformation. We show that the recognitioncapabilities of such a model at zero-temperature outperform those of the original Hopfield model, dueto a substantial increase of the storage capacity and the lack of a naturally defined basin of attraction.The modified model does not fall abruptly in a regime of complete confusion when memory loadexceeds a sharp threshold.
Introduction -
Modeling neural networks as Isingspin systems is a rich and interesting field. Startingfrom the seminal paper by Little [1], it was realized thatdisordered spin systems can store information, working ascontent-addressable memories. In this context, the modelproposed by Hopfield [2] ( H from now on) has often servedas a reference. Its phase diagram [3–5] contains a retrievalphase , where one can use a system composed of N neuronsto “store” patterns containing N symbols. By storage onemeans that patterns can be recovered: starting from theexact pattern we have stored or from a damaged pattern,where a fraction η of the spins do not coincide with theconfiguration we want to retrieve, we end up close enoughto it. Despite being robust in many respects, this modelhas one essential shortcoming: if α ≡ P/N is larger thana critical value α c ≈ .
138 it is impossible to store morethan P different uncorrelated patterns. When α > α c ,every memory is abruptly forgotten, and no pattern canbe retrieved. During the last decades, several approacheshave been proposed to remedy this and other relatedissues (see for example [6–20]). Here we propose to tacklethis problem from a different perspective. Models and Techniques - H [2] describes a systemof N binary neurons σ i = ± i ∈ { , ..., N } , with a longrange spin glass like Hamiltonian [21] H [ J, σ ] = − / N (cid:88) i,j =1 J ij σ i σ j , J ij = 1 N P (cid:88) µ =1 ξ µi ξ µj . The quenched coupling matrix J ij is defined according toHebb’s learning rule [22], where ξ µi = ± µ ∈ { , ..., P } , i ∈ { , ..., N } are the P configurations ( patterns ) that wewant to be able to retrieve. The partition function reads Z = (cid:88) { σ } exp (cid:26) β N N (cid:88) i,j =1 P (cid:88) µ =1 ξ µi ξ µj σ i σ j (cid:27) , (1)where β is the inverse temperature of the system. Recently[23] it has been shown that H can be thought of as the result of a Hubbard-Stratonovich transformation Z = (cid:90) + ∞−∞ . . . (cid:90) + ∞−∞ P (cid:89) µ =1 dX µ (cid:88) σ exp (cid:110) − β ˜ H [ ξ, X, σ ] (cid:111) , (2)where the X µ are P Gaussian auxiliary variables and˜ H [ ξ, X, σ ] ≡ N/ P (cid:88) µ =1 X µ + P (cid:88) µ =1 N (cid:88) i =1 σ i ξ µi X µ . (3)Integrating over the X µ in eq. (2) leads back to eq. (1), andcorresponds to assuming complete thermalization of the X µ variables. In this letter we follow a different strategy:we regard the model defined by eq. (2) as fundamental (Xmodel from now on), considering the continuous auxiliaryvariables as hidden neurons in our system. The hiddenneurons enter the learning dynamics on the same footingas the two-state σ variables. At T = 0 energy barrierscan and do break the equivalence among the two models,making joint thermalization of the X µ and of the σ i variables impossible. Numerical simulations convincinglydemonstrate that this has drastic consequences on theretrieval properties of the system.At T = 0, the recognition process in the X model is ledby a steepest descent procedure, with sequential updating.One sweep is composed of two steps. First one fixes all the X µ variables to minimize the energy given σ i : the optimalvalue is X opt µ = − (1 /N ) (cid:80) i σ i ξ µi . Then one updates allthe σ i for fixed { X µ } . When nothing changes in a fullsweep we have reached the fixed point.We study both the X model and H for different valuesof α , η and N , with α = 0 .
05, 0.08, 0.1-0.18 with stepsof 0.01, 0.2, 0.22, 0.25, 0.3 (for the X model we havealso added simulations at higher α values, both belowand above one). We have used η = 0, 0.01, 0.025, 0.05,0.1, 0.15, 0.2, 0.25 and 0.35. For both systems we havestudied 10 samples for N = 128 and 256, 2 10 samplesfor N = 512, 10 samples for N = 1024, 10 samples for N = 2048, 20 samples for N = 4096, and a small variablenumber of samples (normally of order 10) for N = 8192. a r X i v : . [ c ond - m a t . d i s - nn ] J a n Hopfield 〈 ω 〉 α N = 8192 η = 0 η = 0.025 η = 0.05 η = 0.1 η = 0.15 η = 0.2 η = 0.25 η = 0.35 〈 ω 〉 αη = 0 N=8192N=4096N=2048N=1024N=512N=256N=128
FIG. 1: The average overlap (cid:104) ω (cid:105) as a function of α for H . Finite T Monte Carlo -
We first analyzed the finitetemperature structure of both the H and the X-model,and verified that at T >
The average overlap (cid:104) ω (cid:105) - Next we work at T = 0,with the steepest descent procedure described above. Herethe X-model is allowed to behave differently from H . Westart by measuring the average overlap (cid:104) ω (cid:105) between thestarting spin configuration and the stable one where theenergy minimization ends (we expect to have recognitionwhen (cid:104) ω (cid:105) is large). Our results for H are shown in Fig. 1.On the left, we use the largest available value of N andplot the values of (cid:104) ω (cid:105) for different values of η . On theright, we select η = 0 and show the N -dependence of (cid:104) ω (cid:105) .Things for H work as expected (see [3]). Increasing N at η = 0 the well known transition forms close to α ∼ . η ,the transition moves to lower values of α , but stays verysimilar in nature and shape. Even at our highest valueof η = 0 .
35 (where the overlap of the starting point withthe original pattern is as low as 0.3), at low α (cid:46) . η = 0 case the sharp transitionof H contrasts with a smooth, non monotonous behaviorof the X-model. At low α we still get a recognition phase,which persists to higher values of α than in H (and we willattribute to this effect some importance). Increasing α wesee that (cid:104) ω (cid:105) smoothly decreases, and reaches a minimumclose to α = 0 .
05. Here, for N = 8192 we have (cid:104) ω (cid:105) ∼ . α further, (cid:104) ω (cid:105) X model 〈 ω 〉 α N = 8192 η = 0 η = 0.025 η = 0.05 η = 0.1 η = 0.15 η = 0.2 η = 0.25 〈 ω 〉 αη = 0 N = 128N = 256N = 512N = 1024N = 2048N = 4096N = 8192 〈 ω 〉 α N = 1024 η = 0.1 FIG. 2: As in Fig. 1, but for the X-model.starts to grow.This second, high- α , regime does not correspond to arecognition phase. To gain some intuition of this, noticethat the number of hidden neurons X µ in our model isequal to P . Hence, it is clear that for very large α the X µ are numerous enough to satisfy, by themselves, all theconstraints of the problem. In turn, this implies that, as α → ∞ , any configuration σ can be accommodated in anenergy minimum simply by relaxing the hidden neuronsto their optimal value, making the dynamic ineffective.The behavior of the system when η > H , we still have a recognition region at low α , whichshrinks for increasing η . We still have a smooth decreaseof (cid:104) ω (cid:105) for increasing α , and an asymptotic slow increasethat slows down for increasing η . The asymptotic valuefor α → ∞ is exactly the initial overlap 1 − η (inset ofFig. 2). This clearly confirms that the large α regime isnot a recognition regime, but rather a regime of ineffectivedynamics. To get a deeper insight about this asymptoticbehaviour, we can use the same technique adopted in[25], and look into what happens under a step of thesteepest descent dynamics. Consider any binary neuronconfiguration σ . By plugging the expression for X opt µ intothe Hamiltonian eq. (3), one sees that the change in energyupon flipping σ j , after the X µ variables have thermalizedto their optimal value given σ and the patterns ξ µ , is∆ j ˜ H = 2 N P (cid:88) µ =1 (cid:16) σ j ξ µj (cid:88) h (cid:54) = j σ h ξ µh (cid:17) . (4)The second contribution in the brackets is what one getsfor H (it shows that σ j is pulled by all the memories ξ µ ,with a strength proportional to the overlap between thememory ξ µ and σ ), while the additional 1 comes becauseof the X variables interaction. It is a memory indepen-dent constant price that one has to pay, due to the factthat we are flipping a spin “against the will” of the X µ , Hopfield, P( ω ), N = 8192, η = 0 ω α =0.05 α =0.08 α =0.10 α =0.11 α =0.12 α =0.13 ω α =0.13 ω α =0.14 ω α =0.15 α =0.16 α =0.17 FIG. 3: The probability distribution P ( ω ) for H : η = 0.which were optimal for σ . This effectively introduces athreshold in the dynamics of σ , and a stabilizing effect forthe configuration of the binary neurons. This stabilizingeffect dominates the dynamics in the high α region, mak-ing every configuration stable, and the network useless.On the contrary, and we take this as one of our impor-tant findings, the presence of the X µ degrees of freedomhelps the learning in the small α regime, enlarging therecognition region as compared to H , and, what is maybeeven more important, eliminating complete confusion for α larger than a sharp threshold. The probability distribution P ( ω ) - Even if (cid:104) ω (cid:105) isgiving us a good amount of information, it is appropriateto analyze the behavior of the full probability distribution P ( ω ). In Fig. 3 we show P ( ω ) at η = 0 for different valuesof α , for H . From left to right and from top to bottomwe plot P ( ω ) for increasing values of α . The horizontalscales of the four frames are very different. Plots are inlinear-log scale. We first show (top left) results for the lowvalues of α , where P ( ω ) is concentrated close to ω = 1.Going right from there we plot again α = 0 .
13, to showthat, just below the critical point, a few points are alreadyat low overlap. The number of these points decreases as N increases. In the bottom left frame we have α = 0 . N , still leading (rememberthat the y-scale is a log scale), but a peak at ω ∼ . α .Now the low ω peaks start to dominate. Their location isvery stable, and only shifts very lightly. If we include in the counting the X µ degrees of freedom and weredefine α for the X-model accordingly, the comparison is still infavor of the X-model for η = 0, while at higher η the two picturesbecome very similar. Hopfield, P( ω ), N = 8192, η = 0.25 ω α =0.05 α =0.08 α =0.10 α =0.11 ω α =0.11 ω α =0.12 α =0.13 α =0.14 ω α =0.15 α =0.16 α =0.17 FIG. 4: As in Fig. 3, but for η = 0 . X model, P( ω ), N = 8192 ω η = 0 α =0.05 α =0.10 α =0.12 α =0.14 α =0.16 ω η = 0 α =0.2 α =0.3 α =0.4 α =0.7 α =1.0 ω η = 0.25 α =0.05 α =0.10 α =0.12 α =0.14 α =0.16 ω η = 0.25 α =0.2 α =0.3 α =0.4 α =0.7 α =1.0 FIG. 5: As in Fig. 3, but for the X model: η = 0 and η = 0 . η = 0 .
25 for H . Aftera rescaling of the value of α everything is analogous to η = 0. The position of the low ω peak is again remarkablyconstant. The only clear difference is that here at high-intermediate α value a three peaks structure is visible(there is a clear peak at high ω < H exactly what we expected.As we show in Fig. 4, things are again very differentfor the X-model. Here we only need two frames for eachvalue of η to clearly show our data. In the X-model wedo not see any trace of bi-modality, but only a smoothbehavior. For both η = 0 and η = 0 .
25 at low α we seethat the mass of the distribution is centered close to 1.When α increases the distribution first shifts to lower α values, and eventually to larger ones, its N → ∞ limitdeveloping a narrow peak and being centered at 1 − η . Recognition ρ αη = 0, ε = 0.033 H 1024H 2048H 4096H 8192X 1024X 2048X 4096X 8192 ρ αη = 0.15, ε = 0.033 H 1024H 2048H 4096H 8192X 1024X 2048X 4096X 8192
FIG. 6: Recognition rate ρ as a function of α . The recognition rate -
In order to get furtherinsight, we define the recognition rate ρ as the probabilitythat a minimization run ends with ω ≥ .
967 [3]: thethreshold for recognition is (cid:15) = 1 − .
967 = 0 . H ,that undergoes a sharp transition, selecting a differentthreshold would give the same asymptotic result. Weshow in Fig. 6 ρ as a function of α , both for H and theX-model, for η = 0 in the left frame and for η = 0 .
15 inthe right one. Even if, as we have seen in detail, the twomodels work very differently, the plots for H and for theX-model are similar, both at η = 0 and at η >
0. TheX-model has a wider learning phase. We can say thatthere is a very low α regime where the new X variablesare irrelevant since they are not needed for recognition,and a very large α regime where they fix the system on theobserved pattern, but cannot lead to recognition. Only inthe region where α is slightly larger than α c , they are putat good use, and help in the memorizing. Also, when α increases, they avoid complete confusion: the X memorybecomes less efficient if too many patterns are shown, butthe blackout catastrophe is avoided. The dynamical exponent -
We have also analyzedthe rate of the learning dynamics. We assume that thenumber of sweeps S needed to reach the stable state scalesasymptotically as N ζ , plus N dependent sub-leading cor-rections. In a sweep we include, for the X-model, both thecost of putting the X µ in their optimal position and thecost of updating every σ i once. In absence of any slowingdown we expect to find ζ = 0. We define an effectiveexponent dependent on two values of N as ζ ( N , N ) ≡ log (cid:16) S ( N ) (cid:46) S ( N ) (cid:17)(cid:46) log ( N / N ) . (5)In Fig. 7 we show ζ as a function of α for H . On the right η = 0 and different couples of N , on the left different η values, using N = 4096 and N = 8192. The effectiveexponent for η = 0 is small at small α , is developing Hopfield ζ α N=4096 ➝ N=8192 eta=0eta=0.1eta=0.2eta=0.25eta=0.35 ζ αη = 0 N=4096 ➝ ➝ ➝ ➝ FIG. 7: The effective exponent ζ ( N , N ) for H . X model ζ α N=4096 ➝ N=8192 eta=0eta=0.1eta=0.2eta=0.25eta=0.35 ζ αη = 0 N=4096 ➝ ➝ ➝ ➝ FIG. 8: As in Fig. 7, but for the X-modela δ -function like peak of value close to one at α c , andis eventually decreasing to an asymptotic large α valueclose to 0.6. As expected H is critical close to α c . Thesituation at η > η = 0 .
35 wherea sizeable critical peak cannot be detected anymore.In Fig. 8 we have the same plot for the X-model, and,again, here the situation is different. There is always apeak at low α (larger the α c ) but the N dependence is notabrupt, and does not suggest that a δ -function behavioris emerging (even if one would need very large values of N to make sharp claims about this). Also the peak at η > H , and the effective exponenthas a smooth slow decay for large α . Conclusions -
The introduction of hidden layers inthe Hopfield model leads to interesting new features inthe zero temperature associative memory performance.In our model, the probability distribution of the overlapas well as its average value differ markedly from the onesin the Hopfield model. As a consequence, the recogni-tion performance is improved. More importantly, theinteraction between visible and hidden neurons has a sta-bilizing effect on the zero temperature dynamics, whichprevents the blackout catastrophe . This, together with thesmaller value of the dynamical scaling exponent, implyinga faster recognition process, suggests that our atypicalhidden layers may considerably improve the functioningof Hopfield-like neural systems. This opens an interest-ing perspective for the further research in the field ofassociative memory.
Acknowledgments -
We are very grateful to Ste-fano Fusi and Marc M´ezard for precious conversations.We have been supported by funding from the EuropeanResearch Council (ERC) under the European Union’sHorizon 2020 research and innovation program (GrantNo. 694925-LotglasSy). [1] W. A. Little, Mathematical Biosciences , 101 (1974).[2] J. Hopfield, Proceedings of the National Academy ofSciences , 2554 (1982).[3] D. J. Amit, H. Gutfreund, and H. Sompolinsky, Annalsof Physics , 30 (1987).[4] J. L. Van Hemmen and R. K¨uhn, Physical Review Letters , 913 (1986).[5] B. M. Forrest, Journal of Physics A: Mathematical andGeneral , 245 (1988).[6] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Physical Review E , 066106 (2011).[7] E. Gardner and B. Derrida, Journal of Physics A: Mathe-matical and General , 271 (1988).[8] E. Rolls and A. Treves, Neural Networks and Brain Func-tion (Oxford University Press, 1997).[9] S. Fusi, P. J. Drew, and L. F. Abbott, Neuron , 599(2005).[10] S. Fusi and L. F. Abbott, Nature Neuroscience , 485(2007).[11] A. Barra, A. Bernacchia, E. Santucci, and P. Contucci,Neural Networks , 1 (2012).[12] E. Agliari, A. Barra, A. De Antoni, and A. Galluzzi, Neural Networks , 52 (2013).[13] E. Agliari, A. Barra, A. Galluzzi, F. Guerra, D. Tantari,and F. Tavani, Physical review letters , 028103 (2015).[14] C. Baldassi, F. Gerace, H. J. Kappen, C. Lucibello,L. Saglietti, E. Tartaglione, and R. Zecchina, PhysicalReview Letters , 268103 (2018).[15] C. Baldassi, F. Pittorino, and R. Zecchina, Proceedingsof the National Academy of Sciences , 161 (2020).[16] F. Sch¨onsberg, Y. Roudi, and A. Treves, Physical ReviewLetters , 018301 (2021).[17] G. Parisi, Journal of Physics A: Mathematical and General , L617 (1986).[18] E. Marinari, Neural Computation , 503 (2019).[19] J. L. Van Hemmen, G. Keller, and R. K¨uhn, EurophysicsLetters (EPL) , 663 (1988).[20] J. L. Van Hemmen, L. B. Ioffe, R. K¨uhn, and M. Vaas,Physica A , 386 (1990).[21] M. Mezard, G. Parisi, and M. Virasoro, Spin Glass Theoryand Beyond (World Scientific, 1986).[22] D. O. Hebb, Journal of Clinical Psychology , 307 (1950).[23] M. M´ezard, Physical Review E , 022117 (2017).[24] “A finite size scaling analysis of the Hopfield and of theX model,” In preparation.[25] V. Folli, M. Leonetti, and G. Ruocco, Frontiers in Com-putational Neuroscience10