[PDF] Recognition Capabilities of a Hopfield Model with Auxiliary Hidden Neurons

Abstract

We study the recognition capabilities of the Hopfield model with auxiliary hidden layers, which emerge naturally upon a Hubbard-Stratonovich transformation. We show that the recognition capabilities of such a model at zero-temperature outperform those of the original Hopfield model, due to a substantial increase of the storage capacity and the lack of a naturally defined basin of attraction. The modified model does not fall abruptly in a regime of complete confusion when memory load exceeds a sharp threshold.

Full PDF

RRecognition Capabilities of a Hopﬁeld Model with Auxiliary Hidden Neurons

Marco Benedetti, Victor Dotsenko, Giulia Fischetti, Enzo Marinari,

1, 3 and Gleb Oshanin Universit`a di Roma La Sapienza, Rome, Italy Sorbonne Universit´e, CNRS, Laboratoire de Physique Th´eorique de la Mati`ere Condens´ee (UMR 7600),4 Place Jussieu, F-75252 Paris Cedex 05, France CNR-Nanotec and INFN, Sezione di Roma 1, Rome, Italy (Dated: January 14, 2021)We study the recognition capabilities of the Hopﬁeld model with auxiliary hidden layers, whichemerge naturally upon a Hubbard-Stratonovich transformation. We show that the recognitioncapabilities of such a model at zero-temperature outperform those of the original Hopﬁeld model, dueto a substantial increase of the storage capacity and the lack of a naturally deﬁned basin of attraction.The modiﬁed model does not fall abruptly in a regime of complete confusion when memory loadexceeds a sharp threshold.

Introduction -

Modeling neural networks as Isingspin systems is a rich and interesting ﬁeld. Startingfrom the seminal paper by Little [1], it was realized thatdisordered spin systems can store information, working ascontent-addressable memories. In this context, the modelproposed by Hopﬁeld [2] ( H from now on) has often servedas a reference. Its phase diagram [3–5] contains a retrievalphase , where one can use a system composed of N neuronsto “store” patterns containing N symbols. By storage onemeans that patterns can be recovered: starting from theexact pattern we have stored or from a damaged pattern,where a fraction η of the spins do not coincide with theconﬁguration we want to retrieve, we end up close enoughto it. Despite being robust in many respects, this modelhas one essential shortcoming: if α ≡ P/N is larger thana critical value α c ≈ .

138 it is impossible to store morethan P diﬀerent uncorrelated patterns. When α > α c ,every memory is abruptly forgotten, and no pattern canbe retrieved. During the last decades, several approacheshave been proposed to remedy this and other relatedissues (see for example [6–20]). Here we propose to tacklethis problem from a diﬀerent perspective. Models and Techniques - H [2] describes a systemof N binary neurons σ i = ± i ∈ { , ..., N } , with a longrange spin glass like Hamiltonian [21] H [ J, σ ] = − / N (cid:88) i,j =1 J ij σ i σ j , J ij = 1 N P (cid:88) µ =1 ξ µi ξ µj . The quenched coupling matrix J ij is deﬁned according toHebb’s learning rule [22], where ξ µi = ± µ ∈ { , ..., P } , i ∈ { , ..., N } are the P conﬁgurations ( patterns ) that wewant to be able to retrieve. The partition function reads Z = (cid:88) { σ } exp (cid:26) β N N (cid:88) i,j =1 P (cid:88) µ =1 ξ µi ξ µj σ i σ j (cid:27) , (1)where β is the inverse temperature of the system. Recently[23] it has been shown that H can be thought of as the result of a Hubbard-Stratonovich transformation Z = (cid:90) + ∞−∞ . . . (cid:90) + ∞−∞ P (cid:89) µ =1 dX µ (cid:88) σ exp (cid:110) − β ˜ H [ ξ, X, σ ] (cid:111) , (2)where the X µ are P Gaussian auxiliary variables and˜ H [ ξ, X, σ ] ≡ N/ P (cid:88) µ =1 X µ + P (cid:88) µ =1 N (cid:88) i =1 σ i ξ µi X µ . (3)Integrating over the X µ in eq. (2) leads back to eq. (1), andcorresponds to assuming complete thermalization of the X µ variables. In this letter we follow a diﬀerent strategy:we regard the model deﬁned by eq. (2) as fundamental (Xmodel from now on), considering the continuous auxiliaryvariables as hidden neurons in our system. The hiddenneurons enter the learning dynamics on the same footingas the two-state σ variables. At T = 0 energy barrierscan and do break the equivalence among the two models,making joint thermalization of the X µ and of the σ i variables impossible. Numerical simulations convincinglydemonstrate that this has drastic consequences on theretrieval properties of the system.At T = 0, the recognition process in the X model is ledby a steepest descent procedure, with sequential updating.One sweep is composed of two steps. First one ﬁxes all the X µ variables to minimize the energy given σ i : the optimalvalue is X opt µ = − (1 /N ) (cid:80) i σ i ξ µi . Then one updates allthe σ i for ﬁxed { X µ } . When nothing changes in a fullsweep we have reached the ﬁxed point.We study both the X model and H for diﬀerent valuesof α , η and N , with α = 0 .

05, 0.08, 0.1-0.18 with stepsof 0.01, 0.2, 0.22, 0.25, 0.3 (for the X model we havealso added simulations at higher α values, both belowand above one). We have used η = 0, 0.01, 0.025, 0.05,0.1, 0.15, 0.2, 0.25 and 0.35. For both systems we havestudied 10 samples for N = 128 and 256, 2 10 samplesfor N = 512, 10 samples for N = 1024, 10 samples for N = 2048, 20 samples for N = 4096, and a small variablenumber of samples (normally of order 10) for N = 8192. a r X i v : . [ c ond - m a t . d i s - nn ] J a n Hopfield 〈 ω 〉 α N = 8192 η = 0 η = 0.025 η = 0.05 η = 0.1 η = 0.15 η = 0.2 η = 0.25 η = 0.35 〈 ω 〉 αη = 0 N=8192N=4096N=2048N=1024N=512N=256N=128

FIG. 1: The average overlap (cid:104) ω (cid:105) as a function of α for H . Finite T Monte Carlo -

We ﬁrst analyzed the ﬁnitetemperature structure of both the H and the X-model,and veriﬁed that at T >

The average overlap (cid:104) ω (cid:105) - Next we work at T = 0,with the steepest descent procedure described above. Herethe X-model is allowed to behave diﬀerently from H . Westart by measuring the average overlap (cid:104) ω (cid:105) between thestarting spin conﬁguration and the stable one where theenergy minimization ends (we expect to have recognitionwhen (cid:104) ω (cid:105) is large). Our results for H are shown in Fig. 1.On the left, we use the largest available value of N andplot the values of (cid:104) ω (cid:105) for diﬀerent values of η . On theright, we select η = 0 and show the N -dependence of (cid:104) ω (cid:105) .Things for H work as expected (see [3]). Increasing N at η = 0 the well known transition forms close to α ∼ . η ,the transition moves to lower values of α , but stays verysimilar in nature and shape. Even at our highest valueof η = 0 .

35 (where the overlap of the starting point withthe original pattern is as low as 0.3), at low α (cid:46) . η = 0 case the sharp transitionof H contrasts with a smooth, non monotonous behaviorof the X-model. At low α we still get a recognition phase,which persists to higher values of α than in H (and we willattribute to this eﬀect some importance). Increasing α wesee that (cid:104) ω (cid:105) smoothly decreases, and reaches a minimumclose to α = 0 .

05. Here, for N = 8192 we have (cid:104) ω (cid:105) ∼ . α further, (cid:104) ω (cid:105) X model 〈 ω 〉 α N = 8192 η = 0 η = 0.025 η = 0.05 η = 0.1 η = 0.15 η = 0.2 η = 0.25 〈 ω 〉 αη = 0 N = 128N = 256N = 512N = 1024N = 2048N = 4096N = 8192 〈 ω 〉 α N = 1024 η = 0.1 FIG. 2: As in Fig. 1, but for the X-model.starts to grow.This second, high- α , regime does not correspond to arecognition phase. To gain some intuition of this, noticethat the number of hidden neurons X µ in our model isequal to P . Hence, it is clear that for very large α the X µ are numerous enough to satisfy, by themselves, all theconstraints of the problem. In turn, this implies that, as α → ∞ , any conﬁguration σ can be accommodated in anenergy minimum simply by relaxing the hidden neuronsto their optimal value, making the dynamic ineﬀective.The behavior of the system when η > H , we still have a recognition region at low α , whichshrinks for increasing η . We still have a smooth decreaseof (cid:104) ω (cid:105) for increasing α , and an asymptotic slow increasethat slows down for increasing η . The asymptotic valuefor α → ∞ is exactly the initial overlap 1 − η (inset ofFig. 2). This clearly conﬁrms that the large α regime isnot a recognition regime, but rather a regime of ineﬀectivedynamics. To get a deeper insight about this asymptoticbehaviour, we can use the same technique adopted in[25], and look into what happens under a step of thesteepest descent dynamics. Consider any binary neuronconﬁguration σ . By plugging the expression for X opt µ intothe Hamiltonian eq. (3), one sees that the change in energyupon ﬂipping σ j , after the X µ variables have thermalizedto their optimal value given σ and the patterns ξ µ , is∆ j ˜ H = 2 N P (cid:88) µ =1 (cid:16) σ j ξ µj (cid:88) h (cid:54) = j σ h ξ µh (cid:17) . (4)The second contribution in the brackets is what one getsfor H (it shows that σ j is pulled by all the memories ξ µ ,with a strength proportional to the overlap between thememory ξ µ and σ ), while the additional 1 comes becauseof the X variables interaction. It is a memory indepen-dent constant price that one has to pay, due to the factthat we are ﬂipping a spin “against the will” of the X µ , Hopfield, P( ω ), N = 8192, η = 0 ω α =0.05 α =0.08 α =0.10 α =0.11 α =0.12 α =0.13 ω α =0.13 ω α =0.14 ω α =0.15 α =0.16 α =0.17 FIG. 3: The probability distribution P ( ω ) for H : η = 0.which were optimal for σ . This eﬀectively introduces athreshold in the dynamics of σ , and a stabilizing eﬀect forthe conﬁguration of the binary neurons. This stabilizingeﬀect dominates the dynamics in the high α region, mak-ing every conﬁguration stable, and the network useless.On the contrary, and we take this as one of our impor-tant ﬁndings, the presence of the X µ degrees of freedomhelps the learning in the small α regime, enlarging therecognition region as compared to H , and, what is maybeeven more important, eliminating complete confusion for α larger than a sharp threshold. The probability distribution P ( ω ) - Even if (cid:104) ω (cid:105) isgiving us a good amount of information, it is appropriateto analyze the behavior of the full probability distribution P ( ω ). In Fig. 3 we show P ( ω ) at η = 0 for diﬀerent valuesof α , for H . From left to right and from top to bottomwe plot P ( ω ) for increasing values of α . The horizontalscales of the four frames are very diﬀerent. Plots are inlinear-log scale. We ﬁrst show (top left) results for the lowvalues of α , where P ( ω ) is concentrated close to ω = 1.Going right from there we plot again α = 0 .

13, to showthat, just below the critical point, a few points are alreadyat low overlap. The number of these points decreases as N increases. In the bottom left frame we have α = 0 . N , still leading (rememberthat the y-scale is a log scale), but a peak at ω ∼ . α .Now the low ω peaks start to dominate. Their location isvery stable, and only shifts very lightly. If we include in the counting the X µ degrees of freedom and weredeﬁne α for the X-model accordingly, the comparison is still infavor of the X-model for η = 0, while at higher η the two picturesbecome very similar. Hopfield, P( ω ), N = 8192, η = 0.25 ω α =0.05 α =0.08 α =0.10 α =0.11 ω α =0.11 ω α =0.12 α =0.13 α =0.14 ω α =0.15 α =0.16 α =0.17 FIG. 4: As in Fig. 3, but for η = 0 . X model, P( ω ), N = 8192 ω η = 0 α =0.05 α =0.10 α =0.12 α =0.14 α =0.16 ω η = 0 α =0.2 α =0.3 α =0.4 α =0.7 α =1.0 ω η = 0.25 α =0.05 α =0.10 α =0.12 α =0.14 α =0.16 ω η = 0.25 α =0.2 α =0.3 α =0.4 α =0.7 α =1.0 FIG. 5: As in Fig. 3, but for the X model: η = 0 and η = 0 . η = 0 .

25 for H . Aftera rescaling of the value of α everything is analogous to η = 0. The position of the low ω peak is again remarkablyconstant. The only clear diﬀerence is that here at high-intermediate α value a three peaks structure is visible(there is a clear peak at high ω < H exactly what we expected.As we show in Fig. 4, things are again very diﬀerentfor the X-model. Here we only need two frames for eachvalue of η to clearly show our data. In the X-model wedo not see any trace of bi-modality, but only a smoothbehavior. For both η = 0 and η = 0 .

25 at low α we seethat the mass of the distribution is centered close to 1.When α increases the distribution ﬁrst shifts to lower α values, and eventually to larger ones, its N → ∞ limitdeveloping a narrow peak and being centered at 1 − η . Recognition ρ αη = 0, ε = 0.033 H 1024H 2048H 4096H 8192X 1024X 2048X 4096X 8192 ρ αη = 0.15, ε = 0.033 H 1024H 2048H 4096H 8192X 1024X 2048X 4096X 8192

FIG. 6: Recognition rate ρ as a function of α . The recognition rate -

In order to get furtherinsight, we deﬁne the recognition rate ρ as the probabilitythat a minimization run ends with ω ≥ .

967 [3]: thethreshold for recognition is (cid:15) = 1 − .

967 = 0 . H ,that undergoes a sharp transition, selecting a diﬀerentthreshold would give the same asymptotic result. Weshow in Fig. 6 ρ as a function of α , both for H and theX-model, for η = 0 in the left frame and for η = 0 .

15 inthe right one. Even if, as we have seen in detail, the twomodels work very diﬀerently, the plots for H and for theX-model are similar, both at η = 0 and at η >

0. TheX-model has a wider learning phase. We can say thatthere is a very low α regime where the new X variablesare irrelevant since they are not needed for recognition,and a very large α regime where they ﬁx the system on theobserved pattern, but cannot lead to recognition. Only inthe region where α is slightly larger than α c , they are putat good use, and help in the memorizing. Also, when α increases, they avoid complete confusion: the X memorybecomes less eﬃcient if too many patterns are shown, butthe blackout catastrophe is avoided. The dynamical exponent -

We have also analyzedthe rate of the learning dynamics. We assume that thenumber of sweeps S needed to reach the stable state scalesasymptotically as N ζ , plus N dependent sub-leading cor-rections. In a sweep we include, for the X-model, both thecost of putting the X µ in their optimal position and thecost of updating every σ i once. In absence of any slowingdown we expect to ﬁnd ζ = 0. We deﬁne an eﬀectiveexponent dependent on two values of N as ζ ( N , N ) ≡ log (cid:16) S ( N ) (cid:46) S ( N ) (cid:17)(cid:46) log ( N / N ) . (5)In Fig. 7 we show ζ as a function of α for H . On the right η = 0 and diﬀerent couples of N , on the left diﬀerent η values, using N = 4096 and N = 8192. The eﬀectiveexponent for η = 0 is small at small α , is developing Hopfield ζ α N=4096 ➝ N=8192 eta=0eta=0.1eta=0.2eta=0.25eta=0.35 ζ αη = 0 N=4096 ➝ ➝ ➝ ➝ FIG. 7: The eﬀective exponent ζ ( N , N ) for H . X model ζ α N=4096 ➝ N=8192 eta=0eta=0.1eta=0.2eta=0.25eta=0.35 ζ αη = 0 N=4096 ➝ ➝ ➝ ➝ FIG. 8: As in Fig. 7, but for the X-modela δ -function like peak of value close to one at α c , andis eventually decreasing to an asymptotic large α valueclose to 0.6. As expected H is critical close to α c . Thesituation at η > η = 0 .

35 wherea sizeable critical peak cannot be detected anymore.In Fig. 8 we have the same plot for the X-model, and,again, here the situation is diﬀerent. There is always apeak at low α (larger the α c ) but the N dependence is notabrupt, and does not suggest that a δ -function behavioris emerging (even if one would need very large values of N to make sharp claims about this). Also the peak at η > H , and the eﬀective exponenthas a smooth slow decay for large α . Conclusions -

The introduction of hidden layers inthe Hopﬁeld model leads to interesting new features inthe zero temperature associative memory performance.In our model, the probability distribution of the overlapas well as its average value diﬀer markedly from the onesin the Hopﬁeld model. As a consequence, the recogni-tion performance is improved. More importantly, theinteraction between visible and hidden neurons has a sta-bilizing eﬀect on the zero temperature dynamics, whichprevents the blackout catastrophe . This, together with thesmaller value of the dynamical scaling exponent, implyinga faster recognition process, suggests that our atypicalhidden layers may considerably improve the functioningof Hopﬁeld-like neural systems. This opens an interest-ing perspective for the further research in the ﬁeld ofassociative memory.

Acknowledgments -

We are very grateful to Ste-fano Fusi and Marc M´ezard for precious conversations.We have been supported by funding from the EuropeanResearch Council (ERC) under the European Union’sHorizon 2020 research and innovation program (GrantNo. 694925-LotglasSy). [1] W. A. Little, Mathematical Biosciences , 101 (1974).[2] J. Hopﬁeld, Proceedings of the National Academy ofSciences , 2554 (1982).[3] D. J. Amit, H. Gutfreund, and H. Sompolinsky, Annalsof Physics , 30 (1987).[4] J. L. Van Hemmen and R. K¨uhn, Physical Review Letters , 913 (1986).[5] B. M. Forrest, Journal of Physics A: Mathematical andGeneral , 245 (1988).[6] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Physical Review E , 066106 (2011).[7] E. Gardner and B. Derrida, Journal of Physics A: Mathe-matical and General , 271 (1988).[8] E. Rolls and A. Treves, Neural Networks and Brain Func-tion (Oxford University Press, 1997).[9] S. Fusi, P. J. Drew, and L. F. Abbott, Neuron , 599(2005).[10] S. Fusi and L. F. Abbott, Nature Neuroscience , 485(2007).[11] A. Barra, A. Bernacchia, E. Santucci, and P. Contucci,Neural Networks , 1 (2012).[12] E. Agliari, A. Barra, A. De Antoni, and A. Galluzzi, Neural Networks , 52 (2013).[13] E. Agliari, A. Barra, A. Galluzzi, F. Guerra, D. Tantari,and F. Tavani, Physical review letters , 028103 (2015).[14] C. Baldassi, F. Gerace, H. J. Kappen, C. Lucibello,L. Saglietti, E. Tartaglione, and R. Zecchina, PhysicalReview Letters , 268103 (2018).[15] C. Baldassi, F. Pittorino, and R. Zecchina, Proceedingsof the National Academy of Sciences , 161 (2020).[16] F. Sch¨onsberg, Y. Roudi, and A. Treves, Physical ReviewLetters , 018301 (2021).[17] G. Parisi, Journal of Physics A: Mathematical and General , L617 (1986).[18] E. Marinari, Neural Computation , 503 (2019).[19] J. L. Van Hemmen, G. Keller, and R. K¨uhn, EurophysicsLetters (EPL) , 663 (1988).[20] J. L. Van Hemmen, L. B. Ioﬀe, R. K¨uhn, and M. Vaas,Physica A , 386 (1990).[21] M. Mezard, G. Parisi, and M. Virasoro, Spin Glass Theoryand Beyond (World Scientiﬁc, 1986).[22] D. O. Hebb, Journal of Clinical Psychology , 307 (1950).[23] M. M´ezard, Physical Review E , 022117 (2017).[24] “A ﬁnite size scaling analysis of the Hopﬁeld and of theX model,” In preparation.[25] V. Folli, M. Leonetti, and G. Ruocco, Frontiers in Com-putational Neuroscience10