Efficiency of local learning rules in threshold-linear associative networks
EEfficiency of local learning rules in threshold-linear associative networks
Francesca Sch¨onsberg, Yasser Roudi, and Alessandro Treves
1, 2 SISSA, Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy Kavli Institute for Systems Neuroscience & Centre for Neural Computation, NTNU, Trondheim, Norway (Dated: July 27, 2020)We show that associative networks of threshold linear units endowed with Hebbian learning canoperate closer to the Gardner optimal storage capacity than their binary counterparts and evensurpass this bound. This is largely achieved through a sparsification of the retrieved patterns,which we analyze for theoretical and empirical distributions of activity. As reaching the optimalcapacity via non-local learning rules like back-propagation requires slow and neurally implausibletraining procedures, our results indicate that one-shot self-organized Hebbian learning can be justas efficient.
INTRODUCTION
Local learning rules, those that change synaptic weights depending solely on pre- and post-synaptic activation, aregenerally considered to be more biologically plausible than non-local ones. They can be implemented through exten-sively studied processes in the synapses [1] and they allow neural networks to self-organise into content-addressablememory devices [2–4]. But how effective are local learning rules? Quite ineffective, has been the received wisdomsince the 80’s, when back-propagation algorithms came to the fore. However, this common sense is based on analysingnetworks of binary units [3, 5–7], while neurons in the brain are not binary.A better, but still mathematically simple description of neuronal input-current-to-impulse-frequency transductionis via a threshold-linear transfer function [8–10], which also represents the activation function predominantely adoptedin recent deep learning applications [11–14]) (often called, in that context, Rectified Linear units or ReLu). Therefore,one may ask whether the results from the 80’s highlighting the contrast between the effective, iterative proceduresused in machine learning and the self-organized, one-shot, perhaps computationally ineffective Hebbian learning arevalid beyond binary units [15].The Hopfield model [3], which includes a simple
Hebbian [2] prescription for structuring all the connection weightsin one go, had been analysed and found to be able to retrieve only up to p max ’ . N activity patterns, in a networkof N binary units [5], with C = N − p max = 2 C , about 14 times higher. This optimal capacity canbe approached with iterative procedures based on the notion of a desired output for each unit, and of progressivelyreducing the difference between current and desired output – like backpropagation, in multi-layer neural networks.This consolidated the impression that unsupervised, Hebbian plasticity may well be of biological interest, but is ratherinefficient and unsuitable for performance-driven machine learning applications. The negative characterization wasnot redeemed by the finding that in sparsely connected nets, those where C (cid:28) N , the pattern capacity α c = p max /C can be closer to the Gardner bound (a factor 3 away ) [6]; and approach it, when the coding is sparse, i.e., the fractionof units active in each pattern is f (cid:28) `a la Gardner in networks ofTL units, as a function of the fraction f of active units, and test our results by learning those weights with a TLperceptron. We show that first, while for stored patterns that are binary, such errorless capacity is larger than theHebbian capacity no matter how sparse the code, this does not, in general, hold for non-binary stored patterns, andfor other distributions the Hebbian capacity can even surpass the Gardner bound. This perhaps surprising violationof the bound is because the Gardner calculation imposes an infinite output precision [17], while Hebbian learningexploits its loose precision to sparsify the retrieved pattern. In other words, with TL units, Hebbian capacity canget much closer to the optimal capacity or even surpass it, by permitting errors and retrieving a sparser version ofthe stored pattern. Experimentally observed firing activity distributions from the inferior-temporal cortex [18], whichcan be taken as patterns to be stored, would be sparsified about 50% by Hebbian learning, and would reach about50% −
80% of the Gardner capacity. a r X i v : . [ c ond - m a t . d i s - nn ] J u l MODEL DESCRIPTION
We consider a network of N units and p patterns of activity, { η µi } µ =1 ,..,pi =1 ,..,N each representing one memory stored inthe connection weights via some procedure. Each η µi is drawn independently for each unit i and each memory µ froma common distribution Pr( η ). We denote the activity of each unit i by v i and assume that it is determined by theactivity of the C units feeding to it, through a TL activation function v i = g [ h i − ϑ ] + h i { v i } = 1 √ C X j J ij v j , (1)where [ x ] + = x for x > g and threshold ϑ are fixed parameters taken to beset for the whole network. In a fully connected, recurrent network, C = N −
1, but we intend to consider also moregeneral cases of diluted connectivity. The storage capacity α c ≡ p max /C , is then the maximal number of memoriesthat the network can store and individually retrieve, per unit. The synaptic weights J ij are taken to satisfy thespherical normalization condition for all i X j = i J ij = C. (2)We are interested in finding the set of J ij that satisfy Eq. (2), such that patterns { η µi } µ =1 ,..,pi =1 ,..,N are self-consistentsolutions of Eqs. 1, namely that for all i and µ we have, h µi = ϑ + η µi /g if η µi > h µi ≤ ϑ if η µi = 0. REPLICA ANALYSIS
Adapting the procedure introduced by Elisabeth Gardner [7] for binary units to our network, we evaluate thefractional volume of the space of the interactions J ij which satisfy Eqs. (2) and (1); and using the replica trick weobtain the standard order parameters m a = √ C P j J aij and q ab = C P j J aij J bij corresponding, respectively, to theaverage of the weights within each replica and to their overlap between replicas (Supplemental Material at [URL],Sect. A). Assuming the replica symmetric ansatz simplifies q ab ≡ q and m a ≡ m . We focus on increasing the number p of stored memories, in the C → ∞ limit, up to when the volume of the compatible weights shrinks to a single point,i.e., there is a unique solution, and the maximal storage capacity has been reached. For this purpose we take the limit q →
1, corresponding to the case where all the replicated weights are equal, implying that only one configuration existwhich satisfies the equations. Adding a further memory pattern would make it impossible, in general, to satisfy allequations.We have derived a system of two equations for the maximal storage capacity α c = p max /C for a network ofthreshold-linear units 0 = − f ( x + d g √ d ) + (1 − f ) Z ∞ x Dt ( t − x ) (3)1 α c = f h x + d g d + 2 xd g √ d + 1 i + (1 − f ) Z ∞ x Dt ( t − x ) where we have introduced the averages over Pr( η ): d ≡ h η µi i , d ≡ h ( η µi ) i and d ≡ d − d ; x = ( ϑ − d m ) / √ d is the normalized difference between the threshold and the mean input, while f = Pr( η >
0) is the fraction of activeunits. The two equations yield x (hence implicitly setting an optimal value for ϑ ) and α c . Note that both equationscan be understood as averages over units, respectively of the actual input and of the square input, which determinethe amount of quenched noise and hence the storage capacity.The storage capacity then depends on the proportion f of active units, but also on the gain g , and on the momentsof the distribution Pr( η ), d and d . As can be seen in Fig. 1a, at fixed g , the storage capacity increases as moreand more units remain below threshold, ceasing to contribute to the quenched noise. In fact, the storage capacitydiverges, for f →
0, as α c ≈ n f ln(1 / √ πf ) o − ; (4)see Supplemental Material at [URL], Sect. B. for the full derivation. At fixed f , there is an initially fast increase with g followed by a plateau dependence for larger values of g . One can show that α c → g g +1 as f →
1, i.e., when all the
FIG. 1. Dependence of the Gardner capacity α c on different parameters. α c plotted in (a) as a function of g and f ( d =1 . , d = 2), in (b) as a function of a = d /d for different values of f ( g = 10, d = 1 .
1) in (c) and (d) as a function of d and d for g = 0 . g = 10, respectively ( f = 0 . f , restricts the available range of a , as a cannot be largerthan f ; the inaccessible ranges of f values are shadowed in (b), (c) and (d). units in the memory patterns are above threshold, it is always α c < g . At first sight this may seemabsurd: a linear system of N independent equations and N variables always has an inverse solution, which wouldlead to a storage capacity of (at least) one. Similar to what already noted in [17], however, the inverse solution doesnot generally satisfy the spherical constraint in Eq. (2); but it does, in our case, in the limit g → ∞ and this can alsobe understood as the reason why the capacity is highest when g is very large. In practice, Fig. 1 indicates that overa broad range of f values the storage capacity approaches its g → ∞ limit already for moderate values of the gain;while the dependence on d and d is only noticeable for small g , as can be seen by comparing Fig. 1c and d. In the g → ∞ limit, one sees that Eqs. (3) depend on Pr( η ) only through f .Eqs. (3), at g → ∞ , have been verified by explicitly training a threshold linear perceptron of N = 100 units and C = N connections, with p binary patterns. We have evaluated α c = p max /C numerically by estimating p max as themaximal number of patterns which can be retrieved with no errors. The numerical values for the storage capacityare depicted as red diamonds in Fig. 2, it can be noticed that they follow the profile of the solid line describing the g → ∞ limit of Eq. (3). See Supplemental Material at [URL], Sect. C for details about the algorithm. COMPARISON WITH A HEBBIAN RULE: THEORETICAL ANALYSIS
As described in the Introduction, when used in a fully connected Hopfield network of binary units, and no sparsecoding, Hebbian learning could only reach ∼ /
14 of the Gardner bound. With very sparse connectivity, however,the same network can get to 1 /π of the bound, even if it gets there through a second-order phase transition, wherethe retrieved pattern has vanishing overlap with the stored one [6]. With TL units, the appropriate comparison tothe capacity `a la Gardner is therefore that of a Hebbian network with extremely diluted connectivity, as in this limitthe weights J ij and J ji are effectively independent and noise reverberation through loops is negligible. The capacityof such a sparsely connected TL network was evaluated analytically in [19]. Whereas in the g → ∞ limit the Gardnercapacity depends on Pr( η ) only via f , for Hebbian networks it does depend on the distribution, and most importantlyon, a , the sparsity a = h η µi i / h ( η µi ) i (5)whose relation to f depends on the distribution [19].Here we focus on 3 examples of binary, ternary and quaternary distributions (Supplemental Material at [URL],Sect. D); in these examples, the parameters f and a are related through f = a, a/ a/
4, respectively. Theresults of the numerical comparisons are shown in Fig. 2. Both the Hebbian and the Gardner capacity diverge in thesparse coding limit.One can see in Fig. 2a that when attention is restricted to binary patterns, the Gardner capacity seems to providean upper bound to the capacity reached with Hebbian learning; more structured distributions of activity, however,dispel such a false impression: the quaternary example already shows higher capacity for sufficiently sparse patterns.
FIG. 2. Hebbian capacity vs Gardner bound. (a) α Hc as a function of f for different sample distribution of stored patternscompared to the universal α Gc bound for errorless retrieval, i.e. the g → ∞ limit of Eq (3); the red diamonds are reached withexplicit TL perceptron training. (b) the sparsification of the stored patterns at retrieval, for Hebbian networks loaded at theircapacity. The bound, in fact, would only apply to perfect errorless retrieval, whereas Hebbian learning creates attractors whichare, up to the Hebbian capacity limit, correlated but not identical to the stored patterns, similarly to what occurswith binary units [5]; in particular, we notice that when considering TL units and Hebbian learning, in order to reachclose to the capacity limit, the threshold has to be such as to produce sparser pattern at retrieval, in which onlythe units with the strongest inputs get activated. The sparsity a r = h v µi i / h ( v µi ) i of the retrieved memory can becalculated [19], see Supplemental Material at [URL], Sect. D for an outline. Fig. 2b shows the ratio of the sparsity ofthe retrieved pattern produced by Hebbian learning, denoted as a Hr , to that of the stored pattern a , vs. f . As one cansee, except for the binary patterns at low f , the retrieved patterns, at the storage capacity, are always sparser thanthe stored ones. The largest sparsification happens for quaternary patterns, for which the Hebbian capacity overtakesthe bound on errorless retrieval, at low f . Sparser patterns emerge as, to reach close to α Hc , ϑ has to be such asto inactivate most of the units with intermediate activity levels in the stored pattern. Of course, the perspective isdifferent if α Hc is considered as a function of a r instead of a , in which case the Gardner capacity remains unchanged,as it implies retrieval with a r = a , and above α Hc for each of the 3 sample distributions; see Fig. 1 of SupplementalMaterial at [URL]. COMPARISON WITH A HEBBIAN RULE: EXPERIMENTAL DATA
Having established that the Hebbian capacity of TL networks can surpass the Gardner bound on errorless retrievalfor some distributions, we ask what would happen with the distribution of firing rates naturally occurring in thebrain. Here we consider published distributions of single units in infero-temporal visual cortex, while the animalswere watching short naturalistic movies [18]. Such distributions can be taken as examples of patterns elicited bythe visual stimulus, and to be stored with Hebbian learning, given appropriate conditions, and later retrieved usingattractor dynamics, triggered by a partial cue [20–24]. How many such patterns can be stored, and with whataccompanying sparsification?Fig. 3a and b show the analysis of two sample distributions (of the top right and top left cells in Fig. 2 of [18]).The observed distributions, in blue, with the “Gardner” label, are those we assume could be stored, and whichcould be retrieved exactly as they are with a suitable training procedure bound by the Gardner capacity. In orange,instead, we plot the distribution that would be retrieved following Hebbian learning operating at its capacity, seeSupplemental Material at [URL], Sect. F., for the estimation of the retrieved distribution. Note that the absolutescale of the retrieved firing rate is arbitrary, what is fixed is only the shape of the distribution, which is sparser (asclear already from the higher bar at zero). The pattern in Fig. 3a, which has a < .
5, could also be fitted with aone-parameter exponential shape having f = 2 a (see Supplemental Material at [URL], Sect. E). In that panel wealso report the values of the α H exp c and a H exp r calculated assuming the continuous exponential instead of the observeddiscrete distribution ( α H naive c and a H naive r ). Fig. 3c shows both α Gc ( f ) and α H exp c ( f ); on top of these curves and inthe inset we have indicated as diamonds the values calculated for the 9 empirical distributions present in [18] and ascircles the fitted values for those which could be fitted to an exponential.There are three conclusions that we can draw from these data. First, the Hebbian capacity from the empiricaldistributions is about 80% of that of the exponential fit, when available. Second, for distributions like those of theseneurons, the capacity achieved by Hebbian learning is about 50% −
80% of the Gardner capacity for errorless retrieval,depending on the neuron and whether we take its discrete distribution “as is”, or fit it with a continuous exponential.Third, Hebbian learning leads to retrieved patterns which are 2 − f . FIG. 3. Hebbian learning vs. the Gardner errorless bound for experimental data. (a,b) Examples of the histograms of twoexperimentally recorded spike counts (blue) and the retrieved distribution, if the patterns were stored using Hebbian learning(orange). Note that the retrieved distributions `a la Gardner would be the same as the stored patterns. (c) Analyticallycalculated universal Gardner capacity α Gc ( f ) (blue), i.e. the g → ∞ limit of Eq (3), compared to α H exp c for the Hebbian learningof an exponential distribution (orange). The diamonds are the values a H naive r achieved with the 9 original discrete distributions,and the circles the values a H exp r for those 4 that can be fit to an exponential distribution. The asterisk marks the two cellswhose distribution is plotted in a) and b). d) Sparsification of the retrieved patterns, for Hebbian learning. DISCUSSION
The general notion of attractor neural networks (ANN) has been instrumental in conceptualizing the storage oflong-term memories, including those experimentally accessible, such as spatially selective memories in rodents. Forexample, the activity of Place [25, 26] and Grids cells [27] has been analyzed in terms of putative low-dimensionalattractor manifolds [28, 29]. In detail, however, the applicability of advanced mathematical results [30, 31] hasbeen challenged by several cortically implausible assumptions incorporated in the early models. In particular, anunderstanding of the effectiveness of Hebbian learning has been hampered by the fact that a natural benchmark, theGardner bound on the storage capacity, had been derived initially only for binary units. The bound for binary units,as described in the Introduction, was found to be far above the Hebbian capacity, but then no binary or quasi-binarypattern of activity has ever been observed in the cerebral cortex. A few studies have considered non-binary units: TLnetworks have been shown to be less susceptible to spin-glass effects [32] and to mix-ups of memory states [33] butin the framework of `a la Gardner calculations they have focused on other issues than associative networks storingsparse representations. For instance in [17] the authors carried out a replica analysis on a generic gain function, butthen focused their study on the tanh activation function and on activity patterns that were not neurally motivated.Clopath and Brunel [34] considered monotonically increasing activation functions under the constraint of non-negativeweights as a model of cerebellar Purkinje cells and, more recently, it has been shown with a replica analysis [35] thatTL units can largely tolerate perturbations in the exact values of the weights and of the inputs.Here, we report the analytical derivation of the Gardner capacity for networks of TL units, validate the result withTL perceptron training, and compare it with the performance of networks with Hebbian weights. We argue that thecomparison has to be framed in the context of the difference between stored and retrieved patterns, that becomes moresalient, the higher the storage load of the Hebbian network. It remains to be assessed whether a stability parameter,comparable to the κ used in the original Gardner calculations [7], could be considered also for TL units. Furtherunderstanding would also derive from the comparison between the maximal information content per synapse whenpatterns are stored via Hebbian or iterative learning, as previously performed for binary units [36].For typical cortical distributions of activity in visual cortex, Hebbian one-shot local learning leads to utilize already50% −
80% of the available capacity for errorless retrieval, leading to markedly sparser retrieved activity. In theextreme in which only the most active cells remain active, those retrieved memories cannot be regarded as the fullpattern, with its entire information content, but more as a pointer, effective perhaps to address the full memoryelsewhere, as posited in index theories of 2-stage memory retrieval [37].
ACKNOWLEDGMENTS
We thank R´emi Monasson and Aldo Battista for useful comments and discussions. This research received fundingfrom the EU Marie Skodowska-Curie Training Network 765549 “M-Gate”, and partial support from Human FrontierScience Program RGP0057/2016, the Research Council of Norway (Centre for Neural Computation, grant number223262; NORBRAIN1, grant number 197467), and the Kavli Foundation. [1] T. H. Brown, E. W. Kairiss, and C. L. Keenan, Annual review of neuroscience , 475 (1990).[2] D. O. Hebb, The Organization of Behavior: a Neuropsychological Theory (J. Wiley; Chapman & Hall, 1949).[3] J. J. Hopfield, Proceedings of the national academy of sciences , 2554 (1982).[4] D. J. Amit, Modeling Brain Function: The World of Attractor Neural Networks (Cambridge university press, 1992).[5] D. J. Amit, H. Gutfreund, and H. Sompolinsky, Annals of physics , 30 (1987).[6] B. Derrida, E. Gardner, and A. Zippelius, EPL (Europhysics Letters) , 167 (1987).[7] E. Gardner, Journal of physics A: Mathematical and general , 257 (1988).[8] H. K. Hartline and F. Ratliff, The Journal of general physiology , 357 (1957).[9] H. K. Hartline and F. Ratliff, The Journal of general physiology , 1049 (1958).[10] A. Treves, Journal of Physics A: Mathematical and General , 2631 (1990).[11] V. Nair and G. E. Hinton, in Proceedings of the 27th international conference on machine learning (ICML-10) (2010) pp.807–814.[12] A. L. Maas, A. Y. Hannun, and A. Y. Ng, in
Proc. icml , 1 (2013) p. 3.[13] K. He, X. Zhang, S. Ren, and J. Sun, in
Proceedings of the IEEE international conference on computer vision (2015) pp.1026–1034.[14] I. Goodfellow, Y. Bengio, and A. Courville,
Deep learning (MIT press, 2016).[15] U. Pereira and N. Brunel, Neuron , 227 (2018).[16] M. V. Tsodyks and M. V. Feigel’man, EPL (Europhysics Letters) , 101 (1988).[17] D. Boll´e, R. Kuhn, and J. Van Mourik, Journal of Physics A: Mathematical and General , 3149 (1993).[18] A. Treves, S. Panzeri, E. T. Rolls, M. Booth, and E. A. Wakeman, Neural Computation , 601 (1999).[19] A. Treves and E. T. Rolls, Network: Computation in Neural Systems , 371 (1991).[20] J. M. Fuster and J. P. Jervey, Science , 952 (1981).[21] Y. Miyashita, Nature , 817 (1988).[22] K. Nakamura and K. Kubota, Journal of neurophysiology , 162 (1995).[23] D. J. Amit, S. Fusi, and V. Yakovlev, Neural computation , 1071 (1997).[24] E. T. Rolls and A. Treves, Neural Networks and Brain Function (Oxford university press, Oxford, 1998).[25] J. O’Keefe and J. Dostrovsky, Brain research (1971).[26] J. O’Keefe, Experimental neurology , 78 (1976).[27] T. Hafting, M. Fyhn, S. Molden, M.-B. Moser, and E. I. Moser, Nature , 801 (2005).[28] F. P. Battaglia and A. Treves, Neural Computation , 431 (1998).[29] K. Yoon, M. A. Buice, C. Barry, R. Hayman, N. Burgess, and I. R. Fiete, Nature neuroscience , 1077 (2013).[30] R. Monasson and S. Rosay, Physical review E , 032803 (2014).[31] R. Monasson and S. Rosay, Physical review letters , 098101 (2015).[32] A. Treves, Journal of Physics A: Mathematical and General , 2645 (1991).[33] Y. Roudi and A. Treves, Physical Review E , 041906 (2003).[34] C. Clopath and N. Brunel, PLoS computational biology (2013).[35] C. Baldassi, E. M. Malatesta, and R. Zecchina, Physical review letters , 170602 (2019).[36] J.-P. Nadal and G. Toulouse, Network: Computation in Neural Systems , 61 (1990).[37] T. J. Teyler and P. DiScenna, Behavioral neuroscience , 147 (1986). SUPPLEMENTAL MATERIALA. Derivation of the storage capacity
We start by considering a single threshold-linear unit whose activity is denoted by u . The neuron receives C inputs v j , for j = 1 · · · C through synaptic weights J j . The activity of the neuron is determined through the threshold-linearactivation function as u = g [ h i − ϑ ] + h { v } = 1 √ C X j J j v j , (1)We assume that we have p patterns, indexed as µ = 1 · · · p , of activity over the inputs that we denote by ξ µj . Toeach input pattern µ we also consider a desired output activity by the neuron that we denote η µ . We are interestedin finding how many patterns can be stored in the synaptic weights, such that the input activity elicits the desiredoutput activity, assuming that the synaptic weights satisfy the spherical constraint X j = i J j = C. (2)This task, essentially boils down to calculating the expectation of the logarithm of the fractional volume V of theinteraction space over the distribution of η and ξ , defined as V = R Q j dJ j δ (cid:16)P j J j − C (cid:17) Q µ h (1 − δ η µ , ) δ (cid:16) h µ − ϑ − η µ g (cid:17) + δ η µ , Θ ( ϑ − h µi ) iR Q j dJ j Q i δ (cid:16)P j J j − C (cid:17) , (3)For calculating h log V i η,ξ , we use the replica trick. The initial task is to compute the replicated average h V n i ξ ,namely h V n i ξ = Y a =1 ,..,n Y µ R Q j dJ aj δ (cid:16)P j ( J aj ) − C (cid:17) D (1 − δ η a,µ , ) δ (cid:16) h a,µ − ϑ − η µ g (cid:17) + δ η a,µ , Θ( ϑ − h a,µ ) ER Q j,j = i dJ aj δ (cid:16)P j ( J aj ) − C (cid:17) . (4)We first compute the numerator. To compute the averages over ξ in the numerator, we note that the delta functioncan be written as δ ( h a,µ − ϑ − η µ g ) = Z dx aµ π exp ( ix aµ (cid:16) √ C X j J aj η µ − ϑ − η µ g (cid:17)) = Z dx aµ π exp h − ix aµ g (cid:16) η µ + gϑ (cid:17)i exp h ix aµ P j J aj η µ √ C i . (5)For the average of the Heaviside function, we writeΘ( ϑ − h a,µ ) = Z ∞ dλ aµ δ [ λ aµ − ( ϑ − h a,µ )]= Z ∞ dλ aµ π Z ∞−∞ dy aµ exp[ iy aµ ( λ aµ − ( ϑ − h a,µ ))]= Z ∞ dλ aµ π Z ∞−∞ dy aµ exp h iy aµ ( λ aµ − ϑ ) i exp h iy aµ P j J aj η a √ C i . (6)We now use the above identities in Eqs. (5) and (6) to compute the following quantity that appears in the numeratorof Eq. (4), assuming independently drawn ξ as e CM ≡ * Y µ,a (1 − δ η a,µ , ) δ ( h a,µ − ϑ − η µ g ) + δ η a,µ , Θ( ϑ − h a,µ ) + ξ,η = Y µ * (1 − δ η a,µ , ) D Y a δ ( h a,µ − ϑ − η µ g ) E ξ µ + δ η a,µ , D Y a Θ( ϑ − h a,µ ) E ξ µ + η µ . (7) a r X i v : . [ c ond - m a t . d i s - nn ] J u l In order to compute the average of the delta functions in Eq.(7), we use the approximation h exp( x ) i ≈ exp (cid:26) h x i + h x i − h x i (cid:27) (8)to calculate the following average D exp ( i P a,j x aµ J aj ξ µj √ C ) E ξ µ == exp i √ C X a,j x aµ J aj h ξ µj i − C X a,b,j,k x aµ x bµ J aj J bk h ξ µj ξ µk i − i √ C X a,j x aµ J aj h ξ µj i i √ C X b,k x bµ J bj h ξ µk i = exp i √ C X a,j x aµ J aj h ξ µj i − C X a,b,j x aµ x bµ J aj J bj h ( ξ µj ) i + 12 C X a,b,j x aµ x bµ J aj J bj h ξ µj i (9)where in going from the second to third line in Eq. (9), we have used the fact that h ξ µj ξ µk i = h ξ µj ih ξ µk i . Expanding thesecond exponential in the second line of Eq. (5), we can write in the large C limit D Y a δ ( h a,µ − ϑ − η µ g ) E ξ µ == Z ∞−∞ h Y a dx aµ π i exp h − ig ( η µ + gϑ ) X a x aµ + id inp X a x aµ m a − d inp (cid:16) X a ( x aµ ) + 2 X a
0) = f and rewrite Eq. (13) as M ( q, m ) = pC log {h (1 − δ η µ , ) i η µ h I ( q, m, η µ ) i η µ + h δ η µ , i η µ I ( q, m ) } = pC log h f h I ( q, m, η µ ) i η µ + (1 − f ) I ( q, m ) i . (30)Simplifying for the sake of visualization Eq. (28) and (29) as I ( q, m, η µ ) = Z DtY n I ( q, m ) = Z DtK n (31)where Y ≡ Z dx µ π exp h − i (cid:18) g − η µ + ϑ − d inp m − t q qd inp (cid:19) x µ − d inp − q ) x µ i K ≡ Z ∞ dλ µ π Z ∞−∞ dy µ exp h i (cid:18) λ aµ − ϑ + d inp m + t q qd inp (cid:19) y µ − d inp − q )( y µ ) i (32)one can use again a n ≈ n log a and log(1 + a ) ≈ a , which is valid for n →
0, to write M ( q, m ) as M ( q, m ) = pC log h f (cid:28)Z DtY n (cid:29) η µ + (1 − f ) Z DtK n i = pC log h Z Dt [ f h n log Y i η µ + (1 − f )(1 + n log K ) i = pC log h n (cid:16) f Z Dt h log Y i η µ + (1 − f ) Z Dt log K (cid:17)i = pC n (cid:16) f Z Dt h log Y i η µ + (1 − f ) Z Dt log K (cid:17) (33)Turning back to the original notation we can further develop the terms composing the above approximation. Thefirst one yields: Z Dt h log Y i η µ = Z Dt Z * dx µ π exp h − i (cid:18) g − η µ + ϑ − d inp m − t q qd inp (cid:19) x µ − d inp − q ) x µ i+ η µ = Z Dt * log h exp n − (cid:18) d inp m − g − η µ − ϑ + t q qd inp (cid:19) d inp (1 − q ) os πd inp (1 − q ) 12 π i+ η µ = 12 − log 2 π − log d inp (1 − q ) − D (cid:16) d inp m − g − η µ − ϑ (cid:17) E η µ + qd inp d inp (1 − q ) (34)and the second one yields: Z Dt log K = Z Dt log Z ∞ dλ µ π Z ∞−∞ dy µ exp " i (cid:18) λ aµ − ϑ + d inp m + t q qd inp (cid:19) y µ − d inp − q )( y µ ) = Z Dt log Z ∞ dλ µ π exp − (cid:16) d inp m + λ µ − ϑ + t √ qd inp (cid:17) d inp (1 − q ) s πd inp (1 − q )= Z Dt log Z ∞ dinp m − ϑ + t √ qdinp √ dinp − q ) dz √ π e − z (35)where in the last passage we made a simple change of variables. Therefore we can rewrite Eq. (30) as: M ( q, m ) = pC n ( f " − log[2 πd inp (1 − q )] − "(cid:28)(cid:16) d inp m − g − η µ − ϑ (cid:17) (cid:29) η µ + qd inp d inp (1 − q ) + (1 − f ) Z Dt log H ( u ) ) where u ≡ d inp m − ϑ + t q qd inp q d inp (1 − q ) H ( u ) ≡ Z ∞ u dt √ π e − t / . (36)Now we can evaluate the derivatives dGd ˆ m = dGd ˆ q = dGdE = dGdm = dGdq = 0 (37)where G = G ( q, ˆ q, m, ˆ m, E ) given by Eq. (21), and set them to zero to find the maximum of Eq. (21), with W ( ˆ m, ˆ q, E )given by Eq. (26) and M ( q, m ) given by Eq. (36).With the first three derivatives equalized to zero, which are applied only to the second and third term of Eq. (21),and assuming Cq (cid:29) m and | C (1 − q ) | (cid:29) m as C → ∞ , we obtain the relationsˆ m = − m √ C ( q − q = q (1 − q ) E = 1 − q ( q − . (38)Substituting them into Eq. (21) we have to perform the last two derivatives. dGdm can be simply evaluated, applying the Leibniz integral rule ddx H ( f ( x )) = ddx R ∞ f ( x ) dt √ π e − t = − exp( − f ( x ) ) ddx f ( x )yielding: dGdm = 0 = − f d inp ( d inp m − g − h η µ i − ϑ ) − q d inp (1 − q )(1 − f ) d inp √ π Z DtH ( u ) − e − u / (39)The derivative in q requires in addition the integration by parts of the term multiplied by (1 − f ) enabling to reachthe simplified solution: dGdq = 0 = αq ( f h h ( d inp m − g − η µ − ϑ ) i + qd inp d inp i + (1 − f )(1 − q )2 π Z DtH ( u ) − e − u ) (40)where α ≡ p/C is the storage capacity.As explained in the main text we take the limit q →
1, in which the storage capacity α becomes the critical one α c .Note that in this limit: lim q → u = ∞ if t > ϑ − d inp m √ d inp −∞ if t < ϑ − d inp m √ d inp . (41)This enables to further simplify the above equations aslim x →−∞ H ( x ) ≈ x →∞ H ( x ) ≈ √ πu e − u / (1 − u ) = 1 √ πu e − u / where in the second approximation we have Taylor expanded H ( x ) around x = 0.The simple application of the limit q → x = ϑ − d inp m √ d inp leads to the final set of equations for the critical storage capacity f ( x + d out g √ d inp ) = (1 − f ) R ∞ x Dt ( t − x ) α c = f h x + d out g d inp + xd out g √ d inp + 1 i + (1 − f ) R ∞ x Dt ( t − x ) . (42)where d out , , are defined in the same way as d inp , , except that the averages are now over the output distribution η .Going from the calculation reported above for the threshold-linear perceptron it is straightforward to calculate theoptimal capacity of a network of threshold linear units. Considering the network defined through Eq. (1) of the maintext, the corresponding volume we need to calculate can be written as V T = R Q i,j,j = i dJ ij δ (cid:16)P j,j = i J ij − C (cid:17) Q i,µ h (1 − δ η µ , ) δ (cid:16) h µi − ϑ − η µ g (cid:17) + δ η µ , Θ ( ϑ − h µi ) iR Q i,j,j = i dJ ij Q i δ (cid:16)P j,j = i J ij − C (cid:17) (43)Since V T can be written as the product of the individual volumes of the connection weights towards each unit, as V T = Q Ni V i and thus h log V T i η = N h log V i i η , we will essentially be dealing with individual perceptrons like the onewe just studied. Putting d inp = d out = d and d inp = d out = d and thus d inp = d out = d for ∀ i , we arrive to theequations presented in Eq. (3) of the main text.As explained in the main text, we evaluate the maximal storage capacity in the limit g → ∞ , which is reached formoderate values of g . Eq. (3) of the main text in the g → ∞ limit reduces to: ( f x − (1 − f ) R ∞ x Dt ( t − x ) α c = f ( x + 1) + (1 − f ) R ∞ x Dt ( t − x ) , (44)which provides the universal α Gc bound for errorless retrieval, dependent only through f on the distribution of thepatterns. B. Derivation of the limits
From Eq. (3) of the main text it is possible to evaluate the two limits of very sparse and non-sparse coding. First,a simple substitution at f = 1 leads to x = − d g √ d (45) α − c = 1 + 1 g . (46)The case f → f − f = 1( x + d g √ d ) Z ∞ x Dt ( t − x ) = 1( x + d g √ d ) (cid:16) e − x √ π − x Z ∞ x Dt (cid:17) (47)As f goes to zero, for the left hand side to be equal to the right hand side, we should have x → ∞ . We thereforeuse the expansion Z ∞ x Dt = e − x √ π h x − x + O (cid:16) x (cid:17)i to write the right hand side of Eq. (47) as f − f ≈ e − x √ πx . (48)We find a solution to Eq. (48) through the following iterative procedure. We first solve the leading term for f → x → ∞ namely f ≈ e − x √ π . yielding x ≈ s (cid:16) √ πf (cid:17) (49)We then insert x from Eq. (49) into exp( − x /
2) = √ πf x to obtain the logarithmic correction e − x ≈ √ πf x x ≈ s (cid:16) √ πf x (cid:17) x ≈ s (cid:16) √ πf (cid:17)(cid:16) − ln x ln √ πf (cid:17) ≈ s (cid:16) √ πf (cid:17) −
34 ln (cid:16) √ πf ) (cid:17) ln √ πf ! . (50)where in the last passage we have used the Taylor expansion of the square √ − y = 1 − y + O ( y ) around y = 0 asfor f → ln x ln √ πf → x is indeed a solution to Eq. (47) for f → α c , we apply the same Taylor expansion as before α c = n f [ h ( x + ξ i g √ d ) i + 1] + (1 − f ) Z ∞ x Dt ( t − x ) o − = n f [ h ( x + ξ i g √ d ) i + 1] + (1 − f ) (cid:16) − xe − x √ π + (1 + x ) Z ∞ x Dt (cid:17)o − ≈ n f x − xe − x √ π + (1 + x ) √ π e − x (cid:16) x − x + 3 x (cid:17)o − ≈ n f x + e − x √ π (cid:16) − x + (1 + x )( x − x + 3) x (cid:17)o − ≈ n f x + e − x √ π (cid:16) x + 3 x (cid:17)o − = n f x + r π e − x x o − . To summarise in the limit f → x ≈ r (cid:16) √ πf (cid:17) −
34 ln (cid:16) √ πf ) (cid:17) ln √ πf ! α c ≈ n f x + q π e − x x o − . (51)Substituting x in α c to the leading order leads to Eq.(5) presented in the main text. C. Training algorithm
For the purpose of assessing whether the Gardner capacity for errorless retrieval can be reached with explicit training,we can decompose a network of, say, N + 1 = 10001 units into N + 1 independent threshold linear perceptrons. Athreshold linear perceptron is just a 1-layer feedforward neural network with N inputs and one output, the activityof which is given by a threshold-linear activation function .[ h ] + = max(0 , h ) (52)The network is trained with p patterns. One can then think of the input as a matrix ¯ ξ of dimension [ N × p ] and ofthe output as a vector ~η of dimension [1 × p ].The aim of the algorithm is to tune the weights such that all p patterns can be memorized. In order to tune theweights we start from an initial connectivity vector ~J of dimension [1 × N ] and estimate the output ~ ˆ η as: ~h = ~J ¯ ξ~ ˆ η = g [ ~h ] + (53)where g is the gain parameter. We then compare the output ~ ˆ η with the desired output ~η through the loss functionL( ~ ˆ η ) = p X µ =1
12 (ˆ η µ − η µ ) . (54)The TL perceptron algorithm can be seen as simply a stripped down version of backpropagation , for a 1-layer network:the weights ~J are modified by gradient descent to minimize the loss during the steps k = 1 ..k MAX where k MAX is the number of steps needed for the gradient descent in order to reach the minima d L( ~J k ) d ~J k = 0. If at the minimaL(˜J k MAX ) = 0 at least a set of weights exists for errorless retrieval at that p value. The storage capacity α c = p max N isevaluated by estimating p max as the highest p value enabling to reach L(˜J k MAX ) = 0.Initializing the weights around zero facilitates reaching the minima. The chain derivative that in general implementsgradient descent in backpropagation, in this case reduces to ~J k +1 = ~J k + γ gp ( ~η − ~ ˆ η )Θ( ~ ˆ η ) ¯ ξ T (55)where Θ( ~ ˆ η ) is the Heaviside step function applied to all N elements of ~ ˆ η and where γ is a learning rate, which wevary in order to facilitate reaching the minima. D. Hebbian capacity and sparsity of the retrieved pattern
From the calculation reported in [1], it can be shown that for a network of threshold-linear units described in Eq.(1) of the main text in which p patterns are stored through Hebbian learning, the storage capacity α c can be foundas the value of α for which there are values of v c and w c that solve the equation A ( w, v ) − αA ( w, v ) = 0 (56)at a single point on the w, v plane; where A ( w, v ) = av (1 − a ) D(cid:16) η h η i − (cid:17) ( xφ ( x ) + σ ( x ) E (57) A ( w, v ) = D ( x + 1) φ ( x ) + σ ( x ) E (58)and x ≡ w + v η h η i (59) φ ( x ) = [1 + erf( x √ )]2 = erfc( − x √ )2 (60) σ ( x ) = e − x / √ π . (61)and the auxiliary variables v and w , defined as in [2] and dependent on the threshold ϑ , quantify the signal to noiseratio respectively of the specific signal of the pattern to be retrieved and the one of the background both versus thenoise due to memory loading [1]. Estimating v c and w c translates to optimizing the threshold ϑ such that it maximizesthe storage capacity.In [1] the above expression are reported assuming a = h η i = h η i , but the above equations do not make thisassumption and can be derived easily from the calculation reported in [1].Following [1], the average of the activity and the average of the square activity in the patterns retrieved with Hebbianweights are calculated considering that the field, i.e. the input received by a cell with activity η in the memory, isnormally distributed around a mean field proportional to x . If we call z a random variable normally distributed withmean zero and variance one, x is already the mean field properly normalized. With the threshold-linear transferfunction, the output will be g ( x + z ) for x + z > φ ( − x ). Therefore the average activity andthe average square activity are: h V i = g h Z ∞− x c ( η ) Dz [ x c ( η ) + z ] i η = g h [ x c φ ( x c ) + σ ( x c )] i η (62) h V i = g h Z ∞− x c ( η ) Dz [ x c ( η ) + z ] i η = g h [(1 + x c ) φ ( x c ) + x c σ ( x c )] i η (63) x c ≡ w c + v c η h η i , (64)the sparsity of the retrieved memory is thus a Hr = h V i / h V i .As reported in the main text, we have compared capacity values using a binary, ternary, quaternary and anexponential distribution: p ( η ) = (1 − a ) δ ( η ) + aδ (1 − x ) (65) p ( η ) = (1 − a δ ( η ) + 3 a η −
13 ) + 3 a δ ( η −
53 ) (66) p ( η ) = (1 − a δ ( η ) + 3 a δ ( η −
29 ) + 3 a δ ( η −
59 + 3 a
20 ( η −
209 ) (67) P ( η ) = (1 − a ) δ ( η ) + 4 a exp( − η ) (68)0One can see that all distributions are such that h η i = R ∞ dηP ( η ) η = a and h η i = R ∞ dηP ( η ) η = a , so that a coin-cides with the sparsity h η i / h η i of the network. The fraction of active units is thus related to a as f = a, a/ , a/ a respectively.As a supplement to Fig. 2 of the main text, reproduced here in the 3 separate panels in the upper row in Fig. 1,we show a comparison between the Hebbian capacity and the Gardner one when plotted as a function of the outputsparsity (in the bottom row of Fig. 1). The Gardner storage capacity is now in each of these 3 cases above theHebbian capacity, taken as a function of the output sparsity instead of the input one. FIG. 1. Suplementary to Fig. (2). Comparison between the Hebbian and Gardner storage capacity for 3 discrete distributions.The upper row considers as sparsity parameter the one of the input pattern, the lower row the one of the retrieved pattern.The Garner capacity is that given by Eq. (3) of the main text
E. Analytical derivation for the exponential distribution
In order to facilitate the comparison we extend the analytical calculations in [1] and evaluate explicitly the analyticalexpression of Eq. (57) and (58) for the exponential distribution. In general, for A we write A = av (1 − a ) Z ∞ dηP ( η )( η h η i − Z x ( η ) −∞ Dz ( x ( η ) − z )= av (1 − a ) n Z ∞ w Dz Z ∞ ( z − w ) h η i v dηP ( η )( η h η i − x ( η ) − z ) + Z w −∞ Dz Z ∞ dηP ( η )( η h η i − x ( η ) − z ) o (69)with x ( η ) ≡ w + vη/ h η i . Substituting Eq. (68) we obtain A exp = av (1 − a ) ( A . + A . + A . ) A . = Z w −∞ Dz Z ∞ dη a exp( − η )( ηa − w + vηa − z ) A . = Z w −∞ Dz (1 − a )( z − w ) A . = Z ∞ w Dz Z ∞ ( z − w ) av dη a exp( − η )( ηa − w + vηa − z ) . (70)1Solving the equations leads to A . = (1 − a ) σ ( w ) + h va + w − v − wa i φ ( w ) A . = (2 a − σ ( w ) + wφ ( w )) A . = exp (cid:16) awv (cid:17) exp (cid:16) a v (cid:17)h v (1 − a ) a φ (cid:16) − w − av (cid:17) + σ (cid:16) w + 2 av (cid:17) − (cid:16) w + 2 av (cid:17) φ (cid:16) − w − av (cid:17)i . (71)Thus A exp = φ ( w ) + exp (cid:16) awv + 2 a v (cid:17)n φ (cid:16) − w − av (cid:17) + av (1 − a ) h σ (cid:16) w + 2 av (cid:17) − (cid:16) w + 2 av (cid:17) φ (cid:16) − w − av (cid:17)io (72)For A we have A exp = A . + A . + A . A . = Z w −∞ Dz Z ∞ dη a exp( − η )( w + vηa − z ) A . = Z w −∞ Dz (1 − a )( w − z ) A . = Z ∞ w Dz Z ∞ ( z − w ) av dη a exp( − η )( w + vηa − z ) (73)Substituting Eq. (68) we obtain A . = (1 − a ) σ ( w ) + h va + w − v − wa i φ ( w ) A . = (2 a − σ ( w ) + wφ ( w )) A . = exp (cid:16) awv (cid:17) exp (cid:16) a v (cid:17)h v (1 − a ) a φ (cid:16) − w − av (cid:17) + σ (cid:16) w + 2 av (cid:17) − (cid:16) w + 2 av (cid:17) φ (cid:16) − w − av (cid:17)i (74)and solving the equations leads to A . = 2 a h σ ( w )( w + va ) + φ ( w )(1 + w + vwa + v a i A . = (1 − a )[ wσ ( w ) + (1 + w ) φ ( w ))] A . = v a exp (cid:16) awv (cid:17) exp (cid:16) a v (cid:17) φ (cid:16) − w − av (cid:17) . (75)Thus A exp = 2 v ( σ ( w ) + φ ( w )) + wσ ( w ) + (1 + w ) φ ( w ) + v a φ ( w ) + exp (cid:16) awv + 2 a v (cid:17) φ ( − w − av ) . (76) F. Comparison with real data
In the real activity distributions we use, each neuron emits, in time bins of fixed duration (we use 100msec),0 , . . . , n, . . . , n max spikes, with relative frequency c n , such that P n max n =0 c n = 1. These values are taken from Fig. 2of [3] and correspond to the histograms in blue in Fig.2 below (and in Fig.3 of the main text); they are assumedto be the distributions of the patterns to be stored. If the weights are those described by the Gardner calculation,these patterns can be retrieved as they are, and their distribution remains the same. If they are stored with Hebbianweights close to the maximal Hebbian capacity, however, the retrieved distributions look different, and they can bederived as follows.The firing rate V of a neuron in retrieving a stored pattern η is assumed proportional to w + vη/ h η i + z [1], wherethe parameters w and v are appropriately rescaled signal-to-noise ratios (general and pattern-specific), such that thenormally distributed random variable z , of zero mean and unitary variance, is taken to describe all other non constant(noise) terms, besides η itself. Averaging over z one can write, as in Eq.(62), that at the maximal capacity h V i ( η ) = g Z ∞− x c ( η ) Dz [ x c ( η ) + z ] = g [ x c φ ( x c ) + σ ( x c )] (77)2where x ≡ w + vη/ h η i and at the saddle-point the parameters w and v take the values w c and v c that maximizecapacity, as explained in [1]. This implies setting an optimal value for the threshold ϑ , which in the analysis isabsorbed into the parameter w , and which determines the sparsity of the retrieved distribution. The gain g remains,however, a free parameter, that affects neither sparsity nor capacity. It is a rescaled version of the original gain g inthe hypothetical TL transfer function. In other words, the maximal Hebbian capacity determines the shape of theretrieval activity distribution, but not its scale (e.g., in spikes per sec).To produce a histogram, that details the frequency with which the neuron would produce n spikes at retrieval, e.g.again in bins of 100msec, one has to set this undetermined scale. We set it arbitrarily, with the rough requirementthat the frequency of producing n max spikes at retrieval be below what it is in the observed distribution, taken todescribe storage, and negligible for n max + 1 spikes. Having set the scale g , the frequency with which the neuronemits n spikes at retrieval, with 0 < n < n max is the probability that n − / < V < n + 1 /
2, that is, it is a sum overcontributions from each η , such that n − < g ( w c + v c η h η i + z ) < n + 12 ng − g − x c < z < ng + 12 g − x c (78)i.e., P r ( n ) = η max X η =0 c η h φ (cid:16) ng + 12 g − x c (cid:17) − φ (cid:16) ng − g − x c (cid:17)i , (79)with appropriate expressions for the two extreme bins. These are the distributions shown in Fig.3 in the main text,and in Fig.2 below.We took g = , as this value satisfies the a priori requirements and allows to keep the same number of bins in theretrieved memory as in the stored one (and the coefficients sum up to one, to a very good approximation). Analysis of the other recorded cells
Supplementary to Fig. (3) in the main text, we report in Fig. 2 the same analysis for all 9 single cells reported(using 100ms bins) in [3].In each panel we write the capacity `a la Gardner and the Hebbian one (calculated without fitting an exponential) forthe 9 empirical distributions, as well as the sparsity of the original distribution and the sparsity of the one that wouldbe retrieved with Hebbian weights. For simplicity of visualization we also show the storage capacity values againsteach other, calculated `a la Gardner and `a la Hebb (again, without fitting an exponential), as a single scatterplot forthe 9 distributions, in Fig. 3. [1] A. Treves and E. T. Rolls, Network: Computation in Neural Systems , 371 (1991).[2] A. Treves, Physical Review A , 2418 (1990).[3] A. Treves, S. Panzeri, E. T. Rolls, M. Booth, and E. A. Wakeman, Neural Computation , 601 (1999). FIG. 2. Suplementary to Fig (3).FIG. 3. Comparison between the values of the storage capacity `a la Gardner`a la Gardner