[PDF] Formalising the Use of the Activation Function in Neural Inference

Abstract

We investigate how the activation function can be used to describe neural firing in an abstract way, and in turn, why it works well in artificial neural networks. We discuss how a spike in a biological neurone belongs to a particular universality class of phase transitions in statistical physics. We then show that the artificial neurone is, mathematically, a mean field model of biological neural membrane dynamics, which arises from modelling spiking as a phase transition. This allows us to treat selective neural firing in an abstract way, and formalise the role of the activation function in perceptron learning. The resultant statistical physical model allows us to recover the expressions for some known activation functions as various special cases. Along with deriving this model and specifying the analogous neural case, we analyse the phase transition to understand the physics of neural network learning. Together, it is shown that there is not only a biological meaning, but a physical justification, for the emergence and performance of typical activation functions; implications for neural learning and inference are also discussed.

Full PDF

FFormalising the Use of the Activation Function in Neural Inference

Dalton A R Sakthivadivel ∗ Stony Brook University, Stony Brook, New York, 11794-5281 (Dated: 10th February 2021)We investigate how activation functions can be used to describe neural ﬁring in an abstract way,and in turn, why they work well in artiﬁcial neural networks. We discuss how a spike in a biologicalneurone belongs to a particular universality class of phase transitions in statistical physics. Wethen show that the artiﬁcial neurone is, mathematically, a mean ﬁeld model of biological neuralmembrane dynamics, which arises from modelling spiking as a phase transition. This allows us totreat selective neural ﬁring in an abstract way, and formalise the role of the activation function inperceptron learning. Along with deriving this model and specifying the analogous neural case, weanalyse the phase transition to understand the physics of neural network learning. Together, it isshow that there is not only a biological meaning, but a physical justiﬁcation, for the emergence andperformance of canonical activation functions; implications for neural learning and inference are alsodiscussed.

I. INTRODUCTION

The perceptron learning algorithm, developed by Mc-Culloch and Pitts in 1943, is one of the earliest applic-ations of biological principles for computation to math-ematics, or to machines [1]. A simple model, the per-ceptron consists of a single logic gate, and is only cap-able of classiﬁcation using linearly separable functions,like AND and OR. Nonetheless, recent algorithms havedeviated only slightly from the original developments byMcCulloch and Pitts; in many cases, these simply stackperceptrons or add features onto the original algorithm,such as in deep neural networks or convolutional neuralnetworks. Clearly, the contribution of the single-layerperceptron remains relevant today.Somewhat anomalous in the perceptron, and indeedin further models, is the critical importance of the ac-tivation function. McCulloch and Pitts recognised thatneural ﬁring occurs in an all-or-none fashion, and thatany function with a rapid transition between two ‘stand-ard’ behaviours would suﬃce to describe this [2]. In otherwords, a speciﬁc class of functions is generally used foractivation functions, which can be described as discon-tinuous or nearly discontinuous, vertically asymmetricabout a ‘switching point,’ and bounded from below. Aconcrete example is the sigmoid function originally usedby McCulloch and Pitts. Interstingly, a class of activa-tion functions that are bounded from below but exhibitasymptotically linear behaviour for inputs greater than acritical threshold, such as ReLU, ELU, Mish, and Swish,have been experimentally evaluated as providing the bestperformance for a large number of network architecturesand tests [3–5].While the application of an activation function is justi-ﬁed with the biological facts, and its success is obvious, itis still assembled primarily phenomenologically. Clearly,the activation function was integral to the application of ∗ [email protected] neural networks as logical devices that computed binaryvariables—but, the precise mechanism that justiﬁes thesefunctions’ role in inference, and the physiological relev-ance of this function, both remain unclear. Most proofsof the previous statement also yield little insight into therelevance of the activation function, and especially of thespeciﬁc shape elaborated on above. These proofs oftenrely on what could be summarised as the power of non-linearity, which allows the approximation of non-linear ornon-polynomial functions. Consider that data generat-ing processes are governed by a dynamical system, whichcould be a high-dimensional stochastic system or partialdiﬀerential equation, the solutions to which are typicallynon-linear or non-polynomial in character. Then, the ne-cessity of such a function becomes clear. More precisely,the theorem oﬀered in [6] states that the set of possibleneural network conﬁgurations N is dense in the space ofcontinuous real-valued functions, or, that any real-valuedfunction is contained in or is a limit point of N , if andonly if the activation function on N is non-polynomial. Inother words, given arbitrary width and depth, the prop-erty of being a ‘universal approximator’ is precisely thatof having a non-polynomial activation function. Still, thisleads to little insight about the biological plausibility of,or physical motivation for, the speciﬁc functions used.To investigate this, we employ a model from statisticalmechanics called the Ising model. The Ising model wasdevised by Wilhelm Lenz and Ernst Ising in the 1920s todescribe magnetism in metals, and the loss of magnet-isation when magnets are heated [7]. The Ising modelis a model of the electronic structure of a metal, whereelectrons in metallic atoms are deﬁned with a propertycalled ‘spin,’ pointed either up or down. When all spinsare positively aligned, or, all lattice sites take values of 1,the system is magnetised. Like many systems in nature,it exhibits a phase transition at a critical temperature, inwhich the magnetisation of a cooled metal is lost abovea critical temperature, or regained when cooled. In fact,many other phase transitions can be proven to take char-acteristics of this one—phase transitions lie in one of afew universality classes , meaning they show the same a r X i v : . [ q - b i o . N C ] F e b characteristics no matter what the underlying dynamicsare. In particular, the Ising universality class containssuch diﬀerent phenomena as physical transitions from li-quid to gas, and the behaviour of compactiﬁed particlesin string theory [8, 9].Similar to this class of behaviour, a neural spike isa sort of transition where, once past a critical point (athreshold potential) a spike is initiated, and the systemgoes from disordered uptake (random diﬀusion of ionsacross the membrane) to ordered uptake (uptake of Na + and other positive ions to initiate depolarisation). Wewill use this to model the spiking behaviour of the neuralcell as an Ising model with an appropriate phase trans-ition. In so doing, we recover an expression for both thehyperbolic tangent and unbounded-above linear activa-tion functions. II. MAIN RESULTSA. Modelling Neural Dynamics

Neural spikes are both a regular, periodic phenomenon,and a highly complex, non-equilibrium process. As anexample of self-organisation, neural ﬁring emerges fromcomplex but quantiﬁable dynamics, here involving ionicequilibria and membrane selectivity [10]. The neuroneis surrounded by ions in its extracellular ﬂuid, mean-ing it is subject to diﬀusion of these ions across its cellmembrane, through ion channels. It maintains a negativeresting potential of −

70 mV, which requires active trans-port of positive Na + ions out of the cell. This resultsin a persistent concentration gradient, which is preciselywhat allows a spike to occur. When a critical voltageis reached, previously closed voltage-gated ion channelsopen. The positive Na + ions ﬂow along this concentra-tion gradient through these now open channels, leadingto an upwards spike in voltage.We have simpliﬁed this model of neural dynamics tobe a stationary process coupled to a bath. We can thenapproximate the system as being in a local equilibrium,meaning we can use a simple equilibrium Ising model todescribe the system.Following this, we simplify an Ising model maintaininga non-equilibrium steady state to an equilibrium one withanomalous local eﬀects. Consider a particular formula-tion of the Ising model coupled to a thermal bath, whichundergoes a rapid quench and magnetises in response tothis sudden cooling. When the quench is removed, theIsing model heats up again. For periodic quenching, thedynamics themselves will be periodic, but will obey thetypical zero ﬁeld transition in spin-based magnetisation.As such, the phases induced by the quench can be madedistinct, e.g., alternating disordered phases with thermalﬂuctuations and ordered phases with positive magnetisa-tion.This provides a model of neural dynamics, in which thecrucial simpliﬁcation in the model is ignoring the source of the external quench, thereby restricting us only to local(intracellular) interactions within the system. This hasno consequence for our model, since ﬁring is the organ-isation of channel dynamics within the cell. B. The Ising Case for Neural Firing

As stated, both systems are capable of exhibiting two,diﬀerently stereotyped dynamics, or ‘phases.’ In the Isingmodel, one is a high temperature paramagnetic phase,where the spins in the model are disorganised, weaklycorrelated with one another, and subject to random ﬂuc-tuations. The magnetisation m , or average spin, is zeroin this case; this is a result of the random conﬁgurationof spins, which creates non-alignment, or independence,of spins, in which an approximately equal number shouldbe occupying states − ferromagnetic phase in whichthe spins are organised and aligned in one direction.Here, m is either − n spin pairs is − n , theenergy is at a minimum. The converse is also true: whenthe energy in the system decreases, spins must align andtake a lower energy conﬁguration to satisfy this. Theﬁnal state, − C. The Transition to Magnetisation

To connect macroscopic observables to microscopicstate variables, we often rely on formalisms from stat-istical mechanics. One such technique used in study ofphase transitions is a particular type of coarse grain-ing called mean ﬁeld theory (MFT), which formulates amodel of the macroscopic level change that results fromlarge level microscopic changes. MFT is quantitativelyincorrect in 2D; nonetheless, for both computational andpedagogical reasons, we will demonstrate this using theMean Field approach. MFT gives qualitatively correctresults and requires only small numerical corrections todescribe the magnetisation dynamics.Here, we will brieﬂy state the derivation for the phasetransition in the Ising model. A full derivation is given in[11], with pedagogical commentary. A spin lattice in zeroﬁeld is described by its Hamiltonian ˆ H in the followingway: ˆ H = − J (cid:88) i,j s i s j , with s n ∈ {− , } . ˆ H gives the total energy of the sys-tem, which in turn gives its dynamics, as the sum overall neighbouring spins. Here, a spin is a channel state,which at any time t either contains an Na + ion or doesnot.MFT assumes that at large length scales, a systemconverges to its average dynamics, with only large ﬂuctu-ations playing a role in the dynamics of the system. Weuse this to coarse-grain the model by removing secondorder ﬂuctuations, which are assumed to be vanishinglysmall. The local interaction term in ˆ H is s i s j = ( (cid:104) s i (cid:105) + σs i )( (cid:104) s j (cid:105) + σs j ) , which is a product of two variables under the inﬂuence ofa random displacement. We assert that in this system,the spins tend towards a similar mean and ﬂuctuationsnecessarily decrease; therefore, in describing the dynam-ics leading to an ordered transition, we may assume theﬂuctuations become small.This means that we can rewrite s i s j as the following: (cid:104) s i (cid:105)(cid:104) s j (cid:105) + (cid:104) s j (cid:105) σs i + (cid:104) s i (cid:105) σs j + σs i σs j . Since we have assumed the ﬂuctuations are small, theﬁnal term will vanish. If we use this, and an expansionof the random displacement into a ﬂuctuation about amean, this becomes: s i s j ≈ (cid:104) s i (cid:105)(cid:104) s j (cid:105) + (cid:104) s j (cid:105) ( s i − (cid:104) s i (cid:105) ) + (cid:104) s i (cid:105) ( s j − (cid:104) s j (cid:105) ) . We also use the fact that as the phase transitions,the average spin value (cid:104) s n (cid:105) will approach a magnetisa-tion value m , corresponding to the organisation of spinsneeded to produce magnetisation. Then, we can rewriteˆ H , replacing spins with this mean ﬁeld: − J (cid:88) i,j m ( s i + s j ) − m . If we take spin states as being highly correlated, the i s and j s become equal and the sum over neighbours re-duces to the number of connections, half the number ofneighbours z , across all sites in the lattice. This gives ascaling factor of z : − zJ N (cid:88) i =1 ms i − m . We further simplify to the following:ˆ H MF = − zJm N (cid:88) i =1 s i + N zJm m , it can be used to tell us the actual coordination ofspins—the magnetisation m , where m must satisfy theself-consistency condition m = (cid:104) s (cid:105) .We use ˆ H MF with the canonical partition function toget Z = e − β ˆ H MF where β is a particular thermodynamical quanity, k B T .Expanding the site-wise partition function and usingsome trigonometric identites, we ﬁnd Z = e − β NzJm βzJm ) N . Finally, to ﬁnd our magnetisation m , we must minimisethe free energy of the system with respect to m .We get free energy from our partition function as: F = − βN ln( Z )= 12 zJm − β ln (2cosh( βzJm )) . Now, we calculate the m which minimises the free en-ergy: m → ∂F∂m = 0 m = tanh (cid:18) zJmk B T (cid:19) . where we have used the previous deﬁnition of the ther-modynamic beta.If we use the critical temperature T c as T c = zJk B , thenthis simpliﬁes to: m = tanh (cid:18) T c mT (cid:19) , (1)the plot of which is contained in Figure 1. Figure 1.

Magnetisation for temperature.

A phase trans-ition is evident in this curve at T = T c , where magnetisationbecomes non-zero. Immediately we observe a hyperbolic tangent functionarise in this mean ﬁeld model. The curve bifurcating as T → m = − m = 1, given by the ends of (1). We disregardthe zero solution at T < T c as energetically unfavourable.Note that this tanh curve is a diﬀerent sort of activa-tion function; rather than determining magnetisation, itdetermines either of the possible magnetised states. Inprinciple, we could maintain this bifurcation: supposewe deﬁned two diﬀerent ﬁring patterns, where, upon re-ceiving an input and crossing a critical point, all chan-nels either contained an Na + ion or all channels did not.This could represent the ﬁring of an inhibitory neuronecausing selective inactivity in a ﬁring excitatory neurone,which would normally communicate a signal. Then, wewould have something corresponding to a critical inputcreating a single spike according to the statistics of theinput. In this case, the activation function is determineswhether a stimulus is likely to elicit excitatory or inhibit-ory neural spikes, perhaps comprising a diﬀerent sort ofclassiﬁcation.However, much like this is not main feature of the Isingmodel phase transition, this is not the major point of thispaper. Instead, we restrict the magnetisation to m = 1,and examine the resultant analogy to ﬁring dynamics.This will also allow us to determine the salient featuresof the previously mentioned class of activation functions. D. Unbounded-Above Activation Functions

It is clear to see from Figure 1 that, at temperaturesabove T c , the only solution is m = 0. Below T c , the solu-tion is m = − βzJ ). Weemphasise that, when restricted to m = 1, this behaviourcan be ﬁt to another hyperbolic tangent function, goingfrom zero to one. So, for some parameter a , our functionresembles the following: m ( T ) = −

12 tanh( a ( T − T c )) + 12 (2)which becomes positive when we set the temperatureto neural state, as suggested earlier—recall the neuralmembrane voltage is itself negative. This reproduces theneural activation function, where the threshold T c is abias and the switching behaviour represents spiking ornot spiking. Figure 2.

Magnetisation ﬁt to a hyperbolic tangent.

The previously deﬁned magnetisation curve can be ﬁt to an-other hyperbolic tangent curve, satisfying the typical ﬁringor not-ﬁring activation function. Hyperparameter a is set to a = 6, and T c = 1. A second hyperparameter multiplying thecritical temperature can be used to bring the ﬁt even closer. Since T , the temperature, depends on the quench, wecan combine the temperature of the bath T with thequench, T ( t ) = T − δ ( t − t c ) Q. Here, t c is a time whencooling is applied, and the delta function δ ( x ) returnsone for an argument of zero, and zero everywhere else.Thus, cooling only acts on the temperature when t = t c .Note that, while an interaction with a thermal ﬁeldcould have been included in the Hamiltonian, we opt tocouple it to the temperature later in the paper, for avariety of reasons—not in the least to follow through onour approximation of anomalously introduced local ef-fects, and our desire to begin with a time-independent,and thus equilibrium, Hamiltonian.We may now parameterise motion along this curvedue to changes in temperature in time. Recall we havenegated temperature, as it is equal to neural membranevoltage, − V . We have already coupled our quench totemperature as a subtractive element that restores it toorder.Suppose the model heats up linearly. This is accur-ate with respect to the neurone, which uses pumps toeject positive ions at a constant rate. Indeed, we havepreviously deﬁned a quench as a perturbation from equi-librium. In reality, it is the inﬂux of positive ions due tosome ﬁring event adjacent to the neural cell. The modelwill heat up again as soon as it loses heat, or pumpsions out of its cell body. We assume the neural pumpacts with a constant speed, so that the time spent in the m = 1 regime can be parameterised as linear in time t .Suppose also that this rate is one degree per second, suchthat d T d t = 1 and T ( t ) = t − Q for t c < t ≤ Q . Then, thenumber of spikes emitted, or the time spent in m = 1, isthe integral of (2) from zero to Q . This is because, underour assumptions, the time to cool is equal to one kelvinper unit time. In that case, the time to cool back to zeroperturbation when Q is subtracted from T is exactly Q .Given the analogy drawn in II A and II B holds, clearly,the mean ﬁeld description of neural ﬁring leads to a func-tion that describes neural ﬁring based on perturbationsto equilibrium, as described above. Then, the number ofspikes emitted in this time is the time spent in m = 1, orthe time before cooling, which is the previously describedintegral of (2) with respect to temperature. This integralis evaluated as follows:12 (cid:90) Q [tanh( T ) + 1] d T = Q + 12 ln(2) + 12 ln (cid:0) e − Q (cid:1) + C. Clearly, we have the linear term dominating for Q (cid:29) C = 0, we recover our ReLU function.This behaviour also reproduces that of the exponentiallinear unit for C = − α , diﬀering by no more than α any-where. In general, we have an expression for linear orapproximately linear, unbounded-above activation func-tions.If we choose to ﬁt a sigmoid function to our magnet-isation, rather than the hyperbolic tangent, then we havea similar result: (cid:90) Q

11 + e − T d T = Q + ln (cid:0) e − Q (cid:1) + C. The relevance of this with respect to neural ﬁring isthat the saturating activation function is only a binaryclassiﬁcation case, with one spike—a single logic gate.More complex learning, such as the encoding of complexstimuli, on the other hand, requires many spikes. Hardquenches, or strong inputs, mean more time spent in the m = 1 regime; thus, stronger inputs mean more spikes getemitted. We then recover ReLU and ELU as functions for ﬁring rate, by counting spikes over time; since timespent magnetised, or time before heating, correspondsdirectly to quench strength, so too does spike count. III. DISCUSSIONA. Firing Rates and Sparse Neural Codes

Evidence suggests that neurones rely on sparse cod-ing to eﬃciently communicate stimuli, especially in high-noise or high-dimensional environments. In fact, manyseparate neural coding schemes have been considered toemerge from sparsity, which neural networks employ dueto energy constraints and to cope with dimensionality[12]. Broadly, sparse coding states that diﬀerent ﬁringrates, which contain representations of information byencoding features of a stimulus, will be sparsely distrib-uted in a neural network. In large neural populations,key neurones will be ﬁring at various rates and most otherneurones will not be ﬁring at all. Such a sparse code isadvantageous for eﬃcient learning by decorrelating in-puts, which allows features to be coded independently.Crucially, this leads to a robust representation, and isequivalent to reducing the coding of redundant featureswhile preserving coded information [13].The rectiﬁer, or the unbounded-above activation func-tion we discuss, has indeed been shown to improve rep-resentation in deep neural networks by precisely thesemechanisms [14]. Some neurones are ﬁring with a par-ticular rate, lying on the linear portion of the curve, andothers are resting, lying on the portion of the curve val-ued at zero. The coding beneﬁts highlighted are exactlythose found in sparse representations in biological neuralnetworks, where disentangling is referred to as decorrel-ating inputs, which assists in learning high-dimensionaldata. They also utilise sparsity as a rich but energy ef-ﬁcient coding scheme, showing that in deep neural net-works, sparse representations take fewer computationalresources while showing high training accuracy.In our model, this sparsity is reproduced by local ef-fects such as quenches of diﬀerent magnitudes acting onparticular neurones in the network.

B. Energy-Based Learning

Recently, machine learning has been reformulatedwithin an energy based framework—in particular, aparadigm based on energy minimisation has been pro-posed, where choosing a network conﬁguration that min-imises energy is equivalent to ﬁnding an output that min-imises loss [15]. Here, the energy of a conﬁguration isused as a penalty, following the idea that physical systemsseek to minimise free energy and that this underlies thestability of a given system state. This appeals to statist-ical mechanical ideas about energy minimisation, whichwe have already used in discussing the Ising model—theconﬁguration chosen by a system always obeys a minim-isation principle. As such, this can be used as a measure-ment of error, where we designate high energy states asbeing incorrect in both the physical and statistical sense.A useful way of thinking about the idea of free energyminimisation is that energy is deﬁned as F = E − T S.

For clamped energy levels, clearly, maximising entropy isequivalent to minimising free energy, since∆ F = − T ∆ S for constant energy and temperature. Then, free energyminimisation is a natural consequence of the second lawof thermodynamics, which states that systems will alwaysproduce greater entropy. In the information theoreticsense, deﬁned by Jaynes as essentially equivalent to thethermodynamical sense, maximising entropy is choosingthe most best model of observed variables [16]. Thus,we have a direct application to our inference or learningprocess.Following this, we examine why a neural Ising modelspikes. Clearly, when the temperature decreases, theentropic contribution to free energy decreases as well.Hence, minimisation of free energy occurs when total en-ergy is minimised. We observed this happen when theHamiltonian was in a magnetised state. In the sense ofan error signal, when an input—a temperature loweringquench—arrives, the error in the system is high as longas the Ising model occupies a high energy state, which isunlikely given the physical and statistical scenario. Bymagnetising, or spiking, the system decreases this errorthrough responding to the input, which is equivalent tochoosing a free energy minimising stable state. In theenergy-based learning scheme, loss functions are often ar-rived at by explicitly considering the marginalised Gibbsdistribution over the inputs to the system, and learningis performed by minimising the resultant free energy inthe zero-temperature limit.This accords with other, more biological ideas con-cerning energy minimisation in learning, wherein neuralspikes learn the relationship between stimulus and evokedresponse, and minimisation of energy underlies learning. It has been found that real neural networks, in-vitro,minimise variational free energy when learning repres-entations of stimuli [17]. Variational free energy is aninformation-theoretic notion closely related to the ther-modynamical Helmholtz free energy, although whetheronly by statistical mechanical analogy or also by phys-ical principles remains controversial [18, 19]. C. Self-Similarity and Criticality

We note one ﬁnal implication by suggesting a relation-ship between this result, and mean ﬁeld theory appliedto neural populations. In particular, we note that to re-cover non-linear ﬁring statistics, the collective dynamicsof neural populations are almost ubiquitously describedusing a sigmoid function [20]. Much like the activationfunction in the artiﬁcal neural network, however, thisis justiﬁed by performance rather than explicit theory,and is given heuristically. The results in this paper un-doubtedly extrapolate to the case of neurones as a sub-unit and neural populations as a mean ﬁeld, thus justify-ing this phenomenology in the same way. This relation-ship is consistent with the self-similarity observed in thehuman cortex. Results on self-similarity in the Ising lat-tice suggest that both are at criticality, or the so-called‘edge of chaos.’ This is consistent with observations ofcritical dynamics in the brain, and is important for bothbiological and artiﬁcial neural networks, which have beenshown to perform best at criticality [5, 21, 22].

IV. CONCLUSION

We have seen that a neural spike is analogous to thephase transition in the Ising model; as such, we havemotivated the designs of historical and modern artiﬁcialneural networks, and in particular, the concept and typ-ical form of the activation function. In so doing, we havealso examined how ideal learning necessarily invokes thenon-linear processes in the neurone, and utilises energyminimisation, by modelling this process with an Isingmodel and applying other statistical mechanical ideas. [1] Tara H Abraham. (Physio)logical circuits: The in-tellectual origins of the McCulloch–Pitts neural net-work.

Journal of the History of the Behavioral Sciences ,38(1):3–25, 2002.[2] Walter McCulloch and Warren Pitts. A logical calculusof the ideas immanent in nervous activity.

Bulletin ofMathematical Biophysics , 5:115–133, 1943.[3] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton.Imagenet classiﬁcation with deep convolutional neuralnetworks. In F Pereira, C J C Burges, L Bottou, andK Q Weinberger, editors,

Advances in Neural Inform- ation Processing Systems , volume 25, pages 1097–1105.Curran Associates, Inc., 2012.[4] Vinod Nair and Geoﬀrey E Hinton. Rectiﬁed linear unitsimprove restricted boltzmann machines. In

Proceedingsof the 27th International Conference on Machine Learn-ing , ICML’10, page 807–814, Madison, WI, USA, 2010.Omnipress.[5] Souﬁane Hayou, Arnaud Doucet, and Judith Rousseau.On the impact of the activation function on deep neuralnetworks training. In Kamalika Chaudhuri and RuslanSalakhutdinov, editors,

Proceedings of the 36th Interna- tional Conference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research , pages 2672–2680. PMLR, 2019.[6] Moshe Leshno, Vladimir Y Lin, Allan Pinkus, and Shi-mon Schocken. Multilayer feedforward networks with anonpolynomial activation function can approximate anyfunction.

Neural Networks , 6:861–867, 1993.[7] Stephen Brush. History of the Lenz-Ising model.

Reviewsof Modern Physics , 39(4):883–893, 1967.[8] A D Bruce and N B Wilding. Scaling ﬁelds and uni-versality of the liquid-gas critical point.

Physical ReviewLetters , 68:193–196, Jan 1992.[9] Alexander Migdal. Turbulence, string theory and Isingmodel. arXiv e-prints , page arXiv:1912.00276, November2019.[10] A L Hodgkin and A F Huxley. A quantitative descrip-tion of membrane current and its application to conduc-tion and excitation in nerve.

The Journal of Physiology ,117(4):500–544, 1952.[11] Dalton A R Sakthivadivel. A pedagogical discussion ofmagnetisation in the mean ﬁeld Ising model. arXiv e-prints , page arXiv:2102.00960, January 2021.[12] Michael Beyeler, Emily L Rounds, Kristofor D Carlson,Nikil Dutt, and Jeﬀrey L Krichmar. Neural correlatesof sparse coding and dimensionality reduction.

PLOSComputational Biology , 15(6):1–33, 06 2019.[13] Peter F¨old´ıak. Forming sparse representations by localanti-hebbian learning.

Biological Cybernetics , 64:165–170, 1990.[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deepsparse rectiﬁer neural networks. In Geoﬀrey Gordon,David Dunson, and Miroslav Dud´ık, editors,

Proceed-ings of the Fourteenth International Conference on Arti- ﬁcial Intelligence and Statistics , volume 15 of

Proceedingsof Machine Learning Research , pages 315–323. JMLRWorkshop and Conference Proceedings, 2011.[15] Yann Lecun, Sumit Chopra, Raia Hadsell, Marc AurelioRanzato, and Fu Jie Huang. A tutorial on energy-basedlearning. In G Bakir, T Hofman, B Scholkopt, A Smola,and B Taskar, editors,

Predicting structured data . MITPress, 2006.[16] Edwin Thompson Jaynes. Information theory and stat-istical mechanics.

Physical Review , 106:620–630, 1957.[17] Takuya Isomura and Karl Friston. In vitro neural net-works minimise variational free energy.

Scientiﬁc reports ,8(1):1–14, 2018.[18] Alex B Kiefer. Psychophysical identity and free en-ergy.

Journal of the Royal Society Interface , 17:20200370,2020.[19] Mel Andrews. The math is not the territory: Navig-ating the free energy principle, 2020. Available fromhttp://philsci-archive.pitt.edu/18315/.[20] Gustavo Deco, Viktor K Jirsa, Peter A Robinson, Mi-chael Breakspear, and Karl Friston. The dynamic brain:From spiking neurons to neural masses and cortical ﬁelds.

PLOS Computational Biology , 4(8):1–35, 08 2008.[21] Ge Yang and Samuel Schoenholz. Mean ﬁeld residual net-works: On the edge of chaos. In I Guyon, U V Luxburg,S Bengio, H Wallach, R Fergus, S Vishwanathan, andR Garnett, editors,

Advances in Neural Information Pro-cessing Systems , volume 30, pages 7103–7114. Curran As-sociates, Inc., 2017.[22] Zhengyu Ma, Gina G Turrigiano, Ralf Wessel, andKeith B Hengen. Cortical circuit dynamics are homeo-statically tuned to criticality in vivo.