Formalising the Use of the Activation Function in Neural Inference
FFormalising the Use of the Activation Function in Neural Inference
Dalton A R Sakthivadivel ∗ Stony Brook University, Stony Brook, New York, 11794-5281 (Dated: 10th February 2021)We investigate how activation functions can be used to describe neural firing in an abstract way,and in turn, why they work well in artificial neural networks. We discuss how a spike in a biologicalneurone belongs to a particular universality class of phase transitions in statistical physics. Wethen show that the artificial neurone is, mathematically, a mean field model of biological neuralmembrane dynamics, which arises from modelling spiking as a phase transition. This allows us totreat selective neural firing in an abstract way, and formalise the role of the activation function inperceptron learning. Along with deriving this model and specifying the analogous neural case, weanalyse the phase transition to understand the physics of neural network learning. Together, it isshow that there is not only a biological meaning, but a physical justification, for the emergence andperformance of canonical activation functions; implications for neural learning and inference are alsodiscussed.
I. INTRODUCTION
The perceptron learning algorithm, developed by Mc-Culloch and Pitts in 1943, is one of the earliest applic-ations of biological principles for computation to math-ematics, or to machines [1]. A simple model, the per-ceptron consists of a single logic gate, and is only cap-able of classification using linearly separable functions,like AND and OR. Nonetheless, recent algorithms havedeviated only slightly from the original developments byMcCulloch and Pitts; in many cases, these simply stackperceptrons or add features onto the original algorithm,such as in deep neural networks or convolutional neuralnetworks. Clearly, the contribution of the single-layerperceptron remains relevant today.Somewhat anomalous in the perceptron, and indeedin further models, is the critical importance of the ac-tivation function. McCulloch and Pitts recognised thatneural firing occurs in an all-or-none fashion, and thatany function with a rapid transition between two ‘stand-ard’ behaviours would suffice to describe this [2]. In otherwords, a specific class of functions is generally used foractivation functions, which can be described as discon-tinuous or nearly discontinuous, vertically asymmetricabout a ‘switching point,’ and bounded from below. Aconcrete example is the sigmoid function originally usedby McCulloch and Pitts. Interstingly, a class of activa-tion functions that are bounded from below but exhibitasymptotically linear behaviour for inputs greater than acritical threshold, such as ReLU, ELU, Mish, and Swish,have been experimentally evaluated as providing the bestperformance for a large number of network architecturesand tests [3–5].While the application of an activation function is justi-fied with the biological facts, and its success is obvious, itis still assembled primarily phenomenologically. Clearly,the activation function was integral to the application of ∗ [email protected] neural networks as logical devices that computed binaryvariables—but, the precise mechanism that justifies thesefunctions’ role in inference, and the physiological relev-ance of this function, both remain unclear. Most proofsof the previous statement also yield little insight into therelevance of the activation function, and especially of thespecific shape elaborated on above. These proofs oftenrely on what could be summarised as the power of non-linearity, which allows the approximation of non-linear ornon-polynomial functions. Consider that data generat-ing processes are governed by a dynamical system, whichcould be a high-dimensional stochastic system or partialdifferential equation, the solutions to which are typicallynon-linear or non-polynomial in character. Then, the ne-cessity of such a function becomes clear. More precisely,the theorem offered in [6] states that the set of possibleneural network configurations N is dense in the space ofcontinuous real-valued functions, or, that any real-valuedfunction is contained in or is a limit point of N , if andonly if the activation function on N is non-polynomial. Inother words, given arbitrary width and depth, the prop-erty of being a ‘universal approximator’ is precisely thatof having a non-polynomial activation function. Still, thisleads to little insight about the biological plausibility of,or physical motivation for, the specific functions used.To investigate this, we employ a model from statisticalmechanics called the Ising model. The Ising model wasdevised by Wilhelm Lenz and Ernst Ising in the 1920s todescribe magnetism in metals, and the loss of magnet-isation when magnets are heated [7]. The Ising modelis a model of the electronic structure of a metal, whereelectrons in metallic atoms are defined with a propertycalled ‘spin,’ pointed either up or down. When all spinsare positively aligned, or, all lattice sites take values of 1,the system is magnetised. Like many systems in nature,it exhibits a phase transition at a critical temperature, inwhich the magnetisation of a cooled metal is lost abovea critical temperature, or regained when cooled. In fact,many other phase transitions can be proven to take char-acteristics of this one—phase transitions lie in one of afew universality classes , meaning they show the same a r X i v : . [ q - b i o . N C ] F e b characteristics no matter what the underlying dynamicsare. In particular, the Ising universality class containssuch different phenomena as physical transitions from li-quid to gas, and the behaviour of compactified particlesin string theory [8, 9].Similar to this class of behaviour, a neural spike isa sort of transition where, once past a critical point (athreshold potential) a spike is initiated, and the systemgoes from disordered uptake (random diffusion of ionsacross the membrane) to ordered uptake (uptake of Na + and other positive ions to initiate depolarisation). Wewill use this to model the spiking behaviour of the neuralcell as an Ising model with an appropriate phase trans-ition. In so doing, we recover an expression for both thehyperbolic tangent and unbounded-above linear activa-tion functions. II. MAIN RESULTSA. Modelling Neural Dynamics
Neural spikes are both a regular, periodic phenomenon,and a highly complex, non-equilibrium process. As anexample of self-organisation, neural firing emerges fromcomplex but quantifiable dynamics, here involving ionicequilibria and membrane selectivity [10]. The neuroneis surrounded by ions in its extracellular fluid, mean-ing it is subject to diffusion of these ions across its cellmembrane, through ion channels. It maintains a negativeresting potential of −
70 mV, which requires active trans-port of positive Na + ions out of the cell. This resultsin a persistent concentration gradient, which is preciselywhat allows a spike to occur. When a critical voltageis reached, previously closed voltage-gated ion channelsopen. The positive Na + ions flow along this concentra-tion gradient through these now open channels, leadingto an upwards spike in voltage.We have simplified this model of neural dynamics tobe a stationary process coupled to a bath. We can thenapproximate the system as being in a local equilibrium,meaning we can use a simple equilibrium Ising model todescribe the system.Following this, we simplify an Ising model maintaininga non-equilibrium steady state to an equilibrium one withanomalous local effects. Consider a particular formula-tion of the Ising model coupled to a thermal bath, whichundergoes a rapid quench and magnetises in response tothis sudden cooling. When the quench is removed, theIsing model heats up again. For periodic quenching, thedynamics themselves will be periodic, but will obey thetypical zero field transition in spin-based magnetisation.As such, the phases induced by the quench can be madedistinct, e.g., alternating disordered phases with thermalfluctuations and ordered phases with positive magnetisa-tion.This provides a model of neural dynamics, in which thecrucial simplification in the model is ignoring the source of the external quench, thereby restricting us only to local(intracellular) interactions within the system. This hasno consequence for our model, since firing is the organ-isation of channel dynamics within the cell. B. The Ising Case for Neural Firing
As stated, both systems are capable of exhibiting two,differently stereotyped dynamics, or ‘phases.’ In the Isingmodel, one is a high temperature paramagnetic phase,where the spins in the model are disorganised, weaklycorrelated with one another, and subject to random fluc-tuations. The magnetisation m , or average spin, is zeroin this case; this is a result of the random configurationof spins, which creates non-alignment, or independence,of spins, in which an approximately equal number shouldbe occupying states − ferromagnetic phase in whichthe spins are organised and aligned in one direction.Here, m is either − n spin pairs is − n , theenergy is at a minimum. The converse is also true: whenthe energy in the system decreases, spins must align andtake a lower energy configuration to satisfy this. Thefinal state, − C. The Transition to Magnetisation
To connect macroscopic observables to microscopicstate variables, we often rely on formalisms from stat-istical mechanics. One such technique used in study ofphase transitions is a particular type of coarse grain-ing called mean field theory (MFT), which formulates amodel of the macroscopic level change that results fromlarge level microscopic changes. MFT is quantitativelyincorrect in 2D; nonetheless, for both computational andpedagogical reasons, we will demonstrate this using theMean Field approach. MFT gives qualitatively correctresults and requires only small numerical corrections todescribe the magnetisation dynamics.Here, we will briefly state the derivation for the phasetransition in the Ising model. A full derivation is given in[11], with pedagogical commentary. A spin lattice in zerofield is described by its Hamiltonian ˆ H in the followingway: ˆ H = − J (cid:88) i,j s i s j , with s n ∈ {− , } . ˆ H gives the total energy of the sys-tem, which in turn gives its dynamics, as the sum overall neighbouring spins. Here, a spin is a channel state,which at any time t either contains an Na + ion or doesnot.MFT assumes that at large length scales, a systemconverges to its average dynamics, with only large fluctu-ations playing a role in the dynamics of the system. Weuse this to coarse-grain the model by removing secondorder fluctuations, which are assumed to be vanishinglysmall. The local interaction term in ˆ H is s i s j = ( (cid:104) s i (cid:105) + σs i )( (cid:104) s j (cid:105) + σs j ) , which is a product of two variables under the influence ofa random displacement. We assert that in this system,the spins tend towards a similar mean and fluctuationsnecessarily decrease; therefore, in describing the dynam-ics leading to an ordered transition, we may assume thefluctuations become small.This means that we can rewrite s i s j as the following: (cid:104) s i (cid:105)(cid:104) s j (cid:105) + (cid:104) s j (cid:105) σs i + (cid:104) s i (cid:105) σs j + σs i σs j . Since we have assumed the fluctuations are small, thefinal term will vanish. If we use this, and an expansionof the random displacement into a fluctuation about amean, this becomes: s i s j ≈ (cid:104) s i (cid:105)(cid:104) s j (cid:105) + (cid:104) s j (cid:105) ( s i − (cid:104) s i (cid:105) ) + (cid:104) s i (cid:105) ( s j − (cid:104) s j (cid:105) ) . We also use the fact that as the phase transitions,the average spin value (cid:104) s n (cid:105) will approach a magnetisa-tion value m , corresponding to the organisation of spinsneeded to produce magnetisation. Then, we can rewriteˆ H , replacing spins with this mean field: − J (cid:88) i,j m ( s i + s j ) − m . If we take spin states as being highly correlated, the i s and j s become equal and the sum over neighbours re-duces to the number of connections, half the number ofneighbours z , across all sites in the lattice. This gives ascaling factor of z : − zJ N (cid:88) i =1 ms i − m . We further simplify to the following:ˆ H MF = − zJm N (cid:88) i =1 s i + N zJm m , it can be used to tell us the actual coordination ofspins—the magnetisation m , where m must satisfy theself-consistency condition m = (cid:104) s (cid:105) .We use ˆ H MF with the canonical partition function toget Z = e − β ˆ H MF where β is a particular thermodynamical quanity, k B T .Expanding the site-wise partition function and usingsome trigonometric identites, we find Z = e − β NzJm βzJm ) N . Finally, to find our magnetisation m , we must minimisethe free energy of the system with respect to m .We get free energy from our partition function as: F = − βN ln( Z )= 12 zJm − β ln (2cosh( βzJm )) . Now, we calculate the m which minimises the free en-ergy: m → ∂F∂m = 0 m = tanh (cid:18) zJmk B T (cid:19) . where we have used the previous definition of the ther-modynamic beta.If we use the critical temperature T c as T c = zJk B , thenthis simplifies to: m = tanh (cid:18) T c mT (cid:19) , (1)the plot of which is contained in Figure 1. Figure 1.
Magnetisation for temperature.
A phase trans-ition is evident in this curve at T = T c , where magnetisationbecomes non-zero. Immediately we observe a hyperbolic tangent functionarise in this mean field model. The curve bifurcating as T → m = − m = 1, given by the ends of (1). We disregardthe zero solution at T < T c as energetically unfavourable.Note that this tanh curve is a different sort of activa-tion function; rather than determining magnetisation, itdetermines either of the possible magnetised states. Inprinciple, we could maintain this bifurcation: supposewe defined two different firing patterns, where, upon re-ceiving an input and crossing a critical point, all chan-nels either contained an Na + ion or all channels did not.This could represent the firing of an inhibitory neuronecausing selective inactivity in a firing excitatory neurone,which would normally communicate a signal. Then, wewould have something corresponding to a critical inputcreating a single spike according to the statistics of theinput. In this case, the activation function is determineswhether a stimulus is likely to elicit excitatory or inhibit-ory neural spikes, perhaps comprising a different sort ofclassification.However, much like this is not main feature of the Isingmodel phase transition, this is not the major point of thispaper. Instead, we restrict the magnetisation to m = 1,and examine the resultant analogy to firing dynamics.This will also allow us to determine the salient featuresof the previously mentioned class of activation functions. D. Unbounded-Above Activation Functions
It is clear to see from Figure 1 that, at temperaturesabove T c , the only solution is m = 0. Below T c , the solu-tion is m = − βzJ ). Weemphasise that, when restricted to m = 1, this behaviourcan be fit to another hyperbolic tangent function, goingfrom zero to one. So, for some parameter a , our functionresembles the following: m ( T ) = −
12 tanh( a ( T − T c )) + 12 (2)which becomes positive when we set the temperatureto neural state, as suggested earlier—recall the neuralmembrane voltage is itself negative. This reproduces theneural activation function, where the threshold T c is abias and the switching behaviour represents spiking ornot spiking. Figure 2.
Magnetisation fit to a hyperbolic tangent.
The previously defined magnetisation curve can be fit to an-other hyperbolic tangent curve, satisfying the typical firingor not-firing activation function. Hyperparameter a is set to a = 6, and T c = 1. A second hyperparameter multiplying thecritical temperature can be used to bring the fit even closer. Since T , the temperature, depends on the quench, wecan combine the temperature of the bath T with thequench, T ( t ) = T − δ ( t − t c ) Q. Here, t c is a time whencooling is applied, and the delta function δ ( x ) returnsone for an argument of zero, and zero everywhere else.Thus, cooling only acts on the temperature when t = t c .Note that, while an interaction with a thermal fieldcould have been included in the Hamiltonian, we opt tocouple it to the temperature later in the paper, for avariety of reasons—not in the least to follow through onour approximation of anomalously introduced local ef-fects, and our desire to begin with a time-independent,and thus equilibrium, Hamiltonian.We may now parameterise motion along this curvedue to changes in temperature in time. Recall we havenegated temperature, as it is equal to neural membranevoltage, − V . We have already coupled our quench totemperature as a subtractive element that restores it toorder.Suppose the model heats up linearly. This is accur-ate with respect to the neurone, which uses pumps toeject positive ions at a constant rate. Indeed, we havepreviously defined a quench as a perturbation from equi-librium. In reality, it is the influx of positive ions due tosome firing event adjacent to the neural cell. The modelwill heat up again as soon as it loses heat, or pumpsions out of its cell body. We assume the neural pumpacts with a constant speed, so that the time spent in the m = 1 regime can be parameterised as linear in time t .Suppose also that this rate is one degree per second, suchthat d T d t = 1 and T ( t ) = t − Q for t c < t ≤ Q . Then, thenumber of spikes emitted, or the time spent in m = 1, isthe integral of (2) from zero to Q . This is because, underour assumptions, the time to cool is equal to one kelvinper unit time. In that case, the time to cool back to zeroperturbation when Q is subtracted from T is exactly Q .Given the analogy drawn in II A and II B holds, clearly,the mean field description of neural firing leads to a func-tion that describes neural firing based on perturbationsto equilibrium, as described above. Then, the number ofspikes emitted in this time is the time spent in m = 1, orthe time before cooling, which is the previously describedintegral of (2) with respect to temperature. This integralis evaluated as follows:12 (cid:90) Q [tanh( T ) + 1] d T = Q + 12 ln(2) + 12 ln (cid:0) e − Q (cid:1) + C. Clearly, we have the linear term dominating for Q (cid:29) C = 0, we recover our ReLU function.This behaviour also reproduces that of the exponentiallinear unit for C = − α , differing by no more than α any-where. In general, we have an expression for linear orapproximately linear, unbounded-above activation func-tions.If we choose to fit a sigmoid function to our magnet-isation, rather than the hyperbolic tangent, then we havea similar result: (cid:90) Q
11 + e − T d T = Q + ln (cid:0) e − Q (cid:1) + C. The relevance of this with respect to neural firing isthat the saturating activation function is only a binaryclassification case, with one spike—a single logic gate.More complex learning, such as the encoding of complexstimuli, on the other hand, requires many spikes. Hardquenches, or strong inputs, mean more time spent in the m = 1 regime; thus, stronger inputs mean more spikes getemitted. We then recover ReLU and ELU as functions for firing rate, by counting spikes over time; since timespent magnetised, or time before heating, correspondsdirectly to quench strength, so too does spike count. III. DISCUSSIONA. Firing Rates and Sparse Neural Codes
Evidence suggests that neurones rely on sparse cod-ing to efficiently communicate stimuli, especially in high-noise or high-dimensional environments. In fact, manyseparate neural coding schemes have been considered toemerge from sparsity, which neural networks employ dueto energy constraints and to cope with dimensionality[12]. Broadly, sparse coding states that different firingrates, which contain representations of information byencoding features of a stimulus, will be sparsely distrib-uted in a neural network. In large neural populations,key neurones will be firing at various rates and most otherneurones will not be firing at all. Such a sparse code isadvantageous for efficient learning by decorrelating in-puts, which allows features to be coded independently.Crucially, this leads to a robust representation, and isequivalent to reducing the coding of redundant featureswhile preserving coded information [13].The rectifier, or the unbounded-above activation func-tion we discuss, has indeed been shown to improve rep-resentation in deep neural networks by precisely thesemechanisms [14]. Some neurones are firing with a par-ticular rate, lying on the linear portion of the curve, andothers are resting, lying on the portion of the curve val-ued at zero. The coding benefits highlighted are exactlythose found in sparse representations in biological neuralnetworks, where disentangling is referred to as decorrel-ating inputs, which assists in learning high-dimensionaldata. They also utilise sparsity as a rich but energy ef-ficient coding scheme, showing that in deep neural net-works, sparse representations take fewer computationalresources while showing high training accuracy.In our model, this sparsity is reproduced by local ef-fects such as quenches of different magnitudes acting onparticular neurones in the network.
B. Energy-Based Learning
Recently, machine learning has been reformulatedwithin an energy based framework—in particular, aparadigm based on energy minimisation has been pro-posed, where choosing a network configuration that min-imises energy is equivalent to finding an output that min-imises loss [15]. Here, the energy of a configuration isused as a penalty, following the idea that physical systemsseek to minimise free energy and that this underlies thestability of a given system state. This appeals to statist-ical mechanical ideas about energy minimisation, whichwe have already used in discussing the Ising model—theconfiguration chosen by a system always obeys a minim-isation principle. As such, this can be used as a measure-ment of error, where we designate high energy states asbeing incorrect in both the physical and statistical sense.A useful way of thinking about the idea of free energyminimisation is that energy is defined as F = E − T S.
For clamped energy levels, clearly, maximising entropy isequivalent to minimising free energy, since∆ F = − T ∆ S for constant energy and temperature. Then, free energyminimisation is a natural consequence of the second lawof thermodynamics, which states that systems will alwaysproduce greater entropy. In the information theoreticsense, defined by Jaynes as essentially equivalent to thethermodynamical sense, maximising entropy is choosingthe most best model of observed variables [16]. Thus,we have a direct application to our inference or learningprocess.Following this, we examine why a neural Ising modelspikes. Clearly, when the temperature decreases, theentropic contribution to free energy decreases as well.Hence, minimisation of free energy occurs when total en-ergy is minimised. We observed this happen when theHamiltonian was in a magnetised state. In the sense ofan error signal, when an input—a temperature loweringquench—arrives, the error in the system is high as longas the Ising model occupies a high energy state, which isunlikely given the physical and statistical scenario. Bymagnetising, or spiking, the system decreases this errorthrough responding to the input, which is equivalent tochoosing a free energy minimising stable state. In theenergy-based learning scheme, loss functions are often ar-rived at by explicitly considering the marginalised Gibbsdistribution over the inputs to the system, and learningis performed by minimising the resultant free energy inthe zero-temperature limit.This accords with other, more biological ideas con-cerning energy minimisation in learning, wherein neuralspikes learn the relationship between stimulus and evokedresponse, and minimisation of energy underlies learning. It has been found that real neural networks, in-vitro,minimise variational free energy when learning repres-entations of stimuli [17]. Variational free energy is aninformation-theoretic notion closely related to the ther-modynamical Helmholtz free energy, although whetheronly by statistical mechanical analogy or also by phys-ical principles remains controversial [18, 19]. C. Self-Similarity and Criticality
We note one final implication by suggesting a relation-ship between this result, and mean field theory appliedto neural populations. In particular, we note that to re-cover non-linear firing statistics, the collective dynamicsof neural populations are almost ubiquitously describedusing a sigmoid function [20]. Much like the activationfunction in the artifical neural network, however, thisis justified by performance rather than explicit theory,and is given heuristically. The results in this paper un-doubtedly extrapolate to the case of neurones as a sub-unit and neural populations as a mean field, thus justify-ing this phenomenology in the same way. This relation-ship is consistent with the self-similarity observed in thehuman cortex. Results on self-similarity in the Ising lat-tice suggest that both are at criticality, or the so-called‘edge of chaos.’ This is consistent with observations ofcritical dynamics in the brain, and is important for bothbiological and artificial neural networks, which have beenshown to perform best at criticality [5, 21, 22].
IV. CONCLUSION
We have seen that a neural spike is analogous to thephase transition in the Ising model; as such, we havemotivated the designs of historical and modern artificialneural networks, and in particular, the concept and typ-ical form of the activation function. In so doing, we havealso examined how ideal learning necessarily invokes thenon-linear processes in the neurone, and utilises energyminimisation, by modelling this process with an Isingmodel and applying other statistical mechanical ideas. [1] Tara H Abraham. (Physio)logical circuits: The in-tellectual origins of the McCulloch–Pitts neural net-work.
Journal of the History of the Behavioral Sciences ,38(1):3–25, 2002.[2] Walter McCulloch and Warren Pitts. A logical calculusof the ideas immanent in nervous activity.
Bulletin ofMathematical Biophysics , 5:115–133, 1943.[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neuralnetworks. In F Pereira, C J C Burges, L Bottou, andK Q Weinberger, editors,
Advances in Neural Inform- ation Processing Systems , volume 25, pages 1097–1105.Curran Associates, Inc., 2012.[4] Vinod Nair and Geoffrey E Hinton. Rectified linear unitsimprove restricted boltzmann machines. In
Proceedingsof the 27th International Conference on Machine Learn-ing , ICML’10, page 807–814, Madison, WI, USA, 2010.Omnipress.[5] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau.On the impact of the activation function on deep neuralnetworks training. In Kamalika Chaudhuri and RuslanSalakhutdinov, editors,
Proceedings of the 36th Interna- tional Conference on Machine Learning , volume 97 of
Proceedings of Machine Learning Research , pages 2672–2680. PMLR, 2019.[6] Moshe Leshno, Vladimir Y Lin, Allan Pinkus, and Shi-mon Schocken. Multilayer feedforward networks with anonpolynomial activation function can approximate anyfunction.
Neural Networks , 6:861–867, 1993.[7] Stephen Brush. History of the Lenz-Ising model.
Reviewsof Modern Physics , 39(4):883–893, 1967.[8] A D Bruce and N B Wilding. Scaling fields and uni-versality of the liquid-gas critical point.
Physical ReviewLetters , 68:193–196, Jan 1992.[9] Alexander Migdal. Turbulence, string theory and Isingmodel. arXiv e-prints , page arXiv:1912.00276, November2019.[10] A L Hodgkin and A F Huxley. A quantitative descrip-tion of membrane current and its application to conduc-tion and excitation in nerve.
The Journal of Physiology ,117(4):500–544, 1952.[11] Dalton A R Sakthivadivel. A pedagogical discussion ofmagnetisation in the mean field Ising model. arXiv e-prints , page arXiv:2102.00960, January 2021.[12] Michael Beyeler, Emily L Rounds, Kristofor D Carlson,Nikil Dutt, and Jeffrey L Krichmar. Neural correlatesof sparse coding and dimensionality reduction.
PLOSComputational Biology , 15(6):1–33, 06 2019.[13] Peter F¨old´ıak. Forming sparse representations by localanti-hebbian learning.
Biological Cybernetics , 64:165–170, 1990.[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deepsparse rectifier neural networks. In Geoffrey Gordon,David Dunson, and Miroslav Dud´ık, editors,
Proceed-ings of the Fourteenth International Conference on Arti- ficial Intelligence and Statistics , volume 15 of
Proceedingsof Machine Learning Research , pages 315–323. JMLRWorkshop and Conference Proceedings, 2011.[15] Yann Lecun, Sumit Chopra, Raia Hadsell, Marc AurelioRanzato, and Fu Jie Huang. A tutorial on energy-basedlearning. In G Bakir, T Hofman, B Scholkopt, A Smola,and B Taskar, editors,
Predicting structured data . MITPress, 2006.[16] Edwin Thompson Jaynes. Information theory and stat-istical mechanics.
Physical Review , 106:620–630, 1957.[17] Takuya Isomura and Karl Friston. In vitro neural net-works minimise variational free energy.
Scientific reports ,8(1):1–14, 2018.[18] Alex B Kiefer. Psychophysical identity and free en-ergy.
Journal of the Royal Society Interface , 17:20200370,2020.[19] Mel Andrews. The math is not the territory: Navig-ating the free energy principle, 2020. Available fromhttp://philsci-archive.pitt.edu/18315/.[20] Gustavo Deco, Viktor K Jirsa, Peter A Robinson, Mi-chael Breakspear, and Karl Friston. The dynamic brain:From spiking neurons to neural masses and cortical fields.
PLOS Computational Biology , 4(8):1–35, 08 2008.[21] Ge Yang and Samuel Schoenholz. Mean field residual net-works: On the edge of chaos. In I Guyon, U V Luxburg,S Bengio, H Wallach, R Fergus, S Vishwanathan, andR Garnett, editors,
Advances in Neural Information Pro-cessing Systems , volume 30, pages 7103–7114. Curran As-sociates, Inc., 2017.[22] Zhengyu Ma, Gina G Turrigiano, Ralf Wessel, andKeith B Hengen. Cortical circuit dynamics are homeo-statically tuned to criticality in vivo.