[PDF] Large Deviation Approach to Random Recurrent Neuronal Networks: Parameter Inference, Activity Prediction, and Fluctuation-Induced Transitions

Abstract

We here unify the field theoretical approach to neuronal networks with large deviation theory. For a prototypical random recurrent network model with continuous-valued units, we show that the effective action is identical to the rate function and derive the latter using field theory. This rate function takes the form of a Kullback-Leibler divergence which enables data-driven inference of model parameters, Bayesian prediction of time series, and calculation of fluctuations beyond mean--field theory. Lastly, we expose a regime with fluctuation--induced transitions between mean--field solutions.

Full PDF

LLarge Deviation Approach to Random Recurrent Neuronal Networks:Rate Function, Parameter Inference, and Activity Prediction

Alexander van Meegen, Tobias Kühn,

1, 2, 3 and Moritz Helias

1, 2 Institute of Neuroscience and Medicine (INM-6) and Institute for Advanced Simulation (IAS-6) andJARA-Institute Brain Structure-Function Relationships (INM-10), Jülich Research Centre, Jülich, Germany Department of Physics, Faculty 1, RWTH Aachen University, Aachen, Germany Laboratoire de Physique de l’ENS, CNRS, Paris, France (Dated: September 21, 2020)Statistical ﬁeld theory captures collective non-equilibrium dynamics of neuronal networks, but itdoes not address the inverse problem of searching the connectivity to implement a desired dynamics.We here show for an analytically solvable network model that the eﬀective action in statistical ﬁeldtheory is identical to the rate function in large deviation theory; using ﬁeld theoretical methods wederive this rate function. It takes the form of a Kullback-Leibler divergence and enables data-driveninference of model parameters and Bayesian prediction of time series.

PACS numbers:

Introduction.–

Biological neuronal networks are sys-tems with many degrees of freedom and intriguing prop-erties: their units are coupled in a directed, non-symmetric manner, so that they typically operate outsidethermodynamic equilibrium [1, 2]. Even more challeng-ing is the question how this non-equilibrium dynamics isused to perform information processing. A rigorous un-derstanding is still deﬁcient but imperatively needed toshed light into their remarkable abilities of computation.The primary method to study neuronal networks hasbeen mean-ﬁeld theory [3–8]. Its ﬁeld-theoretical basishas been exposed only recently [9, 10] which promiseshighly eﬃcient schemes to study eﬀects beyond themean–ﬁeld approximation. However, to understandthe parallel and distributed information processing per-formed by neuronal networks, the study of the forwardproblem – from the microscopic parameters of the modelto its dynamics – is not suﬃcient. One additionally facesthe inverse problem of determining the parameters of themodel given a desired dynamics and thus function. For-mally, one needs to link statistical physics with conceptsfrom information theory and statistical inference.We here expose a tight relation between statistical ﬁeldtheory of neuronal networks, large-deviation theory, in-formation theory, and inference. To this end, we general-ize the probabilistic view of large deviation theory, whichyields rigorous results for the leading order behavior inthe network size N [11, 12], to arbitrary single unit dy-namics and transfer functions. We then show that thecentral quantity of large deviation theory, the rate func-tion, is identical to the eﬀective action in statistical ﬁeldtheory. Having rendered this comprehensive picture, athird relation is found: Bayesian inference and predictionare naturally formulated within this framework, span-ning the arc to information processing. Concretely, wediscover a method of parameter inference from transientdata and perform Bayes-optimal prediction of the time-dependent network activity. Model.-

We consider random networks of N nonlin-early interacting units x i ( t ) driven by an external input ξ i ( t ) . The dynamics of the units are governed by thestochastic diﬀerential equation ˙ x i ( t ) = −∇ U ( x i ( t )) + N ∑ j = J ij φ ( x j ( t )) + ξ i ( t ) . (1)In the absence of recurrent and external inputs, theunits undergo an overdamped motion in a potential U ( x ) . The J ij are independent and identically Gaussian-distributed random coupling weights with zero mean andvariance ⟨ J ij ⟩ = g / N where the coupling strength g con-trols the heterogeneity of the weights. The time-varyingexternal inputs ξ i ( t ) are independent Gaussian white-noise processes with zero mean and correlation functions ⟨ ξ i ( t ) ξ j ( t )⟩ = Dδ ij δ ( t − t ) . The model correspondsto the one studied in [4] if the external input vanishes, D = , the potential is quadratic, U ( x ) = x , and thetransfer function is sigmoidal, φ ( x ) = tanh ( x ) ; for D = , U ( x ) = − log ( A − x ) , and φ ( x ) = x it corresponds to theone in [11], which is inspired by the dynamical spin glassmodel of [13]. Field theory.-

The ﬁeld-theoretical treatment ofEq. (1) employs the Martin–Siggia–Rose–de Dominicis–Janssen path integral formalism [14–17]. We denote theexpectation over paths across diﬀerent realizations of thenoise ξ as ⟨⋅⟩ x ∣ J ≡ ⟨⟨⋅⟩ x ∣ J , ξ ⟩ ξ = ∫ D x ∫ D ˜x ⋅ e S ( x , ˜x )− ˜x T J φ ( x ) , where ⟨⋅⟩ x ∣ J , ξ integrates over the unique solution ofEq. (1) given one realization ξ of the noise (see AppendixA). Here, S ( x , ˜x ) = ˜x T ( ˙ x + ∇ U ( x )) + D ˜x T ˜x is the ac-tion of the uncoupled neurons. We use the shorthandnotation a T b = ∑ Ni = ∫ T d t a i ( t ) b i ( t ) .For large N , the system becomes self-averaging, aproperty known from many disordered systems with large a r X i v : . [ c ond - m a t . d i s - nn ] S e p numbers of degrees of freedom: the collective behavior isstereotypical, independent of the realization J ij . Thisholds for observables of the form ∑ Ni = (cid:96) ( x i ) , where (cid:96) isan arbitrary functional of a single unit’s trajectory. It istherefore convenient to introduce the scaled cumulant–generating functional W N ( (cid:96) ) ∶= N ln ⟨⟨ e ∑ Ni = (cid:96) ( x i ) ⟩ x ∣ J ⟩ J , (2)where the prefactor / N makes sure that W N is an in-tensive quantity, reminiscent of the bulk free energy [18].In fact, we will show that the N -dependence vanishes inthe limit N → ∞ because the system decouples.Performing the average over J and in-troducing the auxiliary ﬁeld C ( t , t ) ∶= g N − ∑ Ni = φ ( x i ( t )) φ ( x i ( t )) as well as the con-jugate ﬁeld ˜ C , we can write W N as [9, 19] W N ( (cid:96) ) = N ln ∫ D C ∫ D ˜ C e − Ng C T ˜ C + N Ω (cid:96) ( C, ˜ C ) , (3) Ω (cid:96) ( C, ˜ C ) ∶= ln ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C ˜ x + φ T ˜ Cφ + (cid:96) ( x ) . The eﬀective action is deﬁned as the Legendre transformof W N ( (cid:96) ) , Γ N ( µ ) ∶= ∫ D x µ ( x ) (cid:96) µ ( x ) − W N ( (cid:96) µ ) , (4)where (cid:96) µ is determined implicitly by the condition µ = W ′ N ( (cid:96) µ ) and the derivative W ′ N ( (cid:96) ) has to be understoodas a generalized derivative, the coeﬃcient of the lineariza-tion akin to a Fréchet derivative [20].Note that W N and Γ N are, respectively, generalizationsof a cumulant–generating functional and of the eﬀectiveaction [21] because both map a functional ( (cid:96) or µ ) to thereals. For the choice (cid:96) ( x ) = j T x , we recover the usualcumulant–generating functional of the single unit’s tra-jectory (see Appendix D) and the corresponding eﬀectiveaction. Rate function.-

Self–averaging corresponds to asharply peaked distribution of an observable over real-izations of J —because the distribution is very narrow,the observable always attains the same value. However,this can only hold for observables averaged over all units,reminiscent of the central limit theorem. Therefore, it isnatural to restrict the observables to network–averagedones; formally, this can be achieved without loss of gen-erality using the empirical measure µ ( y ) ∶= N N ∑ i = δ ( x i − y ) , (5)since N ∑ Ni = (cid:96) ( x i ) = ∫ D y µ ( y ) (cid:96) ( y ) . Of particular inter-est is the leading–order exponential behavior of the dis-tribution of empirical measures P ( µ ) = ⟨⟨ P ( µ ∣ x )⟩ x ∣ J ⟩ J across realizations of J and ξ described by the rate func-tion [see e.g. 22] H ( µ ) ∶= − lim N →∞ N ln P ( µ ) . (6)For large N , the probability of an empirical measure thatdoes not correspond to the minimum H ′ ( ¯ µ ) = is expo-nentially suppressed. Put diﬀerently, the system is self–averaging and the statistics of any network–averaged ob-servable can be obtained using ¯ µ .Similar to ﬁeld theory, it is convenient to introducethe scaled cumulant–generating functional of the empiri-cal measure. Because N ∑ Ni = (cid:96) ( x i ) = ∫ D y µ ( y ) (cid:96) ( y ) holdsfor an arbitrary functional (cid:96) ( x i ) of the single unit’s tra-jectory x i , Eq. (2) has the form of the scaled cumulant–generating functional for µ at ﬁnite N .Using a saddle-point approximation for the integralsover C and ˜ C in Eq. (3) (see Appendix A), we get W ∞ ( (cid:96) ) = − g C T (cid:96) ˜ C (cid:96) + Ω (cid:96) ( C (cid:96) , ˜ C (cid:96) ) . (7)Both C (cid:96) and ˜ C (cid:96) are determined self-consistently by thesaddle-point equations C (cid:96) = g ∂ ˜ C Ω (cid:96) ( C, ˜ C )∣ C (cid:96) , ˜ C (cid:96) and ˜ C (cid:96) = g ∂ C Ω (cid:96) ( C, ˜ C )∣ C (cid:96) , ˜ C (cid:96) where ∂ C denotes a partialfunctional derivative.From the scaled cumulant–generating functional,Eq. (7), we obtain the rate function via a Legendre trans-formation [23]: H ( µ ) = ∫ D x µ ( x ) (cid:96) µ ( x ) − W ∞ ( (cid:96) ) with (cid:96) µ implicitly deﬁned by µ = W ′∞ ( (cid:96) µ ) . Comparing withEq. (4), we observe that the rate function is equivalent tothe eﬀective action: H ( µ ) = lim N →∞ Γ N ( µ ) . The equa-tion µ = W ′∞ ( (cid:96) µ ) can be solved for (cid:96) µ to obtain a closedexpression for the rate function viz. eﬀective action (seeAppendix B) H ( µ ) = ∫ D x µ ( x ) ln µ ( x )⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η , (8)where η is a zero–mean Gaussian process with a correla-tion function that is determined by µ ( x ) , C η ( t , t ) = D δ ( t − t )+ g ∫ D x µ ( x ) φ ( x ( t )) φ ( x ( t )) . (9)For D = , U ( x ) = − log ( A − x ) , and φ ( x ) = x , Eq. (8)can be shown to be a equivalent to the mathematicallyrigorous result obtained in the seminal work by BenArous and Guionnet (see Appendix C).The rate function Eq. (8) takes the form of a Kullback-Leibler divergence. Thus, it possesses a unique minimumat ¯ µ ( x ) = ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η , (10)which corresponds to the well-known self-consistentstochastic dynamics that is obtained in ﬁeld theory[4, 9, 10, 19]. Note that the correlation function of theeﬀective stochastic input η at the minimum depends self-consistently on ¯ µ ( x ) through Eq. (9). Parameter Inference.–

Thus far, we considered thenetwork–averaged statistics of the activity for givenstatistics of the connectivity and the external input. Therate function opens the way to address the inverse prob-lem: given the network–averaged activity statistics, en-coded in the corresponding empirical measure µ , whatare the statistics of the connectivity and the external in-put, i.e. g and D ?We determine the parameters using maximum likeli-hood estimation. Using Eq. (6) and Eq. (8), the likeli-hood of the parameters is given by ln P ( µ ∣ g, D ) ≃ − N H ( µ ∣ g, D ) , where ≃ denotes equality in the limit N → ∞ and wemade the dependence on g and D explicit. The maximumlikelihood estimation is given by the minimum with re-spect to the parameters g and D of the Kullback–Leiblerdivergence on the right hand side. Only the cross en-tropy − ∫ D x µ ( x ) ln ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ , by Eq. (9), de-pends on the parameters, thus maximizing the likelihoodestimation is equivalent to minimizing the cross entropy.The derivative of the cross entropy by the parametersyields ∂ a ln P ( µ ∣ g, D ) ≃ − N (( C − C η ) ∂C − η ∂a ) , where we used ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η = Z − exp (− S ( x )) with S ( x ) = ( ˙ x + ∇ U ( x )) T C − η ( ˙ x + ∇ U ( x )) and nor-malization Z = √ det ( πC η ) (see Appendix F), de-ﬁned C ( t , t ) ≡ ∫ D x µ ( x ) ( ˙ x ( t ) + ∇ U ( x ( t ))) ( ˙ x ( t ) +∇ U ( x ( t ))) , and abbreviated a ∈ { g, D } . The derivativevanishes for C = C η . Assuming stationarity, in Fourierdomain this condition reads S ˙ x +∇ U ( x ) ( f ) = D + g S φ ( x ) ( f ) , (11)where S X ( f ) denotes the network–averaged power spec-trum of the observable X . Using non–negative leastsquares [24], Eq. (11) allows a straightforward inferenceof g and D (see Fig. 1). Model Comparison.–

Parameter estimation allows usto determine the statistical properties of the recurrentconnectivity g and the external input D . However, thisleaves the potential U and the transfer function φ unspec-iﬁed. We determine U and φ using model comparisontechniques [25].We consider two options to obtain U and φ : compar-ing the mean squared error in Eq. (11) for the inferredparameters and comparing the likelihood of the inferredparameters. For the latter option, we can use the ratefunction from Eq. (6) and Eq. (8). Given two choices U i , x ) ( f )10 x + U ( x ) ( f ) E Regression f F Power spectra s = 0 A D = 0 1000 1050 1100 t B Activity ( g = 1.5) s = 0 D = 0.51000 1050 1100 t C Activity ( g = 1.5) s = 1.5 D = 0.1 1000 1050 1100 t D Activity ( g = 0.6) Figure 1: Maximum likelihood parameter estimation in net-works with diﬀerent single-unit potentials U ( x ) = x + s ln cosh x and external noise D ( A ). Activity of three ran-domly chosen units (diﬀerent hues) for various couplingstrengths g as indicated in the title with s and D color coded( B - D ). Parameter estimation via non-negative least squaresbased on Eq. (11) ( E ), and the power spectra on the left–(dark, solid lines) and right–hand–sides (light, dashed lines)of Eq. (11) for the inferred parameters ( F ). Further param-eters: N = , temporal discretization dt = − , simula-tion time T = , time-span discarded to reach steady state T = . φ i , i ∈ { , } , with corresponding inferred parameters ˆ g i , ˆ D i , we have ln P ( µ ∣ U , φ , ˆ g , ˆ D ) P ( µ ∣ U , φ , ˆ g , ˆ D ) ≃ − N ( H − H ) (12)with H i ≡ H ( µ ∣ U i , φ i , ˆ g i , ˆ D i ) . The diﬀerence H − H equals the diﬀerence of the minimal cross entropies forthe respective choices U i , φ i . Assuming an inﬁnite ob-servation time, this diﬀerence can be expressed as an in-tegral that is straightforward to evaluate numerically (seeAppendix G).To illustrate the procedure, we consider the potential U ( x ) = x − s ln cosh x, which is bistable for s > [5] and determine s using themean squared error and the cross entropy diﬀerence (seeFig. 2). Parameter estimation yields estimates ˆ g and ˆ D that depend on s (Fig. 2 A , B ). The mean squared er-ror displays a clear minimum at the true value s = . (Fig. 2 C ) whereas the maximal cross entropy occurs ata value larger than s = . (Fig. 2 D ). The latter eﬀect s A Estimate g s B Estimate D s C Mean squared error s D Cross entropy difference

Figure 2: Model comparison in a network with single-unitpotential U ( x ) = x − s ln cosh x . Maximum likelihood esti-mates of ˆ g and ˆ D for given choices of s ( A , B ). True valuesof g and D indicated as gray lines; estimates at the true value s = . indicated as gray symbols. Mean squared error be-tween left– and right–hand–side of Eq. (11) for given s ( C ).Cross entropy diﬀerence between model with s = and withgiven s ( D ). Further parameters as in Fig. 1. arises because the cross entropy is dominated by the pa-rameter estimates, thus the mean squared error providesa more reliable criterion in this case. Activity Prediction.–

Parameter inference and modelcomparison allow us to fully determine the model fromdata. Using this information, we can proceed to a func-tional aspect: predicting the future activity of a unit ofthe recurrent neuronal network from the knowledge of itsrecent past.If the potential of the model is quadratic, U ( x ) ∝ x ,the measure ¯ µ that minimizes the rate function corre-sponds to a Gaussian process. For Gaussian processes,it is possible to perform Bayes–optimal prediction onlybased on its correlation function [25, 26]. Denoting thecorrelation function of the process as C x , the predictionis given by ˆ x = k T K − x (13)with K ij = C x ( t i , t j ) , k i = C x ( t i , ˆ t ) , and x i = x ( t i ) . Here ˆ t denotes the timepoint of the prediction and { t i } a setof timepoints where the activity is known. The predictedvalue ˆ x itself is Gaussian distributed with variance σ x = κ − k T K − k (14)where κ = C x ( ˆ t, ˆ t ) . The variance σ x quantiﬁes the uncer-tainty associated with the prediction ˆ x .Given the inferred parameters, we determine the self-consistent autocorrelation function C x using Eq. (9) andEq. (10). We use this self-consistent autocorrelationfunction to predict the future activity of two arbitraryunits using Eq. (13) and Eq. (14) (Fig. 3 A , B ). To quan-tify the error, we calculate the network–averaged mean t A Prediction (unit 0) t B Prediction (unit 1) t C Normalized MSE t D MSE timescale g =1.8 g =1.5 g =1.3 Figure 3: Prediction of the future network activity ( A , B ).Full trajectory of two arbitrary units shown as gray solidcurves, training data as black dots, and prediction ˆ x witherror bars determined by σ ˆ x as gray symbols. The network–averaged mean squared error, indicated as gray symbols, iswell captured by the predicted variance σ x , shown as solidcurve ( C ). The error increases on half of the timescale of theautocorrelation function: − σ x / σ x , shown as symbols, de-creases asymptotically as C exp (− t / τ c ) , indicated as lines( D ). Parameters: g = . , s = D = ; further parameters as inFig. 1. squared error relative to the variance obtained from theself–consistent autocorrelation function C x ( ) and com-pare it to the variability predicted by Eq. (14), yield-ing a good correspondence (Fig. 3 C ). The timescaleof the error is determined by half of the timescale ofthe autocorrelation function (see Appendix G). We plot − σ x / σ x against an exponential decay C exp (− τ / τ c ) ,where C x ( τ )/ C x ( ) ∼ exp (− τ / τ c ) , and ﬁnd a very goodagreement (Fig. 3 D ). Since τ c diverges for g → + (cf. [4]),the timescale of the error diverges as well. Discussion.–

In this Letter, we found a tight link be-tween the ﬁeld theoretical approach to neuronal networksand its counterpart based on large deviation theory. Weobtained the rate function of the empirical measure forthe widely used and analytically solvable model of a re-current neuronal network [4] by ﬁeld theoretical methods.This rate function generalizes the seminal result by BenArous and Guionnet [11, 12] to arbitrary potentials andtransfer functions. Intriguingly, our derivation elucidatesthat the rate function is identical to the eﬀective actionand takes the form of a Kullback–Leibler divergence, akinto Sanov’s theorem for sums of independent and iden-tically distributed random variables [22, 23]. The ratefunction can thus be interpreted as a distance betweenan empirical measure, for example given by data, andthe activity statistics of the network model. This resultallows us to address the inverse problem: 1) Inferring theparameters of the connectivity and external input froma set of trajectories. 2) Determining the potential andthe transfer function. 3) Predicting future activity in aBayes–optimal way.The exposed link between the eﬀective action deﬁnedwithin statistical ﬁeld theory and the rate function, cen-tral to large deviation theory, opens the door to apply-ing established ﬁeld-theoretical techniques, such as theloopwise expansion [21], to obtain systematic correctionsbeyond the mean-ﬁeld limit. Such sub–exponential cor-rections to the rate function are important for small orsparse networks with non–vanishing mean connectivity,to explain correlated neuronal activity, and to study in-formation processing in ﬁnite-size networks with realis-tically limited resources. More generally, the link allowsthe systematic derivation of results using ﬁeld theory andto subsequently prove them in a mathematically rigorousmanner within the large deviation framework.

Acknowledgments.– We are grateful to Olivier Faugerasand Etienne Tanré for helpful discussions on LDT of neu-ronal networks, to Anno Kurth for pointing us to theFréchet derivative and to Alexandre René, David Dah-men, Kirsten Fischer, and Christian Keup for feedbackon an earlier version of the manuscript. This work waspartly supported by the Helmholtz young investigator’sgroup VH-NG-1028, European Union Horizon 2020 grant785907 (Human Brain Project SGA2).

AppendixA. Scaled Cumulant Generating Functional

Here, we derive the scaled cumulant generating func-tional and the saddle-point equations. The ﬁrst steps ofthe derivations are akin to the manipulations presentedin [9, 19], thus we keep the presentation concise. Weinterpret the stochastic diﬀerential equations governingthe network dynamics in the Itô convention. Using theMartin–Siggia–Rose–de Dominicis–Janssen path integralformalism, the expectation ⟨⋅⟩ x ∣ J of some arbitrary func-tional G ( x ) can be written as ⟨⟨ G ( x )⟩ x ∣ J , ξ ⟩ ξ = ∫ D x ⟨ δ ( ˙ x + ∇ U ( x ) + J φ ( x ) + ξ )⟩ ξ G ( x )= ∫ D x ∫ D ˜x e S ( x , ˜x )− ˜x T J φ ( x ) G ( x ) where we used the Fourier representation δ ( x ) = πi ∫ i ∞− i ∞ e ˜ xx d ˜ x in every timestep in the second step anddeﬁned the action S ( x , ˜x ) = ˜x T ( ˙ x + ∇ U ( x )) + D ˜x T ˜x . An additional average over realizations of the connectiv-ity J i.i.d. ∼ N ( , N − g ) only aﬀects the term − ˜x T J φ ( x ) in the action and results in ⟨ e − ˜x T J φ ( x ) ⟩ J = ∫ D C ∫ D ˜ C e − Ng C T ˜ C + ˜x T C ˜x + φ ( x ) T ˜ Cφ ( x ) , where we introduced the network–averaged auxiliary ﬁeld C ( u, v ) = g N N ∑ i = φ ( x i ( u )) φ ( x i ( v )) via a Hubbard–Stratonovich transformation. The aver-age over the connectivity and the subsequent Hubbard–Stratonovich transformation decouple the dynamicsacross units; afterwards the units are only coupledthrough the global ﬁelds C and ˜ C .Now, we consider the scaled cumulant generating func-tional of the empirical density W N ( (cid:96) ) = N ln ⟨⟨ e ∑ Ni = (cid:96) ( x i ) ⟩ x ∣ J ⟩ J . Using the above results and the abbreviation φ ( x ) ≡ φ ,it can be written as W N ( (cid:96) ) = N ln ∫ D C ∫ D ˜ C e − Ng C T ˜ C + N Ω (cid:96) ( C, ˜ C ) , Ω (cid:96) ( C, ˜ C ) = ln ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C ˜ x + φ T ˜ Cφ + (cid:96) ( x ) , where the N in front of the single–particle cumulant gen-erating functional Ω results from the factorization of the N integrals over x i and ˜ x i each; thus it is a hallmark ofthe decoupled dynamics. Next, we approximate the C and ˜ C integrals in a saddle–point approximation whichyields W N ( (cid:96) ) = − g C T (cid:96) ˜ C (cid:96) + Ω (cid:96) ( C (cid:96) , ˜ C (cid:96) ) + O ( ln ( N )/ N ) , where C (cid:96) and ˜ C (cid:96) are determined by the saddle–pointequations C (cid:96) = g ∂ ˜ C Ω (cid:96) ( C, ˜ C )∣ C (cid:96) , ˜ C (cid:96) , ˜ C (cid:96) = g ∂ C Ω (cid:96) ( C, ˜ C )∣ C (cid:96) , ˜ C (cid:96) . Here, ∂ C denotes a partial functional derivative. In thelimit N → ∞ , the remainder O ( ln ( N )/ N ) vanishes andthe saddle–point approximation becomes exact. B. Rate Function

Here, we derive the rate function from the scaled cu-mulant generating functional. According to the Gärtner-Ellis theorem [23], we obtain the rate function via theLegendre transformation H ( µ ) = ∫ D x µ ( x ) (cid:96) µ ( x ) − W ∞ ( (cid:96) µ ) (15)with (cid:96) µ implicitly deﬁned by µ = W ′∞ ( (cid:96) µ ) . (16)Due to the saddle–point equations, the derivative of thecumulant generating functional in Eq. (16) simpliﬁes to W ′∞ ( (cid:96) µ ) = ( ∂ (cid:96) Ω (cid:96) )( C (cid:96) , ˜ C (cid:96) )∣ (cid:96) µ where the derivative onlyacts on the (cid:96) that is explicit in Ω (cid:96) ( C (cid:96) , ˜ C (cid:96) ) and not on theimplicit dependencies through C (cid:96) , ˜ C (cid:96) . Thus, Eq. (16)yields µ ( x ) = ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + φ T ˜ C (cid:96)µ φ + (cid:96) µ ( x ) ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + φ T ˜ C (cid:96)µ φ + (cid:96) µ ( x ) . Taking the logarithm and using W ∞ ( (cid:96) µ ) + g C T (cid:96) µ ˜ C (cid:96) µ = Ω (cid:96) µ ( C (cid:96) µ , ˜ C (cid:96) µ ) leads to (cid:96) µ ( x ) = ln µ ( x )∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + W ∞ ( (cid:96) µ )+ g C T (cid:96) µ ˜ C (cid:96) µ − φ T ˜ C (cid:96) µ φ. Inserting (cid:96) µ ( x ) into the Legendre transformation (15)yields H ( µ ) = ∫ D x µ ( x ) ln µ ( x )∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + g C T (cid:96) µ ˜ C (cid:96) µ − C T µ ˜ C (cid:96) µ with C µ ( u, v ) = ∫ D x µ ( x ) φ ( x ( u )) φ ( x ( v )) . Identifying µ ( x ) in the saddle–point equation C (cid:96) µ = g ∂ ˜ C Ω (cid:96) ( C, ˜ C )∣ C (cid:96)µ , ˜ C (cid:96)µ = g ∫ D x ∫ D ˜ x φφe S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + φ T ˜ C (cid:96)µ φ + (cid:96) µ ( x ) ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C (cid:96)µ ˜ x + φ T ˜ C (cid:96)µ φ + (cid:96) µ ( x ) yields C (cid:96) µ ( u, v ) = g ∫ D x µ ( x ) φ ( x ( u )) φ ( x ( v )) and thus C (cid:96) µ = g C µ . Accordingly, the last two terms inthe Legendre transformation cancel and we arrive at H ( µ ) = ∫ D x µ ( x ) ln µ ( x )∫ D ˜ x e S ( x, ˜ x )+ g ˜ x T C µ ˜ x (17)where still C µ ( u, v ) = ∫ D x µ ( x ) φ ( x ( u )) φ ( x ( v )) .In the main text, we use the notation ∫ D ˜ x e S ( x, ˜ x )+ g ˜ x T C µ ˜ x = ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η with C η = Dδ + g C µ appearing in the rate function.Indeed, using the Martin–Siggia–Rose–de Dominicis–Janssen formalism, we have ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η = ∫ D ˜ x e ˜ x T ( ˙ x +∇ U ( x )) ⟨ e ˜ x T η ⟩ η = ∫ D ˜ x e ˜ x T ( ˙ x +∇ U ( x ))+ ˜ x T C η ˜ x which shows that the two notations are equivalent since ˜ x T ( ˙ x + ∇ U ( x )) + ˜ x T C η ˜ x = S ( x, ˜ x ) + g ˜ x T C µ ˜ x for C η = Dδ + g C µ . C. Equivalence to Ben Arous and Guionnet (1995)

Here, we show explicitly that the rate function weobtained generalizes the rate function obtained by BenArous and Guionnet. We start with Theorem 4.1 in [11]adapted to our notation: Deﬁne Q ( x ) ∶= ∫ D ˜ x e ˜ x T ( ˙ x +∇ U ( x ))+ ˜ x T ˜ x and G ( µ ) ∶= ∫ D x µ ( x ) log (⟨ e gy T ( ˙ x +∇ U ( x ))− g y T y ⟩ y ) , where ⟨⋅⟩ y is the expectation value over a zero–meanGaussian process y with C µ ( u, v ) = ∫ D x µ ( x ) x ( u ) x ( v ) ,written as ⟨⋅⟩ y = ∫ D y ∫ D ˜ y (⋅) e ˜ y T y + ˜ y T C µ ˜ y . With theKullback–Leibler divergence D KL ( µ ∣ Q ) , Theorem 4.1states that the function ˜ H ( µ ) = ⎧⎪⎪⎨⎪⎪⎩ D KL ( µ ∣ Q ) − G ( µ ) if D KL ( µ ∣ Q ) < ∞+∞ otherwiseis a good rate function.Now we relate ˜ H to the rate function that is derivedabove, Eq. (17). Using the Onsager–Machlup action, wecan write D KL ( µ ∣ Q ) = ∫ D x µ ( x ) log µ ( x ) e − S OM ( x ) + C with S OM ( x ) = ( ˙ x + ∇ U ( x )) T ( ˙ x + ∇ U ( x )) . Next, wetransform gy → y , ˜ y / g → ˜ y and solve the integral over y in G ( µ ) : ∫ D y e − y T y + y T ( ˙ x +∇ U ( x )+ ˜ y ) ∝ e S OM [ x ]+ ˜ y T ( ˙ x +∇ U ( x ))+ ˜ y T ˜ y . The Onsager–Machlup action in the logarithm in D KL ( µ ∣ Q ) and G ( µ ) cancel and we arrive at ˜ H ( µ ) = ∫ D x µ ( x ) log µ ( x )∫ D ˜ y e ˜ y T ( ˙ x +∇ U ( x ))+ ˜ y T ( g C µ + δ ) ˜ y up to an additive constant that we set to zero. Since C µ ( u, v ) = ∫ D x µ ( x ) x ( u ) x ( v ) , the rate function by BenArous and Guionnet is thus equivalent to Eq. (8) with φ ( x ) = x and D = . D. Relation to Sompolinsky, Crisanti, Sommers(1988)

Here, we relate the approach that we laid out in themain text to the approach pioneered by Sompolinsky,Crisanti, and Sommers [4] (reviewed in [9, 10]) using ournotation for consistency. Therein, the starting point isthe scaled cumulant–generating functional ˆ W N ( j ) = N ln ⟨⟨ e j T x ⟩ x ∣ J ⟩ J , which gives rise to the cumulants of the trajectories. Forthe linear functional (cid:96) ( x ) = j T x, we have ∑ Ni = (cid:96) ( x i ) = j T x and thus W N ( j T x ) = ˆ W N ( j ) .Put diﬀerently, the scaled cumulant–generating func-tional of the trajectories ˆ W N ( j ) is a special case ofthe more general scaled cumulant–generating functional W N ( (cid:96) ) we consider in this manuscript. Of course one canstart from the scaled cumulant–generating functional ofthe observable of interest and derive the correspondingrate function. Conversely, we show below how to obtainthe rate function of a speciﬁc observable from the ratefunction of the empirical measure. Contraction Principle

Here, we relate the rather general rate function of theempirical measure H ( µ ) to the rate function of a par-ticular observable I ( C ) . As an example, we choose thecorrelation function C ( u, v ) = g N N ∑ i = φ ( x i ( u )) φ ( x i ( v )) because it is a quantity that arises naturally during theHubbard–Stratonovich transformation. The generic ap-proach to this problem is given by the contraction prin-ciple [23]: I ( C ) = inf µ s.t. C = g ∫ D x µ ( x ) φφ H ( µ ) . Here, the inﬁmum is constrained to the em-pirical measures that give rise to the correla-tion function C , i.e. those that fulﬁll C ( u, v ) = g ∫ D x µ ( x ) φ ( x ( u )) φ ( x ( v )) . Writing H ( µ ) as theLegendre transform of the scaled cumulant–generatingfunctional, H ( µ ) = inf (cid:96) [∫ D x µ ( x ) (cid:96) ( x ) − W ∞ ( (cid:96) ) ], theempirical measure only appears linearly. Using a La-grange multiplier k ( u, v ) , the inﬁmum over µ leads tothe constraint (cid:96) ( x ) = g φ T kφ and we arrive at I ( C ) = inf k [ k T C − W ∞ ( g φ T kφ )] . Once again, we see how to relate W N ( (cid:96) ) to a speciﬁcobservable—this time for the choice (cid:96) ( x ) = g φ T kφ .Up to this point, the discussion applies to any observ-able. For the current example, we can proceed a bitfurther. With the redeﬁnition ˜ C + g k → ˜ C , we get W ∞ ( g φ T kφ ) = extr C, ˜ C [− g C T ˜ C + C T k + Ω ( C, ˜ C )] , Ω ( C, ˜ C ) = ln ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C ˜ x + φ T ˜ Cφ , which made Ω independent of k . Now we can take theinﬁmum over k , leading to I ( C ) = extr ˜ C [ g C T ˜ C − Ω ( C, ˜ C )] . (18)The remaining extremum gives rise to the condition C = g ∫ D x ∫ D ˜ x φφe S ( x, ˜ x )+ ˜ x T C ˜ x + φ T ˜ Cφ ∫ D x ∫ D ˜ x e S ( x, ˜ x )+ ˜ x T C ˜ x + φ T ˜ Cφ , i.e. a self–consistency condition for the correlation func-tion.As a side remark, we mention that the expression inthe brackets of Eq. (18) is the joint eﬀective action for C and ˜ C , because for N → ∞ , the action equals the ef-fective action. This result is therefore analogous to theﬁnding that the eﬀective action in the Onsager–Machlupformalism is given as the extremum of its counterpart inthe Martin–Siggia–Rose–de Dominicis–Janssen formal-ism [27, eq. (24)]. The only diﬀerence is that here,we are dealing with second order statistics and not justmean values. The origin of this ﬁnding is the same inboth cases: we are only interested in the statistics of thephysical quantity (the one without tilde, x or C , respec-tively). Therefore we only introduce a source ﬁeld ( k inthe present case) for this one, but not for the auxiliaryﬁeld, which amounts to setting the source ﬁeld of thelatter to zero. This is translated into the extremum inEq. (18) over the auxiliary variable [27, appendix 5]. E. Log–Likelihood Derivative

Here, we calculate the derivatives of the log–likelihoodwith respect to the parameters g and D . In terms of therate function, we have ∂ a ln P ( µ ∣ g, D ) ≃ − N ∂ a H ( µ ∣ g, D ) where a denotes either g or D . The parameters appearonly in the cross entropy ∂ a H ( µ ) = − ∫ D x µ ( x ) ∂ a ln ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η through the correlation function C η ( u, v ) = Dδ ( u − v ) + g ∫ D x µ ( x ) φ ( x ( u )) φ ( x ( v )) . Above, we showed that ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η = ∫ D ˜ x e ˜ x T ( ˙ x +∇ U ( x ))+ ˜ x T C η ˜ x . Because ˜ x is at most quadratic in the exponent, the in-tegral is solvable and we get ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η = e − ( ˙ x +∇ U ( x )) T C − η ( ˙ x +∇ U ( x )) √ det ( πC η ) . Note that the normalization /√ det ( πC η ) does not de-pend on the potential U . Now we can take the derivativesof ln ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η and get ∂ a ln ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η =− ( ˙ x + ∇ U ( x )) T ∂C − η ∂a ( ˙ x + ∇ U ( x )) − ∂ a tr ln C η where we used ln det C = tr ln C . With this, we arrive at ∂ a H ( µ ) =

12 tr ( C ∂C − η ∂a ) +

12 tr ( ∂C η ∂a C − η ) where the integral over the empirical measure gave riseto C = ∫ D x µ ( x )( ˙ x + ∇ U ( x ))( ˙ x + ∇ U ( x )) and we used ∂ a ln C = ∂C∂a C − . Finally, using ∂C∂a C − = CC − ∂C∂a C − = C ∂C − ∂a , we get ∂ a ln P ( µ ∣ g, D ) ≃ − N (( C − C η ) ∂C − η ∂a ) as stated in the main text. F. Cross Entropy Diﬀerence

Here, we express the cross entropy diﬀerence H − H ∶= H ( µ ∣ U , φ , ˆ g , ˆ D ) − H ( µ ∣ U , φ , ˆ g , ˆ D ) in a form that can be evaluated numerically. Using therate function, we get H − H = ∫ D x µ ( x ) ln ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η with C η i = D i δ + ˆ g i ∫ D x µ ( x ) φ i φ i . Again, we use ⟨ δ ( ˙ x + ∇ U ( x ) − η )⟩ η = e − ( ˙ x +∇ U ( x )) T C − η ( ˙ x +∇ U ( x )) √ det ( πC η ) to arrive at H − H = −

12 tr ( C C − η ) −

12 tr ln C η +

12 tr ( C C − η ) +

12 tr ln C η with C i = ∫ D x µ ( x )( ˙ x + ∇ U i ( x ))( ˙ x + ∇ U i ( x )) . For sta-tionary correlation functions over inﬁnite time intervals,we can evaluate the traces as integrals over the powerspectra: tr ( AB − ) ∝ ∫ ∞−∞ ˜ A ( f ) ˜ B ( f ) df, tr ln A ∝ ∫ ∞−∞ ln ( ˜ A ( f )) df. With this, we get H − H ∝ − ∫ ∞−∞ S ˙ x +∇ U ( x ) ( f ) D + ˆ g S φ ( x ) ( f ) df − ∫ ∞−∞ ln ( D + ˆ g S φ ( x ) ( f )) df + ∫ ∞−∞ S ˙ x +∇ U ( x ) ( f ) D + ˆ g S φ ( x ) ( f ) df + ∫ ∞−∞ ln ( D + ˆ g S φ ( x ) ( f )) df. Accordingly, the cross entropy diﬀerence can be evaluatedwith integrals over the respective power spectra that canbe obtained using Fast Fourier Transformation.

G. Timescale of Prediction Error

We here relate the timescale of the prediction er-ror to the timescale of the autocorrelation function C x ( τ )/ C x ( ) ∼ exp (− τ / τ c ) . The predicted variance inthe continuous time limit is determined by the corre-sponding limit of Eq. (14), σ x = C x ( ˆ t, ˆ t ) − ∫ T ∫ T C x ( ˆ t, u ) C − x ( u, v ) C x ( v, ˆ t ) du dv, where T denotes the training interval. Writing ˆ t = T + τ and approximating C x ( T + τ, u ) ≈ C x ( T, u ) e − τ / τ c , we get σ x ≈ C x ( ˆ t, ˆ t ) − e − τ / τ c C x ( T, T ) , where we used ∫ T C − x ( u, v ) C x ( v, T ) dv = δ ( u − T ) . Usingstationarity C x ( u, v ) = C x ( v − u ) , we arrive at σ x / σ x ≈ − e − τ / τ c where C x ( ) = σ x . Thus, for large τ , the timescale of theprediction error is given by τ c / . [1] M. I. Rabinovich, P. Varona, A. I. Selverston, and H. D.Abarbanel, Rev. Mod. Phys. , 1213 (2006).[2] H. Sompolinsky, Physics Today , 70 (1988).[3] S.-I. Amari, Systems, Man and Cybernetics, IEEE Trans-actions on SMC-2 , 643 (1972), ISSN 2168-2909.[4] H. Sompolinsky, A. Crisanti, and H. J. Sommers, Phys.Rev. Lett. , 259 (1988).[5] M. Stern, H. Sompolinsky, and L. F. Abbott, Phys. Rev.E , 062710 (2014).[6] J. Kadmon and H. Sompolinsky, Phys. Rev. X , 041030(2015).[7] J. Aljadeﬀ, M. Stern, and T. Sharpee, Phys. Rev. Lett. , 088101 (2015).[8] A. van Meegen and B. Lindner, Phys. Rev. Lett. ,258302 (2018). [9] J. Schuecker, S. Goedeke, and M. Helias, Phys Rev X ,041029 (2018).[10] A. Crisanti and H. Sompolinsky, Phys Rev E , 062120(2018).[11] G. B. Arous and A. Guionnet, Probability Theory andRelated Fields , 455 (1995), ISSN 1432-2064.[12] A. Guionnet, Probability Theory and Related Fields ,183 (1997).[13] H. Sompolinsky and A. Zippelius, Phys. Rev. Lett. ,359 (1981).[14] P. Martin, E. Siggia, and H. Rose, Phys. Rev. A , 423(1973).[15] H.-K. Janssen, Zeitschrift für Physik B Condensed Mat-ter , 377 (1976).[16] C. Chow and M. Buice, J Math. Neurosci , 1 (2015).[17] J. A. Hertz, Y. Roudi, and P. Sollich, Journal of PhysicsA: Mathematical and Theoretical , 033001 (2017).[18] N. Goldenfeld, Lectures on phase transitions and therenormalization group (Perseus books, Reading, Mas- sachusetts, 1992).[19] J. Schuecker, S. Goedeke, D. Dahmen, and M. Helias,arXiv (2016), 1605.06758 [cond-mat.dis-nn].[20] M. S. Berger,

Nonlinearity and Functional Analysis (El-sevier, 1977), 1st ed., ISBN 9780120903504.[21] J. Zinn-Justin,

Quantum ﬁeld theory and critical phe-nomena (Clarendon Press, Oxford, 1996).[22] M. Mezard and A. Montanari,

Information, physics andcomputation (Oxford University Press, 2009).[23] H. Touchette, Physics Reports , 1 (2009).[24] C. L. Lawson and R. J. Hanson,

Solving Least SquaresProblems (SIAM, 1995).[25] D. J. MacKay,

Information theory, inference and learningalgorithms (Cambridge university press, 2003).[26] G. Matheron, Economic Geology , 1246 (1963).[27] J. Stapmanns, T. Kühn, D. Dahmen, T. Luu, C. Hon-erkamp, and M. Helias, Phys. Rev. E101