[PDF] Large Deviations Properties of Maximum Entropy Markov Chains from Spike Trains

Abstract

We consider the maximum entropy Markov chain inference approach to characterize the collective statistics of neuronal spike trains, focusing on the statistical properties of the inferred model. We review large deviations techniques useful in this context to describe properties of accuracy and convergence in terms of sampling size. We use these results to study the statistical fluctuation of correlations, distinguishability and irreversibility of maximum entropy Markov chains. We illustrate these applications using simple examples where the large deviation rate function is explicitly obtained for maximum entropy models of relevance in this field.

Full PDF

LLarge Deviations Properties of Maximum Entropy MarkovChains from Spike Trains.

Rodrigo Cofré ∗ , Cesar Maldonado , and Fernando Rosas CIMFAV, Facultad de Ingeniería, Universidad de Valparaíso, Valparaíso, Chile IPICYT/División de Matemáticas Aplicadas, San Luis Potosí Centre of Complexity Science and Department of Mathematics, Imperial CollegeLondon, London, UK Department of Electrical and Electronic Engineering, Imperial College London,London, UK

May 21, 2018

Abstract

We consider the maximum entropy Markov chain inference approach to character-ize the collective statistics of neuronal spike trains, focusing on the statistical proper-ties of the inferred model. We review large deviations techniques useful in this contextto describe properties of accuracy and convergence in terms of sampling size. We usethese results to study the statistical ﬂuctuation of correlations, distinguishability andirreversibility of maximum entropy Markov chains. We illustrate these applicationsusing simple examples where the large deviation rate function is explicitly obtainedfor maximum entropy models of relevance in this ﬁeld.

Keywords

Computational neuroscience; spike train statistics; maximum entropy prin-ciple; large deviation theory; out-of-equilibrium statistical mechanics; thermodynamic for-malism; entropy production.

Spiking neuronal networks are perhaps the most sophisticated information processing de-vices that are available for scientiﬁc inquiry. There exists already an understanding of theirbasic mechanisms and functionality: they are composed by interconnected neurons whichﬁre action potentials (a.k.a. "spikes") collectively in order to accomplish speciﬁc taskse.g. sensory information processing or motor control [43]. However, the interdependenciesin the spiking activity of populations of neurons can be extremely complex. In eﬀect, ∗ Electronic address: [email protected] ; Corresponding author a r X i v : . [ q - b i o . N C ] M a y hese interdependencies can involve neighboring or also distant cells, being established ei-ther via structural connections, i.e. physical mediums such as synapses, or by functionalconnections reﬂected through spike correlations [17].Understanding the way in which neuronal networks process information requires dis-entangling structural and functional connections while clarifying their interplay, which isa challenging but critical issue [38, 18]. For this aim, networks of spiking neurons areusually measured using in-vitro or in-vivo multi-eletrode-arrays, which connect neuronsto electronic sensors specially designed for spike detection. Recent progress in acquisitiontechniques allows the simultaneous measurement of the spiking activity from increasinglylarge populations of neurons, enabling the collection of large amounts of experimental data[16]. Prominent examples of spike train recordings have been obtained from vertebrateretina (salamander, rabbit, degus) [44, 55, 57] and cat cortex [32].However, despite the progress in multi-electrode and neuroimaging recording tech-niques, modeling the collective spike train statistics is still one of the key open challengesin computational neuroscience. Analysis over recorded data has shown that, although theneuronal activity is highly variable (even when presented repeatedly the same stimulus),the statistics of the response is highly structured [11, 48]. Therefore, it seems that muchof the inner dynamics of neuronal networks is encoded in the statistical structure of thespikes. Unfortunately, traditional methods of estimation, inference, and model selectionare not well-suited for this scenario since the number of possible binary patterns that aneuronal network can adopt grows exponentially with the size of the population. As amatter of fact, even long experimental recordings usually contain only a small subset ofthe possible spiking patterns, which makes the empirical frequencies poor estimators forthe underlying probability distribution. For practical purposes, this induces dramatic lim-itations, as standard inference tools become unreliable as soon as the number of consideredneurons grows beyond 10 [44].Given the binary nature of the spiking data, it is natural to relate neuronal networksand digital communication system via Shannon’s information theory. A maybe more subtleway of establishing this link is provided by the physics literature that studies stochasticspins systems. In fact, a succession of research eﬀorts has helped develop a framework tostudy the spike train statistics based on tools of statistical physics, namely the maximumentropy principle (MEP), which provides an intuitive and tractable procedure to build astatistical model for the whole neuronal network. In 2006 Schneidman et al [44] and Pillow et al [40], the MEP was used to characterize the spike train statistics of the vertebrate retinaresponding to natural stimuli, constraining only range one features namely ﬁring rates andinstantaneous pairwise interactions. Since then, the MEP approach has become a standardtool to build probability measures in the ﬁeld of spike train statistics [40, 44, 53, 57]. Thisapproach has triggered fruitful analyses of the neural code, including works about criticality[52], redundancy and error correction [55] among other intriguing and promising topics.Although relatively successful, this approach for linking neuronal populations and sta-tistical mechanics is based on assumptions that go against fundamental biological knowl-edge. Firstly, these works assume that the spike patterns are statistically independentof past and future activities of the network. In fact, and not surprisingly, there existsstrong evidence supporting the facts that memory eﬀects play a major role in spike train2tatistics [51, 33, 57]. Secondly, most of the literature that applies statistical mechanicsto analyze neuronal data use tools that assume that the underlying system is in thermo-dynamic equilibrium. However, it has been recognized that being out-of-equilibrium isone of the distinctive properties of living systems [45, 12, 41]. Consequently, any statisti-cal description that is consistent with the out-of-equilibrium condition of living neuronalnetworks should reﬂect some degree of time asymmetry (i.e. time irreversibility) [49].As a way of addressing the above observations, some recent publications study maxi-mum entropy Markov chains (MEMC) based on a variational principle from the thermo-dynamic formalism of dynamical systems (see for instance [8, 57, 5]). This framework isan extension of the classic approach based on the MEP that considers correlation of spikesamong neurons simultaneously and with diﬀerent time delays as constraints, being able inthis way to account for various memory eﬀects.Most of the literature in spike train statistics via the MEP pays little attention to thefact that model estimation is done based on ﬁnite data (errors due to statistical ﬂuctuationsare likely to occur in this context). As the MEP can be seen as an statistical inferenceprocedure it is natural to inquire about the uncertainty (i.e. ﬂuctuations and convergenceproperties) related to the inferred MEMC, or, in other words, ask for the robustness of theinference as a function of the sampling size of the underlying data set. Quantifying thiserror is particularly relevant in the light of recent results that suggest that the parametersinferred by the MEP approach in the context of experimental biological recordings aresharply poised at criticality [36, 55]. On the other hand once the MEMC has been inferredit is also important to quantify how well a sample of the MEMC reproduce the averagevalues of features of interest and how likely is that a sample of the MEMC produce a "rare"or unlikely event.In order to provide some ﬁrst steps in addressing the above issues, this paper studies theMEMC framework using tools from large deviation theory (LDT) [14, 13]. We exploit thefact that the average values of features obtained from samples of the MEMC satisfy a largedeviation property, and use LDT techniques to estimate their ﬂuctuations in terms of thesampling size. We also show how to compute the rate functions using the tilted transitionmatrix technique and the Gärtner-Ellis theorem. It is to be noted that there is a large bodyof theoretical work linking the maximum entropy principle and large deviations [14, 21].However, these techniques have been scarcely used in spike train analysis (only to study thei.i.d case [2, 35, 30, 34]), most likely because of the lack of a suitable introduction of theseconcepts within the neuroscientiﬁc literature. Consequently, another goal of this paper isto provide an accessible introduction of the MEMC and LDT formalisms to the communityof computational neuroscience, avoiding some technicalities while preserving the core ideasand intuitions. This article is part of a more ambitious program that attempts to build amore uniﬁed theoretical structure and a complete toolbox helpful to approach spike trainstatistics using the thermodynamic formalism [8, 9].The rest of this paper is organized as follows. Section 2 presents the basic deﬁnitionsand tools needed to apply large deviations techniques further in the paper. In particular,we present the maximum entropy principle framed in the thermodynamic formalism as avariational principle. In section 3 we introduce some basic aspects of the theory of largedeviations. In section 4 we focus on the empirical averages of features. We present some3xamples of relevance in spike train statistics, where we are able to compute explicitly therate function for each feature in the maximum entropy potential. In section 5 we presentfurther applications of the theory of large deviations in this ﬁeld with a list of illustrativeexamples and ﬁnally we present our conclusions in section 6. This section introduces the general deﬁnitions, notations and conventions that are usedthroughout the paper, providing in turn the necessary background for the unfamiliar reader.

Let us consider a set of measurements from a network of N interacting neurons. The "rawdata" consists of N continuous signals containing the extra-cellular potential (electricalpotential measured outside of the cell) of each of the neurons recorded over the lengthof the experiment. This data is processed by spike sorting algorithms [42, 23], which aresignal processing techniques designed to extract the spiking activity of each neuron.Neurons have a minimal characteristic time in which no two spikes can occur, called"refractory period" [22], which provides a natural time-scale that can be used for "binning"(i.e. for discretizing) the time index of the measurements, denoted by ∆ t b . Denoting thetime index by the integer variable t , one can say that x kt = 1 whenever the k -th neuronemits a spike during the t -th time bin, while x kt = 0 otherwise. This standard proceduretransforms experimental data into sequences of binary patterns (see ﬁgure 1).A spike pattern is the spike-state of all the measured neurons at time bin t , whichis denoted by x t := (cid:2) x kt (cid:3) Nk =1 . A spike block is a consecutive sequence of spike patterns,denoted by x t,r := (cid:2) x s (cid:3) rs = t . While the length of the spike block x t,r is r − t + 1 , is alsouseful to consider spike blocks of inﬁnite length starting from time t = 0 , which are denotedby x . Finally, is this paper we consider that a spike train is either a spike block of ﬁnitelength or an inﬁnite sequence of spiking patterns, which will be useful later when discussingasymptotic properties. The set of all possible spike blocks of length R corresponding toa network of N neurons is denoted by A NR . The set of all spike blocks of inﬁnite lengthis denoted by Ω ≡ A N N , which is a useful mathematical object as clariﬁed below. Let usdeﬁne proj R : Ω → A NR the natural projection given by proj R ( x ) = x ,R − . Following the machine-learning nomenclature, a feature is a function that extracts a prop-erty of interest from the data. Formally, a feature is deﬁned as a function f : Ω → R thatassociates a real number to each x ∈ Ω . The feature f is said to have a temporal rangeor simply a range R if for every x , y ∈ Ω such that x (cid:54) = y , one has that f ( x ) = f ( y ) ifand only if x ,R − = y ,R − , that is, if f only depends on the ﬁrst R spike patterns of When binning, sometimes can be useful to go beyond the refractory period. In those cases, two spikesmay occur within the same time bin. The convention is to consider this event equivalent to just one spike. ∆ t b the spiking activity is transformed into binary patternsin discrete time. We illustrate the notation used throughout this paper.the spike-train. A special class of features, over which this work is focused on, are binaryfunctions consisting of ﬁnite products of spike states, i.e. f l ( x ) = q (cid:89) k =1 x i k t k . Above, l is a shorthand notation for the set { t k , i k } qk =1 , where [ t k ] qk =1 and [ i k ] qk =1 arecollections of time and neuron indexes respectively, and q is the number of spikes consideredby the feature. Correspondingly, for a given index l , one has f l ( x ) = 1 if and only if the i k -th neuron spikes at time t k , for all k ∈ { , . . . , q } in the spike-train x , while f l ( x ) = 0 otherwise. Note that, when considering features of range R ≥ , the ﬁring times t k areconstrained within the interval { , . . . , R − } . We deﬁne the reduced feature ˜ f : A R N → R such that ˜ f ( x ,R − ) = ˜ f ( proj R ( x )) = f ( x ) . For a given spiking neuronal network involved in a particular experimental protocol, themeasured activity usually contains a signiﬁcant amount of stochasticity that is charac-teristic of measurements at this spatiotemporal scale. This randomness is caused mostlyby(i) the random variation in the ionic ﬂux of charges crossing the cellular membrane perunit time at the post synaptic button due to the binding of neurotransmitter,(ii) the ﬂuctuations in the current resulting from the large number of opening and closingof ion channels [46, 29],(iii) noise coming from electrical synapses [7].5n order to capture this stochasticity within our modeling, it is natural to endow Ω witha probabilistic structure. For this, we assume that exists a probability distribution p {·} over Ω that quantiﬁes the intrinsic randomness that is associated to the spiking phenomena.From this point of view, all A ∈ Ω are events that might take place with probability p { A } .Following a standard practice in computational neuroscience, we assume that the stochasticprocess generating the spikes is stationary i.e. that their statistics do not change in time.As we will discuss below this assumption is crucial for the maximum entropy inference.Although an extension of our approach to a non-stationary scenario is possible, we focushere on the stationary case as it greatly simpliﬁes the presentation. Using the stationaryassumption, given the probability distribution of the whole process p {·} one can deﬁne aunique corresponding probability distribution over A NR following the natural projection,given by: p R { B ∈ A NR } := p { proj − R ( B ) ∈ Ω } . (1)As a consequence of assuming an stochastic process guiding the neuronal activity, a feature f : Ω → R becomes a random variable. Consequently, the statistics of f are deﬁned by p { f = a } = p { x ∈ Ω | f ( x ) = a } . In particular, considering the feature f ( x ) = x kt , one can note that individual spike-states (as well as spike patterns and spike blocks) become discrete random variables. Asa convention, we denote X kt a random spike-state that follows an implicit underlyingprobability distribution p {·} , while lower-case expressions (e.g. x kt ) are used for denotingconcrete realization of these random variables. The mean value of a feature f with respectto the probability p {·} is given by: E p { f } = (cid:88) x ∈ Ω f ( x ) p { x } . For the case of features of range R , the mean value can be expressed alternatively as: E p { f } = (cid:88) x ,R − ∈A NR ˜ f ( x ,R − ) p R { x ,R − } = E p R (cid:110) ˜ f (cid:111) which is a ﬁnite sum. Above, ˜ f is the reduced feature, as deﬁned in (2.2). Let us consider spiking data of the form x ,T − , where T is the sample length. Althoughin general the underlying probability measure p {·} that govern the spiking activity isunknown, it is useful to use the available data to estimate the mean values of speciﬁcfeatures. If f is a feature of range R , the empirical average value of f from the sample x ,T − is A T ( f ) = 1 T − R + 1 T − R (cid:88) i =0 f ( x i,R − i ) . (2)In particular, for features of range one, the previous expression becomes A T ( f ) = T (cid:80) T − i =0 f ( x i ) .An interesting questions is under which conditions A T ( f ) → E p { f } as T grows. This,and other convergence issues, are explored in Section 4.6 Inference of the statistical model with the MEP

Following Section 2.3, the probability measure p {·} represents the inherent stochasticityof the spiking neural population under consideration. As p {·} is unknown, one would liketo infer it from data. In the sequel, we ﬁrst introduce the general MEP as a method forinferring p {·} . Then, we show this principle for the case where only synchronous constraintsare considered, and ﬁnally, we present the case of where non-synchronous correlations areincluded to constrain the maximization problem. The MEP was ﬁrst proposed by E. T. Jaynes as a way for estimating probability distribu-tions when the information for performing the inference is scarce [24]. Rooted in principlesof statistical physics, this approach selects a probability measure that satisﬁes the evidencesupported by the available information while leaving all other aspects as random as pos-sible. For quantifying the corresponding randomness, Jaynes shows that the most naturalmetric is the Shannon entropy [25]. The probability measure found by this procedure isknown as the maximum entropy distribution .Formally, the MEP is a concave constrained maximization problem, where the con-straints that deﬁne the optimization space correspond to the available information thatguide the inference process. Accordingly, if additional constraints are introduced then theoptimization space is reduced; this corresponds to the informative power of new informa-tion, which restricts the space of models that are consistent with it.The inference procedure based on the MEP follows the following steps:I. Choose K features of interest f , . . . , f K (c.f. Section 2.2).II. Using the available data x ,T − , compute the empirical averange of each feature A T ( f k ) := c k .III. Assuming stationarity, deﬁne the space of statistical models M ( c , . . . , c K ) ⊂ M given by M ( c , . . . , c K ) = { g ∈ M| E g { f } = c , . . . , E g { f K } = c K } , where M is the set of probability measures and M ( c , . . . , c K ) is the family of proba-bility measures that are consistent with the empirical mean values c , . . . , c K obtainedin Step II.IV. Deﬁning the entropy rate of the stochastic process as S{ p } = lim t →∞ t (cid:88) x ,t − ∈A Nt p t { x ,t − } log 1 p t { x ,t − } , (3)ﬁnd the maximum entropy process, characterized by the probability measure ˆ p = max q ∈ M ( c ,...,c k ) S{ q } . (4)7ome small remarks are to be said about this procedure. One can think of this as adata-driven algorithm, whose input is the data x ,T − and output is the maximum entropymeasure ˆ p . The ﬁrst two steps of the process are known in the machine learning literatureas "feature selection" and "feature extraction", respectively (see e.g. [39, 4]). The goal ofthese steps is to reduce the dimensionality of the input for the subsequent stages, in orderto prevent the selected model to overﬁtting the data (i.e. to include in the model eﬀects ofnoise and biases due to the ﬁniteness of the data). Hence, what drives the model selectionstages is not the whole data but the quantities c , . . . , c K , which deﬁne the space to beexplored in Step 4.Steps 3 and 4 are known as "model selection". According to the the machine learningjargon these steps deliver a generative model, in the sense the obtained model can laterbe used to generate new data. In this sense, it is interesting that although the data isﬁnite, the entropy rate calculated in Step 4 is computed over all spike blocks of all lengths t , which is possible due to the generative nature of the candidate models. The inputs forthe model selection stages are not the whole data x ,T − but only the values c , . . . , c K ,which represent the knowledge obtained from the data that guides the search in the spaceof candidate models. Moreover, as these quantities represent all the available knowledgeone has about the underlying stochastic process generating the spikes, therefore, one wouldlike to select a model that reﬂect that information while making no further assumptions.By recalling the work made by Claude Shannon on the analysis of information sources(c.f. [10] and references therein), one can interpret the entropy rate as a measure of howhard is to predict the realization of a stochastic process. This implies, in turn, that themaximum entropy measure within M{ c , . . . , c K } is the most random, i.e unstructured,among those which satisﬁes the constraints A T ( f ) = c , . . . , A T ( f K ) = c k . Although theframework presented above is general enough to encompass the cases when consideringonly synchronous constraints and when considering also non-synchronous constraints, themethods used to ﬁnd the maximum entropy measure are diﬀerent. In section 3.2 we presentthe method for ﬁnding the maximum entropy measure when only synchronous featuresare selected, leaving for section 3.2 the more general situation including non-synchronousconstraints. Assuming only synchronous interactions is equivalent to only consider features of rangeone (i.e. features that consider neurons at the same time index, c.f. Section 2.2), whichleads to restricting the candidate models to those in where the present state is statisticallyindependent of past and future states. Moreover, by the assumption of stationarity, thisleads to modeling the collective spiking activity as a sequence of i.i.d. random variables.The challenge, in this case, is to estimate the corresponding distribution as reliably aspossible. A large portion of the literature of maximum entropy spike train statistics fo-cus on synchronous interactions between neurons, implicitly neglecting interactions acrosstime. Although this assumption induces a strong simpliﬁcation, the resulting models haveproved to be rich in structure and can provide interesting results and insights about theneural code [40, 44]. In the following, we recall how this problem can be addressed usingthe MEP. 8s a consequence of the assumptions of temporal independence and stationarity, it canbe shown that (4) is reduced to ˆ p = max q ∈ M ( c ,...,c k ) (cid:88) x ∈A N q { x } log 1 q { x } (5)where M ( c , . . . , c k ) corresponds to the set of distributions q over A N (c.f. range oneprojections in (1)) such that the constraints E q { f k } = c k are satisﬁed for each k =1 , . . . , K . Note that the above sum is over the N possible spike patterns, being a simplercondition than (4). In fact, following a simple argument based on Lagrange multipliersand the concavity of the entropy, it can be show that the distribution ˆ p that solves (5) isunique. Moreover, is a Boltzmann-Gibbs distribution [25]: ˆ p { x } = e −H β ( x ) Z ( β ) ∀ x ∈ A N ; Z ( β ) = (cid:88) x ∈A N e −H β ( x ) , (6)where H β is referred as the energy or potential function H β ( x ) = K (cid:88) k =1 β k ˜ f k ( x ) , (7) β ∈ R K is the vector of Lagrange multipliers. Following the statistical physics literature Z ( β ) is called the partition function , whose logarithm is referred as free energy .Conversely, from the uniqueness property of the maximum entropy distribution one canconclude that there is only one Boltzman-Gibbs distribution ˆ p that belongs to M ( c , . . . , c K ) ,being the only solution of (5). Interestingly, this alternative approach is much easier tosolve the original optimization problem . In eﬀect, one only need to ﬁnd the values of theparameter vector β k that reproduces the empirical average values c , . . . , c K . Moreover, itis known that for any Boltzmann-Gibbs distribution p the following holds: ∂ ln Z ( β ) ∂β k = E ˆ p ( ˜ f k ) . (8)Therefore, using (8) one could ﬁnd the appropriate values of β for which E ˆ p { ˜ f k } = c k aresatisﬁed . A generalization of the previous approach is to include average values of features corre-sponding to interactions in the spiking activity across time as constraints. This is a naturalassumption in biological spiking networks as it is expected that the spike of one neuron In particular, M { c , . . . , c k } is not easy to parametrize and hence the application of standard tech-niques of convex optimization (e.g. gradient decent) is not straightforward. However, for practical purposes this problem cannot be solved for systems with

N > [57], soalternative procedures are needed. For the interested reader, we refer to [37, 54, 55, 52]. R induces interdependencies and a corresponding "memory" inthe model of length R − , and thus it is natural to look for the best suited Markov chainover the state space A RN . A Markov chain model would allow to express the probability ofa spike train x ,T for T > R as p { x ,T } = π { x ,R − } P { x ,R | x ,R − } · · · P { x T − R,T − | x T − R +1 ,T } , where P is a transition probability matrix and π is a corresponding invariant probabilitydistribution (which is unique due to the ergodicity assumption, c.f. Section 3.3.1) associ-ated to P . Note that, due to the stationarity condition, the transition probabilities P {·|·} are constant over the realization of the whole process (see Appendix A for more details.).Following the MEP as described in Section 3.1, we look for a procedure for ﬁnding aMarkov transition matrix P that maximizes its entropy rate while satisfying some con-strains given the empirical averages of observables f , . . . , f K . For ergodic Markov chains,a well-known calculation (that can be found e.g. in [10]) shows that the entropy rate, asgiven by (3), is equivalent to the following simple expression: S KS ( π, P ) = − (cid:88) i,j ∈A RN π i (cid:88) j P ij log P ij . (9)where π i = π { x ,R − = i } and P ij = P { j | i } for i, j ∈ A NR . Is important to noticethat (9) corresponds to the Kolmogorov-Sinai entropy (KSE) in the dynamical systemsliterature [50]. In general (9) is larger when for a ﬁxed i the conditional probabilities P ij are closer to an uniform distribution, i.e. when knowing the transition statistics gives littlecertainty about the next step.It can be shown that, if the considered features do not involve correlations acrosstime (i.e. they are features of range 1, c.f. Section 2.2), then the resulting transitionprobabilities are such that the corresponding stochastic process is i.i.d (i.e. when P ij = π j ).Interestingly, in this scenario equation (9) reduces to the Shannon entropy of π i . Thisclariﬁes that this approach based on Markov chains extends the classical MEP and theresults presented in Section 3.2.Finding the MEMC raises, however, some extra technicalities with respect to the time-independent case. Recall that the goal is no longer to estimate a probability distribution,but to reconstruct from data a transition matrix P and a corresponding invariant measure π . The challenge is that as P and π are not independent parameters of the process ( π has to be the eigenvector associated with the unitary eigenvalue of P [6]), therefore onecannot apply Lagrange multipliers over the entropy rate function (9). In the sequel weexplore an alternative route to build the MEMC based on the transfer matrix technique.This technique is computationally simple, and also provides further insightful connectionswith statistical physics and thermodynamics. Note that P {· , ·} has a consistency requirement: for w , y ∈ A RN , P { w | y } > only when y ,R − = w ,R − . .3.1 Transfer Matrix Method In order to ﬁnd the MEMC associated with non-synchronous constraints, we follow thesame ideas presented above in the time-independent case, but using diﬀerent tools. Wepresent them here.Let us consider the set of features chosen to constrain the maximization of entropyrate (step I in 3.1). We assume that the features chosen have a ﬁnite maximum range R .From these features one can build the energy function H β (7) of ﬁnite range R as a linearcombination of these features. Using this energy function we build a matrix denoted by L H β , so that for every y, w ∈ A NR its entries are given as follows: L H β ( y, w ) = (cid:26) e H β ( yw R − ) if y ,R − = w ,R − , otherwise . (10)By yw R − we mean the word obtained by concatenation of y and w ,R − . In the particularcase of energy functions associated to range one features, we the aboves matrix is deﬁnedas L H β ( y, w ) = e H β ( y ) . Assuming H β > −∞ , the elements of the matrix L H β are non-negative, this in turn implies ergodicity. Moreover, the matrix is primitive by construction,thus it satisﬁes the Perron-Frobenius theorem [47]. Hereafter L H will be referred as theRuelle-Perron-Frobenius matrix (RPF). Let us denote be ρ the largest eigenvalue of L H ,which because it satisﬁes the Perron-Frobenius theorem is an eigenvalue of multiplicity oneand strictly larger in modulus than the rest of the eigenvalues [47]. We denote by U and V left and right eigenvectors of L H β corresponding to the eigenvalue ρ . Notice that U i > and V i > , for all i ∈ A NR .The RPF matrix is not the Markov matrix we are looking for, moreover, is not astochastic matrix, but can be converted into a stochastic matrix. We recall that for anirreducible matrix M with spectral radius λ , and positive right eigenvector v associatedto λ , then the stochasticization of M is the following stochastic matrix: S ( M ) = 1 λ D − M D , (11)where D is the diagonal matrix with diagonal entries D ( i, i ) = v i . So, in our context, theMEMC transition matrix is given as follows: P = S ( L H β ) , (12)whose unique stationary probability measure π is explicitly given by π i := U i V i (cid:104) U, V (cid:105) , ∀ i ∈ A NR , (13)where (cid:104) U, V (cid:105) is the standard inner product in R NR (we refer the reader to [47] for detailsand proofs). In the previous section we have shown how to obtain the transition matrix and the invariantmeasure of a Markov chain. However, we have not yet included the constraints (we have11ust used the features to build the energy function), in other words, we have not yet ﬁtthe parameters of the MEMC. In order to ﬁt the maximum entropy parameters let usintroduce the following quantity, P [ H β ] = sup q ∈M st (cid:110) S{ q } + E q {H β } (cid:111) (14)where M st is the set of all stationary probability measures in A RN and E q {H β } = (cid:80) Kk =1 β k E q { f k } is the average value of H β with respect to q . Solving the optimization problem (14) onegets the Markov measure we are looking for. Indeed, one knows from the thermodynamicalformalism (see [3]) that for our energy function H β of range R ≥ , there exists an uniquetranslation invariant (stationary) Markov measure p associated to H β for which one hasthe constant M > such that, M − ≤ p { x ,n } exp( (cid:80) n − R +1 k =1 H ( x k,k + R − ) − ( n + R − P [ H β ]) ≤ M, (15)that attains the supremum (14), that is S{ p } + E p {H β } . The quantity P [ H β ] is called topological pressure , which plays the role of the free energy in the statistical mechanics.The measure p , as deﬁned by (15), is known as the Gibbs measure in the sense of Bowen.Note that one can show that MEMCs are particular cases of these measures, associated toﬁnite-range energy functions. Moreover, (6) is a particular case of (15), when M = 1 and H β is an energy function of range one.The average values of the features, their correlations, as well as their higher cumulantscan be obtained by taking the successive derivatives of the topological pressure with respectto their conjugate parameters β . This explains the important role played by the topologicalpressure in this framework. In general, ∂ n P [ H β ] ∂β nk = κ n ∀ k ∈ { , ..., K } , (16)where κ n is the cumulant of order n (see appendix B.). In particular, taking the ﬁrstderivative: ∂ P [ H β ] ∂β k = E p { f k } = c k , ∀ k ∈ { , ..., K } , (17)where E p { f } is the the average of f k with respect to p (maximum entropy measure), whichis equal (by assumption) to the average value of f k with respect to the empirical measurefrom the data c k , that constraint of the maximization problem. These equations suggest arelationship with the logarithm of the free energy or log partition function of the BoltzmannGibbs distribution. Indeed, for range one potentials (time-independent Maximum entropydistributions) ρ ( β ) = Z ( β ) and P [ H β ] = ln Z ( β ) which relates (8) with (17) (For a detailedexample see section 5.2). This problem of estimating the MEMC parameters becomecomputationally expensive for big matrices. However, there exist eﬃcient algorithms toestimate the parameters for the Markov maximum entropy problem in the literature [37].12igure 2: Algorithmic view of the MEMC: Inputs are the spike trains { x i } Ti =1 and theaverage values of a set of features. The output is the MEMC transition matrix P This subsection reviews two elementary tools for studying the convergence of randomvariables while providing corresponding references. In the sequel, ﬁrst the central limittheorem is introduced in subsection 4.1.1, and then large deviation theory is discussed insubsection 4.1.2.

Let us ﬁrst assume that one can have access to arbitrarily large data sequences. Consider t ∈ N and let x ,t − be the spike-block of length t (which is allowed to increase), and let f : Ω → R be an arbitrary feature (not necessarily belonging to the set of features chosento ﬁt the MEMC). In this section we establish asymptotic properties of A t ( f ) sampledwith respect to the MEMC characterized by p {·} .Through this work, we will assume that p {·} is an ergodic Markov probability measure,this meaning that every spiking block in A NR is attainable from every other block in theMarkov chain within R time steps as discussed in section 3. Thanks to the ergodic as-sumption, it is guaranteed that the empirical averages become statistically more accurateas the sampling size grows [28], i.e., A t ( f ) → E p { f } . However, the above result does not clariﬁes the rate at which the estimate accuracy im-proves. For answering this question, one can use the central limit theorem (CLT) forergodic Markov chains (see [27]). This theorem states that there exists a constant σ > such that for large values of t , the random variable √ tσ (cid:2) A t ( f ) − E { f } (cid:3) distributes as astandard Gaussian random variable , with σ being the square-root of the auto-covariance Technically, the central limit theorem says that p (cid:26) √ tσ (cid:2) A t ( f ) − E { f } (cid:3) ≤ x (cid:27) → √ πσ (cid:90) x −∞ e − s σ ds, where the convergence is in distribution. f [27]. This, in turn, implies that “typical” ﬂuctuations of A t ( f ) around itsmean value E { f } are of the order of σ/ √ t . Although the central limit theorem for ergodic Markov chains is accurate in describingtypical events (which are ﬂuctuations around the mean value), it does not say anythingabout the likelihood of larger ﬂuctuations. Despite that it is clear that the probability ofsuch large ﬂuctuations goes to zero as the sample size increases, it is valuable to describethe corresponding decrease rate. In particular, one says that an empirical average A t ( f ) satisﬁes a large deviation principle (LDP) with rate function I f , deﬁned as I f ( s ) := − lim t →∞ t log p { A t ( f ) > s } , (18)if the above limit exists. Intuitively, the above condition for large t implies that p { A t ( f ) >s } ≈ e − tI f ( s ) . In particular, if s > E p { f } the Law of Large Numbers (LLN) guaranteesthat p { A t ( f ) > s } tends to zero as t grows; the rate function quantiﬁes the speed at whichthis happens.Calculating I f directly, i.e. by using the deﬁnition (eq 18), can be a formidable task.However, the Gärtner-Ellis theorem provides a smart shortcut for avoiding this prob-lem [14]. To this end, let us introduce the scaled cumulant generating function (SCGF) associated to the random variable f , by λ f ( k ) =: lim t →∞ t ln E p (cid:104) e tkA t ( f ) (cid:105) , k ∈ R , (19)when the limit exists (further general details about cumulant generating functions arefound in Appendix B). Note that, while A t ( f ) is an empirical average taken over a sample,the expectation in (19) is taken over the probability distribution given by the correspondingmodel p {·} . If λ f is diﬀerentiable, then the Gärtner-Ellis theorem ensures that the average A t ( f ) satisﬁes a LDP with rate function given by the Legendre transform of λ f , that is I f ( s ) = max k ∈ R { ks − λ f ( k ) } . (20)Therefore, in summary, one can study the large deviations of empirical averages A t ( f ) by ﬁrst computing their SCGF from the selected model and then ﬁnding their Legenderetransform.One of the most useful applications of the LDP is to estimate the likelihood that A t ( f ) adopts a value far from its expected value. For illustrate this, let us assume that I f ( s ) is a positive diﬀerentiable convex function . Then, because of the properties of convexfunctions I f ( s ) has a unique global minimum. Denoting this minimum by s ∗ , it followsfrom the diﬀerentiability of I f ( s ) that I f ( s ∗ ) = 0 , and using properties of the Legendre The name comes from the fact that the n -th cumulant of f can be obtained by t successive diﬀerenti-ation operations over of λ f ( k ) with respect to k , and then evaluating the result at k = 0 . A classical result in LDP states that I f ( s ) is a convex function if λ f ( k ) is diﬀerentiable [13]. For adiscussion about the diﬀerentiability of λ f ( k ) see [56]. s ∗ = λ (cid:48) f (0) = lim t →∞ E p ( f ) . This is the LLN, i.e., A t ( f ) gets concentratedaround s ∗ . Consider a value s (cid:54) = s ∗ and assume that I f ( s ) admits a Taylor series around s ∗ given by I f ( s ) = I f ( s ∗ ) + I (cid:48) f ( s ∗ )( s − s ∗ ) + I (cid:48)(cid:48) f ( s ∗ )( s − s ∗ ) O ( s − s ∗ ) Since s ∗ must correspond to a zero and a minimum of I ( s ) , the ﬁrst two terms in thisseries vanish, and as I ( s ) is convex function I (cid:48)(cid:48) ( s ) > . For large values of k , we obtainfrom (18) p { A t ( f ) > s } ≈ e − tI f ( s ) ≈ e − t (cid:32) I (cid:48)(cid:48) f ( s ∗ )( s − s ∗ )22 (cid:33) (21)so the small deviations of A t ( f ) around s ∗ are Gaussian-distributed as for i.i.d. sums /I (cid:48)(cid:48) f ( s ∗ ) = λ (cid:48)(cid:48) f (0) = σ . In this sense, large deviation theory can be seen as an extensionof the CLT because it gives information not only about the small deviations around s ∗ butalso about large deviations (not Gaussian) of A t ( f ) . In this section, we focus on the statistical properties of features sampled from the inferredMEMC. For example, one may be interested in measuring the probability of obtaining"rare" average values of features like ﬁring rates, pairwise correlations, triplets or spa-tiotemporal events. This characterization is relevant as these features are likely to play animportant role in neuronal information processing, and rare values may hinder the wholeenterprise of conveying information. We show in this section how to obtain the large devia-tions rate functions of arbitrary features through the Gärtner-Ellis theorem via the SCGF.In particular, we show that the SCGF can be directly obtained from the inferred Markovtransition matrix P .Consider MEMC taking values on the state space A NR with transition matrix P . Let f be a feature of range R which consider only the block and k ∈ R , we introduce (cid:101) P ( f ) ( k ) ,the tilted transition matrix by f of P , parametrized by k , whose elements are given by: (cid:101) P ( f ) ij ( k ) = P ij e kf ( j ) i, j ∈ A NR . (22)For transition matrices P inferred from the MEP, the tilted transition matrix can bebuilt directly from the spectral properties of the transfer matrix (10) as follows, (cid:101) P ( f ) ij ( k ) = e H β ( i,j ) V j V i ρ e kf ( j ) (23) = e [ H β ( i,j )+ kf ( j ) ] V j V i ρ i, j ∈ A NR . V is the right eigenvector of the transfer matrix L . Here we also have usedthe shortcut notation H β ( i, j ) to indicate that the energy function takes the contributionsfrom the blocks i and j . Remarkably, the feature f does not need to belong to the set ofchosen features to ﬁt the MEMC.Now, we can take advantage of the structure of the given process in order to obtainmore explicit expressions for the SCGF λ f ( k ) , for instance, if one considers i.i.d. randomvariables X then, from the very deﬁnition one can obtain that λ ( k ) = lim t →∞ t ln E [ e kX ] t = ln E [ e kX ] , which is the case of range one features. So, using equation (22), we get that the maximumeigenvalue of the tilted matrix, denoted by ρ ( (cid:101) P f ( k )) is, ρ (cid:0) (cid:101) P f ( k ) (cid:1) = (cid:88) j π j e kf ( j ) j ∈ A N . Since (cid:101) P f is a positive matrix the Perron-Frobenius theorem ensures the uniqueness of ρ .Next, for the case of additive features, one deals with positive Markov chains, andunder the assumption of ergodicity, an straightforward calculation (see for instance [15])leads us to obtain that λ f ( k ) = ln (cid:0) ρ (cid:0) (cid:101) P ( f ) (cid:1)(cid:1) . (24)It also can be proved that λ f ( k ) , in this case is diﬀerentiable [15], setting up the scene toapply the Gärtner-Ellis theorem, which bypasses the direct calculation of p { A T ( f ) > s } in(18), i.e., having λ f ( k ) , its Legendre transform leads to the rate function of f as shown inﬁgure 3. A stochastic process is said to be in equilibrium if one cannot notice the eﬀect of time onit. It is worth noticing that non-equilibrium stochastic processes are natural candidatesto model spike train statistics as time plays a non-symmetrical role [9]. One of the con-sequences of including features of range

R > as constraints in the maximum entropyproblem is that it opens the possibility to break the time-reversal symmetry present in thetime-independent models. This captures the irreversible character of the underlying bio-logical process and thus, allows to ﬁt more realistic statistical models from the biologicalpoint of view.To characterize this mathematically, we study how the distribution p {·} changes whenthe time order is reversed. For this aim, let us consider a spike block x ,T − = x , x , . . . , x T − containing T spike patterns, and deﬁne the time-reversed spike block x ( R )0 ,T − obtained byre-ordering the time index in reverse order, i.e., x ( R )0 ,T = x T − , x T − , . . . , x , x .A spiking network modeled by p {·} is said to be in equilibrium if p { x ,T } = p { x ( R )0 ,T } for all x [31]. For a homogeneous discrete time ergodic Markov chain characterized by theMarkov measure p ( π, P ) taking values in A NR , to be in equilibrium is equivalent to satisfy16igure 3: Algorithmic view of the method: Inputs are the maximum entropy Markovtransition matrix and a feature. From the inputs the tilted transition matrix can be built.The rate function of the feature is obtained as the Legendre transform of the log maximumeigenvalue of the tilted transition matrix. Using this framework we can estimate the largedeviations of the average values of the features.the detailed balance conditions , which is given by the following set of equalities: π i P ij = π j P ji , ∀ i, j ∈ A NR . Conversely, when these conditions are not satisﬁed the statistical model of the spikingactivity is said to be a non-equilibrium system. Since non-equilibrium is expected to occurgenerically in neuronal network models, one would like to quantify how far from equilibriumis the inferred MEMC. For this purpose one can deﬁne the information entropy production (IEP) for p , which is given by IEP ( p ) := lim t →∞ t ln (cid:34) p { x ,t − } p { x ( R )0 ,t − } (cid:35) , when the limit exists. For the maximum entropy Markov measure p ( π, P ) , the IEP isexplicitly given by: IEP ( π, P ) = 12 (cid:88) i,j ∈A NR (cid:2) π i P ij − π j P ji (cid:3) log π i P ij π j P ji , (25)(see [19] for the calculation). We remark that it is still possible to obtain the informationentropy production rate also in the non-stationary case. Clearly, for features of range one, IEP = 0 always, meaning that the process is time-reversible, therefore the probabilities17f every path and its corresponding time-reversal path are equal. For features of range

R > , IEP > generically (we refer the interested reader to [9] for details and examples).However, since in practice one only have access to limited amount of data, a naturalquestion is to ask for the entropy production of the system considered up to a ﬁnite amountof time. It turns out that this characterization can be obtained through a LDP. With thisin mind one may deﬁne the following feature: W T ( x ,T − ) = 1 T ln (cid:34) p ( x ,T − ) p ( x ( R )0 ,T − ) (cid:35) . Since we have assumed that samples are produced by a stationary ergodic Markov chaincharacterized by p ( π, P ) , the ergodic theorem assures that for p -almost every sample, thequantity W t when t goes to inﬁnity converges, and it is by deﬁnition the IEP, lim t →∞ W t ( x ,t − )) = IEP ( π, P ) . Once we have the convergence for W t , we may ask for its large deviation properties. Underthe same idea above, and following [26], we introduce the following matrix: F ij = P ij ln (cid:34) π i P ij π j P ji (cid:35) k i, j ∈ A NR , this matrix help us to build the SCGF associated to W t , through the logarithm of themaximum eigenvalue ρ F ( k ) . Using the Gärtner-Ellis theorem one gets the rate function I W ( s ) for the IEP. It is clear that there exist a relationship between accuracy of the estimation and samplesize. The larger the sample size the more information is available and the uncertaintydiminish. In the context of maximum entropy models, this idea has been well conceptu-alized using tools from information geometry [1, 2]. The main idea of this approach isthat the maximum entropy models form a manifold of probability measures whose coor-dinates are the parameters β . Consider a spike train dataset x ,T − consisting of T spikepatterns obtained from a spiking neuronal network. Given a set of features { f k } Kk =1 andtheir empirical averages, one may infer the parameters β = ( β , . . . , β K ) characterizing theMEMC p ( π, P ) . We may use the inferred MEMC to generate a sample x (cid:48) ,T − of the samesize as the original dataset. Considering the same set of features one may apply againthe MEP to infer a new set of parameters β (cid:48) from x (cid:48) ,T − , which is, in general, expectedto be diﬀerent from β . These maximum entropy models will belong to a certain volumein the manifold which will decrease as the sample size increase [2]. On the other hand,increasing the sample size of x (cid:48) ,T − , one expects that the Markov chain p (cid:48) ( π (cid:48) , P (cid:48) ) speciﬁedby β (cid:48) gets "closer" to the one characterized by β . The idea of relating a distance in theparameter space with a distance in the space of probability measures can be rigorouslyformulated using large deviations techniques. Let us start by deﬁning the relative entropy18etween these two MEMC (Gibbs measures in the sense of Bowen (15)), which providesa notion of "distance" . In order to do that in the context of MEMC’s consider a Gibbsmeasure p associated to the energy function H β , and let q be another Gibbs measure. TheRuelle-Föllmer theorem gives us an expression for the relative entropy density between twoGibbs measures in terms of the pressure, the entropy rate and the expected value of theenergy function with respect to q (see [21]), as follows: d ( q | p ) = P [ H β ] − S ( q ) − E q ( H β ) . (26)Observe that if d ( q | p ) = 0 , we obtain the variational characterization of Gibbs measures(14).Consider the potential H β = (cid:80) Kk =1 β k f k associated with a MEMC p ( π, P ) . Given aset of empirical averages A t ( f k ) generated by a sample of p ( π, P ) we obtain new maximumentropy parameters β (cid:48) . The probability that the maximum entropy parameters β (cid:48) asso-ciated with an ergodic Markov Chain p (cid:48) ( π (cid:48) , P (cid:48) ) get close to β follow the following largedeviation principle [13]: lim δ → lim t →∞ − t ln P (cid:16) | β − β (cid:48) |∈ ∆ δ (cid:17) = d ( p | p (cid:48) ) , (27)where ∆ δ = [ − δ, δ ] K . Calling and the vector δ β = β − β (cid:48) and choosing ∆ δ close to 0 weinformally rewrite the above corollary in the form: − t ln P (cid:16) | δ β |∈ ∆ δ (cid:17) −→ t →∞ d ( p | p (cid:48) ) . (28)Thus, for large T we get: P (cid:16) | δ β |∈ ∆ δ (cid:17) ≈ e − t d ( p | p (cid:48) ) , which implies that close-by parameters are associated to close-by probability measures [2].Consider now two MEMC p ( π, P ) and p (cid:48) ( π (cid:48) , P (cid:48) ) speciﬁed by H β and H β (cid:48) respectivelywith the same family of features. We say that the MEMC’s are (cid:15) - indistinguishable if: − ln P (cid:16) | δ β |∈ ∆ δ (cid:17) ≤ (cid:15). (29)As both MEMC’s satisfy the variational principle (14), the relative entropy between p and p (cid:48) (26) reads: d ( p | p (cid:48) ) = P [ H β (cid:48) ] − P [ H β ] + p ( H β ) − p ( H β (cid:48) ) . (30)Taking the Taylor expansion of d ( p | p (cid:48) ) around β (cid:48) = β we get: d ( p | p (cid:48) ) ≈ d ( p | p ) + (cid:88) k ∂ d ( p | p (cid:48) ) ∂β (cid:48) k (cid:12)(cid:12)(cid:12) β (cid:48) = β ( β k − β (cid:48) k ) + 12 (cid:88) k,j ∂ d ( p | p (cid:48) ) ∂β (cid:48) k β (cid:48) j (cid:12)(cid:12)(cid:12) β (cid:48) = β ( β k − β (cid:48) k )( β j − β (cid:48) j ) . Since d ( p | p (cid:48) ) is minimized at β (cid:48) = β we obtain, The relative entropy is not a metric (is not symmetric nor satisfy the triangle inequality). ( p | p (cid:48) ) ≈ (cid:88) k,j ∂ d ( p | p (cid:48) ) ∂β (cid:48) k β (cid:48) j (cid:12)(cid:12)(cid:12) β (cid:48) = β ( β k − β (cid:48) k )( β j − β (cid:48) j ) . Taking the second derivative of d ( p | p (cid:48) ) from (30), one also has that, ∂ d ( p | p (cid:48) ) ∂β (cid:48) k β (cid:48) j = ∂ P [ H β (cid:48) ] ∂β (cid:48) k β (cid:48) j = L kj . (31)The second partial derivatives of the topological pressure with respect to the parameters β (cid:48) k and β (cid:48) j can be conveniently arranged in a matrix L with components L kj . Given twoMEMC’s speciﬁed by H β and H β (cid:48) , in the limit of large t they are (cid:15) -indistinguishable if: (cid:104) ( δ β ) T L ( δ β ) (cid:105) ≤ (cid:15)T , (32)where T denotes transpose. The matrix L can be obtained from data without need toﬁt the parameters. Equation (32) characterize a region in the space of MEMC of indis-tinguishable models, whose volume can be calculated in the large t limit using spectralproperties of the matrix L [2]. This result generalizes a previous result for maximum en-tropy distributions for range one energy functions in [35]. In this section we illustrate the presented methods in some simple scenarios. In theseexamples we follow a set of steps:(A) Choose the observables and build the energy function (7).(B) Build the transfer matrix (10).(C) Compute the free energy and ﬁnd the maximum entropy parameters using (17).(D) Build the Markov transition matrix using (12).(E) Choose the observable to examine and build the tilted transition matrix using (22).(F) Compute the Legendre transform of the log maximum eigenvalue of the tilted tran-sition matrix to obtain the rate function (24).For the sake of clarity, in this section we focus on small neuronal networks. It is clear,however, that the extension of these techniques to larger neural populations is straightfor-ward. 20 .1 First example: Maximum entropy model of a range 2 feature

Consider spiking data from two interacting neurons. We measure only the average value ofa of a range 2 feature from the spiking data to ﬁt a MEMC. The feature denoted by f isgiven by ˜ f ( x , ) = x · x , which detects when a spike of the second neuron is followed bya spike in the ﬁrst one. The system can be described with the help of an energy function H ( x , ) = β ˜ f ( x , ) .For a given dataset of T spike blocks of range 2 the empirical average reads, A T ( f ) = c (33)this means that in the data one ﬁnds that this event appears c % of the time.The transfer matrix L H (c.f. (10)) associated with this energy function is a matrixindexed by the 16 states of the system, which in this case is the set A : (cid:26)(cid:18) (cid:19) , (cid:18) (cid:19) , . . . , (cid:18) (cid:19)(cid:27) . As L H is primitive by construction (c.f. (10)), it satisﬁes the hypothesis of the Perron-Frobenius theorem. In fact, its unique maximum eigenvalue is ρ ( β ) = e β + 3 . Given therestriction (33), using (17) we obtain the following relationship between the parameter β and the value of the restriction c : ∂ P [ H ] ∂β = ∂ log( e β + 3) ∂β = e β e β + 3 = c. This equation can be solved numerically. Using the obtained value of β in equation (12)one can ﬁnd the corresponding Markov transition matrix. Note that, among all the Markovchains that match exactly the restriction, the selected one maximizes the KSE. Moreover,it is direct to check that the variational principle (14) is satisﬁed. Examples of values of β for diﬀerent values of c and IEP (25) for each value of β are given in the following table:Table 1: c β IEP f . Using equation (22) we obtain the tilted transition matrix (cid:101) P ( f ) ij ( k ) for each ofthe values in the table 1. In ﬁgure (4), we compute for each value of β we compute theSCGF λ f ( k ) (24) and the Legendre transform (rate function) associated to the feature I f ( s ) . In ﬁgure (5), we compute for each value of IEP in the table the rate function andillustrate for this example the symmetry relationship (43).21igure 4: A) SCGF (24) for the feature f of the ﬁrst example computed at the valuesprovided by the table above. B) Rate function for the same feature computed at the sameparameter values as the SCGF. Each of this functions are obtained taking the Lagrangetransform of the respective SCGF in A) .Figure 5: Gallavotti-Cohen symmetry relationship for the IEP for values in table 1. LeftSCGF λ W ( k ) . Right rate function of the IEP feature W, I W ( s ) . Let us now consider a network of three neurons. We focus here on range one features.In this example we consider features related to the ﬁring rates and synchronous pairwisecorrelations (Ising model [44, 55]). Speciﬁcally, we consider the following energy function: H ( x ) = β x + β x + β x + β x · x + β x · x + β x · x , with the six parameters β , . . . , β . Following (10), the transfer matrix L H indexed by thestates of A is the following: L H =  e β e β e β + β + β e β e β + β + β e β + β + β e β + β + β + β + β + β ... ... ... ... ... ... ... ... e β e β e β + β + β e β e β + β + β e β + β + β e β + β + β + β + β + β  ρ ( β ) = 1 + e β + e β + e β + β + β + e β + e β + β + β + e β + β + β + e β + β + β + β + β + β . The right eigenvector associated to this eigenvalue has all the components equal to 1. Weobtain the topological pressure P [ H ] = log ρ ( β ) . In order to ﬁnd the MEMC parameterswe solve this set of equations: ∂ P [ H ] ∂β = A T ( f k ) = c k . (34)From equation (34) provided some constraints on the average value of the features wecan solve the maximum entropy problem. Take for example (see table 2):Table 2: A T ( f k ) c k β k δβ k ˜ c k A T ( x ) A T ( x ) A T ( x ) A T ( x x ) A T ( x x ) A T ( x x ) λ f ( k ) .In ﬁgure 6) we illustrate the rate functions for each feature in the model.Figure 6: A) Rate functions for the ﬁring rates of each neuron of the Ising model. Theminimum of the rate functions coincide with the expected value of the ﬁring rates in thetable 2. B) Rate functions for the pairwise interactions computed from the parameters inthe table 2. 23

Conclusion

In the past few years, new experimental techniques combined with clever ideas from statisti-cal mechanics have made possible to infer maximum entropy models of spike trains directlyfrom experimental recordings. However, a very important issue which is to quantify theaccuracy of the estimation obtained from a ﬁnite empirical sample is usually ignored inthis ﬁeld. This is probably because the maximum entropy approach has a dual nature;one side is a convex optimization problem which provides a unique solution independent ofthe sampling size, and on the other hand is a Bayesian inference procedure, from which ismore natural to ask this question. As we have discussed in the introduction this character-ization is relevant in the ﬁeld of computational neuroscience as, in practice, experimentalrecordings are performed during a ﬁnite amount of time which causes ﬂuctuations over theestimated quantities.A fundamental goal of spike train analysis over networks of sensory neurons involvesbuilding accurate statistical models that predict the response of the network to a stimulusof interest. In particular, the aim of statistical inference of spiking neurons using the MEP,is that the ﬁtted parameters shed light on some aspects of the neuronal code, therefore itis extremely important to quantify the accuracy of the statistical procedure. Additionally,one may be interested in measuring some properties of the inferred statistical model char-acterizing the spiking neuronal network. For example about convergence rate of a sampleor to quantify the probability of rare events of features like ﬁring rates, pairwise correla-tions, triplets or spatiotemporal events, mainly because these features are likely to playan important role in neuronal information processing. It is possible that rare and unlikelyevents have been generated by internal states of the neuronal tissue and not driven by theexternal stimulus. The events that are unlikely to occur deserve a better understanding asmay carry important information about the network internal structure and may play a rolein organizing a coherent dynamic to convey sensory information to the cerebral cortex.The present contribution addressed this issue using tools from large deviations theoryin the context of the MEMC. In particular, we showed that the transfer matrix techniqueused to build the MEMC is well adapted to compute large deviation rate functions usingthe Gärtner-Ellis theorem. We also provide tools to investigate how sharply determinedare the parameters of a MEMC with respect to the amount of empirical data using theconcept of (cid:15) distinguishability. Additionally, we present a non-trivial relation between thedistance in the parameter space and the distance in the manifold of maximum entropyprobability measures using a LDP.We have illustrated our method using simple examples. However, these examples mightgive a false impression that large deviations rate functions can always be calculated ex-plicitly. In fact, exact and explicit expressions can be found only in small simple cases,fortunately there exist numerical methods to evaluate rate functions [56].Here, we have focused our attention on large deviations properties on maximum entropymodels arising from spike train statistics, however, these results can be used in other ﬁeldsof applications of maximum entropy models.

Acknowledgements

We thank Ruben Herzog and Adrian Palacios for providing uswith the retinal data and for helping in ﬁgure 6. RC was supported by an ERC advanced24rant "Bridges", CONICYT-PAI Inserción 79160120 and Proyecto REDES ETAPA INI-CIAL, Convocatoria 2017 REDI170457. CM was at the early stage of this project, sup-ported by the CONICYT-FONDECYT Postdoctoral Grant No. 3140572. FR acknowl-edges the support of the European Union’s H2020 research and innovation programmeunder the Marie Skłodowska-Curie grant agreement No. 702981.The following abbreviations are used in this manuscript:MEP Maximum entropy principleMEMC Maximum entropy Markov chainSCGF Scaled cumulant generating functionCLT Central limit theoremLLN Law of large numbersLDP Large deviation principleIEP Information entropy productionKSE Kolmogorov-Sinai entropyNESS Non-equilibrium steady states

Symbol list x kn Spiking state of neuron k at time n . x n Spike pattern at time n x t ,t Spike block from time t to t . A T ( f ) Empirical Average value of the feature f considering T spike patterns. A NR Set of spike blocks of N neurons and length R . S [ p ] Entropy of the probability measure p . H Energy function. P [ H ] Free energy or topological pressure. λ f ( k ) Scaled cumulant generating function of f . I f ( s ) Rate function of f . A Discrete-time Markov chains and spike train statistics

Consider the random process { X n : n ≥ } taking values on A NR . One can assume thatthe spiking activity of the neuronal network can be modeled by some discrete-time Markovprocess whose transition probabilities are obtained by means of the maximum entropymethod described in section 3. In this setting, A NR is the state space of the Markov chain,and thus, if X n = x n,n + R − we say that the process is in the state x n,n + R − at time n .The transition probabilities are given as follows, P (cid:2) X n = x ( n ) | X n − = x ( n − , . . . , X = x (0) (cid:3) = P (cid:2) X n = x ( n ) | X n − = x ( n − (cid:3) , (35)where we used the short hand notation x ( n ) := x n,n + R − . We emphasize that in this paperthe states are spike blocks of ﬁnite length R , x n,n + R − . All along this paper he have onlyconsidered homogeneous Markov chains, that is, (35) is independent of n .Since transitions are considered between blocks of the form x n − R,n − → x n − R +1 ,n ,therefore the block x n − R +1 ,n − must be common for the transition to be possible. Consider25wo spike blocks i, j ∈ A NR of range R ≥ . We say that the transition from state i to state j is allowed if i and j have the common sub-block x ,R − = ˜ x ,R − , where ˜ x ,R − are theﬁrst R − spike patterns of j .Now, we deﬁne the transition matrix P R : A NR × A NR → R , whose entries are given bythe transition probabilities, as follows, ( P R ) ij := (cid:26) [ j | i ] > if i → j is allowed , otherwise . (36)Note that P has NR × NR entries, but it is a sparse matrix since each line has, at most, N non-zero entries. A stochastic matrix P is deﬁned from transition probabilities (36)satisfying: P [ j | i ] ≥ (cid:88) j ∈A NR P [ j | i ] = 1 , for all states i, j ∈ A NR . Moreover, by construction, for any pair of states, there exists apath of maximum length R in the graph of transition probabilities going from one to theother, which means that the Markov chain is primitive. B Cumulant generating function

In general in order to obtain the scale cumulant generating function (as considered insection 4.1.2 ) one has to deal with the moment of order r of a real-valued random variable f , which is given by, m r = E ( f r ) , for r ∈ N . Provided that it has a Taylor expansion about the origin, the moment generatingfunction (or Fourier-Laplace transform) M ( k ) = E ( e kf ) = E (1 + kf + · · · + k r f r /r ! + · · · ) = ∞ (cid:88) r =0 m r k r /r ! The cumulants κ r are the coeﬃcients in the Taylor expansion of the cumulant generatingfunction, deﬁned as the logarithm of the moment generating function, that is, log M ( k ) = (cid:88) r κ r k r /r ! The relationship between the ﬁrst moments and cumulants, can be obtained by extractingcoeﬃcients from the expansion, as follows: κ = m κ = m − m κ = m − m m + 2 m κ = m − m m − m + 12 m m − m , and so on. In particular, κ is the mean of f , κ is the variance, κ the skewness and κ the kurtosis. 26 Linear response

Within the framework we have build we can quantify how a small perturbation of themaximum entropy parameters (associated to given features) aﬀects the average values ofother features of the MEMC. It is important to quantify this perturbation because themaximum entropy parameters are obtained with ﬁnite accuracy due to ﬁnite sample ef-fects. Fixing β , we can obtain the average value of a given feature f k with respect to theMEMC without need to sample, using the Gibbs-Jaynes principle for the KSE [21], whichasserts that for a translation invariant probability measure p , the entropy rate S KS ( p ) ismaximal under the constraints E p { f k } = c k , for all k ∈ { , . . . , K } if and only if p is aGibbs measure associated to the energy H β = (cid:80) β k f k , where E p { f k } = ∂ P [ H β ] ∂β k = c k .Now, let us consider a perturbed version of the energy denoted by H β + δ β . Using aTaylor expansion, we compute the average value of an arbitrary feature here denoted by f k with respect to the MEMC associated to the perturbed energy in terms of the unperturbedone, that is, E p β + δ β { f k } = ∂ P [ H β + δ β ] β k (37) = ∂ P [ H β ] β k + (cid:88) j ∂ P [ H β ] ∂β k β j δβ j + O ( δβ j ) (38) = E p β { f k } + (cid:88) j ∂ P [ H β ] ∂β k β j δβ j + O ( δβ j ) = ˜ c k . (39)From (37) to (38) there is a Taylor expansion of P [ H β + δ β ] about H β . From (38) to (39)we use the Gibbs-Jaynes principle for the KSE. We see from (39) that a small perturbationof a parameter β j inﬂuence the average value of all other features in the energy function(as f k is arbitrary) and the magnitude of the perturbation is controlled by the secondderivatives of the topological pressure of the unperturbed energy P [ H β ] . D Time Correlations from Topological Pressure

For a pair of ﬁnite range features f k , f j of a stationary Markov chain, the covariance oforder r is independent of time, just depend on the lag r and is deﬁned as: C f k ,f j ( r ) := E p { f k ( x n ) f j ( x n + r ) } − E p { f k ( x n ) } E p { f j ( x n ) } , where E p stands for the expected value with respect to the Markov measure p .For MEMC with potentials of range R > there is a positive time correlation correlationbetween pairs of features f ( x n ) and g ( x n + r ) , that we denote σ f,g , indeed one can showthat (Green-Kubo formula): σ f k ,f j = C f k ,f j (0) + ∞ (cid:88) r =1 C f k ,f j ( r ) + ∞ (cid:88) r =1 C f j ,f k ( r ) . (40)27he pairwise time correlations between features can be obtained from the topologicalpressure: σ f k ,f j = ∂ P [ H β ] ∂β k ∂β j = ∂µ [ f j ] ∂β k . (41)For a MEMC taking values on A NR characterized by p ( π, P ) : ∂ P [ H β ] ∂β k ∂β j = E p { f k f j }− E p { f k } E p { f j } + ∞ (cid:88) r =1 (cid:88) y,w ∈A NR f k ( y ) f j ( w ) π y P ryw + ∞ (cid:88) r =1 (cid:88) y,w ∈A NR f j ( y ) f k ( w ) π y P ryw For v, w ∈ A NR . For MEMC ﬁtted through range one energy functions { f ( x n ); n ≥ } is ani.i.d. process and the variance of f is simply C f (0) . These terms are the linear responsecoeﬃcients. For MEMC associated to energy functions formed by K features, the matrix L can be conveniently arranged in a K × K symmetric matrix (known as the Onsagerreciprocity relations [20]). σ f k ,f j = ∂ P [ H β ] ∂β k ∂β j = ∂ E p { f j } ∂β k . (42) E Gallavotti-Cohen ﬂuctuation theorem

The Gallavotti-Cohen ﬂuctuation theorem refers to a universal property of the IEP i.e. isindependent of the parameters of the MEMC. It is as a statement about properties of theSCGF and rate function of the IEP [26], this is, λ W ( k ) = λ W ( − k − , I W ( s ) = I W ( − s ) − s. (43)This symmetry can be seen as a generalization of Kubo formula (40) and Onsagerreciprocity relations (42) to situations far from equilibrium. It is a relationship that holdsfor a general class fs stochastic processes including Markov chains [31].These properties have an impact on the large deviations of the time-averaged entropyproduction rate of the sample trajectory x ,t − of the Markov chain p ( π, P ) denoted W t t .In our framework, the following relationship always holds, p (cid:8) W t t ≈ s (cid:9) p (cid:8) W t t ≈ − s (cid:9) (cid:16) e ts . This means that the positive ﬂuctuations of W t t are exponentially more probable thannegative ﬂuctuations of equal magnitude. References [1] S. Amari. Information geometry of multiple spike trains. In S. Grün and S. Rotter,editors,

Analysis of Parallel Spike trains , volume 7 of

Springer Series in ComputationalNeuroscience , part 11, pages 221–253. Springer, 2010. DOI: 10.1007/978-1-4419-5675.282] V. Balasubramanian. Statistical inference, occam’s razor, and statistical mechanicson the space of probability distributions.

Neural Computation , 9(2), 1997.[3] R. Bowen.

Equilibrium states and the ergodic theory of Anosov diﬀeomorphisms.Second revised version. , volume 470 of

Lect. Notes.in Math.

Springer-Verlag, 2008.[4] G. Brown, A. Pocock, M. Zhao, and M. Lujn. Conditional likelihood maximisation:A unifying framework for information theoretic feature selection.

Journal of MachineLearning Research , 13:27–66, 2012.[5] B. Cessac and R. Cofre. Estimating maximum entropy distributions from periodicorbits in spike trains. research report RR-8329, INRIA, July 2013.[6] G. Chliamovitch, A. Dupuis, and B. Chopard. Maximum entropy rate reconstructionof markov dynamics.

Entropy , 6(17), 2015.[7] R. Cofré and B. Cessac. Dynamics and spike trains statistics in conductance-basedintegrate-and-ﬁre neural networks with chemical and electric synapses.

Chaos, Solitonsand Fractals , 50(8):13–31, 2013.[8] R. Cofré and B. Cessac. Exact computation of the maximum entropy potential ofspiking neural networks models.

Physical Review E , 107(5):368–368, 2014.[9] R. Cofré and C. Maldonado. Information entropy production of maximum entropymarkov chains from spike trains.

Entropy , 20(34), 2018.[10] T. M. Cover and J. A. Thomas.

Elements of information theory . Wiley-Interscience,second edition, 2006.[11] L. Croner, K. Purpura, and E. Kaplan. Response variability in retinal ganglion cellsof primates.

PNAS. , 90:8128–8130, 1993.[12] M. Deem. Mathematical adventures in biology.

Phys Today , 60(1):42–47, 2007.[13] A. Dembo and O. Zeitouni.

Large deviations techniques and applications , volume 38 of

Stochastic Modelling and Applied Probability . Springer-Verlag, Berlin, 2010. Correctedreprint of the second (1998) edition.[14] R. Ellis.

Entropy, Large deviations and Statistical Mechanics . Springer, Berlin, 1985.[15] R. S. Ellis. The theory of large deviations and applications to statistical mechanics.In

Long-range interacting systems . Oxford Univ. Press, 2010.[16] E. Ferrea, A. Maccione, L. Medrihan, T. Nieus, D. Ghezzi, P. Baldelli, F. Benfenati,and L. Berdondini. Large-scale, high-resolution electrophysiological imaging of ﬁeldpotentials in brain slices with microelectronic multielectrode arrays.

Frontiers inNeural Circuits. , 6(80), 2012.[17] K. Friston. Functional and eﬀective connectivity: a review.

Brain Connect. , 1(1):13–36, 2011. 2918] E. Ganmor, R. Segev, and E. Schneidman. The architecture of functional interactionnetworks in the retina.

The journal of neuroscience , 31(8):3044–3054, 2011.[19] P. Gaspard. Time-reversed dynamical entropy and irreversibility in Markovian randomprocesses.

J. Statist. Phys. , 117(3-4):599–615, 2004.[20] P. Gaspard. Random paths and current ﬂuctuations in nonequilibrium statisticalmechanics.

Journal of Mathematical Physics , 55(7), 2014.[21] H. Georgii.

Probabilistic aspects of entropy . In A. Greven, G. Keller and G. Warnecke(editors), Entropy, Princeton University Press, 2003.[22] W. Gerstner and W. Kistler.

Spiking Neuron Models . Cambridge University Press,2002.[23] D. N. Hill, S. B. Mehta, and D. Kleinfeld. Quality Metrics to Accompany Spike Sortingof Extracellular Signals.

The Journal of Neuroscience , 31(24):8699–8705, 2011.[24] E. Jaynes. Information theory and statistical mechanics.

Phys. Rev. , 106(620), 1957.[25] E. Jaynes.

Probability Theory: The Logic of Science . Cambridge University Press,2003.[26] D.-Q. Jiang, M. Qian, and M.-P. Qian.

Mathematical Theory of Nonequilibrium SteadyStates . Springer, 2004.[27] G. L. Jones. On the Markov chain central limit theorem.

Probab. Surv. , 1:299–320,2004.[28] D. A. Levin, Y. Peres, and E. L. Wilmer.

Markov chains and mixing times . AMS,2009.[29] D. Linaro, M. Storace, and M. Giugliano. Accurate and fast simulation of channelnoise in conductance-based model neurons by diﬀusion approximation.

PLoS ComputBiol , 7(3):e1001102, 03 2011.[30] J. Macke, I. Murray, and P. Latham. Estimation bias in maximum entropy models.

Entropy , 15(0), 2013.[31] C. Maes. The ﬂuctuation theorem as a gibbs property.

J. Stat. Phys. , 95:367–392,1999.[32] O. Marre, S. El Boustani, Y. Frégnac, and A. Destexhe. Prediction of spatiotemporalpatterns of neural activity from pairwise correlations.

Physical review letters , 102(13),Apr. 2009.[33] O. Marre, S. El Boustani, Y. Frégnac, and A. Destexhe. Prediction of spatiotemporalpatterns of neural activity from pairwise correlations.

Physical review letters , 102(13),2009. 3034] M. Marsili, I. Mastromatteo, and Y. Roudi. On sampling and modeling complexsystems.

Journal of Statistical Mechanics: Theory and Experiment , 9(P09003), 2013.[35] I. Mastromatteo and M. Marsili. On the criticality of inferred models.

J. Stat. Mech. ,2011.[36] T. Mora and W. Bialek. Are biological systems poised at criticality?

J Stat Phys ,144, 2011.[37] H. Nasser and B. Cessac. Parameter estimation for spatio-temporal maximum entropydistributions: Application to neural spike trains.

Entropy , 16(4), 2014.[38] M. Okatan, M. A.Wilson, and E. N. Brown. Analyzing functional connectivity usinga network likelihood model of ensemble neural spiking activity.

Neural Computation ,17(9):1927–1961, September 2005.[39] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information:criteria of max-dependency, max-relevance, and min-redundancy.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 27(8):1226–1238, 2005.[40] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P.Simoncelli. Spatio-temporal correlations and visual signaling in a complete neuronalpopulation.

Nature , 454(7206):995–999, 2008.[41] I. Prigogine.

Nonequilibrium Statistical Mechanics . Monographs in Statistical Physics.Interscience publishers, John Wiley & Sons, 1962.[42] R. Q. Quiroga, Z. Nadasdy, and Ben-Shaul. Unsupervised spike sorting with waveletsand superparamagnetic clustering.

Neural Computation , 16:1661–1678, 2004.[43] F. Rieke, D. Warland, R. de Ruyter van Steveninck, and W. Bialek.

Spikes, Exploringthe Neural Code . M.I.T. Press, 1996.[44] E. Schneidman, M. Berry II, R. Segev, and W. Bialek. Weak pairwise correlationsimply string correlated network states in a neural population.

Nature , 440:1007– 1012,2006.[45] E. Schrödinger.

What Is Life? The Physical Aspect of the Living Cell . CambridgeUniversity Press, 1944.[46] T. Schwalger, K. Fisch, J. Benda, and B. Lindner. How noisy adaptation of neu-rons shapes interspike interval histograms and correlations.

PLoS Comput Biol ,6(12):e1001026, 2010.[47] E. Seneta.

Non-negative Matrices and Markov Chains . Springer, 2006.[48] M. Shadlen and W. Newsome. The variable discharge of cortical neurons: implicationsfor connectivity, computation, and information coding.

J. Neurosci , 18(10):3870–3896,1998. 3149] P. Shi and H. Qian.

Frontiers in Computational and Systems Biology, J. Feng, W.Fu and F. Sun Eds , chapter Irreversible Stochastic Processes, Coupled Diﬀusions andSystems Biochemistry., pages 175–201. Springer, 2010.[50] Y. Sinai. Gibbs measures in ergodic theory.

Russ. Math. Surveys , 27(4):21–69, 1972.[51] A. Tang, D. Jackson, J. Hobbs, W. Chen, J. Smith, H. Patel, A. Prieto, D. Petrusca,M. Grivich, A. Sher, P. Hottowy, W.Dabrowski, A. Litke, and J. Beggs. A maximumentropy model applied to spatial and temporal correlations from cortical networks

InVitro . The Journal of Neuroscience , 28(2):505–518, 2008.[52] G. Tkačik, T. Mora, O. Marre, D. Amodei, M. Berry II, and W. Bialek. Thermody-namics for a network of neurons: Signatures of criticality.

PNAS , 112, 2015.[53] G. Tkačik, O.Marre, D.Amodei, E.Schneidman, W, and M. B. 2nd. Searching forcollective behavior in a large network of sensory neurons.

Plos Computational Biology ,10, 2013.[54] G. Tkačik, J. Prentice, V. Balasubramanian, and E. Schneidman. Optimal populationcoding by noisy spiking neurons.

PNAS , 107(32):14419–14424, 2010.[55] G. Tkačik, O. Marre, T. Mora, D. Amodei, M. B. II, and W. Bialek. The simplestmaximum entropy model for collective behavior in a neural network.

J Stat Mech ,page P03011, 2013.[56] H. Touchette. A basic introduction to large deviations: Theory, applications, simula-tions. http://arxiv.org/pdf/1106.4146v3.pdf , 2012.[57] J. Vasquez, A. Palacios, O. Marre, M. B. II, and B. Cessac. Gibbs distribution analysisof temporal correlation structure on multicell spike trains from retina ganglion cells.