Searching for collective behavior in a network of real neurons
Gašper Tkačik, Olivier Marre, Dario Amodei, Elad Schneidman, William Bialek, Michael J Berry II
SSearching for collective behavior in a network of real neurons
Gaˇsper Tkaˇcik a , Olivier Marre b,d , Dario Amodei d,e , Elad Schneidman c , William Bialek, e,f and Michael J. Berry II d a Institute of Science and Technology Austria, A-3400 Klosterneuburg, Austria, b Institut de la Vision, INSERM U968, UPMC, CNRS U7210, CHNO Quinze-Vingts, F-75012 Paris, France, c Department of Neurobiology, Weizmann Institute of Science, 76100 Rehovot, Israel, d Department of Molecular Biology, e Joseph Henry Laboratories of Physics,and f Lewis–Sigler Institute for Integrative Genomics,Princeton University, Princeton, New Jersey 08544 USA (Dated: June 14, 2013)Maximum entropy models are the least structured probability distributions that exactly reproducea chosen set of statistics measured in an interacting network. Here we use this principle to constructprobabilistic models which describe the correlated spiking activity of populations of up to 120neurons in the salamander retina as it responds to natural movies. Already in groups as smallas 10 neurons, interactions between spikes can no longer be regarded as small perturbations in anotherwise independent system; for 40 or more neurons pairwise interactions need to be supplementedby a global interaction that controls the distribution of synchrony in the population. Here weshow that such “K-pairwise” models—being systematic extensions of the previously used pairwiseIsing models—provide an excellent account of the data. We explore the properties of the neuralvocabulary by: 1) estimating its entropy, which constrains the population’s capacity to representvisual information; 2) classifying activity patterns into a small set of metastable collective modes;3) showing that the neural codeword ensembles are extremely inhomogenous; 4) demonstrating thatthe state of individual neurons is highly predictable from the rest of the population, allowing thecapacity for error correction.
Keywords: entropy, information, multi–information, neural networks, Monte Carlo, correlation
I. INTRODUCTION
Physicists have long hoped that the functional behav-ior of large, highly interconnected neural networks couldbe described by statistical mechanics [1–3]. The goal ofthis effort has been not to simulate the details of partic-ular networks, but to understand how interesting func-tions can emerge, collectively, from large populations ofneurons. The hope, inspired by our quantitative under-standing of collective behavior in systems near thermalequilibrium, is that such emergent phenomena will havesome degree of universality, and hence that one can makeprogress without knowing all of the microscopic detailsof each system. A classic example of work in this spirit isthe Hopfield model of associative or content–addressablememory [1], which is able to recover the correct memoryfrom any of its subparts of sufficient size. Because thecomputational substrate of neural states in these modelswere binary “spins,” and the memories were realized aslocally stable states of the network dynamics, methodsof statistical physics could be brought to bear on theo-retically challenging issues such as the storage capacityof the network or its reliability in the presence of noise[2, 3]. On the other hand, precisely because of these ab-stractions, it has not always been clear how to bring thepredictions of the models into contact with experiment.Recently it has been suggested that the analogy be-tween statistical physics models and neural networks canbe turned into a precise mapping, and connected to ex-perimental data, using the maximum entropy framework[4]. In a sense, the maximum entropy approach is theopposite of what we usually do in making models or the- ories. The conventional approach is to hypothesize somedynamics for the network we are studying, and then cal-culate the consequences of these assumptions; inevitably,the assumptions we make will be wrong in detail. Inthe maximum entropy method, however, we are tryingto strip away all our assumptions, and find models of thesystem that have as little structure as possible while stillreproducing some set of experimental observations.The starting point of the maximum entropy methodfor neural networks is that the network could, if we don’tknow anything about its function, wander at randomamong all possible states. We then take measured, aver-age properties of the network activity as constraints, andeach constraint defines some minimal level of structure.Thus, in a completely random system neurons wouldgenerate action potentials (spikes) or remain silent withequal probability, but once we measure the mean spikerate for each neuron we know that there must be somedeparture from such complete randomness. Similarly, ab-sent any data beyond the mean spike rates, the maximumentropy model of the network is one in which each neuronspikes independently of all the others, but once we mea-sure the correlations in spiking between pairs of neurons,an additional layer of structure is required to accountfor these data. The central idea of the maximum en-tropy method is that, for each experimental observationthat we want to reproduce, we add only the minimumamount of structure required.An important feature of the maximum entropy ap-proach is that the mathematical form of a maximumentropy model is exactly equivalent to a problem in sta-tistical mechanics. That is, the maximum entropy con- a r X i v : . [ q - b i o . N C ] J un struction defines an “effective energy” for every possiblestate of the network, and the probability that the systemwill be found in a particular state is given by the Boltz-mann distribution in this energy landscape. Further, theenergy function is built out of terms that are related tothe experimental observables that we are trying to re-produce. Thus, for example, if we try to reproduce thecorrelations among spiking in pairs of neurons, the energyfunction will have terms describing effective interactionsamong pairs of neurons. As explained in more detail be-low, these connections are not analogies or metaphors,but precise mathematical equivalencies.Minimally structured models are attractive, both be-cause of the connection to statistical mechanics and be-cause they represent the absence of modeling assump-tions. Of course, these features do not guarantee thatsuch models will provide an accurate description of areal system. Interest in maximum entropy approachesto networks of real neurons was triggered by the obser-vation that, for groups of N = 10 −
15 ganglion cells inthe vertebrate retina, maximum entropy models based onthe mean spike probabilities of individual neurons andcorrelations between pairs of cells indeed generate suc-cessful predictions for the probabilities of all the combi-natorial patterns of spiking and silence in the networkas it responds to naturalistic sensory inputs [4]. In par-ticular, the maximum entropy approach made clear thatgenuinely collective behavior in the network can be con-sistent with relatively weak correlations among pairs ofneurons, so long as these correlations are widespread,shared among most pairs of cells in the system. Thisapproach has now been used to analyze the activity in avariety of neural systems [5–14], the statistics of naturalvisual scenes [15–17], the structure and activity of bio-chemical and genetic networks [18, 19], the statistics ofamino acid substitutions in protein families [20–26], therules of spelling in English words [27], and the directionalordering in flocks of birds [28].One of the lessons of statistical mechanics is that sys-tems with many degrees of freedom can behave in qualita-tively different ways from systems with just a few degreesof freedom. If we can study only a handful of neurons(e.g., N ∼
10 as in Ref [4]), we can try to extrapolatebased on the hypothesis that the group of neurons thatwe analyze is typical of a larger population. These ex-trapolations can be made more convincing by looking ata population of N = 40 neurons, and within such largergroups one can also try to test more explicitly whetherthe hypothesis of homogeneity or typicality is reliable[5, 8]. All these analyses suggest that, in the salaman-der retina, the roughly 200 interconnected neurons thatrepresent a small patch of the visual world should exhibitdramatically collective behavior. In particular, the statesof these large networks should cluster around local min-ima of the energy landscape, much as for the attractorsin the Hopfield model of associative memory [1]. Fur-ther, this collective behavior means that responses will besubstantially redundant, with the behavior of one neuron ABC --------------------- --------------------- --------------------- ------+-------------- --------------------- ------+-------------- ---+----------------- ------+---------+---- --------------+------ ----------------+---- --------------------- -----------------+--- --------------------- ------+---------+---- --------------------- ----------------+---- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ------+-------------- --------------------- --------------------- ------+---------+---- ----------------+---- ----------------+---- --------------------- ------+-------------- ------------------+-- ------+------+------- -----------------+--- --------------------- ------+-------------- --------------------- ------------+-----+-- ------------+-------- --------------------- --------------------- --------------------- -----------+--------- ----------+---------- ------------+-------- ------+-------------- ------------+-------- ------+---------+---- ------------+-------- ------------+-------- ------------+-------- ------------+-------- ------------+-------- --------------------- ------+-----+----+--- --------------------- ------+----++-------- ----+---------------- ------------+-------- ---+----------------- --------------------- ------------+-------- -------------+------- -------------+------- -------------+------- --------------------- ------------+-------- -------------------+- -------------------+- ---------+----------- ---------+----------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ------+-------------- ------+-------------- -------+------------- ---+--+---------+---- ------+-------------+----------------+---+------+-------------- ------+---------+---- ------+-------------- ----------------+---- ------+----------++-- -------------+----+-- ------+---------+---- ------+-------------- ------++---+--+---+-- ------+-----------+-- ------+-------------- -------+----------+-- -----------+--------- --------------------- σ i time [20 ms bins] FIG. 1:
A schematic of the experiment. (A)
Four framesfrom the natural movie stimulus showing swimming fish andwater plants. (B)
The responses of a set of 120 neurons to asingle stimulus repeat, black dots designate spikes. (C)
Theraster for a zoomed-in region designated by a red square in(B), showing the responses discretized into ∆ τ = 20 ms timebins, where σ i = − σ i = +1 represents a spike. largely predictable from the state of other neurons in thenetwork; stated more positively, this collective responseallows for pattern completion and error correction. Fi-nally, the collective behavior suggested by these extrap-olations is a very special one, in which the probability ofparticular network states, or equivalently the degree towhich we should be surprised by the occurrence of anyparticular state, has an anomalously large dynamic range[29]. If correct, these predictions would have a substan-tial impact on how we think about coding in the retina,and about neural network function more generally. Cor-respondingly, there is some controversy about all theseissues [30–33].Here we return to the salamander retina, in experi-ments that exploit a new generation of multi–electrodearrays and associated spike–sorting algorithms [34]. Asschematized in Fig 1, these methods make it possible torecord from N = 100 −
200 ganglion cells in the relevantdensely interconnected patch, while projecting naturalmovies onto the retina. Access to these large popula-tions poses new problems for the inference of maximumentropy models, both in principle and in practice. Whatwe find is that, with extensions of algorithms developedpreviously [35], it is possible to infer maximum entropymodels for more than one hundred neurons, and that withnearly two hours of data there are no signs of “overfit-ting.” We have built models that match the mean prob-ability of spiking for individual neurons, the correlationsbetween spiking in pairs of neurons, and the distributionof summed activity in the network (i.e., the probabilitythat K out of the N neurons spike in the same smallwindow of time [36–38]). We will see that models whichsatisfy all these experimental constraints provide a strik-ingly accurate description of the states taken on by thenetwork as a whole, that these states are collective, andthat the collective behavior predicted by our models hasimplications for how the retina encodes visual informa-tion. II. MAXIMUM ENTROPY
The idea of maximizing entropy has its origin in ther-modynamics and statistical mechanics. The idea that wecan use this principle to build models of systems that arenot in thermal equilibrium is more recent, but still morethan fifty years old [39]. Here we provide a description ofthis approach which we hope makes the ideas accessibleto a broad audience.We imagine a neural system exposed to a stationarystimulus ensemble, in which simultaneous recordings of N neurons can be made. In small windows of time, aswe see in Fig 1, a single neuron i either does ( σ i = +1)or does not ( σ i = −
1) generate an action potential orspike [40]; the state of the entire network in that time binis therefore described by a “binary word” { σ i } . As thesystem responds to its inputs, it visits each of these stateswith some probability P expt ( { σ i } ). Even before we askwhat the different states mean, for example as codewordsin a representation of the sensory world, specifying thisdistribution requires us to determine the probability ofeach of 2 N possible states. Once N increases beyond ∼
20, brute force sampling from data is no longer a generalstrategy for “measuring” the underlying distribution.Even when there are many, many possible states of thenetwork, experiments of reasonable size can be sufficientto estimate the averages or expectation values of vari-ous functions of the state of the system, (cid:104) f µ ( { σ i } ) (cid:105) expt ,where the averages are taken across data collected overthe course of the experiment. The goal of the maximumentropy construction is to search for the probability dis-tribution P ( { f µ } ) ( { σ i } ) that matches these experimentalmeasurements but otherwise is as unstructured as pos-sible. Minimizing structure means maximizing entropy[39], and for any set of moments or statistics that wewant to match, the form of the maximum entropy distri-bution can be found analytically: P ( { f µ } ) ( { σ i } ) = 1 Z ( { g µ } ) exp ( −H ) (1) H ( { σ i } ) = − L (cid:88) µ =1 g µ f µ ( { σ i } ) , (2) Z ( { g µ } ) = (cid:88) { σ i } exp ( −H ) , (3)where H ( { σ i } ) is the effective “energy” function or theHamiltonian of the system, and the partition function Z ( { g µ } ) ensures that the distribution is normalized. Thecouplings g µ must be set such that the expectation values of all constraint functions {(cid:104) f µ (cid:105) P } , µ = 1 , . . . , L , over thedistribution P match those measured in the experiment: (cid:104) f µ (cid:105) P ≡ (cid:88) { σ i } f µ ( { σ i } ) P ( { σ i } ) = ∂ log Z∂g µ = (cid:104) f µ (cid:105) expt . (4)These equations might be hard to solve, but they areguaranteed to have exactly one solution for the couplings g µ given any set of measured expectation values [47].Why should we study the neural vocabulary, P ( { σ i } ),at all? In much previous work on neural coding, the fo-cus has been on constructing models for a “codebook”which can predict the response of the neurons to arbi-trary stimuli, P ( { σ i }| stimulus) [14, 50], or on buildinga “dictionary” that describes the stimuli consistent withparticular patterns of activity, P (stimulus |{ σ i } ) [40]. Ina natural setting, stimuli are drawn from a space ofvery high dimensionality, so constructing these “encod-ing” and “decoding” mappings between the stimuli andresponses is very challenging and often involves makingstrong assumptions about how stimuli drive neural spik-ing (e.g. through linear filtering of the stimulus) [48–51].By trying to understand directly the total distribution ofresponses, P ( { σ i } ), rather than the conditional distribu-tion, P ( { σ i }| stimulus), we take a very different approach.Already when we study the smallest possible network,i.e. a pair of interacting neurons, the usual approach is tomeasure the correlation between spikes generated in thetwo cells, and to dissect this correlation into contribu-tions which are intrinsic to the network and those whichare ascribed to common, stimulus driven inputs. Theidea of decomposing correlations dates back to a timewhen it was hoped that correlations among spikes couldbe used to map the synaptic connections between neu-rons [52]. In fact, in a highly interconnected system, thedominant source of correlations between two neurons—even if they are entirely intrinsic to the network—willalways be through the multitude of indirect paths in-volving other neurons [53]. Regardless of the source ofthese correlations, however, the question of whether theyare driven by the stimulus or are intrinsic to the networkis not a question that the brain can answer. We, as exter-nal observers, can repeat the stimulus exactly, and searchfor correlations conditional on the stimulus, but this isnot accessible to the organism. The brain has accessonly to the output of the retina: the patterns of activitywhich are drawn from the distribution P ( { σ i } ). If the re-sponses { σ i } are codewords for the visual stimulus, thenthe entropy of this distribution sets the capacity of thecode to carry information. Word by word, − log P ( { σ i } )determines how surprised the brain should be by eachparticular pattern of response, including the possibilitythat the response was corrupted by noise in the retinalcircuit and thus should be corrected or ignored [54]. Ina very real sense, what the brain “sees” are sequences ofstates drawn from P ( { σ i } ). In the same spirit that manygroups have studied the statistical structures of naturalscenes [55–60], we would like to understand the statisticalstructure of the codewords that represent these scenes.The maximum entropy method is not a model for net-work activity. Rather it is a framework for building mod-els, and to implement this framework we have to choosewhich functions of the network state f µ ( { σ i } ) we thinkare interesting. The hope is that while there are 2 N statesof the system as a whole, there is a much smaller numberof measurements, { f µ ( { σ i } ) } , with µ = 1 , , · · · , L and L (cid:28) N , which will be sufficient to capture the essentialstructure of the collective behavior in the system. Weemphasize that this is a hypothesis, and must be tested.How should we choose the functions f µ ( { σ i } )? In thiswork we consider three classes of possibilities:(A) We expect that networks have very different be-haviors depending on the overall probability that neuronsgenerate spikes as opposed to remaining silent. Thus, ourfirst choice of functions to constrain in our models is theset of mean spike probabilities or firing rates, which isequivalent to constraining (cid:104) σ i (cid:105) , for each neuron i. Theseconstraints contribute a term to the energy function H (1) = − N (cid:88) i=1 h i σ i . (5)Note that (cid:104) σ i (cid:105) = − r i ∆ τ , where ¯ r i is the mean spikerate of neuron i, and ∆ τ is the size of the time slices thatwe use in our analysis, as in Fig 1. Maximum entropymodels that constrain only the firing rates of all the neu-rons (i.e. H = H (1) ) are called “independent models”;we denote their distribution functions by P (1) . (B) As a second constraint we take the correlationsbetween neurons, two by two. This corresponds to mea-suring C ij = (cid:104) σ i σ j (cid:105) − (cid:104) σ i (cid:105)(cid:104) σ j (cid:105) (6)for every pair of cells ij. These constraints contribute aterm to the energy function H (2) = − N (cid:88) i , j=1 J ij σ i σ j . (7)It is more conventional to think about correlations be-tween two neurons in terms of their spike trains. If wedefine ρ i ( t ) = (cid:88) n δ ( t − t i n ) , (8)where neuron i spikes at times t i n , then the spike–spikecorrelation function is [40] C spikeij ( t − t (cid:48) ) = (cid:104) ρ i ( t ) ρ j ( t (cid:48) ) (cid:105) − (cid:104) ρ i (cid:105)(cid:104) ρ j (cid:105) , (9)and we also have the average spike rates ¯ r i = (cid:104) ρ i (cid:105) . Thecorrelations among the discrete spike/silence variables σ i , σ j then can be written as C ij = 4 (cid:90) ∆ τ dt (cid:90) ∆ τ dt (cid:48) C spikeij ( t − t (cid:48) ) . (10) Maximum entropy models that constrain average firingrates and correlations (i.e. H = H (1) + H (2) ) are called“pairwise models”; we denote their distribution functionsby P (1 , . (C) Firing rates and pairwise correlations focus on theproperties of particular neurons. As an alternative, wecan consider quantities that refer to the network as awhole, independent of the identity of the individual neu-rons. A simple example is the “distribution of synchrony”(also called “population firing rate”), that is the probabil-ity P N ( K ) that K out of the N neurons spike in the samesmall slice of time. We can count the number of neuronsthat spike by summing all of the σ i , remembering thatwe have σ i = 1 for spikes and σ i = − P N ( K ) = (cid:28) δ (cid:32) N (cid:88) i=1 σ i , K − N (cid:33) (cid:29) , (11)where δ ( n, n ) = 1; (12) δ ( n, m (cid:54) = n ) = 0 . (13)If we know the distribution P N ( K ), then we know allits moments, and hence we can think of the functions f µ ( { σ i } ) that we are constraining as being f ( { σ i } ) = N (cid:88) i=1 σ i , (14) f ( { σ i } ) = (cid:32) N (cid:88) i=1 σ i (cid:33) , (15) f ( { σ i } ) = (cid:32) N (cid:88) i=1 σ i (cid:33) , (16)and so on. Because there are only N neurons, there areonly N +1 possible values of K , and hence only N uniquemoments. Constraining all of these moments contributesa term to the energy function H ( K ) = − N (cid:88) K =1 λ K (cid:32) N (cid:88) i=1 σ i (cid:33) K = − V (cid:32) N (cid:88) i=1 σ i (cid:33) , (17)where V is an effective potential [37, 38]. Maximumentropy models that constrain average firing rates, cor-relations, and the distribution of synchrony (i.e. H = H (1) + H (2) + H ( K ) ) are called “K-pairwise models”; wedenote their distribution functions by P (1 , ,K ) . It is important that the mapping between maximumentropy models and a Boltzmann distribution with someeffective energy function is not an analogy, but rathera mathematical equivalence. In using the maximum en-tropy approach we are not assuming that the system ofinterest is in some thermal equilibrium state (note thatthere is no explicit temperature in Eq (1)), nor are we as-suming that there is some mysterious force which drivesthe system to a state of maximum entropy. We are alsonot assuming that the temporal dynamics of the net-work is described by Newton’s laws or Brownian motionon the energy landscape. What we are doing is makingmodels that are consistent with certain measured quan-tities, but otherwise have as little structure as possible.As noted above, this is the opposite of what we usuallydo in building models or theories—rather than trying toimpose some hypothesized structure on the world, we aretrying to remove all structures that are not strictly re-quired by the data.The mapping to a Boltzmann distribution is not ananalogy, but if we take the energy function more liter-ally we are making use of analogies. Thus, the term H (1) that emerges from constraining the mean spike proba-bilities of every neuron is analogous to a magnetic fieldbeing applied to each spin, where spin “up” ( σ i = +1)marks a spike and spin “down” ( σ i = −
1) denotes silence.Similarly, the term H (2) that emerges from constrainingthe pairwise correlations among neurons corresponds toa “spin–spin” interaction which tends to favor neuronsfiring together ( J ij >
0) or not ( J ij < H ( K ) which we can interpret as resulting from theinteraction between all the spins/neurons in the systemand one other, hidden degree of freedom, such as an in-hibitory interneuron. These analogies can be useful, butneed not be taken literally. III. CAN WE LEARN THE MODEL?
We have applied the maximum entropy framework tothe analysis of one large experimental data set on theresponses of ganglion cells in the salamander retina toa repeated, naturalistic movie. These data are collectedusing a new generation of multi–electrode arrays thatallow us to record from a large fraction of the neurons ina 450 × µ m patch, which contains a total of ∼ . ∼
20 ms timescale, and so we chose to discretize the spike train into∆ τ = 20 ms bins.Maximum entropy models have a simple form [Eq (1)]that connects precisely with statistical physics. But tocomplete the construction of a maximum entropy model,we need to impose the condition that averages in themaximum entropy distribution match the experimental −1 −0.5 0 0.5 1012345 J ij P ( J ij ) C ij ij P ( c ij ) J ij r i [ s p i k e s / s ] h i FIG. 2:
Learning the pairwise maximum entropymodel for a 100 neuron subset.
A subgroup of 100neurons from our set of 160 has been sorted by the firingrate. At left, the statistics of the neural activity: correla-tions C ij = (cid:104) σ i σ j (cid:105) − (cid:104) σ i (cid:105)(cid:104) σ j (cid:105) (top), firing rates (equivalent to (cid:104) σ i (cid:105) ; middle), and the distribution of correlation coefficients c ij (bottom). The red distribution is the distribution of dif-ferences between two halves of the experiment, and the smallred error bar marks the standard deviation of correlation co-efficients in fully shuffled data (1 . × − ). At right, theparameters of a pairwise maximum entropy model [ H fromEq (19)] that reproduces these data: coupling constants J ij (top), fields h i (middle), and the distribution of couplings inthis group of neurons. . measurements, as in Eq (4). This amounts to findingall the coupling constants { g µ } in Eq (2). This is, ingeneral, a hard problem. We need not only to solve thisproblem, but also convince ourselves that our solution ismeaningful, and that it does not reflect overfitting to thelimited set of data at our disposal. A detailed account ofthe numerical solution to this inverse problem is given inAppendix B.In Fig 2 we show an example of N = 100 neuronsfrom a small patch of the salamander retina, respond-ing to naturalistic movies. We notice that correlationsare weak, but widespread, as in previous experiments onsmaller groups of neurons [4, 5, 8, 45, 46]. Because thedata set is very large, the threshold for reliable detectionof correlations is very low; if we shuffle the data com-pletely by permuting time and repeat indices indepen-dently for each neuron, the standard deviation of corre- A −3 −2 −1 0 1 2 300.5 B ε =(C ijexpt −C ijP )/ ∆ C ij P ( ε ) C ijexpt C ij P −1 −0.75−1−0.75 < σ i > expt < σ i > P FIG. 3:
Reconstruction precision for a 100 neuron sub-set.
Given the reconstructed Hamiltonian of the pairwisemodel, we used an independent Metropolis Monte Carlo (MC)sampler to assess how well the constrained model statistics(mean firing rates (A) , covariances (B) , plotted on y-axes)match the measured statistics (corresponding x-axes). Errorbars on data computed by bootstrapping; error bars on MCestimates obtained by repeated MC runs generating a num-ber of samples that is equal to the original data size. (C)
The distribution of the difference between true and modelvalues for ∼ · covariance matrix elements, normalizedby the estimated error bar in the data; red overlay is a Gaus-sian with zero mean and unit variance. The distribution hasnearly Gaussian shape with a width of ≈ .
1, showing thatthe learning algorithm reconstructs the covariance statisticsto within measurement precision. lation coefficients, c ij = (cid:104) σ i σ j (cid:105) − (cid:104) σ i (cid:105)(cid:104) σ j (cid:105) (cid:112) (1 − (cid:104) σ i (cid:105) )(1 − (cid:104) σ j (cid:105) ) , (18)is σ c = 1 . × − , as shown in Fig 2C, vastly smallerthan the typical correlations that we observe (median 1 . · − , 90% of values between − . · − and 1 . · − ).More subtly, this means that only a few percent of thecorrelation coefficients are within error bars of zero, andthere is no sign that there is a finite fraction of pairsthat have truly zero correlation—the distribution of cor-relations across the population seems continuous. Notethat, as customary, we report normalized correlation co-efficients ( c ij , between -1 and 1), while maximum entropyformally constrains an equivalent set of unnormalized sec-ond order moments, C ij [Eq (6)].We began by constructing maximum entropy modelsthat match the mean spike rates and pairwise correla-tions, i.e. “pairwise models,” whose distribution is, from Eqs (5, 7), P (1 , ( { σ i } ) = 1 Z exp [ −H ( { σ i } )] H = − N (cid:88) i=1 h i σ i − N (cid:88) i , j=1 J ij σ i σ j . (19)When we reconstruct the coupling constants of the max-imum entropy model, we see that the “interactions” J ij among neurons are widespread, and almost symmetri-cally divided between positive and negative values; formore details see Appendix B. Figure 3 shows that themodel we construct really does satisfy the constraints, sothat the differences, for example, between the measuredand predicted correlations among pairs of neurons arewithin the experimental errors in the measurements.With N = 100 neurons, measuring the mean spikeprobabilities and all the pairwise correlations means thatwe estimate N ( N +1) / C ij are sufficiently strongly correlated that we don’t reallyknow the matrix as a whole with high precision, eventhough the individual elements are measured very accu-rately. This is a question about overfitting: is it possiblethat the parameters { h i , J ij } are being finely tuned tomatch even the statistical errors in our data?To test for overfitting (Fig 4), we exploit the fact that L L t e s t / L t r a i n AB FIG. 4:
A test for overfitting. (A)
The per-neuron averagelog-probability of data (log-likelihood, L = (cid:104) log P ( σ ) (cid:105) expt /N )under the pairwise model of Eq (19), computed on the trainingrepeats (black dots) and on the testing repeats (red dots), forthe same group of N = 100 neurons shown in Fig 2 and 3.Here the repeats have been reordered so that the trainingrepeats precede testing repeats; in fact, the choice of testrepeats is random. (B) The ratio of the log-likelihoods ontest vs training data, shown as a function of the network size N . Error bars are the standard deviation across 30 subgroupsat each value of N . the stimuli consist of a short movie repeated many times.We can choose a random 90% of these repeats from whichto learn the parameters of the maximum entropy model,and then check that the probability of the data in theother 10% of the experiment is predicted to be the same,within errors. We see in Fig 4 that this is true, andthat it remains true as we expand from N = 10 neurons(for which we surely have enough data) out to N = 120,where we might have started to worry. Taken together,Figs 2, 3, and 4 suggest strongly that our data and al-gorithms are sufficient to construct maximum entropymodels, reliably, for networks of more than one hundredneurons. IV. DO THE MODELS WORK?
How well do our maximum entropy models describethe behavior of large networks of neurons? The mod-els predict the probability of occurrence for all possiblecombinations of spiking and silence in the network, andit seems natural to use this huge predictive power to testthe models. In small networks, this is a useful approach.Indeed, much of the interest in the maximum entropy ap-proach derives from the success of models based on meanspike rates and pairwise correlations, as in Eq (19), inreproducing the probability distribution over states innetworks of size N = 10 −
15 [4, 9]. With N = 10, thereare 2 = 1024 possible combinations of spiking and si-lence, and reasonable experiments are sufficiently long toestimate the probabilities of all of these individual states.But with N = 100, there are 2 ∼ possible states,and so it is not possible to “just measure” all the prob-abilities. Thus, we need another strategy for testing ourmodels.Striking (and model–independent) evidence for non-trivial collective behavior in these networks is obtainedby asking for the probability that K out of the N neu-rons generate a spike in the same small window of time,as shown in Fig 5. This distribution, P N ( K ), should be-come Gaussian at large N if the neurons are independent,or nearly so, and we have noted that the correlations be-tween pairs of cells are weak. Thus P ( K ) is very wellapproximated by an independent model, with fractionalerrors on the order of the correlation coefficients, typi-cally less than ∼ N = 10cells, there are substantial departures from the predic-tions of an independent model (Fig 5A). In groups of N = 40 cells, we see K = 10 cells spiking synchronouslywith probability ∼ times larger than expected froman independent model (Fig 5B), and the departure fromindependence is even larger at N = 100 (Fig 5C).Maximum entropy models that match the mean spikerate and pairwise correlations in a network make an un-ambiguous, quantitative prediction for P N ( K ), with noadjustable parameters. In smaller groups of neurons, cer-tainly for N = 10, this prediction is quite accurate, andaccounts for most of the difference between the data and −6 −3 K P ( K ) −6 −3 K P ( K ) −6 −3 K P ( K ) P ( ) exptindep.pairwise A B CD
N=10 N=40 N=100
FIG. 5:
Predicted vs measured probability of K si-multaneous spikes (spike synchrony). (A-C) P N ( K ) forsubnetworks of size N = 10 , , N = 10we already see large deviations from an independent model,but these are captured by the pairwise model. At N = 40 (B),the pairwise models miss the tail of the distribution, where P ( K ) < − . At N = 100 (C), the deviations between thepairwise model and the data are more substantial. (D) Theprobability of silence in the network, as a function of popu-lation size; error bars are s.d. across 30 subgroups of a givensize N . Throughout, red shows the data, grey the indepen-dent model, and black the pairwise model. the expectations from an independent model, as shownin Fig 5. But even at N = 40 we see small deviations be-tween the data and the predictions of the pairwise model.Because the silent state is highly probable, we can mea-sure P N ( K = 0) very accurately, and the pairwise modelsmake errors of nearly a factor of three at N = 100. Theseerrors are negligible when compared to the many ordersof magnitude differences from an independent model, butthey are highly significant. The pattern of errors also isimportant, since in the real networks silence persists asbeing highly probable even at N = 100 (and beyond),which is surprising [37], and the pairwise model doesn’tquite capture this.If a model based on pairwise correlations doesn’t quiteaccount for the data, it is tempting to try and includecorrelations among triplets of neurons. But at N = 100there are N ( N − N − / ∼ . × of these triplets,so a model that includes these correlations is much morecomplex than one that stops with pairs. An alternativeis to use P N ( K ) itself as a constraint on our models, asexplained above in relation to Eq (17). This defines the“K-pairwise model,” P (1 , ,K ) ( { σ i } ) = 1 Z exp [ −H ( { σ i } )] (20) H ( { σ i } ) = − N (cid:88) i=1 h i σ i − N (cid:88) i , j=1 J ij σ i σ j − V (cid:32) N (cid:88) i=1 σ i (cid:33) , where the “potential” V is chosen to match the observeddistribution P N ( K ). As noted above, we can think of this h i −10 0−100−1 0 1−1010 5 10 15 20 25 30−505 V ( K ) K J ij K - pa i r w i s e J ij pairwise h i K - pa i r w i s e h i pairwise A BC DE
FIG. 6:
K-pairwise model for a the same group of N = 100 cells shown in Fig 1. The neurons are againsorted in the order of decreasing firing rates. (A)
Pairwiseinteractions, J ij , and the comparison with the interactions ofthe pairwise model, (B) . (C) Single-neuron fields, h i , andthe comparison with the fields of the pairwise model, (D) . (E) The global potential, V ( K ), where K is the number ofsynchronous spikes. potential as providing a global regulation of the networkactivity, such as might be implemented by inhibitory in-terneurons with (near) global connectivity. Whateverthe mechanistic interpretation of this model, it is impor-tant that it is not much more complex than the pairwisemodel: matching P N ( K ) adds only ∼ N parameters toour model, while the pairwise model already has ∼ N / N = 100neurons shown in Fig 2. Notice that the pairwise inter-action terms J ij remain roughly the same; the local fields h i are also similar but have a shift towards more negativevalues.Since we didn’t make explicit use of the triplet corre-lations in constructing the K-pairwise model, we can testthe model by predicting these correlations. In Fig 7A weshow C ijk ≡ (cid:28) ( σ i − (cid:104) σ i (cid:105) )( σ j − (cid:104) σ j (cid:105) )( σ k − (cid:104) σ k (cid:105) ) (cid:29) (21)as computed from the real data and from the models,for a single group of N = 100 neurons. We see that pair-wise models capture the rankings of the different triplets,so that more strongly correlated triplets are predictedto be more strongly correlated, but these models miss quantitatively, overestimating the positive correlationsand failing to predict significantly negative correlations.These systematic errors are largely corrected in the K-pairwise model, despite the fact that adding a constrainton P N ( K ) doesn’t add any information about the iden-tity of the neurons in the different triplets. It is also inter-esting that this improvement in our predictions (as wellas that in Fig 8 below) occurs even though the numericalvalue of the effective potential V N ( K ) is quite small, asshown in Fig 6E. Fixing the distribution of global activ-ity thus seems to capture something about the networkthat individual spike probabilities and pairwise correla-tions have missed.An interesting effect is shown in Fig 7B, where we lookat the average absolute deviation between predicted andmeasured C ijk , as a function of the group size N . Withincreasing N the ratio between the total number of (pre-dicted) three-point correlations and (fitted) model pa-rameters is increasing (from ≈ N = 10 to ≈ N = 120), leading us to believe that predictions willgrow progressively worse. Nevertheless, the average er-ror in three-point prediction stays constant with networksize, for both pairwise and K-pairwise models. An at-tractive explanation is that, as N increases, the modelsencompass larger and larger fractions of the interactingneural patch and thus decrease the effects of “hidden”units, neurons that are present but not included in themodel; such unobserved units, even if they only inter-acted with other units in a pairwise fashion, could intro-duce effective higher-order interactions between observedunits, thereby causing three-point correlation predictionsto deviate from those of the pairwise model [61]. The ac-curacy of the K-pairwise predictions is not quite as goodas the errors in our measurements (dashed line in Fig 7B),but still very good, improving by a factor of ∼ − .Maximum entropy models assign an effective energy toevery possible combination of spiking and silence in thenetwork, E = H ( { σ i } ) from Eq (20). Learning the modelmeans specifying all the parameters in this expression, sothat the mapping from states to energies is completelydetermined. The energy determines the probability ofthe state, and while we can’t estimate the probabilitiesof all possible states, we can ask whether the distribu-tion of energies that we see in the data agrees with thepredictions of the model. Thus, if we have a set of statesdrawn out of a distribution Q ( { σ i } ), we can count thenumber of states that have energies lower than E , C < ( E ) = (cid:88) { σ i } Q ( { σ i } )Θ [ E − H ( { σ i } )] , (22)where Θ( x ) is the Heaviside step function,Θ( x >
0) = 1;Θ( x <
0) = 0 . (23)Similarly, we can count the number of states that haveenergy larger than E , C > ( E ) = (cid:88) { σ i } Q ( { σ i } )Θ [ H ( { σ i } ) − E ] , (24)Now we can take the distribution Q ( { σ i } ) to be the dis-tribution of states that we actually see in the experi-ment, or we can take it to be the distribution predicted −0.01 0 0.01−0.0200.02 C ijkexpt C ij k N=100, pairwiseN=100, K−pairwise
202 x 10 A −4 −3 −2 −1 N < | C ij k e x p t − C P ij k | > −6 −5 −4 −3 −2 −100.51 log (|C ijk |) B P FIG. 7:
Predicted vs real connected three–point cor-relations, C ijk from Eq (21). (A) Measured C ijk (x-axis)vs predicted by the model (y-axis), shown for an example 100neuron subnetwork. The ∼ . × triplets are binned into1000 equally populated bins; error bars in x are s.d. across thebin. The corresponding values for the predictions are groupedtogether, yielding the mean and the s.d. of the prediction (y-axis). Inset shows a zoom-in of the central region, for theK-pairwise model. (B) Error in predicted three-point corre-lation functions as a function of subnetwork size N . Shownare mean absolute deviations of the model prediction from thedata, for pairwise (black) and K-pairwise (red) models; errorbars are s.d. across 30 subnetworks at each N , and the dashedline shows the mean absolute difference between two halves ofthe experiment. Inset shows the distribution of three–pointcorrelations (grey filled region) and the distribution of dif-ferences between two halves of the experiment (dashed line);note the logarithmic scale. by the model, and if the model is accurate we should findthat the cumulative distributions are similar in these twocases. Results are shown in Fig 8.We see that the distribution of energies in the data andthe model are very similar. There is an excellent matchin the “low energy” (high probability) region, and thenas we look at the high energy tail ( C > ( E )) we see thattheory and experiment match out to probabilities of bet-ter than C > ∼ − . Thus the distribution of energies,which is an essential construct of the model, seems tomatch the data across >
90% of the states that we see.The successful prediction of the cumulative distribu-tion C > ( E ) is especially striking because it extends to E ∼
25. At these energies, the probability of any singlestate is predicted to be e − ∼ − , which means thatthese states should occur roughly once per fifty years (!).This seems ridiculous—what are such rare states doingin our analysis, much less as part of the claim that the-ory and experiment are in quantitative agreement? Thekey is that there are many, many of these rare states—somany, in fact, that the theory is predicting that ∼ . · combinations of spik-ing and silence (1 . ± . · distinct ones) that we C < ( E ) Eexptmodel r e l . d i ff e r en c e σ E , K−pairwise σ E , pairwise
Predicted vs real distributions of energy, E .(A) The cumulative distribution of energies, C < ( E ) fromEq (22), for the K-pairwise models (red) and the data (black),in a population of 120 neurons. Inset shows the high energytails of the distribution, C > ( E ) from Eq (24); dashed linedenotes the energy that corresponds to the probability of see-ing the pattern once in an experiment. (B) Relative differ-ence in the first two moments (mean, (cid:104) E (cid:105) , dashed; standarddeviation, σ E = (cid:112) (cid:104) E (cid:105) − (cid:104) E (cid:105) , solid) of the distribution ofenergies evaluated over real data and a sample from the corre-sponding model (black = pairwise; red = K-pairwise). Errorbars are s.d. over 30 subnetworks at a given size N . N = 120 neurons, 1 . ± . · of these occur only once, which means we really don’tknow anything about their probability of occurrence. Wecan’t say that the probability of any one of these rarestates is being predicted correctly by the model, since wecan’t measure it, but we can say that the distribution of(log) probabilities—that is, the distribution of energies—across the set of observed states is correct, down to the ∼
10% level. The model thus is predicting things far be-yond what can be inferred directly from commonly ob-served patterns of activity.Finally, the structure of the models we are consider-ing is that the state of each neuron—an Ising spin—experiences an “effective field” from all the other spins,determining the probability of spiking vs. silence. Thiseffective field consists of an intrinsic bias for each neuron,plus the effects of interactions with all the other neurons: h eff , i = 12 {H ( σ , . . . , σ i = 1 , . . . , σ N ) (25) − H ( σ , . . . , σ i = − , . . . , σ N ) } . If the model is correct, then the probability of spiking issimply related to the effective field, P ( σ i = 1 | h eff , i ) = 11 + e − h eff , i . (26) −10 −5 0 500.20.40.60.81 h eff P ( σ = | h e ff ) FIG. 9:
Effective field and spiking probabilities in anetwork of N = 120 neurons. Given any configuration of N − N -th neuron by Eqs (25,26); the effective field h eff is fully determined by the parameters of the maximumentropy model and the state of the network. For each activ-ity pattern in recorded data we computed the effective field,and binned these values (shown on x-axis). For every binwe estimated from data the probability that the N -th neu-ron spiked (black circles; error bars are s.d. across 120 cells).This is compared with a parameter-free prediction (red line)from Eq (26). The gray shaded region shows the distributionof the values of h eff over all 120 neurons and all patterns inthe data. To test this relationship, we can choose one neuron, com-pute the effective field from the states of all the otherneurons, at every moment in time, then collect all thosemoments when h eff is in some narrow range, and see howoften the neuron spikes. We can then repeat this forevery neuron, in turn. If the model is correct, spikingprobability should depend on the effective field accord-ing to Eq (26). We emphasize that there are no newparameters to be fit, but rather a parameter–free rela-tionship to be tested. The results are shown in Fig 9.We see that, throughout the range of fields that are wellsampled in the experiment, there is good agreement be-tween the data and Eq (26). As we go into the tails ofthe distribution, we see some deviations, but error barsalso are (much) larger. V. WHAT DO THE MODELS TEACH US?
We have seen that it is possible to construct maximumentropy models which match the mean spike probabilitiesof each cell, the pairwise correlations, and the distribu-tion of summed activity in the network, and that ourdata are sufficient to insure that all the parameters ofthese models are well determined, even when we considergroups of N = 100 neurons or more. Figures 7 through 9indicate that these models give a fairly accurate descrip-tion of the distribution of states—the myriad combina-tions of spiking and silence—taken on by the network asa whole. In effect we have constructed a statistical me-chanics for these networks, not by analogy or metaphorbut in quantitative detail. We now have to ask what wecan learn about neural function from this description. A. Basins of attraction
In the Hopfield model, dynamics of the neural net-work corresponds to motion on an energy surface. Sim-ple learning rules can sculpt the energy surface to gen-erate multiple local minima, or attractors, into whichthe system can settle. These local minima can repre-sent stored memories, or the solutions to various compu-tational problems [105, 106]. If we imagine monitoringa Hopfield network over a long time, the distribution ofstates that it visits will be dominated by the local minimaof the energy function. Thus, even if we can’t take thedetails of the dynamical model seriously, it still should betrue that the energy landscape determines the probabil-ity distribution over states in a Boltzmann–like fashion,with multiple energy minima translating into multiplepeaks of the distribution.In our maximum entropy models, we find a range of J ij values encompassing both signs (Figs 2D and F), asin spin glasses [102]. The presence of such competing in-teractions generates “frustration,” where (for example)triplets of neurons cannot find a combination of spikingand silence that simultaneously minimizes all the terms1 N nu m be r o f M S s t a t e s M≥1M>10M>100
FIG. 10:
The number of identified metastable pat-terns.
Every recorded pattern is assigned to its basin of at-traction by descending on the energy landscape. The numberof distinct basins is shown as a function of the network size, N , for K-pairwise models (black line). Gray lines show thesubsets of those basins that are encountered multiple times inthe recording (more than 10 times, dark gray; more than 100times, light gray). Error bars are s.d. over 30 subnetworksat every N . Note the logarithmic scale for the number of MSstates. in the energy function [4]. In the simplest model of spinglasses, these frustration effects, distributed throughoutthe system, give rise to a very complex energy landscape,with a proliferation of local minima [102]. Our models arenot precisely Hopfield models, nor are they instances ofthe standard (more random) spin glass models. Nonethe-less, by looking at the pairwise J ij terms in the energyfunction of our models, 48 ±
2% of all interacting tripletsof neurons are frustrated across different subnetworks ofvarious sizes ( N ≥ H ( { σ i } )from Eq (20) (see Appendix C). When we can no longermove downhill, we have identified a locally stable patternof activity, or a “metastable state,” MS α = { σ α i } , suchthat a flip of any single spin—switching the state of anyone neuron from spiking to silent, or vice versa—increasesthe energy or decreases the probability of the new state.This procedure also partitions the space of all 2 N possiblepatterns into domains, or basins of attraction, centeredon the metastable states, and compresses the microscopicdescription of the retinal state to a number α identifyingthe basin to which that state belongs.Figure 10 shows how the number of metastable statesthat we identify in the data grows with the size N ofthe network. At very small N , the only stable configura-tion is the all-silent state, but for N >
30 the metastablestates start to proliferate. Indeed, we see no sign that the number of metastable states is saturating, and the growthis certainly faster than linear in the number of neurons.Moreover, the total numbers of possible metastable statesin the models’ energy landscapes could be substantiallyhigher than shown, because we only count those statesthat are accessible by descending from patterns observedin the experiment . It thus is possible that these real net-works exceed the “capacity” of model networks [2, 3].Figure 11A provides a more detailed view of the mostprominent metastable states, and the “energy valleys”that surround them. The structure of the energy valleyscan be thought of as clustering the patterns of neuralactivity, although in contrast to the usual formulationof clustering we don’t need to make an arbitrary choiceof metric for similarity among patterns. Nonetheless, wecan measure the overlap C µν between all pairs of patterns { σ µ i } and { σ ν i } that we see in the experiment, C µν = 1 N N (cid:88) i=1 σ µ i σ ν i , (27)and we find that patterns which fall into the same valleyare much more correlated with one another than they arewith patterns that fall into other valleys (Fig 11B). If westart at one of the metastable states and take a random“uphill” walk in the energy landscape (Appendix C), weeventually reach a transition state where there is a down-hill path into other metastable states, and a selection ofthese trajectories is shown in Fig 11C. Importantly, thetransition states are at energies quite high relative to themetastable states (Fig 11D), so the peaks of the proba-bility distribution are well resolved from one another. Inmany cases it takes a large number of steps to find thetransition state, so that the metastable states are sub-stantially separated in Hamming distance.Individual neurons in the retina are known to gener-ate rather reproducible responses to naturalistic stimuli[34, 45], but even a small amount of noise in the responseof single cells is enough to ensure that groups of N = 100neurons almost never generate the same response to tworepetitions of the same visual stimulus. It is striking,then, that when we show the same movie again, the retinarevisits the same basin of attraction with very high prob-ability, as shown in Fig 12. The same metastable statesand corresponding valleys are identifiable from differentsubsets of the full population, providing a measure of re-dundancy that we explore more fully below. Further, thetransitions into and out of these valleys are very rapid,with a time scale of just ∼ . τ . In summary, the neu-ral code for visual signals seems to respect the structureinferred from the energy landscape, despite the fact thatthe energy landscape is constructed without reference tothe visual stimuli.2 FIG. 11:
Energy landscape in a N = 120 neuron K-pairwise model. (A) The 10 most frequently occurring metastable(MS) states (active neurons for each in red), and 50 randomly chosen activity patterns for each MS state (black dots representspikes). MS 1 is the all-silent basin. (B)
The overlaps, C µν , between all pairs of identified patterns belonging to basins 2 , . . . , (C) The structure of the energy landscape explored with Monte Carlo. Starting in the all-silentstate, single spin-flip steps are taken until the configuration crosses the energy barrier into another basin. Here, two such pathsare depicted (green, ultimately landing in the basin of MS 9; purple, landing in basin of MS 5) as projections into 3D space ofscalar products (overlaps) with the MS 1, 5, and 9. (D)
The detailed structure of the energy landscape. 10 MS patterns from(A) are shown in the energy (y-axis) vs log basin size (x-axis) diagram (silent state at lower right corner). At left, transitionsfrequently observed in MC simulations starting in each of the 10 MS states, as in (C). The most frequent transitions are decaysto the silent state. Other frequent transitions (and their probabilities) shown using vertical arrows between respective states.Typical transition statistics (for MS 3 decaying into the silent state) shown in the inset: the distribution of spin-flip attemptsneeded, P ( L ), and the distribution of energy barriers, P ( E ∗ ), over 1000 observed transitions. B. Entropy
Central to our understanding of neural coding is theentropy of the responses [40]. Conceptually, the en-tropy measures the size of the neural vocabulary: with N neurons there are 2 N possible configurations of spik-ing and silence, but since not all of these have equalprobabilities—some, like the all-silent pattern, may occurorders of magnitude more frequently than others, such asthe all-spikes pattern—the effective number of configura-tions is reduced to 2 S ( N ) , where S ( N ) is the entropy ofthe vocabulary for the network of N neurons. Further-more, if the patterns of spiking and silence really arecodewords for the stimulus, then the mutual informationbetween the stimulus and response, I ( { σ i } ; stimulus), canbe at most the entropy of the codewords, S [ P ( { σ i } )].Thus, the entropy of the system’s output bounds the in-formation transmission. This is true even if the outputwords are correlated in time; temporal correlations implythat the entropy of state sequences is smaller than ex-pected from the entropy of single snapshots, as studiedhere, and hence the limits on information transmissionare even more stringent [14].We cannot sample the distribution—and thus esti-mate the entropy directly—for large sets of neurons,but we know that maximum entropy models with con-straints { f µ } put an upper bound to the true entropy, S [ P ( { σ i } )] ≤ S [ P ( { f µ } ) ( { σ i } )]. Unfortunately, even com-puting the entropy of our model distribution is not sim-ple. Naively, we could draw samples out of the modelvia Monte Carlo, and since simulations can run longerthan experiments, we could hope to accumulate enough samples to make a direct estimate of the entropy, per-haps using more sophisticated methods for dealing withsample size dependences [82]. But this is terribly ineffi-cient (see Appendix D). An alternative is to make morethorough use of the mathematical equivalence betweenmaximum entropy models and statistical mechanics.The first approach to entropy estimation involves ex-tending our maximum entropy models of Eq (2) by in-troducing a parameter analogous to the temperature T in statistical physics: P ( { f µ } ) T ( { σ i } ) = 1 Z T ( { g µ } ) e −H ( { σ i } ) /T . (28)Thus, for T = 1, the distribution in Eq (28) is exactlyequal to the maximum entropy model with parameters { g µ } , but by varying T and keeping the { g µ } constant,we access a one-parameter family of distributions. Un-like in statistical physics, T here is purely a mathemat-ical device, and we do not consider the distributions at T (cid:54) = 1 as describing any real network of neurons. One cannevertheless compute, for each of these distributions attemperature T , the heat capacity C ( T ), and then ther-modynamics teaches us that C ( T ) = T ∂S ( T ) /∂T ; wecould thus invert this relation to compute the entropy: S [ P ( { f µ } ) ] = S ( T = 1) = (cid:90) C ( T ) T dT. (29)The heat capacity might seem irrelevant since thereis no “heat” in our problem, but this quantity is di-rectly related to the variance of energy, C ( T ) = σ E /T ,with σ E as in Fig 8. The energy, in turn, is related to3 MS patterns neu r on s P ( M S | t ) t [s] 00.20.40.60.810 100 20001 t [ms] au t o c o rr . s il e n t M S M S c oh e r e n ce τ = 48 ms AB CD
FIG. 12:
Basin assignments are reproducible acrossstimulus repeats and across subnetworks. (A)
Mostfrequently occurring MS patterns collected from 30 subnet-works of size N = 120 out of a total population of 160 neu-rons; patterns have been clustered into 12 clusters (colors). (B) The probability (across stimulus repeats) that the popu-lation is in a particular basin of attraction at any given time.Each line corresponds to one pattern from (A); patterns be-longing to the same cluster are depicted in the same color.Inset shows the detailed structure of several transitions outof the all-silent state; overlapping lines of the same color showthat the same transition is identified robustly across differentsubnetwork choices of 120 neurons out of 160. (C)
On abouthalf of the time bins, the population is in the all-silent basin;on the remaining time bins, the coherence (the probabilityof being in the dominant basin divided by the probability ofbeing in every possible non-silent basin) is high. (D)
Theaverage autocorrelation function of traces in (B), showing thetypical time the population stays within a basin (dashed redline is best exponential fit with τ = 48 ms, or about 2.5 timebins). the logarithm of the probability, and hence the heat ca-pacity is the variance in how surprised we should be byany state drawn out of the distribution. In practice, wecan draw sample states from a Monte Carlo simulation,compute the energy of each such state, and estimate thevariance over a long simulation. Importantly, it is wellknown that such estimates stabilize long before we havecollected enough samples to visit every state of the sys-tem [72]. Thus, we start with the inferred maximumentropy model, generate a dense family of distributionsat different T spanning the values from 0 to 1, and, fromeach distribution, generate enough samples to estimatethe variance of energy and thus C ( T ); finally, we do theintegral in Eq (29).Interestingly, the mapping to statistical physics givesus other, independent ways of estimating the entropy.The most likely state of the network, in all the cases wehave explored, is complete silence. Further, in the K- pairwise models, this probability is reproduced exactly,since it is just P N ( K = 0). Mathematically, this proba-bility is given by P silence = 1 Z exp [ − E (silence)] , (30)where the energy of the silent state is easily computedfrom the model just by plugging in to the Hamiltonianin Eq (20); in fact we could choose our units so that thesilent state has precisely zero energy, making this eveneasier. But then we see that, in this model, estimatingthe probability of silence (which we can do directly fromthe data) is the same as estimating the partition function Z , which usually is very difficult since it involves sum-ming over all possible states. Once we have Z , we knowfrom statistical mechanics that − ln Z = (cid:104) E (cid:105) − S, (31)and we can estimate the average energy from a singleMonte Carlo simulation of the model at the “real” T = 1(c.f. Fig 8).Finally, there are more sophisticated Monte Carlo re-sampling methods that generate an estimate of the “den-sity of states” [83], related to the cumulative distributions C < ( E ) and C > ( E ) discussed above, and from this den-sity we can compute the partition function directly. Asexplained in Appendix D, the three different methods ofentropy estimation agree to better than 1% on groups of N = 120 neurons.Figure 13A shows the entropy per neuron of the K-pairwise model as a function of network size, N . For com-parison, we also plot the independent entropy, i.e. theentropy of the non-interacting maximum entropy model en t r op y pe r neu r on [ b i t s ] S (1,2,K) /NS (1) /N 10 −2 −1 N [ b i t s ] S (1) I N =S (1) −S (1,2,K) γ ( N ) A B FIG. 13:
Entropy and multi-information from the K-pairwise model. (A)
Independent entropy per neuron, S (1) /N , in black, and the entropy of the K-pairwise modelsper neuron, S (1 , ,K ) /N , in red, as a function of N . Dashedlines are fits from (B). (B) Independent entropy scales lin-early with N (black dashed line). Multi-information I N ofthe K-pairwise models is shown in dark red. Dashed red lineis a best quadratic fit for dependence of log I N on log N ; thiscan be rewritten as I N ∝ N γ ( N ) , where γ ( N ) (shown in inset)is the effective scaling of multi-information with system size N . In both panels, error bars are s.d. over 30 subnetworks ateach size N . S = 0 .
05 bits of entropy per neuron means thatfor N = 200 neurons the vocabulary of neural responsesis restricted 2 N ∆ S ∼ multi–information , I N = S [ P (1) ( { σ i } )] − S [ P (1 , ,K ) ( { σ i } )] , (32)measures the amount of statistical structure between N neurons due to pairwise interactions and the K-spike con-straint. As we see in Fig 13B, the multi-informationinitially grows quadratically ( γ = 2) as a function of N . While this growth is slowing as N increases, it isstill faster than linear ( γ > N = 120 neurons we have not yet reached the extensivescaling regime where the entropy per neuron would beconstant. These results are consistent with suggestionsin Ref [4] based on much smaller groups of cells; in par-ticular the changeover towards extensive entropy growthcould happen at a scale of N ∼ − C. Coincidences and surprises
Usually we expect that, as the number of elements N in a system becomes large, the entropy S ( N ) becomesproportional to N and the distribution becomes nearlyuniform over ∼ S ( N ) states. This is the concept of “typ-icality” in information theory [73] and the “equivalenceof ensembles” in statistical physics [74, 75]. At N = 120,we have S ( N ) = 19 . ± .
58 bits, so that 2 S ∼ × ,and for the full N = 160 neurons in our experiment thenumber of states is even larger. In a uniform distribu-tion, if we pick two states at random then the probabilitythat these states are the same is given by P c = 2 − S ( N ) .On the hypothesis of uniformity, this probability is suf-ficiently small that large groups of neurons should never visit the same state twice during the course of a one hourexperiment. In fact, if we choose two moments in time atrandom from the experiment, the probability that eventhe full 160–neuron state that we observe will be the sameis P c = 0 . ± .
40 60 80 100 120 14010 −4 −3 −2 −1 N P c exptindep.K−pairwise 0 0.01 0.02 0.0300.010.020.030.04 1/N − l n ( P c ) / N A B
FIG. 14:
Coincidence probabilities. (A)
The probabilitythat the combination of spikes and silences is exactly the sameat two randomly chosen moments of time, as a function of thesize of the population. The real networks are orders of magni-tude away from the predictions of an independent model, andthis behavior is captured precisely by the K-pairwise model. (B)
Extrapolating the N dependence of P c to large N . on N . We expect that the negative logarithm of the co-incidence probability, like the entropy itself, will growlinearly with N ; equivalently we should see an expo-nential decay of coincidence probability as we increasethe size of the system. This is exactly true if the neu-rons are independent, even if different cells have differ-ent probabilities of spiking, provided that we averageover possible choices of N neurons out of the popula-tion. But the real networks are far from this prediction,as we can see in Fig 14A. Larger and larger groups ofneurons do seem to approach a “thermodynamic limit”in which − ln P c ∝ N (Fig 14B), but the limiting ra-tio − (ln P c ) /N = 0 . ± . D. Redundancy and predictability
In the retina we usually think of neurons as respond-ing to the visual stimulus, and so it is natural to summa-rize their response as spike rate vs. time in a (repeated)movie, the post–stimulus time histogram (PSTH). Wecan do this for each of the cells in the population thatwe study; one example is in the top row of Fig 15A. Thisexample illustrates common features of neural responsesto naturalistic sensory inputs—long epochs of near zerospike probability, interrupted by brief transients contain-ing a small number of spikes [84]. Can our models predict5this behavior, despite the fact that they make no explicitreference to the visual input?The maximum entropy models that we have con-structed predict the distribution of states taken on bythe network as a whole, P ( { σ i } ). From this we can con-struct the conditional distribution, P ( σ i |{ σ j (cid:54) =i } ), whichtells us the probability of spiking in one cell given thecurrent state of all the other cells, and hence we have aprediction for the spike probability in one neuron at eachmoment in time. Further, we can repeat this construc-tion using not all the neurons in the network, but only agroup of N , with variable N .As the stimulus movie proceeds, all of the cells in thenetwork are spiking, dynamically, so that the state ofthe system varies. Through the conditional distribution P ( σ i |{ σ j (cid:54) =i } ), this varying state predicts a varying spikeprobability for the one cell in the network on which weare focusing, and we can plot this predicted probability CC
20 40 60 80 100 12000.20.40.60.81 N CC s p i k e s / s dataprediction, N=120prediction, N=80prediction, N=40prediction, N=20prediction, N=10 A BC
FIG. 15:
Predicting the firing probability of a neuronfrom the rest of the network. (A)
Probability per unittime (spike rate) of a single neuron. Top, in red, experimentaldata. Lower traces, in black, predictions based on states ofother neurons in an N –cell group, as described in the text.Solid lines are the mean prediction across all trials, and thinlines are the envelope ± one standard deviation. (B) Cross–correlation (CC) between predicted and observed spike ratesvs. time, for each neuron in the N = 120 group. Green pointsare averages of CC computed from every trial, whereas bluepoints are the CC computed from average predictions. (C) Dependence of CC on the population size N . Thin blue linesfollow single neurons as predictions are based on increasingpopulation sizes; red line is the cell illustrated in (A), andthe line with error bars shows mean ± s.d. across all cells.Green line shows the equivalent mean behavior computed forthe green points in (B). vs. time in the same way that we would plot a conven-tional PSTH. On each repeat of the movie, the statesof the network are slightly different, and hence the pre-dicted PSTH is slightly different. What we see in Fig15A is that, as we use more and more neurons in thenetwork to make the prediction, the PSTH based on col-lective effects alone, trial by trial, starts to look moreand more like the real PSTH obtained by averaging overtrials. In particular, the predicted PSTH has near zerospike probability over most of the time, the short epochsof spiking are at the correct moments, and these epochshave the sharp onsets observed experimentally. These arefeatures of the data which are very difficult to reproducein models that, for example, start by linearly filtering thevisual stimulus through a receptive field [85–89]. In con-trast, the predictions in Fig 15 make no reference to thevisual stimulus, only to the outputs of other neurons inthe network.We can evaluate the predictions of spike probabilityvs. time by computing the correlation coefficient betweenour predicted PSTH and the experimental PSTH, as hasbeen done in many other contexts [85, 90, 91]. Since wegenerate a prediction for the PSTH on every presentationof the movie, we can compute the correlation from theseraw predictions, and then average, or average the predic-tions and then compute the correlation; results are shownin Figs 15B and C. We see that correlation coefficientscan reach ∼ .
8, on average, or even higher for particularcells. Predictions seem of more variable quality for cellswith lower average spike rate, but this is a small effect.The quality of average predictions, as well as the qualityof single trial predictions, still seem to grow gradually aswe include more neurons even at N ∼ VI. DISCUSSION
It is widely agreed that neural activity in the brainis more than the sum of its parts—coherent percepts,thoughts, and actions require the coordinated activity ofmany neurons in a network, not the independent activityof many individual neurons. It is not so clear, however,how to build bridges between this intuition about collec-tive behavior and the activity of individual neurons.One set of ideas is that the activity of the network asa whole may be confined to some very low dimensionaltrajectory, such as a global, coherent oscillation. Suchoscillatory activity is observable in the summed electri-6cal activity of large numbers of neurons—the EEG—and should be reflected as oscillations in the (auto–)correlation functions of spike trains from individual neu-rons. On a more refined level, dimensionality reductiontechniques like PCA allow the activity patterns of a neu-ral network to be viewed on a low-dimensional manifold,facilitating visualization and intuition [92–95]. A verydifferent idea is provided by the Hopfield model, in whichcollective behavior is expressed in the stabilization ofmany discrete patterns of activity, combinations of spik-ing and silence across the entire network [1, 2]. Takentogether, these many patterns can span a large fractionof the full space of possibilities, so that there need be nodramatic dimensionality reduction in the usual sense ofthis term.The claim that a network of neurons exhibits collectivebehavior is really the claim that the distribution of statestaken on by the network has some nontrivial structurethat cannot be factorized into contributions from indi-vidual cells or perhaps even smaller subnetworks. Ourgoal in this work has been to build a model of this distri-bution, and to explore the structure of that model. Weemphasize that building a model is, in this view, the firststep rather than the last step. But building a model ischallenging, because the space of states is very large anddata are limited.An essential step in searching for collective behaviorhas been to develop experimental techniques that allowus to record not just from a large number of neurons, butfrom a large fraction of the neurons in a densely inter-connected region of the retina [34, 67]. In large networks,even measuring the correlations among pairs of neuronscan become problematic: individual elements of the cor-relation matrix might be well determined from small datasets, but much larger data sets are required to be confi-dent that the matrix as a whole is well determined. Thus,long, stable recordings are even more crucial than usual.To use the maximum entropy approach, we have to besure that we can actually find the models that reproducethe observed expectation values (Fig 2, 3) and that wehave not, in the process, fit to spurious correlations thatarise from the finite size of our data set (Fig 4). Oncethese tests are passed, we can start to assess the accuracyof the model as a description of the network as a whole.In particular, we found that the pairwise model began tobreak down at a network size N ≥
40 (Fig 5). However,by adding the constraint that reproduces the probabilityof K out of N neurons spiking synchronously (Fig 6),which is a statistic that is well-sampled and does notgreatly increase the model’s complexity, we could againrecover good performance (Figs 7-9).Perhaps the most useful global test of our models isto ask about the distribution of state probabilities: howoften should we see combinations of spiking and silencethat occur with probability P ? This has the same fla-vor as asking for the probability of every state, but doesnot suffer from the curse of dimensionality. Since max-imum entropy models are mathematically identical to the Boltzmann distribution in statistical mechanics, thisquestion about the frequency of states with probability P is the same as asking how many states have a given en-ergy E ; we can avoid binning along the E axis by askingfor the number of models with energies smaller (higherprobability) or larger (lower probability) than E . Figure8 shows that these cumulative distributions computedfrom the model agree with experiment far into the tailof low probability states. These states are so rare that,individually, they almost never occur, but there are somany of these rare states that, in aggregate, they make ameasurable contribution to the distribution of energies.Indeed, most of the states that we see in the data arerare in this sense, and their statistical weight is correctlypredicted by the model.The maximum entropy models that we construct fromthe data do not appear to simplify along any conventionalaxes. The matrix of correlations among spikes in differentcells (Fig 1A) is of full rank, so that principal componentanalysis does not yield significant dimensionality reduc-tion. The matrix of “interactions” in the model (Fig 1D)is neither very sparse nor of low rank, perhaps because weare considering a group of neurons all located (approxi-mately) within the radius of the typical dendritic arbor,so that all cells have a chance to interact with one an-other. Most importantly, the interactions that we find arenot weak (Fig 1F), and together with being widespreadthis means that their impact is strong. Technically, wecannot capture the impact within low orders of perturba-tion theory (Appendix E), but qualitatively this meansthat the behavior of the network as a whole is not in anysense “close” to the behavior of non–interacting neurons.Thus, if our models work, it is not simply because corre-lations are weak, as had been suggested [32].Having convinced ourselves that we can build mod-els which give an accurate description of the probabilitydistribution over the states of spiking and silence in thenetwork, we can ask what these models teach us aboutfunction. As emphasized in Ref [4], one corollary of col-lective behavior is the possibility of error correction orpattern completion—we can predict the spiking or si-lence of one neuron by knowing the activity of all theother neurons. With a population of N = 100 cells, thequality of these predictions becomes quite high (Fig 15).The natural way of testing these predictions is to look atthe probability of spiking vs. time in the stimulus movie.Although we make no reference to the stimulus, we re-produce the sharp peaks of activity and extended silencesthat are so characteristic of the response to naturalisticinputs, and so difficult to capture in conventional mod-els where each individual neuron responds to the visualstimulus as seen through its receptive field [85].One of the dominant concepts in thinking about theretina has been the idea that the structure of receptivefields serves to reduce the redundancy of natural imagesand enhance the efficiency of information transmission tothe brain [96–99] (but see [45, 100]). While one could ar-gue that the observed redundancy among neurons is less7than expected from the structure of natural images ormovies, none of what we have described here would hap-pen if the retina truly “decorrelated” its inputs. Far frombeing almost independent, the combinations of spikingand silence in different cells cluster into basins of attrac-tion defined by the local minima of energy in our models,and the spiking of each neuron is highly predictable fromthe activity of other neurons in the network, even withoutreference to the visual stimulus (Fig 15).With N = 120 neurons, our best estimate of the en-tropy corresponds to significant occupancy of roughly onemillion distinct combinations of spiking and silence. Eachstate could occur with a different probability, and (asidefrom normalization) there are no constraints—each ofthese probabilities could be seen as a separate param-eter describing the network activity. It is appealing tothink that there must be some simplification, that wewon’t need a million parameters, but it is not obviousthat any particular simplification strategy will work. In-deed, there has been the explicit claim that the approachwe have taken here will not work for large networks [32].Thus, it may seem surprising that we can write down arelatively simple model, with all parameters determinedfrom experiment, and have this model predict so much ofthe structure in the distribution of states. Surprising ornot, it certainly is important that, as the community con-templates monitoring the activity of ever larger numberof neurons [101], we can identify theoretical approachesthat have the potential to tame the complexity of theselarge systems.Some cautionary remarks about the interpretation ofour models seem in order. Using the maximum en-tropy method does not mean there is some hidden forcemaximizing the entropy of neural activity, or that weare describing neural activity as being in something likethermal equilibrium; all we are doing is building maxi-mally agnostic models of the probability distribution overstates. Even in the context of statistical mechanics, thereare infinitely many models for the dynamics of the sys-tem that will be consistent with the equilibrium distri-bution, so we should not take the success of our modelsto mean that the dynamics of the network correspondsto something like Monte Carlo dynamics on the energylandscape. It is tempting to look at the couplings J ij between different neurons as reflecting genuine, mecha-nistic interactions, but even in the context of statisticalphysics we know that this interpretation need not be soprecise: we can achieve a very accurate description of thecollective behavior in large systems even if we do not cap-ture every microscopic detail, and the interactions thatwe do describe in the most successful of models oftenare effective interactions mediated by degrees of freedomthat we need not treat explicitly. Finally, the fact thata maximum entropy model which matches a particularset of experimental observations is successful does notmean that this choice of observables (e.g., pairwise cor-relations) is unique or minimal. For all these reasons,we do not think about our models in terms of their pa- rameters, but rather as a description of the probabilitydistribution P ( { σ i } ) itself, which encodes the collectivebehavior of the system.The striking feature of the distribution over states is itsextreme inhomogeneity. The entropy of the distributionis not that much smaller than it would be if the neu-rons made independent decisions to spike or be silent,but the shape of the distribution is very different; thenetwork builds considerable structure into the space ofstates, without sacrificing much capacity. The probabil-ity of the same state repeating is many orders of magni-tude larger than expected for independent neurons, andthis really is quite startling (Fig 14). If we extrapolate tothe full population of ∼
250 neurons in this correlated,interconnected patch of the retina, the probability thattwo randomly chosen states of the system are the sameis roughly one percent. Thus, some combination of spik-ing and silence across this huge population should repeatexactly every few seconds. This is true despite the factthat we are looking at the entire visual representation ofa small patch of the world, and the visual stimuli are fullynaturalistic. Although complete silence repeats more fre-quently, a wide range of other states also recur, so thatmany different combinations of spikes and silence occuroften enough that we (or the brain) can simply countthem to estimate their probability. This would be abso-lutely impossible in a population of nearly independentneurons, and it has been suggested that these repeatedpatterns provide an anchor for learning [12]. It is alsopossible that the detailed structure of the distribution,including its inhomogeneity, is matched to the statisti-cal structure of visual inputs in a way that goes beyondthe idea of redundancy reduction, occupying a regimein which strongly correlated activity is an optimal code[16, 17, 103].Building a precise model of activity patterns requiredus to match the statistics of global activity (the proba-bility that K out of N neurons spike in the same smallwindow of time). Elsewhere we have explored a very sim-ple model in which we ignore the identity of the neuronsand match only this global behavior [37]. This model al-ready has a lot of structure, including the extreme inho-mogeneity that we have emphasized here. In the simplermodel we can exploit the equivalence between maximumentropy models and statistical mechanics to argue thatthis inhomogeneity is equivalent to the statement thatthe population of neurons is poised near a critical sur-face in its parameter space, and we have seen hints ofthis from analyses of smaller populations as well [5, 8].The idea that biological networks might organize them-selves to critical points has a long history, and severaldifferent notions of criticality have been suggested [29].A sharp question, then, is whether the full probabilitydistributions that we have described here correspond toa critical system in the sense of statistical physics, andwhether we can find more direct evidence for criticality inthe data, perhaps without the models as intermediaries.Finally, we note that the our approach to building8models for the activity of the retinal ganglion cell popu-lation is entirely unsupervised: we are making use onlyof structure in the spike trains themselves, with no refer-ence to the visual stimulus. In this sense, the structuresthat we discover here are structures that could be dis-covered by the brain, which has no access to the visualstimulus beyond that provided by these neurons. Whilethere are more structures that we could use—notably,the correlations across time—we find it remarkable thatso much is learnable from just an afternoon’s worth ofdata. As it becomes more routine to record the activityof such (nearly) complete sensory representations, it willbe interesting to take the organism’s point of view [40]more fully, and try to extract meaning from the spiketrains in an unsupervised fashion. Acknowledgments
We thank A Cavagna, I Giardina, T Mora, SE Palmer,GJ Stephens, and A Walczak for many helpful discus-sions. This work was supported in part by NSF GrantsIIS–0613435, PHY–0957573, and CCF–0939370, and byNIH Grants R01 EY14196 and P50 GM071508. Ad-ditional support was provided by the Fannie and JohnHertz Foundation, the Swartz Foundation, and the WMKeck Foundation.
Appendix A: Experimental methods
Electrophysiology.
We analyzed the recordings fromthe tiger salamander (
Ambystoma tigrinum ) retinal gan-glion cells responding to naturalistic movie clips, as inthe experiments of Ref. [4, 34, 45]. In brief, animalswere euthanized according to institutional animal carestandards. The retina was isolated from the eye underdim illumination and transferred as quickly as possibleinto oxygenated Ringer’s medium, in order to optimizethe long-term stability of recordings. Tissue was flat-ted and attached to a dialysis membrane using polyly-sine. The retina was then lowered with the ganglion cellside against a multi-electrode array. Arrays were firstfabricated in university cleanroom facilities [69]. Subse-quently, production was contracted out to a commercialMEMS foundry for higher volume production (InnovativeMicro Technologies, Santa Barbara, CA). Raw voltagetraces were digitized and stored for off-line analysis usinga 252-channel preamplifier (MultiChannel Systems, Ger-many). The recordings were sorted using custom spikesorting software developed specifically for the new densearray [34]. 234 neurons passed the standard tests for thewaveform stability and the lack of refractory period vio-lations. Of those, 160 cells whose firing rates were moststable across stimulus repeats were selected for furtheranalysis. Within this group, the mean fraction of in-terspike intervals (ISI) shorter than 2 ms (i.e., possiblerefractory violations) was 1 . · − . Stimulus display.
The stimulus consisted of a short( t = 19 s) grayscale movie clip of swimming fish andwater plants in a fish tank, which was repeated 297 times.The stimulus was presented using standard optics, at arate of 30 frames per second, and gamma corrected forthe display. Data preparation.
We randomly selected 30 sub-groups of N = 10 , , · · · ,
120 cells for analysis fromthe total of 160 sorted cells. In sum, we analyzed30 ×
12 = 360 groups of neurons, which we denote by S νN ,where N denotes the subgroup size, and ν = 1 , · · · , t = 20 ms time bins, as in our previouswork [4, 5, 8]. The state of the retina was represented by σ i ( t ) = +1( −
1) if the neuron i spiked at least once (wassilent) in a given time bin t . This binary description isincomplete only in ∼ .
5% of the time bins that containmore than one spike; we treat these bins as σ i = +1.Across the entire experiment, the mean probability ofnon-silence (that is, σ i = +1) is ∼ . T = 283 , N -bitbinary samples during the course of the experiment foreach subgroup. Appendix B: Learning maximum entropy modelsfrom data
We used a modified version of our previously publishedlearning procedure to compute the maximum entropymodels given measured constraints [35]; the proof of con-vergence for the core of this L1-regularized maximum en-tropy algorithm is given in Ref. [76]. Our new algorithmcan use as constraints arbitrary functions, not only sin-gle and pairwise marginals as before. Parameters of theHamiltonian are learned sequentially in an order whichgreedily optimizes a bound on the log likelihood, and weuse a variant of histogram Monte Carlo to estimate thevalues of constrained statistics during learning steps [77].Monte Carlo induces sampling errors on our estimates ofthese statistics, which provide an implicit regularizationfor the parameters of the Hamiltonian [76]. We verifiedthe correctness of the algorithm explicitly for groups of 10and 20 neurons where exact numerical solutions are fea-sible. We also verified that our MC sampling had a longenough “burn-in” time to equilibrate, even for groupsof maximal size ( N = 120), by starting the sampling re-peatedly from same vs random different initial conditions(100 runs each) and comparing the constrained statistics,as well as the average and variance of the energy andmagnetization, across these runs; all statistics were notsignificantly dependent on the initial state (two–sampleKolmogorov-Smirnov test at significance level 0.05).Figure S1 provides a summary of the models we havelearned for populations of different sizes. In small net-works there is a systematic bias to the distribution of J ij parameters, but as we look to larger networks this9 −1 −0.5 0 0.5 10123456 J P ( J ) N=100N=50N=20N=10 0 50 100−0.100.10.2 N avg Jstd J
A B
FIG. S1:
Interactions in the (K-)pairwise model. (A)
The distributions of pairwise couplings, J ij , in pairwise modelsof Eq (19), for different network sizes ( N ). The distributionis pooled over 30 networks at each N . (B) The mean (solid)and s.d. (dashed) of the distributions in (A) as a function ofnetwork size (black); the mean and s.d. of the correspondingdistributions for K-pairwise models as a function of networksize (red). vanishes and the distribution of J ij becomes symmetric.Importantly, the distribution remains quite broad, withthe standard deviation of J ij across all pairs decliningonly slightly. In particular, the typical coupling does notdecline as ∼ / √ N , as would be expected in conventionalspin glass models [102]. This implies, as emphasized pre-viously [8], that the “thermodynamic limit” (very large N ) for these systems will be different from what we mightexpect based on traditional physics examples.We withheld a random selection of 20 stimulus repeats(test set) for model validation, while training the modelon the remaining 277 repeats. On training data, we com-puted the constrained statistics (mean firing rates, co-variances, and the k-spike distribution), and used boot-strapping to estimate the error bars on each of thesequantities; the constraints were the only input to thelearning algorithm. Figure 1 shows an example recon-struction for a pairwise model for N = 100 neurons; theprecision of the learning algorithm is shown in Fig. 2.The dataset consists of a total of T ∼ · bi-nary pattern samples, but due to the structure of thestimulus, the number of statistically independent sam-ples must be smaller: while the repeats are statisticallyindependent, the samples within each repeat are not, be-cause they are driven by the stimulus. The variance fora binary variable given its mean, (cid:104) σ i (cid:105) , is σ , = 1 − (cid:104) σ i (cid:105) ;with R independent repeats, the error on the estimate inthe average rate should decrease as σ , R = σ , /R . Byrepeatedly estimating the statistical errors with differentsubsets of repeats and comparing the expected scaling ofthe error in the original data set and in the data set wherewe shuffle time bins randomly, thereby destroying the re-peat structure, we can estimate the effective number ofindependent samples; we find this to be T indep ∼ · ,about 37% of the total number of samples, T .We note that our largest models have < · con-strained statistics that are estimated from at least 15 × as many statistically independent samples. Moreover, thevast majority of these statistics are pairwise correlationcoefficients that can be estimated extremely tightly fromthe data, often with relative errors below 1%, so we donot expect overfitting on general grounds. Nevertheless,we explicitly checked that there is no overfitting by com-paring the log likelihood of the data under the learnedmaximum entropy model, for each of the 360 subgroups S νN , on the training and testing set, as shown in Fig. 3. Appendix C: Exploring the energy landscape
To find the metastable (MS) states, we start with apattern { σ i } that appears in the data, and attempt to flipspins i = 1 , · · · , N from their current state into − σ i , inorder of increasing i. A flip is retained if the energy of thenew configuration is smaller than before the flip. Whennone of the spins can be flipped, the resulting pattern isrecorded as the MS state. The set of MS states found candepend on the manner in which descent is performed, inparticular when some of the states visited during descentare on the “ridges” between multiple basins of attraction.Note that whether a pattern is a MS state or not is in-dependent of the descent method; what depends on themethod is which MS states are found by starting fromthe data patterns. To explore the structure of the energylandscape in Fig 11, we started 1000 Metropolis MC sim-ulations repeatedly in each of the 10 most common MSstates of the model; after each attempted spin-flip, wechecked whether the resulting state is still in the basinof attraction of the starting MS state (by invoking thedescent method above), or whether it has crossed theenergy barrier into another basin. We histogrammed thetransition probabilities into other MS basins of attractionand, for particular transitions, we tracked the transitionpaths to extract the number of spin-flip attempts and theenergy barriers. “Basin size” of a given MS state is thenumber of patterns in the recorded data from which thegiven MS state is reached by descending on the energylandscape. The results presented in Fig 11 are typical ofthe transitions we observe across multiple subnetworksof 120 neurons. Appendix D: Computing the entropy and partitionfunction of the maximum entropy distributions
Entropy estimation is a challenging problem. As ex-plained in the text, the usual approach of counting sam-ples and identifying frequencies with probabilities willfail catastrophically in all the cases of interest here, evenif we are free to draw samples from out model ratherthan from real data. Within the framework of maximumentropy models, however, the equivalence to statisticalmechanics gives us several tools. Here we summarize theevidence that these multiple tools lead to consistent an-swers, so that we can be confident in our estimates.0
15 16 17 18 19 20 21 221516171819202122 S [bits] C(T) S [ b i t s ] W L pairwiseK−pairwise A −20 −10 0x 10 −3 ∆ S C(T) / S B −4 −2 0x 10 −3 ∆ S WL / S C FIG. S2:
Precision of entropy estimates. (A)
Entropy es-timation using heat capacity integration (x-axis) from Eq (29)versus entropy estimation using the Wang-Landau samplingmethod (y-axis) [83]. Each plot symbol is one subnetwork ofeither N = 100 or N = 120 neurons (circles = pairwise mod-els, crosses = K-pairwise models). The two sampling methodsyield results that agree to within ∼ (B) Fractional dif-ference between the heat capacity method and the entropydetermined from the all-silent pattern. The histogram is over30 networks at N = 100 and 30 at N = 120, for the K-pairwisemodel. (C) Fractional difference between the Wang-Landausampling method and the entropy determined from the all-silent pattern. Same convention as in (B).
Our first try at entropy estimation is based on theheat capacity integration in Eq. (29). To begin, with N = 10 ,
20 neurons, we can enumerate all 2 N states ofthe network and hence we can find the maximum entropydistributions exactly (with no Monte Carlo sampling).From these distributions we can also compute the en-tropy exactly, and it agrees with the results of the heatcapacity integration. Indeed, there is good agreement forthe entire distribution, with Jensen-Shannon divergencebetween exact maximum entropy solutions and solutionsusing our reconstruction procedure at ∼ − . As a sec-ond check, now useable for all N , we note that the en-tropy is zero at T = 0, but S = N bits at T = ∞ . Thuswe can do the heat capacity integration from T = 1 to T = ∞ instead of T = 0 to T = 1, and we get essentiallythe same result for the entropy (mean relative differenceof 8 . · − across 30 networks at N = 100 and N = 120).Leaning further on the mapping to statistical physics,we realize that the heat capacity is a summary statisticfor the density of states. There are Monte Carlo samplingmethods, due to Wang and Landau [83] (WL), that aimspecifically at estimating this density, and those allowus to compute the entropy from a single simulation run.The results, in Fig S2A, are in excellent agreement withthe heat capacity integration.K-pairwise models have the attractive feature that, byconstruction, they match exactly the probability of theall-silent pattern, P ( K = 0), seen in the data. As ex-plained in the main text, this means that we can “mea- sure” the partition function, Z , of our model directlyfrom the probability of silence. Then we can compute theaverage energy (cid:104) E (cid:105) from a single MC sampling run, andfind the entropy for each network. As shown in Figs S2Band C, the results agree both with the heat capacity in-tegration and with the Wang–Landau method, to an ac-curacy of better than 1%.Finally, there are methods that allow us to estimate en-tropy by counting samples even in cases where the num-ber of samples is much smaller than the number of states[82] (NSB). The NSB method is not guaranteed to workin all cases, but the comparison with the entropy esti-mates from heat capacity integration (Fig S3A) suggeststhat so long as N <
50, NSB estimates are reliable (seealso [107]). Figure S3B shows that the NSB estimateof the entropy does not depend on the sample size for
N <
50; if we draw from our models a number of sam-ples equal to the number found in the data, and then tentimes more, we see that the estimated entropy changesby just a few percent, within the error bars. This is an-other signature of the accuracy of the NSB estimator for
N <
50. As N increases, these direct estimates of en-tropy become significantly dependent on the sample size,and start to disagree with the heat capacity integration.The magnitude of these systematic errors depends on thestructure of the underlying distribution, and it is thusinteresting that NSB estimates of the entropy from ourmodel and from the real data agree with one another upto N = 120, as shown in Fig S3C. Appendix E: Are real networks in the perturbativeregime?
The pairwise correlations between neurons in this sys-tem are quite weak. Thus, if we make a model for theactivity of just two neurons, treating them as indepen-dent is a very good approximation. It might seem thatthis statement is invariant to the number of neurons thatwe consider—either correlations are weak, or they arestrong. But this misses the fact that weak but widespreadcorrelations can have a non–perturbative effect on thestructure of the probability distribution. Nonetheless, ithas been suggested that maximum entropy methods aresuccessful only because correlations are weak, and hencethat we can’t really capture non–trivial collective behav-iors with this approach [32].While independent models fail to explain the behav-ior of even small groups of neurons [4], it is possiblethat groups of neurons might be in a weak perturbativeregime, where the contribution of pairwise interactionscould be treated as a small perturbation to the inde-pendent Hamiltonian, if the expansion was carried outin the correct representation [32]. Of course, with fi-nite N , all quantities must be analytic functions of thecoupling constants, and so we expect that, if carriedto sufficiently high order, any perturbative scheme willconverge—although this convergence may become much1slower at larger N , signaling genuinely collective behav-ior in large networks.To make the question of whether correlations are weakor strong precise, we ask whether we can approximate themaximum entropy distribution with the leading orders ofperturbation theory. There are a number of reasons tothink that this won’t work [62, 63, 80, 81], but in lightof the suggestion from Ref [32] we wanted to explore thisexplicitly. If correlations are weak, there is a simple rela-tionship between the correlations C ij and the correspond-ing interactions J ij [32, 79]. We see in Fig S4A that thisrelationship is violated, and the consequence is that mod-els built by assuming this perturbative relationship areeasily distinguishable from the data even at N = 15 (FigS4B). We conclude that treating correlations as a smallperturbation is inconsistent with the data. Indeed, if wetry to compute the entropy itself, it can be shown thateven going out to fourth order in perturbation theory is N SB S / N b i a s N samp = dataN samp = 10 x data0.1 0.2 0.30.10.150.20.250.3 S N SB ( k − pa i r w i s e ) [ b i t s / neu r on ] S C(T) (k−pairwise) [bits/neuron]0.1 0.2 0.30.10.150.20.250.3S
NSB (k−pairwise) [bits/neuron] S N SB ( da t a ) [ b i t s / neu r on ] A BC
FIG. S3:
Sample-based entropy estimation.(A)
The biasin entropy estimates computed directly from samples drawnfrom K-pairwise models. The NSB entropy estimate [82] inbits per neuron computed using ∼ · samples from themodel (same size as the experimental data set) on y-axis; thetrue entropy (using heat capacity integration) method on x-axis. Each dot represents one subnetwork of a particular size( N , different colors). For small networks ( N ≤
40) the bias isnegligible, but estimation from samples significantly underes-timates the entropy for larger networks. (B)
The fractionalbias of the estimator as a function of N (black dots = datafrom (A), gray dots = using 10 fold more samples). Red lineshows the mean ± s.d. over 30 subnetworks at each size. (C) The NSB estimation of entropy from samples drawn from themodel (x-axis) vs the samples from real experiment (y-axis);each dot is a subnetwork of a given size (color as in (A)).The data entropy estimate is slightly smaller than that of themodel, as is expected for true entropy; for estimates from fi-nite data this would only be expected if the biases on data vsMC samples were the same. −0.5 0 0.5−0.500.5 J (exact) J ( pe r t u r ba t i v e ) N=5N=10N=15N=20 0 5 10 15 20 2510 −4 −2 N D J S ( m ode l ; da t a ) exactperturbativedata halves < | ∆ J | > / < J > A B
FIG. S4:
Perturbative vs exact solution for the pair-wise maximum entropy models. (A)
The comparison ofcouplings J ij for a group of N = 5 , , ,
20 neurons, com-puted using the exact maximum entropy reconstruction algo-rithm, with the lowest order perturbation theory result, J ij = log c ij , where c ij = (cid:104) ˜ σ i ˜ σ j (cid:105) / ( (cid:104) ˜ σ i (cid:105)(cid:104) ˜ σ j (cid:105) ) and ˜ σ i = 0 . σ i )[32, 79]. In the case of larger networks, the perturbative J ij deviate more and more from equality (black line). Inset: theaverage absolute difference between the true and perturbativecoupling, normalized by the average true coupling. (B) Theexact pairwise model, Eq (19), can be compared to the dis-tribution P expt ( { σ i } ), sampled from data; the olive line showsthe Jensen-Shannon divergence (corrected for finite samplesize) between the two distributions, for four example networksof size N = 5 , , ,
20. The blue line shows the same com-parison in which the pairwise model parameters, g = { h i , J ij } ,were calculated perturbatively. The black line shows the D JS between two halves of the data for the four selected networks. not enough once N >
10 [80, 81].2 [1] JJ Hopfield (1982) Neural networks and physical sys-tems with emergent collective computational abilities.
Proc Natl Acad Sci (USA)
Modeling Brain Function: The Worldof Attractor Neural Networks (Cambridge UniversityPress, Cambridge).[3] J Hertz, A Krogh & RG Palmer (1991)
Introductionto the Theory of Neural Computation (Addison Wesley,Redwood City).[4] E Schneidman, MJ Berry II, R Segev & W Bialek(2006) Weak pairwise correlations imply strongly cor-related network states in a neural population.
Nature arXiv.org: q-bio/0611072.[6] S Yu, D Huang, W Singer & D Nikolic (2008) A smallworld of neuronal synchrony.
Cereb Cortex
J Neu-rosci arXiv.org:
J Neurosci
J Neu-rosci
Nature
Proc Natl AcadSci (USA)
J PhysiolParis
PLoS Comput Biol e1002922.[15] G Tkaˇcik, JS Prentice, JD Victor & V Balasubrama-nian (2010) Local statistics in natural scenes predictthe saliency of synthetic textures. Proc Natl Acad Sci(USA)
Phys RevLett
Proc Natl Acad Sci (USA)
Proc Natl Acad Sci (USA)
Information Flow in Biological Net-works (Dissertation, Princeton University).[20] W Bialek & R Ranganathan (2007) Rediscovering thepower of pairwise interactions. arXiv.org:
Phys Rev Lett
ProcNatl Acad Sci (USA)
Cell
ProcNatl Acad Sci (USA)
PLoS One e28766.[26] JI Sulkowska, F Morocos, M Weigt, T Hwa & JNOnuchic (2012) Genomics–aided structure prediction. Proc Natl Acad Sci (USA)
Phys Rev E
Proc Natl Acad Sci (USA)
J Stat Phys
Curr Opin Neurobiol
Front Comput Neurosci PLoS Comput Biol e1000380.[33] Y Roudi, J Trycha & J Hertz (2009) The Ising model forneural data: model quality and approximate methodsfor extracting functional connectivity. Phys Rev E
J Neurosci arXiv.org: rons to natural visual stimuli. Neural Comput
J Stat Mech
P03011.[38] M Okun, P Yger, SL Marguet, F Gerard-Mercier, ABenucci, S Katzner, L Busse, M Carandini & KD Har-ris (2012) Population rate dynamics and multi neuronfiring patterns in sensory cortex.
J Neurosci
Phys Rev
Spikes: Exploring the Neural Code (MITPress, Cambridge).[41] DN Mastronarde (1983) Interactions between ganglioncells in cat retina.
J Neurophysiol
J Neurophysiol
Neuron
Nat Neurosci
Neuron
J Neurophysiol
Nonlinearprogramming: Theory and algorithms (Wiley & Sons,Hoboken NJ, USA).[48] J Keat, P Reinagel, RC Reid & M Meister (2001) Pre-dicting every spike: a model for the responses of visualneurons.
Neuron
JNeurophysiol
Nature
Progress on Deciphering the RetinalCode. (Dissertation, Princeton University).[52] D Perkel & T Bullock (1968) Neural coding.
NeurosciRes Program Bull Phys Rev E
Neural Comp
J Opt Soc A Phys Rev Lett
Network Nat Neurosci Phys RevLett
Adv Neural Info Proc Sys
PhysRev Lett
Proc Natl Acad Sci(USA)
Phys Rev Lett
J Physiol Paris
J Stat Phys
Phys Rev Lett
Nat Neurosci µ mspacing 519-electrode arrays for in vitro retinal studies. Nuc Inst Methods A
Proceedings MEA Meeting , pp. 197,Stett A ed., BIOPRO Baden-W¨urttemberg, 2008.[70] D Amodei (2011)
Network-scale electrophysiology: Mea-suring and understanding the collective behavior of neu-ral circuits. (Dissertation, Princeton University).[71] S Nirenberg, SM Carcieri, AL Jacobs & PE Latham(2001) Retinal ganglion cells act largely as independentencoders.
Nature
Monte Carlo Sim-ulations in Statistical Physics (Cambridge UniversityPress, Cambridge, UK).[73] TM Cover & JA Thomas (1991)
Elements of Informa-tion Theory (Wiley, New York).[74] OE Lanford III (1973) Entropy and equilibrium statesin classical statistical mechanics.
Statistical Mechanicsand Mathematical Problems,
A Lenard ed., pp. 1–113.Springer Verlag (Berlin).[75] LD Landau and EM Lifshitz (1977)
Statistical Physics (Pergamon, Oxford).[76] M Dudik, SJ Phillips & RE Schapire (2004) Perfor-mance guarantees for regularized maximum entropydensity estimation.
Proceedings 17th Annual conferenceon learning theory . [77] AM Ferrenberg & RH Swendsen (1988) New MonteCarlo technique for studying phase transitions. PhysRev Lett
IEEE Trans Inf Theory
J Phys A
An Information Theoretic Study of Neu-ral Populations (Dissertation, University of California atSanta Barbara).[81] F Azhar & W Bialek (2010) When are correlationsstrong? arXiv.org:
Adv Neural Info Proc Syst
Phys Rev Lett
Proc NatlAcad Sci (USA)
J Neurosci
JPhysiol
J GenPhysiol
Neuron
J Neurosci
J Neurosci
J Neurosci
Neuron
J Neurosci
Neuron
J Neurosci Methods
Psychol Rev
Sensory communication, pp 217–234. MIT Press(Cambridge, USA).[98] JJ Atick & AN Redlich (1990) Towards a theory of earlyvisual processing.
Neural Comput Nature
Net-work
Neuron
Spin GlassTheory and Beyond (World Scientific, Singapore).[103] G Tkaˇcik, JS Prentice, V Balasubramanian & E Schnei-dman (2010) Optimal population coding by noisy spik-ing neurons.
Proc Natl Acad Sci (USA)
Phys Rev Lett
Biol Cybern
Science