[PDF] The Geometry of Information Coding in Correlated Neural Populations

Abstract

Neurons in the brain represent information in their collective activity. The fidelity of this neural population code depends on whether and how variability in the response of one neuron is shared with other neurons. Two decades of studies have investigated the influence of these noise correlations on the properties of neural coding. We provide an overview of the theoretical developments on the topic. Using simple, qualitative and general arguments, we discuss, categorize, and relate the various published results. We emphasize the relevance of the fine structure of noise correlation, and we present a new approach to the issue. Throughout we emphasize a geometrical picture of how noise correlations impact the neural code.

Full PDF

TThe geometry of information coding in correlated neuralpopulations

Rava Azeredo da Silveira , , , and Fred Rieke Department of Physics, Ecole Normale Sup´erieure Laboratoire de Physique de l’ENS, Universit´e PSL, CNRS, Sorbonne Universit´e, Universit´e de Paris Institute of Molecular and Clinical Ophthalmology Basel Faculty of Science, University of Basel Department of Physiology and Biophysics, University of Washington

February 2, 2021

Abstract

Neurons in the brain represent information in their collective activity. The ﬁdelityof this neural population code depends on whether and how variability in the responseof one neuron is shared with other neurons. Two decades of studies have investigatedthe inﬂuence of these noise correlations on the properties of neural coding. We providean overview of the theoretical developments on the topic. Using simple, qualitativeand general arguments, we discuss, categorize, and relate the various published results.We emphasize the relevance of the ﬁne structure of noise correlation, and we presenta new approach to the issue. Throughout we emphasize a geometrical picture of hownoise correlations impact the neural code.

When citing this paper, please use the following: Azeredo da Silveira R, Rieke F. 2020. Thegeometry of information coding in correlated neural populations. Annu. Rev. Neurosci.Submitted. DOI: 10.1146/annurev-neuro-120320-082744

The quantitative study of information processing by neurons was born from investigationsthat correlated neural responses with parameters that characterize an ‘external event’ suchas a physical stimulus or a motor action (Hubel, 1995). Responses of single neurons tosimple stimuli have revealed many key properties of coding. These are often summarized inthe form of receptive ﬁelds or, equivalently, tuning curves. Except in rare cases, however,physical stimuli and motor actions are coded, or represented, in the activity of an entirepopulation of neurons. How, then, are rich signals represented in the collective activity ofmany neurons? 1 a r X i v : . [ q - b i o . N C ] F e b he issue of noise is central. Responses of a single, ideal, noiseless neuron can encodean inﬁnite amount of information. By contrast, a real, noisy neuron disposes of a ﬁnitebandwidth. In a population of neurons, the noise in each individual neuron can be reducedby averaging. This is the simplest view on population coding: the population enhances therepresentative signal by averaging out the noise. But there are two features of populationcoding that make it a richer problem. First, physiological properties diﬀer among cells,so that diﬀerent neurons represent diﬀerent aspects of the stimulus. Second, noise in theresponses of individual neurons is correlated and, hence, its impact on information codinghas to be considered collectively, not one cell at the time. These two aspects of the problemare intimately related: neurons acquire diverse properties because of the speciﬁcity of theconnections they make to other neurons, and this also shapes the correlation in the noise.For example, divergence of common inputs may permit parallel channels to each encode adiﬀerent aspect of the inputs but may also result in strong noise correlations between thechannels. More generally, the architecture of neural circuits shapes the structures of bothsignal and noise.A great deal of research in quantitative neuroscience attempts to relate the geometry orstatistics of neural responses to sensory stimuli or task parameters. This problem is diﬃ-cult because it is high-dimensional, as both signal and noise are speciﬁed by a number ofpossible patterns that grows exponentially with the number of neurons. Statistical physicsexempliﬁes a possible way to tame this complexity: phenomena such as phase transitionsand superconductivity were explained by identifying the collective variables most relevantto the dynamics of measured quantities. Once these relevant variables—speciﬁc combina-tions of the microscopic variables—were identiﬁed, phenomena of interest could be explainedsimply in terms of energy stored in the collective variables or of ﬂuctuations thereof. Theunderstanding of neural coding would similarly beneﬁt from the identiﬁcation of analogouscollective variables. Indeed, a great deal of eﬀort is expended today to develop methods thatcan extract ‘low-dimensional’ or ‘latent’ variables from recordings of neural populations. Animportant consideration, here, is the need to consider the structure of the average populationactivity (e.g., in response to a set of stimuli) as well as the statistics of the variability aboutthis average, and how the two relate.In the past two decades, progress on understanding how coding depends on the geometryof signal and noise has been promoted by a simplifying choice, namely, a focus on pairwisecorrelations. These, unlike higher-order statistical moments, can be measured within theduration of typical neural recordings. Many neural systems exhibit non-negligible pairwisecorrelations (Hatsopoulos et al., 1998, Mastronarde, 1989, Ozden et al., 2008, Perkel et al.,1967, Sasaki et al., 1989, Zohary et al., 1994, Shlens et al., 2008, Usrey & Reid, 1999, Vaadiaet al., 1995, Bair et al., 2001, Fiser et al., 2004, Kohn & Smith, 2005, Smith & Kohn, 2008,Lee et al., 1998, Ecker et al., 2010, Graf et al., 2011, Goris et al., 2014, Lin et al., 2015).Early on, also, pairwise noise correlations were hailed as relevant to coding and behavior:the limits they imposed on noise reduction by averaging across neurons was hypothesized toaccount for the surprisingly similar detection thresholds of small populations of neurons andentire animals (Zohary et al., 1994, Bair et al., 2001).These experimental ﬁndings and some other early investigations (Johnson, 1980, Vogels,1990, Oram et al., 1998) motivated a series of studies that set heuristic arguments on ﬁrmbases and expanded on them, using detailed models of population coding (Abbott & Dayan,2999, Sompolinsky et al., 2001, Wilke & Eurich, 2002, Romo et al., 2003, Golledge et al., 2003,Averbeck & Lee, 2003, Shamir & Sompolinsky, 2004, 2006, Averbeck & Lee, 2006, Averbecket al., 2006, Josic et al., 2009) and general information theoretic arguments (Panzeri et al.,1999, Pola et al., 2003). In addition to elucidating how noise correlation can limit coding,some of the early work (Abbott & Dayan, 1999, Wilke & Eurich, 2002) raised the possibilitythat noise correlation need not always harm coding. More recent investigations (Ecker et al.,2011, Hu et al., 2014, Azeredo da Silveira & Berry II, 2014, Moreno-Bote et al., 2014,Franke et al., 2016, Zylberberg et al., 2016) expanded the panorama of possible scenariosby showing that noise correlation can be harmless or appreciably beneﬁcial to the neuralcode. The key here was the consideration of the ﬁne structure of correlation, beyond itsmagnitude. Indeed, analyses of retinal (Franke et al., 2016, Zylberberg et al., 2016) andcortical recordings (Averbeck & Lee, 2004, 2006, Montani et al., 2007, Graf et al., 2011, Linet al., 2015, Montijn et al., 2016) have illustrated the beneﬁcial impact of speciﬁc structuresof noise correlations on coding.Here, we review theoretical developments on neural population coding in the presence ofcorrelated noise. We provide an overview of the topic that combines heuristic arguments, thestudy of simple models, and general mathematical statements. To ensure a formal unity, wefocus primarily upon the mutual (Shannon) information as a means to quantify the neuralcode, and we comment on its relations with other, related quantities. Section 2 introducesthe problem of neural population coding. Section 3 reviews early, heuristic arguments thatpointed to a potentially detrimental role of noise correlation in coding. Section 4 presentsa general, qualitative argument that encompasses more recent models, and delineates theconditions under which noise correlations are detrimental, inconsequential, or beneﬁcial.Section 5 presents a model-independent point of view of the problem by expressing themutual information in a form that delineates the role of diﬀerent types of correlation. Section6 examines the coding problem from a geometrical point of view that complements andfurther clariﬁes the results described in earlier sections and in the recent literature. Sensory stimuli are coded in the activity of populations of neurons. One of the fundamentalproblems in neuroscience is that of elucidating the nature of this code; this problem canbe divided into two parts. On the encoding side, we would like to know what propertiesof the population activity are relevant to the representation of information, and how theseproperties are manipulated by the brain. On the decoding side, we would like to identify themathematical operation that retrieves a physical stimulus (or some feature of it) from theoutput of a population of neurons. We can then ask also how such a mathematical operationis implemented by neurons. Here, we are concerned exclusively with the encoding side of theproblem. Earlier reviews (see, e.g., (Averbeck et al., 2006)) discuss the impact of correlationson decoding.Population coding is a much richer problem than single-cell coding because it is highdimensional. The number of population states grows exponentially with the number of neu-rons, allowing for combinatorial codes. This is true even for noiseless neurons, as cells comein diﬀerent functional (and genetic) types and even cells of a given type present physiolog-3cal variability. The situation is further complicated by the fact that neurons are noisy: agiven physical stimulus can elicit one of a number of population activity patterns. (We arenot making any philosophical statement about noise as a sort of fundamental randomness.Instead, we refer to noise in a procedural way: for example, we say that the neural responseis noisy if it varies from one trial to the next of an identical stimulus. This variabilitymay result from biochemical stochasticity, but it may also reﬂect the purely deterministicdynamics of a complex system, such as interference between coding of the visual stimuluswith other, unrelated neural activity elicited by other stimuli or internal processing.) Themean population response to a sensory stimulus and its variability are given by the jointstatistics of the ﬁring of neurons. Thus, the fundamental problem of neural population en-coding amounts to asking how information about a physical stimulus is represented by thiscomplicated mathematical object. We say ‘information about the physical stimulus’ ratherthan speciﬁcally the stimulus itself because a neural population may represent propertiesassociated with the stimulus, such as one of its attributes, a hidden event that may havecaused the stimulus, or even a ‘meta-property’ of the stimulus such as the probability withwhich it occurs in a speciﬁc environment.Following the bulk of the theoretical literature to date, we make two simpliﬁcations tomake this general problem more approachable. First, we consider only pairwise correlations;we do not take into account or discuss the potential eﬀects of higher-order correlations, whichare more diﬃcult to estimate precisely from limited experimental data (see Refs. (Cayco-Gajic et al., 2015, Zylberberg & Shea-Brown, 2015, Montijn et al., 2016) for examples ofrecent studies of neural coding in the presence of higher-order correlations). Second, weassume that the output of each individual neuron can be represented by a scalar variable.This means, in particular, that we do not consider temporal representations of information,such as those associated with speciﬁc spike patterns. We think of the output of the neuralpopulation as divided in successive time bins, and the activity of each neuron in each timebin as deﬁned by a single number (such as the spike count). We examine the problemthrough the lens of mathematical quantities that provide a characterization of the codingperformance independently of the choice of a putative decoder. Whenever possible, we chooseto explain theoretical results in terms of the mutual (Shannon) information (Cover, 1999).It quantiﬁes information on a well-founded axiomatic basis, but has the disadvantage that itis often diﬃcult to calculate analytically. Besides its theoretical foundation, our motivationin aligning various results in the framework of a single mathematical ‘ﬁgure of merit’ of theneural code is to provide as much unity as possible to the discussion.

Initial investigations suggested that noise correlation was detrimental to neural coding. Thisconclusion was based on several (simplifying) hypotheses: noise correlations were assumed tobe positive, as suggested by neural recordings, and uniform in a population of neurons withsimilar tuning properties. Noise correlation was thus viewed as a ‘bug’ in neural processing,and possibly an unavoidable one due to the tight interconnections of neurons.4hen we say “noise correlation harms or beneﬁts coding,” we tacitly assume a compar-ison between a correlated neural population and another neural population in all mattersidentical but in which noise correlations have been removed. This independent populationmay not be realizable in a real circuit due to interconnectedness of neurons, but it provides anatural benchmark. Since we disregard higher-order correlations, the comparison is betweena correlated neural population and a neural population in which neurons have identical meanresponses and single-cell variability around their mean responses, but in which pairwise cor-relations are vanishing, i.e., a population of independent neurons with matched single-cellresponse statistics. In practice, when analyzing data, there are several ways to implementthis comparison. A model-independent approach is to create an artiﬁcial data set by shuf-ﬂing recordings of individual neurons among diﬀerent experimental trials, in the populationrecording, so as to retain single-cell statistics while eliminating the same-trial correlations. Ifit is possible to ﬁt a model to the population activity statistics, it is also possible to comparethis model to a parallel model in which the average single-cell activity and noise varianceare left unchanged while correlations of second and higher order are set to zero.It is easy to see why positive noise correlation can be detrimental to coding from thefollowing simple model (Zohary et al., 1994, Bair et al., 2001). Imagine that you want todiscriminate two stimuli, A and B, from the output of a population of N neurons. For thesake of simplicity, we assume binary neurons, i.e., the response of neuron i , r i , can take thevalue 0 or 1. If all the neurons in the population are identical in their response properties,the state of the population is entirely characterized by the number of active neurons, k = N (cid:88) i =1 r i . (1)On average over trials, (cid:104) k (cid:105) s = N p ( s ), where the brackets, (cid:104)·(cid:105) s , denote an average over thedistribution of population activity in the presence of stimulus, s , and p ( s ) is the probabilitythat a neuron is activated by the stimulus s = A or B. From trial to trial, k ﬂuctuates aboutthis average quantity. The population output will discriminate the two stimuli as long as thediﬀerence in the mean outputs, N | p (A) − p (B) | , is much larger than the typical magnitudeof these ﬂuctuations, (cid:113)(cid:10) ( k − (cid:104) k (cid:105) ) (cid:11) s = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i,j =1 (cid:104) [ r i − p ( s )] [ r j − p ( s )] (cid:105) s = (cid:112) N [1 + ( N − c ( s )] p ( s ) [1 − p ( s )] , (2)where c ( s ) is the pairwise correlation of two neurons in the presence of stimulus s , deﬁnedas c ( s ) = (cid:104) [ r i − p ( s )] [ r j − p ( s )] (cid:105) s (cid:113)(cid:10) [ r i − p ( s )] (cid:11) s (cid:10) [ r j − p ( s )] (cid:11) s = (cid:104) [ r i − p ( s )] [ r j − p ( s )] (cid:105) s p ( s ) [1 − p ( s )] . (3)Assuming that the correlation does not depend much on the stimulus, c (A) ≈ c (B) ≈ c , wecan deﬁne a ‘signal-to-noise ratio’ (SNR) that characterizes the faithfulness of the code in5iscriminating the stimuli A and B, asSNR = N [ p (A) − p (B)] [1 + ( N − c ] p (1 − p ) , (4)where p lies somewhere between p (A) and p (B). This quantity is also the square of the‘sensitivity index’ used in statistics and generally denoted by d (cid:48) .The important point is that the SNR diﬀers qualitatively for c = 0 and c > independent neurons ( c = 0), SNR grows linearly and indeﬁnitely with N .Each neuron added to the population carries an incremental piece of information so that,roughly speaking, the coding performance grows in proportion to the size of the population.This is to be contrasted with the case of positively correlated neurons ( c > N ∗ ≈ /c , positive correlation limits the coding performance and the SNRsaturates to a ﬁnite value at larger population sizes (Fig. 1B). Each successive neuron addedto a growing population carries a decreasing amount of information, since its variability isshared in part with that of all the other neurons in the population. In large populations,the activity of an added neuron is ‘dictated’ by that of the other neurons and, hence, it doesnot provide any incremental information.Because of the form of the scaling with population size in Eq. (4), the eﬀect of noisecorrelation can be strong even in relatively small populations with modest values of corre-lation. For example, in the presence of 10% correlation ( c = 0 .

1, a typical value for corticaland retinal neurons), noise correlation has an appreciable eﬀect already in a population ofa few dozen neurons. For N = 100, and assuming p (A) − p (B) ≈ p ≈ .

5, the signal-to-noise ratio amounts to 9, as opposed to 100 for an independent population of neurons.For N = 1000, the signal-to-noise ratio grows to 10, as opposed to 1000 for an independentpopulation of neurons. More generally, while the signal-to-noise ratio increases by one unitfor every independent neuron added, in a correlated population it increases by an amount(1 − c ) / (1 + N c ) when one neuron is added to a population with N neurons. With typicalvalues of c ≈ .

1, this quantity drops rapidly to zero in populations with more than 100neurons.There are at least two other ways of intuiting this result. The signal-to-noise ratio acquiresa factor of N in its denominator because each neuron shares a fraction of its variability withall the other neurons in the population. As a result, any ‘error’ committed by a neuronwill be enhanced by a factor of N , since neurons share their variability. Consequently, thevariability in the population response will be greatly enhanced. In other words, positivecorrelation broadens the distribution of population responses. Yet another way to thinkabout this result is that positive correlation induces neurons to respond similarly: it is as ifpositive correlation yields a reduced ‘eﬀective size’ of the population, and, hence, suppressesthe coding capacity. In the extreme case of 100% correlation ( c = 1), all neurons in thepopulation behave identically, and the population as a whole cannot code for any moreinformation than a single neuron does.It is instructive to see how the conclusions obtained from the simple model are reﬂectedby a fundamental information theoretic quantity, the mutual (Shannon) information. Thereare several equivalent ways to express the mutual information; for our purposes we adopt6 no r m a li z ed p r obab ili t y den s i t y response c = 0.02c = 0.2c = 0N=4 N=20 N=100 N S NR A.B.

Figure 1:

Dependence of the signal-to-noise ratio on the number of neurons ( N )and the correlation strength ( c ). A. Probability densities of the activity for severalcombinations of N and c , in a homogenous population of neurons that respond somewhat morestrongly to stimulus A (darker shaded regions) than to stimulus B . For the purpose of illustration,neural responses are taken to be Gaussian. The overlap between the two distributions decreasessteadily with N in the case of independent neurons ( c = 0, black), but minimally in the case with c = 0 . B. Dependence of the signal-to-noise ratio on N . Closed circles indicate parameter valuesas in panel A. the form I = (cid:42)(cid:88) r P ( r | s ) log (cid:18) P ( r | s ) (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:43) S , (5)where r ≡ ( r , . . . , r N ) is the vector of population response (or population activity), s ∈ S denotes a stimulus (and S is the set of possible stimuli), and (cid:104)·(cid:105) S indicates an average overall stimuli. In our simple model, there are two stimuli, s = A or B, and r labels the 2 N possible states of the population: r = ( n , . . . , n N ) , (6)where n i = 0 if neuron i is silent and n i = 1 if neuron i is ﬁring. In this case, assuming thatthe two stimuli are equiprobable, (cid:104) P ( r | s ) (cid:105) S = 12 [ P ( r | A) + P ( r | B)] , (7)7e can rewrite Eq. (5) as I = H − (cid:88) r (cid:20) P ( r | A) log (cid:18) P ( r | B) P ( r | A) (cid:19) + P ( r | B) log (cid:18) P ( r | A) P ( r | B) (cid:19)(cid:21) , (8)where H = log (2) = 1 bit of information associated with the stimulus.The second term on the right-hand-side of Eq. (8) is referred to as the noise entropy andis a measure of the variability in the neural response that is not due to the variability in thestimulus. In other words, this term quantiﬁes the amount of uninformative variability in theresponse. The noise entropy is a sum of terms, each of which corresponds to a particularrealization of the population activity. From Eq. (8), it appears immediately that a given termvanishes if either of the conditional response probabilities, P ( r | A) or P ( r | B), vanishes;indeed, if a given stimulus prevents a particular activity pattern, the latter is informative—it‘codes’ for the other stimulus. Thus, the noise entropy grows as the overlap between the twoconditional response probabilities increases, and, correspondingly, the mutual information issuppressed. If the overlap of the conditional distributions does not decrease as N increases,then the mutual information, I , saturates and never reaches the stimulus entropy, H . Inthis case, it is impossible to recover the full information about the stimulus from the neuralpopulation response even in an inﬁnitely large population, in agreement with the picturefrom consideration of the signal-to-noise ratio. Because neurons all come with identical properties in our simple model, there is a singleinformative quantity: the total spike count, k (Eq. (1)). More generally, information isrepresented in a higher-dimensional variable. Multiple studies (Abbott & Dayan, 1999,Sompolinsky et al., 2001, Wilke & Eurich, 2002, Romo et al., 2003, Golledge et al., 2003,Averbeck & Lee, 2003, Shamir & Sompolinsky, 2004, 2006, Averbeck & Lee, 2006, Averbecket al., 2006, Josic et al., 2009, Ecker et al., 2011, Hu et al., 2014, Azeredo da Silveira &Berry II, 2014, Moreno-Bote et al., 2014, Franke et al., 2016, Zylberberg et al., 2016) haveexplored how noise correlation can aﬀect the neural code in this case, by exploiting higher-dimensional versions of the structure illustrated in Fig. 1. The main—and important—departure from our simple model was the generalization to heterogeneous neural populations:the average single-neuron response to a given stimulus was assumed to vary from neuron toneuron, and likewise pairwise correlations were diﬀerent from pair to pair. Most studiesstarted with a set of ‘tuning curves’ (average response as a function of stimulus parameter)assigned to the neurons in the population. Noise, including pairwise correlations, was eitherestimated from neural recordings or posited on theoretical grounds. The ﬁdelity of thepopulation code was then evaluated in terms of a chosen ﬁgure of merit, such as the mutualinformation or a decoding error variance. The results obtained thus depended on the speciﬁcsof the assumptions involved in setting the forms of the tuning curves and of the noise model;to explore a range of behaviors, the latter had to be varied. For example, many earlyinvestigations used model neurons responding to a continuous stimulus with broad tuning8urves, and assumed noise models in which the pairwise correlation depended only upon thetuning preferences of the two neurons in the pair. Later studies included more sophisticatedforms of heterogeneity and dependences, such as the dependence of the pairwise correlationnot only upon the tuning preferences but also upon the stimulus itself.To illustrate how coding depends on the manner in which neurons are correlated, wetake a more general but more qualitative approach. As stimulus parameters are varied, theresponses of the N neurons in the population trace out a hypersurface in the N -dimensionalspace of the population responses (Fig. 2A). If the tuning curves are suﬃciently smooth, thishypersurface can be approximated locally by a hyperplane. (There exist important examplesin which this approximation is not valid (Sreenivasan & Fiete, 2011, Blanco Malerba et al.,2020).) Single-trial population responses depart from this hyperplane due to noise; the ori-entation of the hyperplane and the geometry of the noise deﬁne M ‘informative dimensions’or ‘informative modes’, m i = N (cid:88) j =1 a ij r j , (9)where a ij are numerical prefactors and i = 1 , . . . , M . One can think of these variables aschosen to maximize the mutual information with the stimulus or to correspond to optimaldecoding dimensions. If the noise is isotropic in the N -dimensional space of populationresponses, then the informative dimensions coincide with the hyperplane deﬁned by thetuning curves; in general, however, the ‘informative hyperplane’ (deﬁned by the coeﬃcients a ij ) and the ‘signal hyperplane’ are distinct (Fig. 2C).In the simplest and most commonly studied case of a one-dimensional stimulus, theinformative mode, m = N (cid:88) j =1 a j r j , (10)lies along the vector with elements a j . The quantity in Eq. (10) plays a role analogous to thespike count (Eq. (1)) in our simple model. By analogy with the simple model, the ‘strengthof the signal’ carried by the informative mode is obtained by averaging over the noise, andgrows linearly with the size of the population: (cid:104) m (cid:105) ∼ O ( N ). How much information a moderepresents depends also upon the uncertainty of its value. Early studies considered cases inwhich this uncertainty, as measured by its variance, grew either linearly or quadratically with N (Abbott & Dayan, 1999, Sompolinsky et al., 2001, Wilke & Eurich, 2002, Romo et al.,2003). If neurons are independent, the variance of the informative modes grows linearlywith the size of the population, so that each mode can represent reliably up to about √ N diﬀerent states of the stimuli. In this case, the mutual information grows logarithmicallyin N . If, however, positive correlation corrupts an informative mode, its typical amplitudegrows linearly with the size of the population and its variance grows quadratically; in thiscase, the informative mode can represent reliably only O (1) diﬀerent states of the stimulus—i.e., the mutual information saturates to a ﬁnite value smaller than the stimulus entropy.More recent studies (Ecker et al., 2011, Hu et al., 2014, Azeredo da Silveira & Berry II, 2014,Franke et al., 2016, Zylberberg et al., 2016) (but see also Refs. (Abbott & Dayan, 1999,Wilke & Eurich, 2002)) introduced examples in which noise correlation may in fact suppress9 θ response 1 r e s pon s e test " direction s test s signal " manifold σ test θ (degrees) informative " direction -50 0 5000.8 signal " direction ( s t e s t / ) σ t e s t -50 0 501 12 θ (degrees) σ t e s t s t e s t A. B. C.

Figure 2:

Signal and noise together deﬁne the informative direction for neuralcoding. A. Two-neuron illustration of quantities relevant for deﬁning the signal-to-noise ratio ina speciﬁc direction. The thin green line illustrates how the signal changes as the stimulus is varied.The blue ellipse illustrates the distribution of noisy responses corresponding to a mean responselocated at one point along the green line. The signal-to-noise ratio along a test direction (black) atan angle θ relative to the signal direction can then be determined from the projection, s test , of thesignal vector, s , and the projection of the noise, σ test in the test direction . B. Signal and noiseas a function of the angle between the test and signal directions, for the situation depicted in panelA. The green circles occur when the test and signal directions are the same. C. The signal-to-noiseratio as a function of θ . Insets show the signal direction (green) and the informative direction, thedirection that maximizes the signal-to-noise ratio (red). the variance of informative modes relative to the independent case, thereby enhancing theresolution of the code.We can discuss these diﬀerent cases by exploiting Eq. (10), which we can rewrite as m = N (cid:88) i =1 a i (cid:104) r i (cid:105) + N (cid:88) i =1 a i η i ≡ (cid:104) m (cid:105) + µ, (11)where (cid:104)·(cid:105) denotes an average over the noise and the η j s are N correlated random variableswith vanishing mean. The second term, µ , represents the uncertainty on the magnitude ofthe informative mode and is the projection of the population noise along the informativedirection deﬁned by the vector with elements a j (Fig. 2). The informative mode can encodeabout as many diﬀerent states of the stimulus as the ratio between the ﬁrst term, (cid:104) m (cid:105) , andthe standard deviation of the second term, µ , in Eq. (11). Its variance is calculated as (cid:10) µ (cid:11) = N (cid:88) i =1 a i (cid:10) η i (cid:11) + N (cid:88) i =1 a i (cid:88) i (cid:54) = j a j (cid:104) η i η j (cid:105) = N (cid:88) i =1 a i (cid:0) a i (cid:10) η i (cid:11) + Q i (cid:1) , (12)10here Q i = (cid:88) i (cid:54) = j a j (cid:104) η i η j (cid:105) . (13)The ﬁrst term in Eq. (12) represents the contribution of independent neuron variance,and the second term represents the contribution of correlated variability among neurons.Generically, Q i can behave as a function of the population size in one of four ways, listed asfollows.( i ) Q i = 0 . ( ii ) Q i ≈ ± a i (cid:104) η (cid:105) ˜ nc. ( iii ) Q i ≈ a i (cid:104) η (cid:105) ˜ N c. ( iv ) Q i ≈ − a i (cid:104) η (cid:105) ˜ N c.

Here, (cid:104) η (cid:105) corresponds to the typical scale of the single-cell variance and c ∼ O (1) > Q i also depends on an ‘eﬀective population size’ (˜ n or ˜ N ) that corresponds to the magnitude ofthe correlated noise mode relevant to coding. Generically, ˜ n ∼ O (1) > N ∼ O ( N ) > N can scale more weakly with N (see below). Withoutloss of generality, we exhibit a prefactor a i in these expressions, for the sake of conveniencegiven the form of Eq. (12). This form is natural, also, in the case of most models consideredin the literature, in which the total spike count in the population is uninformative. Forexample, for neurons with broad tuning curves that tile the stimulus space densely, so thatthe total spike count in the population is roughly independent of the stimulus, the elementsof the informative vector sum to zero, i.e., (cid:80) Ni =1 a i = 0. In a population with uniformcorrelations (Abbott & Dayan, 1999, Wilke & Eurich, 2002), i.e., (cid:104) η i η j (cid:105) = (cid:104) η (cid:105) c for all i, j ,the quantity Q i amounts to − a i (cid:104) η (cid:105) c .We can now organize the various results which appear in the literature among these fourcategories:

1. Independence (case i ). If neurons are independent, (cid:104) µ (cid:105) grows like N , so that theinformative mode can represent about (cid:104) m (cid:105) / (cid:112) (cid:104) µ (cid:105) ∼ √ N diﬀerent states of the stimulus.

2. Strongly detrimental noise correlation (case iii ). Early models (Abbott & Dayan, 1999,Sompolinsky et al., 2001, Wilke & Eurich, 2002), assume smooth, broad tuning curves, sothat a given stimulus activates most neurons in the population. As a result, the informativevector contains a macroscopic fraction of non-vanishing elements, a i , which vary slowly with i . If the covariance of the noise, (cid:104) η i η j (cid:105) , also varies slowly with j over the population, it can‘interfere constructively’ with a j , meaning that the noise is large in directions in which theinformative mode is also large. In this case, Q i grows like N and (cid:104) µ (cid:105) grows like N , and theinformative mode can represent only O (1) diﬀerent states of the stimulus. In other words,the performance of the code (as measured, e.g., by the mutual information) saturates forlarge neural populations.

3. Weakly detrimental or weakly beneﬁcial noise correlation (case ii ). Some early studies(Abbott & Dayan, 1999, Wilke & Eurich, 2002) noted that noise correlations that are uniformover the population can lead to an improvement in the coding performance. Indeed, if11 η i η j (cid:105) = (cid:104) η (cid:105) c is independent of i and j , then Q i = − a i (cid:104) η (cid:105) c , and (cid:10) µ (cid:11) = N (cid:88) i =1 a i (cid:10) η (cid:11) (1 − c ) ∼ (1 − c ) O ( N ) . (14)As a result, the informative mode can represent a number of diﬀerent states of the stim-ulus that depends upon the population size and the strength of the noise correlation as √ N / (1 − c ). This moderate improvement of the coding performance was obtained in stud-ies that allowed allowed for neuron-to-neuron variability in tuning curve properties (Shamir& Sompolinsky, 2006, Ecker et al., 2011). This variability implied rapid ﬂuctuations of theelements of the informative vector, a i , as a function of the neuron index, i . If, by contrast,the noise covariance, (cid:104) η i η j (cid:105) , varies smoothly over the population, then the ‘destructive in-terference’ between these two terms yields, again, a linear scaling of Q i with N , similar tothe case of perfectly uniform correlation. More generally, whether the noise correlation isstrongly detrimental or weakly detrimental/beneﬁcial depends upon whether the interferencebetween informative vector and noise covariance is constructive or destructive, respectively,over the population.

4. Strongly beneﬁcial noise correlation (case iv ). Recent studies noted that the ‘destructiveinterference’ between the elements of the informative vector and the noise covariance canlead to an appreciable suppression of the uncertainty (Azeredo da Silveira & Berry II, 2014,Franke et al., 2016, Zylberberg et al., 2016). This occurs if a j and (cid:104) η i η j (cid:105) both vary slowly asa function of j , over the population, but are, roughly speaking, ‘out of phase’: positive noisecorrelations are suppressed for neurons which contribute to a greater degree to the ‘strengthof the signal’, and vice versa. As a consequence, the quantity Q i becomes negative and scaleswith N and the variance of the noise is calculated as (cid:10) µ (cid:11) ≈ αN (cid:10) η (cid:11) (cid:16) − ˜ N c (cid:17) , (15)where α is a positive number of O (1) and c is the typical scale of the (positive) pairwise noisecorrelation as before. Generically, ˜ N scales linearly with N , so that uncertainty is stronglysuppressed by noise correlation, through the term 1 − ˜ N c . The informative mode can thenrepresent a number of diﬀerent states of the stimulus that depends upon the population sizeand the strength of the noise correlation as √ N / (cid:16) − ˜ N c (cid:17) . The important point, here, isthat the denominator is strongly suppressed as a function of population size. This results,in particular, in an appreciable enhancement of the coding performance when ˜ N ∼ O (1 /c ).The right-hand-side of Eq. (15) remains non-negative since the covariance of the noiseis positive semi-deﬁnite. In this formulation, we assume that the scale of the correlation,characterized by c , is ﬁxed; as ˜ N increases, the covariance matrix becomes increasinglyconstrained by this condition, and, depending on its structure, one or several small eigenval-ues may emerge. As these tend to zero, the informative vector rotates with respect to theeigenvectors of the noise covariance matrix and, as a consequence, the scaling of ˜ N becomesweaker. This limiting regime in the vicinity of a singular noise covariance matrix interpolatesbetween the scalings in cases (ii) and (iv), thereby allowing the term 1 − ˜ N c in Eq. (15) toremain non-negative. We return to the discussion of the behavior of the informative direction12s a function of the structure of the noise in Sec. 6, where we provide further illustration ina concrete model.The dependence of this boost upon the population size is a signature of the collectiveeﬀect at play here: in a correlated system, the behavior of a neuron is aﬀected by all N − c ∼ .

1, as observed experimentally, the eﬀect of correlation upon codingis appreciable already in populations as small as tens or hundreds of neurons (Azeredo daSilveira & Berry II, 2014). A speciﬁc incarnation of this phenomenon occurs in a model ofbroadly tuned neurons in which the dependence of the correlation between a pair of neuronsupon the diﬀerence in their tuning preference is allowed to be non-monotonic (Franke et al.,2016, Zylberberg et al., 2016).The list just outlined catalogs the various ways in which noise correlation can aﬀect thecoding of stimuli along an informative dimension by shaping the variability in the populationresponse. Our discussion of the ‘interference’ between elements of the informative vectorand the noise covariances can be seen as a generalization of the ‘sign rule’ (Hu et al., 2014),according to which positive correlation is favorable in a pair of neurons with negative signalcorrelation, and vice versa .There is one case, however, which was not covered: this is when there is no informativedimension in the sense we discussed above. To be speciﬁc, consider the case in which theaverage magnitude of the activity in the ‘informative’ dimension, (cid:104) m (cid:105) , is independent of thestimulus. Information about the stimulus can be encoded in the noise itself: if correlationdepends upon the stimulus, then diﬀerent patterns of population activity can discriminatestimuli. We return to this case in the next section, in a more systematic treatment of themutual information.Finally, above we have considered only pairwise correlations. In the presence of higher-order correlations, additional kinds of scalings occur. If real neural systems are dominated bythe strong co-activation of groups of neurons corresponding to higher-order correlation, theanalyses developed so far may have a limited relevance to our understanding of populationcoding. Many of the studies of neural population coding to date have relied upon speciﬁc modelsof neural populations, and have focused on one central question: how does the coding per-formance scale with the number of neurons, in particular in the limit of large populations?Moreover, most of these studies quantiﬁed coding through the Fisher information. Otherinformation-theoretic quantities, such as the mutual (Shannon) information, are more fun-damental (Cover, 1999, Brunel & Nadal, 1998, Kang & Sompolinsky, 2001, Wei & Stocker,2016) and avoid diﬃculties associated with the Fisher information (Bethge et al., 2002). TheFisher information is local in stimulus space, whereas the mutual information quantiﬁes theaccuracy of stimulus representation over the entire stimulus space. The use of the Fisherinformation also relies upon some restrictive assumptions, and yields only a lower bound onthe coding resolution, which may or may not be tight.13oding in neural populations can be examined from a general perspective by expressingthe mutual information in a form that isolates the impact of diﬀerent types of correlation(Panzeri et al., 1999, Pola et al., 2003). We discuss the implications of this decompositionhere; in App. B, we provide a derivation of the decomposition, which follows and somewhatsimpliﬁes that in Refs. (Panzeri et al., 1999, Pola et al., 2003). The central result is areformulation of the mutual information as a sum of three terms: I = I independent + I (1)correlated + I (2)correlated . (16)Each of the terms can be expressed as a function of the joint probability, P ( r, s ), betweenstimulus, s , and population response, r , and transformations of this joint probability, suchas P ( r | s ), deﬁned in Eq. (31) which denotes the conditional probability of the responsein a population of independent neurons with matched mean and variance. In App. B, weshow that the three terms in Eq. (16) can be written as I independent ≡ (cid:42)(cid:88) r P ( r | s ) log (cid:18) P ( r | s ) (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:43) S , (17) I (1)correlated = (cid:42)(cid:88) r P ( r | s ) log (cid:18) P ( r | s ) /P ( r | s ) (cid:104) P ( r | s ) (cid:105) S / (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:43) S (18)and I (2)correlated = (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:32) (cid:81) Ni =1 P ( r i ) (cid:104) P ( r | s ) (cid:105) S (cid:33)(cid:43) S . (19)The beneﬁt of this reformulation of the mutual information is that each of these threeterms come with a transparent interpretation. The term I independent represents the informa-tion carried by conditionally independent neurons; indeed, if there is no noise correlation, P ( r | s ) = P ( r | s ), and both I (1)correlated and I (2)correlated vanish. Thus, I independent accounts forthe amount of information carried by the population which is not aﬀected by noise correla-tions. In App. B, we show that I independent can be broken down further to isolate the impactof signal correlations.The term I (1)correlated is the formal analog to the term I independent , but where the informationis carried by noise correlations rather than by the structure of mean response. Thus, itaccounts for stimulus coding by the noise correlations themselves: the same way diﬀerentialﬁring rates characterize diﬀerent stimuli, non-uniform noise correlation can also specify thestimulus. Figure 3B illustrates an example of the eﬀects captured by this term in the simplecase of a two-neuron population that encodes a binary stimulus, s = A or B. Here, the meanresponses to stimuli A and B are identical, yet the identity of the stimulus can be inferredfrom the two-neuron response due to the diﬀerences in noise correlation. Generalizations ofthis mechanism have been studied in various models of neural population coding (Shamir& Sompolinsky, 2004, Josic et al., 2009, Zylberberg, 2018), and stimulus-dependence ofcorrelations has been proposed as supporting visual coding in direction-selective middle-temporal neurons in monkeys (Ponce-Alvarez et al., 2013). As we show in App. B, bothterms I independent and I (1)correlated are non-negative; they capture occurrences in which variationsof mean response or noise correlations as a function of stimulus are informative.14 euron 1 response neu r on r e s pon s e neuron response p r obab ili t y P ( r | A ) P ( r | B ) A. B.

Figure 3:

Representation of information by stimulus-dependent noise. A. Illus-tration of a situation in which the variance of the response of a single neuron encodes informationabout the stimulus identity. For example, the noisy response denoted by the black circle is morelikely to be induced by stimulus B than by stimulus A . B. Illustration of a similar situation, inwhich the information is encoded in correlations of the noise in the joint response of two neurons.In both panels A and B, the mean response is the same for stimulus A and stimulus B, and, hence,uninformative about their identity.

Finally, I (2)correlated represents the increment or decrement of information due to the in-terplay between signal correlation and noise correlation. From Eq. (19), it is appar-ent that I (2)correlated vanishes if either noise correlation is absent ( P ( r | s ) = P ( r | s )) orsignal correlation is absent ( (cid:104) P ( r | s ) (cid:105) S = (cid:81) Ni =1 P ( r i )). Furthermore, and unlike theother two components of the mutual information, this component can be positive or nega-tive. For a given stimulus, s , each population response, r , yields a positive contribution if[ P ( r | s ) − P ( r | s )] × (cid:104)(cid:81) Ni =1 P ( r i ) − (cid:104) P ( r | s ) (cid:105) S (cid:105) is positive, and a negative contributionotherwise. In other words, the contribution to the information of a population response ispositive if noise correlation favors this population response while signal correlation disfavorsit, and vice versa , and it is negative if both noise and signal correlations either favor or disfa-vor the population response. (When we say ‘favor’ and ‘disfavor’, as usual we are comparingthe case of a population of correlated neurons to the case of a population of independentneurons.) The form of I (2)correlated in Eq. (19) captures another generalized formulation of thesign rule we mentioned in the previous section (Hu et al., 2014). It also illustrates in thelanguage of mutual information, and without reference to any speciﬁc model, the kind of‘constructive versus destructive interference’ discussed in the previous section.The reader might wonder about the merits of what may seem like a mere mathematicalexercise. Many studies today start from data and end in data, and seek to make sense ofdata rather than to propose a theoretical framework. In this context, we ﬁnd it refreshing toexamine the question from a more abstract theoretical point of view, not tied to a speciﬁcmodel. It adds to our understanding to be able to consider a same neural mechanisms frommultiple points of view. Having said this, we emphasize that the framework just outlined canindeed be put to the task of analyzing data. It was recently used, for example, to uncoverthe organization of assemblies of neurons with redundant and synergistic coding of visualinformation in monkey cortex (Nigam et al., 2019).15 Information coding and the geometry of noise corre-lation

The breakdown of the mutual information discussed in the previous section teases apart thevarious contributions from signal and noise correlations, but it provides neither a quantitativenor a geometrical view of how the structures of signal and noise together impact coding. Thissection develops such a geometrical view by revisiting the task of discriminating two stimuli,A and B, discussed in Sec. 3.To streamline the mathematical treatment, we consider a limiting case in which themutual information takes a simple form, namely, the case in which stimulus A is presentedwith probability φ (cid:28)

1. In this limit, the mutual information can be written as I = φ (cid:88) r P ( r | A) ln (cid:18) P ( r | A) P ( r | B) (cid:19) + O (cid:0) φ (cid:1) , (20)where r is the vector of neural responses, r T = ( r , . . . , r N ). This approximation is valid when φP ( r | A) /P ( r | B) (cid:28) P ( r | A) and P ( r | B) overlap considerably. If the noise is Gaussian, the mutual informationcan be expressed in terms of the response mean and covariance, as I = φ ln (cid:18) Det ( C A )Det ( C B ) (cid:19) + 12 φ Tr (cid:0) C A C − − I (cid:1) + 12 φm T C − m, (21)where C s ( s = A,B) is the covariance of the noise in response to stimulus s , I is the identitymatrix, and m is the ‘signal vector’ which, in this case, is simply the diﬀerence between themean responses to the two stimuli, i.e., m i ≡ (cid:104) r i (cid:105) A − (cid:104) r i (cid:105) B . (22)Here i indicates the i th neuron, and (cid:104)·(cid:105) s denotes the average over the conditional probability, P ( r | s ). To obtain Eq. (21), we have further assumed that the spike counts take large values,so that their discrete nature becomes unimportant and the sum over population patternscan be replaced by an integral.Equation (21) can be interpreted particularly transparently in the case in which thevariances of the single-neuron responses do not depend on the stimulus. In this case, theﬁrst two terms depend only upon the noise correlation; they correspond to the term I (1)correlated in Eq. (16). The third term in Eq. (21) describes the interplay between signal and noisecorrelation, and corresponds to the term I (2)correlated in Eq. (21). This term is the formalequivalent to the so-called ‘linear Fisher information’ used in many earlier studies. It canalso be viewed as the signal-to-noise ratio discussed in Sec. 3, or the square of the sensitivityindex, d (cid:48) .We can examine the contribution of the interplay of signal and noise correlation bycomparing the third term in Eq. (21) for correlated and independent neurons, again in the16ase in which the single-neuron variances do not depend on the stimulus. For independentneurons, the mutual information reduces to I ≡ φm T C − m, (23)where C is the diagonal matrix obtained from C B by setting all oﬀ-diagonal elements to zero.If neurons are recorded from individually, one has access only to the moments m and C , andthe mutual information is estimated according to Eq. (23). By contrast, when neurons in apopulation are recorded from simultaneously, and the statistics of the population responsesare ﬁtted to a multivariate Gaussian, then the mutual information is given by the richer Eq.(21).Since the covariance matrix in Eq. (23) is diagonal, the mutual information for indepen-dent neurons can be rewritten as I = 12 φ N (cid:88) i =1 m i σ i , (24)where σ i is the variance of the activity of neuron i . I grows linearly as N increases—aproperty of the limit of a rare stimulus considered here. (Beyond a crossover size, the ﬁrst-order expansion in φ breaks down, and, asymptotically, the mutual information increaseslogarithmically in N ). By analogy with Eq. (24), a natural way to calculate I (2)correlated , indeedan approach followed by much of the literature (starting with Refs. (Abbott & Dayan, 1999,Sompolinsky et al., 2001, Wilke & Eurich, 2002)), is to diagonalize the covariance matrix,to obtain I (2)correlated = 12 φ N (cid:88) i =1 ˜ m i λ i , (25)where ˜ m i are the elements of the vector m in the new basis in which C B is diagonal, and λ i isthe i th eigenvalue of the covariance matrix C B . The argument is then that, if the structureof pairwise correlations is such that the eigenvectors of the covariance matrix (the ‘correlatedmodes’) are not sparse and involve contributions from a sizable fraction of the neurons inthe population, then the eigenvalues will scale with the population size. In this case, thesum in Eq. (25) will yield a weaker scaling with N than the sum in Eq. (24). In particular,if a few eigenvalues remain small as N increases, then these eigenvalues dominate the sumin Eq. (25) and the latter asymptotes to a constant. In other words, I (2)correlated saturates toa ﬁnite value in arbitrary large populations of neurons.This approach is not entirely satisfactory because it is diﬃcult to compare Eq. (24) andEq. (25) since both the numerator and the denominator diﬀer. Indeed, the numerator inEq. (25) depends upon the signal vector as well as the structure of the noise covariance.Furthermore, some of the eigenvalues in Eq. (25) may take small values, and one may wonderwhat dominates the sum: the larger terms associated with small eigenvalues or the morenumerous, smaller terms associated with larger eigenvalues. To resolve these ambiguities,it is possible instead calculate an ‘information ratio’ that quantiﬁes by how much noisecorrelations suppress or enhance coding as compared to the case of an independent populationof neurons (Azeredo da Silveira & Rieke, 2020). This ratio can be expressed in a compact17orm, as R I ≡ I (2)correlated I = Det ( ˜ χ )Det ( χ ) , (26)where χ is the correlation matrix corresponding to the covariance matrix C B , and ˜ χ isthe projection of χ on the ( N − v , with elements v i ≡ m i /σ i . The information ratio depends upon the spectra ofthe two matrices, χ and ˜ χ . While χ depends only upon noise correlation, ˜ χ incorporates aninteraction between the noise correlation the modiﬁed signal vector.To intuit the behavior of the information ratio deﬁned in Eq. (26), it is instructive toexamine information coding with two correlated neurons. The covariance of the noise reads C B = (cid:18) σ σ σ cσ σ c σ (cid:19) , (27)where σ and σ are the standard deviations in the activities of the two neurons, and c isthe correlation of the noise. In this simple case, the information ratio takes the form R I = 1 − c/ ( ζ + ζ − )1 − c , (28)where ζ ≡ m σ (cid:18) m σ (cid:19) − . (29)Since the parameter ζ can take any real value, inspection of the form of the information ratioreveals that large volumes in the space of model parameters yield R I > R I <

1. Speciﬁcally, the information ratio is larger than unity, i.e., noise correlation isbeneﬁcial to information coding, when c > / ( ζ + ζ − ). This relation, again, can be viewedas a generalization of the ‘sign rule’: it dictates how strong correlation ought to be to beneﬁtcoding as a function of the signal vector and the single-neuron variances.This simple example also helps shed light on the discussion in Sec. 4. There, we invokedan ‘informative dimension’. Similarly, here, we can ask whether there is an especially in-formative dimension in the two-dimensional space of the two-neuron population activity: inwhich direction should the unit vector, e point in order to maximize the mutual information I ( s ; x ), where s = A or B and x ≡ e T m = e m + e m ? This problem is solved easily, andthe unit vector that maximizes I ( S ; x ), call it e ∗ , can be expressed in terms of the signalvector as well as the variances and correlation of the pair of neurons. What is more interest-ing, though, is that the mutual information I (cid:0) S ; x = e ∗ T m (cid:1) matches I (2)correlated exactly: thatis, the one-dimensional variable x recovers the entirety of the useful information containedin the two-dimensional activity of the neuron pair. The dimension deﬁned by the vector e ∗ in the space of population activity is thus an ‘informative dimension’ in the sense of Sec. 4.In general, the informative dimension does not align with the signal vector. For example,in the limit of weak correlation, | c | (cid:28)

1, and comparable variances, | σ /σ − | (cid:28)

1, theinformative dimension is obtained by rotating the signal vector by an angle c ( m − m ) /m − σ /σ − m m /m . The ﬁdelity of coding depends upon the noise along this direction,i.e., the variance of the projection of the noise along e ∗ . A contrasting picture has been18iscussed in the literature in recent years: a number of authors have argued that the ﬁdelity ofcoding depends, rather, on the presence of what they call ‘diﬀerential correlations’ (Moreno-Bote et al., 2014, Kohn et al., 2016). These are taken to be present if the covariance matrixof the noise in the population activity contains a component proportional to mm T —i.e.,a component along the signal direction (Fig. 2A). The central conclusion from this lineof research is that diﬀerential correlations limit coding performance in that they cause theinformation represented in the neural population to saturate asymptotically, as N → ∞ .A picture that emerges from the argument summarized in this section and illustratedin Fig. 2 is that, if the covariance matrix contains small eigenvalues, then the informationrepresented in the neural population (equivalently, the signal-to-noise ratio) can be large.This holds even if there is appreciable noise along the signal vector, provided that theeigenvectors corresponding to the low-noise directions are not orthogonal to the signal vector.More generally, ﬁgures of merit of the coding performance of a population of neurons, suchas the mutual information or the signal-to-noise ratio, depend upon the full structure of thenoise covariance in relation to the signal vector, and not exclusively upon the projection of thenoise along the dimension deﬁned by the signal vector. Rather, in scenarios such as the onesdiscussed above, what matters is the projection of the noise along an informative dimensionin general distinct from that along the signal vector. This is true in the case of ﬁnite valuesof N and in the asymptotic limit with N → ∞ . While some of the mathematical statementscan simplify in the asymptotic limit, this limit may be far from natural for neural systems;for example, individual neurons may receive input from a modest number of presynapticneurons, and correlations in this collection of presynaptic neurons will shape signaling in thepostsynaptic neuron. The discussion above aims at unifying various results in the literature using a common metricfor neural coding—the mutual information. We build intuition about the impact of noisecorrelations on coding by developing a geometrical picture of the structure of signal and noise.Our goal is to highlight situations in which noise correlations are beneﬁcial, detrimental, orinconsequential for the ﬁdelity of the neural population code rather than to consider speciﬁcexamples that fall into one category or another. Below, we summarize the assumptions thatform the basis of our discussion and we touch upon some of the open questions that it raises.Obstacles to understanding coding in populations of neurons arise largely from the highdimensionality of the problem. Experiments by necessity probe a small subregion of thespace of interesting stimuli, and the presence of response nonlinearities (such as adaptation)imply that insights gleaned from such experiments often do not generalize to all stimuli. Thespace of neural responses is similarly high dimensional, and is impossible to probe completelyin the ﬁnite duration of an experiment. But even an incomplete understanding of the role ofnoise correlations in neural responses is helpful to guide experimental design. For instance,understanding the neural code requires experiments that not only measure noise correlationsbut also take into account their relation to the encoded signal. Response variability may liein a direction in which it impacts the encoded signal minimally. As an example, consider apopulation of orientation-tuned V1 neurons. A change in stimulus orientation will increase19ctivity in some neurons and decrease activity in others. Noise that produces correlatedﬂuctuations in ﬁring rate that are uniform across the population will interact minimallywith the signal created by changes in orientation.Our discussion centers around correlations between pairs of neurons, and neglects higher-order correlations. This is a matter of practicality: pairwise correlations have been measuredextensively while we know much less about the properties of higher-order correlations in realneural circuits. This is changing with the availability of experimental approaches that allowa large number of cells to be monitored simultaneously. Theoretical frameworks that accountfor geometrical relations between signal and higher-order noise correlations are likely to playan important role in the development of new experimental protocols and methods of dataanalyses, just as has been the case hitherto for pairwise correlations.We chose to focus this review on how structures of signal and noise interact from astatistical point of view, rather than examining the neural mechanisms that produce suchstructures. Single-cell properties and the connectivity of real circuits constrain the structureof both signal and noise, and a ﬁner understanding of the connection between biological con-straints and network properties, on the one hand, and the statistics of population response,on the other, will help interpret empirical observations (Doiron et al., 2016, Rosenbaumet al., 2017, Huang et al., 2019, Trousdale et al., 2012, Ocker et al., 2017, Ostojic et al.,2009, Mastrogiuseppe & Ostojic, 2018, Schuessler et al., 2020, Tannenbaum & Burak, 2017,Goris et al., 2014, Lin et al., 2015, Pernice et al., 2011, Pernice & da Silveira, 2018, de laRocha et al., 2007, Vidne et al., 2012). A ubiquitous example is the divergence of a commoninput into parallel circuits, which can create both signal and noise correlations in those cir-cuits. This is just one of the many ways in which real circuit mechanisms shape signal andnoise. Returning to the picture of collective variables from statistical physics, we can hopein the future to understand the relation between the mechanisms that shape interactionsbetween neurons, the impact of those interactions on the structure of signal and noise, andhow these combine to yield a representation of information in neural populations.Much of the literature in computational neuroscience focuses upon population codingin the ‘thermodynamic limit’ in which N −→ ∞ , that is, the limit of large populations ofneurons. This is not, however, the only relevant limit. Individual neurons receive input froma ﬁnite set of other neurons; what matters for the postsynaptic neuron is the representationof information in its ﬁnite presynaptic pool. From a mathematical point of view, also,populations of moderate sizes may be the relevant ones: for particular structures of pairwisenoise correlation, the condition N c ≈ ppendicesA Noise correlation versus signal correlation When we consider the noisy response of a neural population to an ensemble of stimuli, thereare two possible averaging procedures: we can calculate averages (moments) over the noise orover the ensemble of stimuli. Loosely speaking, the former yields noise correlation while thelatter yields signal correlation. To be more speciﬁc, we consider a population of N neuronsand we denote their outputs by r , . . . , r N . The statistics of population response is given bythe conditional probability P ( r , . . . , r N | s ) , (30)where s refers to a stimulus chosen from a set of discrete stimuli or drawn from a densityover continuous stimuli. Noise correlations are non-vanishing if P ( r , . . . , r N | s ) (cid:54) = N (cid:89) i =1 P ( r i | s ) ≡ P ( r , . . . , r N | s ) , (31)where P ( r i | s ) is obtained from P ( r , . . . , r N | s ) by averaging out all r j , with j (cid:54) = i . Theprobability function in Eq. (30) speciﬁes the noise correlations ; these characterize the pop-ulation variability in response to a given stimulus, and, hence, are themselves functions ofthe stimulus. By contrast, signal correlation is a property if the statistics of the populationresponse over the ensemble or density of stimuli. Signal correlations are obtained from theprobability function (cid:42) N (cid:89) i =1 P ( r i | s ) (cid:43) s ≡ P ( r , . . . , r N ) , (32)where (cid:104)·(cid:105) s denotes an average over the ensemble or density of stimuli. Signal correlationsare non-vanishing if (cid:42) N (cid:89) i =1 P ( r i | s ) (cid:43) s (cid:54) = N (cid:89) i =1 (cid:104) P ( r i | s ) (cid:105) s . (33)Here, our focus will be on noise correlation and its eﬀect upon sensory coding. Wheneverwe mention ‘pairwise correlation’ between two neurons labeled by i and j , we refer to thequantity calculated as c ij = (cid:104) ( r i − (cid:104) r i (cid:105) ) ( r j − (cid:104) r j (cid:105) ) (cid:105) (cid:113)(cid:10) ( r i − (cid:104) r i (cid:105) ) (cid:11) (cid:10) ( r j − (cid:104) r j (cid:105) ) (cid:11) , (34)where the average denoted by (cid:104)·(cid:105) is weighed by the conditional probability in Eq. (30).This correlation coeﬃcient follows the usual deﬁnition of a covariance normalized by thecorresponding standard deviations. By analogy, we can deﬁne pairwise signal correlation as c signal ij = (cid:10) ( (cid:104) r i (cid:105) − (cid:104)(cid:104) r i (cid:105)(cid:105) s ) (cid:0) (cid:104) r j (cid:105) − (cid:104)(cid:104) r j (cid:105)(cid:105) s (cid:1)(cid:11) s (cid:114)(cid:10) ( (cid:104) r i (cid:105) − (cid:104)(cid:104) r i (cid:105)(cid:105) s ) (cid:11) s (cid:68)(cid:0) (cid:104) r j (cid:105) − (cid:104)(cid:104) r j (cid:105)(cid:105) s (cid:1) (cid:69) s , (35)where the average denoted by (cid:104)·(cid:105) s is weighed by the ‘prior probability’ over stimuli, P ( s ).22 Breakdown of the mutual information in terms ofsignal and noise

We follow Refs. (Panzeri et al., 1999, Pola et al., 2003) which introduce a breakdown of themutual information in several terms that exhibit the various ways in which noise correlationsmay inﬂuence the coding performance. We spell out a somewhat modiﬁed derivation, here,as the latter is possibly a more direct one. We reformulate the mutual information (Eq. (5))in a form that emphasizes the contributions of signal and noise correlations.To achieve this, we invoke the independent (marginalized) probabilities deﬁned in Eqs.(31) and (32), and rewrite the mutual information in terms of the independent conditionalprobability, P ( r | s ), and ratios that carry the contribution of correlations, P ( r | s ) /P ( r | s )and (cid:104) P ( r | s ) (cid:105) S / (cid:104) P ( r | s ) (cid:105) S , as I = (cid:42)(cid:88) r P ( r | s ) (cid:20) log (cid:18) P ( r | s ) /P ( r | s ) (cid:104) P ( r | s ) (cid:105) S / (cid:104) P ( r | s ) (cid:105) S (cid:19) + log (cid:18) P ( r | s ) (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:21)(cid:43) S . (36)We further separate independent probabilities from correlated ones by adding a subtractingto this quantity the mutual information corresponding to an independent population ofneurons, i.e., the term I independent ≡ (cid:42)(cid:88) r P ( r | s ) log (cid:18) P ( r | s ) (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:43) S . (37)This manipulation allows us to to obtain the form in Eq. (16), i.e., I = I independent + I (1)correlated + I (2)correlated , (38)where I (1)correlated is deﬁned in Eq. (18) and I (2)correlated = (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:18) P ( r | s ) (cid:104) P ( r | s ) (cid:105) S (cid:19)(cid:43) S . (39)The quantity I independent represents the information carried by conditionally independentneurons; indeed, if there is no noise correlation, P ( r | s ) = P ( r | s ), and both I (1)correlated and I (2)correlated vanish. It can be broken down further, to extract the contribution from signalcorrelation, by bringing in the single-cell marginalized probabilities, P ( r i ) ≡ (cid:104) P ( r i | s ) (cid:105) s . (40)With these, we can rewrite I independent as I independent = (cid:42)(cid:88) r P ( r | s ) log (cid:32) P ( r | s ) (cid:81) Ni =1 P ( r i ) (cid:33)(cid:43) S − (cid:42)(cid:88) r P ( r | s ) log (cid:32) (cid:104) P ( r | s ) (cid:105) S (cid:81) Ni =1 P ( r i ) (cid:33)(cid:43) S = I (1)independent − I (2)independent , (41)23ith I (1)independent = N (cid:88) i =1 (cid:42)(cid:88) r P ( r i | s ) log (cid:18) P ( r i | s ) (cid:104) P ( r i | s ) (cid:105) s (cid:19)(cid:43) s (42)and I (2)independent = (cid:88) r (cid:104) P ( r | s ) (cid:105) S log (cid:32) (cid:104) P ( r | s ) (cid:105) S (cid:81) Ni =1 (cid:104) P ( r i | s ) (cid:105) S (cid:33) . (43)By comparing the form of Eq. (42) with that of Eq. (5), we see that it expresses the sumover the information carried by N independent neurons. Since I (2)independent vanishes in theabsence of signal correlation, I (1)independent amounts to the total mutual information if bothsignal and noise correlations vanish. The quantity I (2)independent thus represents the loss ofinformation due signal correlation; indeed, I (2)independent is nothing but the diﬀerence betweenthe entropy of the marginalized independent distribution, (cid:81) Ni =1 (cid:104) P ( r i | s ) (cid:105) S , and the entropyof the marginalized correlated distribution, (cid:104) P ( r | s ) (cid:105) S , I (2)independent = − (cid:88) r N (cid:89) i =1 (cid:104) P ( r i | s ) (cid:105) S log (cid:32) N (cid:89) i =1 (cid:104) P ( r i | s ) (cid:105) S (cid:33) + (cid:88) r (cid:104) P ( r | s ) (cid:105) S log ( (cid:104) P ( r | s ) (cid:105) S ) , (44)and, as such, is non-negative.The quantity I (1)correlated represents the information carried by noise correlation. By rewrit-ing the logarithm in Eq. (18) aslog (cid:18) P ( r | s ) /P ( r | s ) (cid:104) P ( r | s ) /P ( r | s ) (cid:105) S (cid:19) + log (cid:18) (cid:104) P ( r | s ) /P ( r | s ) (cid:105) S (cid:104) P ( r | s ) (cid:105) S / (cid:104) P ( r | s ) (cid:105) S (cid:19) , (45)then using convexity and Cauchy-Schwarz inequalities, one can show that I (1)correlated is non-negative. It vanishes if the ratio P ( r | s ) /P ( r | s ) is independent of the stimulus. Thus, I (1)correlated accounts for stimulus coding by the values of the noise correlations themselves(Fig. 3): the same way diﬀerential ﬁring rates characterize diﬀerent stimuli, non-uniformnoise correlation can also specify the stimulus.Finally, the quantity I (2)correlated represents the increment or decrement of information dueto the interplay between signal correlation and noise correlation. This appears if we rewriteits expression to emphasize the contribution of signal correlation, by replacing the ratio P ( r | s ) (cid:104) P ( r | s ) (cid:105) S by the product (cid:81) Ni =1 P ( r i ) (cid:104) P ( r | s ) (cid:105) S · P ( r | s ) (cid:81) Ni =1 P ( r i ) . (46)We then rewrite I (2)correlated as I (2)correlated = (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:32) (cid:81) Ni =1 P ( r i ) (cid:104) P ( r | s ) (cid:105) S (cid:33)(cid:43) S + (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:32) P ( r | s ) (cid:81) Ni =1 P ( r i ) (cid:33)(cid:43) S , (47)24ut the second term in fact yields a vanishing contribution: (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:32) P ( r | s ) (cid:81) Ni =1 P ( r i ) (cid:33)(cid:43) S = N (cid:88) i =1 (cid:42)(cid:88) r [ P ( r | s ) − P ( r | s )] log (cid:32) P ( r i | s ) (cid:81) Ni =1 P ( r i ) (cid:33)(cid:43) S = N (cid:88) i =1 (cid:42)(cid:88) r i [ P ( r i | s ) − P ( r i | s )] log (cid:18) P ( r i | s ) P ( r i ) (cid:19)(cid:43) S = 0 , (48)since P ( r i | s ) = P ( r i | s ). Hence, I (2)correlated (Eq. (39)) can be written in the simpler formgiven in Eq. (19). DISCLOSURE STATEMENT

The authors are not aware of any aﬃliations, memberships, funding, or ﬁnancial holdingsthat might be perceived as aﬀecting the objectivity of this review.

ACKNOWLEDGMENTS

Support was provided by the CNRS through UMR 8023, the SNSF Sinergia Project CR-SII5 173728, and the National Institute of Health (EY028111 and EY028542).25 eferences

Abbott LF, Dayan P. 1999. The eﬀect of correlated variability on the accuracy of a populationcode.

Neural Comput

Nat Rev Neurosci

J Neurosci

Trends Neurosci

J Neurophysiol

PLoSComputational Biology in pressAzeredo da Silveira R, Rieke F. 2020. Neural coding and the geometry of signal and noise.

Preprint

Bair W, Zohary E, Newsome WT. 2001. Correlated ﬁring in macaque visual area mt: timescales and relationship to behavior.

J Neurosci

Journal of Neuroscience

Neural Computation

In prepparation

Brunel N, Nadal JP. 1998. Mutual information, ﬁsher information, and population coding.

Neural Computation

Frontiers in computational neuroscience

Nature

Nature neuroscience

Science

The Journal of Neuroscience

Nature

Neuron

Neuroreport

Nature neu-roscience

Graf AB, Kohn A, Jazayeri M, Movshon JA. 2011. Decoding the activity of neuronal popu-lations in macaque primary visual cortex.

Nature neuroscience

Proc NatlAcad Sci U S A

PLoS computational biology

Neuron

J Neurophysiol

Neural Comput

Physical Review Letters

Annual review of neuroscience

J Neurosci

Neuron

Trends Neurosci

The Journal of neuroscience

Cell reports

Nature neuroscience

Nigam S, Pojoga S, Dragoi V. 2019. Synergistic coding of visual information in columnarnetworks.

Neuron

PLoS computational biology

Trends Neurosci

The Journal of neuroscience

J Neurophysiol

Neural Comput

Biophys J

PLoS computational biology

PLoS Comput Biol

Network

Proceedings of the National Academyof Sciences

Neuron

Nature neuroscience

Nature

Eur J Neurosci

Physical Review Research

Neural Comput

Curr Opin Neu-robiol

J Neurosci

Phys Rev E Stat Nonlin Soft Matter Phys

Nature neuroscience

BioRxiv :679324Tannenbaum NR, Burak Y. 2017. Theory of nonstationary hawkes processes.

Physical ReviewE

PLoS Comput Biol

Annu Rev Physiol

Nature

J Comput Neurosci

BiologicalCybernetics

Neural computation

Wilke SD, Eurich CW. 2002. Representational accuracy of stochastic neural populations.

Neural Comput

Nature

Neuron