Two-population model for MTL neurons: The vast majority are almost silent
TTwo-population model for MTL neurons: The vast majority are almost silent
Andrew Magyar and John Collins
Physics Department, Pennsylvania State University, University Park, PA 16802, USA (Dated: 07 November 2014)Recordings in the human medial temporal lobe have found many neurons that respond to pictures(and related stimuli) of just one particular person out of those presented. It has been proposed thatthese are concept cells, responding to just a single concept. However, a direct experimental test ofthe concept cell idea appears impossible, because it would need the measurement of the responseof each cell to enormous numbers of other stimuli. Here we propose a new statistical method foranalysis of the data, that gives a more powerful way to analyze how close data are to the concept-cellidea. It exploits the large number of sampled neurons, to give sensitivity to situations where theaverage response sparsity is to much less than one response for the number of presented stimuli.We show that a conventional model where a single sparsity is postulated for all neurons gives anextremely poor fit to the data. In contrast a model with two dramatically different populationsgive an excellent fit to data from the hippocampus and entorhinal cortex. In the hippocampus,one population has 7% of the cells with a 2.6% sparsity. But a much larger fraction 93% respondto only 0.1% of the stimuli. This results in an extreme bias in the reported responsive of neuronscompared with a typical neuron. Finally, we show how to allow for the fact that some of reportedidentified units correspond to multiple neurons, and find that our conclusions at the neural levelare quantitatively changed but strengthened, with an even stronger difference between the twopopulations.
I. INTRODUCTION
A long-standing debate (e.g., [4–6, 26, 27, 30, 31]) con-cerns whether biological neural systems use local or dis-tributed coding for the representation of high-level enti-ties, like individual people. Although the consensus [10]is that only distributed coding is used, a number of ex-periments find results that are very suggestive of localcoding. For example, Hahnloser, Kozhevnikov and Fee[11] find neurons in zebra finches that each fire only atone particular time during a bird’s singing of its ownsong. In a sensory context in the medial temporal lobe(MTL) of human epileptic patients, Quian Quiroga et al.[32] report many examples of neurons that fire only inresponse to a stimulus that contains a particular person.It has been suggested that these neurons are concept cells[29], responding to one concept out of many possible.It is therefore important to measure how close suchneurons are to the local coding situation. There aresome interesting conceptual issues associated with whatprecisely should be meant by the previous sentence, andwe will discuss these further in Sec. VII C. Among theseis the question of whether a particular neuron should beconsidered as actually participating in the representationof a particular concept, when it is measured that the neu-ron responds strongly to the presence (in some sense) of For our purposes, we will mean by a local code that there existsingle neurons for particular concepts, ideas, or other entities,so that activity of one of these neurons above some thresholdindicates exclusively the presence of the corresponding entity.With distributed coding, measurements from several or manyneurons are needed to determine whether a particular entity ispresent. that concept in the stimulus. It might be, for example,that, instead, the neuron represents an episodic memorywhose content includes the concept.For our immediate purposes, we do not actually haveto solve this issue. The problem we address is that anyneuron that does local coding for a high-level conceptwill respond to only a very small fraction of stimuli, sothat direct experimental measurements have great diffi-culty testing whether the coding is actually local. Con-sider the classic case where the concepts represented areactually those detected in a current stimulus. For ex-ample, the concept might be that of a particular indi-vidual human in a visual scene. To measure whethera particular cell implements local coding for a concept,one would need presentation of stimuli that cover a largefraction of the concepts known to the subject. This is ev-idently far beyond current experimental capabilities, atleast. Measurements such as those in [22, 32] use stimulicorresponding to only about 100 distinct entities (people,famous buildings, etc).As one of us has shown [9] in collaboration with Jin, asubstantial complication in the interpretation of the datais that there are wide differences in the response proper-ties of the neurons in question. A small percentage re-spond to several distinct stimuli out of those presented,while a vast majority respond on average to much lessthan one. This was quantitatively deduced from sum-mary statistics for the data that were given in [32]. (Nofurther breakdown of the data was given.) We made aprediction for the fraction of neurons as a function of thenumber of stimuli to which they make an above-thresholdresponse.In this paper, we present and use a more complete ver-sion of the novel statistical method underlying our earlierwork, to deduce the conceptual-coding properties of neu- a r X i v : . [ q - b i o . N C ] N ov rons. Applying it to more recent and more detailed datafrom [22], we will find that in fact the vast majority ofthe measured neurons respond on average to much lessthan one stimulus in the approximately 100 that are pre-sented. A direct measurement on a single neuron wouldneed thousands of conceptually distinct stimuli to achievethe same result. The effectiveness of our method arisesfrom the large number of neurons probed. A measureof the method’s power is the number of neuron-stimulustrials, i.e., the number of stimuli times the number ofneurons. This is in the hundreds of thousands for thedata of [22, 32].From properties of the set of neuron-stimulus trials,treated as a sample, we deduce estimated response prop-erties for the “universe” of all neurons in particular brainregions and all relevant stimuli. We will formulate the re-sults in terms of the sparsity of individual neurons. Ourdefinition is that the sparsity of a particular neuron isthe fraction of stimuli to which it gives a response abovesome appropriate threshold (e.g., as chosen in [22, 32]).Our aim is to estimate the distribution of sparsity overneurons in a particular brain region. The simplest model,[40], is that all neurons have (approximately) the samesparsity. But, as we will show, such a model is in dra-matic disagreement with the data; a more general distri-bution with widely varying sparsities for different neuronsis needed, and we will estimate such a distribution, andshow that most neurons in the areas concerned have asparsity below 10 − .These results confirm and extend those we made ear-lier [9]. In particular, the prediction mentioned aboveis successfully tested, with a quantitative measure of its(excellent) goodness of fit to the newer data. (Minorchanges in parameters are needed for the different set ofdata.)It has previously been pointed out, notably in [24, 36],that many neurons are unresponsive or even silent, withinthe limits of experimental measurements over a limitedperiod, as is quantitatively supported by our results.Hence a problem is that reported measurements of re-sponsive neurons can give a very biased picture of thenature of neural coding. Our method provides a tool forboth measuring the bias and for compensating for it.In our discussion section, Sec. VII, we include remarkson the implications of our results for the nature of neuralcoding of high-level concepts. II. OUTLINE
After setting up our methods, we apply them to pub-lished data from Mormann et al. [22], which presentsmeasurements of the number of neurons as a functionof the number of stimuli (out of a total of around 100)to which a particular neuron gives an above-threshold re-sponse. The data are for four different areas of the MTLpooled across several patients.For the distribution of sparsity, we use a model whose general form has multiple populations of cells, each char-acterized by a particular sparsity and a fraction of thetotal neural population. We regard an experiment as us-ing a random sample of neurons and stimuli. The modelis in fact a particular kind of mixture model, in statisticalterminology, with one component of the mixture for eachpopulation. To fit the model to data, an appropriatemethod is the maximum likelihood method, and it al-lows us not only to measure the parameters of the modeland their uncertainties, but also to evaluate the good-ness of fit with the aid of a χ function. We show thatthe plots given in [22] (numbers of neurons respondinga particular number of stimuli) give sufficient statisticsfor fitting the model, in the sense of standard statisticaltheory. The much more elaborate statistical treatmentused by Waydo et al. [40] is actually unnecessary, and,importantly, did not provide a measure of goodness of fit.One motivation for using populations of neurons eachwith a particular value of sparsity arises from work byAttwell and Laughlin [1] and by Lennie [18]. There itis shown that to optimize the energy consumption byneurons transmitting a given amount of information, aparticular value of sparsity is preferred, around a few per-cent. Even if other constraints besides energy consump-tion are important, the analyses suggest that sparsity isa parameter that could be adjusted (e.g., by evolution)to optimize the performance of a neural system. There-fore it is sensible to propose models in which particularpopulations of neurons have particular values of sparsity.Another motivation for trying a multiple-populationmodel is that recent experiments have detected multiplepopulations of neurons located in the various song-relatedregions of the brains of zebra finches [16]. These sets ofneurons have strikingly different properties of these setsof neurons as regards how often they spike.We first show that the simplest possible model withone population of neurons, all with the same sparsity,provides an unacceptably bad fit to the data. Such amodel is quite often used, for example [40] by the exper-imental group whose data we fit.The nature of the deviations between the single-population model and the data will show that at leastone more population of very sparsely responding neu-rons is needed. As an intermediate step we show thata better fit is obtained by adding a population of silentneurons, i.e., of neurons that gave no above thresholdresponses whatever in the experiment. This provides agreatly improved fit, as expected, but the fit is still poor.Interestingly, the silent neurons are in the majority.Our main model has two populations of neurons eachwith a particular sparsity. It has three parameters: thesparsities of the two populations and the relative sizeof the two populations. We will find greatly improvedfits. In the case of the hippocampus, the fit is completelyacceptable by the appropriate χ criterion, although inthe other areas the fits are poorer, particularly for thePHC. In all cases, we find that at most only about 5%of the neurons have a normal sparsity of a few percent.The remaining 95% or more of the neurons respond ultra-sparsely, with a sparsity around 0.1%. Thus the vast ma-jority of the neurons are almost silent: they respond onaverage to much less than one out of the approximately100 distinct stimuli presented in the experiments. Theneurons that are reported as responding are therefore anextremely biased sample of all neurons in the tested re-gions. Our statistical methods provide a way to quantifythis bias and to compensate for it.Our finding of a large number of silent cells quantita-tively supports the conclusions of Shoham, O’Connor andSegev [36] about neural dark matter. It is similar, butmore extreme, than a similar conclusion by Olshausenand Field [24] about the (apparent) silence of many neu-rons in area V1 of the visual cortex.There are other possible definitions of sparsity and/orselectivity besides the one we chose, and other definitionscan give what appear to be dramatically different conclu-sions. A notable example is the definition of Treves andRolls [39], who [35] find much larger values of sparsity,around 30%–40%. We will argue that, although the twodefinitions are equivalent in a certain limit, the high val-ues for Rolls-Treves sparsity can be completely mislead-ing as to the nature of neural coding. We will point outthat it is possible that neurons could have a relativelylow number of spikes in response to many stimuli buta dramatically higher and readily identifiable responseto just a few stimuli. In that case, it can be that theRolls-Treves sparsity measures mostly the variability ofthe common low responses, while our sparsity measuresthe fraction of the selective high responses.Finally, we will comment on whether our conclusionsare robust, on its implications, and on the possibilityraised by Waydo et al. [40] that there may be even moreapparently silent neurons, so that the true conclusionsabout ultra-sparsely responding neurons are even moreextreme by a substantial factor than what we have fitted. III. GENERAL MULTI-POPULATION MODELAND ITS STATISTICAL ANALYSIS
We define the sparsity of a unit or neuron as the frac-tion of total stimuli that elicit a response from that unit.For our purpose distinct stimuli are distinct visual imagesof people, objects, etc, such as used in [32]. The thresholdfor defining a response is set by the experimenters [32].Only a very small subset of the total possible stimuli arepresented in a particular experimental context, and anyprecise estimates of the sparsity distribution are likely todepend on the circumstances of the experiment.This definition of sparsity is appropriate where themeasured neurons typically have a low firing rate, andoccasionally have a much higher firing rate, under spe-cific circumstances. This applies to typical neurons inthe kind of data we analyze — see examples in [22, 32].Related examples that suggest when our definition isappropriate can be found in neural activity in the HVC area of zebra finches during their song [11, 16]. RA-projecting and X-projecting neurons have occasional highfiring rates at consistent points in a bird’s song, with fewspikes elsewhere. Our thresholded definition of sparsitywould be appropriate (with different values for the RA-and X-projecting neurons, of course). The high responsescan be usefully read out, not just by experimentalists, butby other “reader” neurons [7]. In contrast interneuronsfire at a high rate during much of the song, but withfairly consistent patterns of various levels of activity. Forthese, the thresholded, binary definition of sparsity wouldbe less appropriate. It might well be that the interneu-rons have very low sparsity, since deviations of severalstandard deviations above the mean firing rate are likelyto be rare, according to a visual examination of the fig-ures in [11, 16]. But this would be misleading as to thenature of the interneuron’s firing properties.In the data that we analyze from [22], units were iden-tified from electrode recordings using spike sorting tech-niques which cannot always distinguish individual neu-rons. Thus, the firing patterns reported for some unitsrepresent the aggregate firing of multiple neurons ratherthan for a single neuron. The simplest version of ourmodel ignores this distinction, and assumes that eachrecorded unit consists of only a single neuron. But, aswe will explain in Sec. VI, the general principles of ourmethods apply perfectly well at the unit level. We willshow how to transform from a neural-level model to aunit-level model, and we will see how to interpret nu-merical results of fits at the unit level in terms of singleneuron properties. This will result in no change in ourqualitative results, but a strengthening of our conclusionsabout the presence of a large number of very sparsely re-sponding neurons.
A. Model Definition
In the data we analyze, the unit firing patterns aretreated in a binary fashion. The threshold for a responseis defined by the experimentalists [22, 32]. We assume: • The sparsity of each neuron remains constant overthe course of the experiment. • The recorded neurons are statistically independentof each other. • The neurons are partitioned into distinct popula-tions, each with fractional abundance, f i , where i labels the population. • All neurons in population i have the same sparsity, α i .The statistical independence of the recorded neurons wasverified experimentally [32].Our ultimate goal is find an appropriate set of popu-lations, with their fractional abundances and sparsities.However, as will become apparent later, data has limitedability to distinguish populations with sparsities that areclose to each other: in that situation the effect is close tothat of a single population with a single averaged spar-sity. In contrast, the data do provide the ability to dis-tinguish populations of widely different sparsities. So ourpractical goal will be to find the simplest model that isconsistent with reported data; the model is thus meantonly to be a useful approximation to reality.In this paper, we will start from a very simple modelwith one population of neurons, and elaborate it in twostages. This will give three models:1. Single-population model: All neurons have thesame sparsity, α . This model has one parameter.2. Two-population model, with one population silent:The active population has sparsity α D and abun-dance f D , while the other population is silent with α S = 0, and abundance 1 − f D . This model hastwo parameters.3. Full two-population model: Two active populationsare present, one with parameters α D and f D , andone ultra-sparse population with parameters α US and f US = 1 − f D . This model has three parame-ters. The labeling of the populations is defined bychoosing α US < α D .For each model, we first produce the best fit to the datain [22], by using a maximum likelihood estimates (MLE)for the parameters [19]. We will do this for each of thefour regions of the MTL for which data is reported. Thenwe test the goodness of the fit by using a χ analysis. B. Mathematical characterization of data andmodel
To implement the MLE of the parameters, we first de-rive the likelihood function for the model. This is defined[19] as the probability of the data given the model andits parameters.Given that the neurons are treated as binary (respon-sive or not responsive), the data from an experiment inwhich N neurons are recorded during the presentation of S stimuli can be fully represented by a collection of N S
Bernoulli random variables, X js X js = (cid:40) , if neuron j responds to stimulus s, , if neuron j does not respond to s. (1)Our model in its general form has M populations, withpopulation i having fractional abundance f i and sparsity α i . The fractional abundances add to unity: M (cid:88) i =1 f i = 1 , (2)so that there are 2 M − i j to which a neuron j be-longs, then the probability distribution of X js would be Prob (cid:0) X js = x js | α i j (cid:1) = (cid:40) α i j , if x js = 1 , − α i j , if x js = 0= α x js i j (cid:0) − α i j (cid:1) − x js . (3)Then the probability of a particular outcome in the show-ing of S stimuli to the whole set of neurons is:Prob (cid:0) { X js = x js } | { α i j } (cid:1) = (cid:89) j,s Prob (cid:0) X js = x js | α i j (cid:1) = (cid:89) j (cid:104) α (cid:80) Ss =1 x js i j (cid:0) − α i j (cid:1) S − (cid:80) Ss =1 x js (cid:105) . (4)Here { X js = x js } and { α i j } denote the whole array ofthe quantities notated.It is now convenient to define two sets of auxiliary ran-dom variables. One is the number of responses that aparticular neuron makes: K j = (cid:88) stimuli s X js . (5)The second is the number of neurons N k that give k re-sponses to the stimuli: N k = (cid:88) neurons j δ k,K j , (6)where as usual, the Kronecker delta δ αβ obeys δ αβ = 1 if α = β and δ αβ = 0 otherwise.We can then write the probability of the data (giventhe set of α i j ) in terms of the values k j of the randomvariables K j alone:Prob (cid:0) { X js = x js } | { α i j } (cid:1) = (cid:89) j (cid:104) α k j i j (cid:0) − α i j (cid:1) S − k j (cid:105) . (7)But we do not know the values of each neuron’s sparsity,so the probability of the data given the model parametersis given by summing over the possible sparsity valuesweighted by their probabilities:Prob( { X js = x js } | { α i , f i } )= N (cid:89) j =1 (cid:34) M (cid:88) i =1 f i α k j i (1 − α i ) S − k j (cid:35) . (8) Here and elsewhere, we use a notational convention that is com-mon in the statistical literature: X js with an upper case X de-notes a random variable in the technical statistical sense, whilethe corresponding symbol x js with a lower-case name x denotesa numerical value of the random variable that results from aparticular set of experimental observations. This depends only on the values of the random variables K j , and not on any more detailed properties of the data.That is, the values of K j are sufficient statistics for fittingthe model.Therefore it is useful to compute the probabilities forthe K j . First, for one neuron of given sparsity α , theprobability of k responses, in the presentation of S ran-dom images, is the binomial distribution: P ( K = k | α ) = (cid:18) Sk (cid:19) α k (1 − α ) S − k . (9)Here the factor (cid:0) Sk (cid:1) = S ! / ( k ! ( S − k )!) counts the num-ber of ways of getting k responses to S stimuli. Thedistribution (9) has mean αS and standard deviation (cid:112) α (1 − α ) S , which is approximated by √ αS for the ac-tual situation of α (cid:28)
1. The ratio of standard deviationto mean is (cid:112) (1 − α ) / ( αS ) (cid:39) / √ αS , which goes to zeroas S → ∞ . If we plot the distribution of k/S , i.e., thedistribution of the fractional response rate, this can beregarded as a smearing of the delta function δ ( k/S − α ).In the model, there are M distinct neuronal popula-tions, each with sparsity α i and fractional abundance f i . If the population to which the neuron belongs is un-known, then its probability of responding to k out of S stimuli is (cid:15) k ≡ P ( K = k ) = (cid:88) i f i P ( K = k | α i )= M (cid:88) i =1 f i (cid:18) Sk (cid:19) α ki (1 − α i ) S − k . (10)With the aid of Eq. (2), one can check that the probabil-ities in Eq. (10) correctly sum to unity: (cid:80) Sk =0 (cid:15) k = 1.We already know that the K j are sufficient statisticsfor fitting the model to the data. So we need the proba-bilities for the set of K j :Prob N (cid:94) j =1 K j = k j = N (cid:89) j =1 (cid:15) k j = S (cid:89) k =0 (cid:15) n k k , (11)where “ ∧ ” denotes “and”. In the last part of this equa-tion, we have simply counted the number of neurons witha given value of k , for every possible value of k . Since theresult depends only on the values of n k , these themselvesform a set of sufficient statistics for fitting the modelto the data. Ref. [22] provided results for these numbers,and no further information on the raw experimental datais needed to fit the parameters of the model; examinationof the other quantities used in the work of Waydo et al.[40] is unnecessary. C. Likelihood and its analysis
Since the random variables N k are sufficient statistics,we need the probabilities for them. From these we ob-tain the likelihood function that we will use for fitting the model to the data. It is obtained from Eq. (11) bycounting the number of different arrays of K j that givena given set of N k s. We then get L ( { α i , f i } | Data) ≡ Prob( { N k = n k }|{ α i , f i } )= N ! (cid:81) Sk =0 n k ! S (cid:89) k =0 (cid:15) n k k . (12)In this likelihood function, the dependence on the modelparameters, { α i , f i } , is contained in the (cid:15) k .Note that in making the transformation from a distri-bution over X js to K j and then to N k , we have greatlyreduced the number of data items to be considered, from N S to N (the number of neurons) to S (the number ofstimuli). Moreover, as can be seen from the data in [22],only the first few N k are nonzero in reality, at least to agood approximation.To estimate the values of the parameters of the model,we use the maximum-likelihood method, as is appropri-ate for this situation.The methods we use are, in fact, almost identical tolong-established methods [2, 19] that are regularly usedfor analyzing data from scattering experiments in highenergy physics (and more generally, scattering experi-ments in physics). This close similarity arises becauseboth in the scattering experiments and in the neural data,we have a large number of independent trials, and theprobability of a non-trivial outcome is small. A non-trivial outcome in the physics case is a scattering eventbetween pairs of particles in the beams in an experiment,while in the neural case it is an above-threshold responseby a particular neuron to a particular stimulus. Thephysics analog of the neural N k is the number of scatter-ing events in a certain bin of kinematics.Although our methods are informed by those used inthe physics case, we will provide a self-contained treat-ment appropriate to the neural case. An important toolwe will take over from high-energy physics is an appro-priately defined χ function that is closely related to thelogarithm of the likelihood function. Minimization of asuitably defined [2, 12] χ is equivalent to maximizinglikelihood. The minimum value of χ provides a veryconvenient measure of goodness of fit, to assess how con-sistent the data are with the model. The shape of the χ function near the minimum can also be used to estimatethe uncertainties on the fitted values of the parameters ofthe model, and also the correlations in the uncertainties.The (cid:15) k are subject to the normalization constraint S (cid:88) k =0 (cid:15) k = 1 , (13)while the n k are constrained by S (cid:88) k =0 n k = N. (14)The maximum likelihood method finds the best fit tothe data by maximizing L ( { α i , f i } ) with respect to theparameters f i and α i .The data show that the probability of a neural responseis much less than one. Thus we can usefully make approx-imations appropriate for the situation that (cid:15) k (cid:28) n k (cid:28) N for k ≥
1. Hence (cid:15) is close to unity, while n is less than N only by a small fraction. Under theseapproximations, the likelihood function simplifies to: L ≈ S (cid:89) k =1 e − N(cid:15) k ( N (cid:15) k ) n k n k ! , (15)i.e., a product of Poisson distributions for each of the non-zero k values. A derivation is given in the Appendix. D. Error analysis
Suppose that we have estimated the best fit parametersby maximizing likelihood with respect to the parameters { α i , f i } . Then taking the likelihood function to be ap-proximately Gaussian in the vicinity of its maximum, wecan estimate the confidence intervals and correlations ofthe model parameters [19] from the diagonal terms of thecovariance matrix, defined by:cov ( θ i , θ j ) = − (cid:0) H − (cid:1) ij , (16)where the Hessian matrix is H ij = ∂ ∂θ i ∂θ j ln L . (17)The diagonal terms yield the parameter variances,cov ( θ i , θ i ) = σ i . Correlations between model parame-ters, ρ ij , can be determined from the off-diagonal ele-ments of Eq. (16), ρ ij = cov ( θ i , θ j ) σ i σ j . (18)Two important properties of a Poisson distribution e − N(cid:15) k ( N (cid:15) k ) n k /n k ! are its mean and standard deviation (cid:104) N k (cid:105) = N (cid:15) k , s.d.( N k ) = (cid:112) N (cid:15) k . (19)Thus we can characterize the typical values of n k by n k = N (cid:15) k ± √ N (cid:15) k .We will assess goodness of fit by using the following χ function: χ ( k max ; n , n , . . . ) = k max (cid:88) k =1 ( n k − N (cid:15) k ) N (cid:15) k . (20)This corresponds to − L , to within an additive con-stant, when the Poisson distributions are replaced byGaussian approximations near their peak. But this ap-proximation is only suitable when N (cid:15) k is reasonably much larger than one. So, in Eq. (20), we truncatedthe sum over k to k ≤ k max . The restriction should bethose values of k such that the expected value of n k isbigger than one or two, i.e., where there are noticeableneural responses. Beyond k max , there are very few neuralresponses, and therefore little information for fitting theparameters of the model.After we have we performed a maximum-likelihood es-timate of the parameters of the model, statistical theory[19] predicts a mean and standard deviation for χ : χ = N dof ± (cid:112) N dof . (21)where the number of degrees of freedom, N dof , is thenumber of data values used (i.e., the number of values of k in the truncated sum) minus the number of parametersfitted (which will be 1, 2 or 3, in the particular imple-mentations of the model that we use). If the value of χ falls much outside this range, that indicates that themodel does not agree with the data. IV. MULTI-UNIT CONSIDERATIONS
The preceding analysis assumes that the recorded unitsall consist of a single neuron. However, only a fraction ofthe recorded units contain a single neuron while the restare composed of several neurons.We now show that given our general multipopulationmodel at the neuron level, a version of the model alsoapplies at the unit level. That is, each unit has a sparsity,which can be calculated as a function of the sparsitiesof its constituent neurons, and there are populations ofunits with different values of sparsity.This is the general picture. Some simplifications anduseful approximations can be made in the application toreal data, as we will see in Sec. VI.Suppose first that a particular unit consists of R neu-rons of known sparsities. Let r = 1 , . . . , R label the neu-rons, and let α i r be the sparsity of neuron r . The unit’ssparsity, i.e., the probability that the unit responds to apresented stimulus is given by: α (cid:48) ≡ Prob (response | R, α i , . . . , α i R ) = 1 − R (cid:89) r =1 (1 − α i r ) . (22)Then the probability that the unit responds to k of S stimuli given R and the α i r is a binomial distribution:Prob( K = k | R, α i , . . . , α i R ) = (cid:18) Sk (cid:19) ( α (cid:48) ) k (1 − α (cid:48) ) S − k . (23)So far, we have assumed that R and the sparsities α i R are known. Next, we assess the unit sparseness in termsof the probability distribution of the neural sparsities.Let (cid:15) unit k,R be the total probability that the unit respondsto k stimuli given the number R of neurons in the unit.Then (cid:15) unit k,R is given by summing Eq. (23) over the dis-tributions of α i r . In the M -population model for theneurons, (cid:15) unit k,R = (cid:90) dα i . . . dα i R Prob( K = k | R, α i , . . . , α i R ) × Prob( α i , . . . , α i R )= M (cid:88) i =1 · · · M (cid:88) i R =1 f i f i . . . f i R (cid:18) Sk (cid:19) ( α (cid:48) ) k (1 − α (cid:48) ) S − k , (24)with α (cid:48) given by Eq. (22). This is in fact a version ofthe original multi-population model, whose (cid:15) k is given inEq. (10), with a more complicated labeling of the popu-lations.If the number of neurons in the unit, R , is sampledfrom a distribution, g ( R ), then the total probability thata unit responds to k of S stimuli, (cid:15) unit k is given by (cid:15) unit k = ∞ (cid:88) R =1 g ( R ) (cid:15) unit k,R . (25)Again, this is a case of a multi-population model. Eq.(25) is then substituted for (cid:15) k in the likelihood function,Eq. (15), and the rest of the analysis proceeds in an iden-tical fashion as with the single-neuron unit model.The combination of Eqs. (24) and (25) shows that fromthe populations at the neural level, with their abundancesand sparsities, we obtain a (larger) set of populations atthe unit level: In all cases, Eqs. (10), (24), and (25), (cid:15) k is a linear combination of binomial distributions. Al-though the structure of the populations appears compli-cated, considerable approximate simplifications will be-come apparent when we fit data. V. FITS TO DATA
The data [22] being analyzed come from recordings ofsingle neuron activity in four regions of the human MTL,the hippocampus (Hipp), the entorhinal cortex (EC),the amygdala (Amy), and the parahippocampal cortex(PHC). Altogether, 1194 neurons/units were detected inthe hippocampus, 844 in the entorhinal cortex, 947 inthe amygdala, and 293 in the parahippocampal cortex,accumulated over patients. During the experiment, thepatients were shown a randomized sequence of imagesof famous individuals, landmarks, animals, and objects.For each session, the patients were shown on average 97images, each of which was presented six times during therandom sequence. Histograms of the number of responsesper unit for neurons in the four MTL regions were thenproduced. The values are given in Table I, which we readfrom graphs in Fig. 3 of [22].In this section we show the results of fits to the datawith the three model implementations of our generalscheme, that we summarized in Sec. III A. We start with a simple model with one population of neurons with asingle sparsity. Analyzing the disagreement between dataand model leads us to first to add a population of com-pletely silent neurons, and then to allow the second pop-ulation a non-zero sparsity. We fit the parameter(s) ofeach model separately for each of the brain regions forwhich data were given. In Table II, we tabulate the nu-merical results of the fits of each model in each region.In Figs. 1–4, we show the results graphically; each figureshows the results of the three fits for a single region.
A. One-Population Model
First, we attempt to fit the data in the four regionsassuming only one population of neurons with sparsity α . In this case, Eq. (10) becomes: (cid:15) k = (cid:18) Sk (cid:19) α k (1 − α ) S − k . (26)Substituting this into the likelihood function, and maxi-mizing with respect to α , yields a maximum likelihood es-timate (MLE) for the sparsity. For this and the other fits,the maximization was performed numerically in Mathe-matica, and we used the exact formula (12) for the like-lihood, without any approximation such as (15).The parameters of the fits are shown numerically inTable II(a). In the top row of each of Figs. 1–4, the datafor each region are compared with the expectation valuesof the response counts N k (cid:104) N k (cid:105) = N (cid:15) k . (27)The error bars in the plots are the one-standard-deviationvariations that the model predicts for the N k over repe-titions of the whole experiment.We assess the goodness of fit of the model to the databy the χ test, with χ defined in Eq. (20). The enormousvalues of χ — see Table II(a) — show that the fits areextremely bad, given the expected range, Eq. (21).The nature of the bad fit is seen in the plots in the toprow of each of Figs. 1–4. Because the value of n , thenumber of non-responding neurons, is much larger thanfor the other n k , we show two versions of the plot. Theright-hand plots have the k = 0 bin omitted and have achanged vertical scale, to better exhibit the other bins.The data considerably exceed the best fit in the k = 0bin in all 4 regions, but undershoot the fit in the otherbins. We have a choice. If the sparsity is large enoughto give a sufficient number of cells that respond to mul-tiple stimuli, then, compared with data, too few cells arenon-responsive. If instead, the sparsity is low enough toreproduce the number of non-responsive cells (i.e., n ),then there is much too small a probability for multipleresponses. In either case, the model cannot reproduce allthe data. n n n n n n n n n n n n n n n Hipp 1019 113 30 17 7 4 1 2 0 0 0 0 1 0 0EC 761 45 15 9 4 8 0 0 1 0 0 0 0 1 0Amy 842 61 17 15 3 3 1 0 1 1 0 0 0 1 2PHC 244 13 11 7 3 0 4 1 2 4 3 0 1 0 0TABLE I: Number of neurons n k responding to k images as reported by [22] in four MTL regions.Hipp EC Amy PHC α (2 . ± . × − (2 . ± . × − (2 . ± . × − (6 . ± . × − χ (5) 5 . × . × . × . × (a) One-population model.Hipp EC Amy PHC α D (1 . ± . × − (1 . ± . × − (1 . ± . × − (4 . ± . × − f D . ± .
02 0 . ± .
01 0 . ± .
01 0 . ± . χ (5) 20 19 33 29(b) Two-population model, with one population being totally silent.Hipp EC Amy PHC α D (2 . ± . × − (3 . ± . × − (3 . ± . × − (5 . ± . × − f D . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± . α US (1 . ± . × − (5 . ± . × − (7 . ± . × − (5 . ± . × − f US . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± . χ (5) 2 . . . χ (10) 5 . . . χ measuring the goodness of fit. B. One Active Population, One Silent PopulationModel
A simple and natural improvement to the model is toadd a population of completely silent neurons that do notrespond to any of the stimuli used — cf. [36]. That is,we make a 2-population model with one active and onesilent population of neurons. The silent population hassparsity zero. For the active population, let α D be itssparsity, and let f D be its fractional abundance. In thiscase, Eq. (10) becomes (cid:15) k = f D (cid:18) Sk (cid:19) α D k (1 − α D ) S − k , (28)for k ≥ χ values are still substantially too large for a good fit. Notice how the silent population contains byfar the majority of the neurons in all four regions.The pattern of deviations between data and model isnow an excess for the data in the k = 1 bins and a deficitat higher k . That is, the number of cells respondingexactly once is substantially higher compared with theextrapolation of the numbers of cells with multiple re-sponses. This indicates that a better model would beto replace the silent neural population by a slightly ac-tive population. To fit the data, this population musthave a very small sparsity, so that it predominantly givescontributions to the k = 0 and k = 1 bins only. C. Two-Population Model
Therefore our final model uses two active populationseach with a particular sparsity. One population we callthe distributed population, with a sparsity α D and frac-tional abundance f D . The other population we call the (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230)(cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (a) (b)FIG. 1: Comparison of data from the hippocampus with fits for the number of neurons n k that respond to k stimuli, for eachof the three models. The red circles indicate the experimental values and blue dots connected by lines indicate the modelpredictions for the expectation values of n k . The blue error bars indicate the model’s prediction for the one-standard-deviationvariation of experimental results on repetition of the experiment. The top plots are the fits from the one-population model.The middle plots are fits from the model with one active and one silent population. The bottom fits are for the model with twoactive populations. The left-hand plots are with the zero-response bin included, and the right-hand plots are without them, toshow more clearly the other bins. ultra-sparse population with sparsity α US . The fractionalabundance of the ultra-sparse population is f US = 1 − f D .Then Eq. (10) becomes (cid:15) k = (1 − f D ) (cid:18) Sk (cid:19) α k US (1 − α US ) S − k + f D (cid:18) Sk (cid:19) α k D (1 − α D ) S − k . (29)The labeling of the populations is defined by α US < α D .The results of a maximum-likelihood fit are shown nu-merically in Table II(c), and graphically in the bottomrow of each of Figs. 1–4. The fits are much improved.For both the hippocampus and the entorhinal cortex, wehave good fits, with the model being consistent with thedata. The fit in the amygdala is less reliable and the model poorly fits the data in the parahippocampal cor-tex. In all cases, the ultra-sparse population is in the vastmajority, around 90% or more, while at the same timehaving a very small sparsity, 10 − or smaller. Thus eachneuron in the ultra-sparse population responds on aver-age to at most about 0.1 stimuli in a session. We onlysee the effects of the ultra-sparse population because thedata are from a large number of neurons. In contrast, theremaining few percent of neurons in the other populationtypically respond to several stimuli in each session.For the parahippocampal cortex (PHC), a different ormore general model is clearly needed. We observe thatthe functionality of the PHC is much different than thatof the hippocampus and the entorhinal cortex, so it isnot surprising that its neural coding properties should0 (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s EC (a) (b)FIG. 2: The same as Fig. 1, but for the entorhinal cortex. be different. The hippocampus is the classical locus ofepisodic memory storage, and the entorhinal cortex is itsmain source of input (and output).An alternative view of the fit is shown in Fig. 5. Herewe show how the neural responses are predicted by themodel to arise from the different populations. The bot-tom parts of the bars, shaded gray, show the expectationvalues for the part of N k coming from above-threshold re-sponses by neurons in the D population. Stacked abovethese are open bars, showing the contribution from theUS population. In the bins with more than one response,i.e., k ≥
2, almost all the responses are from the D pop-ulation, with only a small contamination from the ultra-sparse population, primarily at k = 2. In contrast, in the k = 1 bin, there is a relatively small fraction of responsesfrom the D population, from the tail of a distributionwith its peak at several responses. The majority of the k = 1 bin is from the ultra-sparse population. However,this represents only the tip of the iceberg, so to speak.The vast majority of the ultra-sparse neurons give noabove-threshold responses; they appear in the k = 0 bin, which is much too tall to be shown in Fig. 5.The sparsity and fraction for the D population can bedetermined from the k ≥ k = 1 bin. There are several bins involved, so theshape of the distribution of N k from a single sparsity fitis confirmed, as can be seen from the lowest plots in theright-hand column of Figs. 1–4, certainly for the HC andEC regions. Extrapolating the fit for the D populationto the k = 1 falls far short of the data, by a factor ofat least 5. This then determines that there is an ultra-sparse population, whose average sparsity is determinedto a good approximation by the excess in the k = 1 binrelative to the total number of non-D neurons: α US (cid:39) N (excess above extrapolation) N (1 − f D ) S . (30)Our actual best fit allows for the contamination of thebins of higher k by the ultra-sparse population. The ex-istence and size of the ultra-sparse population is deter-mined by the large excess of the measured value of N compared with the extrapolation from the bins of larger1 (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Amy (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230)(cid:230) number of responses nu m b e r o f n e u r on s Amy (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Amy (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Amy (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Amy (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Amy (a) (b)FIG. 3: The same as Fig. 1, but for the amygdala. k , whose relative sizes correspond to a sparsity of a fewper cent. D. Uncertainties and correlations in fittedparameters
We computed uncertainties and correlations in the fit-ted values of the models’ parameters by the method de-scribed in Sec. III D. The uncertainties are reported inTable II. Correlations between the three parameters ofthe full two-population model were calculated using Eq.(18), and are shown in Table III.
E. Comparisons of the three versions of the model
We can see that the results of fitting the first single-population model with its single sparsity were intermedi-ate sparsities compromising between the extremes of thetwo populations in the full model. The single-population ρ α US ,α D ρ α US ,f D ρ α D ,f D Hipp 0.46 -0.52 -0.62EC 0.34 -0.33 -0.41Amy 0.33 -0.31 -0.37PHC 0.30 -0.23 -0.19TABLE III: Correlations between parameters of the two-active population model, assuming all recorded units are com-prised of a single neuron. The presence of units consisting oftwo neurons did not affect the correlations between parame-ters to within two significant digits. model therefore incorrectly represents the actual neu-ral sparsity. Our fitted values of sparsity in Table II(a)roughly match those found by Waydo et al. [40] in theirfit of a pure one-population model to similar data.In the second model, with a set of exactly silent neu-rons, the fit for the responsive neurons is qualitativelysimilar to the D neurons in the full model: a sparsity ofa percent to a few percent and a minority abundance.2 (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s PHC (a) (b)FIG. 4: The same as Fig. 1, but for the parahippocampal cortex.
Relative to the full model, the value for the active popu-lation’s sparsity is still biased downwards, while its frac-tional abundance is biased upwards; these properties givea compromise between the effects of the two populationsof neurons in the full model.Merely introducing extra parameters increases thegoodness of fit, but only by an expected decrease of oneunit in χ per parameter, as in Eq. (21). So the im-proved fits from adding an extra population are highlysignificant. In all cases, the p -values for the poor fits forthe first two models are well below 0 . χ distribution. VI. MULTI-NEURON UNIT RESULTS
The results presented thus far have been under theassumption that all recorded units are composed of asingle neuron. However, it is known that some units arein fact composed of multiple neurons [22, 32]. To gainan idea of the effect of such multiunits on our fits and of our conclusions about neural properties, we apply thegeneral analysis from Sec. IV.
A. Implementation of multi-unit model
To fit the data taking into consideration the presenceof multi-neuron units, we must maximize the likelihoodfunction, Eq. (15) after replacing the (cid:15) k , with the func-tion (cid:15) unit k , Eq. (25). In the case of the model with twopopulations at the neural level, i.e., with M = 2, Eq. (24)becomes (cid:15) unit k,R = R (cid:88) l =0 (cid:18) Rl (cid:19) f l D f R − l US (cid:18) Sk (cid:19) ( α (cid:48) l,R ) k (1 − α (cid:48) l,R ) S − k , (31)where α (cid:48) l,R = 1 − (1 − α D ) l (1 − α US ) R − l , (32)and, of course f US = 1 − f D . Here l and R − l are thenumbers of neurons in the unit that are in the D and US3 FIG. 5: Plots of neuron responses predicted by the three-parameter maximum-likelihood fits to data in four regions of the MTL.The plots are of number of neurons as a function of number of responses. The shaded bars represent neurons in the almost-silentpopulation while the open bars correspond to the distributed population. The red circles indicate the experimental values. populations. The result is that at the unit level, we havemultiple populations, each with its distinct sparsity. Thedifferent populations correspond to the different terms inthe summation in Eq. (31).For units with R = 1, these just correspond to theoriginal two neural populations. For R = 2, there arethree populations. One of these, the most common case,is where both neurons in the unit are US neurons, givinga sparsity 1 − (1 − α US ) (cid:39) α US , twice that of a singleneuron. The second most common situation is whereone neuron is in each neural population; these units havesparsity 1 − (1 − α D ) (1 − α US ) (cid:39) α D + α US (cid:39) α D , wherethe last approximation follows from the fact, confirmedby our detailed fit later, that α US is much less than α D .The third population, a small fraction f D of the units,has a larger sparsity 1 − (1 − α D ) (cid:39) α D .If only single-neuron and double-neuron units existed,i.e., if only R = 1 and R = 2 occur, then the total numberof populations at the unit level would be 5, and thesewould appear in the formula (25) for (cid:15) unit k . If larger multi-units occur, there are even more populations of units,each with its particular sparsity. This seems like a verycomplicated situation, but provides no issue of principlein the MLE of the parameters of the populations at theneural level, except that there is little data about theexact distribution of the number of neurons in a unit.However, as we will see in more detail shortly, con-siderable simplifications occur because the vast majorityof neurons are extremely sparsely firing. This propertyis reflected at the unit level, and we will see that froma 2-population model at the neural level, the unit-leveldata are reasonably accurately given by a 2-population model at the unit level. This justifies a posteriori the suc-cess of a 2-population model applied to unit level data,and one can see how to relate properties of the neuralpopulations to properties of the unit populations. Thereasons come from the two most common kind of unit.The most common situation is that all the neurons in aunit are all ultra-sparse, so that the unit itself respondsultra-sparsely, typically at most one neuron at a time.The second most common situation is where exactly oneof the neurons in a unit is in the D population. Then byfar the most common response from the unit is due tothe single D neuron. B. Fit with multi-units
To understand how this works, we make a simplifiedmodel in which we assume that R ≤
2, i.e., that all mea-sured units are either composed of a single neuron or oftwo neurons. Let p be the fraction of single units. Then,we have for the total probability of a unit responding to k stimuli, (cid:15) unit k (cid:15) unit k = p(cid:15) unit k, + (1 − p ) (cid:15) unit k, . (33)From the number of single units reported in [32], we es-timate p = 0 .
34, and we will use this value from nowon.We then used this in the formalism that we have al-ready set up. The resulting fitted values of the parame-ters and the χ values are shown in Table IV and, for thecase of the hippocampus, the plots of the predicted andrecorded n k are shown in Fig. 6. Compared to the values4 Hipp EC Amy PHC α D (2 . ± . × − (3 . ± . × − (3 . ± . × − (4 . ± . × − f D . ± .
008 0 . ± .
006 0 . ± .
006 0 . ± . α US (6 . ± . × − (3 . ± . × − (4 . ± . × − (3 . ± . × − f US . ± .
008 0 . ± .
006 0 . ± .
006 0 . ± . χ (5) 1 . . . χ (10) 4 . . χ values for two-population model, with 0 . N single units in each region and 0 . N doubleunits in each region. (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) (cid:230) number of responses nu m b e r o f n e u r on s Hipp
FIG. 6: Plots of MLE fits to the Hipp with the multi-unit model. fitted for the original two-population model, Table II(c),the χ did not change greatly, although the goodness offit is somewhat improved, notably in the amygdala andPHC.Thus the multi-unit model fits all four regions at leastas well as the basic two-population model, without anyextra fitted parameters. However, the values of some ofthe parameters did change. Most notably, the estimate of α US decreased by about 40%. The earlier value is simplya weighted average of the effect of single units contain-ing one US neuron with sparsity α US and double unitscontaining two US neurons, with unit sparsity 2 α US . Thefitting of the US population is determined primarily fromthe n and n bins, so the response data by itself cannotsignificantly determine the existence of these two popu-lations of units; it just gives a weighted average of thesparsities.In contrast, the value of α D is only slightly lower thanin the earlier fit; this is because most of the relevantdata concerns units that have one D neuron. Given thesparseness at the unit level, as in Table II(c), which istied to measured data, the neural sparsity must be lesswhen there are multiunits.As to the population fractions, the value f D is reducedby a factor of about two thirds. This is because for agiven f D at the neural level, there are two chances ofhaving a D neuron in a double unit. Thus the effectivevalue of f D at the unit level is the following weightedaverage: f D (1 − p ) + 2 f D p = f D (1 + p ) , (34)which determines the difference between the values of f D in Tables II(c) and IV fairly well, given the value p = 0 . C. Overall view of effects of multi-units
We see that allowing for multi-units has actuallystrengthened our conclusion that there is a large fractionof extremely sparsely responding neurons. Effectively theexistence of multi-units has diluted the effect in the datarelative to the situation at the neural level. The original2-population fit in Table II(c) characterizes the measureddata at the unit level. At the neural level, as supportedby Table IV, the fraction of US neurons must be evencloser to unity, and their sparsity substantially less thanthe already small values at the unit level. This qualitativeresult is independent of the exact numbers of multi-unitsand the distribution of numbers of neurons in a unit.
VII. DISCUSSION
Our primary result is that in the hippocampus (andother areas of the MTL), a vast majority (90% or higher)of neurons respond ultra-sparsely to stimuli in the classpresented — around one in a thousand stimuli, or evenless. Those neurons that respond more readily, i.e., to In the paper reporting the data we use, it is not stated whichhippocampal region the recorded cells belong to.
5a few per cent of the stimuli, comprise a rather smallfraction (several per cent) of the cells.We have devised methods that treat the measured neu-rons and stimuli as statistical samples, and allow the de-duction of properties of the ultra-sparse population eventhough these neurons respond on average to less than atenth of a stimulus out of the approximately 100 used toobtain the data analyzed [22].There is the widely-known issue [11, 13, 24, 36, 38] thatin many areas of the brain, there appear to be many silentor almost silent cells, i.e., cells that gave no detectablespikes at all in particular experimental situations or thatspike very rarely. This gives the problem [24] that re-sponsive neurons reported in the literature can be a verybiased sample of all neurons in a probed brain region.Our methods (and potential future improvements) give away of quantitatively measuring and correcting the bias.
A. Choice of above-threshold responsiveness todefine sparsity
Care is needed in interpreting our results. Most impor-tantly, the choice to classify a neuron binarily, as respon-sive or not, is at its most natural when the neuron nor-mally has a low firing rate and has substantially higheractivity under particular circumstances, with fairly fewborder-line cases.The simplest case is, of course, when the neuron isstrictly binary. That is, it gives exactly zero spikes un-der normal circumstances, and gives several spikes onlyin one situation, as for the HVC RA neurons for zebrafinches in [11]. But the binary classification is also sen-sible for cells such as the pyramidal neuron in a humanhippocampus whose responses are shown in Fig. 3A of[14]. It gives a consistently high response only to a soar-ing eagle picture, out of those presented; nevertheless,its response to other presented pictures, while small, isnon-zero.We do not need to commit to the exact semantics ofsuch a neuron to say that the categorical informationcoded in the high response of the neuron is useful to thesubject. This information can be readily used for read-out [7] by a downstream neuron, and therefore used toguide further behavior, etc.In contrast, for an interneuron such as the one in Fig.3B of [14], the application of a response threshold is muchless sensible. For this neuron the normal state is a fairlyhigh firing rate somewhat correlated with the stimulusand its presentation. One can imagine occasional excur-sions above some chosen threshold, e.g., the 5- σ thresholdused in [22]. But if these are only small excursions, theirmeaning and utility is not obvious. Of course, if undersome other specific circumstances not probed in the re-ported data, the interneuron had a much higher firingrate, then it would be natural to use a thresholded re-sponse criterion. This is a situation which our statisticalmethods are designed to address; if the cell is typical of a certain class of cells, then one would see examples of therare large excursions of firing rate in other cell-stimuluscombinations.The above remarks imply that for a better applicationof our methods, one should extend the criteria for theresponsive cells and their responses. Not merely shouldthere be a number of spikes above some threshold relativeto the pre-stimulus situation, but the above-thresholddistribution of spike numbers should go well above thethreshold. It would also be useful to classify cells bytheir firing rate in the non-responsive state, so that ouranalysis by populations and sparsity is applied to cellsthat are similar in characteristics. B. Populations
We have identified two very different populations ofcells in the areas examined. It is possible that thesepopulations are anatomically or functionally different, aswith the different kinds of cells in the HVC and RA ar-eas of zebra finches [11, 16]. But this is not a necessarydeduction. Indeed, we think that is unlikely that our twopopulations correspond directly to the pyramidal cell andinterneuron populations reported in [14]. The presenceof the different populations might simply represent dif-ferent semantic properties for the cells’ firing relative tothe nature of the particular class of stimuli used. Onecan at best assume only that the situation is typical. Forexample, suppose one chose stimuli in a different class,e.g., musical tunes instead of famous people. Then somecells that had very low sparsity in the famous-people sit-uation could have much higher sparsity in the music sit-uation, and vice versa. Since such changing contexts arecommon experience, it is reasonable that this situation istypical. The different populations correspond to differentsemantic domains, and the chosen stimuli sample thesedomains.The populations and their sparsities must then betreated as being relative to the general class of stimuliused, e.g., famous individuals, landmarks, animals or ob-jects in the case of the data from [22] that we analyzed.
C. Cell semantics
It has been suggested that the neurons under discus-sion are concept cells [29]; each cell responds to the pres-ence, in some sense, of a particular concept in the currentstimulus. One part of the motivation for this is that theresponses often appear to be genuinely conceptual; forexample, a cell might respond (within the experimen-tal data) only to stimuli involving a particular person,e.g., to multiple different pictures of the person, even indisguise, to the written name of the person, and to thespoken name.In this section, we make some suggestions about howour results quantitatively impact this issue. We label the6subject that of the semantics of the cells, i.e., of whatmeaning should be attached to their responses.In the first place our results strengthen the basic casefor conceptual cells, by showing how sparse the responsestypically are. However, this case can only be made in con-junction with the measurements of the conceptual prop-erties of the responses. To understand this more clearly,it is important to recall that there are two kinds of mea-surement involved.First, there are screening sessions, with many differentunrelated stimuli. The number of different objects andpeople averaged to 97, and it was our analysis of thisdata that led to our deduction that there are many verysparsely firing cells.The screening sessions find pairs of stimulus and neu-ron where a high response is obtained. Then, in testingsessions, variants on the response-causing stimuli wereused. It is the testing sessions that establish that manyresponses are indeed conceptual — e.g., a neuron re-sponds to a picture, the written name, and to the spokenname of a particular human. However, it is the screeningsession data that determine the sparseness of the cells’ re-sponses, because the screening sessions have the largestnumber of identifiably distinct concepts.It is tempting to say that because a cell consistentlyresponds to a variety of very different stimuli related toa particular person, e.g., Jennifer Aniston [32], that thecell actual represents Jennifer Aniston, i.e., that its firingabove threshold codes that the concept of Jennifer Anis-ton is being processed or has been detected. This is thesimplest version of the concept cell idea [29].We now address three questions about the exact cor-rectness of this interpretation: One is whether the con-cept involved is in fact the obvious one, e.g., JenniferAniston in the case just mentioned. The second iswhether the concept is one that in some sense corre-sponds to the current stimulus, be it visual or audi-tory. Our third question is whether the neural conceptualrepresentation is strictly local, i.e., whether the above-threshold firing of one of these cells codes only a singleconcept. Our discussion of these questions will provoke afourth question: Whether the representation for only oneconcept is active at a given time is active or whether therepresentations of multiple concepts are (more-or-less) si-multaneously active, on a time scale of a few hundredmilliseconds.First it appears necessary that a concept, like JenniferAniston, is coded and represented in the subject’s brain.But the reported cell can be a downstream consequence.For example, it might code the activation of an episodicmemory of a show in which she participated. Evidently,such downstream activated concepts are related to theJennifer Aniston concept. Support of the idea that theapparent concept cells may code downstream concepts isthat the Jennifer Aniston cell was later found [29] to re-spond to a picture of Lisa Kudrow, a co-star of Aniston’sin a television series. That individual episodic memoriesmay be among the concepts involved (for some appro- priate definition of “concept”) is suggested because ofthe well-known central role of the hippocampus in theepisodic memory system.We know that, at least in humans, recall of memo-ries can be based not directly on a current stimulus buton cues generated by internal cognition. In general, thisremoves an obligatory link between stimulus propertiesand conceptual neural responses. For the experimentalmeasurements under discussion, this issue need not beimportant, because the experimental protocol was explic-itly designed to have the subjects’ attention focused onthe stimuli.Once one allows that downstream concepts are acti-vated, one should expect that multiple concepts are si-multaneously active, with perhaps only one being se-lected at a given time for conscious attention. This isessentially the same property that computer search en-gines have. We can regard these as associative memorysystems. A cue, e.g., a search string, is provided as in-put, and the result is a list of items containing or relatedto the cue; these are the activated concepts. The usercan click on one item in the list to get its content; this isanalogous to conscious memory recall.The next question is whether or not the representationis local. Normally a dichotomy is made simply betweenlocal representations and distributed representations. Inthe case of distributed representations, even when theyare sparse, it is generally assumed that the individualneurons that are involved in a distributed representationof an object themselves represent features or propertiesof the object in question — see, for example, Ref. [10, 23]and references therein. Such representations were called“iconic” by Wickelgren [41–43].But a further possibility is the use of what Wickelgren[41–43] called “chunk assemblies”. Each of these is a rela-tively small set of cells, the activation of which codes thepresence or processing of the associated concept. Theindividual cells in a chunk assembly do not themselvescode features corresponding to the concept. This is pro-vides an important modification to the concept-cell idea.Quantitatively, coding using chunk assemblies is charac-terized by the number of cells in each assembly and bythe number of chunk assemblies in which each cell partic-ipates (which need not be exactly fixed numbers). Localcoding is the limiting case in which each cell participatesin the assembly for exactly one concept, as opposed toparticipating in merely a very small fraction of the con-cepts stored in a system. We can regard our work as astep in determining quantitative properties of chunk as-semblies. When only a small number of stimuli are used,a cell in a chunk assembly will behave quite similarly toa cell providing a local representation.We now work out a relation between the sparsity ofcell responses, and some of the coding properties. Theproperties of interest are the total number of conceptscoded, the number of concepts that each cell codes (i.e.,the number of chunk assemblies it participates in), andthe number of concepts that are simultaneously active.7Of course neither of the last two numbers need be fixed,but it will be useful to treat them as single or representa-tive numbers to get an idea of the relation to sparsity. Ifonly one concept were active at a time, and if the codingwere local (so there is only one concept per cell), thenthe sparsity would be 1 / Total , (35)at least in some average sense, given that both the num-ber of concepts activated and the number stored by thecell are small compared with total number of conceptsstored overall. The number of concepts per cell in thecase of a local representation is unity. The simplest ideasabout local/grandmother-cell representations would alsoassign unity to the number of simultaneously active con-cepts. Given the number of concepts we should expect tobe represented in the human brain (probably hundredsof thousands if not millions), even our small measuredsparsity is much too large to be consistent with a lo-cal representation. In this discussion, we should use theword “concept” very broadly, such that it includes, forexample, individual episodic memories.Now an estimate of 10,000 to 30,000 has been quotedby Waydo et al. [40] as the number of objects that a typ-ical adult recognizes, on the basis of work by Biederman[3]. But this is surely a substantial underestimate of thenumber of concepts that are coded in a human brain.First, the number of words in a language that known toan adult is in the tens of thousands, and the numberof concepts is surely substantially higher. Second, it isknown that humans can remember thousands of pictures[15, 37] presented during a single day. The total num-ber of memories created over a lifetime must be ordersof magnitude larger. Even allowing that on a long timescale many of these will be forgotten, this indicates thatthe number of concepts/objects represented can easily bein the hundreds of thousands or millions.For illustration assume that the number of conceptsremembered is 10 . Then a measured sparsity of around6 × − , as we found for the majority population fromour analysis, implies that( × ( (cid:39) . (36)We leave further analysis to the future, but use this es-timate as a suggestion about the quantitative propertiesof concept coding. D. Confirmation and relation to previous work
Our results confirm and substantially strengthen re-sults found by one of us and Jin in [9]. There we usedearlier data from [32] that only provided values for the number of units with one response, N , the number withtwo or more responses, N ≥ , and for the average numberof responses from responsive units. We ruled out a one-population model, and fit the three parameters of thetwo-population model from the three reported summarystatistics. But there was therefore no test of goodness offit for the 3-parameter model. Instead the shape of the N k distribution was a prediction, which is successfullytested in this paper, for the case the hippocampus andentorhinal cortex, at least.In the present paper, we also give a more systematicaccount of the statistical methods, and have a breakdownby brain region, using newer data. Although the exactparameters of the fits are a little different, the main pic-ture is confirmed and tested. The differences could be ac-counted for by differences in the subjects and of detailedexperimental procedures and by our different treatmentof multiunits. E. Treves-Rolls definition of sparseness
An alternative measure of sparseness proposed by Rollsand Tovee [34] and reviewed by Treves and Rolls [35], iscalculated from the firing rates of neurons in response tostimuli. The average sparseness reported [35] with thisdefinition was 0 . ± .
13 (for hippocampal spatial viewcells in the macaque hippocampus). This is dramati-cally different than what we found with our definition ofsparsity. It is worth understanding how this differencemight arise (aside from a conceivable difference betweenspecies). We will demonstrate that the Treves-Rolls def-inition can be very misleading as to the nature of neuralcoding.They define the sparseness of a neuron in response toa sample of S stimuli as a s TR = (cid:104) r (cid:105) (cid:104) r (cid:105) = (cid:16)(cid:80) j =1 r j /S (cid:17) (cid:16)(cid:80) j =1 r j (cid:17) /S , (37)where r j is the mean firing rate of the neuron in responseto stimulus j and (cid:104) . . . (cid:105) denotes an average of a quantityover all presented stimuli. In the case that the neuron isbinary in its responses, i.e., that it gives a large responsewith some fixed firing rate to some stimuli and is exactlysilent for all other stimuli, this definition gives the sameresult as our definition of sparsity.Note, however, that there is a difference that the abovedefinition is applied to one neuron with the stimuli ac-tually presented in an experiment, whereas ours appliesfor a universe of stimuli. Our sparsity is something thatmust be inferred from statistical arguments applied todata, whereas theirs is directly defined as a property ofthe data. Even so, if neurons are exactly binary and allhave the same sparsity (in our sense), then the Treves-Rolls sparsity is a good estimator of our sparsity.The definition (37) is in fact the unique combinationof the first and second moments of the firing rate that8gives α for exactly binary neurons (always in the limitof large S ). However, as we will now show, Rolls-Trevessparseness and our thresholded sparsity can have widelydifferent values if the neurons are not strictly binary.We first observe that Rolls and Treves only reporteda value of sparsity averaged over all neurons. Now Isonet al. [14] in their Fig. S2 showed the distribution acrossneurons of various measures of sparseness and selectivity.There is a wide range of Rolls-Treves sparseness from lessthan 0.1 (the most common) to another peak close to 1(for putative interneurons). The average of this distribu-tion is quite misleading as to the properties of individualneurons. Furthermore, all the distributions in that figureare measures of sparseness with respect to the presentedstimuli, and no attempt is made to infer an underlyingsparseness or selectivity defined with respect to a wholeclass of stimuli, such as we do here and Waydo et al didin [40].It is well known that many (but not all) hippocampalneurons respond strongly to certain specific behaviorallyrelevant stimuli or situations, while responding weaklyor not at all at other times. The data we analyzed sim-ply give a particularly notable example. Let us refer tothe situations to which a cell responds strongly as “on-target” and the other situations as “off-target”. Sup-pose, that the on-target responses are very rare, as wehave found, but that the off-target responses, while be-ing of low rate, are nevertheless non-zero. For example,the hippocampal place cells found in rats are known tohave dramatically different firing rates when an animal isin the place field of the cell (producing an on-target re-sponse) compared to when the subject is out of the placefield (producing an off-target response).We will now point out by constructing an appropriateclass of models that the Treves-Rolls sparseness can bedominated by properties of the off-target firing while be-ing quite insensitive to the on-target responses. At thesame time, an appropriate choice of threshold can makeit quite unambiguous as to which responses are on-targetand which are off-target. The model is intended not tobe (at this point) an actual representation of data, butjust a reasonable counterexample, to show how a high av-erage value of Rolls-Treves sparseness can be compatiblewith a very low sparsity in our sense.In this model each neuron has a categorical responseto a certain class of stimuli, with probability α . But in-stead of assuming purely binary neurons, we postulatea pseudo-binary model in which the on-target and off-target firing both come from a distribution of firing rates,but with different distributions. Let us assume that theoff-target and on-target firing rates have the distributions P ( r ) and P ( r ) respectively. Then the total distribu- Analysis of a large, unbiased set of neurons in the rat MTL re-ported in [21] suggests a log-normal distribution for both P ( r )and P ( r ). However, the exact forms of the distributions P ( r ) tion is a mixture: P ( r ) = (1 − α ) P ( r ) + αP ( r ) . (38)We let r and ∆ r be the mean and standard deviationof the off-target responses, and let r and ∆ r be thesame quantities for the on-target responses. Naturally,we should assume that r is sufficiently much higher than r that on-target responses can be detected adequatelyreliably; this depends on the tail of P ( r ) falling suffi-ciently rapidly as r increases.A calculation yields the Rolls-Treves sparseness (rela-tive to all stimuli) as: a s TR = (cid:18) α − α r r (cid:19) − α (cid:18) r ) r (cid:19) + α (1 − α ) (cid:18) r r + (∆ r ) r (cid:19) . (39)This reproduces the value α in the case of binary neu-rons, i.e., where the standard deviations are negligibleand the limit r → α is very small (much less than theratio r /r of off- to on-target mean responses), then theTreves-Rolls sparseness approaches 1 / (1 + ∆ r /r ). Thisis just the Rolls-Treves sparseness calculated purely fromthe off-target distribution. In this case the Rolls-Trevessparseness says nothing about the on-target responses.If the two distributions P and P are distinct enough,then it is possible to set a threshold r th to give reliabledetection of on-target and off-target states. Given an off-target stimulus, we need the false positive rate to be sub-stantially less than α , so that above threshold responsesare predominantly for on-target stimuli. This requires (cid:90) ∞ r th P ( r ) dr (cid:28) α, (40)i.e., that the threshold is sufficiently far out on the tailof P .Given an on-target stimulus, we need it to be reliablydetected, so that the false negative rate (relative to α ) issmall, i.e., (cid:90) r th P ( r ) dr (cid:28) . (41)This is arranged if the bulk of the on-target distributionis beyond the threshold.For any given value of α we can arrange to satisfyall these conditions with appropriate functions for the and P ( r ) are not important in this analysis, so long as they arelargely non-overlapping. P ( r ) and P ( r ). That is,the Rolls-Treves sparseness can be dominated by its off-target value, while we have reliable discrimination be-tween on-target and off-target stimuli, despite the lowproportion α of on-target stimuli.In this situation, there is no contradiction between arather high value of Rolls-Treves sparseness, like 0 . F. What properties of neural responses aredetermined?
Although our model provides a good fit to the data(at least in the hippocampus and entorhinal cortex), itshould not be supposed that the model gives an exactcharacterization of neural responses to stimuli.The first issue is simply that if we replaced one par-ticular neural population of a particular sparsity α andfractional abundance f by several populations with spar-sities not too far from the original single value, and witha total abundance summing to f , the result would not bevery distinguishable from the original case. In the limitof a large number of stimuli, the number of responsesfrom one population would be clustered at k = αS , witha standard deviation √ αS that is much smaller than themean. The fractional standard deviation is 1 / √ αS . Butwith the actual values of S and α , the standard deviationis comparable to the mean; indeed, for the ultra-sparsepopulation the standard deviation is much larger thanthe mean. Thus splitting one population into a set ofpopulations nearby in sparsity produces little measurableeffect.The lack of ability to distinguish populations of nearbysparsity is particularly notable for the ultra-sparse pop-ulation. This populates primarily the k = 1 bin. Thusour measurement of a sparsity for the ultra-sparse pop-ulation is really a measurement of a weighted average ofthe sparsities of the ultra-sparsely responding neurons.What is properly deduced from our analysis is thatthere are relatively few neurons that respond with a spar-sity of a few per cent, and a much larger number thatrespond much more sparsely.A more precise understanding of how the multiunitcases arise from an underlying neural response — e.g.,[8] — would lead to a more precise estimation of thepopulation parameters at the neural level.A second issue [11, 13, 24, 36, 38] is that theremay be many silent cells, i.e., cells that gave no de-tectable spikes at all in the measurements. Silent cellsare to be contrasted with the many non-responsive cellsthat were included and were important in our analysis;non-responsive cells did give detected spikes, but neverenough to count as an above-threshold response. On the basis of results in [13], Waydo et al. [40] argue that thisis potentially a very large effect. They suggest that “asmany as 120–140 neurons are within the sampling regionof a tetrode in the CA1 region of the hippocampus”, butsay that they only identified “1–5 units per electrode”.Of course some of these units are multiunits, correspond-ing to two or more neurons.To quantify the effects of silent cells, let K be the ratiothe total number of cells to the number of detected cells.The suggestion [13, 36, 40] is that K could be as much asan order of magnitude. The effect of silent cells on ourmeasurement of the sparsity α D of the higher sparsitypopulation would be negligible. This is simply becausethese cells typically respond to multiple stimuli in thedata, and are actually detected. The value of α D is ob-tained primarily from the relative numbers of cells with k = 2, k = 3, etc responses. But their fraction in thetotal neuronal population, f D must be decreased by afactor K .For the ultra-sparse population, the sparsity α US is de-creased relative to our fits by a factor K (while the pop-ulation abundance gets closer to 100%). This is simplybecause α US is, to a first approximation, the ratio of N to N + N (after taking out the estimated contaminationof these bins from the other population). N is fixed bydata, but N + N is increased by a factor of about K .Effectively, if there is a population of completely silentcells, then the measured number N of cells that gave noabove-threshold responses should be changed to approx-imately KN . (The approximations consist of neglecting N with respect to N and neglecting the small numberof cells in the D population for which all the stimuli wereoff-target.The effect of possible large numbers of silent cells is tosubstantially strengthen our conclusions, even if it makessome of the numbers less certain. Of course, if at some-time in the future, a more precise understanding of elec-trode properties were obtained, then a useful estimate ofthe silent cell factor K could be obtained. After thatour numerical results could be adjusted accordingly. See[8, 25] for recent work on this issue.A final issue is that the measurements of sparsity arerelative to a particular set of stimuli. We need to re-gard the stimuli as being chosen as a sample from anenormous number in a general subject area known to thepatient. The general topics were chosen to correspondto areas that were well known to the subjects. Giventhat the neural responses were very specific, one mustsuppose that if a very different topic were chosen (e.g.,scientific subjects instead of movie stars), the responses(or lack thereof) by particular neurons could be very dif-ferent. Some previously responsive cells might even be-come silent, and vice versa. Similar phenomena are seenwith place field behavior of hippocampal neurons whenan animals environment is changed [17]. So any estimateof the sparsity of the response of a particular neuron isrelative to the subject area of the stimuli. Nevertheless,it is reasonable that the distribution of sparsity across a0population of hippocampal neurons would be not muchchanged between different stimuli sets.It is known that the hippocampus is involved generallyin episodic memory, so over the whole neural populationone should expect to get responses to stimuli of all sub-ject areas. There should not be specialization for theparticular subject areas used for the measurements ex-cept for the fact that the topics are ones for which the subjects have abundant memories. Therefore our resultsconcerning the neuronal population as a whole should beregarded as broadly applicable. That is, we should expectsimilar results for other topics for the stimuli. This con-clusion also makes it acceptable that we analyzed datanot just from one patient but data pooled over manypatients. Appendix A: Poisson approximation for likelihood
We now review the derivation of the Poisson approximation (15) to the likelihood function (12). The derivationapplies when the values of (cid:15) k for non-zero k are small, more precisely, when (cid:80) Sk =1 (cid:15) k (cid:28) k = 0 factors from the product, and then using Eqs. (13) and (14) towrite n and (cid:15) in terms of the corresponding quantities for k ≥ L ( { α i , f i } ) = N !( N − n ≥ )! (1 − (cid:15) ≥ ) N − n ≥ S (cid:89) k =1 (cid:15) n k k n k ! , (A1)where n ≥ = (cid:80) Sk =1 n k , and (cid:15) ≥ = (cid:80) Sk =1 (cid:15) k . By taking the logarithm of both sides of Eq. (A1) we get:ln L = ln N ! − ln[( N − n ≥ )!] + ( N − n ≥ ) ln(1 − (cid:15) ≥ ) + ln S (cid:89) k =1 (cid:15) n k k n k ! . (A2)For large N , we can use Stirling’s approximation to O (1 /N ):ln N ! = N ln N − N + 12 ln(2 πN ) + O (cid:18) N (cid:19) . (A3)Applying this formula to the first two terms in (A2) yieldsln L = N ln N + ( N − n ≥ ) ln (cid:18) − (cid:15) ≥ N − n ≥ (cid:19) − n ≥ + ln S (cid:89) k =1 (cid:15) n k k n k ! + O (cid:18) N (cid:19) = n ≥ ln N + ( N − n ≥ ) ln (cid:18) − (cid:15) ≥ − n ≥ /N − n ≥ /N (cid:19) − n ≥ + ln S (cid:89) k =1 (cid:15) n k k n k ! + O (cid:18) N (cid:19) . (A4)(The approximation worsens beyond the order 1 /N error estimate if n ≥ gets close to N . But since the (cid:15) k are small,this situation is of very low probability. Henceforth the error estimates will apply not too far from the peak of thelikelihood distribution.)The first term can be combined with the product term n ≥ ln N + ln S (cid:89) k =1 (cid:15) n k k n k ! = ln (cid:34) N n ≥ S (cid:89) k =1 (cid:15) n k k n k ! (cid:35) = ln S (cid:89) k =1 ( N (cid:15) k ) n k n k ! . (A5)We can simplify the remaining logarithm in Eq. (A4)( N − n ≥ ) ln (cid:18) − (cid:15) ≥ − n ≥ /N − n ≥ /N (cid:19) = ( N − n ≥ ) (cid:32) n ≥ /N − (cid:15) ≥ − n ≥ /N + O (cid:34)(cid:18) (cid:15) ≥ − n ≥ /N − n ≥ /N (cid:19) (cid:35)(cid:33) . (A6)For a multinomial distribution, (cid:15) ≥ − n ≥ /N = S (cid:88) k =1 ( (cid:15) k − n k /N ) = S (cid:88) k =1 O (cid:16)(cid:112) (cid:15) k /N (cid:17) , (A7)1where we have used the standard deviation of the distribution to estimate the typical value of | (cid:15) k − n k /N | .Thus we can write: O (cid:34)(cid:18) (cid:15) ≥ − n ≥ /N − n ≥ /N (cid:19) (cid:35) = O N (cid:32) S (cid:88) k =1 √ (cid:15) k (cid:33) = O ( (cid:15) ≥ /N ) . (A8)Thus, Eq. (A6) can be simplified to:( N − n ≥ ) ln (cid:18) − (cid:15) ≥ − n ≥ /N − n ≥ /N (cid:19) = n ≥ − N (cid:15) ≥ + O ( (cid:15) ≥ ) + O (1 /N ) (A9)Substituting Eqs. (A5) and (A9) into Eq. (A4) yieldsln L = − N (cid:15) ≥ + ln S (cid:89) k =1 ( N (cid:15) k ) n k n k ! + O ( (cid:15) ≥ ) + O (1 /N ) . (A10)Therefore the likelihood function in our approximation is given by a product of Poisson distributions: L ≈ S (cid:89) k =1 e − N(cid:15) k ( N (cid:15) k ) n k n k ! . (A11) [1] D. Attwell and S.B. Laughlin, “An Energy Budget forSignaling in the Grey Matter of the Brain”, J. Cereb.Blood Flow Metab. , 1133–1145 (2001).[2] S. Baker and R.D. Cousins, “Clarification of the Useof Chi-Square and Likelihood Functions in Fits to His-tograms”, Nucl. Instrum. Meth. , 437–442 (1984).[3] I. Biederman, “Recognition-by-Components: A Theoryof Human Image Understanding”, Psychol. Rev. , 115–147 (1987).[4] J. Bowers, “On the Biological Plausibility of Grand-mother Cells: Implications for Neural Network Theo-ries in Psychology and Neuroscience”, Psychol. Rev. ,220–251 (2009).[5] J. Bowers, “More on Grandmother Cells and the Biolog-ical Implausibility of PDP Models of Cognition: A Replyto Plaut and McClelland (2010) and Quian Quiroga andKreiman (2010)”, Psychol. Rev. , 300–306 (2010).[6] J. Bowers, “Postscript: Some Final Thoughts on Grand-mother Cells, Distributed Representations, and PDPModels of Cognition”, Psychol. Rev. , 306–308(2010).[7] G. Buzs´aki, “Neural Syntax: Cell Assemblies, Synapsem-bles, and Readers”, Neuron , 362–385 (2010).[8] L.A. Camu˜nas-Mesa and R.Q. Quiroga, “A Detailed andFast Model of Extracellular Recordings”, Neural Com-put. , 1191–1212 (2013).[9] J. Collins and D.Z. Jin, “Grandmother cells and the stor-age capacity of the human brain”, arXiv:q-bio/060301.[10] P. F¨oldi´ak and D. Endres, “Sparse coding”, Scholarpedia,3(1):2984 (2008).[11] R.H.R. Hahnloser, A.A. Kozhevnikov, and M.S. Fee, “Anultra-sparse code underlies the generation of neural se-quences in a songbird”, Nature , 65–70 (2002).[12] T. Hauschild and M. Jentzel, “Comparison of maximum likelihood estimation and chi-square statistics applied tocounting experiments”, Nucl. Instrum. Meth. , 384–401 (2001).[13] D.A. Henze, Z. Borhegyi, J. Csicsvari, A. Mamiya, K.D.Harris, and G. Buzs´aki, “Intracellular Features Pre-dicted by Extracellular Recordings in the HippocampusIn Vivo”, J. Neurophysiol. , 390–400 (2000).[14] M.J. Ison, F. Mormann, M. Cerf, C. Koch, I. Fried,and R.Q. Quiroga, “Selectivity of pyramidal cells and in-terneurons in the human medial temporal lobe”, J. Neu-rophysiol. , 1713–1721 (2011).[15] T. Konkle, T.F. Brady, G.A. Alvarez, and A. Oliva,“Conceptual Distinctiveness Supports Detailed VisualLong-Term Memory for Real-World Objects” J. Exp.Psychol. Gen. , 558–578 (2010).[16] A.A. Kozhevnikov and M.S. Fee, “Singing-Related Ac-tivity of Identified HVC Neurons in the Zebra Finch”, J.Neurophysiol. , 4271–4283 (2007).[17] I. Lee, G. Rao, and J.J. Knierim, “A Double Dis-sociation between Hippocampal Subfields: DifferentialTime Course of CA3 and CA1 Place Cells for Process-ing Changed Environments”, Neuron, Vol. 42, 803–815,(2004)[18] P. Lennie, “The Cost of Cortical Computation”, Curr.Biol. , 493–497 (2003).[19] L. Lyons, “Statistics for Nuclear and Particle Physicists”(Cambridge University Press, 1986).[20] J.F. Miller, M. Neufang, A. Solway, A. Brandt, M. Trip-pel, I. Mader, S. Hefft, M. Merkow, S.M. Polyn, J. Ja-cobs, M.J. Kahana, A. Schulze-Bonhage, “Neural Activ-ity in Human Hippocampal Formation Reveals the Spa-tial Context of Retrieved Memories”, Science , 1111–1114 (2013).[21] K. Mizuseki and G. Buzs´aki, “Preconfigured, Skewed Distribution of Firing Rates in the Hippocampus and En-torhinal Cortex”, Cell Rep. , 1–12 (2013).[22] F. Mormann, S. Kornblith, R.Q. Quiroga, A. Kraskov,M. Cerf, I. Fried, and C. Koch, “Latency and Selectivityof Single Neurons Indicate Hierarchical Processing in theHuman Medial Temporal Lobe”, J. Neurosci. , 8865–8872 (2008).[23] B.A. Olshausen and D.J. Field, “Sparse Coding of Sen-sory Inputs”, Curr. Opinion Neurobiol. , 481–487(2004).[24] B.A. Olshausen and D.J. Field, “How Close Are Weto Understanding V1?”, Neural Comput. , 1665–1699(2005); “What is the other 85% of V1 doing?”, in“23 problems in systems neuroscience”, J.L. van Hem-men and T.J. Sejnowski (eds.) (Oxford University Press,2006).[25] C. Pedreira, J. Martinez, M.J. Ison and R. QuianQuiroga, “How many neurons can we see with currentspike sorting algorithms?”, J. Neurosci. Meth. 211, 58–65 (2012).[26] D.C. Plaut and J.L. McClelland, “Locating ObjectKnowledge in the Brain: Comment on Bower’s (2009)Attempt to Revive the Grandmother Cell Hypothesis”,Psychol. Rev. , 284–288 (2010).[27] D.C. Plaut and J.L. McClelland, “Postscript: Paral-lel Distributed Processing in Localist Models WithoutThresholds”, Psychol. Rev. , 289–290 (2010).[28] R.Q. Quiroga, “Decoding Visual Inputs From MultipleNeurons in the Human Temporal Lobe”, J. Neurophysiol. , 1997–2007 (2007).[29] R.Q. Quiroga, “Concept cells: the building blocks ofdeclarative memory functions”, Nat. Rev. Neurosci. ,587–597 (2012).[30] R.Q. Quiroga and G. Kreiman, “Measuring Sparseness inthe Brain: Comment on Bowers (2009)”, Psychol. Rev. , 291–297 (2010).[31] R.Q. Quiroga and G. Kreiman, “Postscript: About Grandmother Cells and Jennifer Aniston Neurons” Psy-chol. Rev. , 297–299 (2010).[32] R.Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I.Fried, “Invariant visual representation by single neuronsin the human brain”, Nature , 1102–1107 (2005).[33] E.T. Rolls, “The mechanisms for pattern completion andpattern separation in the hippocampus,” Front. Syst.Neurosci. , 74 (2013).[34] E.T. Rolls and M. Tovee, “Sparseness of the NeuronalRepresentation of Stimuli in the Primate Visual Cortex”,J. Neurophys. , 713–726 (1995).[35] E.T. Rolls and A. Treves, “The neuronal coding of in-formation in the brain”, Prog. Neurobiol. , 448–490(2011).[36] S. Shoham, D.H. O’Connor, and R. Segev, “How silentis the brain: is there a “dark matter” problem in neuro-science?”, J. Comp. Physiol. A , 777–784 (2006)[37] L. Standing, “Learning 10,000 pictures”, Q. J. Exp. Psy-chol. , 207–222 (1973).[38] L.T. Thompson and P.J. Best, “Place Cells and SilentCells in the Hippocampus of Freely-Behaving Rats”, J.Neurosci. , 2382–2390 (1989).[39] A. Treves and E.T. Rolls, “What determines the capacityof autoassociative memories in the brain?”, Network ,371–397 (1991).[40] S. Waydo, A. Kraskov, R Quian Quiroga, I. Fried, andC. Koch, “Sparse Representation in the Human MedialTemporal Lobe”, J. Neurosci. , 10232–10234 (2006).[41] W.A. Wickelgren, “Learned specification of concept neu-rons”, Bull. Math. Biophy. , 123–142 (1969).[42] W.A. Wickelgren, “Webs, Cell Assemblies, and Chunk-ing in Neural Nets”, Concepts in Neuroscience , 1–53(1992).[43] W.A. Wickelgren, “Webs, Cell Assemblies, and Chunkingin Neural Nets: Introduction”, Can. J. Exper. Psych.53