[PDF] Invariant expectation values in the sampling of discrete frequency distributions

Abstract

The general relationship between an arbitrary frequency distribution and the expectation value of the frequency distributions of its samples is discussed. A wide set of measurable quantities ("invariant moments") whose expectation value does not in general depend on the size of the sample is constructed and illustrated by applying the results to Ewens sampling formula. Invariant moments are especially useful in the sampling of systems characterized by the absence of an intrinsic scale. Distribution functions that may parametrize the samples of scale-free distributions are considered and their invariant expectation values are computed. The conditions under which the scaling limit of such distributions may exist are described.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] M a y Invariant expectation values in the sampling of discrete frequency distributions

Paolo Rossi

Dipartimento di Fisica dell’Universit`a di Pisa and I.N.F.N.,Sezione di Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy (Dated: May 2, 2013)The general relationship between an arbitrary frequency distribution and the expectation valueof the frequency distributions of its samples is discussed. A wide set of measurable quantities(“invariant moments”) whose expectation value does not in general depend on the size of the sampleis constructed and illustrated by applying the results to Ewens sampling formula. Invariant momentsare especially useful in the sampling of systems characterized by the absence of an intrinsic scale.Distribution functions that may parametrize the samples of scale-free distributions are consideredand their invariant expectation values are computed. The conditions under which the scaling limitof such distributions may exist are described.

PACS numbers: 02.50.Ng, 89.20-a, 89.75.Da

I. INTRODUCTION

In many interesting physical, biological and social phenomena, whenever no intrinsic scale for the relevant variablesis present, the emergence of ”scaling laws” is phenomenologically observed [1]. However, strictly speaking, a power lawis not a proper way of ﬁtting empirical data, since no choice of the exponent can keep the higher moments of a powerlaw distribution from diverging, while every phenomenological distribution leads to ﬁnite values for all moments. Thisis not just a technicality: it is rather a reﬂection of the fact that the long tail of a power law distribution is in practicecutoﬀed by the existence of some ”hidden” scale, irrelevant in the scaling region, but eventually forcing some upperlimit on the variables describing the system. It would therefore be convenient to parametrize the data by means ofmore regular distribution functions, suﬃciently damped for very large values of the variables, but admitting powerlaw distributions as regular limits when the control parameter implementing the cutoﬀ is sent to its limiting value.A related issue concerns the eﬀects of sampling, which may be non trivial even when we restrict our attention tothe expectation values of the sampled variables. On average sampling does not aﬀect the distributions of individualobjects belonging to diﬀerent kinds, but when we consider frequency distributions (that is the number of kinds thatare represented k times in a given population) we cannot in general expect that the frequency distribution in thesamples be the same as in the original population, even after averaging on many diﬀerent samples, basically becausethe cutoﬀ induced by sampling acts diﬀerently (and in general nontrivially) at diﬀerent scales. It is therefore quiteimportant to be able to extract from the frequency distribution of the samples some information reﬂecting directlysome intrinsic property of the underlying distribution.Our purpose is therefore threefold. First we want to discuss the general relationship existing between some arbi-trary frequency distribution and the expectation value of the frequency distributions of its samples, and constructobservables whose expectation values turn out to be independent of the sample size, and therefore coinciding withthe value taken by the same observables in the full distribution.Moreover we want to study classes of distributions whose samples preserve the functional dependence on theparameters present in the original distribution, establishing the connection between the (a priori unknown) values ofthe parameters of the distribution and the (empirically measured) parameters of the sample distributions.Finally we want to study the scaling limit of these distributions (when it exists), in order to explore the possibilityof their use for the phenomenological description of systems that are theoretically expected to show scaling in thelimit when all empirical cutoﬀs (including those induced by sampling) are going to disappear.In Section II we establish the notation and the general framework of our analysis.In Section III we construct a wide set of combinations of expectation values that do not depend on the sample size.In Section IV we apply our approach to the popular Ewens sampling formula, showing that its features are consistentwith the general pattern and computing its invariant expectation values.In Section V we consider the limiting case of a small sampling applied to a large population.In Section VI we focus on the case when the original population and its samples are suﬃciently large in comparisonwith typical frequency values, ﬁnding a useful mathematical relationship between the generating function of theexpectation values of the sample distributions and the generating function of the original distribution.In Section VII we analyze a class of distributions (the so-called negative binomial distributions) admitting a scalinglimit and enjoying the property that the distribution of expectation values of the samples has the same mathematicalform as the original distribution. We also compute in a closed form the values ot the basic invariant expectationvalues for these distributions.Finally in Section VIII we analyze the scaling limit itself and discuss the conditions under which one may expectthis limit to be a sensible description of the original system.Appendices are devoted to the proofs of some mathematical results and to discussing the issue of correlation betweenrandom samples. II. THE GENERAL FRAMEWORK

We are considering a set of N objects (“individuals”) belonging to S diﬀerent kinds (“species”), and we assumethat the set contains ˆ N a objects of the a -th kind, subject to the constraint P a ˆ N a = N .A sample is a set of n objects, containing ˆ n a objects of the a -th kind, subject to the constraint P a ˆ n a = n .The probability P { ˆ n a } of extracting a speciﬁc sample { ˆ n a } from a given set { ˆ N a } is obtained from the multivariatehypergeometric distribution P { ˆ n a } = (cid:18) Nn (cid:19) − S Y a =1 (cid:18) ˆ N a ˆ n a (cid:19) . We can easily compute the relevant expectation values, obtaining in particular h ˆ n a i = ˆ N a nN = n ˆ p a , h ˆ n a i − h ˆ n a i = N − nN − n ˆ p a (1 − ˆ p a )where ˆ p a ≡ ¯ N a /N is the probability of extracting an object of the a -th kind in a single extraction.It may be useful to consider also the limit of small samples ˆ n a ≪ ˆ N a . In this limit the probability of a speciﬁcsample is well approximated by the multinomial distribution P { ˆ n a } = n ! S Y a =1 n a ! (ˆ p a ) ˆ n a . A frequency distribution is a set of values { N k } , where N k is the number of kinds such that for each of them thereare k objects in the original set. According to the deﬁnition, the following conditions must be satisﬁed: N X k =1 N k = S, N X k =1 k N k = N. The frequency distribution of a sample is a set of values { n l } , satisfying the conditions n X l =0 n l = S, n X l =1 l n l = n. Notice that the frequency distribution of a sample formally includes the (unobservable) value n , corresponding tothe number of kinds, present in the original set, which are not represented in the sample.It is in principle possible to compute the probability of any sample distribution { n l } as a function of a given set { N k } . To this purpose it is convenient to deﬁne the intermediate variables N kl , representing the (random) numberof kinds characterized by k objects in the original set and by l ( l ≤ k ) objects in the sample. The variables N kl arestrongly constrained, since they must satisfy all the conditions: n X l =0 N kl = N k , N X k =1 N kl = n l . The probability P { N kl } of a speciﬁc conﬁguration { N kl } follows from the general probability formula [2]: P { N kl } = (cid:18) Nn (cid:19) − N Y k =1 (cid:20) N k ! k Y l =0 N kl ! (cid:18) kl (cid:19) N kl (cid:21) , subject to the constraint P nl =0 N kl = N k .The probability P { n l } of ﬁnding a frequency distribution { n l } in a sample is obtained by summing the probabilities P { N kl } over all conﬁgurations satisﬁying the constraint P Nk =1 N kl = n l . The corresponding multivariate generatingfunction can be deﬁned as ε ( n ) ( { t l } ) ≡ X { n l } P { n l } n Y l =0 t n l l = X { N kl } P { N kl } N Y k =1 k Y l =0 t N kl l . It is also possible (and it will be quite convenient) to deﬁne a cumulative generating function for the probability ofﬁnding the frequency distributions P { n l } for samples of all possible sizes : E ( x ; { t l } ) ≡ N X n =0 (cid:18) Nn (cid:19) ε ( n ) ( { t l } ) x n = X { N kl } N Y k =1 (cid:16) N k ! k Y l =0 N kl ! h(cid:18) kl (cid:19) x l t l i N kl (cid:17) = N Y k =1 h k X l =0 (cid:18) kl (cid:19) x l t l i N k , where we used the explicit expression of P { N kl } and all the relevant constraints.The expectation values h n l i can be computed starting from the above expressions and from the relationship h n l i = N X k =1 h N kl i = N X k =1 X { N jm } N kl P { N jm } . Straightforward manipulations lead to the results [2] h N kl i = N k (cid:0) kl (cid:1)(cid:0) N − kn − l (cid:1)(cid:0) Nn (cid:1) , h n l i = P Nk =1 N k (cid:0) kl (cid:1)(cid:0) N − kn − l (cid:1)(cid:0) Nn (cid:1) . It is easy to check that the following relationships are satisﬁed: n X l =0 h n l i = N X k =1 N k = S, n X l =0 l h n l i = (cid:16) N X k =1 k N k (cid:17) nN = n. In order to fully appreciate the relevance of considerations based on the expectation values we must evaluate theweight of the ﬂuctuations. Taking second derivatives of the generating function E ( x ; t l ) one obtains: h n l i − h n l i = X k,k ′ N k N k ′ (cid:18) kl (cid:19)(cid:18) k ′ l (cid:19)" (cid:0) N − k − k ′ n − l (cid:1)(cid:0) Nn (cid:1) − (cid:0) N − kn − l (cid:1)(cid:0) Nn (cid:1) (cid:0) N − k ′ n − l (cid:1)(cid:0) Nn (cid:1) + X k N k (cid:18) kl (cid:19)" (cid:0) N − kn − l (cid:1)(cid:0) Nn (cid:1) − (cid:18) kl (cid:19) (cid:0) N − kn − l (cid:1)(cid:0) Nn (cid:1) . Notice that in the large N limit the term quadratic in N k is depressed by a power of 1 /N . This observation suggeststhat very important limits of the above results may be obtained when considering large populations. III. INVARIANT EXPECTATION VALUES

It is very important to be able to deﬁne a set of expectation values that are independent of the size of the sample,and therefore may reﬂect very directly the properties of the original frequency distribution.Let’s consider the following combinations of expectation values: h m ( n ) { p i } i ≡ (cid:18) nP (cid:19) − X { q i } I Y i =1 h(cid:18) q i p i (cid:19) ∂∂t q i i ε ( n ) ( { t l }| { t l =1 } , where p i are I arbitrary positive integer numbers, subject only to the constraint that P ≡ P i p i ≤ n .The deﬁnition of the quantities appearing in the r.h.s. implies that the derivatives with respect to t q i are the jointfactorial moments of the distribution; therefore we are dealing with weighted combinations of joint factorial moments.When some of the indices p i are equal to one, the expectation values may be expressed in terms of a combination oflower rank moments ( I ′ h I ).It is possible to recognize that the quantities h m ( P ) { p i } i are related (up to a trivial combinatorial factor taking intoaccount the existence of n p coincident values of the indices p i ) to the probability of ﬁnding the conﬁgurations { p i } inthe sample containing P elements, and are therefore strictly connected with the probabilities P { n p } .Exploiting the properties of the generating function E ( x ; { t l } ) we prove in Appendix A that h m ( n ) { p i } i = h m ( N ) { p i } i ≡ M { p i } for all sets { p i } such that P ≤ n . Hence the expectation values of the nontrivial invariant moments m { p i } evaluated forsamples of arbitrary size n ≥ P , coincide with the moments M { p i } of the original frequency distribution. If the originalset was generated by a random process, also the M { p i } will be expectation values. Recalling that ε ( N ) ( { t l } ≡ Q k t N k k we may now generate a representation of all P { n p } in terms of N k , without making use of the coeﬃcients N kl .The properties of the binomial coeﬃcients make it possible to invert the relationship between invariant momentsand joint factorial moments, thus ﬁnding that h I Y i =1 ∂∂t q i i ε ( n ) ( { t l }| { t l =1 } = X { p i } I Y i =1 ( − p i − q i (cid:18) p i q i (cid:19) h (cid:18) nP (cid:19) m ( n ) { p i } i = X { p i } I Y i =1 ( − p i − q i (cid:18) p i q i (cid:19)(cid:18) nP (cid:19) M { p i } . The basic invariant moments are m ( n ) p = (cid:18) np (cid:19) − n X q = p (cid:18) qp (cid:19) n q . According to the inversion formula h n l i = ∂ε ( n ) ∂t l | { t m =1 } = n X p = l ( − p − l (cid:18) pl (cid:19)(cid:18) np (cid:19) h m ( n ) p i = n X p = l ( − p − l (cid:18) pl (cid:19)(cid:18) np (cid:19) M p . One may deﬁne generating functions for the expectation values of n l and of the basic invariant moments: f ( n ) ( t ) ≡ n X l =0 h n l i t l , g ( n ) ( z ) ≡ f ( n ) (cid:0) zn (cid:1) = n X p =0 (cid:18) np (cid:19) M p (cid:16) zn (cid:17) p . Notice that a special case of the above formula is F ( t ) ≡ N X k =0 N k t k , G ( z ) ≡ F (cid:0) zN (cid:1) = N X p =0 (cid:18) Np (cid:19) M p (cid:16) zN (cid:17) p . It is immediate to recognize that g ( n ) (0) = G (0) = M ≡ S , and dgdz ( n ) (0) = dGdz (0) = M ≡ IV. APPLICATION TO EWENS SAMPLING FORMULA

The multivariate Ewens distribution [3, 4], called in genetics the Ewens sampling formula, describes a speciﬁcprobability for the partition of n into parts, and found its main applications in the context of the neutral theory ofevolution and in the uniﬁed neutral theory of biodiversity [5, 6]. Since Ewens formula and its possibile generalizationshave been the subject of a wide and still growing literature [7–10], it may be interesting to apply the results presentedin Section III to this speciﬁc instance. In our notation Ewens probability distribution takes the form P { n l } = 1 ℵ n n Y l =1 n l ! (cid:16) θl (cid:17) n l , ℵ n ≡ Γ( θ + n ) n ! Γ( θ ) , where 0 < θ < ∞ and P l n l = n .Joint factorial moments of the Ewens distribution are easily computed [11] and one can show that I Y i =1 h ∂∂t q i i X { n l } P { n l } n Y l =0 t n l l = ℵ n − Q ℵ n Y i (cid:16) θq i (cid:17) , where Q ≡ P i q i ≤ n .We are then left with the task of computing the summations appearing in the equation h m ( n ) { p i } i = (cid:18) nP (cid:19) − X { q i } I Y i =1 (cid:18) q i p i (cid:19) ℵ n − Q ℵ n Y i (cid:16) θq i (cid:17) = P ! ( n − P )! Γ( θ )Γ( θ + n ) I Y i =1 (cid:16) θp i (cid:17) X { q i } I Y i =1 (cid:18) q i − p i − (cid:19) Γ( θ + n − Q )Γ( θ ) ( n − Q )! . We prove in Appendix B that X { q i ≥ p i } I Y i =1 (cid:18) q i − p i − (cid:19) Γ( θ + n − Q )Γ( θ ) ( n − Q )! = Γ( θ + n )Γ( θ + P ) ( n − P )! , hence h m ( n ) { p i } i = P ! Γ( θ )Γ( θ + P ) I Y i =1 (cid:16) θp i (cid:17) ≡ ℵ P I Y i =1 (cid:16) θp i (cid:17) , showing explicitly that the expectation values of the invariant moments of the Ewens distribution are independent ofthe sample size and related to the probability of the conﬁguration { p i } in the sampling of P elements.We stress that invariant moments, because of their independence from the size of the sample, may become a highlyvaluable tool in testing the applicability of Ewens distribution (and of the conceptual assumptions underlying itsderivation) to the interpretation of actual empirical data. V. LARGE POPULATION AND SMALL SAMPLES

A signiﬁcant simpliﬁcation occurs when N → ∞ while all other variables are kept ﬁnite. Setting ˜ x ≡ N x and t = 1in the cumulative generating function and taking the large N limit we obtain˜ E (˜ x ; { t l } ) ≡ ∞ X n =1 ˜ ε ( n ) ( { t l } ) ˜ x n n ! = ∞ Y k =1 h k X l =1 (cid:18) kl (cid:19)(cid:16) ˜ xN (cid:17) l t l i N k → ∞ Y k =1 h ∞ X l =1 (cid:16) k ˜ xN (cid:17) l t l l ! i N k . Let’s now deﬁne (for n, l diﬀerent from zero) the following set of coeﬃcients: c ( n ) ( { n l } ) ≡ ( − s − ( s − n Y l =1 n l ! (cid:16) l ! (cid:17) n l , where s = P l n l and n = P l l n l , and notice that the deﬁnition of ˜ E implies thatln ˜ E (˜ x ; { t l } ) ≡ ∞ X n =1 hX { n l } c ( n ) ( { n l } ) n Y l =1 (cid:16) ˜ ε ( l ) ( { t l } ) (cid:17) n l i ˜ x n . It is also possible to recognize that, under the same assumptions,ln ˜ E (˜ x ; { t l } ) = ∞ X n =1 hX { n l } c ( n ) ( { n l } ) n Y l =1 t n l l i ˜ m ( N ) n x n , where we introduced the large N limit of the basic invariant moments: ˜ m ( N ) p ≡ P q N q ( q/N ) p .Comparing the two results we conclude that, for each value of n i X { n l } c ( n ) ( { n l } ) n Y l =1 (cid:16) ˜ ε ( l ) ( { t l } ) (cid:17) n l = hX { n l } c ( n ) ( { n l } ) n Y l =1 t n l l i ˜ m ( N ) n . These equations allow in principle for the recursive determination of all ˜ ε ( n ) ( { t l } in terms of { ˜ m ( N ) p } (with p ≤ n ),starting from the initial condition ˜ ε (1) = t . Higher rank invariant moments ( I i

1) in the large N limit becomepolynomials in the basic moments. However one must keep in mind that, when the set { N k } not ﬁxed, but generatedby a probability distribution (as in the case of the Ewens formula), the expectation values of the products of basicmoments appearing in the l.h.s. do not coincide with the products of the expectation values. VI. LARGE POPULATIONS AND LARGE SAMPLES

When k, l ≪ N, n one may systematically exploit the property that, for small a and b , (cid:18) N − an − b (cid:19) → ρ b (1 − ρ ) b − a ρ n (1 − ρ ) N − n ρ ≡ nN . Expressing N and n in terms of N kl one may then obtain P { N kl } → N Y k =1 (cid:20) N k ! k Y l =0 N kl ! P N kl kl (cid:21) P kl ≡ (cid:18) kl (cid:19) ρ l (1 − ρ ) k − l . As shown in Appendix C, when computing expectation values with the above probability distribution, the constraint P l n l = n becomes irrelevant in the large N limit, and expectation values of products of N kl with diﬀerent valuesof the index k factorize into products of expectation values computed for each separate value of k . We can thereforecompute directly the generating function for a ﬁxed sample size, generalizing the multivariate multinomial distribution: ε ( n ) ( { t l } ) = N Y k =1 X { N kl } (cid:20) N k ! k Y l =0 N kl ! ( P kl t l ) N kl (cid:21) = N Y k =1 h k X l =0 P kl t l i N k . A consistency check is easily obtained by observing that ε ( n ) (1) = 1, because of the property that P kl =0 P kl = 1.The expectation value of n l turns out to be: h n l i = N X k =1 N k P kl , and one may check that the conditions on P l h n l i and on P l l h n l i are still satisﬁed.We can also estimate the behavior of ﬂuctuations when k, l ≪ N, n : h n l i − h n l i = N X k =1 N k P kl (1 − P kl ) . The above expression is always smaller than h n l i and as a consequence ﬂuctuations become unimportant for suﬃ-ciently large values of h n l i .In the same limit we may derive a notable relationship between the generating function of the original frequencydistribution and the generating function of the expectation values of its samples. In fact we may recognize that forsuﬃciently large N and n g ( n ) ( z ) = ∞ X p =0 M p z p p ! = G ( z ) . Since in general f ( n ) ( t ) = g ( n ) (cid:0) n ( t − (cid:1) and F ( t ) = G (cid:0) N ( t − (cid:1) , it is then easy to check that in the limit underconsideration f ( n ) ( t ) = F (1 − ρ + ρ t ) , As a direct consequence of these results, whenever the (size-independent) function γ ( z ) ≡ G ( z ) − G (0) can be castinto a form exhibiting no explicit parametric dependence on N , the expectation values h n l i can be obtained from N k by the replacement N → n .Notice that in the limit k, l ≪ N, n the deﬁnition of the basic invariant moments m ( n ) p simpliﬁes to m ( n ) p → p ! n p n X l = p n l (cid:18) lp (cid:19) . It is worth analyzing in this limit the explicit expressions of the second basic invariant moment: M → N N X k =1 k ( k − N k = S X a =1 (cid:0) ˆ N a N (cid:1) − N = S X a =1 h (cid:0) ˆ n a n (cid:1) i − n ≡ α . As shown in Appendix D the above results may be used also in order to parametrize the expected correlationbetween samples under the assumption of independent random sampling.

VII. A CLASS OF DISTRIBUTIONS AND ITS PROPERTIES

As mentioned in the Introduction, distributions found in samples may often correspond to systems whose asymptotic( N → ∞ ) distribution is expected to obey a scaling law. However the exponent of the scaling law will in generalbe nontrivial, in contrast with the prediction oﬀered by the simplest neutral models. An example of empirical andtheoretical evidence for nontriviality is oﬀered by surname frequency distributions (see Ref. [12] for a recent review),recalling that surnames are expected to mimick the behavior of selectively neutral alleles. it is therefore especiallyinteresting to consider parametrizations that may reﬂect notriviality of exponents, ad in particular the class of negativebinomial distributions [13], which can be obtained starting from the generating function F c ( t ) = Nx (1 − x ) − c c (cid:2) − (1 − xt ) c (cid:3) = ∞ X k =1 Nx (1 − x ) − c Γ(1 − c ) Γ( k − c ) k ! ( xt ) k , where 0 h x h ≤ c h k is easily obtained by observing that in this limitΓ( k − c ) k ! → k c , N k → Nx (1 − x ) − c Γ(1 − c ) x k k c . We can now compute the generating function for the expectation values of the samples according to the generalrule previously discussed, and obtain f c ( t ) = f c (0) + ny (1 − y ) − c c (cid:2) − (1 − yt ) c (cid:3) , where we have deﬁned y = ρx − x + ρx .The distribution of the samples has the same form as the original distribution, once the replacements N → n and x → y have been performed, and therefore we obtain the asymptotic behaviour n l → ny (1 − y ) − c Γ(1 − c ) y k k c . It is possible to deﬁne a combination of parameters independent of the dimension of the sample: β = N − xx = n − yy . It is useful to represent x and y in a form showing explicitly their dependence on the dimension of the sample andon the invariant parameter β : x = Nβ + N , y = nβ + n . It is now possible to evaluate the expectation value of the invariant moments from the expression γ c ( z ) ≡ G c ( z ) − G c (0) = βc h − (cid:16) zβ (cid:17) c i , showing no explicit parametric dependence on N ; we therefore obtain (for p = 0) M p = β − p Γ( p − c )Γ(1 − c ) → β − p Γ(1 − c ) p ! p c , α ≡ M = β − c . The limit of the above results when c → F ( t ) = − β ln(1 − xt ) , N k = β x k k , and f ( t ) = β ln (cid:0) β + Nβ + n (cid:1) − β ln(1 − yt ) , n l = β y l l . The generating function of the invariant moments is obtained from γ ( z ) = − β ln(1 + z/β ), and as a consequencethe expected values of the invariant moments ( p = 0) are exactly M p = ( p − β − p .Notice in particular the relationship β = α , peculiar to Fisher distribution. By comparing these results with thelarge θ limit of the invariant moments of Ewens distribution we may easily check Watterson’s [15] and Hubbell’s [5]observation that in this limit θ is strictly connected to Fisher’s α . VIII. THE SCALING LIMIT

Let’s now consider very large systems, and assume that we can gather information only through the sampling of n objects belonging to the system, with n large but not necessarily comparable to N .The analysis of the invariant moments may then allow us to check the applicability of a phenomenological descriptionof the samples based on some distribution falling into the classes discussed in the previous Sections. In the case of apositive response to the check it is then possible to ﬁnd numerical estimates of the parameter β and of the exponent c . Such estimates are clearly meaningful only if β does not turn out to be signiﬁcantly greater than n .Under these assumptions, we can infer a description of the original system, and in the case N ii n such a descriptionwill correspond to computing the limit x → k , the original distribution is expected to be well described by the scaling form N k → N c β − c Γ(1 − c ) 1 k c . Appendix A: Expectation values of the invariant moments

Let’s consider the cumulative generating function for the expectation value of a given invariant moment (setting P = P p i ) for samples of all possible sizes: N X n =0 (cid:18) Nn (cid:19)(cid:18) nP (cid:19) h m ( n ) { p i } i x n ≡ X { q i } I Y i =1 h(cid:18) q i p i (cid:19) ∂∂t q i i E ( x ; { t l } ) | { t l =1 } . Let’s now observe that, due to the property thatln E ( x ; { t l } ) = N X k =1 N k ln h k X l =0 (cid:18) kl (cid:19) t l x l i , the derivatives appearing in the above deﬁned cumulative generating function, computed at t l = 1, can be expressedas summations (over { k i } indices) of products of N k i times a universal x -dependent factor h I Y i =1 (cid:18) k i q i (cid:19)i x Q (1 + x ) N − K , where Q = P i q i and K = P i k i . However the following identity holds: (cid:18) q i p i (cid:19)(cid:18) k i q i (cid:19) = (cid:18) k i p i (cid:19)(cid:18) k i − p i q i − p i (cid:19) . As a consequence the cumulative generating function is proportional to the factor h I Y i =1 (cid:18) k i p i (cid:19)i x P (1 + x ) N − K X { q i } h I Y i =1 (cid:18) k i − p i q i − p i (cid:19) x Q − P i = h I Y i =1 (cid:18) k i p i (cid:19)i x P (1 + x ) N − P . The summations over the indices { k i } may now be formally performed, and, by matching the coeﬃcient of x N inthe two sides of the equation, the result can be easily recognized to coincide with the expected value of the invariantmoment computed for the original distribution times the combinatorial factor (cid:0) NP (cid:1) .In conclusion we ﬁnd N X n =0 (cid:18) Nn (cid:19)(cid:18) nP (cid:19) h m ( n ) { p i } i x n = (cid:18) NP (cid:19) h m ( N ) { p i } i x P (1 + x ) N − P . Expanding the r.h.s. in powers of x and noticing that (cid:0) Nn (cid:1)(cid:0) nP (cid:1) = (cid:0) NP (cid:1)(cid:0) N − Pn − P (cid:1) we ﬁnally obtain N X n =0 (cid:18) NP (cid:19)(cid:18) N − Pn − P (cid:19) h m ( n ) { p i } i x n = N X n =0 (cid:18) NP (cid:19)(cid:18) N − Pn − P (cid:19) h m ( N ) { p i } i x n , implying immediately that, as long as n ≥ P , h m ( n ) { p i } i = h m ( N ) { p i } i ≡ M { p i } . Appendix B: Proof of a combinatorial formula

The identity X { q i ≥ p i } I Y i =1 (cid:18) q i − p i − (cid:19) Γ( θ + n − Q )Γ( θ ) ( n − Q )! = Γ( θ + n )Γ( θ + P ) ( n − P )! , where Q = P i q i and P = P i P i , can be proven by recalling that for real numbers α and positive integers q ≥ p (1 − x ) − α = ∞ X m =0 Γ( α + m )Γ( α ) m ! x m , (1 − x ) − p = ∞ X q = p (cid:18) q − p − (cid:19) x q − p . Hence expanding in powers of x the two sides of the identity (cid:2) I Y i =1 (1 − x ) − p i (cid:3) (1 − x ) − θ = (1 − x ) − ( θ + P ) and exchanging the order of summations in the l.h.s. we get X { q i ≥ p i } I Y i =1 (cid:18) q i − p i − (cid:19) Γ( θ + m + P − Q )Γ( θ ) ( m + P − Q )! = Γ( θ + P + m )Γ( θ + P ) m ! . The desired result is then obtained by setting m = n − P . Appendix C: Fluctuations of the sample size in the unconstrained large N limit

The multivariate generating function ε ( { t l } ) was computed in Section VI after relaxing the constraint P l l n l = n .In order to show that the constraint is automatically satisﬁed in the large N limit we construct a generating functionfor the expectation value of the powers of ν = P l l n l in the large N distribution: η ( w ) ≡ X ν P ( ν ) w ν = X { N kl } P { N kl } w P l n l = N Y k =1 (cid:16) X { N kl } N k ! k Y l =0 N kl ! ( P kl w l ) N kl (cid:17) , and applying the multinomial formula we obtain η ( w ) = N Y k =1 (cid:0) k X l =0 P kl w l (cid:1) N k = N Y k =1 (1 − ρ + ρ w ) kN k = ((1 − ρ + ρ w ) N . Expanding the result in powers of w we immediately obtain P ( ν ) = P Nν .Since ν is distributed according to the binomial distribution, the relevant expectation values are h νN i = ρ ≡ nN , h ν N i − h νN i = 1 N ρ (1 − ρ ) . Hence ﬂuctuations of ν/N around ρ vanish like 1 /N in the large N limit. Appendix D: Correlation between samples

An important test of randomness in sampling is oﬀered by the measure of the correlation between two diﬀerentsamples. Let’s consider two random samples, characterized by the sets of values { ˆ n a } and { ˆ m a } and by their sizes n and m . The index a labels diﬀerent kinds, as in Section II. The correlation between the two samples is C = P Sa =1 ˆ n a ˆ m a qP Sa =1 ˆ n a qP Sa =1 ˆ m a . n a and ˆ n a with their expectation values, computed in Section II, we obtain (in the large N limit) S X a =1 h ˆ n a ih ˆ m a i = nm S X a =1 ˆ p a , S X a =1 h ˆ n a i = n (cid:0) S X a =1 ˆ p a + 1 n − N (cid:1) , S X a =1 h ˆ m a i = m (cid:0) S X a =1 ˆ p a + 1 m − N (cid:1) . By making use of the results presented in Section VI we can now express the expected value of the correlationbetween samples in the form h C i = α + N q α + n q α + m . For samples of equal size n the expected value of the correlation takes the form h C i = nα + n α + NN . Acknowledgments

I am indebted to Sergio Caracciolo for an important observation on the rˆole of ﬂuctuations. I am also indebted toRoberto Dvornicich, Steve Shore and Ettore Vicari for critical reading of the manuscript. [1] M.E.J. Newman, Power laws, Pareto distributions and Zipf’s law, Contemporary Physics 46 (2005) 323-351[2] D. Zelterman,

Discrete Distributions (Chapter 6), Wiley (2004)[3] W. J. Ewens, The Sampling Theory of Selectively Neutral Alleles, Theoretical Population Biology 3 (1972) 87-112[4] S. Karlin and J. McGregor, Addendum to a Paper of W. Ewens, Theoretical Population Biology 3 (1972) 113-116[5] S. P. Hubbell

The Uniﬁed Neutral Theory of Biodiversity and Biogeography , Princeton University Press (2001)[6] J. Rosindell, S.P. Hubbell and R.S. Etienne,

The Uniﬁed Neutral Theory of Biodiversity and Biogeography at Age Ten,Trends in Ecology and Evolution 26 (2011) 340-348[7] R.C. Griﬃths, S. Lessard, Ewens’ sampling formula and related formulae: combinatorial proofs, extensions to variablepopulation size and applications to ages of alleles, Theoretical Population Biology 68 (2005) 167-177[8] R.S. Etienne, A new sampling formula for neutral biodiversity, Ecology Letters 8 (2005) 253-260[9] S. Lessard, An Exact Sampling Formula for the Wright-Fisher model and a Solution to a Conjecture About the Finite-IslandModel, Genetics 177 (2007) 1249-1254[10] A. Lambert, Species abundance distributions in neutral models with immigration or mutation and general lifetimes, Journalof Mathematical Biology 63 (2011) 57-72[11] N. L. Johnson, S. Kotz, N. Balakrishnan,

Discrete Multivariate Distributions (Chapter 41), Wiley (1997)[12] P. Rossi, Surname distribution in population genetics and in statistical physics, to appear in Physics of Life (2013)[13] J.M. Hilbe,