[PDF] Dependence of exponents on text length versus finite-size scaling for word-frequency distributions

Abstract

Some authors have recently argued that a finite-size scaling law for the text-length dependence of word-frequency distributions cannot be conceptually valid. Here we give solid quantitative evidence for the validity of such scaling law, both using careful statistical tests and analytical arguments based on the generalized central-limit theorem applied to the moments of the distribution (and obtaining a novel derivation of Heaps' law as a by-product). We also find that the picture of word-frequency distributions with power-law exponents that decrease with text length [Yan and Minnhagen, Physica A 444, 828 (2016)] does not stand with rigorous statistical analysis. Instead, we show that the distributions are perfectly described by power-law tails with stable exponents, whose values are close to 2, in agreement with the classical Zipf's law. Some misconceptions about scaling are also clarified.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] A p r Dependence of exponents on text length versus ﬁnite-size scaling for word-frequencydistributions ´Alvaro Corral , , , and Francesc Font-Clos Centre de Recerca Matem`atica, Ediﬁci C,Campus Bellaterra, E-08193 Barcelona, Spain Departament de Matem`atiques, Facultat de Ci`encies,Universitat Aut`onoma de Barcelona,E-08193 Barcelona, Spain Barcelona Graduate School of Mathematics, Ediﬁci C,Campus Bellaterra, E-08193 Barcelona, Spain Complexity Science Hub Vienna,Josefst¨adter Stra β e 39, 1080 Vienna, Austria ISI Foundation, Via Chisola 5, 10126 Torino, Italy

Some authors have recently argued that a ﬁnite-size scaling law for the text-length dependence ofword-frequency distributions cannot be conceptually valid. Here we give solid quantitative evidencefor the validity of such scaling law, both using careful statistical tests and analytical argumentsbased on the generalized central-limit theorem applied to the moments of the distribution (andobtaining a novel derivation of Heaps’ law as a by-product). We also ﬁnd that the picture ofword-frequency distributions with power-law exponents that decrease with text length [Yan andMinnhagen,

Physica A

INTRODUCTION

Many complex processes in biology, social science,economy, Internet science, or cognitive science are mim-icked by the occurrence of words in texts. Indeed, thestatistics of insects in plants [1], molecules in cells [2],inhabitants in cities [3], followers of religions [4], tele-phone calls to people [4], employees in companies [5],links to sites in the worldwide web [6], chords in mu-sical pieces [7], etc., share with word-frequency distribu-tions the property of being broadly distributed, or “heavytailed”. And most of these phenomena are described, atleast asymptotically, by power-law distributions with ex-ponents close to 2; in such cases one can talk about thefulﬁllment of Zipf’s law [8, 9].A fundamental problem is how these systems evolve,in particular, how they grow to reach a state for which apower law, or even Zipf’s law, holds [10–12]. In Ref. [13],Bernhardsson, da Rocha, and Minnhagen challenge the“Zipf’s view”, proposing that the distribution of wordfrequencies in a text or collection of texts (of the sameauthor) changes with text length as D L ( k ) = A e − k/ ( c L ) k γ ( L ) , (1)where k is the absolute frequency (number of tokens)of the diﬀerent words (word types), L is text length innumber of tokens, D L ( k ) is the probability mass functionof k (i.e., the distribution of word frequencies), γ ( L ) is apower-law exponent, c is a constant parameter (indepen-dent on L ), and A is a normalizing constant. Bernhards- son et al.’s equation [Eq. (1)] should apply to individualtexts or collections when one considers parts of length L of the whole. The key ingredient of that approach tomodel the change of D L ( k ) with L is the explicit depen-dence of the exponent γ on text length L , decreasing withincreasing L . Note also that Eq. (1) implies that, for thelargest k , the word-frequency distribution decays expo-nentially (in contrast to the algebraic decay proposed inZipf’s law).Alternatively, in Ref. [14], we (together with Boleda)argue that the variability of the statistics of words in atext with its length is more simply explained by a scalinglaw, D L ( k ) = 1 LV L g ( k/L ) , (2)where V L is the size of vocabulary (number of diﬀerentwords, i.e., word types) for a fraction of text of length L , and g ( z ) is a undeﬁned scaling function, the same forany value of L (but not necessarily the same for diﬀerentauthors).Note that the scaling-law paradigm, Ref. [14] and Eq.(2), does not assume any particular, parametric shape of D L ( k ) [in contrast to Ref. [13] and Eq. (1), which give atruncated gamma distribution]; the scaling-law paradigmonly states that, for a ﬁxed text, all the D L ( k )’s havethe same shape no matter the value of L , but at dif-ferent characteristic scales given by L . In other words,the shape parameters of the distributions do not changewith L , whatever the form of this distribution is (wedo not enter here into such debate [9, 15–17]), and itis only a scale parameter what changes, proportionallyto L . Both exponential tails and power-law tails are al-lowed by the scaling function g ( z ); what is “forbidden”are text-length dependent exponents γ ( L ). Moreover, thescaling paradigm represented by Eq. (2) does not involveany free parameter, as there is no restriction on the scal-ing function g , and L and V L are given directly by thetext. In fact, the scaling law is just a ﬁnite-size scalinglaw [18–21] (which was not explicitly mentioned in Ref.[14]).Subsequently, Yan and Minnhagen claimed that thisscaling law is “fundamentally impossible” [22], “cannotbe conceptually valid” and “is not borne out by the data”[23]. These statements constitute good examples of thesometimes counterintuitive nature of scaling laws. Let ussummarize the points of these authors to make it clearthat their critique is not relevant. • First, in their Fig. 1, they ﬁnd that the scaling lawdoes not hold for k = 1. • Second, in their Fig. 2 they show that the scalinglaw does not work well for, let us say, k ≤ • Third, it is argued that a “Randomness view”,based in the concepts of “Random Group For-mation”, “Random Book Transformation”, and“Metabook” predicts the right form of D L ( k ),which is that of Refs. [13, 24], i.e., Eq. (1) above.It is quite clear that the ﬁrst and second criticisms ofYan and Minnhagen [23] are not fundamental, as theysimply imply that the scaling law can only be valid be-yond the low-frequency limit, so, the scaling law couldbe rewritten as D L ( k ) = 1 LV L g ( k/L ) , for k > . This is not surprising at all, as it is well known in statis-tical physics that scaling laws are usually observed onlyasymptotically (see Appendix I at the end). It is remark-able then that, for texts, scaling is attained just after theﬁrst decade in frequencies (i.e., for words that appearmore than about 10 times). It is also remarkable that,despite the fact that Yan and Minnhagen [23] stretch thescaling hypothesis up to very small fragments of texts(215525 tokens divided into 500 fragments yielding about400 tokens in each one, for the case of

Moby-Dick ), thescaling law still is fulﬁlled reasonably well beyond the ﬁrstdecade in k , as we detail below. Naturally, the appropri-ate way to further test the validity of the scaling law is inthe opposite way, analyzing larger and larger texts. Thethird point of these authors [23] is also not justiﬁed, asthe authors do not provide any statistical evidence sup-porting the claim that Eq. (1) ﬁts the empirical data toan acceptable conﬁdence level.In this paper we revise the evidence for the ﬁnite-sizescaling law for word-frequency distributions, Ref. [14] and Eq. (2), comparing this approach with the one ofRef. [13] and Eq. (1). In Sec. 2 we summarize the mainclaims in Ref. [13] and in which way they relate to thevalidity of the scaling law; next, we compare the perfor-mance of diﬀerent ﬁts related to the two approaches; sub-sequently we use a direct method to test the validity ofthe scaling hypothesis applied to word-frequency distri-butions; and then we compare with other ways of assess-ing errors in scaling. The penultimate section presentsa novel theoretical calculation of the scaling of the mo-ments of the distribution using the generalized central-limit theorem, connecting them with the scaling law (andyielding a derivation of Heaps’ law as a by-product). Theconclusions and two appendices are at the end. As theempirical evidence in favor of the scaling law used in Ref.[14] was essentially “visual” (collapse of rescaled plotsin log-log), and the theoretical arguments were reducedto a heuristic derivation, the present paper provides asubstantial improvement in support of the validity of aﬁnite-size scaling law in word-frequency distributions. VALIDITY OF THE FINITE-SIZE SCALING LAWFOR WORD-FREQUENCY DISTRIBUTIONS

Let us explain Yan and Minnhagen’s points [23] inmore detail. They base their analysis on the empiricalvalue of the number of types with frequency equal orgreater than k , deﬁned as N L ( ≥ k ) = V L S L ( k ) = V L ∞ X k ′ = k D L ( k ′ ) , where S L ( k ) is the empirical complementary cumulativedistribution of the frequency and N L ( ≥ k ) turns out tobe nothing else than the empirical rank associated tofrequency k . In terms of S L ( k ) the scaling law (2) trans-forms into [14] S L ( k ) = 1 V L G ( k/L ) , for k > , under a continuous approximation, with G ( z ) a new scal-ing function directly related to g ( z ) [14]. So, for N L ( ≥ k )one has that the scaling law (2) can be written as N L ( ≥ k ) = V L S L ( k ) = G ( k/L ) , for k > . (3)Then, in their Figs. 1 (a) and 1 (b) Yan and Minnhagen[23] compare this cumulative number evaluated for thecomplete text, N L tot ( ≥ k ), with the vocabulary size V L for variable L , which veriﬁes, by deﬁnition, V L = N L ( ≥

1) (note that L tot is the length of the complete text). Thedisagreement between N L tot ( ≥ k ) and N L ( ≥

1) for thesame values of the ratio between frequency and length( k/L tot = 1 /L ) makes it clear that the scaling law, forany L = L tot and under the form given by Eq. (3),does not work for the corresponding k = 1 (the hapaxlegomena , these are types that appear just once in a textsample of length L ). Nevertheless this disagreement doesnot invalidate the scaling law (2) for k > N L ( ≥ k ) versus k/L for all k and diﬀerent L andﬁnd indeed deviations with respect the scaling law, butlet us note that these deviations are restricted to k ≤ k >

10 or so (see our Appendices I andII, anyhow).To make our thesis totally clear, in Fig. 1 we present N L ( ≥ k ) for the same data as in Fig. 2 of Ref. [23](i.e., Moby-Dick , by Herman Melville, and

Harry Potter ,books 1 to 7, by J. K. Rowling), but adding symbols(instead of only lines, as in Ref. [23]). It is apparentthat even in the extreme case of 500 equal fragments ofthe full text, the scaling law only fails for very small fre-quencies. We have veriﬁed that the scaling law holdsfor many other texts [14], even for

Finnegans Wake , byJames Joyce, which constitutes an extreme case of exper-imental literary creation, yielding an unusual somewhatconcave relation between N L ( ≥ k ) and k (in log-log) [25];in any case, the scaling function g ( z ) does not care aboutconcavity or convexity.Additionally, in Fig. 2 we perform the data collapseassociated to the scaling law in terms of the probabil-ity mass function D L ( k ) for Harry Potter , presented inRef. [23] as a counter-example to the scaling law. Asit is shown, the collapse is excellent: after proper rescal-ing [Fig. 2(b)], all curves collapse into a single, length-independent function, even for very small frequencies(smaller than 10). So we can write, for this text, D L ( k ) ≃ LV L g ( k/L ) , for all k. Notice that this is the original form under which the scal-ing law for word-frequency distributions was presented[14], and not the one in Eq. (3). In other words, devi-ations from the scaling law play an even minor role inthe representation in terms of D L ( k ), in comparison tothe representation in terms of N L ( ≥ k ). Finding a func-tional form for g ( z ) [whose particular shape in the caseof Harry Potter is unveiled in our Fig. 2(b)] is a deli-cate issue, and it is not our interest here [in Ref. [14],when considering lemmatized texts, a double power lawwas proposed for the sake of illustration]; nevertheless,in the next section we show that the empirical data, foreach text and diﬀerent values of L , is compatible with aunique g ( z ) characterized by a power-law tail, with anexponent close to two. PROPER FITTING OF THE POWER-LAW TAIL

Looking at our Fig. 2(a), where D L ( k ) is shown withno rescaling, one could conclude, as Yan and Minnhagen[23], that for diﬀerent text lengths one gets diﬀerentshapes for D k ( L ). Indeed, a visual inspection of the plotseems to show diﬀerent slopes for diﬀerent L − values, cor-responding to diﬀerent exponents γ ( L ). However, thedata collapse after rescaling in Fig. 2(b) demonstratesthat all distributions have the same shape, given by g ( z ),but at diﬀerent scales, given by L (remember that in log-log a rescaling is seen just as a shift). And the deviationsfor k ≤

10 do not play any relevant role. The apparentdiﬀerent slopes of the diﬀerent curves in Fig. 2(a) are avisual artefact caused by the convex (but close to linear)log-log shape of the curves. In other words, the larger L , the more part of g ( z ) one sees beyond the tail. Asthe body of the distribution does not decay as fast as thetail, the more part of the body one sees the smaller theapparent exponent, which is a sort of average betweenthe body and the tail. This illustrates how a simplereplotting under a rescaled form can unveil a commonpattern in distributions given at diﬀerent scales. And letus repeat that we are not interested here in providing aparametric model for this convex shape; for that, just seeRef. [9].In order to support their point, Yan and Minnhagen[23] ﬁt power laws [Eq. (1) with c = ∞ ] to the word fre-quency distributions for diﬀerent L , ﬁnding a drift from γ ≃ γ ≃ . γ = 2,strictly speaking). The authors do not mention which ﬁt-ting method they use, nor the ﬁtting range, but we candemonstrate that power-law ﬁts for the whole range of k in Moby-Dick and

Harry Potter are in general rejectedafter rigorous goodness-of-ﬁt tests, no matter if one ﬁtscontinuous [26] or discrete power laws [9]. Taking

HarryPotter and applying the maximum-likelihood ﬁtting plusthe Kolmogorov-Smirnov test detailed in Ref. [9], we onlyget one case of a non-rejectable discrete power law deﬁnedin the range k ≥

1, which corresponds to L = L tot / γ = 2 . ± .

04 and a p − value = 0 . k ≥ D L ( k ) ∝ k γ (4)deﬁned only for k ≥ k cut , with k cut a cut-oﬀ valueof k verifying k cut >

1, we ﬁnd non-rejectable powerlaws for all L with stable exponents when k cut is largeenough. For the case of Harry Potter we ﬁnd that for k cut ≃ . L , the exponents γ turn out to be stabi-lized with values very close to two. So, g ( z ) has a powerlaw tail, valid for z = k/L > . L as a signature of the existence of a well-deﬁned, L − independent scaling function g ( z ). TABLE I. Results of goodness-of-ﬁt test [27] of a discretepower-law distribution D L ( k ) ∝ /k γ in the range k ≥ k cut for Harry Potter (for which L tot = 1108955). Notice thestability of the γ exponent for k cut ∝ L . Note also that for L = L tot /

500 the diﬀerence between Yan and Minnhagen’sresult (which is γ ≃ .

4) and our result here with k cut =2 (which is 2.16) is rather large. It has been pointed outthat inappropriate ﬁtting methods can lead to biased results[4, 26]. L γ k cut V L p − value L tot ± L tot / ± L tot / ± L tot /

10 2.09 ± L tot /

20 2.18 ± L tot /

50 2.14 ± L tot /

100 2.17 ± L tot /

200 2.14 ± L tot /

500 2.16 ± TESTING OF THE SCALING HYPOTHESIS

A more direct and non-parametric way to test the ex-istence of scaling is to use the two-sample Kolmogorov-Smirnov test [28], which compares two data sets underthe null hypothesis that both of them come from thesame population, and therefore have the same underlyingtheoretical distribution (which is unknown and remainsunknown after the test). But in the case of scaling weare not dealing with the same distribution, but with dis-tributions which have the same shape at diﬀerent scales,i.e., distributions that are the same except for a scale pa-rameter; then, rescaling the distributions by their scaleparameter would lead to the same distributions (underthe null hypothesis that scaling holds). This procedureto test the fulﬁllment of scaling has been used before forcontinuous distributions [29, 30].The Kolmogorov-Smirnov test is probably the best ac-cepted test for comparing continuous distributions, butword-frequency data are discrete, and after rescaling be-come discretized over diﬀerent sets (as the scaling factors L of the two distributions can be very diﬀerent, in gen-eral). So, our ﬁrst step, in order to avoid this problem is to approximate the discrete empirical distributions bycontinuous ones, by adding to each frequency a randomterm, in this way k → k + u (where u is a uniform ran-dom number between − . . k ’s are natural numbers). The second step is to re-move small frequencies (remember that scaling is a large-scale property) in our case we remove values of k below4. Then, the third step is to perform the rescaling k → h k i k h k i , where the moments of k are the original empirical ones(calculated for the discrete distribution). In a simplecase (with no power laws involved [29, 30]) we would haverescaled just by the mean h k i ; in this case the rescaling isa bit more involving [31, 32]. Notice that this rescaling istotally equivalent to divide k by L , as shown in anothersection below; nevertheless, our choice is more generaland makes the rescaling applicable when the data doesnot come from a text.Once these three steps have been done, the two-sampleKolmogorov-Smirnov test [28] is performed for all pairs ofsamples given by diﬀerent L , restricting the samples to acommon support, i.e., a common minimum value is takenas the minimum value of the sample with the smallest L (which has the largest minimum when the frequenciesare rescaled). Figure 4 shows the P − value of this testfor several texts and diﬀerent divisions of the texts, up to L tot /

50. The fact that the P − value appears as uniformlydistributed between zero and one is an indication that thescaling null hypothesis holds. RELATIVE ERRORS OF THE SCALING LAW

Although the proper way to compare statistical distri-butions is by means of statistical tests (as done in theprevious section), Yan and Minnhagen [23] use insteadrelative errors. They show numbers for the relative er-ror provided by the scaling law, and compare it with theerror of the so-called random-group-formation hypothe-sis. We explain why their comparison is not appropriateFirst, for the scaling law, the empirical values of N L ( ≥ k )are compared for ﬁxed ratio k/L with N L tot ( ≥ k ) and theerrors are claimed to be large. Second, for the randomgroup formation the error is claimed much smaller, butin this case the empirical data N L ( ≥ k ) are comparedwith random samples of the same length L , and not witha distribution of a diﬀerent length. It is obvious that thisprocedure has to yield better results, and this constitutesa totally biased comparison.But further, the errors provided by Yan andMinnhagen [23] for the scaling law are inﬂated. Our Fig.5 shows the relative diﬀerence or error between the truevalue N L ( ≥ k ) and the value approximated by the scalinglaw, N L tot ( ≥ k ′ ), with k ′ = L tot k/L , which is ε L ( k ) = N L tot ( ≥ k ′ ) − N L ( ≥ k ) N L ( ≥ k ) . Note that, in general, the replacement in the denomina-tor of N L ( ≥ k ) by N L tot ( ≥ k ′ ) inﬂates the reported error,as, when there are deviations, this number is systemat-ically below N L ( ≥ k ). The results of our analysis (Fig.5) show that the errors are not as big as reported by Yanand Minnhagen [23]. Dividing Harry Potter in up to 20parts, the relative error provided by the scaling law isalmost always below 0.2, with the remarkable exceptionof the case k = 1 for L = L tot /

20. Dividing the textinto smaller parts yields that the relative error is alwaysbelow 0.3 for k >

10. But the error for small k is furtherreduced if one uses for comparison the probability massfunction D L ( k ) instead of N L ( ≥ k ). Remember that theoriginal form of the scaling law was reported for D L ( k )and not for N L ( ≥ k ). Our Fig. 2(b) speaks for itself. SCALING OF MOMENTS FROM THEGENERALIZED CENTRAL-LIMIT THEOREM,HEAPS’ LAW, AND RELATION WITH THESCALING LAW

We start this section dealing with a distribution D L ( k )that has a power-law tail with an exponent in the range1 < γ <

2. We consider the moments h k i and h k i not as the moments of the theoretical distribution (whichwould be equal to inﬁnity) but as the moments of a ﬁnitesample, whose size is just the size of the vocabulary V L (by deﬁnition); that is, h k i = 1 V L V L X i =1 k i and h k i = 1 V L V L X i =1 k i . Due to the power-law behavior for large k , the gener-alized central limit theorem [32, 33] allows one to obtainthe scaling properties of these sums, assuming that theindividual frequencies are independent (or weakly depen-dent). Indeed, P V L i =1 k i does not scale linearly with V L but superlinearly, as P V L i =1 k i ∝ V / ( γ − L . Moreover, if k has a power-law tailed distribution with exponent γ sodoes k , but with exponent γ ′ fulﬁlling γ ′ − γ − / < γ ′ <

2) and then the gener-alized central-limit theorem also applies to k , to give P V L i =1 k i ∝ V / ( γ − L .On the other hand, we can also use the exact result P V L i =1 k i = L (the deﬁnition of text length), from whichwe obtain the classical Heaps’ law (called also Herdan’slaw in llinguistics) [14, 34–37], V L ∝ L γ − , (5) and therefore the moments fulﬁll h k i = L − γ and h k i = L − γ (6)(and, in general, h k m i = L m +1 − γ ).This result is compatible with a scaling law of the form D L ( k ) = 1 L γ g (cid:18) kL (cid:19) . (7)The case considered in the literature [32, 38] assumesthat g ( z ) has an intermediate power-law decay with ex-ponent γ followed by a much faster decay (exponentialor so) for the largest k ’s. The pure power-law tail con-sidered above is included in this framework when g ( z )goes to zero abruptly, transforming the pure power lawinto a truncated power law. Indeed, if the power law istruncated at k max , using a continuous approximation weget h k m i = Z ∞ k m D L ( k ) dk ≃ Z k max k m D L ( k ) dk = L m +1 − γ Z k max /L /L z m g ( z ) dz ∼ L m +1 − γ , (8)because (for m > γ −

1) the integral tends to a con-stant when L is large, taking into account that k max isthe maximum of k from a sample of size V L and scalesin the same way as P k i , i.e., as L . To see this one canjust calculate, for any P , the percentiles k P of the distri-bution of the maximum of V L frequencies, which verify[1 − S L ( k P )] V L = P . Substituting a power law for S L ( k )(with exponent γ −

1) we get k P ∝ − P /V L ) / ( γ − ≃ (cid:18) V L − ln P (cid:19) / ( γ − ∝ L. So, as all the percentiles of the maximum scale with L ,the distribution of the maximum scales with L too [wearrive at the last result using that P /V L = e (ln P ) /V L ≃ P ) /V L , valid for large V L , and also Heaps’ law(5)].Therefore, a power-law tail for the distribution of fre-quencies, with exponent γ , is somehow equivalent to anupper truncated power-law tail, with an eﬀective cut-oﬀ k max ∝ V γ − L ∝ L , which makes that the features of thedistribution at the largest k have to scale with L . Thescaling law (2) follows directly from Eq. (7) using Heaps’law (5), although notice that the version of the law givenby Eq. (2) is non-parametric, in the sense that the valueof the exponent γ does not appear in the law (which isgood if the determination of the exponent contains er-rors).Moreover, as a by-product we obtain another form forthe scaling law, D L ( k ) = h k i h k i g (cid:18) k h k ih k i (cid:19) , (9)using the scaling (6) of the moments with L and thescaling law, with the scaling function g being the sameas before, except for proportionality factors. This scalinghas been used previously for self-organized critical phe-nomena but under diﬀerent conditions [31]. Rememberthat here the moments are not those of the theoreticaldistribution but the ones corresponding to a sample ofsize V L . The equivalence of both scaling laws, Eqs. (2)and (9), is empirically shown in Fig. 6 by means of theproportionality between h k i and L /V L .However, real distribution of frequencies are not welldescribed at the largest frequencies by scaling functions g ( z ) that decay either exponentially or abruptly (i.e., aresharply truncated), as shown in Table I and Fig. 3. In-stead, we expect that the tail of g ( z ) [and therefore thetail of D L ( k )] is another power law, with an exponent γ > γ . Remarkably, this framework is also describedby the scaling law (7) and the scaling of moments (6),being the key point that one can change the upper limitof the integral from inﬁnite to k max , and k max still scaleslinearly with L , so the derivation is the same as in Eq.(8).In order to support empirically the fulﬁllment of a scal-ing law of the form given by Eq. (7) we follow the ap-proach presented in Ref. [39]. If such a scaling law holds,the distance between the diﬀerent rescaled distributionsin log-scale(ln k i − δ ln L, ln D L ( k i ) + γ ln L )should be minimum when the right values of the expo-nents are substituted. Notice that we have introduced anextra exponent δ , which we expect becomes equal to 1.We proceed by minimizing such distances as a functionof the exponents δ and γ , resulting in values of δ veryclose to 1 indeed and values of γ in the range 1.5 to 1.95,when diﬀerent texts are used, see Table II. TABLE II. Exponents δ and γ obtained after the minimiza-tion of the distance between the rescaled word-frequency dis-tributions, optimizing the data collapse, for 7 diﬀerent texts.Text length L varies from L tot to L tot /

10. Diﬀerent seeds areused in the algorithm in order to avoid local minima.Text and author Language δ γ Clarissa by S. Richardson English 0.96 1.50

Moby-Dick by H. Melville English 1.05 1.87

Ulysses by J. Joyce English 1.04 1.94

El Quijote by M. de Cervantes Spanish 0.96 1.64

La Regenta by L. A. Clar´ın Spanish 0.86 1.52

Artam`ene by Scud´ery siblings French 1.04 1.63

Le Vicomte de Bragelonne

French 0.96 1.58by A. Dumas (father)

DISCUSSION AND CONCLUSIONS

As an important remark, we want to clarify that weare not against the so-called random group formationhypothesis [24], as in some previous research we havemade use of randomness to explain real texts [40]. Ourconclusion was that real texts are not random, but theﬁrst appearance of a word is close to random, so theword frequency distribution (related to Zipf’s law) andthe type-token growth curve (related to Heaps’ law) re-main the same for real texts and for random versions ofthem. The reason is that the word frequency distributionis independent on word order and the type-token growthcurve only depends on the ﬁrst appearance of a word.Other properties of real texts are diﬀerent from those ofrandom texts, as inter-appearance distances [40–42].Summarizing, the empirical facts are clear: a ﬁnite-size scaling law gives a very good approximation for thedistribution of word frequencies for diﬀerent fragments oftext of length L . The shape of D k ( L ) is the same for all L , and it is only a scaling factor proportional to L whatmakes the diﬀerence for diﬀerent L . It is the parametricproposal of Ref. [13] which is not well supported by solidstatistical testing. In any case, if the theory held by Yanand Minnhagen [23] is valid, then it must contain in somelimit the scaling law. If not, the theory is irrelevant forreal texts.In conclusion, we show how the sort of scaling argu-ments usual in statistical physics, and in particular ﬁnite-size scaling (2) can describe complex processes much bet-ter than parametric formulas (1). Finally, in order toavoid misunderstandings, let us state that although curveﬁtting is a very honorable approach in science (when donecorrectly [9, 26, 27]), a scaling approach has nothing todo with that [23]. ACKNOWLEDGEMENTS

We are grateful to R. Ferrer-i-Cancho for drawing ourattention to Ref. [13], and to G. Boleda for facilitatingthe beginning of this research. Yan and Minnhagen’s crit-icisms have allowed us to explain in much more detail thevalidity of our scaling law and to arrive to the new resultspresented here. Research projects in which this work isincluded are FIS2012-31324 and FIS2015-71851-P, fromSpanish MINECO, and 2014SGR-1307, from AGAUR.

APPENDIX I

We explain here the diﬀerence between a power lawand a scaling law, and how scaling laws in statisticalphysics usually only hold asymptotically. Let us startwith a scale transformation. This is an operation thatstretches and/or contracts a function, i.e., T [ f ( x, y )] = cf ( x/a, y/b ) , where f ( x, y ) is a (in this example bivariate) real func-tion, a, b, c are constant and positive scale factors, and T is the scale transformation. If we ask the question aboutwhich functions are invariant under scale transformations(i.e., which functions do not change when are stretchedand/or compressed) we ﬁnd that a solution is f ( x, y ) = 1 x β g (cid:16) yx α (cid:17) , (10)with α = ln b/ ln a and β = − ln c/ ln a , and g an arbi-trary function, called scaling function (we also consider x > a >

0, the previous solution (10) turns outto be the only solution [38]. One refers to Eq. (10) as ascaling form or a scaling law.Notice that a power law is a special case of scalinglaw, just taking the arbitrary scaling function g to bea constant C . In fact, in one dimension (i.e., for uni-variate functions f ( x )) the only scaling laws (the onlyscale-invariant functions for any value of the scale factor a ) are the power laws [10, 43], so f ( x ) = C/x β . Althoughthe terms “scaling law” and “power law” are sometimestaken as synonym of each other, it is clear that they areonly equivalent for univariate functions. For bivariate(and multivariate) functions one needs to be more care-ful in distinguishing both concepts (as we do here).Let us stress that scaling is a fundamental pillar of20th-century statistical physics [44]. In our case, we pro-pose that a (bivariate) scaling law holds for D L ( k ), so, weidentify k = y , L = x , and D L ( k ) = f ( x, y ) and assumeHeaps’ law for the usual scaling law to hold (more detailsin Ref. [14]). Alternatively, we may identify N L ( ≥ k ) notwith D L ( k ) but with f ( x, y ) with no necessity of usingHeaps’ law. In any case this not necessarily implies thatthe scaling function g has a power-law shape. We do notcare here about the functional form of g , this is just theshape shown in Fig. 2(b) (for the particular book underconsideration there).We provide in Fig. 7 a practical example of how scal-ing laws in statistical physics hold usually only for large x and y (i.e., large L and k ). We display the rescaled sizedistribution D L ( k ) of a critical Galton-Watson branch-ing process [45] with its number of generations boundedby a ﬁnite L and with oﬀspring distribution given by abinomial distribution with 2 trials. This process is to-tally equivalent to percolation in the Bethe lattice [46].The ﬁgure shows the deviation from the scaling law for k ≤

10, but nevertheless it has been proved analyti-cally that ﬁnite-size scaling holds in this system [46].Ironically, in this case the scaling function is well ap-proximated by the function proposed by Bernhardssonet al. [13] [our Eq. (1)], but with a constant exponent γ = β/α = 3 / APPENDIX II

Let us see how discreteness eﬀects alter scaling. Natu-rally, the discrete nature of word-frequency distributionscomes from the fact that the fundamental unit is thecount of word tokens. If we assume that scaling holdsfor all k , even for k = 1 and k = 2, the ﬁnite-size scalinglaw, under the form given by Eq. (3), implies that N L ( ≥

2) = N L/ ( ≥ , and we can relate this to the size of vocabulary for eachtext length, so V L − n sL (1) = V L/ , where n L ( k ) counts the number of types with frequency(exactly equal to) k , and the superscript s denotes thatwe are under the scaling hypothesis.On the other hand, for a random text, V L/ can becalculated from V L as V L/ = V L − n L/ (0) = V L − X k ≥ h ,k n L ( k ) , where h ,k gives the probability of getting 0 tokens ofa certain type when a fragment of text of length L/ L in which the same type hasfrequency k , see Eq. (5) of Ref. [14]. Comparing bothequations for V L/ we get n sL (1) = P k ≥ h ,k n L ( k ). Butusing that h ,k < / k (see below) and n L ( k ) < n L (1)for k > n sL (1) = n L/ (0) = X k ≥ h ,k n L ( k ) < n L (1) X k ≥ k = n L (1)(extending the sum to inﬁnite). Thus, the scaling hy-pothesis yields, for a random text and for k = 1, lesstypes than it should. This can be seen looking carefullyat some of the plots in Fig. 2 of Ref. [14], but not inFig. 3 there or in Fig. 2(b) here, as the deviations arerather small [just notice that the empirical D L ( k ) is pro-portional to n L ( k )].The fact that h ,k < / k comes from the fact that h ,k is given by the hypergeometric distribution (as we assumewe take tokens from the larger text with no replacement),and then, h ,k = (cid:18) k (cid:19)(cid:18) L − kL/ − (cid:19).(cid:18) LL/ (cid:19) (11)= (cid:20) ( L − k )!( L − k − L/ (cid:21) . (cid:20) L !( L/ (cid:21) (12)= (cid:20) ( L/ L/ − k )! (cid:21) . (cid:20) L !( L − k )! (cid:21) (13)= ( L/ L/ − . . . ( L/ − k + 1) L ( L − . . . ( L − k + 1) , (14)= k − Y j =0 (cid:18) L/ − jL − j (cid:19) , (15)where all factors are smaller than 1 /

2, except the one for j = 0. This yields h ,k < / k . In this way we show howdiscrete eﬀects break scaling for the lowest frequencies,but, as can be seen in the plots, this eﬀect is very small. [1] S. Pueyo and R. Jovani. Comment on “A keystone mu-tualism drives pattern in a power function”. Science ,313:1739c–1740c, 2006.[2] C. Furusawa and K. Kaneko. Zipf’s law in gene expres-sion.

Phys. Rev. Lett. , 90:088102, 2003.[3] Y. Malevergne, V. Pisarenko, and D. Sornette. Testingthe Pareto against the lognormal distributions with theuniformly most powerful unbiased test applied to the dis-tribution of cities.

Phys. Rev. E , 83:036111, 2011.[4] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data.

SIAM Rev. , 51:661–703, 2009.[5] R. L. Axtell. Zipf distribution of U.S. ﬁrm sizes.

Science ,293:1818–1820, 2001.[6] L. A. Adamic and B. A. Huberman. Zipf’s law and theInternet.

Glottometrics , 3:143–150, 2002.[7] J. Serr`a, A. Corral, M. Bogu˜n´a, M. Haro, and J. Ll. Ar-cos. Measuring the evolution of contemporary westernpopular music.

Sci. Rep. , 2:521, 2012.[8] A. Corral, G. Boleda, and R. Ferrer-i-Cancho. Zipf’s lawfor word frequencies: Word forms versus lemmas in longtexts.

PLoS ONE , 10(7):e0129031, 2015.[9] I. Moreno-S´anchez, F. Font-Clos, and A. Corral. Large-scale analysis of Zipf’s law in English texts.

PLoS ONE ,11(1):e0147073, 2016.[10] M. E. J. Newman. Power laws, Pareto distributions andZipf’s law.

Cont. Phys. , 46:323 –351, 2005.[11] B. Corominas-Murtra, R. Hanel, and S. Thurner. Un-derstanding scaling through history-dependent processeswith collapsing sample space.

Proc. Natl. Acad. Sci. USA ,112(17):5348–5353, 2015.[12] V. Loreto, V. D. P. Servedio, S. H. Strogatz, and F. Tria.Dynamics on expanding spaces: Modeling the emergenceof novelties. In M. Degli Esposti et al., editor,

Creativ-ity and Universality in Language , pages 59–83. Springer,2016.[13] S. Bernhardsson, L. E. Correa da Rocha, andP. Minnhagen. The meta book and size-dependent prop-erties of written language.

New J. Phys. , 11:123015, 2009. [14] F. Font-Clos, G. Boleda, and A. Corral. A scaling lawbeyond Zipf’s law and its relation to Heaps’ law.

New J.Phys. , 15:093033, 2013.[15] R. Ferrer i Cancho and R. V. Sol´e. Two regimes in thefrequency of words and the origin of complex lexicons:Zipf’s law revisited.

J. Quant. Linguist. , 8(3):165–173,2001.[16] M. A. Montemurro. Beyond the Zipf-Mandelbrot law inquantitative linguistics.

Physica A: Statistical Mechanicsand its Applications , 300(34):567–578, 2001.[17] W. Li, P. Miramontes, and G. Cocho. Fitting rankedlinguistic data with two-parameter functions.

Entropy ,12(7):1743–1764, 2010.[18] E. Br´ezin. An investigation of ﬁnite size scaling.

J. Phys. ,43:15–22, 1982.[19] M. N. Barber. Finite-size scaling. In C. Domb and J.L.Lebowitz, editors,

Phase Transitions and Critical Phe-nomena, Vol. 8 , pages 145–266. Academic Press, London,1983.[20] V. Privman. Finite-size scaling theory. In V. Privman,editor,

Finite Size Scaling and Numerical Simulation ofStatistical Systems , pages 1–98. World Scientiﬁc, Singa-pore, 1990.[21] A. Corral, R. Garcia-Millan, and F. Font-Clos. Exactderivation of a ﬁnite-size scaling law and corrections toscaling in the geometric Galton-Watson process.

PLoSONE , 11(9):e0161586, 2016.[22] X.-Y. Yan and P. Minnhagen. Comment on ’A scalinglaw beyond Zipf’s law and its relation to Heaps’ law’. arXiv , 1404.1461v1, 2014.[23] X. Yan and P. Minnhagen. Randomness versus speciﬁcsfor word-frequency distributions.

Physica A , 444:828–837, 2016.[24] S. K. Baek, S. Bernhardsson, and P. Minnhagen. Zipf’slaw unzipped.

New J. Phys. , 13(4):043004, 2011.[25] J. Kwapie´n and S. Drozdz. Physical approach to complexsystems.

Phys. Rep. , 515:115–226, 2012.[26] A. Deluca and A. Corral. Fitting and goodness-of-ﬁt testof non-truncated and truncated power-law distributions.

Acta Geophys. , 61:1351–1394, 2013.[27] A. Corral, A. Deluca, and R. Ferrer-i-Cancho. A practi-cal recipe to ﬁt discrete power-law distributions.

ArXiv ,1209:1270, 2012.[28] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery.

Numerical Recipes in FORTRAN . CambridgeUniversity Press, Cambridge, 2nd edition, 1992.[29] A. Corral. Statistical tests for scaling in the inter-eventtimes of earthquakes in California.

Int. J. Mod. Phys. B ,23:5570–5582, 2009.[30] E. Lippiello, A. Corral, M. Bottiglieri, C. Godano, andL. de Arcangelis. Scaling behavior of the earthquake in-tertime distribution: Inﬂuence of large shocks and timescales in the Omori law.

Phys. Rev. E , 86:066119, 2012.[31] O. Peters, A. Deluca, A. Corral, J. D. Neelin, and C. E.Holloway. Universality of rain event size distributions.

J.Stat. Mech. , P11030, 2010.[32] A. Corral. Scaling in the timing of extreme events.

Chaos.Solit. Fract. , 74:99–112, 2015.[33] J.-P. Bouchaud and A. Georges. Anomalous diﬀusionin disordered media: statistical mechanisms, models andphysical applications.

Phys. Rep. , 195:127–293, 1990.[34] R. Baeza-Yates and G. Navarro. Block addressing indicesfor approximate text retrieval.

J. Am. Soc. Inform. Sci. ,51(1):69–82, 2000. [35] A. Kornai. How many words are there?

Glottom. , 2:61–86, 2002.[36] L. L¨u, Z.-K. Zhang, and T. Zhou. Zipf’s law leads toHeaps’ law: Analyzing their relation in ﬁnite-size sys-tems.

PLoS ONE , 5(12):e14139, 12 2010.[37] M. A. Serrano, A. Flammini, and F. Menczer. Model-ing statistical properties of written text.

PLoS ONE ,4(4):e5372, 2009.[38] K. Christensen and N. R. Moloney.

Complexity and Crit-icality . Imperial College Press, London, 2005.[39] A. Deluca and A. Corral. Scale invariant events and dryspells for medium-resolution local rain data.

NonlinearProc. Geophys. , 21:555–567, 2014.[40] F. Font-Clos and A. Corral. Log-log convexity oftype-token growth in Zipf’s systems.

Phys. Rev. Lett. ,114:238701, 2015.[41] E. G. Altmann, J. B. Pierrehumbert, and A. E. Motter.Beyond word frequency: Bursts, lulls, and scaling in thetemporal distributions of words.

PLoS ONE , 4(11):e7678, 2009.[42] A. Corral, R. Ferrer-i-Cancho, and A. D´ıaz-Guilera.Universal complex structures in written language. http://arxiv.org , 0901.2924, 2009.[43] A. Corral. Scaling and universality in the dynamics ofseismic occurrence and beyond. In A. Carpinteri andG. Lacidogna, editors,

Acoustic Emission and CriticalPhenomena , pages 225–244. Taylor and Francis, London,2008.[44] H. E. Stanley. Scaling, universality, and renormalization:Three pillars of modern critical phenomena.

Rev. Mod.Phys. , 71:S358–S366, 1999.[45] A. Corral and F. Font-Clos. Criticality and self-organization in branching processes: application to nat-ural hazards. In M. Aschwanden, editor,

Self-OrganizedCriticality Systems , pages 183–228. Open AcademicPress, Berlin, 2013.[46] R. Garcia-Millan, F. Font-Clos, and A. Corral. Finite-size scaling of survival probability in branching processes.

Phys. Rev. E , 91:042122, 2015. (a) n = 500 n = 200 n = 100 n = 50 n = 20 n = 10 n = 5 n = 2 n = 1 k/L N L ( ≥ k ) − − − − − − (b) n = 500 n = 200 n = 100 n = 50 n = 20 n = 10 n = 5 n = 2 n = 1 k/L N L ( ≥ k ) − − − − − − − FIG. 1. The total number of words N L with a relative frequency greater than or equal to k/L , for varying L = L tot /n , with L tot the length of the complete text. We have taken the same books as in Ref. [23], Moby-Dick (a) and

Harry Potter (b),exactly reproducing panels (a) and (b) of Fig. 2 in Ref. [23], but also including some additional values of n . Deviations fromthe scaling law are always in the regime of very low frequencies, as expected due to discreteness eﬀects (which are due to thefact that word tokens are discrete). (a) n = 500 n = 200 n = 100 n = 50 n = 20 n = 10 n = 5 n = 2 n = 1 k D L ( k ) − − − − − − − − − (b) n = 500 n = 200 n = 100 n = 50 n = 20 n = 10 n = 5 n = 2 n = 1 k/L L V L D L ( k ) − − − − − − − FIG. 2. (a) The probability mass function D L ( k ) of the absolute frequency k , for varying subsets of length L = L tot /n of Harry Potter , displaying a seeming change of shape. (b) Same curves, but plotting D L ( k ) LV L versus k/L , as proposed in Ref.[14] and stated here in Eq. (2). All curves collapse into a single, length-independent scaling function g ( k/L ), in agreement withEq. (2). Note the excellent data collapse: even the deviations for small k and negligible for this text. This is at odds withEq. (1): a length-dependent exponent in D L ( k ), as proposed by Yan and Minnhagen, is not compatible with the data collapseshown in the ﬁgure. BragelonneArtam`eneLa RegentaEl QuijoteUlyssesMoby-DickClarissa text length L e x p o n e n t γ . . . . . . . . . FIG. 3. Exponents of the power-law tails as a function of text length, for seven texts in English, Spanish, and French. Fitsare performer as in Refs. [8, 27], with k cut restricted to the last two decades of the distribution, and accepted for the k cut thatyields p − value larger than 0.20 (calculated from 1000 Monte-Carlo simulations). The error bars correspond to one standarddeviation. Types are deﬁned at the word-lemma-tag level, more details on text processing are given in Ref. [8]. Note how for L > < L < · . test number P v a l u e . . . . FIG. 4. P − values of two-sample Kolmogorov-Smirnov tests for word-frequency distributions rescaled as explained in the text.One of the samples has text length L and the other L ′ , ranging from L tot /

50 to L tot . One of the data sets shown correspondsto testing the ﬁrst fragment of length L with the same fragment of length L ′ , for the other data set other fragments are chosen.The texts are the same seven used in the previous ﬁgure. P − values below 0.05 should lead to the rejection of the scalinghypothesis with a 0.95 conﬁdence level. k/L ε L ( k ) − − − − − − − − − − − FIG. 5. Relative error [ N L tot ( ≥ k ′ ) − N L ( ≥ k )] /N L ( ≥ k ) of the approximation to N L ( ≥ k ) given by the scaling of thedistribution of the whole text N L tot ( ≥ k ′ ), with k ′ = kL tot /L , for Harry Potter . Symbols correspond to the same values of n as in the previous ﬁgures. . L /V L BragelonneArtam`eneLa Regenta by Clar´ın

El Quijote by Cervantes

Ulysses by Joyce

Moby-Dick by Melville

Clarissa by Richardson L /V L h k i FIG. 6. Linear relation between the second moment of the distribution of word frequencies and the ratio of the squared textlength with vocabulary size, for diverse texts used in the previous ﬁgures [8]. A line with a linear coeﬃcient 0.01 is shown forcomparison. / ( √ π z / ) L = 25 L = 50 L = 100 L = 200 L = 400 k/L L D L ( k ) − − − − − − − − FIG. 7. Rescaled size distribution D L ( k ) as a function of rescaled size k for a critical Galton-Watson branching processwith a binomial oﬀspring distribution (with a maximum of two oﬀspring) and with diﬀerent values of the maximum numberof generations L . A ﬁnite-size scaling law, of the form D L ( k ) = g ( k/L ) /L , holds, but only for k >

10, roughly. Curiously,when the oﬀspring distribution is geometric (for which the size distribution coincides with the escape time of a random walkermoving between an absorbing and a reﬂecting boundary [21]) the deviation takes the opposite sign, see Ref. [32]. But beyondthe smallest values of kk