Word frequency-rank relationship in tagged texts
aa r X i v : . [ c s . C L ] F e b Word frequency-rank relationship in tagged texts
Andr´es Chacoma a , Dami´an H. Zanette b, ∗ a Instituto de F´ısica Enrique Gaviola, Consejo Nacional de Investigaciones Cient´ıficas yT´ecnicas and Universidad Nacional de C´ordoba, Ciudad Universitaria, 5000 C´ordoba,Pcia. de C´ordoba, Argentina b Centro At´omico Bariloche and Instituto Balseiro, Comisi´on Nacional de Energ´ıa At´omicaand Universidad Nacional de Cuyo, Consejo Nacional de Investigaciones Cient´ıficas yT´ecnicas, Av. Bustillo 9500, 8400 San Carlos de Bariloche, Pcia. de R´ıo Negro, Argentina
Abstract
We analyze the frequency-rank relationship in sub-vocabularies correspondingto three different grammatical classes ( nouns , verbs , and others ) in a collec-tion of literary works in English, whose words have been automatically taggedaccording to their grammatical role. Comparing with a null hypothesis whichassumes that words belonging to each class are uniformly distributed acrossthe frequency-ranked vocabulary of the whole work, we disclose statisticallysignificant differences between the three classes. This results point to the factthat frequency-rank relationships may reflect linguistic features associated withgrammatical function. Keywords:
Frequency-rank statistics, Grammatical function, Languageprocessing, Quantitative linguistics
1. Introduction: Frequency-rank relationship and Zipf ’s law
The relationship between word frequencies and ranks in written texts is,beyond any doubt, the most thoroughly studied among the statistical propertiesof language usage. Following G. K. Zipf’s prescription [1], the rank of a wordis defined as its position in a list where all the different words of a text are ∗ Corresponding author
Email addresses: [email protected] (Andr´es Chacoma), [email protected] (Dami´an H. Zanette)
Preprint submitted to Physica A February 23, 2021 rranged in decreasing order by their number of occurrences, or frequency . Zipfhimself pointed out the ubiquitous regularity according to which the frequencyof a word is approximately proportional to the inverse of the rank, at least forlarge ranks and low frequencies –a systematic feature which came to be knownas
Zipf ’s law . The linguistic relevance of this regularity has been discussed inconnection with the reinforcement in the use of certain words in detriment ofothers, as a text progresses and its semantic contents grow [2, 3]. The underlyingmechanism is not unlike preferential attachment [4] or sample space reducing[5], which are well known to generate algebraic (power-law) functional relationsand distributions.Thanks to novel advancements in techniques of machine learning and naturallanguage processing [6], it has lately become possible to automatically annotatedigitized texts, assigning a tag to each word token, which indicates its gram-matical role in the corresponding sentence. This possibility opens a plethoraof new lines of computational research on the statistical properties of language,now discerning between word classes with different lexical functions. In a recentcontribution [7], we have presented a statistical analysis of vocabulary growth–namely, appearance of new words as a text gets longer– in a corpus of auto-matically tagged literary works. Scaling properties in the relationship betweenvocabulary size and text length are encompassed by Heaps’ law [8, 9]. Encour-agingly, we have found that these properties are different for each consideredgrammatical class, which suggests that meaningful linguistic information is en-closed in such features.Here, we extend this kind of analysis to the frequency-rank relationship inthe same collection of tagged texts. It should be understood, however, that itis not our aim to assess whether Zipf’s law holds within grammatical classes.Instead, we are interested in whether and how the lexical function of each classhas a statistically discernible effect on the frequency-rank relationship. In thenext section, we present the corpus of texts on which we work, and the nullhypothesis that we use for comparison to determine the statistical significanceof our results. Quantitative measures to evaluate the difference between real2requency-rank relationships and our null hypothesis are proposed and com-puted in Sect. 3, where we show that significant differences appear in the re-lationships corresponding to each grammatical class. In Sect. 4, we sketch aninterpretation of the results, by means of a simple model which highlights thedisparate distribution of words belonging to each grammatical class within thewhole vocabulary of each text. Finally, our main results are summarized inSect. 5.
2. Materials and methods: Tagged corpus and null hypothesis
In this paper, we study a corpus of 75 literary works in English, authoredby six well-known British and North American writers of the nineteenth andtwentieth centuries. The individual text lengths vary between N ≈
800 to3 . × word tokens, with vocabularies between V ≈
300 to 2 . × word typesin size. Digital versions of the texts were downloaded from Project Gutenberg[10] and Faded Page [11]. The list of works in the corpus, indicating author, title,publication year, length, and vocabulary size, can be found in a recent open-access publication [7]. The collection of texts used in our study is availableonline in a public repository [12].As a first step, content not belonging to the original works was removedmanually from each file. Then, each text was automatically processed usingthe Natural Language Toolkit library (NLTK) [13, 6]. The crucial stage inthe process is tagging, which consists of a classification of all the word tokensinto the 35 lexical categories recognized by the NLTK part-of-speech tagger.NLTK tagging uses a combination of techniques, driven by hidden Markov modeltraining [14]. In order to increase the statistical significance of sampling, we haveaggregated the 35 categories into just three broad grammatical classes. Theseare nouns , which include singular and plural common and proper nouns, andpersonal pronouns; verbs in all persons and tenses; and others , comprising theremaining categories. The result of processing each text is a sequence of word3okens, each of them with a new attribute indicating its grammatical class. Forclarity in the presentation, we call each one of the three classes a tag . Given the tagged vocabulary of each text, our focus is here put on thefrequency-rank relationship within the sub-vocabulary corresponding to eachone of the three tags. Our working hypothesis is that this relationship mayshade light, from a statistical viewpoint, on the different linguistic role of thethree grammatical classes.Let w r be the word type of rank r in the whole vocabulary (i.e., the r -thmost frequent word) and f r its number of occurrences (i.e., its frequency) in thetext. Consider now any subset of the vocabulary –namely, any sub-vocabulary–that contains w r . If the word types in this sub-vocabulary are ranked by theirfrequency, the rank r ′ of w r will necessarily satisfy r ′ ≤ r . Moreover, giventwo word types w r and w r with ranks r < r in the original vocabulary,their ranks in the sub-vocabulary (provided that both of them belong to it)will satisfy r ′ < r ′ . These remarks show that the frequency-rank relation inthe sub-vocabulary is shifted to lower ranks with respect to that of the wholevocabulary, and that the original relative ordering of word types is preserved.The full curve in Fig. 1 shows the frequency-rank relationship for the wholevocabulary of one of the tagged texts in the corpus, A. Huxley’s Eyeless inGaza , for which V = 19068. Overall, the relationship is the same as found fornon-tagged texts, closely following Zipf’s law, f r ∝ r − , for intermediate ranks.In fact, a linear fit for 10 < r < − . ± .
001 with a correlation coefficient above 0 . V nouns = 9401, V verbs = 4715, and V others = 4952. Note that, for the first fewvalues of r , data for others and the whole vocabulary coincide. This is due tothe fact that, as expected, the most frequent words in the vocabulary ( the , of , and , to , a ) are function words –i.e. words without lexical meaning [15]– which4
10 100 1000 10000110100100010000 nouns verbs others f r r Figure 1: Frequency-rank relationship (number of occurrences f r vs. rank r ) in the taggedversion of A. Huxley’s Eyeless in Gaza . The full curve stands for the whole vocabulary, whilesymbols correspond to the sub-vocabularies formed by each tag ( nouns , verbs , and others ).The dashed straight line has slope − are tagged as others . The most frequent word in nouns is he , with rank r = 6in the whole vocabulary, followed by it ( r = 10), she ( r = 12), you ( r = 21),and I ( r = 22). As for verbs , the most frequent words are was ( r = 8), had ( r = 13), be ( r = 23), were ( r = 28), and said ( r = 30). Consequently, for lowranks, data for nouns and verbs lie below those corresponding to others . Wepoint out by passing that, except perhaps for the tag nouns , these frequency-rank relationships do not admit a satisfactory power-law fitting over more thana decade in the rank, which raises a doubt whether Zipf’s law holds for therespective sub-vocabularies.To disclose whether the frequency-rank relationship within each tag containsinformation of linguistic relevance, we propose to compare results obtained forthe respective sub-vocabularies in each text of the corpus with those obtainedfor random sub-vocabularies of the same sizes. Our null hypothesis is that theycannot be statistically discerned from each other. Any significant departure, on5he other hand, can possibly be interpreted in terms of the different linguisticrole of each tag. In the following, we provide exact results concerning thefrequency-rank relationship in random sub-vocabularies. Consider a text with a vocabulary formed by V different words, where –asabove– the number of occurrences of w r , the word with rank r , is f r . From thisvocabulary, take a random subset of V ′ words. What is the probability that w r shifts to rank r ′ in the sub-vocabulary? This question can be recast as a hand-some problem in probability theory: From an ordered sequence of V distinctelements, V ′ of them are extracted at random, and their order is maintained.What is the probability p ( r → r ′ ) that the element at place r in the originalsequence ends at r ′ in the resulting sub-sequence? Suitable combinatorial argu-ments make it possible to find that p ( r → r ′ ) = VV ′ − r − r ′ − V − rV ′ − r ′ . (1)As expected, this results is well defined for V ′ ≤ V and r ′ ≤ r only. Moreover,it is required that V ′ − r ′ ≤ V − r , because the number of words with ranksabove r ′ in the sub-vocabulary must be at most equal to that above r in theoriginal vocabulary. For any other combination of V , V ′ , r and r ′ , we have p ( r → r ′ ) = 0. Note also that, if V ′ < V , p ( r → r ′ ) is not normalized withrespect to r ′ : r X r ′ =1 p ( r → r ′ ) = V ′ V < . (2)This fact takes into account the possibility that w r is not chosen to belong tothe sub-vocabulary, in which case its contribution to p ( r → r ′ ) becomes “lost”.On the other hand, V X r = r ′ p ( r → r ′ ) = 1 , (3)because the word which ends at rank r ′ in the sub-vocabulary must necessarilycome from a rank r between r ′ and V , both inclusive.6sing p ( r → r ′ ), it is possible to compute the average value of r ′ as a functionof r –namely, the expected rank of w r in the sub-vocabulary– which turns outto be h r ′ i = P rr ′ =1 r ′ p ( r → r ′ ) P rr ′ =1 p ( r → r ′ ) = 1 + ( r − V ′ − V − . (4)This average is a linear function of r , starting at h r ′ i = 1 for r = 1 and endingat h r ′ i = V ′ for r = V . Note that h r ′ i coincides with the value of r ′ expected for w r provided that the sub-vocabulary is uniformly distributed along the wholevocabulary (see also Sect. 4). The standard deviation of r ′ can also be exactlycalculated: σ r ′ = s ( V − V ′ )( V ′ − V − r )( r − V − V − . (5)It vanishes at the two ends, r = 1 and r = V , and reaches a maximum at r = ( V + 1) / r ′ in the randomsub-vocabulary is given by a sum of suitably weighted contributions comingfrom all the words with r ≥ r ′ in the original vocabulary, namely, h f r ′ i = V X r = r ′ f r p ( r → r ′ ) , (6)where we have taken into account that, as stated above, f r ′ = f r . The corre-sponding standard deviation is σ f ′ = p h f r ′ i − h f r ′ i , with h f r ′ i = V X r = r ′ f r p ( r → r ′ ) . (7)Note that, in contrast with the expected rank and its standard deviation,Eqs. (4) and (5), h f r ′ i and h f r ′ i depend on the specific collection of word fre-quencies of each text. It is interesting to mention that, if f r ∝ r − –namely, ifword frequencies strictly follow Zipf’s law– Eq. (6) implies that h f r ′ i ∝ ( r ′ ) − for sufficiently large values of V , V ′ , and r ′ . Under these conditions, therefore,the random sub-vocabulary preserves Zipf’s law.Actually, in our comparison of real tagged texts with the null hypothesis, wedo not analyze word frequencies directly, but rather their logarithm g r = log f r .7n this way –much as when Zipf frequency-rank relationships are plotted inlogarithmic scale– we partially balance the largely disparate contributions of lowand high ranks over the interval of variation of frequencies. For the quantities g r , in full analogy with Eqs. (6) and (7), we define the mean values h g r ′ i = V X r = r ′ g r p ( r → r ′ ) , h g r ′ i = V X r = r ′ g r p ( r → r ′ ) , (8)and use them to compute the standard deviation σ g ′ = p h g r ′ i − h g r ′ i . For thesake of brevity, in the following we call g r the log-frequency of the word type ofrank r .
3. Results: Rank and frequency anomalies within tags
As an illustration of the comparison between frequencies and ranks in thesub-vocabulary corresponding to a specific tag in a text and its random coun-terpart, in Figs. 2 and 3 we present results for nouns in M. Twain’s
The Princeand the Pauper . For this work, the whole vocabulary consists of V = 10869word types, while the nouns sub-vocabulary has V nouns = 4873.The colored line in the main panel of Fig. 2 shows the rank r ′ in the nouns sub-vocabulary versus the rank r in the whole vocabulary. The dashed linestands for the linear functional dependence expected from our null hypothesis,Eq. (4). As clarified by the close-up shown in the upper-left inset, the dashed lineis surrounded by a shaded band, whose width is given by the standard deviationexpected for a random sub-vocabulary of the same size, Eq. (5). We see in thisexample that, although the relationship between r ′ and r in the actual textroughly follows a linear profile, deviations from the null-hypothesis predictionare statistically significant. In the inset close-up, for instance, the differencebetween data for the real text and the null-hypothesis average is around twicethe standard deviation.The main panel of Fig. 3, in turn, shows analogous results for the log-frequency g r ′ versus r ′ . In this case, data for the actual text lie systematically8 r’ r r’ r’ Figure 2: Main panel: Relationship between the ranks r ′ in the nouns sub-vocabulary and r in the whole vocabulary for M. Twain’s The Prince and the Pauper . The colored linecorresponds to data for the actual work, and the dashed straight line is the null-hypothesisprediction for the average h r ′ i , Eq. (4). The width of the narrow band around the dashedstraight line equals the predicted standard deviation. Upper-left inset: Close-up of the mainpanel for 2000 < r < δ r ′ , as given by the first ofEqs. (9), computed for the same data. below the null-hypothesis prediction and, except in the large-rank tail, welloutside the standard-deviation band. To quantify the difference between the frequency-rank relationships in thesub-vocabularies corresponding to each tag and their random counterparts, weintroduce the relative anomalies for ranks and log-frequencies, δ r ′ = r ′ − h r ′ i σ r ′ , δ g ′ = g r ′ − h g r ′ i σ g ′ , (9)where averages and standard deviations are calculated for the null hypothesis,as in Sect. 2.3. The quantities r ′ and g r ′ , on the other hand, are those corre-sponding to the sub-vocabularies of each actual text. Note that both δ r ′ and9
10 100 10001234 g r ’ l og f r ’ r’ g’ r’ Figure 3: Main panel: The log-frequency g r ′ (defined as a decimal logarithm) versus therank r ′ in the nouns sub-vocabulary of M. Twain’s The Prince and the Pauper . Symbolscorrespond to data for the actual work, and the dashed line is the null-hypothesis predictionfor the average h g r ′ i , Eq. (4). The width of the narrow band around the dashed line equals thepredicted standard deviation. Inset: The log-frequency anomaly δ g ′ , as given by the secondof Eqs. (9), computed for the same data. δ g ′ are to be computed for each individual tag, and that they are functions ofthe rank r ′ . We point out, moreover, that δ g ′ is independent of the logarithmbase used to define g r .The lower-right inset of Fig. 2 shows the relative anomaly δ r ′ versus r ′ for The Prince and the Pauper . Horizontal shaded bands bounded by integer valuesof δ r ′ help to appraise the difference between real data and the null hypothe-sis, relative to the null-hypothesis standard deviation. This difference variesbetween negative and positive values, up to five or six times as large as σ r ′ .Except for the smallest ranks, r ′ . δ r ′ remains mostly positive. For thesenouns, therefore, the actual rank in the sub-vocabulary is larger than expectedby chance.Analogously, the inset of Fig. 2 shows δ g ′ versus r ′ . In this case, the log-frequency relative anomaly is always negative, showing that the actual values10f g r ′ are below those expected by chance, with a difference up to six timesthe standard deviation. Note that, in this plot, the horizontal axis is limited to r ′ . δ g ′ displays sharp saw-like oscillations aroundor close to zero. These oscillations originate in the difference between the preciseposition of the “steps” in the tail of the frequency-rank relationship (see Fig. 1)for real data and the null-hypothesis. Since this artifact affects the words withthe lowest frequencies only, in the following we limit the consideration of theaverage anomaly to words with more than ten occurrences, f r ′ > In order to condense the information provided by the relative anomalies intoa single quantity for each text and tag, we define the respective mean anomalies as ∆ r ′ = 1 V ′ V ′ X r ′ =1 δ r ′ , ∆ g ′ = 1 r ′ r ′ X r ′ =1 δ g ′ . (10)The sum in ∆ g ′ is limited to r ′ , the maximum rank in the sub-vocabulary forwhich f r ′ >
10. Figures 4 and 5 show the main results of our analysis, namely,the mean anomalies ∆ r ′ and ∆ g ′ for each text and tag, versus the size of thewhole vocabulary of the text.We see in Fig. 4 that the rank mean anomaly ∆ r ′ reveals a well-defined dif-ference between the statistical behavior of others , on the one hand, and nouns and verbs , on the other. Specifically, for the smaller vocabularies, ∆ r ′ is pre-dominantly positive for the former and negative for the latter. Between nouns and verbs , in turn, ∆ r ′ tends to be closer to positive values for the former. As V grows, the mean anomalies for others and nouns approach each other, while for verbs it remains negative for most texts. As a guide to the eye, we have plottedas lines the least-square linear fittings of ∆ r ′ versus V for each tag (which areseen as curves because of the logarithmic scale in the horizontal axis).Figure 5 shows that the difference of others with the other two tags is moreconspicuous when it comes to the log-frequency mean anomaly. In fact, ∆ g ′ isalways positive for others . Moreover, it now displays a non-monotonic depen-11 nounsverbs othersr’ V Figure 4: The rank mean anomaly ∆ r ′ for each text and tag versus the size of the wholevocabulary V . Curves stand for least-square linear fittings of ∆ r ′ vs. V . dence on V , with larger values for intermediate vocabulary sizes. The curve inthe plot represents a third-degree polynomial fitting of ∆ g ′ versus V . For nouns and verbs , on the other hand, ∆ g ′ is almost always negative, and there is noobvious separation between the two tags. Closer inspection, however, disclosesa different trend with V , with nouns mirroring the non-monotonic dependenceof others , and verbs with an overall growth of ∆ g ′ with the vocabulary size. Inthe plot, the curves for nouns and verbs are third- and first-order polynomialfittings, respectively.Finally, in Fig. 6 we disregard the information on the the vocabulary sizeand plot ∆ g ′ versus ∆ r ′ . Although nouns and verbs exhibit a considerable over-lapping in the quadrant of negative mean anomalies, the former also occupies asizable zone with ∆ g ′ < < ∆ r ′ (cf. Figs. 2 and 3). Others , in turn, is well sep-arated from the other two tags, as expected. Note the clear positive correlationbetween the two mean anomalies. The oblique straight line corresponds to theleast-square linear fitting of the whole set of points, with a Pearson’s correlation12 ounsverbs others g’ V Figure 5: The log-frequency mean anomaly ∆ g ′ for each text and tag versus the size of thewhole vocabulary V . Curves are polynomial fittings of ∆ r ′ vs. V , as detailed in the maintext. coefficient ρ ≈ .
4. Interpretation: A model sub-vocabulary
Our conjecture is that, as advanced in the discussion of Fig. 1 (Sect. 2.2),a disparate behavior of the frequency-rank relationship for each tag is relatedto mutual differences in the distribution of the corresponding sub-vocabulariesover the ranked list of words of the whole text. To test this idea, we considera stylized model consisting of a sub-vocabulary formed by V ′ words, whose13 ’ r’ nounsverbs others Figure 6: The log-frequency mean anomaly ∆ g ′ versus the rank mean anomaly ∆ r ′ for eachtag in the 75 works of the corpus. The straight line stands for the least-square fitting of thewhole set of points. most and least frequent words respectively have ranks r min and r max in thewhole vocabulary ( r min < r max ). We assume moreover that the V ′ words areuniformly distributed between r min and r max , so that the word of rank r ′ in thesub-vocabulary has rank r = r min + r ′ − V ′ − r max − r min ) (11)in the whole vocabulary. Equivalently, r ′ = 1 + ( r − r min ) V ′ − r max − r min . (12)Note that this expression for r ′ coincides with h r ′ i as given by Eq. (4) if r min = 1and r max = V .To estimate the rank mean anomaly ∆ r ′ for this model sub-vocabulary, wefirst assume that, in the first of Eqs. (9), the standard deviation σ r ′ can bereplaced by an effective value σ eff r ′ , independent of r ′ . Within this simplifying14ssumption, and taking into account Eqs. (9) and (10), we have∆ r ′ = 1 V ′ σ eff r ′ V ′ X r ′ =1 ( r ′ − h r ′ i ) . (13)In the sum, the functional relation between the expected rank h r ′ i and the actualrank r ′ is determined by Eq. (4), with r given as a function of r ′ by Eq. (11).Our estimation for ∆ r ′ can be written explicitly, either by exactly calculatingthe sum in Eq. (13) or by approximating it by an integral. Assuming V, V ′ ≫ r ′ = V ′ σ eff r ′ (cid:18) − r min + r max V (cid:19) . (14)This result makes apparent that, within the model hypotheses, the sign of ∆ r ′ is straightforwardly determined by how V compares with r min + r max . In par-ticular, if both r min and r max are sufficiently large, the rank mean anomaly ispositive, and vice versa . V’ V’ V r m a x r min V r’r’ r m a x r m i n V ’ V r m a x r min V r m a x r m i n V ’ g’ Figure 7: Left panel: Regions in the plane spanned by r min and r max where the estimate for∆ r ′ given by Eq. (13) is either negative or positive, as indicated by labels. The shaded regionbelow the line r max − r min = V ′ is forbidden. Right panel: As in the left panel, for ∆ g ′ ,Eq. (17). The leftmost panel of Fig. 7 helps assessing this connection more quanti-tatively. In the plane spanned by r min and r max , their values are limited to15he region with 0 < r min < r max < V and r max − r min > V ′ , indicated by theupper-left non-shaded triangle. Within this region, ∆ r ′ is positive or negativedepending on r min + r max being less or larger than V , as indicated by the la-bels. On the average, therefore, relatively small or large values of r min and r max respectively determine positive or negative values of ∆ r ′ .In terms of the frequency-rank distribution of words, a negative value forthe rank average anomaly –as obtained in our corpus for nouns and verbs – canbe interpreted to indicate a sub-vocabulary which is relatively shifted towardslarger ranks, namely, a set of words of relatively low frequencies. On the otherhand, ∆ r ′ > others in the corpus– shouldcorrespond to a sub-vocabulary with rather high frequencies. This interpreta-tion provides statistical significance to our remark in connection with Fig. 1(Sect. 2.2) that function words, which belong to others , are among the mostfrequent words in any text. Words belonging to nouns and verbs , in contrast,are more specific to the contents developed in the text, and are mostly relegatedto the low-frequency range.Along this same line of argument, we can outline an explanation for the factthat the rank mean anomalies for the three tags approach values closer to zeroas the whole vocabulary increases in size, as shown in Fig. 4. In fact, as a liter-ary work becomes longer and its vocabulary grows [7], the fraction of functionalwords in the tag others should decrease. In sufficiently large vocabularies, wordssuch as adjectives and adverbs, which also belong to others but are grammat-ically linked to the other two classes, are expected to overcome in number therelatively limited set of function words. Indeed, the number of function wordsin English has been estimated to be around 500 [16].To get a similar estimate for the log-frequency mean anomaly ∆ g ′ , besideschoosing a sub-vocabulary which is uniformly distributed between ranks r min and r max , it is necessary to advance a hypothesis for the frequency-rank re-lationship in the whole vocabulary from which the sub-vocabulary is selected.Quite naturally, we propose a Zipf-like relation of the form f r = f r − z , with ageneric Zipf exponent z . The constant f gives the number of occurrences of the16ost frequent word. Using Eq. (11), we get for the corresponding log-frequency g r ′ = log f − z log (cid:20) r min + r ′ − V ′ − r max − r min ) (cid:21) , (15)as a function of r ′ . Since, from our proposal for f r , it is not possible to give anexplicit expression for the average h g r ′ i , we assume that it can be approximatelyevaluated as h g r ′ i = log f − z log (cid:20) r ′ − V ′ − V − (cid:21) , (16)namely, the same expression as in Eq. (15) for a sub-vocabulary which is uni-formly distributed along the entire vocabulary –i.e., with r min = 1 and r max= V ;cf. Eq. (12).Replacing these forms of g r ′ and h g r ′ i in the second of Eqs. (10) and as-suming, as for the rank mean anomaly, that the standard deviation σ g ′ can beapproximated by an r ′ -independent effective value σ eff g ′ , we get∆ g ′ = zσ eff g ′ (cid:18) log V − r max log r max − r min log r min r max − r min (cid:19) . (17)To get this result, we have approximated the sum in the second of Eqs. (10),and supposed that V, V ′ ≫
1, and that all words in the sub-vocabulary havefrequencies above f r = 10.The rightmost panel in Fig. 7 shows the regions of positive and negative ∆ g ′ ,as estimated by Eq. (17), on the plane spanned by r min and r max . Taking intoaccount the results presented in Fig. 5, we see that the analysis of the sign of ∆ g ′ qualitatively confirms the conclusions obtained from that of ∆ r ′ . Concretely,the mostly positive values of ∆ g ′ for the tag others is a consequence of relativesmall r min and r max . Words in this tag, therefore, accumulate in the range oflarge frequencies. In contrast, words in nouns and verbs tend to be distributedtowards larger ranks and lower frequencies.While the same argument advanced in connection with ∆ r ′ may explain whythe log-frequency mean anomaly decreases as V grows –as observed, at least,for others and nouns – an explanation for its small values for small V is stillpending. The overall non-monotonic dependence of ∆ g ′ on the vocabulary sizefor other and nouns remains an open problem.17 . Summary and conclusion In this contribution, we have studied the statistics of word frequencies andranks within grammatical classes in a corpus of 75 literary works in English.The words of each text were tagged using an automatized analyzer of naturallanguage, and then grouped into three classes: nouns , verbs , and others . For thesub-vocabulary formed by the words of each class, in each text we have charac-terized the changes in rank and the frequency-rank relationship with respect tothe whole vocabulary. To assess the statistical significance of the results, we car-ried out a comparison with a null hypothesis, where the sub-vocabularies wereassumed to be randomly distributed in the list of words ranked by frequency ofthe whole text.Overall, the quantifiers which we have defined to measure differences withthe null hypothesis have revealed significant departures for the three grammat-ical classes. More importantly, these departures are different for each class,which indicates that the distribution of words in the whole ranking dependson their grammatical function. This points to the fact that the frequency-rankrelationships within each sub-vocabulary may contain relevant linguistic infor-mation.In order to outline a preliminary interpretation for our results, we introduceda model sub-vocabulary formed by words evenly distributed in a defined portionof the whole list of ranked words. For this model, and within suitable approxi-mations, we can explicitly estimate the difference with the null hypothesis, andthus compare with results for the real texts. Our conclusion is that, statisticallyspeaking, words belonging to others have higher frequencies than in nouns and verbs . This difference, however, tends to fade out as the whole vocabulariesbecome richer –i.e. for longer texts. A full explanation of the features observedin our quantifiers would nevertheless require expert analysis from the side oflinguistics which, we acknowledge, we are not able to provide here.Our study complements a recent contribution where we have studied thestatistics of appearance of new words along a text (Heaps’ law), discerning18etween the same three grammatical classes as here, and also comparing witha random-like null hypothesis [7]. There, we concluded that the appearance of verbs and others are respectively more and less retarded with respect to theexpectation by chance, while nouns approximately follow the null hypothesis.According to the present results, the ordering verbs - nouns - others seems to alsoshow up in the anomalies of the frequency-rank relationships.Much as our recent Heaps’ analysis of tagged texts, the present results onfrequency-rank relationship within grammatical classes add to the characteriza-tion of statistical regularities with a role in scientific and technological develop-ments in natural language processing, understanding, and learning [17, 18]. Insimilar contexts, techniques inspired by Statistical Physics may contribute tothe rocketing current interest on artificial intelligence [19]. Acknowledgement
This work was partially supported by grants from CONICET (PIP 11220150 10028), FonCyT (PICT-2017-0973), SeCyT–UNC (Argentina) and Min-CyT C´ordoba (PID PGC 2018).
References [1] G. K. Zipf, The psycho-biology of language (Houghton-Mifflin, Boston, MA,1935)[2] H. A. Simon, On a class of skew distribution functions, Biometrika , 425(1955).[3] D. H. Zanette, M A. Montemurro, Dynamics of text generation with real-istic Zipf’s distribution, J. Quant. Ling. , 29 (2005).[4] A. L. Barab´asi, R. Albert, Emergence of scaling in random networks, Sci-ence , 509 (1999). 195] B. Corominas-Murtra, R. Hanel, S. Thurner, Sample space reducing cas-cading processes produce the full spectrum of scaling exponents, Sci. Rep. , 11223 (2017).[6] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python:Analyzing Text with the Natural Language Toolkit (O’Reilly, Sebastopol,CA, 2009).[7] A. Chacoma and D. H. Zanette, Heaps’ Law and Heaps functions in taggedtexts: Evidences of their linguistic relevance, R. Soc. Open Sci. , 200008(2020).[8] H. S. Heaps, Information retrieval: computational and theoretical aspects (Academic Press, New York, NY, 1978).[9] M. Gerlach, E. G. Altmann, Stochastic model for the vocabulary growthin natural languages, Phys. Rev. X Spoken Language Processing: A Guide toTheory, Algorithm, and System Development . (Prentice Hall, Upper SaddleRiver, NJ, 2001).[15] T. Klammer, M. R. Schulz, A. Della Volpe,
Analyzing English Grammar ,7th ed. (Longman, Boston, MA, 2012).[16] D. Caplan,
Neurolinguistics and Linguistic Aphasiology (Cambridge Uni-versity Press, Cambridge, 1987). 2017] Y. Lu, S. Singhal, F. Strub, O. Pietquin, A. Courville, Supervised seedediterated learning for interactive language learning, arXiv:2010.02975 (2020).[18] M. Namazifar, A. Papangelis, G. Tur, D. Hakkani-T¨ur, Language modelis all you need: Natural language understanding as question answering,arXiv:2011.03023 (2020).[19] Z. Ghahramani, Probabilistic machine learning and artificial intelligence,Nature521