[PDF] Rank-frequency relation for Chinese characters

Abstract

We show that the Zipf's law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf's law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.

Full PDF

aa r X i v : . [ c s . C L ] J a n Rank-frequency relation for Chinese characters

W.B. Deng,

1, 2, 3

A.E. Allahverdyan,

1, 4, ∗ B. Li, and Q. A. Wang

1, 3 Laboratoire de Physique Statistique et Syst`emes Complexes,ISMANS, 44 ave. Bartholdi, Le Mans 72000, France Complexity Science Center and Institute of Particle Physics,Hua-Zhong Normal University, Wuhan 430079, China IMMM, UMR CNRS 6283, Universit´e du Maine, 72085 Le Mans, France Yerevan Physics Institute, Alikhanian Brothers Street 2, Yerevan 375036, Armenia Department of Chinese Literature, University of Heilongjiang, Harbin 150080, China

We show that the Zipf’s law for Chinese characters perfectly holds for suﬃciently short texts (fewthousand diﬀerent characters). The scenario of its validity is similar to the Zipf’s law for words inshort English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequencyrelations for Chinese characters display a two-layer, hierarchic structure that combines a Zipﬁanpower-law regime for frequent characters (ﬁrst layer) with an exponential-like regime for less frequentcharacters (second layer). For these two layers we provide diﬀerent (though related) theoreticaldescriptions that include the range of low-frequency characters (hapax legomena). The comparativeanalysis of rank-frequency relations for Chinese characters versus English words illustrates the extentto which the characters play for Chinese writers the same role as the words for those writing withinalphabetical systems.

PACS numbers: 89.75.Fb, 89.75.Da, 05.65.+b

I. INTRODUCTION

Rank-frequency relations provide a coarse-grainedview on the structure of a text: one extracts the normal-ized frequencies of diﬀerent words f > f > ... , ordersthem in a non-increasing way and studies the frequency f r as a function of its rank r . One widely known aspectof this rank-frequency relation that holds for texts writ-ten in many alphabetical languages is the Zipf’s law; see[1–4] for reviews, [5–8] for modern instances of the law,and [9] for extensive lists of references on the subject.This regularity was ﬁrst discovered by Estoup [10]: f r ∝ r − γ with γ ≈ . (1)The message of a power-law rank-frequency relation isthat there is no a single group of dominating words ina text, they rather hold some type of hierarchic, scale-invariant organization. This contrasts to the exponential-like form of the rank-frequency relation that would dis-play a dominant group of words that is representative forthe text.The simple form of the Zipf’s law hides the mechanismbehind it. Hence there is no consensus on the origin ofthe law, as witnessed by diﬀerent theories proposed to ex-plain it [11–17]. An inﬂuential group of theories explainthe law from certain general premises of the language [11–14], e.g. that the language trades-oﬀ between maximizingthe information transfer and minimizing the speaking-hearing eﬀort [11], or that the language employs its wordsvia the optimal setting of information theory [12]. The ∗ Email: [email protected] general problem of derivations from this group is that ex-plaining the Zipf’s law for the language (and verifying itfor a frequency dictionary) does not yet mean to explainthe law for a concrete text, where the frequency of thesame word varies widely from one text to another and isfar from its value in a frequency dictionary.It was held once that the Zipf’s law is not especiallyinformative, since it is recovered by very simple stochas-tic models, where words are generated through randomcombinations of letters and space symbol seemingly re-producing the f r ∝ r − shape of the law [15]. But thereproduction is elusive, since the model is based on fea-tures that are certainly unrealistic for natural languages,e.g. it predicts a huge redundancy (many words have thesame frequency and length) [18]. More recent opinionsreviewed in [19] indicate that the Zipf’s law is informa-tive and not reducible to any trivial statistical regularity.These opinions are conﬁrmed by a recent derivation ofthe Zipf’s law from the ideas of latent semantic analy-sis [17]. The derivation accounts for generalizations ofthe Zipf’s law for high and low frequencies, and also de-scribes (simultaneously with the Zipf’s law) the hapaxlegomena eﬀect ; see Appendix A for the glossary of theused linguistic terms.However, the Zipf’s law was so far found to be absentfor the rank-frequency relation of Chinese characters [20–25], which play—sociologically, psychologically and (tosome extent) linguistically—the same role for Chinese Hapax legomena means literally the set of words that appear inthe text only once. We shall employ this term in a broader senseas the set of words that appear few times, so that suﬃcientlymany words have the same frequency. The description of this setis sometimes referred to as the frequency spectrum. readers and writers as the words do in Indo-Europeanlanguages [26–28].Rank-frequency relations for Chinese characters wereﬁrst studied by Zipf and coauthors who did not ﬁnd theZipf’s law [29]. They claimed to ﬁnd another power lawwith exponent γ = 2 [29], but this result was later onshown to be incorrect [21], since it was not based on anygoodness of ﬁt measure. It was also proposed that thedata obtained by Zipf are reasonably ﬁt with a logarith-mic function f r = a + b ln( c + r ) with constant a , b and c [21]. The result on the absence of the Zipf’s law was thenconﬁrmed by other studies [22–25, 30]. All these authorsagree that the proper Zipf’s law is absent (more gener-ally a power law is absent), but have diﬀerent opinionson the (non-power-law) form of the rank-frequency rela-tion for Chinese characters: logarithmic [21], exponential f r ∝ e − dr (where d > . Hencethe invalidity of the Zipf’s law for Chinese charactershas contributed to the ongoing debate on controversies(coming from linguistics and experimental psychology)on whether and to which extent the Chinese writing sys-tem is similar to phonological writing systems [36–38]; inparticular, to which extent it is based on characters incontrast to words .Results reported in this work amount to the following:– The Zipf’s law holds for suﬃciently short (few thou-sand diﬀerent characters) Chinese texts written in Clas-sic or Modern Chinese . Short texts are important, be-cause they are building blocks for understanding longtexts. For the sake of rank-frequency relations, but alsomore generally, one can argue that long texts are justmixtures (joining) of smaller, thematically homogeneouspieces. This premise of our approach is fully conﬁrmed Applications of the Zipf’s law to automatic keyword recognitionare based on this fact [33], because keywords are located mostlyin the validity range of the Zipf’s law. A related set of appli-cations of this law refers to distinguishing between artiﬁcial andnatural texts, fraud detection [34] etc ; see [35] for a survey ofapplications in natural language processing. We stress already here that the Zipf’s law holds for Chinese [25]and Japanese words [39]. This is expected and intuitively followsfrom the possibility of literal translation from Chinese to English,where (almost) each Chinese word is mapped to an English one(see our glossary at Appendix A for deﬁnition of various specialterms). In this sense, the validity of the Zipf’s law for Chinesewords is consistent with the validity of this law for English texts. The Modern Chinese texts we studied are written with simpli-ﬁed characters, while our Classic Chinese texts are written withtraditional characters. Reforms started in the mainland Chinasince late 1940’s simpliﬁed about 2235 characters. Traditionalcharacters are still used oﬃcially in Hong-Kong and Taiwan. by our results.– The validity scenario of the Zipf’s law for shortChinese texts is basically the same as for short Englishtexts : the rank-frequency relation separates into threeranges. (1) The range of small ranks (more frequent char-acters) that contains mostly function characters; we callit the pre-Zipﬁan range. (2)

The (Zipﬁan) range of mid-dle ranks (more probable words) that contains mostlycontent characters. (3)

The range of rare characters,where many characters have the same small frequency(hapax legomena).– The essential diﬀerence between Chinese charactersand English words comes in for long texts, or upon mix-ing (joining) diﬀerent short texts. When mixing diﬀerentEnglish texts, the range of ranks where the Zipf’s law isvalid quickly increases, roughly combining the validityranges of separate texts. Hence for a long text the ma-jor part of the overall frequency is carried out by theZipﬁan range. When mixing diﬀerent Chinese texts, thevalidity range of the Zipf’s law increases very slowly. In-stead there emerges another, exponential-like regime inthe rank-frequency relation that involves a much largerrange of ranks. However, the Zipﬁan range of ranks is still(more) important, since it carries out some 40% of theoverall frequency. This overall frequency of the Zipﬁanrange is approximately constant for all (numerous andsemantically very diﬀerent) Chinese texts we studied.– We describe these two regimes via diﬀerent (thoughclosely related) theories that are based on the recent ap-proach to rank-frequency relations [17]. This descriptionincludes a rather precise theories for rare characters (ha-pax legomena range) both for long and short Chinesetexts.This work is organized as follows. The next sectiongives a short introduction to Chinese characters and theirdiﬀerences and similarities with English words. SectionIII uncovers the Zipf’s law for short Chinese texts andcompares it with the English situation. Section IV stud-ies the fate of the Zipf’s law for long Chinese texts. Wesummarize in the last section. Appendix A contains theglossary of the used linguistic terms. Appendix B refersto the interference experiments distinguishing betweenChinese characters and English words. Appendix C recol-lects information on the studied Chinese texts. AppendixD lists the key-characters of one studied modern Chinesetext. Appendix E reminds the Kolmogorov-Smirnov testthat is employed for checking the quality of our numericalﬁtting. Here and below we refer to a typical Indo-European alphabet-ical based language as English, meaning that for the sake ofthe present discussion diﬀerences between various Indo-Europeanand/or Uralic languages are not essential. Likewise, we expectthat the basic features of the rank-frequency analysis of Chinesecharacters will apply for those languages (e.g. Japanese), wherethe Chinese characters are used.

II. CHINESE CHARACTERS VERSUSENGLISH WORDS

Here we shortly remind the main diﬀerences and sim-ilarities between Chinese characters and English words;see Footnote 5 in this context. This subject generatedseveral controversies (myths as it was put in [37]), evenamong expert sinologists [27, 28, 36–38, 40].This section is not needed for presenting our results(hence it can be skipped upon ﬁrst reading), but is nec-essary for a deeper understanding of our results and mo-tivations.The main qualitative conclusion of this section is thatin contrast to English words, Chinese characters havegenerally more diﬀerent meanings, they are more ﬂex-ible, they could combine with other characters to con-vey diﬀerent speciﬁc meanings. So there are characters,which appear many times in the text, but their concretemeanings are diﬀerent in diﬀerent places. The unit of Chinese writing system is the character:a spatially marked pattern of strokes phonologically re-alized as a single syllable (please consult Appendix A fora glossary of various linguistic terms used in the paper).Generally, each character denotes a morpheme or severaldiﬀerent morphemes. The Chinese writing system evolved by emphasiz-ing the concept of the character-morpheme, to some ex-tent blurring the concept of the multi-syllable word. Inparticular, spaces in the Chinese writing system are putin between of characters and not in between of multi-syllable words . Thus a given sentence can have diﬀerentmeanings when being separated into diﬀerent sequencesof words [40], and parsing a string of Chinese charactersinto words became a non-trivial computational problem;see [43] for a recent review. Psycholinguistic research shows that the charac-ters are important cognitive and perceptual units forChinese writers and readers [26–28], e.g. Chinese char-acters are more directly related to their meanings thanEnglish words to their meanings [28] ; see AppendixB for additional details. The explanation of this eﬀectwould be that characters (compared to English words)are perceived holistically as a meaning-carrying objects,while English words are yet to be reconstructed from a An immediate question is whether Chinese readers will beneﬁtfrom reading a character-written text, where the words bound-aries are indicated explicitly. For normal sentences the read-ers will not beneﬁt, i.e. it does not matter whether the wordboundaries are indicated explicitly or not [41]. But for diﬃcultsentences the beneﬁt is there [42]. To get a fuller picture of this eﬀect let us denote τ f ( E ) and τ f ( C )for English and Chinese phonology activation times, respectively,while τ m ( E ) and τ m ( C ) stand for respective meaning activationtimes. The phonology activation time is the time passed be-tween seeing a word in English (or character in Chinese) andpronouncing it; likewise, for the meaning activation time. Nowthese quantities hold [28]: τ f ( E ) < τ m ( C ) ≃ τ f ( C ) < τ m ( E ). sequence of their constituents (phonemes and syllables) . One-character words dominate in the following spe-ciﬁc sense. Some 54% of modern Chinese word tokens are single-character, two-character word tokens amountto 42%; the remaining words have three or more charac-ters [45]. For modern Chinese word types the situationis diﬀerent: single character words amount to some 10%against 66% of two-character words [45]. Classic Chi-nese texts have more single-character words (tokens), thepercentage varies between some 60% and 80% for textswritten in diﬀerent periods.The modern Chinese has ≈ ≈ A minor part of multi-character words are multi-character morphemes, i.e. their separate characters donot normally appear alone (they are fully bound). Ex-amples of this are the two-character Chinese words for grape “ 葡萄 ” (p´u t´ao) , dragonﬂy “ 蜻蜓 ” (q¯ıng t´ıng) , olive “ 橄榄 ” (gˇan lˇan) . Estimates show that some 10% of allcharacters are fully bound [37].A related set of examples is provided by two-characterwords, where the separate characters do have an inde-pendent meaning, but this meaning is not directly re-lated to the meaning of the word, e.g. “ 东西 ” (d¯ongx¯ı) means thing , but literally it amounts to east-west , or“ 手足 ” (shˇou z´u) means close partnership , but literally hand-foot .) The majority of the multi-character words are se-mantic compounds: their separate characters can standalone and are related to the overall meaning of the word.Importantly, in most cases, the separate meanings ofthe component characters are wider than the (relativelyunique) meaning of the compound two-character word.An example of this situation is the two-character Chi-nese word for train “ 火车 ” (huˇo ch¯e) : its ﬁrst character“ 火 ” (huˇo) has the meaning of ﬁre, heat, popular, anger,etc , while the second character “ 车 ” (ch¯e) has the mean-ing of vehicle, machine, wheeled, lathe, castle, etc .Note that in Chinese there is a certain freedom ingrouping morpheme into diﬀerent combinations. Henceit is not easy to distinguish the semantic compounds fromlexical phrases. At this point we shall argue that in general Chinesecharacters have a larger number of diﬀerent meaningsthan English words. This statement will certainly appearcontroversial, if it is taken without proper caution, andis explained without proper usage of linguistic terms (seeour glossary at Appendix A); consult Footnote 12 in thiscontext. A simpler explanation would be that the characters are perceivedas pictograms directly pointing to their meaning. In its literalform this explanation is not correct, since characters-pictogramsare not frequent in Chinese [37, 44].

First of all note the diﬀerence between polysemes andhomographs: polysemes are two related meanings of thesame character (word), homographs are two characters(words) that are written in the same way, but their mean-ings are far from each other . Now many characters aresimultaneously homographs and polysemes, e.g. charac-ter “ 明 ” (m´ıng) means brilliant, light, clear, next, etc .Here the ﬁrst three meanings are related and can beviewed as polysemes. The fourth meaning next is clearlydiﬀerent from the previous three. Hence this is a ho-mograph. Another example is the character “ 发 ” (f¯a orf`a) that can mean hair, send out, fermentation, etc . Allthese three meanings are clearly diﬀerent; hence we havehomographs. Note the following peculiarity of the abovetwo examples: the ﬁrst example is a non-heteronym (ho-mophonic) character, i.e. it is read in the same way ir-respectively whether it means light or next . The secondexample is a heteronym character: it written in the sameway, but is read diﬀerently depending on its meaning.In most cases, heteronym characters—those which arewritten in the same way, but have diﬀerent pronuncia-tions—have at least two suﬃciently diﬀerent meanings.The disambiguation of their meaning is to be provided bythe context of the sentence and/or the shared experienceof the writer and reader .Surely, also English words can be ambiguous in mean-ing (e.g. get means obtain , but also understand = haveknowledge ), but there is an essential diﬀerence. The ma-jor contribution of the meaning ambiguity in English isthe polysemy: one word has somewhat diﬀerent, butalso closely related meanings. In contrast, many Chi-nese characters have widely diﬀerent meanings, i.e. theyare homographs rather than polysemes.However, we are not aware of any quantitative com-parison between homography of Chinese versus English.This may be related to the fact that it is sometimes noteasy to distinguish between polysemy and homophony(see the glossary in Appendix A). Still the above state-ment on Chinese characters having a larger number ofdiﬀerent meanings can be quantitatively illustrated viathe relative prevalence of heteronyms in Chinese. Theamount of heteronyms in English is negligible, e.g. inrather complete list of heteronyms presented in [46], wenoted only 74 heteronyms , and only three of them Note that polysemes are deﬁned to be related meanings of thesame word, while homographs are deﬁned to be diﬀerent words.This is natural, but also to some extent conventional, e.g. onecan still deﬁne homographs as far away meanings of the sameword. Note that homophony in Chinese is much larger than homog-raphy: in average a syllable has around 12–13 meanings [26].Hence, in a sense, characters help to resolve the homophony ofChinese speech. This argument is frequently presented as an ad-vantage of the character-based writing system, though it is notclear whether this system is here not solving the problem thatwas invited by its usage [44]. Not counting those heteronyms that arise because an English had more than 2 meanings. This is a tiny amount ofthe overall number of English words ( > × ). Tocompare this with the Chinese situation, we note thatat least some 14% of modern Chinese and 25% of tradi-tional characters are heteronyms, which normally have atleast two widely diﬀerent meanings. Within the most fre-quent 5700 modern characters the number of heteronymsis even larger and amounts to 22 % [45] . Chinese nouns are generally less abstract: when-ever English creates a new word via conceptualizing theexisting one, Chinese tends to explain the meaning viausing certain basic characters (morphemes). Several ba-sic examples of this scenario include: length=long+short“ 长短 ” (ch´ang duˇan) , landscape=mountains+water“ 山水 ” (sh¯an shuˇı) , adult=big+person “ 大人 ” (d`ar´en) , population=person+mouth “ 人口 ” (r´en kˇou) ,astronomy=heaven+script “ 天文 ” (ti¯an w´en) , uni-verse=great+emptiness “ 太空 ” (t`ai k¯ong) . English toolsfor making abstract words include preﬁxes, poly- , super- , pro- , etc and suﬃxes, -tion , -ment . These tools either donot have Chinese analogs, or their usage can generally besuppressed.English words have inﬂections to indicate the tense ofverbs, the number for nouns or the degree for adjectives.Chinese characters generally do not have such linguisticattributes , their role is carried out by the context ofthe sentence(s) .To summarize this section, the diﬀerences betweenChinese and English writing systems can be viewed inthe context of the two features: emphasizing the role ofbase (root) morphemes and delegating the meaning tothe context of the sentence whenever this is possible [26].The quantitative conclusion to be drawn from the word happens to coincide with a foreign special name, e.g. Nancy [English name] and

Nancy city in France. One should not conclude that in average the Chinese characterhas more meanings than the English word, because there is alarge number of characters—between 10 and 14 % depending onthe type of the dictionary employed [47]—that do not have lex-ical meaning, i.e. they are either function words (grammaticalmeaning mainly) or characters that cannot appear alone (boundcharacters). If now the number of meanings for each characteris estimated via the number of entries in the explanatory dictio-nary—which is more or less traditional way of searching for thenumber of meanings, though it mixes up homography and poly-semy—the average number of meanings per a Chinese characterappears to be around 1.8–2 [47]. This is smaller than the averagenumber of (necessarily polysemic) meanings for an English wordthat amounts to 2.3. Chinese expresses temporal ordering via context, e.g. addingwords tomorrow or yesterday , or by aspects. The diﬀerence be-tween tense and aspect is that the former implicitly assumesan external observer, whose reference time is compared with thetime of the event described by the sentence. Aspects order eventsaccording to whether they are completed, or to which extent theyare habitual. Indo-European languages tie up tense and aspect.The tie is weaker for Slavic Indo-European languages. Chinesehas several tenses including perfective, imperfective and neutral. Chinese has certain aﬃxes, but they can be and are suppressedwhenever the issue is clear from the context. above discussion is that Chinese characters have morediﬀerent meanings, they are ﬂexible, they could combinewith other characters to convey diﬀerent speciﬁc mean-ings. Anticipating our results in the sequel, we expectto see a group of characters, which appear many timesin the text, but their concrete meanings are diﬀerent indiﬀerent places of the text.

III. THE ZIPF’S LAW FOR SHORT TEXTS

We studied several Chinese and English texts of dif-ferent lengths and genres written in diﬀerent epochs; seeTables I, II and III. Some Chinese texts were writtenusing modern characters, others employ traditional Chi-nese characters; see Tables I and II. Chinese texts aredescribed in Appendix C. English texts are described inTable III. The texts can be classiﬁed as short (total num-ber of characters or words is N = 1 − × ) and long( N > ). They generally have diﬀerent rank-frequencycharacteristics, so discuss them separately.For ﬁtting empiric results we employed the linear least-square method (linear ﬁtting), but the also checked itsresults with other methods (KS test, non-linear ﬁttingand the maximum likelihood method). We start with abrief remainder of the linear ﬁtting method. A. Linear ﬁtting

For each Chinese text we extract the ordered frequen-cies of diﬀerent characters [the number of diﬀerent char-acters is n ; the overall number of characters in a text is N ]: { f r } nr =1 , f ≥ ... ≥ f n , X nr =1 f r = 1 . (2)Exactly the same method is applied to English texts forstudying the rank-frequency relation of words.We ﬁt the data { f r } nr =1 with a power law: ˆ f r = cr − γ .Hence we represent the data as { y r ( x r ) } nr =1 , y r = ln f r , x r = ln r, (3)and ﬁt it to the linear form { ˆ y r = ln c − γx r } nr =1 . Twounknowns ln c and γ are obtained from minimizing thesum of squared errors [linear ﬁtting] SS err = X nr =1 ( y r − ˆ y r ) . (4)It is known since Gauss that this minimization produces − γ ∗ = P nk =1 ( x k − x )( y k − y ) P nk =1 ( x k − x ) , ln c ∗ = y + γ ∗ x, (5)where we deﬁned y ≡ n X nk =1 y k , x ≡ n X nk =1 x k . (6) f r =0.169*r -0.97 F r equen cy Rank

KLS N= 20226, n= 2048

FIG. 1: (Color online) Frequency versus rank for the shortmodern Chinese text KLS; see Appendix C for its description.Red line: the Zipf curve f r = 0 . r − . ; see Table I. Arrowsand red numbers indicate on the validity range of the Zipf’slaw. Blue line: the numerical solution of (17, 18) for c =0 . r >r min = 62. The step-wise behavior of f r for r > r max refersto hapax legomena. As a measure of ﬁtting quality one can take:min c,γ [ SS err ( c, γ )] = SS err ( c ∗ , γ ∗ ) = SS ∗ err . (7)This is however not the only relevant quality measure.Another (more global) aspect of this quality is the coeﬃ-cient of correlation between { y r } nr =1 and { ˆ y r } nr =1 [2, 48] R = (cid:2) P nk =1 ( y k − ¯ y )(ˆ y ∗ k − ˆ y ∗ ) (cid:3) P nk =1 ( y k − ¯ y ) P nk =1 (ˆ y ∗ k − ˆ y ∗ ) , (8)whereˆ y ∗ = { ˆ y ∗ r = ln c ∗ − γ ∗ x r } nr =1 , ˆ y ∗ ≡ n X nk =1 ˆ y ∗ k . (9)For the linear ﬁtting (5) the squared correlation coeﬃ-cient is equal to the coeﬃcient of determination, R = X nk =1 (ˆ y ∗ k − y ) .X nk =1 ( y k − y ) , (10)the amount of variation in the data explained by the ﬁt-ting [2, 48]. Hence SS ∗ err → R → SS err over c and γ for r min ≤ r ≤ r max and ﬁnd the maximal value of r max − r min for which SS ∗ err and 1 − R are smaller than, respectively, 0 .

05 and 0 . r max − r min also determines the ﬁnal ﬁttedvalues c ∗ and γ ∗ of c and γ , respectively; see Tables I, II,III and Fig. 1. Thus c ∗ and γ ∗ are found simultaneouslywith the validity range [ r max , r max ] of the law. Wheneverthere is no risk of confusion, we for simplicity refer to c ∗ and γ ∗ as c and γ , respectively. B. Empiric results on the Zipf’s law

Here are results produced via the above linear ﬁtting.

TABLE I: Parameters of the modern Chinese texts (see Appendix C for further details). N is the total number of characters inthe text. The number of diﬀerent characters is n . The Zipf’s law f r = cr − γ holds for the ranks r min ≤ r ≤ r max ; see section IIIA. Here P k

N n r min r max c γ P k

25; see Tables I, II andFig. 1. Both for r < r min and r > r max the frequencies arebelow the Zipf curve; see Fig. 1. A power rank-frequencyrelation with exponent γ ≈ | r max − r min | isfew times smaller than the maximal rank n (see Tables Iand II and Figs. 1 and 2), it is relevant, since it containsa sizable amount of the overall frequency: for Chinesetexts (short or long) the Zipﬁan range carries 40 % ofthe overall frequency, i.e. P r max k = r min f k ≃ . In the pre-Zipﬁan range 1 ≤ r < r min the overallnumber of function and empty characters is more than the number of content characters. Function and emptycharacters serve for establishing grammatical construc-tions (e.g. “ 的 ” (de) , “ 是 ” (sh`ı) , “ 了 ” (le) , “ 不 ” (b`u) ,“ 在 ” (z`ai) ). (We shall list them separately, though forour purposes they can be joined together; the main dif-ference between them is that the empty characters arenot used alone.)But the majority of characters in the Zipﬁan range dohave a speciﬁc meaning (content characters). A subset ofthose content characters has a meaning that is speciﬁc forthe text and can serve as its key-characters; see AppendixD and Table IX for an example.Let us take for an example the modern Chinese textKLS; see Table I (this text concerns military activities;see Appendix C). The pre-Zipﬁan range of this text con- TABLE III: Parameters of four English texts and their mixtures:

The Age of Reason (AR) by T. Paine, 1794 (the major sourceof British deism).

Time Machine (TM) by H. G. Wells, 1895 (a science ﬁction classics).

Thoughts on the Funding System andits Eﬀects (TF) by P. Ravenstone, 1824 (economics).

Dream Lover (DL) by J. MacIntyre, 1987 (a romance novella). TF & TMmeans joining the texts TF and TM.The total number of words N , the number of diﬀerent words n , the lower r min and the upper r max ranks of the Zipﬁan domain,the ﬁtted values of c and γ , the overall frequencies of the pre-Zipﬁan and Zipﬁan range, and the diﬀerence d between the totalfrequency of the Zipﬁan domain got empirically and its value according to the Zipf’s law: d = P r max k = r min ( ck − γ − f k ).Texts N n r min r max c γ P k

10. For r > r max we meet the hapaxlegomena eﬀect: characters occurring only few times inthe text (i.e. f r N = 1 , , ... is a small integer), and manycharacters having the same frequency f r [3]. The eﬀectis not described by any smooth rank-frequency relation,including the Zipf’s law. Hence for short texts we getthat the Zipf’s law holds for as high ranks as possible,in the sense that for r > r max no smooth rank-frequencyrelations are possible at all.Note that the very existence of hapax legomena is anon-trivial eﬀect, since one can easily imagine (artiﬁcial)texts, where (say) no character appear only once. Thetheory reviewed below allows to explain the hapax legom-ena range together with the Zipf’s law; see below. It alsopredicts a generalization of the Zipf’s law to frequencies r < r min that is more adequate (than the Zipf’s law) tothe empiric data; see Figs. 1 and 2. All the above results hold for relatively short En- We present that meaning of the character which is most relevantin the context of the text. f r =0.178*r -1.038 AR N= 22641, n= 1706 F r equen cy Rank32 339

FIG. 2: (Color online) Frequency vs. rank for the English textAR; see Table III. Red line: the Zipf curve f r = 0 . r − . .Other notations have the same meanings as in Fig. 1. glish [17]; see Table III and Fig. 2. In particular, theZipﬁan range of English texts also contains mainly con-tent words including the keywords. This is known and isroutinely used in document processing [33].We thus conclude that as far as short texts are con-cerned, the Zipf’s law holds for Chinese characters in thesame way as it does for English words. To check our results on ﬁtting the empiric data forword frequencies to the Zipf’s law we carried out threealternative tests.

First we applied the Kolmogorov-Smirnov (KS)test to decide on the ﬁtting quality of the data with theZipf’s law (in the range [ r min , r max ]). The test was car-ried out both with and without transforming to the loga-rithmic coordinates (3) and it fully conﬁrmed our result;see Table IV. For a detailed presentation of the KS testresults see Appendix E and Table X therein. It was recently shown that even when the applica-bility range [ r min , r max ] of a power law is known, the lin-ear least-square method (that we employed above) maynot give accurate estimations for the exponent γ of thepower law [49–51]. It was then argued that the methodof Maximum Likelihood Estimation (MLE) is more reli-able in this context. Hence to show that our results arerobust, we calculated γ using the MLE method, as sug-gested in [49–51]. We got that the diﬀerence with thelinear least square method is quite small (changes comeonly at the third decimal place); see Table IV. We also checked whether our results on the powerlaw exponent γ are stable with respect to non-linear ﬁt-ting schemes, the ones that do not employ the logarithmiccoordinates (3), but operate directly with the form (2).Again, we ﬁnd that non-linear ﬁtting schemes (that wecarried out via routines of Mathematica 7) produce verysimilar results for γ ; see Table IV.One reason for such a good coincidence between ourlinear ﬁtting results and alternative tests is that we usea rather strict criteria ( SS ∗ err < .

05 and R > . ﬁrst the Zipﬁan range [ r min , r max ] and thenthe parameters of the Zipf’s law. Another reason is thatin the vicinity of r max , the number of diﬀerent words hav-ing the same frequency is not large (it is smaller than 10).Hence there are no problems with lack of data points orsystematic biases that can plague the applicability of theleast square method for determination of the exponent γ . C. Theoretical description of the Zipf’s law andhapax legomena

1. Assumptions of the model

A theoretical description of the Zipf’s law that isspeciﬁcally applicable to short English texts was recentlyproposed in [17]; it is reviewed below. The theory is basedon the ideas of latent semantic analysis and the conceptof mental lexicon [17]. We shall now brieﬂy remind it todemonstrate that– The rank-frequency relation for short Chinese andEnglish texts can be described by the same theory.– The theory allows to extrapolate the Zipf’s law tohigh and low frequencies (including hapax legomena).– It allows to understand the bound c < .

25 for theprefactor of the Zipf’s law (since the law does not applyfor all frequencies, c is not ﬁxed from normalization).– The theory conﬁrms the intuitive expectation aboutthe diﬀerence between the Zipﬁan and hapax legomenarange: in the ﬁrst case the probability of a word is equalto its frequency (frequent words). In the hapax legomenarange, both the probability and frequency are small anddiﬀerent from each other. – In the following section the theory is employed fordescribing the rank-frequency relation of Chinese charac-ters outside of the validity range of the Zipf’s law.Our model for deriving the Zipf’s law together withthe description of the hapax legomena makes four as-sumptions (see [17] for further details). Below we shallrefer to the units of the text as words; whenever this the-ory applies for Chinese texts we shall mean charactersinstead of words. • The bag-of-words picture focusses on the frequency ofthe words that occur in a text and neglects their mutualdisposition (i.e. syntactic structure) [52]. This is a natu-ral assumption for a theory describing word frequencies,which are invariant with respect to an arbitrary permuta-tion of the words in a text. The latter point was recentlyveriﬁed in [53].Given n diﬀerent words { w k } nk =1 , the joint probabilityfor w k to occur ν k ≥ T is assumed tobe multinomial π [ ν | θ ] = N ! θ ν ...θ ν n n ν ! ...ν n ! , ν = { ν k } nk =1 , θ = { θ k } nk =1 , (11)where N = P nk =1 ν k is the length of the text (overallnumber of words), ν k is the number of occurrences of w k ,and θ k is the probability of w k .Hence according to (11) the text is regarded to be asample of word realizations drawn independently withprobabilities θ k .The bag-of-words picture is well-known in computa-tional linguistics [52]. But for our purposes it is incom-plete, because it implies that each word has the sameprobability for diﬀerent texts. In contrast, it is wellknown (and routinely conﬁrmed by the rank-frequencyanalysis) that the same words do not occur with samefrequencies in diﬀerent texts. • To improve this point we make θ a random vectorwith a text-dependent density P ( θ | T ) (a similar, butstronger assumption was done in [52]). With this as-sumption the variation of the word frequencies from onetext to another will be explained by the randomness ofthe word probabilities.We now have three random objects: text T , proba-bilities θ and the occurrence numbers ν . Since θ wasintroduced to explain the relation of T with ν , it is nat-ural to assume that the triple ( T, θ , ν ) form a Markovchain: the text T inﬂuences the observed ν only via θ .Then the probability p ( ν | T ) of ν in a given text T reads p ( ν | T ) = Z d θ π [ ν | θ ] P ( θ | T ) . (12)This form of p ( ν | T ) is basic for probabilistic latent se-mantic analysis [54], a successful method of computa-tional linguistics. There the density P ( θ | T ) of latentvariables θ is determined from the data ﬁtting. We shalldeduce P ( θ | T ) theoretically. • The text-conditioned density P ( θ | T ) is generatedfrom a prior density P ( θ ) via conditioning on the or- TABLE IV: Comparison between diﬀerent methods of estimating the exponent γ of the Zipf’s law; see (1): LLS (linear least-square), NLS (nonlinear least-square), MLE (maximum likelihood estimation). We also present the p-value of the KS testwhen comparing the empiric word frequencies in the range [ r min , r max ] with the Zipf’s-law within the linear lest-square method(LLS); for a more detailed presentation of the KS results see Appendix E. Recall that the p-values have to be suﬃciently largerthan 0 . γ , LLS γ , NLS γ , MLE p-valueTF 1.032 1.033 1.035 0.865TM 1.041 1.036 1.039 0.682AR 1.038 1.042 1.044 0.624DL 1.039 1.034 1.035 0.812AQZ 1.03 1.028 1.027 0.587KLS 0.97 0.975 0.973 0.578CQF 0.985 0.983 0.981 0.962SBZ 0.972 0.967 0.973 0.796WJZ 0.999 0.993 0.995 0.852HLJ 1.01 1.015 1.011 0.923 dering of w = { w k } nk =1 in T : P ( θ | T ) = P ( θ ) χ T ( θ , w ) (cid:30)Z d θ ′ P ( θ ′ ) χ T ( θ ′ , w ) . (13)Thus if diﬀerent words of T are ordered as ( w , ..., w n )with respect to the decreasing frequency of their occur-rence in T (i.e. w is more frequent than w ), then χ T ( θ , w ) = 1 if θ ≥ ... ≥ θ n , and χ T ( θ , w ) = 0 oth-erwise. • The apriori density of the word probabilities P ( θ )in (13) can be related to the mental lexicon (store ofwords) of the author prior to generating a concrete text.For simplicity, we assume that the probabilities θ k aredistributed identically [see [17] for a veriﬁcation of thisassumption] and the dependence among them is due to P nk =1 θ k = 1 only: P ( θ ) ∝ u ( θ ) ... u ( θ n ) δ ( X nk =1 θ k − , (14)where δ ( x ) is the delta function and the normalizationensuring R ∞ Q nk =1 d θ k P ( θ ) = 1 is omitted.

2. Zipf ’s law

It remains to specify the function u ( θ ) in (14). Ref. [17]reviews in detail the experimentally established featuresof the human mental lexicon (see [55] in this context) anddeduces from them that the suitable function u ( θ ) is u ( f ) = ( n − c + f ) − , (15)where c is to be related to the prefactor of the Zipf’s law.Above equations (11–15) together with the feature n ≫ N ≫ n is the number ofdiﬀerent words, while N is the total number of words in the text) allow to the ﬁnal outcome of the theory: theprobability p r ( ν | T ) of the character (or word) with therank r to appear ν times in a text T (with N total char-acters and n diﬀerent characters) [17]: p r ( ν | T ) = N ! ν !( N − ν )! φ νr (1 − φ r ) N − ν , (16)where the eﬀective probability φ r of the character isfound from two equations for two unknowns µ and φ r : r/n = Z ∞ φ r d θ e − µθ ( c + θ ) (cid:30)Z ∞ d θ e − µθ ( c + θ ) , (17) Z ∞ d θ θ e − µθ ( c + θ ) = Z ∞ d θ e − µθ ( c + θ ) , (18)where c is a constant that will later on shown to coincidewith the prefactor of the Zipf’s law.For c . . cµ determined from (18) is small and isfound from integration by parts: µ ≃ c − e − γ E − cc , (19)where γ E = 0 . cµ → rn = ce − nφ r µ / ( c + nφ r ) . (20)Recall that according to (16), φ r is the probability forthe character (or the word in the English situation) withrank r . If φ r is suﬃciently large, φ r N ≫

1, the characterwith rank r appears in the text many times and its fre-quency ν ≡ f r N is close to its maximally probable value φ r N ; see (16). Hence the frequency f r can be obtainedvia the probability φ r . This is the case in the Zipﬁan do-main, since according to our empirical results (both for0Chinese and English) n . f r for r ≤ r max , and—uponidentifying φ r = f r —the above condition φ r N ≫ N/n ≫

1; see Tables I, II and III.Let us return to (20). For r > r min , φ r nµ = f r nµ < . ≪

1; see (19) and Figs. 1 and 2. We get from (20): f r = c ( r − − n − ) . (21)This is the Zipf’s law generalized by the factor n − athigh ranks r . This cut-oﬀ factor ensures faster [than r − ]decay of f r for large r .Figs. 1 and 2 shows that (21) reproduces well the em-pirical behavior of f r for r > r min . Our derivation showsthat c is the prefactor of the Zipf’s law, and that ourassumption on c . .

25 above (19) agrees with observa-tions; see Tables I, II and III.For given prefactor c and the number of diﬀerent char-acters n , (17) predict the Zipﬁan range [ r min , r max ] inagreement with empirical results; see Figs. 1 and 2.For r < r min , it is not anymore true that f r nµ ≪ f r N = φ r N ≫ f r alsofor r < r min ; see Figs. 1 and 2. We do not expect anybetter agreement theory and observations for r < r min ,since the behavior of frequencies in this range is irregularand changes signiﬁcantly from one text to another. D. Hapax legomena

1. Hapax legomena as a consequence of the generalizedZipf ’s law

According to (16), the probability φ r is small for r ≫ r max and hence the occurrence number ν ≡ f r N of thecharacter with the rank r is a small integer (e.g. 1 or 2)that cannot be approximated by a continuous functionof r ; see Figs. 1 and 2. In particular, the reasoning after(20) on the equality between frequency and probabilitydoes not apply, although we see in Figs. 1 and 2 that (21)roughly reproduces the trend of f r even for r > r max .To describe this hapax legomena range, deﬁne r k asthe rank, when ν ≡ f r N jumps from integer k to k + 1(hence the number of characters that appear k +1 times is r k − r k +1 ). Since φ r reproduces well the trend of f r evenfor r > r max , see Fig. 1, r k can be theoretically predictedfrom (21) by equating its left-hand-side to k/N :ˆ r k = [ kN c + 1 n ] − , k = 0 , , , ... (22)Eq. (22) is exact for k = 0, and agrees with r k for k ≥ N, n, c that appear in the Zipf’s law.

2. Comparing with previous theories of hapax legomena

Several theories were proposed over the years for de-scribing the hapax legomena range; see [56] for a review.To be precise, these theories were proposed for rare words(not for rare Chinese characters), but since the Zipf’s lawapplies to characters, we expect that these theories willbe relevant. We now compare predictions of the maintheories with (22). The latter turns out to be superior.Recall that for obtaining (22) it is necessary to em-ploy the generalized (by the factor n − ) form (21) of theZipf’s law. The correction factor is not essential in theproper Zipﬁan domain (since it is a pure power law), butis crucial for obtaining a good agreement with empiricdata in the hapax legomena range; see Figs. 1 and 2.The inﬂuence of this correcting factor can be neglectedfor k ≫ N c/n in (22), where we getˆ r k − − ˆ r k ∝ k ( k − , (23)for the number of characters having frequency k/N . Thisrelation, which is a crude particular case of (22), is some-times called the second Zipf’s law, or the Lotka’s law[3, 56]. The applicability of (23) is however limited, e.g.it does not apply to the data shown in Table V.Another approach to frequencies of rare words was pro-posed in [57]; see [56] for a review. Its basic result (24)was recently recovered from a partial maximization ofentropy (random group formation approach) [58] . Itmakes the following prediction for the number nP ( k ) of characters that appear in the text k times (i.e. P ( k )is a prediction for ( r k − − r k ) /n ) P ( k ) ∝ e − bk k − γ , ≤ k ≤ f N, (24)where we omitted the normalization ensuring P f Nk =1 P ( k ) = 1, and where the constants b > γ > N , the number ofdiﬀerent characters n and the maximal frequency f [58].Distributions similar to (24) (i.e. exponentially modiﬁedpower-laws) were derived from partial maximization of Ref. [58] presented a broad range of applications, but it did notstudy Chinese characters. We acknowledge one of the refereesof this work who informed us that such unpublished studies doexist: Chinese characters are within the applicability range ofRef. [58], as we conﬁrm in Table V. The predictions of (24) forthe AQZ text that we reproduce in Table V were communicatedto us by the referee. Please do not mix up P ( k ) with the density of character proba-bilities that appear in (14, 15). Indeed, P ( k ) is deﬁned as empiricfrequency; it has a discrete argument and applies to any collec-tion of objects, also the one that was generated by any prob-abilistic mechanism. In contrast, (14, 15) amount to a densityof probabities that has continuous argument(s) and assumes aspeciﬁc generative model. P ( k ) in (24) does not apply out of thehapax legomena range, where for all k we must have P ( k ) = 1 /n . However, it is expected that for n ≫ P ( k ) with suﬃciently small values of k , i.e. within thehapax legomena range.The results predicted by (24) are compared with ourdata in Table V. For clarity, we transform (24) to aprediction e r k for quantities r k : e r l = n [1 − l X k =1 P ( k )] , l ≥ , (25)i.e. we go to the cumulative distribution function P lk =1 P ( k ).While the predictions of (25) are in a certain agreementwith the data, their accuracy is inferior (at least by anorder of magnitude) as compared to predictions of (22);see Table V. The reason of this inferiority is that thoughboth (24) and (22) use three input parameters, (24) isnot suﬃciently speciﬁc to the studied text.Finally, let us turn to the Waring-Herdan approachwhich predicts for nP ( k ) (the number of characters thatappear in the text k times) a version of the Yule’s distri-bution [56]: P ( k + 1) = P ( k ) a + k − x + k , k ≥ , (26)where a and x are expressed via three (the same num-ber as in the previous two approaches) input parameters N (the overall number of characters), n (the number ofdistinct characters) and nP (1) (the number of charactersthat appear only once) [56]: a = (cid:18) − P (1) − P (1) − (cid:19) − , x = a − P (1) . (27)Eqs. (26, 27) are turned to a prediction r ′ k for r k . AsTable VI shows, these predictions are also inferior ascompared to those of (22), especially for k ≥ E. Summary

It is to be concluded from this section that—as faras the applicability of the Zipf’s law to short texts isconcerned—the Chinese characters behave similarly to Eq. (26) can viewed as a consequence of the Simon’s model oftext generation. This model does not apply to real texts aswas recently demonstrated in [53]. Nevertheless (26) keeps itsrelevance as a convenient ﬁtting expression; see also [56] in thiscontext.

Mixture Single text 2 Z Z P P Mixing of Chinese TextsMixing of English Texts P Z H P Z H P Z Z P P P P P Z Z Z H H Z Z H H H Single text 1 Z Z P P P Z P H Mixture Single text 2 Z Z P P P Z H P Z H P P P Z Z Z P P Z Z H H E Single text 1 P H H H FIG. 3: Schematic representation of various ranges undermixing (joining) two English (upper ﬁgure) and two Chinese(lower ﬁgure) texts. P k , Z k and H k mean, respectively, thepre-Zipﬁan, Zipﬁan and hapax legomena ranges of the text k ( k = 1 , English words. In particular, both situations can be ade-quately described by the same theory. In particular, thehapax legomena range of short texts is described via thegeneralized Zipf’s law.We should like to stress again why the consideration ofshort texts is important. One can argue that—at least forthe sake of rank-frequency relations—long texts are justmixtures (joinings) of shorter, thematically homogeneouspieces (this premise is fully conﬁrmed below). Hence thetask of studying rank-frequency relations separates intotwo parts: ﬁrst understanding short texts, and then longones. We now move to the second part.

IV. RANK-FREQUENCY RELATION FORLONG TEXTS AND MIXTURES OF TEXTSA. Mixing English texts

When mixing (joining) diﬀerent English texts the va-lidity range of the Zipf’s law increases due to acquiringmore higher rank words, i.e. r min stays approximatelyﬁxed, while r max increases; see Table III. The overall Upon joining two texts (A and B), the word frequencies getmixed: f k (A&B) = N A N A + N B f k (A) + N B N A + N B f k (B), where N A and f k (A) are, respectively, the total number of words and thefrequency of word k in the text A. TABLE V: The hapax legomena range for Chinese characters demonstrated for 4 short Chinese texts. The ﬁrst and secondtext are in Modern Chinese, other two are in Classic Chinese; see Tables I and II. r k is deﬁned before (22) and is found fromempirical data, while ˆ r k is calculated from (22); see section III D. We also present the relative error for ˆ r k approximating r k .Texts k r k r k | ˆ r k − r k | r k r k r k | ˆ r k − r k | r k r k r k | ˆ r k − r k | r k r k r k | ˆ r k − r k | r k r k (given by (22)) e r k and r ′ k in approximating the data r k ; see section III D 2. Here e r k is deﬁned by (25, 24),and r ′ k is the prediction made by (26, 27). For AQZ the parameters in (24) are γ = 1 .

443 and b = 0 . γ = 1 . b = 0 . r k is always smaller; the only exclusion is the case k = 2 of theKLS text. Recall that r ′ = r by deﬁnition.Texts k | e r k − r k | / r k | r ′ k − r k | / r k | ˆ r k − r k | / r k | e r k − r k | / r k | r ′ k − r k | / r k | ˆ r k − r k | / r k precision of the Zipf’s law also increases upon mixing, asTable III shows.The rough picture of the evolution of the rank-frequency relation under mixing two texts is summarizedas follows; see Table III and Fig. 3 for a schematic illus-tration. The majority of the words in the Zipﬁan rangeof the mixture (e.g. AR & TM) come from the Zipﬁanranges of the separate texts. In particular, all the wordsthat appear in the Zipﬁan ranges of the separate wordsdo appear as well in the Zipﬁan range of the mixture(e.g. the Zipﬁan ranges of AR and TM have 130 commonwords). There are also relatively smaller contributions tothe Zipﬁan range of the mixture from the pre-Zipﬁan andhapax legomena range of separate texts: note from TableIII that the Zipﬁan range of the mixture AR & TM is 82words larger than the sum of two separate Zipﬁan ranges,which is (307 + 290) minus 130 common words.Some of the words that appear only in the Zipﬁanrange of one of separate texts will appear in the ha-pax legomena range of the mixture; other words move from the pre-Zipﬁan range of separate texts to the Zip-ﬁan range of the mixture. But these are relatively minoreﬀects: the rough eﬀect of mixing is visualized by sayingthat the Zipﬁan ranges of both texts combine to becomea larger Zipﬁan range of the mixture and acquire addi-tional words from other ranges of the separate texts; seeFig. 3. Note that the keywords of separate words stayin the Zipﬁan range of the mixture, e.g. after joining allfour above texts, the keywords of each text are still in theZipﬁan range (which now contains almost 900 words); seeTable III.The results on the behavior of the Zipf’s law undermixing are new, but their overall message—the validity ofthe Zipf’s law improves upon mixing—is expected, sinceit is known that the Zipf’s law holds not only for short butalso for long English texts and for frequency dictionaries(huge mixtures of various texts) [1–4].3 f r =0.0022*e -0.0022*r F r equen cy Rank

Mixture of CQF and SBZ N= 54651, n= 2528

FIG. 4: (Color online) Rank frequency distribution for themixture of CQF and SBZ.; see Tables I and II and Ap-pendix C. The scale of the frequency is chosen such thatthe exponential-like range of the rank-frequency relation for r >

500 is made visible. For comparison, the dashed blueline shows a curve f r = 0 . e − . r . For the present ex-ample, the exponential-like range is essentially mixed withhapax legomena, since for frequencies f r with r > r max thenumber of diﬀerent words having this frequency is larger than10. Recall that the Zipf’s law holds for r min < r < r max ; seeTables I and II. B. Mixing Chinese texts

1. Stability of the Zipﬁan range

The situation for Chinese texts is diﬀerent. Upon mix-ing two Chinese texts the validity range of the Zipf’s lawincreases, but much slower as compared to English texts;see Tables I and II. The validity ranges of the sepa-rate texts do not combine (in the above sense of Englishtexts). Though the common words in the Zipﬁan rangesof separate texts do appear in the Zipﬁan range of themixture, a sizable amount of those words that appearedin the Zipﬁan range of only one text do not show up inthe Zipﬁan range of the mixture .Importantly, the overall frequency of the Zipﬁan do-main for very diﬀerent Chinese texts (mixtures, longtexts) is approximately the same and amounts to ≃ . As an example, let us consider in detail the mixing of two Chinesetexts SBZ and CQF; see Table II. The Zipﬁan ranges of CQFand SBZ contain, respectively, 306 and 319 characters. Amongthem 133 characters are common. The balance of the charactersupon mixing is calculated as follows: 306 (from the Zipﬁan rangeof CQF) + 319 (from the Zipﬁan range of SBZ) - 133 (commoncharacters) - 50 (characters from the Zipﬁan range of CQF thatdo not appear in the Zipﬁan range of CQF & SBZ) - 54 (char-acters from the Zipﬁan range of SBZ that do not appear in theZipﬁan range of CQF+SBZ) +27 (characters that enter to theZipﬁan range CQF & SBZ from the pre-Zipﬁan ranges of CQFor SBZ)= 415 (characters in the Zipﬁan range of CQF & SBZ). overall frequency grows with the number of diﬀerentwords in the text; see Table III. This is consistent withthe fact that for English texts the Zipﬁan range increasesupon mixing.

2. Emergence of the exponential-like range

The majority of characters that appear in the Zipﬁanrange of separate texts, but do not appear in the Zipﬁanrange of the mixture, moves to the hapax legomena rangeof the mixture. Then, for larger mixtures and longertexts, a new, exponential-like range of the rank-frequencyrelation emerges from within the hapax legomena range.To illustrate the emergence of the exponential-likerange let us start with Fig. 4. Here there are only twoshort texts mixed and hence the exponential-like rangecannot be reliably distinguished from the hapax legom-ena : for all frequencies with the ranks r > r max (i.e.for all frequencies beyond the Zipﬁan range), the numberof diﬀerent characters having exactly the same frequencyis larger than 10. (We conventionally take this numberas a borderline of the hapax legomena.) However, thetrace of the exponential-like range is seen even withinthe hapax legomena; see Fig. 3.For bigger mixtures or longer texts, the exponential-like range clearly diﬀerentiates from the hapax legomena.In this context, we deﬁne r b as the borderline rank of thehapax legomena: for r > r b , the number of charactershaving the frequency f r b is larger than 10. Then theexponential-like range f r = ae − br with a < b, (28)exists for the ranks r max < r . r b (provided that r max issuﬃciently larger than r b ); see Table VII. Put diﬀerently,the exponential-like range exists from ranks larger thanthe upper rank r max of the Zipﬁan range till the ranks,where the hapax legomenon starts. Tables I, II, VII andFig. 5 show that the exponential-like range is not onlysizable by itself, but (for suﬃciently long texts or suf-ﬁciently big mixtures) it is also bigger than the Zipﬁanrange. This, of course, does not mean that the Zipﬁanrange becomes less important, since, as we saw above, itcarries out nearly 40 % of the overall frequency; see Ta-bles I and II. The exponential-like range also carries outnon-negligible frequency, though it is few times smallerthan that of the Zipﬁan and pre-Zipﬁan ranges; see Ta-bles I, II and VII.Finally, we would like to stress that we considered var-ious Chinese texts written with simpliﬁed or traditionalcharacters, with Modern Chinese or diﬀerent versions ofClassic Chinese; see Tables I, II and Appendix C. As far Recall in this context that in the hapax legomena range manycharacters have the same frequency, hence no smooth rank-frequency relation is reliable. TABLE VII: Parameters of the exponential-like range (lower and upper ranks and the overall frequency) for few long Chinesetexts; see also Tables I and II. Here n is the number of diﬀerent characters. Recall that the lowest rank of the exponential-likerange is r max + 1, where r max is the upper rank of the Zipﬁan range. The highest rank of the exponential-like range was denotedas r b ; see Tables I and II. Texts n Rank range Overall frequencyPFSJ 3820 584–1437 0.12816SHZ 4376 591–1618 0.14317SJ 4932 536–1336 0.1288714 texts 5018 626–1223 0.12291 f r =0.00165*e -0.00165*r F r equen cy Rank

PFSJ N= 705130, n= 3820

FIG. 5: (Color online) Rank frequency distribution of thelong modern Chinese text PFSJ. The exponential behav-ior f r ∝ e − . r of frequency f r is visible for r > f r =0 . e − . r . The boundary between the exponential-like range and hapax legomena can be deﬁned as the rank r b , where the number of words having the same frequency f r b is equal to 10. For the present example r b = 1437. TheZipf’s law holds for ranks r min < r < r max , where r max = 583, r min = 67; see Table I. as the rank-frequency relations are concerned, all thesetexts demonstrate the same features showing that thepeculiarities of these relations are based on certain verybasic features of Chinese characters. They do not dependon speciﬁc details of texts. C. Theoretical description of the exponential-likeregime

Now we search for a theoretical description for the ex-ponential like regime of the rank-frequency relation ofChinese characters. This description will simultaneouslyaccount for the hapax legomena range (rare words) oflong Chinese texts.We proceed with the theory outlined in section III C 1and III C 2. There we saw that the Zipf’s law results fromthe choice (15) of the prior density for word probabilities f r =0.00165*e -0.00165*r F r equen cy Rank

PFSJ =1.2, c =0.012, =0.0805 N= 705130, n= 3820

FIG. 6: (Color online) The rank-frequency relation f ( r ) forcharacters from the text PFSJ; see Table I. Blue line denotesthe numerical solution of (31, 32) at the indicated parameters β and c β . The dashed blue line indicates at the exponential-like regime. θ . Now we need to generalize (15). Recall that the choiceof prior densities is is the main problem of the Bayesianstatistics [61, 62] . One way to approach this problemis to look for a natural group in the space of events (e.g.the translation group if the event space is the real line)and then deﬁne the non-informative prior density as theone which is invariant with respect to the group [61, 62].Our event space is the simplex θ ∈ S n : the set of n non-negative numbers (word probabilities) that sum to one.The natural group on the simplex is the multiplicativegroup [61] (in a sense this is the only group that preservesprobability relations [62]), and the corresponding non-informative density is the Haldane’s prior [61–63] that isgiven by (14) under u ( f ) = ( n − c + f ) − , c → . (29)The formal Haldane’s prior is recovered from (29) under We stress that (for a continuous event space) this problem isnot solved by the maximum entropy method. In contrast, thismethod itself does need the prior density as one of its inputs [61]. f r =0.00161*e -0.00161*r SJ =1.2, c =0.006, =0.0629 N= 572864, n= 4932 F r equen cy Rank

FIG. 7: (Color online) The rank-frequency relation f ( r ) forcharacters from the text SJ; see Table II. For parameters andnotations see Fig. 6. c ≡

0; a small but ﬁnite constant c is necessary formaking the density normalizable.Note that the prior density (15) which supports theZipf’s law is far from being non-informative. This is nat-ural, because it relates to a deﬁnite organization of themental lexicon [17].Now the exponential-like regime of the rank-frequencyrelation can be deduced via a prior density that is in-termediate between the Zipﬁan prior (15) and the non-informative, Haldane’s prior (29) u ( f ) = ( n − c β + f ) − β , < β < , (30)where β and c β > f r r/n = Z ∞ f r d θ e − µθ ( c β + θ ) β (cid:30)Z ∞ d θ e − µθ ( c β + θ ) β , (31) Z ∞ d θ θ e − µθ ( c β + θ ) β = Z ∞ d θ e − µθ ( c β + θ ) β . (32)Figs. 6 and 7 compare these analytical predictions withdata. The ﬁt is seen to be good under parameters β and c β that are not very far from (29). The fact that priordensities close to the non-informative (Haldane’s) priorgenerate an exponential-like shape for the rank-frequencyrelations is intuitive, since such a shape means that arelatively small group of words carries out the major partof frequency.As conﬁrmed by Figs. 6 and 7, predictions of (31, 32)that describe the exponential-like regime are not appli-cable for the Zipﬁan range.Importantly, (31, 32) allow to describe the hapaxlegomena range of long Chinese texts. Following sec-tion III D, we equate the solution f r of (31, 32) to k/N and determine from this r = ˆ r k : the rank in the hapaxlegomena range, where the frequency jumps from k/N to ( k + 1) /N . Now ˆ r k agrees well with the empiric data forthe hapax legomena range of long Chinese texts, and theagreement is better than for the approach based on (24,25); see Table VIII and cf. with Table VI.Note that the range of rare words (hapax legomena)relates to that part of the rank-frequency relation whichis closest to it, i.e. for long Chinese texts it relates to theexponential-like regime and not to the Zipﬁan regime.Though suggestive, the above theoretical results arestill preliminary. The full theory of the rank-frequencyrelations for Chinese characters should really explain howspeciﬁcally a non-Zipﬁan relations result from mixingtexts that separately hold the Zipf’s law. V. DISCUSSIONA. Summary of results As implied by the rank-frequency relation forcharacters, short Chinese texts demonstrate the sameZipf’s law—together with its generalization to high andlow frequencies (rare words)—as short English texts; seesection III. Assuming that authors write mainly relativelyshort texts (longer texts are obtained by mixing shorterones), this similarity implies that Chinese characters playthe same role as English words; see Footnote 5 in thiscontext. Recall from section II that a priori there areseveral factors which prevent a direct analogy betweenwords and characters. As compared to English, there are two novelties ofthe rank-frequency relation of Chinese characters in longtexts.

The overall frequency of the Zipﬁan range (therange of middle ranks, where the Zipf’s law holds) stabi-lizes at ≃ .

4. This holds for all texts we studied (writtenin diﬀerent epochs, genres with diﬀerent types of char-acters; see Tables I, II and Appendix C). A similar sta-bilization eﬀect holds as well for the overall frequency ofthe pre-Zipﬁan range for both English and Chinese texts;see Tables I, II and III.

There is a range with an exponential-like rank-frequency relation. It emerges for relatively longer textsfrom within the range of rare words (hapax legomena).The range of ranks, where the exponential-like regimeholds, is larger than that of the Zipf’s law. But its overallfrequency is few times smaller; see Tables I, II and VII.Both these results are absent for English texts; therethe overall frequency of the Zipﬁan range grows withthe length of the text, while there is no exponential-likeregime: the Zipﬁan range end with the hapax legomena;see Table III and Fig. 2.The results and imply that long Chinese textsdo have a hierarchic structure: there is a group of charac-ters that hold the Zipf’s law with nearly universal over-all frequency equal to ≃ .

4, and yet another groupof relatively less frequent characters that display theexponential-like range of the rank-frequency relation.6

TABLE VIII: The hapax legomena range for 2 Chinese texts; see Tables I and II; cf. with Table V. We compare the relativeerrors for, respectively, ˆ r k (given by (22)) and e r k in approximating the data r k ; see section III D 2. Here e r k is deﬁned by (25,24). For PFSJ the parameters in (24) are γ = 1 .

302 and b = 0 . γ = 1 .

Chinese characters diﬀer from English words, sinceonly long Chinese texts have the above hierarchic struc-ture. The underlying reason of the hierarchic structureis to be sought via the linguistic diﬀerences between Chi-nese characters and English words, as we outlined in sec-tion II. In particular, the features

4, 6, 7 discussed insection II can mean that certain homographic contentcharacters play multiple role in diﬀerent parts of a longChinese text. They are hence distinguished and appear inthe Zipﬁan range of the long text with (approximately)stable overall frequency ≃ .

4. Since this frequency issizable, and since the range of ranks carried out by theZipf’s law is relatively small, there is a relatively largerange of ranks that has to have a relatively small overallfrequency; cf. Tables I, II with Table VII. It is then nat-ural that in this range there emerges an exponential-likeregime that is related with a faster (compared to a powerlaw) decay of frequency versus rank.Recall that the stabilization holds as well for the overallfrequency of the pre-Zipﬁan domain both for English andChinese texts. The explanation of this eﬀect is similar tothat given above (but to some extent is also more trans-parent): the pre-Zipﬁan range contains mostly functioncharacters, which are not speciﬁc and used in diﬀerenttexts. Hence upon mixing the pre-Zipﬁan range has astable overall frequency.The above explanation for the coexistence of the Zip-ﬁan and exponential-like range suggests that there is arelation between the characters that appear in the Zip-ﬁan range of long texts and homography. As a prelimi-nary support for this hypothesis, we considered the fol-lowing construction. Assuming that a mixture is formedfrom separate texts T , ..., T k , we looked at charactersthat appear in the Zipﬁan ranges of all the separate texts T , ..., T k ; see Table II for examples. This guarantees thatthese characters appear in the Zipﬁan range of the mix-ture. Then we estimated (via an explanatory dictionaryof Chinese characters) the average number of diﬀerentmeanings for these characters. This average number ap-peared to be around 8, which is larger than the averagenumber of meanings for an arbitrary Chinese character(i.e. when the averaging is taken over all characters in the dictionary) that is known to be not larger than 2 [47].We should like to stress however that the above con-nection between the uncovered hierarchic structure andthe number of meanings is preliminary, since we currentlylack a reliable scheme of relating the rank-frequency rela-tion of a given text to its semantic features; for a recentreview on the (lexical) meaning and its disambiguationwithin machine learning algorithms see [2]. C. Conclusion

The above discussion makes clear that a theory forstudying the rank-frequency relation of a long text, asit emerges from mixing of diﬀerent short texts, is cur-rently lacking. Such a theory was not urgently neededfor English texts, because there the (generalized) Zipf’slaw (21) describes well both long and short texts. Butthe example of Chinese characters clearly shows that thechanges of the rank-frequency relation under mixing areessential. Hence the theory of the eﬀect is needed.Finally, one of main open questions is whether the un-covered hierarchical structure is really speciﬁc for Chi-nese characters, or it will show up as well for Englishtexts, but on the level of the rank-frequency relation formorphemes and not the words. Factorizing English wordsinto proper morphemes is not straightforward, but stillpossible.

Acknowledgments

This work is supported by the Region des Pays dela Loire under the Grant 2010-11967, and by the Na-tional Natural Science Foundation of China (Grant Nos.10975057, 10635020, 10647125 and 10975062), the Pro-gramme of Introducing Talents of Discipline to Universi-ties under Grant No. B08033, and the PHC CAI YUANPEI Programme (LIU JIN OU [2010] No. 6050) underGrant No. 2010008104.7

Appendix A: Glossary • Classic Chinese: ( w´en y´an ) written language em-ployed in China till the early XX (20th) century. It lostits oﬃcial status and was changed to Modern Chinesesince the May Fourth Movement in 1919. The ModernChinese keeps many elements of Classic Chinese. As com-pared to the Modern Chinese, the Classic Chinese hasthe following peculiarities (1) It is more lapidary: textscontain almost two times smaller amount of characters,since the Classic Chinese is dominated by one-characterwords. (2) It lacks punctuation signs and aﬃxes. (3) Itrelies more on the context. (4) It frequently omits gram-matical subjects. • Content word (character): a word that has a meaningwhich can be explained independently from any sentencein which the word may occur. Content words are said tohave lexical meaning, rather than indicating a syntactic(grammatical) function, as a function word does. • Empty Chinese characters—e.g. “ 几 ” (jˇı) or “ 已 ” (yˇı) —serve for establishing numerals for nouns, aspectsfor verbs etc . In contrast, to function characters, theycannot be used alone, i.e. they are fully bound. • Frequency dictionary: collects words used in someactivity (e.g. in exact science, or daily newspapers etc )and orders those words according to the frequency of us-age. Frequency dictionaries can be viewed as big mix-tures of diﬀerent texts. • Function word (character): is a word that has littlelexical meaning or have ambiguous meaning, but insteadserves to express grammatical relationships with otherwords within a sentence, or specify the attitude or moodof the speaker. Such words are said to have a grammati-cal meaning mainly, e.g. the or and . • Hapax legomena: literally means the set of words(characters) that appeared only once in a text. We em-ploy this term in a broader sense: the set of words (char-acters) that appear in a text only few times. Opera-tionally, this set is characterized by the fact that suﬃ-ciently many words (characters) have the same frequency.Texts written by human subjects contains a sizable hapaxlegomena. This is a non-trivial fact, since it is not diﬃ-cult to imagine an artiﬁcial text (or purposefully modi-ﬁed natural text) that will not contain at all words thatappear only once. • Homophones: two diﬀerent words that are pro-nounced in the same way, but may be written diﬀerently(and hence normally have diﬀerent meaning), e.g. rain and reign . • Homographs: two diﬀerent words (or characters)that are written in the same way, but may be pronounceddiﬀerently, e.g. shower [precipitation] and shower [theone who shows]. This example is a proper homograph,since the pronunciation is diﬀerent. Another example(of both homography and homonymy) is present [gift]and present [the current moment of time]. Note thatthe distinction between homographs and polysemes is notsharp and sometimes diﬃcult to make. There are vari- ous boundary situations, e.g. the verb read [present] and read [past] may qualify as homograph, but the meaningsexpressed are close to each other. • Homonymes: two words (or characters) that aresimultaneously homographs and homophones, e.g. left [past of leave ] and left [opposite of right ]. Somehomonymes started out as polysemes, but then developeda substantial diﬀerence in meaning, e.g. close [near] and close [to shut (lips)]. • Heteronyms: two homographs that are not homo-phones, i.e. they are written in the same way, but arepronounced diﬀerently. Normally, heteronyms have atleast two suﬃciently diﬀerent meanings, indicated by dif-ferent pronunciations. • Key-word (key-character): a content word (char-acter) that characterizes a given text with its speciﬁcsubject. The operational deﬁnition of a key-word (key-character) is that in a given text its frequency is muchlarger than in a frequency dictionary, which was obtainedby mixing together a big mixture of diﬀerent texts. • Language family: a set of related languages that arebelieved (or proved) to originate from a common ancestorlanguage. • Latent semantic analysis: the analysis of word fre-quencies and word-word correlations (hence semantic re-lations) in a text that is based on the idea of hidden (la-tent) variables that control the usage of words; see [64]for reviews. • Literal translation: word-to-word translation, with(possibly) changing the word ordering, as necessary formaking more understandable the grammar of the trans-lated text. This notion contrasts to the phrasal transla-tion, where the meaning of each given phrase is trans-lated. The literal translation can misconceive idiomsand/or shades of meaning, but these aspects are minorfor gross (statistical) features of a text, e.g. for rank-frequency relation of its words. • Logographic writing system is based on the directcoding of morphemes. • Mental lexicon: the store of words in the long-timememory. The words from the mental lexicon are em-ployed on-line for expressing thoughts via phrases andsentences; see [55] for detailed theories of the mental lex-icon. Ref. [55] argues that in addition to mental lexi-con humans contain a mental syllabary that is activatedduring the phonologization of a word that was alreadyextracted from the mental lexicon. • Morpheme: the smallest part of the speech or writingthat has a separate (not necessarily unique) meaning, e.g. cats has two morphemes: cat and -s . The ﬁrst morphemecan stand alone. The second one expresses the grammat-ical meaning of plurality, but it is a bound morpheme,since it can appear only together with other morphemes. • Phoneme: a class of speech sounds that are per-ceived as equivalent in a given language. An alternativedeﬁnition: the smallest unit that can change the mean-ing. Hence normally several diﬀerent sounds (frequentlynot distinguished by native speakers) enter in a single8phoneme. • Pictogram: a graphic symbol that represents an ideaor concept through pictorial resemblance to that idea orconcept. • Polysemes: are related meanings of the same word,e.g. the English word get means obtain/have , but also un-derstand (= have knowledge ). Another example is thatmany English nouns are simultaneously verbs (e.g. ad-vocate [person] and advocate [to defend]). • Syllable: is the minimal phonetic unit characterizedby acoustic integrity of its components (sounds), e.g. theword body is composed of two syllables: bo- and -dy ,while consider consists of three syllables: con- -si- -der .In phonetic languages such as Russian the factorizationof the word into syllables (syllabiﬁcation) is straightfor-ward, since the number of syllables directly relates tothe number of vowels. In non-phonetic languages suchas English, the correct syllabiﬁcation can be complicatedand not readily available to non-experts. Indo-Europeanlanguages typically have many syllables, e.g. the totalnumber of English syllables is more 10 000. However, 80% of speech employs only 500-600 frequent syllables [55].It was argued, based on psycholinguistic studies, that thefrequent syllables are also stored in the long-term mem-ory analogously to mental lexicon [55]. The total numberof Chinese syllables is much less, around 500 (about 1200together with tones) [47, 55]. Syllabiﬁcation in Chineseis generally straightforward too, also because each char-acter corresponds to a syllable. • Token: particular instance of a word; a word as itappears in some text. • Type: the general sort of word; a word as it appearsin a dictionary. • Writing system: process or result of recording spo-ken language using a system of visual marks on asurface. There are two major types of writing sys-tems: logographic (Sumerian cuneiforms, Egyptian hi-eroglyphs, Chinese characters) and phonographic. Thelatter includes syllabic writing (e.g. Japanese hira-gana) and alphabetic writing (English, Russian, Ger-man). The former encodes syllables, while the formerencodes phonemes.

Appendix B: Interference experiments

The general scheme of interference experiments in psy-chology is described as follows [28, 40]. There are twotasks, the main one and the auxiliary one. Each task isdeﬁned via speciﬁc instructions. The subjects are askedto carry out the main task simultaneously trying to ig-nore the auxiliary task. The performance times for car-rying out the main task in the presence of the auxiliaryone are then compared with the performance times of themain task when the auxiliary task is absent. The inter-ference means that the auxiliary task impedes the mainone.There is a rough qualitative regularity noted in many experiments: interference decreases upon increasing thecomplexity of the main task or upon decreasing the com-plexity of the auxiliary task.The most known example of interference experiment isthe Stroop eﬀect, where the main task is to call the colorof words. The auxiliary task is not to pay attention atthe meaning of those words. The experiment is designedsuch that there is an incongruency between the semanticmeaning of the word and its color, e.g. the word red is written in black. As compared to the situation whenthe incongruency is absent, i.e. the word red is writtenin red, the reaction time of performing the main task issizably larger. This is the essence of the Stroop eﬀect: thesemantic meaning interferes with the color perception.It appears that the Stroop eﬀect is larger for Chinesecharacters than for English words; see [28] for a review.This is one (but not the only) way to show that gettingto the meaning of a Chinese character is faster than tothe meaning of an English word.Another known interference phenomenon is the wordinferiority versus word superiority eﬀect. In English theseeﬀects amount to the following [65].If English-speaking subjects are asked to trace out (andcount) a speciﬁc letter in a text, they make less errors,when the text is meaningless, i.e. it consists of mean-ingless strings of letters [27, 66]. This is related to thefact that English words are recognized and stored as awhole. Hence the recognition of words—nd moving fromone letter to another— interferes with the task of identi-fying the letter in a single word, and the English-speakingsubjects make more errors when tracing out a letter in ameaningful text. This is the word-inferiority eﬀect.In contrast, if English subjects is presented a singleword for a short amount of time, and is then asked aboutletters of this word, their answers are (statistically) morecorrect if the word is meaningful (i.e. it is a real word,not a meaningless sequence of symbols). This word-superiority eﬀect is understood by noting that a singleword is recalled and/or remembered better due to itsmeaning.In contrast to this, Chinese-speaking adults display theword superiority eﬀect, when the naive analogy with En-glish would suggest the word inferiority. They do lesserrors in tracing out a given character in a string of mean-ingful characters, as compared to tracing it out in a listof meaningless pseudo-characters [27].A possible interpretation of this eﬀect is that, on onehand, the deﬁnition of Chinese words and their bound-aries is somewhat fuzzy, so that the analogue of the En-glish word-inferiority eﬀect is not eﬀective. On the otherhand, the Chinese sentence is perceived as a whole, invit-ing analogies with the English word-superiority eﬀect.Note that when the Chinese subjects are asked to traceout a speciﬁc stroke within a character we expectedly(and in full analogy with the English situation) get thatit is easier for Chinese subjects to trace out the strokein a meaningless pseudo-character than in a meaningfulcharacter [27].9

Appendix C: A list of the studied texts

1) Two short modern Chinese texts:- 昆仑殇 , K¯un L´un Sh¯ang (KLS) by Shu Ming Bi, 1987,(the total number of characters N = 20226, the numberof diﬀerent characters n = 2047). The text is aboutthe arduous military training in the troops of Kun Lunmountain.- 阿 Q 正传 , Ah Q Zh`eng Zhu`an (AQZ) by Xun Lu,1922, ( N = 18153, n = 1553). The story traces the “ad-ventures” of a hypocrit and conformist called Ah Q, whois famous for what he presents as “spiritual victories”.2) Two long modern Chinese texts:- 平凡的世界 , P´ıng F´an de Sh`ı Ji`e (PFSJ) by Yao Lu,1986, ( N = 705130, n = 3820). The novel depicts manyordinary people’s stories which include labor and love,setbacks and pursue, pain and joy, daily life and hugesocial conﬂict.- 水浒传 , Shuˇı Hˇu Zhu`an (SHZ) by Nai An Shi,14th century, ( N = 704936, n = 4376). The storytells how a group of 108 outlaws gathered at MountLiang formed a sizable army before they were eventuallygranted amnesty by the government and sent on cam-paigns to resist foreign invaders and suppress rebel forces.3) Four short classic Chinese texts:- 春秋繁露 , Ch¯un Qi¯u F´an L`u (CQF), by Zhong ShuDong, 179-104 BC, (Vol. -Vol. , N = 30017, n = 1661).A commentary on the Confucian thought and teachings.- 僧宝传 , S¯eng Bˇao Zhu`an (SBZ), by Hong Hui, 1124,(Vol. -Vol. , N = 24634, n = 1959). A commentary onthe Taoist thought and teachings. Biographies of greatTaoist masters.- 武经总要 , Wˇu J¯ıng Zˇong Y`ao (WJZ), by Gong LiangZeng and Du Ding, 1040-1044, (Vol. -Vol. , N = 26330, n = 1708). A Chinese military compendium. The textcovers a wide range of subjects, from naval warships todiﬀerent types of catapults.- 虎玲经 , Hˇu L´ıng J¯ıng (HLJ), by Dong Xu, 1004,(Vol. -Vol. , N = 26559, n = 1837). Reviews variousmilitary strategies and relates them to factors of geogra-phy and climate.4) A long classic Chinese text:- 史记 , Shˇı J`ı (SJ), by Qian Sima, 109 to 91 BC,( N = 572864, n = 4932). Reviews imperial biographies,tables, treatises, biographies of feudal houses and emi-nent persons. Appendix D: Key-characters of the modern Chinesetext KLS

Here is the list of the key-characters in the Pre-Zipﬁanand Zipﬁan range (Table IX) of the modern Chinese text, 昆仑殇 K¯un L´un Sh¯ang (KLS) written by Shu-Ming BIin 1987. The text is about the arduous military trainingin the troops of Kun Lun mountain.

TABLE IX: Key-characters of the modern Chinese text 昆仑殇 k¯un l´un sh¯ang (KLS).No. Rank Character Pinyin English Frequency1 14 号 h`ao horn 1572 32 军 j¯un army 863 44 兵 b¯ın soldier 674 113 队 du`ı troop 385 118 令 l`ıng command 376 123 部 b`u troop 367 152 战 zh`an ﬁght/war 288 156 命 m`ıng command 289 180 防 f´ang protect 2410 213 血 xu`e blood 2011 216 立 l`ı stand straight 2012 224 功 g¯ong honor 1913 225 枪 qi¯ang gun 1914 252 官 gu¯an oﬃcer 1615 295 锅 gu¯o pan 1416 299 保 bˇao protect 1417 300 卫 w`ei protect 1318 352 营 y´ıng camp 1119 355 谋 m´ou strategy 1120 360 烧 sh¯ao burn 1121 394 烈 li`e martyr 1022 407 团 tu´an regiment 10 Appendix E: Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test (KS test) [67, 68] is usedto determine if a data sample agrees with a referenceprobability distribution. The basic idea of the KS test isas follows.We need to determine whether a given set X , X , ... , X n is generated by i.i.d sampling a random variable withcumulative probability distribution F ( x ) (null hypothe-sis). To this end we calculate the the empiric cumulativedistribution function (CDF) F n ( x ) for X , X , ... , X n : F n ( x ) = 1 n n X i =1 I X i ≤ x , (33)where I X i ≤ x equals to 1 if X i ≤ x and 0 otherwise. Nextwe deﬁne: D n = sup x | F n ( x ) − F ( x ) | . (34)The advantage of using D n (against other measures ofdistance between F n ( x ) and F ( x )) is that if the null hy-pothesis is true, the probability distribution of D n doesnot depend on F ( x ). In that case it was shown that for n → ∞ , the cumulative probability distribution of √ nD n P ( √ nD n ≤ x ) ≡ f ( x ) = 1 − ∞ X k =1 ( − k − e − k x . (35)For not rejecting the null hypothesis we need that theobserved value of √ nD ∗ n is suﬃciently small. To quantifythat smallness we take a parameter (signiﬁcance level) α (0 < α <

1) and deﬁne κ α as the unique solution of f ( κ α ) = 1 − α. (36)Now the null hypothesis is not rejected provided that √ nD ∗ n < κ α , (37)where √ nD ∗ n is the observed (calculated) value of D n .Condition (37) ensures that if the null hypothesis is true,the probability to reject it is bounded from below by α .Hence in practice one takes, e.g. α = 0 .

05 or α = 0 . α is taken suﬃciently small. Hence to quantify the goodness of the null hypothesis one should calculatethe p-value p : the maximal value of α , where (37) stillholds. For the hypothesis to be reliable one needs that p is not very small. As an empiric criterion of reliabilitypeople frequently take p > . f r in the Zipﬁan range [ r min , r max ]are ﬁt to the power law, and then also to the theoreti-cal prediction described in section III C. With null hy-pothesis that empiric data follows the numerical ﬁttingsand/or theoretical results, we calculated the maximumdiﬀerences (test statistics) D and the corresponding p-values in the KS tests. From Table X one sees that allthe test statistics D are quite small, while the p-valuesare much larger than 0.1. We conclude that from theviewpoint of the KS test the numerical ﬁttings and the-oretical results can be used to characterize the empiricdata in the Zipﬁan range reasonably well. [1] R.E. Wyllys, Library Trends, , 53 (1981).[2] C.D. Manning and H. Sch¨utze, Foundations of statisticalnatural language processing (MIT Press, 1999).[3] H. Baayen,

Word frequency distribution (Kluwer Aca-demic Publishers, 2001).[4] W.T. Li, Glottometrics, , 14 (2002).[5] N. Hatzigeorgiu, G. Mikros, and G. Carayannis, Journalof Quantitative Linguistics, , 175 (2001).[6] B.D. Jayaram and M.N. Vidya, Journal of QuantitativeLinguistics, , 293 (2008).[7] L. L¨u, Z.K. Zhang and T. Zhou, PLoS ONE, (12),e14139 (2010).[8] J. Baixeries, B. Elvevag and R. Ferrer-i-Cancho, PLoSONE, (3), e53227 (2013).[9] http://en.wikipedia.org/wiki/Zipf’s law http://ccl.pku.edu.cn/doubtﬁre/NLP/Statistical Approach/Zip law/references%20on%20zipf%27s%20law.htm [10] J.B. Estoup, Gammes st´enographique (Institut St´enogra-phique de France, Paris, 1916).[11] R. Ferrer-i-Cancho and R. Sol´e, PNAS, , 788 (2003).M. Prokopenko et al. , JSTAT, P11025 (2010).[12] B. Mandelbrot,

An information theory of the statisticalstructure of language , in

Communication theory , ed. byW. Jackson (London, Butterworths, 1953).B. Mandelbrot,

Fractal geometry of nature (W. H. Free-man, New York, 1983).[13] B. Corominas-Murtra et al. , Phys. Rev. E, , 036115(2011).[14] D. Manin, Cognitive Science, , 1075 (2008).[15] G.A. Miller, Am. J. Psyc. , 311 (1957). W.T. Li, IEEEInform. Theory, , 1842 (1992).[16] M.V. Arapov and Yu.A. Shrejder, in Semiotics and In-formatics , v. 10, p. 74 (Moscow, VINITI, 1978). I. Kanterand D. A. Kessler, Phys. Rev. Lett. , 4559 (1995). B.M.Hill, J. Am. Stat. Ass. , 1017 (1974). G. Troll and P.beim Graben, Phys. Rev. E , 1347 (1998). A. Czirok et al. , ibid. , 6371 (1996). K. E. Kechedzhi et al. , ibid. (2005).[17] A.E. Allahverdyan, Weibing Deng and Q.A. Wang, Phys.Rev. E , 062804 (2013).[18] D. Howes, Am. J. Psyc. , 269 (1968).[19] R. Ferrer-i-Cancho and B. Elveva, PLoS ONE, , 9411(2010).[20] K.H. Zhao, Am. J. Phys. , 449 (1990).[21] R. Rousseau and Q. Zhang, Scientometrics, , 201(1992).[22] D.H. Wang et al. , Physica A, , 545 (2005).[23] S. Shtrikman, Journal of Information Science, , 142(1994).[24] Le Quan Ha et al. , Extension of Zipf ’s Law to Words andPhrases , Proceedings of the 19th international conferenceon Computational linguistics, , pp. 1-6, (2002).[25] Q. Chen, J. Guo and Y. Liu, Journal of QuantitativeLinguistics, , 232 (2012).[26] D. Aaronson and S. Ferres, J. Memory and Language, , 136 (1986).[27] H.C. Chen, Reading comprehension in Chinese , in H.C.Chen & O. J. L. Tzeng (Eds.), Language processing inChinese (pp. 175- 205). Amsterdam, Elsevier, 1992.[28] R. Hoosain,

Speed of getting at the phonology and mean-ing of Chinese words , in

Cognitive neuroscience studies ofChinese language , H.S.R. Kao, C.K. Leong and D.G. Gao(eds.) (Hong kong University Press, Hong kong, 2002).[29] G.K. Zipf,

Selected studies of the principle of relativefrequency in language . (Harvard University Press, Cam-bridge MA, 1932).[30] L. L¨u, Z.K. Zhang and T. Zhou, Sci. Rep. , 1082 (2013).[31] C.K. Hu and W.C. Kuo, Universality and Scaling in theStatistical Data of Literary Works , POLA Forever, 115-139 (2005).[32] J. Elliott et al. , Language identiﬁcation in unknown sig-nals , Proceedings of the 18th conference on Computa- TABLE X: Kolmogorov-Smirnov test (KS test) for the ﬁtting quality of our results (texts are deﬁned in Tables I and II). Inthe KS test, D and p denote the maximum diﬀerence (test statistics) and p-value respectively. D and p are calculated fromthe KS test between empiric data and numerical ﬁtting, D and p are between empiric data and theoretical result, D and p are between numerical ﬁtting and theoretical result; see section III A. Note that for making the testing even more vigorousthe presented results for the KS characteristics are obtained in the original coordinates (2); similar results are obtained inlogarithmical coordinates (3) that are employed for the linear ﬁtting.Texts D p D p D p TF 0.0418 0.865 0.0365 0.939 0.0381 0.912TM 0.0529 0.682 0.0562 0.593 0.0581 0.568AR 0.0564 0.624 0.0469 0.783 0.0443 0.825DL 0.0451 0.812 0.0421 0.865 0.0472 0.761AQZ 0.0586 0.587 0.0565 0.623 0.0601 0.564KLS 0.0592 0.578 0.0641 0.496 0.0626 0.521CQF 0.0341 0.962 0.0415 0.863 0.0421 0.857SBZ 0.0461 0.796 0.0558 0.635 0.0616 0.538WJZ 0.0427 0.852 0.0475 0.753 0.0524 0.691HLJ 0.0375 0.923 0.0412 0.875 0.0425 0.862tional linguistics, , pp. 1021-1025 (2000).J. Elliot and E. Atwell, Journal of the British Interplan-etary Society , 13 (2000).[33] H.P. Luhn, IBM J. Res. Devel. , 159 (1958).[34] S.M. Huang et al. , Decision Support Systems, , 70(2008).[35] D.M.W. Powers, Applications and explanations of Zipf ’slaw , in D.M.W. Powers (ed.), New Methods in LanguageProcessing and Computational Natural Language Learn-ing (NEMLAP3/CONLL98), ACL, 1998, pp. 151-160.[36] G. Sampson, Linguistics, , 117 (1994).[37] J. DeFrancis, Visible Speech: the Diverse Oneness ofWriting Systems (University of Hawaii Press, Honulu,1989).[38] J. L. Packard,

The Morphology of Chinese: A linguis-tic and cognitive approach (Cambridge University Press,Cambridge, 2000).[39] K. Turner,

Visualizing Zipf ’s Law in Japanese , availableat this link: http://classes.soe.ucsc.edu/cmps161/Winter12/projects/katurner/proj/paper/paper.pdf [40] R. Hoosain,

Psychological reality of the word in Chinese ,in H.C. Chen and J.L. Tseng (eds.), Language process-ing in Chinese, pp. 111-130, (Amsterdam, Netherlands,1992).[41] I.M. Liu et al.

Chinese Journal of Psychology, , 25(1974).[42] S.H. Hsu and K.C. Huang, Perceptual and Motor Skills, , 355 (2000); ibid. , 81 (2000).[43] X. Luo, A maximum entropy Chinese character-basedparser . Proceedings of the 2003 Conference on Empiri-cal Methods in Natural Language Processing, 2003.[44] Wm. C. Hannas,

Asia’s Orthographic Dilemma (Univer-sity of Hawaii Press, 1997).[45] C.Y. Chen et al. , Some distributional properties ofMadanrin Chinese , Proceedings of the ﬁrst Pasiﬁc Asiaconference on formal and computational linguistics, p. 81(Taipei, 1993).[46] http://myweb.tiscali.co.uk/wordscape/wordlist/homogrph.html [47] N.V. Obukhova, Quantitative linguistics and automatictext analysis (Proc. of Tartu university), , 119 (1986).[48] N.J.D. Nagelkerke,

A Note on a General Deﬁnition ofthe Coeﬃcient of Determination , Biometrika, (3), 691(1991).[49] M. L. Goldstein, S. A. Morris, G. G. Yen, Eur. Phys. J.B , , 255 (2004).[50] H. Bauke, Eur. Phys. J. B , , 167 (2007).[51] A. Clauset, C. R. Shalizi and M. E. J. Newman, SIAMRev. , , 4 (2009).[52] R.E. Madsen et al. , Modeling word burstiness using theDirichlet distribution , in Proc. Intl. Conf. Machine Learn-ing, 2005.[53] S. Bernhardsson, L. E. Correa da Rocha, P. Minnhagen,Physica A , 330 (2010); New J. Phys. , 123015(2009).[54] T. Hofmann, Probabilistic Latent Semantic Analysis , inUncertainty in Artiﬁcial Intelligence, 1999.[55] W.J.M. Levelt et al. , Beh. Brain Sciences, , 1 (1999).[56] J. Tuldava, Journal of Quantitative Linguistics , 38(1996).[57] D. Krallmann, Statistische Methoden in der stilistischenTextanalyse (Inaug.-Dissert. Bonn, 1966).[58] S.K. Baek, S. Bernhardsson and P. Minnhagen, NewJournal of Physics , 043004 (2011).[59] Y. Dover, Physica A , 591 (2004).[60] E.V. Vakarin and J. P. Badiali, Phys. Rev. E , 036120(2006).[61] E.T. Jaynes, IEEE Trans. Syst. Science & Cyb. , 227(1968).[62] M. Jaeger, Int. J. Approx. Reas. , 217 (2005).[63] J. Haldane, Proceedings of the Cambridge PhilosophicalSociety, , 55 (1932).[64] T. Hofmann, Probabilistic Latent Semantic Analysis , inUncertainty in Artiﬁcial Intelligence, 1999.[65] A.F. Healy and A. Drewnowski, Journal of ExperimentalPsychology: Human Perception and Performance , 413(1983).[66] Reading Chinese Script: A Cognitive Analysis , edited byJ. Wang, A.W. Imhoﬀ and H.-C. Chen (Lawrence Erl- baum Associates, New Jersey, 1999).[67] A.N. Kolmogorov, Giornale dell’ Instituto Italiano degliAttuari, , 77 (1933). [68] P.T. Nicholls, J. Am. Soc. Information Sci. ,40