Explanation and exact formula of Zipfs law evaluated from rank-share combinatorics
1 Explanation and exact formula of Zipf’s law evaluated from rank-share combinatorics
A Shyklo
ABSTRACT
This work proves that ranks and shares are statistically dependent on one another, based on simple combinatorics. It presents a formula for rank-share distribution and illustrates that Zipf’s law, is descended from expected values of various ranks in the new distribution. All conclusions, formulas and charts presented here were tested against publically available statistical data in different areas. The correlation coefficient between the calculated values and statistical numbers provided by Bureau of Labor Statistics was 0.99899. Monte-Carlo simulations were performed as additional evidence.
Introduction
The mysterious Zipf’s law astonishes researchers for over 100 years already. It was initially presented by Jean-Baptiste Estoup [1] in 1908. He observed a strange proportional dependency between frequencies of word usage in texts. Later it was observed in many languages, that the frequency of most common words is proportional to 1/rank. For example, the word “the” is the most commonly used word in the English language. The second most common, “of” is used about half as much as the first. The third, “and” is used about a third as much as the first, and so on. This dependency was popularized by and named after a linguist from Harvard University known as George Kingsley Zipf [2]. It was used in 1913 by German physicist Felix Auerbach in the "Law of Population Concentration” to describe the size distribution of cities. In 1991, Wentian Li demonstrates [3] that randomly generated texts follow the same frequency distribution as real languages. The pattern is distinct in a great deal of research, which causes a recognizable empirical distribution. It can be found in the use of words, in city populations, last names, distribution of wealth, frequency of natural disasters, markets behavior etc. It is distinct in the 80/20 rule. There were multiple attempts to explain it. And it was partially done in many publications [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], however the exact math behind it remained unknown, even after centuries of research. This work started as a practical attempt to apply the latest statistical formulas to real life data. Working with large datasets, we surprisingly found inaccuracy in the existing equations. Closer observations of the various data samples revealed statistical dependency between rank, share and number of participants. Further analysis led to a solid understanding of the combinatorics driving ranking process and exact formula for rank-share distribution, which provides key to understanding of Zipf’s law and Pareto principle. .
Results
To demonstrate the dependency between the rank and share, let’s assume that we have combined volume T shared between N participants. Using combinatorics principles we can calculate that there are !)!1( )!1(
TN NT ways to split the volume. If we sort and rank each case and count how many times the share of some rank equal to a certain number (S), we can calculate the probability of this event. If the outcome appears x times, (when the rank k has the share S), the probability of this outcome can be calculated as: )!1( !)!1(* x = S)k,N,P(T,
NT TN
To illustrate it let’s look at this simplified example: Let’s assume that we had 3 companies, which sold 10 items combined. There are 66 possible combinations of how they can split the market volume (T =10). If we sort each combination and count how many times rank one has each value from 0 to 10, we can create the following chart: From this chart we can see, for example, that the probability of the Rank 1 to have the share = 7 of 10 (or 70%) is 12/66 or 18.1818%. So we can calculate the probability of every share value for rank 1 Similar charts can be created for Rank2 and Rank3.
Figure 2: Number of combinations vs shares for Ranks 2 and 3 (T=10, N= 3))
Using this logic we can calculate Probability Density Functions of various ranks, N and T. nu m b e r o f c o m b i n a t i o n s Share
Figure 1: Number of combinations vs shares for Rank 1 of 3 (T=10) nu m b e r o f c o m b i n a t i o n s Share
Rank 2
30 21 12 3 0 0 0 0 0 0 005101520253035 0 1 2 3 4 5 6 7 8 9 10 nu m b e r o f c o m b i n a t i o n s Share
Rank 3 Figure 3: Probability Density Functions for all ranks (N from 2 to 5)
We were able to evaluate the universal formula for PDFs of rank-share distribution as: ))1(*)!()!( )!(*)1((* )!()!1( !)!1( = S)k,P(N, Nki ki iSiNki kNkNk NN
To prove this formula, we tested it against statistical data and performed Monte-Carlo simulations. Figure 4 represents the same distributions for N = 4 calculated based on 500000 random outcomes. p r o b a b li t y ( % ) share (%) Share PDFs of 2 participants
Rank 1Rank 2 0.0%1.0%2.0%3.0%4.0%5.0%6.0% 0 20 40 60 80 100 p r o b a b ili t y share (%) Share PDFs of 3 participants
Rank 1Rank 2Rank 30%2%4%6%8%10%12% 0 20 40 60 80 100 p r o b a b ili t y share (%) Share PDFs of 4 participants
Rank 1Rank 2Rank 3Rank 4 0%2%4%6%8%10%12%14%16%18%20% 0 20 40 60 80 100 p r o b a b ili t y Share (%)
Share PDFs of 5 participants
Rank 1Rank 2Rank 3Rank 4Rank 5 Figure 4: Monte-Carlo Simulation of N = 4
The expected values of share for each rank can be calculated as: )1(* 1 = S N ki iN Where S represents share for rank k, and N - number of participants
Figure 5: Expected values of the rank-share distribution. Figure 6: Expected values for Zipf’s law in double log scale calculated for N from 1 to 100
The expected values of the rank-share distribution gives us the dependency between rank and frequency known as Zipf’s Law. To prove it we tested it on publically available datasets with known N. For example we know that Canada has 13 states, Brazil has 27 states and US – 50 States. We can get statistics of the area distribution for each country from these sources [18],[19],[20].
Figure 7: Shares of the area of the states in US, Canada, Brazil on double log scale (real vs calc.) Another example can be the distribution of letters among European languages, published here [21]. We exactly know how many letters are in each language.
Figure 8: Frequencies of letters usage in languages on double log scale (real vs expected values).
Table 1: Correlation coefficients between real and calculated distributions of letters usage.
Language Correlation Language Correlation
Czech 0.972696909 Icelandic 0.984039287 Danish 0.983906839 Italian 0.974405202 Dutch 0.972386224 Polish 0.99029575 Esperanto 0.979436537 Portuguese 0.984913919 Finnish 0.977592954 Spanish 0.985479455 French 0.971366449 Swedish 0.969368024 German 0.98786782 Turkish 0.991741606
To achieve a more accurate verification, we bundled together numbers provided by Bureau of Labor Statistics of US Department of Labor [22]. We analyzed Occupational Employment and Wages distributions between 22 categories in more than 50 US cities. We ranked the categories for each city, then we calculated the average share for each rank from 1 to 22, and compared results to the calculated values. The observed correlation coefficient was 0.99899712. Conclusion
We demonstrated that rank and share are statistically related and evaluated an exact formula for the rank-share distribution. This is a universal law, which can be applied to any area. That’s why we can find it in completely different places (words, population, markets). The expected values of the new distribution gives us the dependency between rank and frequency known as Zipf’s Law. We can rank distribution of words, frequency of natural disasters, population in the cities or income spreading, and observe the same pattern caused by simple rank-share combinatorics.
Discussion
The PDF formula of the rank-share distribution contains binomial coefficients and probably could be simplified using binomial equations. The rank-share distribution possibly belongs to binomial series. It still needs to be classified. We spend significant time trying to derive it from known distributions. We were able partially derive dependency for some ranks between rank-share and Negative-Binomial distribution. However, universal dependency still should be evaluated. . For this work we were concentrating on continuous solutions, assuming that T (number of shared items) is large enough to be considered as ∞ , but it would be interesting to derive an exact formula for discrete solutions including T as a parameter. Methods
We started analyzing big sets of statistical data and observed strong recognizable patterns between ranks, shares and number of participants. We also noticed that possible share values for each rank were located within certain range and figured the logic for ranges: The maximum share of Rank k is always 1/k. The minimum share is 0 for all ranks except Rank 1 (where min is 1/N). To explain the logic behind it, imagine the case when we have just two participants. They split shares in some proportions (50/50, 60/40 or so.). Participant ranked
Figure 9: Expected Values for N=22 Calcuated Vs Avg.Stat . Average Stat Calculated distribution between Continuous Formula for the Last Rank
Assume we have 5 participants (N = 5). How can we count the number of combinations for Rank 5? For a given value of S , the minimum value of S can be S , the maximum is (T-S )/(N-1). For each S , the minimum value of S is S , the maximum is (T-S -S )/(N-2) For each S , the minimum value of S is S , the maximum is (T-S -S -S )/(N-3). For each S , we have just one value of S = T-S -S -S -S . Figure 10: Combinations for last rank of 5
Based on this logic we can calculate the quantity (number of combination) for the Last Rank as: dS ...dS dS dS1 ...... = )Q(S N2 i21-NN 1-NNN i NN We used SymPy Python package to calculate results of these interactions and found the pattern: )!3()!2( )S( = )SN,Q(T, )2(NN NN NT N When we normalize it we can calculate the probability of the last rank )2(NN )S(*)1( = )SN,P(T, NN NTT NN
The expected value of the last rank can be calculated as N Continuous Formula for N-1 Rank
In our example, for the second lowest rank (N-1), we should start from S . The right part of the diagram remains the same. In the left part, the minimum of S is 0, but the maximum can be either S or (T-S *4), depending on what is less. Figure 11: Combinations for rank 4 of 5
There are two different equations depending if S > (T-S *4). We can calculate the results for both integral equations. For continuous solutions, we can represent the quantities for S of N=5 as two polynomials: We can also analyze the condition and see that the first polynomial works for S < 1/5 and the second one for S >= 1/5. For this example, the continuous solution can be presented as following graph: The universal formula for the first polynomial, applied on the interval between 0 and 1/N, is: )2(1-N)2(1-N N NN NN The universal formula for the second polynomial, applied on the interval between 1/N and 1/(N-1), is: )2(1-N NN N To get exact PDF formulas, we should also normalize the equations. We can also calculate the expected values for rank N-1 as )1( 1 NN Middle Ranks Formulas
We can follow this path to see that the higher the rank, the more integrals we need for the continuous solution. The solution of these integrals would be a set of polynomials. There are different functions on the intervals 1 to 1/2, 1/2 to 1/3, 1/3 to 1/4, 1/4 to 1/5 …. In general, the PDF for N-tier ranks can be represented as sets of polynomial functions with the degree (N-2). For example assuming T=1, we can calculate the not normalized polynomials for N =3 to 5 as:
Figure 12. Rank 4 of 5 as conmbination of two polinimials Table 2: Polynomials for share PDS’s calculated for N from 3 to 5
S (for N=3)
R1 R2 R3
2S -(3S-1) (3S-1) -2(2S-1) (1-S)
S (for N=4)
R1 R2 R3 R4 0 - 1/4 -3S(7S-2) (4S-1) (4S-1) -42S +24*S-3 3(3S-1) -11S +10S-2 3(2S-1) (1-S) S (for N=5)
R1 R2 R3 R4 R5 0 - 1/5 *(1-4S) -27S+3) (1-5S) (5S-1) -476S +300S -60S+4 4((1-3S) - 2(1-4S) ) 4(1-4S) -131S + 117S - 33S + 3 (1-2S) - 3(1-3S) (1-S) - 4(1-2S) (1-S) Here are some more graphical representations:
Figure 12: Graphical representations of polynomials for PDF’s for ranks 2 and 3 of 5 N5 R3 polynomials N5 R2 polynomials Universal formula for PDF.
When we analyzed polynomials for ranks N and N-1 we realized that they could be presented as a sum of the terms like )2( )S1( N i multiplied by coefficients ( a a a a a ), where i between 1 and N. ... )S51( )S41( )S31( )S21( )S11(*... ~ S)k,P(N, )2( )2( )2( )2( )2(54321 NNNNN aaaaa
To calculate coefficients for all polynomial equations we created the Python package, which parsed through all combinations of coefficients and returned a corresponding matrix for the given polynomial. These are the results calculated for all polynomials with N from 3 to 5:
Table 3: Coefficients for polynomial equations calculated for N from 3 to 5
S (for N=3) R1 R2 R3 [0 2-2] [0 0 1] [1 2 0] [0 2 0] [1 0 0]
S (for N=4) R1 R2 R3 R4 0 - 1/4 [0 3-6 3] [0 0 3-3] [0 0 0 1] [1-3 3 0] [0 3-6 0] [0 0 3 0] [1-3 0 0] [0 3 0 0] [1 0 0 0]
S (for N=5) R1 R2 R3 R4 R5 0 - 1/5 [0 4-12 12 4] [0 0 6-12 6] [0 0 0 4-4] [0 0 0 0 1] [1-4 6-4 0] [0-4 12-12 0] [0 0 6-12 0] [0 0 0 4 0] [1-4 6 0 0] [0 4-12 0 0] [0 0 6 0 0] [1-4 0 0 0] [0 4 0 0 0] [1 0 0 0 0]
It’s obvious that the coefficients follow a binomial pattern. When we normalize the dependency, we evaluate the following formula for PDFs of rank-share distribution: ))1(*)!()!( )!(*)1((* )!()!1( !)!1( = S)d,k,P(N, Nki ki iSiNki kNkNk NN Where S represents the share for rank k, N numbers of participants, d represents range (like d=1 [1/2 – 1] d=2 [1/3 – 1/2] d=3 [1/4 – 1/3] …N) d is related to S and could not be more than N. So it can be calculated as min{N, ⌊ ⌋ } Verification
To verify the equations we tested them against publically available datasets from various sources [18],[19],[20],[21],[22] with known number of categories. We ranked and normalized each dataset to fit shares between 0 and 100%. For example, we used data extracts from Bureau of Labor Statistics of US Department of Labor.
Table 4: Example of occupational employment and wages dataset for various US towns.
Major occupational group Percent of total employment Birmingham Montgomery Anchorage Fairbanks Flagstaff Total, all occupations
Management
Business and financial operations
Computer and mathematical
Architecture and engineering
Life, physical, and social science
Community and social services
Legal
Education, training, and library
Arts, design, entertainment, sports, and media
Healthcare practitioner and technical
Healthcare support
Protective service
Food preparation and serving related
Building and grounds cleaning and maintenance
Personal care and service
Sales and related
Office and administrative support
Farming, fishing, and forestry
Construction and extraction
Installation, maintenance, and repair
Production
Transportation and material moving
The data was be transformed to calculate expected values for all rank from 1 to 22 like this: Table 5: Example of transformed and ranked data to evaluate expected values for shares (N=22)
Rank Birmingham Montgomery Anchorage Fairbanks Flagstaff Exp. Value. For verification, we combined data from more than 50 US towns.
Monte-Carlo Simulation.
We used Wolfram Mathematica to perform Monte Carlo simulations for PDFs (for N from 2 to 6). The following code was used: m = RandomInteger[100, {500000, 3}] m2 = Sort /@ m m3 = Transpose[{m2[[All, 1]], m2[[All, 2]] - m2[[All, 1]], m2[[All, 3]] - m2[[All, 2]], 100 - m2[[All, 3]]}] m4 = Sort /@ m3 ListPlot[Values[KeySort[Counts[m4[[All, 4]]]]]] Figure 13: Monte-Carlo Simulation for N3 and N5
N3:
N5:
References
10 20 30 40 50500100015002000
10 20 30 40 50 6050010001500