Fast Entropy Estimation for Natural Sequences
aa r X i v : . [ phy s i c s . d a t a - a n ] M a y Fast Entropy Estimation for Natural Sequences
Andrew D. Back, ∗ Daniel Angus, and Janet Wiles
School of ITEE, The University of Queensland, Brisbane, QLD, 4072 Australia.
It is well known that to estimate the Shannon entropy for symbolic sequences accurately requiresa large number of samples. When some aspects of the data are known it is plausible to attemptto use this to more efficiently compute entropy. A number of methods having various assumptionshave been proposed which can be used to calculate entropy for small sample sizes. In this paper,we examine this problem and propose a method for estimating the Shannon entropy for a set ofranked symbolic “natural” events. Using a modified Zipf-Mandelbrot-Li law and a new rank-basedcoincidence counting method, we propose an efficient algorithm which enables the entropy to beestimated with surprising accuracy using only a small number of samples. The algorithm is testedon some natural sequences and shown to yield accurate results with very small amounts of data.
PACS numbers: 89.70.Cf, 89.70.Eg, 89.75.-k, 89.75.Da, 02.50.CwKeywords: Entropy, natural sequences, coincidence counting, Zipf-Mandelbrot-Li law
I. INTRODUCTION
Machine learning methods typically rely on formingmodels based on statistical properties of observed data.An area of importance in this regard is information theo-retic methods which involve computing Shannon entropyand mutual information. The idea that the randomnessof a message can give a measure of the information itconveys formed the basis of Shannon’s entropy theorywhich gives a means of assigning a value to the informa-tion carried within a message [1],[2]. The way in whichShannon formulated this principle is that, given a sin-gle random variable x which may take M distinct values,and is in this sense symbolic, where each value occurs in-dependently with probability p ( x i ) , i ∈ [1 , M ] , then thesingle symbol Shannon entropy is defined as: H ( X ) = − M X i =1 p ( x i ) log ( p ( x i )) (1)This extends to the case where the probabilities of mul-tiple symbols occurring together are taken into account.The general N -gram entropy, which is a measure of theinformation due to the statistical probability of N adja-cent symbols occuring consecutively, can be derived as H N ( X | B ) = − X i,j p ( b i , x j ) log ( p ( x j | b i )) (2)where b i ∈ P N − is a block of N − x j is an ar-bitrary symbol following b i , p ( b i , x j ) is the probability ofthe N -gram ( b i , x j ) , p ( x j | b i ) is the conditional probabil-ity of x j occurring after b i and is given by p ( b i , x j ) /p ( b i ) . One of the limitations of computing entropy accuratelyis the dependence on large amounts of data, even more sowhen computing N -gram entropy. Estimates of entropy ∗ Contact email: [email protected] based on letter, word and N -gram statistics have oftenrelied on large data sets [3], [4]. The reliance on long datasequences to estimate the probability distributions usedto calculate entropy and attempts to overcome this incoding schemes is discussed in [5] where they provide anestimate of letter entropy extrapolated for infinite textlengths. A method of estimating the number of samplesrequired to compute entropy was proposed in [6] whichshowed that a very large number of samples may be re-quired to do this accurately.Various approaches to estimating entropy over finitesample sizes have been considered. A method of com-puting the entropy of dynamical systems which correctsfor statistical fluctuations of the sample data over finitesample sizes has been proposed in [7]. Estimation tech-niques using small datasets have been proposed in [8],and an online approach for estimating entropy in limitedresource environments was proposed in [9]. Entropy es-timation over short symbolic sequences was consideredin the context of dynamical time series models based onlogistic maps and correlated Markov chains, where an ef-fective shortened sequence length was proposed whichaccounted for the correlation effect [10]. A novel ap-proach for calculating entropy using the idea of estimat-ing probabilities from a quadratic function of the inversenumber of symbol coincidences was proposed in [11]. Alimitation of this method was that it assumed equiprob-able symbols. The difficulty of estimating entropy dueto the heavy tailed distribution of natural sequences hasbeen recognized, where it has been shown that the biasusing classical estimators depends the sample size andthe characteristics of the heavy-tailed distribution [12].A Bayesian model approach to inferring the probabilitydistributions has been considered at length in [13] and[14]. A computationally efficient method for calculatingentropy based on a James-Stein-type shrinkage estimatorwas proposed by Hausser and Strimmer in [15].In this paper, by considering a model for the probabil-ity distributions of natural sequence data, we propose anextension to the algorithm in [11] which enables a fastmethod of estimating entropy using a small number ofsamples. The proposed algorithm is derived in the sub-sequent sections and simulations are given showing itseffectiveness. II. PROPOSED ALGORITHM FORESTIMATING ENTROPYA. Coincidence Counting For EquiprobableSymbols
To compute Shannon entropy by estimating the symbolprobabilities using conventional histogram plug-in meth-ods is effective for small alphabet sizes, however for nonequiprobable symbols with a large alphabet size, a verylarge number of symbols may be required. For a givenalphabet size M, to estimate the entropy with some de-gree of accuracy it is normally required to estimate theprobabilities of M symbols. Another approach is to adopta parametric model of the symbol probabilities. In thiscase, the idea is to form an invertible model J ( M ) of therelationship between the model parameters and some ob-servable statistical feature of the data. Then, the modelis inverted and the statistical features of the actual dataare observed which enables the model parameters andhence entropy to be estimated.The method of coincidence detection is based on theidea that a discrete (or symbolic) random variable x which may take on a finite number M of distinct values x i ∈ { x , . . . , x M } with probabilities p ( x i ) , i ∈ [1 , M ] . Consider the case where p ( x i ) = p ( x j ) ∀ i, j ∈ [1 , M ] , that is, the symbols are equiprobable. Hence we mayproceed as follows. The probability of drawing any sym-bol on the first try followed by any other different symbolon the second try, that is, any two non repeating symbolsis e F (2; M ) = M ( M − M (3)and hence the probability of drawing any two repeatingor identical symbols out of the entire set is F (2; M ) = 1 − M ( M − M (4)Extending this to n draws, the probability of drawing anysymbol on the first try followed by any other differentsymbol up to the n th draw up to n symbols is e F ( n ; M ) = M ( M − · · · ( M − n + 1) M n (5) That is, the probability of no repeating symbols in the entiresequence. The reason for this formulation, is that by excludingall repeating symbols, it enables us to compute the probability ofany repeating symbols over a given sequence and hence the exactprobability of a coincident event at a specific sample instance,which by definition in (7), must be at the n th sample since wehave discounted the probabilities up to the ( n −
1) th sample.
Therefore, it follows that the probability of drawing any q n ∈ [2 , . . . , n ] identical symbols (ie one or more repeatingsymbols in any position) out of the entire set is given by F ( n ; M ) = 1 − M ( M − · · · ( M − n + 1) M n (6)To compute the probability of a first coincidence occur-ring exactly at the n th symbol for 1 < n < M, meansthat it is necessary to compute the probability of draw-ing no repeating symbols in the entire sequence up to the( n −
1) th draw given by e F ( n − M ) and consequentlydrawing any q n − ∈ [2 , . . . , n −
1] identical symbols isgiven by F ( n − M ) . Hence the required probability isgiven by ([11]): f ( n ; M ) = F ( n ; M ) − F ( n − M ) (7)The expectation of the discrete parameter n and its as-sociated probability f ( n ; M ) is given by: E [ n ] = J ( n ; M ) (8)= M X n =0 nf ( n ; M ) (9)Since n is a function of M, define D ( M ) = E [ n ] . (10)The innovative approach by [11] is to recognize that aninvertible smooth curve can be constructed with D ( M )as a function of M by using a sequence of uniform iidrandom data. Now, since Shannon entropy H N ( X ; M ) isdefined as a function of M and for equiprobable symbols,we have H ( M ) = log ( M ) (11)this indicates that if the unknown value of M can beestimated directly from the data, then the entropy canbe determined.A model for estimating M can be obtained by form-ing an appropriate, eg. polynomial model, inverting theoriginal equation found in (9), as c M ( D ) = G (Θ; D ) (12)= n p X i =0 θ i D i (13)and appropriate values for the parameters θ i by fittinga curve to an ensemble of data. In [11], setting n p = 2 , the values obtained were θ = 0 . , θ = − . , θ =0 . . The entropy can then be estimated as b H = log ( c M ) (14)Experimentally, this approach was shown to providegood accuracy using only a small number of symbol coin-cidence distance observations [11]. The limitation how-ever is the assumption of equiprobable symbol proba-bilities. In the next section we propose a new algo-rithm which extends this method to the case of non-equiprobable symbols. B. Coincidence Counting For Non-EquiprobableSymbols
For natural sequences, including natural language,a mechanism to model the non-equiprobable symbolicprobabilities is to use a Zipfian law where the proba-bility of information events can generally be ranked intomonotonically decreasing order. For natural language, ithas been shown that Zipf’s law approximates the distri-bution of probabilities of letter or words across a corpusof sufficient size for the larger probabilities [16]. We donot rely on Zipf’s law to provide a universal model ofhuman language or other natural sequences (see for ex-ample, the discussions in [17], [18],[19]). Nevertheless,Zipfian laws have been proven to be useful as a meansof statistically characterizing the observed behaviour ofsymbolic sequences of data ([20]) and are useful in form-ing a model of symbolic information transmission whichis organized on the basis of sentences made by wordsin interaction with each other [21]. Here we adopt theZipf-Mandelbrot-Li law described in [6], as a model fornatural sequences with non-equiprobable distribution ofsymbols.In the former case, we have a model defined by f ( n ; M )from which a smooth invertible model J ( n ; M ) is ob-tained. Thus we can obtain a model G (Θ; D ) whichenables the entropy to be estimated directly from thesymbol coincidences. To derive a model for the non-equiprobable case, one approach is to model individual D i and assume some form of discrete probability relatedto each distance.The method we propose is that following (7)-(9) amodel J ′ ( n ; M, r ) is defined for each symbol, indexed byrank r. Therefore, for any given M, each symbol of a spec-ified rank r can be treated as being equiprobable. Thus,if the probability can be determined for each symbol interms of its rank, and this can be related to the overallentropy, then the same approach can be followed as forthe equiprobable case.Consider a reformulation of (6) where: e F ( n ; M ) = M ( M − · · · ( M − n + 1) M n = MM · ( M − M · ( M − M · · · ( M − n + 1) M = 1 · (cid:18) − M (cid:19) · (cid:18) − M (cid:19) · · · (cid:18) − n − M (cid:19) = 1 · (1 − P ) · (1 − P ) · · · (1 − P n − ) (15)using the identity ( M − n + 1) /M = 1 − ( n − /M and P h is the probability of independently drawing h − M in h − If this was cast in the classic case of drawing colored objectsfrom a bag, it would be with replacement. case of equiprobable symbols, we have e P h ( M ) = 1 − h − M (16)Now, for a natural sequence where the probability of oc-currence of a given word can be defined in terms of rank,the Zipf-Mandelbrot-Li law provides an expression for theprobability to be used in (15) where ([6],[20],[22]): P ( r ; M ) = γ ′ ( r + β ) α (17)and for iid samples, the constants can be computed as([17]): α = log ( M + 1)log ( M ) , β = MM + 1 , γ M = M α − ( M − α (18)and γ ′ = γκ (19)where M X i =1 p ( i ) = 1 , M X i =1 γ ( r + β ) α = κ (20)This approach provides an equiprobable representationof the symbols by considering a different model for eachsymbol rank, according to the rank. But moreover, oncea model is found for one rank, then the whole modelcan be identified. Hence, adopting a probabilistic modelaccording to the symbolic rank we define F ( n ; r, M ) = 1 − n Y h =1 (1 − P h ( r, M )) (21)where P h ( r, M ) = hγ ′ ( r + β ) α (22)Therefore, the same approach as before can be adoptedby defining f ( n ; r, M ) = F ( n ; r, M ) − F ( n − r, M ) (23)Hence, we now have E r [ n ] = J ′ ( n ; r, M ) and D r ( M ) = M X n =0 nf ( n ; r, M ) (24)Using a similar approach to the previous equiprobablecase, a per symbolic rank model for estimating M canbe obtained by prescribing J ′ ( n ; r, M ) in (24), and theninverting this to become c M r ( D ) = G r (Θ; M, D r ) (25) Note that although it is technically feasible to derive the exactmodel J ′ ( n ; r, M ) in terms of (17)-(23), it is not necessary to doso in practice as is evident by the curve fitting approach proposedin [11] and adopted here. M D ( M ) r = 1 r = 2 r = 3 r = 4 r = 5 FIG. 1. Rank-based entropy models for D ( M ) = J ′ ( n, r, M ). Note that the symbol distances are measured according totheir rank. Now, unlike the model proposed initially in [11], natu-ral sequence data consists of a non-equiprobable set ofsymbols and so we cannot simply use (14) to estimateentropy in a single step as before. However, given an es-timate c M r ( D ) , from the observed inter-symbol distance,it now becomes possible to apply this parameter to theZipf-Mandelbrot-Li set of equations in addition to ourrank-based probability model of symbol drawings, andobtain an overall estimate for the entire set of symbolicprobabilities. While this can be achieved using, for ex-ample, D , clearly it is possible to form an estimatewhich uses D i for i = 1 ..n according to any desired cri-teria such as least squares or any other norm. Havingthen estimated b P h ( r, M ) , the entropy can then be easilyestimated as b H ( r, X ) = − c M X h =1 b P h ( r, M ) log (cid:16) b P h ( r, M ) (cid:17) (26)which defines the rank r Shannon entropy estimate. Inthe next section, we demonstrate the performance of themodel in various simulations.
III. EXAMPLE RESULTSA. Synthetic Entropy Model of English Text
In this example, a set of data is simulated using theZipf-Mandelbrot-Li model with 27 symbols correspond-ing to the 26 letters and a space. The rank-based entropyestimation algorithm described in the previous section isused to estimate the model by counting the coincidencesof the symbols. In the first instance, we simply computethe average symbol distance D and then apply this tothe inverted model. Note that a different model appliesto each rank as shown in Fig. 1. The rank-based en-tropy models for D ( M ) = J ′ ( n, r, M ) are inverted and D M r =1..5 FIG. 2. Inverse rank-based entropy models for c M r ( D ) = G r (Θ; M, D r ) . Each model is derived from the initial rankbased model which describes the symbolic distance D r as afunction of M. the models are shown in Fig. 2. Here, a power basedmodel is used, c M r ( D ) = aD br + c (27)where a = 0 . , b = 4 . , c = 4 . . In the syn-thetic simulation results, using only 25 symbol coinci-dences, where the true entropy is H a (27) = 4 .
261 byapplication of the rank-based entropy model describedin the previous section, we obtain the estimated entropyof H e (27) = 4 .
266 indicating the efficacy of the method.
B. Entropy of English Text: Tom Sawyer
In this example, the classic English language text TomSawyer was used to test the algorithm. In this case, therank 1 model was again used, where the highest rankedsymbol corresponds to the space character. Commencingat Chapter 2 of the text, the intersymbol distance wasestimated as D (50) = 6 .
03 which leads to an estimatedentropy of H e (27) = 4 . H a (27) = 4 .
4. More-over, the result was obtained by using less than 300 char-acters or 50 words which is quite remarkable.
IV. CONCLUSION
Shannon entropy is a well known method of measur-ing the information content in a sequence of probabilisticsymbolic events. In this paper, we have proposed a fastalgorithm for estimating Shannon Entropy for natural se-quences. Using a modified Zipf-Mandelbrot-Li law anda coincidence counting method, we have demonstrated amethod which gives extremely fast performance in com-parison to other techniques and yet is simple to imple-ment. Examples have been given which show the efficacyof the proposed methodology. It would be of interestto apply this method to various real world applicationsto compare the theoretical results against experimentallyobtained results. In terms of information theoretic ana-lytical tools, it may be of interest to consider just howfew samples may be required in order to obtain usefulresults. In order to make the most use of available data,future work could consider optimal strategies for derivingaccurate models from multiple symbol ranks; this couldbe expected to yield fruitful results especially when thereis some ‘noise’ in the data, eg some symbols are missing. Another area of interest in future work will be to analyzethe bias of the model as considered in [23].
Acknowledgments
The authors would like to acknowledge partial supportfrom the Australian Research Council Centre of Excel-lence for the Dynamics of Language and helpful discus-sions with Dr Yvonne Yu and Dr Paul Vrbik. [1] C. E. Shannon, Bell System Technical Journal
XXVII ,379 (1948).[2] C. E. Shannon, Bell System Technical Journal
XXVII ,623 (1948).[3] W. Ebeling and T. P¨oschel, Europhysics Letters , 241(1994).[4] I. Moreno-S´anchez, F. Font-Clos, and ´A. Corral, PLOSONE (2016).[5] T. Sch¨urmann and P. Grassberger, Chaos , 414(1996).[6] A. D. Back, D. Angus, and J. Wiles, submitted to IEEETrans. on Information Theory (2018).[7] P. Grassberger, Physics Letters A , 369 (1988).[8] J. A. Bonachela, H. Hinrichsen, and M. A. Mu˜noz, Jour-nal of Physics A: Mathematical and Theoretical , 1(2008).[9] M. Paavola, An efficient entropy estimation approach ,Ph.D. thesis, University of Oulu (2011).[10] A. Lesne, J.-L. Blanc, and L. Pezard, Phys. Rev. E ,046208 (2009).[11] J. Montalv˜ao, D. Silva, and R. Attux, Electronics Letters , 1059 (2012). [12] M. Gerlach, F. Font-Clos, and E. G. Altmann,Phys. Rev. X , 021009 (2016).[13] I. Nemenman, F. Shafee, and W. Bialek, in Advancesin Neural Information Processing Systems 14 , edited byT. G. Dietterich, S. Becker, and Z. Ghahramani (MITPress, Cambridge, MA, 2002) pp. 471–478.[14] T. Sch¨urmann,
Neural Computation , , 2097 (2015).[15] J. Hausser and K. Strimmer, Journal of Machine Learn-ing Research , 1469 (2009).[16] S. T. Piantadosi, Psychonomic Bulletin & Review ,1112 (2014).[17] W. Li, IEEE Transactions on Information Theory ,1842 (1992).[18] W. Li, Glottometrics , 14 (2002).[19] ´A. Corral, G. Boleda, and R. Ferrer-i Cancho, PloS one , e0129031 (2015).[20] M. A. Montemurro, Physica A , 567 (2001).[21] R. Ferrer i Cancho and R. V. Sol´e, Proceedings of theRoyal Society of London B , 2261 (2001).[22] B. Mandelbrot, The fractal geometry of nature (W. H.Freeman, New York, 1983).[23] T. Sch¨urmann, Journal of Physics A: Mathematical andGeneral37