Correlations of record events as a test for heavy-tailed distributions
aa r X i v : . [ phy s i c s . d a t a - a n ] J a n Correlations of record events as a test for heavy-tailed distributions
J. Franke, G. Wergen, J. Krug
Institute of Theoretical Physics, University of Cologne,Z¨ulpicher Strasse 77, 50937 K¨oln, Germany (Dated: November 9, 2018)A record is an entry in a time series that is larger or smaller than all previous entries. If the timeseries consists of independent, identically distributed random variables with a superimposed lineartrend, record events are positively (negatively) correlated when the tail of the distribution is heavier(lighter) than exponential. Here we use these correlations to detect heavy-tailed behavior in smallsets of independent random variables. The method consists of converting random subsets of thedata into time series with a tunable linear drift and computing the resulting record correlations.
Determining the probability distribution underlying agiven data set or at least its behavior for large argumentis of pivotal importance for predicting the behavior ofthe system: If the data is drawn from a distribution withheavy tails, one needs to prepare for large events. Of par-ticular relevance is the case when the probability densitydisplays a power law decay, as this implies a drastic en-hancement of the probability of extreme events. This isone of the reasons for the persistent interest in the ob-servation and modeling of power law distributions, whichhave been associated with critical, scale-invariant behav-ior [1, 2] in diverse contexts ranging from complex net-works [3] to paleontology [4], foraging behavior of animals[5], citation distributions [6] and many more [7].However, when trying to infer the tail behavior of theunderlying distribution from a finite data set, one facesthe problem that the number of entries of large absolutevalue is very small. This implies that even though bin-ning the entries by magnitude and plotting them wouldyield an approximate representation of the probabilitydensity, this process becomes inconclusive in particularin the tail of the probability density. Furthermore, insmall data sets, extreme outliers can strongly affect theresults of methods like maximum likelihood estimatorssuch that leaving out even one of these extreme and pos-sibly spurious data points renders the outcome of the testinsignificant. A case in point is the problem of estimatingthe distribution of fitness effects of beneficial mutationsin evolution experiments, which are expected on theoreti-cal ground to conform to one of the universality classes ofextreme value theory (EVT) [8]. Because beneficial mu-tations are rare, the corresponding data sets are typicallylimited to a few dozen values, and the determination ofthe tail behavior can be very challenging [9, 10].In this Letter we present a method for detecting heavytails in empirical data that works reliably for small datasets (on the order of a few dozen entries) and is robustwith respect to removal of extreme entries. The test isbased on the statistics of records of subsamples of thedata set. Similar to conventional record-based statisti-cal tests [11–13], and in contrast to the bulk of methodsavailable in this field [7], our approach is non-parametricand, hence, does not require any hypothesis about the un-derlying distribution. Rather than aiming at reliable es- timates of the parameters of the distribution (such as thepower law exponent), the main purpose of our method isto distinguish between distributions that are heavy-tailedand those that are not.
Record statistics and record correlations.
Given a timeseries { x , . . . , x N } of random variables (RVs), the n th RV is said to be a record if it exceeds all previous RVs { x j } j
0. Thus for large n , thereare two competing contributions to J ( n ) determining thesign of the correlations.To classify the behavior of the correlations in terms ofthe EVT classes [1, 22], consider the generalized Paretodistribution [23] f ( η ) = (1 + κη ) − ( κ +1) /κ , which re-produces the three classes as κ < κ > κ = 0 (Gumbel), respectively. Computing I ( n ) separately for these three cases [18] it was shownthat, up to multiplicative terms in log( n ) or slower, onehas I ( n ) ∼ n − (2+ κ ) and therefore [21] J ( n ) ≈ κ n I ( n ),showing that the sign of correlations is directly deter-mined by the extreme value index κ [25].In the Gumbel class ( κ = 0) more refined calcu-lations for the generalized Gaussian densities f β ( x ) ∼ exp( −| η | β ) show that correlations are negative for β > β < β < n depen-dence: While for β <
1, correlations grow with n up to alimiting value, for β = 1 they are independent of n . Thespecial, marginal role of the exponential distribution wasalso encountered in a study of near-extreme events [24],where the integral (3) appears in a different context.To sum up, correlations between record events in timeseries with a linear drift allow a clear distinction betweenunderlying probability densities that decay like an expo-nential or faster for large argument, and densities withheavier tails, by looking for positive correlations thatgrow in n . Using these two criteria, we now present adistribution-free test for heavy tails in data sets of i.i.d.random variables. Description of the test.
Consider a data set with N entries, x , x , . . . , x N that can reasonably be argued toconsist of independent samples from the same distribu-tion [26]. Then for each n < N , one can pick uniformlyat random a subset of n entries and add a linear trendaccording to the index in the subset (see Eq.(1)), thusforming a set of random variables with linear trend. Foreach n , there are (cid:0) Nn (cid:1) possible subsets [27], which canbe used to compute the fraction of times the n th en-try is a record ˆ p n ( c ), the corresponding fraction ˆ p n − ( c )for the n − th entry, and the fraction ˆ p n,n − ( c ) of timesboth entries are records, for each value of a suitably cho-sen range of c [28]. The number s of subsets used foreach value of c will be referred to as ‘internal statis-tics’. Finally, one obtains an estimate for the correlationsˆ l n,n − ( c ) = ˆ p n ( c )ˆ p n − ( c )ˆ p n,n − ( c ) , where the hat serves to indicatethat we are dealing with one fixed times series of length N and its sub-series, rather than many independent re-alizations. In the following we refer to ˆ l n,n − ( c ) as the heavy tail indicator (HTI).To see how the test works in practice, consider Fig. 1.Two data sets of size N = 64 each are presented, onedrawn from a standard Gaussian distribution, the otherfrom a symmetric L´evy stable distribution with parame-ter µ = 1 .
3. A standard approach to inferring the shape l , ( c ) c Gaussian, σ =1.0Levy-stable, µ =1.3From independent Gaussian RVsFrom independent Levy RVs 0 0.2 0.4 0.6 0.8 1-2 0 2 F ( x ) x Figure 1. A first example of the proposed test, with N = 64i.i.d. RVs drawn from a Gaussian with unit variance (squares)and a symmetric L´evy distribution L µ ( x ) with µ = 1 . Inset:
Comparing the cumulative distribution function F ( x ) (lines) to its empirical estimate from the 64 data pointsshows that one distribution is broader than the other butdoes not allow for a clear distinction between the two datasets. Main plot:
This difference is however clearly seen un-der application of the record-based test for subsamples of size n = 16. Dotted and dashed-dotted lines show the predictionfor l , ( c ) for independent RV’s. of the distribution is to estimate the cumulative distribu-tion function by rank ordering the data along the x -axis(inset). In the example this shows that one distribution isbroader than the other, but does not allow to distinguishbetween a difference in scale (as for two Gaussians of dif-ferent standard deviation) and a difference in shape. Incontrast, the two data sets come apart quite clearly un-der application of the test, showing that ˆ l n,n − ( c ) > l n,n − ( c ) < Fluctuations.
The lines in the main part of Fig. 1 showthe predicted correlation l n,n − ( c ) obtained from simula-tions of independent RV’s. The estimated HTI ˆ l n,n − ( c )obtained from subsamples of the two finite data sets de-viates from these predictions, reflecting the fact that theensemble of subsamples is not independent. The devi-ations depend on the data set in a random way, com-pare to Fig. 4, and understanding how the magnitudeof the deviations depends on the test parameters N , n and s is clearly important for a quantitative assessmentof the significance of the test. Figure 2 explores thesesample-to-sample fluctuations by computing ˆ l n,n − ( c ) fora large number S (‘external statistics’) of different datasets and recording the mean and the mean squared devia-tion for different distributions. The fluctuations are largefor power law distributions and decrease significantly forrepresentatives of the Gumbel and Weibull classes. Thelatter implies that it is very unlikely for positive corre-lations to be produced by chance if the underlying dis-tribution is not of heavy tail type; the observation of a l , ( c ) c no correlationsLevy, N=64N=128Levy,exactUniform, N=64Uniform, exact 0.8 1 1.2 1.4 1.6 0 5 10 15 20 l n , n - ( c * ) n
0 0.5 1 1.5 2 2.5 l , ( c ) cno correlations Pareto, N=64Pareto, exactGauss, N=64Gauss, exact
0 5 10 15 20 l n , n - ( c * ) n Figure 2. Sample-to-sample fluctuations of the HTI ˆ l n,n − fordifferent distributions. Lines show mean values of the correla-tion l , ( c ) obtained from simulations of independent RV’s(labeled exact ), symbols show the mean value of the HTIand error bars indicate the standard deviation of the fluctu-ations for the symmetric L´evy-stable distribution with tail-parameter µ = 1 . ,
1) (top),and the Pareto-distribution with µ = 2 . s = 10 (Pareto) or s = 10 (allother) and averaged over S = 10 independent data sets. In-sets show how the correlations at the value c ∗ = 0 .
25 wherecorrelations deviate maximally from unity grow as functionof n while keeping N fixed. HTI exceeding unity can therefore be taken as a strongindication of heavy tailed behavior in the data.The effect of the internal statistics on the sample-to-sample fluctuations is quantified in Fig.3, where theirmagnitude can be seen to saturate to a limiting valuewith increasing s . Furthermore the limiting value de-pends on the ratio n/N : The smaller a subset of theinitial data set is used, the more precise the results canbe made by using large internal statistics. This behaviorunderlines a particular strength of our approach, namelythat the combinatorially large number of subsequencescan be used (up to a point) to reduce fluctuations dueto the finite size of the data set. On the other hand, n should not be chosen too small, as the amplitude of corre- V a r i an c e o f c o rr e l a t i on s Internal statistics s (log scale)LevyGaussuniform 0 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9n/NGauss, N=32N=64unif., N=32N=64Levy, N=32N=64 Figure 3. Magnitude of sample-to-sample fluctuations forthree of the cases considered in Fig.2. With increasing in-ternal statistics s , the sample-to-sample fluctuations decreaseto a limiting value (main plot). This limiting value increaseswith n/N (inset), indicating that best results in terms of fluc-tuations are obtained by considering short subsequences. Inthe main plot N = 64 and n = 16. lations generally increases with n [21] (see inset of Fig.2).For the examples presented here, we found n/N = 1 / N = 64 to yield the best compromise between thesetwo contradicting requirements, see also Fig.3. Application.
As an application of our approach, weconsider the ISI citation data set first analyzed by Red-ner [6], consisting of citation data for 783339 papers pub-lished in 1981 and cited between 1981 and June 1997.Due to the large size of this data set, the existence of apower law tail with exponent µ ≈ N = 64 papers each(Fig. 4). Despite the substantial fluctuations betweenthe three subsets, the HTI lies clearly above unity in allcases. The small size of the chosen subsets implies thatonly a few (if any) data points in the subsets come fromthe extreme tails of the distribution. The lower panel inFig.4 illustrates the robustness of the test with respectto the removal of putative outliers. Summary.
In conclusion, in this Letter we proposea record-based distribution-free test for heavy tails thatworks particularly well for small data sets. It was shownthat the test is very versatile and quite robust to theremoval of outliers, thus complementing standard meth-ods like maximum likelihood estimates [7]. While recordstatistics has a long history of yielding distribution freetests [11–13], our approach is conceptually novel in thatwe make systematic use of the combinatorial proliferationof subsets of the original data set, which are then manip-ulated by adding a linear drift. We expect our method tobe particularly useful in situations where the size of thedata set is intrinsically limited, as in the assignement of l , ( c ) cdataset 1dataset 2dataset 3 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 12 14 16 18 20 22 l n , n - ( c * ) n
0 2 4 6 8 10 12 14 16 l , ( c ) cdataset 2, removed highestdataset 2, removed top twodataset 3, removed highest l n , n - ( c * ) n Figure 4.
Top:
Three randomly chosen subsets of length N = 64 each from the ISI citation data set [6]. The HTI wascomputed with internal statistics s = 10 and n = 16. Themain plot shows attractive correlations in all three cases, theinset verifies growth of these correlations with n . Bottom:
Removing the largest and even the top two entries of data set2 does not change the result of the test. In data set 3, which isa somewhat extreme case in that the largest value is more thana factor 10 greater than the second largest, the correlationsremain attractive upon removal of the largest entry but themagnitude of correlations no longer increases with n . an EVT universality class to the distribution of beneficialmutations in population genetics [9, 10]. In particular,the test can be used to strengthen the evidence in fa-vor of heavy-tailed behavior in situations where conven-tional parametric tests have insufficient statistical power.By combining our test with standard approaches such asthe maximum likelihood method, the tail parameters canthen also be estimated.The authors would like to thank Sid Redner for sup-plying the ISI data and Ivan Szendro for useful discus-sions. We acknowledge financial support from Studiensti-fung des deutschen Volkes (JF), Friedrich-Ebert-Stiftung(GW) and DFG within BCGS and SFB 680. [1] D. Sornette, Critical Phenomena in Natural Sciences .Springer, Berlin (2004).[2] K. Christensen and N.R. Moloney,
Complexity and Crit-icality . Imperial College Press, London (2005).[3] A.-L. Bar´abasi and R. Albert, Science , 5439 (1999);M.E.J. Newman, SIAM Review , 2 (2003).[4] M.E.J. Newman and P. Sibani, Proc. R. Soc. Lond. B , 1593 (1999).[5] G.M. Viswanathan et al. , Nature , 413 (1996); A.M.Edwards et al. , Nature , 1044 (2007).[6] S. Redner, Eur. Phys. J. B , 131 (1998).[7] A. Clauset, C.R. Shalizi and M.E.J. Newman, SIAM Re-view , 661 (2009).[8] P. Joyce, D.R. Rokyta, C.J. Beisel and H.A. Orr, Genet-ics , 1627 (2008).[9] D.R. Rokyta et al. , J. Mol. Evol. , 368 (2008).[10] C.R. Miller, P. Joyce and H.A. Wichman, Genetics ,185 (2011).[11] F.G. Foster and A. Stuart, J. Roy. Stat. Soc. B , 1(1954).[12] N. Glick, Amer. Math. Monthly , 2 (1978).[13] S. Gulati and W. J. Padgett, Parametric and Nonpara-metric Inference from Record-Breaking Data (Springer,New York 2003).[14] V.B. Nevzorov,
Records: Mathematical Theory , Am.Math. Soc. (2001).[15] J. Krug, J. Stat. Mech.: Theor. Exp. P07001 (2007).[16] S.N. Majumdar and R.M. Ziff, Phys. Rev. Lett. ,050601 (2008).[17] I. Eliazar and J. Klafter, Phys. Rev. E , 061117 (2009).[18] J. Franke, G. Wergen and J. Krug, J. Stat. Mech.: Theor.Exp. P10013 (2010).[19] S. Sabhapandit, EPL , 20003 (2011).[20] R. Ballerini and S. Resnick, J. Appl. Prob. , 487(1985); Adv. Appl. Prob. , 801 (1987).[21] G. Wergen, J. Franke and J. Krug, J. Stat. Phys. ,1206 (2011).[22] L. de Haan and A. Ferreira, Extreme Value Theory: AnIntroduction . Springer, Berlin (2006).[23] J. Pickands III, Ann. Stat. , 119 (1975).[24] S. Sabhapandit and S.N. Majumdar, Phys. Rev. Lett. ,140201 (2007).[25] For κ >
1, corresponding to distributions without amean, the correction term J ( n ) is a decreasing functionof n and the correlations vanish asymptotically for large n . However for finite n there are positive correlations ofsubstantial magnitude, and the proposed test is still ap-plicable.[26] The i.i.d. assumption can be checked using a conventionalrecord-based test, see [12]. Numerical simulations showthat weak correlations in the data (e.g., generated by afirst order autoregressive process) do not invalidate ourmethod.[27] In addition permutations of the subsets can be consid-ered, which are not equivalent in the presence of drift.[28] The range of cc