[PDF] A new quantity for statistical analysis: "Scaling invariable Benford distance"

Abstract

For the first time, we introduce "Scaling invariable Benford distance" and "Benford cyclic graph", which can be used to analyze any data set. Using the quantity and the graph, we analyze some date sets with common distributions, such as normal, exponent, etc., find that different data set has a much different value of "Scaling invariable Benford distance" and different figure feature of "Benford cyclic graph". We also explore the influence of data size on "Scaling invariable Benford distance", and find that it firstly reduces with data size increasing, then approximate to a fixed value when the size is large enough.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] M a r A new quantity for statistical analysis: ”Scaling invariable Benford distance”

Peiyan Luo and Yongqing Li ∗ College of Nuclear Technology and Automation Engineering,Chengdu University of Technology, Chengdu, China College of Physical Science and Technology, Sichuan University, Chengdu, China (Dated: October 11, 2018)For the ﬁrst time, we introduce ”Scaling invariable Benford distance” and ”Benford cyclic graph”,which can be used to analyze any data set. Using the quantity and the graph, we analyze somedate sets with common distributions, such as normal, exponent, etc., ﬁnd that diﬀerent data sethas a much diﬀerent value of ”Scaling invariable Benford distance” and diﬀerent ﬁgure feature of”Benford cyclic graph”. We also explore the inﬂuence of data size on ”Scaling invariable Benforddistance”, and ﬁnd that it ﬁrstly reduces with data size increasing, then approximate to a ﬁxedvalue when the size is large enough.

PACS numbers: + m I. INTRODUCTION

The nine digits 1–9 produced by ﬁrst digit analysisof our typical everyday numbers are supposed to be oc-curring randomly and thus equally distributed. But in1881, Newcomb [1] found that the ﬁrst digit proportionsof many numbers were quite diﬀerent, and in 1938, Ben-ford [2] gave the exact expression P B ( d ) = log (1+1 /d ),where P B ( d ) is the probability of ﬁrst digit d occurringin a data set. This is known as Benford’law . Since foun-dation of the law, much research [3–16] has been doneon giving theoretical extensions, analyzing more cases,or applying the law to other ﬁelds. Apparently, not allof the data sets have the same ﬁrst digit distribution as

Benford’law . Moreover, the data set with ﬁnite size re-sults in rational numbers of ﬁrst digit proportions, while

Benford’law gives irrational numbers, in other words, thedata set will never exactly meet the logarithmic law.Meanwhile, we usually don’t know how the ﬁrst digitproportions change with the data size increasing, and wealso don’t know the limit of the proportions.The question is how much the diﬀerence between theﬁrst digit proportions of a data set and the ﬁrst digit lawis, and it’s the focal issue of this paper. Considering thata data set satisfying

Benford’law is scaling symmetric,which means that it still satisﬁes the law when membersof the data set multiply an arbitrary number. And ifthe data set is not consistent with the law, it is scalingasymmetric. So, we propose a new quantity which doesnot vary with units or scales, such quantity is so necessarybecause that units and scales of data set are artiﬁciallychosen at all time. That is ”Scaling invariable Benforddistance” we’ll talk about in the latter part. ∗ [email protected] II. ”SCALING INVARIABLE BENFORDDISTANCE”

Let A be a data set with ﬁnite size. Firstly, an dis-tance I [ A ] is deﬁned using Benford’law , I [ A ] = vuut X d =1 [ P A ( d ) − P B ( d )] (1)Where P A ( d ) is the probability of ﬁrst digit d occurringin A . Then, a new data set A α is given by transforming A , A α = { x α | x α = x × α , x ∈ A } , where α isa random number between 0 and 1. And I [ A α ] can besimilarly deﬁned as I [ A ], I [ A α ] = vuut X d =1 [ P A α ( d ) − P B ( d )] (2)Where P A α ( d ) is the probability of the occurrence of ﬁrstnumber d in A α . Generally, I [ A ] = I [ A α ].Now, we deﬁne a new quantity I inva [ A α ] in Eq.( 3),and name it ”Scaling invariable Benford distance”, I inva [ A α ] = Z I [ A α + β ] dβ (3)Obviously, I inva [ A α ] = I inva [ A β ] = I inva [ A ], where I inva [ A β ] and I inva [ A ] have the same deﬁnition as I inva [ A α ]. That is to say, I inva [ A α ] is a ﬁxed value whichis independent of α .Thus, for any data set A α , there is a quantity–”Scalinginvariable Benford distance” which doesn’t change withunits or scales. And the central question is tackledthrough this quantity displaying the diﬀerence betweenthe ﬁrst digit proportions of a data set and Benford’law . III. ANALYSIS OF DATA SETSA. Data sets satisfying and approximating

Benford’law

The data set X showed in Eq.( 4) is consistent with Benford’law , X = { x | x = 10 β ,β ∈ [0 ,

1) with uniform distribution } (4)And Y is an arbitary data set, Y = { y | y is nonzero and arbitrary number } (5)Then, data set Z can be given from X and Y , Z = { z | z = x × y, x ∈ X, y ∈ Y } (6)For nonzero Y with any distribution, data set Z satisﬁesthe ﬁrst digit law, which can be simply proved latter ina diﬀerent way compared to Hamming [17].Firstly, Y can be rewritten as Y = { y ( β , t ) | y ( β , t ) = 10 β + N ( t ) ,β ∈ [0 , , N ( t ) is an integer } (7)where only β contributes to the ﬁrst digit.Then, Z = { z | z = 10 β + β + N ( t ) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , β + 1) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , } [ { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [1 , β + 1) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , } [ { z | z = 10 ˜ β + N ( t )+1 , ˜ β ∈ [0 , β ) } (8)In addition, data set X can also be expressed as X = { z | z = 10 β , β ∈ [ β , } [ { z | z = 10 β , β ∈ [0 , β ) } (9)Thus, data set Z has the same distribution of ﬁrst digitas X . That is to say, Z satisﬁes Benford’law .Notice that data sets like Z are a large category of datasets for the arbitrariness of Y . Thus, a real-world dataset, which approximates Benford’law , may be a data setlike Z , for example, the USA Census Data ”pop-2009”[3]. Here, we can produce a data set named ”pop-c”with the same distribution as ”pop-2009”, where ”pop-c” is given from two data sets, one is X deﬁned in Eq. 4and another is Y with distribution showed in graph (c)of FIG. 1. And distributions of ”pop-2009”, ”pop-c” and X are also showed in FIG. 1. TABLE I. Distribution functions

Normal p ( x ) = √ π e − x / , x ∈ ( −∞ , ∞ ) Exponent p ( x ) = e − x , x ∈ [0 , ∞ ) Uniform p ( x ) = 1 / , x ∈ [1 , Constant p ( x ) = 1 , x = 9TABLE II. I inva [ A α ] for diﬀerent data sets Normal Exponent Uniform Constant

B. Analyzing some common data sets

Now, we analyze some typical data sets, distributionfunctions of which are showed in TABLE I, and thesefour cases are labeled as

Normal , Exponent , Uniform and

Constant respectively. Members of any case here are ran-dom numbers generated by computer to form the dataset, the size of which is set to be 10 .For each data set, we give the results of ”Scaling invari-able Benford distance” I inva [ A α ] [Eq.( 3)] in TABLE II.Obviously, I inva [ A α ] is a certain value, where data set Exponent has a really small value and

Constant a largevalue. Interestingly, for data set with normal distribu-tion, which has zero average but arbitrary standard devi-ation, the calculated I inva [ A α ] is equal, just a ﬁxed valueabout 0.087 in TABLE II. So, data set with any normaldistribution, which can be easily transformed to zero av-erage, has one value of ”Scaling invariable Benford dis-tance”.Then, we calculate I [ A α ] deﬁned in Eq.( 2) for anygiven α , which is evenly distributed on interval [0 , I [ A α ], and the angular coordinate istransformed from α . Apparently, diﬀerent data set has -11 -9 -7 -5 -3 P r obab ili t y den s i t y P r obab ili t y den s i t y N (a) N (b) -9 -7 -5 -3 (c) N FIG. 1. Distributions of the data sets, i.e (a) ”pop-2009” [3](line) and ”pop-c” (dot), (b) X deﬁned in Eq. 4, (c) Y pro-duced by us (a) (b) (c) (d) FIG. 2. As radial coordinate, I [ A α ] [Eq.( 2)] varies with an-gular coordinate which is transformed from α ∈ [0 ,

1) withuniform distribution. These four graphs are results of fourdata sets [TABLE. I], i.e.(a)

Normal , (b)

Exponent , (c)

Uni-form and (d)

Constant . diﬀerent ﬁgure feature, and such graph is called ”Ben-ford cyclic graph” here. Moreover, if we change units orscales of one data set, we will get the same shape butrotation of the ”Benford cyclic graph”. For instance, inFIG. 3, the upper two graphs show results of data setstransformed from the above data set Normal , where allmembers multiply 2 (left graph) and 5 (right graph), cor-respondingly, polar angles of the graphs counterclockwiserotate through 108 and 252 degrees compared to graph(a) of FIG. 2. And the lower two graphs show results ofdata sets which are transformed similarly from

Uniform .That is, both ”Scaling invariable Benford distance”and ”Benford cyclic graph” can be used to identify andclassify data sets, and the former is easily to use whilethe latter gives much more information.

IV. THE INFLUENCE OF THE DATA SIZE ONFIRST DIGIT PROPORTIONS

Here, we explore the inﬂuence of the data size onﬁrst digit proportions by ”Scaling invariable Benford dis-tance”. We calculate I inva [ A α ] for three data sets men-tioned above, one is data set X deﬁned in Eq.( 4), theother two are Exponent and

Normal from TABLE. I. Andresults are showed in FIG. 4, where data size changesfrom 10 to 10 . Apparently, if the data size is less than10 , I inva [ A α ] reduces greatly with the number increas-ing, and then it approximates to a value. Furthermore,for data set X , the approximation of I inva [ A α ] is zero be-cause its ﬁrst digit proportions are consistent with Ben-ford’law , while the approximation is a nonzero value forother data sets.

FIG. 3. As radial coordinate, I [ A α ] [Eq.( 2)] varies with an-gular coordinate which is transformed from α ∈ [0 ,

1) withuniform distribution. The upper two graphs show the resultsof data sets transformed from

Normal [TABLE. I], where allmembers multiply 2 (left graph) and 5 (right graph). Thelower two graphs show results of data sets transformed simi-larly from

Uniform [TABLE. I].

V. CONCLUSION

According to the introduction of this paper, it’s toorisky to conﬁrm that ﬁrst digit proportions of a real-worlddata set are consistent with

Benford’law . For example,the data set of star distances of stars in Milky way [3] issupposed to ﬁt the law extremely well. However, ”Scalinginvariable Benford distance” of this case is 0.0438, whileit is just 0.0308 for data set

Exponent with the same datasize (48111). Apparently,

Exponent can’t be the case (a) 10 (b) I i n v a [ A ] (c) FIG. 4. I inva [ A α ] [Eq.( 3)] varies with data size, which changesfrom 10 to 10 . And the graphs are results of data sets, (a) X deﬁned in Eq.( 4),(b) Exponent , (c)

Normal ,both (b) and(c) from TABLE. I which satisfy the law because that the approximationof I inva [ A α ] is a nonzero number, so can’t be the stardistances. Such circumstances have been tackled through”Scaling invariable Benford distance” ﬁrstly introducedin this paper.Using this new quantity, we have analyzed some typ-ical data sets, results of which show that diﬀerent dataset has a much diﬀerent value of ”Scaling invariable Ben-ford distance”. We have also explored how the quantityvaries with the data size, and found that I inva [ A α ] ap-proximates a ﬁxed value when the size is large enough,and the value is zero for data set X [Eq.( 4)] which ﬁtsthe digit law, nonzero number for other data sets. In ad-dition, we have introduced ”Benford cyclic graph” whichcan also identify and classify data sets as ”Scaling invari-able Benford distance”, and in part III A given a diﬀerentproof that a large category of data sets satisfy Benford’slaw in contrast to Hamming [17].In general, ”Scaling invariable Benford distance” and”Benford cyclic graph” can be used to analyze any dataset, and thought as a statistical way, resulting in extend-ing research and applications of

Benford’s law . For in-stance, one application is identifying the authenticity of the given data sets, which formerly must approximatethe logarithmic law, but now can not. Another exam-ple is that we can estimate the distribution of a data setis as it is considered to be, such as normal distributiontalked about in part III B. Regarding these points, fur-ther studying need to be done for deeper meaning andbroader application of ”Scaling invariable Benford dis-tance” and ”Benford cyclic graph”.Moreover, our results raise an interesting question:we apply statistics to deal with real-world data setswhich are usually with uncertainty, but statistics cangive results just at the limiting case while the data sizeis always limited. So, this gap between limited data setsand the inﬁnity in statistics, will bring with it whatproblems, we don’t know. If there are such problems,the method like ours talked about in this paper, namely”Scaling invariable Benford distance” and ”Benfordcyclic graph”, may be an inspiration because they arebased upon the limited analysis.This work was supported by the NationalMagnetic Conﬁnement Fusion Program of China(No.2014GB125004) and the National Natural ScienceFoundation of China (No.11575121). [1] S. Newcomb, Am. J. Math. 4, 39 (1881)[2] F. Benford, Proc. Am. Phil. Soc. 78, 551 (1938)[3] A. E. Kossovsky,

Benford’s Law (Theory, the GeneralLaw of Relative Quantities and Forensic Fraud DetectionApplications) (World Scientiﬁc Publishing Co. Pte. Ltd.,2015)[4] M. J. Nigrini,

Benford’s Law (Applications for ForensicAccounting, Auditing, and Fraud Detection) (John Wiley& Sons, Inc., Hoboken, New Jersey, 2012)[5] A. Berger and T. P. Hill,