A new quantity for statistical analysis: "Scaling invariable Benford distance"
aa r X i v : . [ phy s i c s . d a t a - a n ] M a r A new quantity for statistical analysis: ”Scaling invariable Benford distance”
Peiyan Luo and Yongqing Li ∗ College of Nuclear Technology and Automation Engineering,Chengdu University of Technology, Chengdu, China College of Physical Science and Technology, Sichuan University, Chengdu, China (Dated: October 11, 2018)For the first time, we introduce ”Scaling invariable Benford distance” and ”Benford cyclic graph”,which can be used to analyze any data set. Using the quantity and the graph, we analyze somedate sets with common distributions, such as normal, exponent, etc., find that different data sethas a much different value of ”Scaling invariable Benford distance” and different figure feature of”Benford cyclic graph”. We also explore the influence of data size on ”Scaling invariable Benforddistance”, and find that it firstly reduces with data size increasing, then approximate to a fixedvalue when the size is large enough.
PACS numbers: + m I. INTRODUCTION
The nine digits 1–9 produced by first digit analysisof our typical everyday numbers are supposed to be oc-curring randomly and thus equally distributed. But in1881, Newcomb [1] found that the first digit proportionsof many numbers were quite different, and in 1938, Ben-ford [2] gave the exact expression P B ( d ) = log (1+1 /d ),where P B ( d ) is the probability of first digit d occurringin a data set. This is known as Benford’law . Since foun-dation of the law, much research [3–16] has been doneon giving theoretical extensions, analyzing more cases,or applying the law to other fields. Apparently, not allof the data sets have the same first digit distribution as
Benford’law . Moreover, the data set with finite size re-sults in rational numbers of first digit proportions, while
Benford’law gives irrational numbers, in other words, thedata set will never exactly meet the logarithmic law.Meanwhile, we usually don’t know how the first digitproportions change with the data size increasing, and wealso don’t know the limit of the proportions.The question is how much the difference between thefirst digit proportions of a data set and the first digit lawis, and it’s the focal issue of this paper. Considering thata data set satisfying
Benford’law is scaling symmetric,which means that it still satisfies the law when membersof the data set multiply an arbitrary number. And ifthe data set is not consistent with the law, it is scalingasymmetric. So, we propose a new quantity which doesnot vary with units or scales, such quantity is so necessarybecause that units and scales of data set are artificiallychosen at all time. That is ”Scaling invariable Benforddistance” we’ll talk about in the latter part. ∗ [email protected] II. ”SCALING INVARIABLE BENFORDDISTANCE”
Let A be a data set with finite size. Firstly, an dis-tance I [ A ] is defined using Benford’law , I [ A ] = vuut X d =1 [ P A ( d ) − P B ( d )] (1)Where P A ( d ) is the probability of first digit d occurringin A . Then, a new data set A α is given by transforming A , A α = { x α | x α = x × α , x ∈ A } , where α isa random number between 0 and 1. And I [ A α ] can besimilarly defined as I [ A ], I [ A α ] = vuut X d =1 [ P A α ( d ) − P B ( d )] (2)Where P A α ( d ) is the probability of the occurrence of firstnumber d in A α . Generally, I [ A ] = I [ A α ].Now, we define a new quantity I inva [ A α ] in Eq.( 3),and name it ”Scaling invariable Benford distance”, I inva [ A α ] = Z I [ A α + β ] dβ (3)Obviously, I inva [ A α ] = I inva [ A β ] = I inva [ A ], where I inva [ A β ] and I inva [ A ] have the same definition as I inva [ A α ]. That is to say, I inva [ A α ] is a fixed value whichis independent of α .Thus, for any data set A α , there is a quantity–”Scalinginvariable Benford distance” which doesn’t change withunits or scales. And the central question is tackledthrough this quantity displaying the difference betweenthe first digit proportions of a data set and Benford’law . III. ANALYSIS OF DATA SETSA. Data sets satisfying and approximating
Benford’law
The data set X showed in Eq.( 4) is consistent with Benford’law , X = { x | x = 10 β ,β ∈ [0 ,
1) with uniform distribution } (4)And Y is an arbitary data set, Y = { y | y is nonzero and arbitrary number } (5)Then, data set Z can be given from X and Y , Z = { z | z = x × y, x ∈ X, y ∈ Y } (6)For nonzero Y with any distribution, data set Z satisfiesthe first digit law, which can be simply proved latter ina different way compared to Hamming [17].Firstly, Y can be rewritten as Y = { y ( β , t ) | y ( β , t ) = 10 β + N ( t ) ,β ∈ [0 , , N ( t ) is an integer } (7)where only β contributes to the first digit.Then, Z = { z | z = 10 β + β + N ( t ) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , β + 1) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , } [ { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [1 , β + 1) } = { z | z = 10 ˜ β + N ( t ) , ˜ β ∈ [ β , } [ { z | z = 10 ˜ β + N ( t )+1 , ˜ β ∈ [0 , β ) } (8)In addition, data set X can also be expressed as X = { z | z = 10 β , β ∈ [ β , } [ { z | z = 10 β , β ∈ [0 , β ) } (9)Thus, data set Z has the same distribution of first digitas X . That is to say, Z satisfies Benford’law .Notice that data sets like Z are a large category of datasets for the arbitrariness of Y . Thus, a real-world dataset, which approximates Benford’law , may be a data setlike Z , for example, the USA Census Data ”pop-2009”[3]. Here, we can produce a data set named ”pop-c”with the same distribution as ”pop-2009”, where ”pop-c” is given from two data sets, one is X defined in Eq. 4and another is Y with distribution showed in graph (c)of FIG. 1. And distributions of ”pop-2009”, ”pop-c” and X are also showed in FIG. 1. TABLE I. Distribution functions
Normal p ( x ) = √ π e − x / , x ∈ ( −∞ , ∞ ) Exponent p ( x ) = e − x , x ∈ [0 , ∞ ) Uniform p ( x ) = 1 / , x ∈ [1 , Constant p ( x ) = 1 , x = 9TABLE II. I inva [ A α ] for different data sets Normal Exponent Uniform Constant
B. Analyzing some common data sets
Now, we analyze some typical data sets, distributionfunctions of which are showed in TABLE I, and thesefour cases are labeled as
Normal , Exponent , Uniform and
Constant respectively. Members of any case here are ran-dom numbers generated by computer to form the dataset, the size of which is set to be 10 .For each data set, we give the results of ”Scaling invari-able Benford distance” I inva [ A α ] [Eq.( 3)] in TABLE II.Obviously, I inva [ A α ] is a certain value, where data set Exponent has a really small value and
Constant a largevalue. Interestingly, for data set with normal distribu-tion, which has zero average but arbitrary standard devi-ation, the calculated I inva [ A α ] is equal, just a fixed valueabout 0.087 in TABLE II. So, data set with any normaldistribution, which can be easily transformed to zero av-erage, has one value of ”Scaling invariable Benford dis-tance”.Then, we calculate I [ A α ] defined in Eq.( 2) for anygiven α , which is evenly distributed on interval [0 , I [ A α ], and the angular coordinate istransformed from α . Apparently, different data set has -11 -9 -7 -5 -3 P r obab ili t y den s i t y P r obab ili t y den s i t y N (a) N (b) -9 -7 -5 -3 (c) N FIG. 1. Distributions of the data sets, i.e (a) ”pop-2009” [3](line) and ”pop-c” (dot), (b) X defined in Eq. 4, (c) Y pro-duced by us (a) (b) (c) (d) FIG. 2. As radial coordinate, I [ A α ] [Eq.( 2)] varies with an-gular coordinate which is transformed from α ∈ [0 ,
1) withuniform distribution. These four graphs are results of fourdata sets [TABLE. I], i.e.(a)
Normal , (b)
Exponent , (c)
Uni-form and (d)
Constant . different figure feature, and such graph is called ”Ben-ford cyclic graph” here. Moreover, if we change units orscales of one data set, we will get the same shape butrotation of the ”Benford cyclic graph”. For instance, inFIG. 3, the upper two graphs show results of data setstransformed from the above data set Normal , where allmembers multiply 2 (left graph) and 5 (right graph), cor-respondingly, polar angles of the graphs counterclockwiserotate through 108 and 252 degrees compared to graph(a) of FIG. 2. And the lower two graphs show results ofdata sets which are transformed similarly from
Uniform .That is, both ”Scaling invariable Benford distance”and ”Benford cyclic graph” can be used to identify andclassify data sets, and the former is easily to use whilethe latter gives much more information.
IV. THE INFLUENCE OF THE DATA SIZE ONFIRST DIGIT PROPORTIONS
Here, we explore the influence of the data size onfirst digit proportions by ”Scaling invariable Benford dis-tance”. We calculate I inva [ A α ] for three data sets men-tioned above, one is data set X defined in Eq.( 4), theother two are Exponent and
Normal from TABLE. I. Andresults are showed in FIG. 4, where data size changesfrom 10 to 10 . Apparently, if the data size is less than10 , I inva [ A α ] reduces greatly with the number increas-ing, and then it approximates to a value. Furthermore,for data set X , the approximation of I inva [ A α ] is zero be-cause its first digit proportions are consistent with Ben-ford’law , while the approximation is a nonzero value forother data sets.
FIG. 3. As radial coordinate, I [ A α ] [Eq.( 2)] varies with an-gular coordinate which is transformed from α ∈ [0 ,
1) withuniform distribution. The upper two graphs show the resultsof data sets transformed from
Normal [TABLE. I], where allmembers multiply 2 (left graph) and 5 (right graph). Thelower two graphs show results of data sets transformed simi-larly from
Uniform [TABLE. I].
V. CONCLUSION
According to the introduction of this paper, it’s toorisky to confirm that first digit proportions of a real-worlddata set are consistent with
Benford’law . For example,the data set of star distances of stars in Milky way [3] issupposed to fit the law extremely well. However, ”Scalinginvariable Benford distance” of this case is 0.0438, whileit is just 0.0308 for data set
Exponent with the same datasize (48111). Apparently,
Exponent can’t be the case (a) 10 (b) I i n v a [ A ] (c) FIG. 4. I inva [ A α ] [Eq.( 3)] varies with data size, which changesfrom 10 to 10 . And the graphs are results of data sets, (a) X defined in Eq.( 4),(b) Exponent , (c)
Normal ,both (b) and(c) from TABLE. I which satisfy the law because that the approximationof I inva [ A α ] is a nonzero number, so can’t be the stardistances. Such circumstances have been tackled through”Scaling invariable Benford distance” firstly introducedin this paper.Using this new quantity, we have analyzed some typ-ical data sets, results of which show that different dataset has a much different value of ”Scaling invariable Ben-ford distance”. We have also explored how the quantityvaries with the data size, and found that I inva [ A α ] ap-proximates a fixed value when the size is large enough,and the value is zero for data set X [Eq.( 4)] which fitsthe digit law, nonzero number for other data sets. In ad-dition, we have introduced ”Benford cyclic graph” whichcan also identify and classify data sets as ”Scaling invari-able Benford distance”, and in part III A given a differentproof that a large category of data sets satisfy Benford’slaw in contrast to Hamming [17].In general, ”Scaling invariable Benford distance” and”Benford cyclic graph” can be used to analyze any dataset, and thought as a statistical way, resulting in extend-ing research and applications of
Benford’s law . For in-stance, one application is identifying the authenticity of the given data sets, which formerly must approximatethe logarithmic law, but now can not. Another exam-ple is that we can estimate the distribution of a data setis as it is considered to be, such as normal distributiontalked about in part III B. Regarding these points, fur-ther studying need to be done for deeper meaning andbroader application of ”Scaling invariable Benford dis-tance” and ”Benford cyclic graph”.Moreover, our results raise an interesting question:we apply statistics to deal with real-world data setswhich are usually with uncertainty, but statistics cangive results just at the limiting case while the data sizeis always limited. So, this gap between limited data setsand the infinity in statistics, will bring with it whatproblems, we don’t know. If there are such problems,the method like ours talked about in this paper, namely”Scaling invariable Benford distance” and ”Benfordcyclic graph”, may be an inspiration because they arebased upon the limited analysis.This work was supported by the NationalMagnetic Confinement Fusion Program of China(No.2014GB125004) and the National Natural ScienceFoundation of China (No.11575121). [1] S. Newcomb, Am. J. Math. 4, 39 (1881)[2] F. Benford, Proc. Am. Phil. Soc. 78, 551 (1938)[3] A. E. Kossovsky,
Benford’s Law (Theory, the GeneralLaw of Relative Quantities and Forensic Fraud DetectionApplications) (World Scientific Publishing Co. Pte. Ltd.,2015)[4] M. J. Nigrini,
Benford’s Law (Applications for ForensicAccounting, Auditing, and Fraud Detection) (John Wiley& Sons, Inc., Hoboken, New Jersey, 2012)[5] A. Berger and T. P. Hill,