Benford or not Benford: new results on digits beyond the first
BBenford or not Benford: new results on digitsbeyond the first
September 22, 2018
Abstract
In this paper, we will see that the proportion of d as p th digit, where p > d ∈ (cid:74) , (cid:75) , in data (obtained thanks to the hereunder devel-oped model) is more likely to follow a law whose probability distributionis determined by a specific upper bound, rather than the generalizationof Benford’s Law to digits beyond the first one. These probability dis-tributions fluctuate around theoretical values determined by Hill in 1995.Knowing beforehand the value of the upper bound can be a way to find abetter adjusted law than Hill’s one. Introduction
Benford’s Law is really amazing: according to it, the first digit d , d ∈ (cid:74) , (cid:75) , ofnumbers in many naturally occurring collections of data does not follow a dis-crete uniform distribution; it rather follows a logarithmic distribution. Havingbeen discovered by Newcomb in 1881 ([11]), this law was definitively brought tolight by Benford in 1938 ([2]). He proposed the following probability distribu-tion: the probability for d to be the first digit of a number is equal to log(1 + d ).Most of the empirical data, as physical data (Knuth in [10] or Burke and Kin-canon in [4]), demographic and economic data (Nigrini and Wood in [12]) orgenome data (Friar et al. in [6]), follow approximately Benford’s Law. To suchan extent that this law is used to detect possible frauds in lists of socio-economicdata ([15]) or in scientific publications ([5]).In [3], Blondeau Da Silva, building a rather relevant representative model,showed that, in this case, the proportion of each d as leading digit, d ∈ (cid:74) , (cid:75) ,structurally fluctuates. It strengthens the fact that, concerning empirical datasets, this law often appears to be a good approximation of the reality, but nomore than an approximation ([7]). We can note that there also exist distribu-tions known to disobey Benford’s Law ([13] and [1]).Generalizing Benford’s Law, Hill ([8]) extends the law to digits beyond thefirst one: the probability for d , d ∈ (cid:74) , (cid:75) , to be the p th digit of a number isequal to (cid:80) p − − j =10 p − log(1 + j + d ).Building a very similar model to that described in [3], the naturally occurringdata will be considered as the realizations of independant random variablesfollowing the hereinafter constraints: ( a ) the data is strictly positive and isupper-bounded by an integer n , constraint which is often valid in data sets, the1 a r X i v : . [ s t a t . O T ] M a y hysical, biological and economical quantities being limited ; ( b ) each randomvariable is considered to follow a discrete uniform distribution whereby the firststrictly positive p -digits integers ( p >
1) are equally likely to occur ( i beinguniformly randomly selected in (cid:74) p − , n (cid:75) ). This model relies on the fact thatthe random variables are not always the same.Through this article we will demonstrate that the predominance of 0 over1 (and of 1 over 2, and so on), as p th , ( p >
1) digit is all but surprising.Hill’s probabilities became standard values that should exactly be followed bymost of naturally occurring collections of data. However the reality is that theproportion of each d as leading digit structurally fluctuates. There is not asingle law but numerous distinct laws that we will hereafter examine. Let p and d be two strictly positive integers such that p > d ∈ (cid:74) , (cid:75) . Let m be a strictly positive integer such that m ≥ p − . Let U{ p − , m } denotethe discrete uniform distribution whereby integers between 10 p − and m areequally likely to be observed.Let n be a strictly positive integer such that n ≥ p − . Let us considerthe random experiment E n of tossing two independent dice. The first one is afair ( n + 1 − p − )-sided die showing n + 1 − p − different numbers from 1to n + 1 − p − . The number i rolled on it defines the number of faces on thesecond die. It thus shows i different numbers from 10 p − to i + 10 p − − n as follows: Ω n = { ( i, j ) : i ∈ (cid:74) , n + 1 − p − (cid:75) and j ∈ (cid:74) p − , i + 10 p − − (cid:75) } . Our probability measure isdenoted by P.Let us denote by D ( n,p ) the random variable from Ω n to (cid:74) , (cid:75) that mapseach element ω of Ω n to the p th digit of the second component of ω .As our aim is to determine the probability that the p th digit of the integerobtained thanks to the second throw is d , it can be considered with no con-sequence on our results that we first select an integer i equal to or less than n among at least p -digits integers (following the U{ p − , n } discrete uniformdistribution); afterwards we select an other at least p -digits integer equal to orless than i (following the U{ p − , i } discrete uniform distribution). d Through the below proposition, we will express the value of P( D ( n,p ) = d ) i.e. the probability that the p th digit of our second throw in the random experiment E n is d . Proposition 2.1.
Let k denote the integer such that: k = max { i ∈ N : 10 i + p ≤ n } .Let l denote the positive integer such that: l = (cid:98) n − (10 p − + d )10 k +1 k +2 (cid:99) + 10 p − . he value of P( D ( n,p ) = d ) is: n + 1 − p − (cid:16) k (cid:88) i =0 (cid:0) p − − (cid:88) j =10 p − (10 j +( d +1))10 i − (cid:88) b =(10 j + d )10 i b − ((9 j + d )10 i + 10 p − − b + 1 − p − + p − − (cid:88) j =10 p − − p + i − , (10( j +1)+ d )10 i − (cid:88) a =max(10 p + i − , (10 j +( d +1))10 i ) i ( j + 1) − p − a + 1 − p − (cid:1) + r ( n,d,p ) (cid:17) , where r ( n,d,p ) is, if the p th digit of n is d : l (cid:88) j =10 p − min( n, (10 j +( d +1))10 k +1 − (cid:88) b =(10 j + d )10 k +1 b − ((9 j + d )10 k +1 + 10 p − − b + 1 − p − + l − (cid:88) j =10 p − − j +1)+ d )10 k +1 − (cid:88) a =max(10 p + k , (10 j +( d +1))10 k +1 ) k +1 ( j + 1) − p − a + 1 − p − , or where r ( n,d,p ) is, if the p th digit of n is all but d : l (cid:88) j =10 p − (10 j +( d +1))10 k +1 − (cid:88) b =(10 j + d )10 k +1 b − ((9 j + d )10 k +1 + 10 p − − b + 1 − p − + l (cid:88) j =10 p − − n, (10( j +1)+ d )10 k +1 − (cid:88) a =max(10 p + k , (10 j +( d +1))10 k +1 ) k +1 ( j + 1) − p − a + 1 − p − . Proof.
Let us denote by F ( n,p ) the random variable from Ω n to (cid:74) , n +1 − p − (cid:75) that maps each element ω of Ω n to the first component of ω . It returns thenumber obtained on the first throw of the unbiased ( n + 1 − p − )-sided die.For each q ∈ (cid:74) , n + 1 − p − (cid:75) , we have:P( F ( n,p ) = q ) = 1 n + 1 − p − . (1)According to the Law of total probability, we state:P( D ( n,p ) = d ) = n +1 − p − (cid:88) q =1 P( D ( n,p ) = d | F ( n,p ) = q ) P( F ( n,p ) = q ) . (2)Thereupon two cases appear in determining the value, for q ∈ (cid:74) , n + 1 − p − (cid:75) , of P( D ( n,p ) = d | F ( n,p ) = q ). Let k q be the integer such that k q =max { k ∈ N : 10 p + k ≤ q + 10 p − − } in both cases.Let us study the first case where the p th digit of q + 10 p − − d . Forall i in (cid:74) , k q (cid:75) , there exist 9 × p − sequences of 10 i consecutive integers from(10 j + d )10 i to (10 j + ( d + 1))10 i −
1, where j ∈ (cid:74) p − , p − − (cid:75) , whose p th digit is d . The higher of these integers is (10(10 p − −
1) + ( d + 1))10 k q −
1, the3ast ( p + k q )-digit number in this case. Thus, from 10 p − to 10 p + k q −
1, thenumber of integers whose p th digit is d is: k q (cid:88) i =0 10 p − − (cid:88) j =10 p − (10 j +( d +1))10 i − (cid:88) (10 j + d )10 i × p − k q (cid:88) i =0 i = 10 p − (10 k q +1 − . This equality still holds true for k q = −
1. Such types of sum would be considerednull in the rest of the article. From 10 p + k q to q +10 p − −
1, there exist t sequencesof 10 k q +1 consecutive integers from (10 j + d )10 k q +1 to (10 j + ( d + 1))10 k q +1 − j ∈ (cid:74) p − , p − + t − (cid:75) , whose p th digit is d . There also exist q +10 p − − − (10(10 p − + t ) + d )10 k q +1 + 1 additional integers in this case between(10(10 p − + t ) + d )10 k q +1 and q + 10 p − −
1. Finally the total number of integerswhose p th digit is d is:10 p − (10 k q +1 −
1) + t × k q +1 + q + 10 p − − − (10(10 p − + t ) + d )10 k q +1 + 1i.e. q + 10 p − − − (cid:16)(cid:0) p − + t ) + d (cid:1) k q +1 − (cid:17) .It may be inferred that:P( D ( n,p ) = d | F ( n,p ) = q ) = q + 10 p − − − (cid:16)(cid:0) p − + t ) + d (cid:1) k q +1 − (cid:17) q , (3)the p th digit of q + 10 p − − d .In the second case, we consider the integers q +10 p − − p th digits aredifferent from d . On the basis of the previous case, the total number of integerswhose p th digit is d is, where t is the number of sequences of consecutive integerslower than q + 10 p − − p − (10 k q +1 −
1) + t × k q +1 i.e. 10 k q +1 (10 p − + t ) − p − .It can be concluded that:P( D ( n,p ) = d | F ( n,p ) = q ) = 10 k q +1 (10 p − + t ) − p − q , (4)the p th digit of q + 10 p − − d .Using equalities (1), (2), (3) and (4), we get our result.For example, we get: Examples . Let us first determine the value of P( D (10003 , = 2). The prob-ability that the fifth digit of a randomly selected number in (cid:74) , (cid:75) is2 is , those in (cid:74) , (cid:75) is , those in (cid:74) , (cid:75) is and those in (cid:74) , (cid:75) is . Hence we have:P( D (10003 , = 2) = 14 (cid:16)
01 + 02 + 13 + 14 (cid:17) ≈ . . It is the second case of Proposition 2.1, where n = 10003, d = 2, p = 5, k = − l = 1000. 4et us now determine the value of P( D (1113 , = 1) (first case of Proposition2.1); in this case, we have k = 0 and l = 11. P( D (1113 , = 1) = 11014 (cid:16) (cid:88) j =10 j − j −
98 + (cid:88) j =10 10( j +1) (cid:88) a =10 j +2 j − a −
99 + (cid:88) a =992 a −
99 + (cid:88) a =1000 a − (cid:88) b =1010 b − b −
99 + (cid:88) a =1020 a −
99 + (cid:88) b =1110 b − b − (cid:17) = 11014 (cid:16)
12 + 13 + 14 + ... + 111 + 212 + 213 + ... + 89891 + 90892 + 90893 + ... + 90910+ 91911 + ... + 100920 + 100921 + ... + 1001010 + 1011011 + ... + 1041014 (cid:17) ≈ . . Let us determine the value of P( D (212 , = 9) (second case of Proposition 2.1);in this case, we have k = 0 and l = 1. P( D (212 , = 9) = 1203 (cid:16)
910 + (cid:88) j =1 10( j +1)+8 (cid:88) a =10( j +1) ja − (cid:88) a =100 a − (cid:88) b =190 b − b − (cid:88) a =200 a − (cid:17) = 1203 (cid:16)
110 + 111 + ... + 119 + 220 + 221 + ... + 889 + 990 + 991 + ... + 9180 + 10181+ 11182 + ... + 19190 + 19191 + ... + 19203 (cid:17) ≈ . . It is natural that we take a specific look at the values of n positioned one rankbefore the integers for which the number of digits has just increased.To this end we will consider the sequence (cid:0) P( D n,p = d ) (cid:1) n ∈ N \ (cid:74) , p − − (cid:75) . Inthe interests of simplifying notation, we will denote by ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) this sequence. Let us study the subsequence ( P ( d,φ ( d,p ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) where φ ( d,p ) is the function from N \ (cid:74) , p − (cid:75) to N that maps n to 10 n −
1. We getthe below result:
Proposition 3.1.
The subsequence ( P ( d,φ ( d,p ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) converges to: − + n ( d,p ) + m ( d,p ) − l ( d,p ) − d × k ( d,p ) × p − + 190 ln( 10 p − + d p − ) + 19 ln( 10 p p −
10 + d + 1 ) , where: k ( d,p ) = (cid:80) p − − j =10 p − ln( j +( d +1)10 j + d ) l ( d,p ) = (cid:80) p − − j =10 p − j ln( j +( d +1)10 j + d ) m ( d,p ) = (cid:80) p − − j =10 p − ln( j +1)+ d j +( d +1) ) n ( d,p ) = (cid:80) p − − j =10 p − j ln( j +1)+ d j +( d +1) ) . Proof.
Let n be a positive integer such that n ≥ p . According to Proposition2.1, we have P ( d,φ ( d,p ) ( n ) ,p ) = P ( d, n − ,p ) i.e. , knowing that in this case k =5ax { i ∈ N : 10 i + p ≤ n − } = n − p − n − p − (cid:16) n − p (cid:88) i =0 (cid:0) p − − (cid:88) j =10 p − (10 j +( d +1))10 i − (cid:88) b =(10 j + d )10 i b − ((9 j + d )10 i + 10 p − − b + 1 − p − + p − − (cid:88) j =10 p − − p + i − , (10( j +1)+ d )10 i − (cid:88) a =max(10 p + i − , (10 j +( d +1))10 i ) i ( j + 1) − p − a + 1 − p − (cid:1)(cid:17) . Let us denote by b ( i,d,p ) the positive number: p − − (cid:88) j =10 p − (10 j +( d +1))10 i − (cid:88) b =(10 j + d )10 i b − ((9 j + d )10 i + 10 p − − b + 1 − p − , and by a ( i,d,p ) the positive number: p − − (cid:88) j =10 p − − p + i − , (10( j +1)+ d )10 i − (cid:88) a =max(10 p + i − , (10 j +( d +1))10 i ) i ( j + 1) − p − a + 1 − p − . Thus we have: P ( d,φ ( d,p ) ( n ) ,p ) = 110 n − p − n − p (cid:88) i =0 (cid:16) b ( i,d,p ) + a ( i,d,p ) (cid:17) . Let us first find an appropriate lower bound of P ( d,φ ( d,p ) ( n ) ,p ) . We have: b ( i,d,p ) = p − − (cid:88) j =10 p − (cid:0) i − (10 j +( d +1))10 i − (cid:88) b =(10 j + d )10 i (9 j + d )10 i + 10 p − − p − b + 1 − p − (cid:1) = 9 × p + i − − p − − (cid:88) j =10 p − ((9 j + d )10 i + 10 p − − p − ) (10 j +( d +1))10 i − (cid:88) b =(10 j + d )10 i b + 1 − p − Recall that for all integers ( p, q ), such that 1 < p < q :ln( q + 1 p ) ≤ q (cid:88) k = p k ≤ ln( qp − . (5)Consequently, we obtain, for i ≥ b ( i,d,p ) ≥ × p + i − − p − − (cid:88) j =10 p − (9 j + d )10 i ln( (10 j + ( d + 1))10 i − p − (10 j + d )10 i − p − ) ≥ × p + i − − p − − (cid:88) j =10 p − (9 j + d )10 i (cid:0) ln( 10 j + ( d + 1)10 j + d ) + ln(1 + p − j +( d +1) i (10 j + d ) − p − ) (cid:1) ≥ × p + i − − d × i p − − (cid:88) j =10 p − ln( 10 j + ( d + 1)10 j + d ) − × i p − − (cid:88) j =10 p − j ln( 10 j + ( d + 1)10 j + d ) − p − − (cid:88) j =10 p − (9 j + d )10 i ln(1 + p − j +( d +1) i (10 j + d ) − p − ) (cid:1) . k ( d,p ) the positive number (cid:80) p − − j =10 p − ln( j +( d +1)10 j + d ) and l ( d,p ) thepositive number (cid:80) p − − j =10 p − j ln( j +( d +1)10 j + d ). Knowing that for all x ∈ ] −
1; + ∞ [,we have ln(1 + x ) ≤ x , we obtain: b ( i,d,p ) ≥ × p + i − − d × i k ( d,p ) − × i l ( d,p ) − p − − (cid:88) j =10 p − (9 j + d )10 i p − j +( d +1) i (10 j + d ) − p − ≥ × p + i − − d × i k ( d,p ) − × i l ( d,p ) − p − − (cid:88) j =10 p − i p − i × p − − p − ≥ × p + i − − d × i k ( d,p ) − × i l ( d,p ) − × p − i i − . Similarly, we have thanks to inequalities (5): a ( i,d,p ) ≥ p − − (cid:88) j =10 p − (10 i ( j + 1) − p − ) ln( (10( j + 1) + d )10 i + 1 − p − (10 j + ( d + 1))10 i + 1 − p − )+ (10 p − i − p − ) ln( (10 p − + d )10 i + 1 − p − p + i − + 1 − p − )+ (10 p − i − p − ) ln( 10 p + i + 1 − p − (10 p −
10 + d + 1)10 i + 1 − p − ) ≥ i p − − (cid:88) j =10 p − j ln( 10( j + 1) + d j + ( d + 1) ) + 10 i p − − (cid:88) j =10 p − j ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − )+ (10 i − p − ) (cid:16) p − − (cid:88) j =10 p − (cid:0) ln( 10( j + 1) + d j + ( d + 1) ) + ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ) (cid:1)(cid:17) + (10 p − i − p − ) (cid:0) ln( 10 p − + d p − ) + ln(1 + d (10 p − − p − d p − i + 1 − p − ) (cid:1) + (10 p − i − p − )(ln( 10 p p −
10 + d + 1 ) + ln(1 + (10 p − − − d − p (10 p −
10 + d + 1)10 i + 1 − p − )) . Let us denote by m ( d,p ) the positive number (cid:80) p − − j =10 p − ln( j +1)+ d j +( d +1) ) and n ( d,p ) the positive number (cid:80) p − − j =10 p − j ln( j +1)+ d j +( d +1) ): a ( i,d,p ) ≥ i n ( d,p ) + (10 i − p − ) (cid:16) m ( d,p ) + p − − (cid:88) j =10 p − ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ) (cid:17) + (10 p − i − p − ) ln( 10 p − + d p − ) + (10 p − i − p − ) ln( 10 p p −
10 + d + 1 ) . Hence we have: P ( d,φ ( d,p )( n ) ,p ) ≥ n (cid:16) a (0 ,d,p ) + b (0 ,d,p ) + n − p (cid:88) i =1 (cid:0) × p + i − − d × i k ( d,p ) − × i l ( d,p ) + 10 i n ( d,p ) + 10 i m ( d,p ) + 10 p − i ln( 10 p − + d p − ) + 10 p − i ln( 10 p p −
10 + d + 1 ) − × p − − p − (cid:0) m ( d,p ) + ln( 10 p − + d p − ) + ln( 10 p p −
10 + d + 1 ) (cid:1) + (10 i − p − ) p − − (cid:88) j =10 p − ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ) (cid:1)(cid:17) .
7n light of the following equality (cid:80) n − pi =1 i = n − p +1 − , we have: P ( d,φ ( d,p )( n ) ,p ) ≥ − + 10 − p +1 ( n ( d,p ) + m ( d,p ) − l ( d,p ) − dk ( d,p ) )9 + 10 − p − + d p − )+ 19 ln( 10 p p −
10 + d + 1 ) + (cid:15) ( d,n,p ) , where (cid:15) ( d,n,p ) is: a (0 ,d,p ) + b (0 ,d,p ) n − p − n + dk ( d,p ) + 9 l ( d,p ) − n ( d,p ) − m ( d,p ) × n − − p − × n ln( 10 p − + d p − ) − p × n ln( 10 p p −
10 + d + 1 ) − p − ( n − p )10 n − p − ( n − p )10 n (cid:0) m ( d,p ) + ln( 10 p − + d p − )+ ln( 10 p p −
10 + d + 1 ) (cid:1) + 110 n p − (cid:88) i =1 (10 i − p − ) p − − (cid:88) j =10 p − ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ) . Knowing that for all x ∈ ] −
1; + ∞ [, we have ln(1 + x ) ≤ x , we obtain, for all i ∈ { , ..., p − } : p − − (cid:88) j =10 p − ln(1 + × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ) ≤ p − − (cid:88) j =10 p − × (10 p − − j +1)+ d (10 j + d + 1)10 i + 1 − p − ≤ p − p p − d + 2 ≤ p From the above upper bound and the definition of (cid:15) ( d,n,p ) , it may be deducedthat lim n → + ∞ (cid:15) ( d,n,p ) = 0.Let us now find an appropriate upper bound of P ( d,φ ( d,p ) ( n ) ,p ) . Thanks toinequalities (5): b ( i,d,p ) ≤ × p + i − − p − − (cid:88) j =10 p − ((9 j + d )10 i + 10 p − − p − )ln( (10 j + ( d + 1))10 i + 1 − p − (10 j + d )10 i + 1 − p − ) ≤ × p + i − − p − − (cid:88) j =10 p − ((9 j + d )10 i + 10 p − − p − ) (cid:0) ln( 10 j + ( d + 1)10 j + d )+ ln(1 + p − − j +( d +1) i (10 j + d ) + 1 − p − ) (cid:1) ≤ × p + i − − d × i k ( d,p ) − × i l ( d,p ) + 10 p − k ( d,p ) . a ( i,d,p ) ≤ p − − (cid:88) j =10 p − i ( j + 1) ln( (10( j + 1) + d )10 i − p − (10 j + ( d + 1))10 i − p − )+ 10 p − i ln( (10 p − + d )10 i − p − p + i − − p − ) + 10 p − i ln( 10 p + i − p − (10 p −
10 + d + 1)10 i − p − ) ≤ i n ( d,p ) + 10 i p − − (cid:88) j =10 p − j ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − )+ 10 i (cid:16) m ( d,p ) + p − − (cid:88) j =10 p − ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − ) (cid:17) + 10 p − i (cid:0) ln( 10 p − + d p − ) + ln(1 + d × p − p − d p − i − p − ) (cid:1) + 10 p − i (cid:0) ln( 10 p p −
10 + d + 1 ) + ln(1 + p − − d − p (10 p −
10 + d + 1)10 i − p − ) (cid:1) . Hence we have: P ( d,φ ( d,p )( n ) ,p ) ≤ n − p − n − p (cid:88) i =0 (cid:16) × p + i − − d × i k ( d,p ) − × i l ( d,p ) + 10 i m ( d,p ) + 10 i n ( d,p ) + 10 p − i ln( 10 p − + d p − ) + 10 p − i ln( 10 p p −
10 + d + 1 )+ 10 p − k ( d,p ) + 10 i p − − (cid:88) j =10 p − j ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − )+ 10 i p − − (cid:88) j =10 p − ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − )+ 10 p − i ln(1 + d × p − p − d p − i − p − )+ 10 p − i ln(1 + p − − d − p (10 p −
10 + d + 1)10 i − p − ) (cid:17) . In light of the following equality (cid:80) n − pi =0 i = n − p +1 − , we have: lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 × p + i − ) = 10 − lim n → + ∞ ( − n − p − n − p (cid:88) i =0 d × i k ( d,p ) ) = − dk ( d,p ) × p − lim n → + ∞ ( − n − p − n − p (cid:88) i =0 × i l ( d,p ) ) = − l ( d,p ) − p lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 i n ( d,p ) ) = m ( d,p ) × p − lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 i n ( d,p ) ) = n ( d,p ) × p − lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 p − i ln( 10 p − + d p − )) = 190 ln( 10 p − + d p − )lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 p − i ln( 10 p p −
10 + d + 1 )) = 19 ln( 10 p p −
10 + d + 1 ))lim n → + ∞ ( 110 n − p − n − p (cid:88) i =0 p − k ( d,p ) ) = 0 . x ∈ ] −
1; + ∞ [, we have ln(1 + x ) ≤ x , we obtain, for i ≥ i p − − (cid:88) j =10 p − j ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − ) ≤ i + p − p p − p − i − p − = 10 i +1 i − ≤ i p − − (cid:88) j =10 p − ln(1 + × p − j +1)+ d (10 j + d + 1)10 i − p − ) ≤ i p p − p − i − p − ≤ × p − p − i ln(1 + d × p − p − d p − i − p − ) ≤ p − i d × p − p − d p − i − p − ≤ p − i p − i − p − ≤ p − i ln(1 + p − − d − p (10 p −
10 + d + 1)10 i − p − ) ≤ p − i p − i − p − ≤ . Thanks to P ( d,φ ( d,p ) ( n ) ,p ) upper bound and the above inequalities, the resultfollows.Let us denote by α ( d,p ) the limit of ( P ( d,φ ( d,p ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) . Here is a fewvalues of P ( d,φ ( d,p ) ( n ) ,p ) : d P ( d,φ ( d, (2) , P ( d,φ ( d, (3) , P ( d,φ ( d, (4) , P ( d,φ ( d, (5) , α ( d, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P ( d,φ ( d, ( n ) , and α ( d, , for n ∈ (cid:74) , (cid:75) . These values arerounded to the nearest ten-thousandth. d P ( d,φ ( d, (3) , P ( d,φ ( d, (4) , P ( d,φ ( d, (5) , α ( d, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P ( d,φ ( d, ( n ) , and α ( d, , for n ∈ (cid:74) , (cid:75) . These values arerounded to the nearest ten-thousandth.10 Graphs of ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) Let us plot graphs of sequences ( P ( d,n, ) n ∈ N \ (cid:74) , p − − (cid:75) for values of n from10 to 1000 (Figure 1). Then we plot graphs of ( P ( d,n, ) n ∈ N \ (cid:74) , p − − (cid:75) , for n ∈ (cid:74) , (cid:75) (Figure 2).Figure 1: For d ∈ (cid:74) , (cid:75) , graphs of( P ( d,n, ) n ∈ N \ (cid:74) , p − − (cid:75) . Figure 2: For d ∈ (cid:74) , (cid:75) , graphs of( P ( d,n, ) n ∈ N \ (cid:74) , p − − (cid:75) . Note thatpoints have not been all represented.Let us plot two additional graphs of P ( d,n, versus log( n ) and P ( d,n, versus log( n ) for values of n from 10 to 2000000:Figure 3: For d ∈ (cid:74) , (cid:75) , graphsof P ( d,n, versus log( n ). Note thatpoints have not been all ploted. Thefirst five values of the above definedsubsequence, for each d , being repre-sented by bigger plots. Figure 4: For d ∈ (cid:74) , (cid:75) , graphsof P ( d,n, versus log( n ). Note thatpoints have not been all ploted. Thefirst four values of the above definedsubsequence, for each d , being repre-sented by bigger plots.Through Figures 3 and 4, the proportion of each d as leading digit, d ∈ (cid:74) , (cid:75) , seems to fluctuate and consequently not follow Benford’s Law. Each”pseudo cycle” seems to be composed of 9 × p − short waves. Note that theseobservations were not obvious in view of Figures 1 and 2.We can also prove the following result: Proposition 4.1.
For all n ∈ N \ (cid:74) , p − − (cid:75) such that n ≥ p − + 9 andfor all ( a, b ) ∈ (cid:74) , (cid:75) such that a < b , we have: P ( a,n,p ) > P ( b,n,p ) . The relative position of graphs of P ( d,n,p ) , for d ∈ (cid:74) , (cid:75) , can be observed onFigures 1, 2 and 3. Proof. ( a, b ) ∈ (cid:74) , (cid:75) such that a < b . For all m ∈ (cid:74) p − , n (cid:75) , let us denote by E ( a,m ) the subset of N such that E ( a,m ) = { j ≤ m : the p th digit of j is a } .11or all e ∈ E ( b,m ) , we consider e (cid:48) = e − ( b − a ) × dg − p where dg is thenumber of digits of the integer e . It is clear that e (cid:48) ∈ E ( a,m ) . Thus we get: (cid:12)(cid:12) E ( a,m ) (cid:12)(cid:12) ≥ (cid:12)(cid:12) E ( b,m ) (cid:12)(cid:12) .We also have P ( a, p − + a,p ) = a +1 > P ( b, p − + a,p ) = 0. The result follows. Remark . For n ∈ N \ (cid:74) , p − − (cid:75) , we have, if n < p − + d , P ( d,n,p ) = 0.Hence for all n ∈ N \ (cid:74) , p − − (cid:75) and for all ( a, b ) ∈ (cid:74) , (cid:75) such that a < b , wehave: P ( a,n,p ) ≥ P ( b,n,p ) . Let us henceforth provide the following equality:
Proposition 4.3. P ( d,n,p ) = 1 n + 1 − p − (cid:16) P ( d, k + p − ,p ) × (10 k + p − p − ) + r ( n,d,p ) (cid:17) , where: k = max { i ∈ N : 10 i + p ≤ n } .Proof. Results are directly derived from Proposition 2.1. × p − additional subsequences To definitively bring to light the fact that the sequence ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) does not converge, we will show that there exist additional subsequences thatconverge to limits different from those of ( P ( d,φ ( d,p ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) .For i ∈ (cid:74) p − , p − − (cid:75) , let us in this way study the 9 × p − subsequences( P ( d,ψ ( d,p,i ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) where ψ ( d,p,i ) is the function from N \ (cid:74) , p − (cid:75) to N that maps n to (10 i + ( d + 1))10 n − p +1 −
1. We get the below result:
Proposition 5.1. i ∈ (cid:74) p − , p − − (cid:75) .The subsequence ( P ( d,ψ ( d,p,i ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) converges to: α ( d,p ) p − + i + 1 − p − − k ( d,p,i ) d − l ( d,p,i ) + m ( d,p,i ) + n ( d,p,i ) + 10 p − ln( p − d p − )10 i + d + 1 , where: k ( d,p,i ) = (cid:80) ij =10 p − ln( j +( d +1)10 j + d ) l ( d,p,i ) = (cid:80) ij =10 p − j ln( j +( d +1)10 j + d ) m ( d,p,i ) = (cid:80) i − j =10 p − ln( j +1)+ d j +( d +1) ) n ( d,p,i ) = (cid:80) i − j =10 p − j ln( j +1)+ d j +( d +1) ) . Proof. i ∈ (cid:74) p − , p − − (cid:75) . Thanks to Proposition 4.3, we have, for n ∈ N \ (cid:74) , p − (cid:75) : P ( d,ψ ( d,p,i ) ( n ) ,p ) = 1 (cid:0) i + ( d + 1) (cid:1) n − p +1 − p − (cid:16) P ( d, n − ,p ) × (10 n − p − )+ r ( ψ ( d,p,i ) ( n ) ,d,p ) (cid:17) . r ( ψ ( d,p,i ) ( n ) ,d,p ) can be simplified as follows: i (cid:88) j =10 p − (10 j +( d +1))10 n − p +1 − (cid:88) b =(10 j + d )10 n − p +1 (cid:16) − (9 j + d )10 n − p +1 + 10 p − − p − b + 1 − p − (cid:17) = 10 n − p +1 ( i − p − + 1) − i (cid:88) j =10 p − (cid:0) (9 j + d )10 n − p +1 + 10 p − − p − (cid:1) (10 j +( d +1))10 n − p +1 − (cid:88) b =(10 j + d )10 n − p +1 b + 1 − p − ∼ n → + ∞ n − p +1 ( i − p − + 1) − i (cid:88) j =10 p − (9 j + d )10 n − p +1 ln( 10 j + ( d + 1)10 j + d ) , thanks to inequalities 5.The second term of r ( ψ ( d,p,i ) ( n ) ,d,p ) can be simplified as follows: i − (cid:88) j =10 p − − j +1)+ d )10 n − p +1 − (cid:88) a =max(10 n , (10 j +( d +1))10 n − p +1 ) n − p +1 ( j + 1) − p − a + 1 − p − = (cid:0) n − p +1 p − − p − (cid:1) (10 p − + d )10 n − p +1 − (cid:88) a =10 n a + 1 − p − + (cid:0) n − p +1 ( j + 1) − p − (cid:1) i − (cid:88) j =10 p − (10( j +1)+ d )10 n − p +1 − (cid:88) a =(10 j +( d +1))10 n − p +1 a + 1 − p − ∼ n → + ∞ n − ln( 10 p − + d p − ) + i − (cid:88) j =10 p − n − p +1 ( j + 1) ln( 10( j + 1) + d j + ( d + 1) ) , thanks to inequalities 5.Knowing that P ( d, n − ,p ) ∼ n → + ∞ α ( d,p ) (see Proposition 3.1), the result follows.Let us denote by α ( d,p,i ) the limit of ( P ( d,ψ ( d,p,i ) ( n ) ,p ) ) n ∈ N \ (cid:74) ,p − (cid:75) . Here is afew values of P ( d,ψ ( d,p,i ) ( n ) ,p ) : d P ( d,ψ ( d, , , P ( d,ψ ( d, , , P ( d,ψ ( d, , , P ( d,ψ ( d, , , α ( d, , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3: Values of P ( d,ψ ( d, , ( n ) , and α ( d, , , for n ∈ (cid:74) , (cid:75) and i = 7. Thesevalues are rounded to the nearest ten-thousandth.As a result, the sequence ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) does not converge. The9 × p − convergent subsequences confirm the remarks raised by Figures 3 and4 about the existence of ”pseudo cycles” in the graph of ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) .13 P ( d,ψ ( d, , , P ( d,ψ ( d, , , P ( d,ψ ( d, , , α ( d, , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4: Values of P ( d,ψ ( d, , ( n ) , and α ( d, , , for n ∈ (cid:74) , (cid:75) and i = 23. Thesevalues are rounded to the nearest ten-thousandth. From Figures 3 and 4, we notice that there exist fluctuations in the graph of( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) . We define C ( d,p ) as follows: Definition 5.2. C ( d,p ) = 19 × p − p − − (cid:88) i =10 p − α ( d,p,i ) . Figure 5 below shows the different values of α (0 , ,i ) , for i ∈ (cid:74) , (cid:75) and alsothe values of P (0 ,n, versus log( n ) for n ∈ (cid:74) , (cid:75) :Figure 5: Graph of P (0 ,n, versus log( n ). Note that points have not been allrepresented. Lines whose equation is y = α (0 , ,i ) , for i ∈ (cid:74) , (cid:75) , have also beenploted. Note that those of equations y = α (0 , , and y = α (0 , , are almostcoincident. We have C (0 , ≈ . p = 2 and p = 3, respectively).According to Hill ([9]), it is absolutely normal.We furthermore note, thanks to Table 5, that C (0 , slightly underestimates (cid:80) j =1 log(1 + j ) as can be infered from Figure 5. Conclusion
To conclude, through our model, we have seen that the proportion of d as p th digit, d ∈ (cid:74) , (cid:75) , in certain naturally occurring collections of data is more likely14 C ( d, (cid:80) j =1 log(1 + j + d )0 0 . . . . . . . . . . . . . . . . . . . . C ( d,p ) and probabilities associated to the second digit ([8]),for p = 2. These values are rounded to the nearest thousandth. d C ( d, (cid:80) j =10 log(1 + j + d )0 0 . . . . . . . . . . . . . . . . . . . . C ( d,p ) and probabilities associated to the third digit ([8]).These values are rounded to the nearest thousandth.to follow a law whose probability distribution is ( d, P ( d,n,p ) ) d ∈ (cid:74) , (cid:75) , where n is thesmaller integer upper bound of the physical, biological or economical quantitiesconsidered, rather than the generalized Benford’s Law. Knowing beforehandthe value of the upper bound n can be a way to find a better adjusted law thanBenford’s one.The results of the article would have been the same in terms of fluctuationsof the proportion of d ∈ (cid:74) , (cid:75) as p th digit, of limits of subsequences, or of re-sults on central values, if our discrete uniform distributions uniformly randomlyselected were lower bounded by a positive integer different from 10 p − : firstterms in proportion formulas become rapidly negligible. Through our model weunderstand that the predominance of 0 as p th digit (followed by those of 1 andso on) is all but surprising in experimental data: it is only due to the fact that,in the lexicographical order, 0 appears before 1, 1 appears before 2, etc. However the limits of our model rest on the assumption that the randomvariables used to obtain our data are not the same and follow discrete uni-form distributions that are uniformly randomly selected. In certain naturallyoccurring collections of data it cannot conceivably be justified. Studying thecases where the random variables follow other distributions (and not necessarily15andomly selected) sketch some avenues for future research on the subject.
References [1] T. W. Beer. Terminal digit preference: beware of benford’s law.
Journalof Clinical Pathology , 62(2):192, 2009.[2] F. Benford. The law of anomalous numbers.
Proceedings of the AmericanPhilosophical Society , 78:127–131, 1938.[3] S. Blondeau Da Silva. Benford or not Benford: a systematic but not alwayswell-founded use of an elegant law in experimental fields. arXiv:1804.06186[math.PR] , 2018.[4] J. Burke and E. Kincanon. Benford’s law and physical constants: thedistribution of initial digits.
American Journal of Physics , 59:952, 1991.[5] A. Diekmann. Not the first digit! using benford’s law to detect fraudulentscientific data.
Journal of Applied Statistics , 34(3):321–329, 2007.[6] J. L. Friar, T. Goldman, and J. P´erez-Mercader. Genome sizes and theBenford distribution.
PLOS ONE , 7(5), 2012.[7] N. Gauvrit and J.-P. Delahaye. Pourquoi la loi de benford n’est pasmyst´erieuse.
Math´ematiques et sciences humaines , 182(2):7–15, 2008.[8] T. Hill. The significant-digit phenomenon.
The American MathematicalMonthly , 102(4):322–327, 1995.[9] T. Hill. A statistical derivation of the significant-digit law.
StatisticalScience , 10(4):354–363, 1995.[10] D. Knuth.
The Art of Computer Programming 2 . Addison-Wesley, New-York, 1969.[11] R. Newcomb. Note on the frequency of use of the different digits in naturalnumbers.
American Journal of Mathematics , 4:39–40, 1881.[12] M. Nigrini and W. Wood. Assessing the integrity of tabulated demographicdata. 1995. Preprint.[13] R. A. Raimi. The first digit problem.
American Mathematical Monthly ,83(7):521–538, 1976.[14] G. Van Rossum.
Python tutorial , volume Technical Report CS-R9526.1995. Centrum voor Wiskunde en Informatica (CWI).[15] H. Varian. Benford’s law (letters to the editor).
The American Statistician ,26(3):62–65, 1972. 16 ppendix: Python script
Using Proposition 2.1, we can determine the terms of ( P ( d,n,p ) ) n ∈ N \ (cid:74) , p − − (cid:75) ,for d ∈ (cid:74) , (cid:75) . To this end, we have created a script with the Python pro-gramming language (Python Software Foundation, Python Language Reference,version 3 . . available at , see [14]). The implementedfunction expvalProp has three parameters: the rank n of the wanted term ofthe sequence, the position p of the considered digit and the value d of this digit.Here is the used algorithm: def expvalProp(n,d,p):k=-1;while(10**(k+p+1)¡=n):k=k+1l=math.floor((n-(10**(p-1)+d)*10**(k+1))/10**(k+2))+10**(p-2);S=0;T=0;if (k!=-1):for i in range(0,k+1):for j in range(10**(p-2),10**(p-1)):for b in range((10*j+d)*10**i,(10*j+(d+1))*10**i):T=T+(b-((9*j+d)*10**i+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,10**(p-1)):for a in range(max(10**(p+i-1),(10*j+(d+1))*10**i),min(10**(p+i),(10*(j+1)+d)*10**i)):S=S+((j+1)*10**i-10**(p-2))/(a+1-10**(p-1))if ((math.floor(n/10**(k+1))-10*math.floor(n/10**(k+2)))==d):for j in range(10**(p-2),l+1):for b in range((10*j+d)*10**(k+1),min(n,(10*j+(d+1))*10**(k+1)-1)+1):T=T+(b-((9*j+d)*10**(k+1)+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,l):for a in range(max(10**(p+k),(10*j+(d+1))*10**(k+1)),(10*(j+1)+d)*10**(k+1)):S=S+((j+1)*10**(k+1)-10**(p-2))/(a+1-10**(p-1))else:for j in range(10**(p-2),l+1):for b in range((10*j+d)*10**(k+1),(10*j+(d+1))*10**(k+1)):T=T+(b-((9*j+d)*10**(k+1)+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,l+1):for a in range(max(10**(p+k),(10*j+(d+1))*10**(k+1)),min(n,(10*(j+1)+d)*10**(k+1)-1)+1):S=S+((j+1)*10**(k+1)-10**(p-2))/(a+1-10**(p-1))return((S+T)/(n+1-10**(p-1)))def expvalProp(n,d,p):k=-1;while(10**(k+p+1)¡=n):k=k+1l=math.floor((n-(10**(p-1)+d)*10**(k+1))/10**(k+2))+10**(p-2);S=0;T=0;if (k!=-1):for i in range(0,k+1):for j in range(10**(p-2),10**(p-1)):for b in range((10*j+d)*10**i,(10*j+(d+1))*10**i):T=T+(b-((9*j+d)*10**i+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,10**(p-1)):for a in range(max(10**(p+i-1),(10*j+(d+1))*10**i),min(10**(p+i),(10*(j+1)+d)*10**i)):S=S+((j+1)*10**i-10**(p-2))/(a+1-10**(p-1))if ((math.floor(n/10**(k+1))-10*math.floor(n/10**(k+2)))==d):for j in range(10**(p-2),l+1):for b in range((10*j+d)*10**(k+1),min(n,(10*j+(d+1))*10**(k+1)-1)+1):T=T+(b-((9*j+d)*10**(k+1)+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,l):for a in range(max(10**(p+k),(10*j+(d+1))*10**(k+1)),(10*(j+1)+d)*10**(k+1)):S=S+((j+1)*10**(k+1)-10**(p-2))/(a+1-10**(p-1))else:for j in range(10**(p-2),l+1):for b in range((10*j+d)*10**(k+1),(10*j+(d+1))*10**(k+1)):T=T+(b-((9*j+d)*10**(k+1)+10**(p-2)-1))/(b+1-10**(p-1))for j in range(10**(p-2)-1,l+1):for a in range(max(10**(p+k),(10*j+(d+1))*10**(k+1)),min(n,(10*(j+1)+d)*10**(k+1)-1)+1):S=S+((j+1)*10**(k+1)-10**(p-2))/(a+1-10**(p-1))return((S+T)/(n+1-10**(p-1)))