[PDF] On estimating the alphabet size of a discrete random source

Abstract

We are concerned with estimating alphabet size N from a stream of symbols taken uniformly at random from that alphabet. We define and analyze a memory-restricted variant of an algorithm that have been earlier proposed for this purpose. The alphabet size N can be estimated in O( N − − √ ) time and space by the memory-restricted variant of this algorithm.

Full PDF

OON ESTIMATING THE ALPHABET SIZE OF A DISCRETERANDOM SOURCE.

PHILIP GINZBOORG

Abstract.

We are concerned with estimating alphabet size N from a streamof symbols taken uniformly at random from that alphabet. We deﬁne andanalyze a memory-restricted variant of an algorithm that have been earlierproposed for this purpose. The alphabet size N can be estimated in O ( √ N )time and space by the memory-restricted variant of this algorithm. Introduction

A discrete random source picks symbols uniformly at random from a ﬁnite al-phabet of size N ; any one of the symbols is picked with the same probability 1 /N .In this paper we investigate a memory-restricted variant of a simple algorithm thatestimates N from the stream of symbols that are emitted by such source.This algorithm has been introduced by Brassard and Bratley [1] for ﬁndingcardinality of a set; and further analysed by Flajolet [4], and Flajolet and Sedgewick[5]. Using this algorithm to estimate alphabet size of discrete random source hasbeen proposed by Montalv˜ao et al. [8].The working of this algorithm is based on the same phenomenon as the BirthdayProblem [10]; it needs O ( √ N ) time (number of observed symbols) on the averageto estimate the alphabet size, which may be an advantage when N is large. Thespace (number of stored symbols) needed for making an estimate is O ( √ N ) on theaverage; but in the worst case the space needed will be N + 1 symbols and theruntime will be N + 1 multiplied by a constant.The main contribution of the present paper is in sections 4 and 5, where we deﬁneand analyse the behavior of this algorithm when its internal memory is limited toat most c < N symbols.We have also included original derivation of known results for the case whenthere is no such restriction. This is done for ease of reference, and to show how toanalyse this algorithm using relatively simple means.In the next section we explain how the algorithm works. Section 3 describesthe results of experiments with pseudorandom number generator as the source ofsymbols. In sections 4 and 5 we deﬁne a memory-restricted variant of the algorithmand then consider the eﬀects of limited memory on accuracy of estimation. Section6 contains theoretical calculations. The paper ends with conclusion in section 7. Mathematics Subject Classiﬁcation.

Key words and phrases.

Online algorithms, analysis of algorithms, discrete random source,parameter estimation, Birthday Problem. a r X i v : . [ c s . D S ] N ov PHILIP GINZBOORG The algorithm

The stream of symbols coming from a discrete random source is divided on-lineinto adjacent, variable-sized blocks of symbols, such that one block includes a singlepair of identical (matching) symbols. The sizes of these blocks comprise a sequenceof random variables W [1], W [2],. . . . An example using uppercase letters as symbolswill clarify this. Suppose that the following sequence of symbols, coming from adiscrete random source, are observed: A, B, K, D, E, I, M, D, A, D, C, K, A, C, J,I, . . .The ﬁrst repeating symbol in that sequence is D, and the size of the block (A,B, K, D, E, I, M, D) that contains the ﬁrst pair of matching symbols D is eight.Therefore, we record W [1] = 8, discard the beginning of the sequence up to andincluding the second occurrence of D and continue scanning. The continuation ofthe sequence is A, D, C, K, A, C, J, I, . . . .Now the ﬁrst repeating symbol in the sequence is A; we have a block (A, D, C,K, A) of size ﬁve with two identical symbols A. Therefore, we record W [2] = 5,discard the beginning of the sequence up to and including the second occurrence ofA and again continue scanning. In this manner we get a series of block sizes W [1], W [2], W [3], and so on.Sample mean W l from l realizations of W is W l = W [1] + W [2] + · · · + W [ l ] l . Expected value and variance of W l are E ( W ) and Var( W ) /l , respectively. Theestimation of alphabet size N from W l in the algorithm is based on those statistics.We remark that measurement of W l may tolerate partial loss of symbols in thestream. For instance, if we delete, say, every tenth symbol from the stream, theremaining symbols will still (i) include all of the alphabet, and (ii) be uniformlydistributed in the resulting sequence. As a consequence, statistics of W l will remainthe same after that deletion.In section 6 below it is shown that theoretical mean and variance of block size W are: E ( W ) ≈ (cid:114) πN , and(2.1)(2.2) Var( W ) = 2 N + E ( W ) − E ( W ) . When N is large, variance is approximately(2.3) Var( W ) ≈ N − π N − (cid:114) π N (large N ) . These expressions are the answer to an extension of Birthday Problem [10]:“What are (1) the average size, and (2) the variance, of a group of people whereexactly two group members share the same birthday?”An estimator ˆ N of alphabet size from the sample mean, which is based onstatistical method of moments, is obtained as follows: We invert (2.1), so that N is expressed as a function g at point E ( W ),(2.4) N ≈ g ( E ( W )) = 2 π (cid:18) E ( W ) − (cid:19) . N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 3

From a ﬁrst-order Taylor series expansion of g around the point E ( W ), it can beshown that g ( E ( W )) approximately equals to the average value of g ( W l ) [2]. Weapply this approximation and replace E ( W ) in (2.4) by sample mean W l from l realizations of W :(2.5) g (cid:0) W l (cid:1) = 2 π (cid:18) W l − (cid:19) . Since N is an integer, we will round down (2.5) with the ﬂoor function, resultingin estimator ˆ N :(2.6) ˆ N = (cid:36) π (cid:18) W l − (cid:19) (cid:37) . The average time (number of observed symbols) needed to make an estimate isthe mean of the sum W [1] + W [2] + · · · + W [ l ], i.e. the average time is l · E ( W ) . Since mean block size E ( W ) is O ( √ N ) by equation (2.1), the average time neededto make an estimate by this algorithm is O ( √ N ). The space required to estimate N by this algorithm will be also O ( √ N ) on the average.A theoretical calculation in section 6.4 shows that the result of estimating N byequation (2.5) will have a positive bias α > E ( ˆ N ) = N (1 + α ). This is why weround down – rather than up – to obtain an integer in (2.6).For large N , α is approximately 0 . /l ; and a more accurate estimator for large N can be obtained if we divide the right hand side of equation (2.5) by (1 + α ):(2.7) ˆ N = (cid:36)

11 + . l · π · (cid:18) W l − (cid:19) (cid:37) (large N ) . Performance of estimators (2.6) and (2.7) is compared in next section.We will use coeﬃcient of variation CV = (cid:113)

Var( ˆ N ) /E ( ˆ N ) to characterize theextent of variability of ˆ N in relation to its mean. The smaller this coeﬃcient, themore precise is ˆ N . First-order approximation to the squared CV is(CV) ≈ l · π · (cid:18) − E ( W ) · ( E ( W ) − N (cid:19) . (2.8)For large alphabet sizes it reduces to(CV) ≈ . l (large N ) . (2.9)Inverting (2.9) we get the number of blocks l that would be needed to estimatea large N with a given coeﬃcient of variation: l ≈ (cid:24) . (cid:25) (large N ) . (2.10)For example, if we want a 10 percent CV, then we would need to observe 109blocks, and this value of l multiplied by average block size E ( W ) of (cid:112) πN/ / . √ N + 72 . . √ N + 32 . . √ N +290 . PHILIP GINZBOORG

If distribution of symbols in the stream is non-uniform, the above algorithm willtend to underestimate N , because when some symbols occur in the stream moreoften than others, the average block size W l tends to be smaller than when allsymbols are equally likely to occur. Brassard and Bratley [1] note that in this casethe algorithm may still be used to probabilistically estimate a lower bound on N .3. Experiments

The working of the algorithm with estimators (2.6) and (2.7), and l set byequation (2.10) based on target CV, was tried with six alphabet sizes N = 10,100,. . . , 10 . In each of these six experiments N was held constant and symbolscame from a pseudorandom sequence. This sequence was generated by Matlab’s randi function, which, by default, uses Mersenne Twister method for producingpseudorandom numbers at the time of this writing. Target coeﬃcient of variation inthese measurements was set to 10 percent, resulting in l = 109 by equation (2.10).For each value of N both estimators computed their ˆ N after 109 block sizes havebeen collected, and this operation was repeated twenty thousand times.Empirical bias and coeﬃcient of variation were computed from resulting setof estimates. They are listed as percentage points in Table 1. Empirical bias isthe relative error between the mean of twenty thousand values of ˆ N and actualalphabet size N : bias = (mean( ˆ N ) − N ) /N ; empirical coeﬃcient of variation CVis the standard deviation of twenty thousand values of ˆ N divided by their mean. Table 1.

Empirical bias and CV as percentage points for twoestimators; target CV was set to 10 %. N Estimation by 10 100 10 CV % 9 .

28 9 .

51 9 .

77 9 .

88 9 .

98 9 . − . − .

03 0 .

20 0 .

25 0 . .

21 9 .

50 9 .

76 9 .

87 9 .

98 9 . − . − . − . − . − . − . .

23 9 .

49 9 .

76 9 .

87 9 .

98 9 . N is 10 and 100, the estimation by equation (2.7) is less accuratethan by equation (2.6). But (2.7) is more accurate when N is large, in the range10 , 10 , 10 , and 10 . Observe that in that range the bias in estimating with (2.7)is reduced by 0.25 percent compared to estimating with (2.6). This is because when l = 109, the ﬁrst term in equation (2.7) is approximately 1 − . · − .All in all, estimation by equation (2.7) is best for measuring alphabet sizes inthe order of 10 and higher. N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 5 Memory-restricted algorithm

From equation (2.2) we see that mean block size E ( W ) of (cid:113) πN , or about1 . √ N symbols is suﬃcient to estimate N . But the observed block size instances W [ k ] will sometimes exceed E ( W ) and grow, until N + 1 in the worst case.On the one hand, we need to store up to N previously observed symbols incomputer memory in order to accurately measure all W [ k ]. On the other hand, thecomputer’s memory available for this purpose may be limited to hold less than N symbols when N is large. Let us denote this limit on the number of stored symbolsby c .If N is bigger than our storage capacity c , then we cannot accurately measureblock sizes W [ k ] that are greater than c . As a result our estimates of alphabet sizewill be smaller than N .When N > c , it is sensible to clip the data. This means that any block sizethat is greater than c is replaced by c + 1 during measurement. It is also sensibleto report the number of times Y that the event W [ k ] > c has happened duringmeasurement: if Y is zero, then memory limit c did not impact the measurement;and the ratio between Y and l can serve as a rough approximation to the probabilityof W > c . We will mark clipped block sizes by W c : W c = (cid:40) W, if W ≤ c ;c + 1 , otherwise.Listing Algorithm 1 shows pseudocode of resulting memory-restricted algorithm.We let the identation separate code blocks – there are no end statements at theend of function deﬁnitions, if conditions and for loops.Algorithm 1 includes two functions: the ﬁrst, getBlockSize , repeatedly calls getSymbol to read a symbol from stream emittied by a discrete random source,until a matching pair of symbols is found or the memory limit c is reached; it thenreturns the clipped block size W c . Please note that getBlockSize uses a table,denoted by T , to store unique symbols that have been observed so far in a block.Initially, before we start reading symbols into the block, table T is empty. Datastructure used for holding T in computer memory can be a hash table keyed byobserved symbols.Recall that sample size l that is needed to obtain target precision CV can becomputed by equation (2.10). The second function, EstimateN , calls getBlock-Size l times to obtain sample mean of l block sizes; it then computes the estimateof N by equation (2.7).5. Effects of limited memory on performace

In this section we consider the following question: at what values of memorylimit c the error due to block sizes that were clipped by the memory-restrictedalgorithm will remain small; say, one percent or less? We shall assume that N islarge and that sample size l is big enough, so that the positive bias α ≈ . /l inthe estimator can be neglected.To answer this question we have done a series of experiments with pseudorandomsequence of symbols where the estimator in equation (2.7) used W c rather than W . This pseudorandom sequence was generated by Matlab’s randi function. Thealphabet sizes N were 10 , 10 ,. . . , 10 , and c was set to (cid:100) K √ N (cid:101) with K = 2 . PHILIP GINZBOORG

Algorithm 1

Measuring large N given a sample size l and memory limit c . function getBlockSize ( c ) T ← ∅ (cid:46) initialize table of observed symbols for j = 1 : c do s ← getSymbol (cid:46) get symbol from the stream if s ∈ T then break (cid:46) exit the for loop T ← T ∪ s (cid:46) insert s into the table if s / ∈ T then j = c + 1 (cid:46) we have hit the memory limit c return j function EstimateN ( l , c ) (cid:46) l can be computed from taget CV by eq. (2.10) X ← Y ← W c ← for j = 1 : l do W c ← getBlockSize( c ) X ← X + W c (cid:46) accumulate l block sizes if W c = c + 1 then Y ← Y + 1 (cid:46) count hits of memory limit c return (cid:106) . l · π · (cid:0) X/l − (cid:1) (cid:107) , Y (cid:46) see equation (2.7)

Table 2.

Empirical bias and CV as percentage points when blocksizes are clipped at c = (cid:100) K √ N (cid:101) ; l = 109. NK

100 10 . − . − . − . − . − . .

31 9 .

52 9 .

59 9 .

69 9 . . − . − . − . − . − . .

35 9 .

57 9 .

70 9 .

76 9 . . − . − . − . − . − . .

39 9 .

61 9 .

70 9 .

82 9 . . − . − . − . − . − . .

42 9 .

66 9 .

74 9 .

85 9 . .

6, . . . , 3 .

0, where (cid:100)·(cid:101) denotes the ceiling function: (cid:100) x (cid:101) is the smallest integer ≥ x .This range of c was chosen based on theoretical calculation that we will describelater.For each value of N and K an estimate ˆ N was computed by equation (2.7) after109 block sizes have been collected and this operation was repeated twenty thousandtimes. Empirical bias and coeﬃcient of variation were computed from these results.(Recall that empirical bias is relative error between mean of the twenty thousandvalues of ˆ N and actual alphabet size N : bias = (mean( ˆ N ) − N ) /N ; empiricalcoeﬃcient of variation CV is standard deviation of the twenty thousand values ofˆ N divided by their mean.)In Table 2 we have listed the bias and coeﬃcient of variation as percentage pointsfor a subset of measurements where K is between 2 . . N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 7

Notice, ﬁrst, that because clipping data reduces its variablilty, the CV valuesof clipped estimator for any N and K in Table 2 are less than the CV values ofestimator for that N and K in Table 1.Second, the absolute value of bias decreases as K grows from 2 . .

0; and at K = 2 . N that we have tried.The conclusion from experimental data is that storage capacity for (cid:100) . √ N (cid:101) symbols should be enough to achieve one percent accuracy in the measurement.The time needed for doing one such measurement is at most 109 · (cid:100) . √ N (cid:101) .A theoretical calculation below leads to the same conclusion.Let us denote by (cid:15) the systematic error of theoretical mean E ( W c ) relative to E ( W ):(5.1) E ( W c ) = E ( W ) (1 − (cid:15) ) . The fact that we are recording W c , rather than W , together with W c ≤ W ,introduces systematic error (cid:15) in our measurement of alphabet size N that canbe estimated by substituting E ( W )(1 − (cid:15) ) in place of E ( W ) in equation (2.4).After this substitution we discard the term − /

3, because in our scenario it can bereasonably assumed to be small compared to mean block size. The result is(5.2) E ( ˆ N ) = N (1 − (cid:15) ) ≈ π E ( W ) (1 − (cid:15) ) ≈ N (1 − (cid:15) + (cid:15) ) . We see that(5.3) (cid:15) ≈ (cid:15) (2 − (cid:15) ) . Next, we will derive an analytical expression for (cid:15) by applying the law of totalexpectation. For notational convenience we will denote the probability of the event W > c with p in that derivation. Firstly, E ( W ) = p · E ( W | W > c ) + (1 − p ) · E ( W | W ≤ c ) . By rearranging this equation, so that E ( W | W ≤ c ) is on the left hand side, weget(5.4) E ( W | W ≤ c ) = E ( W ) − p · E ( W | W > c )1 − p . Secondly, E ( W c ) = p · ( c + 1) + (1 − p ) · E ( W | W ≤ c ) , (5.5)Substituting the right hand side of (5.4) into (5.5) we obtain E ( W c ) = E ( W ) − p · ( E ( W | W > c ) − ( c + 1))= E ( W ) (cid:18) − p · E ( W | W > c ) − ( c + 1) E ( W ) (cid:19) . Therefore (cf. equation (5.1)),(5.6) (cid:15) = Pr( W > c ) · E ( W | W > c ) − ( c + 1) E ( W ) . This enables to compute numerically the value of systematic error (cid:15) for a given N and c : both E ( W | W > c ) and E ( W ) in the second term can be computed by PHILIP GINZBOORG equation (6.9) (the latter with setting c = 1); and Pr( W > c ) by equation (6.1). Itis expedient to ﬁrst compute logarithm of Pr(

W > c ):(5.7) ln (Pr(

W > c )) = c − (cid:88) k =1 ln( N − k ) − ( c −

1) ln( N ) , and then exponentiate the result.When c/N is small, the probability Pr( W > c ) ≈ exp ( − c ( c − / (2 N )). Sub-stituting K √ N in place of c in this approximation and simplifying, we get(5.8) Pr( W > K √ N ) ≈ e − K (if K/ √ N is small).In Table 3 we have listed the calculated values of theoretical bias − (cid:15) ≈ − (cid:15) (2 − (cid:15) ) as percentage points for N = 100, 10 ,. . . ,10 and K = 2 .

7, 2 .

8, 2 .

9, 3 . Table 3.

Theoretical bias as percentage points when block sizesare clipped at c = (cid:100) K √ N (cid:101) . NK

100 10 . − . − . − . − . − . − . . − . − . − . − . − . − . . − . − . − . − . − . − . . − . − . − . − . − . − . c ≥ (cid:100) . √ N (cid:101) the error ofthe estimator due to clipped block sizes will remain less than one percent. Theexperimental data summarized in Table 2 conﬁrms this.In Table 4 we have listed the diﬀerence between theoretical prediction of bias inTable 3 and empirical bias in Table 2. At N = 100, there is signiﬁcant diﬀerence Table 4.

The diﬀerence between theoretical predictions in Table 3and experimental data in Table 2. NK

100 10 . .

35 0 .

08 0 .

07 0 .

01 0 . . .

30 0 .

07 0 . − .

01 0 . . .

29 0 .

07 0 .

05 0 .

00 0 . . .

29 0 .

05 0 .

05 0 . − . N = 10 and N = 10 thediﬀerence is about four to six times smaller; and at the high range of alphabet sizes,where N = 10 and N = 10 , the two are almost identical. This approximation is obtained from ln(Pr(

W > c )), using truncated Taylor series of thelogarithm: ln(1 − x ) ≈ − x ; see, e.g., Feller [3]. N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 9

In summary, we have found that when number of blocks l is set as l = 109 inorder to achieve ten percent precision in the measurement, and c = (cid:100) . √ N (cid:101) , theunderestimate of N is less than one percent. With these settings the space andtime used by the algorithm are at most (cid:100) . √ N (cid:101) and 109 · (cid:100) . √ N (cid:101) , respectively.Furthermore, as we increase the memory limit c beyond (cid:100) . √ N (cid:101) , the under-estimate of N decreases sharply. For example, when c = (cid:100) . √ N (cid:101) , theoreticalcalculation shows that the underestimate of N is in the order of 0 .

001 percent.This phenomenon can be explained qualitatively by examining behavior of thetwo terms on the right hand side of equation (5.6), as we increase memory limit c from zero to N .When c = 0, the value of E ( W | W > c ) in the numerator of the second term of(5.6) equals to E ( W ) ≈ . √ N ; and in the other extreme, when c = N , the valueof E ( W | W > c ) is N + 1. Thus, the value of second term in equation (5.6) movesdown from 1 − /E ( W ) at c = 0, to zero at c = N .The ﬁrst term of (5.6), that is the probability Pr( W > c ), which equals to100 percent when c = 0, takes a dramatic dive as we increase c beyond 1 . √ N .For example, when N = 10 , it can be computed by equation (5.8) that althoughPr( W > . √ N ) is 96 . W > . √ N ) is 45 . W > √ N ) is 1 . W > . √ N ) is 0 .

003 percent only.Moreover, the behavior of Pr(

W > c ), as we increase memory limit c from zeroto N , can be characterized byPr( W > c ) ≤ e − N · c ( c − = e − c N · e c N ≤ e − c N · √ e, where the ﬁrst inequality is obtained from ln(Pr( W > c )) using the relation: ln(1 − x ) ≤ − x , which is valid for 0 ≤ x <

1; and the second, by noting that exp( c/ (2 N ))attains its largest value √ e ≈ .

65, at c = N . Thus, for instance, it can be computedthat Pr( W > √ N ) ≤ · − · √ e .Compared to the sharp decline of the ﬁrst term of equation (5.6), the decreaseof the second term is rather gradual. Continuing the example of N = 10 , it canbe computed that at c = 0 . √ N the value of the second term is 82 . c = 1 . √ N it is 46 . c = 3 √ N it is 24 . c = 4 . √ N it is still 16 . W > c ) dominates equation (5.6) when the memory limit c exceeds 1 . √ N , and helps to drive down the size of the systematic error.6. Theoretical calculations

The calculations in this section include ﬁve parts:(6.1) computation of the conditional moments E ( W j | W > c );(6.2) derivation of equation (2.1) for mean block size E ( W );(6.4) computation of the bias of the estimator;(6.3) derivation of equation (2.2) for the variance Var( W ) of block size;(6.5) derivation of equations (2.8) and (2.9) for coeﬃcient of variation.6.1. Computation of the conditional moments E ( W j | W > c ) . We now turnto computing conditional moments E ( W j | W > c ), where j is a positive integer, and c = 0, 1,. . . , N −

1. They are deﬁned by E ( W j | W > c ) = 1Pr(

W > c ) · N +1 (cid:88) k = c +1 Pr( W = k ) · k j . Please note that since a block of input data starts and ends with a matching symbol,a block must include at least two symbols: W ≥

2. For that reason, E ( W j | W > E ( W j | W >

1) are the same as the unconditional moment E ( W j ).On the one hand, the probability distribution function of block size W is wellknown, because it is needed in solving the Birthday Problem [10]:(6.1) Pr( W ≤ k ) = 1 − N k N k and Pr( W > k ) = N k N k , where the symbol N k denotes descending factorial N ( N − · · · ( N − ( k − W = k that we need to compute E ( W j | W > c ) is thediﬀerence between Pr(

W > k −

1) and Pr(

W > k ):(6.2) Pr( W = k ) = N k − N k − · k − N .

But on the other hand, direct computation of E ( W j | W > c ) from the deﬁnitioncan be numerically diﬃcult for large alphabet sizes. We will therefore derive, viaone-step analysis, a nested, computationally-eﬃcient formula for E ( W j | W > c ).Below, the formula for E ( W j ) is derived ﬁrst; it is then generalized to obtain E ( W j | W > c ).Let us denote by a k the growth in exponential function W j when block size W is incremented from k to k + 1:(6.3) a k = ( k + 1) j − k j . (When j = 1, a k degenerates to one for all positive k . Also, a = 1 independentlyof j .)Suppose that we have observed k symbols without ﬁnding a matching pair (thatis, without seeing two identitcal symbols). Let us denote with m k the diﬀerence(6.4) m k = E ( W j | W > k ) − k j . For example, in the simplest case of j = 1, m k is the mean number of symbols thatwe will observe from that moment and until we ﬁnd a matching pair; m k in thiscase is the diﬀerence between expected value of block sizes that are greater than k ,and k .Since there are only N diﬀerent symbols in the alphabet, the index k of m k runsbetween 0 and N . If k = N , then the next, ( k + 1)st symbol will surely match oneof the previously seen; for this reason the value of E ( W j | W > N ) is ( N + 1) j .From this and the above deﬁnition we know that m N is ( N + 1) j − N j . In the otherlimiting case of k = 0, we have m = E ( W j ).The next, ( k + 1)st symbol matches one of already observed symbols with proba-bility k/N ; and it does not match any of these symbols with probability ( N − k ) /N .In the ﬁrst “match” alternative, m k = a k ; in the second “no match” alternative, m k = a k + m k +1 . This deﬁnes a recursive relation between m k and m k +1 , for k = 0,1, 2, . . . , N − m k = kN · a k + N − kN · ( a k + m k +1 ) = a k + N − kN · m k +1 . N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 11

To solve this recursion it is expedient to ﬁrst reverse its order (by formallyreplacing k with N − − k in (6.5)), so that the known term m N = a N becomesﬁrst, and the term m = E ( W j ), which we wish to compute, becomes last:(6.6) m N − k − = a N − k − + k + 1 N · m N − k . To get a closed-form formula for E ( W j ) we start with k = 0 and compute m N − m N − = a N − + 1 N · m N = a N − + a N N .

Next we set k = 1, and substitute what we have just computed into the right handside of (6.6) in place of m N − . This operation is repeated for k = 2, 3, and so on,until at k = N − m . The resulting expression is E ( W j ) = 1+ (cid:32) a + N − N (cid:32) a + N − N (cid:32) a + · · · + (cid:18) a N − + 2 N (cid:18) a N − + 1 N a N (cid:19)(cid:19) · · · (cid:19) . (6.7)Finally, observe that by equation (6.4) the conditional moment E ( W j | W > c ),where c = 0, 1, 2, . . . , N −

1, can be computed in a similar manner, but startingthe recursion (6.5) from c j + m c rather than from m . The resulting expression is E ( W j | W > c ) = c j + (cid:32) a c + N − cN (cid:32) a c +1 + N − ( c + 1) N (cid:32) a c +2 + · · · + (cid:18) a N − + 2 N (cid:16) a N − + a N N (cid:17)(cid:19) · · · (cid:19) . (6.8)Setting j = 1 in (6.8) we get the formula for computing the conditional expectation E ( W | W > c ) that we need in section 5: E ( W | W > c ) = c + (cid:32) N − cN (cid:32) N − ( c + 1) N (cid:32) · · · + (cid:18) N (cid:18) N (cid:19)(cid:19) · · · (cid:19) . (6.9)Let us illustrate this computation when N = 3: E ( W | W > c ) =  (cid:0) (cid:0) (cid:1)(cid:1) ≈ . , c = 1;2 + (cid:0) (cid:1) ≈ . , c = 2;3 + 1 = 4 , c = 3 . Multiplying the terms in (6.7) gives(6.10) E ( W j ) = 1 + N (cid:88) k =1 N k N k a k , where the symbol N k in the summand denotes descending factorial N ( N − · · · ( N − ( k − . Equation (6.7) is better for numerical computations than equation (6.10). Thisis because the number of multiplications and divisions in equation (6.7) is about N times smaller, and also because in evaluating (6.7) we do not need to multiplytogether many tiny, almost-zero numbers.Still, for j = 1 there is a known asymptotic expansion of the second term of(6.10); and we will next use this result to derive simpler formulas for mean andvariance of block size W .6.2. Derivation of equation (2.1) for mean block size E ( W ) . When j = 1,equation (6.10) reduces to(6.11) E ( W ) = 1 + Q ( N ) , where Q ( N ) = N (cid:88) k =1 N k N k . Asymptotic expansion of Q ( N ) in descending powers of N is given in Knuth’streatise [7]. Plugging this expansion into (6.11) produces E ( W ) = (cid:114) πN (cid:114) π N − N + 1288 (cid:114) π N + O ( N − ) . (6.12)The error from truncating the expansion in (6.12) to the ﬁrst two terms on theright is about 1 / √ N : that error is less than one for N ≥

2. We therefore truncateand obtain equation (2.1).These results about the mean E ( W ) are known. See Flajolet, Gardy and Thi-monier [6] and Sedgewick and Flajolet [9].6.3. Derivation of equation (2.2) for the variance of block size.

The function a k when j = 2 is ( k + 1) − k , that is, a k = 2 k + 1. In this case, equation (6.10)reduces to E ( W ) = 1 + N (cid:88) k =1 N k N k (2 k + 1) = 1 + 2 N (cid:88) k =1 N k N k · k + Q ( N ) . If we now rewrite the sum in the second term on the right hand side as a nestedformula N (cid:88) k =1 N k N k · k = NN (cid:32) N − N (cid:32) N − N (cid:32) · · · + 3 N (cid:18) N − N (cid:18) N − N · N (cid:19)(cid:19) · · · (cid:19) , then it can be noticed immediately that the contents of the innermost pair ofbrackets, N − N · N , equal to N . Next we can compute the contents of thepenultimate matching pair of brackets N − N · N and the result is again N ,and so on, until in the end the contents of the outermost pair of brackets equal N as well. Thus, the value of the whole sum is NN · N = N .All in all, when j = 2, E ( W ) = 1 + 2 N + Q ( N );and since E ( W ) = 1 + Q ( N ), by equation (6.11), we have E ( W ) = 2 N + E ( W ) . N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 13

Finally, replacing E ( W ) by 2 N + E ( W ) in deﬁnition of variance: Var( W ) = E ( W ) − E ( W ) , results in equation (2.2).6.4. The bias of the estimator.

Formally expanding g ( W l ) as a second-orderTaylor polynomial around the mean E ( W l ); applying expectation operator withrespect to W l to both sides; and then replacing in the resulting expression (i) E ( W l ) with E ( W ), and (ii) Var( W l ) with Var( W ) /l , gives(6.13) E ( ˆ N ) = E (cid:0) g ( W l ) (cid:1) ≈ g ( E ( W )) + g (cid:48)(cid:48) ( E ( W ))2 · Var ( W ) l . The ﬁrst term g ( E ( W )) on the right hand side of (6.13) can be replaced by N , because we have chosen g ( E ( W )) so that it is a rather good approximation to N . The second term in (6.13) is part of theoretical bias (systematic error) in ourestimate of N ; its absolute value decreases as the size l of the sample increases. Wedenote the relative value of this bias by α :(6.14) E ( ˆ N ) = N (1 + α ) , where α ≈ N · g (cid:48)(cid:48) ( E ( X ))2 · Var ( X ) l . The function g ( W ) is π ( W − ) . Its second derivative is the constant 4 /π .Recall that Var( W ) = 2 N − E ( W ) + E ( W ). These and (6.14) imply that(6.15) α ≈ N · π · N − E ( W ) + E ( W ) l . By equation (6.20) below, variance of block size W is about 0 . N − . √ N when N is large. Thus, the bias for large N is(6.16) α ≈ N · π · . N − . √ Nl ≈ . − . / √ Nl ≈ . l . Coeﬃcient of Variation.

Formally expanding g ( W l ) as a ﬁrst-order Taylorpolynomial around the mean E ( W l ); moving g ( E ( W l ) from the right to the lefthand side; and applying expectation operator with respect to W l to the square ofboth sides in the resulting equation, produces a ﬁst-order approximation to varianceof the estimator:(6.17) Var( ˆ N ) ≈ (cid:2) g (cid:48) (cid:0) E ( W l ) (cid:1)(cid:3) Var (cid:0) W l (cid:1) . Replacing in that expression E ( W l ) with E ( W ), and Var( W l ) with Var( W ) /l , gives(6.18) Var( ˆ N ) ≈ [ g (cid:48) ( E ( W ))] Var ( W ) l . This, together with ﬁrst-order approximation E ( ˆ N ) ≈ N , gives the followingapproximation of coeﬃcient of variation CV = (cid:113) Var( ˆ N ) /E ( ˆ N ):(6.19) CV ≈ | g (cid:48) ( E ( W )) | · (cid:113) Var( W ) l N .

The ﬁrst derivative of function g ( W ) = π ( W − ) at point E ( W ) is 4 /π ( E ( W ) − ); its square is approximately 8 N/π by equation (2.4). Squaring both sidesof (6.19) and substituting on the right-hand side (i) 8

N/π for the square of the derivative, and (ii) 2 N − E ( W ) + E ( W ) for the variance of W , results in equation(2.8): (CV) ≈ l · π · (cid:18) − E ( W ) · ( E ( W ) − N (cid:19) . Let us assume that N is large. Then, ﬁrstly, E ( W ) is much greater than one andby (2.2) the variance of W is about 2 N − E ( W ) ; and secondly, by (2.1) E ( W ) isabout (cid:112) π N + , i.e. E ( W ) ≈ π N + 2 · (cid:112) π N . Thus, for large N the varianceof W is about(6.20) Var( W ) ≈ N − π N − (cid:114) π N ≈ . N − . √ N (large N ) . Squared coeﬃcient of variation (CV) of ˆ N for large alphabet sizes N reducesto equation (2.9): (CV) ≈ . − . √ N l ≈ . l (large N ) . Conclusion

A variant of a simple algorithm for probabilistically estimating alphabet size N from a stream of symbols output by a discrete uniform random source has beenstudied. It divides the stream of symbols online into adjacent blocks such that oneblock includes a single pair of identical symbols; and an estimate of N is computedfrom the average size of l of these blocks. That estimate’s coeﬃcient of variation(standard deviation divided by the mean) is approximately (cid:112) . /l when N islarge. Increasing l decreases coeﬃcient of variation, which results in more preciseestimate of N , but it also increases the average time (number of symbols) it takesto complete the measurement. When estimating large N given a target coeﬃcientof variation CV, l should be set to (cid:100) . / (CV) (cid:101) .We have analysed the eﬀects of limited space (number of symbols that can bestored in computer memory) on accuracy of estimation, when algorithm that canstore at most c symbols replaces any block size greater than c by c + 1, and char-acterised the resulting underestimate of N .It was found that when the number of observed blocks l is set as l = 109, inorder to achieve ten percent precision in the measurement, N = 100, 10 ,. . . ,10 ,and the space limit c = (cid:100) . √ N (cid:101) , the underestimate of N is less than one percent.With these settings, space and time used by the algorithm are at most (cid:100) . √ N (cid:101) and 109 · (cid:100) . √ N (cid:101) , respectively. Furthermore, as we increase the space limit c beyond (cid:100) . √ N (cid:101) the underestimate of N decreases sharply. For example, when c = (cid:100) . √ N (cid:101) , theoretical calculation shows that the underestimate of N is about0 .

001 percent.In conclusion, the algorithm studied in this paper can quite accurately estimatelarge alphabet sizes N using space and time of at most (cid:100) K √ N (cid:101) and l · (cid:100) K √ N (cid:101) ,respectively, where the constant K is in the order of 10. References [1] Brassard, G., Bratley, P., Algorithmics: theory and practice, Prentice Hall(1988), section 8.3.3, pp. 234-235.

N ESTIMATING THE ALPHABET SIZE OF A DISCRETE RANDOM SOURCE. 15 [2] Casella, G., Berger, R., Statistical inference, second edn., chap. 5.5.4. ThomsonLearning (2002).[3] Feller, W., An Introduction to Probability Theory and Its Applications, vol. I,third edn., section II.3, p. 33, John Wiley & Sons (1968). Equation (3.4).[4] Flajolet, P., Counting by coin tossings, Invited lecture at ASIAN’04: the NinthAsian Computing Science Conference (Chiang Mai, December 2004), pp. 9-11.URL http://algo.inria.fr/flajolet/Publications/Slides/asian04.pdf [5] Flajolet, P., Sedgewick, R., Analytic Combinatorics, section VI.9, p. 417. Cam-bridge University Press (2009). Notes VI.24 and VI.25.[6] Flajolet, P., Gardy, D., Thimonier, L., Birthday paradox, coupon collectors,caching algorithms, and self-organizing search. Discrete Applied Mathematics , 207–229 (1992).[7] Knuth, D.E., The Art of Computer Programming, Volume I: Fundamental Al-gorithms, third edn., chap. 1.2.11.3, p. 120. Addison-Wesley (1997). Equation(25).[8] Montalv˜ao, J., Silva, D.G., Attux, R., Simple entropy estimator for smalldatasets. Electronics Letters (17), pp. 1059–1061 (2012).[9] Sedgewick, R., Flajolet, P., An Introduction to the Analysis of Algorithms,second edn., chap. 9.3, p. 487. Addison-Wesley (2013). Theorem 9.1.[10] Wikipedia, Birthday problem (2016). URL http://en.wikipedia.org/wiki/Birthday_problem Huawei Technologies and Aalto University, School of Electrical Engineering, De-partment of Communications and Networking, Finland.

Current address : Huawei Technologies, It¨amerenkatu 9, 00180 Helsinki, Finland

E-mail address ::