Cardinality estimation using Gumbel distribution
aa r X i v : . [ c s . D S ] A ug Cardinality estimation using Gumbel distribution.
Aleksander Łukasiewicz and Przemysław Uznański
Institute of Computer Science, University of Wrocław, Poland.
Abstract
Cardinality estimation is the task of approximating the number of distinct elements in alarge dataset with possibly repeating elements.
LogLog and
HyperLogLog (c.f. Durand andFlajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketchingschemes for cardinality estimation, which have both strong theoretical guarantees of performanceand are highly effective in practice. This makes them a highly popular solution with manyimplementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Prestoand Redis). However, despite having simple and elegant formulation, both the analysis of
LogLog and
HyperLogLog are extremely involved – spanning over tens of pages of analytic combinatoricsand complex function analysis.We propose a modification to both
LogLog and
HyperLogLog that replaces discrete geometricdistribution with a continuous Gumbel distribution. This leads to a very short, simple andelementary analysis of estimation guarantees, and smoother behavior of the estimator.
In cardinality estimation problem we are presented with a dataset consisting of many items, thatmight be repeating. Our goal is to process this dataset efficiently, to estimate the number n of distinct elements it contains. Here, efficiently means in small auxiliary space, and fast processingper each item. A natural scenario to consider is a stream processing of a dataset, with stream ofevents being either element insertions to the multiset and queries of multiset cardinality.A folklore information theoretic analysis reveals that this problem over universe of u elementsrequires at least u bits of memory to answer queries exactly. However, in many practical settingsit suffices to provide an approximate of the cardinality. An example scenario is estimating numberof unique addresses in packets that a router observes, in order to detect malicious behaviors andattacks. Here limited computational capabilities of the router and sheer volume of data observedover e.g. day ask for specialized solutions.The theoretical study of this problem was initiated by seminal work of Flajolet and Martin [20].Two follow-up lines of research follow. First, we mention [6, 7, 8, 11, 22, 23, 29] on the upper-boundside and [6, 10, 27, 28, 35] on lower-bound side. Those works focus on ( ε, δ )-guarantees, meaningthat they guarantee outputting (1 + ε )-multiplicative approximation of the number of distinctelements, with probability at least 1 − δ . The high-level takeaway message is that one can constructapproximate schemes that provide (1 + ε )-multiplicative approximation to the number of distinctelements, using an order of ε − space, and that this dependency on ε is tight. More specifically, thework of Błasiok [11] settles the bit-complexity of the problem, by providing O ( log δ − ε + log n ) bitsf space upper-bound, and this complexity is optimal by a matching lowerbound [28]. To achievesuch small space usage, a number of issues have to be resolved, and a very sophisticated machineryof expanders and pseudo-randomness is deployed.The other line of work is more practical in nature, and focuses on providing variance boundsfor efficient algorithm. The bounds are usually of the form ∼ / √ k where k is some measure ofspace-complexity of algorithms (usually, corresponds to the number of parallel estimation processes).This includes work of [9, 12, 14, 16, 18, 19, 21, 24, 30, 31, 33, 34]. We now focus on two specificalgorithms, namely LogLog [16] and later refined to
HyperLogLog [19]. The guarantees providedfor variance are approximately 1 . / √ k and 1 . / √ k respectively, when using k integer registers.Both are based on simple principle of observing the maximal number of trailing zeroes in binaryrepresentation of hashes of elements in the stream, although they vary in the way they extract thefinal estimate from this observed value (we will discuss those details in the following section). Inaddition to being easy to state and provided with theoretical guarantees, they are highly practicalin nature. We note a following works on algorithmic engineering of practical variants [17, 26, 36],with actual implementations e.g. in Algebird [1], BigQuery [2], Apache DataSketch [3], Presto [4]and Redis [5].Despite its simplicity and popularity, LogLog and
HyperLogLog are exceptionally tough to an-alyze. We note that both papers analyzing
LogLog and later
HyperLogLog use a heavy machineryof tools from analytic combinatorics and complex function analysis to analyze the algorithm guar-antees, such as Mellin transform from complex analysis, poissonization for algorithm analysis, andanalytical depoissonization (to unpack the main tool used in the paper requires another tens ofpages from [32]). Additionally, all of this is presented in a highly compressed form. Thus theanalysis is not easily digestible by a typical computer scientist, and has to be accepted “as is” in ablack-box manner, without actually unpacking it.This creates an unsatisfactory situation where one of the most popular and most elegant algo-rithms for the cardinality estimation problem has to be treated as a black-box from the perspectiveof its performance guarantees. It is an obstacle both in terms of popularization of the
LogLog and
HyperLogLog algorithms, and in terms of scientific progress. Authors note that those algorithmsare generally omitted during majority of theoretical courses on streaming and big data algorithms.
Our contribution.
Our contribution comes in two factors. First, we observe that a key part of
LogLog and
Hyper-LogLog algorithms is counting the trailing zeroes in the binary representation of a hash of element.This random variable is distributed according to geometric distribution. Both
LogLog and
Hyper-LogLog use the maximal value observed over all elements of the count of trailing zeroes to estimatethe cardinality. However, the distribution of many discrete random variables drawn from identicalgeometric distributions is not distributed according to a geometric distribution. This is unwieldyto handle in the analysis in [19]. We propose to replace geometric distribution with Gumbel distri-bution, which has the following crucial property: If X , . . . , X k are independent random variables drawn from Gumbel distribution, then Z =max( X , . . . , X k ) − ln( k ) is also distributed according to the same Gumbel distribution. This lets us to simplify extraction of value of k from max( X , . . . , X k ), since we are always dealingwith the same type of error (Gumbel distribution) on top of value of ln( k ).2ur second contribution comes in the form of simple analysis of performance guarantees ofthe estimation. Instead of analyzing the variance of the estimator itself, we show bounds onintermediate process of maximum of Gumbel random variables. This requires application of somebasic probabilistic inequalities and multinomial identities to bound it in the context of stochasticaveraging (we discuss this later in the paper). The key concept used in virtually all cardinality estimation results, can be summarized as follows:given universe U of elements, we start by picking a hash-function. Then, given subset M ⊆ U whichcardinality we want to estimate, we proceed by applying h to every element of M and operate onlyon M ′ = { h ( x ) : x ∈ M } ⊂ [0 , observable – i.e. a quantity thatonly depends on the underlying set and is independent of replications. Finally step is estimatingof the cardinality from the observable.For example [7] uses h : M → [0 ,
1] and a value y = min M ′ = min x ∈ M h ( x ) as an observable.We expect y ∼ n +1 , thus y − n . However, since we need toovercome the variance, we might need to average over many independent instances of the process,in order to achieve a good estimation. In this particular example, to get an (1 + ε ) approximation,we need to average over O ( ε − ) independent repetitions of the algorithm. Therefore, the totalmemory usage becomes O ( ε − log n ) bits. Stochastic averaging.
Stochastic averaging is a technique that in this setting works as follows: instead of processing eachof elements in each of k processes independently (which is a bottleneck), we partition our inputinto k disjoint sub-inputs: M = M ∪ . . . ∪ M k , and have each observable follow only processing ofa single sub-input. This is achieved by picking a second hash function h ′ : M → { , . . . , k } , andwhen processing an element x , it is assigned to M i where i = h ′ ( x ) is decided solely on hash of x .Thus we expect each M i to contain roughly n/k elements. Note that actual number of elements inall M i follows multinomial distribution , and this presents an additional challenge in the analysis. LogLog sketching.
Consider a following: we hash the elements to bitstrings, that is h : M → { , } ∞ , and considerthe bit-patterns observed. For each element find bit ( x ) such that h ( x ) has a prefix 0 bit ( x )
1. Value bit ( x ) = c should be observed once every ∼ c different hashes, and can be used to estimate thecardinality. The observable used in LogLog is the value of max x bit ( x ) among all elements. Since weexpect its value to be roughly of order of log n , we maintain the value of max bit ( x ) on O (log log n )bits.A single observable produces a value t = max t ( x ). Denote the observables produced overseparate sub-streams as t , . . . , t k . We expect the values of t i to be such that 2 t i ∼ n/k . One caneasily show, that for any t i , we have E [2 t i ] = ∞ , thus arithmetic averaging over 2 t i is not a feasiblestrategy. However, a geometric average works in this setting, and we expect the k (cid:0)Q i t i (cid:1) /k tobe an estimate for n (one needs a normalizing constant that depends solely on k ). The varianceanalysis shows that the variance of the estimation is roughly 1 . / √ k .3 yperLogLog sketching. HyperLogLog ([19]) is an improvement over
LogLog with a following observation, that a harmonicaverage achieves better averaging over geometric average . Thus
HyperLogLog is constructed bysubstituting the estimation to be k (cid:0)P i − t i (cid:1) − with some normalizing constant (depending on k ).Resulting algorithm has variance which is roughly 1 . / √ k .In fact it can be shown that the harmonic average is optimal here in this setting: amongobservables that constitute of taking maximum of a hash function, harmonic average gives is both maximum likelihood estimator and minimum variance estimator (see e.g. [13]). However, thoseclaims are strict only without stochastic averaging. Computation model.
We assume oracle access to a perfect source of randomness, that is a hashfunction h : [ u ] → { , } ∞ . If the sketch demands it, we allow it to access multiple independentsuch sources, which can be simulated with help of bit or arithmetic operations starting with asingle such source a single one. The oracle access is a standard assumption in this line of work (c.f.discussion in [31]) meant to decouple bit-storage of randomness from algorithm analysis.Besides that, we assume standard RAM model, with words of size log u and standard arithmeticoperations on those words taking constant time. Gumbel distribution.
We use a following distribution, which originates from extreme valuetheory . Definition 3.1 (Gumbel distribution [25]) . Let
Gumbel ( µ ) denote the distribution given by a fol-lowing CDF: F ( x ) = e − e − ( x − µ ) . Its probability density function is given by f ( x ) = e − e − ( x − µ ) e − ( x − µ ) . We note that when x → ∞ , then f ( x ) ≈ e − ( x − µ ) , thus the Gumbel distribution has theexponential tail on the positive side. The distribution has a doubly-exponential tail when x → −∞ .We also have the following basic properties when X ∼ Gumbel ( µ ) (c.f. [25]): E [ X − µ ] = γ ≈ . , Var[ X ] = π ≈ . . (1)and E [ e − X ] = e − µ Z ∞−∞ e − e − x e − x dx = e − µ , (2)Var[ e − X ] = E [( e − X ) ] − e − µ = e − µ Z ∞−∞ e − e − x e − x dx − e − µ = e − µ . (3) Property 3.2 (Sampling from Gumbel distribution.) . If t ∈ [0 , is drawn uniformly at random,then X = − ln( − ln t ) + µ has the distribution Gumbel ( µ ) . . . . . . − . . . . . { X , . . . , X k } for k ∈ { , , , , , , } where X i iid random vari-ables distributed according to discrete Geometric distribution (on the left) and Gumbel distribution(on the right). Discrete distribution given by f k ( x ) = (1 − − x − ) k − (1 − − x ) k is drawn with con-tinuous intermediate values for smooth drawing.The following property is a key property used in our algorithm analysis. It essentially statesthat Gumbel distribution is invariant under taking the maximum of independent samples (up tonormalization). Property 3.3. If x , x , . . . , x n ∼ Gumbel (0) are independent random variables, then for Z =max( x , . . . , x n ) we have Z ∼ Gumbel (ln n ) .Proof. Pr(
Z < x ) = Y i Pr( x i < x ) = ( e e − x ) n = e e − x +ln n . Multinomial distribution.
We now discuss the multinomial distribution and its role in analyz-ing stochastic averaging.
Definition 3.4.
We say that X , . . . , X k are distributed according to Multinomial ( n ; p , . . . , p k ) dis-tribution for some P i p i = 1 , if, for any n + . . . + n k = n there is Pr[ X = n ∧ . . . ∧ X k = n k ] = nn , . . . , n k ! p n . . . p n k k . Consider a process of distributing n identical balls to k urns, where each the probability for anyball to land in urn i is p i , fully independently between balls. Then the numbers of total balls ineach urn X , . . . , X k follows Multinomial ( n ; p , . . . , p k ) distribution. In fact, the Fisher–Tippett–Gnedenko theorem (c.f. [15]) states, that for any distribution D , if for some a n , b n the limit lim n →∞ ( max( X ,...,X n ) − b n a n ) converges to some non-degenerate distribution, where X , . . . X n ∼ D (and areindependent), then it converges to one of three possible distribution families: a Fréchet distribution, a Weibulldistribution or a Gumbel distribution. Thus, those three distributions can be viewed as a counterpart to normaldistribution, wrt to taking maximum (instead of repeated additions). f be some real-value function. Lets saythat we have a stochastic process of estimating cardinality in a stream, that is if n distinct elementsappear, the process outputs a value that is concentrated around its expected value f ( n ). Now, weapply stochastic averaging, by splitting the stream into sub-streams, and feed each sub-stream toestimation process separately, say n i going into sub-stream i . We can look at the following randomvariables: S n = E [ X i f ( n i )] and P n = E [ Y i f ( n i )] . We expect S n ≈ kf ( n/k ) and P n ≈ f ( n/k ) k . Deriving actual concentration bounds for specificallychosen functions f gives us insight on how well harmonic average or geometric average performswhen concentrating cardinality estimation processes under stochastic averaging.The analysis of stochastic averaging for a generic function f (under some sanity constraints)has been done in [13]. We actually derive a stronger set of bounds for very specific functions: f ( x ) = x +1 and f ( x ) = ln( x + 1). Following algorithm shows that if we are fine with slower updates, then Gumbel distribution playsnicely into estimating cardinality. The main idea is just to hash each element into a real-valuedistributed according to Gumbel distribution, and take maximum across all values.
Algorithm 1:
Cardinality estimation using Gumbel distribution. Procedure
Init() pick h , . . . , h k : U → [0 ,
1] as independent hash functions X ← −∞ , . . . , X k ← −∞ Procedure
Update( x ) for ≤ i ≤ k do v ← − ln( − ln h i ( x )) // Gumbel(0) RV X i ← max( v, X i ) Procedure
GeometricEstimate() return Z = exp( − γ + k P i X i ) Theorem 4.1.
Applied to a stream of n distinct elements, Algorithm 1 outputs Z such that | Z − n | ≤ n · ( πk − / + O ( k − )) holds with constant probability / . It uses k real-value registers and spends O ( k ) operations per single processed element of the input. Thus, setting k = ε − gives a constant probability for Algorithm 1 outputting a (1 + ε )-multiplicative estimation of cardinality. Proof.
We analyze Algorithm 1 after processing stream of n distinct elements. For each X i , itsvalue is a maximum of n random variables drawn from Gumbel (0) distribution, so by Property 3.3we have that X i ∼ Gumbel (ln n ). Moreover, repeated occurrences of elements in the stream do notchange the state of the algorithm.By Equation (1) E [ X i ] = γ + ln n and Var[ X i ] = π . X = P i X i there is E [ X ] = kγ + k ln n and Var[ X ] = k π . By Chebyshev’s inequality:Pr( | X − E [ X ] | ≥ π √ k ) ≤ / . Since Z = exp( − γ + X/k ), we have that (with probability at least 5 / n · exp (cid:16) − πk − / (cid:17) ≤ Z ≤ n · exp (cid:16) πk − / (cid:17) . . We refine Algorithm 1 with stochastic averaging. Application of the technique is straightforward,but we need to take care of initialization of X i registers. Algorithm 2:
Cardinality estimation using Gumbel distribution and stochastic averaging. Procedure
Init() pick h : U → { , . . . , k } and r : U → [0 ,
1] as independent hash functions for ≤ i ≤ m do X i ← − ln( − ln u i ) where u i is picked uniformly from [0 , // Gumbel(0) RV Procedure
Update( x ) c ← h ( x ) v ← − ln( − ln r ( x )) // Gumbel(0) RV X c ← max( v, X c ) Procedure
GeometricEstimate() return Z = k · exp( − γ + k P i X i ) Theorem 4.2.
Applied to a stream of n distinct elements, Algorithm 2 outputs Z such that | Z − n | = πnk − / + O ( k ) holds with probability / . It uses k real-value registers and spends constant numberof operations per single processed element of the input. Thus, setting k = ε − gives a constant probability for Algorithm 2 outputting a (1 + ε )-multiplicative estimation of cardinality, assuming n ≥ k / = ε − . Proof.
We analyze Algorithm 2 after processing stream S of n distinct elements. Let n , . . . , n k be the respective numbers of unique items hashed by h into buckets { , . . . , k } respectively. Itfollows that n , . . . , n k ∼ Multinomial ( n ; k , . . . , k ). For each X i , its value is a maximum of n i + 1random variables drawn from Gumbel (0) distribution (taking into account n i updates to its valueand initial value). Thus conditioned on specific values of n , . . . , n k , we have that X i follows theGumbel distribution. More specifically X i | n , . . . , n k ∼ Gumbel (ln( n i + 1)) . We also observe, thatfor i = j , X i | n , . . . , n k and X j | n , . . . , n k are independent random variables.Denote X = P i X i and Y = P i ln( n i + 1). We split our analysis of X into two parts. First,almost identical analysis to one from Theorem 4.1 follows: E [ X i | n , . . . , n k ] = γ + ln( n i + 1) and Var[ X i | n , . . . , n k ] = π | X − ( kγ + Y ) | ≥ π √ k | n , . . . , n k ) ≤ / . We can drop the conditional part and writePr( | X − ( kγ + Y ) | ≥ π √ k ) ≤ / . (4)We now show concentration of the second part of sum. First, by convexity we get. Y = X i ln( n i + 1) ≤ k ln( n/k + 1) . (5)By Lemma 4.3 we get that Pr[ Y ≥ k ln( n/k ) − ln 6] ≥ / . (6)Combining Equations (4), (5) and (6) we reach that the following bound holds with probabilityat least 2 / kγ + ( k ln( n/k ) − ln 6) − π √ k ≤ X ≤ kγ + k ln(( n + k ) /k ) + π √ k or equivalently, since Z = k exp( − γ + X/k ) n · (1 − πk − / − O ( k − )) ≤ Z ≤ ( n + k ) · (1 + πk − / + O ( k − )) . Lemma 4.3.
Let n , . . . , n k ∼ Multinomial ( n ; 1 /k, . . . , /k ) and let Y = P i ln( n i + 1) . Then Y ≥ k ln( n/k ) − t with probability at least − e − t .Proof. Consider E [ e − Y ]. We have E n ,..,n k ∼ Multinomial [ e − Y ] = E n ,..,n k ∼ Multinomial "Y i n i + 1 = X i + ... + i k = n Pr[ n = i ∧ . . . ∧ n k = i k ] Y i i i + 1= X i + ... + i k = n k − n ni , . . . , i k ! Y i i i + 1= k − n X i + ... + i k = n n !( i + 1)! · . . . · ( i k + 1)!= k − n X i + ... + i k = n n + ki + 1 , . . . , i k + 1 ! n !( n + k )! ≤ k − n k n + k n !( n + k )! ≤ (cid:18) kn (cid:19) k Thus, for any t >
0, by Markov’s inequalityPr[ Y ≤ k ln( n/k ) − t ] = Pr[ e − Y ≥ e t − k ln( n/k ) ]= Pr[ e − Y ≥ e t · E [ e − Y ]] ≤ e − t . .2 Discretization. Presented sketches use k real-value registers, which is in disadvantage when compared with LogLog and
HyperLogLog , where only k integers are used, each taking O (log log n ) bits. We now discusshow to reduce the memory footprint of the algorithms. Simple rounding.
First we note that rounding the registers to nearest multiplicity of ε forsome ε > ε ) = 1 + ε + O ( ε ) multiplicative distortion, both withthe estimation procedure GeometricEstimate() from Algorithm 1 and 2 and with the estimationprocedure
HarmonicEstimate() from Algorithm 4 and 5 (see Appendix). For example, for 1, wehave, assuming X ′ i are rounded registers: | X ′ i − X i | ≤ ε , and so for Z ′ = exp( − γ + k P i X ′ i ) thereis Z ′ Z = exp( k P i ( X ′ i − X i )), so exp( − ε ) ≤ Z ′ Z ≤ exp( ε ). Since each register stores w.h.p. valuesof magnitude 2 log n , it can be implemented on integer registers using O (log log nε ) = O (log log n +log ε − ) bits. Randomized rounding.
We now show how to eliminate the log ε − term. We define the follow-ing shift-rounding , for shift value c ∈ [0 , f c ( x ) def = ⌊ x + c ⌋ − c. We note two key properties:1. shift-rounding commutes with maximum, that is, for any x , . . . , x k , we have max( f c ( x ) , . . . , f c ( x k )) = f c (max( x , . . . , x k )),2. If c ∼ U [0 , f c ( x ) ∼ U [ x − , x ], where U [ a, b ] denotes uniform distribution on range[ a, b ].We thus show how to adapt the Algorithm 2 using shift-rounding. Algorithm 3:
Algorithm 2 with shift-rounding. Procedure
Init() pick h : U → { , . . . , k } and r : U → [0 ,
1] as independent hash functions for ≤ i ≤ m do c i is picked uniformly from [0 , X i ← ⌊− ln( − ln u i ) + c i ⌋ − c i where u i is picked uniformly from [0 , // Gumbel(0) RV Procedure
Update( x ) for ≤ i ≤ k do v ← ⌊− ln( − ln h i ( x )) + c i ⌋ − c i X ′ i ← max( v, X ′ i ) Procedure
GeometricEstimate() return Z = k exp( − γ + + k P i X ′ i )The analysis of Algorithm 3 comes from following invariant: if Algorithms 3 and 2 are runside-by-side on the same input stream, at any given moment there is X ′ i = f c i ( X i ). Thus, wehave the following X ′ i ∼ Gumbel (ln n i ) − U [0 , E [ X ′ i ] = γ − + ln n i , and Var[ X ′ i ] = π + .9dditionally, X ′ i are independent as X i were independent. Thus an equivalent of Theorem 4.1applies to Algorithm 4.2 with slightly worse constants. Theorem 4.4.
Applied to a stream of n distinct elements, Algorithm 3 outputs Z such that | Z − n | = O ( nk − / + k ) holds with probability / . It uses k integer registers of size O (log log n ) bits eachand spends constant number of operations per single processed element of the input. We note that each X ′ i takes values only from set Z − c i of magnitude at most 2 log n , it canbe stored using O (log log n ) bits. Values of c i do not need to be stored explicitly, as those can beextracted by picking a hash function c : { , . . . , k } → [0 ,
1] and setting c i = c ( i ).We note that analogous adaptation is straightforward to other algorithms presented in thispaper. References [1] Algebird HyperLogLog implementation. https://twitter.github.io/algebird/datatypes/approx/hyperloglog.html.Accessed: 2020-08-01.[2] Counting uniques faster in BigQuery with HyperLogLog++.https://cloud.google.com/blog/products/gcp/counting-uniques-faster-in-bigquery-with-hyperloglog.Accessed: 2020-08-01.[3] HyperLogLog Sketch. https://datasketches.apache.org/docs/HLL/HLL.html. Accessed: 2020-08-01.[4] Presto HyperLogLog function. https://prestodb.github.io/docs/current/functions/hyperloglog.html.Accessed: 2020-08-01.[5] Redis PFCOUNT command. https://redis.io/commands/pfcount. Accessed: 2020-08-01.[6] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequencymoments. In
STOC , pages 20–29, 1996.[7] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinctelements in a data stream. In
RANDOM 2002 , pages 1–10.[8] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with anapplication to counting triangles in graphs. In
SODA 2002 , pages 623–632. ACM/SIAM.[9] K. Beyer, R. Gemulla, P. J. Haas, B. Reinwald, and Y. Sismanis. Distinct-value synopses formultiset operations.
Communications of the ACM , 52(10):87–95, 2009.[10] J. Brody and A. Chakrabarti. A multi-round communication lower bound for gap hammingand some consequences. In
CCC 2009 , pages 358–368.[11] J. Błasiok. Optimal streaming and tracking distinct elements with high probability. In
SODA2018 , pages 2432–2448.[12] A. Chen, J. Cao, L. Shepp, and T. Nguyen. Distinct counting with a self-learning bitmap.
Journal of the American Statistical Association , 106(495):879–890, 2011.1013] P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms.
Scan-dinavian Journal of Statistics , 39(1):1–14, 2012.[14] E. Cohen. All-distances sketches, revisited: Hip estimators for massive graphs analysis.
IEEETransactions on Knowledge and Data Engineering , 27(9):2320–2334, 2015.[15] L. De Haan and A. Ferreira.
Extreme value theory: an introduction . Springer Science &Business Media, 2007.[16] M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In
ESA2003 , pages 605–617.[17] O. Ertl. New cardinality estimation algorithms for hyperloglog sketches.
CoRR ,abs/1702.01284, 2017.[18] C. Estan, G. Varghese, and M. E. Fisk. Bitmap algorithms for counting active flows on high-speed links.
IEEE/ACM Trans. Netw. , 14(5):925–937, 2006.[19] P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a near-optimalcardinality estimation algorithm. In
Discrete Mathematics and Theoretical Computer Science ,pages 137–156. Discrete Mathematics and Theoretical Computer Science, 2007.[20] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications.
J.Comput. Syst. Sci. , 31(2):182–209, 1985.[21] L. Gerin and P. Chassaing. Efficient estimation of the cardinality of large data sets.
DiscreteMathematics & Theoretical Computer Science , 2006.[22] P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries andevent reports. In
VLDB 2001 , pages 541–550.[23] P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams.In
SPAA 2001 , pages 281–291.[24] F. Giroire. Order statistics and estimating cardinalities of massive data sets.
Discret. Appl.Math. , 157(2):406–427, 2009.[25] E. J. Gumbel. Les valeurs extrêmes des distributions statistiques. In
Annales de l’InstitutHenri Poincaré , volume 5, pages 115–158, 1935.[26] S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: algorithmic engineering of astate of the art cardinality estimation algorithm. In
EDBT 2013 , pages 683–692.[27] P. Indyk and D. P. Woodruff. Tight lower bounds for the distinct elements problem. In
FOCS2003 , pages 283–288.[28] T. S. Jayram and D. P. Woodruff. Optimal bounds for johnson-lindenstrauss transforms andstreaming problems with sub-constant error. In
SODA 2011 , pages 1–10.[29] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elementsproblem. In
PODS 2010 , pages 41–52. 1130] J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and itsfull analysis.
Discrete Mathematics & Theoretical Computer Science , 2010.[31] S. Pettie and D. Wang. Information theoretic limits of cardinality estimation: Fisher meetsshannon.
CoRR , abs/2007.08051, 2020.[32] W. Szpankowski.
Average case analysis of algorithms on sequences , volume 50. John Wiley &Sons, 2011.[33] D. Ting. Streamed approximate counting of distinct elements: beating optimal batch methods.In
KDD 2014 , pages 442–451. ACM.[34] A. Viola, C. Martínez, J. Lumbroso, and A. Helmi. Data streams as random permutations:the distinct element problem.
Discrete Mathematics & Theoretical Computer Science , 2012.[35] D. P. Woodruff. Optimal space lower bounds for all frequency moments. In
SODA 2004 , pages167–175.[36] Q. Xiao, Y. Zhou, and S. Chen. Better with fewer bits: Improving the performance of cardi-nality estimation of large data streams. In
INFOCOM 2017 , pages 1–9.12
Harmonic average estimation.
Algorithm 4:
Improved estimation for Algorithm 1. Procedure
Init() // identical as in Algorithm 1 Update
Update( x ) // identical as in Algorithm 1 Procedure
HarmonicEstimate() return Z = k · ( P i exp( − X i )) − Theorem A.1.
Applied to a stream of n distinct elements, Algorithm 4 outputs Z such that | Z − n | ≤ n · (2 k − / + O ( k − )) holds with constant probability / . It uses k real-value registers andspends O ( m ) operations per single processed element of the input. Thus, setting k = ε − gives a constant probability for Algorithm 4 outputting a (1 + ε )-multiplicative estimation of cardinality. Proof.
We analyze Algorithm 4 after processing stream of n distinct elements. For each X i , itsvalue is a maximum of n random variables drawn from Gumbel (0) distribution, so by Property 3.3we have that X i ∼ Gumbel (ln n ). Moreover, repeated occurrences of elements in the stream do notchange the state of the algorithm.Denote U i = e − X i . By Equations (2) and (3) we have E [ U i ] = n and Var[ U i ] = n . Denoting U = P i U i , we have E [ U ] = kn and Var[ U ] = kn . Thus by standard application of Chebyshev’sinequality Pr h | U − kn | ≤ √ kn i ≤ . Taking into account that Z = kU we reach the claim. A.1 Stochastic averaging.
Algorithm 5:
Improved estimation for Algorithm 2. Procedure
Init() // identical as in Algorithm 2 Update
Update( x ) // identical as in Algorithm 2 Procedure
HarmonicEstimate() return Z = k · ( P i exp( − X i )) − − Theorem A.2.
Applied to a stream of n distinct elements, Algorithm 5 outputs Z such that | Z − n | = O ( nk − / + n exp( − n/k )) holds with constant probability / . It uses k real-value registersand spends constant number of operations per single processed element of the input. Thus, setting k = ε − gives a constant probability for Algorithm 5 outputting a (1 + ε )-multiplicative estimation of cardinality, assuming n ≥ k log k = ε − log ε − . Proof.
We analyze Algorithm 2 after processing stream S of n distinct elements. Let n , . . . , n k be the respective numbers of unique items hashed by h into buckets { , . . . , k } respectively. Itfollows that n , . . . , n k ∼ Multinomial ( n ; k , . . . , k ). For each X i , its value is a maximum of n i + 1random variables drawn from Gumbel (0) distribution (taking into account n i updates to its value13nd initial value). Thus conditioned on specific values of n , . . . , n k , we have that X i follows theGumbel distribution. More specifically X i | n , . . . , n k ∼ Gumbel (ln( n i + 1)) . We also observe, thatfor i = j , X i | n , . . . , n k and X j | n , . . . , n k are independent random variables.Denote U i = e − X i and U = P i U i . We derive following bound on conditional expected value E [ U | n , . . . , n k ] = X i E [ U i | n , . . . , n k ]= X i exp( − ln( n i + 1)) (by Equation (2))= X i n i + 1 , and bound on conditional varianceVar[ U | n , . . . , n k ] = X i Var[ U i | n , . . . , n k ] (independence)= X i exp( − n i + 1)) (by Equation (3)) ≤ X i n i + 1)( n i + 2) . Denoting V = P i n i +1 and W = P i n i +1)( n i +2) . Also, let β k = (1 − /k ) k ≤ /e be a constantdependent only on k .We have E [ U ] = E n ,..,n k ∼ Multinomial [ E [ U | n , . . . , n k ]]= E n ,..,n k ∼ Multinomial [ V ] (definition of V )= k n + 1 (1 − β n +1 k k ) , (by Lemma A.3)and Var[ U ] = E n ,..,n k ∼ Multinomial [Var[ U | n , . . . , n k ]] + Var n ,..,n k ∼ Multinomial [ E [ U | n , . . . , n k ]] (Law of Total Variance) ≤ E n ,..,n k ∼ Multinomial [ W ] + Var n ,..,n k ∼ Multinomial [ V ] (definition of W and V ) ≤ k ( n + 1) + 2 β n +1 k k k ( n + 1) . (Lemmas A.4 and A.5)Applying Chebyshev’s inequality, we have with constant probability at least 3 / (cid:12)(cid:12)(cid:12) U − k n + 1 (cid:12)(cid:12)(cid:12) ≤ k n + 1 (cid:18) β n +1 k k + O ( k − / ) + O ( β n +12 k k ) (cid:19) . The claim follows from the fact that Z = k U − − emma A.3. Let n , . . . , n k ∼ Multinomial ( n ; k , . . . , k ) and let V = P i n i +1 . Then E [ V ] = k n +1 (1 − β n +1 k k ) .Proof. Consider the following E [ V ] = k E (cid:20) n + 1 (cid:21) = k n X i =0 Pr[ n = i ] 1 i + 1= k n X i =0 ni ! ( k − n − i k n · i + 1= kn + 1 n X i =0 n + 1 i + 1 ! ( k − n − i k n = kn + 1 · ( k − n − ( k − n k n = k n + 1 · − (cid:18) − k (cid:19) n +1 ! . Lemma A.4.
Let n , . . . , n k ∼ Multinomial ( n ; k , . . . , k ) and let W = P i n i +1)( n i +2) . Then E [ W ] ≤ k ( n +1) .Proof. Consider the following E [ W ] = k E (cid:20) n + 1)( n + 2) (cid:21) = k n X i =0 Pr[ n = i ] 2( i + 1)( i + 2)= k n X i =0 ni ! ( k − n − i k n · i + 1)( i + 2)= 2 k ( n + 1)( n + 2) n X i =0 n + 2 i + 2 ! ( k − n − i k n ≤ k ( n + 1)( n + 2) · ( k − n +2 k n = 2 k ( n + 1)( n + 2) ≤ k ( n + 1) . Lemma A.5.
Let n , . . . , n k ∼ Multinomial ( n ; 1 /k, . . . , /k ) and let V = P i n i +1 . Then Var[ V ] ≤ k ( n +1) + k ( n +1) · β n +1 k k , where β k ≈ /e . roof. Consider the following E [ V ] = E [ P i n i +1) ]+ E [ P i = j n i +1)( n j +1) ]. Since n i +1) ≤ n i +1)( n i +2) ,by Lemma A.4 first term satisfies E [ X i n i + 1) ] ≤ E [ W ] ≤ k ( n + 1) . For the second term, consider E [ X i = j n i + 1)( n j + 1) ] ≤ k ( k − E (cid:20) n + 1)( n + 1) (cid:21) = k ( k − X i,j ≥ Pr[ n = i ∧ n = j ] 1( i + 1)( j + 1)= k ( k − X i,j ≥ ni, j, n − i − j ! ( k − n − i − j k n i + 1)( j + 1)= k ( k − n + 1)( n + 2) X i,j ≥ n + 2 i + 1 , j + 1 , n − i − j ! ( k − n − i − j k n ≤ k ( k − n + 1)( n + 2) X i,j ≥ n + 2 i, j, n + 2 − i − j ! ( k − n +2 − i − j k n = k ( k − n + 1)( n + 2) ≤ k ( k − n + 1) . Additionally, following bound holdsVar[ V ] = E [ V ] − ( E [ V ]) ≤ k ( n + 1) + k ( k − n + 1) − k ( n + 1) (cid:18) − β n +1 k k (cid:19) ≤ k ( n + 1) + k ( n + 1) · β n +1 k k . A.2 Discretization.
We note that techniques used in Algorithm 3 can be applied with harmonic estimation, leading toa following algorithm.
Algorithm 6:
Improved estimation for Algorithm 3. Procedure
Init() // identical as in Algorithm 3 Update
Update( x ) // identical as in Algorithm 3 Procedure
HarmonicEstimate() return Z = k · ( + k P i exp( − X i )) − − Theorem A.6.
Applied to a stream of n distinct elements, Algorithm 6 outputs Z such that | Z − n | = O ( nk − / + n exp( − n/k )) holds with probability / . It uses k integer registers ofsize O (log log n ) bits each and spends constant number of operations per single processed elementof the input.bits each and spends constant number of operations per single processed elementof the input.