aa r X i v : . [ m a t h . NA ] M a y Heavy Hitters and Bernoulli Convolutions
Alexander Kushkuley(Salesforce/Demandware, [email protected])May 29, 2019
Abstract
A very simple event frequency approximation algorithm that is sensi-tive to event timeliness is suggested. The algorithm iteratively updatescategorical click-distribution, producing (path of) a random walk ona standard n -dimensional simplex. Under certain conditions, this ran-dom walk is self-similar and corresponds to a biased Bernoulli convo-lution. Algorithm evaluation naturally leads to estimation of momentsof biased (finite and infinite) Bernoulli convolutions. To quote [2], ”there is a need to estimate the count of a given item i (orevent or combination thereof) during some period of time t ...Typically, itemswith highest counts, commonly known as heavy hitters, are of most interest”.This note is an attempt to redefine event counting problem (cf. [1], [2],[3]). In many cases, the most important factor is recent event ”popularityrank” (cf. e.g. [3]) and not its long-run frequency. Hence, instead of n item-event counters consider a time-dependent discrete probability distribu-tion P = ( p , p · · · , p n ) as an estimate for relative frequencies (ranks) of theitems involved. An occurrence of an event with index i can be represented bya delta function distribution δ i on the set { , · · · , n } triggering an update ofestimated probability distribution P by an application of a convex mixturerule P → αP + (1 − α ) δ i . In other words, arrival of an event i reduces ranksof all other events while tilting estimated event rank-distribution towardsevent-item i in a simplest way possible. Thus we arrive at the followingheavy hitters approximation algorithm1 lgorithm 1 Fix a number α < that is close to . If an item j ∈{ , , · · · , n } , was clicked (event number j did occur) set p i → αp i , i =1 , · · · n, i = j and set p j → αp j + 1 − α One practical problem with the above is that all frequencies (probabilities)are updated simultaneously. There are, however, some advantages:(1) decreasing α gives higher priority to recent events and vice-versa, in-creasing α will bias the ranking towards ”idling” event items(2) and therefore, sensitivity of this ranking scheme to new events can beeasily controlled (even at runtime) by adjusting just one parameter Remark 1
Suppose that it is desirable that an item should loose half of itsrank if it was idle while a list it belongs to was updated T times. It is quiteobvious that this can be achieved by setting parameter α to exp( − log(2) /T ) .For example, if T = 10 then α ≈ . (cf. [10]) Close relationship between Algorithm 1 and Bernoulli convolutions (cf. [4])is a subject of the rest of this paper.
Suppose that incoming event frequencies follow a fixed discrete distribution Q = ( q , q , · · · q n ) , P ni =1 q i = 1 and let Y t = ( y ,t , · · · , y n,t ) be a probabilitydistribution vector ( P ni =1 y i,t = 1 for all t ) of our (relative) frequency esti-mates at times t = 0 , , · · · . Essentially, Algorithm 1 computes a path of arandom walk on a standard ( n − σ n − ∈ R n definedby iterative rule Y t +1 = αY t + (1 − α ) δ i with probability q i , i = 1 , , · · · , n (1)where δ i is an i -th vertex of the simplex σ n − or, in other words, the i -thunit vector in standard Eucledean coordinates in R n . The update rule forthe i -th coordinate on iteration t + 1 is y i,t +1 = αy i,t with probability 1 − q i αy i,t + 1 − α with probability q i (2)Let’s fix a coordinate for a while, omitting the index i . Let ξ m , m = 1 , · · · , t be random biased Bernoulli variables such that P ( ξ m = 0) = 1 − q and2 ( ξ m = 1) = q . It is well known (see. e.g. [4]) that on step t the one-dimensional random walk (2) corresponds to a random variable y t = α t y + (1 − α ) t X m =0 ξ m α m (3)which up to a mostly irrelevant free term is a convolution of t biased Bernoullivariables. The infinite biased Bernoulli convolution (cf. e.g. [4]) is obtainedfrom (3) by setting t = ∞ or similarly, by driving the random process (2)infinite number of steps. Remark 2
It is well known (see e.g. [5] for precise statement) that Bernoulliconvolution (1 − α ) P ∞ ξ m α m is absolutely continuous (with respect to theLebesgue measure on the line) for almost all sufficiently large values of pa-rameter α . For these values of α the weak limit y of the sequence of randomvariables y t does exit and only this case will be considered in this paper. Lemma 1 E ( y t ) = α t y + (1 − α t ) q (4)Indeed, by definition (2) E ( y t ) = α E ( y t − ) + (1 − α ) q (5)and hence by induction E ( y t ) = α t y + (1 − α )(1 + α + · · · + α t − ) q which is the same as (4). Lemma 2
VAR ( y t ) = (1 − α t ) 1 − α α ( q − q ) (6)Proof. It follows from the definition (2) that E ( y t ) = α E ( y t − ) + (1 − α ) q + 2 α (1 − α ) E ( y t − ) q and therefore by (5) VAR ( y t ) = E ( y t ) − E ( y t ) = α VAR ( y t − ) + (1 − α ) ( q − q ) (7)From here, by the same inductive argument as in Lemma 1, we get VAR ( y t ) = (1 − α t )1 − α (1 − α ) ( q − q ) = (1 − α t ) 1 − α α ( q − q )As an obvious consequence of lemmas 1 and 2 (cf. Remark 2) we have3 orollary 1 The infinite Bernoulli convolution defined by (2) has expecta-tion q and variance − α α ( q − q ) Remark 3
Under assumption that the sought for limits exist (Remark 2),Corollary 1 can be established by passing to the limit in recurrent relations(5), (7) and then solving for expectation and variance respectfully.
Here is an example, demonstrating that passing to a limit as suggested inCorollary 1 is not always possible.
Example 1
Assuming that starting point of the random walk (2) is non-zero, we have E (cid:18) y t +1 (cid:19) = (1 − q ) E (cid:18) αy t (cid:19) + q E (cid:18) αy t + 1 − α (cid:19) == 1 − qα E (cid:18) αy t (cid:19) + q E (cid:18) − α (1 − y t ) (cid:19) Passing here to the limit as t → ∞ yields α − qα E (cid:18) y (cid:19) = q E (cid:18) − α (1 − y ) (cid:19) (8) which is obviously wrong if α ≤ − q and therefore, the condition α > − q is necessary for the existence of continuous limit lim t →∞ /y t . If q > − q the condition α > − q follows from the well known necessary condition α > q q (1 − q ) − q for non-singularity of Bernoulli convolution lim t →∞ y t (cf.e.g. [5]). For (8) to be true, however, we need non-singularity of the inverseof Bernoulli convolution. Essentially a question one can ask is this. For whatvalues of α (if any) lim t →∞ /y t satisfying (8) exists. We will compute variances of random vectors generated by (1) and someother similar random walks. As before, it is assumed that continuous limit Y = lim t →∞ Y t does exist. It follows from (4-5) and Corollary 1 that E ( Y t +1 ) = α E ( Y t ) + (1 − α ) Q (9) E ( Y t ) = α t Y + (1 − α t ) Q E ( Y ) = Q In what follows, all vectors are assumed to be column vectors so that for vec-tors
A, B their outer product is AB T where B T is a row vector transposition4f B . A diagonal matrix with elements of a vector A on its main diagonalwill be denoted by diag ( A ).Using the rule (1) we get E ( Y t +1 Y Tt +1 ) = α E ( Y t Y Tt ) + (1 − α ) n X i =1 q i δ i δ Ti ++ α (1 − α ) n X i =1 q i ( E ( Y t δ Ti ) + E ( δ i Y Tt ) ) == α E ( Y t Y Tt ) + (1 − α ) diag ( Q ) + α (1 − α )( E ( Y t ) Q T + Q E ( Y Tt ) ) (10)In the same way, using (9) we compute E ( Y t +1 ) E ( Y Tt +1 ) = α E ( Y t ) E ( Y Tt ) + (1 − α ) QQ T + α (1 − α )( E ( Y t ) Q T + Q E ( Y Tt ) )and subtracting this from (10) we obtain a recurrent relationship VAR ( Y t +1 ) = α VAR ( Y t ) + (1 − α ) ( diag ( Q ) − QQ T )which is perfectly similar to (7). Hence, in accordance with Lemma 2 wehave Theorem 1
The covariance matrix of the finite n -dimensional Bernoulliconvolution defined by (1) is VAR ( Y t ) = (1 − α t ) 1 − α α ( diag ( Q ) − QQ T ) The covariance matrix of the corresponding infinite n -dimensional Bernoulliconvolution is VAR ( Y ) = 1 − α α ( diag ( Q ) − QQ T )Let 1 n be n -vector with all its coordinates being equal to one. It’s easy tocheck that VAR ( Y t )(1 n ) = VAR ( Y )(1 n ) = 0. This is not surprising sincecoordinates of Y t sum-up to one. The matrix diag ( Q ) − QQ T is a symmetricrank-one perturbation of a diagonal matrix and spectral structure of suchmatrices is well studied. We just mention Corollary 2
If bias probabilities q i are pairwise distinct then all the non-zeroeigenvalues of the covariance matrix of n -dimensional Bernoulli convolution(1) are distinct roots of the equation n X i =1 q i q i − λ = 05n the other hand, we have Example 2
The only eigenvalues of the covariance matrix of unbiased ( q i =1 /n, i = 1 , · · · , n ) n -dimensional Bernoulli convolution are and /n As a slight generalization of (1), fix m > v , · · · v m in R n and discrete probability distribution Q = ( q , q , · · · q m ). Define a randomwalk by a rule Y ′ t +1 = αY ′ t + (1 − α ) v i with probability q i , i = 1 , , · · · , m (11)Let V be an n × m matrix that has coordinates of v , · · · , v m as its columns.For random vectors defined by (11), the equation (9) turns into E ( Y ′ t +1 ) = α E ( Y ′ t ) + (1 − α ) V Q
Let Y ′ = lim t →∞ Y ′ t . From the proof of Theorem 1 we have Corollary 3 E ( Y ′ t ) = α t Y ′ + (1 − α t ) V Q, E ( Y ′ ) = V Q
VAR ( Y ′ t ) = (1 − α t ) 1 − α α V ( diag ( Q ) − QQ T ) V T VAR ( Y ′ ) = 1 − α α V ( diag ( Q ) − QQ T ) V T and in one-dimensional case Corollary 4 E ( Y ′ t ) = α t Y ′ + (1 − α t ) m X i v i q i , E ( Y ′ ) = m X i v i q i VAR ( Y ′ t ) = (1 − α t ) 1 − α α n X i =1 v i ( q i − q i ) VAR ( Y ′ ) = 1 − α α n X i =1 v i ( q i − q i )Note that setting here m = 2 , v = 0 , v = 1 we not-surprisingly recoverequations (4) and (6).Moreover, consider a case when all points v , · · · , v m belong to a complexplain. Then Y ′ t , t = 1 , , · · · is a sequence of complex random variables andagain from the proof of Theorem 1 we have6 heorem 2 Let v , · · · , v m ∈ C , m > . Then for the sequence of complexrandom variables Y ′ t , t = 1 , · · · defined by (11) we have E ( Y ′ t ) = α t Y ′ + (1 − α t ) m X i v i q i , E ( Y ′ ) = m X i v i q i (12) VAR ( Y ′ t ) = (1 − α t ) 1 − α α n X i =1 | v i | ( q i − q i ) − X i VAR ( Y ′ t ) = E ( Y ′ t ¯ Y ′ t )) − E ( Y ′ t ) E ( ¯ Y ′ t )and as in the proof of Theorem 1 E ( Y ′ t +1 ¯ Y ′ t +1 ) = α E ( Y ′ t ¯ Y ′ t ) + 2 α (1 − α ) E ( Y ′ t ) E ( ¯ Y ′ ) + (1 − α ) n X i =1 q i | v | i (13)On the other hand E ( Y ′ t +1 ) E ( ¯ Y ′ t +1 ) = α E ( Y ′ t ) E ( ¯ Y ′ t )+2 α (1 − α ) E ( Y ′ t ) E ( ¯ Y ′ )+(1 − α ) E ( ¯ Y ′ ) E ( Y ′ )and it follows from (12) that E ( ¯ Y ′ ) E ( Y ′ ) = n X i =1 | v i | q i + X i VAR ( Y ′ t +1 ) = α VAR ( Y ′ t )+(1 − α ) n X i =1 | v i | ( q i − q i ) − X i Corollary 5 If all points v i ∈ C , i = 1 , , · · · , m, m > belong to a unitcircle then VAR ( Y ′ t ) = 4(1 − α t ) 1 − α α X i Algorithm 1 introduces approximately times α − per iteration”velocity boost” for recent heavy hitters. As we saw above, Algorithm 1 will approximate the mean of a fixed incom-ing click distribution in the long run. Lemmas 1, 2 and a straightforwardapplication of Chebyshev inequality (cf. e.g. [9] for a vector version) give areasonable estimate for a quality of this approximation. Corollary 7 The following estimates hold for random variables y i = lim t →∞ y i,t and for random vector Y = lim t →∞ Y t P ( | y i − q i | ≥ ǫ ) ≤ − αǫ (1 + α ) ( q i − q i ) , i = 1 , · · · n (15) In particular, P (cid:0) | y i − q i | ≥ √ − α (cid:1) ≤ q i − q i α , i = 1 , · · · n (16) and P ( k Y − Q k ≥ ǫ ) ≤ − αǫ (1 + α ) (1 − X i q i ) Remark 4 It follows from (16) that for any q i and large enough α , about (7 / -th of the limit distribution belongs to the narrow interval [ −√ − α, √ − α ] Example 4 For α = 0 . , q = 1 / and for sufficiently large t the value of y t will belong to the interval [0 . , . with about probability It is obvious, that the estimator (15) works better for large values of q , i.e.for above-mentioned heavy hitters. More precisely, setting ǫ ← ǫq i in (15)we get Corollary 8 An estimate P ( | y i − q i | ≥ ǫq i ) ≤ ǫ holds for q i ≥ 11 + α − α ǫ xample 5 For ǫ = 1 / and α = 1 − ǫ = . this boils down to P ( | y i − q i | ≥ q i / 10 ) ≤ / if q i ≥ . In other words, for large enough number of iterations, click probabilities thatare slightly above / can be approximated up-to relative error with confidence. For a finite Bernoulli convolutoin obtained after t iterations of Algorithm 1we get from (4) and (6) Corollary 9 If y i, = q i , i = 1 , · · · n then for any t = 1 , , · · · P ( | y i,t − q i | ≥ ǫ ) ≤ (1 − α t )(1 − α ) ǫ (1 + α ) ( q i − q i ) , i = 1 , · · · n In particular P ( | y i,t − q i | ≥ √ − α ) ≤ (1 − α t )1 + α ( q i − q i ) , i = 1 , · · · n and if Y = Q then P ( k Y t − Q k ≥ ǫ ) ≤ (1 − α t )(1 − α ) ǫ (1 + α ) (1 − X i q i ) Moments of unbiased Bernoulli convolutions were studied in [6],[7],[8]. Somebasic properties of moments of biased infinite Bernoulli convolutions arebriefly discussed in this section..It makes sense to consider central moments, E ( y − q ) n (cf. Corollary 1).Hence, we replace the sequence y t with the sequence y t − q which from nowon will be denoted by the same letter. The transformation rule (2) thuschanges to y m +1 = αy m − (1 − α ) q with probability 1 − qαy m + (1 − α )(1 − q ) with probability q (17)For expectations of the random variable sequence y nm , m = 1 , , · · · thattarnslates into E ( y nm +1 ) = (1 − q ) E (( αy m − (1 − α ) q ) n ) + q E (( αy m + (1 − α )(1 − q )) n )10pening brackets and passing to the limit (that is assumed to exist) resultsin identity E ( y n ) = α n E ( y n ) + n X k =1 (cid:18) nk (cid:19) α n − k (1 − α ) k (( − q ) k (1 − q ) + q (1 − q ) k )) E ( y n − k )Finally, after relabeling M k = E ( y k ) we obtain for n > = 2 a recurrent relation(cf. [7]) M n = 11 − α n n X k =1 (cid:18) nk (cid:19) α n − k (1 − α ) k ((1 − q )( − q ) k + q (1 − q ) k )) M n − k ≡ q − q − α n n X k =2 (cid:18) nk (cid:19) α n − k (1 − α ) k (( − k q k − + (1 − q ) k − )) M n − k (18)Obviously, M = 1 and M = 0. It is now a simple matter to write down afew central moments of the infinite Bernoulli convolution (2): Example 6 M = 1 − α α ( q − q ) M = (1 − α ) − α ( q − q )(1 − q ) M = (1 − α ) − α ( q − q ) (cid:20) α − α ( q − q ) + 1 − q + q (cid:21) Let µ y be a measure associated with the infinite Bernoulli convolution y thatis generated by rule (17) and let ( . ) ∗ denote a reflection x → − x . Denotealso by y ∗ an infinite Bernoulli convolution generated by the rule (17) withinterchanged probabilities q → − q . It is probably worth mentioning Corollary 10 .(i) for any interval [ a, b ] , µ y ∗ ([ a, b ] ∗ ) = µ y ([ a, b ]) (ii) y ∗ = − y and therefore(iii) E ( y n ) = ( − n E ( y ∗ n ) , n = 0 , , · · · (iv) as polynomials of q , the central moments M n ( q ) ≡ E ( y n )( q ) are semi-invariant with respect to the involution τ : q → − q , that is M τn ( q ) = ( − n M n ( q )11ndeed, statements (i) and (ii) follow from definition (17). Statement (iii)follows from (ii) or (iv) and the proof of (iv) is a straightforward inductionbased on (18).Moreover, for central moments M n ≡ M n ( q ) as polynomials of q we have Corollary 11 M n ( q ) is a polynomial of q (1 − q ) if n is even and is a poly-nomial of q (1 − q ) times − q if n is odd. This is an easy consequence of Corollary 10. Just note, that it follows fromCorollary 10 (iv) that M n ( q ) is divisible by q − if n is odd. Lemma 3 If q ≤ − q then(i) all central moments M n are non-negative(ii) M n ≤ − q for all n = 0 , , · · · (iii) lim n →∞ M / n n = 1 − q Proof. The first statement directly follows from (18). The second state-ment is obvious. Statement (iii) is just a recollection of a well known factabout a sequence of n -norms (cid:16)R − q − q | y ( x ) | n dµ y ( x ) (cid:17) /n converging to ∞ -normmax {| y |} = 1 − q .Although random variable y is not non-negative, the following still holds Theorem 3 If q ≤ − q then lim n →∞ M /nn = 1 − q Proof. The sequence M /nn , n = 0 , , · · · for even numbered central momentsis non-decreasing by H¨older’s inequality and converges to 1 − q by Lemma 3.Hence, for any ǫ > k = k such that M n ≥ (1 − q − ǫ ) n (19)for all even n such that n ≥ k . In particular M n − k ≥ (1 − q − ǫ ) n − k forall odd k and n such that k ≤ n − k . Using this fact, we will show that anestimate similar to (19) holds for any large odd number n . Indeed, it followsfrom (18), Lemma 3 (i) and (19) that for any odd n > k + 2 M n ≥ q (1 − q − ǫ ) n − α n X ≤ k ≤ n − k , k odd (cid:18) nk (cid:19) α n − k (1 − α ) k (20)12t is easy to see, however, that the sum in (20) is equal to12 − α n − X n − k References [1] Graham Cormode, Marios Hadjieleftheriou, ”Time Adaptive Sketches(Ada-Sketches) for Summarizing”, Proceedings of the VLDB Endow-ment VLDB, Volume 1, Issue 2, August 2008[2] Anshumali Shrivastava Arnd Christian K¨onig, Mikhail Bilenko, TimeAdaptive Sketches (Ada-Sketches) for Summarizing Data Streams, SIG-MOD’16, June 26-July 01, 2016, San Francisco, CA, USA[3] Chen-Yu Hsu, Piotr Indyk, Dina Katabi and Ali Vakilian, ”Learning-Based Frequency Estimation Algorithms”, ICLR 2019134] Yuval Peres, Wilhelm Schlag, and Boris Solomyak. Sixty years ofBernoulli convolutions. In Fractal geometry and stochastics, II (Greif-swald/Koserow, 1998), volume 46 of Progr. Probab. pages 39–65.Birkhauser, Basel, 2000.[5] Pablo Shmerkin, ”On The Exceptional Set for Absolute Continuity OfBernoulli Convolutions”, arXiv:1303.3992v2, 2003[6] Pawel J. Szablowski, On Moments of Cantor and Related Distributions,arXiv:1403.0386, 2014[7] Timofeev E. A, Asymptotic Formula for the Moments of Bernoulli Con-volutions, Modeling and Analysis of Information Systems, 23:2, 185-194,2016[8] C. Escribano, M.A. Sastre, E. Torrano, Moments of infinite convolu-tions of symmetric Bernoulli distributions, Journal of Computationaland Applied Mathematics 153 (2003), 191 – 199[9] Ferentinos, ”On Tchebycheff type inequalities”. Trabajos Estadıst In-vestigacion Oper. 33: 125–132, 1982[10]