[PDF] A Bayesian nonparametric approach to count-min sketch under power-law data streams

Abstract

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

Full PDF

aa r X i v : . [ s t a t . M L ] F e b A Bayesian nonparametric approach to count-min sketch underpower-law data streams

Emanuele Dolera Stefano Favaro Stefano Peluchetti [email protected] of Pavia [email protected] of Torino and Collegio Carlo [email protected] Labs

Abstract

The count-min sketch (CMS) is a random-ized data structure that provides estimatesof tokens’ frequencies in a large data streamusing a compressed representation of thedata by random hashing. In this paper,we rely on a recent Bayesian nonparamet-ric (BNP) view on the CMS to develop anovel learning-augmented CMS under power-law data streams. We assume that tokens inthe stream are drawn from an unknown dis-crete distribution, which is endowed with anormalized inverse Gaussian process (NIGP)prior. Then, using distributional propertiesof the NIGP, we compute the posterior distri-bution of a token’s frequency in the stream,given the hashed data, and in turn corre-sponding BNP estimates. Applications tosynthetic and real data show that our ap-proach achieves a remarkable performance inthe estimation of low-frequency tokens. Thisis known to be a desirable feature in the con-text of natural language processing, where itis indeed common in the context of the power-law behaviour of the data.

When processing large data streams of data, it iscritical to represent the data in compact structuresthat allow to eﬃciently extract statistical informa-tion. Sketching algorithms, or simply sketches, arerandomized data structures that can be easily updatedand queried to perform a time and memory eﬃcientestimation of statistics of large data streams of to-

Proceedings of the 24 th International Conference on Artiﬁ-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s). kens. Sketches have found numerous applications in,e.g., machine learning (Aggarwal and Yu, 2010), secu-rity analysis (Dwork et al., 2010), natural languageprocessing (Goyal et al., 2009), computational biol-ogy (Zhang et al., 2014), social networks (Song et al.,2009) and games (Harrison, 2010). Of particular in-terest is the problem of estimating the unknown fre-quency of a token in a stream, which is typically re-ferred to as a “point query". A notable approach toaddress point queries is the count-min sketch (CMS)(Cormode and Muthukrishnan, 2005b,a), which usesrandom hashing to obtain a compressed, or approx-imated, representation of tokens’ frequencies in thestream. The CMS achieves the goal of using a mem-ory eﬃcient representation of the data stream, whilehaving provable theoretical guarantees on the esti-mated point query via hashed frequencies. In the re-cent years, there has been an increasing interest inimproving the performance of the CMS by means oflearning models that allow to better exploit propertiesof the data. (Cai et al., 2018; Aamand et al., 2019;Hsu et al., 2019). In this paper, we focus on thelearning-augmented CMS of (Cai et al., 2018), whichrelies on Bayesian nonparametric (BNP) modeling ofa data stream of tokens.The learning-augmented CMS of Cai et al. (2018) as-sumes that tokens in a stream are modeled as ran-dom samples from an unknown discrete distribution,which is endowed with a Dirichlet process (DP) prior(Ferguson, 1973). Under this BNP framework, the pre-dictive distribution induced by the DP provides a nat-ural generative scheme for tokens. That is, predictivedistributions, combined with both a restriction prop-erty and a ﬁnite-dimensional projective property of theDP, lead to the posterior distribution of a point query,given the hashed frequencies. This is referred to asthe CMS-DP. Interestingly, the posterior mode recov-ers the CMS estimate of Cormode and Muthukrishnan(2005a), while other CMS-DP estimates, e.g. posteriormean and median, may be viewed as CMS estimateswith shrinkage. The CMS-DP improves over the CMSon several aspects: i) it incorporates a priori knowl-

Bayesian nonparametric approach to count-min sketch under power-law data streams edge on the data into the estimates; ii) it assumes anunknown and unbounded number of distinct tokens,which is typically expected in large datasets; iii) it al-lows to model, via the posterior distribution inducedby a point query, the uncertainty induced by the pro-cess of random hashing.We extend the BNP approach of Cai et al. (2018) todevelop a novel learning-augmented CMS under power-law data streams. Power-law distributions occur inmany situations of scientiﬁc interest, and have sig-niﬁcant consequences for the understanding of natu-ral and man-made phenomena (Clauset et al., 2009).Here, we assume that tokens in the stream are modeledas random samples from an unknown discrete distribu-tion, which is endowed with a normalized inverse Gaus-sian process (NIGP) prior (Prünster, 2002; Lijoi et al.,2005). The NIGP comes as a “forced" choice since it isthe sole discrete nonparametric prior that combines: i)a power-law tail behaviour, in contrast with the expo-nential tail behaviour of the DP; ii) both a restrictionproperty and a ﬁnite-dimensional projective propertyanalogous to those of the DP, which are critical tocompute and work with the posterior distribution of apoint query given the hashed frequencies. While thisprior choice limits the ﬂexibility of tuning the prior tothe power-law degree of the data, the NIGP is arguablystill a sensible choice of practical interest. Under theNIGP prior, we compute the posterior distribution ofa point query, given the stored hashed frequencies, andin turn corresponding BNP estimates. Applications tosynthetic and real data show that our approach out-performs the CMS and the CMS-DP in the estimationof low-frequency tokens. This is known to be a de-sirable feature in the context of natural language pro-cessing (Goyal et al., 2012, 2009; Pitel and Fouquier,2015), where it is indeed common the power-law be-haviour of the data stream.The paper is structured as follows. In Section 2 wereview the BNP approach to CMS of Cai et al. (2018),and in Section 3 we extend this approach to develop anovel learning-augmented CMS under power-law datastreams. Section 4 contains numerical experiments,whereas Section 5 concludes with ﬁnal remarks andfuture works.

To introduce the BNP approach of Cai et al. (2018),let X m = ( X , . . . , X m ) be a large data stream oftokens taking values in a (possibly inﬁnite) measur-able space of symbols V . The stream X m is availablefor inferential purposes only through its compressedrepresentation obtained by means of random hashing.Speciﬁcally, let J and N be positive integers such that [ J ] = { , . . . , J } and [ N ] = { , . . . , N } , and let h , . . . , h N , with h n : V → [ J ] , be a collection of hashfunctions drawn uniformly at random from a pairwiseindependent hash family H . For mathematical conve-nience it is assumed that H is a perfectly random hashfamily, that is for h n drawn uniformly at random from H the random variables ( h n ( x )) x ∈V are i.i.d. as a Uni-form distribution over [ J ] . In practice, as discussed inCai et al. (2018), real-world hash functions yield onlysmall perturbations from perfect hash functions. Hash-ing X m through h , . . . , h N creates N vectors of J buckets { C n } n ∈ [ N ] , where C n = ( C n, , . . . , C n,J ) with C n,j obtained by aggregating the frequencies for all x such that h n ( x ) = j . Every C n,j is initialized atzero, and whenever a new token X i is observed weset C n,h n ( X i ) ← C n,h n ( X i ) for every n ∈ [ N ] . Un-der this setting, the goal consists in estimating thefrequency f v of a token of type v ∈ V in X m , i.e.the point query f v = P ≤ i ≤ m X i ( v ) . In particular,the CMS of Cormode and Muthukrishnan (2005a) es-timates f v with ˆ f (CMS) v = min n ∈ [ N ] { C n,h n ( v ) } n ∈ [ N ] . (1)We refer to Appendix A for a detailed account on theCMS and a theoretical (probabilistic) guarantee forthe estimator (1).Diﬀerently from the CMS ofCormode and Muthukrishnan (2005a), the CMS-DP of Cai et al. (2018) estimates f v by relying on thefollowing modeling assumptions on the data stream X m : i) symbols v j ’s in V are distributed as anunknown probability measure P ( · ) = P j ≥ p j δ v j ( · ) on V ; ii) P is distributed as a DP prior (Ferguson,1973) with diﬀuse probability (base) measure ν on V and mass parameter α > . Then, tokens X i ’s aremodeled as random samples from a DP, i.e., X m | P iid ∼ PP ∼ DP( α, ν ) (2)for m ≥ . Under (2), a point query induces the poste-rior distribution of f v , given { C n,h n ( v ) } n ∈ [ N ] , for v ∈ V .CMS-DP estimates of f v are obtained as functionals ofthe posterior distribution, e.g. mode, mean, median.The computation of the posterior distribution of f v relies on the predictive distribution of the DP prior,namely the conditional distribution of an additionaltoken given the stream of tokens. This is combinedwith two critical properties of the DP: P1) the restric-tion property which, due to the perfectly random H ,implies that the prior governing the tokens hashed ineach of the J buckets is a DP prior with mass param-eter α/J ; P2) the ﬁnite-dimensional projective prop-erty which, due to the perfectly random H , implies manuele Dolera, Stefano Favaro, Stefano Peluchetti that the prior governing the multinomial hashed fre-quencies C n is a J -dimensional symmetric Dirichletdistribution with parameter α/J .Now, we outline the BNP approach of Cai et al. (2018)based properties P1) and P2). Because of the discrete-ness of P ∼ DP( α, ν ) , a random sample X m from P induces a random partition of { , . . . , m } into sub-sets labelled by distinct symbols in V . See AppendixC. The predictive distribution of the DP provides theconditional distribution, given X m , over which par-tition subset a new token X m +1 will join; the size ofthat subset is precisely the frequency f v we seek toestimate. However, since we have only access to thehashed frequencies, the object of interest is the dis-tribution p f v ( m, α ) of f v . This distribution followsby marginalizing out the sampling information X ,m ,with respect to the DP prior, from the conditionaldistribution of f v given X ,m . According to propertyP1), for a single h n the distribution p f v ( · ; c n,h n ( v ) , α/J ) coincides with the posterior distribution of f v , given C n,h n ( v ) = c n,h n ( v ) ; the posterior distribution of f v given { C n,h n ( v ) } n ∈ [ N ] follows by the independence as-sumption on H and Bayes theorem. To conclude, itremains to estimate the prior’s parameter α > basedon the hashed frequencies. According to property P2),and by the independence assumption on H , the N vec-tors { C n } n ∈ [ N ] are i.i.d. as a Dirichlet-Multinomialdistribution with symmetric parameter α/J . This factprovides an explicit expression for the likelihood func-tion of the hashed frequencies, and thus to a Bayesianestimation of α . We extend the BNP approach of Cai et al. (2018) todevelop a novel learning-augmented CMS under power-law data streams. In this respect, it is natural toassume that tokens in a stream X m are modeledas random samples from an unknown discrete distri-bution P , and then to endow P with prior distribu-tion Q with power-law tail behaviour. Critical con-straints in the choice of Q arises directly from the ap-proach of Cai et al. (2018). In particular, the prior Q must feature both a restriction property and a ﬁnite-dimensional projective property analogous to thoseof the DP prior. This is required to compute andwork with the posterior distribution of a point query,given the hashed data stream X m . To the bestof our knowledge, the NIGP prior (Prünster, 2002;Lijoi et al., 2005) is the sole discrete nonparametricprior with power-law tail behaviour that features botha restriction property and a ﬁnite-dimensional projec-tive property analogous to those of the DP. This paves the way to our learning-augmented CMS under datastreams with power-law behaviour. The DP and the NIGP are discrete random probabil-ity measures belonging to the class of homogeneousnormalized completely random measures (hNCRMs)(James, 2002; Prünster, 2002; Regazzini et al., 2003;Pitman, 2006; Lijoi and Prünster, 2010). Let the mea-surable space V be endowed with its Borel σ -ﬁeld F .A completely random measure CRM µ on V is deﬁnedas a random measure such that for any A , . . . , A k in F , with A i ∩ A j = ∅ for i = j , the random variables µ ( A ) , . . . , µ ( A k ) are mutually independent (Kingman,1993). Any CRM µ with no ﬁxed point of discon-tinuity and no deterministic drift is represented as µ = P j ≥ ξ j δ v j , where the ξ j ’s are positive randomjumps and the v j ’s are V -valued random locations.Then, µ is characterized by the Lévy–Khintchine rep-resentation E h e − R V f ( v ) µ ( d v ) i = e − R R + ×V [1 − e − ξf ( v ) ] γ ( d ξ, d v ) , (3)where f : V → R is a measurable function such that R | f | d µ < + ∞ and γ is a measure on R + × V suchthat R B R R + min { ξ, } γ ( d ξ, d v ) < + ∞ for any B ∈ F .For our purposes it is useful to separate the jump andlocation part of the Lévy intensity measure γ by writ-ing it as γ ( d ξ, d v ) = ρ ( d ξ ; v ) ν ( d v ) , where ν denotes ameasure on ( V , F ) and ρ denotes a transition kernel on B ( R + ) × V , with B ( R + ) being the Borel σ -ﬁeld of R + ,i.e. v ρ ( A ; v ) is F -measurable for any A ∈ B ( R + ) and ρ ( · ; v ) is a measure on ( R + , B ( R + )) for any v ∈ V .In particular, if ρ ( · ; v ) = ρ ( · ) for any v then the jumpsof µ are independent of their locations. In this case,the CMR µ is termed homogeneous CRM. See Ap-pendix B.hNCRMs are obtained by normalizing CRMs. Todeﬁne the NIGP, we ﬁrst introduce the normalizedgeneralized Gamma process (NGGP) (James, 2002;Prünster, 2002; Lijoi et al., 2007), which is a hNCRMincluding both the DP and NIGP as special cases. TheNGGP is useful to understand the power-law tail be-haviuor featured by the NIGP, in contrast with the ex-ponential tail behaviour of the DP, as well as predictiveproperties of the NIGP. A generalized Gamma process(GGP) µ on V is a CRM characterized, through theLévy–Khintchine formula (3), by the Lévy intensitymeasure γ (d ξ, d v ) = ρ σ (d ξ ) αν (d v ) , where: i) α > isthe mass parameter; ii) ν is a diﬀuse probability (base)measure on V , governing the location part of µ ; iii) ρ σ ,with σ ∈ [0 , , is a rate measure on R + governing thejump part of µ such that ρ σ (d ξ ) = 2 − (1 − σ ) Γ(1 − σ ) ξ − (1+ σ ) e − ξ R + ( ξ )d ξ. (4) Bayesian nonparametric approach to count-min sketch under power-law data streams

See Appendix B. The total mass µ ( V ) is ﬁnite (almostsurely) (Lijoi et al., 2007), and then the NGGP is de-ﬁned as P = µµ ( V ) = X j ≥ p j δ v j , (5)where p j = ξ j /µ ( V ) for j ≥ are random probabilitiessuch that p j ∈ (0 , for any j ≥ and P j ≥ p j = 1 almost surely. For short, we write P ∼ NGGP ( α, σ, ν ) .For σ = 0 the GGP reduces to the Gamma process(Kingman, 1993), and hence the NGPP becomes a DPwith mass parameter α/ . The NIGP with mass pa-rameter α > is deﬁned as the NGGP with σ = 1 / and, for short, we write P ∼ NIGP ( α, ν ) . See Ap-pendix C.The NIGP features a restriction property analogous tothat of the DP. That is, if A ⊂ V and P A is the randomprobability measure on A induced by P ∼ NIGP ( α, ν ) on V then P A ∼ NIGP ( αν ( A ) , ν A /ν ( A )) , where ν A isthe projection of ν to A . The restriction property ofthe NIGP follows from the deﬁnition of the NIGP asa normalized GGP, which has a Poisson process repre-sentation admitting the Poisson coloring theorem. SeeAppendix B and Chapter 5 of (Kingman, 1993) fordetails. To the best of our knowledge, hNCRM pri-ors are the sole discrete nonparametric priors featur-ing the restriction property. The NIGP also features aﬁnite-dimensional projective property analogous thatof the DP. That is, if { B , . . . , B k } is a measurable k -partition of V , for any k ≥ , then P ∼ NIGP ( α, ν ) issuch that ( P ( B ) , . . . , P ( B k )) d = W P ki =1 W i , . . . , W k P ki =1 W i ! , (6)with d = denoting an equality in distribution, wherethe W i ’s are independent random variables dis-tributed as an inverse Gaussian (IG) distribution(Seshadri, 1993) with shape parameter αν ( B i ) andscale parameter , for i = 1 , . . . , k . Thedistribution of ( P ( B ) , . . . , P ( B k )) is referred toas the normalized IG distribution (Lijoi et al.,2005; Hadjicharalambous et al., 2011) with parameter ( αν ( B ) , . . . , αν ( B k )) . The ﬁnite-dimensional projec-tive property of the NIGP follows directly from thedeﬁnition of the NIGP through its ﬁnite-dimensionaldistributions (Lijoi et al., 2005), for which it is criti-cal a peculiar additive property of the IG distribution.See Appendix D. To the best of our knowledge, the DPprior and the NIGP prior are the sole hNCRM priorsfeaturing the ﬁnite-dimensional projective property.Before describing the power-law tail behaviour ofthe NGGP prior, and hence the power-law tail be-haviour featured by the NIGP prior, we recall thesampling structure of the NGGP. Hereafter, we de- note by ( a ) ( n ) the ascending factorial of a of or-der n , i.e., ( a ) ( n ) = Q ≤ i ≤ n − ( a + i ) . Let P ∼ NGGP ( α, σ, ν ) , with σ ∈ (0 , . Because of the dis-creteness of P , a random sample of tokens X m from P induces a random partition of the set { , . . . , m } into ≤ K m ≤ m partition subsets, labelled by dis-tinct symbols v = { v , . . . , v K m } , with correspondingfrequencies ( N , . . . , N K m ) such that ≤ N i ≤ n and P ≤ i ≤ K n N i = n . For any ≤ r ≤ m , let M r,m ≥ denote the random number of distinct sym-bols with frequency r , i.e. M r,m = P ≤ i ≤ K m N i ( r ) such that P ≤ r ≤ m M r,m = K m and P ≤ r ≤ m rM r,m = m . The distribution of M m = ( M ,m , . . . , M m,m ) isdeﬁned on the set M m,k = { ( m , . . . , m n ) : m i ≥ , P ≤ i ≤ m m i = k, P ≤ i ≤ m im i = m } . See AppendixC. For m ∈ M m,k Pr [ M m = m ] = V m,k m ! m Y i =1 (cid:18) (1 − σ ) ( i − i ! (cid:19) m i m i ! , (7)where V m,k = α k m − k e α σ Γ( m ) Z + ∞ x m − e − α σ (1+2 x ) σ (1 + 2 x ) m − kσ d x. In the the next proposition we state the predictive dis-tribution of P ∼ NGGP ( α, σ, ν ) as a function of thesampling information X m through the statistic M m .The predictive distribution of the DP prior arises byletting σ → , whereas the predictive distribution ofthe NIGP prior arises by setting σ = 1 / . See Ap-pendix C. Proposition 1.

For any m ≥ , let X m be a randomsample from P ∼ NGGP ( α, σ, ν ) , with σ ∈ (0 , , andlet X m feature K m = k partition subsets, labelledby v = { v , . . . , v K m } , with frequencies ( N , . . . , N K m ) such that M r,m = m r for ≤ r ≤ m . Let v r = { v i ∈ v : N i = r } , i.e., the labels of the partition subsetswith frequency r in v , and v = V − v , i.e., the labelsin V not in v . Then,Pr [ X m +1 ∈ v r | X m ] =  V m +1 ,k +1 V m,k r = 0 V m +1 ,k V m,k ( r − σ ) m r r ≥ . (8)Let P ∼ NGGP ( α, σ, ν ) with σ ∈ (0 , and, fromthe deﬁnition of P in (5), let ( p ( j ) ) i ≥ denote the de-creasing ordered random probabilities p j ’s of P . Bycombining the rate measure (4) with Proposition 23of Gnedin et al. (2007), as j → + ∞ the p ( j ) ’s followa power-law distribution of exponent s = σ − . SeePitman (2003) and references therein for details. Thatis, the parameter σ ∈ (0 , controls the power-law tailbehaviour of P through the small probabilities p ( j ) ’s: manuele Dolera, Stefano Favaro, Stefano Peluchetti the larger σ the heavier the tail of P . At the samplinglevel, the power-law behaviour of P ∼ NGGP ( α, σ, ν ) emerges directly from the large m asymptotic be-haviour of the statistics K m and M r,m /K m inducedby (8). In particular, let X m be a random samplefrom P . Then, in Proposition 3 of (Lijoi et al., 2007)it is showed that, as m → + ∞ , K m m σ → S σ (9)almost surely, where S σ is a positive and ﬁnite (almostsurely) random variable (Pitman, 2006). Moreover, as m → + ∞ M r,m K m → σ (1 − σ ) ( r − r ! (10)almost surely. Equation (9) shows that the number K m of distinct symbols in X m , for large m , grows as m σ . This is the growth of the number of distinct sym-bols in random samples from a power-law distributionof exponent s = σ − . Moreover, Equation (10) showsthat p σ,r = σ (1 − σ ) ( r − /r ! is the large m asymptoticproportion of the number of distinct symbols with fre-quency r . Then p σ,r ≈ c σ r − σ − for large r , for aconstant c σ . This is the distribution of the number ofdistinct symbols with frequency r in random samplesfrom a power-law distribution of exponent s = σ − .See Figure 1. m K m α = 10 m K m α = 100 r M r , K α = 10 r M r , K α = 100 Figure 1: K m and M r, K under P ∼ NGGP ( α, σ, ν ) : σ = 0 (blue -), σ = . (red -.), σ = . (yellow –) σ = . (purple :) Because of its unicity in combining a power-law tailbehaviour with both a restriction property and a ﬁnite-dimensional projective property, the NIGP prior comesas a “forced" choice within our problem of extending the BNP approach of Cai et al. (2018) to deal withpower-law data streams. While this choice limits theﬂexibility of tuning the prior to the power-law degree ofthe data, in the sense that the NIGP prior is deﬁned asa NGGP prior with σ = 1 / , it is still a sensible choiceof practical interest in applications. In particular, ifone were forced to choose a single value for σ ∈ (0 , ,without information on the power-law degree of thedata, σ = 1 / would arguably be a sensible and safechoice. Hereafter, we assume that tokens in a stream X m are modeled as random samples from the NIGP,i.e., X m | P iid ∼ PP ∼ NIGP ( α, ν ) for m ≥ . Tokens X i ’s are hashed through a col-lection of hash functions h , . . . , h N drawn uniformlyat random from a pairwise independent hash family H which, for mathematical convenience, it is assumedto be perfectly random. Under this BNP setting, wecombine the predictive distribution of the NIGP withboth its restriction property and the ﬁnite-dimensionalprojective property to develop a learning-augmentedCMS under power-law data streams. This is referredto as the CMS-NIGP. In particular, we show that, apoint query induces the posterior distribution for thefrequency f v of a token of type v in X m , given thehashed frequencies { C n,h n ( v ) } n ∈ [ N ] , for v ∈ V . CMS-NIGP estimates of f v are obtained as suitable func-tionals of the posterior distribution, e.g. mode, mean,median.The predictive distribution of P ∼ NIGP ( α, ν ) , i.e.Equation (8) with σ = 1 / , induces the conditionaldistribution of f v , given M m . However, since we haveonly access to the hashed frequencies { C n,h n ( v ) } n ∈ [ N ] ,the object of interest is in the distribution p f v ( m, α ) of f v . Therefore M m must be marginalized out, underthe NIGP prior, from the conditional distribution of f v given M m . That is, the distribution of f v is obtainedas p f v ( ℓ ; m, α ) = Pr [ f v = ℓ ]= X m ∈M k,m Pr [ X m +1 ∈ v ℓ | M m = m ] Pr [ M m = m ] for ℓ = 0 , , . . . , m , where the predictive distributionPr [ X m +1 ∈ v ℓ | M m = m ] arises from (8) with σ = 1 / , and the distribution Pr [ M m = m ] arisesfrom (7) with σ = 1 / . The next proposition com-bines (8) and (7) with σ = 1 / to compute the dis-tribution p f v ( ℓ ; m, α ) . In this respect, we exploit thefact that the predictive distribution of the NGGP isa function of simple suﬃcient statistics of the datastream X m , i.e. the statistics K n and ( K n , M r,m ) Bayesian nonparametric approach to count-min sketch under power-law data streams for r = 1 , . . . , m . This peculiar feature of the NGGP(Bacallado et al., 2017) prior allows to obtain a work-able expression, from a purely computational perspec-tive, of p f v ( ℓ ; m, α ) . See Appendix E. Proposition 2.

For any m ≥ , let X m denote arandom sample of tokens from P ∼ NIGP ( α, ν ) . Then, p f v ( ℓ ; m, α ) =  ( mℓ ) e α απ × R K − (cid:16) α √ x (cid:17) x m − ℓ − (1 − x ) − − ℓ +1 d xℓ = 0 , , . . . , m − m α ( ) ( m ) Γ( m +1) × R + ∞ x m e − α ( √ x − (1+2 x ) m +1 / d xℓ = m, (11) where K − ( · ) is the modiﬁed Bessel function of the sec-ond type, or Macdonald function, with parameter − . Uniformity of the hash function h ∼ H implies thateach hash function h n induces a J -partition of V , say { B h n , , . . . , B h n ,J } , and the measure with respect to P ∼ NIGP ( α, ν ) of each B h n ,j is /J . By the re-striction property of the NIGP, the hash function h n turns a global P ∼ NIGP ( α, ν ) that governs the dis-tribution of X m into a collection of bucket-speciﬁc P j ∼ NIGP ( α/J, Jν B hn,j ) , for j = 1 , . . . , J , that gov-ern the distribution of the sole tokens that hashedthere. This, combined Proposition 2, leads to the pos-terior distribution, for the single hash function h n , of f v given C n,h n ( v ) , i.e.,Pr [ f v = ℓ | C n,h n ( v ) = c n,h n ( v ) ] = p f v (cid:16) ℓ ; c n,h n ( v ) , αJ (cid:17) (12)for ℓ = 0 , , . . . , c n,h n ( v ) and n ∈ [ N ] . Then, the pos-terior distribution of f v given { C n,h n ( v ) } n ∈ [ N ] followsfrom the posterior distribution (12) by exploiting theindependence assumption of H and by Bayes theorem.This posterior distribution, which is reported in thenext theorem, is the core of the CMS-NIGP. See Ap-pendix E. Theorem 3.

Let h , . . . , h N be hash functions drawnat random from a truly random hash family H . For any m ≥ , let X m be a random sample of tokens from P ∼ NIGP ( α, ν ) and let { C n,h n ( v ) } n ∈ [ N ] be the hashedfrequencies induced form X m through h , . . . , h N , i.e. C n,h n ( v ) = P ≤ i ≤ m h n ( X i ) ( h n ( v )) for n ∈ [ N ] and v ∈ V . Then, the posterior distribution of f v , given { C n,h n ( v ) } n ∈ [ N ] , isPr [ f v = ℓ | { C n,h n ( v ) } n ∈ [ N ] = { c n,h n ( v ) } n ∈ [ N ] ] (13) ∝ Y n ∈ [ N ]  ( cn,hn ( v ) ℓ ) e αJ αJπ × R K − (cid:16) αJ √ x (cid:17) x cn,hn ( v ) − ℓ − (1 − x ) − − ℓ +1 d xℓ = 0 , , . . . , c n,h n ( v ) − cn,hn ( v ) α ( ) ( cn,hn ( v )) J Γ( c n,hn ( v ) +1) × R + ∞ x cn,hn ( v ) e − αJ ( √ x − (1+2 x ) cn,hn ( v )+1 / d xℓ = c n,h n ( v ) , where K − ( · ) is the modiﬁed Bessel function of the sec-ond type, or Macdonald function, with parameter − . While computing the posterior distribution of f v ismore than what is required from the classical CMS, itleads to two main advantages: i) the posterior distribu-tion of f v allows to compute diﬀerent CMS estimatesof f v according to the speciﬁcation of suitable loss func-tions, e.g. posterior mean under a quadratic loss, pos-terior median under the absolute loss, posterior modeunder the - loss; ii) the posterior distribution of f v provides a natural tool to quantify uncertainty of CMSestimates, e.g. via the posterior variance or, in general,via credible intervals arising from suitable concentra-tion inequalities. With respect to i), Cai et al. (2018)showed that the posterior mode recovers the CMS esti-mate, and they applied the posterior mean to improveCMS estimates of low-frequency tokens. In our con-text of power-law data streams, we will consider theposterior mean, which is shown to provide better esti-mates of low-frequency tokens. With respect to ii), tothe best of our knowledge, there are no studies inves-tigating the problem of assessing uncertainty of CMSestimates. The BNP approach improves over CMS’salgorithms, by providing with a natural tool, i.e. theposterior distribution, for quantifying uncertainty ofCMS estimates.To conclude our posterior analysis of f v , it remains toestimate the prior’s parameter α > from the collec-tion of hashed frequencies. In particular, this step re-quires the distribution of { C n } n ∈ [ N ] , that is the likeli-hood function of the hashed frequencies. According tothe ﬁnite-dimensional projective property of the NIGPprior, for a single hash function h n the distributionof the hashed frequencies C n is obtained by integrat-ing the normalized IG distribution (6) with parame-ter ( α/J, . . . , . . . , α/J ) against the multinomial counts ( c n ) . Then, the distribution of { C n } n ∈ [ N ] follows bythe independence assumption of the hash family H .That is,Pr [ { C n } n ∈ [ N ] = { c n } n ∈ [ N ] ] (14) = Y n ∈ [ N ] m (cid:0) αJ (cid:1) m + J e α ( π/ J Q Jj =1 c n,j ! manuele Dolera, Stefano Favaro, Stefano Peluchetti × Z + ∞ x m − Q Jj =1 K c n,j − (cid:18) ( αJ ) (1+2 x ) − / (cid:19) (1 + 2 x ) m − J d x. See Appendix F. Equation (14) provides an explicitexpression of the likelihood function of the hashed fre-quencies { c n } n ∈ [ N ] , and thus it allows for estimatingthe parameter α . Here, we adopt an empirical Bayesapproach to estimate α : we maximize, with respectto α , the likelihood function of the hashed frequencies.Alternatively, a fully Bayesian approach can be consid-ered by placing a prior distribution on α . The max-imization of the likelihood function is performed byevaluating (14) on a grid of exponentially spaced val-ues, which gives an initial bracketing of the maximum.Then, we apply the golden section search derivative-free optimization algorithm (Press et al., 2007) to max-imize (14) up to a given absolute tolerance for α . SeeAppendix F. We apply the CMS-NIGP to synthetic data and to realdata. For the CMS-NIGP estimator of f v , we considerthe posterior mean ˆ f (NIGP) v , which follows from: i) themaximization of (14) with respect to the parameter α for the given data stream of tokens; ii) the evalua-tion, under the selected (optimal) α , of (13) for thegiven data stream of tokens. To ensure numerical sta-bility in float64 , it is imperative to work in log-spaceand employ the log-sum-exp trick for all summations,including the quadratures to evaluate integrals. In gen-eral, the function K ν ( x ) is diﬃcult to evaluate for anarbitrary real-valued ν . Hoverer, for the special case ν = c n,j − / considered in (14), K c n,j − / ( x ) ad-mits a ﬁnite sum representation that simplify its eval-uation. See Appendix F. In (13) instead ν is integer-valued and numerically accurate implementations areavailable ( Stan and

Boost C++ ). We checked thatthe number of employed quadrature points resultedin converged numerical estimates, and that the eval-uations of (13) passed basic sanity checks. The com-putational complexity for evaluating (14) directly is O ( N QJ ) , where Q is the number of quadrature points;we reduced this quantity by caching the evaluations of K c n,j − / for each step of the optimization. Insteadthe evaluation of (13) has O ( N Q min n c n,h n ( v ) ) com-putational complexity. The evaluation of (13) and (14)have been performed on a MacBook Pro, and it takesabout ten minutes.We compare the CMS-NIGP estimator ˆ f (NIGP) v withrespect to: i) the CMS estimator ˆ f (CMS) v inCormode and Muthukrishnan (2005a), namely theminimum hashed frequency based on N hash func-tions; ii) the CMS-DP estimator ˆ f (DP) v of Cai et al. (2018) corresponding to the posterior mean underthe DP prior. We also consider the count-mean-min(CMM) estimator ˆ f (CMM) v discussed in the work ofGoyal et al. (2012). The CMM relies on the same sum-mary statistics, i.e. the buckets { C n } n ∈ [ N ] , applied inthe CMS, CMS-DP and CMS-NIGP. This facilitatesthe implementation of a fair comparison among estima-tors, since the storage requirement and sketch updatecomplexity are unchanged. In particular, Goyal et al.(2012) shows that the CMM estimator stands out inthe estimation of low-frequency tokens (see Figure 1 ofGoyal et al. (2012)), which is a desirable feature in thecontext of natural language processing where it is com-mon the power-law behaviour of the data stream of to-kens. Hereafter, we compare estimators ˆ f (NIGP) v , ˆ f (DP) v , ˆ f (CMS) v and ˆ f (CMM) v in terms of the MAE (mean abso-lute error) between true frequencies and their corre-sponding estimates. Because of the limitation of pagespace, the comparison of ˆ f (NIGP) v with respect to ˆ f (CMS) v and ˆ f (CMM) v on synthetic data is reported in AppendixG. The comparison of ˆ f (NIGP) v with respect to ˆ f (CMS) v on real data is also in Appendix G.We consider datasets of tokens simulated from Zipf’sdistribution with (exponent) parameter s > , denotedby Z s . The parameter s controls the tail behaviourof the Zipf’s distribution: the smaller s the heavieris the tail of the distribution, i.e., the smaller s thelarger the fraction of symbols with low-frequency to-kens. Here, we generate synthetic datasets of m =500 . tokens from a Zipf’s distributions with param-eter s = 1 . , . , . , . , . . We make use of a -universal hash family, with the following pairs of hash-ing parameters: i) J = 320 and N = 2 ; ii) J = 160 and N = 4 . Table 1 and Table 2 reports the MAE ofthe estimators ˆ f (DP) v and ˆ f (NIGP) v . From Table 1 andTable 2, it is clear that ˆ f (NIGP) v has a remarkable bet-ter performance than ˆ f (DP) v in the estimation of low-frequency tokens. In particular, for both Table 1 andTable 2, if we consider the bin of low-frequencies (0 , the MAE of ˆ f (NIGP) v is alway smaller than the MAE of ˆ f (DP) v , i.e. ˆ f (NIGP) v outperforms ˆ f (DP) v . This behaviourbecomes more and more evident as the parameter s decreases, that is the heavier is the tail of the distribu-tion the more the estimator ˆ f (NIGP) v outperforms theestimator ˆ f (DP) v . In particular, for dataset Z . theCMS-NIGP outperforms the CMS-DP for tokens withfrequency smaller than 256, whereas for dataset Z . the CMS-NIGP outperforms the CMS-DP for tokenswith frequency smaller than 16.A comparison between ˆ f (NIGP) v , ˆ f (CMS) v and ˆ f (CMM) v isreported in Appendix G. This comparison reveals thatthe CMS-NIGP outperforms the CMS in the estima-tion of low-frequency tokens for both the choices ofhashing parameters, whereas the CMS-NIGP outper- Bayesian nonparametric approach to count-min sketch under power-law data streams forms the CMM in the estimation of low-frequencytoken for the choice of hashing parameters J =160 and N = 4 . In general, from our experi-ments it emerges that ˆ f (NIGP) v underestimates large-frequency tokens. To explain this underestimationphenomenon, we observe that the posterior distribu-tion p f v ( ℓ ; c n,h n ( v ) , α/J ) in (12) is a decreasing func-tion of ℓ ∈ { , , . . . , m } . In other terms, the poste-rior distribution of f v assigns more probability massto small values of f v . Such a decreasing behaviourof p f v ( ℓ ; c n,h n ( v ) , α/J ) , which is is inherited from thepredictive distribution of the NIGP prior, providesan intuitive explanation of the empirical evidencethat the larger v the more ˆ f (NIGP) v underestimates f v ,i.e. for any parameter s = 1 . , . , . , . , . ofZipf’s distribution the MAE increases along the rowsof Table 1 and Table 2. This underestimation phe-nomenon for large v becomes more evident as s be-comes larger, namely as the tail of Zipf’s distributionbecomes lighter and hence the fraction of symbols withlow-frequency becomes smaller. For instance, we ob-serve that for the bin (128 , the MAE for s = 2 . is larger than the MAE for s = 1 . .We also present an application of the CMS-NIGP totextual datasets, for which the distribution of words istypically a power-law distribution. See Clauset et al.(2009) and references therein. Here, we consider the20 Newsgroups dataset ( http://qwone.com/~jason/20Newsgroups/ ) and the Enron dataset ( https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ ). The 20 Newsgroupsdataset consists of m = 2 . . tokens with k = 53 . distinct tokens, whereas the Enron datasetconsists of m = 6412175 tokens with k = 28102 distinct tokens. Following experiments in Cai et al.(2018), we make use of a -universal hash family,with the following hashing parameters: i) J = 12000 and N = 2 ; ii) J = 8000 and N = 4 . By meansthe goodness of ﬁt test proposed in Clauset et al.(2009), we found that the 20 Newsgroups and En-ron datasets ﬁt with a power-law distribution withexponent s = 2 . and s = 2 . , respectively. Table3 reports the MAE of ˆ f (DP) v and ˆ f (NIGP) v appliedto the 20 Newsgroups dataset and to the Enrondataset. Results of Table 3 conﬁrms the behaviourobserved in Zipf’ synthetic data. That is, ˆ f (NIGP) v outperforms ˆ f (DP) v for low-frequency tokens, whereas ˆ f (DP) v of Cai et al. (2018) has a better performancethan ˆ f (NIGP) v for high-frequency tokens. Table 3also contains a comparison with respect to ˆ f (CMM) v ,revealing that ˆ f (NIGP) v is competitive with ˆ f (CMM) v in the estimation of low-frequency tokens both thechoices of hashing parameters. Under the BNP approach to CMS of Cai et al. (2018),the restriction property of the DP is critical to com-pute the posterior distribution of a point query, giventhe hashed frequencies, whereas the ﬁnite-dimensionalprojective property of the DP is desirable for easeof estimating prior’s parameters since it provides thelikelihood function of the hashed frequencies. TheNIGP prior is the sole discrete nonparametric priorwith power-law behaviour that satisﬁes both the re-striction property and the ﬁnite-dimensional projec-tive property. This made the NIGP a somehow"forced" prior choice for our problem of extendingthe work of Cai et al. (2018) under power-law datastreams. By relying on the restriction property and theﬁnite-dimensional projective property of the NIGP, inthis paper we have introduced the CMS-NIGP, whichis a learning-augmented CMS under power-law datastreams of token. The CMS-NIGP exploits BNP mod-eling to incorporate into the CMS, through the NIGPprior, a sensible power law behavior for the datastream. CMS-NIGP estimates of a point query areobtained as functionals, e.g. mean, mode, median, ofthe posterior distribution of the point query given thestored hashed frequencies. Applications to syntheticand real data have showed that the CMS-NIGP out-performs the CMS and the CMS-DP in the estimationof low-frequency tokens. The ﬂaw of the CMS in theestimation of low-frequency token is quite well known,especially in the context of natural language processing(Goyal et al., 2012, 2009; Pitel and Fouquier, 2015),and the CMS-NIGP is a new proposal to compensatefor this ﬂaw. In particular, the CMS-NIGPS results tobe competitive with the CMM of Goyal et al. (2012)in the estimation of low-frequency token.The NGGP (James, 2002; Prünster, 2002; Lijoi et al.,2007) and the generalized negative Binomial process(GNBP) of (Zhou et al., 2016) are examples of non-parametric priors with power-law behaviour, and wehave considered them before turning to the NIGP.Both the NGGP and the GNBP have the restric-tion property which, however, leads to estimators of apoint query that are more involved from a mathemat-ical/computational perspective than estimators underthe NIGP prior. See Appendix E. Both the NGGPand the GNBP do not have the ﬁnite-dimensional pro-jective property. The lack of the ﬁnite-dimensionalprojective property makes impractical the use of thelikelihood function of the hashed frequencies and, inaddition, the complicated form of the predictive distri-butions induced by the NGGP the GNBP makes hardto apply likelihood-free methods to estimate prior’s pa-rameters. We are aware that the NIGP prior limits theﬂexibility of tuning the prior to the power-law degree manuele Dolera, Stefano Favaro, Stefano Peluchetti

Table 1: Synthetic data: MAE for ˆ f (NIGP) v and ˆ f (DP) v , case J = 320 , N = 2 Z . Z . Z . Z . Z . Bins v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v (0,1] 1,057.61 231.31 626.85 134.75 306.70 65.71 51.38 12.91 32.43 7.16(1,2] 1,194.67 287.43 512.43 119.22 153.57 37.03 288.27 61.87 47.84 9.88(2,4] 1,105.16 262.18 472.59 95.78 2,406.00 353.73 133.31 26.90 53.97 10.09(4,8] 1,272.02 302.89 783.88 175.10 457.57 83.30 117.76 21.58 69.47 14.28(8,16] 1,231.63 257.08 716.52 136.66 377.99 66.44 411.21 77.39 80.43 20.15(16,32] 1,252.18 248.41 829.17 190.05 286.98 41.99 501.00 90.29 9.61 15.36(32,64] 1,309.14 284.12 780.70 139.52 413.95 67.30 216.84 48.00 9.89 28.90(64,128] 1,716.76 312.59 946.20 125.07 1,869.23 353.10 63.05 65.91 13.38 66.18(128,256] 1,102.96 97.91 1,720.49 273.50 199.87 110.32 45.98 130.94 17.03 125.75 Table 2: Synthetic data: MAE for ˆ f (NIGP) v and ˆ f (DP) v , case J = 160 , N = 4 Z . Z . Z . Z . Z . Bins v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (DP) v ˆ f (NIGP) v (0,1] 2,206.09 0.9 1,254.85 0.25 420.76 0.18 153.20 0.32 56.08 0.38(1,2] 2,333.06 0.5 1,326.71 0.70 549.12 0.82 180.71 1.24 47.48 1.45(2,4] 2,266.35 1.3 1,267.97 2.47 482.45 2.53 182.18 2.66 56.87 2.74(4,8] 2,229.22 4.6 1,371.27 4.67 538.91 5.28 250.32 5.96 50.30 5.42(8,16] 2,207.42 10.5 1,159.29 10.68 487.69 10.86 245.09 10.28 23.70 11.75(16,32] 2,279.80 20.7 1,211.41 19.21 529.77 22.08 293.68 21.57 24.41 23.37(32,64] 2,301.99 42.6 1,280.17 43.14 632.45 42.64 118.26 44.49 30.95 44.03(64,128] 2,241.57 92.2 1,112.41 94.43 419.42 95.19 177.61 95.10 28.78 93.34(128,256] 2,235.40 170.0 1,133.85 173.87 522.21 185.83 128.09 180.41 31.46 179.51 Table 3: Real data: MAE for ˆ f (NIGP) v , ˆ f (DP) v and ˆ f (CMM) v J = 12000 and N = 2 J = 8000 and N = 4

20 Newsgroups Enron 20 Newsgroups EnronBins v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMM) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMM) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMM) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMM) v (0,1] 46.39 11.34 5.41 12.20 3.00 0.90 53.39 0.39 4.50 70.98 0.41 51.00(1,2] 16.60 3.53 2.16 13.80 3.06 2.00 30.49 1.40 2.00 47.38 1.47 27.20(2,4] 38.40 7.71 7.91 61.49 12.55 9.90 32.49 2.70 4.80 52.49 3.25 3.90(4,8] 59.39 10.40 35.70 88.39 17.36 17.32 38.69 5.97 6.23 53.08 6.17 10.50(8,16] 54.29 11.34 45.40 23.40 4.58 9.52 25.29 11.97 13.50 56.98 11.28 22.20(16,32] 17.80 9.85 20.99 55.09 11.58 21.00 24.99 21.25 21.60 89.98 19.82 20.60(32,64] 40.79 25.65 58.86 128.48 39.46 134.47 39.69 42.81 39.22 108.37 47.07 61.38(64,128] 25.99 57.95 91.59 131.08 54.42 110.27 22.09 91.06 86.32 55.67 87.32 66.50(128,256] 13.59 126.07 186.92 50.68 119.04 140.43 25.79 205.58 183.96 80.76 178.23 90.20 of the data, in the sense that the NIGP is deﬁned asa NGGP with σ = 1 / . However, we believe that aNIGP prior is still a sensible choice of practical interest,especially in light of the fact that estimating σ underthe NGGP prior is a diﬃcult task (Lijoi et al., 2007).Moreover, if one were forced to choose a speciﬁc valuefor σ ∈ (0 , , without any information on the power-law degree of the data, σ = 1 / would arguably be asensible and safe choice.Many fruitful directions for future work remain open, especially with respect to the use of the BNP approachto develop learning-augmented CMSs that allows foradapting to the power-law degree of the data stream oftokens. Moreover, based on the promising empirical re-sults of the BNP approach to CMS, we also encourageresearch to extend the BNP approach to other queries,e.g., range queries, inner product queries. This lineof research would broaden the range of applications ofthe CMS, especially for data streams with power-lawbehaviour. Bayesian nonparametric approach to count-min sketch under power-law data streams

Stefano Favaro is also aﬃliated to IMATI-CNR “En-rico Magenes" (Milan, Italy). Emanuele Dolera andStefano Favaro received funding from the EuropeanResearch Council (ERC) under the European Union’sHorizon 2020 research and innovation programme un-der grant agreement No 817257. Emanuele Doleraand Stefano Favaro gratefully acknowledge the ﬁnan-cial support from the Italian Ministry of Education,University and Research (MIUR), “Dipartimenti di Ec-cellenza" grant agreement 2018-2022.

References

Aamand, A., Indyk, P., and Vakilian, A. (2019).(learned) frequency estimation algorithms under Zip-ﬁan distribution. arXiv preprint arXiv:1908.05198 .Aggarwal, C. and Yu, P. (2010). On classiﬁcation ofhigh-cardinality data streams. In

Proceedings of the2010 SIAM International Conference on Data Min-ing .Bacallado, S., Battiston, M., Favaro, S., and Trippa,L. (2017). Suﬃcientness postulates for gibbs-typepriors and hierarchical generalizations.

StatisticalScience , 32:487–500.Cai, D., Mitzenmacher, M., and Adams, R. P. (2018).A Bayesian nonparametric view on count-min sketch.In

Advances in Neural Information Processing Sys-tems .Clauset, A., Shalizi, C. R., and Newman, M. E. (2009).Power-law distributions in empirical data.

SIAMReview , 51:661–703.Cormode, G. and Muthukrishnan, S. (2005a). An im-proved data stream summary: the count-min sketchand its applications.

Journal of Algorithms , 55:58–75.Cormode, G. and Muthukrishnan, S. (2005b). Summa-rizing and mining skewed data streams. In

Proceed-ings of the 2005 SIAM International Conference onData Mining .Dwork, C., Naor, M., Pitassi, T., Rothblum, G., andYekhanin, S. (2010). Panprivate streaming algo-rithms. In

Proceedings of The First Symposium onInnovations in Computer Science .Ferguson, T. S. (1973). A Bayesian analysis of somenonparametric problems.

The Annals of Statistics ,pages 209–230.Gnedin, A., Hansen, B., and Pitman, J. (2007). Noteson the occupancy problem with inﬁnitely manyboxes: general asymptotics and power laws.

Proba-bility Surveys , 4:146–171. Goyal, A., Daumé, H., and Cormode, G. (2012).Sketch algorithms for estimating point queries in nlp.In

Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational NaturalLanguage Learning .Goyal, A., Daumé, H., and Venkatasubramanian, S.(2009). Streaming for large scale nlp: language mod-eling. In

Proceedings of the Annual Conference ofthe North American Chapter of the Association forComputational Linguistics .Hadjicharalambous, G., Favaro, S., and Prünster, I.(2011). On a class of distributions on the sim-plex.

Journal of Statistical Planning and Inference ,141:2987–3004.Harrison, B. (2010).

Move prediction in the game ofGo . Ph.D Thesis, Harvard University.Hsu, C.-Y., Indyk, P., Katabi, D., and Vakilian, A.(2019). Learning-based frequency estimation algo-rithms. In

International Conference on LearningRepresentations .James, L. F. (2002). Poisson process partitioncalculus with applications to exchangeable mod-els and Bayesian nonparametrics. arXiv preprintarXiv:math/0205093 .Kingman, J. (1993).

Poisson processes . Wiley OnlineLibrary.Lijoi, A., Mena, R. H., and Prünster, I. (2005). Hier-archical mixture modeling with normalized inverse-Gaussian priors.

Journal of the American StatisticalAssociation , 100:1278–1291.Lijoi, A., Mena, R. H., and Prünster, I. (2007).Controlling the reinforcement in Bayesian non-parametric mixture models.

Journal of the RoyalStatistical Society Series B , 69:715–740.Lijoi, A. and Prünster, I. (2010). Models beyondthe Dirichlet process. In

Bayesian Nonparametrics,Hjort, N.L., Holmes, C.C. Müller, P. and Walker,S.G. Eds. Cambridge University Press.

Pitel, G. and Fouquier, G. (2015). Count-min-logsketch: approximately counting with approximatecounters. In

Proceedings of the 1st InternationalSymposium on Web Algorithms .Pitman, J. (2003). Poisson-Kingman partitions. In

Science and Statistics: A Festschrift for Terry Speed.Institute of Mathematical Statistics .Pitman, J. (2006).

Combinatorial stochastic processes .Lecture Notes in Mathematics. Springer-Verlag.Press, W. H., Teukolsky, S. A., Vetterling, W. T., andFlannery, B. P. (2007).

Numerical recipes 3rd edi-tion: The art of scientiﬁc computing . Cambridgeuniversity press. manuele Dolera, Stefano Favaro, Stefano Peluchetti

Prünster, I. (2002).

Random probability measures de-rived from increasing additive processes and their ap-plication to Bayesian statistics . Ph.D Thesis, Uni-versity of Pavia.Regazzini, E., Lijoi, A., and Prünster, I. (2003). Dis-tributional results for means of normalized randommeasures with independent increments.

The Annalsof Statistics , 31:560–585.Seshadri, V. (1993).

The inverse Gaussian distribution .Oxford University Press.Song, H., Cho, T., Dave, V., Zhang, Y., and Qiu,L. (2009). Scalable proximity estimation and linkprediction in online social networks. In

Proceedingsof the 9th ACM SIGCOMM conference on Internetmeasurement .Zhang, Q., Pell, J., Canino-Koning, R., Howe, A., andBrown, C. (2014). These are not the k-mers you arelooking for: eﬃcient online k-mer counting using aprobabilistic data structure.

PloS one , 9.Zhou, M., Favaro, S., and Walker, S. G. (2016).Frequency of frequencies distributions and size-dependent exchangeable random partitions.

Journalof the American Statistical Association , 112:1623–1635. r X i v : . [ s t a t . M L ] F e b A The CMS

For any m ≥ let X m = ( X , . . . , X m ) be a data stream of tokens taking values in a measurable space ofsymbols V . A point query over X m asks for the estimation of the frequency f v of a token of type v ∈ V in X m ,i.e. f v = P ≤ i ≤ m X i ( v ) . The goal of CMS of (Cormode and Muthukrishnan, 2005b,a) consists in estimating f v based on a compressed representation of X m by random hashing. In particular, let J and N be positiveintegers such that [ J ] = { , . . . , J } and [ N ] = { , . . . , N } , and let h , . . . , h N , with h n : V → [ J ] , be a collectionof hash functions drawn uniformly at random from a pairwise independent hash family H . That is, a randomhash function h ∈ H has the property that for all v , v ∈ H such that v = v , the probability that v and v hash to values j , j ∈ [ J ] , respectively, isPr [ h ( v ) = j , h ( v ) = j ] = 1 J . Hashing X m through h , . . . , h N creates N vectors of J buckets { ( C n, , . . . , C n,J ) } n ∈ [ N ] , with C n,j obtained byaggregating the frequencies for all x where h n ( x ) = j . Every C n,j is initialized at zero, and whenever a new token X i is observed we set C n,h n ( X i ) ← C n,h n ( X i ) for every n ∈ [ N ] . After m tokens, C n,j = P ≤ i ≤ m h n ( X i ) ( j ) and f v ≤ C n,j for any v ∈ V . Under this setting, the CMS of (Cormode and Muthukrishnan, 2005a) estimates f v with the smallest hashed frequency among { C n,h n ( v ) } n ∈ [ N ] , i.e., ˆ f (CMS) v = min n ∈ [ N ] { C n,h n ( v ) } n ∈ [ N ] . That is, ˆ f (CMS) v returns the count associated with the fewest collisions. This provides an upper bound on thetrue count. For an arbitrary data stream with m tokens, the CMS satisﬁes the following guarantee. Theorem 1. (Cormode and Muthukrishnan, 2005a) Let J = ⌈ e/ ⌉ and let N = ⌈ log 1 /δ ⌉ , with ε > and δ > .Then, the estimate ˆ f (CMS) v satisﬁes ˆ f (CMS) v ≥ f v and, with probability at least − δ , the estimate ˆ f (CMS) v satisﬁes ˆ f (CMS) v ≤ f v + εm . B CRMs and hNCRMs

Let V be a measurable space endowed with its Borel σ -ﬁeld F . A CRM µ on V is deﬁned as a random measuresuch that for any A , . . . , A k in F , with A i ∩ A j = ∅ for i = j , the random variables µ ( A ) , . . . , µ ( A k ) aremutually independent (Kingman, 1993). Any CRM µ with no ﬁxed point of discontinuity and no deterministicdrift is represented as µ = P j ≥ ξ j δ v j , where the ξ j ’s are positive random jumps and the v j ’s are V -valuedrandom locations. Then, µ is characterized by the Lévy–Khintchine representation E (cid:20) exp (cid:26) − Z V f ( v ) µ ( d v ) (cid:27)(cid:21) = exp (cid:26) − Z R + ×V [1 − e − ξf ( v ) ] (cid:27) ρ ( d ξ, d v ) , where f : V → R is a measurable function such that R | f | d µ < + ∞ and ρ is a measure on R + × V suchthat R B R R + min { ξ, } ρ ( d ξ, d v ) < + ∞ for any B ∈ F . The measure ρ , referred to as Lévy intensity measure,characterizes µ : it contains all the information on the distributions of jumps and locations of µ . For our purposesit will often be useful to separate the jump and location part of ρ by writing it as γ ( d ξ, d v ) = ρ ( d ξ ; v ) ν ( d v ) , where ν denotes a measure on ( V , F ) and ρ denotes a transition kernel on B ( R + ) × V , with B ( R + ) being theBorel σ -ﬁeld of R + , i.e. v ρ ( A ; v ) is F -measurable for any A ∈ B ( R + ) and ρ ( · ; v ) is a measure on ( R + , B ( R + )) for any v ∈ V . In particular, if ρ ( · ; v ) = ρ ( · ) for any v then the jumps of µ is independent of their locations and γ and µ are termed homogeneous. See (Kingman, 1993) and references therein.CRMs are closely connected to Poisson processes. Indeed µ can be represented as a linear functional of a Poissonprocess Π on R + × V with mean measure γ . To stated this precisely, Π is a random subset of R + × V and if N ( A ) = card { Π ∩ A } for any A ⊂ B ( R + ) ⊗ F such that γ ( A ) < + ∞ , thenPr [ N ( A ) = k ] = e − γ ( A ) ( γ ( A )) k k ! or k ≥ . Then, for any A ∈ F µ ( A ) = Z A Z R + N ( d v, d ξ ) See (Kingman, 1993) and references therein. An important property of CRMs is their almost sure discrete-ness (Kingman, 1993), which means that their realizations are discrete measures with probability . This factessentially entails discreteness of random probability measures obtained as transformations of CRMs, such ashNCRMs.hNCRMs (James, 2002; Prünster, 2002; Regazzini et al., 2003; Pitman, 2006; Lijoi and Prünster, 2010) are de-ﬁned in terms of a suitable normalization of CRMs. Let µ be a homogeneous CRM on V such that < µ ( V ) < + ∞ almost surely. Then, the random probability measure P = µµ ( V ) (1)is termed hNCRM. Because of the almost sure discreteness of µ , the P is discrete almost surely. That is, P = X j ≥ p j δ v j , where p j = ξ j /µ ( V ) for j ≥ are random probabilities such that p j ∈ (0 , for any j ≥ and P j ≥ p j = 1 almostsurely. Both the conditions of ﬁniteness and positiveness of µ ( V ) are clearly required for the normalization (1)to be well-deﬁned, and it is natural to express these conditions in terms of the Lévy intensity measure γ of theCRM µ . It is enough to have ρ = + ∞ and < µ ( V ) < + ∞ . In particular, the former is equivalent to requiringthat µ has inﬁnitely many jumps on any bounded set: in this case µ is also called an inﬁnite activity process.The previous conditions can also be strengthened to necessary and suﬃcient conditions but we do not pursuethis here. See (Kingman, 1993). C NGGP priors, and proof of Proposition 1

Let V be a measurable space endowed with its Borel σ -ﬁeld F . For any m ≥ , let X m be a random sampleof tokens from P ∼ NGGP ( α, σ, ν ) . Because of the discreteness of P , the random sample X m induces arandom partition of the set { , . . . , m } into K m = k ≤ m partition subsets, labelled by distinct symbols v = { v , . . . , v K m } in V , with frequencies N n = ( N , . . . , N K m ) = ( n , . . . , n k ) such that N i > and P ≤ i ≤ K m N i = m . Distributional properties of the random partition induced by X m induced by X m have been investigatedin, e.g., (James, 2002), Pitman (2003), (Lijoi et al., 2007), (De Blasi et al., 2013) and (Bacallado et al., 2017).In particular, Pr [ K m = k, N m = ( n , . . . , n k )] = 1 k ! (cid:18) mn , . . . , n k (cid:19) V m,k k Y i =1 (1 − σ ) ( n i − , (2)where V m,k = ( α σ − ) k Γ( m ) Z + ∞ x m − (2 − + x ) m − kσ exp (cid:26) − α σ − σ [(2 − + x ) σ − − σ ] (cid:27) d x. (3)Now, let P m,k = { ( n , . . . , n k ) : n i ≥ and P ≤ i ≤ k n i = m } denote the set of partitions of m into k ≤ m blocks.Then, the distribution of K m follows my marginalizing (2) on the set P m,k , that isPr [ K m = k ] = X ( n ,...,n k ) ∈P m,k k ! (cid:18) mn , . . . , n k (cid:19) V m,k k Y i =1 (1 − σ ) ( n i − = V m,k σ k C ( m, k ; σ ) , (4)where C ( m, k ; σ ) denotes the (central) generalized factorial coeﬃcient (Charalambides, 2005), which is deﬁnedas C ( m, k ; σ ) = ( k ! ) − P ≤ i ≤ k (cid:0) ki (cid:1) ( − i ( iσ ) ( m ) , with the proviso C (0 , σ ) = 1 and C ( m, σ ) = 0 for any m ≥ . For any ≤ r ≤ m , let M r,m ≥ denote the number of distinct symbols with frequency r in X m , i.e. r,m = P ≤ i ≤ K m N i ( r ) such that P ≤ r ≤ m M r,m = K m and P ≤ r ≤ m rM r,m = m . Then, the distribution of M m = ( M ,m , . . . , M m,m ) follows directly form (2), i.e.Pr [ M m = m ] = V m,k m ! m Y i =1 (cid:18) (1 − σ ) ( i − i ! (cid:19) m i m i ! M m,k ( m ) , (5)where M m,k = { ( m , . . . , m n ) : m i ≥ , and P ≤ i ≤ m m i = k, P ≤ i ≤ m im i = m } . The distribution (5) is thereferred to as the sampling formula of the random partition with distribution (2).For any m ≥ , let X m be a random sample from P ∼ NGGP ( α, σ, ν ) featuring K m = k partition subsets,labelled by distinct symbols v = { v , . . . , v K m } in V , with frequencies N n = ( n , . . . , n k ) . The predictivedistributions of P provides the conditional distribution of X m +1 given X m . That is, for A ∈ F Pr [ X m +1 ∈ A | X m ] = V m +1 ,k +1 V m,k ν ( A ) + V m +1 ,k V m,k k X i =1 ( n i − σ ) δ v i ( A ) (6)for any m ≥ . We refer to Bacallado et al. (2017) for a characterization of (6) in terms of a meaningful Pólyalike urn scheme. The predictive distributions (6) provides the fundamental ingredient of the proof of Proposition1. Proof of Proposition 1.

The proof follows from the predictive distributions (6) by setting A = v and A = v r .We conclude by showing that the distributional property of a random sample from P ∼ DP ( α, ν ) follows fromthe distributional property of a random sample from P ∼ NGGP ( α, σ, ν ) by letting σ → . For any m ≥ , let X m be a random sample from P ∼ DP ( α/ , ν ) featuring K m = k partition subsets, labelled by distinct symbols v = { v , . . . , v K m } in V , with frequencies N n = ( n , . . . , n k ) . The distribution of the random partition inducedby X m follows from (2) by letting σ → . Indeed, lim σ → V m,k = lim σ → +0 ( α σ − ) k Γ( m ) Z + ∞ x m − (2 − + x ) m − kσ exp (cid:26) − α σ − σ [(2 − + x ) σ − − σ ] (cid:27) d x = ( α/ k Γ( m ) 2 − α/ Z + ∞ x m − (2 − + x ) m + α/ d x = (cid:0) α (cid:1) k (cid:0) α (cid:1) ( m ) . (7)Therefore, by combining the distribution (2) with (7), and letting σ → , it follows directly the distribution ofthe random partition induced by a random sample X m from P ∼ DP ( α/ , ν ) . That is,Pr [ K m = k, N m = ( n , . . . , n k )] = 1 k ! (cid:18) mn , . . . , n k (cid:19) (cid:0) α (cid:1) k (cid:0) α (cid:1) ( m ) k Y i =1 ( n i − . The distribution of K m follows by combining the distribution (4) with (7), and from the fact that lim σ → σ − k C ( m, k ; σ ) = | s ( m, k ) | , where | s ( m, k ) | denotes the signless Stirling number of the ﬁrst type(Charalambides, 2005). That is, Pr [ K m = k ] = (cid:0) α (cid:1) k (cid:0) α (cid:1) ( m ) | s ( m, k ) | . In a similar manner, the distribution of M m under the DP prior, which is referred to as Ewens sampling formulaEwens (1972), follows by combining the sampling formula (5) with (7), and letting σ → .Finally, the predictive distributions of P ∼ DP ( α, ν ) . For any m ≥ , let X m be a random sample from P ∼ DP ( α/ , ν ) featuring K m = k partition subsets, labelled by distinct symbols v = { v , . . . , v K m } in V , withfrequencies N n = ( n , . . . , n k ) . The predictive distributions of P follows by combining the predictive distributions(6) with (7), and letting σ → . That is, for A ∈ F Pr [ X m +1 ∈ A, | X m ] = α α + m ν ( A ) + 1 α + m k X i =1 n i δ v i ( A ) (8)or any m ≥ . The predictive distributions (8) is at the basis of the CMS-DP proposed in Cai et al. (2018). Inparticular, Equation 4 in Cai et al. (2018) follows from the predictive distributions (8) by setting A = v and A = v r . D The NIGP prior

For σ = 1 / the NGGP prior reduces to the NIGP prior (Prünster, 2002; Lijoi et al., 2005). Al alternativedeﬁnition of the NIGP prior is given through its family of ﬁnite-dimensional distributions. This alternativedeﬁnition relies on the IG distribution (Seshadri, 1993). In particular, a random variable W has IG distributionwith shape parameter a ≥ and scale parameter b ≥ if it has the density function, with respect to the Lebesguemeasure, given by f W ( w ; a, b ) = a e ab √ π w − exp (cid:26) − (cid:18) a w + b w (cid:19)(cid:27) R + ( w ) . Let ( W , . . . , W k ) be a collection of independent random variables such that W i is distributed according to theIG distribution with shape parameter a i and scale parameter , for i = 1 , . . . , k . The normalized IG distributionwith parameter ( a , . . . , a k ) is the distribution of the following random variable ( P , . . . , P k ) = W P ki =1 W i , . . . , W k P ki =1 W i ! . The distribution of the random variable ( P , . . . , P k − ) is absolutely continuous with respect to the Lebesguemeasure on R k − , and its density function on the ( k − -dimensional simplex coincides with f ( P ,...,P k − ) ( p , . . . , p k − ; a , . . . , a k ) (9) = k Y i =1 a i e a i √ π ! k − Y i =1 p − / i − k − X i =1 p i ! − / × k − X i =1 a i p i + a k − P k − i =1 p i ! − k/ K − k/ vuut k − X i =1 a i p i + a k − P k − i =1 p i  , where K − k/ denotes the modiﬁed Bessel function of the second type, or Macdonald function, with parameter − k/ . If the random variable ( P , . . . , P k ) is distributed according to a normalized IG distribution with parameter ( a , . . . , a k ) , and if m < m < · · · < m r < k are positive integers, then  m X i =1 P i , m X i = m +1 P i , . . . , k X i = m r − +1 W i  is a random variable distributed as a normalized inverse Gaussian distribution with parameter ( P ≤ i ≤ m a i , P m +1 ≤ i ≤ m a i , . . . , P km r − +1 ≤ i ≤ k W i ) . This projective property of the normalized inverse Gaus-sian distribution follows from the additive property of the inverse Gaussian distribution (Seshadri, 1993).To deﬁne the NIGP prior through its family of ﬁnite-dimensional distributions, let V be a measurable spaceendowed with its Borel σ -ﬁeld F . Let P = { Q B ,...,B k : B , . . . , B k ∈ F for k ≥ } be a family of probabilitydistributions, and let ˜ ν = αν be a diﬀuse (base) measure on V with ˜ ν ( V ) = α . If { B , . . . , B k } denotes ameasurable k -partition of V and ∆ k − is the ( k − -dimensional simplex, then set Q B ,...,B k ( C ) = Z C ∩ ∆ k − f ( P ,...,P k − ) ( p , . . . , p k − ; a , . . . , a k )d p · · · d p k − for any C in the Borel σ -ﬁeld of R k , where f ( P ,...,P k − ) is the normalized IG distribution with density function(9) with a i = ˜ ν ( B i ) , for i = 1 , . . . , k . According to Proposition 3.9.2 of Regazzini (2001), the NIGP is the uniquerandom probability measure admitting P as its family of ﬁnite-dimensional distributions.The projective property of P ∼ NIGP ( α, ν ) follows directly from: i) the deﬁnition of P through its family ofﬁnite-dimensional distributions; ii) the projective property of the normalized IG distribution. In particular, forny ﬁnite family of sets { A , . . . , A k } in F , let { B , . . . , B h } be a measurable h -partition of V such that it is ﬁnerthen the partition generated by the family of sets { A , . . . , A k } . Then, Q A ,...,A k ( C ) = Q B ,...,B h ( C ′ ) for any C in the Borel σ -ﬁeld of R k , with C ′ = { ( x , . . . , x h ) ∈ [0 , h : ( P i x i , . . . , P i x i ) ∈ C } . See (Lijoi et al.,2005). E Proof of Proposition 2, and proof of Theorem 3

To prove Proposition 2, we start with the following lemma under the assumption that X m is a random samplefrom P ∼ NGGP ( α, σ, ν ) . The proof of Proposition 2 then follows by setting σ = 1 / . Let p f v ( ℓ ; m, α, σ ) = X m ∈M k,m Pr [ X m +1 ∈ v ℓ | M m = m ] Pr [ M m = m ] , ℓ = 0 , , . . . , m, where the predictive distributions Pr [ X m +1 ∈ v ℓ | M m = m ] are displayed in Equation 5, and the distributionPr [ M m = m ] is displayed in Equation 4. For σ ∈ (0 , , let f σ denote the density function of the positive σ -stablerandom variable X σ , i.e. E [exp {− tX σ } ] = exp {− t σ } for any t > . Lemma 1.

For any m ≥ , let X m be a random sample from P ∼ NGGP ( α, σ, ν ) . Then, for ℓ = 0 , , . . . , mp f v ( ℓ ; m, α, σ )  σ ( ℓ − σ ) ( mℓ ) (1 − σ ) ( ℓ − Γ(1 − σ + ℓ ) × R + ∞ R

10 1 h σ f σ ( hp ) e − h (cid:16) α − σ (cid:17) σ + α − σ p m − ℓ (1 − p ) − σ + ℓ − d p d h ℓ < m α m (1 − σ ) ( m ) Γ( m +1) × R + ∞ x m (1+2 x ) m +1 − σ exp n − α σ − σ [(2 − + x ) σ − − σ ] o d x ℓ = m. (10) Proof.

We start by considering the case ℓ = 0 . The probability p f v (0; m, α, σ ) follows by combining Proposition1 with the distribution of K m displayed in (4). Indeed, we can write the following expression p f v (0; m, α, σ ) = X m ∈M m,k Pr [ X m +1 ∈ v | M m = m ] Pr [ M m = m ] (11) = X m ∈M m,k V m +1 ,k +1 V m,k Pr [ M m = m ]= m X k =1 V m +1 ,k +1 σ k C ( m, k ; σ ) . (12)Then, the expression of p f v (0; m, α, σ ) in (10) follows by combining (11) with V m +1 ,k +1 displayed in (3), i.e., p f v (0; m, α, σ ) = ( α σ − )Γ( m + 1) Z + ∞ u m (2 − + u ) m +1 − σ exp (cid:26) − α σ − σ [(2 − + u ) σ − − σ ] (cid:27) × m X k =1 (cid:18) α σ − σ (2 − + u ) − σ (cid:19) k C ( m, k ; σ )d u [Equation 13 of Favaro et al. (2015)] = ( α σ − )Γ( m + 1) Z + ∞ u m (2 − + u ) m +1 − σ exp (cid:26) − α σ − σ [(2 − + u ) σ − − σ ] (cid:27) × exp (cid:26) α σ − σ (2 − + u ) − σ (cid:27) (cid:18) α σ − σ (2 − + u ) − σ (cid:19) m/σ Z + ∞ x m exp ( − x (cid:18) α σ − σ (2 − + u ) − σ (cid:19) /σ ) f σ ( x )d x d u [ Identity (2 − + u ) − σ = 1Γ(1 − σ ) Z + ∞ y − σ − exp (cid:8) − y (2 − + u ) (cid:9) d y ] ( α σ − ) m/σ σ m/σ Γ( m + 1) × Z + ∞ u m (cid:18) − σ ) Z + ∞ y − σ − exp (cid:8) − y (2 − + u ) (cid:9) d y (cid:19) × Z + ∞ x m exp ( − x (cid:18) α σ − σ (cid:19) σ u ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) f σ ( x )d x ! d u = ( α σ − ) m/σ σ m/σ Γ(1 − σ ) × Z + ∞ x m f σ ( x ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) × Z + ∞ y − σ − exp (cid:8) − y − (cid:9) " x (cid:18) α σ − σ (cid:19) σ + y − n − d y d x [ Change of variable p = yx (cid:16) α σ − σ (cid:17) σ + y ]= ( α σ − ) m/σ σ m/σ Γ(1 − σ ) × Z + ∞ x m f σ ( x ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) × Z  (cid:16) α σ − σ (cid:17) σ xp − p  − σ − (cid:16) α σ − σ (cid:17) σ x (1 − p ) × exp  − (cid:16) α σ − σ (cid:17) σ xp − p −   x (cid:18) α σ − σ (cid:19) σ + (cid:16) α σ − σ (cid:17) σ xp − p  − m − d p d x [ Change of variable h = x/ (1 − p )]= ( α σ − ) m/σ σ m/σ Γ(1 − σ ) × Z + ∞ Z ( h (1 − p )) m f σ ( h (1 − p )) exp ( − h (1 − p ) (cid:18) α − σ (cid:19) σ + α − σ ) × (cid:18) α σ − σ (cid:19) σ hp ! − σ − (cid:18) α σ − σ (cid:19) σ h × exp ( − (cid:18) α σ − σ (cid:19) σ hp − ) " h (1 − p ) (cid:18) α σ − σ (cid:19) σ + (cid:18) α σ − σ (cid:19) σ hp − m − d p d h = ( α σ − ) m/σ σ m/σ Γ(1 − σ ) (cid:18) α σ − σ (cid:19) σ ! − σ − m × Z + ∞ Z ( h (1 − p )) n f σ ( h (1 − p )) exp ( − h (cid:18) α − σ (cid:19) σ + α − σ ) p − σ − h − m − σ d p d h = σ Γ(1 − σ ) Z + ∞ Z f σ ( hp ) exp ( − h (cid:18) α − σ (cid:19) σ + α − σ ) h − σ p m (1 − p ) − σ − d p d h. This complete the case ℓ = 0 . Now, we consider ℓ > . The probability p f v ( ℓ ; m, α, σ ) follows by combiningroposition 1 with the distribution of ( K m , N n ) displayed in (2). In particular, we can write p f v ( ℓ ; m, α, σ ) = X m ∈M m,k Pr [ X m +1 ∈ v ℓ | M m = m ] Pr [ M m = m ]= X m ∈M m,k V m +1 ,k V m,k ( ℓ − σ ) m l Pr [ M m = m ]= ( ℓ − σ ) m X k =1 X ( n ,...,n k ) ∈P m,k k ! (cid:18) mn , . . . , n k (cid:19) V m,k k Y i =1 (1 − σ ) ( n i − V m +1 ,k V m,k k X j =1 n j ( ℓ )= ( ℓ − σ ) m X k =1 V m +1 ,k V m,k k X j =1 Pr [ K m = k, N j = ℓ ]= ( ℓ − σ ) m X k =1 V m +1 ,k V m,k k X j =1 V m,k k (cid:18) nℓ (cid:19) (1 − σ ) ( ℓ − C ( m − ℓ, k − σ ) σ k − = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − m X k =1 V m +1 ,k C ( m − ℓ, k − σ ) σ k − . (13)Then, the expression of p f v ( ℓ ; m, α, σ ) in (10) follows by combining (13) with V m +1 ,k displayed in (3), i.e., p f v ( ℓ ; m, α, σ ) = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − m X k =1 V m +1 ,k C ( m − ℓ, k − σ ) σ k − = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × σ Γ( m + 1) Z + ∞ u m exp (cid:26) − α σ − σ [(2 − + u ) σ − − σ ] (cid:27) (2 − + u ) − m − d u × m X k =1 C ( m − ℓ, k − σ ) (cid:18) α σ − σ (2 − + u ) − σ (cid:19) k . If ℓ = m , then p f v ( m ; m, α, σ ) = (1 − σ ) ( m ) m X k =1 V m +1 ,k C (0 , k − σ ) σ k − = (1 − σ ) ( m ) V m +1 , = (1 − σ ) ( m ) α σ − Γ( m + 1) Z + ∞ x m (2 − + x ) m +1 − σ exp (cid:26) − α σ − σ [(2 − + x ) σ − − σ ] (cid:27) d x. If ℓ < m , then p f v ( ℓ ; m, α, σ ) = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − m X k =1 V m +1 ,k C ( m − ℓ, k − σ ) σ k − = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × α σ − Γ( m + 1) Z + ∞ u m exp (cid:26) − α σ − σ [(2 − + u ) σ − − σ ] (cid:27) (2 − + u ) − m − σ d u × m − ℓ X k =1 C ( m − ℓ, k ; σ ) (cid:18) α σ − σ (2 − + u ) − σ (cid:19) k [Equation 13 of Favaro et al. (2015)] = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − α σ − Γ( m + 1) Z + ∞ u m exp (cid:26) − α σ − σ [(2 − + u ) σ − − σ ] (cid:27) (2 − + u ) − m − σ × exp (cid:26) α σ − σ (2 − + u ) − σ (cid:27) (cid:18) α σ − σ (2 − + u ) − σ (cid:19) m − ℓσ × Z + ∞ x m − ℓ exp ( − x (cid:18) α σ − σ (2 − + u ) − σ (cid:19) σ ) f σ ( x )d x d u [ Identity (2 − + u ) − σ = 1Γ(1 − σ + ℓ ) Z + ∞ y − σ + ℓ − exp (cid:8) − y (2 − + u ) (cid:9) d y ]= ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × ( α σ − )( α σ − ) m − ℓσ σ m − ℓσ Γ( m + 1) Z + ∞ u m (cid:18) − σ + ℓ ) Z + ∞ y − σ + ℓ − exp {− y (2 − + u ) } d y (cid:19) × Z + ∞ x m − ℓ exp ( − xu (cid:18) α σ − σ (cid:19) σ ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) f σ ( x )d x d u = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × ( α σ − ) m − ℓσ σ m − ℓσ Γ(1 − σ + ℓ ) Z + ∞ x m − ℓ f σ ( x ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) × Z + ∞ y − σ + ℓ − exp {− y − } " x (cid:18) α σ − σ (cid:19) σ + y − m − d y d x [ Change of variable p = yx (cid:16) α σ − σ (cid:17) σ + y ]= ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × ( α σ − ) m − ℓσ σ m − ℓσ Γ(1 − σ + ℓ ) Z + ∞ x m − ℓ f σ ( x ) exp ( − x (cid:18) α − σ (cid:19) σ + α − σ ) × Z  (cid:16) α σ − σ (cid:17) σ xp − p  − σ + ℓ − (cid:16) α σ − σ (cid:17) σ x (1 − p ) × exp  − (cid:16) α σ − σ (cid:17) σ xp − p −   x (cid:18) α σ − σ (cid:19) σ + (cid:16) α σ − σ (cid:17) σ xp − p  − m − d p d x [ Change of variable h = x/ (1 − p )]= ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × ( α σ − ) m − ℓσ σ m − ℓσ Γ(1 − σ + ℓ ) Z + ∞ Z ( h (1 − p )) m − ℓ f σ ( h (1 − p )) exp ( − h (1 − p ) (cid:18) α − σ (cid:19) σ + α − σ ) × (cid:18) α σ − σ (cid:19) σ hp ! − σ + ℓ − (cid:18) α σ − σ (cid:19) σ h × exp ( − (cid:18) α σ − σ (cid:19) σ hp − ) " h (1 − p ) (cid:18) α σ − σ (cid:19) σ + (cid:18) α σ − σ (cid:19) σ hp − m − d p d h ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × ( α σ − ) m − ℓσ σ m − ℓσ Γ(1 − σ + ℓ ) (cid:18) α σ − σ (cid:19) σ ! − m − σ + ℓ × Z + ∞ Z ( h (1 − p )) m − ℓ f σ ( h (1 − p )) exp ( − h (cid:18) α − σ (cid:19) σ + α − σ ) p − σ + ℓ − h − m − σ + ℓ d p d h = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × σ Γ(1 − σ + ℓ ) Z + ∞ Z f σ ( hp ) exp ( − h (cid:18) α − σ (cid:19) σ + α − σ ) h − σ p m − ℓ (1 − p ) − σ + ℓ − d p d h. Remark 2.

Here we present an alternative representation of p f v ( ℓ ; m, α, σ ) in (10) . It provides a useful tool forimplementing a straightforward Monte Carlo evaluation of p f v ( ℓ ; m, α, σ ) . For ℓ = m , p f v ( m ; m, α, σ ) = σ ( ℓ − σ ) (cid:0) mℓ (cid:1) (1 − σ ) ( ℓ − Γ(1 − σ + ℓ ) × Z + ∞ Z h σ f σ ( hp ) e − h (cid:16) α − σ (cid:17) σ + α − σ p m − ℓ (1 − p ) − σ + ℓ − d p d h = 1Γ( m + 1) Z + ∞ exp (cid:26) − h (cid:16) α σ (cid:17) /σ + α σ (cid:27) σ Γ( m + 1)Γ( m + 1 − σ ) h − σ × Z (1 − p ) m +1 − σ − f σ ( hp )d p d h = (1 − σ ) ( m ) Γ( m + 1) E " exp ( − XY (cid:18) α − σ (cid:19) σ + α − σ ) , where Y is a Beta random variable with parameter ( m − ℓ + σ, − σ + ℓ ) and X is a random variable, independentof Y , distributed according to a polynomially tilted σ -stable distribution of order σ , i.e.Pr [ X ∈ d x ] = Γ( σ + 1)Γ(2) x − σ f σ ( x )d x. For ℓ < m , p f v ( ℓ ; m, α, σ ) = σ ( ℓ − σ ) (cid:0) mℓ (cid:1) (1 − σ ) ( ℓ − Γ(1 − σ + ℓ ) × Z + ∞ Z h σ f σ ( hp ) e − h (cid:16) α − σ (cid:17) σ + α − σ p m − ℓ (1 − p ) − σ + ℓ − d p d h = ( ℓ − σ ) (cid:18) mℓ (cid:19) (1 − σ ) ( ℓ − × Γ( m − ℓ + σ )Γ( σ )Γ( m + 1) × Z + ∞ f σ ( hp ) exp ( − h (cid:18) α − σ (cid:19) σ + α − σ ) Γ( σ + 1)Γ(2) h − σ × Γ( m + 1)Γ(1 − σ + ℓ )Γ( m − ℓ + σ ) Z p m − ℓ (1 − p ) − σ + ℓ − d p d h = Γ( m − ℓ + σ )Γ( σ )Γ( m + 1) E " exp ( − XY (cid:18) α − σ (cid:19) σ + α − σ ) . ccording to this alternative representation, p f v ( ℓ ; m, α, σ ) allows for a Monte Carlo evaluation by sampling froma Beta random variable and from a polynomially tilted σ -stable random variable of order σ . See, e.g., (Devroye,2009).Proof of Proposition 2. The proof follows by a direct application of Lemma (1) by setting σ = 1 / . First,let recall that the density function of the (1 / -stable positive random variable coincides with the IG densityfunction (Seshadri, 1993) with shape parameter a = 2 − / and scale parameter b = 0 . That is, we write f / ( x ) = 12 √ π w − exp (cid:26) − w (cid:27) . For ℓ = m , p f v ( m ; m, α, σ ) = α m (cid:0) (cid:1) ( m ) Γ( m + 1) × Z + ∞ x m (1 + 2 x ) m + exp n − α / [(2 − + x ) / − − / ] o d x. For ℓ < m , p f v ( ℓ ; m, α, σ ) = 2 − ( ℓ − − ) (cid:0) mℓ (cid:1) (cid:0) (cid:1) ( ℓ − Γ(2 − + ℓ ) × Z + ∞ Z √ h f / ( hp ) e − hα + α p m − ℓ (1 − p ) + ℓ − d p d h = ( ℓ − − ) (cid:18) mℓ (cid:19) (1 − − ) ( ℓ − × − π / e α Z Z + ∞ h − − exp ( − hα − p h ) × − − + ℓ ) p m − ℓ − − (1 − p ) − + ℓ − d p d h [ Equation 3.471.9 of Gradshteyn and Ryzhic (2007) ]= ( ℓ − − ) (cid:18) mℓ (cid:19) (1 − − ) ( ℓ − × e α απ / Γ(1 − − + ℓ ) Z K − (cid:18) αp / (cid:19) p m − ℓ − (1 − p ) − + ℓ − d p, where K − is the modiﬁed Bessel function of the second type, or Macdonald function, with parameter − . Remark 3.

Here we present an alternative representation of p f v ( ℓ ; m, α, σ ) in Proposition 2. It provides a usefultool for implementing a straightforward Monte Carlo evaluation of p f v ( ℓ ; m, α, σ ) . For ℓ = m , p f v ( m ; m, α, σ ) = α m (cid:0) (cid:1) ( m ) Γ( m + 1) × Z + ∞ x m (1 + 2 x ) m + exp n − α / [(2 − + x ) / − − / ] o d x = 1Γ( m + 1) Z + ∞ exp (cid:8) − hα + α (cid:9) − Γ( m + 1)Γ( m + 1 − /

2) 1 √ h × Z (1 − p ) m +1 − − √ π ( hp ) − exp (cid:26) − hp (cid:27) d p d h = (cid:0) (cid:1) ( m ) Γ( m + 1) E (cid:20) exp (cid:26) − XY α + α (cid:27)(cid:21) , here Y is a Beta random variable with parameter (1 / , m + 1 / and X is a random variable, independent of Y , distributed according to a polynomially tilted IG distribution of the order / , that isPr [ X ∈ d x ] = Γ(3 / x − x − √ π exp (cid:26) − x (cid:27) d x. For ℓ < m , p f v ( ℓ ; m, α, σ ) = ( ℓ − − ) (cid:18) mℓ (cid:19) (1 − − ) ( ℓ − × e α απ / Γ(1 − − + ℓ ) Z K − (cid:18) αp / (cid:19) p m − ℓ − (1 − p ) − + ℓ − d p = ( ℓ − − ) (cid:18) mℓ (cid:19) (1 − − ) ( ℓ − × Γ( m − ℓ ) π / Γ( m + 1 − − ) e α α × Z K − (cid:18) αp / (cid:19) Γ( m + 1 − − )Γ(1 − − + ℓ )Γ( m − ℓ ) p m − ℓ − (1 − p ) − + ℓ − d p = ( l − − ) (cid:18) nl (cid:19) (1 − − ) ( l − × Γ( m − ℓ ) π / Γ( m + 1 − − ) e α α E h K − (cid:16) αY / (cid:17)i , where Y is a Beta random variable with parameter ( m − ℓ, / ℓ ) . According to this alternative representation, p f v ( ℓ ; m, α, σ ) allows for a straightforward Monte Carlo evaluation by sampling from a Beta random variable andfrom a polynomially tilted IG random variable of order / . See, e.g., (Devroye, 2009).Proof of Theorem 3 . Because of the assumption of independence of the hash family, we can factorize themarginal likelihood of ( c , . . . , c N ) , i.e. of hash functions h , . . . , h N , into the product of the marginal likelihoodsof c n = ( c n, , . . . , c n,J ) , i.e. of each hash function. This, combined with Bayes theorem, leads toPr [ f v = ℓ | { C n,h n ( v ) } n ∈ [ N ] = { c n,h n ( v ) } n ∈ [ N ] ][ Bayes theorem and independence of the hash family ]= 1 Pr [ { C n,h n ( v ) } n ∈ [ N ] = { c n,h n ( v ) } n ∈ [ N ] ] Pr [ f v = ℓ ] N Y n =1 Pr [ C n,h n ( v ) = c n,h n ( v ) | f v = ℓ ]= 1 Pr [ { C n,h n ( v ) } n ∈ [ N ] = { c n,h n ( v ) } n ∈ [ N ] ] Pr [ f v = ℓ ] N Y n =1 Pr [ C n,h n ( v ) = c n,h n ( v ) , f v = ℓ ] Pr [ f v = ℓ ]= 1 Pr [ { C n,h n ( v ) } n ∈ [ N ] = { c n,h n ( v ) } n ∈ [ N ] ] ( Pr [ f v = ℓ ]) − N N Y n =1 Pr [ C n,h n ( v ) = c n,h n ( v ) ] Pr [ f v = ℓ | C n,h n ( v ) = c n,h n ( v ) ]= ( Pr [ f v = ℓ ]) − N N Y n =1 Pr [ f v = ℓ | C n,h n ( v ) = c n,h n ( v ) ] ∝ N Y n =1 Pr [ f v = ℓ | C n,h n ( v ) = c n,h n ( v ) ][ Proposition 2 and Equation 9 ]= Y n ∈ [ N ]  ( cn,hn ( v ) ℓ ) e αJ αJπ R K − (cid:16) αJ √ x (cid:17) x c n,hn ( v ) − ℓ − (1 − x ) + ℓ − d x ℓ = 0 , , . . . , c n,h n ( v ) − cn,hn ( v ) α ( ) ( cn,hn ( v )) J Γ( c n,hn ( v ) +1) R + ∞ x cn,hn ( v ) (1+2 x ) cn,hn ( v )+1 / e − αJ ( √ x − d x ℓ = c n,h n ( v ) , where K − ( · ) is the modiﬁed Bessel function of the second type, or Macdonald function, with parameter − . Estimation of α We start by deriving the marginal likelihood corresponding to the hashed frequencies ( c , . . . , c N ) induced bythe collection of hash functions h , . . . , h N . In particular, according to the deﬁnition of P ∼ NIGP ( α, ν ) throughits family of ﬁnite-dimensional distributions, for a single hash function h n the marginal likelihood of c n =( c n, , . . . , c n,J ) is obtained by integrating the normalized IG distribution with parameter ( α/J, . . . , α/J ) againstthe multinomial counts c n . In particular, by means of the normalized IG distribution (9), the marginal likelihoodof c n has the following expression p ( c n ; α )= m ! Q Ji =1 c n,i ! × Z { ( p ,...,p J − ) : p i ∈ (0 , and P J − i =1 p i ≤ } J − Y i =1 p c n,i i − J − X i =1 p i ! c n,J f ( P ,...,P J − ) ( p , . . . , p J − )d p · · · d p J − = m ! Q Ji =1 c n,i ! × Z { ( p ,...,p J − ) : p i ∈ (0 , and P J − i =1 p i ≤ } J Y i =1 ( α/J ) e α/J √ π ! J − Y i =1 p c n,i − / i − J − X i =1 p i ! c n,J − / × Z + ∞ z − J/ J − exp ( − z J − X i =1 ( α/J ) p i + ( α/J ) − P J − i =1 p i ! − z ) d z d p · · · d p J − [ Change of variable p i = x i P ki =1 x i , for i = 1 , . . . , J − , and z = J X i =1 x i ]= m ! Q Ji =1 c n,i ! × (cid:18) ( α/J ) e α/J √ π (cid:19) J Z (0 , + ∞ ) J J Y i =1 x c n,i − / i J X i =1 x i ! − P Ji =1 c n,i exp ( − J X i =1 ( α/J ) x i − J X i =1 x i ) d x · · · d x J = m ! Q Ji =1 c n,i ! × (cid:18) ( α/J ) e α/J √ π (cid:19) J m ) Z (0 , + ∞ ) J J Y i =1 x c n,i − / i Z + ∞ y m − exp ( − y J X i =1 x i ) d y ! × exp ( − J X i =1 ( α/J ) x i − J X i =1 x i ) d x · · · d x J = m ! Q Ji =1 c n,i ! × (cid:18) ( α/J ) e α/J √ π (cid:19) J m ) Z + ∞ y m − J Y i =1 Z + ∞ x c n,i − / i exp (cid:26) − ( α/J ) x i − x i (cid:18) y + 12 (cid:19)(cid:27) d x i ! d y [ Equation 3.471.9 of Gradshteyn and Ryzhic (2007) ]= m ! Q Ji =1 c n,i ! × (cid:18) ( α/J ) e α/J √ π (cid:19) J m ) Z + ∞ y m − J Y i =1 (cid:18) ( α/J ) y (cid:19) c n,i / − / K c n,i − / (cid:16)q c n,i (1 + 2 y ) (cid:17)! d y = m (cid:0) αJ (cid:1) m + J e α ( π/ J Q Jj =1 c n,j ! Z + ∞ y m − (1 + 2 y ) m/ − J/ J Y i =1 K c n,i − / (cid:18)q ( α/J ) i (1 + 2 y ) (cid:19)! d y. ecause of the independence of the hash family, h , . . . , h N leads to the following marginal likelihood of { c n,j } n ∈ [ N ] j ∈ [ J ] p ( c , . . . , c N ; α ) (14) = Y n ∈ [ N ] m (cid:0) αJ (cid:1) m + J e α ( π/ J Q Jj =1 c n,j ! Z + ∞ x m − (1 + 2 x ) m − J  J Y j =1 K c n,j − r(cid:16) αJ (cid:17) (1 + 2 x ) ! d x. The marginal likelihood of { c n,j } n ∈ [ N ] j ∈ [ J ] in (14) is applied to estimate the mass parameter α . This is theempirical Bayes approach to the estimation of α . In particular, we consider the following problem arg max α  Y n ∈ [ N ] V n,m,α,J Z + ∞ F n,m,α,J ( y )d y  , where V n,m,α,J = m (cid:0) αJ (cid:1) m + J e α ( π/ J Q Jj =1 c n,j ! and F n,m,α,J ( y ) = y m − (1 + 2 y ) m − J  J Y j =1 K c n,j − r(cid:16) αJ (cid:17) (1 + 2 y ) ! under the constraint that α > . To avoid overﬂow/underﬂow issues in the above optimization problem, here wework in log-space. That is, we consider the following equivalent optimization problem arg max α  X n ∈ [ N ] log( V n,m,α,J ) + log (cid:18)Z + ∞ F n,m,α,J ( y )d y (cid:19) = arg max α  X n ∈ [ N ] v n,m,α,J + log (cid:18)Z + ∞ exp { f n,m,α,J ( y ) } d y (cid:19) , with v n,m,α,J = log( V n,m,α,J ) and f n,m,α,J ( y ) = log( F n,m,α,J ( y )) . For the computation of the integral we usedouble exponential quadrature (Takahasi and Mori, 1974), which approximates R +1 − f ( y )d y with P mj =1 w j f ( y j ) for appropriate weights w j ∈ W and coordinates y j ∈ Y . Integrals of the form R ba f ( y )d y for −∞ ≤ a ≤ b ≤ + ∞ are handled via change of variable formulas. To avoid underﬂow/overﬂow issues it is necessary to apply the"log-sum-exp" trick to the above integral. That is, log (cid:18)Z + ∞ exp { f n,m,α,J ( y ) } d y (cid:19) = f ∗ + log (cid:18)Z + ∞ exp { f n,m,α,J ( y ) − f ∗ } d y (cid:19) and f ∗ = arg max y ∈Y { f n,m,α,J ( y ) } . The computation of log( K c n,j − ( x )) is performed via the following ﬁnite-sum representation of K c n,j − ( x ) , whichholds for K v ( x ) when v is an half-integer. Recall that K v ( x ) is symmetric in v . In particular, K c n,i − / (cid:16)p ( α/J )(1 + 2 y ) (cid:17) = r π (cid:8) − (( α/J )(1 + 2 y )) / (cid:9) (( α/J )(1 + 2 y )) / c n,i − X j =0 ( j + c n,i − j ! ( c n,i − j − α/J )(1 + 2 y )) / ) − j . the computation of log-factorials is done via the specialized implementation of the log-gamma function n order to increase eﬃciency in the our optimization, we cache the log-factorials and, anew for each α and y thevalues of log( K c n,j − ( p ( α/J ) (1 + 2 y ))) across j . In particular, as the dependency on j goes through c n,j onlywe can exploit the fact that many duplicates exists, i.e. the complexity scales in the number of unique c n,j . Allcode is implemented in LuaJIT by using the scilua library. G Additional experiments

We present additional experiments on the application of the CMS-NIGP on synthetic and real data. First, werecall the synthetic and real data to which the CMS-NIGP is applied. As regards synthetic data, we considerdatasets of m = 500 . tokens from a Zipf’s distributions with parameter s = 1 . , . , . , , , . . As regardsreal data, we consider: i) the 20 Newsgroups dataset, which consists of m = 2 . . tokens with k = 53 . distinct tokens; ii) the Enron dataset, which consists of m = 6412175 tokens with k = 28102 distinct tokens.Tables 1, 2, 3 and 4 report the MAE (mean absolute error) between true frequencies and their correspondingestimates via: i) the CMS-NIGP estimate ˆ f (NIGP) v ; ii) the CMS estimate ˆ f (CMS) v ; iii) the CMS-DP estimate ˆ f (DP) v ,the CMM estimate ˆ f (CMM) v . References

Bacallado, S., Battiston, M., Favaro, S., and Trippa, L. (2017). Suﬃcientness postulates for gibbs-type priorsand hierarchical generalizations.

Statistical Science , 32:487–500.Cai, D., Mitzenmacher, M., and Adams, R. P. (2018). A Bayesian nonparametric view on count-min sketch. In

Advances in Neural Information Processing Systems .Charalambides, C. A. (2005).

Combinatorial methods in discrete distributions , volume 600. John Wiley & Sons.Cormode, G. and Muthukrishnan, S. (2005a). An improved data stream summary: the count-min sketch and itsapplications.

Journal of Algorithms , 55:58–75.Cormode, G. and Muthukrishnan, S. (2005b). Summarizing and mining skewed data streams. In

Proceedings ofthe 2005 SIAM International Conference on Data Mining .De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., and Ruggiero, M. (2013). Are Gibbs-type priorsthe most natural generalization of the Dirichlet process?

IEEE transactions on pattern analysis and machineintelligence , 37:212–229.Devroye, L. (2009). Random variate generation for exponentially and polynomially tilted stable distributions.

ACM Transactions on Modeling and Computer Simulation , 19.Ewens, W. (1972). The sampling theory of selectively neutral alleles.

Theoretical Population Biology , 3:87–112.Favaro, S., Nipoti, B., and Teh, Y. (2015). Random variate generation for laguerre-type exponentially tiltedalpha-stable distributions.

Electronic Journal of Statistics , 9:1230–1242.Gradshteyn, I. and Ryzhic, I. (2007).

Table of integrals, series and products . Academic Press.James, L. F. (2002). Poisson process partition calculus with applications to exchangeable models and Bayesiannonparametrics. arXiv preprint arXiv:math/0205093 .Kingman, J. (1993).

Poisson processes . Wiley Online Library.Lijoi, A., Mena, R. H., and Prünster, I. (2005). Hierarchical mixture modeling with normalized inverse-Gaussianpriors.

Journal of the American Statistical Association , 100:1278–1291.Lijoi, A., Mena, R. H., and Prünster, I. (2007). Controlling the reinforcement in Bayesian non-parametric mixturemodels.

Journal of the Royal Statistical Society Series B , 69:715–740.Lijoi, A. and Prünster, I. (2010). Models beyond the Dirichlet process. In

Bayesian Nonparametrics, Hjort, N.L.,Holmes, C.C. Müller, P. and Walker, S.G. Eds. Cambridge University Press.

Pitman, J. (2003). Poisson-Kingman partitions. In

Science and Statistics: A Festschrift for Terry Speed. Instituteof Mathematical Statistics . https://luajit.org https://scilua.org itman, J. (2006). Combinatorial stochastic processes . Lecture Notes in Mathematics. Springer-Verlag.Prünster, I. (2002).

Random probability measures derived from increasing additive processes and their applicationto Bayesian statistics . Ph.D Thesis, University of Pavia.Regazzini, E. (2001).

Foundations of Bayesian statistics and some theory of Bayesian nonparametric methods .Lecture Notes, Stanford University.Regazzini, E., Lijoi, A., and Prünster, I. (2003). Distributional results for means of normalized random measureswith independent increments.

The Annals of Statistics , 31:560–585.Seshadri, V. (1993).

The inverse Gaussian distribution . Oxford University Press.Takahasi, H. and Mori, M. (1974). Double exponential formulas for numerical integration.

Publications of theResearch Institute for Mathematical Sciences , 9(3):721–741.able 1:

Synthetic data: MAE for ˆ f (NIGP) v , ˆ f (CMM) v and ˆ f (CMS) v , case J = 320 , N = 2 Z . Z . Z . Z . Z . Bins v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v (0,1] 1,061.3 161.72 231.31 629.40 62.19 134.75 308.11 81.10 65.71 51.65 1.04 12.91 32.65 1.02 7.16(1,2] 1,197.9 169.74 287.43 514.31 102.42 119.22 154.20 2.00 37.03 289.50 2.04 61.87 48.15 2.01 9.88(2,4] 1,108.3 116.37 262.18 474.82 52.10 95.78 2,419.51 2215.85 353.73 134.05 3.40 26.90 54.34 10.50 10.09(4,8] 1,275.9 378.04 302.89 786.73 214.46 175.10 460.13 258.90 83.30 118.40 6.44 21.58 69.85 6.03 14.28(8,16] 1,236.1 230.32 257.08 719.84 232.24 136.66 380.05 139.50 66.44 413.13 129.03 77.39 80.80 13.10 20.15(16,32] 1,256.8 221.98 248.41 831.70 79.73 190.05 288.59 23.90 41.99 503.60 364.30 90.29 9.86 22.39 15.36(32,64] 1,312.8 235.87 284.12 783.90 184.99 139.52 415.58 54.82 67.30 217.81 82.92 48.00 10.22 30.90 28.90(64,128] 1,721.7 766.29 312.59 950.31 304.36 125.07 1,875.50 1762.20 353.10 64.01 97.40 65.91 13.75 96.98 66.18(128,256] 1,107.7 334.57 97.91 1,727.19 1488.38 273.50 202.09 163.61 110.32 46.80 156.71 130.94 17.51 181.38 125.75able 2: Synthetic data: MAE for ˆ f (NIGP) v , ˆ f (CMM) v and ˆ f (CMS) v , case J = 160 , N = 4 Z . Z . Z . Z . Z . Bins v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (CMM) v ˆ f (NIGP) v (0,1] 212.1 590.48 0.94 262.00 146.11 0.25 424.8 130.90 0.18 154.79 47.10 0.32 56.7 1.01 0.38(1,2] 339.8 359.57 0.56 332.75 63.21 0.70 552.0 65.00 0.82 182.72 2.01 1.24 48.2 2.03 1.45(2,4] 270.9 69.42 1.33 277.80 301.89 2.47 487.3 163.55 2.53 184.70 97.15 2.66 57.8 14.35 2.74(4,8] 234.6 339.95 4.69 375.74 579.94 4.67 545.2 243.08 5.28 252.53 62.70 5.96 51.1 8.30 5.42(8,16] 213.3 313.37 10.57 165.73 152.53 10.68 493.2 196.20 10.86 247.33 29.70 10.28 24.1 14.11 11.75(16,32] 283.0 23.30 20.72 217.20 22.94 19.21 535.5 154.30 22.08 295.90 190.92 21.57 25.0 23.20 23.37(32,64] 305.7 133.09 42.66 284.61 209.13 43.14 637.8 150.05 42.64 120.62 71.86 44.49 31.7 40.40 44.03(64,128] 244.5 102.43 92.26 120.21 118.42 94.43 425.1 198.60 95.19 180.30 113.75 95.10 29.2 94.73 93.34(128,256] 237.4 294.43 170.09 141.30 573.12 173.87 525.9 267.15 185.83 129.70 176.50 180.41 32.1 119.19 179.51able 3: Real data ( J = 12000 and N = 2 ): MAE for ˆ f (NIGP) v , ˆ f (DP) v and ˆ f (CMS) v

20 Newsgroups EnronBins for true v ˆ f (CMS) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (DP) v ˆ f (NIGP) v (0,1] 46.4 46.39 11.34 12.2 12.20 3.00(1,2] 16.6 16.60 3.53 13.8 13.80 3.06(2,4] 38.4 38.40 7.71 61.5 61.49 12.55(4,8] 59.4 59.39 10.40 88.4 88.39 17.36(8,16] 54.3 54.29 11.34 23.4 23.40 4.58(16,32] 17.8 17.80 9.85 55.1 55.09 11.58(32,64] 40.8 40.79 25.65 128.5 128.48 39.46(64,128] 26.0 25.99 57.95 131.1 131.08 54.42(128,256] 13.6 13.59 126.07 50.7 50.68 119.04 Table 4:

Real data ( J = 8000 and N = 4 ): MAE for ˆ f (NIGP) v , ˆ f (DP) v and ˆ f (CMS) v

20 Newsgroups EnronBins for true v ˆ f (CMS) v ˆ f (DP) v ˆ f (NIGP) v ˆ f (CMS) v ˆ f (DP) v ˆ f (NIGP) vv