[PDF] Learning-augmented count-min sketches via Bayesian nonparametrics

Abstract

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream, i.e. point queries, based on random hashed data. Learning-augmented CMSs improve the CMS by learning models that allow to better exploit data properties. In this paper, we focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors. This is referred to as the CMS-DP, and it leads to BNP estimates of a point query as posterior means of the point query given the hashed data. While BNPs is proved to be a powerful tool for developing robust learning-augmented CMSs, ideas and methods behind the CMS-DP are tailored to point queries under DP priors, and they can not be used for other priors or more general queries. In this paper, we present an alternative, and more flexible, derivation of the CMS-DP such that: i) it allows to make use of the Pitman-Yor process (PYP) prior, which is arguably the most popular generalization of the DP prior; ii) it can be readily applied to the more general problem of estimating range queries. This leads to develop a novel learning-augmented CMS under power-law data streams, referred to as the CMS-PYP, which relies on BNP modeling of the stream via PYP priors. Applications to synthetic and real data show that the CMS-PYP outperforms the CMS and the CMS-DP in the estimation of low-frequency tokens; this known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data streams.

Full PDF

LLearning-augmented count-min sketches via Bayesiannonparametrics

Emanuele Dolera ∗ , Stefano Favaro † , and Stefano Peluchetti ‡ Department of Mathematics, University of Pavia, Italy Department of Economics and Statistics, University of Torino and Collegio Carlo Alberto, Italy Cogent Labs, Tokyo, Japan

February 10, 2021

Abstract

The count-min sketch (CMS) is a time and memory eﬃcient randomized data structurethat provides estimates of tokens’ frequencies in a data stream, i.e. point queries, based onrandom hashed data. Learning-augmented CMSs improve the CMS by learning models thatallow to better exploit data properties. In this paper, we focus on the learning-augmented CMSof Cai, Mitzenmacher and Adams (

NeurIPS

Keywords:

Bayesian nonparametrics; count-min sketch; Dirichlet process prior; empirical Bayes; likelihood-free estimation; Pitman-Yor process prior; point query; power-law data stream; random hashing; rangequery.

When processing large data streams, it is critical to represent the data in compact structures that allow toeﬃciently extract statistical information. Sketching algorithms, or simply sketches, are compact randomizeddata structures that can be easily updated and queried to perform a time and memory eﬃcient estimation ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ s t a t . M L ] F e b f statistics of large data streams of tokens. Sketches have found applications in machine learning (Aggarwaland Yu, 2010), security analysis (Dwork et al., 2010), natural language processing (Goya et al., 2009),computational biology (Zhang et al., 2014; Leo Elworth et al., 2020), social networks (Song et al., 2009) andgames (Harrison, 2010). Of particular interest is the problem of estimating the frequency of a token in thestream, also referred to as “point query”, and more generally the problem of estimating the overall frequencyof a collection of s ≥ s -range query”. A notable approach to point queries isthe count-min sketch (CMS) of Cormode and Muthukrishnan (2005), which uses random hashing to obtaina compressed, or approximated, representation of the data. The CMS achieves the goal of using a compactdata structure to save time and memory, while having provable theoretical guarantees on the estimatedfrequency via hashed data. Nevertheless, there are several aspects of the CMS that can be improved. First,the CMS provides only point estimates, although the random hashing procedure may induce substantialuncertainty, especially for low-frequency tokens. Second, the CMS relies on a ﬁnite universe (or population)of tokens, although it is common for large data streams to have an unbounded number of unique tokens.Third, often there exists an a priori knowledge of the data, and it is desirable to incorporate this knowledgeinto CMS estimates.The emerging ﬁeld of learning-augmented CMSs aims at improving the CMS by augmenting it withlearning models that allow to better exploit data properties (Cai et al., 2018; Aamand et al., 2019; Hsuet al., 2019). In this paper, we focus on the learning-augmented CMS of Cai et al. (2018), which relieson Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors (Ferguson,1973). This is referred to as the CMS-DP. The BNP approach of Cai et al. (2018) assumes that tokens inthe stream are modeled as random samples from an unknown discrete distribution, which is endowed with aDP prior. Then, the CMS-DP arises by combining the predictive distribution induced by the DP prior withboth a restriction property and a ﬁnite-dimensional projective property of the DP. While the restrictionproperty is critical to compute the posterior distribution of a point query, given the hashed frequencies,the ﬁnite-dimensional projective property is desirable for ease of estimating prior’s parameters. Cai et al.(2018) showed that the posterior mode recovers the CMS estimate of Cormode and Muthukrishnan (2005),while other CMS-DP estimates, e.g. the posterior mean and median, may be viewed as CMS estimates withshrinkage. Because of the DP prior assumption, the CMS-DP relies on an unknown and unbounded numberof distinct tokens in the universe, which is typically expected in large data streams. Most importantly, theBNP approach of Cai et al. (2018) allows to incorporate an a priori knowledge of the data stream into CMSestimates, and then it leads to model uncertainty in CMS estimates through the posterior distribution of apoint query.The interplay between the predictive distribution and the restriction property of the DP is the cornerstoneof the proof-method of Cai et al. (2018) for computing the posterior distribution of a point query. Whilethis method provides an intuitive derivation of the CMS-DP, it is tailored to point queries under DP priors.That is, because of both the peculiar predictive distribution and the peculiar restriction property of theDP, the method of Cai et al. (2018) can not be used when considering other nonparametric priors or withinthe more general problem of estimating range queries. This is a critical limitation of the BNP approachof Cai et al. (2018), especially with respect to the ﬂexibility of incorporating a priori knowledge of thedata stream into CMS estimates. In this paper, we present an alternative, and more ﬂexible, proof-methodfor computing the posterior distribution of a point query, given the hashed frequencies. Our method doesneither rely on the predictive distribution of the DP prior nor on its restriction property, but it directlymakes use of the deﬁnition of the posterior distribution of a point query in the BNP framework under a DPprior. This is referred to as the “direct” proof-method, whereas the method of Cai et al. (2018) is referredto as the “indirect” proof-method. Besides recovering the posterior distribution of a point query obtainedin Cai et al. (2018), our “direct” proof-method is such that: i) it allows to make use the Pitman-Yor process(PYP) prior (Perman et al., 1992; Pitman, 1995; Pitman and Yor, 1997), which is arguably the most populargeneralization of the DP prior; ii) it can be readily applied to the problem of estimating s -range queries, forany s ≥ De Blasi et al., 2015), in contrast with the exponential tail behaviour of the DP. The PYP does neitherhave a restriction property nor a ﬁnite-dimensional projective property analogous to that of the DP, andhence the “indirect” proof-method of Cai et al. (2018) can not be used for estimating point queries. Our“direct” proof-method thus becomes critical to compute the posterior distribution of a point query, giventhe hashed frequencies, under the PYP prior; this is combined with an estimate of prior’s parameters, whichis obtained by a likelihood-free estimation approach relying on the predictive distribution induced by thePYP prior. This procedure leads to a novel learning-augmented CMS, which is referred to as the CMS-PYP.CMS-PYP estimates of point queries arise as mean functionals of the posterior distribution. Applicationsto synthetic and real data show that the CMS-PYP outperforms both the CMS and the CMS-DP in theestimation of low-frequency tokens. This is known to be a critical feature in the context of natural languageprocessing (Goyal et al., 2012; Pitel and Fouquier, 2015), where it is indeed common to encounter power-lawdata streams.The paper is structured as follows. In Section 2 we review the BNP approach of Cai et al. (2018),and we present our “direct” proof-method to compute the posterior distribution of a point query, giventhe hashed frequencies. In Section 3 we exploit the “direct” proof-method to develop the CMS-PYP. InSection 4 we apply the “direct” proof-method to extend the BNP approach of Cai et al. (2018) to the moregeneral problem of estimating s -range queries, for s ≥

1. Section 5 contains numerical experiments for theCMS-PYP, whereas in Section 6 we discuss our work and directions for future work. Proofs are deferred toappendices.

For m ≥

1, let X m = ( X , . . . , X m ) be a data stream of tokens taking values in a (possibly inﬁnite)measurable space of symbols V , e.g., language n -grams, IP addresses, hashtags, or URLs. The stream X m is available for inferential purposes through its compressed representation by random hashing. Speciﬁcally,for any pair of positive integers ( J, N ) such that [ J ] = { , . . . , J } and [ N ] = { , . . . , N } , let h , . . . , h N , with h n : V → [ J ], be a collection of hash functions drawn uniformly at random from a pairwise independenthash family H . For mathematical convenience, it is assumed that H is a perfectly random hash family,that is for h n drawn uniformly at random from H the random variables ( h n ( x )) x ∈V are i.i.d. as a Uniformdistribution over [ J ]. In practice, as discussed in Cai et al. (2018), real-world hash functions yield only smallperturbations from perfect hash functions. Hashing X m through h , . . . , h N creates N vectors of J buckets { ( C n, , . . . , C n,J ) } n ∈ [ N ] , with C n,j obtained by aggregating the frequencies of all X i ’s with h n ( X i ) = j .That is, every C n,j is initialized at zero, and whenever a new token X i with h n ( X i ) = j is observed we set C n,j ← C n,j for every n ∈ [ N ]. The goal is to estimate the frequency f v of a symbol v ∈ V in the stream,i.e. the point query f v = m (cid:88) i =1 { X i } ( v ) , based on the hashed frequencies { C n,h n ( v ) } n ∈ [ N ] . The CMS of Cormode and Muthukrishnan (2005) estimates f v with the smallest hashed frequency among { C n,h n ( v ) } n ∈ [ N ] . See Appendix A for a theoretical guaranteeon CMS estimates. Instead, the CMS-DP of Cai et al. (2018) estimates f v by relying on BNP modeling ofthe stream X m via DP priors. In this section, we review the BNP approach of Cai et al. (2018), and wepresent our “direct” proof-method to compute the posterior distribution of a point query, given the hashedfrequencies. The DP is a discrete (almost surely) random probability measure. A simple and intuitive deﬁnition of the DPfollows from the stick-breaking construction (Sethuraman, 1994). For any θ > V i ) i ≥ be independentrandom variables identically distributed as a Beta distribution with parameter (1 , θ ); ii) ( Y i ) i ≥ randomvariables, independent of ( V i ) i ≥ , and independent and identically distributed as a non-atomic distribution ν on V . If we set P = V and P j = V j (cid:81) ≤ i ≤ j − (1 − V i ) for j ≥

2, which ensures that (cid:80) i ≥ P i = 1 almostsurely, then P = (cid:80) j ≥ P j δ Y j is a DP on V with (base) distribution ν and mass parameter θ . For short,we write P ∼ DP( θ, ν ). We refer to the monograph of Ghosal and van der Vaart (2017), and references herein, for a comprehensive account of DP priors in BNPs. For the purposes of the present paper, it isuseful to recall two peculiar properties of the DP prior: the ﬁnite-dimensional projective property and therestriction property (Ferguson, 1973; Regazzini, 2001). The ﬁnite-dimensional projective property is statedas follows: if { B , . . . , B k } is a measurable k -partition of V , for any k ≥

1, then P ∼ DP( θ, ν ) is such that( P ( B ) , . . . , P ( B k )) is distributed according to a Dirichlet distribution with parameter ( θν ( B ) , . . . , θν ( B k )).The restriction property is stated as follows: if A ⊂ V and P A denotes the random probability measure on A induced by P ∼ DP( θ, ν ) on V then P A ∼ DP( θν ( A ) , ν A /ν ( A )), where ν A denotes the projection of ν tothe set A .Because of the discreteness of P ∼ DP( θ, ν ), a random sample ( X , . . . , X m ) from P induces a randompartition of the set { , . . . , m } into 1 ≤ K m ≤ m partition subsets, labelled by distinct symbols v = { v , . . . , v K m } , with corresponding frequencies ( N , . . . , N K m ) such that 1 ≤ N i ≤ n and (cid:80) ≤ i ≤ K m N i = n .For any 1 ≤ r ≤ m let M r,m be the number of distinct symbols with frequency r , i.e. M r,m = (cid:80) ≤ i ≤ K m N i ( r ) such that (cid:80) ≤ r ≤ m M r,m = K m and (cid:80) ≤ r ≤ m rM r,m = m . Then the distribution of M m = ( M ,m , . . . , M m,m )is deﬁned on the set M m,k = { ( m , . . . , m n ) : m i ≥ , (cid:80) ≤ i ≤ m m i = k, (cid:80) ≤ i ≤ m im i = m } . In particular,let ( a ) ( n ) denotes the rising factorial of a of order n , i.e. ( a ) ( n ) = (cid:81) ≤ i ≤ n − ( a + i ). For any m ∈ M m,k , itholds Pr[ M m = m ] = m ! θ (cid:80) mi =1 m i ( θ ) ( m ) m (cid:89) i =1 i m i m i ! . (1)See Chapter 3 of Pitman (2006) for details on (1). Now, let v r = { v i ∈ v : N i = r } , i.e. labels in v withfrequency r , and let v = V − v , i.e. labels in V not belonging to v , then the predictive distribution inducedby P ∼ DP( θ, ν ) isPr[ X m +1 ∈ v r | X m ] = Pr[ X m +1 ∈ v r | M m = m ] =  θθ + m if r = 0 rm r θ + m if r ≥ , (2)for any m ≥

1. The DP prior is the sole discrete nonparametric prior whose predictive distribution is suchthat i) the probability that X m +1 belongs to v does not depend on X m ; ii) the probability that X m +1 belongs to v r depends on X m only through the statistic M r,m . See Bacallado et al. (2017) for a detailedaccount.The BNP approach of Cai et al. (2018) assumes that tokens in the stream X m are modeled as randomsamples from an unknown discrete distribution P , which is endowed with a DP prior. That is, tokens X i ’sare modeled as X m | P iid ∼ P (3) P ∼ DP( θ, ν )for any m ≥

1. If the X i ’s are hashed through hash functions h , . . . , h N drawn uniformly at randomfrom H , then under (3) a point query induces the posterior distribution of f v , given the hashed frequencies { C n,h n ( v ) } n ∈ [ N ] , for v ∈ V . CMS-DP estimates of f v are obtained as functionals of the posterior distributionof f v given { C n,h n ( v ) } n ∈ [ N ] , e.g. mode, mean, median. The CMS-DP of Cai et al. (2018) arises by combiningthe predictive distribution induced by the DP prior with two distributional properties of the DP: i) therestriction property of the DP which, since H is perfectly random, implies that the prior governing the tokenshashed in each of the J buckets is a DP prior with mass parameter θ/J ; ii) the ﬁnite-dimensional projectiveproperty of the DP which, since H is perfectly random, implies that the prior distribution governing themultinomial hashed frequencies C n = ( C n, , . . . , C n,J ) is a J -dimensional symmetric Dirichlet distributionwith parameter θ/J .Because of the discreteness of P ∼ DP( θ, ν ), the random sample X m from P induces a random partitionof the set { , . . . , m } into subsets labelled by distinct symbols in V . The predictive distribution (2) of theDP prior provides the conditional distribution, given X m , over which partition subset v ∈ V a new token X m +1 will join. The size of that subset is the frequency f v , for an arbitrary v ∈ V , we seek to estimate.However, since we have only access to the hashed frequencies { C n,h n ( X m +1 ) } n ∈ [ N ] , the object of interest is themarginal distribution p f Xm +1 ( · ; m, θ ) of f X m +1 . This distribution follows by marginalizing out the sampling nformation X ,m , with respect to the DP prior, from the conditional distribution of f X m +1 given X ,m . Dueto the peculiar dependency on X m of the predictive distribution (2), i.e. a dependence only through thestatistics M r,m ’s, this marginalization is doable explicitly by applying the distribution (1). By combining(2) with (1) it holds p f Xm +1 ( l ; m, θ ) = Pr[ f X m +1 = l ] = θθ + m ( m − l + 1) ( l ) ( θ + m − l ) ( l ) (4)for l = 0 , , . . . , m . We refer to Cai et al. (2018) for details on the derivation of the distribution (4).Uniformity of h n ∼ H implies that each hash function h n induces a J -partition of V , say { B h n , , . . . , B h n ,J } ,and the measure with respect to P ∼ DP( θ, ν ) of each B h n ,j is 1 /J . By the restriction property of theDP, h n turns a global P ∼ DP( θ, ν ) that governs the distribution of X m into J bucket-speciﬁc DPs,say P j ∼ DP( θ/J, Jν B hn,j ) for j = 1 , . . . , J , that govern the distribution of the sole tokens hashed there.Accordingly, the posterior distribution of f X m +1 , given C n,h n ( X m +1 ) = c n , coincides with p f Xm +1 ( · ; c n , θ/J ).This is precisely the “indirect” proof-method of the next theorem, which is the main result of Cai et al.(2018). Theorem 1.

Let h n be a hash function drawn at random from a pairwise independent perfectly random fam-ily H . Let X m be a random sample of tokens from P ∼ DP ( θ, ν ) and let X m +1 be an additional sample. If C n,h n ( X m +1 ) is the hashed frequency from X m through h n , i.e. C n,h n ( X m +1 ) = (cid:80) ≤ i ≤ m { h n ( X i ) } ( h n ( X m +1 )) ,then p f Xm +1 (cid:18) · ; c n , θJ (cid:19) = Pr [ f X m +1 = l | C n,h n ( X m +1 ) = c n ] = θJθJ + c n ( c n − l + 1) ( l ) ( θJ + c n − l ) ( l ) l ≥ . For a collection of hash functions h , . . . , h N drawn at random from H , the posterior distribution of f X m +1 given the hashed frequencies { C n,h n ( X m +1 ) } n ∈ [ N ] follows directly from Theorem 1 by the independenceassumption on H and by an application of Bayes theorem. In particular, according to Theorem 2 of Cai etal. (2018), we havePr[ f X m +1 = l | { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] ∝ (cid:89) n ∈ [ N ] θJθJ + c n ( c n − l + 1) ( l ) ( θJ + c n − l ) ( l ) (5)for l ≥

0. This leads to the posterior distribution of f v given the hashed frequencies { C n,h n ( v ) } n ∈ [ N ] , foran arbitrary v ∈ V . CMS-DP estimates of f v are obtained as functionals of (5), e.g. mean, mode, median.It remains to estimate the prior’s parameter θ > C n is available in closed-form. By the ﬁnite-dimensional projective property of the DP,and since H is perfectly random, C n is distributed as a Direchlet-Multinomial distribution with symmetricparameter θ/J , for any n ∈ [ N ]. Then, by the independence assumption on H the distribution of { C n } n ∈ [ N ] is Pr[ { C n } n ∈ [ N ] = { c n } n ∈ [ N ] ] = (cid:89) n ∈ [ N ] m !( θ ) ( m ) J (cid:89) j =1 ( θJ ) c n,j c n,j ! . (6)Equation (6) provides the likelihood function induced by the hashed frequencies { c n } n ∈ [ N ] . The explicitform of the likelihood function of { c n } n ∈ [ N ] allows for Bayesian estimation of the prior’s parameter θ . Inparticular, Cai et al. (2018) adopts an empirical Bayes approach to estimate θ , though a fully Bayes approachcan be also applied. The interplay between the predictive distribution and the restriction property of the DP is the cornerstoneof the “indirect” proof-method of Theorem 1. This method imposes two strong constraints in the choiceof the prior distribution: i) the predictive distribution induced by the prior must depend on the sampling nformation X m through simple statistics of X m , whose marginalization under the prior is doable explic-itly; ii) the prior distribution must have a restriction property analogous to that of the DP prior. Discretenonparametric priors obtained by normalizing homogeneous completely random measures (James, 2002;Pr¨unster, 2002; Pitman, 2003; Regazzini et al., 2003; James at al., 2009) are the sole nonparametric priorssatisfying the constraint ii); this follows from the fact that completely random measures have a Poissonprocess representation admitting the Poisson coloring theorem (Kingman, 1993, Chapter 5). However, fromRegazzini (1978), it follows that the DP is the unique normalized homogeneous completely random measurewhich satisﬁes the constraint i). The DP prior is thus the unique discrete nonparametric prior which satisﬁesboth the constraint i) and the constraint ii). Because of this peculiar feature of the DP prior, the “indirect”proof-method can not be used when considering other nonparametric priors in BNPs. In this respect, themost popular generalization of the DP prior is the PYP prior. From Zabell (1997), the PYP is the uniquediscrete nonparametric prior satisfying the constraint ii). However, the PYP prior does not belong to theclass of priors obtained by normalizing homogeneous completely random measures, and hence it does notsatisﬁes the constraint i).We present an alternative, and more ﬂexible, derivation of Theorem 1. Our “direct” proof-method doesneither rely on the predictive distribution of the DP nor on its restriction property, but it directly makesuse of the deﬁnition of the posterior distribution of a point query in the BNP framework under a DP prior.To simplify the notation, we remove the subscript n from h n . For h ∼ H , we are interest in computing theposterior distribution Pr[ f X m +1 = l | C h ( X m +1 ) = c ] (7)= Pr (cid:34) f X m +1 = l | m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) = Pr (cid:2) f X m +1 = l, (cid:80) mi =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:3) Pr (cid:2)(cid:80) mi =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:3) , for l = 0 , , . . . , m . First, we consider the denominator of (7). Uniformity of h implies that h induces a J -partition { B , . . . , B J } of V such that B j = { v ∈ V : h ( v ) = j } and ν ( B j ) = J − for j = 1 , . . . , J . Then,the ﬁnite-dimensional projective property of the DP implies that P ( B j ) is distributed according to a Betadistribution with parameter ( θ/J, θ (1 − /J )) for j = 1 , . . . , J . Hence, we write the denominator of (7) asfollows Pr (cid:34) m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (8)= J (cid:18) mc (cid:19) E [( P ( B j )) c +1 (1 − P ( B j )) m − c ]= J (cid:18) mc (cid:19) (cid:90) p c +1 (1 − p ) m − c Γ( θ )Γ( θJ )Γ( θ − θJ ) p θJ − (1 − p ) θ − θJ − d p = J (cid:18) mc (cid:19) Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c + 1)Γ( θ − θJ + m − c )Γ( θ + m + 1) . This completes the study of the denominator of the posterior distribution (7). Now, we consider the numer-ator of (7). Let us deﬁne the event B ( m, l ) = { X = · · · = X l = X m +1 , { X l +1 , . . . , X m } ∩ { X m +1 } = ∅} .Then, we write Pr (cid:34) f X m +1 = l, m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (9)= (cid:18) ml (cid:19) Pr (cid:34) B ( m, l ) , m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) = (cid:18) ml (cid:19) Pr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) . hat is, the distribution of ( f X m +1 , C j ) is determined by the distribution of the random variables ( X , . . . , X m +1 ).Let Π( s, k ) denote the set of all possible partitions of the set { , . . . , s } into k disjoints subsets π , . . . , π k such that n i is the cardinality of π i . In particular, from Equation 3.5 of Sangalli (2006), for any measurable A , . . . , A m +1 we havePr[ X ∈ A , . . . , X m +1 ∈ A m +1 ] = m +1 (cid:88) k =1 θ k ( θ ) ( m +1) (cid:88) ( π ,...,π k ) ∈ Π n +1 ,k k (cid:89) i =1 ( n i − ν ( ∩ m ∈ π i A m )for m ≥

1. Let V be the Borel σ -algebra of V . Let ν π ,...,π k be a probability measure on ( V m +1 , V m +1 )deﬁned as ν π ,...,π k ( A × · · · × A m +1 ) = (cid:89) ≤ i ≤ k ν ( ∩ m ∈ π i A m ) , and attaching to the event B ( m, l ) a value that is either 0 or 1. In particular, ν π ,...,π k ( B ( m, l )) = 1 if andonly if one of the π i ’s is equal to the set { , . . . , l, m + 1 } . Hence, based on the measure ν π ,...,π k , we canwritePr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) = m − l +1 (cid:88) k =2 θ k ( θ ) ( m +1) (cid:88) ( π ,...,π k − ) ∈ Π( m − l,k − l ! k − (cid:89) i =1 ( n i − ν π ,...,π k (cid:32) m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) = θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! × m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( n i −  J (cid:88) j =1 ν ( { j } ) ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( j ) = c − l (cid:33) = θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! × m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( n i − ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) . Now, m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( n i − ν π ,...,π r ( · )is the distribution of a random sample ( X , . . . , X m − l ) under P ∼ DP( θ, ν ). Again, the distribution of( X , . . . , X m − l ) is given in Equation 3.5 of Sangalli (2006). In particular, using the fact that P ( B j ) isdistributed as a Beta distribution with parameter ( θ/J, θ (1 − /J )), for j = 1 , . . . , J , we write the followingidentitiesPr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) = θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! × m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( n i − ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) = θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! (cid:18) m − lc − l (cid:19) E [( P ( B j )) c − l (1 − P ( B j )) m − c ] θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! (cid:18) m − lc − l (cid:19) (cid:90) p c − l (1 − p ) m − c Γ( θ )Γ( θJ )Γ( θ − θJ ) p θJ − (1 − p ) θ − θJ − d p = θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! (cid:18) m − lc − l (cid:19) Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c − l )Γ( θ − θJ + m − c )Γ( θ + m − l ) , and form (9) Pr (cid:34) f X m +1 = l, m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (10)= (cid:18) ml (cid:19) θ ( θ ) ( m − l ) ( θ ) ( m +1) l ! (cid:18) m − lc − l (cid:19) Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c − l )Γ( θ − θJ + m − c )Γ( θ + m − l )= θ m !( c − l )!( m − c )! Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c − l )Γ( θ − θJ + m − c )Γ( θ + m + 1) . This completes the study of the numerator of the posterior distribution (7). By combining (7) with (8) and(10) we obtain Pr[ f X m +1 = l | C h ( X m +1 ) = c ] (11)= θ m !( c − l )!( m − c )! Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c − l )Γ( θ − θJ + m − c )Γ( θ + m +1) J (cid:0) mc (cid:1) Γ( θ )Γ( θJ )Γ( θ − θJ ) Γ( θJ + c +1)Γ( θ − θJ + m − c )Γ( θ + m +1) = θJθJ + c ( c − l + 1) ( l ) ( θJ + c − l ) ( l ) for l = 0 , , . . . , m . The posterior distribution (11) coincides with the posterior distribution in Theorem 1,i.e. the posterior distribution obtained in Cai et al. (2018). Besides providing with an alternative proof ofTheorem 1, in the next sections we show that our “direct” proof-method is such that: i) it allows to makeuse of the PYP prior; ii) it can be readily applied to the more general problem of estimating s -range queries,for any s ≥ We extend the BNP approach of Cai et al. (2018) to develop a learning-augmented CMS under power-law data streams. In this respect, we assume that tokens in the stream X m , for m ≥

1, are modeled asrandom samples from an unknown discrete distribution P , which is endowed with prior Q with power-law tailbehaviour. Among discrete nonparametric priors Q with power-law tail behaviour, the PYP prior standsout for both its mathematical tractability and interpretability, and hence it is the natural candidate forapplications within the broad class of priors considered in De Blasi et al. (2015). See also Bacallado et al.(2017), and references therein, for a detailed account on priors with power-law tail behaviour. The PYP doesnot have a restriction property analogous to that of the DP, and hence the “indirect” proof-method of Caiet al. (2018) can not be applied to obtain the posterior distribution of a point query, given the hashed data.Moreover, the PYP does not have a ﬁnite-dimensional projective property analogous to that of the DP, andhence prior’s parameters can not be estimated via an empirical Bayes procedure as in Cai et al. (2018). Inthis section, we make use of the ”direct” proof-method of Section 2 for computing the posterior distributionof a point query, given the hashed frequencies, under a PYP prior. Then, we exploit the tractable form ofthe predictive distribution induced by the PYP prior to implement a likelihood-free approach to estimateprior’s parameters. This procedure leads to the CMS-PYP, which is a novel learning-augmented CMS underpower-las data streams. Among the various possible deﬁnitions of the PYP, a simple and intuitive one follows from the stick-breakingconstruction of Pitman (1995); Pitman and Yor (1997). For any α ∈ [0 ,

1) and θ > − α let: i) ( V i ) i ≥ be ndependent random variables such that V i is distributed according to a Beta distribution with parameter (1 − α, θ + iα ); ii) ( Y i ) i ≥ random variables, independent of ( V i ) i ≥ , and independent and identically distributedaccording to a non-atomic distribution ν on V . If we set P = V and P j = V j (cid:81) ≤ i ≤ j − (1 − V i ) for j ≥ (cid:80) i ≥ P i = 1 almost surely, then P = (cid:80) j ≥ P j δ Y j is a PYP on V with (base) distribution ν and mass parameter θ . For short, we write P ∼ PYP( α, θ, ν ). We refer to Perman et al. (1992) and Pitmanand Yor (1997) for an alternative deﬁnition of the PYP through a suitable transformation of the α -stablecompletely random measure Kingman (1993). See also Pitman (2006). The DP arises as a special case of thePYP by setting α = 0. For the purposes of the present paper, it is useful to recall the power-law tail behaiviorfeatured by the PYP prior. Let P ∼ PYP( α, θ, ν ) with α ∈ (0 , P ( j ) ) j ≥ be the decreasing orderedrandom probabilities P j ’s of P . Then, as j → + ∞ the P ( j ) ’s follow a power-law distribution of exponent c = α − . See Pitman and Yor (1997) and references therein. That is, the parameter α ∈ (0 ,

1) controls thepower-law tail behaviour of the PYP through small probabilities P ( j ) ’s: the larger α the heavier the tail of P . Because of the discreteness of P ∼ PYP( α, θ, ν ), a random sample ( X , . . . , X m ) from P induces a randompartition of { , . . . , m } into 1 ≤ K m ≤ m partition subsets, labelled by distinct symbols v = { v , . . . , v K m } ,with frequencies ( N , . . . , N K m ) such that 1 ≤ N i ≤ n and (cid:80) ≤ i ≤ K m N i = n . For any 1 ≤ r ≤ m let M r,m be the number of distinct symbols with frequency r , i.e. M r,m = (cid:80) ≤ i ≤ K m N i ( r ) such that (cid:80) ≤ r ≤ m M r,m = K m and (cid:80) ≤ r ≤ m rM r,m = m . The distribution of M m = ( M ,m , . . . , M m,m ) is deﬁnedon the set M m,k = { ( m , . . . , m n ) : m i ≥ , (cid:80) ≤ i ≤ m m i = k, (cid:80) ≤ i ≤ m im i = m } . In particular, for any m ∈ M m,k Pr[ M m = m ] = m ! (cid:0) θα (cid:1) ( (cid:80) mi =1 m i ) ( θ ) ( m ) m (cid:89) i =1 (cid:18) α (1 − α ) ( i − i ! (cid:19) m i m i ! . (12)Also, Pr[ K m = k ] = (cid:0) θα (cid:1) ( k ) ( θ ) ( m ) C ( m, k ; α ) (13)for any k = 1 , . . . , m , where C ( m, k ; α ) = ( k !) − (cid:80) ≤ i ≤ k (cid:0) ki (cid:1) ( − i ( − iα ) ( m ) is the generalized factorial coef-ﬁcient (Charalambides, 2005), with the proviso C (0 , α ) = 1 and C ( m, α ) = 0 for m ≥

1. See Chapter3 of Pitman (2006) for details on (12) and (13). Now, let v r = { v i ∈ v : N i = r } , i.e. labels in v withfrequency r , and let v = V − v , i.e. labels in V not belonging to v , then the predictive distribution inducedby P ∼ PYP( α, θ, ν ) isPr[ X m +1 ∈ v r | X m ] = Pr[ X m +1 ∈ v r | M m = m ] =  θ + kαθ + m if r = 0 m r ( r − α ) θ + m if r ≥ , (14)for any m ≥

1. For α = 0 the predictive distribution (14) reduces to that of the DP prior (2). The PYPprior is the sole discrete nonparametric prior whose predictive distribution is such that: i) the probabilitythat X m +1 belongs to v depends on X m only through the statistic K m ; ii) the probability that X m +1 belongs to v r depends on X m only through the statistic M r,m . See Bacallado et al. (2017) for a detailedaccount.At the sampling level, the power-law tail behaviour of P ∼ PYP( α, θ, ν ) emerges directly from theanalysis of the large m asymptotic behaviour of the statistics K m and M r,m /K m . See Chapter 3 of Pitman(2006) and references therein. Let X m be a random sample from P . Theorem 3.8 of Pitman (2006) showsthat as m → + ∞ , K m m α → S α,θ (15)almost surely, where S α,θ is a positive and ﬁnite (almost surely) random variable. See Dolera and Favaro(2020a) and Dolera and Favaro (2020b) for reﬁnement of (15). Moreover, it follows from (15) that, as m → + ∞ , M r,m K m → α (1 − α ) ( r − r ! (16) lmost surely. Equation (15) shows that the number K m of distinct symbols in X m , for large m , growsas m α . This is precisely the growth of the number of distinct symbols in m ≥ c = α − . Moreover, Equation (16) shows that p α,r = α (1 − α ) ( r − /r !is the large m asymptotic proportion of the number of distinct symbols with frequency r in X m . Then p α,r ≈ c α r − α − for large r , for a constant c α . This is precisely the distribution of the number of distinctsymbols with frequency r in m ≥ c = α − . SeeFigure 1. m K m α = 10 m K m α = 100 r M r , K α = 10 r M r , K α = 100 Figure 1: K m and M r, K under P ∼ PYP( α, θ, ν ): α = 0 (blue -), α = .

25 (red -.), α = . α = .

75 (purple :).

For m ≥ X m be a data stream of tokens taking values in a (possibly inﬁnite) measurable space ofsymbols V . Under our BNP approach, we assume that tokens in the stream X m are modeled as randomsamples from an unknown discrete distribution P , which is endowed with a PYP prior. That is, tokens X i ’sare modeled as X m | P iid ∼ P (17) P ∼ PYP( α, θ, ν )for any m ≥

1. As in Cai et al. (2018), we assume that X i ’s are hashed through hash functions h , . . . , h N drawn uniformly at random from a pairwise independent perfectly random hash family H . Then, under(17), a point query induces the posterior distribution of f v , given the hashed frequencies { C n,h n ( v ) } n ∈ [ N ] ,for v ∈ V . CMS-PYP estimates of f v are obtained as functionals of the posterior distribution of f v given { C n,h n ( v ) } n ∈ [ N ] , e.g. mode, mean, median. For a single h n ∼ H , in the next theorem we adapt the “direct”proof-method of Section 2 to compute the posterior distribution of f X m +1 , given C n,h n ( X m +1 ) , under thePYP prior. This provides the posterior distribution of the point query f v , given C n,h n ( v ) , for an arbitrarysymbol v ∈ V . Theorem 2.

Let h n be a hash function drawn at random from a pairwise independent perfectly ran-dom family H . Let X m be a random sample of tokens from P ∼ PYP ( α, θ, ν ) and let X m +1 be anadditional sample. Let C n,h n ( X m +1 ) be the hashed frequency from X m through h n , i.e. C n,h n ( X m +1 ) = (cid:80) ≤ i ≤ m { h n ( X i ) } ( h n ( X m +1 )) . If G K m ( t ; α, θ ) is the probability generating function of the distribution (13) ,for t > , then Pr [ f X m +1 = l | C n,h n ( X m +1 ) = c n ] (18) θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − l − i ( J ; α, θ + α ) (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − i +1 ( J ; α, θ ) l ≥ . Proof.

We follow the “direct” proof-method of Section 2, which is adapted to the PYP prior. We removethe subscript n from the hash function h n . Then, for the hash function h ∼ H we are interest in computingthe posterior distribution Pr[ f X m +1 = l | C h ( X m +1 ) = c ] (19)= Pr (cid:34) f X m +1 = l | m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) = Pr (cid:2) f X m +1 = l, (cid:80) mi =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:3) Pr (cid:2)(cid:80) mi =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:3) for l = 0 , , . . . , m . Uniformity of h implies that h induces a J -partition { B , . . . , B J } of V such that B j = { v ∈ V : h ( v ) = j } and ν ( B j ) = J − for j = 1 , . . . , J . Then, we write the denominator of (19) asfollows Pr (cid:34) m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (20)= J (cid:18) mc (cid:19) E [( P ( B j )) c +1 (1 − P ( B j )) m − c ]= J (cid:18) mc (cid:19) m − c (cid:88) i =0 (cid:18) m − ci (cid:19) ( − m − c − i E [( P ( B j )) m − i +1 ]= J (cid:18) mc (cid:19) m − c (cid:88) i =0 (cid:18) m − ci (cid:19) ( − m − c − i m − i +1 (cid:88) k =1 (cid:0) θα (cid:1) ( k ) ( θ ) ( m − i +1) J k C ( m − i + 1 , k ; α ) , where the last equality follows from moment formulae in Equation 3.3. of Sangalli (2006). This completes thestudy of the denominator of the posterior distribution (19). Now, we consider the numerator of the posteriordistribution (7). Let us deﬁne the event B ( m, l ) = { X = · · · = X l = X m +1 , { X l +1 , . . . , X m }∩{ X m +1 } = ∅} .We write Pr (cid:34) f X m +1 = l, m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (21)= (cid:18) ml (cid:19) Pr (cid:34) B ( m, l ) , m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) = (cid:18) ml (cid:19) Pr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) . That is, the distribution of ( f X m +1 , C j ) is determined by the knowledge of the distribution of ( X , . . . , X m +1 ).Let Π( s, k ) denote the set of all possible partitions of the set { , . . . , s } into k disjoints subsets π , . . . , π k such that n i is the cardinality of π i . From Equation 3.5 of Sangalli (2006), for any measurable A , . . . , A m +1 we have that Pr[ X ∈ A , . . . , X m +1 ∈ A m +1 ] = m +1 (cid:88) k =1 (cid:81) k − i =0 ( θ + iα )( θ ) ( m +1) × (cid:88) ( π ,...,π k ) ∈ Π n +1 ,k k (cid:89) i =1 (1 − α ) ( n i − ν ( ∩ m ∈ π i A m ) or m ≥

1. Let V be the Borel σ -algebra of V . Let ν π ,...,π k be a probability measure on ( V m +1 , V m +1 )deﬁned as ν π ,...,π k ( A × · · · × A m +1 ) = (cid:89) ≤ i ≤ k ν ( ∩ m ∈ π i A m ) , and attaching to the event B ( m, l ) a value that is either 0 or 1. In particular, ν π ,...,π k ( B ( m, l )) = 1 if andonly if one of the π i ’s is equal to the set { , . . . , l, m + 1 } . Hence, based on the measure ν π ,...,π k , we canwritePr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) = m − l +1 (cid:88) k =2 (cid:81) k − i =0 ( θ + iα )( θ ) ( m +1) × (cid:88) ( π ,...,π k − ) ∈ Π( m − l,k − (1 − α ) ( l ) k − (cid:89) i =1 (1 − α ) ( n i − ν π ,...,π k (cid:32) m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) = θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × m − l (cid:88) r =1 (cid:81) r − i =0 ( θ + α + iα )( θ + α ) ( m − l ) × (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 (1 − α ) ( n i −  J (cid:88) j =1 ν ( { j } ) ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( j ) = c − l (cid:33) = θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × m − l (cid:88) r =1 (cid:81) r − i =0 ( θ + α + iα )( θ + α ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 (1 − α ) ( n i − ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) . Now, m − l (cid:88) r =1 (cid:81) r − i =0 ( θ + α + iα )( θ + α ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 (1 − α ) ( n i − ν π ,...,π r ( · )is the distribution of a random sample ( X , . . . , X m − l ) from P ∼ PYP( α, θ + α, ν ). Again, an expression forthis distribution of ( X , . . . , X m − l ) is given in Equation 3.5 of Sangalli (2006). In particular, we write thefollowing identitiesPr (cid:34) B ( m, l ) , m (cid:88) i = l +1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:35) = θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × m − l (cid:88) r =1 (cid:81) r − i =0 ( θ + α + iα )( θ + α ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 (1 − α ) ( n i − ν π ,...,π r (cid:32) m − l (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c − l (cid:33) = θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) (cid:18) m − lc − l (cid:19) E [( P ( B j )) c − l (1 − P ( B j )) m − c ]= θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) (cid:18) m − lc − l (cid:19) m − c (cid:88) i =0 (cid:18) m − ci (cid:19) ( − m − c − i E [( P ( B j )) m − l − i ] θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) (cid:18) m − lc − l (cid:19) × m − c (cid:88) i =0 (cid:18) m − ci (cid:19) ( − m − c − i m − l − i (cid:88) k =1 (cid:0) θ + αα (cid:1) ( k ) ( θ + α ) ( m − l − i ) J k C ( m − l − i, k ; α ) , and from (21) Pr (cid:34) f X m +1 = l, m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (22)= (cid:18) ml (cid:19) θ ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) (cid:18) m − lc − l (cid:19) × m − c (cid:88) i =0 (cid:18) m − ci (cid:19) ( − m − c − i m − l − i (cid:88) k =1 (cid:0) θ + αα (cid:1) ( k ) ( θ + α ) ( m − l − i ) J k C ( m − l − i, k ; α ) . This completes the study of the numerator of the posterior distribution (19). By combining (19) with (20)and (22) we obtainPr (cid:34) f X m +1 = l | m (cid:88) i =1 { h ( X i ) } ( h ( X m +1 )) = c (cid:35) (23)= θJ (cid:18) cl (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − ci =0 (cid:0) m − ci (cid:1) ( − m − c − i (cid:80) m − l − ik =1 ( θ + αα ) ( k ) ( θ + α ) ( m − l − i ) J k C ( m − l − i, k ; α ) (cid:80) m − ci =0 (cid:0) m − ci (cid:1) ( − m − c − i (cid:80) m − i +1 k =1 ( θα ) ( k ) ( θ ) ( m − i +1) J k C ( m − i + 1 , k ; α ) . The proof is completed by combining (23) with the probability generating function of the distribution(13).Theorem 2 extends Theorem 1 to the more general BNP framework (17). In particular, Theorem 1 canbe recovered from Theorem 2 by setting α = 0. See Appendix B. For α ∈ (0 , α -stable distribution(Zolotarev, 1986). See Appendix C. If g α denotes the density function of a positive α -stable distribution,then Pr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ] (24)= α (cid:18) c n l (cid:19) Γ( θ + α + m − l )Γ( θ + m + 1) (1 − α ) ( l ) × (cid:82) + ∞ (cid:82) + ∞ g α ( h ) g α ( x ) x − θ − α (cid:16) hx ( J − α (cid:17) m − cn (cid:16) hx ( J − α +1 (cid:17) θ + m − l + α d x d h (cid:82) + ∞ (cid:82) + ∞ g α ( h ) g α ( x ) x − θ (cid:16) hx ( J − α (cid:17) m − cn (cid:16) hx ( J − α +1 (cid:17) θ + m +1 d x d h for l ≥

0. Expression (24) is useful for the numerical evaluation of the posterior distribution (18), sinceit avoids numerical issues in the evaluation of summations of ( m − c n ) terms with alternate sign. SeeSection 5. Figure 2 shows the behaviour of the posterior distribution (24) for diﬀerent values of the prior’sparameter ( α, θ ), keeping m, J and c m ﬁxed. For α = 0, i.e. for the CMS-DP of Cai et al. (2018), theposterior distribution (24) is shown to be monotonically decreasing or monotonically increasing. Here, theadditional parameter α ∈ (0 ,

1) allows for a more ﬂexible shape of the posterior distribution of f X m +1 , given C n,h n ( X m +1 ) . r [ f X m + c ] = 0.2 | = 0.0 = 0.2 | = 13.3 = 0.2 | = 26.7 = 0.2 | = 40.0 P r [ f X m + c ] = 0.4 | = 0.0 = 0.4 | = 13.3 = 0.4 | = 26.7 = 0.4 | = 40.0 P r [ f X m + c ] = 0.6 | = 0.0 = 0.6 | = 13.3 = 0.6 | = 26.7 = 0.6 | = 40.00 5 10 15 20 f X m +1 P r [ f X m + c ] = 0.8 | = 0.0 0 5 10 15 20 f X m +1 = 0.8 | = 13.3 0 5 10 15 20 f X m +1 = 0.8 | = 26.7 0 5 10 15 20 f X m +1 = 0.8 | = 40.0 Figure 2:

Posterior distribution of f X m +1 given C n,h n ( X m +1 ) = c n under P ∼ PYP( α, θ, ν ): m = 1000, J = 50 and c n = 20. For a collection of hash functions h , . . . , h N drawn at random from H , the posterior distribution of f X m +1 given { ( C n,h n ( X m +1 ) } n ∈ [ N ] follows directly from Theorem 2 by the independence assumption on H and by anapplication of Bayes theorem. See Appendix D for the proof. In particular, the following expression holdstrue Pr[ f X m +1 = l | { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] (25) ∝ (cid:89) n ∈ [ N ] θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − l − i ( J ; α, θ + α ) (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − i +1 ( J ; α, θ )or, by (24), Pr[ f X m +1 = l | { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] (26) ∝ (cid:89) n ∈ [ N ] α (cid:18) c n l (cid:19) Γ( θ + α + m − l )Γ( θ + m + 1) (1 − α ) ( l ) × (cid:82) + ∞ (cid:82) + ∞ g α ( h ) g α ( x ) x − θ − α (cid:16) hx ( J − α (cid:17) m − cn (cid:16) hx ( J − α +1 (cid:17) θ + m − l + α d x d h (cid:82) + ∞ (cid:82) + ∞ g α ( h ) g α ( x ) x − θ (cid:16) hx ( J − α (cid:17) m − cn (cid:16) hx ( J − α +1 (cid:17) θ + m +1 d x d h for l ≥

0. Expression (25) provides the posterior distribution of the point query f v given the hashedfrequencies { C n,h n ( v ) } n ∈ [ N ] , for an arbitrary v ∈ V . CMS-PYP estimates of f v are obtained as functionals f the posterior distribution (25), e.g. mode, mean, median. The evaluation of (26) requires care to achievenumerical stability and eﬃciency, especially as the densities g α are not available in closed form. See Appendix5. To apply the posterior distribution (26), it remains to estimate the prior’s parameter ( α, θ ) given thehashed frequencies { C n } n ∈ [ N ] , with C n = ( C n, , . . . , C n,J ). For ease of exposition, we denote by C the N × J matrix with entries C n,j for n ∈ [ N ] and j ∈ [ J ]. Assuming that the matrix C has been computed from m tokens, the sum of the entries of each row of C is equal to the sample size m . Since the PYP does nothave a restriction property analogous to that of the DP, under the model (17) the distribution of C is notavailable in closed-form. Hence, prior’s parameter ( α, θ ) can not be estimated following the empirical Bayesapproach adopted by Cai et al. (2018) in the context of the DP prior. Instead, here we estimate ( α, θ ) byrelying to the minimum Wasserstein distance method (Bernton et al., 2019). This method estimates ( α, θ ) byselecting the value of ( α, θ ) that minimizes the expected Wasserstein distance between a summary statisticof the data and the corresponding summary statistic of synthetic data generated under the model (17). Inour context, a natural choice for the summary statistic is the matrix C . By construction, the rows of C are independent and identically distributed; moreover, since H is assumed to be a perfectly random hashfamily, each column of C is exchangeable. Then, we can deﬁne the reference summary summary statistic C as a vector of length N J containing the (unordered) entries of the matrix C . For any ﬁxed m (cid:48) ≥ α, θ ), let (cid:101) X m (cid:48) = ( (cid:101) X , . . . , (cid:101) X m (cid:48) ) be a random sample from P ∼ PYP( α, θ, ν ),i.e. (cid:101) X m (cid:48) is modeled as (17). For a moderate sample size m (cid:48) , generating random variates from (cid:101) X m (cid:48) isstraightforward by means of the predictive distribution (14) of the PYP. These random variates, by a directtransformation through the hash functions h , . . . , h N drawn at random from H , leads to random variatesfrom the hashed frequencies and to random variates from reference summary summary statistic, denoted by (cid:101) C ( α, θ, m (cid:48) ).In practice, the sample size m is such that m (cid:29) m (cid:48) and the computational cost of sampling from thepredictive distribution (14) of the PYP scales super-linearly in m (cid:48) . To account for this mis-match we scalethe entries of (cid:101) C ( α, θ, m (cid:48) ) by m/m (cid:48) , so that each row of (cid:101) C ( α, θ, m (cid:48) ) m/s sum to m . Now, we are interested inﬁnding (ˆ α, ˆ θ ) such that (ˆ α, ˆ θ ) = arg min ( α,θ ) E (cid:104) W (cid:16) C , (cid:101) C ( α, θ, m (cid:48) ) mm (cid:48) (cid:17)(cid:105) , (27)where W is the Wasserstein distance of order 1, and the expectation is taken with respect to (cid:101) C . To fullyspecify the optimization problem we choose ρ ( x, y ) = | x − y | as distance underlying W ; with these choice W can be computed in closed-form eﬃciently (Bernton et al., 2019). We use a Monte Carlo approximationof the expectation in (27), i.e., 1 R R (cid:88) r =1 W p (cid:16) C , (cid:101) C r ( α, θ, m (cid:48) ) mm (cid:48) (cid:17) (28)for R ≥

1, where ( (cid:101) C ( α, θ, m (cid:48) ) , . . . , (cid:101) C R ( α, θ, m (cid:48) )) are independent and identically distributed as (cid:101) C ( α, θ, m (cid:48) ).We refer to the work of Bernton et al. (2019) for a theoretical and empirical analysis of the minimum distanceWasserstein method. To improve the Monte Carlo approximation (28), which might be detrimental to theminimization (27), we ﬁx the same random numbers underlying the routines used for generating randomvariates from the predictive distribution (14) of the PYP over all values of ( α, θ ). Moreover the optimizationis carried out via noise-robust Gaussian optimization (Letham at al., 2019). We report experimental resultsin Section 5. Range queries provide a natural multidimensional generalization of point queries. We refer to Chapter 5 ofCormode et al. (2012) for a comprehensive account on range queries and generalizations thereof. For m ≥ X m be a data stream of tokens taking values in a (possibly inﬁnite) measurable space of symbols V .For any pair of positive integers ( J, N ) let h , . . . , h N , with h n : V → [ J ], be a collection of hash functionsdrawn uniformly at random from a pairwise independent perfectly random hash family H . Then, for s ≥ s -range query is deﬁned as the overall frequency ¯ f s of a ﬁnite collection of arbitrary symbols { v , . . . , v s } ∈ V , .e. ¯ f s = s (cid:88) r =1 f v r , based on the hashed frequencies { ( C n,h n ( v ) , . . . , C n,h n ( v s ) ) } n ∈ [ N ] . A 1-range query corresponds to a pointquery. In the use of the predictive distribution and the restriction property of the DP, the “indirect” proof-method of Cai et al. (2018) exploits the unidimensional nature of the problem of estimating point queries.In particular, the “indirect” proof-method can not be used within the more general problem of estimating s -range queries. In this section we show that the “direct” proof-method of Section 2 can be readily appliedto the estimation of s -range queries, for any s ≥

1. We focus on the DP prior, thought the same argumentsapply to the PYP prior. This leads to an extension of the CMS-DP to the more general problem of estimatingrange queries.Following the work of Cai et al. (2018), we assume that tokens in the stream X m are modeled asrandom samples from an unknown discrete distribution P , which is endowed with a DP prior, i.e. theBNP framework (3). Under this BNP framework, a s -range query induces the posterior distribution ofthe frequencies ( f v , . . . , f v s ) given the hashed frequencies { ( C n,h n ( v ) , . . . , C n,h n ( v s ) ) } n ∈ [ N ] , for arbitrary { v , . . . , v s } ∈ V . This posterior distribution, in turn, induces the posterior distribution of the s -range query¯ f s given { ( C n,h n ( v ) , . . . , C n,h n ( v s ) ) } n ∈ [ N ] . CMS-DP estimates of ¯ f s are obtained as suitable functionals ofthe posterior distribution of ¯ f s given { ( C n,h n ( v ) , . . . , C n,h n ( v s ) ) } n ∈ [ N ] . In order to compute the posteriordistribution of ( f v , . . . , f v s ) given { ( C n,h n ( v ) , . . . , C n,h n ( v s ) ) } n ∈ [ N ] , it is natural to consider s additionalrandom samples ( X m +1 , . . . , X m + s ). In particular, for any r = 1 , . . . , s let f X m + r be the frequency of X m + r in X m , i.e., f X m + r = m (cid:88) i =1 { X i } ( X m + r )and let C n,h n ( X m + r ) be the hashed frequency of all X i ’s, for i = 1 , . . . , m , such that h n ( X i ) = h n ( X m + r ),i.e., C n,h n ( X m + r ) = m (cid:88) i =1 h n ( X i ) ( h ( X m + r )) . Now, let X s = ( X m +1 , . . . , X m + s ) and for n ∈ [ N ] let f X s = ( f X m +1 , . . . , f X m + s ). Moreover, for n ∈ [ N ]let C n,h n ( X s ) = ( C n,h n ( X m +1 ) , . . . , C n,h n ( X m + s ) . For each h n we are interested in computing the posteriordistribution Pr (cid:2) f X s = l s | C n,h n ( X s ) = c n (cid:3) = Pr[ f X s = l s , C n,h n ( X s ) = c n ]Pr[ C n,h n ( X s ) = c n ] . (29)for l s ∈ { , , . . . , m } s . For hash functions h , . . . , h N drawn at random from H , the posterior distributionof f X s given { C n,h n ( X s ) } n ∈ [ N ] follows from the posterior distribution (29) by the independence assumptionon H and Bayes theorem. Our “direct” proof-method to the CMS-DP can be readily extended to computethe posterior distribution (29). See Appendix E. As an example, in the next theorem we consider 2-rangequeries. See Appendix F. Theorem 3.

Let h n be a hash function drawn at random from a pairwise independent perfectly random fam-ily H . For m ≥ , let X m be a random sample of tokens from P ∼ DP ( θ, ν ) and let ( X m +1 , X m +2 ) be a pairof additional random samples. Let C n,h n ( X m +1 ) and C n,h n ( X m +2 ) be the hashed frequencies induced from X m through h n , i.e. C n,h n ( X m +1 ) = (cid:80) ≤ i ≤ m { h n ( X i ) } ( h n ( X m +1 )) and C n,h n ( X m +2 ) = (cid:80) ≤ i ≤ m { h n ( X i ) } ( h n ( X m +2 )) .Then, we write Pr [ f X m +1 = l , f X m +2 = l | C n,h n ( X m +1 ) = c n, , C n,h n ( X m +2 ) = c n, ]= Num( l , l , c n, , c n, )Den( c n, , c n, ) l , l ≥ with ) Den( c n, , c n, ) = J { c n, = c n, = c } ( θJ ) ( c +2) ( θ − θJ ) ( m − c ) c !( m − c )!+ J ( J −

1) ( θJ ) ( c n, +1) ( θJ ) ( c n, +1) ( θ − θJ ) ( m − c n, − c n, ) c n, ! c n, !( m − c n, − c n, )! ; ii) Num( l , l , c n, , c n, ) = { l = l =: l, c n, = c n, = c } θ ( l + 1)( θJ ) ( c − l ) ( θ − θJ ) ( m − c ) ( c − l )!( m − c )!+ { c n, = c n, = c } θ ( θJ ) ( c − l − l ) ( θ − θJ ) ( m − c ) J ( c − l − l )!( m − c )!+ (cid:18) J − J (cid:19) θ ( θJ ) ( c n, − l ) ( θJ ) ( c n, − l ) ( θ − θJ ) ( m − c n, − c n, ) ( c n, − l )!( c n, − l )!( m − c n, − c n, )! . Theorem 3 extends Theorem 1 to the more general problem of 2-range queries. For a collection ofhash functions h , . . . , h N drawn at random from H , the posterior distribution of ( f X m +1 , f X m +2 ) given { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] follows from Theorem 3 by the independence assumption on H and Bayestheorem, i.e.,Pr[ f X m +1 = l , f X m +1 = l | { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] = { ( c n, , c n, ) } n ∈ [ N ] ] (30) ∝ (cid:89) n ∈ [ N ] Pr[ f X m +1 = l , f X m +1 = l | C n,h n ( X m +1 ) = c n, , C n,h n ( X m +2 ) = c n, ]for l , l ≥

0. See Appendix G. The distribution (30) leads to the posterior distribution of ( f v , f v ) given { ( C n,h n ( v ) , C n,h n ( v ) ) } n ∈ [ N ] , for arbitrary v , v ∈ V . CMS-DP estimates of the 2-range query ¯ f = f v + f v are then obtained as functionals of the posterior distribution of ¯ f . To conclude, it remains to estimate theprior’s parameter θ > We present numerical experiments for the CMS-PYP introduced in Section 3. First, we consider the problemof estimating the prior’s parameter ( α, θ ) by means of the likelihood-free approach of Section 3. Then, weapply the CMS-PYP to synthetic and real data, and we compare its performance with respect to the CMSof Cormode and Muthukrishnan (2005), the CMS-DP of Cai et al. (2018) and the count-mean-min (CMM)of Goyal et al. (2012). ( α, θ ) We present an empirical study of the likelihood-free estimation approach detailed in Section 3. We start witha scenario where the data generating process (PYP-DGP) is (17). In particular, we generate 10 syntheticdatasets of m = 300000 tokens each, for diﬀerent prior’s parameter ( α, θ ). See Table 5.1 for values of ( α, θ ).For each dataset, the estimation of the prior’s parameter ( α, θ ) is performed by means of (27) and (28) with m (cid:48) = 100000 for R = 25. The optimization procedure is based on Letham at al. (2019), as implemented bythe AX library. See https://ax.dev/ for details. The stochastic objective function (28) is evaluated a totalof 50 times for each dataset. The results from Table 5.1 support the proposed inference procedure. It isalso apparent that, for the datasets under consideration, the parameter α is more easily identiﬁed than theparameter θ .We also consider synthetic datasets generated from Zipf’s distributions with (exponent) parameter c > Z s -DGP). We recall that the parameter c controls the tail behaviour of the Zipf’s distribution: the smaller α θ ˆ α ˆ θ Prior’s parameter ( α, θ ) estimates, under PYP-DGP. c the heavier is the tail of the distribution, i.e., the smaller c the larger the fraction of symbols with low-frequency tokens. In particular, we generate 7 synthetic datasets of m = 500000 tokens each, for diﬀerentparameter c . See Table 5.1 for values of c . Under the BNP framework (17) for each dataset, the estimationof the prior’s parameter ( α, θ ) is performed by means of (27) and (28) with m (cid:48) = 100000 for R = 25. Theoptimization procedure is still based on the work of Letham at al. (2019). The stochastic objective function(28) is evaluated a total of 50 times for each dataset. The results from Table 5.1 shows that the PYP prior isable to adapt to diﬀerent power-law tails behaviours. In particular, we observe that the larger c the smallerˆ α , which is in agreement with the interpretation of α as the parameter controlling the tail behaviour of thePYP prior. Z c -DGP Estimates c ˆ α ˆ θ Prior’s parameter ( α, θ ) estimates, under Z c -DGP. We apply the CMS-PYP to synthetic and real data. For the CMS-PYP estimator of f v we consider theposterior mean ˆ f (PYP) v , which follows from: i) the estimation of the prior’s parameter ( α, θ ) by means ofthe likelihood-free approach of Section 3; ii) the evaluation, under the prior’s parameter estimate (ˆ α, ˆ θ ) of(26). To ensure numerical stability in float64 , we work in log-space as much as possible, i.e. compute the(natural) logarithm of each multiplicative term of (26), and exponentiate back only as ﬁnal computation.For the computation of integrals we use double exponential quadrature (Takahasi and Mori, 1974), whichapproximates (cid:82) +1 − f ( y )d y with (cid:80) ≤ j ≤ m w j f ( y j ) for appropriate weights w j ∈ W and coordinates y j ∈ Y .Integrals of the form (cid:82) ba f ( y )d y for −∞ ≤ a ≤ b ≤ + ∞ are handled via change of variable formulas. In order o avoid underﬂow/overﬂow issues it is necessary to apply the ”log-sum-exp” trick to integrals. That is, for f ( y ) > (cid:18)(cid:90) exp { l ( y ) } d y (cid:19) = f ∗ + log (cid:18)(cid:90) exp { l ( y ) − l ∗ } d y (cid:19) where l ∗ = arg max y ∈Y { l ( y ) } and l ( y ) = log { f ( y ) } . Nested integrals are handled by a nested applicationof the ”log-sum-exp” trick. Particular care is required for the α -stable density function g α ( · ), which lacks aclosed form expression. We used the integral representation of Saa at al. (2006), which allows for accuratecomputation of log g α ( · ). Saa at al. (2006) empirically veriﬁes the accuracy of the proposed approximationusing an adaptive Gaussian quadrature method. In our own testing, we observed a maximum absolute errorof 10 − between log g α ( · ) with α = 1 / g α ( · ) over the quadrature coordinates are pre-computedfor a given α .We compare the CMS-PYP estimator ˆ f (PYP) v with respect to: i) the CMS estimator ˆ f (CMS) v of Cormodeand Muthukrishnan (2005), namely the minimum hashed frequency with respect to N hash functions; ii) theCMS-DP estimator ˆ f (DP) v of Cai et al. (2018) corresponding to the mean of the posterior distribution (5).We also consider the CMM estimator ˆ f (CMM) v of Goyal et al. (2012). The CMM relies on the same summarystatistics used in the CMS, CMS-DP and CMS-PYP, i.e. the hashed frequencies { C n } n ∈ [ N ] . This facilitatesthe implementation of a fair comparison among estimators, since the storage requirement and sketch updatecomplexity are unchanged. In the work of Goyal et al. (2012) it is shown that the CMM estimator standsout in the estimation of low-frequency tokens (Figure 1 of Goyal et al. (2012)), which is a desirable feature inthe context of natural language processing where it is common the power-law behaviour of the data streamof tokens. Hereafter, we compare ˆ f (PYP) v , ˆ f (DP) v , ˆ f (CMS) v and ˆ f (CMM) v in terms of the MAE (mean absoluteerror) between true frequencies and their estimates. The comparison among ˆ f (PYP) v , ˆ f (CMS) v and ˆ f (CMM) v onsynthetic data is reported in Appendix H. The comparison between ˆ f (PYP) v and ˆ f (CMS) v on real data is inAppendix H.With regards to synthetic data, we consider datasets generated from Zipf’s distributions with exponent c = 1 . , . , . , . , .

5. Each dataset consists of m = 500000 tokens. We make use of a 2-universal hashfamily, with the following pairs of hashing parameters: i) J = 320 and N = 2; ii) J = 160 and N = 4. Table5.2 and Table 5.2 report the MAE of the estimators ˆ f (DP) v and ˆ f (PYP) v . From Table 5.2 and Table 5.2, it isclear that ˆ f (PYP) v has a remarkable better performance than ˆ f (DP) v in the estimation of low-frequency tokens.In particular, for both Table 5.2 and Table 5.2, if we consider the bin of low-frequencies (0 , f (PYP) v is alway smaller than the MAE of ˆ f (DP) v , i.e. ˆ f (PYP) v outperforms ˆ f (DP) v . This behaviour becomes moreand more evident as the parameter c decreases, that is the heavier is the tail of the distribution the morethe estimator ˆ f (PYP) v outperforms the estimator ˆ f (DP) v . For a ﬁxed exponent c , the gap between the MAEsof ˆ f (PYP) v and ˆ f (DP) v reduces as v increases, and this reduction is much more evident as c becomes large. Forany exponent c we expect a frequency threshold, say v ∗ ( c ), such that ˆ f (PYP) v underestimates f v for v > v ∗ ( c ).From Table 5.2 and Table 5.2, for any two exponents c and c such that c < c it will be v ∗ ( c ) > v ∗ ( c ). Acomparison among ˆ f (PYP) v , ˆ f (CMS) v and ˆ f (CMM) v is reported in Appendix H. This comparison reveals that theCMS-PYP outperforms the CMS in the estimation of low-frequency tokens for both the choices of hashingparameters, whereas the CMS-PYP outperforms the CMM in the estimation of low-frequency token for J = 160 and N = 4.We conclude by presenting an application of the CMS-PYP to textual datasets, for which the distributionof words is typically a power-law distribution. See Clauset et al. (2009), and references therein, for adetailed account on power-law distributions in applications. Here, we consider the 20 Newsgroups dataset( http://qwone.com/~jason/20Newsgroups/ ) and the Enron dataset ( https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ ). The 20 Newsgroups dataset consists of m = 2765300tokens with 53975 distinct tokens, whereas the Enron dataset consists of m = 6412175 tokens with 28102distinct tokens. Following the experiments in work of Cai et al. (2018), we make use of a 2-universal hashfamily, with the following hashing parameters: i) J = 12000 and N = 2; ii) J = 8000 and N = 4. Bymeans the goodness of ﬁt test proposed in Clauset et al. (2009), we found that the 20 Newsgroups andEnron datasets ﬁt with a power-law distribution with exponent ν = 2 . ν = 2 .

1, respectively. Table .2 reports the MAEs of the estimators ˆ f (DP) v and ˆ f (PYP) v applied to the 20 Newsgroups dataset and to theEnron dataset. Results of Table 5.2 conﬁrms the behaviour observed in Zipf’ synthetic data. That is, ˆ f (PYP) v outperforms ˆ f (DP) v for low-frequency tokens. Table 5.2 also contains a comparison with respect to ˆ f (CMM) v ,revealing that ˆ f (PYP) v is competitive with ˆ f (CMM) v in the estimation of low-frequency tokens both the choicesof the hashing parameters. Z . Z . Z . Z . Z . Bins v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v (0,1] 1,057.61 0.86 626.85 3.10 306.70 92.35 51.38 3.40 32.43 30.69(1,2] 1,194.67 1.71 512.43 1.85 153.57 29.62 288.27 88.25 47.84 45.35(2,4] 1,105.16 3.22 472.59 1.52 2,406.00 1,155.12 133.31 16.47 53.97 50.85(4,8] 1,272.02 6.14 783.88 7.34 457.57 115.39 117.76 8.15 69.47 66.67(8,16] 1,231.63 11.68 716.52 8.73 377.99 92.64 411.21 129.23 80.43 77.75(16,32] 1,252.18 24.13 829.17 14.66 286.98 59.95 501.00 160.75 9.61 7.55(32,64] 1,309.14 42.80 780.70 34.63 413.95 155.87 216.84 94.49 9.89 6.79(64,128] 1,716.76 94.68 946.20 78.05 1,869.23 1,557.88 63.05 81.62 13.38 11.17(128,256] 1,102.96 178.55 1,720.49 347.06 199.87 99.75 45.98 134.87 17.03 14.37 Table 3:

Synthetic data: MAE for ˆ f (PYP) v and ˆ f (DP) v , case J = 320 , N = 2. Z . Z . Z . Z . Z . Bins v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v ˆ f (DP) v ˆ f (PYP) v (0,1] 2,206.09 0.60 1,254.85 1.00 420.76 0.96 153.20 29.34 56.08 15.07(1,2] 2,333.06 1.00 1,326.71 2.00 549.12 1.95 180.71 21.77 47.48 5.49(2,4] 2,266.35 1.70 1,267.97 3.60 482.45 3.36 182.18 16.62 56.87 6.80(4,8] 2,229.22 3.90 1,371.27 6.00 538.91 6.16 250.32 40.26 50.30 4.66(8,16] 2,207.42 7.70 1,159.29 11.90 487.69 11.66 245.09 102.23 23.70 6.04(16,32] 2,279.80 12.60 1,211.41 20.60 529.77 22.95 293.68 55.98 24.41 15.09(32,64] 2,301.99 34.10 1,280.17 44.80 632.45 43.65 118.26 32.92 30.95 24.33(64,128] 2,241.57 73.20 1,112.41 95.60 419.42 96.05 177.61 66.87 28.78 24.68(128,256] 2,235.40 130.60 1,133.85 175.10 522.21 186.95 128.09 93.83 31.46 38.13 Table 4:

Synthetic data: MAE for ˆ f (PYP) v and ˆ f (DP) v , case J = 160 , N = 4. J = 12000 and N = 2 J = 8000 and N = 420 Newsgroups Enron 20 Newsgroups EnronBins v ˆ f (DP) v ˆ f (CMM) v ˆ f (PYP) v ˆ f (DP) v ˆ f (CMM) v ˆ f (PYP) v ˆ f (DP) v ˆ f (CMM) v ˆ f (PYP) v ˆ f (DP) v ˆ f (CMM) v ˆ f (PYP) v (0,1] 46.39 5.41 1.16 12.20 0.90 0.99 53.39 4.50 1.00 70.98 51.00 1.00(1,2] 16.60 2.16 1.96 13.80 2.00 1.99 30.49 2.00 2.00 47.38 27.20 2.00(2,4] 38.40 7.91 2.93 61.49 9.90 3.76 32.49 4.80 3.30 52.49 3.90 3.90(4,8] 59.39 35.70 6.10 88.39 17.32 7.50 38.69 6.23 6.70 53.08 10.50 6.80(8,16] 54.29 45.40 11.65 23.40 9.52 11.97 25.29 13.50 12.60 56.98 22.20 11.90(16,32] 17.80 20.99 21.14 55.09 21.00 20.78 24.99 21.60 21.90 89.98 20.60 20.60(32,64] 40.79 58.86 45.85 128.48 134.47 43.78 39.69 39.22 43.70 108.37 61.38 47.80(64,128] 25.99 91.59 88.47 131.08 110.27 81.38 22.09 86.32 92.19 55.67 66.50 88.10(128,256] 13.59 186.92 170.27 50.68 140.43 171.99 25.79 183.96 207.58 80.76 90.20 179.30 Table 5:

Real data: MAE for ˆ f (PYP) v , ˆ f (DP) v and ˆ f (CMM) v . Discussion

This paper contributes to the CMS-DP of Cai et al. (2018), which is a learning-augmented CMS via BNPmodeling of the data stream of tokens. While Cai et al. (2018) showed that BNPs is a powerful tool fordeveloping robust learning-augmented CMSs, ideas and methods behind the CMS-DP are tailored to pointqueries under DP priors, and they can not be used for other priors or more general queries. In this paper,we presented an alternative, and more ﬂexible, derivation of the CMS-DP such that: i) it allows to makeuse of the Pitman-Yor process (PYP) prior, which is arguably the most popular generalization of the DPprior; ii) it can be readily applied to the more general problem of estimating range queries. Our alternativederivation of CMS-DP, led to the following main contributions: i) the development of the CMS-PYP, whichis novel learning-augmented CMS under power-law data streams; the extension of the BNP approach of Caiet al. (2018) to the more general problem of estimating s -range queries, for s ≥

1. Applications to syntheticand real data have showed that the CMS-PYP outperforms the CMS and the CMS-DP in the estimation oflow-frequency tokens. The ﬂaw of the CMS in the estimation of low-frequency token is quite well known,especially in the context of natural language processing (Goyal et al., 2012; Goya et al., 2009; Pitel andFouquier, 2015), and the CMS-PYP is a new proposal to compensate for this ﬂaw. In particular, the CMS-PYP results to be competitive with the CMM of Goyal et al. (2012) in the estimation of low-frequencytoken.Our study paves the way to many fruitful directions for future work. On the theoretical side, investigatingproperties of the posterior distribution in Theorem 2, would be of interest. Nothing is known about theoreticalproperties of the BNP approach to CMS, both from an asymptotic and non-asymptotic point of view. For α = 0, Cai et al. (2018) showed that the posterior mode recovers the CMS estimate of Cormode andMuthukrishnan (2005), while other CMS-DP estimates, e.g. the posterior mean and median, may be viewedas CMS estimates with shrinkage. It is natural to ask whether there exists a similar interplay between theposterior distribution in Theorem 2 and variations of the CMS for power-law data streams, e.g. the CMM.Consistency properties of the posterior distribution, in the Bayesian sense, of the posterior distribution arealso of interest. Another direction of interest consists in using the posterior estimates of point queries andthe linear sketch properties of the CMS for large-scale streaming algorithms, e.g., for large text or streaminggraphs applications. Lastly, one may be interested in the CMS-PYP to accommodate diﬀerent updateoperations, such as the conservative update, as well as diﬀerent types of queries, such as range and innerproduct queries. Acknowledgement

Emanuele Dolera and Stefano Favaro received funding from the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257.Emanuele Dolera and Stefano Favaro gratefully acknowledge the ﬁnancial support from the Italian Ministryof Education, University and Research (MIUR), “Dipartimenti di Eccellenza” grant 2018-2022.

References

Aamand, A., Indyk, P. and Vakilian, A. (2019). Frequency estimation algorithms under Zipﬁan distri-bution.

Preprint arXiv:1908.05198 . Aggarwal, C. and Yu, P. (2010). On classiﬁcation of high-cardinality data streams. In

Proceedings of the2010 SIAM International Conference on Data Mining . Bacallado, S., Battiston, M., Favaro, S. and Trippa, L. (2017). Suﬃcientness postulates for Gibbs-type priors and hierarchical generalizations.

Statistical Science , 487–500. Bernton, E., Jacob, P.E., Gerber, M. and Robert, C.P. (2019). On parameter estimation with theWasserstein distance.

Information and Inference , 657–676. Cai, D., Mitzenmacher, M. and Adams, R.P. (2018). A Bayesian nonparametric view on count–minsketch. In

Advances in Neural Information Processing Systems . haralambides (2005) Combinatorial methods in discrete distributions.

Wiley.

Chung, K., Mitzenmacher, M. and Vadhan, S.P. (2013). Why simple hash functions work: exploitingthe entropy in a data stream.

Theory of Computing , 897–945. Clauset, A., Shalizi, C.R. and Newman, M.E.J. (2009). Power-law distributions in empirical data.

SIAM Review , 661–703. Cormode, G., Garofalakis, M. and Haas, P.J. (2012).

Synopses for massive data: samples, histograms,wavelets, sketches . Foundations and Trends in Databases.

Cormode, G. and Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketchand its applications.

Journal of Algorithms , 58–75. De Blasi, P., Favaro, S., Lijoi, A., Mena, R.H., Pr¨unster, I. and Ruggiero, M. (2015). AreGibbs-type priors the most natural generalization of the Dirichlet process?

IEEE Transactions on PatternAnalysis and Machine Intelligence , , 212–229. Dolera, E. and Favaro, S. (2020). A Berry–Esseen theorem for Pitman’s α –diversity. Annals of AppliedProbability , 847–869. Dolera, E. and Favaro, S. (2020). Rates of convergence in de Finetti’s representation theorem, andHausdorﬀ moment problem.

Bernoulli , 1294–1322. Dwork, C. and Naor, M. and Pitassi, T. and Rothblum, G. and Yekhanin, S. (2010). Pan-privatestreaming algorithms. In

Proceedings of the Symposium on Innovations in Computer Science . Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors - an empirical Bayesapproach.

Journal of the American Statistical Association , 117–130. Ewens, W. (1972). The sampling theory or selectively neutral alleles.

Theoretical Population Biology ,87–112. Favaro, S. and Nipoti, B. and Teh, Y.W. (2015). Random variate generation for Laguerre-type expo-nentially tilted alpha-stable distributions.

Electronic Journal of Statistics , 1230–1242. Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems.

The Annals of Statistics ,209–230. Ghosal, S. and van der Vaart, A. (2017)

Fundamentals of Nonparametric Bayesian Inference.

Cam-bridge University Press.

Goyal, A., Daum´e, H. and Cormode, G. (2012). Sketch algorithms for estimating point queries inNLP. In

Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning . Goyal, A., Daum´e, H. and Venkatasubramanian, S. (2009). Streaming for large scale NLP: languagemodeling. In

Proceedings of the Conference of the North American Chapter of the Association for Com-putational Linguistics . Gradshteyn, L.S. and Ryzhik, L.M. (2007).

Table of integrals, series, and products . Academic Press.

Harrison, B.A. (2010)

Move prediction in the game of Go.

Ph.D Thesis, Harvard University.

Hsu, C., Indyk, P., Katabi, D. and Vakilian, A. (2019) Learning-based frequency estimation algo-rithms. In

Proceedings of the International Conference on Learning Representations . James, L.F. (2002). Poisson process partition calculus with applications to exchangeable models andBayesian nonparametrics.

Preprint arXiv:math/0205093 . ames, L.F., Pr¨unster, I., Lijoi, A. (2009). Posterior analysis for normalized random measures withindependent increments. Scandinavian Journal of Statistics , 76–97. Kingman, J.F.C. (1993).

Poisson processes.

Wiley Online Library.

Leo Elworth, L.A., Wang, Q., Kota, P.K., Barberan, C.J., Coleman, B., Balaji, A., Gupta,G., Baraniuk, R.G., Shrivastava, A. and Treangen, T.J. (2020). To petabytes and beyond: recentadvances in probabilistic and signal processing algorithms and their application to metagenomics.

NucleicAcids Research Letham, B., Karrer, B., Ottoni, G. and Bakshy, E. (2019). Constrained Bayesian optimization withnoisy experiments.

Bayesian Analysis , 495–519. Perman, M., Pitman, J. and Yor, M. (1992). Size-biased sampling of Poisson point processes andexcursions.

Probability Theory and Related Fields , , 21–39. Pitel, G. and Fouquier, G. (2015). Count-min-log sketch: approximately counting with approximatecounters. In

Proceedings of the 1st International Symposium on Web Algorithm . Pitman, J. (1995). Exchangeable and partially exchangeable random partitions.

Probability Theory andRelated Fields , 145–158.

Pitman, J. (2003). Poisson-Kingman partitions. In

Science and Statistics: A Festschrift for Terry Speed ,Goldstein, D.R. Eds. Institute of Mathematical Statistics.

Pitman, J. (2006).

Combinatorial stochastic processes . Lecture Notes in Mathematics, Springer Verlag.

Pitman, J. and Yor, M. (1997). The two parameter Poisson-Dirichlet distribution derived from a stablesubordinator.

The Annals of Probability , 855–900. Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P. (2007).

Numerical recipes3rd edition: The art of scientiﬁc computing . Cambridge University Press.

Pr¨unster, I. (2002).

Random probability measures derived from increasing additive processes and theirapplication to Bayesian statistics . Ph.d thesis, University of Pavia.

Regazzini, E. (1978).

Intorno ad alcune questioni relative alla deﬁnizione del premio secondo la teoria dellacredibili`a. Giornale dell’Istituto Italiano degli Attuari , 77–89. Regazzini, E. (2001).

Foundations of Bayesian statistics and some theory of Bayesian nonparametric meth-ods.

Lecture Notes, Stanford University.

Regazzini, E., Lijoi, A. and Pr¨unster, I. (2003). Distributional results for means of normalized randommeasures with independent increments.

The Annals of Statistics , 560–585. Saa, A. and Venegeroles, R. (2011). Alternative numerical computation of one-sided L´evy and Mittag-Leﬄer distributions.

Physical Review E . Sangalli, M.L. (2006). Some developments of the normalized random measures with independent incre-ments.

Sankhya A , 461–487. Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors.

Statistica Sinica , 639–650. Song, H.H., Cho, T.W., Dave, V., Zhang, Y. and Qiu, L. (2009). Scalable proximity estimation andlink prediction in online social networks. In

Proceedings of the ACM SIGCOMM Conference on Internetmeasurement . Takahasi, H. and Mori, M. (1974). Double exponential formulas for numerical integration.

Publicationsof the Research Institute for Mathematical Sciences , 721–741. abell, S.L. (1997). The continuum of inductive methods revisited. In The cosmos of science: essays inexploration , Earman, J. and Norton, J.D. Eds. Universty of Pittsburgh Press.

Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C. and Brown, C.T. (2014). These are not thek-mers you are looking for: eﬃcient online k-mer counting using a probabilistic data structure.

PloS one . Zolotarev, V.M. (1986).

One dimensional stable distributions.

American Mathematical Society.

Appendix A The CMS

For any m ≥ X m = ( X , . . . , X m ) be a data stream of tokens taking values in a measurable space ofsymbols V . A point query over X m asks for the estimation of the frequency f v of a token of type v ∈ V in X m , i.e. f v = (cid:80) ≤ i ≤ m { X i } ( v ). The goal of CMS of Cormode and Muthukrishnan (2005) consists inestimating f v based on a compressed representation of X m by random hashing. Speciﬁcally, for any pair ofpositive integers ( J, N ) such that [ J ] = { , . . . , J } and [ N ] = { , . . . , N } , let h , . . . , h N , with h n : V → [ J ],be a collection of hash functions drawn uniformly at random from a pairwise independent hash family H .That is, a random hash function h ∈ H has the property that for all v , v ∈ H such that v (cid:54) = v , theprobability that v and v hash to values j , j ∈ [ J ], respectively, isPr[ h ( v ) = j , h ( v ) = j ] = 1 J . Hashing X m through h , . . . , h N creates N vectors of J buckets { ( C n, , . . . , C n,J ) } n ∈ [ N ] , with C n,j obtainedby aggregating the frequencies for all x where h n ( x ) = j . Every C n,j is initialized at zero, and whenevera new token X i is observed we set C n,h n ( X i ) ← C n,h n ( X i ) for every n ∈ [ N ]. After m tokens, C n,j = (cid:80) ≤ i ≤ m h n ( X i ) ( j ) and f v ≤ C n,j for any v ∈ V . Under this setting, the CMS estimate of f v is the smallesthashed frequency among { C n,h n ( v ) } n ∈ [ N ] , i.e.,ˆ f (CMS) v = min n ∈ [ N ] { C n,h n ( v ) } n ∈ [ N ] . That is, ˆ f (CMS) v returns the count associated with the fewest collisions. This provides an upper bound onthe true count. For an arbitrary data stream with m tokens, the CMS satisﬁes the following guarantee: if J = (cid:100) e/ (cid:101) and N = (cid:100) log 1 /δ (cid:101) , with ε > δ >

0, then ˆ f (CMS) v satisﬁes ˆ f (CMS) v ≥ f v and, with probabilityat least 1 − δ , the estimate ˆ f (CMS) v satisﬁes ˆ f (CMS) v ≤ f v + εm . See Cormode and Muthukrishnan (2005) fora detailed account on the CMS. Appendix B Recovering Theorem 1 from Theorem 2

We show how Theorem 2 reduces to Theorem 1 by setting α = 0. First, we recall two identities involving thegeneralized factorial coeﬃcient C ( m, k ; α ) and the signless Stirling number of the ﬁrst type. See Chapter 2of Charalambides (2005) for details. In particular, it holds m (cid:88) k =0 a k | s ( m, k ) | = ( a ) ( m ) (31)for a >

0, and lim α → C ( m, k ; α ) α k = | s ( m, k ) | . (32)Hereafter, we apply the identity (31) and the identity (32) to show that Theorem 2 reduces to Theorem 1by setting α = 0. In this respect, we rewrite the posterior distribution (18) as followsPr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ] θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − l − i ( J ; α, θ + α ) (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i G K m − i +1 ( J ; α, θ )[by the distribution (13)]= θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − l − ik =0 1 J k ( θα + α ) ( k ) ( θ + α ) ( m − l − i ) C ( m − l − i, k ; α ) (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − i +1 k =0 1 J k ( θα ) ( k ) ( θ + α ) ( m − i +1) C ( m − i + 1 , k ; α )= θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − l − ik =0 1 J k (cid:81) k − t =0 ( θ + α + tα )( θ + α ) ( m − l − i ) C ( m − l − i,k ; α ) α k (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − i +1 k =0 1 J k (cid:81) k − t =0 ( θ + tα )( θ + α ) ( m − i +1) C ( m − i +1 ,k ; α ) α k . Then, lim α → Pr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ]= lim α → θJ (cid:18) c n l (cid:19) ( θ + α ) ( m − l ) ( θ ) ( m +1) (1 − α ) ( l ) × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − l − ik =0 1 J k (cid:81) k − t =0 ( θ + α + tα )( θ + α ) ( m − l − i ) C ( m − l − i,k ; α ) α k (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i (cid:80) m − i +1 k =0 1 J k (cid:81) k − t =0 ( θ + tα )( θ + α ) ( m − i +1) C ( m − i +1 ,k ; α ) α k [by the identity (32)]= θJ (cid:18) c n l (cid:19) ( θ ) ( m − l ) ( θ ) ( m +1) l ! × (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i θ ) m − l − i (cid:80) m − l − ik =0 (cid:0) θJ (cid:1) k | s ( m − l − i ; k ) | (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i θ ) ( m − i +1) (cid:80) m − i +1 k =0 (cid:0) θJ (cid:1) k | s ( m − i + 1) | [by the identity (31)]= θJ (cid:18) c n l (cid:19) ( θ ) ( m − l ) ( θ ) ( m +1) l ! (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i ( θJ ) ( m − l − i ) ( θ ) m − l − i (cid:80) m − c n i =0 (cid:0) m − c n i (cid:1) ( − m − c n − i ( θJ ) ( m − i +1) ( θ ) ( m − i +1) [by the deﬁnition of Gauss hypergeometric function (Gradshteyn and Ryzhik, 2007, Chapter 9)]= θJ (cid:18) c n l (cid:19) ( θ ) ( m − l ) ( θ ) ( m +1) l ! Γ( θ )Γ( c n − l + θJ )Γ( θ − θJ + m − c n )Γ( θJ )Γ( m − l + θ )Γ( θ − θJ )Γ( θ )Γ( θJ + c n +1)Γ( θ − θJ + m − c n )Γ( θJ )Γ(1+ m + θ )Γ( θ − θJ ) = θJ Γ( c n + 1)Γ( c n − l + θJ )Γ( c n − l + 1)Γ( θJ + c n + 1)= θJθJ + c n ( c n − l + 1) ( l ) ( θJ + c n − l ) ( l ) , which is the expression for the posterior distribution stated in Theorem 1. The proof is completed. Appendix C Proof of Equation (24)

We start to rewrite the denominator of the posterior distribution (24) in terms of an integral with respectto the stable distribution. Similar arguments can be applied to rewrite the numerator of (24) in terms of an ntegral with respect to the stable distribution. We denote by L αn the generalized Laguerre polynomial. SeeChapter 8 of Gradshteyn and Ryzhik (2007). We write, m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i G K m − i +1 ( J − ; α, θ )[by the distribution (13)]= m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i m − i +1 (cid:88) k =1 (cid:0) θα (cid:1) ( k ) ( θ ) ( m − i +1) J k C ( m − i + 1 , k ; α )[by the deﬁnition of Gamma function (Gradshteyn and Ryzhik, 2007, Chapter 8)]= 1Γ( θ/α ) (cid:90) + ∞ y θ/α − e − y × (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ ) ( m − i +1) m − i +1 (cid:88) k =1 (cid:16) yJ (cid:17) k C ( m − i + 1 , k ; α ) (cid:33) d y [by Equation 13 in Favaro et al. (2015)]= 1Γ( θ/α ) (cid:90) + ∞ y θ/α − e − y + y/J × (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ ) ( m − i +1) (cid:18)(cid:16) yJ (cid:17) /α (cid:19) m − i +1 (cid:90) + ∞ x m − i +1 e − x ( y/J ) /α g α ( x )d x (cid:33) d y = 1Γ( θ/α ) (cid:90) + ∞ y θ/α − e − y + y/J × (cid:90) + ∞ e − x ( y/J ) /α g α ( x ) (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ ) ( m − i +1) (cid:18) x (cid:16) yJ (cid:17) /α (cid:19) m − i +1 (cid:33) d x d y [by Equation 8.970.1 in Gradshteyn and Ryzhik (2007)]= ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ y θ/α − e − y + y/J × (cid:90) + ∞ e − x ( y/J ) /α g α ( x ) (cid:32)(cid:18) x (cid:16) yJ (cid:17) /α (cid:19) c n +1 L θ + c n m − c n (cid:18) x (cid:16) yJ (cid:17) /α (cid:19)(cid:33) d x d y = α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ z θ + c n e − z α (1 − /J ) (cid:90) + ∞ e − ( x/J /α ) z g α ( x ) (cid:18)(cid:16) xJ /α (cid:17) c n +1 L θ + c n m − c n (cid:16) z xJ /α (cid:17)(cid:19) d x d z [by the identity (Zolotarev, 1986) (cid:90) (0 , + ∞ ) e − th g α ( h )d h = e − t α for t > α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ z θ + c n (cid:90) + ∞ e − h ( z (1 − /J ) /α ) g α ( h )d h × (cid:90) + ∞ e − ( x/J /α ) z g α ( x ) (cid:18)(cid:16) xJ /α (cid:17) c n +1 L θ + c n m − c n (cid:16) z xJ /α (cid:17)(cid:19) d x d z = α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:16) xJ /α (cid:17) c n +1 × (cid:90) + ∞ z θ + c n exp (cid:110) − z (cid:104) h (1 − /J ) /α + ( x/J /α ) (cid:105)(cid:111) L θ + c n m − c n (cid:16) z xJ /α (cid:17) d z d x d h = α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:16) xJ /α (cid:17) c n +1 J /α x (cid:18) J /α x (cid:19) θ + c n × (cid:90) + ∞ y θ + c n exp (cid:26) − yJ /α x (cid:104) h (1 − /J ) /α + ( x/J /α ) (cid:105)(cid:27) L θ + c n m − c n ( y ) d z d x d h α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ × (cid:90) + ∞ y θ + c n exp (cid:26) − y (cid:20) hx ( J − /α + 1 (cid:21)(cid:27) L θ + c n m − c n ( y ) d z d x d h [by Equation 7.414.8 in Gradshteyn and Ryzhik (2007)]= α ( m − c n )!( θ ) ( m +1) Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ Γ( θ + m + 1)( m − c n )! (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m +1 d x d h = α Γ( θ )Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m +1 d x d h. That is, m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i G K m − i +1 ( J − ; α, θ ) (33)= α Γ( θ )Γ( θ/α ) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m +1 d x d h. Now, along the same lines, we rewrite the numerator of the posterior distribution (24) in terms of an integralwith respect to the stable distribution. In particular, we write the following m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i G K m − l − i ( J − ; α, θ + α )[by the distribution (13)]= m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i m − l − i (cid:88) k =1 (cid:0) θ + αα (cid:1) ( k ) ( θ + α ) ( m − l − i ) J k C ( m − l − i, k ; α )[by the deﬁnition of Gamma function (Gradshteyn and Ryzhik, 2007, Chapter 8)]= 1Γ( θ/α + 1) (cid:90) + ∞ y θ/α e − y × (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ + α ) ( m − l − i ) m − l − i (cid:88) k =1 (cid:16) yJ (cid:17) k C ( m − l − i, k ; α ) (cid:33) d y [by Equation 13 in Favaro et al. (2015)]= 1Γ( θ/α + 1) (cid:90) + ∞ y θ/α e − y + y/J × (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ + α ) ( m − l − i ) (cid:18)(cid:16) yJ (cid:17) /α (cid:19) m − l − i (cid:90) + ∞ x m − l − i e − x ( y/J ) /α g α ( x )d x (cid:33) d y = 1Γ( θ/α + 1) (cid:90) + ∞ y θ/α e − y + y/J × (cid:90) + ∞ e − x ( y/J ) /α g α ( x ) (cid:32) m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i ( θ + α ) ( m − l − i ) (cid:18) x (cid:16) yJ (cid:17) /α (cid:19) m − l − i (cid:33) d x d y [by Equation 8.970.1 in Gradshteyn and Ryzhik (2007)]= ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ y θ/α e − y + y/J × (cid:90) + ∞ e − x ( y/J ) /α g α ( x ) (cid:32)(cid:18) x (cid:16) yJ (cid:17) /α (cid:19) c n − l L c n − l + α + θ − m − c n (cid:18) x (cid:16) yJ (cid:17) /α (cid:19)(cid:33) d x d y α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ z θ + c n − l + α − e − z α (1 − /J ) × (cid:90) + ∞ e − ( x/J /α ) z g α ( x ) (cid:18)(cid:16) xJ /α (cid:17) c n − l L θ + c n + α − l − m − c n (cid:16) z xJ /α (cid:17)(cid:19) d x d z [by the identity (Zolotarev, 1986) (cid:90) (0 , + ∞ ) e − th g α ( h )d h = e − t α for t > α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ z θ + c n − l + α − (cid:90) + ∞ e − h ( z (1 − /J ) /α ) g α ( h )d h × (cid:90) + ∞ e − ( x/J /α ) z g α ( x ) (cid:18)(cid:16) xJ /α (cid:17) c n − l L θ + c n + α − l − m − c n (cid:16) z xJ /α (cid:17)(cid:19) d x d z = α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:16) xJ /α (cid:17) c n − l × (cid:90) + ∞ z θ + c n − l + α − exp (cid:110) − z (cid:104) h (1 − /J ) /α + ( x/J /α ) (cid:105)(cid:111) L θ + c n + α − l − m − c n (cid:16) z xJ /α (cid:17) d z d x d h = α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:16) xJ /α (cid:17) c n − l J /α x (cid:18) J /α x (cid:19) θ + c n − l + α − × (cid:90) + ∞ y θ + c n − l + α − exp (cid:26) − yJ /α x (cid:104) h (1 − /J ) /α + ( x/J /α ) (cid:105)(cid:27) L θ + c n − l + α − m − c n ( y ) d z d x d h = α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ + α × (cid:90) + ∞ y θ + c n − l + α − exp (cid:26) − y (cid:20) hx ( J − /α + 1 (cid:21)(cid:27) L θ + c n − l + α − m − c n ( y ) d z d x d h [by Equation 7.414.8 in Gradshteyn and Ryzhik (2007)]= α ( m − c n )!( θ + α ) ( m − l ) Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ + α × Γ( θ + m − l + α )( m − c n )! (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m − l + α d x d h = α Γ( θ + α )Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ + α (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m − l + α d x d h. That is, m − c n (cid:88) i =0 (cid:18) m − c n i (cid:19) ( − m − c n − i G K m − l − i ( J − ; α, θ + α ) (34)= α Γ( θ + α )Γ( θ/α + 1) (cid:90) + ∞ (cid:90) + ∞ g α ( h ) g α ( x ) (cid:18) J /α x (cid:19) θ + α (cid:0) hx ( J − /α (cid:1) m − c n (cid:0) hx ( J − /α + 1 (cid:1) θ + m − l + α d x d h. The proof is completed by combining the posterior distribution (24) with identities (33) and (34).

Appendix D Proof of Equation (25)

Because of the independence assumption of H , and by an application of Bayes theorem, we writePr[ f X m +1 = l | { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ]= 1Pr[ { C n,h n ( X n +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] Pr[ f X m +1 = l ] N (cid:89) n =1 Pr[ C n,h n ( X m +1 ) = c n | f X m +1 = l ] { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] Pr[ f X m +1 = l ] N (cid:89) n =1 Pr[ C n,h n ( X m +1 ) = c n , f X m +1 = l ]Pr[ f X m +1 = l ]= 1Pr[ { C n,h n ( X m +1 ) } n ∈ [ N ] = { c n } n ∈ [ N ] ] (Pr[ f X m +1 = l ]) − N × N (cid:89) n =1 Pr[ C n,h n ( X m +1 ) = c n ]Pr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ]= (Pr[ f X m +1 = l ]) − N N (cid:89) n =1 Pr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ] ∝ N (cid:89) n =1 Pr[ f X m +1 = l | C n,h n ( X m +1 ) = c n ]where the n -th term of the last expression is precisely the probability in Theorem 2. The proof is completed. Appendix E A “direct” proof-method for s -range queries Our “direct” proof-method to the CMS-DP can be readily extended to compute the posterior distribution(29). We outline this extension for any arbitrary s ≥

1, and then we present an explicit example for s = 2.To simplify the notation, we remove the subscript n from h n . That is, for the hash function h ∼ H we areinterest in computing the posterior distributionPr (cid:2) f X s = l s | C h ( X s ) = c (cid:3) = Pr[ f X s = l s , C h ( X s ) = c ]Pr[ C h ( X s ) = c ] . (35)Note that for s = 1 the posterior distribution (35) reduces to (7). We analyze the posterior distribution (35)starting from its denominator. In particular, the denominator of (35) can be written asPr[ C h ( X s ) = c ] = (cid:88) ( j ,...,j s ) ∈ [ J ] s Pr[ C h ( X s ) = c , h ( X m +1 ) = j , . . . , h ( X m + s ) = j s ]= (cid:88) ( j ,...,j s ) ∈ [ J ] s Pr (cid:34) m (cid:88) i =1 h ( X i ) ( j ) = c , . . . , m (cid:88) i =1 h ( X i ) ( j s ) = c s ,h ( X m +1 ) = j , . . . , h ( X m + s ) = j s (cid:35) . To evaluate Pr (cid:34) m (cid:88) i =1 h ( X i ) ( j ) = c , . . . , m (cid:88) i =1 h ( X i ) ( j s ) = c s , h ( X m +1 ) = j , . . . , h ( X m + s ) = j s (cid:35) , (36)we split the sum over [ J ] s and we organize the summands as follows. First, we introduce a variable k whichcounts how many distinct object there are in each vector ( j , . . . , j s ), so that k ∈ { , , . . . , min { s, J }} .Second, we consider the vector ( r , . . . , r k ) of frequencies of the distinct k objects. Third, we consider thevector ( j ∗ , . . . , j ∗ k ) of distinct objects with { j ∗ , . . . , j ∗ k } ⊆ { , . . . , J } . Then, we evaluate the probability (36)in the distinguished case that  j = · · · = j r =: j ∗ j r +1 = · · · = j r + r =: j ∗ . . .j r + ··· + r k − +1 = · · · = j r + ··· + r k =: j ∗ k uch that the probability (36) of interest is diﬀerent from zero if and only if the following holds true  c = · · · = c r =: c ∗ c r +1 = · · · = c r + r =: c ∗ . . .c r + ··· + r k − +1 = · · · = c r + ··· + r k =: c ∗ k . That is,Pr (cid:34) m (cid:88) i =1 h ( X i ) ( j ∗ ) = c ∗ , . . . , m (cid:88) i =1 h ( X i ) ( j ∗ k ) = c ∗ k , h ( X m +1 ) = · · · = h ( X m + r ) = j ∗ , . . .. . . , h ( X m + r + ··· + r k − +1 ) = · · · = h ( X m + r + ··· + r k ) = j ∗ k (cid:35) . Now, we set B ∗ r := { x ∈ V : h ( x ) = j ∗ r } for any r ∈ { , . . . , k } and we set B ∗ k +1 = (cid:0) ∪ kr =1 B ∗ r (cid:1) C . Thus, { B ∗ , . . . , B ∗ k +1 } is a ﬁnite partition of V . If k = J , then B ∗ k +1 = ∅ and in such case we intend that { B ∗ , . . . , B ∗ k +1 } is replaced by { B ∗ , . . . , B ∗ k } . Accordingly, we can write the identityPr (cid:34) m (cid:88) i =1 h ( X i ) ( j ∗ ) = c ∗ , . . . , m (cid:88) i =1 h ( X i ) ( j ∗ k ) = c ∗ k , h ( X m +1 ) = · · · = h ( X m + r ) = j ∗ , . . .. . . , h ( X m + r + ··· + r k − +1 ) = · · · = h ( X m + r + ··· + r k ) = j ∗ k (cid:35) = (cid:18) mc ∗ , . . . , c ∗ k (cid:19) (cid:90) ∆ k (cid:32) k (cid:89) i =1 p c ∗ i + r i i (cid:33) (1 − p − · · · − p k ) m − (cid:80) ki =1 c ∗ i µ B ∗ ,...,B ∗ k +1 (d p . . . d p k )where µ B ∗ ,...,B ∗ k +1 is the distribution of ( P ( B ∗ ) , . . . , P ( B ∗ k )) which, by the ﬁnite-dimensional projective prop-erty of the DP, is a Dirichlet distribution with parameter ( θ/J, . . . , θ/J ) on ∆ k . If k < J Pr (cid:34) m (cid:88) i =1 h ( X i ) ( j ∗ ) = c ∗ , . . . , m (cid:88) i =1 h ( X i ) ( j ∗ k ) = c ∗ k , h ( X m +1 ) = · · · = h ( X m + r ) = j ∗ , . . .. . . , h ( X m + r + ··· + r k − +1 ) = · · · = h ( X m + r + ··· + r k ) = j ∗ k (cid:35) = Γ( θ )[Γ( θJ )] k Γ(( J − k ) θJ ) (cid:104)(cid:81) ki =1 Γ( θJ + c ∗ i + r i ) (cid:105) Γ(( J − k ) θJ + m − (cid:80) ki =1 c ∗ i )Γ( θ + m + s ) , and if k = J Pr (cid:34) m (cid:88) i =1 h ( X i ) ( j ∗ ) = c ∗ , . . . , m (cid:88) i =1 h ( X i ) ( j ∗ k ) = c ∗ k , h ( X m +1 ) = · · · = h ( X m + r ) = j ∗ , . . .. . . , h ( X m + r + ··· + r k − +1 ) = · · · = h ( X m + r + ··· + r k ) = j ∗ k (cid:35) = Γ( θ )(Γ( θJ )) k (cid:81) ki =1 Γ( θJ + c ∗ i + r i )Γ( θ + m + s ) . Upon denoting by I k ( c ∗ n, , . . . , c ∗ n,k ; r , . . . , r k ) the right expression of the integral, we conclude thatPr[ C h ( X s ) = c ] (37) min { s,J } (cid:88) k =1 J !( J − k )! × (cid:88) ( π ,...,π k ) ∈ Π( s,k ) ∆( π , . . . , π k ; c , . . . , c s ) (cid:18) mc ∗ , . . . , c ∗ k (cid:19) I k ( c ∗ , . . . , c ∗ k ; | π | , . . . , | π k | ) , where: i) Π( s, k ) denotes the set of all possible partitions of the set { , . . . , s } into k disjoint subsets π , . . . , π k ; | π i | stands for the cardinality of the subset π i ; ii) ∆( π , . . . , π k ; c , . . . , c s ) is either 0 or 1 with the provisothat it equals 1 if and only if, for all z ∈ { , . . . , k } for which | π z | ≥

2, all the integers c i with i ∈ π z areequal; for any i ∈ { , . . . , k } , c i represents the common integer associated to π i . Formula (37) simpliﬁesremarkably for small values of s . For instance,i) for s = 1 Pr[ C h ( X m +1 ) = c ] = J (cid:18) mc (cid:19) I ( c ; 1);ii) for s = 2 Pr[ C h ( X m +1 ) = c , C h ( X m +2 ) = c ] (38)= J { c = c } (cid:18) mc (cid:19) I ( c ; 2) + J ( J − (cid:18) mc , c (cid:19) I ( c , c ; 1 , . We conclude by studying the numerator in (35). This expression is determined by the complete knowledgeof the joint distribution of ( X , . . . , X n + s ). As above, we can start by writingPr[ f X s = l s , C h ( X s ) = c ]= s (cid:88) k =1 (cid:88) ( π ,...,π k ) ∈ Π( s,k ) ∆( π , . . . , π k ; l , . . . , l s ) (cid:18) nl ∗ , . . . , l ∗ k (cid:19) × Pr (cid:34) B ( m ; l ∗ , . . . , l ∗ k ; π , . . . , π k ) ∩ (cid:40) m (cid:88) i =1 h ( X i ) ( j ) = c , . . . , m (cid:88) i =1 h ( X i ) ( j s ) = c s (cid:41)(cid:35) where the event B ( m ; l ∗ , . . . , l ∗ k ) is characterized by the relations among random variables X m + r ’s X = · · · = X l ∗ = X m + r for all r ∈ π X l ∗ +1 = · · · = X l ∗ + l ∗ = X m + r for all r ∈ π . . . . . .X l ∗ + ··· + l ∗ k − +1 = · · · = X l ∗ + ··· + l ∗ k = X m + r for all r ∈ π k X n + r (cid:54) = X n + r for all r ∈ π a , r ∈ π b for all a (cid:54) = b { X l ∗ + ··· + l ∗ k +1 , . . . X m } ∩ { X m +1 , . . . , X m + s } = ∅ . The numerator of (35) can be treated as the denominator of (35), namely by exploiting the double partitionstructure induced by the above relations on the random variables X i ’s and h ( X i )’s. We observe that thecombination of this two partition structures proves particularly cumbersome to be written a for general s ≥

1. For this reason, further manipulations of the posterior distribution (35) will be deferred to the proofof Theorem 3, in which we will assume that s = 2. Appendix F Proof of Theorem 3

Following the “direct” proof-method for s ≥

1, we start by expressing the posterior distribution of ( f X m +1 , f X m +2 )given C n,h n ( X m +1 ) and C n,h n ( X m +2 ) as a ratio of two probabilities, and then we deal with the numerator anddenominator. That is, we write the following expressionPr[ f X m +1 = l , f X m +1 = l | C n,h n ( X m +1 ) = c n, , C n,h n ( X m +2 ) = c n, ] (39) Pr (cid:2) f X m +1 = l , f X m +2 = l , (cid:80) mi =1 h n ( X i ) ( X m +1 ) = c n, , (cid:80) mi =1 h n ( X i ) ( X m +2 ) = c n, (cid:3) Pr (cid:2) C n,h n ( X m +1 ) = c n, , C n,h n ( X m +2 ) = c n, (cid:3) Observe that the denominator of the posterior distribution (39) reduces to (38). Then, by using the ﬁnite-dimensional projective property of the DP, we can write the following expressions J { c n, = c n, = c } (cid:18) mc (cid:19) I ( c ; 2)= J { c n, = c n, = c } (cid:18) mc (cid:19) × (cid:90) p c +2 (1 − p ) m − c Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) p θ/J − (1 − p ) θ (1 − /J ) − d p = J { c n, = c n, = c } (cid:18) mc (cid:19) Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) Γ( θ/J + c + 2)Γ( θ (1 − /J ) + m − c )Γ( θ + m + 2)and J ( J − (cid:18) mc n, , c n, (cid:19) I ( c n, , c n, ; 1 , J ( J − (cid:18) mc n, , c n, (cid:19) × (cid:90) ∆ p c n, +11 p c n, +12 (1 − p − p ) m − c n, − c n, × Γ( θ )Γ( θ/J )Γ( θ/J )Γ( θ (1 − /J )) p θ/J − p θ/J − (1 − p − p ) θ (1 − /J ) − d p d p = J ( J − (cid:18) mc n, , c n, (cid:19) Γ( θ )(Γ( θ/J )) Γ( θ (1 − /J )) × Γ( θ/J + c n, + 1)Γ( θ/J + c n, + 1)Γ( θ (1 − /J ) + m − c n, − c n, )Γ( θ + m + 2) . Then, Pr (cid:2) C n,h n ( X m +1 ) = c n, , C n,h n ( X m +2 ) = c n, (cid:3) (40)= J { c n, = c n, = c } (cid:18) mc n, (cid:19) Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) × Γ( θ/J + c + 1)Γ( θ (1 − /J ) + m − c )Γ( θ + m + 2)+ J ( J − (cid:18) mc n, , c n, (cid:19) Γ( θ )(Γ( θ/J )) Γ( θ (1 − /J )) × Γ( θ/J + c n, + 1)Γ( θ/J + c n, + 1)Γ( θ (1 − /J ) + m − c n, − c n, )Γ( θ + m + 2) . Now, we focus on the numerator of the posterior distribution (39), which we is rewritten as followsPr (cid:34) f X m +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, (cid:35) (41)= Pr (cid:34) f X n +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 = X m +2 (cid:35) + Pr (cid:34) f X n +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 (cid:54) = X m +2 (cid:35) . irst, we consider the ﬁrst term on the right-hand side of the probability (41). In particular, we writePr (cid:34) f X m +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 = X m +2 (cid:35) = { l = l =: l, c n, = c n, = c } (cid:18) ml (cid:19) × Pr (cid:34) X = . . . , X l = X m +1 = X m +2 , { X l +1 , . . . , X m } ∩ { X m +1 } = ∅ , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c (cid:35) = { l = l =: l, c n, = c n, = c } (cid:18) ml (cid:19) × Pr (cid:34) X = . . . , X l = X m +1 = X m +2 , { X l +1 , . . . , X m } ∩ { X m +1 } = ∅ , m (cid:88) i = l +1 h n ( X i ) ( X m +1 ) = c − l (cid:35) which is determined by the distribution of ( X , . . . , X m +2 ). In view of Equation 3.5 of Sangalli (2006)Pr[ X ∈ C , . . . , X m +2 ∈ C m +2 ] = m +2 (cid:88) k =1 θ k ( θ ) ( m +2) (cid:88) ( π ,...,π k ) ∈ Π( m +2 ,k ) k (cid:89) i =1 ( | π i | − ν ( ∩ r ∈ π i C r ) . We set D ( m, l ) := { X = . . . , X l = X m +1 = X m +2 , { X l +1 , . . . , X m }∩{ X m +1 } = ∅} , and we deﬁne µ ( π ,...,π k ) as the probability measure on ( V m +2 , V n +2 ) generated by the following identity ν π ,...,π k ( C × · · · × C m +2 ) := k (cid:89) i =1 ν ( ∩ r ∈ π i C r ) , It is clear that such measures attach to D ( m, l ) a probability value that is either 0 or 1. In particular, ν π ,...,π k ( D ( m, l )) = 1 if and only if one of the π ’s (e.g. π k , being these partitions given up to the order) isexactly equal to the set { , . . . , l, m + 1 , m + 2 } . Accordingly, we writePr (cid:34) D ( m, l ) , m (cid:88) i = l +1 h n ( X i ) ( X m +1 ) = c − l (cid:35) = m − l +1 (cid:88) k =2 θ k ( θ ) ( m +2) (cid:88) ( π ,...,π k − ) ∈ Π( m − l,k − ( l + 1)! k − (cid:89) i =1 ( | π i | − ν π ,...,π k (cid:32) m (cid:88) i = l +1 h n ( X i ) ( X m +1 ) = c − l (cid:33) = θ ( θ ) ( m − l ) ( θ ) ( m +2) ( l + 1)! m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( | π i | − ××  J (cid:88) j =1 ν ( { j } ) ν π ,...,π r (cid:32) m (cid:88) i = l +1 h n ( X i ) ( j ) = c − l (cid:33) = θ ( θ ) ( m − l ) J ( θ ) ( m +2) ( l + 1)! m − l (cid:88) r =1 θ r ( θ ) ( m − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l,r ) r (cid:89) i =1 ( | π i | − ××  J (cid:88) j =1 ν π ,...,π r (cid:32) m (cid:88) i = l +1 h n ( X i ) ( j ) = c − l (cid:33) . Hence,Pr (cid:34) f X n +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 = X m +2 (cid:35) (42) { l = l =: l, c n, = c n, = c } m !( c − l )!( m − c )! θ ( l + 1)Γ( θ + m + 2) ×× Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) Γ( θ/J + c − l )Γ( θ (1 − /J ) + n − c ) . Now, we consider the second term on the right-hand side of the probability (41). In particular, we writePr (cid:34) f X m +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 (cid:54) = X m +2 (cid:35) = (cid:18) ml , l (cid:19) Pr (cid:34) X = . . . , X l = X m +1 , X l +1 = . . . , X l + l = X m +2 , X m +1 (cid:54) = X m +2 , { X l + l +1 , . . . , X m } ∩ { X m +1 , X m +2 } = ∅ , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, (cid:35) = (cid:18) ml , l (cid:19) Pr (cid:34) X = . . . , X l = X m +1 , X l +1 = . . . , X l + l = X m +2 , X m +1 (cid:54) = X m +2 , { X l + l +1 , . . . , X m } ∩ { X m +1 , X m +2 } = ∅ ,l h n ( X l ) ( X m +1 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +1 ) = c n, − l ,l h n ( X ) ( X m +2 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +2 ) = c n, − l (cid:35) . Setting E ( n, l , l ) := (cid:40) X = . . . , X l = X m +1 , X l +1 = . . . , X l + l = X m +2 , X m +1 (cid:54) = X m +2 , { X l + l +1 , . . . , X m } ∩ { X m +1 , X m +2 } = ∅ (cid:41) , we have that ν π ,...,π k ( E ( n, l , l )) = 1 if and only if two of the π ’s (e.g. π k − and π k , being these partitionsgiven up to the order) are exactly equal to the sets { , . . . , l , m +1 } and { l +1 , . . . , l + l , m +2 } , respectively.Therefore, from above, we write the following probabilityPr (cid:34) E ( n, l , l ) , l h n ( X l ) ( X m +1 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +1 ) = c n, − l ,l h n ( X ) ( X m +2 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +2 ) = c n, − l (cid:35) = n − l − l +2 (cid:88) k =3 θ k ( θ ) ( m +2) (cid:88) ( π ,...,π k − ) ∈ Π( m − l − l ,k − l ! l ! k − (cid:89) i =1 ( | π i | − ×× ν π ,...,π k (cid:32) l h n ( X l ) ( X m +1 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +1 ) = c n, − l ,l h n ( X ) ( X m +2 ) + m (cid:88) i = l + l +1 h n ( X i ) ( X m +2 ) = c n, − l (cid:33) = θ ( θ ) ( m − l − l ) ( θ ) ( m +2) l ! l ! m − l − l (cid:88) r =1 θ r ( θ ) ( m − l − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l − l ,r ) r (cid:89) i =1 ( | π i | − (cid:34) (cid:88) ( j ,j ) ∈ [ J ] ν ( { j } ) ν ( { j } ) × ν π ,...,π r (cid:32) m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l − l { j = j } , m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l − l { j = j } (cid:33)(cid:35) . We observe that the expression within the brackets in the last term can be split into the sum of two terms,according on whether j = j or not, in the sum over [ J ] . Therefore, we write m − l − l (cid:88) r =1 θ r ( θ ) ( m − l − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l − l ,r ) r (cid:89) i =1 ( | π i | − × (cid:34) (cid:88) j = j ∈ [ J ] ν ( { j } ) ν ( { j } ) ν π ,...,π r (cid:32) m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l − l , m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l − l (cid:33)(cid:35) = 1 J { c n, = c n, =: c } (cid:18) m − l − l c − l − l (cid:19) Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) × Γ( θ/J + c − l − l )Γ( θ (1 − /J ) + n − c )Γ( θ + m − l − l ) . On the other hand, assuming J ≥ m − l − l (cid:88) r =1 θ r ( θ ) ( m − l − l ) (cid:88) ( π ,...,π r ) ∈ Π( m − l − l ,r ) r (cid:89) i =1 ( | π i | − × (cid:34) (cid:88) ( j ,j ) ∈ [ J ] j (cid:54) = j ν ( { j } ) ν ( { j } ) ν π ,...,π r (cid:32) m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l , m (cid:88) i = l + l +1 h n ( X i ) ( j ) = c n, − l (cid:33)(cid:35) = J − J (cid:18) m − l − l c n, − l , c n, − l (cid:19) Γ( θ )[Γ( θ/J )] Γ( θ (1 − /J )) × Γ( θ/J + c n, − l )Γ( θ/J + c n, − l )Γ( θ (1 − /J ) + m − c n, − c n, )Γ( θ + m − l − l ) . Then,Pr (cid:34) f X m +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, , X m +1 (cid:54) = X m +2 (cid:35) (43)= (cid:18) ml , l (cid:19) θ ( θ ) ( m − l − l ) ( θ ) ( m +2) l ! l ! × (cid:34) J { c n, = c n, = c } (cid:18) m − l − l c − l − l (cid:19) Γ( θ )Γ( θ/J )Γ( θ (1 − /J )) Γ( θ/J + c − l − l )Γ( θ (1 − /J ) + m − ν )Γ( θ + m − l − l )+ J − J (cid:18) m − l − l c n, − l , c n, − l (cid:19) Γ( θ )[Γ( θ/J )] Γ( θ (1 − /J )) Γ( θ/J + c n, − l )Γ( θ/J + c n, − l )Γ( θ (1 − /J ) + m − c n, − c n, )Γ( θ + m − l − l ) (cid:35) . Then, by combining the probability (42) and the probability (43) we write the following expressionPr (cid:34) f X m +1 = l , f X m +2 = l , m (cid:88) i =1 h n ( X i ) ( X m +1 ) = c n, , m (cid:88) i =1 h n ( X i ) ( X m +2 ) = c n, (cid:35) = m !Γ( θ + m + 2) (cid:110) { l = l =: l, c n, = c n, = c } θ ( l + 1)( c − l )!( m − c )! β ( θ, J ) ×× Γ( θ/J + c − l )Γ( θ (1 − /J ) + m − c )+ { c n, = c n, = c } θ J ( c − l − l )!( m − c )! β ( θ, J )Γ( θ/J + c − l − l )Γ( θ (1 − /J ) + m − c )+ (cid:18) J − J (cid:19) θ ( c n, − l )!( c n, − l )!( m − c n, − c n, )! β ( θ, J ) ×× Γ( θ/J + c n, − l )Γ( θ/J + c n, − l )Γ( θ (1 − /J ) + m − c n, − c n, ) (cid:111) . (44)The proof is completed by combing the posterior distribution (39) with probabilities (40) and (44). Appendix G Proof of Equation (30)

Because of the independence assumption of H , and by an application of Bayes theorem, we writePr[ f X m +1 = l , f X m +1 = l | { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] = { ( c n, , c n, ) } n ∈ [ N ] ]= 1Pr[ { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] = { ( c n, , c n, ) } ] Pr[ f X m +1 = l , f X m +2 = l ] × N (cid:89) n =1 Pr[( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, ) | f X m +1 = l , f X m +2 = l ]= 1Pr[ { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] = { ( c n, , c n, ) } ] Pr[ f X m +1 = l , f X m +2 = l ] × N (cid:89) n =1 Pr[( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, ) , f X m +1 = l , f X m +2 = l ]Pr[ f X m +1 = l , f X m +2 = l ]= 1Pr[ { ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) } n ∈ [ N ] = { ( c n, , c n, ) } ] (Pr[ f X m +1 = l , f X m +2 = l ]) − N × N (cid:89) n =1 Pr[( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, )] × Pr[ f X m +1 = l , f X m +2 = l | ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, )]= (Pr[ f X m +1 = l , f X m +2 = l ]) − N × N (cid:89) n =1 Pr[ f X m +1 = l , f X m +2 = l | ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, )] ∝ N (cid:89) n =1 Pr[ f X m +1 = l , f X m +2 = l | ( C n,h n ( X m +1 ) , C n,h n ( X m +2 ) ) = ( c n, , c n, )] , where the n -th term of the last expression is precisely the probability in Theorem 3. The proof is completed. ppendix H Additional experiments We present additional experiments on the application of the CMS-PYP on synthetic and real data. First, werecall the synthetic and real data to which the CMS-PYP is applied. As regards synthetic data, we considerdatasets of m = 500000 tokens from a Zipf’s distributions with parameter ν = 1 . , . , . , , , .

5. Asregards real data, we consider: i) the 20 Newsgroups dataset, which consists of m = 2765300 tokens with K m = 53975 distinct tokens; ii) the Enron dataset, which consists of m = 6412175 tokens with K m = 28102distinct tokens. Tables H, H, H and H report the MAE (mean absolute error) between true frequencies andtheir corresponding estimates via: i) the CMS-PYP estimate ˆ f (PYP) v ; ii) the CMS estimate ˆ f (CMS) v ; iii) theCMS-DP estimate ˆ f (DP) v , the CMM estimate ˆ f (CMM) v . . Z . Z . Z . Z . B i n s v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . , . , . , . . . . . . . ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . . . . . . . . . . ( , ] , . . . . . . , . , . , . . . . . . . ( , ] , . . . , . , . . . . . . . . . . . T a b l e : S y n t h e t i c d a t a : M A E f o r ˆ f ( P Y P ) v , ˆ f ( C MM ) v a nd ˆ f ( C M S ) v , c a s e J = , N = . . Z . Z . Z . Z . B i n s v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ˆ f ( C M S ) v ˆ f ( C MM ) v ˆ f ( P Y P ) v ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . ( , ] , . . . , . . . . . . . . . . . . T a b l e : S y n t h e t i c d a t a : M A E f o r ˆ f ( P Y P ) v , ˆ f ( C MM ) v a nd ˆ f ( C M S ) v , c a s e J = , N = . v ˆ f (CMS) v ˆ f (DP) v ˆ f (PYP) v ˆ f (CMS) v ˆ f (DP) v ˆ f (PYP) v (0,1] 46.4 46.39 1.16 12.2 12.20 0.99(1,2] 16.6 16.60 1.96 13.8 13.80 1.99(2,4] 38.4 38.40 2.93 61.5 61.49 3.76(4,8] 59.4 59.39 6.10 88.4 88.39 7.50(8,16] 54.3 54.29 11.65 23.4 23.40 11.97(16,32] 17.8 17.80 21.14 55.1 55.09 20.78(32,64] 40.8 40.79 45.85 128.5 128.48 43.78(64,128] 26.0 25.99 88.47 131.1 131.08 81.38(128,256] 13.6 13.59 170.27 50.7 50.68 171.99 Table 8:

Real data ( J = 12000 and N = 2): MAE for ˆ f (PYP) v , ˆ f (DP) v and ˆ f (CMS) v .

20 Newsgroups EnronBins v ˆ f (CMS) v ˆ f (DP) v ˆ f (PYP) v ˆ f (CMS) v ˆ f (DP) v ˆ f (PYP) v (0,1] 53.4 53.39 1.00 71.0 70.98 1.00(1,2] 30.5 30.49 2.00 47.4 47.38 2.00(2,4] 32.5 32.49 3.30 52.5 52.49 3.90(4,8] 38.7 38.69 6.70 53.1 53.08 6.80(8,16] 25.3 25.29 12.60 57.0 56.98 11.90(16,32] 25.0 24.99 21.90 90.0 89.98 20.60(32,64] 39.7 39.69 43.70 108.4 108.37 47.80(64,128] 22.1 22.09 92.19 55.7 55.67 88.10(128,256] 25.8 25.79 207.58 80.8 80.76 179.30 Table 9:

Real data ( J = 8000 and N = 4): MAE for ˆ f (PYP) v , ˆ f (DP) v and ˆ f (CMS) v ..