[PDF] Convergence of Chao Unseen Species Estimator

Abstract

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Convergence of Chao Unseen Species Estimator

Nived Rajaraman, Prafulla Chandra, Andrew Thangaraj

Department of Electrical EngineeringIndian Institute of Technology MadrasChennai 600036, IndiaEmail: { ee14b040,ee16d402,andrew } @ee.iitm.ac.in Ananda Theertha Suresh

Google ResearchNew York, USAEmail: { theertha } @google.com January 14, 2020 Abstract

Support size estimation and the related problem of unseen species estimation have wide applications in ecologyand database analysis. Perhaps the most used support size estimator is the Chao estimator. Despite its widespreaduse, little is known about its theoretical properties. We analyze the Chao estimator and show that its worst case meansquared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of O (( k/n ) ) , where k is themaximum support size and n is the number of samples. Our main technical contribution is a new method to analyzerational estimators for discrete distribution properties, which may be of independent interest. Given independent samples from an underlying unknown distribution, we consider the problem of estimating the sup-port size of the distribution. Estimating the support size and unseen species estimation has applications in ecologicaldiversity [Cha84; SB84; SCL03; Cha05; Col+12], vocabulary size estimation [ET76; TE87], database attribute varia-tion [Haa+95], password analysis [FH07], and, recently, in modern applications such as microbial diversity [Hug+01;Pas+01; Gao+07] and genome sequencing [DS13].Formally, let P denote the unknown distribution over domain X . Upon observing N independent samples X , X , . . . , X N def = X N from P , the goal is to estimate the support size, S ( P ) def = X x ∈X p x > . Let N x ( X N ) be the number of occurrences of symbol x in X N . The simplest estimator is the plug-in or the empirical estimator, which estimates S ( P ) by ˆ S pl ( X N ) def = X x ∈X N x ( X N ) > . (1)The plug-in estimator often performs poorly in the non-asymptotic regime, where N ≈ S ( P ) . To overcome this,several estimators have been proposed, including the Efron-Thisted estimator [ET76], the Chao estimator [Cha84], This work was presented in part at the 2019 IEEE International Symposium on Information Theory (ISIT 2019) [Raj+19] mean squared error (MSE). In the next section, we state the problem deﬁnition and the statistical model.

In general, support size estimation is an ill-posed problem as there might be a large set of symbols with inﬁnitesi-mally small probability, which can never be detected with any ﬁnite number of samples. To overcome this, follow-ing [Ras+09; VV11; WY15], we focus on distributions where every non-zero probability is lower-bounded. Formally,we restrict ourselves to ∆ k , the set of distributions such that all non-zero symbols have probability ≥ /k . By the lawof total probability, distributions in ∆ k have support size upper-bounded by k .Support size estimation has been studied in a number of different statistical models, including multinomial [GT56],Poisson, and Bernoulli-product models [Col+12]. Following [Cha84; OSW16], we study the problem in the Poissonsampling model, where the number of observed samples N is a Poisson random variable with known mean n . UnderPoisson sampling, the multiplicities of symbols N x ( X N ) , x ∈ X , are independent random variables, and N x ( X N ) isPoisson with mean np x . The independence of multiplicities comparatively simpliﬁes the MSE analysis. We believesimilar results should hold for the other above stated statistical models.For a distribution P and an estimator ˆ S ( X N ) , we measure the performance of the estimator in terms of MSE,given by E n ( ˆ S, P ) def = E X N ∼ P ( S ( P ) − ˆ S ( X N )) , (2)and the worst case MSE over all distributions is E n,k ( ˆ S ) def = max P ∈ ∆ k E n ( ˆ S, P ) . The simple plug-in estimator only takes into account the number of seen symbols and does not try to predict thesymbols that are not observed yet. In this context, Efron-Thisted [ET76] and Chao [Cha84], observed that support sizeestimation is closely related to the problem of unseen species estimation, where the goal is to estimate the number ofsymbols that have not yet appeared and will appear in the future, U ( X N , P ) def = X x ∈X p x > N x =0 . Given an estimator ˆ U ( X N ) for U ( X N , P ) , one can estimate the support size via ˆ S pl ( X N ) + ˆ U ( X N ) . (3)Let the prevalence or ﬁnger-print ϕ i ( X N ) denote the number of symbols with non-zero probability that appeared i times. For i ≥ ϕ i ( X N ) def = X x ∈X N x = i , and, for i = 0 , ϕ ( X N , P ) def = P x ∈X N x =0 p x > . With this notation, S ( P ) = ϕ ( X N , P ) + P i ≥ ϕ i ( X N ) , theplug-in estimator, ˆ S pl = P i ≥ ϕ i ( X N ) , and U ( X N , P ) = ϕ ( X N , P ) . Hence, for estimators of the form (3), S ( P ) − ˆ S = ϕ ( X N , P ) − ˆ U ( X N ) , and the error in estimating the support is same as the error in estimating the unseen symbols. Similar to (2), we deﬁnethe worst case mean squared error in estimating the unseen symbols by E n,k ( ˆ U ) = max P ∈ ∆ k E X N ∼ P ( ˆ U ( X N ) − ϕ ( X N , P )) , ˆ S = ˆ S pl + ˆ U , E n,k ( ˆ S ) = E n,k ( ˆ U ) . Chao [Cha84] proposed the following estimator to estimate the number of unseen symbols , ˆ U c ( X N ) = ϕ ϕ , which has a rational form and is not in the class of linear estimators. To understand the Chao estimator, ﬁrst observethat ϕ i = P x ∈X N x = i . Since N x is a Poisson random variable with mean np x , E [ ϕ i ] = X x ∈X e − np x ( np x ) i i ! . By the Cauchy-Schwarz inequality, E [ ϕ ] · E [ ϕ ] = X x ∈X e − np x ! · X x ∈X e − np x ( np x ) ! , ≥ X x ∈X e − np x ( np x ) √ ! = ( E [ ϕ ]) . (4)Hence, E [ ϕ ] E [ ϕ ] ≤ E [ ϕ ] , and thus is a lower bound on the expected number of unseen symbols. Since expectations arenot available, Chao [Cha84] proposed to use ϕ ϕ as an estimator for ϕ . Before we state results for the Chao estimator, we ﬁrst state a folklore result on the performance of the plug-inestimator.

Lemma 1.

For the plug-in estimator ˆ S pl deﬁned in (1) , k e − n/k + ke − n/k ≥ E n,k ( ˆ S pl ) ≥ k e − n/k + ke − n/k − ke − n/k . Proof.

For any distribution p ∈ ∆ k , let S ( P ) − ˆ S pl = ϕ = X x ∈X N x =0 . Hence, E [( S ( P ) − ˆ S pl ) ] = E [( X x ∈X N x =0 ) ] ( a ) = E [( X x ∈X N x =0 )] + Var X x ∈X N x =0 ! ( b ) = E [( X x ∈X N x =0 )] + X x ∈X Var ( N x =0 ) ( c ) = ( X x ∈X e − np x ) + X x ∈X e − np x (1 − e − np x ) , where ( a ) follows from the deﬁnition of bias and variance, ( b ) follows from the fact that variance of sum of independentrandom variables is the sum of variance of independent random variables, and ( c ) follows from the fact that N x =0 is a Bernoulli random variable with parameter e − np x . The lower bound follows by substituting p to be the uniformdistribution over k elements and the upper bound follows by the convexity of the function p → e − np . We use N x and ϕ i to abbreviate N x ( X N ) and ϕ i ( X N ) for simplicity. ϕ = 0 . To circumvent this, we consider the closely related modiﬁedChao estimator , ˆ U mc ( X N ) = ϕ ϕ + 1) . The analysis of MSE for the Chao estimator and the modiﬁed Chao estimator are involved, as they are rational func-tions over the prevalences. Furthermore, the prevalences are dependent on each other. By developing new tools toanalyze the expectation of ratios of functions of prevalences, we show the following.

Theorem 2.

For the modiﬁed Chao estimator, E n,k ( ˆ U mc ) ≤ k (cid:18)

11 + n/ ( kα ) (cid:19) e − n/k + ǫ ( n, k ) , where α = 0 . ... solves u = 4 e − e − u and ǫ ( n, k ) =  k (cid:16) n / − p /π (cid:17) + (32 . k n / + (98 . k n / + 2 k n / + (1 . kn / + (22 . k n  . For the non-asymptotic regime of interest, where n = Ω( k ) , ǫ ( n, k ) is o ( k ) and the ﬁrst term dominates. Hence,for n = Ω( k ) , the Chao estimator has better worst case MSE than the plug-in estimator. Furthermore, when n ≥ k ,the worst case MSE of the Chao estimator is at least a factor ( k/n ) lower than the worst case MSE of the plug-inestimator (Lemma 1) and, for n ≪ k , the worst case performance of the Chao estimator approaches that of the plug-inestimator.We note that the best estimator for support size and the unseen species problem achieves the worst case MSE min ˆ S E n,k ( ˆ S ) = k · exp − Θ r n log kk ∨ nk ∨ !! , and is achieved by the Chebyshev linear estimator [WY15], obtained by the approximation properties of Chebyshevpolynomials.An empirical comparison of three estimators: plug-in, Chao, and Chebyshev estimators on synthetic data is shownin Fig. 1. The Chebyshev estimator is parameterized by constants c and c , which we choose as . and . assuggested in [WY15]. The distributions are chosen from ∆ k with k = 10 . We consider (i) the uniform distributionon k symbols, (ii) the Zipf(1) distribution with probability of the i th symbol proportional to i − , (iii) the geometricdistribution with probability of the i th symbol proportional to α i − where α = 1 − k − , and (iv) an even mixture oftwo uniform distributions, with probability of half of the symbols as k − and the other half as k − . From Fig. 1, theconvergence rate of the modiﬁed Chao estimator is seen to be higher than the plug-in estimator over the distributionswe considered. However, with the exception of the uniform distribution, the Chebyshev estimator outperforms themodiﬁed Chao estimator. In the rest of the paper, we provide a proof of Theorem 2. The MSE of the modiﬁed Chao estimator can be written as E ( ˆ U mc , P ) = E "(cid:18) ϕ ϕ + 1) (cid:19) − E (cid:20) ϕ ϕ ϕ + 1 (cid:21) + E (cid:2) ϕ (cid:3) . (5)Analyzing the above quantity is difﬁcult as it involves rational functions of prevalences. A natural question to ask ishow good are the approximations: E "(cid:18) ϕ ϕ + 1) (cid:19) ≈ (cid:18) E [ ϕ ]2( E [ ϕ ] + 1) (cid:19) , E (cid:20) ϕ ϕ ϕ + 1 (cid:21) ≈ E [ ϕ ] · E [ ϕ ] E [ ϕ ] + 1 . (6)4 n s uppo r t e s t i m a t e Uniform

TrueChaoplug-inChebyshev n s uppo r t e s t i m a t e Zipf(1)

TrueChaoplug-inChebyshev n s uppo r t e s t i m a t e geometric TrueChaoplug-inChebyshev n s uppo r t e s t i m a t e mixture TrueChaoplug-inChebyshev

Figure 1: Comparison of plug-in, Chao and Chebyshev estimators over various distributions.We expect such approximations to hold when E [ ϕ ] is large. Motivated by this, we divide the proof of Theorem 2 intotwo cases based on E [ ϕ ] : High collision regime . E [ ϕ ] ≥ n a , where a is a constant that is determined later. In this case, the prevalencesconcentrate around their mean. Low collision regime . E [ ϕ ] < n a . In this case, both the number of unseen elements and the estimates are small. We ﬁrst analyze the case where E [ ϕ ] is large. Instead of asking when approximation (6) holds, we generalize andask if expectations involving such rational functions of prevalences hold. Let Φ poly be a homogeneous polynomial ofdegree d in ϕ i and let Φ linear be a linear function of prevalences of the form Φ linear = X i ≥ β i ϕ i , and let σ , β + X i ≥ β i √ πi . (7)5 heorem 3. Let β i ∈ [0 , for each i ≥ . Then for any non-increasing function f , E [ Φ poly · f ( Φ linear )] ≥ E [ Φ poly ] · E [ f ( Φ linear + d )] . (8) If f is concave and E [ Φ linear ] ≥ dσ , E [ Φ poly · f ( Φ linear )] ≤ E [ Φ poly ] · f ( E [ Φ linear ] − dσ ) . (9) Proof.

A proof is given in Section 4.Note that if the function f is smooth and has small derivative around E [ Φ linear ] , then Theorem 3 implies that E [ Φ poly · f ( Φ linear )] ≈ E [ Φ poly ] · E [ f ( Φ linear )] . In addition to (9) of Theorem 3, which only holds when f is concave, we develop one more such upper bound when f is not concave. This is particularly useful for Chao estimator as the function f in Chao estimator is /x , which is notconcave.Deﬁne V as the space spanned by the functions { , ( x + 1) − , (( x + 1)( x + 2)) − , . . . } over R ≥ . Functions inthis space are represented as v = ( v , v , . . . ) ≡ P r ≥ v r · Q rj =1 ( x + j ) − . A function f is said to dominate anotherfunction f over some domain D if ∀ x ∈ D, f ( x ) ≥ f ( x ) . Let Supp( Φ linear ) be the range of function Φ linear . Theorem 4.

Consider Φ linear with β i ∈ [0 , for each i ≥ . Consider some function f and let ( f ′ , f ′ , . . . ) ∈ V dominate f over Supp( Φ linear ) . Then, if E [ Φ linear ] > dσ , E [ Φ poly f ( Φ linear )] ≤ E [ Φ poly ] · X t ≥ f ′ t ( E [ Φ linear ] − dσ ) − t . Proof.

A proof is given in Section 4.The above two theorems can be used in other scenarios where expectation of rational functions of prevalencesare required, such as computing the expected KL risk for Good-Turing estimators [OS15] and modiﬁed Good-Turingestimators [Ach+13; HO19]. Using Theorems 3 and 4, we approximate E ( ˆ U mc , P ) and relate the expectation of ratios(resp. products) to ratio (resp. product) of expectations as required in (6). This results in the following lemma. Lemma 5.

For the modiﬁed Chao estimator, deﬁning σ Chao = 1 / √ π = 0 . ... by (7) , if E [ ϕ ] > σ Chao , for anydistribution P ∈ ∆ k , E ( ˆ U mc , P ) ≤ (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) + 4 k ( E [ ϕ ] − σ Chao ) . Proof.

We start by upper-bounding the ﬁrst term of E ( ˆ U mc , P ) in (5) using Theorem 4. Let f ( x ) = (1 + x ) − . For x ≥ , since x ) ≤ x )(2 + x ) + 3(1 + x )(2 + x )(3 + x ) , we have that (0 , , , , , . . . ) ∈ V dominates f in [0 , ∞ ) . Setting Φ poly = ϕ ( d = 4 ) and Φ linear = ϕ in Theorem4, we get E "(cid:18) ϕ ϕ + 1) (cid:19) ≤ E [ ϕ ]4 (cid:18) E [ ϕ ] − σ Chao ) + 3( E [ ϕ ] − σ Chao ) (cid:19) . (10)For < a < x and integer t ≥ , we have (1 − a/x ) t ≥ − at/x ≥ − at/ ( x − a ) . Rearranging, we get x − a ) t ≤ x t + at ( x − a ) t +1 , < a < x. (11)Using the above in the ﬁrst term of (10) with x = E [ ϕ ] , a = 4 σ Chao , we get E "(cid:18) ϕ ϕ + 1) (cid:19) ≤ E [ ϕ ]4 E [ ϕ ] + (3 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) , (12)6here we have used ϕ ≤ k in the second term. Using Lemma 14 with Φ poly = ϕ , we have E [ ϕ ] ≤ E [ ϕ ] + 6 k E [ ϕ ] ≤ E [ ϕ ] + 6 k . Using the above in (12), we get E "(cid:18) ϕ ϕ + 1) (cid:19) ≤ E [ ϕ ] E [ ϕ ] + 6 k E [ ϕ ] + (3 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) i ) ≤ E [ ϕ ] E [ ϕ ] + 6 k E [ ϕ ] − σ Chao ) k E [ ϕ ] − σ Chao + (3 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) ≤ E [ ϕ ] E [ ϕ ] + (9 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) , (13)where ( i ) follows because E [ ϕ ] − σ Chao ≤ E [ ϕ ] ≤ k .Next, we lower bound the second term of E ( ˆ U mc , P ) in (5). In (8), setting Φ poly = ϕ ϕ ( d = 3 ), Φ linear = ϕ and f ( x ) = (1 + x ) − , we get E (cid:20) ϕ ϕ ϕ + 1 (cid:21) ≥ E [ ϕ ϕ ] E (cid:20) ϕ + 4 (cid:21) ( i ) ≥ E [ ϕ ϕ ] 1 E [ ϕ ] + 4 ( ii ) ≥ E [ ϕ ϕ ] (cid:18) E [ ϕ ] − E [ ϕ ] (cid:19) , (14)where ( i ) follows by Jensen’s inequality and ( ii ) by the inequality / ( x + a ) ≥ /x − a/x for x + a ≥ .To upper bound the third term of E ( ˆ U mc , P ) in (5), we use Lemma 13 (with h = 2 ) to get E [ ϕ ] ≤ E [ ϕ ] + E [ ϕ ] . (15)Adding (13), (14) and (15), we get E ( ˆ U mc , P ) ≤ E [ ϕ ] E [ ϕ ] + (9 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) − E [ ϕ ϕ ] E [ ϕ ] + 4 E [ ϕ ϕ ] E [ ϕ ] + E [ ϕ ] + E [ ϕ ] . (16)In (9) of Theorem 3, setting Φ poly = ϕ ( d = 2 ), Φ linear = ϕ and f ( x ) = − x , we get E [ ϕ ϕ ] ≥ E [ ϕ ] E [ ϕ − σ Chao ] = E [ ϕ ] E [ ϕ ] − σ Chao E [ ϕ ] . (17)Using the above in the negative third term in (16) and rearranging, we get E ( ˆ U mc , P ) ≤ (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) + (9 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) + 2 σ Chao E [ ϕ ] E [ ϕ ] + 4 E [ ϕ ϕ ] E [ ϕ ] + E [ ϕ ] . (18)To get the statement of the lemma, the last three terms above are bounded and combined into the second term asfollows. Using ϕ i ≤ k in the numerators of the last three terms in (18) and replacing E [ ϕ ] with the smaller E [ ϕ ] − σ Chao in the denominators, we get E ( ˆ U mc , P ) ≤ (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) + (9 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) + 2 σ Chao k E [ ϕ ] − σ Chao + 4 k ( E [ ϕ ] − σ Chao ) + k. (19)Finally, observe that E [ ϕ ] − σ Chao ≤ E [ ϕ ] = 12 X x ∈X ( np x ) e − np x ( i ) ≤ e − X ≤ e − k, (20)7here ( i ) follows because x e − x ≤ e − , x ≥ . Using the above in (19), E ( ˆ U mc , P ) ≤ (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) + (9 + 8 σ Chao ) k E [ ϕ ] − σ Chao ) + 2 σ Chao (4 e − ) k ( E [ ϕ ] − σ Chao ) + 4(2 e − ) k ( E [ ϕ ] − σ Chao ) + (8 e − ) k ( E [ ϕ ] − σ Chao ) ≤ (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) + (9 / σ Chao + 8 e − σ Chao + 8 e − + 8 e − ) k ( E [ ϕ ] − σ Chao ) , (21)which results in the statement of the lemma.Thus, to bound the MSE of the Chao estimator, we need to bound E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] . Lemma 6.

For any P ∈ ∆ k , − ke − n/k (1 + n/ ( kα )) ≤ E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] ≤ kn , (22) where α = 0 . ... solves u = 4 e − e − u . Modifying (22) , (cid:18) E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] (cid:19) ≤ k e − n/k (1 + n/ ( kα )) + k n . (23) Proof.

We ﬁrst prove the upper bound. By Lemma 13 (with h = 2 ), E [ ϕ ] ≤ E [ ϕ ] + E [ ϕ ] . (24)Using (4), the ﬁrst term above is upper-bounded as E [ ϕ ] ≤ E [ ϕ ] E [ ϕ ] . For the second term, we proceed asfollows: E [ ϕ ] = X x ∈X e − np x np x = 2 X x ∈X e − np x ( np x ) np x ( i ) ≤ X x ∈X e − np x ( np x ) kn ≤ kn E [ ϕ ] , (25)where ( i ) follows by using p x ≥ /k in the term /np x . Using (4) and (25) in (24), we get the upper bound of thelemma.The lower bound is more involved and we provide the proof now. By Jensen’s inequality, E [ ϕ ]2 E [ ϕ ] − E [ ϕ ] ≥ E [ ϕ ] E [ ϕ ] − E [ ϕ ] . Hence, it sufﬁces to lower bound the RHS above, or upper bound its negative. For ease of exposition, let λ x denote np x for symbol x ∈ X . Recall that E [ ϕ i ] = X x ∈X e − λ x λ ix i ! . Fixing the size of the alphabet m , |X | and letting λ = [ λ , . . . , λ m ] , we deﬁne B ( λ ) , E [ ϕ ] − E [ ϕ ] E [ ϕ ] = m X i =1 e − λ i − ( P mi =1 λ i e − λ i ) P mi =1 λ i e − λ i , where λ ∈ Λ , { v ∈ R m : v i ≥ n/k, X i v i = n } . (26)8e relax the domain of λ to Λ ′ , { v ∈ R m : v i ≥ n/k } ⊇ Λ , (27)and consider the following optimization problem: B ∗ = max λ ∈ Λ ′ B ( λ ) . (28)In the rest of this proof, we will show that B ∗ ≤ k e − n/k / (1 + n/αk ) , which implies the lower bound of the lemma.Since B ( λ ) is continuously differentiable in Λ ′ , any extremum point λ ∗ = [ λ ∗ , . . . , λ ∗ m ] ∈ Λ ′ for B ( λ ) satisﬁeseither λ ∗ i = n/k or ∂B∂λ i | λ ∗ i = 0 for each i . Differentiating B ( λ ) partially with respect to λ i and simplifying, we get ∂B∂λ i = − (cid:18) λ i − E [ ϕ ] E [ ϕ ] (cid:19) (cid:18) λ i − − E [ ϕ ] E [ ϕ ] (cid:19) e − λ i E [ ϕ ] E [ ϕ ] (29) = − (cid:0) a ∼ i λ i − b ∼ i (cid:1)(cid:0) ( a ∼ i − e − λ i ) λ i − (2 a ∼ i + b ∼ i ) (cid:1) e − λ i E [ ϕ ] , (30)where a ∼ i = m P i ′ =1 ,i ′ = i λ i ′ e − λ i ′ and b ∼ i = m P i ′ =1 ,i ′ = i λ i ′ e − λ i ′ . Hence, if ∂B∂λ i = 0 , then either λ i = 2 E [ ϕ ] E [ ϕ ] = b ∼ i a ∼ i , or (31) λ i = 2 + 2 E [ ϕ ] E [ ϕ ] solves ( a ∼ i − e − λ i ) λ i = 2 a ∼ i + b ∼ i . (32)Since ( a ∼ i − e − x ) x is one-to-one from [max(0 , log(2 /a ∼ i )) , ∞ ) to [0 , ∞ ) , a solution for λ i exists in (32). Also, by(25), E [ ϕ ] E [ ϕ ] ≥ n/k .Hence, any extremum point λ ∗ = [ λ ∗ , . . . , λ ∗ m ] ∈ Λ ′ of B ( λ ) necessarily has the following form: ∃ S ⊆ [ m ] : λ ∗ i = n/k, i ∈ S , (33) ∃ S ⊆ [ m ] \ S : λ ∗ i = λ c , i ∈ S , (34) S = [ m ] \ S \ S : λ ∗ i = 2 + λ c , i ∈ S , (35) λ c = P mi =1 ( λ ∗ i ) e − λ ∗ i P mi =1 λ ∗ i e − λ ∗ i . (36)Letting s = | S | , s = | S | , (36) can be written as λ c = s ( n/k ) e − n/k + s λ c e − λ c + ( m − s − s )(2 + λ c ) e − (2+ λ c ) s ( n/k ) e − n/k + s λ c e − λ c + ( m − s − s )(2 + λ c ) e − (2+ λ c ) . (37)Cross-multiplying and simplifying, we get s ( λ c − n/k )( n/k ) e − n/k = 2( m − s − s )(2 + λ c ) e − (2+ λ c ) . (38)For s = m , we get B ( λ ∗ ) = 0 . For s = 0 , from (38), we need s = m , which results in B ( λ ∗ ) = 0 . For s + s = m , we have s = 0 and, from (38), we need λ c = n/k , which results in B ( λ ∗ ) = 0 . For s > , s ≥ such that s + s < m , it is easy to see that there exists a unique λ c > n/k satisfying (38) (the LHS increases from0 at λ c = n/k linearly, and the RHS falls from a non-zero value). For such extremum points, we get B ( λ ∗ ) > .Therefore, B ∗ = max λ ∈ Λ ′ B ( λ )= max s ∈ [1: m − ,s s n/k and λ ′ c < λ c . Hence, the claim is true.Now, using B ( s , s ) ≥ B ( s , s + 1) repeatedly, we see that B ( s , ≥ B ( s , s ) for any s > . Hence, in theoptimization, it is sufﬁcient to consider s = 0 . To reduce clutter, let λ c ( s ) , λ c ( s , be the solution to s ( λ c − n/k )( n/k ) e − n/k = 2( m − s )(2 + λ c ) e − (2+ λ c ) (45)in [ n/k, ∞ ) . Using the above observations in (39), we get B ∗ = max s ∈ [1: m − kn ( m − s ) 1 λ c ( s ) ( λ c ( s ) + 2 − n/k ) e − (2+ λ c ( s ))( a ) = max λ c ( s ) ,s ∈ [1: m − λ c ( s ) − n/k )( λ c ( s ) + 2 − n/k ) e − ( λ c ( s )+2) mλ c ( s )(2( λ c ( s ) + 2) e − ( λ c ( s )+2 − n/k ) + ( n/k )( λ c ( s ) − n/k )) ( b ) ≤ max u ∈ R ,u>n/k u − n/k )( u + 2 − n/k ) e − ( u +2) mu (2( u + 2) e − ( u +2 − n/k ) + ( n/k )( u − n/k )) , ( c ) = 2 me − n/k max u> u ( u + 2) e − ( u +2) ( u + n/k )(2( u + 2 + n/k ) e − ( u +2) + un/k )) , (46)where ( a ) follows by using (45) to write s in terms of λ c ( s ) , ( b ) follows by relaxing { λ c ( s ) : s ∈ [1 : m − } toa real-valued u ∈ [ n/k, ∞ ) , and ( c ) follows by the substituting u by u + n/k .The maximization in (46) is relatively straight-forward using calculus and is achieved at α = 0 . ... that solves u = 4 e − ( u +2) . (47)Replacing e − ( α +2) with α / in the RHS of (46) and simplifying, we get B ∗ ≤ k e − n/k / (1 + n/ ( αk )) .Combining Lemmas 5 and 6 and assuming E [ ϕ ] ≥ n / results in Theorem 2 for High collision regime .10 .2 Analysis for Low collision regime If E [ ϕ ] is small, then it is not possible to prove general results as in Theorem 3 and say that the MSE must be small.The case Φ poly = ϕ and Φ linear = ϕ ∞ illustrates this claim, since E [ Φ linear ] is always , but the MSE ≈ E [ Φ poly ] can be as high as Θ( k ) for a certain range of n . Hence, modifying our approach for small E [ ϕ ] , we show that boththe Chao estimate and the number of unseen symbols are small. Lemma 7.

For the Chao estimator, if E [ ϕ ] ≤ n / , for any distribution P ∈ ∆ k , E ( ˆ U mc , P ) ≤ (cid:18) (32 . k n / + (98 . k n / + 2 k n / + (1 . kn / + (21 . k n (cid:19) . (48)Recall that in Low collision regime we assume that E [ ϕ ] < n / . Our strategy is to show that when E [ ϕ ] issmall, then the unseen elements as well as the estimates are small on average. The idea of negative regression betweenrandom variables will play a role in proving Lemma 7.Negative regression is a strong notion of negative dependence between random variables. It is closely related tonegative correlation and negative association [JDP83] between random variables. We begin with its deﬁnition, Deﬁnition 8. [DR96, Deﬁnition 21] Let X := { X , . . . , X m } be a set of random variables. X satisﬁes the negativeregression condition if E [ f ( X i , i ∈ I ) | X j = t j , j ∈ J ] is non-increasing in each t j , j ∈ J for any disjoint I, J ⊆ [ m ] and any non-decreasing (coordinate-wise) function f . To make the connection with negative regression, we ﬁrst introduce the classical balls and bins experiment. Considera set of n balls and m bins. In the experiment, each ball is tossed into one of the m bins as per some distributionindependent of the others (balls need not have the same distribution of probabilities of going into different bins).There is an intuitive notion of negative dependence in the balls and bins experiment - if a particular bin, say i , wasrevealed to hold a comparatively high number of balls, one would expect the other bins to hold fewer number of balls(because there are fewer balls left to go into the other bins). This notion is formalized in the following theorem. Theorem 9. [DR96, Theorem 31] The set B = { B , B , . . . } , where B i represents the number of balls in bin i satisﬁes the negative regression condition. By the negative regression condition, we can conclude that for all bins j = i , E [ B j | B i = δ ] is a decreasingfunction of δ . We are now ready to introduce the lemma that brings these results into context. Theorem 10.

Under the Poisson sampling model, { ϕ , ϕ , . . . } satisfy the negative regression condition.Proof. The heart of the proof lies in the fact that the Poisson arrival process can be thought of as a version of balls andbins where the set { ϕ i , i ∈ N } plays the role of B , as elaborated below.Each symbol x ∈ X is a ball. A ball x is in Bin i , i = 0 , , , . . . , if N x = i . Each ball x is put into bin i withprobability P ( N x = i ) independently. The total number of balls in bin i is therefore P x ∈X N x = i = ϕ i . The vector B of Theorem 9 is, therefore, equivalent to { ϕ i , i ∈ N } . This concludes the proof.In order to prove Lemma 7, we ﬁrst use negative regression to establish an upper bound on the MSE for anydistribution in P ∈ ∆ k as a function of E [ ϕ ] and show that if E [ ϕ ] is small, this bound is small too. Lemma 11.

For the modiﬁed Chao estimator, for any distribution P ∈ ∆ k , E ( ˆ U mc , P ) ≤ (4 + 8 a )( k/n ) E [ ϕ ] + (28 a ( k/n ) + 2( k/n ) + 0 . a ( k/n )) E [ ϕ ] + 6 a ( k/n ) , (49) where a = 1 / (1 − e − ) .Proof. From the deﬁnition of MSE, E ( ˆ U mc , P ) = E "(cid:18) ϕ ϕ + 1) − ϕ (cid:19) ≤ E "(cid:18) ϕ ϕ + 1) (cid:19) ( a ) + E (cid:2) ϕ (cid:3)| {z } ( b ) . (50)11n order to upper bound the MSE, we separately upper bound the quantities ( a ) and ( b ) . By deﬁnition of conditionalexpectation, E "(cid:18) ϕ ϕ + 1) (cid:19) = 14 X j ∈ N E (cid:2) ϕ | ϕ = j (cid:3) ( j + 1) P ( ϕ = j ) . (51)Using the negative regression of { ϕ i , i = 0 , , , . . . } , we can conclude that ∀ δ = 0 , E [ ϕ | ϕ = 0] ≥ E [ ϕ | ϕ = δ ] .Therefore, E "(cid:18) ϕ ϕ + 1) (cid:19) ≤ E (cid:2) ϕ | ϕ = 0 (cid:3) · E (cid:2) ( ϕ + 1) − (cid:3) ( i ) ≤ E (cid:2) ϕ (cid:3) (1 − e − ) · E (cid:2) ( ϕ + 1) − (cid:3) , (52)where ( i ) uses Lemma 15 to bound the conditional expectation of ϕ . Using Lemma 13 (with h = 4 ) to bound E [ ϕ ] , E "(cid:18) ϕ ϕ + 1) (cid:19) = E [ ϕ ] + 7 E [ ϕ ] + 6 E [ ϕ ] + E [ ϕ ]4 (1 − e − ) ! · E (cid:20) ϕ + 1) (cid:21) = E [ ϕ ] + 7 E [ ϕ ] + 6 E [ ϕ ] − e − ) ! · E (cid:20) ϕ + 1) (cid:21) + E [ ϕ ]4 (1 − e − ) · E (cid:20) ϕ + 1) (cid:21) ≤ E [ ϕ ] + 7 E [ ϕ ] + 6 E [ ϕ ] − e − ) E [ ϕ ] !| {z } ( i ) + E [ ϕ ]4 (1 − e − ) | {z } ( ii )( iii ) ≤ ( kn ) E [ ϕ ] + (cid:0) kn ) + kn (cid:1) E [ ϕ ] + 6( kn ) − e − ) , (53)where ( i ) follows from the fact that E [(1 + ϕ ) − ] ≤ E [ ϕ ] − from Theorem 4 with f ( x ) = (1 + x ) − , f ′ =(0 , , , , . . . ) , ( ii ) uses E [(1 + ϕ ) − ] ≤ , and ( iii ) follows from (25).We now upper bound ( b ) in (50). Using Lemma 14 ( h = 2 , j = 0 ), E [ ϕ ] ≤ E [ ϕ ] + E [ ϕ ] , ( a ) ≤ k/n ) E [ ϕ ] + 2( k/n ) E [ ϕ ] , (54)where ( a ) follows because E [ ϕ ] ≤ k/n ) E [ ϕ ] , which is proved as follows. E [ ϕ ] = X x ∈X e − np x = 2 X x ∈X e − np x ( np x ) np x ) i ) ≤ X x ∈X e − np x ( np x ) k/n ) ≤ k/n ) E [ ϕ ] , (55)where ( i ) follows using p x > /k .Combining (53) and (54), proves the lemma. Proof of Lemma 7.

Lemma 7 follows by substituting the upper bound E [ ϕ ] ≤ n / into Lemma 11.Combining Lemmas 5, 6 and 7, the proof of Theorem 2 is complete for all cases. In the rest of the paper, weprovide detailed proofs for Theorems 3 and 4. 12 Analysis of rational estimators

We use the notation N to denote the set of natural numbers { , , . . . } and N to denote the set of whole numbers, N ∪ { } . In addition, for some i ∈ N we use [ i ] to denote the set { , . . . , i } . The degree- d , homogeneous Φ poly can beexpressed as Φ poly = X i d α i d ϕ i · · · ϕ i d = X i d α i d X x d ∈X d d Y t =1 N xt = i t , (56)where i d = [ i , . . . , i d ] , i j ∈ { , , . . . } , x d = [ x , . . . , x d ] , x j ∈ X , α i d ∈ R . Recall that Φ linear = X i ≥ X u ∈X β i N u = i . Note that both Φ poly and Φ linear are functions of N x , x ∈ X . To proceed further, we make the following twodeﬁnitions. S ( i d , x d ) , d Y t =1 N xt = i t , (57) T ( i d , x d ) , Φ linear | N x = i ,...,N xd = i d . (58)In words, T ( i d , x d ) denotes evaluation of Φ linear by setting N x j = i j , j = 1 , . . . , d . Now, Φ poly f ( Φ linear ) = X i d α i d X x d (cid:2) S ( i d , x d ) f ( Φ linear ) (cid:3) ( i ) = X i d ,x d α i d S ( i d , x d ) f ( T ( i d , x d )) , (59)where ( i ) follows because S ( i d , x d ) = 1 only when N x j = i j , j = 1 , . . . , d . Note that S ( i d , x d ) and T ( i d , x d ) do notinvolve any common N x terms and are independent. Taking expectations in (59) and using the independence, E [ Φ poly f ( Φ linear )] = X i d ,x d α i d E [ S ( i d , x d )] E [ f ( T ( i d , x d ))] . (60)The above equality is the main starting point for the proofs. For a given i d and x d , we can write T ( i d , x d ) as follows: T ( i d , x d ) = d X j =1 β i j + X u ∈X \ x d X i ≥ β i N u = i . (61)To compare the above with Φ linear , we rewrite Φ linear as follows: Φ linear = d X j =1 X i ≥ β i N xj = i + X u ∈X \ x d X i ≥ β i N u = i . (62)Hence, we see that T ( i d , x d ) = Φ linear − d X j =1 X i ≥ β i N xj = i + d X j =1 β i j . (63)Since β i ≤ , from (63), we have T ( i d , x d ) ≤ Φ linear + d . Since f is non-increasing, we have E [ f ( T ( i d , x d ))] ≥ E [ f ( Φ linear + d )] . (64)13sing the above in (60), we get the lower bound in (8).Since f is concave, by Jensen’s inequality, E [ f ( T ( i d , x d ))] ≤ f ( E [ T ( i d , x d )]) . (65)In (63), dropping the third positive term on the RHS and taking expectations, we get E [ T ( i d , x d )] ≥ E [ Φ linear ] − d X j =1 X i ≥ β i P ( N x j = i ) ( i ) ≥ E [ Φ linear ] − d X j =1 X i ≥ β i / √ πi = E [ Φ linear ] − dσ, (66)where ( i ) follows because P ( N u = j ) = j j e − j /j ! ≤ / √ πj by Stirling’s approximation for u ∈ X and j ≥ .Using the above in (65), since f is non-increasing, we get E [ f ( T ( i d , x d ))] ≤ f ( E [ Φ linear ] − dσ ) . (67)Using the above in (60), we get the upper bound in (9). If f is not concave, arriving at upper bounds as in Theorem 3 is less straightforward. However, they are necessary toanalyze Chao estimator as the function (1 + ϕ ) − in (6) is not concave. An additional property of Φ linear is requiredto arrive at upper bounds on the approximation error for such functions.Observe that Φ linear = P i ≥ β i ϕ i can be expanded as Φ linear = X x ∈X X i β i N x = i = X x ∈X Y x , (68)where each Y x is a discrete random variable that takes value β i with probability P ( N x = i ) . The restriction of β i ∈ [0 , for Theorem 4 implies that Supp( Y x ) ⊆ [0 , . In the Poisson sampling model, the random variables Y x , x ∈ X , are independent. Hence, Φ linear is the sum of independent discrete random variables each supported on somesubset of [0 , . We term such random variables as generalized Poisson binomial random variables.The crucial result is the following. Lemma 12. If X = P i X i with X i ∈ [0 , being independent discrete random variables, Z u E [ z X ] dz ≤ E [ X ] − E [ u X ] , u ∈ (0 , . (69) Proof.

See Section A for a proof.Let f r ( X ) = Q rj =1 ( X + j ) − , f ( X ) = 1 . It is easy to see that E [ f r ( X )] =  u Z u Z · · u r − Z E [ u Xr ] du r · · · du du  u =1 . (70)Using Lemma 12 r times, we get E [ f r ( X )] ≤ E [ X ] − r . (71)Note that T ( i d , x d ) is a generalized Poisson binomial random variable and satisﬁes Lemma 12.14ow, we are ready to prove the theorem. From the hypothesis of the theorem, ( f ′ , f ′ , . . . ) ∈ V dominates f . Thuswe have E [ f ( T ( i d , x d ))] ≤ X i ≥ f ′ t E [ f t ( T ( i d , x d ))] ( i ) ≤ X t ≥ f ′ t E [ T ( i d , x d )] − t ( ii ) ≤ X t ≥ f ′ t ( E [ Φ linear ] − dσ ) − t , (72)where ( i ) uses (71) and ( ii ) uses (66). Using the above in (60) concludes the proof of the theorem. References [Ach+13] Jayadev Acharya et al. “Optimal probability estimation with applications to prediction and classiﬁcation”.In:

Conference on Learning Theory . 2013, pp. 764–796.[Cha05] Anne Chao. “Species estimation and applications”. In:

Encyclopedia of Statistical Sciences (2005).[Cha84] Anne Chao. “Nonparametric estimation of the number of classes in a population”. In:

Scandinavian Jour-nal of statistics (1984), pp. 265–270.[Col+12] Robert K Colwell et al. “Models and estimators linking individual-based and sample-based rarefaction,extrapolation and comparison of assemblages”. In:

Jour. of Plant Ecology

Balls and Bins: A Study in Negative Dependence . BRICS report series.BRICS, Department of Computer Science, Univ., 1996.

URL : https://books.google.co.in/books?id=mxuxtgAACAAJ .[DS13] Timothy Daley and Andrew D Smith. “Predicting the molecular complexity of sequencing libraries”. In: Nature methods

Biometrika

Proceedings ofthe 16th international conference on World Wide Web . ACM. 2007, pp. 657–666.[GT56] I.J. Good and G.H. Toulmin. “The number of new species, and the increase in population coverage, whena sample is increased”. In:

Biometrika

Proceedingsof the National Academy of Sciences

International Conferenceon Machine Learning . 2019, pp. 2614–2623.[Haa+95] Peter J Haas et al. “Sampling-based estimation of the number of distinct values of an attribute”. In:

VLDB .Vol. 95. 1995, pp. 311–322.[Hug+01] Jennifer B Hughes et al. “Counting the uncountable: statistical approaches to estimating microbial diver-sity”. In:

Applied and environmental microbiology

The Annals of Statistics

ISSN : 00905364.

URL : .[Lem+11] Leandro N. Lemos et al. “Rethinking microbial diversity analysis in the high throughput sequencing era”.In: Journal of Microbiological Methods

ISSN : 0167-7012.

DOI : https://doi.org/10.1016/j.mimet.2011.03.014 . URL : .[OS15] Alon Orlitsky and Ananda Theertha Suresh. “Competitive Distribution Estimation: Why is Good-TuringGood”. In: Advances in Neural Information Processing Systems . 2015, pp. 2134–2142.[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. “Optimal prediction of the number of unseenspecies”. In:

Proceedings of the National Academy of Sciences

Journal of bacteriology

IEEE International Sym-posium on Information Theory (ISIT) (2019), pp. 46–51.[Ras+09] Sofya Raskhodnikova et al. “Strong lower bounds for approximating distribution support size and thedistinct elements problem”. In:

SIAM Journal on Computing

Biometrics

Ecology

Biometrika n/ log( n ) -sample estimator for entropy andsupport size, shown optimal via new CLTs”. In: Proceedings of the 43rd annual ACM symposium onTheory of computing . 2011, pp. 685–694.[VV13] Paul Valiant and Gregory Valiant. “Estimating the Unseen: Improved Estimators for Entropy and otherProperties”. In:

Advances in Neural Information Processing Systems . 2013, pp. 2157–2165.[WY15] Yihong Wu and Pengkun Yang. “Chebyshev polynomials, moment matching, and optimal estimation ofthe unseen”. In: preprint arxiv:1504.01227 (Apr. 2015).

A Proof of Lemma 12

For any discrete random variable X , the characteristic polynomial C X ( z ) : R > → R is equal to E (cid:2) z X (cid:3) whereverthis expectation exists.To prove Lemma 12, we show that the characteristic polynomial C x ( z ) satisﬁes for y ∈ (0 , : C X ( y ) ≤ E [ X ] − ( DC X )( y ) , (73)where D denotes the differentiation operator f ( t ) d f d t ( t ) . Integrating both sides from y = 0 to u completes theproof of Lemma 12.Consider a generalized Poisson binomial random variable X = P mi X i , where each X i is supported on some { d ij } ⊆ [0 , . From the deﬁnition of characteristic function, we write C X ( z ) = E [ z X ] = E [ z P i X i ] ( i ) = m Y i =1 C X i ( z ) = m Y i =1 X j P ( X i = d ij ) z d ij  , (74)where ( i ) follows from the independence of the X i ’s. Differentiating both sides of (74), it follows that, ( DC X )( z ) = m X i =1  X j : d ij =0 P ( X i = d ij ) d ij z d ij −  Y k = i X j ′ P ( X k = d kj ′ ) z d kj ′  , ( i ) ≥ m X i =1  X j : d ij =0 P ( X i = d ij ) d ij z d ij −  Y k X j ′ P ( X k = d kj ′ ) z d kj ′  , = m X i =1  X j : d ij =0 P ( X i = d ij ) d ij z d ij −  C X ( z ) , ( ii ) ≥ C X ( z ) m X i =1 X j P ( X i = d ij ) d ij = C X ( z ) E [ X ] , where ( i ) follows because P j : d ij =0 P ( X i = d ij ) z d ij ≤ for z ∈ (0 , , and ( ii ) follows because z d ij − ≥ for all i and j , d ij ∈ [0 , . 16 Inequalities for moments of prevalences

Lemma 13.

For all j ≥ and h ≥ , E [ ϕ hj ] ≤ h X k =1 c h,k E [ ϕ j ] k where c h, = 1 and c h,k = h − X l = k − (cid:18) h − l (cid:19) c l,k − . (75) Proof.

Since ϕ j = P x N x = j is the sum of independent Bernoulli random variables, it has moment generatingfunction, M ( t ) = Q x ∈X (1 − P ( N x = j ) + P ( N x = j ) e t ) . Let us deﬁne the function E ( t, x ) def = P ( N x = j ) e t and Z ( t, S ) def = Q x ′ ∈ S (1 − P ( N x ′ = j ) + P ( N x ′ = j ) e t ) . Note that Z ( t, X ) = M ( t ) and E ( t ) satisﬁes the property ( DE )( t ) = E ( t ) where D denotes the differentiation operator, f ( t ) dd t f ( t ) . Then, for all t ∈ [0 , and a nonempty S ⊆ X , differentiating Z ( t, S ) results in, ( DZ )( t, S ) = X x ∈ S P ( N x = j ) e t Y x ′ ∈ Sx ′ = x (1 − P ( N x ′ = j ) + P ( N x ′ = j ) e t )= X x ∈ S E ( t, x ) · Z ( t, S \ { x } ) . (76)For functions f and g , by the general Leibniz rule, D n ( f g ) = n X k =0 (cid:18) nk (cid:19) ( D n − k f )( D k g ) . (77)Using (77) with f = Z ( t, S \ { x } ) and g = E ( t, x ) , taking n = m − and using the fact that ( DE )( t ) = E ( t ) , D m − ( E ( t, x ) · Z ( t, S \ { x } )) = E ( t, x ) m − X k =0 (cid:18) m − k (cid:19) D k Z ( t, S \ { x } ) . Summing both sides over x ∈ S and using (76), D m Z ( t, S ) = X x ∈ S E ( t, x ) m − X k =0 (cid:18) m − k (cid:19) D k Z ( t, S \ { x } ) (78)Now, observe that since − P ( N x = j ) + P ( N x = j ) e t ≥ for t ≥ , we have that Z ( t, S ) ≤ Z ( t, S ) if S ⊆ S .Therefore, with this as the base case, an inductive argument using (78) shows that D m Z ( t, S ) ≤ D m Z ( t, S ) if S ⊆ S . Therefore, we may upper-bound (78) by replacing each D k Z ( t, S \ { x } ) by D k Z ( t, S ) resulting in D m Z ( t, S ) ≤ X x ∈ S E ( t, x ) m − X k =0 (cid:18) m − k (cid:19) D k Z ( t, S ) (79)Observe from its deﬁnition that Z ( t, X ) = M ( t ) . Therefore, from (79) with t = 0 , m = h and S = X , E [ ϕ hj ] ≤ E [ ϕ j ] h − X k =0 (cid:18) h − k (cid:19) E [ ϕ kj ]= E [ ϕ j ] " h − X l =1 (cid:18) h − l (cid:19) E [ ϕ lj ] . (80)For h = 1 , (75) is trivially true. For h = 2 , (75) is proved by (80). Now, as an induction hypothesis, suppose that (75)holds up to and including h − . Using the induction hypothesis in (80), we get E [ ϕ hj ] ≤ E [ ϕ j ] " h − X l =1 (cid:18) h − l (cid:19) l X k =1 c l,k E [ ϕ j ] k a ) = E [ ϕ j ] + h − X k =1 h − X l = k (cid:18) h − l (cid:19) c l,k ! E [ ϕ j ] k +1 = E [ ϕ j ] + h X k =2 h − X l = k − (cid:18) h − l (cid:19) c l,k − ! E [ ϕ j ] k , (81)where ( a ) follows by interchanging the order of summations, and (81) proves the statement of the lemma for h . Thiscompletes the induction and the proof. Lemma 14.

For a homogeneous degree- polynomial Φ poly in { ϕ i : i ≤ L } with coefﬁcients in [0 , , E [ Φ ] ≤ E [ Φ poly ] + 6 kL E [ Φ poly ] . Proof.

Let Φ poly be explicitly given as P i ∈ N α i ϕ i ϕ i . Then, E [ Φ ] = E  X i ∈ N X j ∈ N α i α j Y i ∈ i ϕ i Y j ∈ j ϕ j  . Expanding Φ j as P x ∈X N x = j and rearranging the summations, E [ Φ ] = E  X i ∈ N α i X x ∈|X |  Y k ∈ [2] N xk = i k   X j ∈ N α j Y j ∈ j X y N y = j ! , ≤ E  X i ∈ N α i X x ∈|X |  Y k ∈ [2] N xk = i k | {z } S ( x ,i )  X t =0 , , X j ∈ N | j ∩ i | = t α j Y j ∈ j  t + X y x N y = j | {z } T ( x ,i )  . Observe that the terms S ( x , i ) and T ( x , i ) are independent because S ( x , i ) depends on N x = i and N x = i while T ( x , i ) depends on N x : x ∈ X \ x . Therefore, the expectation of their product is equal to the product oftheir expectations, E [ Φ ] = X i ∈ N α i X x ∈|X | E (cid:2) S ( x , i ) (cid:3) · E (cid:2) T ( x , i ) (cid:3) . (82)Let us now upper-bound T ( x , i ) . By its deﬁnition, T ( x , i ) = X t =0 , , X j ∈ N | j ∩ i | = t α j Y j ∈ j  t + X y x N y = j  ≤ X t =0 , , X j ∈ N | j ∩ i | = t α j Y j ∈ j ( t + ϕ j ) , where the last inequality follows by upper bounding P y ∈ x N y = j by ϕ j . Therefore, T ( x , i ) = X j ∈ N α j Y j ∈ j ϕ j + X t =1 , X j ∈ N | j ∩ i | = t α j (cid:0) t ( ϕ j + ϕ j ) + t (cid:1) , = Φ poly + X t =1 , X j ∈ N | j ∩ i | = t α j (cid:0) t ( ϕ j + ϕ j ) + t (cid:1) , Φ poly + X j ∈ N α i j ( ϕ i + ϕ j ) + X j ∈ N α i j ( ϕ j + ϕ i ) + X j ∈ N α i j + X j ∈ N α i j + 2( α i i + α i i ) , ( i ) ≤ Φ poly + 2 |X | + 2 L |X | + 2 L + 2 = Φ poly + 2( L + 1)( |X | + 1) , (83)where ( i ) follows from the fact that Φ poly has coefﬁcients in [0 , , so, α ij + α ji ( ≤ , if i, j ≤ L, = 0 otherwise,and from the fact that ϕ i , ϕ i ≤ |X | . Plugging (83) into (82), we have: E [ Φ ] ≤ E [ Φ poly ] + 2 E [ Φ poly ]( L + 1)( |X | + 1) , ≤ E [ Φ poly ] + 6 kL E [ Φ poly ] , where the last inequality follows from the assumptions < |X | ≤ k and L ≥ .Finally, we show Lemma 15, which upper-bounds the conditional moments of φ j when φ = 0 and is used in Lowcollision regime . Lemma 15.

For all j = 2 and h ≥ , E (cid:2) ϕ hj (cid:12)(cid:12) ϕ = 0 (cid:3) ≤ − e − ) min( |X | ,h ) E [ ϕ hj ] . Proof.

By deﬁnition of conditional expectation, E (cid:2) ϕ hj (cid:12)(cid:12) ϕ = 0 (cid:3) = 1 P ( ϕ = 0) E  X x ∈X N x = j ! h Y x ∈X N x =2  . (84)Without loss of generality, let us denote the domain X as { , , . . . |X |} and set p i , p x . Using multinomial expansionfor the term (cid:0) N = j + · · · + N |X| = j (cid:1) h in (84), and the independence of N x for different symbols x ∈ X , E (cid:2) ϕ hj (cid:12)(cid:12) ϕ = 0 (cid:3) = 1 Q x ∈X P ( N x = 2) E  X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Y i : h i =0 N i = j  Y x ∈X N x =2  , = 1 Q x ∈X P ( N x = 2) E  X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Y i : h i =0 N i = j Y i : h i =0 N i =2  . Interchanging the summation and the expectation and again using the independence of N x between symbols in X , E (cid:2) ϕ hj (cid:12)(cid:12) ϕ = 0 (cid:3) = X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Q i : h i =0 P ( N i = j ) Q i : h i =0 P ( N i = 2) ( a ) ≤ X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Q i : h i =0 P ( N i = j ) Q i : h i =0 (1 − e − ) ( b ) ≤ − e − ) min( |X | ,h ) X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Y i : h i =0 P ( N i = j ) − e − ) min( |X | ,h ) E  X { h ,...,h X } P i h i = h h ! h ! h ! . . . h |X | ! Y i : h i =0 N i = j  = 1(1 − e − ) min( |X | ,h ) E [ ϕ hj ] , (85)where ( a ) follows because P ( N i = 2) = 1 − ( np i ) e − np i ≥ − e − , and ( b ) follows because | i : h i = 0 | ≤ h since P hi =1 h i = h with h i ≥1