[PDF] Sample complexity of the distinct elements problem

Abstract

We consider the distinct elements problem, where the goal is to estimate the number of distinct colors in an urn containing k balls based on n samples drawn with replacements. Based on discrete polynomial approximation and interpolation, we propose an estimator with additive error guarantee that achieves the optimal sample complexity within O(\log\log k) factors, and in fact within constant factors for most cases. The estimator can be computed in O(n) time for an accurate estimation. The result also applies to sampling without replacement provided the sample size is a vanishing fraction of the urn size. One of the key auxiliary results is a sharp bound on the minimum singular values of a real rectangular Vandermonde matrix, which might be of independent interest.

Full PDF

SSample complexity of the distinct elements problem

Yihong Wu ∗ Pengkun Yang † January 16, 2018

Abstract

We consider the distinct elements problem, where the goal is to estimate the number ofdistinct colors in an urn containing k balls based on n samples drawn with replacements. Basedon discrete polynomial approximation and interpolation, we propose an estimator with additiveerror guarantee that achieves the optimal sample complexity within O (log log k ) factors, and infact within constant factors for most cases. The estimator can be computed in O ( n ) time foran accurate estimation. The result also applies to sampling without replacement provided thesample size is a vanishing fraction of the urn size.One of the key auxiliary results is a sharp bound on the minimum singular values of a realrectangular Vandermonde matrix, which might be of independent interest. Keywords sampling large population, nonparametric statistics, discrete polynomial approxima-tion, orthogonal polynomials, Vandermonde matrix, minimaxity

AMS 2010 subject classiﬁcations

Primary: 62G05; secondary: 62C20, 62D05, 41A05, 41A10 ∗ Department of Statistics and Data Science, Yale University, New Haven, CT, USA, email: [email protected] . † Corresponding author, Department of Electrical and Computer Engineering and the Coordinated Science Lab,University of Illinois at Urbana-Champaign, Urbana, IL, USA, email: [email protected] . a r X i v : . [ m a t h . S T ] J a n ontents Distinct Elements problem 2 (cid:96) -approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Minimum singular values of real rectangle Vandermonde matrices . . . . . . . . . . . 122.4 Lagrange interpolating polynomials and Stirling numbers . . . . . . . . . . . . . . . 15 Distinct Elements problem

The

Distinct Elements problem [CCMN00] refers to the following question:

Given n balls randomly drawn from an urn containing k colored balls, how to estimatethe total number of distinct colors in the urn? Originating from ecology, numismatics, and linguistics, this problem is also known as the speciesproblem in the statistics literature [Lo92, BF93]. Apart from the theoretical interests, it has a widearray of applications in various ﬁelds, such as estimating the number of species in a populationof animals [FCW43, Goo53], the number of dies used to mint an ancient coinage [Est86], and thevocabulary size of an author [ET76]. In computer science, this problem frequently arises in large-scale databases, network monitoring, and data mining [RRSS09, BYJK +

02, CCMN00], where theobjective is to estimate the types of database entries or IP addresses from limited observations,since it is typically impossible to have full access to the entire database or keep track of all thenetwork traﬃc. The key challenge in the

Distinct Elements problem is the following: given a smallset of samples where most of the colors are not observed, how to accurately extrapolate the numberof unseens?

The fundamental limit of the

Distinct Elements problem is characterized by the sample complexity,i.e., the smallest sample size needed to estimate the number of distinct colors with a prescribedaccuracy and conﬁdence level. A formal deﬁnition is the following:2 eﬁnition 1.

The sample complexity n ∗ ( k, ∆) is the minimal sample size n such that there existsan integer-valued estimator ˆ C based on n balls drawn independently with replacements from theurn, such that P [ | ˆ C − C | ≥ ∆] ≤ . k balls with C diﬀerent colors. The main results of this paper provide bounds and constant-factor approximations of the sam-ple complexity in various regimes summarized in Table 1, as well as computationally eﬃcientalgorithms. Below we highlight a few important conclusions drawn from Table 1:

From linear to sublinear:

From the result for k . δ ≤ ∆ ≤ ck in Table 1, we conclude that thesample complexity is sublinear in k if and only if ∆ = k − o (1) , which also holds for samplingwithout replacement. To estimate within a constant fraction of balls ∆ = ck for any smallconstant c , the sample complexity is Θ( k log k ), which coincides with the general support sizeestimation problem [VV11a, WY15] (see Section 1.2 for a detailed comparison). However,in other regimes we can achieve better performance by exploiting the discrete nature of the Distinct Elements problem.

From linear to superlinear:

The transition from linear to superlinear sample complexity oc-curs near ∆ = √ k . Although the exact sample complexity near ∆ = √ k is not completelyresolved in the current paper, the lower bound and upper bound in Table 1 diﬀer by a fac-tor of at most log log k . In particular, the estimator via interpolation can achieve ∆ = √ k with n = O ( k log log k ) samples, and achieving a precision of ∆ ≤ k . − o (1) requires strictlysuperlinear sample size.∆ Lower bound Upper bound Estimator ≤ k log k ) Na¨ıveΘ (cid:0) k log k ∆ (cid:1)(cid:104) , √ k (log k ) − δ (cid:105) Interpolation(Section 2.4) (cid:104) √ k (log k ) − δ , k . δ (cid:105) Ω (cid:0) k (cid:0) ∨ log k ∆ (cid:1)(cid:1) O (cid:18) k log log k ∨ log ∆2 k (cid:19) Θ (cid:16) k log k log k ∆ (cid:17) [ k . δ , ck ] (cid:96) -approximation(Section 2.2)[ ck, (0 . − δ ) k ] k exp( − (cid:112) O (log k log log k ))[RRSS09] O (cid:16) k log k (cid:17) Table 1: Summary of the sample complexity n ∗ ( k, ∆), where δ is any suﬃciently small constant, c is an absolute positive constant less than 0.5 (same over the table), and the notations a ∧ b and a ∨ b stand for min { a, b } and max { a, b } , respectively. The estimators are linear with coeﬃcientsobtained from either interpolation or (cid:96) -approximation.To establish the sample complexity, our lower bounds are obtained under zero-one loss andour upper bounds are under the (stronger) quadratic loss. Hence we also obtain the following Clearly, since ˆ C − C ∈ Z , we shall assume without loss of generality that ∆ ∈ N , with ∆ = 1 corresponding tothe exact estimation of the number of distinct elements. A more precise result from [RRSS09] is the following: for ∆ ∈ [ ck, . k − k / √ log k ], n ∗ ( k, ∆) ≥ k exp( − (cid:113) O (log k (log log k + log kk/ − ∆ ))). Distinct Elements problem:min ˆ C max k -ball urn E (cid:32) ˆ C − Ck (cid:33) = exp (cid:26) − Θ (cid:18)(cid:18) ∨ n log kk (cid:19) ∧ (cid:16) log k ∨ nk (cid:17)(cid:19)(cid:27) =  Θ(1) , n ≤ k log k , exp( − Θ( n log kk )) , k log k ≤ n ≤ k, exp( − Θ(log k )) , k ≤ n ≤ k log k, exp( − Θ( nk )) , n ≥ k log k, where ˆ C denotes an estimator using n samples with replacements and C is the number of distinctcolors in a k -ball urn. Statistics literature

The

Distinct Elements problem is equivalent to estimating the number ofspecies (or classes) in a ﬁnite population, which has been extensively studied in the statistics (seesurveys [BF93, GS04]) and the numismatics literature (see survey [Est86]). Motivated by variouspractical applications, a number of statistical models have been introduced for this problem, themost popular four being (cf. [BF93, Figure 1]): • The multinomial model : n samples are drawn uniformly at random with replacement; • The hypergeometric model : n samples are drawn uniformly at random without replacement; • The Bernoulli model : each individual is observed independently with some ﬁxed probability,and thus the total number of samples is a binomial random variable; • The Poisson model : the number of observed samples in each class is independent and Poissondistributed, and thus the total sample size is also a Poisson random variable.These models are closely related: conditioned on the sample size, the Bernoulli model coincideswith the hypergeometric one, and Poisson model coincides with the multinomial one; furthermore,hypergeometric model can simulate multinomial one and is hence more informative. The multino-mial model is adopted as the main focus of this paper and the sample complexity in Deﬁnition 1refers to the number of samples with replacement. In the undersampling regime where the samplesize is signiﬁcantly smaller than the population size, all four models are approximately equivalent.See Appendix A for a rigorous justiﬁcation and detailed comparisons.Under these models various estimators have been proposed such as unbiased estimators [Goo49],Bayesian estimators [Hil79], variants of Good-Turing estimators [CL92], etc. None of these method-ologies, however, have a provable worst-case guarantee. Finally, we mention a closely related prob-lem of estimating the number of connected components in a graph based on sampled inducedsubgraphs. In the special case where the underlying graph consists of disjoint cliques, the problemis exactly equivalent to the

Distinct Elements problem [Fra78].

Computer science literature

The interests in the

Distinct Elements problem also arise in thedatabase literature, where various intuitive estimators [HOT88, NS90] have been proposed undersimplifying assumptions such as uniformity, and few performance guarantees are available. Morerecent work in [CCMN00,BYKS01] obtained the optimal sample complexity under the multiplicative α is shown to be Θ( k/α ). For this task, it turns out the least favorable scenario isto distinguish an urn with unitary color from one with almost unitary color, the impossibility ofwhich implies large multiplicative error. However, the optimal estimator performs poorly comparedwith others on an urn with many distinct colors [CCMN00], the case where most estimators enjoysmall multiplicative error. In view of the limitation of multiplicative error, additive error is laterconsidered by [RRSS09,Val11]. To achieve an additive error of ck for a constant c ∈ (0 , ), the resultin [CCMN00] only implies an Ω(1 /c ) sample complexity lower bound, whereas a much strongerlower bound scales like k − O ( (cid:113) log log k log k ) obtained in [RRSS09], which is almost linear. Determiningthe optimal sample complexity under additive error is the focus of the present paper.The Distinct Elements problem can be viewed as a special case of the

Support Size problem,where the goal is to estimate the cardinality of the support of an unknown discrete distribution,whose nonzero probabilities are at least k , based on independent samples. Improving previousresults in [VV11a], the optimal sample complexity has been recently determined in [WY15] to beΘ (cid:18) k log k log k ∆ (cid:19) . (1)Samples drawn from a k -ball urn with replacement can be viewed as i.i.d. samples from a distribu-tion supported on the set { k , k , . . . , kk } . From this perspective, any support size estimator, as wellas its performance guarantee, is applicable to the Distinct Elements problem.We brieﬂy describe and compare the strategy to construct estimators in [WY15] and the currentpaper. Both are based on the idea of polynomial approximation , a powerful tool to circumventthe nonexistence of unbiased estimators [LNS99]. The key is to approximate the function to beestimated by a polynomial, whose degree is chosen to balance the approximation error (bias) andthe estimation error (variance). The worst-case performance guarantee for the

Support Size problemin [WY15] is governed by the uniform approximation error over an interval where the probabilitiesmay reside. In contrast, in the

Distinct Elements problem, samples are generated from a distributionsupported on a discrete set of values. Uniform approximation over a discrete subset leads to smallerapproximation error and, in turn, improved sample complexity. It turns out that O ( k log k log k ∆ )samples are suﬃcient to achieve an additive error of ∆ that satisﬁes k . O (1) ≤ ∆ ≤ O ( k ), whichstrictly improves the sample complexity (1) for the Support Size problem, thanks to the discretestructure of the

Distinct Elements problem.The

Distinct Elements problem considered here is not to be confused with the formulation inthe streaming literature, where the goal is to approximate the number of distinct elements inthe observations with low space complexity, see, e.g., [FFGM07, KNW10]. There, the proposedalgorithms aim to optimize the memory consumption, but still require a full pass of every ball inthe urn. This is diﬀerent from the setting in the current paper, where only random samples drawnfrom the urn are available.

The paper is organized as follows: In Section 2 we describe a uniﬁed approach to construct esti-mators via discrete polynomial approximation, whose bias is analyzed in Section 2.2 and varianceis upper bounded in Sections 2.3 and 2.4 separately. In Section 3 we obtain lower bounds on thesample complexity in Table 1 which establish the optimality of the proposed estimators. Section 4explains how sample complexity bounds summarized in Table 1 follow from various results in Sec-tions 2 and 3. Connections between the four sampling model mentioned in Section 1.2 are detailedin Appendix A. Proofs of auxiliary results are deferred to Appendix B and Appendix C.5 .4 Notations

All logarithms are with respect to the natural base. The transpose of a matrix A is denoted by A (cid:62) . Let denote the all-one column vector. Let (cid:107) · (cid:107) p denote the vector (cid:96) p -norm, for 1 ≤ p ≤∞ . Let Poi( λ ) be the Poisson distribution with mean λ , Bern( p ) be the Bernoulli distributionwith mean p , Binomial( n, p ) be the binomial distribution with n trials and success probability p ,and Hypergeometric( N, K, n ) be the hypergeometric distribution with probability mass function (cid:0) Kk (cid:1)(cid:0) N − Kn − k (cid:1) / (cid:0) Nn (cid:1) , for 0 ∨ ( n + K − N ) ≤ k ≤ n ∧ K . The n -fold product of a distribution P isdenoted by P ⊗ n . We use standard big- O notations: for any positive sequence { a n } and { b n } , a n = O ( b n ) or a n (cid:46) b n if a n ≤ cb n for some absolute constant c >

0, or equivalently, sup n a n b n < ∞ ; a n = Ω( b n ) or a n (cid:38) b n if b n = O ( a n ); a n = Θ( b n ) or a n (cid:16) b n if both a n = O ( b n ) and b n = O ( a n ); a n = o ( b n ) if lim a n /b n = 0; a n = ω ( b n ) if b n = o ( a n ). Furthermore, the subscript in o n (1) indicatesconvergence in n that is uniform in all other parameters. We use the notations a ∧ b and a ∨ b formin { a, b } and max { a, b } , respectively. For M ∈ N , let [ M ] (cid:44) { , . . . , M } . For α ∈ R and S ⊂ R ,let αS (cid:44) { αx : x ∈ S } . In this section we develop a uniﬁed framework to construct linear estimators and analyze its perfor-mance. Note that linear estimators (i.e. linear combinations of ﬁngerprints) have been previouslyused for estimating distribution functionals [Pan04, VV11a, VV11b, WY15]. As commonly done inthe literature, we assume the

Poisson sampling model , where the sample size is a random variablePoi( n ) instead of being exactly n . Under this model, the histograms of the samples, which count thenumber of balls in each color, are independent which simpliﬁes the analysis. Any estimator underthe Poisson sampling model can be easily modiﬁed for ﬁxed sample size, and vice versa, thanksto the concentration of the Poisson random variable near its mean. Consequently, the samplecomplexities of these two models are close to each other, as shown in Corollary 1 in Appendix A. Recall that C denotes the number of distinct colors in a urn containing k colored balls. Let k i denotethe number of balls of the i th color in the urn. Then (cid:80) i k i = k and C = (cid:80) i { k i > } . Let X , X , . . . be independently drawn with replacement from the urn. Equivalently, the X i ’s are i.i.d. accordingto a distribution P = ( p i ) i ≥ , where p i = k i /k is the fraction of balls of the i th color. The observeddata are X , . . . , X N , where the sample size N is independent from ( X i ) i ≥ and is distributed asPoi( n ). Under the Poisson model (or any of the sampling models described in Section 1.2), the histograms { N i } are suﬃcient statistics for inferring any aspect of the urn conﬁguration; here N i isthe number of balls of the i th color observed in the sample, which is independently distributed asPoi( np i ). Furthermore, the ﬁngerprints { Φ j } j ≥ , which are the histogram of the histograms, arealso suﬃcient for estimating any permutation-invariant distributional property [Pan03, Val11], inparticular, the number of colors. Speciﬁcally, the j th ﬁngerprint Φ j denotes the number of colorsthat appear exactly j times. Note that U (cid:44) Φ , the number of unseen colors, is not observed.The na¨ıve estimator, “what you see is what you get,” is simply the number of observed distinctcolors, which can be expressed in terms of ﬁngerprints asˆ C seen = (cid:88) j ≥ Φ j , C = ˆ C seen + U . In turn, our estimator is˜ C = ˆ C seen + ˆ U , (2)which adds a linear correction term ˆ U = (cid:88) j ≥ u j Φ j , (3)where the coeﬃcients u j ’s are to be speciﬁed. Since the ﬁngerprints Φ , Φ , . . . are dependent (forexample, they sum up to C ), (3) serves as a linear predictor of U = Φ in terms of the observedﬁngerprints. Equivalently, in terms of histograms, the estimator has the following decomposableform: ˜ C = ∞ (cid:88) i =1 g ( N i ) , (4)where g : Z + → R satisﬁes g (0) = 0 and g ( j ) = 1 + u j for j ≥

1. In fact, any estimator that islinear in the ﬁngerprints can be expressed of the decomposable form (4).The main idea to choose the coeﬃcients u j is to achieve a good trade-oﬀ between the varianceand the bias. In fact, it is instructive to point out that linear estimators can easily achieve exactlyzero bias, which, however, comes at the price of high variance. To see this, note that the bias ofthe estimator (4) is E [ ˜ C ] − C = (cid:80) i ≥ ( E [ g ( N i )] − | E [ g ( N i ) − | = e − np i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∞ (cid:88) j =1 k ji u j ( n/k ) j j ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ e − n/k max a ∈ [ k ] | φ ( a ) − | , (5)and φ ( a ) (cid:44) (cid:80) j ≥ a j u j ( n/k ) j j ! is a (formal) power series with φ (0) = 0. The right-hand side of (5)can be made zero by choosing φ to be, e.g., the Lagrange interpolating polynomial that satisﬁes φ (0) = − φ ( i ) = 0 for i ∈ [ k ], namely, φ ( a ) = ( − k +1 k ! (cid:81) ki =1 ( a − i ); however, this strategyresults in a high-degree polynomial φ with large coeﬃcients, which, in turn, leads to a large varianceof the estimator.To reduce the variance of our estimator, we only use the ﬁrst L ﬁngerprints in (3) by setting u j = 0 for all j > L , where L is chosen to be proportional to log k . This restricts the polynomialdegree in (5) to at most L and, while possibly incurring bias, reduces the variance. A further reasonfor only using the ﬁrst few ﬁngerprints is that higher-order ﬁngerprints are almost uncorrelated withthe number of unseens Φ . For instance, if red balls are observed for n/ andΦ j decays exponentially (see Appendix B for a proof). Therefore for L = Θ(log k ), { Φ j } j>L oﬀerlittle predictive power about Φ . Moreover, if a color is observed at most L times, say, N i ≤ L , thisimplies that, with high probability, k i ≤ M , where M = O ( kL/n ), thanks to the concentration ofPoisson random variables. Therefore, eﬀectively we only need to consider those colors that appearin the urn for at most M times, i.e., k i ∈ [ M ], for which the bias is at most | E [ g ( N i ) − | ≤ e − n/k max a ∈ [ M ] | φ ( a ) − | = e − n/k max x ∈ [ M ] /M | p ( x ) − | = e − n/k (cid:107) Bw − (cid:107) ∞ , (6)where p ( x ) (cid:44) φ ( M x ) = (cid:80) Lj =1 w j x j , w = ( w , . . . , w L ) (cid:62) , and w j (cid:44) u j ( M n/k ) j j ! , B (cid:44)  /M (1 /M ) · · · (1 /M ) L /M (2 /M ) · · · (2 /M ) L ... ... . . . ...1 1 · · ·  (7)7s a (partial) Vandermonde matrix. Lastly, since ˆ C seen ≤ C ≤ k , we deﬁne the ﬁnal estimator tobe ˜ C projected to the interval [ ˆ C seen , k ]. We have the following error bound: Proposition 1.

Assume the Poisson sampling model. Let L = α log k, M = βk log kn , (8) for any β > α such that L and M are integers. Let w ∈ R L . Let ˜ C be deﬁned in (2) with u j = w j j !( knM ) j for j ∈ [ L ] and u j = 0 otherwise. Deﬁne ˆ C (cid:44) ( ˜ C ∨ ˆ C seen ) ∧ k . Then E ( ˆ C − C ) ≤ k e − n/k (cid:107) Bw − (cid:107) ∞ + ke − n/k + k max m ∈ [ M ] E N ∼ Poi( nm/k ) [ u N ] + k − ( β − α log eβα − . (9) Proof.

Since ˆ C seen ≤ C ≤ k , ˆ C is always an improvement of ˜ C . Deﬁne the event E (cid:44) ∩ ki =1 { N i ≤ L ⇒ kp i ≤ M } , which means that whenever N i ≤ L we have p i ≤ M/k . Since β > α , applying theChernoﬀ bound and the union bound yields P [ E c ] ≤ k − β + α log eβα , and thus E ( ˆ C − C ) ≤ E (( ˆ C − C ) E ) + k P [ E c ] ≤ E (( ˜ C − C ) E ) + k − β + α log eβα . (10)The decomposable form of ˜ C in (4) leads to( ˜ C − C ) E = (cid:88) i : k i ∈ [ M ] ( g ( N i ) − { N i ≤ L } (cid:44) E . In view of the bias analysis in (6), we have | E [ E ] | ≤ (cid:88) i : k i ∈ [ M ] e − nk i /k (cid:107) Bw − (cid:107) ∞ ≤ ke − n/k (cid:107) Bw − (cid:107) ∞ . (11)Recall that g (0) = 0 and g ( j ) = u j + 1 for j ∈ [ L ]. Since N i is independently distributed asPoi( nk i /k ), we have var [ E ] = (cid:88) i : k i ∈ [ M ] var (cid:2) ( g ( N i ) − { N i ≤ L } (cid:3) ≤ (cid:88) i : k i ∈ [ M ] E (cid:2) ( g ( N i ) − { N i ≤ L } (cid:3) = (cid:88) i : k i ∈ [ M ] (cid:16) e − nk i /k + E [ u N i ] (cid:17) ≤ ke − n/k + k max m ∈ [ M ] E N ∼ Poi( nm/k ) [ u N ] . (12)Combining the upper bound on the bias in (11) and the variance in (12) yields an upper boundon E [ E ]. Then the MSE in (9) follows from (10).Proposition 1 suggests that the coeﬃcients of the linear estimator can be chosen by solving thefollowing linear programming (LP): min w ∈ R L (cid:107) Bw − (cid:107) ∞ (13)and showing that the solution does not have large entries. Instead of the (cid:96) ∞ -approximation prob-lem (13), whose optimal value is diﬃcult to analyze, we solve the (cid:96) -approximation problem as arelaxation: min w ∈ R L (cid:107) Bw − (cid:107) , (14)which is an upper bound of (13), and is in fact within an O (log k ) factor since M = O ( k log k/n )and n = Ω( k/ log k ). In the remainder of this section, we consider two separate cases:8 M > L ( n (cid:46) k ): In this case, the linear system in (14) is overdetermined and the minimumis non-zero. Surprisingly, as shown in Section 2.2, the exact optimal value can be found inclosed form using discrete orthogonal polynomials. The coeﬃcients of the solution can bebounded using the minimum singular value of the matrix B , which is analyzed in Section 2.3. • M ≤ L ( n (cid:38) k ): In this case, the linear system is underdetermined and the minimumin (14) is zero. To bound the variance, it turns out that the coeﬃcients bound obtainedfrom the minimum singular value is not precise enough in this regime. Instead, we expressthe coeﬃcients in terms of Lagrange interpolating polynomials and use Stirling numbers toobtain sharp variance bounds. This analysis in carried out in Section 2.4.We ﬁnish this subsection with two remarks: Remark 1 (Discrete versus continuous approximation) . The optimal estimator for the

SupportSize problem in [WY15] has the same linear form as (2); however, since the probabilities can takeany values in an interval, the coeﬃcients are found to be the solution of the continuous polynomialapproximation problem inf p max x ∈ [ M , | p ( x ) − | = exp (cid:16) − Θ (cid:16) L √ M (cid:17)(cid:17) . (15)where the inﬁmum is taken over all degree- L polynomials such that p (0) = 0, achieved by the(appropriately shifted and scaled) Chebyshev polynomial [Tim63]. In contrast, in Section 2.2 weshow that the discrete version of (15), which is equivalent to the LP (13), satisﬁesinf p max x ∈{ M , M ,..., } | p ( x ) − | = poly ( M ) exp (cid:16) − Θ (cid:16) L M (cid:17)(cid:17) , (16)provided L < M . The diﬀerence between (15) and (16) explains why the sample complexity (1) forthe

Support Size problem has an extra log factor compared to that of the

Distinct Elements problemin Table 1. When the sample size n is large enough, interpolation is used in lieu of approximation.See Fig. 1 for an illustration. - - - - - (a) Continuous approximation ● ● ● ● ● ● ● - - - - - (b) Discrete approximation ● ● ● ● ● ● ● - - - - - (c) Interpolation Figure 1: Continuous and discrete polynomial approximations for M = 6 and degree L = 4, where(a) and (b) plot the optimal solution to (15) and (16) respectively. The interpolating polynomialin (c) requires a higher degree L = 6. Remark 2 (Time complexity) . The time complexity of the estimator (2) consists of: (a) Computinghistograms N i and ﬁngerprints Φ j of n samples: O ( n ); (b) Computing the coeﬃcients w by solvingthe least square problem in (6): O ( L ( M + L )); (c) Evaluating the linear combination (2): O ( n ∧ k ).As shown in Table 1, for an accurate estimation the sample complexity is n = Ω( k log k ), which implies L = O (log k ) and M = O (log k ). Therefore, the overall time complexity is O ( n + log k ) = O ( n ).9 .2 Exact solution to the (cid:96) -approximation Next we give an explicit solution to the (cid:96) -approximation problem (14). In general, the optimalsolution is given by w ∗ = ( B (cid:62) B ) − B (cid:62) and the minimum value is the Euclidean distance betweenthe all-one vector and the column span of B , which, in the case of M > L , is non-zero (since B has linearly independent columns). Taking advantage of the Vandermonde structure of the matrix B in (7), we note that (14) can be interpreted as ﬁnding the orthogonal projection of the constantfunction onto the linear space of polynomials of degree between 1 and L deﬁned on the discreteset [ M ] /M . Using the orthogonal polynomials with respect to the counting measure, known as discrete Chebyshev (or Gram) polynomials (see [Sze75, Section 2.8] or [NUS91, Section 2.4.2]), weshow that, surprisingly, the optimal value of the (cid:96) -approximation can be found in closed form: Lemma 1.

For all L ≥ and M ≥ L + 1 , min w ∈ R L (cid:107) Bw − (cid:107) = (cid:34) (cid:0) M + L +1 L +1 (cid:1)(cid:0) ML +1 (cid:1) − (cid:35) − / = (cid:20) exp (cid:18) Θ (cid:18) L M (cid:19)(cid:19) − (cid:21) − / . (17) Proof.

Deﬁne the following inner product between functions f and g : (cid:104) f, g (cid:105) (cid:44) M (cid:88) i =1 f (cid:18) iM (cid:19) g (cid:18) iM (cid:19) (18)and the induced norm (cid:107) f (cid:107) (cid:44) (cid:112) (cid:104) f, f (cid:105) . The least square problem (17) can be equivalently formulatedas min w ∈ R L (cid:107)− w x + w x + · · · + w L x L (cid:107) . (19)This can be analyzed using the orthogonal polynomials under the inner product (18), which wedescribe next.Recall the discrete Chebyshev polynomial [Sze75, Sec. 2.8]: for x = 0 , , . . . , M − t m ( x ) (cid:44) m ! ∆ m p m ( x ) = 1 m ! m (cid:88) j =0 ( − j (cid:18) mj (cid:19) p m ( x + m − j ) , ≤ m ≤ M − , (20)where p m ( x ) (cid:44) x ( x − · · · ( x − m + 1)( x − M )( x − M − · · · ( x − M − m + 1) , (21)and ∆ m denotes the m -th order forward diﬀerence. The polynomials { t , . . . , t M − } are orthogonalwith respect to the counting measure over the discrete set { , , . . . , M − } ; in particular, we have(cf. [Sze75, Sec. 2.8.2, 2.8.3]): M − (cid:88) x =0 t m ( x ) t (cid:96) ( x ) = 0 , m (cid:54) = (cid:96), M − (cid:88) x =0 t m ( x ) = M ( M − )( M − ) · · · ( M − m )2 m + 1 (cid:44) c ( M, m ) . By appropriately shifting and scaling the set of polynomials t m , we deﬁne an orthonormal basisfor the set of polynomials of degree at most L ≤ M − φ m ( x ) = t m ( M x − (cid:112) c ( M, m ) , m = 0 , . . . , L. (22)10ince { φ m } Lm =0 constitute a basis for polynomials of degree at most L , the least square problem(19) can be equivalently formulated asmin a : (cid:80) Li =1 a i φ i (0)= − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 a i φ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = min a : (cid:104) a,φ (0) (cid:105) = − (cid:107) a (cid:107) , where φ (0) (cid:44) ( φ (0) , . . . , φ L (0)), a = ( a , . . . , a L ), and (cid:104)· , ·(cid:105) denotes vector inner product. Thus,the optimal value is clearly (cid:107) φ (0) (cid:107) , achieved by a ∗ = − φ (0) (cid:107) φ (0) (cid:107) .From (21) we have p m (0) = p m (1) = · · · = p m ( m −

1) = 0. By the formula of t m in (20), weobtain t m ( −

1) = 1 m ! ( − m p m ( −

1) = ( − m m (cid:89) j =1 ( M + j ) . In view of the deﬁnition of φ m in (22), we have φ m (0) = t m ( − (cid:112) c ( M, m ) = ( − m (cid:81) mj =1 ( M + j ) (cid:113) M (cid:81) mj =1 ( M − j )2 m +1 = ( − m (cid:118)(cid:117)(cid:117)(cid:116) m + 1 M m (cid:89) j =1 M + jM − j . Therefore (cid:107) φ (0) (cid:107) = L (cid:88) m =0 m + 1 M m (cid:89) j =1 M + jM − j = (cid:0) M + L +1 L +1 (cid:1)(cid:0) ML +1 (cid:1) − , where the last equality follows from induction since (cid:0) M + L +1 L +1 (cid:1)(cid:0) ML +1 (cid:1) − (cid:0) M + LL (cid:1)(cid:0) ML (cid:1) = 2 L + 1 M L (cid:89) j =1 M + jM − j . This proves the ﬁrst equality in (17).The second equality in (17) is a direct consequence of Stirling’s approximation. If M = L + 1,then (cid:0) M + L +1 L +1 (cid:1)(cid:0) ML +1 (cid:1) = (cid:18) L + 1) L + 1 (cid:19) = exp(Θ( L )) . (23)If M ≥ L + 2, denoting x = L +1 M and applying n ! = √ πn ( ne ) n (1 + Θ( n )) when n ≥

1, we have (cid:0) M + L +1 L +1 (cid:1)(cid:0) ML +1 (cid:1) = ( M + L + 1)!( M − L − M !) = ( M (1 + x ))!( M (1 − x ))!( M !) = (cid:112) πM (1 + x )( M (1+ x ) e ) M (1+ x ) (cid:112) πM (1 − x )( M (1 − x ) e ) M (1 − x ) (1 + Θ( M (1+ x ) + M (1 − x ) ))2 πM ( Me ) M (1 + Θ( M ))= (cid:112) − x exp ( M ((1 + x ) log(1 + x ) + (1 − x ) log(1 − x ))) 1 + Θ( M (1 − x ) )1 + Θ( M )= exp (cid:32) Θ( M x ) + 12 log(1 − x ) + log 1 + Θ( M (1 − x ) )1 + Θ( M ) (cid:33) , (24)where the last step follows from (1 + x ) log(1 + x ) + (1 − x ) log(1 − x ) = Θ( x ) when 0 ≤ x ≤ M x ) dominates when M ≥ L + 2. Applying (23) and (24) tothe exact solution (17) yields the desired approximation.11 .3 Minimum singular values of real rectangle Vandermonde matrices In Proposition 1 the variance of our estimator is bounded by the magnitude of coeﬃcients u , whichis related to the polynomial coeﬃcients w by (7). A classical result from approximation theory isthat if a polynomial is bounded over a compact interval, its coeﬃcients are at most exponential inthe degree [Tim63, Theorem 2.9.11]: for any degree- L polynomial p ( x ) = (cid:80) Li =0 w i x i ,max ≤ i ≤ L | w i | ≤ max x ∈ [0 , | p ( x ) | exp( O ( L )) , (25)which is tight when p is the Chebyshev polynomial. This fact has been applied in statistical contextsto control the variance of estimators obtained from best polynomial approximation [CL11, WY16,WY15, JVHW15]. In contrast, for the Distinct Elements problem, the polynomial is only known tobe bounded over the discretized interval. Nevertheless, we show that the bound (25) continues tohold as long as the discretization level exceeds the degree:max ≤ i ≤ L | w i | ≤ max x ∈{ M , M ,..., } | p ( x ) | exp( O ( L )) , (26)provided that M ≥ L + 1 (see Remark 3 after Lemma 2). Clearly, (26) implies (25) by sending M → ∞ . If M ≤ L , a coeﬃcient bound like (26) is impossible, because one can add to p anarbitrary degree- L interpolating polynomial that evaluates to zero at all M points.To bound the coeﬃcients, note that the optimal solution of (cid:96) -approximation is w ∗ = ( B (cid:62) B ) − B (cid:62) ,and consequently (cid:107) w ∗ (cid:107) ≤ (cid:107) (cid:107) σ min ( B ) , (27)where σ min ( B ) denotes the smallest singular value of B . Let¯ B (cid:44) [ , B ] =  /M (1 /M ) · · · (1 /M ) L /M (2 /M ) · · · (2 /M ) L · · ·  which is an M × ( L +1) Vandermonde matrix and satisﬁes σ min ( ¯ B ) ≤ σ min ( B ) since ¯ B has one extracolumn. The Gram matrix of ¯ B is an instance of moment matrices . A moment matrix associatedwith a probability measure µ is a Hankel matrix M given by M i,j = m i + j − , where m (cid:96) = (cid:82) x (cid:96) d µ denotes the (cid:96) th moment of µ . Then M ¯ B (cid:62) ¯ B is the moment matrix associated with the uniformdistribution over the discrete set { M , M , . . . , } , which converges to the uniform distribution overthe interval (0 , Hilbert matrix H ,with H ij = 1 i + j − L × L Hilbert matrix is O ( (1+ √ L √ L ) [Tod54]and the operator norm is Θ(1), and thus the minimum singular value is exponentially small in thedegree. Therefore we expect the discrete moment matrix M ¯ B (cid:62) ¯ B to behave similarly to the Hilbertmatrix when M is large enough. Interestingly, we show that this is indeed the case as soon as M exceeds L (otherwise the minimum singular value is zero).12 emma 2. For all M ≥ L + 1 , σ min (cid:18) ¯ B √ M (cid:19) ≥ L L (2 L + 1) (cid:18) M + LeM (cid:19) L +0 . . (28) Remark 3.

The inequality (26) follows from Lemma 2 since the coeﬃcient vector w = ( w , . . . , w L )satisﬁes (cid:107) w (cid:107) ∞ ≤ (cid:107) w (cid:107) ≤ σ min ( ¯ B ) (cid:107) ¯ Bw (cid:107) ≤ √ Mσ min ( ¯ B ) (cid:107) ¯ Bw (cid:107) ∞ . Remark 4.

The extreme singular values of square Vandermonde matrices have been extensivelystudied (c.f. [Gau90,Bec00] and the references therein). For rectangular Vandermonde matrices, thefocus was mainly with nodes on the unit circle in the complex domain [CGR90, Fer99, Moi15] withapplications in signal processing. In contrast, Lemma 2 is on rectangular Vandermonde matriceswith real nodes. The result on integers nodes in [EPS01] turns out to be too crude for the purposeof this paper.

Proof.

Note that ¯ B (cid:62) ¯ B is the Gramian of monomials x = (1 , x, x , . . . , x L ) (cid:62) under the inner productdeﬁned in (18). When M ≥ L +1, the orthonormal basis φ = ( φ , . . . , φ L ) (cid:62) under the inner product(18) are given in (22). Let φ = Lx where L ∈ R ( L +1) × ( L +1) is a lower triangular matrix and L consists of the coeﬃcients of φ . Taking the Gramian of φ yields that I = L ( ¯ B (cid:62) ¯ B ) L (cid:62) , i.e., L − canbe obtained from the Cholesky decomposition: ¯ B (cid:62) ¯ B = ( L − )( L − ) (cid:62) . Then σ ( ¯ B ) = 1 (cid:107) L (cid:107) op ≥ (cid:107) L (cid:107) F , (29)where (cid:107)·(cid:107) op denotes the (cid:96) operator norm, which is the largest singular value of L , and (cid:107)·(cid:107) F denotesthe Frobenius norm. By deﬁnition, (cid:107) L (cid:107) F is the sum of all squared coeﬃcients of φ , . . . , φ L . A usefulmethod to bound the sum-of-squares of the coeﬃcients of a polynomial is by its maximal modulusover the unit circle on the complex plane. Speciﬁcally, for any polynomial p ( z ) = (cid:80) ni =0 a i z i , wehave n (cid:88) i =0 | a i | = 12 π (cid:73) | z | =1 | p ( z ) | d z ≤ sup | z | =1 | p ( z ) | . (30)Therefore σ min ( ¯ B ) ≥ (cid:107) L (cid:107) F ≥ (cid:113)(cid:80) Lm =0 sup | z | =1 | φ m ( z ) | ≥ √ L + 1 1sup ≤ m ≤ L, | z | =1 | φ m ( z ) | . (31)For a given M , the orthonormal basis φ m ( x ) in (22) is proportional to the discrete Chebyshevpolynomials t m ( M x − M →∞ M − m t m ( M x ) = P m (2 x − , where P m is the Legendre polynomial of degree m . This gives the intuition that t m ( x ) ≈ M m forreal-valued x ∈ [0 , M ]. We have the following non-asymptotic upper bound (proved in Appendix C)for t m over the complex plane: Lemma 3.

For all ≤ m ≤ M − , | t m ( z ) | ≤ m m sup ≤ ξ ≤ m ( | z + ξ | ∨ M ) m . (32) The lower bound (29), which was also obtained in [CL99, (1.13)] using Cauchy-Schwarz inequality, is tight up topolynomial terms in view of the fact that (cid:107) L (cid:107) F ≤ ( L + 1) (cid:107) L (cid:107) op . φ m in (22), for any | z | = 1 and any M ≥ L + 1, we have | φ m ( z ) | = | t m ( M z − | (cid:112) c ( M, m ) ≤ m m M m (cid:113) M ( M − )( M − ) ··· ( M − m )2 m +1 . The right-hand side is increasing with m . Therefore,sup ≤ m ≤ L, | z | =1 | φ m ( z ) | ≤ L L M L (cid:113) M ( M − )( M − ) ··· ( M − L )2 L +1 = 1 √ M L L √ L + 1 (cid:115) M L +1 (cid:0) M + L L +1 (cid:1) (2 L + 1)! . Combining (31), we obtain σ min (cid:18) ¯ B √ M (cid:19) ≥ L L (cid:112) ( L + 1)(2 L + 1) (cid:115) (cid:0) M + L L +1 (cid:1) (2 L + 1)! M L +1 ≥ L L (2 L + 1) (cid:18) M + LeM (cid:19) L +0 . , where in the last inequality we used (cid:0) nk (cid:1) ≥ ( nk ) k and n ! ≥ ( ne ) n .Using the optimal solution w ∗ to the (cid:96) -approximation problem (14) as the coeﬃcient of thelinear estimator ˆ C , the following performance guarantee is obtained by applying Lemma 1 andLemma 2 to bound the bias and variance, respectively: Theorem 1.

Assume the Poisson sampling model. Then, E ( ˆ C − C ) ≤ k exp (cid:18) − Θ (cid:18) ∨ n log kk ∧ log k (cid:19)(cid:19) . (33) Proof. If n ≤ k log k , then the upper bound in (33) is Θ( k ), which is trivial thanks to the thresholdsthat ˆ C = ( ˜ C ∨ ˆ C seen ) ∧ k . It is hereinafter assumed that n ≥ k log k , or equivalently M ≤ βα L ; here M, L are deﬁned in (8) and the constants α, β are to be determined later. Then, from Lemma 1, (cid:107) Bw ∗ − (cid:107) ∞ ≤ (cid:107) Bw ∗ − (cid:107) ≤ exp (cid:18) − Θ (cid:18) L M (cid:19)(cid:19) . (34)In view of (27) and Lemma 2, we have (cid:107) w ∗ (cid:107) ∞ ≤ (cid:107) w ∗ (cid:107) ≤ (cid:107) (cid:107) σ min ( B ) ≤ exp( O ( L )) . Recall the connection between u j and w j in (7). For 1 ≤ j ≤ L < β log k , we have u j = w j j !( β log k ) j ≤ w j β log k . Therefore, (cid:107) u ∗ (cid:107) ∞ ≤ (cid:107) w ∗ (cid:107) ∞ β log k ≤ exp( O ( L )) β log k . (35)Applying (34) and (35) to Proposition 1, we obtain E ( ˆ C − C ) ≤ k exp (cid:18) − nk − Θ (cid:18) n log kk (cid:19)(cid:19) + ke − n/k + k exp( O (log k ))( β log k ) + k − ( β − α log eβα − . Then the desired (33) holds as long as β is suﬃciently large and α is suﬃciently small.14 .4 Lagrange interpolating polynomials and Stirling numbers When we sample at least a constant faction of the urn, i.e., n = Ω( k ), we can aﬀord to choose α and β in (8) so that L = M and B is an invertible matrix. We choose the coeﬃcient w = B − which isequivalent to applying Lagrange interpolating polynomial and achieves exact zero bias. To controlthe variance, we can follow the approach in Section 2.3 by using the bound on minimum singularvalue of the matrix B , which implies that the coeﬃcients are exp( O ( L )) and yields a coarse upperbound O ( k log k ∨ log ∆2 k ) on the sample complexity. As previously announced in Table 1, this bound canbe improved to O ( k log log k ∨ log ∆2 k ) by a more careful analysis of the Lagrange interpolating polynomialcoeﬃcients expressed in terms of the Stirling numbers, which we introduce next.The Stirling numbers of the ﬁrst kind are deﬁned as the coeﬃcients of the falling factorial ( x ) n where ( x ) n = x ( x − . . . ( x − n + 1) = n (cid:88) j =1 s ( n, j ) x j . Compared to the coeﬃcients w expressed by the Lagrange interpolating polynomial: M (cid:88) j =1 w j x j − − (1 − xM )(2 − xM ) . . . ( M − xM ) M ! , we obtain a formula for the coeﬃcients w in terms of the Stirling numbers: w j = ( − M +1 M j M ! s ( M + 1 , j + 1) , ≤ j ≤ M. Consequently, the coeﬃcients of our estimator u j are given by u j = ( − M +1 j ! M ! (cid:18) kn (cid:19) j s ( M + 1 , j + 1) . (36)The precise asymptotics the Stirling number is rather complicated. In particular, the asymptoticformula of s ( n, m ) as n → ∞ for ﬁxed m is given by [Jor47] and the uniform asymptotics over all m is obtained in [MW58] and [Tem93]. The following lemma (proved in Appendix C) is a coarsenon-asymptotic version, which suﬃces for the purpose of constant-factor approximations of thesample complexity. Lemma 4. | s ( n + 1 , m + 1) | = n ! (cid:18) Θ (cid:18) m (cid:16) ∨ log nm (cid:17)(cid:19)(cid:19) m (37)We construct ˆ C as in Proposition 1 using the coeﬃcients u j in (36) to achieve zero bias. Thevariance upper bound by the coeﬃcients u is a direct consequence of the upper bound of Stirlingnumbers in Lemma 4. Then we obtain the following mean squared error (MSE): Theorem 2 (Interpolation) . Assume the Poisson sampling model. If n > ηk for some suﬃcientlylarge constant η , then E ( ˆ C − C ) ≤ ke − Θ( nk ) + k − . − . kn log ken +  k exp (cid:16) k log kn e − Θ( nk ) (cid:17) , n (cid:46) k log log k,k (cid:16) Θ (cid:0) kn (cid:1) log k log kn (cid:17) n/k , k log log k (cid:46) n (cid:46) k √ log k, , n (cid:38) k √ log k. roof. In Proposition 1, ﬁx β = 3 . α = βkn so that L = M . Our goal is to show an upperbound of max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] = max λ ∈ nk [ M ] M (cid:88) j =1 u j e − λ λ j j ! . (38)Here the coeﬃcients u j are obtained from (36) and, in view of (37), satisfy: | u j | ≤ (cid:18) ηkn (cid:18) ∨ log Mj (cid:19)(cid:19) j , ≤ j ≤ M, (39)for some universal constant η . We consider three cases separately: Case I: n ≥ √ βk √ log k . In this case we have nk ≥ M . The maximum of each summand in (38)as a function of λ ∈ R occurs at λ = j . Since j ≤ nk , the maximum over λ ∈ nk [ M ] is attained at λ = nk . Then, max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] = E N ∼ Poi( nk ) [ u N ] . (40)In view of (39) and j ≥

1, we have | u j | ≤ (Θ( k/n ) log M ) j . Then, E N ∼ Poi( nk ) [ u N ] ≤ E N ∼ Poi( nk ) (cid:32) Θ (cid:18) k log Mn (cid:19) (cid:33) N = exp (cid:32) nk (cid:32) Θ (cid:18) k log Mn (cid:19) − (cid:33)(cid:33) = e − Θ( n/k ) , as long as n (cid:38) k log log k and thus k log Mn (cid:46)

1. Therefore,max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] ≤ e − Θ( n/k ) , n (cid:38) k (cid:112) log k. (41) Case II: ηk log log k ≤ n ≤ √ βk √ log k . We apply the following upper bound:max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] = max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N { N ≥ n/k } ] + max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N { N

1, the right-hand side of (39) is decreasing with j when j ≥ M/e . It suﬃces to consider j ≤ M/e , when themaximum as a function of j ∈ R occurs at j ∗ ≤ M e − nηk . Since M e − nηk ≤ nk when n ≥ ηk log log k ,the maximum over nk ≤ j ≤ M is attained at j = nk . Applying (39) with j = nk to (42) yieldsmax λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] ≤ (cid:18) Θ (cid:18) kn (cid:19) log k log kn (cid:19) n/k + e − Θ( n/k ) . (43)16 ase III: ηk ≤ n ≤ ηk log log k . We apply the upper bound of expectation by the maximum:max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] ≤ max j ∈ [ M ] u j . Since ηkn ≤

1, the right-hand side of (39) is decreasing with j when j ≥ M/e , so it suﬃces to consider j ≤ M/e . Denoting x = log Mj and τ = Θ( kn ), in view of (39), we have | u j | ≤ exp( M e − x log( τ x )),which attains maximum at x ∗ satisfying e /x ∗ x ∗ = τ . Then, | u j | ≤ exp( M e − x ∗ log( τ x ∗ )) = exp( M e − x ∗ /x ∗ ) < exp( M τ e − /τ ) . where the last inequality is because of τ > x ∗ . Therefore,max λ ∈ nk [ M ] E N ∼ Poi( λ ) [ u N ] ≤ exp (cid:18) k log kn e − Θ( nk ) (cid:19) , k (cid:46) n (cid:46) k log log k. (44)Applying the upper bounds in (41), (43) and (44) to Proposition 1 concludes the proof. Remark 5.

It is impossible to bridge the gap near ∆ = √ k in Table 1 using the technology ofinterpolating polynomials that aims at zero bias, since its worst-case variance is at least k when n = O ( k ). To see this, note that the variance term given by (12) is (cid:88) p i E N ∼ Poi( np i ) [ u N ] = (cid:88) p i L (cid:88) j =1 u j e − np i ( np i ) j j ! . (45)Consider the distribution Uniform[ n/j ] with j = Le − n/k = Ω(log k ), which corresponds to anurn where each of the n/j colors appears equal number of times. By the formula of coeﬃcient u j in (36) and the characterization from Lemma 4, the j = j term in the summation of (45) is oforder nj ( kn log Mj ) j = nj j , which is already k . In this section we develop lower bounds of the sample complexity which certify the optimality ofestimators constructed in Section 2. We ﬁrst give a brief overview of the lower bound in [CCMN00,Theorem 1], which gives the optimal sample complexity under the multiplicative error criterion.The lower bound argument boils down to considering two hypothesis: in the null hypothesis, the urnconsists of only one color; in the alternative, the urn contains 2∆ + 1 distinct colors, where k − k/ ∆) samples. This lower bound is optimal for estimating within a multiplicative factorof √ ∆, which, however, is too loose for additive error ∆.In contrast, instead of testing whether the urn is monochromatic, our ﬁrst lower bound isgiven by testing whether the urn is maximally colorful, that is, containing k distinct colors. Thealternative contains k −

2∆ colors, and the numbers of balls of two diﬀerent colors diﬀer by at mostone. In other words, the null hypothesis is the uniform distribution on [ k ] and the alternative isclose to uniform distribution with smaller support size. The sample complexity, which is shown inTheorem 3, gives the lower bound in Table 1 for ∆ ≤ √ k .17 heorem 3. If ≤ ∆ ≤ k , then n ∗ ( k, ∆) ≥ Ω (cid:18) k − √ k (cid:19) . (46) If ≤ ∆ < k , then n ∗ ( k, ∆) ≥ Ω (cid:18) k arccosh (cid:18) k (cid:19)(cid:19) (cid:16) (cid:40) k log(1 + k ∆ ) , ∆ ≤ √ k, k / ∆ , ∆ ≥ √ k. (47) Proof.

Consider the following two hypotheses: The null hypothesis H is an urn consisting of k distinct colors; The alternative H consists of k −

2∆ distinct colors, and each color appearseither b (cid:44) (cid:98) kk − (cid:99) or b (cid:44) (cid:100) kk − (cid:101) times. In terms of distributions, H is the uniform distribution Q = ( k , . . . , k ); H is the closest perturbation from the uniform distribution: randomly pick disjointsets of indices I, J ⊆ [ k ] with cardinality | I | = c and | J | = c , where c and c satisfy(number of colors) c + c = k − , (number of balls) c b + c b = k. Conditional on θ (cid:44) ( I, J ), the distribution P θ = ( p θ, , . . . , p θ,k ) is given by p θ = (cid:40) b /k, i ∈ I,b /k, i ∈ J. Put the uniform prior on the alternative. Denote the marginal distributions of the n samples X = ( X , . . . , X n ) under H and H by Q X and P X , respectively. Since the distinct colors in H and H are separated by 2∆, to show that the sample complexity n ∗ ( k, ∆) ≥ n , it suﬃces to showthat no test can distinguish H and H reliably using n samples. A further suﬃcient condition isa bounded χ divergence [Tsy09] χ ( P X (cid:107) Q X ) (cid:44) (cid:90) P X Q X − ≤ O (1) . The remainder of this proof is devoted to upper bounds of the χ divergence.Since P X | θ = P ⊗ nθ and Q X = Q ⊗ n , we have χ ( P X (cid:107) Q X ) + 1 = (cid:90) P X Q X = (cid:90) ( E θ P X | θ )( E θ (cid:48) P X | θ (cid:48) ) Q X = E θ,θ (cid:48) (cid:90) P X | θ P X | θ (cid:48) Q X = E θ,θ (cid:48) (cid:18)(cid:90) P θ P θ (cid:48) Q (cid:19) n , where θ (cid:48) is an independent copy of θ . By the deﬁnition of P θ and Q , (cid:90) P θ P θ (cid:48) Q = b k | I ∩ I (cid:48) | + b k | J ∩ J (cid:48) | + b b k ( | I ∩ J (cid:48) | + | J ∩ I (cid:48) | ) = 1 + (cid:88) i =1 A i , (48)where A (cid:44) b k ( | I ∩ I (cid:48) | − c k ), A (cid:44) b k ( | J ∩ J (cid:48) | − c k ), A = b b k ( | I ∩ J (cid:48) | − c c k ), and A = b b k ( | J ∩ I (cid:48) | − c c k ) are centered random variables. Applying 1 + x ≤ e x and Cauchy-Schwarzinequality, we obtain χ ( P X (cid:107) Q X ) + 1 ≤ E [ e n (cid:80) i =1 A i ] ≤ (cid:89) i =1 ( E [ e nA i ]) . (49)18onsider the ﬁrst term E [ e nA ]. Note that | I ∩ I (cid:48) | ∼ Hypergeometric( k, c , c ), which is thedistribution of the sum of c samples drawn without replacement from a population of size k whichconsists of c ones and k − c zeros. By the convex stochastic dominance of the binomial over thehypergeometric distribution [Hoe63, Theorem 4], for Y ∼ Binomial( c , c k ), we have( E [ e nA ]) ≤ (cid:18) E (cid:20) exp (cid:18) nb k ( Y − c /k ) (cid:19)(cid:21)(cid:19) ≤ exp (cid:18) c k (cid:18) exp (cid:18) nb k (cid:19) − − nb k (cid:19)(cid:19) ≤ exp (cid:18) c k (cid:18) exp (cid:18) nb k (cid:19) − − nb k (cid:19)(cid:19) , (50)where the last inequality follows from the fact that x (cid:55)→ e x − − x is increasing when x >

0. Otherterms in (49) are bounded analogously and we have χ ( P X (cid:107) Q X ) + 1 ≤ exp (cid:18) c + c + 2 c c k (cid:18) exp (cid:18) nb k (cid:19) − − nb k (cid:19)(cid:19) = exp (cid:32) ( k − k (cid:32) exp (cid:32) nk (cid:24) kk − (cid:25) (cid:33) − − nk (cid:24) kk − (cid:25) (cid:33)(cid:33) . (51)If k − ≥ √ k , the upper bound (51) implies that n ∗ ( k, ∆) ≥ Ω( k − √ k ) since the χ -divergence isﬁnite with O ( k − √ k ) samples, using the inequality that e x − − x ≤ x for x ≥

0; if k − ≤ √ k ,the lower bound is trivial since k − √ k ≤ ≤ ∆ < k/

4, in which case | I | = c = k − , | J | = c = 2∆ and b = 1 , b = 2. When c is close to k , Hypergeometric( k, c , c ) is no longer wellapproximated by Binomial( c , c k ), and the upper bound in (50) yields a loose lower bound for thesample complexity. To ﬁx this, note that in this case the set K (cid:44) ( I ∪ J ) c has small cardinality | K | = 2∆. The equality in (48) can be equivalently represented in terms of J, J (cid:48) and

4. Therefore, χ ( P X (cid:107) Q X ) + 1 ≤ exp (cid:18) ∆ k (cid:16) e n/k + 2 e − n/k − (cid:17)(cid:19) = exp (cid:18) k (cosh(4 n/k ) − (cid:19) . (52)The upper bound (52) yields the sample complexity n ∗ ( k, ∆) ≥ Ω( k arccosh(1 + k )).19ow we establish another lower bound for the sample complexity of the Distinct Elements prob-lem for sampling without replacement. Since we can simulate sampling with replacement fromsamples obtained without replacement (see (53) for details), it is also a valid lower bound for n ∗ ( k, ∆) deﬁned in Deﬁnition 1. On the other hand, as observed in [RRSS09, Lemma 3.3] (see also[Val12, Lemma 5.14]), any estimator ˆ C for the Distinct Elements problem with sampling withoutreplacement leads to an estimator for the

Support Size problem with slightly worse performance:Suppose we have n i.i.d. samples drawn from a distribution P whose minimum non-zero probabilityis at least 1 /(cid:96) . Let ˆ C seen denote the number of distinct elements in these samples. Equivalently,these samples can be viewed as being generated in two steps: ﬁrst, we draw k i.i.d. samples from P , whose realizations form an instance of a k -ball urn with ˆ C seen distinct colors; next, we draw n samples from this urn without replacement ( n ≤ k ), which clearly are distributed according to P ⊗ n . Suppose ˆ C seen is close to the actual support size of P . Then applying any algorithm forthe Distinct Elements problem to these n i.i.d. samples constitutes a good support size estimator.Lemma 5 formalizes this intuition. Lemma 5.

Suppose an estimator ˆ C takes n samples from a k -ball urn ( n ≤ k ) without replacementand provides an estimation error of less than ∆ with probability at least − δ . Applying ˆ C with n i.i.d. samples from any distribution P with minimum non-zero mass /(cid:96) and support size S ( P ) , wehave | ˆ C − S ( P ) | ≤ with probability at least − δ − (cid:0) (cid:96) ∆ (cid:1) (cid:0) − ∆ (cid:96) (cid:1) k .Proof. Suppose that we take k i.i.d. samples from P = ( p , p , . . . ), which form a k -ball urn con-sisting of C distinct colors. By the union bound, P [ | C − S ( P ) | ≥ ∆] ≤ (cid:88) I : | I | =∆ ,p i ≥ (cid:96) ,i ∈ I (cid:32) − (cid:88) i ∈ I p i (cid:33) k ≤ (cid:18) (cid:96) ∆ (cid:19) (cid:18) − ∆ (cid:96) (cid:19) k . Next we take n samples without replacement from this urn and apply the given estimator ˆ C . Byassumption, conditioned on any realization of the k -ball urn, | ˆ C − C | ≤ ∆ with probability at least1 − δ . Then | ˆ C − S ( P ) | ≤

2∆ with probability at least 1 − δ − (cid:0) (cid:96) ∆ (cid:1) (cid:0) − ∆ (cid:96) (cid:1) k . Marginally, these n samples are identically distributed as n i.i.d. samples from P .Combining with the sample complexity of the Support Size problem in (1), Lemma 5 leads tothe following lower bound for the

Distinct Elements problem:

Theorem 4.

Fix a suﬃciently small constant c . For any ≤ ∆ ≤ ck , n ∗ ( k, ∆) ≥ Ω (cid:18) k log k log k ∆ (cid:19) . The same lower bound holds for sampling without replacement.Proof.

By the lower bound of the support size estimation problem obtained in [WY15, Theorem 2],if n ≤ α(cid:96) log (cid:96) log (cid:96) and 2∆ ≤ c (cid:96) for some ﬁxed constants c < and α , then for any ˆ C , there existsa distribution P with minimum non-zero mass 1 /(cid:96) such that | ˆ C − S ( P ) | ≤

2∆ with probabilityat most 0 .

8. Applying Lemma 5 yields that, using n samples without replacement, no estimator20an provide an estimation error of ∆ with probability 0 . k -ball urn, provided (cid:0) (cid:96) ∆ (cid:1) (cid:0) − ∆ (cid:96) (cid:1) k ≤ .

1. Consequently, as long as 2∆ ≤ c (cid:96) and (cid:0) (cid:96) ∆ (cid:1) (cid:0) − ∆ (cid:96) (cid:1) k ≤ .

1, we have n ∗ ( k, ∆) ≥ α(cid:96) log (cid:96) log (cid:96) . The desired lower bound follows from choosing (cid:96) (cid:16) k log( k/ ∆) . Below we explain how the sample complexity bounds summarized in Table 1 are obtained fromvarious results in Section 2 and Section 3: • The upper bounds are obtained from the worst-case MSE in Section 2 and the Markov in-equality. In particular, the case of ∆ ≤ √ k (log k ) − δ follows from the second and the thirdupper bounds of Theorem 2; the case of √ k ≤ ∆ ≤ k . δ follows from the ﬁrst upper boundof Theorem 2; the case of k − δ ≤ ∆ ≤ ck follows from Theorem 1. By monotonicity, we havethe O ( k log log k ) upper bound when √ k (log k ) − δ ≤ ∆ ≤ √ k , the O ( k log k ) upper bound when∆ ≥ ck , and the O ( k ) upper bound when k . δ ≤ ∆ ≤ k − δ . • The lower bound for ∆ ≤ √ k follows from Theorem 3; the lower bound for k . δ ≤ ∆ ≤ ck follows from Theorem 4. These further implies the Ω( k ) lower bound for √ k ≤ ∆ ≤ k . δ by monotonicity. A Connections between various sampling models

As mentioned in Section 1.2, four popular sampling models have been introduced in the statisticsliterature: the multinomial model, the hypergeometric model, the Bernoulli model, and the Poissonmodel. The connections between those models are explained in details in this section, as well asrelations between the respective sample complexities.

Bernoulli modelhypergeometric model multinomial modelPoisson model

Binomial( k, p ) samples simulate Poi( n ) samplesFigure 2: Relations between the four sampling models. In particular, hypergeometric (resp. multi-nomial) model reduces to the Bernoulli (resp. Poisson) model when the sample size is binomial(resp. Poisson) distributed.The connections between diﬀerent models are illustrated in Fig. 2. Under the Poisson model,the sample size is a Poisson random variable; conditioned on the sample size, the samples are i.i.d.which is identical to the multinomial model. The same relation holds as the Bernoulli model tothe hypergeometric model. Given samples ( Y , . . . , Y n ) uniformly drawn from a k -ball urn with-out replacement (hypergeometric model), we can simulate ( X , . . . , X n ) drawn with replacement21multinomial model) as follows: for each i = 1 , . . . , n , let X i = (cid:40) Y i , with probability 1 − i − k ,Y m , with probability i − k , m ∼ Uniform([ i − . (53)In view of the connections in Fig. 2, any estimator constructed for one speciﬁc model can beadapted to another. The adaptation from multinomial to hypergeometric model is provided by thesimulation in (53), and the other direction is given by Lemma 5 (without modifying the estimator).The following result provides a recipe for going between ﬁxed and randomized sample size: Lemma 6.

Let N be an N -valued random variable.(a) Given any ˆ C that uses n samples and succeeds with probability at least − δ , there exists ˆ C (cid:48) using N samples that succeeds with probability at least − δ − P [ N < n ] .(b) Given any ˜ C using N samples that succeeds with probability at least − δ , there exists ˜ C (cid:48) thatuses n samples and succeeds with probability at least − δ − P [ N > n ] .Proof. (a) Denote the samples by X , . . . , X N . Following [RRSS09, Lemma 5.3(a)], deﬁne ˆ C (cid:48) asˆ C (cid:48) = (cid:40) ˆ C ( X , . . . , X n ) , N ≥ n, , N < n. Then ˆ C (cid:48) succeeds as long as N ≥ n and ˆ C succeeds, which has probability at least 1 − δ − P [ N n. The given estimator ˜ C fails with probability (cid:80) j ≥ P [ ˜ C fails | N = j ] P [ N = j ] ≤ δ . Consequently, (cid:80) nj =0 P [ ˜ C fails | N = j ] P [ N = j ] ≤ δ . The estimator ˜ C (cid:48) fails with probability at most n (cid:88) j =0 P [ ˜ C fails | m = j ] P [ m = j ] + P [ m > n ] ≤ δ + P [ m > n ] , which completes the proof.The adaptations of estimators between diﬀerent sampling models imply the relations of the fun-damental limits on the corresponding sample complexities. Extending Deﬁnition 1, let n ∗ M ( k, ∆ , δ ), n ∗ H ( k, ∆ , δ ), n ∗ B ( k, ∆ , δ ), and n ∗ P ( k, ∆ , δ ) be the minimum expected sample size under the multino-mial, hypergeometric, Bernoulli, and Poisson sampling model, respectively, such that there existsan estimator ˆ C satisfying P [ | ˆ C − C | ≥ ∆] ≤ δ . Combining Chernoﬀ bounds (see, e.g., [MU05, The-orem 4.4, 4.5, and 5.4]), we obtain Corollary 1, in which the connection between multinomial andPoisson models gives a rigorous justiﬁcation of the assumption on the Poisson sampling model inSection 2. More precisely, here and below ˆ C is understood as a sequence of estimators indexed by the sample size( X , . . . , X n ) (cid:55)→ ˆ C ( X , . . . , X n ). orollary 1. The following relations hold: • n ∗ H versus n ∗ M :(a) n ∗ H ( k, ∆ , δ ) ≤ n ∗ M ( k, ∆ , δ ) ;(b) n ∗ H ( k, ∆ , δ ) ≤ n ⇒ n ∗ M ( k (cid:48) , , δ + (cid:0) k (cid:48) ∆ (cid:1) (1 − ∆ k (cid:48) ) k ) ≤ n , for any k (cid:48) ∈ N . In particular, if δ is a constant, then we can choose k (cid:48) = Θ( k/ log k ∆ ) . • n ∗ P versus n ∗ M :(c) n ∗ P ( k, ∆ , δ ) ≤ n ⇒ n ∗ M ( k, ∆ , δ + ( e/ n ) ≤ n ;(d) n ∗ M ( k, ∆ , δ ) ≤ n ⇒ n ∗ P ( k, ∆ , δ + (2 /e ) n ) ≤ n . • n ∗ B versus n ∗ H :(e) n ∗ B ( k, ∆ , δ ) ≤ n ⇒ n ∗ H ( k, ∆ , δ + ( e/ n ) ≤ n ;(f ) n ∗ H ( k, ∆ , δ ) ≤ n ⇒ n ∗ B ( k, ∆ , δ + (2 /e ) n ) ≤ n . B Correlation decay between ﬁngerprints

Recall that the ﬁngerprints are deﬁned by Φ j = (cid:80) i { N i = j } , where N i denotes the histogram ofsamples. Under the Poisson model, N i ind ∼ Poi( np i ). Then cov (Φ j , Φ j (cid:48) ) = − (cid:88) i P [ N i = j ] P [ N i = j (cid:48) ] , j (cid:54) = j (cid:48) , var [Φ j ] = (cid:88) i P [ N i = j ](1 − P [ N i = j ]) . The correlation coeﬃcient between Φ and Φ j follows immediately: | ρ (Φ , Φ j ) | = (cid:88) i P [ N i = 0] P [ N i = j ] (cid:112)(cid:80) l P [ N l = 0](1 − P [ N l = 0]) (cid:80) l P [ N l = j ](1 − P [ N l = j ]) ≤ (cid:88) i P [ N i = 0] P [ N i = j ] (cid:112) P [ N i = 0](1 − P [ N i = 0]) P [ N i = j ](1 − P [ N i = j ])= (cid:88) i (cid:115) P [ N i = 0]1 − P [ N i = 0] P [ N i = j ]1 − P [ N i = j ] = (cid:88) i (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) e − λ i − e − λ i e − λi λ ji j ! − e − λi λ ji j ! , (54)where λ i = np i . Note that max x> e − x x j j ! = e − j j j j ! → j → ∞ . Therefore, for any x > e − x − e − x e − x x j j ! − e − x x j j ! = 1 j ! e − x x j − e − x (1 + o j (1)) , (55)where o j (1) is uniform as j → ∞ . Taking derivative, the function x (cid:55)→ e − x x j − e − x on x > x + e x ( j − x ) − j >

0, and the maximum is attained at x = j/ o j (1). Therefore,applying j ! > ( j/e ) j , 1 j ! e − x x j − e − x ≤ (1 + o j (1))2 − j . (56)Combining (54) – (56), we conclude that | ρ (Φ , Φ j ) | ≤ k − j/ (1 + o j (1)) . Proof of auxiliary lemmas

Proof of Lemma 3.

For any z ∈ C , we can represent the forward diﬀerence in (20) as an integral:∆ m f ( z ) = f ( z + m ) − (cid:18) m (cid:19) f ( z + m −

1) + · · · + ( − m f ( z )= (cid:90) [0 , m f ( m ) ( z + x + · · · + x m )d x · · · d x m . Therefore, | t m ( z ) | = (cid:12)(cid:12)(cid:12)(cid:12) m ! ∆ m p m ( z ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ m ! sup ≤ ξ ≤ m | p ( m ) m ( z + ξ ) | . (57)Recall the deﬁnition of p m in (21). Let p m ( z ) = (cid:80) ml =0 a (cid:96) z (cid:96) . Let z ( z − · · · ( z − m + 1) = (cid:80) mi =0 b i z i and ( z − M )( z − M − · · · ( z − M − m + 1) = (cid:80) mi =0 c i z i . Expanding the product and collectingthe coeﬃcients yields a simple upper bound: | b i | ≤ m ( m − m − i , | c i | ≤ m ( M + m − m − i ≤ m (2 M ) m − i ≤ m M m − i . Since (cid:80) m(cid:96) =0 a (cid:96) z (cid:96) = ( (cid:80) mi =0 b i z i )( (cid:80) mj =0 c j z j ) , for (cid:96) ≥ m , | a (cid:96) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) i = (cid:96) − m b i c (cid:96) − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ m (cid:88) i = (cid:96) − m m ( m − m − i M m − (cid:96) + i = 2 m M m − (cid:96) m (cid:88) i = (cid:96) − m (cid:18) m − M (cid:19) m − i ≤ m m M m − (cid:96) . Taking m -th derivative of p m , we obtain | p ( m ) m ( z ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) j =0 a j + m ( j + m )! j ! z j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ m (cid:88) j =0 | a j + m M j | (cid:18) m + jm (cid:19) m ! (cid:12)(cid:12)(cid:12) zM (cid:12)(cid:12)(cid:12) j ≤ m m M m m !(2 e ) m m (cid:88) j =0 (cid:12)(cid:12)(cid:12) zM (cid:12)(cid:12)(cid:12) j ≤ m m M m m ! (cid:18) | z | M ∨ (cid:19) m = m m m ! ( | z | ∨ M ) m . Then the desired (32) follows from (57).

Proof of Lemma 4.

The following uniform asymptotic expansions of the Stirling numbers of theﬁrst kind was obtained in [CRT00, Theorem 2]: | s ( n + 1 , m + 1) | =  n ! m ! (log n + γ ) m (1 + o (1)) , ≤ m ≤ √ log n, Γ( n +1+ R )Γ( R ) R m +1 √ πH (1 + o (1)) , √ log n ≤ m ≤ n − n / , (cid:0) n +1 m +1 (cid:1) ( m +12 ) n − m (1 + o (1)) , n − n / ≤ m ≤ n, where γ is Euler’s constant, R is the unique positive solution to h (cid:48) ( x ) = 0 with h ( x ) (cid:44) log Γ( x + n +1)Γ( x +1) x m , H = R h (cid:48)(cid:48) ( R ), and all o (1) terms are uniform in m . In the following we consider each rangeseparately and prove the non-asymptotic approximation in (37).24ase I. For 1 ≤ m ≤ √ log n , Stirling’s approximation gives n ! m ! (log n + γ ) m = n ! (cid:18) Θ (cid:18) log nm (cid:19)(cid:19) m . Case II. For n − n / ≤ m ≤ n , (cid:18) n + 1 m + 1 (cid:19) (cid:18) m + 12 (cid:19) n − m = n ! m ! (cid:18) Θ (cid:18) mn − m (cid:19)(cid:19) n − m = n ! exp (cid:18) m (cid:18) n − mm log (cid:18) Θ (cid:18) mn − m (cid:19)(cid:19) − log Θ( m ) (cid:19)(cid:19) = n ! (cid:18) Θ (cid:18) m (cid:19)(cid:19) m . Case III. For √ log n ≤ m ≤ n − n / , note that h ( x ) = (cid:80) ni =1 log( x + i ) − m log x , and thus H = R h (cid:48)(cid:48) ( R ) = m − (cid:80) ni =1 R ( R + i ) ≤ m . By [MW58, Lemma 4.1], H = ω (1) in this range. Hence, | s ( n + 1 , m + 1) | = Γ( n + 1 + R )Γ( R ) R m +1 (Θ(1)) m = n ! R m Γ( n + 1 + R ) n !Γ( R + 1) (Θ(1)) m , (58)where R is the solution to x ( x +1 + · · · + x + n ) = m . Bounding the sum by integrals, we have R log (cid:18) nR + 1 (cid:19) ≤ m ≤ R log (cid:16) nR (cid:17) . If √ log n ≤ m ≤ ne , then R (cid:16) m log( n/m ) , and hence1 ≤ Γ( n + 1 + R ) n !Γ( R + 1) ≤ (cid:18) O (cid:18) n + RR (cid:19)(cid:19) R = exp( O ( m )) . In view of (58), we have | s ( n + 1 , m + 1) | = n !(Θ( R )) m , which is exactly (37) when m ≤ n/e . If n/e ≤ m ≤ n − n / , then R (cid:16) n n − m , and1 R m Γ( n + 1 + R ) n !Γ( R + 1) = R − m (cid:18) Θ (cid:18) n + Rn (cid:19)(cid:19) n = exp (cid:18) − m log Θ (cid:18) n n − m (cid:19) + n log Θ (cid:18) nn − m (cid:19)(cid:19) = exp (cid:18) − m log Θ( n ) + ( n − m ) log Θ (cid:18) nn − m (cid:19)(cid:19) = exp ( − m log Θ( n )) . Combining (58) yields that | s ( n + 1 , m + 1) | = n !(Θ( n )) m , which coincides with (37) since n (cid:16) m is this range. Acknowledgment

This research has been supported in part by the National Science Foundation under the grantagreement IIS-14-47879 and CCF-15-27105 and an NSF CAREER award CCF-1651588. The au-thors thank Greg Valiant for helpful discussion on Lemma 5.14 in his thesis [Val12]. We thank theanonymous referees for constructive comments which have helped to improve the presentation ofthe paper. 25 eferences [Bec00] Bernhard Beckermann. The condition number of real Vandermonde, Krylov and posi-tive deﬁnite Hankel matrices.

Numerische Mathematik , 85(4):553–577, 2000.[BF93] John Bunge and M Fitzpatrick. Estimating the number of species: a review.

Journalof the American Statistical Association , 88(421):364–373, 1993.[BYJK +

02] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Countingdistinct elements in a data stream. In

Proceedings of the 6th Randomization andApproximation Techniques in Computer Science , pages 1–10. Springer-Verlag, 2002.[BYKS01] Ziv Bar-Yossef, Ravi Kumar, and D Sivakumar. Sampling algorithms: lower boundsand applications. In

Proceedings of the thirty-third annual ACM symposium on Theoryof computing , pages 266–275. ACM, 2001.[CCMN00] Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Towardsestimation error guarantees for distinct values. In

Proceedings of the nineteenth ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) ,pages 268–279. ACM, 2000.[CGR90] Antonio C´ordova, Walter Gautschi, and Stephan Ruscheweyh. Vandermonde matriceson the circle: spectral properties and conditioning.

Numerische Mathematik , 57(1):577–591, 1990.[CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage.

Journal of the American statistical Association , 87(417):210–217, 1992.[CL99] Yang Chen and Nigel Lawrence. Small eigenvalues of large hankel matrices.

Journalof Physics A: Mathematical and General , 32(42):7305, 1999.[CL11] T.T. Cai and M. G. Low. Testing composite hypotheses, Hermite polynomials andoptimal estimation of a nonsmooth functional.

The Annals of Statistics , 39(2):1012–1041, 2011.[CRT00] R Chelluri, LB Richmond, and NM Temme. Asymptotic estimates for generalizedStirling number.

Analysis-International Mathematical Journal of Analysis and its Ap-plication , 20(1):1–14, 2000.[EPS01] Alfredo Eisinberg, Paolo Pugliese, and Nicola Salerno. Vandermonde matrices oninteger nodes: the rectangular case.

Numerische Mathematik , 87(4):663–674, 2001.[Est86] Warren W Esty. Estimation of the size of a coinage: A survey and comparison ofmethods.

The Numismatic Chronicle (1966-) , pages 185–215, 1986.[ET76] B. Efron and R. Thisted. Estimating the number of unseen species: How many wordsdid Shakespeare know?

Biometrika , 63(3):435–447, 1976.[FCW43] Ronald Aylmer Fisher, A Steven Corbet, and Carrington B Williams. The relationbetween the number of species and the number of individuals in a random sample ofan animal population.

The Journal of Animal Ecology , pages 42–58, 1943.26Fer99] PJSG Ferreira. Super-resolution, the recovery of missing samples and Vandermondematrices on the unit circle. In

Proceedings of the Workshop on Sampling Theory andApplications, Loen, Norway , 1999.[FFGM07] Philippe Flajolet, ´Eric Fusy, Olivier Gandouet, and Fr´ed´eric Meunier. Hyperloglog:The analysis of a near-optimal cardinality estimation algorithm. In

In AofA07: Pro-ceedings of the 2007 International Conference on Analysis of Algorithms . Citeseer,2007.[Fra78] Ove Frank. Estimation of the number of connected components in a graph by using asampled subgraph.

Scandinavian Journal of Statistics , pages 177–188, 1978.[Gau90] Walter Gautschi. How (un) stable are Vandermonde systems.

Asymptotic and compu-tational analysis , 124:193–210, 1990.[Goo49] Leo A Goodman. On the estimation of the number of classes in a population.

TheAnnals of Mathematical Statistics , pages 572–579, 1949.[Goo53] Irving J Good. The population frequencies of species and the estimation of populationparameters.

Biometrika , 40(3-4):237–264, 1953.[GS04] Alexander Goldenshluger and Vladimir Spokoiny. On the shape–from–moments prob-lem and recovering edges from noisy Radon data.

Probability Theory and RelatedFields , 128(1):123–140, 2004.[Hil79] Bruce M Hill. Posterior moments of the number of species in a ﬁnite population andthe posterior probability of ﬁnding a new species.

Journal of the American StatisticalAssociation , 74(367):668–673, 1979.[Hoe63] W. Hoeﬀding. Probability inequalities for sums of bounded random variables.

Journalof the American Statistical Association , 58(301):13–30, Mar. 1963.[HOT88] Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K Taneja. Statistical estimators forrelational algebra expressions. In

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems , pages 276–287. ACM, 1988.[Jor47] Charles Jordan.

Calculus of ﬁnite diﬀerences . Chelsea, 1947.[JVHW15] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimationof functionals of discrete distributions.

IEEE Transactions on Information Theory ,61(5):2835–2885, 2015.[KNW10] Daniel M Kane, Jelani Nelson, and David P Woodruﬀ. An optimal algorithm forthe distinct elements problem. In

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 41–52. ACM,2010.[LNS99] Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the L r norm of a regression function. Probability theory and related ﬁelds , 113(2):221–253,1999.[Lo92] Shaw-Hwa Lo. From the species problem to a general coverage problem via a newinterpretation.

The Annals of Statistics , 20(2):1094–1109, 1992.27Moi15] Ankur Moitra. Super-resolution, extremal functions and the condition number of Van-dermonde matrices. In

Proceedings of the Forty-Seventh Annual ACM on Symposiumon Theory of Computing , pages 821–830. ACM, 2015.[MU05] Michael Mitzenmacher and Eli Upfal.

Probability and computing: Randomized algo-rithms and probabilistic analysis . Cambridge University Press, 2005.[MW58] L Moser and M Wyman. Asymptotic development of the Stirling numbers of the ﬁrstkind.

Journal of the London Mathematical Society , 1(2):133–146, 1958.[NS90] Jeﬀrey F Naughton and S Seshadri. On estimating the size of projections. In

Interna-tional Conference on Database Theory , pages 499–513. Springer, 1990.[NUS91] Arnold F Nikiforov, Vasilii B Uvarov, and Sergei K Suslov.

Classical orthogonal poly-nomials of a discrete variable . Springer, 1991.[Pan03] Liam Paninski. Estimation of entropy and mutual information.

Neural Computation ,15(6):1191–1253, 2003.[Pan04] Liam Paninski. Estimating entropy on m bins given fewer than m samples. IEEETransactions on Information Theory , 50(9):2200–2203, 2004.[RRSS09] Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower boundsfor approximating distribution support size and the distinct elements problem.

SIAMJournal on Computing , 39(3):813–842, 2009.[Sze75] G. Szeg¨o.

Orthogonal polynomials . American Mathematical Society, Providence, RI,4th edition, 1975.[Tem93] Nico M Temme. Asymptotic estimates of Stirling numbers.

Studies in Applied Math-ematics , 89(3):233–243, 1993.[Tim63] Aleksandr Filippovich Timan.

Theory of approximation of functions of a real variable .Pergamon Press, 1963.[Tod54] John Todd. The condition of the ﬁnite segments of the Hilbert matrix.

Contributionsto the solution of systems of linear equations and the determination of eigenvalues ,39:109–116, 1954.[Tsy09] A.B. Tsybakov.

Introduction to Nonparametric Estimation . Springer Verlag, NewYork, NY, 2009.[Val11] Paul Valiant. Testing symmetric properties of distributions.

SIAM Journal on Com-puting , 40(6):1927–1968, 2011.[Val12] Gregory Valiant.

Algorithmic Approaches to Statistical Questions . PhD thesis, EECSDepartment, University of California, Berkeley, Sep 2012.[VV11a] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log( n )-sample esti-mator for entropy and support size, shown optimal via new CLTs. In Proceedings ofthe 43rd annual ACM symposium on Theory of computing , pages 685–694, 2011.28VV11b] Gregory Valiant and Paul Valiant. The power of linear estimators. In

Foundations ofComputer Science (FOCS), 2011 IEEE 52nd Annual Symposium on , pages 403–412.IEEE, 2011.[WY15] Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and opti-mal estimation of the unseen. arXiv:1504.01227 , 2015.[WY16] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alpha-bets via best polynomial approximation.