[PDF] On the High Accuracy Limitation of Adaptive Property Estimation

Abstract

Recent years have witnessed the success of adaptive (or unified) approaches in estimating symmetric properties of discrete distributions, where one first obtains a distribution estimator independent of the target property, and then plugs the estimator into the target property as the final estimator. Several such approaches have been proposed and proved to be adaptively optimal, i.e. they achieve the optimal sample complexity for a large class of properties within a low accuracy, especially for a large estimation error ε≫ n −1/3 where n is the sample size. In this paper, we characterize the high accuracy limitation, or the penalty for adaptation, for all such approaches. Specifically, we show that under a mild assumption that the distribution estimator is close to the true sorted distribution in expectation, any adaptive approach cannot achieve the optimal sample complexity for every 1 -Lipschitz property within accuracy ε≪ n −1/3 . In particular, this result disproves a conjecture in [Acharya et al. 2017] that the profile maximum likelihood (PML) plug-in approach is optimal in property estimation for all ranges of ε , and confirms a conjecture in [Han and Shiragur, 2021] that their competitive analysis of the PML is tight.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug On the High Accuracy Limitation of Adaptive PropertyEstimation

Yanjun Han ∗ August 28, 2020

Abstract

Recent years have witnessed the success of adaptive (or uniﬁed) approaches in estimatingsymmetric properties of discrete distributions, where one ﬁrst obtains a distribution estimatorindependent of the target property, and then plugs the estimator into the target property asthe ﬁnal estimator. Several such approaches have been proposed and proved to be adaptivelyoptimal, i.e. they achieve the optimal sample complexity for a large class of properties withina low accuracy, especially for a large estimation error ε ≫ n − / where n is the sample size.In this paper, we characterize the high accuracy limitation, or the penalty for adaptation,for all such approaches. Speciﬁcally, we show that under a mild assumption that the distribu-tion estimator is close to the true sorted distribution in expectation, any adaptive approachcannot achieve the optimal sample complexity for every 1-Lipschitz property within accuracy ε ≪ n − / . In particular, this result disproves a conjecture in [ADOS17] that the proﬁlemaximum likelihood (PML) plug-in approach is optimal in property estimation for all rangesof ε , and conﬁrms a conjecture in [HS20] that their competitive analysis of the PML is tight. Contents ∗ Yanjun Han is with the Department of Electrical Engineering, Stanford University, email: [email protected] . Introduction and Main Results

Given n i.i.d. samples drawn from a discrete distribution p = ( p , · · · , p k ) of support size k , theproblem of symmetric property estimation is to estimate the following quantity F ( p ) = k X i =1 f ( p i )or its variants within a small additive error, for a given function f : [0 , → R . This is a fundamen-tal problem in computer science and statistics with applications in neuroscience [RWdRvSB99],physics [VBB + + + + ℓ distance [VV11b, HJW18], R´enyi entropy [AOST14,AOST17], nonparametric functionals [HJWW17, HJM20], and many others. One of the mainﬁndings in these work is that, plugging the empirical distribution into the property often leads toa strictly suboptimal estimator, especially when the function f has some non-smooth parts. Theyalso provided general recipes for the construction of minimax rate-optimal estimators, while thedetailed construction crucially depends on the speciﬁc property in hand.The other line of research aims to achieve a more ambitious goal: ﬁnd an adaptive (or uniﬁed)estimator that achieves the optimal sample complexity for all (or most of) the above symmetricproperties. Speciﬁcally, the learner aims to obtain a uniﬁed distribution estimator b p of the truedistribution p independent of the property F in hand, and hopes that the plug-in estimator F ( b p )is minimax rate-optimal in estimating F ( p ) for a large class of properties F . As the empiricaldistribution, possibly the most natural choice of b p , does not work for this purpose, and the op-timal estimator even for known F is typically quite involved, this goal may sound too good tobe true. However, a surprising recent development shows that there does exist such an estimator b p , and there are even multiple such estimators. One estimator is the local moment matching (LMM) estimator in [HJW18] (and its reﬁnement in [HS20]), which is minimax rate-optimal inestimating the true distribution p up to permutation. Moreover, plugging the LMM estimatorinto the entropy, power sum function, support size, and all 1-Lipschitz functionals attains theoptimal sample complexity for the respective properties within any accuracy ε ≫ n − / . An-other estimator is the proﬁle maximum likelihood (PML) estimator proposed in [OSVZ04], whosestatistical performance was analyzed in [ADOS17] via a competitive analysis with ampliﬁcationfactor exp(3 √ n ) of the error probability; this factor was later improved to exp( c ′ n / c ) for any c > F where there exists asample-optimal estimator with a sub-Gaussian error probability exp( − cnε ), the above analysesimply that the PML plug-in approach is also adaptively optimal within any accuracy parameter ε ≫ n − / .The above adaptive estimators, albeit promising, still leaves important questions. Speciﬁcally,we notice the following discrepancy: the estimators constructed in the property-speciﬁc mannercould achieve the optimal sample complexity for the entire accuracy regime ε ≫ n − / , whileboth adaptive estimators above are shown to be optimal only when ε ≫ n − / . This discrepancyindicates the possibility that the adaptive approach may not be fully optimal – what happens for2he adaptive approach in the high-accuracy regime n − / ≪ ε ≪ n − / ? Is this high-accuracyregime uncovered simply due to an artifact of the analyses for these adaptive estimators, thepossibility of another fully adaptive estimator which is currently missing, or a fundamental burdenfor any adaptive estimators? Specializing this question to the PML, [ADOS17] conjectured that“the PML based approach is indeed optimal for all ranges of ε ”, while [HS20] conjectured that ε ≫ n − / is the best possible range for the PML to be adaptively optimal. In this paper, we showthat the latter conjecture is true even for general adaptive estimation: there is a phase transitionfor the performance of adaptive estimators at the accuracy parameter ε ≍ n − / , while beyondthis point, there is an unavoidable price that any adaptive estimator needs to pay on the samplecomplexity. In other words, for a reasonable family of symmetric properties, although property-speciﬁc approaches are optimal for the full accuracy range ε ≫ n − / , any adaptive approachfails to achieve the optimal sample complexity for at least one of the property if ε ≪ n − / . Morespeciﬁcally, the main contributions of this paper are as follows:1. We prove a tight adaptation lower bound for the class of all 1-Lipschitz properties. We showthat although the sample complexity for each 1-Lipschitz property is at most O ( k/ ( ε log k ))for any ε ≫ n − / , under a mild assumption, any adaptive estimator must incur a samplecomplexity at least Ω( k/ε ) for every ε ≪ n − / .2. As a corollary, we obtain a tight competitive analysis for the PML plug-in approach. Specif-ically, we show that the ampliﬁcation factor of the error probability in the PML competitiveanalysis is at least exp(Ω( n / − c )) for every c >

0, resolving the tightness conjecture of theupper bound exp( O ( n / c )) for every c > Throughout the paper we adopt the following notations. Let N be the set of all positive integers,and for n ∈ N , let [ n ] , { , , . . . , n } . For a ﬁnite set A , let | A | be the cardinality of A . For k ∈ N ,let M k be the set of all discrete distributions supported on [ k ]. For two probability measures P, Q on the same probability space, let k P − Q k TV = R | dP − dQ | / D KL ( P k Q ) = R dP log( dP/dQ ),and χ ( P k Q ) = R ( dP − dQ ) /dQ be the total variation (TV) distance, the Kullback–Leibler (KL)divergence, and the χ -divergence between P and Q , respectively. For random variables X and Y ,let I ( X ; Y ) = R dP XY log( dP XY /dP X ⊗ dP Y ) be the mutual information. For p ∈ M k , let P p and E p denote the probability and expectation taken with respect to the i.i.d. samples X , · · · , X n ∼ p , respectively. We also adopt the following asymptotic notations. For non-negative sequences { a n } and { b n } , we write a n . b n (or a n = O ( b n )) to denote that lim sup n →∞ a n /b n < ∞ , and a n & b n (or a n = Ω( b n )) to denote b n . a n , and a n ≍ b n (or a n = Θ( b n )) to denote both a n . b n and b n . a n . We also write a n ≪ b n to denote that lim sup ε → lim sup n →∞ n ε a n /b n = 0, and a n ≫ b n to denote b n ≪ a n . 3 .2 Main Results To state our main adaptation lower bound, we ﬁrst need to deﬁne the family of adaptive estimatorsas well as the family of symmetric property estimation problems. In this paper, we consider theclass F Lip of all 1-Lipschitz properties expressed as F ( p ) = P ki =1 f ( p i ) with a 1-Lipschitz function f : [0 , → R , i.e. | f ( x ) − f ( y ) | ≤ | x − y | for all x, y ∈ [0 , b p = ( b p , · · · , b p k ) basedon the observations X n , and then uses the plug-in estimator F ( b p ) to estimate the property F ( p ).To measure the performance of this adaptive estimator, we consider the expected estimation error E p | F ( b p ) − F ( p ) | for the worst-case discrete distribution p ∈ M k and the worst-case 1-Lipschitzproperty F ∈ F Lip . In other words, in this paper we are interested in characterizing the following adaptive minimax risk : R ⋆ adaptive ( n, k ) , inf b p sup F ∈F Lip sup p ∈M k E p | F ( b p ) − F ( p ) | . (1)For technical reasons, we also assume the following mild assumption for the single distributionestimator b p . Assumption 1.

For each n, k ∈ N , we assume that the distribution estimator b p ( X n ) = ( b p , · · · , b p k ) satisﬁes (where S k denotes the permutation group over [ k ] ) sup p ∈M k E p " min σ ∈S k k X i =1 | b p σ ( i ) − p i | ≤ A ( n ) · r kn , with A ( n ) ≪ n δ for every δ > . We will use P to denote the class of all such estimators b p . Assumption 1 essentially requires that the single distribution estimator b p used in the adaptiveapproach must be a reasonably good estimator of the true distribution p up to permutation, wherethe term reasonably means that the estimator cannot be much worse than the empirical estimator.We provide three reasons on why we believe this assumption to be mild. First, it is very naturalto expect or require that a good distribution estimator used in the adaptive approach should besound not only after being plugging into various properties, but also before the plug-in processin terms of the (sorted) distribution estimation. In other words, Assumption 1 could be treatedas an additional requirement for any sound adaptive approach. Second, Assumption 1 holds formany natural or known estimators. For example, the empirical distribution satisﬁes Assumption1 with A ( n ) ≡ P with A ( n ) = polylog( n ) (cf. [HJW18] for the LMM, and the proof of Theorem 2for the PML). Hence, restricting to the estimator class P still leads to non-trivial lower bounds forthese known estimators. Third, a larger quantity A ( n ) in Assumption 1 only shrinks the accuracyregime from ε ≪ n − / to ε ≪ ( nA ( n )) − / , but does not aﬀect the claimed minimax lower boundin the new accuracy regime. In addition to these reasons, we remark that Assumption 1 is mostlya technical assumption, and we conjecture that the following Theorem 1 still holds without it.Restricting to the estimator class P , the following theorem characterizes the tight adaptiveminimax rate for 1-Lipschitz property estimation. Theorem 1.

For each n, k ∈ N , it holds that inf b p ∈P sup F ∈F Lip sup p ∈M k E p | F ( b p ) − F ( p ) | ≍ q kn log n if n / ≪ k . n log n, q kn if ≪ k ≪ n / . Corollary 1.

It is suﬃcient and necessary to have n = Θ( k/ ( ε log k )) samples for the existenceof an adaptive estimator in P to estimate all -Lipschitz properties within error ε if ε ≫ n − / ,and it is suﬃcient and necessary to have n = Θ( k/ε ) samples for the existence of an adaptiveestimator in P to estimate all -Lipschitz properties within error ε if n − / ≪ ε ≪ n − / . Let us appreciate the result of Theorem 1 via the comparison with other results. First, therewill be no phase transition in the high-accuracy regime if we do not require an adaptive estimator.Speciﬁcally, the following result was shown in [HO19b]:sup F ∈F Lip inf b F sup p ∈M k E p | b F − F ( p ) | ≍ s kn log n , ≪ k . n log n. (2)Comparing Theorem 1 and (2), we observe that simply after a swap of the inﬁmum and supremum,the minimax risk becomes signiﬁcantly diﬀerent. In particular, there is a strict separation betweenthe best achievable errors for adaptive and non-adaptive approaches, and the learner may needto pay a strict penalty on the estimation error to achieve adaptation.Second, we also compare Theorem 1 with a similar form of the minimax risk in the problemof estimating sorted distribution, where [HJW18] shows thatinf b p sup p ∈M k E p " sup F ∈F Lip | F ( b p ) − F ( p ) | ≍ q kn log n if n / ≪ k . n log n, q kn if 1 ≪ k ≪ n / . (3)As E [sup n X n ] ≥ sup n E [ X n ], the quantity in (3) is no smaller than our adaptive minimax risk in(1), and thus implies the upper bound in Theorem 1. However, the lower bound of Theorem 1 isthe most challenging part and stronger than what (3) gives. In particular, we remark that afterexchanging the expectation and supremum, the lower bound argument will become fundamentallydiﬀerent, and the traditional approaches fail to give the tight adaptive lower bound. We refer toSection 2.1 for more details. Moreover, comparing the results of Theorem 1 and (3), we remarkthat (3) gives a tight phase transition for the problem , while the adaptive lower bound in Theorem1 shows a tight phase transition only for adaptive approaches . Technically, the former transitioncould be derived by studying diﬀerent regimes of the problem, while the latter transition requiresto also take into account the crucial nature of the adaptive approach.The general adaptive lower bound of Theorem 1 also gives tight and non-trivial lower boundsfor some known adaptive approaches. For example, for the LMM adaptive approach in [HJW18],Theorem 1 shows that the condition ε ≫ n − / required for its optimality in property estimationis not superﬂuous, but in general unavoidable. The implication for the PML adaptive approach[OSVZ04] is even more surprising; to fully describe this we need to recall some basics of PML.Given n i.i.d. observations X , · · · , X n drawn from a discrete distribution p supported on thedomain [ k ], the proﬁle of the observations is deﬁned as a vector φ = ( φ , · · · , φ n ) with φ i beingthe number of domain elements j ∈ [ k ] which appear exactly i times in the sample. For example, φ is the number of unseen elements, and φ is the number of unique elements, i.e. appearingexactly once. Let Φ n,k be the set of all possible proﬁles with n observations and support size k .Note that for any φ ∈ Φ n,k and p ∈ M k , we could compute the probability that the resulting5roﬁle is φ under true distribution p , denoted by P ( p, φ ). The proﬁle maximum likelihood (PML)distribution estimator is then deﬁned as p PML ( φ ) = arg max p ∈M k P ( p, φ ) . In other words, upon observing the proﬁle φ , the PML estimator is the discrete distribution whichmaximizes the probability of observing φ . This estimator is interesting in several aspects. Fromthe optimization side, the probability P ( p, φ ) is a highly non-convex function of p , and it is verychallenging to compute the exact or approximate PMLs. From the statistical side, as P ( p, φ ) doesnot admit an additive form even under i.i.d. models (unlike the traditional log-likelihood), evenﬁrst-order asymptotic properties are challenging to establish for the PML. After 13 years of itsinvention, a useful statistical property of the PML was established in [ADOS17] in terms of aninteresting competitive analysis : for every property F and accuracy parameter ε , it holds thatsup p ∈M k P p ( | F ( p PML ) − F ( p ) | ≥ ε ) ≤ exp(3 √ n ) · inf b F sup p ∈M k P p ( | b F − F ( p ) | ≥ ε ) . (4)Speciﬁcally, (4) gives an indirect statistical analysis of the PML plug-in approach which dependson the performance of another estimator. For many properties (such as all 1-Lipschitz properties),the minimax error probability in the RHS of (4) behaves as exp( − Ω( nε )) when n exceeds theoptimal sample complexity, thus (4) shows that the PML plug-in approach is adaptively optimalfor ε ≫ n − / . The proof of (4) used only the deﬁning property of PML in a delicate way, and theerror ampliﬁcation factor exp(3 √ n ) follows from a simple union bound and a cardinality boundon the number of proﬁles | Φ n,k | ≤ exp(3 √ n ).The paper [ADOS17] asked whether the above error ampliﬁcation factor exp(3 √ n ) could beimproved in general; three years later [HS20] provided an aﬃrmative answer. Speciﬁcally, usinga chaining property of the PML distributions, [HS20] showed the following improvementsup p ∈M k P p ( | F ( p PML ) − F ( p ) | ≥ (2 + o (1)) ε ) ≤ exp( c ′ n / c ) · inf b F sup p ∈M k P p ( | b F − F ( p ) | ≥ ε ) ! − c (5)for any absolute constant c > c ′ > c . Using (5), the accuracyrange for the optimality of PML could be improved to ε ≫ n − / for the aforementioned proper-ties. It was also conjectured in [HS20] that the new ampliﬁcation factor in (5) is tight, but littleintuition was provided.Surprisingly, without directly analyzing the PML adaptive approach, Theorem 1 implies thetightness of the error ampliﬁcation factor in (5), as summarized in our next main theorem. Theorem 2.

For any given constants

C > , c ∈ (0 , / and c ∈ (0 , , it holds that lim inf n →∞ n − (1 / − c ) · sup F ∈F Lip sup k,ε> log sup p ∈M k P p ( | F ( p PML ) − F ( p ) | ≥ Cε ) (cid:16) inf b F sup p ∈M k P p ( | b F − F ( p ) | ≥ ε ) (cid:17) − c = + ∞ . After some algebra, it is clear that Theorem 2 rules out the possibility that the exponent O ( n / c ) of the ampliﬁcation factor in (5) could be improved to O ( n / − c ) in general. Therefore,6heorem 2 implies that the general competitive analysis of the PML in [HS20] is essentially tight,thereby resolves the conjecture in [HS20].We provide two additional remarks on Theorem 2. First, the validity of Theorem 2 is irrelevantto Assumption 1, as the PML estimator is a simple instance which satisﬁes Assumption 1. Second,the lower bound in Theorem 2 does not rule out the possibility that the PML adaptive approachcould be fully optimal for some property. For example, it was shown in [CSS19] that the PMLplug-in approach is fully optimal in estimating the support size. It will be an understanding openquestion to propose a tight analysis of the PML estimator for speciﬁc properties. Property Estimation.

There has been a rich line of research towards the optimal estimation ofproperties (or functionals) of high-dimensional parameters, especially in the past decade. Startingfrom some early work [LNS99,Pan03,Pan04,CL11,VV11a,VV11b,VV13], the fully minimax rate-optimal estimators in all accuracy regimes were obtained for the Shannon entropy in [JVHW15,WY16]. They also provided general recipes for both the estimator construction and tight minimaxlower bounds. Speciﬁcally, the crux of the optimal estimator construction lies in the classiﬁcationof smooth and non-smooth regimes and the usage of polynomial approximation to reduce biasin the non-smooth regime, and the minimax lower bound relies on the duality between momentmatching and best polynomial approximation. Since then, these general recipes together withtheir non-trivial extensions have been applied to various other properties, e.g. the R´enyi entropy[AOST14, AOST17], support size [WY19], support coverage [OSW16, ZVV +

16, PW19], distanceto uniformity [JHW18], general 1-Lipschitz property [HO19a, HO19b], L distance [JHW18], KLdivergence [HJW16, BZLV18], and nonparametric functionals [HJWW17, HJM20]. We refer tothe survey [Ver19] for an overview of these results. There is also another line of recent work onestimating a population of parameters or distribution under a Wasserstein distance, a problemclosely related to property estimation, via projection-based methods without explicit polynomialapproximation [KV17,TKV17,HJW18,RW19,VKVK19,VKK19,WY20,JPW20]. While the abovework completely characterized the complexity of many given problems in property estimation,the complexity of adaptive estimation in a set of such problems is largely missing. For example,the Ω( p k/ ( n log n )) lower bound for large k in Theorem 1 simply follows from the complexityof estimating a particular 1-Lipschitz property, but the main Ω( p k/n ) lower bound for small k becomes the crucial complexity of adaptive approaches and thus does not follow from the aboveset of results or tools. Adaptive Property Estimation.

More recently the problem of adaptive, or uniﬁed, propertyestimation has drawn several research attention. As reviewed in the introduction, possibly themost well-known adaptive approach is the PML plug-in approach, with early statistical develop-ments in [OSVZ04, OSVZ11, AGZ17]. Since [ADOS17] provided the ﬁrst competitive analysis ofthe PML plug-in approach, there have been several follow-up papers on the statistical analysis ofthe PML. Some work focused on the application of the competitive analysis and the constructionof the estimator achieving the minimax error probability in (4), e.g. [HO19a]. Some work focusedon proper modiﬁcations of the PML to achieve better adaptation, e.g. [HO19a, CSS19]; however,these modiﬁed distribution estimators will depend on the target property and are thus not fullyuniﬁed. Other work aimed to improve the competitive analysis in [ADOS17]; for example, [HO20]obtained a distribution-dependent ampliﬁcation factor without changing the worst-case analysis,7nd [HS20] improved this factor to exp( O ( n / c )) in general. However, none of the above workstudied the limitation of the PML plug-in approach, even for concrete examples. Therefore, thelower bound analysis, especially the possible separation compared with the optimal estimator, ofthe PML is missing.Another adaptive approach plugs in the LMM estimator proposed in [HJW18]. Diﬀerent fromthe general competitive analysis of PML, the performance of the LMM approach could be directlyanalyzed for given properties based on its moment matching performance in each local interval.Built on the LMM performance analysis in estimating entropy, power sum function, and supportsize, the authors of [HJW18] commented that the LMM pays some penalty for being a uniﬁedapproach. However, this comment was only an insight, and there was no lower bound to supportit rigorously. The current work ﬁlls in this gap and shows that the price observed for the LMMis in fact unavoidable even for general adaptive approaches. Adaptation Lower Bound.

We also review and compare with some known tools to establishadaptation lower bounds, mainly taken from the statistics literature. Adaptation is an importanttopic in statistics; for example, in nonparametric estimation one may aim to design a densityestimator adapting to diﬀerent smoothness parameters, or in hypothesis testing one may wishto propose an adaptive test procedure against several diﬀerent alternatives. However, for someproblems the adaptation could be achieved without paying any penalty (e.g. density estimation[Lep92,DJKP95], L r norm estimation with non-even r [HJM20]), while for others some adaptationpenalties are inevitable (e.g. linear [EL94] or quadratic [EL96] functional of densities). The maintechnical tool to establish tight penalties of adaptation is the constrained risk inequality originallydeveloped in [BL96] and generalized in [CL11, DR18]. Roughly speaking, this type of inequalityasserts that if an estimator achieves a too small error at one point, it must incur a too large error atanother point; therefore, adaptation may incur a penalty as it might be required to adapt to easierproblems and achieve a too small error. For testing, there is also another approach to establishadaptation lower bounds, where the key is to use a mixture of diﬀerent alternative distributionswhich could be closer to the null than any ﬁxed alternative; see [Spo96] and also [GN16, Chapter8] for examples.However, we remark that our adaptive estimation problem is fundamentally diﬀerent. In theabove work, the target of adaptive estimation is to adapt to diﬀerent (usually a nest of) parametersets , such as H¨older balls with diﬀerent smoothness parameters. In contrast, we consider a ﬁxedparameter set (i.e. p ∈ M k ), but wish to adapt to diﬀerent loss functions for the ﬁnal estimator.Establishing adaptation lower bounds for diﬀerent losses is novel to our knowledge, and the abovetools are not applicable in this problem. Consequently, we aim to provide useful tools (cf. Section2.1) for this new adaptation problem, and expect them to be a helpful addition to the literatureon adaptive estimation. This section is devoted to the proof of Theorem 1. Note that the upper bound is achieved bythe LMM estimator for k ≫ n / and the empirical distribution for k ≪ n / [HJW18] , andthe lower bound for k ≫ n / follows from the minimax lower bound for estimating a speciﬁc1-Lipschitz property, i.e. the distance to uniformity F ( p ) = P ki =1 | p i − /k | [JHW18]. Therefore, Note that [HJW18] shows that both the LMM and empirical distributions belong to P .

8t remains to prove the following adaptation lower bound:inf b p ∈P sup F ∈F Lip sup p ∈M k E p | F ( b p ) − F ( p ) | & r kn , ≪ k ≪ n / . (6)This section is organized as follows. Section 2.1 presents a high-level overview of the idea inproving the adaptation lower bounds to a class of loss functions in an abstract decision-theoreticsetup, and Section 2.2 introduces a generalized Fano’s inequality for the adaptation lower bounds.The details to feed into these tools are worked out in Section 2.3. We consider a general decision-theoretic setup [Wal50]. Let ( P θ ) θ ∈ Θ be a general statistical modelwith parameter set Θ, and A be the space of all possible actions the learner could take. In otherwords, the learner obtains an observation X ∼ P θ with some unknown θ , and then maps X to arandom action a ( X ) ∈ A . Let L : Θ × A → R + be any (measurable) loss function, the problemof minimax estimation is to characterize the following minimax risk: R ⋆ (Θ , A , L ) = inf a sup θ ∈ Θ E θ [ L ( θ, a ( X ))] . (7)Similarly, the problem of adaptive minimax estimation with respect to a class of loss functions L is to characterize the following adaptive minimax risk: R ⋆ (Θ , A , L ) = inf a sup θ ∈ Θ sup L ∈L E θ [ L ( θ, a ( X ))] . (8)To see how (8) is related to the adaptive property estimation, we could set P θ to be the distributionof n i.i.d. samples from the discrete distribution θ , with Θ = M k . Moreover, A = M k , L F ( θ, a ) = | F ( θ ) − F ( a ) | , and L = { L F : F is a 1-Lipschitz property } .There are several well-known tools to establish the lower bound of (7), where a standard andprominent tool is the reduction to hypothesis testing problems; see, e.g. [Yu97, Tsy09]. The mainstep is to ﬁnd θ , · · · , θ M ∈ Θ such that both the separation condition and the indistinguishabilitycondition hold: the separation condition typically requires that inf a ∈A [ L ( θ i , a ) + L ( θ j , a )] ≥ ∆ forsome separation parameter ∆ > i = j , and the indistinguishability condition essentiallystates that any learner could not determine the true parameter θ i based on her observations ifthe truth i ∈ [ M ] is chosen uniformly at random. Then it might be tempting to think thatone only needs to replace L ( θ, a ) by sup L ∈L L ( θ, a ) in the above arguments to lower bound(8). However, this approach will place the supremum in L inside the expectation in (8), andthus provide a lower bound for a larger quantity like (3). An alternative way is to use the trivialinequality R ⋆ (Θ , A , L ) ≥ sup L ∈L R ⋆ (Θ , A , L ) and then lower bound the latter quantity. Althoughthis gives a valid lower bound, it is not strong enough in our problem where R ⋆ (Θ , A , L ) ≫ sup L ∈L R ⋆ (Θ , A , L ) in view of Theorem 1 and (2).The main idea to ﬁx the above diﬃculty is that in addition to choose M points θ , · · · , θ M ∈ Θcorresponding to diﬀerent statistical models, we also ﬁnd M diﬀerent loss functions L , · · · , L M ∈L tailored for the respective models. Speciﬁcally, the indistinguishability condition is unchangedas it depends only on θ , · · · , θ M , while the separation condition could be replaced by inf a ∈A [ L i ( θ i , a )+ L j ( θ j , a )] ≥ ∆ for all i = j . Despite its simplicity, this idea gives the tight adaptation lower boundfor the property estimation, and can thus be viewed as an adaptive version of the hypothesis test-ing approach for the adaptation lower bound. 9 .2 Generalized Fano’s Inequality There is an additional diﬃculty to apply the aforementioned high-level idea to our problem, i.e.the new separation condition L i ( θ i , a ) + L j ( θ j , a ) ≥ ∆ does not hold for any action a ∈ A , butinstead holds for the random action a ( X ) with a strictly positive probability. To account for thissubtlety, we propose the following version of the Fano’s inequality. Lemma 1 (Generalized Fano’s Inequality) . In the above decision-theoretic setup, suppose that θ , · · · , θ M ∈ Θ and L , · · · , L M ∈ L are chosen. Assume that there exists A ⊆ A such that inf a ∈A [ L i ( θ i , a ) + L j ( θ j , a )] ≥ ∆ > , ∀ i, j ∈ [ M ] , i = j, and an estimator a ( X ) satisﬁes that P θ i ( a ( X ) ∈ A ) ≥ p min > for all i ∈ [ M ] . Then for thisestimator we have sup θ ∈ Θ sup L ∈L E θ [ L ( θ, a ( X ))] ≥ ∆2 (cid:18) p min − I ( U ; X ) + p min log 2log M (cid:19) , where I ( U ; X ) denotes the mutual information between U ∼ Uniform([ M ]) and X | U ∼ P θ U . Note that when L i ≡ L and p min = 1, Lemma 1 reduces to the traditional Fano’s inequality[CT06]. Hence, Lemma 1 is a generalization of the Fano’s inequality in the sense that it gives anadaptation lower bound with a soft separation condition. We prove Lemma 1 in the remainderof this subsection. First, as the maximum is no smaller than the average, we havesup θ ∈ Θ sup L ∈L E θ [ L ( θ, a ( X ))] ≥ M M X i =1 E θ i [ L i ( θ i , a ( X ))] . (9)For each i ∈ [ M ], let Q i be the conditional distribution of a ( X ) with X ∼ P θ i conditioning onthe event a ( X ) ∈ A . Then by the non-negativity of each L i and deﬁnition of p min , E θ i [ L i ( θ i , a ( X ))] ≥ P θ i ( a ( X ) ∈ A ) · E a ∼ Q i [ L i ( θ i , a )] ≥ p min · E a ∼ Q i [ L i ( θ i , a )] , and therefore (9) givessup θ ∈ Θ sup L ∈L E θ [ L ( θ, a ( X ))] ≥ p min · M M X i =1 E a ∼ Q i [ L i ( θ i , a )] . (10)The next few steps are similar to the proof of the traditional Fano’s inequality. For each a ∈ A ,deﬁne a test Ψ( a ) = arg min i ∈ [ M ] L i ( θ i , a ). Then by the separation condition, we have L i ( θ i , a ) ≥ L i ( θ i , a ) + L Ψ( a ) ( θ Ψ( a ) , a )2 ≥ ∆2 · (Ψ( a ) = i ) , ∀ i ∈ [ M ] , a ∈ A , and therefore (10) givessup θ ∈ Θ sup L ∈L E θ [ L ( θ, a ( X ))] ≥ ∆ p min · M M X i =1 Q i (Ψ( a ) = i ) ≥ ∆ p min (cid:18) − I ( U ; Y ) + log 2log M (cid:19) , (11)10here the second inequality is due to the traditional Fano’s inequality [CT06], with U ∼ Uniform([ M ])and Y | U ∼ Q U . To proceed, we introduce a few notations: let R i be the distribution of a ( X )with X ∼ P θ i , R be the distribution of a ( X ) with X ∼ M − P Mi =1 P θ i , and Q be the restrictionof the distribution R to the set A . Then I ( U ; Y ) (a) ≤ E U [ D KL ( Q U k Q )] (b) ≤ E U (cid:20) P θ U ( a ( X ) ∈ A ) · D KL ( R U k R ) (cid:21) (c) ≤ p min · E U [ D KL ( R U k R )] (d) = I ( U ; a ( X )) p min(e) ≤ I ( U ; X ) p min , where (a) is due to the variational representation of the mutual information I ( U ; Y ) = min Q Y E U [ D KL ( P Y | U k Q Y )],(b) follows from the data-processing property of the KL divergence D KL ( P k Q ) ≥ P ( A ) · D KL ( P ·| A k Q ·| A ),(c) is due to the assumption of Lemma 1, (d) is the deﬁnition of the mutual information, and (e)is the data-processing property of the mutual information. Now combining the above inequalitywith (11) completes the proof of Lemma 1. Recall that to formulate our adaptive property estimation problem in the general framework of(8), we identify θ ∈ Θ and a ∈ A with the distributions p, b p ∈ M k , and Θ = A = M k . Moreover,the loss function is the absolute diﬀerence in the property value L F ( p, b p ) = | F ( p ) − F ( b p ) | , and thefamily of losses is L = { L F : F is a 1-Lipschitz property } . In this section, we apply Lemma 1 toa suitable choice of distributions p , · · · , p M ∈ M k and 1-Lipschitz properties F , · · · , F M ∈ F Lip ,and prove the target adaptation lower bound in (6).Without loss of generality we assume that k = 2 k is an even integer. Consider the followingdistribution p = ( p , , · · · , p ,k ) ∈ M k serving as the “center” of all hypotheses: p = (cid:18) k , k + 1 k ( k − , k + 2 k ( k − , · · · , k (cid:19) . Fix a parameter δ ∈ (cid:18) , k ( k − (cid:19) (12)to be chosen later, for each u ∈ U , {± } k we also associate a distribution p u = ( p u, , · · · , p u,k ) ∈M k with p u,i = p ,i + u i δ, p u,k + i = p ,k + i − u i δ, ∀ i ∈ [ k ] . Clearly each p u is a valid probability distribution, and this is known as the Paninski’s construction[Pan08]. By the Gilbert–Varshamov bound, there exists U ⊆ U such that the minimum pairwise11amming distance between distinct elements of U is at least k /

5, and |U | ≥ exp( k / { p u } u ∈U as the parameters θ , · · · , θ M in Lemma 1, with M = |U | ≥ exp( k / u ∈ U , we also need to specify the associated loss, or equivalently the 1-Lipschitzproperty F u ∈ F Lip . The detailed choice of F u is given by F u ( p ) = k X i =1 f u ( p i ) = k X i =1 min j ∈ [ k ] | p i − p u,j | , p = ( p , · · · , p k ) , u ∈ U . As the map x

7→ | x − x | is 1-Lipschitz for any x ∈ R , and the pointwise minimum of 1-Lipschitzfunctions is still 1-Lipschitz, each F u is a valid 1-Lipschitz property.Finally, to apply Lemma 1, it remains to specify the subset A . For each i ∈ [ k ], let I i be theopen interval ( p ,i − / (2 k ( k − , p ,i + 1 / (2 k ( k − I , · · · , I k are disjoint intervalsby the deﬁnition of p . Now we deﬁne A as A ,  q = ( q , · · · , q k ) ∈ M k : k X i =1 k Y j =1 ( q j / ∈ I i ) ≤ k  . In other words, the subset A consists of all probability vectors which intersect with at least 9 / I , · · · , I k .With the above construction and deﬁnitions, we are about to use Lemma 1 for the adaptationlower bound. Speciﬁcally, we are left with three tasks: to lower bound the separation parameter∆, to lower bound the minimum probability p min for all estimators b p ∈ P , and to upper boundthe mutual information I ( U ; X n ). Lower bound of ∆ . First, we aim to ﬁnd a lower bound of | F u ( q ) − F u ( p u ) | + | F u ′ ( q ) − F u ′ ( p u ′ ) | for all q ∈ A and u = u ′ ∈ U . By construction of F u , it is clear that F u ( p u ) = 0 for all u ∈ U ,and the above quantity can be written as | F u ( q ) − F u ( p u ) | + | F u ′ ( q ) − F u ′ ( p u ′ ) | = k X i =1 (cid:18) min j ∈ [ k ] | q i − p u,j | + min j ∈ [ k ] | q i − p u ′ ,j | (cid:19) . One could check the following simple fact: if q i ∈ I j ( i ) for some j ( i ) ∈ [ k ], thenmin j ∈ [ k ] | q i − p u,j | + min j ∈ [ k ] | q i − p u ′ ,j | ≥ | p u,j ( i ) − p u ′ ,j ( i ) | ∈ { , δ } . By the deﬁnition of q ∈ A , we know that the set { j ( i ) } i ∈ [ k ] contains at least 9 k/

10 elements of[ k ]. Moreover, by the minimum distance property of U , for any u = u ′ ∈ U , there are at least k/ j ∈ [ k ] such that | p u,j − p u ′ ,j | = 2 δ . By an inclusion-exclusion principle, there are atleast 9 k/

10 + k/ − k = k/

10 elements in the set { j ( i ) } i ∈ [ k ] such that | p u,j ( i ) − p u ′ ,j ( i ) | = 2 δ , andtherefore | F u ( q ) − F u ( p u ) | + | F u ′ ( q ) − F u ′ ( p u ′ ) | ≥ k · δ = kδ , ∀ u = u ′ ∈ U , q ∈ A . In other words, ∆ ≥ kδ/ Lower bound of p min . Next, we lower bound the probability P p u ( b p ( X ) ∈ A ) for all b p ∈ P and u ∈ U . Here we need to use the deﬁnition of P in Assumption 1. Assume without loss of12enerality that b p ≤ · · · ≤ b p k , as any permutation of b p does not aﬀect the validity of Assumption1. Also, by the deﬁnition of p u and the choice of δ in (12), the entries of each p u are monotonicallyincreasing as well. Consequently, choosing p = p u in Assumption 1 gives E p u " k X i =1 | b p i − p u,i | ≤ A ( n ) r kn . On the other hand, if the event b p / ∈ A occurs, there are at least k/

10 indices i ∈ [ k ] such that b p j / ∈ I i for all j ∈ [ k ]. Consequently, for such an index i , one has | b p i − p u,i | ≥ / (2 k ( k − − δ ≥ / (4 k ( k − δ in (12). Therefore, k X i =1 | b p i − p u,i | ≥ k · k ( k − · ( b p / ∈ A ) ≥ k · ( b p / ∈ A ) . Combining the above two inequalities, we conclude thatsup b p ∈P max u ∈U P p u ( b p ( X ) / ∈ A ) ≤ A ( n ) · r k n , which is far smaller than 1 as k ≪ n / and the assumption A ( n ) ≪ n δ for all δ >

0. Consequently,we may choose p min ≥ / Upper bound of I ( U ; X n ) . The upper bound of the mutual information could be establishedin a similar way as [HJW18]. Speciﬁcally, the following chain of inequalities holds: I ( U ; X n ) (a) ≤ E U [ D KL ( p ⊗ nU k p ⊗ n )] (b) = n · E U [ D KL ( p U k p )] (c) ≤ n · E U " k X i =1 ( p U,i − p ,i ) p ,i (d) ≤ nk δ , where (a) is due to the variational representation of the mutual information I ( U ; X ) = min Q X E U [ D KL ( P X | U k Q X )]and the fact that P X n | U = p ⊗ nU , (b) follows from the chain rule of the KL divergence, (c) uses theinequality D KL ( P k Q ) ≤ χ ( P k Q ), and (d) follows from min i ∈ [ k ] p ,i ≥ / (2 k ) and simple algebra.Consequently, the mutual information could be upper bounded as I ( U ; X n ) ≤ nk δ .Combining the above analysis, Lemma 1 gives thatinf b p ∈P sup F ∈F Lip sup p ∈M k E p | F ( b p ) − F ( p ) | ≥ kδ (cid:18) − nk δ + log 2 k/ (cid:19) . Consequently, choosing δ = c/ √ nk for a small enough constant c > δ is also fulﬁlled as k ≪ n / ).13 Proof of Theorem 2

This section is devoted to the proof of Theorem 2. The proof consists of two steps: ﬁrst, we showthat the PML distribution belongs to the class P in Assumption 1, and therefore the adaptationlower bound of Theorem 1 holds for the PML estimator; second, we argue by contradiction thatif Theorem 2 is false, then the PML plug-in approach will also achieve the rate-optimal minimaxrate for all 1-Lipschitz properties for some k ≪ n / , a contradiction to Theorem 1. Step I: show that p PML ∈ P . First, for the empirical distribution b p , [HJW15] shows thatsup p ∈M k E p " k X i =1 | b p i − p i | ≤ r kn . Moreover, a single perturbation of the observations X , · · · , X n only changes the quantity P ki =1 | b p i − p i | by at most 2 /n . Hence, by McDiarmid’s inequality, we havesup p ∈M k P p " min σ ∈S k k X i =1 | b p σ ( i ) − p i | ≥ r kn + ε ≤ (cid:18) − nε (cid:19) for every ε >

0. As for the PML distribution, the competitive analysis of [ADOS17] shows thatsup p ∈M k P p " min σ ∈S k k X i =1 | p PML σ ( i ) − p i | ≥ ε ≤ | Φ n,k | · sup p ∈M k P p " min σ ∈S k k X i =1 | b p σ ( i ) − p i | ≥ ε , where | Φ n,k | is the cardinality of all possible proﬁles with length n and support size k . Note thattrivially | Φ n,k | ≤ ( n + 1) k holds, the above two inequalities lead tosup p ∈M k P p " min σ ∈S k k X i =1 | p PML σ ( i ) − p i | ≥ ε ≤ min  ,  k log( n + 1) − n ε − r kn !  . (13)Now integrating the RHS of (13) over ε ∈ (0 , ∞ ) gives that p PML ∈ P with A ( n ) = O ( √ log n ). Step II: proof by contradiction.

Assume by contradiction that Theorem 2 is false, i.e. thereexists an absolute constant c such that for some large enough n , it holds thatsup p ∈M k P p ( | F ( p PML ) − F ( p ) | ≥ Cε ) ≤ exp( c n / − c ) · inf b F sup p ∈M k P p ( | b F − F ( p ) | ≥ ε ) ! − c (14)for all k ∈ N , ε > F ∈ F Lip . For any ε ≫ n − / and k ≫

1, it was shown in [HO19b] thatthe minimax error probability for any 1-Lipschitz property estimation is at mostinf b F sup p ∈M k P p ( | b F − F ( p ) | ≥ ε ) ≤  − c δ n − δ ε − d δ s kn log n !  , δ > c δ , d δ > δ . Consequently, (14)implies thatsup F ∈F Lip sup p ∈M k P p ( | F ( p PML ) − F ( p ) | ≥ Cε ) ≤  c n / − c − (1 − c ) c δ n − δ ε − d δ s kn log n !  . Choosing δ < c / ε = 2 d δ p k/ ( n log n ) and k ≍ n / − c / , the above inequality shows thatthere exists an absolute constant c ′ > c , c , c , C ) such thatsup F ∈F Lip sup p ∈M k P p | F ( p PML ) − F ( p ) | ≥ c ′ s kn log n ! ≤ (cid:16) − c ′ n / − c (cid:17) . Hence, using that E | X | ≤ t + k X k ∞ · P ( | X | ≥ t ) for any t > k ≍ n / − c / and n tending to inﬁnity (possibly along some subsequence), we arrive atsup F ∈F Lip sup p ∈M k E p | F ( p PML ) − F ( p ) | . s kn log n , a contradiction to Theorem 1 as p PML ∈ P . Therefore, the inequality (14) does not hold, andthe proof of Theorem 2 is completed.

In this paper we showed that there is a high-accuracy limitation for general adaptive approaches ofproperty estimation, which in turn implied tight lower bounds for the known adaptive approachessuch as the PML and LMM. A number of directions could be of interest. First, we believe thatAssumption 1 is an artifact of our proof and unnecessary for Theorem 1 to hold, and a betterchoice of the loss functions in Lemma 1 could remove this assumption. Second, the adaptationlower bound for PML does not rule out the possibility that PML could be fully optimal for certain properties . However, to show this, one need to go beyond the competitive analysis of thePML and seek for additional properties. Third, our current lower bound for PML only shows theexistence of a property requiring ε ≫ n − / for the PML to be optimal, and it is interesting toconstruct such a property explicitly. Acknowledgement

Yanjun Han is grateful to Kirankumar Shiragur for helpful discussions.

References [ADOS17] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh.A uniﬁed maximum likelihood approach for estimating symmetric properties ofdiscrete distributions. In

International Conference on Machine Learning , pages11–21, 2017. 15AGZ17] Dragi Anevski, Richard D Gill, and Stefan Zohren. Estimating a probability massfunction with unknown labels.

The Annals of Statistics , 45(6):2708–2735, 2017.[AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi.The complexity of estimating rnyi entropy. In

Proceedings of the Twenty-SixthAnnual ACM-SIAM Symposium on Discrete Algorithms , 2014.[AOST17] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi.Estimating renyi entropy of discrete distributions.

IEEE Trans. Inf. Theor. ,63(1):38–56, January 2017.[BF93] John Bunge and Michael Fitzpatrick. Estimating the number of species: a review.

Journal of the American Statistical Association , 88(421):364–373, 1993.[BL96] Lawrence D Brown and Mark G Low. A constrained risk inequality with ap-plications to nonparametric functional estimation.

The annals of Statistics ,24(6):2524–2535, 1996.[BZLV18] Yuheng Bu, Shaofeng Zou, Yingbin Liang, and Venugopal V Veeravalli. Estima-tion of kl divergence: Optimal minimax rate.

IEEE Transactions on InformationTheory , 64(4):2648–2674, 2018.[CCG +

12] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang XuanMao, Robin L Chazdon, and John T Longino. Models and estimators linkingindividual-based and sample-based rarefaction, extrapolation and comparison ofassemblages.

Journal of plant ecology , 5(1):3–21, 2012.[Cha84] A Chao. Nonparametric estimation of the number of classes in a population.scandinavianjournal of statistics11, 265-270.

Chao26511Scandinavian Journal ofStatistics1984 , 1984.[CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via samplecoverage.

Journal of the American statistical Association , 87(417):210–217, 1992.[CL11] T Tony Cai and Mark G Low. Testing composite hypotheses, Hermite polynomialsand optimal estimation of a nonsmooth functional.

The Annals of Statistics ,39(2):1012–1041, 2011.[CSS19] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. A general frameworkfor symmetric property estimation. In

Advances in Neural Information ProcessingSystems , pages 12426–12436, 2019.[CT06] Thomas M. Cover and Joy A. Thomas.

Elements of Information Theory . Wiley,New York, second edition, 2006.[DJKP95] David L Donoho, Iain M Johnstone, G´erard Kerkyacharian, and Dominique Pi-card. Wavelet shrinkage: asymptopia?

Journal of the Royal Statistical Society:Series B (Methodological) , 57(2):301–337, 1995.[DR18] John C Duchi and Feng Ruan. A constrained risk inequality for general losses. arXiv preprint arXiv:1804.08116 , 2018.16EL94] Sam Efromovich and Mark G Low. Adaptive estimates of linear functionals.

Probability theory and related ﬁelds , 98(2):261–275, 1994.[EL96] Sam Efromovich and Mark Low. On optimal adaptive estimation of a quadraticfunctional.

The Annals of Statistics , 24(3):1106–1125, 1996.[GN16] Evarist Gin´e and Richard Nickl.

Mathematical foundations of inﬁnite-dimensionalstatistical models , volume 40. Cambridge University Press, 2016.[HJM20] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On estimation of l r -norms ingaussian white noise models. Probability Theory and Related Fields , 177(3):1243–1294, 2020.[HJW15] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of dis-crete distributions under ℓ loss. IEEE Transactions on Information Theory ,61(11):6343–6354, 2015.[HJW16] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimalestimation of divergences between discrete distributions. arXiv preprintarXiv:1605.09124 , 2016.[HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: Auniﬁed methodology for symmetric functional estimation and distribution esti-mation under wasserstein distance. In

Conference On Learning Theory , pages3189–3221, 2018.[HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates ofentropy estimation over lipschitz balls. to appear in the Annals of Statistics , 2017.[HO19a] Yi Hao and Alon Orlitsky. The broad optimality of proﬁle maximum likelihood.In

Advances in Neural Information Processing Systems , pages 10989–11001, 2019.[HO19b] Yi Hao and Alon Orlitsky. Uniﬁed sample-optimal property estimation in near-linear time. In

Advances in Neural Information Processing Systems , pages 11104–11114, 2019.[HO20] Yi Hao and Alon Orlitsky. Proﬁle entropy: A fundamental measure forthe learnability and compressibility of discrete distributions. arXiv preprintarXiv:2002.11665 , 2020.[HS20] Yanjun Han and Kirankumar Shiragur. On the competitive analysis andhigh accuracy optimality of proﬁle maximum likelihood. arXiv preprintarXiv:2004.03166 , 2020.[JHW18] Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the l distance. IEEE Transactions on Information Theory , 64(10):6672–6706, 2018.[JPW20] Soham Jana, Yury Polyanskiy, and Yihong Wu. Extrapolating the proﬁle of aﬁnite population. arXiv preprint arXiv:2005.10561 , 2020.17JVHW15] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax esti-mation of functionals of discrete distributions.

Information Theory, IEEE Trans-actions on , 61(5):2835–2885, 2015.[KV17] Weihao Kong and Gregory Valiant. Spectrum estimation from samples.

TheAnnals of Statistics , 45(5):2218–2247, 2017.[Lep92] OV Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds.optimally adaptive estimates.

Theory of Probability & Its Applications , 36(4):682–697, 1992.[LNS99] Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the L r norm of a regression function. Probability theory and related ﬁelds , 113(2):221–253, 1999.[OSVZ04] Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Ju-nan Zhang. On modeling proﬁles instead of values. In

Proceedings of the 20thconference on Uncertainty in artiﬁcial intelligence , pages 426–435. AUAI Press,2004.[OSVZ11] Alon Orlitsky, NP Santhanam, Krishnamurthy Viswanathan, and Junan Zhang.On estimating the probability multiset. draft manuscript, June , 2011.[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction ofthe number of unseen species.

Proceedings of the National Academy of Sciences ,113(47):13283–13288, 2016.[Pan03] Liam Paninski. Estimation of entropy and mutual information.

Neural Compu-tation , 15(6):1191–1253, 2003.[Pan04] Liam Paninski. Estimating entropy on m bins given fewer than m samples. In-formation Theory, IEEE Transactions on , 50(9):2200–2203, 2004.[Pan08] Liam Paninski. A coincidence-based test for uniformity given very sparsely sam-pled discrete data.

IEEE Transactions on Information Theory , 54(10):4750–4755,2008.[PGM +

01] A. Porta, S. Guzzetti, N. Montano, R. Furlan, M. Pagani, A. Malliani, andS. Cerutti. Entropy, entropy rate, and pattern classiﬁcation as tools to typ-ify complexity in short heart period variability series.

IEEE Transactions onBiomedical Engineering , 48(11):1282–1291, Nov 2001.[PW96] Nina T. Plotkin and Abraham J. Wyner.

An Entropy Estimator Algorithm andTelecommunications Applications , pages 351–363. Springer Netherlands, Dor-drecht, 1996.[PW19] Yury Polyanskiy and Yihong Wu. Dualizing le cam’s method, with applicationsto estimating the unseens. arXiv preprint arXiv:1902.05616 , 2019.18RW19] Philippe Rigollet and Jonathan Weed. Uncoupled isotonic regression via minimumwasserstein deconvolution.

Information and Inference: A Journal of the IMA ,8(4):691–717, 2019.[RWdRvSB99] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek.

Spikes: Exploring the Neural Code . MIT Press, Cambridge, MA, USA, 1999.[Spo96] Vladimir Spokoiny. Adaptive and spatially adaptive testing of a nonparametrichypothesis. 1996.[TKV17] Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of param-eters. In

Advances in neural information processing systems , pages 5778–5787,2017.[Tsy09] A. Tsybakov.

Introduction to Nonparametric Estimation . Springer-Verlag, 2009.[VBB +

12] Martin Vinck, Francesco P. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck,and Cyriel M. A. Pennartz. Estimation of the entropy based on its polynomialrepresentation.

Phys. Rev. E , 85:051139, May 2012.[Ver19] Sergio Verd´u. Empirical estimation of information measures: A literature guide.

Entropy , 21(8):720, 2019.[VKK19] Ramya Korlakai Vinayak, Weihao Kong, and Sham M Kakade. Optimal estima-tion of change in a population of parameters. arXiv preprint arXiv:1911.12568 ,2019.[VKVK19] Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, and Sham M Kakade.Maximum likelihood estimation for learning populations of parameters. arXivpreprint arXiv:1902.04553 , 2019.[VV11a] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log n -sampleestimator for entropy and support size, shown optimal via new CLTs. In Proceed-ings of the 43rd annual ACM symposium on Theory of computing , pages 685–694.ACM, 2011.[VV11b] Gregory Valiant and Paul Valiant. The power of linear estimators. In

Foundationsof Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on , pages403–412. IEEE, 2011.[VV13] Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimatorsfor entropy and other properties. In

Advances in Neural Information ProcessingSystems , pages 2157–2165, 2013.[Wal50] Abraham Wald.

Statistical decision functions.

Wiley, 1950.[WY16] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on largealphabets via best polynomial approximation.

IEEE Transactions on InformationTheory , 62(6):3702–3720, 2016. 19WY19] Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, andoptimal estimation of the unseen.

The Annals of Statistics , 47(2):857–883, 2019.[WY20] Yihong Wu and Pengkun Yang. Optimal estimation of gaussian mixtures viadenoised method of moments.

Annals of Statistics , 48(4):1981–2007, 2020.[Yu97] Bin Yu. Assouad, fano, and le cam. In

Festschrift for Lucien Le Cam , pages423–435. Springer, 1997.[ZVV +

16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan,Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G.MacArthur. Quantifying unobserved protein-coding variants in human popu-lations provides a roadmap for large-scale sequencing projects.