[PDF] A Statistical Perspective on Coreset Density Estimation

Abstract

Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Hölder-smooth densities.

Full PDF

aa r X i v : . [ m a t h . S T ] D ec A Statistical Perspective on CoresetDensity Estimation

Paxton Turner, Jingbo Liu, and Philippe Rigollet ∗ Massachusetts Institute of Technology and University of Illinois at Urbana-ChampaignAbstract.

Coresets have emerged as a powerful tool to summarize data by se-lecting a small subset of the original observations while retaining most of itsinformation. This approach has led to signiﬁcant computational speedups butthe performance of statistical procedures run on coresets is largely unexplored.In this work, we develop a statistical framework to study coresets and focuson the canonical task of nonparameteric density estimation. Our contributionsare twofold. First, we establish the minimax rate of estimation achievable bycoreset-based estimators. Second, we show that the practical coreset kernel den-sity estimators are near-minimax optimal over a large class of H¨older-smoothdensities.

AMS 2000 subject classiﬁcations:

Primary 62G07; secondary 68Q32.

Key words and phrases: data summarization, kernel density estimator,Carath´eodory’s theorem, minimax risk, compression.

1. INTRODUCTION

The ever-growing size of datasets that are routinely collected has led practitioners across manyﬁelds to contemplate eﬀective data summarization techniques that aim at reducing the size of thedata while preserving the information that it contains. While there are many ways to achievethis goal, including standard data compression algorithms, they often prevent direct manipu-lation of data for learning purposes.

Coresets have emerged as a ﬂexible and eﬃcient set oftechniques that permit direct data manipulation. Coresets are well-studied in machine learn-ing [Har-Peled and Kushal, 2007, Feldman et al., 2013, Bachem et al., 2017, 2018, Karnin and Liberty,2019], statistics [Feldman et al., 2011, Zheng and Phillips, 2017, Munteanu et al., 2018, Huggins et al.,2016, Phillips and Tai, 2018a,b], and computational geometry [Agarwal et al., 2005, Clarkson, 2010,Frahling and Sohler, 2005, G¨artner and Jaggi, 2009, Claici et al., 2020].Given a dataset D = { X , . . . , X n } ⊂ R d and task (density estimation, logistic regression, etc.)a coreset C is given by C = { X i : i ∈ S } for some subset S of { , . . . , n } of size | S | ≪ n . A goodcoreset should suﬃce to perform the task at hand with the same accuracy as with the whole dataset D .In this work we study the canonical task of density estimation. Given i.i.d random variables X , . . . , X n ∼ P f that admit a common density f with respect to the Lebesgue measure over R d ,the goal of density estimation is to estimate f . It is well known that the minimax rate of estimation ∗ P.R. was supported by NSF awards IIS-1838071, DMS-1712596, DMS-1740751, and DMS-2022448. TURNER ET AL. over the L -H¨older smooth densities P H ( β, L ) of order β is given by(1) inf ˆ f sup f ∈P H ( β,L ) E f k ˆ f − f k = Θ β,d,L ( n − β β + d ) , where the inﬁmum is taken over all estimators based on the dataset D . Moreover the minimax rateabove is achieved by a kernel density estimator(2) ˆ f n ( x ) := 1 nh d n X j =1 K (cid:18) X i − xh (cid:19) for suitable choices of kernel K : R d → R and bandwidth h > We formally deﬁne a coreset as follows. Throughout this work m = o ( n ) denotes the cardinalityof the coreset. Let S = S ( y | x ) denote a conditional probability measure on (cid:0) [ n ] m (cid:1) , where x ∈ R d × n .In information theoretic language, S is a channel from R d × n to subsets of cardinality m . We refer tothe channel S as a coreset scheme because it designates a data-driven method of choosing a subsetof data points. In what follows, we abuse notation and let S = S ( x ) denote an instantiation of asample from the measure S ( y | x ) for x ∈ R d × n . A coreset X S is then deﬁned to be the projectionof the dataset X = ( X , . . . , X n ) onto the subset indicated by S ( X ): X S := { X i } i ∈ S ( X ) .The ﬁrst family of estimators that we investigate is quite general and allows the statistician toselect a coreset and then employ an estimator that only manipulates data points in the coresetto estimate an unknown density. To study coresets, it is convenient to make the dependence ofestimators on observations more explicit than in the traditional literature. More speciﬁcally, adensity estimator ˆ f based on n observations X , . . . , X n ∈ R d is a function ˆ f : R d × n → L ( R d )denoted by ˆ f [ X , . . . , X n ]( · ). Similarly, a coreset-based estimator ˆ f S is constructed from a coresetscheme S of size m and an estimator (measurable function) ˆ f : R d × m → L ( R d ) on m observations.We enforce the additional restriction on ˆ f that for all y , . . . , y m ∈ R d and for all bijections π :[ m ] → [ m ], it holds that ˆ f [ y , . . . , y m ]( · ) = ˆ f [ y π (1) , . . . , y π ( m ) ]( · ). Given S and ˆ f as above, we deﬁnethe coreset-based estimator ˆ f S : R d × n → L ( R d ) to be the function ˆ f S [ X ]( · ) := ˆ f [ X S ]( · ) : R d → R .We evaluate the performance of coreset-based estimators in Section 2 by characterizing their rateof estimation over H¨older classes. The symmetry restriction on ˆ f prevents the user from exploiting information about the orderingof data points to their advantage: the only information that can be used by the estimator ˆ f iscontained in the unordered collection of distinct vectors given by the coreset X S .As evident from the the results in Section 2, the information-theoretically optimal coreset es-timator does not resemble coreset estimators employed in practice. To remedy this limitation, we Our notion of coreset-based estimators bares conceptual similarity to various notions of compression schemes asstudied in the literature, e.g. Littlestone and Warmuth [1986], Ashtiani et al. [2020], Hanneke et al. [2019].ORESETS also study weighted coreset kernel density estimators (KDEs) in Section 3. Here the statisticianselects a kernel k , bandwidth parameter h , and a coreset X S of cardinality m as deﬁned above andthen employs the estimator ˆ f S ( y ) = X j ∈ S λ j h − d k (cid:18) X j − yh (cid:19) , where the weights { λ j } j ∈ S are nonnegative, sum to one and are allowed to depend on the fulldataset.In the case of uniform weights where λ j = m for all j ∈ S , coreset KDEs are well-studied [see e.g.Bach et al., 2012, Harvey and Samadi, 2014, Phillips and Tai, 2018a,b, Karnin and Liberty, 2019].Interestingly, our results show that allowing ﬂexibility in the weights gives a deﬁnitive advantagefor the task of density estimation. By Theorems 2 and 5, the uniformly weighted coreset KDEsrequire a much larger coreset than that of weighted coreset KDEs to attain the minimax rate ofestimation over univariate Lipschitz densities. We reserve the notation k·k for the L norm and |·| p for the ℓ p -norm. The constants c, c β,d , c L , etc. vary from line to line and the subscripts indicate parameter dependences.Fix an integer d ≥

1. For any multi-index s = ( s , . . . , s d ) ∈ Z d ≥ and x = ( x , . . . , x d ) ∈ R d ,deﬁne s ! = s ! · · · s d !, x s = x s · · · x s d d and let D s denote the diﬀerential operator deﬁned by D s = ∂ | s | ∂x s · · · ∂x s d d . Fix a positive real number β, and let ⌊ β ⌋ denote the maximal integer strictly less than β . Given amulti-index s , the notation | s | signiﬁes the coordinate-wise application of |·| to s .Given L > H ( β, L ) denote the space of H¨older functions f : R d → R that are supportedon the cube [ − / , / d , are ⌊ β ⌋ times diﬀerentiable, and satisfy | D s f ( x ) − D s f ( y ) | ≤ L | x − y | β −⌊ β ⌋ , for all x, y ∈ R d and for all multi-indices s such that | s | = ⌊ β ⌋ .Let P H ( β, L ) denote the set of probability density functions contained in H ( β, L ). For f ∈P H ( β, L ), let P f (resp. E f ) denote the probability distribution (resp. expectation) associated to f .For d ≥ γ ∈ Z ≥ , we also deﬁne the Sobolev functions S ( γ, L ′ ) that consist of all f : R d → R that are γ times diﬀerentiable and satisfy k D α f k ≤ L ′ for all multi-indices α such that | α | = γ .

2. CORESET-BASED ESTIMATORS

In this section we study the performance of coreset-based estimators. Recall that coreset-basedestimators are estimators that only depend on the data points in the coreset.Deﬁne the minimax risk for coreset-based estimators ψ n,m ( β, L ) over P H ( β, L ) to be(3) ψ n,m ( β, L ) = inf ˆ f, | S | = m sup f ∈P H ( β,L ) E f k ˆ f S − f k , TURNER ET AL. where the inﬁmum above is over all choices of coreset scheme S of cardinality m and all estimatorsˆ f : R d × m → L ( R d ).Our main result on coreset-based estimators characterizes their minimax risk. Theorem . Fix β, L > and an integer d ≥ . Assume that m = o ( n ) . Then the minimaxrisk of coreset-based estimators satisﬁes inf ˆ f, | S | = m sup f ∈P H ( β,L ) E f k ˆ f S − f k = Θ β,d,L ( n − β β + d + ( m log n ) − βd ) . The above theorem readily yields a characterization of the minimal size m ∗ ( β, d ) that a coresetcan have while still enjoying the minimax optimal rate n − β β + d from (1). More speciﬁcally, let m ∗ = m ∗ ( n ) be such that(i) if m ( n ) is a sequence such that m = o ( m ∗ ), then lim inf n →∞ n β β + d ψ n,m ( β ) = ∞ , and(ii) if m = Ω( m ∗ ) then lim sup n →∞ ψ n,m ( β ) n β β + d ≤ C β,d,L for some constant C β,d,L > m ∗ = Θ β,d,L ( n d β + d / log n ).Theorem 1 illustrates two diﬀerent curses of dimensionality: the ﬁrst stems from the originalestimation problem, and the second stems from the compression problem. As d → ∞ , it holds that m ∗ ∼ n/ log n , and in this regime there is essentially no compression, as the implicit constant inTheorem 1 grows rapidly with d . Our proof of the lower bound in Theorem 1 ﬁrst uses a standard reduction from estimationto multiple hypothesis testing problem over a ﬁnite function class. While Fano’s inequality is theworkhorse of our second step, note that the lower bound must hold only for coreset estimators andnot any estimator as in standard minimax lower bounds. This additional diﬃculty is overcome bya careful handling of the information structure generated by coreset scheme channels rather thanusing oﬀ-the-shelf results for minimax lower bounds. The full details of the lower bound are in theAppendix.The estimator achieving the rate in Theorem (1) relies on an encoding procedure. It is constructedby building a dictionary between the subsets in (cid:0) [ n ] m (cid:1) and an ε -net on the space of H¨older functions.The key idea is that, for ω (1) = m ≤ n/

2, the amount of subsets of size m is extremely large, so for m large enough, there is enough information to encode a nearby-neighbor in L ( R d ) to the kerneldensity estimator on the entire dataset. Fix ε = c ∗ ( m log n ) − βd for c ∗ to be determined and let N ε denote an ε -net of P H ( β, L ) withrespect to the L ([ − , ] d ) norm. It follows from the classical Kolmogorov-Tikhomorov bound [see,e.g., Theorem XIV of Tikhomirov, 1993] that there exists a constant C KT ( β, d, L ) > N ε with log |N ε | ≤ C KT ( β, d, L ) ε − d/β . In particular, there exists f ∈ N ε such that k ˆ f n − f k L ([ − / , / d ) ≤ ε where ˆ f n is the minimax optimal kernel density estimator deﬁned in (2).We now develop our encoding procedure for f . To that end, ﬁx an integer K ≥ m such that (cid:0) Km (cid:1) ≥ |N ε | and let φ : (cid:0) [ K ] m (cid:1) → N ε be any surjective map. Our procedure only looks at the ﬁrstcoordinates of the sample X = { X , . . . , X n } . Denote these coordinates by x = { x , . . . , x n } and In fact, even for the classical estimation problem (1), this constant scales as d d [see McDonald, 2017, Theorem3].ORESETS note that these n numbers are almost surely distinct. Let A denote a parameter to be determined,and deﬁne the intervals B ik = [( i − K − A + ( k − A, ( i − K − A + ( k − A + K − A ] . For i = 1 , . . . , K , deﬁne B i = /A [ k =1 B ik . The next lemma, whose proof is in the Appendix, ensures that with high probability every bin B i contains the ﬁrst coordinate x i of at least one data point. Lemma . Let K − = c (log n ) /n for c > a suﬃciently large absolute constant, and let A = A β,L,K denote a suﬃciently small constant. Then for all f ∈ P H ( β, L ) and X , . . . , X n iid ∼ P f ,the event that for every j = 1 , . . . , K there exists some x i in bin B j holds with probability at least − O ( n − ) . In the high-probability event E that every bin B i contains the ﬁrst coordinate of some data point,choose a unique representative x ◦ j ∈ x such that x ◦ j ∈ B j and pick any T f ∈ φ − ( f ). Then deﬁne S = { i : x i = x ◦ j , j ∈ T f } . If there exists a bin with no observation, then let X S consist of two datapoints lying in the same bin and m − f S ≡ f S is indeed a coreset-based estimator. The function ˆ f such that ˆ f S = ˆ f [ X S ] looksat the m data points in the coreset, and if their ﬁrst coordinates lie in distinct bins, then X S isdecoded as above to output the corresponding element f of the net N ε . Otherwise, ˆ f ≡ m ≤ cn d/ (2 β + d ) for c asuﬃciently small absolute constant. For c ∗ = c ∗ β,d,L suﬃciently large, by Stirling’s formula and ourchoice of K it holds that log (cid:18) Km (cid:19) ≥ C KT ( β, d, L ) (cid:18) ε (cid:19) dβ ≥ log |N ε | . Hence, the surjection φ and our encoding estimator ˆ f S are well-deﬁned.Next we have E f k ˆ f S − f k = E f (cid:2) k f − f k E (cid:3) + E f (cid:2) k − f k E c (cid:3) . We control the ﬁrst term as follows using (1) and the fact that k f − ˆ f n k ≤ ε on E : E f (cid:2) k f − f k E (cid:3) ≤ E f k ˆ f n − f k + E f k f − ˆ f n k ≤ c β,d,L (cid:0) n − β β + d + ( m log n ) − βd (cid:1) . By the Cauchy-Schwarz inequality, E f (cid:2) k − f k E c (cid:3) ≤ (cid:0) E f k f k P ( E c ) (cid:1) / ≤ c β,d,L n − . Put together, the previous three displays yield the upper bound of Theorem 1.

TURNER ET AL.

3. CORESET KERNEL DENSITY ESTIMATORS

In this section, we consider the family of weighted kernel density estimators built on coresetsand study its rate of estimation over the H¨older densities. In this framework, the statistician ﬁrstcomputes a minimax estimator ˆ f using the entire dataset and then approximates ˆ f with a weightedkernel density estimator over the coreset. Here we allow the weights to be a measurable functionover the entire dataset rather than just the coreset.As is typical in density estimation, we consider kernels k : R d → R of the form k ( x ) = Q di =1 κ ( x i )where κ is an even function and R κ ( x ) d x = 1. Given bandwidth parameter h , we deﬁne k h ( x ) = h − d k ( xh ). Given a KDE with uniform weights and bandwidth h deﬁned byˆ f ( y ) = 1 n n X j =1 k h ( X j − y ) , on a sample X , . . . , X n , we deﬁne a coreset KDE ˆ g S as follows in terms of a cutoﬀ frequency T > A = { ω ∈ π Z d : | ω | ∞ ≤ T } . Consider the complex vectors ( e i h X j ,ω i ) ω ∈ A . By Carath´eodory’stheorem [Carath´eodory, 1907], there exists a subset S ⊂ [ n ] of cardinality at most 2(1 + Tπ ) d + 1and nonnegative weights { λ j } j ∈ S with P j ∈ S λ j = 1 such that(4) 1 n n X j =1 ( e i h X j ,ω i ) ω ∈ A = X j ∈ S λ j ( e i h X j ,ω i ) ω ∈ A . Then ˆ g S ( y ) is deﬁned to be ˆ g S ( y ) = X j ∈ S λ j k h ( X j − y ) . For a convex polyhedron P with vertices v , . . . , v n ∈ R D , the proof of Carath´eodory’s theo-rem is constructive and yields a polynomial-time algorithm in n and D to ﬁnd a convex com-bination of D + 1 vertices that represents a given point in P [Carath´eodory, 1907] [see alsoHiriart-Urruty and Lemar´echal, 2004, Theorem 1.3.6]. For completeness, we describe below thisalgorithm applied to our problem. Note that, more generally, for a large class of convex bodies,Carath´eodory’s theorem may be implemented eﬃciently using standard tools from convex opti-mization [Gr¨otschel et al., 2012, Chapter 6].Set D = 2 | A | ≤ Tπ ) d . For j = 1 , . . . , n , let v j = ( Re e i h X j ,ω i , Im e i h X j ,ω i ) ω ∈ A ∈ R D . Let M denote the matrix with columns ( v , T , . . . , ( v n , T ∈ R D +1 , and let ∆ n − ⊂ R n denotethe standard simplex. Assume without loss of generality that n ≥ D + 2. Next,1. Find a nonzero vector w ∈ ker( M )2. Find α > λ := n

1I + αw lies on the boundary of ∆ n − ORESETS Observe that

M λ = ( n P v i , T , and since λ ∈ ∂ ∆ n − the average is now represented using aconvex combination of at most n − v , . . . , v n . As long as at least D + 2 verticesremain, we can continue reducing the number of vertices used to represent n P v j by applying steps1 and 2. Thus after at most n − D − λ ∈ ∆ D that satisﬁes P λ j v j = n P v i ,as desired. Proposition 1 is key to our results and speciﬁes conditions on the kernel guaranteeing that theCarath´eodory method yields an accurate estimator.

Proposition . Let k ( x ) = Q di =1 κ ( x i ) denote a kernel with κ ∈ S ( γ, L ′ ) such that | κ ( x ) | ≤ c β,d | x | − ν for some ν ≥ β + d , and the KDE ˆ f ( y ) = 1 n n X i =1 k h ( X i − y ) with bandwidth h = n − β + d satisﬁes (5) sup f ∈P H ( β,L ) E k f − ˆ f k ≤ c β,d,L n − β β + d . Then the Carath´eodory coreset estimator ˆ g S ( y ) constructed from ˆ f with T = c d,γ,L ′ n d/ β + γγ (2 β + d ) satisﬁes sup f ∈P H ( β,L ) E k ˆ g S − f k ≤ c β,d,L n − β β + d . There exists a kernel k s ∈ C ∞ that satisﬁes the conditions above for all β and γ . We sketchthe details here and postpone the full argument to the Proof of Theorem 2 in the Appendix. Let ψ : [ − , → [0 ,

1] denote a cutoﬀ function that has the following properties: ψ ∈ C ∞ , ψ (cid:12)(cid:12) [ − , ≡ ψ is compactly supported on [ − , κ S ( x ) = F [ ψ ]( x ), and let k s ( x ) = Q di =1 κ S ( x i )denote the resulting kernel. Observe that for all β >

0, the kernel k s satisﬁesess sup ω =0 − F [ k s ]( ω ) | ω | α ≤ , ∀ α (cid:22) β. Using standard results from [Tsybakov, 2009], this implies that the resulting KDE ˆ f s satisﬁes(5). Since ψ = F − [ k s ] ∈ C ∞ , the Riemann–Lebesgue lemma guarantees that | κ s ( x ) | ≤ c β,d | x | ν issatisﬁed for ν = ⌈ β + d ⌉ . Since ψ is compactly supported, an application of Parseval’s identity yields κ s ∈ S ( γ, c γ ). Applying Proposition 1 to k s , we conclude that for the task of density estimation,weighted KDEs built on coresets are nearly as powerful as the coreset-based estimators studied inSection 2. Theorem . Let ε > . The Carath´eodory coreset estimator ˆ g S ( y ) built using the kernel k s andsetting T = c d,β,ε n εd + β + d satisﬁes sup f ∈P H ( β,L ) E f k ˆ g S − f k ≤ c β,d,L n − β β + d . The corresponding coreset has cardinality m = c d,β,ε n d β + d + ε . TURNER ET AL.

Theorem 2 shows that the Carath´eodory coreset estimator achieves the minimax rate of esti-mation with near-optimal coreset size. In fact, a small modiﬁcation yields a near-optimal rate ofconvergence for any coreset size as in Theorem 1.

Corollary . Let ε > and m ≤ c β,d,ε n d β + d + ε . The Carath´eodory coreset estimator ˆ g S ( y ) built using the kernel k s , setting h = m − d + εβ and T = c d m /d , satisﬁes sup f ∈P H ( β,L ) E k ˆ g S − f k ≤ c β,d,ε,L (cid:16) m − βd + ε + n − β β + d + ε (cid:17) , and the corresponding coreset has cardinality m . Next we apply Proposition 1 to the popular Gaussian kernel φ ( x ) = (2 π ) − d/ exp( − | x | ). Thiskernel has rapid decay in the real domain and Fourier space, and is thus amenable to our techniques.Moreover, k φ is a kernel of order ℓ = 1, [Tsybakov, 2009, Deﬁnition 1.3 and Theorem 1.2] and sothe standard KDE ˆ f φ on the full dataset attains the minimax rate of estimation c d,L n / (2+ d ) overthe Lipschitz densities P H (1 , L ). Theorem . Let ε > . The Carath´eodory coreset estimator ˆ g φ ( y ) built using the kernel φ andsetting T = c d,ε n d + εd satisﬁes sup f ∈P H (1 ,L ) E k ˆ g φ − f k ≤ c d,L n − d . The corresponding coreset has cardinality m = c d,ε n d d + ε . In addition, we have a nearly matching lower bound to Theorem 2 for coreset KDEs. In fact,our lower bound applies to a generalization of coreset KDEs where the vector of weights ( λ j ) j isnot constrained to be in the simplex but can range within a hypercube of width that may growpolynomially with n : max j ∈ S | λ j | ≤ n B . Theorem . Let

A, B ≥ . Let k denote a kernel with k k k ≤ n . Let ˆ g S denote a weightedcoreset KDE with bandwidth h ≥ n − A built from k with weights { λ j } j ∈ S satisfying max j ∈ S | λ j | ≤ n B . Then sup f ∈P H ( β,L ) E f k ˆ g S − f k ≥ c β,d,L h ( A + B ) − βd ( m log n ) − βd + n − β β + d i . This result is essentially a consequence of the lower bound in Theorem 1 because, in an appropri-ate sense, coreset KDEs with bounded weights are well-approximated by coreset-based estimators.Hence, in the case of bounded weights, allowing these weights to be measurable functions of theentire dataset rather than just the coreset, as would be required in Section 2, does not make asigniﬁcant diﬀerence for the purpose of estimation. The full details of Theorem 4 are postponed tothe Appendix.

ORESETS Here we sketch the proof of Proposition 1, our main tool in constructing eﬀective coreset KDEs.Full details of the argument may be found in the Appendix.Let k ( x ) = Q di =1 κ ( x i ) denote a kernel, and suppose that ˆ f ( y ) = n P ni =1 k h ( X i − y ) is a goodestimator for an unknown density f in that k f − ˆ f k ≤ ε := c β,d n − β β + d on setting h = n − / (2 β + d ) . Our goal is to ﬁnd a subset S ⊂ [ n ] and weights { λ j } j ∈ S such that1 n n X i =1 k h ( X i − y ) ≈ X j ∈ S λ j k h ( X j − y ) . Suppose for simplicity that κ is compactly supported on [ − / , / κ ∈ S ( γ, L ′ ), and we can further show that k ∈ S ( γ, c d,L ′ ) and k h ∈ S ( γ, c d,L ′ h − d/ − γ ).Let ¯ F denote the Fourier transform on the interval [ − , k h , we have(6) k k h ( x ) − X | ω | ∞

4. LOWER BOUNDS FOR CORESET KDES WITH UNIFORM WEIGHTS

In this section we study the performance of univariate uniformly weighted coreset KDEsˆ f unif S ( y ) = 1 m X i ∈ S k h ( X i − y ) , where X S is the coreset and | S | = m . The next results demonstrate that for a large class of kernels,there is signiﬁcant gap between the rate of estimation achieved by ˆ f unif S ( y ) and that of coreset KDEswith general weights. First we focus on the particular case of estimating Lipschitz densities, theclass P H (1 , L ). For this class, the minimax rate of estimation (over all estimators) is n − / , andthis can be achieved by a weighted coreset KDE of cardinality c ε n / ε by Theorem 2, for all ε > TURNER ET AL.

Theorem . Let k denote a nonnegative kernel satisfying k ( t ) = O ( | t | − ( k +1) ) , and F [ k ]( ω ) = O ( | ω | − ℓ ) for some ℓ > , k > . Suppose that < α < / . If m ≤ n − ( α (1 − ℓ )+ ℓ )log n , then (7) inf h,S : | S |≤ m sup f ∈P H (1 ,L ) E k ˆ f unif S − f k = Ω k (cid:16) n − + α log n (cid:17) . The inﬁmum above is over all possible choices of bandwidth h and all coreset schemes S of cardinalityat most m . By this result, if k has lighter than quadratic tails and fast Fourier decay, the error in (7) is apolynomial factor larger than the minimax rate n − / when m ≪ n / . Hence, our result covers awide variety of kernels typically used for density estimation and shows that the uniformly weightedcoreset KDE performs much worse than the encoding estimator or the Carath´eodory method. Inaddition, for very smooth univariate kernels with rapid decay, we have the following lower boundthat applies for all β > Theorem . Fix β > and a nonnegative kernel k on R satisfying the following fast decayand smoothness conditions: lim s → + ∞ s log 1 R | t | >s k ( t ) dt > , (8) lim ω →∞ | ω | log 1 |F [ k ]( ω ) | > , (9) where we recall that F [ k ] denotes the Fourier transform. Let ˆ f unif S be the uniformly weighted coresetKDE. Then there exists L β > such that for L ≥ L β and any m and h > , we have inf h,S : | S |≤ m sup f ∈P H ( β,L ) E k ˆ f unif S − f k = Ω β,k (cid:18) m − β β log β + 12 m (cid:19) . Therefore attaining the minimax rate with ˆ f unif S requires m ≥ n β +12 β +1 for such kernels. Next, notethat the Gaussian kernel satisﬁes the hypotheses of Theorem 5 and 6. As we show in Theorem 7,results of [Phillips and Tai, 2018b] imply that our lower bounds are tight up to logarithmic factors:there exists a uniformly weighted Gaussian coreset KDE of size m = ˜ O ( n / ) that attains theminimax rate n − / for estimating univariate Lipschitz densities ( β = 1). In general, we expect alower bound m = Ω( n β + d β + d ) to hold for uniformly weighted coreset KDEs attaining the minimaxrate. The proofs of Theorems 5 and 6 can be found in the Appendix. ORESETS

5. COMPARISON TO OTHER METHODS

Three methods for constructing coreset kernel density estimators that have previously beenexplored include random sampling [Joshi et al., 2011, Lopez-Paz et al., 2015], the Frank–Wolfealgorithm [Bach et al., 2012, Harvey and Samadi, 2014, Phillips and Tai, 2018a], and discrepancy-based approaches [Phillips and Tai, 2018b, Karnin and Liberty, 2019]. These procedures all resultin a uniformly weighted coreset KDE. To compare these results with ours on the problem of densityestimation, for each method under consideration we raise the question: How large does m , the sizeof the coreset, need to be to guarantee that(10) sup f ∈P H ( β,L ) E f k ˆ g S − f k = O β,d,L (cid:16) n − β β + d (cid:17) ?Here ˆ g S is the resulting coreset KDE and the right-hand-side is the minimax rate over all estimatorson the full dataset X , . . . , X n .Uniform random sampling of a subset of cardinality m yields an i.i.d dataset, so the rate obtainedis at least m − β/ (2 β + d ) . Hence, we must take m = Ω( n ) to achieve the minimax rate.The Frank–Wolfe algorithm is a greedy method that iteratively constructs a sparse approxi-mation to a given element in a convex set [Frank et al., 1956, Bubeck, 2015]. Thus Frank–Wolfemay be applied directly in the RKHS corresponding to a positive-semideﬁnite kernel as shown inPhillips and Tai [2018b] to approximate the KDE on the full dataset. However, due to the shrinkingbandwidth in our problem, this approach also requires m = Ω( n ) to guarantee the bound in (10).Another strategy is to approximately solve the linear equation (4) using the Frank–Wolfe algorithm.Unfortunately, a direct implementation again uses m = Ω( n ) data points.A more eﬀective strategy utilizes discrepancy theory [Phillips, 2013, Phillips and Tai, 2018b,Karnin and Liberty, 2019] [see Matouˇsek, 1999, Chazelle, 2000, for a comprehensive exposition ofdiscrepancy theory]. By the well-known halving algorithm [see e.g. Chazelle and Matouˇsek, 1996,Phillips and Tai, 2018b] if for all N ≤ n , the kernel discrepancy disc k = sup x ,...,x N min σ ∈{− , +1 } n T σ =0 k N X i =1 σ i k ( x i − y ) k ∞ is at most D , then there exists a coreset X S of size ˜ O D ( ε − ) such that(11) k n n X i =1 k ( X i − y ) − m X j ∈ S k ( X i − y ) k ∞ = ˜ O D ( ε ) . The idea of the halving algorithm is to maintain a set of datapoints C ℓ at each iteration andthen set C ℓ +1 to be the set of vectors that receive sign +1 upon minimizing k P x ∈C ℓ σ x k ( x − y ) k ∞ .Starting with the original dataset and repeating this procedure O (log nm ) times yields the desiredcoreset X S satisfying (11).Phillips and Tai [2018b, Theorem 4] use a state-of-the-art algorithm from Bansal et al. [2018]called the Gram–Schmidt walk to give strong bounds on the kernel discrepancy of bounded andLipschitz kernels k : R d × R d → R that are positive deﬁnite and decay rapidly away from thediagonal. With a careful handling of the Lipschitz constant and error in their argument when thebandwidth is set to be h = n − / (2 β + d ) , their techniques yield the following result applied to thekernel k s . For completeness we give details of the argument in the Appendix. TURNER ET AL.

Theorem . Let k s denote the kernel from Section 3.2. The algorithm of Phillips and Tai[2018b] yields in polynomial time a subset S with | S | = m = ˜ O ( n β + d β + d ) such that the uniformlyweighted coreset KDE ˆ g S satisﬁes sup f ∈P H ( β,L ) E k f − ˆ g S k ≤ c β,d,L n − β β + d . This result also applies to more general kernels, for example, the Gaussian kernel when β = 1.We suspect that this is the best result achievable by discrepancy-based methods. In particularfor nonnegative univariate kernels with fast decay in the real and Fourier domains, such as theGaussian kernel, Theorem 5 implies that this rate is optimal for estimating Lipschitz densities withuniformly weighted coreset KDEs.In contrast, the Carath´eodory coreset KDE as in Theorem 2 only needs cardinality m = O ε ( n d β + d + ε )to be a minimax estimator. By Theorem 4, this result is nearly optimal for coreset KDEs withbounded kernels and weights. And as with the other three methods described, our constructionis computationally eﬃcient. Hence allowing more general weights results in more powerful coresetKDEs for the problem of density estimation. APPENDIX A: PROOFS FROM SECTION 2A.1 Proof of Lemma 1

Here we prove Lemma 1, restated below for convenience.

Lemma.

Let K − = c (log n ) /n for c > a suﬃciently large absolute constant, and let A = A β,L,K denote a suﬃciently small constant. Then for all f ∈ P H ( β, L ) and X , . . . , X n iid ∼ P f , theevent that for every j = 1 , . . . , K there exists some x i in bin B j holds with probability at least − O ( n − ) . Proof.

Note that f ( x ) ∈ P H ( β, L ) as a univariate density because f ( x ) ∈ P H ( β, L ). Hence, f satisﬁes | f ( x ) − f ( y ) | ≤ L | x − y | α for some absolute constants L > α ∈ (0 , B ik = B jk + s for s ≤ A , then | P ( B ik ) − P ( B jk ) | ≤ Z B ik | f ( x ) − f ( x + s ) | d x ≤ LK − A α . (12)Thus for all i, j , | P ( B i ) − P ( B j ) | ≤ /A X k =1 | P ( B ik ) − P ( B jk ) | ≤ LK − A α . (13)It follows that for all i = 1 , . . . , K ,(14) lim A → P ( B i ) = K − . Let E denote the event that every bin B i contains at least one observation x k . By the unionbound, P ( E c ) ≤ X j =1 P ( X / ∈ B j ) n ≤ K max j (1 − P ( B j )) n . ORESETS By (14), choosing A small enough ensures that P [ B j ] ≥ (1 / K − for all j . In fact, by (12) onemay take A = ( K − L ) /α . Hence, setting K − = c (log n ) /n for c suﬃciently large, we have P ( E c ) = O ( n − ) . A.2 Proof of the lower bound in Theorem 1

In this section, X = X , . . . , X n ∈ R d denotes the sample. It is convenient to consider a moregeneral family of decorated coreset-based estimators . A decorated coreset consists of a coreset X S along with a data-dependent binary string σ of length R . A decorated coreset-based estimator isthen given by ˆ f [ X S , σ ], where ˆ f : R d × m × { , } R → L ([ − / , / d ) is a measurable function. Aswith coreset-based estimators, we require that ˆ f [ x , . . . , x m , σ ] is invariant under permutation of thevectors x , . . . , x m ∈ R d . We slightly abuse notation and refer to the channel S : X → Y S = ( X S , σ )as a decorated coreset scheme and ˆ f S as the decorated coreset-based estimator. The next propositionimplies the lower bound in Theorem 1 on setting R = 0, in which case a decorated coreset-basedestimator is just a coreset-based estimator. This more general framework allows us to prove Theorem1 on lower bounds for weighted coreset KDEs. Proposition . Let ˆ f S denote a decorated coreset-based estimator with decorated coreset scheme S such that σ ∈ { , } R . Then sup f ∈P H ( β,L ) E f k ˆ f S − f k ≥ c β,d,L (cid:16) ( m log n + R ) − βd + n − β β + d (cid:17) . A.2.1 Choice of function class

Fix h ∈ (0 ,

1) such that 1 /h d is integral to be chosen later. Let z , . . . , z /h d label the points in { h · d + h Z d } ∩ [ − / , / d , where 1I d denotes the all-ones vector of R d . We consider a class offunctions of the form f ω ( x ) = 1 + P /h d j =1 ω j g j ( x ) indexed by ω ∈ { , } /h d . Here, g j ( x ) is deﬁnedto be g j ( x ) = h β φ (cid:18) x − z j h (cid:19) where φ : R d → R is L -H¨older smooth of order β , has k φ k ∞ = 1, and has R φ ( x ) d x = 0.Informally, f ω puts a bump on the uniform distribution with amplitude h β over z j if and only if ω i = 1. Using a standard argument [Tsybakov, 2009, Chapter 2] we can construct a packing V of { , } /h d which results G = { f ω : ω ∈ V} of the function class { f ω : ω ∈ { , } /h d } such that(i) k f − g k ≥ c β,d,L h β for all f, g ∈ G , f = g and,(ii) G is large in the sense that M := |G| ≥ c β,d,L /h d . A.2.2 Minimax lower bound

Using standard reductions from estimation to testing, we obtain thatinf ˆ f, | S | = m,σ ∈{ , } R sup f ∈P H ( β,L ) E f k ˆ f S − f k ≥ inf ˆ f, | S | = m,σ ∈{ , } R max f ∈G E f k ˆ f S − f k ≥ c β,d,L h β · inf ψ S M X ω ∈V P f ω [ ψ S ( X ) = ω ] . (15) TURNER ET AL. where the inﬁmum in the last line is over all tests ψ S : R d × n → [ M ] of the form ψ S ( X ) = ψ ( Y S )for a decorated coreset scheme S and a measurable function ψ : R d × m × { , } R → [ M ].Let V denote a random variable that is distributed uniformly over V and observe that1 M X ω ∈V P f ω [ ψ S ( X ) = ω ] = P [ ψ S ( X ) = V ]where P denotes the joint distribution of ( X, V ) characterized by the conditional distribution X | V = ω which is assumed to have density f ω for all ω ∈ V .Next, by Fano’s inequality [Cover and Thomas, 2006, Theorem 2.10.1] and the chain rule, wehave(16) P [ ψ S ( X ) = V ] ≥ − I ( V ; ψ S ( X )) + 1log M , where I ( V ; ψ S ( X )) denotes the mutual information between V and ψ S ( X ) and we used the factthat the entropy of V is log M . Therefore, it remains to control I ( V ; ψ S ( X )). To that end, notethat it follows from the data processing inequality that I ( V ; ψ S ( X )) ≤ I ( V ; ( X S , σ )) = I ( V ; Y S ) = KL ( P V,Y S k P V ⊗ P Y S ) , where P V,Y S , P V and P Y S denote the distributions of ( V, Y S ), V and Y S respectively and observe that P Y S is the mixture distribution given by P Y S ( A, t ) = M − P ω ∈V P f ω ( X S ∈ A, σ = t ) for A ⊂ R d × m and t ∈ { , } R . Denote by f ω,Y S the mixed density of P f ω ( X S ∈ · , σ = · ), where the continuouscomponent is with respect to the Lebesgue measure on [ − / , / d × m . Denote by ¯ f Y S the mixeddensity of the uniform mixture of these:¯ f Y S := 1 M X ω ∈V f ω,Y S . By a standard information-theoretic inequality, for all measures Q it holds that(17) KL ( P V,Y S k P V ⊗ P Y S ) = 1 M X ω KL ( P Y S | ω k P Y S ) ≤ M X ω KL ( P Y S | ω k Q ) . In fact, we have equality precisely when Q = P Y S , and (17) follows immediately from the nonneg-ativity of the KL-divergence. Setting Q = Unif [ − , ] d ⊗ Unif { , } R , for all ω we have KL ( P Y S | ω , Q ) = X t ∈{ , } R Z [ − , ] d f ω,Y S ( x, t ) log f ω,Y S ( x, t )2 − R d x ≤ X t ∈{ , } R Z [ − , ] d f ω,Y S ( x, t ) log f ω,Y S ( x, t ) d x + R. (18)Our next goal is to bound the ﬁrst term on the right-hand-side above. Lemma . For any ω ∈ V , we have X t ∈{ , } R Z [ − , ] d f ω,Y S ( x, t ) log f ω,Y S ( x, t ) d x ≤ m log n. ORESETS Proof.

Let P X S denote the distribution of the (undecorated) coreset X S , and note that thedensity of this distribution is given by f ω,X S ( x ) := P t ∈{ , } R f ω,Y S ( x, t ). Then because the logarithmis increasing, X t ∈{ , } R Z [ − , ] d f ω,Y S ( x, t ) log f ω,Y S ( x, t ) d x ≤ X t ∈{ , } R Z [ − , ] d f ω,Y S ( x, t ) log f ω,X S ( x ) d x = Z [ − , ] d f ω,X S ( x ) log f ω,X S ( x ) d x. By the union bound, P X S ( · ) ≤ X s ∈ ( [ n ] m ) P X s ( · ) = (cid:18) nm (cid:19) P X [ m ] ( · ) . It follows readily that f ω,X S ( · ) ≤ (cid:0) nm (cid:1) f ω,X [ m ] ( · ) . Next, let Z ∈ [ − / , / d × m be a random variablewith density f ω,X S and note that Z f ω,X S log f ω,X S = E log f ω,X S ( Z ) ≤ log (cid:18) nm (cid:19) + E log f ω,X [ m ] ( Z ) ≤ m log (cid:0) enm (cid:1) + m log 2 , where in the last inequality, we use the fact that f ω,X [ m ] = f mω ≤ m . The lemma follows.Since log M ≥ c β,d,L h − d , it follows from (16)–(18) and Lemma 2 that P [ ψ S ( X ) = V ] ≥ − m log n + R + 1log M ≥ . h = c β,d,L ( m log n + R ) − /d . Plugging this value back into (15) yieldsinf ˆ f, | S | = m sup f ∈P H ( β,L ) E f k ˆ f S − f k ≥ c β,d,L ( m log n + R ) − β/d . Moreover, it follows from standard minimax theory [see e.g. Tsybakov, 2009, Chapter 2] thatinf ˆ f, | S | = m sup f ∈P H ( β,L ) E f k ˆ f S − f k ≥ c β,d,L n − β β + d . Combined together, the above two displays give the lower bound of Proposition 2.

APPENDIX B: PROOFS FROM SECTION 3B.1 Proof of Proposition 1

We restate the result below.

Proposition.

Let k ( x ) = Q di =1 κ ( x i ) denote a kernel with κ ∈ S ( γ, L ′ ) such that | κ ( x ) | ≤ c β,d | x | − ν for some ν ≥ β + d , and the KDE ˆ f ( y ) = 1 n n X i =1 k h ( X i − y ) TURNER ET AL. with bandwidth h = n − β + d satisﬁes sup f ∈P H ( β,L ) E k f − ˆ f k ≤ c β,d,L n − β β + d . Then the Carath´eodory coreset estimator ˆ g S ( y ) constructed from ˆ f with T = c d,γ,L ′ n d/ β + γγ (2 β + d ) satisﬁes sup f ∈P H ( β,L ) E k ˆ g S − f k ≤ c β,d,L n − β β + d . Let ϕ : R d → [0 ,

1] denote a cutoﬀ function that has the following properties: ϕ ∈ C ∞ , ϕ (cid:12)(cid:12) [ − , d ≡

1, and ϕ is compactly supported on [ − , d . Lemma . Let ˜ k h ( x ) = k h ( x ) ϕ ( x ) where | κ ( x ) | ≤ c β,d | x | − ν . Then k ˜ k h − k h k ≤ c β,d h − d + ν . Proof. k ˜ k h − k h k = k (1 − ϕ ) k h k ≤ k (1 − [ − , d ) k h k = h − d/ k (1 − [ − h , h ] d ) k k ≤ dh − d/ k | x |≥ h k k ≤ c β,d h − d/ sZ | x |≥ h κ ( x ) d x ≤ c β,d h − d + ν . The triangle inequality and the previous lemma yield the next result.

Lemma . Let k denote a kernel such that | κ ( x ) | ≤ c β,d | x | − ν . Recall the deﬁnition of ˜ k h fromLemma 3. Let X , . . . , X m ∈ R d , and let ˆ g S ( y ) = X j ∈ S λ j k h ( X j − y ) denote where λ j ≥ and T λ = 1 . Let ˜ g S ( y ) = X j ∈ S λ j ˜ k h ( X j − y ) . Then k ˆ g S − ˜ g S k ≤ c β,d h − ν + d . ORESETS Next we show that ˜ k h is well approximated by its Fourier expansion on [ − , d . Since ˜ k h is asmooth periodic function on [ − , d , it is expressed in L as a Fourier series on π Z d . Thus webound the tail of this expansion. In what follows, α ∈ Z d ≥ is a multi-index and¯ F [ f ]( ω ) = 14 d Z f ( x ) e i h x,ω i d x denotes the (rescaled) Fourier transform on [ − , d , where ω ∈ π Z d . Lemma . Suppose that the kernel k ∈ S ( β, L ′ ) . Let A = { ω ∈ π Z d : | ω | ≤ T } , and deﬁne ˜ k Th ( y ) = X ω ∈ A ¯ F [˜ k h ]( ω ) e i h y,ω i . Then k (˜ k h − ˜ k Th )1I [ − , d k ≤ c γ,d,L ′ T − γ h − d/ − γ Proof.

Observe that for ω / ∈ A , it holds that X | α | = γ γ ! α ! | ω | α = ( | ω | + · · · + | ω d | ) γ ≥ T γ . Therefore, k ¯ F [˜ k h ]( ω )1I ω / ∈ A k ℓ ≤ T − γ k X | α | = γ γ ! α ! | ω | α ¯ F [˜ k h ]( ω )1I ω / ∈ A k ℓ ≤ T − γ X | α | = γ γ ! α ! k ω α ¯ F [˜ k h ]( ω ) k ℓ = c d T − γ X | α | = γ γ ! α ! k ∂ α ∂x α ˜ k h ( x ) k , (19)where in the last line we used Parseval’s identity. For any multi-index α with | α | = γ , k ∂ α ∂x α ˜ k h ( x ) k = k X η (cid:22) α ∂ η ∂x η k h ( x ) ∂ α − η ∂x α − η ϕ ( x ) k ≤ h − d − γ X η (cid:22) α c d,γ k ∂ η ∂x η k ( x ) k , (20)where we used that the derivatives of ϕ are bounded. Next by Parseval’s identity, k ∂ η ∂x η k ( x ) k = d Y i =1 k ω η i i F [ κ ]( ω i ) k . (21)For 0 ≤ a ≤ γ , we have Z | ω a F [ κ ]( ω ) | d ω ≤ k k k + Z | ω |≥ | ω γ F [ κ ]( ω ) | d ω ≤ k k k + L ′ . (22) TURNER ET AL.

By (19)–(22), k ¯ F [˜ k h ]( ω )1I ω / ∈ A k ℓ ≤ c d,γ,L ′ T − γ h − d − γ , as desired.Applying the previous lemma and linearity of the Fourier transform, we have the next corollarythat gives an expansion for a general KDE on the smaller domain [ − , ] d . Corollary . Let ˜ g S denote the KDE built from ˜ k h from Lemma 4 where X , . . . , X m ∈ [ − , ] d and moreover κ ∈ S ( β, L ′ ) . Let A = { ω ∈ π Z d : | ω | ≤ T } , and deﬁne ˜ g ST ( y ) = X ω ∈ A ¯ F [ ˜ g S ]( ω ) e i h y,ω i . Then k ( ˜ g S − ˜ g ST )1I [ − , ] d k ≤ c d,γ,L ′ T − γ h − d/ − γ L. Now we have all the ingredients needed to prove Proposition 1.

Proof of Proposition 1 .

Let ˜ f ( y ) = 1 n n X j =1 ˜ k h ( X j − y ) , and ˜ g S ( y ) = X j ∈ S λ j ˜ k h ( X j − y ) . Also consider their expansions ˜ f T and ˜ g T as deﬁned in Lemma 5. Observe that, by construction ofthe Carath´eodory coreset, ˜ f T ( y ) = ˜ g T ( y ) ∀ y ∈ [ − ,

12 ] d . In what follows, k·k is computed on [ − , ] d . By the triangle inequality, k ˆ g S − ˆ f k ≤ k ˆ g S − ˜ g k + k ˜ g − ˜ g T k + k ˜ g T − ˜ f T k + k ˜ f T − ˜ f k + k ˜ f − ˆ f k ≤ c β,d h − d + ν + c d,γ,L ′ T − γ h − d/ − γ + 0+ c d,γ,L ′ T − γ h − d/ − γ + c β,d h − d + ν (23)On the right-hand-side of the ﬁrst line, the ﬁrst and last terms are bounded via Lemma 4. Thesecond and fourth terms are bounded via Lemma 5, and the third term is 0 by Carath´eodory. Byour choice of T and the decay properties of k , we have k ˆ g S − ˆ f k ≤ c β,d,L h β ≤ c β,d,L n − β/ (2 β + d ) . The conclusion follows by the hypothesis on k , the previous display, and the triangle inequality. ORESETS B.2 Proof of Theorem 2

We restate Theorem 2 here for convenience.

Theorem.

Let ε > . The Carath´eodory coreset estimator ˆ g S ( y ) built using the kernel k s andsetting T = c d,β,ε n εd + β + d satisﬁes sup f ∈P H ( β,L ) E f k ˆ g S − f k ≤ c β,d,L n − β β + d . The corresponding coreset has cardinality m = c d,β,ε n d β + d + ε . Proof.

Our goal is to apply Proposition 1 to k s . First we show that the standard KDE builtfrom k s attains the minimax rate on P H ( β, L ). The Fourier conditioness sup ω =0 − F [ k s ]( ω ) | ω | α ≤ , ∀ α (cid:22) β, implies that k s is a kernel of order β [Tsybakov, 2009, Deﬁnition 1.3]. Since F [ k s ](0) = 1 = R k s ( x ) d x , it remains to show that the ‘moments’ of order at most β of k s vanish. In fact all of themoments vanish. We have, expanding the exponential and using the multinomial formula, ψ ( ω ) = F − [ k s ]( ω )= Z k s ( x ) e − i h x,ω i d x = ∞ X t =0 Z k s ( x ) ( − i h x, ω i ) t t ! d x = ∞ X t =0 X | α | = t − i t α ! w α (cid:26)Z k s ( x ) x α d x (cid:27) . Since ψ ( ω ) ≡ R k s ( x ) x α d x = 0.Thus k s is a kernel of order β for all β ∈ Z ≥ , and the standard KDE on all of the datasetwith bandwidth h = n − / (2 β + d ) attains the rate of estimation n − β/ (2 β + d ) over P H ( β, L ) [see e.g.Tsybakov, 2009, Theorem 1.2].Next, | κ s ( x ) | ≤ c β,d | x | ν for ν = ⌈ β + d ⌉ . This is because x ν κ s ( x ) = x ν F [ ψ ]( x ) = F (cid:20) d ν d x ν ψ (cid:21) ( x ) ≤ k d ν d x ν ψ k ≤ c β,d . Moreover for all γ ∈ Z > , κ s ∈ S ( γ, c γ ). By Parseval’s identity, k d γ d x γ κ s k = kF [ d γ d x γ κ s ] k = k ω γ ψ ( ω ) k ≤ c γ because ψ has compact support [see e.g. Katznelson, 2004, Chapter VI].All of the hypotheses of Proposition 1 are satisﬁed, so we apply the result with γ = d ( d + β ) ε (2 β + d )to derive Theorem 2. TURNER ET AL.

B.3 Proof of Corollary 1

Corollary.

Let ε > and m ≤ c β,d,ε n d β + d + ε . The Carath´eodory coreset estimator ˆ g S ( y ) builtusing the kernel k s , setting h = m − d + εβ and T = c d m /d , satisﬁes sup f ∈P H ( β,L ) E k ˆ g S − f k ≤ c β,d,ε,L (cid:16) m − βd + ε + n − β β + d + ε (cid:17) , and the corresponding coreset has cardinality m . Proof.

Recall from the proof of Theorem 2 that k s is a kernel of all orders. By a standardbias-variance trade-oﬀ [see e.g. Tsybakov, 2009, Section 1.2], it holds that for the KDE ˆ f withbandwidth h (on the entire dataset)(24) E f k ˆ f − f k ≤ c β,d,L (cid:18) h β + 1 √ nh d (cid:19) . Moreover, from (23) applied to k s , setting T = m /d , we get(25) k ˆ g S − ˆ f k ≤ c β,d h β + c d,γ m − d/γ h − d/ − γ . Choosing γ = ( β + d βdε − , h = m − d + εβ (assuming without loss of generality that ε > γ > m yield the conclusion of Corollary 1. B.4 Proof of Theorem 4

For convenience, we restate Theorem 4 here.

Theorem.

Let

Let λ = λ , . . . , λ m and let ˜ λ = ˜ λ , . . . , ˜ λ m . Observe that k X j ∈ S λ j k h ( X j − y ) − X j ∈ S ˜ λ j k h ( X j − y ) k ≤ X j ∈ S (cid:12)(cid:12)(cid:12) λ j − ˜ λ j (cid:12)(cid:12)(cid:12) k k h ( X j − y ) k ≤ (cid:12)(cid:12)(cid:12) λ − ˜ λ (cid:12)(cid:12)(cid:12) ∞ n h − d/ . (26)Using this we develop a decorated coreset-based estimator ˆ f S (see Section A.2) that approximatesˆ g S well. Set δ = c β,d,L n − h d/ for c β,d,L suﬃciently small and to be chosen later. Order the pointsof the coreset X S according to their ﬁrst coordinate. This gives rise to an ordering (cid:22) so that X ′ (cid:22) X ′ (cid:22) · · · (cid:22) X ′ m ORESETS denote the elements of X S . Let λ ∈ R m denote the correspondingly reordered collection of weightsso that ˆ g S ( y ) = m X j =1 λ j k h ( X ′ j − y ) . Construct a δ -net N δ with respect to the sup-norm |·| ∞ on the set { ν ∈ R m : | ν | ∞ ≤ n B } .Observe that(27) log |N δ | = log( n B δ − ) m = c β,d,L ( B + A ) m log n Deﬁne R to be the smallest integer larger than the right-hand-side above. Then we can construct asurjection φ : { , } R → N δ . Note that φ is constructed before observing any data: it simply labelsthe elements of the δ -net N δ by strings of length R .Given ˆ g S ( y ) = P j ∈ S λ j k h ( X j − y ), deﬁne ˆ f S as follows:1. Let ˜ λ ∈ R m denote the closest element in N δ to λ ∈ R m .2. Choose σ ∈ { , } R such that φ ( σ ) = ˜ λ .3. Deﬁne the decorated coreset Y S = ( X S , σ ).4. Order the points of X S by their ﬁrst coordinate. Pair the i -th element of ˜ λ with the i -thelement X ′ i of X S , and deﬁne ˆ f S ( y ) = m X j =1 ˜ λ j k h ( X ′ j − y )We see that ˆ f S is a decorated-coreset based estimator because in step 4 this estimator is con-structed only by looking at the coreset X S and the bit string σ . Moreover, by (26) and the settingof δ ,(28) k ˆ f S − ˆ g S k ≤ c β,d,L n − . By Proposition 2 and our choice of R ,sup f ∈P H ( β,L ) E f k ˆ f S − f k ≥ c β,d,L (cid:16) ( A + B ) − βd ( m log n ) − βd + n − β β + d (cid:17) . Applying the triangle inequality and (28) yields Theorem 4.

APPENDIX C: PROOFS FROM SECTION 4

Notation:

Given a set of points X = x , . . . , x m ∈ [ − / , /

2] (not necessarily a sample), we letˆ f X ( y ) = 1 m m X i =1 k h ( X i − y )denote the uniformly weighted KDE on X . TURNER ET AL.

C.1 Proof of Theorem 5

Theorem.

Let k denote a nonnegative kernel satisfying k ( t ) = O ( | t | − ( k +1) ) , and F [ k ]( ω ) = O ( | ω | − ℓ ) for some ℓ > , k > . Suppose that < α < / . If m ≤ n − ( α (1 − ℓ )+ ℓ )log n , then inf h,S : | S |≤ m sup f ∈P H (1 ,L ) E k ˆ f unif S − f k = Ω k (cid:16) n − + α log n (cid:17) . The inﬁmum above is over all possible choices of bandwidth h and all coreset schemes S of cardinalityat most m . The proof of Theorem 5 follows directly from Propositions 3 and 4, which are presented inSections C.1.1 and C.1.2, respectively.

C.1.1 Small bandwidth

First we show that uniformly weighted coreset KDEs on m points poorly approximate densitiesthat are very close to 0 everywhere. Lemma . Let ˆ f X denote a uniformly weighted coreset KDE built from an even kernel k : R → R with bandwidth h on m points X = x , . . . , x m ∈ R . Suppose that quantiles ≤ q ≤ q satisfy Z q − q k ( t )d t ≥ . , and (29) Z q − q k ( t )d t ≥ − γ. (30) Let U denote an interval [0 , u ] where (31) u ≥ q h, and suppose that f : U → R satisﬁes (32) 1100 q mh ≤ f ( x ) ≤ · q mh for all x ∈ U .Then inf X : | X | = m k ( ˆ f X − f )1I U k ≥ u q mh − γ. Proof.

Let N denote the number of x i ∈ X such that [ x i − q h, x i + q h ] ⊂ [0 , u ]. The ar-gument proceeds in two cases. With foresight, we set α = 1 / (44 q ). Also let C = 1 / (100 q ) and C = 45 / (4400 q ). ORESETS Case 1: N ≥ αuh . Then by (29) and the nonnegativity of k , k ˆ f X U k ≥ . Nm ≥ . αumh . By (32), k f k ≤ C umh . Hence, k ( ˆ f X − f )1I U k ≥ umh (0 . α − C ) = C umh = 454400 · uq mh . Thus Lemma 6 holds in Case 1 where N ≥ αu/h . Case 2: N ≤ αuh . Let V = [2 hq , u − hq ] \ [ j ∈ T [ x j − q h, x j + q h ]where T is the set of indices j so that [ x j − q h, x j + q h ] ⊂ U . Observe that if j / ∈ T , then by (30), Z V h k (cid:18) x j − th (cid:19) d t ≤ γ. If j ∈ T , then by (29), Z V h k (cid:18) x j − th (cid:19) d t ≤ . . Thus, k ˆ f X V k ≤ . Nm + γ ≤ α . umh + γ. By the union bound, observe that the Lebesgue measure of V is at least u − hq − N hq ≥ u − N hq ≥ u ( 12 − αq ) . Next, by (32), k f V k ≥ C umh ( 12 − αq ) . Therefore,(33) k ( ˆ f X − f )1I U k ≥ umh ( C (1 / − αq ) − . α ) − γ = u q mh − γ. Proposition . Let

L > . Let < δ < / denote an absolute constant. Let ˆ f X denote auniformly weighted coreset KDE with bandwidth h built from a kernel k on X = x , . . . , x m . Supposethat k ( t ) ≤ ∆ | t | − ( k +1) for some absolute constants ∆ > , k ≥ . If h ≤ n − / δ , then for m ≤ n / − δ log n it holds that (34) sup f ∈P H (1 ,L ) inf X : | X | = m k ˆ f X − f k = Ω n − / δ log n ! . TURNER ET AL.

Proof.

Let f ( t ) = λ (cid:16) e − /t t ∈ [ − / , e − / (1 − t ) t ∈ [0 , / (cid:17) , where λ is a normalizing constant so that R f = 1. Observe that f ∈ P H (1 , L ). Our ﬁrst goal is toshow that k ˆ f X − f k = Ω (cid:18) mh log ( mh ) (cid:19) holds for all τ /h ≤ m ≤ h − and for all h ≤ n − / δ , where τ is an absolute constant to bedetermined.We apply Lemma 6 to the density f . Let q be deﬁned as in Lemma 6, and set C = 1 / (100 q )and C = 45 / (4400 q ). Set τ = 10 C /λ . Let U = [ t , t ] := (cid:20) λmh/C ) , λmh/C ) (cid:21) . The function f | U satisﬁes the bounds (32) from Lemma 6. Observe that the length of U is u := t − t = Ω( 1log ( mh ) ) . We set the parameter γ in Lemma 6 to be γ = 1800 q mh log ( mh ) . By the decay assumption on k , we may set q := (cid:18) kγ (cid:19) /k . Therefore, u − q h = Ω( 1log ( mh ) ) − h (cid:18) kγ (cid:19) /k (35) = Ω( 1log ( mh ) ) − O ( h ( mh log ( mh )) /k )(36) = Ω( 1log ( h − ) ) − O ( h − /k log ( h − )) > n suﬃciently large, because we assume τ /h ≤ m ≤ h − , h ≤ n − / δ , and k >

1. Hence,condition (31) is satisﬁed for m, h in the speciﬁed range, so we apply Cauchy–Schwarz and Lemma6 to conclude that for all τ /h ≤ m ≤ h − and h ≤ n − / δ ,(38) k ˆ f X − f k ≥ k ˆ f X − f k = Ω (cid:18) mh log ( mh ) (cid:19) = Ω (cid:18) mh log ( h − ) (cid:19) . Suppose ﬁrst that log (1 /h ) ≥ n / − δ . Then clearly the right-hand side of (38) is Ω(1) for m ≤ n .Otherwise, we have for all h ≤ n − / δ that if m is in the range τh ≤ m ≤ min n / − δ log nh log (1 /h ) , h − ! =: N h , ORESETS then (38) implies(39) k ˆ f X − f k = Ω n − / δ log n ! . Moreover, a uniformly weighted coreset KDE on m = O (1 /h ) points can be expressed as a uniformlyweighted coreset KDE on Ω(1 /h ) points by setting some of the x i ’s to be duplicates. Hence (39)holds for all 1 ≤ m ≤ N h . Since N h is a decreasing function of h , it follows that (39) holds for all m ≤ n / − δ / log n and h ≤ n − / δ , as desired. C.1.2 Large bandwidth

Lemma . Let ε = ε ( n ) > , and let ˆ f X denote the uniformly weighted coreset KDE on X with bandwidth h . Suppose that φ : R → R is an odd C ∞ function supported on [ − / , / . Let f ( t ) : [ − / , / → R ≥ denote the density f ( t ) = 1211 (1 − t ) + εφ ( t ) cos (cid:18) tε (cid:19) . Then (40) k ˆ f X − f k ≥ ε (cid:0) k φ k − (cid:12)(cid:12) F [ φ ](2 ε − ) (cid:12)(cid:12)(cid:1) − k φ k sup | ω |≥ hε − / |F [ k ]( ω ) | − ε Z | ω |≥ ε − / |F [ φ ]( ω ) | d ω. Proof.

Let g ( t ) = (12 / − t ) and ψ ( t ) = εφ ( t ) cos( t/ε ). Observe that k ˆ f X − f k ≥ k g − f k − h ˆ f X , g − f i + 2 h g, ψ ( t ) i = k g − f k − h ˆ f X , g − f i (41)because g ( t ) ψ ( t ) is an odd function. Next, using cos ( θ ) = (1 / θ ) + 1), k g − f k = ε Z / − / cos ( t/ε ) φ ( t )d t ≥ ε k φ k − ε (cid:12)(cid:12) F [ φ ](2 ε − ) (cid:12)(cid:12) . (42)By the triangle inequality and Parseval’s formula, (cid:12)(cid:12)(cid:12) h ˆ f X , g − f i (cid:12)(cid:12)(cid:12) ε ≤ (cid:16) Z | ω |≤ hε − / | {z } =: A + Z | ω |≥ hε − / | {z } =: B (cid:17)(cid:12)(cid:12)(cid:12) F [ k ] (cid:18) − hε − ω (cid:19) h F [ φ ] (cid:16) − ωh (cid:17) (cid:12)(cid:12)(cid:12) d ω. Moreover, A ≤ ε k φ k · sup | ω |≥ hε − / |F [ k ]( ω ) | , (43) B ≤ k k k · Z | ω | >ε − / |F [ φ ]( ω ) | d ω. (44)Then (40) follows from k k k = 1 and equations (41), (42), (43), and (44). TURNER ET AL.

Proposition . Let ε = n − / γ for some absolute constant γ > . Let ˆ f X denote a uniformlyweighted coreset KDE with bandwidth h built from a kernel k on X = x , . . . , x m . Suppose that |F [ k ]( ω ) | ≤ | ω | − ℓ . If h ≥ cε − /ℓ = cn ( − / γ )(1 − /ℓ ) for c suﬃciently large, then for all m it holdsthat (45) sup f ∈P H ( β,L ) inf X : | X | = m k ˆ f X − f k = Ω( ε ) = Ω (cid:16) n − / γ (cid:17) Proof.

The proof is a direct application of Lemma 7. Let f ( t ) = g ( t ) + εφ ( t ) cos( t/ε ), wherewe set φ ( t ) = − e x ( x +1 / x ∈ [ − / , e − x ( x − / x ∈ [0 , / . Observe that φ is odd and φ ∈ C ∞ . Thus, φ ∈ C ∞ , so by the Riemann–Lebesgue lemma [seee.g. Katznelson, 2004, Chapter VI], F [ φ ]( ε − ) ≤ ε . Using a similar argument and noting that F [ φ ]( ω ) = ω − F [ φ ′′ ]( ω ) ≤ ω − , we obtain Z | ω |≥ ε − |F [ φ ]( ω ) | d ω ≤ ε . Also k φ k ≥ c ′ for a small absolute constant, and k φ k ≤ k , and h ≥ c ′ ε − /ℓ imply that k ˆ f X − f k ≥ c ε − (cid:16) εh (cid:17) ℓ − ε = Ω( ε ) . Since f ∈ P H (1 , L ), the statement of the lemma follows. C.2 Proof of Theorem 6

Theorem.

Fix β > and a nonnegative kernel k on R satisfying the following fast decay andsmoothness conditions: lim s → + ∞ s log 1 R | t | >s k ( t ) dt > , (46) lim ω →∞ | ω | log 1 |F [ k ]( ω ) | > , (47) where we recall that F [ k ] denotes the Fourier transform. Let ˆ f unif S be the uniformly weighted coresetKDE. Then there exists L β > such that for L ≥ L β and any m and h > , we have inf h,S : | S |≤ m sup f ∈P H ( β,L ) E k ˆ f unif S − f k = Ω β,k (cid:18) m − β β log β + 12 m (cid:19) . Proof.

We follow a similar strategy to the proof of Theorem 5 by handling the cases of smalland large bandwidth separately.Let q = q ( k ) > R | t | >q k ( t ) dt ≤ .

1. By the assumption inthe theorem, there exists a > Z | t | >s k ( t ) dt ≤ a exp( − as ) , ∀ s ≥ . ORESETS Note that we can set L (1) β large such that for any δ ∈ [0 , f ∈ P H ( β, L (1) β ) such that f ( x ) = δ for x ∈ [0 , / m and h , we haveinf S : | S |≤ m sup f ∈P H ( β,L (1) β ) E k ˆ f unif S − f k ≥ . (cid:18) ∧ q mh (cid:19) ( h ≤ . a log (cid:0) mq . a ∨ a (cid:1) ∧ ) . (48)Let f be an arbitrary function in f ∈ P H ( β, L (1) β ) such that f ( x ) = 1 ∧ q mh , ∀ x ∈ [0 , / . Let T be the set of i ∈ S for which x i ∈ [ q h, / − q h ]. Case 1: | T | ≥ m (cid:16) ∧ q mh (cid:17) . Since k ≥

0, we have k ˆ f X [0 , / k ≥ . | T | m ≥ . (cid:18) ∧ q mh (cid:19) . On the other hand, k f [0 , / k ≤ (cid:18) ∧ q mh (cid:19) , therefore, k ( ˆ f X − f )1 [0 , / k ≥ . (cid:18) ∧ q mh (cid:19) . Case 2: | T | < m (cid:16) ∧ q mh (cid:17) . Deﬁne γ := 0 . (cid:18) ∧ q mh (cid:19) and q := 0 . h . Note that to verify (48) we only need to consider the event of h ≤ . a log ( mq . a ∨ a ) ∧

1, in which case Z | t | >q k ( t ) dt ≤ a exp( − aq ) ≤ a · (cid:18) . amq ∧ . a (cid:19) ≤ a · (cid:18) . aq mh ∧ . a (cid:19) = 0 . ∧ q mh )= γ. TURNER ET AL.

Moreover since γ ≤ . q ≥ q . Now deﬁne V := [2 hq , / − hq ] \ [ j ∈ T [ x j − q h, x j − q h ] . Then for j / ∈ T , we have Z V h k (cid:18) x j − th (cid:19) dt ≤ γ while for j ∈ T we have Z V h k (cid:18) x j − th (cid:19) dt ≤ . . Thus, k ˆ f X V k ≤ . | T | m + γ ≤ . (cid:18) ∧ q mh (cid:19) . On the other hand, by the union bound we see that the Lebesgue measure of V is at least12 − q h − q h | T | ≥ . − q h − . ≥ . q h = 0 .

02. Then k f V k ≥ . (cid:18) ∧ q mh (cid:19) and hence k ( ˆ f X − f )1 [0 , / k ≥ k ( ˆ f X − f )1 V k ≥ . (cid:18) ∧ q mh (cid:19) . This concludes the proof of (48).The second step is to show that for given m and h , we haveinf S : | S |≤ m sup f ∈P H ( β,L ) E k ˆ f unif S − f k ≥ (cid:18) b ( h ∧ m (cid:19) β − bm (49)suﬃciently large m and L to be determined later, and 0 < b < ∞ is such that F [ k ]( ω ) ≤ b exp( − bω ) , ∀ ω ∈ R whose existence is guaranteed by the assumption of the theorem. Let φ be a smooth, even, nonneg-ative function supported on [ − / , /

2] satisfying R [ − / , / φ = 1. Deﬁne f ǫ ( t ) := φ ( t ) (cid:18) c ǫ + ǫ β sin tǫ (cid:19) ORESETS where c ǫ > R [ − / , / f ǫ = 1. Then lim ǫ → c ǫ = 1, and in particular f ǫ ≥ ǫ < ǫ ( φ, β ) for some ǫ ( φ, β ). Moreover we can ﬁnd L (2) β < ∞ such that f ǫ ∈ P H ( β, L (2) β ) for all ǫ < ǫ ( φ, β ). Now k f ǫ − ˆ f X k ≥ |F [ f ǫ ](1 /ǫ ) − F [ ˆ f X ](1 /ǫ ) |≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z [ − / , / f ǫ ( t ) e − it/ǫ dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) F [ k ]( hǫ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z [ − / , / f ǫ ( t ) sin tǫ dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) F [ k ]( hǫ ) (cid:12)(cid:12)(cid:12)(cid:12) = ǫ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z [ − / , / φ ( t ) sin tǫ dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) F [ k ]( hǫ ) (cid:12)(cid:12)(cid:12)(cid:12) (50)where (50) used the fact that φ is even. Since lim ǫ → R [ − / , / φ ( t ) sin tǫ dt = , there exists ǫ ′ ( φ )such that Z [ − / , / φ ( t ) sin tǫ dt ≥ ǫ ≤ ǫ ′ ( φ ). Now deﬁne ǫ ′′ ( h, m ) = b ( h ∧ m . There exists m ( φ, β, b ) < ∞ such that sup h> ǫ ′′ ( h, m ) < ǫ ( φ, β ) ∧ ǫ ′ ( φ ) whenever m ≥ m ( φ, β, b ).With the choice of ǫ = ǫ ′′ ( h, m ), we can continue lower bounding (50) as (for m ≥ m ( φ, β, b )):14 (cid:18) b ( h ∧ m (cid:19) β − bm . Finally, we collect the results for step 1 and step 2. First observe that the main term in the riskin step 1 can be simpliﬁed as (cid:18) ∧ q mh (cid:19) ( h ≤ . a log (cid:0) mq . a ∨ a (cid:1) ∧ ) = 1100 q mh ∧ {A} (51)where A denotes the event in the left side of (51).Thus up to multiplicative constant depending on k , β , we can lower bound the risk by takingthe max of the risks in the two steps: (cid:18) mh ∧ {A} (cid:19) ∨ (cid:18) b ( h ∧ m (cid:19) β − bm ! (52)whenever L ≥ L β := L (1) β ∨ L (2) β . We can use the distributive law to open up the parentheses in(52). By checking the h > m − β and h ≤ m − β cases respectively, it is easy to verify that1 mh ∨ (cid:18) b ( h ∧ m (cid:19) β − bm ! = Ω m − ββ +1 log β m ! . TURNER ET AL.

Next, if A is true, we evidently have1 {A} ∨ (cid:18) b ( h ∧ m (cid:19) β − bm ! = 1 = Ω m − ββ +1 log β m ! . If A is not true, then h > . a log ( mq . a ∨ a ) ∧

1, and we have1 {A} ∨ (cid:18) b ( h ∧ m (cid:19) β − bm ! = (cid:18) b ( h ∧ m (cid:19) β − bm ! = Ω (cid:16) log − β m (cid:17) = Ω m − ββ +1 log β m ! . In either case the risk with respect to L is Ω (cid:18) m − ββ +1 log β m (cid:19) . It remains to convert this to a lower boundin L .We consider two cases. First note that by the fast decay condition on the Fourier transform, k ∈ C . Let B = B k denote a constant such that(53) sup x ∈ [ − / , / (cid:12)(cid:12) k ′ ( x ) (cid:12)(cid:12) ≤ B. Set ∆ = B / ∨ k (0) ∨ Case 1: h ≤ ∆.Let U = {| y | ≥ + c β, ∆ ,a log m } , and let U c = R \ U . If h ≤ ∆, then because X i ∈ [ − / , / k , k ˆ f X ( y )1I U k ≤ m − for c β, ∆ ,a suﬃciently large. Thus by Cauchy–Schwarz, k ( ˆ f X − f )1I U c k ≥ c ′ β, ∆ ,a (log m ) − / k ( ˆ f X − f )1I U c k = c ′ β, ∆ ,a (log m ) − / (cid:16) k ( ˆ f X − f ) k − k ( ˆ f X − f )1I U k (cid:17) ≥ c ′ β, ∆ ,a (log m ) − / c β,k m − ββ +1 log β m ! − m − ! = Ω m − ββ +1 log β + m ! Case 2: h ≥ ∆In this case, k ( X i − y ) is nearly constant for all i . By (53) and Taylor’s theorem, (cid:12)(cid:12)(cid:12)(cid:12) k (0) − k (cid:18) X i − yh (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B for all y ∈ [ − / , /

2] and for all i . Hence, for all y ∈ [ − / , / h ≥ ∆,ˆ f X ( y ) = 1 mh m X i =1 k (cid:18) X i − yh (cid:19) ≤ h ( k (0) + 2 B ) ≤ . ORESETS For L β large enough, we see that for the function f ∈ P H ( β, L β ) with f | [0 , ] ≡ k ˆ f X − f k ≥ k ( ˆ f X − f )1I [0 , ] k = Ω(1) . APPENDIX D: PROOFS FROM SECTION 5D.1 Proof of Theorem 7

The result is restated below.

Theorem.

Let k s denote the kernel from Section 3.2. The algorithm of Phillips and Tai [2018b]yields in polynomial time a subset S with | S | = m = ˜ O ( n β + d β + d ) such that the uniformly weightedcoreset KDE ˆ g S satisﬁes sup f ∈P H ( β,L ) E k f − ˆ g S k ≤ c β,d,L n − β β + d . Proof.

Here we adapt the results in Section 2 of Phillips and Tai [2018b] to our setting wherethe bandwidth h = n − / (2 β + d ) is shrinking. Using their notation, we deﬁne K s ( x, y ) = k s (cid:0) x − yh (cid:1) and study the kernel discrepancy of the kernel K s . First we verify the assumptions on the kernel(bounded inﬂuence, Lipschitz, and positive semideﬁniteness) needed to apply their results.First, the kernel K s is bounded inﬂuence [see Phillips and Tai, 2018b, Section 2] with constant c K = 2 and δ = n − , which means that | K s ( x, y ) | ≤ n if | x − y | ∞ ≥ n . This follows from the fast decay of κ s .Note that if x and y diﬀer on a single coordinate i , then | k s ( x ) − k s ( y ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c ( x i − y i ) Y j = i κ s ( x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c | x i − y i | because | κ s ( x ) | ≤ k ψ k for all x and the function κ s is c -Lipschitz for some constant c . Hence bythe triangle and Cauchy–Schwarz inequalities, the function k s is Lipschitz: | k s ( x ) − k s ( y ) | ≤ dc k | x − y | ≤ d / c κ | x − y | . Therefore the kernel K s ( x, y ) is Lipschitz [see Phillips and Tai, 2018b] with constant C K = d / c κ h − . Moreover, the kernel K s is positive semideﬁnite because the Fourier transform of κ s isnonnegative.Given the shrinking bandwidth h = n − / (2 β + d ) , we slightly modify the lattice used in Phillips and Tai[2018b, Lemma 1]. Deﬁne the lattice L = { ( i δ, i δ, . . . , i d δ ) | i j ∈ Z } , where δ = 1 c κ d nh − . TURNER ET AL.

The calculation at the top of page 6 of Phillips and Tai [2018b, Lemma 1] yields disc ( X, χ, y ) := (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 χ ( X i ) K s ( X i , y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 χ ( X i ) K s ( X i , y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 1where y is the closest point to y in the lattice L , and χ assigns either +1 or − X = X , . . . , X n . Moreover, with the bounded inﬂuence of K s , ifmin i | y − X i | ∞ ≥ n , then disc ( X, χ, y ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 χ ( X i ) K s ( X i , y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ . On deﬁning L X = L ∩ { y : min i | y − X i | ∞ ≤ n } , we see that max y ∈ R d disc ( X, χ, y ) ≤ max y ∈L X disc ( X, χ, y ) + 1for all signings χ : X → {− , +1 } . This is precisely the conclusion of Phillips and Tai [2018b,Lemma 1].This established, the positive deﬁniteness and bounded diagonal entries of K s and Phillips and Tai[2018b, Lemmas 2 and 3] imply that disc K s = O ( p d log n ) . Given ε >

0, the halving algorithm can be applied to K s as in Phillips and Tai [2018b, Corollary5] to yield a coreset X S of size m = O ( ε − p d log ε − ) such that k n n X j =1 K s ( X j , y ) − m X j ∈ S K s ( X j , y ) k ∞ ≤ ε. Rescaling by h − d , we have k ˆ f − ˆ f unif S k ∞ = k n n X j =1 k s ( X j , y ) − m X j ∈ S k s ( X j , y ) k ∞ ≤ εh − d . Recall from Section B.2 that ˆ f attains the minimax rate of estimation on P H ( β, L ). Thus setting ε = h d n − β/ (2 β + d ) we get a coreset of size ˜ O d ( n β + d β + d ) that attains the minimax rate c β,d,L n − β/ (2 β + d ) ,as desired. Moreover, by the results of Phillips and Tai [2018b], this coreset can be constructed inpolynomial time. Acknowledgments

We thank Cole Franks for helpful discussions regarding algorithmic aspectsof Carath´eodory’s theorem.

ORESETS REFERENCES

Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation viacoresets.

Combinatorial and computational geometry , 52:1–30, 2005.Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, andYaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixturesvia compression schemes.

Journal of the ACM (JACM) , 67(6):1–42, 2020.Francis R Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence betweenherding and conditional gradient algorithms. In

ICML , 2012.Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machinelearning. arXiv preprint arXiv:1703.06476 , 2017.Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweightcoresets. In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , pages 1119–1127, 2018.Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The gram-schmidt walk: acure for the banaszczyk blues. In

Proceedings of the 50th Annual ACM SIGACT Symposium onTheory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018 , pages 587–597,2018. . URL https://doi.org/10.1145/3188745.3188850 .S´ebastien Bubeck.

Convex optimization: algorithms and complexity . Now Publishers Inc., 2015.C. Carath´eodory. ¨Uber den Variabilit¨atsbereich der Koeﬃzienten von Potenzreihen, die gegebeneWerte nicht annehmen, March 1907. URL https://doi.org/10.1007/bf01449883 .Chazelle and Matouˇsek. On linear-time deterministic algorithms for optimization problems in ﬁxeddimension.

Journal of Algorithms , 21(3):579–597, 1996.B. Chazelle.

The Discrepancy Method: Randomness and Complexity . Cambridge University Press,Cambridge, 2000.Sebastian Claici, Aude Genevay, and Justin Solomon. Wasserstein measure coresets. arXiv preprintarXiv:1805.07412 , 2020.Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm.

ACMTransactions on Algorithms (TALG) , 6(4):1–30, 2010.Thomas M. Cover and Joy A. Thomas.

Elements of information theory . Wiley-Interscience [JohnWiley & Sons], Hoboken, NJ, second edition, 2006.Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models viacoresets. In

Advances in neural information processing systems , pages 2142–2150, 2011.Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In

Proceedings of the twenty-fourthannual ACM-SIAM symposium on Discrete algorithms , pages 1434–1453. SIAM, 2013. TURNER ET AL.

Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In

Proceedingsof the Thirty-Seventh Annual ACM Symposium on Theory of Computing , STOC ’05, pages 209–217, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1581139608. .URL https://doi.org/10.1145/1060590.1060622 .Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming.

Naval researchlogistics quarterly , 3(1-2):95–110, 1956.Bernd G¨artner and Martin Jaggi. Coresets for polytope distance. In

Proceedings of the twenty-ﬁfthannual symposium on Computational geometry , pages 33–42, 2009.Martin Gr¨otschel, L´aszl´o Lov´asz, and Alexander Schrijver.

Geometric algorithms and combinatorialoptimization , volume 2. Springer Science & Business Media, 2012.Steve Hanneke, Aryeh Kontorovich, and Menachem Sadigurschi. Sample compression for real-valuedlearners. In

Algorithmic Learning Theory , pages 466–488, 2019.Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering.

Discrete& Computational Geometry , 37(1):3–19, 2007.Nick Harvey and Samira Samadi. Near-optimal herding. In

Conference on Learning Theory , pages1165–1182, 2014.Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal.

Fundamentals of convex analysis . SpringerScience & Business Media, 2004.Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logisticregression. In

Advances in Neural Information Processing Systems , pages 4080–4088, 2016.Sarang Joshi, Raj Varma Kommaraji, Jeﬀ M. Phillips, and Suresh Venkatasubramanian. Com-paring distributions and shapes using the kernel distance. In

Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry , SoCG ’11, page 47–56, New York,NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306829. . URL https://doi.org/10.1145/1998196.1998204 .Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In AlinaBeygelzimer and Daniel Hsu, editors,

Proceedings of the Thirty-Second Conference on LearningTheory , volume 99 of

Proceedings of Machine Learning Research , pages 1975–1993, Phoenix,USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/karnin19a.html .Yitzhak Katznelson.

An Introduction to Harmonic Analysis . Cambridge Mathematical Library.Cambridge University Press, 3 edition, 2004. .Nick Littlestone and Manfred Warmuth. Relating data compression and learnability.

Unpublishedmanuscript , 1986.David Lopez-Paz, Krikamol Muandet, Bernhard Sch¨olkopf, and Iliya Tolstikhin. Towards alearning theory of cause-eﬀect inference. In Francis Bach and David Blei, editors,

Proceed-ings of the 32nd International Conference on Machine Learning , volume 37 of

Proceedings ofMachine Learning Research , pages 1452–1461, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/lopez-paz15.html . ORESETS J. Matouˇsek.

Geometric Discrepancy: an Illustrated Guide . Springer, New York, 1999.Daniel McDonald. Minimax Density Estimation for Growing Dimension. In AartiSingh and Jerry Zhu, editors,

Proceedings of the 20th International Conference on Ar-tiﬁcial Intelligence and Statistics , volume 54 of

Proceedings of Machine Learning Re-search , pages 194–203, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/mcdonald17a.html .Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruﬀ. On coresetsfor logistic regression. In

Proceedings of the 32nd International Conference on Neural InformationProcessing Systems , NIPS’18, pages 6562–6571, Red Hook, NY, USA, 2018. Curran AssociatesInc.Jeﬀ M Phillips. ε -samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAMsymposium on Discrete algorithms , pages 1622–1632. SIAM, 2013.Jeﬀ M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In

Proceedingsof the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 2718–2727.SIAM, 2018a.Jeﬀ M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. In , pages 66:1–66:13, 2018b. . URL https://doi.org/10.4230/LIPIcs.SoCG.2018.66 .V. M. Tikhomirov. ε -Entropy and ε -Capacity of Sets In Functional Spaces , pages 86–170. Springer Netherlands, Dordrecht, 1993. ISBN 978-94-017-2973-4. . URL https://doi.org/10.1007/978-94-017-2973-4_7 .Alexandre B. Tsybakov. Introduction to Nonparametric Estimation . Springer series in statistics.Springer, 2009. ISBN 978-0-387-79051-0. . URL https://doi.org/10.1007/b13794 .Yan Zheng and Jeﬀ M Phillips. Coresets for kernel regression. In

Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 645–654,2017.

Paxton TurnerDepartment of MathematicsMassachusetts Institute of Technology77 Massachusetts Avenue,Cambridge, MA 02139-4307, USA( [email protected] ) Jingbo LiuDepartment of StatisticsUniversity of Illinois at Urbana-Champaign725 S. Wright St.,Champaign, IL 61820, USA( [email protected] )Philippe RigolletDepartment of MathematicsMassachusetts Institute of Technology77 Massachusetts Avenue,Cambridge, MA 02139-4307, USA( [email protected]@math.mit.edu