[PDF] On approximations via convolution-defined mixture models

Abstract

An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximate the density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixture distribution is sufficiently complex. This fact is often not made concrete. We investigate and review theorems that provide approximation bounds for mixing distributions. Connections between the approximation bounds of mixing distributions and estimation bounds for the maximum likelihood estimator of finite mixtures of location- scale distributions are reviewed.

Full PDF

11 On approximations via convolution-deﬁnedmixture models

Hien D. Nguyen and Geoffrey J. McLachlan

Abstract

An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximatethe density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixturedistribution is sufﬁciently complex. This fact is often not made concrete. We investigate and review theoremsthat provide approximation bounds for mixing distributions. Connections between the approximation bounds ofmixing distributions and estimation bounds for the maximum likelihood estimator of ﬁnite mixtures of location-scale distributions are reviewed.

Index Terms

Mixing distributions; Finite mixture models; convolutions; Kullback-Leibler divergence; Maximum likelihoodestimators.

I. I

NTRODUCTION

Mixing distributions and ﬁnite mixture models are important classes of probability models that have found usein many areas of application, such as in artiﬁcial intelligence, machine learning, pattern recognition, statistics, andbeyond. Mixing distributions provide probability models with probability density functions (PDFs) of the form f ( x ) = ´ X f ( x ; θ ) d Π ( θ ) , where f ( · ; θ ) is a PDF (with respect to a random variable X ∈ X ) that is dependenton some parameter θ ∈ Θ ⊆ R d with distribution function (DF) Π . Here, X ⊆ R p is the support of the PDFs f and f ( · ; θ ) , where X is not functionally dependent on θ . Notice that the mixing distributions contain the ﬁnite mixturemodels by setting the DF to Π ( θ ) = P ni =1 π i δ ( θ − θ i ) , for n ∈ N , where δ is the Dirac delta function (cf. [1]), π i ≥ , P ni =1 π i = 1 , θ i ∈ X (for each i ∈ [ n ] = { , . . . , n } ), and N is the natural numbers (zero exclusive); see[2, Sec. 1.2.2] and [3, Sec. 1.12] for descriptions and references regarding mixing distributions.The appeal of ﬁnite mixture models largely comes from their ﬂexibility of representation. The folk theoremregarding mixture models generally states that a mixture model can approximate any distribution to a sufﬁcientlevel of accuracy, provided that the number of mixture components is sufﬁciently large. Example statements ofthe folk theorem include: “provided the number of component densities is not bounded above, certain forms of Hien Nguyen is with the Department of Mathematics and Statistics, La Trobe University, Bundoora, Melbourne, Australia 3086. GeoffreyMcLachlan is with the School of Mathematics and Physics, The University of Queensland, St. Lucia, Brisbane, Australia 4075. *CorrespondingAuthor: Hien Nguyen (Email: [email protected]).

March 1, 2018 DRAFT a r X i v : . [ s t a t . O T ] M a r mixture can be used to provide arbitrarily close approximations to a given probability distribution” [4, p. 50], “anycontinuous distribution can be approximated arbitrarily well by a ﬁnite mixture of normal densities with commonvariance (or covariance matrix in the multivariate case)” [3, p. 176], “there is an obvious sense in which the mixtureof normals approach, given enough components, can approximate any multivariate density” [5, p. 5], “the [mixture]model forms can ﬁt any distribution and signiﬁcantly increase model ﬁt” [6, p. 173], and “a mixture model canapproximate almost any distribution” [7, p. 500]. From the examples, we see that statements regarding the ﬂexibilityof mixture models are generally left technically vague and unclear.Let d TV X ( f, g ) = k f − g k X , be the total-variation distance, where k f k X ,q = (cid:2) ´ X | f ( x ) | q d x (cid:3) /q for q ∈ [1 , ∞ ] is the L q -norm over support X . Here, f ( x ) and g ( x ) are functions over the support X ⊆ R d . We also let k f k X , ∞ = sup x ∈ X | f ( x ) | . Further deﬁne the location-scale family of PDFs over R as F = (cid:26) f : ˆ R σ f (cid:18) x − µσ (cid:19) d x = 1 , for all µ ∈ R and σ ∈ (0 , ∞ ) (cid:27) .Deﬁne x i to be the i th element of x , for i ∈ [ p ] . Historical technical justiﬁcations for the folk theorem include[8] and [9]. Within more recent literature, it is difﬁcult to obtain clear technical statements of such results. Amongthe only references that we could ﬁnd is the exposition of [10, Sec. 33.1], and in particular, the following theoremstatement. Theorem 1 (DasGupta 2008, Thm. 33.1) . Let f be a probability density function over R p for p ∈ N . If F g is theclass of mixtures of g ∈ F : F g = ( f ∗ : f ∗ ( x ) = ˆ ∞ ˆ R p σ d p Y i =1 g (cid:18) x i − µ i σ (cid:19) d Π µ ( µ ) d Π σ ( σ ) ) Then, given any (cid:15) > , there exists a f ∗ ∈ F such that d TV R d ( f, f ∗ ) < (cid:15) , where Π µ ( µ ) and Π σ ( σ ) are DFs over R d and (0 , ∞ ) , respectively. Upon inspection, Theorem 1 states that the class of marginally-independent location-scale mixing distributionshave PDFs that can approximate any other PDF arbitrarily well, with respect to the total-variation distance.Unfortunately, the proof of Theorem 1 is not provided in [10]. Remarks regarding the theorem relegate the proofto an unknown location in [11], which makes it difﬁcult to investigate the structure and nature of the theorem.The lack of transparency from the cited text has lead us to investigate the nature of the presented theorem. Theoutcome of our investigation is the collection of the various proofs and technical results that are reviewed in thisarticle. The contents of our review are as follows.Firstly, we investigate the proofs from [11] and consider alternative versions of Theorem 1 that provide moreinsight into the structure of the result. For example, we present an alternative to Theorem 1, whereupon onlya mixture over the location parameter is required. That is, no integration over the scale parameter element isneeded, as in F g . Furthermore, we state a uniform approximation alternative to Theorem 1 that is applicable to theapproximation of target PDFs over compact sets. Rates of convergence are also obtainable if we make a Lipschitzassumption on the target PDF. March 1, 2018 DRAFT

In addition to the presentation of Theorem 1 and its variants, we also review the relationship between themixing distributions results and the approximation bounds of [12]. Via the approximation and estimation boundingresults for the maximum likelihood estimator (MLE) from [13] and [14], we further present results for boundingKullback-Leibler errors [KL; [15]] of the MLE for ﬁnite mixtures of location-scale PDFs.The article proceeds as follows. In Section 2, we discuss Theorem 1 and its variants. The relationship betweenthe mixing distributions approximation results and the results of [12], [13], and [14] are then presented in Sections3, 4, and 5, respectively. II. M

IXING DISTRIBUTION APPROXIMATION THEOREMS

Let L q ( X ) be the space of functions having the property k f k X ,q < ∞ , with support X , and which map to R .Further, we deﬁne the convolution between f ∈ L q ( X ) and g ∈ L r ( X ) as ( f ∗ g ) ( x ) = ˆ X f ( y ) g ( x − y ) d y , (1)where (1) exists and is measurable for speciﬁc cases, due to results such as the following from [16, Sec. 9.3]. Theorem 2 (Makarov and Podkorytov, 2013, Sec. 9.3.1-2) . Let f ∈ L q ( R p ) and g ∈ L r ( R p ) , for q, r ∈ [1 , ∞ ] .We have the following results: (i) if q = 1 , then f ∗ g exists and k f ∗ g k R p ,r ≤ k f k R p , k g k R p ,r . (ii) if /q + 1 /r = 1 , then f ∗ g exists and k f ∗ g k R p , ∞ ≤ k f k R p ,q k g k R p ,r .Remark . When q = r = 1 , not only do we have inequality (i) of Theorem 2, but also that ˆ R p ( f ∗ g ) ( x ) d x = ˆ R p f ( y ) ˆ R p g ( x − y ) d x d y = ˆ R p f ( x ) d x ˆ R p g ( x ) d x < ∞ .Moreover, this implies that if f and g are PDFs over R p , then f ∗ g is also a PDF over R p .Let a function α k ∈ L ( R p ) for k ∈ R + be called an approximate identity in R p if there exists a k ∗ ∈ [0 , ∞ ] such that (i) α k ≥ , (ii) ´ R p α k ( x ) d x = 1 , and (iii) ´ k x k >δ α k ( x ) d x → as k → k ∗ , for every δ > [cf. [16,Sec. 7.6.1]]. Here, k x k q = ( P pi =1 | x i | q ) /q is the l q -vector norm. The following result of [11] provides a usefulgenerative method for constructing approximate identities. Lemma 4 (Cheney and Light, 2000, Ch. 20, Thm. 4) . Let α ∈ L ( R p ) and let k ∈ N . If α ∈ F , where F = ( f ∈ L ( R p ) : f ( x ) = p Y i =1 g ( x i ) , ˆ R g ( x ) d x = 1 , and g ( x ) ≥ for all x ∈ R ) ,then the dilations α k ( x ) = k p α ( k x ) is an approximate identity, with k ∗ = ∞ . W may call F the class of marginally-independent scaled density functions. With an ability to constructapproximate identities, the following theorem from [16] provides a powerful means to construct approximations forany function over R p . The corollary to the result provides a statistical interpretation. March 1, 2018 DRAFT

Theorem 5 (Makarov and Podkorytov, 2013, Sec. 9.3.3) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] . If f ∈ L q ( R p ) for q ∈ [1 , ∞ ) , then k f ∗ α k − f k R p ,q → as k → k ∗ . Corollary 6.

Let f be a PDF in L q ( R p ) , for q ∈ [1 , ∞ ) . If g ∈ F , then for any (cid:15) > , there exists a PDF f ∗ in F g = (cid:26) f ∗ : f ∗ ( x ) = ˆ R p k p g ( k x − k m ) f ( m ) d m , k ∈ N (cid:27) such that k f − f ∗ k R p ,q < (cid:15) , for any q ∈ [1 , ∞ ) .Proof: From Lemma 4 and Theorem 5, for any g ∈ F and f ∈ L q ( R p ) , we have k f ∗ [ k p g ( k × · )] − f k R p ,q → , where the convolution f ∗ [ k p g ( k × · )] = ´ R p k p g ( k x − k m ) f ( m ) d m . By the deﬁnition of convergence, wehave the fact that for every (cid:15) > , there exists some K such that for all k > K , k f ∗ [ k p g ( k × · )] − f k R p ,q < (cid:15) .Putting the convolutions f ∗ [ k p g ( k × · )] for all k ∈ N into F g provides the desired convergence result. Lastly, f ∗ is a PDF via Remark 3. Remark . Corollary 6 improves upon Theorem 1 in several ways. Firstly, the total variation bound is replacedby the stronger L q -norm result. Secondly, mixing only occurs over the mean parameter element m , via the PDFd Π m ( m ) / d m = f ( m ) , and not over the scaling parameter element k , which can be taken as a constant value.That is, we only require that the class F g be mixing distributions over the location parameter element of g , where Π m is the DF that is determined by the density being approximated and the scale parameter element is picked tobe some ﬁxed value k ∈ N . Lastly, we note that Theorem 1 can simply be obtained as the q = 1 case of Corollary6 by setting σ = 1 /k and µ = m /k .Notice that Theorem 5 cannot be used to provide L ∞ -norm approximation results. Let C ( X ) be the class ofcontinuous functions over the set X . If one assumes that the target PDF f is bounded and belongs to C ( R p ) , thena uniform approximation alternative to Theorem 5 is possible for compact subsets of R p . Theorem 8 (Cheney and Light, 2000, Ch. 20, Thm. 2) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] . If f is a bounded function in C ( R p ) , then k f ∗ α k − f k K , ∞ → as k → k ∗ , for all compact K ⊂ R p . We note in passing that Theorem 8 can be used to prove density results for ﬁnite mixture models, such as thatof [10, Thm. 33.2]. For further details, see [11, Thm. 5] and [17]. Let Lip a ( X ) be the class of Lipschitz functions f , where | f ( x ) − f ( y ) | ≤ C k x − y k a ∞ , for some a, C ∈ [0 , ∞ ) , where x , y ∈ X . If one assumes that the targetPDF is in Lip a ( X ) for some a ∈ (0 , , then the following approximation rate result is available. Theorem 9 (Cheney and Light, 2000, Ch. 21, Thm. 1) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] with the additional property that ´ R p k x k a α ( x ) d x < ∞ for some a ∈ (0 , . If f ∈ Lip a ( R p ) , then there existsa constant A > such that k f ∗ α k − f k R p , ∞ ≤ A/k a for k ∈ N . Example 10.

Let α ∈ F be generated by taking the marginal location-scale density g = φ , where φ is the standardnormal PDF. The condition ´ R p k x k a α ( x ) d x < ∞ is satisﬁed for a = 1 since the multivariate normal distributionhas all of its polynomial moments; see for example [18]. March 1, 2018 DRAFT

Corollary 11.

Let f ∈ Lip (cid:0) R d (cid:1) be a PDF. If f ∗ ( x ) = ´ R d k p Q pi =1 φ ( kx i − km i ) f ( m ) d m , then k f − f ∗ k R p , ∞ ≤ A/k for k ∈ N and some constant A > . Thus, the mixing distribution generated via marginally-independent normal PDFs convergences uniformly fortarget PDFs f ∈ Lip ( R p ) , at a rate of /k .III. B OUNDING OF K ULLBACK -L EIBLER DIVERGENCE VIA RESULTS FROM Z EEVI AND M EIR (1997)Let K ⊂ R p be a compact subset and let F K ,β = (cid:26) f : ˆ K f ( x ) d x = 1 and f ( x ) ≥ β > , for all x ∈ K (cid:27) be the class of lower-bounded target PDFs over K . In [12], the approximation errors for ﬁnite mixtures of marginally-independent PDFs are studied in the context of approximating functions in F K ,β . Remark . The use of ﬁnite mixtures of marginally-independent PDFs is implicit in [12] as they report onapproximation via product kernels of radial basis functions. The products of kernels is equivalent to taking productsover marginally-independent densities to yield a joint density. Univariate radial basis functions that are positive andintegrate to unity are symmetric PDFs in one dimension. Thus, the product of univariate radial basis functions thatgenerate PDFs correspond to a subclass of F ; see [19] regarding radial basis functions.Let the KL divergence between two PDFs f, g ∈ L ( X ) be deﬁned as d KL X ( f, g ) = ˆ X f ( x ) log (cid:20) f ( x ) g ( x ) (cid:21) d x .The KL divergence between f and g is difﬁcult to work with as it is not a distance function. That is, it is asymmetricand it does not obey the triangle inequality. As such, bounding the KL divergence by a distance function providesa useful means of manipulation and control. The following useful result is obtained by [12]. Lemma 13 (Zeevi and Meir, 1997, Lemma 3) . If f, g ∈ F K ,β , then d KL K ( f, g ) ≤ β − k f − g k K , . Let F g = { k p g ( k x − k m ) : m ∈ [ m, m ] p and k ∈ N } , and for any g ∈ F , deﬁne the n -component boundedﬁnite mixtures of g as the class F g,n = ( f : f ( x ) = n X i =1 π i k p g ( k x − k m ) , m i ∈ [ m, m ] p , k ∈ N , π i ≥ , and n X i =1 π i = 1 ) ,where i ∈ [ n ] and −∞ < m < m < ∞ .For an arbitrary family of functions F , deﬁne the n -point convex hull of F to beConv n ( F ) = ( n X i =1 π i f i : f i ∈ F , π i ≥ , and n X i =1 π i = 1 ) ,and refer simply to Conv ∞ ( F ) = Conv ( F ) as the convex hull. Observe that F g,n = Conv n (cid:0) F g (cid:1) . By Corollary 1of [12], we have the fact thatConv (cid:0) F g (cid:1) = (cid:26) f : f ( x ) = ˆ R p k p g ( k x − k m ) d Π m , and Π m ∈ M m (cid:27) March 1, 2018 DRAFT is the closure of Conv (cid:0) F g (cid:1) , where M m is the sets of all probability measures over m . Here, we generically denotethe closure of Conv ( F ) by Conv ( F ) . The following result from [20] relates the closure of convex hulls to the L -norm. Lemma 14 (Barron, 1993, Lemma 1) . If ¯ f is in Conv ( F ) , where F is a Hilbert space of functions over support X , such that k f k X , ≤ B for each f ∈ F , then for every n ∈ N , and every C > B − (cid:13)(cid:13) ¯ f (cid:13)(cid:13) X , , there exists an f n ∈ Conv n ( F ) such that (cid:13)(cid:13) ¯ f − f n (cid:13)(cid:13) X , ≤ C/n . Thus, from Lemma 14, we know that if ¯ f ∈ Conv (cid:0) F g (cid:1) , then there exists an n -component ﬁnite mixture ofdensity g , f n ∈ F g,n , such that (cid:13)(cid:13) ¯ f − f n (cid:13)(cid:13) K , ≤ C/n , where K is the compact support of both densities and C > is a constant that depends on the class F g , which we know to be bounded on K . From Corollary 6 we know that if f ∈ F K ,β ∩L ( K ) , then for every (cid:15) > , there exists an ¯ f ∈ F g such that (cid:13)(cid:13) ¯ f − f (cid:13)(cid:13) K , < (cid:15) . Since F g ⊂ Conv (cid:0) F g (cid:1) ,we can set ¯ f = f ∗ . An application of the triangle inequality yields the following result from [12]. Theorem 15 (Zeevi and Meir, 1997, Eqn. 27) . If f ∈ F K ,β ∩ L ( K ) , then for any (cid:15) > and g ∈ F , there existsan f n ∈ F g,n such that d KL K ( f, f n ) ≤ (cid:15)/β + C/ ( nβ ) , for some C > and n ∈ N .Proof: By the triangle inequality, we have k f n − f k K , ≤ (cid:13)(cid:13) f n − ¯ f (cid:13)(cid:13) K , + (cid:13)(cid:13) ¯ f − f (cid:13)(cid:13) K , ≤ (cid:15) + C/n . We thenapply Lemma 13 to obtain the desired result.

Remark . The application of Corollary 6 requires the convolution of a compactly supported function with afunction over R p . In general, the convolution of two functions on different supports produces a function with asupport that is itself a function of the original supports. That is, if f is supported on supp ( f ) and g is supportedon supp ( g ) , then the support of f ∗ g is a subset of the closure of the set { x + y : x ∈ supp ( f ) , y ∈ supp ( g ) } . Inorder to mitigate against any problems relating to the algebra of supports, we can allow any compactly supportedPDF f to take values outside of its support K by simply setting f ( x ) = 0 if x / ∈ K and thus implicitly only workwith functions over R p . Remark . We note that [12] utilized a slightly different version of Corollary 6 that makes use of the alternativeapproximate identity α k ( x ) = k − p g ( x /k ) with k ∗ = 0 . Here, g is taken to be a product kernel of radial basisfunctions.An approach for quantifying the error of the quasi-maximum likelihood estimator (quasi-MLE) for ﬁnite mixturemodels, with respect to the Hellinger divergence is then developed by [12] via the theory of [21]. We will insteadpursue the bounding of KL errors for the MLE via the directions of [13] and [14].IV. M AXIMUM LIKELIHOOD ESTIMATION BOUNDS VIA RESULTS FROM L I AND B ARRON (1999)As alternatives to Lemma 14 and Theorem 15, we can interpret the following results from [13] for ﬁnite mixturesof location-scale PDFs over compact supports K . March 1, 2018 DRAFT

Theorem 18 (Li and Barron, 1999, Thm. 1) . If g ∈ F and ¯ f ∈ Conv (cid:0) F g (cid:1) , then there exists an f n ∈ F g,n suchthat d KL K (cid:0) ¯ f , f n (cid:1) ≤ Cγ/n , where C = ˆ K ´ K [ k p g ( k x − k m )] d Π m ´ K k p g ( k x − k m ) d Π m d x with DFs Π m over R p corresponding to ¯ f , and γ = 4 (log (3 √ e ) + A ) with A = sup m , m , x log k p g ( k x − k m ) k p g ( k x − k m ) .Remark . Although it is not explicitly mentioned in [13], a condition for the application of Theorem 18 is that g must be such that A < ∞ over K . This was alluded to in [14]. This assumption is implicitly made in the sequel. Theorem 20 (Li and Barron, 1999, Thm. 2) . For every ¯ f ∈ Conv (cid:0) F g (cid:1) (with corresponding DF Π m ), if f ∈ F K ,β and g ∈ F , then there exists an f n ∈ F g,n such that d KL K ( f, f n ) ≤ d KL K (cid:0) f, ¯ f (cid:1) + Cγ/n , where γ is as deﬁned inTheorem 18, and C = ˆ K ´ K [ k p g ( k x − k m )] d Π m (cid:0) ´ K k p g ( k x − k m ) d Π m (cid:1) f ( x ) d x . By Corollary 6 and Lemma 13, for every (cid:15) > , there exists an ¯ f ∈ Conv (cid:0) F g (cid:1) , such that d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β .We thus have the following outcome. Corollary 21. If f ∈ F K ,β and g ∈ F , then for any (cid:15) > , there exists an f n ∈ F g,n such that d KL K ( f, f n ) ≤ (cid:15)/β + Cγ/n , where γ and C are as deﬁned in Theorems 18 and 20. Corollary 21 implies that we can approximate a compactly supported PDF to arbitrary degrees of accuracy usingﬁnite mixtures of location-scale PDFs of increasing large number of components n . Thus far, the results havefocused on functional approximation. We now present a KL error bounding result for the MLE.Let X , ..., X N be N independent and identically distributed (IID) random sample generated from a distributionwith density f ∈ F K ,β . Deﬁne the log-likelihood function of an n -component mixture of location-scale PDFs g ∈ F as ‘ g,n,N ( θ ) = N X j =1 log " n X i =1 π i k p g ( k X i − k m i ) ,where θ contains π i , k , and m i for i ∈ [ n ] . The MLE can then be deﬁned as ˆ f g,n,N ( x ) = n X i =1 ˆ π i k p g ( k x − k ˆ m i ) ,where ˆ θ n,N ∈ n ˆ θ : ‘ g,n,N (cid:16) ˆ θ (cid:17) = sup ‘ g,n,N ( θ ) , satisfying the restrictions of F g,n o Put the the corresponding estimators of π i and m i (i.e. ˆ π i , and ˆ m i ) into ˆ θ n,N . For B > , if K is a compact setand the Lipschitz condition sup x ∈ K | log [ k p g ( k x − m )] − log [ k p g ( k x − m )] | ≤ B k m − m k (2)holds, then the following bound on the expected KL divergence for ˆ f g,n,N can be adapted from [13]. March 1, 2018 DRAFT

Theorem 22 (Li and Barron, 1999, Thm. 3) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with density f ∈ F K ,β . For every (cid:15) > , if (2) is satisﬁed and A = m − m , then under therestrictions of F g , there exists a ﬁnite C ∗ > , such that E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + γ C ∗ n + γ npN log ( N ABe ) , γ is as in deﬁned in Theorem 18.Proof: The original theorem provides the inequality E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ d KL K (cid:0) f, ¯ f ∗ (cid:1) + γ C ∗ n + γ npN log ( N ABe ) ,where ¯ f ∗ is the argument that achieves inf ¯ f ∈ Conv ( F g ) d KL K (cid:0) f, ¯ f (cid:1) . By deﬁnition d KL K (cid:0) f, ¯ f ∗ (cid:1) ≤ d KL K (cid:0) f, ¯ f (cid:1) , andthere exists an ¯ f ∈ Conv (cid:0) F g (cid:1) such that d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β . Thus select any ¯ f that satisﬁes d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β andwe have the desired result. Remark . Since (cid:15) can be made as small as we would like, the expected KL divergence between f and the MLE ˆ f g,n,N can be made arbitrarily small by choosing an increasing sequence of n that grows slower than N/ log N .For example, one can take n = O (log N ) . Via some calculus, we obtain the optimal convergence rate by setting n = O (cid:16)p N/ log N (cid:17) .V. C ONCENTRATION INEQUALITIES VIA RESULTS FROM R AKHLIN ET AL . (2005)We now proceed to utilize the theory of [14] to provide a concentration inequality for the MLE of ﬁnite mixturesof location-varying PDFs. Let N (∆ , F , d ) denote the ∆ -covering number of the class F , with respect to the distance d . That is, N (∆ , F , d ) is the minimum number of ∆ -balls that is needed to cover F , where a ∆ -ball around f (with centre not necessarily in F ) is deﬁned as { g : d ( f, g ) < δ } ; see for example [22, Sec. 2.2.2]. Further, deﬁne d n as the empirical distance. That is for functions f and g , and realizations x , ..., x N of the random variables X , ..., X N , we have d n ( f, g ) = N − P Ni =1 [ f ( x i ) − g ( x i )] . The following theorem can be adapted from [14,Thm. 2.1]. Theorem 24 (Rakhlin et al., 2005, Thm. 2.1) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with PDF f ∈ F K ,β such that f ( x ) < β for all x ∈ K . If ˆ f g,n,N is the MLE for an n -componentﬁnite mixture of g (under the restrictions of F g ), then for any (cid:15) > E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + 8 β nβ (cid:18) ββ (cid:19) + 1 √ N βCβ E f ˆ β log / N (cid:0) ∆ , F g , d n (cid:1) d δ ! + 8 ββ ! + r tN (cid:18) √ ββ (cid:19) ,for some universal constant C , with probability at least − exp ( − t ) . March 1, 2018 DRAFT

Proof:

The original statement of [14, Thm. 2.1] has d KL K (cid:0) f, ¯ f ∗ (cid:1) in place of (cid:15)/β . Thus, we obtain the desiredresult via the same technique as that used in Theorem 22. Remark . To make it directly comparable to Theorem 22, one can integrate out the probability statement ofTheorem 24 to obtain the inequality in expectation E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + 8 β nβ (cid:18) ββ (cid:19) + 1 √ N " βCβ E f ˆ β log / N (cid:0) δ, F g , d n (cid:1) d δ ! + 1 √ N (cid:18) ββ + 4 √ ββ (cid:19) .See the proof of [14, Thm. 2.1] for details. The following corollary specializes the results of Theorem 24 to conformwith the conclusion of Theorem 22. Corollary 26 (Rakhlin et al., 2005, Cor. 2.2) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with density f ∈ F K ,β such that f ( x ) < β for all x ∈ K . For every (cid:15) > , if (2) is satisﬁedand A = m − m , under the restrictions of F g , E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + C n + C √ N ,where C and C are constants that depend on β , β , A , B , C , and p . Here, C is the same universal constant asin Theorem 24.Remark . Corollary 26 directly improves upon the result of Theorem 22 by allowing n and N to increaseindependently of one another and still be able to achieve an arbitrarily small bound on the expected KL divergenceof the MLE for ﬁnite mixtures of location-scale PDFs, under the same hypothesis. The corollary implies that theoptimal choice for the number of components is to set n = O (cid:16) √ N (cid:17) .R EFERENCES[1] R. F. Hoskins,

Delta Functions: Introduction to Generalized Functions . Oxford: Woodhead, 2009.[2] B. G. Lindsay, “Mixture models: theory, geometry and applications,” in

NSF-CBMS Regional Conference Series in Probability and Statistics ,1995.[3] G. J. McLachlan and D. Peel,

Finite Mixture Models . New York: Wiley, 2000.[4] D. M. Titterington, A. F. M. Smith, and U. E. Makov,

Statistical Analysis Of Finite Mixture Distributions . New York: Wiley, 1985.[5] P. E. Rossi,

Bayesian Non- and Semiparametric Methods and Applications . Princeton: Princeton University Press, 2014.[6] J. L. Walker and M. Ben-Akiva,

A Handbook of Transport Economics . Edward Edgar, 2011, ch. Advances in discrete choice: mixturemodels, pp. 160–187.[7] G. Yona,

Introduction to Computational Proteomics . Boca Raton: CRC Press, 2011.[8] J. T.-H. Lo, “Finite-dimensional sensor orbits and optimal nonlinear ﬁltering,”

IEEE Transactions on Information Theory , vol. IT-18, pp.583–588, 1972.[9] T. S. Ferguson,

Recent Advances in Statistics: Papers in Honour of Herman Chernoff on His Sixtieth Birthday . New York: AcademicPress, 1983, ch. Bayesian density estimation by mixtures of normal distributions, pp. 287–302.[10] A. DasGupta,

Asymptotic Theory Of Statistics And Probability . New York: Springer, 2008.[11] W. Cheney and W. Light,

A Course in Approximation Theory . Paciﬁc Grove: Brooks/Cole, 2000.

March 1, 2018 DRAFT0 [12] A. J. Zeevi and R. Meir, “Density estimation through convex combinations of densities: approximation and estimation bounds,”

NeuralComputation , vol. 10, pp. 99–109, 1997.[13] J. Q. Li and A. R. Barron, “Mixture density estimation,” in

Advances in Neural Information Processing Systems , S. A. Solla, T. K. Leen,and K. R. Mueller, Eds., vol. 12. Cambridge: MIT Press, 1999.[14] A. Rakhlin, D. Panchenko, and S. Mukherjee, “Risk bounds for mixture density estimation,”

ESAIM: Probability and Statistics , vol. 9,pp. 220–229, 2005.[15] S. Kullback and R. A. Leibler, “On information and sufﬁciency,”

Annals of Mathematical Statistics , vol. 22, pp. 79–86, 1951.[16] B. Makarov and A. Podkorytov,

Real Analysis: Measures, Integrals and Applications . New York: Springer, 2013.[17] W. A. Light, “Techniques for generating approximations via convolution kernels,”

Numerical Algorithms , vol. 5, pp. 247–261, 1993.[18] R. Willink, “Normal moments and Hermite polynomials,”

Statistics and Probability Letters , vol. 73, pp. 271–275, 2005.[19] M. D. Buhmann,

Radial Basis Functions: Theory And Implementation . Cambridge University Press, 2003.[20] A. R. Barron, “Universal approximation bound for superpositions of a sigmoidal function,”

IEEE Transactions on Information Theory ,vol. IT-39, pp. 930–945, 1993.[21] H. White, “Maximum likelihood estimation of misspeciﬁed models,”

Econometrica , vol. 50, pp. 1–25, 1982.[22] M. R. Kosorok,

Introduction to Empirical Processes and Semiparametric Inference . New York: Springer, 2008.. New York: Springer, 2008.