On approximations via convolution-defined mixture models
11 On approximations via convolution-definedmixture models
Hien D. Nguyen and Geoffrey J. McLachlan
Abstract
An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximatethe density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixturedistribution is sufficiently complex. This fact is often not made concrete. We investigate and review theoremsthat provide approximation bounds for mixing distributions. Connections between the approximation bounds ofmixing distributions and estimation bounds for the maximum likelihood estimator of finite mixtures of location-scale distributions are reviewed.
Index Terms
Mixing distributions; Finite mixture models; convolutions; Kullback-Leibler divergence; Maximum likelihoodestimators.
I. I
NTRODUCTION
Mixing distributions and finite mixture models are important classes of probability models that have found usein many areas of application, such as in artificial intelligence, machine learning, pattern recognition, statistics, andbeyond. Mixing distributions provide probability models with probability density functions (PDFs) of the form f ( x ) = ´ X f ( x ; θ ) d Π ( θ ) , where f ( · ; θ ) is a PDF (with respect to a random variable X ∈ X ) that is dependenton some parameter θ ∈ Θ ⊆ R d with distribution function (DF) Π . Here, X ⊆ R p is the support of the PDFs f and f ( · ; θ ) , where X is not functionally dependent on θ . Notice that the mixing distributions contain the finite mixturemodels by setting the DF to Π ( θ ) = P ni =1 π i δ ( θ − θ i ) , for n ∈ N , where δ is the Dirac delta function (cf. [1]), π i ≥ , P ni =1 π i = 1 , θ i ∈ X (for each i ∈ [ n ] = { , . . . , n } ), and N is the natural numbers (zero exclusive); see[2, Sec. 1.2.2] and [3, Sec. 1.12] for descriptions and references regarding mixing distributions.The appeal of finite mixture models largely comes from their flexibility of representation. The folk theoremregarding mixture models generally states that a mixture model can approximate any distribution to a sufficientlevel of accuracy, provided that the number of mixture components is sufficiently large. Example statements ofthe folk theorem include: “provided the number of component densities is not bounded above, certain forms of Hien Nguyen is with the Department of Mathematics and Statistics, La Trobe University, Bundoora, Melbourne, Australia 3086. GeoffreyMcLachlan is with the School of Mathematics and Physics, The University of Queensland, St. Lucia, Brisbane, Australia 4075. *CorrespondingAuthor: Hien Nguyen (Email: [email protected]).
March 1, 2018 DRAFT a r X i v : . [ s t a t . O T ] M a r mixture can be used to provide arbitrarily close approximations to a given probability distribution” [4, p. 50], “anycontinuous distribution can be approximated arbitrarily well by a finite mixture of normal densities with commonvariance (or covariance matrix in the multivariate case)” [3, p. 176], “there is an obvious sense in which the mixtureof normals approach, given enough components, can approximate any multivariate density” [5, p. 5], “the [mixture]model forms can fit any distribution and significantly increase model fit” [6, p. 173], and “a mixture model canapproximate almost any distribution” [7, p. 500]. From the examples, we see that statements regarding the flexibilityof mixture models are generally left technically vague and unclear.Let d TV X ( f, g ) = k f − g k X , be the total-variation distance, where k f k X ,q = (cid:2) ´ X | f ( x ) | q d x (cid:3) /q for q ∈ [1 , ∞ ] is the L q -norm over support X . Here, f ( x ) and g ( x ) are functions over the support X ⊆ R d . We also let k f k X , ∞ = sup x ∈ X | f ( x ) | . Further define the location-scale family of PDFs over R as F = (cid:26) f : ˆ R σ f (cid:18) x − µσ (cid:19) d x = 1 , for all µ ∈ R and σ ∈ (0 , ∞ ) (cid:27) .Define x i to be the i th element of x , for i ∈ [ p ] . Historical technical justifications for the folk theorem include[8] and [9]. Within more recent literature, it is difficult to obtain clear technical statements of such results. Amongthe only references that we could find is the exposition of [10, Sec. 33.1], and in particular, the following theoremstatement. Theorem 1 (DasGupta 2008, Thm. 33.1) . Let f be a probability density function over R p for p ∈ N . If F g is theclass of mixtures of g ∈ F : F g = ( f ∗ : f ∗ ( x ) = ˆ ∞ ˆ R p σ d p Y i =1 g (cid:18) x i − µ i σ (cid:19) d Π µ ( µ ) d Π σ ( σ ) ) Then, given any (cid:15) > , there exists a f ∗ ∈ F such that d TV R d ( f, f ∗ ) < (cid:15) , where Π µ ( µ ) and Π σ ( σ ) are DFs over R d and (0 , ∞ ) , respectively. Upon inspection, Theorem 1 states that the class of marginally-independent location-scale mixing distributionshave PDFs that can approximate any other PDF arbitrarily well, with respect to the total-variation distance.Unfortunately, the proof of Theorem 1 is not provided in [10]. Remarks regarding the theorem relegate the proofto an unknown location in [11], which makes it difficult to investigate the structure and nature of the theorem.The lack of transparency from the cited text has lead us to investigate the nature of the presented theorem. Theoutcome of our investigation is the collection of the various proofs and technical results that are reviewed in thisarticle. The contents of our review are as follows.Firstly, we investigate the proofs from [11] and consider alternative versions of Theorem 1 that provide moreinsight into the structure of the result. For example, we present an alternative to Theorem 1, whereupon onlya mixture over the location parameter is required. That is, no integration over the scale parameter element isneeded, as in F g . Furthermore, we state a uniform approximation alternative to Theorem 1 that is applicable to theapproximation of target PDFs over compact sets. Rates of convergence are also obtainable if we make a Lipschitzassumption on the target PDF. March 1, 2018 DRAFT
In addition to the presentation of Theorem 1 and its variants, we also review the relationship between themixing distributions results and the approximation bounds of [12]. Via the approximation and estimation boundingresults for the maximum likelihood estimator (MLE) from [13] and [14], we further present results for boundingKullback-Leibler errors [KL; [15]] of the MLE for finite mixtures of location-scale PDFs.The article proceeds as follows. In Section 2, we discuss Theorem 1 and its variants. The relationship betweenthe mixing distributions approximation results and the results of [12], [13], and [14] are then presented in Sections3, 4, and 5, respectively. II. M
IXING DISTRIBUTION APPROXIMATION THEOREMS
Let L q ( X ) be the space of functions having the property k f k X ,q < ∞ , with support X , and which map to R .Further, we define the convolution between f ∈ L q ( X ) and g ∈ L r ( X ) as ( f ∗ g ) ( x ) = ˆ X f ( y ) g ( x − y ) d y , (1)where (1) exists and is measurable for specific cases, due to results such as the following from [16, Sec. 9.3]. Theorem 2 (Makarov and Podkorytov, 2013, Sec. 9.3.1-2) . Let f ∈ L q ( R p ) and g ∈ L r ( R p ) , for q, r ∈ [1 , ∞ ] .We have the following results: (i) if q = 1 , then f ∗ g exists and k f ∗ g k R p ,r ≤ k f k R p , k g k R p ,r . (ii) if /q + 1 /r = 1 , then f ∗ g exists and k f ∗ g k R p , ∞ ≤ k f k R p ,q k g k R p ,r .Remark . When q = r = 1 , not only do we have inequality (i) of Theorem 2, but also that ˆ R p ( f ∗ g ) ( x ) d x = ˆ R p f ( y ) ˆ R p g ( x − y ) d x d y = ˆ R p f ( x ) d x ˆ R p g ( x ) d x < ∞ .Moreover, this implies that if f and g are PDFs over R p , then f ∗ g is also a PDF over R p .Let a function α k ∈ L ( R p ) for k ∈ R + be called an approximate identity in R p if there exists a k ∗ ∈ [0 , ∞ ] such that (i) α k ≥ , (ii) ´ R p α k ( x ) d x = 1 , and (iii) ´ k x k >δ α k ( x ) d x → as k → k ∗ , for every δ > [cf. [16,Sec. 7.6.1]]. Here, k x k q = ( P pi =1 | x i | q ) /q is the l q -vector norm. The following result of [11] provides a usefulgenerative method for constructing approximate identities. Lemma 4 (Cheney and Light, 2000, Ch. 20, Thm. 4) . Let α ∈ L ( R p ) and let k ∈ N . If α ∈ F , where F = ( f ∈ L ( R p ) : f ( x ) = p Y i =1 g ( x i ) , ˆ R g ( x ) d x = 1 , and g ( x ) ≥ for all x ∈ R ) ,then the dilations α k ( x ) = k p α ( k x ) is an approximate identity, with k ∗ = ∞ . W may call F the class of marginally-independent scaled density functions. With an ability to constructapproximate identities, the following theorem from [16] provides a powerful means to construct approximations forany function over R p . The corollary to the result provides a statistical interpretation. March 1, 2018 DRAFT
Theorem 5 (Makarov and Podkorytov, 2013, Sec. 9.3.3) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] . If f ∈ L q ( R p ) for q ∈ [1 , ∞ ) , then k f ∗ α k − f k R p ,q → as k → k ∗ . Corollary 6.
Let f be a PDF in L q ( R p ) , for q ∈ [1 , ∞ ) . If g ∈ F , then for any (cid:15) > , there exists a PDF f ∗ in F g = (cid:26) f ∗ : f ∗ ( x ) = ˆ R p k p g ( k x − k m ) f ( m ) d m , k ∈ N (cid:27) such that k f − f ∗ k R p ,q < (cid:15) , for any q ∈ [1 , ∞ ) .Proof: From Lemma 4 and Theorem 5, for any g ∈ F and f ∈ L q ( R p ) , we have k f ∗ [ k p g ( k × · )] − f k R p ,q → , where the convolution f ∗ [ k p g ( k × · )] = ´ R p k p g ( k x − k m ) f ( m ) d m . By the definition of convergence, wehave the fact that for every (cid:15) > , there exists some K such that for all k > K , k f ∗ [ k p g ( k × · )] − f k R p ,q < (cid:15) .Putting the convolutions f ∗ [ k p g ( k × · )] for all k ∈ N into F g provides the desired convergence result. Lastly, f ∗ is a PDF via Remark 3. Remark . Corollary 6 improves upon Theorem 1 in several ways. Firstly, the total variation bound is replacedby the stronger L q -norm result. Secondly, mixing only occurs over the mean parameter element m , via the PDFd Π m ( m ) / d m = f ( m ) , and not over the scaling parameter element k , which can be taken as a constant value.That is, we only require that the class F g be mixing distributions over the location parameter element of g , where Π m is the DF that is determined by the density being approximated and the scale parameter element is picked tobe some fixed value k ∈ N . Lastly, we note that Theorem 1 can simply be obtained as the q = 1 case of Corollary6 by setting σ = 1 /k and µ = m /k .Notice that Theorem 5 cannot be used to provide L ∞ -norm approximation results. Let C ( X ) be the class ofcontinuous functions over the set X . If one assumes that the target PDF f is bounded and belongs to C ( R p ) , thena uniform approximation alternative to Theorem 5 is possible for compact subsets of R p . Theorem 8 (Cheney and Light, 2000, Ch. 20, Thm. 2) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] . If f is a bounded function in C ( R p ) , then k f ∗ α k − f k K , ∞ → as k → k ∗ , for all compact K ⊂ R p . We note in passing that Theorem 8 can be used to prove density results for finite mixture models, such as thatof [10, Thm. 33.2]. For further details, see [11, Thm. 5] and [17]. Let Lip a ( X ) be the class of Lipschitz functions f , where | f ( x ) − f ( y ) | ≤ C k x − y k a ∞ , for some a, C ∈ [0 , ∞ ) , where x , y ∈ X . If one assumes that the targetPDF is in Lip a ( X ) for some a ∈ (0 , , then the following approximation rate result is available. Theorem 9 (Cheney and Light, 2000, Ch. 21, Thm. 1) . Let α k be an approximate identity in R p for some k ∗ ∈ [0 , ∞ ] with the additional property that ´ R p k x k a α ( x ) d x < ∞ for some a ∈ (0 , . If f ∈ Lip a ( R p ) , then there existsa constant A > such that k f ∗ α k − f k R p , ∞ ≤ A/k a for k ∈ N . Example 10.
Let α ∈ F be generated by taking the marginal location-scale density g = φ , where φ is the standardnormal PDF. The condition ´ R p k x k a α ( x ) d x < ∞ is satisfied for a = 1 since the multivariate normal distributionhas all of its polynomial moments; see for example [18]. March 1, 2018 DRAFT
Corollary 11.
Let f ∈ Lip (cid:0) R d (cid:1) be a PDF. If f ∗ ( x ) = ´ R d k p Q pi =1 φ ( kx i − km i ) f ( m ) d m , then k f − f ∗ k R p , ∞ ≤ A/k for k ∈ N and some constant A > . Thus, the mixing distribution generated via marginally-independent normal PDFs convergences uniformly fortarget PDFs f ∈ Lip ( R p ) , at a rate of /k .III. B OUNDING OF K ULLBACK -L EIBLER DIVERGENCE VIA RESULTS FROM Z EEVI AND M EIR (1997)Let K ⊂ R p be a compact subset and let F K ,β = (cid:26) f : ˆ K f ( x ) d x = 1 and f ( x ) ≥ β > , for all x ∈ K (cid:27) be the class of lower-bounded target PDFs over K . In [12], the approximation errors for finite mixtures of marginally-independent PDFs are studied in the context of approximating functions in F K ,β . Remark . The use of finite mixtures of marginally-independent PDFs is implicit in [12] as they report onapproximation via product kernels of radial basis functions. The products of kernels is equivalent to taking productsover marginally-independent densities to yield a joint density. Univariate radial basis functions that are positive andintegrate to unity are symmetric PDFs in one dimension. Thus, the product of univariate radial basis functions thatgenerate PDFs correspond to a subclass of F ; see [19] regarding radial basis functions.Let the KL divergence between two PDFs f, g ∈ L ( X ) be defined as d KL X ( f, g ) = ˆ X f ( x ) log (cid:20) f ( x ) g ( x ) (cid:21) d x .The KL divergence between f and g is difficult to work with as it is not a distance function. That is, it is asymmetricand it does not obey the triangle inequality. As such, bounding the KL divergence by a distance function providesa useful means of manipulation and control. The following useful result is obtained by [12]. Lemma 13 (Zeevi and Meir, 1997, Lemma 3) . If f, g ∈ F K ,β , then d KL K ( f, g ) ≤ β − k f − g k K , . Let F g = { k p g ( k x − k m ) : m ∈ [ m, m ] p and k ∈ N } , and for any g ∈ F , define the n -component boundedfinite mixtures of g as the class F g,n = ( f : f ( x ) = n X i =1 π i k p g ( k x − k m ) , m i ∈ [ m, m ] p , k ∈ N , π i ≥ , and n X i =1 π i = 1 ) ,where i ∈ [ n ] and −∞ < m < m < ∞ .For an arbitrary family of functions F , define the n -point convex hull of F to beConv n ( F ) = ( n X i =1 π i f i : f i ∈ F , π i ≥ , and n X i =1 π i = 1 ) ,and refer simply to Conv ∞ ( F ) = Conv ( F ) as the convex hull. Observe that F g,n = Conv n (cid:0) F g (cid:1) . By Corollary 1of [12], we have the fact thatConv (cid:0) F g (cid:1) = (cid:26) f : f ( x ) = ˆ R p k p g ( k x − k m ) d Π m , and Π m ∈ M m (cid:27) March 1, 2018 DRAFT is the closure of Conv (cid:0) F g (cid:1) , where M m is the sets of all probability measures over m . Here, we generically denotethe closure of Conv ( F ) by Conv ( F ) . The following result from [20] relates the closure of convex hulls to the L -norm. Lemma 14 (Barron, 1993, Lemma 1) . If ¯ f is in Conv ( F ) , where F is a Hilbert space of functions over support X , such that k f k X , ≤ B for each f ∈ F , then for every n ∈ N , and every C > B − (cid:13)(cid:13) ¯ f (cid:13)(cid:13) X , , there exists an f n ∈ Conv n ( F ) such that (cid:13)(cid:13) ¯ f − f n (cid:13)(cid:13) X , ≤ C/n . Thus, from Lemma 14, we know that if ¯ f ∈ Conv (cid:0) F g (cid:1) , then there exists an n -component finite mixture ofdensity g , f n ∈ F g,n , such that (cid:13)(cid:13) ¯ f − f n (cid:13)(cid:13) K , ≤ C/n , where K is the compact support of both densities and C > is a constant that depends on the class F g , which we know to be bounded on K . From Corollary 6 we know that if f ∈ F K ,β ∩L ( K ) , then for every (cid:15) > , there exists an ¯ f ∈ F g such that (cid:13)(cid:13) ¯ f − f (cid:13)(cid:13) K , < (cid:15) . Since F g ⊂ Conv (cid:0) F g (cid:1) ,we can set ¯ f = f ∗ . An application of the triangle inequality yields the following result from [12]. Theorem 15 (Zeevi and Meir, 1997, Eqn. 27) . If f ∈ F K ,β ∩ L ( K ) , then for any (cid:15) > and g ∈ F , there existsan f n ∈ F g,n such that d KL K ( f, f n ) ≤ (cid:15)/β + C/ ( nβ ) , for some C > and n ∈ N .Proof: By the triangle inequality, we have k f n − f k K , ≤ (cid:13)(cid:13) f n − ¯ f (cid:13)(cid:13) K , + (cid:13)(cid:13) ¯ f − f (cid:13)(cid:13) K , ≤ (cid:15) + C/n . We thenapply Lemma 13 to obtain the desired result.
Remark . The application of Corollary 6 requires the convolution of a compactly supported function with afunction over R p . In general, the convolution of two functions on different supports produces a function with asupport that is itself a function of the original supports. That is, if f is supported on supp ( f ) and g is supportedon supp ( g ) , then the support of f ∗ g is a subset of the closure of the set { x + y : x ∈ supp ( f ) , y ∈ supp ( g ) } . Inorder to mitigate against any problems relating to the algebra of supports, we can allow any compactly supportedPDF f to take values outside of its support K by simply setting f ( x ) = 0 if x / ∈ K and thus implicitly only workwith functions over R p . Remark . We note that [12] utilized a slightly different version of Corollary 6 that makes use of the alternativeapproximate identity α k ( x ) = k − p g ( x /k ) with k ∗ = 0 . Here, g is taken to be a product kernel of radial basisfunctions.An approach for quantifying the error of the quasi-maximum likelihood estimator (quasi-MLE) for finite mixturemodels, with respect to the Hellinger divergence is then developed by [12] via the theory of [21]. We will insteadpursue the bounding of KL errors for the MLE via the directions of [13] and [14].IV. M AXIMUM LIKELIHOOD ESTIMATION BOUNDS VIA RESULTS FROM L I AND B ARRON (1999)As alternatives to Lemma 14 and Theorem 15, we can interpret the following results from [13] for finite mixturesof location-scale PDFs over compact supports K . March 1, 2018 DRAFT
Theorem 18 (Li and Barron, 1999, Thm. 1) . If g ∈ F and ¯ f ∈ Conv (cid:0) F g (cid:1) , then there exists an f n ∈ F g,n suchthat d KL K (cid:0) ¯ f , f n (cid:1) ≤ Cγ/n , where C = ˆ K ´ K [ k p g ( k x − k m )] d Π m ´ K k p g ( k x − k m ) d Π m d x with DFs Π m over R p corresponding to ¯ f , and γ = 4 (log (3 √ e ) + A ) with A = sup m , m , x log k p g ( k x − k m ) k p g ( k x − k m ) .Remark . Although it is not explicitly mentioned in [13], a condition for the application of Theorem 18 is that g must be such that A < ∞ over K . This was alluded to in [14]. This assumption is implicitly made in the sequel. Theorem 20 (Li and Barron, 1999, Thm. 2) . For every ¯ f ∈ Conv (cid:0) F g (cid:1) (with corresponding DF Π m ), if f ∈ F K ,β and g ∈ F , then there exists an f n ∈ F g,n such that d KL K ( f, f n ) ≤ d KL K (cid:0) f, ¯ f (cid:1) + Cγ/n , where γ is as defined inTheorem 18, and C = ˆ K ´ K [ k p g ( k x − k m )] d Π m (cid:0) ´ K k p g ( k x − k m ) d Π m (cid:1) f ( x ) d x . By Corollary 6 and Lemma 13, for every (cid:15) > , there exists an ¯ f ∈ Conv (cid:0) F g (cid:1) , such that d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β .We thus have the following outcome. Corollary 21. If f ∈ F K ,β and g ∈ F , then for any (cid:15) > , there exists an f n ∈ F g,n such that d KL K ( f, f n ) ≤ (cid:15)/β + Cγ/n , where γ and C are as defined in Theorems 18 and 20. Corollary 21 implies that we can approximate a compactly supported PDF to arbitrary degrees of accuracy usingfinite mixtures of location-scale PDFs of increasing large number of components n . Thus far, the results havefocused on functional approximation. We now present a KL error bounding result for the MLE.Let X , ..., X N be N independent and identically distributed (IID) random sample generated from a distributionwith density f ∈ F K ,β . Define the log-likelihood function of an n -component mixture of location-scale PDFs g ∈ F as ‘ g,n,N ( θ ) = N X j =1 log " n X i =1 π i k p g ( k X i − k m i ) ,where θ contains π i , k , and m i for i ∈ [ n ] . The MLE can then be defined as ˆ f g,n,N ( x ) = n X i =1 ˆ π i k p g ( k x − k ˆ m i ) ,where ˆ θ n,N ∈ n ˆ θ : ‘ g,n,N (cid:16) ˆ θ (cid:17) = sup ‘ g,n,N ( θ ) , satisfying the restrictions of F g,n o Put the the corresponding estimators of π i and m i (i.e. ˆ π i , and ˆ m i ) into ˆ θ n,N . For B > , if K is a compact setand the Lipschitz condition sup x ∈ K | log [ k p g ( k x − m )] − log [ k p g ( k x − m )] | ≤ B k m − m k (2)holds, then the following bound on the expected KL divergence for ˆ f g,n,N can be adapted from [13]. March 1, 2018 DRAFT
Theorem 22 (Li and Barron, 1999, Thm. 3) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with density f ∈ F K ,β . For every (cid:15) > , if (2) is satisfied and A = m − m , then under therestrictions of F g , there exists a finite C ∗ > , such that E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + γ C ∗ n + γ npN log ( N ABe ) , γ is as in defined in Theorem 18.Proof: The original theorem provides the inequality E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ d KL K (cid:0) f, ¯ f ∗ (cid:1) + γ C ∗ n + γ npN log ( N ABe ) ,where ¯ f ∗ is the argument that achieves inf ¯ f ∈ Conv ( F g ) d KL K (cid:0) f, ¯ f (cid:1) . By definition d KL K (cid:0) f, ¯ f ∗ (cid:1) ≤ d KL K (cid:0) f, ¯ f (cid:1) , andthere exists an ¯ f ∈ Conv (cid:0) F g (cid:1) such that d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β . Thus select any ¯ f that satisfies d KL K (cid:0) f, ¯ f (cid:1) < (cid:15)/β andwe have the desired result. Remark . Since (cid:15) can be made as small as we would like, the expected KL divergence between f and the MLE ˆ f g,n,N can be made arbitrarily small by choosing an increasing sequence of n that grows slower than N/ log N .For example, one can take n = O (log N ) . Via some calculus, we obtain the optimal convergence rate by setting n = O (cid:16)p N/ log N (cid:17) .V. C ONCENTRATION INEQUALITIES VIA RESULTS FROM R AKHLIN ET AL . (2005)We now proceed to utilize the theory of [14] to provide a concentration inequality for the MLE of finite mixturesof location-varying PDFs. Let N (∆ , F , d ) denote the ∆ -covering number of the class F , with respect to the distance d . That is, N (∆ , F , d ) is the minimum number of ∆ -balls that is needed to cover F , where a ∆ -ball around f (with centre not necessarily in F ) is defined as { g : d ( f, g ) < δ } ; see for example [22, Sec. 2.2.2]. Further, define d n as the empirical distance. That is for functions f and g , and realizations x , ..., x N of the random variables X , ..., X N , we have d n ( f, g ) = N − P Ni =1 [ f ( x i ) − g ( x i )] . The following theorem can be adapted from [14,Thm. 2.1]. Theorem 24 (Rakhlin et al., 2005, Thm. 2.1) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with PDF f ∈ F K ,β such that f ( x ) < β for all x ∈ K . If ˆ f g,n,N is the MLE for an n -componentfinite mixture of g (under the restrictions of F g ), then for any (cid:15) > E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + 8 β nβ (cid:18) ββ (cid:19) + 1 √ N βCβ E f ˆ β log / N (cid:0) ∆ , F g , d n (cid:1) d δ ! + 8 ββ ! + r tN (cid:18) √ ββ (cid:19) ,for some universal constant C , with probability at least − exp ( − t ) . March 1, 2018 DRAFT
Proof:
The original statement of [14, Thm. 2.1] has d KL K (cid:0) f, ¯ f ∗ (cid:1) in place of (cid:15)/β . Thus, we obtain the desiredresult via the same technique as that used in Theorem 22. Remark . To make it directly comparable to Theorem 22, one can integrate out the probability statement ofTheorem 24 to obtain the inequality in expectation E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + 8 β nβ (cid:18) ββ (cid:19) + 1 √ N " βCβ E f ˆ β log / N (cid:0) δ, F g , d n (cid:1) d δ ! + 1 √ N (cid:18) ββ + 4 √ ββ (cid:19) .See the proof of [14, Thm. 2.1] for details. The following corollary specializes the results of Theorem 24 to conformwith the conclusion of Theorem 22. Corollary 26 (Rakhlin et al., 2005, Cor. 2.2) . Let g ∈ F and suppose that X , ..., X N is an IID random samplefrom a distribution with density f ∈ F K ,β such that f ( x ) < β for all x ∈ K . For every (cid:15) > , if (2) is satisfiedand A = m − m , under the restrictions of F g , E f h d KL K (cid:16) f, ˆ f g,n,N (cid:17)i ≤ (cid:15)β + C n + C √ N ,where C and C are constants that depend on β , β , A , B , C , and p . Here, C is the same universal constant asin Theorem 24.Remark . Corollary 26 directly improves upon the result of Theorem 22 by allowing n and N to increaseindependently of one another and still be able to achieve an arbitrarily small bound on the expected KL divergenceof the MLE for finite mixtures of location-scale PDFs, under the same hypothesis. The corollary implies that theoptimal choice for the number of components is to set n = O (cid:16) √ N (cid:17) .R EFERENCES[1] R. F. Hoskins,
Delta Functions: Introduction to Generalized Functions . Oxford: Woodhead, 2009.[2] B. G. Lindsay, “Mixture models: theory, geometry and applications,” in
NSF-CBMS Regional Conference Series in Probability and Statistics ,1995.[3] G. J. McLachlan and D. Peel,
Finite Mixture Models . New York: Wiley, 2000.[4] D. M. Titterington, A. F. M. Smith, and U. E. Makov,
Statistical Analysis Of Finite Mixture Distributions . New York: Wiley, 1985.[5] P. E. Rossi,
Bayesian Non- and Semiparametric Methods and Applications . Princeton: Princeton University Press, 2014.[6] J. L. Walker and M. Ben-Akiva,
A Handbook of Transport Economics . Edward Edgar, 2011, ch. Advances in discrete choice: mixturemodels, pp. 160–187.[7] G. Yona,
Introduction to Computational Proteomics . Boca Raton: CRC Press, 2011.[8] J. T.-H. Lo, “Finite-dimensional sensor orbits and optimal nonlinear filtering,”
IEEE Transactions on Information Theory , vol. IT-18, pp.583–588, 1972.[9] T. S. Ferguson,
Recent Advances in Statistics: Papers in Honour of Herman Chernoff on His Sixtieth Birthday . New York: AcademicPress, 1983, ch. Bayesian density estimation by mixtures of normal distributions, pp. 287–302.[10] A. DasGupta,
Asymptotic Theory Of Statistics And Probability . New York: Springer, 2008.[11] W. Cheney and W. Light,
A Course in Approximation Theory . Pacific Grove: Brooks/Cole, 2000.
March 1, 2018 DRAFT0 [12] A. J. Zeevi and R. Meir, “Density estimation through convex combinations of densities: approximation and estimation bounds,”
NeuralComputation , vol. 10, pp. 99–109, 1997.[13] J. Q. Li and A. R. Barron, “Mixture density estimation,” in
Advances in Neural Information Processing Systems , S. A. Solla, T. K. Leen,and K. R. Mueller, Eds., vol. 12. Cambridge: MIT Press, 1999.[14] A. Rakhlin, D. Panchenko, and S. Mukherjee, “Risk bounds for mixture density estimation,”
ESAIM: Probability and Statistics , vol. 9,pp. 220–229, 2005.[15] S. Kullback and R. A. Leibler, “On information and sufficiency,”
Annals of Mathematical Statistics , vol. 22, pp. 79–86, 1951.[16] B. Makarov and A. Podkorytov,
Real Analysis: Measures, Integrals and Applications . New York: Springer, 2013.[17] W. A. Light, “Techniques for generating approximations via convolution kernels,”
Numerical Algorithms , vol. 5, pp. 247–261, 1993.[18] R. Willink, “Normal moments and Hermite polynomials,”
Statistics and Probability Letters , vol. 73, pp. 271–275, 2005.[19] M. D. Buhmann,
Radial Basis Functions: Theory And Implementation . Cambridge University Press, 2003.[20] A. R. Barron, “Universal approximation bound for superpositions of a sigmoidal function,”
IEEE Transactions on Information Theory ,vol. IT-39, pp. 930–945, 1993.[21] H. White, “Maximum likelihood estimation of misspecified models,”
Econometrica , vol. 50, pp. 1–25, 1982.[22] M. R. Kosorok,
Introduction to Empirical Processes and Semiparametric Inference . New York: Springer, 2008.. New York: Springer, 2008.