Projected likelihood contrasts for testing homogeneity in finite mixture models with nuisance parameters
aa r X i v : . [ m a t h . S T ] M a y IMS CollectionsBeyond Parametrics in Interdisciplinary Research: Festschrift in Honor of ProfessorPranab K. Sen
Vol. 1 (2008) 272–281c (cid:13)
Institute of Mathematical Statistics, 2008DOI:
Pro jected likelihood contrasts for testinghomogeneity in finite mixture modelswith nuisance parameters
Debapriya Sengupta ∗ and Rahul Mazumder Indian Statistical Institute
Abstract:
This paper develops a test for homogeneity in finite mixture mod-els where the mixing proportions are known a priori (taken to be 0.5) anda common nuisance parameter is present. Statistical tests based on the no-tion of Projected Likelihood Contrasts (PLC) are considered. The PLC is aslight modification of the usual likelihood ratio statistic or the Wilk’s Λ and issimilar in spirit to the Rao’s score test. Theoretical investigations have beencarried out to understand the large sample statistical properties of these tests.Simulation studies have been carried out to understand the behavior of thenull distribution of the PLC statistic in the case of Gaussian mixtures withunknown means (common variance as nuisance parameter) and unknown vari-ances (common mean as nuisance parameter). The results are in conformitywith the theoretical results obtained. Power functions of these tests have beenevaluated based on simulations from Gaussian mixtures.
1. Introduction
Finite mixture models are often used to understand whether the data comes froma heterogeneous or a homogeneous population. In particular, consider the case ofa mixture of two populations with the mixing proportions known (Goffinet et al.[7]). We are interested to know whether the data is sampled from a proper mixtureof two distributions or a single distribution.In particular, consider a mixture family g, with generating population densitiesgiven by M = { f ( ·| θ, η ) : θ ∈ Θ , η ∈ E} , where θ is the main parameter of interestand η is the common nuisance parameter. We assume that the mixing proportionis known a priori to be 0 .
5. The mixture model then becomes(1.1) g ( z | θ , θ , η ) = 0 . f ( z | θ , η ) + 0 . f ( z | θ , η ) . The null hypothesis for homogeneity is, θ = θ . In several practical examples (for example, arising in speech analysis and non-parametric regression methodology) detection of the location of discontinuity inthe local mean or the local variance (or local amplitude) are of interest (Figure 1).The theoretical results developed in this paper can be used in such problems. Fig-ure 1 demonstrates several scenarios of signals being scanned through a running ∗ Supported in part by Grant No. 12(30)/04-IRSD, DIT, Govt. of India. Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203 B. T. Road,Kolkata 700108, India, e-mail: [email protected] C. V. Raman Hall, 205 B. T. Road, Kolkata 700108, India, e-mail: [email protected]
AMS 2000 subject classifications:
Primary 62G08, 60G35; secondary 60J55.
Keywords and phrases:
Gaussian mixture models, projected likelihood contrast.272 rojected likelihood contrasts
Fig 1 . Left column shows time plots of data with solid vertical lines marking the windows con-sidered. The top two panels indicate a simulated noisy signal (with additive Gaussian noise) withmean function having a jump discontinuity. The bottom panels describe a portion of digitizedspeech waveform. In the right column three fitted densities of y -values: nonparametric kernelsmoothed density (solid line), single component Gaussian fit (dashed line) and mixture of twoGaussian fit with equal mixing weights (curve indicated by + ), are shown corresponding to theframes indicated in the left column. window of specified bandwidth. When the center of the window is placed at pointsof discontinuity the raw signal values ( y -axis) will have a distribution which canbe adequately modeled by (1.1). This basic idea has been explored by Hall andTitterington [8] in the context of edge and peak preserving smoothers.A brief list of references dealing with the study of mixture distributions andproperties of the Likelihood Ratio Test (LRT) tests are provided below. In Tit-terington et al. [13], McLachlan and Basford [11] and Lindsay [10] one may findextensive discussions about the background of finite mixture models. The asymp-totic distributions of the LRT in mixture models have been studied in Bickel andChernoff [1], Chernoff and Lander [5], Ghosh and Sen [6], Lemdani and Pons [9].Different modifications of LRT tests in mixture models are proposed and studiedby Chen et al. [4] and Self and Liang [12].In this paper we introduce a concept of Projected Likelihood Contrasts (PLC),a modified version of the LRT test or the Wilks’ Λ (Wilks [14]) statistic, which wemotivate as follows. Consider i.i.d. observations Z , Z , . . . , Z N generated by someelement of the class of densities g given by (1.1). The likelihood under the fullmixture model is given by(1.2) L N ( θ , θ , η ) = N X i =1 log g ( Z i | θ , θ , η ) , where g is defined through (1.1). Under the null hypothesis the likelihood reduces D. Sengupta and R. Mazumder to the usual likelihood under M , namely,(1.3) L N ( θ, θ, η ) = N X i =1 log f ( Z i | θ, η ) . Define (ˆ θ, ˆ η ) as the maximum likelihood estimators of ( θ, η ) under (1.3). The ideabehind the PLC statistics is to plug in the estimated nuisance parameter under thenull in (1.2) and maximize it over remaining parameters θ and θ . Finally the PLCstatistic is defined as(1.4) Λ N = 2 (cid:18) max θ ,θ L N ( θ , θ , ˆ η ) − L N (ˆ θ, ˆ θ, ˆ η ) (cid:19) . The term projected likelihood is used here to distinguish the procedure from profilelikelihood. We call it projected likelihood because the profile of the nuisance param-eter is obtained after projecting the full likelihood onto f ( ·| ˆ θ, η ) ∈ M . That waywe first obtain a projected profile of η and then maximize it so that its estimatecoincides with the maximum likelihood estimate (MLE) under the null hypothesis.Note that this procedure, in spirit, is very similar to the Rao’s score test.The paper is organized as follows. In Section 2, the large sample properties ofthe PLC statistics is discussed. In Section 3, some simulation studies are provided.The proof of the main theorem in Section 2 is provided in the Appendix.
2. Large sample approximation of PLC statistic
For the purpose of theoretical investigation we shall simplify the model further as-suming that the class of densities are all one dimensional. Denote the null hypothesisby(2.1) H ∗ : Z , Z , . . . , Z N are iid M For notational convenience we adopt the convention that the symbol D rx indicates r -th partial derivative with respect to x , treated as a generic argument in a function.Define the following estimated scores(2.2) ˆ ξ r ( j ) = D rθ f ( Z j | ˆ θ N , ˆ η N ) f ( Z j | ˆ θ N , ˆ η N ) , for 1 ≤ j ≤ N and r ≥
1. Analogously define the true scores ξ r ( j ) = D rθ f ( Z j | θ,η ) f ( Z j | θ,η ) atthe true parameter values under H ∗ . One can verify that E H ∗ ξ r (1) = 0 for every r ≥ ξ r ’sexists. Define the following mixed partial derivatives of the full likelihood L N .(2.3) C N ij = ( D θ + D θ ) i ( D θ − D θ ) j L N (ˆ θ N , ˆ θ N ) , where i, j are nonnegative integers. Moreover, let ¯ C N ij = N − C N ij . Although thequantities defined in (2.3) look quite incomprehensible they can however be ex-pressed as linear combinations of D lθ D mθ L N ( θ , θ ) using the Binomial expansion. rojected likelihood contrasts One can establish with some effort the following. D iθ D jθ log g ( z | ˆ θ N , ˆ θ N , ˆ η N ) = P Ω ∗ a (Ω) Q i + jr =1 ˆ ξ ω r r ( z ) , where P ∗ runs over all nonnegative integral partitionsΩ = ( ω , ω , . . . , ω p + q ) satisfying P r ω r = i + j . The coefficients a (Ω) are compli-cated combinatorial quantities but can be recursively computed. It can be verifiedthat C N ij = 0 if j is odd. We provide simplified expressions for some of the lowerorder C N ij which are necessary for future calculations.(2.4) ¯ C N = N P Nj =1 ( ˆ ξ ( j ) − ˆ ξ ( j ) ) ( P → −I ) , ¯ C N = N P Nj =1 ˆ ξ ( j )¯ C N = N P Nj =1 ( ˆ ξ ( j ) − ˆ ξ ( j ) ˆ ξ ( j )) , ¯ C N = N P Nj =1 ˆ ψ ( j ) , where ˆ ψ ( j ) = ˆ ξ ( j ) + ˆ ξ ( j ) ˆ ξ ( j ) − ξ ( j ) + 3 ˆ ξ ( j ) ˆ ξ ( j ), and I is the Fisherinformation of θ under H ∗ . Finally, let ¯ C ij denote asymptotic expected values of¯ C N ij under H ∗ which can be easily derived using Lemma 2.1 (i). The distributionalproperties of ¯ C N ij can be derived using classical properties of M -estimators. Westate the following lemma for the sake of completeness. The proof can be found inBickel and Doksum [2]. Lemma 2.1.
Let Z , Z , . . . , Z N be independent and identically distributed randomvariables with density f ( z | θ ) satisfying usual regularity conditions with the scorefunction S ( z, θ ) and Fisher information matrix I = Cov θ ( S ( Z , θ )) . (i) Let ψ ( z, θ ) be a real valued, continuously differentiable (in θ ) kernel with E θ ψ ( Z , θ ) < ∞ , for every θ . Further let ˆ θ N denote the MLE of θ . Then N N X i =1 ψ ( Z i , ˆ θ N ) P → E θ ψ ( Z , θ ) . (ii) In addition if ψ satisfies E θ ψ ( Z , θ ) = 0 for every θ then (2.5) 1 √ N N X i =1 ψ ( Z i , ˆ θ N ) = ⇒ N (0 , V ) , where V = E θ ψ − C ′ I − C where C = Cov θ ( ψ ( Z , θ ) , S ( Z , θ )) . Finally, we proceed to the main asymptotic representation theorem of the PLCstatistic. It turns out that even in the Gaussian case the standard χ -approximationdoes not hold. Actually it turns out that Gaussian case is more paradoxical than onewould expect. As a result one has to go for higher order expansion to get an idea ofthe limiting behavior of the statistic. The crucial issue is whether E H ∗ ξ (1) ξ (1) = 0or not. This is a measure of some type of spurious non-degeneracy in the modeldue to skewness and its asymptotic effect needs to be corrected for. Two cases areconsidered in the simulation section. In the first case we consider a mixture Gaussianwith different means but common unknown variance and the in second case scalemixture Gaussian with common unknown mean is considered. In both cases wefind E H ∗ ξ (1) ξ (1) = 0. The first case is covered by Theorem 2.2(i) below whilethe second case is covered by Theorem 2.2(ii). We state the theorem keeping thesetwo special cases in mind. The proof of the theorem is provided in the Appendix. Theorem 2.2.
Assume that E H ∗ ξ (1) ξ (1) = 0 and C < . Then under H ∗ , (i) if ¯ C N = 0 , then Λ N P → . D. Sengupta and R. Mazumder (ii) if √ N ¯ C N = ⇒ N (0 , σ ) for some σ > , then (2.6) Λ N = ⇒ c max(0 , Z ) , for suitable c > and a standard normal variate Z .
3. Simulation studies in the case of Gaussian mixtures
In this section we provide results pertaining to the sampling distributions of thePLC statistic under the null in case of Gaussian mixtures [7]. Studies have beencarried out for two different cases: unknown variances and common mean as thenuisance parameter and unknown means and common variance as the nuisanceparameter. The simulation results are in conformity with the theoretical resultsderived. The power function of the PLC test statistic for each of the above twoset-ups have been studied for different values of the alternative. Simulation studieshave been carried out for different sample sizes.
Consider the particular example of Gaussian mixture models, the main parame-ters of interest are the unknown means and the common variance is the nuisanceparameter. The generating model is given by(3.1) f ( z | θ, η ) = η − φ (( z − θ ) /η )where φ is the standard normal probability density function ( θ ∈ ℜ , η > . Inthis case ˆ η = N − P Ni =1 ( Z i − ¯ Z ) , where ¯ Z = N − P Ni =1 Z i . The correspondingPLC is denoted by Λ mN . Simulation studies for the null distribution of Λ mN havebeen performed for sample sizes N =50, 100 and 200. Percentiles of the samplingdistribution are displayed in Table 1 which shows how different percentiles p (5,50 and 95) of the null distribution of Λ mN decrease with increasing sample size N .The difference of the percentile values, (say that between percentiles 95 and 5),decreases with increasing sample size as well. The tabulated values give sufficientreason to believe in the validity of the theoretical results obtained in Theorem 2.2.In the second example, also pertaining Gaussian mixture models, the main pa-rameters of interest are unknown variances and the common mean is the nuisanceparameter.(3.2) f ( z | θ, η ) = θ − φ (( z − η ) /θ )for θ > , η ∈ ℜ . Here ˆ η = ¯ Z. The corresponding PLC statistic is denoted by Λ sN . Table 1
Percentiles of the null distribution of the PLC, corresponding to aGaussian mixture with unknown means and common variance asthe nuisance parameter
Percentiles N
50 0.008 0.011 0.014100 0.004 0.005 0.006200 0.002 0.002 0.003 rojected likelihood contrasts
The constant c in the limiting distribution (2.6) can be computed, but thecomputations are quite cumbersome. Hence the constant c has been evaluatedbased on the sampling distribution of Λ sN under the null. The sampling distributionis based on 5000 simulations of data-size 2000. The value of c hence obtained is0.69070.The asymptotic null distribution of Λ sN is a mixture of a degenerate mass at 0and a c χ (for suitable c > sN , obtained from 5000 simulations of sample size 2000, is found to be amixture of outcomes which are exactly zero and another strictly positive absolutelycontinuous distribution. We have observed that this absolutely continuous distri-bution (as obtained from simulations) is very close to c χ (where c = 0 . sN have been performed andtabulated (see Table 2) for different sample sizes N based on 1000 simulations ofdata size N where N = 50 , ,
200 .The expected value of the sampling distribution shows a negative bias. Thedegree to which it approximates the mean of the large sample distribution of thePLC improves with increasing sample size. The proportion of zeros in the sampling
Fig 2 . Dotted line shows the kernel density estimate of c (max { , N (0 , } )( c = 0 . , the theoretical asymptotic null distribution of the PLC under N (0 , . Note that by invariancethe results do not depend on the choice of the mean and variance. The solid line is the kerneldensity estimate of the sampling distribution of the PLC with the zeros left out, under the nullcorresponding to a Gaussian mixture of the same set-up. This sampling distribution is based on5000 simulations of sample size 2000. Table 2
Summary statistics of the null distribution of the PLC, corresponding to a Gaussian mixturewith unknown variance and common mean as the nuisance parameter
Expectation % of zeros 5% signif. point N Theor.* Est. Theor. Est. Theor.* Est.
50 0.345 0.156 50 70.1 1.86 0.935100 0.345 0.256 50 61.5 1.86 1.608200 0.345 0.328 50 57.5 1.86 1.817*The sampling distribution based on 5000 simulations of sample-size 2000, has been used as aproxy for the theoretical asymptotic null distribution.78
D. Sengupta and R. Mazumder
Fig 3 . Solid line, dotted line and dashed line correspond to the sample sizes 200, 100 and 50respectively in both the figures. Power functions of the PLC test statistic at level α = 0 . havebeen evaluated. In the case of Λ mN , (left figure) the power function has been evaluated for valuesof the parameter | θ − θ | ∈ [0 , . The power function corresponding to Λ sN (right figure) has alsobeen evaluated for the values of the parameter q max { θ ,θ } min { θ ,θ } ∈ [1 , . distribution goes on decreasing with N before it asymptotes to the theoretical value0.5. The degree to which the sampling distribution approximates the theoreticaldistribution improves with increasing sample size in the case of the 95 th percentile. Power functions corresponding to the test statistic Λ mN at level α = 0 .
05 have beenevaluated for different values of the parameter (different values of the alternative) | θ − θ | in the range [0 , , for three different sample sizes N = 50 , , . (Fig-ure 3). The power is found to increase with increasing sample size.Power functions corresponding to the test statistic Λ sN at level α = 0 .
05 have beenevaluated for different values of the parameter (different values of the alternative) q max { θ ,θ } min { θ ,θ } in the range [1 , , for three different sample sizes N = 50 , , . (Figure 3). The power is found to increase with increasing sample size. Appendix: Proof of Theorem 2.2
First, it follows from Chen et al. [4] that both the MLEs ˆ θ and ˆ θ respectively are N / consistent under (1.1). For both the cases in the theorem we re-parametrizethe problem with θ = ˆ θ N + N − / s + N − / τ and θ = ˆ θ N + N − / s − N − / τ andstudy its behavior near (ˆ θ N , ˆ θ N ) in the neighborhoods | s | ≤ log N and | τ | ≤ log N respectively. In what follows we do not verify orders of remainder terms explicitly.Several technical steps need to be verified in the process of deriving the result. Werefer to Bickel and Doksum [2], Ghosh and Sen [6] and Bose and Sengupta [3] for thetype of regularity assumptions and machinery needed for uniform approximationsin such a context. Also, note that under the above parametrization the likelihoodbecomes an even function in τ . Therefore we work with τ ≥ I ( θ , θ , η ), has rank 2 if θ = θ and 3 otherwise (can be verified by rojected likelihood contrasts straightforward differentiation). Next define H ( s, τ ) = L N (ˆ θ N + s + τ, ˆ θ N + s − τ ) . It can be readily verified from (1.2) and (2.3) that(A.1) ∂ i + j ∂s i ∂τ j H (0 ,
0) = C N ij , for i, j ≥
0. The strategy of the proof is the following. Since the expansion is regularin within-model displacement s , we fix τ ≥ s in the first step.Then, we examine the behavior of the maximum value obtained in the first stepacross τ to derive the final result. Because of our general regularity conditions allthe following calculations will be valid uniformly in probability over the compactset | s | ≤ log N and 0 ≤ τ ≤ log N . In what follows γ > N − a (log N ) b → N → ∞ for any a, b > H ( N − / s, τ ) = H (0 , τ ) + s [ N − / H (0 , τ )]+ 12 s [ N − H (0 , τ )] + o P ( N − γ ) , (A.2)where H ij ’s denote respective partial derivatives of H . Also, it can be checked that H (0 , τ ) = − N I (1 + o P ( N − γ ) ) . Therefore, in large samples, for fixed 0 ≤ τ ≤ N − / log N , the maximum valueof H ( N − / s, τ ) over the compact set | s | ≤ log N cannot exceed its unrestrictedglobal maximum, which is of the order of [ N − / H (0 , τ )] / [ N − H (0 , τ )]. Bydirect Taylor series of order 4 we find H (0 , N − / τ ) = (2!) − [ √ N ¯ C N ] τ + (4!) − [ ¯ C N ] τ + o P ( N − γ ) . The facts required for the above simplification are: (i) H (0 ,
0) = 0 by the maxi-mum likelihood equation, (ii) H j (0 ,
0) = 0 for j odd (since H is an even functionof τ ) and (iii) the assumption of the theorem that E H ∗ ξ (1) ξ (1) = 0. It can bechecked that the last assertion implies √ N ¯ C N = O P (1), in view of (2.4) and Lemma2.1.Therefore by virtue of the assumptions of the theorem the profile global maxi-mum of H ( · , τ ) becomes negligible in probability over the range of interest. Thuswe have(A.3) max | s |≤ log N H ( N − / s, τ ) = H (0 , τ ) + o P ( N − γ ) , uniformly over 0 ≤ τ ≤ N − / log N . Finally, H (0 , N − / τ ) = H (0 ,
0) + (2!) − [ √ N ¯ C N ] τ + (4!) − [ ¯ C N ] τ + o P ( N − γ ) . Therefore we have(A.4) Λ N ≈ | s |≤ log N, ≤ τ ≤ log N [ H ( N − / s, N − / τ ) − H (0 , ≤ τ ≤ log N { [ √ N ¯ C N ] τ + [ ¯ C N ] τ + o P ( N − γ ) } . D. Sengupta and R. Mazumder
Now we consider case (i) of the theorem where ¯ C N = 0. Then (A.4) reduces toΛ N = max ≤ τ< log N { (1 /
12) [ ¯ C N ] τ + o P ( N − γ ) } . Since C < { ¯ C N < − δ } → δ >
0. By choosing τ > / δ − / N − γ/ one can show that the value of the objective function (beingmaximized) becomes negative. Hence it can be verified that Λ N P → τ (notingthat the dominant expression is a quadratic in τ and ¯ C N P → C ( < N ≈ max ≤ τ ≤ log N { [ √ N ¯ C N ] τ + [ ¯ C N ] τ }≈ − [max(0 , √ N ¯ C N )] ¯ C , with an error in approximation of the order of o P ( N − γ ) as before. Hence the secondpart of the the theorem follows from the assumptions. References [1]
Bickel, P. J. and Chernoff, H. (1993). Asymptotic distribution of thelikelihood ratio statistic in a prototypical non regular problem. In
Statisticsand Probability: A Raghu Raj Bahadur Festschrift (J. K. Ghosh, S. K. Mitra,K. R. Parthasarathy and B. Prakasa Rao, eds.) 83–96. Wiley, New York.[2]
Bickel, P. J. and Doksum, K. A. (2001).
Mathematical Statistics . BasicIdeas and Selected Topics . 1, 2nd ed. Prentice Hall, NJ. MR0443141[3]
Bose, A. and Sengupta, D. (2003). Strong consistency of minimum contrastestimators with applications.
Sankhy¯a Chen, H., Chen, J. and Kalbfleisch, J. D. (2001). A modified likelihoodratio test for homogeneity in finite mixture models.
J. R. Stat. Soc. Ser. BStat. Methodol. Chernoff, H. and Lander, E. (1995). Asymptotic distribution of the likeli-hood ratio test that a mixture of two binomials is a single binomial.
J. Statist.Plann. Inference Ghosh, J. K. and Sen, P. K. (1985). On the asymptotic performance ofthe log likelihood ratio statistic for the mixture model and related results. In
Proc. of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer II (Berkeley, Calif., 1983) Goffinet, B., Loisel, P. and Laurent, B. (1992). Testing in normalmixture models when the proportions are known.
Biometrika Hall, P. and Titterington, D. M. (1992). Edge-preserving and peak-preserving smoothing.
Technometrics Lemdani, M. and Pons, O. (1999). Likelihood ratio tests in contaminationmodels.
Bernoulli Lindsay, B. G. (1995).
Mixture Models: Theory, Geometry and Applications .Institute of Mathematical Statistics, Hayward, CA.[11]
McLachlan, G. J. and Basford, K. E. (1988).
Mixture Models: Inferenceand Applications to Clustering . Dekker, New York. MR0926484[12]
Self, S. G. and Liang, K.-Y. (1987). Asymptotic properties of maximumlikelihood estimators and likelihood ratio tests under nonstandard conditions.
J. Amer. Statist. Assoc. rojected likelihood contrasts [13] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985).
Statis-tical Analysis of Finite Mixture Distributions . Wiley, Chichester. MR0838090[14]
Wilks, S. S. (1938). The large sample distribution of the likelihood ratio fortesting composite hypothesis.
Ann. Math. Statist.9