Handling missing data in model-based clustering
HHandling missing data in model-based clustering
Alessio Serafini , Thomas Brendan Murphy , Luca Scrucca Department of Economics, Università degli Studi di Perugia, Italy School of Mathematics and Statistics, University College Dublin, Ireland
June 5, 2020
Abstract
Gaussian Mixture models (GMMs) are a powerful tool for clustering, classifica-tion and density estimation when clustering structures are embedded in the data.The presence of missing values can largely impact the GMMs estimation process,thus handling missing data turns out to be a crucial point in clustering, classificationand density estimation. Several techniques have been developed to impute themissing values before model estimation. Among these, multiple imputation is asimple and useful general approach to handle missing data. In this paper we pro-pose two different methods to fit Gaussian mixtures in the presence of missing data.Both methods use a variant of the Monte Carlo Expectation-Maximisation (MCEM)algorithm for data augmentation. Thus, multiple imputations are performed dur-ing the E-step, followed by the standard M-step for a given eigen-decomposedcomponent-covariance matrix. We show that the proposed methods outperform themultiple imputation approach, both in terms of clusters identification and densityestimation.
In many real applications, observations are subject to missing entries during the datacollection process. Thus, handling these missing values is a crucial point in modelestimation. This aspect is also fundamental in pattern recognition, where missingvalues can be informative, and an inaccurate treatment can lead to serious errors inclassification and density estimation. Before focusing on the methods available fordealing with missing data, it is necessary to specify the nature of the missing mechanism(Rubin, 1976). This describes how the probability of a missing value is related to thedata and, although in general it is unknown and unobservable, it is necessary to makesome assumptions about the underlying missing mechanism. Often the validity of themethods used in practical applications depend on whether or not these assumptionsare fulfilled. Three different missing mechanisms are common in the literature: MissingCompletely At Random (MCAR), Missing At Random (MAR), and Missing Not At1 a r X i v : . [ s t a t . M L ] J un andom (MNAR). For the technical details and definitions we refer to Little and Rubin(2002). We note, however, that under the MCAR and MAR assumptions the missingmechanism is considered to be ignorable, i.e. model parameters are not affected by thedistribution of the missing data.Different approaches for dealing with missing values can be found in the literature. Inthe listwise/casewise deletion approach observations with missing entries are removed,and models are estimated using only the full set of observed data (Little and Rubin,2002, Chap. 3). This approach is simple and fast to implement, but the estimates may bebiased due to the loss of information caused by the removal of part of the data. Anotherpopular approach is to impute the missing observations, that is somehow fill in themissing data with a single value or a set of values. Single imputation methods arethe mean imputation, which fill the missing values with the mean or the conditionalmean (Wilks, 1932), the regression imputation (Buck, 1960), which replaces missingvalues with predicted scores from a regression model, and its stochastic version (Littleand Rubin, 2002, Sec. 4.3), the hot-deck imputation (Ford, 1983), which is a collectionof techniques that impute the missing values with scores from similar observations.Multiple imputation refers to a set of procedures that fill the missing values with differentplausible values, thus generating different complete datasets. Then, each completedataset is analysed and the results are combined to reflect the uncertainty due to thepresence of missing values (Rubin, 1987; Schafer, 1997). Finally, model-based proceduresassume a specific distribution for the observed data and draw inference on parametersbased on the likelihood. A popular method for Maximum-Likelihood in missing-dataproblems is the Expectation-Maximisation (EM) algorithm (Dempster et al., 1977).Gaussian Mixture Models (GMMs) are a powerful tool for density estimation, clus-tering, and classification (Fraley and Raftery, 2002; McLachlan and Peel, 2000). Parsimo-nious parametrisation of covariance matrices allows for a flexible class of models, eachwith its distinctive characteristics (Banfield and Raftery, 1993; Celeux and Govaert, 1995).However, the resulting log-likelihood is difficult to maximise directly, even numerically(see McLachlan and Peel, 2000, Sect. 2.8.1), so GMMs are usually fitted by reformulatingthe mixture problem as an incomplete-data problem within the EM framework. Since themissing observations may contain important information about the clustering structureof the data, Ghahramani and Jordan (1995) extended the EM algorithm for GMMs withmissing values. They derived a closed-form solution for the M-step, but only in theunconstrained case, that is for full within-cluster covariance matrices. However, nosolutions are available for the parsimonious parametrisations of covariance matricesmentioned above.In this paper we extend the work of Ghahramani and Jordan (1995) by consideringall the parsimonious covariance structures proposed in Banfield and Raftery (1993)and Celeux and Govaert (1995). In our proposal, we use a modified E-step based onaugmented data, followed by maximisation of the complete-data log-likelihood usingthe standard M-step for GMMs. To this goal, we have developed two model-basedapproaches to handle missing data in GMMs. Assuming that the data are distributed asmultivariate Gaussian within each cluster, Monte Carlo sampling allows to approximate2he E-step using an augmented dataset with filled missing values. The complete-datalog-likelihood of the augmented dataset can then be maximised using standard formulasfor GMMs. The main advantage of this approach is that only the E-step is modified,whereas the M-step for each parsimonious parametrisation of the covariance matrices isnot affected.The paper is organised as follows. Section 2 contains a brief description of Gaussianmixtures and the standard EM algorithm. Section 3 introduces two approaches forhandling missing data in Gaussian mixture models, with details on the correspondingMonte Carlo EM algorithms presented in Section 4. In Section 5 the performance of theproposed algorithms are evaluated on both simulated and real datasets in terms of bothclustering and density estimation. The final section provides some concluding remarks. Gaussian Mixture Models (GMMs) are a popular parametric model for cluster analysisand density estimation. This class of models has been widely used for different problemsin several fields for its flexibility and interpretability. Extensive reviews of mixturemodels can be found in McLachlan and Peel (2000) and in Fraley and Raftery (2002)for the applications of GMMs to cluster analysis, discriminant analysis, and densityestimation.Let x = ( x , x , . . . , x N ) be a random sample of N observations with x i ∈ R d . Theobserved data are assumed to arise from a Gaussian mixture distribution if the densitycan be written as a convex combination of multivariate Gaussian distributions such that: f ( x i ; θ ) = G ∑ g = π g φ ( x i ; µ g , Σ g ) ,where G is the number of mixture components, φ ( x i | µ g , Σ g ) is the underlying densityfunction of the g th component, a multivariate Gaussian distribution with mean µ g and covariance matrix Σ g , ( π , π , . . . , π G ) are the mixing weights, such that π g > ∑ Gg = π g =
1, and θ = { π , . . . , π G − , µ , . . . , µ g , Σ , . . . , Σ g } is the set of unknownparameters of the mixture model.The Gaussian assumption implies ellipsoidal clusters, each centred at the meanvector µ g and with covariance matrices Σ g . The latter control the geometrical featuresof the ellipsoids, such as the orientation, the volume and the shape. Parsimoniousparameterisation of covariance matrices for GMMs can be achieved through the eigen-decomposition Σ g = λ g D g A g D (cid:62) g , where λ g is a scalar controlling the volume, A g is adiagonal matrix controlling the shape, and D g is an orthogonal matrix controlling theorientation of the ellipsoids. This parameterisation generates a broad class of modelswith specific geometric properties (Banfield and Raftery, 1993; Celeux and Govaert,1995). 3aximum likelihood estimation of the unknown parameters of a mixture model canin principle be obtained by maximising the log-likelihood: (cid:96) ( θ ) = N ∑ i = log G ∑ g = π g φ ( x i ; µ g , Σ g ) .However, the objective function above is hard to maximise directly, even numerically(see McLachlan and Peel, 2000, Sect. 2.8.1). Consequently, the standard algorithmemployed for parameters estimation in mixture models is the Expectation-Maximisation(EM) algorithm, which is briefly reviewed in the following subsection. Dempster et al. (1977) introduced an iterative procedure to estimate the parameters of amixture model by maximising the log-likelihood by alternating two different steps. TheExpectation step (E-step) computes the expected value of complete-data log-likelihood,and the Maximisation step (M-step) maximises the expected value previously computedwith respect to the parameters of the mixture model. The algorithm guarantees a non-decreasing log-likelihood and, under fairly general conditions, it converges to at least alocal maximum (McLachlan and Krishnan, 2008).Suppose there exists an unobservable process z = ( z , . . . , z N ) , where each z i =( z i , . . . , z iG ) (cid:62) is a latent variable specifying the component-label, i.e. z ig = i -thobservation belongs to component g and 0 otherwise.Let f ( x , z | θ ) be the joint density of the complete-data vector, formed by the observedvariable x and the latent variable z . Then, the complete-data log-likelihood can bewritten as (cid:96) c ( θ ) = N ∑ i = log { f ( z i ; θ ) f ( x i | z i ; θ ) } . (1)The E-step computes the expected value of the complete-data log-likelihood in (1) withrespect to the latent variable and using the current fit for θ , the so-called Q-function: Q ( θ | θ t ) = E [ (cid:96) c ( θ )] = E (cid:34) N ∑ i = log f ( z i ; θ ) f ( x i | z i ; θ ) (cid:35) . (2)The M-step updates the estimate of θ by maximising the Q-function in (2), such that: θ t + = arg max θ Q ( θ | θ t ) .In the GMMs case, the complete-data log-likelihood is (cid:96) c ( θ ) = N ∑ i = G ∑ g = z ig { log π g + log φ ( x i ; µ g , Σ g ) } . (3)4t iteration t of the EM algorithm, the Q-function is Q ( θ | θ t ) = N ∑ i = G ∑ g = E (cid:104) z ig { log π tg + log φ ( x i ; µ tg , Σ tg ) } (cid:105) ,so the E-step computes the expected value of z ig asˆ z tig = E [ z ig | x i ] = π t − g φ ( x i ; µ t − g , Σ t − g ) ∑ Gk = π t − k φ ( x i ; µ t − k , Σ t − k ) .The parameters of the mixture components are updated in the M-step by maximising Q ( θ | θ t ) = N ∑ i = G ∑ g = ˆ z tig { log π g + log φ ( x i ; µ g , Σ g ) } ,with respect to the parameters ( π g , µ g , Σ g ) for g =
1, . . . , G . The procedure is repeateduntil convergence, and the estimated parameters ( ˆ π g , ˆ µ g , ˆ Σ g ) for g =
1, . . . , G are re-turned. Celeux and Govaert (1995) provide details of the M-step for different covarianceparameterisations, some of which have closed-form solutions, while others requirednumerical procedures. Using the previous notation, the single observation x i is partitioned as x i = [ x ( o ) i , x ( m ) i ] (cid:62) ,where x ( o ) i is the observed part, and x ( m ) i is the missing part of the data vector. Notethat, because the missing pattern for each observation could be different, the subscript i should be included on the superscripts ( o ) and ( m ) , but for ease of presentation andreading we avoid that notation. If the vector x i is assumed to be Gaussian, that is [ x ( o ) i , x ( m ) i ] (cid:62) ∼ N ( µ , Σ ) , with µ = (cid:20) µ o µ m (cid:21) ,and Σ = (cid:20) Σ oo Σ om Σ mo Σ mm (cid:21) ,where Σ mo = Σ (cid:62) om , then, by the conditional property of the Gaussian distribution, wemay write the conditional distribution of the missing part given the observed part of thedata as x ( m ) i | x ( o ) i ∼ N ( µ ( m ) , Σ ( m ) ) ,5ith µ ( m ) = µ m + Σ mo Σ − oo ( x ( o ) i − µ o ) ,and Σ ( m ) = Σ mm − Σ mo Σ − oo Σ om .Thus, the conditional distribution of the missing part given the observed part follows amultivariate Gaussian distribution, with E [ x ( m ) | x ( o ) ] = µ ( m ) and V [ x ( m ) | x ( o ) ] = Σ ( m ) .In the GMMs context, the underlying density function of each component is assumedto be a multivariate Gaussian distribution, that is x i | ( z ig = ) ∼ N ( µ g , Σ g ) . In thepresence of missing values, we may write: (cid:34) x ( o ) i x ( m ) i (cid:35) (cid:12)(cid:12)(cid:12)(cid:12) ( z ig = ) ∼ N (cid:18)(cid:20) µ o , g µ m , g (cid:21) , (cid:20) Σ oo , g Σ om , g Σ mo , g Σ mm , g (cid:21)(cid:19) ,then x ( o ) i | ( z ig = ) ∼ N (cid:16) µ ( o ) g , Σ ( o ) g (cid:17) with µ ( o ) g = µ o , g , Σ ( o ) g = Σ oo , g , and x ( m ) i | ( x ( o ) i , z ig = ) ∼ N (cid:16) µ ( m ) g , Σ ( m ) g (cid:17) ,with µ ( m ) g = µ m , g + Σ mo , g Σ − oo , g ( x ( o ) i − µ o , g ) ,and Σ ( m ) g = Σ mm , g − Σ mo , g Σ − oo , g Σ om , g .Furthermore, by the conditional property of the Gaussian distribution, the joint distri-bution of the observed part and the missing part given the group membership can befactorised as the product of two Gaussian distributions: φ g ( x ( o ) i , x ( m ) i ; θ g ) = φ ( x ( o ) i ; µ ( o ) g , Σ ( o ) g ) φ ( x ( m ) i | x ( o ) i ; µ ( m ) g , Σ ( m ) g ) .Ghahramani and Jordan (1995) proposed an extension of the standard EM algorithmto deal with missing values (see also Hunt and Jorgensen, 2003). The complete-datalog-likelihood in this case can be written as (cid:96) c ( θ ) = N ∑ i = G ∑ g = z ig (cid:110) log π g + log φ ( x ( o ) i ; µ ( o ) g , Σ ( o ) g ) + log φ ( x ( m ) i | x ( o ) i ; µ ( m ) g , Σ ( m ) g ) (cid:111) .6hus, the conditional expectation in the E-step takes the following form: Q ( θ | θ t ) = N ∑ i = G ∑ g = E (cid:20) z ig (cid:8) log π g + log φ ( x ( o ) i ; µ ( o ) g , Σ ( o ) g ) + log φ ( x ( m ) i | x ( o ) i ; µ ( m ) g , Σ ( m ) g ) (cid:111) (cid:12)(cid:12)(cid:12)(cid:12) x ( o ) i (cid:21) . (4)In this case there are two sources of missingness, one related to the unknown classi-fication, and one connected with the missing part of the data. Therefore, additionalexpectations must be introduced to solve the E-step. By solving these expectations, itis possible to obtain a closed-form expression for the M-step under the assumption ofunconstrained within-component covariance matrices (Ghahramani and Jordan, 1995;Hunt and Jorgensen, 2003; Eirola et al., 2014). However, the flexible parameterisations ofthe covariance matrices described in Section 2.1 have not been taken into account. Forthis reason, we aim at proposing methods to solve the Q -function in (4) for the generalfamily of parsimonious GMMs. In this paper we propose two versions of a Missing Monte Carlo EM (MMCEM) approach.Both versions are based on the idea of using a Monte Carlo EM algorithm (MCEM; Weiand Tanner, 1990) to approximate the expected values required in the E-step.Starting from equation (4), the first idea is to directly use MC approximations tosolve the additional expectations generated from the presence of missing values. Theexpected complete-data log-likelihood can be approximated as follows: Q ( θ | θ t ) ≈ S S ∑ s = N ∑ i = G ∑ g = ˆ z ig , s (cid:8) log π g + log φ ( x ( o ) i ; µ ( o ) g , Σ ( o ) g ) + log φ ( ˆ x ( m ) i , s | x ( o ) i ; µ ( m ) g , Σ ( m ) g ) (cid:111) , (5)where S is the MC sample size, ˆ z ig , s is the simulated indicator variable, and ˆ x ( m ) i , s is theimputed value. Comparing equation (3) and equation (5), it is clear that the latter isthe complete-data log-likelihood of a GMM for the augmented dataset with dimension ( NS × d ) .Drawing S indicator variables ˆ z ig , s for each observation i from the conditional distri-bution z ig | x ( o ) i , the missing part is imputed S times using the conditional distribution x ( m ) i | ( x ( o ) i , ˆ z ig , s ) , i.e. conditional on the simulated group membership z ig , s and on theobserved part x ( o ) i , to obtain ˆ x ( m ) i , s . We refer to this method as MMCEM1.The second approach employs the MC approximation in a different way, togetherwith the law of total expectation. For a general density function h () , the E-step of the7MCEM1 method described above is approximated as E [ z ig h ( x i )] ≈ S S ∑ s = z ig , s h ( x i , s ) . (6)Since equation (6) can be rewritten as E [ z ig h ( x i )] = E (cid:20) z ig E (cid:2) h ( x i ) | z ig (cid:3) (cid:21) ,the inner expected value can be computed using MC approximations, and the outerexpected value can be written in closed-form, i.e. E [ z ig h ( x i )] ≈ E (cid:34) z ig S S ∑ s = [ h ( x i , s ) | z ig ] (cid:35) = ˆ z ig S S ∑ s = [ h ( x i , s ) | z ig ] . (7)Using the approach in equation (7), the expected complete-data log-likelihood can beapproximated as follows: Q ( θ | θ t ) = N ∑ i = G ∑ g = E (cid:20) z ig (cid:110) log π g + log φ ( x ( o ) i , x ( m ) i ; µ g , Σ g ) (cid:111) (cid:12)(cid:12)(cid:12)(cid:12) x ( o ) (cid:21) ≈ N ∑ i = G ∑ g = E (cid:34) z ig S S ∑ s = (cid:110) log π g + log φ ( x ( o ) i , ˆ x ( m ) i , sg ; µ g , Σ g (cid:111)(cid:35) = S S ∑ s = N ∑ n = G ∑ g = ˆ z ig (cid:110) log π g + log φ ( x ( o ) i , ˆ x ( m ) i , sg ; µ g , Σ g ) (cid:111) , (8)where ˆ z ig is the expected group membership indicator variable for observation i com-puted using the conditional distribution of z ig | x ( o ) i , and ˆ x ( m ) i , sg is the missing part ofobservation i which is imputed S times for each group g = G . Also in this case,equation (8) is the complete-data log-likelihood of the augmented dataset. We refer tothis method as MMCEM2.Note that, since the standard M-step is unchanged, both approaches introduced inthis Section allow to estimate the parameters for all the parsimonious GMMs obtainedby adopting the different parameterisations of the covariance matrices. Next sectionprovides further details concerning the proposed algorithms. The approximations discussed in the previous Section yield two different methods forcomputing the E-step; see equation (5) and equation (8). This section describes furthercomputational details concerning the proposed extensions of the EM algorithm forGMMs with missing values. 8 .1 MMCEM1 algorithm
Initialisation is an important step in any iterative algorithm, especially when the functionto maximise is multimodal, as in the case of GMMs. Proportions, means, and covariancematrices of the mixture components must be provided for starting the EM algorithm.They can be estimated from the fully observed dataset obtained after removing themissing observations. Thus, the classification matrix z is initialised using the observedpart of each observation:ˆ z ig = E (cid:104) z ig | x ( o ) i (cid:105) = π g φ (cid:16) x ( o ) i ; µ ( o ) g , Σ ( o ) g (cid:17) ∑ Gk = π k φ (cid:16) x ( o ) i ; µ ( o ) k , Σ ( o ) k (cid:17) . The imputation of the missing values is performed directly in the E-step using thefollowing steps:1. At iteration t +
1, draw ( ˆ z i , . . . , ˆ z Si ) for each i =
1, . . . , N from Multinomial distri-bution with parameter:ˆ z ig , s = π tg φ (cid:18) x ( o ) i ; µ ( o ) g t , Σ ( o ) g t (cid:19) ∑ Gk = π tk φ (cid:18) x ( o ) i ; µ ( o ) k t Σ ( o ) k t (cid:19) ,where ˆ z ig , s is the s th draw from the conditional probability of observation i tobelong to the cluster g , given the observed part of the data and the previousestimates of the parameters. Then, for observations with missing values only theobserved part is considered.2. After simulating the classification matrix, the missing values are imputed. Eachmissing values is simulated from φ ( x ( m ) i | x ( o ) i , ˆ z ig , s ; µ g , Σ g ) for s =
1, 2, . . . , S .3. The new augmented dataset ˆ x ( NS × d ) is given by the union of the S imputeddatasets ˆ x s ( N × d ) , and the new classification matrix ˆ z ( NS × G ) is given by the unionof the S classification matrices ˆ z s ( N × G ) drawn at step 1. For instance, the genericobservation i containing at least one missing value can be represented as x ( o ) i x ( m ) i ,1 x ( o ) i x ( m ) i ,2 ... ... x ( o ) i x ( m ) i , S , ˆ z i / S ˆ z i / S · · · ˆ z iG ,1 / S ˆ z i / S ˆ z i / S · · · ˆ z iG ,2 / S ... ... . . . ...ˆ z i S / S ˆ z i S / S · · · ˆ z iG , S / S .Then, the new dataset ˆ x ( NS × d ) and ˆ z ( NS × G ) are used in the standard M-step.9 .1.3 M-step The parameters and the classification matrix, obtained according to the MAP principlefor assigning the observations to a given cluster, are computed using the standard M-stepfor GMMs using the augmented data ˆ x ( NS × d ) and the classification matrix ˆ z ( NS × G ) . Since standard convergence criteria for the EM algorithm cannot be used in this casebecause of MC error, iterations and convergence check are performed as describedbelow:1. EM steps are run for T iterations, and only the parameters associated to the largestvalue of the log-likelihood are stored.2. In the following EM steps, the parameters are updated when the log-likelihoodimproves compared to the previous step.3. The algorithm is stopped if the log-likelihood does not increase for K successiveiterations.Two tuning parameters must be set in the above algorithm. If the initial number of“warm-up” iterations T is large, the algorithm potentially explores a large part of theparameters space at the cost of increasing the computing time, and viceversa when T is small. The second parameter K specifies the number of “stalled iterations” allowed,so we can control how stringent is the adopted criterion for declaring the convergenceof the algorithm. A sensible choice of these tuning parameters is needed to balance theefficiency and accuracy of the MCEM. MMCEM1 and MMCEM2 differ only for the E-step, then both methods share the sameinitialisation, iterations, convergence criterion, and the M-step (see, respectively, Section4.1.1, 4.1.4, and 4.1.3). In the following we provide details only for the E-step.
As in the MMCEM1 algorithm, the imputation is performed during the E-step. Thealgorithm used to build the augmented dataset is the following:1. For each observation, simulate S values x ig , s from x ( m ) i | ( x ( o ) i , z ig = µ ( ) g t , Σ ( o ) g t ) ,for g =
1, . . . , G . 10. Construct an augmented data set where each observation is represented as follows: x ( o ) i x ( m ) i ,11 x ( o ) i x ( m ) i ,12 ... ... x ( o ) i x ( m ) i ,1 S x ( o ) i x ( m ) i ,21 x ( o ) i x ( m ) i ,22 ... ... x ( o ) i x ( m ) i ,2 S ... ... x ( o ) i x ( m ) i , G x ( o ) i x ( m ) i , G ... ... x ( o ) i x ( m ) i , GS ˆ z i / S · · · z i / S · · · z i / S · · ·
00 ˆ z i / S · · ·
00 ˆ z i / S · · · z i / S · · · · · · ˆ z iG / S · · · ˆ z iG / S ... ... . . . ...0 0 · · · ˆ z iG / S In this Section, the proposed algorithms are evaluated on both simulated and realdatasets to assess their performance in terms of clustering and density estimation.The software package mclust , freely available on CRAN ( https://cran.r-project.org/web/packages/mclust/index.html ) for the R language (R Core Team, 2019), pro-vides fitting of finite mixture of Gaussian distributions through the use of the EMalgorithm (Scrucca et al., 2016). In particular, the function mstep() can be used to per-form the maximisation step for each of the fourteen models in the parsimonious GMMfamily generated by the eigen-decomposition of the covariance matrices discussed inSection 2.1. This function, together with our code that implements the two versions ofthe E-step, have been used to build the algorithms described in Section 4.The methods included in the comparison are:1. EM algorithm with MC approximations of the E-step as presented in Section 4.1(MMCEM1).2. EM algorithm with MC approximations of the E-step as presented in Section 4.2(MMCEM2).3. Multiple imputation (Schafer, 1997), where N imp different imputed datasets aregenerated, and for each of these the standard EM algorithm is applied to estimatethe density and the final clustering (GMMMI). The imputeData() function in the11 clust package is used to impute the missing values, and the Mclust() function isused to estimate GMMs on the imputed dataset. A label switching strategy is alsoimplemented using the majority vote to assign the observations to the differentclusters. The number of multiple imputations is set at 50.4. Gaussian mixture on the original dataset before introducing the missing values(GMM). The
Mclust() function from mclust package is used to estimate the pa-rameters of the Gaussian mixture. These estimates represent the benchmark towhich the different methods should tend in the presence of missing values; thecloser the values are to these estimates, the better a method can recover the missinginformation.MC sample size is one of the most important tuning parameter in our proposedalgorithms. This parameter must balance the precision of the method and the compu-tational efficiency. Large MC sample sizes imply a higher probability of convergenceto the true value, and therefore greater precision. In contrast, small MC sample sizesimprove the speed of the algorithm to the detriment of the accuracy of the estimates.In our experiments we set the MC sample size to S =
10. This relatively small valueprovides a conservative assessment of the precision and efficiency of the MMCEM1and MMCEM2 algorithms. Moreover, the “warm-up” parameter is set at T =
10, andthe “stalled iterations” parameter is set at K =
1. Since the largest improvements oflog-likelihood are likely to happen in the initial iterations of the EM algorithm, we triedto replicate conditions similar to the standard EM algorithm. Clearly, larger MC samplesizes and larger values of the tuning parameters could only improve model fitting, atthe cost of increasing the computing effort.
Simulated datasets are generated from eight different scenarios using a mixture of Gaus-sian distributions with number of variables d = G =
3. Details for each scenario are the following:(a) Three well-separated clusters from a Gaussian mixture with mean vector for eachcomponent given by µ = [
4, 4 ] (cid:62) , µ = [ − ] (cid:62) , µ = [ −
4, 4 ] (cid:62) , and commoncovariance matrix: Σ = (cid:20) (cid:21) .This correspond to model EEE in the mclust nomenclature. The mixing probabili-ties are all equal to 1/3.(b) Three well-separated clusters with strongly unbalanced mixture of Gaussians withprior probabilities set to π = ( ) , and with the remaining parametersset as in the previous scenario. 12c) Three-groups case with two overlapping clusters having mean vector for eachcomponent given by µ = [
2, 3.5 ] (cid:62) , µ = [ −
2, 3 ] (cid:62) , and µ = [
0, 2 ] (cid:62) , and commoncovariance matrix: Σ = (cid:20) − − (cid:21) .This corresponds to model EEE in the mclust nomenclature. The mixing probabili-ties are all equal to .(d) Three-groups case with two overlapping clusters with strongly unbalanced clus-ters with prior probabilities set to π = ( ) , and with the remainingparameters set as in the previous scenario.(e) Three well-separated clusters with mean for each component given by µ = [
4, 4 ] (cid:62) , µ = [ − ] (cid:62) , µ = [ −
4, 4 ] (cid:62) , and unconstrained covariance matrices: Σ = (cid:20) (cid:21) , Σ = (cid:20) − − (cid:21) , Σ = (cid:20) (cid:21) .This corresponds to model VVV in the mclust nomenclature. The mixing probabil-ities are all equal to 1/3.(f) Three well-separated clusters with strongly unbalanced mixture of Gaussians withprior probabilities set to π = ( ) , and with the remaining parametersset as in the previous scenario.(g) Three-groups case with two overlapping clusters having mean vector for each com-ponent given by µ = [ ] (cid:62) , µ = [ −
3, 0 ] (cid:62) , µ = [ −
4, 4 ] (cid:62) , and unconstrainedcovariance matrices: Σ = (cid:20) (cid:21) , Σ = (cid:20) − − (cid:21) , Σ = (cid:20) (cid:21) .This corresponds to model VVV in the mclust nomenclature. The mixing probabil-ities are all equal to 1/3.(h) Three-groups case with two overlapping clusters with strongly unbalanced clus-ters and prior probabilities set to π = ( ) , and with the remainingparameters set as in the previous scenario.Scenarios (a)-(b)-(e)-(f) are relatively simpler cases compared to scenarios (c)-(d)-(g)-(h). In the latter cases, because clusters have substantial overlap, detecting the exactnumber of components can be particularly difficult. The unbalanced cases in scenarios(b)-(d)-(f)-(h) are used to assess the effect of missing values when some clusters havesmall sizes.For all the above scenarios the number of observations is set to N = l ll ll l lll ll ll ll ll lll ll llll l ll l lll ll l lll l l ll l lllll llll lll ll ll lll ll lllll lll ll ll l l lllll ll l l ll lll l ll ll l −404 −4 0 4 y (a) ll ll ll ll l l lllll lll l ll ll lll lll llll lll l lll l ll llll l lllll ll ll lllll l ll lll l lll lll lll lll ll ll ll ll l llll ll llll l l lllllll ll lll ll lll l l lll ll lll l lll ll ll l ll l lll lll l ll llll l l lllll ll ll l ll l llll lll l llll ll l ll ll ll l llll lll ll lll ll lll ll −404 −4 0 4 (b) ll lll l llll llll lllll l ll lll ll l ll ll l lll l l ll ll l lll lll lll l ll ll ll ll l lll lll l l lll l ll l llll ll lll l l l ll l ll lll ll l x y (c) l ll ll l l ll ll ll ll ll l ll lll l ll l l ll l lll l lll l l llll ll llll lll ll llll llll l lll lll lll ll lllllll ll llll l llll l ll lll ll lll ll l llll llll l l ll l l ll llll l lll l ll lll l lllll l l ll ll ll ll l lll l lll ll llll l ll lll lll lll ll l l llll llll ll l ll l llll lll ll ll l l x (d) ll ll ll l lll ll ll ll ll lll ll llll l ll l lll ll l lll l l ll l lllll llll lll ll ll lll ll lllll lll ll ll l l lllll ll l l ll lll l ll ll l −404 −4 0 4 y (a) ll ll ll ll l l lllll lll l ll ll lll lll llll lll l lll l ll llll l lllll ll ll lllll l ll lll l lll lll lll lll ll ll ll ll l llll ll llll l l lllllll ll lll ll lll l l lll ll lll l lll ll ll l ll l lll lll l ll llll l l lllll ll ll l ll l llll lll l llll ll l ll ll ll l llll lll ll lll ll lll ll −404 −4 0 4 (b) ll lll l llll llll lllll l ll lll ll l ll ll l lll l l ll ll l lll lll lll l ll ll ll ll l lll lll l l lll l ll l llll ll lll l l l ll l ll lll ll l x y (c) l ll ll l l ll ll ll ll ll l ll lll l ll l l ll l lll l lll l l llll ll llll lll ll llll llll l lll lll lll ll lllllll ll llll l llll l ll lll ll lll ll l llll llll l l ll l l ll llll l lll l ll lll l lllll l l ll ll ll ll l lll l lll ll llll l ll lll lll lll ll l l llll llll ll l ll l llll lll ll ll l l x (d) Figure 1:
Some simulated datasets from scenarios having EEE configurations: (a) balanced clusters; (b) unbalancedclusters; (c) balanced overlapping clusters; (d) unbalanced overlapping clusters.
A missing mechanism is applied to each dataset with missing percentage set to about50%, i.e. approximately half of the observations have at least a missing value. Detailsdiffer depending on whether MCAR or MAR mechanism is used.For the MCAR mechanism, incomplete observations are randomly selected, and foreach observation the variable to contain the missing value is also selected at random.This two steps guarantee that the mechanism is MCAR because the missing values are arandom sample of all data values.To generate data under the MAR mechanism the following process is adopted. Let M be a ( N × d ) matrix of indicator variables, such that m ij = x ij is missing, and 1otherwise. From the definition in Schafer (1997), the missing values are supposed to beMAR if: P ( M | x ( o ) , x ( m ) ; θ ) = P ( M | x ( o ) ; θ ) . (9)14 ll lll l lllll ll ll lll ll ll ll l llll lll lll l llll l l ll ll lll ll l llll ll l ll ll l lll l llll ll ll lll l l ll lll lll lll l llll ll −505 −4 0 4 y (e) ll ll ll ll ll lll l l llll lllll lllll lll lll lll l lllll ll ll lll ll ll lll lll l lll ll ll l lll ll lll lll l lll l lll l lll ll lllll l llll lll lll lllll llll llll lll ll ll l lll lll llll ll l l llll ll lll ll ll ll ll llllll ll lll l lll ll l lllll l llll llll ll l ll ll ll l ll lll −505 −4 0 4 (f) l ll lll lllll llll l l lll ll l lll ll ll ll ll llll ll lll lll l l ll l llll ll l ll l ll ll l lll ll lll llllll lll lll l ll ll lll lll l ll −4048−7.5 −5.0 −2.5 0.0 2.5 x y (g) ll ll lll ll l ll ll ll ll l ll lll ll ll l lll ll l l l ll ll lll llll l l ll ll l lll l ll lll lll l lll l ll ll lll l l lll lllll l ll llll l lll l lll ll ll l ll l lllll llll l l lll lllll l ll ll ll lll l lll ll lll ll ll l ll l lll lll ll ll l l ll ll ll lll l ll l lll ll ll ll ll lll ll ll l l ll llll l −2.50.02.55.07.5 −6 −4 −2 0 2 4 x (h) lll lll l lllll ll ll lll ll ll ll l llll lll lll l llll l l ll ll lll ll l llll ll l ll ll l lll l llll ll ll lll l l ll lll lll lll l llll ll −505 −4 0 4 y (e) ll ll ll ll ll lll l l llll lllll lllll lll lll lll l lllll ll ll lll ll ll lll lll l lll ll ll l lll ll lll lll l lll l lll l lll ll lllll l llll lll lll lllll llll llll lll ll ll l lll lll llll ll l l llll ll lll ll ll ll ll llllll ll lll l lll ll l lllll l llll llll ll l ll ll ll l ll lll −505 −4 0 4 (f) l ll lll lllll llll l l lll ll l lll ll ll ll ll llll ll lll lll l l ll l llll ll l ll l ll ll l lll ll lll llllll lll lll l ll ll lll lll l ll −4048−7.5 −5.0 −2.5 0.0 2.5 x y (g) ll ll lll ll l ll ll ll ll l ll lll ll ll l lll ll l l l ll ll lll llll l l ll ll l lll l ll lll lll l lll l ll ll lll l l lll lllll l ll llll l lll l lll ll ll l ll l lllll llll l l lll lllll l ll ll ll lll l lll ll lll ll ll l ll l lll lll ll ll l l ll ll ll lll l ll l lll ll ll ll ll lll ll ll l l ll llll l −2.50.02.55.07.5 −6 −4 −2 0 2 4 x (h) Figure 2:
Some simulated datasets from scenarios having VVV configurations: (e) balanced clusters; (f) unbalancedclusters; (g) balanced overlapping clusters; (h) unbalanced overlapping clusters.
However, to replicate a MAR mechanism it is necessary to estimate the probabilitiesin (9). Since we are generating two-dimensional datasets, without loss of generality,we may assume that the first variable is completely observed, and the second variablecontains all the missing values. The probability of a missing value on the second variablefor the i th observation can be modelled using an inverse logit transformation: P ( m i = | x ( o ) ; θ ) = exp ( β x i ) + exp ( β x i ) ,for all i =
1, . . . , N . The value β = f ( x ) and g ( x ) betwo density functions, then the KL divergence is defined as follows: D ( f || g ) = (cid:90) f ( x ) log f ( x ) g ( x ) d x , (10)where D ( f || g ) = D ( f || g ) > f ( x ) is taken to be thetrue density of the simulated data, whereas g ( x ) is the density estimated using one ofthe methods under comparison. Then, to have a good density estimation method, theKL should be as much as possible close to zero. Since GMMs do not have a closed-formexpression for (10), a MC approximation must be used (Hershey and Olsen, 2007).To compare the estimated classification with the true clusters, the Adjusted RandIndex (ARI; Hubert and Arabie, 1985) is used. This measures the agreement between twopartitions, with expected value 0 in case of random partitions, and a value equal to 1 incase of a perfect agreement. Thus, the true partition is compared with the classificationprovided by the GMM methods under comparison in the presence of missing values.Models are estimated either by fixing the number of components and the parameteri-sation of component-covariances used to generate the data, and then by selecting thenumber of mixture components by the the Bayesian Information Criterion (BIC; Schwarzet al., 1978). A last configuration is considered when both the number of componentsand the parsimonious decomposition of component-covariance matrix are selected byBIC. The last two situations allow to investigate the influence of missing values onGMMs parameters estimation when either the number of groups is not known a priori,or the component-covariance matrix, or both.Figures 3–6 show the results of the simulation study using box-plots for each methodin each scenario.In general, the proposed methods clearly outperform the multiple imputation ap-proach in all scenarios in terms of density estimate accuracy. To this goal, MMCEM1and MMCEM2 are essentially equivalent and close to the estimates obtained usingthe complete dataset (GMM). The same also applies to cluster identification, with fewexceptions where the three methods are roughly comparable. Multiple imputation(GMMMI) appears to be no worse than MMCEM1 and MMCEM2 only when clustersare unbalanced and overlapping.When the number of groups is selected by BIC, the proposed methods again performbetter than the multiple imputation approach, both in terms of cluster accuracy and den-sity estimation. In particular, by removing the number of clusters the MMCEM methodsoutperform in term of classification the multiple imputation also in the overlapping casewith unbalanced clusters. In addition, as expected, GMM selects three groups, whereasMMCEM1 and MMCEM2 select three components in the majority of cases. Conversely,the multiple imputation approach (GMMMI) tends to select more components that the16 eparatedEEE SeparatedVVV OverlapEEE OverlapVVV G and M ode l f i x ed G t uned and M ode l f i x ed G and M ode l t uned G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M A d j u s t ed R and I nde x Missing mechanism
MCAR MAR
Figure 3:
Box-plots of ARI values from the simulation study for the 3-cluster cases with balanced mixture proportionsunder different missing mechanisms (larger values are better). other methods in the well separated clusters, and less components in the overlappingclusters (see Figures 7–8).If the covariance matrix is constrained to be equal among the components, themultiple imputation approach tends to select more clusters than the original groups.This may be due to the imputation step that generates points that fill the gaps betweenthe clusters. As a consequence, imposing the ellipsoids to be equal increases the numberof mixture components required.When both the number of components and the covariance model are unknown andselected by the BIC criterion, our MMCEM methods tend to outperform the multipleimputation, with values close to the estimates provided by the GMM on the originaldata. This suggests that our methods seem to be able to recover the original structure ofthe data.Another element arises from the simulations. The MMCEM2 algorithm appears toperform better than the MMCEM1 algorithm, with smaller standard errors, indicatinga more precise method. Such behaviour can be noted also in the distribution of the17 eparatedEEE SeparatedVVV OverlapEEE OverlapVVV G and M ode l f i x ed G t uned and M ode l f i x ed G and M ode l t uned G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M K u ll ba ck Le i b l e r d i v e r gen c e Missing mechanism
MCAR MAR
Figure 4:
Box-plots of KL values from the simulation study for the 3-cluster case with balanced mixture proportionsunder different missing mechanisms (smaller values are better). number of estimated components in Figures 7–8; in most cases MMCEM2 selects thecorrect number of clusters, whereas MMCEM1 has much more variability in selectingthe number of components. 18 eparatedEEE SeparatedVVV OverlapEEE OverlapVVV G and M ode l f i x ed G t uned and M ode l f i x ed G and M ode l t uned G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M A d j u s t ed R and I nde x Missing mechanism
MCAR MAR
Figure 5:
Box-plots of ARI values from the simulation study for the 3-cluster case with unbalanced mixture proportionsunder different missing mechanisms (larger values are better). eparatedEEE SeparatedVVV OverlapEEE OverlapVVV G and M ode l f i x ed G t uned and M ode l f i x ed G and M ode l t uned G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M G MM G MMM I MM C E M MM C E M K u ll ba ck n Le i b l e r d i v e r gen c e Missing mechanism
MCAR MAR
Figure 6:
Box-plots of KL values from the simulation study for the 3-cluster case with unbalanced mixture proportionsunder different missing mechanisms (smaller values are better). ll lllll l lllllll lllll lllllll lll l lll lllll l lllllll l ll llll llll llll l l llll l lll llll SeparatedEEE SeparatedVVV OverlapEEE OverlapVVV B a l an c ed G t uned and M ode l f i x ed U nba l an c ed G t uned and M ode l f i x ed B a l an c ed G and M ode l t uned U nba l an c ed G and M ode l t uned0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 . Proportion G Method l GMM GMMMI MMCEM1 MMCEM2
Figure 7:
Distribution of the number of mixture components selected by BIC under the MCAR missing mechanisms. l llll ll llll lll lllll ll llllll l ll llll l llllll lllllll l ll lllll llll llll ll ll lll lllll SeparatedEEE SeparatedVVV OverlapEEE OverlapVVV B a l an c ed G t uned and M ode l f i x ed U nba l an c ed G t uned and M ode l f i x ed B a l an c ed G and M ode l t uned U nba l an c ed G and M ode l t uned0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 .
00 0 .
00 0 .
25 0 .
50 0 .
75 1 . Proportion G Method l GMM GMMMI MMCEM1 MMCEM2
Figure 8:
Distribution of the number of mixture components selected by BIC under the MAR missing mechanisms. .2 Diabetes data Reaven and Miller (1979) provided data from a diabetes study conducted at the StanfordClinical Research Center. Blood chemistry measurements were recorded on 145 non-obese adult subjects, namely the area under plasma glucose curve, the area underplasma insulin curve, and the steady state plasma glucose level. After further analysis,the patients were classified into three groups: a group of patients suffering from overtdiabetes (Overt), a group affected by chemical diabetes (Chemical), and a last groupmade of patients without diabetes (Normal). The dataset is available in the R package mclust .Missing values are introduced according to the MCAR mechanism under two dif-ferent missing data patterns. In the first data pattern scenario, a single missing valueis randomly assigned to a given percentage of sample observations. By setting thispercentage at approximately 30% and 50%, we get, respectively, 43 and 72 observationswith a single missing value out of 145 observations. In the second data pattern scenario,the percentage of missing values refer to the overall number of measurements. Hence,setting the percentage to approximately 30% and 50%, the total number of missingvalues randomly entered into the data matrix are 130 and 217, respectively, out of 435total measurements.In this real data analysis example, we aim at comparing the clustering performanceof the GMM fitted on the original data, as in the simulation studies, against the proposedmethods, i.e. MCEM1 and MCEM2, the multiple imputation approach, and the GMMfitted on the data obtained after removing the observations with at least one missingvalue (GMMD). The performance of these methods are evaluated using the ARI. Fur-thermore, the BIC criterion is employed to select both the number of clusters and theparsimonious within-component covariance matrices.Figure 9 shows the results for the Diabetes data after applying the missing procedure1, 000 times outlined above. In the first missing data scenario, where missing valuesare randomly assigned to observations, so each row of the data matrix has at most onemissing value, the methods perform in a similar way, and they are pretty close to thecase of GMM estimated on the full dataset (see left panel of Figure 9). In the secondscenario, where the percentage of missing values is distributed over the entire dataset,the proposed MMCEM methods appear to outperform the other methods based onlistwise-deletion or multiple imputation (see right panel of Figure 9).
In this paper we proposed two different algorithms to estimate GMMs in the presenceof missing values by exploiting the well-known EM algorithm. Both algorithms employMonte Carlo methods during the E-step to build augmented datasets via stochasticmissing values imputation. These are then used in the standard M-step, thus allowingto obtain parameters estimates for any parsimonious component-covariance matrixstructure available for Gaussian mixtures. 23 issing per observation Missing over the entire dataset G MM G MM D G MMM I MM C E M MM C E M G MM G MM D G MMM I MM C E M MM C E M A d j u s t ed R and I nde x Missing Percentage
30% 50%
Figure 9:
Box-plots of ARI values for the Diabetes data with different MCAR mechanism scenarios and missing per-centage (larger values are better). Both the number of mixture components and the component-covariance model areselected by BIC.
In general, the proposed methods seem to outperform the multiple imputationapproach, both in terms of clustering and density estimation. The MMCEM1 andMMCEM2 algorithms are basically equivalent when the number of mixture componentsand the covariance structure are known. When these are unknown and, consequently, areselected by BIC, the MMCEM2 procedure provides more accurate clustering partitionsand density estimates.On the other hand, the proposed algorithms are highly demanding in terms ofcomputational cost. For high-dimensional dataset the procedures could need a largenumber of iterations during the model estimation phase, because of data augmentationin the imputation step that strongly depends on the number of observations and thesample size of the MC approximation. For this reason a more efficient method could beinvestigated. In addition, since the proposed methods are based on the MAR and MCARassumptions, another future development might consider MNCAR missing mechanism.
References
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussianclustering.
Biometrics , 49:803–821.Buck, S. F. (1960). A method of estimation of missing values in multivariate data suitable24or use with an electronic computer.
Journal of the Royal Statistical Society: Series B(Methodological) , 22(2):302–306.Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models.
Patternrecognition , 28(5):781–793.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm.
Journal of the Royal Statistical Society. Series B(Methodological) , 39:1–38.Eirola, E., Lendasse, A., Vandewalle, V., and Biernacki, C. (2014). Mixture of Gaussiansfor distance estimation with missing data.
Neurocomputing , 131:32–42.Ford, B. L. (1983).
Incomplete Data in Sample Surveys , chapter An overview of hot-deckprocedures, pages 185–207. Academic Press, New York.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, anddensity estimation.
Journal of the American Statistical Association , 97(458):611–631.Ghahramani, Z. and Jordan, M. I. (1995). Learning from incomplete data. Technicalreport, AI Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab.Hershey, J. R. and Olsen, P. A. (2007). Approximating the Kullback Leibler divergencebetween Gaussian mixture models. In , volume 4, pages IV–317. IEEE.Hubert, L. and Arabie, P. (1985). Comparing partitions.
Journal of Classification , 2(1):193–218.Hunt, L. and Jorgensen, M. (2003). Mixture model clustering for mixed data with missinginformation.
Computational Statistics & Data Analysis , 41(3):429–440.Kullback, S. and Leibler, R. A. (1951). On information and sufficiency.
The Annals ofMathematical Statistics , 22(1):79–86.Little, R. J. and Rubin, D. B. (2002).
Statistical Analysis with Missing Data . John Wiley &Sons, 2nd edition.McLachlan, G. and Krishnan, T. (2008).
The EM Algorithm and Extensions . John Wiley &Sons, Hoboken, New Jersey, 2nd edition.McLachlan, G. and Peel, D. (2000).
Finite Mixture Models . John Wiley & Sons, New York.R Core Team (2019).
R: A Language and Environment for Statistical Computing . R Founda-tion for Statistical Computing, Vienna, Austria.Reaven, G. and Miller, R. (1979). An attempt to define the nature of chemical diabetesusing a multidimensional analysis.
Diabetologia , 16(1):17–24.25ubin, D. B. (1976). Inference and missing data.
Biometrika , 63:581–592.Rubin, D. B. (1987).
Multiple Imputation for Nonresponse in Surveys . John Wiley & Sons,New York.Schafer, J. L. (1997).
Analysis of Incomplete Multivariate Data . Chapman & Hall/CRC,London.Schwarz, G. et al. (1978). Estimating the dimension of a model.
The Annals of Statistics ,6(2):461–464.Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: clustering,classification and density estimation using Gaussian finite mixture models.
The RJournal , 8(1):289–317.Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithmand the poor man’s data augmentation algorithms.
Journal of the American StatisticalAssociation , 85(411):699–704.Wilks, S. S. (1932). Moments and distributions of estimates of population parametersfrom fragmentary samples.