Initializing EM algorithm for univariate Gaussian, multi-component, heteroscedastic mixture models by dynamic programming partitions
Andrzej Polanski, Michal Marczyk, Monika Pietrowska, Piotr Widlak, Joanna Polanska
SSeptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
International Journal of Computational Methodsc (cid:13)
World Scientific Publishing Company
INITIALIZING EM ALGORITHM FOR UNIVARIATE GAUSSIAN,MULTI-COMPONENT, HETEROSCEDASTIC MIXTURE MODELSBY DYNAMIC PROGRAMMING PARTITIONS
ANDRZEJ POLANSKI
Institute of Informatics, Silesian University of Technology44-100 Gliwice, [email protected]
MICHAL MARCZYK
Data Mining Group, Institute of Automatic Control, Silesian University of Technology44-100 Gliwice, [email protected]
MONIKA PIETROWSKA
Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Branch in Gliwice44-101 Gliwice, [email protected]
PIOTR WIDLAK
Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Branch in Gliwice44-101 Gliwice, [email protected]
JOANNA POLANSKA
Data Mining Group, Institute of Automatic Control, Silesian University of Technology44-100 Gliwice, [email protected]
Received (Day Month Year)Revised (Day Month Year)Setting initial values of parameters of mixture distributions estimated by using the EMrecursive algorithm is very important to the overall quality of estimation. None of theexisting methods is suitable for mixtures with large number of components. We presenta relevant methodology of estimating initial values of parameters of univariate, het-eroscedastic Gaussian mixtures, on the basis of the dynamic programming algorithmfor partitioning the range of observations into bins. We evaluate variants of dynamicprogramming method corresponding to different scoring functions for partitioning. Forsimulated and real datasets we demonstrate superior efficiency of the proposed methodcompared to existing techniques.
Keywords : Gaussian mixtures; EM algorithm; dynamic programming; mass spectra.1 a r X i v : . [ s t a t . A P ] A ug eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska
1. Introduction
Due to the importance of Gaussian distribution a significant part of the research onimproving performance of the expectation maximization (EM) recursive algorithm[McLachlan and Peel (2000)] for estimation of parameters of mixture distributionsis focused on estimating parameters of mixtures of Gaussian components. A prob-lem of crucial importance is the choice of initial values for mixture parameters.A good choice of a starting point for the EM iterations can result in reducing theprobability of erroneous estimation of parameters and/or in (faster) convergence ofEM iterations. Approaches to initializing EM iterations were extensively discussedand studied [McLachlan and Peel (2000); Karlis and Xekalaki (2003); Biernackiet al. (2003); Biernacki (2004); Maitra (2009); O’Hagan et al. (2012); Melnykov andMelnykov (2012)]. A simple approach is random initialization involving generationof initial values of parameters and component weights (mixing proportions) by us-ing some assumed probability distributions [McLachlan and Peel (2000)]. Anothersimple idea is using data quantilles to estimate initial means and variances of com-ponents to start EM iterations. A group of approaches involve using some kind ofclustering procedure (hierarchical clustering or k-means clustering) applied for thedata set to compute initial parameters for EM iterations [Biernacki et al. (2003);Maitra (2009); O’Hagan et al. (2012)]. Some other ideas for computing initial val-ues involve using sample moments of data [Karlis and Xekalaki (2003)], method ofusing singular value decompositions for multivariate mixtures [Melnykov and Mel-nykov (2012)], using properties of EM trajectories and data moments to limit thesearch space for initial parameters of multivariable Gaussian mixtures [Biernacki(2004)]. Available software packages for mixture modeling [McLachlan and Peel(1999); Biernacki et al. (2006); Fraley and Raftery (1999); Richardson and Green(1997)] offer different possibilities for setting initial conditions for EM iterations.A challenge for the existing methodologies is decomposition of univariate, het-eroscedastic Gaussian mixtures with large number of components. It can be verifiedin practical computations that with the increase of the number of components inthe mixture, the application of the EM algorithm with the published methods forsetting initial conditions would lead to the mixture models of the progresively worsequality. Yet, problems of mixture decompositions of univariate models where thenumbers of components are large are encountered in many applications. Some ex-amples are given hereafter. Frequency (amplitude or power) spectra of different timedomain signals may contain numerous compenents. E.g., in the vibration diagnos-tics or speech recognition applications frequency spectra of measurement signalscan be analyzed as mixtures of tens of Gaussian functions [Fraiha Machado et al.(2013)]. In nuclear magnetic resonance (NMR) spectroscopy free induction decay(FID) signal contains components corresponding to molecules present in the sample,associated with different frequencies. Frequency spectrum of such signal containstens of components. Moreover, spectral signatures of some metabolites can exhibitcomplex shapes, which may require including even more components for their mod-eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
Initializing EM by dynamic programming partitions eling [Chylla (2012)]. In some applications concerning modeling, interpolation andforecasting of time signals mixtures of Gaussian functions are applied, which includetens of components [Eirola and Lendasse (2013)]. Time of flight mass spectrometry,a high-throughput, experimental technique in molecular biology provides measure-ments of peptide, protein or lipid compositions of biological samples. Mass spectralsignals can be modeled by mixture decompositions, where numbers of componentscan reach even several hundreds [Pietrowska et al. (2011); Polanski et al. (2015)].The aim of this paper is to develop and evaluate the method for estimatinginitial values of parameters for EM iterations, for univariate, multi-component, het-eroscedastic Gaussian mixtures, based on the dynamic programming partitioningalgorithm. Partitioning the data points into bins, by dynamic programming, deter-mines initial values of weights, means and variances of Gaussian components for theEM algorithm. The idea of partitioning an interval into bins, by using dynamic pro-gramming, with the aim of obtaining solution to some fitting or estimation problemwas formulated by Bellman [Bellman (1961)] and then studied/used by other au-thors in many contexts [e.g., H´ebrail et al. (2010)]. The advantage of using dynamicprogramming over partitioning (clustering) heuristics by hierarchical clustering orgreedy algorithms is that the dynamic programming method allows for obtainingglobal optimal solution for a given quality function. According to authors’ bestknowledge, dynamic programming partitions were so far not used for initalizationof EM iterations.We present the dynamic programming algorithm for computing initial valuesfor mixture parameters for EM iterations. We also study the problem of the choiceof the scoring function. We compare several possible scoring functions and, on thebasis of computational experiments, we propose the one best suited to the prob-lem of computing initial values of mixture parameters. For a number of artificiallygenerated univariate data sets, with multiple Gaussian, heteroscedastic componentswe compare some other methodologies of initialization of EM iterations with thedynamic programming method and we show its advantage. In the areas of applica-tions of mixture decompositions listed afore, measured signals are often numbers ofcounts or intensities. In the context of the mixture decompositions of such datasetswe describe the version of EM iterations appropriate for binned data and we developan approach for the dynamic programming partition for computing initial condi-tions. We apply the dynamic programming method for initializing EM iterations tothe real dataset coming from protein mass spectrometry experiment and we againdemonstrate the efficiency of dynamic programming method for initialization of theEM iterations.
2. Gaussian mixture modeling
Gaussian mixture modeling of the dataset of univariate scalar observations, x =[ x , x , ..., x N ] involves estimating the vector of mixture parameters p = [ α , ..., α K , µ , ..., µ K , σ , ..., σ K ]eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska (component weights, means and standard deviations) of the mixture probabilitydensity function (pdf) f mix ( x, p ) = K (cid:88) k =1 α k f ( x, µ k , σ k ) , (1)where f ( x, µ k , σ k ) is the Gaussian pdf and (cid:80) Kk =1 α k = 1, such that the log-likelihoodfunction L ( x , p ) = N (cid:88) n =1 log f mix ( x n , p ) (2)is maximized [McLachlan and Peel (2000)].Analyzes of datasets where measurements are numbers of counts or intensitiesinvolve formulations of mixture model decompositions for binned data [McLachlanand Peel (2000)], where x is a vector of equally spaced centers of bins and observa-tions are given by a vector y = [ y , y , ..., y N ] of counts y n , n = 1 ...N generated bymultinomial distribution with probabilities p n defined by areas of bins, p n = K (cid:88) k =1 α k (cid:20) Φ ( x n + δ , µ k , σ k ) − Φ ( x n − δ , µ k , σ k ) (cid:21) . (3)In the above Φ ( x, µ k , σ k ) is the cumulative probability distribution function of theGaussian distribution and δ is a bin width. We assume that bins are ”dense”, whichallows for approximating the area of the bin by the product of its width and theprobability density function at the bin center, and consequently for using the log-likelihood function defined by the following formula L ( x , p ) = N (cid:88) n =1 y n log f mix ( x n , p ) . (4) EM iterations
Maximization of the log-likelihood functions (2) and (4) is done by using EM iter-ations - the successive updates of parameter vector p old ← p new , where p old = [ α old , ..., α oldK , µ old , ..., µ oldK , σ old , ..., σ oldK ] , and p new = [ α new , ..., α newK , µ new , ..., µ newK , σ new , ..., σ newK ] . Standard formulae for EM iterations are e.g., [McLachlan and Peel (2000); Bilmes(1998)]. The version of EM iterations appropriate for binned data with dense binsis similar to standard EM iterations. It involves defining conditional distributionseptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
Initializing EM by dynamic programming partitions of hidden variables, χ n (corresponding to unknown assignments of observations tocomponents) P [ χ n = k | x n ] = α oldk f ( x n , µ oldk , σ oldk ) K (cid:80) κ =1 α old κ f ( x n , µ old κ , σ old κ ) (5)and updates of parameters estimates given by α newk = N (cid:80) n =1 y n P [ χ n = k | x n ] N (cid:80) n =1 y n , (6) µ newk = N (cid:80) n =1 y n x n P [ χ n = k | x n ] N (cid:80) n =1 y n P [ χ n = k | x n ] , (7)( σ newk ) = N (cid:80) n =1 y n ( x n − µ newk ) P [ χ n = k | x n ] N (cid:80) n =1 y n P [ χ n = k | x n ] , (8) k = 1 , , ...K . Preventing divergence of EM iterations
Some assumptions should be made concerning execution of the EM iterations. Inthe case of unequal variances of components of the Gaussian mixture, consideredhere, the log-likelihood (2) or (4) is unbounded [Kiefer and Wolfowitz (1956)]. Un-boundedness results in a possibility of encountering divergence of EM iterations inpractical computations and in the need for using approaches for preventing diver-gence [Yao (2010); Ingrassia (2004)]. Here we prevent divergence of EM iterationsby a simple constraint conditions. Namely we do not allow standard deviations ofGaussian components and mixing proportions to fall below given threshold values σ min and α min i.e., we augment equations for iterations for standard deviations andcomponent weights by additional constraints σ newk ← max( σ newk , σ min ) (9)and α newk ← max( α newk , α min ) . (10)The above constraints are sufficient to prevent divergence of EM iterations.The constraints values assumed in EM iterations are σ min = 10 − , α min = 10 − for the simulated datasets and σ min = 1 , α min = 10 − for the proteomic dataset.eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska
3. Problem formulation
The problem studied in this paper concerns determining initial values for mixtureparameters for EM iterations, µ ini k , σ ini k , α ini k , k = 1 , , ..., K , for the best qualityof the mixture parameters estimates. All methods for setting initial conditions forEM iterations, studied and compared here, rely on partitions of the data rangeperformed according to some criterion. We assume that observations x , x , ..., x N are sorted in the ascending order x < x < ... < x N . (11)Partitions are defined by blocks B , B , ..., B K , k -th block contains samples i, i +1 , ..., j , B k = { i, i + 1 , ..., j } .Partitions defined by blocks are used for computing initial values for parameters.Initial means are computed as µ ini k = 1 j − i + 1 n = j (cid:88) n = i x n , (12)initial values for standard deviations are computed as σ ini k = (cid:118)(cid:117)(cid:117)(cid:116) j − i + 1 n = j (cid:88) n = i ( x n − µ ini k ) , (13)and initial values for mixing proportions are computed by α ini k = B k N = j − i + 1 N . (14)In the above expression B k denotes the number of measurements in the block B k .For the case of the binned data the appropriate expressions for initial valuesof parameters, implied by partitions given by blocks B k = { i, i + 1 , ..., j } , are asfollows. Initial values for means are computed as µ ini k = n = j (cid:88) n = i x n w n , (15)initial values for standard deviations are computed as σ ini k = (cid:118)(cid:117)(cid:117)(cid:116) n = j (cid:88) n = i w n ( x n − µ ini k ) , (16)and initial values for component weights are computed as α ini k = (cid:80) ν = jν = i y ν (cid:80) Nn =1 y n . (17)In expressions (15)-(17) by w n we denote w n = y n (cid:80) ν = jν = i y ν . (18)eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 Initializing EM by dynamic programming partitions
4. Initializing EM iterations by using dynamic programmingpartitioning algorithm
In this section we describe the algorithm for computing partitions by using thedynamic programming method. Partitioning of the observations involves defining K blocks B , B , ..., B K (19)where (as already stated) each of the blocks is a range of successive indexes ofobservations B k = { i, i + 1 , ..., j } . (20)For each of the blocks (more precisely, for data in the block) we compute a scoringfunction, denoted either by Q ( B k ) or by Q ( x i , x i +1 , ..., x j ), Q ( B k ) = Q ( x i , x i +1 , ..., x j ) , (21)The problem of optimal partitioning involves defining blocks (19), such that thecumulative scoring index Q ( B , B , ..., B K ) is minimized, Q ( B , B , ..., B K ) = K (cid:88) k =1 Q ( B k ) → min . (22)The solution to the optimal partitioning problem (22) by dynamic program-ming [Bellman (1961)] is obtained by iterative application of the following Bellmanequation Q opt ..j ( k + 1) = min i =1 ...j − Q opt ..i − ( k ) + Q ( x i , x i +1 , ..., x j ) , (23)where Q opt ..i ( k ) denotes the optimal cumulative partial score of partitioning the range1 ...i into k blocks. Scoring functions
The scoring function Q ( B k ) used in the dynamic programming algorithm of datapartition (22)-(23) should be designed in such a way that it allows for detection ofthe dense subgroups in the data. A scoring function often used in the literature isthe weighted sum of squares of within block deviations of data points from mean Q ( B k ) = n = j (cid:88) n = i γ k (cid:34) x n − j − i + 1 ν = j (cid:88) ν = i x ν (cid:35) . (24)Often, weights γ k are assumed as normalizing factors for numbers of elements inthe block, which leads to the scoring function, which we define as Q ( B k ) Q ( B k ) = 1( j − i + 1) n = j (cid:88) n = i (cid:34) x n − j − i + 1 ν = j (cid:88) ν = i x ν (cid:35) . (25)eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska
The scoring function Q ( B k ) is a within-block variance of the data. Intuitively, Q ( B k ) is a reasonable measure of concentration of data points within the definedclusters for detecting clusters in the data.Other definitions of scoring functions are also possible and it seems an interestingissue whether the use of other scoring functions can improve data partitions, clustersdetection and, consequently estimation of the mixture parameters. Therefore, belowwe define other scoring indexes, obtained by some modifications of Q ( B k ) (25).The second scoring function Q ( B k ) is a within-block, standard deviation Q ( B k ) = (cid:112) Q ( B k ) . (26)We also use a third scoring function Q ( B k ) defined as a ratio of the within-block, standard deviation and the block sample range, Q ( B k ) = (cid:112) Q ( B k ) x j − x i . (27)Intuitively, the scoring fucntion Q ( B k ) takes smaller values for blocks where datapoins are concentrated close to the center and larger values otherwise. The propertyof the scoring function Q ( B k ) is that it is dimensionless and, in the large number ofdata points limit, depends only on the shape of the probability distribution functionof the data.Finally, we also introduce a fourth scoring function, Q ( B k , ∆), which is a mod-ification of the scoring function Q ( B k ), such that some preference is given to widerblocks in comparison to narrower blocks. The idea of giving preference to widerblocks is motivated by the fact that very narrow blocks detected by Q ( B k ) maycorrespond to random variations of the data rather than to the true dense sub-groups related to the structure of the Gaussian components. In order to give somepreference to wider blocks we modify Q ( B k ) by introducing additional parameter∆, which leads to the scoring function Q ( B k , ∆) Q ( B k , ∆) = ∆ + (cid:112) Q ( B k ) x j − x i . (28)Adding a positive constant ∆ in the numerator of the above expression results inlimiting the possibility of “shrinking” the numerator of Q ( B k , ∆) to values veryclose to 0, which can happen when narrow random “peaks” occur in the data. Scoring functions for binned data
For the case of application of the dynamic programming method to binned data theappropriate modification of the expression for the scoring function Q1 is Q ( B k ) = n = j (cid:88) n = i w n (cid:32) x n − ν = j (cid:88) ν = i x ν w ν (cid:33) , (29)where w n is defined as in (18). Formulas for scoring functions Q , Q and Q , forbinned data, are given by (26), (27) and (28), with (25) replaced by (29).eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 Initializing EM by dynamic programming partitions Properties of different scoring functions
Different scoring indexes may lead to different partitions of the data. Therefore, inthe computational experiments further reported in this paper we apply and com-pare all scoring indexes Q ( B k ), Q ( B k ), Q ( B k ) and Q ( B k , ∆). Some preliminaryobservations (later systematically verified) are summarized below.The index Q ( B k ) has a tendency to over-penalize wide components, which canresult in splitting some of the wider components into two and in merging narrowcomponents into one. The index Q ( B k ) shows high sensitivity for the case wherethere is little overlap between components. However, when the overlap betweencomponents increases, it has the tendency to randomly merge components. Theindex Q ( B k ) shows advantages following from its dimensionless construction andoften leads to correct partitions. However, it shows sensitivity to noise in the data.Finally, the modified index Q ( B k , ∆) allows for improving the performance of Q ( B k ) by robustification against noise in the data.
5. Reference methods of setting initial condition for EM iterations
We compare dynamic programming partitions to several reference methods for set-ting initial conditions for EM iterations. These methods were already studied in theliterature [Biernacki et al. (2003); Maitra (2009); Fraley and Raftery (1999)] andthey were proven to be useful approaches for estimating mixture decompositions ofdatasets. The first reference method of generation of initial mixture parameters isthe method of equal quantiles, used e.g. as the default option in the software pack-age Mclust [Fraley and Raftery (1999)]. Here bins are defined by equal quantilesof the dataset. Two other reference methods are hierarchical clustering algorithms,e.g. [Hastie et al. (2009)], where clusters of samples are created by successive op-eration of merging based on distances between samples and/or distances betweenclusters of samples. We apply two versions of hierarchical clustering, with averageand complete linkage [Hastie et al. (2009)]. Initial values for component means,standard deviations and weights are defined by blocks obtained by application ofthe hierarchical clustering algorithms.
6. Results
We have conducted several computational experiments for systematic comparisonsof the methods of setting initial values of parameters for EM iterations by the dy-namic programming algorithm and the three reference methods. We are using thefollowing abbreviations for algorithms for setting initial conditions: E-Q - equalquantiles algorithm, H-clu-c - hierarchical clustering algorithm with complete link-age, H-clu-a - hierarchical clustering algorithm with average linkage, DP-Q1, DP-Q2, DP-Q3, DP-Q4(∆) - dynamic programming algorithm with scoring functionQ1, Q2, Q3, Q4(∆). In the subsections below we first define performance criteriafor comparing results of applying different algorithms. Then we describe two groupseptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska of computational experiments, artificially created data and proteomic mass spectraldata and we report results of comparisons of different methods for setting initialconditions for EM iterations.
Performance criteria
Performance criteria for evaluating results of parameter estimation algorithms arebased on values of the differences between true and estimated parameters or onthe values of the log-likelihood functions obtained in EM iterations, averaged overrepeated experiments. Since in our constructions of the quality criteria we aim atmaking reasonable comparisons of different initialization algorithms for datasetscorresponding to different mixtures then we need to introduce additional scalingsand orderings of the values of differences or log-likelihoods, as described below.
Difference between values of true and estimated parameters.
The first ap-proach to evaluating results of mixtures parameter estimation algorithms is usinga direct criterion given by a scaled difference between estimated and true values.We use a scaled absolute difference between true and estimated locations of com-ponents, averaged over all components. Scaling is aimed at making the distributionof errors invariant with respect to component widths and to components weights.The criterion is defined as follows D = 1 K K (cid:88) i =1 | µ true i − µ est i | σ true i (cid:112) N α true i . (30)In the above expression, µ true i , σ true i and α true i are true parameters of the analyzedmixture distribution, K is the number of mixture components and N is the samplesize. By µ est i we understand the value of the estimated mixture component meanclosest to µ true i . Due to skewness of the distributions of D , we use log( D ) as eventualmeasure of the quality of parameter estimation.The value of the quality criterion log( D ) averaged over mixture datasets is de-noted as Avg [log( D )] (31)and further used in reporting results of computational experiments. Log-likelihoods.
The direct criterion defined in the previous subsection can beused only in the case where the true compositions of the analyzed mixtures areknown. Since we study both types of data with known and unknown compositions ofmixtures, we also use a second approach to evaluating results of mixtures parameterestimation algorithms, based on values of the log-likelihood functions. Values of thelog-likelihoods obtained in the EM iterations can be used for scoring performanceof different algorithms in both cases of known or unknown values of true mixtureparameters.eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
Initializing EM by dynamic programming partitions In the case of analysis of a single dataset on can use the obtained values oflog likelihoods to order performances of different algorithms. Higher values of loglikelihoods imply better quality of the solution obtained by using an algorithm.There are exceptions from the rule “higher likelihood → better estimate of mix-ture parameters” caused by the possible existence of the spurious local maximizers[McLachlan and Peel (2000)], but their influence on obscuring results of evaluationsof performances of different algorithms is strongly reduced by the used constraintson values of standard deviations of mixture components.Often we are not interested in using values of the log likelihood functions forthe analysis of one dataset but rather for comparisons involving many datasets.In that case, typically the differences between log-likelihoods obtained for differ-ent mixtures (different datasets), are much larger than differences of log-likelihoodsresulting from applying different initialization algorithms for the same data-set.Therefore orderings or scalings are used in order to compensate for this. In thisstudy we use the criterion applied previously in the literature [Karlis and Xekalaki(2003); Biernacki et al. (2003)] defined by the percentage (probability) of attaining“maximum” likelihood by a method of initialization of EM iterations. By “maxi-mum” likelihood we understand the maximal value over all applied algorithms. Wealso assume that “a method no m attained maximum likelihood”, means that thedifference between “maximum” likelihood and the m -th likelihood is lower then 5%of the range of all log likelihoods. The value of this criterion, estimated by averagingover repeated computational experiments, is denoted by Avg ( P ) . (32) Simulated datasets
The first computational experiment involves analyses of artificially created datasets,which are 10 component Gaussian mixtures, with known values of means, standarddeviations and weights of Gaussian components. In the simulated datasets we arecontrolling the overlap between Gaussian components, due to its strong influence onthe results of the fit. Several measures of the degree of overlap between two Gaussiancomponents have been proposed in the literature, e.g., [Sun and Wang (2011)]. Theyuse different approaches. A simple method is based on distances between Gaussiandistributions (Mahalanobis, Bhattacharyya). Here we define a simple parameter ov for measuring the degree of overlap between neighboring components, ov = exp − | µ i − µ i +1 | (cid:113) σ i + σ i +1 . (33)Parameter ov assumes value equal or close to zero for disjoint components andlarger values for components of stronger overlap. Maximal possible value assumedby the overlap parameter is ov = 1, which occurs in the case where µ i = µ i +1 .The definition of ov in (33) can be interpreted as an adaptation/simplification ofeptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska the idea of the Bhattacharyya distance. The contruction (33) simplifies the overlapdefinition by the Bhattacharyya distance in the sense that components with equalmeans, which show maximal possible overlap ov = 1, can be possibly distinguishedbased on the Bhattacharyya distance by differences between variances. Despite thissimplification, the overlap measure (33) is useful for our analyzes, due to the factthat we are not considering mixtures with components of similar means and differentvariances (claw-like mixtures)[McLachlan and Peel (2000)].True parameters of each of the Gaussian mixtures are drawn randomly in eachstochastic simulation experiment. Draws of parameters of the Gaussian mixturesare designed such that overlaps between neighboring components are constant overone dataset. One stochastic simulation experiment includes three steps:(1) Draw randomly values of variances and weights of Gaussian components σ , ..., σ and α , ..., α .(2) Define µ = 0 and compute values of means of Gaussian components, µ , ..., µ such that values of overlapping coefficient (33) between successive componentshas a given constant value ov .(3) Generate 1000 independent samples of 10-component Gaussian mixtures withparameters α , ..., α , µ , ..., µ and σ , ..., σ .Differences between stochastic simulation experiments involve different methods fordrawing weights α , ..., α and standard deviations σ , ..., σ of Gaussian compo-nents in the mixtures and different values of the overlapping coefficient ov . Fourgroups of datasets are generated. Each includes 5 series of experiments correspond-ing, respectively, to the following 5 values of the overlapping coefficient: ov = 0 . , .
1, 0 .
15, 0 .
2, 0 .
25. Each series corresponds to one value of overlapping coefficient ov and includes 500 datasets, generated according to steps 1-3. Different groups usedifferent scenarios for generating weights and variances of Gaussian components,described below. Group 1: Equal mixing proportions. Low variability of standard devia-tions.
In this group equal values of mixing proportions are assumed, α = α = ... = α = 0 . U (0 . ,
1) in each dataset. Values in parenthesisgive the range of possible changes of standard deviations. So this scenario allowsonly for low (two-fold) variability of standard deviations of Gaussian components.
Group 2: Equal mixing proportions. High variability of standard de-viations.
In this group again equal values of mixing proportions are assumed, α = α = ... = α = 0 . U (0 . , Initializing EM by dynamic programming partitions Group 3: Different mixing proportions. Low variability of standarddeviations.
In this group different values of mixing proportions are assumed, α = , α = , ..., α = and values of component standard deviations aregenerated randomly from uniform distribution U (0 . , Group 4: Different mixing proportions. High variability of standarddeviations.
In this group different values of mixing proportions are assumed, α = , α = , ..., α = and values of component standard deviations aregenerated randomly from uniform distribution U (0 . , Comparisons of performances of different algorithms.
In the computationalexperiments parameters of the Gaussian mixtures were estimated by using EMiterations started with each of the above described, seven algorithms of settinginitial conditions, E-Q, H-clu-c, H-clu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4(∆). Per-formances of algorithms are evaluated by using quality indexes
Avg [log( D )] (31),and Avg ( P ) (32). Results, obtained by averaging over 500 datasets in each of thedata generation scenario are presented in figure 1. This figure includes 5 subplotsarranged in columns and rows. Subplots in the two upper rows depict results of ap-plying seven algorithms of setting initial conditions in terms of Avg [log( D )]. Each ofthe subplots corresponds to one group of datasets (experiments). In the top-left sub-plot, corresponding to the group 1 of experiments with equal weights of componentsand low variability of standard deviations of components, all initialization methodsshow high and similar performances. In the subplot (second row, left column) corre-sponding to group 3 of experiments with low variability of standard deviations anddifferent component weights a method, which shows significantly lower performanceis the equal quantiles method E-Q. Other clustering methods (H-clu-c, H-clu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4(∆)) again show quite similar performances, howeverdifferences are here bigger then in the previous plot. In the subplots in the rightcolumn corresponding to groups 2 and 4 of stochastic simulation experiments, bothwith high variability of standard deviations of Gaussian components, we can observestronger diversity between results of different initialization algorithms. The perfor-mance of the equal quantiles algorithm E-Q is high in group 2 (equal componentweights), however in group 4 (different component weights) E-Q is the algorithm ofthe worst performance.In all groups algorithms based on dynamic programming method DP-Q1, DP-Q2, DP-Q3 and DP-Q4 either lead to similar values of Avg [log( D )] or outperformreference methods. The cases where performance of hierarchical clustering methodwith average linkage can be slightly higher than dynamic programming methodsare groups 1 and 3 (low variability of standard deviations). For groups 2 and 4(high variability of standard deviations) there is a strong advantage of the dynamicprogramming algorithms over the reference methods.eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska A v g [ l og ( D ) ] A v g [ l og ( D ) ] A v g [ l og ( D ) ] ov 0.05 0.1 0.15 0.2 0.2500.511.522.5 Group 4 A v g [ l og ( D ) ] ov −− E−Q −− H−clu−c −− H−clu−a −− DP−Q1 −− DP−Q2 −− DP−Q3 −− DP−Q4, Δ =0.1Legend:0 0.2 0.4 0.6 0.8 100.511.522.5 A v g [ l og ( D ) ] Avg(P)Correlation plot
Fig. 1. Comparisons of results of applying algorithms, E-Q, H-clu-c, H-clu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4(∆) for estimating mixture parameters for simulated datasets. (Explanations in thetext).
High performance of the equal quantiles, E-Q, algorithm, in groups 1 and 2, canbe considered as a kind of artifact. It does not follow from the high capability ofthe algorithm to locate positions of components, but rather from the fact that truevalues of component weights coincide with equal quantiles assumed in the algorithm.We can also make observations concerning comparisons between variants of dy-namic programming algorithms based on differetnt scoring functions. Performanceof the algorithm DP-Q1, based on the scoring function (25) given by a samplevariance, is high in groups 1 and 3 where the variability of component standarddeviations is low. However, in groups 2 and 4 where the variability of componentstandard deviations strongly increases, the algorithm DP-Q1 exhibits worse perfor-eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
Initializing EM by dynamic programming partitions mance compared to other variants of dynamic programming method. This is con-sistent with the tendency of the dynamic programming partition with the scoringfunction Q1 to incorrectly merge narrow components. Performance of the algorithmDP-Q2 is high for low values of the overlap coefficient ov , but strongly decreaseswith the increase of ov . Performance of the algorithm DP-Q4(∆), ∆ = 0 . Avg [log( D )] (31) versus values of the probability index Avg ( P ) (32). In the scatter-plot strong, negative correlation between values of Avg [log( D )] and Avg ( P ) is seen,which confirms the potential of the index Avg ( P ) to serve as a reasonable estimateand a comparison tool for the performance of different algorithms. Protein spectra dataset
A proteomic mass spectrum contains information about mass-to-charge (m/z) val-ues of registered ions, denoted by x n , versus their abundances i.e., numbers ofcounts from the ion detector denoted by y n , n denotes index of the data point. Inreal experiments the dataset consists of more than one spectrum (sample). To eachpoint x n along the m/z axis correspond counts y sn , where s denotes the index ofthe sample.The second computational experiment in our study was the analysis of the pro-teomic dataset, which included 52 low resolution proteomic mass spectra of humanblood plasma [Pietrowska et al. (2011)]. This dataset was obtained in the clinicalstudy where blood samples were collected in the group of 22 head and neck cancerpatients and in the group of 30 healthy donors. Raw spectra consisted of approxi-mately 45 000 m/z values, covering the range of 2 000 to 12 000 Da. Spectral signalswere binned with the resolution 1 Da and baselines were removed with the use of aversion of the algorithm described in [Sauve and Speed (2004)]. For further analysisonly spectral fragments ranging from 2 000 to 4 120 Da have been selected. Thechoice of the range 2 000 to 4 120 Da was motivated by the fact that this fragmentof m/z scale contains several protein and peptide species interesting as potentialcandidates for biomarkers. Comparison of performances of different algorithms
Computational exper-iments on the proteomic dataset involves modeling spectral signals as mixtures ofGaussian components, estimating models parameters by using EM algorithm andcomparing qualities of models obtained by using different algorithms of settinginitial conditions for EM iterations. The quality criterion for comparing differentinitialization methods was
Avg ( P ), where averaging is done over all spectral signalsin the dataset. Clearly, the criterion Avg [log( D )] cannot be used here due to thelack of knowledge on the true parameters of mixture models.For spectral signals with high variability of standard deviations, strong, differ-ent overlaps between components and large number of components, the differenceseptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska y
50 100 15000.20.4 No. of components KAvg(P) Legend: −− DP−Q1 −− DP−Q3 −− DP−Q4, ∆ =1 −− DP−Q4, ∆ =5 −− DP−Q4, ∆ =102500 2600 2700 2800 2900 3000 3100 3200 3300 3400 35000500100015002000 m/z values − x y Fig. 2. Upper panel: Comparison of performances of algorithms of setting initial conditions DP-Q1, DP-Q3, DP-Q4(∆ = 1), DP-Q4(∆ = 5), DP-Q4(∆ = 10) for proteomic spectra dataset basedon the
Avg ( P ) index. Middle panel: Spectral signal y n = y n . Lower panel: A fragment of thespectrum y n = y n (black) versus its mixture model (green). between performances of different algorithms of initialization of the EM iterationsare magnified compared to the simulated dataset. Application of algorithms, EQ,Hclu-c, Hclu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4, for the proteomic dataset, leadto large differences between values of log likelihood functions of models. Sinceinitialization methods EQ, Hclu-c, Hclu-a and DP-Q2 exhibit significantly lowerperformance than DP-Q1, DP-Q3 and DP-Q4, then we have confined the set ofcompared algorithms to those of the highest quality, DP-Q1, DP-Q3 and DP-Q4.We have decomposed each of the spectral signals into Gaussian mixture. EM algo-rithm was initialized with the use of the following five algorithms: DP-Q1, DP-Q3,DP-Q4(∆ = 1), DP-Q4(∆ = 5), DP-Q4(∆ = 10). Decompositions were computedwith numbers of components K ranging from 50 to 150.In figure 2, in the upper panel, we present the plot of values of the index Avg ( P )versus the number of Gaussian components K . One can observe that, on the average,DP-Q4 shows higher performance than DP-Q1 and DP-Q3. High performance ofthe algorithm DP-Q4 was observed for quite wide range of the parameter ∆, ∆ = 1,∆ = 5, ∆ = 10.In the middle panel in figure 2 a plot of the spectral signal y n = y n (aftereptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 Initializing EM by dynamic programming partitions preprocessing operations of baseline correction and binning) corresponding to thesample no 1, within the range 2 000 to 4 120 Da. In the lower panel, the plot ofa fragment of the spectrum y n = y n (black), within the range 2 500 to 3 500 Da,is drawn versus its mixture model (green) obtained with the use of the algorithmDP-Q4 (∆ = 10). The number of components ( K = 90) was estimated with the useof Bayesian information criterion [Schwarz (1978)]. One can observe a good fit ofthe computed mixture model.
7. Discussion
Despite simplicity of construction of the EM algorithm, the research on its per-formance and possible improvements is extensive and includes several steps ofthe algorithm: initialization, stopping conditions, preventing divergence, execution[McLachlan and Peel (2000)]. Modifications in each of the above steps interact onewith another in the sense of influencing the overall performance of the algorithm.In this paper we have focused on initialization of EM for certain types of mixturedistributions. We have also mentioned some solutions for stopping and preventingdivergence. The latter of the above listed issues, concerning modifications of exe-cution of EM steps for improving its performance, is however also worth discussingdue to its relations to the topic of our study.Several papers introduce modifications of E and M steps of the EM algorithmdesigned in such a way that searching through parameter space becomes more in-tensive, which can help in avoiding local maxima of the likelihood function andmake the recursive process more robust against the choice of initial conditions e.g.,[Zhou and Lange (2010)]. A group of approaches to enhancing performance of EMiterations involves multiple runs (threads) and repeats of EM steps combined withcriteria of selecting between threads [Karlis and Xekalaki (2003); Biernacki et al.(2003); O’Hagan et al. (2012)]. The simplest version of short runs initiation [Karlisand Xekalaki (2003); Biernacki et al. (2003)] involves generating multiple initialvalues for random methods and starting EM iterations for the one correspondingto highest likelihood. Then only one iteration process, namely the one, which at-tained the highest value of the likelihood function is continued. A recently developedimplementation, “burn in EM” [O’Hagan et al. (2012)], involves continuation of re-cursions of multiple threads combined with gradual elimination of the worse on thebasis of the value of likelihood function. Several ideas of improving performanceof EM iterations were related to modifications of the log-likelihood function corre-sponding to the Gaussian mixture model.One example of such an approach is theprofile likelihood method in [Yao (2010)]. Introducing constraints and/or modifica-tions of the form of the likelihood function both prevent divergence of iterationsand lead to improvement of performance of the corresponding variant of the EMalgorithm.Each of the above discussed approaches can be treated as competitive to ouralgorithm in the sense that it can lead to improvement of estimation of parame-eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 A. Polanski, M. Marczyk, M. Pietrowska, P. Widlak, J. Polanska ters of mixture distributions. We did not present comparisons of our method tothe above approaches. However, according to out experience, for the type of dataanalyzed in this paper, univariate, heteroscedastic, multi-component, precise initial-ization is more important than possible improvements following from modificationsof execution of EM iterations. We should also mention that improvements of EMinitialization can be combined with improvements in EM execution to lead to evenbetter quality of mixture parameters estimation.
8. Conclusions
The first conclusion of our study is that initialization methods based on dynamicprogramming, DP-Q1, DP-Q2, DP-Q3, DP-Q4(∆), show advantage over the ref-erence methods EQ, Hclu-c, Hclu-a. We have compared initialization algorithmsfor a variety of mixture data with the overlap between neighboring componentscontrolled by the parameter (33) and different values of variances and componentweights (groups 1-4). This allowed for characterizing dependence of performancesof algorithm on parameters of mixture distributions. The advantage of the dynamicprogramming initialization methods over the reference methods is highest for het-eroscedastic mixture distributions with different mixing proportions.The second conclusion is that performance of the dynamic programming par-titioning algorithm used for initialization of EM iterations depends on the scoringfunction used in the algorithm. We have studied several variants of the dynamicprogramming partition algorithms defined by different scoring functions (25)-(28).The conclusion coming from these analyzes, drawn on the basis of both
Avg [log( D )]and Avg ( P ) criterion was that for the type of datasets with different mixing propor-tions and high variability of standard deviations of components, the most efficientEM initialization method is the dynamic programming algorithm with the scoringfunction Q4.We have also applied the dynamic programming partition algorithms DP-Q1,DP-Q3 and DP-Q4 for initialization of EM iterations for estimating mixture modelparameters for proteomic dataset including 52 low resolution mass spectral signals.Comparisons of values of the Avg ( P ) performance criterion lead to the conclusionthat again the method of the highest efficiency is the dynamic programming parti-tion with the scoring function Q4.The dynamic programming method with the scoring function Q4 needs the ad-justment of the parameter ∆. However, computing decompositions for several valuesof ∆ (1, 5, 10) leads to the conclusion that the algorithm shows high performance forquite broad range of values of ∆. So adjusting the value of ∆ can be done efficientlyand robustly.The dynamic programming algorithm applied to the partitioning problem hasa quadratic computational complexity with respect to the number of elements ofvector of observations x . Despite computational load, the advantage of using dy-namic programming method for initialization of the EM algorithm is the quality ofeptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 REFERENCES the obtained mixture model. In the majority of applications of mixture models thequality of the model is more important than the computational complexity of thealgorithm used for the computations. Acknowledgments
This paper was partially financially supported by the scientific projects fromthe Polish National Center for Science (NCN) and the Polish National Centerfor Research and Development (NCBIR). JP was supported by NCN Harmoniagrant DEC-2013/08/M/ST6/00924, AP was supported by NCN Opus grant UMO-2011/01/B/ST6/06868, MM was supported by NCBiR grant POIG.02.03.01-24-099/13. Computations were performed with the use of the infrastructure providedby the NCBIR POIG.02.03.01-24-099/13 grant: GeCONiI - Upper Silesian Centerfor Computational Science and Engineering.
Supplementary materials
Supplementary materials are Matlab scripts and functions for performing compar-isons of partitioning algorithms E-Q, H-clu-c, H-clu-a, DP-Q4 for the data describedas Group 4 in section 6.2. Demo computations are started by launching Matlabscript partitions-em-demo. One stochastic simulation experiment is performed (in-cluding three steps 1-3 listed in section 6.2). Results of computations are shownby plots of partitions and data histograms versus estimated probability densityfunctions. Values of errors
Avg [log( D )] and likelihoods are also reported. By mod-ifications of the Matlab code other computational scenarios for simulated data canbe also easily realized. References
Bellman, R. (1961). On the approximation of curves by line segments using dynamicprogramming.
Commun. ACM , 4(6):284.Biernacki, C. (2004). Initializing EM using the properties of its trajectories ingaussian mixtures.
Statistics and Computing , 14(3):267–279.Biernacki, C., Celeux, G., and Govaert, G. (2003). Choosing starting values forthe EM algorithm for getting the highest likelihood in multivariate gaussian mix-ture models.
Computational Statistics & Data Analysis , 41(3-4):561–575. RecentDevelopments in Mixture Model.Biernacki, C., Celeux, G., Govaert, G., and Langrognet, F. (2006). Model-basedcluster and discriminant analysis with the MIXMOD software.
ComputationalStatistics & Data Analysis , 51(2):587 – 600.Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application toparameter estimation for gaussian mixture and hidden markov models. TechnicalReport ICSI-TR-97-021, University of California Berkeley.eptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015 REFERENCES
Chylla, R. (2012). Metabolite analysis of biological mixtures using adaptable-shapemodeling of an online nmr spectral database.
Journal of Computer Science &Systems Biology , 5(1):51.Eirola, E. and Lendasse, A. (2013). Gaussian mixture models for time series mod-elling, forecasting, and interpolation. In Tucker, A., Hoppner, F., Siebes, A.,and Swift, S., editors,
Advances in Intelligent Data Analysis XII , volume 8207 of
Lecture Notes in Computer Science , pages 162–173. Springer Berlin Heidelberg.Fraiha Machado, A., Bonafonte, A., and Queiroz, M. (2013). Parametric decom-position of the spectral envelope. In
Acoustics, Speech and Signal Processing(ICASSP), 2013 IEEE International Conference on , pages 571–574.Fraley, C. and Raftery, A. (1999). Mclust: Software for model-based cluster analysis.
Journal of Classification , 16(2):297–306.Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The Elements of StatisticalLearning: Data Mining, Inference, and Prediction . Springer series in statistics.Springer, Berlin.H´ebrail, G., Hugueney, B., Lechevallier, Y., and Rossi, F. (2010). Exploratory anal-ysis of functional data via clustering and optimal segmentation.
Neurocomput. ,73(7-9):1125–1141.Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate nor-mal mixture models.
Statistical Methods and Applications , 13(2).Karlis, D. and Xekalaki, E. (2003). Choosing initial values for the EM algorithmfor finite mixtures.
Computational Statistics & Data Analysis , 41(34):577 – 590.Recent Developments in Mixture Model.Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estima-tor in the presence of infinitely many incidental parameters.
Ann. Math. Statist. ,27(4):887–906.Maitra, R. (2009). Initializing partition-optimization algorithms.
IEEE/ACM TransComput Biol Bioinform , 6(1):144–57.McLachlan, G. and Peel, D. (1999). The emmix algorithm for the fitting of normaland t-components.
Journal of Statistical Software , 4(2):1–14.McLachlan, G. and Peel, D. (2000).
Finite Mixture Models . Wiley Series in Proba-bility and Statistics. Wiley, New York.Melnykov, V. and Melnykov, I. (2012). Initializing the EM algorithm in gaussianmixture models with an unknown number of components.
Computational Statis-tics & Data Analysis , 56(6):1381 – 1395.O’Hagan, A., Murphy, T., and Gormley, I. (2012). Computational aspects of fit-ting mixture models via the expectation-maximization algorithm.
ComputationalStatistics & Data Analysis , 56(12):3843 – 3864.Pietrowska, M., Polanska, J., Walaszczyk, A., Wygoda, A., Rutkowski, T., Sklad-owski, K., Marczak, L., Stobiecki, M., Marczyk, M., Polanski, A., and Widlak,P. (2011). Association between plasma proteome profiles analysed by mass spec-trometry, a lymphocyte-based dna-break repair assay and radiotherapy-inducedeptember 18, 2018 10:38 WSPC/INSTRUCTION FILE Polanski˙2015
REFERENCES acute mucosal reaction in head and neck cancer patients. Int J Radiat Biol ,87(7):711–9.Polanski, A., Marczyk, M., Pietrowska, M., Widlak, P., and Polanska, J. (2015).Signal partitioning algorithm for highly efficient gaussian mixture modeling inmass spectrometry.
PLOS ONE , 10(7):e0134256.Richardson, S. and Green, P. (1997). On bayesian analysis of mixtures with an un-known number of components (with discussion).
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 59(4):731–792.Sauve, A. and Speed, T. (2004). Normalization, baseline correction and alignment ofhigh-throughput mass spectrometry data. In
Proceedings of the Genomic SignalProcessing and Statistics .Schwarz, G. (1978). Estimating the dimension of a model.
Ann. Statist. , 6(2):461–464.Sun, H. and Wang, S. (2011). Measuring the component overlapping in the gaussianmixture model.
Data Mining and Knowledge Discovery , 23(3):479–502.Yao, W. (2010). A profile likelihood method for normal mixture with unequalvariance.
Journal of Statistical Planning and Inference , 140(7):2089–2098.Zhou, H. and Lange, K. L. (2010). On the bumpy road to the dominant mode.