A Partial EM Algorithm for Clustering White Breads
aa r X i v : . [ s t a t . A P ] F e b A Partial EM Algorithm for Clustering White Breads
Ryan P. Browne ∗ , Paul D. McNicholas ∗ , Christopher J. Findlay † Abstract
The design of new products for consumer markets has undergone a major transformation over thelast 50 years. Traditionally, inventors would create a new product that they thought might address aperceived need of consumers. Such products tended to be developed to meet the inventors own percep-tion and not necessarily that of consumers. The social consequence of a top-down approach to productdevelopment has been a large failure rate in new product introduction. By surveying potential cus-tomers, a refined target is created that guides developers and reduces the failure rate. Today, however,the proliferation of products and the emergence of consumer choice has resulted in the identification ofsegments within the market. Understanding your target market typically involves conducting a productcategory assessment, where 12 to 30 commercial products are tested with consumers to create a prefer-ence map. Every consumer gets to test every product in a complete-block design; however, many classesof products do not lend themselves to such approaches because only a few samples can be evaluatedbefore ‘fatigue’ sets in. We consider an analysis of incomplete balanced-incomplete-block data on 12different types of white bread. A latent Gaussian mixture model is used for this analysis, with a partialexpectation-maximization (PEM) algorithm developed for parameter estimation. This PEM algorithmcircumvents the need for a traditional E-step, by performing a partial E-step that reduces the Kullback-Leibler divergence between the conditional distribution of the missing data and the distribution of themissing data given the observed data. The results of the white bread analysis are discussed and somemathematical details are given in an appendix.
Keywords:
Balanced-incomplete-block; mixture models; progressive EM; sensometrics; white bread.
Consumer-driven product development of new consumer products and the improvement of existing productshave become recognized as a best-practice in industry. Food researchers have become increasingly dependenton understanding consumer wants and desires to effectively design food products (Jaeger et al., 2003). Tounderstand consumer behaviour, preference maps are built by assessing consumer liking of an appropriaterange of commercial products within a category. From these liking data, a model may be built that describesthe ideal product for the test population. However, most product categories will have more than a singleideal product, with two or more liking clusters revealed. Hedonic taste tasting is the most common practiceused to measure consumer liking within a target population (Lawless and Heymann, 2010). In a complete-block design, every consumer gets to taste every product, but many product categories do not facilitate thissampling plan. When tasting wine, for example, a consumer can only evaluate three or four samples before‘fatigue’ sets in, compromising the quality of the data collected. The fact that consumers tend to behave likeexperts puts into doubt the value of data obtained over multiple tasting sessions (Findlay, 2008). Therefore,a balanced-incomplete-block (BIB) design is used for high-fatigue products. The resulting data are sparseand tend to be heterogenous; therefore, we must identify sub-populations in an incomplete-data setting. ∗ Department of Mathematics & Statistics, University of Guelph, Guelph, Ontario, N1G 2W1, Canada. E-mail: { rbrowne,paul.mcnicholas } @uoguelph.ca. † Compusense Inc., Guelph, Ontario, N1G 4S2, Canada. E-mail: cfi[email protected]
1n this paper, we consider an analysis of 12 white breads. The descriptive analysis of the breads providesa measure of the range of sensory properties found within the product category. It also gives us informationregarding changes that may improve the sensory liking of a product for a specific consumer segment. Con-sumer research is conducted to measure liking on a nine-point hedonic scale anchored at dislike extremely(1) and like extremely (9), with the midpoint (5) indicating neither like nor dislike. By clustering consumerson similarity of liking profiles across the products, it is possible to determine the sensory attributes thatcontribute to like/dislike within each cluster. The calibrated descriptive analysis was performed by a trainedpanel using well-defined product attributes that have been rated for intensity on a scale from 0 to 100. Theattributes are generated to provide a complete sensory description of the breads. Some attributes are foundat low intensity, but are important in differentiating products. There are also attributes that are definedby a major attribute or group of attributes. For example, sourdough breads would score high in sournessand sour aroma and flavour. The products selected for this study encompass the commercial sliced whitebread category. The range goes from the extremely popular sandwich breads that are fine-celled, spongy,and bland to a ciabatta-style Italian hearth bread. The breads differ in crust colour and roughness, textureof the crumb, and flavour, but all fall within the realm of sliced white bread. A total of 369 consumersevaluated six breads within a 12-present-6 BIB design.One straightforward way to tackle such an incomplete-data problem is to impute the missing data prior tothe analysis. However, this approach is not generally desirable for clustering problems because the imputedvalues will be partly based on data from other sub-populations. Herein, we develop a clustering approachfor these data based on a finite Gaussian mixture model. A random variable X follows a G -component finiteGaussian mixture model if its density can be written f ( x | ϑ ) = G X g =1 π g φ ( x | µ g , Σ g ) , (1)where π g >
0, with P Gg =1 π g = 1, are mixing proportions, φ ( x | µ g , Σ g ) is multivariate Gaussian den-sity with mean µ g and covariance matrix Σ g , and ϑ = ( π , . . . , π G , µ , . . . , µ G , Σ , . . . , Σ G ). Finite mix-ture models have been used for clustering for at least fifty years (Wolfe, 1963) and such applications arecommonly referred to as ‘model-based clustering’ (cf. Fraley and Raftery, 2002). One problem with ap-plications of Guassian mixture models is the number of free covariance parameters: Gp ( p + 1) /
2. Toovercome this, many authors have considered imposing constraints on decomposed component covariancematrices (e.g., Celeux and Govaert, 1995) and other have considered underlying latent factor models (e.g.,Ghahramani and Hinton, 1997). We consider an underlying latent factor model herein (cf. Section 3) andbecause so many of the data are missing, we assume common component covariance. The result is a parsi-monious Gaussian mixture model.The expectation-maximization (EM) algorithm (Dempster et al., 1977) is the standard approach to pa-rameter estimation for model-based clustering (cf. McLachlan and Basford, 1988). However, in our incomplete-block data we obtain only 6 liking scores from 12 products for each consumer; therefore, an EM algorithmwould require (cid:0) (cid:1) = 924 different 6 × × The EM Algorithm for Missing Data Problems
The EM algorithm is an iterative procedure for finding maximum likelihood estimates when data are in-complete. Therefore, EM algorithms are naturally suited for missing data problems. The EM algorithmconsists of alternating between E- and M-steps until a convergence criterion is satisfied. In the E-step, theexpected value of the complete-data (i.e., the observed plus missing data) is computed, and in the M-step,this quantity is maximized with respect to the parameters. Formally, the EM algorithm is a special case ofan MM algorithm of the minorization-maximization variety (Hunter and Lange, 2000, 2004).Suppose we observe p -dimensional y , . . . , y n such that each y i can be decomposed into an observedcomponent, x i , of dimension m , a missing component, z i , of dimension l = p − m , and Y i = (cid:20) X i Z i (cid:21) ∽ N (cid:18) µ = (cid:20) µ x µ z (cid:21) , Σ = (cid:20) Σ xx Σ xz Σ xz Σ zz (cid:21)(cid:19) . Then, the conditional distribution of the missing data given the observed is given by Z i | X i = x i ∽ N ( µ z i .x i := µ z + Σ zx Σ − xx ( x i − µ x ) , Σ z.x := Σ zz − Σ zx Σ − xx Σ xz ) . Now, define ˆ z i := µ z.x i = µ z + Σ zx Σ − xx ( x i − µ x ) , (2)ˆ Y i := (cid:20) l × l l × m l × m ˆ Z i (cid:21) = (cid:20) l × l l × m m × l Σ z.x (cid:21) = (cid:20) l × l l × m m × l Σ zz − Σ zx Σ − xx Σ xz (cid:21) , (3)where l × m is an l × m matrix of zeros and ˆ y i = ( x i , ˆ z i ). Using this notation, the EM updates for the meanand covariance can be written asˆ µ ( t +1) = y = 1 n n X i =1 ˆ y i and ˆ Σ ( t +1) = S = n X i =1 (ˆ y i − y ) (ˆ y i − y ) ′ + n X i =1 ˆ Y i , (4)respectively. Because our white bread data have lots of missing observations, ‘standard’ E-steps are verycomputationally expensive. For example, each observation requires inversion of a 6 × (cid:0) (cid:1) = 920 different matrix inversions at each iteration.Next, consider approximate E-steps instead of full E-steps. All of these procedures will work with theinverse of Σ and its principal sub-matrices and vectors. We will assume that Σ − is known; this quantityis typically readily available because it is necessary to calculate the log-likelihood at each EM iteration. Wedenote Σ = (cid:20) Σ xx Σ xz Σ xz Σ zz (cid:21) and Σ − = Ξ = (cid:20) Ξ xx Ξ xz Ξ zx Ξ zz (cid:21) because there exists a relationship between Ξ zz and Σ z.x . This relationship exists because Σ z.x is the Suhurcomplement of the matrix Σ which has the property that Σ z.x = Ξ − zz and, equivalently, Σ − z.x = Ξ zz . In addition, there exists a relationship between the regression coefficients and Ξ , which can be derivedthrough block inversion of Σ , Σ zx Σ − xx = − Ξ − zz Ξ zx . These relations can be useful if the dimension of z is smaller than the dimension of x . For example, if thereis a single missing observation then using Σ requires a p − Ξ is known we onlyrequire an inversion of a 1 × G > i in component g is w ig = π g φ (cid:0) x i | µ g,x , Σ g,xx (cid:1)P Gk =1 π k φ (cid:0) x i | µ k,x , Σ k,xx (cid:1) , (5)where ˆ π ( t +1) g = n g /n = (1 /n ) P ni =1 w ig , ˆ µ ( t +1) g = y g = (1 /n g ) P ni =1 w ig ˆ y i , andˆ Σ ( t +1) g = S g = 1 n g n X i =1 h w ig (ˆ y i − y ) (ˆ y i − y ) ′ + b Y i i , (6)where n g = P ni =1 w ig . As already mentioned (Section 1), it is common practice to introduce parsimony via constrained componentcovariance matrices. When dealing with sparse data, such as the white bread data, estimating the covariancematrix can be especially difficult. Therefore, we use a variant of the mixture of factor analyzers model(Ghahramani and Hinton, 1997; McLachlan and Peel, 2000) in which we constrain the component factorloading matrices to be equal across groups (cf. McNicholas and Murphy, 2008). The factor analysis model(Spearman, 1904; Bartlett, 1953) assumes that a p -dimensional random vector X i can be modelled usinga q -dimensional vector of latent factors U i , where q ≪ p . The model can be written X i = µ + ΛU i + ǫ ,where Λ is a p × q matrix of factor weights, the latent variables U i ∼ N ( , I q ), and ǫ ∼ N ( , Ψ ), where Ψ is a p × p diagonal matrix. Therefore, the marginal distribution of X i is N ( µ , ΛΛ ′ + Ψ ). It follows thatthe density for the mixture of factor analyzers model is that of Equation 1 with Σ g = Λ g Λ ′ g + Ψ g . Themodel we use for the analysis of the bread data assumes equal factor loading matrices across components,i.e., Σ g = ΛΛ ′ + Ψ g , and so the EM algorithm updates are given byvec (cid:16) ˆ Λ (new) (cid:17) = " G X g =1 n g ˆ Ψ − g ⊗ ˆ Θ g − vec G X g =1 n g ˆ Ψ − g S g β ′ g ! , (7)ˆ Ψ (new) g = diag (cid:26) S g − Λ (new) ˆ β g S g + ˆ Λ (new) Θ g (cid:16) ˆ Λ (new) (cid:17) ′ (cid:27) . (8) We follow Neal and Hinton (1998) and store the sufficient statistics ( b z , . . . , b z n ) and ( b Z , . . . , b Z n ). However,instead of doing a complete update of a partial set of the sufficient statistics as suggested by Neal and Hinton(1998), we perform a partial E-step at each iteration. These partial E-steps can be shown to reduce theKullback-Leibler (KL) divergence at every step, which ensures that the monotonicity of the EM algorithmis preserved. From Neal and Hinton (1998), the EM algorithm can be viewed as minimizing the function n X i =1 F ( e P i , θ ) = − n X i =1 D ( e P i || P i,θ ) + n X i =1 l ( x i | θ ) , where D ( e P i || P i,θ ) is the KL divergence between the distribution of the missing data e P i and the conditioaldistribution of the missing data given the observed data, P θ = P ( Z i | x i , θ ). A ‘standard’ E-step sets e P i P ( Z i | x i , θ t ), for all i , at each iteration t . Neal and Hinton (1998) suggest a partial or sparse E-step,where a subset of e P i is updated to P ( Z i | x i , θ t ) at each EM iteration. The algorithm we describe in the nexttwo sections partially updates each e P i towards P ( Z i | x i , θ ) such that the KL divergence is reduced but notminimized. For the multivariate Gaussian distribution and a particular i , the EM algorithm can be viewedas minimizing F ( N z , x i , θ ) = − D KL ( N z || N z . x i ) + l ( x i | θ ) , with respect to N z , the distribution of the latent or missing variables, and the parameter set θ .The KL divergence between the missing data distribution with mean ˆ z i and variance b Z i and the condi-tional distribution of the missing data given the observed data x i is D KL ( N z || N z.x i ) = 12 " tr n Σ − z.x b Z i o + (ˆ z i − µ z.x ) ′ Σ − z.x (ˆ z i − µ z.x ) − ln | b Z i || Σ z.x | ! . (9)From Equation 9, we can see that if we set ˆ z i to the conditional mean and b Z i to the conditional covariancematrix, then the KL divergence is minimized. However, for our data this involves inverting (cid:0) (cid:1) = 924different 6 × Hereafter, the following notation will be used. Let Σ j be the principal sub-matrix of Σ , obtained by deletingcolumn j and row j . Let σ j and ξ j be the j th diagonal elements of Σ and Ξ , respectively. Let σ j and ξ j bethe j th rows of Σ and Ξ , respectively, with the j th element deleted. For example, if j = 1, then Σ = (cid:20) σ σ ′ σ Σ (cid:21) and Σ − = Ξ = (cid:20) ξ ξ ′ ξ Ξ (cid:21) . (10) b z i To minimize the KL divergence with respect to b z i , we set it to µ z.x , which is given in (2). However, µ z.x depends on a matrix inversion that depends on i . We now develop an updating equation that reduces thenumber of matrix inversions. The KL divergence depends on b z i through one term in equation (9) andargmax ˆ z i D KL ( N z || N z.x i ) = argmax ˆ z i (ˆ y i − µ ) ′ Σ − (ˆ y i − µ )because the last term in(ˆ y i − µ ) ′ Σ − (ˆ y i − µ ) = (ˆ z i − µ z.x ) ′ Σ − z.x (ˆ z i − µ z.x ) + ( x i − µ x ) ′ Σ − xx ( x i − µ x ) (11)is fixed and, moreover, does not depend on z i (cf. Appendix A.1).We now consider a conditional minimization, by each element in y i = ( y i , . . . , y in ), ofargmax ˆ z i (ˆ y i − µ ) ′ Σ − (ˆ y i − µ ) , which is given by y ( t ) ij = ( µ j + σ j Σ − j (cid:16) y ( t ) − j − µ − k (cid:17) if j is corresponds to an element in b z i ,y ij if j is corresponds to an element in x i , µ k is the k th element of µ . This procedure requires inversion of the Σ j , i.e., the j th principal sub-matrix; which involves a p − Σ − = Ξ is known,the updates can be simplified to y ( t ) ij = ( µ j − ξ j ξ j (cid:16) y ( t ) − j − µ − j (cid:17) if j is corresponds to an element in b z i ,y ij if j is corresponds to an element in x i . This set of updates requires the inversion of a 1 × n × p matrixˆ A = (ˆ y ′ , . . . , ˆ y ′ n ), then our partial E-steps update ˆ A by column instead of ‘standard’ E-steps that updateˆ A by row. b Z i Now, minimizing the KL divergence with respect to b Z i we haveargmax ˆ Z i D KL ( N z || N z.x i ) = argmax ˆ Z i tr n Σ − z.x b Z i o − ln | b Z i || Σ z.x | ! . The log-determinant and the trace function are convex with respect to the positive definite matrices (Magnus and Neudecker,1998). Now, consider the function γ ( Z i ) = tr h(cid:16) Σ − b Y i (cid:17) Σ − (cid:16) Σ − b Y i (cid:17)i , (12)which is also convex, and these functions have the propertyargmax ˆ Z i D KL ( N z || N z.x i ) = argmax ˆ Z i tr h(cid:16) Σ − b Y i (cid:17) Σ − (cid:16) Σ − b Y i (cid:17)i . (13)Both objective functions are minimized by the Schur complements (c.f. Appendix A.2 for the function γ ).In addition, if one objective function is reduced then so is the other because both functions are convex andhave the same (global) minimum. Therefore, if the function γ is reduced at every iteration, then the KLdivergence is reduced at every iteration and so the algorithm has the monotonicity property.Conditional updates for the function γ are derived in Appendix A.2. We update ˆ Y i by column and row,while holding the associated principal sub-matrix fixed. Therefore, if we denote the j th row of ˆ Y i by ˆ Y i,j and the ( p − × p matrix obtained from removing the j th row as ˆ Y i, − j , then the updates can be written asˆ Y ( t +1) i,j = ( σ j + (cid:16) σ j − ˆ Y ( t ) i,j (cid:17) ′ Σ − j (cid:16) Σ − j − ˆ Y ( t ) i, − j (cid:17) if j is associated with ˆ z i , × p if j is associated with x i , and then we set the j th column of b Y ( t +1) i equal to the j th row.We can avoid the matrix inversion of the principal sub-matrix Σ j by again exploiting the properties ofthe inverse of Ξ . Specifically, if ξ ′ i σ i = 1, then Σ − j = " I p − + 11 − ξ ′ j σ j ξ j σ ′ j Ξ j , where I p − is the ( p − × ( p −
1) identity matrix. 6 .6 Evaluating Weights and the Likelihood Function
The likelihood depends only on the observed data f ( x i ) = 1(2 π ) p/ | Σ xx | / exp (cid:26) −
12 ( x i − µ x ) ′ Σ − xx ( x i − µ x ) (cid:27) . (14)However, from Equation (11) we have(ˆ y i − µ ) ′ Σ − (ˆ y i − µ ) = (ˆ z i − µ z.x ) ′ Σ − z.x (ˆ z i − µ z.x ) + ( x i − µ x ) ′ Σ − xx ( x i − µ x ) . (15)Therefore, ( x i − µ x ) ′ Σ − xx ( x i − µ x ) ≤ (ˆ y i − µ ) ′ Σ − (ˆ y i − µ ) = (ˆ y i − µ ) ′ Ξ (ˆ y i − µ ) (16)and we have equality when ˆ z i = µ z.x . Our algorithm for ˆ z i will converge to µ z.x , so if we use this approxi-mation for the likelihood and weights calculations they will also converge to the true quantities.To calculate | Σ xx | , we use the relationship between Schur complements and the determinant, and therelationship between the inverse matrix and the Schur complement;ln | Σ | = ln | Σ xx | + ln | Σ z.x | = ln | Σ xx | − ln | Ξ zz | . Alternatively, we could use our current estimate of Σ z.x i , namely, Z i . However, note that if the dimensionof the missing data is larger than the dimension of the observed data for observation i then it will better tocalculate | Σ xx | directly. The Bayesian information criterion (BIC; Schwarz, 1978) is used to select the number of components G andthe number of latent factors q . For a model with parameters θ , BIC = 2 l ( x , ˆ θ ) − m log n , where l ( x , ˆ θ ) is themaximized log-likelihood, ˆ θ is the maximum likelihood estimate of θ , m is the number of free parameters inthe model, and n is the number of observations. The use of the BIC in mixture model selection was originally(Dasgupta and Raftery, 1998) based on an approximation to Bayes factors (Kass and Raftery, 1995). Theeffectiveness of the BIC for choosing the number of factors in a factor analysis model has been establishedby Lopes and West (2004). A total of n = 369 consumers tasted six out of 12 white breads in a BIB design. Taste was evaluated on thehedonic scale, so values in { , , . . . , } are assigned to each tasted bread. For illustration, the first few rowsof the data are shown in Table 1, where the bread brands are denoted A , B , . . . , L. We fitted our mixtureof factor analyzers model, with common factors, to these data using the PEM algorithm introduced herein.These models were fitted for G = 1 , . . . , q = 1 , . . . ,
3, using multiple restarts.
The results (Table 2) show that the BIC selected a model with G = 3 components and q = 2 factors. Notethat we also ran standard EM algorithms on these data and can confirm that they converged to the sameresults as our PEM algorithms. A plot of the two latent factors (Figure 1) shows the three components inthe latent space. Because the classifications are based on maximum a posteriori (MAP) probabilities, it is7able 1: The first six rows of the white bread data, where each consumer evaluates six breads using thehedonic scale.Consumer A B C D E F G H I J K L1 9 8 6 9 4 82 3 8 7 8 7 83 8 6 7 6 9 74 5 4 6 4 3 65 7 7 8 7 6 86 8 3 4 8 7 7Table 2: BIC values from our analysis of the white bread data, for G = 1 , . . . , q = 1 , . . . , G -12 -10 -8 -6 -4 - u1 u2 456789 Breads L i k i ng A B C D E F G H I J K L
Figure 1: Plot of the two latent factors for the selected model, coloured by component (left), and a plot ofthe average liking scores for each of the breads separated by component (right).straightforward to provide the client with probabilities rather than hard group memberships; this might beparticularly desirable for consumers near the cluster boundaries.In the same figure, there is also a plot of the mean liking scores for each bread for each of the threecomponents. The red and green components seem to represent higher and lower scorers, respectively, withconsumers within the black component exhibiting more variability in liking.Some interesting points emerge from inspection of the results. Notably, bread J emerges as polarizing:it is strongly liked in the red and black groups and disliked in the green group. Interestingly, bread J is8he only ciabatta-style bread in the study and so it makes sense that its sensory properties will result ina relatively extreme liking response. This liking contrast is useful in differentiating groups of consumersbecause the objective of this research is to understand the sensory-based choice behaviour of consumers todefine an optimum product for each liking cluster. Bread I is also interesting, in that it is the one bread forwhich consumers in all three groups seem to converge to the similar liking scores. Bread I is the sweetest,most flavourful bread in the study; it is also firm, dense, moist, and chewy. This is an unusual combinationof characteristics and one would expect it to stand out. The fact that it stood out by not differentiatingconsumers in this study is itself interesting in the process of trying to understand the sensory-based choicebehaviour of consumers.
The analysis of the bread data was repeated using a standard EM algorithm for parameter estimation.The results were the same, as we would expect. Figure 2 illustrates the progression of the EM and PEMalgorithms with G = 1 and G = 2 components, respectively, and q = 2 latent factors. As expected, bothalgorithms converge to the same solution in an almost identical fashion. - - - G=1
Iteration l og li k - - - G=2
Iteration l og li k Figure 2: Plot of the log-likelihood for G = 1 and G = 2 for PEM (red line) and EM (black line) algorithms. We developed an approach for clustering incomplete BIB data from consumer tasting of 12 different com-mercial white breads. Our clustering approach is based on a parsimonious mixture of factor analyzers model,where the factor loading matrices are constrained to be equal across groups. The problem of missing datais handled along with parameter estimation within an partial EM algorithm framework. Rather than sim-ple imputation, this PEM algorithm approach effectively imputes missing data at each iteration based oncurrent component membership probabilities; this is a natural approach as missing values are filled in basedon complete values in observations that are in some sense similar (i.e., in the same component). Our PEMalgorithm is much more computationally efficient than the standard EM algorithm for this, and any such,missing data problem. The PEM is shown to retain the monotonicity property and, thus, retains the sameconvergence properties as the EM algorithm. Three benefits are achieved through this approach: the qualityof data that are collected prior to fatigue is improved; the method of substituting missing data reflects9he sensory preferences of each consumer, which permits robust cluster assignment; and the collection ofincomplete-block data reduces the cost, time, and materials required for this type of study.We introduce a new variation of the EM algorithm called the PEM algorithm. The many varieties of theEM mainly focus on the M-step: the expectation-conditional maximization (ECM) algorithm (Meng and Rubin,1993), the ECM either (ECME) algorithm (Liu and Rubin, 1994), the alternating ECM (AECM) algorithm,and others. Few examine different ways of partially updating the E-step. Neal and Hinton (1998) give sev-eral possible methods to update the missing sufficient statistics. All of the methods suggest something alongthe lines of fully updating a partial set of the missing sufficient statistics. When the E-step is intractable,Wei and Tanner (1990) suggest approximating the E-step by simulating m observations from the conditionaldistribution of the missing data given the observed data. This version of EM algorithm is called Monte CarloEM (MCEM). Prior to MCEM, Celeux and Diebolt (1985) suggested using stochastic EM (SEM), which isthe same as MCEM with m = 1. Other variations on approximating the E-step have been introduced,such as MCEM using rejection sampling, importance sampling, and Markov chain Monte Carlo. The PEMalgorithm presented here is similar to ECM in which we have ‘conditional’ E-steps instead of conditionalM-steps. These conditional E-steps are computationally cheaper than using a complete or full E-step. Acknowledgements
This work was supported by a grant-in-aid from Compusense Inc. and by a Collaborative Research andDevelopment grant from the Natural Sciences and Engineering Research Council of Canada.
A Some Mathematical Details
A.1 Schur Complement Relation in Quadratic Form
Suppose we have a positive-definite symmetric matrix S and a vector y with decompositions y = (cid:20) y y (cid:21) and S = (cid:20) S S S S (cid:21) , then y ′ S − y = y ′ S − y + ( y − S S − y ) ′ S − . ( y − S S − y ) , where S . = S − S S − S . A.2 A Matrix Minimization Problem
Suppose we have a positive-definite symmetric matrix S with decomposition S = (cid:20) S S S S (cid:21) . We then have the following property for a function γ , h ( Θ ) = tr (cid:8) ( S − Θ , S ) S − ( S − Θ , S ) (cid:9) ≥ tr (cid:8) S − S S ′ (cid:9) . (17)Equality holds when Θ = S − S S S . Therefore, h ( Θ ) is minimized by the Schur complement Θ = S − S S S . Now, if we define Θ = (cid:20) Θ
00 0 (cid:21) , (18)10hen h ( Θ ) = tr (cid:8) ( S − Θ ) S − ( S − Θ ) (cid:9) = γ ( Θ ) + tr (cid:8) S − S S (cid:9) . (19)Because the right-hand term does not depend on Θ , h ( Θ ) has the same minimum as g ( Θ ); i.e., theSchur complement Θ = S − S S S . Therefore, a minimization algorithm based on the function h is equivalent to minimizing γ . We minimize h using a conditional minimization algorithm (by column/row)based on Equation (17). References
Bartlett, M. (1953). Factor analysis in psychology as a statistician sees it. In
Uppsala Symposium onPsychological Factor Analysis , Number 3 in Nordisk Psykologi’s Monograph Series, Uppsala, Sweden, pp.23–34. Almquist and Wiksell Uppsala.Celeux, G. and J. Diebolt (1985). The SEM algorithm: A probabilistic teacher algorithm derived from theEM algorithm for the mixture problem.
Computational Statistics Quarterly 2 , 73–82.Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models.
Pattern Recognition 28 ,781–793.Dasgupta, A. and A. E. Raftery (1998). Detecting features in spatial point processes with clutter via model-based clustering.
Journal of the American Statistical Association 93 , 294–302.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via theEM algorithm.
Journal of the Royal Statistical Society: Series B 39 , 1–38.Findlay, C. J. (2008). Consumer segmentation of BIB liking data of 12 cabernet sauvignon wines: A casestudy. Presented at the 9th Sensometrics Meeting, July 20–23, St. Catharines, Canada.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation.
Journal of the American Statistical Association 97 , 611–631.Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.Hunter, D. L. and K. Lange (2000). Rejoinder to discussion of “Optimization transfer using surrogateobjective functions”.
Journal of Computational and Graphical Statistics 9 , 52–59.Hunter, D. L. and K. Lange (2004). A tutorial on MM algorithms.
The American Statistician 58 (1), 30–37.Jaeger, S. R., K. L. Rossiter, W. V. Wismer, and F. R. Harker (2003). Consumer-driven product developmentin the kiwifruit industry.
Food Quality and Preference 14 (3), 187–198.Kass, R. E. and A. E. Raftery (1995). Bayes factors.
Journal of the American Statistical Association 90 ,773–795.Lawless, H. T. and H. Heymann (2010).
Sensory Evaluation of Food: Principles and Practices . New York:Springer.Liu, C. and D. B. Rubin (1994). The ECME algorithm: A simple extension of EM and ECM with fastermonotone convergence.
Biometrika 81 , 19–39.Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis.
Statistica Sinica 14 , 41–67.11agnus, J. R. and H. Neudecker (1998).
Matrix Differential Calculus with Applications in Statistics andEconometrics . New York: John Wiley & Sons.McLachlan, G. J. and K. E. Basford (1988).
Mixture Models: Inference and Applications to Clustering . NewYork: Marcel Dekker Inc.McLachlan, G. J. and D. Peel (2000). Mixtures of factor analyzers. In
Proceedings of the Seventh InternationalConference on Machine Learning , San Francisco, pp. 599–606. Morgan Kaufmann.McNicholas, P. D. and T. B. Murphy (2008). Parsimonious Gaussian mixture models.
Statistics and Com-puting 18 , 285–296.Meng, X. L. and D. B. Rubin (1993). Maximum likelihood estimation via the ECM algorithm: a generalframework.
Biometrika 80 , 267–278.Neal, R. M. and G. E. Hinton (1998). A view of the EM algorithm that justifies incremental, sparse, andother variants. In M. I. Jordan (Ed.),
Learning in Graphical Models , pp. 335–368. Dordrecht: KluwerAcademic Publishers.Schwarz, G. (1978). Estimating the dimension of a model.
Annals of Statistics 6 , 461–464.Spearman, C. (1904). The proof and measurement of association between two things.
The American Journalof Psychology 15 (1), 72–101.Wei, G. and M. A. Tanner (1990). A Monte Carlo implementation of the EM algorithm and the poor man’sdata augmentation algorithms.