How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action
Bettina Grün, Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter
HHow many data clusters are in the Galaxy data set?Bayesian cluster analysis in action
Bettina Gr¨un ∗ , Gertraud Malsiner-Walli † andSylvia Fr¨uhwirth-Schnatter ‡ In model-based clustering, the Galaxy data set is often used as a benchmark data setto study the performance of different modeling approaches. Aitkin (2001) compares max-imum likelihood and Bayesian analyses of the Galaxy data set and expresses reservationsabout the Bayesian approach due to the fact that the prior assumptions imposed remainrather obscure while playing a major role in the results obtained and conclusions drawn.The aim of the paper is to address Aitkin’s concerns about the Bayesian approach byshedding light on how the specified priors impact on the number of estimated clusters.We perform a sensitivity analysis of different prior specifications for the mixtures of finitemixture model, i.e., the mixture model where a prior on the number of components isincluded. We use an extensive set of different prior specifications in a full factorial designand assess their impact on the estimated number of clusters for the Galaxy data set.Results highlight the interaction effects of the prior specifications and provide insightsinto which prior specifications are recommended to obtain a sparse clustering solution.A clear understanding of the impact of the prior specifications removes restraintspreventing the use of Bayesian methods due to the complexity of selecting suitable priors.Also, the regularizing properties of the priors may be intentionally exploited to obtain asuitable clustering solution meeting prior expectations and needs of the application.
Keywords.
Bayes, cluster analysis, Galaxy data set, mixture model, specification.
This paper investigates the impact of different prior specifications on the resultsobtained in Bayesian cluster analysis based on mixture models. Mixture modelsmay be used to either approximate arbitrary densities in a semi-parametric way orin a model-based clustering context to identify groups in the data. We will focuson the later application where each component is assumed to potentially representa data cluster and the cluster distribution is not approximated by several mixturecomponents.Hennig and Liao (2013) claim that “there are no unique ‘true’ or ‘best’ clusters ina data set” but that the prototypical shape of a cluster needs to be specified before ∗ WU Vienna University of Economics and Business † WU Vienna University of Economics and Business ‡ WU Vienna University of Economics and Business a r X i v : . [ s t a t . A P ] J a n his question can be answered. For clustering methods using mixture models, theprototypical shape of a cluster is in general specified by selecting the component-specific distributions. For the fitted mixture model, then a one-to-one relationshipbetween components and clusters is assumed. For example, in the case of multi-variate metric data one can specify isotropic Gaussian distributions as componentdistributions, where the variance is comparable across components, or Gaussiandistributions with arbitrary variance-covariance matrices, which are allowed to con-siderably vary across components (see, for example Fraley and Raftery, 2002).The Bayesian framework provides a principled approach to specify the prototyp-ical shape of the clusters. By specifying priors on the model parameters, both themean prototypical shape as well as the variability around this prototypical shapeare included, i.e., what the shape is on average as well as how much the componentdistributions vary from one another across components. In this sense the Bayesianapproach provides more flexibility to incorporate the prototypical shape of a clusterin the analysis and hence arrive at a suitable cluster solution for the specific analysisundertaken compared to other clustering methods. In addition the Bayesian frame-work also allows to specify a prior on the component weights, thus influencing if theclusters are a-priori assumed to be rather balanced in size or include both very smalland large clusters in size. By contrast, for example, k -means clustering assumes thatthe clusters have an isotropic shape with similar cluster size and volume (see, forexample, Gr¨un, 2019).However, the additional flexibility provided by the Bayesian approach mightalso be perceived as overwhelming, in particular, if the influence of different priorspecifications on results obtained remains rather opaque. Aitkin (2001) comparesmaximum likelihood and Bayesian analyses of mixture models and expresses reser-vations about the Bayesian approach due to the fact that the prior assumptionsimposed remain rather obscure while playing a major role in the results obtainedand conclusions drawn. Certainly having sufficient insight into the influence of priorspecifications on the clustering results is crucial in order to leverage the advantagesof the Bayesian approach where the priors may be used to regularize the problemand also guide the analysis to focus on the clustering results of interest.In the following we consider the mixture of finite mixture model (MFM), aname coined by Miller and Harrison (2018) following Richardson and Green (1997).The MFM is a hierarchical finite mixture model where a prior on the number ofcomponents K is included. We focus on the MFM, because the Bayesian analysis ofthe MFM results in an a-posteriori distribution of the number of data clusters K + aswell as an a-posteriori distribution of partitions C . These are both core componentsof a Bayesian cluster analysis to address the questions how many data clusters thereare in the data set and how the observations should be grouped into these dataclusters.Note that in our analyses of the MFM, we make a crucial distinction between K , the number of components in the mixture distribution, and K + , the number of filled components, to which observations are actually assigned given the partitionwhich groups the observations. However, only a filled component corresponds to adata cluster. This implies that, when estimating the number of clusters in the data,the posterior of K + is of interest, rather than the posterior of K . We will thus not2nly investigate the prior on K , but also explicitly inspect the prior on K + , whichis induced by the prior on K and the prior on the mixture weights. In the analysisof the results focus is given to the posterior of K + (rather than K ), determining inparticular the mode of this distribution and its entropy.We illustrate the impact of different prior specifications using a MFM of univari-ate Gaussian distributions for the (in-)famous Galaxy data set originally introducedto the statistical literature by Roeder (1990). Several results obtained for this dataset using either maximum likelihood estimation or Bayesian analysis methods werecompared and discussed in Aitkin (2001). Aitkin (2001) concluded that the max-imum likelihood analysis, while having complications of its own, would be ratherstraightforward to implement and be well understood. By contrast Aitkin (2001)formulated a call for action with respect to the Bayesian analysis, asking for a care-ful analysis of the role of the priors. This paper aims at responding to this call foraction. In our specification of the MFM model with Gaussian component distributions, thefollowing data generation process is assumed for a univariate data set of size n givenby y = ( y , . . . , y n ) (see also Richardson and Green, 1997). One assumes that thenumber of components K of the mixture model is sampled from the prior p ( K ).Given K the component weights η = ( η , . . . , η K ) are sampled from a symmetric K -dimensional Dirichlet distribution with parameter γ K . For each observation i component assignments S i are drawn from a multinomial distribution with param-eter η .Regarding the Gaussian component distributions, the component means µ k andthe component variances σ k , k = 1 , . . . , K , are independently drawn from the sameprior distributions to have exchangeability. The component means µ k are drawnfrom the normal distribution with mean b and variance B , while the componentprecisions σ − k , i.e., the inverse variances, are assumed to follow a Gamma distribu-tion with parameters c and C (and expectation c /C ). Note that for the compo-nent distribution not the conjugate prior for the normal distribution with unknownmean and variance is used, but the independence prior is employed. If instead theconjugate prior had been used, the component-specific variances would influence theprior variability of the component means. This would imply that components whichhave less variability also have their mean closer to the prior mean b . This priorimplication does in general not seem to be appealing in the mixture context andhence the independence prior is used. For a further detailed discussion of the priorsfor the component distributions see Fr¨uhwirth-Schnatter (2006, Chapter 6).Summarizing, this specification results in the following Bayesian hierarchical3FM model: K ∼ p ( K ) , η | K ∼ D K ( γ K ) ,S i | η ∼ M ( η ) , i = 1 , . . . , n, (2.1) µ k | b , B ∼ N ( b , B ) , k = 1 , . . . K,σ − k | c , C ∼ G ( c , C ) , k = 1 , . . . K,y i | µ , σ , S i = k ∼ N ( µ k , σ k ) , i = 1 , . . . , n, where µ = ( µ k ) k =1 ,...,K and σ = ( σ k ) k =1 ,...,K .Additionally, hyperpriors might be specified. For example, Richardson and Green(1997) suggest to specify a hyperprior on C and Malsiner-Walli et al. (2016) addan additional layer for the prior on the component means which corresponds to ashrinkage prior allowing for variable selection. However, in the following we do notconsider adding further hyperpriors in order to better assess the influence of differentspecifications for these priors and their parameters on the clustering results. Thusin this paper we focus on the specification of the following priors and parameters: • the prior p ( K ) of the number of components K , • the value γ K used for the Dirichlet prior, • the prior parameters b and B for the component means, • the prior parameters c and C for the component variances. The Galaxy data set was originally published in astronomy by Postman et al. (1986)and consists of univariate measurements representing velocities of galaxies, movingaway from our galaxy. In this original publication 83 observations are listed. Roeder(1990) introduced the data set to the statistics literature, but omitted the smallestobservation such that in the following in the statistics literature only 82 observationswere considered. Unfortunately Roeder (1990) also introduced a typo, i.e., oneobservation has a different value than in Table 1 in Postman et al. (1986). Afurther influential statistics publication using the Galaxy data set was Richardsonand Green (1997) who also considered only the 82 observations, but corrected thetypo and scaled the units by 1000.The data set was used in statistics by a number of authors to demonstrate densityestimation methods and investigate mixture model methods. They used either theversion presented by Roeder (1990) or by Richardson and Green (1997). A numberof textbooks on applied statistics also use the data set to demonstrate differentstatistical methods (see, e.g., Lunn et al., 2012; Hothorn and Everitt, 2014).In the following we will use the Galaxy data set as used by Richardson and Green(1997). A histogram of the data set is given in Figure 3. This version of the data setwas also used by Aitkin (2001) when comparing maximum likelihood and Bayesian4nalysis methods for estimating mixture models, focusing in particular on the ques-tion of the number of data clusters in the data set. Within the maximum likelihoodframework, Aitkin (2001) considered mixtures of univariate Gaussian distributionswith equal as well as unequal variances. The mixture models were fitted using theEM algorithm (Dempster et al., 1977) and for each class of component distributions,the number of components were selected based on the results of a bootstrap likeli-hood ratio test (Aitkin et al., 1981; McLachlan, 1987). This maximum likelihoodanalysis may easily be replicated using the R package mclust (Scrucca et al., 2016).Based on the maximum likelihood results, Aitkin (2001) concludes that “there isconvincing evidence of three equal variance components, or four unequal variancecomponents, but no convincing evidence of more than these numbers, in the velocitydata” (p. 296).In addition, Aitkin (2001) reviews the Bayesian analysis of the Galaxy data setpresented in Escobar and West (1995), Carlin and Chib (1995), Phillips and Smith(1996), Roeder and Wasserman (1997) and Richardson and Green (1997). Table 3 inAitkin (2001), according to its caption, summarizes the posterior distributions of K .However, in fact for the Dirichlet process mixture fitted by Escobar and West (1995),the posterior distribution of K + is given. The Bayesian approaches compared differconsiderably with respect to the prior on K and the prior on the component-specificvariances and lead to rather diverse results. Aitkin (2001) concludes that someanalysis results in overwhelming posterior evidence for three groups, while otherposterior distributions are either relatively diffuse over 4–9 with a mode around 6–7or are concentrated on the range 7–9. Overall the cluster solutions for the Galaxydata set are interpreted as either being sparse, with up to four clusters, or containmany, i.e., more than four, clusters. In this section, we discuss possible specifications and previous suggestions in theliterature for each of the prior distributions and their parameters, taking in par-ticular those into account considered in the Bayesian analysis reviewed in Aitkin(2001). This informs the choice of specific prior settings to be used for a systematicsensitivity analysis of the influence of different prior specifications on the clusteringresults for the Galaxy data set. We also discuss our expectation regarding the effectof these prior specifications on the cluster solutions obtained, focusing in particularon the estimated number of data clusters. K Fr¨uhwirth-Schnatter et al. (2020) provide an overview on previously used priorson K including the uniform distribution (Richardson and Green, 1997), the trun-cated Poisson distribution (Phillips and Smith, 1996; Nobile, 2004) and the shiftedgeometric distribution (Miller and Harrison, 2018). They also propose the shiftedbeta-negative-binomial (BNB) distribution as a suitable alternative which representsa generalization of the Poisson and the geometric distribution.Based on this overview, we consider the following priors on K :5 the uniform distribution U(1 ,
30) with prior mean E [ K ] = 15 . V [ K ] = 74 . • the truncated Poisson distribution trPois(3) with prior mean E [ K ] = 3 . V [ K ] = 2 . • the shifted geometric distribution K − ∼ Geom(0 .
1) with prior mean E [ K ] =10 and prior variance V [ K ] = 90 (Miller and Harrison, 2018), • the shifted BNB distribution K − ∼ BNB(1 , ,
3) with prior mean E [ K ] = 2and prior variance V [ K ] = 4 (Fr¨uhwirth-Schnatter et al., 2020).These priors essentially cover all Bayesian MFM analysis reviewed and comparedby Aitkin (2001). The only exceptions are Carlin and Chib (1995) who performmodel selection to decide between a 3- and a 4-component solution and Roeder andWasserman (1997) who use a uniform distribution with support [1 , K differ considerable in the prior means and variancesinduced. The shifted BNB(1 , ,
3) has the smallest prior mean; the truncated Poissondistribution has the smallest prior variance, with only a slightly higher prior mean.We expect the two prior distributions trPois(3) and the shifted BNB(1 , , K , e.g., the probability of K >
10 is lessthan 0.001. γ K for the component weights All Bayesian MFM analysis considered in Aitkin (2001) are based on a MFM with γ K ≡
1. However, as will be demonstrated in Section 4.3, the Dirichlet parameter γ K crucially affects the prior on K + , since it determines how closely the prior on K + follows the prior on K . A more detailed discussion on the specification of γ K for the MFM is given in Fr¨uhwirth-Schnatter et al. (2020).Fr¨uhwirth-Schnatter et al. (2020) suggest to use an arbitrary sequence for theDirichlet parameter γ K which might depend on the number of components K . Theydistinguish two special cases: the static MFM where γ K ≡ γ and the dynamic MFMwhere γ K = α/K . McCullagh and Yang (2008) already discussed these two specialcases indicating that they are structurally different. While previous applications ofthe MFM focused on the static case, the Dirichlet process mixture model is includedin the dynamic case.In the following we will consider the static as well as the dynamic MFM usingin the static case γ ∈ { . , , } and in the dynamic case α ∈ { . , , } . Thus,additionally to the popular choice γ ≡
1, we consider also a much smaller valueof γ and α as well as a much larger value. The much smaller values are expectedto induce a sparse cluster solution with only very few data clusters and thus alsoachieve a certain independence of the specification of the prior on K . The muchlarger values are expected to induce cluster solutions with rather equally sized dataclusters and also a stronger link between the number of data clusters and the number6f components, which implies a larger influence of the prior on K in this setting.We expect that the dynamic MFM leads to sparser solutions than the static MFM. K + As the posterior of K + , the number of filled components, is the aim of the analysis,it is illuminating to study the prior on K + . The prior on K + is implicitly inducedthrough the specification of the prior on K and the prior parameter γ K . Fr¨uhwirth-Schnatter et al. (2020) and Greve et al. (2020) present formulas to derive this implicitprior in a computational efficient way. We also investigate the prior on K + inducedby the prior specifications on K and γ K considered for the Galaxy data set tofurther gauge our prior expectations of the influence of these prior specifications onthe cluster solutions obtained.Using n = 82 – the sample size of the Galaxy data set – the priors on K (inblue) and on K + (in red) are visualized by bar plots in Figure 1 for the staticMFM and in Figure 2 for the dynamic MFM. The different priors on K are in thecolumns and the values γ ∈ { . , , } and α ∈ { . , , } are in the rows. Thepriors on K are ordered by their prior mean of K , i.e., the squared prior mean of K plus the prior variance. In total there are 12 combinations of specifications on( K, γ K ) inducing different priors on the data clusters K + . Comparing Figure 1 withFigure 2 indicates that in general the dynamic MFM leads to priors on K + inducingstochastically smaller values.Figure 1 clearly indicates that only γ = 0 .
01 leads to a sparse prior on thenumber of data clusters K + and that the impact of the prior on K increases withincreasing γ . For γ = 10, the two priors p ( K ) and p ( K + ) are essentially the same.For the dynamic case shown in Figure 2, the prior on the number of data clusters K + induces a very sparse solution for α = 0 .
01 regardless of the prior on K . For α = 1, the prior on K + is sparser than the prior on K but the induced prior clearlyconsiderably varies depending on the selected prior on K . For α = 10 a close linkbetween the priors on K and K + is discernible if the prior on K puts essentiallyall mass on small values of K , while still considerable differences between these twopriors are visible for the shifted geometric prior and the uniform prior on K whichassign substantial mass to values K > K + should be specified. This can be achieved by specifying a sparse prior on K and/orsmall values for γ/α . In contrast a flat prior on K (e.g., U(1 , γ/α will a-priori support large values of K + (i.e., larger than 4). b and B for the component means Richardson and Green (1997) proposed to use empirical Bayes estimates for b and B which correspond to the midpoint of the observed data range for b and thesquared length of the observed data range R for B . This choice makes the priorinvariant to the scaling of the data, i.e., invariant to the units of the data usedor standardization of the data. Richardson and Green (1997) argue that this is asensible weakly informative prior which does not constrain the component means7 rPois ( ) BNB ( , , ) Geom ( ) U ( , ) g = . g = g = K K + P r obab ili t y Figure 1: The prior probabilities of K (in blue) and K + (in red) for the static MFMfor different priors on K and values for γ with n = 82. trPois ( ) BNB ( , , ) Geom ( ) U ( , ) a = . a = a = K K + P r obab ili t y Figure 2: The prior probabilities of K (in blue) and K + (in red) for the dynamicMFM for different priors on K and values for α with n = 82.8 .000.050.100.150.20 0 10 20 30 40 m k2 den s i t y Figure 3: The prior distributions for the component means µ k ∼ N ( b , B ) with b equal to the data midpoint and B ∈ { . , , , } together with a histogramof the Galaxy data set.and does not encourage mixtures with close component means. They perform asensitivity analysis for this prior by considering values from R / to R for B ,indicating for the Acidity data set that the number of components are inverse U-shaped, by first increasing with increasing values for B and then decreasing again.In the following we also use the midpoint of the data for b . For B we varythe values to assess the impact on the number of data clusters by considering thevalues B ∈ { . , , , } . The extreme values correspond to the limits R / and R considered by Richardson and Green (1997), 20 corresponds to the empiricalvariance of the data and Phillips and Smith (1996) used 100 in their analysis.Figure 3 visualizes the prior distributions for the component means consideredtogether with a histogram of the Galaxy data set. B = R = 630 induces a weaklyinformative prior as suggested by Richardson and Green (1997) with approximatelythe same prior density values assigned to all data values observed. B = R /
100 =6 . b . The smallest value for B seems problematic as hardly any weight is assignedto values above 30, where, however, the histogram would suggest that the center of asmall data cluster is located. We consider this rather extreme range of B values toalso assess for the Galaxy data set if the inverse U-shape for the estimated numberof data clusters is observed. c and C for the component variances Richardson and Green (1997) propose to use σ − k ∼ G ( c , C ) with a hierarchicalprior on C , but also assess differences in results for a fixed and a random C . Aswe are interested in assessing the impact of different prior specifications, we onlyconsider the case of fixed values for C . Following Escobar and West (1995), Phillipsand Smith (1996) and Richardson and Green (1997), we use c = 2. We consider C ∈ { . , , , . } , where C = 0 . C = 1 in9 .00.20.40.6 0 10 20 30 40 s k den s i t y Figure 4: The prior distributions for 4 σ k induced by the prior on the componentprecisions σ − k ∼ G ( c , C ) with c = 2 and C ∈ { . , , , . } together with ahistogram of the Galaxy data set.Escobar and West (1995), and C = 12 . C in Richardson and Green (1997).Figure 4 visualizes the prior distributions for the component variances consideredtogether with a histogram of the Galaxy data set. The priors induced for 4 σ k arevisualized. These values correspond to the length of the 95% prediction interval for asingle component and might be thus seen as representing the volume considered forthe components and hence reflect the prototypical shape imposed for the clusters.Clearly C = 0 . C = 1 induce prior standard deviations which allow to includecomponents able to capture the extreme observations in data clusters of their own,whereas C = 12 . C values induce a more fine-grained density approximation,whereas larger C values lead to a coarser density approximation and hence weexpect the number of estimated data clusters to decrease for increasing C . In order to sample the complete parameter vector, which consists of K and, con-ditional on K , of η = ( η k ) k =1 ,...,K , µ = ( µ k ) k =1 ,...,K , and σ = ( σ k ) k =1 ,...,K , fromthe posterior distribution, a transdimensional sampler is required which is able tosample parameter vectors of varying dimension. We use the telescoping samplerproposed by Fr¨uhwirth-Schnatter et al. (2020). This MCMC sampling scheme in-cludes a sampling step where K is explicitly sampled as an unknown parameter, butotherwise requires only sampling steps used for finite mixtures.The posterior inference uses data augmentation and also samples the componentassignments S = ( S i ) i =1 ,...,n . These component assignments induce partitions, thusthat the sampling scheme directly also allows to obtain the posterior distributionof the partitions C = {C , . . . , C K + } of the data and the induced number of dataclusters K + , with C k being the index set of observations assigned to the k th groupof the partition C . To illustrate the connection between the component assignments10 and the partitions, assume that K = 3 and S = (2 , , , , , , , , ,
1) for n = 10observations. Then K + = 2, since no observations are assigned to the third group,and the induced partition is given by C = {C , C } with C = { , , , , , , } and C = { , , } .Following Fr¨uhwirth-Schnatter et al. (2020), the sampling steps consist of:1. Update the partition C by sampling S from p ( S | η , µ , σ , y ) given by P ( S i = k | η , µ , σ , y i ) ∝ η k f N ( y i | µ k , σ k ).2. Conditional on C , update the parameters of the non-empty components for k = 1 , . . . , K + :(a) Draw the component-specific precisions from the posterior: σ − k | µ k , C , y ∼ G ( c k , C k ) , with c k = c + 12 N k , C k = C + 12 (cid:88) i ∈C k ( y i − µ k ) , where N k are the number of observations assigned to C k , the k th group inthe partition C .(b) Draw the component-specific means from the posterior: µ k | σ − k , C , y ∼ N ( b k , B k ) , with b k = B k ( B − b + σ − k N k ¯ y k ) , B k = ( B − + N k σ − k ) − , where ¯ y k is the sample mean of the observations assigned to C k .3. Conditional on C , draw a new value of K using p ( K |C ) ∝ p ( C| K ) p ( K ).4. Add K − K + empty components with component-specific parameters drawnfrom the priors: µ k ∼ N ( b , B ) , σ − k ∼ G ( c , C ) , for k = K + + 1 , . . . , K .5. Conditional on N = ( N , . . . , N K + , K − K + ), with K − K + being a K − K + vector of zeros, draw a new value of η : η | N ∼ D K ( γ ) , with γ = ( γ k ) k =1 ,...,K and γ k = γ K + N k . c k indicate that 2 c might be interpreted asa prior sample size and C /c corresponds to the variance assumed for this priorobservations. The choice of c = 2 thus corresponds to adding 4 observations a-priori to each component with a variance of C /
2. If C / C k is increased leading to the sampling of inflated σ k values. This in turn induces more overlap across the component densities and thuspotentially leads to a sparser cluster solution with less number of clusters estimated.The updates for b k indicate that b k results as a weighted mean of the prior value b and the mean of the observations currently assigned to the cluster. Accordingto the formula for B k , the influence of B decreases for data clusters containingmany observations, as the second summand increases with N k . It is also clear thatthere is an interaction with the estimate for the component-specific variance, withlarger variances allowing the component-specific means to vary more in the posteriorupdates. For the largest values of B considered, we would expect that the priorinfluence is negligible, and that the posterior updates are only influenced by thedata points currently assigned to this cluster.Step 3 is influenced by the choice of prior on K and γ K . More details on thisstep are given in Fr¨uhwirth-Schnatter et al. (2020). The new K is sampled from adistribution with support K ≥ K + . This distribution is the more spread out themore the prior on K puts mass on larger values of K and the smaller γ K is. Inaddition the distribution depends on K + and the cluster sizes ( N , . . . , N K + ). Thisstep allows for the birth and death of empty components.In Step 4 the parameters of the component-specific distributions of the newempty components are drawn from the priors. “Unattractive” empty componentsresult in particular when B is large and C is small. In this case the sampled µ k can be located far away from the data and the probability that observationsare assigned to this empty component is extremely small in the following Step 1.Thus, the “attractiveness” of the empty components influences whether new emptycomponents are filled and thus, the number of filled components increases or not.Step 5 is influenced by the choice of γ K . In particular for empty components,the value of the Dirichlet parameter only depends on this prior value, influencingthe value η k drawn for these components and hence also the probability of suchan empty component having observations assigned in Step 1. The smaller γ k , thesmaller the sampled η k and thus the smaller the probability that an observation willbe assigned to this component in Step 1. Furthermore, it can be seen that the priorsample size is equal to Kγ K . Thus, in the dynamic MFM the prior sample size isconstant over mixtures with different number of components, whereas for the staticMFM the prior sample size linearly increases with the number of components.12 Assessing the impact of different prior specifi-cations
After discussing in detail how the prior specifications might affect the posterior of thenumber of data clusters, the following analysis investigates whether these theoreticalconsiderations can be empirically verified. The MFM model is fitted to the Galaxydata set with 384 different prior settings, using four different specifications of theprior on K , using either the static or the dynamic MFM, considering three differentvalues for the Dirichlet parameter and four different parameters each for B and C in a full factorial design. For each prior setting, posterior inference is performed based on 200,000 iterationsafter 10,000 burn-in iterations with every fourth draw being recorded (i.e., a thinningof 4). Initially 10 components are filled. The MCMC algorithm is initialized byspecifying values for the component weights and the component-specific parameters.Equal component weights are specified and all component-specific variances σ k , k =1 , . . . ,
10 are set equal to C /
2. The component-specific means µ k are set equal tothe centroids obtained when applying the k -means algorithm with 10 clusters to thedata set. The MCMC iterations start with Step 1 by assigning observations to the10 components according to the a-posteriori probabilities.Partitions are label-invariant. Hence also the number of data clusters or filledcomponents is a label-invariant quantity and it is not necessary to resolve the la-bel switching problem (Redner and Walker, 1984) for the following analysis of theresults. The analysis of the results focuses on the impact of the prior specifications on theposterior p ( K + | y ) of the number of data clusters. The mode of p ( K + | y ) is usedas point estimator. In addition the entropy of the posterior of K + is determinedto indicate how informative this posterior is for a point estimate of K + . The en-tropy of a discrete random variable X with possible outcomes x , . . . , x I is given by − (cid:80) Ii =1 P ( X = x i ) log( P ( X = x i )). Thus, a high entropy value for the posteriorof K + indicates rather equal posterior probabilities for the different values of K + ,while a low entropy value results if the posterior is concentrated on a few values.The marginal impact of each of the prior specifications on the estimated numberof data clusters K + , based on the posterior mode, is assessed by averaging theresults across all other prior settings. Table 1 shows that on average the estimatednumber of data clusters K + (a) is higher for the static than the dynamic MFM,(b) increases for increasing values of the Dirichlet parameter, (c) is lowest for thetruncated Poisson prior followed by the BNB(1, 4, 3) prior and then followed aftera substantial gap by the Geom(0 .
1) and finally the uniform U(1 ,
30) prior. For thepriors on the component-specific parameters, a non-monotonic influence is indicatedfor B . While the average number of estimated data clusters K + is highest for13FM ˆ K + γ / α ˆ K + p ( K ) ˆ K + B ˆ K + C ˆ K + static 5.89 0.01 2.98 trPois(3) 3.99 6.3 5.39 0.5 6.93dynamic 4.70 1 5.56 BNB(1 , ,
3) 4.35 20 6.69 1 6.2110 7.33 Geom(0.1) 6.00 100 5.20 5 4.53U(1, 30) 6.82 630 3.90 12.5 3.50Table 1: Average number of data clusters K + estimated based on the modemarginally for each of the different prior specifications. B = 20, comparable results are obtained for B = 6 . B = 100, while asubstantial lower average number of data clusters K + is estimated for B = 630.The influence of C on the average number of data clusters estimated is monotonicand the number substantially decreases for increasing values of C . The marginaleffects observed in Table 1 are in line with our prior expectations based on theoreticconsiderations and previous results.Figure 5 visualizes the results obtained for the 384 different settings in moredetail. This allows not only to assess marginal effects, but also to gain insights intothe interaction between the specifications on the priors. For each prior setting, thenumber of data clusters K + estimated based on the posterior mode is indicated bya dot with the value being shown on the y -axis. The results are split into six panelswhere the top panels contain the results for the static MFM, while the bottom panelscontain the results for the dynamic MFM. The columns represent the different valuesselected for the Dirichlet parameter, α for the dynamic MFM and γ for the staticMFM, with values 0.01, 1, and 10 (from left to right). Within each of the panels theresults are grouped on the x -axis by the prior p ( K ). The priors p ( K ) are orderedby their prior mean of K . Colors and point characters are used to indicate thedifferent settings used for the component-specific parameters. Small values of B are in red, whereas the large values of B are in blue. The highly saturated colorsindicate the extreme values of B and lighter colors are used for the middle valuesof B . The filled shapes represent the large values of C , whereas the empty shapesare used for the small values of C .Focusing on the dynamic MFM with α = 0 .
01 (in the bottom left panel), one canclearly see that for nearly all settings the number of data clusters K + are estimatedto be equal to 3. Only for some cases, an even smaller number of data clusters K + = 2 is estimated. This only occurs for settings where B is small and C islarge. This suggests that in this panel, where the dynamic MFM with a sparsityinducing parameter α is fitted, a sparse clustering solution is obtained regardless ofprior on K and also quite unaffected by the specification on the component-specificparameters.The results for the static MFM with γ = 0 .
01 are shown above this panel (in thetop left panel). Clearly the sparsity inducing prior used for K + leads to most of theestimated number of data clusters being equal to 3. Only for very few settings, alower or a higher number of data clusters than 3 (i.e., 2, 4, or 5) is estimated. Againa lower number of data clusters is only observed in the case where B is small and C is large. The higher number of data clusters is only observed for small values of C and middle values of B . 14 . static dynamic t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) p ( K ) Mode of K + B . C . . F i g u r e : G a l a xy d a t a s e t . E s t i m a t e dnu m b e r o f d a t a c l u s t e r s K + f o r d i ff e r e n t p r i o r s p ec i fi c a t i o n s . I n t h e r o w s , t h e r e s u l t s f o rt h e s t a t i c a ndd y n a m i c M F M a r e r e p o rt e d ,i n t h ec o l u m n s f o r γ / α ∈ { . , , } , r e s p ec t i v e l y . α = 0 .
01 for the dynamic MFM and γ = 0 .
01 for the staticMFM indicate that the prior on K is not very influential, as regardless of choice ofthe prior on K a sparsity inducing prior for K + is imposed where a rather large gapbetween K and K + a-priori is likely to occur. Also the results are quite insensitiveto the selection of the parameters for the component-specific distributions. Thisimplies that if a sparse clustering solution is desired, one clearly needs to fix theDirichlet parameter to have a small value. The results are rather insensitive tothe specification of the other priors. If the cluster analysis aims at answering thequestion what is the minimum number of data clusters necessary to approximatethe data distribution reasonably well, such a sparsity inducing prior is warranted.In this case the question how many data clusters are in the Galaxy data set wouldalso be rather unambiguously answered by three.Increasing α and γ to 1 indicates that the influence of the other prior specifi-cations on the estimated number of data clusters increases (middle panels). Thedynamic MFM tends to estimate less number of data clusters than the static MFM.The difference to the static MFM becomes more pronounced if the prior on K putsmore mass on the tails. For the dynamic MFM, all estimated number of data clus-ters are at most 7, with higher numbers being more likely for the uniform and thegeometric prior, followed by the BNB prior and the truncated Poisson prior. Underthe static MFM extremely large values are obtained for the uniform and the geo-metric prior, with estimates as large as 20. These large values are obtained if smallvalues are used in the prior specification for B and C .For the dynamic MFM, a higher number of data clusters K + is estimated for α = 10 compared to α = 1, while for the static MFM, rather similar results areobtained for γ = 1 and γ = 10 (panels on the right). For the uniform and geometricprior on K the estimated number of data clusters varies most, regardless of whethera static or dynamic MFM is fitted. The prior on K is not particularly sparsityinducing and thus the prior on the component-specific parameters influences whichapproximation of the data density is selected. Small values for B induce the mostextreme values for the estimated number of data clusters, with large values of C leading to small numbers and small values of C encouraging large numbers of dataclusters.Figure 6 also visualizes the results obtained for the 384 settings in detail. Insteadof the estimated number of data clusters K + , the entropy of the posterior of K + isshown. If the entropy is 0, then all mass is assigned to a single value (which thenalso corresponds to the mode shown in Figure 5). For a fixed support, the uniformdistribution has the maximum entropy. For U(1 , ≈ . α = 0 .
01 with slightly larger values for the static MFM with γ = 0 .
01. For thedynamic MFM, the entropy increases for increasing α . For the static MFM, theentropy values also increase from γ = 0 .
01 to γ = 1, but are rather comparable for γ = 1 and γ = 10.Regarding the prior on K , smaller entropy values are observed for the truncatedPoisson prior compared to the other priors which have rather comparable entropy16 . static dynamic t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) t r P o i s ( ) B N B ( , , ) G eo m ( . ) U ( , ) p ( K ) entropy of p ( K + ) B . C . . F i g u r e : G a l a xy d a t a s e t . E n tr o p y o f t h e p o s t e r i o r o n K + f o r d i ff e r e n t p r i o r s p ec i fi c a t i o n s . I n t h e r o w s , t h e r e s u l t s f o rt h e s t a t i c a ndd y n a m i c M F M a r e r e p o rt e d ,i n t h ec o l u m n s f o r γ / α ∈ { . , , } , r e s p ec t i v e l y . γ K setting. This indicates that the smaller prior variance of theprior on K has a substantial impact on the entropy.Regarding the prior on B , a general pattern of the red points being abovethe blue points is discernible. This implies that the posterior on K + is particularlyspread out for small values of B , i.e., where the component-specific mean values areshrunken towards the midpoint. We conjecture that in this setting posterior massis also assigned to small values of K + as due to the shrinkage there is also posteriorsupport for solutions with few data clusters. In the Galaxy data set, for example,the observations with large values which seem to form a small data cluster of theirown, might be merged with observations from the middle bulk of the observationsdue to shrinkage, inducing a large component-specific variance and thus a coarsedensity approximation.Regarding C , the general pattern is that the filled shapes are below the emptyshapes, indicating that the entropy increases with decreasing values for C . Thisimplies that aiming at a fine-grained approximation, using a rather small volumeas prototypical shape for the clusters, leads to the mass being more spread out. Inparticular, if the aim is semi-parametric density estimation and a small volume is im-posed, no clear estimate for a single K + value is expected, but rather a combinationof mixtures with different values of K + is desired to obtain a good approximation. In this paper, we respond to the call for action made by Aitkin (2001) regardingthe need to provide more insights into the influence of different prior specificationswhen fitting Bayesian mixture models. Based on recent developments in the contextof MFMs, we use the model specification of a MFM, considering the static as wellas the dynamic case. The Galaxy data set is used to illustrate the prior impact onthe estimated number of data clusters K + using the mode as well as on the entropyof the posterior of K + . Results confirm the marginal effects postulated, but alsointeresting interaction effects are discerned.Aiming at a sparse clustering solution using a dynamic MFM with α = 0 .
01 givesstable results regardless of the other prior choices (prior on K and the component-specific parameters). Such an estimate could be interpreted as the “minimum num-ber of data clusters” being present in the data, where the sparsity inducing propertyof the ( K, γ K ) specification leads to insensitivity regarding misspecification of thecomponent-specific parameters.Such a setting would lead to an unambiguous estimate of three data clustersbased on the mode for the Galaxy data set, with also the posterior distributionsbeing rather concentrated on very few values. This is in line with the conclusiondrawn in Aitkin (2001) for the maximum likelihood framework using equal variancecomponents in the mixture model.We suggest to use the dynamic MFM with small α value and reasonablecomponent-specific distributions in a Bayesian model-based clustering applicationwhere a minimum number of data clusters is to be identified. For the component-specific distributions, shrinking the prior mean is not recommended, whereas for thecomponent-specific variances using reasonable values is important to guard against18oo fine-grained or too coarse approximations. In the univariate case the visualiza-tion of the induced volume (see Figure 4) is useful to determine a suitable valuefor C . A generalization of such a visual tool to the multivariate case or othercomponent-specific distributions would be of interest. Further analysis is also re-quired to gain insights of the prior impact on Bayesian cluster analysis results forlarger data sets, containing many observations, but also more variables and withother component-specific distributions. Acknowledgements.
The authors gratefully acknowledge support from the Aus-trian Science Fund (FWF): P28740, and through the WU Projects grant scheme:IA-27001574.
References
Aitkin M (2001) Likelihood and Bayesian analysis of mixtures. Statistical Modelling1(4):287–304, DOI 10.1177/1471082x0100100404Aitkin M, Anderson D, Hinde J (1981) Statistical modelling of data on teachingstyles. Journal of the Royal Statistical Society A 144(4):419–461, DOI 10.2307/2981826Carlin BP, Chib S (1995) Bayesian model choice via Markov chain Monte Carlomethods. Journal of the Royal Statistical Society B 57:473–484, DOI 10.1111/j.2517-6161.1995.tb02042.xDempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society B 39(1):1–38,DOI 10.1111/j.2517-6161.1977.tb01600.xEscobar MD, West M (1995) Bayesian density estimation and inference using mix-tures. Journal of the American Statistical Association 90(430):577–588, DOI10.1080/01621459.1995.10476550Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and den-sity estimation. Journal of the American Statistical Association 97(458):611–631,DOI 10.1198/016214502760047131Fr¨uhwirth-Schnatter S (2006) Finite Mixture and Markov Switching Models.Springer, New YorkFr¨uhwirth-Schnatter S, Malsiner-Walli G, Gr¨un B (2020) Generalized mixtures offinite mixtures and telescoping sampling, URL https://arxiv.org/abs/2005.09918 , arXiv:2005.09918 [stat.ME]Greve J, Gr¨un B, Malsiner-Walli G, Fr¨uhwirth-Schnatter S (2020) Spying on theprior of the number of data clusters and the partition distribution in Bayesiancluster analysis, URL https://arxiv.org/abs/2012.12337https://arxiv.org/abs/2012.12337