[PDF] Mixture modeling on related samples by ψ-stick breaking and kernel perturbation

Abstract

There has been great interest recently in applying nonparametric kernel mixtures in a hierarchical manner to model multiple related data samples jointly. In such settings several data features are commonly present: (i) the related samples often share some, if not all, of the mixture components but with differing weights, (ii) only some, not all, of the mixture components vary across the samples, and (iii) often the shared mixture components across samples are not aligned perfectly in terms of their location and spread, but rather display small misalignments either due to systematic cross-sample difference or more often due to uncontrolled, extraneous causes. Properly incorporating these features in mixture modeling will enhance the efficiency of inference, whereas ignoring them not only reduces efficiency but can jeopardize the validity of the inference due to issues such as confounding. We introduce two techniques for incorporating these features in modeling related data samples using kernel mixtures. The first technique, called \psi-stick breaking, is a joint generative process for the mixing weights through the breaking of both a stick shared by all the samples for the components that do not vary in size across samples and an idiosyncratic stick for each sample for those components that do vary in size. The second technique is to imbue random perturbation into the kernels, thereby accounting for cross-sample misalignment. These techniques can be used either separately or together in both parametric and nonparametric kernel mixtures. We derive efficient Bayesian inference recipes based on MCMC sampling for models featuring these techniques, and illustrate their work through both simulated data and a real flow cytometry data set in prediction/estimation, cross-sample calibration, and testing multi-sample differences.

Full PDF

MMixture modeling on related samples by ψ -stick breaking and kernel perturbation Jacopo Soriano ∗ Google Inc.Li Ma † Duke UniversityOctober 16, 2018

Abstract

There has been great interest recently in applying nonparametric kernel mixturesin a hierarchical manner to model multiple related data samples jointly. In suchsettings several data features are commonly present: (i) the related samples oftenshare some, if not all, of the mixture components but with diﬀering weights, (ii) onlysome, not all, of the mixture components vary across the samples, and (iii) often theshared mixture components across samples are not aligned perfectly in terms of theirlocation and spread, but rather display small misalignments either due to systematiccross-sample diﬀerence or more often due to uncontrolled, extraneous causes. Prop-erly incorporating these features in mixture modeling will enhance the eﬃciency ofinference, whereas ignoring them not only reduces eﬃciency but can jeopardize thevalidity of the inference due to issues such as confounding. We introduce two tech-niques for incorporating these features in modeling related data samples using kernelmixtures. The ﬁrst technique, called ψ -stick breaking, is a joint generative processfor the mixing weights through the breaking of both a stick shared by all the samplesfor the components that do not vary in size across samples and an idiosyncratic stickfor each sample for those components that do vary in size. The second technique isto imbue random perturbation into the kernels, thereby accounting for cross-samplemisalignment. These techniques can be used either separately or together in bothparametric and nonparametric kernel mixtures. We derive eﬃcient Bayesian infer-ence recipes based on MCMC sampling for models featuring these techniques, andillustrate their work through both simulated data and a real ﬂow cytometry data set inprediction/estimation, cross-sample calibration, and testing multi-sample diﬀerences. Keywords:

Bayesian nonparametrics, Dirichlet process mixtures, stick breaking processes,Bayesian hierarchical models, ﬂow cytometry, multi-sample comparison. ∗ Part of the research was completed while JS was a PhD student at Duke University. † LM’s research is partly supported by NSF grant DMS-1612889 and a Google Faculty Research Award. a r X i v : . [ s t a t . M E ] A p r Introduction

Kernel mixtures are a powerful tool for modeling a variety of data sets, especially in thepresence of a natural clustering structure (Escobar and West, 1995; MacEachern and M¨uller,1998). A good portion of the rapidly expanding literature on Bayesian nonparametrics isaimed at building eﬀective mixture models. A recent focus of the literature is on howto jointly model in a hierarchical manner data samples that are similar or otherwise re-lated, the main objective being eﬀective borrowing of strength across samples, therebysubstantially enhancing inference on the underlying data generative mechanisms as well asprediction. This is particularly important for complex data sets, for which each individualsample may only contain very limited information regarding the underlying probabilitydistribution. Among many notable eﬀorts in this direction, Lopes et al. (2003) proposed ahierarchical model for multiple ﬁnite mixtures. M¨uller et al. (2004) proposed a nonpara-metric extension of Lopes et al. (2003)’s model by replacing ﬁnite mixtures with Dirichletprocess (DP) mixtures. In a diﬀerent vein, Cron et al. (2013) proposed to use the hierar-chical DP, or HDP, (Teh et al., 2006) as the mixing distribution to characterize variationacross multiple mixture distributions. Rodr´ıguez et al. (2008) proposed the nested DP(NDP) mixture, which is an inﬁnite mixture of DP mixtures that induces an additionallevel of clustering among multiple mixture distributions themselves (to be distinguishedfrom the clustering within each mixture distribution).While applicable to a variety of mixture modeling contexts, our work is motivatedduring our attempt to apply existing hierarchical mixture models to the analysis of datacollected from ﬂow cytometry experiments. Flow cytometry is a laser-based technologythat measures biomarkers on a large number of cells, so each cell is an observation froma distribution in R p , where p is the number of biomarkers measured. The cell populationtypically comes from a blood sample in immunological studies, and it consists of cells ofvarious subtypes—e.g., T cells, B cells, etc.—with each subtype forming a “cluster” in thesample space. Because each cell subtype has a speciﬁc function in the immune system,inference on the abundance of the various subtypes across blood samples of a patient underdiﬀerent stimulating conditions, for instance, is of interest. Mixture models are natural toolsfor characterizing such data as the data is indeed a mixture of various cell types (Chan et al.,2008), and because a typical ﬂow cytometry study will involve multiple samples collectedunder diﬀerent conditions, the need for jointly modeling to achieve eﬀective borrowing ofstrength also naturally arises (Cron et al., 2013).During the analysis of ﬂow cytometry experiments using mixtures, we encountered anumber of important challenges that we believe are present in numerous (if not most of)other applications involving mixture modeling of related samples (not only with location-scale kernels but beyond). Below we summarize the three main data features/challengesthat motivate the current work:I. Samples often share clusters but with diﬀering weights.

Related samples tend to sharesome (even most) of their clusters, and these common clusters vary across relatedsamples in their weights. In ﬂow cytometry, for instance, data samples often sharea vast majority of the cell subtypes, and the most common type of variation acrosssamples is the diﬀerences in the relative sizes of the subtypes.II.

Only some, not all, clusters vary.

Often, only a fraction, not all, of the clustersvary across samples. In ﬂow cytometry, not all cell subtypes are aﬀected by theexperimental conditions of interest. Very often only one or two cell types are aﬀectedand thus vary across the samples while the rest do not.III.

Misalignment across samples in shared clusters.

Even the same cluster shared amongsamples is often not perfectly aligned across samples, either due to actual systematicdiﬀerence across the samples, or very often due to the presence of extraneous, uncon-trolled additional sources of variation, i.e., some “random” eﬀect. This is easily seenin mixtures of location-scale families, where the location and spread of some sharedclusters diﬀer to various extent across samples. Such misalignment is ubiquitous inﬂow cytometry data, with numerous potential causes. For example even tiny dif-ferences in the chemical concentrations applied in the experimental protocol acrossexperiments can cause noticeable “perturbations” in the cell subtypes.As far as we know, none of the existing hierarchical approaches satisfactorily address allof these issues in a single coherent framework. Table 1 provides a summary of these data3eatures and the extent to which some of the state-of-the-art methods (along with themethod we propose herein) address each of them.

Shared clusters Only a subset Misalignmentwith varying weights of clusters diﬀer in kernelsLopes et al. (2003); M¨uller et al. (2004) Not allowed Allowed Not allowedTeh et al. (2006); Cron et al. (2013) Allowed Not allowed Not allowedRodr´ıguez et al. (2008) Not allowed Not allowed Not allowedThis work Allowed Allowed Allowed

Table 1: Comparison of hierarchical mixture models in terms of how they cope with thethree common data features/challenges in modeling multiple related data samples.Speciﬁcally, the existing approaches exploit some aspects of these features but do notfully take them into account. By introducing a cluster-speciﬁc hierarchical relationshipamong the samples, Lopes et al. (2003) and M¨uller et al. (2004) allow some clusters to beshared among the samples. However, their models require that the kernel parameters andthe mixture weight for each cluster be either both shared across samples or both diﬀerent,without the option to decouple these two diﬀerent types of variations. In particular, noclusters are allowed to have only one type of variation—e.g., mixing weights—under thesemodels. In the context of ﬂow cytometry, for instance, this would mean that cell subtypescannot change just in abundance across the samples but not in their location and spread,clearly an unrealistic assumption. On the other hand, by using the hierarchical DP (Tehet al., 2006) as the mixing distribution, Cron et al. (2013) does allow variations to existin weights alone, but enforces the constraint that all clusters must all vary across samples,excluding the common situation in applications such as ﬂow cytometry that only someclusters (e.g., subtypes) vary while others remain unchanged across conditions. Finally,under the nested DP mixture (Rodr´ıguez et al., 2008), the clusters in each sample musteither be completely identical as those in another sample if they fall into the same modellevel cluster or all be completely diﬀerent, in both weights and kernel parameters, if theybelong to diﬀerent model level clusters.New hierarchical modeling techniques are needed to address these limitations. To meet4his need, we design two new modeling devices that can be embedded into a single hierar-chical mixture modeling framework—the ﬁrst for the mixing weights and the other for thekernel parameters. For the weights, we introduce a new stick breaking process that inducesshared weights on some clusters (those that do not change in abundance) through breakinga “shared” stick across all samples while inducing diﬀerent weights on the other clustersthrough breaking an “idiosyncratic” stick for each sample. This technique will allow usto address challenges I and II. For the mixture kernels, we utilize a hierarchical kernel toinduce local perturbations in the kernel parameters across samples, which mimics the eﬀecton the kernels due to uncontrolled confounding. By decoupling the hierarchical relationshipamong the mixing weights from that among the kernel parameters, our approach oﬀers theneeded additional ﬂexibility and thus achieves substantially higher eﬃciency in modelingrelated mixtures, as will be demonstrated through numerical examples.The rest of the paper is organized as follows. We start in Section 2.1 with a brief reviewof the relevant background regarding nonparametric mixture modeling and stick breaking,and then in Section 2.2 introduce the two techniques in turn. In Section 2.3 we providea recipe for posterior inference based on Markov chain Monte Carlo (MCMC) sampling.In Section 3 we compare our method to current methods through simulation studies thatcover prediction/estimation, cross-sample calibration, and testing multi-sample diﬀerences,and ﬁnally use it to analyze two ﬂow cytometry data sets.

While our techniques can be embedded into mixture models with various weight generatingmechanisms and kernel families, we shall introduce and illustrate them in the context of DPmixtures of Gaussians, which is the most widely adopted nonparametric mixture model.Suppose n observations y = ( y , y , . . . , y n ) are from a mixture model: y i iid ∼ F, i = 1 , . . . , n, and f ( · ) = (cid:88) k ∈K π k g ( ·| λ k )where f denotes the probability density function of F , g ( ·| λ ) is a kernel distribution5arametrized by λ , π k the associated (mixture) weight, and K the countable (possiblyinﬁnite) index set of the mixture components (or clusters). Location-scale families arecommonly adopted as the kernel distribution, in which case λ k speciﬁes the location andspread of the k th cluster. By deﬁnition the weights satisfy π k ≥ (cid:80) k π k = 1. Analternative and computationally attractive formulation utilizes a latent cluster membershiplabel Z i ∈ K for each observation, such that y i | Z i = k ∼ g ( ·| λ k ) and Pr( Z i = k ) = π k for i = 1 , , . . . , N and k ∈ K . Bayesian inference under mixture models can proceed after specifying prior distributionson the weights and the kernel parameters { ( π k , λ k ) : k ∈ K} (Marin et al., 2005). Aﬂexible and convenient choice on the prior for the mixing weights is a generative procedurecalled the stick breaking process (SBP) (Sethuraman, 1994; Ishwaran and James, 2001).The general scheme of SBP starts with the drawing of a sequence of independent randomvariables v , v , . . . supported on (0 , k th cluster is given as π k = v k k − (cid:89) l =1 (1 − v l ) . A popular two-parameter speciﬁcation is the Poisson-Dirichlet process (Kingman, 1975;Pitman and Yor, 1997), corresponding to v i ∼ Beta(1 − γ, α + γ ) for some parameters α and γ . In particular, when γ = 0, this boils down to the weight generative mechanismfrom a Dirichlet process (Ferguson, 1973; Sethuraman, 1994), which we shall refer to as theSBP( α ) process.By adopting the SBP( α ) prior on the weights, along with a prior H on the kernelparameters, we obtain a Dirichlet process mixture (DPM) model: π = ( π k : k ∈ K ) ∼ SBP( α ) and λ k iid ∼ H, k ∈ K . The most commonly adopted kernel distributions are location-scale families such as the(multivariate) Gaussian family, i.e., g ( ·| λ k ) = N ( ·| µ k , Σ k ). In this case, H is often chosento be the corresponding conjugate prior such as a normal-inverse-Wishart (NIW) prior on( µ k , Σ k ). 6 .2 Two techniques for hierarchically modeling related samples Now assume J samples of observations y j = ( y ,j , . . . , y n j ,j ) for j = 1 , . . . , J have beencollected, and the observations in each sample are modeled by a mixture: y i,j ind ∼ F j , i = 1 , . . . n j and j = 1 , . . . , Jf j ( · ) = (cid:88) k ∈K π j,k g ( ·| λ j,k ) , j = 1 , . . . , J, where f j is the probability density function of F j , and λ j,k represent the kernel parameterfor the k th cluster in the j th sample. To characterize potential relationship across thesamples, let us assume that the k th component under each sample represent the samecluster (e.g., cell subtype). Note that this does not exclude the possibility of having novelclusters that appear in only one or some of the samples, in which case the weights π j,k = 0if cluster k is absent in the j th sample. Again we let K be the collection of all clusterindices over all the samples. Let Z i,j be a latent variable indicating that the data point y i,j belongs to the k th cluster with k ∈ K . Then the model can be equivalently written as[ y i,j | Z i,j = k, µ j,k , Σ k ] ind ∼ N ( y i,j | µ j,k , Σ k ) and Pr( Z i,j = k ) = π j,k for k ∈ K .We next introduce techniques for prior choices on the weights and on the kernel param-eters by extending the stick breaking prior and the kernel respectively, which will addressthe three data features and challenges described in the Introduction. ψ -stick breaking for weights We consider a generative stick breaking procedure called“ ψ -stick breaking” (for reasons to be explained below), which breaks J sticks of unitlength—one for each sample—in a dependent manner to generate the mixing weights { π j,k : k = 1 , , . . . } for j = 1 , , . . . , J . We start by observing that each cluster fallsinto one of two categories K and K , that is K = K ∪ K with K ∩ K = ∅ : those in K have weights that do not vary across the J samples (e.g., cell types whose abundanceis constant across experimental conditions), i.e., π j,k = π j (cid:48) ,k for j, j (cid:48) = 1 , . . . , J for k ∈ K ,whereas those in K have varying weights across samples.The generative process proceeds in two steps and is illustrated in Figure 1. In the ﬁrststep, we break the J sticks at exactly the same spot into two pieces of length ρ and 1 − ρ ........ .........0 1 ρ j=1j=2...j=1j=2... Figure 1: Illustration of the ψ -stick breaking procedure with the s -stick (left) and the i -sticks (right).respectively, where ρ ∈ (0 ,

1) is drawn as a Beta random variable. Then in the second step,we use the J pieces of length ρ to generate the weights for the components in K , and the J pieces of length 1 − ρ for the subtypes in K . Hence the parameter ρ is interpreted asthe overall proportion of the clusters with constant weights across samples.Speciﬁcally, one can imagine that we tie the J sticks of length ρ together and breakthem using a single SBP as if they were a single stick—always at the same locations. Forthis reason, we shall refer to the common stick formed by tying the J sticks of length ρ as the “shared” stick, or the s -stick. Let { w ,k : k ∈ K } with (cid:80) k ∈K w ,k = 1 be therandomly generated relative sizes of the components in K in terms of the proportions ofthe s -stick. So the absolute size of each cluster that does not change across samples is givenby π j,k = ρw ,k for all j = 1 , , . . . , J and k ∈ K .On the other hand, we break the J sticks of length 1 − ρ independently using separateindependent SBPs, each generating the weights for one of the J samples, corresponding tothe sizes of clusters that vary across samples. For this reason, we shall refer to the J sticksof length 1 − ρ as the “idiosyncratic” sticks, or the i -sticks. We let { w j,k : k ∈ K } for j = 1 , , . . . , J with (cid:80) k ∈K w j,k = 1 be the randomly generated lengths of the componentsas proportions of the corresponding i -stick. So for the k th cluster, its weight in the j thsample is given by π j,k = (1 − ρ ) w j,k .Using SBP( α ) processes for breaking each of the s - and i -sticks, we arrive at a joint gen-8rative model for the weights in all of the J samples, which we call “shared/idiosyncratic”(si or ψ ) stick breaking. Speciﬁcally, with a Beta prior on the length of the shared stick,we arrive at the following hierarchical model for weights π j,k =  ρw ,k j = 1 , . . . , J and k ∈ K (1 − ρ ) w j,k j = 1 , . . . , J and k ∈ K (1) ρ ∼ Beta( a ρ , b ρ )( w ,k : k ∈ K ) ∼ SBP( α )( w j,k : k ∈ K ) iid ∼ SBP( α ) , j = 1 , . . . , J. See Figure 1 for a visualization of the hierarchical prior on the mixture weights.The hyperparameter α speciﬁes the size of the clusters as well as the number of clusters(in K and K respectively), with a smaller α corresponding to a small number of largeclusters and a larger α corresponding to a large number of small clusters. We infer on α in ahierarchical Bayesian paradigm by placing Gamma hyperprior on it: α ∼ Gamma( τ α, , τ α, ). Local kernel perturbation

We utilize a hierarchical setup to incorporate local per-turbation in the kernel parameters, thereby adjusting for the misalignment and allowingmore eﬀective borrowing of information across the samples on each cluster. Speciﬁcally, wemodel the kernel parameters { λ j,k } as follows λ ,k iid ∼ H ( · | φ ) for k ∈ K λ j,k iid ∼ H ( · | λ ,k , (cid:15) ) for j = 1 , , . . . , J where λ ,k represent the cross-sample “centroid” kernel parameters for the k th cluster,with a hyperprior H speciﬁed by hyperparameter φ . Given λ ,k , the sample-speciﬁckernel parameters for the k th cluster λ j,k is drawn from H with additional hyperparameter (cid:15) , which speciﬁes the dispersion of cluster k among the samples around the “centroid”.The above speciﬁcation enforces that each cluster k will have misalignment. Moregenerally, in some problems misalignment may exist in only a subset of the clusters. Toallow for such cases, again appeal to a “spike-and-slab” setup by introducing an additionalBernoulli latent indicator S k for each cluster, such that S k = 1 if there is misalignment in9luster k whereas S k = 0 if otherwise. That is, λ j,k ind ∼  δ λ ,k if S k = 0 H ( ·| λ ,k , (cid:15) ) if S k = 1 and S k iid ∼ Bernoulli( ϕ )where δ · represents a point mass.Putting the pieces together in the context of Gaussian kernels, we arrive at the followingspike-and-slab version of the locally perturbed kernel model:Σ − k iid ∼ Wishart(Ψ , ν )[ µ j,k | µ ,k , Σ k , S k ] ind ∼ δ µ ,k { S k =0 } + Normal( µ ,k , (cid:15) Σ k )1 { S k =1 } [ µ ,k | Σ k ] ind ∼ Normal( m , Σ k /k ) S k iid ∼ Bernoulli( ϕ ) . This model is illustrated in Figure 2. The hyperparameter (cid:15) speciﬁes the total amountof local variation between the means of each group µ j,k and the grand mean µ ,k , and ϕ speciﬁes the proportion of clusters that have misalignment. The hyperparameters m ,Ψ , k , (cid:15) , and ϕ are all characterizing “global” features of the data that pertain to allof the clusters and samples. We can reliably infer them by pooling information throughhierarchical Bayes. In particular, in our numerical examples we adopt the following hy-perpriors: (cid:15) ∼ Uniform( a (cid:15) , b (cid:15) ), m ∼ Normal( m , S ), Ψ ∼ Inverse-Wishart(Ψ , ν ), k ∼ Gamma( τ / , τ / ϕ ∼ Beta( a ϕ , b ϕ ). Posterior inference can be carried out through Markov Chain Monte Carlo (MCMC). Oneoption is to use M¨uller et al. (2004)’s standard P´olya urn scheme. A beneﬁt of this samplingscheme is that all the random weights are integrated out. However it can be computationallyineﬃcient for large datasets such as in ﬂow cytometry experiments. Alternatively, one canapproximate the nonparametric model with a ﬁnite model and use a blocked Gibbs sampler(Ishwaran and James, 2001), which is more eﬃcient in terms of mixing and computationalspeed, and hence is what we recommend. 10 k = S k = Figure 2: A locally perturbed Gaussian kernel with a spike-and-slab setup. When S k = 0,all kernels for the k th cluster are identical across samples. When S k = 1, the kernel iscentered around a common mean but are not identical.To this end, two diﬀerent ﬁnite approximation strategies are commonly adopted forDPMs and other stick breaking mixtures: (i) truncating the stick breaking at some maxi-mum number of components and (ii) using ﬁnite-dimensional symmetric Dirichlet distribu-tion. These two approximations might look very diﬀerent at ﬁrst, but the main diﬀerencebetween the two is in the induced stochastic ordering of the weights, which is irrelevant inmixture models. In fact, as Kurihara et al. (2007) points out, one can apply a size-biasedpermutation to the order of the weights of a ﬁnite symmetric Dirichlet distribution andobtain a distribution which is practically identical to the truncated SBP. However, the twostrategies are not computationally equivalent for mixture models. The weights under thesymmetric ﬁnite-Dirichlet approximation are exchangeable, which results in substantiallyimproved mixing over truncating the SBP. Therefore we opt for the symmetric ﬁnite Dirich-let approximation in our implementation. This approximation has been studied and usedby many authors in a variety of contexts. See Neal (2000), Green and Richardson (2001)and Ishwaran and Zarepour (2002), among others. Speciﬁcally, under this approximation,the inﬁnite sequences of mixture weights in Eq. (1) are replaced by:( w ,k : k ∈ K ) ∼ Dirichlet( α/K , α/K , . . . , α/K )( w j,k : k ∈ K ) iid ∼ Dirichlet( α/K , α/K , . . . , α/K ) , for j = 1 , . . . , J, K and K represent the numbers of mixture components that are shared and dif-ferential across the groups, respectively. In the nonparametric case, both K and K areinﬁnite, while in the ﬁnite approximation we need to choose K and K . A simple choiceis to set K = K = K for some large K which represents an upperbound to the a prioriexpected number of mixture components.With this speciﬁcation, next we give the details on the MCMC sampler for the jointposterior in terms of the full conditionals:1. Latent assignments for i = 1 , . . . , n j and j = 1 , . . . , J :Pr( Z i,j = k | . . . ) ∝ π j,k Normal( y i,j | µ j,k , Σ k ) , k ∈ K .

2. Mixture weights:[ w , , . . . , w ,K | . . . ] ∼ Dirichlet( n , + α/K , . . . , n ,K + α/K )[ w j, , . . . , w j,K | . . . ] ind ∼ Dirichlet( n j, + α/K , . . . , n j,K + α/K ) , where n ,k = | Z i,j = k : i = 1 , . . . , n j and j = 1 , . . . , J | for k ∈ K , and n j,k = | Z i,j = k : i = 1 , . . . , n j | for j = 1 , . . . , J and k ∈ K .3. Latent perturbation state variables for k ∈ K :Pr( S k = 1 | . . . ) = (cid:18) − ϕϕ · BF k (cid:19) − , where BF k = (cid:18) | Ψ (0)1 ,k || Ψ (1)1 ,k | (cid:19) ( ν + (cid:80) j n j,k ) / (cid:89) j ( (cid:15)n j,k + 1) p/ Ψ (1)1 ,k = (cid:26) Ψ − + (cid:88) j (cid:20) SS j,k + (cid:0) (cid:15) + 1 n j,k (cid:1) − ( ¯ Y j,k − µ k )( ¯ Y j,k − µ k ) (cid:48) (cid:21)(cid:27) − Ψ (0)1 ,k = [Ψ − + SS k + (cid:88) j n j,k ( ¯ Y k − µ k )( ¯ Y k − µ k ) (cid:48) ] − , for ¯ Y j,k = (cid:80) i : Z i,j = k Y i,j /n j,k , ¯ Y k = ( (cid:80) i,j : Z i,j = k Y i,j ) / ( (cid:80) j n j,k ), SS j,k = (cid:80) { i : Z i,j = k } ( Y i,j − ¯ Y j,k )( Y i,j − ¯ Y j,k ) (cid:48) and SS k = (cid:80) { i,j : Z i,j = k } ( Y i,j − ¯ Y k )( Y i,j − ¯ Y k ) (cid:48) .12. Precision matrices for k ∈ K :[Σ − k | . . . ] ∼ Wishart (cid:0) Ψ ( S k )1 ,k , ν + (cid:88) j n j,k (cid:1)

5. Grand means for k ∈ K :[ µ k | . . . ] ∼ Normal (cid:18) m ( S k )1 ,k , Σ k / ( (cid:88) j ( (cid:15)S k + 1 /n j,k ) − + k ) (cid:19) ,

6. Group means for j = 1 , . . . , J and k ∈ K :[ µ j,k | S k = 0 , . . . ] ∼ δ µ k [ µ j,k | S k = 1 , . . . ] ∼ Normal (cid:18) n j,k ¯ Y j,k + µ k /(cid:15)n j,k + 1 /(cid:15) , Σ k / ( n j,k + 1 /(cid:15) ) (cid:19) .

7. A Metropolis step to explore diﬀerent modes of the posterior distribution by swappingan index from K with an index from K . The proposal distribution is deﬁned asfollows. An initial index k (cid:48) is drawn proportionally to √ n j,k for k ∈ K , where n j,k = | ( i, j ) : Z i,j = k | , and a second index k (cid:48)(cid:48) is drawn uniformly from K if k (cid:48) ∈ K anduniformly from K if k (cid:48) ∈ K . Since the proposal is symmetric, the swap is acceptedwith probability: min (cid:18) E w,ρ ( (cid:81) j,k π n j,k j,k | Z new )E w,ρ ( (cid:81) j,k π n j,k j,k | Z ) , (cid:19) , where Z and Z new represent the vectors of the latent assignments before and afterthe swap. Since the mixture components are exchangeable within K and K , theacceptance probability depends only on the swapped indices. Similar strategies toimprove the exploration of the sample space have been proposed by Porteous et al.(2012) and Papaspiliopoulos and Roberts (2008).8. The Dirichlet pseudo-count parameter α is updated using a Metropolis-Hastings stepwith the following proposal: α ∗ | α ∼ Gamma( α · a, α · a ) , where is a is a tuning parameter calibrated in the burn-in.13. Mean shrinkage parameter[ k | . . . ] ∼ Gamma(( τ + p · K ) / , ( τ + (cid:88) k ( µ ,k − m ) (cid:48) Σ − k ( µ ,k − m )) / − | . . . ] ∼ Wishart((Ψ + (cid:80) k Σ − k ) − , K · ν + ν ).11. Centroid mean parameter [ m | . . . ] ∼ Normal(

V m, V ), where m = S − m + k (cid:88) k Σ − k µ ,k and V = ( S − + k (cid:88) k Σ − k ) − .

12. The perturbation parameter (cid:15) is updated using a Metropolis step with the followingproposal: Uniform( a (cid:15) , b (cid:15) )13. The proportion of clusters with kernel misalignment [ ϕ | . . . ] ∼ Beta( a ϕ + s , b ϕ + s ),where s i = | S k = i : k = 1 , . . . , K | .

14. The “length” of the shared stick [ ρ | . . . ] ∼ Beta( a ρ + n , b ρ + (cid:80) j n j ), where n j = (cid:80) k n j,k . In this section we provide three numerical examples. In the ﬁrst example data are simulatedunder diﬀerent mixture distributions, and we compare the goodness-of-ﬁt of our methodwith respect to competing approaches. In the second example we illustrate through a simu-lated dataset how our model can be used to remove small distributional shifts across relatedmixture distributions. In the third example we compare the performance of our model toother competing methods in testing and identifying diﬀerences across distributions. In thefourth example we analyze two real ﬂow cytometry datasets. In all of the examples, weshall refer to our Dirichlet process mixtures of Gaussians with ψ -stick breaking and kernelperturbation as CREMID, as it models Closely RElated MIxture Distributions.14 .1 Example 1: Estimation and predictive performance In this ﬁrst example, we investigate how CREMID helps achieve more eﬀective borrowingof information across samples thereby enhancing predictive performance. To this end, weconsider four simulation scenarios, representative of a vast variety of real applications. Weuse the sum of L distances of the estimated univariate predictive densities from the truedensities as measure of goodness of ﬁt. (Note that we used this metric instead of themore natural log predictive score or the L distance between the multivariate predictivedensity from the true density, because at the time of writing, the available software forthe competitor HDPM provides the marginal predictive densities but not the other twometrics.)We consider the following multi-sample scenarios in R . In each scenario, there arethree data samples ( j = 1 , ,

3) and the sample size for each is 100. Below we outline thefour diﬀerent scenarios. Some of the parameters are omitted here, but provided in theAppendix.1. Local shift: y i,j | µ , Σ , π ∼ π N ( y i,j | µ + δ j , Σ ) + (cid:88) k =2 π k N ( y i,j | µ k , Σ k ) , where δ j = ( j/ , , ,

0) and µ k ∼ U (0 ,

10) for k = 1 , . . . , y i,j | µ , Σ , π ∼ (cid:88) k =1 π k N ( y i,j | µ k + j , Σ k ) , where µ k ∼ U (0 ,

10) for k = 1 , . . . , y i,j | µ , Σ , π ∼ ( π − . j − N ( y i,j | µ , Σ )+ ( π + 0 . j − N ( y i,j | µ , Σ ) + (cid:88) k =3 π k N ( y i,j | µ k , Σ k ) , (2)where π = (0 . , . , . , .

1) and µ k ∼ U (0 ,

10) for k = 1 , . . . , y i,j | µ , Σ , π ∼ (cid:88) k =1 π j,k N ( y i,j | µ k , Σ k ) π j ∝ exp( m j ) m j ∼ N (0 , S ) , where µ k ∼ U (0 ,

10) for k = 1 , . . . , DPpackage (Jara et al., 2011) for ﬁtting HDPM.In addition, we also compare these to methods to independent ﬁnite mixture of Gaussiansfor each of the three samples, using Mclust (Fraley and Raftery, 2002), available in the Rpackage mclust . llll ll lll lll ll lllllll ll ll ll Global Shift Global Weight Change Local Shift Local Weight Change

C R E M I D H D P M M c l u s t C R E M I D H D P M M c l u s t C R E M I D H D P M M c l u s t C R E M I D H D P M M c l u s t method d i s t an c e method CREMIDHDPMMclust

Figure 3: Box-plots of the sum of L distances of the estimated univariate predictivedensities from the true densities for three methods.In Figure 3 we show the sum of L distances of the estimated univariate predictivedensities from the true densities for the three methods. Our approaches outperform HDPMand mclust in the two shift scenarios. CREMID is the most accurate method in the twolocation shift scenarios as well as in the local weight change scenario. In the global weightchange scenario, both our method and HDPM underperforms Mclust. Because the samplesare diﬀerent in all cluster weights, we pay a price for assuming that some cluster weightsare shared. 16 .2 Example 2: Correcting for cross-sample misalignment A common problem in studies involving data collected from multiple labs or centers is themisalignment of the same clusters across samples due to external confounders, which iswhat motivated our hierarchical locally perturbed kernel construction. In ﬂow cytometry,for example, misalignment across cell subpopulations can be substantial. An importantpreprocessing step is cross-sample calibration—that is, to estimate and correct for themisalignment across samples and thereby produce “standardized” data sets for follow upstudies. (This shares the registration problem in functional data analysis.) To this end, wenote that for each observation y i,j , if Z i,j = k , that is, the observation belongs to cluster k , then we can compute a corrected value by adjusting for the shift in the cluster centeracross the samples: ˜ y i,j = µ ,k + ( y i,j − µ j,k ) = y i,j − ∆ j,k where ∆ j,k = µ j,k − µ ,k is the displacement of cluster k in sample j relative to the cen-troid. Because Z i,j is unobserved, we can appeal to Bayesian model averaging (BMA) bycomputing the posterior mean of ˜ y i,j E(˜ y i,j | y ) = y i,j − E(∆ j,Z i,j | y ) ≈ y i,j − B B (cid:88) b =1 ∆ ( b ) j,Z ( b ) i,j , where ∆ ( b ) j,Z ( b ) i,j is the b th posterior draw on the displacement ∆ ( b ) j,Z ( b ) i,j = µ ( b ) j,Z ( b ) i,j − µ ( b )0 ,Z ( b ) i,j .Let us consider a numerical example based on mixture of normals in R to illustratehow one can remove cross-sample misalignment. The data are generated as follows: y i, ∼ . N ( µ , , I ) + 0 . N ( µ , I ) + 0 . N ( µ , . I ) + 0 . N ( µ , , . I ) y i, ∼ . N ( µ , , I ) + 0 . N ( µ , I ) + 0 . N ( µ , . I ) + 0 . N ( µ , , . I ) y i, ∼ . N ( µ , , I ) + 0 . N ( µ , I ) + 0 . N ( µ , . I ) + 0 . N ( µ , , . I ) , where i = 1 , . . . , µ j, = (1 , − j, , µ = (8 , , , µ = (1 , , ,

1) and µ j, = (6 + j, j, , ll llllllll llll lll lllll llll lll ll l l lll lll lll llllll lllll ll l ll ll lll ll lll l ll lll llll ll ll l ll ll ll ll ll l llll ll lll llll ll lll lll l ll ll lllll ll ll l lll ll l ll lll ll lll lll ll ll ll ll lllll llll ll llll lll ll ll lll llll ll l l llll ll lll lll l lllllll lll l ll l lllllll ll lll ll ll l llll lllll ll ll llllllll llll lll ll llll ll ll ll l lllllll ll lllll lll ll llll ll lll lll llll l ll llll ll llll ll llll l lll lll lll llll ll lll lll llll l l ll ll llll l ll lll l lll l l lll ll lll ll ll lll ll llll llll l l ll lll l l ll ll ll ll ll ll llll lll llll ll llll l lll l l llll l ll lll l ll lll l ll l lllll ll llll l lll ll l ll lll l lllll l l lll l lllll lll ll l ll lllll l llll lll lll lll ll llll ll lll llll ll ll l ll llllll ll ll ll l llll l llllll ll llll ll ll ll l llll l lll llll l l ll lllllll l ll ll l ll ll l lll ll l ll ll llll ll ll ll ll l ll ll lll ll lll lll lll llllll l l ll l ll lll lll lllll l lll ll lll ll llll l ll lllll ll llll lll ll ll ll lllll l lll lllll lll lll llllll l ll l lllll l ll l ll lll ll l ll l lll l ll lll l ll llll l lllllll ll ll lll llll l l ll lll lll ll lllllllllllll l ll lllll ll lllll ll ll ll l l ll ll lll l ll l ll l l llll l lll l lll ll l ll lll lllllll lll ll llll l ll ll lll llllll lll ll ll lll l ll ll lll ll l ll ll l ll ll ll l lllll lllll ll ll l llll lll l l lll llll lll llll ll sample 1 ll l l lll l ll ll ll llll lllll ll l lll l l ll l ll ll l lll ll ll llll ll ll l lll l lll ll ll lll ll l lll l lll l llll ll llll lll l l ll ll ll ll lll llll l llll ll lll llll llll ll l ll ll lllll ll l ll ll ll ll ll ll l llllll l lll ll ll lllll l lll llllll llll lll llll ll ll ll lll lll lll ll llll ll llll l llll llllll lll l ll ll lllllll l lll ll lllll l lll llllllllll llll lll l ll l llll ll lll ll lll ll l ll l l lll l lllll ll lllllll llll l lll lll l lll lll llll lll lllll ll lll ll ll lll lll lll l ll lll l l lll l lll l ll l lll lll ll ll ll ll l llll ll llllll lllll lll l llllll lllll ll lll ll lll ll ll l lll llllll ll ll ll lllll ll l lll ll l ll lll l l ll lllll ll ll lll lll lll lllll lll ll lll lllll l lll l ll llllll l ll lll llll lll ll lll lll ll ll llllll ll l ll ll lll ll lll llllllllll ll lll l l lllll lll ll ll ll ll lll ll l llllll l ll lll lll l llll l l lll l ll l lll ll ll ll llll llllll lll l ll ll l ll l lll l lllllll lllll l lll ll l ll llll lll ll l ll llll l lll ll lll ll llll l llll lll llll l lll l lll lll lll l ll ll ll lllll llll l l ll lll ll ll l lll l ll ll llll lll l lll ll lll l lll llll ll l ll lll ll ll ll l ll llll lll llll lll l ll lll lll ll ll ll ll llllll lll lll l ll llll ll l l llll l llll lll lll ll l ll ll ll ll ll lll l lll l lll ll ll ll ll lll l lll llll ll lll ll llll l l lll ll llllll lll lll llll ll ll llllll sample 2 ll ll ll l lll ll lll ll l lll llll llll ll l lll l lll llllll l ll l lll lll llll ll l ll llllll lllll l l ll l ll lll ll l ll lll l ll l llll llll ll llll ll lll l ll ll ll lll l ll l llll l lll lll ll llllll l lll ll lll lll llll lll lllllll l lll llll lll lll l lll llll l ll l llll l lll l lll l l lllll lll lll ll lll lll l lll llll ll lll l ll ll ll l lll ll ll ll l ll llll ll lll l lllllll llllll lll l ll lll lll llllll llll lll llllll l lll ll ll ll lll lll l ll l lllll ll ll llll l l lll l lllll lll l ll lllll l llllll lll l lll llll lllll l lllllll ll ll lll l ll lllll lll ll ll l ll lllll llll ll llll lll ll llll lll lllll llllll ll ll l lllllll llll llll ll lllll llllllll l l llll ll lll ll l lll l ll l lll lll llll l lll llll lll ll ll ll ll lll llll lll ll ll l ll ll ll lll llll lllll llll ll lll l ll ll ll lll lll lll ll lll llll lllll lll lll lllll l ll l ll ll l lll l llll lllll lll ll l lll lll llll llll l llll l lll l lll llll ll ll ll lllll lllll lll llllllll lllll ll lllll ll ll ll llll lll l lllll lllll ll l lll lll llllllll lllllll l ll ll ll l lll l llll lllll ll llll ll ll l l lll ll l ll lll ll llll lll l llll l l lll ll ll ll ll ll l ll llll lll lll llll lllll lll lll lll ll l l lll lll llll ll lllll lll lllllll l lll lll l llll ll ll ll ll ll llll l ll l ll lll ll ll ll lll l lll lll ll lll l ll lll ll l l lll llll lll llll l sample 3 ll llllllll llll lll lllll llll lll ll l l lll lll lll llllll lllll ll l ll ll lll ll lll l ll lll llll ll ll l ll ll ll ll ll l llll ll lll llll ll lll lll l ll ll lllll ll ll l lll ll l ll lll ll lll lll ll ll ll ll lllll llll ll llll lll ll ll lll llll ll l l llll ll lll lll l lllllll lll l ll l lllllll ll lll ll ll l llll lllll ll ll llllllll llll lll ll llll ll ll ll l lllllll ll lllll lll ll llll ll lll lll llll l ll llll ll llll ll llll l lll lll lll llll ll lll lll llll l l ll ll llll l ll lll l lll l l lll ll lll ll ll lll ll llll llll l l ll lll l l ll ll ll ll ll ll llll lll llll ll llll l lll l l llll l ll lll l ll lll l ll l lllll ll llll l lll ll l ll lll l lllll l l lll l lllll lll ll l ll lllll l llll lll lll lll ll llll ll lll llll ll ll l ll llllll ll llll l llll l llllll ll llll ll ll ll l llll l lll llll l l ll lllllll l ll ll l ll ll l lll ll l ll ll llll ll ll ll ll l ll ll lll ll lll lll lll llllll l l ll l ll lll lll lllll l lll ll lll ll llll l ll lllll ll llll lll ll ll ll lllll l lll lllll lll lll llllll l lll lllll l ll l ll lll ll l ll l lll l ll lll l ll llll l lllllll ll lllll llll l l ll lll lll ll lllllllllllll l ll lllll ll lllll ll ll ll l l ll ll lll l ll l ll l lllll l lll l lll ll lll lll lllllll lll ll llll l ll ll lll llllll lll ll ll lll l ll ll lll ll l ll ll l ll ll ll l lllll lllll ll ll l llll lll l l lll llll lll llll ll calibrated sample 1 ll l l lll l ll ll ll llll lllll ll l lll l l ll l ll ll l lll ll ll llll ll ll l lll l lll ll ll lll ll l lll l lll l llll ll llll lll l l ll ll ll ll lll llll l llll ll lll llll llll ll l ll ll lllll ll l ll ll ll ll ll ll l llllll l lll ll ll lllll l lll llllll llll lll llll ll ll ll lll lll lll ll llll ll lllll llll llllll lll l ll ll lllllll l lll ll lllll l lll llllllllll llll lll l ll l llll ll lll ll lll ll l ll l l lll l lllll ll lllllll llll l lll lll l lll lll llll lll lllll ll lll ll ll lll lll lll l ll lll l l lll l lll l ll l lll lll ll ll ll ll l llll ll llllll lllll lll l llllll lllll ll lll ll lll ll ll l lll llllll ll ll ll lllll ll l lll ll l ll lll l l ll lllll ll ll lll lll lll lllll lll ll lll lllll l lll l ll llllll l ll lll llll lll ll lll lll ll ll llllll ll l ll ll lll ll lll llllllllll ll lll l l lllll lll ll ll ll ll lll ll l llllll l ll lll lll l llll l l lll l ll l lll ll ll ll llll llllll lll l ll ll l ll l lll l lllllll lllll l lll ll l ll llll lll ll l ll llll l lll ll lll ll llll l llll lll llll l lll l lll lll lll l ll ll ll lllll llll l l ll lll ll ll l lll l ll ll llll lll l lll ll lll l lll llll ll l ll lll ll ll ll l ll llll lll llll lll l ll lll lll ll ll ll ll llllll lll lll l ll llll ll l l llll l llll lll lll ll l ll ll ll ll ll lll l lll l lll ll ll ll ll lll l lll llll ll lll ll llll l l lll ll llllll lll lll llll ll ll llllll calibrated sample 2 ll ll ll l lll ll lll ll l lll llll llll ll l lll l lll llllll l ll l lll lll llll ll l ll llllll lllll ll ll l ll lll ll l ll lll l ll l llll llll ll llll ll lll l ll ll ll lll l ll l llll l lll lll ll llllll l lll ll lll lll llll lll lllllll l lll llll lll lll l lll llll l ll l llll l lll l lll l l lllll lll lll ll lll lll l lll llll ll lll l ll ll ll l lll ll ll ll l ll llll ll lll l lllllll llllll lll l ll lll lll llllll llll lll llllll l lll ll ll ll lll lll l ll l lllll ll ll llll l l lll l lllll lll l ll lllll l llllll lll l lll llll lllll l lllllll ll ll lll l ll lllll lll ll ll l ll lllll llll ll llll lll ll llll lll lllll llllll ll ll l lll llll llll llll ll lllll llllllll l l llll ll lll ll l lll l ll l lll lll llll l lll llll lll ll ll ll ll lll llll lll ll ll l ll ll ll lll llll lllll llll ll lll l ll ll ll lll lll lll ll lll llll lllll lll lll lllll l ll l ll ll l lll l llll lllll lll ll l lll lll llll llll l llll l lll l lll llll ll ll ll lllll lllll lll llllllll lllll ll lllll ll ll ll llll lll l lllll lllll ll l lll lll ll llllll lllllll l ll ll ll l lll l llll lllll ll llll ll ll l l lll ll l ll lll ll llll lll l llll l l lll ll ll ll ll ll l llllll lll lll llll lllll lll lll lll ll l l lll lll llll ll lllll lll lllllll l lll lll l llll ll ll ll ll ll llll l ll l ll lll ll ll ll lll l lll lll lll ll l ll lll ll l l lll llll lll llll l calibrated sample 3 Figure 4: The three plots in the ﬁrst row show the data from Example 2 projected alongthe ﬁrst two dimensions for each of the three samples. In the second row the three plotsshow the calibrated data, i.e., after removing the estimated kernel perturbations.

We consider the same multi-sample scenarios in R used in Example 1. For each datasetwe deﬁne a corresponding null data set by permuting the labels of the three samples. InFigure 5 we compare the ROC curves of our method and HDPM for testing the hypothesis18hat the three distributions are identical. Our method is substantially more powerful thanHDPM in all four scenarios.In these simulations, for our method we use E( ρϕ | y ) as the test statistic. This quantitygoes to zero when there are diﬀerences in the mixture weights or in the mixture kernelsacross samples, and it goes to one when the distributions are identical across samples. Onecan adopt diﬀerent test statistics under our method depending on the inference objective.For instance, if one is interested in testing just the presence of diﬀerences in weights thena suitable test statistic is E( ρ | y ).We compare our method only to HDPM since Mclust does not provide a way to test fordiﬀerences across samples. In HDPM each F j is deﬁned as a mixture of two components: F j = (cid:15)H + (1 − (cid:15) ) H j for j = 1 , . . . , J . The distribution H represents the common part,and H j represents the idiosyncratic part. The hyperparameter (cid:15) controlling the “degree ofsimilarity” across the F j ’s has a beta hyperprior. We use E( (cid:15) | y ) as the test statistic. Local Shift

FPR T P R . . . . . . HDPMCREMID

Global Shift

FPR T P R . . . . . . Local Weight Change

FPR T P R . . . . . . Global Weight Change

FPR T P R . . . . . . Figure 5: ROC curves for two methods in Example 3.3: HDPM (M¨uller et al., 2004) inblack solid, our method in red dashed. 19 .4 Application: ﬂow cytometry

In ﬂow cytometry experiments, biomarkers are measured on a large number of blood cells.Diﬀerent cell subtypes, i.e., groups of cells sharing similar biomarker’s levels, have distinctfunctions in human immune system. Identifying variations in the abundance of subtypesacross multiple samples is an important immunological question. Additionally, the locationof a given subtype across samples can slightly change due to both experimental variabilityand other uncontrolled “random eﬀects”.We analyze two datasets where each one contains three samples of 5,000 blood cells,and for each cell six biomarkers have been measured.

The blood from a given patient was split in three samples, and each sample went througha separate experimental procedure to generate the data. Since the three samples are es-sentially biologically identical, one expects no variations in the abundance of the diﬀerentsubtypes or large location shifts of the cell types. Small perturbations of the cell types arelikely due to additional variations in the experimental procedures.In Figure 6 we plot the posterior distributions of ρ and (cid:15) for this data set under ourproposed model. The parameter ρ reﬂects the total mass assigned to mixture componentswhere the mixture weights are identical across groups. In this dataset a posteriori thisparameter concentrates around one, indicating that there is no evidence of a diﬀerence in themixture weights across the three replicates. The parameter (cid:15) controls the expected amountof shift in the location of each kernel across samples. Its posterior does not concentratearound zero, indicating the presence of small misalignment among the replicate samples dueto uncontrolled sources of variation. It is the decoupling of these two sources of variationsthat allows us to correctly infer the absence of variations in the mixture weights across thedistributions of the three samples. In another data set, three blood samples from an individual underwent diﬀerent stimu-lation treatments. One sample was left unstimulated, while the two remaining samples20 D en s i t y j D en s i t y e D en s i t y Figure 6: Histograms of the posterior of ρ and (cid:15) for the ﬂow cytometry control study. r D en s i t y j D en s i t y e D en s i t y Figure 7: Histograms of the posterior distributions of ρ and (cid:15) .were stimulated with CEF and CMV pp65, respectively. The samples underwent separateexperimental procedures in data generation. In Figure 7 we plot the posterior distributionsof ρ and (cid:15) . The parameter ρ concentrates around 0.6, indicating that there are diﬀerencesin some of the mixture weights across the three samples. The parameter (cid:15) concentratesaround 0 .

2, either due to eﬀects of the experiment conditions on the locations of the kernels,which is also a systematic cross-sample diﬀerence, or substantial additional variations inthe experimental procedures in comparison to the control study.To judge the goodness-of-ﬁt, we also compare the predictive performance of our modelwith Mclust, evaluated by the log predictive likelihood of the a “test” sample. We randomlyselect 1,000 data points from the whole data set as a “test” sample, while using 5,000observations as the “training sample”. We had hoped to compare our method to other21ethods such as M¨uller et al. (2004) but at the time of writing, the existing software in R (the HDPMdensity function in

DPpackage ) crashes for the data sets, most probably due tothe large sample sizes, and it does not output predictive scores.MethodData set CREMID MClustControl study -15456.34 -16310.93Diﬀerent stimulation conditions -14649.47 -15408.23Table 2: Log- p predictive score comparison for CREMID versus MClust. Larger values (orsmaller absolute values for negative scores) indicate better ﬁt to the data. In this work we have introduced two useful techniques in modeling related data sets usingmixture models—the shared-idiosyncratic stick breaking and the locally perturbed ker-nel. When used together, they incorporate three common data features observed in realapplications—(i) samples often share the same clusters with diﬀerent weights; (ii) only someclusters vary across samples; (iii) misalignment in the clusters due to extraneous causes.We have derived Bayesian inference recipe through MCMC sampling and carried out anextensive numerical studies to illustrate the gain in inferential eﬃciency in both estimation,prediction, and hypothesis testing.Finally, we note that while the two techniques are introduced and demonstrated in thecontext of mixtures of location-scale families, they are generally applicable to modelingrelated mixtures of other forms of kernels as well, such as mixtures of generalized linearmodels and mixtures of factor models. The computational details will vary but the generalideas remain the same. 22 oftware

R code for the proposed MCMC sampler and code for the numerical examples are availableat https://github.com/jacsor/cremid/ and https://github.com/jacsor/MPG-examples/ ,respectively.

Acknowledgment

The authors are very grateful to Cliburn Chan for helpful discussions. The ﬂow cytome-try data set was provided by EQAPOL (HHSN272201000045C), an NIH/NIAID/DAIDS-sponsored, international resource that supports the development, implementation, andoversight of quality assurance programs (Sanchez PMC4138253).

Appendix

Numerical Examples

1. Local and global shift scenarios:Σ ( i, i ) = 1 . i = 1 , . . . , , Σ ( i, j ) = 0 . i (cid:54) = j and i, j = 1 , . . . , ( i, i ) = 2 . i = 1 , . . . , , Σ ( i, j ) = 1 . i (cid:54) = j and i, j = 1 , . . . , ( i, i ) = 0 . i = 1 , . . . , , Σ ( i, j ) = − . i (cid:54) = j and i, j = 1 , . . . , ( i, i ) = 0 . i = 1 , . . . , , Σ ( i, j ) = 0 . i (cid:54) = j and i, j = 1 , . . . , π = (0 . , . , . , . .

2. Local weight diﬀerence: Σ k for k = 1 , . . . , = diag(1 , , , = diag(2 , , , = diag(0 . , . , . , . k = diag(0 . , . , . , .

1) for k = 4 , . . . , . References

Chan, C., F. Feng, J. Ottinger, D. Foster, M. West, and T. B. Kepler (2008). Statisti-cal mixture modeling for cell subtype identiﬁcation in ﬂow cytometry.

Cytometry PartA 73 (8), 693–701.Cron, A., C. Gouttefangeas, J. Frelinger, L. Lin, S. K. Singh, C. M. Britten, M. J. Welters,S. H. van der Burg, M. West, and C. Chan (2013). Hierarchical modeling for rare eventdetection and cell subset alignment across ﬂow cytometry samples.

PLoS computationalbiology 9 (7), e1003130.Escobar, M. D. and M. West (1995). Bayesian density estimation and inference usingmixtures.

Journal of the American Statistical Association 90 (430), 577–588.Ferguson, T. S. (1973, 03). A bayesian analysis of some nonparametric problems.

Ann.Statist. 1 (2), 209–230.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, anddensity estimation.

Journal of the American Statistical Association 97 (458), 611–631.Green, P. J. and S. Richardson (2001). Modelling heterogeneity with and without thedirichlet process.

Scandinavian journal of statistics 28 (2), 355–375.Ishwaran, H. and L. F. James (2001). Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association 96 (453).Ishwaran, H. and M. Zarepour (2002). Exact and approximate sum representations for thedirichlet process.

Canadian Journal of Statistics 30 (2), 269–283.24ara, A., T. E. Hanson, F. A. Quintana, P. M¨uller, and G. L. Rosner (2011). Dppackage:Bayesian semi-and nonparametric modeling in r.

Journal of statistical software 40 (5), 1.Kingman, J. F. (1975). Random discrete distributions.

Journal of the Royal StatisticalSociety. Series B (Methodological) , 1–22.Kurihara, K., M. Welling, and Y. W. Teh (2007). Collapsed variational dirichlet processmixture models. In

IJCAI , Volume 7, pp. 2796–2801.Lopes, H. F., P. M¨uller, and G. L. Rosner (2003). Bayesian meta-analysis for longitudinaldata models using multivariate mixture priors.

Biometrics 59 (1), 66–75.MacEachern, S. N. and P. M¨uller (1998). Estimating mixture of dirichlet process models.

Journal of Computational and Graphical Statistics 7 (2), 223–238.Marin, J.-M., K. Mengersen, and C. P. Robert (2005). Bayesian modelling and inferenceon mixtures of distributions. In D. Dey and C. Rao (Eds.),

Bayesian ThinkingModelingand Computation , Volume 25 of

Handbook of Statistics , pp. 459 – 507. Elsevier.M¨uller, P., F. Quintana, and G. Rosner (2004). A method for combining inference acrossrelated nonparametric bayesian models.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 66 (3), 735–749.Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models.

Journal of computational and graphical statistics 9 (2), 249–265.Papaspiliopoulos, O. and G. O. Roberts (2008). Retrospective markov chain monte carlomethods for dirichlet process hierarchical models.

Biometrika 95 (1), 169–186.Pitman, J. and M. Yor (1997). The two-parameter poisson-dirichlet distribution derivedfrom a stable subordinator.

The Annals of Probability , 855–900.Porteous, I., A. T. Ihler, P. Smyth, and M. Welling (2012). Gibbs sampling for (cou-pled) inﬁnite mixture models in the stick breaking representation. arXiv preprintarXiv:1206.6845 . 25odr´ıguez, A., D. B. Dunson, and A. E. Gelfand (2008). The nested dirichlet process.

Journal of the American Statistical Association 103 (483), 1131–1154.Sethuraman, J. (1994). A constructive deﬁnition of dirichlet priors.

Statistica Sinica 4 ,639–650.Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2006). Hierarchical dirichlet processes.