Latent nested nonparametric priors
Federico Camerlenghi, David B. Dunson, Antonio Lijoi, Igor Prünster, Abel Rodríguez
LLatent nested nonparametric priors
Federico CamerlenghiDepartment of Economics, Management and Statistics,University of Milano–BicoccaandDavid B. DunsonDepartment of Statistical Science,Duke UniversityandAntonio Lijoi and Igor Pr¨unster ∗ Department of Decision Sciences and BIDSA,Bocconi UniversityandAbel Rodr´ıguezDepartment of Applied Mathematics and Statistics,University of California at Santa Cruz ∗ A. Lijoi and I. Pr¨unster are supported by the European Research Council (ERC) through StG ”N-BNP”306406 a r X i v : . [ m a t h . S T ] J a n bstract Discrete random structures are important tools in Bayesian nonparametrics andthe resulting models have proven effective in density estimation, clustering, topic mod-eling and prediction, among others. In this paper, we consider nested processes andstudy the dependence structures they induce. Dependence ranges between homogene-ity, corresponding to full exchangeability, and maximum heterogeneity, correspondingto (unconditional) independence across samples. The popular nested Dirichlet processis shown to degenerate to the fully exchangeable case when there are ties across sam-ples at the observed or latent level. To overcome this drawback, inherent to nestinggeneral discrete random measures, we introduce a novel class of latent nested pro-cesses. These are obtained by adding common and group-specific completely randommeasures and, then, normalising to yield dependent random probability measures.We provide results on the partition distributions induced by latent nested processes,and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A testfor distributional homogeneity across groups is obtained as a by product. The resultsand their inferential implications are showcased on synthetic and real data.
Keywords:
Bayesian nonparametrics; Completely random measures; Dependent nonpara-metric priors; Heterogeneity; Mixture models; Nested processes.
Data that are generated from different (though related) studies, populations or experimentsare typically characterised by some degree of heterogeneity. A number of Bayesian non-parametric models have been proposed to accommodate such data structures, but analyticcomplexity has limited understanding of the implied dependence structure across samples.The spectrum of possible dependence ranges from homogeneity, corresponding to full ex-changeability, to complete heterogeneity, corresponding to unconditional independence. Itis clearly desirable to construct a prior that can cover this full spectrum, leading to aposterior that can appropriately adapt to the true dependence structure in the availabledata.This problem has been partly addressed in several papers. In Lijoi et al. (2014) a class ofrandom probability measures is defined in such a way that proximity to full exchangeabilityor independence is expressed in terms of a r , s –valued random variable. In the same spirit,a model decomposable into idiosyncratic and common components is devised in M¨ulleret al. (2004). Alternatively, approaches based on P´olya tree priors are developed in Ma2 Wong (2011), Holmes et al. (2015), Filippi & Holmes (2017), while a multi–resolutionscanning method is proposed in Soriano & Ma (2017). In Bhattacharya & Dunson (2012)Dirichlet process mixtures are used to test homogeneity across groups of observations on amanifold. A popular class of dependent nonparametric priors that fits this framework is the nested Dirichlet process of Rodr´ıguez et al. (2008), which aims at clustering the probabilitydistributions associated to d populations. For d “ X i, , X j, | p ˜ p , ˜ p q ind „ ˜ p ˆ ˜ p p ˜ p , ˜ p q | ˜ q „ ˜ q , ˜ q “ ÿ i ě ω i δ G i (1)where the random elements X (cid:96) “ p X i,(cid:96) q i ě , for (cid:96) “ ,
2, take values in a space X , thesequences p ω i q i ě and p G i q i ě are independent, with ř i ě ω i “ G i ’s are i.i.d. random probability measures on X such that G i “ ÿ t ě w t,i δ θ t,i , θ t,i iid „ P (2)for some non–atomic probability measure P on X . In Rodr´ıguez et al. (2008) it is assumedthat ˜ q and the G i ’s are realizations of Dirichlet processes while in Rodr´ıguez & Dunson(2014) it is assumed they are from a generalised Dirichlet process introduced in Hjort(2000). Due to discreteness of ˜ q , one has ˜ p “ ˜ p with positive probability allowing forclustering at the level of the populations’ distributions and implying X „ X in suchcases.The nested Dirichlet process has been widely used in a rich variety of applications, but ithas an unappealing characteristic that provides motivation for this article. In particular, if X and X share at least one value, then the posterior distribution of p ˜ p , ˜ p q degenerateson t ˜ p “ ˜ p u , forcing homogeneity across the two samples. This occurs also in nestedDirichlet process mixture models in which the X i,(cid:96) are latent, and is not specific to theDirichlet process but is a consequence of nesting discrete random probabilities.To overcome this major limitation, we propose a more flexible class of latent nested pro-cesses , which preserve heterogeneity a posteriori , even when distinct values are shared bydifferent samples. Latent nested processes define ˜ p and ˜ p in (1) as resulting from normal-isation of an additive random measure model with common and idiosyncratic components,the latter with nested structure. Latent nested processes are shown to have appealing distri-butional properties. In particular, nesting corresponds, in terms of the induced partitions,to a convex combination of full exchangeability and unconditional independence, the twoextreme cases. This leads naturally to methodology for testing equality of distributions.3 Nested processes
We first propose a class of nested processes that generalise nested Dirichlet processes byreplacing the Dirichlet process components with a more flexible class of random measures.The idea is to define ˜ q in (1) in terms of normalised completely random measures on thespace P of probability measures on X . Let ˜ µ be an almost surely finite completely randommeasure without fixed points of discontinuity, i.e. ˜ µ “ ř i ě J i δ G i where G i are i.i.d. randomprobability measures on X with some fixed distribution Q on P . The corresponding L´evymeasure on R ` ˆ P is assumed to factorise as ν p d s, d p q “ c ρ p s q d s Q p d p q (3)where ρ is some non–negative function such that ş min t , s u ρ p s q d s ă 8 and c ą
0. Sincesuch a ν characterises ˜ µ through its L´evy-Khintchine representation E ” e ´ λ ˜ µ p A q ı “ exp ” ´ c Q p A q ż ´ ´ e ´ λs ¯ ρ p s q d s ı “ : e ´ c Q p A q ψ p λ q (4)for any measurable A Ă P , we use the notation ˜ µ „ CRM r ν ; P s . The function ψ in (4) isalso referred to as the Laplace exponent of ˜ µ . For a more extensive treatment of completelyrandom measures, see Kingman (1993). If one additionally assumes that ş ρ p s q d s “ 8 ,then ˜ µ p P q ą q in (1) as˜ q d “ ˜ µ ˜ µ p P q (5)This is known as a normalised random measure with independent increments , introducedin Regazzini et al. (2003), and is denoted as ˜ q „ NRMI r ν ; P s . The baseline measure, Q , of˜ µ in (3) is, in turn, the probability distribution of ˜ q „ NRMI r ν ; X s , with ˜ q “ ˜ µ { ˜ µ p X q and ˜ µ having L´evy measure ν p d s, d x q “ c ρ p s q d s Q p d x q (6)for some non–negative function ρ such that ş min t , s u ρ p s q d s ă 8 and ş ρ p s q d s “ 8 .Moreover, Q is a non–atomic probability measure on X and ψ is the Laplace exponentof ˜ µ . The resulting general class of nested processes is such that p ˜ p , ˜ p q| ˜ q „ ˜ q and isindicated by p ˜ p , ˜ p q „ NP p ν , ν q . The nested Dirichlet process of Rodr´ıguez et al. (2008)is recovered by specifying ˜ µ and ˜ µ to be gamma processes, namely ρ p s q “ ρ p s q “ s ´ e ´ s ,so that both ˜ q and ˜ q are Dirichlet processes.4 .2 Clustering properties of nested processes A key property of nested processes is their ability to cluster both population distribu-tions and data from each population. In this subsection, we present results on: (i) theprior probability that ˜ p “ ˜ p and the resulting impact on ties at the observations’ level;(ii) equations for mixed moments as convex combinations of fully exchangeable and un-conditionally independent special cases; and (iii) a similar convexity result for partiallyexchangeable partition probability function. The probability distribution of an exchange-able partition depends only on the numbers of objects in each group; the exchangeablepartition probability function is the probability of observing a particular partition as afunction of the group counts. Partial exchangeability is exchangeability within samples;the partially exchangeable partition probability function depends only on the number ofobjects in each group that are idiosyncratic to a group and common. Simple forms for thepartially exchangeable partition probability function not only provide key insights into theclustering properties but also greatly facilitate computation.Before stating result (i), define τ q p u q “ ż s q e ´ us ρ p s q d s, τ p q q p u q “ ż s q e ´ us ρ p s q d s, for any u ą
0, and agree that τ p u q ” τ p q p u q ” Proposition 1. If p ˜ p , ˜ p q „ NP p ν , ν q , c “ Q p P q and c “ Q p X q , then π : “ P p ˜ p “ ˜ p q “ c ż u e ´ cψ p u q τ p u q d u (7) and the probability that any two observations from the two samples coincide equals P p X j, “ X k, q “ π c ż u e ´ c ψ p u q τ p q p u q d s ą . (8)This result shows that the probability of ˜ p and ˜ p coinciding is positive, as desired,but also that this implies a positive probability of ties at the observations’ level. Moreover,(7) only depends on ν and not ν , since the latter acts on the X space. In contrast, theprobability that any two observations X j, and X k, from the two samples coincide givenin (8) depends also on ν . If p ˜ p , ˜ p q is a nested Dirichlet process, which corresponds to ρ p s q “ ρ p s q “ e ´ s { s , one obtains π “ {p c ` q and P p X , “ X , q “ π {p c ` q .The following proposition [our result (ii)] provides a representation of mixed momentsas a convex combination of full exchangeability and unconditional independence betweensamples. 5 roposition 2. If p ˜ p , ˜ p q „ NP p ν , ν q and π “ P p ˜ p “ ˜ p q is as in (7) , then E ” ż P X f p p q f p p q ˜ q p d p q ˜ q p d p q ı “ π ż P X f p p q f p p q Q p d p q ` p ´ π q ż P X f p p q Q p d p q ż P X f p p q Q p d p q (9) for all measurable functions f , f : P Ñ R ` . This convexity property is a key property of nested processes.The component with weight 1 ´ π in (9) accounts for heterogeneity among datafrom different populations and it is important to retain this component also a posteri-ori in (1). Proposition 2 is instrumental to obtain our main result (iii) characterizingthe partially exchangeable random partition induced by X p n q “ p X , , . . . , X n , q and X p n q “ p X , , . . . , X n , q in (1). To fix ideas consider a partition of the n i data of sam-ple X p n i q i into k i specific groups and k groups shared with sample X p n j q j ( j ‰ i ) withcorresponding frequencies n i “ p n ,i , . . . , n k i ,i q and q i “ p q ,i , . . . , q k ,i q . For example, X p q “ p ´
1, 5, 5, 0.5, 0.5 q and X p q “ p ´
2, 0.5, 0.5 q yield a partition of n ` n “
11 objects into 5 groups of which k “ k “ k “ n “ p , q , n “ p q , q “ p , q and q “ p , q . Let us start by analyzing the twoextreme cases. For the fully exchangeable case (in the sense of exchangeability holding trueacross both samples), one obtains the exchangeable partition probability functionΦ p N q k p n , n , q ` q q “ c k Γ p N q ż u N ´ e ´ c ψ p u q ˆ k ź j “ τ p q n j, p u q k ź i “ τ p q n i, p u q k ź r “ τ p q q r, ` q r, p u q d u (10)having set N “ n ` n , k “ k ` k ` k and | a | “ ř pi “ a i for any vector a “ p a , . . . , a p q P R p with p ě
2. The marginal exchangeable partition probability functions for the individualsample (cid:96) “ , p n (cid:96) q k ` k (cid:96) p n (cid:96) , q (cid:96) q “ p c q k ` k (cid:96) Γ p n (cid:96) q ż u n (cid:96) ´ e ´ c ψ p u q k (cid:96) ź j “ τ p q n j,(cid:96) p u q k ź r “ τ p q q r,(cid:96) p u q d u (11)6oth (10) and (11) hold true with the constraints ř k (cid:96) j “ n j,(cid:96) ` ř k r “ q r,(cid:96) “ n (cid:96) and 1 ď k (cid:96) ` k ď n (cid:96) , for each (cid:96) “ ,
2. Finally, the convention τ p q ” p n q k is zero, then it reduces to Φ p n q k ´ . For example, Φ p q p , , q “ Φ p q p , q .Both (10) and (11) solely depend on the L´evy intensity of the completely random measureand can be made explicit for specific choices. We are now ready to state our main result(iii). Theorem 1.
The random partition induced by the samples X p n q and X p n q drawn from p ˜ p , ˜ p q „ NP p ν , ν q , according to (1) , is characterised by the partially exchangeable parti-tion probability function Π p N q k p n , n , q , q q “ π Φ p N q k p n , n , q ` q q` p ´ π q Φ p n `| q |q k ` k p n , q q Φ p n `| q |q k ` k p n , q q t u p k q (12)The two independent exchangeable partition probability functions in the second sum-mand on the right–hand side of (12) are crucial for accounting for the heterogeneity acrosssamples. However, the result shows that one shared value, i.e. k ě
1, forces the randompartition to degenerate to the fully exchangeable case in (10). Hence, a single tie forces thetwo samples to be homogeneous, representing a serious limitation of all nested processesincluding the nDP special case. This result shows that degeneracy is a consequence ofcombining simple discrete random probabilities with nesting. In the following section, wedevelop a generalisation that is able to preserve heterogeneity in presence of ties betweenthe samples.
To address degeneracy of the partially exchangeable partition probability function in (12),we look for a model that, while still able to cluster random probabilities, can also take intoaccount heterogeneity of the data in presence of ties between X p n q and X p n q . The issue isrelevant also in mixture models where ˜ p and ˜ p are used to model partially exchangeablelatent variables such as, e.g., vectors of means and variances in normal mixture models.To see this, consider a simple density estimation problem, where two-sample data of sizes n “ n “
100 are generated from X i, „
12 N p , q `
12 N p , q X j, „
12 N p , q `
12 N p , q . p , q is detected, the two marginaldistributions are considered identical as the whole dependence structure boils down toexchangeability across the two samples. (a) (b) Figure 1: Nested σ –stable mixture models: Estimated densities (blue) and true densities(red), for X p q in Panel (a) and for X p q in Panel (b).This critical issue can be tackled by a novel class of latent nested processes. Specifically,we introduce a model where the nesting structure is placed at the level of the underlyingcompletely random measures, which leads to greater flexibility while preserving tractability.In order to define the new process, let M be the space of boundedly finite measures on X and Q the probability measure on M induced by ˜ µ „ CRM r ν ; X s , where ν is as in (6).8ence, for any measurable subset A of XE ” e ´ λ ˜ µ p A q ı “ ż M e ´ λ m p A q Q p d m q “ exp ! ´ c Q p A q ż ´ ´ e ´ λs ¯ ρ p s q d s ) . Definition 1.
Let ˜ q „ NRMI r ν ; M s , with ν p d s, d m q “ cρ p s q d s Q p d m q . Random probabil-ity measures p ˜ p , ˜ p q are a latent nested process if˜ p (cid:96) “ µ (cid:96) ` µ S µ (cid:96) p X q ` µ S p X q (cid:96) “ , , (13)where p µ , µ , µ S q | ˜ q „ ˜ q ˆ ˜ q S and ˜ q S is the law of a CRM r ν ˚ ; X s , where ν ˚ “ γ ν , forsome γ ą
0. Henceforth, we will use the notation p ˜ p , ˜ p q „ LNP p γ, ν , ν q .Furthermore, since˜ p i “ w i µ i µ i p X q ` p ´ w i q µ S µ S p X q , where w i “ µ i p X q µ S p X q ` µ i p X q , (14)each ˜ p i is a mixture of two components: an idiosyncratic component µ i { µ i p X q and a sharedcomponent µ S { µ S p X q . Here µ S preserves heterogeneity across samples even when sharedvalues are present. The parameter γ in the intensity ν ˚ tunes the effect of such a sharedCRM. One recovers model (1) as γ Ñ
0. A generalisation to nested completely randommeasures of the results given in Propositions 1 and 2 is provided in the following proposition,whose proof is omitted.
Proposition 3. If p µ , µ q | ˜ q „ ˜ q , where ˜ q „ NRMI r ν ; M s as in Definition 1 , then π ˚ “ P p µ “ µ q “ c ż u e ´ cψ p u q τ p u q d u (15) and E ” ż M f p m q f p m q ˜ q p d m , d m q ı “ π ˚ ż M f p m q f p m q Q p d m q` p ´ π ˚ q ź (cid:96) “ ż M f (cid:96) p m q Q p d m q (16) for all measurable functions f , f : M Ñ R ` . roposition 4. If p ˜ p , ˜ p q „ LNP p γ, ν , ν q , then P p ˜ p “ ˜ p q “ P p µ “ µ q . Proposition 4, combined with t ˜ p “ ˜ p u “ t µ “ µ u Y pt ˜ p “ ˜ p u X t µ ‰ µ uq , entails P rt ˜ p “ ˜ p u X t µ ‰ µ us “ P pt ˜ p “ ˜ p u X t µ “ µ uq ` P pt ˜ p ‰ ˜ p u X t µ ‰ µ uq “ t ˜ p “ ˜ p u and t µ “ µ u coincide almost surely. Asa consequence the posterior distribution of t µ “ µ u can be readily employed to testequality between the distributions of the two samples. Further details are given in Section5. For analytic purposes, it is convenient to introduce an augmented version of the la-tent nested process, which includes latent indicator variables. In particular, p X i, , X j, q |p ˜ p , ˜ p q „ ˜ p ˆ ˜ p , with p ˜ p , ˜ p q „ LNP p γ, ν , ν q if and only if p X i, , X j, q | p ζ i, , ζ j, , µ , µ , µ S q ind „ p ζ ,i ˆ p ζ ,j p ζ i, , ζ j, q | p µ , µ , µ S q „ Bern p w q ˆ Bern p w qp µ , µ , µ S q | p ˜ q, ˜ q S q „ ˜ q ˆ ˜ q S . (17)The latent variables ζ i,(cid:96) indicate which random probability measure between p (cid:96) and p “ p S generates each X i,(cid:96) , for i “ , . . . , n (cid:96) . Theorem 2.
The random partition induced by the samples X p n q and X p n q drawn from p ˜ p , ˜ p q „ LNP p γ, ν , ν q , as in (17) , is characterised by the partially exchangeable partitionprobability function Π p N q k p n , n , q , q q “ π ˚ c k p ` γ q k Γ p N qˆ ż s N ´ e ´p ` γ q c ψ p s q ź (cid:96) “ k (cid:96) ź j “ τ p q n j,(cid:96) p s q k ź j “ τ p q q j, ` q j, p s q d s ` p ´ π ˚ q ÿ p˚q I p n , n , q ` q , ζ ˚ q (18) where I p n , n , q ` q , ζ ˚ q “ c k γ k ´ ¯ k Γ p n q Γ p n q ż ż u n ´ v n ´ e ´ γc ψ p u ` v q´ c p ψ p u q` ψ p v qq k ź j “ τ p q n j, p u ` p ´ ζ ˚ j, q v q k ź j “ τ p q n j, pp ´ ζ ˚ j, q u ` v q k ź j “ τ p q q j, ` q j, p u ` v q d u d v and the sum in the second summand on the right hand side of (18) runs over all the possiblelabels ζ ˚ P t , u k ` k . The partially exchangeable partition probability function (18) is a convex linear combi-nation of an exchangeable partition probability function corresponding to full exchangeabil-ity across samples and one corresponding to unconditional independence. Heterogeneityacross samples is preserved even in the presence of shared values. The above result is statedin full generality, and hence may seem somewhat complex. However, as the following ex-amples show, when considering stable or gamma random measures, explicit expressions areobtained. When γ Ñ Example 1.
Based on Theorem 2 we can derive an explicit expression of the partitionstructure of latent nested σ –stable processes. Suppose ρ p s q “ σ s ´ ´ σ { Γ p ´ σ q and ρ p s q “ σ s ´ ´ σ { Γ p ´ σ q , for some σ and σ in p , q . In such a situation it is easy to see that π ˚ “ ´ σ , τ p q q p u q “ σ p ´ σ q q ´ u σ ´ q and ψ p u q “ u σ . Moreover let c “ c “ , sincethe total mass of a stable process is redundant under normalization. If we further set J σ ,γ p H , H ; H q : “ ż w H ´ p ´ w q H ´ r γ ` w σ ` p ´ w q σ s H d w, for any positive H , H and H , and ξ a p n , n , q ` q q : “ ź (cid:96) “ k (cid:96) ź j “ p ´ a q n j,(cid:96) ´ k ź j “ p ´ a q q j, ` q j, ´ , for any a P r , q , then the partially exchangeable partition probability function in (18) maybe rewritten as Π p N q k p n , n , q , q q “ σ k ´ Γ p k q ξ σ p n , n , q ` q q " p ´ σ q Γ p N q ` σ Γ p n q Γ p n qˆ ÿ p˚q γ k ´ ¯ k J σ ,γ p n ´ ¯ n ` ¯ k σ , n ´ ¯ n ` ¯ k σ ; k q ,.- . he sum with respect to ζ ˚ can be evaluated and it turns out that Π p n q k p n , n , q ` q q “ σ k ´ Γ p k q Γ p n q ξ σ p n , n , q ` q q ” ´ σ ` σγ k B p k σ , k σ q B p n , n qˆ ż ś k j “ p ` γw n j, ´ σ q ś k i “ r ` γ p ´ w qs n i, ´ σ ” γ ` w σ ` p ´ w q σ ı k Beta p d w ; k σ , k σ q ı where Beta p ¨ ; a, b q stands for the beta distribution with parameters a and b , while B p p, q q isthe beta function with parameters p and q . As it is well–known, σ k ´ Γ p k q ξ σ p n , n , q ` q q{ Γ p N q is the exchangeable partition probability function of a normalised σ –stable pro-cess. Details on the above derivation, as well as for the following example, can be found inthe Appendix. Example 2.
Let ρ p s q “ ρ p s q “ e ´ s { s . Recall that τ p q q p u q “ Γ p q q{p u ` q q and ψ p u q “ log p ` u q , furthermore π ˚ “ {p ` c q by standard calculations. From Theorem 2 we obtainthe partition structure of the latent nested Dirichlet process Π p N q k p n , n , q , q q “ ξ p n , n , q ` q q c k " ` c p ` γ q k p c p ` γ qq N ` c ` c ÿ p˚q γ k ´ ¯ k p α q n p β q n F p c ` ¯ n , α, n ; α ` n , β ` n ; 1 q ,.- where α “ p γ ` q c ` n ´ ¯ n , β “ c p ` γ q and F is the generalised hypergeometricfunction. In the same spirit as in the previous example, the first element in the linearconvex combination above c k p ` γ q k ξ p n , n , q ` q q{p c p ` γ qq N is nothing but theEwens’ sampling formula, i.e. the exchangeable partition probability function associated tothe Dirichlet process whose base measure has total mass c p ` γ q . We develop a class of Markov Chain Monte Carlo algorithms for posterior computationin latent nested process models relying on the partially exchangeable partition probabilityfunctions in Theorem 2, as they tended to be more effective. Moreover, the sampler ispresented in the context of density estimation, where X j,(cid:96) | p θ p n q , θ p n q q ind „ h p ¨ ; θ j,(cid:96) q (cid:96) “ , X i, , X j, q | p θ p n q , θ p n q q ind „ h p ¨ ; θ i, q ˆ h p ¨ ; θ j, q and the vectors θ p n (cid:96) q (cid:96) “ p θ ,(cid:96) , . . . , θ n (cid:96) ,(cid:96) q , for (cid:96) “ , θ i,(cid:96) taking values inΘ Ă R b , are partially exchangeable and governed by a pair of p ˜ p , ˜ p q as in (17). Thediscreteness of ˜ p and ˜ p entails ties among the latent variables θ p n q and θ p n q that giverise to k “ k ` k ` k distinct clusters identified by • the k distinct values specific to θ p n q , i.e. not shared with θ p n q . These are denotedas θ ˚ : “ p θ ˚ , , . . . , θ ˚ k , q , with corresponding frequencies n and labels ζ ˚ ; • the k distinct values specific to θ p n q , i.e. not shared with θ p n q . These are denotedas θ ˚ : “ p θ ˚ , , . . . , θ ˚ k , q , with corresponding frequencies n and labels ζ ˚ ; • the k distinct values shared by θ p n q and θ p n q . These are denoted as θ ˚ : “ p θ ˚ , , . . . , θ ˚ k , q ,with q (cid:96) being their frequencies in θ p n (cid:96) q (cid:96) and shared labels ζ ˚ .As a straightforward consequence of Theorem 2, one can determine the joint distributionof the data X , the corresponding latent variables θ and labels ζ as follows f p x | θ q Π p N q k p n , n , q , q q ź (cid:96) “ k (cid:96) ź j “ Q p d θ ˚ j,(cid:96) q (19)where Π p N q k is as in (18) and, for C j,(cid:96) : “ t i : θ i,(cid:96) “ θ ˚ j,(cid:96) u and C r,(cid:96), : “ t i : θ i,(cid:96) “ θ ˚ r, u , f p x | θ q “ ź (cid:96) “ k (cid:96) ź j “ ź i P C j,(cid:96) h p x i,(cid:96) ; θ ˚ j,(cid:96) q k ź r “ ź i P C r,(cid:96), h p x i,(cid:96) ; θ ˚ r, q . We do now specialise (19) to the case of latent nested σ –stable processes described inExample 1. The Gibbs sampler is described just for sampling θ p n q , since the structureis replicated for θ p n q . To simplify the notation, v ´ j denotes the random variable v afterthe removal of θ j, . Moreover, with T “ p X , θ , ζ , σ, σ , φ q , we let T ´ θ j, stand for T afterdeleting θ j, , I “ t ˜ p “ ˜ p u and Q ˚ j p d θ q “ h p x j, ; θ q Q p d θ q{ ş Θ h p x j, ; θ q Q p d θ q . Here φ denotes a vector of hyperparameters entering the definition of the base measure Q . Theupdating structure of the Gibbs sampler is as follows(1) Sample θ j, from P p θ j, P d θ | T ´ θ j, , I “ q “ w Q ˚ j, p d θ q ` ÿ t i : ζ ˚ , ´ ji, “ ζ j, u w i, δ t θ ˚ , ´ ji, up d θ q ÿ t i : ζ ˚ , ´ ji, “ ζ j, u w i, δ t θ ˚ , ´ ji, up d θ q ` ÿ t i : ζ ˚ , ´ ji, “ ζ j, u w i, δ t θ ˚ , ´ ji, up d θ q P p θ j, P d θ | T ´ θ j, , I “ q “ w Q ˚ j, p d θ q ` ÿ t i : ζ ˚ , ´ ji, “ ζ j, u w i, δ t θ ˚ , ´ ji, up d θ q` t u p ζ j, q ” ÿ t i : ζ ˚ , ´ ji, “ u w i, δ t θ ˚ , ´ ji, up d θ q ` k ÿ r “ w r, δ t θ ˚ , ´ jr, up d θ q ı where w γ ´ ζ j, σ k ´ r ` γ h p x j, ; θ q , w i,(cid:96) n ´ ji,(cid:96) ´ σ q h p x j, ; θ ˚ , ´ ji,(cid:96) q (cid:96) “ , w i, q ´ ji, ` q ´ ji, ´ σ q h p x j, ; θ ˚ , ´ ji, q and, with a “ n ´ p ¯ n ´ j ` ζ j, q ` ¯ k ´ j σ and a “ n ´ ¯ n ` ¯ k σ , one further has w γ ´ ζ j, σ k ´ j J σ p a ` ζ j, σ , a ; k ´ j ` q h p x j, ; θ q ,w i,(cid:96) J σ p a , a ; k ´ j q p n ´ ji,(cid:96) ´ σ q h p x j,(cid:96) ; θ ˚ , ´ jj,(cid:96) q (cid:96) “ , ,w i, J σ p a , a ; k ´ j q p q ´ ji, ` q ´ ji, ´ σ q h p x j, ; θ ˚ , ´ ji, q . (2) Sample ζ ˚ j, from P p ζ ˚ j, “ x | T ´ ζ ˚ j, , I “ q “ γ ´ x ` γ P p ζ ˚ j, “ x | T ´ ζ ˚ j, , I “ q 9 γ k ´ k x ´ ¯ k ´ ¯ k J σ p n ´ n x ` k x σ , n ´ ¯ n ` ¯ k σ ; k q where x P t , u , k x : “ x ` | ζ ˚ , ´ j | and n x “ n j, x ` | ζ ˚ , ´ j d n ´ j | , where a d b denotes thecomponent–wise product between two vectors a , b . Moreover, it should be stressed that,conditional on I “
0, the labels ζ ˚ r, are degenerate at x “ r “ , . . . , k .(3) Update I from P p I “ | T q “ ´ P p I “ | T q “ p ´ σ q B p n , n qp ´ σ q B p n , n q ` σJ σ p ¯ a , ¯ a ; k qp ` γ q k a “ n ´ ¯ n ` ¯ k σ and ¯ a “ n ´ ¯ n ` ¯ k σ . This sampling distribution holds truewhenever θ p n q and θ p n q do not share any value θ ˚ j, with label ζ ˚ j, “
1. If this situationoccurs, then P p I “ | T q “ σ and σ from f p σ | T ´ σ , I q 9 J ´ Iσ p ¯ a , ¯ a ; k q σ k ´ κ p σ q ź (cid:96) “ k (cid:96) ź j “ p ´ σ q n j,(cid:96) ´ k ź r “ p ´ σ q q r, ` q r, ´ f p σ | T ´ σ , I q 9 κ p σ q “ p ´ σ q t u p I q ` σ t u p I q ‰ where κ and κ are the priors for σ and σ , respectively.(5) Update γ from f p γ | T ´ γ , I q 9 γ k ´ ¯ k g p γ q ” ´ σ p ` γ q k t u p I q ` σ J σ p ¯ a , ¯ a ; k q t u p I q ı where g is the prior distribution for γ .Finally, the updating of the hyperparameters depends on the specification of Q thatis adopted. They will be displayed in the next section, under the assumption that Q is anormal/inverse–Gamma.The evaluation of the integral J σ p h , h ; h q is essential for the implementation of theMarkov Chain Monte Carlo procedure. This can be accomplished through numerical meth-ods based on quadrature. However, computational issues arise when h and h are bothless than 1 and the integrand defining J σ is no longer bounded, although still integrable.For this reason we propose a plain Monte Carlo approximation of J σ based on observingthat J σ p h , h ; h q “ B p h , h q E ! r γ ` W σ ` p ´ W q σ s h ) , with W „ Beta p h , h q . Then generating an i.i.d. sample t W i u Li “ of length L , with W i „ W , we get the following approximation J σ p h , h ; h q « B p h , h q L L ÿ i r γ ` W σ i ` p ´ W i q σ s h . Illustrations
The algorithm introduced in Section 4 is employed here to estimate dependent randomdensities. Before implementation, we need first to complete the model specification of ourlatent nested model (13). Let Θ “ R ˆ R ` and h p¨ ; p M, V qq be Gaussian with mean M and variance V . Moreover, as customary, Q is assumed to be a normal/inverse–Gammadistribution Q p d M, d V q “ Q , p d V q Q , p d M | V q with Q , an inverse–Gamma probability distribution with parameters p s , S q and Q , aGaussian with mean m and variance τ V . Furthermore, the hyperpriors are τ ´ „ Gam p w { , W { q , m „ N p a, A q , for some real parameters w ą , W ą , A ą a P R . In the simulation studies wehave set p w, W q “ p , q , p a, A q “ pp n ¯ X ` n ¯ Y q{p n ` n q , q . The parameters τ and m are updated on the basis of their full conditional distributions, which can be easily derived,and correspond to L p τ | T ´ τ , I q „ IG ´ w ` k , W ` ÿ i “ k i ÿ j “ p M ˚ i,j ´ m q V ˚ i,j ¯ , L p m | T ´ m , I q „ N ´ RD , D ¯ where R “ aA ` ÿ i “ k i ÿ j “ M ˚ i,j τ V ˚ i,j , D “ A ` ÿ i “ k i ÿ j “ τ V ˚ i,j . The model specification is completed by choosing uniform prior distributions for σ and σ .In order to overcome the possible slow mixing of the P´olya urn sampler, we include theacceleration step of MacEachern (1994) and West et al. (1994), which consists in resamplingthe distinct values p θ ˚ i,j q k i j “ , for i “ , ,
2, at the end of every iteration. The numericaloutcomes displayed in the sequel are based on 50 ,
000 iterations after 50 ,
000 burn–in sweeps.Throughout we assume the data X p n q and X p n q to be independently generated bytwo densities f and f . These will be estimated jointly through the MCMC procedureand the borrowing of strength phenomenon should then allow improved performance. Aninteresting byproduct of our analysis is the possibility to examine the clustering structure ofeach distribution, namely the number of components of each mixture. Since the expression16f the pEPPF (18) consists of two terms, in order to carry out posterior inference we havedefined the random variable I “ t µ “ µ u . This random variable allows to test whether thetwo samples come from the same distribution or not, since I “ t ˜ p “ ˜ p u almost surely (seealso Proposition 4). Indeed, if interest lies in testing H : ˜ p “ ˜ p versus H : ˜ p “ ˜ p , based on the Markov Chain Monte Carlo output, it is straightforward to compute anapproximation of the Bayes factorBF “ P p ˜ p “ ˜ p | X q P p ˜ p “ ˜ p | X q P p ˜ p “ ˜ p q P p ˜ p “ ˜ p q “ P p I “ | X q P p I “ | X q P p I “ q P p I “ q leading to acceptance of the null hypothesis if BF is sufficiently large. In the following wefirst consider simulated datasets generated from normal mixtures and then we analyse thepopular Iris dataset. We consider three different simulated scenarios, where X p n q and X p n q are independentand identically distributed draws from densities that are both two component mixtures ofnormals. In both cases p s , S q “ p , q and the sample size is n “ n “ n “ X p n q and X p n q are drawn from the same density X i, „ X j, „
12 N p , q `
12 N p , q . The posterior distributions for the number of mixture components, respectively denoted by K and K for the two samples, and for the number of shared components, denoted by K ,are reported in Table 1. The maximum a posteriori estimate is highlighted in bold. Themodel is able to detect the correct number of components for each distribution as well asthe correct number of components shared across the two mixtures. The density estimates,not reported here, are close to the true data generating densities. The Bayes factor to testequality between the distributions of X p n q and X p n q has been approximated through theMarkov Chain Monte Carlo output and coincides with BF “ X p n q and X p n q generated, respectively, from X i, „ p , q ` p , q X j, „ p , q ` p , q . ě K K K K K K K K K K ), in the second sample ( K ) and shared by the two samples ( K ) correspondingto the three scenarios. The posterior probabilities corresponding to the MAP estimates aredisplayed in bold.Both densities have two components but only one in common, i.e. the normal distributionwith mean 5. Moreover, the weight assigned to N p , q differs in the two cases. The den-sity estimates are displayed in Figure 2. The spike corresponding to the common component(concentrated around 5) is estimated more accurately than the idiosyncratic components(around 0 and 10, respectively) of the two samples nicely showcasing the borrowing ofinformation across samples. Moreover, the posterior distributions of the number of com-ponents are reported in Table 1. The model correctly detects that each mixture has twocomponents with one of them shared and the corresponding distributions are highly con-centrated around the correct values. Finally the Bayes factor BF to test equality betweenthe two distributions equals 0.00022 and the null hypothesis of distributional homogeneityis rejected.Scenario III consists in generating the data from mixtures with the same componentsbut differing in their weights. Specifically, X p n q and X p n q are drawn from, respectively, X i, „ p , q ` p , q X j, „ p , q ` p , q , The posterior distribution of the number of components is again reported in Table 1 andagain the correct number is identified, although in this case the distributions exhibit a18 a) (b)
Figure 2: Estimated densities (blue) and true densities (red) for X in Panel (a) and Y inPanel (b).higher variability. The Bayes factor BF to test equality between the two distributions is0.54, providing weak evidence in favor of the alternative hypothesis that the distributionsdiffer. Finally, we examine the well known Iris dataset, which contains several measurements con-cerning three different species of Iris flower: setosa, versicolor, virginica. More specifically,we focus on petal width of those species. The sample X has size n “
90, containing 50observations of setosa and 40 of versicolor. The second sample Y is of size n “
60 with10 observations of versicolor and 50 of virginica.Since the data are scattered across the whole interval r , s , we need to allow for largevariances and this is obtained by setting p s , S q “ p , q . The model neatly identifies thatthe two densities have two components each and that one of them is shared as showcasedby the posterior probabilities reported in Table 2. As for the Bayes factor, we obtainBF « ě K K K K ), in the second sample ( K ) and shared by the two samples ( K ). The posteriorprobabilities corresponding to the MAP estimates are displayed in bold.Figure 3: Estimated densities for X (red) and Y (blue).We have also monitored the convergence of the algorithm that has been implemented.Though we here provide only details for the Iris dataset, we have conducted similar analysesalso for each of the illustrations with synthetic datasets in Section 5.1. Notably, all the20xamples with simulated data have experienced even better performances than those weare going to display henceforth. Figure 4 depicts the partial autocorrelation function forthe sampled parameters σ and σ . The partial autocorrelation function apparently has anexponential decay and after the first lag exhibits almost negligible peaks. (a) (b) Figure 4: Plots of the partial autocorrelation functions for the parameters σ (a) and σ (b).We have additionally monitored the two estimated densities near the peaks, whichidentify the mixtures’ components. More precisely, Figure 5(a) displays the trace plots ofthe density referring to the first sample at the points 3 and 13, whereas Figure 5(b) showsthe trace plots of the estimated density function of the second sample at the points 13 and21. We have introduced and investigated a novel class of nonparametric priors featuring a latentnested structure. Our proposal allows flexible modeling of heterogeneous data and dealswith problems of testing distributional homogeneity in two-sample problems. Even if our21 a) (b)
Figure 5: (a): trace plots of the density referring to X p n q at the points 3 and 13; (b): traceplots of the density referring to X p n q at the points 13 and 21.treatment has been confined to the case d “
2, we stress that the results may be formallyextended to d ą p ˜ p , . . . , ˜ p d q leads to consideringall possible partitions of the d random probability measures. While sticking to the samemodel and framework which has been shown to be effective both from a theoretical andpractical point of view in the case d “
2, a more computationally oriented approach wouldbe desirable in this case. There are two possible paths. The first, along the lines of theoriginal proposal of the nested Dirichlet process in Rodr´ıguez et al. (2008), consists in usingtractable stick–breaking representations of the underlying random probabilities, wheneveravailable to devise an efficient algorithm. The second, which needs an additional significantanalytical step, requires the derivation of a posterior characterization of p ˜ p , . . . , ˜ p d q thatallows sampling of the trajectories of latent nested processes and build up algorithms forwhich marginalization is not needed. Both will be the object of our future research.22 ppendix 1 Proof of Proposition 1
Since p ˜ p , ˜ p q „ NP p ν , ν q , one has π “ E ż P ˜ q p d p q “ E ż P ˜ µ p d p q ˜ µ p P q “ ż u ż P E e ´ u ˜ µ p P q ˜ µ p d p q d u (20)In order to get the result, we extend and adapt the techniques used in James et al. (2006).Indeed, it can be seen that E e ´ u ˜ µ p P X q ˜ µ p d p q “ e ´ c ψ p u q “ c Q p d p q τ p u q ` c Q p d p q τ p u q ‰ . (21)Recall that Q is the probability distribution of the NRMI ˜ q “ ř j ě ω j δ ˜ θ j with ř j ě ω j “ θ j iid „ Q . This means that Q is concentrated on the set of discreteprobability measures on X . If p “ ř j ě w j δ θ j P P is fixed, we set W j ,n : “ t ω j “ w , . . . , ω j n “ w n u and Θ j ,n “ t ˜ θ j “ θ , . . . , ˜ θ j n “ θ n u where j “ p j , . . . , j n q is a vector ofpositive integer. Then Q pt p uq “ P ´ ˜ q “ p ¯ ď P ” ď p˚q p W j ,n X Θ j ,n q ı (22)where the above union is taken over the set of all vectors j “ p j , . . . , j n q P N n such that j ‰ ¨ ¨ ¨ ‰ j n . The upper bound in (22) is clearly equal to 0. This, combined with (21),yields ż P X E e ´ u ˜ µ p P X q ˜ µ p d p q “ c e ´ cψ p u q τ p u q ż P X Q p d p q “ c e ´ cψ p u q τ p u q and the proof is completed. Proof of Proposition 2
Let f “ A and f “ B , for some measurable subsets A and B of P . One has E ż P X f p p q f p p q ˜ q p d p q ˜ q p d p q “ E ˜ q p A q ˜ q p B q“ E ˜ q p A X B q ` E ˜ q p A X B q ˜ q p B X A c q ` E ˜ q p A X B c q ˜ q p B q .
23t can now be easily seen that E ˜ q p A X B q “ E ˜ µ p A X B q ˜ µ p P X q“ ż u e ´ cψ p u q ” c Q p A X B q τ p u q ` c Q p A X B q τ p u q ı d u “ π Q p A X B q ` p ´ π q Q p A X B q . On the other hand, if A X B “ ∅ , we get E ˜ q p A q ˜ q p B q “ Q p A q Q p B q ż c u e ´ c ψ p u q τ p u q d u “ π Q p A q Q p B q To sum up, one finds that E ˜ q p A q ˜ q p B q “ π Q p A X B q ` p ´ π q r Q p A X B q ` Q p A X B q Q p B X A c q ` Q p A X B c q Q p B qs which boils down to (9). Now it is easy to prove that (9) is true when f and f are simplefunctions and, then, for all positive and measurable functions relying on the monotoneconvergence theorem. Proof of Theorem 1
The partition probability function Π p N q k p n , n , q , q q equals ż X k E k ź j “ ˜ p n j, p d x ˚ j, q k ź j “ ˜ p n j, p d x ˚ j, q k ź j “ ˜ p q j, p d z ˚ j q ˜ p q j, p d z ˚ j q (23)obtained by marginalizing with respect to p ˜ p , ˜ p q . Due to conditional independence of ˜ p and ˜ p , given ˜ q , the integrand in (23) can be rewritten as E ś (cid:96) “ h (cid:96) p d x ˚ (cid:96) , d z ˚ ; ˜ q q where, foreach (cid:96) “ , h (cid:96) p d x ˚ (cid:96) , d z ˚ ; ˜ q q “ E ” k (cid:96) ź j “ ˜ p n j, (cid:96) (cid:96) p d x ˚ j,(cid:96) q k ź j “ ˜ p q j,(cid:96) (cid:96) p d z ˚ j q ˇˇˇ ˜ q ı “ ż P X k (cid:96) ź j “ p n j,(cid:96) (cid:96) p d x ˚ j,(cid:96) q k ź j “ p q j,(cid:96) (cid:96) p d z ˚ j q ˜ q p d p (cid:96) q
24 simple application of the Fubini–Tonelli theorem, then, yieldsΠ p N q k p n , n , q , q q “ ż X k E ż P X f p p q f p p q ˜ q p d p q ˜ q p d p q (24)where, for each (cid:96) “ ,
2, we have set f (cid:96) p p (cid:96) q : “ ś k (cid:96) j “ p n j,(cid:96) (cid:96) p d x ˚ j,(cid:96) q ś k j “ p q j,(cid:96) (cid:96) p d z ˚ j q and agreethat ś j “ a j ”
1. In view of Proposition 2 the integrand in (24) boils down to ż P X E f p p q f p p q ˜ q p d p q ˜ q p d p q “ π ż P X f p p q f p p q Q p d p q ` p ´ π q ź (cid:96) “ ż P X f (cid:96) p p q Q p d p q“ π ” E f p ˜ q q f p ˜ q q ı ` p ´ π q ” E f p ˜ q q ı” E f p ˜ q q ı . In order to complete the proof it is now enough to note that, due to non–atomicity of Q , E f p ˜ q q f p ˜ q q “ E k ź j “ ˜ q n j, p d x ˚ j, q k ź j “ ˜ q n j, p d x ˚ j, q k ź j “ ˜ q q j, ` q j, p d z ˚ j q is absolutely continuous with respect to Q k on X k andd E f p ˜ q q f p ˜ q q d Q k p x ˚ , x ˚ , z ˚ q “ Φ p N q k p n , n , q ` q q for any vector p x ˚ , x ˚ , z ˚ q whose k components are all distinct, and is zero otherwise. Asfor the second summand above, from Proposition 3 in James et al. (2009) one deduces that ” E f p ˜ q q ı ” E f p ˜ q q ı “ k ź j “ Q p d x ˚ j, q k ź j “ Q p d x ˚ j, q k ź j “ Q p d z ˚ j qˆ Φ p| n |`| q |q k ` k p n , q q Φ p| n |`| q |q k ` k p n , q q Then it is apparent that r E f p ˜ q qs r E f p ˜ q qs ! Q k and still by virtue of the non–atomicityof Q one hasd r E f p ˜ q qs r E f p ˜ q qs d Q k p x ˚ , x ˚ , z ˚ q “ Φ p| n |`| q |q k ` k p n , q q Φ p| n |`| q |q k ` k p n , q q t u p k q for any vector p x ˚ , x ˚ , z ˚ q P X k whose components are all distinct, and is zero otherwise.Note that if it were k ě
1, then some of the infinitesimal factors Q p d z ˚ j q would not canceland the above density would be exactly equal to zero.25 roof of Proposition 4 Since ˜ q „ NRMI r ν ; M s , one has ˜ q “ ř j ě ˜ ω j δ ˜ η j , with P p ř j ě ˜ ω j “ q “ η j iid „ Q .Furthermore, ˜ η j is, in turn, a CRM ˜ η j “ ř k ě ˜ ω p j q k δ X p j q k where P p ř k ě ˜ ω p j q k ă `8q “ X p j q k iid „ Q , for any j “ , , . . . . An analogous representation holds true also for µ S , i.e. µ S “ ř k ě ˜ ω p q k δ X p q k with the same conditions as above. From the assumptions one deducesthat the sequences p X p j q k q k ě and p ˜ ω p j q k q k ě are independent also across different values of j , and Definition 1 entails, with probability 1, P ” p µ , µ , µ S q P A ˆ A ˆ A ˇˇˇ ˜ q ı “ ˜ q p A q ˜ q p A q ˜ q p A q which implies P p ˜ p “ ˜ p q “ E ” P ´ µ ` µ S µ p X q ` µ S p X q “ µ ` µ S µ p X q ` µ S p X q ˇˇˇ ˜ q ¯ ı “ E ” ÿ i “ j ˜ ω j ˜ ω i P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η j ` µ S ˜ η j p X q ` µ S p X q ˇˇˇ ˜ q ¯ ` ÿ i “ ˜ ω i P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η i ` µ S ˜ η i p X q ` µ S p X q ˇˇˇ ˜ q ¯ı “ E ” ÿ i “ j ˜ ω j ˜ ω i P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η j ` µ S ˜ η j p X q ` µ S p X q ˇˇˇ ˜ q ¯ı ` E ÿ i “ ˜ ω i . For the second summand above one trivially has E ř i “ ˜ ω i “ P p µ “ µ q . As for the firstsummand, a simple application of the Fubini–Tonelli theorem and the fact that ˜ ω j ě
1, forany j , yield the following upper bound E ” ÿ i “ j ˜ ω j ˜ ω i P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η j ` µ S ˜ η j p X q ` µ S p X q ˇˇˇ ˜ q ¯ ı “ ÿ i “ j E ” ˜ ω j ˜ ω i P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η j ` µ S ˜ η j p X q ` µ S p X q ˇˇˇ ˜ q ¯ ı ď ÿ i “ j P ´ ˜ η i ` µ S ˜ η i p X q ` µ S p X q “ ˜ η j ` µ S ˜ η j p X q ` µ S p X q ¯ . i , j and n , consider the n –tuple of atoms p X p i q , ¨ ¨ ¨ , X p i q n q referring to ˜ η i and correspondingly define the setsΘ p j q (cid:96) : “ ! ω P Ω : X p i q p ω q “ X p j q (cid:96) p ω q , . . . , X p i q n p ω q “ X p j q (cid:96) n p ω q ) , Θ p q (cid:96) : “ ! ω P Ω : X p i q p ω q “ X p q (cid:96) p ω q , . . . , X p i q n p ω q “ X p q (cid:96) n p ω q ) for any (cid:96) “ p (cid:96) , ¨ ¨ ¨ , (cid:96) n q P N n . It is then apparent that P ´ ˜ η i ` ˜ µ ˚ ˜ η i p X q ` ˜ µ ˚ p X q “ ˜ η j ` ˜ µ ˚ ˜ η j p X q ` ˜ µ ˚ p X q ¯ ď P ” ď (cid:96) P N n ,(cid:96) h “ (cid:96) h p Θ p j q (cid:96) Y Θ p q (cid:96) q ı and this upper bound is equal to 0, because each of the events Θ p j q (cid:96) and Θ p q (cid:96) in the abovecountable union has 0 probability in view of the non–atomicity of Q and independence. Proof of Theorem 2
Consider the partition induced by the sample X p n q and X p n q into k “ k ` k ` k groupswith frequencies n (cid:96) “ p n ,(cid:96) , . . . , n k (cid:96) ,(cid:96) q , for (cid:96) “ ,
2, and ¯ q “ p q , ` q , , . . . , q k , ` q k , q .Recalling that p (cid:96) “ µ (cid:96) { µ (cid:96) p X q , for (cid:96) “ ,
2, the conditional likelihood is ź (cid:96) “ k (cid:96) ź j “ p ζ ˚ j,(cid:96) n j,(cid:96) (cid:96) p d x ˚ j,(cid:96) q p S p ´ ζ ˚ j,(cid:96) q n j,(cid:96) p d x ˚ j,(cid:96) q k ź r “ p S p ´ ζ ˚ r, qp q r, ` q r, q p d z ˚ r q ź (cid:96) “ p ζ ˚ r, q r,(cid:96) (cid:96) p d z ˚ r q where we take t x j,(cid:96) : j “ , . . . , k (cid:96) u , for (cid:96) “ ,
2, and t z ˚ r : r “ , . . . , k u as the k ` k ` k distinct values in X . If we now let f p µ S , u, v q : “ e ´p u ` v q µ S p X q k ź r “ µ S p ´ ζ ˚ r, qp q r, ` q r, q p d z ˚ r q ź (cid:96) “ k (cid:96) ź j “ µ p ´ ζ ˚ j,(cid:96) q n j,(cid:96) p d x ˚ j,(cid:96) q f p µ , u, v q : “ e ´ uµ p X q k ź j “ µ ζ ˚ j, n j, p d x ˚ j, q k ź r “ µ ζ ˚ r, q r, p d z ˚ r q f p µ , u, v q : “ e ´ vµ p X q k ź j “ µ ζ ˚ j, n j, p d x ˚ j, q k ź r “ µ ζ ˚ r, q r, p d z ˚ r q , p µ S , µ , µ q , so that the the joint distribution of the random partition and of the corre-sponding labels ζ ˚˚ “ p ζ ˚ , ζ ˚ , ζ ˚ q isΠ p N q k p n , n , q , q ; ζ ˚˚ q “ p n q Γ p n q ż ż u n ´ v n ´ E ´ ź i “ f i p µ i , u, v q ¯ d u d v, (25)where, for simplicity, we have set µ “ µ S . Now, for any p u, v q P R ` , Proposition 3 implies E ź i “ f i p µ i , u, v q “ E E ´ ź i “ f i p µ i , u, v q ˇˇˇ ˜ q, ˜ q ˚ ¯ “ E ź i “ E ´ f i p µ i , u, v q ˇˇˇ ˜ q ¯ “ ” E f p µ S , u, v q ı E ż M f p m , u, v q f p m , u, v q ˜ q p d m q ˜ q p d m q“ ” E f p µ S , u, v q ı! π ˚ ” E ź i “ f i p ˜ µ , u, v q ı ` p ´ π ˚ q ź i “ ” E f i p ˜ µ , u, v q ı) (26)Using the properties that characterise µ S it is easy to show that E f p µ S , u, v q ! Q k ´ ¯ k ,where ¯ k “ ř k j “ ζ ˚ j, ` ř k j “ ζ ˚ j, ` ř k r “ ζ ˚ r, . Moreoverd r E f p µ S , u, v qs d Q k ´ ¯ k p x q “ e ´ γ c ψ p u ` v q γ k ´ ¯ k c k ´ ¯ k ź (cid:96) “ ź j : ζ j,(cid:96) “ τ p q n j,(cid:96) p u ` v q k ź r “ τ q r, ` q r, p u ` v q (27)for any x P X k ´ ¯ k with all distinct components, and it is zero otherwise. If one notes that ś i “ E f i p ˜ µ , u, v q vanishes when at least one of the ζ ˚ r, ’s is non–zero, the other terms in(26) can be similarly handled and, after having marginalised with respect to p ζ ˚ , ζ ˚ , ζ ˚ q ,one hasΠ p N q k p n , n , q , q q “ π ˚ ÿ p˚˚q I p n , n , q ` q , ζ ˚˚ q ` p ´ π ˚ q ÿ p˚q I p n , n , q ` q , ζ ˚ q where the first sum runs over all vectors ζ ˚˚ “ p ζ ˚ , ζ ˚ , ζ ˚ q P t , u k and the second sum isover all vectors ζ ˚ “ p ζ ˚ , ζ ˚ q P t , u k ´ k . Moreover, I p n , n , q ` q , ζ ˚˚ q “ c k γ k ´ ¯ k Γ p n q Γ p n q ż ż u n ´ v n ´ e ´p ` γ q c ψ p u ` v q k ź j “ τ p q n j, p u ` v q k ź j “ τ p q n j, p u ` v q k ź j “ τ p q q j, ` q j, p u ` v q d u d v. One may further note that ÿ p˚q I p n , n , q ` q , ζ ˚˚ q “ k ÿ ¯ k “ ÿ t ζ ˚ : | ζ ˚ |“ ¯ k u c k γ k ´ ¯ k Γ p n q Γ p n q ż ż u n ´ v n ´ e ´p ` γ q c ψ p u ` v q ˆ k ź j “ τ p q n j, p u ` v q k ź j “ τ p q n j, p u ` v q k ź j “ τ p q q j, ` q j, p u ` v q d u d v “ k ÿ ¯ k “ ˆ k ¯ k ˙ c k γ k ´ ¯ k Γ p n q Γ p n q ż ż u n ´ v n ´ e ´p ` γ q c ψ p u ` v q ˆ k ź j “ τ p q n j, p u ` v q k ź j “ τ p q n j, p u ` v q k ź j “ τ p q q j, ` q j, p u ` v q d u d v “ c k p ` γ q k Γ p n q Γ p n q ż ż u n ´ v n ´ e ´p ` γ q c ψ p u ` v q ˆ k ź j “ τ p q n j, p u ` v q k ź j “ τ p q n j, p u ` v q k ź j “ τ p q q j, ` q j, p u ` v q d u d v and a simple change of variable yields (18). Details on Examples 1 and 2
As for the latent nested σ –stable process, the first term in the expression of the pEPPF(18) turns out to be the EPPF of a σ –stable process multiplied by π ˚ “ ´ σ , namely p ´ σ q σ k ´ Γ p k q Γ p N q ź (cid:96) “ k (cid:96) ź j “ p ´ σ q n j,(cid:96) ´ k ź j “ p ´ σ q q j, ` q j, ´ . As for the second summand in (18), the term I p n , n , q ` q , ζ ˚ q equals29 k γ k ´ ¯ k Γ p n q Γ p n q ξ σ p n , n , q ` q qˆ ż ż u n ´ v n ´ exp t´ γ p u ` v q σ ´ u σ ´ v σ up u ` v q N ´ ¯ n ´ ¯ n ´p k ´ ¯ k ´ ¯ k q σ u ¯ n ´ ¯ k σ v ¯ n ´ ¯ k σ d u d v. The change of variables s “ u ` v and w “ u {p u ` v q , then, yields I p n , n , q ` q , ζ ˚ q “ σ k ´ Γ p k q γ k ´ ¯ k Γ p n q Γ p n q ξ σ p n , n , q ` q qˆ ż w n ´ ¯ n ` ¯ k σ ´ p ´ w q n ´ ¯ n ` ¯ k σ ´ r γ ` w σ ` p ´ w q σ s k d w and the obtained expression for Π p N q k follows.As far as the latent nested Dirichlet process is concerned, the first term in (18) coincideswith the EPPF of a Dirichlet process having total mass c multiplied by π ˚ “ p c ` q ´ ,i.e. 11 ` c ¨ r c p ` γ qs k p c p ` γ qq N ź (cid:96) “ k (cid:96) ź j “ Γ p n j,(cid:96) q k ź j “ Γ p q j, ` q j, q . On the other hand, it can be seen that I p n , n , q ` q , ζ ˚ q equals c k γ k ´ ¯ k Γ p n q Γ p n q ź (cid:96) “ k (cid:96) ź j “ Γ p n j,(cid:96) q k ź j “ Γ p q j, ` q j, qˆ ż ż u n ´ v n ´ p ` u ` v q γc ` N ´ ¯ n ´ ¯ n p ` u q ¯ n ` c p ` v q ¯ n ` c d u d v. If p F q p α , . . . , α p ; β , . . . , β q ; z q denotes the generalised hypergeometric series, which is de-fined as p F q p α , . . . , α p ; β , . . . , β q ; z q : “ ÿ k “ p α q k . . . p α p q k p β q k . . . p β q q k z k k ! , identity 3.197.1 in Gradshteyn & Ryzhik (2007) leads to rewrite I p n , n , q ` q , ζ ˚ q asfollows c k γ k ´ ¯ k Γ pp ` γ q c ` n ´ ¯ n q Γ p n q Γ pp ` γ q c ` N ´ ¯ n q ξ p n , n , q ` q q ż u n ´ p ` u q c p ` γ q` n ´ ¯ n F p c ` ¯ n , n ; N ´ ¯ n ` c p ` γ q ; ´ u q d u. On view of the formula F p α, β ; δ ; z q “ p ´ z q ´ α F p α, δ ´ β ; δ ; z {p z ´ qq and of the changeof variable t “ u {p ` u q , the integral above may be expressed as ż t n ´ p ´ t q c p ` γ q` c ´ ˆ F p c ` ¯ n , c p ` γ q ` n ´ ¯ n ; N ´ ¯ n ` c p ` γ q ; t q d t. and, finally, identity 7.512.5 in Gradshteyn & Ryzhik (2007) yields the displayed closedform of Π p N q k . References
Bhattacharya, A. & Dunson, D. (2012), ‘Nonparametric Bayes classification and hypothesistesting on manifolds’,
J. Multivariate Anal. , 1–19.Filippi, S. & Holmes, C. C. (2017), ‘A Bayesian nonparametric approach for quantifyingdependence beetween random variables’,
Bayesian Analysis (4), 919–938.Gradshteyn, I. S. & Ryzhik, I. M. (2007), Tables of integrals, sums, series, and products ,7th edn, Academic Press.Hjort, N. L. (2000), Bayesian analysis for a generalized Dirichlet process prior, Technicalreport, University of Oslo.Holmes, C., Caron, F., Griffin, J. E. & Stephens, D. A. (2015), ‘Two–sample Bayesiannonparametric hypothesis testing’,
Bayesian Analysis (2), 297–320.James, L. F., Lijoi, A. & Pr¨unster, I. (2006), ‘Conjugacy as a distinctive feature of theDirichlet process’, Scandinavian Journal of Statistics (1), 105–120.James, L. F., Lijoi, A. & Pr¨unster, I. (2009), ‘Posterior analysis for normalized randommeasures with independent increments’, Scandinavian Journal of Statistics (1), 76–97.Kingman, J. F. C. (1993), Poisson processes , Oxford University Press.Lijoi, A., Nipoti, B. & Pr¨unster, I. (2014), ‘Bayesian inference with dependent normalizedcompletely random measures’,
Bernoulli (3), 1260–1291.31a, L. & Wong, W. H. (2011), ‘Coupling optional P´olya trees and the two sample problem’, J. Amer. Statist. Assoc. (496), 1553–1565.MacEachern, S. N. (1994), ‘Estimating normal means with a conjugate style Dirichletprocess prior’,
Comm. Statist. Simulation Comput. (3), 727–741.M¨uller, P., Quintana, F. & Rosner, G. (2004), ‘A method for combining inference across re-lated nonparametric Bayesian models’, J. R. Stat. Soc. Ser. B Stat. Methodol. (3), 735–749.Regazzini, E., Lijoi, A. & Pr¨unster, I. (2003), ‘Distributional results for means of randommeasures with independent increments’, Ann. Statist , 560–585.Rodr´ıguez, A. & Dunson, D. B. (2014), ‘Functional clustering in nested designs: modelingvariability in reproductive epidemiology studies’, Ann. Appl. Stat. (3), 1416–1442.Rodr´ıguez, A., Dunson, D. B. & Gelfand, A. E. (2008), ‘The nested Dirichlet process’, J.Amer. Statist. Assoc. (483), 1131–1144.Soriano, J. & Ma, L. (2017), ‘Probabilistic multi-resolution scanning for two-sample differ-ences’,
J. R. Stat. Soc. Ser. B Stat. Methodol. (2), 547–572.West, M., M¨uller, P. & Escobar, M. D. (1994), Hierarchical priors and mixture models,with application in regression and density estimation, inin