Data augmentation for models based on rejection sampling
DData augmentation for models based on rejectionsampling
Vinayak Rao ∗† , Lizhen Lin ‡ , David Dunson § August 4, 2015
Abstract
We present a data augmentation scheme to perform Markov chain Monte Carlo inferencefor models where data generation involves a rejection sampling algorithm. Our idea, whichseems to be missing in the literature, is a simple scheme to instantiate the rejected proposalspreceding each data point. The resulting joint probability over observed and rejected vari-ables can be much simpler than the marginal distribution over the observed variables, whichoften involves intractable integrals. We consider three problems, the first being the modelingof flow-cytometry measurements subject to truncation. The second is a Bayesian analysis ofthe matrix Langevin distribution on the Stiefel manifold, and the third, Bayesian inference fora nonparametric Gaussian process density model. The latter two are instances of problemswhere Markov chain Monte Carlo inference is doubly-intractable. Our experiments demon-strate superior performance over state-of-the-art sampling algorithms for such problems.
Keywords : Bayesian inference; Density estimation; Doubly intractable; Gaussian process;Matrix Langevin; Markov Chain Monte Carlo; Rejection sampling; Stiefel manifold; Trunca-tion.
Rejection sampling allows sampling from a probability density p ( x ) by constructing an upperbound to p ( x ) , and accepting or rejecting samples from a density proportional to the bounding en-velope. The envelope is usually much simpler than p ( x ) , with the number of rejections determinedby how closely it matches the true density.In typical applications, the probability density of interest is indexed by a parameter θ , and wewrite it as p ( x | θ ) . A Bayesian analysis places a prior on θ , and given observations from the like-lihood p ( x | θ ) , studies the posterior over θ . An intractable likelihood, often with a normalization ∗ [email protected] † Department of Statistics, Purdue University, USA ‡ Department of Statistics and Data Science, University of Texas at Austin, USA § Department of Statistical Science, Duke University, USA a r X i v : . [ s t a t . C O ] A ug onstant depending on θ , precludes straightforward Markov chain Monte Carlo inference over θ :calculating a Metropolis-Hastings acceptance ratio involves evaluating the ratio of two such likeli-hoods, and is itself intractable. This class of problems is called ‘doubly intractable’ (Murray et al.,2006), and existing approaches require the ability to draw exact samples from p ( x | θ ) , or to obtainpositive unbiased estimates of p ( x | θ ) .We describe a different approach that is applicable when p ( x | θ ) has an associated rejectionsampling algorithm. Our idea is to instantiate the rejected proposals preceding each observation,resulting in an augmented state-space on which we run a Markov chain. Including the rejectedproposals can eliminate any intractable terms, and allow the application of standard techniques(Adams et al., 2009). Importantly, we show that conditioned on the observations, it is straightfor-ward to independently sample the number and values of the rejected proposals: this just requiresrunning the rejection sampler to generate as many acceptances as there are observations, with allrejected proposals kept. The ability to produce a conditionally independent draw of these variablesis important when posterior updates of some parameters are intractable, while others are simple. Insuch a situation, we introduce the rejected variables only when we need to carry out the intractableupdates, after which we discard them and carry out the simpler updates.A particular application of our algorithm is parameter inference for probability distributionstruncated to sets like the positive orthant, the simplex, or the unit sphere. Such distributions cor-respond to sampling proposals from the untruncated distribution and rejecting those outside thedomain of interest. We consider an application from flow cytometry where this representation isthe actual data collection process. Truncated distributions also arise in diverse applications likemeasured time-to-infection (Goethals et al., 2009), where times larger than a year are truncated,mortality data (Alai et al., 2013), annuity valuation for truncated lifetimes (Alai et al., 2013), andstock price changes (Aban et al., 2006). One approach for such problems was proposed in Liechtyet al. (2009), though their algorithm samples from an approximation to the posterior distributionof interest. Our algorithm provides a simple and general way to apply the machinery of Bayesianinference to such problems. Consider a probability density p ( x | θ ) = f ( x, θ ) /Z ( θ ) on some space X , with the parameter θ taking values in Θ . We assume that the normalization constant Z ( θ ) is difficult to evaluate, so thatna¨ıvely sampling from p ( x | θ ) is not easy. We also assume there exists a second, simpler density q ( x | θ ) ≥ f ( x, θ ) /M for all x , and for some positive M .Rejection sampling generates samples distributed as p ( · | θ ) by first proposing samples from q ( · | θ ) . A draw y from q ( · | θ ) is accepted with probability f ( y, θ ) / { M q ( y | θ ) } . Let there be r rejected proposals preceeding an accepted sample x , and denote them by Y = { y , · · · , y r } where2 itself is a random variable. Write |Y| = r , so that the joint probability is p ( Y , x ) = |Y| (cid:89) i =1 q ( y i | θ ) (cid:26) − f ( y i , θ ) M q ( y i | θ ) (cid:27) q ( x | θ ) (cid:26) f ( x, θ ) M q ( x | θ ) (cid:27) = f ( x, θ ) M |Y| (cid:89) i =1 (cid:26) ( q ( y i | θ ) − f ( y i , θ ) M (cid:27) . (1)It is well known that this procedure recovers samples from p ( x | θ ) , so that the expressionabove has the correct marginal distribution over x (Robert and Casella, 2005). Later, we will needto sample the rejected variables Y given an observation x drawn from p ( · | θ ) . Simulating from p ( Y | x, θ ) involves the two steps in Algorithm 1. Algorithm 1 relies on Proposition 1 about p ( Y | x, θ ) which we prove in the appendix. Algorithm 1:
Algorithm to sample from p ( Y | x, θ ) Input:
A sample x , and the parameter value θ Output:
The set of rejected proposals Y preceeding x Draw sample y i independently from q ( · | θ ) until a point ˆ x is accepted. Discard ˆ x , and treat the preceeding rejected proposals as Y . Proposition 1.
The set of rejected samples Y preceding an accepted sample x is independent of x : p ( Y | θ, x ) = p ( Y | θ ) . We can thus assign x the set (cid:98) Y of another sample, ˆ x . Given observations X = { x , · · · , x n } , and a prior p ( θ ) , Bayesian inference typically uses Markovchain Monte Carlo to sample from the intractable posterior p ( θ | X ) . Split θ as ( θ , θ ) so that thenormalization constant factors as Z ( θ ) = Z ( θ ) Z ( θ ) , with the first term Z simple to evaluate,and Z intractable. Updating θ with θ fixed is easy, and there are situations where we can placea conjugate prior on θ . Inference over θ is a doubly-intractable problem.We assume that p ( x | θ ) has an associated rejection sampling algorithm with proposal density q ( x | θ ) ≥ f ( x, θ ) /M . For the i th observation x i , write the preceding set of rejected samples as Y i = { y i , . . . , y i |Y i | } . The joint density of all samples, both rejected and accepted, is then P ( x , Y , . . . , x n , Y n ) = n (cid:89) i =1 f ( x i , θ ) M |Y i | (cid:89) j =1 (cid:26) q ( y ij | θ ) − f ( y ij , θ ) M (cid:27) . θ . To introduce the rejected proposals Y i , we simply follow Algorithm 1: draw proposals from q ( · | θ ) until we have n acceptances, with the i th batch of rejected proposals forming the set Y i .The ability to produce conditionally independent draws of Y is important when, for instance,there exists a conjugate prior p ( θ ) on θ for the likelihood p ( x | θ , θ ) . Introducing the rejectedproposals Y i breaks this conjugacy, and the resulting complications in updating θ can slow downmixing, especially when θ is high dimensional. A much cleaner solution is to sample θ from itsconditional posterior p ( θ | X, θ ) , introducing the auxiliary variables only when needed to update θ . After updating θ , they can then be discarded. Algorithm 2 describes this. Algorithm 2:
An iteration of the Markov chain for posterior inference over θ = ( θ , θ ) Input:
The observations X , and the current parameter values ( θ , θ ) Output:
New parameter values (˜ θ , ˜ θ ) Run Algorithm 1 | X | times, keeping all the rejected proposals Y = ∪ | X | i =1 Y i . Update θ to ˜ θ with a Markov kernel having p ( θ | X, Y , θ ) as stationary distribution. Discard the rejected proposals Y . Sample a new value of θ from its posterior p ( θ | X, ˜ θ ) . One of the simplest and most widely applicable Markov chain Monte Carlo algorithms for doubly-intractable distributions is the exchange sampler of Murray et al. (2006). Simplifying an earlieridea by Møller et al. (2006), this algorithm effectively amounts to the following: given the currentparameter θ curr , propose a new parameter θ new according to some proposal distribution. Addition-ally, generate a dataset of n ‘pseudo-observations’ { ˆ x i } from p ( x | θ new ) . The exchange algorithmthen proposes swapping the parameters associated with datasets. Murray et al. (2006) show that allintractable terms cancel out in the resulting acceptance probability, and that the resulting Markovchain has the correct stationary distribution.While the exchange algorithm is applicable whenever one can sample from the likelihood p ( x | θ ) , it does not exploit the mechanism used to produce these samples. When the latter isa rejection sampling algorithm, each pseudo-observation is preceeded by a sequence of rejectedproposals. These are all discarded, and only the accepted proposals are used to evaluate the newparameter θ new . By contrast our algorithm explicitly instantiates these rejected proposals, so thatthey can be used to make good proposals. In our experiments, we use a Hamiltonian Monte Carlosampler on the augmented space and exploit gradient information to make non-local moves withhigh probability of acceptance. For reasonable acceptance probabilities under the exchange sam-pler, one must make local updates to θ , or resort to complicated annealing schemes.Another framework for doubly intractable distributions is the pseudo-marginal approach ofAndrieu and Roberts (2009). The idea here is that even if we cannot exactly evaluate the acceptanceprobability, it is sufficient to use a positive, unbiased estimate: this will still result in a Markov4hain with the correct stationary distribution. In our case, instead of requiring an unbiased estimate,we require an upper bound M to the normalization constant Z ( θ ) . Additionally, like the exchangesampler, the pseudo-marginal method provides a mechanism to evaluate a proposed parameter θ new ; how to make good proposals is less obvious. Other papers are Beskos and Roberts (2005)and Walker (2011), the latter requiring a bound on the target density of interest.Most closely related to our ideas is a sampler from Adams et al. (2009), see also Section 7.Their problem also involved inferences on the parameters governing the output of a rejection sam-pling algorithm. Like us, they proceeded by augmenting the state space to include the rejectedproposals Y , and like us, given these auxiliary variables, they used Hamiltonian Monte Carlo toefficiently update parameters. However, rather than generating independent realizations of Y whenneeded, Adams et al. (2009) outlined a set of Markov transition operators to perturb the currentconfiguration of Y , while maintaining the correct stationary distribution. With prespecified prob-abilities, they proposed adding a new variable to Y , deleting a variable from Y and perturbing thevalue of an existing element in Y . These local updates to Y can slow down Markov chain mix-ing, require the user to specify a number of parameters, and also involve calculating Metropolis-Hastings acceptance probabilities for each local step. Furthermore, the Markov nature of theirupdates require them to maintain the rejected proposals at all times, complicating inferences overother parameters. Our algorithm is much simpler and cleaner. Write the Markov transition density of our chain as k (ˆ θ | θ ) , and the m -fold transition density as k m (ˆ θ | θ ) . For simplicity, we suppress that these depend on X . The Markov chain is uniformlyergodic if constants C and ρ exist such that for all m and θ , (cid:82) Θ | p (ˆ θ | X ) − k m (ˆ θ | θ ) | dˆ θ ≤ Cρ m . The term to the left is twice the total variation distance between the desired posterior and the stateof the Markov chain initialized at θ after m iterations. Small values of ρ imply faster mixing.The following minorization condition is sufficient for uniform ergodicity (Jones and Hobert,2001): there exists a probability density h (ˆ θ ) and a δ > such that for all θ, ˆ θ ∈ Θ , k (ˆ θ | θ ) ≥ δh (ˆ θ ) . (2)When this holds, the mixing rate ρ ≤ − δ , so that a large δ implies rapid mixing.Our Markov transition density first introduces the rejected proposals Y , and then conditionallyupdates θ . The set Y i preceeding the i th observation takes values in the union space U ≡ ∪ ∞ r =0 X r .The output of the rejection sampler, including the i th observation, lies in the product space U × X with density given by equation (1), so that any ( Y , x ) ∈ ( U × X ) has probability p ( Y , x | θ ) = f ( x, θ ) M λ (d x ) |Y| (cid:89) i =1 (cid:26) q ( y i | θ ) − f ( y i , θ ) M (cid:27) λ (d y i ) . Here, λ is the measure with respect to which the densities f and q are defined, and it is easy to see5hat the above quantity integrates to . From Bayes’ rule, the conditional density over Y is p ( Y | x, θ ) = Z ( θ ) M |Y| (cid:89) i =1 (cid:26) q ( y i | θ ) − f ( y i , θ ) M (cid:27) λ (d y i ) . The fact that the right hand side does not depend on x is an alternate proof of Proposition 1. Thisdensity characterizes the data augmentation step of our sampling algorithm. In practice, we needas many draws from this density as there are observations.The next step involves updating θ given ( Y , X, θ ) , and depends on the problem at hand. Wesimplify matters by assuming we can sample from p ( θ | Y , X ) independently of the old θ : this isthe classical data augmentation algorithm. We also assume that the functions f ( · , θ ) and q ( · | θ ) are uniformly bounded from above and below by finite, positive quantities ( B f , b f ) and ( B q , b q ) respectively, and that (cid:82) X λ (d x ) < ∞ . It follows that there exists positive numbers r and R thatminimize − f ( x,θ ) { MZ ( θ ) } and Z ( θ ) M . We can now state our result. Theorem 2.
Assume that (cid:82) X λ (d x ) < ∞ and that positive bounds b f , B f , b q , B q exist with r and R as defined earlier. Further assume we can sample from the conditional p ( θ | Y , X ) . Thenour data augmentation algorithm is uniformly ergodic with mixing rate ρ upper bounded by ρ =1 − (cid:110) b f B f ( β + R − ) (cid:111) n , where β = b q r/B q and n is the number of observations. Despite our assumptions, our theorem has a number of useful implications. The ratio b f /B f is a measure of how ‘flat’ the function f is, and the closer it is to one, the more efficient rejectionsampling for f can be. From our result, the smaller the value, the larger the bound on ρ , suggestingslower mixing. This is intuitive: more rejected proposals Y will increase coupling between suc-cessive θ ’s in the Markov chain. On the other hand, a small b q /B q suggests a proposal distributiontailored to f , and our result shows this implies faster mixing. The numbers r and /R are measuresof mismatch between the the target and proposal density, with small values giving better mixing.Finally, more observations n result in slower mixing. Again, from the construction of our chain thismakes sense. We suspect this last property holds for most exact samplers for doubly-intractabledistributions, though we are unaware of any such result.Even without assuming we can sample from p ( θ | Y , X ) , our ability to sample Y independentlymeans that the marginal chain over θ is still Markovian. By contrast, existing approaches (Adamset al. (2009); Walker (2011)) only produce dependent updates in the complicated auxiliary space:they sample from p ( ˆ Y | θ, Y , X ) by making local updates to Y . Consequently, these chains areMarkovian only in the complicated augmented space, and the marginal processes over θ have long-term dependencies. Besides affecting mixing, this can also complicate analysis.In the following sections, we apply our sampling algorithm to three problems, one involving aBayesian analysis of flow-cytometry data, the second, Bayesian inference for the matrix Langevindistribution, and the last the Gaussian process density sampler of Adams et al. (2009).6 .00.20.40.6 0.0 0.2 0.4 0.6 CD4 CD CD4 CD Figure 1: Scatterplots of first two dimensions for control (left) and positive (right) group. Contoursrepresent log posterior-mean densities under a Dirichlet process mixture.
We apply our algorithm to a dataset of flow cytometry measurements from patients subjected tobone-marrow transplant (Brinkman et al., 2007). This graft-versus-host disease dataset consists of6,809 control and 9,083 positive observations depending on whether donor immune cells attackhost cells. Each observation consists of four biomarker measurements truncated to lie between and , though more complicated truncation rules are often used according to operator judge-ment (Lee and Scott, 2012). We normalize and plot the first two dimensions corresponding to themarkers CD4 and CD8b in Figure 1. Truncation complicates the clustering of observations into ho-mogeneous groups, an important step in the flow-cytometry pipeline called gating. Consequently,Lee and Scott (2012) propose an EM algorithm for a truncated mixture of Gaussians, which mustbe adapted if different mixture components or truncation rules are used.We model the untruncated distribution for each group as a Dirichlet process mixture of Gaus-sians (Lo, 1984), with points outside the four-dimensional unit hypercube discarded to form thenormalized dataset. The Dirichlet process mixture model is a flexible nonparametric prior overdensities parametrized by a concentration parameter α and a base probability measure. We set α = 1 , and for the base measure, which gives the distribution over cluster parameters, we use anormal-inverse-Wishart distribution. Given the rejected variables, we can use standard techniquesto update a representation of the Dirichlet process. We follow the blocked-sampler of Ishwaranand James (2001) based on the stick-breaking representation of the Dirichlet process, using a trun-cation level of clusters. This corresponds to updating θ , step 2 in Algorithm 2. Having donethis, we discard the old rejected samples, and produce a new set by drawing from a -componentmixture of Gaussians model, corresponding to step 1 in Algorithm 2.Figure 1 shows the log mean posterior densities for the first two dimensions from 10,000 it-erations. While the control group has three clear modes, these are much less pronounced in thepositive group. Directly modeling observations with a Gaussian mixture model obscured this byforcing modes forced away from the edges. One can use components with bounded support inthe mixture model, such as a Dirichlet process mixture of Betas; however, these do not reflectthe underlying data generation process, and are unsuitable when different groups have differenttruncation levels. By contrast, it is easy to extend our modeling ideas to allow groups to share7omponents (Teh et al., 2006), thereby allowing better identification of disease predictors.Our sampler took less than two minutes to run 1,000 iterations, not much longer than a typicalDirichlet process sampler for datasets of this size. The average number of augmented points was3,960 and 4,608 for the two groups. We study our sampler more systematically in the next section,but this application demonstrates the flexibility and simplicity of our main idea. The Stiefel manifold V p,d is the space of all d × p orthonormal matrices, that is, d × p matrices X such that X T X = I p , where I p is the p × p identity matrix. When p = 1 , this is the d − hypersphere S d − , and when p = d , this is the space of all orthonormal matrices O ( d ) . Probabilitydistributions on the Stiefel manifold play an important role in statistics, signal processing andmachine learning, with applications ranging from studies of orientations of orbits of comets andasteroids to principal components analysis to the estimation of rotation matrices. The simplestsuch distribution is the matrix Langevin distribution, an exponential-family distribution whosedensity with respect to the invariant Haar volume measure (Edelman et al., 1998) is p ML ( X | F ) = etr ( F T X ) /Z ( F ) . Here etr is the exponential-trace, and F is a d × p matrix. The normalizationconstant Z ( F ) = F ( d, F T F ) is the hypergeometric function with matrix arguments, evaluatedat F T F (Chikuse, 2003). Let F = G κ H T be the singular value decomposition of F , where G and H are d × p and p × p orthonormal matrices, and κ is a positive diagonal matrix. We parametrize p ML by ( G, κ , H ) , and one can think of G and H as orientations, with κ controlling the concentrationin directions determined by these orientations. Large values of κ imply concentration along theassociated directions, while setting κ to zero gives the uniform distribution on the Stiefel manifold.It can be shown (Khatri and Mardia, 1977) that F ( d, F T F ) = F ( d, κ T κ ) , so that thisdepends only on κ . We write it as Z ( κ ).In our Bayesian analysis, we place independent priors on κ , G and H . The latter two lie onthe Stiefel manifolds V p,d and V p,p , and we place matrix Langevin priors p ML ( · | F ) and p ML ( · | F ) on these: we will see below that these are conditionally conjugate. We place independentGamma ( a , b ) priors on the diagonal elements of κ . However, the difficulty in evaluating thenormalization constant Z ( κ ) makes posterior inference over κ doubly intractable. Thus, in a 2006University of Iowa PhD thesis, Camano-Garcia keeps κ constant, while Hoff (2009a) uses a first-order Taylor expansion of the intractable term to run an approximate sampling algorithm. Below,we show how fully Bayesian inference can be carried out over this quantity as well. We first describe a rejection sampling algorithm from Hoff (2009b) to sample from p ML . Forsimplicity, assume H is the identity matrix. In the general case, we simply rotate the resultingdraw by H , since if X ∼ p ML ( · | F ) , then XH ∼ p ML ( · | F H T ) . At a high level, the algorithmsequentially proposes vectors from the matrix Langevin on the unit sphere: this is also called vonMises-Fisher distribution and is easy to simulate (Wood, 1994). The mean of the r th vector iscolumn r of G , G [: r ] , projected onto the nullspace of the earlier vectors, N r . This sampled vector is8hen projected back onto N r and normalized, and the process is repeated p times. Call the resultingdistribution p seq ; for more details, see Algorithm 3 and Hoff (2009b). Letting I k ( · ) be the Algorithm 3:
Proposal p seq ( · | G, κ ) for matrix Langevin distribution (Hoff, 2009b) Input:
Parameters G, κ ; write G [: i ] for column i of G , and κ i for element ( i, i ) of κ Output:
An output X ∈ V p,d ; write X [: i ] for column i of X Sample X [:1] ∼ p ML ( · | κ G [:1] ) . For r ∈ { , · · · p } (a) Construct N r , an orthogonal basis for the nullspace of { X [:1] , · · · X [: r − } .(b) Sample z ∼ p ML ( · | κ r N Tr G [: r ] ) , and(c) Set X [: r ] = z T N r / (cid:107) z T N r (cid:107) .modified Bessel function of the first kind, p seq is a density on the Stiefel manifold with p seq ( X | G, κ ) = (cid:40) p (cid:89) r =1 (cid:107) κ r N Tr G [: r ] / (cid:107) ( d − r − / Γ( d − r +12 ) I ( d − r − / ( (cid:107) κ r N Tr G [: r ] (cid:107) ) (cid:41) etr ( κ G T X ) . (3)Write D ( X, κ , G ) for the reciprocal of the term in braces. Since I k ( x ) /x k is an increasing functionof x , and (cid:107) N Tr G [: r ] (cid:107) ≤ (cid:107) G [: r ] (cid:107) = 1 , we have the following bound D ( κ ) for D ( X, κ , G ) : D ( X, κ , G ) ≤ p (cid:89) r =1 Γ( d − r +12 ) I ( d − r − / ( (cid:107) κ r (cid:107) ) (cid:107) κ r / (cid:107) ( d − r − / = D ( κ ) . This implies etr ( κ G T X ) ≤ D ( κ ) p seq ( X | G, κ ) , allowing the following rejection sampler: drawa sample X from p seq ( · ) , and accept with probability D ( X, κ , G ) /D ( κ ) . The accepted proposalscome from p ML ( · | G, κ ) , and for samples from p ML ( · | G, κ , H ) , post multiply these by H . Given a set of n observations { X i } , and writing S = (cid:80) ni =1 X i , we have: p ( G, κ , H | X i } ) ∝ etr ( H κ G T S ) p ( H ) p ( G ) p ( κ ) /Z ( κ ) n . At a high level, our approach is a Gibbs sampler that sequentially updates
H, G and κ . Thepair of matrices ( H, G ) correspond to the tractable θ in Algorithm 2, while κ corresponds to θ .Updating the first two is straightforward, while the third requires our augmentation scheme.
1. Updating G and H : With a matrix Langevin prior on H , the posterior is p ( H | X i , κ , G ) ∝ etr (cid:8) ( S T G κ + F ) T H (cid:9) . H , allowing us toignore this term. Redefining S as SH , the posterior over G is also a matrix Langevin: p ( G | X i } , κ ) ∝ etr (cid:8) ( S κ + F ) T G (cid:9) .
2. Updating κ : Here, we exploit the rejection sampler scheme of the previous section, and instan-tiate the rejected proposals using Algorithm 1. From Section 6.1, the joint probability is p ( { X i , Y i } | G, κ ) = etr (cid:110) κ G T (cid:16) S + (cid:80) |Y i | j =1 Y ij (cid:17)(cid:111) D ( κ ) |Y| n (cid:89) i =1 |Y| (cid:89) j =1 { D ( κ ) − D ( Y ij , G, κ ) } D ( Y ij , G, κ ) . (4)All terms in the expression above can be evaluated easily, allowing a simple Metropolis-Hastingsalgorithm in this augmented space. In fact, we can go further, calculating gradients to run aHamiltonian Monte Carlo algorithm (Neal, 2010) that makes significantly more efficient pro-posals than a random-walk sampling algorithm. In particular, let N = n + (cid:80) ni =1 |Y i | , and S = (cid:80) ni =1 ( X i + (cid:80) |Y i | j =1 Y ij ) . The log joint probability L ≡ log { p ( { X i , Y i } ) } is L = trace ( G T κ S ) + n (cid:88) i =1 |Y i | (cid:88) j =1 [log { D ( κ ) − D ( Y ij , κ ) } − log { D ( Y ij , κ ) } ] − n log { D ( κ ) } . Writing D ( Y, κ ) = (cid:110) C (cid:81) pr =1 I ( d − r − / ( (cid:107) κ r N Tr G r (cid:107) ) (cid:107) κ r N Tr G r (cid:107) ( d − r − / (cid:111) as C ˜ D ( Y, κ ) , Appendix B shows that d L d κ k = G T [ ,k ] S [ ,k ] + n (cid:88) i =1 |Y i | (cid:88) j =1 I ( d − k +1) / I ( d − k − / ( κ k ) − N Tk G k I ( d − k +1) / I ( d − k − / ( κ k N Tk G k ) (cid:110) − ˜ D ( Y ij , κ )˜ D ( κ ) (cid:111) − N I ( d − k +1) / I ( d − k − / ( κ k ) . We use this gradient to construct a Hamiltonian Monte Carlo sampler (Neal, 2010) for κ . Here,it suffices to note that a proposal involves taking L ‘leapfrog’ steps of size (cid:15) along the gradient,and accepting the resulting state with probability proportional to the product of equation (4), anda simple Gaussian ‘momentum’ term. The acceptance probability depends on how well the (cid:15) -discretization approximates the continuous dynamics of the system, and choosing a small (cid:15) and alarge L can give global moves with high acceptance probability. A large L however comes at thecost of a large number of gradient evaluations. We study this trade-off in Section 6.4. The vectorcardiogram is a loop traced by the cardiac vector during a cycle of the heart beat. Thetwo directions of orientation of this loop in three-dimensions form a point on the Stiefel manifold.The dataset of Downs et al. (1971) includes such recordings, and is displayed in the left subplotof Figure 2. We represent each observation with a pair of orthonormal vectors, with the set of cyanlines to the right forming the first component. This empirical distribution possesses a single mode,so that the matrix Langevin distribution appears a suitable model.10 .00.50.00.51.0 1.0 0.5 0.0 0.5 1.0 1.00.50.00.51.0 κ P o s t e r i o r p r obab ili t y κ P o s t e r i o r p r obab ili t y Figure 2: (Left) Vector cardiogram dataset with inferences. Bold lines are maximum likelihoodestimates of G , and solid circles contain posterior mass. Dashed circles are predictiveprobability regions. (Right) Posterior over κ and κ , circles are maximum likelihood estimates.We place weak independent exponential priors with mean and variance on the scaleparameter κ , and a uniform prior on the location parameter G . We restrict H to be the identitymatrix. Inferences were carried out using the Hamiltonian sampler to produce 10,000 samples,with a burn-in period of 1,000. For the leapfrog dynamics, we set a step size of . , with thenumber of steps equal to . We fix the ‘mass parameter’ to the identity matrix as is typical. Weimplemented all algorithms in R, building on code from the rstiefel package of Peter Hoff.All simulations were run on an Intel Core 2 Duo 3 Ghz CPU. For comparison, we include themaximum likelihood estimates of κ and G . For κ and κ , these were . and . , and we plotthese in the right half of Figure 2 as the red circles.The bold straight lines in Figure 2 (left) show the maximum likelihood estimates of the compo-nents of G , with the small circles corresponding to Bayesian credible regions estimated fromthe Monte Carlo output. The dashed circles correspond to predictive probability regions forthe Bayesian model. For these, we generated points on V , for each sample, with parametersspecified by that sample. The dashed circles contain of these points across all samples. Figure2 (right) show the posterior over κ and κ . To quantify sampler efficiency, we estimate the effective sample sizes produced per unit time. Thiscorrects for correlation between successive Markov chain samples by estimating the number ofindependent samples produced; for this we used the rcoda package of Plummer et al. (2006).The left plot in Figure 3 considers two Metropolis-Hastings samplers, the exchange samplerand our latent variable sampler on the vectorcardiogram dataset. Both samplers perform a randomwalk in the κ -space, with the steps drawn for a normal distribution whose variance increasesalong the horizontal axis. The vertical axis shows the median effective sample size per secondfor the components of κ . The figure shows that both samplers’ performance peaks when the11 .1 0.3 0.5 1 1.5 2 . . . . Variance of Metropolis proposals E ff e c t i v e s a m p l e s pe r s e c ond Leapfrog step size E ff e c t i v e s a m p l e s pe r s e c ond Figure 3: Effective samples per second for (left) random walk and (right) Hamiltonian samplers.From bottom to top at abscissa . : (left) Metropolis-Hastings data-augmentation sampler andexchange sampler, and (right) 1/10/5/3 leapfrog steps of Hamiltonian sampler.proposals have a variance between and . , with the exchange sampler performing slightly better.However, the real advantage of our sampler is that introducing the latent variables results in ajoint distribution without any intractable terms, allowing the use of more sophisticated samplingalgorithms. The plot to the right studies the Hamiltonian Monte Carlo sampler described at theend of Section 3.1. Here we vary the size of the leapfrog steps along the horizontal axis, with thedifferent curves corresponding to different numbers of leapfrog steps. We see that this performs anorder of magnitude better than either of the previous algorithms, with performance peaking with to steps of size . to . , fairly typical values for this algorithm. This shows the advantage ofexploiting gradient information in exploring the parameter space. In this section, we consider an approximate sampler based on an asymptotic approximation to Z ( κ ) = F ( d, κ T κ ) for large values of ( κ , · · · , κ n ) (Khatri and Mardia, 1977): Z ( κ ) (cid:39) (cid:40) − p ( p +5)+ pd π p (cid:41) etr ( κ ) p (cid:89) j =1 Γ (cid:18) d − j + 12 (cid:19) (cid:34)(cid:40) p (cid:89) j =2 j − (cid:89) i =1 ( κ i + κ j ) (cid:41) p (cid:89) i =1 κ ( d − p ) i (cid:35) − . We use this approximation in the acceptance probability of a Metropolis-Hastings algorithm; itcan similarly be used to construct a Hamiltonian sampler. For a more complicated but accurateapproximation, see Kume et al. (2013). In general however, using such approximate schemesinvolves the ratio of two approximations, and can have very unpredictable performance.On the vectorcardiogram dataset, the approximate sampler is about forty times faster than theexact samplers. For larger datasets, this difference will be even greater, and the real question is howaccurate the approximation is. Our exact sampler allows us to study this: we consider the Stiefelmanifold V d, , with the three diagonal elements of κ set to , and . With this setting of κ , anda random G , we generate datasets with observations with d taking values , , , , and . Ineach case, we estimate the posterior mean of κ by running the exchange sampler, and treat thisas the truth. We compare this with posterior means returned by our Hamiltonian sampler, as wellas the approximate sampler. Figure 4 shows these results, with the three subplots corresponding12 − Ambient dimensionality E rr o r i n − E rr o r i n Ambient dimensionality − E rr o r i n Ambient dimensionality
Figure 4: Errors in the posterior mean. Solid/dashed lines are Hamiltonian/approximate sampler.to the three components of κ , and ambient dimensionality d increasing along the horizontal axis.As expected, the two exact samplers agree, and the Hamiltonian sampler has almost no ‘error’.The approximate sampler is more complicated. For values of d around , its estimated posteriormean is close to that of the exact samplers. Smaller values lead to an approximate posterior meanthat underestimates the actual posterior mean, while in higher dimensions, the opposite occurs.Recalling that κ controls the concentration of the matrix Langevin distribution about its mode,this implies that in high dimensions, the approximate sampler underestimates uncertainty in thedistribution of future observations. Our next application is the Gaussian process density sampler of Adams et al. (2009), a nonpara-metric prior for probability densities induced by a logistic transformation of a random functionfrom a Gaussian process. Letting σ ( · ) denote the logistic function, the random density is g ( x ) ∝ g ( x ) σ { f ( x ) } , f ∼ GP , with g ( · ) a parametric base density and GP denoting a Gaussian process. The inequality g ( x ) σ { f ( x ) } ≤ g ( x ) allows a rejection sampling algorithm by making proposals from g ( · ) . At a proposed loca-tion x ∗ , we sample the function value f ( x ∗ ) conditioning on all previous evaluations, and acceptthe proposal with probability σ { f ( x ∗ ) } . Such a scheme involves no approximation error, and onlyrequires evaluating the random function on a finite set of points. Algorithm 4 describes the stepsinvolved in generating n observations. Given observations X = { x , · · · , x n } , we are interested in p ( g | X ) , the posterior over theunderlying density. Since g is determined by the modulating function f , we focus on p ( f | X ) .While this quantity is doubly intractable, after augmenting the state space to include the proposals Y from the rejection sampling algorithm, p ( f | X, Y ) has density with respect to the Gaussianprocess prior given by (cid:81) ni =1 σ { f ( x i ) } (cid:81) |Y| i =1 [1 − σ { f ( y i ) } ] , see also Adams et al. (2009). In13 lgorithm 4: Generate n new samples from the Gaussian process density sampler Input:
A base probability density g ( · ) ,Previous accepted and rejected proposals ˜ X and ˜ Y ,Gaussian process evaluations f ˜ X and f ˜ Y at these locations. Output: n new samples X , with the associated rejected proposals Y ,Gaussian process evaluations f X and f Y at these locations. repeat Sample a proposal y from g ( · ) . Sample f y , the Gaussian process evaluated at y , conditioning on f X , f Y , f ˜ X and f ˜ Y . with probability σ ( f y ) : Accept y and add it to X . Add f y to f X . else : Reject y and add it to Y . Add f y to f Y . until n samples are accepted.words, the posterior over f evaluated at X ∪ Y is just the posterior from a Gaussian processclassification problem with a logistic link-function, and with the accepted and rejected proposalscorresponding to the two classes. There are a number of Markov chain Monte Carlo methods suchas Hamiltonian Monte Carlo or elliptical slice sampling (Murray et al., 2010) that are applicable insuch a situation. Given f on X ∪ Y , it can be evaluated anywhere else by conditionally samplingfrom a multivariate normal.Sampling the rejected proposals Y given X and f is straightforward by Algorithm 1: run therejection sampler until n accepts, and treat the rejected proposals generated along the way as Y . Inpractice, we do not have access to the entire function f , only its values evaluated on X and Y old , thelocation of the previous thinned variables. However, just as under the generative mechanism, wecan retrospectively evaluate the function f where needed. After proposing from g ( · ) , we samplethe value of the function at this location conditioned on all previous evaluations, and use this valueto decide whether to accept or reject. We outline the inference algorithm in Algorithm 5, notingthat it is much simpler than that proposed in Adams et al. (2009). We also refer to that paper forlimitations of the exchange sampler in this problem. Voice changes are a symptom and measure of onset of Parkinson’s disease, and one attribute isvoice shimmer, a measure of variation in amplitude. We consider a dataset of such measurementsfor subjects with and without the disease (Little et al., 2007), with measurements with, and without the disease. We normalized these to vary from to , and used the model of Adams et al.(2009) as a prior on the underlying probability densities. We set g ( · ) to a normal N ( µ, σ ) , witha normal-inverse-Gamma prior on ( µ, σ ) . The latter had parameters (0 , . , , . The Gaussianprocess had a squared-exponential kernel, with variance and length-scale of . For each case,we ran a Matlab implementation of our data augmentation algorithm to produce 2,000 posteriorsamples after a burn-in of samples.Figure 5 (left) shows the resulting posterior over densities, corresponding to θ in Algorithm 2.14 lgorithm 5: A Markov chain iteration for inference in the Gaussian process density sampler
Input:
Observations X with corresponding function evaluations ˜ f X ,Current rejected proposals ˜ Y with corresponding function evaluations ˜ f ˜ Y . Output:
New rejected proposals Y ,New Gaussian process evaluations f X and f Y at X and Y ,New hyperparameters. Run Algorithm 4 to produce | X | accepted samples, with X, ˜ Y , ˜ f X and ˜ f ˜ Y as inputs. Replace ˜ Y and f ˜ Y with values returned by the previous step; call these Y and ˆ f Y . Update ˜ f X and ˆ f Y using for example, hybrid Monte Carlo, to get f X and f Y . Update Gaussian process and base-distribution hyperparameters. P r obab ili t y den s i t y positivecontrol M odu l a t i ng i n t en s i t y Figure 5: (Left) Posterior density for positive and control groups, (Right) Posterior over the Gaus-sian process function for positive group with observations. Both plots show the median with percent posterior credible intervals. D en s i t y o f r e j e c t i on s P r obab ili t y Number of rejections
Figure 6: (Left) Kernel density estimate of locations of rejected proposals and (Right) Histogramof number of rejected proposals for positive group.15he control group is fairly Gaussian, while the disease group is skewed to the right. The rightplot in the same figure focuses on the deviation from normality by plotting the posterior over thelatent function f . We see that to the right of . , it is larger than its prior mean of , implying largerprobability than under a Gaussian. Figure 6 studies the distribution of the rejected proposals Y . Theleft plot shows the distribution of their locations: most of these occured near the origin. Here, thedisease density reverts to Gaussian or even sub-Gaussian, with the intensity function taking smallvalues. The right plot is a histogram of the number of rejected proposals: this is typically around to , though the largest value we observed was . Since inference over the latent functioninvolves evaluating it at the locations of the accepted as well as rejected proposals, the largestcovariance matrix we had to deal with was about × ; typical values were around × .Using the same setup as Section 6.4, it took a na¨ıve Matlab implementation and minutes torun 2,500 iterations for the disease and control datasets. One can imagine computations becomingunwieldy for a large number of observations, or when there is large mismatch between the truedensity and the base-measure g ( · ) . In such situations, one might have to choose the Gaussianprocess covariance kernel more carefully, use one of many sparse approximation techniques, oruse other nonparametric priors like splines instead. In all these cases, we can use our algorithmto recover the rejected proposals Y , and given these, posterior inference over f can be carried outusing standard techniques. We described a simple approach to carry out Markov chain Monte Carlo inference when data gen-eration involves a rejection sampling algorithm. Our algorithm is simple and efficient, and allowsus to exploit ideas like Hamiltonian Monte Carlo to carry out efficient inference. While our al-gorithm is exact, it also provides a framework for faster, approximate algorithms. For instance,the number of rejected proposals preceeding any observation is a random number that a priori isunbounded. One can bound the computational cost of an iteration by limiting the maximum num-ber of rejected proposals. Similarly, one might try sharing rejected proposals across observations.We leave the study of the approximate Markov chain algorithms resulting from such ‘user impa-tience’ for future research. Also left open is a more careful analysis of Markov mixing rates forthe applications we considered. There are also a number of potential applications that we havenot described here: particularly relevant are rejection samplers for stochastic differential equations(Beskos and Roberts, 2005; Bladt and Sørensen, 2014) .
This work was supported by the National Institute of Environmental Health Sciences of the Na-tional Institute of Health. 16
Proofs
Proof (of Proposition 1) . Rejection sampling first proposes from q ( x ) , and then accepts with prob-ability f ( x ) / { M q ( x ) } . Conceptually, one can first decide whether to accept or reject, and thenconditionally sample the location. The marginal acceptance probability is Z ( θ ) /M , the area under f ( · , θ ) divided by that under M q ( · | θ ) . An accepted sample x is distributed as the target distri-bution f ( x, θ ) /Z ( θ ) , while rejected samples are distributed as Mq ( x | θ ) − f ( x,θ ) M − Z ( θ ) . This two componentmixture is just the proposal q ( x ) . While this scheme loses the computational benefits that motivatethe original algorithm, it shows that the location of an accepted sample is independent of the past,and consequently, that the number and locations of rejected samples preceding an accepted sam-ple is independent of the location of that sample. Consequently, one can use the rejected samplespreceding any other accepted sample. Proof (of Theorem 2) . It follows easily from Bayes’ rule that for an observation X , p ( θ | X, Y ) ≥ p ( θ | X ) b f B f (cid:18) b q rB q (cid:19) |Y| . Let the number of observations | X | be n . Then, k (ˆ θ | θ ) = (cid:90) U n p (ˆ θ | Y , X ) p ( Y | θ, X )d Y≥ (cid:18) b f B f (cid:19) n p (ˆ θ | X ) n (cid:89) i =1 (cid:90) U β |Y i | p ( Y i | θ, X )d Y i = (cid:18) b f B f (cid:19) n p (ˆ θ | X ) n (cid:89) i =1 (cid:90) U β |Y i | Z ( θ ) M |Y i | (cid:89) j =1 (cid:26) q ( y ji | θ ) − f ( y ji , θ ) M (cid:27) λ (d y ji )= (cid:18) b f Z ( θ ) B f M (cid:19) n p (ˆ θ | X ) n (cid:89) i =1 ∞ (cid:88) |Y i | =0 β |Y i | |Y i | (cid:89) j =1 (cid:18) − Z ( θ ) M (cid:19) = p (ˆ θ | X ) (cid:18) b f Z ( θ ) B f M (cid:19) n n (cid:89) i =1 δ θ ˜ δ = 1 − β (1 − Z ( θ ) /M )= δ θ p (ˆ θ | X ) 1 δ n θ = B f b f (cid:18) MZ ( θ ) − β ( MZ ( θ ) − (cid:19) = B f b f (cid:18) MZ ( θ ) (1 − β ) − β ) (cid:19) ≥ δp (ˆ θ | X ) δ = (cid:26) b f B f ( β + R − ) (cid:27) n (5) Thus k (ˆ θ | θ ) satisfies equation (2) , with δ = (cid:110) b f B f ( β + R − ) (cid:111) n , and h (ˆ θ ) , the posterior p (ˆ θ | X ) . Gradient information
For n pairs { X i , Y i } , with N = n + (cid:80) ni =1 |Y i | , and S = (cid:80) ni =1 ( X i + (cid:80) |Y i | j =1 Y ij ) , we have log { P ( { X i , Y i } ) } = trace ( G T κ S ) + n (cid:88) i =1 |Y i | (cid:88) j =1 [log { D ( κ ) − D ( Y ij , κ ) } − log { D ( Y ij , κ ) } ] − N log { D ( κ ) } . Write D ( Y, κ ) = (cid:110) C (cid:81) pr =1 I ( d − r − / ( (cid:107) κ r N Tr G r (cid:107) ) (cid:107) κ r N Tr G r (cid:107) ( d − r − / (cid:111) as C ˜ D ( Y, κ ) . Since dd x (cid:110) I m ( x ) x m (cid:111) = x − m I m +1 ( x ) , d ˜ D ( Y, κ )d κ j = N Tj G j ˜ D ( Y, κ ) I ( d − j +1) / I ( d − j − / ( κ j N Tj G j ) and d ˜ D ( κ )d κ j = ˜ D ( κ ) I ( d − j +1) / I ( d − j − / ( κ j ) . Then, writing L = log { P ( { X i , Y i } ) } , we have d L d κ k = G T [ ,k ] S [ ,k ] + n (cid:88) i =1 |Y i | (cid:88) j =1 (cid:40) ˜ D (cid:48) ( κ ) − ˜ D (cid:48) ( Y ij , κ )˜ D ( κ ) − ˜ D ( Y ij , κ ) − ˜ D (cid:48) ( Y ij , κ )˜ D ( Y ij , κ ) (cid:41) − N ˜ D (cid:48) ( κ )˜ D ( κ )= G T [ ,k ] S [ ,k ] + n (cid:88) i =1 |Y i | (cid:88) j =1 ( I ( d − k +1) / I ( d − k − / ( κ k ) − N Tk G k I ( d − k +1) / I ( d − k − / ( κ k N Tk G k )1 − ˜ D ( Y ij , κ )˜ D ( κ ) − N I ( d − k +1) / I ( d − k − / ( κ k ) References
Aban, I. B., Meerschaert, M. M., and Panorska, A. K. (2006). Parameter estimation for the trun-cateda Pareto distribution.
J. Amer. Statist. Assoc. , 101:270–277.Adams, R. P., Murray, I., and MacKay, D. J. C. (2009). The Gaussian process density sampler. InKoller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors,
Adv. Neural Inf. Process. Syst.21 , pages 9–16. MIT Press.Alai, D. H., Landsman, Z., and Sherris, M. (2013). Lifetime dependence modelling using a trun-cated multivariate Gamma distribution.
Insurance Math. Econom. , 52(3):542 – 549.Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlocomputations.
Ann. Statist. , 37(2):697–725.Beskos, A. and Roberts, G. (2005). Exact simulation of diffusions.
Ann. App. Prob. , 15(4):2422 –2444.Bladt, M. and Sørensen, M. (2014). Simple simulation of diffusion bridges with application tolikelihood inference for diffusions.
Bernoulli , 20(2):645–675.18rinkman, R. R., Gasparetto, M., Lee, S.-J. J., Ribickas, A. J., Perkins, J., Janssen, W., Smiley, R.,and Smith, C. (2007). High-Content flow cytometry and temporal data analysis for defining acellular signature of Graft-Versus-host disease.
Biol. Blood Marrow Trans. , 13(6):691–700.Chikuse, Y. (2003).
Statistics on Special Manifolds . Springer, New York.Downs, T., Liebman, J., and Mackay, W. (1971). Statistical methods for vectorcardiogram orien-tations.
In Vectorcardiography 2: Proc. XIth Intn. Symp. Vectorcardiography (I. Hoffman, R.I.Hamby and E. Glassman, Eds.) , pages 216–222. North-Holland, Amsterdam.Edelman, A., Arias, T., and Smith, S. T. (1998). The geometry of algorithms with orthogonalityconstraints.
SIAM J. Matrix Anal. Appl , 20(2):303–353.Goethals, K., Ampe, B., Berkvens, D., Laevens, H., Janssen, P., and Duchateau, L. (2009). Mod-eling interval-censored, clustered cow udder quarter infection times through the shared gammafrailty model.
J. Agr. Biol. Envir. Statist. , 14(1):1 – 14.Hoff, P. D. (2009a). A hiearchical eigenmodel for pooled covariance estimation.
J. R. Statist. Soc.B , 71(5):971–992.Hoff, P. D. (2009b). Simulation of the Matrix Bingham-von Mises-Fisher Distribution, with Ap-plications to Multivariate and Relational Data.
J. Comp. Graph. Stat. , 18(2):438–456.Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.
J. Amer.Statist. Assoc. , 96(453):161–173.Jones, G. L. and Hobert, J. P. (2001). Honest Exploration of Intractable Probability Distributionsvia Markov Chain Monte Carlo.
Statist. Sci. , 16(4):312–334.Khatri, C. G. and Mardia, K. V. (1977). The Von Mises-Fisher Matrix Distribution in OrientationStatistics.
J. R. Statist. Soc. B , 39(1).Kume, A., Preston, S. P., and Wood, A. T. A. (2013). Saddlepoint approximations for the nor-malizing constant of Fisher-Bingham distributions on products of spheres and Stiefel manifolds.
Biometrika , 100(4):971–984.Lee, G. and Scott, C. (2012). EM algorithms for multivariate Gaussian mixture models withtruncated and censored data.
Comput. Statist. Data Anal. , 56(9):2816–2829.Liechty, M. W., Liechty, J. C., and M¨uller, P. (2009). The shadow prior.
J. Comp. Graph. Statist. ,18(2):368–383.Little, M. A., McSharry, P., Roberts, S., Costello, D., and Moroz, I. (2007). Exploiting nonlinearrecurrence and fractal scaling properties for voice disorder detection.
Biomed. Eng. Online ,pages 6–23.Lo, A. (1984). On a class of Bayesian nonparametric estimates: I. density estimates.
Ann. Statist. ,12(1):351–357. 19øller, J., Pettitt, A. N., Reeves, R., and Berthelsen, K. K. (2006). An efficient Markov chainMonte Carlo method for distributions with intractable normalising constants.
Biometrika ,93(2):451–458.Murray, I., Adams, R. P., and MacKay, D. J. (2010). Elliptical slice sampling.
J. Mach. Learn.Res. W&CP , 9.Murray, I., Ghahramani, Z., and MacKay, D. J. C. (2006). MCMC for doubly-intractable distribu-tions. In
Proc. 22nd Conf. Uncert. Artif. Intell. , pages 359–366. AUAI Press.Neal, R. M. (2010). MCMC using Hamiltonian dynamics.
Handbook of Markov Chain MonteCarlo , 54:113–162.Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence diagnosis andoutput analysis for MCMC.
R News , 6(1):7–11.Robert, C. P. and Casella, G. (2005).
Monte Carlo Statistical Methods (Springer Texts in Statistics) .Springer-Verlag New York, Inc., Secaucus, NJ, USA.Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes.
J.Amer. Statist. Assoc. , 101(476):1566–1581.Walker, S. G. (2011). Posterior sampling when the normalizing constant is unknown.
Commun.Statist. Simulat. , 40(5):784–792.Wood, A. T. (1994). Simulation of the von Mises Fisher distribution.