[PDF] A Latent Slice Sampling Algorithm

Abstract

In this paper we introduce a new sampling algorithm which has the potential to be adopted as a universal replacement to the Metropolis--Hastings algorithm. It is related to the slice sampler, and motivated by an algorithm which is applicable to discrete probability distributions %which can be viewed as an alternative to the Metropolis--Hastings algorithm in this setting, which obviates the need for a proposal distribution, in that is has no accept/reject component. This paper looks at the continuous counterpart. A latent variable combined with a slice sampler and a shrinkage procedure applied to uniform density functions creates a highly efficient sampler which can generate random variables from very high dimensional distributions as a single block.

Full PDF

AA Latent Slice Sampling Algorithm

Yanxin Li ∗ Department of Statistics and Data Sciences, University of Texas at Austin andStephen G. Walker

Department of Mathematics, University of Texas at Austin

Abstract

In this paper we introduce a new sampling algorithm which has the potential to be adoptedas a universal replacement to the Metropolis–Hastings algorithm. It is related to the slicesampler, and motivated by an algorithm which is applicable to discrete probability distri-butions which obviates the need for a proposal distribution, in that is has no accept/rejectcomponent. This paper looks at the continuous counterpart. A latent variable combinedwith a slice sampler and a shrinkage procedure applied to uniform density functions createsa highly eﬃcient sampler which can generate random variables from very high dimensionaldistributions as a single block.

Keywords:

High dimensional density; Markov chain Monte Carlo; Shrinkage procedure; Uniformrandom variables.

The original, and still one of the most popular sampling methods is the Metropolis–Hastingsalgorithm (Metropolis et al, 1953; Hastings, 1970). It generates a Markov sample which has atarget density as the stationary density. One well known drawback of the algorithm is that thesampler can get stuck if the proposal density is not well set. In this case the Markov chain canlinger at a particular value before, if ever, moving. Hence, if the density of interest can be sampleddirectly, via a rejection algorithm, or a Gibbs sampler, if appropriate, then these would be themethods of choice.The question discussed in this paper is whether we can always avoid a Metropolis–Hastings al-gorithm without compromising eﬃciency. When the sample space is discrete, say Ω = { , , , . . . } , ∗ For correspondence, contact: [email protected]. a r X i v : . [ s t a t . C O ] O c t he sampler presented in Walker (2014) is one such possibility. One of the key ideas behind theMetropolis–Hastings algorithm is the transition density p ( y | x ), deﬁned for all x, y ∈ Ω, satisfying p ( y | x ) π ( x ) = p ( x | y ) π ( y ) (1)where π is the target density. The Metropolis–Hastings algorithm has transition density p ( y | x ) = α ( x, y ) q ( y | x ) + (1 − r ( x )) ( y = x ) , where q ( y | x ) is a proposal density, to be chosen, α ( x, y ) = min (cid:26) , π ( y ) q ( x | y ) π ( x ) q ( y | x ) (cid:27) , and r ( x ) = (cid:82) α ( x, y ) q ( y | x ) dy . It is easily seen that this p ( · | · ) satisﬁes equation (1).An alternative p ( · | · ) satisfying equation (1) and proposed in Walker (2014) is given by p ( y | x ) = π ( y ) k min( y + k − , x + k − (cid:88) l =max( y, x ) (cid:80) lz =max(0 , l − k +1) π ( z ) , (2)where | y − x | < k , and k is to be chosen. However, the choice of k is easy to set; as largeas possible while computations required to sample p ( y | x ) remain time feasible. So note thatwith this transition density there is no possibility for the sampler to get stuck and neither isthere an accept/reject component. Note also that π only needs to be known up to a normalizingconstant, a strong requirement in any sampler, as often, in many applications, the target density isonly speciﬁed up to an unknown normalizing constant. Finally, note that (2) is easy to sample. Amultivariate version of (2) is easy to establish and has been applied to a certain class of optimizationproblem in Ekin et al. (2020).The aim in the present paper is to ﬁnd a continuous counterpart to (2). In fact a suitabletransition density is not diﬃcult to write down as a direct analog of (2); p ( y | x ) = π ( y ) k (cid:90) min( y + k, x + k ) l =max( y, x ) dl (cid:82) lz = l − k π ( z ) dz , (3)where here we have Ω = ( −∞ , ∞ ) and | y − x | ≤ k . Just as (2) can be seen as a Gibbs sampler,so can (3). To see this, consider the joint density function p ( y, l ) = π ( y ) ( y < l < y + k ) k , (4)so clearly π ( y ) is the required marginal density. Then (3) is given by p ( y | x ) = (cid:82) p ( y | l ) p ( l | x ) dl, where p ( l | x ) is uniform on the interval ( x, x + k ). Further, (3) also satisﬁes equation (1). Theonly outstanding question is how to sample (3), which is the main focus of the paper. Indeed,we show that sampling (3) can be done eﬃciently using only uniform random variables and onlyrequires the implementation of an adaptive rejection algorithm, which works extremely fast.In section 2 the aim is to show how to sample from p ( y | x ) given by (3) but with necessaryextensions involving making k random. This also requires some further latent variable; speciﬁcallya “slice” variable, similar in spirit to Besag and Green (1993), Damien et al (1999) and Neal(2003). Slice sampling, as it has become known, is a popular approach to sampling complex2ensities usually within a Gibbs sampling framework. In fact slice samplers have good convergenceproperties; Robert and Rosenthal (1999) show that slice samplers are nearly always geometricallyergodic while Mira and Tierney (2002) provide suﬃcient conditions for a slice sampler to beuniformly ergodic. Recent uses of Neal’s approach include the elliptical slice sampler, see Murrayet al (2010), and the generalized elliptical slice sampler, see Nishihara et al (2014), and factorslice sampling, see Tibbits et al (2014). Once the slice variable has been incorporated within (3),it is then possible to compare the new sampler with Neal’s slice sampler. Indeed, as it standswith k ﬁxed, it is precisely a version of Neal’s algorithm. Both us and Neal extend from this ﬁxed k , but in diﬀerent directions. Neal adopts the reversible framework while we adopt a random k approach and use the framework established by the joint density (3). This allows us to maintain aGibbs sampling framework while avoiding a tricky detailed balance constraint. We make a directcomparison with Neal’s slice sampler in section 3. Numerous illustrations are presented in section4 and section 5 concludes with a brief description and a full layout of the algorithm for arbitrarymultivariate distribution. We ﬁrst describe the algorithm in one dimension and later detail the extension to multi–dimensions.To develop the joint density (4), we make it more ﬂexible by allowing k to be a random variable,which we will now refer to as s , and assign s to have density p ( s ), to be chosen, and allow for l tobe in the interval ( y − s/ , y + s/ p ( y, s, l ) = π ( y ) p ( s ) (cid:0) y − s/ < l < y + s/ (cid:1) s . (5)A key aspect of the innovation in the sampler is on dispaly here; we have introduced a y termoutside of π ( y ) term without altering the correct marginal. So the marginal density of y is π ( y )and the marginal density of s is p ( s ). A Gibbs sampler based directly on (5) would be diﬃcultto implement as it is not possible to sample from π ( y ); or rather it is assumed not to be able todo so. In such cases, a slice sampler can be utilized. By introducing a slice variable w , the jointdensity then becomes p ( y, w, s, l ) = (cid:0) π ( y ) > w (cid:1) p ( s ) (cid:0) y − s/ < l < y + s/ (cid:1) s . (6)While this is more than used by Neal (2003), the extra component, i.e. p ( s ) (cid:0) y − s/ < l < y + s/ (cid:1) s is eﬀectively providing the stochastic search engine for the set of y for which π ( y ) > w . Sucha procedure was also required by Neal (2003) who used a search strategy while needing also tomaintain a detailed balance criterion. On the other hand, we are free from some such constraints.For us, this is greatly simpliﬁed, yet just as eﬀective, by incorporating the search component intothe joint density. This means we do not have to implement a stepping out or a doubling procedurewhich is a part of Neal’s algorithm. 3e implement a Gibbs sampler based on (6). So p ( w, l | y, s ) is easy to sample; being twoconditionally independent uniform random variables. Further p ( s | y, w, l ) ∝ p ( s ) s (cid:0) s > | l − y | (cid:1) . (7)This conditional density is also straightforward to sample; and throughout we take p ( s ) ∝ s e − λs for some λ , typically in order to provide a large variance. Finally, p ( y | w, s, l ) ∝ (cid:0) π ( y ) > w (cid:1) ( l − s/ < y < l + s/ . We sample this using an adaptive rejection sampler; it is also a shrinkage procedure as describedin Neal (2003). Before describing the adaptive rejection sampler we present a simple illustrationof the key aspects of the one step algorithm, starting with the current value y . Figure 1: Illustration of latent slice sampler

An illustration is provided in Fig. 1. The current values of y , w and l are indicated. Theillustration for this case gives a value of s for which the relevant values of l − s/ l + s/ y is sampled uniformly from ( l − s/ , l + s/

2) and is accepted if π ( y ) > w , as shown in the graph. Rejected y give information about the location of the interval π ( y ) > w and this can be used to improve the proposal with the shrinkage procedure. To generalizethe setting we consider the adaptive rejection sampling of p ( y ) ∝ ( y ∈ C ) ( a < y < b ) , where C ∩ ( a, b ) (cid:54) = ∅ and y ∈ C ∩ ( a, b ). Let a = a and b = b ; at iteration m , starting at m = 1,1. Sample y ∗ uniformly from ( a m , b m ).2. While y ∗ / ∈ C : if y ∗ < y then a m +1 ← max { a m , y ∗ } else b m +1 ← min { b m , y ∗ } and m → m +1.3. Repeat steps 1. and 2. until y ∗ ∈ C ; then y = y ∗ .4his works for reasons outlined in Neal (2003), and see also the discussion by Walker in Neal’spaper. The basic idea is that the sampling strategy resulting in y = y ∗ conditional on y , andwrite this density as p ( y | y ), satisﬁes detailed balance with respect to p ( y ); i.e. p ( y | y ) p ( y ) = p ( y | y ) p ( y ) . The obvious points here are that as p ( y ) is uniform, one only need establish that p ( y | y ) = p ( y | y ) which is straightforward to understand. Example 1.

To see how eﬃcient this sampling strategy is, we take the target for y as a mixtureof two normal densities with variances 1 and means -10 and +10, and with equal weights. Thatis, π ( y ) = N( y | − ,

1) + N( y | , . Figure 2: Samples from latent slice algorithm from mixture of two normals

We take p ( s ) to be a gamma distribution with parameters shape equal to 2 and scale equal to100, i.e., p ( s ) ∝ s exp( − . s ), and generate 2,000 samples from the algorithm. The subsequentplot of the sampled y is given in Fig. 2. As can be seen, the mixing and accuracy of the samples isexcellent. It should be noted that there are very few, if any, alternative algorithms using Markovchains, which could achieve this. From the univariate case there is an easy way to set up a multivariate latent slice sampler when y is a d –dimensional variable. We have the relevant joint density now as p ( y, w, s, l ) = (cid:0) π ( y ) > w (cid:1) p ( s ) d (cid:89) j =1 ( l j − s j / < y j < l j + s j / s j . So w remains a one dimensional variable, but the other two; i.e. s and l , are both d –dimensional.5he sampling strategy using a Gibbs sampler is an obvious extension to the one dimensionalcase. The conditional for y is given by p ( y | w, s, l ) ∝ (cid:0) π ( y ) > w (cid:1) d (cid:89) j =1 ( l j − s j / < y j < l j + s j / . This can also be sampled using the shrinkage procedure; writing a j = l j − s j / b j = l j + s j / y = ( y , . . . , y d ) as the current y , and { y : π ( y ) > w } = C , we sample proposal y ∗ = ( y ∗ , . . . , y ∗ d )from (cid:81) dj =1 ( a j < y j < b j ) and accept y = y ∗ if y ∗ ∈ C . Otherwise, do for all j = 1 , . . . , d :if y ∗ j < y j then a j ← max { a j , y ∗ j } else b j ← min { b j , y ∗ j } . Example 2 . As an illustration we take π ( y ) to be a bivariate normal density with a very highcorrelation; i.e. we take a mean of (0 ,

0) and a covariance matrix with unit variances and correlation ρ = 0 .

95. It is known that slice sampling algorithms can perform poorly when the variables arehighly correlated; indeed, as stated in Tibbits et al (2014), “It is particularly diﬃcult to createan eﬃcient sampler when there is strong dependence among the variables”. We take p ( s ) to beindependent gamma distributions with shape equal to 2 and scale equal to 10. The bivariate plotand contour of the ( y , y ) from the output of the sampling algorithm is presented in Fig. 3. Ascan be seen this has worked extremely well. Figure 3: Samples from latent slice algorithm from bivariate normal

Example 3 . Here we do a d = 50 dimensional example with the target density π ( y ) ∝ exp (cid:40) − d (cid:88) j =1 y j (cid:41) . The code was written in R and 5000 samples of y were collected. The time for execution wastwo seconds. We take the same p ( s ) as that of Example 2. The samples of y are presented as ahistogram in Fig. 4 along with the standard normal density function for comparison.6 igure 4: Samples of y from latent slice algorithm with 50 dimensional multivariate normal target density The algorithm of Neal (2003) is concerned with the sampling of p ( y | w ) ∝ ( π ( y ) > w ) which isuniform, and let S = { y : π ( y ) > w } . The aim is to ﬁnd an interval I = ( L, R ) which contains thewhole, or a part, of S , and to sample a proposal y ∗ uniformly from I and accept it as y if y ∗ ∈ S .Now the interval I will be constructed stochastically from x = y c and hence, as we are dealingwith uniform densities; it is required that p ( y | x, w ) = p ( x | y, w ) . Eﬀectively, this boils down to the probability of getting I from x being the same as the probabilityof getting I from y . Neal (2003) has two key ideas for constructing I and we will focus on the“stepping out” procedure.The idea here is to select a positive value k and an integer m ≥ L = x − k (1 − U ) and R = x + k U where U is a uniform random variable from (0 , m = 1this approach would coincide exactly with our own by choosing s − p ( s ) to be a point mass of 1at s = k . This can be seen by noting that our algorithm selects l uniformly from the interval( x − k/ , x + k/ l = x − k/ U k and then takes y ∗ uniformly from ( l − k/ , l + k/

2) whichcan be written as ( x − k (1 − U ) , x + kU ).To move on from this rather inﬂexible strategy, whereas with our algorithm we take k = s asa random variable, Neal accounts for the rigidity of k by allowing the interval to broaden out byextending L → L − k and R → R + k until π ( L ) < w and π ( R ) < w , respectively, or J = 0 and K = 0, respectively, where J is a random number in [0 , . . . , m −

1] and K = m − − J and J and K go down by 1 every time an extension is made, respectively. The exact details are presented in7ig. 3 of Neal’s paper where a proof is provided that this stochastic construction of I does indeedsatisfy detailed balance.An alternative idea described in Neal (2003) is the “doubling” procedure and is described inFig. 4 of his paper. The starting point is as with the stepping out procedure but now the intervalsdouble in size when the interval is allowed to grow. In short, the additional latent variables l and s we introduce at the outset obviate the need for a doubling or stepping out procedure. So while weare able to treat k = s as random within our framework, and hence deal with any issue arising asa consequence of it being ﬁxed, it has recently been pointed out that some problems are sensitiveto the choice of k within Neal’s slice sampler; see Karamanis and Beutler (2020). We compared the latent slice sampler with the slice sampling algorithm by using the illustrationsin section 8 of Neal’s paper. It is a ten-dimensional funnel-like distribution of ten real-valuedvariables v and x to x . The marginal distribution of v is Gaussian with mean zero and standarddeviation 3. Conditional on a given value of v , the variables x to x are independent, with theconditional distribution for each being Gaussian with mean zero and variance e v , which can beformulated as v ∼ N( v | , ) with [ x i | v ] ∼ N( x i | , e v ) for i = 1 , . . . ,

9. The joint distributionis obviously given by p ( v, x , . . . , x ) = N( v | , ) (cid:89) i =1 N( x i | , e v ) . Such a distribution is typical of priors for components of Bayesian hierarchical models; x to x might, for example, be random eﬀects for nine subjects, with v being the log of the variance ofthese random eﬀects. If the data is largely informative, the problem of sampling from the posteriorwill be similar to that of sampling from the prior. From the above framework, we know the correctmarginal distribution for v , which is the focus of the illustration, and we can sample for each of x to x given the value for v .In Neal’s paper, the single variable slice sampling method is used to sample from a multivariatedistribution by sampling repeated for each variable in turn. Each update uses the step-out andshrinkage procedure. Fig. 5 compared the result of trying to sample from the funnel distributionusing latent slice sampling and single-variable slice sampling. The upper plot shows 2000 iterationsof a run, which is the subsampling of 4,000,000 samples with a spacing of m = 200 to reduces theautocorrelation of successive samples. If every 200 th iteration is used and the rest thrown away,this produces another reversible Markov chain with asymptotic variance. The selection of spacing m = 200 can yield better estimates of the true posterior and yet smooth out autocorrelation. Weuse a gamma distribution with shape 2 and scale 5 to randomize the “slice”, i.e p ( s ) ∝ se − s/ sothat the sampler is able to explore the distribution eﬃciently. The lower plot of Fig. 5 shows theresults of trying to sample from the funnel distribution using single-variable slice sampling. Toavoid the high autocorrelation, the same spacing of m = 200 is used to “thin” the simulations.The resulting 2000 updates are shown in the scatterplot. Both the latent slice sampler and thesingle-variable slice sampling perform fairly well with small and large values of v sampled quitegood, compared with single-variable Metropolis updates and multivariate Metropolis updates, asdiscussed in Neal’s paper. However, slice sampling method takes much greater cost in wasted8 igure 5: Sampling the funnel distribution using latent slice sampling (dark dots) and single-variableslice sampling (blue dots) computation. The average time for 10000 iterations are at least three times of that for latent slicesampling algorithm. The simplicity of the latent slice sampling makes it favorable for samplingdistribution without selecting proposal distribution. By using stochastic search we accelerate theconvergence to the stationary distribution. In this section we present a number of illustrations. We start with two examples for discrete spaces,which include the allocation variables in a mixture of Dirichlet process model and the number ofcomponents in a mixture model. We then consider some continuous examples, including a modelin which Neal’s slice sampler has been used, elliptical sampling, and then a state space model anda variable selection model where the vectors of unknowns are typically sampled componentwiseusing a Gibbs sampler. In these latter two examples we use the multivariate latent slice samplerto sample the entire vector as a single block. 9 .1 Mixture of Dirichlet process model

Here we consider the well–known and widely used mixture of Dirichlet process (MDP) model,introduced in Lo (1984). The MDP model with Gaussian kernel is given by f ( x ) = (cid:90) N( x | µ, σ ) d P ( θ )where θ = ( µ, σ ) and where µ represents the mean and σ the variance of the normal kernel. LetDP( α, P ) denote a Dirichlet process prior (Ferguson, 1973) with scale parameter α > P , so E( P ) = P and Var( P ( A )) = P ( A ) (1 − P ( A ) / (1 + α ) for all appropriatesets A . The model has been one of the most popular in Bayesian nonparametrics, and for a recentreview on estimation techniques, see Hjort et al (2010).A number of recent ideas maintain the P as part of a Markov chain sampling approach andﬁnd ways to obviate the need for sampling an inﬁnite dimensional object. The more traditionalapproaches marginalize it out of the model. Here we adopt the former approach and the key ideahere is to introduce the latent indicator variable which tells us which component each observationcame from. Following Sethuraman (1994) we can write P = ∞ (cid:88) j =1 w j δ θ j where w = v and w j = v j (cid:81) l

A histogram of the 400 data points with the density estimators (blue: using our new transitiondensity approach; red: using slice sampling as described in Kalli et al (2011)) based on 5000samples of x n +1 , and with the true density (black), are provided in Fig. 6. The density estimatorswere obtained using the R density routine from the predictive samples. The advantage of usingour new transition density is that we do not need any truncation of the distribution of the ( d i ).After picking an appropriate k , there are no other parameters to be tuned and the algorithm itselfis straightforward. It avoids the accept/reject component of a Metropolis–Hastings algorithm,the errors introduced by truncating the correct density of the ( d i ), and avoids introducing furtherlatent variables to sample the ( d i ). Here we consider another mixture model but now we use a version which works with a randomnumber of components. For illustrative purposes we select a case where the components are fully11peciﬁed, exponential densities with the integers as parameter. A more complete version of themodel with unknown normal components was considered in Richardson and Green (1997).The model, given M , the number of components, is given by f ( x | w M , M ) = M (cid:88) j =1 w jM je − jx with M ∈ { , . . . ∞} and w M = ( w M , . . . , w MM ) are the weights which sum to 1. Using theindicator variables ( d i ), as in the previous section, though now given M their values will be froma ﬁnite set, the complete likelihood function is given by l ( w, M | x, d ) = π ( w | M ) π ( M ) n (cid:89) i =1 w d i d i e − d i x i . Here w represents all weights for all possible M ; i.e. w = ( w , w , . . . ). We adopt the frameworkof Godsill (2001) and provide a prior for each w j given M , in the form π ( w | M ) = π ( w | w ) . . . π ( w M − | w M ) π ( w M | M ) π ( w M +1 | w M ) π ( w M +2 | w M +1 ) . . . . The prior for M is π ( M ) = λ M − e − λ / ( M − M = 1 , . . . , so is a Poisson shifted to { , , . . . } .The prior for w M given M is Dirichlet with common parameter α . To complete the prior setting,we need to specify π ( w j +1 | w j ) and π ( w j − | w j ), the latter avoiding j = 1. The former is obtainedby selecting a weight from ( w jl ) l =1: j and splitting it into two, uw j and (1 − u ) w j , with u uniformon (0 , π ( w j +1 | w j ) = 1 /j . Likewise, for π ( w j − | w j ) we take w jj and combine it with w jl , for l (cid:54) = j . Hence, π ( w j − | w j ) = 1 / ( j − w M given M and the ( d i ) given M . However, sampling M is the diﬃculty. The benchmarkprocedure here is a reversible jump Markov chain, see Green (1995), where a detailed balance con-dition is required. While this may not be diﬃcult for the present mixture of known exponentialcomponents, it is far from trivial for unknown normal components; see Richardson and Green(1997). Now π ( M | · · · ) ∝ π ( M ) π ( w M | M ) n (cid:89) i =1 M (cid:88) j =1 w jM je − jx i × M − (cid:89) j =2 π ( w j | w j +1 ) ∞ (cid:89) j = M +1 π ( w j | w j − ) . It is important to note that for any M (cid:48) (cid:54) = M , π ( M (cid:48) | · · · ) π ( M | · · · ) = π ( M (cid:48) ) π ( w M (cid:48) | M (cid:48) ) (cid:81) ni =1 (cid:80) M (cid:48) j =1 w jM (cid:48) je − jx i π ( M ) π ( w M | M ) (cid:81) ni =1 (cid:80) Mj =1 w jM je − jx i ;i.e. because of how we set the conditional priors for the weights, they cancel from the ratio ofposteriors for diﬀerent number of component values. So we now sample the new M (cid:48) given thecurrent M , with a chosen k , via the transition density P ( M (cid:48) | M ) = π ( M (cid:48) ) k min( M (cid:48) + k − , M + k − (cid:88) l =max( M (cid:48) , M ) (cid:80) lz =max(1 , l − k +1) π ( z ) , | M (cid:48) − M | < k .A reversible jump Markov chain from a current M would typically propose a move to either M − M + 1 and up front sample a set of weights which the chain would move to if theproposed move was accepted. As with all such algorithms, it has an accept/reject componentwhich, to reiterate, our transition density does not have. Figure 7: Predictive densities using new transition density and reversible jump algorithms, with histogramof data.

For the demonstration we generated 400 i.i.d. data points from a single exponential densitywith parameter 3, i.e. f ( x ) = 3 e − x . Fig. 7 shows the histogram of the data along with thepredictive density estimates using the new transition density and show alongside the estimatefrom the reversible jump algorithm. Both are clearly working well.

In this example we consider a continuous space, and here we label the algorithm with the newtransition density as the latent slice sampler. Speciﬁcally, here, we compare with elliptical slicesampling, which is used in a number of models which have a multivariate Gaussian distributionas the prior. See Murray et al (2010). The objective is to sample from a posterior distributionover latent variables that is proportional to the product of a multivariate Gaussian prior and alikelihood function that ties the latent variables to the observed data.Suppose f is the vector of the latent variables that we wish to sample and has a zero–meanmultivariate Gaussian prior with covariance matrix Σ; i.e. f ∼ N( , Σ) and, for completeness, thedensity function is given byN( f | , Σ) ≡ | π Σ | − / exp (cid:0) − f T Σ − f (cid:1) . The data are assume to have likelihood function L ( f ) = p (data | f ) so that the target posteriordistribution is π ∗ ( f ) ∝ N( f | , Σ) L ( f ) . Input : current state f , log-likelihood function log L Output : new state f (cid:48) .1. Sample ν ∼ N( , Σ).2. Sample u ∼ U(0 ,

1) and set log w ← log L ( f ) + log u .3. Sample θ using latent slice sampler.4. f (cid:48) ← f cos θ + ν sin θ : if log L ( f (cid:48) ) > log w , accept f (cid:48) ; else GoTo step 3.Given a current state f , a new state can be proposed via f (cid:48) = √ − (cid:15) f + (cid:15) ν with ν ∼ N( , Σ) , where (cid:15) ∈ [ − ,

1] is a step-size parameter, and the proposal is accepted or rejected using aMetropolis–Hastings step. However, apparently the choice of (cid:15) becomes crucial. A more ﬂexibleapproach would allow for a richer class of proposal. An alternative and more natural parameteri-zation for the proposal is f (cid:48) = ν sin θ + f cos θ, deﬁning a full ellipse as θ ranges over [0 , π ]. The original strategy in Murray et al (2010) is totake θ as random and sample θ using Neal’s slice sampler; here we replace this part with our latentslice sampler. The resulting algorithm is given in Table 1.For our illustration we consider a Gaussian data model; i.e. y i = f ( x i ) + ε i with ε i ∼ N(0 , σ ) , assuming σ is known. The Gaussian process prior has covariance matrix given by elementsΣ i,j = τ exp (cid:18) − (cid:80) ( x i − x j ) ψ (cid:19) , where ψ is the “lengthscale” parameter and τ the “signal variance”. So we take L ( f ) ∝ exp (cid:40) − n (cid:88) i =1 ( y i − f ( x i )) (cid:41) . In the experiment, we generate a sequence of n = 100 evenly spaced values ( x n ) over theinterval [0 ,

1] as the input data and take the true function f ( x i ) = sin(4 π x i ) + sin(7 π x i ). We takethe noise standard deviation as σ = 0 . y n ). For the covariance matrix ofthe Gaussian process prior, we use lengthscale ψ = 0 . τ = 1.Fig. 8 shows the estimated function using our own latent slice sampler and also Neal’s slicesampler. Both, obviously, perform well with fast convergence, indicating that the latent slicesampler can be applied in a vast range of Gaussian based models that are currently using Gibbs,Metropolis–Hastings, or slice sampling. 14 igure 8: Comparison of estimated latent function from latent slice sampling(blue) and elliptical slicesampling(red dash). The black solid curve is the observed latent function. In this subsection we sample a 500 dimensional space which is the unknown states of a state space,also known as a hidden Markov, model. We consider[ y i | x i ] ∼ Poisson ( θ exp( x i )) and x i = ρ x i − + σ z i for i = 1 , . . . , n with n = 500 and x = 0 and the ( z i ) independent standard normal. To generatethe data set we take ρ = 0 . σ = 1 and θ = 1.The joint density of the x = ( x n ) given θ is π ( x | θ ) ∝ exp (cid:40) n (cid:88) i =1 (cid:2) x i y i − θ e x i − ( x i − ρ x i − ) (cid:3)(cid:41) ;for simplicity we assume ρ and σ to be known, without any loss to the illustration about to bepresented. Typically, the π ( x | θ ) is sampled component by component, i.e. by sampling p ( x i | x − i , θ ) for i = 1 , . . . , n within a Gibbs sampling framework. In some special cases, conditionallynormal dynamic linear models, it can be sampled as a block by backward sampling. The mostcommon approaches nowadays are based on particle ﬁlters; see Andrieu et al. (2010).Using the multivariate latent slice sampling algorithm we sample the entire vector of statespaces in one block. We only assume θ is unknown and the conditional density of θ with a gammaprior with shape and rate parameters both equal to 0 . . (cid:80) i =1: n y i and rate parameter 0 . (cid:80) i =1: n e x i .The chain was run for 2000 iterations and the time taken was 20 secs. A plot of the posterior θ samples is presented in Fig. 9. The mean value is 0.97.15 igure 9: Posterior density of θ for state space model In this subsection we consider a popular approach to variable selection within the Bayesian frame-work; namely the spike and slab prior (George and McCulloch, 1993). The model is given by Y = Xβ + (cid:15), (cid:15) ∼ N(0 , σ I n )where Y ∈ R n is a vector of responses, X = [ X , . . . , X p ] ∈ R n × p is a regression matrix of p predictors, β = ( β , . . . , β p ) T ∈ R p is a vector of unknown regression coeﬃcients, and (cid:15) ∈ R n is thenoise vector of independent normal random variables with σ as their unknown common variance.The spike and slab prior for β is given by π ( β ) ∝ p (cid:89) j =1 (cid:2) σ − exp( − β j /σ ) + σ − exp( − β j /σ ) (cid:3) , where σ ≈ σ ≈ ∞ yields the slab. Markov chain Monte Carlo methods forthis model require the Gibbs sampling of β j conditional on the β − j , i.e. the vector of β withoutthe β j . See, for example, Narisetty and Xe (2014). Here we use the latent slice sampler to sample β as one block.We assume σ = 1 is known and generate data for n = 100 with p = 90. We take β = 1, β = 5 and β = 0. All the elements in the design matrix X are generated as independentstandard normal random variables. We take σ = 0 . σ = 10; writing down the posterior for β is quite straightforward and is in particular easy to compute for any given value of β . We ranthe latent slice sampler for 10,000 iterations; taking a few seconds to complete the task.For illustration we present the posterior samples for β and β and β ; the true values being1, 5 and 0, respectively. As is visible from Fig. 10 the samples are accumulating at the correctlocations. 16 igure 10: Posterior samples of β , β and β from spike and slab model In this paper we have presented a generic sampling algorithm which has the ability to sampleeﬃciently very high dimensional distribution functions at great speed. The key is the latentmodel combined with the shrinkage procedure based on uniform distributions and an automaticreversible condition. Given the simplicity of the algorithm we present it here, in the general d –dimensional case, with target density π ( y ) and y = ( y , . . . , y d ). Let λ = 0 .

1, for example; wedescribe a single loop with current values y = ( y , . . . , y d ) and s = ( s , . . . , s d ).1. Sample w ∼ U(0 , π ( y )) and, for j = 1 , . . . , d , sample l j ∼ U (cid:0) y j − s j / , y j + s j / (cid:1) and sample s j from the density proportional toexp( − λs j ) ( s j > | l j − y j | ) .

2. Set a j = l j − s j / b j = l j + s j / j = 1 , . . . , d , sample y ∗ j ∼ U( a j , b j ) . if π ( y ∗ ) > w , accept y = y ∗ ; else, for j = 1 , . . . , d ,if y ∗ j < y j then a j ← max { a j , y ∗ j } else b j ← min { b j , y ∗ j } .

4. Repeat step 3 until π ( y ∗ ) > w and set y = y ∗ .As we have demonstrated, such an algorithm can work with a nonlinear state space model withdimension 500 and return output in short time. Future work will consider sampling of constrainedspaces, such as uniform sampling on polytopes and truncated distributions, such as the multivari-ate normal (Robert, 1995; Damien and Walker, 2001).17 eferences Andrieu, A., Doucet, A. and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods.

Journal of the Royal Statistical Society, Series B , 269–342.Besag, J. and Green, P. J. (1993). Spatial statistics and Bayesian computation. Journal of theRoyal Statistical Society, Series B , 25–37.Damien, P., Wakeﬁeld, J. C. and Walker, S. G. (1999). Gibbs sampling for Bayesian nonconjugateand hierarchical models using auxiliary variables. Journal of the Royal Statistical Society,Series B , 331–344.Damien, P. and Walker, S.G. (2001). Sampling truncated normal, beta and gamma densities. Journal of Computational and Graphical Statistics , 206–215.Ekin, T., Walker, S. G. and Dmaien, P. (2020). Augmented simulation methods for discretestochastic optimization with recourse. To appear in Annals of Operations Research .Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems.

The Annals ofStatistics , 209–230 .George, E. I. and McCulloch, R. (1993). Variable selection via Gibbs sampling. Journal of theAmerican Statistical Association , 881–889.Godsill, S. J. (2001). On the relationship between MCMC methods for model uncertainty. Jour-nal of Computational and Graphical Statistics , 230–248.Green, P. J. (1995). Reversible jump MCMC computation and Bayesian model dtermination. Biometrika , 711–732.Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-tions. Biometrika , 97–109.Hjort, N. L., Holmes, C., Mueller, P. and Walker, S. G. (2010). Bayesian Nonparametrics .Cambridge University Press.Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick–breaking priors.

Journalof the American Statistical Association , 161–173.Kalli, M., Griﬃn, J. E. and Walker, S. G. (2009). Slice sampling mixture models. Statistics &Computing , 93–105.Karamanis, M. and Beutler, F. (2020). Ensemble slice sampling. arXiv:2002.06212v1.Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates. Annalsof Statistics , , 351–357.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953).Equation of state calculations by fast computing machines. The Journal of Chemical Physics , 1087–1092. 18ira, A. and Tierney, L. (2002). Eﬃciency and convergence properties of slice samplers. Scan-dinavian Journal of Statistics , 1–12.Murray, I., Adams, R. P. and Mackay, D. J. C. (2010). Elliptical slice sampling. Journal ofMachine Learning Research , 541–548.Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diﬀusingpriors. Annals of Statistics , 789–817.Neal, R. M. (2003). Slice sampling. Annals of Statistics , 705–767.Nishihara, R., Murray, I. and Adams, R. P. (2014). Parallel MCMC with generalized ellipitcalslice sampling. Journal of Machine Learning Research , 2087–2112.Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknownnumber of components. Journal of the Royal Statistical Society, Series B , 731–792.Robert, C. P. (1995). Simulation of truncated normal variables. Statistics & Computing ,121–125.Roberts, G. O. and Rosenthal, J. S. (1999). Convergence of slice sampler Markov chains. Journalof the Royal Statistical Society, Series B , 643–660.Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica Sinica , 639–650.Tibbits, M. M., Haran, M. and Liechty, J. C. (2011). Parallel mulitvariate slice sampling. Statis-tics and Computing , 415–430.Tibbits, M. M., Groendyke, C., Haran, M. and Liechty, J. C. (2014). Factor slice sampling. Journal of Computational and Graphical Statistics , 543–563.Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications inStatistics , 45–54.Walker, S. G. (2014). Sampling un–normalized probabilities: An alternative to the Metropolis–Hastings algorithm. SIAM Journal on Scientiﬁc Computing36