[PDF] A Piecewise Deterministic Markov Process via (r,θ) swaps in hyperspherical coordinates

Abstract

Recently, a class of stochastic processes known as piecewise deterministic Markov processes has been used to define continuous-time Markov chain Monte Carlo algorithms with a number of attractive properties, including compatibility with stochastic gradients like those typically found in optimization and variational inference, and high efficiency on certain big data problems. Not many processes in this class that are capable of targeting arbitrary invariant distributions are currently known, and within one subclass all previously known processes utilize linear transition functions. In this work, we derive a process whose transition function is nonlinear through solving its Fokker-Planck equation in hyperspherical coordinates. We explore its behavior on Gaussian targets, as well as a Bayesian logistic regression model with synthetic data. We discuss implications to both the theory of piecewise deterministic Markov processes, and to Bayesian statisticians as well as physicists seeking to use them for simulation-based computation.

Full PDF

PProceedings

A Piecewise Deterministic Markov Process via ( r , θ ) swaps in hyperspherical coordinates Alexander Terenin , Daniel Thorngren Department of Mathematics, Imperial College London, London, UK; [email protected] Department of Physics, University of California, Santa Cruz, CA, USA; [email protected] July 3, 2018 submitted to Proceedings

Abstract:

Recently, a class of stochastic processes known as piecewise deterministic Markov processeshas been used to deﬁne continuous-time Markov chain Monte Carlo algorithms with a number ofattractive properties, including compatibility with stochastic gradients like those typically found inoptimization and variational inference, and high efﬁciency on certain big data problems. Notmany processes in this class that are capable of targeting arbitrary invariant distributions arecurrently known, and within one subclass all previously known processes utilize linear transitionfunctions. In this work, we derive a process whose transition function is nonlinear through solving itsFokker-Planck equation in hyperspherical coordinates. We explore its behavior on Gaussian targets,as well as a Bayesian logistic regression model with synthetic data. We discuss implications to boththe theory of piecewise deterministic Markov processes, and to Bayesian statisticians as well asphysicists seeking to use them for simulation-based computation.

Keywords:

Bayesian Statistics, Big Data, Continuous-time MCMC, Markov Chain Monte Carlo,Piecewise Determinisitic Markov Process

1. Introduction

The Bayesian statistical paradigm possesses many desirable properties, including the ability toquantify uncertainty about a set of estimated parameters. However, using it entails the computationof posterior probability distributions – this tends to be expensive, because these are inherentlycomplicated, and depend on the data used to deﬁne them. This is especially challenging in modernapplication areas such as natural language processing and analysis of internet data, which involvelarge data sets. Creating algorithms that scale well with data size is a current area of research.Very recently, a class algorithms called piecewise deterministic Markov processes (PDMP)[1] hasbeen proposed with some surprising properties making them attractive to this task. In particular,PDMPs – like stochastic gradient descent (SGD) [2] and stochastic variational inference (SVI) [3] – can beused for simulation-based inference under an exchangeable model without performing any full-datacomputations. Unlike SGD or SVI, however, PDMPs target the correct posterior distribution π anddo not entail the use of point estimates or distributional approximations. Furthermore, subject to aone-off calculation, their computational cost can be O ( ) with respect to data size [4,5].These advantages have lead to increased interest in studying PDMPs – particularly since, atpresent, only a small number of PDMPs invariant with respect to arbitrary target distributions areknown. These include the bouncy particle sampler [6,7], zig-zag [5], scalable Langevin exact [4], and afew other variants whose comparative behavior is not yet well-understood. In particular, all knownPDMPs with constant deterministic dynamics also utilize linear transition functions. In this work, wedeﬁne a PDMP whose transition function is nonlinear and develop methods for computing with it.Our contribution is purely theoretical. The process is derived in Section 2. We present empiricalevaluation in Sections 4 and 5. We discuss the implications of these results in Section 6. Submitted to

Proceedings a r X i v : . [ s t a t . C O ] J u l ersion July 3, 2018 submitted to Proceedings

2. Piecewise Deterministic Markov Processes

Piecewise deterministic Markov processes are a class of stochastic processes ﬁrst introduced byDavis [8] and described in the Markov chain Monte Carlo (MCMC) context by Fearnhead et al. [1].All such algorithms evolve in part deterministically, and in part according to a Markov jump process.These are fully described by three components.(1) Deterministic Dynamics: a function Φ that determines the process’ behavior between jumps,typically speciﬁed through a system of differential equations.(2) Switching Rate: a function λ that speciﬁes the intensity of the jumps at each state.(3) Transition Distribution: a probability measure Q that speciﬁes what states the process jumps to.We refer the reader to Fearnhead et al. [1] for a detailed introduction. One appealing propertyof PDMPs is that they can be simulated exactly – meaning with no discretization error – throughthe use of techniques such as Poisson thinning. This is because the switches evolve according to anonhomogeneous Poisson process, which can be simulated by proposing switches from a Poissonprocess with greater intensity, and accepting or rejecting the proposals. Other techniques, suchas Poisson inversion, can also be considered [1]. In between switches, the algorithm’s behavior isdeterministic, and can be calculated exactly provided Φ is tractable.Another appealing property is that most PDMPs do not require the evaluation of the targetdistribution π ( x ) directly, and only depend on it through functions of ∇ ln π ( x ) . This term is amenableto unbiased estimation: ∇ ln π ( x ) can be replaced by its expectation E [ ∇ ln π ( x )] with respect tosome auxiliary variable, which can then be replaced with an unbiased estimate – all without violatingstationarity with respect to π . Indeed Bouchard-Côté et al. [7], Pollock et al. [4], Bierkens et al. [5],and Vanetti et al. [9] use PDMPs with stochastic gradients to obtain state-of-the-art performance oncertain big data problems. Moreover, Bierkens et al. [5] and Pollock et al. [4] have shown that, throughintroducing a control variate that can be computed using a one-off O ( N ) calculation, the computationalcost of certain such algorithms can be O ( ) . This makes PDMPs appealing in a big data setting.

3. Piecewise Linear Markov Processes

We now proceed to describe the class of PDMPs studied in this work, which includes the BouncyParticle Sampler and Pure Reﬂection process as special cases.

Deﬁnition 1.

Consider a PDMP in the sense of Fearnhead et al. [1]. Let π ( v , x ) = π ( v ) π ( x ) with π ( v ) standard multivariate Gaussian and π ( x ) the target distribution of interest. Let Q to be the Dirac measurecentered at ( x , F x ( v )) for some function F x , called the transition function. Deﬁne the following.1. Deterministic Dynamics: d x d t = v d v d t =

0. (1)

2. Switching Rate: λ ( x , v ) = max { − v · ∇ ln π ( x )] } . (2)

3. Transition Function:F x ( v ) s.t. F − x ( v ) exists and π ( x , v ) is stationary. (3) Call a PDMP satisfying these conditions a

Piecewise Linear Markov Process . Both the Pure Reﬂection process of Fearnhead et al. [1] and Bouncy Particle Sampler ofBouchard-Côté et al. [7] are examples within this subclass of PDMPs, with F x ( v ) = − v and F x ( v ) = v − v · ∇ ln π ( x ) ||∇ ln π ( x ) || ∇ ln π ( x ) (4) ersion July 3, 2018 submitted to Proceedings respectively. In both cases, F x changes only the process’ velocity v , and is linear – the latter expressioncan be viewed as a reﬂection with respect to a hyperplane normal to ∇ ln π ( x ) . In this work, we beginby asking the following question: does there exists a process within this class for which F x is nonlinear?For such a process to exist, invariance must hold, which means that π ( x , v ) must be a zero of theFokker-Planck Equation. We proceed to derive this equation for processes given by Deﬁnition 1, whichdiffer slightly from those considered by Fearnhead et al. [1]. Lemma 2.

The Fokker-Planck Equation for a Piecewise Linear Markov Process is given by λ ( x , v ) π ( v ) − λ [ x , F − x ( v )] π [ F − x ( v ) | x ] (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) = − v · [ ∇ ln π ( x )] π ( v ) (5) where λ ( x , v ) = max { − v · ∇ ln π ( x ) } . Proof.

To simplify notation, we ﬁrst consider general PDMPs – within this lemma let z = ( v , x ) , F ( z ) = ( x , F x ( v )) , let Φ be the deterministic dynamics, and let Q be a Dirac measure centered at F ( z ) .We derive Fokker-Planck Equation from the inﬁnitesimal generator A f ( z ) = Φ ( z ) · ∇ f ( z ) + λ ( z ) (cid:90) Ω f ( z (cid:48) ) d Q ( z (cid:48) ) − λ ( z ) f ( z ) (6)given by Davis [8], by ﬁnding its formal adjoint A ∗ satisfying (cid:90) Ω π ( z ) A f ( z ) d z = (cid:90) Ω f ( z ) A ∗ π ( z ) d z . (7)We proceed as follows. First, note that by linearity we may ﬁnd the formal adjoint of A component-wise.It is shown in Fearnhead et al. [1] that the components Φ ( z ) · ∇ f ( z ) and − λ ( z ) f ( z ) (8)map to − p ∑ i = ∂ Φ i ( z ) ∂ z i π ( z ) and − λ ( z ) π ( z ) (9)respectively. For the remaining component, we can write (cid:90) Ω π ( z ) λ ( z ) (cid:90) Ω f ( z (cid:48) ) d Q ( z (cid:48) ) d z = (cid:90) Ω π ( z ) λ ( z ) f [ F ( z )] d z = (cid:90) Ω f ( ˜ z ) π [ F − ( ˜ z )] λ [ F − ( ˜ z )] (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − ( ˜ z ) ∂ ˜ z (cid:12)(cid:12)(cid:12)(cid:12) d ˜ z (10)where the change of variables is justiﬁed because F x is assumed invertible everywhere, except possiblya set of measure zero, in which case we may divide Ω accordingly and invert F x piecewise. Note thepresence of a Jacobian term that does not explicitly appear in the derivation of Fearnhead et al. [1]because they consider a slightly different case. Thus we have A ∗ π ( z ) = − p ∑ i = ∂ Φ i ( z ) ∂ z i π ( z ) + π [ F − ( z )] λ [ F − ( z )] (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − ( z ) ∂ ˜ z (cid:12)(cid:12)(cid:12)(cid:12) − λ ( z ) π ( z ) . (11)For π ( z ) to be invariant, we must have A ∗ π ( z ) =

0. Consider now our case, where F ( x , v ) =( x , F x ( v )) , and π ( z ) = π ( v ) π ( x ) . The expression then simpliﬁes to the desired result. ersion July 3, 2018 submitted to Proceedings

Both the Pure Reﬂection Process of Fearnhead et al. [1] and the Bouncy Particle Samplerof Bouchard-Côté et al. [7] are processes within Deﬁnition 1. Observe that they both solve theFokker-Planck Equation by letting || F − x ( v ) || = || v || − F − x ( v ) · −∇ ln π ( x ) = v · −∇ ln π ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) =

1. (12)Both of these solutions are magnitude-preserving. This motivates us to further ask: are there solutionsthat are not magnitude-preserving? We now proceed to ﬁnd such a solution.

Theorem 3.

Let r = || v || and θ be the angle between ∇ ln π ( x ) and v along the hyperplane spanned by bothvectors. Consider a transition function F x which maps v to another vector on that hyperplane, which is fullydetermined by the coordinates r (cid:48) , θ (cid:48) . Suppose that r (cid:48) is only a function of θ and θ (cid:48) is only a function of r. Then,letting k be a positive constant and p be the dimension of v , we have that for every r, if we take θ (cid:48) to be thesolution of the differential equation d θ (cid:48) ( r ) d r = kr p exp (cid:110) r − (cid:111) cos [ θ (cid:48) ( r )] sin p − [ θ (cid:48) ( r )] (13) subject to the boundary conditions θ (cid:48) ( ) = and θ (cid:48) ( ∞ ) = π /2 which fully determine k, and if we take r (cid:48) ( θ ) tobe the above solution’s inverse, the resulting PDMP is π -invariant. Proof.

By Lemma 2, the Fokker-Planck Equation ismax { v · ∇ ln π ( x ) } π ( v ) − max { F − x ( v ) · ∇ ln π ( x ) } π [ F − x ( v )] (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) = v · ∇ ln π ( x ) π ( v ) (14)which for v · ∇ ln π ( x ) > F x ( v ) − · ∇ ln π ( x ) <

0, which we henceforthassume. Consider v · ∇ ln π ( x ) <

0, and suppose F x ( v ) − · ∇ ln π ( x ) >

0. Substituting in π ( v ) ∝ exp (cid:110) || v || − (cid:111) , we can write − F − x ( v ) · ∇ ln π ( x ) exp (cid:26) || F − x ( v ) || − (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) = v · ∇ ln π ( x ) exp (cid:26) || v || − (cid:27) . (15)Now, transform to hyperspherical coordinates coordinates, by letting v = r cos ( θ ) v = r sin ( θ ) cos ( φ ) ... ... v p − = r sin ( θ ) p − ∏ i = sin ( φ i ) cos ( φ p − ) v p = r sin ( θ ) p − ∏ i = sin ( φ i ) (16)where θ is the angle on the hyperplane spanned by v and ∇ ln π ( x ) , and φ i are angles on an arbitraryset of hyperplanes orthogonal to v and ∇ ln π ( x ) . For this transformation, we have the identities || v || = r v · ∇ ln π ( x ) || v || ||∇ ln π ( x ) || = cos ( θ ) (17)and letting r (cid:48) , θ (cid:48) , φ (cid:48) be the coordinates under F − x , i.e. functions of r , θ , φ , the Fokker-Planck equationbecomes − cos ( θ (cid:48) ) r (cid:48) exp (cid:40) r (cid:48) − (cid:41) (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) = cos ( θ ) r exp (cid:26) r − (cid:27) . (18) ersion July 3, 2018 submitted to Proceedings

We can decompose the Jacobian into (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ v (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ F − x ( v ) ∂ ( r (cid:48) , θ (cid:48) , φ (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( r (cid:48) , θ (cid:48) , φ (cid:48) ) ∂ ( r , θ , φ ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( r , θ , φ ) ∂ ( v ) (cid:12)(cid:12)(cid:12)(cid:12) (19)which, since the Jacobian for hyperspherical coordinates is (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( r , θ , φ ) ∂ ( v ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:34) r p − sin p − ( θ ) p − ∏ i = sin p − − i ( φ i ) (cid:35) − (20)yields the system − cos ( θ (cid:48) ) sin p − ( θ (cid:48) ) (cid:34) p − ∏ i = sin p − − i ( φ (cid:48) i ) (cid:35) r (cid:48) p exp (cid:26) r (cid:48) ( r , θ ) − (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( r (cid:48) , θ (cid:48) , φ (cid:48) ) ∂ ( r , θ , φ ) (cid:12)(cid:12)(cid:12)(cid:12) == cos ( θ ) sin p − ( θ ) (cid:34) p − ∏ i = sin p − − i ( φ i ) (cid:35) r p exp (cid:26) r − (cid:27) . (21)Now, suppose that we are interested in solutions where φ (cid:48) = φ , θ (cid:48) is only a function of r and r (cid:48) is onlya function of θ . The Jacobian is just (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( r (cid:48) , θ (cid:48) , φ (cid:48) ) ∂ ( r , θ , φ ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I p − ∂ r (cid:48) ∂ r ∂ r (cid:48) ∂θ ∂θ (cid:48) ∂ r ∂θ (cid:48) ∂θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ r (cid:48) ∂θ∂θ (cid:48) ∂ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ r (cid:48) ∂θ ∂θ (cid:48) ∂ r (cid:12)(cid:12)(cid:12)(cid:12) . (22)Under these assumptions, the Fokker-Planck Equation becomes − cos (cid:2) θ (cid:48) ( r ) (cid:3) sin p − (cid:2) θ (cid:48) ( r ) (cid:3) r (cid:48) ( θ ) p exp (cid:26) r (cid:48) ( θ ) − (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) ∂ r (cid:48) ∂θ ∂θ (cid:48) ∂ r (cid:12)(cid:12)(cid:12)(cid:12) = cos ( θ ) sin p − ( θ ) r p exp (cid:26) r − (cid:27) (23)which we can multiply on both sides by an arbitrary constant k , then factorize into the system − cos (cid:2) θ (cid:48) ( r ) (cid:3) sin p − (cid:2) θ (cid:48) ( r ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12) d θ (cid:48) ( r ) d r (cid:12)(cid:12)(cid:12)(cid:12) = k r p exp (cid:26) r − (cid:27) k r (cid:48) ( θ ) p exp (cid:26) r (cid:48) ( θ ) − (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) d r (cid:48) ( θ ) d θ (cid:12)(cid:12)(cid:12)(cid:12) = cos ( θ ) sin p − ( θ ) (24)and rewrite asd θ (cid:48) ( r ) d r = kr p exp (cid:110) r − (cid:111) cos [ θ (cid:48) ( r )] sin p − [ θ (cid:48) ( r )] d r (cid:48) ( θ ) d θ = cos ( θ ) sin p − ( θ ) k r (cid:48) ( θ ) p exp (cid:110) r (cid:48) ( θ ) − (cid:111) . (25)Notice that these differential equations are reciprocals of one another: therefore, subject to identicalinitial conditions, θ (cid:48) ( r ) and r (cid:48) ( θ ) are inverse functions. We have thus shown that F x = F − x , andtherefore need not consider the inversion. We impose the boundary conditions θ (cid:48) ( ) = θ (cid:48) ( ∞ ) = π /2 (26)under which the above differential equations can be solved analytically. Since these solutions aremonotonic, the result follows.Though the above differential equations can be solved analytically, computation using them isintractable because they are not numerically stable due to the presence of large powers. Indeed, for ersion July 3, 2018 submitted to Proceedings moderate p , to satisfy boundary conditions the constant k needs to be taken closer to zero than thesmallest positive number available in double precision arithmetic. As a result, we cannot proceeddirectly. We instead consider the differential equation’s asymptotic form for large p – this introducessome approximation error that vanishes in high dimension. Proposition 4.

For large p, we have θ (cid:48) ( r ) ≈ π − (cid:115) − p − Φ (cid:104) √ ( r − √ p ) (cid:105) (27) where Φ is the CDF of a unit Gaussian, in the sense that θ (cid:48) ( r ) is the solution of a differential equation whoseright-hand side is the pointwise limit as p → ∞ of the equation in Proposition 3. Proof.

It is a standard result thatlim p → ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ( p + ) /2 √ π Γ [( p + ) /2 ] r p exp (cid:26) r − (cid:27) − exp (cid:110) − ( r − √ p ) (cid:111)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = r ∈ R + , as the former is the density of a χ distribution, and thatlim p → ∞ (cid:12)(cid:12)(cid:12)(cid:12) cos ( θ ) sin p ( θ ) − (cid:16) θ − π (cid:17) exp (cid:26) ( θ − π /2 ) − p (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) = θ ∈ [ π /2 ] . Therefore, the limiting form for our differential equation isd θ (cid:48) ( r ) d r = k (cid:48) exp (cid:8) − ( r − √ p ) (cid:9)(cid:0) θ (cid:48) ( r ) − π (cid:1) exp (cid:110) ( θ (cid:48) ( r ) − π /2 ) − ( p − ) (cid:111) (30)for some constant k (cid:48) , which has analytic solution θ (cid:48) ( r ) = π − (cid:115) c − p − [ ± + c erf ( r − √ p )] (31)for arbitrary constants c , c . We must choose c , c such that F x is invertible, and r is positive. If we set θ (cid:48) ( ) = θ (cid:48) ( ∞ ) = π /2, we obtain that ± should be taken to be + and c = exp (cid:110) π ( p − ) (cid:111) − (cid:110) π ( p − ) (cid:111) + ≈ c = ( + c ) π ( p − ) ≈ ( ) π ( p − ) (32)which yields the desired result.For c and c as above, the solution is strictly increasing and positive everywhere, except possiblyon an interval near the origin. From a practical perspective, this is not a concern, as the probability oflanding in those states is exceedingly small and was never occurred in our simulations. The inversefunction r (cid:48) ( θ ) is obtained numerically, which can easily be done as θ (cid:48) ( r ) is one-dimensional. Thiscompletes our derivation.

4. Example: Independent Gaussian Target

To understand the algorithm’s behavior, we implemented it for a standard multivariate Gaussiantarget and compared it against the bouncy particle sampler. We examined three targets with dimension p =

10, 100, 1,000. All were implemented using Poisson thinning with identical velocity-dependentswitching rate bound 5 || v || which was never exceeded outside of burn-in. Velocity was resampled ersion July 3, 2018 submitted to Proceedings H y pe r s phe r i c a l p : −10−50510 p : −10010 p : BPS

Figure 1.

Trace plots for the ﬁrst coordinate of a multivariate Gaussian of dimension p ∈ {

10, 100, 1000 } . Hyperspherical refers to the process derived in Section 3,

BPS refers to the bouncy particle sampler. according to a homogeneous Poisson process with intensity 0.2. Each algorithm was given a ﬁxedcomputational budget consisting of 100,000 gradient evaluations, and started from initial values of (

10, .., 10 ) selected to be away from the target mode. This implementation avoids using analyticproperties of Gaussians to better mimic real-world scenarios.Trace plots of the resulting chains can be seen in Figure 1. It can be seen that for p =

10, bothalgorithms produce similar output. For p = p =

5. Example: Bayesian Logistic Regression

To examine the performance on a Bayesian model with known correct answer, we implementedthe algorithm for a Bayesian Logistic Regression problem with synthetic data. Data was generated bytaking x i iid ∼ N p ( , I ) β = ( −

1, 1.6, 5, − p − ) T y ∼ Ber (cid:2) Ψ ( X β ) (cid:3) . (33)where Ψ is the logistic function. We used the logistic regression model y | β ∼ Ber (cid:2) Ψ ( X β ) (cid:3) β ∼ N p ( , 10 − I ) . (34)We selected N = p = λ = || v || , which was selected to be sufﬁciently large to ensure it was not exceeded more than 1%of the time. Velocity was resampled according to a homogeneous Poisson process with intensity 10.Computation was performed as follows. First, a point estimate ˆ β of the posterior mode wasobtained using stochastic gradient descent, consisting of 100,000 steps, each with a batch size of 10,using a total of N data points. Then, the data was used to precompute ∇ ln π ( ˆ β ) , which was thenused to implement the control variate of Bierkens et al. [5] and Pollock et al. [4]. Finally, sampling wasperformed, using 100,000 stochastic gradient evaluations, each with a batch size of 10, with the controlvariate used to reduce variance.Results can be seen in Figure 2. Given the extremely limited nature of our computational budget– a total of 3 N evaluations of ( x i , y i ) pairs – both algorithms obtained reasonable posterior samples,performing similarly. We ﬁnd this remarkable: standard MCMC methods such as Gibbs sampling [10]and Hamiltonian Monte Carlo [11] would not generally produce useful output under such constraints. ersion July 3, 2018 submitted to Proceedings

Hyperspherical

BPS

Figure 2.

Trace plots for the ﬁrst coordinate of the logistic regression target distribution.

Hyperspherical refers to the process derived in Section 3,

BPS refers to the bouncy particle sampler.

6. Discussion

The PDMP constructed in Section 3 performs slightly better than the bouncy particle samplerfor Gaussian targets of moderate dimension. This is because its transition function is nonlinear andnon-magnitude-preserving – this helps the process avoid getting stuck in high-dimensional orbitsby making it easier to move perpendicular to the contours of the target distribution. Unfortunately,the overall improvement is rather limited – non-magnitude-preserving transitions appear to us to benecessary but not sufﬁcient for efﬁciency in high dimension.Our results in Section 5 replicate the behavior of other PDMPs on big data problems exploredin detail by Bouchard-Côté et al. [7], Bierkens et al. [5], Pollock et al. [4], and Vanetti et al. [9]. It isclear that these algorithms can achieve state-of-the-art performance in this setting through the use ofsubsampling and precomputed control variates. For logistic regression, this technique is attractivebecause the posterior mode is easily obtained using classical techniques.One difﬁculty with PDMPs well-illustrated by our work can be seen in the trace plots under theGaussian target with p = always move rapidly through the state space, and yet stillconverge slowly due to moving primarily in directions orthogonal to those needed to ensure goodmixing. Our use of hyperspherical coordinates makes the above easy to visualize: a non-reversibleprocess can move rapidly in the θ and φ dimensions while moving arbitrarily slowly in the r dimension.Thus, non-reversible MCMC methods require additional care to diagnose convergence and ensureposterior estimates are reliable.Further research in PDMPs is needed to understand their behavior on high-dimensional targets.Our use of hyperspherical coordinates to derive a PDMP with a nonlinear transition function mayyield improvement for certain targets of moderate dimension. Many PDMPs resemble HamiltonianMonte Carlo [11], so it may be possible to connect current work with existing theory in that area. Wehope that with additional ideas substantially larger improvements are possible. Acknowledgments

We are grateful to Georgi Dinolov, David Draper, Mark Girolami, David Parks, Daniele Venturi,and Yuanran Zhu for their thoughts. Membership on this list does not imply agreement with the ideaspresented nor responsibility for errors that may inadvertently be present. ersion July 3, 2018 submitted to

Proceedings

References

1. Fearnhead, P.; Bierkens, J.; Pollock, M.; Roberts, G.O. Piecewise Deterministic Markov Processes forContinuous-Time Monte Carlo. arXiv:1611.07873 .2. Robbins, H.; Monro, S. A stochastic approximation method.

The Annals of Mathematical Statistics , pp.400–407.3. Hoffman, M.D.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic Variational Inference.

Journal of Machine LearningResearch , , 1303–1347.4. Pollock, M.; Fearnhead, P.; Johansen, A.M.; Roberts, G.O. The Scalable Langevin Exact Algorithm: BayesianInference for Big Data. arXiv:1609.03436 .5. Bierkens, J.; Fearnhead, P.; Roberts, G. The Zig-Zag Process and Super-Efﬁcient Sampling for BayesianAnalysis of Big Data. arXiv:1607.03188 .6. Peters, E.A.; de With, G. Rejection-free Monte Carlo sampling for general potentials. Physical Review E , , 026703.7. Bouchard-Côté, A.; Vollmer, S.J.; Doucet, A. The Bouncy Particle Sampler: A Non-Reversible Rejection-FreeMarkov Chain Monte Carlo Method. arXiv:1510.02451 .8. Davis, M.H. Piecewise-deterministic Markov processes: A general class of non-diffusion stochastic models. Journal of the Royal Statistical Society, Series B (Statistical Methodology) , pp. 353–388.9. Vanetti, P.; Bouchard-Côté, A.; Deligiannidis, G.; Doucet, A. Piecewise Deterministic Markov Chain MonteCarlo. arXiv:1707.05296 .10. Casella, G.; George, E.I. Explaining the Gibbs sampler.