Adaptive Schemes for Piecewise Deterministic Monte Carlo Algorithms
aa r X i v : . [ s t a t . C O ] D ec ADAPTIVE SCHEMES FOR PIECEWISE DETERMINISTICMONTE CARLO ALGORITHMS
ANDREA BERTAZZI AND JORIS BIERKENS
Delft Institute of Applied Mathematics, TU Delft, the Netherlands
Abstract.
The Bouncy Particle sampler (BPS) and the Zig-Zag sampler (ZZS) arecontinuous time, non-reversible Monte Carlo methods based on piecewise deterministicMarkov processes. Experiments show that the speed of convergence of these samplerscan be affected by the shape of the target distribution, as for instance in the case ofanisotropic targets. We propose an adaptive scheme that iteratively learns all or partof the covariance matrix of the target and takes advantage of the obtained informationto modify the underlying process with the aim of increasing the speed of convergence.Moreover, we define an adaptive scheme that automatically tunes the refreshment rateof the BPS or ZZS. We prove ergodicity and a law of large numbers for all the proposedadaptive algorithms. Finally, we show the benefits of the adaptive samplers with severalnumerical simulations. Introduction
Piecewise Deterministic Markov processes (PDMP) have been recently used for MonteCarlo sampling as a continuous time alternative to Markov chain Monte Carlo (MCMC)methods. The Bouncy Particle sampler (BPS) and the Zig-Zag sampler (ZZS), introducedin [12] and [7] respectively, are primary examples of this new class of methods, with notablepredecessors [27, 25, 26]. Both samplers are based on continuous time dynamics definedby piecewise linear, deterministic trajectories and changes in the velocity of the processat random times. The simplicity of these dynamics is such that the underlying processescan be simulated exactly in many relevant scenarios. A key aspect of these samplers isthe non-reversibility of the underlying Markov process. This property has been observedto result in a lower asymptotic variance of Monte Carlo estimates for moments of thetarget density [2, 6]. Moreover, these algorithms can be naturally modified to achieveasymptotically exact subsampling. In the Bayesian setting, this means that PiecewiseDeterministic Monte Carlo (PDMC) algorithms need to access only a small portion of thedata set at every iteration without introducing a bias. The reader is referred to [19, 33]for an in depth description of the general methodology of PDMC algorithms.Several papers have studied the convergence properties of the BPS and the ZZS inrecent years. It was first observed in [12] that the BPS can fail to be ergodic unless arefreshment of the velocity vector is performed at random times. Exponential ergodicityof the BPS was proved in [16, 17] for different distributions from which refreshmentsof the velocity vector can be drawn. Similarly, in [11] the ZZS was shown to convergeexponentially to its invariant distribution under reasonable assumptions and without theneed of velocity refreshments. Exponential convergence in the L sense was established E-mail address : [email protected] . for both samplers in [1] using the hypocoercivity framework, and recently, using Poincar´einequalities in space-time, in [23]. A study of the scaling limits was conducted in [9],giving also a criterion to choose the refreshment rate of BPS. The asymptotic varianceof these processes has been studied in e.g. [2, 6]. It is also possible to design PDMCalgorithms with non-linear trajectories, see e.g. [33, 8].Although PDMC sampling methods offer some important benefits as mentioned above,computation remains expensive, which requires us to investigate possible performance im-provements. In particular, a strong performance degradation is observed when the targetdistribution π is anisotropic. Figure 1 illustrates this phenomenon in the case of Gauss-ian targets as a function of the correlation between all components. The performancedrop occurs due to a combination of decreasing accuracy of the estimates and increasingcomputational complexity of the algorithms, which is implied by the growing number ofvelocity change events. Our idea to improve this issue is to let the process learn (partof) the covariance matrix Σ π and take advantage of it to enhance the mixing properties.The covariance estimate is used to linearly transform the target in such a way that itbecomes more isotropic, i.e. with unitary covariance matrix. The standard samplers arethen run targeting the transformed version of π , and the obtained sample is finally re-transformed to be approximately from π . The procedure is applied iteratively, and oncea new estimate of Σ π is computed, it is used to define the linear transformation of π . Theestimate will eventually be close to the true covariance matrix and the process targets anisotropic version of π . This scheme can also be interpreted as an application of a lineartransformation directly to the standard ZZS and BPS.Furthermore, we address the problem of automatically choosing the rate of velocityrefreshments. In [9] the authors consider a BPS with a specific target and derive thatin the limit it is optimal to have a ratio of number of refreshments over number of totalevents of 0 . Figure 1.
Root mean squared error (RMSE) for the radius statistic (left)and number of events (right) for 50-dimensional Gaussian targets for severalvalues of the correlation between all components. The continuous timehorizon is T = 10 . DAPTIVE SCHEMES FOR PDMC ALGORITHMS 3 measure, although it comes with a larger asymptotic variance. For an analysis of thesetwo results we refer to [34] and [6] respectively.Both the schemes we discussed take advantage of what the process has learned upuntil the current time to tune a parameter or to improve the performance. This ideais at the core of adaptive Markov chain Monte Carlo algorithms. For an introductionto this area we refer to [3, 31], while standard results on convergence of these methodscan be found among others in [5, 20, 21, 30]. It is well known that adaptive MCMCalgorithms can lose the right invariant measure if not applied with care (see for instance[30] and the examples therein). Therefore we study in depth the convergence propertiesof the proposed algorithms. To fit into the existing adaptive MCMC literature we let theadaptation happen at fixed points in time. The main challenge consists of establishinga simultaneous geometric drift condition for a family of BPS’s (see Lemma 19) and asimultaneous small set condition for a family of ZZ processes (see Lemma 15). Theformer result is obtained taking advantage of the Lyapunov function found in [17], whilethe latter is proved by extending the one-dimensional case considered in Lemma 25. Theergodicity and a law of large numbers for the proposed adaptive PDMC algorithms arethen established in Theorem 12 and Theorem 14.In Section 2 we introduce the adaptive schemes, while in Section 3 the theoreticalaspects of the algorithms are studied. The proofs of the two main theorems can be foundin Section 4. The adaptive BPS and ZZS are tested empirically on various Gaussian targetsand on a Bayesian logistic regression problem with correlated data in Section 5. AppendixA gives details on the implementation of adaptive PDMC (with and without subsampling),as well as an alternative adaptive scheme for the refreshment rate. Appendix B containsthe proofs of other theorems stated in Section 3, together with more technical results.2.
The adaptive schemes
We are interested in building adaptive strategies to make the ZZS and the BPS choosethe refreshment rate themselves and/or converge faster to the target density. We beginwith an introduction of the standard versions of both samplers, followed by a character-isation of the preconditioned processes in Section 2.2 and a discussion on the choice ofthe transformation matrix in Section 2.3. Finally, the adaptive algorithms are defined inSection 2.4.2.1.
The standard ZZS and BPS.
Let the target density π be defined on X ⊂ R d as π ( ξ ) = 1 Z exp ( − U ( ξ )) , where U ( ξ ) is called potential or energy function, and ξ ∈ X . Let us now define thestandard ZZS with invariant measure π . Throughout the paper the position and velocityvectors at time t of the standard processes, both ZZS and BPS, are denoted respectivelyas Ξ( t ) and Θ( t ). The Zig-Zag (ZZ) process is a Markov process (Ξ( t ) , Θ( t )) t ≥ with statespace E = R d × {− , +1 } d that follows a linear trajectory in the position space withvelocity Θ until one of its d -inhomogeneous Poisson clocks rings. When the i -th clockrings the velocity of the i -th coordinate switches sign. In mathematical terms, this meansthat the new velocity vector is F i Θ, where F i : {− , +1 } d → {− , +1 } d is the flip operator DAPTIVE SCHEMES FOR PDMC ALGORITHMS 4 defined as ( F i θ ) j = ( θ j if j = i, − θ j if j = i. The rates of the Poisson clocks are defined for i = 1 , . . . , d as λ i ( ξ, θ ) = ( θ i ∂ i U ( ξ )) + + γ i ( ξ, θ ) , where ( θ i ∂ i U ( ξ )) + = max(0 , θ i ∂ i U ( ξ )) and γ i ( ξ, θ ) : E → R + is called excess switchingrate and is such that γ i ( ξ, θ ) = γ i ( ξ, F i θ ). It was shown in [7, Theorem 2.2] that this choiceof switching rates ensures that µ = π × Unif( {− , +1 } d ) is the stationary distribution ofthe process. Moreover, the ZZ process is characterised by its infinitesimal generator(1) L f ( ξ, θ ) = h θ, ∇ f ( ξ, θ ) i + d X i =1 λ i ( ξ, θ )( f ( ξ, F i θ ) − f ( ξ, θ )) , where function f should be in the domain of the generator D ( L ). The linear trajectoriesare represented in the first term, while the second term represents the event of a velocityflip.Similarly, we denote the BPS as (Ξ( t ) , Θ( t )) t ≥ , but in this case the state space is E = R d × R d and so Θ( t ) ∈ R d . In contrast to the ZZS, the BPS has two Poisson clocks.The first one depends on the gradient of the energy function and has inhomogeneous rate λ ( ξ, θ ) = h θ, ∇ U ( ξ ) i + = max(0 , h θ, ∇ U ( ξ ) i ) . At event time the particle is reflected on thelevel curve of the potential U following an elastic bounce, and thus preserving the normof the velocity vector. After a bounce the new velocity vector is given by R ( ξ ) θ = θ − h∇ U ( ξ ) , θ i||∇ U ( ξ ) || ∇ U ( ξ ) . It was observed in [12] that the BPS needs refreshments of the velocity vector in order tobe ergodic. This brings us to the second Poisson clock, which has rate λ r : E → R + . Thisis referred to as refreshment rate and should thus be strictly positive. When this clockrings, the velocity vector is refreshed by sampling from a distribution ψ . Possible choicesare the Gaussian distribution ψ = N (0 , d ), or ψ = Unif( S d − ), where S d − is the surfaceof the unit hypersphere. In the analysis that follows we focus on the former distribution.The infinitesimal generator of the BPS is defined for any f ∈ D ( L ) as L f ( ξ, θ ) = h θ, ∇ f ( ξ, θ ) i + λ ( ξ, θ ) (cid:0) f ( ξ, R ( ξ ) θ ) − f ( ξ, θ ) (cid:1) + λ r ( ξ, θ ) Z ( f ( ξ, θ ′ ) − f ( ξ, θ )) ψ ( dθ ′ ) . The invariant measure of the BPS defined by the infinitesimal generator above is µ = π × ψ as shown in [12, Proposition 1].2.2. Applying a linear transformation to the ZZS and BPS.
In this section wesuppose a matrix M ∈ R d × d is given. We then wish to define a transformation schemeencoded by M , which we should think of being such that, for a suitable choice of M ,it gives a “more isotropic” version of the target, and analyse its effects on the PDMCsamplers. DAPTIVE SCHEMES FOR PDMC ALGORITHMS 5 π ( x ) ˜ π M ( ξ )( X t , Θ t ) (Ξ t , Θ t ) X t = M Ξ t ξ = M − x ˜ λ M ( ξ, θ ) λ ( x, θ ) Figure 2.
Transformation scheme.The transformation scheme encoded by M consists of a linear transformation of thestate space, which defines a new target distribution ˜ π M given by˜ π M ( ξ ) := 1˜ Z M exp( − ˜ U M ( ξ )) , with ˜ U M ( ξ ) = U ( M ξ ) and ˜ Z M = Z/ | det M | . The idea is to apply the transformation tothe target distribution π and simulate the standard PDMC sampler (Ξ t , Θ t ) t ≥ with theresulting target ˜ π M . Then the last thing to do is transform the obtained sample, whichis approximately from ˜ π M , by applying the inverse transformation. For this reason it isimportant that the matrix M is invertible, and thus that we can go from one state spaceto the other. This procedure is illustrated in Figure 2. Another possibility is to simulatedirectly the process ( X t , Θ t ) t ≥ that results from the scheme in Figure 2. The dynamicsof such process are studied in the remainder of this section.Let us first focus on the case in which (Ξ t , Θ t ) t ≥ is a standard ZZS with excess switch-ing rate γ . In this case the switching rates are ˜ λ M,i ( ξ, θ ) = ( θ i ∂ i ˜ U M ( ξ )) + + γ i ( ξ, θ ) for i = 1 , . . . , d . Note that, unless stated otherwise, we will always use a tilde to indicatequantities related to the standard PDMC samplers with transformed target. In the nextproposition we find the generator of the preconditioned ZZS. For a characterisation of thedomain of the extended generator we refer to [15, Theorem 26.14]. Proposition 1.
Let M ∈ R d × d be an invertible matrix. Let (Ξ( t ) , Θ( t )) t ≥ be a standardZZS with target ˜ π M and excess switching rates γ : E → R d + . The process ( X ( t ) , Θ( t )) t ≥ =( M Ξ( t ) , Θ( t )) t ≥ has extended generator ( L M , D ( L M )) where for any h ∈ D ( L M )(2) L M h ( x, θ ) = h M θ, ∇ x h ( x, θ ) i + d X i =1 λ Mi ( x, θ )( h ( x, F i θ ) − h ( x, θ )) , in which for i = 1 , . . . , d (3) λ Mi ( x, θ ) = ˜ λ M,i ( M − x, θ ) = ( θ i h M i , ∇ U ( x ) i ) + + γ i ( M − x, θ ) , where M i denotes the i -th column of M .Proof. All the proofs of this section are postponed to Appendix B.1. (cid:3)
Proposition 1 shows that the transformed process is again a PDMP with linear trajectoriesbetween jumps. The transformation affects the velocity of the process, which is now
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 6 v = M θ , and the switching intensities, which are as defined in (3). In particular, theavailable velocities for a fixed transformation M are in the following set: V := (cid:8) v : v = M θ, θ ∈ {− , +1 } d (cid:9) . When a switch of the i -th velocity of the underlying standard ZZ process happens, thevelocities of the transformed process change according to the operator ¯ F i v = M F i θ = M F i ( M − v ) for i = 1 , . . . , d . Therefore, all components of the velocity are possiblyaffected by any single event. If M is diagonal the behaviour is more similar to thestandard ZZ process and in particular we have that ¯ F i ≡ F i . In the proposition below wecheck that ( X ( t ) , Θ( t )) t ≥ targets the correct density function. Proposition 2.
Consider the same setting of Proposition 1. Then, for any invertible M ∈ R d × d , the modified ZZ process ( X ( t ) , Θ( t )) t ≥ has invariant distribution µ = π × Unif( {− , +1 } d ) . Let us now apply the same transformation scheme shown in Figure 2 to the BPS. In thiscase the switching rate of the standard BPS with target ˜ π M is ˜ λ M ( ξ, θ ) = h θ, ∇ ˜ U M ( ξ ) i + ,while reflections on the level curves of ˜ U M are obtained by applying operator ˜ R M . Thefollowing result, analogous to Proposition 1, studies the transformed process. Proposition 3.
Let M ∈ R d × d be an invertible matrix. Let (Ξ( t ) , Θ( t )) t ≥ be a standardBPS with target ˜ π M and refreshment rate λ r : E → R + . The process ( X ( t ) , Θ( t )) t ≥ =( M Ξ( t ) , Θ( t )) t ≥ has extended generator ( L M , D ( L M )) where for any h ∈ D ( L M )(4) L M h ( x, θ ) = h M θ, ∇ x h ( x, θ ) i + λ M ( x, θ ) (cid:0) h ( x, R M ( x ) θ ) − h ( x, θ ) (cid:1) + λ r ( M − x, θ ) Z ( h ( x, θ ′ ) − h ( x, θ )) ψ ( dθ ′ ) , where we defined (5) λ M ( x, θ ) = ˜ λ ( M − x, θ ) = h M θ, ∇ U ( x ) i + and (6) R M ( x ) θ = ˜ R M ( M − x ) θ = θ − h M T ∇ U ( x ) , θ i|| M T ∇ U ( x ) || M T ∇ U ( x ) . Once again the true velocity of the process is v = M θ . When a velocity refreshment takesplace the new θ is sampled from ψ = N (0 d , d ), while the new velocity v is v = M θ ∼N (0 d , M M T ). Observe also that the reflection rate is λ M ( x, θ ) = h v, ∇ U ( x ) i + and thuspreserves the same structure as in the standard BPS. It follows that the complexity ofthe simulation of event times remains unchanged. Finally consider the reflection operatorin (6). This corresponds to a reflection in the opposite direction to the gradient in thetransformed space, i.e. ∇ ξ ˜ U ( ξ ) = M T ∇ U ( x ). After the bounce the process moves withvelocity v = M ( R M ( x ) θ ) = v − h∇ U ( x ) , v i|| M T ∇ U ( x ) || M M T ∇ U ( x ) . This implies that h v, ∇ U ( x ) i = h M ( R M ( x ) θ ) , ∇ U ( x ) i = −h M θ, ∇ U ( x ) i = −h v, ∇ U ( x ) i . Proposition 4.
Consider the same setting of Proposition 3. Then, for any invertible M ∈ R d × d , the transformed BPS ( X ( t ) , Θ( t )) t ≥ has invariant distribution µ = π × ψ . DAPTIVE SCHEMES FOR PDMC ALGORITHMS 7
Choosing the transformation matrix.
As explained above, we wish to transformthe target to mitigate its anisotropicities. To this end, some alternative choices of thetransformation matrix M are the following:a) M = p Cov π ( X ): this transformation is such that the target ˜ π M has unitary co-variance matrix. The downside of this choice is the additional O ( d ) computationsthat are introduced by the calculation of the square root of the covariance;b) M is a rotation matrix (det M = 1) such that the transformed density has a certainangle. Although an interesting case, it is not clear whether there is an optimalangle that speeds up the convergence;c) M is the diagonal matrix with M ii = p Var π ( X i ) for any 1 ≤ i ≤ d . This choiceintroduces O ( d ) computations due to the square root of the variances, which is anegligible additional computational burden. However, correlations in the target arenot picked up and only a rescaling of the axes is performed. The main advantageof this choice is that the scenario in which some components are explored quicklyand others slowly is avoided.Both the first and the third option can potentially change the expected number of switch-ing events, and this could be an inconvenience in certain settings. It is not difficult tomodify these transformations in such a way that the expected switching rate is enforcedto be the close to that of the original standard PDMC algorithm. For example, considerthe transformed BPS with generator as in Proposition 3. Then in stationarity we have E µ ( h M Θ , ∇ U ( X )) i + ) = E π k M T ∇ U ( X ) k ≤ k M k E π k∇ U ( X ) k . Since the standard case corresponds to M = I d , we can normalise any M by dividing itby its Euclidean norm. Then the upper bound is the same for all such choices of M andthe expected switching intensity will be close to the standard case. This does not make adifference from a computational point of view as it just amounts to reparametrization ofthe time parameter, but prevents unpredictable behaviour of the algorithm.Naturally the options above are not available in practice as the covariance matrix isunknown. It is the goal of the next section to propose an adaptive scheme that overcomesthis issue.2.4. Adaptive PDMC algorithms.
In the previous sections we defined the transfor-mation scheme and we discussed the effect it has on the underlying process, together withdifferent choices of the preconditioning matrix. We now describe how this idea can beapplied in practice by designing an adaptive PDMC algorithm. In addition to the adap-tive preconditioner, we incorporate an adaptation of the refreshment rate, which makesits choice automatic.Let us then define a family of Markov semigroups by P := { ( P t Γ ) t ≥ : Γ ∈ Y } , in whichΓ is the adaptation parameter, Y is a compact space, and ( P t Γ ) t ≥ is the semigroup of amodified ZZS or BPS. The modification is given by the adaptation parameter, which isthen Γ = ( M, λ r ) for BPS and Γ = ( M, γ ) for ZZS. Thus Y is a suitable compact space ofpreconditioners and refreshment rates/excess switching rates. Naturally, it is also possibleto choose Γ = M or Γ = λ r only. Now that we have defined a family of Markov processes,we define a rule that establishes how to choose a P Γ ∈ P at every iteration. Let us beginby introducing a discretisation step ∆ t , which defines a discretisation of the time variable.At each time step n ∈ N , which corresponds to continuous time t = n ∆ t , the adaptive DAPTIVE SCHEMES FOR PDMC ALGORITHMS 8
Algorithm 1
Adaptive PDMC sampler Input: family of kernels P = { P Γ : Γ ∈ Y } Input: initial condition ( x, θ ) ∈ E , Γ ∈ Y , set B , ∆ t , { p n } n ≥ , number of steps N Output: sequence of discrete samples { X n , Θ n } Nn =0 Initialise n = 0, ( X , Θ ) = ( x, θ ), Q = Γ while n ≤ N do ( X n +1 , Θ n +1 ) ∼ P ∆ t Γ n (( X n , Θ n ) , · ) Q n +1 = update ( Q n , ( X n +1 , Θ n +1 )) if ( X n +1 , Θ n +1 ) ∈ B then With probability p n , set Γ n +1 = Q n +1 With probability 1 − p n , set Γ n +1 = Γ n end if n = n + 1 end while scheme can update the parameter Γ n based on the new information available, that is thenew state of the process ( X n , Θ n ). This defines a random sequence { Γ n } n ≥ . Once Γ n is computed, the next state of the process is given by ( X n +1 , Θ n +1 ) ∼ P ∆ t Γ n (( X n , Θ n ) , · ).Then one updates the parameter, obtains the next state of the process, and so on.The definition above defines the core ideas, which are written in pseudo-code form inAlgorithm 1. A few issues remain to be clarified. A first question is how to simulate thePDMP semigroup of either ZZS or BPS. Details on how the processes can be simulatedin the case of a target with dominated Hessian can be found in Appendix A.1. In a bigdata setting (large number of observations, moderate dimensionality of the problem) itcan be beneficial to take advantage of subsampling techniques that can be implementedwith PDMC algorithms. In Appendix A.2 details can be found on how to make use ofsubsampling in the context of the adaptive schemes here discussed. For further informa-tion on the general implementation of ZZS and BPS we refer to [7, 12]. A second aspectof Algorithm 1 we focus on is the introduction of set B , and thus of the auxiliary sequenceof random variables ( Q n ) n ≥ . The idea is to update the adaptation parameter Γ n onlyif ( X n , Θ n ) ∈ B . This is useful from a theoretical point of view as it ensures that theprocess remains bounded in probability. Note that set B is defined by the user and canbe chosen large. The auxiliary variable Q n is updated even if the process is outside of B and then, as soon as the process enters B , or if it was already in B , we let Γ n = Q n . Athird characteristic of Algorithm 1 is the sequence { p n } n ≥ . This is a sequence for which p n ∈ [0 ,
1] for all n ∈ N and p n → n → ∞ . The meaning is that at time step n weupdate the parameter Γ n with probability p n (assuming ( X n , Θ n ) ∈ B ), and with remain-ing probability (1 − p n ) we set Γ n = Γ n − . This choice is helpful when proving ergodicityof the adaptive scheme, as it enforces that the quantity of adaptation diminishes andeventually vanishes.The function update ( Q n , ( X n +1 , Θ n +1 )) outputs the updated parameter given the newobservation ( X n +1 , Θ n +1 ). The estimation of the covariance matrix can be done sequen-tially, or online, by applyingˆ µ n +1 = ˆ µ n + r n +1 ( X n +1 − ˆ µ n ) , ˆΣ n +1 = ˆΣ n + r n +1 (( X n +1 − ˆ µ n )( X n +1 − ˆ µ n ) T − ˆΣ n ) . (7) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 9
Here { r n } n ≥ is a positive, decreasing sequence such that r n → n → ∞ . In oursimulations we choose r n = 1 /n . The same principle can be used if one is not interestedin estimating the full covariance matrix, but only the diagonal, or more generally onlya subset of it. More advanced estimation techniques can be employed to preserve anyexisting conditional independence structure in the target, as discussed in [35]. The re-freshment rate of the BPS is assumed to be constant and is updated iteratively as follows.At time step n , n refl reflections took place and thus we estimate the average reflection rateas λ refl ( n ) = n refl / ( n ∆ t ). Therefore, using the optimality criterion in [9] we have(8) λ nr λ nr + λ refl ( n ) = 0 . ⇒ λ nr = 0 . . λ refl ( n ) . An alternative adaptive strategy for the refreshment can be found in Appendix A.3. Thescheme above can be applied to the excess switching rate of the ZZS. Although the analysisin [6] suggests that the best choice in terms of asymptotic variance is γ ≡
0, adding somediffusivity could speed up the convergence to the target measure. In practice one can setthe wanted ratio of velocity switches over total events and proceed as above.Finally, we remark that in practice it is not reasonable to update the parameters atevery iteration. The main motivation is the computational cost of such operation. Inthe most general case, the task of learning all components of Σ takes O ( d ) operations,while the computation of its square root, which is needed to obtain the transformationmatrix M , is an O ( d ) operation. Therefore it is rather inconvenient to perform this atevery time step. To avoid this issue it is sufficient to define the adaptive scheme suchthat adaptations happen every n adap time steps, where n adap is a user-defined integer. Apossible choice is for instance n adap = 1000. This modification is beneficial also becauseit allows the process to explore the target distribution before updating the parameters.Similarly, it is reasonable to update the refreshment rate based on the previous n adap timesteps, as in the long term this allows to stabilise around the wanted ratio. The covariancematrix can be updated as in (7) by simply processing the entire batch of n adap data pointsone at a time. 3. Theoretical results
In the context of adaptive MCMC algorithms, convergence to the target density is usu-ally proved with simultaneous drift conditions and small set conditions. In Section 3.1 weintroduce the notation and the main existing theorems we make use of, and we extendthese results to more general conditions in Theorem 9. In Section 3.2 we state Theo-rem 10, which shows ergodicity for an adaptive MCMC algorithm based on a continuoustime process. In this result, the assumptions are formulated directly in continuous time.Finally, Theorems 12 and 14 in Section 3.3 show that the adaptive ZZS and the adaptiveBPS discussed in Section 2.4 are ergodic and satisfy a weak law of large numbers underreasonable growth conditions on the target.3.1.
Theory of adaptive MCMC.
We denote the parameter that specifies the kernelas Γ ∈ Y . At time step n a Y -valued random variable Γ n determines which transitionkernel will be used to move to the next step. From here on each Markov transition kernel P Γ is assumed to define a Markov chain that has µ as stationary measure, and moreover DAPTIVE SCHEMES FOR PDMC ALGORITHMS 10 it is aperiodic and irreducible. An adaptive MCMC algorithm is then said to be ergodic if(9) lim n →∞ k P ( Z n ∈ · | z , Γ ) − µ ( · ) k TV = 0 for all z ∈ E, Γ ∈ Y , where || · || TV is the total variation distance, i.e. k µ − ν k TV = sup A ⊂ E | µ ( A ) − ν ( A ) | . Acrucial quantity turns out to be the ε -convergence time function M ε : E × Y → N , definedas M ε ( z, Γ) = inf { n ≥ k P n Γ ( z, · ) − µ ( · ) k TV ≤ ε } . The next theorem, proved in [30], is arguably the most important result for establishingergodicity of adaptive MCMC methods.
Theorem 5 (Theorem 2 in [30]) . Consider an adaptive MCMC algorithm on a state space E with adaption parameter in a space Y . Let µ be stationary for P Γ for each Γ ∈ Y . Theadaptive algorithm is ergodic if the two following conditions hold:(a) (Containment condition) For all z ∈ E, Γ ∈ Y , ε > the sequence { M ε ( Z n , Γ n ) } ∞ n =0 is bounded in probability given z , Γ ;(b) (Diminishing adaptation) The following limit holds in probability: (10) lim n →∞ (cid:18) sup z ∈ E k P Γ n +1 ( z, · ) − P Γ n ( z, · ) k TV (cid:19) = 0 . The boundedness of { M ε ( Z n , Γ n ) } ∞ n =0 can be rephrased as for all z ∈ E, Γ ∈ Y , δ > N ∈ N such that P ( M ε ( Z n , Γ n ) ≤ N | z , Γ ) ≥ − δ , for all n ∈ N .We are then interested in sufficient conditions that imply containment. A first case isthe following, and was studied in [5]. Assumption 6 ([5]) . The family { P Γ : Γ ∈ Y } is simultaneously geometrically ergodic (SGE), i.e. there are C ∈ B ( E ), some integer n ≥
1, a function V : E → [1 , ∞ ), δ > < λ <
1, and b < ∞ , such that sup z ∈ C V ( z ) < ∞ , µ ( V ) < ∞ , and(a) C is a uniform ( ν Γ , δ, n )-small set, i.e. for each Γ, there exists a probabilitymeasure ν Γ ( · ) on C such that P n Γ ( z, · ) ≥ δν Γ ( · ) for all z ∈ C ;(b) ( simultaneous geometric drift condition ) P Γ V ≤ λV + b C for all Γ ∈ Y .Then [5, Theorem 3] establishes that an SGE family satisfies the containment condition.In Section 3.3 we use this result to show that containment holds for the adaptive ZZSwhen the class of preconditioners is restricted to diagonal matrices.In practice it is often hard to show that the family of Markov kernels is SGE, as itis not trivial to find a Lyapunov function that satisfies the simultaneous geometric driftcondition. In [14] the authors introduced a way around this problem, although in adifferent context, and in [13] this was applied to adaptive MCMC. The fundamental ideais that it is possible to weaken the simultaneous drift condition by allowing adaptationsonly at time steps n at which the process Z n is inside of a compact set B . This meansthat, defining an auxiliary random process { Q n } n ≥ that contains the current adaptationparameter independently of the position of Z n , Γ n is updated as(11) Γ n +1 = ( Γ n if Z n +1 / ∈ B,Q n +1 if Z n +1 ∈ B. DAPTIVE SCHEMES FOR PDMC ALGORITHMS 11
This modification avoids unbounded detours of the process by sticking to the same er-godic kernel once the process exits a fixed compact set. The compact set can be chosenarbitrarily large, and therefore in most applications the process will not exit from it.With this is mind we introduce the following sets of assumptions, which we show inTheorem 9 to be sufficient to enforce the containment condition.
Assumption 7.
Let { P Γ : Γ ∈ Y } be a family of discrete time Markov chains with statespace E . There are C ∈ B ( E ), an integer n ≥
1, a class of functions { V Γ : E → [1 , ∞ ) :Γ ∈ Y } , δ >
0, 0 < λ <
1, and b < ∞ , such that sup z ∈ C, Γ ∈Y V Γ ( z ) < ∞ , µ ( V Γ ) < ∞ , and(a) C is a uniform ( ν Γ , δ, n )-small set, i.e. for each Γ ∈ Y , there exists a probabilitymeasure ν Γ ( · ) on C such that P n Γ ( z, · ) ≥ δν Γ ( · ) for all z ∈ C ;(b) for each Γ ∈ Y , z ∈ E , P Γ V Γ ( z ) ≤ λV Γ ( z ) + b C ( z );(c) the adaptation parameter is allowed to be updated only if the process is inside ofa compact set B , as defined in (11). Assumption 8.
Let { P Γ : Γ ∈ Y } be a family of discrete time Markov chains with statespace E . There exist α, λ ∈ (0 , C > C > C , a class of functions { V Γ : E → [1 , ∞ ) : Γ ∈ Y } with µ ( V Γ ) < + ∞ , such that(a) for each Γ ∈ Y , for all x, y ∈ E such that V Γ ( x ) + V Γ ( y ) ≤ C it holds that k P Γ ( x, · ) − P Γ ( y, · ) k TV ≤ − α );(b) for each Γ ∈ Y and for any z ∈ E , P Γ V Γ ( z ) ≤ λV Γ ( z ) + C (1 − λ );(c) the adaptation parameter is allowed to be updated only if the process is inside ofa compact set B , as defined in (11). Theorem 9.
Consider a family of discrete time Markov transition kernels { P Γ : Γ ∈Y } . Assume that all kernels P Γ are aperiodic, irreducible, and have stationary measure µ . Suppose the adaptive algorithm satisfies the diminishing adaptation condition, i.e.assumption (b) in Theorem 5, and let either Assumption 7 or Assumption 8 hold. Thenthe containment condition holds and the adaptive MCMC algorithm is ergodic. The proof can be found in Appendix B.2.
Remark.
A weak law of large numbers (WLLN) for bounded and measurable functionsfollows immediately from containment and diminishing adaptation by Theorem 3.4 in[28]. Therefore under the conditions of Theorem 9 a WLLN holds.3.2.
Convergence properties of adaptive MCMC algorithms based on contin-uous time Markov processes.
It could be the case, as it is in the present work, thatone is interested in defining an adaptive scheme based on a family of continuous timeMarkov processes in continuous time. In this case a grid for the time variable needs tobe introduced in order to indicate the times at which the adaptation occurs. In fact theadaptive chain only sees the process at times m ∆ t , where ∆ t > m ∈ N . Although the resulting chain is in discrete time, it is in most cases easier to workdirectly with the continuous time process. The following result, which is analogous toTheorem 9, is helpful in this sense. Theorem 10.
Consider a family of Markov processes with generators {L Γ : Γ ∈ Y } , eachbeing irreducible, aperiodic, and having µ as invariant measure. Consider a grid for the DAPTIVE SCHEMES FOR PDMC ALGORITHMS 12 time variable with step ∆ t . Consider an adaptive scheme that at times m ∆ t , with m ∈ N ,chooses a process from the forementioned family. Furthermore, suppose that the adaptivescheme satisfies the diminishing adaptation condition (10) for P := P ∆ t . Finally assumeone of the following two set of conditions holds:(1) There exist a set C ∈ B ( E ) , t > , a class of functions { V Γ : E → [1 , ∞ ) : Γ ∈ Y } , δ > , A , A > , such that for each Γ , sup z ∈ C, Γ ∈Y V Γ ( z ) < ∞ , π ( V Γ ) < ∞ , and(a) for each Γ ∈ Y there exists a probability measure ν Γ such that P t Γ ( z, · ) ≥ δν Γ ( · ) for all z ∈ C ;(b) for each Γ ∈ Y and z ∈ E it holds that L Γ V Γ ( z ) ≤ − A V Γ ( z ) + A C ( z ) ;(c) it holds that ∆ t = mt , for some m ∈ N .(2) There exist A , A > , C > A /A , a class of functions { V Γ : E → [1 , ∞ ) : Γ ∈Y } with π ( V Γ ) < + ∞ , such that(a) for each Γ ∈ Y , for all x, y ∈ E such that V Γ ( x ) + V Γ ( y ) ≤ C , there exists α, t > such that k P t Γ ( x, · ) − P t Γ ( y, · ) k TV ≤ − α ); (b) for each Γ ∈ Y and for any z ∈ E , L Γ V Γ ( z ) ≤ − A V Γ ( z ) + A ;(c) it holds that ∆ t = t .If the adaptation parameter is allowed to be updated only if the process is inside of acompact set B , as defined in (11) , then the adaptive algorithm satisfies the containmentcondition and is thus ergodic. The proof of this theorem can be found in Appendix B.3.
Remark.
The restrictions on the step size can be milder than as stated in Theorem 10.For instance, if the minorisation condition (1a) of Theorem 10 holds for all t ≥ t , thenone is free to choose any step size ∆ t >
0. Furthermore, in both cases the assumptionthat the parameter can be updated only if the process is inside of a compact set at theadaptation time can be dropped when a simultaneous geometric drift condition holds(Assumption 6(b)).3.3.
Convergence properties of adaptive PDMC algorithms.
Relying on Theo-rem 10, in this section we turn our attention to the ergodicity of adaptive PDMC algo-rithms. The proofs of the two theorems are postponed to Section 4. First, let us considerthe adaptive ZZS. We assume the following conditions on the potential.
Assumption 11 (Growth Condition 3 in [11]) . U ∈ C ( R d ) andlim k x k→∞ max(1 , k∇ U ( x ) k ) k∇ U ( x ) k = 0 , lim k x k→∞ k∇ U ( x ) k U ( x ) = 0 . Let us now state the ergodicity result for the adaptive ZZS.
Theorem 12.
Let M be a compact set of positive definite matrices and let Λ be a set ofexcess switching rates γ : E → R d + for which there are < γ min ≤ γ max < ∞ such that forall γ ∈ Λ γ min ≤ γ ( x, θ ) ≤ γ max for all ( x, θ ) ∈ E. Let P = { P M,γ : M ∈ M , γ ∈ Λ } be a family of preconditioned Zig-Zag processes withswitching rate γ as defined by Equation (2) . Suppose Assumption 11 holds and assumeeither of the following conditions holds: DAPTIVE SCHEMES FOR PDMC ALGORITHMS 13 a) M = { M ∈ R d × d : M ii ∈ [ a, b ] , M jk = 0 for all j = k } with b > a > ;b) M has no additional restrictions but adaptations are allowed only inside of acompact set B ,and let ∆ t be the discretisation step. Then the containment condition holds. Moreover,if the adaptive strategy is as described in Section 2.4 and is such that p n → as n → ∞ ,then the diminishing adaptation condition holds and thus for µ = π × Unif(Θ) : (12) lim n →∞ || P (( X n , Θ n ) ∈ · | x , θ , γ ) − µ ( · ) || T V = 0 for all ( x , θ ) ∈ E, γ ∈ Λ . Finally, for any bounded and measurable f : R d → R a weak law of large numbers holds,i.e. (13) P Nn =1 f ( X n ) N → π ( f ) in probability . Remark.
The time discretisation step ∆ t can be chosen freely and is not subject to con-straints. Moreover, we remark that under condition (a) on M the adaptive algorithm isSGE, i.e. satisfies Assumption 6, while under condition (b) it satisfies assumption (1) inTheorem 10.Below we introduce a set of assumptions that is used to show ergodicity of the adaptiveBPS. Here we limit our attention to the case of ψ = N (0 , d ). Assumption 13 (Assumptions A1, A2, and A7 in [17]) . Let U : R d → [0 , ∞ ) satisfy(a) U ∈ C ( R d ), and x → k∇ U ( x ) k is integrable w.r.t. π ;(b) R R d e − U ( x ) / dx < + ∞ and lim k x k→∞ U ( x ) = + ∞ ;(c) There exists ζ ∈ (0 ,
1) such thatlim inf k x k→∞ k∇ U ( x ) k U − ζ ( x ) > , lim sup k x k→∞ k∇ U ( x ) k U − ζ ( x ) < ∞ , and lim sup k x k→∞ k∇ U ( x ) k U − ζ ( x ) < ∞ . Theorem 14.
Let M be a compact set of positive definite matrices and Λ r = [ λ min , λ max ] for < λ min ≤ λ max < ∞ . Let P = { P M,λ r : M ∈ M , λ r ∈ Λ r } be a family ofpreconditioned BPS’s as defined in Equation (4) , where λ r is the refreshment rate. SupposeAssumption 13 holds and let ∆ t be the discretisation step. Assume that ψ = N (0 , d ) .If adaptations are allowed only inside of a compact set as explained in (11) , then thecontainment condition holds. Furthermore, for the strategy discussed in Section 2.4 thediminishing adaptation holds as long as p n → as n → ∞ . Thus the ABPS is ergodic inthe sense of Equation (12) and satisfies a WLLN of the form (13) for any bounded andmeasurable f : R d → R . Proofs of the main theorems
Proof of Theorem 12.
In order to prove the theorem we show that, for suitablefamilies of preconditioners, either Assumption 6 holds for the family of discretised ZZprocesses, or condition (1) in Theorem 10 holds for the family of continuous time ZZ
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 14 processes. In the next two sections we show auxiliary results, while in Section 4.1.3 weassemble them to show the theorem.4.1.1.
Minorisation condition for the ZZS.
The following lemma shows that a simulta-neous small set condition holds for the family of ZZ processes. The proof relies on the1-dimensional case considered in Lemma 25.
Lemma 15.
Let U ∈ C . Consider the family of d -dimensional Zig-Zag processes withgenerators {L M,γ : M ∈ M , γ ∈ Λ } , in which M is a compact set of positive definitematrices and Λ is a set of switching rates γ : E → R d + . Assume that there are γ min , γ max such that for all γ ∈ Λ0 < γ min ≤ γ ( x, θ ) ≤ γ max < ∞ for all ( x, θ ) ∈ E. Then for any set of the form C = D × V , where D ⊂ R d is a compact set and V ⊆ Θ ,there exists t > such that for any t ≥ t there are δ > , and probability measures { ν M } M ∈M on E such that P tM,γ (( x, θ ) , · ) ≥ δν M ( · ) for all ( x, θ ) ∈ C. In particular, t and δ do not depend neither on M nor on γ .Proof. We first show that all compact sets are uniformly small for the family of standardZig-Zag processes with targets in { ˜ π M ( ξ ) = π ( M ξ ) : M ∈ M} . Let C be a d -dimensionalrectangle of the form [ − R, + R ] d × V for some R > V ⊆ Θ.Let M ∈ M and γ ∈ Λ and denote as (Ξ
M,γ ( t ) , Θ M,γ ( t )) t ≥ the ZZ process that targets˜ π M ( ξ ) and has switching rate ˜ γ M ( ξ, θ ) = γ ( M ξ, θ ). Let ˜ P M,γ be the corresponding semi-group. Then observe that for any rectangle B = B × · · · × B d , with B i = [ R i, , R i, ] forsome R i, > R i, the following holds(14)˜ P tM,γ (( ξ, θ ) , B ) = P ( ξ,θ ) ((Ξ M,γ ( t ) , Θ M,γ ( t )) ∈ B )= d Y i =1 P ( ξ,θ ) (cid:0) (Ξ M,γ ( t ) , Θ M,γ ( t )) i ∈ B i | (Ξ M,γ ( t ) , Θ M,γ ( t )) j ∈ B j for j > i (cid:1) . Then observe that, independently of the values at any time s < t of the other componentsof the process, the i -th switching rate ˜ λ M,i ( ξ, θ ) = ( θ i ∂ i U ( M ξ )) + + γ i ( M ξ, θ ) satisfies thebounds(15) 0 < γ min ≤ ˜ λ M,i ( ξ, θ ) ≤ λ max < ∞ , where we have defined λ max := max i =1 ,...,d max M ∈M max { ξ ∈ ˜ C t ,θ ∈ Θ } (cid:8) ( θ i ∂ i U ( M ξ )) + (cid:9) + γ max with ˜ C t = [ − R − t, R + t ] d is the set of reachable points in time t . Here λ max is well definedbecause U ∈ C . In particular neither of the bounds in (15) depend on the specific ˜ π M and thus hold for any M ∈ M . Therefore, we can apply the same reasoning of Lemma 25with t = 2 R + ε with ε > DAPTIVE SCHEMES FOR PDMC ALGORITHMS 15 for any t ≥ t there exists δ > P tM,γ (( ξ, θ ) , B ) ≥ d Y i =1 δ ν ( B i ) = δ d ν d ( B ) = δ d ν d ( B ) for all ( ξ, θ ) ∈ C, where ν d = Leb d ( C ) × Unif(Θ) is the d -dimensional equivalent of ν that was defined inthe proof of Lemma 25. Most importantly, δ depends only on the bounds on the switchingrates, which are uniform for all γ ∈ Λ. Therefore Equation (16) holds for the same δ and ν d for all M ∈ M , γ ∈ Λ, and any rectangle B ⊂ C . Since the set of rectangles is a π -system and generates the Borel σ -algebra, by the monotone class theorem (Theorem 6.2in [22]) it follows that the lower bound of Equation (16) holds for any Borel set B . Sinceany compact set can be included in a large enough hypercube, for all sets C = D × V ,with D compact and V ⊆ Θ, there are ν d and t > t ≥ t there exists δ d > C is uniformly ( t, δ d , ν d )-small for the family of standard ZZ processesdefined above.Now we wish to translate this result to the family of linearly transformed Zig-Zagprocesses with excess switching rates in Λ and target π . Denote as ( X M,γ ( t ) , Θ M,γ ( t )) t ≥ the process with generator L M,γ and observe that for any t > A = A x × V , for a Borel set A x , the event { ( X M,γ ( t ) , Θ M,γ ( t )) ∈ A } is equivalent to the event { (Ξ M,γ ( t ) , Θ M,γ ( t )) ∈ ˜ A M } for ˜ A M = { ( ξ, θ ) ∈ E : M ξ ∈ A x , θ ∈ V } . Assume C ∈ E is acompact set and let ( x, θ ) ∈ C . Then define C M = { ( ξ, θ ) = ( M − x, θ ) : ( x, θ ) ∈ C, M ∈M} which is itself a compact set and depends on M , but not on the specific M ∈ M .Then, for all t ≥ t , with t large enough, it holds that P tM,γ (( x, θ ) , A ) = P ( x,θ ) (( X M,γ ( t ) , Θ M,γ ( t )) ∈ A )= P ( ξ,θ ) ((Ξ M,γ ( t ) , Θ M,γ ( t )) ∈ ˜ A M ) ≥ δν d ( ˜ A M ) for all ( ξ, θ ) ∈ C M = δν M ( A ) for all ( x, θ ) ∈ C with ν M ( A ) = ν d ( ˜ A M ) and δ is chosen such that C M is a uniform ( t, δ, ν )-small set forthe family of standard ZZ processes with targets ˜ π M and switching rates ˜ γ M . Thereforethere is no dependence neither on M nor on γ in t and δ and the proof is concluded. (cid:3) Drift conditions for the Zig-Zag process.
If we restrict our attention to the classof diagonal matrices with positive, bounded entries, then the Lyapunov function in [11,Lemma 11] satisfies also a simultaneous drift condition. This is shown in the followinglemma.
Lemma 16.
Let Assumption 11 hold. Consider the family of linearly transformed Zig-Zagprocesses with generators {L M,γ : M ∈ M , γ ∈ Λ } , where (17) M = { M ∈ R d × d : M ii ∈ [ V i min , V i max ] , M jk = 0 for all j = k } , with V max ≥ V i max ≥ V i min ≥ V min > for each i = 1 , . . . , d , and where Λ is a set of excessswitching rates γ : E → R d + such that for all γ ∈ Λ it holds that (18) γ i ( x, θ ) ≤ γ max for all ( x, θ ) ∈ E, i = 1 , . . . , d.
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 16
Let δ > and α > be such that < ( δγ max ) /V min < α < and define φ ( s ) = sign( s ) ln (1 + δ | s | ) . Then the function (19) V ( x, θ ) = exp αU ( x ) + d X i =1 φ ( θ i ∂ i U ( x )) ! is a simultaneous Lyapunov function for the family of ZZ processes, that is there exist A > , A > , a compact set C ⊂ E such that L M,γ V ( x, θ ) ≤ − A V ( x, θ ) + A C ( x, θ ) for all ( x, θ ) ∈ E, M ∈ M , γ ∈ Λ , where A , A , C do not depend neither on M nor on γ (but depend on M and Λ ).Proof. We follow the proof of Lemma 11 in [11]. Let M ∈ M and γ ∈ Λ. Applying thegenerator to V , which was defined in Equation (19), we obtain (cid:18) L M,γ VV (cid:19) ( x, θ ) = d X i =1 M ii θ i α∂ i U ( x ) + d X j =1 θ j ∂ ij U ( x ) φ ′ ( θ j ∂ j U ( x )) ! + d X j =1 (cid:0) M ii ( θ i ∂ i U ( x )) + + γ i ( x, θ ) (cid:1) ·· (exp ( φ ( − θ i ∂ i U ( x )) − φ ( θ i ∂ i U ( x ))) − s = θ i ∂ i U ( x ) ≥
0, then M ii αs + ( M ii s + γ i )(exp ( φ ( − s ) − φ ( s )) −
1) = M ii (cid:18) αs + (cid:18) s + γ i M ii (cid:19) (cid:18)
11 + δs − (cid:19)(cid:19) = M ii (cid:18) ( α − s + s − γ i /M ii δs (cid:19) ≤ − M ii ((1 − α ) | s | + 1 /δ ) . In case s = θ i ∂ i U ( x ) < M ii αs + ( M ii s + γ i )(exp ( φ ( − s ) − φ ( s )) −
1) = M ii (cid:18) αs + γ i M ii (1 + δ | s | − (cid:19) ≤ − M ii (cid:18) α − γ max M ii δ (cid:19) | s | . Note that by assumption ( α − γ max δ/M ii ) ≥ ( α − γ max δ/V min ) >
0. For the remainingterm the best we can do is to derive the following bound X i,j M ii θ i θ j ∂ ij U ( x ) φ ′ ( θ j ∂ j U ( x )) ≤ X i,j M ii φ ′ ( θ j ∂ j U ( x )) | ∂ ij U ( x ) |≤ V max δ X i,j | ∂ ij U ( x ) | , where we used that 0 ≤ φ ′ ( s ) ≤ δ/
2. Finally we obtain for any M ∈ M (cid:18) L M,γ VV (cid:19) ( x, θ ) ≤ − min (cid:18) − α, α − γ max δV min (cid:19) V min d X i =1 | ∂ i U ( x ) | (20) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 17 + V max dδ + δ V max X i,j | ∂ ij U ( x ) | , which is independent of the specific M and γ and can be made arbitrarily small outsideof a sufficiently large compact set C by our assumptions on U . (cid:3) If we wish to consider a more general class of positive-definite matrices, then we haveto settle for the following, weaker result.
Lemma 17.
Let Assumption 11 hold. Consider a family of linearly transformed Zig-Zagprocesses with generators {L M,γ : M ∈ M , γ ∈ Λ } , where M ⊂ R d × d is a compact spaceof positive definite matrices and Λ is a space of excess switching rates γ : E → R + suchthat (18) is satisfied for some γ max . Let δ > and α > be such that < δγ max < α < .Define for each M ∈ M the function (21) V M ( x, θ ) = exp αU ( x ) + d X i =1 φ ( θ i h M i , ∇ U ( x ) i ) ! , where M i denotes the i -th column of M and φ : R → R was defined in Lemma 16. Thenthere are A > , A > , and a compact set C ⊂ E such that for all M ∈ M thefollowing simultaneous drift condition holds: L M,γ V M ( x, θ ) ≤ − A V M ( x, θ ) + A C ( x, θ ) for all ( x, θ ) ∈ E. In particular A , A , C do not depend neither on M nor on γ (but depend on M and Λ ).Proof. Consider the change of variables ξ = M − x . Denote as ˜ L M,γ the generator of aZZ process with transformed stationary measure ˜ π M and transformed excess switchingrate ˜ γ M ( ξ, θ ) = γ ( M ξ, θ ). Then transforming the function in (21) we obtain a Lyapunovfunction for the standard ZZ process with transformed target:˜ V M ( ξ, θ ) = exp α ˜ U M ( ξ ) + d X i =1 φ ( θ i ∂ i ˜ U M ( ξ )) ! , where ˜ U M ( ξ ) = U ( M ξ ). Since M is a compact space of positive definite linear transfor-mations, Assumption 11 is satisfied by each ˜ U M ( · ). Then, by the proof of Lemma 11 in[11], for any constant A > B M := { ( ξ, θ ) ∈ E : θ ∈ Θ , ξ ∈ B (0 , ˜ R M ) } such that˜ L M,γ ˜ V M ( ξ, θ ) ≤ − A ˜ V M ( ξ, θ ) for all ( ξ, θ ) / ∈ ˜ B M . In particular ˜ B M does not depend on γ , but only on γ max . Observe that by the proof ofLemma 11 in [11] it follows that ˜ R M depends continuously on M . Indeed one can showthat ( ˜ L M,γ ˜ V M ( ξ, θ )) / ˜ V M ( ξ, θ ) is smaller or equal than a sum of terms which depend con-tinuously on the components of ∇ ˜ U M and ∇ ˜ U M . Continuity follows from the assumptionthat U ∈ C and because M does not appear in other ways. Proposition 1 then impliesthat L M,γ V M ( x, θ ) = ˜ L M,γ ˜ V M ( ξ, θ ). Thus for each M ∈ ML M,γ V M ( x, θ ) ≤ − A V M ( x, θ ) for all ( x, θ ) / ∈ B M , DAPTIVE SCHEMES FOR PDMC ALGORITHMS 18 where B M := { ( x, θ ) ∈ E : ( M − x, θ ) ∈ ˜ B M } and A does not depend on M . Finally,take C to be any ball that contains all sets B M . Then C is bounded by continuity of ˜ R M in M and compactness of M . By continuity of L M,γ V M ( x, θ ) in x, θ, M it follows that L M,γ V M ( x, θ ) ≤ − A V M ( x, θ ) + A C ( x, θ ) , in which A = max { M ∈M , γ ∈ Λ , ( x,θ ) ∈ C } ( L M,γ V M ( x, θ ) + A V M ( x, θ )). In particular the max-imum over γ ∈ Λ does not cause problem as the γ ’s are uniformly bounded. Hence A isindependent of the specific M and γ . (cid:3) Finalising the proof of Theorem 12.
Let us first consider the case of diagonal precon-ditioners. Let ∆ t > C = D × V , with D compact and V ⊆ Θ, there exist δ > n := inf { n ∈ N : n ≥ t ∆ t } , ν M ( · ) such that C is a uniform ( ν M , n , δ )-small set for the family of discretised pro-cesses. Observe that no conditions on ∆ t are required. Moreover, a simultaneous driftcondition holds by Lemma 16 combined with Lemma 26 for any ∆ t . The condition µ ( V ) < ∞ is satisfied by definition of the Lyapunov function V , and in fact it also holdsthat sup ( x,θ ) ∈ C V ( x, θ ) < ∞ by continuity of V in x . Therefore all the conditions of As-sumption 6 are verified, which means the family is simultaneously geometrically ergodicand by Theorem 3 in [5] the containment condition is satisfied.In the case of a non-diagonal transformation matrix, parts (a)-(c) of condition (1) inTheorem 10 are verified by Lemmas 15 and 17. It also holds that sup { ( x,θ ) ∈ C,M ∈M} V M ( x, θ ) < ∞ for (small) sets C = D × V , with D compact and V ⊆ Θ, because of continuity ofeach V M in x and M , together with the fact that D and M are compact spaces (seeLemma 17 for the definition of { V M } M ∈M ). Moreover µ ( V M ) = ˜ µ M ( ˜ V M ) < ∞ , where˜ µ M = ˜ π M ⊗ Unif(Θ) and ˜ V M is a Lyapunov function for a standard ZZ process withinvariant measure ˜ µ M . Theorem 10 ensures that the containment condition holds truewith no restriction on ∆ t .Proposition 20 implies that, under the assumption that p n → n → ∞ , the adaptivestrategy satisfies diminishing adaptation. Therefore ergodicity follows from diminishingadaptation and containment.4.2. Proof of Theorem 14.
In the next two sections we show respectively that condition(2) in Theorem 10 is verified for the family of preconditioned BPS and/or of BPS withrefreshment rates in a compact set. In Section 4.2.3 we use these auxiliary results to showthe theorem.4.2.1.
Simultaneous coupling inequality.
The next lemma states that a simultaneous cou-pling inequality is satisfied for the BPS with adaptive preconditioner and/or adaptiverefreshment rate. The proof is based on the proof of Lemma 12 in [17], which shows acoupling inequality result for the standard BPS.
Lemma 18.
Let condition (a) in Assumption 13 hold for the energy function U . Considerthe family of BP processes { P tM,λ r : M ∈ M , λ r ∈ Λ r } , where M is a compact space ofnon-singular preconditioning matrices and Λ r = [ λ min r , λ max r ] is the set of refreshmentrates, for some < λ min r ≤ λ max r < ∞ . Then for any compact set K ⊂ { ( x, θ ) ∈ R d × R d : DAPTIVE SCHEMES FOR PDMC ALGORITHMS 19 k x k + k θ k ≤ R } , with R ≥ , there exists α > such that for all ( x, θ ) , (˜ x, ˜ θ ) ∈ K , forall t > , and for all M ∈ M and λ r ∈ Λ r k P tM,λ r (( x, θ ) , · ) − P tM,λ r ((˜ x, ˜ θ ) , · ) k TV ≤ − α ) . In particular α is independent of M and λ r .Proof. Let M ∈ M and λ r ∈ Λ r . First note that for ( ξ, θ ) = ( M − x, θ ) and ( ˜ ξ, ˜ θ ) =( M − ˜ x, ˜ θ ) we have the equality(22) k P tM,λ r (( x, θ ) , · ) − P tM,λ r ((˜ x, ˜ θ ) , · ) k TV = k ˜ P tM,λ r (( ξ, θ ) , · ) − ˜ P tM,λ r (( ˜ ξ, ˜ θ ) , · ) k TV , where ˜ P tM,λ r is the transition kernel of a standard BPS with energy function ˜ U M ( ξ ) = U ( M ξ ) as defined in Figure 2, and refreshment rate λ r .Our strategy is thus to apply Lemma 12 in [17] to each semigroup ˜ P tM,λ r and then takeadvantage of Equation (22). Observe that it is possible to apply the lemma because part(a) of Assumption 13 implies that for all M ∈ M Z k∇ ˜ U M ( ξ ) k ˜ π M ( dξ ) = Z k∇ U ( x ) k π ( dx ) < ∞ , for ˜ π M ( ξ ) = exp( − ˜ U M ( ξ )) / ˜ Z M , with ˜ U M ( ξ ) = U ( M ξ ) and ˜ Z M = Z/ | det( M ) | . Hence theintegrability condition holds for each ˜ P tM,λ r with respect to the corresponding target ˜ π M .Applying the lemma and Equation (22) it follows that for any compact set K ⊂ { ( x, θ ) ∈ R d × R d : k x k + k θ k ≤ R } , with R ≥
0, for each M ∈ M and λ r ∈ Λ r there exists α = α ( λ r , M, K ) > x, θ ), (˜ x, ˜ θ ) ∈ K , for all t > k P tM,λ r (( x, θ ) , · ) − P tM,λ r ((˜ x, ˜ θ ) , · ) k TV ≤ − α ( λ r , M, K )) . Observe that there is a dependence on M due to the fact that the set of initial conditions K is transformed to K M := { ( ξ, θ ) ∈ R d × R d : k M − ξ k + k θ k ≤ R } in Equation(22). This translates to a dependence on M in the coefficient α , as a consequence of thedependence of α on set K M . This inconvenience can be avoided if we choose α such thatthe coupling inequality is satisfied for all initial conditions ( ξ, θ ) ∈ K M , where R M is suchthat K M := { ( ξ, θ ) ∈ R d × R d : k ξ k + k θ k ≤ R M } satisfies K M ⊂ K M for all M ∈ M .Thus for all M ∈ M it holds that if ( x, θ ) ∈ K , then ( ξ, θ ) = ( M − x, θ ) ∈ K M .The last thing to do is showing that there exists a constant α ∗ > M and λ r such that α ( λ r , M, K ) ≥ α ∗ for all M ∈ M and λ r ∈ Λ r . This is indeed the casefor the following reasons: • The dependence on M appears then in the following term, which in the statementof Lemma 12 in [17] appears as a factor in α ( λ r , M, K ): g ( r ) = P E ≤ r ˜ N sup { ξ : k ξ k≤ (1+ E /λ r ) R M +( r/λ r ) ˜ N } k∇ ˜ U M ( ξ ) k ! . Here ˜ N = N + (1 + E /λ r ) R M for some N >
0, and E , E are independent expo-nential random variables with parameter 1. Because g ( r ) is a factor in α ( λ r , M, K ),we wish to bound it from below, and hence we should bound the supremum of DAPTIVE SCHEMES FOR PDMC ALGORITHMS 20 k∇ ˜ U M ( ξ ) k from below. This can be done as follows:sup { ξ : k ξ k≤ (1+ E /λ r ) R M +( r/λ r ) ˜ N } k∇ ˜ U M ( ξ ) k = sup { x : k M − x k≤ (1+ E /λ r ) R M +( r/λ r ) ˜ N } k M T ∇ U ( x ) k≥ (cid:18) min M ∈M k M − T k (cid:19) sup { x : k x k≤ min M ∈M ( k M k )((1+ E /λ r ) R M +( r/λ r ) ˜ N ) } k∇ U ( x ) k , where we used that ˜ U M ( ξ ) = M T ∇ U ( x ), that k M − x k ≥ k x k / k M k , and thecompactness of M . This is sufficient to eliminate any dependence on M in α . • Depending on the specific factors in the statement of the lemma, the switchingrate λ r can be conveniently bounded either above by λ max r or below by λ min r inorder to bound α ( λ r , M, K ) from below. This can be done in every term in which λ r appears and it follows that the dependence on it can be easily eliminated.We have thus shown that we can choose α ∗ > M ∈ M and all λ r ∈ Λ r k P tM,λ r (( x, θ ) , · ) − P tM,λ r ((˜ x, ˜ θ ) , · ) k TV ≤ − α ∗ ) . (cid:3) Drift condition for the BPS.
The second condition we need is uniformity of theconstants in the drift condition for the family of preconditioned BPS and/or for thefamily of BPS with different refreshment rate. To this end, we go through the proof ofLemma 7 from [17] to show that this is indeed the case.
Lemma 19.
Consider a family of BP processes with generators {L M,λ r : M ∈ M , λ r ∈ Λ r } , where M is a compact space of non-singular matrices that act as preconditioners, λ r is the refreshment rate and Λ r = [ λ min , λ max ] for some < λ min ≤ λ max < ∞ . LetAssumption 13 hold and let ψ = N (0 , d ) . Then there are A , A > and a class offunctions { V M,λ r : M ∈ M , λ r ∈ Λ r } such that for each M ∈ M and λ r ∈ Λ r it holdsthat L M,λ r V M,λ r ( x, θ ) ≤ − A V M,λ r ( x, θ ) + A for all ( x, θ ) ∈ R d × R d , where in particular A , A do not depend on M .Proof. We follow the same underlying idea that was used in the proof of Lemma 17. Itis in fact sufficient that there exist A , A >
0, both independent of M and λ r , such thatfor all M and λ r there exists a function ˜ V M,λ r ( ξ, θ ) such that(24) ˜ L M,λ r ˜ V M,λ r ( ξ, θ ) ≤ − A ˜ V M,λ r ( ξ, θ ) + A for all ( ξ, θ ) ∈ R d × R d , where ˜ L M,λ r is the generator of a standard BPS with target ˜ U M ( ξ ) = U ( M ξ ) and re-freshment rate λ r . Indeed, we can then define V M,λ r ( x, θ ) = ˜ V M,λ r ( M − x, θ ) to obtain byProposition 3(25) L M,λ r V M,λ r ( x, θ ) = ˜ L M,λ r ˜ V M,λ r ( ξ, θ ) ≤ − A V M,λ r ( x, θ ) + A for all ( x, θ ) ∈ R d × R d . In order to show the inequality in (24) we rely on [17, Lemma7]. In particular we begin by showing that the constants c , c , c , c > R >
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 21 in [17, Assumption A8] can be chosen independently of the specific M for the class ofstandard BP samplers with targets ˜ π M . This implies that A , A can then be chosen tobe the same for every M ∈ M . The dependence on λ r is dealt with as a second step.For ease of notation from now on we denote the corresponding energy function as U M ( ξ ).Let us restrict our attention to the case ψ = N (0 , d ) and thus, using the notation in[17], we take ℓ ( ξ ) = 1, H ( k θ k ) = η k θ k for some η ∈ (0 ,
1) small enough such that R R d exp ( η k θ k ) ψ ( dθ ) < ∞ , and ¯ U M ( ξ ) = U ζM ( ξ ) for ζ ∈ (0 ,
1) chosen as in part (c) ofAssumption 13. Observe that η, ζ are uniform in M .We start by considering c , which must be such that k∇ ξ ¯ U M ( ξ ) k ≥ c for all pointsoutside of a ball of radius R . In particular we wish to show that there exist c , R thatsatisfy the property above for all M ∈ M . Observing that for x = M ξ it holds that ∇ ξ ¯ U M ( ξ ) = ζ U ζ − M ( ξ ) ∇ ξ U M ( ξ ) and ∇ ξ U M ( ξ ) = M T ∇ x U ( x ), hence we have that for any M ∈ M k∇ ξ ¯ U M ( ξ ) k = ζ k∇ ξ U M ( ξ ) k U − ζM ( ξ ) ≥ ζ k M − T k k∇ x U ( x ) k U − ζ ( x ) ≥ C k∇ x U ( x ) k U − ζ ( x ) , (26)where C = ζ min M ∈M (1 / k M − T k ) >
0. Assumption 13(c) implies that there exist ˜ c , ˜ R > k∇ x U ( x ) k /U − ζ ( x ) ≥ ˜ c for any x such that k x k ≥ ˜ R , and because M is acompact space, we can choose c = C ˜ c and R = ˜ R , which are then independent of M .The constant c must be such that ℓ ( ξ ) ≤ c , and because for a Gaussian ψ , we maytake as written above ℓ ( ξ ) = 1. Therefore c = 1 for each M ∈ M .Then, c must be such that ( k∇ ξ U M ( ξ ) k / k∇ ξ ¯ U M ( ξ ) k ) ≥ c for all ξ such that ξ > R for some R >
0. Indeed for x = M ξ we have k∇ ξ U M ( ξ ) kk∇ ξ ¯ U M ( ξ ) k = k∇ ξ U M ( ξ ) k ζ U ζ − M ( ξ ) k∇ ξ U M ( ξ ) k = 1 ζ U − ζM ( ξ ) = 1 ζ U − ζ ( x ) . By part (b) of Assumption 13 we have that lim k x k→∞ U ( x ) = + ∞ and since ζ ∈ (0 , c and R large enough such that outside of a ball of radius R we have( k∇ ξ U M ( ξ ) k / k∇ ξ ¯ U M ( ξ ) k ) ≥ c for any M ∈ M .It is left to show that c can be chosen uniform. For A x,M := { θ ∈ R d : η k θ k ≤ U M ( M − x ) } , c must be such that(27) k∇ ¯ U M ( ξ ) k sup θ ∈ A x,M k θ k ! ≤ c for any ξ such that k ξ k > R for some R >
0. Observing that for θ ∈ A x,M it holds that sup θ ∈ A x,M k θ k ≤ η U ζM ( ξ ), weobtain k∇ ¯ U M ( ξ ) k sup θ ∈ A x,M k θ k ! ≤ (cid:16) ζ (1 − ζ ) U ζ − M ( ξ ) k∇ ξ U M ( ξ )( ∇ ξ U M ( ξ )) T k + ζ U ζ − M ( ξ ) k∇ ξ U M ( ξ ) k (cid:17) η U ζM ( ξ ) ≤ ζ (1 − ζ ) η k∇ ξ U M ( ξ ) k U − ζM ( ξ ) ! + 3 ζη k∇ ξ U M ( ξ ) k U − ζM ( ξ ) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 22 ≤ ζ (1 − ζ ) η k M T k (cid:18) k∇ x U ( x ) k U − ζ ( x ) (cid:19) + 3 k M kk M T k ζη k∇ x U ( x ) k U − ζ ( x ) ≤ ζ (1 − ζ ) η D (cid:18) k∇ x U ( x ) k U − ζ ( x ) (cid:19) + 3 D ζη k∇ x U ( x ) k U − ζ ( x ) . (28)In the second to last inequality we used that ∇ ξ U M ( ξ ) = M T ∇ x U ( x ) M with x = M ξ , andin the last inequality we defined D = max M ∈M {k M k ∨ k M T k} . Part (c) of Assumption 13implies that there exist ˜ c , ˜ R such that k∇ x U ( x ) k U − ζ ( x ) ≤ ˜ c , k∇ x U ( x ) k U − ζ ( x ) ≤ ˜ c for all x such that k x k ≥ ˜ R . As a consequence, because M is a compact space and by (28), there exist c , R in-dependent of M such that (27) holds for all M ∈ M . It is now sufficient to take R = max { R , R , R } , and we have shown that c , c , c , c , R can be picked uniformly forall standard BP samplers with energy function U M . In particular there is no dependenceon λ r in all these constants.It is important to observe that part (c) of Assumption 13 is satisfied by all BP samplerswith targets in { ˜ π M } M ∈M because we have bounds on k M k and k M − k . Moreover, parts(a) and (b) of Assumption 13 are trivially verified because of the transformation schemein Figure 2 that defines ˜ U M . It is then possible to apply [17, Lemma 7] to each processto obtain the drift condition (24). Therefore for each M ∈ M and λ r ∈ Λ r we have theLyapunov function(29) ˜ V M,λ r ( ξ, θ ) = exp (cid:16) κ λ r U ζM ( ξ ) (cid:17) ϕ λ r (cid:18) rc h θ, ∇ ξ ¯ U M ( ξ ) i (cid:19) + exp (cid:0) η k θ k (cid:1) , where η ∈ (0 ,
1) and r depend only the distribution at refreshments ψ , κ λ r ∈ (0 ,
1) dependson the c i ’s, and ϕ λ r is a positive, non decreasing, continuously differentiable function asdefined in [17]. Notice that, as a consequence of the calculations above, κ λ r and ϕ λ r donot depend on the preconditioner. Applying our usual transformation scheme we obtainthe functions(30) V M,λ r ( x, θ ) = exp (cid:0) κ λ r U ζ ( x ) (cid:1) ϕ λ r (cid:18) rc h θ, M T ∇ x ¯ U ( x ) i (cid:19) + exp (cid:0) η k θ k (cid:1) . This means that for each ˜ V M as defined in (29), the same proof of [17, Lemma 7] can befollowed and thus ˜ V M satisfies(31) ˜ L M,λ r ˜ V M,λ r ( ξ, θ ) ≤ − A ( λ r ) ˜ V M,λ r ( ξ, θ ) + A ( λ r ) for all ( ξ, θ ) ∈ R d × R d , In particular, A , A both depend on λ r , on c , c , c , c , R , and on other constants thatdepend only on ψ and thus are independent of M . Therefore A , A can be chosenuniformly in M ∈ M .The last step is now to eliminate the dependence on λ r in A , A . Observe that A ( λ r )is defined in the proof of [17, Lemma 7] as a minimum of several constants. It is enoughfor our needs to notice that the constants have a continuous dependence in λ r and that weare considering λ r ∈ [ λ min , λ max ]. On the other hand, A ( λ r ) can be chosen independentlyof λ r by continuity of V M,λ r in λ r . We can therefore choose A , A independently both of M and λ r , thus the wanted simultaneous drift condition follows by (25). (cid:3) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 23
Finalising the proof of Theorem 14.
Let ∆ t > V M,λ r ( x, θ )+ V M,λ r ( x, θ ) ≤ C are compact by definition of the class of Lyapunov functions { V M,λ r : M ∈ M , λ r ∈ Λ r } . We are in particular free to choose the constant C as large as we wish. Note alsothat the coupling inequality holds for all t >
0, hence there are no constraints on thechoice of ∆ t . Moreover µ ( V M,λ r ) = ˜ µ M ( ˜ V M,λ R ) < ∞ for all M ∈ M and λ r ∈ Λ r . Here˜ V M,λ R is the Lyapunov function of a standard BPS with refreshment rate λ r and target˜ µ M = ˜ π M × Ψ. The containment condition is thus verified as all conditions in part (2) ofTheorem 10 hold. Proposition 20 implies the diminishing adaptation condition, and thusergodicity follows.4.3.
Proving the diminishing adaptation condition.
A key part of Theorem 5 iscondition (b), i.e. the diminishing adaptation condition. For the adaptive scheme de-scribed in Section 2.4 the condition can be easily shown to be true as the adaptationhappens with diminishing probability.
Proposition 20.
Consider the adaptive schemes in Section 2. In particular, assume that { p n } n ≥ , i.e. the sequence of probabilities of updating the adaptation parameters, is suchthat p n → as n → ∞ . Then the diminishing adaptation holds for any t ≥ .Proof. Consider for example the adaptive BPS. Observe that M n +1 = M n and λ n +1 r = λ nr with probability 1 − p n +1 and thus k P ∆ tM n +1 , λ n +1 r (( x, θ ) , · ) − P ∆ tM n , λ nr (( x, θ ) , · ) k TV ≤ p n +1 → n → ∞ . Thus the diminishing adaption holds. The same reasoning works for the adaptive ZZS. (cid:3) Numerical experiments
In this section we test the empirical performance of the adaptive schemes we definedin Section 2.4. All experiments are implemented in Julia and the corresponding codescan be found at https://github.com/andreabertazzi/Adaptive_PDMC_samplers . Letus state some settings that hold for all experiments below. The time horizon is set to T = 10 for all processes. This is large enough for the adaptive PDMC samplers to learnand take advantage of the covariance structure. When considered fixed, the refreshmentrate of the BPS is taken to be λ r = 1. The discretisation step is chosen to be ∆ t = 0 . t adap = 2000 continuous time units. Theprobability of adapting decays as O (log log n ). The performance measures we consider arethe effective sample size per second (ESS/sec) for the mean and for the radius statistic t ( x ) = P di =1 x i . These are computed in continuous time by estimating the asymptoticvariance and the variance of the Monte Carlo estimate on the continuous time trajectoriesof the processes as discussed in [7]. Finally, in all settings we repeat the same task 20 timesand report all the results in box-plots. The legend of all plots is as follows: AdaptiveBPS (full,adap) denotes the BPS that learns the entire covariance matrix ( full ) andwith adaptive refreshment rate ( adap ). Alternatives for the preconditioner are diag and off , while for the refreshment rate can be fixed . The Zig-Zag samplers have similarnotation. DAPTIVE SCHEMES FOR PDMC ALGORITHMS 24
Multidimensional Gaussian target.
In this section we focus on two differentkinds of multivariate Gaussian target distributions. The first one, denoted by
MG1 , hasunitary variances and correlation ρ between all components. Denoting the covariancematrix by Σ, this means that Σ ii = 1 for each i = 1 , . . . , d and Σ ij = ρ for all i = j . Westudy how the adaptive PDMC algorithms compare to their non-adaptive counterparts fordifferent values of ρ and different dimensionalities. In this setting we focus on adaptivealgorithms that estimate the full covariance matrix. The second Gaussian target weconsider has variances 0 . , , , ,
15 repeated depending on the dimension, together witha milder correlation between components. This setting is denoted as
MG2 and is usefulto compare all kinds of adaptive algorithms we introduced.5.1.1.
MG1 target.
In the first experiment we consider a 50-dimensional
MG1 targetfor different values of ρ . In Figure 3 the average ESS/sec and the ESS/sec for the radiusstatistic are shown. As expected, the performance of the ZZS is degrading as the corre-lation increases. This behaviour is for the most part caused by the very large number ofevents that have to be simulated for very narrow targets. The adaptive ZZS successfullyimproves over this inconvenience and is stable with respect to the increasing correlation.The standard BPS with fixed refreshment rate shows a decaying average ESS/sec, whilethe ESS/sec for the radius statistic appears to increase as ρ grows up until ρ = 0 . λ r = 1 ismore suited for the estimation of the radius in case of a more concentrated target ratherthan for a standard Gaussian. A similar behaviour is shown by the adaptive BPS’s. Over-all we notice a marked improvement for the BPS with adaptive preconditioner and fixed λ r . Choosing to adapt only the refreshment turns out to be a detrimental decision whenthe target is correlated. Indeed the optimality criterion derived [9] assumes a standardGaussian target.In Figure 4 we study how the adaptive schemes compare to the standard ones for an MG1 target with correlation ρ = 0 . λ r is chosen at first due to the high number of reflections, thus slowing downthe estimation of the covariance matrix. It is worth pointing out that the performanceof the BPS with adaptive preconditioner and refreshment improves compared to the BPSas the dimension increases. This is according to the theoretical results in [9], which areindeed obtained in the high dimensional limit. Therefore we expect that for a large d itis reasonable to apply both the transformation scheme and the tuning of λ r .5.1.2. MG2 target.
Let us now consider a 50-dimensional
MG2 target with a mild cor-relation set to ρ = 0 .
3. Figure 5 shows the results for several adaptive PDMC samplers.The adaptive algorithms that learn the entire covariance matrix show the largest gain interms of ESS/sec. For the BPS the choice of learning only the variance of each compo-nent of the target seems interesting, also in view of larger dimensions. As in the previoussection, we observe that the adaptation of the refreshment rate seems to have a bad effectfor anisotropic targets.
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 25
Correlation A v e r age ESS / s e c Correlation
ESS / s e c ( R ad i u s ) ZigZag Adaptive ZigZag (full) (a)
Comparison between the ZZS and the ZZS with adaptive preconditioner.
Correlation A v e r age ESS / s e c Correlation
ESS / s e c ( R ad i u s ) BPS Adaptive BPS (off,adap) Adaptive BPS (full,adap) Adaptive BPS (full,fixed) (b)
Comparison between the BPS and various alternative adaptive BPS’s.
Figure 3.
Results as a function of the correlation for
MG1 targets.
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 26
Dimension A v e r age ESS / s e c Dimension
ESS / s e c ( R ad i u s ) ZigZag BPS Adaptive ZigZag (full) Adaptive BPS (full,adap) Adaptive BPS (full,fixed)
Figure 4.
MG1 target with ρ = 0 . Logistic regression with correlated data.
The next numerical experiment weconsider is a Bayesian logistic regression task. In this setting for j = 1 , . . . , n obs a binaryoutput value y j ∈ { , } has distribution P ( Y j = 1 | β ) = 11 + exp ( − β T x j ) , where { x j } n obs j =1 are known covariates, and β ∈ R d is an unknown parameter. We take aflat prior and thus obtain the posterior π ( β |{ y j } n obs j =1 ) ∝ n obs Y j =1 y j exp ( − β T x j )1 + exp ( − β T x j ) . We force correlation between some components of the parameter by taking, for j =1 , . . . , n obs and i = 1 , . . . , d , ( x j ) i = 1 + εN ji , where N ji ∼ N (0 ,
1) and ε = 0 .
1. Theresults of the experiment are reported in Figure 6, in which the samplers are tested withtargets as above with d = 2 , , ,
16 and n obs = 1000. The adaptation of the refreshmentrate follows the alternative scheme discussed in Appendix A.3. This scheme seems morestable as the refreshment rate is update gradually and cannot jump immediately to verylarge or small values. Although the dimensionality is small and the correlation is limitedto a subset of the coordinates, we observe that the adaptive schemes outperform theirstandard counterparts. 6. Discussion
In this paper we proposed adaptive schemes to overcome two of the current issues withthe BPS and ZZS. We have shown that the refreshment rate and the excess switchingrate can be tuned on the fly as long as the updates or the probabilities of updatingdecrease to 0. With this approach the user does not have to worry about tuning therefreshment rate. A current limitation is that more theory or experiments are needed to
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 27
Variance
ESS / s e c Adaptive ZigZag (diag) Adaptive ZigZag (full) ZigZag (a)
Results of the
MG2 experiment for the ZZS.
Variance
ESS / s e c BPSAdaptive BPS (diag,fixed) Adaptive BPS (full,fixed)Adaptive BPS (full,adap) Adaptive BPS (diag,adap)Adaptive BPS (off,adap) (b)
Results of the
MG2 experiment for the BPS.
Figure 5.
Comparison of several samplers in the context of Section 5.1.2.determine a criterion that works well with anisotropic targets. In addition to this, we haveproposed a way to make the PDMC samplers learn and take advantage of the covariancestructure of the target. The numerical experiments we conducted suggest that this canlead to a remarkable performance improvement when there are strong anisotropicities. Inaddition, the adaptive schemes discussed in this paper can be applied to the Boomerangsampler [8]. This would result in elliptical dynamics that are adapted to resemble the(unimodal) target at hand. Naturally, the adaptive algorithms should be run with atime horizon that is large enough to benefit from the adaptation. Two other importantsettings are the discretisation step and the time between two adaptations. Based on ourexperience with the experiments, we suggest ∆ t = O (10 − ) and t adaps = O (10 ). Forconcentrated targets, as for instance posteriors when there is a very large number ofdata points, it is suggested to choose both values small. We remark that the adaptivePDMC algorithms with subsampling are applicable in the setting of tall data, that is when DAPTIVE SCHEMES FOR PDMC ALGORITHMS 28 dimension a v e r age ESS pe r s e c ond dimension s qua r ed r ad i u s ESS pe r s e c ond ZigZagBPS Adaptive ZigZag (full)Adaptive BPS (full,adap) Adaptive BPS (full,fixed)Adaptive BPS (off,adap)
Figure 6.
Logistic regression task of Section 5.2.data-set is made of a large number of observations, but with a moderate dimensionality.Empirical experiments in this setting are a very interesting area of future research.A question that one could naturally ask is how applicable this transformation schemeis in case of a multimodal target. The answer depends on the specific target at hand,but one can design a target as a mixture of Gaussian distributions for which applying thetransformation scheme would not speed up the convergence of the sampler. However, whenthe target is multimodal it is possible for instance to use the adaptive PDMC samplerstogether with the framework proposed in [28]. In this framework, the adaptive PDMCsamplers would be beneficial since the regions around each mode would be explored moreefficiently by taking advantage of the covariance structure of the specific mode.
Appendix A. Implementation of adaptive PDMC algorithms
The main issue in the exact simulation of PDMPs lies in the simulation of the switchingtimes. It is generally proposed to use Poisson thinning to overcome this difficulty. Theidea is to find upper bounds for the switching rates that are more tractable, then simulatea Poisson process with said rate and finally adjust with an acceptance-rejection step.Applying a preconditioning matrix to PDMPs results in a modification of the switching
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 29 rates and thus the bound proposed in [7, 12] do not directly apply to our proposedalgorithms. In the following two sections we give two important examples to motivatethat it is possible to find bounds with similar ideas to those used for standard PDMCalgorithms. In Appendix A.3 we discuss a different adaptation strategy for the refreshmentrate.A.1.
Dominated Hessian of the negative log-likelihood.
Let us consider the case inwhich the Hessian of the negative log-likelihood is dominated by a positive definite matrix.Denoting the Hessian by H U ( x ) = ( ∂ i ∂ j U ( x )) di,j =1 , this means that for any x ∈ R d it holdsthat h H U ( x ) u, u i ≤ h Qu, u i for all u ∈ R d . Then we say that H U ( x ) (cid:22) Q . Assuming Q issymmetric, we have for all u, v ∈ R d that h u, H U ( x ) v i ≤ k u k k Qv k .Let us consider the switching rates of a ZZ process with preconditioner M , i.e. λ Mi ( x, θ ) =( θ i h M i , ∇ U ( x ) i ) + , where M i is the i -th column of M . Remember that the deterministictrajectory of this process is x ( t ) = x + M θt , where ( x, θ ) is the initial condition. We have ∂ i U ( x ( t )) = ∂ i U ( x ) + Z t d X j =1 ∂ j ∂ i U ( x ( s ))( M θ ) j ds = ∂ i U ( x ) + Z t h H U ( x ( s )) e i , M θ i ds, where e i , is the i -th vector of the canonical basis, i.e. with zeros in all components exceptfor a 1 in the i -th component. Taking a i = θ i h M i , ∇ U ( x ) i and b i = √ d k M k k QM i k itfollows that λ Mi ( x ( t ) , θ ) = θ i h M i , ∇ U ( x ) i + θ i d X j =1 M ji Z t h H U ( x ( s )) e j , M θ i ds ! + ≤ (cid:18) θ i h M i , ∇ U ( x ) i + Z t h H U ( x ( s )) M i , M θ i ds (cid:19) + ≤ ( a i + t k M θ k k QM i k ) + ≤ ( a i + b i t ) + . It is possible to have a tighter bound taking b i = k QM i k k M θ k , but this entails havingto update b i after every switching time, which would make the overall simulation moreexpensive. Note that in particular for a diagonal M we have that k M θ k = ( P di =1 M ii ) / and we can take a i = θ i M ii ∂ i U ( x ) and b i = M ii k Q i k ( P di =1 M ii ) / . It is worth observingthat k Q k can be computed once before running the algorithm and then stored. Terms k M k and k M i k can in principle be estimated by taking a maximum over the class ofmatrices M . However, having loose computational bounds leads to a lower percentage ofaccepted proposed events and thus to an increased computational burden.For BPS with preconditioner M we showed that λ M ( x, θ ) = ( h M θ, ∇ U ( x ) i ) + . Withcomputations similar to the ones above we obtain λ M ( x ( t ) , θ ) = (cid:18) h M θ, ∇ U ( x ) i + Z t h M θ, H U ( x ( s )) M θ i ds (cid:19) + ≤ ( h M θ, ∇ U ( x ) i + t h M θ, QM θ i ) + ≤ ( a + bt ) + DAPTIVE SCHEMES FOR PDMC ALGORITHMS 30 with a = h M θ, ∇ U ( x ) i and b = h M θ, QM θ i . Note that M θ is the velocity of the processand needs to be computed in any case to determine the trajectories.A.2.
Subsampling techniques.
PDMC algorithms are particularly interesting becausethey allow for exact subsampling. This means that it is possible to modify PDMC samplerssuch that the correct invariant measure is maintained but without going through the entiredata-set at each iteration. This was illustrated in [7, 12] for ZZS and BPS. Subsamplingtechniques are particularly helpful when the number of data points n is much larger thanthe dimensionality d of the posterior density function. This is an interesting scenariofor the adaptive algorithms that are here introduced, as for a small d the additionalcomputations necessary to learn and take the square root of (part of) the covariance matrixare not overpowering. In this section we explain how adaptive PDMC with subsamplingcan be implemented.Suppose the partial derivatives of the posterior distribution can be written in the form(32) ∂ i U ( x ) = 1 n n X j =1 E ji ( x ) , where E ji : R d → R d are continuous mappings. This holds for example when the targetsatisfies U ( x ) = 1 n d X j =1 U j ( x ) . This is the case for instance when the target is a posterior density of a model with iidobservations. In such cases one can choose U j ( x ) = − log( π ( x )) − n log ( l ( y j | x )), where π is a prior of the parameters and l is the likelihood associated to observation y j . Thenthe idea is to define a collection of switching rates along a trajectory. m ji ( t ) = ( θ i h M i , E j ( x + M θt )) + , where E j ( x ) = ( E j ( x ) , . . . , E jd ( x )) T . If we can find uniform bounds H i ( t ) such that m ji ( t ) ≤ H i ( t ) for all j , then the subsampling procedure is successful. One can simulateIPPs with rates H i ( t ) and then accept the proposed switch of component i with probabil-ity m Ji ( t ) /H i ( t ) where J ∼ Unif { , . . . , n } . The next proposition shows that stationarityof π is preserved for the ZZS with subsampling also introducing a preconditioner. Proposition 21.
The Zig-Zag sampler with subsampling and preconditioner M has in-variant distribution µ = π × Unif (Θ) . Moreover it coincides with a Zig-Zag process withswitching rates λ Mi ( x, θ ) = 1 n n X j =1 ( θ i h M i , E j ( x ) i ) + for ( x, θ ) ∈ E, i = 1 , . . . , d.
Proof.
The statements can be derived following the proof of Theorem 4.1 in [7] and isthus omitted. (cid:3)
A very interesting situation is that of a negative log-likelihood with Lipschitz partialderivatives, i.e. | ∂ i U ( x ) − ∂ i U ( y ) |≤ C i k x − y k p . In this case we can take a reference point DAPTIVE SCHEMES FOR PDMC ALGORITHMS 31 x ∗ and take E j ( x ) = ∇ U ( x ∗ ) + ∇ U j ( x ) − ∇ U j ( x ∗ ) . This allows us to obtain the following nice uniform bound on the switching rates m ji ( t ) = ( θ i h M i , ∇ U ( x ∗ ) + ∇ U j ( x + M θt ) − ∇ U j ( x ∗ ) i ) + ≤ (cid:0) θ i h M i , ∇ U ( x ∗ ) i + |h M i , ∇ U j ( x + M θt ) − ∇ U j ( x ∗ ) i| (cid:1) + ≤ θ i h M i , ∇ U ( x ∗ ) i + d X l =1 C l | M li |k x + M θt − x ∗ k p ! + ≤ ( θ i h M i , ∇ U ( x ∗ ) i + h| M i | , C ik x − x ∗ k p + t h| M i | , C ik M θ k p ) + . Therefore we can simulate d IPPs with rates H i ( t ) = ( a i + b i t ) + , where a i = θ i h M i , ∇ U ( x ∗ ) i + h| M i | , C ik x − x ∗ k p and b i = d /p h| M i | , C ik M k p .The same estimator for the gradient of the negative log-density can be used for apreconditioned BPS. With computations analogous to above we find m ji ( t ) ≤ ( h M θ, ∇ U ( x ∗ ) i + h| M θ | , C ik x − x ∗ k p + t h| M θ | , C ik M θ k p ) + ≤ H i ( t ) . Although establishing ergodicity of these adaptive algorithms would be an interestingresearch question, we leave it for future research due to the lack of theoretical results fortheir standard counterparts.A.3.
A different adaptation strategy for the refreshment rate.
The adaptivescheme for the refreshment rate described in Section 2.4 is a natural implementation ofthe results in [9]. However, it can be unstable when combined with an adaptive precondi-tioner. Here we define an alternative adaptive strategy and we prove that the diminishingadaptation condition from Theorem 9 is satisfied.Assume that the refreshment rate of the BPS is constant in the position and velocityspaces. We update it iteratively as follows:(33) λ rn +1 = ( λ rn + q n +1 if n refresh ( n +1) n events ( n +1) < λ ∗ ,λ rn − q n +1 if n refresh ( n +1) n events ( n +1) > λ ∗ , in which n refresh ( n ) and n events ( n ) are respectively the number of refreshments and thenumber of events up to time n , q n is a positive, decreasing sequence such that q n → λ ∗ = 0 . n there is a probability of adaptingequal to p n , such that p n →
0. In the remainder of this section we show that for theadaptive scheme described in (33) the diminishing adaptation condition holds even if therefreshment rate is updated with probability 1.
Lemma 22.
Let π be a probability distribution. Denote as P tγ the semigroup of a ZZprocess with excess switching rate γ : E → R d + . Let γ , γ be two bounded excess switchingrates. Then for any t ≥ ( x,θ ) ∈ E k δ ( x,θ ) P tγ − δ ( x,θ ) P tγ k TV → as k γ − γ k ∞ → . DAPTIVE SCHEMES FOR PDMC ALGORITHMS 32
Similarly, denote now by P tλ the semigroup of a BPS with refreshment rate λ : E → R + .Let λ , λ be two bounded refreshment rates. Then for any t ≥ ( x,θ ) ∈ E k δ ( x,θ ) P tλ − δ ( x,θ ) P tλ k TV → as k λ − λ k ∞ → . Proof.
Consider first the case of the BPS. Consider the generators of two BPS’s withrefreshment rates λ and λ and denote them respectively by L λ and L λ . Now observethat L λ = L λ + B , where B is a bounded operator such that Bf ( x, θ ) = ( λ ( x, θ ) − λ ( x, θ )) Z R d ( f ( x, θ ′ ) − f ( x, θ )) ψ ( dθ ′ ) . In other words, L λ is a perturbed version of L λ , with bounded perturbation B . Since L λ and L λ generate strongly continuous semigroups ( P tλ ) t ≥ , ( P tλ ) t ≥ , we can apply thereasoning in the proof of Corollary 1.11 in Chapter 3 of [18] to obtain that, for any t > k P tλ − P tλ k → B →
0, which is equivalent tosup f ∈C ( E ) , | f |≤ sup ( x,θ ) ∈ E | P tλ f ( x, θ ) − P tλ f ( x, θ ) | → k λ − λ k ∞ → , where C ( E ) is the space of continuous functions on E that vanish at infinity. Invertingthe order of the two suprema and applying the Riesz-Markov representation theorem (seee.g. [4]) we obtain the result in (35).Now consider the case of two ZZS with two different bounded excess switching rates γ : E → R d + and γ : E → R d + . The perturbation is now given by˜ Bf ( x, θ ) = d X i =1 (( γ ( x, θ )) i − ( γ ( x, θ )) i )( f ( x, F i θ ) − f ( x, θ )) . Then the result (34) follows by the same arguments as above. (cid:3)
Proposition 23.
Consider the adaptive PDMC algorithms defined in Section 2.4, withthe following modification. Let the adaptation of the refreshment rate be performed withprobability at adaptation times, but in such a way that sup ( x,θ ) ∈ E | λ n +1 r ( x, θ ) − λ nr ( x, θ ) | ≤ δ n +1 , where ( δ n ) n ≥ is such that δ n → as n → ∞ , and similarly for the refreshment rateof the ZZS. Then for any ∆ t > n →∞ sup ( x,θ ) ∈ E k P ∆ tM n +1 , λ n +1 r (( x, θ ) , · ) − P ∆ tM n , λ nr (( x, θ ) , · ) k TV ! = 0 in probability , and similarly for the ZZS.Proof. By the triangle inequality for the total variation distance we obtain k δ ( x,θ ) P ∆ tM n +1 , λ n +1 r − δ ( x,θ ) P ∆ tM n , λ nr k TV ≤ k P ∆ tM n +1 , λ n +1 r (( x, θ ) , · ) − P ∆ tM n +1 , λ nr (( x, θ ) , · ) k TV + k P ∆ tM n +1 , λ nr (( x, θ ) , · ) − P ∆ tM n , λ nr (( x, θ ) , · ) k TV . (36)The inequality above allows us to deal with the two adaptations separately. The secondterm in the right hand side of (36) is handled by Proposition 20.Now focus on the first term in the right hand side of (36), for which we want to applyLemma 22. It is sufficient to observe that supremum norm of the perturbation goes tozero in probability when n → ∞ . Indeed the sequence of refreshment rates is Cauchy DAPTIVE SCHEMES FOR PDMC ALGORITHMS 33 in probability, as for all n we have k λ n +1 r − λ nr k ∞ ≤ δ n +1 . Convergence in probabilityfollows from the fact that ( δ n ) n ≥ is a convergent sequence and therefore the diminishingadaptation condition holds. (cid:3) Appendix B. Omitted proofs and technical results
B.1.
Proofs of Section 2.
Proof of Proposition 1.
The extended generator ( ˜ L M , D ( ˜ L M )) of the standard ZZ processis by [15, Definition 14.15] such that for all ˜ f ∈ D ( ˜ L M )(37) M ˜ ft = ˜ f (Ξ( t ) , Θ( t )) − ˜ f ( ξ, θ ) − Z t ˜ L M ˜ f (Ξ( s ) , Θ( s )) ds is a local martingale. For any ˜ f ∈ D ( ˜ L M ) we define a function h ( x, θ ) = ˜ f ( ξ, θ ) with x = M ξ , for any M ∈ M . Observe that h ( X ( t ) , Θ( t )) = ˜ f (Ξ( t ) , Θ( t )) since X ( t ) = M Ξ( t ).The generator of the transformed process then satisfies Condition (37) provided that L M h ( x, θ ) = ˜ L M ˜ f ( ξ, θ ) for all ( x, θ ) ∈ E. Therefore ˜ L M ˜ f ( ξ, θ ) = h θ, ∇ ξ h ( M ξ, θ ) i + d X i =1 ˜ λ M,i ( ξ, θ )( h ( M ξ, F i θ ) − h ( M ξ, θ ))= h M θ, ∇ x h ( x, θ ) i + d X i =1 ˜ λ M,i ( M − x, θ )( h ( x, F i θ ) − h ( x, θ ))= L M h ( x, θ ) . (cid:3) Proof of Proposition 2.
Following the approach in the proof of Theorem 2.2 in [7], we wantto check that for any M ∈ M , and any h ∈ D ( L M ), it holds that R E L M h ( x, v ) dµ = 0.By a change of variable and using the same reasoning as in the proof of Proposition 1, forany Z L M h ( x, θ ) dµ ( x, θ ) = 1 Z X θ ∈{− , +1 } d Z L M h ( x, θ ) exp( − U ( x )) dx = 1˜ Z X θ ∈{− , +1 } d Z ˜ L M ˜ f ( ξ, θ ) exp( − ˜ U M ( ξ )) dξ = Z ˜ L M ˜ f ( ξ, θ ) d ˜ µ M ( ξ, θ ) = 0 , where ˜ µ M = ˜ π M ⊗ Unif( {− , +1 } d ) the last equality was obtained by the invariance of˜ µ M for the standard Zig-Zag process. (cid:3) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 34
Proof of Proposition 3.
Following the same approach as in the proof of Proposition 1, forany ˜ f ∈ D ( ˜ L M ) we define a function h ( x, θ ) = ˜ f ( ξ, θ ) with x = M ξ . The generator of thetransformed process then satisfies Condition (37) provided that L M h ( x, θ ) = ˜ L M ˜ f ( ξ, θ ) for all ( x, θ ) ∈ E. In this case˜ L M ˜ f ( ξ, θ ) = h θ, ∇ ξ h ( M ξ, θ ) i + ˜ λ M ( ξ, θ )( h ( M ξ, ˜ R ( ξ ) θ ) − h ( M ξ, θ ))+ λ r ( ξ, θ ) Z ( h ( M ξ, θ ′ ) − h ( M ξ, θ )) ψ ( dθ ′ )= h M θ, ∇ x h ( x, θ ) i + ˜ λ M ( M − x, θ )( h ( x, ˜ R ( M − x ) θ ) − h ( x, θ ))+ λ r ( M − x, θ ) Z ( h ( x, θ ′ ) − h ( x, θ )) ψ ( dθ ′ )= L M h ( x, θ ) . (cid:3) Proof of Proposition 4.
As for Proposition 2, it suffices to show that for all M ∈ M R E L M h ( x, v ) dµ = 0. By a change of variable we obtain Z L M h ( x, θ ) dµ ( x, θ ) = 1 Z Z X Z V L M h ( x, θ ) exp( − U ( x )) ψ ( θ ) dθdx = 1˜ Z Z X Z V ˜ L M ˜ f ( ξ, θ ) exp( − ˜ U M ( ξ )) ψ ( θ ) dθdξ = Z ˜ L M ˜ f ( ξ, θ ) d ˜ µ ( ξ, θ ) = 0 , where ˜ µ M = ˜ π M ⊗ ψ and the last equality follows by the invariance of ˜ µ M for the BPS asshown in [12]. (cid:3) B.2.
Proof of Theorem 9.
Let us initially separate the proof under Assumption 7 andnext under Assumption 8. First we show that in both cases the containment conditionholds if the sequence { V Γ n ( Z n ) } n ≥ is bounded in probability.B.2.1. The case of Assumption 7.
Let γ ∈ Y and let C be the uniform ( ν γ , δ, n )-smallset as defined in Assumption 7. As described in [29, 32] we define the following couplingbetween two chains with kernel P γ . The two chains evolve independently each with kernel P γ when they are outside of C . When both chains are inside C , then with probability δ we let the two chains couple after n steps by drawing X n + n = Y n + n from ν γ , or withprobability (1 − δ ) we independently draw X n + n ∼ ( P n γ ( X n , · ) − δν γ ( · )) / (1 − δ ) and Y n + n ∼ ( P n γ ( Y n , · ) − δν γ ( · )) / (1 − δ ). Then by taking h γ ( x, y ) = ( V γ ( x ) + V γ ( y )) / E γ (cid:0) h γ ( X , Y ) (cid:12)(cid:12) X = x, Y = y (cid:1) ≤ λh γ ( x, y ) for all ( x, y ) / ∈ C × C. Define now A ( γ ) := sup ( x,y ) ∈ C × C E γ (cid:0) h γ ( X n , Y n ) (cid:12)(cid:12) X = x, Y = y (cid:1) . Then by [32, Theorem5] we have for each γ ∈ Y the following bound on the total variation distance || P nγ ( x, · ) − µ ( · ) || T V ≤ (1 − δ ) j √ nn k + λ n −√ nn +1 ( A ( γ )) √ n − ( V γ ( z ) + π ( V γ )) / . DAPTIVE SCHEMES FOR PDMC ALGORITHMS 35
The dependence on γ in A ( γ ) can be eliminated by noting that A ( γ ) ≤ sup γ ∈Y ( A ( γ )) ≤ λ n sup { γ ∈Y , x ∈ C } V γ ( x ) + bn =: B, which is independent of γ . In particular by our assumptions we have that B < ∞ and π ( V γ ) < ∞ for each γ .B.2.2. The case of Assumption 8.
Define for each V γ and any measurable function ϕ : E → R and any β ≥ k ϕ k β,V γ = sup z ∈ E | ϕ ( z ) | βV γ ( z ) . For measures µ , µ on E such that, for each γ ∈ Y , µ ( V γ ) < ∞ , µ ( V γ ) < ∞ , considerthe V γ weighted norm(39) ρ β ( µ , µ ) = sup { ϕ : k ϕ k β,Vγ ≤ } ( µ ( ϕ ) − µ ( ϕ )) . Note that ρ β depends on γ . Now for each γ we can apply [17, Theorem 24] and thereforethere exist β ∗ > κ ∈ (0 ,
1) such that ρ β ∗ ( µ P γ , µ P γ ) ≤ κρ β ∗ ( µ , µ ) , where in particular κ and β ∗ depend only on α, λ, C as defined in Assumption 8. Therefore κ and β ∗ do not depend on γ by uniformity of said constants. The norm (38) is such thatif k ϕ k ,V γ ≤
1, then k ϕ k β,V γ ≤ β >
0. Choosing µ = δ z , µ = π , for any z ∈ E we obtain k P nγ ( z, · ) − π ( · ) k TV ≤ ρ β ∗ ( δ z P nγ , π ) ≤ κ n ρ β ∗ ( δ z , π )= κ n sup { ϕ : k ϕ k β ∗ ,Vγ ≤ } ( ϕ ( z ) − π ( ϕ )) ≤ κ n (2 + β ∗ V γ ( z ) + β ∗ π ( V γ ))Here we used (39) in the third equality, and twice (38) in the last inequality. Moreoverby assumption π ( V γ ) < + ∞ for all γ ∈ Y .B.2.3. Boundedness in probability of the sequence of Lyapunov functions.
In both casesthe containment condition (assumption (a) in Theorem 5) is thus satisfied if the process { V Γ n ( Z n ) } n ≥ is bounded in probability. This is indeed implied by our assumption thatΓ n is updated only if Z n ∈ B , where B is a compact set, as suggested in [14, 13, 28]. Forthe sake of completeness we report the main steps.By [30, Lemma 3] it is enough to show that sup n ∈ N E ( V Γ n ( Z n )) < ∞ . By our assump-tions and letting D := sup { z ∈ B, γ ∈Y} V γ ( z ) < ∞ E ( V Γ n +1 ( Z n +1 ) | Z n = z, Γ n = γ ) == E ( V Γ n +1 Z n +1 ( ( Z n +1 / ∈ B ) + ( Z n +1 ∈ B )) | Z n = z, Γ n = γ ) ≤ E ( V Γ n ( Z n +1 ) ( Z n +1 / ∈ B ) | Z n = z, Γ n = γ ) + D ≤ λV γ ( z ) + b + D DAPTIVE SCHEMES FOR PDMC ALGORITHMS 36
Taking expectations on both sides one obtains E ( V Γ n +1 ( Z n +1 )) ≤ λ E ( V Γ n ( Z n )) + b + D. This shows that the sequence is contracting since λ ∈ (0 ,
1) and this is enough by [30,Lemma 2] to show our boundedness in probability of the sequence.B.3.
Proof of Theorem 10.
To prove Theorem 10 it is sufficient to verify that thediscretised process with time step ∆ t satisfies the assumptions of Theorem 9.Let us first consider the assumptions in alternative (1) of Theorem 10. The uniformsmall set condition of the discretised process (Assumption 7(a)) follows by choosing n = t ∆ t . The geometric drift condition (Assumption 7(b)) is a consequence of Lemma 26.Now consider the second set of assumptions, that is alternative (2) in the theorem. Inthis case we wish to show that Assumption 8 holds for the discretised process. Part (a) ofAssumption 8 is immediately verified by the fact that ∆ t = t . For part (b), first observethat from Lemma 26 by our assumptions we have P ∆ tγ V γ ( z ) ≤ e − A ∆ t V γ ( z ) + A A (1 − e − A ∆ t ) . In the notation of Assumption 8(b) we have λ = e − A ∆ t and C = A A . In particular wehave 2 C < C since we assumed that C > A /A .The thesis now follows in both cases by Theorem 9.B.4. Other technical results.
B.4.1.
Small set condition for a family of one-dimensional ZZ processes.
In this sectionwe show a simultaneous minorisation condition for the ZZ process in the one-dimensionalcase. This result is essential to generalise the result to the multidimensional case (seeLemma 15 and its proof).
Assumption 24 (Assumption 3 in [10]) . Let U ∈ C ( R ). Define the switching rates ofa 1-d ZZ process as λ ( x, θ ) = ( θU ′ ( x )) + , where x ∈ R and θ ∈ {− , +1 } . There exists x > x ≥ x λ ( x, +1) > sup x ≥ x λ ( x, − , inf x ≥− x λ ( x, − > sup x ≤− x λ ( x, +1) . Remark.
Assumption 24 requires that there exists a point x ∈ R such that the processhas a strictly positive probability of changing direction if it is outside of and movingoutwards of the set [ − x , + x ]. The lemma below states that a small set condition followsfrom this assumption. Lemma 25.
Consider the family of 1-dimensional Zig-Zag processes with generators {L m : m ∈ M} where M = [ V min , V max ] for some < V min ≤ V max < ∞ . Thus for f ∈ D ( L m ) and m ∈ [ V min , V max ] L m f ( x, θ ) = mθf ′ ( x, θ ) + ( m ( θU ′ ( x )) + + γ ( x ))( f ( x, − θ ) − f ( x, θ )) . Assume either of the two conditions: • γ ( x ) = 0 for all x ∈ R , but Assumption 24 holds; DAPTIVE SCHEMES FOR PDMC ALGORITHMS 37 • there exists γ min > such that γ ( x ) ≥ γ min for any x ∈ R , and γ ( · ) is bounded oncompact sets.Then for any set of the form C = D × V , where D ⊂ R is a compact set and V ⊆ {− , +1 } ,there exists t > such that for any t ≥ t there are δ > , and a probability measure ν on E such that P tm (( x, θ ) , · ) ≥ δν ( · ) for all ( x, θ ) ∈ C and all m ∈ M . Moreover, consider {L m,γ : m ∈ M , γ ∈ Λ } , where Λ is a family of switching rates γ : R → R + such that for all γ ∈ Λ it holds that < γ min ≤ γ ( x ) ≤ γ max < ∞ for all x ∈ R . Then for any compact set C as above, there exists t > such that for any t ≥ t there are δ > , and a probability measure ν on E that satisfy P tm,γ (( x, θ ) , · ) ≥ δν ( · ) for all ( x, θ ) ∈ C, γ ∈ Λ , m ∈ M . In particular t depends on V min , and δ depends on V min , V max , γ min , γ max , but neither de-pends on m or γ .Proof. Let
R > C = [ − R, R ] × V , where V ⊆ {− , +1 } . Consider the family {L m : m ∈ M} . For m ∈ [ V min , V max ], the process L m moves with velocity mθ , so either+ m > − m <
0. Consider first the case in which Assumption 24 is satisfied for some x >
0. Note that as a consequence it is satisfied for any L m . We consider the case x > R because it is the most general. Indeed one is always free to take ˜ x > R > x and applythe reasoning below. Alternatively the case x ≤ R follows by the same method of proof.For B = A × V ⊂ E let ν ( B ) = (Leb( A ∩ [ − R, R ]) / Leb([ − R, R ])) × ( δ ( θ ) + δ − ( θ ))be a probability measure on E . Define T = T ( ε ) := (4 x + 2 R ) /V min + ε where ε >
0. This choice of the time horizon allows the process to reach the region with strictlypositive probability of a velocity flip for any initial position ( x, θ ) with x ∈ [ − R, R ] and θ ∈ {− , +1 } . The addition of ε > C is a uniform ( ν, δ, T )-small set. Since the proof applies for all ε > δ ’s, one is free to choose t = T ( ε ) freely. Let us considerthe two cases below. First case: let the initial condition be ( x, +1) with x ∈ [ − R, R ] and we want to be attime T in B = A × {− } with A ∈ B ([ − R, + R ]) any Borel set. We can then use the R − Rx T ( ε ) (a) First case. R − Rx − x T ( ε ) (b) Second case.
Figure 7.
Illustration of the proof of Lemma 25.
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 38 following inequality(40) P ( x, +1) (( X ( T ) , Θ( T )) ∈ B ) ≥ P ( x, +1) ( X ( T ) ∈ A, E ) , where E is the event that exactly one velocity switch takes place. We are then in thecase of Figure 7(a). Observe that by the choice of T the process has enough time to travelthe longest path, i.e. from ( − R, +1) to ( − R, −
1) with the smallest allowed velocity V min .In order to compute the r.h.s. in (40) one can compute P ( x, +1) ( X ( T ) ≤ y, E ) and thendifferentiate with respect to y . To do this we assume only one velocity switch and imposethat X ( T ) ≤ y , resulting in the condition X ( T ) = x + mt − m ( T − t ) ≤ y, where t is the time at which the velocity switch takes place. By rearranging we obtainthe condition t ≤ y − x m + T =: ¯ t ( y ). Observe that for any m ∈ [ V min , V max ] the process hasenough time to reach the region where it is assumed there is a strictly positive probabilityof a velocity flip. Indeed using that y ∈ [ − R, R ] we find¯ t ( y ) ≥ x − xm + y + x m + 2 x + 2 R V min + ε ≥ x − xm + x V min + ε > x − xm . Therefore P ( x, +1) ( X ( T ) ≤ y, E ) = Z ¯ t ( y )0 λ ( x + ms,
1) exp (cid:18) − Z s λ ( x + mu, du (cid:19) exp (cid:18) − Z T − s λ ( x + ms − mu, − du (cid:19) ds. After differentiation one obtains P ( x, +1) ( X ( T ) ∈ A, E ) == Z A m λ ( x + m ¯ t ( y ) ,
1) exp − Z ¯ t ( y )0 λ ( x + mu, du ! exp − Z T − ¯ t ( y )0 λ ( x + m ¯ t ( y ) − mu, − du ! dy ≥ V min Z A λ ( x + m ¯ t ( y ) ,
1) exp( − ¯ t ( y ) λ max − ( T − ¯ t ( y )) λ max ) dy = exp( − λ max T )2 V min Z A λ ( x + m ¯ t ( y ) , dy ≥ λ min R exp( − λ max T ) V min (cid:18) · R Z A dy (cid:19) = 2 λ min R exp( − λ max T ) V min ν ( B ) . DAPTIVE SCHEMES FOR PDMC ALGORITHMS 39 where(41) λ max := max { x ∈ ˜ C T ,θ ∈{− , +1 }} ( V max ( θU ′ ( x )) + + γ ( x )) < ∞ ,λ min := min (cid:26) inf x ≥ x ( V min U ′ ( x )) + , inf x ≤− x ( V min U ′ ( x )) + (cid:27) > . where ˜ C T := [ − R − V max T, R + V max T ] is the set of points that can be reached in timeT by a process starting in x ∈ [ − R, R ] with velocity V max (and thus for any value m ∈ [ V min , V max ]). Thus λ max is the maximum switching rate achieved within time T . Second case: again we consider an initial condition with positive velocity, but now alsoa positive velocity at time T , i.e. we consider a set B = A × { +1 } . Taking advantage ofthe same idea as before we use the bound P ( x, +1) (( X ( T ) , Θ( T )) ∈ B ) ≥ P ( x, +1) ( X ( T ) ∈ A, E ) , where E is the event that exactly two switches take place. This case corresponds toFigure 7(b). Let t and t be the times of first and second switch respectively. Then X ( T ) ≤ y when X ( T ) = x + mt − m ( t − t ) + m ( T − t ) ≤ y, which can be rearranged as t ≥ t + x − y m + T =: ¯ t ( y ). We can obtain a bound for t byimposing that ¯ t ( y ) < T . The resulting condition is t < T / − ( x − y ) / (2 m ) =: ¯ t ( y ).We observe that by definition of T it follows that for any x, y ∈ [ − R, R ]¯ t ( y ) − x − xm ≥ x V min − x − xm + ε > , which means that there is enough time for the process to reach cyan region in Figure 7(b)and have the first velocity flip. The same holds true for the second velocity flip. Nowcompute the distribution function as P ( x, +1) ( X ( T ) ≤ y, E ) = Z ¯ t ( y ) s =0 Z T − su =˜ t ( y ) λ ( x + ms,
1) exp (cid:18) − Z s λ ( x + ml, dl (cid:19) λ ( x + ms − mu, −
1) exp (cid:18) − Z u λ ( x + ms − ml, − dl (cid:19) exp (cid:18) − Z T − u − s λ ( x + ms − mu + ml, dl (cid:19) ds du, where ˜ t ( y ) := ¯ t ( y ) − t = x − y m + T . Then by differentiating we obtain P ( x, +1) ( X ( T ) ∈ A, E ) == Z A Z ¯ t ( y )0 m λ ( x + ms,
1) exp (cid:18) − Z sl =0 λ ( x + ml, dl (cid:19) λ ( x + ws − w ˜ t ( y ) , −
1) exp − Z ˜ t ( y ) l =0 λ ( x + ms − ml, − dl ! exp − Z T − ˜ t ( y ) − sl =0 λ ( x + ms − m ˜ t ( y ) + ml, dl ! ds dy DAPTIVE SCHEMES FOR PDMC ALGORITHMS 40 ≥ exp ( − λ max T )2 V min Z A Z ¯ t ( y )0 λ ( x + ws, λ ( x + ws − w ˜ t ( y ) , − ds dy. Since the integrand is non-negative we can lower bound this quantity by restricting thedomain of integration corresponding to the s variable. A sensible choice is ( x − x ) /m ≤ s ≤ − ( x + x ) /m + ˜ t ( y ), as this would imply that both switches take place in the cyanregion in Figure 7(b). Indeed for s ∈ [( x − x ) /m, − ( x + x ) /m + ˜ t ( y )] we have that x + ms ≥ x and x + ms − m ˜ t ( y ) ≤ − x . Observe that − x + xm + ˜ t ( y ) = T x − y m − x + x m ≤ T x − y m − x − ym ≤ ¯ t ( y ) , where we used that − x ≤ y since y ∈ [ − R, R ]. Combining this with the fact that¯ t ( y ) > x − xm , we have shown that [( x − x ) /m, − ( x + x ) /m + ˜ t ( y )] ⊂ [0 , ¯ t ( y )]. Thereforeon this interval we can bound below the switching rates as follows P ( x, +1) ( X ( T ) ∈ A, E ) ≥≥ exp ( − λ max T )2 V min Z A Z − x xm +˜ t ( y ) x − xm λ ( x + ms, λ ( x + ms − w ˜ t ( y ) , − dsdy ≥ exp ( − λ max T )2 V min λ Z A Z − x xm +˜ t ( y ) x − xm dsdy ≥ exp ( − λ max T )2 V min λ Z A ε dy = R exp ( − λ max T ) V min λ ε ν ( B ) . By symmetry the same bounds hold also when the process has initial velocity is − ε > C = [ − R, + R ] × {− , +1 } is a uniform ( ν, δ, T )-small set with(42) δ = min (cid:26) λ min R exp( − λ max T ) V min , ε λ R exp ( − λ max T ) V min (cid:27) . Notice that there is an ε dependence also in λ min and λ max , but there is no dependence onthe specific value of m . To conclude the argument we observe that it is always possibleto incorporate any compact set in a set of the same form as C and therefore all compactsets are uniformly small for the family of Zig-Zag processes.When the switching rate is strictly positive but Assumption 24 does not hold, the samereasoning can be applied by taking x = 0, λ min = γ min >
0, and T = RV min + ε for some ε >
0. The switching rates can be upper-bounded because γ is bounded on compact sets.Thus the only difference is that in this case the process can switch at any time and doesnot need to escape a compact set to do so. One can again obtain the uniform small setcondition with δ defined as in (42), and ν as above.Finally, consider the family {L m,γ : m ∈ M , γ ∈ Λ } . Choosing λ max = max { x ∈ ˜ C T ,θ ∈{− , +1 }} ( V max ( θU ′ ( x )) + ) + γ max DAPTIVE SCHEMES FOR PDMC ALGORITHMS 41 and applying the same reasoning as in the case of a strictly positive refreshment showsthe statement. (cid:3)
B.4.2.
Simultaneous drift conditions, from discrete time to continuous time.
The lemmabelow is needed to go from simultaneous, continuous time drift conditions to simultaneous,discrete time ones.
Lemma 26.
Consider a family of Markov processes with state space E and generators {L γ : γ ∈ Y } . Assume the following conditions are verified:a) there exist c > , c > , a set K ⊂ E all independent of γ , and a class offunctions { V γ : E → [1 , + ∞ ) : γ ∈ Y } , such that for each γ ∈ Y the followingdrift condition holds (43) L γ V γ ( z ) ≤ − c V γ ( z ) + c K ( z ) for all z ∈ E. b) the family of processes is such that for K ⊂ E as in the previous point, a constant t > , and all z ∈ E , there exists a set ˜ K ( t ) ⊂ E which depends on t such that onthe event z / ∈ ˜ K ( t ) it holds that Z ( t ) / ∈ K almost surely for all γ ∈ Y .Then for any t > there are λ ∈ (0 , and κ > , both independent of γ , such that foreach γ ∈ Y P tγ V γ ( z ) ≤ λV γ ( z ) + κ ˜ K ( z ) for all z ∈ E. In particular, if conditions (43) hold with V ≡ V γ for each γ , then P tγ V ( z ) ≤ λV ( z ) + κ ˜ K ( z ) for all z ∈ E. Proof.
Let t > γ ∈ Y . As a first step, we apply Dynkin’s formula to f ( z, t ) =exp ( c t ) V γ ( z ) as in the proof of Theorem 6.1 in [24] to obtain e c t P tγ V γ ( z ) = V γ ( z ) + Z t P sγ (cid:18) ∂∂s + L γ (cid:19) ( e c s V γ ( z )) ds. By the product rule and the drift condition we can write the integrand on the right handside as P sγ (cid:18) ∂∂s + L γ (cid:19) ( e c s V γ ( z )) = e c s P sγ L γ V γ ( z ) + c e c s P sγ V γ ( z ) ≤ e c s P sγ ( − c V γ ( z ) + c K ( z )) + c e c s P sγ V γ ( z )= c e c s P sγ K ( z ) . Therefore by linearity of the integral P tγ V γ ( z ) ≤ e − c t V γ ( z ) + c Z t e c ( s − t ) P sγ K ( z ) ds ≤ e − c t V γ ( z ) + c ˜ K ( z ) Z t e c ( s − t ) ds = e − c t V γ ( z ) + c c (1 − e − c t ) ˜ K ( z ) . In the second inequality we took advantage of condition (b) to conclude that P sγ K ( z ) = P z ( Z ( s ) ∈ K ) ≤ ˜ K ( z ). It is then sufficient to take λ = e − c t , κ = c c (1 − e − c t ) toconclude the proof. Observe that λ and κ do not depend on γ (but depend on t ). (cid:3) DAPTIVE SCHEMES FOR PDMC ALGORITHMS 42
Remark.
Condition (b) in Lemma 26 is for instance satisfied by a family of preconditionedZig-Zag processes, if the preconditioner is taken from a compact set of positive definitematrices. Indeed this claim follows from the fact that all processes travel with almostsurely bounded velocity. Therefore, it is possible to choose ˜ K ( t ) by adding a suitablebuffer zone around the set K , such that the set K is not reachable in time t startingoutside ˜ K ( t ). Remark.
Instead of drift conditions of the form (43), consider the following case L γ V γ ( z ) ≤ − c V γ ( z ) + c for all z ∈ E. Without condition (b) in Lemma 26 we can still conclude by the same line of reasoningthat there are λ ∈ (0 , κ >
0, both independent of γ , such that for each γ ∈ Y P tγ V γ ( z ) ≤ λV γ ( z ) + κ for all z ∈ E. Acknowledgements
This work is part of the research programme ‘Zigzagging through computational bar-riers’ with project number 016.Vidi.189.043, which is financed by the Dutch ResearchCouncil (NWO). We acknowledge helpful discussions with Gareth Roberts.
References [1]
Andrieu, C., Durmus, A., N¨usken, N., and Roussel, J.
Hypocoercivity of piecewise determin-istic markov process-monte carlo. arXiv 1808.08592, 2019.[2]
Andrieu, C., and Livingstone, S.
Peskun-tierney ordering for markov chain and process montecarlo: beyond the reversible scenario. arXiv 1906.06197, 2019.[3]
Andrieu, C., and Thoms, J.
A tutorial on adaptive mcmc.
Statistics and Computing 18 (12 2008).[4]
Arveson, W.
Notes on measure and integration in locally compact spaces, 1996.[5]
Bai, Y., Roberts, G. O., and Rosenthal, J. S.
On the containment condition for adaptivemarkov chain monte carlo algorithms.
Advances and Applications in Statistics 21 , 1 (2011), 1–54.[6]
Bierkens, J., and Duncan, A.
Limit theorems for the zig-zag process.
Advances in AppliedProbability 49 , 3 (2017), 791–825.[7]
Bierkens, J., Fearnhead, P., and Roberts, G.
The zig-zag process and super-efficient samplingfor bayesian analysis of big data.
Annals of Statistics 47 (2019).[8]
Bierkens, J., Grazzi, S., Kamatani, K., and Roberts, G.
The boomerang sampler. In
Pro-ceedings of the 37th International Conference on Machine Learning (2020), H. D. III and A. Singh,Eds., vol. 119 of
Proceedings of Machine Learning Research , PMLR, pp. 908–918.[9]
Bierkens, J., Kamatani, K., and Roberts, G. O.
High-dimensional scaling limits of piecewisedeterministic sampling algorithms. arXiv 1807.11358, 2018.[10]
Bierkens, J., and Roberts, G.
A piecewise deterministic scaling limit of lifted metropo-lis–hastings in the curie–weiss model.
Ann. Appl. Probab. 27 , 2 (04 2017), 846–882.[11]
Bierkens, J., Roberts, G., and Zitt, P.-A.
Ergodicity of the zigzag process.
The Annals ofApplied Probability 29 (2019).[12]
Bouchard-Cˆot´e, A., Vollmer, S. J., and Doucet, A.
The bouncy particle sampler: A non-reversible rejection-free markov chain monte carlo method.
Journal of the American Statistical As-sociation 113 , 522 (2018), 855–867.[13]
Chimisov, C., Latuszynski, K., and Roberts, G.
Adapting the gibbs sampler. arXiv 1801.09299,2018.[14]
Craiu, R. V., Gray, L., Latuszy´nski, K., Madras, N., Roberts, G. O., and Rosenthal,J. S.
Stability of adversarial markov chains, with an application to adaptive mcmc algorithms.
Ann.Appl. Probab. 25 , 6 (12 2015), 3592–3623.
DAPTIVE SCHEMES FOR PDMC ALGORITHMS 43 [15]
Davis, M.
Markov Models & Optimization . Chapman & Hall/CRC Monographs on Statistics &Applied Probability. Taylor & Francis, 1993.[16]
Deligiannidis, G., Bouchard-Cˆot´e, A., and Doucet, A.
Exponential ergodicity of the bouncyparticle sampler.
Ann. Statist. 47 , 3 (06 2019), 1268–1287.[17]
Durmus, A., Guillin, A., and Monmarch´e, P.
Geometric ergodicity of the bouncy particlesampler.
Ann. Appl. Probab. 30 , 5 (10 2020), 2069–2098.[18]
Engel, K.-J., and Nagel, R.
One-parameter semigroups for linear evolution equations.
SemigroupForum 63 (06 2001), 278–280.[19]
Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O.
Piecewise deterministicmarkov processes for continuous-time monte carlo.
Statist. Sci. 33 , 3 (08 2018), 386–412.[20]
Fort, G., Moulines, E., and Priouret, P.
Convergence of adaptive and interacting markovchain monte carlo algorithms.
Ann. Statist. 39 , 6 (12 2011), 3262–3289.[21]
Fort, G., Moulines, E., Priouret, P., and Vandekerkhove, P.
A central limit theorem foradaptive and interacting markov chains.
Bernoulli 20 (07 2011).[22]
Jacod, J., and Protter, P.
Probability Essentials , 2nd ed. Springer Berlin Heidelberg, 2004.[23]
Lu, J., and Wang, L.
On explicit L -convergence rate estimate for piecewise deterministic Markovprocesses. arXiv 2007.14927, jul 2020.[24] Meyn, S. P., and Tweedie, R. L.
Stability of markovian processes iii: Foster–lyapunov criteriafor continuous-time processes.
Advances in Applied Probability 25 , 3 (1993), 518–548.[25]
Michel, M., Kapfer, S. C., and Krauth, W.
Generalized event-chain Monte Carlo: Con-structing rejection-free global-balance algorithms from infinitesimal steps.
The Journal of ChemicalPhysics 140 , 5 (2014).[26]
Monmarch´e, P.
Piecewise deterministic simulated annealing.
ALEA 13 , 1 (2016), 357–398.[27]
Peters, E. A. J. F., and De With, G.
Rejection-free Monte Carlo sampling for general potentials.
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics 85 , 2 (2012), 1–5.[28]
Pompe, E., Holmes, C., and Latuszy´nski, K.
A framework for adaptive mcmc targeting mul-timodal distributions.
Ann. Statist. 48 , 5 (10 2020), 2930–2952.[29]
Roberts, G. O., and Rosenthal, J. S.
General state space markov chains and mcmc algorithms.
Probab. Surveys 1 (2004), 20–71.[30]
Roberts, G. O., and Rosenthal, J. S.
Coupling and ergodicity of adaptive markov chain montecarlo algorithms.
Journal of Applied Probability 44 , 2 (2007), 458–475.[31]
Roberts, G. O., and Rosenthal, J. S.
Examples of adaptive mcmc.
Journal of Computationaland Graphical Statistics 18 , 2 (2009), 349–367.[32]
Rosenthal, J. S.
Minorization conditions and convergence rates for markov chain monte carlo.
Journal of the American Statistical Association 90 , 430 (1995), 558–566.[33]
Vanetti, P., Bouchard-Cˆot´e, A., Deligiannidis, G., and Doucet, A.
Piecewise-deterministic markov chain monte carlo. arXiv 1707.05296, 2017.[34]
Vialaret, M., and Maire, F.
On the convergence time of some non-reversible markov chainmonte carlo methods.
Methodology and Computing in Applied Probability (2020).[35]
Wallin, J., and Bolin, D.
Efficient adaptive mcmc through precision estimation.