[PDF] HMC, an Algorithms in Data Mining, the Functional Analysis approach

Abstract

The main purpose of this paper is to facilitate the communication between the Analytic, Probabilistic and Algorithmic communities. We present a proof of convergence of the Hamiltonian (Hybrid) Monte Carlo algorithm from the point of view of the Dynamical Systems, where the evolving objects are densities of probability distributions and the tool are derived from the Functional Analysis.

Full PDF

aa r X i v : . [ s t a t . C O ] F e b HMC, AN ALGORITHMS IN DATA MINING, THE FUNCTIONALANALYSIS APPROACH.

SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI

Abstract.

The main purpose of this paper is to facilitate the communication be-tween the Analytic, Probabilistic and Algorithmic communities. We present a proofof convergence of the Hamiltonian (Hybrid) Monte Carlo algorithm from the pointof view of the Dynamical Systems, where the evolving objects are densities of prob-ability distributions and the tool are derived from the Functional Analysis. Introduction

Functional Analysis for all Functioning Algorithms.

We observed that all too of-ten not only we do not speak a common language but also we do not see the reason tocommunicate and to see the problems through a diﬀerent eye. We tried, on the examplean HMC algorithm, to gently (relatively speaking) build the bridges and make the ”other”methods clearer and more comprehensible. As the paper is addressed to people not nec-essarily ﬂuent in Functional Analysis we make the proofs quite extended at some placesand sometimes a bit hand-waving. It is diﬃcult to strike the balance between clarity andrigor.

Algorithms as source of inspiration.

Recent development and usage of MachineLearning (ML) Data Mining (DM) and Artiﬁcial Intelligence (AI) resulted in a vast varietyof new or refurbished algorithms to deal with large data sets, be it collected or streamed.Such new methods are usually tested on some data sets, however not very often theyare thoroughly vetted by theoretical means. Algorithms tend to rely on discrete models,but the nature of data and approximation approaches suggest rather a continuous pointof view. In our opinion one should go even further, the right objects of investigation ofalgorithms should not be the continuous parameters but rather very general features ofdata such as distributions.We perceive the algorithms as iterative transformations of the points in some underlyingdomains. The leading idea is to move from a relatively simple objects such as ﬁnite setsor points in ﬁnite dimensional Euclidean spaces with quite complicated transformationsto simple transformations in richer spaces such as distributions in functional spaces.

Hamiltonian Monte Carlo.

Or Hybrid Monte Carlo (HMC) algorithm is a methodto obtain random samples from a (target) probability distribution f / R Q f on the space Q whose density is known only up to a factor, that is to say that f is known, but R Q f isnot, or at least is very diﬃcult to calculate. It is an algorithms known for a while [1]of the Metropolis-Hastings type used to estimate the integrals. There are known proofsof convergence [2]. Our goal is to provide a clear and understandable reason why HMCalgorithm converges to the right limit for densities in the spaces L ( Q ). We refer to ourpapers [3, 4, 5] for other approaches to HMC, probabilistic, algorithmic and analytic in L q . Here we concentrate on the convergence in L .HMC performs by iterating the following steps. Given an initial distribution h (samplepoints) in a given space Q double (the dimension of) the space by considering Q × P , with P ∼ Q . Then spread each point q ∈ Q to a point ( q, p ) ∈ Q × P by sampling p ∈ P froma distribution of choice g , where g > P and R P g = 1. Then move each point ( q, p ) to a new point ( Q, P ) = H ( q, p ), where the transformation H : Q × P → Q × P satisﬁessome special invariance properties with respect to f and g and the underlying base measureon Q × P . Finally project ( Q, P ) on Q providing a new sample of points Q in Q with a newdistribution ˆ h which shall be used as the initial sample (or distribution) for the next step.With the right choice of H the iteration of this procedure will result in an approximatesample from the target distribution. Moving to functional spaces.

The success of the algorithm lies in the appropriatechoice of the transformation H . The Hamiltonian part of the algorithm’s name is dueto the Hamiltonian motion H . It turns out that if the target distribution has a densityproportional to a given function f and the distribution of choice is g then H is a Hamil-tonian motion generated by the Hamiltonian energy H ( q, p ) = − log( f ( q ) · g ( p )). Thatis if ( Q, P ) = H ( q, p ) is the solution of the time evolution ˙ Q = ∂ H /∂P , ˙ P = − ∂ H /∂Q after time t with initial point ( q, p ) then H has the needed invariance properties for HMCto converge to a distribution proportional to f . Eﬀectively it means that one can obtaina normalizing constant R Q f or any expected value of a function φ with respect to thedistribution proportional to f : R φ · f / R f .In terms of the densities of the involved distributions one can present HMC as follows:Given some initial distribution h ( q ) on Q one produces a joint distribution h · g by ( h · g )( q, p ) = h ( q ) · g ( p ) on Q × P then moves the points ( q, p ) ( Q, P ) = H ( q, p ) producinganother distribution ( h · g ) ◦ H ( q, p ) = h ( Q ) · ( P ) in Q × P and ﬁnally projects the last oneonto Q by calculating the marginal R P ( h · g ) ◦ H ( q, p ) dp , which is a result of the action ofthe algorithm in one step. In short(1.1) T ( h )( q ) = Z P ( h · g ) ◦ H ( q, p ) dp and from a rather complicated algorithm we receive a relatively simple, linear operator onsome space of integrable functions. The convergence of the algorithm corresponds to theconvergence of sequences of iterates of T . Remark 1.1. (1)

The distribution g may depend on the point q , g ( p | q ) as long as for (almost) all q it satisﬁes the required conditions. (2) It is clear that the Hamiltonian motion H does not depend on the constant nor-malizing factor in front of f (3) The motion H in practical implementation is performed by the leap-frog algorithm which displays the needed invariance properties. We shall not deal with it in thispaper. (4) An example of the situation where the target distribution is known up to the nor-malizing constant is the Bayesian update. In order to establish the distribution of(random) parameters θ inﬂuencing the outcome D of the observations, when weknow all the probabilities P ( D | θ ) one uses the knowledge of the outcome D to im-prove the estimate: Given the estimate distribution π ( θ ) before the experiment wecalculate ˆ π ( θ ) = P π ( θ | D ) = P π ( θ, D ) /P π ( D ) = P ( D | θ ) · π ( θ ) / P θ ′ P ( D | θ ′ ) · π ( θ ′ ) and take the distribution ˆ π as a new estimate. However the sum (integral) in thedenominator may be not that easy to calculate. This yields to ˆ π ( θ ) ∼ P ( D | θ ) π ( θ ) without the normalizing factor. Results

Convergence under invariance and coverage properties.

Assume that the motion H : Q × P → Q × P (measurable spaces with measures dq and dp ) satisﬁes the following invariance and coverage properties when UNCTIONAL ANALYSIS IN ALGORITHMS 3

Given 0 ≤ f : Q → R R Q f < ∞ , 0 ≤ g : P → R R P g = 1:( f · g ) ◦ H = f · g (2.1) Z Z Q × P A ◦ H = Z Z Q × P A for any integrable A (2.2) Q ( q, P ) = Q for (almost) every q (2.3)The space Q may be restricted to equal the support of f .Deﬁne T h = R P ( h · g ) ◦ H as in (1.1) and T n +1 = T n ◦ T . The adjoint operator T † isgiven by the same formula with H − in place of H and is described below in Section 4.A self-adjoint operator satisﬁes T † = T , see (4.1).Let L denotes the space of square-integrable functions h : Q → R such that || h || = R Q | h | / f < ∞ and the support of h is included in the support of f ( i.e. Q ). Theorem 2.1.

Under the above invariance and coverage conditions and when the operator T is self-adjoint then for every h ∈ L ( Q ) the sequence T n h converges strongly in L to f · R h/ R f . The direction of f is the unique direction of ﬁxed points. Except of the eigenvalue1 with multiplicity 1, all the spectrum is contained in the interior of the unit disc. Remark 2.2. (1)

The Hamiltonian motion satisﬁes two ﬁrst integral invariance assumptions, asboth the Hamiltonian (energy function equal here − log( f ( q ) · g ( p )) ) and the Lebesguemeasure are invariant under such a motion. (2) The coverage property Q ( q, P ) = Q can be weakened to a statement of an even-tual coverage, not necessarily in one step. Some type of irreducibility must beassumed to avoid complete disjoint domains of the motion and hence an obviousnon existence of a (unique) limit. (3) The support condition takes care of some initialization problems with the divisionby 0. This can be formally avoided by working in the space of likelihoods, seebelow. (4)

The self-adjointness condition is not very restrictive, as one can always use thecomposition T † ◦T which is self-adjoint and satisfy all the needed properties. Alsoin case of any even auxiliary distribution g ( p ) = g ( − p ) on R d , such as standardGaussian, the operator is always self-adjoint, see Lemma 4.2. Exponential convergence under uniformly strong logarithmic concavity.

Wesay that h : R d → R is uniformly strongly logarithmic concave if for almost every Q the (symmetric) Hessian − ∂ log( h ( Q )) /∂Q has its spectrum contained in some ﬁnite,positive interval { z : 0 < λ ≤ Λ < ∞} independent on Q . Gaussian auxiliary distributionsare obviously uniformly strongly concave. Speaking informally one may say that thedensity lies between two Gaussians.For the next Theorem we need a stronger coverage condition . We say that the motion H : ( q, p ) ( Q, P ) is fully invertible if given ﬁxed values of (almost) any two of thefour variables ( q, p, Q, P ) the other two are connected by a diﬀerentiable bijection. Inparticular given q the derivative ∂Q/∂p of the map p

7→ Q q ( p ) = Q is smoothly invertibleand the same holds for q

7→ P p ( q ) = P . Theorem 2.3.

Suppose that the target distribution f : R d → R and the auxiliary distribu-tion g : R d → R are both uniformly strictly logarithmic concave. Let H t be the Hamiltonianmotion deﬁned by the Hamiltonian H ( Q, P ) = − log( f ( Q ) · g ( P )) and T the operator de-ﬁned by H t . Let assume it is self-adjoint. Then for t > small enough the iterations ofthe operator T t are converging geometrically to the map h R h · f / R f . ∃ (0 < ρ < ∀ ( h ∈ L ( Q )) ∀ ( n ) ||T n h − f R h R f || ≤ ρ n || h || . SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI

One can prove that the uniformly strong logarithmic concavity assumption is neededonly outside an arbitrary bounded region, see [4].

The proofs are not very hard.

For Theorem 2.1 we observe that T is in fact (3.4) anaveraging map, thus by the convexity of x x the norm of h decreases (3.6) under T ,sharply unless (by coverage assumption) h = α f . The space L is reﬂexive, thus boundedsequences have weak accumulation points. Using self-adjointness deﬁned in Section 4we prove that each accumulation point must be of form α f , proving weak convergence.Meanwhile the proof of the convergence of the norms provides (Proposition 5.4) strongconvergence. The spectral properties follow from Remark 3.3.The proof of Theorem 2.3 relies on the representation of the operator T as a kerneloperator (6.4) and the proof in Subsection 6 (somewhat lengthy in calculation but nottoo deep) that the L norm of that kernel (6.5) is ﬁnite, hence the kernel is compact andthe operator T has the spectral gap which provides the geometric (or exponential) rate ofconvergence. 3. The operator T in L The Hilbert space L . For h ∈ L we have a standard norm || h || = R Q (cid:16) h f (cid:17) f and ascalar product h· , ·i : L × L → R : h a, b i = R Q a · b f .We shall assume h ≥ h/ f a likelihood (up to an irrelevantnormalizing constant R f ) of h with respect to f . The space of ˜ L = { ˜ h : R Q | ˜ h | f < ∞} oflikelihoods ˜ h = h/ f is isometric to L . Lemma 3.1. a, b ∈ L ⇒ h a, b i ≤ || a || · || b || , (3.1) we have f ∈ L with || f || = Z Q f , (3.2) h ∈ L ⇒ h h, f i = Z Q h . (3.3) Proof. (3.1) is the H¨older inequality for likelihoods a/ f and b/ f in the space ˜ L .(3.2) and (3.3) follow directly from the deﬁnitions. (cid:3) Operator T .Lemma 3.2 (Properties of T ) . For ≤ h ∈ L : T h = f · Z P h f ◦ H · g (3.4) Z Q T h = Z Q h (3.5) ||T h || ≤ || h || (3.6) The equality in (3.6) occurs iﬀ h = α · f a.e. , (3.7) where α = α ( h ) = R h/ R f . Equation (3.4) gives an explicit formula for ˜ T (˜ h ) acting in ˜ L . Proof. (3.4): Using the invariance properties we have R P h f ◦ H · ( f · g ) ◦ H = R P h f ◦ H · ( f · g ) and f does not depend on p ∈ P .(3.5): R Q R P ( h · g ) ◦ H = RR Q × P ( h · g ) = (cid:16)R Q h (cid:17) (cid:0)R P g (cid:1) .(3.6): ||T h || = R Q (cid:16)R P h f ◦ H · g (cid:17) f ≤ R Q R P (cid:16) h f ◦ H (cid:17) g f = RR Q × P (cid:16) h f (cid:17) ◦ H · ( g · f ) ◦ H = UNCTIONAL ANALYSIS IN ALGORITHMS 5 RR Q × P (cid:16) h f (cid:17) g · f the last one being equal to (cid:18)R Q (cid:16) h f (cid:17) f (cid:19) · (cid:0)R P g (cid:1) = || h || . For a given q theequality occurs only if ( h/ f )( H ( q, p )) is a constant for ( g -)almost all p , but the coverageassumption assures that the constant ( h/ f ) ◦ H ( q, P ) is the same for f -almost all q ∈ Q :( h/ f )( Q ) = R Q h/ R Q f , the value follows from (3.5). Note that (3.6) hides the formulafor the variance with respect to probability g , the random variable being the transportedlikelihood. (cid:3) Remark 3.3.

The operator T is an averaging operator of the (transported) likelihood h/ f with respect to the probability g . The scalar product is monotone: ≤ a ≤ b, ≤ c ≤ d implies h a, c i ≤ h b, d i and T is positive, in particular if a ≤ b then T a ≤ T b . The function f provides the eigendirection of ﬁxed points and by (3.6) T has its spectrum in the unitdisk, with 1 being a unique eigenvalue on the unit circle and has multiplicity 1. For any h ∈ L one has the unique decomposition h = α f + ( h − α f ) where α f is a direction ofthe ﬁxed points and h − α f ∈ N = { a ∈ L : R a = 0 } lies in an invariant subspace. Itis not a priori clear under what conditions the eigen-value is isolated in the spectrum,in other words wether the contraction ||T h || < || h || is uniform on N , which would imply T n N → { } (point-wise) with exponential speed. The adjoint operator T † As H is invertible the inverse map H − is well deﬁned as enjoys the same invarianceproperties as H . It turns out that the operator T † deﬁned by H − :(4.1) T † h = Z P ( h · g ) ◦ H − is adjoint to T with respect to the duality functional h· , ·i , namely Lemma 4.1.

For h, k ∈ L : (4.2) hT h, k i = h h, T † k i Proof.

Using (3.4) and invariance hT h, k i = R Q ( R P h f ◦ H · g ) · k = RR Q × P h f · ( g · k ) ◦ H − = RR Q × P h f · ( k f ) ◦ H − · ( g · f ) ◦ H − = RR Q × P h f · ( k f ◦ H − ) · ( g · f ) = R Q h ( R P k f ◦ H − · g ) = h h, T † k i (cid:3) A suﬃcient condition for self-adjointness T = T † . Let τ be a measure preservinginvolution τ : P → P , τ ◦ τ = id. We can extend it to τ : Q × P → Q × P by τ ( q, p ) =( q, τ ( p )). Assume that g is invariant with respect to τ : g ◦ τ = g . Lemma 4.2. If τ ◦ H − ◦ τ = H and g is invariant with respect to τ then T † = T . As an example take Q = P = R , τ to be the symmetry (reﬂection) of the space P with respect to the 0, τ ( p ) = − p . An even g ( p ) = g ( − p ) is invariant with respect to τ . The involution τ ( p ) = − p is applicable in the most common choice of g : a centralizedGaussian distribution. In a particular case of f also a Gaussian the Hamiltonian movement H is a rotation and H − an opposite rotation. The spreading by g is symmetric andwhatever mass is transported from ( q, p ) to H ( q, p ) = ( Q, P ) by H the same mass will betransported from ( q, − p ) to H − ( q, p ) = ( ¯ Q, ¯ P ) = ( Q, − P ) by H − . The projection onto Q will produce the same mass transported by both maps T and T † . Clearly this extendsto non-standard Gaussians. Proof.

Measure invariance means that R P a ◦ τ = R P a . Let ( ¯ Q, ¯ P ) = H − ( q, p ) then τ ◦ H − ( q, p ) = τ ( ¯ Q, ¯ P ) = ( ¯ Q, τ ( ¯ P ) and T h = R P ( h · g ) ◦ H = R P ( h · g ) ◦ τ ◦ H − ◦ τ = R P ( h · g ) ◦ τ ◦ H − = R P ( h · g ) τ ( ¯ Q, ¯ P ) = R P ( h ( ¯ Q ) · g ( τ ( ¯ P )) = R P h ( ¯ Q ) g ( ¯ P ) = R P h ◦ H − · g ◦ H − = T † h . (cid:3) SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI

In what follows we shall assume that T = T † . If it is not the case we can use thealgorithm with S = T † ◦ T such that S † = S .5. Limits of the sequences T n of self-adjoint operator From ||T h || < || h || by induction we obtain ||T n h || < || h || unless h = α f , whenequality holds. For h ∈ L let(5.1) V ( h ) = inf || T n h || = lim || T n h || . We see that for any M , V ( h ) = V ( T M ( h )). As we are interested in the limit of the sequence T n h , for a given h we can assume that for an arbitrary ǫ > || h || < V + ǫ ,taking a high iterate T M h instead of h if needed.By a corollary to Alaoglu Theorem bounded sets in reﬂexive L are weakly (the sameas weakly*) compact. Any inﬁnite sequence T n h have a weak converging subsequence T m n h ⇀ h ∞ , meaning hT m n h, b i → h h ∞ , b i for every b ∈ L . Lemma 5.1.

If the operator T is self adjoint then for any weak accumulation point h ∞ of the sequence T n h we have lim inf || T n h || ≤ || h ∞ || lim sup || T n h || . Proof.

With T m n h ⇀ h ∞ let N be such that the sequence ( m n − N ) contains inﬁnitelymany even indices 2 m . Then for h ′ = T N h we have T m h ′ ⇀ h ∞ as well, and: || T m h ′ || = h T m h ′ , T m h ′ i = h T m h ′ , h ′ i → h h ∞ , h ′ i ≤ || h ∞ || · || h ′ || . For an ǫ > N suﬃciently large such that || T m h ′ || = || T m + N h || ≥ lim inf || T n h || − ǫ and || h ′ || = || T N h || ≤ lim sup N || T N h || + ǫ . Arbitrary choice of ǫ > (cid:3) Remark 5.2.

For any bounded operator T on L and any weak accumulation point T m n h ⇀ h ∞ we have || h ∞ || ≤ lim sup || T m n h || ≤ lim sup || T n h || . Proof.

This is standard: for an ǫ > N such that for m n > N we have || T m n h || ≤ lim sup || T n h || + ǫ . Then || h ∞ || = h h ∞ , h ∞ i ← h T m n h, h ∞ i ≤ || T m n h || || h ∞ || . (cid:3) Corollary 5.3.

If for a self-adjoint operator T the sequence of norms || T n h || convergesthen every weak accumulation point T m n h ⇀ h ∞ of the sequence T n h has the norm || h ∞ || = lim || T n h || and is a strong limit of the same subsequence. || T m n h − h ∞ || → . Proof.

The value of norm || h ∞ || follows from the previous Lemma and Remark. Thestrong convergence is standard again. Due to the strong convexity of the ball in L (or adirect manipulation of ||T m n h − h ∞ || ) a weak convergent sequence with the convergenceof the norms to the norm of the limit converges also in the strong sense. (cid:3) Proposition 5.4.

Assume T = T † . Then T n h converges strongly. ||T n h − R h R f f || → . Proof.

By Lemma 3.6 the sequence ||T n h || converges to V ( h ) from (5.1). By Corollary 5.3every weak converging subsequence of T n h converges strongly and the norm of the limitis equal p V ( h ). In particular if T m n h ⇀ h ∞ then T m n +1 h ⇀ T ( h ∞ ) and || h ∞ || = V ( h ) = ||T h ∞ || . By Lemma 3.2 (3.7) we have thus h ∞ = α f = T h ∞ , with α = R h/ R f by (3.3). That means that every weak converging subsequence of T n h converges to the UNCTIONAL ANALYSIS IN ALGORITHMS 7 same limit α f . But any subsequence of T n h contains a weakly convergent subsequence,hence T n h converges weakly, and also strongly to α f . (cid:3) This concludes the proof of Theorem 2.1.6.

The operator T as a kernel operator The kernel K ( q, Q ) . In this section we shall assume a stronger version of the coveringproperty (2.3) of the map H : Q × P → Q × P , ( Q, P ) = H ( q, p ) on ﬁnite dimensionalspaces Q and P (manifolds modeled on R d , usually P will be a co-tangent space to Q ). Q q : P → Q deﬁned by Q q ( p ) = Q ( q, p ) is a bijection for almost every q (6.1) ∂Q ( q, p ) ∂p exists and is invertible for a.e. q (6.2)By assumption the function Q − q : Q → P , p = Q − q ( Q ) is well deﬁned and so is P q : Q → P , P q ( Q ) = P ( q, p ) = P ( q, Q − q ( Q ). Deﬁne the Jacobian (determinant) D q of the partialderivative (6.2) by(6.3) D q ( Q ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂ Q q ( p ) ∂p (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∈ R , where Q = Q q ( p ) . The movement (

Q, P ) = H ( q, p ) must be suﬃciently smooth in order for D q to behave.In order to simplify our reasoning we shall assume that, similarly as in (6.1) and (6.2),the map P p : Q → P , deﬁned by P p ( q ) = P ( q, p ), is a bijection with invertible derivativeof which the Jacobian (similarly as in (6.3)) D p ( P ) = | det ( ∂ P p ( q ) /∂q ) − | = | det ∂q/∂P | . Remark 6.1.

In such situation (one can assume that) the measure spaces ( Q , dQ ) and ( P , dp ) are isomorphic in the measurable sense. Then the Jacobian D q can be treated asthe Radon-Nikodym derivative of the transport of the measure ( dp ) on the ﬁber { q } × P performed by H and the projection, in eﬀect by Q q , to the measure ( D q · dQ ) on Q .Similarly D p transfers the measure dq on Q to the measure D p ( P ) dP on P . With the change of variables by p Q = Q q ( p ), p = Q − q ( Q ), P = P q ( Q ), dp = D q ( Q ) dQ we get T h ( q ) = R P h ( Q ) g ( P ) dp = R Q h ( Q ) · g ( P q ( Q )) D q ( Q ) dQ . This showsthat the operator T is a kernel operator, namely T h ( q ) = h h ( Q ) , K q ( Q ) i where K q ( Q ) = K ( q, Q ) is deﬁned by K ( q, Q ) = f ( Q ) · g ( P q ( Q )) · D q ( Q ) as then(6.4) T h ( q ) = Z P h ( Q ( q, p )) g ( P ( q, p )) dp = Z Q h ( Q ) · g ( P q ( Q )) f ( Q ) f ( Q ) D q ( Q ) dQ = Z Q h ( Q ) · K ( q, Q ) f ( Q ) dQ = h h, K q i When a kernel operator has a ﬁnite L norm it is compact, and then its spectrum is discreteexcept the unique possible accumulation point at 0. We know by Remark 3.3 that thespectrum of T has all the eigenvalues inside the unit disk except for the eigenvalue 1 whichhas multiplicity 1. The compactness of T would provide the spectral gap and exponentialconvergence in norm of each sequence T n h to its ﬁxed point limit α f with the rate ofconvergence given by the second largest eigenvalue, in this case, strictly smaller than 1.The norm for A ( q, Q ) ∈ L ( Q × Q ) is given by || A || = Z Z Q × Q A ( q, Q ) f ( q ) · f ( Q ) dQ dq . For a.e. q the maps Q

7→ A q ( Q ) = A ( q, Q ) should belong to L ( Q ) (with respect to Q ,thus f ( Q ) in the denominator) and then the map deﬁned by the norms q

7→ ||A q || shouldbelong to L ( Q ) (with respect to q thus f ( q ) in the denominator), in both cases use the SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI norm deﬁned in the ﬁrst line of Section 3. One might be more comfortable working withan analogous kernel expression in the space ˜ L . Lemma 6.2 (The norm of the kernel K ) . The L norm of the kernel K (6.4) can beexpressed as: || K || = Z Z P × P g ( p ) g ( P ) D q ( Q ) D p ( P ) dP dp , where Q and q are well deﬁned functions of p and P , q = P − p ( P ) and Q = Q q ( p ) .Proof. Using invariance (with p = Q − q ( Q ), P = P q ( Q )) we have: K ( Q, q ) = f ( Q ) g ( P ) D q ( Q ) = f ( Q ) f ( Q ) g ( P ) g ( P ) D q ( Q ) = f ( Q ) f ( q ) g ( p ) g ( P ) D q ( Q ) . In the integral expression for the norm we change both variables q and Q to variables p and P using the Jacobians D q ( Q ) dQ = dp and dq = D p ( P ) dP || K || = Z Z Q × Q K ( q, Q ) f ( Q ) f ( q ) dQ dq = Z Z Q × Q g ( p ) g ( P ) D q ( Q ) dQ dq (6.5) = Z Z Q × P g ( p ) g ( P ) D q ( Q ) dP dq = Z Z P × P g ( p ) g ( P ) D q ( Q ) D p ( P ) dP dp (6.6) (cid:3) Corollary 6.3.

If the product of determinants D q ( Q ) · D p ( P ) is uniformly bounded fromabove then the L norm of the kernel K is ﬁnite and the operator T is compact. Hamiltonian movement.

In the following we shall prove that || K ( q, Q ) || is ﬁnite in aspecial case of a Hamiltonian movement when Q = P = R d . The distribution of choice g ( p )is usually the standard Gaussian (mean 0 and covariance equal to identity matrix, that is − log g ( p ) = h p, p i / U ( Q ) = − log( f ( Q )) then U ′′ ( Q ) = ∂ ( − log f ( Q )) /∂Q , a symmetric matrix by assumed continuous diﬀerentiability.We assume that the target distribution f is uniformly strictly logarithmic concave , that is U ′′ is a (strictly) positive operator, i.e. it is bounded away (in either norm or spectrumsense) from 0 and ∞ uniformly on q ∈ Q . Then U ′′ ( Q ) can be bounded away from 0 and ∞ by two symmetric, strictly positive bounded operators constant with respect to Q , thenthe distribution f can be then estimated both from above and below by two Gaussians(with some positive ﬁnite multiplicative constants). Similar statements hold for g and V ( P ) = − log( g ( P )).Consider the spaces L Q , L P and L Q × P of functions on Q (positions), P (momenta) and Q × P (conﬁgurations) to R with the appropriate (integral) norms. Given 0 ≤ f ∈ L Q and 0 ≤ g ∈ L P we can deﬁne ( potential energy ) U : Q → R , U ( q ) = − log( f ( q )) and( kinetic energy ) V : P → R , V ( p ) = − log( g ( p )). We see that if f ( q ) and g ( p ) representthe densities of probability distributions of two independent variables then f ( q ) · g ( p ) =exp( − ( U ( q )+ V ( p )) represents a density of their (independent) joint distribution. We shalluse the name Hamiltonian for the total energy H = U + V . Remark 6.4.

In fact one may consider a more general case when g = g ( q, p ) (understoodas conditional g ( p | q ) ) and thus V = V ( p | q ) . However to simplify the calculations we shalldeal only with the case of g independent on q . The Hamiltonian energy provides the following (Hamiltonian) dynamics ( q, p ) ( Q, P ),where (

Q, P ) = ( Q t ( q, p ) , P t ( q, p )) is the position after time t of the point starting at ( q, p )ruled by the system of equations:˙ Q = dQdt = ∂ H ( Q, P ) ∂P (6.7) ˙ P = dQdt = − ∂ H ( Q, P ) ∂Q . UNCTIONAL ANALYSIS IN ALGORITHMS 9

The dot derivative is the derivative with respect to time ˙ A = ∂A/∂t . We see that thenormalizing constants of g and f are irrelevant to the motion. Formally the solutions canbe written as Q ( t ) = q + Z t ∂ H ∂P ( Q ( s ) , P ( s )) ds = q + Z t V ′ ( P ( s )) ds (6.8) P ( t ) = p − Z t ∂ H ∂Q ( Q ( s ) , P ( s )) ds = p − Z t U ′ ( Q ( s )) ds , assuming all functions are suﬃciently regular.The movement H is deﬁned by ( q, p ) H t ( q, p ) = ( Q t , P t ), and the map T t is deﬁnedby (1.1) using H t . For a given ﬁxed time t we shall skip the subscript t . For a function W ∈ L Q × P we denote t W t ( q, p ) = ( W ◦ H t )( q, p ) = W ( Q, P ).The value of the Hamiltonian H does not change along the trajectories and the Hamil-tonian motion conserves the Lebesgue measure: H ◦ H = H , or after taking the exponent ( f · g ) ◦ H = f · g . (6.9) Z Z Q × P W t d ( qp ) = Z Z Q × P W d ( qp ) for any W ∈ L Q × P , (6.10)which corresponds to the properties (2.1) and (2.2).We deﬁne T as in (1.1), to stress the dependence of T on the choice of t we use thenotation T t . Given t and n the iterate of the map T nt is in general diﬀerent from the map T tn with time nt .The subset of (target) functions f with some interest has usually some additional prop-erties: U , V → + ∞ , as | q | , | p | → ∞ , fast enough, so that the functions f , g are bounded,integrable and vanish at inﬁnity (or at boundaries of the support) meaning that H escapesto inﬁnity when ( q, p ) approaches these boundaries. This assures that the level sets of theHamiltonian and therefore the trajectories are bounded (and closed). Often additionallythe derivatives of U and V (or f and g ) are zero at a unique point (no stationary pointsexcept this one). Diﬀerential equation solutions to the gradient of the Hamiltonian ﬂow.

Assumethat all the functions involved have suﬃcient smoothness, so that derivative exists andtheir order can be changed. In order to estimate the norm of the kernel we need to have agood control on the Jacobians D q ( Q ) and D p ( P ), in other words the partial derivatives ofthe time evolution of conﬁguration with respect to initial conﬁguration ∂ ( Q, P ) /∂ ( q, p ).By assumption that H ( Q, P ) = U ( Q ) + V ( P ) ( i.e. the spreading g does not depend onthe position q ) its mixed second derivatives ∂ H /∂Q∂P = 0 vanish. Lemma 6.5 (Evolution of the dependence on the initial conﬁguration) . Under the as-sumption that H ( Q, P ) = U ( Q ) + V ( P ) the derivative of the motion ( Q, P ) with respect tothe starting conﬁguration ( q, p ) satisfy the following time evolution equation: (6.11) ∂∂t ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! = (cid:18) V ′′ −U ′′ (cid:19) · ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! ; ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! t =0 = (cid:18) I I (cid:19) where U ′′ = U ′′ ( Q ) = ∂ H /∂Q and V ′′ = V ′′ ( P ) = ∂ H /∂P .Proof. The initial condition ∂ ( Q, P ) /∂ ( q, p ) t =0 = I can be calculated from (6.8), assumingsuﬃcient continuity one can change the order of derivative and integration. Then as t → q : ∂∂q ∂ H ∂P = ∂ H ∂Q∂P · ∂Q∂q + ∂ H ∂P · ∂P∂q = 0 · ∂Q∂q + V ′′ · ∂P∂q = V ′′ ∂P∂q . Similarly ∂ H /∂q∂Q = U ′′ · ∂Q/∂q . It is clear that the calculation holds after exchangingevery q by p . Now we diﬀerentiate the Hamiltonian equations (6.7) with respect to initialconﬁguration ( q, p ) and change the order of diﬀerentiation, for example ∂∂t (cid:18) ∂Q∂q (cid:19) = ∂∂q (cid:18) ∂Q∂t (cid:19) = ∂∂q ∂ H ∂P = V ′′ ∂P∂q , and again similarly ∂ P/∂t∂q = −U ′′ · ∂Q/∂q . The calculation holds when exchanging q for p . (cid:3) Before we proceed with the proof we remind that for

U, V symmetric, positive deﬁniteoperators on L their symmetric positive deﬁnite square roots are uniquely deﬁned. Forexample, for V < I (that is I − V is positive deﬁnite, which can be achieved by a nor-malization trick) √ V = I − R where R is a limit of the (strongly converging) sequence R n +1 = ( I − ( V − R n )) / R = 0. Also V U and UV are positive deﬁnite (but notnecessarily symmetric, when non commuting), as for example V U = √ U − ( √ UV √ U ) √ U is similar via a symmetric operator √ U to a symmetric positive deﬁnite √ U V √ U . Usingthis we can deﬁne √ V U = √ U − p √ U V √ U √ U , and similarly √ UV .Below the functions are deﬁned by their power series exp( x ) = P ∞ n =0 x n /n !, sin( x ) = P ∞ n =0 ( − n x n +1 / (2 n + 1)!, sinc( x ) = x − sin( x ) = P ∞ n =0 ( − n x n / (2 n + 1)!, which iswell deﬁned even when x − is not and cos( x ) = P ∞ n =0 ( − n x n / (2 n )!. Lemma 6.6 (Exponential function of a matrix) . Let

V, U be symmetric, positive deﬁnitelinear operators in L . If for t ∈ R C = (cid:18) tV − tU (cid:19) then for A = √ V U and B = √ UV we have exp( C ) = ∞ X n =0 ( − n C n = (cid:18) cos( tA ) tV sinc( tB ) − tU sinc( tA ) cos( tB ) (cid:19) . Proof.

From direct calculation of C we have the following powers of C : C n = ( − n (cid:18) ( V U ) n t n

00 ( UV ) n t n (cid:19) = ( − n (cid:18) ( At ) n

00 ( Bt ) n (cid:19) ; C n +1 = ( − n (cid:18) V ( UV ) n t n +1 − U ( V U ) n t n +1 (cid:19) = ( − n (cid:18) tV ( Bt ) − ( Bt ) n +1 − tU ( At ) − ( At ) n +1 (cid:19) so that: exp( C ) = ∞ X n =0 ( − n ( tA ) n (2 n )! tV ( tB ) − tB ) n +1 (2 n +1)! − tU ( tA ) − tA ) n +1 (2 n +1)! ( tB ) n (2 n )! ! = (cid:18) cos( tA ) tV sinc( tB ) − tU sinc( tA ) cos( tB ) (cid:19) . We need to use the additional U and V to compensate for odd powers on the oﬀ-diagonal. (cid:3) UNCTIONAL ANALYSIS IN ALGORITHMS 11

In the following Proposition let H ( Q, P ) = U ( Q ) + V ( P ), and U ( t ) = 1 t Z t ∂ H ( Q ( s ) , P ( s )) ∂Q ds = 1 t Z t U ′′ ( Q ( s )) ds (6.12) V ( t ) = 1 t Z t ∂ H ( Q ( s ) , P ( s )) ∂P ds = 1 t Z t V ′′ ( P ( s )) ds . (6.13) Proposition 6.7 (Solution of the evolution of the dependence on initial conditions) . Assume that the target f and auxiliary g distributions are both strictly log-concave. Then U ′′ ( s ) and V ′′ are symmetric positive deﬁnite, and so are U ( t ) and V ( t ) . Let A = A ( t ) = p V ( t ) U ( t ) and B = B ( t ) = p U ( t ) V ( t ) . Then the solution of the evolution equation inLemma 6.5 (6.11) is given by: ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! ( t ) = (cid:18) cos( tA ) tV ( t ) sinc( tB ) − tU ( t ) sinc( tA ) cos( tB ) (cid:19) . Proof.

The solution to a linear diﬀerential equation ˙ X = A ( t ) X is equal to X ( t ) =exp( R ts =0 A ( s ) ds ) · X (0) and we use Lemma 6.6. (cid:3) Proof of Theorem 2.3.

Proof.

It is enough to prove that the Jacobians D q ( Q ) and D p ( P ) are uniformly boundedaway from 0 and ∞ , and we can use Corollary 6.3 which says that the kernel K deﬁnedin (6.4) is then bounded in L which makes the operator T compact, which yields to thespectral gap. Then, as the (maximal) eigenvalue 1 has multiplicity 1, the spectrum σ ofthe operator T N on the closed hyperplane N = { h : R h = 0 } orthogonal to the eigendirection of ﬁxed points { f · R } , which is contained inside the open unit disk σ ( T N ) ⊂{ µ ∈ C : | µ | < } (in fact in the interval [0 ,

1) as T is positive and in our case selfadjoint).The spectrum has only 0 as the possible accumulation point. We get sup | σ ( T N ) | <

1, theradius is in fact the second largest eigenvalue, which secures the geometrical convergenceto 0 on N and to f · ( R h/ R f ) in L .We deal with ﬁnite dimensional Q and P , both modeled by R d . The operators ∂Q/∂p and ∂P/∂q are both symmetric d × d matrices with the determinant equal to the productof their real eigenvalues. For a (positive symmetric) matrix M and a (positive) function φ deﬁned by the power series the (real positive) eigenvalues of the (positive symmetric)matrix φ ( M ) are equal to the images under φ of the (real positive) eigenvalues of M .By Proposition 6.7, ∂Q/∂p = tV sinc( t √ UV ), where U = U ( t ) and V = V ( t ) weredeﬁned in (6.12) and (6.13). By the assumption on uniform strict concavity the spec-tra σ ( U ′′ ) and σ ( V ′′ ) are uniformly bounded away from 0 and inﬁnity by 0 < λ =inf( σ ( U ′′ ) , σ ( V ′′ )) ≤ sup( σ ( U ′′ ) , σ ( V ′′ )) = Λ < ∞ so do the spectra of running averages U and V and also the spectra of A = √ V U and B = √ UV as in Proposition 6.7. Con-sequently, for small t , such that 0 < t Λ < π/ σ ( ∂Q ) /∂p ) is bounded, from above by t Λ and from below by t sinc( tλ ). Similarly σ ( ∂P/∂q ) is bounded from below by − t Λ andfrom above by − t sinc( tλ ). All these bounds are uniformly away from 0 and ∞ . Finallyfor the product of determinants D q ( Q ) · D p ( P ) deﬁned by (6.3) have uniform bounds awayfrom 0 and ∞ by ( t sinc( tλ )) − d from above and by ( t Λ) − d from below. (cid:3) References [1] Duane, Simon; Kennedy, Anthony D.; Pendleton, Brian J.; Roweth, Duncan (3 September 1987).

Hybrid Monte Carlo . Physics Letters B. 195 (2): 216–222.[2] Livingstone, Samuel; Betancourt, Michael; Byrne, Simon; Girolami, Mark. emphOn the geometricergodicity of Hamiltonian Monte Carlo. Bernoulli 25 (2019), no. 4A, 3109–3138.[3] S. Ghosh, Y. Lu, T. Nowicki

HMC, an example of Functional Analysis applied to Algorithms inData Mining. The convergence in L p https://arxiv.org/abs/2101.08688 [4] S. Ghosh, Y. Lu, T. Nowicki HMC, an Algorithms in Data Mining, the Probabilistic approach. inpreparation. [5] S. Ghosh, Y. Lu, T. Nowicki

On Convergence of Hamiltonian Monte Carlo with AsymmetricalMomentum Distributions. preprint

IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, US

Email address ::