HMC, an Algorithms in Data Mining, the Functional Analysis approach
aa r X i v : . [ s t a t . C O ] F e b HMC, AN ALGORITHMS IN DATA MINING, THE FUNCTIONALANALYSIS APPROACH.
SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI
Abstract.
The main purpose of this paper is to facilitate the communication be-tween the Analytic, Probabilistic and Algorithmic communities. We present a proofof convergence of the Hamiltonian (Hybrid) Monte Carlo algorithm from the pointof view of the Dynamical Systems, where the evolving objects are densities of prob-ability distributions and the tool are derived from the Functional Analysis. Introduction
Functional Analysis for all Functioning Algorithms.
We observed that all too of-ten not only we do not speak a common language but also we do not see the reason tocommunicate and to see the problems through a different eye. We tried, on the examplean HMC algorithm, to gently (relatively speaking) build the bridges and make the ”other”methods clearer and more comprehensible. As the paper is addressed to people not nec-essarily fluent in Functional Analysis we make the proofs quite extended at some placesand sometimes a bit hand-waving. It is difficult to strike the balance between clarity andrigor.
Algorithms as source of inspiration.
Recent development and usage of MachineLearning (ML) Data Mining (DM) and Artificial Intelligence (AI) resulted in a vast varietyof new or refurbished algorithms to deal with large data sets, be it collected or streamed.Such new methods are usually tested on some data sets, however not very often theyare thoroughly vetted by theoretical means. Algorithms tend to rely on discrete models,but the nature of data and approximation approaches suggest rather a continuous pointof view. In our opinion one should go even further, the right objects of investigation ofalgorithms should not be the continuous parameters but rather very general features ofdata such as distributions.We perceive the algorithms as iterative transformations of the points in some underlyingdomains. The leading idea is to move from a relatively simple objects such as finite setsor points in finite dimensional Euclidean spaces with quite complicated transformationsto simple transformations in richer spaces such as distributions in functional spaces.
Hamiltonian Monte Carlo.
Or Hybrid Monte Carlo (HMC) algorithm is a methodto obtain random samples from a (target) probability distribution f / R Q f on the space Q whose density is known only up to a factor, that is to say that f is known, but R Q f isnot, or at least is very difficult to calculate. It is an algorithms known for a while [1]of the Metropolis-Hastings type used to estimate the integrals. There are known proofsof convergence [2]. Our goal is to provide a clear and understandable reason why HMCalgorithm converges to the right limit for densities in the spaces L ( Q ). We refer to ourpapers [3, 4, 5] for other approaches to HMC, probabilistic, algorithmic and analytic in L q . Here we concentrate on the convergence in L .HMC performs by iterating the following steps. Given an initial distribution h (samplepoints) in a given space Q double (the dimension of) the space by considering Q × P , with P ∼ Q . Then spread each point q ∈ Q to a point ( q, p ) ∈ Q × P by sampling p ∈ P froma distribution of choice g , where g > P and R P g = 1. Then move each point ( q, p ) to a new point ( Q, P ) = H ( q, p ), where the transformation H : Q × P → Q × P satisfiessome special invariance properties with respect to f and g and the underlying base measureon Q × P . Finally project ( Q, P ) on Q providing a new sample of points Q in Q with a newdistribution ˆ h which shall be used as the initial sample (or distribution) for the next step.With the right choice of H the iteration of this procedure will result in an approximatesample from the target distribution. Moving to functional spaces.
The success of the algorithm lies in the appropriatechoice of the transformation H . The Hamiltonian part of the algorithm’s name is dueto the Hamiltonian motion H . It turns out that if the target distribution has a densityproportional to a given function f and the distribution of choice is g then H is a Hamil-tonian motion generated by the Hamiltonian energy H ( q, p ) = − log( f ( q ) · g ( p )). Thatis if ( Q, P ) = H ( q, p ) is the solution of the time evolution ˙ Q = ∂ H /∂P , ˙ P = − ∂ H /∂Q after time t with initial point ( q, p ) then H has the needed invariance properties for HMCto converge to a distribution proportional to f . Effectively it means that one can obtaina normalizing constant R Q f or any expected value of a function φ with respect to thedistribution proportional to f : R φ · f / R f .In terms of the densities of the involved distributions one can present HMC as follows:Given some initial distribution h ( q ) on Q one produces a joint distribution h · g by ( h · g )( q, p ) = h ( q ) · g ( p ) on Q × P then moves the points ( q, p ) ( Q, P ) = H ( q, p ) producinganother distribution ( h · g ) ◦ H ( q, p ) = h ( Q ) · ( P ) in Q × P and finally projects the last oneonto Q by calculating the marginal R P ( h · g ) ◦ H ( q, p ) dp , which is a result of the action ofthe algorithm in one step. In short(1.1) T ( h )( q ) = Z P ( h · g ) ◦ H ( q, p ) dp and from a rather complicated algorithm we receive a relatively simple, linear operator onsome space of integrable functions. The convergence of the algorithm corresponds to theconvergence of sequences of iterates of T . Remark 1.1. (1)
The distribution g may depend on the point q , g ( p | q ) as long as for (almost) all q it satisfies the required conditions. (2) It is clear that the Hamiltonian motion H does not depend on the constant nor-malizing factor in front of f (3) The motion H in practical implementation is performed by the leap-frog algorithm which displays the needed invariance properties. We shall not deal with it in thispaper. (4) An example of the situation where the target distribution is known up to the nor-malizing constant is the Bayesian update. In order to establish the distribution of(random) parameters θ influencing the outcome D of the observations, when weknow all the probabilities P ( D | θ ) one uses the knowledge of the outcome D to im-prove the estimate: Given the estimate distribution π ( θ ) before the experiment wecalculate ˆ π ( θ ) = P π ( θ | D ) = P π ( θ, D ) /P π ( D ) = P ( D | θ ) · π ( θ ) / P θ ′ P ( D | θ ′ ) · π ( θ ′ ) and take the distribution ˆ π as a new estimate. However the sum (integral) in thedenominator may be not that easy to calculate. This yields to ˆ π ( θ ) ∼ P ( D | θ ) π ( θ ) without the normalizing factor. Results
Convergence under invariance and coverage properties.
Assume that the motion H : Q × P → Q × P (measurable spaces with measures dq and dp ) satisfies the following invariance and coverage properties when UNCTIONAL ANALYSIS IN ALGORITHMS 3
Given 0 ≤ f : Q → R R Q f < ∞ , 0 ≤ g : P → R R P g = 1:( f · g ) ◦ H = f · g (2.1) Z Z Q × P A ◦ H = Z Z Q × P A for any integrable A (2.2) Q ( q, P ) = Q for (almost) every q (2.3)The space Q may be restricted to equal the support of f .Define T h = R P ( h · g ) ◦ H as in (1.1) and T n +1 = T n ◦ T . The adjoint operator T † isgiven by the same formula with H − in place of H and is described below in Section 4.A self-adjoint operator satisfies T † = T , see (4.1).Let L denotes the space of square-integrable functions h : Q → R such that || h || = R Q | h | / f < ∞ and the support of h is included in the support of f ( i.e. Q ). Theorem 2.1.
Under the above invariance and coverage conditions and when the operator T is self-adjoint then for every h ∈ L ( Q ) the sequence T n h converges strongly in L to f · R h/ R f . The direction of f is the unique direction of fixed points. Except of the eigenvalue1 with multiplicity 1, all the spectrum is contained in the interior of the unit disc. Remark 2.2. (1)
The Hamiltonian motion satisfies two first integral invariance assumptions, asboth the Hamiltonian (energy function equal here − log( f ( q ) · g ( p )) ) and the Lebesguemeasure are invariant under such a motion. (2) The coverage property Q ( q, P ) = Q can be weakened to a statement of an even-tual coverage, not necessarily in one step. Some type of irreducibility must beassumed to avoid complete disjoint domains of the motion and hence an obviousnon existence of a (unique) limit. (3) The support condition takes care of some initialization problems with the divisionby 0. This can be formally avoided by working in the space of likelihoods, seebelow. (4)
The self-adjointness condition is not very restrictive, as one can always use thecomposition T † ◦T which is self-adjoint and satisfy all the needed properties. Alsoin case of any even auxiliary distribution g ( p ) = g ( − p ) on R d , such as standardGaussian, the operator is always self-adjoint, see Lemma 4.2. Exponential convergence under uniformly strong logarithmic concavity.
Wesay that h : R d → R is uniformly strongly logarithmic concave if for almost every Q the (symmetric) Hessian − ∂ log( h ( Q )) /∂Q has its spectrum contained in some finite,positive interval { z : 0 < λ ≤ Λ < ∞} independent on Q . Gaussian auxiliary distributionsare obviously uniformly strongly concave. Speaking informally one may say that thedensity lies between two Gaussians.For the next Theorem we need a stronger coverage condition . We say that the motion H : ( q, p ) ( Q, P ) is fully invertible if given fixed values of (almost) any two of thefour variables ( q, p, Q, P ) the other two are connected by a differentiable bijection. Inparticular given q the derivative ∂Q/∂p of the map p
7→ Q q ( p ) = Q is smoothly invertibleand the same holds for q
7→ P p ( q ) = P . Theorem 2.3.
Suppose that the target distribution f : R d → R and the auxiliary distribu-tion g : R d → R are both uniformly strictly logarithmic concave. Let H t be the Hamiltonianmotion defined by the Hamiltonian H ( Q, P ) = − log( f ( Q ) · g ( P )) and T the operator de-fined by H t . Let assume it is self-adjoint. Then for t > small enough the iterations ofthe operator T t are converging geometrically to the map h R h · f / R f . ∃ (0 < ρ < ∀ ( h ∈ L ( Q )) ∀ ( n ) ||T n h − f R h R f || ≤ ρ n || h || . SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI
One can prove that the uniformly strong logarithmic concavity assumption is neededonly outside an arbitrary bounded region, see [4].
The proofs are not very hard.
For Theorem 2.1 we observe that T is in fact (3.4) anaveraging map, thus by the convexity of x x the norm of h decreases (3.6) under T ,sharply unless (by coverage assumption) h = α f . The space L is reflexive, thus boundedsequences have weak accumulation points. Using self-adjointness defined in Section 4we prove that each accumulation point must be of form α f , proving weak convergence.Meanwhile the proof of the convergence of the norms provides (Proposition 5.4) strongconvergence. The spectral properties follow from Remark 3.3.The proof of Theorem 2.3 relies on the representation of the operator T as a kerneloperator (6.4) and the proof in Subsection 6 (somewhat lengthy in calculation but nottoo deep) that the L norm of that kernel (6.5) is finite, hence the kernel is compact andthe operator T has the spectral gap which provides the geometric (or exponential) rate ofconvergence. 3. The operator T in L The Hilbert space L . For h ∈ L we have a standard norm || h || = R Q (cid:16) h f (cid:17) f and ascalar product h· , ·i : L × L → R : h a, b i = R Q a · b f .We shall assume h ≥ h/ f a likelihood (up to an irrelevantnormalizing constant R f ) of h with respect to f . The space of ˜ L = { ˜ h : R Q | ˜ h | f < ∞} oflikelihoods ˜ h = h/ f is isometric to L . Lemma 3.1. a, b ∈ L ⇒ h a, b i ≤ || a || · || b || , (3.1) we have f ∈ L with || f || = Z Q f , (3.2) h ∈ L ⇒ h h, f i = Z Q h . (3.3) Proof. (3.1) is the H¨older inequality for likelihoods a/ f and b/ f in the space ˜ L .(3.2) and (3.3) follow directly from the definitions. (cid:3) Operator T .Lemma 3.2 (Properties of T ) . For ≤ h ∈ L : T h = f · Z P h f ◦ H · g (3.4) Z Q T h = Z Q h (3.5) ||T h || ≤ || h || (3.6) The equality in (3.6) occurs iff h = α · f a.e. , (3.7) where α = α ( h ) = R h/ R f . Equation (3.4) gives an explicit formula for ˜ T (˜ h ) acting in ˜ L . Proof. (3.4): Using the invariance properties we have R P h f ◦ H · ( f · g ) ◦ H = R P h f ◦ H · ( f · g ) and f does not depend on p ∈ P .(3.5): R Q R P ( h · g ) ◦ H = RR Q × P ( h · g ) = (cid:16)R Q h (cid:17) (cid:0)R P g (cid:1) .(3.6): ||T h || = R Q (cid:16)R P h f ◦ H · g (cid:17) f ≤ R Q R P (cid:16) h f ◦ H (cid:17) g f = RR Q × P (cid:16) h f (cid:17) ◦ H · ( g · f ) ◦ H = UNCTIONAL ANALYSIS IN ALGORITHMS 5 RR Q × P (cid:16) h f (cid:17) g · f the last one being equal to (cid:18)R Q (cid:16) h f (cid:17) f (cid:19) · (cid:0)R P g (cid:1) = || h || . For a given q theequality occurs only if ( h/ f )( H ( q, p )) is a constant for ( g -)almost all p , but the coverageassumption assures that the constant ( h/ f ) ◦ H ( q, P ) is the same for f -almost all q ∈ Q :( h/ f )( Q ) = R Q h/ R Q f , the value follows from (3.5). Note that (3.6) hides the formulafor the variance with respect to probability g , the random variable being the transportedlikelihood. (cid:3) Remark 3.3.
The operator T is an averaging operator of the (transported) likelihood h/ f with respect to the probability g . The scalar product is monotone: ≤ a ≤ b, ≤ c ≤ d implies h a, c i ≤ h b, d i and T is positive, in particular if a ≤ b then T a ≤ T b . The function f provides the eigendirection of fixed points and by (3.6) T has its spectrum in the unitdisk, with 1 being a unique eigenvalue on the unit circle and has multiplicity 1. For any h ∈ L one has the unique decomposition h = α f + ( h − α f ) where α f is a direction ofthe fixed points and h − α f ∈ N = { a ∈ L : R a = 0 } lies in an invariant subspace. Itis not a priori clear under what conditions the eigen-value is isolated in the spectrum,in other words wether the contraction ||T h || < || h || is uniform on N , which would imply T n N → { } (point-wise) with exponential speed. The adjoint operator T † As H is invertible the inverse map H − is well defined as enjoys the same invarianceproperties as H . It turns out that the operator T † defined by H − :(4.1) T † h = Z P ( h · g ) ◦ H − is adjoint to T with respect to the duality functional h· , ·i , namely Lemma 4.1.
For h, k ∈ L : (4.2) hT h, k i = h h, T † k i Proof.
Using (3.4) and invariance hT h, k i = R Q ( R P h f ◦ H · g ) · k = RR Q × P h f · ( g · k ) ◦ H − = RR Q × P h f · ( k f ) ◦ H − · ( g · f ) ◦ H − = RR Q × P h f · ( k f ◦ H − ) · ( g · f ) = R Q h ( R P k f ◦ H − · g ) = h h, T † k i (cid:3) A sufficient condition for self-adjointness T = T † . Let τ be a measure preservinginvolution τ : P → P , τ ◦ τ = id. We can extend it to τ : Q × P → Q × P by τ ( q, p ) =( q, τ ( p )). Assume that g is invariant with respect to τ : g ◦ τ = g . Lemma 4.2. If τ ◦ H − ◦ τ = H and g is invariant with respect to τ then T † = T . As an example take Q = P = R , τ to be the symmetry (reflection) of the space P with respect to the 0, τ ( p ) = − p . An even g ( p ) = g ( − p ) is invariant with respect to τ . The involution τ ( p ) = − p is applicable in the most common choice of g : a centralizedGaussian distribution. In a particular case of f also a Gaussian the Hamiltonian movement H is a rotation and H − an opposite rotation. The spreading by g is symmetric andwhatever mass is transported from ( q, p ) to H ( q, p ) = ( Q, P ) by H the same mass will betransported from ( q, − p ) to H − ( q, p ) = ( ¯ Q, ¯ P ) = ( Q, − P ) by H − . The projection onto Q will produce the same mass transported by both maps T and T † . Clearly this extendsto non-standard Gaussians. Proof.
Measure invariance means that R P a ◦ τ = R P a . Let ( ¯ Q, ¯ P ) = H − ( q, p ) then τ ◦ H − ( q, p ) = τ ( ¯ Q, ¯ P ) = ( ¯ Q, τ ( ¯ P ) and T h = R P ( h · g ) ◦ H = R P ( h · g ) ◦ τ ◦ H − ◦ τ = R P ( h · g ) ◦ τ ◦ H − = R P ( h · g ) τ ( ¯ Q, ¯ P ) = R P ( h ( ¯ Q ) · g ( τ ( ¯ P )) = R P h ( ¯ Q ) g ( ¯ P ) = R P h ◦ H − · g ◦ H − = T † h . (cid:3) SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI
In what follows we shall assume that T = T † . If it is not the case we can use thealgorithm with S = T † ◦ T such that S † = S .5. Limits of the sequences T n of self-adjoint operator From ||T h || < || h || by induction we obtain ||T n h || < || h || unless h = α f , whenequality holds. For h ∈ L let(5.1) V ( h ) = inf || T n h || = lim || T n h || . We see that for any M , V ( h ) = V ( T M ( h )). As we are interested in the limit of the sequence T n h , for a given h we can assume that for an arbitrary ǫ > || h || < V + ǫ ,taking a high iterate T M h instead of h if needed.By a corollary to Alaoglu Theorem bounded sets in reflexive L are weakly (the sameas weakly*) compact. Any infinite sequence T n h have a weak converging subsequence T m n h ⇀ h ∞ , meaning hT m n h, b i → h h ∞ , b i for every b ∈ L . Lemma 5.1.
If the operator T is self adjoint then for any weak accumulation point h ∞ of the sequence T n h we have lim inf || T n h || ≤ || h ∞ || lim sup || T n h || . Proof.
With T m n h ⇀ h ∞ let N be such that the sequence ( m n − N ) contains infinitelymany even indices 2 m . Then for h ′ = T N h we have T m h ′ ⇀ h ∞ as well, and: || T m h ′ || = h T m h ′ , T m h ′ i = h T m h ′ , h ′ i → h h ∞ , h ′ i ≤ || h ∞ || · || h ′ || . For an ǫ > N sufficiently large such that || T m h ′ || = || T m + N h || ≥ lim inf || T n h || − ǫ and || h ′ || = || T N h || ≤ lim sup N || T N h || + ǫ . Arbitrary choice of ǫ > (cid:3) Remark 5.2.
For any bounded operator T on L and any weak accumulation point T m n h ⇀ h ∞ we have || h ∞ || ≤ lim sup || T m n h || ≤ lim sup || T n h || . Proof.
This is standard: for an ǫ > N such that for m n > N we have || T m n h || ≤ lim sup || T n h || + ǫ . Then || h ∞ || = h h ∞ , h ∞ i ← h T m n h, h ∞ i ≤ || T m n h || || h ∞ || . (cid:3) Corollary 5.3.
If for a self-adjoint operator T the sequence of norms || T n h || convergesthen every weak accumulation point T m n h ⇀ h ∞ of the sequence T n h has the norm || h ∞ || = lim || T n h || and is a strong limit of the same subsequence. || T m n h − h ∞ || → . Proof.
The value of norm || h ∞ || follows from the previous Lemma and Remark. Thestrong convergence is standard again. Due to the strong convexity of the ball in L (or adirect manipulation of ||T m n h − h ∞ || ) a weak convergent sequence with the convergenceof the norms to the norm of the limit converges also in the strong sense. (cid:3) Proposition 5.4.
Assume T = T † . Then T n h converges strongly. ||T n h − R h R f f || → . Proof.
By Lemma 3.6 the sequence ||T n h || converges to V ( h ) from (5.1). By Corollary 5.3every weak converging subsequence of T n h converges strongly and the norm of the limitis equal p V ( h ). In particular if T m n h ⇀ h ∞ then T m n +1 h ⇀ T ( h ∞ ) and || h ∞ || = V ( h ) = ||T h ∞ || . By Lemma 3.2 (3.7) we have thus h ∞ = α f = T h ∞ , with α = R h/ R f by (3.3). That means that every weak converging subsequence of T n h converges to the UNCTIONAL ANALYSIS IN ALGORITHMS 7 same limit α f . But any subsequence of T n h contains a weakly convergent subsequence,hence T n h converges weakly, and also strongly to α f . (cid:3) This concludes the proof of Theorem 2.1.6.
The operator T as a kernel operator The kernel K ( q, Q ) . In this section we shall assume a stronger version of the coveringproperty (2.3) of the map H : Q × P → Q × P , ( Q, P ) = H ( q, p ) on finite dimensionalspaces Q and P (manifolds modeled on R d , usually P will be a co-tangent space to Q ). Q q : P → Q defined by Q q ( p ) = Q ( q, p ) is a bijection for almost every q (6.1) ∂Q ( q, p ) ∂p exists and is invertible for a.e. q (6.2)By assumption the function Q − q : Q → P , p = Q − q ( Q ) is well defined and so is P q : Q → P , P q ( Q ) = P ( q, p ) = P ( q, Q − q ( Q ). Define the Jacobian (determinant) D q of the partialderivative (6.2) by(6.3) D q ( Q ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂ Q q ( p ) ∂p (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∈ R , where Q = Q q ( p ) . The movement (
Q, P ) = H ( q, p ) must be sufficiently smooth in order for D q to behave.In order to simplify our reasoning we shall assume that, similarly as in (6.1) and (6.2),the map P p : Q → P , defined by P p ( q ) = P ( q, p ), is a bijection with invertible derivativeof which the Jacobian (similarly as in (6.3)) D p ( P ) = | det ( ∂ P p ( q ) /∂q ) − | = | det ∂q/∂P | . Remark 6.1.
In such situation (one can assume that) the measure spaces ( Q , dQ ) and ( P , dp ) are isomorphic in the measurable sense. Then the Jacobian D q can be treated asthe Radon-Nikodym derivative of the transport of the measure ( dp ) on the fiber { q } × P performed by H and the projection, in effect by Q q , to the measure ( D q · dQ ) on Q .Similarly D p transfers the measure dq on Q to the measure D p ( P ) dP on P . With the change of variables by p Q = Q q ( p ), p = Q − q ( Q ), P = P q ( Q ), dp = D q ( Q ) dQ we get T h ( q ) = R P h ( Q ) g ( P ) dp = R Q h ( Q ) · g ( P q ( Q )) D q ( Q ) dQ . This showsthat the operator T is a kernel operator, namely T h ( q ) = h h ( Q ) , K q ( Q ) i where K q ( Q ) = K ( q, Q ) is defined by K ( q, Q ) = f ( Q ) · g ( P q ( Q )) · D q ( Q ) as then(6.4) T h ( q ) = Z P h ( Q ( q, p )) g ( P ( q, p )) dp = Z Q h ( Q ) · g ( P q ( Q )) f ( Q ) f ( Q ) D q ( Q ) dQ = Z Q h ( Q ) · K ( q, Q ) f ( Q ) dQ = h h, K q i When a kernel operator has a finite L norm it is compact, and then its spectrum is discreteexcept the unique possible accumulation point at 0. We know by Remark 3.3 that thespectrum of T has all the eigenvalues inside the unit disk except for the eigenvalue 1 whichhas multiplicity 1. The compactness of T would provide the spectral gap and exponentialconvergence in norm of each sequence T n h to its fixed point limit α f with the rate ofconvergence given by the second largest eigenvalue, in this case, strictly smaller than 1.The norm for A ( q, Q ) ∈ L ( Q × Q ) is given by || A || = Z Z Q × Q A ( q, Q ) f ( q ) · f ( Q ) dQ dq . For a.e. q the maps Q
7→ A q ( Q ) = A ( q, Q ) should belong to L ( Q ) (with respect to Q ,thus f ( Q ) in the denominator) and then the map defined by the norms q
7→ ||A q || shouldbelong to L ( Q ) (with respect to q thus f ( q ) in the denominator), in both cases use the SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI norm defined in the first line of Section 3. One might be more comfortable working withan analogous kernel expression in the space ˜ L . Lemma 6.2 (The norm of the kernel K ) . The L norm of the kernel K (6.4) can beexpressed as: || K || = Z Z P × P g ( p ) g ( P ) D q ( Q ) D p ( P ) dP dp , where Q and q are well defined functions of p and P , q = P − p ( P ) and Q = Q q ( p ) .Proof. Using invariance (with p = Q − q ( Q ), P = P q ( Q )) we have: K ( Q, q ) = f ( Q ) g ( P ) D q ( Q ) = f ( Q ) f ( Q ) g ( P ) g ( P ) D q ( Q ) = f ( Q ) f ( q ) g ( p ) g ( P ) D q ( Q ) . In the integral expression for the norm we change both variables q and Q to variables p and P using the Jacobians D q ( Q ) dQ = dp and dq = D p ( P ) dP || K || = Z Z Q × Q K ( q, Q ) f ( Q ) f ( q ) dQ dq = Z Z Q × Q g ( p ) g ( P ) D q ( Q ) dQ dq (6.5) = Z Z Q × P g ( p ) g ( P ) D q ( Q ) dP dq = Z Z P × P g ( p ) g ( P ) D q ( Q ) D p ( P ) dP dp (6.6) (cid:3) Corollary 6.3.
If the product of determinants D q ( Q ) · D p ( P ) is uniformly bounded fromabove then the L norm of the kernel K is finite and the operator T is compact. Hamiltonian movement.
In the following we shall prove that || K ( q, Q ) || is finite in aspecial case of a Hamiltonian movement when Q = P = R d . The distribution of choice g ( p )is usually the standard Gaussian (mean 0 and covariance equal to identity matrix, that is − log g ( p ) = h p, p i / U ( Q ) = − log( f ( Q )) then U ′′ ( Q ) = ∂ ( − log f ( Q )) /∂Q , a symmetric matrix by assumed continuous differentiability.We assume that the target distribution f is uniformly strictly logarithmic concave , that is U ′′ is a (strictly) positive operator, i.e. it is bounded away (in either norm or spectrumsense) from 0 and ∞ uniformly on q ∈ Q . Then U ′′ ( Q ) can be bounded away from 0 and ∞ by two symmetric, strictly positive bounded operators constant with respect to Q , thenthe distribution f can be then estimated both from above and below by two Gaussians(with some positive finite multiplicative constants). Similar statements hold for g and V ( P ) = − log( g ( P )).Consider the spaces L Q , L P and L Q × P of functions on Q (positions), P (momenta) and Q × P (configurations) to R with the appropriate (integral) norms. Given 0 ≤ f ∈ L Q and 0 ≤ g ∈ L P we can define ( potential energy ) U : Q → R , U ( q ) = − log( f ( q )) and( kinetic energy ) V : P → R , V ( p ) = − log( g ( p )). We see that if f ( q ) and g ( p ) representthe densities of probability distributions of two independent variables then f ( q ) · g ( p ) =exp( − ( U ( q )+ V ( p )) represents a density of their (independent) joint distribution. We shalluse the name Hamiltonian for the total energy H = U + V . Remark 6.4.
In fact one may consider a more general case when g = g ( q, p ) (understoodas conditional g ( p | q ) ) and thus V = V ( p | q ) . However to simplify the calculations we shalldeal only with the case of g independent on q . The Hamiltonian energy provides the following (Hamiltonian) dynamics ( q, p ) ( Q, P ),where (
Q, P ) = ( Q t ( q, p ) , P t ( q, p )) is the position after time t of the point starting at ( q, p )ruled by the system of equations:˙ Q = dQdt = ∂ H ( Q, P ) ∂P (6.7) ˙ P = dQdt = − ∂ H ( Q, P ) ∂Q . UNCTIONAL ANALYSIS IN ALGORITHMS 9
The dot derivative is the derivative with respect to time ˙ A = ∂A/∂t . We see that thenormalizing constants of g and f are irrelevant to the motion. Formally the solutions canbe written as Q ( t ) = q + Z t ∂ H ∂P ( Q ( s ) , P ( s )) ds = q + Z t V ′ ( P ( s )) ds (6.8) P ( t ) = p − Z t ∂ H ∂Q ( Q ( s ) , P ( s )) ds = p − Z t U ′ ( Q ( s )) ds , assuming all functions are sufficiently regular.The movement H is defined by ( q, p ) H t ( q, p ) = ( Q t , P t ), and the map T t is definedby (1.1) using H t . For a given fixed time t we shall skip the subscript t . For a function W ∈ L Q × P we denote t W t ( q, p ) = ( W ◦ H t )( q, p ) = W ( Q, P ).The value of the Hamiltonian H does not change along the trajectories and the Hamil-tonian motion conserves the Lebesgue measure: H ◦ H = H , or after taking the exponent ( f · g ) ◦ H = f · g . (6.9) Z Z Q × P W t d ( qp ) = Z Z Q × P W d ( qp ) for any W ∈ L Q × P , (6.10)which corresponds to the properties (2.1) and (2.2).We define T as in (1.1), to stress the dependence of T on the choice of t we use thenotation T t . Given t and n the iterate of the map T nt is in general different from the map T tn with time nt .The subset of (target) functions f with some interest has usually some additional prop-erties: U , V → + ∞ , as | q | , | p | → ∞ , fast enough, so that the functions f , g are bounded,integrable and vanish at infinity (or at boundaries of the support) meaning that H escapesto infinity when ( q, p ) approaches these boundaries. This assures that the level sets of theHamiltonian and therefore the trajectories are bounded (and closed). Often additionallythe derivatives of U and V (or f and g ) are zero at a unique point (no stationary pointsexcept this one). Differential equation solutions to the gradient of the Hamiltonian flow.
Assumethat all the functions involved have sufficient smoothness, so that derivative exists andtheir order can be changed. In order to estimate the norm of the kernel we need to have agood control on the Jacobians D q ( Q ) and D p ( P ), in other words the partial derivatives ofthe time evolution of configuration with respect to initial configuration ∂ ( Q, P ) /∂ ( q, p ).By assumption that H ( Q, P ) = U ( Q ) + V ( P ) ( i.e. the spreading g does not depend onthe position q ) its mixed second derivatives ∂ H /∂Q∂P = 0 vanish. Lemma 6.5 (Evolution of the dependence on the initial configuration) . Under the as-sumption that H ( Q, P ) = U ( Q ) + V ( P ) the derivative of the motion ( Q, P ) with respect tothe starting configuration ( q, p ) satisfy the following time evolution equation: (6.11) ∂∂t ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! = (cid:18) V ′′ −U ′′ (cid:19) · ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! ; ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! t =0 = (cid:18) I I (cid:19) where U ′′ = U ′′ ( Q ) = ∂ H /∂Q and V ′′ = V ′′ ( P ) = ∂ H /∂P .Proof. The initial condition ∂ ( Q, P ) /∂ ( q, p ) t =0 = I can be calculated from (6.8), assumingsufficient continuity one can change the order of derivative and integration. Then as t → q : ∂∂q ∂ H ∂P = ∂ H ∂Q∂P · ∂Q∂q + ∂ H ∂P · ∂P∂q = 0 · ∂Q∂q + V ′′ · ∂P∂q = V ′′ ∂P∂q . Similarly ∂ H /∂q∂Q = U ′′ · ∂Q/∂q . It is clear that the calculation holds after exchangingevery q by p . Now we differentiate the Hamiltonian equations (6.7) with respect to initialconfiguration ( q, p ) and change the order of differentiation, for example ∂∂t (cid:18) ∂Q∂q (cid:19) = ∂∂q (cid:18) ∂Q∂t (cid:19) = ∂∂q ∂ H ∂P = V ′′ ∂P∂q , and again similarly ∂ P/∂t∂q = −U ′′ · ∂Q/∂q . The calculation holds when exchanging q for p . (cid:3) Before we proceed with the proof we remind that for
U, V symmetric, positive definiteoperators on L their symmetric positive definite square roots are uniquely defined. Forexample, for V < I (that is I − V is positive definite, which can be achieved by a nor-malization trick) √ V = I − R where R is a limit of the (strongly converging) sequence R n +1 = ( I − ( V − R n )) / R = 0. Also V U and UV are positive definite (but notnecessarily symmetric, when non commuting), as for example V U = √ U − ( √ UV √ U ) √ U is similar via a symmetric operator √ U to a symmetric positive definite √ U V √ U . Usingthis we can define √ V U = √ U − p √ U V √ U √ U , and similarly √ UV .Below the functions are defined by their power series exp( x ) = P ∞ n =0 x n /n !, sin( x ) = P ∞ n =0 ( − n x n +1 / (2 n + 1)!, sinc( x ) = x − sin( x ) = P ∞ n =0 ( − n x n / (2 n + 1)!, which iswell defined even when x − is not and cos( x ) = P ∞ n =0 ( − n x n / (2 n )!. Lemma 6.6 (Exponential function of a matrix) . Let
V, U be symmetric, positive definitelinear operators in L . If for t ∈ R C = (cid:18) tV − tU (cid:19) then for A = √ V U and B = √ UV we have exp( C ) = ∞ X n =0 ( − n C n = (cid:18) cos( tA ) tV sinc( tB ) − tU sinc( tA ) cos( tB ) (cid:19) . Proof.
From direct calculation of C we have the following powers of C : C n = ( − n (cid:18) ( V U ) n t n
00 ( UV ) n t n (cid:19) = ( − n (cid:18) ( At ) n
00 ( Bt ) n (cid:19) ; C n +1 = ( − n (cid:18) V ( UV ) n t n +1 − U ( V U ) n t n +1 (cid:19) = ( − n (cid:18) tV ( Bt ) − ( Bt ) n +1 − tU ( At ) − ( At ) n +1 (cid:19) so that: exp( C ) = ∞ X n =0 ( − n ( tA ) n (2 n )! tV ( tB ) − tB ) n +1 (2 n +1)! − tU ( tA ) − tA ) n +1 (2 n +1)! ( tB ) n (2 n )! ! = (cid:18) cos( tA ) tV sinc( tB ) − tU sinc( tA ) cos( tB ) (cid:19) . We need to use the additional U and V to compensate for odd powers on the off-diagonal. (cid:3) UNCTIONAL ANALYSIS IN ALGORITHMS 11
In the following Proposition let H ( Q, P ) = U ( Q ) + V ( P ), and U ( t ) = 1 t Z t ∂ H ( Q ( s ) , P ( s )) ∂Q ds = 1 t Z t U ′′ ( Q ( s )) ds (6.12) V ( t ) = 1 t Z t ∂ H ( Q ( s ) , P ( s )) ∂P ds = 1 t Z t V ′′ ( P ( s )) ds . (6.13) Proposition 6.7 (Solution of the evolution of the dependence on initial conditions) . Assume that the target f and auxiliary g distributions are both strictly log-concave. Then U ′′ ( s ) and V ′′ are symmetric positive definite, and so are U ( t ) and V ( t ) . Let A = A ( t ) = p V ( t ) U ( t ) and B = B ( t ) = p U ( t ) V ( t ) . Then the solution of the evolution equation inLemma 6.5 (6.11) is given by: ∂Q∂q ∂Q∂p∂P∂q ∂P∂p ! ( t ) = (cid:18) cos( tA ) tV ( t ) sinc( tB ) − tU ( t ) sinc( tA ) cos( tB ) (cid:19) . Proof.
The solution to a linear differential equation ˙ X = A ( t ) X is equal to X ( t ) =exp( R ts =0 A ( s ) ds ) · X (0) and we use Lemma 6.6. (cid:3) Proof of Theorem 2.3.
Proof.
It is enough to prove that the Jacobians D q ( Q ) and D p ( P ) are uniformly boundedaway from 0 and ∞ , and we can use Corollary 6.3 which says that the kernel K definedin (6.4) is then bounded in L which makes the operator T compact, which yields to thespectral gap. Then, as the (maximal) eigenvalue 1 has multiplicity 1, the spectrum σ ofthe operator T N on the closed hyperplane N = { h : R h = 0 } orthogonal to the eigendirection of fixed points { f · R } , which is contained inside the open unit disk σ ( T N ) ⊂{ µ ∈ C : | µ | < } (in fact in the interval [0 ,
1) as T is positive and in our case selfadjoint).The spectrum has only 0 as the possible accumulation point. We get sup | σ ( T N ) | <
1, theradius is in fact the second largest eigenvalue, which secures the geometrical convergenceto 0 on N and to f · ( R h/ R f ) in L .We deal with finite dimensional Q and P , both modeled by R d . The operators ∂Q/∂p and ∂P/∂q are both symmetric d × d matrices with the determinant equal to the productof their real eigenvalues. For a (positive symmetric) matrix M and a (positive) function φ defined by the power series the (real positive) eigenvalues of the (positive symmetric)matrix φ ( M ) are equal to the images under φ of the (real positive) eigenvalues of M .By Proposition 6.7, ∂Q/∂p = tV sinc( t √ UV ), where U = U ( t ) and V = V ( t ) weredefined in (6.12) and (6.13). By the assumption on uniform strict concavity the spec-tra σ ( U ′′ ) and σ ( V ′′ ) are uniformly bounded away from 0 and infinity by 0 < λ =inf( σ ( U ′′ ) , σ ( V ′′ )) ≤ sup( σ ( U ′′ ) , σ ( V ′′ )) = Λ < ∞ so do the spectra of running averages U and V and also the spectra of A = √ V U and B = √ UV as in Proposition 6.7. Con-sequently, for small t , such that 0 < t Λ < π/ σ ( ∂Q ) /∂p ) is bounded, from above by t Λ and from below by t sinc( tλ ). Similarly σ ( ∂P/∂q ) is bounded from below by − t Λ andfrom above by − t sinc( tλ ). All these bounds are uniformly away from 0 and ∞ . Finallyfor the product of determinants D q ( Q ) · D p ( P ) defined by (6.3) have uniform bounds awayfrom 0 and ∞ by ( t sinc( tλ )) − d from above and by ( t Λ) − d from below. (cid:3) References [1] Duane, Simon; Kennedy, Anthony D.; Pendleton, Brian J.; Roweth, Duncan (3 September 1987).
Hybrid Monte Carlo . Physics Letters B. 195 (2): 216–222.[2] Livingstone, Samuel; Betancourt, Michael; Byrne, Simon; Girolami, Mark. emphOn the geometricergodicity of Hamiltonian Monte Carlo. Bernoulli 25 (2019), no. 4A, 3109–3138.[3] S. Ghosh, Y. Lu, T. Nowicki
HMC, an example of Functional Analysis applied to Algorithms inData Mining. The convergence in L p https://arxiv.org/abs/2101.08688 [4] S. Ghosh, Y. Lu, T. Nowicki HMC, an Algorithms in Data Mining, the Probabilistic approach. inpreparation. [5] S. Ghosh, Y. Lu, T. Nowicki
On Convergence of Hamiltonian Monte Carlo with AsymmetricalMomentum Distributions. preprint
IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, US
Email address ::