HMC, an example of Functional Analysis applied to Algorithms in Data Mining. The convergence in L^p
aa r X i v : . [ m a t h . C A ] J a n HMC, AN EXAMPLE OF FUNCTIONAL ANALYSIS APPLIED TOALGORITHMS IN DATA MINING. THE CONVERGENCE IN L p SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI
Abstract.
We present a proof of convergence of the Hamiltonian Monte Carlo al-gorithm in terms of Functional Analysis. We represent the algorithm as an operatoron the density functions, and prove the convergence of iterations of this operatorin L p , for 1 < p < ∞ , and strong convergence for 2 ≤ p < ∞ . Introduction
Functional Analysis for all Functioning Algorithms.
We dare not making an acronymof that.Recent development and usage of Machine Learning (ML) Data Mining (DM) andArtificial Intelligence (AI) resulted in a vast variety of new or refurbished algorithms todeal with large data sets, be it collected or streamed. Such new methods are usually testedon some data sets, however not very often they are thoroughly vetted by theoretical means.Algorithms tend to rely on discrete models, but the nature of data and approximationapproaches suggest rather a continuous point of view. In our opinion one should goeven further, the right objects of investigation of algorithms should not be the continuousparameters but rather very general features of data such as their probability distributions.We perceive the algorithms as iterative transformations of the points in some underly-ing domains. The leading idea is to move from a relatively simple objects such as finitesets or points in finite dimensional Euclidean spaces with quite complicated transforma-tions to simple transformations in richer spaces such as functional spaces of probabilitydistributions.
Hamiltonian Monte Carlo.
Or Hybrid Monte Carlo (HMC) algorithm is a methodto obtain random samples from a (target) probability distribution f / R Q f on the space Q whose density is known only up to a factor, that is to say that f is known, but R Q f isnot, or at least is very difficult to calculate. It is an algorithms known for a while [1]of the Metropolis-Hastings type used to estimate the integrals. There are known proofsof convergence [2]. Our goal is to provide a clear and understandable reason why HMCalgorithm converges to the right limit for densities in the spaces L q ( Q ). We refer to ourpapers [3, 4, 5] for three approaches to investigation of HMC, analytic (in L ), probabilisticand algorithmic. Here we concentrate on an extension of the analytic one, the convergencein L q .HMC performs by iterating the following steps. Given an initial distribution h (samplepoints) in a given space Q double (the dimension of) the space by considering Q × P , with P ∼ Q . Then spread each point q ∈ Q to a point ( q, p ) ∈ Q × P by sampling p ∈ P froma distribution of choice g , where g > P and R P g = 1. Then move each point ( q, p )to a new point ( Q, P ) = H ( q, p ), where the transformation H : Q × P → Q × P satisfiessome special invariance properties. Finally project ( Q, P ) on Q providing a new sampleof points Q in Q with a new distribution ˆ h which shall be used as the initial sample (or T. N. is the corresponding author. probability distribution) for the next step. With the right choice of H the iteration of thisprocedure will result in an approximate sample from the target distribution. Moving to functional spaces.
Thus the success of the algorithm lies in the appropri-ate choice of the transformation H . The Hamiltonian part of the algorithm’s name is dueto the Hamiltonian motion H . It turns out that if the target distribution has a densityproportional to a given function f and the distribution of choice is g then H is a Hamil-tonian motion generated by the Hamiltonian energy H ( q, p ) = − log( f ( q ) · g ( p )). Thatis if ( Q, P ) = H ( q, p ) is the solution of the time evolution ˙ Q = ∂ H /∂P , ˙ P = − ∂ H /∂Q after time t with initial point ( q, p ) then H has the needed invariance properties for HMCto converge to a distribution proportional to f . Effectively it means that one can obtaina normalizing constant R Q f or any expected value of a function φ with respect to thedistribution proportional to f : R ( φ · ( f / R f )).In terms of the densities of the involved distributions one can present HMC as follows:Given some initial (probability) distribution h ( q ) on Q one produces a joint distribution h · g ( q, p ) = h ( q ) · g ( p ) on Q × P then moves the points ( q, p ) ( Q, P ) = H ( q, p ) producinganother (probability) distribution ( h · g ) ◦ H ( q, p ) = h ( Q ) · ( P ) in Q × P and finally projectsthe last one onto Q by calculating the marginal R P ( h · g ) ◦ H ( q, p ) dp , which is a result ofthe action of the algorithm in one step. In short(1.1) T ( h )( q ) = Z P ( h · g ) ◦ H ( q, p ) dp from a rather complicated algorithm we receive a relatively simple, linear operator onsome space of integrable functions. The convergence of the algorithm corresponds to theconvergence of sequences of iterates of T .A few remarks. • The distribution g may depend on the point q , g ( p | q ) as long as for (almost) all q it satisfies the required conditions. • It is clear that the Hamiltonian motion H does not depend on the constant factorin front of f • The motion H in practical implementation is performed by the leap-frog algorithm which displays the needed invariance properties. We shall not deal with it in thispaper. • An example of the situation where the target distribution is known up to thenormalizing constant is the Bayesian update. In order to establish the distributionof (random) parameters θ influencing the outcome D of the observations, whenwe know all the probabilities P ( D | θ ) one uses the knowledge of the outcome D toimprove the estimate: Given the estimate π ( θ ) before the experiment we calculateˆ π ( θ ) = P π ( θ | D ) = P π ( θ, D ) /P π ( D ) = P ( D | θ ) · π ( θ ) / P θ ′ P ( D | θ ′ ) · π ( θ ′ ) and takeˆ π as a new estimate. However the sum (integral) in the denominator may be notthat easy to calculate. This yields to ˆ π ( θ ) ∼ P ( D | θ ) π ( θ ) without the normalizingfactor. 2. Results
Assume that the motion H : Q × P → Q × P (measurable spaces with measures dq and dp ) satisfies the following invariance and coverage properties : • ( f · g ) ◦ H = f · g • RR Q × P A ◦ H = RR Q × P A for any integrable A • Q ( q, P ) = Q for (almost) every q • Q is the support of f .For T h = R P ( h · g ) ◦ H let T n +1 = T n ◦ T . The adjoint operator T † is given by the same(1.1) formula with H − in place of H and is described below in Section 4, a self-adjoint operator satisfies T † = T . UNCTIONAL ANALYSIS IN ALGORITHMS 3
Theorem 2.1.
Under the above invariance and coverage conditions and when the operator T is self-adjoint then for every h ∈ L q ( Q ) , < q < ∞ the sequence T n h , h ≥ , convergesweakly to f · R h/ R f and for ≤ q < ∞ it converges also strongly. Here L q denotes the space of integrable functions h : Q → R such that || h || qq = R Q | h | q / f q − < ∞ and the support of h is included in the support of f , which we mayassume is equal to Q . Remark . • The Hamiltonian motion satisfies two first integral invariance assumptions, asboth the Hamiltonian (energy function equal here − log( f ( q ) · g ( p ))) and theLebesgue measure are invariant under such a motion. • The coverage property Q ( q, P ) = Q can be weakened to a statement of an even-tual coverage, not necessarily in one step. Some type of irreducibility must beassumed to avoid complete disjoint domains of the motion and hence an obviousnon existence of a (unique) limit. • The support condition takes care of some initialization problems with the divisionby 0. This can be formally avoided by working in the space of likelihoods, seebelow.The proof is not very hard. We observe that T is in fact (3.5) an averaging map,thus by the convexity of x x q , q ≥
1, the norm of h decreases (3.7) under T , sharplyunless (by coverage assumption) h = α f . The spaces L q for 1 < q < ∞ are reflexive,hence bounded sequences have weak accumulation points. Using self-adjointness and theconvexity of x x q − , for q ≥ α f , proving (Corollary 5.2) weak convergence for q ≥
2. Meanwhile the proof of theconvergence of the norms provides (Corollary 5.3) strong convergence. Weak convergencefor 1 < q < f .3. The operator T in L q The spaces L q . We shall be working in the reflexive spaces L q ( Q ) dual to L p ( Q ), where q , p > q + p = q · p . We remark that 1 / p = 1 − / q = ( q − / q and q = ( q − p . In such spaces for h ∈ L q we have standard:norm || h || qq = Z Q (cid:18) h f (cid:19) q f bilinear form h· , ·i : L q × L p → R : h a, b i = Z Q a · b f and conjugacy ∗ : L q → L p , h ∗ = h · (cid:18) | h | f (cid:19) q − . We shall assume h ≥ h/ f a likelihood (up to an irrelevantnormalizing constant R f ) of h with respect to f . The space of ˜ L q = { ˜ h : R Q | ˜ h | q f < ∞} oflikelihoods ˜ h = h/ f is isometric to L q . Lemma 3.1. a ∈ L q , b ∈ L p ⇒ h a, b i ≤ || a || q · || b || p (3.1) h ∈ L q ⇒ h ∗ ∈ L p ; || h || qq = h h, h ∗ i = || h ∗ || pp ; ( h ∗ ) ∗ = h (3.2) f ∈ L q ; || f || qq = Z Q f ; f ∗ = f (3.3) h h, f i = Z Q h, h ∈ L q ; h f , h ∗ i = Z P h ∗ , h ∗ ∈ L p . (3.4) SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI
Proof. (3.1) is the H¨older inequality for likelihoods a/ f and b/ f in the spaces ˜ L q and ˜ L p .(3.2) for h ≥ q ( p −
1) = p .(3.3) and (3.4) follow directly from the definitions. (cid:3) Operator T .Lemma 3.2 (Properties of T ) . For ≤ h ∈ L q : T h = f · Z P h f ◦ H · g (3.5) Z Q T h = Z Q h (3.6) ||T h || q ≤ || h || q (3.7)(3.8) The equality in (3.7) occurs iff h = α · f ( f a.e.), where α = α ( h ) = R h/ R f .Proof. (3.5): Using the invariance properties we have R P h f ◦ H · ( f · g ) ◦ H = R P h f ◦ H · ( f · g )and f does not depend on p ∈ P .(3.6): R Q R P ( h · g ) ◦ H = RR Q × P ( h · g ) = (cid:16)R Q h (cid:17) (cid:0)R P g (cid:1) .(3.7): ||T h || qq = R Q (cid:16)R P h f ◦ H · g (cid:17) q f ≤ R Q R P (cid:16) h f ◦ H (cid:17) q g f = RR Q × P (cid:16) h f (cid:17) q ◦ H · ( g · f ) ◦ H = RR Q × P (cid:16) h f (cid:17) q g · f the last one being equal to (cid:16)R Q (cid:16) h f (cid:17) q f (cid:17) · (cid:0)R P g (cid:1) = || h || qq . For a given q theequality occurs only if ( h/ f )( H ( q, p )) is a constant for ( g -)almost all p , but by coverageassumption it means that h/ f is a constant on ( f -)almost all Q . The constant followsfrom (3.6). (cid:3) The operator T is an averaging operator of the (transported) likelihood h/ f with respectto the probability g . The scalar bilinear functional is monotone: 0 ≤ a ≤ b, ≤ c ≤ d implies h a, c i ≤ h b, d i and T is positive, in particular if a ≤ b then T a ≤ T b . The function f provides the eigendirection of fixed points and by (3.7) T has its spectrum in the unitdisk, with 1 being a unique eigenvalue on the unit circle and has multiplicity 1. For any h ∈ L q one has the unique decomposition h = α f + ( h − α f ) where α f is a direction of thefixed points and h − α f ∈ N = { a ∈ L q : R a = 0 } lies in an invariant subspace. It is not apriori clear under what conditions 1 is isolated in the spectrum, in other words wether thecontraction ||T h || < || h || is uniform on N , which would imply T n N → { } (point-wise)with exponential speed. 4. The adjoint operator T † As H is invertible the inverse map H − is well defined and it enjoys the same invarianceproperties as H . It turns out that the operator T † defined by H − : T † h = Z P ( h · g ) ◦ H − is adjoint to T with respect to the duality functional h· , ·i , namely Lemma 4.1.
For h ∈ L q and k ∈ L p : (4.1) hT h, k i = h h, T † k i Proof.
Using (3.5) and invariance hT h, k i = R Q ( R P h f ◦ H · g ) · k = RR Q × P h f · ( g · k ) ◦ H − = RR Q × P h f · ( k f ) ◦ H − · ( g · f ) ◦ H − = RR Q × P h f · ( k f ◦ H − ) · ( g · f ) = R Q h ( R P k f ◦ H − · g ) = h h, T † k i (cid:3) UNCTIONAL ANALYSIS IN ALGORITHMS 5
Lemma 4.2.
For ≤ q < ∞ and h ∈ L q (4.2) ( T n h ) ∗ ≤ T n ( h ∗ ) For < q ≤ the opposite inequality holds. The equality happens when q = 2 or when h is aligned with f .Proof. It is enough to prove for n = 1. The comparison acts in L p . The Lemma followsfrom the convexity of x q − , positivity of the linear operator T and its averaging prop-erty. Again the inequality is sharp unless h = α f . By induction and monotonicity of T itfollows that ( T n h ) ∗ ≤ T n ( h ∗ ). (cid:3) Case of the self-adjoint operator, when T = T † . Let σ be a measure preservinginvolution σ : P → P , σ ◦ σ = id. We can extend it to σ : Q × P → Q × P by σ ( q, p ) =( q, σ ( p )). Assume that g is invariant with respect to σ : g ◦ σ = g . Lemma 4.3. If σ ◦ H − ◦ σ = H and g is invariant with respect to σ then T † = T . As an example take Q = P = R , σ to be the symmetry (reflection) of the space P withrespect to the 0, σ ( p ) = − p . An even g ( p ) = g ( − p ) is invariant with respect to σ . Therotation H around any point on the Q axis and H − the opposite rotation satisfy thecondition of the Lemma. The involution σ ( p ) = − p is applicable the case in the mostcommon choice of g : a centralized Gaussian distribution. Proof.
Measure invariance means that R P a ◦ σ = R P a . Let ( ¯ Q, ¯ P ) = H − ( q, p ) then σ ◦ H − ( q, p ) = σ ( ¯ Q, ¯ P ) = ( ¯ Q, σ ( ¯ P )) and T h = R P ( h · g ) ◦ H = R P ( h · g ) ◦ σ ◦ H − ◦ σ = R P ( h · g ) ◦ σ ◦ H − which applied to ( q, p ) yields R P ( h · g ) ◦ σ ( ¯ Q, ¯ P ) = R P h ( ¯ Q ) · g ( σ ( ¯ P )) = R P h ( ¯ Q ) · g ( ¯ P ) = ( T † h )( q ). (cid:3) In what follows we shall assume that(4.3) T = T † If not we can use S = T † ◦ T such that S † = S .5. Limits of the sequences T n of self-adjoint operator From ||T h || q < || h || q by induction we obtain ||T n h || < || h || unless h = α f , whenequality holds. For h ∈ L q let(5.1) V q ( h ) = inf || T n h || qq = lim || T n h || qq . By definition V q ( h ) = V q ( T n ( h )). As we are interested in the limit of the sequence T n h ,for a given h we can assume that for an arbitrary ǫ > || h || qq < V q + ǫ , taking ahigh iterate T M h instead of h if needed.By a corollary to Alaoglu Theorem bounded sets in reflexive L q are weakly (the same asweakly*) compact. Let h ∞ denote any weak accumulation point (limit of a subsequences)of T n h , say T m n h ⇀ h ∞ . Proposition 5.1.
Assume T = T † . Let h ∞ be a weak limit of a subsequence T m n ( h ) , ≤ h ∈ L q , q ≥ . Then || h ∞ || qq = V q ( h ) .Proof. As V = V q ( h ) = V ( T M ( h )), for any ǫ > M in the set ofindices m n large enough so that h = T M ( h ) has the norm V ≤ || h || qq ≤ V + ǫ . We mayassume that ( m n − M ) has infinite number of even numbers, otherwise we take M + 1instead of M . Simplifying the notation we have T m h ⇀ h ∞ on a subsequence of m ’s.With this we have hT m h , b i → h h ∞ , b i for every b ∈ L p . By (4.2) and q / p = q − V ≤ ||T m ( h ) || qq = hT m ( h ) , ( T m h ) ∗ i ≤ hT m ( h ) , T m ( h ∗ ) i = hT m ( h ) , h ∗ i → h h ∞ , h ∗ i ≤|| h ∞ || q · || h ∗ || p = || h ∞ || q · ( || h || qq ) / p = || h ∞ || q · ( || h || q ) q − ≤ || h ∞ || q · ( V + ǫ ) − / q thus bythe arbitrary choice of ǫ > || h ∞ || qq ≥ V . The opposite direction is standard, we SOUMYADIP GHOSH, YINGDONG LU, TOMASZ NOWICKI use (3.1) and (3.2): || h ∞ || qq = h h ∞ , ( h ∞ ) ∗ i ← hT m ( h ) , ( h ∞ ) ∗ i ≤ ||T m ( h ) || q ·|| ( h ∞ ) ∗ || p ≤|| h || q · || h ∞ || q / pq ≤ ( V + ǫ ) / q || h ∞ || q − q . (cid:3) Corollary 5.2.
For q ≥ every weak convergent subsequence of T n ( h ) , h ∈ L q has alimit h ∞ of norm || h ∞ || qq = V q ( h ) . In consequence T n ( h ) ⇀ α f .Proof. If T m n ( h ) ⇀ h ∞ then T m n +1 ( h ) ⇀ T ( h ∞ ) (use the operator T † ). As they havethe same norm, by Lemma 3.2, (3.7) they are equal h ∞ = T h ∞ = α f . Therefore everyweakly convergent subsequence converges to the same limit, and as every subsequence hasa weakly convergent subsequence, the whole sequence converges. (cid:3) Corollary 5.3.
For q ≥ for each h the sequence T n ( h ) converges strongly to α f , where α = R Q h / R Q f .Proof. Due to the strong convexity of the ball in L q weak convergent sequence with theconvergence of the norms to the norm of the limit implies strong convergence. (cid:3) Proposition 5.4.
For any < q < ∞ and h ∈ L q the sequence T n ( h ) converges weaklyto α ( h ) f , where α ( h ) = R Q h/ R Q f . In particular hT n h, a i → R h · R a/ R f .Proof. The case q ≥ < q ≤
2, then p ≥ a ∈ L p we have T n a ⇀ α ( a ) f , where α ( a ) = R a/ R f . Let h ∈ L q with q ≤ a ∈ L p .We have hT n h, a i = h h, T n a i → h h, α ( a ) f i = α ( a ) h h, f i = α ( a ) R h = ( R h · R a ) / R f = h α ( h ) f , a i . Which means T n h ⇀ α ( h ) f . (cid:3) References [1] Duane, Simon; Kennedy, Anthony D.; Pendleton, Brian J.; Roweth, Duncan (3 September 1987).
Hybrid Monte Carlo . Physics Letters B. 195 (2): 216–222.[2] Livingstone, Samuel; Betancourt, Michael; Byrne, Simon; Girolami, Mark
On the geometric ergod-icity of Hamiltonian Monte Carlo . Bernoulli 25 (2019), no. 4A, 3109–3138.[3] S. Ghosh, Y. Lu, T. Nowicki
HMC, an Algorithms in Data Mining, the Functional Analysis ap-proach. in preparation.[4] S. Ghosh, Y. Lu, T. Nowicki
HMC, an Algorithms in Data Mining, the Probabilistic approach. inpreparation.[5] S. Ghosh, Y. Lu, T. Nowicki
HMC, an Algorithms in Data Mining, the Algorithmic approach. inpreparation.
IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, US
Email address ::