A Condition Number for Hamiltonian Monte Carlo
Ian Langmore, Michael Dikovsky, Scott Geraedts, Peter Norgaard, Rob Von Behren
AA Condition Number for HamiltonianMonte Carlo
Ian Langmore * , Michael Dikovsky, Scott Geraedts, PeterNorgaard, and Rob von Behren Google LLC. 1600 Amphitheatre Parkway. Mountain View, CA. 94030, e-mail: * [email protected] Abstract:
Hamiltonian Monte Carlo is a popular sampling technique forsmooth target densities. The scale lengths of the target have long beenknown to influence integration error and sampling efficiency. However, quan-titative measures intrinsic to the target have been lacking. In this paper, werestrict attention to the multivariate Gaussian and the leapfrog integrator,and obtain a condition number corresponding to sampling efficiency. Thisnumber, based on the spectral and Schatten norms, quantifies the numberof leapfrog steps needed to efficiently sample. We demonstrate its utility byusing this condition number to analyze HMC preconditioning techniques.We also find the condition number of large inverse Wishart matrices, fromwhich we derive burn-in heuristics.
Keywords and phrases:
Hamiltonian Monte Carlo, Markov Chain MonteCarlo, Condition Number, Preconditioning, Random Matrices.
1. Introduction
Hamiltonian Monte Carlo (HMC) is a technique for sampling random variables X ∈ R N , possessing smooth densities p ( x ). A a core step is the numericalintegration of Hamilton’s equations for time T , in (cid:96) discrete steps, each of size h . Tools are available to adjust h and (cid:96) so as to maximize sampling efficiencyfor particular problems [20, 10, 2]. Lacking has been a measure of difficulty,intrinsic to the density p ( x ), rather than sub-optimal choices of h and (cid:96) .It has long been recognized that disparate covariance scales in X tend tomake sampling difficult [6]. This motivates techniques to “flatten” X throughtransformations. These transformations can be as basic as scaling componentsof X by their standard deviation, or as complex as application of a diffeomor-phism built with polynomials or a neural network [25, 16]. Despite some success,there is limited understanding as to exactly how much better or worse differentcovariance scales are.Our main contribution is to show that, in the multivariate Gaussian case, oneparticular condition number governs the number of leapfrog integration stepsneeded to efficiently sample in every direction. This number, κ (see (3.6)), differsfrom the common spectral condition number (ratio of largest to smallest singularvalues) since it takes into account all eigenvalues of the covariance matrix. Thisis needed, since all eigenvalues contribute to integration error. a r X i v : . [ s t a t . C O ] F e b . Langmore et al./A Condition Number for HMC Using κ we are able to analyze and develop preconditioning techniques. Wefind the law of κ when preconditioning with the sample covariance. The lawturns out to be the condition number of the inverse Wishart ensemble. Anasymptotic expression for this law is then derived, leading to a set of precondi-tioning heuristics. We next show that the popular component-wise standardiza-tion can be better or worse than preconditioning with a diagonal transformationtrained via variational inference. Each of these in turn can be better or worsethan doing nothing at all. Just as importantly, insight is given into what sortof spectra are antithetical to efficient HMC. These “bad” spectra have only afew large eigenvalues, and many small ones. We demonstrate the virtue of a lowrank update preconditioner for this situation.We limit analysis to Gaussian targets, despite the fact that sampling fromthem does not even require HMC. Our hope is that, by providing explicit formu-las and precise analysis, these results will be helpful in more general situations.For example, the original motivation for this work was to determine if reverseor forward KL preconditioning was strictly superior (the answer is “No”). Ad-ditionally, a pleasant consequence of a condition number formulation is the con-nection to random matrix theory. We hope to expand on this line of thinking infuture work.Our result relating step size to integration error (theorem 3.1) is similar tothe analysis of [5, 7], which derive scaling laws for (non-Gaussian) densitieswith repeated components. Our restriction to Gaussian densities allows for anexplicit relationship between scales of the random variable X and the necessarystep size/number of integration steps, and removes the repeated componentsrequirement. A condition number is used for selecting preconditioners in [4].Their number shows some promise for non-Gaussian systems, but does not havea precise relation to sampling efficiency (our κ does).In section 2 we briefly review the HMC method. Section 3 goes over our mainresults surrounding our condition number κ . Section 4 derives an asymptoticresult on κ for the inverse Wishart ensemble. Section 5 demonstrates using κ to derive and analyze different preconditioning techniques. Proofs of the mainresults are in section 6.
2. The Hamiltonian Monte Carlo Method
Here we quickly review the basics of HMC for purposes of establishing notation.A comprehensive introduction can be found in [24].The Hamiltonian Monte Carlo (HMC) method was introduced in 1987 as“Hybrid Monte Carlo” for use in lattice field theory simulations [12]. Since then,it has been recognized as an efficient alternative to random walk Metropolis,well suited for higher dimensional problems. Implementations are available fora variety of languages [10].HMC defines a way to sample from smooth densities p ( x ) for X ∈ R N by . Langmore et al./A Condition Number for HMC augmenting state space with a momentum ξ ∈ R N , and defining the joint density p ( x, ξ ) = exp {− H ( x, ξ ) } , where H ( x, ξ ) := − log p ( x ) + (cid:107) ξ (cid:107) , where (cid:107) ξ (cid:107) is the Euclidean norm. Alternative norms may be used, although theseare less popular in practice [14]. Moreover, a fixed norm generated throughthe inner product (cid:104) LL T ξ, ξ (cid:105) is shown in [24] to be equivalent to the linearpreconditioning X (cid:55)→ LX (which we do consider here).In the physics setting, the Hamiltonian H , is total energy, whereas − log p ( x ), (cid:107) ξ (cid:107) / x j , ξ j ).1. Draw ˜ ξ ∼ N (0 , I N ).2. Let ( x ( t ) , ξ ( t )) be the time t solution to the ODE ˙ x = ξ , ˙ ξ = ∇ log p ( x ),with initial condition ( x j , ˜ ξ ).3. Set ( x j +1 , ξ j +1 ) = ( x ( T ) , − ξ ( T )), for integration time T .In practice, the ODE must be solved numerically over (cid:96) steps with step-size h . Denote this solution by Ψ (cid:96) . The integration error means we can no longerjust accept the move in step 3, which is replaced by a Metropolis correction:( x j +1 , ξ j +1 ) = Ψ (cid:96) , with probability a ( x j , ξ j → Ψ (cid:96) ) , and ( x j +1 , ξ j +1 ) = ( x j , ξ j ) , with probability 1 − a ( x j , ξ j → Ψ (cid:96) ) , for acceptance probability a ( x j , ξ j → Ψ (cid:96) ) : = min (cid:0) , exp (cid:8) H ( x j , ξ j ) − H (Ψ (cid:96) ) (cid:9)(cid:1) . (2.1)Since Hamilton’s equations of motion preserve the Hamiltonian, if numericalintegration was perfect, H ( x j , ξ j ) = H (Ψ (cid:96) ) and every step would be accepted.In practice, finite step size leads to some rejections and wasted effort.The numerical integration is usually done with (cid:96) steps of the St¨ormer-Verlet or leapfrog integrator , each step progressing ( x, ξ ) to ( x h , ξ h ) via1. Set ξ h/ = ξ + h ∇ log p ( x )2. Set x h = x + hξ h/
3. Set ξ h = ξ h/ + h ∇ log p ( x h )Figure 1 shows that integration errors remain small, even if h ≈ σ . Just asimportantly, trajectories do not diverge, but instead follow paths of a modifiedHamiltonian due to the fact that the leapfrog integrator is symplectic [24, 21].The number of leapfrog steps (cid:96) is often chosen to be a fixed (but highlyinfluential) constant. To avoid unlucky (or difficult to analyze) circumstances,we use a random integration time T , then set (cid:96) = (cid:100) T /h (cid:101) . This randomnessis often introduced to ensure ergodicity. See section 3.2 of [24] as well as [22].Inspection of our proofs show (see e.g. (6.7)), without this regularizing effect thespectrum could conspire to make the leading term vanish. Additionally, with afixed integration length and dense enough spectrum, near resonances can occur,whereby samples nearly repeat the same trajectory. . Langmore et al./A Condition Number for HMC Fig 1 . Leapfrog Integration Error . Left:
Integrating trajectories ( x ( t ) , ξ ( t )) with differentstep size h . h = σ/ is nearly perfect. Even with larger h , deviation of trajectories from theperfect line are barely perceptible. Right:
Values of the Hamiltonian, H ( x ( t ) , ξ ( t )) , along thetrajectories. For larger h , the error is greater, but does not diverge. κ and Computational Effort in HMC Important results relating computational effort and step size has been estab-lished by previous work: Being a second order integrator, leapfrog results inerror in the Hamiltonian, bounded at each step by O ( h ). Thus, over (cid:96) = T /h steps, error is bounded by O ( T h ) [24, 21]. In expectation the situation is bet-ter: [5, 7] show that asymptotically, the integration error is Normal with scale O ( h ), with T contributing only to higher order terms.In this section, we establish a relationship between the covariance spectrum,and the number of leapfrog steps needed to effectively sample. This is rigorouslyanalyzed as dimension N → ∞ . Before submersing into the world of limits,consider fixed dimension and covariance spectrum σ ≥ σ ≥ · · · ≥ σ N > h andnumber of leapfrog steps (cid:96) to achieve the following Desiderata . (i) The average acceptance probability E (cid:8) a ( X, ξ → Ψ (cid:96) ) (cid:9) = ¯ a ,for desired ¯ a ∈ (0 , (cid:96) = (cid:100) σ T /h (cid:101) , forintegration time T ∼ π , where π is some probability density.To motivate (i), consider results in [5, 7], where computational cost is shownto be (asymptotically in dimension) optimal when the average acceptance prob-ability approaches a limit (approximately 0.68). In addition to being asymptoti-cally optimal, tuning h to achieve desired ¯ a is often convenient [2, 10]. Condition(ii) states that each trajectory travels a distance h(cid:96) that scales with the largestscale length, σ . This prevents HMC from reverting to a random walk in thedirection corresponding to σ .Condition (ii) does imply (cid:96) could be quite large, but this turns out not to be . Langmore et al./A Condition Number for HMC Fig 2 . Spectra and κ . Spectra were generated using f from (3.4) , and κ was computed. Left:
Holding the maximal eigenvalue ( σ ) at one, smaller tails of the spectrum increase κ . Right:
Holding σ /σ N constant, the worst case is to have one large eigenvalue, and many smallones. an issue. The reason, is that the dominant error term depends on integrationlength only through an average of sin ( · ), which is close to being constant (seethe proof of theorem 3.1 in section 6.2). Therefore, the acceptance rate dependsstrongly on h but only weakly on (cid:96) . Thus, the user can often adjust (cid:96) after setting h , so that h(cid:96) ∝ σ as desired.Define κ := (cid:32) N (cid:88) n =1 (cid:18) σ σ n (cid:19) (cid:33) / , ν : = (cid:32) N (cid:88) n =1 (cid:18) σ n (cid:19) (cid:33) / . (3.1)Corollary 3.2 shows that if σ n does not decay too fast, one may meet desiderata3.1 (ii) using a step size ¯ h ≈ ν / (cid:114) Φ − (cid:16) − ¯ a (cid:17) , (3.2)where Φ is the normal cumulative distribution function. The number of leapfrogsteps is then proportional to σ /h ∝ σ ν = κ . Thus κ is a measure of work in atuned HMC setup.Since κ involves a ratio of eigenvalues, it is the shape (as opposed to overallscale) of the spectrum that determines the conditioning. As with the spectralcondition number, σ /σ N , κ is minimal when the spectrum is flat, i.e., σ = · · · = σ N . Unlike the spectral condition number, κ is worst when there aremany small eigenvalues and at least one large one (see figure 2). . Langmore et al./A Condition Number for HMC κ and demonstration of main results. A straightforward estimate of κ may be obtained by plugging the sample covari-ance into (3.1). We did this, and it led to a very inaccurate estimate (see figure3 “Sample κ ”). The reason being, accurate estimation of a covariance matrixrequires many samples (see e.g., discussion of the Wishart case in section 5.1).We obtained a more accurate estimate by re-writing (3.2) as κ : = σ ν ≈ σ ¯ h / (cid:114) Φ − (cid:16) − ¯ a (cid:17) . (3.3)Thus, κ can be estimated by drawing samples with step size h , observing theacceptance probability ˆ a , estimating ˆ σ ≈ σ from the sample covariance, thenplugging into (3.3) (figure 3 “Inferred κ ”). In experiments, where we know σ ahead of time, this relation can be used to check theorem 3.1 (figure 3 “Inferred κ (known σ )”).To generate random spectra for these numerical tests, we use the set valuedfunction f ( Y ; m, M, c, β ) : = (cid:26) g ( y ) − min y ∈Y { g ( y ) } max y ∈Y { g ( y ) } − min y ∈Y { g ( y ) } · ( M − m ) + m : y ∈ Y (cid:27) ,g ( y ) : = 1 / (cid:0) | y/c | β (cid:1) . (3.4)We repeatedly drew { σ n } ∼ f ( Y ; m, M, c, β ), for random sets Y of size N = 32 , , , , U (0 , minval m = 1, maxval M ∈ { , } , cutoff c ∈ { . , . } , and power β ∈ { , } . This means σ n arevalues of the function g ( y ) re-scaled to the interval [ m, M ]. See e.g., figure 2for examples. Since the performance of HMC for Gaussian targets is invariantunder isometries (see section 4 of [24]), it suffices to use these spectra in aMultivariate normal with covariance C = Diag( σ , . . . , σ N ). For each spectrum,we adjust step size h until the acceptance probability is close to either 0.8 or0.95. HMC (implemented in TensorFlow Probability [20]) is then used todraw S samples, with the oversampling ratio S/N ∈ { , , , , , } . A totalof 4790 random spectra were generated. κ is a condition number on scale matrices Condition numbers provide worst-case bounds on solutions to linear systems.For HMC, κ also provides a worst-case result of sorts; the work needed to sam-ple from the most difficult direction corresponding to σ . Our usage of κ isclose to the stiffness ratio of a linear ODE, which is just the spectral conditionnumber. The stiffness ratio can determine the number of time steps needed forconvergence to steady state. . Langmore et al./A Condition Number for HMC Fig 3 . Estimation of κ and validation of main result. Random spectra were generatedusing f ( Y , m, M, β, c ) from (3.4) , and the relationship between estimated and actual κ isplotted. The coefficient of determination R of estimated vs. actual is shown, as is R of thebest fit line. Top Left:
Estimating κ directly from the sample covariance matrix resulted inover-estimation. Top Right:
Using (3.3) with estimated ˆ σ gives a fairly accurate estimate of κ . It tends to over-estimate, due to over-estimation of σ . Bottom Left:
When σ is known, R ≈ . This validates theorem 3.1 in the finite N case. Bottom Right: R vs. N when σ is known. R appears set to converge to , validating theorem 3.1.. Langmore et al./A Condition Number for HMC Recall the vector norm (cid:107)·(cid:107) and the induced matrix norm (the spectral norm ): (cid:107) x (cid:107) : = (cid:32) N (cid:88) n =1 x n (cid:33) / , (cid:107) A (cid:107) := sup x (cid:107) Ax (cid:107) (cid:107) x (cid:107) . A condition number quantifies worst case sensitivity of solutions to Ax = b with respect to perturbations of b . For example, consider the perturbed system A ( x + δx ) = b + δb for nonsingular A . Elementary steps show that (cid:107) δb (cid:107) (cid:107) b (cid:107) ≤ (cid:107) A (cid:107) (cid:107) A − (cid:107) (cid:107) δx (cid:107) (cid:107) x (cid:107) . Hence A (cid:55)→ (cid:107) A (cid:107) (cid:107) A − (cid:107) is a condition number.A class of matrix norms of interest to us are the Schatten Norms [18]. For r ∈ [1 , ∞ ], the r th Schatten norm of matrix A , (cid:107) A (cid:107) S r , is the vector r normapplied to the singular values of A . For example, with { σ , . . . , σ N } the N singular values of A ∈ R N × N , (cid:107) A (cid:107) S r : = (cid:32) N (cid:88) n =1 σ rn (cid:33) /r . Since (cid:107) A (cid:107) = max n { σ n } , we have (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) S r . Therefore (cid:107) δb (cid:107) (cid:107) b (cid:107) ≤ (cid:107) A (cid:107) (cid:107) A − (cid:107) S (cid:107) δx (cid:107) (cid:107) x (cid:107) , (3.5)and A (cid:55)→ (cid:107) A (cid:107) (cid:107) A − (cid:107) S is a condition number w.r.t. (cid:107) · (cid:107) . As compared with (cid:107) A (cid:107) (cid:107) A − (cid:107) , it is inferior since it provides a looser bound. The interest for usis that it is equal to κ . Indeed, suppose the covariance matrix is AA T . Then it’seigenvalues σ ≥ · · · ≥ σ N > A , and κ : = (cid:32) N (cid:88) n =1 (cid:18) σ σ n (cid:19) (cid:33) / = (cid:107) A (cid:107) (cid:107) A − (cid:107) S = (cid:113) (cid:107) AA T (cid:107) (cid:107) ( AA T ) − (cid:107) S . (3.6)The first equality shows κ is a condition number on the scale matrix A . Thesecond writes κ in terms of the covariance matrix AA T , which is often moreconvenient. κ By definition, the set of covariance matrices for multivariate normals is theset of symmetric positive definite (SPD) matrices. Viewing κ as a function ofcovariance, we have κ ( C ) : = (cid:112) (cid:107) C (cid:107) (cid:107) C − (cid:107) S . (3.7) . Langmore et al./A Condition Number for HMC Lemma 3.1. κ : SP D → (0 , ∞ ) satisfies(i) κ ( C ) = lim k →∞ k (cid:112) Trace { C k } · (cid:112) Trace { C − } (ii) Suppose A, B are non-singular, then κ ( ABB T A T ) = κ ( B T A T AB ) (iii) Suppose U is orthogonal and C is SPD, then κ ( U CU T ) = κ ( C ) Proof. If C is SPD, its eigenvalues are its singular values σ ≥ · · · ≥ σ N > (cid:8) C k (cid:9) = (cid:80) Nn =1 σ kn . Taking the limit, we have (i). To show (ii),use (i) along with the cyclic permutation property of the trace. (iii) follows from(ii) with C = BB T . To study convergence, we must establish a way of taking dimension to in-finity. We draw inspiration from the discretization of a continuous linear op-erator. Here, we expect the N point discretization to have singular values( σ N , . . . , σ NN ) that are close to singular values of the continuous operator[15]. This arises e.g., in linear inverse problems [19]. Another example is the dis-cretization of a linear filter in signal processing. By contrast, [5, 7, 24] consider a fixed set of (possibly non-Gaussian, correlated) random variables, then let p ( x )be the law of N i.i.d. groups of these fixed variables. This simplification allowsthem to ignore problems associated with vanishing eigenvalues, but would beunnatural in our setting. Here we use a random integration time T N := σ N T , where T ∼ π . Letˆ π ( ω ) := (cid:90) e − iωt π ( t ) d t. The bound | ˆ π | ≤ ˆ π (0) = 1 is trivial. In addition, we impose the regularitycondition | ˆ π ( ω ) | ≤ C π < , for all | ω | ≥ . (3.8)This condition is satisfied e.g., if π is a uniform density on any interval. Thisensures each integral appearing in (3.9) is uniformly (in n ) bounded below, andallows derivation of (3.10).For α > N ∈ N , define the step sizes h N : = (cid:32) α N (cid:88) n =1 σ N,n ) (cid:90) sin (cid:18) tσ Nn (cid:19) π ( t/σ N ) σ N d t (cid:33) − / , ¯ h N : = (cid:32) α N (cid:88) n =1 σ N,n ) (cid:33) − / . (3.9) . Langmore et al./A Condition Number for HMC Note that ¯ h N is, up to a constant, the inverse of ν from (3.1).One can show (see section 6.1) that(1 + C π ) − / ¯ h N ≤ h N ≤ (1 − C π ) − / ¯ h N , (3.10)so the step sizes differ by at most a constant, and may sometimes be equal(corollary 3.1). Moreover, the proof of theorem 3.1 shows the chain is stable assoon as h N < σ NN . Then, since (see (6.2) in section 6.1) h N /σ NN →
0, thechain is stable for large enough N .We assume the spectra do not decay too rapidly in the sense thatlim N →∞ σ N (cid:32) N (cid:88) n =1 σ Nn (cid:33) (cid:32) N (cid:88) n =1 σ Nn (cid:33) − / = 0 . (3.11)One can check that (3.11) holds for any polynomial decrease σ N,n ∼ n − k , butnot for exponential σ N,n ∼ e − n . Note also that (3.11) provides uniform controlover the spectra, e.g., it implies h N /σ Nn → n (see (6.2)). Thisallows convergence despite { σ Nn } otherwise being unrelated at different N .Our (standard) choice of momentum term, (cid:107) ξ (cid:107) /
2, means leapfrog integra-tion is invariant under isometry (see section 4 of [24]). Applying a rotationaligning the axis with eigenvectors of the covariance matrix, the Hamiltonian isdiagonalized, and integration error may be studied one component at a time.Integration error in component n , after (cid:96) leapfrog steps is δ N,n : = H (Ψ (cid:96)N,n ) − H (Ψ N,n ) . The total integration error is then∆ N : = N (cid:88) n =1 δ N,n . Theorem 3.1.
Given step size h N from (3.9) , integration time T N = σ N T with T ∼ π satisfying (3.8) , and sequences of spectra satisfying (3.11) , we haveconvergence in distribution for the HMC (leapfrog) integration error ∆ N → N (cid:16) α , α (cid:17) , for chains in equilibrium. Theorem 3.1 is not surprising. Indeed, since ∆ N is a sum of N independentrandom variables, one expects a central limit theorem to hold, provided thescaling given by h N is correct, and many terms contribute to the sum (as opposedto it being dominated by a few terms). See e.g. the CLT for triangular arraysin [13]. The meat of the proof is establishing this scaling (see section 6.2). Oncethat is done, assumption (3.11) ensures many terms contribute to the sum.The inclusion of the integral (cid:82) sin ( · ) d t in h N is ugly. Unfortunately, it isnecessary to handle the case where the spectrum contains significant terms closeto σ N , for which the averaging of sin ( · ) does not happen. One simple casewhere it does is . Langmore et al./A Condition Number for HMC Corollary 3.1.
Assume there exists δ, C > such that | ˆ π ( ω ) | ≤ C | ω | − δ . Sup-pose further σ NK /σ N < r K , where r K is a sequence → as K → ∞ . Then as N → ∞ , ¯ h N h N → . Corollary 3.2 shows the free parameter α may be chosen to achieve desiredacceptance rate ¯ a ∈ (0 , Corollary 3.2.
Given the hypothesis of theorem 3.1, choose (with Φ the normaldistribution function) α : = 4 (cid:16) Φ − (cid:16) − ¯ a (cid:17)(cid:17) , for use in h N . We then have lim N →∞ E (cid:8) a N ( x j , ξ j → Ψ (cid:96)Nn ) (cid:9) = ¯ a. Given hypothesis of corollary 3.1, the same result holds with ¯ h N in place of h N .Proof. Designate the distributional limit of ∆ N by ∆ ∞ ∼ N ( α/ , α ). As in theproof of theorem 3.6 in [5], the boundedness of u (cid:55)→ ∧ e u implies E (cid:8) a N ( x j , ξ j → Ψ (cid:96)Nn ) (cid:9) → E (cid:8) ∧ e − ∆ ∞ (cid:9) . This expectation can be found analytically, and is 2Φ ( −√ α/ κ for Large Inverse Wishart Matrices The
W ishart ( N, S ) ensemble is that of N × N random matrices (1 /S ) (cid:80) Ss =1 ( X s )( X s ) T ,where each X s ∼ N (0 , I N ). C ∼ InverseW ishart ( N, S ) if C − ∼ W ishart ( N, S ).Wishart matrices arise naturally as the S -sample covariance matrix of N -variaterandom normals. Inverse Wishart matrices can result from preconditioning (lemma5.1).If N → ∞ with the oversampling ratio S/N → ω ∈ (1 , ∞ ), then the smallestand largest eigenvalues of a Wishart matrix approach a := (1 − ω − / ) and b := (1 + ω − / ) almost surely [28]. The limiting spectral density is given bythe Mar˘cenko-Pastur law[23]. f ( x ) : = (cid:26) ω πx (cid:112) ( b − x )( x − a ) , a ≤ x ≤ b, κ in the inverseWishart case. Figure 4 shows good agreement between this expression and sam-pled κ . . Langmore et al./A Condition Number for HMC Proposition 4.1. If C ∼ InverseW ishart ( N, S ) , and N → ∞ with S/N → ω ∈ (1 , ∞ ) , then κ ( C ) N / → (1 + ω − ) / − ω − / almost surely.Proof. Let λ ≥ · · · ≥ λ N > C − ∼ W ishart ( N, S ).Then, κ ( C ) N = 1 N N (cid:88) n =1 λ − N λ − n = λ − N N N (cid:88) n =1 λ n . Now, as shown in [28] (main result) and [9] (remark 5) we have almost sureconvergence, λ N → (1 − ω − / ) and 1 N N (cid:88) n =1 λ n → E f (cid:8) X (cid:9) . Therefore, almost surely, κ ( C ) N → − ω − / ) E f (cid:8) X (cid:9) . The result then follows from E f (cid:8) X (cid:9) = (cid:90) ba ω πx (cid:112) ( b − x )( x − a ) x d x = 1 + ω − , which is a straightforward computation involving integrals of the Beta function.Convergence of the normalized trace to the moment is a result of remark 5in [9], which gives a.s. convergence of the empirical spectral density function.Convergence of this type holds for broad class of matrices [3], [8].
5. Preconditioning HMC
The results of section 3 show a clear sampling advantage for problems wherethe spectrum is as close to “flat” as possible. Here we consider techniques wherea diffeomorphism F transforms the random variable X into Z := F − ( X ),with a hope that Z is easier to sample from. This is not a new idea. Lineartransformations have been considered as far back as [24]. Trainable nonlinearmappings seem to have been introduced by [25], where variational inference isused to find F . It has since been developed further in [26], which considersmappings based on low-fidelity approximations to the posterior, and [16], wherethe preconditioner is a neural network. . Langmore et al./A Condition Number for HMC Fig 4 . Density and asymptotic κ ( C ) , C ∼ InverseW ishart ( N, S ) . Top Left: Densityplots when dimension N = 64 . Large S/N means κ is closer to N / , and for S/N (cid:38) , κ concentrates near its mean, so point estimates will be useful. Top Right:
Estimate vs.samples of κ when dimension N = 64 and S varies. Here “asymptotic” means the large N formula implied by proposition 4.1. Agreement is very good, except for small S/N . BottomLeft:
When N = 8 , the densities overlap quite a bit, so point estimates will be misleading. Bottom Right:
When N = 8 , samples of κ vary significantly from the asymptotic κ .. Langmore et al./A Condition Number for HMC To formalize this procedure, let us start with X ∼ p X ( x ), and a diffeomor-phism F , which transforms X (cid:55)→ Z = F − ( X ). Equivalently, the density p X istransformed by the pushforward p Z ( z ) = F − p X ( z ) := p X ( F ( z )) | det( DF ( z )) | . (5.1)Above, DF is the matrix of partial derivatives, ( DF ) ij = ∂F j /∂z i . Using HMC,we sample from the density p Z , producing Z , . . . , Z K . Transforming back, X k := F ( Z k ), and we have samples from p X as desired.In the Gaussian case, p X ∼ N (0 , AA T ), the linear preconditioner induced bya matrix F transforms the covariance and κ as follows: AA T (cid:55)→ ( F − A )( F − A ) T = ( F − ) AA T ( F − ) T κ ( AA T ) (cid:55)→ (cid:107) F − A (cid:107) (cid:107) ( F − A ) − (cid:107) S . (5.2) The preconditioning techniques we will consider have an upfront cost: Either inobtaining some number of preliminary samples (which should be thrown away),or in solving an optimization problem. It’s fair to ask whether the subsequentspeed-up is worth it. That depends on how many final samples are needed. Herewe present two cases where the ratio of samples to dimension, or oversamplingratio , ω := S/N , is required to be 10 or more.First, the estimation of component means and variance is done with summa-tions of the form (1 /S ) (cid:80) Ss =1 Y s , resulting in relative error on each componentof O (1 / √ S ). Thus, relative error of size ε requires S ∼ O ( ε − ). This is indepen-dent of dimension N , but for moderate N ≈ ε ≈ . S ≈ ω := S/N ≈
16 is required.Second, consider the error in reconstructing the covariance spectrum from thesample covariance matrix ˆ C := (1 /S ) (cid:80) Ss =1 ( X s )( X s ) T . If X s ∼ i.i.d. N (0 , I N ),we have ˆ C ∼ W ishart ( S, N ), and for the bulk of the spectrum to be within ε ofthe correct values (which are all one), we must have (1 + ω − / ) < ε , whichimplies ω > /ε (see section 4). Here we consider a choice to be made by a practitioner who has gathered S ≥ N burn-in samples ( X , . . . , X S ). They could continue gathering samples until agoal of S f > S “final” are obtained. Alternatively, they could form the samplecovariance ˆ C := (1 /S ) (cid:80) Ss =1 ( X s )( X s ) T , precondition with its Cholesky factorˆ L , throw away the first S samples, then gather S f final samples.Abusing notation, we let κ be the initial condition number, and κ ( S ) be thecondition number after preconditioning with S samples. Assuming the samplingrate is proportional to 1 /κ , then the total time τ to obtain S f samples is τ ∝ Sκ + S f κ ( S ) , . Langmore et al./A Condition Number for HMC which has extremal points d τ / d S = 0 at S ∗ whenever dκ d S ( S ∗ ) = − κ S f . (5.3)This suggests continuing to draw burn-in samples until d κ/ d S ≤ − κ /S f ,at which time updates may stop and S f samples can be drawn. The speedup is S f κ Sκ + S f κ ( S ) . (5.4)Assuming the burn-in samples are i.i.d., we will obtain an approximate ex-pression for the value of S/N at which (5.3) is met. First however, we mustestablish
Lemma 5.1.
Suppose ( X , . . . , X S ) are i.i.d., and HMC sampling of X ∼N (0 , C ) is preconditioned with the S -sample Cholesky factor ˆ L . Then the pre-conditioned κ follows the law of κ ( C ) , for C ∼ InverseW ishart ( S, N ) .Proof. The preconditioned covariance is ˆ L − C ˆ L − T . Due to lemma 3.1, κ ( ˆ L − C ˆ L − T ) = κ ( L T ˆ C − L ). Since L T ˆ C − L = (cid:32) S S (cid:88) s =1 ( L − X s )( L − X s ) T (cid:33) − , and L − X s ∼ N (0 , I N ), we see that L T ˆ C − L ∼ InverseW ishart ( S, N ), whichcompletes the proof.To check (5.3) we will use lemma 5.1, and proposition 4.1. That is, we startwith the approximation κ ( S ) ≈ g N ( S ) := N / (cid:0) NS (cid:1) / − (cid:113) NS . (5.5)Then, (5.3) is approximately satisfied when κ (cid:48) ( S ) ≈ g (cid:48) N ( S ) = − κ /S f , which isequivalent to finding S/N such that U (cid:18) SN (cid:19) = N / κ S f N , where U ( ω ) := 4( ω / − ( ω + ω ) / ω + ω / + 1 . (5.6)This expresses an optimality condition on the burn-in oversampling ratio S/N in terms of the final oversampling ratio S f /N and the rescaled initial conditionnumber κ /N / .These steps are put together in figure 5. For example, suppose N = 50, κ /N / = 10, and uncertainty in sample covariance needs to be less than 25%.Then, the “Marcenko-Pastur Density” plot shows S f /N ≈
40 is required, the“Optimal Burn-In Size S ” plot shows S/N ≈ . Langmore et al./A Condition Number for HMC Fig 5 . Sample Covariance Preconditioning:
Charts to help decide how many samples S to use for burn-in. Top Left:
Approximation of κ ( S ) , from proposition 4.1. As always, κ is (approximately) inversely proportional to the sampling rate for HMC. We conclude largerburn-in size S leads to faster sampling rate. Top Right:
Limiting spectral density for Wishartensemble, (4.1) , for three different
S/N values. The spectrum has a fair bit of spread, evenwith 40x oversampling.
Bottom Left:
The optimal burn-in oversampling ratio
S/N , as afunction of the final desired oversampling ratio S f /N for different values of κ /N / . Thisis the graph (cid:8) ( U ( ω ) κ /N / , ω ) : ω ∈ (0 , (cid:9) , with U ( ω ) defined in (5.6) . Bottom Right:
Speedup, as defined by (5.4) , with κ ( S ) approximated by (5.5) .. Langmore et al./A Condition Number for HMC Fig 6 . Sample Covariance Preconditioning : Theory vs. Numerics.
Random spectra weregenerated using f ( Y ; m, M, β, c ) from (3.4) , and the relationship between theoretical and actual κ is plotted. The coefficient of determination R of estimated vs. actual is shown, as is R of the best fit line. Left: κ after preconditioning with S i.i.d. samples in dimension N (various S and N ) is close to κ obtained via samples from InverseW ishart ( N, S ) , as lemma5.1 dictates. The scatter plots of course don’t match perfectly, since samples are random. Center:
The function g N ( S ) (with S equal to the mean effective sample size) from (5.5) ,which approximates κ , is compared with the actual κ obtained by preconditioning with S effective samples from a NUTS chain. Agreement is good. Right:
Same as Center, but usingan HMC chain without
NUTS. Agreement is not so good.
To enact these steps in practice, one needs both an estimate of κ , and S i.i.d. samples. The estimate of κ can be obtained using the steps in section 3.1.However, i.i.d. samples are presumably impossible to come by, else we wouldjust use them. As a supplement, we suggest gathering S HMC samples, thenusing the effective sample size (ESS) in place of S in (5.6) [27]. Our experimentsfound ESS to be a good substitute, but only if the original samples were obtainedusing the No-U-Turn Sampler (NUTS) [17]. See figure 6. This is unfortunate,since the nice estimate of κ (as per section 3.1) falls apart with NUTS, due toits non-trivial acceptance criteria. Our suggested remedy is to obtain a smallnumber of additional non-NUTS samples, and use these to estimate κ . Thisis possible, since estimation of κ only requires an estimate of the acceptanceprobability. In variational inference, parameters θ are tuned to minimize a loss functioninvolving a parameterized distribution q ( · ; θ ) and the target p . For example, the reverse KL divergence isKL[ q || p ] = (cid:90) log (cid:20) q ( x ; θ ) p ( x ) (cid:21) q ( x ; θ ) d x. (5.7)We will always choose q to be a pushforward (see (5.1)) of the standard normal.That is, q = F φ , for φ ∼ N (0 , I N ). It follows thatKL[ q || p ] = KL[ F φ || p ] = KL[ φ || ( F ) − ( p )] , (5.8) . Langmore et al./A Condition Number for HMC which leads to the approximationKL[ q || p ] ≈ K K (cid:88) k =1 log (cid:20) φ ( z k ) p ( F ( z k )) | DF ( z ) | (cid:21) = 1 K K (cid:88) k =1 log (cid:20) q ( F ( z k )) p ( F ( z k )) (cid:21) , z k ∼ φ. (5.9)If F is smooth and θ (cid:55)→ KL[ q || p ] is convex, (5.7) may be minimized by stochasticgradient descent using (5.9). This is an example of sample path optimization [1].Since the summands are log probabilities, (5.9) is usually found to be stable.The “reverse” moniker is attached to (5.7) to differentiate it from forwardKL divergence , KL[ p || q ] = (cid:90) log (cid:20) p ( x ) q ( x ; θ ) (cid:21) p ( x ) d x. (5.10)Since presumably p ( x ) is not easy to sample from, a stable “log space” formulaanalogous to (5.9) cannot be used to approximate (5.10).Equation (5.8) shows that, if our minimization results in small KL divergence,then ( F ) − p is close to the well-conditioned unit Gaussian, in the sense of KLdivergence. Unfortunately, this does not imply ( F ) − p has well-conditioned co-variance. Indeed, if p is Gaussian, then, with λ n the eigenvalues of the precondi-tioned covariance W W T , one can check that, up to additive and multiplicativeconstants,KL[ p || q ] ∝ (cid:107) W (cid:107) F − log | det( W W T ) | = N (cid:88) n =1 (cid:0) λ n − log( λ n ) (cid:1) , KL[ q || p ] ∝ (cid:107) W − (cid:107) F − log | det(( W W T ) − ) | = N (cid:88) n =1 (cid:0) λ − n − log( λ − n ) (cid:1) . (5.11)Clearly, minimizing forward or reverse KL is different than minimizing κ . Here we consider preconditioning a Gaussian N (0 , C ) with a diagonal matrix D . The covariance and κ transform as in (5.2).Minimizing reverse KL over the set of diagonal matrices (see (5.11)) gives us D ii = 1 / ( C − ) ii , (5.12)while minimizing forward KL gives us D ii = C ii , (5.13)which is just a diagonal matrix with the component-wise variances. In this case,preconditioning is equivalent to re-scaling the axis so that X has unit standarddeviation.As diagonal preconditioners, both forward and reverse KL exhibit a scaleinvariance, the proof of which follows directly from (5.12) and (5.13). . Langmore et al./A Condition Number for HMC Fig 7 . Forward and reverse KL, preconditioner and preconditioned.
The target p ( x ) ∼N (0 , C n ) , with C n from (5.14) and ( ρ = 0 . ). The variational distribution q ( z ) ∼ N (0 , D n ) ,where D n is diagonal. Both forward and reverse KL were minimized to find D n . Left:
Theone standard deviation iso-line of q ( z ) is plotted. Both are circular due to symmetry, butforward KL chooses a larger sphere (of radius 1). Right:
Iso-lines after preconditioning by D n (which yields ∼ N (0 , D − n C n D − n ) ). Reverse KL results in a much larger preconditionedspectrum. Lemma 5.2.
Suppose forward/reverse KL preconditioning of X ∼ N (0 , C ) results in preconditioned covariance Ω . Then, for any positive diagonal matrix ˜ D , forward/reverse KL diagonal preconditioning of X ∼ N (0 , ˜ DC ˜ D ) also leadsto preconditioned covariance Ω . Consider the following choices:(i) Do not precondition, and sample directly from the target p ( x ).(ii) Precondition with D − , where D is obtained by minimizing KL[ q || p ], (i.e.,reverse KL), for q ( x ) ∼ N ( µ, D ).(iii) Precondition with D − , where D is obtained by minimizing KL[ p || q ], (i.e.,forward KL), for q ( x ) ∼ N ( µ, D ).In the proceeding sections, we will show realistic scenarios where each methodis better than the other two. Before proceeding, we point out some practicalconsiderations. Since forward KL is often unstable and cannot be minimizeddirectly, (iii) is done by estimation of the component-wise standard deviation.If this must be done by sampling, then it somewhat defeats the purpose ofpreconditioning. Regarding (ii), setting up a variational problem is not too hardonce the target p ( x ) is built, and software packages exist to make this easier [11].This does however incur a one time development cost that may be too great forthe problem at hand. . Langmore et al./A Condition Number for HMC A simple covariance comprised of 2x2 blocks provides a demonstration of caseswhere forward KL preconditions better than reverse KL, which, depending onthe blocks, performs better or worse than doing nothing. We also see that neitherforward nor reverse KL is optimal.For ρ n ∈ (0 , C = C ⊕ C ⊕ · · · ⊕ C N , C n := (cid:18) ρ n ρ n (cid:19) , (5.14)which has eigenvalues { ± ρ n : n = 1 , . . . , N } . By symmetry, the optimal pre-conditioner D , for forward or reverse KL, will be partitioned as D = D ⊕ · · · ⊕ D N , D n := (cid:18) d n d n (cid:19) . This leads to the preconditioned covariance D − CD − = d − C ⊕ · · · ⊕ d − N C N , with spectrum Λ := (cid:26) ρ d , − ρ d , · · · , ρ N d N , − ρ N d N , (cid:27) . Thus, each pair of eigenvalues, { ± ρ n } is moved up and down together by thepreconditioner. As seen below and in in figure 8, the optimal preconditionermoves the larger of the two from every block to the same level. Reverse KLto some extent does the opposite, moving the smaller of each pair to a similarlevel. This is expected, since as illustrated in figure 7, the scale of the reverseKL variational solution is mostly determined by the smallest scale of the target.In the forward KL, reverse KL, and the optimal choices of d n that follow,(1 + ρ ) /d ≥ (1 ± ρ n ) /d n , for n = 2 , . . . , N . Therefore, all three choices willhave κ (Λ) = N (cid:88) n =1 (cid:18) d n d (cid:19) (cid:34)(cid:18) ρ ρ n (cid:19) + (cid:18) ρ − ρ n (cid:19) (cid:35) . (5.15)Referring to (5.12), (5.13), the minimizing d n for forward KL will be identically1, and for reverse KL will be 1 − ρ n . Thus κ (Λ rev ) = N (cid:88) n =1 (cid:18) − ρ n − ρ (cid:19) (cid:34)(cid:18) ρ ρ n (cid:19) + (cid:18) ρ − ρ n (cid:19) (cid:35) κ (Λ fwd ) = N (cid:88) n =1 (cid:34)(cid:18) ρ ρ n (cid:19) + (cid:18) ρ − ρ n (cid:19) (cid:35) , . Langmore et al./A Condition Number for HMC Fig 8 . Preconditioned spectra in the case of 5 correlated 2x2 blocks.
Preconditionerchoices (OPTIMAL vs. REV KL.) are plotted against the target, which is the same as FWDKL preconditioning. The spectrum comes in pairs of eigenvalues, which are scaled togetherby the preconditioners.
Left:
The optimal preconditioner pushes the larger member of eachpair to 1.
Right:
Reverse KL (almost) pushes the smaller of each pair to 1. and therefore κ (Λ fwd ) ≤ κ (Λ rev ).Minimizing κ over all d n we find d n ∝ ρ n , so that κ (Λ OP T ) = N (cid:88) n =1 (cid:34) (cid:18) ρ n − ρ n (cid:19) (cid:35) . Since reverse KL is a practical method, it is disappointing to see that doingnothing (forward KL had D = I ) performs better. In practice however, thesituation is often closer to C n : = γ n (cid:18) ρ n ρ n (cid:19) , for γ ≥ · · · ≥ γ N >
0. The results for forward and reverse KL will be the sameas if γ n ≡ β := max n (cid:8) γ n (1 + ρ n ) (cid:9) ) κ : = (cid:20) β γ n (1 + ρ n ) + β γ n (1 − ρ n ) (cid:21) ≥ N (cid:88) n =1 γ γ n (cid:34)(cid:18) ρ ρ n (cid:19) + (cid:18) ρ − ρ n (cid:19) (cid:35) . So if γ (cid:29) γ n is large enough, preconditioning with reverse KL does improveupon doing nothing. Forward KL would still be superior. Here we compare preconditioner options (Forward or Reverse KL, or “Do Noth-ing”) applied to 100x100 random matrices of different types. Each option is best . Langmore et al./A Condition Number for HMC some fraction of the time.The Wishart random matrices are constructed by (i) letting A ∈ R × be composed of i.i.d. unit normal entries, then (ii) setting C := AA T . The inverse Wishart matrices are constructed by inverting a Wishart matrix. The rotated scale matrices are constructed by (i) making the scale matrix Λ :=Diag( σ , . . . , σ ), where { σ , . . . , σ } = f ( { , . . . , } , m = 1 , M = 5 , β =4 , c = cutoff ) from (3.4), then (ii) rotating with a random orthogonal matrixgenerated with scipy.stats.ortho group , i.e. C := U Λ U T [29]. Three varietiesof rotated scale matrix were constructed by choosing cutoff = 0 . , . , . κ is shown in table1. The Wishart/inverse Wishart matrices have a fairly regular structure, andforward KL is the “winner” more and more often as N increases. The results forrotated scale matrices varied considerably as the dimension or the parameters( M, β ) were changed. We warn the reader not to draw broad conclusions ortrends from table 1.
Wishart InvWishart RS (5%) RS (10%) RS (20%)Do Nothing 3% 0% 0% 20% 100%Fwd KL 91% 100% 88% 1% 0%Rev KL 6% 0% 12% 78% 0%
Table 1
For Wishart, inverse Wishart, and rotated scale (RS) random 100x100 matrices, thepercentage of the time each preconditioning method had the lowest κ is tabulated. Themagnitude of the differences was small, about 10% at most. As demonstrated in figure 2, having a few large eigenvalues and many smallones is especially bad for κ . To mitigate these situations, we consider a low-rank update to a diagonal preconditioner. Specifically, we choose variationaldistribution q ∼ N (0 , F F T ), where F = D + U U T , D ∈ R N × N is diagonal,and U ∈ R N × K . Both D and U were trained to minimize KL[ q || p ], where p ∼ N (0 , LL T ) and L is circulant. This provides some correlation that cannotbe matched with a diagonal preconditioner. The spectrum of L was chosen using(3.4) so that it was a low pass filter with some cutoff. As expected, when therank of U was larger than the cutoff, preconditioning worked. When the rank of U was less than the cutoff, large eigenvalues remained and κ was barely reducedby preconditioning. See figure 9.A word of caution: We also found that when the circulant matrix L had verysmall eigenvalues, the corresponding extreme correlations in X ∼ p ( x ) were toomuch for the optimization procedure to handle, and instabilities arose. . Langmore et al./A Condition Number for HMC Fig 9 . Preconditioning with a low-rank update.
Normalized covariance spectra of the Target p ∼ N (0 , LL T ) , variational model q ∼ N (0 , D + UU T ) , and the preconditioned target. Left: LL T had 10 large eigenvalues, and U was a rank 20 update. The trained preconditioner D + UU T had about 10 large eigenvalues, which were used to reduce the largest eigenvaluesin the preconditioned matrix. Right:
Here LL T had about 40 large eigenvalues, and the rank20 update could not reduce the size of them all.
6. Proof of Convergence Results
In this section we prove theorem 3.1 and corollary 3.1.Denote by π N the density of T N := σ N T (recall T ∼ π ). That is, π N ( t ) : = π (cid:18) tσ N (cid:19) · σ N . First, we show (3.10). To that end, the distinction between h N and ¯ h N is thereplacement of (cid:82) sin ( · ) d t by its average, 1 /
2. Let R Nn : = 12 − (cid:90) sin (cid:18) tσ Nn (cid:19) π N ( t ) d t = 14 (cid:20) ˆ π (cid:18) σ N σ Nn (cid:19) + ˆ π (cid:18) − σ N σ Nn (cid:19)(cid:21) , which, due to (3.8), satisfies | R Nn | ≤ C π < . (6.1)This means (cid:12)(cid:12) ¯ h − N − h − N (cid:12)(cid:12) ≤ C π ¯ h − N , . Langmore et al./A Condition Number for HMC and therefore (1 + C π ) − / ¯ h N ≤ h N ≤ (1 − C π ) − / ¯ h N , which is exactly (3.10).Second, we show the uniform limitlim N →∞ h N σ Nn ≤ lim N →∞ h N σ NN = 0 . (6.2)To that end, since σ NN ≤ σ Nn , we have (cid:18) h N σ Nn (cid:19) ≤ (cid:18) h N σ NN (cid:19) ≤ σ N σ NN h N σ NN = σ N h N σ NN ≤ σ N h N N (cid:88) n =1 σ Nn . Due to (3.10), we may replace h N by ¯ h N , and suffer only a constant depending on α and C π . From the definition of ¯ h N , we may replace ¯ h N with (cid:16)(cid:80) Nn =1 σ − Nn (cid:17) − / and suffer only a constant depending on α . We therefore have (with C (cid:48) dependingonly on α and C π ) σ N h N N (cid:88) n =1 σ Nn ≤ C (cid:48) σ N (cid:32) N (cid:88) n =1 σ Nn (cid:33) (cid:32) N (cid:88) n =1 σ Nn (cid:33) − / , which tends to zero due to our assumption (3.11), and so too then does h N /σ Nn . Proof of theorem 3.1.
Since, in equilibrium, the performance of HMC with mo-mentum term (cid:107) ξ (cid:107) / σ N , . . . , σ NN ). Leapfrog integration willact independently on each component. Moreover, in equilibrium, position andmomentum samples from each component are independent.We first consider the case of one component with variance σ , and hide thedependence on N . Each leapfrog step is an iteration of the matrix U h : = (cid:32) − h σ h − (cid:16) hσ − h σ (cid:17) − h σ (cid:33) , which has eigenvalues and eigenvectors λ ± : = 1 − h σ ± i hσ (cid:114) − h σ , v ± := (cid:32) , ± iσ (cid:114) − h (cid:33) . . Langmore et al./A Condition Number for HMC If h/ (2 σ ) <
1, the eigenvalues have modulus 1 and the iteration is stable. Then,by diagonalizing, one can show U (cid:96)h = (cid:18) cos( (cid:96)θ ) γ − sin( (cid:96)θ ) − γ sin( (cid:96)θ ) cos( (cid:96)θ ) (cid:19) , where γ : = (cid:114) σ − h σ , θ := cos − (cid:18) − h σ (cid:19) . To compute the Hamiltonian after (cid:96) steps, we apply U (cid:96)h to the starting point( x , ξ ), then plug into H ( x, ξ ) = x / (2 σ ) + ξ / H ( U (cid:96)h ( x , ξ )) = cos ( (cid:96)θ ) (cid:18) x σ + ξ (cid:19) + sin ( (cid:96)θ ) (cid:18) γ σ x σ + 1 γ σ ξ (cid:19) + cos( (cid:96)θ ) sin( (cid:96)θ ) (cid:18) γσ − γσ (cid:19) x ξ σ . (6.3)Define χ : = (cid:18) h σ (cid:19) · − (cid:0) h σ (cid:1) , then using the relations γ σ + 1 γ σ = 2 + χ, γσ − γσ = √ χ, and the fact that the initial Hamiltonian is x / (2 σ ) + ξ /
2, we find δ (cid:96) : = sin ( (cid:96)θ )2 (cid:18) h σ (cid:19) (cid:18) ξ − x σ (cid:19) + sin ( (cid:96)θ ) χ ξ (cid:96)θ ) sin( (cid:96)θ ) √ χ x ξ σ . (6.4)This has the bound | δ (cid:96) | ≤ h σ (cid:18) ξ − x σ (cid:19) + χ ξ √ χ | x ξ | σ . (6.5)To compute moments, use the fact that, in equilibrium, x ∼ N (0 , σ ), and ξ ∼ N (0 ,
1) are independent. E (cid:8) δ (cid:96) (cid:9) = sin ( (cid:96)θ )2 (cid:18) h σ (cid:19) + R ( δ (cid:96) ) (cid:18) h σ (cid:19) , Var (cid:8) δ (cid:96) (cid:9) = 2 E (cid:8) δ (cid:96) (cid:9) + R ( δ (cid:96) ) (cid:18) h σ (cid:19) , . Langmore et al./A Condition Number for HMC where there exists a constant C < ∞ , uniform in ( σ, (cid:96), θ ) (so long as h < σ ),such that | R j | ≤ C .Re-introducing dependence on N , n , and setting (cid:96) = T /h N for random inte-gration time T ∼ π N (we assume (cid:96) is an integer, if not minor adjustments areneeded), we have E { ∆ N } = 12 N (cid:88) n =1 (cid:18) h N σ Nn (cid:19) (cid:90) sin (cid:32) th N cos − (cid:32) − h N σ N,n (cid:33)(cid:33) π N ( t ) d t + 12 N (cid:88) n =1 (cid:18) h N σ Nn (cid:19) (cid:90) R ( δ (cid:96)N,n ) π N ( t ) d t. (6.6)The second term is bounded in absolute value by a constant times (cid:18) h N σ NN (cid:19) h N N (cid:88) n =1 σ Nn , which tends to zero due to (6.2) and (3.10). As for the first term, upon solving1 − ε = cos( y ) we have the relation cos − (1 − ε ) = √ ε + O ( ε / ). This meanssin (cid:32) th N cos − (cid:32) − h N σ N,n (cid:33)(cid:33) = sin (cid:32) th N (cid:34) h N σ Nn + O (cid:32)(cid:18) h N σ Nn (cid:19) (cid:33)(cid:35)(cid:33) = sin (cid:18) tσ Nn (cid:19) + t · O (cid:18) h N σ Nn (cid:19) , where O (cid:0) h N /σ Nn (cid:1) is a term bounded (uniformly in N , and n ) by a constanttimes h N /σ Nn . Since (cid:82) t · π N ( t ) d t = σ N E { T } , we havelim N →∞ E { ∆ N } = lim N →∞ N (cid:88) n =1 (cid:18) h N σ Nn (cid:19) (cid:90) sin (cid:32) th N cos − (cid:32) − h N σ N,n (cid:33)(cid:33) π N ( t ) d t = lim N →∞ N (cid:88) n =1 (cid:18) h N σ Nn (cid:19) (cid:26)(cid:90) sin (cid:18) th N (cid:19) π N ( t ) d t + O (cid:18) σ N h N σ Nn (cid:19) . (cid:27) = α N (cid:88) n =1 O (cid:18) σ N h N σ Nn (cid:19) . (6.7)The first term is the desired limit. The second term tends to zero upon re-placement of h N by (cid:80) Nn =1 σ − Nn (as in section 6.1) and the use of assumption(3.11).The steps for variance are similar, and yieldlim N →∞ Var { ∆ N } = α. . Langmore et al./A Condition Number for HMC The normal limit follows after verifying the Lindeberg condition [13]: For all ε >
0, lim N →∞ N (cid:88) n =1 Var (cid:8) δ (cid:96)Nn : | δ (cid:96)Nn − E (cid:8) δ (cid:96)Nn (cid:9) | > ε (cid:9) = 0 . (6.8)One can check that δ (cid:96)Nn − E (cid:8) δ (cid:96)Nn (cid:9) is bounded by a term similar to (6.5) whichtends uniformly to zero, so (6.8) follows. Proof of corollary 3.1.
The simpler step size amounts to replacing the integrals (cid:82) sin ( t/σ Nn ) π N ( t ) d t in (6.7) with 1/2. This will be implied by sufficient decayof the remainder R Nn in (6.1). Indeed, | ˆ π ( ω ) | ≤ C | ω | − δ implies | R Nn | ≤ C (cid:18) σ Nn σ N (cid:19) δ , This means (with (cid:46) denoting ≤ up to a constant depending only on α , C π ), (cid:12)(cid:12)(cid:12)(cid:12) ¯ h N h N − (cid:12)(cid:12)(cid:12)(cid:12) = ¯ h N (cid:12)(cid:12) ¯ h − N − h − N (cid:12)(cid:12) (cid:46) ¯ h N N (cid:88) n =1 σ Nn (cid:18) σ Nn σ N (cid:19) δ . For any given K ,¯ h N N (cid:88) n =1 σ Nn (cid:18) σ Nn σ N (cid:19) δ = ¯ h N K (cid:88) n =1 σ Nn (cid:18) σ Nn σ N (cid:19) δ + ¯ h N N (cid:88) n = K +1 σ Nn (cid:18) σ Nn σ N (cid:19) δ ≤ ¯ h N K (cid:88) n =1 σ Nn + (cid:18) σ NK σ N (cid:19) δ ¯ h N N (cid:88) n = K +1 σ Nn (cid:46) K ¯ h N σ NK + (cid:18) σ NK σ N (cid:19) δ . As N → ∞ , the first term tends to zero due to (6.2). The second is bounded by r K by our hypothesis. Therefore,lim N →∞ (cid:12)(cid:12)(cid:12)(cid:12) ¯ h N h N − (cid:12)(cid:12)(cid:12)(cid:12) (cid:46) r K . Since K was arbitrary, and r K →
0, we have shown ¯ h N /h N → Acknowledgements
The authors would like to acknowledge the value of many conversations withmembers of the TensorFlow probability team. In particular (in alphabeticalorder!) Josh Dillon, Matt Hoffman, Pavel Sountsov, and Srinivas Vasudevan. . Langmore et al./A Condition Number for HMC References [1]
Amaran, S. , Sahinidis, N. V. , Sharda, B. and
Bury, S. J. (2016).Simulation optimization: a review of algorithms and applications.
Ann.Oper. Res.
Andrieu, C. and
Thoms, J. (2008). A tutorial on adaptive MCMC.
Stat.Comput. Bai, Z. D. (1999). Methodologies in spectral analysis of large dimensionalrandom matrices, a review.
Statistica Sinica Bales, B. , Pourzanjani, A. , Vehtari, A. and
Petzold, L. (2019).Selecting the Metric in Hamiltonian Monte Carlo. arXiv preprints .[5]
Beskos, A. , Pillai, N. , Roberts, G. , Sanz-Serna, J.-M. and
Stu-art, A. (2013). Optimal tuning of the hybrid Monte Carlo algorithm.
Bernoulli Betancourt, M. (2017). A Conceptual Introduction to HamiltonianMonte Carlo.[7]
Betancourt, M. J. , Byrne, S. and
Girolami, M. (2014). OptimizingThe Integrator Step Size for Hamiltonian Monte Carlo. arXiv preprints .[8]
Bose, A. , Gangopadhyay, S. and
Sen, A. (2010). Limiting spectraldistribution of XX’ matrices.
Ann. Inst. Henri Poincare Probab. Stat. Bose, A. and
Sen, A. (2008). Another look at the moment method forlarge dimensional random matrices.
Electron. J. Probab. Carpenter, B. , Gelman, A. , Hoffman, M. D. , Lee, D. , Goodrich, B. , Betancourt, M. , Brubaker, M. , Guo, J. , Li, P. and
Riddell, A. (2017). Stan: A Probabilistic Programming Language.[11]
Dillon, J. V. , Langmore, I. , Tran, D. , Brevdo, E. , Vasudevan, S. , Moore, D. , Patton, B. , Alemi, A. , Hoffman, M. and
Saurous, R. A. (2017). TensorFlow Distributions.[12]
Duane, S. , Kennedy, A. D. , Pendleton, B. J. and
Roweth, D. (1987). Hybrid Monte Carlo.[13]
Durrett, R. (2010).
Probability: Theory and Examples . Cambridge Uni-versity Press.[14]
Girolami, M. and
Calderhead, B. (2011). Riemann manifold Langevinand Hamiltonian Monte Carlo methods.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) .[15]
Hansen, P. C. (1988). Hansen – Computing 1988 – Computation of theSingular Value Expansion.pdf.
Computing Hoffman, M. , Sountsov, P. , Dillon, J. V. , Langmore, I. , Tran, D. and
Vasudevan, S. (2019). NeuTra-lizing Bad Geometry in HamiltonianMonte Carlo Using Neural Transport.
ArXiv preprints .[17]
Hoffman, M. D. (2014). The No-U-Turn Sampler: Adaptively SettingPath Lengths in Hamiltonian Monte Carlo.
Journal of Machine LearningResearch .[18] Horn, R. A. , Horn, R. A. and
Johnson, C. R. (1990).
Matrix Analysis .Cambridge University Press. . Langmore et al./A Condition Number for HMC [19] Kaipio, J. and
Somersalo, E. (2006).
Statistical and Computational In-verse Problems . Springer Science & Business Media.[20]
Lao, J. , Suter, C. , Langmore, I. , Chimisov, C. , Saxena, A. , Sountsov, P. , Moore, D. , Saurous, R. A. , Hoffman, M. D. and
Dillon, J. V. (2020). Modern Markov Chain Monte Carlo Tools Built forModern Hardware.[21]
Leimkuhler, B. and
Reich, S. (2005).
Simulating Hamiltonian Dynam-ics . Cambridge University Press.[22]
Mackenze, P. B. (1989). An improved hybrid Monte Carlo method.[23]
Marˇcenko, V. A. and
Pastur, L. A. (1967). Distribution of Eigenvaluesfor some Sets of Random Matrices.[24]
Neal, R. M. (2012). MCMC using Hamiltonian dynamics. arXivpreprints .[25]
Parno, M. and
Marzouk, Y. (2018). Transport map accelerated Markovchain Monte Carlo.
SIAM/ASA Journal on Uncertainty Quantification Peherstorfer, B. and
Marzouk, Y. (2019). A transport-based multifi-delity preconditioner for Markov chain Monte Carlo.
Adv. Comput. Math. Robert, C. P. and
Casella, G. (2004). Monte Carlo Statistical Meth-ods.[28]
Silverstein, J. W. (1985). The Smallest Eigenvalue of a Large Dimen-sional Wishart Matrix.