Incomplete Reparameterizations and Equivalent Metrics
aa r X i v : . [ s t a t . O T ] O c t Incomplete Reparameterizationsand Equivalent Metrics
Michael Betancourt
Abstract.
Reparameterizing a probabilisitic system is common advicefor improving the performance of a statistical algorithm like Markovchain Monte Carlo, even though in theory such reparameterizationsshould leave the system, and the performance of any algorithm, in-variant. In this paper I show how the reparameterizations common inpractice are only incomplete reparameterizations which result in dif-ferent interactions between a target probabilistic system and a givenalgorithm. I then consider how these changing interactions manifestin the context of Markov chain Monte Carlo algorithms defined onRiemannian manifolds. In particular I show how any incomplete repa-rameterization is equivalent to modifying the metric geometry directly.
Michael Betancourt is the principal research scientist at Symplectomorphic, LLC. (e-mail:[email protected]). BETANCOURT
CONTENTS
EOMETRY OF INCOMPLETE REPARAMETERIZATIONS In practice probabilistic systems are implemented in a specific parameterization of thetarget space, with the target probability distribution specified by its representative proba-bility density function. Reparameterizing the target space modifies this probability densityfunction but not the probability distribution that it represents. In other words any proba-bilistic computation should yield equivalent results no matter the parameterization used.That said, reparameterizations are known to alter the performance of algorithms thatimplement those probabilistic computations, suggesting that the interaction between thetarget probability distribution and the algorithm is not invariant. These interactions, andhow they relate to a given parameterization, are often opaque and difficult to understand.Developing explicit criteria to identify which parameterization yields the highest perfor-mance for a given circumstance is particularly challenging.The situation improves when the target space is a Riemannian manifold and the al-gorithm in question exploits that Riemannian structure, as is common for Markov chainMonte Carlo methods. Here we can construct and then analyze a comprehensive geom-etry that encompasses both the target probabilistic system and the algorithmic system.In particular a geometric analysis reveals that the reparameterizations employed in prac-tice are only incomplete reparameterizations, modifying the target geometry but not thealgorithmic geometry and hence changing the relationship between the two.In this paper I formalize the effect of incomplete reparameterizations for Markov chainMonte Carlo algorithms defined on Riemannian manifolds and construct an implicit crite-rion for the optimal reparameterization for a given target distribution. I begin by reviewingthe basics of Riemannian geometry, and Markov transitions that exploit Riemannian geom-etry, before demonstrating the duality between reparameterizations of the target space andequivalent metric geometries and introducing a heuristic criterion to identify optimal repa-rameterizations. Finally I apply these results to latent Gaussian models and their commoncentered and non-centered parameterizations.
1. RIEMANNIAN MANIFOLDS
For the rest of this paper I will assume familiarity with the basics of differential geometry.Part I of Baez and Muniain (1994) provides an accessible introduction with Lee (2013)giving a more thorough reference of the concepts and notation that I will use here.Let our target distribution be defined on a D -dimensional smooth manifold, Q , withlocal coordinate functions denoted { q ( q ) , . . . , q D ( q ) } .The tangent space at each point, T q Q , is a vector space over the real numbers whoseelements can be associated with equivalence classes of one-dimensional curves sharing thesame velocity at that point (Figure 1). Local coordinate functions induce a basis withineach tangent space given by the velocities of each coordinate function, which we denote bythe partial derivatives, (cid:26) ∂∂q , . . . , ∂∂q D (cid:27) ≡ { ∂ , . . . , ∂ D } . BETANCOURT QT q Qq Fig 1 . Each tangent space T q Q is a D -dimensional vector space associated with a point q ∈ Q . If we embedthe manifold Q in a higher-dimensional space then we can interpret the tangent space as a plane fixed to q and tangent to the manifold at that point of connection. T q Q Q ( q, v ) q = ̟ ( q, v ) T Q
Fig 2 . The tangent bundle
T Q is given by weaving together the tangent spaces attached at each point inthe base manifold, Q . Here the base manifold is one-dimensional and each tangent space can be representedwith a one-dimensional line. Each point in the tangent bundle is identified by a point in the base space, q ∈ Q , and a point in the corresponding tangent space, v ∈ T q Q . The tangent bundle is equipped with anatural projection operator, ̟ : T Q → Q that maps each point in the tangent space back to the associatedpoint in the base space. All of the tangent spaces in a manifold stitch together to define a 2 D -dimensional man-ifold with a canonical projection back down to the base space, π : T Q → Q , called thetangent bundle. Vector fields are sections of this bundle, specifying a vector within eachtangent space, v : Q → T Q . The space of all vector fields on Q is denoted Γ( Q ).Similarly the cotangent space at each point, T ∗ q Q is a vector space over the real numberswhose elements can be associated with equivalence classes of real-valued functions withthe same first-order differential behavior. These covectors are also dual to vectors of thetangent space, with each covector mapping a vector to a real number and vice versa. Withina local chart the coordinate functions define a basis for the cotangent space given by thegradients of the coordinate functions, { d q , . . . , d q D } . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS As with the tangent spaces, all of the cotangent spaces can be weaved together to givethe cotangent bundle, π ∗ : T ∗ Q → Q . Covector fields, or one-forms, are sections of thisbundle, specifying a covector within each cotangent space, α : Q → T ∗ Q .Without any additional structure a manifold Q isn’t particularly rigid; there is littlestructure within each tangent space, let alone between tangent spaces. In order to rigidifythe manifold, and elevate it to a Riemannian manifold , we need to equip it with additionalstructure. In particular we need to specify a
Riemannian metric and a linear connection which allow us to compare vectors within a single tangent space as well as vectors indifferent tangent spaces. Their structure also gives rise to geodesics and the ability to flowthrough the manifold.For a thorough introduction of Riemannian manifolds see Lee (2018). In the next fewsections I will review the basic concepts that we will need to construct Markov transitionson Riemannian manifolds.
A Riemannian metric a positive-defining map taking two vector fields to the real num-bers, g : Γ( Q ) × Γ( Q ) → R ( v, u ) g ( u, v ) , such that g ( u, v ) = g ( v, u ) > u, v ∈ Γ( Q ) and g ( u, u ) = 0 only if u = 0.Within each tangent space the metric induces an inner product, g q : T q Q × T q Q → R ( v, u ) g q ( v, u ) , which allows us to orient vectors relative to each other. In particular the length of a vectoris defined by || v || = q g q ( ~v, ~v )while the angle between two vectors is be defined bycos θ = g q ( ~u, ~v ) . If g q ( ~u, ~v ) = 0 then the vectors are said to be perpendicular or orthogonal. These conceptsallow us to define, for example, orthonormal bases within a tangent space such that eachbasis vector has unit length and is orthogonal to each other basis vector (Figure 3).One powerful feature of a metric is its ability to transform vector fields into covectorfields and vice versa. In particular given a vector field v we can define a correspondingcovector field v ♭ as the covector field satisfying v ♭ ( u ) = g ( u, v ) , BETANCOURT QT q Qq Fig 3 . A metric adds rigidity to each tangent space, defining amongst other things orthogonal bases ofvectors that coordinate the vector spaces. for any vector field u . The inverse of this map takes any covector field ν into a correspondingvector field, α ♯ . This inverse transformation can then be used to define an inverse metric over cotangent fields as g − ( α, β ) = g ( α ♯ , β ♯ ) . By construction any metric is isomorphic to an element of the symmetric tensor product T ∗ Q ⊗ T ∗ Q . We can use this equivalence to represent a given metric in a local coordinatebasis with the D component functions g ( q ) = g ij ( q ) d q i ⊗ d q j . The inverse metric is similarly locally represented by the component functions g − ( q ) = g ij ( q ) ∂ i ⊗ ∂ j , where g ij ( q ) is the matrix inverse satisfying g ij ( q ) · g jk ( q ) = δ ki for all q ∈ Q .If a manifold admits an atlas such that the metric components equal the identity matrix, g ij = δ ij , in every chart then the manifold is said to be Euclidean . Unfortunately this defi-nition is not how the term ”Euclidean” is used colloquially in statistics. There ”Euclidean”denotes any algorithm with constant metric component functions and ”Riemannian” isused to denote a more general algorithm that exploits position-dependent metric compo-nent functions. This distinction between component functions, however, is a property ofthe atlas being used, in particular the parameterizations within the local charts, and notthe inherent structure of the manifold. What makes a manifold Euclidean is not that itslocal metric component functions are constant but rather that they can be made constantwith some choice of local parameterizations . For example any algorithm defined on the realnumbers is geometrically a Euclidean algorithm no matter the parameterization used . A metric defines concepts like orientation and length which allows to compare vectors within each tangent space, but a manifold equipped with only a metric is still not rigid
EOMETRY OF INCOMPLETE REPARAMETERIZATIONS enough for us to compare vectors that live in different tangent spaces. To make the manifoldfully rigid we need to introduce a connection between these vector spaces.We start differentially and ask whether or not we can compute directional derivativesof vector fields, in other words how vector fields change along a given direction, just as wecan compute directional derivatives of functions. Unlike directional derivatives of functions,however, there is no unique directional derivatives of vector fields. Instead we have to impose one.A linear connection defines a directional derivatives of vector fields through a covariantderivative that maps two vector fields into a third, ∇ : Γ( Q ) × Γ( Q ) → Γ( Q )( v, u )
7→ ∇ v u . The first input defines directions at each point in the manifold along which changes willbe probed, the second input defines the vector field being probed, and the output definesthe vector-valued changes in each tangent space.In order for such a map to qualify as a derivative, however, it has to satisfy the usualproperties of derivations. It must, for example, be linear with respect to multiplying theprobing directions by functions, ∇ f · v + f · u w = f ∇ v w + f ∇ u w, for any two real-valued functions f : Q → R and f : Q → R . Moreover it must be linearwith respect to multiplying probed vector field by constants, ∇ v ( a · u + b · w ) = a ∇ v u + b ∇ v w, for any two real-value constants a, b ∈ R . Finally it must satisfy the Leibnitz rule withrespect to multiplying the probed vector field by functions, ∇ v ( f · u ) = f ∇ v u + v ( f ) · u. Within a local coordinate basis the action of the covariant derivative becomes ∇ v u = (cid:18) v i ∂y k ∂q i + Γ kij v i u j (cid:19) ∂ k . In other words the linear connection is completely specified with D component func-tions denoted Γ kij ( q ). These Christoffel coefficients are not the components of a tensor butrather transform in a much complex way as they encode second-order differential informa-tion. Only with careful combinations, such as in the above equation, do the non-tensorialcomponents cancel to leave a well-defined geometric object.
BETANCOURT
There are an infinite number of connections on a given manifold but there is a uniqueconnection that is naturally compatible with a given Riemannian metric. In local coordi-nates basis the Christoffel coefficients for such a Riemannian or
Levi-Cevita connection aregiven by the functionsΓ kij ( q ) = 12 g kl ( q ) (cid:18) ∂g jl ∂q i ( q ) + ∂g il ∂q j ( q ) + ∂g ij ∂q l ( q ) (cid:19) . When discussing Riemannian manifolds a natural connection is often assumed to comple-ment a given metric and fully rigidify the manifold. Here, too, we will assume the choiceof a Levi-Cevita connection.
The differential structure imposed by a connection immediately relates neighboring tan-gent spaces. It can also relate distant tangent spaces when we traverse special curvesthrough the base manifold, Q .Recall that a curve is a smooth map from an interval of the real numbers into ourmanifold, c : I → Qt c ( t ) . In particular the points on a curve and their corresponding tangent spaces define a subsetof the tangent bundle, with restricted vector fields defined as sections of this subset. Whenrestricted to this subset the covariant derivative will define how restricted vector fieldschange along the curve.One restricted vector field inherent to any curve is the velocity vector field,˙ c ( t ) ∈ T c ( t ) Q, ∀ t ∈ I. When this restricted vector field is placed into both arguments of the covariant derivative, ∇ ˙ c ( t ) ˙ c ( t ), the output defines how velocities change along the curve. In other words it definesthe acceleration along the curve with respect to the chosen connection.Curves with vanishing acceleration everywhere ∇ ˙ c ( t ) ˙ c ( t ) = 0 , ∀ t ∈ I, generalize the concept of a straight line to arbitrary smooth manifolds and are denoted geodesics . Geodesics have a variety of useful properties, but in the context of this paperone of the most useful is that they define a local flow on the base manifold. Each point q ∈ Q and vector ~v ∈ T q Q intersects with only one geodesic, defining an unambiguousway to move through Q , at least within a local neighborhood where the geodesics arewell-defined. In other words once we pick a point and a direction we have a deterministic EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Qq = exp · v ( q ) q t = exp t · v ( q ) Fig 4 . Each point in the base manifold, q ∈ Q , and vector in the corresponding tangent space, v ∈ T q Q –inother words each point in the tangent bundle–identifies a unique geodesic curve through Q . Following thiscurve for a given time defines an exponential map that transports the initial point through Q . way to slide through the manifold (Figure 4). The flow of the entire manifold along thesegeodesics is also known as the exponential map , φ exp : T Q × R → Q ( q, ~v, t ) φ exp t,~v ( q ) . Geodesics, however, carry not only points along the manifold but also vectors from onetangent space to another. Consider an initial point q ∈ Q , an initial direction ~v ∈ T q Q ,and the corresponding geodesic curve with c ( t = 0) = q . For any vector ~u ∈ T q Q there is aunique vector field restricted to the geodesic satisfying u ( c (0)) = ~u and ∇ ˙ c ( t ) u = 0 alongthe entire geodesic. This restricted vector field defines parallel transport of ~u along thegeodesic (Figure 5). Overloading notation a bit I will also refer to this parallel transportas an exponential map, φ exp : Q × T q Q × T q Q × R → Γ( Q )( q, ~v, ~u, t ) φ exp t,~v ( ~u ) . By this definition the velocity vectors of a geodesic curve are all parallel transported intoeach other.Parallel transport allows us to formalize the intuition for how a linear connection actuallyconnects the tangent spaces in our manifold. The first input of the covariant derivativedefines directions, and corresponding geodesics, along which we probe the given vectorfield. The output of the covariant derivative is given by the change in that probed vectorfield after being parallel transported for an infinitesimal amount of time (Figure 6), ·∇ v u = lim ǫ → ( φ exp ǫ,v ) − u ǫ − uǫ . Combining the geodesic flow and this parallel transport we see that the covariant deriva-tive defines a flow along the entire tangent bundle (Figure 7). An initial point ( q, ~v ) ∈ T Q BETANCOURT Qu = exp · v ( u ) u t = exp t · v ( u ) Fig 5 . A linear connection defines a transport of vectors in the tangent space of any point along a geodesicto vectors in the tangent space of any other point along that geodesic. Here we transport a vector u ∈ T q Q to the vector u t ∈ T q t Q where q t is the exponential map of ( q, v ) ∈ T Q for time t . T q Q T q ǫ Qu ǫ = exp ǫ,v uu (exp ǫ,v ) − u ǫ ǫ · ∇ v u Fig 6 . The covariant derivative can be interpreted as the difference between a vector and its paralleltransport for an infinitesimal amount of time. Here the initial point q ∈ Q and initial vector v ∈ T q Q definea geodesic curve and the exponential map q t along that curve. The vector u ∈ T q Q is parallel transportedalong the geodesic to the vector u ǫ ∈ T q ǫ Q ; the covariant derivative is the scaled difference between u andthe pullback of that parallel transport. EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Q ( q, v )( q t , v t ) = exp t ( q, v ) Fig 7 . Any point q ∈ Q and vector v ∈ T q Q defines a geodesic which then transports both q and v alongthe curve. Together these transports defines a flow along the entire tangent bundle. defines a starting location and direction, which then identifies a unique geodesic paththrough Q . At each point on that path we also have the velocity vectors of the geodesicwhich are the parallel transports of ~v . Overloading notation once again I will refer to thistangent flow as an exponential map, φ exp : T Q × R → T Q ( q, v, t ) φ exp t ( q, v ) .
2. RIEMANNIAN MARKOV TRANSITIONS
Markov chain Monte Carlo (Robert and Casella, 1999; Brooks et al., 2011) explores atarget probability distribution, π (d q ), defined on Q by sampling from a Markov transitionconditioned on a given state, τ (d q | q ′ ). If the Markov transition preserves the targetdistribution in expectation, π (d q ) = Z π (d q ′ ) τ (d q | q ′ ) , then the repeated transitions generates a sequence of states that converges towards, andeventually disperses across, the typical the support of the target distribution. The statesin this Markov chain then define Markov chain Monte Carlo estimatorsˆ f N = 1 N N X n =1 f ( q n )that asymptotically converge to the true target expectation values,lim N →∞ ˆ f N = Z π (d q ) f ( q ) , BETANCOURT under typical regularity conditions. The practical utility of a given Markov transition isdetermined by how quickly it explores target distribution and, consequently, how quicklythe Markov chain Monte Carlo estimators converge to the true expectation values.A powerful method for constructing Markov transitions is sampling over a family ofdeterministic transformations. In particular, if φ t is a family of continuous, bijective maps, φ t : Q → Q , parameterized by t ∈ T , γ is a probability distribution over T , and I A ( q ) isthe indicator function for the set A ⊂ Q , then τ (d q, q ) = Z γ (d t ) I d q ( φ t ( q )) , defines a Markov transition on Q (Diaconis and Freedman, 1999). If the transformationseach preserve the target distribution,(( φ t ) ∗ π )(d q ) = π (d q ) , then this Markov transition will also preserve the target distribution and generate thedesired Markov chains; when the individual transformations do not preserve the targetdistribution straightforward correction schemes can be applied to each move to ensure thedesired invariance. The freedom to choose a family of transformations and probability dis-tribution over that family allows one to engineer particularly effective Markov transitions,especially when those choices are informed by the structure of the target distribution itself.Because Markov transitions condition on an initial state they can exploit the local struc-ture of the target distribution within the neighborhood of that state to inform efficienttransformations. In particular, if Q is a Riemannian manifold then the local metric struc-ture can be used to construct both families of deterministic transformations and distribu-tions over those families, defining potentially effective Markov transitions. In this sectionwe’ll see how the local metric structure of a Riemannian manifold can be used to con-struct the ingredients of a Markov transition, and review examples of that constructionthat realize familiar algorithms. By exploiting the structure of the tangent and cotangent bundles associated with a man-ifold we can construct natural transformations that carry us around the space, providingthe basis for Markov transitions.
As we saw in Section 1.3, equipping a smooth manifold with aRiemannian metric and its corresponding Levi-Cevita connection endows the space withnatural geodesics that allow us to transport points and vectors along the curves. Theseactions define a flow along the tangent bundle,
T Q , which we referred to as an exponentialmap, φ exp : T Q × R → T Q ( q, ~v, t ) φ exp t ( q, ~v ) . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Given an initial point we can identify a particular transformation by choosing a direction,which identifies a unique geodesic path, and a time, which informs how long to move alongthat path. Dropping the final velocity vector then projects this cotangent flow to a flowacross the base manifold, Q . In other words the choice of vector and integration timeparameterize a family of deterministic transformations on Q .Because these transformations are not informed by the target distribution they will not,in general, preserve it. Instead the tangent flow provides proposals that can be correctedto achieve the desired invariance. Unlike the tangent bundle, the cotangent bundle, T ∗ Q is nat-urally equipped with a unique symplectic structure, ω , and symplectic measure, Ω, thatallows to construct flows without the need for the extra structure introduced by a Rieman-nian metric (Betancourt et al., 2016).Instead of complementing the manifold with a metric, we instead complement the cotan-gent bundle and its natural symplectic structure with some function H : T ∗ Q → R , denoted a Hamiltonian. The choice of a Hamiltonian function immediately defines a Hamil-tonian flow over the cotangent bundle, φ H : T ∗ Q × R → T ∗ Q ( q, p, t ) φ Ht ( q, p ) . Hamiltonian flows have the added benefit of inherently preserving the canonical distri-bution, a probability distribution over the cotangent bundle given by π (d q, d p ) = e − H ( q,p ) Ω(d q, d p ) . If the Hamiltonian is chosen such that this canonical distribution projects to our target dis-tribution, then the projection of the Hamiltonian flow will preserve the target distribution, π (d q ). We can guarantee the desired invariance by introducing a conditional distributionover the cotangent fibers, π (d p | q ), defining the lifted joint distribution, π (d q, d p ) = π (d p | q ) π (d q ) , and then taking the Hamiltonian to be the corresponding Radon-Nikodym derivative withrespect to the symplectic measure, H = − log d π (d q, d p )dΩ(d q, d p ) = − log π ( p | q ) − log π ( q ) . Similar to the geodesic-informed tangent flow, this cotangent flow defines a family oftransformations from any initial point q ∈ Q parameterized by the choice of cotangent BETANCOURT vector, p ∈ T ∗ q Q , which defines a unique Hamiltonian trajectory, and and integration time, t , which defines how long to move along that trajectory. Projecting this flow back to thebase manifold defines the family of transformations from which we can construct a validMarkov transition.Although these trajectories are not immediately dependent on a Riemannian metric,they do depend on the choice of Hamiltonian which itself depends on the choice of someconditional probability distribution π (d p | q ). As we will see in the next section, buildingsuch a conditional probability distribution is greatly facilitated by exploiting Riemannianmetric structure. Consequently in practice these cotangent flows are implicitly informed bythe choice of metric. Both the tangent and cotangent flows introduced above were parameterized by an inte-gration time as well as an initial vector or covector. In order to incorporate these familiesof transformations into a Markov transition we need to impose probability distributionsover these parameters.Selecting a distribution over the real-valued integration times is straightforward, al-though it is not immediately obvious how to select an optimal distribution. The choiceof distribution of vectors and covectors, however, is complicated by the abstract geome-try of the manifolds involved. Fortunately the rigidity imposed by a Riemannian metricdrastically simplifies this problem.Because the distribution of vectors and covectors can vary with the initial point wereally want to define conditional probability distributions over each of the tangent andcotangent spaces (Betancourt et al., 2016). Conveniently a Riemannian metric provides allof the ingredients we need. Within a given tangent space, for example the metric defines thequadratic form g q ( v, v ) and the metric determinant, | g ( q ) | . These are sufficient to constructany elliptical family of probability density functions of the from π ( v ; q, φ ) = ξ ( g q ( v, v ) , φ ) + ζ ( | g ( q ) | , φ ) , for appropriate choices of the real-value functions ξ and ζ . Similarly in the cotangent spaceswe can use the inverse metric to build elliptical probability density functions of the form π ( p ; q, φ ) = ι ( g − q ( p, p ) , φ ) + κ ( | g ( q ) | , φ ) . Because the quadratic forms and metric determinants smoothly vary with the base point q ,these probability density functions fuse together into well-defined conditional probabilitydensity functions, π ( v | q ; φ ) and π ( p | q ; φ ).Despite their relatively simple form elliptical families span a wide range of distributions,providing useful flexibility when constructing Markov transitions. For example ellipticalfamilies include not only the Gaussian family but also more heavy-tailed families like theLaplace and Cauchy families of probability density functions. EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Q Fig 8 . Randomly sampling a vector v ∈ T q Q and integration time, t , defines a transformation that takesan initial point q to another point in the manifold. Repeating this process defines a Markov chain over Q whose stationary distribution will depend on the choice of probability distributions over the initial vectorsand integration times. Metric-informed tangent flows and metric-informed tangent conditional distributions,provide the components of a full Markov transition. We start at an initial point, q , andbegin the transition by sampling an initial vector from the corresponding distribution overthe tangent space, v ∼ π (d v | q ) , before sampling an integration time from some distribution that might be informed by thisinitial configuration, t ∼ π ( t | q, p ) . The initial point and vector define a unique geodesic along which we integrate for time t ,generating a random move to a new point in the tangent bundle which we can project backdown to the base manifold. The same construction holds for the cotangent bundle, usinginstead Hamiltonian trajectories and a metric-informed cotangent conditional distribution.With some additional modifications to correct the transitions and preserve a desiredtarget distribution, this general geometric procedure recovers quite a few well-known al-gorithms. Here I demonstrate three – random walk Metropolis-Hastings, Langevin MonteCarlo, and Hamiltonian Monte Carlo. Repeatedly sampling a random direction andthen following the corresponding geodesic for some finite time generates a second-orderMarkov process on the base manifold (Figure 8). In the limit where the integration timevanishes this process converges to a random walk that diffuses across the manifold (Hsu,2002).This diffusive behavior explores the manifold but it will not, in general, preserve aspecified target distribution. To achieve that behavior we need to correct the random walk,rejecting moves that stray too far from the typical set of the joint distribution on thetangent bundle. BETANCOURT Q (a) Q (b) Fig 9 . (a) If a proposal strays too far from neighborhoods of high target probability then a Metropoliscorrection is likely to reject that proposal and return to the initial point. (b) A proposal staying closer tohigh probability neighborhoods, however, will be accepted and ensure exploration that preserves the targetdistribution.
A standard approach to such corrections is to consider the moves as proposals which arethen accepted or rejected according to a Metropolis-Hastings correction. The deterministicgeodesic moves, however, require a small correction to serve as valid Metropolis proposals(Tierney, 1998). In order to admit a well-defined correction we have to compose each movewith a reflection operator that flips the sign of the tangent vector after each flow, R : T Q → T Q ( q, v ) ( q, − v ) . This negation turns the flow into an involution which returns to the initial state on thetangent bundle when the proposal is applied twice.After sampling an initial velocity and time, flowing along the corresponding geodesicfor that time, and then negating the final velocity we have a valid proposal that can beaccepted only with probability P [accept] = min(1 , r ( q, v )) , otherwise returning to the initial state ready for another transition (Figure 9). Here r isthe Radon-Nikodym derivative between the joint distribution on the tangent bundle andits pullback under the tangent flow, r ( q, v ) = d( φ exp t ) ∗ π d π ( q, v ) . Because of the careful dependence of the acceptance probability on the joint distribution,the complete transition will always preserves the joint distribution. The marginal chainover the base manifold will then preserve the target distribution.In local coordinates on the tangent bundle the Radon-Nikodym derivative becomes aratio of joint probability density functions, r ( q, v ) = π ( q ′ ) π ( − v ′ | q ′ ) π ( q ) π ( v | q ) , EOMETRY OF INCOMPLETE REPARAMETERIZATIONS where ( q ′ , v ′ ) = φ exp t ( q, v ) . The ratio of target probability density functions is known as the Metropolis ratio, while theratio of tangent conditional probability density functions is known as the Hastings ratio.In the global coordinates of a Euclidean manifold this construction reduces to the usualrandom walk Metropolis algorithm. If we specify the tangent distribution with a multivari-ate Gaussian probability density function then the process of sampling a tangent vector andflowing exactly yields a sample from a multivariate Gaussian on the base manifold whosecovariance matrix is given by the components of the metric scaled by the integration time,Σ ij ( q ) = t · g ij ( q ) , as the algorithm is typically presented.For infinitesimally small integration times the geodesic random walk without any Metropo-lis correction defines a Brownian motion over the base manifold. Taking small, but finite,integration times then provides a discrete approximation to that Brownian motion. Intro-ducing the Metropolis correction guides the discretized random walk towards the neigh-borhoods of high target probability, albeit relatively inefficiently in most contemporaryproblems. A similar procedure applies to the cotangent bundle. Sam-pling a covector from a cotangent conditional distribution and then applying the Hamil-tonian flow for some time defines a second-order stochastic process across the cotangentbundle. Unlike the geodesic random walk, however, the Hamiltonian-informed process man-ifestly preserves the joint distribution on the cotangent bundle, and hence the marginalprocess preserves the target distribution on the base manifold. For infinitesimally smallintegration times this process defines an Ornstein-Uhlenbeck over the base manifold, adrifting diffusion that exactly targets the given joint distribution.Unfortunately this invariance isn’t robust enough to manifest exactly in practical appli-cations where we have to approximate the Hamiltonian flow with the discrete trajectoriesof a symplectic integrator, Φ
Hǫ,L . Here ǫ denotes the step size of the integration and L thenumber of steps. Approximating the infinitesimal action of the Hamiltonian flow with onestep of a symplectic integrator, Φ Hǫ, gives Langevin Monte Carlo , or sometimes unadjustedLangevin Monte Carlo (Xifara et al., 2014).Although symplectic integrators are exceptionally accurate they are not perfect, andthe numerical errors they introduce will bias the discrete transitions away from the targetdistribution. In order to preserve the invariance of the target distribution, especially inhigher dimensions, we need to apply a Metropolis correction just as we did for the geodesics.As in that case we first turn the discrete update into an involution with the composition BETANCOURT of a reflection operator, R : T ∗ Q → T ∗ Q ( q, p ) ( q, − p ) , before applying a Metropolis correction that accepts the updated state only with proba-bility P [accept] = min(1 , r ( q, v )) , where r is the Radon-Nikodym derivative r ( q, p ) = d(Φ Hǫ, ) ∗ π d π ( q, p ) . Combining the discrete Langevin dynamics with a Metropolis correction defines adjustedLangevin Monte Carlo , or
Metropolis adjusted Langevin Monte Carlo , or typically just
MALA .Because the Metropolis correction is compensating only for the errors introduced bythe symplectic integrator, and not the imperceptive geodesics of random walk Metropolis-Hastings, Metropolis adjusted Langevin methods perform much better than their randomwalk equivalents. Still, their overall performance is limited by the diffusive nature of thetransitions.
To fully exploit Hamiltonian flow we need to follow itfor much longer than infinitesimal times, taking advantage of the coherent trajectories torapidly explore the target distribution. In practice this is accomplished by first samplinga covector from the cotangent conditional distribution and then applying a symplecticintegrator for multiple steps, Φ
Hǫ,L , to simulate the Hamiltonian flow for time t = ǫ · L . Thisdefines the family of Hamiltonian Monte Carlo methods (Betancourt et al., 2016).Once we have a longer discrete trajectory we still have to correct for the small but non-negligible numerical errors. The first Hamiltonian Monte Carlo methods considered only thefinal state in the trajectory, applying a Metropolis correction to that state as with randomwalk Metropolis and Metropolis adjusted Langevin methods. Modern implementations,however, take advantage of the entire trajectory by going beyond Metropolis corrections.For a thorough discussion see (Betancourt, 2018a).
3. REPARAMETERIZATIONS AND EQUIVALENT METRICS
One of the benefits of the pure geometric construction that we have so far discussedis that it explicitly guides implementations. Once local coordinates have been chosen, theprobability distributions, namely the target distribution and the tangent or cotangentconditional distributions, can be specified with local probability density functions. Likewiseany geometric objects, namely the metric, can be specified with local component functions.In this way everything is manifestly compatible with each other and the chosen coordinates.
EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Under a reparameterization we simply begin with a new coordinate system and start theprocess anew.That progression from geometric to coordinate, however, is not how algorithms are typ-ically implemented in practice. Instead practitioners often begin with a default local rep-resentation of the algorithmic system and are responsible for working out how that localrepresentation transforms under reparameterizations. These transformations are challeng-ing to determine and prone to error; cnsequently practitioners routinely neglect transform-ing the algorithmic structure entirely, resulting in an incomplete transformations and anentirely different Riemannian geometry.In this section I first review the proper way local representations transform under areparameterization of the local coordinates before considering the incomplete reparameter-izations typical of practice. In particular I explicitly derive the modified geometries thatresult from these incomplete reparameterizations. Finally I use this relationship between in-complete reparameterizations and modified geometries to motivate an optimality criterionfor reparameterizations in the context of a given target distribution.
There are two equivalent perspectives on the reparameterization of a manifold: the pas-sive and the active
Baez and Muniain (1994). In the passive perspective a reparameter-ization fixes the manifold but transforms each chart, while in the active perspective areparameterization transforms the manifold while fixing the charts. Although these twoperspectives are equivalent the latter is closer to how reparameterizations are implementedin practice and consequently I will focus on that perspective.More formally in the active perspective a reparameterization is a diffeomorphism fromthe base manifold into itself, ψ : Q → Qq q ′ = ψ ( q ) . that pushes each chart forward. These chart maps are linear transformations representedby Jacobian matrices , J ij ( q ) = ∂ψ i ∂q j ( q ) . In other words even a non-linear reparameterization acts like a linear transformation withineach local neighborhood.For example, the coordinate functions in the new charts are given by( q ′ ) i ( q ′ ) = J ij ( ψ − ( q ′ )) · q j ( ψ − ( q ′ )) . Similarly the probability density function representation of a probability distribution withina given chart transforms by acquiring a factor of the inverse determinant of the Jacobianmatrix, π ( q ′ ) = π ( ψ − ( q ′ )) (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − . BETANCOURT QT q Qq (a) QT ψ ( q ) Qψ ( q )(b) Fig 10 . A reparameterization ψ : Q → Q transforms not only points in the manifold but also geometricobjects defined in the tangent and cotangent spaces. Locally the action of a reparameterization behaves likea rotation given by the Jacobian matrix of the reparameterizing map. Importantly a reparameterization of the base manifold also affects the local structureof the tangent and cotangent spaces (Figure 10). Tangent vectors push forward along thetransformation, the vector v ∈ T q Q mapping into a vector v ′ ∈ T ψ ( q ) Q . Local bases of atangent spaces transform as ( ∂ ′ ) i = J ji ( ψ − ( q ′ )) · ∂ j , which immediately implies that the components of a vector in that basis transforms to thecomponents ( v ′ ) i = J ij ( ψ − ( q ′ )) · v j . Likewise probability density functions over a tangent space acquire the same Jacobiandeterminant as the probability density functions in the local charts. This implies thatconditional probability density functions over the tangent bundle pick up two factors ofthe inverse Jacobian determinant, π ( q ′ , v ′ ) = π ( ψ − ( q ′ ) , J ij ( ψ − ( q ′ )) ( v ′ ) j ) · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − . At the same time we can take a more comprehensive perspective and note that anyreparameterization over the base manifold Q induces a reparameterization of the entiretangent bundle at once. The Jacobian matrix of this bundle reparameterization is blockdiagonal with both blocks equal to the Jacobian matrix of the base reparameterization, J T Q = (cid:18) J J (cid:19) , from which one can readily reproduce all of the previous results. For example, | J T Q | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) J J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = J . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Objects in the cotangent spaces naturally pull back along the reparameterization andhence transform in the opposite way as tangent vectors. A local basis of a cotangent spacetransforms as (d q ′ ) i = J ij ( ψ − ( q ′ )) · d q j , and the component of a covector in that basis transform as( p ′ ) i = ( J − ) ji ( ψ − ( q ′ )) · p j . Probability density functions over a cotangent space behave opposite to probability densityfunctions over a local chart or in a tangent space; they acquire a factor of the Jacobiandeterminant without inversion. Critically this implies that probability density functionsover the cotangent bundle pick up no Jacobian factors under a reparameterization π ( q ′ , p ′ ) = π ( ψ − ( q ′ ) , ( J − ) ji ( ψ − ( q ′ )) ( p ′ ) j ) · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) = π ( ψ − ( q ′ ) , ( J − ) ji ( q ) ( p ′ ) j ) . This hints at the natural probabilistic structure of the cotangent bundle and some ofinherent advantages of algorithms like Langevin Monte Carlo and Hamiltonian Monte Carlodefined there.Just as a reparameterization of the base manifold induces a reparameterization of thetangent bundle, it also induces a reparameterization of the entire cotangent bundle. Herethe Jacobian matrix of this bundle reparameterization is block diagonal, but the lowerblock now equals the inverse Jacobian matrix of the base reparameterization, * J T ∗ Q = (cid:18) J J − (cid:19) . This perspective makes it particularly clear that the Jacobian determinant of the inducedreparameterization is exactly one, | J T ∗ Q | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) J J − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = J · J − = 1 . From the transformation properties of vectors and covectors we can work out how gen-eral tensors transform. In particular we can work out how the components functions of aRiemannian metric transform when we reparameterize the base space. In this case we gettwo inverse Jacobians, one for each component,( g ′ ) lm ( q ′ ) = ( J − ) il ( ψ − ( q ′ )) · ( J − ) jm ( ψ − ( q ′ )) · g ij ( ψ − ( q ′ )) . BETANCOURT
As we’d expect from a geometric invariant, the quadratic form defining an elliptical proba-bility density function on the tangent spaces doesn’t change under the reparameterization,( g ′ ) q ′ ( v ′ , v ′ ) = ( g ′ ) lm ( q ′ )( v ′ ) l ( v ′ ) m = ( J − ) il ( ψ − ( q ′ )) · ( J − ) jm ( ψ − ( q ′ )) · g ij ( q ) · J lr ( ψ − ( q ′ )) · v r · J ms ( ψ − ( q ′ )) · v s = h ( J − ) il ( ψ − ( q ′ )) · J lr ( ψ − ( q ′ )) i · h J jm ( ψ − ( q ′ )) · ( J − ) ms ( ψ − ( q ′ )) i · g ij ( q ) · v r · v s = h δ ir i · h δ js i · g ij ( q ) · v r · v s = g ij ( q ) · v i · v j = g q ( v, v ) . The transformation properties of the elliptical probability density functions instead dependentirely on their metric determinant terms.Provided that we reparameterize not just the base space but also the tangent and cotan-gent spaces, then any algorithm based on exact flows will be invariant to reparameteriza-tions; the entire Markov chains they generate will map forward from one parameterizationto another without any changes to their dynamics. Algorithms that depend on discreteapproximations to these flows will not be exactly invariant, as the approximation errorand any correction scheme will in general depend on the local parameterization, but theresulting dynamics will be similar.Reparameterization, however, is often recommended to improve performance because itis supposed to change the dynamics of the Markov chain. This contradiction is resolvedwhen we realize that the reparameterizations employed in practice are not the completereparameterizations of a proper geometric system but rather incomplete reparameteriza-tions that transform the initial system into something else entirely.
In practice any geometric algorithm is implemented with coordinates, components, andprobability densities. Typically, however, only the target probability density is exposed tothe user. Conditional probability density functions on the tangent or cotangent bundle, orcomponents of the metric that define those conditional densities, are set to default valuesnot exposed to the user or exposed but significantly limited in flexibility. For example,the default configuration of Stan (Stan Development Team, 2019) forces a metric withconstant, diagonal components.Consequently a user cannot reparameterize the entire geometric system on which thesealgorithms are based. They can only reparameterize the target probability density function
EOMETRY OF INCOMPLETE REPARAMETERIZATIONS QT ψ ( q ) Qψ ( q )(a) QT q Qq (b) Fig 11 . An incomplete reparameterization forces the metric geometry to be defined using component func-tions in the reparameterized charts, not the initial charts. Transforming back to the initial parameterizationwe see that this is equivalent to defining a different metric on the original space that we might have an-ticipated. In other words an incomplete reparameterization transforms the base manifold while holding thetangent and cotangent spaces fixed, twisting the tangent and cotangent bundles. If we release our hold onthese spaces the bundles snap back, revealing the equivalent metric geometries. while the tangent and cotangent structures remain fixed . These incomplete reparameter-izations result in a different geometry and hence a different algorithm that may interactbetter or worse with the target distribution.While an incomplete reparameterization is not a proper geometric transformation, itseffect does admit a convenient geometric interpretation. We begin with a metric specified bythe local coordinate functions g ij ( q ). Reparameterizing the base manifold, q q ′ = φ ( q ),but fixing the components of the metric results in a new metric specified by the samecomponents but in the new coordinate system, g lm ( q ′ ). To compare these metrics we haveto completely invert the reparameterization, pulling the new metric back into the originalcoordinate system, g ij ( q ) = ( J ) li ( ψ − ( q ′ )) · ( J ) mj ( ψ − ( q ′ )) · g lm ( ψ − ( q ′ ))= ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . In words, an incomplete reparameterization is equivalent to running the algorithm in theoriginal coordinate system but with the transformed metric (Figure 11)¯ g ij ( q ) = ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . Consequently there is a one-to-one equivalence between incomplete reparameterizationsand the choice of metric, and hence the configuration of a Riemannian Markov transition.Under an incomplete reparameterization geodesics and elliptical conditional probabilitydistributions on the tangent and cotangent spaces all follow from this new, equivalentmetric. These new configurations will induce new dynamics with respect to the targetdistribution, resulting in modified performance that may or may not be beneficial. BETANCOURT
One advantage of this equivalence is that by applying an incomplete reparameterizationwe can effectively implement an algorithm with spatially-varying metric components us-ing only an algorithm configuration with constant metric components, at least if we canfind the right reparameterization. Riemannian Markov transitions in a coordinate systemadmitting constant metric components are significantly easier to robustly implement andindeed are often the only option in popular software packages. For example we can useexplicit symplectic integrators in to implement Langevin and Hamiltonian Monte Carloinstead of more expensive, and more fragile, implicit symplectic integrators. By findingan appropriate reparameterization we can reproduce the geometry of a more sophisticatedmetric without modifying the software itself.Algorithms exploiting metrics with spatially-varying components in the default coordi-nate system are often denoted “Riemannian” algorithms in the statistics literature, withthose using constant components denoted “Euclidean” algorithms. Using this terminologyan incomplete reparameterization allows one to effectively run a Riemannian algorithmusing only a Euclidean implementation. Keep in mind, however, that this terminology istechnically incorrect, as discussed at the end of Section 1.1.
An immediate advantage of this identification between an incomplete reparameteriza-tion and its equivalent Riemannian geometry is that it allows us to determine reparame-terizations that optimize performance with respect to a given target distribution by firstdetermining the optimal Riemannian geometry.Within a small neighborhood we can approximate log target probability density functionwith a Taylor expansion, although that approximation has to be made with care. Firstly thetarget probability density function is not an invariant function amenable to approximation.We can construct an appropriate function, however, by using the determinant of the metricto correct for the non-invariant behaviors, ρ ( q ) = log (cid:16) π ( q ) · | g ( q ) | − (cid:17) = log π ( q ) −
12 log | g ( q ) | . We can then we can construct a local Taylor expansion of this invariant function, ρ ( q ) = ρ ( q ) + ∂ρ∂q i ( q ) · ( q − q ) i + 12 ∂ ρ∂q i ∂q j ( q ) · ( q − q ) i · ( q − q ) j + . . . . Finally if the local chart is in a basin where the gradient, as well as all of the higher-order terms, are negligible compared to the constant and quadratic terms then we canapproximate the function as ρ ( q ) ≈ const + 12 ∂ ρ∂q i ∂q j ( q ) · ( q − q ) i · ( q − q ) j EOMETRY OF INCOMPLETE REPARAMETERIZATIONS If the second derivatives of this function in that small neighborhood are all positive then thisapproximation defines a Gaussian probability density function. In other words in sufficientlysmall charts where ρ ( q ) is concave we can approximate the target probability densityfunction with a multivariate Gaussian probability density function defined by the precisionmatrix (Σ − ) ij = ∂ ρ∂q i ∂q j ( q ) . This approximation significantly simplifies the analysis of geometric algorithms thatutilize elliptical probability density functions over the tangent or cotangent spaces. Ro-tating the entire tangent bundle within the small neighborhood, for example, exchangescovariance between the target approximation and the covariance of the tangent probabilitydensity function defined by the metric. At the same time rotating the cotangent bundle ex-changes precision between the target approximation and the cotangent probability densityfunction defined by the inverse metric. Consequently we can completely decorrelate thelocal approximation to the target probability density function by choosing a metric thatcompensates for the local behavior of the Hessian of ρ . In each neighborhood this wouldbe accomplished with a metric specified by the components g ij ( q ) = ∂ ρ∂q i ∂q j ( q ) . Unfortunately this equality is valid only within a single chart and hence does not definean optimization criterion that is consistent across the entire base manifold. The mainproblem is that the Hessian does not transform like a metric but rather a jet , in particulara one-dimensional, rank-two covelocity (Betancourt, 2018b). We can use the Riemannianstructure on the manifold, however, to correct the Hessian into a geometric object that wecan compare to the metric.The covariant Hessian uses the linear connection to compensate for the non-tensorialbehavior of the Hessian, ∇ f ( q ) = (cid:18) ∂ f∂q i ∂q j ( q ) + Γ kij ( q ) ∂f∂q k ( q ) (cid:19) d q i ⊗ d q j , consistently across all charts. Local comparisons between the covariant Hessian and themetric are then self-consistent across the entire base manifold. This allows us to constructa proper criterion for metric optimality at each point as g ij ( q ) = ∇ ij λ ( q )= ∇ ij (log π −
12 log | g | )( q ) , BETANCOURT or, using the fact that the covariant Hessian of any function of the metric vanishes, g ij ( q ) = ∇ ij (log π −
12 log | g | )( q )= ∇ ij log π ( q )= ∂ log π∂q i ∂q j ( q ) + Γ kij ∂ log π∂q k ( q ) . Likewise the local deviation from optimality can be quantified by the difference∆( q ) = g ( q ) − ∇ log π ( q ) . We can summarize this deviation with any matrix scalar, for example the scalar determi-nant, | ∆( q ) | .In practice we can achieve ∆( q ) = 0 with the proper choice of metric components, but wecan also achieve it with an appropriate incomplete reparameterization and its equivalentmetric, ¯ g ij ( q ) = ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . Consequently substituting the equivalent metric, ¯ g ij ( q ) into the geometric optimality cri-terion immediately defines an optimality criterion for reparameterizations,¯∆( q ) = ¯ g ( q ) − ¯ ∇ log π ( q ) . Initial excitement is quickly tempered once we inspect the criterion a bit more carefully.The criterion defines a system of coupled, non-ordinary differential equations for the ele-ments of the Jacobian matrix which define the optimal reparameterization. This then setsup a system of partial differential equations for the optimal reparameterization itself. Inother words, we will not be solving for optimal reparameterizations in general systems anytime soon!The criterion does, however, allow us to analyze specific reparameterizations. Given aspecific reparameterization we can verify optimality by constructing the equivalent metric,its corresponding connection, and then computing the scalar deviation function, | ∆( q ) | .If the determinant doesn’t vanish then we can analyze the components of the deviationtensor for insight about the limitations of the chosen reparameterization and potentialimprovements. Ultimately this criterion provides the theoretical foundation upon which wecan begin formal studies of reparameterizations in earnest.
4. OPTIMAL REPARAMETERIZATION OF LATENT GAUSSIAN MODELS
To demonstrate the utility of the geometric analysis of incomplete reparameterizationslet’s consider the popular reparameterization that arises when transforming from the cen-tered parameterization to the non-centered parameterization of a latent Gaussian model
EOMETRY OF INCOMPLETE REPARAMETERIZATIONS (Papaspiliopoulos, Roberts and Sk¨old, 2007). This reparameterization is known to drasti-cally improve the empirical performance of geometric algorithms (Betancourt and Girolami,2015) and we can use our new geometric analysis to provide a more formal motivation forits benefits.A latent Gaussian model captures the behavior of an unobserved exchangeable popula-tion of individual parameters, θ = { θ , . . . , θ N } , that follow a Gaussian distribution with location µ and scale τ . There are two naturalparameterizations of the individual parameters, and hence two natural parameterizationsof the latent Gaussian model. Both parameterizations span the entire manifold, so we canlimit our consideration to the entire space instead of a single local chart. The parameters θ and { µ, τ } define the centered parameterization of a latent Gaussianmodel where the model is specified by the probability density function π ( θ , µ, τ ) = π ( θ | µ, τ ) · π ( µ, τ )= N Y n =1 N ( θ n | µ, τ ) · π ( µ, τ ) . Complementing the latent Gaussian model with an observational model for data gener-ated from each individual θ n yields the joint model π ( y , θ , µ, τ, φ ) = N Y n =1 π ( y n | θ n , φ ) · N Y n =1 N ( θ n | µ, τ ) · π ( µ, τ )= N Y n =1 π ( y n | θ n , φ ) N ( θ n | µ, τ ) · π ( µ, τ ) . If the individual likelihood functions are only weakly informative then this joint modelis dominated by the latent Gaussian probability density function which frustrates accuratecomputation. The problem is that the interaction between the individual parameters andthe population scale manifests with a funnel geometry. For large τ the individual θ n areonly weakly coupled to the population mean, but for small τ the θ n collapse into a narrowconcentration around µ (Figure 12). This rapidly varying curvature frustrates Markovtransitions that cannot dynamically adapt.On the other hand, as the observational model becomes more informative the individuallikelihood functions concentrate around just those model configurations that are consistentthe observed data. Eventually this suppresses the pathological neck of the funnel geometry.Consequently with enough data the posterior probability density function will have littlecontribution from the pathological geometry of the latent Gaussian model, and it will bemuch easier to fit with most Markov transitions. BETANCOURT θ n l og τ Fig 12 . In a centered parameterization the probability density function for a latent Gaussian model man-ifests a funnel geometry, where the density concentrates into a narrow volume around µ for small τ butdisperses for large τ . Here and in subsequent figures µ is fixed at zero. In order to quantify the entire proba-bility distribution a Markov transition must be able to explore both regions reasonably quickly which is mucheasier said than done. The non-centered parameterization takes advantage of the fact that we can decouple anyGaussian probability density function, N ( θ | µ, τ ), into a standardized Gaussian probabilitydensity function, N (˜ θ | ,
1) and the deterministic transformation, θ = µ + τ · ˜ θ .Using ˜ θ = n ˜ θ , . . . , ˜ θ N o as parameters the latent Gaussian model can be specified by a product of independentprobability density functions, π (˜ θ , µ, τ ) = π (˜ θ | µ, τ ) · π ( µ, τ )= π (˜ θ ) · π ( µ, τ )= N Y n =1 N (˜ θ n ) · π ( µ, τ ) . When incorporating individual observational models, however, the non-centered ˜ θ n mustbe coupled to the population parameters in order to recreate each θ n , π ( y , ˜ θ , µ, τ, φ ) = N Y n =1 π ( y n | θ n (˜ θ n , µ, τ ) , φ ) · N ( θ n | µ, τ ) · π ( µ, τ )= N Y n =1 π ( y n | µ + τ · ˜ θ n , φ ) N ( θ n | µ, τ ) · π ( µ, τ ) . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Centered ParameterizationConstant Metric ComponentsTypical Trajectory θ n l og τ (a) Centered ParameterizationConstant Metric ComponentsUnstable Numerical Trajectory θ n l og τ (b) Fig 13 . A funnel geometry frustrates Hamiltonian Monte Carlo in numerous ways. (a) Typical Hamiltoniantrajectories span only a limited range of τ values and hence only slowly explore the entire distribution. (b)Trajectories that do penetrate deeper into the funnel are difficult to numerically integrate, usually resultingin unstable numerical trajectories. For weakly informative likelihood functions the posterior probability density function isdominated by the latent Gaussian probability density function, which now is free of thepathological funnel geometry. On the other hand as the likelihood functions concentratethey strongly constrain the latent parameters, but only through the functions µ + τ · ˜ θ n . This constraint, however, induces its own funnel geometry! In other words the non-centeredparameterization yields a better geometry for weakly informative data and a worse geome-try for strongly informative data, inverse to the behavior of the centered parameterization.
Riemannian algorithms that utilize constant metric components are not able to adaptto the rapidly varying curvature of the funnel and will consequently suffer when tryingto explore posterior density functions corresponding to weakly-informed likelihoods in thecentered parameterization or strongly-informed likelihoods in the non-centered parameter-ization. For example in Hamiltonian Monte Carlo this results in exact trajectories thattend to be restricted to narrow neighborhoods of τ , (Figure 13a). Moreover, when thetrajectories are lucky enough to venture deeper into the funnel their numerical integrationbecomes unstable (Figure 13b).One option around this pathology is to generalize the algorithms by allowing the metriccomponents to vary and capture the Hessian structure of the posterior density function.Although this results in exact and numerical trajectories that are much better behaved(Figure 14), the general algorithms are significantly more challenging to implement. In BETANCOURT
Centered ParameterizationOptimal Dynamic Metric Components θ n l og τ Fig 14 . Using dynamic metric components that capture the second-order structure of the funnel densityfunction itself results in Hamiltonian trajectories that span the entire funnel and explore much more effi-ciently. The integration of these trajectories is more stable but also more difficult to implement in practice.
Hamiltonian Monte Carlo, for example, this requires an implicit midpoint symplectic inte-grator which needs a fixed point equation to be solved at each iteration.As we learned in Section 3.2, however, we can achieve the same behavior by applying aparticular incomplete reparameterization (Figure 15). While we can’t work out the idealreparameterization analytically, we can investigate how well mapping between the canonicalcentered and non-centered parameterizations performs.Consider having no observations so that the posterior distribution reduces to the latentGaussian model. In this case empirical experience informs us that a centered parameteri-zation will perform poorly, and we can achieve much better performance by transformingto a non-centered parameterization with the map µ = µτ = τ ˜ θ = θ − µτ . Because of the exchangeability of the θ n we can analyze the efficacy of this reparameteriza-tion using any number of components. To further simplify the analysis let’s consider onlya single individual parameter, θ .If we fix the metric components to constants while applying this map to the parameters,exactly what effective metric do we induce? To avoid any complications due to the positivityconstraint on the population scale let’s first reparameterize from τ to λ = log τ . The non-centering transformation then becomes ˜ θ = θ − µe λ EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Centered Parameterization θ n l og τ (a) Non-CenteredFull Reparameterization θ n l og τ (b) Non-CenteredIncomplete Reparameterization θ n l og τ (c) Fig 15 . Hamiltonian trajectories from a centered parameterization of a latent Gaussian model with onlyweakly-informative data are frustrated by the funnel geometry in the posterior density function. (a) Typicaltrajectories explore only limited neighborhoods. (b) Applying a full reparameterization to a non-centeredparameterization results in the same geometric system and hence the same Hamiltonian dynamics. (c)Applying an incomplete reparameterization to the non-centered parameterization, however, modifies thegeometry, resulting in Hamiltonian dynamics better suited to explore the funnel. with the Jacobian matrix J = ∂ ( µ, λ, ˜ θ ) ∂ ( µ, λ, θ ) = − cosh( λ ) + sinh( λ ) − e − λ ( θ − µ ) e − λ and determinant | J | = e − λ . If we assume that the initial metric components are equal to the identify matrix, withones along the diagonal and zeroes elsewhere, g = , BETANCOURT then the equivalent metric components are given by the matrix g ′ ( θ, µ, τ ) = J T ( θ, µ, τ ) · g · J ( θ, µ, τ )= e − λ e − λ ( θ − µ ) − e − λ e − λ ( θ − µ ) 1 + e − λ ( θ − µ ) − e − λ ( θ − µ ) − e − λ − e − λ ( θ − µ ) e − λ = + e − λ ( θ − µ ) − − + e − λ −
10 0 0 − + e − λ ( θ − µ ) . To consider optimality we need an explicit target density function. For the latent Gaus-sian model that means specifying prior density functions for µ and τ . Here let’s considerunit Gaussian probability density functions for both µ and λ , or equivalently a log Gaus-sian probability density function for τ . The log joint target probability density function isthen log π ( θ, µ, λ ) = − (cid:18) θ − µe λ (cid:19) − λ − µ − λ + const . We can now compute the covariate derivative of this log probability density function withrespect to our induced metric analytically; here I use Headrick (2015) to compute thecovariant derivative symbolically. In this case we get an exact cancelation,∆( q ) = g ′ ( q ) − ∇ log π ( q ) = . The non-centering transformation is exactly the optimal incomplete reparameterization!Running a Riemannian algorithm with unit metric and a non-centered parameterizationof the target distribution is equivalent to running an algorithm whose metric captures thelocal second-order differential structure of the latent Gaussian model but in a centeredparameterization.If the prior densities on µ and λ have non-unit scales then we can maintain optimality bymatching those scales in the diagonal elements of the initial metric components. In partic-ular an adaptive algorithm that sets the diagonal elements of the initial metric componentsto the variance of each parameter function will be able to sustain optimality for arbitraryprior scales.Although we cannot in general extend this analytic analysis to nontrivial observationalmodels, we can use the geometric perspective to provide qualitative information about theinfluence of those models. For example this optimality criterion considers only the first-order and second-order partial derivatives of the target probability density function, which EOMETRY OF INCOMPLETE REPARAMETERIZATIONS means that the influence of a nontrivial observational model is captured within the first-order and second-order behavior of the likelihood functions. Contrast this to the Fisherinformation matrix, which captures the same information but only in expectation withrespect to possible observations.At the same time the geometric analysis is useful for motivating even further questions.For example the common non-centering reparameterization is geometrically optimal onlyfor the log Gaussian prior density on λ = log τ . This prior choice, however, suppressesthe limit τ →
5. CONCLUSION
Placing inherently geometric algorithms like random walk Metropolis-Hastings, LangevinMonte Carlo, and Hamiltonian Monte Carlo into a proper geometric framework enablesa wide range of theoretical analyses. In particular we can use the Riemannian structureof these algorithms to quantify the affect of incomplete reparameterizations. We can evenmotivate incomplete reparameterizations that optimize the local geometry for all of thesealgorithms at the same time.Here we demonstrated this analysis on a particularly simple Gaussian latent modelwhere we could analytically prove the geometric optimality induced by non-centering thenatural parameterization, at least in the case of non-influential data. Although the analyticresults don’t immediately generalize to more complex systems, the qualitative insightsstretches beyond the confines of that simple system. They suggests important questions andconnections that may ultimately lead to important insights in more general circumstances.It also suggests empirical studies, such as correlating the optimality criterion | ∆( q ) | witheffective sample size per iteration or other quantifications of Markov chain Monte Carloperformance.Insights about geometric algorithms, like those considered in this paper, will continueto be most efficiently mined by using geometric analyses that directly perceive their fun-damental structures.
6. ACKNOWLEDGEMENTS
I thank Dan Simpson for critical discussions about reparameterizations and geometry aswell as Luiz Carvahlo and Charles Margossian for helpful comments on this manuscript. BETANCOURT
REFERENCES
Baez, J. C. and
Muniain, J. P. (1994).
Gauge Fields, Knots, and Gravity . World Scientific Singapore.
Betancourt, M. (2018a). A Conceptual Introduction to Hamiltonian Monte Carlo.
Betancourt, M. (2018b). A Geometric Theory of Higher-Order Automatic Differentiation.
Betancourt, M. and
Girolami, M. (2015). Hamiltonian Monte Carlo for Hierarchical Models. In
Cur-rent Trends in Bayesian Methodology with Applications (U. S. Dipak K. Dey and A. Loganathan, eds.)Chapman & Hall/CRC Press.
Betancourt, M. , Byrne, S. , Livingstone, S. and
Girolami, M. (2016). The Geometric Foundations ofHamiltonian Monte Carlo.
Bernoulli . Brooks, S. , Gelman, A. , Jones, G. L. and
Meng, X.-L. , eds. (2011).
Handbook of Markov Chain MonteCarlo . CRC Press, New York.
Diaconis, P. and
Freedman, D. (1999). Iterated Random Functions.
SIAM review Headrick, M. (2015). diffgeo.m. http://people.brandeis.edu/~headrick/Mathematica/diffgeo.m . Hsu, E. P. (2002).
Stochastic analysis on manifolds . Graduate Studies in Mathematics . American Math-ematical Society, Providence, RI. Lee, J. M. (2013).
Introduction to Smooth Manifolds . Springer.
Lee, J. M. (2018).
Introduction to Riemannian manifolds . Graduate Texts in Mathematics . Springer,Cham.
Papaspiliopoulos, O. , Roberts, G. O. and
Sk¨old, M. (2007). A General Framework for the Parametriza-tion of Hierarchical Models.
Statistical Science
Robert, C. P. and
Casella, G. (1999).
Monte Carlo Statistical Methods . Springer New York.
Stan Development Team (2019). Stan: A C++ Library for Probability and Sampling, Version 2.20.0.http://mc-stan.org/.
Tierney, L. (1998). A Note on Metropolis-Hastings Kernels for General State Spaces.
The Annals ofApplied Probability Xifara, T. , Sherlock, C. , Livingstone, S. , Byrne, S. and
Girolami, M. (2014). Langevin diffusionsand the Metropolis-adjusted Langevin algorithm.
Statistics & Probability Letters91