[PDF] Incomplete Reparameterizations and Equivalent Metrics

Abstract

Reparameterizing a probabilisitic system is common advice for improving the performance of a statistical algorithm like Markov chain Monte Carlo, even though in theory such reparameterizations should leave the system, and the performance of any algorithm, invariant. In this paper I show how the reparameterizations common in practice are only incomplete reparameterizations which result in different interactions between a target probabilistic system and a given algorithm. I then consider how these changing interactions manifest in the context of Markov chain Monte Carlo algorithms defined on Riemannian manifolds. In particular I show how any incomplete reparameterization is equivalent to modifying the metric geometry directly.

Full PDF

aa r X i v : . [ s t a t . O T ] O c t Incomplete Reparameterizationsand Equivalent Metrics

Michael Betancourt

Abstract.

Reparameterizing a probabilisitic system is common advicefor improving the performance of a statistical algorithm like Markovchain Monte Carlo, even though in theory such reparameterizationsshould leave the system, and the performance of any algorithm, in-variant. In this paper I show how the reparameterizations common inpractice are only incomplete reparameterizations which result in dif-ferent interactions between a target probabilistic system and a givenalgorithm. I then consider how these changing interactions manifestin the context of Markov chain Monte Carlo algorithms deﬁned onRiemannian manifolds. In particular I show how any incomplete repa-rameterization is equivalent to modifying the metric geometry directly.

Michael Betancourt is the principal research scientist at Symplectomorphic, LLC. (e-mail:[email protected]). BETANCOURT

CONTENTS

EOMETRY OF INCOMPLETE REPARAMETERIZATIONS In practice probabilistic systems are implemented in a speciﬁc parameterization of thetarget space, with the target probability distribution speciﬁed by its representative proba-bility density function. Reparameterizing the target space modiﬁes this probability densityfunction but not the probability distribution that it represents. In other words any proba-bilistic computation should yield equivalent results no matter the parameterization used.That said, reparameterizations are known to alter the performance of algorithms thatimplement those probabilistic computations, suggesting that the interaction between thetarget probability distribution and the algorithm is not invariant. These interactions, andhow they relate to a given parameterization, are often opaque and diﬃcult to understand.Developing explicit criteria to identify which parameterization yields the highest perfor-mance for a given circumstance is particularly challenging.The situation improves when the target space is a Riemannian manifold and the al-gorithm in question exploits that Riemannian structure, as is common for Markov chainMonte Carlo methods. Here we can construct and then analyze a comprehensive geom-etry that encompasses both the target probabilistic system and the algorithmic system.In particular a geometric analysis reveals that the reparameterizations employed in prac-tice are only incomplete reparameterizations, modifying the target geometry but not thealgorithmic geometry and hence changing the relationship between the two.In this paper I formalize the eﬀect of incomplete reparameterizations for Markov chainMonte Carlo algorithms deﬁned on Riemannian manifolds and construct an implicit crite-rion for the optimal reparameterization for a given target distribution. I begin by reviewingthe basics of Riemannian geometry, and Markov transitions that exploit Riemannian geom-etry, before demonstrating the duality between reparameterizations of the target space andequivalent metric geometries and introducing a heuristic criterion to identify optimal repa-rameterizations. Finally I apply these results to latent Gaussian models and their commoncentered and non-centered parameterizations.

1. RIEMANNIAN MANIFOLDS

For the rest of this paper I will assume familiarity with the basics of diﬀerential geometry.Part I of Baez and Muniain (1994) provides an accessible introduction with Lee (2013)giving a more thorough reference of the concepts and notation that I will use here.Let our target distribution be deﬁned on a D -dimensional smooth manifold, Q , withlocal coordinate functions denoted { q ( q ) , . . . , q D ( q ) } .The tangent space at each point, T q Q , is a vector space over the real numbers whoseelements can be associated with equivalence classes of one-dimensional curves sharing thesame velocity at that point (Figure 1). Local coordinate functions induce a basis withineach tangent space given by the velocities of each coordinate function, which we denote bythe partial derivatives, (cid:26) ∂∂q , . . . , ∂∂q D (cid:27) ≡ { ∂ , . . . , ∂ D } . BETANCOURT QT q Qq Fig 1 . Each tangent space T q Q is a D -dimensional vector space associated with a point q ∈ Q . If we embedthe manifold Q in a higher-dimensional space then we can interpret the tangent space as a plane ﬁxed to q and tangent to the manifold at that point of connection. T q Q Q ( q, v ) q = ̟ ( q, v ) T Q

Fig 2 . The tangent bundle

T Q is given by weaving together the tangent spaces attached at each point inthe base manifold, Q . Here the base manifold is one-dimensional and each tangent space can be representedwith a one-dimensional line. Each point in the tangent bundle is identiﬁed by a point in the base space, q ∈ Q , and a point in the corresponding tangent space, v ∈ T q Q . The tangent bundle is equipped with anatural projection operator, ̟ : T Q → Q that maps each point in the tangent space back to the associatedpoint in the base space. All of the tangent spaces in a manifold stitch together to deﬁne a 2 D -dimensional man-ifold with a canonical projection back down to the base space, π : T Q → Q , called thetangent bundle. Vector ﬁelds are sections of this bundle, specifying a vector within eachtangent space, v : Q → T Q . The space of all vector ﬁelds on Q is denoted Γ( Q ).Similarly the cotangent space at each point, T ∗ q Q is a vector space over the real numberswhose elements can be associated with equivalence classes of real-valued functions withthe same ﬁrst-order diﬀerential behavior. These covectors are also dual to vectors of thetangent space, with each covector mapping a vector to a real number and vice versa. Withina local chart the coordinate functions deﬁne a basis for the cotangent space given by thegradients of the coordinate functions, { d q , . . . , d q D } . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS As with the tangent spaces, all of the cotangent spaces can be weaved together to givethe cotangent bundle, π ∗ : T ∗ Q → Q . Covector ﬁelds, or one-forms, are sections of thisbundle, specifying a covector within each cotangent space, α : Q → T ∗ Q .Without any additional structure a manifold Q isn’t particularly rigid; there is littlestructure within each tangent space, let alone between tangent spaces. In order to rigidifythe manifold, and elevate it to a Riemannian manifold , we need to equip it with additionalstructure. In particular we need to specify a

Riemannian metric and a linear connection which allow us to compare vectors within a single tangent space as well as vectors indiﬀerent tangent spaces. Their structure also gives rise to geodesics and the ability to ﬂowthrough the manifold.For a thorough introduction of Riemannian manifolds see Lee (2018). In the next fewsections I will review the basic concepts that we will need to construct Markov transitionson Riemannian manifolds.

A Riemannian metric a positive-deﬁning map taking two vector ﬁelds to the real num-bers, g : Γ( Q ) × Γ( Q ) → R ( v, u ) g ( u, v ) , such that g ( u, v ) = g ( v, u ) > u, v ∈ Γ( Q ) and g ( u, u ) = 0 only if u = 0.Within each tangent space the metric induces an inner product, g q : T q Q × T q Q → R ( v, u ) g q ( v, u ) , which allows us to orient vectors relative to each other. In particular the length of a vectoris deﬁned by || v || = q g q ( ~v, ~v )while the angle between two vectors is be deﬁned bycos θ = g q ( ~u, ~v ) . If g q ( ~u, ~v ) = 0 then the vectors are said to be perpendicular or orthogonal. These conceptsallow us to deﬁne, for example, orthonormal bases within a tangent space such that eachbasis vector has unit length and is orthogonal to each other basis vector (Figure 3).One powerful feature of a metric is its ability to transform vector ﬁelds into covectorﬁelds and vice versa. In particular given a vector ﬁeld v we can deﬁne a correspondingcovector ﬁeld v ♭ as the covector ﬁeld satisfying v ♭ ( u ) = g ( u, v ) , BETANCOURT QT q Qq Fig 3 . A metric adds rigidity to each tangent space, deﬁning amongst other things orthogonal bases ofvectors that coordinate the vector spaces. for any vector ﬁeld u . The inverse of this map takes any covector ﬁeld ν into a correspondingvector ﬁeld, α ♯ . This inverse transformation can then be used to deﬁne an inverse metric over cotangent ﬁelds as g − ( α, β ) = g ( α ♯ , β ♯ ) . By construction any metric is isomorphic to an element of the symmetric tensor product T ∗ Q ⊗ T ∗ Q . We can use this equivalence to represent a given metric in a local coordinatebasis with the D component functions g ( q ) = g ij ( q ) d q i ⊗ d q j . The inverse metric is similarly locally represented by the component functions g − ( q ) = g ij ( q ) ∂ i ⊗ ∂ j , where g ij ( q ) is the matrix inverse satisfying g ij ( q ) · g jk ( q ) = δ ki for all q ∈ Q .If a manifold admits an atlas such that the metric components equal the identity matrix, g ij = δ ij , in every chart then the manifold is said to be Euclidean . Unfortunately this deﬁ-nition is not how the term ”Euclidean” is used colloquially in statistics. There ”Euclidean”denotes any algorithm with constant metric component functions and ”Riemannian” isused to denote a more general algorithm that exploits position-dependent metric compo-nent functions. This distinction between component functions, however, is a property ofthe atlas being used, in particular the parameterizations within the local charts, and notthe inherent structure of the manifold. What makes a manifold Euclidean is not that itslocal metric component functions are constant but rather that they can be made constantwith some choice of local parameterizations . For example any algorithm deﬁned on the realnumbers is geometrically a Euclidean algorithm no matter the parameterization used . A metric deﬁnes concepts like orientation and length which allows to compare vectors within each tangent space, but a manifold equipped with only a metric is still not rigid

EOMETRY OF INCOMPLETE REPARAMETERIZATIONS enough for us to compare vectors that live in diﬀerent tangent spaces. To make the manifoldfully rigid we need to introduce a connection between these vector spaces.We start diﬀerentially and ask whether or not we can compute directional derivativesof vector ﬁelds, in other words how vector ﬁelds change along a given direction, just as wecan compute directional derivatives of functions. Unlike directional derivatives of functions,however, there is no unique directional derivatives of vector ﬁelds. Instead we have to impose one.A linear connection deﬁnes a directional derivatives of vector ﬁelds through a covariantderivative that maps two vector ﬁelds into a third, ∇ : Γ( Q ) × Γ( Q ) → Γ( Q )( v, u )

7→ ∇ v u . The ﬁrst input deﬁnes directions at each point in the manifold along which changes willbe probed, the second input deﬁnes the vector ﬁeld being probed, and the output deﬁnesthe vector-valued changes in each tangent space.In order for such a map to qualify as a derivative, however, it has to satisfy the usualproperties of derivations. It must, for example, be linear with respect to multiplying theprobing directions by functions, ∇ f · v + f · u w = f ∇ v w + f ∇ u w, for any two real-valued functions f : Q → R and f : Q → R . Moreover it must be linearwith respect to multiplying probed vector ﬁeld by constants, ∇ v ( a · u + b · w ) = a ∇ v u + b ∇ v w, for any two real-value constants a, b ∈ R . Finally it must satisfy the Leibnitz rule withrespect to multiplying the probed vector ﬁeld by functions, ∇ v ( f · u ) = f ∇ v u + v ( f ) · u. Within a local coordinate basis the action of the covariant derivative becomes ∇ v u = (cid:18) v i ∂y k ∂q i + Γ kij v i u j (cid:19) ∂ k . In other words the linear connection is completely speciﬁed with D component func-tions denoted Γ kij ( q ). These Christoﬀel coeﬃcients are not the components of a tensor butrather transform in a much complex way as they encode second-order diﬀerential informa-tion. Only with careful combinations, such as in the above equation, do the non-tensorialcomponents cancel to leave a well-deﬁned geometric object.

BETANCOURT

There are an inﬁnite number of connections on a given manifold but there is a uniqueconnection that is naturally compatible with a given Riemannian metric. In local coordi-nates basis the Christoﬀel coeﬃcients for such a Riemannian or

Levi-Cevita connection aregiven by the functionsΓ kij ( q ) = 12 g kl ( q ) (cid:18) ∂g jl ∂q i ( q ) + ∂g il ∂q j ( q ) + ∂g ij ∂q l ( q ) (cid:19) . When discussing Riemannian manifolds a natural connection is often assumed to comple-ment a given metric and fully rigidify the manifold. Here, too, we will assume the choiceof a Levi-Cevita connection.

The diﬀerential structure imposed by a connection immediately relates neighboring tan-gent spaces. It can also relate distant tangent spaces when we traverse special curvesthrough the base manifold, Q .Recall that a curve is a smooth map from an interval of the real numbers into ourmanifold, c : I → Qt c ( t ) . In particular the points on a curve and their corresponding tangent spaces deﬁne a subsetof the tangent bundle, with restricted vector ﬁelds deﬁned as sections of this subset. Whenrestricted to this subset the covariant derivative will deﬁne how restricted vector ﬁeldschange along the curve.One restricted vector ﬁeld inherent to any curve is the velocity vector ﬁeld,˙ c ( t ) ∈ T c ( t ) Q, ∀ t ∈ I. When this restricted vector ﬁeld is placed into both arguments of the covariant derivative, ∇ ˙ c ( t ) ˙ c ( t ), the output deﬁnes how velocities change along the curve. In other words it deﬁnesthe acceleration along the curve with respect to the chosen connection.Curves with vanishing acceleration everywhere ∇ ˙ c ( t ) ˙ c ( t ) = 0 , ∀ t ∈ I, generalize the concept of a straight line to arbitrary smooth manifolds and are denoted geodesics . Geodesics have a variety of useful properties, but in the context of this paperone of the most useful is that they deﬁne a local ﬂow on the base manifold. Each point q ∈ Q and vector ~v ∈ T q Q intersects with only one geodesic, deﬁning an unambiguousway to move through Q , at least within a local neighborhood where the geodesics arewell-deﬁned. In other words once we pick a point and a direction we have a deterministic EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Qq = exp · v ( q ) q t = exp t · v ( q ) Fig 4 . Each point in the base manifold, q ∈ Q , and vector in the corresponding tangent space, v ∈ T q Q –inother words each point in the tangent bundle–identiﬁes a unique geodesic curve through Q . Following thiscurve for a given time deﬁnes an exponential map that transports the initial point through Q . way to slide through the manifold (Figure 4). The ﬂow of the entire manifold along thesegeodesics is also known as the exponential map , φ exp : T Q × R → Q ( q, ~v, t ) φ exp t,~v ( q ) . Geodesics, however, carry not only points along the manifold but also vectors from onetangent space to another. Consider an initial point q ∈ Q , an initial direction ~v ∈ T q Q ,and the corresponding geodesic curve with c ( t = 0) = q . For any vector ~u ∈ T q Q there is aunique vector ﬁeld restricted to the geodesic satisfying u ( c (0)) = ~u and ∇ ˙ c ( t ) u = 0 alongthe entire geodesic. This restricted vector ﬁeld deﬁnes parallel transport of ~u along thegeodesic (Figure 5). Overloading notation a bit I will also refer to this parallel transportas an exponential map, φ exp : Q × T q Q × T q Q × R → Γ( Q )( q, ~v, ~u, t ) φ exp t,~v ( ~u ) . By this deﬁnition the velocity vectors of a geodesic curve are all parallel transported intoeach other.Parallel transport allows us to formalize the intuition for how a linear connection actuallyconnects the tangent spaces in our manifold. The ﬁrst input of the covariant derivativedeﬁnes directions, and corresponding geodesics, along which we probe the given vectorﬁeld. The output of the covariant derivative is given by the change in that probed vectorﬁeld after being parallel transported for an inﬁnitesimal amount of time (Figure 6), ·∇ v u = lim ǫ → ( φ exp ǫ,v ) − u ǫ − uǫ . Combining the geodesic ﬂow and this parallel transport we see that the covariant deriva-tive deﬁnes a ﬂow along the entire tangent bundle (Figure 7). An initial point ( q, ~v ) ∈ T Q BETANCOURT Qu = exp · v ( u ) u t = exp t · v ( u ) Fig 5 . A linear connection deﬁnes a transport of vectors in the tangent space of any point along a geodesicto vectors in the tangent space of any other point along that geodesic. Here we transport a vector u ∈ T q Q to the vector u t ∈ T q t Q where q t is the exponential map of ( q, v ) ∈ T Q for time t . T q Q T q ǫ Qu ǫ = exp ǫ,v uu (exp ǫ,v ) − u ǫ ǫ · ∇ v u Fig 6 . The covariant derivative can be interpreted as the diﬀerence between a vector and its paralleltransport for an inﬁnitesimal amount of time. Here the initial point q ∈ Q and initial vector v ∈ T q Q deﬁnea geodesic curve and the exponential map q t along that curve. The vector u ∈ T q Q is parallel transportedalong the geodesic to the vector u ǫ ∈ T q ǫ Q ; the covariant derivative is the scaled diﬀerence between u andthe pullback of that parallel transport. EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Q ( q, v )( q t , v t ) = exp t ( q, v ) Fig 7 . Any point q ∈ Q and vector v ∈ T q Q deﬁnes a geodesic which then transports both q and v alongthe curve. Together these transports deﬁnes a ﬂow along the entire tangent bundle. deﬁnes a starting location and direction, which then identiﬁes a unique geodesic paththrough Q . At each point on that path we also have the velocity vectors of the geodesicwhich are the parallel transports of ~v . Overloading notation once again I will refer to thistangent ﬂow as an exponential map, φ exp : T Q × R → T Q ( q, v, t ) φ exp t ( q, v ) .

2. RIEMANNIAN MARKOV TRANSITIONS

Markov chain Monte Carlo (Robert and Casella, 1999; Brooks et al., 2011) explores atarget probability distribution, π (d q ), deﬁned on Q by sampling from a Markov transitionconditioned on a given state, τ (d q | q ′ ). If the Markov transition preserves the targetdistribution in expectation, π (d q ) = Z π (d q ′ ) τ (d q | q ′ ) , then the repeated transitions generates a sequence of states that converges towards, andeventually disperses across, the typical the support of the target distribution. The statesin this Markov chain then deﬁne Markov chain Monte Carlo estimatorsˆ f N = 1 N N X n =1 f ( q n )that asymptotically converge to the true target expectation values,lim N →∞ ˆ f N = Z π (d q ) f ( q ) , BETANCOURT under typical regularity conditions. The practical utility of a given Markov transition isdetermined by how quickly it explores target distribution and, consequently, how quicklythe Markov chain Monte Carlo estimators converge to the true expectation values.A powerful method for constructing Markov transitions is sampling over a family ofdeterministic transformations. In particular, if φ t is a family of continuous, bijective maps, φ t : Q → Q , parameterized by t ∈ T , γ is a probability distribution over T , and I A ( q ) isthe indicator function for the set A ⊂ Q , then τ (d q, q ) = Z γ (d t ) I d q ( φ t ( q )) , deﬁnes a Markov transition on Q (Diaconis and Freedman, 1999). If the transformationseach preserve the target distribution,(( φ t ) ∗ π )(d q ) = π (d q ) , then this Markov transition will also preserve the target distribution and generate thedesired Markov chains; when the individual transformations do not preserve the targetdistribution straightforward correction schemes can be applied to each move to ensure thedesired invariance. The freedom to choose a family of transformations and probability dis-tribution over that family allows one to engineer particularly eﬀective Markov transitions,especially when those choices are informed by the structure of the target distribution itself.Because Markov transitions condition on an initial state they can exploit the local struc-ture of the target distribution within the neighborhood of that state to inform eﬃcienttransformations. In particular, if Q is a Riemannian manifold then the local metric struc-ture can be used to construct both families of deterministic transformations and distribu-tions over those families, deﬁning potentially eﬀective Markov transitions. In this sectionwe’ll see how the local metric structure of a Riemannian manifold can be used to con-struct the ingredients of a Markov transition, and review examples of that constructionthat realize familiar algorithms. By exploiting the structure of the tangent and cotangent bundles associated with a man-ifold we can construct natural transformations that carry us around the space, providingthe basis for Markov transitions.

As we saw in Section 1.3, equipping a smooth manifold with aRiemannian metric and its corresponding Levi-Cevita connection endows the space withnatural geodesics that allow us to transport points and vectors along the curves. Theseactions deﬁne a ﬂow along the tangent bundle,

T Q , which we referred to as an exponentialmap, φ exp : T Q × R → T Q ( q, ~v, t ) φ exp t ( q, ~v ) . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Given an initial point we can identify a particular transformation by choosing a direction,which identiﬁes a unique geodesic path, and a time, which informs how long to move alongthat path. Dropping the ﬁnal velocity vector then projects this cotangent ﬂow to a ﬂowacross the base manifold, Q . In other words the choice of vector and integration timeparameterize a family of deterministic transformations on Q .Because these transformations are not informed by the target distribution they will not,in general, preserve it. Instead the tangent ﬂow provides proposals that can be correctedto achieve the desired invariance. Unlike the tangent bundle, the cotangent bundle, T ∗ Q is nat-urally equipped with a unique symplectic structure, ω , and symplectic measure, Ω, thatallows to construct ﬂows without the need for the extra structure introduced by a Rieman-nian metric (Betancourt et al., 2016).Instead of complementing the manifold with a metric, we instead complement the cotan-gent bundle and its natural symplectic structure with some function H : T ∗ Q → R , denoted a Hamiltonian. The choice of a Hamiltonian function immediately deﬁnes a Hamil-tonian ﬂow over the cotangent bundle, φ H : T ∗ Q × R → T ∗ Q ( q, p, t ) φ Ht ( q, p ) . Hamiltonian ﬂows have the added beneﬁt of inherently preserving the canonical distri-bution, a probability distribution over the cotangent bundle given by π (d q, d p ) = e − H ( q,p ) Ω(d q, d p ) . If the Hamiltonian is chosen such that this canonical distribution projects to our target dis-tribution, then the projection of the Hamiltonian ﬂow will preserve the target distribution, π (d q ). We can guarantee the desired invariance by introducing a conditional distributionover the cotangent ﬁbers, π (d p | q ), deﬁning the lifted joint distribution, π (d q, d p ) = π (d p | q ) π (d q ) , and then taking the Hamiltonian to be the corresponding Radon-Nikodym derivative withrespect to the symplectic measure, H = − log d π (d q, d p )dΩ(d q, d p ) = − log π ( p | q ) − log π ( q ) . Similar to the geodesic-informed tangent ﬂow, this cotangent ﬂow deﬁnes a family oftransformations from any initial point q ∈ Q parameterized by the choice of cotangent BETANCOURT vector, p ∈ T ∗ q Q , which deﬁnes a unique Hamiltonian trajectory, and and integration time, t , which deﬁnes how long to move along that trajectory. Projecting this ﬂow back to thebase manifold deﬁnes the family of transformations from which we can construct a validMarkov transition.Although these trajectories are not immediately dependent on a Riemannian metric,they do depend on the choice of Hamiltonian which itself depends on the choice of someconditional probability distribution π (d p | q ). As we will see in the next section, buildingsuch a conditional probability distribution is greatly facilitated by exploiting Riemannianmetric structure. Consequently in practice these cotangent ﬂows are implicitly informed bythe choice of metric. Both the tangent and cotangent ﬂows introduced above were parameterized by an inte-gration time as well as an initial vector or covector. In order to incorporate these familiesof transformations into a Markov transition we need to impose probability distributionsover these parameters.Selecting a distribution over the real-valued integration times is straightforward, al-though it is not immediately obvious how to select an optimal distribution. The choiceof distribution of vectors and covectors, however, is complicated by the abstract geome-try of the manifolds involved. Fortunately the rigidity imposed by a Riemannian metricdrastically simpliﬁes this problem.Because the distribution of vectors and covectors can vary with the initial point wereally want to deﬁne conditional probability distributions over each of the tangent andcotangent spaces (Betancourt et al., 2016). Conveniently a Riemannian metric provides allof the ingredients we need. Within a given tangent space, for example the metric deﬁnes thequadratic form g q ( v, v ) and the metric determinant, | g ( q ) | . These are suﬃcient to constructany elliptical family of probability density functions of the from π ( v ; q, φ ) = ξ ( g q ( v, v ) , φ ) + ζ ( | g ( q ) | , φ ) , for appropriate choices of the real-value functions ξ and ζ . Similarly in the cotangent spaceswe can use the inverse metric to build elliptical probability density functions of the form π ( p ; q, φ ) = ι ( g − q ( p, p ) , φ ) + κ ( | g ( q ) | , φ ) . Because the quadratic forms and metric determinants smoothly vary with the base point q ,these probability density functions fuse together into well-deﬁned conditional probabilitydensity functions, π ( v | q ; φ ) and π ( p | q ; φ ).Despite their relatively simple form elliptical families span a wide range of distributions,providing useful ﬂexibility when constructing Markov transitions. For example ellipticalfamilies include not only the Gaussian family but also more heavy-tailed families like theLaplace and Cauchy families of probability density functions. EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Q Fig 8 . Randomly sampling a vector v ∈ T q Q and integration time, t , deﬁnes a transformation that takesan initial point q to another point in the manifold. Repeating this process deﬁnes a Markov chain over Q whose stationary distribution will depend on the choice of probability distributions over the initial vectorsand integration times. Metric-informed tangent ﬂows and metric-informed tangent conditional distributions,provide the components of a full Markov transition. We start at an initial point, q , andbegin the transition by sampling an initial vector from the corresponding distribution overthe tangent space, v ∼ π (d v | q ) , before sampling an integration time from some distribution that might be informed by thisinitial conﬁguration, t ∼ π ( t | q, p ) . The initial point and vector deﬁne a unique geodesic along which we integrate for time t ,generating a random move to a new point in the tangent bundle which we can project backdown to the base manifold. The same construction holds for the cotangent bundle, usinginstead Hamiltonian trajectories and a metric-informed cotangent conditional distribution.With some additional modiﬁcations to correct the transitions and preserve a desiredtarget distribution, this general geometric procedure recovers quite a few well-known al-gorithms. Here I demonstrate three – random walk Metropolis-Hastings, Langevin MonteCarlo, and Hamiltonian Monte Carlo. Repeatedly sampling a random direction andthen following the corresponding geodesic for some ﬁnite time generates a second-orderMarkov process on the base manifold (Figure 8). In the limit where the integration timevanishes this process converges to a random walk that diﬀuses across the manifold (Hsu,2002).This diﬀusive behavior explores the manifold but it will not, in general, preserve aspeciﬁed target distribution. To achieve that behavior we need to correct the random walk,rejecting moves that stray too far from the typical set of the joint distribution on thetangent bundle. BETANCOURT Q (a) Q (b) Fig 9 . (a) If a proposal strays too far from neighborhoods of high target probability then a Metropoliscorrection is likely to reject that proposal and return to the initial point. (b) A proposal staying closer tohigh probability neighborhoods, however, will be accepted and ensure exploration that preserves the targetdistribution.

A standard approach to such corrections is to consider the moves as proposals which arethen accepted or rejected according to a Metropolis-Hastings correction. The deterministicgeodesic moves, however, require a small correction to serve as valid Metropolis proposals(Tierney, 1998). In order to admit a well-deﬁned correction we have to compose each movewith a reﬂection operator that ﬂips the sign of the tangent vector after each ﬂow, R : T Q → T Q ( q, v ) ( q, − v ) . This negation turns the ﬂow into an involution which returns to the initial state on thetangent bundle when the proposal is applied twice.After sampling an initial velocity and time, ﬂowing along the corresponding geodesicfor that time, and then negating the ﬁnal velocity we have a valid proposal that can beaccepted only with probability P [accept] = min(1 , r ( q, v )) , otherwise returning to the initial state ready for another transition (Figure 9). Here r isthe Radon-Nikodym derivative between the joint distribution on the tangent bundle andits pullback under the tangent ﬂow, r ( q, v ) = d( φ exp t ) ∗ π d π ( q, v ) . Because of the careful dependence of the acceptance probability on the joint distribution,the complete transition will always preserves the joint distribution. The marginal chainover the base manifold will then preserve the target distribution.In local coordinates on the tangent bundle the Radon-Nikodym derivative becomes aratio of joint probability density functions, r ( q, v ) = π ( q ′ ) π ( − v ′ | q ′ ) π ( q ) π ( v | q ) , EOMETRY OF INCOMPLETE REPARAMETERIZATIONS where ( q ′ , v ′ ) = φ exp t ( q, v ) . The ratio of target probability density functions is known as the Metropolis ratio, while theratio of tangent conditional probability density functions is known as the Hastings ratio.In the global coordinates of a Euclidean manifold this construction reduces to the usualrandom walk Metropolis algorithm. If we specify the tangent distribution with a multivari-ate Gaussian probability density function then the process of sampling a tangent vector andﬂowing exactly yields a sample from a multivariate Gaussian on the base manifold whosecovariance matrix is given by the components of the metric scaled by the integration time,Σ ij ( q ) = t · g ij ( q ) , as the algorithm is typically presented.For inﬁnitesimally small integration times the geodesic random walk without any Metropo-lis correction deﬁnes a Brownian motion over the base manifold. Taking small, but ﬁnite,integration times then provides a discrete approximation to that Brownian motion. Intro-ducing the Metropolis correction guides the discretized random walk towards the neigh-borhoods of high target probability, albeit relatively ineﬃciently in most contemporaryproblems. A similar procedure applies to the cotangent bundle. Sam-pling a covector from a cotangent conditional distribution and then applying the Hamil-tonian ﬂow for some time deﬁnes a second-order stochastic process across the cotangentbundle. Unlike the geodesic random walk, however, the Hamiltonian-informed process man-ifestly preserves the joint distribution on the cotangent bundle, and hence the marginalprocess preserves the target distribution on the base manifold. For inﬁnitesimally smallintegration times this process deﬁnes an Ornstein-Uhlenbeck over the base manifold, adrifting diﬀusion that exactly targets the given joint distribution.Unfortunately this invariance isn’t robust enough to manifest exactly in practical appli-cations where we have to approximate the Hamiltonian ﬂow with the discrete trajectoriesof a symplectic integrator, Φ

Hǫ,L . Here ǫ denotes the step size of the integration and L thenumber of steps. Approximating the inﬁnitesimal action of the Hamiltonian ﬂow with onestep of a symplectic integrator, Φ Hǫ, gives Langevin Monte Carlo , or sometimes unadjustedLangevin Monte Carlo (Xifara et al., 2014).Although symplectic integrators are exceptionally accurate they are not perfect, andthe numerical errors they introduce will bias the discrete transitions away from the targetdistribution. In order to preserve the invariance of the target distribution, especially inhigher dimensions, we need to apply a Metropolis correction just as we did for the geodesics.As in that case we ﬁrst turn the discrete update into an involution with the composition BETANCOURT of a reﬂection operator, R : T ∗ Q → T ∗ Q ( q, p ) ( q, − p ) , before applying a Metropolis correction that accepts the updated state only with proba-bility P [accept] = min(1 , r ( q, v )) , where r is the Radon-Nikodym derivative r ( q, p ) = d(Φ Hǫ, ) ∗ π d π ( q, p ) . Combining the discrete Langevin dynamics with a Metropolis correction deﬁnes adjustedLangevin Monte Carlo , or

Metropolis adjusted Langevin Monte Carlo , or typically just

MALA .Because the Metropolis correction is compensating only for the errors introduced bythe symplectic integrator, and not the imperceptive geodesics of random walk Metropolis-Hastings, Metropolis adjusted Langevin methods perform much better than their randomwalk equivalents. Still, their overall performance is limited by the diﬀusive nature of thetransitions.

To fully exploit Hamiltonian ﬂow we need to follow itfor much longer than inﬁnitesimal times, taking advantage of the coherent trajectories torapidly explore the target distribution. In practice this is accomplished by ﬁrst samplinga covector from the cotangent conditional distribution and then applying a symplecticintegrator for multiple steps, Φ

Hǫ,L , to simulate the Hamiltonian ﬂow for time t = ǫ · L . Thisdeﬁnes the family of Hamiltonian Monte Carlo methods (Betancourt et al., 2016).Once we have a longer discrete trajectory we still have to correct for the small but non-negligible numerical errors. The ﬁrst Hamiltonian Monte Carlo methods considered only theﬁnal state in the trajectory, applying a Metropolis correction to that state as with randomwalk Metropolis and Metropolis adjusted Langevin methods. Modern implementations,however, take advantage of the entire trajectory by going beyond Metropolis corrections.For a thorough discussion see (Betancourt, 2018a).

3. REPARAMETERIZATIONS AND EQUIVALENT METRICS

One of the beneﬁts of the pure geometric construction that we have so far discussedis that it explicitly guides implementations. Once local coordinates have been chosen, theprobability distributions, namely the target distribution and the tangent or cotangentconditional distributions, can be speciﬁed with local probability density functions. Likewiseany geometric objects, namely the metric, can be speciﬁed with local component functions.In this way everything is manifestly compatible with each other and the chosen coordinates.

EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Under a reparameterization we simply begin with a new coordinate system and start theprocess anew.That progression from geometric to coordinate, however, is not how algorithms are typ-ically implemented in practice. Instead practitioners often begin with a default local rep-resentation of the algorithmic system and are responsible for working out how that localrepresentation transforms under reparameterizations. These transformations are challeng-ing to determine and prone to error; cnsequently practitioners routinely neglect transform-ing the algorithmic structure entirely, resulting in an incomplete transformations and anentirely diﬀerent Riemannian geometry.In this section I ﬁrst review the proper way local representations transform under areparameterization of the local coordinates before considering the incomplete reparameter-izations typical of practice. In particular I explicitly derive the modiﬁed geometries thatresult from these incomplete reparameterizations. Finally I use this relationship between in-complete reparameterizations and modiﬁed geometries to motivate an optimality criterionfor reparameterizations in the context of a given target distribution.

There are two equivalent perspectives on the reparameterization of a manifold: the pas-sive and the active

Baez and Muniain (1994). In the passive perspective a reparameter-ization ﬁxes the manifold but transforms each chart, while in the active perspective areparameterization transforms the manifold while ﬁxing the charts. Although these twoperspectives are equivalent the latter is closer to how reparameterizations are implementedin practice and consequently I will focus on that perspective.More formally in the active perspective a reparameterization is a diﬀeomorphism fromthe base manifold into itself, ψ : Q → Qq q ′ = ψ ( q ) . that pushes each chart forward. These chart maps are linear transformations representedby Jacobian matrices , J ij ( q ) = ∂ψ i ∂q j ( q ) . In other words even a non-linear reparameterization acts like a linear transformation withineach local neighborhood.For example, the coordinate functions in the new charts are given by( q ′ ) i ( q ′ ) = J ij ( ψ − ( q ′ )) · q j ( ψ − ( q ′ )) . Similarly the probability density function representation of a probability distribution withina given chart transforms by acquiring a factor of the inverse determinant of the Jacobianmatrix, π ( q ′ ) = π ( ψ − ( q ′ )) (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − . BETANCOURT QT q Qq (a) QT ψ ( q ) Qψ ( q )(b) Fig 10 . A reparameterization ψ : Q → Q transforms not only points in the manifold but also geometricobjects deﬁned in the tangent and cotangent spaces. Locally the action of a reparameterization behaves likea rotation given by the Jacobian matrix of the reparameterizing map. Importantly a reparameterization of the base manifold also aﬀects the local structureof the tangent and cotangent spaces (Figure 10). Tangent vectors push forward along thetransformation, the vector v ∈ T q Q mapping into a vector v ′ ∈ T ψ ( q ) Q . Local bases of atangent spaces transform as ( ∂ ′ ) i = J ji ( ψ − ( q ′ )) · ∂ j , which immediately implies that the components of a vector in that basis transforms to thecomponents ( v ′ ) i = J ij ( ψ − ( q ′ )) · v j . Likewise probability density functions over a tangent space acquire the same Jacobiandeterminant as the probability density functions in the local charts. This implies thatconditional probability density functions over the tangent bundle pick up two factors ofthe inverse Jacobian determinant, π ( q ′ , v ′ ) = π ( ψ − ( q ′ ) , J ij ( ψ − ( q ′ )) ( v ′ ) j ) · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − . At the same time we can take a more comprehensive perspective and note that anyreparameterization over the base manifold Q induces a reparameterization of the entiretangent bundle at once. The Jacobian matrix of this bundle reparameterization is blockdiagonal with both blocks equal to the Jacobian matrix of the base reparameterization, J T Q = (cid:18) J J (cid:19) , from which one can readily reproduce all of the previous results. For example, | J T Q | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) J J (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = J . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Objects in the cotangent spaces naturally pull back along the reparameterization andhence transform in the opposite way as tangent vectors. A local basis of a cotangent spacetransforms as (d q ′ ) i = J ij ( ψ − ( q ′ )) · d q j , and the component of a covector in that basis transform as( p ′ ) i = ( J − ) ji ( ψ − ( q ′ )) · p j . Probability density functions over a cotangent space behave opposite to probability densityfunctions over a local chart or in a tangent space; they acquire a factor of the Jacobiandeterminant without inversion. Critically this implies that probability density functionsover the cotangent bundle pick up no Jacobian factors under a reparameterization π ( q ′ , p ′ ) = π ( ψ − ( q ′ ) , ( J − ) ji ( ψ − ( q ′ )) ( p ′ ) j ) · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) − · (cid:12)(cid:12) J ( ψ − ( q ′ )) (cid:12)(cid:12) = π ( ψ − ( q ′ ) , ( J − ) ji ( q ) ( p ′ ) j ) . This hints at the natural probabilistic structure of the cotangent bundle and some ofinherent advantages of algorithms like Langevin Monte Carlo and Hamiltonian Monte Carlodeﬁned there.Just as a reparameterization of the base manifold induces a reparameterization of thetangent bundle, it also induces a reparameterization of the entire cotangent bundle. Herethe Jacobian matrix of this bundle reparameterization is block diagonal, but the lowerblock now equals the inverse Jacobian matrix of the base reparameterization, * J T ∗ Q = (cid:18) J J − (cid:19) . This perspective makes it particularly clear that the Jacobian determinant of the inducedreparameterization is exactly one, | J T ∗ Q | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) J J − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = J · J − = 1 . From the transformation properties of vectors and covectors we can work out how gen-eral tensors transform. In particular we can work out how the components functions of aRiemannian metric transform when we reparameterize the base space. In this case we gettwo inverse Jacobians, one for each component,( g ′ ) lm ( q ′ ) = ( J − ) il ( ψ − ( q ′ )) · ( J − ) jm ( ψ − ( q ′ )) · g ij ( ψ − ( q ′ )) . BETANCOURT

As we’d expect from a geometric invariant, the quadratic form deﬁning an elliptical proba-bility density function on the tangent spaces doesn’t change under the reparameterization,( g ′ ) q ′ ( v ′ , v ′ ) = ( g ′ ) lm ( q ′ )( v ′ ) l ( v ′ ) m = ( J − ) il ( ψ − ( q ′ )) · ( J − ) jm ( ψ − ( q ′ )) · g ij ( q ) · J lr ( ψ − ( q ′ )) · v r · J ms ( ψ − ( q ′ )) · v s = h ( J − ) il ( ψ − ( q ′ )) · J lr ( ψ − ( q ′ )) i · h J jm ( ψ − ( q ′ )) · ( J − ) ms ( ψ − ( q ′ )) i · g ij ( q ) · v r · v s = h δ ir i · h δ js i · g ij ( q ) · v r · v s = g ij ( q ) · v i · v j = g q ( v, v ) . The transformation properties of the elliptical probability density functions instead dependentirely on their metric determinant terms.Provided that we reparameterize not just the base space but also the tangent and cotan-gent spaces, then any algorithm based on exact ﬂows will be invariant to reparameteriza-tions; the entire Markov chains they generate will map forward from one parameterizationto another without any changes to their dynamics. Algorithms that depend on discreteapproximations to these ﬂows will not be exactly invariant, as the approximation errorand any correction scheme will in general depend on the local parameterization, but theresulting dynamics will be similar.Reparameterization, however, is often recommended to improve performance because itis supposed to change the dynamics of the Markov chain. This contradiction is resolvedwhen we realize that the reparameterizations employed in practice are not the completereparameterizations of a proper geometric system but rather incomplete reparameteriza-tions that transform the initial system into something else entirely.

In practice any geometric algorithm is implemented with coordinates, components, andprobability densities. Typically, however, only the target probability density is exposed tothe user. Conditional probability density functions on the tangent or cotangent bundle, orcomponents of the metric that deﬁne those conditional densities, are set to default valuesnot exposed to the user or exposed but signiﬁcantly limited in ﬂexibility. For example,the default conﬁguration of Stan (Stan Development Team, 2019) forces a metric withconstant, diagonal components.Consequently a user cannot reparameterize the entire geometric system on which thesealgorithms are based. They can only reparameterize the target probability density function

EOMETRY OF INCOMPLETE REPARAMETERIZATIONS QT ψ ( q ) Qψ ( q )(a) QT q Qq (b) Fig 11 . An incomplete reparameterization forces the metric geometry to be deﬁned using component func-tions in the reparameterized charts, not the initial charts. Transforming back to the initial parameterizationwe see that this is equivalent to deﬁning a diﬀerent metric on the original space that we might have an-ticipated. In other words an incomplete reparameterization transforms the base manifold while holding thetangent and cotangent spaces ﬁxed, twisting the tangent and cotangent bundles. If we release our hold onthese spaces the bundles snap back, revealing the equivalent metric geometries. while the tangent and cotangent structures remain ﬁxed . These incomplete reparameter-izations result in a diﬀerent geometry and hence a diﬀerent algorithm that may interactbetter or worse with the target distribution.While an incomplete reparameterization is not a proper geometric transformation, itseﬀect does admit a convenient geometric interpretation. We begin with a metric speciﬁed bythe local coordinate functions g ij ( q ). Reparameterizing the base manifold, q q ′ = φ ( q ),but ﬁxing the components of the metric results in a new metric speciﬁed by the samecomponents but in the new coordinate system, g lm ( q ′ ). To compare these metrics we haveto completely invert the reparameterization, pulling the new metric back into the originalcoordinate system, g ij ( q ) = ( J ) li ( ψ − ( q ′ )) · ( J ) mj ( ψ − ( q ′ )) · g lm ( ψ − ( q ′ ))= ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . In words, an incomplete reparameterization is equivalent to running the algorithm in theoriginal coordinate system but with the transformed metric (Figure 11)¯ g ij ( q ) = ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . Consequently there is a one-to-one equivalence between incomplete reparameterizationsand the choice of metric, and hence the conﬁguration of a Riemannian Markov transition.Under an incomplete reparameterization geodesics and elliptical conditional probabilitydistributions on the tangent and cotangent spaces all follow from this new, equivalentmetric. These new conﬁgurations will induce new dynamics with respect to the targetdistribution, resulting in modiﬁed performance that may or may not be beneﬁcial. BETANCOURT

One advantage of this equivalence is that by applying an incomplete reparameterizationwe can eﬀectively implement an algorithm with spatially-varying metric components us-ing only an algorithm conﬁguration with constant metric components, at least if we canﬁnd the right reparameterization. Riemannian Markov transitions in a coordinate systemadmitting constant metric components are signiﬁcantly easier to robustly implement andindeed are often the only option in popular software packages. For example we can useexplicit symplectic integrators in to implement Langevin and Hamiltonian Monte Carloinstead of more expensive, and more fragile, implicit symplectic integrators. By ﬁndingan appropriate reparameterization we can reproduce the geometry of a more sophisticatedmetric without modifying the software itself.Algorithms exploiting metrics with spatially-varying components in the default coordi-nate system are often denoted “Riemannian” algorithms in the statistics literature, withthose using constant components denoted “Euclidean” algorithms. Using this terminologyan incomplete reparameterization allows one to eﬀectively run a Riemannian algorithmusing only a Euclidean implementation. Keep in mind, however, that this terminology istechnically incorrect, as discussed at the end of Section 1.1.

An immediate advantage of this identiﬁcation between an incomplete reparameteriza-tion and its equivalent Riemannian geometry is that it allows us to determine reparame-terizations that optimize performance with respect to a given target distribution by ﬁrstdetermining the optimal Riemannian geometry.Within a small neighborhood we can approximate log target probability density functionwith a Taylor expansion, although that approximation has to be made with care. Firstly thetarget probability density function is not an invariant function amenable to approximation.We can construct an appropriate function, however, by using the determinant of the metricto correct for the non-invariant behaviors, ρ ( q ) = log (cid:16) π ( q ) · | g ( q ) | − (cid:17) = log π ( q ) −

12 log | g ( q ) | . We can then we can construct a local Taylor expansion of this invariant function, ρ ( q ) = ρ ( q ) + ∂ρ∂q i ( q ) · ( q − q ) i + 12 ∂ ρ∂q i ∂q j ( q ) · ( q − q ) i · ( q − q ) j + . . . . Finally if the local chart is in a basin where the gradient, as well as all of the higher-order terms, are negligible compared to the constant and quadratic terms then we canapproximate the function as ρ ( q ) ≈ const + 12 ∂ ρ∂q i ∂q j ( q ) · ( q − q ) i · ( q − q ) j EOMETRY OF INCOMPLETE REPARAMETERIZATIONS If the second derivatives of this function in that small neighborhood are all positive then thisapproximation deﬁnes a Gaussian probability density function. In other words in suﬃcientlysmall charts where ρ ( q ) is concave we can approximate the target probability densityfunction with a multivariate Gaussian probability density function deﬁned by the precisionmatrix (Σ − ) ij = ∂ ρ∂q i ∂q j ( q ) . This approximation signiﬁcantly simpliﬁes the analysis of geometric algorithms thatutilize elliptical probability density functions over the tangent or cotangent spaces. Ro-tating the entire tangent bundle within the small neighborhood, for example, exchangescovariance between the target approximation and the covariance of the tangent probabilitydensity function deﬁned by the metric. At the same time rotating the cotangent bundle ex-changes precision between the target approximation and the cotangent probability densityfunction deﬁned by the inverse metric. Consequently we can completely decorrelate thelocal approximation to the target probability density function by choosing a metric thatcompensates for the local behavior of the Hessian of ρ . In each neighborhood this wouldbe accomplished with a metric speciﬁed by the components g ij ( q ) = ∂ ρ∂q i ∂q j ( q ) . Unfortunately this equality is valid only within a single chart and hence does not deﬁnean optimization criterion that is consistent across the entire base manifold. The mainproblem is that the Hessian does not transform like a metric but rather a jet , in particulara one-dimensional, rank-two covelocity (Betancourt, 2018b). We can use the Riemannianstructure on the manifold, however, to correct the Hessian into a geometric object that wecan compare to the metric.The covariant Hessian uses the linear connection to compensate for the non-tensorialbehavior of the Hessian, ∇ f ( q ) = (cid:18) ∂ f∂q i ∂q j ( q ) + Γ kij ( q ) ∂f∂q k ( q ) (cid:19) d q i ⊗ d q j , consistently across all charts. Local comparisons between the covariant Hessian and themetric are then self-consistent across the entire base manifold. This allows us to constructa proper criterion for metric optimality at each point as g ij ( q ) = ∇ ij λ ( q )= ∇ ij (log π −

12 log | g | )( q ) , BETANCOURT or, using the fact that the covariant Hessian of any function of the metric vanishes, g ij ( q ) = ∇ ij (log π −

12 log | g | )( q )= ∇ ij log π ( q )= ∂ log π∂q i ∂q j ( q ) + Γ kij ∂ log π∂q k ( q ) . Likewise the local deviation from optimality can be quantiﬁed by the diﬀerence∆( q ) = g ( q ) − ∇ log π ( q ) . We can summarize this deviation with any matrix scalar, for example the scalar determi-nant, | ∆( q ) | .In practice we can achieve ∆( q ) = 0 with the proper choice of metric components, but wecan also achieve it with an appropriate incomplete reparameterization and its equivalentmetric, ¯ g ij ( q ) = ( J ) li ( q ) · ( J ) mj ( q ) · g lm ( q ) . Consequently substituting the equivalent metric, ¯ g ij ( q ) into the geometric optimality cri-terion immediately deﬁnes an optimality criterion for reparameterizations,¯∆( q ) = ¯ g ( q ) − ¯ ∇ log π ( q ) . Initial excitement is quickly tempered once we inspect the criterion a bit more carefully.The criterion deﬁnes a system of coupled, non-ordinary diﬀerential equations for the ele-ments of the Jacobian matrix which deﬁne the optimal reparameterization. This then setsup a system of partial diﬀerential equations for the optimal reparameterization itself. Inother words, we will not be solving for optimal reparameterizations in general systems anytime soon!The criterion does, however, allow us to analyze speciﬁc reparameterizations. Given aspeciﬁc reparameterization we can verify optimality by constructing the equivalent metric,its corresponding connection, and then computing the scalar deviation function, | ∆( q ) | .If the determinant doesn’t vanish then we can analyze the components of the deviationtensor for insight about the limitations of the chosen reparameterization and potentialimprovements. Ultimately this criterion provides the theoretical foundation upon which wecan begin formal studies of reparameterizations in earnest.

4. OPTIMAL REPARAMETERIZATION OF LATENT GAUSSIAN MODELS

To demonstrate the utility of the geometric analysis of incomplete reparameterizationslet’s consider the popular reparameterization that arises when transforming from the cen-tered parameterization to the non-centered parameterization of a latent Gaussian model

EOMETRY OF INCOMPLETE REPARAMETERIZATIONS (Papaspiliopoulos, Roberts and Sk¨old, 2007). This reparameterization is known to drasti-cally improve the empirical performance of geometric algorithms (Betancourt and Girolami,2015) and we can use our new geometric analysis to provide a more formal motivation forits beneﬁts.A latent Gaussian model captures the behavior of an unobserved exchangeable popula-tion of individual parameters, θ = { θ , . . . , θ N } , that follow a Gaussian distribution with location µ and scale τ . There are two naturalparameterizations of the individual parameters, and hence two natural parameterizationsof the latent Gaussian model. Both parameterizations span the entire manifold, so we canlimit our consideration to the entire space instead of a single local chart. The parameters θ and { µ, τ } deﬁne the centered parameterization of a latent Gaussianmodel where the model is speciﬁed by the probability density function π ( θ , µ, τ ) = π ( θ | µ, τ ) · π ( µ, τ )= N Y n =1 N ( θ n | µ, τ ) · π ( µ, τ ) . Complementing the latent Gaussian model with an observational model for data gener-ated from each individual θ n yields the joint model π ( y , θ , µ, τ, φ ) = N Y n =1 π ( y n | θ n , φ ) · N Y n =1 N ( θ n | µ, τ ) · π ( µ, τ )= N Y n =1 π ( y n | θ n , φ ) N ( θ n | µ, τ ) · π ( µ, τ ) . If the individual likelihood functions are only weakly informative then this joint modelis dominated by the latent Gaussian probability density function which frustrates accuratecomputation. The problem is that the interaction between the individual parameters andthe population scale manifests with a funnel geometry. For large τ the individual θ n areonly weakly coupled to the population mean, but for small τ the θ n collapse into a narrowconcentration around µ (Figure 12). This rapidly varying curvature frustrates Markovtransitions that cannot dynamically adapt.On the other hand, as the observational model becomes more informative the individuallikelihood functions concentrate around just those model conﬁgurations that are consistentthe observed data. Eventually this suppresses the pathological neck of the funnel geometry.Consequently with enough data the posterior probability density function will have littlecontribution from the pathological geometry of the latent Gaussian model, and it will bemuch easier to ﬁt with most Markov transitions. BETANCOURT θ n l og τ Fig 12 . In a centered parameterization the probability density function for a latent Gaussian model man-ifests a funnel geometry, where the density concentrates into a narrow volume around µ for small τ butdisperses for large τ . Here and in subsequent ﬁgures µ is ﬁxed at zero. In order to quantify the entire proba-bility distribution a Markov transition must be able to explore both regions reasonably quickly which is mucheasier said than done. The non-centered parameterization takes advantage of the fact that we can decouple anyGaussian probability density function, N ( θ | µ, τ ), into a standardized Gaussian probabilitydensity function, N (˜ θ | ,

1) and the deterministic transformation, θ = µ + τ · ˜ θ .Using ˜ θ = n ˜ θ , . . . , ˜ θ N o as parameters the latent Gaussian model can be speciﬁed by a product of independentprobability density functions, π (˜ θ , µ, τ ) = π (˜ θ | µ, τ ) · π ( µ, τ )= π (˜ θ ) · π ( µ, τ )= N Y n =1 N (˜ θ n ) · π ( µ, τ ) . When incorporating individual observational models, however, the non-centered ˜ θ n mustbe coupled to the population parameters in order to recreate each θ n , π ( y , ˜ θ , µ, τ, φ ) = N Y n =1 π ( y n | θ n (˜ θ n , µ, τ ) , φ ) · N ( θ n | µ, τ ) · π ( µ, τ )= N Y n =1 π ( y n | µ + τ · ˜ θ n , φ ) N ( θ n | µ, τ ) · π ( µ, τ ) . EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Centered ParameterizationConstant Metric ComponentsTypical Trajectory θ n l og τ (a) Centered ParameterizationConstant Metric ComponentsUnstable Numerical Trajectory θ n l og τ (b) Fig 13 . A funnel geometry frustrates Hamiltonian Monte Carlo in numerous ways. (a) Typical Hamiltoniantrajectories span only a limited range of τ values and hence only slowly explore the entire distribution. (b)Trajectories that do penetrate deeper into the funnel are diﬃcult to numerically integrate, usually resultingin unstable numerical trajectories. For weakly informative likelihood functions the posterior probability density function isdominated by the latent Gaussian probability density function, which now is free of thepathological funnel geometry. On the other hand as the likelihood functions concentratethey strongly constrain the latent parameters, but only through the functions µ + τ · ˜ θ n . This constraint, however, induces its own funnel geometry! In other words the non-centeredparameterization yields a better geometry for weakly informative data and a worse geome-try for strongly informative data, inverse to the behavior of the centered parameterization.

Riemannian algorithms that utilize constant metric components are not able to adaptto the rapidly varying curvature of the funnel and will consequently suﬀer when tryingto explore posterior density functions corresponding to weakly-informed likelihoods in thecentered parameterization or strongly-informed likelihoods in the non-centered parameter-ization. For example in Hamiltonian Monte Carlo this results in exact trajectories thattend to be restricted to narrow neighborhoods of τ , (Figure 13a). Moreover, when thetrajectories are lucky enough to venture deeper into the funnel their numerical integrationbecomes unstable (Figure 13b).One option around this pathology is to generalize the algorithms by allowing the metriccomponents to vary and capture the Hessian structure of the posterior density function.Although this results in exact and numerical trajectories that are much better behaved(Figure 14), the general algorithms are signiﬁcantly more challenging to implement. In BETANCOURT

Centered ParameterizationOptimal Dynamic Metric Components θ n l og τ Fig 14 . Using dynamic metric components that capture the second-order structure of the funnel densityfunction itself results in Hamiltonian trajectories that span the entire funnel and explore much more eﬃ-ciently. The integration of these trajectories is more stable but also more diﬃcult to implement in practice.

Hamiltonian Monte Carlo, for example, this requires an implicit midpoint symplectic inte-grator which needs a ﬁxed point equation to be solved at each iteration.As we learned in Section 3.2, however, we can achieve the same behavior by applying aparticular incomplete reparameterization (Figure 15). While we can’t work out the idealreparameterization analytically, we can investigate how well mapping between the canonicalcentered and non-centered parameterizations performs.Consider having no observations so that the posterior distribution reduces to the latentGaussian model. In this case empirical experience informs us that a centered parameteri-zation will perform poorly, and we can achieve much better performance by transformingto a non-centered parameterization with the map µ = µτ = τ ˜ θ = θ − µτ . Because of the exchangeability of the θ n we can analyze the eﬃcacy of this reparameteriza-tion using any number of components. To further simplify the analysis let’s consider onlya single individual parameter, θ .If we ﬁx the metric components to constants while applying this map to the parameters,exactly what eﬀective metric do we induce? To avoid any complications due to the positivityconstraint on the population scale let’s ﬁrst reparameterize from τ to λ = log τ . The non-centering transformation then becomes ˜ θ = θ − µe λ EOMETRY OF INCOMPLETE REPARAMETERIZATIONS Centered Parameterization θ n l og τ (a) Non-CenteredFull Reparameterization θ n l og τ (b) Non-CenteredIncomplete Reparameterization θ n l og τ (c) Fig 15 . Hamiltonian trajectories from a centered parameterization of a latent Gaussian model with onlyweakly-informative data are frustrated by the funnel geometry in the posterior density function. (a) Typicaltrajectories explore only limited neighborhoods. (b) Applying a full reparameterization to a non-centeredparameterization results in the same geometric system and hence the same Hamiltonian dynamics. (c)Applying an incomplete reparameterization to the non-centered parameterization, however, modiﬁes thegeometry, resulting in Hamiltonian dynamics better suited to explore the funnel. with the Jacobian matrix J = ∂ ( µ, λ, ˜ θ ) ∂ ( µ, λ, θ ) =  − cosh( λ ) + sinh( λ ) − e − λ ( θ − µ ) e − λ  and determinant | J | = e − λ . If we assume that the initial metric components are equal to the identify matrix, withones along the diagonal and zeroes elsewhere, g =   , BETANCOURT then the equivalent metric components are given by the matrix g ′ ( θ, µ, τ ) = J T ( θ, µ, τ ) · g · J ( θ, µ, τ )=  e − λ e − λ ( θ − µ ) − e − λ e − λ ( θ − µ ) 1 + e − λ ( θ − µ ) − e − λ ( θ − µ ) − e − λ − e − λ ( θ − µ ) e − λ  =   + e − λ ( θ − µ )  − −  + e − λ  −

10 0 0 −  + e − λ ( θ − µ )   . To consider optimality we need an explicit target density function. For the latent Gaus-sian model that means specifying prior density functions for µ and τ . Here let’s considerunit Gaussian probability density functions for both µ and λ , or equivalently a log Gaus-sian probability density function for τ . The log joint target probability density function isthen log π ( θ, µ, λ ) = − (cid:18) θ − µe λ (cid:19) − λ − µ − λ + const . We can now compute the covariate derivative of this log probability density function withrespect to our induced metric analytically; here I use Headrick (2015) to compute thecovariant derivative symbolically. In this case we get an exact cancelation,∆( q ) = g ′ ( q ) − ∇ log π ( q ) =   . The non-centering transformation is exactly the optimal incomplete reparameterization!Running a Riemannian algorithm with unit metric and a non-centered parameterizationof the target distribution is equivalent to running an algorithm whose metric captures thelocal second-order diﬀerential structure of the latent Gaussian model but in a centeredparameterization.If the prior densities on µ and λ have non-unit scales then we can maintain optimality bymatching those scales in the diagonal elements of the initial metric components. In partic-ular an adaptive algorithm that sets the diagonal elements of the initial metric componentsto the variance of each parameter function will be able to sustain optimality for arbitraryprior scales.Although we cannot in general extend this analytic analysis to nontrivial observationalmodels, we can use the geometric perspective to provide qualitative information about theinﬂuence of those models. For example this optimality criterion considers only the ﬁrst-order and second-order partial derivatives of the target probability density function, which EOMETRY OF INCOMPLETE REPARAMETERIZATIONS means that the inﬂuence of a nontrivial observational model is captured within the ﬁrst-order and second-order behavior of the likelihood functions. Contrast this to the Fisherinformation matrix, which captures the same information but only in expectation withrespect to possible observations.At the same time the geometric analysis is useful for motivating even further questions.For example the common non-centering reparameterization is geometrically optimal onlyfor the log Gaussian prior density on λ = log τ . This prior choice, however, suppressesthe limit τ →

5. CONCLUSION

Placing inherently geometric algorithms like random walk Metropolis-Hastings, LangevinMonte Carlo, and Hamiltonian Monte Carlo into a proper geometric framework enablesa wide range of theoretical analyses. In particular we can use the Riemannian structureof these algorithms to quantify the aﬀect of incomplete reparameterizations. We can evenmotivate incomplete reparameterizations that optimize the local geometry for all of thesealgorithms at the same time.Here we demonstrated this analysis on a particularly simple Gaussian latent modelwhere we could analytically prove the geometric optimality induced by non-centering thenatural parameterization, at least in the case of non-inﬂuential data. Although the analyticresults don’t immediately generalize to more complex systems, the qualitative insightsstretches beyond the conﬁnes of that simple system. They suggests important questions andconnections that may ultimately lead to important insights in more general circumstances.It also suggests empirical studies, such as correlating the optimality criterion | ∆( q ) | witheﬀective sample size per iteration or other quantiﬁcations of Markov chain Monte Carloperformance.Insights about geometric algorithms, like those considered in this paper, will continueto be most eﬃciently mined by using geometric analyses that directly perceive their fun-damental structures.

6. ACKNOWLEDGEMENTS

I thank Dan Simpson for critical discussions about reparameterizations and geometry aswell as Luiz Carvahlo and Charles Margossian for helpful comments on this manuscript. BETANCOURT

REFERENCES

Baez, J. C. and

Muniain, J. P. (1994).

Gauge Fields, Knots, and Gravity . World Scientiﬁc Singapore.

Betancourt, M. (2018a). A Conceptual Introduction to Hamiltonian Monte Carlo.

Betancourt, M. (2018b). A Geometric Theory of Higher-Order Automatic Diﬀerentiation.

Betancourt, M. and

Girolami, M. (2015). Hamiltonian Monte Carlo for Hierarchical Models. In

Cur-rent Trends in Bayesian Methodology with Applications (U. S. Dipak K. Dey and A. Loganathan, eds.)Chapman & Hall/CRC Press.

Betancourt, M. , Byrne, S. , Livingstone, S. and

Girolami, M. (2016). The Geometric Foundations ofHamiltonian Monte Carlo.

Bernoulli . Brooks, S. , Gelman, A. , Jones, G. L. and

Meng, X.-L. , eds. (2011).

Handbook of Markov Chain MonteCarlo . CRC Press, New York.

Diaconis, P. and

Freedman, D. (1999). Iterated Random Functions.

SIAM review Headrick, M. (2015). diﬀgeo.m. http://people.brandeis.edu/~headrick/Mathematica/diffgeo.m . Hsu, E. P. (2002).

Stochastic analysis on manifolds . Graduate Studies in Mathematics . American Math-ematical Society, Providence, RI. Lee, J. M. (2013).

Introduction to Smooth Manifolds . Springer.

Lee, J. M. (2018).

Introduction to Riemannian manifolds . Graduate Texts in Mathematics . Springer,Cham.

Papaspiliopoulos, O. , Roberts, G. O. and

Sk¨old, M. (2007). A General Framework for the Parametriza-tion of Hierarchical Models.

Statistical Science

Robert, C. P. and

Casella, G. (1999).

Monte Carlo Statistical Methods . Springer New York.

Stan Development Team (2019). Stan: A C++ Library for Probability and Sampling, Version 2.20.0.http://mc-stan.org/.

Tierney, L. (1998). A Note on Metropolis-Hastings Kernels for General State Spaces.

The Annals ofApplied Probability Xifara, T. , Sherlock, C. , Livingstone, S. , Byrne, S. and

Girolami, M. (2014). Langevin diﬀusionsand the Metropolis-adjusted Langevin algorithm.

Statistics & Probability Letters91