Non-reversible Markov chain Monte Carlo for sampling of districting maps
Gregory Herschlag, Jonathan C. Mattingly, Matthias Sachs, Evan Wyse
NNon-reversible Markov chain Monte Carlo for sampling ofdistricting maps
Gregory Herschlag ∗ , Jonathan C. Mattingly † , Matthias Sachs ‡ , Evan Wyse § Department of Mathematics, Duke University, Durham, NC 27708 Department of Statistical Science, Duke University, Durham NC 27708 The Statistical and Applied Mathematical Sciences Institute (SAMSI), Durham, NC27709
Abstract.
Evaluating the degree of partisan districting (Gerrymandering) in a statisti-cal framework typically requires an ensemble of districting plans which are drawn from aprescribed probability distribution that adheres to a realistic and non-partisan criteria. Inthis article we introduce novel non-reversible Markov chain Monte-Carlo (MCMC) meth-ods for the sampling of such districting plans which have improved mixing properties incomparison to previously used (reversible) MCMC algorithms. In doing so we extend thecurrent framework for construction of non-reversible Markov chains on discrete samplingspaces by considering a generalization of skew detailed balance. We provide a detaileddescription of the proposed algorithms and evaluate their performance in numerical ex-periments. Introduction
The use of computer generated alternative redistricting plans to benchmark particularredistricting maps has gained legal and scientific traction in recent years. The generationof such an ensemble of maps has been used to identify and quantify the extent of partisanand racial gerrymandering by answering the question “What would one expect to havehappened if no partisan or racial information had been used?” These methods producea baseline informed by the geo-political geography of the state and which do not assumeproportional presentation or unrealistic symmetry assumptions. This baseline can then beused to evaluate a particular redistricting plan of interest.In [18, 11, 16, 20], an ensemble of maps is generated by sampling from probability dis-tribution constructed on the space of possible redistricting plans using only non-partisanconsiderations. In this thread of work, the sampling was performed via Markov chainMonte Carlo (MCMC) using a standard Metropolis-Hasting algorithm based on a singlenode flip proposal chain. Other ensemble methods have used generative techniques basedon optimization, genetic algorithms, or Markov chains without a clearly describable sta-tionary measure. Examples of the latter include the generation of samples using simulatedannealing [2, 13, 12] and Markov chains based on merge-split operations [6] (see also [5] ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] a r X i v : . [ s t a t . C O ] A ug for an extension of the latter work which allows one to generate samples from a prescribedtarget measure.)Many of the above samplers utilize the Metropolis-Hastings algorithm so the underlyinggenerating Markov chain is reversible The reversible methods, by definition, have a Markovkernel associated with this Markov chain satisfying a detailed balance condition with re-spect to the corresponding stationary measure. Heuristically, this implies that the Markovchain has a diffusive nature. Other samplers used which are non-reversible typically samplefrom an unknown distribution.In recent years MCMC methods based on non-reversible Markov chains (i.e., Markovchains whose Markov kernel fails to satisfy a detailed balance condition) have attractedincreased attention because of their favorable convergence and mixing properties; withoutclaim to completeness of the work listed we refer the reader to [22, 25, 28, 26, 14] andto [3, 19, 4, 10, 9, 15, 8] for examples of non-reversible MCMC methods for sampling oncontinuous spaces, and discrete spaces, respectively.For many of these methods, improved mixing properties over their reversible counter-parts is folklore among practitioners; however, there is a growing body of theoretical workthat supports these claims [7, 10].In the setup of a continuous sampling space, non-reversible Markov chains naturallyarise through time discretization of stochastically perturbed versions or modifications ofNewton’s equations of motion (see e.g., [1, 23]). In these cases, reversibility of the dynamicsis broken due to the presence of inertia modeled by the momenta associated with eachdegree of freedom. This is consistent with physical intuition that the resulting ballistic-likemotion tends to exhibit better mixing properties over a purely diffusive dynamics of areversible Markov chain. For example, the existence of momentum is typically cited as thestrength of Langevin sampling over simple Browning dynamics.For sampling in discrete space, a common approach for designing non-reversible MCMCmethods is what is sometimes referred to as “lifting” [29, 19]. Here, a reversible MCMCmethod is modified by replicating the state space through the introduction of a dichotomousauxiliary variable taking values in {− , } along with a simple directed subgraph of theMarkov state graph induced by the original reversible Markov chain. Depending on thevalue of the auxiliary variable transition probabilities along the assigned directions of thesimple subgraph are then either increased or decreased. As such the auxiliary variable hasa similar effect as the momentum variable in the continuous setup. For example, in thecase where the Markov state graph induced by the original reversible MCMC method is acircular graph, a simple way of implementing a lifting approach is by increasing clockwisetransition probabilities for a positive value of the auxiliary variable and increasing counter-clockwise transition probabilities for a negative values of the auxiliary variable [7].The sampling efficiency of the Markov chain obtained by lifting highly depends on thechoice of the simple directed subgraph. While in the above mentioned example a “good”choice can easily be identified, constructing a suitable subgraph for Markov chains whoseassociated Markov state graph has a more complex topology can be difficult.In this article, we introduce non-reversible MCMC methods for the sampling of redistrict-ing maps. Creating a collection of redistricting maps, via sampling of a specified measure, is an important step in many method currently used to evaluate redistricting and detectand explain gerrymandering. In this note, we introduce a heuristic for implementing anefficient lifting approach which is based on a notion of flowing the center of mass of districtsalong a defined vector field; the center of mass arises from an embedding of the districtinggraph in R . We also introduce a novel construction for non-reversible MCMC dynamics asa generalization of the standard lifting approach which allows the incorporation of multi-ple momenta variables. This allows us to construct non-reversible MCMC schemes for ourapplication which make use of the structure of the induced district-level graph. Finally, wecombine these methods with a tempering scheme which minimizes rejection rates in thenon-reversible Markov chain and thereby increases sampling efficiency.The remainder of this article is organized as follows. In Section 3.1, we review the formaldefinition of non-reversibility of Markov chains on discrete sampling spaces. In Section 3.2,we review the basic construction of non-reversible MCMC methods via a skew detailedbalance condition. In Section 4, we describe a novel construction of non-reversible MCMCschemes which allows for multipule momentum corresponding to different proposal chains.In Section 5, we describe the implementation of our approach under the application underconsideration; in Section 7, we test our ideas numerically.2. Exposition of the main algorithms
Before we rigorously develop the underlying mathematical framework, we start by in-formally describing the two main sampling algorithms proposed in this article and demon-strating how they would be applied to sampling redistricting plans for the North CarolinaCongressional Delegation. We construct our non-reversible sampling methods as modi-fications of a variant of the single node flip algorithms (see Section 5.3) where randomredistricting maps are sequentially generated by changing in each iteration the color (dis-trict allocation) of a single precinct located at the border of the current redistricting map.We introduce non-reversibility by directing transitions along what we informally refer toas a flow. Depending on the value of a momentum variable, only transitions in positive ornegative direction are permitted, resulting in a macroscopic level kinetic like movementsalong/against the flow.The intuition behind the first proposed method (“Center-of-mass flow”, see Section 6.1and Fig. 1 ) is that a fast mixing on a macroscopic level is obtained if districts tend tocollectively follow the flow of a suitable, well-stirring vector field in R where there districtgraph is embedded. For example, under an appropriate choice of the vector field, theresulting collective rotational movements of districts in the course of a simulation producesmore efficient mixing then more diffusive sampling algorithms (see Fig. 1). Technically,we implement this idea by aligning the movement of the center of masses of districts withthe vector field. For positive/negative momentum value only transitions for which themidpoints of the center of masses of the modified districts move in the positive/negativedirection of the vector field are permitted. The second proposed method (“(Pair-wise)District-to-district flow”, see Section 6.2 and Fig. 2) utilizes an extended framework, whichallows the incorporation of multiple momenta each associated with a different flow. The Figure 1.
Center-of-mass flow introduced in Section 6.1. Changes in districtboundaries must, on average, move the center of masses of the districts either withor against the drawn vector field. Drawing the 13-district N.C. Congressional mapis used as an example.
Figure 2.
District-to-district flow introduced in Section 6.2. Depending on theindividual velocity values associated with each vector the corresponding boundariesbetween adjacent districts may either only move in the direction or in oppositedirection of the displayed vector. idea of the method is to associate a momentum variable with each district pair. Dependingon the value of the respective momentum, only transitions that flow districts in a directionaligned with the orientation of the respective momentum arrow are permitted. For example,consider the redistricting plan depicted in Fig. 2. If the value of the momentum variableassociated with the orange and light blue district is positive, then among the transitionswhich modify both these two districts only transitions that add a precinct from the lightblue district to the orange district are permitted.3.
Reversible and Non-Reversible Markov Chains
Detailed Balance.
Consider a Markov Chain on a countable state-space X witha transition kernel P . The Markov kernel P is said to be reversible , if there exists aprobability distribution π on X so that the pair ( P , π ) satisfies detailed balance . That is π ( x ) P ( x, x (cid:48) ) = π ( x (cid:48) ) P ( x (cid:48) , x ) , ∀ x, x (cid:48) ∈ X . (1) We restrict to a countable state-space of simplicity. There are no inherent obstructions to generalizingto general Polish Space. See Remark 4.5
Markov transition kernels P which fail to satisfy the detailed balance condition for anymeasure π are referred to as non-reversible . Since π ( x ) P ( x, x (cid:48) ) is the probability flux inequilibrium flowing from state x to x (cid:48) , detailed balance can be restated as the equilibriumflux from x to x (cid:48) is the same as from x (cid:48) to x .The detailed balance condition is a sufficient, but not a necessary condition, for thetransition kernel P to preserve the measure π . By definition, invariance of the measure π only requires that π P = π which is just a compact notation for (cid:88) x ∈X π ( x ) P ( x, x (cid:48) ) = π ( x (cid:48) ) ∀ x (cid:48) ∈ X . It can be rewritten as (cid:88) x (cid:48) ∈X \{ x } π ( x (cid:48) ) P ( x (cid:48) , x ) = (cid:88) x (cid:48) ∈X \{ x } π ( x ) P ( x, x (cid:48) ) ∀ x ∈ X , (2)and, as such, states that for any state x ∈ X the total probability flux into the state x (thelefthand side of (2)) is equal to the total probability flux out of the state x (the righthandside of (2)). This condition is commonly referred to as a global balance condition and issatisfied by any Markov kernel which preserves π .3.2. Skew detailed balance.
A common way of constructing non-reversible Markovchains with prescribed invariant measure π is by enforcing global balance through aninvolutive transform. This structure is called skew detailed balance and ensures that de-tailed balance holds up to some π -invariant involutive transformation. More precisely,let S : X → X be an π -invariant involutive transformation, so that S = S − , and π ( S ( A )) = π ( S − ( A )) = π ( A ) , ∀ A ⊂ X . Then, the Markov kernel P satisfies skewdetailed balance if π ( x ) P ( x, x (cid:48) ) = π ( x (cid:48) ) P ( S ( x (cid:48) ) , S ( x )) , ∀ x, x (cid:48) ∈ X . (3)It is easy to verify that skew detailed balance implies global balance (see e.g. [24], or proofof Theorem 4.1 in appendix B), and thus invariance of π under P .Due to its local nature, skew detailed balance with respect to π can be easily enforcedby an accept-reject step. More precisely, let Q denote a “proposal” Markov kernel on X ,and denote by ( x k ) k ∈ N the Markov chain generated by the following generalization of theMetropolis-Hastings algorithm(1) x (cid:48) ∼ Q ( x k , · ),(2) with probability r ( x k , x (cid:48) ) set x k +1 = x (cid:48) ; otherwise x k +1 = S ( x k ), where r ( x, x (cid:48) ) := min (cid:18) , π ( x (cid:48) ) Q ( S ( x (cid:48) ) , S ( x )) π ( x ) Q ( x, x (cid:48) ) (cid:19) . The detailed balance condition is equivalent to the Kolmogorov definition of reversibility which requiresthe probability of following any sequence of states is the same as following the sequence in reverse order.This justifies the name reversible . Here, and in the remainder of this article we denote by N the set of non-negative integers. Provided that the acceptance probability r ( x, x (cid:48) ) is well defined for all pairs ( x, x (cid:48) ), thetransition kernel of the generated Markov chain takes the form P ( x, x (cid:48) ) = r ( x, x (cid:48) ) Q ( x, x (cid:48) ) + (1 − r ( x, x (cid:48) )) { S ( x ) } ( x (cid:48) ) , which indeed can be verified to satisfy the skew detailed balance condition (3) (see e.g.,[24] or proof of Theorem 4.2 in appendix B).4. A General Non-Reversible Process Construction
In this section we first introduce a generalization of the standard skewed balance condi-tion, termed mixed skewed balance condition, and show that this condition is sufficient forthe corresponding Markov kernel to preserve a prescribed probability measure. We thenprovide a generalization of the Metropolis-Hastings algorithm, the
Mixed Skew Metropolis-Hastings Algorithm (MSMH) which utilizes the mixed skewed balance condition.4.1.
The mixed skewed balance condition.
In the following, let { S i } ni =1 be a collectionof π -invariant involutions, { P i } ni =1 a collection of Markov kernels on X , and ω : X → ∆ n − , ω ( x ) = ( ω ( x ) , . . . , ω n ( x )) , a weight vector taking values in the n th standard simplex ∆ n − := { y ∈ R n : y i ≥ , (cid:80) ni =1 y i = 1 } . Since at each point x the weights are non-negative and sum to one,we can build a new kernel P out of the collection of Markov kernels { P i } ni =1 by setting P = ω · P , which is written more explicitly as P ( x, · ) = n (cid:88) i =1 ω i ( x ) P i ( x, · ) , ∀ x ∈ X . From this we see clearly that P is an x -dependent mixture of the kernels { P i } . With thiscomes the interpretation that a draw from P can be realized by first picking an index i according to the weights and then drawing the next state according to P i .We say that the Markov kernel P satisfies mixed skewed balance with respect to π , if forall x, x (cid:48) ∈ X and for all i ∈ { , . . . , n } ,(4) ω i ( x ) π ( x ) P i ( x, x (cid:48) ) = ω i ( x (cid:48) ) π ( x (cid:48) ) P i (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) . As discussed further in Remark 4.3, equation (4) means that i -th kernel P i satisfies theskew-detail balance condition for the an invariant measure proportional to π ( x ) ω i ( x ). Yet,as the following results show, by mixing these kernels according to the weights ω i ( x ), oneobtains a Markov which has π as its invariant measure. Typically one has that π ( x ) = π ( S i ( x )) and ω i ( x ) = ω i ( S i ( x )); and hence, (4) can again be understood as a probabilityflux balancing condition; the flux from x to x (cid:48) is equal to the flux from S i ( x (cid:48) ) to S i ( x ).(Given that we chose the i th kernel P i according to the weight ω i .) Theorem 4.1.
If the Markov kernel P defined by the collection { ( ω i , P i , S i ) } ni =1 satisfiesmixed detailed balance with respect to π , then P π = π , i.e., the Markov kernel P preservesthe probability measure π . For a proof of this theorem, see Appendix B.
The Mixed Skew Metropolis-Hastings algorithm.
Consider a collection of n Markov “proposal kernels” Q i , i = 1 , . . . , n, on the subsets X i ⊂ X , i = 1 , . . . , n , re-spectively, which form a cover of the whole domain, i.e., (cid:83) ni =1 X i = X . Moreover, let S i : X i → X i , i = 1 , . . . , n be a collection of π -invariant involutions.In what follows we describe how the collections of proposal kernels and involutionstogether with a suitable state dependent weight vector ω : X → ∆ n − can be used togenerate a Markov chain which preserves the target measure π .The mixed skewed balance condition provides the appropriate framework for “patching”these proposals kernels together to obtain a Markov chain which samples from the targetmeasure π .Algorithmically, this can be implemented in a two-step algorithm (see Algorithm 1). Inthe first step of this algorithm a proposal x (cid:48) is generated from the current state x of theMarkov chain as x (cid:48) ∼ Q i ( x, · ) , where i ∼ ω ( x ) . The mixed skew detailed balance condition is then enforced through an accept-reject step,where the proposal is accepted with probabilitymin (cid:18) , π ( x (cid:48) ) ω i ( S i ( x (cid:48) )) Q i ( S i ( x (cid:48) ) , S i ( x )) π ( x ) ω i ( x ) Q i ( x, x (cid:48) ) (cid:19) , in which case the subsequent state of the Markov chain is set to be x (cid:48) , and rejected oth-erwise, in which case the next state of the Markov chain is set to be the i th involutivetransformation of the current state, that is S i ( x ).In order for these two steps to be well-defined and the resulting transition kernel toindeed preserve the target measure π we require the weight vector ω to satisfy( C ) ω i ( x ) > ⇐⇒ x ∈ X i , i = 1 , . . . , n .( C ) ω i ( x ) > Q i ( x, x (cid:48) ) > ω i ( x (cid:48) ) > Q i ( S i ( x (cid:48) ) , S i ( x )) > C ) ω i ( S i ( · )) = ω i ( · ) , i = 1 , . . . , n , i.e., the i th entry of the weight vector is invariantunder the i th involutive transformationCondition ( C ) ensures that the effective proposal kernel Q , Q ( x, · ) = (cid:80) ni =1 ω i ( x ) Q i ( x, · ) , ∀ x ∈X , is well defined, and ( C ) ensures that the Metropolis ratio r i ( x, x (cid:48) ) is well defined. In-variance of the i th weight under the i th involution as stated in ( C ) ensures that the mixedskew detailed balance condition holds for the generated Markov chain. In summary, wehave Theorem 4.2.
Let X i ⊂ X , i = 1 , . . . , n ; be a cover of the X , and let Q i , i = 1 , . . . , n ; beMarkov kernels defined on X i , i = 1 , . . . , n ; respectively. Moreover, let S i , i = 1 , . . . , n ; be acollection of π -invariant involutions on X , and ω : X → ∆ n − be an x -dependent weightvector satisfying the conditions ( C ) to ( C ) . Then, the MSMH Markov chain generated byAlgorithm 1 possesses π as an invariant measure. For a proof of this theorem, see Appendix B.
Algorithm 1:
Mixed SkewMetropolis-Hastings (MSMH) input : x sample partition i ∼ ω ( x ); generate proposal x (cid:48) ∼ Q i ( x, · ); r i ( x, x (cid:48) ) ← π ( x (cid:48) ) ω i ( S i ( x (cid:48) )) Q i ( S i ( x (cid:48) ) ,S i ( x )) π ( x ) ω i ( x ) Q i ( x,x (cid:48) ) ; sample u ∼ U ([0 , if u < r i ( x, x (cid:48) ) then x ← x (cid:48) else x ← S i ( x ) return x Algorithm 2:
Mixed Skew Metropolis-Hastings on extended state space input : ξ, (cid:126)θ sample partition i ∼ (cid:101) ω ( ξ ); sample proposal ( ξ (cid:48) , (cid:126)θ (cid:48) ) ∼ Q i (cid:0) ( ξ, (cid:126)θ ) , ( · , · ) (cid:1) ; r i (cid:0) ( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) ) (cid:1) ← (cid:101) ω i ( ξ (cid:48) ) (cid:101) π ( ξ (cid:48) ) Q i ( ( ξ (cid:48) ,R i ( (cid:126)θ (cid:48) )) , ( ξ,R i ( (cid:126)θ )) ) (cid:101) ω i ( ξ ) (cid:101) π ( ξ ) Q i ( ( ξ,(cid:126)θ ) , ( ξ (cid:48) ,(cid:126)θ (cid:48) ) ) ; sample u ∼ U ([0 , if u < r i (cid:0) ( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) ) (cid:1) then ( ξ, (cid:126)θ ) ← ( ξ (cid:48) , (cid:126)θ ) else θ i ← − θ i return ξ, (cid:126)θ Mixed Skew Metropolis-Hastings algorithm in generic form (Algorithm 1) and asobtained via augmenting the sampling space (Algorithm 2).
Remark . The transition kernel P of the Markov chain generated by Algorithm 1 takesthe explicit form P ( x, · ) = (cid:80) ni =1 ω i ( x ) P i ( x, · ) with P i ( x, x (cid:48) ) = min(1 , r i ( x, x (cid:48) )) Q i ( x, x (cid:48) ) + (1 − min(1 , r i ( x, x (cid:48) ))) { S i ( x ) } ( x (cid:48) ) , i = 1 , . . . , n. If entries in the weight vector ω are constant in x , then, the weight entries in the expressionof the respective Metropolis-Hasting ratios r i ( x, x (cid:48) ) , i = 1 , . . . , n cancel, so that each P i is π -invariant, and P is simply a mixture of π -invariant Markov kernels. In contrast, theweights ω will not be constant in our examples; and hence, the Markov kernels P i will notgenerally preserve the target measure π . Instead these kernels can be shown to preservethe probability measures π i ( x ) ∝ ω i ( x ) π ( x ) , i = 1 , . . . , n , respectively.4.3. Implementation on the state space graph of a Markov process.
While Algo-rithm 1 is very general, we have not specified how the involutions { S i } ni =1 and the proposalkernels { Q i } ni =1 may be chosen, or provided any intuition for why the algorithm might bean improvement over the classical Metropolis-Hastings algorithm.In what follows we provide a general construction which takes a proposal Markov tran-sition kernel (cid:101) Q and target measure (cid:101) π on a discrete state space Ξ and builds a collectionof proposals { Q i } ni =1 and involutions { S i } ni =1 on an extended state space X so that Al-gorithm 1 can be used. This construction will make more precise the idea that the skewMetropolis-Hastings algorithm adds “momentum” to the standard Metropolis-Hastings al-gorithm.Our main conditions on the proposal kernel (cid:101) Q and (cid:101) π are summarized in the followingassumption. Assumption 1.
Let the proposal kernel (cid:101) Q and the target measure (cid:101) π be such that (i) (cid:101) π ( ξ ) > for all ξ ∈ Ξ (ii) (cid:101) Q ( ξ, ξ (cid:48) ) (cid:54) = 0 ⇐⇒ (cid:101) Q ( ξ (cid:48) , ξ ) (cid:54) = 0 (iii) The Markov chain generated by (cid:101) Q is irreducible. The first condition is mild as states ξ with (cid:101) π ( ξ ) = 0 can simply be removed from Ξ. Thesymmetry condition (ii) ensures that (cid:101) Q is equivalent (in the senes that the correspondingtransition probabilities have identical support) to some reversible kernel with invariant mea-sure (cid:101) π . Condition (iii) is necessary to ensure that the constructed non-reversible Markovchain is uniquely ergodic (see Section 4.4).The proposal kernel induces a graph structure on Ξ. States ξ, ξ (cid:48) ∈ Ξ are said to beadjacent if (cid:101) Q ( ξ, ξ (cid:48) ) >
0. We refer to corresponding adjacency graph G = ( V , E ) withvertices given as V = Ξ and edges E = { ( ξ, ξ (cid:48) ) ∈ Ξ × Ξ | (cid:101) Q ( ξ, ξ (cid:48) ) > } as the state graph of the Markov chain; see Fig. 3a for an illustration. In the view of this graph structure,the condition (ii) ensures that the graph G is symmetric (or undirected) in the sense thatif ( u, v ) ∈ E then so is ( v, u ) ∈ E , and condition (iii) ensures that G is connected.The general idea of our construction is to build a non-reversible Markov chain by intro-ducing non-reversible flows, typically shaped like “vortices,” on the state graph, each ofwhich being associated with a involutive transformation.Concretely, we begin by specify these flows by a collection of oriented subgraphs G + i =( V i , E + i ) , i = 1 , . . . , n . We require that each G + i has no isolated vertices and that thesymmetric completions G i = ( V i , E i ), where E i = E + i ∪ E − i with E − i = { ( u, v ) | ( v, u ) ∈ E + i } ,form a cover of G in the sense that V = (cid:83) ni =1 V i , E = (cid:83) ni =1 E i ; see Fig. 3b.For each oriented subgraphs G + i , we introduce an auxiliary variable θ i which takes pos-itive or negative unitary values {− , +1 } , and we denote the vector of all such auxiliaryvariables as (cid:126)θ = ( θ , . . . , θ n ) ∈ {− , +1 } n . In accordance with our notation of Section 4we denote the such extended state-space by X = Ξ × {− , } n and we use the shorthandnotation x = ( ξ, (cid:126)θ ) ∈ X for elements of that state.Given this collection of directed subgraphs, our general recipe to built a collections ofassociated proposals { Q i } ni =1 , involutions { S i } ni =1 and weights { (cid:101) ω i } ni =1 on the extendedspace X is as follows. Let N i ( ξ ) := { ξ (cid:48) | ( ξ, ξ (cid:48) ) ∈ E i } , denote the of neighborhood of ξ in the graph G i which are reachable under the proposalkernel (cid:101) Q . Let N + i ( ξ ) = { ξ (cid:48) ∈ Ξ : ( ξ, ξ (cid:48) ) ∈ E + i } , N − i ( ξ ) = { ξ (cid:48) ∈ Ξ : ( ξ, ξ (cid:48) ) ∈ E − i } , denote the partition of neighborhood N i ( ξ ) into the set of states which can be reached inone step from the state ξ following the direction of the positive flow E + i , and the negativeflow E − i , respectively; see Fig. 3b. We use this partition to built for each subgraph G i aproposal kernel Q i on X i := V i × {− , } n ⊂ X , which for positive value θ i = 1 proposesnew states in the direction of the positive flow E + i , and for negative value θ i = − new states in the direction of the negative flow E − i . That is Q i (cid:0) ( ξ, (cid:126)θ ) , ( · , (cid:126)θ ) (cid:1) ∝ N θii ( ξ ) ( · ) (cid:101) Q i ( ξ, · ) , or, more precisely, Q i (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) )) = (cid:101) Q ( ξ, ξ (cid:48) ) (cid:101) Q ( ξ, N θ i i ( ξ )) if ξ (cid:48) ∈ N θ i i ( ξ ) and (cid:126)θ (cid:48) = (cid:126)θ, N θ i i ( ξ ) = ∅ and S i (( ξ (cid:48) , (cid:126)θ (cid:48) )) = ( ξ, (cid:126)θ ) , , (5)where in both the above expressions we used the shorthand notation N θi ( ξ ) = (cid:40) N + i ( ξ ) , if θ = +1 N − i ( ξ ) , if θ = − . As a natural choice for the involutive map S i : X → X we consider the map which flipsthe sign of the i th component of the vector (cid:126)θ , i.e., S i (( ξ, (cid:126)θ )) = ( ξ, R i ( (cid:126)θ )) , with R i ( (cid:126)θ ) = (cid:126)θ − (cid:126)θ · e i )e i , (6)where e i denotes the i th canonical vector in R n . For an illustration, see Fig. 3.Throughout the remainder of this article, we assume, that the weight vector ω is purely afunction of ξ , i.e., ω (( ξ, (cid:126)θ )) = (cid:101) ω ( ξ ) for some (cid:101) ω : Ξ → ∆ n − so that ( C ) is trivially satisfied.With the Q i ’s and S i ’s as defined in (5) and (6), respesctively, it can be verified that( C (cid:48) ) (cid:101) ω i ( ξ ) > ⇐⇒ ξ ∈ V i , is sufficient for the remaining conditions ( C ) and ( C ) to be satisfied, provided that thesymmetry condition (ii) of Assumption 1 holds. We say that the i th proposal is active instate ξ if ξ ∈ V i , or, equivalently, if (cid:101) ω i ( ξ ) >
0, and we denote by A ( ξ ) = { i ∈ { , . . . , n } | (cid:101) ω i ( ξ ) > } the index set of proposals which are active in ξ . Remark . Note that by our assumptions on the graphs G + i , they have non-empty edgesets E + i . This implies that for every i ∈ { , . . . , n } there exists at lease one ξ ∈ Ξ with i ∈ A ( ξ ). Hence there is always at least one state at which the i th momentum can beflipped.We consider(7) (cid:101) ω i ( ξ ) = (cid:101) Q ( ξ, N + i ( ξ )) + (cid:101) Q ( ξ, N − i ( ξ )) , as a generic choice for the weight vector which can be easily verified to satisfy condition( C (cid:48) ).Lastly, we extend the definition of the target measure (cid:101) π from Ξ to X as π (( ξ, (cid:126)θ )) = 12 n (cid:101) π ( ξ ) , ξ ∈ Ξ , (cid:126)θ ∈ {− , } n , (a) Undirected StateGraph G = ( V , E ) (b) State graph with threeassigned flows: E +1 (blue), E +2 (red), E +3 (green) (c) State graph with indi-cated positive and negativeneighborhoods of state ξ . Figure 3.
Panel (A) shows the undirected state graph induced by a reversible proposal kernel (cid:101) Q on a state space comprised of 9 states. Panel (B) shows an exemplary assignment of flows. Ifthe current state is ξ and either (cid:126)θ = (1 , ,
1) or (cid:126)θ = ( − , , N +2 ( ξ ) = { ξ } or N +3 ( ξ ) = { ξ , ξ , ξ } . If instead (cid:126)θ = (1 , − , −
1) or (cid:126)θ = ( − , − , − N − ( ξ ) = { ξ } ; see Panel (C). Notethat the neighbourhoods N +1 ( ξ ) , N − ( ξ ) , and N − ( ξ ) are empty. so that π is the product measure of (cid:101) π and the uniform measure on {− , } n . In particular, (cid:88) (cid:126)θ ∈{− , } n π (( ξ, (cid:126)θ )) = (cid:101) π ( ξ ) , i.e., the marginal measure of π in ξ coincides with the target measure (cid:101) π on Ξ. Moreover,with the augmented measure being of product form and uniform in the (cid:126)θ -component, itfollows that the S i ’s are π -invariant, i.e., π ( S i (( ξ, (cid:126)θ ))) = π (( ξ, (cid:126)θ )) for all i ∈ { , . . . , n } , ξ ∈ Ξ and (cid:126)θ, (cid:126)θ (cid:48) ∈ {− , } n , since S i only changes the (cid:126)θ -component.In conclusion, the collection of proposals, weights and involutions ( Q i , S i , ω i ) , i ∈ { , . . . , n } and the augmented measure π satisfy by construction the hypotheses of Theorem 4.2.Hence, with these choices, the MSMH-algorithm (see Algorithm 2) produces a Markovchain which preserves the measure π on X , and thus also the marginal measure (cid:101) π on Ξ. Remark . We have chosen to assumeΞ to be countable. Extending to the case where Ξ is a general separable measure spaceis straight forward when proposal kernel from state ξ , (cid:101) Q ( ξ, dξ (cid:48) ), is absolutely continuouswith respect to a common ( σ -finite) radon measure λ ( dξ (cid:48) ) for all ξ ∈ Ξ.It then follows that the measure (cid:101) π needs to also be absolutely continuous with respect to λ . One can then write the detailed balance condition as (cid:101) π ( dξ ) (cid:101) Q ( ξ, dξ (cid:48) ) = (cid:101) π ( dξ (cid:48) ) (cid:101) Q ( ξ (cid:48) , dξ )as measure on the product space Ξ × Ξ. If we denote by q ( ξ, ξ (cid:48) ) and p ( ξ ) to be the densitiesof (cid:101) Q ( ξ, dξ (cid:48) ) and (cid:101) π ( dξ ) with respect to λ then all that follows makes sense and applies tothis more general setting with (cid:101) Q and (cid:101) π replaced respectively by q and p . For example theconditions of Assumptions 1 become p ( ξ ) > (cid:101) π ( ξ ) > q ( ξ, ξ (cid:48) ) > than (cid:101) Q ( ξ, ξ (cid:48) ) >
0. The ratios of (cid:101) Q ’s in (5) become ratios of q ’s evaluated at the samepoints. The existence of a version (indistinguishable up to null sets) which has the rightmeasurability properties of these ratios and the acceptance rations r i from Algorithm 1and 2 are guarantied by the arguments of Proposition 1 in [27]. The fact that graph whichis constructed had countable vertices was unimportant as the definitions really do not usethe graph structure. All of the definitions make sense once the above modifications havebeen made.4.4. Ergodic properties.
Let in the remainder of this section P (cid:0) ( ξ, (cid:126)θ ) , · (cid:1) = n (cid:88) i =1 (cid:101) ω i ( ξ ) P i (cid:0) ( ξ, (cid:126)θ ) , · (cid:1) , denote the Markov kernel generated by Algorithm 2 with generic weight function (cid:101) ω asspecified in (7). If in addition to Assumption 1 the following Assumption 2.
For any extended state ( ξ, (cid:126)θ ) ∈ Ξ ×{− , } n and any active index i ∈ A ( ξ ) ,there are m ∈ N and ξ (cid:48) ∈ Ξ so that P mi (cid:0) ( ξ, (cid:126)θ ) , ( ξ (cid:48) , R i ( (cid:126)θ )) (cid:1) > . is satisfied, then we can conclude unique ergodicity of the generated Markov chain asdetailed in the following theorem. Theorem 4.6.
Let Assumption 1 and Assumption 2 be satisfied. The non-reversibleMarkov chain ( x k ) k ∈ N generated by P and initial extended state x = ˜ x is uniquely ergodicin the sense that (8) lim N →∞ N N − (cid:88) k =0 { x (cid:48) } ( x k ) = π ( x (cid:48) ) almost surely for any value of the initial state ˜ x ∈ X and all x (cid:48) ∈ X . In particular, forany observable ϕ : Ξ → R , we have (9) lim N →∞ N N − (cid:88) k =0 ϕ ( x k ) = (cid:88) x ∈X ϕ ( x ) π ( x ) almost surely for all initial states ˜ x ∈ X .Proof. A proof of this theorem can be found in appendix B. (cid:3)
By construction described in Section 4.3, Assumption 2 implies that for every state ξ ∈ V i we can reach a vertex ξ (cid:48) for which the probability of flipping the i th velocity componentis positive. This can be accomplished by either following a directed path along edges in E + i if θ i = +1, or by following a directed path along edges in E − i if θ i = −
1. Providedthat each of the Markov kernels { P i } ni =1 possess an invariant measure (which Assumption 1guarantees; see Remark 4.3), we will see bellow that the only obstruction to the existenceof such a reachable vertex ξ (cid:48) is that ξ is a vertex of a cycle in the associated directed graph.To make this precise, we need a few concepts. A circuit in a directed graph G = ( V, E ) is a sequence of vertices ( v , . . . , v m ) such that(i) each pair ( v i , v i +1 ) ∈ E for i = 0 , . . . , m − v = v m ). Note that we allow edges ( v i , v i +1 ) to be repeated. We say thata circuit is escapable if there is at least one edge ( v, v (cid:48) ) ∈ E with v being a vertex in thecircuit and v (cid:48) not. Conversely, we say that a circuit is non-escapable if such an edge doesnot exist. We say that a non-escapable circuit ( v , . . . , v m ) is maximal if the vertices of thecircuit, namely { v k | k = 0 , . . . , m } , are not a proper subset of the vertices of another non-escapable circuit. Equipped with the terminology we can state the following Assumption 3and Theorem 4.7. Assumption 3.
For every value of (cid:126)θ ∈ {− , } n and i ∈ { , . . . , n } every maximal non-escapable circuit in G θ i i contains at least one state ξ (cid:48) for which P i (cid:0) ( ξ (cid:48) , (cid:126)θ ) , ( ξ (cid:48) , R i ( (cid:126)θ )) (cid:1) > .Here we use the short-hand notation G θ i i = (cid:40) G + i , θ i = +1 G − i , θ i = − . Theorem 4.7.
Provided that Assumption 1 holds, then Assumption 3 and Assumption 2are equivalent. In short: (Assumption 1 + Assumption 3) ⇐⇒ (Assumption 1 + As-sumption 2).Proof. We prove this theorem in appendix B. (cid:3)
In practice, Assumption 2 and even Assumption 3 may often be difficult to check. Inorder to guarantee ergodicity in the sense of Theorem 4.6, one may instead consider a“lazy” version of the algorithm, where in each step(1) with some small probability ε >
0, a component index is uniformly sampled from { , . . . , n } and the sign of the corresponding velocity component flipped, while thestate ξ remains unchanged.(2) with probability 1 − ε , the steps in Algorithm 2 are executed.Since the Markov kernel of the Markov chain generated by this modification of the algorithmis a mixture of two kernels which each preserve π it directly follows that this Markov chainindeed preserves π . Moreover, provided that (cid:101) Q is irreducible it follows by similar argumentsas in the proof of Theorem 4.6 that the such generated non-reversible Markov chain is alsoirreducible, which is sufficient for the Markov chain to be uniquely ergodic in the sensespecified in Theorem 4.6. In addition we may consider an even “lazier” modification of thealgorithm where in each step in addition to the above described modification, we may leavewith non-zero probability the complete extended state ( ξ, (cid:126)θ ) unchanged. It is easy to seethat under this modification, the resulting Markov chain will not only preserve the targetmeasure and be irreducible, but will in addition also be guaranteed to be acyclic, so thatthe chain converges in law to the target measure, i.e., lim N →∞ P N (cid:0) ( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) ) (cid:1) = π ( ξ (cid:48) , (cid:126)θ (cid:48) )for all ( ξ, (cid:126)θ ) ∈ Ξ × {− , } n and all ( ξ (cid:48) , (cid:126)θ (cid:48) ) ∈ Ξ × {− , } n .In what follows we show how the framework described in this section can be appliedto devise non-reversible MCMC methods for the sampling of redistricting plans. For this purpose we describe in the following Section 5.1 the sampling space Ξ of that application–the set of redistricting plans– and the target measure (cid:101) π defined on that space. We thenintroduce in Section 5.3 a reversible Markov kernel (cid:101) Q on set of redistricting plans, which weconstruct as a tempered version of the single node flip algorithm of [12, 16]. In Section 6 wedescribe and discuss three methods to partition and direct the induced state space graph G = ( V , E ). 5. Sampling redistricting plans
The space and distribution of redistricting plans.
Throughout this article weconsider the problem of sampling redistricting plans as equivalent to sampling the spaceof n D partitions on a graph G p = ( V p , E p ), which is a discrete representation of the ad-ministrative region (e.g., a state, or collection of counties) for which redistricting plansare drawn. The vertices ( V p ) of the graph correspond to geographic regions of a certainadministrative level (e.g., voter tabulation districts (VTD), precincts, census blocks), andedges ( E p ) are placed between vertices that are either rook, queen, or legally adjacent ; seeFig. 4a (depending on the situation).In order to keep language simple we present our approach in the setup where the vertexset V p represent the precincts of a state , and we refer to G p as the precinct graph .We represent a single districting plan made up of n D districts as a function ξ : V p →{ , . . . n D } . In other words, districting plans are n D -coloring of the graph G p . Informally, ξ ( v ) = i means that the precinct associated with vertex v is in the i th district. Given adistricting plan ξ , we denote by D i ( ξ ) = { v ∈ V p | ξ ( v ) = i } , E i ( ξ ) = { ( u, v ) ∈ E p | ξ ( v ) = ξ ( u ) = i } , the set of precincts assigned to the i th district, and the set of edges between verticescorresponding to precincts within the i th district, respectively.On the set of all districting plans (cid:8) ξ | ξ : V p → { , . . . n D } (cid:9) , we define a score function J which measures how well a redistricting plan complies with a set of prescribed crite-ria such as compactness of districts, equal partition of the population or preservation ofmunicipalities.This score function is used to define a probability measure (cid:101) π on the set of districtingplans via the relation (cid:101) π ( ξ ) ∝ e − J ( ξ ) . (10)The lower the score J ( ξ ) of a redistricting plan, the better it complies with the criteria andthe higher is the probability assigned to it by the Gibbs measure (cid:101) π ( ξ ). In particular, if aredistricting plan ξ is non-compliant, then, J ( ξ ) = ∞ , and thus (cid:101) π ( ξ ) = 0. Rook adjacency means that the geographical boundary between two regions has non-zero length; queenadjacency means that the boundaries touch, but may do so at a point. At times two regions may not begeographically adjacent, but may be considered adjacent for legal purposes; for example, an island may stillbe considered adjacent to regions on a mainland for the purposes of making districts. (a) Precinct graph (b)
District graph
Figure 4. (A) State with associated precinct graph; the embedding φ used for displaying theprecinct graph places nodes at the areal centroid of the associated precinct. (B) Coloring of theprecinct graph which corresponds to a partition of the state into three districts. Superimposed isthe associated district graph. In the remainder of this article we denote byΞ := { ξ : V p → { , . . . , n D } | J ( ξ ) < ∞} . the support of (cid:101) π , to which in the following we will also refer as the set of all possible maps or possible redistricting plans .5.2. The score function.
The score function J ( ξ ) which determines the measure (cid:101) π typ-ically relies on additional information associated with vertices and edges of the precinctgraph G p such as population, land area, and border length. This additional data is usedto evaluate the districts on desired redistricting criteria, such as equal-population andcompactness.We denote by pop( v ) and area( v ) the population and area, respectively, of the geograph-ical region corresponding to vertex v ∈ V p . Similarly, for e = ( u, v ) ∈ E p , we denote byboundary( e ) the length of the boundary shared between the precincts u and v . Certainvertices may not share all of their boundary with an adjacent vertex; for example, a vertexmay be on the boundary of the map. In this case, we also describe the unshared boundaryof a vertex to be boundary( v ) (which will be zero for interior nodes).On the set of all possible maps, the score function J may be constructed as a positivelinear combination of sub-functions, each of which evaluate different properties of theredistricting map (see, e.g., [12]). For example, in some of the numerical examples ofSection 7, we let(11) J ( ξ ) = w pop J pop ( ξ ) + w c J c ( ξ ) , where J pop ( ξ ) is a measure of the population deviation and J c ( ξ ) is a measure of howcompact the districts are. Typically J pop ( ξ ) is in the form of a hard constraint or a sum of squared deviations of each district from a target population; similarly, J c ( ξ ) is typically asum of district isoparametric ratios or some measure of the overall perimeter.5.3. The tempered Single Node Flip proposal and algorithm.
To utilize the ideasof Section 4.3 in order to sample from (cid:101) π on the space of possible redistricting plans, wemust first establish a proposal kernel (cid:101) Q on the domain of possible redistricting plans Ξwhich satisfies Assumption 1.For redistricting problems, one of the most widely used methods is what is commonlyknown as the single node flip algorithm . This algorithm has been shown to mix wellon smaller problems [11, 17, 16], but as the size of the districting plan and the criteriafor redistricting becomes more complex, the moving boundary MCMC algorithms willconverge, in theory, but the mixing time for these chains may cause their use to be infeasibleto solve computationally [21].In this article we use the proposal kernel (cid:101) Q of a tempered version of the single nodeflip algorithm as the basis for constructing our non-reversible MCMC methods. As inthe classical version of the algorithm, the proposal kernel of this variant of the algorithmchanges “flips” the color of exactly one vertex on the boundary of a district to the color ofa neighboring district.More specifically, for an ordered pair, ( u, v ) ∈ V p × V p , of two distinct precincts we definethe flip operator F ( u,v ) : Ξ → Ξ , F ( u,v ) ( ξ )( w ) = (cid:40) ξ ( w ) , w (cid:54) = v,ξ ( u ) , w = v, which flips the label of vertex v to the label of vertex u , and where as above Ξ is the domainof all possible maps. For a given districting map ξ ∈ Ξ, we denote by C ( ξ ) := (cid:8) ( u, v ) ∈ E p : ξ ( u ) (cid:54) = ξ ( v ) , F ( u,v ) ( ξ ) ∈ Ξ (cid:9) , the set of all conflicting edges , i.e., edges which connect precincts with different labels(precincts which are assigned to different districts), and for which application of the cor-responding flip operator results in a valid redistricting plan. Moreover, denote by N ( ξ ) := (cid:8) F ( u,v ) ( ξ ) | ( u, v ) ∈ C ( ξ ) (cid:9) , the set of all possible redistricting plans which can be obtained from the districting plan ξ upon application of the flip operator along a conflicting edge; that is the vertices of theneighborhood of ξ . With this notation at hand we define the proposal distribution (cid:101) Q ( ξ, · )to be a tempered version of the target measure (cid:101) π constrained to the set N ( ξ ), i.e., (cid:101) Q ( ξ, · ) ∝ N ( ξ ) ( · )e − βJ ( · ) , ξ ∈ Ξ , or, more explicitly,(12) (cid:101) Q ( ξ, ξ (cid:48) ) = (cid:40) Z β ( ξ ) e − βJ ( ξ (cid:48) ) , ξ ∈ N ( ξ )0 , otherwise Figure 5.
Support of the single node flip proposal; the color(s) of the circular markers withinprecincts indicate that the probability of proposing the districting plan which is obtained from thecurrent districting plan by changing the color of the respective precinct to (one of) the color(s) ofthe marker, is positive. where β > Z ( ξ ) := (cid:88) ξ (cid:48) ∈N ( ξ ) e − βJ ( ξ (cid:48) ) is the corresponding partition function.The tempered proposal kernel (cid:101) Q can be combined with a Metropolis acceptance rejec-tion criteria to obtain a reversible Metropolis-Hastings algorithm (see Algorithm 5) whichpossess (cid:101) π as an invariant measure.6. Application to graph partitions and redistricting
In this section we present two different approaches for applying the framework of Sec-tion 4.3 to built non-reversible Markov chains for the sampling of redistricting maps outof the tempered single node flip algorithm described in Section 5.3. We describe these twoapproaches in Section 6.1 and Section 6.2, respectively. In Section 6.3 we discuss com-putational aspects of the algorithms and the role of tempering/choice of the temperatureparameter β .6.1. Precinct flows via planar embeddings.
In this approach we construct a singleflow via a planar embedding of the planar graph in such a way that state transitions inpositive flow direction are aligned are aligned with a prescribed vector field in R .Consider the embedding φ : V p → R of the precinct graph in R , and a volume preserv-ing vector field v in R .More specifically, we consider the embedding which maps every vertex v to the centerof mass (assuming constant density) of the associated precinct and the field defined byconcentric circles oriented in the counter-clockwise direction. That is, φ ( v ) = 1area( v ) (cid:90) A v x d x where A v ⊂ R is the precinct represented by v in some suitable geographical map repre-sentation of the state, and v ( x, y ) = ( − r sin( α ) , r cos( α )) , where ( r, α ) = ( (cid:112) x + y , arctan( x/y )) are the polar representations of the ( x, y ) ∈ R .For a given embedding φ and vector field v , we need to make precise what is meant bya transition to be aligned with the vector field v . For this purpose we need to define afunction, orientation φ, v : E → {− , } , which maps every edge of the state graph E of theMarkov chain to the set {− , +1 } , with +1 , − E + = { ( ξ, ξ (cid:48) ) | orientation φ, v ( ξ, ξ (cid:48) ) > } . The question now is how to construct such a function orientation φ, v . To accomplish thiswe base positively align the movement of centroids with the vector field. Let c ( D i ( ξ )) = 1area( D i ( ξ )) (cid:88) v ∈ D i ( ξ ) area( v ) φ ( v ) , (13)denote the centroid of the i -th district. We orient edges ( ξ, ξ (cid:48) ) such that transitions from ξ to ξ (cid:48) are such that the movements of the centroids of the involved districts i, j (these arethe districts for which either a precinct is removed or added in the course of the transition)are aligned with the direction of the vector field v , i.e.,(14)orientation φ, v ( ξ, ξ (cid:48) ) = sign (cid:88) k ∈{ i,j } v (cid:18) c ( D k ( ξ (cid:48) )) + c ( D k ( ξ ))2 (cid:19) · (cid:2) c ( D k ( ξ (cid:48) )) − c ( D k ( ξ )) (cid:3) , where i, j are the indices of the two districts which are modified in the transition from ξ to ξ (cid:48) . See Algorithm 3 for an algorithmic implementation and Fig. 6 for a graphicalillustration of this method.6.2. District to district flows.
We associate n = n D ( n D −
1) momenta θ e ∈ {− , } , e ∈ E d across the ordered district pairs E d = { ( i, j ) ∈ { , . . . , n D } : i < j } . If the momentum θ e associated with an adjacent district pair e = ( i, j ) is positive, we may only propose statechanges in which a boundary precinct of district j is reassigned to district i . If, on theother hand, θ e = −
1, we may only propose state changes in which a boundary precinct ofdistrict i is reassigned to district j ; see Fig. 7a.In the view of Section 4.3 the construction of the algorithm is as follows. The velocityvector is of the form (cid:126)θ = ( θ e , . . . , θ e n ) ∈ {− , } E d . For any pair of districts e = ( i, j ) ∈ E d , Algorithm 3:
Center-of-mass flow input : ξ, θ N θ ( ξ ) ← { ξ (cid:48) ∈ N ( ξ ) : orientation φ, v ( ξ, ξ (cid:48) ) = θ } ; if N θ ( ξ ) = ∅ then θ ← − θ else sample ξ (cid:48) ∼ N ( ξ ) ( · )e − βJ ( · ) , sample u ∼ U ([0 , N − θ ( ξ (cid:48) ) ← (cid:110) ξ (cid:48)(cid:48) ∈ N ( ξ (cid:48) ) :orientation φ, v ( ξ (cid:48) , ξ (cid:48)(cid:48) ) = − θ (cid:111) ; if N − θ ( ξ (cid:48) ) = ∅ or u < e J ( ξ ) − J ( ξ (cid:48) ) Z θ ( ξ ) e βJ ( ξ ) − βJ ( ξ (cid:48) ) Z − θ ( ξ (cid:48) ) then ξ ← ξ (cid:48) else θ ← − θ return ξ, θ In the edge-aligned version of the algorithm, orientation φ, v ( ξ, ξ (cid:48) ) has the formspecified in (14). (a) Center-of-mass-flow, conceptual sketch (b)
Support of center-of mass-flow proposal
Figure 6. (A) Geometrical illustration of the orientation assigning function orientation φ, v in thecenter-of-mass flow method: in a transition where the color of the hatched precinct changes fromred to blue the centroids of the corresponding districts (black dots) change as indicated by theblue arrows. The red arrows correspond to evaluations of the vector field (black arrows) at therespective midpoints of the blue arrows. The orientation of the transition is computed as the signof the sum of the inner products of the two arrow/vector pairs of each district. (B) Support of thecenter-of-mass-flow proposal distribution for positive velocity θ = 1; The same convention is usedfor displaying the support as in Fig. 5. the corresponding positive flow E + e of transitions between possible districts where a precinctfrom district i is removed and added to district j is E + e = { ( ξ, F ( u,v ) ( ξ )) | ξ ∈ E , ( u, v ) ∈ C + e ( ξ ) } , where C + e ( ξ ) = { ( u, v ) ∈ C ( ξ ) | u ∈ D i ( ξ ) , v ∈ D j ( ξ ) } , denotes the set of directed conflicted edges which connect a precinct of district i with aprecinct of district j . Similarly, we have E − e = { ( ξ, F ( u,v ) ( ξ )) | ξ ∈ E , ( u, v ) ∈ C − e ( ξ ) } where C − e ( ξ ) = { ( u, v ) | ( v, u ) ∈ C + e ( ξ ) } . The e th vicinity of the state ξ in positive andnegative direction can be explicitly written as N + e ( ξ ) = { F ( u,v ) ( ξ ) | ( u, v ) ∈ C + e ( ξ ) } , N − e ( ξ ) = { F ( u,v ) ( ξ ) | ( v, u ) ∈ C + e ( ξ ) } , respectively, and we can write the e th proposal kernel on the extended space Ξ × {− , } E d as Q e (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) )) = e − βJ ( ξ (cid:48) ) Z θ e e ( ξ ) if ξ (cid:48) ∈ N θ e e ( ξ ) and (cid:126)θ (cid:48) = (cid:126)θ, N θ e e ( ξ ) = ∅ and ξ (cid:48) = ξ, (cid:126)θ (cid:48) = F e ( (cid:126)θ ) , , where Z θ e e ( ξ ) = (cid:80) ξ (cid:48) ∈N θee ( ξ ) e − βJ ( ξ (cid:48) ) . The generic choice (7) for the weight vector (cid:101) ω ( ξ ) =( (cid:101) ω e ( ξ ) , . . . , (cid:101) ω e n ( ξ )) ∈ {− , } E d results in weights of the form(15) (cid:101) ω e ( ξ ) = Z e ( ξ ) Z ( ξ ) , where Z e ( ξ ) = Z θ e e ( ξ ) + Z − θ e e ( ξ ) , for e ∈ E d , so that Z ( ξ ) = (cid:80) ˜ e ∈ E d Z ˜ e ( ξ ). With this choice of proposal kernel and weights,the Metropolis-Hastings ratio becomes r e (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) )) = e J ( ξ ) − J ( ξ (cid:48) ) Z θ e e ( ξ ) e βJ ( ξ ) − βJ ( ξ (cid:48) ) Z − θ e e ( ξ (cid:48) ) . We provide an explicit implementation as Algorithm 4.6.2.1.
Associated district graph.
If the number of district is three or larger, then, for certainredistricting plans we may have the situation that certain pairs of district do not share aborder. That means that for a given redistricting plan ξ ∈ Ξ the set of adjacent districts E d ( ξ ) = (cid:8) ( i, j ) | C + e ( ξ ) ∪ C − e ( ξ ) (cid:54) = ∅ (cid:9) , may be a proper subset of E d . We refer to the state dependent graph G d ( ξ ) = ( V d , E d ( ξ )),where V d = { , . . . , n D } is the index set of the districts, as the district graph of ξ ; seeFig. 4b. If e = ( i, j ) ∈ E d ( ξ ), we refer to Q e as an active kernel. The generic choice ofweights as in (15) ensures that only active kernels are selected in the proposal step. (a) District-to-district flow (b)
Circular flow on district graph
Figure 7. (A) Support of the district-to-district flow proposal distribution with ( θ , , θ , , θ , ) =(1 , , , , (1 , , (2 ,
3) are displayed as red, blue, andgreen colored directed edges of the district graph, respectively. (B) Support of a generalized version(see Remark 6.2) of the district-to-district flow where a momentum variable θ (1 , , is assignedto the red colored counter-clockwise oriented cycle on the district graph. The support is shownfor positive velocity θ (1 , , , ) = 1. For both figures the same convention is used for displaying thesupport as in Fig. 5. Remark . The total number of district pairs | E d | = n D ( n D + 1) / n D , and thus keeping track of all entries in (cid:126)θ may be memoryintensive if n D is large. Moreover, the stochastic process ( ξ n ) n ∈ N typically converges tothe target measure (cid:101) π (when accounting for internal symmetries/label permutation) beforeall proposal kernels have become active. For this reason one may not want to keep trackfor all velocities over the whole simulation time. Instead one may resample the velocity θ e from the uniform measure on {− , } whenever the kernel Q e becomes active after atransition which results in the previously non-adjacent district pairs e = ( i, j ) to share aborder. Remark . Algorithm 4can be viewed as a special case of a much larger class of non-reversible algorithms which usethe structure of the state dependent district graph G d ( ξ ) = ( V d , E d ( ξ )). Instead of assigningvelocities to edges as in Algorithm 4, one may assign velocities to any type of subgraphsof G d ( ξ ), and associate proposals which mimic flows across the respective districts. Forexample, one may associate velocities with cycles of 3 districts (see Fig. 7b), and choosethe proposal Q i,j,k so that it mimics a district flow in clockwise/anti-clockwise directiondepending on the value of the velocity θ ( i,j,k ) , e.g., Q ( i,j,k ) (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) )) = N θ ( i,j ) ( ξ ) ( · ) (cid:101) Q ( i,j ) ( ξ, · ) + N θ ( j,k ) ( ξ ) ( · ) (cid:101) Q ( j,k ) ( ξ, · )+ N θ ( k,i ) ( ξ ) ( · ) (cid:101) Q ( k,i ) ( ξ, · ) , with the terms on the right hand side as defined in the pair-wise district graph version ofthe algorithm.6.3. Computational aspects and choice of the temperature parameter.
The com-putational efficiency of an MCMC scheme both depends on the cost per generated sampleand the mixing rate of the Markov chain itself. It is intuitively clear that very high rejec-tion rates will negatively affect the mixing speed of a Markov chain. This is in particulartrue in the case of non-reversible Markov chains constructed as presented here, since everyrejection event will result in the reversal of a momentum variable, so that high rejectionrates prevent kinetic like movement of the chain. Therefore, it may be worth to tradein higher computational costs for the generation of proposals / cost per step if rejectionrates are reduced by that. In what follows we discuss this tradeoff in the case of two im-portant special cases of the tempered proposal kernel of Algorithm 4 (Note that the sameobservation follow as a special case for Algorithm 5.) • If β = 1, then the proposal distribution (cid:101) Q ( ξ, · ) is the target measure constrainedto N ( ξ ), and the Metropolis ratio simplifies to(16) r e ( ξ, ξ (cid:48) ) = Z θ e e ( ξ ) Z − θ e e ( ξ (cid:48) ) × Z e ( ξ (cid:48) ) Z e ( ξ ) × Z ( ξ ) Z ( ξ (cid:48) ) . In this case the acceptance probability of the proposal is not dependent on theenergy difference ∆ J = J ( ξ (cid:48) ) − J ( ξ ), but is merely a function of various partitionfunctions of the current state and the proposed state. • In the limit β →
0, the proposal distribution (cid:101) Q ( ξ, · ) becomes the uniform distri-bution on N ( ξ ), and we thus recover for β = 0 a version of the single node flipalgorithm where the proposal is sampled uniformly from the set N ( ξ ). In this casethe acceptance probability becomes r e ( ξ, ξ (cid:48) ) = e − ∆ J × |N θ e e ( ξ ) ||N − θ e e ( ξ (cid:48) ) | × |N e ( ξ (cid:48) ) ||N e ( ξ ) | × |N ( ξ ) ||N ( ξ (cid:48) ) | . In terms of computational costs it is important to note, that in the case where β > ξ as well as all statesin the neighborhood of the proposal ξ (cid:48) . Depending on the form of the score function J the computational costs for this operation may vary. In particular, if the form of J is asin (11), computing the J ( ξ (cid:48) ) of a neighboring state of ξ may only require reevaluation ofthe two terms in sums of the sub score functions which relate to the two districts whichare modified in the transition from ξ to ξ (cid:48) . In contrast to that the operations involved ingenerating a proposal and evaluating the Metropolis-Hastings ratio in the case of β = 0only require identifying the various neighborhoods of the current state ξ and the proposal ξ (cid:48) .The higher computational costs of using a tempered proposal with β = 1, may beoffset due to potentially drastically reduced rejection rates in the tempered case. For largeredistricting maps the terms Z e ( ξ (cid:48) ) /Z e ( ξ ) and Z ( ξ ) /Z ( ξ (cid:48) ) in the Metropolis-Hastings ratio Algorithm 4:
Pair-wise district-to-district flow input : ξ, (cid:126)θ sample e ∼ (cid:101) ω ( ξ ) = [ Z ˜ e ( ξ ) /Z ( ξ )] ˜ e ∈ E d ; if Z θ e e ( ξ ) = 0 then θ e ← − θ e else sample ξ (cid:48) ∼ N θee ( ξ ) ( · )e − βJ ( · ) ; sample u ∼ U ([0 , if Z − θ e e ( ξ (cid:48) ) = 0 or u < e J ( ξ ) − J ( ξ (cid:48) ) Z θee ( ξ ) e βJ ( ξ ) − βJ ( ξ (cid:48) ) Z − θee ( ξ (cid:48) ) then ξ ← ξ (cid:48) ; θ ˜ e ∼ U ( {− , } ) , for any edge ˜ e ∈ E d ( ξ (cid:48) ) \ E d ( ξ ); else θ e ← − θ e ; return ξ, (cid:126)θ (16) can be expected to be with high probability close to 1. Similarly, the first factormay only be significantly smaller than 1 if extending the district i into j is in averageenergetically unfavorable in comparison to extending the j th district in to the i th district,a property which may give rise to cyclic alignment of district-to-district velocities; seeSection 7.In contrast to that the Metropolis-Hastings ratio and thus the acceptance probability inthe case β = 0 does directly depend on the energy difference between the current state ξ and ξ (cid:48) . In the situation where the variation of score values of states in the neighborhood ofa given states is large this may result in drastically increased rejection rates in comparisonthe rejection rates in the tempered proposal.7. Numerical experiments
To test our ideas numerically, we consider an example first presented in Section 4 of [21]of a square lattice split into two districts. As noted in the previous work, this districtingproblem is equivalent to considering a loop-free random walk that partitions the region.We consider the score function J ( ξ ) = J pop ( ξ ) + J C ( ξ ) , (17)where the population score J pop ( ξ ) = (cid:40) , pop min ≤ | D ( ξ ) | ≤ pop max ∞ otherwise , is a hard constraint ensuring that the number of nodes in each district are between pop min and pop max . The compactness score J C ( ξ ) = | C ( ξ ) | corresponds to length of the boundary,i.e., the number of conflicted edges.We sample the space of districting plans that are simply connected on a 10 ×
10 squarelattice with parameter values pop min = 45 , pop max = 55 (i.e. we allow up to 10% deviationfrom half of the lattice points).We note that the score function J ( ξ ) is minimized when the district boundary lies eitherperfectly horizontally or vertically, and decreases as the district boundary length grows.Intuitively we may think of four meta-stable states, one with a given district in the north,south, east or west, with an energetic barrier between these meta-stable states. Indeed,as demonstrated in [21], the single node flip algorithm may mix extremely slowly. Using J C ( ξ ) = log(10) | cut( ξ ) | on a 40 ×
40 lattice, the authors found that the system did notchange meta-stable states even after nearly 3 billion steps when using the single node flipalgorithm without tempering.We evaluate the performance of the single node flip algorithm, non-reversible district-to-district flow algorithm, and non-reversible center of mass flow algorithm. For the singlenode flip algorithm we consider versions with tempering and without tempering. For thetempered algorithms we set β = 0 .
5. For the center of mass algorithm, we place a circularvortex field with an origin at the center of the redistricting graph; in Cartesian coordinates,the field is defined as (cos( α ) , − sin( α )), where α is the angle between the vector and thepositive horizontal axis.For each method we simulate 10 independent chains of 10 samples. All chains are ini-tialized in the same meta-stable state with one district in the north and south, respectively.We first evaluate the early mixing properties of the methods on basis of the first 25000steps of a single chain for each method. For each vertex of the precinct graph, we computethe fraction f of steps that vertex is assigned to district 1. By symmetry, the expected valueof f with respect to the target measure (cid:101) π is 1 / | f − / | ) sgn( f − / , (18)where f is the fraction of time a node spends in a chosen district and E [ f ] = 1 /
2. Wesummarize this field variable in the early part of the chains in Figure 8.In the single node flip algorithm, we find that the plan is predominantly stuck in thenorth/south orientation (see Figure 8). Tempering helps to alleviate this issue, however,this method appears to favor diagonal cuts (or more likely oscillates between a verticaland horizontal orientation). The district to district flow is biased toward a north/southdistricting plan, although apparently less so than the first two methods. The center of massflow is nearly unbiased in its orientation. The largest deviation of f from the expectedvalue of any vertex is found to be . . . . -0.4-0.3-0.2-0.100.10.20.30.4 Center of MassDistrict to districtSingle Node Flip(Tempered)Single Node Flip , s t ep s s t ep s T i m e i n d i s t r i c t Figure 8.
We examine the fraction of time each node in a 10 ×
10 square latticeremains in a given district for four the methods after 25000 steps (top row) and10 step (bottom row). The methods are single node flip (left most), single nodeflip with tempered proposals (second from the left), center of mass flow (secondfrom the right) and district to district flow (right most). We also examine the chains over all 10 million steps in the second row of Figure 8, andfind that for all methods and all vertices f does not deviate further than 0.02 from theexpected value.To further probe the mixing rates of the chains, we examine how frequently the chaintransitions between each meta-stable state. We define a meta stable state as a state wherethe square boundary between the two districts is within three nodes of a horizontal orvertical cut on an opposite side of the lattice. We begin by examining the frequency thateach chain spends in each of the meta-stable states after 10 steps and look at the varianceof these frequencies over the 10 chains (see Figure 9). For the single node flip algorithm therange of frequencies for the meta-stable states varies between 3.2% and 4.5%; when addingtempering, the range of varied frequencies across chains lies between 3.6% and 4.7%; fordistrict to district flow, the range lies between 2.8% and 2.8%. In contrast, the center ofmass flow frequencies range between 1.4% and 2.8%, meaning that the variance across runsis significantly lower than the other methods.We expect that the reason for the faster mixing in the center of mass flow is due to ahigher rate of transitions across meta-stable states. We examine the frequency of transitionsbetween meta-stable states in Figure 10 after 10 millions steps on a single chain. We findthat all of the methods are symmetric in terms of how often they transition between meta-stable states. We find that the center of mass flows have significantly more transitions thanall other methods – this method transitions roughly 7,500 times in 10 million proposals Single Node FlipSingle Flip (Temp)District to districtCenter of mass F r a c t i on o f s a m p l e s i n s t a t e Approximate meta-stable state
Figure 9.
We display the median frequency spent in the meta-stable states after10 proposals over 10 chains for each method. We also display the deviation of thesefrequencies across the chains with standard box plots. The half colored squarescorrespond to a given district being in the north, east, south and west (whereshaded). Dashed lines show the long time expected limit (i.e. equal probability ofbeing in any of the meta-stable states)) which is more than twice the number of transitions of the other methods. In contrast thesingle node flip tempered and district to district flow transition with a nearly identicalfrequency (roughly 3,600 times per 10 million proposals) and the single node flip methodtransitions even less (roughly 2,200 times per 10 million proposals).We conclude by examining how the districting plan decorrelates along the trajectoryof each sampling method. To examine this correlation, we examine precinct assignmentmatrix, φ : Ξ → R n × d such that φ ( ξ ) ij = 1 if precinct i (of n precincts) is assigned todistrict j (of d districts), 0 else. To compare two precinct assignments, we note that { φ ( ξ ) φ ( ξ (cid:48) ) T } ii = δ ξ ( i ) ξ (cid:48) ( i ) , (19)which is to say that the diagonal of the product is one when the precinct is assigned to thesame precinct and zero otherwise. This implies that1 n tr( φ ( ξ ) φ ( ξ ) T ) = 1 . (20)Furthermore, by symmetry, we have that E ( φ ( ξ )) ij = 1 /d. (21) Single Node FlipSing. Node Flip (Temp)District to DistrictCenter of Mass
Figure 10.
We display the number of transitions between meta-stable states inthe first 10 million proposals on a single chain for each of the four methods. Thehalf colored squares correspond to a given district being in the north, east, southand west. All sub-figure bar charts have the same scale in the vertical axis with arange of zero to 1,100 transitions.
We use the above facts to develop a measurement of precinct similarity given by G ( t ) = dn ( d − E (cid:34) tr (cid:16)(cid:0) φ ( ξ ) − E ( φ ) (cid:1)(cid:0) φ ( ξ t ) − E ( φ ) (cid:1) T (cid:17)(cid:35) , (22)where the outer expectation is taken with respect to the law of the process assuming ξ ∼ (cid:101) π .We remark that G (0) = 1 and that G ( t ) → t → ∞ since ξ t becomes independent of ξ in the limit t → ∞ . We also remark that this correlation is closely related to the evolutionof the Hamming distance between two plans. The Hamming distance, d ( ξ, ξ (cid:48) ), counts thenumber of precincts that are assigned to different districts across two plans. Thustr( φ ( ξ ) φ ( ξ (cid:48) ) T ) = n − d ( ξ, ξ (cid:48) ) . (23)We estimate G ( t ) by taking 100,000 boot-strapped samples from each chain of 10 millionsamples. From each sample, we gather statistics as a function of progressive steps from thesample and then average across all chains. We plot G ( t ) as a function of the steps from theinitial samples in Figure 11. We find that all four methods decorrelate within 50,000 steps.We find that the center of mass flow decorrelates significantly faster than the three othermethods. The district to district and single node flip tempered algorithms decorrelate atnearly the same rate, which makes sense due to the fact that a two district state does not Single Node FlipSingle Flip (Temp)District to districtCenter of mass P r e c i n c t S i m il a r i t y ( G ( t )) Figure 11.
We show the correlation of precinct assignments, G ( t ), as a functionof the number of steps taken for each of the four methods. The results are takenby bootstrap sampling over the 10 different runs on the first 10 − × stepsfor each method and then gathering statistics for G ( t ) starting with each of thesamples. admit cycles which can create longer time flows without rejection. The single node flipalgorithm decorrelates far slower than all other methods. Discussion
To the best of our knowledge this article is the first work to propose using non-reversibleMCMC schemes in the context of sampling of redistricting plans. We explore variousnatural choices of introducing non-reversibility in this application (i.e., via flows inducedby a vector field, and district-to-district flows). We provide the necessary (and novel)mathematical framework for implementing these approaches in the form of the MixedSkew Metropolis-Hastings algorithm, which relies on a generalization of the skew detailedbalance condition involving multiple momentum and a mixture of proposal kernels. Theresulting Markov chain, on a state space extended to include unit momentum, satisfiesmixed skew detailed balance. In this setting, we derive conditions which ensure ergodicityof the generated Markov chain with respect to the target measure.This framework can be used to derive many instances of non-reversible MCMC algo-rithms, both in general and for the sampling of redistricting plans. We focus in this articleon two specific simple implementations of the above mentioned approaches, namely thecenter-of-mass flow (Algorithm 3) and the pair-wise district-to-district flow (Algorithm 4).Even with this simple implementations, we find in numerical experiments that the pro-posed methods mix significantly better than comparable reversible methods. We expect that further improvement can be achieved by more principled implementations of the heresuggested approaches. This may open many interesting directions of research. This per-tains in particular the choice of the vector field v , the embedding φ and the orientationfunction. Similarly, we expect that a more refined assignment of cycles on the districtgraph as alluded to in Remark 6.2 may lead to significant improvement in performance inredistricting problems with a larger number of districts. Acknowledgement
Matthias Sachs acknowledges support from grant DMS-1638521 from SAMSI. JonathanMattingly acknowledges the partial support the NSF grant DMS-1613337. Jonathan Mat-tingly and Gregory Herschlag acknowledge partially support from the Duke MathematicsDepartment, the Rhodes Information initiative at Duke, and the Duke Dean’s and Provost’soffices. All authors are thankful for the support of SAMSI and the Duke NSF-TRIPOSgrant (NSF-CFF-1934964) which have supported activity around the mathematics of Re-districting during the time of this work which was very stimulating and supportive. Wealso acknowledge the PRUV and Data+ undergraduate research programs under whichour initial work in Quantifying Gerrymander was developed staring in 2013. JonathanMattingly is thankful for very educational and inspiring conversations with Manon Michelwhile at CIRM in Marseille in September 2018 which started him on this line of inquire.We would also like to thank Andrea Agazzi, Jianfeng Lu, and the Triangle QuantifyingGerrymandering Meet-up for useful and stimulating discussions.
References [1] E. Akhmatskaya, N. Bou-Rabee, and S. Reich. A comparison of generalized hybrid Monte Carlo meth-ods with and without momentum flip.
Journal of Computational Physics , 228(6):2256–2265, 2009.[2] S. Bangia, C. V. Graves, G. Herschlag, H. S. Kang, J. Luo, J. C. Mattingly, and R. Ravier. Redistricting:Drawing the line. arXiv preprint arXiv:1704.03360 , 2017.[3] J. Bierkens, P. Fearnhead, G. Roberts, et al. The zig-zag process and super-efficient sampling forBayesian analysis of big data.
The Annals of Statistics , 47(3):1288–1320, 2019.[4] A. Bouchard-Cˆot´e, S. J. Vollmer, and A. Doucet. The bouncy particle sampler: A nonreversiblerejection-free Markov chain Monte Carlo method.
Journal of the American Statistical Association ,113(522):855–867, 2018.[5] D. Carter, G. Herschlag, Z. Hunter, and J. Mattingly. A merge-split proposal for reversible MonteCarlo Markov chain sampling of redistricting plans. arXiv preprint arXiv:1911.01503 , 2019.[6] D. DeFord, M. Duchin, and J. Solomon. Recombination: A family of Markov chains for redistricting. arXiv preprint arXiv:1911.05725 , 2019.[7] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a nonreversible Markov chain sampler.
Annals ofApplied Probability , pages 726–752, 2000.[8] P. Dobson, I. Fursov, G. Lord, and M. Ottobre. Reversible and non-reversible Markov chain MonteCarlo algorithms for reservoir simulation problems.
Computational Geosciences , pages 1–13, 2020.[9] A. Duncan, G. Pavliotis, and K. Zygalakis. Nonreversible Langevin samplers: Splitting schemes, anal-ysis and implementation. arXiv preprint arXiv:1701.04247 , 2017.[10] A. B. Duncan, T. Lelievre, and G. Pavliotis. Variance reduction using nonreversible Langevin samplers.
Journal of statistical physics , 163(3):457–491, 2016.[11] B. Fifield, M. Higgins, K. Imai, and A. Tarr. A new automated redistricting simulator using Markovchain Monte Carlo.
Work. Pap., Princeton Univ., Princeton, NJ , 2015. [12] G. Herschlag, H. S. Kang, J. Luo, C. V. Graves, S. Bangia, R. Ravier, and J. C. Mattingly. QuantifyingGerrymandering in North Carolina. arXiv preprint arXiv:1801.03783 , 2018.[13] G. Herschlag, R. Ravier, and J. C. Mattingly. Evaluating partisan Gerrymandering in Wisconsin. arXivpreprint arXiv:1709.01596 , 2017.[14] K. Hukushima and Y. Sakai. An irreversible Markov-chain Monte Carlo method with skew detailed bal-ance conditions. In Journal of Physics: Conference Series , volume 473, page 012012. IOP Publishing,2013.[15] Y.-A. Ma, E. B. Fox, T. Chen, and L. Wu. Irreversible samplers from jump and continuous Markovprocesses.
Statistics and Computing , 29(1):177–202, 2019.[16] J. C. Mattingly. Expert report for Common Cause v. Lewis.
Common Cause v. Lewis , 2019.[17] J. C. Mattingly. Rebuttal of defendant’s expert reports for Common Cause v. Lewis.
Common Causev. Lewis , 2019.[18] J. C. Mattingly and C. Vaughn. Redistricting and the will of the people, 2014.[19] M. Michel.
Irreversible Markov chains by the factorized Metropolis filter: Algorithms and applicationsin particle systems and spin models . PhD thesis, ´Ecole Normale Supe´erieure, 2016.[20] L. Najt, D. DeFord, and J. Solomon. Complexity and geometry of sampling connected graph partitions,2019.[21] L. Najt, D. Deford, and J. Solomon. Complexity and geometry of sampling connected graph partitions.
Preprint; arxiv.org/pdf/1908.08881.pdf , 2019.[22] R. M. Neal. Improving asymptotic variance of MCMC estimators: Non-reversible chains are better. arXiv preprint math/0407281 , 2004.[23] M. Ottobre, N. S. Pillai, F. J. Pinski, A. M. Stuart, et al. A function space HMC algorithm with secondorder Langevin diffusion limit.
Bernoulli , 22(1):60–106, 2016.[24] G. Stoltz, M. Rousset, et al.
Free energy computations: A mathematical perspective . World Scientific,2010.[25] Y. Sun, J. Schmidhuber, and F. J. Gomez. Improving the asymptotic performance of Markov chainMonte-Carlo by inserting vortices. In
Advances in Neural Information Processing Systems , pages 2235–2243, 2010.[26] H. Suwa and S. Todo. General construction of irreversible kernel in Markov chain Monte Carlo. arXivpreprint arXiv:1207.0258 , 2012.[27] L. Tierney. A note on Metropolis-Hastings kernels for general state spaces.
The Annals of AppliedProbability , 8(1):19, Feb 1998.[28] K. S. Turitsyn, M. Chertkov, and M. Vucelja. Irreversible Monte Carlo algorithms for efficient sampling.
Physica D: Nonlinear Phenomena , 240(4-5):410–414, 2011.[29] M. Vucelja. Lifting—a nonreversible Markov chain Monte Carlo algorithm.
American Journal ofPhysics , 84(12):958–968, 2016.
Appendix A. Algorithms
A.1.
Tempered single node-flip algorithm.
Combining the tempered single node-flipproposal of Section 5.3 with a Metropolis-Hastings accept-reject step results in the followingalgorithm.Note that for • β = 1, then the proposal distribution (cid:101) Q ( ξ, · ) is the target measure constrained to N ( ξ ), and the Metropolis ratio simplifies to r ( ξ, ξ (cid:48) ) = Z ( ξ ) Z ( ξ (cid:48) ) . Algorithm 5:
Tempered single node-flip algorithm. input : ξ generate proposal ξ (cid:48) ∼ N ( ξ ) ( · )e − βJ ( · ) ; r ( ξ, ξ (cid:48) ) = e − J ( ξ (cid:48) )+ βJ ( ξ (cid:48) ) Z ( ξ ) e − J ( ξ )+ βJ ( ξ ) Z ( ξ (cid:48) ) ; sample u ∼ U ([0 , if u < r ( ξ, ξ (cid:48) ) then ξ ← ξ (cid:48) return ξ In this case the acceptance probability of the proposal is not explicitly dependenton the energy difference ∆ J = J ( ξ (cid:48) ) − J ( ξ ), but is merely a function of the partitionfunctions of the current state and the proposed state. • β = 0 a version of the single node flip algorithm where the proposal is sampleduniformly from the set N ( ξ ) is recovered and the Metropolis-Hastings ratio becomes r ( ξ, ξ (cid:48) ) = e − J ( ξ (cid:48) ) |N ( ξ (cid:48) ) | − e − J ( ξ ) |N ( ξ ) | − , tends to be lower in comparison to the tempered proposalwith β = 1.Furthermore, if the underlying tempered proposal generates an irreducible Markov chain,then the tempered MCMC algorithm is also irreducible and hence has the desired targetmeasure as its unique invariant measure. This is summarized in the following proposition. Proposition A.1.
If the topology of the graph G p = ( V p , E p ) is such that the Markov chaingenerated by the single-node-flip proposal as defined in (12) is irreducible, then the proba-bility measure (cid:101) π is the unique invariant measure of the Markov chain ( ξ n ) n ∈ N generated bythe MCMC algorithm given in Algorithm 5. Appendix B. Proofs
B.1.
Proof of Theorem 4.1.
We now return to the proof of Theorem 4.1 which guaran-teed that the mixed skewed balance condition ensures that the desired target measure π is an invariant measure for the Markov chain P . The proof has the same structure as theanalogous proof for the original skewed balance condition [7]. Proof of Theorem 4.1. ( π P )( x ) = (cid:88) x (cid:48) ∈X π ( x (cid:48) ) P ( x (cid:48) , x ) = (cid:88) x (cid:48) ∈X n (cid:88) i =1 π ( x (cid:48) ) ω i ( x (cid:48) ) P i ( x (cid:48) , x )= (cid:88) x (cid:48) ∈X n (cid:88) i =1 ω i ( x ) π ( x ) P i ( S i ( x ) , S i ( x (cid:48) ))= π ( x ) n (cid:88) i =1 ω i ( x ) (cid:16) (cid:88) x (cid:48) ∈X P i ( S i ( x ) , S i ( x (cid:48) )) (cid:17) = π ( x ) (cid:16) n (cid:88) i =1 ω i ( x ) (cid:17) = π ( x ) , where the third equality follows by the mixed skew detailed balance condition (4), the fifthand sixth equality follow due to the fact that the sum within each pair of parentheses sumsto one. In the first case because P i is a Markov transition kernel and S i ( X ) = X and inthe second case because because weights ω i ( x ) sum up to one. (cid:3) B.2.
Proof of Theorem 4.2.
We now give the proof that the Mixed Skew Metropolis-Hastings (MSMH) algorithm given in Algorithm 1 satisfies the mixed skew detailed balancecondition, given in (4); and hence, has the desired target measure π as an invariant measure. Proof of Theorem 4.2.
The form of the algorithm directly implies that the transition ma-trices/kernels P i , i = 1 , . . . , n of the generated Markov chain are of the form P i ( x, x (cid:48) ) = g i, ( x, x (cid:48) ) + g i, ( x, x (cid:48) ) , where(24) g i, ( x, x (cid:48) ) = Q i ( x, x (cid:48) ) min(1 , r i ( x, x (cid:48) )) ,g i, ( x, x (cid:48) ) = { S i ( x ) } ( x (cid:48) ) (cid:104) − (cid:88) ˜ x ∈X Q i ( x, ˜ x ) min(1 , r i ( x, ˜ x )) (cid:105) , thus a generalized skew detailed balance condition is satisfied for P = ω · (cid:101) P , if(25) g i,j ( x, x (cid:48) ) π ( x ) (cid:101) ω i ( x ) = g i,j (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) π ( x (cid:48) ) (cid:101) ω i ( x (cid:48) )for j = 1 , x, x (cid:48) ∈ X and all i = 1 , . . . , n . We will now verify this condition foreach j separately.Case j = 1 : A simple computation shows that r i ( x, x (cid:48) ) = 1 r i ( S i ( x (cid:48) ) , S i ( x )) . Thus, r i ( x, x (cid:48) ) ∈ (0 , ⇐⇒ r i ( S i ( x (cid:48) ) , S i ( x )) >
1, and r i ( x, x (cid:48) ) = r i ( S i ( x (cid:48) ) , S i ( x )) ⇐⇒ r i ( x, x (cid:48) ) = 1. Therefore, if r i ( x, x (cid:48) ) ≤
1, we find g i, ( x, x (cid:48) ) π ( x ) = min(1 , r i ( x, x (cid:48) )) (cid:101) Q ( x, x (cid:48) ) π ( x )= (cid:101) Q ( S i ( x (cid:48) ) , S i ( x )) π ( x (cid:48) ) (cid:101) Q ( x, x (cid:48) ) π ( x ) (cid:101) Q ( x, x (cid:48) ) π ( x )= g i, (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) π ( x (cid:48) ) , and, similarly, r i ( x, x (cid:48) ) > ⇐⇒ r i ( S i ( x (cid:48) ) , S i ( x )) ∈ (0 , g i, ( x, x (cid:48) ) π ( x ) = (cid:101) Q ( x, x (cid:48) ) min(1 , r i ( x, x (cid:48) )) π ( x )= (cid:101) Q ( x, x (cid:48) ) π ( x )= (cid:101) Q ( x, x (cid:48) ) π ( x ) (cid:101) Q ( S i ( x (cid:48) ) , S i ( x )) π ( x (cid:48) ) (cid:101) Q (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) π ( x (cid:48) )= min (cid:0) , r i ( S i ( x (cid:48) ) , S i ( x )) (cid:1) (cid:101) Q (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) π ( x (cid:48) )= g i, (cid:0) S i ( x (cid:48) ) , S i ( x ) (cid:1) π ( x (cid:48) ) . Case j = 2 : For S i ( x ) (cid:54) = x (cid:48) , it follows from the definition of g i, ( x, x (cid:48) ) in (24) that g i, ( x, x (cid:48) ) = g i, ( S i ( x (cid:48) ) , S i ( x )) = 0 and (25) is thus trivially satisfied. If S i ( x ) = x (cid:48) , itfollows that g i, ( x, x (cid:48) ) = g i, ( S i ( x ) , S i ( x (cid:48) )) = g i, ( S i ( x (cid:48) ) , S i ( x )) , which by virtue of the fact that π is invariant under S i also implies (25). (cid:3) B.3.
Proof of Theorem 4.6.
We now turn to the proof of the unique ergodicity andconvergence of averages result given in Theorem 4.6.Since π is invariant under P , that is π P = π , and π ( ξ ) > ξ ∈ Ξ , it is sufficientto show that the Markov chain generated by P is irreducible, meaning that we can reachany extended state ( ξ (cid:48) , (cid:126)θ (cid:48) ) from any other extended state ( ξ, (cid:126)θ ) within a finite number ofsteps, i.e.,(26) ∀ ( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) ) ∈ Ξ × {− , } n , ∃ m ∈ N , s.t. P m (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ (cid:48) )) > . We will prove this statement by combining the following two intermediate facts.(1) Show that we can find a path with non-zero probability between any state ( ξ, (cid:126)θ ) ∈ Ξ × {− , } n in the set ξ (cid:48) × {− , } n sitting above an arbitrary point ξ (cid:48) ∈ Ξ.(Lemma B.1 below). Such a path will be used to move about the statespace.(2) Show that for any ( ξ, (cid:126)θ ) ∈ Ξ × {− , } n , if i ∈ A ( ξ ), we can find a path with non-zero probability beginning at ( ξ, (cid:126)θ ) and ending at ( ξ, R i ( (cid:126)θ )). These paths will beused to flip the i th momentum as needed, without changing the underlining state ξ . (Lemma B.2 below). The basic idea of the proof of Theorem 4.6 is that the first of the above fact allows oneto move around the state space from ξ to ξ (cid:48) but without control of what happens to themomentum variable (cid:126)θ . The second fact above then allows one to modify the resulting (cid:126)θ (cid:48)(cid:48) tothe desired (cid:126)θ (cid:48) . We will see that at its core, the first fact will follow from the irreducibilityof the proposal kernel (cid:101) Q which was an input to our algorithm. The second fact will followagain from this base irreducibility along with Assumption 2 (or in light of Theorem 4.7,Assumption 3) to ensure there is a place along the path to flip the signs in the momentum (cid:126)θ (cid:48)(cid:48) until it agrees with (cid:126)θ (cid:48) . This is the basic arc of the proof of Theorem 4.6, though each ofthe above statements will need a little refinement and a few additional technical elementswill need to be added to complete the argument. A graphic representation of the sketch ofthe proof is given in Fig. 12.We begin by stating the two lemma which correspond to the two above statements. Theproofs of these lemma are postponed to the end of the section. Lemma B.1.
For all extended states ( ξ, (cid:126)θ ) ∈ Ξ × {− , } n and any state ξ (cid:48) ∈ Ξ thereexists m ∈ N so that P m (( ξ, (cid:126)θ ) , { ξ (cid:48) } × {− , } n ) > . Additionally, m can be chosen so that there is a path (( ξ , (cid:126)θ ) , · · · , ( ξ m , (cid:126)θ m )) , of positiveprobability starting at ( ξ, (cid:126)θ ) and ending in the set { ξ (cid:48) } × {− , } n , satisfying the followingproperty: for every i ∈ { , . . . , n } there exist a k ∈ { , . . . , m } (depending on i ) with i ∈ A ( ξ k ) . Lemma B.2.
For any extended state ( ξ, (cid:126)θ ) ∈ Ξ × {− , } n and i ∈ A ( ξ ) , there is m ∈ N so that P m (cid:16)(cid:0) ξ, (cid:126)θ (cid:1) , (cid:0) ξ, R i ( (cid:126)θ ) (cid:1)(cid:17) > . By the construction of the MSMH chain P from the proposal chain (cid:101) Q in Section 4.3, inparticular the structure N + i and N − i , imply that(27) (cid:91) i ∈A ( ξ ) (cid:104) supp P i (cid:0) ( ξ, (cid:126)θ ) , ( · , (cid:126)θ ) (cid:1) ∪ supp P i (cid:0) ( ξ, R i ( (cid:126)θ )) , ( · , R i ( (cid:126)θ )) (cid:1)(cid:105) = supp (cid:101) Q ( ξ, · ) . The one central obstacle in deducing Lemma B.1 from this fact is that the particularcoordinate momentum (cid:126)θ might not be aligned in the right direction to allow us to proposeand then follow a given transition which is possible under (cid:101) Q . Lemma B.2 allows us to flipparticular coordinate momentum θ i provided i is active. One remaining concern is that wemight be forced to constantly correct the momentum which are changed as a side-effect ofprevious moves. The following lemma is central to ruling out this scenario. Lemma B.3.
Let Assumption 1 hold and consider the transition kernel P (( ξ, (cid:126)θ ) , · ) = (cid:80) ni =1 (cid:101) ω i ( ξ ) P i (( ξ, (cid:126)θ ) , · ) of the Markov chain generated by Algorithm 2 with generic weightsas specified in (7) . For any two states ξ, ξ (cid:48) ∈ Ξ and an i ∈ { , . . . , n } , such that i ∈ A ( ξ ) and i (cid:54)∈ A ( ξ (cid:48) ) , the following conclusion holds: ∀ (cid:126)θ ∈ {− , } n , P i (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ )) = 0 . x = ( ξ , ⃗ θ ) x . . . x τ i = ( ξ τ i , ⃗ θ τ i ) Last chance to flip { i ∉ 𝒜( ξ ′ ) i ∈ 𝒜( ξ τ i )𝒫 𝒫 P i 𝒫 x ⋆2 = ( ξ ⋆2 , ⃗ θ τ i ) ... x ( m i ) m i /2 = ( ξ ⋆ m i /2 , ⃗ θ τ i ) x ⋆ m i /2+1 = ( ξ ⋆ m i /2+1 , R i ( ⃗ θ τ i )) P i P i P i ... x ⋆( m i −1) = ( ξ ⋆ m i −1 , R i ( ⃗ θ τ i )) x ⋆ τ i = ( ξ τ i , R i ( ⃗ θ τ i )) P i P i P i x τ i +1 . . .𝒫 𝒫 𝒫 x m ′ ′ = ( ξ ′ , ⃗ θ ′ ′ ) Lemma B.2 x m ′ ′ ′ = ( ξ ′ , R k ( ⃗ θ ′ ′ )) with Lemma B.2Flip { ⃗ θ ′ k ≠ ⃗ θ ′ ′ k k ∈ 𝒜( ξ ′ ) ... P k . . . P k x m = ( ξ ′ , R k ∘ R k ∘ . . . ∘ R k ℓ ( ⃗ θ ′ ′ )) = ( ξ ′ , ⃗ θ ′ ) Start FinishLemma B.1 Lemma B.1
Figure 12.
A sketch of the proof of theorem 4.5, showing that there is a pathwith non-zero probability between any two points in the extended state space ( ξ, θ )and ( ξ (cid:48) , θ (cid:48) ). Equipped now with Lemma B.1, Lemma B.2 and Lemma B.3 as well (27), we are able ofgive the proof of Theorem 4.6. As already mentioned, the basic outline is given in Fig. 12.The proofs of these lemmas are postponed to the end of this section.
Proof of Theorem 4.6.
As already mentioned, Lemma B.1 ensures that for any extendedstate ( ξ, (cid:126)θ ) and state ξ (cid:48) ∈ Ξ, there is a sequence of states ( ξ , (cid:126)θ ) , · · · , ( ξ m , (cid:126)θ m (cid:48)(cid:48) ) with( ξ , (cid:126)θ ) = ( ξ, (cid:126)θ ) and ξ m (cid:48)(cid:48) = ξ (cid:48) and P (( ξ k , (cid:126)θ k ) , ( ξ k +1 , (cid:126)θ k +1 )) > k = 0 , . . . , m (cid:48)(cid:48) − ξ to ξ m (cid:48)(cid:48) which runs horizontal across the center of the diagram (orange arrows).In order to show (26), we need to show that this sequence can be modified/extended toa sequence where the final velocity coincides with the velocity vector (cid:126)θ (cid:48) specified in (26).First notice that if (cid:126)θ (cid:48)(cid:48) and (cid:126)θ (cid:48) only differ in components which are active in ξ (cid:48) , then theexistence of such a modified sequence directly follows from Lemma B.2. One simply addsloops (green arrows in Fig. 12) starting from ξ (cid:48) and the current (cid:126)θ and returning with aparticular component of the momentum’s sign flipped to agree with (cid:126)θ (cid:48) . In Fig. 12, theseexcursions correspond extensions to the initial path, which ended at ξ m (cid:48)(cid:48) , which snakedownward and then back to the left from ( ξ (cid:48) , (cid:126)θ (cid:48)(cid:48) ).If (cid:126)θ (cid:48)(cid:48) and (cid:126)θ (cid:48) differ in a component i which is activated along the path ( ξ , . . . , ξ m (cid:48)(cid:48) ), then,we can again invoke Lemma B.2 to insert a loop in the middle of the original sequenceto flip the i th component of the momentum (blue arrows in Fig. 12). We consider thestate ξ τ i in the sequence ( ξ , . . . , ξ m (cid:48)(cid:48) ) at which the component i is active for the lasttime, i.e., τ i := max { j ∈ { , . . . , m (cid:48)(cid:48) } | i ∈ A ( ξ j ) } . After this state we insert a loop ( ξ (cid:63) , (cid:126)θ (cid:63) ) , . . . , ( ξ (cid:63)m i , (cid:126)θ (cid:63)m i ), into the sequence which according to Lemma B.2 flips the sign ofthe i th velocity component, i.e., ξ (cid:63)m i = ξ τ i , (cid:126)θ (cid:63)m i = R i ( (cid:126)θ τ i ). This produces a modifiedsequence of states( ξ , (cid:126)θ ) , . . . , ( ξ τ i , (cid:126)θ τ i ) , ( ξ (cid:63) , (cid:126)θ (cid:63) ) , . . . , ( ξ (cid:63)m i , (cid:126)θ (cid:63)m i ) , ( ξ τ i +1 , R i ( (cid:126)θ τ i +1 )) , . . . , ( ξ m (cid:48)(cid:48) , R i ( (cid:126)θ m (cid:48)(cid:48) )) , which is attained with positive probability by the Markov chain (see Fig. 12). Observe that,the transition probabilities between the remaining states ( ξ τ i +1 , R i ( (cid:126)θ τ i +1 )) , . . . , ( ξ m , R i ( (cid:126)θ m ))is not affected by the sign change in the i th velocity component, since the i th component isby the choice of τ i inactive for all this states. All that remains is to show that the transitionfrom the extended state ( ξ (cid:63)m i , (cid:126)θ (cid:63)m i ) = ( ξ τ i , R i ( (cid:126)θ τ i )) to ( ξ τ i +1 , R i ( (cid:126)θ τ i +1 )) does not use kernel P i ; and hence, can not be effected by the sign change in the i th velocity component from ξ (cid:63)m i = ξ τ i (at the start of the loop) to (cid:126)θ (cid:63)m i = R i ( (cid:126)θ τ i ) (at the end of the loop). Lemma B.3ensures that P i (( ξ τ i , R i ( (cid:126)θ τ i ) , ( ξ τ i +1 , R i ( (cid:126)θ τ i +1 )) = 0 because we know that i (cid:54)∈ A ( ξ τ i +1 ) fromthe definition of τ i .By repeating the described procedure for all remaining mismatched components whichare activated somewhere along the original sequence of states ( ξ , . . . , ξ m (cid:48)(cid:48) ), we obtain asequence for which the final velocity coincides in all these components with (cid:126)θ (cid:48) . SinceLemma B.1 allows to choose the sequence ( ξ , . . . , ξ m (cid:48)(cid:48) ) which connects the state ( ξ, (cid:126)θ )with ( ξ (cid:48) , (cid:126)θ (cid:48) ), such that every component P i , i ∈ { , . . . , n } is at least activated once on theway, this concludes the proof. (cid:3) We now return to the proofs of Lemma B.1, Lemma B.2 and Lemma B.3 which we willgive in reverse order.
Proof of Lemma B.3.
Since mixed skew detailed balance holds for P we have in particular(28) P i (( ξ, (cid:126)θ ) , ( ξ, (cid:126)θ )) (cid:101) π ( ξ ) (cid:101) ω i ( ξ ) = P i (cid:16) ( ξ (cid:48) , R i ( (cid:126)θ )) , ( ξ, R i ( (cid:126)θ )) (cid:17) (cid:101) π ( ξ (cid:48) ) (cid:101) ω i ( ξ (cid:48) ) . By the hypotheses of the Lemma we have (cid:101) ω i ( ξ (cid:48) ) = 0 since i (cid:54)∈ A ( ξ (cid:48) ). Thus, the right handside of (28) certainly is 0. Similarly, we have (cid:101) ω i ( ξ ) >
0, and (cid:101) π ( ξ ) > P i (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ )) = 0 in order for the equality given in (28) tohold. (cid:3) Proof of Lemma B.2.
First notice that since each kernel P i satisfies a modified skew de-tailed balance condition as detailed in (4.1), we have that P i (( ξ, (cid:126)θ ) , ( ξ (cid:48) , (cid:126)θ )) > ⇐⇒ P i (( ξ (cid:48) , R i ( (cid:126)θ )) , ( ξ, R i ( (cid:126)θ ))) > . Thus it follows inductively that any sequence of states ( ξ , (cid:126)θ ) , ( ξ , (cid:126)θ ) , . . . , ( ξ m , (cid:126)θ m ) which canbe observed with positive probability when evolving according to P i , can also with positiveprobability be “walked back” in reverse order as ( ξ m , R i ( (cid:126)θ )) , . . . , ( ξ , R i ( (cid:126)θ )) , ( ξ , R i ( (cid:126)θ )) afterflipping the sign of the i th momentum. Since i ∈ A ( ξ ), Assumption 2 guarantees that thereis some path with positive P i probability from ( ξ, (cid:126)θ ) to a point at which (cid:126)θ can be flippedto R i ( (cid:126)θ ). Without loss of generality, we can assume that (cid:126)θ does not change until this laststep. By “walked back” along this path we arrive at ( ξ, R i ( (cid:126)θ )) as desired. (cid:3) Proof of Lemma B.1.
From Assumption 1, which states that (cid:101) Q is irreducible on Ξ, we knowthat there exists an integer m and a path ξ = ξ , . . . , ξ m = ξ (cid:48) with positive (cid:101) Q probability.By the construction of the P chain from (cid:101) Q , we know that for any pair ξ i and ξ i +1 alongthis path there exists some kernel P k and some choice of the k th momentum (cid:15) ∈ {− , } sothat P k (( ξ i , (cid:126)θ ∗ ) , ( ξ i +1 , (cid:126)θ ∗ )) > θ ∗ k = (cid:15) . If by chance we arrive at ξ i with the wrong signin the k th momentum we can insert one of the loops constructed in Lemma B.2 to changethe sign of the k th momentum. By following this procedure inductively for each step inthe original path, we can obtain the path needed to prove the first part of Lemma B.1.Finally, we need to show that we can construct the path connecting ( ξ, (cid:126)θ ) to ( ξ (cid:48) , (cid:126)θ (cid:48) ) suchfor each index i ∈ { , . . . , n } there is that along one state ( ξ k , (cid:126)θ k ) such that i ∈ A ( ξ k ). Firstnotice that Remark 4.4 guarantees the existence of states ξ ∗ i ∈ Ξ with i ∈ A ( ξ ∗ i ) for each i ∈ { , . . . , n } . The last result is obtained by prepending to the path constructed above apath which visits each of the ξ ∗ i , i ∈ { , . . . , n } , before heading onto ( ξ (cid:48) , (cid:126)θ (cid:48) ). (cid:3) B.4.
Proof of Theorem 4.7.
We closeout this last appendix of the paper by provingTheorem 4.7 which showed that, under Assumption 1, Assumption 3 and Assumption 2are equivalent.
Proof of Theorem 4.7.
Consider the Markov chain obtained by starting in ( ξ, (cid:126)θ ) and evolv-ing according to the memory kernel P i . Moreover, assume without loss of generality θ i = +1, and denote by V + i ( ξ ) = (cid:91) m ∈ N supp P mi (( ξ, (cid:126)θ ) , ( · , (cid:126)θ )) , the set of states which can be reached from ξ in a finite number of steps when evolving ac-cording to the kernel P i without changing the sign of the velocity θ i = +1. This set of reach-able vertices V + i ( ξ ) induces a subgraph of G + i which we denote by G + i ( ξ ) = ( V + i ( ξ ) , E + i ( ξ ))where E + i ( ξ ) := E + i ∩ (cid:0) V + i ( ξ ) × V + i ( ξ ) (cid:1) .Let Assumption 3 be violated. That is for a certain i ∈ { , . . . , n } the correspondinggraph G + i contains a non-escapable circuit . If ξ is contained in this circuit , then G + i ( ξ )coincides with this closed circuit and in particular P i (( ξ (cid:48) , (cid:126)θ ) , ( ξ (cid:48) , R i ( (cid:126)θ ))) = 0 for every ξ (cid:48) ∈ G + i ( ξ ). Thus, it follows immediately that also Assumption 2 is violated. This showsthat Assumption 2 implies Assumption 3.In order to show that Assumption 3 together with Assumption 1 implies Assumption 2 itsuffices to show that if Assumption 3 and Assumption 1 hold, then there is always at leastone state/node, say ˜ ξ , in the set of reachable vertices V + i ( ξ ) for which there is a positiveprobability of flipping the velocity component θ i (in the sense that P i (( ˜ ξ, (cid:126)θ ) , ( ˜ ξ, R i ( (cid:126)θ ))) > G + i ( ξ )either does or does not contain a non-escapable circuit separately.If the graph G + i ( ξ ) contains a non-escapable circuit, then, the existence of such a state˜ ξ is guaranteed by Assumption 3.If the graph G + i ( ξ ) does not contain a non-escapable circuit , the existence of such a state˜ ξ can be easily shown using the fact that the probability measure π i ( ξ (cid:48) , (cid:126)θ (cid:48) ) ∝ (cid:101) ω i ( ξ (cid:48) ) (cid:101) π ( ξ (cid:48) ) is invariant under P i (see Remark 4.3): there is at least one state ξ (cid:63) ∈ E + i ( ξ ) which is not partof a circuit (otherwise G + i ( ξ ) would be a non-escapable circuit). If P i (( ξ (cid:48) , (cid:126)θ ) , ( ξ (cid:48) , R i ( (cid:126)θ ))) = 0for all ξ (cid:48) ∈ V + i ( ξ ) ⊃ V + i ( ξ (cid:63) ), then this state is transient, which is in direct contradictionto π i ( ξ (cid:63) , (cid:126)θ ) >
0. Consequently, there must be at least one state in V + i ( ξ ) for which theprobability of flipping the i th velocity component is positive.th velocity component is positive.