[PDF] Automating Involutive MCMC using Probabilistic and Differentiable Programming

Abstract

Involutive MCMC is a unifying mathematical construction for MCMC kernels that generalizes many classic and state-of-the-art MCMC algorithms, from reversible jump MCMC to kernels based on deep neural networks. But as with MCMC samplers more generally, implementing involutive MCMC kernels is often tedious and error-prone, especially when sampling on complex state spaces. This paper describes a technique for automating the implementation of involutive MCMC kernels given (i) a pair of probabilistic programs defining the target distribution and an auxiliary distribution respectively and (ii) a differentiable program that transforms the execution traces of these probabilistic programs. The technique, which is implemented as part of the Gen probabilistic programming system, also automatically detects user errors in the specification of involutive MCMC kernels and exploits sparsity in the kernels for improved efficiency. The paper shows example Gen code for a split-merge reversible jump move in an infinite Gaussian mixture model and a state-dependent mixture of proposals on a combinatorial space of covariance functions for a Gaussian process.

Full PDF

AAutomating Involutive MCMC using Probabilistic andDiﬀerentiable Programming

Marco Cusumano-Towner Alexander K. Lew Vikash K. Mansinghka

Massachusetts Institute of Technology

Abstract

Involutive MCMC is a unifying mathematicalconstruction for MCMC kernels that general-izes many classic and state-of-the-art MCMCalgorithms, from reversible jump MCMC tokernels based on deep neural networks. Butas with MCMC samplers more generally, im-plementing involutive MCMC kernels is of-ten tedious and error-prone, especially whensampling on complex state spaces. This pa-per describes a technique for automating theimplementation of involutive MCMC kernelsgiven (i) a pair of probabilistic programsdeﬁning the target distribution and an aux-iliary distribution respectively and (ii) a dif-ferentiable program that transforms the exe-cution traces of these probabilistic programs.The technique, which is implemented as partof the Gen probabilistic programming sys-tem, also automatically detects user errors inthe speciﬁcation of involutive MCMC kernelsand exploits sparsity in the kernels for im-proved eﬃciency. The paper shows exampleGen code for a split-merge reversible jumpmove in an inﬁnite Gaussian mixture modeland a state-dependent mixture of proposalson a combinatorial space of covariance func-tions for a Gaussian process.

Markov chain Monte Carlo (MCMC) algorithms arepowerful tools for approximate sampling from proba-bility distributions and are central to modern Bayesianstatistics, probabilistic machine learning, statisticalphysics, and numerous application areas where prob-abilistic modeling and inference are used. But design-ing and deriving eﬃcient MCMC algorithms is math-ematically involved, and implementing MCMC ker- nels is tedious and notoriously error-prone. Thesechallenges are especially pronounced when samplingfrom probability distributions on complex state spacesthat combine symbolic, numeric, and structural un-certainty, such as those arising in computational bi-ology (Huelsenbeck et al., 2004), robotics and sceneunderstanding (Geiger et al., 2011), and models of hu-man cognition (Tenenbaum et al., 2011).Involutive MCMC is a mathematical construction forMCMC kernels that gives a simplifying and unifyingperspective on a number of previously disparate classesof kernels, including reversible jump MCMC (Green,1995), which is the dominant mathematical frame-work for MCMC on complex state spaces. InvolutiveMCMC constructs an MCMC kernel from three com-ponents: (i) the unnormalized target density, (ii) asampler and density for an auxiliary probability distri-bution, and (iii) an involution on an extended statespace. While this construction is mathematically clar-ifying, correctly implementing involutive MCMC ker-nels on complex state spaces remains challenging dueto tedious density and Jacobian computations and theneed for careful reasoning about the state space.This paper formulates involutive MCMC on generalstate spaces and shows how to automate the imple-mentation of an involutive MCMC kernel from threedeclarative programs that deﬁne the target probabil-ity distribution, auxiliary probability distribution, andthe involution, respectively. The probability distribu-tions are deﬁned as probabilistic programs, and theinvolution is deﬁned as a diﬀerentiable program thattransforms the execution traces of the probabilisticprograms. We use probabilistic programming tech-niques and automatic diﬀerentiation to automaticallycompute the acceptance probability. We also showhow to automatically detect mathematical errors inthe speciﬁcation of an involutive MCMC kernel, andhow to improve the eﬃciency by automatically exploit-ing the sparsity structure in the involution. We imple-mented the approach within the Gen probabilistic pro- An involution is a bijection that is its own inverse (thatis, f where f ( f ( z )) = z ). a r X i v : . [ s t a t . C O ] J u l utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming gramming system (Cusumano-Towner et al., 2019).The paper shows examples of involutive MCMC ker-nels implemented in Gen for (i) a split-merge reversiblejump move in an inﬁnite mixture model, and (ii) astate-dependent mixture of Metropolis-Hastings pro-posals on an inﬁnite combinatorial space of covariancefunctions for a Gaussian process. We also provide alightweight PyTorch implementation of the basic ap-proach at https://github.com/probcomp/autoimcmc .The contributions of this paper include:1. A measure-theoretic formulation of involutiveMCMC on general state spaces.2. A mathematical formulation of state spaces con-sisting of arbitrary key-value stores (i.e. dic-tionaries) that formalizes the space of executiontraces of Gen probabilistic programs, and a for-mulation of involutive MCMC on these spaces.3. A diﬀerentiable programming language for deﬁn-ing transformations between spaces of traces ofprobabilistic programs.4. An algorithm that automates the implementationof an involutive MCMC kernel given two proba-bilistic programs and a diﬀerentiable program en-coding the involution, using automatic diﬀerenti-ation and tracing of probabilistic programs.5. An extension to the algorithm that exploits spar-sity in the involution to reduce the number ofoperations needed to apply an involutive MCMCkernel from O ( N ) to O (1) in some cases, where N is the dimensionality of the latent space.6. An algorithm that dynamically detects errors inthe speciﬁcation of an involutive MCMC kernel. The involutive MCMC construction was previously im-plemented (Cusumano-Towner, 2018) by the authorsas part of the inference library of the Gen probabilisticprogramming system (Cusumano-Towner et al., 2019).The construction was motivated in part by a desirefor a simple interface that automated the implemen-tation of reversible jump MCMC samplers (Green,1995), state-dependent mixtures of proposals on com-plex state spaces, and data-driven neural proposals.Gen’s involutive MCMC construction has since beenused by a number of researchers to design and imple-ment MCMC algorithms in diverse domains, includingcomputational biology (Merrell and Gitter, 2020) andartiﬁcial intelligence (Zhi-Xuan et al., 2020). Neklyudov et al. (2020) independently identiﬁed theinvolutive MCMC construction as a unifying frame-work for MCMC algorithms, and showed how morethan a dozen classic and recent MCMC algorithmscan be cast within this framework. Neklyudov et al.(2020) also identiﬁed design principles for developingnew MCMC algorithms using the involutive MCMCconstruction, and showed that the framework aids inthe derivation of novel eﬃcient MCMC algorithms.The involutive MCMC construction encompassesmany existing classes of MCMC kernels, some of whichexplicitly make use of bijective or involutive determin-istic maps. In particular, the reversible jump frame-work (Green, 1995; Hastie and Green, 2012) employs afamily of continuously diﬀerentiable bijections betweenthe parameter spaces of diﬀerent models. Tierney(1998) described a family of deterministic proposalsbased on a deterministic involution that is equivalentto involutive MCMC but without the auxiliary prob-ability distribution. More recently, Spanbauer et al.(2020) deﬁned a class of deep generative models basedon diﬀerentiable involutions and trained these modelsto serve as eﬃcient proposal distributions on continu-ous state spaces; the resulting algorithm is an instanceof the construction presented in this paper.In recent decades, a number of probabilistic program-ming systems have automated the implementations ofprobabilistic inference algorithms (Gilks et al., 1994;Milch et al., 2005; Pfeﬀer, 2007; Goodman et al., 2008;Gehr et al., 2016; Carpenter et al., 2017). Most ofthese systems support generic built-in inference al-gorithms, with user-customization limited to tweak-ing algorithm parameters. Some systems allow foruser-deﬁned variational families or proposal distribu-tions (Ritchie et al., 2016; Bingham et al., 2019). Pro-grammable inference (Mansinghka et al., 2018) pro-poses that inference algorithms be programmed byusers using new high-level inference abstractions. TheGen probabilistic programming system (Cusumano-Towner et al., 2019) exposes an API that supportshigh-level user implementations of an open-ended setof inference algorithms, and abstracts away low-levelimplementation details of inference algorithms.One other probabilistic programming system be-sides Gen supports custom reversible jump samplers:Roberts et al. (2019) present a system embeddedin Haskell that automatically generates the imple-mentation of some reversible jump MCMC kernelsfrom a high-level speciﬁcation. Narayanan and Shan(2020) give a technique that automatically computesMetropolis-Hastings acceptance probabilities in somesettings; however, these approaches do not handlemany kernels that can be handled by our technique,including the example in Figure 2. arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka @gen function p (n::Int) k ~ poisson_plus_one(1) means = [({(:mu, j)} ~ normal(0, 10)) for j in 1:k] vars = [({(:var, j)} ~ inv_gamma(1, 10)) for j in 1:k] weights ~ dirichlet([2.0 for j in 1:k]) for i in 1:n {(:x, i)} ~ mixture_of_normals(weights, means, vars) endend @copy_proposal_to_proposal(:path_from_root, :path_from_root) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_model_to_proposal(model_subtree_address, :subtree)@copy_proposal_to_proposal(:path_from_root, :path_from_root) (b) An inﬁnite Gaussian mixture model p expressed in a Gen probabilistic modeling language @gen function q (trace) k = trace[:k] (c) Auxiliary proposal distribution q for the split-merge reversible jump MCMC move, expressed in a Gen probabilistic modeling language k=3k=2 split=falsesplit=true cluster_to_merge=2cluster_to_split=2(mu, 1)(var, 1) = model traces (traces of p ) = auxiliary proposal traces (traces of q ) (x, 1)..(x, 2) ImplicitCopyImplicitCopy (mu, 1)(var, 1)(mu, 1)(var, 1)

ImplicitCopyImplicitCopy (mu, 2)(var, 2) (mu, 3)(var, 3)(mu, 2)(var, 2)(mu, 2)(var, 2) (mu, 3)(var, 3) u1 u2 u3 weightsweightsweights(x, 1)..(x, 2)(x, 1)..(x, 2) (d) Involution f for the split-merge reversible jump MCMC move, expressed in a Gen diﬀerentiable programming language. (f) Schematic of the involution f being applied twice in succession to a model trace and auxiliary proposal trace. The ﬁrst application merges cluster 2 with the last cluster (cluster 3); the second application splits the resulting cluster, and returns the original pair of traces. (observations, n) = get_observations()trace, = Gen.generate( p , (n,), observations)for iter in 1:100 trace = Gen.involutive_mcmc(trace, q , (), f )end (e) Julia code that applies the split-merge reversible jump MCMC move deﬁned by q and f

100 times to a trace of the model p , using Gen's involution MCMC operator. k=2 k=3 SplitMerge f split=false cluster_to_merge=2 @transform f (model_in, aux_in) to (model_out, aux_out) begin k = @read(model_in[:k], :discrete) split = (k == 1) ? true : @read(aux_in[:split], :discrete) if split cluster_to_split = @read(aux_in[:cluster_to_split], :discrete) @write(model_out[:k], k+1, :discrete) @copy(aux_in[:cluster_to_split], aux_out[:cluster_to_merge]) @write(aux_out[:split], false, :discrete) else cluster_to_merge = @read(aux_in[:cluster_to_merge], :discrete) @write(p_out, :k, k-1, :discrete) @copy(aux_in[:cluster_to_merge], aux_out[:cluster_to_split]) if (k > 2) @write(aux_in[:split], true, :discrete) end end if split u1 = @read(aux_in[:u1], :continuous) u2 = @read(aux_in[:u2], :continuous) u3 = @read(aux_in[:u3], :continuous) weights = @read(model_in[:weights], :continuous) mu = @read(model_in[(:mu, cluster_to_split)], :continuous) var = @read(model_in[(:var, cluster_to_split)], :continuous) new_weights = split_weights(weights, cluster_to_split, u1, k) w1 = new_weights[cluster_to_split]; w2 = new_weights[k+1] (mu1, mu2, var1, var2) = split_params(mu, var, u2, u3, w1, w2) @write(model_out[:weights, new_weights, :continuous) @write(model_out[(:mu, cluster_to_split), mu1, :continuous) @write(model_out[(:mu, k+1), mu2, :continuous) @write(model_out[(:var, cluster_to_split), var1, :continuous) @write(model_out[(:var, k+1), var2, :continuous) else mu1 = @read(model_in[(:mu, cluster_to_merge)], :continuous) mu2 = @read(model_in[(:mu, k)], :continuous) var1 = @read(model_in[(:var, cluster_to_merge)], :continuous) var2 = @read(model_in[(:var, k)], :continuous) weights = @read(model_in[:weights], :continuous) w1 = weights[cluster_to_merge]; w2 = weights[k] (new_weights, u1) = merge_weights(weights, cluster_to_merge, k) w = new_weights[cluster_to_merge] (mu, var, u2, u3) = merge_params(mu1, mu2, var1, var2, w1, w2, w) @write(model_out[:weights], new_weights, :continuous) @write(model_out[(:mu, cluster_to_merge)], mu, :continuous) @write(model_out[(:var, cluster_to_merge)], var, :continuous) @write(model_out[(:u1, u1)], :continuous) @write(model_out[(:u2, u2)], :continuous) @write(model_out[(:u3, u3)], :continuous) endend f = (a) A pair of split-merge transitions in a Gaussian mixture model. The split transition splits the orange component into purple and blue components. Figure 1: Example of reversible jump MCMC (Green, 1995) implemented using involutive MCMC in Gen.The example implements a ‘split-merge move’ in a inﬁnite Gaussian mixture model (Richardson and Green,1997) using three Gen programs: (1) a probabilistic program p encoding the generative model (shown in b), (2)a probabilistic program q encoding an auxiliary probability distribution (shown in c), and (3) a diﬀerentiableprogram f that encodes an involution on the space of pairs of traces of p and q (shown in d). Gen’s involutiveMCMC operator (shown in e) automatically computes the acceptance probability. utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming @transform f (model_in, aux_in) to (model_out, aux_out) begin path = @read(aux_in[], :discrete) subtree_address = foldr(=>, path)) @copy(model_in[subtree_address], aux_out[:new_subtree]) @copy(aux_in[:new_subtree], model_out[subtree_address]) @copy(aux_in[:path], aux_out[:path])end @gen function p (x_values::Vector) cov_function ~ cov_function_prior() cov_matrix = compute_cov_matrix(cov_function, x_values) n = length(xs) y_values ~ mvnormal(zeros(n), cov_matrix .+ 0.01*I(n))end@gen function cov_function_prior() node_type ~ categorical(production_rule_probs)) if node_type == CONSTANT param ~ uniform(0, 1) return ConstantNode(param) elseif node_type == LINEAR param ~ uniform(0, 1) return LinearNode(param) elseif node_type == SQUARED_EXP length_scale ~ uniform(0, 1) return SquaredExponentialNode(length_scale) elseif node_type == PERIODIC .. elseif node_type == PLUS left_node ~ cov_function_prior() right_node ~ cov_function_prior() return PlusNode(left_node, right_node) elseif node_type == TIMES left_node ~ cov_function_prior() right_node ~ cov_function_prior() return TimesNode(left_node, right_node) endend @copy_proposal_to_proposal(:path_from_root, :path_from_root) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_proposal_to_model(:subtree, model_subtree_address) @copy_proposal_to_proposal(:path_from_root, :path_from_root) model traces auxiliary proposal traces(a) A Gaussian process generative model p that uses a PCFG-based prior on a combinatorial space of covariance functions, expressed in a Gen probabilistic modeling language. @gen function q (trace) prev_cov_function = trace[:cov_function] path ~ walk_tree(prev_cov_function, (:cov_function,)) new_subtree ~ cov_function_prior() return pathend@gen function walk_tree(node::Node, path) if isa(node, LeafNode) done ~ bernoulli(1) return path elseif ({:done} ~ bernoulli(0.5)) return path elseif ({:recurse_left} ~ bernoulli(0.5)) path = (path..., :left_node) return ({:left} ~ walk_tree(node.left, path)) else path = (path..., :right_node) return ({:right} ~ walk_tree(node.right, path)) endend (e) An involution f , which swaps the previous and newly proposed subtrees. ImplicitCopy pathnew_subtreecov_function (f) Schematic showing the involution f applied twice. The ﬁrst application replaces a subtree with one node with a subtree with 5 nodes. The second application reverts the change. y_values ImplicitCopy

31 2 31 2 3 node_type

PLUS left_node right_nodecov_function y_values [1.1, 2.5, 3.7, ..] param node_type

TIMES left_node right_nodenode_type

CONSTANT length_scale node_type

SQUARED_EXP param node_type

LINEAR (b) An execution trace of the generative model p from (a). Each random choice has a unique address in a hierarchical address space that is based on the tree of function calls. (d) Auxiliary proposal distribution q , which proposes a new subtree in the covariance function, expressed in a Gen probabilistic modeling language.(c) The covariance function represented by the trace in (c), and several samples from the resulting Gaussian process. @gen function walk_tree(node::Node, path) if isa(node, LeafNode) done ~ bernoulli(1) return path end n1 = size(node.left) n2 = size(node.right) if ({:done} ~ bernoulli(1 / (1 + n1 + n2))) return path elseif ({:recurse_left} ~ bernoulli(n1 / (n1 + n2)) path = (path..., :left_node) return ({:left} ~ walk_tree(node.left, path)) else path = (path..., :right_node) return ({:right} ~ walk_tree(node.right, path)) endend Figure 2: A mixture kernel implemented using involutive MCMC in Gen, applied to infer the covariance functionof a Gaussian process. The prior on covariance functions is based on a probabilistic context-free grammar. Eachcomponent kernel in the mixture replaces a subtree of the covariance function parse tree with a new subtree. Themixture kernel chooses a random subtree to replace via a random walk on the parse tree. The mixture kernel iscomposed from three Gen programs: (1) a probabilistic program p encoding the generative model (shown in a),(2) a probabilistic program q encoding an auxiliary probability distribution (shown in b), and (3) a diﬀerentiableprogram f that encodes an involution (shown in d). arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka Involutive MCMC is a general framework for con-structing MCMC kernels that are stationary for a tar-get probability distribution p . Informally, the algo-rithm works as follows: starting at some state x , weﬁrst sample an auxiliary variable y ∼ q x from a state-dependent auxiliary distribution. We then apply an involution f to the pair ( x, y ) to obtain ( x (cid:48) , y (cid:48) ). Fi-nally, we compute an acceptance probability α and ei-ther accept x (cid:48) as the new state, or reject it and repeatthe previous state x . Diﬀerent choices of q and f re-cover many algorithms from the literature (Neklyudovet al., 2020).In this section, we present involutive MCMC for mod-els p and auxiliary kernels q deﬁned over general statespaces. By emphasizing general state spaces, we intendto clarify a potential point of confusion regarding theinvolutive MCMC algorithm: as presented by Neklyu-dov et al. (2020), the acceptance probability α dependson the Jacobian of the involution f , but it is not imme-diately clear how to deﬁne this Jacobian when f mayoperate on samples from arbitrary measurable spaces,rather than on vectors in R n . Our reformulation ofthe algorithm below is general enough to handle ar-bitrary model and auxiliary distributions, and preciseenough to enable automation via probabilistic and dif-ferentiable programming: the rest of this paper uses itto develop a technique for deriving eﬃcient implemen-tations of involutive MCMC algorithms automatically,given only declarative speciﬁcations of p , q , and f . Let ( X, Σ P , µ P ) and ( Y, Σ Q , µ Q ) denote two generalmeasure spaces with σ -ﬁnite µ P and µ Q . InvolutiveMCMC (Algorithm 1) implements a transition ker-nel that is invariant for a model distribution given by p : X → [0 , ∞ ), a probability density over X withrespect to µ P . Each iteration, the algorithm ﬁrst sam-ples auxiliary variables y ∈ Y from an auxiliary dis-tribution q x based on the model’s current state x : foreach x ∈ X such that p ( x ) > q x : Y → [0 , ∞ ) is aprobability density with respect to µ Q .The resulting pair ( x, y ) of the current model state andthe newly sampled auxiliary state will be an element ofthe joint space Z := { ( x, y ) ∈ X × Y | p ( x ) q x ( y ) > } .We can equip Z with the σ -algebra Σ := { A ∩ Z | A ∈ Σ P ⊗ Σ Q } (assuming Z is a µ P × µ Q -measurable set),and a reference measure µ ( A ) := ( µ P × µ Q )( A ).Let f : Z → Z denote an involution ( f − = f ) suchthat the pushforward of µ under f , denoted µ ◦ f − , isabsolutely continuous with respect to µ , with Radon- Nikodym derivative d ( µ ◦ f − ) /dµ : Z → [0 , ∞ ). Invo-lutive MCMC runs f on ( x, y ) to obtain ( x (cid:48) , y (cid:48) ), thencomputes an acceptance probability α . With proba-bility α , the new state x (cid:48) is returned; otherwise, theprevious state x is repeated. Algorithm 1

Involutive MCMC procedure involutive-mcmc ( p , q , f , x ) y ∼ q x ( · ) (cid:46) Sample auxiliary state( x (cid:48) , y (cid:48) ) ← f ( x, y ) (cid:46) Apply involution α ← p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) · (cid:18) d ( µ ◦ f − ) dµ ( x, y ) (cid:19) r ∼ Uniform(0 , if r ≤ α then return x (cid:48) else return x end procedureTheorem 3.1 (Involutive MCMC is stationary) . In-volutive MCMC deﬁnes a probability kernel k on X that is stationary with respect to the model probabil-ity distribution. That is, (cid:82) X k x ( B ) p ( x ) dµ X ( dx ) = (cid:82) B p ( x ) dµ X ( dx ) for all B ∈ Σ X .Proof. The proof is presented in stages in the appendix(see Section A.2, Section A.3, and Section A.4).

While maximally general, the measure-theoretic for-mulation of involutive MCMC in Algorithm 1 is notamenable to an automated implementation, because itdoes not indicate how to compute the Radon-Nikodymderivative that is required for the acceptance proba-bility, and it is unclear how to specify the probabilitymeasures involved.While restricting the state space to vectors of realnumbers would address these issues, we seek a rep-resentation that remains ﬂexible enough to representcomplex hybrid state spaces with numeric, symbolic,and structure uncertainty. Therefore, we use statespaces consisting of ﬁnite dictionaries that map (pos-sibly random) keys to (possibly random) values. Dic-tionaries include vectors as a special case (a vector x ∈ R n can be represented as a dictionary mappingthe keys 1 , . . . , n to the values x , . . . , x n ), but aremore ﬂexible: diﬀerent keys can hold values of dif-ferent types (e.g. integers, strings), and we can alsoconsider distributions in which the set of keys is itselfrandom, which is useful for model selection problemsand structure uncertainty more generally.This section describes probability distributions on dic-tionaries, and gives a constructive deﬁnition of the in-volutive MCMC acceptance probability in this settingin terms of a Jacobian. Section 4 will then show howprobability distributions on dictionaries can be speci-ﬁed with probabilistic programs, and how probabilistic utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming programming techniques can automatically computeprobability densities on spaces of dictionaries. The space of ﬁnite dictionaries.

We ﬁx a count-ably inﬁnite set K of possible keys, such that eachkey k is either called discrete ( k ∈ I ) or continuous ( k ∈ J ), where K = I ∪ J . Let V k denote the setof possible values for key k , where V k is a countableset for each discrete key, and where V k = R d k for eachcontinuous key for some d k . Given a set of keys K , let V K = × k ∈ K V k denote the set of assignments of valuesto each key. Then the set of all ﬁnite dictionaries is D := (cid:83) K ⊂K , | K | < ∞ { ( K, x ) | x ∈ V K } . That is, a dic-tionary speciﬁes a ﬁnite set of keys K ⊂ K at which ithas values, and an assignment x of values x k for each. Relationship to representation of Green (1995).

Green (1995) uses a state space that is the countableunion of ‘models’, where each model is typically a vec-tor of real-valued parameters. Dictionaries have sub-stantially more structure: Instead of monolithic ‘mod-els’, dictionaries use a more elaborate discrete statethat includes the set of keys and the assignment tothe discrete keys. Also, because continuous keys playthe role of real-valued parameters, it is possible to ex-press that a given real-valued parameter is shared be-tween models. The additional structure of dictionariesenables the automation techniques in Section 4.

A measure space of ﬁnite dictionaries.

We asso-ciate a measure µ k on V k with each key k —the count-ing measure for each discrete key and the Lebesgue-measure on R d k for each continuous key. For eachﬁnite set of keys K , we make V K a measure space us-ing the standard product σ -algebra Σ K = ⊗ k ∈ K Σ k and the product measure µ K = × k ∈ K µ k . We equip D with the σ -algebra Σ D := { (cid:83) K ⊂K , | K | < ∞ { ( K, x ) | x ∈ B K } | B K ∈ Σ K for each ﬁnite K ⊂ K} to obtain ameasurable space of dictionaries. A reference measure µ D on this space can be constructed using the productmeasures µ K : we set µ ( B ) := (cid:80) K ∈K , | K | < ∞ µ K ( { x | ( K, x ) ∈ B } ). Notation for dictionaries.

Given a dictionary m =( K, x ), we write K m for K and m [ k ] for the value x k as-sociated with a key k ∈ K . We also denote speciﬁc dic-tionaries using notation { k (cid:55)→ v , k (cid:55)→ v , . . . } . Forexample, the dictionary ( K, x ) with K = { , "foo" } and x = 0 .

123 and x "foo" = 5 is denoted { (cid:55)→ . , "foo" (cid:55)→ } . Probability distributions on ﬁnite dictionaries.

When V k is discrete for all keys k , a probability distri-bution on dictionaries is deﬁned by a probability massfunction p : D → [0 ,

1] that assigns a probability p ( m ) It is possible to assign a general measure space to eachkey, but this is not necessary for our purposes. to each dictionary m ∈ P such that (cid:80) m ∈D p ( m ) = 1.More generally a probability distribution on dictionar-ies is deﬁned by a probability density p : D → [0 , ∞ )such that (cid:82) p ( m ) µ D (d m ) = 1. The probability massis distributed among the ﬁnite sets of keys K ⊆ K :1 = (cid:88) K ⊆K| K | < ∞ (cid:32)(cid:88) x (cid:18)(cid:90) R dK p (( K, ( x , x ))) d x (cid:19)(cid:33) (1)where d K := (cid:80) k ∈ K ∩J d k is the total continuous di-mension for keys K , and where x is an assignment tothe discrete choices in K and x is an assignment tothe continuous choices.We now give an example to build intuition. Consider agenerative model of univariate data points y , . . . , y n from a Gaussian mixture with an unknown numberof components k , each with unknown mean m i andvariance s i . If we place a Gaussian prior on m i , aninverse Gamma prior on s i , and a Poisson prior on k ,the resulting density on dictionaries d is: p ( d ) = p poisson(3) ( d [ k ]) · (cid:81) d [ k ] i =1 p normal(0 , ( d [ m i ]) · (cid:81) d [ k ] i =1 p inversegamma(1 , ( d [ s i ]) · (cid:81) ni =1 1 d [ k ] (cid:80) d [ k ] j =1 · p normal( d [ m j ] ,d [ s i ]) ( d [ y i ])when K d = { k , y , . . . , y n } ∪ { m i | ≤ i ≤ d [ k ] } ∪ { s i | ≤ i ≤ d [ k ] } , and 0 otherwise. In this case, we have V k = N with the counting measure for µ k , and for allother keys k ∈ K , V k = R with the Lebesgue measurefor µ k . For each j ∈ { , , . . . } , the probability massassigned to key set { k , y , . . . , y n } ∪ { m i | ≤ i ≤ j } ∪{ s i | ≤ i ≤ j } is p poisson( s ) ( j ) (via Equation (1)). Conditional distributions via disintegration.

Consider a probability density p on the space D of dic-tionaries. We say a key k ∈ K almost always appears if (cid:82) [ k ∈ K m ] p ( m ) µ D (d m ) = 1. Suppose B is a setof keys that almost always appear for p , and that b =( B, b ) is a dictionary with keys B . Furthermore, let( K m , m ) ⊕ ( K n , n ) := ( K m ∪ K n , ( m , n )) be the merge of two dictionaries m and n deﬁned on disjoint key sets K m and K n . Then we can deﬁne the conditional den-sity p ( d | b ) := [ K d ∩ B = ∅ ] p ( d ⊕ b ) (cid:82) { m | Km ∩ B = ∅} p ( m ⊕ b ) µ D (d m ) when the denominator is ﬁnite. If each each key k ∈ B is discrete (e.g. µ k is the counting measure), then thisdeﬁnition corresponds to the ordinary notion of con-ditioning on an event (namely, the event that a sam-ple from p agrees with the dictionary b on all keys in B ). When this is not the case, it corresponds to amore general measure-theoretic notion called disinte-gration (Chang and Pollard, 1997).Consider the inﬁnite univariate mixture model, andthe conditional density given observed data ( B, b ) := arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka { y (cid:55)→ y , . . . , y n (cid:55)→ y n } . The conditional density p ( d | b ) is nonzero only if K d = { k } ∪ { m i | ≤ i ≤ k } ∪ { s i | ≤ i ≤ k } for some k ( d does not contain y-values), and the denominator in the deﬁnition of p ( d | b )simpliﬁes to the familiar sum of marginal likelihoodsover all k , where each marginal likelihood is a Riemannintegral over R k . Suppose that the model distribution and auxiliary dis-tributions are probability distributions on dictionaries,with densities p and q x . Then, X and Y are both setsof dictionaries, and the joint space Z is a set of pairs( x, y ) of dictionaries with keys K P and K Q respec-tively, so that Z = X × Y ⊆ D P × D Q where D P is theset of dictionaries on keys taken from K P and similarlyfor D Q . To simplify the notation, and without loss ofgenerality, we will assume that K P and K Q are dis-joint , and we deﬁne Z := { x ⊕ y : p ( x ) q x ( y ) > } ⊆ D ,where D is the set of dictionaries on keys from K P ∪K Q (recall x ⊕ y denotes the dictionary resulting frommerging dictionaries x and y with disjoint keys).Suppose there is a countable partition of Z into { Z e : e ∈ E } such that if ( K , x ) and ( K , x ) are two dic-tionaries in the same component Z e , then K = K and they agree on all discrete values: x k = x k forall k ∈ K ∩ I . Then each set Z e is isomorphic to aEuclidean space of assignments to the continuous keysin the two dictionaries. Suppose there is an involu-tion g : E → E between elements of the partition,and a family of continuously diﬀerentiable bijections h e : Z e → Z g ( e ) indexed by e ∈ E , with h e = h − g ( e ) .Let e ( z ) ∈ E denote which element of the partition adictionary z ∈ Z belongs to. Then f : Z → Z givenby f ( z ) := h e ( z ) ( z ) is an involution: f ( f ( z )) = h e ( f ( z )) ( h e ( z ) ( z )) = h g ( e ( z )) ( h e ( z ) ( z )) = z. Let | Jh e | ( z ) denote the absolute value of the determi-nant of the Jacobian of h e , evaluated at z . Then, theacceptance probability in Algorithm 1 simpliﬁes to: p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) · | Jh e | ( z ) (2)One example of a valid partition of Z is given by equiv-alence classes of the following equivalence relation: z ∼ z ⇐⇒ ( K z = K z ) ∧ ( z [ k ] = z [ k ] ∀ k ∈ K z ∩I )(dictionaries are equivalent if they contain the samekeys and they agree on the value of all discrete keys).See Section A.1 of the appendix for details. If K P and K Q are not disjoint then, they can be madeso by adding a diﬀerent preﬁx to the keys of each set. Involutive MCMC is a general framework that can beused to develop diverse MCMC algorithms for mod-els over arbitrary state spaces. We wish to automate the implementation details for involutive MCMC al-gorithms, given only a speciﬁcation of the model p ,the auxiliary distribution q , and the involution f . Todo so, we require a representation for the distributions p and q that is ﬂexible enough to represent the fullvariety of models and auxiliary distributions of inter-est to practitioners. The representation must supportdensity evaluation and sampling. It is also desirablethat the representation be structured : the more infor-mation available to us (e.g., about the decompositionof a distribution’s state space into individual univari-ate and multivariate random variables, or about con-ditional independence relationships in a model), theeasier it will be for the implementation to exploit thisstructure automatically by using more eﬃcient datastructures and low-level manipulations. Probabilistic programs are ﬂexible and structured rep-resentations for probability distributions. Unlike den-sities (but like Bayesian networks), probabilistic pro-grams can be eﬃciently sampled and contain explic-itly represented information about some conditionalindependence relationships in a model. But unlikeBayesian networks, they do not assume a ﬁxed numberof random variables, state space dimension, or depen-dency structure.At the most basic level, a probabilistic program is aprogram that makes random choices. Any such pro-gram induces a probability distribution over its pos-sible execution traces , records of each random choiceit makes. If each random choice is associated witha unique address from the set of dictionary keys K ,then these traces can be viewed as ﬁnite dictionar-ies, mapping the address of each random choice to itsvalue. The distribution induced by a probabilistic pro-gram over its execution traces can thus be understoodas a measure on the space ( D , Σ D , µ D ) introduced inthe previous section. Furthermore, densities of tracedistributions with respect to µ D are typically easy tocompute.In this section, we introduce a probabilistic program-ming language from the Gen probabilistic program-ming system (Cusumano-Towner et al., 2019), andpresent a technique for automating the implementa-tion of involutive MCMC algorithms when the model p and auxiliary distribution q x are both represented utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming Algorithm 2

Automated Involutive MCMC

Require: probabilistic programs P and Q ; diﬀerentiable program F ; initial state x ; observations b procedure auto-involutive-mcmc ( P , Q , F , x , b ) (cid:46) Sample y ∼ q x ( · ) and compute log-density via probabilistic program Q ( y, log q x ( y )) ← trace-and-score ( Q x ) (cid:46) Compute ( x (cid:48) ⊕ y (cid:48) ) = f ( x ⊕ y ) and log(Radon-Nikodym derivative) via diﬀerentiable program F b ( x (cid:48) ⊕ y (cid:48) , log D ) ← run-involution ( F , x ⊕ y, b ) (cid:46) Compute log-density via probabilistic program P log ˜ p ( x ⊕ b ) ← score ( P , x ⊕ b ) (cid:46) Compute log-density via probabilistic program P log ˜ p ( x (cid:48) ⊕ b ) ← score ( P , x (cid:48) ⊕ b ) (cid:46) Compute log-density via probabilistic program Q log q x (cid:48) ( y (cid:48) ) ← score ( Q x (cid:48) , y (cid:48) ) (cid:46) Compute acceptance probability and accept or reject α ← min { , exp(log ˜ p ( x (cid:48) ⊕ b ) − log ˜ p ( x ⊕ b ) + log q x (cid:48) ( y (cid:48) ) − log q x ( y ) + log D ) } with probability α return x (cid:48) else return x endend procedureprocedure trace-and-score ( P ) s ← x ← {} Execute P , but with " k ∼ distribution " ≡ (1. set v ∼ distribution2. set x [ k ] ← v

3. set s ← s + logpdf (distribution , v )4. evaluate to v ) return ( x, s ) end procedure procedure score ( P , x ) s ← K ← {} Execute P , but with " k ∼ distribution " ≡ (1. set v ← x [ k ]2. set K ← K ∪ { k }

3. set s ← s + logpdf (distribution , v )4. evaluate to v ) if K (cid:54) = K x then s ← −∞ endreturn s end procedureprocedure run-involution ( F , z , b ) (cid:46) Run the involution, and keep track of copied, read, and written continuous keys z (cid:48) ← {} (cid:46) Initialize empty output dictionaryExecute F , but with " @ write ( k, v ) " ≡ ( if k ∈ I then set z (cid:48) [ k ] ← v else set z (cid:48) [ k ] ← v and set W ← W ∪ { k } ) " @ read ( k ) " ≡ ( if k ∈ I then evaluate to z [ k ] else set R ← R ∪ { k } and evaluate to z [ k ]) " @ copy ( k , k ) " ≡ (set z (cid:48) [ k ] ← z [ k ] and set C ← C ∪ { k } )(and reading from observations b for any k ∈ K b ) (cid:46) Use automatic diﬀerentiation of F to compute Jacobian, skipping copied addresses for i in 1 to | W | do k i ← W i (cid:46) Pick i th key in W ; the order does not matter J [: , i ] ← ∇ z R \ C (cid:0) h e ( z ) [ a i ] (cid:1) (cid:46) Gradient of z (cid:48) [ k i ] = h e ( z ) [ k i ] w.r.t. non-copied continuous inputs ( z k for k ∈ R \ C ) end forreturn ( z (cid:48) , log( | det( J ) | )) end procedure as probabilistic programs in this language. But thetechnique is not limited to the Gen system—we alsoprovide the implementation of a minimal probabilisticprogramming language in PyTorch that supports theautomation technique. Our probabilistic programming language augments thesyntax of Julia (Bezanson et al., 2017) with a sin-gle new construct, the ‘ { address } ∼ distribution ’expression, for making a named random choice. An arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka execution trace of a Gen probabilistic program is adictionary that can be sampled by running the pro-gram according to Julia’s usual semantics, and uponencountering an expression of the form ‘ { address } ∼ distribution ’, (i) evaluating the address expression (‘ address ’) to obtain an address k ∈ K ; (ii) evalu-ating the distribution expression (‘ distribution ’) toobtain a probability distribution over the measurablespace V k ; (iii) sampling a value x k from this distribu-tion and adding the mapping { k (cid:55)→ x k } to the exe-cution trace; and (iv) returning the sampled value x k to the program, to continue execution. When execu-tion terminates, the execution trace has accumulateda mapping for each random choice encountered duringthe program’s execution.For example, consider the probabilistic program be-low, which deﬁnes a Gaussian mixture model with anunknown number of components: @gen function p(n::Int)k ∼ poisson_plus_one(1)means = [({(:mu, j)} ∼ normal(0, 10)) for j in 1:k]vars = [({(:var, j)} ∼ inv_gamma(1, 10)) for j in 1:k]weights ∼ dirichlet([2.0 for j in 1:k])for i in 1:n{(:x, i)} ∼ mixture_of_normals(weights, means, vars)endend The program p accepts as input an integer n , a numberof data points. Each n deﬁnes a distinct distributionover dictionaries. The ﬁrst line of the program sam-ples a number of mixture components from a Poissonprior using the address :k . (This line could also bewritten k = { :k } ∼ poisson plus one(1) : the ad-dress is the symbol :k , and the result of the choice isassigned to a Julia variable called k . Because this is acommon pattern, Gen provides the syntactic sugar x ∼ d as shorthand for x = { :x } ∼ d .) The programthen samples k means and k variances from Gaussianand inverse Gamma priors, respectively. Each of these2 k random choices has its own address, determinedby the address expression preceding it. For example,the mean for the fourth mixture component (if k ≥ (:mu, 4) . The mixture weights are thensampled at the address :weights from a Dirichlet dis-tribution, and n data points are sampled at addresses (:x, 1) , . . . , (:x, n) . The random choice at address: k is discrete with V :k = Z ≥ , and the other randomchoices are continuous. Overall, the program deﬁnes adistribution over traces with the following density over traces ( K, x ) with respect to µ D : p ( n )(( K, x )) = p poisson(1) ( x : k − · (cid:81) x : k i =1 p normal(0 , ( x (: mu ,i ) ) · (cid:81) x : k i =1 p inversegamma(1 , ( x (: var ,i ) ) · p dirichlet(2 . ,..., . ( x : weights ) · (cid:81) ni =1 (cid:80) x : k j =1 x : weights [ j ] · p normal( x (: mu , j ) ,x (: var ,j ) ) ( x (: x ,i ) )when K contains exactly the addresses: k , : weights , (: x , i ) for i = 1 , . . . , n , and (: mu , j )and (: var , j ) for j = 1 , . . . , x : k ; otherwise, the densityis 0. Automatic Sampling and Density Compu-tation for Probabilistic Programs.

Samplingtraces from, and computing densities of dictionariesunder, the distribution on traces induced by a proba-bilistic program is straight-forward to do, using a stan-dard technique in probabilistic programming.We illustrate this technique in the trace-and-score and score subroutines in Algorithm 2. The trace-and-score subroutine samples a trace by running aprobabilistic program, but recording the value of everyencountered choice into a dictionary, which it returnsonce execution has terminated. It also returns the logdensity of the trace, calculated by accumulating densi-ties of the individual choices it encounters. (Note thatthis procedure is only valid if the program halts withprobability 1. Otherwise, it could loop inﬁnitely, andeven if it terminates, the density will be incorrect.)The density of an arbitrary dictionary ( K, x ) underany probabilistic program’s distribution on traces canbe computed using score . The idea is to run theprobabilistic program, and whenever a random choice‘ { address } ∼ distribution ’ is encountered, to lookup the value x k in the dictionary, compute its den-sity under the primitive distribution d , multiply thisdensity into a running total, and return control to theprobabilistic program as if the sampling instructionhad executed and returned x k . At the end, the run-ning total can be returned as the trace’s density. Ifat any point an address k (cid:54)∈ K is encountered, or ifnot all addresses in K have been visited at the end ofexecution, the algorithm returns 0 as the density. We can use two probabilistic programs P and Q tospecify the model density p and auxiliary densities q x that appear in involutive MCMC (Algorithm 1). Inthis case, the program Q accepts a trace x of P as The programs must satisfy a mild technical require-ment for our formalism to go through; see Section A.5 utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming input, we denote a probabilistic program Q applied toinput x by Q x .Typically, we cannot use a probabilistic program P to represent the target distribution directly: proba-bilistic programs implement simulators for a distribu-tion, but target densities are typically not tractableto simulate (hence the need for MCMC). Instead, wemay wish to sample from a target p that arises from conditioning a probabilistic program P ’s distributionover traces on observations of the values at some ad-dresses. Let ˜ p denote P ’s distribution over traces, andlet b = ( B, b ) be a dictionary of observations, as de-scribed in Section 3.2. Then, as shown in that section,the target distribution p ( x ) = ˜ p ( x | b ) = ˜ p ( x ⊕ b ) L ( b ) , where L ( b ) = (cid:82) { m | m ∩ B = ∅} p ( m ⊕ b ) µ D (d m ) is the marginallikelihood of b , and does not depend on x . Then˜ p ( x ⊕ b ) = L ( b ) p ( x ). If we use ˜ p ( x ⊕ b ) in place of p ( x ) to compute the ratio of densities in Algorithm 1,we will wind up with the same output, because L ( b )will cancel in the numerator and denominator: p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) = ˜ p ( x (cid:48) ⊕ b ) q x (cid:48) ( y (cid:48) )˜ p ( x ⊕ b ) q x ( y ) = ˜ p ( x (cid:48) | b ) q x (cid:48) ( y (cid:48) )˜ p ( x | b ) q x ( y ) (3)Thus, this ratio can be computed term-by-term using score on the probabilistic programs P and Q (Al-gorithm 2): to compute q x ( y (cid:48) ), we run the algorithmdirectly on the program Q x , and to compute p ( x ) and p ( x (cid:48) ), we actually merge the dictionary x with the ob-servations b and compute L ( b ) p ( x ) = ˜ p ( x ⊕ b ) instead,by running the algorithm on P with trace x ⊕ b . Thedensity q x ( y ) can be computed while y is being sam-pled, using trace-and-score .Section 5 describes a more eﬃcient approach that ex-ploits sparsity in the involution and cancellations inthe acceptance ratio for improved eﬃciency, but re-quires a more sophisticated probabilistic programmingruntime system.Figure 1b and Figure 1c show examples of probabilis-tic programs P and Q respectively, for a split-mergereversible jump move. Section 4.1 showed that if the densities p and q arespeciﬁed using probabilistic programs P and Q , thenthe density ratio in the acceptance probability for in-volutive MCMC on dictionaries (Equation (2)) canbe automated using probabilistic programming tech-niques. This section shows that if the involution f isspeciﬁed using a diﬀerentiable program F that trans-forms the execution traces of probabilistic programs,then the Jacobian factor in Equation (2) can also beautomated, using automatic diﬀerentiation. The pro- cedure auto-involutive-mcmc in Algorithm 2 com-bines these two ideas and automates involutive MCMCgiven the programs P , Q , and F . Recall that for involutive MCMC on a state space ofdictionaries, we deﬁne Z := { x ⊕ y : p ( x ) q x ( y ) > } ⊆D , where x ⊕ y is the dictionary resulting from mergingdictionaries x and y with disjoint keys. The involutionis a function f : Z → Z .We now introduce a simple diﬀerentiable programminglanguage for specifying involutions f . The languageneeds to have syntax for reading the value from anaddress in x ⊕ y ∈ Z and writing to an address in x (cid:48) ⊕ y (cid:48) ∈ Z . To read a value ( x ⊕ y )[ a ] at address a ,we use the @read keyword: value = @read(

, ) The ﬁrst argument is the address a and the secondargument is either : discrete or : continuous , and in-forms the interpreter whether the random choice atthat address is drawn from a discrete or continuousdistribution (this information will be used to supportautomatic diﬀerentiation). Recall that x is the traceof the model probabilistic program, and y is the traceof the auxiliary probabilistic program. Each address a therefore needs to specify which of these traces to readfrom, and the address within that trace. The tracesare given names in the function signature: @transform f (model_in, aux_in) to (model_out, aux_out)begin..end Here, the traces x , y , x (cid:48) and y (cid:48) are given names model in , aux in , model out , and aux out , respec-tively. The syntax for address a within trace trace is trace [ a ]. For example, to read the value of a contin-uous address : a from the input model trace ( x ): val = @read(model_in[:a], :continuous) The syntax for writing to an address in x (cid:48) ⊕ y (cid:48) ∈ Z issimilar. For example to write a value val to x (cid:48) [: a ]: @write(model_out[:a], val, :continuous) Note that the input traces x, y are distinct from theoutput traces x (cid:48) , y (cid:48) ; input traces can only be read from,and output traces can only be written to. For example,it is not possible to write an output trace and then readthe written value from it later. arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka Often, we want to simply copy the value from someaddress in the input traces to some address in the out-put trace. While this is possible via a @ read followedby a @ write , the language provides a special syntax: @copy(, )

For example, to copy the value from address : u in x toaddress : v in x (cid:48) , we use: @copy(model_in[:u], model_out[:v]) Of course, it is also possible to copy from x to y (cid:48) , from y to x (cid:48) and from y to y (cid:48) . As we will see in Section 5,it is preferable to use @ copy when possible instead ofreading and then writing, as this can make the accep-tance probability calculation more eﬃcient. Constructing an Involution

Consider the follow-ing generative model, which posits that a vector of uni-variate data is either generated from a single normaldistribution or a mixture of two normal distributions.The model is expressed as a probabilistic program: @gen function p()k ∼ uniform_discrete(1, 2)means = [{(:mu, j)} ∼ normal(0, 10) for j in 1:k]weights = ones(k)/kvars = ones(k)for i in 1:100{(:x, i)} ∼ mixture_of_normals(weights, means, vars)endend This is a simpliﬁed version of the inﬁnite Gaussianmixture model in Figure 1b. Each key of the form(: x , i ) will be observed (in the dictionary b ), so thelatent part of the trace (the dictionary x ) contains theother keys : k and (: mu ,

1) and (when k = 2) (: mu , k from 2 to 1 or vice versa. The discrete partof the involution is straightforward: k = @read(model_in["k"], discrete)if k == 1@write(model_out["k"], 2, discrete)else@write(model_out["k"], 1, discrete)end Writing the involution code for the continuous choicesis more complex. There is no one-to-one correspon-dence between the space of dictionaries with keys { : k , (: mu , } and the space of dictionaries x with keys { : k , (: mu , , (: mu , } . Therefore, we need to extend thestate space using the auxiliary distribution, deﬁnedwith the following probabilistic program: @gen function q(model_trace)if model_trace["k"] == 1 ∼ beta(2, 2)endend Our involution is on the space of combined dictionaries x ⊕ y where y are traces of q . Note that for each( x ⊕ y ) where x [ k ] = 1, y has a key u , and for each( x ⊕ y ) where x [ k ] = 2, y is empty. We extend theinvolution using a pair of bijections between ( µ , µ )and ( µ , u ) that show how two cluster means should betransformed into one cluster mean (merge), and viceversa (split): µ , µ (cid:55)→ (( µ + µ ) / , µ − ( µ + µ ) /

2) [Merge] µ , u (cid:55)→ ( µ − u, µ + u ) [Split] (4)The value of x [ k ] determines which of these functionsis executed. The full involution program F is then: @transform f (model_in,aux_in) to (model_out,aux_out)begink = @read(model_in["k"], discrete)if k == 1@write(model_out["k"], 2, discrete) The two continuous bijections (implemented in blocksof code in the two branches) are inverses of oneanother. This is a common pattern in involutionprograms—an involution on the discrete parts of thetraces (in this case, just x [ k ]) determines via controlﬂow which continuous code blocks get executed, suchthat the end-to-end program deﬁnes an involution.Note that the observations b (in this case, the x-coordinates) are included in traces of p , which takethe form x ⊕ b where x is the latent part and b is theobserved part. The involution program F is allowedto read from the the observations in the model traceusing the same syntax as used to read from the latentpart of the trace x . (The example given here does not utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming utilize this feature). Recall that the involution f must decompose into (i)an involution g on elements e ∈ E of a partition of thestate space, and (ii) a pair of continuous diﬀerentiablebijections h e and h g ( e ) = h − e between each pair ofcorresponding elements of the partition. Each function h e is a function from the values at continuous addressesin the input trace ( z [ k ] for k ∈ K z ∩ J ) to the valuesat continuous addresses in the output trace ( z (cid:48) [ k ] for k ∈ K z (cid:48) ∩ J ). In the example above, the partition E is given by the equivalence classes of the relation( x , y ) ∼ ( x , y ) ⇐⇒ x [ k ] = x [ k ] (there are twoequivalence classes), g maps the k = 1 class to the k =2 class and vice versa, and the continuous bijections h e and h g ( e ) are the functions in Equation (4).We compute the Jacobian using automatic diﬀeren-tiation. (The Gen implementation uses forward-modeAD whereas our PyTorch implementation uses reverse-mode AD.) The procedure run-involution in Algo-rithm 2 shows the implementation of the interpreterfor the language that uses reverse-mode AD. The inter-preter executes F using regular Julia or Python seman-tics, but intercepts calls to @write , @read , and @copy statements, and in addition to performing the desiredoperation, records the set of continuous addresses thatare read, written, and copied. After F is ﬁnished ex-ecuting, AD uses the recorded addresses to computethe Jacobian | Jh e | ( z ); in the case of reverse-mode AD,this is accomplished by iterating over output contin-uous addresses and backpropagating from each oneto all of the input continuous addresses, computingthe Jacobian column-by-column. Keys with an n -dimensional Lebesgue measure as their reference mea-sure, corresponding to vector-valued random choices,are unpacked into n separate columns in the Jacobian;this detail is elided in Algorithm 2. The discrete and continuous labels are also omitted from the syntax in run-involution to simplify notation ( a ∈ I indicatesa discrete choice and otherwise a choice is continuous). Suppose N and N (cid:48) are the total number of randomchoices in the input traces x, y and output traces x (cid:48) , y (cid:48) respectively (note that the observations b areexcluded— x and x (cid:48) are the latent part of the model’straces only). Let M be the number of continuous ran-dom choices among these (which must be the samein the input traces and output traces). Then, thenumber of operations used in Algorithm 2 grows as O ( N + N (cid:48) ) + O ( M ). The linear term is due to sam-pling y (cid:48) and computing the four log-densities requiredfor the acceptance probability. The cubic term is dueto computing the Jacobian determinant, which is alsorequired for the acceptance probability (Equation 2).It is possible to reduce the number of operations per-formed in an automated involutive MCMC kernel byexploiting special structure in the involution f . Insome cases, this structure can lead to O (1) operationsper kernel application (i.e. constant in N , N (cid:48) , and M ).This section describes techniques for exploiting involu-tion structure within an automated involutive MCMCimplementation. These techniques are used in the Genimplementation, and one of the techniques is used inour minimal PyTorch implementation. The naive implementation of Algorithm 2 computesby the Jacobian by ﬁrst computing the M -by- M Jaco-bian matrix Jh e ( z ) via automatic diﬀerentiation, andthen computing the absolute value of its determinant.However, we observe that in many applications of in-volutive MCMC, the values at continuous choices inthe input traces are directly copied into the outputtraces (either at the same key or a diﬀerent key).These copy operations result in columns in the Ja-cobian matrix that have a single 1 entry with remain-ing entries equal to 0. For example, for the function( u, v, x, y ) (cid:55)→ ( u, u − v, y, x ) = ( u (cid:48) , v (cid:48) , x (cid:48) , y (cid:48) ), the Jaco-bian is (with columns corresponding to u (cid:48) , v (cid:48) , x (cid:48) , and y (cid:48) and rows corresponding to u , v , x , and y ):  −  uvxy Using the cofactor expansion of the determinant, weobserve that for any ‘copy’ column in an M -by- M Ja-cobian matrix (a column with a single 1 and all otherentries 0), the absolute value of the determinant isequivalent to that of the ( M − M −

1) sub-matrixwith the corresponding column and row omitted (evenif that would remove other nonzero entries from thematrix). By applying this rule recursively, we caninstead compute the determinant of a much smallermatrix; for the example above with M = 4, the abso-lute value of the determinant simpliﬁes to the absolutevalue of a single entry ( | − | ). Indeed, if some inputkey is copied to some output key, then we can entirelyavoid computing its row (and corresponding column)of the Jacobian. Therefore, the number of operations(which is dominated by the determinant) reduces from M to ( M − | C | ) where | C | is the number of input arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka keys that were copied to some output key. The @ copy statement in our diﬀerentiable programming languagemakes the set of input keys that were copied explicitlyavailable to the interpreter, which makes automatingthis optimization straightforward, as shown in run-involution in Algorithm 2.Many involutive MCMC kernels modify only a con-stant number of keys that does not depend on the sizes N and N (cid:48) of the input and output traces or the totalnumber of continous keys M ; the other keys are copiedover unchanged. For example, consider the Jacobianfor a split-merge reversible jump move Richardsonand Green (1997), which is implemented in Figure 1:Here, the move is splitting cluster 2 into two clus-ters (cluster 2 and cluster 4). Black squares indicatenonzero entries. In this case, the size N of the la-tent part of the trace grows linearly in the numberof clusters, but the Jacobian determinant can be cal-culated from only the 6-by-6 submatrix of the Jaco-bian that involves the parameters of the one clusterbeing split (or the two clusters being merged). There-fore, the acceptance ratio computation reduces from O ( N + N (cid:48) ) + O ( M ) to O ( N + N (cid:48) ). While a rea-sonable hand-coded implementation of this algorithmwould likely perform this sort of optimization as well,Algorithm 2 automates it. Consider input traces ( x, y ) and output traces ( x (cid:48) , y (cid:48) )with f ( x, y ) = ( x (cid:48) , y (cid:48) ) for involution f encoded by pro-gram F (again, x and x (cid:48) include only the latent partof the model trace and not the observed part b ). Let N := | K x | , N := | K y | , N (cid:48) := | K x (cid:48) | and N (cid:48) := | K y (cid:48) | ,and N := N + N and N (cid:48) := N (cid:48) + N (cid:48) . Supposethat N + N (cid:48) (cid:29) N + N (cid:48) , which often occurs whenthe the number of latent variables is large, but the in-volutive MCMC move only updates a portion of thelatent variables. Algorithm 2 runs F , which explicitlywrites the value for each element of ( x (cid:48) , y (cid:48) ), requiring N (cid:48) = N (cid:48) + N (cid:48) ≈ N (cid:48) ‘write’ or ‘copy’ operations. Thealgorithm also uses approximately N + N (cid:48) operationsto evaluate the log-densities, because it accumulatesthe log-density of each random choice in x , and x (cid:48) .We now show how to modify the program F andits interpreter, so that both computing z (cid:48) = ( x (cid:48) , y (cid:48) )from z = ( x, y ), and computing the log density ratiolog(˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b )), can take O (1) operations (thatis, constant in N + N (cid:48) ) when the involution f hassparse structure and the model density p has condi-tional independencies.Note that x and x (cid:48) may both have values for certainkeys, and x may contains keys not present in x (cid:48) , andvice versa. Often, an involution does not modify thevalues of many keys in the trace—this is the case inthe split-merge reversible jump move described earlier.Suppose that A ⊆ K x ∩ K x (cid:48) is the set of keys k in both x and x (cid:48) for which x (cid:48) [ k ] = x [ k ]. Let ∆ := K x (cid:48) \ A be theset of keys in x (cid:48) that are either (i) not in x , or (ii) in x but have their value changed. Then, the previous trace x and a dictionary δ with K δ = ∆ and δ [ k ] = x (cid:48) [ k ]and the density function p suﬃce to uniquely deﬁnethe trace x (cid:48) , provided the density satisﬁes the follow-ing condition, which is satisﬁed by densities deﬁnedby probabilistic programs : p ( x ) > p ( x (cid:48) ) > x (cid:54) = x (cid:48) implies there exists k ∈ K x ∩ K x (cid:48) with x [ k ] (cid:54) = x (cid:48) [ k ]. Let x | A denote the restriction of a dictio-nary x to only keys in A . Then, naive-trace-update (Algorithm 3) computes the new trace x (cid:48) and log-densities ratio log( p ( x (cid:48) ) /p ( x )) of a probabilistic pro-gram from a previous trace x and a dictionary δ (andobservations b ), provided there exists x (cid:48) = δ ⊕ x | A forsome A where p ( x (cid:48) ) > naive-trace-update is a naive implementation of ageneral trace-update operation that is part of Gen’sAPI (Cusumano-Towner et al., 2019). While the naiveimplementation requires O ( N + N (cid:48) ) operations, itis possible for more sophisticated implementations of trace-update to run in approximately O ( | K δ | ) opera-tions for certain probabilistic programs P by exploitingconditional independence structure program that leadsto cancellations in the density ratio ˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b ).The details of these more eﬃcient implementations areoutside the scope of this paper, but are implementedas part of the Gen system.Given an implementation of trace-update (that has thesame input and output signature of naive-trace-update ) we can optimize Algorithm 2 by modifyingthe requirement for the involution program F : In-stead of explicitly specifying a value for all keys in Intuitively, two traces of a probabilistic program can-not have diﬀerent sets of addresses unless they share anaddress and disagree on its value. utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming

Algorithm 3

Naive Trace Update

Require: probabilistic program P ; previous trace x ;observations b ;dictionary δ such that there exists x (cid:48) = ( δ ⊕ x | A ) forsome A where p ( x (cid:48) ) > procedure naive-trace-update ( P , x , b , δ ) assert K x ∩ K b = ∅ x (cid:48) ← {} s (cid:48) ← P , but with " k ∼ distribution " ≡ (1. if k ∈ K δ then v ← δ [ k ] else v ← ( x ⊕ b )[ k ]2. if k (cid:54)∈ K b then set x (cid:48) [ k ] ← v

3. set s (cid:48) ← s (cid:48) + logpdf (distribution , v )4. evaluate to v ) s ← score ( P , x ⊕ b ) return ( x (cid:48) , s (cid:48) − s ) end procedure x (cid:48) , it need only explicitly specify values in δ . (Wecall the other values in x (cid:48) implicit copies —they couldbe explicitly copied within F , but this would resultin uncessary code and possibly unecessary computa-tion.) Therefore, the involution program can run in O ( | K δ | + N + N (cid:48) ) operations. We remove the sep-arate calls to score for the log-densities log ˜ p ( x ⊕ b )and log ˜ p ( x (cid:48) ⊕ b ) and replace them with a single call to trace-update on P that takes in x and δ and returns x (cid:48) and log(˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b )). When | K δ | and N and N (cid:48) are all constant in the size N of the latent part ofmodel trace ( x ), the resulting algorithm uses a numberof operations that does not grow with N + N (cid:48) . The automated technique described in the previoussections allows users to implement involutive MCMCalgorithms by writing probabilistic programs for p and q and a diﬀerentiable program for the involution f ,avoiding the need to derive and implement the ac-cept/reject formula by hand. This eliminates certainclasses of errors that could otherwise be diﬃcult toroot out in hand-coded implementations of MCMC al-gorithms. However, it is still of course possible to im-plement p , q , or f incorrectly, introducing bugs thatmay invalidate the correctness proofs for the involutiveMCMC algorithm.Fortunately, the factorization of diverse algorithmsinto a common template, involving an explicitly rep-resented model, auxiliary distribution, and involution,enables simple and automated debugging checks thatcan catch a wide variety of such errors dynamically.These checks are not complete, but anecdotally, we have found that they often alert users to subtle bugsin the design or implementation of an algorithm. Thechecks are summarized in Algorithm 4. Algorithm 4

Dynamic Check for Bugs procedure involutive-mcmc-check ( P , Q , F ) (cid:46) Generate a randomized test case(( x ⊕ b ) , ) ∼ trace-and-score ( P )( y, ) ∼ trace-and-score ( Q x ) (cid:46) Run involution and perform dimension check x (cid:48) ⊕ y (cid:48) ← run-involution ( F , x ⊕ y, b ) (cid:46) Support checklog p ( x (cid:48) ) ← score ( P , x (cid:48) )log q x (cid:48) ( y (cid:48) ) ← score ( Q x (cid:48) , y (cid:48) ) assert (log p ( x (cid:48) ) > −∞ ) ∧ (log q x (cid:48) ( y (cid:48) ) > −∞ ) (cid:46) Involution check˜ x ⊕ ˜ y ← run-involution ( F , x (cid:48) ⊕ y (cid:48) , b ) assert (˜ x = x ) ∧ (˜ y = y ) end procedure The check ﬁrst randomly samples a model trace x andsimulated observations b from the model prior p , andauxiliary trace y from the q x . It then runs three testson this simulated test case:1. Support check.

The support check runs the in-volution F on ( x, y ) and checks that the result-ing pair of traces, ( x (cid:48) , y (cid:48) ), are within the set Z ofpositive-density elements for π .2. Dimension check.

The dimension check runsthe involution F on ( x, y ) and records the sets ofaddresses R and W at which F performs continu-ous reads and writes, respectively, and checks thatthe sizes of these sets match. (This check is partof computing the determinant of the Jacobian in run-involution ).3. Involution check.

The involution check runs F twice to compute f ( f ( x, y )) and checks that theresult is equal to ( x, y ). If f is an involution on Z ,then this check will always succeed; if there is aset ˜ Z of positive π -measure on which f ( f ( z )) (cid:54) = z (i.e., f is not an involution), then this check haspositive probability of failing.These checks each catch qualitatively diﬀerent bugs inuser programs. We now give several examples. Incorrectly transformed continuous variables.

Manybugs in the design or implementation of deterministictransformations of continuous variables are naturallydetected by the involution check. Consider, for exam-ple, the Hamiltonian Monte Carlo algorithm (Duane arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka et al., 1987), which, as Neklyudov et al. (2020) observe,is an instance of the involutive MCMC framework.When applied to HMC, the dynamic bug check in-volves: (a) sampling a model state x from the prior; (b)sampling a momentum y ; (c) running the leapfrog in-tegrator forward to get a new state x (cid:48) and new momen-tum − y (cid:48) , (d) running the leapfrog integrator backwardfrom state x (cid:48) with momentum y (cid:48) , and (e) checking thatthis results in the state-momentum pair x, − y . If themomentum is not properly negated, or if the leapfrogintegrator is incorrectly implemented, this check canfail.As another, simpler example of this type of error,consider the split params and merge params func-tions invoked in Figure 1, as part of a reversible-jumpMCMC kernel for inferring a mixture of Gaussianswith an unknown number of components. Consid-ering only the µ parameter, suppose the split movesamples an auxiliary variable u and computes µ = µ + u , µ = µ − u as the means of the two newclusters. If the merge move joins two clusters andassigns µ = √ µ µ , there is a mismatch: a splitmove cannot be reversed by a corresponding merge,because in general, (cid:112) ( µ + u )( µ − u ) (cid:54) = µ . This er-ror could be ﬁxed either by changing the split to com-pute µ = µu , µ = µu , or by changing the merge tocompute µ = µ + µ . Discrete logic errors.

Another common class of errorsis for the discrete logic of an involution to be ﬂawed.Consider the following incorrect implementation of a birth-death move for the mixture model in Figure 1,which either adds a new mixture component or se-lects an existing one at random to delete (we namethe traces tr1 , tr2 , tr3 , ad tr4 for reasons of space): @gen function q(trace)current_k = trace[:k]is_birth ∼ bernoulli(current_k == 1 ? 1.0 : 0.5)if is_birthnew_mu ∼ normal(0, 10)new_var ∼ inv_gamma(1, 10)elsedeletion_idx ∼ uniform_discrete(1, current_k)endend @transform h (tr1, tr2) to (tr3, tr4) beginis_birth = @read(tr2, :is_birth, :discrete)@write(tr4[:is_birth], !is_birth, :discrete)k = @read(tr1[:k], :discrete)weights = @read(tr1[:weights], :continuous)if is_birth@write(tr3[:k], k+1, :discrete)new_mu = @read(tr2[:new_mu], :continuous)new_var = @read(tr2[:new_var], :continuous)@write(tr3[(:mu, k+1)], new_mu, :continuous)@write(tr3[(:var, k+1)], new_var, :continuous)new_weights = add_weight(weights)@write(tr4[:deletion_idx], k+1, :discrete)elseidx = @read(tr2[:deletion_idx], :discrete)@copy(tr1[(:mu, idx)], tr4[:new_mu])@copy(tr1[(:var, idx)], tr4[:new_var])for i in (idx+1):k@copy(tr1[(:mu, i)], tr3[(:mu, i-1)])@copy(tr1[(:var, i)], tr3[(:var, i-1)])end@write(tr3[:k], k-1, :discrete)new_weights = delete_weight(weights, idx)end@write(tr3[:weights], new_weights, :continuous)end The ﬂaw in this implementation is that although thedeath move can delete any of the k mixture compo-nents, the birth move can only add a new componentto the end (index k + 1), so the move is not reversible.The involution check will discover that the deletionof a component with index i < k is not reversed by acorresponding birth move, and will thus raise an error. Other miscellaneous errors.

When implementing dis-tributions as probabilistic programs, it is also possi-ble for users to make more mundane errors, such asspelling the name of a random choice inconsistently,or characterizing random choices using the wrong type tags ( :continuous and :discrete ). Such er-rors can be diﬃcult to detect statically, because theaddresses at which a probabilistic program makes ran-dom choices, and the distributions of those choices,may change from sample to sample. (Previous workhas explored static analyses based on types (Lew et al.,2019) and abstract interpretation (Lee et al., 2019),but these each work on limited subsets of the pro-grams that Gen’s full modeling language permits, andit is often precisely these more complex programs thatrequire the ﬂexibility of the involutive MCMC frame-work in the ﬁrst place). Our dynamic support anddimension checks can help to detect bugs like these.For example, if the involution f writes to a misspelledaddress, the support check will determine that the re-sulting trace’s density is 0. Dynamic checks during inference.

These dynamicassertions can also be run during inference, at eachapplication of the transition kernel. This can be usefulto catch bugs that only occur in regions of the state utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming space with low prior mass (but perhaps high posteriormass). When enabled, we can run dynamic checksafter each application of the kernel, and when theyfail, write to a debugging log and reject the proposednew state. As it turns out, the kernel induced by thisprocedure is still stationary for p : Lemma 6.1.

Let p and q x be model and auxiliary den-sities as above, but suppose f : D×D → D×D may notbe an involution on Z . Trace-based involutive MCMCwith dynamic checks enabled, rejecting whenever sucha check fails, still yields a kernel that is stationary for p .Proof. Let R = { x ∈ Z | f ( f ( x )) = x ∧ π ( f ( x )) > } ,and let h ( x ) := [ x ∈ R ] f ( x ) + [ x (cid:54)∈ R ] x . Then h isan involution on Z , and trace-based involutive MCMCwith p , q x , and h yields a stationary kernel. But thiskernel is the same one induced by using f with dy-namic checks. For x on which dynamic checks succeed, h is equivalent to f . For x on which dynamic checksfail, h is equivalent to the identity; thus, accepting amove produced by h is equivalent to rejecting. Reversible jump MCMC (Green, 1995; Hastie andGreen, 2012) is a special case of involutive MCMC,and the implementation of reversible jump MCMC ker-nels can be automated using the probabilistic and dif-ferentiable programming languages presented in thispaper. We now review reversible jump MCMC, thenshow how it can be automated using the techniquespresented earlier, and give an example.

Review of reversible jump MCMC.

The re-versible jump MCMC framework involves a set of‘models’ h ∈ H , and a prior distribution on models p ( h ). For each model, there is a latent continuousparameter vector θ h ∈ R n ( h ) where n ( h ) is the dimen-sion of model h , and a likelihood function L D,h ( θ h )for each h given data D . The latent state x is a pair( h, θ h ) of model and continuous parameter. There isa set of move types M . Each move type m ∈ M isassociated with an unordered pair of models ( h , h )and a dimensionality d ( m ) such that d ( m ) ≥ n ( h )and d ( m ) ≥ n ( h ) (zero, one, or more than one movetypes may be associated with a given pair of mod-els). For each latent state x = ( h, θ h ), there is aprobability distribution q x ( m ) on move types suchthat q x ( m ) > h is one of the modelsfor move type m . For each move type m ∈ M be-tween h and h there is a pair of continuously dif-ferentiable bijections g m,h → h : R d ( m ) → R d ( m ) and g m,h → h := g − m,h → h , and a pair of proposal den-sities q m,h → h ( u h → h ) and q m,h → h ( u h → h ) where u h → h ∈ R d ( m ) − n ( h ) and u h → h ∈ R d ( m ) − n ( h ) . Aproposal is made from state x = ( h, θ h ) by (i) sam-pling a move type m ∼ q x ( · ), and (ii) sampling con-tinuous variable u ∼ q m,h → h (cid:48) ( · ) for ( h, h (cid:48) ) associatedwith m , and (iii) computing ( θ (cid:48) h , u (cid:48) ) := g m,h → h (cid:48) ( θ h , u ),and proposing new state x (cid:48) = ( h (cid:48) , θ (cid:48) h ). Encoding reversible jump in involutive MCMC.

To encode reversible jump MCMC in our framework,we write a probabilistic program P that encodes thespace of models H , the prior distribution on mod-els, p ( h ), the per-model priors p h ( θ h ) and the per-model likelihoods L D,h ( θ h ). The set of all models h is encoded in the set of all pairs ( K P , d P ) where K P represent possible trace structures (i.e. control-ﬂowpaths through P ) and d P are the set of assignmentsto discrete random choices made by P . The per-modelcontinuous parameters θ are encoded via continuousrandom choices c P . The auxiliary probabilistic pro-gram Q encodes both the probability distribution onmoves types using discrete random choices and possi-bly stochastic control ﬂow (( K Q , d Q )), and the per-move-type probability densities on u using continuousrandom choices c Q . The involution f factors into an(i) involution f on pairs i = (( K P , d P ) , ( H Q , d Q ))that deﬁnes the association between move types M and the model pairs ( h , h ); and (ii) a family of bijec-tions f ,i on the space of pairs ( c P , c Q ) of continuousrandom choices for both programs for ﬁxed values ofthe discrete random choices and ﬁxed trace structure. Example: Split-merge reversible jump.

Fig-ure 1 shows a split-merge reversible jump kernel foran inﬁnite Gaussian mixture model (Richardson andGreen, 1997) implemented using the probabilistic anddiﬀerentiable programming languages described in thispaper. Figure 1b shows the inﬁnite Gaussian mixturemodel, speciﬁed as a probabilistic program p . The pro-gram takes the number of data points as input, thensamples the number of clusters from a Poisson distri-bution, then samples cluster parameters and mixtureproportions, and ﬁnally samples the data points fromthe resulting ﬁnite mixture. Figure 1c shows the auxil-iary probabilistic program q for the split-merge kernel.This program takes a trace of the model program asinput, and randomly decides whether to split a clusterand increase the number of clusters by one or mergetwo clusters and decrease the number of clusters byone. Then, the program randomly picks which clusterto split, or which clusters to merge. This kernel alwaysmerges the last cluster with a random other cluster; forergodicity the move can be composed with a simplemove (that has acceptance probability 1) that swaps a arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka random cluster with the last cluster. If a split is cho-sen, then the program also samples the three degrees offreedom necessary to generate the new parameters forthe clusters in an invertible manner. Figure 1d showsa diﬀerentiable program specifying the involution forthe split-merge kernel, and Figure 1f shows graphi-cally how this involution acts on pairs of traces. Theyellow section (1) deﬁnes an involution f on the dis-crete random choices that speciﬁes that (i) the split choice should be ﬂipped (so that split moves are alwaysmapped to merge moves and vice versa) and that (ii)the number of clusters should be increased by one fora split move and decreased by one for a merge move,and (iii) which merged cluster corresponds to whichsplit clusters. The green section (2) speciﬁes the con-tinuous bijections that govern the transformation ofcontinuous random choices during split moves and thepurple section (3) speciﬁes the inverses of these bijec-tions, which govern the transformation of continuouschoices during merge moves. Figure 2 shows automated in-volutive MCMC being applied to fully Bayesian infer-ence over the covariance function of a Gaussian pro-cess, where the prior on covariance functions (Fig-ure 2a) is based on a probabilistic context-free gram-mar. The inference algorithm is based on an involutiveMCMC kernel that uses a state-dependent mixture ofproposals. A variant of this inference algorithm waspreviously studied in (Schaechtle et al., 2016; Saadet al., 2019) based on a model of Grosse et al. (2012).

Hierarchical address spaces.

This example usesprobabilistic programs that invoke other probabilis-tic programs, sometimes recursively. For example, themodel probabilistic program p invokes the probabilis-tic program cov function prior , which is itself re-cursive. Similarly, the auxiliary probabilistic program q invokes walk tree (which is recursive) as well as cov function prior . Consider the syntax used to re-cursively invoke walk tree within walk tree : ({:left} ∼ walk_tree(node.left, path)) This expression resembles a random choice expression.However, instead of associating the return value ofthe function walk tree with the address : left , theaddress : left is associated with the entire trace ofrandom choices made within walk tree . That is,: left is the namespace for the addresses of all ran-dom choice made within the invocation. Further in-vocations by the callee themselves result in nestednamespaces. This process results in a hierarchical ad- dress space for random choices, as shown in Figure 2b.This does not modify the mathematical formalism—each random choice made during the execution of aprobabilistic program still has a unique address, butthe address has multiple components that localize itwithin the hierarchy. For example, the choice done arecursive call to walk tree might have address: k = (: left => : right => : done )(‘ => ’ is the Gen syntax for constructing hierarchicaladdresses). A complex state-dependent distribution.

Ateach iteration of the MCMC algorithm, the auxiliaryprobabilistic program q (Figure 2d) ﬁrst picks a ran-dom node in the parse tree of the covariance function,by doing a stochastic walk of the existing parse treethat terminates at the chosen node. prev_cov_function = trace[:cov_function]path ∼ walk_tree(prev_cov_function, ..) The code that walks the tree uses the following recur-sion, which results in a probability distribution thatassigns exponentially lower probability to nodes thatare deeper in the tree. if ({:done} ∼ bernoulli(0.5))return pathelseif ({:recurse_left} ∼ bernoulli(0.5))path = (path..., :left_node)return ({:left} ∼ walk_tree(node.left, path))elsepath = (path..., :right_node)return ({:right} ∼ walk_tree(node.right, path))end The resulting distributions on selected nodes for twopossible input trees are shown below:

The ﬁrst part of the involution (Figure 2e) copies therandom choices made during this walk from the inputauxiliary trace to the output auxiliary trace. @copy(aux_in[:path], aux_out[:path])

Note that here, @ copy is being used to copy the entire set of random choices from the namespace : path in aux in to the namespace : path in aux out .Because the mixture distribution is speciﬁed using aprobabilistic program, it is straightforward to modify utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming the program p to deﬁne a diﬀerent mixture distribu-tion. The code below speciﬁes a mixture distributionthat is uniform over all nodes in the tree. n1 = size(node.left); n2 = size(node.right)if ({:done} ∼ bernoulli(1 / (1 + n1 + n2)))return pathelseif ({:recurse_left} ∼ bernoulli(n1 / (n1+n2))path = (path..., :left_node)return ({:left} ∼ walk_tree(node.left, path))elsepath = (path..., :right_node)return ({:right} ∼ walk_tree(node.right, path))end The resulting distributions, for two possible inputtrees, are:

Note that the probability of choosing a given subtree topropose to is itself changed when the subtree changes.Therefore, the mixture probabilities do not in generalcancel in the the acceptance probability calculation,and must be accounted for. For the original mixturedistribution, the ratio of mixture probabilities is either1, 0 .

5, or 2 depending on whether the previous and newsubtrees are leaf or internal nodes. For this alternativemixture distribution, the ratio of mixture probabilitiesis the ratio of sizes of the two trees (e.g. 9 / / A general pattern for state-dependent mixturesof proposals in Metropolis-Hastings

The otherparts of the auxiliary probabilistic program and the in-volution program speciﬁes a proposal distribution forthe subtree of the parse tree that is rooted at the cho-sen node. In particular, the rest of the auxiliary proba-bilistic program q proposes a new subtree by samplingfrom the same process used to recursively deﬁne theprior distribution: new_subtree ∼ cov_function_prior() The involution program swaps the old subtree with thenewly proposed subtree: @copy(model_in[subtree_address], aux_out[:new_subtree])@copy(aux_in[:new_subtree], model_out[subtree_address])

This is an instance of a more general pattern for im-plementing state-dependent mixture proposals:1. The auxiliary probabilistic program Q samplesfrom a distribution over diﬀerent sets of randomchoices that will be proposed to (in this case, eachset is a diﬀerent subtree of the parse tree).2. The auxiliary probabilistic program Q then sam-ples new values for those random choices (in thiscase, a new subtree).3. The involution program F swaps the previous val-ues of those random choices with their new val-ues, by swapping data between the model traceand the auxiliary trace.4. The involution program F copies the randomchoices that determined what subset of randomchoices to propose to from the input auxiliarytrace to the output auxiliary trace. Involutive MCMC is a unifying construction (Neklyu-dov et al., 2020) for MCMC algorithms that encom-passes both classic approaches to constructing kernelslike reversible jump MCMC (Green, 1995), but also re-cently introduced classes of MCMC kernels based onneural networks (Spanbauer et al., 2020). Therefore,the approach to automating involutive MCMC ker-nels presented in this paper makes a number of classicMCMC techniques easier to use and broadens their ac-cessibility, and may potentially aid in development ofnovel MCMC techniques as well. The implementationof our approach in the Gen probabilistic programmingsystem has already been used by researchers in com-putational biology (Merrell and Gitter, 2020) and arti-ﬁcial intelligence (Zhi-Xuan et al., 2020) to prototypeand develop new reversible-jump MCMC algorithms.The technique for automating involutive MCMC pre-sented in this paper can be generalized to the settingof sequential Monte Carlo samplers (Del Moral et al.,2006). Instead of one model probabilistic program,one auxiliary probabilistic program and one involutionprogram, there are two model probabilistic programs,two auxiliary probabilistic programs, and a pair of bi-jective diﬀerentiable programs that transform traces ofone model into traces of the other. This more generalconstruct, which builds on earlier work on sequentialMonte Carlo and probabilistic programs (Cusumano-Towner et al., 2018), has already been implemented aspart of the Gen probabilistic programming system.Improving the performance of automated involutiveMCMC and of ﬂexible probabilistic programming sys-tems like Gen more generally is an important area arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka for future work. The approach described in thispaper is largely dynamic and is not performance-competitive with optimized hand-coded implementa-tions in performance-oriented languages like C. Whileour approach is already valuable for use cases wherethe best performance is not necessary or the expertiseor time needed for an optimized hand-coded imple-mentation is not available, more research into com-pilers and automatic code generation of custom in-ference algorithms from high-level user speciﬁcationswould broaden the applicability of systems like Gen.

Acknowledgements

This research was supported in part by the US De-partment of Defense through the the National DefenseScience & Engineering Graduate Fellowship (NDSEG)Program, the DARPA SD2 program (contract FA8750-17-C-0239), the DARPA Machine Common Sense(MCS) program, the DARPA Synergistic Discoveryand Design (SD2) program, support from the IntelCorporation, and a philanthropic gift from the Apho-rism Foundation. The authors would also like to thankFeras Saad and Cameron Freer for helpful discussions.

References

Bezanson, J., Edelman, A., Karpinski, S., and Shah,V. B. (2017). Julia: A fresh approach to numericalcomputing.

SIAM Review , 59(1):65–98.Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer,F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip,P., Horsfall, P., and Goodman, N. D. (2019). Pyro:Deep universal probabilistic programming.

TheJournal of Machine Learning Research , 20(1):973–978.Carpenter, B., Gelman, A., Hoﬀman, M. D., Lee, D.,Goodrich, B., Betancourt, M., Brubaker, M., Guo,J., Li, P., and Riddell, A. (2017). Stan: A proba-bilistic programming language.

Journal of statisticalsoftware , 76(1).Chang, J. T. and Pollard, D. (1997). Conditioningas disintegration.

Statistica Neerlandica , 51(3):287–317.Cusumano-Towner, M. (2018). Inference libraryof the Gen probabilistic programming system. https://github.com/probcomp/Gen.jl/blob/b9d72b/src/inference/mh.jl . Ac-cessed: 2018-12-27.Cusumano-Towner, M., Bichsel, B., Gehr, T., Vechev,M., and Mansinghka, V. K. (2018). Incremental in-ference for probabilistic programs. In

Proceedings ofthe 39th ACM SIGPLAN Conference on Program-ming Language Design and Implementation , PLDI2018, pages 571–585. ACM. Cusumano-Towner, M. F., Saad, F. A., Lew, A. K.,and Mansinghka, V. K. (2019). Gen: A general-purpose probabilistic programming system with pro-grammable inference. In

Proceedings of the 40thACM SIGPLAN Conference on Programming Lan-guage Design and Implementation , pages 221–236.ACM.Del Moral, P., Doucet, A., and Jasra, A. (2006). Se-quential Monte Carlo samplers.

Journal of the RoyalStatistical Society: Series B (Statistical Methodol-ogy) , 68(3):411–436.Duane, S., Kennedy, A. D., Pendleton, B. J., andRoweth, D. (1987). Hybrid Monte Carlo.

Physicsletters B , 195(2):216–222.Gehr, T., Misailovic, S., and Vechev, M. (2016).Psi: Exact symbolic inference for probabilistic pro-grams. In

International Conference on ComputerAided Veriﬁcation , pages 62–83. Springer.Geiger, A., Lauer, M., and Urtasun, R. (2011). Agenerative model for 3D urban scene understand-ing from movable platforms. In

CVPR 2011 , pages1945–1952. IEEE.Gilks, W. R., Thomas, A., and Spiegelhalter, D. J.(1994). A language and program for complexBayesian modelling.

Journal of the Royal StatisticalSociety: Series D (The Statistician) , 43(1):169–177.Goodman, N., Mansinghka, V., Roy, D. M., Bonawitz,K., and Tenenbaum, J. B. (2008). Church: a lan-guage for generative models. In

Proceedings of the24th Annual Conference on Uncertainty in ArtiﬁcialIntelligence , UAI 2008, pages 220–229. AUAI Press.Green, P. J. (1995). Reversible jump Markov chainMonte Carlo computation and Bayesian model de-termination.

Biometrika , 82(4):711–732.Grosse, R. B., Salakhutdinov, R., Freeman, W. T., andTenenbaum, J. B. (2012). Exploiting composition-ality to explore a large space of model structures. In

Proceedings of the 28th Conference on Uncertaintyin Artiﬁcial Intelligence , UAI 2012, pages 306–315.AUAI Press.Hastie, D. I. and Green, P. J. (2012). Model choiceusing reversible jump Markov chain Monte Carlo.

Statistica Neerlandica , 66(3):309–338.Huelsenbeck, J. P., Larget, B., and Alfaro, M. E.(2004). Bayesian phylogenetic model selection usingreversible jump Markov chain Monte Carlo.

Molec-ular biology and evolution , 21(6):1123–1133.Lee, W., Yu, H., Rival, X., and Yang, H. (2019).Towards veriﬁed stochastic variational inference forprobabilistic programs.

Proceedings of the ACM onProgramming Languages , 4(POPL):1–33. utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming

Lew, A. K., Cusumano-Towner, M. F., Sherman,B., Carbin, M., and Mansinghka, V. K. (2019).Trace types and denotational semantics for soundprogrammable inference in probabilistic languages.

Proceedings of the ACM on Programming Lan-guages , 4(POPL):1–32.Mansinghka, V. K., Schaechtle, U., Handa, S., Radul,A., Chen, Y., and Rinard, M. (2018). Probabilis-tic programming with programmable inference. In

Proceedings of the 39th ACM SIGPLAN Conferenceon Programming Language Design and Implementa-tion , pages 603–616.Merrell, D. and Gitter, A. (2020). Inferring signal-ing pathways with probabilistic programming.

Pro-ceedings of the Nineteenth European Conference ofComputational Biology .Milch, B., Marthi, B., Russell, S., Sontag, D., Ong,D. L., and Kolobov, A. (2005). BLOG: Probabilis-tic models with unknown objects. In

Proceedingsof the Nineteenth International Joint Conference onArtiﬁcial Intelligence , IJCAI 2005, pages 1352–1359.Morgan Kaufmann Publishers Inc.Narayanan, P. and Shan, C.-c. (2020). Symbolic dis-integration with a variety of base measures.

ACMTransactions on Programming Languages and Sys-tems (TOPLAS) , 42(2):1–60.Neklyudov, K., Welling, M., Egorov, E., and Vetrov,D. (2020). Involutive MCMC: A Unifying Frame-work. arXiv preprint arXiv:2006.16653 .Pfeﬀer, A. (2007). The design and implementationof IBAL: A general-purpose probabilistic language.

Introduction to statistical relational learning , page399.Richardson, S. and Green, P. J. (1997). On Bayesiananalysis of mixtures with an unknown number ofcomponents (with discussion).

Journal of the RoyalStatistical Society: series B (statistical methodol-ogy) , 59(4):731–792.Ritchie, D., Horsfall, P., and Goodman, N. D. (2016).Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735 .Roberts, D. A., Gallagher, M., and Taimre, T. (2019).Reversible jump probabilistic programming. InChaudhuri, K. and Sugiyama, M., editors,

Proceed-ings of Machine Learning Research , volume 89 of

Proceedings of Machine Learning Research , pages634–643. PMLR.Saad, F. A., Cusumano-Towner, M. F., Schaechtle,U., Rinard, M. C., and Mansinghka, V. K. (2019).Bayesian synthesis of probabilistic programs for au-tomatic data modeling.

Proceedings of the ACM onProgramming Languages , 3(POPL):1–32. Schaechtle, U., Saad, F., Radul, A., and Mans-inghka, V. (2016). Time series structure discoveryvia probabilistic program synthesis. arXiv preprintarXiv:1611.07051 .Spanbauer, S., Freer, C., and Mansinghka, V.(2020). Deep involutive generative models for neuralMCMC. arXiv preprint arXiv:2006.15167 .Tenenbaum, J. B., Kemp, C., Griﬃths, T. L., andGoodman, N. D. (2011). How to grow a mind:Statistics, structure, and abstraction.

Science ,331(6022):1279–1285.Tierney, L. (1998). A note on Metropolis-Hastings ker-nels for general state spaces.

Annals of applied prob-ability , pages 1–9.Zhi-Xuan, T., Mann, J. L., Silver, T., Tenenbaum,J. B., and Mansinghka, V. K. (2020). OnlineBayesian goal inference for boundedly-rational plan-ning agents. arXiv preprint arXiv:2006.07532 . arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka A APPENDIX

A.1 Derivation of the pushforward Radon-Nikodym derivative for a special case

Implementing Algorithm 1 requires computing the Radon-Nikodym derivative d ( µ ◦ f − ) /dµ . This section derivesthat function for the special case in which the involution f can be factored into an involution on a countable set I and a family of bijections on R n i for some n i for each i ∈ I . Suppose Z = { ( i, x ) : i ∈ I, x ∈ R n i } . Suppose f is an involution on I and n i = n f ( i ) and f is a family of continuously diﬀerentiable bijections indexed by i ∈ I ,such that f ,i : R n i → R n i . Also suppose that f ,i = f − ,f ( i ) for all i ∈ I . That is, f ,i ( f ,f ( i ) ( x )) = x for all x ∈ R n i and all i ∈ I (5)Then, f : Z → Z : ( i, x ) (cid:55)→ ( f ( i ) , f ,i ( x )) is an involution because: f ( f ( i, x )) = f ( f ( i ) , f ,i ( x )) = ( f ( f ( i )) , f ,f ( i ) ( f ,i ( x ))) = ( i, x ) (6)Let Σ n and µ n denote the Lebesgue σ -algebra and Lebesgue measure on R n , respectively. Let Σ ⊂ P ( Z ) be the σ -algebra of sets of the form ∪ i ∈ I { ( i, x ) : x ∈ K i } for some K i ∈ Σ n i for each i ∈ I . Let µ denote the measureon measurable space ( Z, Σ) given by: µ ( ∪ i ∈ I { ( i, x ) : x ∈ K i } ) := (cid:88) i ∈ I µ n i ( K i ) (7)We wish to show that the Radon-Nikodym derivative of the pushforward of µ by f with respect to µ , evaluatedat ( i, x ), is the absolute value of the Jacobian (determinant) of the function f ,i evaluated at x , which is denoted( Jf ,i )( x ): d ( µ ◦ f − ) dµ ( i, x ) = | ( Jf ,i )( x ) | (8)Consider ( µ ◦ f − )( A ) for K ∈ Σ:( µ ◦ f − )( ∪ i ∈ I { ( i, x ) : x ∈ K i } ) = µ ( f − ( ∪ i ∈ I { ( i, x ) : x ∈ K i } )) (9)= µ ( ∪ i ∈ I f − ( { ( i, x ) : x ∈ K i } )) (10)= µ ( ∪ i ∈ I { ( f ( i ) , x ) : x ∈ f ,i ( K i ) } ) (11)= (cid:88) i ∈ I µ n i ( f ,i ( K i )) (12)It suﬃces to show that for all K ∈ Σ: (cid:90) K | ( Jf ,i )( x ) | µ ( dz ) = (cid:88) i ∈ I µ n i ( f ,i ( K i )) (13)Expanding the left-hand side: (cid:90) K | ( Jf ,i )( x ) | µ ( dz ) = (cid:88) i ∈ I (cid:90) K i | ( Jf ,i )( x ) | µ n i ( d x ) = (cid:88) i ∈ I µ n i ( f ,i ( K i )) (14)where the ﬁnal step uses Bogachev Theorem 3.7.1 with g := 1, and F := f ,i . A.2 Proof of detailed balance for involution

The involutive MCMC kernel is composed of two parts: An extension of the state space, and an involution onthe extended state space. First, we show detailed balance for the deterministic involution move applied to theextended state space.Tierney (1998) gives a class of MCMC kernels based on involutions that satisfy detailed balance. We nowreproduce the result in our notation: utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming

Lemma A.1 (Detailed balance for involution move (Tierney, 1998)) . Let ( Z, Σ , π ) denote a measure space.Suppose f is a one-to-one function from Z onto Z such that f − = f . Consider the probability kernel k deﬁnedby k z ( A ) := [ f ( z ) ∈ A ] α ( z, f ( z )) + [ z ∈ A ](1 − α ( z, f ( z ))) (where α ( z, f ( z )) gives the probability of accepting aproposed transition from z to f ( z ) ). Let ν ( dz ) := π ( dz ) + ( π ◦ f − )( dz ) . Let h ( z ) be a density for π with respectto ν . Let K := { z ∈ Z : h ( z ) > and h ( f ( z )) > } . k satisﬁes detailed balance with respect to π if and only if:1. α ( z, f ( z )) = 0 for π -almost all z (cid:54)∈ K α ( z, f ( z )) h ( z ) h ( f ( z )) = α ( f ( z ) , z )Now we apply Lemma A.1 to our setting where π is σ -ﬁnite, there exists a σ -ﬁnite reference measure µ for mea-surable space ( Z, Σ) such that π is mutually absolutely continuous with respect to µ , and where the pushforwardof µ by f , denoted µ ◦ f − , is absolutely continuous with respect to µ .In our setting, α is deﬁned as: α ( z, f ( z )) := min (cid:40) , dπdµ ( f ( z )) dπdµ ( z ) · d ( µ ◦ f − ) dµ ( z ) (cid:41) (15)This deﬁnition of α satisﬁes: α ( z, f ( z )) dπdµ ( z ) dπdµ ( f ( z )) · (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − = α ( f ( z ) , z ) (16)Therefore, to apply Lemma A.1, it suﬃces to show π has density with respect to ν (denoted h ( z )) such that: h ( z ) h ( f ( z )) = dπdµ ( z ) dπdµ ( f ( z )) (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − (17)Since π and π ◦ f − are both absolutely continuous with respect to µ , ν is also absolutely continuous with respectto µ , and has density: dνdµ ( z ) = dπdµ ( z ) + d ( π ◦ f − ) dµ ( z ) (18) π is absolutely continuous with respect to ν , and therefore: dπdµ ( z ) = dπdν ( z ) · dνdµ ( z ) (19)Because ( dπ/dµ )( z ) > z ∈ Z , ( dν/dµ )( z ) > z ∈ Z . Therefore, h ( z ) := dπdν ( z ) = dπdµ ( z ) dνdµ ( z ) (20)Therefore: h ( z ) h ( f ( z )) = dπdµ ( z ) dπdµ ( f ( z )) · dνdµ ( f ( z )) dνdµ ( z ) (21)It suﬃces to show that: dνdµ ( f ( z )) dνdµ ( z ) = (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − First, we prove a Lemma:

Lemma A.2. If ( Z, Σ) is a measurable space and f : Z → Z is a measurable function that is an involution, ν and µ are σ -ﬁnite measures such that ν is absolutely continuous with respect to µ , and such that the pushforwardmeasures ν ◦ f − and µ ◦ f − are both σ -ﬁnite, then ν ◦ f − is absolutely continuous with respect to µ ◦ f − and d ( ν ◦ f − ) d ( µ ◦ f − ) ( z ) = dνdµ ( f ( z )) (22) arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka Proof.

First, ν ◦ f − is absolutely continuous with respect to µ ◦ f − because ( µ ◦ f − )( A ) = 0 implies µ ( f − ( A )) =0 implies ν ( f − ( A )) = 0 implies ( ν ◦ f − )( A ) = 0. To show that z (cid:55)→ ( dν/dµ )( f ( z )) is the Radon-Nikodymderivative d ( ν ◦ f − ) /d ( µ ◦ f − ), it suﬃces to show that for all K ∈ Σ: (cid:90) K (cid:18) dνdµ ( f ( z )) (cid:19) ( µ ◦ f − )( dz ) = ( ν ◦ f − )( A ) := ν ( f − ( A )) (23)Applying Theorem 3.6.1 in Bogachev with Y := K , y := z , x := z (cid:48) , X := f − ( A ), and g ( y ) := ( dν/dµ )( f ( y )): (cid:90) K (cid:18) dνdµ ( f ( z )) (cid:19) ( µ ◦ f − )( dz ) = (cid:90) Y g ( y )( µ ◦ f − )( dy ) (24)= (cid:90) X g ( f ( x )) µ ( dx ) [Bogachev Theorem 3.6.1] (25)= (cid:90) f − ( A ) ( dν/dµ )( f ( f ( z (cid:48) ))) µ ( dz (cid:48) ) (26)= (cid:90) f − ( A ) ( dν/dµ )( z (cid:48) ) µ ( dz (cid:48) ) (27)= (cid:90) f − ( A ) ν ( dz (cid:48) ) = ν ( f − ( A )) (28)Now, note that ν and ν ◦ f − are the same measure: ν ( A ) = π ( A ) + π (cid:48) ( A ) = π ( A ) + π ( f − ( A )) (29)( ν ◦ f − )( A ) = π ( f − ( A )) + π ( f − ( f − ( A ))) = π ( f − ( A )) + π ( A ) (30)Therefore, d ( ν ◦ f − ) dν ( z ) = 1 for all z Expanding d ( ν ◦ f − ) /dν using the chain rule:1 = d ( ν ◦ f − ) dν ( z ) (31)= d ( ν ◦ f − ) d ( µ ◦ f − ) ( z ) · d ( µ ◦ f − ) dµ ( z ) · dµdν ( z ) (32)= dνdµ ( f ( z )) · d ( µ ◦ f − ) dµ ( z ) · dµdµ ( z ) [Lemma A.2] (33)= dνdµ ( f ( z )) · d ( µ ◦ f − ) dµ ( z ) · dνdµ ( z ) (34) (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − = dνdµ ( f ( z )) dνdµ ( z ) (35) A.3 Proof of stationarity for involution

Detailed balance of the the involution kernel with respect to the measure induced by π implies: (cid:90) B π ( z ) k (cid:48) z ( A ) µ ( dz ) = (cid:90) K π ( z ) k (cid:48) z ( B ) µ ( dz ) for all K, B ∈ Σ (36)Stationarity with respect to π follows by substituting Z for B : (cid:90) Z π ( z ) k (cid:48) z ( A ) µ ( dz ) = (cid:90) K π ( z ) k (cid:48) z ( Z ) µ ( dz ) = (cid:90) K π ( z ) µ ( dz ) for all A ∈ Σ (37)(38) utomating Involutive MCMC using Probabilistic and Diﬀerentiable Programming

A.4 Proof of stationarity for end-to-end kernel

We are given that the involution is stationary with respect to (the measure induced by) π ( x, u ) := p ( x ) q x ( u ): (cid:90) Z k (cid:48) z ( A ) π ( z ) µ ( dz ) = (cid:90) K π ( z ) µ ( dz ) for all A ∈ Σ (39)The end-to-end kernel is deﬁned ﬁr all x ∈ X P such that p ( x ) > k x ( A ) := (cid:90) U k (cid:48) x,u ( A × U ) q x ( u ) µ U ( du ) for all A ∈ Σ P , x ∈ X (40)Stationarity of the end-to-end kernel with respect to the measure induced by p is: (cid:90) X k x ( A ) p ( x ) µ P ( dx ) = (cid:90) K p ( x ) µ P ( dx ) for all A ∈ Σ P (41)Expanding: (cid:90) X k x ( A ) p ( x ) µ P ( dx ) = (cid:90) X (cid:18)(cid:90) U k (cid:48) x,u ( A × U ) q x ( u ) µ Q ( du ) (cid:19) p ( x ) µ P ( dx )= (cid:90) X × U k (cid:48) x,u ( A × U ) q x ( u ) p ( x )( µ P × µ Q )( dz )= (cid:90) Z k (cid:48) x,u ( { ( x (cid:48) , u (cid:48) ) ∈ Z : x (cid:48) ∈ A } ) π ( z ) µ ( dz )= (cid:90) { ( x,u ) ∈ Z : x ∈ A } π ( z ) µ ( dz ) [ Stationarity of k (cid:48) with respect to π ]= (cid:90) { ( x,u ) ∈ Z : x ∈ A } q x ( u ) p ( x ) µ ( dz )= (cid:90) K (cid:18)(cid:90) U q x ( u ) µ Q ( du ) (cid:19) p ( x ) µ P ( dx )= (cid:90) K p ( x ) µ P ( dx ) A.5 A Suﬃcient Condition for Involutive MCMC with Dictionaries

Our formulation of involutive MCMC requires the following technical condition to hold: Z := { ( x, y ) ∈ X × Y : π ( x, y ) > } is µ P × µ Q -measurable. We now give a suﬃcient condition for this to hold, when X and Y arespaces of dictionaries. Let D ⊆ K denote the subset of addresses that are discrete (i.e. where V k is a countableset and µ k is the counting measure). For x ∈ × k ∈ A V k let x = ( d , c ) where d ∈ × k ∈ A ∩ D V k and c = × k ∈ A \ D V k ,so that d is the discrete part of x and c is the non-discrete part. Lemma A.3.