Automating Involutive MCMC using Probabilistic and Differentiable Programming
Marco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka
AAutomating Involutive MCMC using Probabilistic andDifferentiable Programming
Marco Cusumano-Towner Alexander K. Lew Vikash K. Mansinghka
Massachusetts Institute of Technology
Abstract
Involutive MCMC is a unifying mathematicalconstruction for MCMC kernels that general-izes many classic and state-of-the-art MCMCalgorithms, from reversible jump MCMC tokernels based on deep neural networks. Butas with MCMC samplers more generally, im-plementing involutive MCMC kernels is of-ten tedious and error-prone, especially whensampling on complex state spaces. This pa-per describes a technique for automating theimplementation of involutive MCMC kernelsgiven (i) a pair of probabilistic programsdefining the target distribution and an aux-iliary distribution respectively and (ii) a dif-ferentiable program that transforms the exe-cution traces of these probabilistic programs.The technique, which is implemented as partof the Gen probabilistic programming sys-tem, also automatically detects user errors inthe specification of involutive MCMC kernelsand exploits sparsity in the kernels for im-proved efficiency. The paper shows exampleGen code for a split-merge reversible jumpmove in an infinite Gaussian mixture modeland a state-dependent mixture of proposalson a combinatorial space of covariance func-tions for a Gaussian process.
Markov chain Monte Carlo (MCMC) algorithms arepowerful tools for approximate sampling from proba-bility distributions and are central to modern Bayesianstatistics, probabilistic machine learning, statisticalphysics, and numerous application areas where prob-abilistic modeling and inference are used. But design-ing and deriving efficient MCMC algorithms is math-ematically involved, and implementing MCMC ker- nels is tedious and notoriously error-prone. Thesechallenges are especially pronounced when samplingfrom probability distributions on complex state spacesthat combine symbolic, numeric, and structural un-certainty, such as those arising in computational bi-ology (Huelsenbeck et al., 2004), robotics and sceneunderstanding (Geiger et al., 2011), and models of hu-man cognition (Tenenbaum et al., 2011).Involutive MCMC is a mathematical construction forMCMC kernels that gives a simplifying and unifyingperspective on a number of previously disparate classesof kernels, including reversible jump MCMC (Green,1995), which is the dominant mathematical frame-work for MCMC on complex state spaces. InvolutiveMCMC constructs an MCMC kernel from three com-ponents: (i) the unnormalized target density, (ii) asampler and density for an auxiliary probability distri-bution, and (iii) an involution on an extended statespace. While this construction is mathematically clar-ifying, correctly implementing involutive MCMC ker-nels on complex state spaces remains challenging dueto tedious density and Jacobian computations and theneed for careful reasoning about the state space.This paper formulates involutive MCMC on generalstate spaces and shows how to automate the imple-mentation of an involutive MCMC kernel from threedeclarative programs that define the target probabil-ity distribution, auxiliary probability distribution, andthe involution, respectively. The probability distribu-tions are defined as probabilistic programs, and theinvolution is defined as a differentiable program thattransforms the execution traces of the probabilisticprograms. We use probabilistic programming tech-niques and automatic differentiation to automaticallycompute the acceptance probability. We also showhow to automatically detect mathematical errors inthe specification of an involutive MCMC kernel, andhow to improve the efficiency by automatically exploit-ing the sparsity structure in the involution. We imple-mented the approach within the Gen probabilistic pro- An involution is a bijection that is its own inverse (thatis, f where f ( f ( z )) = z ). a r X i v : . [ s t a t . C O ] J u l utomating Involutive MCMC using Probabilistic and Differentiable Programming gramming system (Cusumano-Towner et al., 2019).The paper shows examples of involutive MCMC ker-nels implemented in Gen for (i) a split-merge reversiblejump move in an infinite mixture model, and (ii) astate-dependent mixture of Metropolis-Hastings pro-posals on an infinite combinatorial space of covariancefunctions for a Gaussian process. We also provide alightweight PyTorch implementation of the basic ap-proach at https://github.com/probcomp/autoimcmc .The contributions of this paper include:1. A measure-theoretic formulation of involutiveMCMC on general state spaces.2. A mathematical formulation of state spaces con-sisting of arbitrary key-value stores (i.e. dic-tionaries) that formalizes the space of executiontraces of Gen probabilistic programs, and a for-mulation of involutive MCMC on these spaces.3. A differentiable programming language for defin-ing transformations between spaces of traces ofprobabilistic programs.4. An algorithm that automates the implementationof an involutive MCMC kernel given two proba-bilistic programs and a differentiable program en-coding the involution, using automatic differenti-ation and tracing of probabilistic programs.5. An extension to the algorithm that exploits spar-sity in the involution to reduce the number ofoperations needed to apply an involutive MCMCkernel from O ( N ) to O (1) in some cases, where N is the dimensionality of the latent space.6. An algorithm that dynamically detects errors inthe specification of an involutive MCMC kernel. The involutive MCMC construction was previously im-plemented (Cusumano-Towner, 2018) by the authorsas part of the inference library of the Gen probabilisticprogramming system (Cusumano-Towner et al., 2019).The construction was motivated in part by a desirefor a simple interface that automated the implemen-tation of reversible jump MCMC samplers (Green,1995), state-dependent mixtures of proposals on com-plex state spaces, and data-driven neural proposals.Gen’s involutive MCMC construction has since beenused by a number of researchers to design and imple-ment MCMC algorithms in diverse domains, includingcomputational biology (Merrell and Gitter, 2020) andartificial intelligence (Zhi-Xuan et al., 2020). Neklyudov et al. (2020) independently identified theinvolutive MCMC construction as a unifying frame-work for MCMC algorithms, and showed how morethan a dozen classic and recent MCMC algorithmscan be cast within this framework. Neklyudov et al.(2020) also identified design principles for developingnew MCMC algorithms using the involutive MCMCconstruction, and showed that the framework aids inthe derivation of novel efficient MCMC algorithms.The involutive MCMC construction encompassesmany existing classes of MCMC kernels, some of whichexplicitly make use of bijective or involutive determin-istic maps. In particular, the reversible jump frame-work (Green, 1995; Hastie and Green, 2012) employs afamily of continuously differentiable bijections betweenthe parameter spaces of different models. Tierney(1998) described a family of deterministic proposalsbased on a deterministic involution that is equivalentto involutive MCMC but without the auxiliary prob-ability distribution. More recently, Spanbauer et al.(2020) defined a class of deep generative models basedon differentiable involutions and trained these modelsto serve as efficient proposal distributions on continu-ous state spaces; the resulting algorithm is an instanceof the construction presented in this paper.In recent decades, a number of probabilistic program-ming systems have automated the implementations ofprobabilistic inference algorithms (Gilks et al., 1994;Milch et al., 2005; Pfeffer, 2007; Goodman et al., 2008;Gehr et al., 2016; Carpenter et al., 2017). Most ofthese systems support generic built-in inference al-gorithms, with user-customization limited to tweak-ing algorithm parameters. Some systems allow foruser-defined variational families or proposal distribu-tions (Ritchie et al., 2016; Bingham et al., 2019). Pro-grammable inference (Mansinghka et al., 2018) pro-poses that inference algorithms be programmed byusers using new high-level inference abstractions. TheGen probabilistic programming system (Cusumano-Towner et al., 2019) exposes an API that supportshigh-level user implementations of an open-ended setof inference algorithms, and abstracts away low-levelimplementation details of inference algorithms.One other probabilistic programming system be-sides Gen supports custom reversible jump samplers:Roberts et al. (2019) present a system embeddedin Haskell that automatically generates the imple-mentation of some reversible jump MCMC kernelsfrom a high-level specification. Narayanan and Shan(2020) give a technique that automatically computesMetropolis-Hastings acceptance probabilities in somesettings; however, these approaches do not handlemany kernels that can be handled by our technique,including the example in Figure 2. arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka @gen function p (n::Int) k ~ poisson_plus_one(1) means = [({(:mu, j)} ~ normal(0, 10)) for j in 1:k] vars = [({(:var, j)} ~ inv_gamma(1, 10)) for j in 1:k] weights ~ dirichlet([2.0 for j in 1:k]) for i in 1:n {(:x, i)} ~ mixture_of_normals(weights, means, vars) endend @copy_proposal_to_proposal(:path_from_root, :path_from_root) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_model_to_proposal(model_subtree_address, :subtree)@copy_proposal_to_proposal(:path_from_root, :path_from_root) (b) An infinite Gaussian mixture model p expressed in a Gen probabilistic modeling language @gen function q (trace) k = trace[:k] (c) Auxiliary proposal distribution q for the split-merge reversible jump MCMC move, expressed in a Gen probabilistic modeling language k=3k=2 split=falsesplit=true cluster_to_merge=2cluster_to_split=2(mu, 1)(var, 1) = model traces (traces of p ) = auxiliary proposal traces (traces of q ) (x, 1)..(x, 2) ImplicitCopyImplicitCopy (mu, 1)(var, 1)(mu, 1)(var, 1)
ImplicitCopyImplicitCopy (mu, 2)(var, 2) (mu, 3)(var, 3)(mu, 2)(var, 2)(mu, 2)(var, 2) (mu, 3)(var, 3) u1 u2 u3 weightsweightsweights(x, 1)..(x, 2)(x, 1)..(x, 2) (d) Involution f for the split-merge reversible jump MCMC move, expressed in a Gen differentiable programming language. (f) Schematic of the involution f being applied twice in succession to a model trace and auxiliary proposal trace. The first application merges cluster 2 with the last cluster (cluster 3); the second application splits the resulting cluster, and returns the original pair of traces. (observations, n) = get_observations()trace, = Gen.generate( p , (n,), observations)for iter in 1:100 trace = Gen.involutive_mcmc(trace, q , (), f )end (e) Julia code that applies the split-merge reversible jump MCMC move defined by q and f
100 times to a trace of the model p , using Gen's involution MCMC operator. k=2 k=3 SplitMerge f split=false cluster_to_merge=2 @transform f (model_in, aux_in) to (model_out, aux_out) begin k = @read(model_in[:k], :discrete) split = (k == 1) ? true : @read(aux_in[:split], :discrete) if split cluster_to_split = @read(aux_in[:cluster_to_split], :discrete) @write(model_out[:k], k+1, :discrete) @copy(aux_in[:cluster_to_split], aux_out[:cluster_to_merge]) @write(aux_out[:split], false, :discrete) else cluster_to_merge = @read(aux_in[:cluster_to_merge], :discrete) @write(p_out, :k, k-1, :discrete) @copy(aux_in[:cluster_to_merge], aux_out[:cluster_to_split]) if (k > 2) @write(aux_in[:split], true, :discrete) end end if split u1 = @read(aux_in[:u1], :continuous) u2 = @read(aux_in[:u2], :continuous) u3 = @read(aux_in[:u3], :continuous) weights = @read(model_in[:weights], :continuous) mu = @read(model_in[(:mu, cluster_to_split)], :continuous) var = @read(model_in[(:var, cluster_to_split)], :continuous) new_weights = split_weights(weights, cluster_to_split, u1, k) w1 = new_weights[cluster_to_split]; w2 = new_weights[k+1] (mu1, mu2, var1, var2) = split_params(mu, var, u2, u3, w1, w2) @write(model_out[:weights, new_weights, :continuous) @write(model_out[(:mu, cluster_to_split), mu1, :continuous) @write(model_out[(:mu, k+1), mu2, :continuous) @write(model_out[(:var, cluster_to_split), var1, :continuous) @write(model_out[(:var, k+1), var2, :continuous) else mu1 = @read(model_in[(:mu, cluster_to_merge)], :continuous) mu2 = @read(model_in[(:mu, k)], :continuous) var1 = @read(model_in[(:var, cluster_to_merge)], :continuous) var2 = @read(model_in[(:var, k)], :continuous) weights = @read(model_in[:weights], :continuous) w1 = weights[cluster_to_merge]; w2 = weights[k] (new_weights, u1) = merge_weights(weights, cluster_to_merge, k) w = new_weights[cluster_to_merge] (mu, var, u2, u3) = merge_params(mu1, mu2, var1, var2, w1, w2, w) @write(model_out[:weights], new_weights, :continuous) @write(model_out[(:mu, cluster_to_merge)], mu, :continuous) @write(model_out[(:var, cluster_to_merge)], var, :continuous) @write(model_out[(:u1, u1)], :continuous) @write(model_out[(:u2, u2)], :continuous) @write(model_out[(:u3, u3)], :continuous) endend f = (a) A pair of split-merge transitions in a Gaussian mixture model. The split transition splits the orange component into purple and blue components. Figure 1: Example of reversible jump MCMC (Green, 1995) implemented using involutive MCMC in Gen.The example implements a ‘split-merge move’ in a infinite Gaussian mixture model (Richardson and Green,1997) using three Gen programs: (1) a probabilistic program p encoding the generative model (shown in b), (2)a probabilistic program q encoding an auxiliary probability distribution (shown in c), and (3) a differentiableprogram f that encodes an involution on the space of pairs of traces of p and q (shown in d). Gen’s involutiveMCMC operator (shown in e) automatically computes the acceptance probability. utomating Involutive MCMC using Probabilistic and Differentiable Programming @transform f (model_in, aux_in) to (model_out, aux_out) begin path = @read(aux_in[], :discrete) subtree_address = foldr(=>, path)) @copy(model_in[subtree_address], aux_out[:new_subtree]) @copy(aux_in[:new_subtree], model_out[subtree_address]) @copy(aux_in[:path], aux_out[:path])end @gen function p (x_values::Vector) cov_function ~ cov_function_prior() cov_matrix = compute_cov_matrix(cov_function, x_values) n = length(xs) y_values ~ mvnormal(zeros(n), cov_matrix .+ 0.01*I(n))end@gen function cov_function_prior() node_type ~ categorical(production_rule_probs)) if node_type == CONSTANT param ~ uniform(0, 1) return ConstantNode(param) elseif node_type == LINEAR param ~ uniform(0, 1) return LinearNode(param) elseif node_type == SQUARED_EXP length_scale ~ uniform(0, 1) return SquaredExponentialNode(length_scale) elseif node_type == PERIODIC .. elseif node_type == PLUS left_node ~ cov_function_prior() right_node ~ cov_function_prior() return PlusNode(left_node, right_node) elseif node_type == TIMES left_node ~ cov_function_prior() right_node ~ cov_function_prior() return TimesNode(left_node, right_node) endend @copy_proposal_to_proposal(:path_from_root, :path_from_root) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_model_to_proposal(model_subtree_address, :subtree) @copy_proposal_to_model(:subtree, model_subtree_address) @copy_proposal_to_proposal(:path_from_root, :path_from_root) model traces auxiliary proposal traces(a) A Gaussian process generative model p that uses a PCFG-based prior on a combinatorial space of covariance functions, expressed in a Gen probabilistic modeling language. @gen function q (trace) prev_cov_function = trace[:cov_function] path ~ walk_tree(prev_cov_function, (:cov_function,)) new_subtree ~ cov_function_prior() return pathend@gen function walk_tree(node::Node, path) if isa(node, LeafNode) done ~ bernoulli(1) return path elseif ({:done} ~ bernoulli(0.5)) return path elseif ({:recurse_left} ~ bernoulli(0.5)) path = (path..., :left_node) return ({:left} ~ walk_tree(node.left, path)) else path = (path..., :right_node) return ({:right} ~ walk_tree(node.right, path)) endend (e) An involution f , which swaps the previous and newly proposed subtrees. ImplicitCopy pathnew_subtreecov_function (f) Schematic showing the involution f applied twice. The first application replaces a subtree with one node with a subtree with 5 nodes. The second application reverts the change. y_values ImplicitCopy
31 2 31 2 3 node_type
PLUS left_node right_nodecov_function y_values [1.1, 2.5, 3.7, ..] param node_type
TIMES left_node right_nodenode_type
CONSTANT length_scale node_type
SQUARED_EXP param node_type
LINEAR (b) An execution trace of the generative model p from (a). Each random choice has a unique address in a hierarchical address space that is based on the tree of function calls. (d) Auxiliary proposal distribution q , which proposes a new subtree in the covariance function, expressed in a Gen probabilistic modeling language.(c) The covariance function represented by the trace in (c), and several samples from the resulting Gaussian process. @gen function walk_tree(node::Node, path) if isa(node, LeafNode) done ~ bernoulli(1) return path end n1 = size(node.left) n2 = size(node.right) if ({:done} ~ bernoulli(1 / (1 + n1 + n2))) return path elseif ({:recurse_left} ~ bernoulli(n1 / (n1 + n2)) path = (path..., :left_node) return ({:left} ~ walk_tree(node.left, path)) else path = (path..., :right_node) return ({:right} ~ walk_tree(node.right, path)) endend Figure 2: A mixture kernel implemented using involutive MCMC in Gen, applied to infer the covariance functionof a Gaussian process. The prior on covariance functions is based on a probabilistic context-free grammar. Eachcomponent kernel in the mixture replaces a subtree of the covariance function parse tree with a new subtree. Themixture kernel chooses a random subtree to replace via a random walk on the parse tree. The mixture kernel iscomposed from three Gen programs: (1) a probabilistic program p encoding the generative model (shown in a),(2) a probabilistic program q encoding an auxiliary probability distribution (shown in b), and (3) a differentiableprogram f that encodes an involution (shown in d). arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka Involutive MCMC is a general framework for con-structing MCMC kernels that are stationary for a tar-get probability distribution p . Informally, the algo-rithm works as follows: starting at some state x , wefirst sample an auxiliary variable y ∼ q x from a state-dependent auxiliary distribution. We then apply an involution f to the pair ( x, y ) to obtain ( x (cid:48) , y (cid:48) ). Fi-nally, we compute an acceptance probability α and ei-ther accept x (cid:48) as the new state, or reject it and repeatthe previous state x . Different choices of q and f re-cover many algorithms from the literature (Neklyudovet al., 2020).In this section, we present involutive MCMC for mod-els p and auxiliary kernels q defined over general statespaces. By emphasizing general state spaces, we intendto clarify a potential point of confusion regarding theinvolutive MCMC algorithm: as presented by Neklyu-dov et al. (2020), the acceptance probability α dependson the Jacobian of the involution f , but it is not imme-diately clear how to define this Jacobian when f mayoperate on samples from arbitrary measurable spaces,rather than on vectors in R n . Our reformulation ofthe algorithm below is general enough to handle ar-bitrary model and auxiliary distributions, and preciseenough to enable automation via probabilistic and dif-ferentiable programming: the rest of this paper uses itto develop a technique for deriving efficient implemen-tations of involutive MCMC algorithms automatically,given only declarative specifications of p , q , and f . Let ( X, Σ P , µ P ) and ( Y, Σ Q , µ Q ) denote two generalmeasure spaces with σ -finite µ P and µ Q . InvolutiveMCMC (Algorithm 1) implements a transition ker-nel that is invariant for a model distribution given by p : X → [0 , ∞ ), a probability density over X withrespect to µ P . Each iteration, the algorithm first sam-ples auxiliary variables y ∈ Y from an auxiliary dis-tribution q x based on the model’s current state x : foreach x ∈ X such that p ( x ) > q x : Y → [0 , ∞ ) is aprobability density with respect to µ Q .The resulting pair ( x, y ) of the current model state andthe newly sampled auxiliary state will be an element ofthe joint space Z := { ( x, y ) ∈ X × Y | p ( x ) q x ( y ) > } .We can equip Z with the σ -algebra Σ := { A ∩ Z | A ∈ Σ P ⊗ Σ Q } (assuming Z is a µ P × µ Q -measurable set),and a reference measure µ ( A ) := ( µ P × µ Q )( A ).Let f : Z → Z denote an involution ( f − = f ) suchthat the pushforward of µ under f , denoted µ ◦ f − , isabsolutely continuous with respect to µ , with Radon- Nikodym derivative d ( µ ◦ f − ) /dµ : Z → [0 , ∞ ). Invo-lutive MCMC runs f on ( x, y ) to obtain ( x (cid:48) , y (cid:48) ), thencomputes an acceptance probability α . With proba-bility α , the new state x (cid:48) is returned; otherwise, theprevious state x is repeated. Algorithm 1
Involutive MCMC procedure involutive-mcmc ( p , q , f , x ) y ∼ q x ( · ) (cid:46) Sample auxiliary state( x (cid:48) , y (cid:48) ) ← f ( x, y ) (cid:46) Apply involution α ← p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) · (cid:18) d ( µ ◦ f − ) dµ ( x, y ) (cid:19) r ∼ Uniform(0 , if r ≤ α then return x (cid:48) else return x end procedureTheorem 3.1 (Involutive MCMC is stationary) . In-volutive MCMC defines a probability kernel k on X that is stationary with respect to the model probabil-ity distribution. That is, (cid:82) X k x ( B ) p ( x ) dµ X ( dx ) = (cid:82) B p ( x ) dµ X ( dx ) for all B ∈ Σ X .Proof. The proof is presented in stages in the appendix(see Section A.2, Section A.3, and Section A.4).
While maximally general, the measure-theoretic for-mulation of involutive MCMC in Algorithm 1 is notamenable to an automated implementation, because itdoes not indicate how to compute the Radon-Nikodymderivative that is required for the acceptance proba-bility, and it is unclear how to specify the probabilitymeasures involved.While restricting the state space to vectors of realnumbers would address these issues, we seek a rep-resentation that remains flexible enough to representcomplex hybrid state spaces with numeric, symbolic,and structure uncertainty. Therefore, we use statespaces consisting of finite dictionaries that map (pos-sibly random) keys to (possibly random) values. Dic-tionaries include vectors as a special case (a vector x ∈ R n can be represented as a dictionary mappingthe keys 1 , . . . , n to the values x , . . . , x n ), but aremore flexible: different keys can hold values of dif-ferent types (e.g. integers, strings), and we can alsoconsider distributions in which the set of keys is itselfrandom, which is useful for model selection problemsand structure uncertainty more generally.This section describes probability distributions on dic-tionaries, and gives a constructive definition of the in-volutive MCMC acceptance probability in this settingin terms of a Jacobian. Section 4 will then show howprobability distributions on dictionaries can be speci-fied with probabilistic programs, and how probabilistic utomating Involutive MCMC using Probabilistic and Differentiable Programming programming techniques can automatically computeprobability densities on spaces of dictionaries. The space of finite dictionaries.
We fix a count-ably infinite set K of possible keys, such that eachkey k is either called discrete ( k ∈ I ) or continuous ( k ∈ J ), where K = I ∪ J . Let V k denote the setof possible values for key k , where V k is a countableset for each discrete key, and where V k = R d k for eachcontinuous key for some d k . Given a set of keys K , let V K = × k ∈ K V k denote the set of assignments of valuesto each key. Then the set of all finite dictionaries is D := (cid:83) K ⊂K , | K | < ∞ { ( K, x ) | x ∈ V K } . That is, a dic-tionary specifies a finite set of keys K ⊂ K at which ithas values, and an assignment x of values x k for each. Relationship to representation of Green (1995).
Green (1995) uses a state space that is the countableunion of ‘models’, where each model is typically a vec-tor of real-valued parameters. Dictionaries have sub-stantially more structure: Instead of monolithic ‘mod-els’, dictionaries use a more elaborate discrete statethat includes the set of keys and the assignment tothe discrete keys. Also, because continuous keys playthe role of real-valued parameters, it is possible to ex-press that a given real-valued parameter is shared be-tween models. The additional structure of dictionariesenables the automation techniques in Section 4.
A measure space of finite dictionaries.
We asso-ciate a measure µ k on V k with each key k —the count-ing measure for each discrete key and the Lebesgue-measure on R d k for each continuous key. For eachfinite set of keys K , we make V K a measure space us-ing the standard product σ -algebra Σ K = ⊗ k ∈ K Σ k and the product measure µ K = × k ∈ K µ k . We equip D with the σ -algebra Σ D := { (cid:83) K ⊂K , | K | < ∞ { ( K, x ) | x ∈ B K } | B K ∈ Σ K for each finite K ⊂ K} to obtain ameasurable space of dictionaries. A reference measure µ D on this space can be constructed using the productmeasures µ K : we set µ ( B ) := (cid:80) K ∈K , | K | < ∞ µ K ( { x | ( K, x ) ∈ B } ). Notation for dictionaries.
Given a dictionary m =( K, x ), we write K m for K and m [ k ] for the value x k as-sociated with a key k ∈ K . We also denote specific dic-tionaries using notation { k (cid:55)→ v , k (cid:55)→ v , . . . } . Forexample, the dictionary ( K, x ) with K = { , "foo" } and x = 0 .
123 and x "foo" = 5 is denoted { (cid:55)→ . , "foo" (cid:55)→ } . Probability distributions on finite dictionaries.
When V k is discrete for all keys k , a probability distri-bution on dictionaries is defined by a probability massfunction p : D → [0 ,
1] that assigns a probability p ( m ) It is possible to assign a general measure space to eachkey, but this is not necessary for our purposes. to each dictionary m ∈ P such that (cid:80) m ∈D p ( m ) = 1.More generally a probability distribution on dictionar-ies is defined by a probability density p : D → [0 , ∞ )such that (cid:82) p ( m ) µ D (d m ) = 1. The probability massis distributed among the finite sets of keys K ⊆ K :1 = (cid:88) K ⊆K| K | < ∞ (cid:32)(cid:88) x (cid:18)(cid:90) R dK p (( K, ( x , x ))) d x (cid:19)(cid:33) (1)where d K := (cid:80) k ∈ K ∩J d k is the total continuous di-mension for keys K , and where x is an assignment tothe discrete choices in K and x is an assignment tothe continuous choices.We now give an example to build intuition. Consider agenerative model of univariate data points y , . . . , y n from a Gaussian mixture with an unknown numberof components k , each with unknown mean m i andvariance s i . If we place a Gaussian prior on m i , aninverse Gamma prior on s i , and a Poisson prior on k ,the resulting density on dictionaries d is: p ( d ) = p poisson(3) ( d [ k ]) · (cid:81) d [ k ] i =1 p normal(0 , ( d [ m i ]) · (cid:81) d [ k ] i =1 p inversegamma(1 , ( d [ s i ]) · (cid:81) ni =1 1 d [ k ] (cid:80) d [ k ] j =1 · p normal( d [ m j ] ,d [ s i ]) ( d [ y i ])when K d = { k , y , . . . , y n } ∪ { m i | ≤ i ≤ d [ k ] } ∪ { s i | ≤ i ≤ d [ k ] } , and 0 otherwise. In this case, we have V k = N with the counting measure for µ k , and for allother keys k ∈ K , V k = R with the Lebesgue measurefor µ k . For each j ∈ { , , . . . } , the probability massassigned to key set { k , y , . . . , y n } ∪ { m i | ≤ i ≤ j } ∪{ s i | ≤ i ≤ j } is p poisson( s ) ( j ) (via Equation (1)). Conditional distributions via disintegration.
Consider a probability density p on the space D of dic-tionaries. We say a key k ∈ K almost always appears if (cid:82) [ k ∈ K m ] p ( m ) µ D (d m ) = 1. Suppose B is a setof keys that almost always appear for p , and that b =( B, b ) is a dictionary with keys B . Furthermore, let( K m , m ) ⊕ ( K n , n ) := ( K m ∪ K n , ( m , n )) be the merge of two dictionaries m and n defined on disjoint key sets K m and K n . Then we can define the conditional den-sity p ( d | b ) := [ K d ∩ B = ∅ ] p ( d ⊕ b ) (cid:82) { m | Km ∩ B = ∅} p ( m ⊕ b ) µ D (d m ) when the denominator is finite. If each each key k ∈ B is discrete (e.g. µ k is the counting measure), then thisdefinition corresponds to the ordinary notion of con-ditioning on an event (namely, the event that a sam-ple from p agrees with the dictionary b on all keys in B ). When this is not the case, it corresponds to amore general measure-theoretic notion called disinte-gration (Chang and Pollard, 1997).Consider the infinite univariate mixture model, andthe conditional density given observed data ( B, b ) := arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka { y (cid:55)→ y , . . . , y n (cid:55)→ y n } . The conditional density p ( d | b ) is nonzero only if K d = { k } ∪ { m i | ≤ i ≤ k } ∪ { s i | ≤ i ≤ k } for some k ( d does not contain y-values), and the denominator in the definition of p ( d | b )simplifies to the familiar sum of marginal likelihoodsover all k , where each marginal likelihood is a Riemannintegral over R k . Suppose that the model distribution and auxiliary dis-tributions are probability distributions on dictionaries,with densities p and q x . Then, X and Y are both setsof dictionaries, and the joint space Z is a set of pairs( x, y ) of dictionaries with keys K P and K Q respec-tively, so that Z = X × Y ⊆ D P × D Q where D P is theset of dictionaries on keys taken from K P and similarlyfor D Q . To simplify the notation, and without loss ofgenerality, we will assume that K P and K Q are dis-joint , and we define Z := { x ⊕ y : p ( x ) q x ( y ) > } ⊆ D ,where D is the set of dictionaries on keys from K P ∪K Q (recall x ⊕ y denotes the dictionary resulting frommerging dictionaries x and y with disjoint keys).Suppose there is a countable partition of Z into { Z e : e ∈ E } such that if ( K , x ) and ( K , x ) are two dic-tionaries in the same component Z e , then K = K and they agree on all discrete values: x k = x k forall k ∈ K ∩ I . Then each set Z e is isomorphic to aEuclidean space of assignments to the continuous keysin the two dictionaries. Suppose there is an involu-tion g : E → E between elements of the partition,and a family of continuously differentiable bijections h e : Z e → Z g ( e ) indexed by e ∈ E , with h e = h − g ( e ) .Let e ( z ) ∈ E denote which element of the partition adictionary z ∈ Z belongs to. Then f : Z → Z givenby f ( z ) := h e ( z ) ( z ) is an involution: f ( f ( z )) = h e ( f ( z )) ( h e ( z ) ( z )) = h g ( e ( z )) ( h e ( z ) ( z )) = z. Let | Jh e | ( z ) denote the absolute value of the determi-nant of the Jacobian of h e , evaluated at z . Then, theacceptance probability in Algorithm 1 simplifies to: p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) · | Jh e | ( z ) (2)One example of a valid partition of Z is given by equiv-alence classes of the following equivalence relation: z ∼ z ⇐⇒ ( K z = K z ) ∧ ( z [ k ] = z [ k ] ∀ k ∈ K z ∩I )(dictionaries are equivalent if they contain the samekeys and they agree on the value of all discrete keys).See Section A.1 of the appendix for details. If K P and K Q are not disjoint then, they can be madeso by adding a different prefix to the keys of each set. Involutive MCMC is a general framework that can beused to develop diverse MCMC algorithms for mod-els over arbitrary state spaces. We wish to automate the implementation details for involutive MCMC al-gorithms, given only a specification of the model p ,the auxiliary distribution q , and the involution f . Todo so, we require a representation for the distributions p and q that is flexible enough to represent the fullvariety of models and auxiliary distributions of inter-est to practitioners. The representation must supportdensity evaluation and sampling. It is also desirablethat the representation be structured : the more infor-mation available to us (e.g., about the decompositionof a distribution’s state space into individual univari-ate and multivariate random variables, or about con-ditional independence relationships in a model), theeasier it will be for the implementation to exploit thisstructure automatically by using more efficient datastructures and low-level manipulations. Probabilistic programs are flexible and structured rep-resentations for probability distributions. Unlike den-sities (but like Bayesian networks), probabilistic pro-grams can be efficiently sampled and contain explic-itly represented information about some conditionalindependence relationships in a model. But unlikeBayesian networks, they do not assume a fixed numberof random variables, state space dimension, or depen-dency structure.At the most basic level, a probabilistic program is aprogram that makes random choices. Any such pro-gram induces a probability distribution over its pos-sible execution traces , records of each random choiceit makes. If each random choice is associated witha unique address from the set of dictionary keys K ,then these traces can be viewed as finite dictionar-ies, mapping the address of each random choice to itsvalue. The distribution induced by a probabilistic pro-gram over its execution traces can thus be understoodas a measure on the space ( D , Σ D , µ D ) introduced inthe previous section. Furthermore, densities of tracedistributions with respect to µ D are typically easy tocompute.In this section, we introduce a probabilistic program-ming language from the Gen probabilistic program-ming system (Cusumano-Towner et al., 2019), andpresent a technique for automating the implementa-tion of involutive MCMC algorithms when the model p and auxiliary distribution q x are both represented utomating Involutive MCMC using Probabilistic and Differentiable Programming Algorithm 2
Automated Involutive MCMC
Require: probabilistic programs P and Q ; differentiable program F ; initial state x ; observations b procedure auto-involutive-mcmc ( P , Q , F , x , b ) (cid:46) Sample y ∼ q x ( · ) and compute log-density via probabilistic program Q ( y, log q x ( y )) ← trace-and-score ( Q x ) (cid:46) Compute ( x (cid:48) ⊕ y (cid:48) ) = f ( x ⊕ y ) and log(Radon-Nikodym derivative) via differentiable program F b ( x (cid:48) ⊕ y (cid:48) , log D ) ← run-involution ( F , x ⊕ y, b ) (cid:46) Compute log-density via probabilistic program P log ˜ p ( x ⊕ b ) ← score ( P , x ⊕ b ) (cid:46) Compute log-density via probabilistic program P log ˜ p ( x (cid:48) ⊕ b ) ← score ( P , x (cid:48) ⊕ b ) (cid:46) Compute log-density via probabilistic program Q log q x (cid:48) ( y (cid:48) ) ← score ( Q x (cid:48) , y (cid:48) ) (cid:46) Compute acceptance probability and accept or reject α ← min { , exp(log ˜ p ( x (cid:48) ⊕ b ) − log ˜ p ( x ⊕ b ) + log q x (cid:48) ( y (cid:48) ) − log q x ( y ) + log D ) } with probability α return x (cid:48) else return x endend procedureprocedure trace-and-score ( P ) s ← x ← {} Execute P , but with " k ∼ distribution " ≡ (1. set v ∼ distribution2. set x [ k ] ← v
3. set s ← s + logpdf (distribution , v )4. evaluate to v ) return ( x, s ) end procedure procedure score ( P , x ) s ← K ← {} Execute P , but with " k ∼ distribution " ≡ (1. set v ← x [ k ]2. set K ← K ∪ { k }
3. set s ← s + logpdf (distribution , v )4. evaluate to v ) if K (cid:54) = K x then s ← −∞ endreturn s end procedureprocedure run-involution ( F , z , b ) (cid:46) Run the involution, and keep track of copied, read, and written continuous keys z (cid:48) ← {} (cid:46) Initialize empty output dictionaryExecute F , but with " @ write ( k, v ) " ≡ ( if k ∈ I then set z (cid:48) [ k ] ← v else set z (cid:48) [ k ] ← v and set W ← W ∪ { k } ) " @ read ( k ) " ≡ ( if k ∈ I then evaluate to z [ k ] else set R ← R ∪ { k } and evaluate to z [ k ]) " @ copy ( k , k ) " ≡ (set z (cid:48) [ k ] ← z [ k ] and set C ← C ∪ { k } )(and reading from observations b for any k ∈ K b ) (cid:46) Use automatic differentiation of F to compute Jacobian, skipping copied addresses for i in 1 to | W | do k i ← W i (cid:46) Pick i th key in W ; the order does not matter J [: , i ] ← ∇ z R \ C (cid:0) h e ( z ) [ a i ] (cid:1) (cid:46) Gradient of z (cid:48) [ k i ] = h e ( z ) [ k i ] w.r.t. non-copied continuous inputs ( z k for k ∈ R \ C ) end forreturn ( z (cid:48) , log( | det( J ) | )) end procedure as probabilistic programs in this language. But thetechnique is not limited to the Gen system—we alsoprovide the implementation of a minimal probabilisticprogramming language in PyTorch that supports theautomation technique. Our probabilistic programming language augments thesyntax of Julia (Bezanson et al., 2017) with a sin-gle new construct, the ‘ { address } ∼ distribution ’expression, for making a named random choice. An arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka execution trace of a Gen probabilistic program is adictionary that can be sampled by running the pro-gram according to Julia’s usual semantics, and uponencountering an expression of the form ‘ { address } ∼ distribution ’, (i) evaluating the address expression (‘ address ’) to obtain an address k ∈ K ; (ii) evalu-ating the distribution expression (‘ distribution ’) toobtain a probability distribution over the measurablespace V k ; (iii) sampling a value x k from this distribu-tion and adding the mapping { k (cid:55)→ x k } to the exe-cution trace; and (iv) returning the sampled value x k to the program, to continue execution. When execu-tion terminates, the execution trace has accumulateda mapping for each random choice encountered duringthe program’s execution.For example, consider the probabilistic program be-low, which defines a Gaussian mixture model with anunknown number of components: @gen function p(n::Int)k ∼ poisson_plus_one(1)means = [({(:mu, j)} ∼ normal(0, 10)) for j in 1:k]vars = [({(:var, j)} ∼ inv_gamma(1, 10)) for j in 1:k]weights ∼ dirichlet([2.0 for j in 1:k])for i in 1:n{(:x, i)} ∼ mixture_of_normals(weights, means, vars)endend The program p accepts as input an integer n , a numberof data points. Each n defines a distinct distributionover dictionaries. The first line of the program sam-ples a number of mixture components from a Poissonprior using the address :k . (This line could also bewritten k = { :k } ∼ poisson plus one(1) : the ad-dress is the symbol :k , and the result of the choice isassigned to a Julia variable called k . Because this is acommon pattern, Gen provides the syntactic sugar x ∼ d as shorthand for x = { :x } ∼ d .) The programthen samples k means and k variances from Gaussianand inverse Gamma priors, respectively. Each of these2 k random choices has its own address, determinedby the address expression preceding it. For example,the mean for the fourth mixture component (if k ≥ (:mu, 4) . The mixture weights are thensampled at the address :weights from a Dirichlet dis-tribution, and n data points are sampled at addresses (:x, 1) , . . . , (:x, n) . The random choice at address: k is discrete with V :k = Z ≥ , and the other randomchoices are continuous. Overall, the program defines adistribution over traces with the following density over traces ( K, x ) with respect to µ D : p ( n )(( K, x )) = p poisson(1) ( x : k − · (cid:81) x : k i =1 p normal(0 , ( x (: mu ,i ) ) · (cid:81) x : k i =1 p inversegamma(1 , ( x (: var ,i ) ) · p dirichlet(2 . ,..., . ( x : weights ) · (cid:81) ni =1 (cid:80) x : k j =1 x : weights [ j ] · p normal( x (: mu , j ) ,x (: var ,j ) ) ( x (: x ,i ) )when K contains exactly the addresses: k , : weights , (: x , i ) for i = 1 , . . . , n , and (: mu , j )and (: var , j ) for j = 1 , . . . , x : k ; otherwise, the densityis 0. Automatic Sampling and Density Compu-tation for Probabilistic Programs.
Samplingtraces from, and computing densities of dictionariesunder, the distribution on traces induced by a proba-bilistic program is straight-forward to do, using a stan-dard technique in probabilistic programming.We illustrate this technique in the trace-and-score and score subroutines in Algorithm 2. The trace-and-score subroutine samples a trace by running aprobabilistic program, but recording the value of everyencountered choice into a dictionary, which it returnsonce execution has terminated. It also returns the logdensity of the trace, calculated by accumulating densi-ties of the individual choices it encounters. (Note thatthis procedure is only valid if the program halts withprobability 1. Otherwise, it could loop infinitely, andeven if it terminates, the density will be incorrect.)The density of an arbitrary dictionary ( K, x ) underany probabilistic program’s distribution on traces canbe computed using score . The idea is to run theprobabilistic program, and whenever a random choice‘ { address } ∼ distribution ’ is encountered, to lookup the value x k in the dictionary, compute its den-sity under the primitive distribution d , multiply thisdensity into a running total, and return control to theprobabilistic program as if the sampling instructionhad executed and returned x k . At the end, the run-ning total can be returned as the trace’s density. Ifat any point an address k (cid:54)∈ K is encountered, or ifnot all addresses in K have been visited at the end ofexecution, the algorithm returns 0 as the density. We can use two probabilistic programs P and Q tospecify the model density p and auxiliary densities q x that appear in involutive MCMC (Algorithm 1). Inthis case, the program Q accepts a trace x of P as The programs must satisfy a mild technical require-ment for our formalism to go through; see Section A.5 utomating Involutive MCMC using Probabilistic and Differentiable Programming input, we denote a probabilistic program Q applied toinput x by Q x .Typically, we cannot use a probabilistic program P to represent the target distribution directly: proba-bilistic programs implement simulators for a distribu-tion, but target densities are typically not tractableto simulate (hence the need for MCMC). Instead, wemay wish to sample from a target p that arises from conditioning a probabilistic program P ’s distributionover traces on observations of the values at some ad-dresses. Let ˜ p denote P ’s distribution over traces, andlet b = ( B, b ) be a dictionary of observations, as de-scribed in Section 3.2. Then, as shown in that section,the target distribution p ( x ) = ˜ p ( x | b ) = ˜ p ( x ⊕ b ) L ( b ) , where L ( b ) = (cid:82) { m | m ∩ B = ∅} p ( m ⊕ b ) µ D (d m ) is the marginallikelihood of b , and does not depend on x . Then˜ p ( x ⊕ b ) = L ( b ) p ( x ). If we use ˜ p ( x ⊕ b ) in place of p ( x ) to compute the ratio of densities in Algorithm 1,we will wind up with the same output, because L ( b )will cancel in the numerator and denominator: p ( x (cid:48) ) q x (cid:48) ( y (cid:48) ) p ( x ) q x ( y ) = ˜ p ( x (cid:48) ⊕ b ) q x (cid:48) ( y (cid:48) )˜ p ( x ⊕ b ) q x ( y ) = ˜ p ( x (cid:48) | b ) q x (cid:48) ( y (cid:48) )˜ p ( x | b ) q x ( y ) (3)Thus, this ratio can be computed term-by-term using score on the probabilistic programs P and Q (Al-gorithm 2): to compute q x ( y (cid:48) ), we run the algorithmdirectly on the program Q x , and to compute p ( x ) and p ( x (cid:48) ), we actually merge the dictionary x with the ob-servations b and compute L ( b ) p ( x ) = ˜ p ( x ⊕ b ) instead,by running the algorithm on P with trace x ⊕ b . Thedensity q x ( y ) can be computed while y is being sam-pled, using trace-and-score .Section 5 describes a more efficient approach that ex-ploits sparsity in the involution and cancellations inthe acceptance ratio for improved efficiency, but re-quires a more sophisticated probabilistic programmingruntime system.Figure 1b and Figure 1c show examples of probabilis-tic programs P and Q respectively, for a split-mergereversible jump move. Section 4.1 showed that if the densities p and q arespecified using probabilistic programs P and Q , thenthe density ratio in the acceptance probability for in-volutive MCMC on dictionaries (Equation (2)) canbe automated using probabilistic programming tech-niques. This section shows that if the involution f isspecified using a differentiable program F that trans-forms the execution traces of probabilistic programs,then the Jacobian factor in Equation (2) can also beautomated, using automatic differentiation. The pro- cedure auto-involutive-mcmc in Algorithm 2 com-bines these two ideas and automates involutive MCMCgiven the programs P , Q , and F . Recall that for involutive MCMC on a state space ofdictionaries, we define Z := { x ⊕ y : p ( x ) q x ( y ) > } ⊆D , where x ⊕ y is the dictionary resulting from mergingdictionaries x and y with disjoint keys. The involutionis a function f : Z → Z .We now introduce a simple differentiable programminglanguage for specifying involutions f . The languageneeds to have syntax for reading the value from anaddress in x ⊕ y ∈ Z and writing to an address in x (cid:48) ⊕ y (cid:48) ∈ Z . To read a value ( x ⊕ y )[ a ] at address a ,we use the @read keyword: value = @read(
,For example, to copy the value from address : u in x toaddress : v in x (cid:48) , we use: @copy(model_in[:u], model_out[:v]) Of course, it is also possible to copy from x to y (cid:48) , from y to x (cid:48) and from y to y (cid:48) . As we will see in Section 5,it is preferable to use @ copy when possible instead ofreading and then writing, as this can make the accep-tance probability calculation more efficient. Constructing an Involution
Consider the follow-ing generative model, which posits that a vector of uni-variate data is either generated from a single normaldistribution or a mixture of two normal distributions.The model is expressed as a probabilistic program: @gen function p()k ∼ uniform_discrete(1, 2)means = [{(:mu, j)} ∼ normal(0, 10) for j in 1:k]weights = ones(k)/kvars = ones(k)for i in 1:100{(:x, i)} ∼ mixture_of_normals(weights, means, vars)endend This is a simplified version of the infinite Gaussianmixture model in Figure 1b. Each key of the form(: x , i ) will be observed (in the dictionary b ), so thelatent part of the trace (the dictionary x ) contains theother keys : k and (: mu ,
1) and (when k = 2) (: mu , k from 2 to 1 or vice versa. The discrete partof the involution is straightforward: k = @read(model_in["k"], discrete)if k == 1@write(model_out["k"], 2, discrete)else@write(model_out["k"], 1, discrete)end Writing the involution code for the continuous choicesis more complex. There is no one-to-one correspon-dence between the space of dictionaries with keys { : k , (: mu , } and the space of dictionaries x with keys { : k , (: mu , , (: mu , } . Therefore, we need to extend thestate space using the auxiliary distribution, definedwith the following probabilistic program: @gen function q(model_trace)if model_trace["k"] == 1 ∼ beta(2, 2)endend Our involution is on the space of combined dictionaries x ⊕ y where y are traces of q . Note that for each( x ⊕ y ) where x [ k ] = 1, y has a key u , and for each( x ⊕ y ) where x [ k ] = 2, y is empty. We extend theinvolution using a pair of bijections between ( µ , µ )and ( µ , u ) that show how two cluster means should betransformed into one cluster mean (merge), and viceversa (split): µ , µ (cid:55)→ (( µ + µ ) / , µ − ( µ + µ ) /
2) [Merge] µ , u (cid:55)→ ( µ − u, µ + u ) [Split] (4)The value of x [ k ] determines which of these functionsis executed. The full involution program F is then: @transform f (model_in,aux_in) to (model_out,aux_out)begink = @read(model_in["k"], discrete)if k == 1@write(model_out["k"], 2, discrete) The two continuous bijections (implemented in blocksof code in the two branches) are inverses of oneanother. This is a common pattern in involutionprograms—an involution on the discrete parts of thetraces (in this case, just x [ k ]) determines via controlflow which continuous code blocks get executed, suchthat the end-to-end program defines an involution.Note that the observations b (in this case, the x-coordinates) are included in traces of p , which takethe form x ⊕ b where x is the latent part and b is theobserved part. The involution program F is allowedto read from the the observations in the model traceusing the same syntax as used to read from the latentpart of the trace x . (The example given here does not utomating Involutive MCMC using Probabilistic and Differentiable Programming utilize this feature). Recall that the involution f must decompose into (i)an involution g on elements e ∈ E of a partition of thestate space, and (ii) a pair of continuous differentiablebijections h e and h g ( e ) = h − e between each pair ofcorresponding elements of the partition. Each function h e is a function from the values at continuous addressesin the input trace ( z [ k ] for k ∈ K z ∩ J ) to the valuesat continuous addresses in the output trace ( z (cid:48) [ k ] for k ∈ K z (cid:48) ∩ J ). In the example above, the partition E is given by the equivalence classes of the relation( x , y ) ∼ ( x , y ) ⇐⇒ x [ k ] = x [ k ] (there are twoequivalence classes), g maps the k = 1 class to the k =2 class and vice versa, and the continuous bijections h e and h g ( e ) are the functions in Equation (4).We compute the Jacobian using automatic differen-tiation. (The Gen implementation uses forward-modeAD whereas our PyTorch implementation uses reverse-mode AD.) The procedure run-involution in Algo-rithm 2 shows the implementation of the interpreterfor the language that uses reverse-mode AD. The inter-preter executes F using regular Julia or Python seman-tics, but intercepts calls to @write , @read , and @copy statements, and in addition to performing the desiredoperation, records the set of continuous addresses thatare read, written, and copied. After F is finished ex-ecuting, AD uses the recorded addresses to computethe Jacobian | Jh e | ( z ); in the case of reverse-mode AD,this is accomplished by iterating over output contin-uous addresses and backpropagating from each oneto all of the input continuous addresses, computingthe Jacobian column-by-column. Keys with an n -dimensional Lebesgue measure as their reference mea-sure, corresponding to vector-valued random choices,are unpacked into n separate columns in the Jacobian;this detail is elided in Algorithm 2. The discrete and continuous labels are also omitted from the syntax in run-involution to simplify notation ( a ∈ I indicatesa discrete choice and otherwise a choice is continuous). Suppose N and N (cid:48) are the total number of randomchoices in the input traces x, y and output traces x (cid:48) , y (cid:48) respectively (note that the observations b areexcluded— x and x (cid:48) are the latent part of the model’straces only). Let M be the number of continuous ran-dom choices among these (which must be the samein the input traces and output traces). Then, thenumber of operations used in Algorithm 2 grows as O ( N + N (cid:48) ) + O ( M ). The linear term is due to sam-pling y (cid:48) and computing the four log-densities requiredfor the acceptance probability. The cubic term is dueto computing the Jacobian determinant, which is alsorequired for the acceptance probability (Equation 2).It is possible to reduce the number of operations per-formed in an automated involutive MCMC kernel byexploiting special structure in the involution f . Insome cases, this structure can lead to O (1) operationsper kernel application (i.e. constant in N , N (cid:48) , and M ).This section describes techniques for exploiting involu-tion structure within an automated involutive MCMCimplementation. These techniques are used in the Genimplementation, and one of the techniques is used inour minimal PyTorch implementation. The naive implementation of Algorithm 2 computesby the Jacobian by first computing the M -by- M Jaco-bian matrix Jh e ( z ) via automatic differentiation, andthen computing the absolute value of its determinant.However, we observe that in many applications of in-volutive MCMC, the values at continuous choices inthe input traces are directly copied into the outputtraces (either at the same key or a different key).These copy operations result in columns in the Ja-cobian matrix that have a single 1 entry with remain-ing entries equal to 0. For example, for the function( u, v, x, y ) (cid:55)→ ( u, u − v, y, x ) = ( u (cid:48) , v (cid:48) , x (cid:48) , y (cid:48) ), the Jaco-bian is (with columns corresponding to u (cid:48) , v (cid:48) , x (cid:48) , and y (cid:48) and rows corresponding to u , v , x , and y ): − uvxy Using the cofactor expansion of the determinant, weobserve that for any ‘copy’ column in an M -by- M Ja-cobian matrix (a column with a single 1 and all otherentries 0), the absolute value of the determinant isequivalent to that of the ( M − M −
1) sub-matrixwith the corresponding column and row omitted (evenif that would remove other nonzero entries from thematrix). By applying this rule recursively, we caninstead compute the determinant of a much smallermatrix; for the example above with M = 4, the abso-lute value of the determinant simplifies to the absolutevalue of a single entry ( | − | ). Indeed, if some inputkey is copied to some output key, then we can entirelyavoid computing its row (and corresponding column)of the Jacobian. Therefore, the number of operations(which is dominated by the determinant) reduces from M to ( M − | C | ) where | C | is the number of input arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka keys that were copied to some output key. The @ copy statement in our differentiable programming languagemakes the set of input keys that were copied explicitlyavailable to the interpreter, which makes automatingthis optimization straightforward, as shown in run-involution in Algorithm 2.Many involutive MCMC kernels modify only a con-stant number of keys that does not depend on the sizes N and N (cid:48) of the input and output traces or the totalnumber of continous keys M ; the other keys are copiedover unchanged. For example, consider the Jacobianfor a split-merge reversible jump move Richardsonand Green (1997), which is implemented in Figure 1:Here, the move is splitting cluster 2 into two clus-ters (cluster 2 and cluster 4). Black squares indicatenonzero entries. In this case, the size N of the la-tent part of the trace grows linearly in the numberof clusters, but the Jacobian determinant can be cal-culated from only the 6-by-6 submatrix of the Jaco-bian that involves the parameters of the one clusterbeing split (or the two clusters being merged). There-fore, the acceptance ratio computation reduces from O ( N + N (cid:48) ) + O ( M ) to O ( N + N (cid:48) ). While a rea-sonable hand-coded implementation of this algorithmwould likely perform this sort of optimization as well,Algorithm 2 automates it. Consider input traces ( x, y ) and output traces ( x (cid:48) , y (cid:48) )with f ( x, y ) = ( x (cid:48) , y (cid:48) ) for involution f encoded by pro-gram F (again, x and x (cid:48) include only the latent partof the model trace and not the observed part b ). Let N := | K x | , N := | K y | , N (cid:48) := | K x (cid:48) | and N (cid:48) := | K y (cid:48) | ,and N := N + N and N (cid:48) := N (cid:48) + N (cid:48) . Supposethat N + N (cid:48) (cid:29) N + N (cid:48) , which often occurs whenthe the number of latent variables is large, but the in-volutive MCMC move only updates a portion of thelatent variables. Algorithm 2 runs F , which explicitlywrites the value for each element of ( x (cid:48) , y (cid:48) ), requiring N (cid:48) = N (cid:48) + N (cid:48) ≈ N (cid:48) ‘write’ or ‘copy’ operations. Thealgorithm also uses approximately N + N (cid:48) operationsto evaluate the log-densities, because it accumulatesthe log-density of each random choice in x , and x (cid:48) .We now show how to modify the program F andits interpreter, so that both computing z (cid:48) = ( x (cid:48) , y (cid:48) )from z = ( x, y ), and computing the log density ratiolog(˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b )), can take O (1) operations (thatis, constant in N + N (cid:48) ) when the involution f hassparse structure and the model density p has condi-tional independencies.Note that x and x (cid:48) may both have values for certainkeys, and x may contains keys not present in x (cid:48) , andvice versa. Often, an involution does not modify thevalues of many keys in the trace—this is the case inthe split-merge reversible jump move described earlier.Suppose that A ⊆ K x ∩ K x (cid:48) is the set of keys k in both x and x (cid:48) for which x (cid:48) [ k ] = x [ k ]. Let ∆ := K x (cid:48) \ A be theset of keys in x (cid:48) that are either (i) not in x , or (ii) in x but have their value changed. Then, the previous trace x and a dictionary δ with K δ = ∆ and δ [ k ] = x (cid:48) [ k ]and the density function p suffice to uniquely definethe trace x (cid:48) , provided the density satisfies the follow-ing condition, which is satisfied by densities definedby probabilistic programs : p ( x ) > p ( x (cid:48) ) > x (cid:54) = x (cid:48) implies there exists k ∈ K x ∩ K x (cid:48) with x [ k ] (cid:54) = x (cid:48) [ k ]. Let x | A denote the restriction of a dictio-nary x to only keys in A . Then, naive-trace-update (Algorithm 3) computes the new trace x (cid:48) and log-densities ratio log( p ( x (cid:48) ) /p ( x )) of a probabilistic pro-gram from a previous trace x and a dictionary δ (andobservations b ), provided there exists x (cid:48) = δ ⊕ x | A forsome A where p ( x (cid:48) ) > naive-trace-update is a naive implementation of ageneral trace-update operation that is part of Gen’sAPI (Cusumano-Towner et al., 2019). While the naiveimplementation requires O ( N + N (cid:48) ) operations, itis possible for more sophisticated implementations of trace-update to run in approximately O ( | K δ | ) opera-tions for certain probabilistic programs P by exploitingconditional independence structure program that leadsto cancellations in the density ratio ˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b ).The details of these more efficient implementations areoutside the scope of this paper, but are implementedas part of the Gen system.Given an implementation of trace-update (that has thesame input and output signature of naive-trace-update ) we can optimize Algorithm 2 by modifyingthe requirement for the involution program F : In-stead of explicitly specifying a value for all keys in Intuitively, two traces of a probabilistic program can-not have different sets of addresses unless they share anaddress and disagree on its value. utomating Involutive MCMC using Probabilistic and Differentiable Programming
Algorithm 3
Naive Trace Update
Require: probabilistic program P ; previous trace x ;observations b ;dictionary δ such that there exists x (cid:48) = ( δ ⊕ x | A ) forsome A where p ( x (cid:48) ) > procedure naive-trace-update ( P , x , b , δ ) assert K x ∩ K b = ∅ x (cid:48) ← {} s (cid:48) ← P , but with " k ∼ distribution " ≡ (1. if k ∈ K δ then v ← δ [ k ] else v ← ( x ⊕ b )[ k ]2. if k (cid:54)∈ K b then set x (cid:48) [ k ] ← v
3. set s (cid:48) ← s (cid:48) + logpdf (distribution , v )4. evaluate to v ) s ← score ( P , x ⊕ b ) return ( x (cid:48) , s (cid:48) − s ) end procedure x (cid:48) , it need only explicitly specify values in δ . (Wecall the other values in x (cid:48) implicit copies —they couldbe explicitly copied within F , but this would resultin uncessary code and possibly unecessary computa-tion.) Therefore, the involution program can run in O ( | K δ | + N + N (cid:48) ) operations. We remove the sep-arate calls to score for the log-densities log ˜ p ( x ⊕ b )and log ˜ p ( x (cid:48) ⊕ b ) and replace them with a single call to trace-update on P that takes in x and δ and returns x (cid:48) and log(˜ p ( x (cid:48) ⊕ b ) / ˜ p ( x ⊕ b )). When | K δ | and N and N (cid:48) are all constant in the size N of the latent part ofmodel trace ( x ), the resulting algorithm uses a numberof operations that does not grow with N + N (cid:48) . The automated technique described in the previoussections allows users to implement involutive MCMCalgorithms by writing probabilistic programs for p and q and a differentiable program for the involution f ,avoiding the need to derive and implement the ac-cept/reject formula by hand. This eliminates certainclasses of errors that could otherwise be difficult toroot out in hand-coded implementations of MCMC al-gorithms. However, it is still of course possible to im-plement p , q , or f incorrectly, introducing bugs thatmay invalidate the correctness proofs for the involutiveMCMC algorithm.Fortunately, the factorization of diverse algorithmsinto a common template, involving an explicitly rep-resented model, auxiliary distribution, and involution,enables simple and automated debugging checks thatcan catch a wide variety of such errors dynamically.These checks are not complete, but anecdotally, we have found that they often alert users to subtle bugsin the design or implementation of an algorithm. Thechecks are summarized in Algorithm 4. Algorithm 4
Dynamic Check for Bugs procedure involutive-mcmc-check ( P , Q , F ) (cid:46) Generate a randomized test case(( x ⊕ b ) , ) ∼ trace-and-score ( P )( y, ) ∼ trace-and-score ( Q x ) (cid:46) Run involution and perform dimension check x (cid:48) ⊕ y (cid:48) ← run-involution ( F , x ⊕ y, b ) (cid:46) Support checklog p ( x (cid:48) ) ← score ( P , x (cid:48) )log q x (cid:48) ( y (cid:48) ) ← score ( Q x (cid:48) , y (cid:48) ) assert (log p ( x (cid:48) ) > −∞ ) ∧ (log q x (cid:48) ( y (cid:48) ) > −∞ ) (cid:46) Involution check˜ x ⊕ ˜ y ← run-involution ( F , x (cid:48) ⊕ y (cid:48) , b ) assert (˜ x = x ) ∧ (˜ y = y ) end procedure The check first randomly samples a model trace x andsimulated observations b from the model prior p , andauxiliary trace y from the q x . It then runs three testson this simulated test case:1. Support check.
The support check runs the in-volution F on ( x, y ) and checks that the result-ing pair of traces, ( x (cid:48) , y (cid:48) ), are within the set Z ofpositive-density elements for π .2. Dimension check.
The dimension check runsthe involution F on ( x, y ) and records the sets ofaddresses R and W at which F performs continu-ous reads and writes, respectively, and checks thatthe sizes of these sets match. (This check is partof computing the determinant of the Jacobian in run-involution ).3. Involution check.
The involution check runs F twice to compute f ( f ( x, y )) and checks that theresult is equal to ( x, y ). If f is an involution on Z ,then this check will always succeed; if there is aset ˜ Z of positive π -measure on which f ( f ( z )) (cid:54) = z (i.e., f is not an involution), then this check haspositive probability of failing.These checks each catch qualitatively different bugs inuser programs. We now give several examples. Incorrectly transformed continuous variables.
Manybugs in the design or implementation of deterministictransformations of continuous variables are naturallydetected by the involution check. Consider, for exam-ple, the Hamiltonian Monte Carlo algorithm (Duane arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka et al., 1987), which, as Neklyudov et al. (2020) observe,is an instance of the involutive MCMC framework.When applied to HMC, the dynamic bug check in-volves: (a) sampling a model state x from the prior; (b)sampling a momentum y ; (c) running the leapfrog in-tegrator forward to get a new state x (cid:48) and new momen-tum − y (cid:48) , (d) running the leapfrog integrator backwardfrom state x (cid:48) with momentum y (cid:48) , and (e) checking thatthis results in the state-momentum pair x, − y . If themomentum is not properly negated, or if the leapfrogintegrator is incorrectly implemented, this check canfail.As another, simpler example of this type of error,consider the split params and merge params func-tions invoked in Figure 1, as part of a reversible-jumpMCMC kernel for inferring a mixture of Gaussianswith an unknown number of components. Consid-ering only the µ parameter, suppose the split movesamples an auxiliary variable u and computes µ = µ + u , µ = µ − u as the means of the two newclusters. If the merge move joins two clusters andassigns µ = √ µ µ , there is a mismatch: a splitmove cannot be reversed by a corresponding merge,because in general, (cid:112) ( µ + u )( µ − u ) (cid:54) = µ . This er-ror could be fixed either by changing the split to com-pute µ = µu , µ = µu , or by changing the merge tocompute µ = µ + µ . Discrete logic errors.
Another common class of errorsis for the discrete logic of an involution to be flawed.Consider the following incorrect implementation of a birth-death move for the mixture model in Figure 1,which either adds a new mixture component or se-lects an existing one at random to delete (we namethe traces tr1 , tr2 , tr3 , ad tr4 for reasons of space): @gen function q(trace)current_k = trace[:k]is_birth ∼ bernoulli(current_k == 1 ? 1.0 : 0.5)if is_birthnew_mu ∼ normal(0, 10)new_var ∼ inv_gamma(1, 10)elsedeletion_idx ∼ uniform_discrete(1, current_k)endend @transform h (tr1, tr2) to (tr3, tr4) beginis_birth = @read(tr2, :is_birth, :discrete)@write(tr4[:is_birth], !is_birth, :discrete)k = @read(tr1[:k], :discrete)weights = @read(tr1[:weights], :continuous)if is_birth@write(tr3[:k], k+1, :discrete)new_mu = @read(tr2[:new_mu], :continuous)new_var = @read(tr2[:new_var], :continuous)@write(tr3[(:mu, k+1)], new_mu, :continuous)@write(tr3[(:var, k+1)], new_var, :continuous)new_weights = add_weight(weights)@write(tr4[:deletion_idx], k+1, :discrete)elseidx = @read(tr2[:deletion_idx], :discrete)@copy(tr1[(:mu, idx)], tr4[:new_mu])@copy(tr1[(:var, idx)], tr4[:new_var])for i in (idx+1):k@copy(tr1[(:mu, i)], tr3[(:mu, i-1)])@copy(tr1[(:var, i)], tr3[(:var, i-1)])end@write(tr3[:k], k-1, :discrete)new_weights = delete_weight(weights, idx)end@write(tr3[:weights], new_weights, :continuous)end The flaw in this implementation is that although thedeath move can delete any of the k mixture compo-nents, the birth move can only add a new componentto the end (index k + 1), so the move is not reversible.The involution check will discover that the deletionof a component with index i < k is not reversed by acorresponding birth move, and will thus raise an error. Other miscellaneous errors.
When implementing dis-tributions as probabilistic programs, it is also possi-ble for users to make more mundane errors, such asspelling the name of a random choice inconsistently,or characterizing random choices using the wrong type tags ( :continuous and :discrete ). Such er-rors can be difficult to detect statically, because theaddresses at which a probabilistic program makes ran-dom choices, and the distributions of those choices,may change from sample to sample. (Previous workhas explored static analyses based on types (Lew et al.,2019) and abstract interpretation (Lee et al., 2019),but these each work on limited subsets of the pro-grams that Gen’s full modeling language permits, andit is often precisely these more complex programs thatrequire the flexibility of the involutive MCMC frame-work in the first place). Our dynamic support anddimension checks can help to detect bugs like these.For example, if the involution f writes to a misspelledaddress, the support check will determine that the re-sulting trace’s density is 0. Dynamic checks during inference.
These dynamicassertions can also be run during inference, at eachapplication of the transition kernel. This can be usefulto catch bugs that only occur in regions of the state utomating Involutive MCMC using Probabilistic and Differentiable Programming space with low prior mass (but perhaps high posteriormass). When enabled, we can run dynamic checksafter each application of the kernel, and when theyfail, write to a debugging log and reject the proposednew state. As it turns out, the kernel induced by thisprocedure is still stationary for p : Lemma 6.1.
Let p and q x be model and auxiliary den-sities as above, but suppose f : D×D → D×D may notbe an involution on Z . Trace-based involutive MCMCwith dynamic checks enabled, rejecting whenever sucha check fails, still yields a kernel that is stationary for p .Proof. Let R = { x ∈ Z | f ( f ( x )) = x ∧ π ( f ( x )) > } ,and let h ( x ) := [ x ∈ R ] f ( x ) + [ x (cid:54)∈ R ] x . Then h isan involution on Z , and trace-based involutive MCMCwith p , q x , and h yields a stationary kernel. But thiskernel is the same one induced by using f with dy-namic checks. For x on which dynamic checks succeed, h is equivalent to f . For x on which dynamic checksfail, h is equivalent to the identity; thus, accepting amove produced by h is equivalent to rejecting. Reversible jump MCMC (Green, 1995; Hastie andGreen, 2012) is a special case of involutive MCMC,and the implementation of reversible jump MCMC ker-nels can be automated using the probabilistic and dif-ferentiable programming languages presented in thispaper. We now review reversible jump MCMC, thenshow how it can be automated using the techniquespresented earlier, and give an example.
Review of reversible jump MCMC.
The re-versible jump MCMC framework involves a set of‘models’ h ∈ H , and a prior distribution on models p ( h ). For each model, there is a latent continuousparameter vector θ h ∈ R n ( h ) where n ( h ) is the dimen-sion of model h , and a likelihood function L D,h ( θ h )for each h given data D . The latent state x is a pair( h, θ h ) of model and continuous parameter. There isa set of move types M . Each move type m ∈ M isassociated with an unordered pair of models ( h , h )and a dimensionality d ( m ) such that d ( m ) ≥ n ( h )and d ( m ) ≥ n ( h ) (zero, one, or more than one movetypes may be associated with a given pair of mod-els). For each latent state x = ( h, θ h ), there is aprobability distribution q x ( m ) on move types suchthat q x ( m ) > h is one of the modelsfor move type m . For each move type m ∈ M be-tween h and h there is a pair of continuously dif-ferentiable bijections g m,h → h : R d ( m ) → R d ( m ) and g m,h → h := g − m,h → h , and a pair of proposal den-sities q m,h → h ( u h → h ) and q m,h → h ( u h → h ) where u h → h ∈ R d ( m ) − n ( h ) and u h → h ∈ R d ( m ) − n ( h ) . Aproposal is made from state x = ( h, θ h ) by (i) sam-pling a move type m ∼ q x ( · ), and (ii) sampling con-tinuous variable u ∼ q m,h → h (cid:48) ( · ) for ( h, h (cid:48) ) associatedwith m , and (iii) computing ( θ (cid:48) h , u (cid:48) ) := g m,h → h (cid:48) ( θ h , u ),and proposing new state x (cid:48) = ( h (cid:48) , θ (cid:48) h ). Encoding reversible jump in involutive MCMC.
To encode reversible jump MCMC in our framework,we write a probabilistic program P that encodes thespace of models H , the prior distribution on mod-els, p ( h ), the per-model priors p h ( θ h ) and the per-model likelihoods L D,h ( θ h ). The set of all models h is encoded in the set of all pairs ( K P , d P ) where K P represent possible trace structures (i.e. control-flowpaths through P ) and d P are the set of assignmentsto discrete random choices made by P . The per-modelcontinuous parameters θ are encoded via continuousrandom choices c P . The auxiliary probabilistic pro-gram Q encodes both the probability distribution onmoves types using discrete random choices and possi-bly stochastic control flow (( K Q , d Q )), and the per-move-type probability densities on u using continuousrandom choices c Q . The involution f factors into an(i) involution f on pairs i = (( K P , d P ) , ( H Q , d Q ))that defines the association between move types M and the model pairs ( h , h ); and (ii) a family of bijec-tions f ,i on the space of pairs ( c P , c Q ) of continuousrandom choices for both programs for fixed values ofthe discrete random choices and fixed trace structure. Example: Split-merge reversible jump.
Fig-ure 1 shows a split-merge reversible jump kernel foran infinite Gaussian mixture model (Richardson andGreen, 1997) implemented using the probabilistic anddifferentiable programming languages described in thispaper. Figure 1b shows the infinite Gaussian mixturemodel, specified as a probabilistic program p . The pro-gram takes the number of data points as input, thensamples the number of clusters from a Poisson distri-bution, then samples cluster parameters and mixtureproportions, and finally samples the data points fromthe resulting finite mixture. Figure 1c shows the auxil-iary probabilistic program q for the split-merge kernel.This program takes a trace of the model program asinput, and randomly decides whether to split a clusterand increase the number of clusters by one or mergetwo clusters and decrease the number of clusters byone. Then, the program randomly picks which clusterto split, or which clusters to merge. This kernel alwaysmerges the last cluster with a random other cluster; forergodicity the move can be composed with a simplemove (that has acceptance probability 1) that swaps a arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka random cluster with the last cluster. If a split is cho-sen, then the program also samples the three degrees offreedom necessary to generate the new parameters forthe clusters in an invertible manner. Figure 1d showsa differentiable program specifying the involution forthe split-merge kernel, and Figure 1f shows graphi-cally how this involution acts on pairs of traces. Theyellow section (1) defines an involution f on the dis-crete random choices that specifies that (i) the split choice should be flipped (so that split moves are alwaysmapped to merge moves and vice versa) and that (ii)the number of clusters should be increased by one fora split move and decreased by one for a merge move,and (iii) which merged cluster corresponds to whichsplit clusters. The green section (2) specifies the con-tinuous bijections that govern the transformation ofcontinuous random choices during split moves and thepurple section (3) specifies the inverses of these bijec-tions, which govern the transformation of continuouschoices during merge moves. Figure 2 shows automated in-volutive MCMC being applied to fully Bayesian infer-ence over the covariance function of a Gaussian pro-cess, where the prior on covariance functions (Fig-ure 2a) is based on a probabilistic context-free gram-mar. The inference algorithm is based on an involutiveMCMC kernel that uses a state-dependent mixture ofproposals. A variant of this inference algorithm waspreviously studied in (Schaechtle et al., 2016; Saadet al., 2019) based on a model of Grosse et al. (2012).
Hierarchical address spaces.
This example usesprobabilistic programs that invoke other probabilis-tic programs, sometimes recursively. For example, themodel probabilistic program p invokes the probabilis-tic program cov function prior , which is itself re-cursive. Similarly, the auxiliary probabilistic program q invokes walk tree (which is recursive) as well as cov function prior . Consider the syntax used to re-cursively invoke walk tree within walk tree : ({:left} ∼ walk_tree(node.left, path)) This expression resembles a random choice expression.However, instead of associating the return value ofthe function walk tree with the address : left , theaddress : left is associated with the entire trace ofrandom choices made within walk tree . That is,: left is the namespace for the addresses of all ran-dom choice made within the invocation. Further in-vocations by the callee themselves result in nestednamespaces. This process results in a hierarchical ad- dress space for random choices, as shown in Figure 2b.This does not modify the mathematical formalism—each random choice made during the execution of aprobabilistic program still has a unique address, butthe address has multiple components that localize itwithin the hierarchy. For example, the choice done arecursive call to walk tree might have address: k = (: left => : right => : done )(‘ => ’ is the Gen syntax for constructing hierarchicaladdresses). A complex state-dependent distribution.
Ateach iteration of the MCMC algorithm, the auxiliaryprobabilistic program q (Figure 2d) first picks a ran-dom node in the parse tree of the covariance function,by doing a stochastic walk of the existing parse treethat terminates at the chosen node. prev_cov_function = trace[:cov_function]path ∼ walk_tree(prev_cov_function, ..) The code that walks the tree uses the following recur-sion, which results in a probability distribution thatassigns exponentially lower probability to nodes thatare deeper in the tree. if ({:done} ∼ bernoulli(0.5))return pathelseif ({:recurse_left} ∼ bernoulli(0.5))path = (path..., :left_node)return ({:left} ∼ walk_tree(node.left, path))elsepath = (path..., :right_node)return ({:right} ∼ walk_tree(node.right, path))end The resulting distributions on selected nodes for twopossible input trees are shown below:
The first part of the involution (Figure 2e) copies therandom choices made during this walk from the inputauxiliary trace to the output auxiliary trace. @copy(aux_in[:path], aux_out[:path])
Note that here, @ copy is being used to copy the entire set of random choices from the namespace : path in aux in to the namespace : path in aux out .Because the mixture distribution is specified using aprobabilistic program, it is straightforward to modify utomating Involutive MCMC using Probabilistic and Differentiable Programming the program p to define a different mixture distribu-tion. The code below specifies a mixture distributionthat is uniform over all nodes in the tree. n1 = size(node.left); n2 = size(node.right)if ({:done} ∼ bernoulli(1 / (1 + n1 + n2)))return pathelseif ({:recurse_left} ∼ bernoulli(n1 / (n1+n2))path = (path..., :left_node)return ({:left} ∼ walk_tree(node.left, path))elsepath = (path..., :right_node)return ({:right} ∼ walk_tree(node.right, path))end The resulting distributions, for two possible inputtrees, are:
Note that the probability of choosing a given subtree topropose to is itself changed when the subtree changes.Therefore, the mixture probabilities do not in generalcancel in the the acceptance probability calculation,and must be accounted for. For the original mixturedistribution, the ratio of mixture probabilities is either1, 0 .
5, or 2 depending on whether the previous and newsubtrees are leaf or internal nodes. For this alternativemixture distribution, the ratio of mixture probabilitiesis the ratio of sizes of the two trees (e.g. 9 / / A general pattern for state-dependent mixturesof proposals in Metropolis-Hastings
The otherparts of the auxiliary probabilistic program and the in-volution program specifies a proposal distribution forthe subtree of the parse tree that is rooted at the cho-sen node. In particular, the rest of the auxiliary proba-bilistic program q proposes a new subtree by samplingfrom the same process used to recursively define theprior distribution: new_subtree ∼ cov_function_prior() The involution program swaps the old subtree with thenewly proposed subtree: @copy(model_in[subtree_address], aux_out[:new_subtree])@copy(aux_in[:new_subtree], model_out[subtree_address])
This is an instance of a more general pattern for im-plementing state-dependent mixture proposals:1. The auxiliary probabilistic program Q samplesfrom a distribution over different sets of randomchoices that will be proposed to (in this case, eachset is a different subtree of the parse tree).2. The auxiliary probabilistic program Q then sam-ples new values for those random choices (in thiscase, a new subtree).3. The involution program F swaps the previous val-ues of those random choices with their new val-ues, by swapping data between the model traceand the auxiliary trace.4. The involution program F copies the randomchoices that determined what subset of randomchoices to propose to from the input auxiliarytrace to the output auxiliary trace. Involutive MCMC is a unifying construction (Neklyu-dov et al., 2020) for MCMC algorithms that encom-passes both classic approaches to constructing kernelslike reversible jump MCMC (Green, 1995), but also re-cently introduced classes of MCMC kernels based onneural networks (Spanbauer et al., 2020). Therefore,the approach to automating involutive MCMC ker-nels presented in this paper makes a number of classicMCMC techniques easier to use and broadens their ac-cessibility, and may potentially aid in development ofnovel MCMC techniques as well. The implementationof our approach in the Gen probabilistic programmingsystem has already been used by researchers in com-putational biology (Merrell and Gitter, 2020) and arti-ficial intelligence (Zhi-Xuan et al., 2020) to prototypeand develop new reversible-jump MCMC algorithms.The technique for automating involutive MCMC pre-sented in this paper can be generalized to the settingof sequential Monte Carlo samplers (Del Moral et al.,2006). Instead of one model probabilistic program,one auxiliary probabilistic program and one involutionprogram, there are two model probabilistic programs,two auxiliary probabilistic programs, and a pair of bi-jective differentiable programs that transform traces ofone model into traces of the other. This more generalconstruct, which builds on earlier work on sequentialMonte Carlo and probabilistic programs (Cusumano-Towner et al., 2018), has already been implemented aspart of the Gen probabilistic programming system.Improving the performance of automated involutiveMCMC and of flexible probabilistic programming sys-tems like Gen more generally is an important area arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka for future work. The approach described in thispaper is largely dynamic and is not performance-competitive with optimized hand-coded implementa-tions in performance-oriented languages like C. Whileour approach is already valuable for use cases wherethe best performance is not necessary or the expertiseor time needed for an optimized hand-coded imple-mentation is not available, more research into com-pilers and automatic code generation of custom in-ference algorithms from high-level user specificationswould broaden the applicability of systems like Gen.
Acknowledgements
This research was supported in part by the US De-partment of Defense through the the National DefenseScience & Engineering Graduate Fellowship (NDSEG)Program, the DARPA SD2 program (contract FA8750-17-C-0239), the DARPA Machine Common Sense(MCS) program, the DARPA Synergistic Discoveryand Design (SD2) program, support from the IntelCorporation, and a philanthropic gift from the Apho-rism Foundation. The authors would also like to thankFeras Saad and Cameron Freer for helpful discussions.
References
Bezanson, J., Edelman, A., Karpinski, S., and Shah,V. B. (2017). Julia: A fresh approach to numericalcomputing.
SIAM Review , 59(1):65–98.Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer,F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip,P., Horsfall, P., and Goodman, N. D. (2019). Pyro:Deep universal probabilistic programming.
TheJournal of Machine Learning Research , 20(1):973–978.Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D.,Goodrich, B., Betancourt, M., Brubaker, M., Guo,J., Li, P., and Riddell, A. (2017). Stan: A proba-bilistic programming language.
Journal of statisticalsoftware , 76(1).Chang, J. T. and Pollard, D. (1997). Conditioningas disintegration.
Statistica Neerlandica , 51(3):287–317.Cusumano-Towner, M. (2018). Inference libraryof the Gen probabilistic programming system. https://github.com/probcomp/Gen.jl/blob/b9d72b/src/inference/mh.jl . Ac-cessed: 2018-12-27.Cusumano-Towner, M., Bichsel, B., Gehr, T., Vechev,M., and Mansinghka, V. K. (2018). Incremental in-ference for probabilistic programs. In
Proceedings ofthe 39th ACM SIGPLAN Conference on Program-ming Language Design and Implementation , PLDI2018, pages 571–585. ACM. Cusumano-Towner, M. F., Saad, F. A., Lew, A. K.,and Mansinghka, V. K. (2019). Gen: A general-purpose probabilistic programming system with pro-grammable inference. In
Proceedings of the 40thACM SIGPLAN Conference on Programming Lan-guage Design and Implementation , pages 221–236.ACM.Del Moral, P., Doucet, A., and Jasra, A. (2006). Se-quential Monte Carlo samplers.
Journal of the RoyalStatistical Society: Series B (Statistical Methodol-ogy) , 68(3):411–436.Duane, S., Kennedy, A. D., Pendleton, B. J., andRoweth, D. (1987). Hybrid Monte Carlo.
Physicsletters B , 195(2):216–222.Gehr, T., Misailovic, S., and Vechev, M. (2016).Psi: Exact symbolic inference for probabilistic pro-grams. In
International Conference on ComputerAided Verification , pages 62–83. Springer.Geiger, A., Lauer, M., and Urtasun, R. (2011). Agenerative model for 3D urban scene understand-ing from movable platforms. In
CVPR 2011 , pages1945–1952. IEEE.Gilks, W. R., Thomas, A., and Spiegelhalter, D. J.(1994). A language and program for complexBayesian modelling.
Journal of the Royal StatisticalSociety: Series D (The Statistician) , 43(1):169–177.Goodman, N., Mansinghka, V., Roy, D. M., Bonawitz,K., and Tenenbaum, J. B. (2008). Church: a lan-guage for generative models. In
Proceedings of the24th Annual Conference on Uncertainty in ArtificialIntelligence , UAI 2008, pages 220–229. AUAI Press.Green, P. J. (1995). Reversible jump Markov chainMonte Carlo computation and Bayesian model de-termination.
Biometrika , 82(4):711–732.Grosse, R. B., Salakhutdinov, R., Freeman, W. T., andTenenbaum, J. B. (2012). Exploiting composition-ality to explore a large space of model structures. In
Proceedings of the 28th Conference on Uncertaintyin Artificial Intelligence , UAI 2012, pages 306–315.AUAI Press.Hastie, D. I. and Green, P. J. (2012). Model choiceusing reversible jump Markov chain Monte Carlo.
Statistica Neerlandica , 66(3):309–338.Huelsenbeck, J. P., Larget, B., and Alfaro, M. E.(2004). Bayesian phylogenetic model selection usingreversible jump Markov chain Monte Carlo.
Molec-ular biology and evolution , 21(6):1123–1133.Lee, W., Yu, H., Rival, X., and Yang, H. (2019).Towards verified stochastic variational inference forprobabilistic programs.
Proceedings of the ACM onProgramming Languages , 4(POPL):1–33. utomating Involutive MCMC using Probabilistic and Differentiable Programming
Lew, A. K., Cusumano-Towner, M. F., Sherman,B., Carbin, M., and Mansinghka, V. K. (2019).Trace types and denotational semantics for soundprogrammable inference in probabilistic languages.
Proceedings of the ACM on Programming Lan-guages , 4(POPL):1–32.Mansinghka, V. K., Schaechtle, U., Handa, S., Radul,A., Chen, Y., and Rinard, M. (2018). Probabilis-tic programming with programmable inference. In
Proceedings of the 39th ACM SIGPLAN Conferenceon Programming Language Design and Implementa-tion , pages 603–616.Merrell, D. and Gitter, A. (2020). Inferring signal-ing pathways with probabilistic programming.
Pro-ceedings of the Nineteenth European Conference ofComputational Biology .Milch, B., Marthi, B., Russell, S., Sontag, D., Ong,D. L., and Kolobov, A. (2005). BLOG: Probabilis-tic models with unknown objects. In
Proceedingsof the Nineteenth International Joint Conference onArtificial Intelligence , IJCAI 2005, pages 1352–1359.Morgan Kaufmann Publishers Inc.Narayanan, P. and Shan, C.-c. (2020). Symbolic dis-integration with a variety of base measures.
ACMTransactions on Programming Languages and Sys-tems (TOPLAS) , 42(2):1–60.Neklyudov, K., Welling, M., Egorov, E., and Vetrov,D. (2020). Involutive MCMC: A Unifying Frame-work. arXiv preprint arXiv:2006.16653 .Pfeffer, A. (2007). The design and implementationof IBAL: A general-purpose probabilistic language.
Introduction to statistical relational learning , page399.Richardson, S. and Green, P. J. (1997). On Bayesiananalysis of mixtures with an unknown number ofcomponents (with discussion).
Journal of the RoyalStatistical Society: series B (statistical methodol-ogy) , 59(4):731–792.Ritchie, D., Horsfall, P., and Goodman, N. D. (2016).Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735 .Roberts, D. A., Gallagher, M., and Taimre, T. (2019).Reversible jump probabilistic programming. InChaudhuri, K. and Sugiyama, M., editors,
Proceed-ings of Machine Learning Research , volume 89 of
Proceedings of Machine Learning Research , pages634–643. PMLR.Saad, F. A., Cusumano-Towner, M. F., Schaechtle,U., Rinard, M. C., and Mansinghka, V. K. (2019).Bayesian synthesis of probabilistic programs for au-tomatic data modeling.
Proceedings of the ACM onProgramming Languages , 3(POPL):1–32. Schaechtle, U., Saad, F., Radul, A., and Mans-inghka, V. (2016). Time series structure discoveryvia probabilistic program synthesis. arXiv preprintarXiv:1611.07051 .Spanbauer, S., Freer, C., and Mansinghka, V.(2020). Deep involutive generative models for neuralMCMC. arXiv preprint arXiv:2006.15167 .Tenenbaum, J. B., Kemp, C., Griffiths, T. L., andGoodman, N. D. (2011). How to grow a mind:Statistics, structure, and abstraction.
Science ,331(6022):1279–1285.Tierney, L. (1998). A note on Metropolis-Hastings ker-nels for general state spaces.
Annals of applied prob-ability , pages 1–9.Zhi-Xuan, T., Mann, J. L., Silver, T., Tenenbaum,J. B., and Mansinghka, V. K. (2020). OnlineBayesian goal inference for boundedly-rational plan-ning agents. arXiv preprint arXiv:2006.07532 . arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka A APPENDIX
A.1 Derivation of the pushforward Radon-Nikodym derivative for a special case
Implementing Algorithm 1 requires computing the Radon-Nikodym derivative d ( µ ◦ f − ) /dµ . This section derivesthat function for the special case in which the involution f can be factored into an involution on a countable set I and a family of bijections on R n i for some n i for each i ∈ I . Suppose Z = { ( i, x ) : i ∈ I, x ∈ R n i } . Suppose f is an involution on I and n i = n f ( i ) and f is a family of continuously differentiable bijections indexed by i ∈ I ,such that f ,i : R n i → R n i . Also suppose that f ,i = f − ,f ( i ) for all i ∈ I . That is, f ,i ( f ,f ( i ) ( x )) = x for all x ∈ R n i and all i ∈ I (5)Then, f : Z → Z : ( i, x ) (cid:55)→ ( f ( i ) , f ,i ( x )) is an involution because: f ( f ( i, x )) = f ( f ( i ) , f ,i ( x )) = ( f ( f ( i )) , f ,f ( i ) ( f ,i ( x ))) = ( i, x ) (6)Let Σ n and µ n denote the Lebesgue σ -algebra and Lebesgue measure on R n , respectively. Let Σ ⊂ P ( Z ) be the σ -algebra of sets of the form ∪ i ∈ I { ( i, x ) : x ∈ K i } for some K i ∈ Σ n i for each i ∈ I . Let µ denote the measureon measurable space ( Z, Σ) given by: µ ( ∪ i ∈ I { ( i, x ) : x ∈ K i } ) := (cid:88) i ∈ I µ n i ( K i ) (7)We wish to show that the Radon-Nikodym derivative of the pushforward of µ by f with respect to µ , evaluatedat ( i, x ), is the absolute value of the Jacobian (determinant) of the function f ,i evaluated at x , which is denoted( Jf ,i )( x ): d ( µ ◦ f − ) dµ ( i, x ) = | ( Jf ,i )( x ) | (8)Consider ( µ ◦ f − )( A ) for K ∈ Σ:( µ ◦ f − )( ∪ i ∈ I { ( i, x ) : x ∈ K i } ) = µ ( f − ( ∪ i ∈ I { ( i, x ) : x ∈ K i } )) (9)= µ ( ∪ i ∈ I f − ( { ( i, x ) : x ∈ K i } )) (10)= µ ( ∪ i ∈ I { ( f ( i ) , x ) : x ∈ f ,i ( K i ) } ) (11)= (cid:88) i ∈ I µ n i ( f ,i ( K i )) (12)It suffices to show that for all K ∈ Σ: (cid:90) K | ( Jf ,i )( x ) | µ ( dz ) = (cid:88) i ∈ I µ n i ( f ,i ( K i )) (13)Expanding the left-hand side: (cid:90) K | ( Jf ,i )( x ) | µ ( dz ) = (cid:88) i ∈ I (cid:90) K i | ( Jf ,i )( x ) | µ n i ( d x ) = (cid:88) i ∈ I µ n i ( f ,i ( K i )) (14)where the final step uses Bogachev Theorem 3.7.1 with g := 1, and F := f ,i . A.2 Proof of detailed balance for involution
The involutive MCMC kernel is composed of two parts: An extension of the state space, and an involution onthe extended state space. First, we show detailed balance for the deterministic involution move applied to theextended state space.Tierney (1998) gives a class of MCMC kernels based on involutions that satisfy detailed balance. We nowreproduce the result in our notation: utomating Involutive MCMC using Probabilistic and Differentiable Programming
Lemma A.1 (Detailed balance for involution move (Tierney, 1998)) . Let ( Z, Σ , π ) denote a measure space.Suppose f is a one-to-one function from Z onto Z such that f − = f . Consider the probability kernel k definedby k z ( A ) := [ f ( z ) ∈ A ] α ( z, f ( z )) + [ z ∈ A ](1 − α ( z, f ( z ))) (where α ( z, f ( z )) gives the probability of accepting aproposed transition from z to f ( z ) ). Let ν ( dz ) := π ( dz ) + ( π ◦ f − )( dz ) . Let h ( z ) be a density for π with respectto ν . Let K := { z ∈ Z : h ( z ) > and h ( f ( z )) > } . k satisfies detailed balance with respect to π if and only if:1. α ( z, f ( z )) = 0 for π -almost all z (cid:54)∈ K α ( z, f ( z )) h ( z ) h ( f ( z )) = α ( f ( z ) , z )Now we apply Lemma A.1 to our setting where π is σ -finite, there exists a σ -finite reference measure µ for mea-surable space ( Z, Σ) such that π is mutually absolutely continuous with respect to µ , and where the pushforwardof µ by f , denoted µ ◦ f − , is absolutely continuous with respect to µ .In our setting, α is defined as: α ( z, f ( z )) := min (cid:40) , dπdµ ( f ( z )) dπdµ ( z ) · d ( µ ◦ f − ) dµ ( z ) (cid:41) (15)This definition of α satisfies: α ( z, f ( z )) dπdµ ( z ) dπdµ ( f ( z )) · (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − = α ( f ( z ) , z ) (16)Therefore, to apply Lemma A.1, it suffices to show π has density with respect to ν (denoted h ( z )) such that: h ( z ) h ( f ( z )) = dπdµ ( z ) dπdµ ( f ( z )) (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − (17)Since π and π ◦ f − are both absolutely continuous with respect to µ , ν is also absolutely continuous with respectto µ , and has density: dνdµ ( z ) = dπdµ ( z ) + d ( π ◦ f − ) dµ ( z ) (18) π is absolutely continuous with respect to ν , and therefore: dπdµ ( z ) = dπdν ( z ) · dνdµ ( z ) (19)Because ( dπ/dµ )( z ) > z ∈ Z , ( dν/dµ )( z ) > z ∈ Z . Therefore, h ( z ) := dπdν ( z ) = dπdµ ( z ) dνdµ ( z ) (20)Therefore: h ( z ) h ( f ( z )) = dπdµ ( z ) dπdµ ( f ( z )) · dνdµ ( f ( z )) dνdµ ( z ) (21)It suffices to show that: dνdµ ( f ( z )) dνdµ ( z ) = (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − First, we prove a Lemma:
Lemma A.2. If ( Z, Σ) is a measurable space and f : Z → Z is a measurable function that is an involution, ν and µ are σ -finite measures such that ν is absolutely continuous with respect to µ , and such that the pushforwardmeasures ν ◦ f − and µ ◦ f − are both σ -finite, then ν ◦ f − is absolutely continuous with respect to µ ◦ f − and d ( ν ◦ f − ) d ( µ ◦ f − ) ( z ) = dνdµ ( f ( z )) (22) arco Cusumano-Towner, Alexander K. Lew, Vikash K. Mansinghka Proof.
First, ν ◦ f − is absolutely continuous with respect to µ ◦ f − because ( µ ◦ f − )( A ) = 0 implies µ ( f − ( A )) =0 implies ν ( f − ( A )) = 0 implies ( ν ◦ f − )( A ) = 0. To show that z (cid:55)→ ( dν/dµ )( f ( z )) is the Radon-Nikodymderivative d ( ν ◦ f − ) /d ( µ ◦ f − ), it suffices to show that for all K ∈ Σ: (cid:90) K (cid:18) dνdµ ( f ( z )) (cid:19) ( µ ◦ f − )( dz ) = ( ν ◦ f − )( A ) := ν ( f − ( A )) (23)Applying Theorem 3.6.1 in Bogachev with Y := K , y := z , x := z (cid:48) , X := f − ( A ), and g ( y ) := ( dν/dµ )( f ( y )): (cid:90) K (cid:18) dνdµ ( f ( z )) (cid:19) ( µ ◦ f − )( dz ) = (cid:90) Y g ( y )( µ ◦ f − )( dy ) (24)= (cid:90) X g ( f ( x )) µ ( dx ) [Bogachev Theorem 3.6.1] (25)= (cid:90) f − ( A ) ( dν/dµ )( f ( f ( z (cid:48) ))) µ ( dz (cid:48) ) (26)= (cid:90) f − ( A ) ( dν/dµ )( z (cid:48) ) µ ( dz (cid:48) ) (27)= (cid:90) f − ( A ) ν ( dz (cid:48) ) = ν ( f − ( A )) (28)Now, note that ν and ν ◦ f − are the same measure: ν ( A ) = π ( A ) + π (cid:48) ( A ) = π ( A ) + π ( f − ( A )) (29)( ν ◦ f − )( A ) = π ( f − ( A )) + π ( f − ( f − ( A ))) = π ( f − ( A )) + π ( A ) (30)Therefore, d ( ν ◦ f − ) dν ( z ) = 1 for all z Expanding d ( ν ◦ f − ) /dν using the chain rule:1 = d ( ν ◦ f − ) dν ( z ) (31)= d ( ν ◦ f − ) d ( µ ◦ f − ) ( z ) · d ( µ ◦ f − ) dµ ( z ) · dµdν ( z ) (32)= dνdµ ( f ( z )) · d ( µ ◦ f − ) dµ ( z ) · dµdµ ( z ) [Lemma A.2] (33)= dνdµ ( f ( z )) · d ( µ ◦ f − ) dµ ( z ) · dνdµ ( z ) (34) (cid:18) d ( µ ◦ f − ) dµ ( z ) (cid:19) − = dνdµ ( f ( z )) dνdµ ( z ) (35) A.3 Proof of stationarity for involution
Detailed balance of the the involution kernel with respect to the measure induced by π implies: (cid:90) B π ( z ) k (cid:48) z ( A ) µ ( dz ) = (cid:90) K π ( z ) k (cid:48) z ( B ) µ ( dz ) for all K, B ∈ Σ (36)Stationarity with respect to π follows by substituting Z for B : (cid:90) Z π ( z ) k (cid:48) z ( A ) µ ( dz ) = (cid:90) K π ( z ) k (cid:48) z ( Z ) µ ( dz ) = (cid:90) K π ( z ) µ ( dz ) for all A ∈ Σ (37)(38) utomating Involutive MCMC using Probabilistic and Differentiable Programming
A.4 Proof of stationarity for end-to-end kernel
We are given that the involution is stationary with respect to (the measure induced by) π ( x, u ) := p ( x ) q x ( u ): (cid:90) Z k (cid:48) z ( A ) π ( z ) µ ( dz ) = (cid:90) K π ( z ) µ ( dz ) for all A ∈ Σ (39)The end-to-end kernel is defined fir all x ∈ X P such that p ( x ) > k x ( A ) := (cid:90) U k (cid:48) x,u ( A × U ) q x ( u ) µ U ( du ) for all A ∈ Σ P , x ∈ X (40)Stationarity of the end-to-end kernel with respect to the measure induced by p is: (cid:90) X k x ( A ) p ( x ) µ P ( dx ) = (cid:90) K p ( x ) µ P ( dx ) for all A ∈ Σ P (41)Expanding: (cid:90) X k x ( A ) p ( x ) µ P ( dx ) = (cid:90) X (cid:18)(cid:90) U k (cid:48) x,u ( A × U ) q x ( u ) µ Q ( du ) (cid:19) p ( x ) µ P ( dx )= (cid:90) X × U k (cid:48) x,u ( A × U ) q x ( u ) p ( x )( µ P × µ Q )( dz )= (cid:90) Z k (cid:48) x,u ( { ( x (cid:48) , u (cid:48) ) ∈ Z : x (cid:48) ∈ A } ) π ( z ) µ ( dz )= (cid:90) { ( x,u ) ∈ Z : x ∈ A } π ( z ) µ ( dz ) [ Stationarity of k (cid:48) with respect to π ]= (cid:90) { ( x,u ) ∈ Z : x ∈ A } q x ( u ) p ( x ) µ ( dz )= (cid:90) K (cid:18)(cid:90) U q x ( u ) µ Q ( du ) (cid:19) p ( x ) µ P ( dx )= (cid:90) K p ( x ) µ P ( dx ) A.5 A Sufficient Condition for Involutive MCMC with Dictionaries
Our formulation of involutive MCMC requires the following technical condition to hold: Z := { ( x, y ) ∈ X × Y : π ( x, y ) > } is µ P × µ Q -measurable. We now give a sufficient condition for this to hold, when X and Y arespaces of dictionaries. Let D ⊆ K denote the subset of addresses that are discrete (i.e. where V k is a countableset and µ k is the counting measure). For x ∈ × k ∈ A V k let x = ( d , c ) where d ∈ × k ∈ A ∩ D V k and c = × k ∈ A \ D V k ,so that d is the discrete part of x and c is the non-discrete part. Lemma A.3.