[PDF] C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching

Abstract

Lightweight, source-to-source transformation approaches to implementing MCMC for probabilistic programming languages are popular for their simplicity, support of existing deterministic code, and ability to execute on existing fast runtimes. However, they are also slow, requiring a complete re-execution of the program on every Metropolis Hastings proposal. We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. C3 is based on two core ideas: transforming probabilistic programs into continuation passing style (CPS), and caching the results of function calls. We show that on several common models, C3 reduces proposal runtime by 20-100x, in some cases reducing runtime complexity from linear in model size to constant. We also demonstrate nearly an order of magnitude speedup on a complex inverse procedural modeling application.

Full PDF

CC3: Lightweight Incrementalized MCMC forProbabilistic Programs usingContinuations and Callsite Caching

Daniel Ritchie Andreas Stuhlm ¨uller Noah D. Goodman

Stanford University

Abstract

Lightweight, source-to-source transformation approaches to implementingMCMC for probabilistic programming languages are popular for their simplic-ity, support of existing deterministic code, and ability to execute on existing fastruntimes [1]. However, they are also slow, requiring a complete re-execution ofthe program on every Metropolis Hastings proposal. We present a new exten-sion to the lightweight approach, C3, which enables efﬁcient, incrementalizedre-execution of MH proposals. C3 is based on two core ideas: transforming prob-abilistic programs into continuation passing style (CPS), and caching the results offunction calls. We show that on several common models, C3 reduces proposal run-time by 20-100x, in some cases reducing runtime complexity from linear in modelsize to constant. We also demonstrate nearly an order of magnitude speedup on acomplex inverse procedural modeling application.

Probabilistic programming languages (PPLs) are a powerful, general-purpose tool for developingprobabilistic models. A PPL is a programming language augmented with random sampling state-ments; programs written in a PPL correspond to generative priors. Performing inference on suchprograms amounts to reasoning about the space of execution traces which satisfy some condition onthe program output. Many different PPL systems have been proposed, such as BLOG [2], Figaro [3],Church [4], Venture [5], Anglican [6], and Stan [7].There are many possible implementations of PPL inference. One popular choice is the ‘LightweightMH’ framework [1]. Lightweight MH uses a source-to-source transformation to turn a probablisticprogram into a deterministic one, where random choices are uniquely identiﬁed by their structuralposition in the program execution trace. Random choice values are then stored in a database in-dexed by these structural ‘addresses.’ To perform a Metropolis-Hastings proposal, Lightweight MHchanges the value of a random choice and re-executes the program, looking up the values of otherrandom choices in the database to reuse them when possible. Lightweight MH is simple to imple-ment and allows PPLs to be built atop existing deterministic languages. Users can thus leverageexisting libraries and fast compilers/runtimes for these ‘host’ languages. For example, Stochas-tic Matlab can access Matlab’s rich matrix and image manipulation routines [1], WebPPL runs onGoogle’s highly-optimized V8 Javascript engine [8], and Quicksand’s host language compiles to fastmachine code using LLVM [9].Unfortunately, Lightweight MH is also inefﬁcient: when an MH proposal changes a random choice,the entire program re-executes to propagate this change. This is rarely necessary: for many mod-els, most proposals affect only a small subset of the program execution trace. To update the trace,re-execution is needed only where values can change. Under Lightweight MH, random choice val-1 a r X i v : . [ c s . A I] S e p // Hidden Markov Model var hmm = function(n, obs) { if (n === 0) return true; else { var prev = hmm(n-1, obs); var state = transition(prev); observation(state, obs[n]); return state; } };

1 2 i i+1 N-1 N Lightweight MH + Continuations C3 + Callsite Caching Figure 1: (Left)

A simple HMM program in the WebPPL language. (Right)

Illustrating the re-execution behavior of different MH implementations in response to a proposal to the random choice c i shaded in red. Lightweight MH re-executes the entire hmm program, invoking (orange bar) and thenunwinding (blue bar) the full chain of recursive calls. Callsite caching allows re-execution to skipall recursive calls under hmm(i-1, obs) . With continuations, re-execution only has to unwind fromthe continuation of choice c i . Combining callsite caching and continuations allows re-execution toterminate upon returning from hmm(i+1, obs) , since its return value does not change.ues are preserved and reused when possible, limiting the effect of a proposal to a subset of thechanged variable’s Markov blanket (sometimes a much smaller subset, due to context-speciﬁc in-dependence [10]). Custom PPL interpreters can leverage this property to incrementalize proposalre-execution [5], but implementing such interpreters is complicated, and using them makes it difﬁ-cult or impossible to leverage libraries and fast runtimes for existing deterministic languages.In this paper, we present a new implementation technique for MH proposals on probabilistic pro-grams that gives the best of both worlds: incrementalized proposal execution using a lightweight,source-to-source transformation framework. Our method, C3, is based on two core ideas:1. Continuations : Converting the program into continuation-passing style to allow programre-execution to begin anywhere.2.

Callsite caching : Caching function calls to avoid re-execution when function inputs orouputs have not changed.We ﬁrst describe how to implement C3 in any functional PPL with ﬁrst-class functions; our imple-mentation is integrated into the open-source WebPPL probabilistic programming language [8]. Wethen compare C3 to Lightweight MH, showing that it gives orders of magnitude speedups on com-mon models such as HMMs, topic models, Gaussian mixtures, and hierarchical linear regression. Insome cases, C3 reduces runtimes from linear in model size to constant. We also demonstrate thatC3 is nearly an order of magnitude faster on a complex inverse procedural modeling example fromcomputer graphics.

To illustrate our approach, we use a simple example: a binary state Hidden Markov Model pro-gram written in WebPPL (Figure 1 Left). This program recursively samples latent states (inside the transition function), conditioning on the observations in the obs list (inside the observation func-tion). When invoked, hmm(N, obs) generates a linear chain of latent and observed random variables(Figure 1 Right).Consider how Lightweight MH performs a proposal on this program. It ﬁrst runs the program onceto initialize the database of random choices. It then selects a choice c i uniformly at random from thisdatabase (the red circle in Figure 1 Right) and changes its value. This change necessitates a constant-time update to the score of c i +1 . However, Lightweight MH re-executes the entire program, invokinga chain of recursive calls to hmm (the orange bar in Figure 1 Right) and then unwinding those calls(the blue bar). This process requires N such call visits for an HMM with N states.2 / Initial HMM codevar hmm = function(n, obs) {if (n === 0)return true;else {var prev = hmm(n-1, obs);var state = transition(prev);observation(state, obs[n]);return state;}}; // After caching transformvar hmm = function(n, obs) {if (n === 0)return true;else {var prev = cache(hmm, n-1, obs);var state = cache(transition, prev);cache(observation, state, obs[n]);return state;}}; // After function tagging transformvar hmm = tag(function(n, obs) {if (n === 0)return true;else {var prev = cache(hmm, n-1, obs);var state = cache(transition, prev);cache(observation, state, obs[n]);return state;}}, ’1’, [hmm, transition, observation]); Figure 2: Source code transformations used by C3. (Left)

Original HMM code. (Middle)

Code afterapplying the caching transform, wrapping all callsites with the cache intrinsic. (Right)

Code afterapplying the function tagging transform, where all functions are annotated with a lexically-uniqueID and the values of their free variables. An example CPS-transformed program can be found in theancillary materials.One strategy for speeding up re-execution is to cache function calls and reuse their results if they areinvoked again with unchanged inputs. We call this scheme, which is a generalization of LightweightMH’s random choice reuse policy, callsite caching . With this strategy, the recursive re-execution of hmm must still traverse all ancestors of choice c i but can stop at hmm(i, obs) : it can reuse the result of hmm(i-1, obs) , since the inputs have not changed. As shown in Figure 1 Right, using callsite cachingcan result in less re-execution, but it still requires ∼ N hmm call visits on average.Now suppose we instead convert the program into continuation passing style. CPS re-organizes aprogram to make all data and control ﬂow explicit—instead of returning, functions invoke a ‘contin-uation’ function which represents the remaining computation to be performed [11]. For our HMMexample, by storing the continuation at c i , computation can resume from the point where this ran-dom choice is made, which corresponds to unwinding the stack from hmm(i, obs) up to hmm(N, obs) .Looking at the ‘Continuations’ row of Figure 1, this is a signiﬁcant improvement over LightweightMH and is also better than callsite caching. However, it still requires ∼ N call visits.Our main insight is that we can achieve the desired runtime by combining callsite caching withcontinuations—we call the resulting system C3 . With C3, re-execution can not only jump directlyto choice c i by invoking its continuation, but it can actually terminate almost immediately: the cachealso contains the return values of all function calls, and since the return value of hmm(i+1, obs) hasnot changed, all subsequent computation will not change either. C3 unwinds only two recursive hmm calls, giving the desired constant-time update. Thus C3 is more than the sum of its parts: bycombining caching with CPS, it enables incrementalization beneﬁts that neither component candeliver independently.In the sections that follow, we describe how to implement C3 in a functional PPL. Speciﬁcally, wedescribe how to transform the program source at compile-time (Section 3) to make requisite dataavailable to the runtime caching mechanism (Section 4). Lightweight MH transforms the source code of probabilistic programs to compute random choiceaddresses; the transformed code can then be executed on existing runtimes for the host deterministiclanguage. C3 ﬁts into this framework by adding three additonal source transformations: caching,function tagging, and a standard continuation passing style transform for functional languages.

Caching

This transform wraps every function callsite with a call to an intrinsic cache function(Figure 2 Middle). This function performs run-time callsite cache lookups, as described in Section 4.

Function tagging

This transform analyzes the body of each function and tags the function withboth a lexically-unique ID as well as the values of its free variables (Figure 2 Right). In Section 4,we describe how C3 uses this information to decide whether a function call must be re-executed.The ﬁnal source transformation pipeline is: caching → function tagging → address computation → CPS. Standard compiler optimizations such as inlining, constant folding, and common subexpres-3 // Arguments added by compiler: // a: current address // k: current continuation function cache(a, k, fn, args) { // Global function call stack var currNode = nodeStack.top(); var node = find(a, currNode.children); if (node === null) { node = FunctionNode(a); // Insert maintains execution order insert(node, currNode.children, currNode.nextChildIndex); } execute(node, k, fn, args); } // rc: a random choice node function propagate(rc) { // Restore call stack up to rc.parent restore(nodeStack, rc.parent); // Changes to rc may make siblings unreachable markUnreachable(rc.parent.children, rc.index); // Continue executing rc.parent.nextChildIndex = rc.index + 1; rc.k(rc.val); } function execute(node, k, fn, args) { node.reachable = true; node.k = k; node.index = node.parent.nextChildIndex; // Check for input changes if (!fnEquiv(node.fn, fn) || !equal(node.args, args)) { this.fn = fn; this.args = args; // Mark all children as initially unreachable markUnreachable(this.children, 0); // Call fn with special continuation node.nextChildIndex = 0; nodeStack.push(node); node.entered = true; fn(args, function(retval) { node = nodeStack.pop(); // Remove unreachable children removeUnreachables(node.children); // Terminate early on proposals where // retval does not change var rveq = equal(retval, this.retval); if (!node.entered && rveq) kexit(); else { node.entered = false; // retval change may make siblings unreachable if (!rveq) markUnreachable(node.parent.children, node.index); // Continue executing node.retval = retval; node.parent.nextChildIndex++; k(node.retval); } }); } else { node.parent.nextChildIndex++; k(node.retval); } } Figure 3: The main subroutines governing C3’s callsite cache. Function calls are wrapped with cache ,which retrieves (or creates) a cache node for a given address a . It calls execute , which examines thefunction call’s inputs for changes and runs the call if needed. Finally, MH proposals use propagate toresume re-execution of the program from a particular random choice node which has been changed.sion elimination can then be applied. In fact, the host language compiler often already performssuch optimizations, which is an additional beneﬁt of the lightweight transformational approach. When performing an MH proposal, callsite caching aims to avoid re-executing functions and to en-able early termination from them as often as possible. In this section, we describe how C3 efﬁcientlyimplements both of these types of computational ‘short-circuiting’ for probabilistic functional pro-grams. Figure 3 provides high-level code for the main subroutines which govern the caching system.

We ﬁrst require an efﬁcient cache structure to minimize overhead introduced by performing a cacheaccess on every function call. C3 uses a tree-structured cache: it stores one node for each functioncall. A node’s children correspond to the function’s callees. Random choices are stored as leaf nodes.C3 also maintains a stack of nodes which tracks the program’s call stack ( nodeStack in Figure 3).During cache lookups, the desired node, if it exists, must be a child of the node on the top of thisstack. Exploiting this property accelerates lookups, which would otherwise proceed from the cacheroot. Altogether, this structure provides expected constant time lookups, additions, and deletions.In addition, by storing a node’s children in execution order, C3 can efﬁciently determine when childnodes have become ‘stale’ (i.e. unreachable) due to control ﬂow changes and should be removed.A child node is marked unreachable when its parent begins or resumes execution ( execute line 8; propagate line 6) and marked reachable when it is executed ( execute line 2). Any children left markedunreachable when the parent exits are removed from the cache ( execute line 16).4 .2 Short-Circuit On Function Entry

As described in Section 3, every function call is wrapped in a call to cache , which retrieves (orcreates) a cache node for the current address. C3 then evaluates whether the node’s associatedfunction call must be re-evaluated or if its previous return value can be re-used (the execute function).Reuse is possible when the following two criteria are satisﬁed:1. The function’s arguments are equivalent to those from the previous execution.2. The function itself is equivalent to that from the previous execution.The ﬁrst criterion can be veriﬁed with conservative equality testing; C3 uses shallow value equalitytesting, though deeper equality tests could result in more reuse for structured argument types. Deepequality testing is more expensive, though this can be mitigated using data structure techniques suchas hash consing [12] or compiler optimizations such as global value numbering [13].The second criterion is necessary because C3 operates on languages with ﬁrst-class functions, so theidentity of the caller at a given callsite is a runtime variable. Checking whether the two functions areexactly equal (i.e. refer to the same closure) is too conservative, however. Instead, C3 leverages theinformation provided by the function tagging transform from Section 3: two functions are equivalentif they have the same lexical ID (i.e. came from the same source location) and if the values of theirfree variables are equal. C3 applies this check recursively to any function-valued free variables,and it also memoizes the result, as program execution traces often feature many applications of thesame function. This scheme is especially critical to obtain reuse in programs that feature anonymousfunctions, as those manifest as different closures for each program execution.

When C3 re-executes the program after changing a random choice (using the propagate func-tion), control may eventually return to a function call whose return value has not changed.In this case, since all subsequent computation will have the same result, C3 can termi-nate execution early by invoking the exit continuation kexit . During function exit, C3’s execute function detects if control is returning from a proposal by checking if the call is ex-iting without having ﬁrst been entered (line 20). This condition signals that the currentre-execution originated at some descendant of the exiting call, i.e. a random choice node. // Using the query table to infer // the sequence of latent states. var hmm = function(n, obs) { if (n === 0) return true; else { var prev = hmm(n-1, obs); var state = transition(prev); query.add(n, state); observation(state, obs[n]); return state; } }; hmm(100, observed_data); return query; Early termination is complicated by inference queries whosesize depends on model size: for example, the sequence oflatent states in an HMM. In lightweight PPL implementa-tions, inference typically computes the marginal distributionon program return values. Thus, a na¨ıve HMM implementa-tion would construct and return a list of latent states. How-ever, this implementation makes early termination impossible,as the list must be recursively reconstructed after a change toany of its elements.For these scenarios, C3 offers a solution in the form of a global query table to which the program can write values of interest.Critically, query has a write-only interface: since the programcannot read from query , a write to it cannot introduce side-effects in subsequent compuation, and thus the semantics of early termination are preserved. Pro-grams that use query can then simply return it to infer the marginal distribution over its contents.

C3 takes care to ensure that the amount of work it performs in response to a proposal is only pro-portional to the amount of the program execution trace affected by that proposal. First, it maintainsreferences to all random choices in a hash table, which provides expected constant time additions,deletions, and random element lookups. This table allows C3 to perform uniform random proposalchoice in constant time, rather than the linear time cost of scanning through the entire cache.5 T i m e ( s e c ) Method

C3Caching OnlyCPS OnlyLightweight MH

10 20 30 40 50 60 70 80 90 100HMM - Number of Observations0K2K4K6K8K10K12K T h r oughpu t ( p r opo s a l s / s e c ) T i m e ( s e c ) T h r oughpu t ( p r opo s a l s / s e c ) Figure 4: Comparing the performance of C3 with other MH implementations. (Top)

Performing10000 MH iterations on an HMM program. (Bottom)

Performing 1000 MH iterations on an LDAprogram. (Left)

Wall clock time elapsed, in seconds. (Right)

Sampling throughput, in proposals persecond. 95% conﬁdence bounds are shown in a lighter shade. Only C3 exhibits constant asymptoticcomplexity for the HMM; other implementations take linear time, exhibiting decreasing throughput.Second, proposals may be rejected, which necessitates copying the cache in case its prior statemust be restored on rejection. C3 avoids copying the entire cache using a copy-on-write schemewith similar principles to transactional memory [14]: modiﬁcations to a cache node’s properties arestaged and only committed if the proposal is accepted. Thus, C3 only copies as much of the cacheas is actually visited during proposal re-execution.Finally, it is not always optimal to cache every callsite: caching introduces overhead, and somefunction calls almost always change on each invocation. C3 detects such callsites and stops cachingthem in a heuristic process we call adaptive caching . A callsite is un-cached if, after at least N proposals, execution has reached it M times without resulting in either short-circuit-on-entry orshort-circuit-on-exit. We use N = 10 , M = 50 for the results presented in this paper. A small,constant overhead remains for un-cached callsites, as calling them still triggers a table lookup todetermine their caching status. Future work could explore efﬁciently re-compiling the program toremove cache calls around such callsites. We now investigate the runtime performance characteristics of C3. We compare C3 to LightweightMH, as well as to systems that use only callsite caching and only continuations. This allows us toinvestigate the incremental beneﬁt provided by each of C3’s components. The source code for allmodels used in this section is available in the ancillary materials, and our implementation of C3itself is available as part of the WebPPL probabilistic programming language [8]. All timing datawas collected on an Intel Core i7-3840QM machine with 16GB RAM running OSX 10.10.2.We ﬁrst evaluate these systems on two standard generative models: a discrete-time Hidden MarkovModel and a Latent Dirichlet Allocation model. We use synthetic data, since we are interestedpurely in the computational efﬁciency of different implementations of the same statistical inferencealgorithm. The HMM program uses 10 discrete latent states and 10 discrete observable states andreturns the sequence of latent states. We condition it on a random sequence of observations, of6 ethod 0 50 100 150Time (sec)C3Caching OnlyCPS OnlyLightweight MH

Method 0 100 200 300 400 500 600Throughput (proposals/sec)C3Caching OnlyCPS OnlyLightweight MH

Figure 5: Comparing C3 and Lightweight MH on an inverse procedural modeling program. (Left)

Desired tree shape. (Middle)

Example output from inference over a tree program given the desiredshape. (Right)

Performance characteristics of different MH implementations. C3 delivers nearly anorder of magnitude speedup.increasing length from 10 to 100, and run each system for 10000 MH iterations, collecting a sampleevery 10 iterations. The LDA program uses 10 topics, a vocabulary of 100 words, and 20 words perdocument. It returns the distribution over words for each topic. We condition it on a set of randomdocuments, increasing in size from 5 to 50, and run each system for 1000 MH iterations.Figure 4 shows the results of this experiment; all quantities are averaged over 20 runs. We showwall clock time in seconds (left) and throughput in proposals per second (right). For the HMM, C3’sruntime is constant regardless of model size, whereas

Lightweight MH and

CPS Only exhibit the ex-pected linear runtime (approximately N and N , respectively). As discussed in Section 2, CachingOnly has the same complexity as

Lightweight MH but is a constant factor slower due to cachingoverhead. For the LDA model,

Lightweight MH and

CPS Only all exhibit asymptotic complexitycomparable with their performance on the HMM. However,

Caching Only performs signiﬁcantlybetter. The LDA program is structured with nested loops; caching allows re-execution to skip en-tire inner loops for many proposals.

Caching Only must still re-execute all ancestors of a changedrandom choice, though, so it is slower than C3, which jumps directly to the change point. C3 doesnot achieve exactly constant runtime for LDA because a small percentage of its proposals affecthierarchical variables, requiring more re-execution. This is a characteristic of hierarchical models ingeneral; in this speciﬁc case, conjugacy could be leveraged to integrate out higher-level variables.We also evaluate these systems on an inverse procedural modeling program. Procedural models areprograms that generate random 3D models from the same family.

Inverse procedural modeling infersexecutions of such a program that resemble a target output shape [15]. We use a simple grammar-like program for tree skeletons presented in prior work, conditioning its output to be volumetricallysimilar to a target shape [16]. We run each system for 2000 MH iterations.Figure 5 shows the results of this experiment. C3 achieves the best performance, delivering nearly anorder of magnitude speedup over Lightweight MH. Using caching only does not help in this example,since re-executing the program from its beginning reconstructs all of the recursive procedural model-ing function’s structured inputs, whose equality is not captured by our cache’s shallow equality tests. S peedup Model

HMMLDAGMMHLR

Finally, the ﬁgure on the left shows the re-sults of a wider evaluation: for four mod-els, we plot the speedup obtained by C3 overLightweight MH (in relative throughput) asmodel size increases. The four models are:the HMM and LDA models from Figure 4, aone-dimensional ﬁnite Gaussian mixture model(GMM), and a hierarchical linear regressionmodel (HLR) [17]. The 1-10 normalized ModelSize parameter maps to a natural scale parame-ter for each of the four models; details are avail-able in the ancillary materials. While C3 offersonly small beneﬁts over Lightweight MH for small models, it achieves dramatic speedups of 20-100x for large models. 7

Related Work

The ideas behind C3 have connections to other areas of active research. First, incrementalizingMCMC proposals for PPLs falls under the umbrella of incremental computation [18]. Much of theactive work in this ﬁeld seeks to build general-purpose languages and compilers to incrementalizeany program [19]. However, there are also systems such as ours which seek simpler solutions todomain-speciﬁc incrementalization problems. In particular, C3’s callsite caching mechanism wasinspired in part by recent work in computer graphics on hierarchical render caches [20]. The Venture PPL features an algorithm to incrementally update a probabilistic execution trace inresponse to a random choice change [5]. Implemented as part of a custom interpreter, this methodwalks the trace starting from the changed node, identifying nodes which must be updated or re-moved, and determining when re-evaluation can stop. C3 performs a similar computation but usescontinuations to traverse the execution trace rather than maintaining a complete interpreter state.The Shred system also incrementalizes MH updates for PPLs [17]. Shred traces a program to removeits control ﬂow and then uses data-ﬂow analysis to produce incremental update procedures for eachrandom choice. This process produces very fast proposal code, but it requires signiﬁcant implemen-tation cost, and its re-compilation overhead grows very large for programs with high control-ﬂowvariability, such as PCFGs. C3’s caching scheme is a dynamic analog to Shred’s static slicing whichdoes not have compilation overhead but may not be as fast for models with ﬁxed control ﬂow.The Swift compiler for the BLOG language is another recent system supporting incrementalizedMCMC updates [21]. Unlike the above systems, BLOG/Swift uses a possible-world semantics forprobabilistic programs, representing program state as a graphical model whose structure changesover time. Swift tracks the Markov Blanket of this model, computing incremental updates to it asmodel structure changes, allowing it to make efﬁcient MCMC proposals. C3 does not explicitlycompute Markov blankets, but its short-circuiting facilities limit re-execution to the subset of achanged variable’s Markov blanket that is affected by the change.

This paper presented C3, a lightweight, source-to-source compilation system for incrementalizingMCMC updates in probabilistic programs. We have described how C3’s two main components,continuations and callsite caching, allow it both to avoid re-executing function calls and to termi-nate re-execution early. Our experimental results show that C3 can provide orders-of-magnitudespeedups over previous lightweight inference systems on typical generative models. It even enablesconstant-time updates in some cases where previous systems required linear time. We also demon-strate that C3 improves performance by nearly 10x on a complex, compute-heavy inverse proceduralmodeling problem. Our implementation of C3 is freely available as part of the open-source WebPPLprobabilistic programming language.Careful optimization of computational efﬁciency, such as the work presented in this paper, is neces-sary for PPLs to move out of the domain of research and into production machine learning and AIsystems. Along these lines, there are several directions for future work. First, static analysis mightallow C3 to determine at compile time dependencies between random choices and subsequent func-tion calls, obviating the need for some input equality checks and reducing caching overhead. Second,C3’s CPS transform is overcomplete: it transforms the entire program, but C3 only need continu-ations at random choice points. Detecting and fusing blocks of purely deterministic code beforeapplying the CPS transform could improve performance. Finally, while the results presented in thispaper focus on single-site Metropolis Hastings, C3’s core incrementalization scheme also applies toother sampling algorithms, such as Gibbs samplers or particle ﬁlter rejuvenation kernels [22]. An incomplete, undocumented version of C3’s callsite caching mechanism also appears in the originalMIT-Church implementation of the Church probabilistic programming language [4]. eferences [1] David Wingate, Andreas Stuhlm¨uller, and Noah D. Goodman. Lightweight Implementations ofProbabilistic Programming Languages Via Transformational Compilation. In AISTATS 2011 .[2] Brian Milch, Bhaskara Marthi, Stuart J. Russell, David Sontag, Daniel L. Ong, and AndreyKolobov. BLOG: Probabilistic Models with Unknown Objects. In

IJCAI 2005 .[3] A. Pfeffer. Figaro: An object-oriented probabilistic programming language. Technical report,Charles River Analytics, 2009.[4] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B.Tenenbaum. Church: a language for generative models. In

UAI 2008 .[5] Vikash K. Mansinghka, Daniel Selsam, and Yura N. Perov. Venture: a higher-order probabilis-tic programming platform with programmable inference.

CoRR , 2014.[6] F. Wood, J. W. van de Meent, and V. Mansinghka. A New Approach to Probabilistic Program-ming Inference. In

AISTATS 2014 .[7] Stan Development Team.

Stan Modeling Language Users Guide and Reference Manual, Ver-sion 2.5.0 , 2014.[8] Noah D Goodman and Andreas Stuhlm¨uller. The Design and Implementation of ProbabilisticProgramming Languages. http://dippl.org , 2014. Accessed: 2015-5-18.[9] Daniel Ritchie. Quicksand: A Lightweight Embedding of Probabilistic Programming for Pro-cedural Modeling and Design. In

The 3rd NIPS Workshop on Probabilistic Programming ,2014.[10] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Context-speciﬁc Inde-pendence in Bayesian Networks. In

UAI 1996 .[11] Andrew W. Appel.

Compiling with Continuations . Cambridge University Press, New York,NY, USA, 2007.[12] E. Goto. Monocopy and associative algorithms in an extended lisp. Technical report, 1974.[13] B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global Value Numbers and Redundant Com-putations. In

POPL 1988 .[14] Maurice Herlihy and J. Eliot B. Moss. Transactional Memory: Architectural Support for Lock-free Data Structures. In

ISCA 1993 .[15] Jerry O. Talton, Yu Lou, Steve Lesser, Jared Duke, Radom´ır Mˇech, and Vladlen Koltun.Metropolis Procedural Modeling.

ACM Trans. Graph. , 30(2), 2011.[16] Daniel Ritchie, Ben Mildenhall, Noah D. Goodman, and Pat Hanrahan. Controlling Procedu-ral Modeling Programs with Stochastically-Ordered Sequential Monte Carlo. In

SIGGRAPH2015 .[17] Lingfeng Yang, Pat Hanrahan, and Noah D. Goodman. Generating Efﬁcient MCMC Kernelsfrom Probabilistic Programs. In

AISTATS 2014 .[18] G. Ramalingam and Thomas Reps. A Categorized Bibliography on Incremental Computation.In

POPL 1993 .[19] Yan Chen, Joshua Dunﬁeld, and Umut A. Acar. Type-Directed Automatic Incrementalization.In

PLDI 2012 .[20] Michael W¨orister, Harald Steinlechner, Stefan Maierhofer, and Robert F. Tobler. Lazy Incre-mental Computation for Efﬁcient Scene Graph Rendering. In

HPG 2013 .[21] Lei Li, Yi Wu, and Stuart J. Russell. SWIFT: Compiled Inference for Probabilistic Programs.Technical report, EECS Department, University of California, Berkeley, 2015.[22] Walter R. Gilks and Carlo Berzuini. Following a moving target—Monte Carlo inference for dy-namic Bayesian models.