558Paradoxes of Probabilistic Programming
And How to Condition on Events of Measure Zero with Infinitesimal Probabilities
JULES JACOBS,
Radboud University and Delft University of Technology, The NetherlandsProbabilistic programming languages allow programmers to write down conditional probability distributionsthat represent statistical and machine learning models as programs that use observe statements. Theseprograms are run by accumulating likelihood at each observe statement, and using the likelihood to steerrandom choices and weigh results with inference algorithms such as importance sampling or MCMC. Weargue that naive likelihood accumulation does not give desirable semantics and leads to paradoxes when anobserve statement is used to condition on a measure-zero event, particularly when the observe statementis executed conditionally on random data. We show that the paradoxes disappear if we explicitly modelmeasure-zero events as a limit of positive measure events, and that we can execute these type of probabilisticprograms by accumulating infinitesimal probabilities rather than probability densities. Our extension improvesprobabilistic programming languages as an executable notation for probability distributions by making it morewell-behaved and more expressive , by allowing the programmer to be explicit about which limit is intendedwhen conditioning on an event of measure zero.CCS Concepts: • Mathematics of computing → Probability and statistics ; •
Theory of computation → Probabilistic computation .Additional Key Words and Phrases: probabilistic programming
ACM Reference Format:
Jules Jacobs. 2021. Paradoxes of Probabilistic Programming: And How to Condition on Events of MeasureZero with Infinitesimal Probabilities.
Proc. ACM Program. Lang.
5, POPL, Article 58 (January 2021), 26 pages.https://doi.org/10.1145/3434339
Probabilistic programming languages such as Stan [Carpenter et al. 2017], Church [Goodmanet al. 2008], and Anglican [Wood et al. 2014] allow programmers to express probabilistic models instatistics and machine learning in a structured way, and run these models with generic inferencealgorithms such as importance sampling, Metropolis-Hastings, SMC, HMC. At its core, a probabilis-tic programming language is a notation for probability distributions that looks much like normalprogramming with calls to random number generators, but with an additional observe construct.There are two views on probabilistic programming. The pragmatist says that probabilisticprograms are a convenient way to write down a likelihood function, and the purist says thatprobabilistic programs are a notation for structured probabilistic models. The pragmatist interpretsan observe statement as “soft conditioning”, or imperatively multiplying the likelihood functionby some factor. The purist interprets an observe statement as true probabilistic conditioning in thesense of conditional distributions. The pragmatist may also want to write a probabilistic programto compute the likelihood function of a conditional distribution, but the pragmatist is not surprised
Author’s address: Jules Jacobs, Radboud University and Delft University of Technology, The Netherlands, [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,contact the owner/author(s).© 2021 Copyright held by the owner/author(s).2475-1421/2021/1-ART58https://doi.org/10.1145/3434339 Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. a r X i v : . [ c s . P L ] J a n that there are non-sensical probabilistic programs that do not express any sensible statistical model.After all, if one writes down an arbitrarily likelihood function then it will probably not correspondto a sensible, structured, non-trivial statistical model. The pragmatist blames the programmer forwriting non-sensical programs, just as it would have been the fault of the programmer if theyhad written down the same likelihood function manually. The purist, on the other hand, insiststhat any probabilistic program corresponds to structured statistical model, and that each observe statement in a probabilistic program has a probabilistic interpretation whose composition resultsin the statistical model. We will show that the current state is not satisfactory for the purist, andwe will show how to make probabilistic programming languages satisfactory in this respect.The difficulties with conditioning in probabilistic programs can be traced back to a foundationalissue in probability theory. When the event 𝐸 being conditioned on has nonzero probability, theconditional distribution P ( 𝐴 | 𝐸 ) is defined as: P ( 𝐴 | 𝐸 ) = P ( 𝐴 ∩ 𝐸 ) P ( 𝐸 ) However, this formula for conditional probability is undefined when P ( 𝐸 ) = , since then also P ( 𝐴 ∩ 𝐸 ) = and the fraction P ( 𝐴 | 𝐸 ) = is undefined. In probabilistic programming we oftenwish to condition on events 𝐸 with probability , such as “ 𝑥 = . ”, where 𝑥 is a continuousrandom variable. There are several methods to condition on measure-zero events. For continuousdistributions that have probability density functions, we can replace the probabilities in the aboveformula with probability densities, which are (usually) nonzero even if P ( 𝐸 ) is zero. For morecomplicated situations, we can use the Radon–Nikodym derivative or disintegration [Ackermannet al. 2017; Chang and Pollard 1997; Dahlqvist and Kozen 2020; Shan and Ramsey 2017].A general method for conditioning on measure-zero events is to define a sequence of events 𝐸 𝜖 parameterized by a number 𝜖 > such that 𝐸 𝜖 in some sense converges to 𝐸 in the limit 𝜖 → ,but P ( 𝐸 𝜖 ) > for all 𝜖 > . We then define the conditional distribution to be the limit of P ( 𝐴 | 𝐸 𝜖 ) : P ( 𝐴 | 𝐸 ) = lim 𝜖 → P ( 𝐴 ∩ 𝐸 𝜖 ) P ( 𝐸 𝜖 ) In the book
Probability Theory: The Logic of Science [Jaynes 2003],
E.T. Jaynes explains thatconditioning on measure-zero events is inherently ambiguous, because it depends not just on 𝐸 butalso on the limiting operation 𝐸 𝜖 we choose:Yet although the sequences { 𝐴 𝜖 } and { 𝐵 𝜖 } tend to the same limit “ 𝑦 = ”, the conditionaldensities [ P ( 𝑥 | 𝐴 𝜖 ) and P ( 𝑥 | 𝐵 𝜖 ) ] tend to different limits. As we see from this, merelyto specify “ 𝑦 = ” without any qualifications is ambiguous; it tells us to pass to ameasure-zero limit, but does not tell us which of any number of limits is intended. [...]Whenever we have a probability density on one space and we wish to generate from itone on a subspace of measure zero, the only safe procedure is to pass to an explicitlydefined limit by a process like [ 𝐴 𝜖 and 𝐵 𝜖 ]. In general, the final result will and mustdepend on which limiting operation was specified. This is extremely counter-intuitiveat first hearing; yet it becomes obvious when the reason for it is understood.The other methods implicitly make the choice 𝐸 𝜖 for us. Conditioning on events of measure-zero using those methods can lead to paradoxes such as the Borel-Komolgorov paradox, even inthe simplest case when probability density functions exist. Paradoxes occur because seeminglyunimportant restatements of the problem, such as using a different parameterization for thevariables, can affect the choice of 𝐸 𝜖 that those methods make, and thus change the value of thelimit. Consider the following probabilistic program: Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:3 h = rand(Normal(1.7, 0.5))if rand(Bernoulli(0.5))observe(Normal(h, 0.1), 2.0)end
We first sample a value (say, a person’s height) from a prior normally distributed around 1.7 metersand then with probability 0.5 we observe a measurement normally distributed around the height tobe 2.0. We ran this program in Anglican with importance sampling, and obtained the followingexpectation values for ℎ : 1.812 1.814 1.823 1.813 1.806 (10000 samples each). Suppose that we hadmeasured the height in centimeters instead of meters: h = rand(Normal(170, 50))if rand(Bernoulli(0.5))observe(Normal(h, 10), 200)end We might naively expect this program to produce roughly the same output as the previous program,but multiplied by a factor of 100 to account for the conversion of meters to centimeters. Instead,we get 170.1 170.4 171.5 170.2 169.4. This behavior happens because even though the units of theprogram appear to be correct, the calculations that importance sampling does to estimate theexpectation value involve arithmetic with inconsistent units (in this case, adding a quantity withunits 𝑚 − to a quantity with neutral units). The issue is not particular to Anglican or importancesampling, but due to the interaction of stochastic branching with way the likelihood is calculatedwith probability densities; other algorithms [Paige et al. 2014; Tolpin et al. 2015] have the samebehavior. In fact, formal semantics based on likelihood accumulation, such as the commutativesemantics [Staton 2017] and the semantics based on on Quasi-Borel spaces [Heunen et al. 2017],also perform arithmetic with inconsistent units for this example. Lexical likelihood weighting [Wuet al. 2018] does give the right answer for this example , but still exhibits unit anomalies for otherexamples described in Section 3.Unit errors in a programming language’s implementation or semantics may seem like a veryserious issue, but we do not believe that this is a show-stopper in practice, because practitioners canalways take the pragmatist view and avoid writing such programs. Although we consider this to bean important foundational issue, it does not invalidate existing work on probabilistic programming.It is known that conditionals can be problematic. Some inference algorithms, like SMC, will makeassumptions that exclude observe inside conditionals. For example, [van de Meent et al. 2018]mentions the following when describing SMC:Each breakpoint needs to occur at an expression that is evaluated in every executionof a program. In particular, this means that breakpoints should not be associated withexpressions inside branches of if expressions. [...] An alternative design, which is oftenused in practice, is to simply break at every observe and assert that each sample hashalted at the same point at run time.If the design is used where breakpoints happen at every observe, then the assertion that breakpointsshould not be associated with expressions inside branches of if expressions will disallow using SMCwith programs that have observes inside conditionals. Languages such as Stan, that do not haveor do not allow stochastic branching, also do not suffer from the preceding example. In section 3 Many thanks to Alex Lew for pointing this out.Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. we will show that the problem is not limited to conditionals; there are programs that do not haveconditionals but nevertheless have paradoxical behavior. Furthermore, we show that the standardmethod of likelihood accumulation for implementing probabilistic programming languages cansometimes obtain an answer that disagrees with the purist’s exact value for P ( 𝐴 | 𝐸 ) even if P ( 𝐸 ) is nonzero , due to a confusion between probabilities and probability densities.We identify three types paradoxes that affect probabilistic programming languages that allowdynamically conditioning on events of measure-zero. These paradoxes are based on the idea thatit should not matter which parameter scale we use for variables. It shouldn’t matter whether weuse meters or centimeters to measure height, but it also shouldn’t matter whether we use energydensity or decibels to measure sound intensity. The change from centimeters to meters involvesa linear parameter transformation by 𝑐𝑚 = . 𝑚 , whereas the change from energy density todecibels involves a nonlinear parameter transformation decibels = log ( energy density ) . We giveseveral example programs that show that the output of a probabilistic program can depend on theparameter scale used when we condition on events of measure zero.Following Jaynes’ advice, we extend the language with notation for explicitly choosing which limit 𝐸 𝜖 we mean in an observe statement. We give an implementation of likelihood accumulationusing infinitesimal probabilities instead of probability densities, and show that this does not sufferfrom the three types of paradoxes. Infinitesimal probabilities give meaning to conditioning onmeasure-zero events in terms of a limit of events of strictly positive measure. Since events of strictlypositive measure are unproblematic, paradoxes can no longer occur.Furthermore, we add explicit language support for parameter transformations. This is onlysoundly possible due to the introduction of infinitesimal probabilities. We show that introducinga parameter transformation in an observe statement does not change the behavior of the proba-bilistic program. That is, we show that in our language, observe(D,I) has the same behavior as observe(D’,I’) where D’,I’ is D,I in a different parameter scale.Our contributions are the following. • We identify a problem with existing probabilistic programming languages, in which likelihoodaccumulation with probability densities can result in three different types of paradoxes whenconditioning on a measure-zero event. The three paradoxes violate the principle that theoutput of a program should not depend on the parameter scale used (Section 3). • We analyze the event that probabilistic programs with observe statements condition on,taking the paradox-free discrete case as a guide, in order to determine what observe oughtto mean in the continuous case (Section 2). • We propose a change to probabilistic programming languages to avoid the paradoxes of thecontinuous measure-zero case, by changing the observe construct to condition on measure-zero events 𝐸 as an explicit limit 𝜖 → of 𝐸 𝜖 (Sections 4 and 5), and – a method for computing the limit by accumulating infinitesimal probabilities instead ofprobability densities, which we use to implement the adjusted observe construct, – a theorem that shows that infinitesimal probabilities correctly compute the limit of 𝐸 𝜖 ,ensuring that programs that use observe on measure-zero events are paradox free, – a translation from the existing observe construct to our new observe construct, whichgives the same output if the original program was non-paradoxical, – language support for parameter transformations, which we use to show that the meaningof programs in our language is stable under parameter transformations, – an implementation of our language as an embedded DSL in Julia [Jacobs 2020] (Section 6). Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:5
Different probabilistic programming languages have different variants of the observe statement.Perhaps it’s simplest variant, observe(b) takes a boolean b and conditions on that boolean beingtrue. For instance, if we throw two dice and want to condition on the sum of the dice being 8, wecan use this probabilistic program, in pseudocode: function twoDice()x = rand(DiscreteUniform(1,6))y = rand(DiscreteUniform(1,6))observe(x + y == 8)return xend The program twoDice represents the conditional distribution P ( 𝑥 | 𝑥 + 𝑦 = ) where 𝑥 and 𝑦 areuniformly distributed numbers from to . We wrap the program in a function and use the returnvalue to specify the variable x whose distribution we are interested in. Anglican has a defquery construct analogous to the function definition that we use here.Probabilistic programming languages allow us to sample from the distribution specified bythe probabilistic program and compute expectation values. The simplest method to implement observe is rejection sampling [Goodman et al. 2008; von Neumann 1951]: we start a trial by runningthe program from the beginning, drawing random samples with rand , and upon encountering observe(x + y == 8) we test the condition, and if the condition is not satisfied we reject thecurrent trial and restart the program from the beginning hoping for better luck next time. If allobserves in a trial are satisfied, then we reach the return statement and obtain a sample for x . Weestimate expectation values by averaging multiple samples.What makes probabilistic programs such an expressive notation for probability distributions isthat we have access to use the full power of a programming language, such as its control flow andhigher order functions [Heunen et al. 2017]. The following example generates two random dicethrows x and y , and a random boolean b , and uses an observe statement to condition on the sum ofthe dice throws being if b = true , with control flow: x = rand(DiscreteUniform(1,6))y = rand(DiscreteUniform(1,6))b = rand(Bernoulli(0.5))if bobserve(x + y == 8)endreturn x This code expresses the conditional probability distribution P ( 𝑥 | 𝐸 ) where 𝑥, 𝑦, 𝑏 are distributedaccording to the given distributions, and 𝐸 is the event ( 𝑏 = 𝑡𝑟𝑢𝑒 ∧ 𝑥 + 𝑦 = ) ∨ ( 𝑏 = 𝑓 𝑎𝑙𝑠𝑒 ) . Thatis, a trial is successful if 𝑥 + 𝑦 = or if 𝑏 = 𝑓 𝑎𝑙𝑠𝑒 .In general, a probabilistic program conditions on the event that the tests of all observe statementsthat are executed succeed. A bit more formally, we have an underlying probability space Ω and wethink of an element 𝜔 ∈ Ω as the “random seed” that determines the outcome of all rand calls (itis sufficient to take Ω = R ; a real number contains an infinite amount of information, sufficientto determine the outcome of an arbitrary number of rand calls, even if those calls are sampling Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. from continuous distributions). The execution trace of the program is completely determined bythe choice 𝜔 ∈ Ω . For some subset 𝐸 ⊂ Ω , the tests of all the observe calls that are executed in thetrace succeed. This is the event 𝐸 that a probabilistic program conditions on. Rejection samplinggives an intuitive semantics for the observe statement:For a boolean b , the statement observe(b) means that we onlycontinue with the current trial only if b = true . If b = false wereject the current trial.Unfortunately, rejection sampling can be highly inefficient when used to run a probabilisticprogram. If we use 1000-sided dice instead of 6-sided dice, the probability that the sum 𝑥 + 𝑦 isa particular fixed value is very small, so most trials will be rejected and it may take a long timeto obtain a successful sample. Probabilistic programming languages therefore have a construct observe(D,x) that means observe(rand(D) == x) , but can be handled by more efficient methodssuch as importance sampling or Markov Chain Monte Carlo (MCMC). The previous example canbe written using this type of observe as follows: x = rand(DiscreteUniform(1,6))b = rand(Bernoulli(0.5))if bobserve(DiscreteUniform(1,6), 8 - x)endreturn x This relies on the fact that x + y == 8 is equivalent to y == 8 - x . The intuitive semantics of observe(D,x) is as follows:For discrete distributions D , the statement observe(D,x) meansthat we sample from D and only continue with the current trial ifthe sampled value is equal to x .This variant of observe can be implemented more efficiently than rejection sampling. We keeptrack of the weight of the current trial that represents the probability that the trial is still active (i.e.the probability that it was not yet rejected). An observe(D,x) statement will multiply the weight of the current trial by the probability P(D,x) that a sample from D is equal to x :For discrete distributions D , the statement observe(D,x) getsexecuted as weight *= P(D,x) , where P(D,x) is the probabilityof x in D .The output of a trial of a probabilistic program is now weighted sample: a pair of random value x and a weight . Weighted samples can be used to compute expectation values as weighted averages(this is called importance sampling ). Estimating an expectation value using importance sampling More advanced MCMC methods can use the weight to make intelligent choices for what to return from rand calls, whereasimportance sampling uses a random number generator for rand calls. We focus on importance sampling because this is thesimplest method beyond rejection sampling.Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:7 will usually converge faster than rejection sampling, because importance sampling’s observe willdeterministically weigh the trial by the probability
P(D,x) rather than randomly rejecting the trialwith probability . If
P(D,x) = 0.01 then rejection sampling would reject 99% of trials,which is obviously very inefficient. It is important to note that multiplying weight *= P(D,x) isthe optimized implementation of observe , and we may still semantically think of it as rejecting thetrial if sample(D) != x .If the distribution D is a continuous distribution, then the probability that a sample from D isequal to any particular value x becomes zero, so rejection sampling will reject of trials; itbecomes infinitely inefficient. This is not surprising, because on the probability theory side, theevent 𝐸 that we are now conditioning on has measure zero. Importance sampling, on the otherhand, continues to work in some cases, provided we replace probabilities with probability densities:For continuous distributions D , the statement observe(D,x) getsexecuted as weight *= pdf(D,x) , where pdf(D,x) is theprobability density of x in D .For instance, if we want to compute E [ 𝑥 | 𝑥 + 𝑦 = ] where 𝑥 and 𝑦 are distributed according to 𝑁𝑜𝑟𝑚𝑎𝑙 ( , ) distributions, conditioned on their sum being , we can use the following probabilisticprogram: x = rand(Normal(2,3))observe(Normal(2,3), 8 - x)return x This allows us to draw (weighted) samples from the distribution P ( 𝑥 | 𝑥 + 𝑦 = ) where 𝑥, 𝑦 aredistributed according to 𝑁𝑜𝑟𝑚𝑎𝑙 ( , ) . Unfortunately, as we shall see in the next section, unlikethe discrete case, we do not in general have a probabilistic interpretation for observe(D,x) oncontinuous distributions D when control flow is involved, and we can get paradoxical behavioreven if control flow is not involved. We identify three types of paradoxes. The first two involve control flow where we either executeobserve on different variables in different control flow paths, or an altogether different number ofobserves in different control flow paths. The third paradox is a variant of the Borel-Komolgorovparadox and involves non-linear parameter transformations.
Consider the following probabilistic program: h = rand(Normal(1.7, 0.5))w = rand(Normal(70, 30))if rand(Bernoulli(0.5))observe(Normal(h, 0.1), 2.0)elseobserve(Normal(w, 5), 90)endbmi = w / h^2
Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021.
We sample a person’s height ℎ and weight 𝑤 from a prior, and then we observe a measurement ofthe height or weight depending on the outcome of a coin flip. Finally, we calculate the BMI, andwant to compute its average. If ℎ ′ is the measurement sampled from Normal ( ℎ, . ) and 𝑤 ′ is themeasurement sampled from Normal ( 𝑤, ) and 𝑏 is the boolean sampled from Bernoulli ( . ) , thenthe event that this program conditions on is ( 𝑏 = true ∧ ℎ ′ = . ) ∨ ( 𝑏 = false ∧ 𝑤 ′ = ) . Thisevent has measure zero.Just like the program in the introduction, this program exhibits surprising behavior when wechange ℎ from meters to centimeters: even after adjusting the formula bmi = 𝑤 /( . · ℎ ) toaccount for the change of units, the estimated expectation value for bmi still changes. Why doesthis happen?The call to observe(D,x) is implemented as multiplying the weight by the probability densityof x in D . Importance sampling runs the program many times, and calculates the estimate for bmi as a weighted average. Thus the program above effectively gets translated as follows by theimplementation: weight = 1h = rand(Normal(1.7, 0.5))w = rand(Normal(70, 30))if rand(Bernoulli(0.5))weight *= pdf(Normal(h, 0.1), 2.0)elseweight *= pdf(Normal(w, 90), 5)endbmi = w / h^2 Where pdf ( Normal ( 𝜇, 𝜎 ) , 𝑥 ) is the probability density function of the normal distribution: pdf ( Normal ( 𝜇, 𝜎 ) , 𝑥 ) = 𝜎 √ 𝜋 𝑒 − ( 𝑥 − 𝜇𝜎 ) Importance sampling runs this program 𝑁 times, obtaining a sequence ( bmi 𝑘 , weight 𝑘 ) 𝑘 ∈{ ,...,𝑁 } .It estimates E [ bmi ] with a weighted average: E [ bmi ] ≈ (cid:205) 𝑁𝑘 = ( weight 𝑘 ) · ( bmi 𝑘 ) (cid:205) 𝑁𝑘 = ( weight 𝑘 ) The problem that causes this estimate to change if we change the units of h is that the formulaadds quantities with inconsistent units: some weight 𝑘 have unit 𝑚 − (inverse length) and somehave unit 𝑘𝑔 − (inverse mass) . It might be surprising that the weights have units at all, but consider that if we have a probabilitydistribution 𝐷 over values of unit 𝑈 , then the probability density function pdf ( 𝐷, 𝑥 ) has units 𝑈 − . The formula for pdf ( Normal ( 𝜇, 𝜎 ) , 𝑥 ) shows this in the factor of 𝜎 in front of the (unitless)exponential, which has a unit because 𝜎 has a unit.The call pdf(Normal(h, 0.1), 2.0) has units 𝑚 − and the call pdf(Normal(w, 90), 5) hasunits 𝑘𝑔 − , and thus the variable weight has units 𝑚 − or 𝑘𝑔 − depending on the outcome of thecoin flip. The weighted average estimate for E [ bmi ] adds weights of different runs together, whichmeans that it adds values of unit 𝑚 − to values of unit 𝑘𝑔 − . This manifests itself in the estimatechanging depending on whether we use 𝑚 or 𝑐𝑚 : computations that do arithmetic with inconsistent Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:9 units may give different results depending on the units used. This calls into question whether thisestimate is meaningful, since the estimate depends on whether we measure a value in 𝑚 or 𝑐𝑚 , orin 𝑘𝑔 or 𝑔 , which arguably should not matter at all.The reader might now object that conditionally executed observe statements are always wrong,and probabilistic programs that use them should be rejected as erroneous. However, in the discretecase there are no unit errors, because in that case the weight gets multiplied by a probability rather than a probability density , and probabilities are unitless. Furthermore, in the precedingsection we have seen that conditionally executed observe statements have a rejection samplinginterpretation in the discrete case. This gives the programs a probabilistic meaning in terms ofconditional distributions, even if the discrete observe statements are inside conditionals. The event 𝐸 that is being conditioned on involves the boolean conditions of the control flow. Ideally we wouldtherefore not want to blame the programmer for using conditionals, but change the implementationof observe on continuous variables so that the program is meaningful in the same way that theanalogous program on discrete variables is meaningful. Let us analyze the program from the introduction: h = rand(Normal(1.7, 0.5))if rand(Bernoulli(0.5))observe(Normal(h, 0.1), 2.0)endreturn h
This program exhibits unit anomalies for the same reason: some of the weight 𝑘 have units 𝑚 − and some have no units, and adding those leads to the surprising behavior. Rather than taking thisbehavior as a given, let us analyze what this program ought to do, if we reason by analogy to thediscrete case.This program has the same structure as the dice program from section 2, the difference beingthat we now use a normal distribution instead of a discrete uniform distribution. By analogy to thatdiscrete case, the event that is being conditioned on is ( 𝑏 = true ∧ ℎ ′ = . ) ∨ ( 𝑏 = false ) , where ℎ ′ is the measurement from Normal ( ℎ, . ) .Surprisingly, this event does not have measure zero! The event ( 𝑏 = true ∨ ℎ ′ = . ) has measurezero, but the event 𝑏 = false has measure , so the entire event has measure . We can thereforeunambiguously apply the definition of conditional probability P ( 𝐴 | 𝐸 ) = P ( 𝐴 ∩ 𝐸 ) P ( 𝐸 ) . Our probabilityspace is Ω = R × R × bool , corresponding to ℎ ∼ Normal ( . , . ) , ℎ ′ ∼ Normal ( ℎ, . ) , 𝑏 ∼ Bernoulli ( . ) , and 𝐴 ⊆ Ω and 𝐸 = {( ℎ, ℎ ′ , 𝑏 )|( 𝑏 = true ∧ ℎ ′ = . ) ∨ ( 𝑏 = false )} ⊆ 𝑋 . Theposterior P ( 𝐴 | 𝐸 ) = 𝑃 ( 𝐴 ∩ 𝐸 ) 𝑃 ( 𝐸 ) = · P ( 𝐴 ∩ 𝐸 ) = · P ( 𝐴 ∩ {( ℎ, ℎ ′ , 𝑏 )| 𝑏 = false }) , so the marginalposterior for ℎ is simply Normal ( . , . ) . That is, the whole if statement with the observe oughtto have no effect.We can understand this intuitively in terms of rejection sampling: if the sampled boolean 𝑏 = true ,then the observe statement will reject the current trial with probability 1, because the probabilityof sampling exactly 2.0 from a normal distribution is zero. Hence if 𝑏 = true then the trial willalmost surely get rejected, whereas if 𝑏 = false the trial will not get rejected. The trials where 𝑏 = true ∧ ℎ ′ = . are negligibly rare, so even though the expectation of ℎ is affected in those trials ,they do not contribute to the final expectation value; only trials with 𝑏 = false do. Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021.
As an aside: if we added an extra unconditional observe(Normal(h, 0.1), 1.9) to the program,then the whole event will have measure zero, but nevertheless, trials with 𝑏 = false will dominateover trials with 𝑏 = true , relatively speaking. In general, the control flow path with the least numberof continuous observes dominates. If there are multiple control flow paths with minimal number ofobserves, but also control flow paths with a larger number of observes, we may have a paradox ofmixed type 1 & 2.This reasoning would imply that the if statement and the observe statement are irrelevant; theprogram ought to be equivalent to return rand(Normal(1.7, 0.5)) . If this still seems strange,consider the following discrete analogue: h = rand(Binomial(10000, 0.5))if rand(Bernoulli(0.5))observe(binomial(10000, 0.9), h)endreturn h That is, we first sample ℎ between 0 and 10000 according to a binomial distribution, and then withprobability . we observe that ℎ is equal to a number sampled from another binomial distributionthat gives a number between and . Since that binomial distribution is highly biased towardnumbers close to , we might expect the average value of ℎ to lie significantly higher than .This is not the case. The rejection sampling interpretation tells us that most of the trials where thecoin flipped true , will be rejected, because the sample from Binomial ( , . ) is almost neverequal to ℎ . Thus, although those samples have an average significantly above , almost all ofthe successful trials will be trials where the coin flipped false , and thus the expected value of ℎ willlie very close to .Since we know that rejection sampling agrees with importance sampling in expectation, impor-tance sampling will also compute an estimate for the expectation value of ℎ that lies very closeto . The further we increase the number 10000, the stronger this effect becomes, because theprobability that the second sample is equal to ℎ further decreases. In the continuous case thisprobability becomes , so the successful samples will almost surely be from trials where the coinflipped to false . Therefore the average value of ℎ in the continuous case should indeed be ,unaffected by the if statement and the observe.Another way to express this point, is that in the discrete case importance sampling, rejectionsampling, and the exact value given by the conditional expectation are all in agreement, even ifconditionals are involved. On the other hand, in the continuous case, importance sampling withprobability densities gives a different answer than rejection sampling and the exact value given bythe conditional expectation E [ ℎ | 𝐸 ] (the latter two are equal to each other; both . ).The reader may insist that the semantics of the program is defined to be weight accumulationwith probability densities, that is, the semantics of the program is defined to correspond to weight = 1h = rand(Normal(1.7, 0.5))if rand(Bernoulli(0.5))weight *= pdf(Normal(h, 0.1), 2.0)endreturn h Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:11
We can only appeal to external principles to argue against this, such as unit consistency, analogywith the discrete case, the probabilistic interpretation of observe, and the rejection samplinginterpretation of observe, but the reader may choose to lay those principles aside and take this implementation of observe as the semantics of observe. We do hope to eventually convince this readerthat a different implementation of observe that does abide by these principles, could be interesting.Although our semantics will differ from the standard one, it will agree with lexicographic likelihoodweighting[Wu et al. 2018] for this example, which does not exhibit this particular paradox.
Consider the problem of conditioning on 𝑥 = 𝑦 given 𝑥 ∼ Normal ( , ) and 𝑦 ∼ Normal ( , ) ,and computing the expectation E [ exp ( 𝑥 )] . Written as a probabilistic program, x = rand(Normal(10,5))observe(Normal(15,5),x)return exp(x) In a physical situation, 𝑥, 𝑦 might be values measured in decibels and exp ( 𝑥 ) , exp ( 𝑦 ) may be(relative) energy density. We could change parameters to 𝑎 = exp ( 𝑥 ) and 𝑏 = exp ( 𝑦 ) . Then 𝑎 ∼ LogNormal ( , ) and 𝑏 ∼ LogNormal ( , ) . Since the event 𝑥 = 𝑦 is the same as exp ( 𝑥 ) = exp ( 𝑦 ) ,we might naively expect the program to be equivalent to: a = rand(LogNormal(10,5))observe(LogNormal(15,5),a)return a This is not the case. The two programs give different expectation values. Compared to type 1 &2 paradoxes, this type 3 paradox shows that the subtlety is not restricted to programs that havecontrol flow or to distributions that are not continuous; the normal and lognormal distributions areperfectly smooth.This paradox is closely related to the Borel-Komolgorov paradox. Another variant of the originalBorel-Komolgorov paradox is directly expressible in Hakaru [Shan and Ramsey 2017], but not inAnglican or Stan. Hakaru allows the programmer to condition a measure-zero condition 𝑓 ( 𝑥, 𝑦 ) = such as 𝑥 + 𝑦 − = directly without having to manually invert the relationship to 𝑦 = − 𝑥 ,and performs symbolic manipulation to do exact Bayesian inference. Hakaru allows a single suchobserve at the very end of a program, which allows it to sidestep the previous paradoxes related tocontrol flow. The semantics of the single observe is defined by disintegration, which means thatthe semantics of a Hakaru program depends on the form of 𝑓 . That is, if we take another function 𝑔 with the same solution set 𝑔 ( 𝑥, 𝑦 ) = as 𝑓 , the output may change. The programmer can use thismechanism to control which event they want to condition on. Our version of the paradox showsthat the subtlety of conditioning on measure-zero events is not restricted to programs that use thattype of disintegration. Unit anomalies cannot occur with discrete distributions, because in the discrete case we onlydeal with probabilities and not with probability densities. Recall that for discrete probabilitydistributions D , an observe statement observe(D,x) gets executed as weight *= P(D,x) where P(D,x) is the probability of 𝑥 in the distribution 𝐷 . Probabilities have no units, so the weight Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. variable stays unitless and the weighted average is always unit correct if the probabilistic programis unit correct, even if observe statements get executed conditionally. Furthermore, in the discretecase we have a probabilistic and rejection sampling interpretation of observe, and we may viewweight accumulation as an optimization to compute the same expectation values as rejectionsampling, but more efficiently. We wish to extend these good properties to the continuous case.The reason that the discrete case causes no trouble is not due to D being discrete per se. Thereason it causes no trouble is that P(D,x) is a probability rather than a probability density. In thecontinuous case the probability that rand(D) == x is zero, and that’s why it was necessary to useprobability densities. However, even in the continuous case, the probability that a sample from D lies in some interval is generally nonzero. We shall therefore change the observe statement to observe(D,I) where I is an interval, which conditions on the event rand ( 𝐷 ) ∈ 𝐼 . In the discretecase we can allow I to be a singleton set, but in the continuous case we insist that I is an intervalof nonzero width.We have the following rejection sampling interpretation for observe(D,I) :For continuous or discrete distributions D , the statement observe(D,I) means that we sample from D and only continuewith the current trial if the sampled value lies in I .And the following operational semantics for observe(D,I) :For continuous or discrete distributions D , the statement observe(D,I) gets executed as weight *= P(D,I) where P(D,I) is the probability that a value sampled from D lies in I .Let 𝐼 = [ 𝑎, 𝑏 ] = { 𝑥 ∈ R : 𝑎 ≤ 𝑥 ≤ 𝑏 } . We can calculate P ( rand ( 𝐷 ) ∈ [ 𝑎, 𝑏 ]) = cdf ( 𝐷, 𝑏 ) − cdf ( 𝐷, 𝑎 ) using the cumulative density function cdf ( 𝐷, 𝑥 ) . This probability allows us to update the weight of the trial. For instance, a call observe(Normal(2.0,0.1), [a,b]) can be executed as weight *= normalcdf(2.0,0.1,b) - normalcdf(2.0,0.1,a) where 𝑛𝑜𝑟𝑚𝑎𝑙𝑐𝑑 𝑓 ( 𝜇, 𝜎, 𝑥 ) is thecumulative density function for the normal distribution.Notice how this change from probability densities to probabilities prevents unit anomalies: if wechange the variables 𝑎, 𝑏 from meters to centimeters, then we must write observe(Normal(200,10),[a,b]) , which gets executed as weight *= normalcdf(200,10,b) - normalcdf(200,10,a) . Weintroduced a factor to convert 𝜇 and 𝜎 from meters to centimeters. This conversion ensures thatthe result of the program remains unchanged, because normalcdf ( 𝑟 𝜇, 𝑟𝜎, 𝑟𝑥 ) = normalcdf ( 𝜇, 𝜎, 𝑥 ) for all 𝑟 > . Hence the computed weight will be exactly the same whether we work with metersor centimeters. On the other hand, for the probability density function it is not the case that normalpdf ( 𝑟 𝜇, 𝑟𝜎, 𝑟𝑥 ) = normalpdf ( 𝜇, 𝜎, 𝑥 ) . It is precisely this lack of invariance that causes unitanomalies with probability densities. We can approximate the old observe(D,x) behavior with observe(D,I) by choosing 𝐼 = [ 𝑥 − 𝑤, 𝑥 + 𝑤 ] to be a very small interval of width w around x (taking w to be a small number, such as w = 0.0001 ). This has two important advantages over observe(D,x) :(1) We no longer get unit anomalies or other paradoxes; if we change the units of x , we mustalso change the units of w , which keeps the weight the same. Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:13 (2) Unlike for observe(D,x) , we have an unambiguous probabilistic and rejection samplinginterpretation of observe(D,I) for intervals of nonzero width, because the event beingconditioned on has nonzero measure.However, the number w = 0.0001 is rather arbitrary. We would like to let 𝑤 → and recover thefunctionality of observe(D,x) to condition on an exact value. With sufficiently small w we can getarbitrarily close, but we can never recover its behavior exactly.We therefore parameterize probabilistic programs by a dimensionless parameter eps . The BMIexample then becomes: function bmi_example(eps)h = rand(Normal(170, 50))w = rand(Normal(70, 30))if rand(Bernoulli(0.5))observe(Normal(200, 10), (h, A*eps))elseobserve(Normal(90, 5), (w, B*eps))endreturn w / h^2end Since eps is dimensionless, we can not simply use eps as the width of the intervals: because h is in 𝑐𝑚 , the width of the interval around h has to be in 𝑐𝑚 , and the width of the interval around w hasto be in 𝑘𝑔 . We are forced to introduce a constant A with units 𝑐𝑚 and a constant B with units 𝑘𝑔 that multiply eps in the widths of the intervals in the observes.We could now run importance sampling on bmi_example(eps) for n=10000 trials for eps=0.1 , eps=0.01 , eps=0.001 and so on, to see what value it converges to. If we run each of these inde-pendently, then the rand calls will give different results, so there will be different randomness ineach of these, and it may be difficult to see the convergence. In order to address this, we can runthe program with different values of eps but with the same random seed for the random numbergenerator. This will make the outcomes of the rand calls the same regardless of the value of eps . Infact, for a given random seed, the result of running importance sampling for a given number oftrials will be a deterministic function f(seed,eps) of the random seed and eps If we assume that the program uses eps = 𝜖 only in the widths of the intervals, and not in the restof the program, then for a fixed seed , the function 𝑓 ( seed , 𝜖 ) will be a function of 𝜖 of a specificform, because importance sampling compute 𝑓 ( seed , 𝜖 ) = (cid:205) 𝑁𝑘 = ( weight 𝑘 ( 𝜖 )) · ( 𝑣𝑎𝑙𝑢𝑒 𝑘 ) (cid:205) 𝑁𝑘 = ( weight 𝑘 ( 𝜖 )) In this fraction, the 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 are a function of 𝜖 , but the 𝑣𝑎𝑙𝑢𝑒 𝑘 are independent of 𝜖 if 𝜖 only occursinside the widths of intervals. Since the weight gets multiplied by 𝑃 ( 𝐷, 𝐼 ) on each observe(D,I) ,the 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 ( 𝜖 ) is of a very specific form: 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 ( 𝜖 ) = 𝐶 · 𝑃 ( 𝐷 , ( 𝑥 , 𝑤 𝜖 )) · · · 𝑃 ( 𝐷 𝑛 , ( 𝑥 𝑛 , 𝑤 𝑛 𝜖 )) where the constant 𝐶 contains all the probabilities accumulated from observes that did not involve 𝜖 , multiplied by a product of probabilities that did involve 𝜖 . Since 𝑃 ( 𝐷, ( 𝑥, 𝑤𝜖 )) = cdf ( 𝐷, 𝑥 + 𝑤𝜖 ) + cdf ( 𝐷, 𝑥 − 𝑤𝜖 ) , we could, in principle determine the precise function 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 ( 𝜖 ) andhence 𝑓 ( seed , 𝜖 ) for any given seed. We could then, in principle, compute the exact limit of this Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. function as 𝜖 → , with a computer algebra system. This is, of course, impractical. The next sectionshows that we can compute the limit efficiently by doing arithmetic with infinitesimal numbers. In order to recover the behavior of the old observe(D,x) using observe(D,I) with an interval 𝐼 = [ 𝑥 − 𝑤, 𝑥 + 𝑤 ] , we want to take the limit 𝑤 → , to make [ 𝑥 − 𝑤, 𝑥 + 𝑤 ] an infinitesimallysmall interval around 𝑥 . We accomplish this using symbolic infinitesimal numbers of the form 𝑟𝜖 𝑛 ,where 𝑟 ∈ R and 𝑛 ∈ Z . We allow 𝑛 < , so that 𝑟𝜖 𝑛 can also represent “infinitely large” numbersas well as “infinitesimally small” numbers. We will not make use of this possibility, but it makesthe definitions and proofs more general and more uniform. Definition 5.1.
An infinitesimal number is a pair ( 𝑟, 𝑛 ) ∈ R × Z , which we write as 𝑟𝜖 𝑛 . The infinitesimals of the form 𝑟𝜖 correspond to the real numbers. Definition 5.2.
Addition, subtraction, multiplication, and division on those infinitesimal numbersare defined as follows: 𝑟𝜖 𝑛 ± 𝑠𝜖 𝑘 = ( 𝑟 ± 𝑠 ) 𝜖 𝑛 if 𝑛 = 𝑘𝑟𝜖 𝑛 if 𝑛 < 𝑘 ± 𝑠𝜖 𝑘 if 𝑛 > 𝑘 ( 𝑟𝜖 𝑛 ) · ( 𝑠𝜖 𝑘 ) = ( 𝑟 · 𝑠 ) 𝜖 𝑛 + 𝑘 ( 𝑟𝜖 𝑛 )/( 𝑠𝜖 𝑘 ) = (cid:40) ( 𝑟 / 𝑠 ) 𝜖 𝑛 − 𝑘 if 𝑠 ≠ undefined if 𝑠 = Like ordinary division, division of infinitesimals is a partial function, which is undefined if thedenominator is exactly zero.These rules may be intuitively understood by thinking of 𝜖 as a very small number; e.g. if 𝑛 < 𝑘 then 𝜖 𝑘 will be negligible compared to 𝜖 𝑛 , which is why we define 𝑟𝜖 𝑛 + 𝑠𝜖 𝑘 = 𝑟𝜖 𝑛 in that case, andkeep only the lowest order term.We represent intervals [ 𝑥 − 𝑤, 𝑥 + 𝑤 ] as midpoint-width pairs ( 𝑥, 𝑤 ) , where 𝑤 may be aninfinitesimal number. Definition 5.3. If 𝐷 is a continuous distribution, we compute the probability 𝑃 ( 𝐷, ( 𝑥, 𝑤 )) that 𝑋 ∼ 𝐷 lies in the interval ( 𝑥, 𝑤 ) as: 𝑃 ( 𝐷, ( 𝑥, 𝑤 )) = (cid:40) cdf ( 𝐷, 𝑥 + 𝑟 ) − cdf ( 𝐷, 𝑥 − 𝑟 ) if 𝑤 = 𝑟𝜖 is not infinitesimal pdf ( 𝐷, 𝑥 ) · 𝑟𝜖 𝑛 if 𝑤 = 𝑟𝜖 𝑛 is infinitesimal ( 𝑛 > ) (1)Where cdf ( 𝐷, 𝑥 ) and pdf ( 𝐷, 𝑥 ) are the cumulative and probability density functions, respectively.Note that the two cases agree in the sense that if 𝑤 is very small, then cdf ( 𝐷, 𝑥 + 𝑤 ) − cdf ( 𝐷, 𝑥 − 𝑤 ) ≈ 𝑑𝑑𝑥 cdf ( 𝐷, 𝑥 ) · 𝑤 = pdf ( 𝐷, 𝑥 ) · 𝑤 In the philosophy literature there has been work on using non-standard analysis and other number systems to handleprobability 0 events, see [Pedersen 2014]and [Hofweber 2014] and references therein. These infinitesimal numbers may be viewed as the leading terms of Laurent series. This bears some resemblance to thedual numbers used in automatic differentiation, which represent the constant and linear term of the Taylor series. In ourcase we only have the first nonzero term of the Laurent series, but the order of the term is allowed to vary. The exponent 𝑛 of 𝜖 will play the same role as the number of densities 𝑑 in lexicographic likelihood weighting[Wu et al.2018].Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:15 Definition 5.4.
We say that 𝑓 ( 𝑥 ) is a “probability expression” in the variable 𝑥 if 𝑓 ( 𝑥 ) is definedusing the operations + , − , · , / , constants, and 𝑃 ( 𝐷, ( 𝑠, 𝑟𝑥 )) where 𝑟, 𝑠 ∈ R are constants, and 𝐷 is aprobability distribution with differentiable cdf.We can view 𝑓 as a function from reals to reals (on the domain on which it is defined, thatis, excluding points where division by zero happens), or as a function from infinitesimals toinfinitesimals by re-interpreting the operations in infinitesimal arithmetic. The value of 𝑓 ( 𝜖 ) onthe symbolic infinitesimal 𝜖 tells us something about the limiting behavior of 𝑓 ( 𝑥 ) near zero:Theorem 5.5. If 𝑓 ( 𝑥 ) is a probability expression, and if evaluation of 𝑓 ( 𝜖 ) is not undefined , and 𝑓 ( 𝜖 ) = 𝑟𝜖 𝑛 , then lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = 𝑟 . Note that the theorem only tells us that lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = 𝑟 if 𝑓 ( 𝜖 ) evaluates to 𝑟𝜖 𝑛 with infinitesimalarithmetic. If evaluating 𝑓 ( 𝜖 ) results in division by zero, then the theorem does not give anyinformation. In fact, the converse of the theorem does not hold: it may be that lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = 𝑟 butevaluating 𝑓 ( 𝜖 ) results in division by zero.Proof. By induction on the structure of the expression.We know that evaluation of 𝑓 ( 𝜖 ) did not result in division by zero, and 𝑓 ( 𝜖 ) = 𝑟𝜖 𝑛 . We need toshow that lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = 𝑟 . • If 𝑓 ( 𝑥 ) is a constant 𝑟 , then we have 𝑓 ( 𝜖 ) = 𝑟𝜖 , and indeed lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 = lim 𝑥 → 𝑓 ( 𝑥 ) = 𝑟 . • If 𝑓 ( 𝑥 ) = 𝑃 ( 𝐷, ( 𝑠, 𝑟𝑥 )) . Now 𝑓 ( 𝜖 ) = pdf ( 𝐷, 𝑠 ) · 𝑟𝜖 , and pdf ( 𝐷, 𝑠 ) · 𝑟 = 𝑟 𝑑𝑑𝑥 [ cdf ( 𝐷, 𝑥 )] 𝑥 = 𝑠 = 𝑟 lim 𝑥 → cdf ( 𝐷, 𝑠 + 𝑥 ) − cdf ( 𝐷, 𝑠 − 𝑥 ) 𝑥 = lim 𝑥 ′ → cdf ( 𝐷, 𝑠 + 𝑟𝑥 ′ ) − cdf ( 𝐷, 𝑠 − 𝑟𝑥 ′ ) 𝑥 ′ = lim 𝑥 ′ → 𝑃 ( 𝐷, ( 𝑠, 𝑟𝑥 ′ )) 𝑥 ′ • If 𝑓 ( 𝑥 ) = 𝑔 ( 𝑥 ) + ℎ ( 𝑥 ) . Since evaluation of 𝑓 ( 𝜖 ) did not result in division by zero, neitherdid evaluation of the subexpressions 𝑔 ( 𝜖 ) and ℎ ( 𝜖 ) , so 𝑔 ( 𝜖 ) = 𝑟 𝜖 𝑛 and ℎ ( 𝜖 ) = 𝑟 𝜖 𝑛 forsome 𝑟 , 𝑟 , 𝑛 , 𝑛 . Therefore, by the induction hypothesis we have lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 = 𝑟 and lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 = 𝑟 . • Case 𝑛 = 𝑛 : Now 𝑓 ( 𝜖 ) = ( 𝑟 + 𝑟 ) 𝜖 𝑛 , and we have lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) + ℎ ( 𝑥 ) 𝑥 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 + lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 = 𝑟 + 𝑟 • Case 𝑛 < 𝑛 : Now 𝑓 ( 𝜖 ) = 𝑟 𝜖 𝑛 , and since lim 𝑥 → ℎ ( 𝑒 ) 𝑥 𝑛 = 𝑟 we have = · 𝑟 = ( lim 𝑥 → 𝑥 𝑛 − 𝑛 ) · ( lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 ) = lim 𝑥 → 𝑥 𝑛 − 𝑛 ℎ ( 𝑥 ) 𝑥 𝑛 = lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 Therefore lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) + ℎ ( 𝑥 ) 𝑥 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 + lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 = 𝑟 • Case 𝑛 > 𝑛 . Analogous to the previous case. • If 𝑓 ( 𝑥 ) = 𝑔 ( 𝑥 ) − ℎ ( 𝑥 ) . Analogous to the case for addition. Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. • If 𝑓 ( 𝑥 ) = 𝑔 ( 𝑥 ) · ℎ ( 𝑥 ) . Since evaluation of 𝑓 ( 𝜖 ) did not result in division by zero, neitherdid evaluation of the subexpressions 𝑔 ( 𝜖 ) and ℎ ( 𝜖 ) , so 𝑔 ( 𝜖 ) = 𝑟 𝜖 𝑛 and ℎ ( 𝜖 ) = 𝑟 𝜖 𝑛 forsome 𝑟 , 𝑟 , 𝑛 , 𝑛 . Therefore, by the induction hypothesis we have lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 = 𝑟 and lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 = 𝑟 . Then lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 + 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 · ℎ ( 𝑥 ) 𝑥 𝑛 = ( lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 ) · ( lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 ) = 𝑟 · 𝑟 • If 𝑓 ( 𝑥 ) = 𝑔 ( 𝑥 )/ ℎ ( 𝑥 ) . Since evaluation of 𝑓 ( 𝜖 ) did not result in division by zero, neitherdid evaluation of the subexpressions 𝑔 ( 𝜖 ) and ℎ ( 𝜖 ) , so 𝑔 ( 𝜖 ) = 𝑟 𝜖 𝑛 and ℎ ( 𝜖 ) = 𝑟 𝜖 𝑛 forsome 𝑟 , 𝑟 , 𝑛 , 𝑛 . Therefore, by the induction hypothesis we have lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 = 𝑟 and lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 = 𝑟 . By the assumption that no division by exactly zero occurred in the evalua-tion of 𝑓 ( 𝜖 ) , we have 𝑟 ≠ . Then lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 + 𝑛 = lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 / ℎ ( 𝑥 ) 𝑥 𝑛 = ( lim 𝑥 → 𝑔 ( 𝑥 ) 𝑥 𝑛 )/( lim 𝑥 → ℎ ( 𝑥 ) 𝑥 𝑛 ) = 𝑟 / 𝑟 This finishes the proof. □ Some subtleties of limits and infinitesimals.
In order to think about infinitesimals one must firstchoose a function 𝑓 ( 𝑥 ) of which one wishes to learn something about the limit as 𝑥 → . Thinkingabout infinitesimal arithmetic independent of such a function leads to confusion. Furthermore, theresult of evaluating 𝑓 ( 𝜖 ) depends not just on 𝑓 ( 𝑥 ) as a function on real numbers, but also on thearithmetic expression used for computing 𝑓 . Consider the functions 𝑓 , 𝑔 : 𝑓 ( 𝑥 ) = · 𝑥 + · 𝑥𝑔 ( 𝑥 ) = · 𝑥 As functions on real numbers, 𝑓 = 𝑔 , but nevertheless, with infinitesimal arithmetic their resultsdiffer: 𝑓 ( 𝜖 ) = · 𝜖 𝑔 ( 𝜖 ) = · 𝜖 Applying the theorem to these results gives the following limits for 𝑓 and 𝑔 : lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 = 𝑥 → 𝑔 ( 𝑥 ) 𝑥 = Both of these limits are correct, but this example shows that which limit the theorem says somethingabout may depend on how the function is computed. The limit for 𝑔 gives more information thanthe limit for 𝑓 ; the limit for 𝑓 is conservative and doesn’t tell us as much as the limit for 𝑔 does.Fortunately, this won’t be a problem for our use case: we intend to apply the theorem to the weightedaverage of importance sampling, where the probabilities may be infinitesimal numbers. In thiscase the power of 𝜖 of the numerator and denominator are always the same, so the final result willalways have power 𝜖 , and the theorem will then tell us about the limit lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 = lim 𝑥 → 𝑓 ( 𝑥 ) .Another subtlety is that the converse of the theorem does not hold. It is possible that lim 𝑥 → 𝑓 ( 𝑥 ) 𝑥 𝑛 = 𝑟 , but evaluation of 𝑓 ( 𝜖 ) with infinitesimal arithmetic results in division by exactly zero. An exampleis 𝑓 ( 𝑥 ) = 𝑥 ( 𝑥 + 𝑥 )− 𝑥 . We have lim 𝑥 → 𝑓 ( 𝑥 ) = , but when evaluating 𝑓 ( 𝜖 ) = 𝜖 ( 𝜖 + 𝜖 )− 𝜖 , division by Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:17 zero occurs, because we have the evaluation sequence: 𝜖 ( 𝜖 + 𝜖 ) − 𝜖 → 𝜖 𝜖 − 𝜖 → 𝜖 → undefinedIf we used full Laurent series 𝑎 𝑘 𝜖 𝑘 + 𝑎 𝑘 + 𝜖 𝑘 + + . . . as our representation for infinitesimal numbers,then we would potentially be able to compute more limits, even some of those where exactcancellation happens in a denominator. Keeping only the first term is sufficient for our purposes, andmore efficient, because our infinitesimal numbers are pairs ( 𝑟, 𝑛 ) of a real (or floating point) number 𝑟 and an integer 𝑛 , whereas Laurent series are infinite sequences of real numbers ( 𝑎 𝑘 , 𝑎 𝑘 + , . . . ) .The lemmas about computing limits have the form “For all 𝑎, 𝑏 ∈ R , if lim 𝑥 → 𝑓 ( 𝑥 ) = 𝑎 , and lim 𝑥 → 𝑔 ( 𝑥 ) = 𝑏 , and 𝑏 ≠ , then lim 𝑥 → 𝑓 ( 𝑥 ) 𝑔 ( 𝑥 ) = lim 𝑥 → 𝑓 ( 𝑥 ) lim 𝑥 → 𝑔 ( 𝑥 ) ”. It is not true in general that lim 𝑥 → 𝑓 ( 𝑥 ) 𝑔 ( 𝑥 ) = lim 𝑥 → 𝑓 ( 𝑥 ) lim 𝑥 → 𝑔 ( 𝑥 ) . It is possible that the limit on the left hand side exists, even whenthe limits on the right hand side fail to exist, or when the right hand side is . Therefore, in orderto apply these theorems about limits, we must know that the right hand side is not undefined, priorto applying such a lemma. In the proof above, the existence of the limits follows from the inductionhypothesis, and that the denominator is nonzero follows from the assumption that division byzero does not occur. This is why we must assume that no division by exactly zero occurs in theevaluation of 𝑓 ( 𝜖 ) with infinitesimal arithmetic, and it is also why the converse of the theoremdoes not hold. The proposed observe construct allows finite width intervals observe(D,(a,w)) where w is an ex-pression that returns a number, as well as infinitesimal width intervals, as in observe(D,(a,w*eps)) where w is some expression that returns a number and eps is the symbolic infinitesimal 𝜖 . It ispossible to allow higher powers of eps to occur directly in the source program, and it is possible toallow eps to occur in other places than in widths of intervals, but for conceptual simplicity we shallassume it doesn’t, and that observe is always of one of those two forms. That is, we will assume that eps is only used in order to translate exact conditioning observe(D,x) to observe(D,(x,w*eps)) .We translate the example from the introduction as follows: h = rand(Normal(170, 50))if rand(Bernoulli(0.5))observe(Normal(200, 10), (h,w*eps))end Where the pair (h,w*eps) represents an interval of width w*eps centered around h , in order tocondition on the observation to be “exactly ℎ ”.Let us now investigate the meaning of this program according to the rejection sampling inter-pretation of observe . Assuming the coin flip results in 𝑡𝑟𝑢𝑒 , we reject the trial if the sample from 𝑁𝑜𝑟𝑚𝑎𝑙 ( , ) does not fall in the interval [ ℎ − 𝑤𝜖, ℎ + 𝑤𝜖 ] . If the coin flip results in 𝑓 𝑎𝑙𝑠𝑒 ,we always accept the trial. If we let 𝜖 → then the probability of rejecting the trial goes to ifthe coin flips to 𝑡𝑟𝑢𝑒 , so almost all successful trials will be those where the coin flipped to 𝑓 𝑎𝑙𝑠𝑒 .Therefore the expected value of ℎ converges to as 𝜖 → , and expected value of running thisprogram should be .We translate the BMI example as follows: Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. h = rand(Normal(170, 50))w = rand(Normal(70, 30))if rand(Bernoulli(0.5))observe(Normal(200, 10), (h, A*eps))elseobserve(Normal(90, 5), (w, B*eps))endbmi = w / h^2
Where A and B are constants with units 𝑐𝑚 and 𝑘𝑔 , respectively. The units force us to introducethese constants: since (h, A*eps) represents an interval centered at h (in cm), the width A*eps must also be a quantity in 𝑐𝑚 . If we change the units of h or w , we also need to change the units of A or B . If we change the units of h and A from centimeters to meters, the numerical value of h and A will both get multiplied by . This additional factor for A*eps , which cannot be provided inthe original non-interval type of observe(D,x) statement, is what will make this program behaveconsistently under change of units.Both branches of the if statement contain observes with intervals of infinitesimal width, so withrejection sampling both branches will be rejected with probability 1, regardless of the outcome ofthe coin flip. We must therefore interpret the example with eps tending to , but not being exactly . If we chose A to be 1 meter, and B to be 1 kg, and change B to be 1000 kg, then the observe in theelse branch is 1000x more likely to succeed compared to before, because the width of the intervalgoes from to . If we made this change then most of the successful trials wouldbe trials where the coin flipped to false . Thus even in the infinitesimal case, the relative sizesof the intervals matter a great deal. The relative sizes of the intervals are an essential part of theprobabilistic program, and omitting them will inevitably lead to unit anomalies, because changingunits also requires resizing the intervals by a corresponding amount (by 1000 × in case we change 𝑤 from 𝑘𝑔 to 𝑔 ). If we do not resize the intervals, that changes the relative rejection rates of thebranches, or the relative weights of the trials, and thus the estimated expectation value E [ 𝑏𝑚𝑖 ] .As Jaynes notes, conditioning on measure-zero events is ambiguous; even though in the limit theintervals (w,1*eps) and (w,1000*eps) both tend to the singleton set {w} , relative to the interval (h,A*eps) it matters which of these limits is intended, and the final result will and must dependon which limit was specified.We translate the third example as follows: x = rand(Normal(10,5))observe(Normal(15,5), (x,eps))return exp(x) After a parameter transformation from 𝑥 to exp ( 𝑥 ) we get the following program: exp_x = rand(LogNormal(10,5))observe(LogNormal(15,5), (exp_x,exp_x*eps))return exp_x Note that the width of the interval is now exp_x*eps and not simply eps . In general, if we apply adifferentiable function 𝑓 to an interval of width 𝜖 around 𝑥 , we obtain an interval of width 𝑓 ′ ( 𝑥 ) 𝜖 Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:19 around 𝑓 ( 𝑥 ) . If we take the exponential of an interval of small width 𝜖 around 𝑥 , we get an intervalof width exp ( 𝑥 ) 𝜖 around exp ( 𝑥 ) , not an interval of width 𝜖 around exp ( 𝑥 ) . Both of these programsshould give the same estimate for the expectation value of exp ( 𝑥 ) , so that infinitesimal widthintervals allow us to correctly express non-linear parameter transformations without running intoBorel-Komolgorov-type paradoxes. It is debatable whether conditioning on small but finite width intervals is preferable to conditioningon measure zero events. Real measurement devices do not measure values to infinite precision.If a measurement device displays 45.88, we might take that to mean an observation in the in-terval [ . , . ] . The measurement may in addition measure the true value x plus some Normal(0,sigma) distributed noise rather than the true value x . In this case it might be appropriateto use observe(Normal(x,sigma), (45.88, 0.01)) . The finite precision of the device and itsnoisy measurement are in principle two independent causes of uncertainty. The rejection samplinginterpretation of this program is that we first sample a value from Normal(x,sigma) and thencontinue with the current trial if this lies in the interval [ . , . ] , which matches the twosources of uncertainty. An argument for using infinitesimal width intervals is that observe on afinite interval requires the evaluation of the distribution’s CDF, which is usually more complicatedand expensive to compute than the distribution’s PDF.The term “soft conditioning” is sometimes used for observe(D,x) statements, particularly whenthe distribution D is the normal distribution. This term can be interpreted as an alternative to therejection sampling interpretation in several ways:(1) Rather than conditioning on x being exactly y , we instead condition on x being “roughly” y .(2) The statement observe(D,x) means that we continue with the current trial with probability pdf(D,x) and reject it otherwise.We argue that neither of these interpretations is completely satisfactory. For (1) it is unclear whatthe precise probabilistic meaning of conditioning on 𝑥 being “roughly” 𝑦 is. One possible precisemeaning of that statement is that we reject the trial if the difference | 𝑥 − 𝑦 | is too large, and continueotherwise, but this is not what a statement such as observe(Normal(y,0.01), x) does. Rather, itweighs trials where 𝑥 is close to 𝑦 higher, and smoothly decreases the weight as the distance between 𝑥 and 𝑦 gets larger. It may seem that (2) makes this idea precise, but unfortunately pdf(D,x) is nota probability but a probability density, and can even have units or be larger than . Furthermore,the statement “continue with the current trial with probability pdf(D,x) ” seems to have nothingto do with the distribution D as a probability distribution, and instead seems to be a statement thatsuggests that the statistical model is a biased coin flip rather than drawing a sample from D . Indeed,under our rejection sampling interpretation, if one wants to have a program whose statistical modelis about coin flips, one can use the program observe(Bernoulli(f(x)), true) . That program does mean “flip a biased coin with heads probability f(x) and continue with the current trial ifthe coin landed heads”. This makes sense for any function f(x) provided the function gives us aprobability in the range [ , ] . If that function has a roughly bump-like shape around y , then thiswill indeed in some sense condition on x being roughly y . The function 𝐶 exp (( 𝑥 − 𝐴 ) / 𝐵 ) similarto the PDF of the normal distribution does have a bump-like shape around 𝐴 , so it is possible to usethat function for f , if one makes sure that 𝐵 and 𝐶 are such that it is unitless and everywhere lessthan 1 (note that this normalization is not the same as the normalization that makes its integralsum to ).We therefore suggest to stick with the rejection sampling interpretation of observe statements,and suggest that a statistician who wants to do “soft conditioning” in the senses (1) and (2) writes Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. their probabilistic program using observe(Bernoulli(f(x)), true) where f is a function of thedesired soft shape rather than observe(D,x) where the PDF of D has that shape. To do importance sampling for programs with infinitesimal width intervals we need to changealmost nothing. We execute a call observe(D,I) as weight *= P(D,I) where P(D,I) has beendefined in (1). Since
P(D,I) returns an infinitesimal number if the width of I is infinitesimal,the computed weight variable will now contain a symbolic infinitesimal number 𝑟𝜖 𝑛 (where 𝑛 isallowed to be ), rather than a real number. It will accumulate the product of some number ofordinary probabilities (for observe on discrete distributions or continuous distributions with aninterval of finite width) and a number of infinitesimal probabilities (for observe on continuousdistributions with intervals of infinitesimal width).We now simply evaluate the estimate for E [ 𝑉 ] using the usual weighted average formula, withinfinitesimal arithmetic E [ 𝑉 ] ≈ (cid:205) 𝑁𝑘 = ( 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 ) · ( 𝑉 𝑘 ) (cid:205) 𝑁𝑘 = ( 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 ) (2)In the denominator we are adding numbers of the form 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 = 𝑤 𝑘 𝜖 𝑛 𝑘 . Only the numbers withthe minimum value 𝑛 𝑘 = 𝑛 𝑚𝑖𝑛 matter; the others are infinitesimally small compared to those, anddo not get taken into account due to the definition of (+) on infinitesimal numbers. The same holdsfor the numerator: the values 𝑉 𝑘 associated with weights that are infinitesimally smaller do not gettaken into account (an optimized implementation could reject a trial as soon as weight becomesinfinitesimally smaller than the current sum of accumulated weights, since those trials will nevercontribute to the estimate of E [ 𝑉 ] ). Therefore the form of the fraction is E [ 𝑉 ] ≈ 𝐴𝜖 𝑛 𝑚𝑖𝑛 𝐵𝜖 𝑛 𝑚𝑖𝑛 = 𝐴𝐵 𝜖 𝑛 𝑚𝑖𝑛 − 𝑛 𝑚𝑖𝑛 = 𝐴𝐵 𝜖 that is, the infinitesimal factors cancel out in the estimate for E [ 𝑉 ] , and we obtain a non-infinitesimalresult.We shall now suppose that the symbolic infinitesimal eps only occurs in the width of intervalsin observe(D,(x,r*eps)) calls, and not, for instance, in the return value of the probabilisticprogram. In this case, the estimate (2) of E [ 𝑉 ] satisfies the conditions of Theorem 5.5. The calculatedestimate may be viewed as a probability expression 𝑓 ( 𝜖 ) of 𝜖 (Definition 5.4), and since 𝑓 ( 𝜖 ) = 𝐴𝐵 𝜖 ,the theorem implies that lim 𝑥 → 𝑓 ( 𝑥 ) = 𝐴𝐵 . Therefore the estimate calculated by importancesampling with infinitesimal arithmetic indeed agrees with taking the limit 𝜖 → . Figure ?? showsthree example probabilistic programs that are parameterized by the interval width. The blue linesshow several runs of the probabilistic program as a function of the interval width, and the orangeline shows the result when taking the width to be 𝜖 . Taking the width to be exactly 0 results indivision by zero in the weighted average, but taking it to be 𝜖 correctly computes the limit: theblue lines converge to the orange lines as the width goes to 0. We may take a program written using observe(D,x) with exact conditioning on points, andconvert it to our language by replacing such calls with observe(D,(x,w*eps)) where w is someconstant to make the units correct. For programs that exhibit a paradox of type 1 by executing adifferent number of observes depending on the outcome of calls to rand , the computed expectationvalues will change. However, for programs that always execute the same number of observe calls,regardless of the outcome of rand calls, the computed expectation values will not be affected by Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:21 function example1(width)h = rand(Normal(1.70, 0.2))w = rand(Normal(70, 30))if rand(Bernoulli(0.5))observe(Normal(2.0,0.1),Interval(h,10*width))elseobserve(Normal(90,5),Interval(w,width))endreturn w / h^2endfunction example2(width)h = rand(Normal(1.7,0.5))if rand(Bernoulli(0.5))observe(Normal(2.0,0.1),Interval(h,width))endreturn hendfunction example3(width)x = rand(Normal(10,5))observe(Normal(15,5),Interval(x,width))return xend
Fig. 1. Three example programs evaluated with finite width intervals with width going to zero (blue curves)and with infinitesimal width (orange curves). The finite width result correctly converges to the infinitesimalresult in the limit 𝑤 → . Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. this translation. To see this, note that a call to observe(D,x) will multiply weight *= pdf(D,x) ,whereas observe(D,(x,w*eps)) will multiply weight *= pdf(D,x)*w*eps . Thus if the observecalls are the same in all trials, the only difference is that weight will contain an extra factor of 𝑤𝜖 in all trials. The net result is that both the numerator and denominator in the weighted average getmultiplied by the factor 𝑤𝜖 , which has no effect. Thus this translation is conservative with respectto the old semantics, in the sense that it does not change the result that already well-behavedprobabilistic programs compute. The three paradoxes we identified all have to do with parameter transformations. We explicitlyadd parameter transformations as a language feature. A parameter transformation 𝑇 allows usto transform a probability distribution 𝐷 to 𝑇 ( 𝐷 ) , such that sampling from 𝑇 ( 𝐷 ) is the sameas sampling from 𝐷 and then applying the function 𝑇 to the result. In order to ensure that thedistribution 𝑇 ( 𝐷 ) has a probability density function we require 𝑇 to be continuously differentiable.We can also use a parameter transformation to transform an interval from 𝐼 to 𝑇 ( 𝐼 ) = { 𝑇 ( 𝑥 ) : 𝑥 ∈ 𝐼 } which contains all the numbers 𝑇 ( 𝑥 ) for 𝑥 ∈ 𝐼 . In order to ensure that the transformed interval isagain an interval, we require that 𝑇 is monotone, that is, whenever 𝑎 < 𝑏 we also have 𝑇 ( 𝑎 ) < 𝑇 ( 𝑏 ) .In this case, 𝑇 ’s action on an interval [ 𝑎, 𝑏 ] is simple: 𝑇 ([ 𝑎, 𝑏 ]) = [ 𝑇 ( 𝑎 ) ,𝑇 ( 𝑎 )] . Definition 5.6.
A parameter transformation 𝑇 : R A → R B is a continuously differentiable functionwith 𝑇 ′ ( 𝑥 ) > for all 𝑥 ∈ R A , where R A ⊆ R and R B ⊆ R are intervals representing its domainand range.A strictly monotone function has an inverse on its range, so parameter transformations have aninverse 𝑇 − and 𝑇 − ( 𝑦 ) = 𝑇 ′ ( 𝑇 − ( 𝑦 )) − > , so the inverse of a parameter transformation is againa parameter transformation. Example 5.7.
The function 𝑇 ( 𝑥 ) = exp ( 𝑥 ) is a parameter transformation 𝑇 : (−∞ , ∞) → [ , ∞) .The function 𝑇 ( 𝑥 ) = 𝑥 is a parameter transformation 𝑇 : (−∞ , ∞) → (−∞ , ∞) .The transformation 𝑇 can be used to convert decibels to energy density, and 𝑇 can be used toconvert meters to centimeters.Probability distributions need to support 3 operations: random sampling with rand(D) , comput-ing the CDF with cdf(D,x) and computing the PDF with pdf(D,x) . We define these operationsfor the transformed distribution 𝑇 ( 𝐷 ) . Definition 5.8.
Given a continuous probability distribution 𝐷 and a parameter transformation 𝑇 ,we define the operations: rand ( 𝑇 ( 𝐷 )) = T ( rand ( 𝐷 )) cdf ( 𝑇 ( 𝐷 ) , 𝑥 ) = cdf ( 𝐷,𝑇 − ( 𝑥 )) pdf ( 𝑇 ( 𝐷 ) , 𝑥 ) = pdf ( 𝐷,𝑇 − ( 𝑥 )) · ( 𝑇 − ) ′ ( 𝑥 ) This definition matches how probability distributions transform in probability theory. Our imple-mentation represents a parameter transformation 𝑇 as the 4-tuple of functions ( 𝑇 ,𝑇 ′ ,𝑇 − , ( 𝑇 − ) ′ ) ,so that we have access to the inverse and derivative. Definition 5.9.
Given an interval ( 𝑎, 𝑤 ) with midpoint 𝑎 ∈ R and width 𝑤 ∈ R , we let 𝑙 = 𝑇 ( 𝑎 − 𝑤 ) and 𝑟 = 𝑇 ( 𝑎 + 𝑤 ) and define: 𝑇 (( 𝑎, 𝑤 )) = (cid:18) 𝑙 + 𝑟 , 𝑟 − 𝑙 (cid:19) Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:23
This performs parameter transformation on an interval represented as a midpoint-width pair. If thewidth is infinitesimal, we need a different rule.
Definition 5.10.
Given an interval ( 𝑎, 𝑤 ) with midpoint 𝑎 ∈ R and infinitesimal width 𝑤 , wedefine : 𝑇 (( 𝑎, 𝑤 )) = ( 𝑇 ( 𝑎 ) ,𝑇 ′ ( 𝑎 ) · 𝑤 ) This performs parameter transformation on an infinitesimal width interval, which gets transformedto an interval whose width is larger by a factor 𝑇 ′ ( 𝑎 ) . The key lemma about parameter transforma-tions is that they do not affect the value of the (possibly infinitesimal) probability of a (possiblyinfinitesimal) interval.Lemma 5.11. Let 𝑇 be a parameter transformation, 𝐷 a distribution, and 𝐼 an interval. Then 𝑃 ( 𝑇 ( 𝐷 ) ,𝑇 ( 𝐼 )) = 𝑃 ( 𝐷, 𝐼 ) where 𝑃 is the probability function defined at (1). Proof. We distinguish non-infinitesimal intervals from infinitesimal intervals. • If 𝐼 = ( 𝑎, 𝑤 ) is non infinitesimal, then by Definition (1): 𝑃 ( 𝐷, ( 𝑎, 𝑤 )) = cdf ( 𝐷, 𝑎 + 𝑤 ) − cdf ( 𝐷, 𝑎 − 𝑤 ) For 𝑇 (( 𝑎, 𝑤 )) we have, where 𝑙 = 𝑇 ( 𝑎 − 𝑤 ) and 𝑟 = 𝑇 ( 𝑎 + 𝑤 ) : 𝑇 (( 𝑎, 𝑤 )) = ( 𝑙 + 𝑟 , 𝑟 − 𝑙 ) and by (1): 𝑃 ( 𝑇 ( 𝐷 ) ,𝑇 (( 𝑎, 𝑤 ))) = cdf ( 𝑇 ( 𝐷 ) , 𝑙 + 𝑟 + ( 𝑟 − 𝑙 )) − cdf ( 𝑇 ( 𝐷 ) , 𝑙 + 𝑟 − ( 𝑟 − 𝑙 )) = cdf ( 𝑇 ( 𝐷 ) , 𝑟 ) − cdf ( 𝑇 ( 𝐷 ) , 𝑙 ) = cdf ( 𝐷,𝑇 − ( 𝑟 )) − cdf ( 𝐷,𝑇 − ( 𝑙 )) = cdf ( 𝐷,𝑇 − ( 𝑇 ( 𝑎 + 𝑤 ))) − cdf ( 𝐷,𝑇 − ( 𝑇 ( 𝑎 − 𝑤 ))) = cdf ( 𝐷, 𝑎 + 𝑤 ) − cdf ( 𝐷, 𝑎 − 𝑤 )• If 𝐼 = ( 𝑎, 𝑟𝜖 𝑛 ) is infinitesimal ( 𝑛 > ), then by definition (1): 𝑃 ( 𝐷, ( 𝑎, 𝑤 )) = pdf ( 𝐷, 𝑥 ) · 𝑟𝜖 𝑛 For 𝑇 (( 𝑎, 𝑟𝜖 𝑛 )) we have: 𝑇 (( 𝑎, 𝑟𝜖 𝑛 )) = ( 𝑇 ( 𝑎 ) ,𝑇 ′ ( 𝑎 ) · 𝑟𝜖 𝑛 ) and by (1): 𝑃 ( 𝑇 ( 𝐷 ) ,𝑇 (( 𝑎, 𝑟𝜖 𝑛 ))) = pdf ( 𝑇 ( 𝐷 ) ,𝑇 ( 𝑎 )) · 𝑇 ′ ( 𝑎 ) · 𝑟𝜖 𝑛 = pdf ( 𝐷,𝑇 − ( 𝑇 ( 𝑎 ))) · ( 𝑇 − ) ′ ( 𝑇 ( 𝑎 )) · 𝑇 ′ ( 𝑎 ) · 𝑟𝜖 𝑛 = pdf ( 𝐷, 𝑎 ) · 𝑟𝜖 𝑛 □ This lemma implies that the effect of observe(T(D),T(I)) is the same as observe(D,I) , since observe(D,I) does weight *= P(D,I) . This property of observe ensures the absence of parametertransformation paradoxes, not only of the three examples we gave, but in general: it does not matterwhich parameter scale we use; the weight accumulated remains the same.
Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021.
We have implemented the constructs described in the preceding sections as a simple embeddedDSL in the Julia programming language, with the following interface: • Infinitesimal numbers 𝑟𝜖 𝑛 constructed by Infinitesimal(r,n) , with predefined eps =Infinitesimal(1.0,1) , and overloaded infinitesimal arithmetic operations +,-,*,/ ac-cording to Definition 5.2. • Probability distributions D with random sampling rand(D) and cdf(D,x) and pdf(D,x) .These distributions are provided by Julia’s Distributions package, which supports beta,normal, Cauchy, Chi-square, Bernoulli, Binomial, and many other continuous distributionsand discrete distributions. • Intervals constructed by
Interval(mid,width) , where width may be infinitesimal, and anoperation
P(D,I) to compute the (possibly infinitesimal) probability that a sample from 𝐷 lies in the interval 𝐼 . If 𝐼 is infinitesimal, then this uses the PDF, and if 𝐼 has finite width, thenthis uses the CDF, according to Definition 5.3. • Parameter transformations T represented as 4-tuples ( 𝑇 ,𝑇 ′ ,𝑇 − , ( 𝑇 − ) ′ ) , with operations T(D) and
T(I) to transform probability distributions and intervals, according to Definitions5.8, 5.9, and 5.10. • The main operations of probabilistic programming DSL are the following: – rand(D) , where D is a distribution provided by Julia’s Distributions package. – observe(D,I) , where D is a continuous distribution and I is an interval, or D is a discretedistribution and I is an element, implemented as weight *= P(D,I) – importance(trials,program) which does importance sampling, where trials is thenumber of trials to run, and program is a probabilistic program written as a Julia functionthat uses rand and observe , and returns the value that we wish to estimate the expectationvalue of. Importance sampling is implemented as described in Section 5.3.The example in the introduction can be written as follows: function example1_m()h = rand(Normal(1.7,0.5))if rand(Bernoulli(0.5))observe(Normal(2.0,0.1), Interval(h,eps))endreturn hendestimate = importance(1000000,example1_m) This program will produce an estimate very close to . . If we change the units to centimeters, wewill get an estimate very close to , as expected: function example1_cm()h = rand(Normal(170,50))if rand(Bernoulli(0.5))observe(Normal(200,10), Interval(h,100*eps))endreturn hendestimate = importance(1000000,example1_cm) The artifact contains the other examples from the paper and further examples to illustrate the useof the DSL [Jacobs 2020].
Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021. aradoxes of Probabilistic Programming 58:25
We have seen that naive likelihood accumulation results in unit anomalies when observe statementswith continuous distributions are executed conditionally on random data, and we have shownthat the culprit is the use of probability densities. From an analysis of what observe statementsmean in the discrete case, we motivated a switch to interval-based observe statements, whichhave a probabilistic and rejection sampling interpretation. To recover the behavior of measure-zero observe statements we introduced intervals with infinitesimal width. This results in theaccumulation of infinitesimal probabilities rather than probability densities, which solves theunit anomalies and paradoxes even when conditioning on events of measure zero. Infinitesimalprobabilities also enabled us to implement parameter transformations that do not change thebehavior of the program. We implemented this form of probabilistic programming as an embeddedDSL in Julia.This improves the state of the art in two ways:(1) It fixes unit and parameter transformation paradoxes, which result in surprising and in somecases arguably incorrect behavior in existing probabilistic programming languages whencontinuous observe statements are executed conditionally on random data, or when nonlinearparameter transformations are performed.(2) It gives the observe statement a probabilistic and rejection sampling interpretation, withmeasure zero conditioning as a limiting case when the observation interval is of infinitesimalwidth.We hope that this will have a positive impact on the development of the formal semantic foundationsof probabilistic programming languages, potentially reducing the problem of conditioning to eventsof positive measure. On the implementation side, we hope to generalize more powerful inferencealgorithms such as Metropolis-Hastings and SMC to work with infinitesimal probabilities.
ACKNOWLEDGMENTS
I thank Sriram Sankaranarayanan and the anonymous reviewers for their outstanding feedback. I amgrateful to Arjen Rouvoet, Paolo Giarrusso, Ike Mulder, Dongho Lee, Ahmad Salim Al-Sibahi, SamStaton, Christian Weilbach, Alex Lew, and Robbert Krebbers for help, inspiration, and discussions.
REFERENCES
Nathanael L. Ackermann, Cameron E. Freer, and Daniel M. Roy. 2017. On computability and disintegration.
MathematicalStructures in Computer Science
27, 8 (2017), 1287–1314. https://doi.org/10.1017/S0960129516000098Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker,Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Language.
Journal of Statistical Software,Articles
76, 1 (2017), 1–32. https://doi.org/10.18637/jss.v076.i01Joseph Chang and David Pollard. 1997. Conditioning as disintegration.
Statistica Neerlandica
51, 3 (1997), 287–317.https://doi.org/10.1111/1467-9574.00056Fredrik Dahlqvist and Dexter Kozen. 2020. Semantics of higher-order probabilistic programs with conditioning, In POPL.
PACMPL . https://doi.org/10.1145/3371125Noah Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. 2008. Church: a languagefor generative models. In
UAI . 220–229. https://doi.org/10.5555/2969033.2969207Chris Heunen, Ohad Kammar, Sam Staton, and Hongseok Yang. 2017. A convenient category for higher-order probabilitytheory. In
LICS . 1–12. https://doi.org/10.1109/LICS.2017.8005137Thomas Hofweber. 2014. Infinitesimal Chances.
Philosophers’ Imprint (2014).Jules Jacobs. 2020. Paradoxes of Probabilistic Programming: Artifact. https://doi.org/10.5281/zenodo.4075076Edwin Thompson Jaynes. 2003.
Probability theory: The logic of science . Cambridge University Press, Cambridge.Brooks Paige, Frank Wood, Arnaud Doucet, and Yee Whye Teh. 2014. Asynchronous Anytime Sequential Monte Carlo. In
Advances in Neural Information Processing Systems 27 . Curran Associates, Inc., 3410–3418.Arthur Paul Pedersen. 2014. Comparative Expectations.
Studia Logica (2014).Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 58. Publication date: January 2021.
Chung-Chieh Shan and Norman Ramsey. 2017. Exact Bayesian inference by symbolic disintegration, In POPL.
PACMPL ,130–144. https://doi.org/10.1145/3009837.3009852Sam Staton. 2017. Commutative Semantics for Probabilistic Programming. In
Proceedings of the 26th European Symposiumon Programming Languages and Systems - Volume 10201 . Springer-Verlag, Berlin, Heidelberg, 855–879. https://doi.org/10.1007/978-3-662-54434-1_32David Tolpin, Jan-Willem van de Meent, Brooks Paige, and Frank Wood. 2015. Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs. In
Machine Learning and Knowledge Discovery in Databases . Lecture Notes inComputer Science, Vol. 9285. 311–326. https://doi.org/10.1007/978-3-319-23525-7_19Jan-Willem van de Meent, Brooks Paige, Hongseok Yang, and Frank Wood. 2018. An Introduction to Probabilistic Program-ming. arXiv:arXiv:1809.10756John von Neumann. 1951. Various Techniques Used in Connection with Random Digits. In
Monte Carlo Method . NationalBureau of Standards Applied Mathematics Series, Vol. 12. US Government Printing Office, Washington, DC, Chapter 13,36–38.Frank Wood, Jan-Willem van de Meent, and Vikash Mansinghka. 2014. A New Approach to Probabilistic ProgrammingInference. In
AISTATS 2014 (JMLR Proceedings) . JMLR.org, 1024–1032. http://jmlr.org/proceedings/papers/v33/wood14.htmlYi Wu, Siddharth Srivastava, Nicholas Hay, Simon Du, and Stuart Russell. 2018. Discrete-Continuous Mixtures in ProbabilisticProgramming: Generalized Semantics and Inference Algorithms. In