[PDF] Compositional Semantics for Probabilistic Programs with Exact Conditioning

Abstract

We define a probabilistic programming language for Gaussian random variables with a first-class exact conditioning construct. We give operational, denotational and equational semantics for this language, establishing convenient properties like exchangeability of conditions. Conditioning on equality of continuous random variables is nontrivial, as the exact observation may have probability zero; this is Borel's paradox. Using categorical formulations of conditional probability, we show that the good properties of our language are not particular to Gaussians, but can be derived from universal properties, thus generalizing to wider settings. We define the Cond construction, which internalizes conditioning as a morphism, providing general compositional semantics for probabilistic programming with exact conditioning.

Full PDF

CCompositional Semantics for Probabilistic Programswith Exact Conditioning

Dario Stein

University of Oxford, UK

Sam Staton

University of Oxford, UK

Abstract —We deﬁne a probabilistic programming language forGaussian random variables with a ﬁrst-class exact conditioningconstruct. We give operational, denotational and equationalsemantics for this language, establishing convenient propertieslike exchangeability of conditions. Conditioning on equality ofcontinuous random variables is nontrivial, as the exact observa-tion may have probability zero; this is

Borel’s paradox . Usingcategorical formulations of conditional probability, we showthat the good properties of our language are not particular toGaussians, but can be derived from universal properties, thusgeneralizing to wider settings. We deﬁne the Cond construction,which internalizes conditioning as a morphism, providing generalcompositional semantics for probabilistic programming withexact conditioning.

I. I

NTRODUCTION

Probabilistic programming is the paradigm of specifyingcomplex statistical models as programs, and performing infer-ence on them. There are two ways of expressing dependenceon observed data, thus learning from them: soft constraints and exact conditioning . Languages like Stan [8] or WebPPL[16] use a scoring construct for soft constraints, re-weightingprogram traces by observed likelihoods. Other frameworks likeHakaru [28] or Infer.NET [26] allow exact conditioning ondata. In this paper we provide two semantic analyses of exactconditioning in a simple Gaussian language: a denotationalsemantics, and a equational axiomatic semantics, which weprove to coincide for closed programs. Our denotational se-mantics is based on a new and general construction on Markovcategories, which, we argue, serve as a good framework forexact conditioning in probabilistic programs.

A. Case study: Reasoning about a Gaussian ProgrammingLanguage with Exact Conditions

Exact conditioning decouples the generative model fromthe data observations. Consider the following example forGaussian process regression (a.k.a. kriging ): The prior ys isa -dimensional multivariate normal (Gaussian) vector; weperform inference by ﬁxing four observed datapoints via exactconditioning (=:=) . ys = gp sample (n=100 , kernel =rbf) for (i,c) in observations :ys[i] =:= c The same program is difﬁcult to express compositionallywithout exact conditioning. Fig. 1: GP prior and posterior with 4 exact observationsNo style of probabilistic modelling is immune to fallaciesand paradoxes. Exact conditioning is indeed sensitive in thisregard in general (§VII-A), and so it is important to show thatwhere it is used, it is consistent in a compositional way. Thatis the contribution of this paper.The kriging example (Fig. 1) uses a smooth kernel,as is common, but to discuss the situation further weconsider the following concrete variation with a Gaussianrandom walk. Suppose that the observation points are at (0 , , , , , . ys [0] = normal (0 ,1) for i = 1 to 100:ys[i] = ys[i −

1] + normal (0 ,1) for j = 0 to 5:ys [20 * j] =:= c[j] To illustrate the power of compositional reasoning, we notethat exact conditioning here is ﬁrst-class, and as we will show,it is consistent to reorder programs as long as the dataﬂowis respected (Prop. IV.10). So this random walk program isequivalent to: ys [0] = normal (0 ,1)ys [0] =:= c[0] for i = 1 to 100:ys[i] = ys[i −

1] + normal (0 ,1) if i % 20 == 0: ys[i] =:= c[i % 20] We can now use a substitution law and initialization principleto simplify the program: ys [0] = c[0] for i = 1 to 100: if i % 20 == 0:ys[i] = c[i % 20](ys[i] − ys[i − normal (0 ,1) else : ys[i] = ys[i −

1] + normal (0 ,1) a r X i v : . [ c s . P L ] J a n he constraints are now all ‘soft’, in that they relate an expres-sion with a distribution, and so this last program could be runwith a Monte Carlo simulation in Stan or WebPPL. Indeed, thesoft-conditioning primitive observe can be deﬁned in termsof exact conditioning as observe (D,x) ≡ ( let y = sample (D) in x =:= y) Our language is by no means restricted to kriging. For exam-ple, we can use similar techniques to implement and verify asimple K´alm´an ﬁlter (VII-F).In Section II we provide an operational semantics for thislanguage, in which there are two key commands: drawingfrom a standard normal distribution ( normal() ) and exactconditioning (=:=) . The operational semantics is deﬁned interms of conﬁgurations ( t, ψ ) where t is a program and ψ is a state, which here is a Gaussian distribution. Each call to normal() introduces a new dimension into the state ψ , andconditioning (=:=) alters the state ψ , using a canonical form ofconditioning for Gaussian distributions (§II-A).For the program in Figure 1, the operational semanticswill ﬁrst build up the prior distribution shown on the leftin Figure 1, and then the second part of the program willcondition to yield a distribution as shown on the right. But forthe other programs above, the conditioning will be interleavedin the building of the model.In stateful programming languages, composition of pro-grams is often complicated and local transformations aredifﬁcult to reason about. But, as we now explain, we willshow that for the Gaussian language, compositionality andlocal reasoning are straightforward. For example, as we havealready illustrated: • Program lines can be reordered as long as dataﬂow isrespected. That is, the following commutativity equation[38] remains valid for programs with conditioning let x = u inlet y = v in t ≡ let y = v inlet x = u in t (1)where x not free in v and y not free in u . • We have a substitutivity property: if t =:= u appears in aprogram then all other occurrences of t can be replacedby u . ( t =:= u ); v [ t / x ] ≡ ( t =:= u ); v [ u / x ] (2) • As a special base case, if we condition a normal variableon a constant, then the variable takes this value: let x = normal() in ( x =:= 0); t ≡ t [ / x ] (3) B. Denotational semantics, the Cond construction, andMarkov categories

In Section V we show that this compositional reasoningis valid by using a denotational semantics. For a Gaussianlanguage with out conditioning, we can easily interpret termsas noisy afﬁne functions, x (cid:55)→ Ax + c + N (Σ) . The exact conditioning requires a new construction for building a se-mantic model. In fact this construction is not at all speciﬁc toGaussian probability and works generally.For this general construction, we start from a class ofsymmetric monoidal categories called Markov categories [11].These can be understood as the categorical counterpart ofa broad class of probabilistic programming languages with-out conditioning (§III-B). For example, Gaussian probabilityforms a Markov category, but there are many other examples,including ﬁnite probability (e.g. coin tosses) and full Borelprobability.Our conditioning construction starts from a Markov cate-gory C , regarded as a probabilistic programming languagewithout conditioning. We build a new symmetric monoidalcategory Cond ( C ) which is conservative over C but whichcontains a conditioning construct. This construction buildson an analysis of conditional probabilities from the Markovcategory literature, which captures conditioning purely interms of categorical structure: there is no explicit Radon-Nikod´ym theorem, limits, reference measures or in fact anymeasure theory at all. The good properties of the Gaussianlanguage generalize to this abstract setting, as they follow fromuniversal properties alone.The category Cond ( C ) has the same objects as C , but amorphism is reminiscent of the decomposition of the programin Fig 1: a pair of a purely probabilistic morphism togetherwith an observation. These morphisms compose by composingthe generative parts and accumulating the observations (for agraphical representation, see Figure 2). The morphisms areconsidered up-to a natural contextual equivalence. We provesome general properties about Cond ( C ) :1) Proposition IV.11: Cond ( C ) is consistent, in that nodistinct unconditional distributions from C are equatedin Cond ( C ) .2) Proposition IV.10: Cond ( C ) allows programs to be re-ordered according to their dataﬂow graph, i.e. it satisﬁesthe interchange law of monoidal categories.Returning to the speciﬁc case study of Gaussian probability,we show that we have a canonical interpretation of theGaussian language in Cond ( Gauss ) , which is fully abstract(Prop V.3). In consequence, the principles of reordering andconsistency hold for the contextual equivalence induced by theoperational semantics. C. Equational axioms

Our second semantic analysis (§VI) has a more syntacticand concrete ﬂavor. We leave the generality of Markov cate-gories and focus again on the Gaussian language. We presentan equational theory for programs and use this to give normalforms for programs.Our equational theory is surprisingly simple. The ﬁrst two2quations are let x = normal() in () = ()let x = normal() , . . . , x n = normal() in U(cid:126)x = let x = normal() , . . . , x n = normal() in (cid:126)x The ﬁrst equation is sometimes called discarding. In thesecond equation, U must be an orthogonal matrix, and we areusing shorthand for multiplying a vector by a matrix. Thesetwo equations are enough to fully axiomatize the fragment ofthe Gaussian language without conditioning (Prop. VI.1).(In Section VI we use a concise notation, writing the ﬁrstaxiom as νx. r[] = r[] . One instance of the second axiom witha permutation matrix for U is νx.νy. r[ x, y ] = νy.νx. r[ x, y ] ,reminiscent of name generation in the π -calculus [25] or ν -calculus [30].)The remaining axioms focus on conditioning. There arecommutativity axioms for reordering parts of programs, aswell as the two substitutivity axioms considered above, (2),(3). Finally there are two axioms for eliminating a conditionthat is tautologous ( a =:= a ) or impossible (0 =:= 1) .Together, these axioms are consistent, which we can deduceby showing them to hold in the Cond model. To moreoverillustrate the strength of the axioms, we show two normalform theorems by merely using the axioms. Here normal n () describes the n -dimensional standard normal distribution. • Proposition VI.6: any closed program is either derivablyimpossible (0 =:= 1) or derivably equal to a condition-freeprogram of the form A ∗ normal n () + (cid:126)c . • Theorem VI.8: any program of unit type (with no returnvalue) is either derivably impossible (0 =:= 1) or derivablyequal to a soft constraint, i.e. a program of the form A ∗ (cid:126)x =:= B ∗ normal n () + (cid:126)c . We also give a uniquenesscriterion on A , B and (cid:126)c . D. Summary • We present a minimalist language with exact conditioningfor Gaussian probability, with the purpose of studying theabstract properties of conditioning. Despite its simplicity,the language can express Gaussian processes or K´alm´anﬁlters. • In order to make the denotational semantics composi-tional, we introduce the Cond construction, which ex-tends a Markov category C to a category Cond ( C ) in which conditioning internalizes as a morphism. TheGaussian language is recovered as the internal languageof Cond ( Gauss ) . • We give three semantics for the language – operational(§II), denotational (§V) and axiomatic (§VI). We showthat the denotational semantics is fully abstract (Propo-sition V.3) and that the axiomatic semantics is strongenough to derive normal forms (Theorem VI.8). Thisjustiﬁes properties like commutativity and substitutivityfor the language. Thus probabilistic programming withexact conditioning can serve as a practical foundation forcompositional statistical modelling. II. A L

ANGUAGE FOR G AUSSIAN P ROBABILITY

We introduce a typed language (§II-B), similar to the onediscussed in Section I-A, and provide an operational semantics(§II-C).

A. Recap of Gaussian Probability

We brieﬂy recall

Gaussian probability , by which we meanthe treatment of multivariate Gaussian distributions and afﬁne-linear maps (e.g. [24]). A (multivariate) Gaussian distribution is the law of a random vector X ∈ R n of the form X = AZ + µ where A ∈ R n × m , µ ∈ R n and the random vector Z hascomponents Z , . . . , Z m ∼ N (0 , which are independentand standard normally distributed with density function ϕ ( x ) = 1 √ π e − x The distribution of X is fully characterized by its mean µ andthe positive semideﬁnite covariance matrix Σ . Conversely, forany µ and positive semideﬁnite matrix Σ there is a uniqueGaussian distribution of that mean and covariance denoted N ( µ, Σ) . The vector X takes values precisely in the afﬁnesubspace S = µ + col(Σ) where col(Σ) denotes the columnspace of Σ . We call S the support of the distribution.This deﬁnes a small convenient fragment of probabilitytheory: Afﬁne transformations of Gaussians remain Gaussian.Furthermore, conditional distributions of Gaussians are againGaussian. This is known as self-conjugacy. If X ∼ N ( µ, Σ) with X = (cid:18) X X (cid:19) , µ = (cid:18) µ µ (cid:19) , Σ = (cid:18) Σ Σ Σ Σ (cid:19) then the conditional distribution X | ( X = a ) of X condi-tional on X = a is N ( µ (cid:48) , Σ (cid:48) ) where µ (cid:48) = µ + Σ Σ +22 ( a − µ ) , Σ (cid:48) = Σ − Σ Σ +22 Σ (4)and Σ +22 denotes the Moore-Penrose pseudoinverse. Example II.1. If X, Y ∼ N (0 , are independent and Z = X − Y , then ( X, Y ) | ( Z = 0) ∼ N (cid:18)(cid:18) (cid:19) , (cid:18) . . . . (cid:19)(cid:19) The posterior distribution is equivalent to the model X ∼ N (0 , . , Y = X B. Types and terms of the Gaussian language

We now describe a language for Gaussian probability andconditioning. The core language resembles ﬁrst-order OCamlwith a construct normal() to sample from a standard Gaussian,and conditioning denoted as (=:=) . Types τ are generated froma basic type R denoting real or random variable , pair typesand unit type I . τ ::= R | I | τ ∗ τ e ::= x | e + e | α · e | β | ( e, e ) | () | let x = e in e | let ( x, y ) = e in e | normal() | e =:= e where α, β range over real numbers.Typing judgements are Γ , x : τ, Γ (cid:48) (cid:96) x : τ Γ (cid:96) () : I Γ (cid:96) s : σ Γ (cid:96) t : τ Γ (cid:96) ( s, t ) : σ ∗ τ Γ (cid:96) s : R Γ (cid:96) t : RΓ (cid:96) s + t : R Γ (cid:96) t : RΓ (cid:96) α · t : R Γ (cid:96) β : RΓ (cid:96) normal() : R Γ (cid:96) s : R Γ (cid:96) t : RΓ (cid:96) ( s =:= t ) : IΓ (cid:96) s : σ Γ , x : σ (cid:96) t : τ Γ (cid:96) let x = s in t : τ Γ (cid:96) s : σ ∗ σ (cid:48) Γ , x : σ, y : σ (cid:48) (cid:96) t : τ Γ (cid:96) let ( x, y ) = s in t : τ We deﬁne standard syntactic sugar for sequencing s ; t , iden-tifying the type R n = R ∗ (R ∗ . . . ) with vectors and deﬁningmatrix-vector multiplication A · (cid:126)x . For σ ∈ R and e : R , wedeﬁne normal( x, σ ) ≡ x + σ · normal() . More generally,for a covariance matrix Σ , we write normal( (cid:126)x, Σ) = (cid:126)x + A · (normal() , . . . , normal()) where A is any matrix such that Σ = AA T . We can identify any context and type with R n forsuitable n . C. Operational semantics

Our operational semantics is call-by-value. Calling normal() allocates a latent random variable, and a priordistribution over all latent variables is maintained. Calling (=:=) updates this prior by symbolic inference according tothe formula (4).

Values v and redexes ρ are deﬁned as v ::= x | ( v, v ) | v + v | α · v | β | () ρ ::= normal() | v =:= v | let x = v in e | let ( x, y ) = v in e A reduction context C with hole [ − ] is of the form C ::= [ − ] | C + e | v + C | r · C | C =:= e | v =:= C | let x = C in e | let ( x, y ) = C in e Every term is either a value or decomposes uniquely as C [ ρ ] .We deﬁne a reduction relation for terms. During the execution,we will allocate latent variables z i which we assume distinctfrom all other variables in the program. A conﬁguration iseither a pair ( e, ψ ) where z , . . . , z r (cid:96) e and ψ is a Gaussiandistribution on R r , or a failure conﬁguration ⊥ . We ﬁrst deﬁnereduction on redexes1) For normal() , we add an independent latent variable tothe prior (normal() , ψ ) (cid:66) ( z r+1 , ψ ⊗ N (0 ,

2) To deﬁne conditioning, note that every value z , . . . , z r (cid:96) v : R deﬁnes an afﬁne function R r → R .In order to reduce ( v =:= w, ψ ) , we consider the jointdistribution X ∼ ψ, Z = v ( X ) − w ( X ) . If lies inthe support of Z , we denote by ψ | v = w the outcome ofconditioning X on Z = 0 as in (4), and reduce ( v =:= w, ψ ) (cid:66) (() , ψ | v = w ) Otherwise ( v =:= w, ψ ) (cid:66) ⊥ , indicating that the inferenceproblem has no solution.3) Let bindings are standard (let x = v in e, ψ ) (cid:66) ( e [ v/x ] , ψ )(let ( x, y ) = ( v, w ) in e, ψ ) (cid:66) ( e [ v/x, w/y ] , ψ )

4) Lastly, under reduction contexts, if ( ρ, ψ ) (cid:66) ( e, ψ (cid:48) ) we deﬁne ( C [ ρ ] , ψ ) (cid:66) ( C [ e ] , ψ (cid:48) ) . If ( ρ, ψ ) (cid:66) ⊥ then ( C [ e ] , ψ ) (cid:66) ⊥ . Proposition II.2.

Every closed program (cid:96) e : R n , togetherwith the empty prior ‘ ! ’, deterministically reduces to either aconﬁguration ( v, ψ ) or ⊥ . We consider the observable result of this execution eitherfailure, or the pushforward distribution v ∗ ψ on R n , as thisdistribution could be sampled from empirically. Example II.3.

The program let ( x, y ) = (normal() , normal()) in x =:= y ; x + y reduces to (( z , z ) , ψ ) where ψ = N (cid:18)(cid:18) (cid:19) , (cid:18) . . . . (cid:19)(cid:19) The observable outcome of the run is the pushforward distri-bution (1 1) ∗ ψ = N (0 , on R .One goal of this paper is to study properties of this languagecompositionally, and abstractly, without relying on any speciﬁcproperties of Gaussians. The crucial notion to investigate iscontextual equivalence. Deﬁnition II.4.

We say Γ (cid:96) e , e : τ are contextuallyequivalent , written e ≈ e , if for all closed contexts K [ − ] and i, j ∈ { , }

1) when ( K [ e i ] , !) (cid:66) ∗ ( v i , ψ i ) then ( K [ e j ] , !) (cid:66) ∗ ( v j , ψ j ) and ( v i ) ∗ ψ i = ( v j ) ∗ ψ j

2) when ( K [ e i ]) (cid:66) ∗ ⊥ then ( K [ e j ] , !) (cid:66) ∗ ⊥ We study contextual equivalence by developing a denota-tional semantics for the Gaussian language (§V), and provingit fully abstract (Prop. V.3). We furthermore show that thesesemantics can be axiomatized completely by a set of programequations (§VI).We also note nothing conceptually limits our languageto only Gaussians. We are running with this example forconcreteness, but any family of distributions which can besampled and conditioned can be used. So we will take care toestablish properties of the semantics in a general setting.4II. C

ATEGORICAL F OUNDATIONS OF C ONDITIONING

We will now generalize away from Gaussian probability,recovering its convenient structure in the general categoricalframework of Markov categories (§III-A). We argue that this isa categorical counterpart of probabilistic programming withoutconditioning (§III-B).

Deﬁnition III.1 ([11, § 6]) . The symmetric monoidal category

Gauss has objects n ∈ N , which represent the afﬁne space R n , and m ⊗ n = m + n . Morphisms m → n are tuples ( A, b, Σ) where A ∈ R n × m , b ∈ R n and Σ ∈ R n × n is a posi-tive semideﬁnite matrix. The tuple represents a stochastic map f : R m → R n that is afﬁne-linear, perturbed with multivariateGaussian noise of covariance Σ , informally written f ( x ) = Ax + b + N (Σ) Such morphisms compose sequentially and in parallel in theexpected way, with noise accumulating independently ( A, b, Σ) ◦ ( C, d,

Ξ) = (

AC, Ad + b, A Ξ A T + Σ)( A, b, Σ) ⊗ ( C, d,

Ξ) = (cid:18)(cid:18) A C (cid:19) , (cid:18) bd (cid:19) , (cid:18) Σ 00 Ξ (cid:19)(cid:19)

Gauss furthermore has ability to introduce correlations anddiscard values by means of the afﬁne maps copy : R n → R n + n , x (cid:55)→ ( x, x ) and del : R n → R , x (cid:55)→ () . This gives Gauss the structure of a categorical model of probability,namely a Markov category.

A. Conditioning in Markov and CD categoriesMarkov and

CD categories are a formalism that is increas-ingly widely used (e.g. [12], [13]). We review their graphicallanguage, and theory of conditioning.

Deﬁnition III.2 ([9]) . A copy-delete (CD) category is asymmetric monoidal category ( C , ⊗ , I ) in which every object X is equipped with the structure of a commutative comonoid copy X : X → X ⊗ X , del X : X → I which is compatiblewith the monoidal structure.In CD categories, morphisms f : X → Y need not be discardable , i.e. satisfy del Y ◦ f = del X . If they are, we obtaina Markov category. Deﬁnition III.3 ([11]) . A Markov category is a CD categoryin which every morphism is discardable, i.e. del is natural.Equivalently, the unit I is terminal.Beyond Gauss , further examples of Markov categories arethe category

FinStoch of ﬁnite sets and stochastic matrices,and the category

BorelStoch of Markov kernels betweenstandard Borel spaces. CD categories generalize unnormalized measure kernels.The interchange law of ⊗ encodes exchangeability (Fubini’stheorem) while the discardability condition signiﬁes that prob-ability measures are normalized to total mass . We introducethe following terminology: States µ : I → X are also called distributions , and if f : A → X ⊗ Y , we denote its marginals by f X : A → X, f Y : A → Y . Copying and discarding allows us to write tupling (cid:104) f, g (cid:105) and projection π X , howevernote that the monoidal structure is only semicartesian, i.e. f (cid:54) = (cid:104) f X , f Y (cid:105) in general. We use string diagram notationfor symmetric monoidal categories, and denote the comonoidstructure as == del X copy X Deﬁnition III.4 ([11, 10.1]) . A morphism f : X → Y iscalled deterministic if it commutes with copying, that is copy Y ◦ f = ( f ⊗ f ) ◦ copy X In a Markov category, the wide subcategory C det of determin-istic maps is cartesian, i.e. ⊗ is a product.A morphism ( A, b, Σ) in Gauss is deterministic iff

Σ = 0 .The deterministic subcategory A = Gauss det consists of thespaces R n and afﬁne maps x (cid:55)→ Ax + b between them.We recall the theory of conditioning for Markov categories. Deﬁnition III.5 ([11, 11.1,11.5]) . A conditional distribution for ψ : I → X ⊗ Y is a morphism ψ | X : X → Y such that ψ = ψψ | X X Y X Y (5)A (parameterized) conditional for f : A → X ⊗ Y is amorphism f | X : X ⊗ A → Y such that X f | X = f fAX Y YA Parameterized conditionals can be specialized to conditionaldistributions in the following way

Proposition III.6. If f : A → X ⊗ Y has conditional f | X : X ⊗ A → Y and a : I → X is a deterministic state, then f | X (id X ⊗ a ) is a conditional distribution for f a . All of our examples

FinStoch , BorelStoch and

Gauss have conditionals [5], [11]. For

Gauss , this captures the self-conjugacy of Gaussians [19]. An explicit formula generalizing(4) is given in [11], but we shall only require the existence ofconditionals and work with their universal property.5 eﬁnition III.7 ([11, 13.1]) . Let µ : I → X be a distribution.Parallel morphisms f, g : X → Y are called µ -almost surelyequal , written f = µ g , if (cid:104) id X , f (cid:105) µ = (cid:104) id X , g (cid:105) µ .Conditional distributions for a given distribution µ : I → X ⊗ Y are generally not unique. However, it follows fromdeﬁnition that they are µ X -almost surely equal. In order touniquely evaluate conditionals at a point, we need to descendfrom the global universal property to individual inputs. Thisis achieved by the absolute continuity relation. Deﬁnition III.8 ([12, 2.8]) . Let µ, ν : I → X be twodistributions. We write µ (cid:28) ν if for all f, g : X → Y , f = ν g implies f = µ g . Lemma III.9. If f, g : X → Y are µ -almost surely equal and x : I → X satisﬁes x (cid:28) µ then f x = gx . Proposition III.10.

For a distribution µ = N ( b, Σ) : 0 → m in Gauss , let S = b + col(Σ) be its support as in §II-A. Then • If f, g : m → n are morphisms, then f = µ g iff f x = gx for all x ∈ S , seen as deterministic states x : 0 → m . • If ν : 0 → m then µ (cid:28) ν iff the support of µ is containedin the support of ν • In particular for x : 0 → m deterministic, x (cid:28) µ iff x ∈ S . There is a general notion of support in Markov categoriesdeﬁned in [11] which agrees with S , but we will formulateour results in terms of the more ﬂexible notion (cid:28) . Proof.

See appendix, where we also characterize (cid:28) for

FinStoch and

BorelStoch .We give an example of how to use the categorical condi-tioning machinery in practice.

Example III.11.

The statistical model from Example II.1 X ∼ N (0 , Y ∼ N (0 , Z = X − Y corresponds to the distribution µ : 0 → with covariancematrix Σ =   A conditional with respect to Z is µ | Z ( z ) = (cid:18) . . (cid:19) z + N (cid:18) . . . . (cid:19) which can be veriﬁed by calculating (5). We wish to conditionon Z = 0 . The marginal µ Z = N (2) is supported on all of R , hence (cid:28) µ Z and by Lemma III.9 the composite µ | Z (0) = N (cid:18) . . . . (cid:19) is uniquely deﬁned and represents the posterior distributionover ( X, Y ) . B. Internal language of Markov categories

There is a strong correspondence between ﬁrst-order proba-bilistic programming languages and the categorical models ofprobability, via their internal languages. The internal languageof a CD category C has types τ ::= X | I | τ ∗ τ where X ranges over objects of C . Any type τ can beregarded as an object [[ τ ]] of C , via [[ X ]] = X , [[ I ]] = I , and [[ τ ∗ τ ]] = [[ τ ]] ⊗ [[ τ ]] . The terms of the internal languageare like the language of Section II, built from let x = t in u ,free variables and pairing, but instead of Gaussian-speciﬁcconstructs like normal() , + , and =:= , we have terms for anymorphisms in C : Γ (cid:96) t : τ . . . Γ (cid:96) t n : τ n Γ (cid:96) f ( t . . . t n ) : τ (cid:48) ( f : [[ τ ]] ⊗ . . . ⊗ [[ τ n ]] → [[ τ (cid:48) ]] in C ) Taking C = Gauss we recover the conditioning-free fragmentof the language of Section II (III.12), but the syntax makessense for any CD or Markov category. A core result of thiswork is that the full language can be recovered as well for aCD category C = Cond ( Gauss ) (§IV).A typing context Γ = ( x : τ . . . x n : τ n ) is interpreted as [[Γ]] = [[ τ ]] ⊗ · · · ⊗ [[ τ n ]] . A term in context Γ (cid:96) t : τ isinterpreted as a morphism [[Γ]] → [[ τ ]] , deﬁned by inductionon the structure of typing derivations. This is similar tothe interpretation of a dual linear λ -calculus in a monoidalcategory [4, §3.1,§4], although because every type supportscopying and discarding we do not need to distinguish betweenlinear and non-linear variables. For example, [[let x = t in u ]] = [[Γ]] copy −−−→ [[Γ]] ⊗ [[Γ]] [[ t ]] ⊗ id −−−−→ [[ A ]] ⊗ [[Γ]] [[ u ]] −−→ [[ B ]][[Γ , x : τ, Γ (cid:48) (cid:96) x : τ ]] = [[Γ]] ⊗ [[ τ ]] ⊗ [[Γ (cid:48) ]] del ⊗ id [[ τ ]] ⊗ del −−−−−−−−−→ [[ τ ]] The interpretation always satisﬁes the following identity, as-sociativity and commutativity equations: [[let y = (let x = t in u ) in v ]] = [[let x = t in let y = u in v ]][[let x = t in x ]] = [[ t ]] [[let x = x in u ]] = [[ u ]] (6) [[let x = t in let y = u in v ]] = [[let y = u in let x = t in v ]] where x not free in u and y not free in t . There are alsostandard equations for tensors [39, §3.1], which always hold.We can always substitute terms for free variables: if we have Γ , x : A (cid:96) t : B and Γ (cid:96) u : A then Γ (cid:96) t [ u / x ] : B . In any CDcategory we have [[let x = t in u ]] = [[ u [ t / x ]]] if x occurs exactly once in u .In a Markov category, moreover, every term is discardable: [[let x = t in u ]] = [[ u [ t / x ]]] if x occurs at most once in u .(It is common to also deﬁne a term to be copyable if a versionof the substitution condition holds when x occurs at least once (e.g. [14], [22]), but we will not need that in what follows.)6 xample III.12. The fragment of the Gaussian languagewithout conditioning ( =:= ) is a subset of the internal languageof the category

Gauss . That is to say, there is a canonicaldenotational semantics of the Gaussian language where weinterpret types and contexts as objects of

Gauss , e.g. [[R]] = 1 and [[( x : R , y : R ⊗ R)]] = 3 . Terms Γ (cid:96) t : A are interpretedas stochastic maps Ax + b + N (Σ) . This is all automaticonce we recognize that addition (+) : 2 → , scaling α · ( − ) : 1 → , constants β : 0 → and sampling N (1) : 0 → are morphisms in Gauss . Example III.13.

In Section IV, we will show that the fullGaussian language with conditioning ( =:= ) is the internal lan-guage of a CD category. The fact that commutativity (6) holdsis non-trivial. It cannot reasonably be the internal languageof a Markov category, because conditions (=:=) cannot bediscardable. For example there is no non-trivial morphism (=:=) : 2 → in Gauss .IV. C

OND – C

OMPOSITIONAL C ONDITIONING

Let C be a Markov category with conditionals (§III-A). Forsimplicity of notation, we assume C to be strictly monoidal .We construct a new category Cond ( C ) by adding to thiscategory the ability to condition on ﬁxed observations. By observation we mean a deterministic state o : I → X ,and we seek to add for each of those a conditioning effect (:= o ) : X → I .Our constructions proceed in two stages. We ﬁrst (§IV-A)form a category Obs ( C ) on the same objects as C where (:= o ) is added purely formally. A morphism X (cid:32) Y in Obs ( C ) represents an intensional open program of the form x : X (cid:96) let ( y, k ) : Y ⊗ K = f ( x ) in ( k := o ); y (7)We think of K as an additional hidden output wire, to whichwe attach the observation o . Such programs compose theobvious way, by aggregating observations (see Fig. 2).In the second stage (§IV-B) – this is the core of the paper –we relate such open programs to the conditionals present in C ,that is we quotient by contextual equivalence. The resultingquotient is called Cond ( C ) . Under sufﬁcient assumptions,this will have the good properties of a CD category. A. Step 1 (Obs): Adding conditioning

Deﬁnition IV.1.

The following data deﬁne a symmetric pre-monoidal category

Obs ( C ) : • the object part of Obs ( C ) is the same as C • morphisms X (cid:32) Y are tuples ( K, f, o ) where K ∈ ob( C ) , f ∈ C ( X, Y ⊗ K ) and o ∈ C det ( I, K ) • The identity on X is Id X = ( I, id X , !) where ! = id I . • Composition is deﬁned by ( K (cid:48) , f (cid:48) , o (cid:48) ) • ( K, f, o ) = ( K (cid:48) ⊗ K, ( f (cid:48) ⊗ id K ) f, o (cid:48) ⊗ o ) . • if ( K, f, o ) : X (cid:32) Y and ( K (cid:48) , f (cid:48) , o (cid:48) ) : X (cid:48) (cid:32) Y (cid:48) , theirtensor product is deﬁned as ( K (cid:48) ⊗ K, (id Y (cid:48) ⊗ swap K (cid:48) ,Y ⊗ id K )( f (cid:48) ⊗ f ) , o (cid:48) ⊗ o ) • There is an identity-on-objects functor J : C → Obs ( C ) that sends f : X → Y to ( I, f, !) : X (cid:32) Y . This functoris strict premonoidal and its image central • Obs ( C ) inherits symmetry and comonoid structureA premonoidal category (due to [31]) is like a monoidalcategory where the interchange law need not hold. This isthe case because Obs ( C ) does not yet identify observationsarriving in different order. This will be remedied automaticallylater when passing to the quotient Cond ( C ) .Composition and tensor can be depicted graphically as inFigure 2, where dashed wires indicate condition wires K andtheir attached observations o . For an observation o : I → K ,the conditioning effect (:= o ) : K (cid:32) I is given by ( I, id K , o ) . ff (cid:48) Z o (cid:48) oY X f (cid:48) X (cid:48) fXo (cid:48) oY (cid:48) Y Fig. 2: Composition and tensoring of morphisms in

Obs

B. Step 2 (Cond): Equivalence of open programs

We now quotient

Obs -morphisms, tying them to the con-ditionals which can be computed in C . We know how tocompute conditionals for closed programs. Given a state ( K, ψ, o ) : I (cid:32) m , we follow the procedure of ExampleIII.11: If o (cid:54)(cid:28) ψ K , the observation does not lie in the supportof the model and conditioning fails. If not, we form theconditional ψ | K in C and obtain a well-deﬁned posterior µ | K ◦ o .This notion deﬁnes an equivalence relation on states I (cid:32) n in Cond ( C ) . We will then extend this notion to a congruenceon arbitrary morphisms X (cid:32) Y by a general categoricalconstruction. Deﬁnition IV.2.

Given two states I (cid:32) X we deﬁne ( K, ψ, o ) ∼ ( K (cid:48) , ψ (cid:48) , o (cid:48) ) if either1) o (cid:28) ψ K and o (cid:48) (cid:28) ψ (cid:48) K (cid:48) and ψ | K ( o ) = ψ (cid:48) | K (cid:48) ( o (cid:48) ) .2) o (cid:54)(cid:28) ψ K and o (cid:48) (cid:54)(cid:28) ψ (cid:48) K (cid:48) That is, both conditioning problems either fail, or both succeedwith equal posterior.Figure 3 formulates Example III.11 in

Obs ( Gauss ) : Deﬁnition IV.3.

Let X be a symmetric premonoidal category.An equivalence relation ∼ on states X ( I, − ) is called func-torial if ψ ∼ ψ (cid:48) implies f ψ ∼ f ψ (cid:48) . We can extend such arelation to a congruence ≈ on all morphisms X → Y via f ≈ g ⇔ ∀ A, ψ : I → A ⊗ X, (id A ⊗ f ) ψ ∼ (id A ⊗ g ) ψ. ∼N (1) N (1) N (0 . Fig. 3: Example III.11 describes related states (cid:32) The quotient category X / ≈ is symmetric premonoidal.We show now that under good assumptions, the quotient byconditioning IV.2 on X = Obs ( C ) is functorial, and inducesa quotient category Cond ( C ) . The technical condition is thatsupports interact well with dataﬂow Deﬁnition IV.4.

A Markov category C has precise supports ifthe following are equivalent for all deterministic x : I → X , y : I → Y , and arbitrary f : X → Y and µ : I → X .1) x ⊗ y (cid:28) (cid:104) id X , f (cid:105) µ x (cid:28) µ and y (cid:28) f x Proposition IV.5.

Gauss , FinStoch and

BorelStoch haveprecise supports.Proof.

See appendix.

Theorem IV.6.

Let C be a Markov category that has condi-tionals and precise supports. Then ∼ is a functorial equiva-lence relation on Obs ( C ) .Proof. Let ( K, ψ, o ) ∼ ( K (cid:48) , ψ (cid:48) , o (cid:48) ) : I (cid:32) X and ( H, f, v ) : X (cid:32) Y be any morphism. We need to show that ( H ⊗ K, ( f ⊗ id K ) ψ, v ⊗ o ) ∼ ( H ⊗ K (cid:48) , ( f ⊗ id K (cid:48) ) ψ (cid:48) , v ⊗ o (cid:48) ) (8)If ( K, f, o ) fails, i.e. o (cid:54)(cid:28) ψ K , then by marginalization anycomposite must fail. But then the RHS fails too.Now assume that ( K, ψ, o ) succeeds and ψ | K o = ψ (cid:48) | K o (cid:48) .We show that the success conditions on both sides are equiv-alent. That is because the following are equivalent1) v ⊗ o (cid:28) ( f H ⊗ id K ) ψ o (cid:28) ψ K and v (cid:28) f H ψ | K o This is exactly the ‘precise supports’ axiom, applied to µ = ψ K and g = f H ◦ ψ | K . Because 2) agrees on both sides of(8), so does 1). We are left with the case that (8) succeeds,and need to show that [( f ⊗ id K ) ψ ] | H ⊗ K ( v ⊗ o ) = [( f ⊗ id K (cid:48) ) ψ (cid:48) ] | H ⊗ K (cid:48) ( v ⊗ o (cid:48) ) . We use a variant of the argument from [11, 11.11] that doubleconditionals can be replaced by iterated conditionals. Considerthe parameterized conditional β = ( f ◦ ψ | K ) | H : H ⊗ K → Y then string diagram manipulation shows that β has the univer-sal property β = [( f ⊗ id K ) ψ ] | H ⊗ K . By specialization III.6,it also has the property β (id H ⊗ o ) = ( f ◦ ψ | K o ) | H . Hence [( f ⊗ id K ) ψ ] | H ⊗ K ( v ⊗ o ) = β (id H ⊗ o ) ◦ v = ( f ◦ ψ | K o ) | H ◦ v = ( f ◦ ψ (cid:48) | K (cid:48) o (cid:48) ) | H ◦ v = [( f ⊗ id K (cid:48) ) ψ (cid:48) ] | H ⊗ K (cid:48) ( v ⊗ o ) We can spell out the equivalence ≈ as follows: Proposition IV.7.

We have ( K, f, o ) ≈ ( K (cid:48) , f (cid:48) , o (cid:48) ) : X (cid:32) Y if for all ψ : I → A ⊗ X , either o (cid:28) f K ψ X and o (cid:48) (cid:28) f (cid:48) K (cid:48) ψ (cid:48) X and [(id A ⊗ f ) ψ ] | K ( o ) =[(id A ⊗ f (cid:48) ) ψ (cid:48) ] | K (cid:48) ( o (cid:48) ) o (cid:54)(cid:28) f K ψ X and o (cid:48) (cid:54)(cid:28) f (cid:48) K (cid:48) ψ (cid:48) X The universal property of the conditional in question is ψ f = ψ fA Y KA Y K We can show that isomorphic conditions are equivalentunder the relation ≈ . Proposition IV.8 (Isomorphic conditions) . Let ( K, f, o ) : X (cid:32) Y and α : K ∼ = K (cid:48) be an isomorphism. Then ( K, f, o ) ≈ ( K (cid:48) , (id Y ⊗ α ) f, αo ) . In programming terms ( k := o ) ≈ ( αk := αo ) .Proof. Let ψ : I → A ⊗ X . We ﬁrst notice that o (cid:28) ψ K ifand only if αo (cid:28) αψ K , so the success conditions coincide. Itis now straightforward to check the universal property (id A ⊗ f ) ψ | K = (id A ⊗ ((id X ⊗ α ) f )) ψ | K (cid:48) ◦ α. This requires the fact that isomorphisms are deterministic,which holds in every Markov category with conditionals [11,11.28]. The proof works more generally if α is deterministicand split monic.We can now give the Cond construction: Deﬁnition IV.9.

Let C be a Markov category that has con-ditionals and precise supports. We deﬁne Cond ( C ) as thequotient Cond ( C ) = Obs ( C ) / ≈ This quotient is a CD category, and the functor J : C → Cond ( C ) preserves CD structure.8 roof. We have checked functoriality of ∼ in IV.6, so by IV.3,the quotient is symmetric premonoidal. It remains to show thatthe interchange laws holds, i.e. observations can be reordered.This follows from IV.8 because swap morphisms are iso. Proposition IV.10.

By virtue of being a well-deﬁned CDcategory, the program equations (6) hold in the internallanguage of

Cond ( C ) . In particular, conditioning satisﬁescommutativity.C. Laws for Conditioning We derive some properties of

Cond ( C ) . We ﬁrstly noticethat J is faithful for common Markov categories. Proposition IV.11.

For f, g : m → n , J ( f ) ≈ J ( g ) iff ∀ ψ : I → a ⊗ m, (id a ⊗ f ) ψ = (id a ⊗ g ) ψ In particular, J is faithful for Gauss , FinStoch and

BorelStoch .Proof.

The proof is straightforward. This condition is strongerthan equality on points: It implies that f, g are almost surelyequal with respect to all distributions.

Proposition IV.12 (Closed terms) . There is a unique state ⊥ X : I (cid:32) X in Cond ( C ) that always fails, given byany ( K, ψ, o ) with o (cid:54)(cid:28) ψ K . Any other state is equal to aconditioning-free posterior, namely ( K, ψ, o ) ≈ J ( ψ | K ◦ o ) . Proposition IV.13 (Enforcing conditions) . We have ( X, copy X , o ) ≈ ( X, o ⊗ id X , o ) This means conditions actually hold after we condition onthem. In programming notation x (cid:96) ( x := o ); x ≈ ( x := o ); o Proof.

Let ψ : I → A ⊗ X ; the success condition reads o (cid:28) ψ X both cases. Now let o (cid:28) ψ X . We verify the properties [(id A ⊗ copy X ) ψ ] | X = (cid:104) ψ | X , id X (cid:105) [(id A ⊗ o ⊗ id X ) ψ ] | X = ψ | X ⊗ o and obtain (cid:104) ψ | X , id X (cid:105) o = ψ | X ( o ) ⊗ o = ( ψ | X ⊗ o )( o ) fromdeterminism of o .V. D ENOTATIONAL SEMANTICS

We apply

Cond (§IV) to give denotational semantics toour Gaussian language (§II), which we show to be fullyabstract (Prop. V.3). One convenient feature is that we can usesubtraction in

Gauss to condition on arbitrary expressions byobserving a vanishing difference:

Deﬁnition V.1.

The Gaussian language embeds into the inter-nal language of

Cond ( Gauss ) , where x =:= y is translatedas ( x − y ):= 0 . A term (cid:126)x : R m (cid:96) e : R n denotes a morphism [[ e ]] : m (cid:32) n . Proposition V.2 (Correctness) . If ( e, ψ ) (cid:66) ( e (cid:48) , ψ (cid:48) ) then [[ e ]] ψ = [[ e (cid:48) ]] ψ (cid:48) . If ( e, ψ ) (cid:66) ⊥ then [[ e ]] = ⊥ . Proof. We can faithfully interpret ψ as a state in both Gauss and

Cond ( Gauss ) . If x (cid:96) e and ( e, ψ ) (cid:66) ( e (cid:48) , ψ (cid:48) ) then e (cid:48) haspotentially allocated some fresh latent variables x (cid:48) . We showthat let x = ψ in ( x, [[ e ]]) = let ( x, x (cid:48) ) = ψ (cid:48) in ( x, [[ e (cid:48) ]]) . (9)This notion is stable under reduction contexts.Let C be a reduction context. Then let x = ψ in ( x, [[ C [ e ]]]( x ))= let x = ψ in let y = [[ e ]]( x ) in ( x, [[ C ]]( x, y ))= let ( x, x (cid:48) ) = ψ (cid:48) in let y = [[ e (cid:48) ]]( x, x (cid:48) ) in ( x, [[ C ]]( x, y ))= let ( x, x (cid:48) ) = ψ (cid:48) in ( x, [[ C [ e (cid:48) ]]]) Now for the redexes1) The rules for let follow from the general axioms of valuesubstitution in the internal language2) For normal() we have (normal() , ψ ) (cid:66) ( x (cid:48) , ψ ⊗N (0 , and verify let x = ψ in ( x, [[normal()]])= ψ ⊗ N (0 , x, x (cid:48) ) = ψ ⊗ N (0 ,

1) in ( x, [[ x (cid:48) ]])

3) For conditioning, we have ( v =:= w, ψ ) (cid:66) (() , ψ | v = w ) .We need to show let x = ψ in ( x, [[ v =:= w ]]) = let x = ψ | v = w in ( x, ()) Let h = v − w , then we need to the following morphismsare equivalent in Cond ( Gauss ) : ψ | h =0 ≈ ψ h Applying IV.12 to the left-hand side requires us tocompute the conditional (cid:104) id , h (cid:105) ψ | ◦ , which is exactlyhow ψ | h =0 is deﬁned. Proposition V.3 (Full abstraction) . [[ e ]] = [[ e ]] if and only if e ≈ e (where ≈ is contextual equivalence, Def. II.4).Proof. For ⇒ , let K [ − ] be a closed context. Because [[ − ]] iscompositional, we obtain [[ K [ e ]]] = [[ K [ e ]]] . If both succeed,we have reductions ( K [ e i ] , !) (cid:66) ∗ ( v i , ψ i ) and by correctness v ψ = [[ K [ e ]]] = [[ K [ e ]]] = v ψ as desired. If [[ K [ e ]]] =[[ K [ e ]]] = ⊥ then both ( K [ e i ] , !) (cid:66) ∗ ⊥ .For ⇐ , we note that Cond quotients by contextual equiv-alence, but all Gaussian contexts are deﬁnable in the lan-guage.9I. E

QUATIONAL THEORY

We now give an explicit presentation of the equality betweenprograms in the Gaussian language (§VI-A). We demonstratethe strength of the axioms by using them to characterizenormal forms for various fragments of the language (§VI-B).Besides an axiomatization of program equality, this can alsobe regarded in other equivalent ways, such as a presentationof a PROP by generators and relations, or as a presentationof a strong monad by algebraic effects, or as a presentationof a Freyd category. But we approach from the programminglanguage perspective.

A. Presentation

We use the following fragment of the language from §II.The reader may ﬁnd it helpful to think of this as a normal formfor the language modulo associativity of ‘let’. This fragmenthas the following modiﬁcations: only variables of type R areallowed in the typing context Γ ; we have an explicit commandfor failure ( ⊥ ); we separate the typing judgement in two:judgements for expressions of afﬁne algebra (cid:96) a and for generalcomputational expressions (cid:96) c ; we have an explicit coercion‘ return ’ between them for clarity. Γ , x : R , Γ (cid:48) (cid:96) a x : R Γ (cid:96) a s : R Γ (cid:96) a t : RΓ (cid:96) a s + t : R Γ (cid:96) a t : RΓ (cid:96) a α · t : RΓ (cid:96) a β : R Γ , x : R (cid:96) t : R n Γ (cid:96) c let x = normal() in t : R n Γ (cid:96) a s : R Γ (cid:96) a t : R Γ (cid:96) c u : R n Γ (cid:96) c ( s =:= t ); u : R n Γ (cid:96) a t : R . . . Γ (cid:96) a t n : RΓ (cid:96) c return( t , . . . , t n ) : R n Γ (cid:96) c ⊥ : R n There is no general sequencing construct, but we cancombine expressions using the following substitution construc-tions, whose well-typedness is derivable. Γ , x : R , Γ (cid:48) (cid:96) c t : R n Γ , Γ (cid:48) (cid:96) a s : RΓ , Γ (cid:48) (cid:96) c t [ s/x ] : R n Γ (cid:96) c t : R m Γ , x , . . . , x m : R (cid:96) c x , . . . , x n .u : R n Γ (cid:96) c t [ u/ return] : R n In the second form we replace the return statement of anexpression with another expression, capturing variables appro-priately. The precise deﬁnition of this hereditary substitutionis standard in logical frameworks (e.g. [2], [37]), for example: (cid:0) let x = normal() in return( x + 3) (cid:1) [ a.a = : = a ) / return ]= let x = normal() in ( x + 3) =:= 4; return( x + 3) For brevity we now write νx.t for let x = normal() in t , r for return and drop ‘ ; ’ when unambiguous. We axiomatizeequality by closing the following axioms under the two formsof substitution and also congruence. Now the syntax has theappearance of a second order algebraic theory, similar to thefamiliar presentations of λ -calculus or predicate logic. The theory is parameterized over an underlying theory ofvalues, which is afﬁne algebra. The type R has the structureof a pointed vector space , which obeys the usual axioms ofvector spaces plus constant symbols ( β ) β ∈ R subject to α · β = αβ, α + β = α + β Terms modulo equations are afﬁne functions. The categorytheorist will recognize the category A = Gauss det as theLawvere theory of pointed vector spaces.The following axioms characterize the conditioning-freefragment of the language, that is, Gaussian probability (cid:96) c νx. r[] ≡ r[] : R (DISC) (cid:96) c ν(cid:126)x. r[ U(cid:126)x ] ≡ ν(cid:126)x. r[ (cid:126)x ] : R n if U orthogonal (ORTH)The following are commutativity axioms for conditioning a, b, c, d (cid:96) c ( a =:= b )( c =:= d )r[] ≡ ( c =:= d )( a =:= b )r[] : R (C1) a, b (cid:96) c ( a =:= b ); νx. r[ x ] ≡ νx. ( a =:= b )r[ x ] : R (C2) a, b (cid:96) c ( a =:= b ) ⊥ ≡ ⊥ : R n (C3)while following encode speciﬁc properties of (=:=) a (cid:96) c ( a =:= a )r[] ≡ r[] : R (TAUT) (cid:96) c (0 =:= 1)r[] ≡ ⊥ : R (FAIL) a, b (cid:96) c ( a =:= b )r[ a ] ≡ ( a =:= b )r[ b ] : R (SUBS) (cid:96) c νx. ( x =:= c )r[ x ] ≡ r[ c ] : R (INIT)Lastly, we add the special congruence scheme Γ (cid:96) c ( s =:= t )r[] ≡ ( s (cid:48) =:= t (cid:48) )r[] : R (CONG)whenever ( s = t ) and ( s (cid:48) = t (cid:48) ) are interderivable equationsover Γ in the theory of pointed vector spaces.Axioms (DISC) and (ORTH) completely axiomatize thefragment of the language without conditioning (Prop. VI.1).Axioms (C1)-(C3) describe dataﬂow – all the operationsdistribute over each other. The reader should focus on theremaining ﬁve axioms (TAUT)-(CONG), which are speciﬁcto conditioning. It is intended that they are straightforwardand intuitive. B. Normal forms

Proposition VI.1.

Axioms (DISC) - (ORTH) are complete for Gauss . That is, conditioning-free terms (cid:126)x : R n (cid:96) u, v : R n denote the same morphism in Gauss if and only if (cid:126)x (cid:96) u ≡ v is derivable from the axioms.Proof. The axioms are clearly validated in

Gauss ; probabilityis discardable and independent standard normal Gaussians areinvariant under orthogonal transformations. Note that ν com-mutes with itself because permutation matrices are orthogonal.It is curious that these laws completely characterize Gaus-sians: Any term normalizes to the form ν(cid:126)z. r[ A(cid:126)x + B(cid:126)z + (cid:126)c ] ,denoting the map ( A, (cid:126)c, BB T ) in Gauss . Consider some otherterm ν (cid:126)w.ϕ [ A (cid:48) (cid:126)x + B (cid:48) (cid:126)w + (cid:126)c (cid:48) ] that has the same denotation. By(DISC), we can without loss of generality assume that (cid:126)z and10 w have the same dimension. The condition ( A, c, BB T ) =( A (cid:48) , c (cid:48) , B (cid:48) ( B (cid:48) ) T ) implies A = A (cid:48) , (cid:126)c = (cid:126)c (cid:48) . By VII.6 there is anorthogonal matrix U such that B (cid:48) = BU . So the two termsare equated under (ORTH). Example VI.2. νx.νy. r[ x + y ] ≡ νy. r[ √ · y ] Proof.

Let s = 1 / √ , then the matrix U = (cid:18) s s − s s (cid:19) isorthogonal. Thus νx.νy. r[ x + y ] ≡ νx.νy. r[( sx + sy ) + ( − sx + sy )] ≡ νx.νy. r[ √ y ] ≡ νy. r[ √ y ] where we apply (ORTH), afﬁne algebra and (DISC).We proceed to showing the consistency of the axioms forconditioning. Proposition VI.3.

Axioms (DISC) - (CONG) are valid in Cond ( Gauss ) Proof.

Sketch. The commutation properties are straightfor-ward from string diagram manipulation.(SUBS)Write a = b + ( a − b ) ; by IV.13, once we condition a − b := 0 , we have a = b .(INIT) By IV.12, noting that c (cid:28) N (0 , (FAIL) By IV.12, because (cid:54)(cid:28) (CONG)This follows from IV.8, because over A , equivalentscalar equations are nonzero multiples of each other.Still, this is very surprising axiom scheme, which issubstantially generalized in VI.5.For the remainder of this section, we will show how to usethe theory to derive normal forms for conditioning programs. Proposition VI.4.

Elementary row operations are valid onsystems of conditions. In particular, if S is an invertible matrixthen ( A(cid:126)x =:= (cid:126)b )r[] ≡ ( SA(cid:126)x =:=

S(cid:126)b )r[]

Proof.

Reordering and scaling of equations is (C1), (CONG).For summation, i.e. ( s =:= t )( u =:= v )r[] ≡ ( s =:= t )( u + s =:= v + t )r[] instantiate (SUBS) with ( u + x = : = v + t )r[] / r[ x ] . Now use the factthat applying any invertible matrix on the left can be decom-posed into elementary row operations. Corollary VI.5. If A(cid:126)x = (cid:126)c and B(cid:126)x = (cid:126)d are linear systems ofequations with the same solution space, then ( A(cid:126)x =:= (cid:126)c )r[] ≡ ( B(cid:126)x =:= (cid:126)d )r[] is derivable.

This generalizes (CONG) to systems of conditions.

Proof.

If the systems are consistent, then they are isomorphicby VII.4 and we use the previous proposition. If they areinconsistent, we can derive (0 =:= 1) and use (FAIL),(C3) toequate them to ⊥ .We give a normal form for closed terms. Theorem VI.6.

Any closed term can be brought into the form ν(cid:126)z. r[ A(cid:126)z + (cid:126)c ] or ⊥ . The matrix AA T is uniquely determined. This is the algebraic analogue of IV.12.

Proof.

By commutativity, we bring the term into the form ν(cid:126)z. ( A(cid:126)z =:= (cid:126)b )r[

D(cid:126)z + (cid:126)d ] By VII.2, we can ﬁnd invertible matrices

S, T such that

SAT − = (cid:18) I r

00 0 (cid:19) and T is orthogonal. Using the orthogonal coordinate change (cid:126)w = T (cid:126)z and VI.5, the equations take the form ν (cid:126)w. ( SAT − (cid:126)w =:= S(cid:126)b )r[ DT − (cid:126)w + (cid:126)d ] This simpliﬁes to ν (cid:126)w. ( (cid:126)w r =:= (cid:126)c r )(0 =:= (cid:126)c r +1: n )r[ DT − (cid:126)w + (cid:126)d ] where (cid:126)c = S(cid:126)b . We can process the ﬁrst block of conditionswith (INIT). The conditions (0 =:= c i ) can either be discardedby (TAUT) if c i = 0 for all i = r + 1 , . . . , n , or fail by (FAIL)otherwise. We arrive at a conditioning-free term. Example VI.7. νx.νy. ( x =:= y )r[ x, y ] ≡ νx. r[ sx, sx ] where s = 1 / √ . Proof.

We use again the unitary matrix U from Example VI.2 νx.νy. ( x =:= y ); r[ x, y ] ≡ νx.νy. ( sx + sy =:= − sx + sy );r[ sx + sy, − sx + sy ] ≡ νx.νy. ( x =:= 0)r[ sx + sy, − sx + sy ] ≡ νy. r[ sy, sy ] where we apply (ORTH), afﬁne algebra and (INIT).Lastly, we give a normal form for conditioning effects. Theorem VI.8 (Normal forms) . Every term (cid:126)x : R n (cid:96) u : R can either be brought into the form ⊥ or ν(cid:126)z.A(cid:126)x =:= B(cid:126)z + (cid:126)c (10) where A ∈ R r × n is in reduced echelon form with no zerorows. The values of A , (cid:126)c and BB T are uniquely determined.Proof. Through the commutativity axioms, we can bring u into the form ν(cid:126)z.A(cid:126)x =:= B(cid:126)z + (cid:126)c for general A . Let S be an invertible matrix that turns A intoreduced row echelon form, and apply it to the condition via11I.4. The zero columns don’t involve (cid:126)x , so we use VI.6 toevaluate the condition involving (cid:126)z separately. We either obtain ⊥ or the desired form (10)For uniqueness, we consider the term’s denotation ( A(cid:126)x =:= η ) : n (cid:32) in Cond ( Gauss ) , where η = N ( (cid:126)c, BB T ) .We must show that A and η can be reconstructed from theobservational behavior of the denotation. The proof given inthe appendix VII.9.VII. C ONTEXT , RELATED WORK , AND OUTLOOK

A. Symbolic disintegration and paradoxes

Our line of work can be regarded as a synthetic andaxiomatic counterpart of the symbolic disintegration of Shanand Ramsey [34]. (See also [15], [27], [29], [40].) Thatwork provides in particular veriﬁed program transforma-tions to convert an arbitrary probabilistic program of type R ⊗ τ to an equivalent one that is of the form let x =lebesgue() in let y = M in ( x, y ) . So now exact conditioning x =:= o can be carried out by substituting o for x in M . Weemphasize the similarity with the deﬁnition of conditionalsin Markov categories, as well as the role that coordinatetransformations play in both our work (§VI) and [34]. Onelanguage novelty in our work is that exact conditioning is aﬁrst-class construct, as opposed to a whole-program transfor-mation, in our language, which makes the consistency of exactconditioning more apparent.Consistency is a fundamental concern for exact condition-ing. Borel’s paradox is an example of an inconsistency thatarises if one is careless with exact conditioning [21, Ch. 15],[20, §3.3]: It arises when naively substituting equivalent equa-tions within (=:=) . For example, the equation x − y = 0 isequivalent to x/y = 1 over the (nonzero) real numbers. Yet,in an extension of our language with division, the followingprograms are not contextually equivalent: x = normal (0,1)y = normal (0,1)x − y =:= 0 (cid:54)≡ x = normal (0,1)y = normal (0,1)x/y =:= 1 On the left, the resulting variable x has distribution N (0 , . while on the right, x can be shown to have density | x | e − x [32], [34]. In our work, Borel’s paradox ﬁnds a type-theoreticresolution: Conditioning is presented abstractly as an algebraiceffect, so the expressions ( s =:= t ) : I and ( s == t ) : bool havea different formal status and can no longer be confused. Theyare related explicitly through axioms like (SUBS), and speciallaws for simplifying conditions are given in (CONG), VI.5. ByIV.8, we can always substitute conditions which are formallyisomorphic, but x − y =:= 0 and x/y =:= 1 are not isomorphicconditions in this sense. For the special case of Gaussianprobability, we proved that equivalent afﬁne equations areautomatically isomorphic, making it very easy to avoid Borel’sparadox in this restricted setting (Prop. VI.5). To include thenon-example above, our language needs a nonlinear operationlike ( / ) . If beyond that we introduced equality testing to thelanguage, difference between equations and conditions wouldbecome even more apparent. The equation x − y = 0 is obviously equivalent to the equation ( x == y ) = true , but the condition ( x == y ) =:= true would cause the whole programto fail, since measure-theoretically, ( x == y ) is the same as false .This also suggests a tradeoff between expressivity of thelanguage and well-behavedness of conditioning. On this sub-ject, Shan and Ramsey [34] wrote: The [measure-theoretic] deﬁnition of disintegrationallows latitude that our disintegrator does not take:When we disintegrate ξ = Λ ⊗ κ , the output κ isunique only almost everywhere — κx may returnan arbitrary measure at, for example, any ﬁniteset of x ’s. But our disintegrator never invents anarbitrary measure at any point. The mathematicaldeﬁnition of disintegration is therefore a bit tooloose to describe what our disintegrator actuallydoes. How to describe our disintegrator by a tighterclass of “well-behaved disintegrations” is a questionfor future research. In particular, the notion ofcontinuous disintegrations [1] is too tight, becausedepending on the input term, our disintegrator doesnot always return a continuous disintegration, evenif one exists. In this paper we have tackled this research problem: a notionof “well-behaved disintegrations” is given by a Markov cate-gory with precise supports. The most comprehensive category

BorelStoch admits conditioning only on events of positiveprobability (VII.1). The smaller category

Gauss features abetter notion of support and an interesting theory of con-ditioning. Studying Markov categories of different degreesof specialization helps navigating the tradeoff. Once in thesynthetic setting of a Markov category C with precise supports,the program transformations of [34] are all valid in Cond ( C ) ,and the Markov conditioning property (Def. III.5) exactlymatches the correctness criterion for symbolic disintegration. B. Other directions

Once a foundation is in algebraic or categorical form, itis easy to make connections to and draw inspiration from avariety of other work.The

Obs construction (§IV.1) that we considered here isreminiscent of the lens construction [10] and the Oles con-struction [17]. These have recently been applied to probabilitytheory [35] and quantum theory [18]. The details and intuitionsare different, but a deeper connection or generalization maybe proﬁtable in the future.Algebraic presentations of probability theories andconjugate-prior relationships have been explored in [36].Furthermore, the concept of exact conditioning is reminiscentof uniﬁcation in Prolog-style logic programming. Ourpresentation in Section VI is partly inspired by the algebraicpresentation of predicate logic in [37], which has a similarsignature and axioms. One technical difference is that in logicprogramming, ∃ a. r[ a ] ≡ ∃ a. ∃ b. ( a =:= b )r[ a ] holds whereashere we have νa. r[ a ] ≡ νa.νb. ( a =:= b )r[(1 / √ a ] , so thingsare more quantitative here. By collapsing Gaussians to their12upports (forgetting mean and covariance), we do in factobtain a model of uniﬁcation.Logic programming is also closely related to relationalprogramming, and we note that our presentation is reminiscentof presentations of categories of linear relations [3], [6], [7].On the semantic side, we recall that presheaf categories havebeen used as a foundation for logic programming [23]. Ouraxiomatization can be regarded as the presentation of a monadon the category [ A op , Set ] , via [37], where A is the categoryof ﬁnite dimensional afﬁne spaces discussed in §VI. Probabilistic logic programming [33] supports both logicvariables as well as random variables within a commonformalism. We have not considered logic variables in thisarticle, but a challenge for future work is to bring the ideasof exact conditioning closer to the ideas of uniﬁcation, bothpractically and in terms of the semantics. We wonder if it ispossible to think of ∃ as an idealized “ﬂat” prior.A CKNOWLEDGMENT

It has been helpful to discuss this work with many people,including Tobias Fritz, Tom´aˇs Gonda, Mathieu Huot, OhadKammar and Paolo Perrone. Research supported by a RoyalSociety University Research Fellowship and the ERC BLASTgrant. R

EFERENCES[1] N. L. Ackerman, C. E. Freer, and D. M. Roy, “On computability anddisintegration,”

Math. Struct. Comput. Sci. , vol. 27, no. 8, 2016.[2] R. Adams, “Lambda free logical frameworks,”

Ann. Pure. Appl. Logic ,2009, to appear.[3] J. C. Baez and J. Erbele, “Categories in control,”

Theory Appl. Categ. ,vol. 30, pp. 836–881, 2015.[4] A. Barber, “Dual intuitionistic linear logic,” University of Edinburgh,Tech. Rep., 1996.[5] V. I. Bogachev and I. I. Malofeev, “Kantorovich problems and condi-tional measures depending on a parameter,”

Journal of MathematicalAnalysis and Applications , 2020.[6] F. Bonchi, R. Piedeleu, P. Sobocinski, and F. Zanasi, “Graphical afﬁnealgebra,” in

Proc. LICS 2019 , 2019.[7] F. Bonchi, P. Sobocinski, and F. Zanasi, “The calculus of signal ﬂowdiagrams I: linear relations on streams,”

Inform. Comput. , vol. 252, 2017.[8] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich,M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan:A probabilistic programming language,”

Journal of statistical software ,vol. 76, no. 1, 2017.[9] K. Cho and B. Jacobs, “Disintegration and Bayesian inversion via stringdiagrams,”

Math. Struct. Comput. Sci. , vol. 29, pp. 938–971, 2019.[10] B. Clarke, D. Elkins, J. Gibbons, F. Loregian, B. Milewski, E. Pill-more, and M. Roman, “Profunctor optics, a categorical update,” 2020,arxiv:2001.07488.[11] T. Fritz, “A synthetic approach to Markov kernels, conditional indepen-dence and theorems on sufﬁcient statistics,”

Adv. Math. , vol. 370, 2020.[12] T. Fritz, T. Gonda, P. Perrone, and E. F. Rischel, “Representable markovcategories and comparison of statistical experiments in categoricalprobability,” 2020. [Online]. Available: https://arxiv.org/abs/2010.07416[13] T. Fritz and E. F. Rischel, “Inﬁnite products and zero-one laws incategorical probability,”

Compositionality , vol. 2, Aug. 2020. [Online].Available: https://doi.org/10.32408/compositionality-2-3[14] C. F¨uhrmann, “Varieties of effects,” in

Proc. FOSSACS 2002 , 2002.[15] T. Gehr, S. Misailovic, and M. Vechev, “PSI: Exact symbolic inferencefor probabilistic programs,” in

Proc. CAV 2016 , 2016.[16] N. D. Goodman and A. Stuhlm¨uller, “The Design and Implementation ofProbabilistic Programming Languages,” http://dippl.org, 2014, accessed:2020-10-15.[17] C. Hermida and R. D. Tennent, “Monoidal indeterminates and categoriesof possible worlds,”

Theoret. Comput. Sci. , vol. 430, 2012. [18] M. Huot and S. Staton, “Universal properties in quantum theory,” in

Proc. QPL 2018 , 2018.[19] B. Jacobs, “A channel-based perspective on conjugate priors,”

Mathe-matical Structures in Computer Science , vol. 30, no. 1, p. 44–61, 2020.[20] J. Jacobs, “Paradoxes of probabilistic programming,” in

Proc. POPL2021 , 2021.[21] E. T. Jaynes,

Probability Theory: The Logic of Science . CUP, 2003.[22] O. Kammar and G. D. Plotkin, “Algebraic foundations for effect-dependent optimisations,” in

Proc. POPL 2012 , 2012.[23] Y. Kinoshita and J. Power, “A ﬁbrational semantics for logic programs,”in

Proc. ELP 1996 , 1996.[24] S. Lauritzen and F. Jensen, “Stable local computation with conditionalgaussian distributions,”

Statistics and Computing , vol. 11, 11 1999.[25] R. Milner, J. Parrow, and D. Walker, “A calculus of mobile processes,I,”

Inform. Comput. , vol. 100, no. 1, 1992.[26] T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill,“Infer.NET 0.3,” 2018, Microsoft Research Cambridge. [Online].Available: http://dotnet.github.io/infer[27] L. Murray, D. Lund´en, J. Kudlicka, D. Broman, and T. Sch¨on, “Delayedsampling and automatic rao-blackwellization of probabilistic programs,”in

Proceedings of the Twenty-First International Conference on ArtiﬁcialIntelligence and Statistics , 2018, pp. 1037–1046.[28] P. Narayanan, J. Carette, W. Romano, C. Shan, and R. Zinkov, “Proba-bilistic inference by program transformation in Hakaru (system descrip-tion),” in

Proc. FLOPS 2016 , 2016, pp. 62–79.[29] P. Narayanan and C.-c. Shan, “Applications of a disintegration transfor-mation,” in

Workshop on program transformations for machine learning ,2019.[30] A. Pitts and I. Stark, “Observable properties of higher order functionsthat dynamically create local names, or: What’s new?” in

Proc. MFCS1993 , 1993.[31] J. Power and E. Robinson, “Premonoidal categories and notions ofcomputation,”

Math. Struct. Comput. Sci. , vol. 7, pp. 453–468, 1997.[32] M. A. Proschan and B. Presnell, “Expect the unexpected from condi-tional expectation,”

The American Statistician , vol. 52, no. 3, 1998.[33] L. D. Raedt and A. Kimmig, “Probabilistic (logic) programming con-cepts,”

Mach. Learn. , vol. 100, 2015.[34] C.-c. Shan and N. Ramsey, “Exact Bayesian inference by symbolicdisintegration,” in

Proc. POPL 2017 , 2017, pp. 130–144.[35] T. S. C. Smithe, “Bayesian updates compose optically,” 2020. [Online].Available: https://arxiv.org/abs/2006.01631[36] S. Staton, D. Stein, H. Yang, L. Ackerman, C. E. Freer, and D. M. Roy,“The Beta-Bernoulli process and algebraic effects,” 2018.[37] S. Staton, “An algebraic presentation of predicate logic,” in

Proc. FOS-SACS 2013 , 2013, pp. 401–417.[38] ——, “Commutative semantics for probabilistic programming,” in

Proc. ESOP 2017 , 2017.[39] S. Staton and P. B. Levy, “Universal properties of impure programminglanguages,” in

Proc. POPL 2013 , 2013.[40] R. Walia, P. Narayanan, J. Carette, S. Tobin-Hochstadt, and C.-c. Shan,“From high-level inference algorithms to efﬁcient code,” in

Proc. ICFP2019 , 2019. A PPENDIX

C. Markov categories

Here, we spell out some details for the notions of (cid:28) andprecise supports.

Proof of Proposition III.10.

Gauss faithfully embeds into

BorelStoch , that is any Gaussian morphism m → n can beseen as a measurable map R m → G ( R n ) , where G denotes theGiry monad. By [11, 13.3], we have f = µ g if f ( x ) = g ( x ) in G ( R n ) for µ -almost all x . If µ is a Gaussian distribution withsupport S , then µ is equivalent to the Lebesgue measure on S .Because f, g are continuous as maps into G ( R m ) , f ( x ) = g ( x ) on almost all x ∈ S implies f = g everywhere on S .For the second part, let S µ , S ν denote the supports of µ and ν , and let x ∈ S µ \ S ν . Then we can ﬁnd two afﬁne functions13 , g which agree on S ν but f ( x ) (cid:54) = g ( x ) . Then f = ν g butnot f = µ g , hence µ (cid:54)(cid:28) ν . Proposition VII.1. In FinStoch , we have x (cid:28) µ if µ ( x ) > . In BorelStoch , we have x (cid:28) µ if µ ( { x } ) > . This gives the correct intuition of support for

FinStoch .For

BorelStoch , this is an overly rigid notion of supportwhich may contradict our intuition. For example the standard-normal distribution N (0 , has support R in Gauss , but ∅ in BorelStoch . Proof.

The arguments follow from [11, 13.2,13.3]. For

BorelStoch , let µ ( { x } ) = 0 and consider the measurablefunctions f ( x ) = I { x } ( x ) , g ( x ) = 0 . Then f = µ g , yet f ( x ) (cid:54) = g ( x ) showing x (cid:54)(cid:28) µ . Proof of Proposition IV.5.

For Gauss, this follows from thecharacterization of (cid:28) in III.10. Let µ have support S and f ( x ) = Ax + N ( b, Σ) . Let T be the support of N ( b, Σ) . Thesupport of (cid:104) id , f (cid:105) µ is the image space { ( x, Ax + c ) : x ∈ S, c ∈ T } . Hence ( x, y ) (cid:28) (cid:104) id , f (cid:105) µ iff x (cid:28) µ and y (cid:28) f x .For FinStoch , an outcome ( x, y ) has positive probabilityunder (cid:104) id , f (cid:105) µ iff x has positive probability under µ , and y has positive probability under f ( −| x ) .For BorelStoch , the measure ψ = (cid:104) id , f (cid:105) µ is given by ψ ( A × B ) = (cid:90) x ∈ A f ( B | x ) µ (d x ) Hence ψ ( { ( x , y ) } ) = f ( { y }| x ) µ ( { x } ) , which is positiveexactly if µ ( { x } ) > and f ( { y }| x ) > .A note on the deﬁnition of ‘precise supports’: Theexpression (cid:104) id , f (cid:105) µ is an analogue of the graph of f . Wewonder about its single-valuedness. If x ⊗ y (cid:28) (cid:104) id , f (cid:105) µ thenwe always have x (cid:28) µ and y (cid:28) f µ . We ask that y doesn’tlie in the pushforward of any old sample of µ , but preciselyof f x . This is certainly a natural property to demand, butalso very speciﬁcally tailored towards the application in IV.6.We expect ‘precise supports’ to arise as an instance of somemore encompassing axiom. D. Linear algebra

The following facts from linear algebra are useful to recalland get used throughout.

Proposition VII.2.

Let A ∈ R m × n , then there are invertiblematrices S, T such that

SAT − = (cid:18) I r

00 0 (cid:19) where r = rank( A ) . Furthermore, T can taken to be orthog-onal.Proof. Take a singular-value decomposition (SVD) A = U DV T , let T = V and use create S from U T by rescalingthe appropriate columns. Proposition VII.3 (Row equivalence) . Two matrices

A, B ∈ R m × n are called row equivalent if the following equivalentconditions hold for all x ∈ R n , Ax = 0 ⇔ Bx = 0 A and B have the same row space there is an invertible matrix S such that A = SB Unique representatives of row equivalence classes are matri-ces in reduced row echelon form . Corollary VII.4.

Let

A, B ∈ R m × n and let Ax = c and Bx = d be consistent systems of linear equations that havethe same solution space. There is an invertible matrix S suchthat B = SA and d = Sc . Proposition VII.5 (Column equivalence) . For matrices

A, B ∈ R m × n , the following are equivalent A and B have the same column space there is an invertible matrix T such that A = BT . Proposition VII.6.

For matrices

A, B ∈ R m × n , the followingare equivalent AA T = BB T there is an orthogonal matrix U such that A = BU .Proof. This is a known fact, but we sketch a proof for lack ofreference. In the construction of the SVD A = U DV T , we canchoose U and D depending on AA T alone. It follows that thesame matrices work for B , giving SVDs A = U DV T , B = U DW T . Then A = B ( W V T ) as claimed. E. Normal forms

We present a proof of the uniqueness of normal forms forconditioning morphisms. Some preliminary facts:

Proposition VII.7.

Let X ∼ N ( µ X , Σ X ) and Y ∼N ( µ Y , Σ Y ) be independent. Then X | ( X = Y ) has distri-bution N (¯ µ, ¯Σ) given by ¯ µ = µ X + Σ X (Σ X + Σ Y ) + ( µ Y − µ X )¯Σ = Σ X − Σ X (Σ X + Σ Y ) + Σ X In programming terms, this is written let x = N ( µ X , Σ X ) in ( x =:= N ( µ Y , Σ Y )); return( x ) and corresponds to the observe statement from the introduc-tion x = normal ( µ X , Σ X ); observe ( normal ( µ Y , Σ Y ), x) Corollary VII.8.

No 1-dimensional observe statement leavesthe prior N (0 , unchanged.Proof. Conditioning decreases variance. If we observe from N ( µ, σ ) , the variance of the posterior is − (1 + σ ) − < . roposition VII.9. Consider a morphism κ : n (cid:32) in Cond ( Gauss ) given by κ ( x ) = ( Ax =:= η ) (11) where A ∈ R r × n is in reduced row echelon form with nozero rows, and η ∈ Gauss (0 , r ) . Then the matrix A anddistribution η are uniquely determined.Proof. We will probe κ by applying the condition 11 todifferent priors ψ ∈ Gauss (0 , n ) , giving either a result ψ (cid:48) ∈ Gauss (0 , n ) or ⊥ .Let S ⊆ R r be the support of η and W = { x ∈ R n : Ax ∈ S } . We can recover W from observational behavior, becausefor deterministic priors ψ = x , we have ψ (cid:48) (cid:54) = ⊥ iff x ∈ W .We have κ = ⊥ iff W = ∅ . Assume W is nonempty now.Next, we can identify the nullspace K of A by consideringsubspaces along which no conditioning updates happens. Callan afﬁne subspace V ⊆ R n vacuous if for all ψ (cid:28) V we have ψ (cid:48) = ψ . Any such V must be contained in W . We claim thatevery maximal vacuous subspace is of the form K + x where x ∈ W :Every space of the form K + x is clearly vacuous: If ψ (cid:28) K then the condition (11) becomes constant as Ax =:= η .Because by assumption Ax ∈ S , this condition is vacuousand can be discarded without effect.Let V be any vacuous subspace and x ∈ V . We show V ⊆ x + K : Assume there is any other x ∈ W such that x − x / ∈ K and consider the 1-dimensional prior t ∼ N (0 , , x = x + t ( x − x ) Let d = A ( x − x ) (cid:54) = 0 and ﬁnd an invertible matrix T suchthat T d = (1 , , . . . , T . The condition becomes ( t, , . . . ,

0) =:=

T η − T Ax . All but the ﬁrst equations do not involve t . By commutativity,they can be computed independently, resulting in an updatedright-hand side and a 1-dimensional condition t =:= η (cid:48) with η (cid:48) either a Gaussian or ⊥ . By VII.8, such a condition cannotleave the prior N (0 , unchanged, contradicting vacuity of V .Having reconstructed K , the matrix A in reduced rowechelon form is determined uniquely by its nullspace. Groupthe coordinates x , . . . , x n into exactly r pivot coordinates x p and n − r free coordinates x u . Setting x u = 0 in (11) results inthe simpliﬁed condition x p =:= η . It remains to show that wecan recover the observing distribution µ from observationalbehavior. Intuitively, if we put a ﬂat enough prior on x p , theposterior will resemble µ arbitrarily closely:Let µ = N ( b, Σ) and consider x p ∼ N ( λI ) for λ → ∞ .The matrix ( I + λ − Σ) is invertible for all large enough λ .By the formulas VII.7, the mean of the posterior is ¯ µ = ( I + λ − Σ) − µ λ →∞ −−−−→ µ For the covariance, we truncate Neumann’s series ( I + λ − Σ) − = I − λ − Σ + o ( λ − ) to obtain ¯Σ = λI − λ ( I + λ − Σ) − = Σ + o ( λ − ) λ →∞ −−−−→ Σ F. Implementation

It is straightforward to implement the operational semanticsof §II in a language like Python. We have done this and weillustrate with further simple programs and results, in additionto the examples in Section I-A.Listing 1: Gaussian regression (Fig. 4) xs = [1.0 , 2.0, 2.25 , 5.0, 10.0]ys = [ − − − − − Gauss . N (0 ,10)b = Gauss . N (0 ,10)f = lambda x: a * x + b for (x,y) in zip (xs ,ys): Gauss . condition (f(x), y + Gauss . N (0 ,0.1)) Fig. 4: Gaussian regularized regression (Ridge regression),plotting 100 samples, the mean (red) and ± σ coordinatewise(blue)15isting 2: 1-dimensional K´alm´an ﬁlter (Fig. 5) xs = [ 1.0, 3.4, 2.7, 3.2, 5.8,14.0 , 18.0 , 11.7 , 19.5 , 19.2]x = [0] * len (xs)v = [0] * len (xs) x[0] = xs [0] + Gauss . N (0 ,1)v[0] = 1.0 + Gauss . N (0 ,10) for i in range (1, len (xs )): x[i] = x[i −

1] + v[i − −

1] +

Gauss . N (0 ,0.75) Gauss . condition (x[i] + Gauss . N (0,1),xs[i])(0,1),xs[i])