Compositional Semantics for Probabilistic Programs with Exact Conditioning
CCompositional Semantics for Probabilistic Programswith Exact Conditioning
Dario Stein
University of Oxford, UK
Sam Staton
University of Oxford, UK
Abstract —We define a probabilistic programming language forGaussian random variables with a first-class exact conditioningconstruct. We give operational, denotational and equationalsemantics for this language, establishing convenient propertieslike exchangeability of conditions. Conditioning on equality ofcontinuous random variables is nontrivial, as the exact observa-tion may have probability zero; this is
Borel’s paradox . Usingcategorical formulations of conditional probability, we showthat the good properties of our language are not particular toGaussians, but can be derived from universal properties, thusgeneralizing to wider settings. We define the Cond construction,which internalizes conditioning as a morphism, providing generalcompositional semantics for probabilistic programming withexact conditioning.
I. I
NTRODUCTION
Probabilistic programming is the paradigm of specifyingcomplex statistical models as programs, and performing infer-ence on them. There are two ways of expressing dependenceon observed data, thus learning from them: soft constraints and exact conditioning . Languages like Stan [8] or WebPPL[16] use a scoring construct for soft constraints, re-weightingprogram traces by observed likelihoods. Other frameworks likeHakaru [28] or Infer.NET [26] allow exact conditioning ondata. In this paper we provide two semantic analyses of exactconditioning in a simple Gaussian language: a denotationalsemantics, and a equational axiomatic semantics, which weprove to coincide for closed programs. Our denotational se-mantics is based on a new and general construction on Markovcategories, which, we argue, serve as a good framework forexact conditioning in probabilistic programs.
A. Case study: Reasoning about a Gaussian ProgrammingLanguage with Exact Conditions
Exact conditioning decouples the generative model fromthe data observations. Consider the following example forGaussian process regression (a.k.a. kriging ): The prior ys isa -dimensional multivariate normal (Gaussian) vector; weperform inference by fixing four observed datapoints via exactconditioning (=:=) . ys = gp sample (n=100 , kernel =rbf) for (i,c) in observations :ys[i] =:= c The same program is difficult to express compositionallywithout exact conditioning. Fig. 1: GP prior and posterior with 4 exact observationsNo style of probabilistic modelling is immune to fallaciesand paradoxes. Exact conditioning is indeed sensitive in thisregard in general (§VII-A), and so it is important to show thatwhere it is used, it is consistent in a compositional way. Thatis the contribution of this paper.The kriging example (Fig. 1) uses a smooth kernel,as is common, but to discuss the situation further weconsider the following concrete variation with a Gaussianrandom walk. Suppose that the observation points are at (0 , , , , , . ys [0] = normal (0 ,1) for i = 1 to 100:ys[i] = ys[i −
1] + normal (0 ,1) for j = 0 to 5:ys [20 * j] =:= c[j] To illustrate the power of compositional reasoning, we notethat exact conditioning here is first-class, and as we will show,it is consistent to reorder programs as long as the dataflowis respected (Prop. IV.10). So this random walk program isequivalent to: ys [0] = normal (0 ,1)ys [0] =:= c[0] for i = 1 to 100:ys[i] = ys[i −
1] + normal (0 ,1) if i % 20 == 0: ys[i] =:= c[i % 20] We can now use a substitution law and initialization principleto simplify the program: ys [0] = c[0] for i = 1 to 100: if i % 20 == 0:ys[i] = c[i % 20](ys[i] − ys[i − normal (0 ,1) else : ys[i] = ys[i −
1] + normal (0 ,1) a r X i v : . [ c s . P L ] J a n he constraints are now all ‘soft’, in that they relate an expres-sion with a distribution, and so this last program could be runwith a Monte Carlo simulation in Stan or WebPPL. Indeed, thesoft-conditioning primitive observe can be defined in termsof exact conditioning as observe (D,x) ≡ ( let y = sample (D) in x =:= y) Our language is by no means restricted to kriging. For exam-ple, we can use similar techniques to implement and verify asimple K´alm´an filter (VII-F).In Section II we provide an operational semantics for thislanguage, in which there are two key commands: drawingfrom a standard normal distribution ( normal() ) and exactconditioning (=:=) . The operational semantics is defined interms of configurations ( t, ψ ) where t is a program and ψ is a state, which here is a Gaussian distribution. Each call to normal() introduces a new dimension into the state ψ , andconditioning (=:=) alters the state ψ , using a canonical form ofconditioning for Gaussian distributions (§II-A).For the program in Figure 1, the operational semanticswill first build up the prior distribution shown on the leftin Figure 1, and then the second part of the program willcondition to yield a distribution as shown on the right. But forthe other programs above, the conditioning will be interleavedin the building of the model.In stateful programming languages, composition of pro-grams is often complicated and local transformations aredifficult to reason about. But, as we now explain, we willshow that for the Gaussian language, compositionality andlocal reasoning are straightforward. For example, as we havealready illustrated: • Program lines can be reordered as long as dataflow isrespected. That is, the following commutativity equation[38] remains valid for programs with conditioning let x = u inlet y = v in t ≡ let y = v inlet x = u in t (1)where x not free in v and y not free in u . • We have a substitutivity property: if t =:= u appears in aprogram then all other occurrences of t can be replacedby u . ( t =:= u ); v [ t / x ] ≡ ( t =:= u ); v [ u / x ] (2) • As a special base case, if we condition a normal variableon a constant, then the variable takes this value: let x = normal() in ( x =:= 0); t ≡ t [ / x ] (3) B. Denotational semantics, the Cond construction, andMarkov categories
In Section V we show that this compositional reasoningis valid by using a denotational semantics. For a Gaussianlanguage with out conditioning, we can easily interpret termsas noisy affine functions, x (cid:55)→ Ax + c + N (Σ) . The exact conditioning requires a new construction for building a se-mantic model. In fact this construction is not at all specific toGaussian probability and works generally.For this general construction, we start from a class ofsymmetric monoidal categories called Markov categories [11].These can be understood as the categorical counterpart ofa broad class of probabilistic programming languages with-out conditioning (§III-B). For example, Gaussian probabilityforms a Markov category, but there are many other examples,including finite probability (e.g. coin tosses) and full Borelprobability.Our conditioning construction starts from a Markov cate-gory C , regarded as a probabilistic programming languagewithout conditioning. We build a new symmetric monoidalcategory Cond ( C ) which is conservative over C but whichcontains a conditioning construct. This construction buildson an analysis of conditional probabilities from the Markovcategory literature, which captures conditioning purely interms of categorical structure: there is no explicit Radon-Nikod´ym theorem, limits, reference measures or in fact anymeasure theory at all. The good properties of the Gaussianlanguage generalize to this abstract setting, as they follow fromuniversal properties alone.The category Cond ( C ) has the same objects as C , but amorphism is reminiscent of the decomposition of the programin Fig 1: a pair of a purely probabilistic morphism togetherwith an observation. These morphisms compose by composingthe generative parts and accumulating the observations (for agraphical representation, see Figure 2). The morphisms areconsidered up-to a natural contextual equivalence. We provesome general properties about Cond ( C ) :1) Proposition IV.11: Cond ( C ) is consistent, in that nodistinct unconditional distributions from C are equatedin Cond ( C ) .2) Proposition IV.10: Cond ( C ) allows programs to be re-ordered according to their dataflow graph, i.e. it satisfiesthe interchange law of monoidal categories.Returning to the specific case study of Gaussian probability,we show that we have a canonical interpretation of theGaussian language in Cond ( Gauss ) , which is fully abstract(Prop V.3). In consequence, the principles of reordering andconsistency hold for the contextual equivalence induced by theoperational semantics. C. Equational axioms
Our second semantic analysis (§VI) has a more syntacticand concrete flavor. We leave the generality of Markov cate-gories and focus again on the Gaussian language. We presentan equational theory for programs and use this to give normalforms for programs.Our equational theory is surprisingly simple. The first two2quations are let x = normal() in () = ()let x = normal() , . . . , x n = normal() in U(cid:126)x = let x = normal() , . . . , x n = normal() in (cid:126)x The first equation is sometimes called discarding. In thesecond equation, U must be an orthogonal matrix, and we areusing shorthand for multiplying a vector by a matrix. Thesetwo equations are enough to fully axiomatize the fragment ofthe Gaussian language without conditioning (Prop. VI.1).(In Section VI we use a concise notation, writing the firstaxiom as νx. r[] = r[] . One instance of the second axiom witha permutation matrix for U is νx.νy. r[ x, y ] = νy.νx. r[ x, y ] ,reminiscent of name generation in the π -calculus [25] or ν -calculus [30].)The remaining axioms focus on conditioning. There arecommutativity axioms for reordering parts of programs, aswell as the two substitutivity axioms considered above, (2),(3). Finally there are two axioms for eliminating a conditionthat is tautologous ( a =:= a ) or impossible (0 =:= 1) .Together, these axioms are consistent, which we can deduceby showing them to hold in the Cond model. To moreoverillustrate the strength of the axioms, we show two normalform theorems by merely using the axioms. Here normal n () describes the n -dimensional standard normal distribution. • Proposition VI.6: any closed program is either derivablyimpossible (0 =:= 1) or derivably equal to a condition-freeprogram of the form A ∗ normal n () + (cid:126)c . • Theorem VI.8: any program of unit type (with no returnvalue) is either derivably impossible (0 =:= 1) or derivablyequal to a soft constraint, i.e. a program of the form A ∗ (cid:126)x =:= B ∗ normal n () + (cid:126)c . We also give a uniquenesscriterion on A , B and (cid:126)c . D. Summary • We present a minimalist language with exact conditioningfor Gaussian probability, with the purpose of studying theabstract properties of conditioning. Despite its simplicity,the language can express Gaussian processes or K´alm´anfilters. • In order to make the denotational semantics composi-tional, we introduce the Cond construction, which ex-tends a Markov category C to a category Cond ( C ) in which conditioning internalizes as a morphism. TheGaussian language is recovered as the internal languageof Cond ( Gauss ) . • We give three semantics for the language – operational(§II), denotational (§V) and axiomatic (§VI). We showthat the denotational semantics is fully abstract (Propo-sition V.3) and that the axiomatic semantics is strongenough to derive normal forms (Theorem VI.8). Thisjustifies properties like commutativity and substitutivityfor the language. Thus probabilistic programming withexact conditioning can serve as a practical foundation forcompositional statistical modelling. II. A L
ANGUAGE FOR G AUSSIAN P ROBABILITY
We introduce a typed language (§II-B), similar to the onediscussed in Section I-A, and provide an operational semantics(§II-C).
A. Recap of Gaussian Probability
We briefly recall
Gaussian probability , by which we meanthe treatment of multivariate Gaussian distributions and affine-linear maps (e.g. [24]). A (multivariate) Gaussian distribution is the law of a random vector X ∈ R n of the form X = AZ + µ where A ∈ R n × m , µ ∈ R n and the random vector Z hascomponents Z , . . . , Z m ∼ N (0 , which are independentand standard normally distributed with density function ϕ ( x ) = 1 √ π e − x The distribution of X is fully characterized by its mean µ andthe positive semidefinite covariance matrix Σ . Conversely, forany µ and positive semidefinite matrix Σ there is a uniqueGaussian distribution of that mean and covariance denoted N ( µ, Σ) . The vector X takes values precisely in the affinesubspace S = µ + col(Σ) where col(Σ) denotes the columnspace of Σ . We call S the support of the distribution.This defines a small convenient fragment of probabilitytheory: Affine transformations of Gaussians remain Gaussian.Furthermore, conditional distributions of Gaussians are againGaussian. This is known as self-conjugacy. If X ∼ N ( µ, Σ) with X = (cid:18) X X (cid:19) , µ = (cid:18) µ µ (cid:19) , Σ = (cid:18) Σ Σ Σ Σ (cid:19) then the conditional distribution X | ( X = a ) of X condi-tional on X = a is N ( µ (cid:48) , Σ (cid:48) ) where µ (cid:48) = µ + Σ Σ +22 ( a − µ ) , Σ (cid:48) = Σ − Σ Σ +22 Σ (4)and Σ +22 denotes the Moore-Penrose pseudoinverse. Example II.1. If X, Y ∼ N (0 , are independent and Z = X − Y , then ( X, Y ) | ( Z = 0) ∼ N (cid:18)(cid:18) (cid:19) , (cid:18) . . . . (cid:19)(cid:19) The posterior distribution is equivalent to the model X ∼ N (0 , . , Y = X B. Types and terms of the Gaussian language
We now describe a language for Gaussian probability andconditioning. The core language resembles first-order OCamlwith a construct normal() to sample from a standard Gaussian,and conditioning denoted as (=:=) . Types τ are generated froma basic type R denoting real or random variable , pair typesand unit type I . τ ::= R | I | τ ∗ τ e ::= x | e + e | α · e | β | ( e, e ) | () | let x = e in e | let ( x, y ) = e in e | normal() | e =:= e where α, β range over real numbers.Typing judgements are Γ , x : τ, Γ (cid:48) (cid:96) x : τ Γ (cid:96) () : I Γ (cid:96) s : σ Γ (cid:96) t : τ Γ (cid:96) ( s, t ) : σ ∗ τ Γ (cid:96) s : R Γ (cid:96) t : RΓ (cid:96) s + t : R Γ (cid:96) t : RΓ (cid:96) α · t : R Γ (cid:96) β : RΓ (cid:96) normal() : R Γ (cid:96) s : R Γ (cid:96) t : RΓ (cid:96) ( s =:= t ) : IΓ (cid:96) s : σ Γ , x : σ (cid:96) t : τ Γ (cid:96) let x = s in t : τ Γ (cid:96) s : σ ∗ σ (cid:48) Γ , x : σ, y : σ (cid:48) (cid:96) t : τ Γ (cid:96) let ( x, y ) = s in t : τ We define standard syntactic sugar for sequencing s ; t , iden-tifying the type R n = R ∗ (R ∗ . . . ) with vectors and definingmatrix-vector multiplication A · (cid:126)x . For σ ∈ R and e : R , wedefine normal( x, σ ) ≡ x + σ · normal() . More generally,for a covariance matrix Σ , we write normal( (cid:126)x, Σ) = (cid:126)x + A · (normal() , . . . , normal()) where A is any matrix such that Σ = AA T . We can identify any context and type with R n forsuitable n . C. Operational semantics
Our operational semantics is call-by-value. Calling normal() allocates a latent random variable, and a priordistribution over all latent variables is maintained. Calling (=:=) updates this prior by symbolic inference according tothe formula (4).
Values v and redexes ρ are defined as v ::= x | ( v, v ) | v + v | α · v | β | () ρ ::= normal() | v =:= v | let x = v in e | let ( x, y ) = v in e A reduction context C with hole [ − ] is of the form C ::= [ − ] | C + e | v + C | r · C | C =:= e | v =:= C | let x = C in e | let ( x, y ) = C in e Every term is either a value or decomposes uniquely as C [ ρ ] .We define a reduction relation for terms. During the execution,we will allocate latent variables z i which we assume distinctfrom all other variables in the program. A configuration iseither a pair ( e, ψ ) where z , . . . , z r (cid:96) e and ψ is a Gaussiandistribution on R r , or a failure configuration ⊥ . We first definereduction on redexes1) For normal() , we add an independent latent variable tothe prior (normal() , ψ ) (cid:66) ( z r+1 , ψ ⊗ N (0 ,
2) To define conditioning, note that every value z , . . . , z r (cid:96) v : R defines an affine function R r → R .In order to reduce ( v =:= w, ψ ) , we consider the jointdistribution X ∼ ψ, Z = v ( X ) − w ( X ) . If lies inthe support of Z , we denote by ψ | v = w the outcome ofconditioning X on Z = 0 as in (4), and reduce ( v =:= w, ψ ) (cid:66) (() , ψ | v = w ) Otherwise ( v =:= w, ψ ) (cid:66) ⊥ , indicating that the inferenceproblem has no solution.3) Let bindings are standard (let x = v in e, ψ ) (cid:66) ( e [ v/x ] , ψ )(let ( x, y ) = ( v, w ) in e, ψ ) (cid:66) ( e [ v/x, w/y ] , ψ )
4) Lastly, under reduction contexts, if ( ρ, ψ ) (cid:66) ( e, ψ (cid:48) ) we define ( C [ ρ ] , ψ ) (cid:66) ( C [ e ] , ψ (cid:48) ) . If ( ρ, ψ ) (cid:66) ⊥ then ( C [ e ] , ψ ) (cid:66) ⊥ . Proposition II.2.
Every closed program (cid:96) e : R n , togetherwith the empty prior ‘ ! ’, deterministically reduces to either aconfiguration ( v, ψ ) or ⊥ . We consider the observable result of this execution eitherfailure, or the pushforward distribution v ∗ ψ on R n , as thisdistribution could be sampled from empirically. Example II.3.
The program let ( x, y ) = (normal() , normal()) in x =:= y ; x + y reduces to (( z , z ) , ψ ) where ψ = N (cid:18)(cid:18) (cid:19) , (cid:18) . . . . (cid:19)(cid:19) The observable outcome of the run is the pushforward distri-bution (1 1) ∗ ψ = N (0 , on R .One goal of this paper is to study properties of this languagecompositionally, and abstractly, without relying on any specificproperties of Gaussians. The crucial notion to investigate iscontextual equivalence. Definition II.4.
We say Γ (cid:96) e , e : τ are contextuallyequivalent , written e ≈ e , if for all closed contexts K [ − ] and i, j ∈ { , }
1) when ( K [ e i ] , !) (cid:66) ∗ ( v i , ψ i ) then ( K [ e j ] , !) (cid:66) ∗ ( v j , ψ j ) and ( v i ) ∗ ψ i = ( v j ) ∗ ψ j
2) when ( K [ e i ]) (cid:66) ∗ ⊥ then ( K [ e j ] , !) (cid:66) ∗ ⊥ We study contextual equivalence by developing a denota-tional semantics for the Gaussian language (§V), and provingit fully abstract (Prop. V.3). We furthermore show that thesesemantics can be axiomatized completely by a set of programequations (§VI).We also note nothing conceptually limits our languageto only Gaussians. We are running with this example forconcreteness, but any family of distributions which can besampled and conditioned can be used. So we will take care toestablish properties of the semantics in a general setting.4II. C
ATEGORICAL F OUNDATIONS OF C ONDITIONING
We will now generalize away from Gaussian probability,recovering its convenient structure in the general categoricalframework of Markov categories (§III-A). We argue that this isa categorical counterpart of probabilistic programming withoutconditioning (§III-B).
Definition III.1 ([11, § 6]) . The symmetric monoidal category
Gauss has objects n ∈ N , which represent the affine space R n , and m ⊗ n = m + n . Morphisms m → n are tuples ( A, b, Σ) where A ∈ R n × m , b ∈ R n and Σ ∈ R n × n is a posi-tive semidefinite matrix. The tuple represents a stochastic map f : R m → R n that is affine-linear, perturbed with multivariateGaussian noise of covariance Σ , informally written f ( x ) = Ax + b + N (Σ) Such morphisms compose sequentially and in parallel in theexpected way, with noise accumulating independently ( A, b, Σ) ◦ ( C, d,
Ξ) = (
AC, Ad + b, A Ξ A T + Σ)( A, b, Σ) ⊗ ( C, d,
Ξ) = (cid:18)(cid:18) A C (cid:19) , (cid:18) bd (cid:19) , (cid:18) Σ 00 Ξ (cid:19)(cid:19)
Gauss furthermore has ability to introduce correlations anddiscard values by means of the affine maps copy : R n → R n + n , x (cid:55)→ ( x, x ) and del : R n → R , x (cid:55)→ () . This gives Gauss the structure of a categorical model of probability,namely a Markov category.
A. Conditioning in Markov and CD categoriesMarkov and
CD categories are a formalism that is increas-ingly widely used (e.g. [12], [13]). We review their graphicallanguage, and theory of conditioning.
Definition III.2 ([9]) . A copy-delete (CD) category is asymmetric monoidal category ( C , ⊗ , I ) in which every object X is equipped with the structure of a commutative comonoid copy X : X → X ⊗ X , del X : X → I which is compatiblewith the monoidal structure.In CD categories, morphisms f : X → Y need not be discardable , i.e. satisfy del Y ◦ f = del X . If they are, we obtaina Markov category. Definition III.3 ([11]) . A Markov category is a CD categoryin which every morphism is discardable, i.e. del is natural.Equivalently, the unit I is terminal.Beyond Gauss , further examples of Markov categories arethe category
FinStoch of finite sets and stochastic matrices,and the category
BorelStoch of Markov kernels betweenstandard Borel spaces. CD categories generalize unnormalized measure kernels.The interchange law of ⊗ encodes exchangeability (Fubini’stheorem) while the discardability condition signifies that prob-ability measures are normalized to total mass . We introducethe following terminology: States µ : I → X are also called distributions , and if f : A → X ⊗ Y , we denote its marginals by f X : A → X, f Y : A → Y . Copying and discarding allows us to write tupling (cid:104) f, g (cid:105) and projection π X , howevernote that the monoidal structure is only semicartesian, i.e. f (cid:54) = (cid:104) f X , f Y (cid:105) in general. We use string diagram notationfor symmetric monoidal categories, and denote the comonoidstructure as == del X copy X Definition III.4 ([11, 10.1]) . A morphism f : X → Y iscalled deterministic if it commutes with copying, that is copy Y ◦ f = ( f ⊗ f ) ◦ copy X In a Markov category, the wide subcategory C det of determin-istic maps is cartesian, i.e. ⊗ is a product.A morphism ( A, b, Σ) in Gauss is deterministic iff
Σ = 0 .The deterministic subcategory A = Gauss det consists of thespaces R n and affine maps x (cid:55)→ Ax + b between them.We recall the theory of conditioning for Markov categories. Definition III.5 ([11, 11.1,11.5]) . A conditional distribution for ψ : I → X ⊗ Y is a morphism ψ | X : X → Y such that ψ = ψψ | X X Y X Y (5)A (parameterized) conditional for f : A → X ⊗ Y is amorphism f | X : X ⊗ A → Y such that X f | X = f fAX Y YA Parameterized conditionals can be specialized to conditionaldistributions in the following way
Proposition III.6. If f : A → X ⊗ Y has conditional f | X : X ⊗ A → Y and a : I → X is a deterministic state, then f | X (id X ⊗ a ) is a conditional distribution for f a . All of our examples
FinStoch , BorelStoch and
Gauss have conditionals [5], [11]. For
Gauss , this captures the self-conjugacy of Gaussians [19]. An explicit formula generalizing(4) is given in [11], but we shall only require the existence ofconditionals and work with their universal property.5 efinition III.7 ([11, 13.1]) . Let µ : I → X be a distribution.Parallel morphisms f, g : X → Y are called µ -almost surelyequal , written f = µ g , if (cid:104) id X , f (cid:105) µ = (cid:104) id X , g (cid:105) µ .Conditional distributions for a given distribution µ : I → X ⊗ Y are generally not unique. However, it follows fromdefinition that they are µ X -almost surely equal. In order touniquely evaluate conditionals at a point, we need to descendfrom the global universal property to individual inputs. Thisis achieved by the absolute continuity relation. Definition III.8 ([12, 2.8]) . Let µ, ν : I → X be twodistributions. We write µ (cid:28) ν if for all f, g : X → Y , f = ν g implies f = µ g . Lemma III.9. If f, g : X → Y are µ -almost surely equal and x : I → X satisfies x (cid:28) µ then f x = gx . Proposition III.10.
For a distribution µ = N ( b, Σ) : 0 → m in Gauss , let S = b + col(Σ) be its support as in §II-A. Then • If f, g : m → n are morphisms, then f = µ g iff f x = gx for all x ∈ S , seen as deterministic states x : 0 → m . • If ν : 0 → m then µ (cid:28) ν iff the support of µ is containedin the support of ν • In particular for x : 0 → m deterministic, x (cid:28) µ iff x ∈ S . There is a general notion of support in Markov categoriesdefined in [11] which agrees with S , but we will formulateour results in terms of the more flexible notion (cid:28) . Proof.
See appendix, where we also characterize (cid:28) for
FinStoch and
BorelStoch .We give an example of how to use the categorical condi-tioning machinery in practice.
Example III.11.
The statistical model from Example II.1 X ∼ N (0 , Y ∼ N (0 , Z = X − Y corresponds to the distribution µ : 0 → with covariancematrix Σ = A conditional with respect to Z is µ | Z ( z ) = (cid:18) . . (cid:19) z + N (cid:18) . . . . (cid:19) which can be verified by calculating (5). We wish to conditionon Z = 0 . The marginal µ Z = N (2) is supported on all of R , hence (cid:28) µ Z and by Lemma III.9 the composite µ | Z (0) = N (cid:18) . . . . (cid:19) is uniquely defined and represents the posterior distributionover ( X, Y ) . B. Internal language of Markov categories
There is a strong correspondence between first-order proba-bilistic programming languages and the categorical models ofprobability, via their internal languages. The internal languageof a CD category C has types τ ::= X | I | τ ∗ τ where X ranges over objects of C . Any type τ can beregarded as an object [[ τ ]] of C , via [[ X ]] = X , [[ I ]] = I , and [[ τ ∗ τ ]] = [[ τ ]] ⊗ [[ τ ]] . The terms of the internal languageare like the language of Section II, built from let x = t in u ,free variables and pairing, but instead of Gaussian-specificconstructs like normal() , + , and =:= , we have terms for anymorphisms in C : Γ (cid:96) t : τ . . . Γ (cid:96) t n : τ n Γ (cid:96) f ( t . . . t n ) : τ (cid:48) ( f : [[ τ ]] ⊗ . . . ⊗ [[ τ n ]] → [[ τ (cid:48) ]] in C ) Taking C = Gauss we recover the conditioning-free fragmentof the language of Section II (III.12), but the syntax makessense for any CD or Markov category. A core result of thiswork is that the full language can be recovered as well for aCD category C = Cond ( Gauss ) (§IV).A typing context Γ = ( x : τ . . . x n : τ n ) is interpreted as [[Γ]] = [[ τ ]] ⊗ · · · ⊗ [[ τ n ]] . A term in context Γ (cid:96) t : τ isinterpreted as a morphism [[Γ]] → [[ τ ]] , defined by inductionon the structure of typing derivations. This is similar tothe interpretation of a dual linear λ -calculus in a monoidalcategory [4, §3.1,§4], although because every type supportscopying and discarding we do not need to distinguish betweenlinear and non-linear variables. For example, [[let x = t in u ]] = [[Γ]] copy −−−→ [[Γ]] ⊗ [[Γ]] [[ t ]] ⊗ id −−−−→ [[ A ]] ⊗ [[Γ]] [[ u ]] −−→ [[ B ]][[Γ , x : τ, Γ (cid:48) (cid:96) x : τ ]] = [[Γ]] ⊗ [[ τ ]] ⊗ [[Γ (cid:48) ]] del ⊗ id [[ τ ]] ⊗ del −−−−−−−−−→ [[ τ ]] The interpretation always satisfies the following identity, as-sociativity and commutativity equations: [[let y = (let x = t in u ) in v ]] = [[let x = t in let y = u in v ]][[let x = t in x ]] = [[ t ]] [[let x = x in u ]] = [[ u ]] (6) [[let x = t in let y = u in v ]] = [[let y = u in let x = t in v ]] where x not free in u and y not free in t . There are alsostandard equations for tensors [39, §3.1], which always hold.We can always substitute terms for free variables: if we have Γ , x : A (cid:96) t : B and Γ (cid:96) u : A then Γ (cid:96) t [ u / x ] : B . In any CDcategory we have [[let x = t in u ]] = [[ u [ t / x ]]] if x occurs exactly once in u .In a Markov category, moreover, every term is discardable: [[let x = t in u ]] = [[ u [ t / x ]]] if x occurs at most once in u .(It is common to also define a term to be copyable if a versionof the substitution condition holds when x occurs at least once (e.g. [14], [22]), but we will not need that in what follows.)6 xample III.12. The fragment of the Gaussian languagewithout conditioning ( =:= ) is a subset of the internal languageof the category
Gauss . That is to say, there is a canonicaldenotational semantics of the Gaussian language where weinterpret types and contexts as objects of
Gauss , e.g. [[R]] = 1 and [[( x : R , y : R ⊗ R)]] = 3 . Terms Γ (cid:96) t : A are interpretedas stochastic maps Ax + b + N (Σ) . This is all automaticonce we recognize that addition (+) : 2 → , scaling α · ( − ) : 1 → , constants β : 0 → and sampling N (1) : 0 → are morphisms in Gauss . Example III.13.
In Section IV, we will show that the fullGaussian language with conditioning ( =:= ) is the internal lan-guage of a CD category. The fact that commutativity (6) holdsis non-trivial. It cannot reasonably be the internal languageof a Markov category, because conditions (=:=) cannot bediscardable. For example there is no non-trivial morphism (=:=) : 2 → in Gauss .IV. C
OND – C
OMPOSITIONAL C ONDITIONING
Let C be a Markov category with conditionals (§III-A). Forsimplicity of notation, we assume C to be strictly monoidal .We construct a new category Cond ( C ) by adding to thiscategory the ability to condition on fixed observations. By observation we mean a deterministic state o : I → X ,and we seek to add for each of those a conditioning effect (:= o ) : X → I .Our constructions proceed in two stages. We first (§IV-A)form a category Obs ( C ) on the same objects as C where (:= o ) is added purely formally. A morphism X (cid:32) Y in Obs ( C ) represents an intensional open program of the form x : X (cid:96) let ( y, k ) : Y ⊗ K = f ( x ) in ( k := o ); y (7)We think of K as an additional hidden output wire, to whichwe attach the observation o . Such programs compose theobvious way, by aggregating observations (see Fig. 2).In the second stage (§IV-B) – this is the core of the paper –we relate such open programs to the conditionals present in C ,that is we quotient by contextual equivalence. The resultingquotient is called Cond ( C ) . Under sufficient assumptions,this will have the good properties of a CD category. A. Step 1 (Obs): Adding conditioning
Definition IV.1.
The following data define a symmetric pre-monoidal category
Obs ( C ) : • the object part of Obs ( C ) is the same as C • morphisms X (cid:32) Y are tuples ( K, f, o ) where K ∈ ob( C ) , f ∈ C ( X, Y ⊗ K ) and o ∈ C det ( I, K ) • The identity on X is Id X = ( I, id X , !) where ! = id I . • Composition is defined by ( K (cid:48) , f (cid:48) , o (cid:48) ) • ( K, f, o ) = ( K (cid:48) ⊗ K, ( f (cid:48) ⊗ id K ) f, o (cid:48) ⊗ o ) . • if ( K, f, o ) : X (cid:32) Y and ( K (cid:48) , f (cid:48) , o (cid:48) ) : X (cid:48) (cid:32) Y (cid:48) , theirtensor product is defined as ( K (cid:48) ⊗ K, (id Y (cid:48) ⊗ swap K (cid:48) ,Y ⊗ id K )( f (cid:48) ⊗ f ) , o (cid:48) ⊗ o ) • There is an identity-on-objects functor J : C → Obs ( C ) that sends f : X → Y to ( I, f, !) : X (cid:32) Y . This functoris strict premonoidal and its image central • Obs ( C ) inherits symmetry and comonoid structureA premonoidal category (due to [31]) is like a monoidalcategory where the interchange law need not hold. This isthe case because Obs ( C ) does not yet identify observationsarriving in different order. This will be remedied automaticallylater when passing to the quotient Cond ( C ) .Composition and tensor can be depicted graphically as inFigure 2, where dashed wires indicate condition wires K andtheir attached observations o . For an observation o : I → K ,the conditioning effect (:= o ) : K (cid:32) I is given by ( I, id K , o ) . ff (cid:48) Z o (cid:48) oY X f (cid:48) X (cid:48) fXo (cid:48) oY (cid:48) Y Fig. 2: Composition and tensoring of morphisms in
Obs
B. Step 2 (Cond): Equivalence of open programs
We now quotient
Obs -morphisms, tying them to the con-ditionals which can be computed in C . We know how tocompute conditionals for closed programs. Given a state ( K, ψ, o ) : I (cid:32) m , we follow the procedure of ExampleIII.11: If o (cid:54)(cid:28) ψ K , the observation does not lie in the supportof the model and conditioning fails. If not, we form theconditional ψ | K in C and obtain a well-defined posterior µ | K ◦ o .This notion defines an equivalence relation on states I (cid:32) n in Cond ( C ) . We will then extend this notion to a congruenceon arbitrary morphisms X (cid:32) Y by a general categoricalconstruction. Definition IV.2.
Given two states I (cid:32) X we define ( K, ψ, o ) ∼ ( K (cid:48) , ψ (cid:48) , o (cid:48) ) if either1) o (cid:28) ψ K and o (cid:48) (cid:28) ψ (cid:48) K (cid:48) and ψ | K ( o ) = ψ (cid:48) | K (cid:48) ( o (cid:48) ) .2) o (cid:54)(cid:28) ψ K and o (cid:48) (cid:54)(cid:28) ψ (cid:48) K (cid:48) That is, both conditioning problems either fail, or both succeedwith equal posterior.Figure 3 formulates Example III.11 in
Obs ( Gauss ) : Definition IV.3.
Let X be a symmetric premonoidal category.An equivalence relation ∼ on states X ( I, − ) is called func-torial if ψ ∼ ψ (cid:48) implies f ψ ∼ f ψ (cid:48) . We can extend such arelation to a congruence ≈ on all morphisms X → Y via f ≈ g ⇔ ∀ A, ψ : I → A ⊗ X, (id A ⊗ f ) ψ ∼ (id A ⊗ g ) ψ. ∼N (1) N (1) N (0 . Fig. 3: Example III.11 describes related states (cid:32) The quotient category X / ≈ is symmetric premonoidal.We show now that under good assumptions, the quotient byconditioning IV.2 on X = Obs ( C ) is functorial, and inducesa quotient category Cond ( C ) . The technical condition is thatsupports interact well with dataflow Definition IV.4.
A Markov category C has precise supports ifthe following are equivalent for all deterministic x : I → X , y : I → Y , and arbitrary f : X → Y and µ : I → X .1) x ⊗ y (cid:28) (cid:104) id X , f (cid:105) µ x (cid:28) µ and y (cid:28) f x Proposition IV.5.
Gauss , FinStoch and
BorelStoch haveprecise supports.Proof.
See appendix.
Theorem IV.6.
Let C be a Markov category that has condi-tionals and precise supports. Then ∼ is a functorial equiva-lence relation on Obs ( C ) .Proof. Let ( K, ψ, o ) ∼ ( K (cid:48) , ψ (cid:48) , o (cid:48) ) : I (cid:32) X and ( H, f, v ) : X (cid:32) Y be any morphism. We need to show that ( H ⊗ K, ( f ⊗ id K ) ψ, v ⊗ o ) ∼ ( H ⊗ K (cid:48) , ( f ⊗ id K (cid:48) ) ψ (cid:48) , v ⊗ o (cid:48) ) (8)If ( K, f, o ) fails, i.e. o (cid:54)(cid:28) ψ K , then by marginalization anycomposite must fail. But then the RHS fails too.Now assume that ( K, ψ, o ) succeeds and ψ | K o = ψ (cid:48) | K o (cid:48) .We show that the success conditions on both sides are equiv-alent. That is because the following are equivalent1) v ⊗ o (cid:28) ( f H ⊗ id K ) ψ o (cid:28) ψ K and v (cid:28) f H ψ | K o This is exactly the ‘precise supports’ axiom, applied to µ = ψ K and g = f H ◦ ψ | K . Because 2) agrees on both sides of(8), so does 1). We are left with the case that (8) succeeds,and need to show that [( f ⊗ id K ) ψ ] | H ⊗ K ( v ⊗ o ) = [( f ⊗ id K (cid:48) ) ψ (cid:48) ] | H ⊗ K (cid:48) ( v ⊗ o (cid:48) ) . We use a variant of the argument from [11, 11.11] that doubleconditionals can be replaced by iterated conditionals. Considerthe parameterized conditional β = ( f ◦ ψ | K ) | H : H ⊗ K → Y then string diagram manipulation shows that β has the univer-sal property β = [( f ⊗ id K ) ψ ] | H ⊗ K . By specialization III.6,it also has the property β (id H ⊗ o ) = ( f ◦ ψ | K o ) | H . Hence [( f ⊗ id K ) ψ ] | H ⊗ K ( v ⊗ o ) = β (id H ⊗ o ) ◦ v = ( f ◦ ψ | K o ) | H ◦ v = ( f ◦ ψ (cid:48) | K (cid:48) o (cid:48) ) | H ◦ v = [( f ⊗ id K (cid:48) ) ψ (cid:48) ] | H ⊗ K (cid:48) ( v ⊗ o ) We can spell out the equivalence ≈ as follows: Proposition IV.7.
We have ( K, f, o ) ≈ ( K (cid:48) , f (cid:48) , o (cid:48) ) : X (cid:32) Y if for all ψ : I → A ⊗ X , either o (cid:28) f K ψ X and o (cid:48) (cid:28) f (cid:48) K (cid:48) ψ (cid:48) X and [(id A ⊗ f ) ψ ] | K ( o ) =[(id A ⊗ f (cid:48) ) ψ (cid:48) ] | K (cid:48) ( o (cid:48) ) o (cid:54)(cid:28) f K ψ X and o (cid:48) (cid:54)(cid:28) f (cid:48) K (cid:48) ψ (cid:48) X The universal property of the conditional in question is ψ f = ψ fA Y KA Y K We can show that isomorphic conditions are equivalentunder the relation ≈ . Proposition IV.8 (Isomorphic conditions) . Let ( K, f, o ) : X (cid:32) Y and α : K ∼ = K (cid:48) be an isomorphism. Then ( K, f, o ) ≈ ( K (cid:48) , (id Y ⊗ α ) f, αo ) . In programming terms ( k := o ) ≈ ( αk := αo ) .Proof. Let ψ : I → A ⊗ X . We first notice that o (cid:28) ψ K ifand only if αo (cid:28) αψ K , so the success conditions coincide. Itis now straightforward to check the universal property (id A ⊗ f ) ψ | K = (id A ⊗ ((id X ⊗ α ) f )) ψ | K (cid:48) ◦ α. This requires the fact that isomorphisms are deterministic,which holds in every Markov category with conditionals [11,11.28]. The proof works more generally if α is deterministicand split monic.We can now give the Cond construction: Definition IV.9.
Let C be a Markov category that has con-ditionals and precise supports. We define Cond ( C ) as thequotient Cond ( C ) = Obs ( C ) / ≈ This quotient is a CD category, and the functor J : C → Cond ( C ) preserves CD structure.8 roof. We have checked functoriality of ∼ in IV.6, so by IV.3,the quotient is symmetric premonoidal. It remains to show thatthe interchange laws holds, i.e. observations can be reordered.This follows from IV.8 because swap morphisms are iso. Proposition IV.10.
By virtue of being a well-defined CDcategory, the program equations (6) hold in the internallanguage of
Cond ( C ) . In particular, conditioning satisfiescommutativity.C. Laws for Conditioning We derive some properties of
Cond ( C ) . We firstly noticethat J is faithful for common Markov categories. Proposition IV.11.
For f, g : m → n , J ( f ) ≈ J ( g ) iff ∀ ψ : I → a ⊗ m, (id a ⊗ f ) ψ = (id a ⊗ g ) ψ In particular, J is faithful for Gauss , FinStoch and
BorelStoch .Proof.
The proof is straightforward. This condition is strongerthan equality on points: It implies that f, g are almost surelyequal with respect to all distributions.
Proposition IV.12 (Closed terms) . There is a unique state ⊥ X : I (cid:32) X in Cond ( C ) that always fails, given byany ( K, ψ, o ) with o (cid:54)(cid:28) ψ K . Any other state is equal to aconditioning-free posterior, namely ( K, ψ, o ) ≈ J ( ψ | K ◦ o ) . Proposition IV.13 (Enforcing conditions) . We have ( X, copy X , o ) ≈ ( X, o ⊗ id X , o ) This means conditions actually hold after we condition onthem. In programming notation x (cid:96) ( x := o ); x ≈ ( x := o ); o Proof.
Let ψ : I → A ⊗ X ; the success condition reads o (cid:28) ψ X both cases. Now let o (cid:28) ψ X . We verify the properties [(id A ⊗ copy X ) ψ ] | X = (cid:104) ψ | X , id X (cid:105) [(id A ⊗ o ⊗ id X ) ψ ] | X = ψ | X ⊗ o and obtain (cid:104) ψ | X , id X (cid:105) o = ψ | X ( o ) ⊗ o = ( ψ | X ⊗ o )( o ) fromdeterminism of o .V. D ENOTATIONAL SEMANTICS
We apply
Cond (§IV) to give denotational semantics toour Gaussian language (§II), which we show to be fullyabstract (Prop. V.3). One convenient feature is that we can usesubtraction in
Gauss to condition on arbitrary expressions byobserving a vanishing difference:
Definition V.1.
The Gaussian language embeds into the inter-nal language of
Cond ( Gauss ) , where x =:= y is translatedas ( x − y ):= 0 . A term (cid:126)x : R m (cid:96) e : R n denotes a morphism [[ e ]] : m (cid:32) n . Proposition V.2 (Correctness) . If ( e, ψ ) (cid:66) ( e (cid:48) , ψ (cid:48) ) then [[ e ]] ψ = [[ e (cid:48) ]] ψ (cid:48) . If ( e, ψ ) (cid:66) ⊥ then [[ e ]] = ⊥ . Proof. We can faithfully interpret ψ as a state in both Gauss and
Cond ( Gauss ) . If x (cid:96) e and ( e, ψ ) (cid:66) ( e (cid:48) , ψ (cid:48) ) then e (cid:48) haspotentially allocated some fresh latent variables x (cid:48) . We showthat let x = ψ in ( x, [[ e ]]) = let ( x, x (cid:48) ) = ψ (cid:48) in ( x, [[ e (cid:48) ]]) . (9)This notion is stable under reduction contexts.Let C be a reduction context. Then let x = ψ in ( x, [[ C [ e ]]]( x ))= let x = ψ in let y = [[ e ]]( x ) in ( x, [[ C ]]( x, y ))= let ( x, x (cid:48) ) = ψ (cid:48) in let y = [[ e (cid:48) ]]( x, x (cid:48) ) in ( x, [[ C ]]( x, y ))= let ( x, x (cid:48) ) = ψ (cid:48) in ( x, [[ C [ e (cid:48) ]]]) Now for the redexes1) The rules for let follow from the general axioms of valuesubstitution in the internal language2) For normal() we have (normal() , ψ ) (cid:66) ( x (cid:48) , ψ ⊗N (0 , and verify let x = ψ in ( x, [[normal()]])= ψ ⊗ N (0 , x, x (cid:48) ) = ψ ⊗ N (0 ,
1) in ( x, [[ x (cid:48) ]])
3) For conditioning, we have ( v =:= w, ψ ) (cid:66) (() , ψ | v = w ) .We need to show let x = ψ in ( x, [[ v =:= w ]]) = let x = ψ | v = w in ( x, ()) Let h = v − w , then we need to the following morphismsare equivalent in Cond ( Gauss ) : ψ | h =0 ≈ ψ h Applying IV.12 to the left-hand side requires us tocompute the conditional (cid:104) id , h (cid:105) ψ | ◦ , which is exactlyhow ψ | h =0 is defined. Proposition V.3 (Full abstraction) . [[ e ]] = [[ e ]] if and only if e ≈ e (where ≈ is contextual equivalence, Def. II.4).Proof. For ⇒ , let K [ − ] be a closed context. Because [[ − ]] iscompositional, we obtain [[ K [ e ]]] = [[ K [ e ]]] . If both succeed,we have reductions ( K [ e i ] , !) (cid:66) ∗ ( v i , ψ i ) and by correctness v ψ = [[ K [ e ]]] = [[ K [ e ]]] = v ψ as desired. If [[ K [ e ]]] =[[ K [ e ]]] = ⊥ then both ( K [ e i ] , !) (cid:66) ∗ ⊥ .For ⇐ , we note that Cond quotients by contextual equiv-alence, but all Gaussian contexts are definable in the lan-guage.9I. E
QUATIONAL THEORY
We now give an explicit presentation of the equality betweenprograms in the Gaussian language (§VI-A). We demonstratethe strength of the axioms by using them to characterizenormal forms for various fragments of the language (§VI-B).Besides an axiomatization of program equality, this can alsobe regarded in other equivalent ways, such as a presentationof a PROP by generators and relations, or as a presentationof a strong monad by algebraic effects, or as a presentationof a Freyd category. But we approach from the programminglanguage perspective.
A. Presentation
We use the following fragment of the language from §II.The reader may find it helpful to think of this as a normal formfor the language modulo associativity of ‘let’. This fragmenthas the following modifications: only variables of type R areallowed in the typing context Γ ; we have an explicit commandfor failure ( ⊥ ); we separate the typing judgement in two:judgements for expressions of affine algebra (cid:96) a and for generalcomputational expressions (cid:96) c ; we have an explicit coercion‘ return ’ between them for clarity. Γ , x : R , Γ (cid:48) (cid:96) a x : R Γ (cid:96) a s : R Γ (cid:96) a t : RΓ (cid:96) a s + t : R Γ (cid:96) a t : RΓ (cid:96) a α · t : RΓ (cid:96) a β : R Γ , x : R (cid:96) t : R n Γ (cid:96) c let x = normal() in t : R n Γ (cid:96) a s : R Γ (cid:96) a t : R Γ (cid:96) c u : R n Γ (cid:96) c ( s =:= t ); u : R n Γ (cid:96) a t : R . . . Γ (cid:96) a t n : RΓ (cid:96) c return( t , . . . , t n ) : R n Γ (cid:96) c ⊥ : R n There is no general sequencing construct, but we cancombine expressions using the following substitution construc-tions, whose well-typedness is derivable. Γ , x : R , Γ (cid:48) (cid:96) c t : R n Γ , Γ (cid:48) (cid:96) a s : RΓ , Γ (cid:48) (cid:96) c t [ s/x ] : R n Γ (cid:96) c t : R m Γ , x , . . . , x m : R (cid:96) c x , . . . , x n .u : R n Γ (cid:96) c t [ u/ return] : R n In the second form we replace the return statement of anexpression with another expression, capturing variables appro-priately. The precise definition of this hereditary substitutionis standard in logical frameworks (e.g. [2], [37]), for example: (cid:0) let x = normal() in return( x + 3) (cid:1) [ a.a = : = a ) / return ]= let x = normal() in ( x + 3) =:= 4; return( x + 3) For brevity we now write νx.t for let x = normal() in t , r for return and drop ‘ ; ’ when unambiguous. We axiomatizeequality by closing the following axioms under the two formsof substitution and also congruence. Now the syntax has theappearance of a second order algebraic theory, similar to thefamiliar presentations of λ -calculus or predicate logic. The theory is parameterized over an underlying theory ofvalues, which is affine algebra. The type R has the structureof a pointed vector space , which obeys the usual axioms ofvector spaces plus constant symbols ( β ) β ∈ R subject to α · β = αβ, α + β = α + β Terms modulo equations are affine functions. The categorytheorist will recognize the category A = Gauss det as theLawvere theory of pointed vector spaces.The following axioms characterize the conditioning-freefragment of the language, that is, Gaussian probability (cid:96) c νx. r[] ≡ r[] : R (DISC) (cid:96) c ν(cid:126)x. r[ U(cid:126)x ] ≡ ν(cid:126)x. r[ (cid:126)x ] : R n if U orthogonal (ORTH)The following are commutativity axioms for conditioning a, b, c, d (cid:96) c ( a =:= b )( c =:= d )r[] ≡ ( c =:= d )( a =:= b )r[] : R (C1) a, b (cid:96) c ( a =:= b ); νx. r[ x ] ≡ νx. ( a =:= b )r[ x ] : R (C2) a, b (cid:96) c ( a =:= b ) ⊥ ≡ ⊥ : R n (C3)while following encode specific properties of (=:=) a (cid:96) c ( a =:= a )r[] ≡ r[] : R (TAUT) (cid:96) c (0 =:= 1)r[] ≡ ⊥ : R (FAIL) a, b (cid:96) c ( a =:= b )r[ a ] ≡ ( a =:= b )r[ b ] : R (SUBS) (cid:96) c νx. ( x =:= c )r[ x ] ≡ r[ c ] : R (INIT)Lastly, we add the special congruence scheme Γ (cid:96) c ( s =:= t )r[] ≡ ( s (cid:48) =:= t (cid:48) )r[] : R (CONG)whenever ( s = t ) and ( s (cid:48) = t (cid:48) ) are interderivable equationsover Γ in the theory of pointed vector spaces.Axioms (DISC) and (ORTH) completely axiomatize thefragment of the language without conditioning (Prop. VI.1).Axioms (C1)-(C3) describe dataflow – all the operationsdistribute over each other. The reader should focus on theremaining five axioms (TAUT)-(CONG), which are specificto conditioning. It is intended that they are straightforwardand intuitive. B. Normal forms
Proposition VI.1.
Axioms (DISC) - (ORTH) are complete for Gauss . That is, conditioning-free terms (cid:126)x : R n (cid:96) u, v : R n denote the same morphism in Gauss if and only if (cid:126)x (cid:96) u ≡ v is derivable from the axioms.Proof. The axioms are clearly validated in
Gauss ; probabilityis discardable and independent standard normal Gaussians areinvariant under orthogonal transformations. Note that ν com-mutes with itself because permutation matrices are orthogonal.It is curious that these laws completely characterize Gaus-sians: Any term normalizes to the form ν(cid:126)z. r[ A(cid:126)x + B(cid:126)z + (cid:126)c ] ,denoting the map ( A, (cid:126)c, BB T ) in Gauss . Consider some otherterm ν (cid:126)w.ϕ [ A (cid:48) (cid:126)x + B (cid:48) (cid:126)w + (cid:126)c (cid:48) ] that has the same denotation. By(DISC), we can without loss of generality assume that (cid:126)z and10 w have the same dimension. The condition ( A, c, BB T ) =( A (cid:48) , c (cid:48) , B (cid:48) ( B (cid:48) ) T ) implies A = A (cid:48) , (cid:126)c = (cid:126)c (cid:48) . By VII.6 there is anorthogonal matrix U such that B (cid:48) = BU . So the two termsare equated under (ORTH). Example VI.2. νx.νy. r[ x + y ] ≡ νy. r[ √ · y ] Proof.
Let s = 1 / √ , then the matrix U = (cid:18) s s − s s (cid:19) isorthogonal. Thus νx.νy. r[ x + y ] ≡ νx.νy. r[( sx + sy ) + ( − sx + sy )] ≡ νx.νy. r[ √ y ] ≡ νy. r[ √ y ] where we apply (ORTH), affine algebra and (DISC).We proceed to showing the consistency of the axioms forconditioning. Proposition VI.3.
Axioms (DISC) - (CONG) are valid in Cond ( Gauss ) Proof.
Sketch. The commutation properties are straightfor-ward from string diagram manipulation.(SUBS)Write a = b + ( a − b ) ; by IV.13, once we condition a − b := 0 , we have a = b .(INIT) By IV.12, noting that c (cid:28) N (0 , (FAIL) By IV.12, because (cid:54)(cid:28) (CONG)This follows from IV.8, because over A , equivalentscalar equations are nonzero multiples of each other.Still, this is very surprising axiom scheme, which issubstantially generalized in VI.5.For the remainder of this section, we will show how to usethe theory to derive normal forms for conditioning programs. Proposition VI.4.
Elementary row operations are valid onsystems of conditions. In particular, if S is an invertible matrixthen ( A(cid:126)x =:= (cid:126)b )r[] ≡ ( SA(cid:126)x =:=
S(cid:126)b )r[]
Proof.
Reordering and scaling of equations is (C1), (CONG).For summation, i.e. ( s =:= t )( u =:= v )r[] ≡ ( s =:= t )( u + s =:= v + t )r[] instantiate (SUBS) with ( u + x = : = v + t )r[] / r[ x ] . Now use the factthat applying any invertible matrix on the left can be decom-posed into elementary row operations. Corollary VI.5. If A(cid:126)x = (cid:126)c and B(cid:126)x = (cid:126)d are linear systems ofequations with the same solution space, then ( A(cid:126)x =:= (cid:126)c )r[] ≡ ( B(cid:126)x =:= (cid:126)d )r[] is derivable.
This generalizes (CONG) to systems of conditions.
Proof.
If the systems are consistent, then they are isomorphicby VII.4 and we use the previous proposition. If they areinconsistent, we can derive (0 =:= 1) and use (FAIL),(C3) toequate them to ⊥ .We give a normal form for closed terms. Theorem VI.6.
Any closed term can be brought into the form ν(cid:126)z. r[ A(cid:126)z + (cid:126)c ] or ⊥ . The matrix AA T is uniquely determined. This is the algebraic analogue of IV.12.
Proof.
By commutativity, we bring the term into the form ν(cid:126)z. ( A(cid:126)z =:= (cid:126)b )r[
D(cid:126)z + (cid:126)d ] By VII.2, we can find invertible matrices
S, T such that
SAT − = (cid:18) I r
00 0 (cid:19) and T is orthogonal. Using the orthogonal coordinate change (cid:126)w = T (cid:126)z and VI.5, the equations take the form ν (cid:126)w. ( SAT − (cid:126)w =:= S(cid:126)b )r[ DT − (cid:126)w + (cid:126)d ] This simplifies to ν (cid:126)w. ( (cid:126)w r =:= (cid:126)c r )(0 =:= (cid:126)c r +1: n )r[ DT − (cid:126)w + (cid:126)d ] where (cid:126)c = S(cid:126)b . We can process the first block of conditionswith (INIT). The conditions (0 =:= c i ) can either be discardedby (TAUT) if c i = 0 for all i = r + 1 , . . . , n , or fail by (FAIL)otherwise. We arrive at a conditioning-free term. Example VI.7. νx.νy. ( x =:= y )r[ x, y ] ≡ νx. r[ sx, sx ] where s = 1 / √ . Proof.
We use again the unitary matrix U from Example VI.2 νx.νy. ( x =:= y ); r[ x, y ] ≡ νx.νy. ( sx + sy =:= − sx + sy );r[ sx + sy, − sx + sy ] ≡ νx.νy. ( x =:= 0)r[ sx + sy, − sx + sy ] ≡ νy. r[ sy, sy ] where we apply (ORTH), affine algebra and (INIT).Lastly, we give a normal form for conditioning effects. Theorem VI.8 (Normal forms) . Every term (cid:126)x : R n (cid:96) u : R can either be brought into the form ⊥ or ν(cid:126)z.A(cid:126)x =:= B(cid:126)z + (cid:126)c (10) where A ∈ R r × n is in reduced echelon form with no zerorows. The values of A , (cid:126)c and BB T are uniquely determined.Proof. Through the commutativity axioms, we can bring u into the form ν(cid:126)z.A(cid:126)x =:= B(cid:126)z + (cid:126)c for general A . Let S be an invertible matrix that turns A intoreduced row echelon form, and apply it to the condition via11I.4. The zero columns don’t involve (cid:126)x , so we use VI.6 toevaluate the condition involving (cid:126)z separately. We either obtain ⊥ or the desired form (10)For uniqueness, we consider the term’s denotation ( A(cid:126)x =:= η ) : n (cid:32) in Cond ( Gauss ) , where η = N ( (cid:126)c, BB T ) .We must show that A and η can be reconstructed from theobservational behavior of the denotation. The proof given inthe appendix VII.9.VII. C ONTEXT , RELATED WORK , AND OUTLOOK
A. Symbolic disintegration and paradoxes
Our line of work can be regarded as a synthetic andaxiomatic counterpart of the symbolic disintegration of Shanand Ramsey [34]. (See also [15], [27], [29], [40].) Thatwork provides in particular verified program transforma-tions to convert an arbitrary probabilistic program of type R ⊗ τ to an equivalent one that is of the form let x =lebesgue() in let y = M in ( x, y ) . So now exact conditioning x =:= o can be carried out by substituting o for x in M . Weemphasize the similarity with the definition of conditionalsin Markov categories, as well as the role that coordinatetransformations play in both our work (§VI) and [34]. Onelanguage novelty in our work is that exact conditioning is afirst-class construct, as opposed to a whole-program transfor-mation, in our language, which makes the consistency of exactconditioning more apparent.Consistency is a fundamental concern for exact condition-ing. Borel’s paradox is an example of an inconsistency thatarises if one is careless with exact conditioning [21, Ch. 15],[20, §3.3]: It arises when naively substituting equivalent equa-tions within (=:=) . For example, the equation x − y = 0 isequivalent to x/y = 1 over the (nonzero) real numbers. Yet,in an extension of our language with division, the followingprograms are not contextually equivalent: x = normal (0,1)y = normal (0,1)x − y =:= 0 (cid:54)≡ x = normal (0,1)y = normal (0,1)x/y =:= 1 On the left, the resulting variable x has distribution N (0 , . while on the right, x can be shown to have density | x | e − x [32], [34]. In our work, Borel’s paradox finds a type-theoreticresolution: Conditioning is presented abstractly as an algebraiceffect, so the expressions ( s =:= t ) : I and ( s == t ) : bool havea different formal status and can no longer be confused. Theyare related explicitly through axioms like (SUBS), and speciallaws for simplifying conditions are given in (CONG), VI.5. ByIV.8, we can always substitute conditions which are formallyisomorphic, but x − y =:= 0 and x/y =:= 1 are not isomorphicconditions in this sense. For the special case of Gaussianprobability, we proved that equivalent affine equations areautomatically isomorphic, making it very easy to avoid Borel’sparadox in this restricted setting (Prop. VI.5). To include thenon-example above, our language needs a nonlinear operationlike ( / ) . If beyond that we introduced equality testing to thelanguage, difference between equations and conditions wouldbecome even more apparent. The equation x − y = 0 is obviously equivalent to the equation ( x == y ) = true , but the condition ( x == y ) =:= true would cause the whole programto fail, since measure-theoretically, ( x == y ) is the same as false .This also suggests a tradeoff between expressivity of thelanguage and well-behavedness of conditioning. On this sub-ject, Shan and Ramsey [34] wrote: The [measure-theoretic] definition of disintegrationallows latitude that our disintegrator does not take:When we disintegrate ξ = Λ ⊗ κ , the output κ isunique only almost everywhere — κx may returnan arbitrary measure at, for example, any finiteset of x ’s. But our disintegrator never invents anarbitrary measure at any point. The mathematicaldefinition of disintegration is therefore a bit tooloose to describe what our disintegrator actuallydoes. How to describe our disintegrator by a tighterclass of “well-behaved disintegrations” is a questionfor future research. In particular, the notion ofcontinuous disintegrations [1] is too tight, becausedepending on the input term, our disintegrator doesnot always return a continuous disintegration, evenif one exists. In this paper we have tackled this research problem: a notionof “well-behaved disintegrations” is given by a Markov cate-gory with precise supports. The most comprehensive category
BorelStoch admits conditioning only on events of positiveprobability (VII.1). The smaller category
Gauss features abetter notion of support and an interesting theory of con-ditioning. Studying Markov categories of different degreesof specialization helps navigating the tradeoff. Once in thesynthetic setting of a Markov category C with precise supports,the program transformations of [34] are all valid in Cond ( C ) ,and the Markov conditioning property (Def. III.5) exactlymatches the correctness criterion for symbolic disintegration. B. Other directions
Once a foundation is in algebraic or categorical form, itis easy to make connections to and draw inspiration from avariety of other work.The
Obs construction (§IV.1) that we considered here isreminiscent of the lens construction [10] and the Oles con-struction [17]. These have recently been applied to probabilitytheory [35] and quantum theory [18]. The details and intuitionsare different, but a deeper connection or generalization maybe profitable in the future.Algebraic presentations of probability theories andconjugate-prior relationships have been explored in [36].Furthermore, the concept of exact conditioning is reminiscentof unification in Prolog-style logic programming. Ourpresentation in Section VI is partly inspired by the algebraicpresentation of predicate logic in [37], which has a similarsignature and axioms. One technical difference is that in logicprogramming, ∃ a. r[ a ] ≡ ∃ a. ∃ b. ( a =:= b )r[ a ] holds whereashere we have νa. r[ a ] ≡ νa.νb. ( a =:= b )r[(1 / √ a ] , so thingsare more quantitative here. By collapsing Gaussians to their12upports (forgetting mean and covariance), we do in factobtain a model of unification.Logic programming is also closely related to relationalprogramming, and we note that our presentation is reminiscentof presentations of categories of linear relations [3], [6], [7].On the semantic side, we recall that presheaf categories havebeen used as a foundation for logic programming [23]. Ouraxiomatization can be regarded as the presentation of a monadon the category [ A op , Set ] , via [37], where A is the categoryof finite dimensional affine spaces discussed in §VI. Probabilistic logic programming [33] supports both logicvariables as well as random variables within a commonformalism. We have not considered logic variables in thisarticle, but a challenge for future work is to bring the ideasof exact conditioning closer to the ideas of unification, bothpractically and in terms of the semantics. We wonder if it ispossible to think of ∃ as an idealized “flat” prior.A CKNOWLEDGMENT
It has been helpful to discuss this work with many people,including Tobias Fritz, Tom´aˇs Gonda, Mathieu Huot, OhadKammar and Paolo Perrone. Research supported by a RoyalSociety University Research Fellowship and the ERC BLASTgrant. R
EFERENCES[1] N. L. Ackerman, C. E. Freer, and D. M. Roy, “On computability anddisintegration,”
Math. Struct. Comput. Sci. , vol. 27, no. 8, 2016.[2] R. Adams, “Lambda free logical frameworks,”
Ann. Pure. Appl. Logic ,2009, to appear.[3] J. C. Baez and J. Erbele, “Categories in control,”
Theory Appl. Categ. ,vol. 30, pp. 836–881, 2015.[4] A. Barber, “Dual intuitionistic linear logic,” University of Edinburgh,Tech. Rep., 1996.[5] V. I. Bogachev and I. I. Malofeev, “Kantorovich problems and condi-tional measures depending on a parameter,”
Journal of MathematicalAnalysis and Applications , 2020.[6] F. Bonchi, R. Piedeleu, P. Sobocinski, and F. Zanasi, “Graphical affinealgebra,” in
Proc. LICS 2019 , 2019.[7] F. Bonchi, P. Sobocinski, and F. Zanasi, “The calculus of signal flowdiagrams I: linear relations on streams,”
Inform. Comput. , vol. 252, 2017.[8] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich,M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan:A probabilistic programming language,”
Journal of statistical software ,vol. 76, no. 1, 2017.[9] K. Cho and B. Jacobs, “Disintegration and Bayesian inversion via stringdiagrams,”
Math. Struct. Comput. Sci. , vol. 29, pp. 938–971, 2019.[10] B. Clarke, D. Elkins, J. Gibbons, F. Loregian, B. Milewski, E. Pill-more, and M. Roman, “Profunctor optics, a categorical update,” 2020,arxiv:2001.07488.[11] T. Fritz, “A synthetic approach to Markov kernels, conditional indepen-dence and theorems on sufficient statistics,”
Adv. Math. , vol. 370, 2020.[12] T. Fritz, T. Gonda, P. Perrone, and E. F. Rischel, “Representable markovcategories and comparison of statistical experiments in categoricalprobability,” 2020. [Online]. Available: https://arxiv.org/abs/2010.07416[13] T. Fritz and E. F. Rischel, “Infinite products and zero-one laws incategorical probability,”
Compositionality , vol. 2, Aug. 2020. [Online].Available: https://doi.org/10.32408/compositionality-2-3[14] C. F¨uhrmann, “Varieties of effects,” in
Proc. FOSSACS 2002 , 2002.[15] T. Gehr, S. Misailovic, and M. Vechev, “PSI: Exact symbolic inferencefor probabilistic programs,” in
Proc. CAV 2016 , 2016.[16] N. D. Goodman and A. Stuhlm¨uller, “The Design and Implementation ofProbabilistic Programming Languages,” http://dippl.org, 2014, accessed:2020-10-15.[17] C. Hermida and R. D. Tennent, “Monoidal indeterminates and categoriesof possible worlds,”
Theoret. Comput. Sci. , vol. 430, 2012. [18] M. Huot and S. Staton, “Universal properties in quantum theory,” in
Proc. QPL 2018 , 2018.[19] B. Jacobs, “A channel-based perspective on conjugate priors,”
Mathe-matical Structures in Computer Science , vol. 30, no. 1, p. 44–61, 2020.[20] J. Jacobs, “Paradoxes of probabilistic programming,” in
Proc. POPL2021 , 2021.[21] E. T. Jaynes,
Probability Theory: The Logic of Science . CUP, 2003.[22] O. Kammar and G. D. Plotkin, “Algebraic foundations for effect-dependent optimisations,” in
Proc. POPL 2012 , 2012.[23] Y. Kinoshita and J. Power, “A fibrational semantics for logic programs,”in
Proc. ELP 1996 , 1996.[24] S. Lauritzen and F. Jensen, “Stable local computation with conditionalgaussian distributions,”
Statistics and Computing , vol. 11, 11 1999.[25] R. Milner, J. Parrow, and D. Walker, “A calculus of mobile processes,I,”
Inform. Comput. , vol. 100, no. 1, 1992.[26] T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill,“Infer.NET 0.3,” 2018, Microsoft Research Cambridge. [Online].Available: http://dotnet.github.io/infer[27] L. Murray, D. Lund´en, J. Kudlicka, D. Broman, and T. Sch¨on, “Delayedsampling and automatic rao-blackwellization of probabilistic programs,”in
Proceedings of the Twenty-First International Conference on ArtificialIntelligence and Statistics , 2018, pp. 1037–1046.[28] P. Narayanan, J. Carette, W. Romano, C. Shan, and R. Zinkov, “Proba-bilistic inference by program transformation in Hakaru (system descrip-tion),” in
Proc. FLOPS 2016 , 2016, pp. 62–79.[29] P. Narayanan and C.-c. Shan, “Applications of a disintegration transfor-mation,” in
Workshop on program transformations for machine learning ,2019.[30] A. Pitts and I. Stark, “Observable properties of higher order functionsthat dynamically create local names, or: What’s new?” in
Proc. MFCS1993 , 1993.[31] J. Power and E. Robinson, “Premonoidal categories and notions ofcomputation,”
Math. Struct. Comput. Sci. , vol. 7, pp. 453–468, 1997.[32] M. A. Proschan and B. Presnell, “Expect the unexpected from condi-tional expectation,”
The American Statistician , vol. 52, no. 3, 1998.[33] L. D. Raedt and A. Kimmig, “Probabilistic (logic) programming con-cepts,”
Mach. Learn. , vol. 100, 2015.[34] C.-c. Shan and N. Ramsey, “Exact Bayesian inference by symbolicdisintegration,” in
Proc. POPL 2017 , 2017, pp. 130–144.[35] T. S. C. Smithe, “Bayesian updates compose optically,” 2020. [Online].Available: https://arxiv.org/abs/2006.01631[36] S. Staton, D. Stein, H. Yang, L. Ackerman, C. E. Freer, and D. M. Roy,“The Beta-Bernoulli process and algebraic effects,” 2018.[37] S. Staton, “An algebraic presentation of predicate logic,” in
Proc. FOS-SACS 2013 , 2013, pp. 401–417.[38] ——, “Commutative semantics for probabilistic programming,” in
Proc. ESOP 2017 , 2017.[39] S. Staton and P. B. Levy, “Universal properties of impure programminglanguages,” in
Proc. POPL 2013 , 2013.[40] R. Walia, P. Narayanan, J. Carette, S. Tobin-Hochstadt, and C.-c. Shan,“From high-level inference algorithms to efficient code,” in
Proc. ICFP2019 , 2019. A PPENDIX
C. Markov categories
Here, we spell out some details for the notions of (cid:28) andprecise supports.
Proof of Proposition III.10.
Gauss faithfully embeds into
BorelStoch , that is any Gaussian morphism m → n can beseen as a measurable map R m → G ( R n ) , where G denotes theGiry monad. By [11, 13.3], we have f = µ g if f ( x ) = g ( x ) in G ( R n ) for µ -almost all x . If µ is a Gaussian distribution withsupport S , then µ is equivalent to the Lebesgue measure on S .Because f, g are continuous as maps into G ( R m ) , f ( x ) = g ( x ) on almost all x ∈ S implies f = g everywhere on S .For the second part, let S µ , S ν denote the supports of µ and ν , and let x ∈ S µ \ S ν . Then we can find two affine functions13 , g which agree on S ν but f ( x ) (cid:54) = g ( x ) . Then f = ν g butnot f = µ g , hence µ (cid:54)(cid:28) ν . Proposition VII.1. In FinStoch , we have x (cid:28) µ if µ ( x ) > . In BorelStoch , we have x (cid:28) µ if µ ( { x } ) > . This gives the correct intuition of support for
FinStoch .For
BorelStoch , this is an overly rigid notion of supportwhich may contradict our intuition. For example the standard-normal distribution N (0 , has support R in Gauss , but ∅ in BorelStoch . Proof.
The arguments follow from [11, 13.2,13.3]. For
BorelStoch , let µ ( { x } ) = 0 and consider the measurablefunctions f ( x ) = I { x } ( x ) , g ( x ) = 0 . Then f = µ g , yet f ( x ) (cid:54) = g ( x ) showing x (cid:54)(cid:28) µ . Proof of Proposition IV.5.
For Gauss, this follows from thecharacterization of (cid:28) in III.10. Let µ have support S and f ( x ) = Ax + N ( b, Σ) . Let T be the support of N ( b, Σ) . Thesupport of (cid:104) id , f (cid:105) µ is the image space { ( x, Ax + c ) : x ∈ S, c ∈ T } . Hence ( x, y ) (cid:28) (cid:104) id , f (cid:105) µ iff x (cid:28) µ and y (cid:28) f x .For FinStoch , an outcome ( x, y ) has positive probabilityunder (cid:104) id , f (cid:105) µ iff x has positive probability under µ , and y has positive probability under f ( −| x ) .For BorelStoch , the measure ψ = (cid:104) id , f (cid:105) µ is given by ψ ( A × B ) = (cid:90) x ∈ A f ( B | x ) µ (d x ) Hence ψ ( { ( x , y ) } ) = f ( { y }| x ) µ ( { x } ) , which is positiveexactly if µ ( { x } ) > and f ( { y }| x ) > .A note on the definition of ‘precise supports’: Theexpression (cid:104) id , f (cid:105) µ is an analogue of the graph of f . Wewonder about its single-valuedness. If x ⊗ y (cid:28) (cid:104) id , f (cid:105) µ thenwe always have x (cid:28) µ and y (cid:28) f µ . We ask that y doesn’tlie in the pushforward of any old sample of µ , but preciselyof f x . This is certainly a natural property to demand, butalso very specifically tailored towards the application in IV.6.We expect ‘precise supports’ to arise as an instance of somemore encompassing axiom. D. Linear algebra
The following facts from linear algebra are useful to recalland get used throughout.
Proposition VII.2.
Let A ∈ R m × n , then there are invertiblematrices S, T such that
SAT − = (cid:18) I r
00 0 (cid:19) where r = rank( A ) . Furthermore, T can taken to be orthog-onal.Proof. Take a singular-value decomposition (SVD) A = U DV T , let T = V and use create S from U T by rescalingthe appropriate columns. Proposition VII.3 (Row equivalence) . Two matrices
A, B ∈ R m × n are called row equivalent if the following equivalentconditions hold for all x ∈ R n , Ax = 0 ⇔ Bx = 0 A and B have the same row space there is an invertible matrix S such that A = SB Unique representatives of row equivalence classes are matri-ces in reduced row echelon form . Corollary VII.4.
Let
A, B ∈ R m × n and let Ax = c and Bx = d be consistent systems of linear equations that havethe same solution space. There is an invertible matrix S suchthat B = SA and d = Sc . Proposition VII.5 (Column equivalence) . For matrices
A, B ∈ R m × n , the following are equivalent A and B have the same column space there is an invertible matrix T such that A = BT . Proposition VII.6.
For matrices
A, B ∈ R m × n , the followingare equivalent AA T = BB T there is an orthogonal matrix U such that A = BU .Proof. This is a known fact, but we sketch a proof for lack ofreference. In the construction of the SVD A = U DV T , we canchoose U and D depending on AA T alone. It follows that thesame matrices work for B , giving SVDs A = U DV T , B = U DW T . Then A = B ( W V T ) as claimed. E. Normal forms
We present a proof of the uniqueness of normal forms forconditioning morphisms. Some preliminary facts:
Proposition VII.7.
Let X ∼ N ( µ X , Σ X ) and Y ∼N ( µ Y , Σ Y ) be independent. Then X | ( X = Y ) has distri-bution N (¯ µ, ¯Σ) given by ¯ µ = µ X + Σ X (Σ X + Σ Y ) + ( µ Y − µ X )¯Σ = Σ X − Σ X (Σ X + Σ Y ) + Σ X In programming terms, this is written let x = N ( µ X , Σ X ) in ( x =:= N ( µ Y , Σ Y )); return( x ) and corresponds to the observe statement from the introduc-tion x = normal ( µ X , Σ X ); observe ( normal ( µ Y , Σ Y ), x) Corollary VII.8.
No 1-dimensional observe statement leavesthe prior N (0 , unchanged.Proof. Conditioning decreases variance. If we observe from N ( µ, σ ) , the variance of the posterior is − (1 + σ ) − < . roposition VII.9. Consider a morphism κ : n (cid:32) in Cond ( Gauss ) given by κ ( x ) = ( Ax =:= η ) (11) where A ∈ R r × n is in reduced row echelon form with nozero rows, and η ∈ Gauss (0 , r ) . Then the matrix A anddistribution η are uniquely determined.Proof. We will probe κ by applying the condition 11 todifferent priors ψ ∈ Gauss (0 , n ) , giving either a result ψ (cid:48) ∈ Gauss (0 , n ) or ⊥ .Let S ⊆ R r be the support of η and W = { x ∈ R n : Ax ∈ S } . We can recover W from observational behavior, becausefor deterministic priors ψ = x , we have ψ (cid:48) (cid:54) = ⊥ iff x ∈ W .We have κ = ⊥ iff W = ∅ . Assume W is nonempty now.Next, we can identify the nullspace K of A by consideringsubspaces along which no conditioning updates happens. Callan affine subspace V ⊆ R n vacuous if for all ψ (cid:28) V we have ψ (cid:48) = ψ . Any such V must be contained in W . We claim thatevery maximal vacuous subspace is of the form K + x where x ∈ W :Every space of the form K + x is clearly vacuous: If ψ (cid:28) K then the condition (11) becomes constant as Ax =:= η .Because by assumption Ax ∈ S , this condition is vacuousand can be discarded without effect.Let V be any vacuous subspace and x ∈ V . We show V ⊆ x + K : Assume there is any other x ∈ W such that x − x / ∈ K and consider the 1-dimensional prior t ∼ N (0 , , x = x + t ( x − x ) Let d = A ( x − x ) (cid:54) = 0 and find an invertible matrix T suchthat T d = (1 , , . . . , T . The condition becomes ( t, , . . . ,
0) =:=
T η − T Ax . All but the first equations do not involve t . By commutativity,they can be computed independently, resulting in an updatedright-hand side and a 1-dimensional condition t =:= η (cid:48) with η (cid:48) either a Gaussian or ⊥ . By VII.8, such a condition cannotleave the prior N (0 , unchanged, contradicting vacuity of V .Having reconstructed K , the matrix A in reduced rowechelon form is determined uniquely by its nullspace. Groupthe coordinates x , . . . , x n into exactly r pivot coordinates x p and n − r free coordinates x u . Setting x u = 0 in (11) results inthe simplified condition x p =:= η . It remains to show that wecan recover the observing distribution µ from observationalbehavior. Intuitively, if we put a flat enough prior on x p , theposterior will resemble µ arbitrarily closely:Let µ = N ( b, Σ) and consider x p ∼ N ( λI ) for λ → ∞ .The matrix ( I + λ − Σ) is invertible for all large enough λ .By the formulas VII.7, the mean of the posterior is ¯ µ = ( I + λ − Σ) − µ λ →∞ −−−−→ µ For the covariance, we truncate Neumann’s series ( I + λ − Σ) − = I − λ − Σ + o ( λ − ) to obtain ¯Σ = λI − λ ( I + λ − Σ) − = Σ + o ( λ − ) λ →∞ −−−−→ Σ F. Implementation
It is straightforward to implement the operational semanticsof §II in a language like Python. We have done this and weillustrate with further simple programs and results, in additionto the examples in Section I-A.Listing 1: Gaussian regression (Fig. 4) xs = [1.0 , 2.0, 2.25 , 5.0, 10.0]ys = [ − − − − − Gauss . N (0 ,10)b = Gauss . N (0 ,10)f = lambda x: a * x + b for (x,y) in zip (xs ,ys): Gauss . condition (f(x), y + Gauss . N (0 ,0.1)) Fig. 4: Gaussian regularized regression (Ridge regression),plotting 100 samples, the mean (red) and ± σ coordinatewise(blue)15isting 2: 1-dimensional K´alm´an filter (Fig. 5) xs = [ 1.0, 3.4, 2.7, 3.2, 5.8,14.0 , 18.0 , 11.7 , 19.5 , 19.2]x = [0] * len (xs)v = [0] * len (xs) x[0] = xs [0] + Gauss . N (0 ,1)v[0] = 1.0 + Gauss . N (0 ,10) for i in range (1, len (xs )): x[i] = x[i −
1] + v[i − −
1] +
Gauss . N (0 ,0.75) Gauss . condition (x[i] + Gauss . N (0,1),xs[i])(0,1),xs[i])