[PDF] Supermartingales, Ranking Functions and Probabilistic Lambda Calculus

Abstract

We introduce a method for proving almost sure termination in the context of lambda calculus with continuous random sampling and explicit recursion, based on ranking supermartingales. This result is extended in three ways. Antitone ranking functions have weaker restrictions on how fast they must decrease, and are applicable to a wider range of programs. Sparse ranking functions take values only at a subset of the program's reachable states, so they are simpler to define and more flexible. Ranking functions with respect to alternative reduction strategies give yet more flexibility, and significantly increase the applicability of the ranking supermartingale approach to proving almost sure termination, thanks to a novel (restricted) confluence result which is of independent interest. The notion of antitone ranking function was inspired by similar work by McIver, Morgan, Kaminski and Katoen in the setting of a first-order imperative language, but adapted to a higher-order functional language. The sparse ranking function and confluent semantics extensions are unique to the higher-order setting. Our methods can be used to prove almost sure termination of programs that are beyond the reach of methods in the literature, including higher-order and non-affine recursion.

Full PDF

SSupermartingales, Ranking Functions andProbabilistic Lambda Calculus

Andrew Kenyon-Roberts Luke Ong

Abstract —We introduce a method for proving almost suretermination in the context of lambda calculus with continuousrandom sampling and explicit recursion, based on rankingsupermartingales. This result is extended in three ways.

Antitoneranking functions have weaker restrictions on how fast they mustdecrease, and are applicable to a wider range of programs.

Sparseranking functions take values only at a subset of the program’sreachable states, so they are simpler to deﬁne and more ﬂexible.

Ranking functions with respect to alternative reduction strategies give yet more ﬂexibility, and signiﬁcantly increase the applicabil-ity of the ranking supermartingale approach to proving almostsure termination, thanks to a novel (restricted) conﬂuence resultwhich is of independent interest. The notion of antitone rankingfunction was inspired by similar work by McIver, Morgan,Kaminski and Katoen in the setting of a ﬁrst-order imperativelanguage, but adapted to a higher-order functional language. Thesparse ranking function and conﬂuent semantics extensions areunique to the higher-order setting. Our methods can be used toprove almost sure termination of programs that are beyond thereach of methods in the literature, including higher-order andnon-afﬁne recursion.

I. I

NTRODUCTION

Probabilistic (or randomised) programs have long beenrecognised as essential to the efﬁcient solution of manyalgorithmic problems [1]. More recently, in probabilistic pro-gramming [2–4], probabilistic programs are used to expressgenerative models whose posterior probability can be com-puted by general purpose inference engines. Sampling fromcontinuous distributions is an essential construct of proba-bilistic programming languages, such as Church [5], Stan [6],Anglican [7], Gen [8], Pyro [9], Edward [10] and Turing [11].Another key feature is higher order; in fact some of the mostinﬂuential probabilistic programming languages are functional.In this paper we study a central property of probabilistic pro-grams: termination . When a probabilistic program implementsa solution to an algorithmic problem, we naturally require thecomputation to terminate with probability , in which case theprogram is called almost surely terminating (AST). Indeed,it is standard for designers and implementors of probabilisticprogramming systems to regard non-AST programs as deﬁninginvalid models, and hence inadmissible [5, 12]. (Yet none ofthese systems provides any support for the development orveriﬁcation of AST programs.)Moreover various theorems about probabilistic programsrely on the assumption that the program terminates almostsurely (see e.g. [5]). A recent result [13] proves that ASTprograms have density functions that are differentiable almosteverywhere. This is signiﬁcant for Bayesian machine learning, because almost everywhere differentiability is a preconditionfor the correctness of some of the most scalable inferencealgorithms, such as Hamiltonian Monte Carlo [14, 15] andreparameterised gradient variational inference [16].The AST problem is not just important, but also difﬁcult:Deciding AST of ﬁrst-order imperative programs with discreteprobabilities is Π -complete [17]. In recent unpublished work,the AST problem for higher-order programs with continuousdistributions (and suitable primitive functions) is shown tohave the same complexity. Though the problem of provingAST of imperative probabilistic programs is much studied[17–23], to our knowledge the problem of proving AST ofhigher-order programs with continuous distributions is new. Contributions

A possible approach to prove AST is to ﬁnd some varianton the program state, called ranking function , that decreaseson average sufﬁciently quickly that it must at some pointreach 0, at which point the program terminates. In otherwords, the program’s behavior is used to deﬁne an associated(ranking) supermartingale . Proof rules based on relating theprogram state to supermartingales already exist for ﬁrst-orderimperative programs [21, 24, 25]. This paper’s contribution isto extend this method to a higher-order setting.The language PPCF used in this paper is simply typed,with random continuous sampling, and an explicit recursionprimitive, Y . We deﬁne a ranking function on a term M to be anon-negative measurable function on the reachable terms from M whose expected value decreases or is unchanged as theterm is reduced. Because the type system already constrainsthe terms enough to force termination in the absence of the re-cursion construct [26, 27], it is only the Y -reduction steps thatmust be counted, making deﬁning ranking functions somewhateasier. Using supermartingales and Doob’s (iterated) OptionalSampling Theorem, we show the soundness of rankability(Thm. III.10): if a term has a ranking function, it is AST.To deﬁne a ranking function, one would typically try toorganise the set of reachable terms (of the program beinganalysed) into a manageable number of syntactic cases. Unfor-tunately (for any ﬁxed reduction strategy) there are programswhose reachable terms span such a large number of cases thatit would be extremely difﬁcult to analyse by hand, if at allpossible. How can the construction of ranking functions bemade possible or easier?Our ﬁrst answer is the notion of sparse ranking function .ost of the individual execution steps of a typical program aretrivial and easy to mentally skip over. Sparse ranking functionscan be deﬁned only for those points in the execution of aprogram which are semantically important, while all of theother intermediate steps can be ignored. Yet, they are no lessefﬁcacious for proving AST (Thm. IV.5): every sparse rankingfunction is a restriction of a ranking function. This also makesthe ranking function method of proving AST more compatiblewith syntactic sugar, because the intermediate reduction stepsimplicit in the simpliﬁed notation can be ignored.A ranking function (or sparse ranking function) provides abound on the expected number of Y -reduction steps beforethe program terminates, therefore ranking functions cannot beconstructed for terms whose expected number of Y -reductionsteps is inﬁnite, such as the simplest implementation of the1D unbiased random walk. This restriction can be removed bygeneralising ranking functions to antitone ranking functions ,which rather than having to decrease by a constant amount foreach Y -reduction step, may decrease by a variable amount,depending on the value of the ranking function. Thanks tothis feature, the antitone ranking function method is capa-ble of handling programs which terminate arbitrarily slowly.We show that this method is also sound for proving AST(Thm. V.6), moreover it also enjoys a sparse function theorem(Thm. V.8).Basic (non-random) lambda calculus has the very usefulChurch-Rosser property, which implies (among other things)that even if execution of a program starts in a different order,it will still reach the same normal form eventually (assumingit does reach a normal form). Probabilistic lambda calculusdoes not have this property, because random choices may beduplicated, and evaluating the same subterm multiple timescan yield different results. However, with a restricted set ofreduction strategies, Church-Rosserness may be regained. Weintroduce a novel addressing scheme for the possible randomchoices in a program’s execution, which ensures that thesame random choices are taken at corresponding positionsin alternative reduction sequences, so that the same eventualresult can be reached. This is then used to prove yet anotherextension to the ranking function theorem, that (Thm. VI.15)ranking functions may be deﬁned with respect to alternativereduction strategies (which in some cases may lead to aconsiderably simpler execution and ranking function), and(Cor. VI.12) rankability in this sense still imples almost suretermination. The conﬂuent trace semantics is of wider interest,and has other possible applications as well, for example inBayesian inference algorithms.By combining these methods we can prove AST of a varietyof PPCF programs that are beyond the reach of methods inthe literature, including non-afﬁne recursion (Ex. V.12) andrecursions that deﬁne high-order functions (Ex. V.14). Non-afﬁne recursive programs are recursive programs that can, during theevaluation of the recursive body, make multiple recursive calls (of a ﬁrst-orderfunction) from distinct call sites.

Outline:

We present the syntax and trace semantics ofPPCF in Sec. II. In Sec. III, we show that ranking functionson terms induce supermartingales, which form the basis of asound method for proving AST. We introduce sparse rankingfunctions in Sec. IV and antitone ranking functions in Sec. V,and illustrate how they can be used to prove AST via examples.In Sec. VI, we present a conﬂuent trace semantics and demon-strate its usefulness. We discuss further applications of theconﬂuent semantics in Sec. VII; and conclude with commentson related work and further directions in Sec. VIII.

Additional materials:

Further details of some of the ex-amples and all missing proofs can be found in the appendices.II. P

ROBABILISTIC

PCF

A. Syntax of PPCF

The language PPCF is a call-by-value (CBV) version of PCFwith sampling of real numbers from the closed interval [0 , [28–30]. Types and terms are deﬁned as follows, where r is areal number, x is a variable, f : R n → R is any measurablefunction, and Γ is an environment: types A, B ::= R | A → B values V ::= λx.M | r terms M, N ::= V | x | M M | f ( M , . . . , M n ) | Y M | if ( M < , N , N ) | sample The typing rules are standard (see Fig. 1). Terms are identiﬁedup to α -equivalence, as usual. The set of all terms is denoted Λ , and the set of closed terms is denoted Λ .To deﬁne the CBV reduction relation, let evaluation contexts be of the form: E ::= [ ] | E M | V E | f ( r , . . . , r k − , E, M k +1 , . . . , M n ) | Y E | if ( E < , M , M ) then a term reduces if it is formed by substituting a redex ina context i.e. E [( λx.M ) V ] → E [ M [ V /x ]] E [ f ( r , . . . , r n )] → E [ f ( r , . . . , r n )] E [ Y λx.M ] → E [ λz.M [( Y λx.M ) /x ] z ] where z is not free in sE [ if ( r < , M , M ) ] → E [ M ] where r < E [ if ( r < , M , M ) ] → E [ M ] where r ≥ E [ sample ] → E [ r ] where r ∈ [0 , . We write → ∗ for the reﬂexive, transitive closure of → . Everyclosed term either is a value or reduces to another term. B. Trace semantics

This version of the reduction relation allows sample toreduce to any number in [0 , . To more precisely specify theprobabilities, an additional argument is needed to determinethe outcome of random samples. Let I := [0 , ⊂ R , and2eﬁne S := I N , with the Borel σ -algebra Σ S . Equivalently, abasis of measurable sets is (cid:81) ∞ i =0 X i where X i are all Boreland all but ﬁnitely many are I , and the probability measure µ S is given by µ S ( (cid:81) ∞ i =0 X i ) := (cid:81) ∞ i =0 Leb( X i ) , writing Leb forthe Lebesgue measure. The maps π h : S → I (projecting tothe ﬁrst element) and π t : S → S (popping the ﬁrst element)are then measurable. Following [31], we call the probabilityspace ( S , Σ S , µ S ) the entropy space , and elements of S traces .Deﬁne a skeleton to be a term but, instead of having realconstants r , it has a placeholder X , so that each term M hasa skeleton Sk ( M ) , and each skeleton S can be converted to aterm S [ r ] given a vector r of n real numbers to substitute in,where n is the number of occurrences of X in S . Following[29, 32, 33], the σ -algebra on Λ is deﬁned by identifying Λ with (cid:83) m ≥ (cid:0) Sk m × R m (cid:1) , where Sk m is the set of skeletonscontaining m occurrences of X . Thus identiﬁed, we give Λ the countable disjoint union topology of the product topologyof the discrete topology on Sk m and the standard topology on R m , and take the corresponding Borel σ -algebra. Note that theconnected components of Λ have the form { S } × R m , with S ranging over Sk m , and m over N .The one-step reduction of the trace-based (or sampling-style ) operational semantics [31, 32, 34] is given by the func-tion red : Λ × S → Λ × S where red( M, s ) :=  ( E [ N ] , s ) if M = E [ R ] , R → N and R (cid:54) = sample ( E [ π h ( s )] , π t ( s )) if M = E [ sample ]( M, s ) if M is a valueThe result after n steps is then simply red n ( M, s ) = red( . . . red( (cid:124) (cid:123)(cid:122) (cid:125) n M, s ) . . . ) and the limit red ∞ can then be deﬁned as a partial functionas lim n →∞ red n ( M, s ) whenever that sequence becomes con-stant by reaching a value. A term M terminates for a samplesequence s if the limit red ∞ ( M, s ) is deﬁned.The reduction function is measurable, and the set of valuesis measurable, therefore the set of s such that M termi-nates at s within n steps is measurable for any n , therefore { s | M terminates at s } is measurable [32, 35]. We saythat a term M is almost surely terminating (AST) just if µ S ( { s | M terminates at s } ) = 1 .For example, the term ( Y λf n : if ( sample − . < , n, f ( n + 1) ) ) 0 , which generates a geometric distribution, terminates on theset S \ [0 . , N , which has measure 1, therefore it terminatesalmost surely, whereas if ( sample − . < , , ( Y λx.x ) 0 ) ,which terminates on the set π − h [[0 , . , has probability 0.5of failing to terminate. Remark

II.1 . Our deﬁnition of AST is equivalent to that givenin [30] (although the program semantics is stated in a slightlydifferent way), except for the presence of a score construct for soft conditioning. The score construct is irrelevant totermination except that it fails if its argument is negative,thus allowing computations to fail after ﬁnitely many steps. Ifsuch a thing were added, it would merely require an additionalside condition to our main theorems, that no failing term wasreachable (or, in the case of the conﬂuent semantics basedresults, no term where the reduction strategy is not deﬁned isreachable via the reduction strategy unless it’s a value).III. S

UPERMARTINGALES

One approach to proving that a term terminates almostsurely is to ﬁnd some variant that is bounded below and, onaverage, decreases sufﬁciently quickly that it must eventuallyreach 0, similarly to the approach taken by [21] and others forimperative programs.These variants are deﬁned as functions from reachable terms(i.e. possible states of the program’s execution) to real num-bers. Speciﬁcally, let the set of reachable terms from a givenclosed starting term M be Rch ( M ) := { N ∈ Λ | M → ∗ N } ,with the σ -algebra induced as a subset of Λ . Deﬁnition III.1. A ranking function on M ∈ Λ is ameasurable function f : Rch ( M ) → R such that f ( N ) ≥ for all N , and(i) f ( E [ Y λx.N ]) ≥ f ( E [ λz.N [( Y λx.N ) /x ] z ) where z is not free in N (ii) f ( E [ sample ]) ≥ (cid:82) I f ( E [ x ]) Leb(d x ) (iii) f ( E [ R ]) ≥ f ( E [ R (cid:48) ]) for any other redex R with R → R (cid:48) We say that the ranking function f is strict if there exists (cid:15) > such that for all E and R → R (cid:48) , f ( E [ R (cid:48) ]) ≤ f ( E [ R ]) − (cid:15) .Any closed term for which a ranking (respectively, strictranking) function exists is called rankable (respectively, strictly rankable ). For example, ( Y λx.x ) 0 is not rankable.It will be demonstrated later that for any rankable term M , if ( M n ) n ≥ is the reduction sequence starting from M (considered as a stochastic process), then ( f ( M n )) n ≥ is asupermartingale, and M terminates almost surely, but ﬁrst,some preliminaries about supermartingales (see e.g. [36] fordetails).Fix a probability space (Ω , F , P ) and a ﬁltration ( F n ) n ≥ (i.e. F n ⊆ F is a σ -algebra, and F n ⊆ F n +1 for all n ). Let T be a r.v. that takes values in N ∪ {∞} . We call T a stoppingtime adapted to ( F n ) n ≥ just if { T = n } ∈ F n , for all n . Deﬁnition III.2. (i) A sequence of r.v.s ( Y n ) n ≥ adaptedto a ﬁltration ( F n ) n ≥ is a supermartingale if for all n ≥ , Y n is integrable (i.e. E [ | Y n | ] < ∞ ), and E [ Y n +1 |F n ] ≤ Y n a.s. (i.e. for all A ∈ F n , (cid:82) A P (d ω ) Y n +1 ( ω ) ≤ (cid:82) A P (d ω ) Y n ( ω ) ).(ii) Let (cid:15) > . Given a stopping time T and a supermartin-gale ( Y n ) n ≥ , both adapted to ﬁltration ( F n ) n ≥ , we say that The fact that the ranking function can be 0 in some cases before it reachesa value is necessary to get Thm. IV.2 to work neatly. Y n ) n ≥ is a (cid:15) -ranking supermartingale w.r.t. T if for all n , Y n ≥ and E [ Y n +1 | F n ] ≤ Y n − (cid:15) · { T >n } . A rankingsupermartingale w.r.t. T is a (cid:15) -ranking supermartingale w.r.t. T for some (cid:15) > . Remark

III.3 . Our notion of ranking supermartingale is a slightgeneralisation of the original deﬁnition [21, 24], which doesnot involve an arbitrary stopping time.Intuitively Y n gives the rank of the program after n stepsof computation, and T is the time at which it reaches avalue (which may be inﬁnite if it fails to terminate). In a (cid:15) -ranking supermartingale, each computation step causes astrict decrease in rank of at least (cid:15) , provided the term inquestion is not a value. Lemma III.4.

Let ( Y n ) n ≥ be a (cid:15) -ranking supermartingalew.r.t. the stopping time T , then T < ∞ a.s., and E [ T ] ≤ E [ Y ] (cid:15) .Remark III.5 . Lem. III.4 is a slight extension of [21, Lemma5.5], which asserts the same result but for the speciﬁc stoppingtime T : ω (cid:55)→ min { n | Y n ( ω ) = 0 } .Let T and T (cid:48) be stopping times adapted to ( F n ) n ≥ . Recallthe σ -algebra (consisting of measurable subsets “prior to T ”) F T := { A ∈ F | ∀ i ≥ . A ∩ { T ≤ i } ∈ F i } and if T ≤ T (cid:48) , then F T ⊆ F T (cid:48) .The following is an iterated version of Doob’s well-knownOptional Sampling Theorem (see e.g. [21, 37]). Theorem III.6 (Optional Sampling) . Let ( X n ) n ≥ be a su-permartingale, and ( T n ) n ≥ a sequence of increasing stoppingtimes, all adapted to ﬁltration ( F n ) n ≥ , then ( X T n ) n ≥ is asupermartingale adapted to ( F T n ) n ≥ if one of the followingconditions holds:(i) each T n is bounded i.e. T n < c n where c n is a constant(ii) ( X n ) n ≥ is uniformly integrable.A. Ranking functions and supermartingales Henceforth, ﬁx the probability space ( S , Σ S , µ S ) , and aclosed PPCF term M . For n ≥ , deﬁne the random variables M n ( s ) := π (red n ( M, s )) T M ( s ) := min { n | M n ( s ) is a value } and the ﬁltration ( F n ) n ≥ where F n := σ ( M , · · · , M n ) .Thus T M is the runtime of M (and M is AST iff µ S ( T M < ∞ ) = 1 ).Our ﬁrst result is the following theorem. Theorem III.7 (Deriving supermartingales) . If a closed PPCFterm M is rankable (respectively, strictly rankable) by f then ( f ( M n )) n ≥ is a supermartingale (respectively, rankingsupermartingale w.r.t. stopping time T M ) adapted to ( F n ) n ≥ . For A ∈ F , we write A for the indicator function of A , i.e., the randomvariable deﬁned by A ( ω ) := 1 if ω ∈ A , and otherwise. The ranking supermartingale condition is satisﬁed essen-tially by a case analysis on the type of redex (Lem. A.1).

B. Soundness of rankability

In the rest of this section we show that if a PPCF term isrankable, then it is AST.Let f be a ranking function on M ∈ Λ . Deﬁne randomvariables on the probability space ( S , Σ S , µ S ) : T − ( s ) := − T n +1 ( s ) := min { k | k > T n ( s ) , M k ( s ) a valueor of form E [ Y λx.N ] } Y n ( s ) := f ( M T n ( s )) T Y M ( s ) := min { n | M T n ( s ) ( s ) is a value } (1)We ﬁrst state a useful property about r.v.s T , T , T , . . . . Lemma III.8. ( T n ) n ≥ is an increasing sequence of stoppingtimes adapted to ( F n ) n ≥ , and each T i is bounded. The random variable T Y M , which we call the Y -runtime of M , can equivalently be deﬁned as the number of Y -reductionsteps in the reduction sequence of M . Note that, as thetype system ensures that the reduction relation excluding Y -reduction is strongly normalising, only ﬁnitely many re-ductions can occur in a row without one of them being a Y -reduction, therefore T Y M < ∞ a.s. iff M is AST. Moreover: Lemma III.9. T Y M is a stopping time adapted to ( F T n ) n ≥ . Theorem III.10 (Soundness of rankability) . (i) If a closedPPCF term M is rankable, then M is AST (equivalently, T Y M < ∞ almost surely), and M is Y -positively almostsure terminating ( Y -PAST) i.e. E [ T Y M ] < ∞ .(ii) If a closed PPCF term M is strictly rankable, then E [ T M ] < ∞ i.e. M is positively a.s. terminating (PAST).Proof. Let f be a ranking function on M . For (i), since ( T n ) n ≥ is an increasing sequence of stopping times, eachis adapted to ( F n ) n ≥ and bounded (Lem. III.8), and ( f ( M n )) n ≥ is a supermartingale also adapted to ( F n ) n ≥ (Thm. III.7), it follows from the Optional Sampling Thm. III.6that ( Y n ) n ≥ is a supermartingale adapted to ( F T n ) n ≥ .Notice the stopping time T Y M is also adapted to ( F T n ) n ≥ (Lem. III.9); and we have that ( Y n ) n ≥ is a -ranking su-permartingale. Therefore, by Lem. III.4, T Y M < ∞ a.s. and E [ T Y M ] < ∞ . Statement (ii) follows at once from Thm. III.7and Lem. III.4.Thus the method of (strict) ranking function is sound forproving (positive) a.s. termination of PPCF programs. It is infact also complete in a sense: if E [ T Y N ] < ∞ for all N ∈ Rch ( M ) then M is rankable (Thm. IV.2).4V. C ONSTRUCTING RANKING FUNCTIONS

Although rankability implies almost sure termination, theconverse does not hold in general. For example, if ( − sample < , , ( Y λx.x ) 0 ) terminates in 3 steps with probability 1, but isn’t rankablebecause ( Y λx.x ) 0 is reachable, although that has probability0. Not only is this counterexample AST, it’s PAST.A ranking function can be constructed under the strongerassumptions that, for every N reachable from M , the expectednumber of Y -reduction steps from N to a value is ﬁnite. Inparticular, the expected number of Y -reduction steps from eachreachable term is a ranking function. However, a ﬁnite numberof expected Y -reduction steps does not necessarily imply aﬁnite number of expected total reduction steps. Example IV.1.

The term M = Ξ ( λx.x + 1) where Ξ := Y λf n. if ( sample − . < , n , f ( λx.n ( n x )) ) terminates with only 2 Y -reductions on average, i.e., E [ T Y M ] =2 , but applies the increment function n times with probability − n − for n ≥ , which diverges, i.e., E [ T M ] = ∞ . Theorem IV.2.

Given a closed term M , the function f : Rch ( M ) → R given by f ( N ) := E [ number of Y -reduction steps from N to a value ] if it exists, is the least of all possible ranking functions of M .A. Sparse ranking functions Even in the case of reasonable simple terms, explicitlyconstructing a ranking function would be a lot of work, andThm. IV.2 makes even stronger assumptions than almost suretermination, so it isn’t useful for proving it.

Example IV.3 (Geometric distribution) . Let

Θ := λf, n. if ( sample − . < , n, f ( n + 1) ) . The term ( Y Θ) 0 generates a geometric distribution. Despiteits simplicity, its

Rch contains all the terms in Fig. 2, foreach i ∈ N , r ∈ I . The meanings of the substitutions (to beapplied to the contractum) and conditions that label some ofthe reductions in the diagram should be self-explanatory.Even in this simple case, deﬁning a ranking function ex-plicitly is awkward because of the number of cases, althoughin most cases, because the value need only be greater thanor equal to that of the next term in sequence, it sufﬁces totake the ranking function as having the same value as the nextterm, so that overall it takes only 3 distinct values. (We willexplain why later.)The deﬁnition of rankability is also inconvenient for syn-tactic sugar. It could be useful, for example, to deﬁne M ⊕ p N := if ( sample − p < , M, N ) , where M ⊕ p N reducesto M or N , depending on the ﬁrst value of s ∈ S , with probability p resp. (1 − p ) . Technically though, it reduces ﬁrstto if ( r − p < , M, N ) for all r ∈ I , so those terms all needvalues of the ranking function too.In both of these cases, there are only some values of theranking function that are semantically important. Deﬁnition IV.4.

Deﬁne a sparse ranking function on a closedterm M to be a partial function f : Rch ( M ) (cid:42) R such that(i) f ( N ) ≥ for all N ∈ dom( f ) , (ii) M ∈ dom( f ) , (iii) forany N ∈ dom( f ) , evaluation of N will eventually reach some O which is either a value or in dom( f ) , and f ( N ) ≥ E [ f ( O )+ the number of Y -reduction steps from N to O ] (where f ( O ) is taken to be 0 if O is a value outside of dom( f ) ).A sparse ranking function that is total is just a rankingfunction. Providing a sparse ranking function is essentiallypart way between providing a ranking function and directlyproving almost sure termination. Theorem IV.5 (Sparse function) . Every sparse ranking func-tion is a restriction of a ranking function.

As a corollary, any term which admits a sparse rankingfunction terminates almost surely.

B. Examples

Let M ⊕ p N = if ( sample − p < , M, N ) , for p ∈ (0 , ,then there are the pseudo-reduction relations E [ M ⊕ p N ] → E [ M ] E [ M ⊕ p N ] → E [ N ]red ( E [ M ⊕ p N ] , s ) = (cid:26) ( E [ M ] , π t ( s )) if π h ( s ) < p ( E [ N ] , π t ( s )) if π h ( s ) ≥ p. A sparse ranking function could be deﬁned with respect tothis shortcut reduction simply by replacing → and red inthe deﬁnition of a sparse ranking function by a version thatgoes straight from N ⊕ p O to N or O . Such a pseudo-sparseranking function would then be a partial function from asubset of Rch ( M ) , so it could also be considered as a partialfunction from all of Rch ( M ) , and it would in fact also be anactual sparse ranking function. It is therefore possible to proverankability directly using the shortcutted reductions.A similar procedure would work for other forms of syntacticsugar. If a closed term N eventually reduces to one of a setof other terms { N i | i ∈ I } with certain probabilities, a sparseranking function deﬁned w.r.t. a reduction sequence that skipsstraight from N to N i is also a valid sparse ranking functionfor the original reduction function, and therefore its existenceimplies almost sure termination. There is a caveat, however,that Y -reduction steps skipped over in the shortcut still needto be counted for the expected number of Y -reduction steps.With this abbreviation, the geometric distribution examplefrom earlier can be written as ( Y λf n. n ⊕ . f ( n + 1)) 0 . It is5hen easy to see that the following is a sparse ranking function: ( Y λf n. n ⊕ . f ( n + 1)) N (cid:55)→ i ⊕ . ( Y λf n. n ⊕ . f ( n + 1)) ( i + 1) (cid:55)→ i (cid:55)→ . In fact, even the partial function Y Θ N (cid:55)→ alone is a sparseranking function for this term.V. A NTITONE RANKING FUNCTIONS

Example V.1 (Random walk) . (i) 1D biased (towards 0)random walk M = (cid:0) Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n + 1) ) (cid:1) (ii) 1D unbiased random walk M = (cid:0) Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n + 1) ) (cid:1) (iii) 1D biased (away from 0) random walk M = (cid:0) Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n + 2) ) (cid:1) The term M is rankable, terminating in 31 Y -steps onaverage, and M only has a / chance of terminating,but in between, M is AST, but isn’t Y -PAST, therefore itisn’t rankable and Thm. III.10 is insufﬁcient to prove itstermination. Thus we seek a generalised notion of rankingfunction so that M becomes rankable, and then prove it soundi.e. rankable in this generalised sense implies AST. Deﬁnition V.2.

The deﬁnition of antitone ranking function f for M ∈ Λ is the same as that of ranking function except thatin the case of Y -redex, we require the existence of an antitone(meaning: r < r (cid:48) implies (cid:15) ( r ) ≥ (cid:15) ( r (cid:48) ) ) function (cid:15) : R ≥ → R > such that the ranking function f : Rch ( M ) → R ≥ satisﬁes f ( E [ R (cid:48) ]) ≤ f ( E [ R ]) − (cid:15) ( f ( E [ R ])) where R → R (cid:48) is the Y -redex rule. Any closed term for whichan antitone ranking function exists is called antitone rankable .Note that antitone ranking functions are actually a generali-sation of ranking functions, even though the way we referencethem may suggest that they are a type of ranking functions. Deﬁnition V.3.

Given a probability space (Ω , F , P ) , and asupermartingale ( Y n ) n ≥ and a stopping time T adapted toﬁltration ( F n ) n ≥ , we say that ( Y n ) n ≥ is an antitone strictsupermartingale w.r.t. T if for all n ≥ , we have Y n ≥ ,and there exists an antitone function (cid:15) : R ≥ → R > satisfying E [ Y n +1 | F n ] ≤ Y n − (cid:15) ( Y n ) · { T >n } . Theorem V.4.

Let ( Y n ) n ≥ be an antitone strict supermartin-gale w.r.t. stopping time T . Then T < ∞ a.s. It is essential in this theorem that (cid:15) be deﬁned for all R ≥ ,not just the values that ( Y n ) n ≥ actually takes (or at leastif ( Y n ) n ≥ is uniformly bounded, (cid:15) must be positive at thesupremum as well as the actual values). Example V.5.

Take the probability space (Ω , F , P ) where Ω is the closed interval of reals [ − , , F the Borel σ -algebra,and P the corresponding Lebesgue probability measure. Let ω ∈ [ − , . Deﬁne random variables T and ( Y k ) k ∈ ω : T ( ω ) := (cid:40) min { n ∈ N | ω > − n } if ω ∈ (0 , ∞ otherwise Y k ( ω ) := (cid:40) − − k if ω ∈ [ − , − k ] (equivalently k < T ( ω ) ) otherwisePlainly, T is a stopping time, and ( Y n ) n ≥ is a supermartin-gale, adapted to the ﬁltration ( F n ) n ≥ where F n = F forall n . In this case, ( Y n ) n ≥ either tends to as n → ∞ , ordrops to 0 at some point, with an exponentially decreasingprobability. With respect to the antitone function (cid:15) ( x ) = − x and stopping time T , we have that ( Y n ) n ≥ is an antitonestrict supermartingale, except that (cid:15) (4) = 0 , and T = ∞ withprobability , even though is larger than any value ( Y n ) n ≥ can actually take, and (cid:15) ( Y n ) never actually reaches . Theorem V.6 (Antitone ranking function soundness) . If aclosed PPCF term M is antitone rankable, then T Y M < ∞ a.s. (equivalently, M is AST).Proof. As in Thm. III.10, take the probability space ( S , Σ S , µ S ) , and deﬁne the same r.v. T n , X n , Y n , T Y M . Thanksto Thm. III.7, ( Y n ) n ≥ is a supermartingale; and because M is now assumed to be antitone rankable, it is an antitone strictsupermartingale w.r.t. stopping time T Y M . Thus, by Thm. V.4, T Y M < ∞ a.s.As before, constructing antitone ranking fuctions completelyis not necessary, and there is a corresponding notion of anantitone sparse ranking function. Deﬁnition V.7.

Deﬁne an antitone sparse ranking func-tion on a closed term M to be a partial function F : Rch ( M ) (cid:42) R such that for some antitone function (cid:15) : R ≥ → R > : (i) f ( N ) ≥ for all N ∈ dom( f ) ,(ii) M ∈ dom( f ) , (iii) for any N ∈ dom( f ) , evalua-tion of N will eventually reach some O which is either avalue or in dom( f ) , and f ( N ) ≥ E [ f ( O ) + (cid:15) ( f ( O )) × the number of Y -reduction steps from N to O ] (where f ( O ) is taken to be 0 if O is a value outside of dom( f ) ). Theorem V.8 (Sparse function) . Every antitone sparse rankingfunction is a restriction of an antitone ranking function.

As a corollary, any term which admits an antitone sparseranking function terminates almost surely.

Ex. V.1 revisited:

Programs M and M are AST. For M , the sparse ranking function (cid:0) Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n + 1) ) (cid:1) x (cid:55)→ x + 1 sufﬁces to prove its termination (and could equivalently beconsidered an antitone sparse ranking function with the con-stant antitone function (cid:15) ( x ) = 1 ).6 xample V.9 (Unbiased random walk) . For M , deﬁne thefunction g : R ≥ → R ≥ by g ( x ) = ln( x + 1) + 1 . Usingshorthand Θ = Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n +1) ) , we can deﬁne an antitone sparse ranking function f : Rch ( M ) (cid:42) R ≥ by f : Θ n (cid:55)→ g ( n ) and (cid:55)→ for n ∈ N . For n ≥ , Θ n reduces in several steps to either Θ ( n − or Θ ( n + 1) , each with probability / , withone Y -reduction. Now g ( n ) − g ( n −

1) + g ( n + 1)2= ln (cid:16) n + 1 (cid:112) n ( n + 2) (cid:17) = 12 ln (cid:18) ( n + 1) n ( n + 2) (cid:19) = 12 ln (cid:18) n ( n + 2) (cid:19) > n ( n + 2) + 1 >

12 1( n + 1) + 12 1( n + 3) Moreover Θ reduces to with one Y -reduction with areduction in g of . Therefore by setting (cid:15) ( x ) = e x − +1) ,the condition on how much g must decrease is met. It is alsodeﬁned at M = Θ , and is non-negative, therefore it is anantitone sparse ranking function, and M is AST.All of the following examples also have antitone rankingfunctions, provided in Appendix C. Example V.10 (Continuous random walk) . In PPCF we canconstruct a function whose argument changes by a randomamount at each recursive call:

Θ 10 where

Θ := Y λf x. if ( x ≤ , , f ( x − sample ) ) For a more complex (not Y -PAST) example, consider thefollowing continuous random walk: Ξ 10 where

Ξ := Y λf x. if ( x ≤ , , f ( x − sample + 1 / ) Example V.11 (Fair-in-the-limit random walk) . [25, §5.3] (cid:0) Y λf x. if ( x ≤ , , f ( x − ⊕ x x +1 f ( x + 1) ) (cid:1) Example V.12 (Non-afﬁne recursion) . ( Y λf x.x ⊕ / f ( f ( x + 1))) 1 Example V.13 (More complex non-afﬁne recursion) . ( Y λf x. ( λe.X [ e ]) sample ) 0 X [ e ] = if ( e ≤ p − x − , x, f ( x + 1) ⊕ e f ( x + 1) ) Example V.14 (Higher-order recursion) . Consider the higher-order function

Ξ : ( R → R → R ) → R → R → R recursivelydeﬁned by Ξ := Y λϕ f R → R → R s R n R . if ( n ≤ , s, f n ( ϕ f s ( n − ⊕ p f n ( ϕ f s ( n + 1)) ) For any F : R → R → R such that F n m terminates with no Y reductions for all m and n , Ξ F x y is antitone rankable. Technically this is not quite correct, because of the distinction between n + 1 and n + 1 , but this ranking function can still be made to work byappealing to Thm. VI.15, as can the others in this section with similar issues. Inspired by examples in [38, 39].

Completeness:

It seems very likely that this methodis complete in the case that every reachable term is AST(Conj. VIII.1), but we have been unable to actually provethis. It is certainly at least capable of proving the terminationof terms which terminate arbitrarily slowly (in the sense thatthere is no similar limitation to Theorem III.10, which canonly prove termination of Y -PAST terms).The following theorem does not prove completeness, but issuggestive in that direction: Theorem V.15.

ONFLUENT TRACE SEMANTICS

When proving almost sure termination in this way, it isnecessary to consider all of the terms a given term may reduceto, and organise them into a manageable number of casesto assign ranking function values. Sometimes however, thereduction that the programmer has in mind may not be strictlythe call-by-value order deﬁned so far; or the number of casesnecessary for deﬁning a sparse ranking function is excessivebecause the reduction does not proceed in a neat and orderlymanner. In cases such as these (e.g. Ex. VI.17), it may be muchmore convenient to instead consider ranking functions withrespect to alternative reduction strategies constructed to matchthe individual term in question, thereby reducing the numberof cases needed and simplifying the reachable terms. It willbe shown in Thm. VI.15 that ranking functions with respect toalternative reduction strategies are sound for proving AST inthe usual sense (i.e. with respect to the standard CBV reductionstrategy). In order to obtain this result, we prove a relationbetween the results of using different reduction strategies, byﬁrst constructing a conﬂuent trace semantics that is equivalentto the standard semantics as a special case.Non-probabilistic lambda calculi generally have the Church-Rosser property, that if a term A reduces to both B and B , there is some C to which both B and B reduce, sothe reduction order mostly doesn’t matter. In the probabilisticcase, this may not be true, because β -reduction can duplicate sample s, so the outputs of the copies of the sample may beidentical or independent, depending on whether the sample istaken before or after β -reduction. There are, however, somerestricted variations on the reduction order that do not havethis problem.Even with this restriction, a trace semantics in the styleof the one already deﬁned would not be entirely Church-Rosser, as, for example, red ( sample − sample , (1 , , . . . )) would be either or − depending on the order of evaluationof the sample s, as that determines which sample from the pre-selected sequence is used for each one. To ﬁx this, rather thanpre-selecting samples according to the order they’ll be drawn7n, select them according to the position in the term wherethey’ll be used instead.A position is a ﬁnite sequence of steps into a term, deﬁnedinductively as α ::= · | λ ; α | @ ; α | @ ; α | f i ; α | Y ; α | if ; α | if ; α | if ; α. The subterm of M at α , denoted M | α , is deﬁned as M | · = Mλx.M | λ ; α = M | αM M | @ i ; α = M i | α for i = 1 , f ( M , . . . , M n ) | f i ; α = M i | α for i ≤ n Y M | Y ; α = M | α if ( M < , M , M ) | if i ; α = M i | α for i = 1 , , so that every subterm is located at a unique position, butnot every position corresponds to a subterm (e.g. x y | λ is undeﬁned). A position such that M | α does exist issaid to occur in M . Substitution of N at position α in M , written M [ N/α ] , is deﬁned similarly. For example, let M = λx y.y ( if ( x < , y ( f ( x )) , ) ) and α = λ ; λ ; @ ; if ; @ then M [ sample /α ] = λx y.y ( if ( x < , y sample , ) ) . Two subterms N and N of a term M , corresponding topositions α and α , can overlap in a few different ways.If α is a preﬁx of α (written as α ≤ α ), then N isalso a subterm of N . If neither α ≤ α nor α ≥ α , thepositions are said to be disjoint . The notion of disjointness ismostly relevant in that if α and α are disjoint, performinga substitution at α will leave the subterm at α unaffected.With this notation, a more general reduction relation → canbe deﬁned. Deﬁnition VI.1.

The binary relation → is deﬁned by thefollowing rules, each is conditional on a redex occurring atposition α in the term M :if M | α = ( λx.N ) V, M → M [ N [ V /x ] /α ] if M | α = f ( r , . . . , r n ) , M → M [ f ( r , . . . , r n ) /α ] if M | α = Y λx.N, M → M [ λz.N [( Y λx.N ) /x ] z/α ] where z is not free in N if M | α = if ( r < , N , N ) , M → M [ N /α ] where r < if M | α = if ( r < , N , N ) , M → M [ N /α ] where r ≥ if M | α = sample and λ does not occur after @ or Y in α,M → M [ r/α ] where r ∈ [0 , . In each of these cases, M | α is the redex , and the reductiontakes place at α .Labelling the pre-chosen samples by the positions in theterm would also not work because in some cases, a sample willbe duplicated before being reduced, for example, in ( λx.x x λy. sample ) , both of the sample redexes that eventuallyoccur originate at @ ; λ . It is therefore necessary to consider possible positions that may occur in other terms reachablefrom the original term. Even this is itself inadequate becausesome of the positions in different reachable terms need to beconsidered the same, and the number of reachable terms is ingeneral uncountable, which leads to measure-theoretic issues.We are thus led to consider the reduction relation onskeletons (and positions in a skeleton), which can be ex-tended from the deﬁnitions on terms in the obvious way, with if ( X < , A, B ) reducing nondeterministically to both A and B , sample reducing to X , and X considered as (the skeletalequivalent of) a value, so that ( λx.A ) X reduces to A [ X /x ] . Forexample, we have ( λx. if ( x < , x, X ) ) sample → ( λx. if ( x < , x, X ) ) X → if ( X < , X , X ) → X .Given a closed term M , let L ( M ) be the set of pairs, theﬁrst element of which is a → -reduction sequence of skeletonsstarting at Sk ( M ) , and the second of which is a position inthe ﬁnal skeleton of the reduction sequence. As with the tracesfrom I N used to pre-select samples to use in the standardtrace semantics, modiﬁed traces, which are elements of I L ( M ) (with one more caveat introduced later), will be used to pre-select a sample from I for each element of L ( M ) , whichwill then be used if a sample reduction is ever performed atthat position.A (skeletal) reduction sequence is assumed to contain theinformation on the locations of all of the redexes as well asthe actual sequence of skeletons that occurs. For example, ( λx.x )(( λx.x ) X ) could reduce to ( λx.x ) X with the redex ateither · or @ , and these give different reduction sequences. Example VI.2 (Labelling samples by terms) . Consider theterms A [ x ] = if ( if ( x > , I, I ) ( λy. sample ) 0 − . > , , Ω ) B = if ( sample − . > , , Ω ) If terms rather than skeletons were used to label samples, theset of modiﬁed traces where A [ sample ] terminates would be (cid:83) r ∈ [0 , (cid:8) s | s ( A [ sample ] , if ; − ; @ ; @ ; if ) = r,s ( A [ sample ] → A [ r ] → ∗ B, if ; − ) > . (cid:9) . This is a rather unwieldy expression, but the crucial partis that r occurs twice in the conditions on s : once as thevalue a sample must take, and once in the location of asample. This set is unmeasurable, therefore the terminationprobability would not even be well-deﬁned. Labelling samplesby skeletons instead, this problem does not occur because thereare only countably many skeleton, and at each step in a re-duction sequence, only ﬁnitely many could have occurred yet.Although skeletal reduction sequences omit the informationon what the results of sampling were, they still contain all thenecessary information on how many, and which, reductionstook place.For this particular term, Sk ( A [ r ]) does not depend on thevalue of r , therefore the set where it terminates becomessimply the following, which is measurable. (cid:8) s | s ( Sk ( A [ sample ]) → Sk ( A [0]) → ∗ Sk ( B ) , if ; − ) > . (cid:9) Example VI.3.

Consider the term M = Y ( λf x. if ( sample − . < , f x, x ) ) 0 , which reduces after a few steps to N = if ( sample − . < , M, ) . If we label samples byjust skeletons and positions, and the pre-selected sample for ( Sk ( N ) , if ; − ) is less than 0.5, N reduces back to M , then N again, then the same sample is used the next time, therefore it’san inﬁnite loop, whereas if samples are labelled by reductionsequences, the samples for M → ∗ N are independent fromthe samples for M → ∗ N → M → ∗ N , and so on.The reduction sequences of skeletons will often be discussedas though they were just skeletons, identifying them withtheir ﬁnal skeletons. With this abuse of notation, a reductionsequence N (actually N → ∗ N n = N ) may be saidto reduce to a reduction sequence O , where the reductionsequence implicitly associated with the ﬁnal skeleton O is N → ∗ N n → O .This is still not quite sufﬁcient because sometimes the samesamples must be used at corresponding positions in differentreduction sequences. Example VI.4.

The term M = sample + sample has thereachable skeletons N = X + sample , N = sample + X , O = X + X and X , with reductions M → N → O → X and M → N → O → X . In the reduction M → N , the samplelabelled ( M, + ) is used, and in the reduction N → O , thesample labelled ( M → N , + ) is used. Each of these samplesbecomes the value of the ﬁrst numeral in O in their respectivereduction sequences, therefore in order for Church-Rossernessto be attained, they must be the same. Which elements of L ( M ) must match can be described by the relation ∼ ∗ : Deﬁnition VI.5.

The relation ∼ is deﬁned as the union of theminimal symmetric relations ∼ p (“ p ” for parent-child) and ∼ c (“ c ” for cousin) satisfying(i) If N reduces to O with the redex at position α , and β is a position in N disjoint from α , then ( N, β ) ∼ p ( O, β ) .(ii) If N β -reduces to O at position α , β is a position in N | α ; @ ; λ and N | α ; @ ; λ ; β is not the variable involvedin the reduction, ( N, α ; @ ; λ ; β ) ∼ p ( O, α ; β ) (iii) If N if -reduces to O at position α , with the ﬁrst resp.second branch being taken, and α ; if i ; β occurs in N (where i = 2 resp. ), ( N, α ; if i ; β ) ∼ p ( O, α ; β ) (iv) If N , O and O match any of the following cases:a) N contains redexes at disjoint positions α and α , O is N reduced ﬁrst at α then α and O is N reduced ﬁrst at α then at α .b) N | α = if ( r < , N , N ) , where r < (or, respectively, r ≥ ), ( N resp. N ) | β is a redex, and O is N reducedat α and O is N reduced ﬁrst at α ; ( if resp. if ); β thenat α c) N | α = if ( r < , N , N ) , where r < (or, respectively, r ≥ ), ( N resp. N ) | β is a redex, and O is N reduced ﬁrst at α then at α ; β and O is N reduced ﬁrst at α ; ( if resp. if ); β then at α d) N | α = ( λx.A ) B , there is a redex in A at position β , O is N reduced ﬁrst at α then at α ; β , and O is N reducedﬁrst at α ; @ ; λ ; β then at α e) N | α = ( λx.A ) B , B | β is a redex, ( γ i ) i is a list of allthe positions in A where A | γ = x , ordered from left toright, O is N reduced ﬁrst at α ; @ ; β then at α , and O is N reduced ﬁrst at α then at α ; γ i ; β for each i in order.f) N | α = Y ( λx.A ) , A reduced at β is A (cid:48) , ( γ i ) i is a listof all the positions where A (cid:48) | γ = x , ordered from left toright, O is N reduced ﬁrst at α ; Y ; λ ; β then at α , and O is N reduced ﬁrst at α then at α ; λ ; @ ; γ i ; Y ; λ ; β for each i in order where γ i is left of β then at α ; λ ; @ ; β then at α ; λ ; @ ; γ i ; Y ; λ ; β for the remaining values of i .(in which case O and O are equal as skeletons, but withdifferent reduction sequences), O (cid:48) and O (cid:48) are the results ofapplying some reduction sequence to each of O and O (the same reductions in each case, which is always possiblebecause they’re equal skeletons), and δ is a position in O (cid:48) (orequivalently O (cid:48) ), then ( O (cid:48) , δ ) ∼ c ( O (cid:48) , δ ) .The ∼ c -rules are illustrated in Fig. 3. Example VI.6.

In Ex. VI.4, ( M, + ) ∼ p ( M → N , + ) bycase i of ∼ p (because the reduction M → N occurs at + ,which is disjoint from + ), and similarly, ( M, + ) ∼ p ( M → N , + ) .If we extend it to have three sample s, ∼ c becomes necessaryas well: Let M sss = sample + sample + sample (taking thethree-way addition to be a single primitive function), M Xss = X + sample + sample , and so on. There are then reductionsequences M sss → M Xss → M XXs → M XXX → X and M sss → M sXs → M XXs → M XXX → X . For the ﬁrst tworeductions, these reduction sequences take the same samplesby ∼ p , case i, as in Ex. VI.4 . The next reduction uses thesamples labelled by ( M sss → M Xss → M XXs , + ) and ( M sss → M sXs → M XXs , + ) , which are related by ∼ c , casea, therefore when these reduction sequences reach M XXX ,they still contain all the same numbers, as desired.The reﬂexive transitive closure ∼ ∗ of this relation is usedto deﬁne the set of potential positions L ( M ) = L ( M ) / ∼ ∗ ,and each equivalence class can be considered as the sameposition as it may occur across multiple reachable skeletons.If ( N, α ) ∼ ∗ ( O, β ) , then N | α and O | β both have thesame shape (i.e. they’re either both the placeholder X , bothvariables, both applications, both sample s etc.), therefore it’swell-deﬁned to talk of the set of potential positions wherethere is a sample , L s ( M ) . Formally L s ( M ) := { [( X, α )] ∼ ∗ M : X | α = sample } . The new sample space is then deﬁned as I L s ( M ) , with the Borel σ -algebra and product measure. Since I L s ( M ) is a countable product, the measure space is well-deﬁned [37, Cor. 2.7.3].9efore deﬁning the new version of the reduction relation red , the following lemma is necessary for it to be well-deﬁned. Lemma VI.7.

The relation ∼ is deﬁned on L ( M ) withreference to a particular starting term M , so different versions, ∼ M and ∼ N , can be deﬁned starting at different terms. If M → N , then ∼ ∗ N is equal to the restriction of ∼ ∗ M to L ( N ) . At each reduction step M → N , the sample space mustbe restricted from I L s ( M ) to I L s ( N ) . The injection L ( N ) → L ( M ) is trivial to deﬁne by appending Sk ( M ) → Sk ( N ) toeach path, and using Lem. VI.7, this induces a correspondinginjection on the quotient, L ( N ) → L ( M ) . The correspondingmap L s ( N ) → L s ( M ) is then denoted i ( M → N ) . Deﬁnition VI.8.

Unlike in the purely call-by-value case,the version of the reduction relation that takes into accountsamples is still a general relation rather than a function, so itis denoted “ ⇒ ” instead of “ red ”, and it relates (cid:85) M ∈ Λ I L s ( M ) to itself. We write an element of (cid:85) M ∈ Λ I L s ( M ) as ( M (cid:48) , s ) where the term M (cid:48) ∈ Λ and s ∈ I L s ( M (cid:48) ) . ( M, s ) ⇒ ( N, s ◦ i ( M → N )) if M → N at α and eitherthe redex is not sample , or M | α = sample and N = M [ s ( Sk ( M ) , α ) /α ] This reduction relation now has all of the properties requiredof it. In particular, it can be considered an extension of thestandard trace semantics (as will be seen later in Thm. VI.10),and also:

Lemma VI.9.

The relation ⇒ is Church-Rosser. The reduction relation ⇒ is nondeterministic, so it admitsmultiple possible reduction strategies. A reduction strategy starting from a closed term M is a measurable partial function f from Rch ( M ) to positions, such that for any reachable term N where f is deﬁned, f ( N ) is a position of a redex in N ,and if f ( N ) is not deﬁned, N is a value. Using a reductionstrategy f , a subset of ⇒ that isn’t nondeterministic, ⇒ f , canbe deﬁned by ( N, s ) ⇒ f ( N (cid:48) , s (cid:48) ) just if ( N, s ) ⇒ ( N (cid:48) , s (cid:48) ) and N reduces to N (cid:48) with the redex at f ( N ) .The usual call-by-value semantics can be implemented asone of these reduction strategies, given by (with V a valueand T a term that isn’t a value and M a general term) cbv( T M ) = @ ; cbv( T )cbv( V T ) = @ ; cbv( T )cbv( f ( V , . . . , V k − , T, M k +1 , . . . , M n )) = f k ; cbv( T )cbv( Y T ) = Y ; cbv( T )cbv( if ( T < , M , M ) ) = if ; cbv( T )cbv( V ) is undeﬁned cbv( T ) = · otherwise(this last case covers redexes at the root position). A closed term M terminates with a given reduction strategy f and samples s if there is some natural number n such that ( M, s ) ⇒ nf ( N, s (cid:48) ) where f gives no reduction at N . The termterminates almost surely with respect to f if it terminates with f for almost all s .With these deﬁnitions, it is now possible to relate theconﬂuent trace semantics back to the standard trace semantics. Theorem VI.10.

A closed term M is AST with respect to cbv iff it is AST. Theorem VI.11. If M terminates with some reduction strategy f and trace s , it terminates with cbv and s . Corollary VI.12 (Reduction strategy independence) . If M isAST with respect to any reduction strategy, it is AST.Proof. Suppose M is AST w.r.t. f . Let the set of sampleswith which it terminates with this reduction strategy be X . ByThm. VI.11, M also terminates with cbv and every element of X , and X has measure 1, by assumption, therefore M is ASTwith respect to cbv therefore by Thm. VI.10 it is AST.All of the theorems on the termination of rankable termstherefore extend to other reduction strategies too. The proofsof Thm. III.10, IV.5, V.6 and V.8 are all sufﬁciently genericwith respect to what the reduction relation actually is that theycan be directly applied to other reduction strategies. They onlyrequire that the number of reductions that can occur withoutany of them being a Y -reduction is bounded for any startingterm (which is true, because Thm. A.2 applies equally to anyreduction strategy). For a reduction strategy r and term M ,just substitute N [ r ( N )] being a Y -redex for N being of theform E [ Y λx.O ] , and r ( N ) being undeﬁned for N being avalue.The domain of deﬁnition of the ranking functions alsoneeds to be changed from the reachable terms Rch ( M ) tothe reachable terms with respect to the reduction strategy r , Rch r ( M ) := { N | ∃ n, ( N i ) ≤ i ≤ n : N = M, N n = N, N i → N i +1 at r ( N i ) } . More explicitly, the modiﬁed forms of the theorems are:

Deﬁnition VI.13. An antitone ranking function on M withrespect to a reduction strategy r is a measurable function f : Rch r ( M ) → R such that f ( N ) ≥ for all N , and there existsan antitone function (cid:15) : R ≥ → R > such that • f ( N ) ≥ (cid:15) ( f ( N )) + f ( N (cid:48) ) if N [ r ( N )] is a Y -redex and N reduces to N (cid:48) at r ( N ) • f ( N ) ≥ (cid:82) I f ( N [ x/r ( N )]) Leb(d x ) if N [ r ( N )] = sample • f ( N ) ≥ f ( N (cid:48) ) if r ( N ) is any other redex, where N → N (cid:48) at r ( N ) .Any closed term for which an antitone ranking function withrespect to a reduction strategy r exists is called antitonerankable with respect to r .10 eﬁnition VI.14. An antitone sparse ranking function on M with respect to a reduction strategy r is a partial function f : Rch r ( M ) (cid:42) R such that there exists some antitonefunction (cid:15) : R ≥ → R > such that (i) f ( N ) ≥ forall N ∈ dom( f ) , (ii) M ∈ dom( f ) , (iii) for any N ∈ dom( f ) , evaluation of N at the positions speciﬁed by r will eventually reach some O such that either r ( O ) isn’tdeﬁned or f ( O ) is, and f ( N ) ≥ E [ f ( O ) + (cid:15) ( f ( O )) × the number of Y -reduction steps from N to O ] (where f ( O ) is taken to be 0 if O is a value outside of dom( f ) ).If the antitone function is just the constant function instead, then it is called a (sparse) ranking function on M with respect to r instead, and the term is called rankable withrespect to r . Theorem VI.15 (Conﬂuent ranking) . If a closed PPCF term M has a ranking function, sparse ranking function, antitoneranking function or antitone sparse ranking function w.r.t. areduction strategy r , then M is AST w.r.t. r , and AST. When constructing an (antitone) sparse ranking functionwith respect to an alternative reduction strategy, it is intuitivelysimplest to deﬁne the sparse ranking function and the reductionstrategy together. For each key reachable term where thesparse ranking function is deﬁned, simply decide which sub-term should be reduced next, and how far before the nextkey reachable term. The only restrictions to bear in mind onevaluating sub-terms are that a sample may not be evaluatedinside a λ that’s inside a Y or on the right of an application,and a β redex may not be reduced until its argument is a value.This can be seen as applied to Ex. VI.17 at the end of Sec. D.If, as seems likely, Thm. V.6 is complete for terms fromwhich every reachable term is AST, then by the followingtheorem, Thm. VI.15 can’t actually prove the termination ofany terms not already provable by Thm. V.6 and Thm. V.8. Theorem VI.16.

For any closed term M and reductionstrategy r on M , if every term in Rch r ( M ) is AST, then everyterm in Rch ( M ) is AST. However, there are some terms which terminate more sim-ply and directly via some alternative reduction strategy. Forthese examples, a custom reduction strategy is speciﬁed foreach term under consideration.

Example VI.17 (Sum of powers) . The following term picksa random natural number n , then computes (cid:80) nk =0 k n . M = ( λn.P [ n ]) (cid:98) sample − / (cid:99) P [ n ] = ( λp. Ξ[ p ] n )( λx. Θ[ x ] n ))Ξ[ p ] = Y λf n. if ( n = 0 , , p n + f ( n − ) Θ[ x ] = Y λf n. if ( n = 0 , , x × f ( n − ) . With the CBV reduction strategy, the recursion in Θ[ x ] (thatdeﬁnes the function k (cid:55)→ k x ) is executed separately for everyterm in the sum in Ξ . Not only is this much more complicated than simplifying Θ[ x ] n ﬁrst, it requires considerably more Y -reduction steps, so many so that M is Y -PAST with respectto a reduction strategy that simpliﬁes Θ[ x ] n ﬁrst (call it“ r ”), but isn’t Y -PAST with respect to the standard reductionstrategy, so that M is rankable with respect to r , but isn’trankable. It is still antitone rankable, but this is much morecomplicated. Details of the ranking function construction withand without using the conﬂuent semantics are given in Sec. D.Even some of the examples given earlier implicitly useslightly non-standard reduction strategies to make the deﬁni-tions of their ranking functions slightly simpler. For example,the antitone sparse ranking function in Ex. V.9 was deﬁnedat Θ n (where Θ = Y λf n. if ( n = 0 , , f ( n − ⊕ / f ( n + 1) ) ), but Θ actually reduces to Θ (10 − or Θ (10 + 1) , and (taking the second of these), it then reducesto ( λz. ( λn. if ( n = 0 , , Θ ( n − ⊕ / Θ ( n +1) ) ) z ) (10+1) ,then ( λz. ( λn. if ( n = 0 , , Θ ( n − ⊕ / Θ ( n + 1) ) ) z ) 11 ,bypassing Θ entirely. The antitone sparse ranking functionin the form given can be justiﬁed anyway, by considering it tobe using a reduction strategy which is the same as cbv exceptthat in Θ ( f n ) , it evaluates the redex at @ ﬁrst.Ex. V.14 also makes use of an alternative reduction strategy.In this case, it is to allow the antitone sparse ranking functionto skip over all of those terms where the unknown function F is part way through evaluation.VII. A PPLICATIONS

Although the deﬁnition of potential positions in a term inSec. VI was intended simply to deﬁne a variant of trace seman-tics that has a restricted version of the Church-Rosser property,it is likely to also be applicable to inference algorithms (forprobabilistic programs) as well.In lightweight Metropolis-Hastings (LMH) or single-siteMH (originally suggested by [40]; see also [3]), the programis executed, with a random value being chosen each timea sample redex is evaluated. The trace of samples used isrecorded, then for the next step, it is modiﬁed slightly beforethe program is run again. This time, when sample redexesare evaluated, they may be taken from the pre-determinedtrace instead of chosen randomly. The change to the tracemay be accepted or rejected, depending on the probabilityweight associated with each execution of the program (basedon a conditioning language construct), and after this processis repeated sufﬁciently many times, the distribution of resultsof the program tends towards the true posterior distribution.It is essential to the efﬁciency of this algorithm that aftermodifying the trace, the weight of the resultant programexecution is likely to be similar to the weight of the previousexecution. This is satisﬁed by the values of the samples beingsimilar to their original values, which (for a sufﬁciently well-behaved program) will tend to make the entire executionproceed similarly. In the simplest version of this algorithm,the samples in the trace are used one at a time, in the orderthey are encountered. However, if as a result of some random11hoice, more or fewer samples are taken during some stage ofthe program, subsequent samples in the trace will be displaced,and end up being used for different purposes than previously,thereby decreasing the correlation between the program runs.It is possible to reduce this problem by labelling the samplesin the trace by something other than the order in which they areused. If the labels of samples in the trace correspond closelyto the roles that the samples play in the program execution,this will increase the efﬁciency of the inference algorithm. In[40], for example, something rather like a stack trace is used,although this addressing scheme turns out to have variousproblems [41, 42]. Potential positions could also be usedfor a similar purpose, although in practice, a more compactrepresentation than an entire skeletal reduction sequence wouldbe necessary. Selecting an appropriate addressing scheme isalso important for variational inference [43].The ∼ relation as it has been deﬁned is not exactly whatwould be necessary for inference (for example, ∼ c onlybecomes relevant when there are multiple different reductionstrategies being considered), but something much like it is stilllikely to be useful.Another possible application for the conﬂuent trace seman-tics would be as a justiﬁcation, within a trace-style semantics,of some forms of equational reasoning. By introducing re-duction steps that go backwards to terms not reachable fromthe original term, it may in some cases be simpliﬁed, orequivalences between terms may be proved, by appealing toconﬂuence results such as Lem. VI.9 and Thm. VI.11.VIII. R ELATED WORK AND FURTHER DIRECTIONS

Most methods for proving AST of programs in the literaturedo not support continuous distributions [17–20, 44, 45]. Thefew that do [21–23] are for ﬁrst-order imperative programs.A fortiori, our results specialise to a method for provingtermination of non-random lambda calculus. Thus restricted,our approach is closest in spirit to Jones’ work on size-changetermination [46, 47]. Dal Lago and Grellois [44] extend thisto programs with discrete distributions, but their method islimited to afﬁne recursion of ﬁrst-order functions. Breuvartand Dal Lago [48] develop systems of intersection types fromwhich the termination probability of higher-order programs(with discrete distributions) can be inferred from (inﬁnitelymany) type derivations.Kobayashi et al. [45] show that the termination probabilityof order- n probabilistic recursion schemes ( n -PHORS) can beobtained as the least ﬁxpoint of suitable order- ( n − ﬁxpointequations, which are solvable by Kleene ﬁxpoint iteration. Bycontrast, our approach works for programs with continuousdistributions. Note that n -PHORS is strictly less expressivethen order- n call-by-name PPCF: the PPCF terms in Ex. V.10and V.14 are not deﬁnable as PHORS.The main result by McIver et al. [25] is similar to ourThm. V.6, but in an imperative language. While we require that the antitone ranking function decreases for every Y -reductionstep, they require that it decreases for every iteration of acertain while loop (the ranking function in that case is deﬁnedin the context of a particular loop), which is similar to Y butlimited to tail recursion. The difference in the exact progresscondition is insigniﬁcant. A ranking function satisfying eitherprogress condition can easily be converted to satisfy the other.The conﬂuent semantics, and the results based on it, is newin the setting of a functional language, because an imperativelanguage does not have any equivalent of other redexes,or a similar nondeterministic structure. The sparse rankingfunction construction is also more useful in a functionallanguage (although it would be possible to give an equivalentin an imperative language), because in an imperative languagewhere the order of execution is more rigid, the iterations ofthe while loop provide a natural set of checkpoints where itis reasonable to give values of the ranking function. Further directions:

We have not yet been able to proveConj. VIII.1 which would imply that Thm. V.6 is almostcomplete.

Conjecture VIII.1 (Completeness) . If every term reachablefrom a certain PPCF term is AST, then that term is antitonerankable.

The Def. VI.1 of redexes and which positions are acceptableto reduce at is sufﬁciently restrictive to guarantee Church-Rosserness, but is also a little more restrictive in some casesthan is necessary for this purpose. For example, if the argumentof a function is not yet a value, but its reduction to a valuewould be deterministic, or the function is afﬁne, then applyingthe function before reducing its argument would not causeany problematic duplication of random samples. Similarly, if sample occurs at position α ; @ ; β ; λ ; γ , but the function at α ; @ is afﬁne, evaluating the sample may in some cases alsowork ﬁne. A more complete characterisation of which redexescan be reduced without breaking Church-Rosserness could beinteresting.Although this is a broadly applicable method of provingalmost sure termination of probabilistic functional programs,and Thm. V.8 and VI.15 make it more convenient to use,some method of automating the construction and checkingof (antitone) ranking functions, even partially, could make itconsiderably more practically useful, especially in cases wherealmost sure termination is merely a side-condition for someother theorem or algorithm to be applicable. Conclusions:

We have presented the ﬁrst application ofmartingales to probabilistic lambda calculus, and the ﬁrstversion of trace semantics that’s capable of satisfying the (re-stricted) Church-Rosser property. Though the construction ofranking functions is inherently difﬁcult, using sparse rankingfunctions, antitone ranking functions and ranking functionsw.r.t. an alternative reduction strategy, we have shown that agreat variety of programs can be proved to be AST, includingsome that are beyond the reach of methods in the literature.12

EFERENCES[1] M. O. Rabin, “Probabilistic algorithms,” pp. 21–39, 1976.[2] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani, “Prob-abilistic programming,” in

Proceedings of the on Future of SoftwareEngineering, FOSE 2014, Hyderabad, India, May 31 - June 7, 2014 ,2014, pp. 167–181.[3] T. Rainforth, “Automating inference, learning, and design using proba-bilistic programming,” Ph.D. dissertation, University of Oxford, 2017.[4] J.-W. van de Meent, B. Paige, H. Yang, and F. Wood, “An Introductionto Probabilistic Programming,” arXiv:1809.10756 [cs, stat] , Sep. 2018.[5] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, andJ. B. Tenenbaum, “Church: a language for generative models,” in

UAI2008, Proceedings of the 24th Conference in Uncertainty in ArtiﬁcialIntelligence, Helsinki, Finland, July 9-12, 2008 , 2008, pp. 220–229.[6] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich,M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan:A probabilistic programming language,”

Journal of statistical software ,vol. 76, no. 1, 2017.[7] D. Tolpin, J. van de Meent, and F. D. Wood, “Probabilistic program-ming in Anglican,” in

Machine Learning and Knowledge Discovery inDatabases - European Conference, ECML PKDD 2015, Porto, Portugal,September 7-11, 2015, Proceedings, Part III , 2015, pp. 308–311.[8] M. F. Cusumano-Towner, F. A. Saad, A. K. Lew, and V. K. Mans-inghka, “Gen: A general-purpose probabilistic programming system withprogrammable inference,” in

Proceedings of the 40th ACM SIGPLANConference on Programming Language Design and Implementation ,ser. PLDI 2019. New York, NY, USA: Association for ComputingMachinery, Jun. 2019, pp. 221–236.[9] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan,T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman,“Pyro: Deep Universal Probabilistic Programming,”

Journal of MachineLearning Research , vol. 20, no. 28, pp. 1–6, 2019.[10] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, andD. M. Blei, “Edward: A library for probabilistic modeling, inference,and criticism,” arXiv preprint arXiv:1610.09787 , 2016.[11] H. Ge, K. Xu, and Z. Ghahramani, “Turing: A Language for FlexibleProbabilistic Inference,” in

International Conference on Artiﬁcial Intel-ligence and Statistics . PMLR, Mar. 2018, pp. 1682–1690.[12] T. Rainforth, “Automating Inference, Learning, and Design using Prob-abilistic Programming,” Ph.D. dissertation, 2017.[13] C. Mak, C.-H. L. Ong, H. Paquet, and D. Wagner, “Densities of almostsurely terminating probabilistic programs are differentiable almosteverywhere,”

CoRR , vol. abs/2004.03924, 2020. [Online]. Available:https://arxiv.org/abs/2004.03924[14] Y. Zhou, B. J. Gram-Hansen, T. Kohn, T. Rainforth, H. Yang, andF. Wood, “LF-PPL: A low-level ﬁrst order probabilistic programminglanguage for non-differentiable models,” in

The 22nd InternationalConference on Artiﬁcial Intelligence and Statistics, AISTATS 2019, 16-18April 2019, Naha, Okinawa, Japan , 2019, pp. 148–157.[15] A. Nishimura, D. B. Dunson, and J. Lu, “Discontinuous HamiltonianMonte Carlo for discrete parameters and discontinuous likelihoods,”

Biometrika , vol. 107, no. 2, pp. 365–380, 2020.[16] W. Lee, H. Yu, and H. Yang, “Reparameterization gradient for non-differentiable models,” in

Advances in Neural Information ProcessingSystems 31: Annual Conference on Neural Information ProcessingSystems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada ,2018, pp. 5558–5568.[17] B. L. Kaminski and J. Katoen, “On the hardness of almost-sure ter-mination,” in

Mathematical Foundations of Computer Science 2015 -40th International Symposium, MFCS 2015, Milan, Italy, August 24-28,2015, Proceedings, Part I , 2015, pp. 307–318.[18] B. L. Kaminski, J. Katoen, C. Matheja, and F. Olmedo, “Weakest pre-condition reasoning for expected runtimes of randomized algorithms,”

J. ACM , vol. 65, no. 5, pp. 30:1–30:68, 2018.[19] F. Olmedo, B. L. Kaminski, J. Katoen, and C. Matheja, “Reasoning aboutrecursive probabilistic programs,” in

Proceedings of the 31st AnnualACM/IEEE Symposium on Logic in Computer Science, LICS ’16, NewYork, NY, USA, July 5-8, 2016 , 2016, pp. 672–681.[20] A. McIver and C. Morgan,

Abstraction, Reﬁnement and Proof forProbabilistic Systems , ser. Monographs in Computer Science. Springer,2005. [21] L. M. F. Fioriti and H. Hermanns, “Probabilistic termination: Soundness,completeness, and compositionality,” in

Proceedings of the 42nd AnnualACM SIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages, POPL 2015, Mumbai, India, January 15-17, 2015 , 2015,pp. 489–501.[22] J. Chen and F. He, “Proving almost sure termination by omega-regulardecomposition,” in

Proceedings of the 41st ACM SIGPLAN InternationalConference on Programming Language Design and Implementation,PLDI 2020, London, UK, June 15-20, 2020 , 2020, pp. 869–882.[23] K. Chatterjee, H. Fu, P. Novotný, and R. Hasheminezhad, “Algorithmicanalysis of qualitative and quantitative termination problems for afﬁneprobabilistic programs,”

ACM Trans. Program. Lang. Syst. , vol. 40,no. 2, pp. 7:1–7:45, 2018.[24] A. Chakarov and S. Sankaranarayanan, “Probabilistic program analysiswith martingales,” in

Computer Aided Veriﬁcation - 25th InternationalConference, CAV 2013, Saint Petersburg, Russia, July 13-19, 2013.Proceedings , 2013, pp. 511–526.[25] A. McIver, C. Morgan, B. L. Kaminski, and J. Katoen, “A new proofrule for almost sure termination,”

Proc. ACM Program. Lang. , vol. 2,no. POPL, pp. 33:1–33:28, 2018.[26] W. W. Tait, “Intensional interpretations of functionals of ﬁnite type i,”

J. Symbolic Logic , vol. 32, no. 2, pp. 198–212, 06 1967. [Online].Available: https://projecteuclid.org:443/euclid.jsl/1183735831[27] H. Barendregt, W. Dekkers, and R. Statman,

Lambda Calculus withTypes , ser. Perspectives in Logic. CUP, 2010.[28] T. Ehrhard, M. Pagani, and C. Tasson, “Full Abstraction for ProbabilisticPCF,”

Journal of the ACM , vol. 65, no. 4, pp. 1–44, apr 2018. [Online].Available: http://dl.acm.org/citation.cfm?doid=3208081.3164540[29] ——, “Measurable cones and stable, measurable functions: a model forprobabilistic higher-order programming,”

Proc. ACM Program. Lang. ,vol. 2, no. POPL, pp. 59:1–59:28, 2018.[30] C. Mak, C.-H. L. Ong, H. Paquet, and D. Wagner, “Densities of almostsurely terminating probabilistic programs are differentiable almost ev-erywhere,” in

ESOP 2021 , 2021, to appear. https://arxiv.org/abs/2004.03924.[31] R. Culpepper and A. Cobb, “Contextual equivalence for probabilisticprograms with continuous random variables and scoring,” in

Programming Languages and Systems - 26th European Symposiumon Programming, ESOP 2017, Held as Part of the European JointConferences on Theory and Practice of Software, ETAPS 2017, Uppsala,Sweden, April 22-29, 2017, Proceedings , ser. Lecture Notes in ComputerScience, H. Yang, Ed., vol. 10201. Springer, 2017, pp. 368–392.[Online]. Available: https://doi.org/10.1007/978-3-662-54434-1_14[32] J. Borgström, U. Dal Lago, A. D. Gordon, and M. Szymczak, “Alambda-calculus foundation for universal probabilistic programming,”in

Proceedings of the 21st ACM SIGPLAN International Conference onFunctional Programming, ICFP 2016, Nara, Japan, September 18-22,2016 , 2016, pp. 33–46.[33] S. Staton, H. Yang, F. D. Wood, C. Heunen, and O. Kammar,“Semantics for probabilistic programming: higher-order functions,continuous distributions, and soft constraints,” in

Proceedings of the31st Annual ACM/IEEE Symposium on Logic in Computer Science,LICS ’16, New York, NY, USA, July 5-8, 2016 , M. Grohe, E. Koskinen,and N. Shankar, Eds. ACM, 2016, pp. 525–534. [Online]. Available:https://doi.org/10.1145/2933575.2935313[34] D. Kozen, “Semantics of probabilistic programs,”

J. Comput. Syst. Sci. ,vol. 22, no. 3, pp. 328–350, 1981.[35] C. Mak, C.-H. L. Ong, and H. Paquet, “Almost-surely terminatingprobabilistic programs are differentiable almost everywhere,” 2020,extended abstract, submitted to PROBPROG 2020.[36] D. Williams,

Probability with Martingales , ser. Cambridge MathematicalTextbooks. Cambridge University Press, 1999.[37] R. B. Ash and C. Doléans-Dade,

Probability and measure Theory .Harcourt Academic Press, 2000.[38] T. Cathcart Burn, C.-H. L. Ong, and S. J. Ramsay, “Higher-orderconstrained horn clauses for veriﬁcation,”

Proc. ACM Program.Lang. , vol. 2, no. POPL, pp. 11:1–11:28, 2018. [Online]. Available:https://doi.org/10.1145/3158099[39] C.-H. L. Ong and D. Wagner, “HoCHC: A refutationally complete andsemantically invariant system of higher-order logic modulo theories,”in . IEEE, 2019, pp.1–14. [Online]. Available: https://doi.org/10.1109/LICS.2019.8785784

40] D. Wingate, A. Stuhlmüller, and N. Goodman, “Lightweight imple-mentations of probabilistic programming languages via transformationalcompilation,” in

Proceedings of the Fourteenth International Conferenceon Artiﬁcial Intelligence and Statistics , 2011, pp. 770–778.[41] O. Kiselyov, “Problems of the lightweight implementation of proba-bilistic programming,” in

Proceedings of Workshop on ProbabilisticProgramming Semantics , 2016.[42] C.-k. Hur, A. V. Nori, S. K. Rajamani, and S. Sammuel, “A ProvablyCorrect Sampler for Probabilistic Programs,” in

FSTTCS , no. Fsttcs,2015, pp. 1–14.[43] T. B. Paige, “Automatic inference for higher-order probabilistic pro-grams,” Ph.D. dissertation, University of Oxford, 2016.[44] U. Dal Lago and C. Grellois, “Probabilistic termination by monadicafﬁne sized typing,”

ACM Trans. Program. Lang. Syst. , vol. 41, no. 2,pp. 10:1–10:65, 2019.[45] N. Kobayashi, U. D. Lago, and C. Grellois, “On the terminationproblem for probabilistic higher-order recursive programs,” in , 2019, pp. 1–14.[46] N. D. Jones and N. Bohr, “Call-by-value termination in the untypedlambda-calculus,”

Log. Methods Comput. Sci. , vol. 4, no. 1, 2008.[Online]. Available: https://doi.org/10.2168/LMCS-4(1:3)2008[47] D. Sereni and N. D. Jones, “Termination analysis of higher-orderfunctional programs,” in

Programming Languages and Systems, ThirdAsian Symposium, APLAS 2005, Tsukuba, Japan, November 2-5,2005, Proceedings , ser. Lecture Notes in Computer Science, K. Yi,Ed., vol. 3780. Springer, 2005, pp. 281–297. [Online]. Available:https://doi.org/10.1007/11575467_19[48] F. Breuvart and U. Dal Lago, “On intersection types and probabilisticlambda calculi,” in

Proceedings of the 20th International Symposiumon Principles and Practice of Declarative Programming, PPDP 2018,Frankfurt am Main, Germany, September 03-05, 2018 , 2018, pp. 8:1–8:13.[49] M. Huang, H. Fu, and K. Chatterjee, “New approaches for almost-sure termination of probabilistic programs,” in

Programming Languagesand Systems - 16th Asian Symposium, APLAS 2018, Wellington, NewZealand, December 2-6, 2018, Proceedings , 2018, pp. 181–201.[50] K. Chatterjee, P. Novotný, and D. Zikelic, “Stochastic invariants forprobabilistic termination,” in

Proceedings of the 44th ACM SIGPLANSymposium on Principles of Programming Languages, POPL 2017,Paris, France, January 18-20, 2017 , 2017, pp. 145–160.[51] S. Agrawal, K. Chatterjee, and P. Novotný, “Lexicographic rankingsupermartingales: an efﬁcient approach to termination of probabilisticprograms,”

Proc. ACM Program. Lang. , vol. 2, no. POPL, pp. 34:1–34:32, 2018.[52] K. Chatterjee, H. Fu, and A. K. Goharshady, “Termination analysis ofprobabilistic programs through Positivstellensatz’s,” in

Computer AidedVeriﬁcation - 28th International Conference, CAV 2016, Toronto, ON,Canada, July 17-23, 2016, Proceedings, Part I , 2016, pp. 3–22.[53] M. Huang, H. Fu, K. Chatterjee, and A. K. Goharshady, “Modularveriﬁcation for almost sure termination of probabilistic programs,”

Proc.ACM Program. Lang. , vol. 3, no. OOPSLA, pp. 129:1–129:29, 2019.[54] V. K. Mansinghka, D. Selsam, and Y. N. Perov, “Venture: ahigher-order probabilistic programming platform with programmableinference,”

CoRR , vol. abs/1404.0099, 2014. [Online]. Available:http://arxiv.org/abs/1404.0099[55] C.-H. L. Ong, “On model-checking trees generated by higher-order recursion schemes,” in . IEEE Computer Society, 2006, pp. 81–90. [Online].Available: https://doi.org/10.1109/LICS.2006.38[56] ——, “Higher-order model checking: An overview,” in . IEEE Computer Society, 2015, pp.1–15. [Online]. Available: https://doi.org/10.1109/LICS.2015.9[57] R. Statman, “On the lambda Y calculus,” in . IEEE Computer Society, 2002, pp. 159–166.[Online]. Available: https://doi.org/10.1109/LICS.2002.1029825[58] K. Etessami and M. Yannakakis, “Recursive markov chains, stochasticgrammars, and monotone systems of nonlinear equations,”

J.ACM , vol. 56, no. 1, pp. 1:1–1:66, 2009. [Online]. Available:https://doi.org/10.1145/1462153.1462154 [59] H. Barendregt,

The Lambda Calculus, Its Syntax and Semantics , 2nd ed.North-Holland, 1984.[60] M. Takahashi, “Parallel reductions in lambda-calculus,”

Information andComputation , vol. 118, pp. 120–127, 1995. A PPENDIX AS UPPLEMENTARY MATERIALS FOR S EC . III Lemma III.4.

Let ( Y n ) n ≥ be a (cid:15) -ranking supermartingalew.r.t. the stopping time T , then T < ∞ a.s., and E [ T ] ≤ E [ Y ] (cid:15) .Proof. We ﬁrst prove (by induction on n ): for all n ≥ E [ Y n ] ≤ E [ Y ] − (cid:15) · (cid:0) (cid:80) n − i =0 P [ T > i ] (cid:1) . It follows that (cid:80) ∞ i =0 P [ T > i ] converges; hence we have lim i →∞ P [ T > i ] = 0 and so P [ T < ∞ ] = 1 . It then remainsto observe: if T < ∞ a.s., then E [ T ] = (cid:80) ∞ i =0 P [ T > i ] . Theorem III.7 (Deriving supermartingales) . If a closed PPCFterm M is rankable (respectively, strictly rankable) by f then ( f ( M n )) n ≥ is a supermartingale (respectively, rankingsupermartingale w.r.t. stopping time T M ) adapted to ( F n ) n ≥ . We shall obtain the theorem as a corollary of a technicallemma (Lem. A.1).We say that a given PPCF term is: type 1 if it has the shape E [ Y λx.N ] , type 2 if it has the shape E [ sample ] ; type 3 if ithas the shape E [ R ] where R is any other redex, and type 4 if it is a value. Henceforth ﬁx an n ≥ , and deﬁne T i := { s | M n ( s ) is type i } . It is straightforward to see that each T i ∈ F n , and { T , T , T , T } is a partition of S . Hence itsufﬁces to prove the following lemma. Lemma A.1 (Technical) . For all i ∈ { , , , } and A ∈ F n (cid:90) A µ S (d s ) f ( M n +1 )[ s ∈ T i ] ≤ (cid:90) A µ S (d s ) f ( M n )[ s ∈ T i ] Hence E [ f ( M n +1 ) | F n ] ≤ f ( M n ) a.s. First, some notation.Given an ω -sequence (e.g. s ∈ S ) and m ≥ , we write s ≤ m ∈ I m to mean the preﬁx of s of length m . For n ≥ ,deﬁne n ( s ) := |{ k < n | ∃ E : M k ( s ) = E [ sample ] }| so that π (red n ( M, s )) = π t ( · · · ( π t ( s )) , with π t applied n ( s ) times. The F n -measurability of M n (and henceof n ) follows from [32]. Take s ∈ A ∈ F n with n ( s ) = l . For any s (cid:48) ∈ S , if s ≤ l = s (cid:48)≤ l then s (cid:48) ∈ A . Itfollows that { s ≤ l } · I ω ⊆ A . Proof.

We show the non-trivial case of i = 2 . First we express f ( M n +1 )[ s ∈ T ] = (cid:88) i ∈I f ( E i [ σ i ( s )][ ρ ( s )])[ s ∈ U i ] (2)where where (the Iverson bracket) [ P ] := 1 if the statement P hold, and 0otherwise. ; x : A (cid:96) x : A Γ; x : A (cid:96) M : B Γ (cid:96) λx.M : A → B Γ (cid:96) M : A → B Γ (cid:96) N : A Γ (cid:96) M N : B Γ (cid:96) M : ( A → B ) → ( A → B )Γ (cid:96) Y M : ( A → B ) Γ (cid:96) M : R Γ (cid:96) N : A Γ (cid:96) N : A Γ (cid:96) if ( M < , N , N ) : Ar : R Γ (cid:96) sample : R Γ (cid:96) M : R . . . Γ (cid:96) M n : R Γ (cid:96) f ( M , . . . , M n ) : R ( f : R n → R ) Figure 1. Typing rules of SPCF • I is a countable indexing set • E i [ · ][ sample ] ∈ Sk j i , and σ i : S → R j i , and ρ ( s ) := π h ( π (red n ( M, s ))) = π h ( π l i t ( s )) ∈ R • { U i } i ∈I is a partition of T , where each U i is determinedby a skeletal environment E i and a number (of draws) l i ≤ n , so that U i is the set of traces where after n reduction steps, l i samples have been used, and the termhas reached E i [ r ][ sample ] for some r ∈ R j i . (We usethe (measurable) function σ i : S → R j i to skolemise theexistentially quantiﬁed r .) Equivalently U i := − n [ l i ] ∩ M − n [ { E i [ r ][ sample ] | r ∈ R j i } ] ∈ F n . Observe that if s ∈ U i then { s ≤ l i }· I ω ⊆ U i ; in fact ( U i ) ≤ l i · I ω = U i . This means that for any measurable g : S → R ≥ , if g ( s ) only depends on the preﬁx of s of length ( l i + 1) , then,writing (cid:98) g : I l i +1 → R ≥ where g ( s ) = (cid:98) g ( s ≤ l i +1 ) , we have:for any A ∈ F n (cid:90) A ∩ U i µ S (d s ) g ( s ) = (cid:90) ( A ∩ U i ) ≤ li +1 Leb l i +1 (d t ) (cid:98) g ( t ) (3)Take s ∈ U i , and set l = l i . Plainly σ i ( s ) depends on s ≤ l ,and ρ ( s ) depends on s ≤ l +1 . Take u ∈ ( U i ) ≤ l . It then followsfrom the deﬁnition of ranking function that (cid:90) I Leb(d r ) f ( E i [ (cid:98) σ i ( u )][ r ])) ≤ f ( E i [ (cid:98) σ i ( u )][ sample ])) . Take A ∈ F n , and integrating both sides, we get (cid:90) ( A ∩ U i ) ≤ l Leb l (d u ) (cid:90) I Leb (d r ) f ( E i [ (cid:98) σ i ( u )][ r ])) ≤ (cid:90) ( A ∩ U i ) ≤ l Leb l (d u ) f ( E i [ (cid:98) σ i ( u )][ sample ])) . Since

Leb l +1 is the (unique) product measure satisfying Leb l +1 ( V × B ) = Leb l ( V ) · Leb( B ) , and ( U i ) ≤ l i · I ω = U i ,we have (cid:90) ( A ∩ U i ) ≤ l +1 Leb l +1 (d u (cid:48) ) f ( E i [ (cid:98) σ i ( u (cid:48) )][ (cid:98) ρ ( u (cid:48) )])) ≤ (cid:90) ( A ∩ U i ) ≤ l Leb l (d u (cid:48) ) f ( E i [ (cid:98) σ i ( u (cid:48) )][ sample ])) and so, by (3) (cid:90) A ∩ U i µ S (d s ) f ( E i [ σ i ( s )][ ρ ( s )])) ≤ (cid:90) A ∩ U i µ S (d s ) f ( E i [ σ i ( s )][ sample ])) . (4)Now, integrating both sides of (2), we have (cid:90) A µ S (d s ) f ( M n +1 )[ s ∈ T ]= (cid:90) A µ S (d s ) (cid:88) i ∈I f ( E i [ σ i ( s )][ ρ ( s )])[ s ∈ U i ]= (cid:88) i ∈I (cid:90) A ∩ U i µ S (d s ) f ( E i [ σ i ( s )][ ρ ( s )]) ≤ (cid:88) i ∈I (cid:90) A ∩ U i µ S (d s ) f ( E i [ σ i ( s )][ sample ]) ∵ (4) = (cid:90) A µ S (d s ) (cid:88) i ∈I f ( E i [ σ i ( s )][ sample ])[ s ∈ U i ]= (cid:90) A µ S (d s ) f n ( M )[ s ∈ T ] As an immediate corollary of Lem. A.1, each f ( M n ) isintegrable. This concludes the proof of Thm. III.7. Lemma III.8. ( T n ) n ≥ is an increasing sequence of stoppingtimes adapted to ( F n ) n ≥ , and each T i is bounded. We ﬁrst show that T is bounded i.e. T ≤ n , for some n ∈ ω .A ﬁrst proof idea is to construct a reduction argument tothe strong normalisation of the simply-typed lambda-calculus.There is a classical transform of conditionals into the purelambda-calculus. However, each evaluation of sample (almostsurely) returns a different number, which complicates such atransform.Given a closed PPCF term M , we transform it to a term (cid:112) M (cid:113) of the nondeterministic simply-typed lambda-calculus asfollows. (We may assume that the nondeterministic calculus15s generated from a base type ι with a ι -type constant symbol r , and a function symbol ⊥ : A for each function type A .) (cid:112) sample (cid:113) := r (cid:112) y (cid:113) := y (cid:112) λy.N (cid:113) := λy. (cid:112) N (cid:113)(cid:112) M M (cid:113) := (cid:112) M (cid:113) (cid:112) M (cid:113) n ≥ , (cid:112) f M · · · M n (cid:113) := ( λz · · · z n .r ) (cid:112) M (cid:113) · · · (cid:112) M n (cid:113)(cid:112) if ( B, M , M ) (cid:113) := (cid:0) λz. ( (cid:112) M (cid:113) + (cid:112) M (cid:113) ) (cid:1) (cid:112) B (cid:113)(cid:112) Y N (cid:113) := ( λy. ⊥ ) (cid:112) N (cid:113) In the above, variables z, z , · · · , z n are assumed to be fresh; f ranges over primitive functions and numerals.The idea is that (cid:112) M (cid:113) captures the initial, non-recursiveoperational behaviour of M , so that (cid:112) M (cid:113) simulates thereduction of M until the latter reaches a value, or a term ofthe form E [ Y λx.N ] .It is straightforward to see that for every trace s ∈ S ,there is a reduction sequence of (cid:112) M (cid:113) that simulates (an initialsubsequene of) the reduction of M under s .Finally thanks to the following Theorem A.2 (de Groote) . The simply-typed nondeterministiclambda-calculus is strongly normalising. we have:(i) Every (cid:112) M (cid:113) -reduction terminates on reaching r , or a termof the shape E [ ⊥ N · · · N n ] .(ii) Further there is a ﬁnite bound (say l ) on the length of such (cid:112) M (cid:113) -reduction sequences; and l bounds the stoppingtime T . Lemma III.9. T Y M is a stopping time adapted to ( F T n ) n ≥ .Proof. To see { T Y M = n } ∈ F T n for a given n , we need toshow that { T Y M = n } ∩ { T n ≤ i } ∈ F i , for all i ∈ ω . Let V bethe set of values in Λ (which is measurable), and let W bethe non-values. Then, it sufﬁces to observe that { T Y M = n } ∩{ T n ≤ i } = (cid:83) il = n (cid:0) M − l [ V ] ∩ M − l − [ W ] ∩ { T n = l } (cid:1) , whereeach of M − l [ V ] , M − l − [ W ] , and { T n = l } is in F i .A PPENDIX BS UPPLEMENTARY MATERIALS FOR S EC . IV Theorem IV.2.

Given a closed term M , the function f : Rch ( M ) → R given by f ( N ) := E [ number of Y -reduction steps from N to a value ] if it exists, is the least of all possible ranking functions of M .Proof. Let f be the candidate least ranking function deﬁnedabove, and suppose g is another ranking function such that f ( N ) > g ( N ) for some N ∈ Rch ( M ) . The restrictions of f and g to Rch ( N ) have the same properties assumed of f and g , so assume w.l.o.g. that N = M . The difference g − f isthen a supermartingale (with the same setup as in Thm. III.7); Y Θ 0( λz. ( λn. if ( sample − . < , n, Y Θ ( n +1) ) ) z ) i ( λn. if ( sample − . < , n, Y Θ ( n +1) ) ) i if ( sample − . < , i, Y Θ ( i +1) ) if ( r − . < , i, Y Θ ( i +1) ) if ( r − . < , i, Y Θ ( i +1) ) i Y Θ ( i +1)( λz. ( λn. if ( sample − . < , n, Y Θ ( n +1) ) ) z ) ( i +1) i := 0 r ≥ . i : = i + r < . Figure 2. The reachable terms of Y Θ 0 therefore E [ g ( M n )] ≤ E [ f ( M n )] + g ( M ) − f ( M ) , for all n .Now E [ f ( M n )] = (cid:80) ∞ k = n P [ M k = E [ Y N ] for some E, N ] → as n → ∞ ; therefore as g ( M ) − f ( M ) < , eventually E [ g ( M n )] < , which is impossible. It follows that g ≥ f asrequired.In order for f to be the least ranking function of M , italso has to actually be a ranking function itself. Each of theconditions on a ranking function is easily veriﬁed from thedeﬁnition of f . Theorem IV.5 (Sparse function) . Every sparse ranking func-tion is a restriction of a ranking function.Proof.

Take a closed term M and sparse ranking function f on M . Deﬁne f : Rch ( M ) (cid:42) R by f ( N ) := f ( N ) whenever f ( N ) is deﬁned, f ( V ) := 0 for values V not in the domain of f .Deﬁne (next( N, s ) , _ ) = red n ( N, s ) for the least n ≥ such that it’s in the domain of f , and g ( N, s ) := |{ m < n | red m ( N, s ) is of the form ( E [ Y λx.N (cid:48) ] , s (cid:48) ) }| . Thefunction next is well-deﬁned (i.e. n is ﬁnite) for all N ∈ Rch ( M ) by induction on the path from M to N , by thethird condition on sparse ranking functions. Deﬁne f ( N ) = (cid:82) S f (next( N, s )) + g ( N, s ) µ S (d s ) . The (total) function f agrees with f on f ’s domain, and it is a ranking functionon M (in fact, the least ranking function of which f is arestriction, by the same argument as Thm. IV.2).16 PPENDIX CS UPPLEMENTARY MATERIALS FOR S EC . V Theorem V.4.

Let ( Y n ) n ≥ be an antitone strict supermartin-gale w.r.t. stopping time T . Then T < ∞ a.s.Proof. First, as ( Y n ) is a supermartingale, E [ Y n ] ≤ E [ Y ] .Therefore E [ Y n | T > n ]= { rearranging terms } P [ T > n ] E [ Y n | T > n ] + P [ T ≤ n ] E [ Y n | T ≤ n ] − P [ T ≤ n ] E [ Y n | T ≤ n ] P [ T > n ]= { deﬁnition of conditional expectation } E [ Y n ] − P [ T ≤ n ] E [ Y n | T ≤ n ] P [ T > n ] ≤ { Y n ≥ always } E [ Y n ] P [ T > n ] ≤ { ( Y n ) n is a supermartingale } E [ Y ] P [ T > n ] Claim : For all < x ≤ , P [ T > B x ] ≤ x, where B x = (cid:24) E [ Y ] + 1 x (cid:15) ( E [ Y ] x − ) (cid:25) and (cid:15) : R ≥ → R > is the antitonefunction.As the convex hull of (cid:15) (the greatest convex function lessthan or equal to it) satisﬁes all the conditions assumed of (cid:15) ,in addition to being convex, assume wlog that (cid:15) is convex.Assume for a contradiction that P [ T > B x ] > x . Then, take n ≤ B x . We have E [ Y n − Y n +1 ]= { F n ⊆ F n +1 , def. & linearity of cond. expectation } E [ Y n − E [ Y n +1 | F n ]] ≥ { antitone strict assumption } E [ (cid:15) ( Y n ) · { T >n } ]= { deﬁnition of expectation conditioning on an event } P [ T > n ] E [ (cid:15) ( Y n ) | T > n ] ≥ { Jensen’s inequality } P [ T > n ] (cid:15) ( E [ Y n | T > n ]) ≥ { proved earlier } P [ T > n ] (cid:15) (cid:16) E [ Y ] P [ T > n ] (cid:17) > { assumption, P [ T > n ] ≥ P [ T > B x ] > x } x (cid:15) (cid:0) E [ Y ] x − (cid:1) . Therefore, by a telescoping sum E [ Y B x ] = E [ Y B x − Y + Y ] ≤ E [ Y ] − B x x (cid:15) ( E [ Y ] x − ) ≤ − < which is a contradiction, therefore the claim must be true,therefore P [ T > n ] → as n → ∞ , therefore T < ∞ a.s. Theorem V.8 (Sparse function) . Every antitone sparse rankingfunction is a restriction of an antitone ranking function.Proof.

Take a closed term M and an antitone sparse rankingfunction f on M , with a corresponding antitone function (cid:15) .Assume wlog that (cid:15) is convex (as if it isn’t, we can justtake its convex hull instead). As in Theorem IV.5, deﬁne f : Rch ( M ) (cid:42) R by f ( N ) = f ( N ) whenever f ( N ) is deﬁned ,f ( V ) = 0 for values V not in the domain of f .Deﬁne (next( N, s ) , _ ) = red n ( N, s ) for the least n ≥ such that it’s in the domain of f , and g ( N, s ) := |{ m < n | red m ( N, s ) is of the form ( E [ Y λx.N (cid:48) ] , s (cid:48) ) }| . Thefunction next is well-deﬁned (i.e. n is ﬁnite) for all N ∈ Rch ( M ) by induction on the path from M to N , by the third condition on antitone partial rank-ing functions. Deﬁne f ( N ) := (cid:82) S f (next( N, s )) + (cid:15) ( f (next( N, s ))) g ( N, s ) µ S (d s ) . For any term N where f isdeﬁned, f ( N ) = f ( N ) , and the value that f would have at N if f were not deﬁned at N is ≤ f ( N ) , by the third conditionon antitone partial ranking functions. In order to show that f is an antitone partial ranking function, it therefore sufﬁces toshow that the value that f would have had at each term if f were not deﬁned at that term is at least the expectation of f after one reduction step (plus (cid:15) ( f ( N )) if the reduction stepis a tY -reduction). For any term N which is not of the form E [ R ] for some Y -redex R , this is trivial. If R is a Y -redex, then (cid:15) ( f ( N )) ≤ (cid:82) S (cid:15) ( f (next( N, s ))) µ S (d s ) by the convexity of (cid:15) , because n is bounded. Therefore, the (total) function f ,which agrees with f on f ’s domain, is an antitone rankingfunction on M (with the same function (cid:15) if it’s convex).Ex. V.10 Let Θ := Y λf x. if ( x ≤ , , f ( x − sample ) ) . Wecan construct a sparse ranking function f for Θ 10 as follows: Θ l (cid:55)→ l + 2 if ( l ≤ , , Θ ( l − sample ) ) (cid:55)→ l + 10 (cid:55)→ . For a more complex (not Y -PAST) example, consider thefollowing “continuous random walk”: Ξ 10 where

Ξ := Y λf x. if ( x ≤ , , f ( x − sample + 1 / ) Let g ( x ) := 2 + ln( x + 1) , and let (cid:15) be the function speciﬁedby (cid:15) ( g ( x − / g ( x ) − (cid:90) x +1 / x − / g ( y ) d y. The limit of this as x → ∞ and g ( x + 1 / → ∞ is 0,and dd x (cid:15) ( g ( x + 1 / x +1 + ln(1 − x +1 / ) < , and g

17s monotonic increasing; therefore (cid:15) is antitone and boundedbelow by 0. We deﬁne an antitone sparse ranking function by: Ξ l (cid:55)→ g ( l ) , (cid:55)→ The value of g after one Y -reduction step is at least g ( l − / , therefore the expectation of (cid:15) after one Y -reduction stepis at most (cid:15) ( g ( l − / g ( x ) − (cid:82) x +1 / x − / g ( y ) d y . Thus • in case l > , g decreases by the required amount • in case l ≤ , g (Ξ l ) ≥ / > ln( √ ) − (cid:15) ( g (0 − / as well.Hence this is a valid antitone sparse ranking function and theterm is AST.Ex. V.11 (cid:0) Y λf x. if ( x ≤ , , f ( x − ⊕ x x +1 f ( x + 1) ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) Ξ To construct an antitone ranking function, we solve the recur-rence relation: z = 0 n ≥ , z n > n n +1 z n − + n +12 n +1 z n +1 . For n > , z n = ln( n − works. The expected decreaseis (ln(1 + n − − ) − n +1 ln(1 + n − )) , and using thefact that xx +1 ≤ ln(1 + x ) ≤ x for x > , this is at least ( n − − n +1 2 n − ) = n n − (2 n +1)( n − , which (againfor n > ) is positive and antitone. For n = 0 , , , we take (cid:15) to be 9/40 (the same as its value at 3), then set z = 2 ln 2 − ln 3 − , z = 3 ln 2 − − , z = 4 ln 2 − − . Someof those values are negative, but it’s still bounded below, soby adding a constant offset it can be corrected, and the termis AST. Example C.1 (Escaping spline) . [25, §5.4] (cid:0) Y λf x. ⊕ x +1 f ( x + 1) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) Ξ In this case, the fact that the ranking function must decreaseat each Y -step (in an antitone sparse ranking function) by theexpected value of (cid:15) not at the current term, but at the nextterm where the ranking function is deﬁned, is a little harderto deal with, because the variable x can change all the way to in one step, therefore simply adding a small offset doesn’tsufﬁce to compensate for this fact.Consider the candidate ranking function Ξ n (cid:55)→ n + 1 . Foreach n , Ξ n reduces to either (with probability n +1 ) or Ξ n + 1 , therefore the expected value of the ranking functionis n ( n +2) n +1 = n + 1 − n +1) , and the required decrease is (cid:15) (0)+ n(cid:15) ( n +2) n +1 , therefore eventually, the expected decrease isn’tenough, whatever the value of (cid:15) (0) is. If instead, we take the ranking function to be deﬁned afterthe Y -reduction but before the sample -reduction as well, thiscan be resolved. Letting Θ[ n ] = 0 ⊕ n Ξ n : Θ[ n ] (cid:55)→ n Ξ 10 (cid:55)→ . For each n , Θ[ n ] reduces to either or Θ[ n + 1] , with a Y -reduction only in the latter case, and the condition that thisis an antitone sparse ranking function is that n ≥ ( n − n +1) n + n − n (cid:15) ( n +1) and ≥ (cid:15) (11) . These are satisﬁed by setting (cid:15) ( x ) = min(1 , /x ) , therefore this term is AST. Non-afﬁne recursion:

Many of the recent advances in thedevelopment of AST veriﬁcation methods [19, 21, 22, 24, 25,49–53] are concerned with loop-based programs. We can viewsuch loops as tail-recursive programs that are, in particular, afﬁne recursive , i.e., in each evaluation (or run) of the bodyof the recursion, recursive calls are made from at most one callsite [44, §4.1]. (Note that whether a program is afﬁne recursivecannot be checked by just counting textual occurrences ofvariables.) Termination analysis of non-afﬁne recursive proba-bilistic programs does not seem to have received much atten-tion. Methods such as those presented in [44] are explicitlyrestricted to afﬁne programs, and are unsound otherwise. Bycontrast, many probabilistic programming languages allow forricher recursive structures [5, 7, 54].Ex. V.12 Let

Ξ := Y λf x.x ⊕ / f ( f ( x + 1)) A suitable sparse ranking function for

Ξ 1 would be Ξ n i (cid:55)→ n (for n ≥ ), because Ξ n i reduces to either Ξ n +1 i + 1 or Ξ n − i + 1 , with probabilities / and / . The value of i does not actually matter for the progress of this recursion. Itis basically another variant of the biased random walk, exceptthat the relevant variable is the number of copies of Ξ , insteadof a real number in the term.Ex. V.13 Let Ξ = Y λf x. ( λe.X [ e ]) sample X [ e ] = if ( e ≤ p − x − , x + 1 , f ( x + 1) ⊕ e f ( x + 1) ) Not only is this program non-afﬁne recursive, note also theuse of a random sample, e , as a ﬁrst-class value, and thesubsequent use as a probability in the binary choice ⊕ e . Sucha computation cannot be modelled via discrete distributions.We can use the ranking function method (coupled with thesolution of linear recurrence relations) to show that provided p ≥ −√ , the program Ξ 3 is AST. Consider the edge case,that p = −√ exactly.The term Ξ n x (for n > , x − < p ) reduces to either Ξ n − x + 1 , Ξ n x + 1 or Ξ n +1 x + 1 , with probabilities p − x − , (1 − p + x − ) p − x − and (1 − p + x − )(1 − p − x − ) respectively. Let a (Ξ n x ) = n + 2 x − . This is a supermartin-gale (because the decrease in x − as x increases is enough18o offset the average increase in n ), but it does not satisfythe antitone-strict progress condition. It does however have abounded-below variance, so the usual method of using ln a instead of just a works. Calculating the exact amount that ln a decreases is not necessary, because it can be boundedas follows: a changes by at least / with probability atleast / (assuming x ≥ ), therefore ln( a ) decreases inexpectation by at least (ln( a + ) − ln( a ) − a ) , usingthe fact that a linear approximation to ln , applied to a , atleast doesn’t increase, then adding the deviation of ln fromits linear approximation ln( a ) + x − aa (and using the fact thatthe linear approximation is everywhere an overestimate), weobtain a sufﬁciently strong bound on the decrease of ln( a ) thatit’s an antitone sparse ranking function. (It can obviously beextended to Ξ 1 too, which reduces to Ξ n or a value aftera bounded number of steps, but that complicates the analysisa little.) Higher-order recursion:

There is an obvious source ofdeterministic (i.e. non-probabilistic) higher-order functionsdeﬁned by recursion, viz., higher-order recursion schemes (HORS) (see e.g. [55, 56]). (Incidentally the (sparse) rank-ing function method is just as applicable to the termina-tion analysis of deterministic

PPCF programs.) Recently [45]have extended HORS to probabilistic higher-order recursionschemes (PHORS), which are HORS augmented with proba-bilistic (binary) branching ⊕ p . As HORS are in essence the λ Y -calculus [57] (i.e. pure simply-typed lambda calculus withrecursion, generated from a ﬁnite base type), PHORS aredeﬁnable in PPCF: (order- n ) PHORS is encodable as (order- n )(call-by-name) PPCF, but the former is strictly less expressive(because the underlying recursion schemes are not Turingcomplete). A relevant result here is that the AST problem isdecidable for order-1 PHORS (by reduction to the PSPACE-hard solvability of ﬁnite systems of polynomial equationswith real coefﬁcients; see [58]), but undecidable for order-2PHORS.Ex. V.14 Consider the higher-order function Ξ : ( R → R → R ) → R → R → R recursively deﬁned by Ξ := Y λϕ f R → R → R s R n R . if ( n ≤ , s, f n ( ϕ f s ( n − ⊕ p f n ( ϕ f s ( n + 1)) ) Let F : R → R → R be a function such that F n m terminateswith no Y reductions for all m and n . Then Ξ F x y has the an-titone sparse ranking function

F n ( F n ( . . . (Ξ F x n ) . . . ) (cid:55)→ g ( n ) , where g is as in Ex. V.9, using the same antitonefunction too.The inequalities required for this to be an antitone partialsupermartingale are satisﬁed with just the same reasoning asin Ex. V.9, therefore this term too is antitone rankable.Again, this is not actually the correct reduction order, inthat F should be applied to its argument n i and reducedto a value before its second argument is expanded, but thiswould complicate the presentation, and this version can bejustiﬁed by Thm. VI.15 instead. The reduction strategy impliedhere should be clear enough (assuming that, where it isn’t speciﬁed, it just matches cbv ), but to be more precise, let r ( F n ( . . . ( F n k − X ) . . . )) = @ k ; cbv( X ) , where X is nota value or of the form F n k Z for some non-value Z . All termsmatch this pattern for exactly one value of k ≥ , except forvalues. Within X , there is necessarily a redex at cbv( X ) . Theonly conditions that @ k ; cbv( X ) is also a position of a redexin the whole term are that if the redex is a sample , it is notinside a λ that’s inside a Y or on the right of an application,which is true because cbv never selects a redex inside ofa λ . After n reaches , the term is F n ( . . . ( F n k x ) . . . ) ,then the innermost F is evaluated completely in cbv order,therefore (by assumption) it terminates. Because there are no Y -reductions in the evaluation of F , it can’t reach any otherterm of the form F n (cid:48) k X , therefore k never changes untilthis whole subexpression reaches a number, at which point k decreases by 1 and the next F is evaluated. Theorem V.15.

For any stopping time T which is almostsurely ﬁnite, if ( F n ) n is the coarsest ﬁltration to which T is adapted, then there is a supermartingale ( Y n ) n adaptedto ( F n ) n and an antitone function (cid:15) such that ( Y n ) n is anantitone ranking supermartingale with respect to T and (cid:15) .Proof. If T is bounded by b a.s. (for n constant), the statementis trivially true by taking (cid:15) to be constantly 1, and Y n = b − n .Otherwise, let ( t n ) n ≥ be deﬁned recursively such that t = 0 , t n +1 > t n and P [ T > t n +1 ] ≤ P [ T > t n ] . We will thendeﬁne a supermartingale ( Y n ) n ≥ and nonrandom sequence ( y n ) n ≥ such that Y n = 0 iff T < n , Y n = y n iff T ≥ n , and y n ≥ k for n ≥ t k .The antitone function (cid:15) is deﬁned piecewise and recursivelyin such a way as to force all the necessary constraints to hold:for x ∈ [0 , , (cid:15) ( x ) = 1 for x ∈ [2 k , k +1 ) , (cid:15) ( x ) = min (cid:18) (cid:15) (2 k − ) , k t k +1 − t k (cid:19) . This ensures that • (cid:15) is weakly decreasing. • (cid:15) is bounded below by 0. • For x ≥ k , (cid:15) ( x ) ≤ k t k +1 − t k .The sequence ( y n ) n ≥ is then deﬁned recursively by y = 2 , y n +1 = ( y n − (cid:15) ( y n )) P [ T > n ] P [ T > n + 1] . The fact y t k ≥ k +1 is proven by induction on k . For basecase, ≥ , as required. For the inductive case k + 1 , we ﬁrstdo another induction to prove that y n ≥ k for all t k ≤ n ≤ t k +1 . The base case of this inner induction follows from theouter induction hypothesis a fortiori. Take the greatest k such19hat n > t k . By the induction hypothesis, y t k ≥ k . y n = y t k P [ T > t k ] P [ T > n ] − n − (cid:88) m = t k (cid:15) ( y m ) P [ T > m + 1] P [ T > n ] ≥ P [ T > t k ] P [ T > n ] ( y t k − n − (cid:88) m = t k (cid:15) ( y m )) ≥ P [ T > t k ] P [ T > n ] ( y t k − n − (cid:88) m = t k k t k +1 − t k )= P [ T > t k ] P [ T > n ] ( y t k − k ( n − t k ) t k +1 − t k ) ≥ P [ T > t k ] P [ T > n ] ( y t k − k ) ≥ P [ T > t k ] P [ T > n ] (2 k +1 − k )= P [ T > t k ] P [ T > n ] 2 k ≥ k as required. Substituting n = t k +1 and using the fact that P [ T >t k ] P [ T >t k +1 ] > , the same reasoning gives y t k +1 ≥ k +2 , asrequired.Although the stronger induction hypothesis was necessaryfor the induction to work, the only reason this is neededis to prove that y n > for all n , which implies that Y n ≥ . The other condition for ( Y n ) to be an antitone-strictsupermartingale is E [ Y n +1 | F n ] = E [ y n +1 { T ≥ n +1 } | F n ]= y n +1 E [ { T ≥ n +1 } | F n ]= y n +1 { T ≥ n } P [ T >n +1] P [ T >n ] = ( y n − (cid:15) ( y n )) { T ≥ n } = Y n − (cid:15) ( y n ) { T ≥ n } therefore ( Y n ) n is an antitone strict supermartingale withrespect to ( T, (cid:15) ) .The assumption that ( F n ) n ≥ is the coarsest ﬁltration towhich T is adapted is used in the equality between the thirdand fourth lines here. This condition is the main reason thatthis proof does not extend directly to a completeness resultfor antitone ranking functions.A PPENDIX DS UPPLEMENTARY MATERIALS FOR S EC . VIThe rules deﬁning ∼ c can be more intuitively understood asdiagrams. In the following diagrams, a circle represents a term(or skeleton), and the leaves on the trees are subterms at thepositions indicated by the labels in the other nodes. The labelson the arrows between trees are the positions of the reductions,and dashed arrows represent some number of reductions ina row. If a reduction sequence includes one branch of oneof these cases at some point in it, then the positions in the reduction sequence obtained by substituting in the other branchare all related by ∼ c to the corresponding positions in theoriginal reduction sequence. a) : Although the six cases in the deﬁnition of ∼ c aredeﬁned separately, there are some commonalities worth noting.Each case (except a, which is symmetrical) has a right branchand a left branch (the branches of case b are unfortunatelydisplayed the wrong way around in the diagram). The rightbranch has reduction steps at two positions, β then α , where β > α (i.e. β is inside α ), and in the left branch, the orderof these reductions are swapped, so the reductions are at α then β , . . . , β n . The position α of one of the reductions isunchanged, and the other reduction may have multiple (incases e and f) or 0 (in cases b and e) images in the leftbranch. The images ( β i ) are all still inside (greater than orequal to) α , and although they are not generally equal to β ,the subskeletons at these positions have a similar shape tothe subskeleton (initially) at position β , so that the reductionsat these positions are the same type of reduction (e.g. both β -reduction or both if -reduction). Case a is also similar, exceptthat the relevant positions are all disjoint rather than containedin one another, and either branch could be considered the rightbranch.20 α α α Nα A α A α A (cid:48) α A α A α A (cid:48) α A (cid:48) α A (cid:48) a α ; if ; β α αNα if r A βBα if r A βB (cid:48) αA b α α ; if ; βα ; β αNα if r A βBα ; βB α if r A βB (cid:48) α ; βB (cid:48) c21 α ; @ ; λ ; βα ; β αNα @ λxβR Bα ; βR [ B/x ] α @ λxβR (cid:48) Bα ; βR (cid:48) [ B/x ] d α α ; @ ; β ( α ; γ i ; β ) i αNα @ λx. . . γ i . . .x βBα. . . γ i . . .βB α @ λx. . . γ i . . .x βB (cid:48) α. . . γ i . . .βB (cid:48) e22 α Y λx. . . γ i . . .x βCαλz @ . . . γ i . . . Y λxβC βC [ Y λxA/x ] z αλz @ . . . γ i . . . β. . . δ i . . . Y λxβCz α Y λx. . . γ i . . .x β. . . δ i . . .x C (cid:48) αλz @ . . . γ i . . . β. . . δ i . . . Y λxβC (cid:48) zαα ; λ ; @ ; β α ; Y ; λ ; β α ( α ; λ ; @ ; γ i ; Y ; λ ; β ) i ( α ; λ ; @ ; β ; δ i ; Y ; λ ; β ) i f23 igure 3. Illustration of the ∼ c rules. Lemma D.1. If ( N , α ) ∼ ∗ ( N , α ) , there is some descen-dants N (cid:48) , N (cid:48) of N resp. N , and a position α (cid:48) in both of themsuch that ( N , α ) ∼ ∗ p ( N (cid:48) , α (cid:48) ) ∼ ∗ c ( N (cid:48) , α (cid:48) ) ∼ ∗ p ( N , α ) .Proof. The ∼ p relation can be split into ∼ ↓ ∪ ∼ ↑ , where ( A, α ) ∼ ↓ ( B, β ) if A → B and ( A, α ) ∼ p ( B, β ) , andsimilarly ( A, α ) ∼ ↑ ( B, β ) if B → A and ( A, α ) ∼ p ( B, β ) .At each stage of this proof, it will be assumed that the ∼ c stepsand the ∼ p steps in the sequence from ( N , α ) to ( N , α ) are rearranged such that there is never a ∼ c immediatelybefore a ∼ ↓ , and there is never a ∼ c immediately after a ∼ ↑ . This rearrangement is always possible, because if thereis some subsequence ( A, α ) ∼ c ( B, α ) ∼ ↓ ( C, β ) , then thereis an alternate path ( A, α ) ∼ ↓ ( B (cid:48) , β ) ∼ c ( C, β ) , where thereduction and the ∼ ↓ from A to B (cid:48) are the same as those from B to C , and the ∼ c step is the same, but with the reductionsequences O → ∗ O (cid:48) and O → ∗ O (cid:48) in the deﬁnition of ∼ c extended by the reduction B → C . With these rearrangementsassumed, it is sufﬁcient to prove that the ∼ p steps can berearranged so that all of the ∼ ↓ steps come before all of the ∼ ↑ steps (possibly introducing some ∼ c steps in the process).The rearrangement to put all of the ∼ ↓ s before all of the ∼ ↑ s proceeds by induction on the number of ∼ ↓ s that occurafter the ﬁrst occurrence of ∼ ↑ . If it’s 0, all of the ∼ p s arealready in the correct order, and we’re done. Otherwise, takethe subsequence from the ﬁrst ∼ ↑ to the ﬁrst ∼ ↓ after that.This must be rearranged to some ∼ ↓ s followed by some ∼ ↑ s,then once that’s done, the number of ∼ ↓ s after the ﬁrst ∼ ↑ inthe overall sequence will have decreased by 1.Let this subsequence be A ∼ n ↑ B ∼ ↓ C . Suppose forinduction that A ∼ k ↑ ( D, δ ) ∼ ∗↓ ( E, (cid:15) ) ∼ ∗↑ C for some D, δ, E, (cid:15) , and ≤ k ≤ n , where in the reduction sequence D → ∗ E , for any pair of reductions at positions whicharen’t disjoint, the reduction at the innermost (greater) positionoccurs ﬁrst. Reduction sequences with this property will becalled “parallel”, as it is analagous to the parallel reductionintroduced by Tait and Martin-Löf in their proof of the Church-Rosser theorem [59, 60]. This induction is in reverse, with k decreasing from n to . If k = n , simply set ( D, δ ) = B and ( E, (cid:15) ) = C . If k = 0 , then A is related to B by a sequenceof ∼ ↓ s then ∼ ↑ s, as desired. In the intermediate steps, k mustbe decreased by one, so a subsequence of one ∼ ↑ followedby a parallel sequence of ∼ ↓ s must be replaced by a parallelsequence of ∼ ↓ s followed by some ∼ ↑ s.For any individual ∼ ↑ then ∼ ↓ pair, let ( A, α ) ∼ ↑ ( B, β ) ∼ ↓ ( C, γ ) . The skeleton B reduces to both A and C . If these are the same reduction, then ( A, α ) = (

C, γ ) and this subsequence may be removed. If they are different,then the way they overlap corresponds to one of the casesin the deﬁnition of ∼ c , either case a if the reduction po-sitions are disjoint, or one of the cases b–f if one of thereduction positions is inside the other. In any case, there isa reduction sequence from each of A and C to some common skeleton D ( O and O in the deﬁnition of ∼ c ). For thesame reason that ( B, β ) ∼ ↓ ( C, γ ) , there is some δ suchthat ( A, α ) ∼ ∗↓ ( D, δ ) , and similarly for the same reasonthat ( B, β ) ∼ ↓ ( A, α ) , for the same δ , ( C, γ ) ∼ ∗↓ ( D, δ ) .There are a lot of cases to check for this statement, but allof them are rather simple, and similar to each other. There istherefore an alternative ∼ ∗ sequence from ( A, α ) to ( C, γ ) ,of the form ( A, α ) ∼ ∗↓ ( D , δ ) ∼ c ( D , δ ) ∼ ∗↑ ( C, γ ) , where D and D are the alternative reduction sequences leadingto the same skeleton D . There are only multiple ∼ ↓ steps inthe result if the position of the reduction B → C is inside theposition of the reduction B → A , and similarly, there are onlymultiple ∼ ↑ steps in the result if the position of the reduction B → A is inside the position of the reduction B → C . Ifthere are multiple ∼ ↓ steps, they are not necessarily parallel,but in the only case where they aren’t (case f of ∼ c ), thesubsequence of ∼ ↓ steps with the reductions at α ; λ ; @ ; β then α ; λ ; @ ; γ i ; Y ; λ ; β for each i where γ i ≥ β (in thenotation of the deﬁnition of ∼ c ) can be replaced by ∼ ↓ s withthe reductions at α ; λ ; @ ; γ (cid:48) i ; Y ; λ ; β then α ; λ ; @ ; β , thenanother ∼ c , where ( γ (cid:48) i ) is the list of positions in A ( before the reduction at β to A (cid:48) , unlike ( γ i ) ) where A | γ (cid:48) i = x and γ (cid:48) i > β .The effect of this rearrangement is (ignoring the ∼ c s) toswap the order of the ∼ ↑ and the ∼ ↓ , except that if thepositions of one of these reductions is inside the other, thatinner ∼ p may be duplicated. If the positions are disjoint, bothof them are unchanged (corresponding to case a of ∼ c ), andif one is inside the other, then the outer position is unchangedand the other resultant positions, although they aren’t equal tothe original inner position, are still inside the outer position.Taking the sub-sequence of one ∼ ↑ followed by a parallelsequence of ∼ ↓ s, this rearrangement can be repeatedly appliedto move the ∼ ↑ further along the sequence. If at some pointa ∼ ↑ and ∼ ↓ match (by relating the same (skeleton, position)pairs in opposite directions), they may be removed and theprocess may stop early. The ∼ ↑ may be duplicated if it passesa ∼ ↓ whose position is outside of the ∼ ↑ ’s position, but by theassumption that the sequence of ∼ ↓ s is initially parallel, noposition inside of this occurs later in the sequence, therefore allof the resultant ∼ ↑ s pass the remaining ∼ ↓ s without changingor duplicating them. The ∼ ↑ s may be further duplicated, butthe number of steps left in this process is bounded by theproduct, for all of the remaining ∼ ↓ s, of the number ofoccurrences of the variable relevant to the reduction, becausefor each ∼ ↓ , that is more than the number of duplicates itcould produce for any ∼ ↑ that passes it. It is also possible(earlier) for some of the ∼ ↓ s to be duplicated, but the onlycase where this process would not terminate, that both the ∼ ↓ s and the ∼ ↑ s continue being duplicated forever so that thenumber of switches left to do never decreases, is preventedby the parallelness condition, as all of the duplications of ∼ ↓ soccur before all of the duplications of ∼ ↑ s.It still remains to be shown that the sequence of ∼ ↓ s leftat the end of this can be made parallel. The changes that24ay have occurred since the previous version of this sequence(which was known to be parrallel already), are that some ofthe ∼ ↓ s may have been duplicated by passing the ∼ ↑ , andif that ∼ ↑ matches one of the ∼ ↓ s, that ∼ ↓ is removed.The position of the ∼ ↑ ’s reduction is the same for all ofthese duplications, because it is not changed until later in thesequence, when the ∼ ↓ s with positions outside of that mayoccur. The resultant positions after a duplication are in thesame position relative to each other, and therefore the orderof the remaining ∼ ↓ s is automatically parallel, except that(letting the position of the ∼ ↑ ’s reduction be α ) if there are ∼ ↓ s with reductions at positions α ; @ ; λ ; β and α ; @ ; γ , or α ; Y ; λ ; β and α ; Y ; λ ; γ , some of the resultant ∼ ↓ s’ positionsmay be inside each other where they were originally not.Similarly to the case where a Y reduction and a reductionat a position inside it are swapped though, the order can beﬁxed by swapping some of the ∼ ↓ steps with each other. In theﬁrst case, the reduction at position α ; @ ; λ ; β ends up at α ; β ,and the reduction at position α ; @ ; γ ends up at the positions α ; δ i ; γ , where ( δ i ) is the positions of the variable relevantto the ∼ ↑ ’s reduction inside its lambda. The positions of thevariable in the lambda may be different before the reduction at β , but the positions of the redexes corresponding to the originalredex at α ; @ ; γ are changed similarly. Rather than reducingat α ; β then at α ; δ i ; γ , it is therefore possible to reach thesame point by reducing at α ; δ (cid:48) i ; γ then at α ; β , where ( δ (cid:48) i ) isthe positions of the relevant variable within the lambda beforethe reduction at α ; @ ; λ ; β (and again, a ∼ c of some sort mustalso be introduced). The other case, that the ∼ ↑ ’s reduction isa Y -reduction, is similar. Each reduction which is duplicatedhas one image at α ; λ ; @ ; its original relative position, andone for each occurrence of the variable. Those correspondingto variable occurrences may need to be moved to before thoseat λ ; @ .In summary, to reach the desired order of ∼ p steps, each ∼ ↓ may have to be moved past some ∼ ↑ steps, and it may getduplicated in the process, but the resultant sequence of ∼ ↓ s canall be put in parallel order, and it has been shown that movinga ∼ ↑ past a sequence of ∼ ↓ s is always possible if they’rein parallel order, therefore the overall process terminates andreaches a point where all of the ∼ ↓ s precede all of the ∼ ↑ s, andall of the ∼ c s are in between them. All of these rearrangementsleave the end points of the sequence unchanged, therefore thisis an alternative sequence of ∼ steps between ( N , α ) and ( N , α ) of the form ( N , α ) ∼ ∗↓ ( N (cid:48) , α (cid:48) ) ∼ ∗ c ( N (cid:48) , α (cid:48) ) ∼ ∗↑ ( N , α ) .If ( X, α ) ∼ c ( Y, α ) then the ﬁnal skeletons in X and Y are equal, and ( X, β ) ∼ c ( Y, β ) for any β that occurs in thisskeleton, therefore ∼ c can be considered to apply to reductionsequences even without positions, X ∼ c Y iff ( X, · ) ∼ c ( Y, · ) .The structure of ∼ c can be seen more clearly by choosing acanonical example from each equivalence class. For this, themember of the class that’s in call-by-value order is used. Areduction sequence is in call-by-value order if for every pairof reductions in the sequence at positions α and β , where the reduction at α happens ﬁrst, either α ≤ β or α and β are disjoint, with α to the left of β (i.e. the ﬁrst elementsof the sequences α, β where they differ are @ and @ , f i and f j for i < j , or if i and if j for i < j , in that order).Note that although the reductions that occur in the sequenceare in call-by-value order, it may not be the actual call-by-value reduction sequence starting from that point because someredexes may remain un-reduced even when other reductionsthat should happen afterwards occur. It is only required thatthe reductions that do occur occur in the correct order. Lemma D.2.

For any reduction sequence X , there is a uniquereduction sequence X cbv that is related to X by ∼ ∗ c and is incall-by-value order, and if X ∼ ∗ c Y , then X cbv = Y cbv .Proof. Deﬁne X n recursively so that X = X and if X n is not in CBV order, take the last pair of reductions in X n which aren’t in CBV order. They are adjacent in the sequence,as the CBV order is a total order. Let the positions of thesereductions be α and β respecively. If they are disjoint, they canbe considered as one of the branches of ∼ c case a. Otherwise, β < α , and the sub-skeleton at position β at the appropriatepoint in the reduction sequence is a redex with non-trivialsubterms, therefore it’s of one of the forms if ( X > , A, B ) , ( λx.A ) B or Y λx.A , therefore the reductions α and β formthe O branch of one of the cases b, c, d, e or f of ∼ c . Ineither case, deﬁne X n +1 as equal to X n except taking the other(left) branch, so that the order of the reductions at α and β are switched (although the equivalent(s) of the reduction at α may not actually be at that position).This sequence of swaps rearranges the reductions in X rather like insertion-sort (but in some cases duplicating ordeleting the new element), and eventually reaches some X n that is in CBV order. This is deﬁned to be X cbv . b) : If Y ∼ c X ∼ ∗ c X cbv , and the reductions involvedin Y ∼ c X are not the last pair of reductions in Y thataren’t in CBV order, the order of the ∼ c s can be rearrangedso that the sequence Y ∼ ∗ c X cbv is the canonical sequencegiven above, therefore Y cbv = X cbv . To be more preciseabout this rearrangement, consider the sequence of ∼ c stepsfrom Y to X cbv . The ﬁrst step takes some sub-sequenceof the reduction sequence Y and swaps the order of thosereductions. If X = X cbv , then X = Y = Y cbv already.Otherwise, consider the cases according to whether the imageof this subsequence in X occurs earlier than the next pairof reductions to be swapped (i.e. the last pair of reductionsin X that occur not in CBV order). If not, then either thestep Y ∼ c X changes this subsequence into CBV order, orout of it. If it is into, then it was the last out-of-order pairin Y therefore the whole sequence Y ∼ ∗ c X cbv is already inthe canonical order and Y cbv = X cbv . If it switches the pairto the wrong order, then the result of that switch is the lastout-of-order pair in X therefore Y = X , and the sequence X ∼ ∗ c X cbv is just a sufﬁx of the sequence X ∼ ∗ c X cbv therefore ( X ) cbv = X cbv and again Y cbv = X cbv .25he remaining cases are when the subsequence of reductionsteps involved in Y ∼ c X occur before the last out-of-orderpair in X , but possibly overlapping. If there is no overlap, thenthe ∼ c steps do not interfere with each other at all, and cansimply be performed in the other order. This gives a differentsequence of ∼ c steps Y ∼ c Y ∼ c X ∼ ∗ c X cbv . As Y cbv =( Y ) cbv , we can proceed by induction in this case and use thefact that ( Y ) cbv = ( X ) cbv = X cbv to acheive the result.In the other case, there is some overlap between the regionsinvolved in Y ∼ c X and X ∼ c X , but the last reduction in X ∼ c X is later than the last reduction in Y ∼ c X . There areonly 2 reductions in X involved in X ∼ c X , therefore theregion involved in Y ∼ c X ends on the ﬁrst of the reductionsinvolved in X ∼ c X . Now consider the speciﬁc sequence ofpositions of the reductions in X involved in either of thesesteps. Let the last of these positions be α , the ﬁrst γ , and thesecond-last, third-last and so on β , β , . . . respectively, andlet the positions of the redexes in Y involved in Y ∼ c X be β then ( γ i ) in order. Then X ∼ c X swaps β and α , and α comes before β in CBV order (i.e. α < β or α is left of β ), and the step Y ∼ c X either proceeds forwards towardsCBV order, swapping β below γ = γ to form γ then ( β i ) , orproceeds backwards, taking Y even farther from CBV order,swapping the ( γ i ) to above β , forming γ then β = β . In eachof the cases below, it is required to show that the canonicalsequence ( X i ) i eventually reaches some term which is thesame as one in ( Y j ) j , and from that point on, they match,therefore they reach the same end result: X cbv = Y cbv . • Y ∼ c X in the backwards direction (so that Y is closerto CBV order than X is): By swapping the roles of X and Y , and β and γ , this is equivalent to the Y ∼ c X forwards case. The assumption that α is before β inCBV order (so that β , α is the pair to be swapped in X ∼ c X ) maps to the assumption that α is before γ (so that the result doesn’t just follow trivially by Y beingequal to X ) and vice-versa. Then one of the cases belowestablishes that X i = Y j for some i and j , therefore afterswapping back, Y i = X j . • Y ∼ c X forwards and α comes after γ in CBV order:In this case, the γ and α reduction steps are already inthe correct order in Y , therefore β and γ are already thelast out-of-order pair of reductions in Y , Y = X , and Y ∼ c X ∼ ∗ c X cbv is already the canonical sequence from Y therefore Y cbv = X cbv . • Y ∼ c X forwards and α is disjoint from γ and β : Inthis case, the sequence ( X i ) i starts by swapping α and β by case a, then because γ is disjoint from α andcomes after is in CBV order, γ is right of α , thereforeeither all the remaining β i s, are ≥ γ (if β and γ aren’tdisjoint) or there are no remaining β i s (if β and γ aredisjoint, in which case they swap by case a and thereis only β ). The sequence ( X i ) i therefore proceeds toswap all the remaining β i s then γ , in order, below α ,resulting in X i for some i having as its relevant sub- reduction-sequence α, γ, β k , . . . , β . The other sequence ( Y j ) j starts by swapping γ then β below α , then swap-ping γ and β . Because α is disjoint from γ , the reductionthere does not affect the sub-skeleton at position γ ,therefore this last ∼ c step proceeds identically to howit did in Y ∼ c X , resulting in γ, β k , . . . , β i . Overall,this results in Y having the relevant sub-reduction-sequence α, γ, β k , . . . , β , therefore Y = X i for theaforementioned value of i . • Y ∼ c X forwards, α ≤ γ and α (and therefore also γ ) is disjoint from β : In this case, ( X i ) i proceeds byswapping β = β below α by case a, then swapping γ below α by one of the other cases, and ( Y j ) j proceedsby swapping γ and α , then swapping β below α then allof the images of γ (which are ≥ α and therefore left of β ). In both cases, the subterm at α when α and γ areswapped is unaffected by the reduction at β , thereforeit produces the same result in both cases, therefore theoverall sequence in both cases is α then the images of γ then β . • Both γ and β are > α but they’re disjoint from eachother, and either the skeleton at position α is of the form if ( X > , A, B ) or it is of the form AB and either both γ and β are > α ; @ or they are both > α ; @ : In this case, ( X i ) i proceeds by swapping β then γ below α , in eachcase possibly forming 0 or multiple images. The onlyway the number of images of each of these positions canbe different is if the subskeleton at α is an if , and thereare 0 of one of them and 1 of the other, in which casethey’re trivially in the correct order already. Otherwise, allof the images of β and γ are disjoint from one another,therefore none of the positions, or their relative order,change when they are swapped, therefore the next part ofthe sequence ( X i ) i is just insertion-sort running on theimages of β and γ with ∼ c case a swaps until they’re inthe correct order. The sequence ( Y j ) j starts by swapping γ then β below α , then as before, sorting the images intothe correct order by case a swaps. The images originallyformed of γ and β are the same in both cases, thereforethe ﬁnal order is the same, so X i = Y j for some i and j . • Y ∼ c X forwards, α ; @ < γ , and α ; @ < β : Let x be the variable involved in the β -reduction at α , so thatthe sub-skeleton at α is ( λx.A ) B for some A, B , where γ = α ; @ ; λ ; γ (cid:48) and A → A (cid:48) at γ (cid:48) , and β = α ; @ ; β (cid:48) ,and B → B (cid:48) at β (cid:48) . In this case, ( X i ) i starts by swapping β below α by case e, producing one image of β for each x in A (cid:48) , then swapping γ below α by case d, producinga single image α ; γ (cid:48) . For each of the instances of x in A left of γ (cid:48) , the reduction at α ; γ (cid:48) is then swapped belowthe corresponding image of β . The sequence ( Y j ) j startsby swapping γ below α then β below α , but in this casethe images of β produced may be different. In this casethe swap happens earlier in the reduction sequence than γ , so that the body of the lambda is still A rather than A (cid:48) , and the reduction at γ may rearrange or change the26umber of instances of x . Next, each of the images of β to the right of α ; γ (cid:48) are swapped with it by case a. At thispoint, the next few reductions before α ; γ (cid:48) will in generalbe images of β at positions inside α ; γ (cid:48) . The subskeletonat position α ; γ (cid:48) before these reductions is ( A | γ (cid:48) )[ B/x ] ,then by the images of β this reduces to ( A | γ (cid:48) )[ B (cid:48) /x ] ,then it reduces at its root position ( α ; γ (cid:48) in the overallskeleton) to ( A (cid:48) | γ (cid:48) )[ B (cid:48) /x ] . For each position of an x in ( A | γ (cid:48) ) , there are 0 or more corresponding positions of x in ( A (cid:48) | γ (cid:48) ) , depending on what type of reduction A → A (cid:48) is. Because B and B (cid:48) cannot contain any instances ofthe variable involved in the reduction A → A (cid:48) (if it’sa β -reduction or Y -reduction), each instance of B in ( A | γ (cid:48) )[ B/x ] has the same set of images in ( A (cid:48) | γ (cid:48) )[ B/x ] as the corresponding x in ( A | γ (cid:48) ) . When these images of β that overlap with the image of γ are swapped belowit, each of them has its own images, one at each positionof B in ( A (cid:48) | γ (cid:48) )[ B/x ] corresponding to its original copyof B in ( A | γ (cid:48) )[ B/x ] . After all of them are swappedbelow α ; γ (cid:48) (and rearranged among themselved by casea), there is therefore one reduction at each copy of B in ( A (cid:48) | γ (cid:48) )[ B/x ] , i.e. one at position β (cid:48) relative to each x in ( A (cid:48) | γ (cid:48) ) . They are therefore the same as the images of β in X that overlap with α ; γ (cid:48) , because those are also onereduction at position β (cid:48) relative to each x in ( A (cid:48) | γ ) . Both ( X i ) i and ( Y j ) j therefore eventually reach a point wherethe relevant sub-reduction-sequence is α , the images of β to the left of α ; γ (cid:48) , α ; γ (cid:48) itself, the aforementioned imagesof β that overlap with α ; γ (cid:48) , then all the images of β tothe right of α ; γ (cid:48) . • Y ∼ c X forwards, γ and β are disjoint and both > α ,and the reduction at α is a Y -reduction: This is similar tothe previous case, but more complicated, so it will needsome deﬁnitions to explain properly what’s going on. Letthe term at α before any of the reductions be Y λx.A , let γ = α ; Y ; λ ; γ (cid:48) , let β = α ; Y ; λ ; β (cid:48) , let A | β (cid:48) = B → B (cid:48) with the reduction at · , let A | γ (cid:48) = C → C (cid:48) withthe reduction at · , let the positions in A where x occurs that are left of γ (cid:48) , between γ (cid:48) and β (cid:48) (but stilldisjoint from both), and right of β (cid:48) respectively be ( δ li ) , ( δ mi ) and ( δ ri ) , let the positions in B , C , B (cid:48) and C (cid:48) respectively where x occurs be ( δ Bi ) , ( δ Ci ) , ( δ B (cid:48) i ) and ( δ C (cid:48) i ) (so that all the positions in A where x occurs, inleft-to-right order, are ( δ li ) i , ( δ Ci ) i , ( δ mi ) i , ( δ Bi ) i , ( δ ri ) i ).The sequence ( X i ) i starts by swapping β then γ below α . With the shorthand that γ (cid:48) ( (cid:15) ) = α ; λ ; @ ; (cid:15) ; Y ; λ ; γ (cid:48) , γ (cid:48) = α ; λ ; @ ; γ (cid:48) , and similarly for β (cid:48) , the relevantportion of X is then α , ( γ (cid:48) ( δ li )) i , γ (cid:48) , ( γ (cid:48) ( δ C (cid:48) i )) i , ( γ (cid:48) ( δ mi )) i , ( γ (cid:48) ( δ Bi )) i , ( γ (cid:48) ( δ ri )) i , ( β (cid:48) ( δ li )) i , ( β (cid:48) ( δ C (cid:48) i )) i , ( β (cid:48) ( δ mi )) i , β (cid:48) , ( β (cid:48) ( δ B (cid:48) i )) i , ( β (cid:48) ( δ ri )) i . Next in ( X k ) k , foreach i , γ (cid:48) ( δ ri ) are swapped past all the images of β untilit is immediately before β (cid:48) ( δ ri ) by case a, then for each i , γ (cid:48) ( δ Bi ) is swapped past all the images of β until β (cid:48) bycase a, then swapped with β (cid:48) by one of the other casesdepending on what type of reduction B → B (cid:48) is and the relative positions. Even if it’s case d or f, all of theimages of γ (cid:48) ( δ Bi ) are disjoint from each other (and fromthe positions of the other reductions after β (cid:48) ) because theredex, C , doesn’t contain any instances of the variableinvolved in the reduction B → B (cid:48) . Expanding thedeﬁnitions, this is swapping α ; λ ; @ ; δ Bi ; Y ; λ ; γ (cid:48) with α ; λ ; @ ; β (cid:48) at a point in the reduction sequence wherethe subskeleton at α ; λ ; @ ; β (cid:48) is B with a mixture of Y λx.A and Y λx.A [ C (cid:48) /γ (cid:48) ] substituted for its occurrencesof x , and one of these occurrences of x is at δ Bi . Foreach δ Bj , there are some corresponding δ B (cid:48) k , wherethe images of the x at δ Bj end up after the reduction B → B (cid:48) . These cover all the δ B (cid:48) k , and uniquely, and foreach δ Bi , the images of ζ ; δ Bi ; θ after swapping it with ζ ; β (cid:48) are precisely ( ζ ; δ B (cid:48) k ; θ ) for those same values of k , therefore after swapping all of the ( γ (cid:48) ( δ Bi )) i downpast β (cid:48) , their images are ( γ (cid:48) ( δ B (cid:48) i )) i . After swappingwith β (cid:48) , each of these images than swaps by case a totake its place among the ( β (cid:48) ( δ B (cid:48) i )) i . Next up in ( X i ) i ,the ( γ (cid:48) ( δ mi )) i s and the ( γ (cid:48) ( δ C (cid:48) i )) i s swap down pastsome images of β they’re disjoint from to take theirplace before their matching image of β , then γ (cid:48) swapspast the ( β (cid:48) ( δ li )) i , stopping immediately before β (cid:48) ( δ C (cid:48) ) ,then the ( γ (cid:48) ( δ li )) i and the ( β (cid:48) ( δ li )) i mix. The endresult of the rearrangement of (this subsequence of) thereduction sequence is therefore α , ( γ (cid:48) ( δ li ) , β (cid:48) ( δ li )) i , γ (cid:48) , ( γ (cid:48) ( δ C (cid:48) i ) , β (cid:48) ( δ C (cid:48) i )) i , ( γ (cid:48) ( δ mi ) , β (cid:48) ( δ mi )) i , β (cid:48) , ( γ (cid:48) ( δ B (cid:48) i ) , β (cid:48) ( δ B (cid:48) i )) i , ( γ (cid:48) ( δ ri ) , β (cid:48) ( δ ri )) i .The other sequence, ( Y j ) j , proceeds similarly, but withthe images of γ starting out after the images of β . It swapsall the images of β that are right of γ to the appropriateplaces by case a, then swaps all the ( β (cid:48) ( δ Ci ) i ﬁrst to thenpast γ (cid:48) , resulting in ( β (cid:48) ( δ C (cid:48) i )) i for the same reason aswith swapping those images of γ that overlapped with β (cid:48) past it in the ( X i ) i case above, then ﬁnally swappingthe ( β (cid:48) ( δ li )) i s and the ( γ (cid:48) ( δ li )) i s into the correct order. Asrequired, this produces the same result as in ( X i ) i . • Y → X forwards, and α < γ < β : Let γ (cid:48) be suchthat γ = α ; if i ; γ (cid:48) , α ; @ ; λ ; γ (cid:48) , α ; @ ; γ (cid:48) or α ; Y ; λ ; γ (cid:48) ,so that γ (cid:48) is the freely varying later part of γ as in thedeﬁnition of ∼ c , whichever case applies to swapping γ and α . Similarly, let β (cid:48) be the freely varying part of β relative to γ . Let the images of γ after swapping itwith α (where this swap occurs later in the reductionsequence than β ) then be ( α ; δ i ; γ (cid:48) ) i , and the images of β after swapping it with γ be ( γ ; (cid:15) i ; β (cid:48) ) i . Depending onthe skeleton at α , and γ ’s position within it, ( δ i ) i maybe empty (case b), a singleton containing · (cases c andd), all the positions of the relevant variable in the lambdain the skeleton at α (case e), or λ ; @ and λ ; @ ; ζ ; Y ; λ for each position ζ of the relevant variable (case f), butin all cases (except a, which is excluded because noneof α, γ and β are disjoint) the set of images has thisgeneral structure. Furthermore, the positions ( δ i ) i aredetermined only by the general position of γ within α (the27art excluded from γ (cid:48) ), and the positions of the relevantvariable within the skeleton at α . They do not dependon α or γ , so that if both initial positions were movedsomewhere else the relative positions of the images of γ would be unaffected, and similarly if some positionwithin γ were swapped with α , the relative positions ofits images would be the same (except to the extent that,in case f, the positions of the relevant variable are takenafter the inner reduction takes place, so that some of thosemay still change). Because there are so many of them andthey don’t actually affect the multiset of positions wherereductions take place, we will be ignoring swaps by casea in the proof of this case, and just assuming that all theﬁnal positions end up in the correct order.The ﬁnal set of reduction positions that wewill be proving both ( X i ) i and ( Y j ) j reach is α, ( α ; δ i ; γ (cid:48) , ( α ; δ i ; γ (cid:48) ; (cid:15) j ; β (cid:48) ) j ) i , where as usual thesequences are expanded out to the full list. For thesequence ( X i ) i , ﬁrst the images of β in X are ( γ, (cid:15) i , β (cid:48) ) . For each of these in turn, it is swapped with α . By the fact mentioned above that sub-positions aremapped to the same set of images except in case f, theimages of γ, (cid:15) j , β (cid:48) after this swap are ( α ; δ ji ; γ (cid:48) , (cid:15) j , β (cid:48) ) i ,where ( δ ji ) i is the same as ( δ i ) i except that in thecase where this is not the ﬁrst image of β swappedbelow α and the reduction at α is a Y -reduction (andtherefore swaps with it proceed by case f), the positionsof the relevant variable are taken after the reductionsat γ and γ ; (cid:15) k ; β (cid:48) for k ≤ j (this implies that theﬁrst of these, δ ni = δ i ). All the images α ; δ ji ; γ (cid:48) ; (cid:15) j ; β (cid:48) where δ ji > λ ; @ ; γ (cid:48) ; (cid:15) j +1 ; β (cid:48) are then swapped with α ; λ ; @ ; γ (cid:48) ; (cid:15) j +1 ; β (cid:48) (which is α ; δ j +1 i ; γ (cid:48) ; (cid:15) j +1 ; β (cid:48) forsome i ). As in the case where α ; Y < γ, β and γ and β are disjoint, the effects of these swaps may duplicateor delete the reductions in such a way that all theremaining images of γ ; (cid:15) j ; β (cid:48) are ( α ; δ j +1 i ; γ (cid:48) ; (cid:15) j ; β (cid:48) .This is repeated with δ j +2 i and so on until these imagesare all in the correct order, at which point they are ( α ; δ i ; γ (cid:48) ; (cid:15) j ; β (cid:48) ) i . After all of these are done, γ isswapped with α , yielding ( α ; δ − i ; γ (cid:48) ) i . As with theimages of β , the subset of these which are inside α ; λ ; @ ; γ (cid:48) ; (cid:15) j ; β (cid:48) for each j in turn are swapped with it,until they are all in the correct order and the images leftof γ are ( α ; δ i ; γ (cid:48) ) i . At this point, X i has the desiredvalue.The sequence ( Y j ) j proceeds similarly. First γ swapswith α , yielding α and ( α ; δ i ; γ (cid:48) ) i , then β swaps with α ,yielding one image of β for each δ i , except that, in casef, those images at positions inside α ; λ ; @ ; γ (cid:48) may differbecause they are earlier in the reduction sequence thanany of the images of γ . Each of the images of β in turnis moved to the corresponding image of γ , except thatthose overlapping with α ; λ ; @ ; γ (cid:48) may be duplicated ordeleted on the way. The images of β after just this process(which does not actually correspond to any term in ( Y j ) j , but it’s sufﬁciently independent that the order doesn’tmatter that much) are ( α ; δ i ; γ (cid:48) ; β (cid:48)(cid:48) ) (where γ ; β (cid:48)(cid:48) = β ).The images of these for each i , after swapping with α ; δ i ; γ (cid:48) , is ( α ; δ i ; γ (cid:48) ; (cid:15) j ; β (cid:48) ) j , because the reductions pro-duced by a swap are unaffected by its overall location,and the skeleton at α ; δ i ; γ (cid:48) at the relevant point in thesequence is equal to the skeleton at γ initially, thereforethis is equivalent to immediately swapping β and γ as in Y ∼ c X (except that in the case where the reductionat α is a Y -reduction, the skeleton at α ; λ ; @ ; γ (cid:48) isnot actually equal to the skeleton initially at γ becausesomething was substituted in for the variable bound at α ; Y , but this doesn’t affect the variable involved in thereduction at γ , so this doesn’t actually matter). Thereforealso in this case, eventually the set of reductions in therelevant portion of Y j is α, ( α ; δ i ; γ (cid:48) , ( α ; δ i ; γ (cid:48) ; (cid:15) j ; β (cid:48) ) j ) i ,therefore it is equal to some X i .In summary, if the swap in Y ∼ c X doesn’t overlap with theswap X ∼ c X , the sequence of swaps can be rearranged untilit does, and in any other case, there is some Y j that’s equal tosome X i , therefore these sequences reach the same end pointand if Y ∼ c X , then Y cbv = X cbv , and by chaining thesetogether, if Y ∼ ∗ c X , Y cbv = X cbv . c) : If Y is in CBV order and Y ∼ ∗ c X , then Y cbv = X cbv but also Y cbv = Y because the sequence ( Y i ) i terminatesimmediately therefore X cbv is the unique reduction sequencein call-by-value order that’s related to X by ∼ ∗ c . Lemma VI.7.

The relation ∼ is deﬁned on L ( M ) withreference to a particular starting term M , so different versions, ∼ M and ∼ N , can be deﬁned starting at different terms. If M → N , then ∼ ∗ N is equal to the restriction of ∼ ∗ M to L ( N ) .Proof. ∼ ∗ N is trivially a subset of ∼ ∗ M because ∼ N is a subsetof ∼ M .In the other direction, suppose ( N , α ) ∼ ∗ M ( N , α ) where both N and N are descendants of N . By Lem. D.1,take ( N , α ) ∼ ∗ p,M ( N (cid:48) , α (cid:48) ) ∼ ∗ c,M ( N (cid:48) , α (cid:48) ) ∼ ∗ p,M ( N , α ) .The ∼ ∗ p steps remain within Rch ( N ) , and ∼ p does notdepend on the history of the reduction sequences, therefore ( N , α ) ∼ ∗ p,N ( N (cid:48) , α (cid:48) ) ∼ ∗ c,M ( N (cid:48) , α (cid:48) ) ∼ ∗ p,N ( N , α ) .Let X be the call-by-value reduction sequence related toboth N (cid:48) and N (cid:48) by ∼ c,M given by Lem. D.2. As M → N is the ﬁrst reduction in the sequence N (cid:48) , it is the last to beaffected by the ∼ c,M sequence N (cid:48) ∼ ∗ c,M X given by Lem. D.2therefore it can be split into N (cid:48) ∼ ∗ c,M Y ∼ ∗ c,M X where Y is in CBV order except possibly for its ﬁrst reduction, whichis still M → N . Let the position of the reduction M → N be β . The rearrangement of the reduction sequence Y ∼ c,M X consists of moving the reduction at β down past the otherreductions in Y , possibly duplicating or deleting it in theprocess, but not affecting the positions or the correct orderfor any of the other reductions. The reductions derived from M → N can be identiﬁed as follows:28 position in some (reduction sequence of) term(s) is derived from another if it is related by the reﬂexive transitiveclosure of (cid:32) , where ( V, γ ) (cid:32) ( W, δ ) just if V → W and oneof the following cases holds: • ( V, γ ) ∼ p ( W, δ ) • γ = δ , V → W at (cid:15) , and (cid:15) > γ • V | (cid:15) = ( λx.U ) Z, U | ζ = x, γ = (cid:15) ; @ ; θ and δ = (cid:15) ; ζ ; θ • V | (cid:15) = Y ( λx.U ) , γ = (cid:15) ; Y ; λ ; ζ and δ = (cid:15) ; λ ; @ ; ζ • V | (cid:15) = Y ( λx.U ) , U | ζ = x, γ = (cid:15) ; θ and δ = (cid:15) ; λ ; @ ; ζ ; θ Crucially, the reductions in X can be partitioned into 2 sets:those at positions derived from ( M, β ) , and those at positionsequal to the reductions in Y (in the same order as they occurin Y ). Using the same construction for N (cid:48) shows that thepositions of the reductions in Y are also the positions ofthe reductions in X other than those derived from ( M, β ) ,therefore Y = Y therefore N (cid:48) ∼ c,M Y ∼ c,M N (cid:48) andevery reduction sequence in this sequence starts with M → N therefore N (cid:48) ∼ c,N N (cid:48) .Combining this with the ∼ ∗ p,N s at the beginning and endthen yields the desired result that ( N , α ) ∼ ∗ N ( N , α ) ,therefore the restriction of ∼ ∗ M to ∼ N ’s domain ( Rch ( N ) )is a subset of ∼ ∗ N therefore the two versions of ∼ match onthis domain, as desired. Lemma VI.9.

The relation ⇒ is Church-Rosser.Proof. Suppose that ( M, s ) ⇒ ∗ ( M , s ) and also ( M, s ) ⇒ ∗ ( M , s ) , then it is required to prove that there issome ( M (cid:48) , s (cid:48) ) such that both ( M , s ) ⇒ ∗ ( M (cid:48) , s (cid:48) ) and ( M , s ) ⇒ ∗ ( M (cid:48) , s (cid:48) ) . First consider the special case where ( M, s ) ⇒ ( M , s ) and ( M, s ) ⇒ ( M , s ) , with only a singlestep in each case. Let the positions of the redexes in M → M and M → M be α and α respectively.First consider the case that α and α are disjoint. Let M = M [ X /α ] and similarly for X , then let M (cid:48) = M [ X /α ][ X /α ] . As the positions are disjoint, the sub-stitutions commute and both M and M reduce (with → )to M (cid:48) . Let s (cid:48) = s ◦ i ( M → M ) ◦ i ( M → M (cid:48) ) . Theinjection i ( M → M ) ◦ i ( M → M (cid:48) ) consists of prepending M → M → M (cid:48) to each reduction sequence, but by casea of ∼ c , this is equivalent to prepending M → M → M (cid:48) ,which is i ( M → M ) ◦ i ( M → M (cid:48) ) . In the case that thereduction M → M isn’t a sample -reduction, this is enoughto establish that ( M , s ) ⇒ ( M (cid:48) , s (cid:48) ) (and similarly if theredex of M → M isn’t sample , ( M , s ) ⇒ ( M (cid:48) , s (cid:48) ) ).If it is sample though, in order for it to be the case that ( M , s ) ⇒ ( M (cid:48) , s (cid:48) ) , it is additionally necessary that X , theresult of the reduction at α , be s ( M , α ) , which followsfrom the fact that ( M, α ) ∼ ( M , α ) by case 1 of ∼ p . Thecase that the reduction M → M is a sample -reduction issimilar. The case that α = α is trivial, because there is at mostone possible ⇒ reduction at any given position, therefore ( M , s ) = ( M , s ) already.The remaining case is that α < α or α > α . Assumewithout loss of generality that α < α . For each possiblecase of what type of redex M | α is, and α ’s position withinit, there is a corresponding case of ∼ c , and similarly to thecase where α and α are disjoint, the term O = O from thedeﬁnition of ∼ c is a suitable value of M (cid:48) . The term M | α can’t be sample , because it has strict subterms, but the casethat M | α = sample is still somewhat more complicated. α can’t be within α ; @ or α ; Y ; λ (cases e and f of ∼ c )because if α > α ; @ , M | α ; @ must be a value, therefore α > α ; @ ; λ and those are not valid positions to reduce a sample . The case that α is within the branch of an if statementthat is deleted by the reduction at α (case b) doesn’t actuallypresent a problem because there is no corresponding reductionto M → M in the other branch, which leaves cases c andd, that α is in the if ( · , · , · ) branch that isn’t deleted, and that α < α ; @ ; λ . The values that the sample s take in thesecases match because ( M, α ) ∼ p ( M , α ; β ) by cases 3 and 2of ∼ p respectively. d) : In the case that ( M, s ) ⇒ ∗ ( M , s ) or ( M, s ) ⇒ ∗ ( M , s ) by multiple steps, it is possible to repeatedly replacea pair of the form A ⇐ B ⇒ C by A ⇒ ∗ D ⇐ ∗ C by theconstruction above, but it is not immediate that this processterminates, because each ⇒ or ⇐ may be replaced by multiple,so the sequence of ⇒ s and ⇐ s from ( M , s ) to ( M , s ) mayget longer at some steps. However, termination can be provedby noting that the structure of this process is identical to theprocess of swapping ∼ ↑ and ∼ ↓ steps in Lem. D.1. To bemore precise, in this case there is a sequence of ⇐ and ⇒ steps, each of which has an associated reduction position andinitial and ﬁnal terms, and if a ⇐ immediately precedes a ⇒ , they may be swapped to produce some number of ⇒ s,followed by some number of ⇐ s. In the case of Lem. D.1,there is a sequence of ∼ ↑ and ∼ ↓ steps, each of which hasan associated reduction position and initial and ﬁnal skeletons,and if a ∼ ↑ immediately precedes a ∼ ↓ , they may be swappedto produce some number of ∼ ↓ s followed by some numberof ∼ ↑ s. In both cases, the number and reduction positionsof the resultant steps are determined by the case of ∼ c thatmatches the way the initial reduction positions overlap, andthe initial skeleton (or the skeleton of the initial term). Thesame argument that the process in Lem. D.1 terminated istherefore applicable here too. At every stage, the sequence ofreductions that results from one of the initial ⇒ s is a parallelsequence, and swapping a parallel sequence of ⇒ s with a ⇐ always terminates, therefore each ⇒ in turn can be moved pastall of the ⇐ s, and the process as a whole will terminate in astate where all of the ⇒ s precede all of the ⇐ s, i.e. a pair ofreduction sequences ( M , s ) ⇒ ∗ ( M (cid:48) , s (cid:48) ) ⇐ ∗ ( M , s ) . Lemma D.3. If A is some descendant of M and ( A, γ ) ∼ ∗ ( M, δ ) , then ( A, γ ) ∼ ∗ p ( M, δ ) , with the length of the reductionsequences decreasing by one each step from A to M . roof. This is a simple induction on ∼ ∗ . In the base case, ( A, γ ) = (

M, δ ) therefore ( A, γ ) ∼ ∗ p ( M, δ ) trivially. Oth-erwise, suppose that ( M, δ ) ∼ ∗ p ( B, (cid:15) ) ∼ ( A, γ ) . Either ( B, (cid:15) ) ∼ p ( A, γ ) or ( B, (cid:15) ) ∼ c ( A, γ ) . In the ﬁrst case,either B → A , in which case the result follows directly,or A → B , in which case the fact that each (reductionsequence of) skeleton(s) has only one parent implies that a ( M, δ ) ∼ ∗ p ( A, γ ) directly as a subsequence of the path to ( B, (cid:15) ) .In the ∼ c case, consider the deﬁnition of ∼ c . Either O (cid:48) = A and O (cid:48) = B or vice-versa. As M → ∗ N → ∗ O → ∗ B and ( M, δ ) ∼ ∗ p ( B, (cid:15) ) , and ∼ p only relates positions in a term andits parent, there are some positions ζ, θ such that ( M, δ ) ∼ ∗ p ( N, ζ ) ∼ ∗ p ( O , θ ) ∼ ∗ p ( B, (cid:15) ) . It follows that ( O , θ ) ∼ ∗ p ( A, γ ) by following the same path, therefore it sufﬁces to provide theonly missing portion of the path from M to A , i.e. to prove that ( N, ζ ) ∼ ∗ p ( O , θ ) given ( N, ζ ) ∼ ∗ p ( O , θ ) (or vice-versa).If ζ is disjoint from all the positions of reduction from N to O and O (and consequently ζ = θ ), this follows from case1 of ∼ p . Otherwise, this can be proved by taking cases fromthe deﬁnition of ∼ c . This is rather long, but all of the casesare similar. The general idea is that the reductions from N to O correspond to the reductions from N to O , so that if aposition is related by ∼ p across that reduction, it is related inthe other branch for the same reason. Case d, where B = O (cid:48) rather than O (cid:48) , is given here in more detail as an illustrativeexample:Let I be N reduced at α , and J be N reduced at α ; @ ; λ ; β ,so that N → I → O and N → J → O . All of thereduction positions are ≥ α , and ζ is not disjoint fromall of them, therefore ζ is not disjoint from α . Let ι bethe position such that ( N, ζ ) ∼ p ( I, ι ) ∼ p ( O , θ ) . Thefact that ( N, ζ ) ∼ p ( I, ι ) implies that ζ > α ; @ ; λ . Let ζ = α ; @ ; λ ; κ and θ = α ; κ (cid:48) . If κ is disjoint from β ,then ( N, ζ ) ∼ p ( J, ζ ) ∼ p ( O , α ; κ (cid:48) ) = ( O ; θ ) by cases 1and 2 of ∼ p . In the other case, that κ isn’t disjoint from β , κ > β because none of the positions ≤ α ; @ ; λ ; β in N are related to any position in I by ∼ p . As ( I, α ; κ ) ∼ p ( O , α ; κ (cid:48) ) (with the redex at α ; β ), for exactly the samereason ( N, α ; @ ; λ ; κ ) ∼ p ( J, α ; @ ; λ ; κ (cid:48) ) (with the redexat α ; @ ; λ ; β ). Because ( N, α ; @ ; λ ; κ ) ∼ p ( J, α ; @ ; λ ; κ (cid:48) ) , N | α ; @ ; λ ; κ = J | α ; @ ; λ ; κ (cid:48) , and N | α ; @ ; λ ; κ (cid:54) = thevariable of N | α ; @ , therefore J | α ; @ ; λ ; κ (cid:48) is also not thevariable therefore ( J, α ; @ ; λ ; κ (cid:48) ) ∼ p ( O , α ; κ (cid:48) ) = ( O , θ ) by case 2 of ∼ p . Combining these results, ( N, ζ ) ∼ ∗ p ( O ; θ ) as desired. Lemma D.4. If M → N , with the redex at position α , thenno position in any term reachable from N is related by ∼ ∗ to ( M, α ) .Proof. Suppose on the contrary that ( M, α ) ∼ ∗ ( A, β ) , where N → ∗ A , then by Lem. D.3, ( M, α ) ∼ ∗ p ( A, β ) . M (cid:54) = A therefore ( M, α ) ∼ p some position in N , but in all the casesof the deﬁnition of ∼ p , no position in the child term is relatedto the position of the redex. Essentially what Lem. D.4 demonstrates is that the samplestaken during any reduction sequence are independent of eachother. This is made more precise in the following lemmas. Lemma D.5.

For any skeletons M → N , with the redexat position α , and measurable set of samples S ⊂ I L s ( N ) , µ ( S ) = µ ( { s ∈ I L s ( M ) | s ◦ i ( M → N ) ∈ S } ) , andfurthermore, if M | α = sample , for any S ⊂ I × I L s ( N ) , µ ( S ) = µ ( { s ∈ I L s ( M ) | ( s ( M, α ) , s ◦ i ( M → N )) ∈ S } ) .Proof. If the result holds for all sets S of the form { s ∈ I L s ( N ) | ∀ j : s ( j ) ∈ x i } , where ( x j ) j ∈ J is a family ofmeasurable subsets of I indexed by some ﬁnite set J ⊂ L s ( N ) of potential positions, it also holds for all other S by takinglimits and disjoint unions.Take such a ( x j ) j ∈ J , and deﬁne K = i ( M → N )[ J ] .Because i ( M → N ) is injective, it deﬁnes a bijectionbetween J and K . We can then calculate the measure µ ( { s ∈ I L s ( M ) | s ◦ i ( M → N ) ∈ S } ) = µ ( { s ∈ I L s ( M ) |∀ k : s ( k ) ∈ x i ( M → N ) − ( k ) } ) = (cid:81) k ∈ K µ I ( x i ( M → N ) − ( k ) ) = (cid:81) j ∈ J µ I ( x j ) = µ ( S ) .In the case that M | α = sample , we can similarly consideronly those sets S of the form x sample × { s ∈ I L s ( N ) | ∀ j : s ( j ) ∈ x i } . The position α in M is not related to any positionin L s ( N ) , therefore ( M, α ) (cid:54)∈ K . Again, we can calculatethe measure µ ( { s ∈ I L s ( M ) | ( s ( M, α ) , s ◦ i ( M → N )) ∈ S } ) = µ ( { s ∈ I L s ( M ) | s ( M, α ) ∈ x sample , ∀ k : s ( k ) ∈ x i ( M → N ) − ( k ) } ) = µ I ( x sample ) (cid:81) k ∈ K µ I ( x i ( M → N ) − ( k ) ) = µ I ( x sample ) (cid:81) j ∈ J µ I ( x j ) = µ ( S ) . Lemma D.6.

For any initial term M , reduction strategy f on M , natural number n , skeleton N with k holes, measurable set T ⊂ R k and measurable set S of samples in I L s ( N ) , µ ( { s ∈ I L s ( M ) | ∃ r ∈ T, s (cid:48) ∈ S : ( M, s ) ⇒ nf ( N [ r ] , s (cid:48) ) } ) = µ ( { s ∈ I L s ( M ) | ∃ r ∈ T, s (cid:48) ∈ I L s ( N ) : ( M, s ) ⇒ nf ( N [ r ] , s (cid:48) ) } ) µ ( S ) Proof.

Suppose, to begin with, that n = 0 . Either ∃ r ∈ T : N [ r ] = M , in which case both sides of the equation are µ ( S ) ,or there is no such r , in which case both sides of the equationare 0.For n > , suppose for induction that the lemma is true for n − , for all N, T and S . If ∃ r ∈ T, s (cid:48) ∈ S : ( M, s ) ⇒ nf ( N [ r ] , s (cid:48) ) , the r and s (cid:48) are necessarily unique, therefore wemay assume that T is the product of k measurable setsets of R , ( T j ) ≤ j

For any closed term M and reductionstrategy r on M , if every term in Rch r ( M ) is AST, then everyterm in Rch ( M ) is AST.Proof. Suppose that ( M, s ) ⇒ ∗ cbv ( F, s (cid:48) ) for some trace s andterm F which is not AST. Let P be the ﬁnite set of potentialpositions in M such that the corresponding sample is used inthe reductions ( M, s ) ⇒ ∗ cbv ( F, s (cid:48) ) . As F is not AST, the set T = { t ∈ I L s ( M ) \ P | M terminates with ( t, s | P ) and cbv } has non-zero measure.Take some order p , . . . on the elements of P , and deﬁne T n ⊂ I L s ( M ) recursively as follows. If the sample at p n iseventually used in the sequence ( M, ( t, s | P )) ⇒ ∗ r . . . for somepositive measure subset of the traces t in T n − , then let T n be the subset of T n − where this sample is used within m steps, for the minimal m such that this is of positive measure.Otherwise, let T n be the subset of T n − where the sample at p n is never used. Let T be the ﬁnal such T n . Also, let P i be the set of p n such that the second case was taken in thedeﬁnition of T n i.e. the sample at p n is never used in the reduction sequences starting at ( M, ( t, s | P )) with strategy r ,and let k be the maximum value of m .Let T (cid:48) = T × (cid:81) p ∈ P i I × (cid:81) p ∈ P \ P i { s ( p ) } . Although T (cid:48) itself may have a measure of 0 as a subset of I L s ( M ) , it stillhas a natural measure-space structure of its own. It also hasthe properties that M does not terminate with cbv and anytrace in T (cid:48) (therefore M also does not terminate with r andany trace in T (cid:48) , by Thm. VI.11), and for any trace t ∈ T (cid:48) ,if ( M, t ) ⇒ kr ( M (cid:48) , t (cid:48) ) , none of the potential positions in M (cid:48) correspond to any of the elements of P \ P i . There are ﬁnitelymany skeletal reduction sequences k steps long starting at M and following r , each one corresponding to some measurablesubset of I L s ( M ) , so pick one such that its intersection with T (cid:48) is a non-null subset of T (cid:48) , let this subset be Q .Pick some p ∈ Q , and let ( M, p ) ⇒ kr ( N, p (cid:48) ) . The injection i ( M → k N ) is the same as it would be for any other elementof Q , and its image is disjoint from P \ P i . The inverse imageunder composition with i ( M → k N ) of Q is therefore a non-null subset of I L s ( N ) , and N fails to terminate with r andevery trace in this set, therefore N is not AST with respect to r , therefore N is not AST, but N ∈ Rch r ( M ) , therefore notevery element of Rch r ( M ) is AST.Taking the contrapositive of this, we have the desired result.Ex. VI.17 Recall that M = ( λn.P [ n ]) (cid:98) sample − / (cid:99) P [ n ] = ( λp. Ξ[ p ] n )( λx. Θ[ x ] n ))Ξ[ p ] = Y λf n. if ( n = 0 , , pn + f ( n − ) Θ[ x ] = Y λf n. if ( n = 0 , , x × f ( n − ) . Let r be the reduction strategy which reduces M in theorder M → ∗ P [ n ] → ∗ ( λp. Ξ[ p ] n ) ( λx. x × · · · × x × (cid:124) (cid:123)(cid:122) (cid:125) n → Ξ[ λx. x × · · · × x × (cid:124) (cid:123)(cid:122) (cid:125) n n → ∗ n n + ( n − n + · · · + 0 → ∗ n (cid:88) k =0 k n (this is a little underspeciﬁed, but the details are not important).33 suitable sparse ranking function for it is M (cid:55)→ π P [ n ] (cid:55)→ n + 2( λp. Ξ[ p ] n ) ( λx. x × · · · × x × (cid:124) (cid:123)(cid:122) (cid:125) n − k (Θ[ x ] k )) (cid:55)→ n + k + 2Ξ[ λx. x × · · · × x × (cid:124) (cid:123)(cid:122) (cid:125) n n (cid:55)→ n + 1 n n + · · · + ( n − k + 1) n + Ξ[ λx.x × . . . k (cid:55)→ k + 1 n (cid:88) k =0 k n (cid:55)→ . The ﬁrst step is justiﬁed because the probability of reaching P [ n ] for any speciﬁc n ≥ is ( n + 1) − − ( n + 2) − = n +3( n +1) ( n +2) , therefore the expected next value of the rankingfunction is (cid:80) ∞ n =0 (2 n +2)(2 n +3)( n +1) ( n +2) = π . The other steps aredeterministic, and involve 1 or 0 Y -reductions each. Giventhat all of the reduction steps after the ﬁrst are deterministic,and the number of Y reduction steps is not that hard to count,deﬁning the sparse ranking function’s values only at M and P [ n ] would also have been reasonable, although in a similarterm which had more random samples throughout, that wouldnot have been so simple.Providing a ranking function of some sort for this termin a similar level of detail would have been rather morecomplicated using only the standard reduction strategy. Therecursion in Θ would have to be evaluated separately forevery time it was used, so there would have been more (andmore complex) terms to consider in the reduction sequence.Also, the number of Y -reductions from P [ n ] would have been n + 2 n + 1 instead of n + 2 , therefore the expected numberof Y reductions starting from M would be inﬁnite, i.e. M is not Y -PAST, therefore it is not even rankable. It would bepossible to deﬁne an antitone sparse ranking function for it,but it is much more difﬁcult to construct: M (cid:55)→ . P [ n ] (cid:55)→ n + 1) n + 1)Ξ[ λx. Θ[ x ] n ] n (cid:55)→ n + 1) n + 1) n n + · · · + k n +Ξ[ λx. Θ[ x ] n ]( k − (cid:27) (cid:55)→ n + 1)( k −

1) + 1) n n + · · · + ( k + 1) n +Θ[ k ] n +Ξ[ λx. Θ[ x ] n ]( k −  (cid:55)→ n + 1) k + 1) n n + · · · + ( k + 1) n + k × · · · × k × (cid:124) (cid:123)(cid:122) (cid:125) n − k +1 Θ[ k ]( k −

1) +Ξ[ λx. Θ[ x ] n ]( k −  (cid:55)→ n + 1) k + k + 1) (cid:80) nk =0 k n (cid:55)→ . It wouldn’t even be possible in this case to give a sparserversion of the ranking function deﬁned only at M and P [ n ] , because the amount that an antitone sparse ranking functionmust decrease at each Y -reduction step depends on the valueof the ranking function at the next term where it is deﬁned,therefore it is necessary to have these intermediate steps inorder for the ranking function to be able to change sufﬁcientlyslowly. C ONTENTS

I Introduction II Probabilistic PCF

III Supermartingales

IV Constructing ranking functions