[PDF] CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

Abstract

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

Full PDF

CCertRL : Formalizing Convergence Proofs for Value and PolicyIteration in Coq

KOUNDINYA VAJJHA ∗ , University of Pittsburgh

AVRAHAM SHINNAR,

IBM Research

VASILY PESTUN,

IBM Research and IHES

BARRY TRAGER,

IBM Research

NATHAN FULTON,

IBM Research

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizingfor long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work onformally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm intheir Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learningalgorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalizationof two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes.The central results are a formalization of Bellman’s optimality principle and its proof, which uses a contraction property ofBellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL developmentexemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learningalgorithms. The CertRL library provides a general framework for proving properties about Markov decision processes andreinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.Additional Key Words and Phrases: Formal Verification, Policy Iteration, Value Iteration, Reinforcement Learning, Coinduction

Reinforcement learning (RL) algorithms solve sequential decision making problems in which the goal is tochoose actions that maximize a quantitative utility function [Bel54, How60, Put94, SB98]. Recent high-profileapplications of reinforcement learning include beating the world’s best players at Go [SHM + + policy that specifies which action(s) should betaken in a given state. The primary correctness property for reinforcement learning algorithms is convergence :in the limit, a reinforcement learning algorithm should converge to a policy that optimizes for the expectedfuture-discounted value of the reward signal.This paper contributes CertRL , a formal proof of convergence for value and policy iteration. Value andpolicy iteration are canonical model-based reinforcement learning algorithms. They are often taught as the firstreinforcement learning methods in machine learning courses because the algorithms are relatively simple buttheir convergence proofs contain the main ingredients of a typical convergence argument for a reinforcementlearning algorithm.There is a cornucopia of presentations of these iterative algorithms and an equally diverse variety of prooftechniques for establishing convergence. Many presentations state but do not prove the fact that the optimal ∗ The work described in this paper was performed while Koundinya Vajjha was an intern at IBM.Authors’ addresses: Koundinya Vajjha, [email protected], University of Pittsburgh; Avraham Shinnar, [email protected], IBM Research;Vasily Pestun, [email protected], IBM Research and IHES; Barry Trager, [email protected], IBM Research; Nathan Fulton, [email protected],IBM Research. a r X i v : . [ c s . A I] S e p • Vajjha et al. policy of an infinite-horizon Markov decision process with γ -discounted reward is a stationary policy ; i.e., theoptimal decision in a given state does not depend on the time step at which the state is encountered. Followingthis convention, this paper contributes the first formal proof that policy and value iteration converge in thelimit to the optimal policy in the space of stationary policies for infinite-horizon Markov decision processes. Inaddition to establishing convergence results for the classical iterative algorithms under classical infinitary andstationarity assumptions, we also formalize an optimality result about n -step iterations of value iteration withouta stationarity assumption. The former formalization matches the standard theoretical treatment, while the latteris closer to real-world implementations.In all cases, the convergence argument for policy/value iteration proceeds by proving that a contractivemapping converges to a fixed point and that this fixed point is an optimum. This is typical of convergence proofsfor reinforcement learning algorithms. CertRL is intentionally designed for ongoing reinforcement learningformalization efforts.Formalizing the convergence proof directly would require complicated and tedious ϵ -hacking as well aslong proofs involving large matrices. CertRL obviates these challenges using a combination of the Giry monad[Gir82, Jac18] and a proof technique called

Metric coinduction [Koz07].Metric coinduction was first identified by Kozen and Ruozzi as a way to streamline and simplify proofs oftheorems about streams and stochastic processes [KR09]. Our convergence proofs use a specialized version ofmetric coinduction called contraction coinduction [FHM18] to reason about order statements concerning fixedpoints of contractive maps. Identifying a coinduction hypothesis allows us to automatically infer that a given(closed) property holds in the limit whenever it holds ab initio . The coinduction hypothesis guarantees that thisproperty is a limiting invariant. This is significant because the low level ϵ − δ arguments – typically neededto show that a given property holds of the limit – are now neatly subsumed by a single proof rule, allowingreasoning at a higher level of abstraction.The finitary Giry monad is a monad structure on the space of all finitely supported probability mass functionson a set. Function composition in the Kliesli category of this monad recovers the Chapman-Kolmogorov formula[Per19, Jac18]. Using this fact, our formalization recasts iteration of a stochastic matrix in a Markov decisionprocess as iterated Kliesli composites of the Giry monad, starting at an initial state. Again, this makes thepresentation cleaner since we identify and reason about the basic operations of bind and ret , thus bypassing theneed to define matrices and matrix multiplication and substantially simplifying convergence proofs.This paper shows how these two basic building blocks – the finitary Giry monad and metric coinduction– provide a compelling foundations for formalizing reinforcement learning theory. CertRL develops the basicconcepts in reinforcement learning theory and demonstrate the usefulness of this library by proving severalresults about value and policy iteration.

CertRL contains a proof of Bellman’s optimality principle, an inductiverelation on the optimal value and policy over the horizon length of the Markov decision process. The developmentalso contains proofs of convergence for value iteration and policy iteration over infinite time horizons, twocanonical reinforcement learning algorithms [Bel54, How60, Put94].In practice, reinforcement learning algorithms almost always run in finite time by either fixing a run timecutoff (e.g., number training steps) or by stopping iteration after the value/policy changes become smaller than afixed threshold. Therefore, our development also formalizes a proof that n -step value iteration satisfies a finitetime analogue of our convergence results.To summarize, the CertRL library contains:(1) a formalization of Markov decision processes and their long-term values in terms of the finitary Girymonad,(2) a formalization of optimal value functions and the Bellman operator, ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 3 (3) a formal proof of convergence for value iteration and a formalization of the policy improvement theoremin the case of stationary policies, and(4) a formal proof that the optimal value function for finitary sequences satisfies the finite time analogue ofthe Bellman equation.Throughout the text which follows, hyperlinks to theorems, definitions and lemmas which have formalequivalents in the Coq [Tea] development are indicated by a ✿ . We provide a brief introduction to value/policy iteration and to the mathematical structures upon which ourformalization is built: contractive metric spaces, metric coninduction, the Giry monad and Kliesli composition.

This section gently introduces the basics of reinforcement learning with complete information about the stochasticreward and transition functions. In this simplified situation the focus of the algorithm is on optimal exploitationof reward. This framework is also known as the stochastic optimal control problem.We give an informal definition of Markov decision processes, trajectories, long-term values, and dynamicprogramming algorithms for solving Markov decision processes. Many of these concepts will be stated later in amore formal type-theoretic style; here, we focus on providing an intuitive introduction to the field.The basic mathematical object in reinforcement learning theory is the Markov decision process. A Markovdecision process is a 4-tuple ( S , A , R , T ) where S is a set of states, A is set of actions, R : S × A × S → R is a rewardfunction , and T is a transition relation on states and actions mapping each ( s , a , s ′ ) ∈ S × A × S to the probabilitythat taking action a in state s results in a transition to s ′ . Markov decision processes are so-called because theycharacterize a sequential decision-making process (each action is a decision) in which the transition structure onstates and actions depends only on the current state. Example 1 (CeRtL the Turtle ✿ ). Consider a simple grid world environment in which a turtle can move incardinal directions throughout a 2D grid. The turtle receives +1 point for collecting stars, -10 for visiting redsquares, and +2 for arriving at the green square. The turtle chooses which direction to move, but with probability will move in the opposite direction. For example, if the turtle takes action left then it will go left withprobability and right with probability . The game ends when the turtle arrives at the green square.This environment is formulated as a Markov decision process as follows: • The set of states S are the coordinates of each box ✿ : {( x , y ) | ≤ x ≤ ≤ y ≤ }• The set of actions A are { up , down , left , right } ✿ . We recommend MacOS users view this document in Abobe, Firefox, or Chrome, as Preview and Safari parse the URLs linked to by ✿ ’sincorrectly. • Vajjha et al. Fig. 1. An example grid-world environment ✿ . • The reward function is defined as ✿ : R ( , ) = R ( , ) = R ( , ) = R ({ , , } , ) = − R ( , ) = − R ( , ) = − R ( , ) = − R (· , ·) = • The transition probabilities are as described ✿ ; e.g., T (( , ) , up , ( , )) = T (( , ) , up , ( , )) = T (( , ) , up , (· , ·)) = CertRL . We first define a matrix whose indices arestates ( x , y ) and whose entries are colors { red, green, star, empty } . We then define a reward function that mapsfrom matrix entries to a reward depending on the color of the turtle’s current state. We also define a transitionfunction that comports with the description given above. At last, we prove that this combination of states, actions,transitions and rewards inhabits our MDP type. Therefore, all of the theorems developed in this paper applydirectly to our Coq implementation of the CertRL

Turtle environment. ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 5

The goal of reinforcement learning is to find a policy ( π : S → A ) specifying which action the algorithm shouldtake in each state. This policy should maximize the amount of reward obtained by the agent. A policy is stationary if it is not a function of time; i.e., if the optimal action in some state s ∈ S is always the same and, in particular,independent of the specific time step at which s is encountered.Reinforcement learning agents optimize for a discounted sum of rewards – placing more emphasis on rewardobtained today and less emphasis on reward obtained tomorrow. A constant discount factor from the open unitinterval, typically denoted by γ , quantitatively discounts future rewards and serves as a crucial hyper-parameterto reinforcement learning algorithms.Value iteration, invented by Bellman [Bel54], is a dynamic programming algorithm that finds optimal policiesto reinforcement learning algorithms by iterating a contractive mapping. Value iteration is defined in terms of a value function V π : S → R , where V π ( s ) is the expected value of state s when following policy π from s . Data:

Markov decision process ( S , A , T , R ) Initial value function V = θ > < γ < Result: V ∗ , the value function for an optimal policy. for n from to ∞ dofor each s ∈ S do V n + [ s ] = max a (cid:205) s ′ P ( s , a , s ′ )( R ( r , a , s ′ ) + γV n [ s ′ ]) endif ∀ s | V n + [ s ] − V n | < θ thenreturn V n + endend Algorithm 1: Pseudocode for Value Iteration.The optimal policy π ∗ is then obtained by π ∗ ( a ) = argmax a ∈ A (cid:213) s ′ P ( s , a , s ′ )( R ( s , a , s ′ ) + γV n + [ s ′ ]) . Policy iteration follows a similar iteration scheme, but with a policy estimation function Q π : S × A → R where Q π ( s , a ) estimates the value of taking action a in state s and then following the policy π . In Section 3.4we will demonstrate a formalized proof that V n is the optimal value function of MDP process of length n ; thisalgorithm implements dynamics programming principle. Our formalization uses metric coinduction to establish convergence properties for infinite sequences. This sectionrecalls the Banach fixed point theorem and explains how this theorem gives rise to a useful proof technique.A metric space ( X , d ) is a set X equipped with a function d : X × X → R satisfying certain axioms that ensure d behaves like a measurement of the distance between points in X . A metric space is complete if the limit of everyCauchy sequence of elements in X is also in X .Let ( X , d ) denote a complete metric space with metric d . Subsets of X are modeled by terms of the functiontype ϕ : X → Prop . Another interpretation is that ϕ denotes all those terms of X which satisfy a particularproperty. These subsets are also called Ensembles in the Coq standard library. • Vajjha et al. A Lipschitz map ✿ is a mapping that is Lipschitz continuous; i.e., a mapping F from ( X , d X ) into ( Y , d Y ) suchthat for all x , x ∈ X there is some K ≥ d Y ( F ( x ) , F ( x )) ≤ Kd X ( x , x ) . The constant K is called a Lipschitz constant.A map F : X → X is called a contractive map ✿ , or simply a contraction , if there exists a constant 0 ≤ γ < d ( F ( u ) , F ( v )) ≤ γd ( u , v ) ∀ u , v ∈ X . Contractive maps are Lipschitz maps with Lipschitz constant γ < If ( X , d ) is a nonempty complete metric space and F : X → X is acontraction, then F has a unique fixed point; i.e., there exists a point x ∗ ∈ X such that F ( x ∗ ) = x ∗ . This fixed point is x ∗ = lim n →∞ F ( n ) ( x ) where F ( n ) stands for the n -th iterate of the function F and x is an arbitrary point in X . The Banach fixed point theorem generalizes to subsets of X .Theorem 3 (Banach fixed point theorem on subsets ✿ ). Let ( X , d ) be a complete metric space and ϕ a closednonempty subset of X . Let F : X → X be a contraction and assume that F preserves ϕ . In other words, ϕ ( u ) → ϕ ( F ( u )) Then F has a unique fixed point in ϕ ; i.e., a point x ∗ ∈ X such that ϕ ( x ∗ ) and F ( x ∗ ) = x ∗ . The fixed point of F isgiven by x ∗ = lim n →∞ F ( n ) ( x ) where F ( n ) stands for the n -th iterate of the function F . Both the Banach fixed point theorem and the more general theorem on subsets were previously formalizedin Coq by Boldo et al. [BCF + X is either a CompleteSpace or a

CompleteNormedModule .The fixed point of F in Theorem 3 is unique, but it depends on an initial point x ∈ X , which F then iterates on.Uniqueness of the fixed point implies that different choices of the initial point still give the same fixed point ✿ .To emphasize how this theorem is used in our formalization, we restate it as an inductive proof rule: ϕ closed ∃ x , ϕ ( x ) ϕ ( u ) → ϕ ( F ( u )) ϕ ( fix F x ) ✿ (1)This proof rule states that in order to prove some closed ϕ is a property of a fixed point of F , it suffices toestablish the standard inductive assumptions: that ϕ holds for some initial x , and that if ϕ holds at u then it alsoholds after a single application of F to u . In this form, the Banach fixed point theorem is called Metric coinduction .The rule (1) is coinductive because it is equivalent to the assertion that a certain coalgebra is final in a category ofcoalgebras. (Details are given in Section 2.3 of Kozen and Ruozzi [KR09]).

Definition 4 (Ordered Metric Space) . A metric space X is called an ordered metric space if the underlying set X is partially ordered and the sets { z ∈ X | z ≤ y } and { z ∈ X | y ≤ z } are closed sets in the metric topology for every y ∈ X . For ordered metric spaces, metric coinduction specializes to [FHM18, Theorem 1], which we restate as Theorem 5below.Theorem 5 (Contraction coinduction).

Let X be a non-empty, complete ordered metric space. If F : X → X is a contraction and is order-preserving, then: ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 7 • ∀ x , F ( x ) ≤ x ⇒ x ∗ ≤ x ✿ and • ∀ x , x ≤ F ( x ) ⇒ x ≤ x ∗ ✿ where x ∗ is the fixed point of F . We will use the above result to reason about Markov decision processes. However, doing so requires firstsetting up an ordered metric space on the function space A → R where A is a finite set ✿ . A → R . Let A be a finite set ✿ . We endow the function space A → R with the natural structure of the vector space andwith L ∞ norm ✿ : ∥ f ∥ ∞ = max a ∈ A | f ( a )| (2)Our development establishes several important properties about this function space. The norm (2) is well-defined because A is finite and furthermore induces a metric that makes A → R a metric space ✿ . With thismetric, the space of functions A → R is also complete ✿ . From R this metric inherits a pointwise order; viz. , forfunctions f , д : A → R , f ≤ д ⇐⇒ ∀ a ∈ A , f ( a ) ≤ д ( a ) ✿ f ≥ д ⇐⇒ ∀ a ∈ A , f ( a ) ≥ д ( a ) ✿ We also prove that the sets { f | f ≤ д } ✿ and { f | f ≥ д } ✿ are closed in the norm topology. Our formalization of the proof of closedness for these sets relies on classicalreasoning. Additionally, we rely on functional extensionality to reason about equality between functions.We now have an ordered metric space structure on the function space A → R when A is finite. Constructing acontraction on this space will allow an application of Theorem 5. Once we set up a theory of Markov decisionprocesses we will have natural examples of such a function space and contractions on it. Before doing so, we firstintroduce the Giry monad. A monad structure on the category of all measurable spaces was first described by Lawvere in [Law62] and wasexplicitly defined by Giry in [Gir82]. This monad has since been called the Giry monad. While the construction isvery general (applying to arbitrary measures on a space), for our purposes it suffices to consider finitely supportedprobability measures.The Giry monad for finitely supported probability measures is called the finitary Giry monad , althoughsometimes also goes by the more descriptive names distribution monad and convex combination monad .On a set A , let P ( A ) denote the set of all finitely-supported probability measures on A ✿ . An element of P ( A ) isa list of elements of A together with probabilities.The Giry monad is defined in terms of two basic operations associated to this space: ret : A → P ( A ) ✿ a (cid:55)→ λx : A , δ a ( x ) • Vajjha et al. where δ a ( x ) = a = x and 0 otherwise. The other basic operation is bind : P ( A ) → ( A → P ( B )) → P ( B ) ✿ bind p f = λ b : B , (cid:213) a ∈ A f ( a )( b ) ∗ p ( a ) In both cases the resulting output is a probability measure. The above definition is well-defined because weonly consider finitely-supported probability measures. A more general case is obtained by replacing sums withintegrals.The definitions of bind and ret satisfy the following properties: bind ( ret x ) f = ret ( f ( x )) ✿ (3) bind p ( λx , δ x ) = p ✿ (4) bind ( bind p f ) д = bind p ( λx , bind ( f x ) д ) ✿ (5)These monad laws establish that the triple ( P , bind , ret ) forms a monad.The Giry monad has been extensively studied and used by various authors because it has several attractivequalities that simplify (especially formal) proofs. First, the Giry monad naturally admits a denotational monadicsemantics for certain probabilistic programs [RP02, JP89, ŚGG15, APM09]. Second, it is useful for rigorouslyformalizing certain informal arguments in probability theory by providing a means to perform ad hoc notationoverloading [TTV19]. Third, it can simplify certain constructions such as that of the product measure [EHN15]. CertRL uses the Giry monad as a substitute for the stochastic matrix associated to a Markov decision process.This is possible because the Kliesli composition of the Giry monad recovers the Chapman-Kolmogorov formula[Per18, Per19]. The Kliesli composition is the fish operator in Haskell parlance.

Reasoning about probabilistic processes requires composing probabilities. The Chapman-Kolmogorov formula is a classical result in the theory of Markovian processes that states the probability oftransition from one state to another through two steps can be obtained by summing up the probability of visitingeach intermediate state. This application of the Chapman-Kolmogorov formula plays a fundamental role in thestudy of Markovian processes, but requires formalizing and reasoning about matrix operations.Kliesli composition provides an alternative and more elegant mechanism for reasoning about compositions ofprobabilistic choices. This section defines and provides an intuition for Kliesli composition.Think of P ( A ) as the random elements of A ([Per18, page 15]). In this paradigm, the set of maps A → P ( B ) aresimply the set of maps with a random outcome. When P is a monad, such maps are called Kliesli arrows of P .In terms of reinforcement learning, a map f : A → P ( B ) is a rule which takes a state a : A and gives theprobability of transitioning to state b : B . Suppose now that we have another such rule д : B → P ( C ) . Klieslicomposition puts f and д together to give a map ( f д ) : A → P ( C ) . It is defined as: f д : = λx : A , bind ( f x ) д (6) = λx : A , ( λc : C , (cid:213) b : B д ( b )( c ) ∗ f ( x )( b )) (7) = λ ( x : A ) ( c : C ) , (cid:213) b : B f ( x )( b ) ∗ д ( b )( c ) (8)The motivation for (6)–(8) is intuitive. In order to start at x : A and end up at c : C by following the rules f and д , one must first pass through an intermediate state b : B in the codomain of f and the domain of д . Theprobability of that point being any particular b : B is f ( x )( b ) ∗ д ( b )( c ) . ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 9 So, to obtain the total probability of transitioning from x to c , simply sum over all intermediate states b : B . Thisis exactly (8). We thus recover the classical Chapman-Kolmogorov formula, but as a Kliesli composition of theGiry monad. This obviates the need for reasoning about operators on linear vector spaces, thereby substantiallysimplifying the formalization effort. CertRL contains a formalization of Markov decision processes, a definition of the Kliesli composition specializedto Markov decision processes, a definition of the long-term value of a Markov decision process, a definition ofthe Bellman operator, and a formalization of the operator’s main properties.Building on top of its library of results about Markov decision processes,

CertRL contains proofs of our mainresults:(1) the (infinite) sequence of value functions obtained by value iteration converges in the limit to a globaloptimum assuming stationary policies,(2) the (infinite) sequence of policies obtained by policy iteration converges in the limit to a global optimumamount stationary policies, and(3) the optimal value function for Markov decision process of length n is computed inductively by applicationof Bellman operator, Section 3.4. We refer to [Put94] for detailed presentation of the theory of Markov decision processes. Our formalizationconsiders the theory of infinite-horizon discounted Markov decision processes with deterministic stationary policies .We now elaborate on the above definitions and set up relevant notation. Our presentation will be type-theoreticin nature, to reflect the formal development. The exposition (and

CertRL formalization) closely follows the workof Frank Feys, Helle Hvid Hansen, and Lawrence Moss [FHM18].

Definition 6 (Markov Decision Process ✿ ) . A Markov decision process consists of the following data: • A nonempty finite type S called the set of states . We assume that S has decidable equality. • For each state s : S , a nonempty finite type A ( s ) called the type of actions available at state s . This is modelledas a dependent type. • A stochastic transition structure T : (cid:206) s : S ( A ( s ) → P ( S )) . • A reward function v : (cid:206) s : S ( A ( s ) → S → R ) where v ( s , a , s ′ ) is the reward obtained on transition from state s to state s ′ under action a . From these definitions it follows that the rewards are bounded in absolute value: since the state and actionspaces are finite, there exists a constant D such that ∀ ( s s ′ : S ) , ( a : A ( s )) , | v ( s , a , s ′ )| ≤ D ✿ (9)In the above definition, P ( S ) denotes the space of all probability measures on S . The CertRL definition ofMarkov decision processes assumes that S has decidable equality because its definition of finite ✿ is not strongenough to guarantee that finite sets have decidable equality. Definition 7 (Decision Rule / Policy) . Given a Markov decision process with state space S and action space (cid:206) s : S A ( s ) , • A function π : (cid:206) s : S A ( s ) is called a deterministic agent policy , or decision rule ✿ . The decision rule is deterministic since it returns an action, as opposed to returning a probability distribution on actions, in whichcase it would be called stochastic . • A stationary policy is an infinite sequence of decision rules: ( π , π , π , ... ) ✿ . Stationary implies that the samedecision rule is applies at each step.

This policy π induces a stochastic dynamic process on S evolving in discrete time steps k ∈ Z ≥ . In this sectionwe consider only stationary policies, and therefore use the terms policy and decision rule interchangeably. Note that for a fixed decision rule π , we get a Kliesli arrow T π : S → P ( S ) defined as T π ( s ) = T ( s )( π ( s )) .Conventionally, T π is represented as a row-stochastic matrix ( T π ) s s ′ that acts on the probability co-vectorsfrom the right, so that the row s of T π corresponding to state s encode the probability distribution of states s ′ after a transition from the state s .Let p k ∈ P ( S ) for k ∈ Z ≥ denote a probability distribution on S evolving under the policy stochastic map T π after k transition steps, so that p is the initial probability distribution on S (the initial distribution is usuallytaken to be ret s for a state s ). These are related by p k = p T kπ (10)In general (if p = ret s ) the number p k ( s ) gives the probability that starting out at s , one ends up at s after k stages. So, for example, if k =

1, we recover the stochastic transition structure at the end of the first step ✿ .Instead of representing T kπ as an iterated product of a stochastic matrix in our formalization, we recognize that(10) states that u k is the k -fold iterated Kliesli composite of T π applied to the initial distribution p ✿ . p k = ( p T π . . . T π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) k times ) (11)Thus, we bypass the need to define matrices and matrix multiplication entirely in the formalization. Since the transition from one state to another by an actionis governed by a probability distribution T , there is a notion of expected reward with respect to that distribution. Definition 8 (Expected immediate reward) . For a Markov decision process, • An expected immediate reward to be obtained in the transition under action a from state s to state s ′ is afunction ¯ v : S → A → R computed by averaging reward function over the stochastic transition map to a newstate s ′ ¯ v ( s , a ) : = (cid:213) s ′ ∈ S v ( s , a , s ′ ) T ( s , a )( s ′ ) (12) • An expected immediate reward under a decision rule π , denoted ¯ v π : S → R is defined to be: ¯ v π ( s ) : = ¯ v ( s , π ( s )) ✿ (13) That is, we replace the action argument in (12) by the action prescribed by the decision rule π . • The expected reward at time step k of a Markov decision process starting at initial state s , following policy π is defined as the expected value of the reward with respect to the k -th Kliesli iterate of T π starting at state s . r πk ( s ) : = E T kπ ( s ) [ ¯ v π ] = (cid:213) s ′ ∈ S (cid:104) ¯ v π ( s ′ ) T kπ ( s )( s ′ ) (cid:105) ✿ The long-term value of a Markov decision process under a policy π is defined as follows: Definition 9 (Long-Term Value) . Let γ ∈ R , ≤ γ < be a discount factor , and π = ( π , π , . . . ) be a stationarypolicy. Then V π : S → R is given by V π ( s ) = ∞ (cid:213) k = γ k r πk ( s ) ✿ (14) ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 11 The rewards being bounded in absolute value implies that the long-term value function V π is well-defined forevery initial state ✿ .It can be shown by manipulating the series in (14) that the long-term value satisfies the Bellman equation: V π ( s ) = ¯ v ( s , π ( s )) + γ (cid:213) s ′ ∈ S V π ( s ′ ) T π ( s )( s ′ ) ✿ (15) = ¯ v π ( s ) + γ E T π ( s ) V π (16) Definition 10.

Given a Markov decision process, we define the

Bellman operator as B π : ( S → R ) → ( S → R ) (17) W (cid:55)→ ¯ v π ( s ) + γ E T π ( s ) W (18)Theorem 11 (Properties of the Bellman Operator ✿ ). The Bellman operator satisfies the following properties: • As is evident from (15) , the long-term value V π is the fixed point of the operator B π ✿ . • The operator B π (called the Bellman operator) is a contraction in the norm (2) ✿ . • The operator B π is a monotone operator. That is, ∀ s , W ( s ) ≤ W ( s ) ⇒ ∀ s , B π ( W )( s ) ≤ B π ( W )( s ) The Banach fixed point theorem now says that V π is the unique fixed point of this operator.Let V π , n : S → R be the n -th iterate of the Bellman operator B π . It can be computed by the recursion relation V π , ( s ) = V π , n + ( s ) = ¯ v π ( s ) + γ E T π ( s ) V π , n n ∈ Z > = (19)where s is an arbitrary initial state. The first term in the reward function V π , n + for the process of length n + immediate reward ), and the remaining total reward obtained inthe subsequent process of length n ( discounted future reward ). The n -th iterate is also seen to be equal to the n -thpartial sum of the series (14) ✿ .The sequence of iterates { V π , n }| n = , , ,... is convergent and equals V π , by the Banach fixed point theorem. V π = lim n →∞ V π , n ✿ (20)In Section 3.4 we will provide a general formalized proof, applicable as well for non-stationary policies, of thefact that V n is the value of MDP process of length n . In the previous subsection we defined the long-term value function V π and showed that it is the fixed point ofthe Bellman operator. It is also the pointwise limit of the iterates V π , n the expected value function of all length n realizations of the Markov decision process which follows a fixed stationary policy π .We note that the value function V π induces a partial order on the space of all decision rules; with σ ≤ τ if andonly if V σ ≤ V τ ✿ .The space of all decision rules is finite because the state and action spaces are finite ✿ .The above facts imply the existence of a decision rule (stationary policy) which maximizes the long-termreward. We call this stationary policy the optimal policy and its long-term value the optimal value function . V ∗ ( s ) = max π { V π ( s )} ✿ (21) The aim of reinforcement learning, as we remarked in the introduction, is to have tractable algorithms to findthe optimal policy and the optimal value function corresponding to the optimal policy.Bellman’s value iteration algorithm is such an algorithm, which is known to converge asymptotically to theoptimal value function. In this section we describe this algorithm and formally prove this convergence property.

Definition 12.

Given a Markov decision process we define the

Bellman optimality operator as: ˆ B : ( S → R ) → ( S → R ) W (cid:55)→ λs , max a ∈ A ( s ) (cid:0) ¯ v ( s , a ) + γ E T ( s , a ) [ W ] (cid:1) ✿ Theorem 13.

The Bellman optimality operator ˆ B satisfies the following properties: • The operator ˆ B is a contraction in the norm (2) ✿ . • The operator ˆ B is a monotone operator. That is, ∀ s , W ( s ) ≤ W ( s ) ⇒ ∀ s , ˆ B ( W )( s ) ≤ ˆ B ( W )( s ) ✿ Now we move on to proving the most important property of ˆ B : the optimal value function V ∗ is a fixed pointof ˆ B .By theorem (13) and the Banach fixed point theorem, we know that the fixed point of ˆ B exists. Let us denote itˆ V . Then we have:Theorem 14 (Lemma 1 of [FHM18] ✿ ). For every decision rule σ , we have V σ ≤ ˆ V . Proof. Fix a policy σ . We note that for every f : S → R , have B σ ( f ) ≤ ˆ B ( f ) ✿ . In particular, applying this to f = V σ and using theorem 11, we get that V σ = B σ ( V σ ) ≤ ˆ B ( V σ ) . Now by contraction coinduction (theorem 5with F = ˆ B along with theorem 13) we get that V σ ≤ ˆ V . □ Theorem 14 immediately implies that V ∗ ≤ ˆ V .To go the other way, we introduce the following policy, called the greedy decision rule. σ ∗ ( s ) : = argmax a ∈ A ( s ) (cid:16) ¯ v ( a , s ) + γ E T ( s , a ) [ ˆ V ] (cid:17) ✿ (22)We now have the following theorem:Theorem 15 (Proposition 1 of [FHM18] ✿ ). The greedy policy is the policy whose long-term value is the fixedpoint of ˆ B : V σ ∗ = ˆ V Proof. We observe that B σ ∗ ( ˆ V ) = ˆ V ✿ . Thus, ˆ V ≤ B σ ∗ ( ˆ V ) . Note that we have V σ ∗ is the fixed point of B σ ∗ bytheorem 11. Now applying contraction coinduction with F = B σ ∗ , we get ˆ V ≤ V σ ∗ . From theorem 14 we get that V σ ∗ ≤ ˆ V . □ Theorem 1b implies that V ∗ ≥ ˆ V and so we conclude that V ∗ = ˆ V ✿ .Thus, the fixed point of the optimal Bellman operator ˆ bellman exists and is equal to the optimal value function.Stated fully, value iteration proceeds by:(1) Initialize a value function V : S → R .(2) Define V n + = ˆ B V n for n ≥

0. At each stage, the following policy is computed π n ( s ) ∈ argmax a ∈ A ( s ) (cid:0) ¯ v ( s , a ) + γ E T ( s , a ) [ V n ] (cid:1) ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 13 By the Banach Fixed Point theorem, the sequence { V n } converges to the optimal value function V ∗ ✿ . Inpractice, one repeats this iteration as many times as needed until a fixed threshold is breached.In Section 3.4 we explain and provide a formalized proof of dynamic programming principle : the value function V n is equal to the optimal value function of MDP process of length n under the optimal possibly non-staionarysequence of policies of length n . The convergence of value iteration is asymptotic, which means the iteration is continued until a fixed thresholdis breached. Policy iteration is a similar iterative algorithm that benefits from a more definite stopping condition.Define the Q function to be: Q π ( s , a ) : = ¯ v ( s , a ) + γ E T ( s , a ) [ V π ] . The policy iteration algorithm proceeds in the following steps:(1) Initialize the policy to π .(2) Policy evaluation: For n ≥

0, given π n , compute V π n .(3) Policy improvement: From V π n , compute the greedy policy: π n + ( s ) ∈ argmax a ∈ A ( s ) (cid:2) Q π n ( s , a ) (cid:3) (4) Check if V π n = V π n + . If yes, stop.(5) If not, repeat (2) and (3).This algorithm depends on the following results for correctness. We follow the presentation from [FHM18]. Definition 16 (Improved policy ✿ ) . A policy τ is called an improvement of a policy σ if for all s ∈ S it holds that τ ( s ) = argmax a ∈ A ( s ) [ Q σ ( s , a )]] So, step (2) of the policy iteration algorithm simply constructs an improved policy from the previous policy ateach stage.Theorem 17 (Policy Improvement Theorem).

Let σ and τ be two policies. • If B τ V σ ≥ B σ V σ then V τ ≥ V σ ✿ . • If B τ V σ ≤ B σ V σ then V τ ≤ V σ ✿ . Using the above theorem, we have:Theorem 18 (Policy Improvement Improves Values ✿ ). For σ and τ are two policies with τ an improvementof σ , then we have V τ ≥ V σ . Proof. From Theorem 17, it is enough to show B τ V σ ≥ B σ V σ . We have that τ is an improvement of σ . τ ( s ) = argmax a ∈ A ( s ) [ Q σ ( s , a )] (23) = argmax a ∈ A ( s ) (cid:2) ¯ v ( s , a ) + γ E T ( s , a ) [ V σ ] (cid:3) (24)Note that B τ V σ = ¯ v ( s , τ ( s )) + γ E T ( s , τ ( s )) [ V σ ] = max a ∈ A ( s ) (cid:2) ¯ v ( s , a ) + γ E T ( s , a ) [ V σ ] (cid:3) by (24) ≥ ¯ v ( s , σ ( s )) + γ E T ( s , σ ( s )) [ V σ ] = B σ V σ □ In other words, since π n + is an improvement of π n by construction, the above theorem implies that V π n ≤ V π n + .This means that π n ≤ π n + .Thus, the policy constructed in each stage in the policy iteration algorithm is an improvement of the policyin the previous stage. Since the set of policies is finite ✿ , this policy list must at some point stabilize. Thus, thealgorithm is guaranteed to terminate.In Section 3.4 we will provide formalization of the statement that π n is actually the optimal policy to follow foran MDP process of any finite length at that timestep when n steps remain towards the end of the process. All results up to this subsection were stated in terms of the convergences of infinite sequences of states andactions. Stating convergence results in terms of the limits of infinite sequences is not uncommon in texts onreinforcement learning; however, in practice, reinforcement learning algorithms are always run some finitenumber of steps. In this section we consider decision processes of finite length and do not impose an assumptionthat the optimal policy is stationary.Let V ¯ π denote the value function of Markov decision process for a finite sequence of policies ¯ π = π :: π :: π :: . . . π n − of length n = len ( ¯ π ) . Denote by p the probability distribution over the initial state at the start ofthe process.We define the probability measure at step k in terms of Kliesli iterates for each decision rule π i for i in 0 . . . ( k − ) : p T ¯ π [ : k ] : = ( p T π . . . T π k − ) ✿ (25) Definition 19 (expectation value function of MDP of length n = len ( ¯ π ) over the initial probability distribution p ) . ⟨ p | V ¯ π ⟩ = n − (cid:213) k = γ k ⟨ p T ¯ π [ : k ] | ¯ v π k ⟩ ✿ (26)Definition 19 implies the recursion relation ⟨ p | V π :: tail ⟩ = ⟨ p | ¯ v π + γT π V tail ⟩ ✿ n ∈ Z > = (27)where ¯ π = π :: tail .Let ˆ V ∗ , n be the optimal value function of the Markov decision process of length n on the space of all policysequences of length n : ˆ V ∗ , n : = sup ¯ π | len ( π ) = n V ¯ π ✿ (28)Let ˆ V π :: ∗ , n + be the optimal value function of the Markov decision process of length n + n + π . Using the relation (27) and thatsup π :: tail V π :: tail , n + = sup π sup tail V π :: tail , n + ✿ (29)we find ⟨ p | ˆ V ∗ , n + ⟩ = sup π ∈ (cid:206) S A ( s ) ⟨ p | ¯ v π + γT π ˆ V ∗ , n ⟩ ✿ n ∈ Z > = (30)with the initial term of the sequence V ∗ , = The optimal value function ˆ V ∗ , n + of a Markovdecision process of length n + relates to the optimal value function of the same Markov decision process of length n by the inductive relation ˆ V ∗ , n + = ˆ B V ∗ , n (31) ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 15 where ˆ B is Bellman optimality operator (Definition 12). The iterative computation of the sequence of optimal value functions { ˆ V ∗ , n } n ∈ Z ≥ of Markov decision processesof length n = , , , . . . from the recursion ˆ V ∗ , n + = ˆ B ˆ V ∗ , n is the same algorithm as value iteration . CertRL contributes a formal library for reasoning about Markov decision processes. We demonstrate the effec-tiveness of this library’s building blocks by proving the two most canonical results from reinforcement learningtheory. In this section we reflect on the structure of

CertRL ’s formalization, substantiating our claim that

CertRL serves as a convenient foundations for a continuing line of work on formalization of reinforcement learningtheory.

Most texts on Markov decision processses (for example [Put94, Section 2.1.6])start out with a probability space on the space of all possible realizations of the Markov decision process. Thelong-term value for an infinite-horizon Markov decision process is then defined as the expected value over allpossible realizations: V π ( s ) = E ( x , x ,... ) (cid:34) ∞ (cid:213) k = γ k v ( x k , π ( x k ))| x = x ; π (cid:35) (32)where each x k is drawn from the distribution T ( x k − , π ( x k − )) . This definition is hard to work with because, as[Put94] notes, it ignores the dynamics of the problem. Fortunately, it is also unnecessary since statements aboutthe totality of all realizations are rarely made.In our setup, following [FHM18], we only consider the probability space over the finite set of states of theMarkov decision process. By identifying the basic operation of Kliesli composition, we generate more realizations(and their expected rewards) on the fly as and when needed.Implementations of reinforcement learning algorithms often compute the long-term value using matrixoperators for efficiency reasons. The observation that clean theoretical tools do not necessarily entail efficientimplementations is not a new observation; both Puterman [Put94] and Hölzl [Höl17] make similar remarks.Fortunately, the design of our library provides a clean interface for future work on formalizing efficiencyimprovements. Extending CertRL with correctness theorems for algorithms that use matrix operations requiresnothing more than a proof that the relevant matrix operations satisfy the definition of Kliesli composition.

Comparing Theorem 14 and Theorem 1b with the the equivalentresults from Puterman [Put94, Theorem 6.2.2] demonstrates that

CertRL avoids reasoning about low-level ϵ − δ details through strategic use of coinduction.The usefulness of contraction coinduction is reflected in the formalization, sometimes resulting in Coq proofswhose length is almost the same as the English text. Both qualitative and quantitative evidence substantiate thisclaim.Quantitatively, the raw size of CertRL ’s proofs that use metric coinduction is roughly similar to the raw size ofcorresponding English proofs. We compare the two using a metric known as the intrinsic De Bruijn factor . [Wie00]defines the

De Bruijn factor as being the ratio of the length of a Coq proof and its English counterpart. The DeBruijn factor is intrinsic if it is the ratio of compressed file sizes. In our case we used the gzip utility to compressthe Coq and English proof texts. In most systems, it is reported that the De Bruijn factor is approximately 4.Table 2 reports the intrinsic De Bruijn factors of proofs in

CertRL that use metric coinduction, all of which areless than 1.Comparing proofs based upon the size of their compressed contents is inherently limited. For this reason,we also qualitatively compare the Coq proof of Theorem 1b to an English proof of the same. We present this

Theorem 15 (Proposition 1 of [FHM18] ✿ ). The greedypolicy is the policy whose long-term value is the fixed pointof ˆ B : V σ ∗ = ˆ V Proof.(1) V σ ∗ ≤ ˆ V follows by Theorem 14.(2) Now have to show ˆ V ≤ V σ ∗ . Note that we have V σ ∗ is the fixed point of B σ ∗ by Theorem 11.(3) We can now apply contraction coinduction with F = B σ ∗ .(4) The hypotheses are satisfied since by Theorem 11,the B σ ∗ is a contraction and it is a monotone oper-ator.(5) The only hypothesis left to show is ˆ V ≤ B σ ∗ ˆ V .(6) But in fact, we have B σ ∗ ( ˆ V ) = ˆ V by the definitionof σ ∗ . □ (a) English proof adapted from [FHM18]. Lemma exists_fixpt_policy : forall init , let V ' := fixpt ( bellman_max_op ) in let pi ' := greedy init in ltv gamma pi ' = V ' init . Proof . intros init V ' pi '; eapply Rfct_le_antisym ; split . − eapply ltv_Rfct_le_fixpt . − rewrite ( ltv_bellman_op_fixpt _ init ). apply contraction_coinduction_Rfct_ge '. + apply is_contraction_bellman_op . + apply bellman_op_monotone_ge . + unfold V ', pi '. now rewrite greedy_argmax_is_max . Qed . (b) Coq proof ✿ Table 1. Comparison of English and Coq proofs of Theorem 1b. comparison in Table 1. The two proofs are roughly equivalent in length and, crucially, also make essentially thesame argument at the same level of abstraction. Note that what we compare is not exactly the proof from Feys etal. [FHM18, Proposition 1], but is as close as possible to a restatement of their Proposition 1 and Lemma 1. Thefull proof from [FHM18], with Lemma 1 inlined, reads as follows:

Proposition 1 : The greedy policy is optimal. That is,

LTV σ ∗ = V ∗ .(1) Observe that Ψ σ ∗ ≥ V ∗ (in fact, equality holds).(2) By contraction coinduction, V ∗ ≤ LTV σ ∗ .(3) Lemma 1:

For all policies σ , LTV σ ≤ V ∗ .(4) A straightforward calculation and monotonicity argument shows that for all f ∈ B ( S , R ) , Ψ σ ( f ) ≤ Ψ ∗ ( f ) .(5) In particular, LTV σ = Ψ σ ( LTV σ ) ≤ Ψ ∗ ( LTV σ ) .(6) By contraction coinduction we conclude that LTV σ ≤ V ∗ .English Coq De Bruijn factorTheorem 14 409B 253B 0.61Theorem 1b 392B 319B 0.81Theorem 17 491B 381B 0.77 Table 2. Intrinsic de Bruijn factors of theorems whose proofs use contraction coinduction. ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 17

Table 1 compares two coinductive proofs – one in English and the other in Coq. Another important comparison isbetween a Coq coinductive proof and an English non-coinductive proof. The Coq proof of the policy improvementtheorem provides one such point of comparison. The theorem states that a particular closed property (the set { x | x ≤ y } ) holds of the fixed point of a particular contractive map (the Bellman operator). The most commonargument – presented in the most common textbook on reinforcement learning – proves this theorem by expandingthe infinite sum in multiple steps [SB98, Section 4.2]. The coinductive viewpoint substantially simplifies thisargument. To our knowledge,

CertRL is the first formal proof of convergence for value iteration or policy iteration. Relatedwork falls into three categories:(1) libraries that

CertRL builds upon,(2) formalizations of results from probability and machine learning, and(3) work at the intersection of formal verification and reinforcement learning.

Dependencies.

CertRL builds on the Coquelicot [BLM14] library for real analysis. Our main results are statementsabout fixed points of contractive maps in complete normed modules.

CertRL therefore builds on the formaldevelopment of the Lax-Milgram theorem and, in particular, Boldo et al.’s formal proof of the Banach fixed pointtheorem [BCF + CertRL also makes extensive use of some utilities from Q*cert [AHM + CertRL contains an implementation of the Giry monad. Our use of the monad forreasoning about probabilistic processes, as well as the design of our library, is highly motivated by the designof the Polaris library [TH19]. Building on these other foundations,

CertRL demonstrates how existing work onformalization enables formalization of key results in reinforcement learning theory.

Related Formalizations.

There is a growing body of work on formalization of machine learning theory [TTV19,TTV +

20, H¨17, SLD17, BS19, BBK19].Johannes Hölzl’s Isabelle/HOL development of Markov processes is most related to our own work [H¨17, Höl17].Hölzl builds on the probability theory libraries of Isabelle/HOL to develop continuous-time Markov chains. Manyof Hölzl’s basic design choices are similar to ours; for example, he also uses the Giry monad to place a monadicstructure on probability spaces and also utilizes coinductive methods.

CertRL focuses instead on formalization ofconvergence proofs for dynamic programming algorithms that solve Markov decision processes. In the future, weplan to extend our formalization to include convergence proofs for model-free methods, in which a fixed Markovdecision process is not known a priori .The CertiGrad formalization by Selsam et al. contains a Lean proof that the gradients sampled by a stochasticcomputation graph are unbiased estimators of the true mathematical function [SLD17]. This result, together withour development of a library for proving convergence of reinforcement learning algorithms, provides a pathtoward a formal proof of correctness for deep reinforcement learning.

Formal Methods for RL.

The likelihood that reinforcement learning algorithms will be deployed in safety-criticalsettings during the coming decades motivates a growing body of work on formal methods for safe reinforcementlearning. The basic approach – variously called formally constrained reinforcement learning [HAK18], shielding[ABE + +

20] – uses temporal or dynamic logics to specifyconstraints on the behavior of RL algorithms.Global convergence is a fundamental theoretical property of classical reinforcement learning algorithms, andin practice at least local convergence is an important property for any useful reinforcement learning algorithm.However, the formal proofs underlying these methods typically establish the correctness of a safety constraintbut do not formalize any convergence properties. In future work, we plan to establish an end-to-end proof that constrained reinforcement learning safely converges by combining our current development with the safe RLapproach of Fulton et al. [FP18] and the VeriPhy pipeline of Bohrer et al. [BTM + Reinforcement learning algorithms are an important class of machine learning algorithms that are now beingdeployed in safety-critical settings. Ensuring the correctness of these algorithms is societally important, but prov-ing properties about stochastic processes presents several challenges. In this paper we show how a combinationof metric coinduction and the Giry monad provides a convenient setting for formalizing convergence proofs forreinforcement learning algorithms.

REFERENCES [ABE +

18] Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcementlearning via shielding. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,

Proceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI 2018) . AAAI Press, 2018.[AHM +

17] Joshua S. Auerbach, Martin Hirzel, Louis Mandel, Avraham Shinnar, and Jérôme Siméon. Q*cert: A platform for implementingand verifying query compilers. In Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu, editors,

Proceedingsof the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017 ,pages 1703–1706. ACM, 2017.[APM09] Philippe Audebaud and Christine Paulin-Mohring. Proofs of randomized algorithms in Coq.

Science of Computer Programming ,74(8):568–589, 2009.[BBK19] Alexander Bentkamp, Jasmin Christian Blanchette, and Dietrich Klakow. A formal proof of the expressiveness of deep learning.

J. Autom. Reason. , 63(2):347–368, 2019.[BCF +

17] Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. A Coq formal proof of the Lax–Milgramtheorem. In , Paris, France, January 2017.[Bel54] Richard Bellman. The theory of dynamic programming.

Bull. Amer. Math. Soc. , 60(6):503–515, 11 1954.[BLM14] Sylvie Boldo, Catherine Lelay, and Guillaume Melquiond. Coquelicot: A user-friendly library of real analysis for Coq.

Mathematicsin Computer Science , 9, 03 2014.[BS19] Alexander Bagnall and Gordon Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees.In

The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019 , pages 2662–2669. AAAI Press, 2019.[BTM +

18] Brandon Bohrer, Yong Kiam Tan, Stefan Mitsch, Magnus O. Myreen, and André Platzer. VeriPhy: Verified controller executablesfrom verified cyber-physical system models. In Dan Grossman, editor,

Proceedings of the 39th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI 2018) , pages 617–630. ACM, 2018.[EHN15] Manuel Eberl, Johannes Hölzl, and Tobias Nipkow. A verified compiler for probability density functions. In Jan Vitek, editor,

ESOP 2015 , volume 9032 of

LNCS , pages 80–104. Springer, 2015.[FHM18] Frank MV Feys, Helle Hvid Hansen, and Lawrence S Moss. Long-term values in Markov decision processes, (co)algebraically. In

International Workshop on Coalgebraic Methods in Computer Science , pages 78–99. Springer, 2018.[FP18] Nathan Fulton and André Platzer. Safe reinforcement learning via formal methods: Toward safe control through proof andlearning. In Sheila McIlraith and Kilian Weinberger, editors,

Proceedings of the Thirty-Second AAAI Conference on ArtificialIntelligence (AAAI 2018) , pages 6485–6492. AAAI Press, 2018.[GHLL16] Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation.

CoRR ,abs/1610.00633, 2016.[Gir82] Michèle Giry. A categorical approach to probability theory. In B. Banaschewski, editor,

Categorical Aspects of Topology andAnalysis , pages 68–85, Berlin, Heidelberg, 1982. Springer Berlin Heidelberg.[H¨17] Johannes Hölzl. Markov processes in Isabelle/HOL. In

Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs andProofs , CPP 2017, page 100–111, New York, NY, USA, 2017. Association for Computing Machinery.[HAK18] Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Logically-correct reinforcement learning.

CoRR ,abs/1801.08099, 2018.[HFM +

20] Nathan Hunt, N. Fulton, Sara Magliacane, N. Hoàng, Subhro Das, and Armando Solar-Lezama. Verifiably safe exploration forend-to-end reinforcement learning.

ArXiv , abs/2007.01223, 2020.[Höl17] Johannes Hölzl. Markov chains and Markov decision processes in Isabelle/HOL.

Journal of Automated Reasoning , 2017.[How60] R.A. Howard.

Dynamic Programming and Markov Processes . Technology Press of Massachusetts Institute of Technology, 1960. ertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq • 19 [Jac18] Bart Jacobs. From probability monads to commutative effectuses.

Journal of Logical and Algebraic Methods in Programming ,94:200 – 237, 2018.[JP89] C. Jones and Gordon D. Plotkin. A probabilistic powerdomain of evaluations. In

Proceedings of the Fourth Annual Symposium onLogic in Computer Science (LICS ’89), Pacific Grove, California, USA, June 5-8, 1989 , pages 186–195. IEEE Computer Society, 1989.[Koz07] Dexter Kozen. Coinductive proof principles for stochastic processes.

CoRR , abs/0711.0194, 2007.[KR09] Dexter Kozen and Nicholas Ruozzi. Applications of metric coinduction.

Log. Methods Comput. Sci. , 5(3), 2009.[Law62] F William Lawvere. The category of probabilistic mappings. preprint , 1962.[Ope18] OpenAI. OpenAI five. https://blog.openai.com/openai-five/, 2018.[Per18] Paolo Perrone.

Categorical Probability and Stochastic Dominance in Metric Spaces . PhD thesis, University of Leipzig, 2018.[Per19] Paolo Perrone. Notes on category theory with examples from basic mathematics, 2019.[Put94] Martin L. Puterman.

Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley and Sons, Inc., USA, 1stedition, 1994.[RP02] Norman Ramsey and Avi Pfeffer. Stochastic lambda calculus and monads of probability distributions. In John Launchbury andJohn C. Mitchell, editors,

Conference Record of POPL 2002: The 29th SIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages, Portland, OR, USA, January 16-18, 2002 , pages 154–165. ACM, 2002.[SB98] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 1998.[SEJ +

20] Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, AlexanderNelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, DavidSilver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning.

Nature , 577:1–5, 01 2020.[ŚGG15] Adam Ścibior, Zoubin Ghahramani, and Andrew D. Gordon. Practical probabilistic programming with monads. In Ben Lippmeier,editor,

Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Haskell 2015, Vancouver, BC, Canada, September 3-4, 2015 ,pages 165–176. ACM, 2015.[SHM +

16] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, IoannisAntonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, IlyaSutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game ofGo with deep neural networks and tree search.

Nature , 529(7587):484–489, January 2016.[SLD17] Daniel Selsam, Percy Liang, and David L. Dill. Developing bug-free machine learning systems with formal mathematics. In DoinaPrecup and Yee Whye Teh, editors,

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,Australia, 6-11 August 2017 , volume 70 of

Proceedings of Machine Learning Research , pages 3047–3056. PMLR, 2017.[Tea] The Coq Development Team.

The Coq Proof Assistant Reference Manual .[TH19] Joseph Tassarotti and Robert Harper. A separation logic for concurrent randomized programs.

Proceedings of the ACM onProgramming Languages , 3(POPL):1–30, 2019.[TTV19] Joseph Tassarotti, Jean-Baptiste Tristan, and Koundinya Vajjha. A formal proof of PAC learnability for decision stumps.

CoRR ,abs/1911.00385, 2019.[TTV +

20] Jean-Baptiste Tristan, Joseph Tassarotti, Koundinya Vajjha, Michael L. Wick, and Anindya Banerjee. Verification of ML systemsvia reparameterization.

CoRR , abs/2007.06776, 2020.[Wie00] Freek Wiedijk. The De Bruijn factor.