[PDF] Empirical Dynamic Programming

Abstract

We propose empirical dynamic programming algorithms for Markov decision processes (MDPs). In these algorithms, the exact expectation in the Bellman operator in classical value iteration is replaced by an empirical estimate to get `empirical value iteration' (EVI). Policy evaluation and policy improvement in classical policy iteration are also replaced by simulation to get `empirical policy iteration' (EPI). Thus, these empirical dynamic programming algorithms involve iteration of a random operator, the empirical Bellman operator. We introduce notions of probabilistic fixed points for such random monotone operators. We develop a stochastic dominance framework for convergence analysis of such operators. We then use this to give sample complexity bounds for both EVI and EPI. We then provide various variations and extensions to asynchronous empirical dynamic programming, the minimax empirical dynamic program, and show how this can also be used to solve the dynamic newsvendor problem. Preliminary experimental results suggest a faster rate of convergence than stochastic approximation algorithms.

Full PDF

MMATHEMATICS OF OPERATIONS RESEARCH

Vol. 00, No. 0, Xxxxx 0000, pp. 000–000 issn | eissn | | | INFORMS doi (cid:13)

Empirical Dynamic Programming

William B. Haskell

EE & CS Departments, University of Southern California [email protected]

Rahul Jain*

EE & ISE Departments, University of Southern California [email protected]

Dileep Kalathil

EE Department, University of Southern California [email protected]

We propose empirical dynamic programming algorithms for Markov decision processes (MDPs). In thesealgorithms, the exact expectation in the Bellman operator in classical value iteration is replaced by anempirical estimate to get ‘empirical value iteration’ (EVI). Policy evaluation and policy improvement inclassical policy iteration are also replaced by simulation to get ‘empirical policy iteration’ (EPI). Thus, theseempirical dynamic programming algorithms involve iteration of a random operator, the empirical Bellmanoperator. We introduce notions of probabilistic ﬁxed points for such random monotone operators. We developa stochastic dominance framework for convergence analysis of such operators. We then use this to givesample complexity bounds for both EVI and EPI. We then provide various variations and extensions toasynchronous empirical dynamic programming, the minimax empirical dynamic program, and show howthis can also be used to solve the dynamic newsvendor problem. Preliminary experimental results suggest afaster rate of convergence than stochastic approximation algorithms.

Key words : dynamic programming; empirical methods; simulation; random operators; probabilistic ﬁxedpoints.

MSC2000 subject classiﬁcation : 49L20, 90C39, 37M05, 62C12, 47B80, 37H99.

OR/MS subject classiﬁcation : TBD.

History : Submitted: November 26, 2013

1. Introduction

Markov decision processes (MDPs) are natural models for decision makingin a stochastic dynamic setting for a wide variety of applications. The ‘principle of optimality’introduced by Richard Bellman in the 1950s has proved to be one of the most important ideasin stochastic control and optimization theory. It leads to dynamic programming algorithms forsolving sequential stochastic optimization problems. And yet, it is well-known that it suﬀers froma “curse of dimensionality” [5, 32, 19], and does not scale computationally with state and actionspace size. In fact, the dynamic programming algorithm is known to be PSPACE-hard [33].This realization led to the development of a wide variety of ‘approximate dynamic programming’methods beginning with the early work of Bellman himself [6]. These ideas evolved independently indiﬀerent ﬁelds, including the work of Werbos [44], Kushner and Clark [26] in control theory, Minsky[29], Barto, et al [4] and others in computer science, and Whitt in operations research [45, 46].The key idea was an approximation of value functions using basis function approximation [6], stateaggregation [7], and subsequently function approximation via neural networks [3]. The diﬃcultywas universality of the methods. Diﬀerent classes of problems require diﬀerent approximations.Thus, alternative model-free methods were introduced by Watkins and Dayan [43] where a Q-learning algorithm was proposed as an approximation of the value iteration procedure. It was * Corresponding author. This research was supported via NSF CAREER award CNS-0954116 and ONR YoungInvestigator Award N00014-12-1-0766. a r X i v : . [ m a t h . O C ] N ov oon noticed that this is essentially a stochastic approximations scheme introduced in the 1950sby Robbins and Munro [36] and further developed by Kiefer and Wolfowitz [23] and Kushner andClarke [26]. This led to many subsequent generalizations including the temporal diﬀerences methods[9] and actor-critic methods [24, 25]. These are summarized in [9, 40, 34]. One shortcoming in thistheory is that most of these algorithms require a recurrence property to hold, and in practice, oftenwork only for ﬁnite state and action spaces. Furthermore, while many techniques for establishingconvergence have been developed [11], including the o.d.e. method [28, 12], establishing rate ofconvergence has been quite diﬃcult [22]. Thus, despite considerable progress, these methods are notuniversal, sample complexity bounds are not known, and so other directions need to be explored.A natural thing to consider is simulation-based methods. In fact, engineers and computer scien-tists often do dynamic programming via Monte-Carlo simulations. This technique aﬀords consider-able reduction in computation but at the expense of uncertainty about convergence. In this paper,we analyze such ‘empirical dynamic programming’ algorithms. The idea behind the algorithms isquite simple and natural. In the DP algorithms, replace the expectation in the Bellman opera-tor with a sample-average approximation. The idea is widely used in the stochastic programmingliterature, but mostly for single-stage problems. In our case, replacing the expectation with anempirical expectation operator, makes the classical Bellman operator a random operator. In the DPalgorithm, we must ﬁnd the ﬁxed point of the Bellman operator. In the empirical DP algorithms,we must ﬁnd a probabilistic ﬁxed point of the random operator.In this paper, we ﬁrst introduce two notions of probabilistic ﬁxed points that we call ‘strong’and ‘weak’. We then show that asymptotically these concepts converge to the deterministic ﬁxedpoint of the classical Bellman operator. The key technical idea of this paper is a novel stochasticdominance argument that is used to establish probabilistic convergence of a random operator, andin particular, of our empirical algorithms. Stochastic dominance, the notion of an order on thespace of random variables, is a well developed tool (see [30, 38] for a comprehensive study).In this paper, we develop a theory of empirical dynamic programming (EDP) for Markov decisionprocesses (MDPs). Speciﬁcally, we make the following contributions in this paper. First, we proposeboth empirical value iteration and policy iteration algorithms and show that these converge. Each isan empirical variant of the classical algorithms. In empirical value iteration (EVI), the expectationin the Bellman operator is replaced with a sample-average or empirical approximation. In empiricalpolicy iteration (EPI), both policy evaluation and the policy iteration are done via simulation,i.e., replacing the exact expectation with the simulation-derived empirical expectation. We notethat the EDP class of algorithms is not a stochastic approximation scheme. Thus, we don’t need arecurrence property as is commonly needed by stochastic approximation-based methods. Thus, theEDP algorithms are relevant for a larger class of problems (in fact, for any problem for which exactdynamic programming can be done.) We provide convergence and sample complexity bounds forboth EVI and EPI. But we note that EDP algorithms are essentially “oﬀ-line” algorithms just asclassical DP is. Moreover, we also inherit some of the problems of classical DP such as scalabilityissues with large state spaces. These can be overcome in the same way as one does for classical DP,i.e., via state aggregation and function approximation.Second, since the empirical Bellman operator is a random monotone operator, it doesn’t havea deterministic ﬁxed point. Thus, we introduce new mathematical notions of probabilistic ﬁxedpoints. These concepts are pertinent when we are approximating a deterministic operator with animproving sequence of random operators. Under fairly mild assumptions, we show that our twoprobabilistic ﬁxed point concepts converge to the deterministic ﬁxed point of the classical monotoneoperator.Third, since scant mathematical methods exist for convergence analysis of random operators,we develop a new technique based on stochastic dominance for convergence analysis of iteration ofrandom operators. This technique allows for ﬁnite sample complexity bounds. We use this idea to2rove convergence of the empirical Bellman operator by constructing a dominating Markov chain.We note that there is an extant theory of monotone random operators developed in the context ofrandom dynamical systems [15] but the techniques for convergence analysis of random operators isnot relevant to our context. Our stochastic dominance argument can be applied for more generalrandom monotone operators than just the empirical Bellman operator.We also give a number of extensions of the EDP algorithms. We show that EVI can be performedasynchronously, making a parallel implementation possible. Second, we show that a saddle pointequilibrium of a zero-sum stochastic game can be computed approximately by using the minimaxBellman operator. Third, we also show how the EDP algorithm and our convergence techniques canbe used even with continuous state and action spaces by solving the dynamic newsvendor problem. Related Literature

A key question is how is the empirical dynamic programming methoddiﬀerent from other methods for simulation-based optimization of MDPs, on which there is sub-stantial literature. We note that most of these are stochastic approximation algorithms, also calledreinforcement learning in computer science. Within this class. there are Q learning algorithms,actor-critic algorithms, and appproximate policy iteration algorithms. Q-learning was introducedby Watkins and Dayan but its convergence as a stochastic approximation scheme was done byBertsekas and Tsitsiklis [9]. Q-learning for the average cost case was developed in [1], and in arisk-sensitive context was developed in [10]. The convergence rate of Q-learning was establishedin [18] and similar sample complexity bounds were given in [22]. Actor-critic algorithms as a twotime-scale stochastic approximation was developed in [24]. But the most closely related work isoptimistic policy iteration [42] wherein simulated trajectories are used for policy evaluation whilepolicy improvement is done exactly. The algorithm is a stochastic approximation scheme and itsalmost sure convergence follows. This is true of all stochastic approximation schemes but they dorequire some kind of recurrence property to hold. In contrast, EDP is not a stochastic approx-imation scheme, hence it does not need such assumptions. However, we can only guarantee itsconvergence in probability.A class of simulation-based optimization algorithms for MDPs that is not based on stochasticapproximations is the adaptive sampling methods developed by Fu, Marcus and co-authors [13, 14].These are based on the pursuit automata learning algorithms [41, 35, 31] and combine multi-armed bandit learning ideas with Monte-Carlo simulation to adaptively sample state-action pairsto approximate the value function of a ﬁnite-horizon MDP.Some other closely related works are [16] which introduces simulation-based policy iteration (foraverage cost MDPs). It basically shows that almost sure convergence of such an algorithm can fail.Another related work is [17]. wherein a simulation-based value iteration is proposed for a ﬁnitehorizon problem. Convergence in probability is established if the simulation functions correspondingto the MDP is Lipschitz continuous. Another closely related paper is [2], which considers valueiteration with error. We note that our focus is on inﬁnite horizon discounted MDPs. Moreover,we do not require any Lipschitz continuity condition. We show that EDP algorithms will converge(probabilistically) whenever the classical DP algorithms will converge (which is almost always).A survey on approximate policy iteration is provided in [8]. Approximate dynamic programming(ADP) methods are surveyed in [34]. In fact, many Monte-Carlo-based dynamic programmingalgorithms are introduced herein (but without convergence proof.) Simulation-based uniform esti-mation of value functions was studied in [20, 21]. This gave PAC learning type sample complexitybounds for MDPs and this can be combined with policy improvement along the lines of optimisticpolicy iteration.This paper is organized as follows. In Section 2, we discuss preliminaries and brieﬂy talk aboutclassical value and policy iteration. Section 3 presents empirical value and policy iteration. Section 4introduces the notion of random operators and relevant notions of probabilistic ﬁxed points. In thissection, we also develop a stochastic dominance argument for convergence analysis of iteration of3andom operators when they satisfy certain assumptions. In Section 5, we show that the empiricalBellman operator satisﬁes the above assumptions, and present sample complexity and convergencerate estimates for the EDP algorithms. Section 6 provides various extensions including asychronousEDP, minmax EDP and EDP for the dynamic newsvendor problem. Basic numerical experimentsare reported in Section 7.

2. Preliminaries

We ﬁrst introduce a typical representation for a discrete time MDP as the5-tuple ( S , A , { A ( s ) : s ∈ S } , Q, c ) . Both the state space S and the action space A are ﬁnite. Let P ( S ) denote the space of probabilitymeasures over S and we deﬁne P ( A ) similarly. For each state s ∈ S , the set A ( s ) ⊂ A is the set offeasible actions. The entire set of feasible state-action pairs is K (cid:44) { ( s, a ) ∈ S × A : a ∈ A ( s ) } . The transition law Q governs the system evolution, Q ( ·| s, a ) ∈ P ( A ) for all ( s, a ) ∈ K , i.e., Q ( j | s, a )for j ∈ S is the probability of visiting the state j next given the current state-action pair ( s, a ).Finally, c : K → R is a cost function that depends on state-action pairs.Let Π denote the class of stationary deterministic Markov policies , i.e., mappings π : S → A which only depend on history through the current state. We only consider such policies since it iswell known that there is an optimal policy in this class. For a given state s ∈ S , π ( s ) ∈ A ( s ) is theaction chosen in state s under the policy π . We assume that Π only contains feasible policies thatrespect the constraints K . The state and action at time t are denoted s t and a t , respectively. Anypolicy π ∈ Π and initial state s ∈ S determine a probability measure P πs and a stochastic process { ( s t , a t ) , t ≥ } deﬁned on the canonical measurable space of trajectories of state-action pairs. Theexpectation operator with respect to P πs is denoted E πs [ · ].We will focus on inﬁnite horizon discounted cost MDPs with discount factor α ∈ (0 , s ∈ S , the expected discounted cost for policy π ∈ Π is denoted by v π ( s ) = E πs (cid:34) ∞ (cid:88) t =0 α t c ( s t , a t ) (cid:35) . The optimal cost starting from state s is denoted by v ∗ ( s ) (cid:44) inf π ∈ Π E πs (cid:34)(cid:88) t ≥ α t c ( s t , a t ) (cid:35) , and v ∗ ∈ R | S | denotes the corresponding optimal value function in its entirety. Value iteration

The Bellman operator T : R | S | → R | S | is deﬁned as[ T v ] ( s ) (cid:44) min a ∈ A ( s ) { c ( s, a ) + α E [ v (˜ s ) | s, a ] } , ∀ s ∈ S , for any v ∈ R | S | , where ˜ s is the random next state visited, and E [ v (˜ s ) | s, a ] = (cid:88) j ∈ S v ( j ) Q ( j | s, a )is the explicit computation of the expected cost-to-go conditioned on state-action pair ( s, a ) ∈ K .Value iteration amounts to iteration of the Bellman operator. We have a sequence (cid:8) v k (cid:9) k ≥ ⊂ R | S | v k +1 = T v k = T k +1 v for all k ≥ v . This is the well-known valueiteration algorithm for dynamic programming.We next state the Banach ﬁxed point theorem which is used to prove that value iterationconverges. Let U be a Banach space with norm (cid:107) · (cid:107) U . We call an operator G : U → U a contraction mapping when there exists a constant κ ∈ [0 ,

1) such that (cid:107)

G v − G v (cid:107) U ≤ κ (cid:107) v − v (cid:107) U , ∀ v , v ∈ U. Theorem 2.1. (Banach ﬁxed point theorem) Let U be a Banach space with norm (cid:107) · (cid:107) U , andlet G : U → U be a contraction mapping with constant κ ∈ [0 , . Then,(i) there exists a unique v ∗ ∈ U such that G v ∗ = v ∗ ;(ii) for arbitrary v ∈ U , the sequence v k = G v k − = G k v converges in norm to v ∗ as k → ∞ ;(iii) (cid:107) v k +1 − v ∗ (cid:107) U ≤ κ (cid:107) v k − v ∗ (cid:107) U for all k ≥ . For the rest of the paper, let C denote the space of contraction mappings from R | S | → R | S | . It is wellknown that the Bellman operator T ∈ C with constant κ = α is a contraction operator, and hencehas a unique ﬁxed point v ∗ . It is known that value iteration converges to v ∗ as k → ∞ . In fact, v ∗ is the optimal value function. Policy iteration

Policy iteration is another well known dynamic programming algorithm forsolving MDPs. For a ﬁxed policy π ∈ Π, deﬁne T π : R | S | → R | S | as[ T π v ] ( s ) = c ( s, π ( s )) + α E [ v (˜ s ) | s, π ( s )] . The ﬁrst step is a policy evalution step. Compute v π by solving T π v π = v π for v π . Let c π ∈ R | S | bethe vector of one period costs corresponding to a policy π , c π ( s ) = c ( s, π ( s )) and Q π , the transitionkernel corresponding to the policy π . Then, writing T π v π = v π we have the linear system c π + Q π v π = v π . (Policy Evaluation)The second step is a policy improvement step. Given a value function v ∈ R | S | , ﬁnd an ‘improved’policy π ∈ Π with respect to v such that T π v = T v. (Policy Update)Thus, policy iteration produces a sequence of policies (cid:8) π k (cid:9) k ≥ and (cid:8) v k (cid:9) k ≥ as follows. At iteration k ≥

0, we solve the linear system T π k v π k = v π k for v π k , and then we choose a new policy π k satisfying T π k v π k = T v π k , which is greedy with respect to v π k . We have a linear convergence rate for policy iteration as well.Let v ∈ R | S | be any value function, solve T π v = T v for π , and then compute v π . Then, we know [9,Lemma 6.2] that (cid:107) v π − v ∗ (cid:107) ≤ α (cid:107) v − v ∗ (cid:107) , from which convergence of policy iteration follows. Unless otherwise speciﬁed, the norm || · || wewill use in this paper is the sup norm.We use the following helpful fact in the paper. Proof is given in Appendix A.1. Remark 2.1.

Let X be a given set, and f : X → R and f : X → R be two real-valued func-tions on X . Then,(i) | inf x ∈ X f ( x ) − inf x ∈ X f ( x ) | ≤ sup x ∈ X | f ( x ) − f ( x ) | , and(ii) | sup x ∈ X f ( x ) − sup x ∈ X f ( x ) | ≤ sup x ∈ X | f ( x ) − f ( x ) | .5 . Empirical Algorithms for Dynamic Programming We now present empirical variantsof dynamic programming algorithms. Our focus will be on value and policy iteration. As the readerwill see, the idea is simple and natural. In subsequent sections we will introduce the new notionsand techniques to prove their convergence.

We introduce empirical value iteration (EVI) ﬁrst.The Bellman operator T requires exact evaluation of the expectation E [ v (˜ s ) | s, a ] = (cid:88) j ∈ S Q ( j | s, a ) v ( j ) . We will simulate and replace this exact expectation with an empirical estimate in each iteration.Thus, we need a simulation model for the MDP. Let ψ : S × A × [0 , → S be a simulation model for the state evolution for the MDP, i.e. ψ yields the next state given thecurrent state, the action taken and an i.i.d. random variable. Without loss of generality, we canassume that ξ is a uniform random variable on [0 ,

1] and ( s, a ) ∈ K . With this convention, theBellman operator can be written as[ T v ] ( s ) (cid:44) min a ∈ A ( s ) { c ( s, a ) + α E [ v ( ψ ( s, a, ξ ))] } , ∀ s ∈ S . Now, we replace the expectation E [ v ( ψ ( s, a, ξ ))] with its sample average approximation bysimulating ξ . Given n i.i.d. samples of a uniform random variable, denoted { ξ i } ni =1 , the empiricalestimate of E [ v ( ψ ( s, a, ξ ))] is n (cid:80) ni =1 v ( ψ ( s, a, ξ i )). We note that the samples are regenerated ateach iteration. Thus, the EVI algorithm can be summarized as follows. Algorithm 1

Empirical Value IterationInput: v ∈ R | S | , sample size n ≥ k = 0.1. Sample n uniformly distributed random variables { ξ i } ni =1 from [0 , v k +1 ( s ) = min a ∈ A ( s ) (cid:40) c ( s, a ) + αn n (cid:88) i =1 v k ( ψ ( s, a, ξ i )) (cid:41) , ∀ s ∈ S .

2. Increment k := k + 1 and return to step 2.In each iteration, we regenerate samples and use this empirical estimate to approximate T . Nowwe give the sample complexity of the EVI algorithm. Proof is given in Section 5. Theorem 3.1.

Given (cid:15) ∈ (0 , and δ ∈ (0 , , ﬁx (cid:15) g = (cid:15)/η ∗ and select δ , δ > such that δ +2 δ ≤ δ where η ∗ = (cid:100) / (1 − α ) (cid:101) . Select an n such that n ≥ n ( (cid:15), δ ) = 2 ( κ ∗ ) (cid:15) g log 2 | K | δ where κ ∗ = max ( s,a ) ∈ K c ( s, a ) / (1 − α ) and select a k such that k ≥ k ( (cid:15), δ ) = log (cid:18) δ µ n, min (cid:19) , where µ n,min = min η µ n ( η ) and µ n ( η ) is given by Lemma 4.3. Then P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) ≥ (cid:15) (cid:9) ≤ δ. emark 3.1. This result says that, if we take n ≥ n ( (cid:15), δ ) samples in each iteration of theEVI algorithm and perform k > k ( (cid:15), δ ) iterations then the EVI iterate ˆ v kn is (cid:15) close to the optimalvalue function v ∗ with probability greater that 1 − δ . We note that the sample complexity is O (cid:0) (cid:15) , log δ , log | S | , log | A | (cid:1) .The basic idea in the analysis is to frame EVI as iteration of a random operator (cid:98) T n which wecall the empirical Bellman operator. We deﬁne (cid:98) T n as (cid:104) (cid:98) T n ( ω ) v (cid:105) ( s ) (cid:44) min a ∈ A ( s ) (cid:40) c ( s, a ) + αn n (cid:88) i =1 v ( ψ ( s, a, ξ i )) (cid:41) , ∀ s ∈ S . (3.1)This is a random operator because it depends on the random noise samples { ξ } ni =1 . The deﬁnitionand the analysis of this operator is done rigorously in Section 5. We now deﬁne EPI along the same lines by replacing exactpolicy improvement and evaluation with empirical estimates. For a ﬁxed policy π ∈ Π, we canestimate v π ( s ) via simulation. Given a sequence of noise ω = ( ξ i ) i ≥ , we have s t +1 = ψ ( s t , π ( s t ) , ξ t )for all t ≥

0. For γ >

0, choose a ﬁnite horizon T such thatmax ( s,a ) ∈ K | c ( s, a ) | ∞ (cid:88) t = T +1 α t < γ. We use the time horizon T to truncate simulation, since we must stop simulation after ﬁnitetime. Let [ˆ v π ( s )] ( ω ) = T (cid:88) t =0 α t c ( s t ( ω ) , π ( s t ( ω )))be the realization of (cid:80) T t =0 α t c ( s t , a t ) on the sample path ω .The next algorithm requires two input parameters, n and q , which determine sample sizes.Parameter n is the sample size for policy improvement and parameter q is the sample size for policyevaluation. We discuss the choices of these parameters in detail later. In the following algorithm,the notation s t ( ω i ) is understood as the state at time t in the simulated trajectory ω i . Algorithm 2

Empirical Policy IterationInput: π ∈ Π, (cid:15) > k = 0.2. For each s ∈ S , draw ω , . . . , ω q ∈ Ω and computeˆ v π k ( s ) = 1 q q (cid:88) i =1 T (cid:88) t =0 α t c ( s t ( ω i ) , π ( s t ( ω i ))) .

3. Draw ξ , . . . , ξ n ∈ [0 , π k +1 to satisfy π k +1 ( s ) ∈ arg min a ∈ A ( s ) (cid:40) c ( s, a ) + αn n (cid:88) i =1 ˆ v π k ( ψ ( s, a, ξ i )) (cid:41) , ∀ s ∈ S .

4. Increase k := k + 1 and return to step 2.5. Stop when (cid:107) ˆ v π k +1 − ˆ v π k (cid:107) ≤ (cid:15) .Step 2 replaces computation of T π v = T v (policy improvement). Step 3 replaces solution of thesystem v = c π + α Q π v (policy evaluation).We now give the sample complexity result for EPI. Proof is given in Section 5.7 heorem 3.2. Given (cid:15) ∈ (0 , , δ ∈ (0 , select δ , δ > such that δ + 2 δ < δ . Also select δ , δ > such that δ + δ < δ . Then, select (cid:15) , (cid:15) > such that (cid:15) g = (cid:15) +2 α(cid:15) (1 − α ) where (cid:15) g = (cid:15)/η ∗ , η ∗ = (cid:100) / (1 − α ) (cid:101) . Then, select a q and n such that q ≥ q ( (cid:15), δ ) = 2( κ ∗ ( T + 1)) ( (cid:15) − γ ) log 2 | S | δ .n ≥ n ( (cid:15), δ ) = 2 ( κ ∗ ) ( (cid:15) /α ) log 2 | K | δ . where κ ∗ = max ( s,a ) ∈ K c ( s, a ) / (1 − α ) , and select a k such that k ≥ k ( (cid:15), δ ) = log (cid:18) δ µ n,q, min (cid:19) , where µ n,q,min = min η µ n,q ( η ) and µ n,q ( η ) is given by equation (5.6) . Then, P {(cid:107) v π k − v ∗ (cid:107) ≥ (cid:15) } ≤ δ. Remark 3.2.

This result says that, if we do q ≥ q ( (cid:15), δ ) simulation runs for empirical policyevaluation, n ≥ n ( (cid:15), δ ) samples for empirical policy update and perform k > k ( (cid:15), δ ) iterations thenthe true value v π k of the policy π k will be (cid:15) close to the optimal value function v ∗ with probabilitygreater that 1 − δ . We note that q is O (cid:0) (cid:15) , log δ , log | S | (cid:1) and n is O (cid:0) (cid:15) , log δ , log | S | , log | A | (cid:1) .

4. Iteration of Random Operators

The empirical Bellman operator (cid:98) T n we deﬁned in equa-tion (3.1) is a random operator. When it operates on a vector, it yields a random vector. When (cid:98) T n is iterated, it produces a stochastic process and we are interested in the possible convergence ofthis stochastic process. The underlying assumption is that the random operator (cid:98) T n is an ‘approxi-mation’ of a deterministic operator T such that (cid:98) T n converges to T (in a sense we will shortly makeprecise) as n increases. For example the empirical Bellman operator approximates the classicalBellman operator. We make this intuition mathematically rigorous in this section. The discussionin this section is not speciﬁc to the Bellman operator, but applies whenever a deterministic operator T is being approximated by an improving sequence of random operators { (cid:98) T n } n ≥ . In this subsection we formalizethe deﬁnition of a random operator, denoted by (cid:98) T n .Since (cid:98) T n is a random operator, we need an appropriate probability space upon which to deﬁne (cid:98) T n . So, we deﬁne the sample space Ω = [0 , ∞ , the σ − algebra F = B ∞ where B is the inheritedBorel σ − algebra on [0 , P on Ω formed by an inﬁnite sequence ofuniform random variables. The primitive uncertainties on Ω are inﬁnite sequences of uniform noise ω = ( ξ i ) i ≥ where each ξ i is an independent uniform random variable on [0 , , F , P ) asthe appropriate probability space on which to deﬁne iteration of the random operators (cid:110) (cid:98) T n (cid:111) n ≥ .Next we deﬁne a composition of random operators, (cid:98) T kn , on the probability space (Ω ∞ , F ∞ , P ),for all k ≥ n ≥ (cid:98) T kn ( ω ) v = (cid:98) T n ( ω k − ) (cid:98) T n ( ω k − ) · · · (cid:98) T n ( ω ) v. Note that ω ∈ Ω ∞ is an inﬁnite sequence ( ω j ) j ≥ where each ω j = ( ξ j,i ) i ≥ . Then we can deﬁne theiteration of (cid:98) T n with an initial seed ˆ v n ∈ R | S | (we use the hat notation to emphasize that the iteratesare random variables generated by the empirical operator) as (cid:98) v k +1 n = (cid:98) T n (cid:98) v kn = (cid:98) T kn (cid:98) v n (4.1)8otice that we only iterate k for ﬁxed n . The sample size n is constant in every stochastic process (cid:8) ˆ v kn (cid:9) k ≥ , where ˆ v kn = (cid:98) T kn ˆ v , for all k ≥

1. For a ﬁxed ˆ v n , we can view all ˆ v kn as measurable mappingsfrom Ω ∞ to R | S | via the mapping ˆ v kn ( ω ) = (cid:98) T kn ( ω ) ˆ v n .The relationship between the ﬁxed points of the deterministic operator T and the probabilisticﬁxed points of the random operator (cid:110) (cid:98) T n (cid:111) n ≥ depends on how (cid:110) (cid:98) T n (cid:111) n ≥ approximates T . Motivatedby the relationship between the classical and the empirical Bellman operator, we will make thefollowing assumption. Assumption 4.1. P (cid:16) lim n →∞ (cid:107) (cid:98) T n v − T v (cid:107) ≥ (cid:15) (cid:17) = 0 ∀ (cid:15) > and ∀ v ∈ R (cid:107) S (cid:107) . Also T has a (pos-sibly non-unique) ﬁxed point v ∗ such that T v ∗ = v ∗ . Assumption 4.1 is equivalent to lim n →∞ (cid:98) T n ( ω ) v = T v for P − almost all ω ∈ Ω. Here, we beneﬁtfrom deﬁning all of the random operators (cid:110) (cid:98) T n (cid:111) n ≥ together on the sample space Ω = [0 , ∞ , sothat the above convergence statement makes sense. Strong probabilistic ﬁxed point:

We now introduce a natural probabilistic ﬁxed pointnotion for (cid:110) (cid:98) T n (cid:111) n ≥ , in analogy to the deﬁnition of a ﬁxed point, (cid:107) T v ∗ − v ∗ (cid:107) = 0 for a deterministicoperator. Definition 4.1.

A vector ˆ v ∈ R | S | is a strong probabilistic ﬁxed point for the sequence (cid:110) (cid:98) T n (cid:111) n ≥ if lim n →∞ P (cid:16) (cid:107) (cid:98) T n ˆ v − ˆ v (cid:107) > (cid:15) (cid:17) = 0 , ∀ (cid:15) > . We note that the above notion is deﬁned for a sequence of random operators, rather than for asingle random operator.

Remark 4.1.

We can give a slightly more general notion of a probabilistic ﬁxed point whichwe call ( (cid:15), δ )-strong probablistic ﬁxed point. For a ﬁxed ( (cid:15), δ ), we say that a vector ˆ v ∈ R | S | is an( (cid:15), δ )-strong probabilistic ﬁxed point if there exists an n ( (cid:15), δ ) such that for all n ≥ n ( (cid:15), δ ) weget P (cid:16) (cid:107) (cid:98) T n ˆ v − ˆ v (cid:107) > (cid:15) (cid:17) < δ . Note that, all strong probabilistic ﬁxed points satisfy this condition forarbitrary ( (cid:15), δ ) and hence are ( (cid:15), δ )-strong probabilistic ﬁxed points. However the converse neednot be true. In many cases we may be looking for an (cid:15) -optimal ‘solution’ with a 1 − δ ‘probabilisticguarantee’ where ( (cid:15), δ ) is ﬁxed a priori. In fact, this would provide an approximation to the strongprobabilistic ﬁxed point of the sequence of operators. Weak probabilistic ﬁxed point:

It is well known that iteration of a deterministic contractionoperator converges to its ﬁxed point. It is unclear whether a similar property would hold for randomoperators, and whether they would converge to the strong probabilistic ﬁxed point of the sequence (cid:110) (cid:98) T n (cid:111) n ≥ in any way. Thus, we deﬁne an apparently weaker notion of a probabilistic ﬁxed pointthat explicitly considers iteration. Definition 4.2.

A vector ˆ v ∈ R | S | is a weak probabilistic ﬁxed point for (cid:110) (cid:98) T n (cid:111) n ≥ iflim n →∞ lim sup k →∞ P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:17) = 0 , ∀ (cid:15) > , ∀ v ∈ R | S | We use lim sup k →∞ P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:17) instead of lim k →∞ P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:17) because the latter limitmay not exist for any ﬁxed n ≥ Remark 4.2.

Similar to the deﬁnition that we gave in Remark 4.1, we can deﬁne an ( (cid:15), δ )-weak probablistic ﬁxed point. For a ﬁxed ( (cid:15), δ ), we say that a vector ˆ v ∈ R | S | is an ( (cid:15), δ )-weak probabilistic ﬁxed point if there exists an n ( (cid:15), δ ) such that for all n ≥ n ( (cid:15), δ ) we getlim sup k →∞ P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:17) < δ . As before, all weak probabilistic ﬁxed points are indeed ( (cid:15), δ )weak probabilistic ﬁxed points, but converse need not be true.9t this point the connection between strong/weak probabilistic ﬁxed points of the random oper-ator (cid:98) T n and the classical ﬁxed point of the deterministic operator T is not clear. Also it is not clearwhether the random sequence { ˆ v kn } k ≥ converges to either of these two ﬁxed point. In the followingsubsections we address these issues. N In this subsection, we construct a new stochastic processon N that will be useful in our analysis. We ﬁrst start with a simple lemma. Lemma 4.1.

The stochastic process (cid:8) ˆ v kn (cid:9) k ≥ is a Markov chain on R | S | .Proof: This follows from the fact that each iteration of (cid:98) T n is independent, and identically dis-tributed. Thus, the next iterate ˆ v k +1 n only depends on history through the current iterate ˆ v kn . (cid:3) Even though (cid:8) ˆ v kn (cid:9) k ≥ is a Markov chain, its analysis is complicated by two factors. First, (cid:8) ˆ v kn (cid:9) k ≥ is a Markov chain on the continuous state space R | S | , which introduces technical diﬃcultiesin general when compared to a discrete state space. Second, the transition probabilities of (cid:8) ˆ v kn (cid:9) k ≥ are too complicated to compute explicitly.Since we are approximating T by (cid:98) T n and want to compute v ∗ , we should track the progress of (cid:8) ˆ v kn (cid:9) k ≥ to the ﬁxed point v ∗ of T . Equivalently, we are interested in the real-valued stochasticprocess (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ . If (cid:107) ˆ v kn − v ∗ (cid:107) approaches zero then ˆ v kn approaches v ∗ , and vice versa.The state space of the stochastic process (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ is R , which is simpler than the statespace R | S | of (cid:8) ˆ v kn (cid:9) k ≥ , but which is still continuous. Moreover, (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ is a non-Markovianprocess in general. In fact it would be easier to study a related stochastic process on a discrete,and ideally a ﬁnite state space. In this subsection we show how this can be done.We make a boundedness assumption next. Assumption 4.2.

There exists a κ ∗ < ∞ such that (cid:107) ˆ v kn (cid:107) ≤ κ ∗ almost surely for all k ≥ , n ≥ .Also, (cid:107) v ∗ (cid:107) ≤ κ ∗ . Under this assumption we can restrict the stochastic process (cid:8) (cid:107) ˆ v k − v ∗ (cid:107) (cid:9) k ≥ to the compactstate space B κ ∗ (0) = (cid:8) v ∈ R | S | : (cid:107) v (cid:107) ≤ κ ∗ (cid:9) . We will adopt the convention that any element v outside of B κ ∗ (0) will be mapped to its projection κ ∗ v (cid:107) v (cid:107) onto B κ ∗ (0) by any realization of (cid:98) T n .Choose a granularity (cid:15) g > R into intervals of length (cid:15) g starting at zero, and we will note which interval is occupied by (cid:107) ˆ v kn − v ∗ (cid:107) at each k ≥

0. We will deﬁne a new stochastic process (cid:8) X kn (cid:9) k ≥ on (Ω ∞ , F ∞ , P ) with state space N . The idea is that (cid:8) X kn (cid:9) k ≥ will report which interval of [0 , κ ∗ ] is occupied by (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ .Deﬁne X kn : Ω ∞ → N via the rule: X kn ( ω ) = (cid:40) , if (cid:107) ˆ v k ( ω ) − v ∗ (cid:107) = 0 ,η ≥ , if ( η − (cid:15) g < (cid:107) ˆ v k ( ω ) − v ∗ (cid:107) ≤ η (cid:15) g , (4.2)for all k ≥

0. More compactly, X kn ( ω ) = (cid:6) (cid:107) ˆ v kn ( ω ) − v ∗ (cid:107) /(cid:15) g (cid:7) , where (cid:100) χ (cid:101) denotes the smallest integer greater than or equal to χ ∈ R . Thus the stochastic process (cid:8) X kn (cid:9) k ≥ is a report on how close the stochastic process (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ is to zero, and in turn howclose the Markov chain (cid:8) ˆ v kn (cid:9) k ≥ is to the true ﬁxed point v ∗ of T .Deﬁne the constant N ∗ (cid:44) (cid:24) κ ∗ (cid:15) g (cid:25) . N ∗ is the smallest number of intervals of length (cid:15) g needed to cover the interval[0 , κ ∗ ]. By construction, the stochastic process (cid:8) X kn (cid:9) k ≥ is restricted to the ﬁnite state space { η ∈ N : 0 ≤ η ≤ N ∗ } . The process (cid:8) X kn (cid:9) k ≥ need not be a Markov chain. However, it is easier to work with than either (cid:8) ˆ v kn (cid:9) k ≥ or (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ because it has a discrete state space. It is also easy to relate (cid:8) X kn (cid:9) k ≥ back to (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ .Recall that X ≥ as Y denotes almost sure inequality between two random variables X and Y deﬁned on the same probability space. The stochastic processes (cid:8) X kn (cid:9) k ≥ and (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) /(cid:15) g (cid:9) k ≥ are deﬁned on the same probability space, so the next lemma follows by construction of (cid:8) X kn (cid:9) k ≥ . Lemma 4.2.

For all k ≥ , X kn ≥ as (cid:107) ˆ v kn − v ∗ (cid:107) /(cid:15) g . To proceed, we will make the following assumptions about the deterministic operator T and therandom operator (cid:98) T n . Assumption 4.3. (cid:107)

T v − v ∗ (cid:107) ≤ α (cid:107) v − v ∗ (cid:107) for all v ∈ R | S | . Assumption 4.4.

There is a sequence { p n } n ≥ such that P (cid:16) (cid:107) T v − (cid:98) T n v (cid:107) < (cid:15) (cid:17) > p n ( (cid:15) ) and p n ( (cid:15) ) ↑ as n → ∞ for all v ∈ B κ ∗ (0) , ∀ (cid:15) > . We now discuss the convergence rate of (cid:8) X kn (cid:9) k ≥ . Let X kn = η . On the event F = (cid:110) (cid:107) T ˆ v kn − (cid:98) T n ˆ v kn (cid:107) < (cid:15) g (cid:111) , we have (cid:107) ˆ v k +1 n − v ∗ (cid:107) ≤ (cid:107) (cid:98) T n ˆ v kn − T ˆ v k (cid:107) + (cid:107) T ˆ v kn − v ∗ (cid:107) ≤ ( α η + 1) (cid:15) g where we used Assumption 4.3 and the deﬁnition of X kn . Now using Assumption 4.4 we can sum-marize: If X kn = η, then X k +1 n ≤ (cid:100) α η + 1 (cid:101) with a probability at least p n ( (cid:15) g ) . (4.3)We conclude this subsection with a comment about the state space of the stochastic processof { X kn } k ≥ . If we start with X kn = η and if (cid:100) α η + 1 (cid:101) < η then we must have improvement in theproximity of ˆ v k +1 n to v ∗ . We deﬁne a new constant η ∗ = min { η ∈ N : (cid:100) α η + 1 (cid:101) < η } = (cid:24) − α (cid:25) . If η is too small, then (cid:100) α η + 1 (cid:101) may be equal to η and no improvement in the proximity of ˆ v kn to v ∗ can be detected by (cid:8) X kn (cid:9) k ≥ . For any η ≥ η ∗ , (cid:100) α η + 1 (cid:101) < η and strict improvement musthold. So, for the stochastic process (cid:8) X kn (cid:9) k ≥ , we can restrict our attention to the state space X := { η ∗ , η ∗ + 1 , . . . , N ∗ − , N ∗ } . If we could understand the behavior of the stochasticprocesses (cid:8) X kn (cid:9) k ≥ , then we could make statements about the convergence of (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ and (cid:8) ˆ v kn (cid:9) k ≥ . Although simpler than (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ and (cid:8) ˆ v kn (cid:9) k ≥ , the stochastic process (cid:8) X kn (cid:9) k ≥ is still too complicated to work with analytically. We overcome this diﬃculty with a family ofdominating Markov chains. We now present our dominance argument. Several technical details areexpanded upon in the appendix.We will denote our family of “dominating” Markov chains (MC) by (cid:8) Y kn (cid:9) k ≥ . We will constructthese Markov chains to be tractable and to help us analyze (cid:8) X kn (cid:9) k ≥ . Notice that the family11 Y kn (cid:9) k ≥ has explicit dependence on n ≥

1. We do not necessarily construct (cid:8) Y kn (cid:9) k ≥ on the prob-ability space (Ω ∞ , F ∞ , P ). Rather, we view (cid:8) Y kn (cid:9) k ≥ as being deﬁned on ( N ∞ , N ), the canonicalmeasurable space of trajectories on N , so Y kn : N ∞ → N . We will use Q to denote the probabilitymeasure of (cid:8) Y kn (cid:9) k ≥ on ( N ∞ , N ). Since (cid:8) Y kn (cid:9) k ≥ will be a Markov chain by construction, the prob-ability measure Q will be completely determined by an initial distribution on N and a transitionkernel. We denote the transition kernel of (cid:8) Y kn (cid:9) k ≥ as Q n .Our speciﬁc choice for (cid:8) Y kn (cid:9) k ≥ is motivated by analytical expediency, though the reader willsee that many other choices are possible. We now construct the process (cid:8) Y kn (cid:9) k ≥ explicitly, andthen compute its steady state probabilities and its mixing time. We will deﬁne the stochasticprocess (cid:8) Y kn (cid:9) k ≥ on the ﬁnite state space X , based on our observations about the boundednessof (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ and (cid:8) X kn (cid:9) k ≥ . Now, for a ﬁxed n and p n ( (cid:15) g ) (we drop the argument (cid:15) g in thefollowing for notational convenience) as assumed in Assumption 4.4 we construct the dominatingMarkov chain (cid:8) Y kn (cid:9) k ≥ as: Y k +1 n = (cid:40) max (cid:8) Y kn − , η ∗ (cid:9) , w.p. p n ,N ∗ , w.p. 1 − p n . (4.4)The ﬁrst value Y k +1 n = max (cid:8) Y kn − , η ∗ (cid:9) corresponds to the case where the approximation errorsatisﬁes (cid:107) T ˆ v k − (cid:98) T n ˆ v k (cid:107) < (cid:15) g , and the second value Y k +1 n = N ∗ corresponds to all other cases (givingus an extremely conservative bound in the sequel). This construction also ensures that Y kn ∈ X forall k , Q− almost surely. Informally, (cid:8) Y kn (cid:9) k ≥ will either move one unit closer to zero until it reaches η ∗ , or it will move (as far away from zero as possible) to N ∗ .We now summarize some key properties of (cid:8) Y kn (cid:9) k ≥ . Proposition 4.1.

For (cid:8) Y kn (cid:9) k ≥ as deﬁned above,(i) it is a Markov chain;(ii) the steady state distribution of (cid:8) Y kn (cid:9) k ≥ , and the limit Y n = d lim k →∞ Y kn , exists;(iii) Q (cid:8) Y kn > η (cid:9) → Q { Y n > η } as k → ∞ for all η ∈ N ;Proof: Parts (i) - (iii) follow by construction of (cid:8) Y kn (cid:9) k ≥ and the fact that this family consistsof irreducible Markov chains on a ﬁnite state space. (cid:3) We now describe a stochastic dominance relationship between the two stochastic processes (cid:8) X kn (cid:9) k ≥ and (cid:8) Y kn (cid:9) k ≥ . The notion of stochastic dominance (in the usual sense) will be central toour development. Definition 4.3.

Let X and Y be two real-valued random variables, then Y stochasticallydominates X , written X ≤ st Y , when E [ f ( X )] ≤ E [ f ( Y )] for all increasing functions f : R → R .The condition X ≤ st Y is known to be equivalent to E [ { X ≥ θ } ] ≤ E [ { Y ≥ θ } ] or Pr { X ≥ θ } ≤ Pr { Y ≥ θ } , for all θ in the support of Y . Notice that the relation X ≤ st Y makes no mention of the respectiveprobability spaces on which X and Y are deﬁned - these spaces may be the same or diﬀerent (inour case they are diﬀerent).Let (cid:8) F k (cid:9) k ≥ be the ﬁltration on (Ω ∞ , F ∞ , P ) corresponding to the evolution of informationabout (cid:8) X kn (cid:9) k ≥ . Let [ X k +1 n |F k ] denote the conditional distribution of X k +1 n given the information F k . The following theorem compares the marginal distributions of (cid:8) X kn (cid:9) k ≥ and (cid:8) Y kn (cid:9) k ≥ at alltimes k ≥ (cid:8) X kn (cid:9) k ≥ and (cid:8) Y kn (cid:9) k ≥ start from the same state. Theorem 4.1. If X n = Y n , then X kn ≤ st Y kn for all k ≥ . (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) (cid:9) k ≥ , (cid:8) X kn (cid:9) k ≥ , and (cid:8) Y kn (cid:9) k ≥ in a probabilistic sense, and summarizes our stochasticdominance argument. Corollary 4.1.

For any ﬁxed n ≥ , we have(i) P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η(cid:15) g (cid:9) ≤ P (cid:8) X kn > η (cid:9) ≤ Q (cid:8) Y kn > η (cid:9) for all η ∈ N for all k ≥ ;(ii) lim sup k →∞ P (cid:8) X kn > η (cid:9) ≤ Q { Y n > η } for all η ∈ N ;(iii) lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η (cid:15) g (cid:9) ≤ Q { Y n > η } for all η ∈ N .Proof: (i) The ﬁrst inequality is true by construction of X kn . Then P (cid:8) X kn > η (cid:9) ≤ Q (cid:8) Y kn > η (cid:9) forall k ≥ η ∈ N by Theorem 4.1.(ii) Since Q (cid:8) Y kn > η (cid:9) converges (by Proposition 4.1), the result follows by taking the limit in part(i).(iii) This again follows by taking limit in part (i) and using Proposition 4.1 (cid:3) We now compute the steady state distribution of the Markov chain { Y kn } k ≥ . Let µ n denotes thesteady state distribution of Y n = d lim k →∞ Y kn (whose existence is guaranteed by Proposition 4.1)where µ n ( i ) = Q { Y n = i } for all i ∈ X . The next lemma follows from standard techniques (see [37]for example). Proof is given in Appendix A.3. Lemma 4.3.

For any ﬁxed n ≥ , µ n ( η ∗ ) = p N ∗ − η ∗ − n , µ n ( N ∗ ) = 1 − p n p n , µ n ( i ) = (1 − p n ) p ( N ∗ − i − ) n , ∀ i = η ∗ + 1 , . . . , N ∗ − . Note that an explicit expression for p n in the case of empirical Bellman operator is given in equation(5.1). We now give results on the conver-gence of the stochastic process { ˆ v kn , which could equivalently be written (cid:98) T kn ˆ v } k ≥ . Also we elaborateon the connections between our diﬀerent notions of ﬁxed points. Throughout this section, v ∗ denotesthe ﬁxed point of the deterministic operator T as deﬁned in Assumption 4.1. Theorem 4.2.

Suppose the random operator (cid:98) T n satisﬁes the assumptions 4.1 - 4.4. Then forany (cid:15) > , lim n →∞ lim sup k →∞ P (cid:0) (cid:107) ˆ v kn − v ∗ (cid:107) > (cid:15) (cid:1) = 0 , i.e. v ∗ is a weak probabilistic ﬁxed point of (cid:110) (cid:98) T n (cid:111) n ≥ .Proof: Choose the granularity (cid:15) g = (cid:15)/η ∗ . From Corollary 4.1 and Lemma 4.3,lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) ≤ Q { Y n > η ∗ } = 1 − µ n ( η ∗ ) = 1 − p N ∗ − η ∗ − n Now by Assumption 4.4, p n ↑ (cid:3) Now we show that a strong probabilistic ﬁxed point and the deterministic ﬁxed point v ∗ coincideunder Assumption 4.1. Proposition 4.2.

Suppose Assumption holds. Then,(i) v ∗ is a strong probabilistic ﬁxed point of the sequence (cid:110) (cid:98) T n (cid:111) n ≥ .(ii) Let ˆ v be a strong probabilistic ﬁxed point of the sequence (cid:110) (cid:98) T n (cid:111) n ≥ , then ˆ v is a ﬁxed point of T . T and the set of strong probabilistic ﬁxed points of { (cid:98) T n } n ≥ coincide. This suggests that a “probabilistic” ﬁxed point would be an “approximate” ﬁxed pointof the deterministic operator T .We now explore the connection between weak probabilistic ﬁxed points and classical ﬁxed points. Proposition 4.3.

Suppose the random operator (cid:98) T n satisﬁes the assumptions 4.1 - 4.4. Then,(i) v ∗ is a weak probabilistic ﬁxed point of the sequence (cid:110) (cid:98) T n (cid:111) n ≥ .(ii) Let ˆ v be a weak probabilistic ﬁxed point of the sequence (cid:110) (cid:98) T n (cid:111) n ≥ , then ˆ v is a ﬁxed point of T . Proof is given Appendix A.5. It is obvious that we need more assumptions to analyze the asymptoticbehavior of the iterates of the random operator (cid:98) T n and establish the connection to the ﬁxed pointof the deterministic operator.We summarize the above discussion in the following theorem. Theorem 4.3.

Suppose the random operator (cid:98) T n satisﬁes the assumptions 4.1 - 4.4. Then thefollowing three statements are equivalent:(i) v is a ﬁxed point of T ,(ii) v is a strong probabilistic ﬁxed point of (cid:110) (cid:98) T n (cid:111) n ≥ ,(iii) v is a weak probabilistic ﬁxed point of (cid:110) (cid:98) T n (cid:111) n ≥ . This is quite remarkable because we see not only that the two notions of a probabilistic ﬁxedpoint of a sequence of random operators coincide, but in fact they coincide with the ﬁxed point ofthe related classical operator. Actually, it would have been disappointing if this were not the case.The above result now suggests that the iteration of a random operator a ﬁnite number k of timesand for a ﬁxed n would yield an approximation to the classical ﬁxed point with high probability.Thus, the notions of the ( (cid:15), δ )-strong and weak probabilistic ﬁxed points coincide asymptotically,however, we note that non-asymptotically they need not be the same.

5. Sample Complexity for EDP

In this section we present the proofs of the sample com-plexity results for empirical value iteration (EVI) and policy iteration (EPI) (Theorem 3.1 andTheorem 3.2 in Section 3).

Recall the deﬁnition of the empirical Bellman operatorin equation (3.1). Here we give a mathematical basis for that deﬁnition which will help us to analyzethe convergence behaviour of EVI (since EVI can be framed as an iteration of this operator).The empirical Bellman operator is a random operator, because it maps random samples tooperators. Recall from Section 4.1 that we deﬁne the random operator on the sample space Ω =[0 , ∞ where primitive uncertainties on Ω are inﬁnite sequences of uniform noise ω = ( ξ i ) i ≥ whereeach ξ i is an independent uniform random variable on [0 , , n for a ﬁxed n ≥

1, makes convergence statements with respect to n easier tomake.Classical value iteration is performed by iterating the Bellman operator T . Our EVI algorithmis performed by choosing n and then iterating the random operator (cid:98) T n . So we follow the notationsintroduced in Section 4.1 and the k th iterate of EVI, ˆ v kn is given by ˆ v kn = (cid:98) T kn ˆ v n where ˆ v n ∈ R | S | bean initial seed for EVI.We ﬁrst show that the empirical Bellman operator satisﬁes the Assumptions 4.1 - 4.4. Then theanalysis follows the results of Section 4. Proposition 5.1.

The Bellman operator T and the empirical Bellman operator (cid:98) T n (deﬁned inequation (3.1) ) satisfy Assumptions 4.1 - 4.4 p n ( (cid:15) ) (of Assumption 4.4) as below. Forproof, refer to Proposition 5.1: P (cid:110) (cid:107) (cid:98) T n v − T v (cid:107) < (cid:15) (cid:111) > p n ( (cid:15) ) := 1 − | K | e − (cid:15)/α ) n/ ( κ ∗ ) . (5.1)Also we note that we can also give an explicit expression for κ ∗ of in Assumption 4.2 as κ ∗ (cid:44) max ( s,a ) ∈ K | c ( s, a ) | − α . (5.2)For proof, refer to the proof of Proposition 5.1. Here we use the results of Section 4 for analyzing theconvergence of EVI. We ﬁrst give an asymptotic result.

Proposition 5.2.

For any δ ∈ (0 , select n such that n ≥ κ ∗ ) ( (cid:15) g /α ) log 2 | K | δ then, lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) ≤ − µ n ( η ∗ ) ≤ δ Proof:

From Corollary 4.1, lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) ≤ Q { Y n > η ∗ } = 1 − µ n ( η ∗ ). For 1 − µ n ( η ∗ ) to be less that δ , we compute n using Lemma 4.3 as,1 − δ ≤ µ n ( η ∗ ) = p N ∗ − η ∗ − n ≤ p n = 1 − | K | e − (cid:15) g /α ) n/ ( κ ∗ ) . Thus, we get the desired result. (cid:3)

We cannot iterate (cid:98) T n forever so we need a guideline for a ﬁnite choice of k . This question can beanswered in terms of mixing times. The total variation distance between two probability measures µ and ν on S is (cid:107) µ − ν (cid:107) T V = max S ⊂ S | µ ( S ) − ν ( S ) | = 12 (cid:88) s ∈ S | µ ( s ) − ν ( s ) | . Let Q kn be the marginal distribution of Y kn on N at stage k and d ( k ) = (cid:107) Q kn − µ n (cid:107) T V be the total variation distance between Q kn and the steady state distribution µ n . For δ >

0, wedeﬁne t mix ( δ ) = min { k : d ( k ) ≤ δ } to be the minimum length of time needed for the marginal distribution of Y kn to be within δ ofthe steady state distribution in total variation norm.We now bound t mix ( δ ). Lemma 5.1.

For any δ > , t mix ( δ ) ≤ log (cid:18) (cid:15) µ n, min (cid:19) . where µ n, min := min η µ n ( η ) . roof: Let Q n be the transition matrix of the Markov chain { Y kn } k ≥ . Also let λ (cid:63) =max {| λ | : λ is an eigenvalue of Q n , λ (cid:54) = 1 } . By [27, Theorem 12.3], t mix ( δ ) ≤ log (cid:18) δ µ n, min (cid:19) − λ (cid:63) but λ (cid:63) = 0 by Lemma given in Appendix A.7. (cid:3) We now use the above bound on mixing time to get a non-asymptotic bound for EVI.

Proposition 5.3.

For any ﬁxed n ≥ , P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) ≤ δ + (1 − µ n ( η ∗ )) for k ≥ log (cid:16) δ µ n, min (cid:17) .Proof: For k ≥ log (cid:16) δ µ n, min (cid:17) ≥ t mix ( δ ), d ( k ) = 12 N ∗ (cid:88) i = η ∗ | Q ( Y kn = i ) − µ n ( i ) | ≤ δ . Then, | Q ( Y kn = η ∗ ) − µ n ( η ∗ ) | ≤ d ( t ) ≤ δ . So, P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) ≤ Q ( Y kn > η ∗ ) = 1 − Q ( Y kn = η ∗ ) ≤ δ + (1 − µ n ( η ∗ )). (cid:3) We now combine Proposition 5.2 and 5.3 to prove Theorem 3.1.

Proof of Theorem 3.1:

Proof:

Let (cid:15) g = (cid:15)/η ∗ , and δ , δ be positive with δ + 2 δ ≤ δ . By Proposition 5.2, for n ≥ n ( (cid:15), δ )we have lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > (cid:15) (cid:9) = lim sup k →∞ P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) = ≤ − µ n ( η ∗ ) ≤ δ . Now, for k ≥ k ( (cid:15), δ ), by Proposition 5.3, P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > η ∗ (cid:15) g (cid:9) = P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > (cid:15) (cid:9) ≤ δ + (1 − µ n ( η ∗ )). Combining both we get, P (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) > (cid:15) (cid:9) ≤ δ . (cid:3) We now consider empirical policy iteration. EPI is diﬀerentfrom EVI, and seemingly more diﬃcult to analyze, because it does not correspond to iteration ofa random operator. Furthermore, it has two simulation components, empirical policy evaluationand empirical policy update. However, we show that the convergence analysis in a manner similarto that of EVI.We ﬁrst give a sample complexity result for policy evaluation. For a policy π , let v π be the actualvalue of the policy and let ˆ v πq be the empirical evaluation. Then, Proposition 5.4.

For any π ∈ Π , (cid:15) ∈ (0 , γ ) and for any δ > P (cid:8) (cid:107) ˆ v πq − v π (cid:107) ≥ (cid:15) (cid:9) ≤ δ, for q ≥ κ ∗ ( T + 1)) ( (cid:15) − γ ) log 2 | S | δ , where ˆ v q is evaluation of v π by averaging q simulation runs. roof: Let v π, T := E (cid:104)(cid:80) T t =0 α t c ( s t ( ω )) , π ( s t ( ω ))) (cid:105) . Then, | ˆ v πq ( s ) − v π ( s ) | ≤ | ˆ v πq ( s ) − v π, T | + | v π, T − v π |≤ | q q (cid:88) i =1 T (cid:88) t =0 α k c ( s t ( ω i ) , π ( s t ( ω i ))) − v π, T | + γ ≤ T (cid:88) t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q q (cid:88) i =1 ( c ( s t ( ω i ) , π ( s t ( ω i ))) − E [ c ( s t ( ω ) , π ( s t ( ω )))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + γ. Then, with ˜ (cid:15) = ( (cid:15) − γ ) / ( T + 1), P (cid:0) | ˆ v πq ( s ) − v π ( s ) | ≥ (cid:15) (cid:1) ≤ P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q q (cid:88) i =1 ( c ( s t ( ω i ) , π ( s t ( ω i ))) − E [ c ( s t ( ω i ) , π ( s t ( ω i )))]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ˜ (cid:15) (cid:33) ≤ e − q ˜ (cid:15) / (2 κ ∗ ) . By applying the union bound, we get P (cid:0) (cid:107) ˆ v πq − v π (cid:107) ≥ (cid:15) (cid:1) ≤ | S | e − q ˜ (cid:15) / (2 κ ∗ ) . For q ≥ κ ∗ ( T +1)) ( (cid:15) − γ ) log | S | δ the above probability is less than δ . (cid:3) We deﬁne P (cid:0) (cid:107) ˆ v πq − v π (cid:107) < (cid:15) (cid:1) > r q ( (cid:15) ) := 1 − | S | e − q ˜ (cid:15) / (2 κ ∗ ) , with ˜ (cid:15) = ( (cid:15) − γ ) / ( T + 1) . (5.3)We say that empirical policy evaluation is (cid:15) -accurate if (cid:107) ˆ v πq − v π (cid:107) < (cid:15) . Then by the aboveproposition empirical policy evaluation is (cid:15) -accurate with a probability greater than r q ( (cid:15) ).The accuracy of empirical policy update compared to the actual policy update indeed dependson the empirical Bellman operator (cid:98) T n . We say that empirical policy update is (cid:15) -accurate if (cid:107) (cid:98) T n v − T v (cid:107) < (cid:15) . Then, by the deﬁnition of p n in equation (5.1), our empirical policy update is (cid:15) -accuratewith a probability greater than p n ( (cid:15) )We now give an important technical lemma. Proof is essentially a probabilistic modiﬁcation ofLemmas 6.1 and 6.2 in [9] and is omitted. Lemma 5.2.

Let { π k } k ≥ be the sequence of policies from the EPI algorithm. For a ﬁxed k ,assume that P (cid:0) (cid:107) v π k − ˆ v π k q (cid:107) < (cid:15) (cid:1) ≥ (1 − δ ) and P (cid:16) (cid:107) T ˆ v π k q − (cid:98) T n ˆ v π k q (cid:107) < (cid:15) (cid:17) ≥ (1 − δ ) . Then, (cid:107) v π k +1 − v ∗ (cid:107) ≤ α (cid:107) v π k − v ∗ (cid:107) + (cid:15) + 2 α(cid:15) (1 − α ) with probability at least (1 − δ )(1 − δ ) . We now proceed as in the analysis of EVI given in the previous subsection. Here we track thesequence {(cid:107) v π k − v ∗ (cid:107)} k ≥ . Note that this being a proof technique, the fact that the value (cid:107) v π k − v ∗ (cid:107) is not observable does not aﬀect our algorithm or its convergence behavior. We deﬁne X kn, q = (cid:100)(cid:107) ˆ v π k − v ∗ (cid:107) /(cid:15) g (cid:101) where the granularity (cid:15) g is ﬁxed according to the problem parameters as (cid:15) g = (cid:15) +2 α(cid:15) (1 − α ) . Then byLemma 5.2,if X kn, q = η, then X k +1 n, q ≤ (cid:100) α η + 1 (cid:101) with a probability at least p n,q = r q ( (cid:15) ) p n ( (cid:15) ) . (5.4)17his is equivalent to the result for EVI given in equation (4.3). Hence the analysis is the samefrom here onwards. However, for completeness, we explicitly give the dominating Markov chainand its steady state distribution.For p n,q given in display (5.4), we construct the dominating Markov chain (cid:8) Y kn, q (cid:9) k ≥ as Y k +1 n, q = (cid:40) max (cid:8) Y kn, q − , η ∗ (cid:9) , w.p. p n, q ,N ∗ , w.p. 1 − p n, q , (5.5)which exists on the state space X . The family (cid:8) Y kn, q (cid:9) k ≥ is identical to (cid:8) Y kn (cid:9) k ≥ except that itstransition probabilities depend on n and q rather than just n . Let µ n,q denote the steady statedistribution of the Markov chain (cid:8) Y kn,q (cid:9) k ≥ . Then by Lemma 4.3, µ n,q ( η ∗ ) = p N ∗ − η ∗ − n,q , µ n,q ( N ∗ ) = 1 − p n,q p n,q , µ n,q ( i ) = (1 − p n,q ) p ( N ∗ − i − ) n,q , ∀ i = η ∗ + 1 , . . . , N ∗ − . (5.6) Proof of Theorem 3.2:

Proof:

First observe that by the given choice of n and q , we have r q ≥ (1 − δ ) and p n ≥ (1 − δ ).Hence 1 − p n,q ≤ δ + δ − δ δ < δ .Now by Corollary 4.1,lim sup k →∞ P {(cid:107) ˆ v π k − v ∗ (cid:107) > (cid:15) } = lim sup k →∞ P {(cid:107) ˆ v π k − v ∗ (cid:107) > η ∗ (cid:15) g } ≤ Q { Y n,q > η ∗ } = 1 − µ n,q ( η ∗ ) . For 1 − µ n,q ( η ∗ ) to be less than δ we need 1 − δ ≤ µ n ( η ∗ ) = p N ∗ − η ∗ − n,q ≤ p n,q which true as veriﬁedabove. Thus we get lim sup k →∞ P {(cid:107) ˆ v π k − v ∗ (cid:107) > (cid:15) } ≤ δ , similar to the result of Proposition 5.2. Selecting the number of iterations k based on the mixingtime is same as given in Proposition 5.3. Combining both as in the proof of Theorem 3.1 we getthe desired result. (cid:3)

6. Variations and Extensions

We now consider some variations and extensions of EVI.

The EVI algorithm described above is synchronous,meaning that the value estimates for every state are updated simultaneously. Here we consider eachstate to be visited at least once to complete a full update cycle. We modify the earlier argumentto account for the possibly random time between full update cycles.Classical asynchronous value iteration with exact updates has already been studied. Let ( x k ) k ≥ be any inﬁnite sequence of states in S . This sequence ( x k ) k ≥ may be deterministic or stochastic, andit may even depend online on the value function updates. For any x ∈ S , we deﬁne the asynchronousBellman operator T x : R | S | → R | S | via[ T x v ] ( s ) = (cid:40) min a ∈ A ( s ) { c ( s, a ) + α E [ v ( ψ ( s, a, ξ ))] } , s = x,v ( s ) , s (cid:54) = x. The operator T x only updates the estimate of the value function for state x , and leaves the estimatesfor all other states exactly as they are. Given an initial seed v ∈ R | S | , asynchronous value iterationproduces the sequence (cid:8) v k (cid:9) k ≥ deﬁned by v k = T x t T x t − · · · T x v for k ≥ T x are immediate.18 emma 6.1. For any x ∈ S :(i) T x is monotonic;(ii) T x [ v + η

1] = T x v + α η e x , where e x ∈ R | S | be the unit vector corresponding to x ∈ S . Proof is given in Appendix A.8The next lemma is used to show that classical asynchronous VI converges. Essentially, a cycleof updates that visits every state at least once is a contraction.

Lemma 6.2.

Let ( x k ) Kk =1 be any ﬁnite sequence of states such that every state in S appears atleast once, then the operator (cid:101) T = T x T x · · · T x K is a contraction with constant α . It is known that asynchronous VI converges when each state is visited inﬁnitely often. To continue,deﬁne K = 0 and in general, we deﬁne K m +1 (cid:44) inf (cid:110) k : k ≥ K m , ( x i ) ki = K m +1 includes every state in S (cid:111) . Time K is the ﬁrst time that every state in S is visited at least once by the sequence ( x k ) k ≥ .Time K is the ﬁrst time after K that every state is visited at least once again by the sequence( x k ) k ≥ , etc. The times { K m } m ≥ completely depend on ( x k ) k ≥ . For any m ≥

0, if we deﬁne (cid:101) T = T K m +1 T K m +1 − · · · T K m +2 T K m +1 , then we know (cid:107) (cid:101) T v − v ∗ (cid:107) ≤ α (cid:107) v − v ∗ (cid:107) , by the preceding lemma. It is known that asynchronous VI converges under some conditions on( x k ) k ≥ . Theorem 6.1. [7]. Suppose each state in S is included inﬁnitely often by the sequence ( x k ) k ≥ .Then v k → v ∗ . Next we describe an empirical version of classical asynchronous value iteration. Again, we replaceexact computation of the expectation with an empirical estimate, and we regenerate the sample ateach iteration.

Algorithm 3

Asynchronous empirical value iterationInput: v ∈ R | S | , sample size n ≥

1, a sequence ( x k ) k ≥ .Set counter k = 0.1. Sample n uniformly distributed random variables { ξ i } ni =1 , and computeˆ v k +1 ( s ) = (cid:40) min a ∈ A ( s ) (cid:8) c ( s, a ) + αn (cid:80) ni =1 ˆ v k ( ψ ( s, a, ξ i )) (cid:9) , s = x k ,v ( s ) , s (cid:54) = x k .

2. Increment k := k + 1 and return to step 2.Step 1 of this algorithm replaces the exact computation v k +1 = T x k v k with an empirical variant.Using our earlier notation, we let (cid:98) T x,n be a random operator that only updates the value functionfor state x using an empirical estimate with sample size n ≥ (cid:104) (cid:98) T x,n ( ω ) v (cid:105) ( s ) = (cid:40) min a ∈ A ( s ) (cid:8) c ( s, a ) + n (cid:80) ni =1 v ( ψ ( s, a, ξ i )) (cid:9) , s = x,v ( s ) , s (cid:54) = x.

19e use (cid:8) ˆ v kn (cid:9) k ≥ to denote the sequence of asynchronous EVI iterates,ˆ v k +1 n = (cid:98) T x k ,n (cid:98) T x k − ,n · · · (cid:98) T x ,n ˆ v n , or more compactly ˆ v k +1 n = (cid:98) T x k ,n ˆ v kn for all k ≥ { K m } m ≥ as well,since the accuracy of the overall update depends on the accuracy in (cid:98) T x,n as well as the length ofthe interval { K m + 1 , K m + 2 , . . . , K m +1 } . In asynchronous EVI, we will focus on { ˆ v K m n } m ≥ ratherthan (cid:8) ˆ v kn (cid:9) k ≥ . We check the progress of the algorithm at the end of complete update cycles.In the simplest update scheme, we can order the states and then update them in the sameorder throughout the algorithm. The set ( x k ) k ≥ is deterministic in this case, and the intervals { K m + 1 , K m + 2 , . . . , K m +1 } all have the same length | S | . Consider (cid:101) T = T x K ,n T x K − ,n · · · T x ,n T x ,n , the operator (cid:98) T x ,n introduces (cid:15) error into component x , the operator (cid:98) T x ,n introduces (cid:15) error intocomponent x , etc. To ensure that (cid:98) T = (cid:98) T x K ,n (cid:98) T x K − ,n · · · (cid:98) T x ,n (cid:98) T x ,n is close to (cid:101) T , we require each (cid:98) T x k ,n to be close to T x k for k = 0 , , . . . , K − , K .The following noise driven perspective helps with our error analysis. In general, we can viewasynchronous empirical value iteration as v (cid:48) = T x v + ε for all k ≥ ε = (cid:98) T x,n v − T x v is the noise for the evaluation of T x (and it has at most one nonzero component).Starting with v , deﬁne the sequence (cid:8) v k (cid:9) k ≥ by exact asynchronous value iteration v k +1 = T x k v k for all k ≥

0. Also set ˜ v := v and deﬁne˜ v k +1 = T x k ˜ v k + ε k for all k ≥ ε k ∈ R | S | is the noise for the evaluation of T x k on ˜ v k . In the following proposition,we compare the sequences of value functions (cid:8) v k (cid:9) k ≥ and (cid:8) ˜ v k (cid:9) k ≥ under conditions on the noise { ε k } k ≥ . Proposition 6.1.

Suppose − η ≤ ε i ≤ η for all j = 0 , , . . . , k where η ≥ and ∈ R | S | , i.e.the error is uniformly bounded for j = 0 , , . . . , k . Then, for all j = 0 , , . . . , k : v j − (cid:32) j (cid:88) i =0 α i (cid:33) η ≤ ˜ v j ≤ v j + (cid:32) j (cid:88) i =0 α i (cid:33) η . Proof is given in Appendix A.9Now we can use the previous proposition to obtain conditions for (cid:107) (cid:101)

T v − (cid:98) T v (cid:107) < (cid:15) (for our deter-ministic update sequence). Starting with the update for state x , we can choose n to ensure (cid:107) T x ,n v − (cid:98) T x ,n v (cid:107) < (cid:15)/ | S | P (cid:110) (cid:107) (cid:98) T x,n v − T x v (cid:107) ≥ (cid:15)/ | S | (cid:111) ≤ P (cid:40) max a ∈ A ( s ) | n n (cid:88) i =1 v ( ψ ( s, a, ξ i )) − E [ v ( ψ ( s, a, ξ ))] | ≥ (cid:15)/ ( α | S | ) (cid:41) ≤ | A | e − (cid:15)/ ( α | S | )) n/ ( κ ∗ ) , for all v ∈ R | S | (which does not depend on x ). We are only updating one state, so we are concernedwith the approximation of at most | A | terms c ( s, a ) + α E [ v ( ψ ( s, a, ξ ))] rather than | K | . At thenext update we want (cid:107) (cid:98) T x ,n ˆ v n − T x ,n ˆ v n (cid:107) < (cid:15)/ | S | , and we get the same error bound as above.Based on this reasoning, assume (cid:107) T x k ,n ˆ v kn − (cid:98) T x k ,n ˆ v kn (cid:107) < (cid:15)/ | S | for all k = 0 , , . . . , K . In this case we will in fact get the stronger error guarantee (cid:107) (cid:101) T v − (cid:98) T v (cid:107) < (cid:15) | S | | S |− (cid:88) i =0 α i < (cid:15) from Proposition 6.1. The complexity estimates are multiplicative, so the probability (cid:107) T x k ,n ˆ v kn − (cid:98) T x k ,n ˆ v kn (cid:107) < (cid:15)/ | S | for all k = 0 , , . . . , K is bounded above by p n = 2 | S | | A | e − (cid:15)/ ( α | S | )) n/ ( κ ∗ ) . To understand this result, remember that | S | iterations of asynchronous EVI amount to at most | S | | A | empirical estimates of c ( s, a ) + E [ v ( ψ ( s, a, ξ ))]. We require all of these estimates to be withinerror (cid:15)/ | S | .We can take the above value for p n and apply our earlier stochastic dominance argument to {(cid:107) ˆ v K m n − v ∗ (cid:107)} m ≥ , without further modiﬁcation. This technique extends to any deterministicsequence ( x k ) k ≥ where the lengths of a full update for all states | K m +1 − K m | are uniformlybounded for all m ≥ Now we consider a two player zero sum Markov game andshow how an empirical min-max value iteration algorithm can be used to a compute an approximateMarkov Perfect equilibrium. Let the Markov game be described by the 7-tuple( S , A , { A ( s ) : s ∈ S } , B , { B ( s ) : s ∈ S } , Q, c ) . The action space B for player 2 is ﬁnite and B ( s ) accounts for feasible actions for player 2. We let K = { ( s, a, b ) : s ∈ S , a ∈ A ( s ) , b ∈ B ( s ) } be the set of feasible station-action pairs. The transition law Q governs the system evolution, Q ( ·| s, a, b ) ∈ P ( A ) for all ( s, a ) ∈ K , which is the probability of next visiting the state j given( s, a, b ). Finally, c : K → R is the cost function (say of player 1) in state s for actions a and b . Player1 wants to minimize this quantity, and player 2 is trying to maximize this quantity.Let the operator T be deﬁned as T : R | S | → R | S | is deﬁned as[ T v ] ( s ) (cid:44) min a ∈ A ( s ) max b ∈ B ( s ) { c ( s, a, b ) + α E [ v (˜ s ) | s, a, b ] } , ∀ s ∈ S , v ∈ R | S | , where ˜ s is the random next state visited and E [ v (˜ s ) | s, a, b ] = (cid:88) j ∈ S v ( j ) Q ( j | s, a, b )is the same expected cost-to-go for player 1. We call T the Shapley operator in honor of Shapleywho ﬁrst introduced it [39]. We can use T to compute the optimal value function of the same whichis given by v ∗ ( s ) = max a ∈ A ( s ) min b ∈ B ( s ) { c ( s, a, b ) + α E [ v ∗ (˜ s ) | s, a, b ] } , ∀ s ∈ S , is the optimal value function for player 1.It is well known that that the Shapley operator is a contraction mapping. Lemma 6.3.

The Shapley operator T is a contraction. Proof is given in Appendix A.10 for completeness.To compute v ∗ , we can iterate T . Pick any initial seed v ∈ R S , take v = T v , v = T v , andin general v k +1 = T v k for all k ≥

0. It is known that [39] this procedure converges to the optimalvalue function. We refer to this as the classical minimax value iteration.Now, using the simulation model ψ : S × A × B × [0 , → S , the empirical Shapley operator canbe written as [ T v ] ( s ) (cid:44) max a ∈ A ( s ) min b ∈ B ( s ) { c ( s, a, b ) + α E [ v ( ψ ( s, a, b, ξ ))] } , ∀ s ∈ S , where ξ is a uniform random variable on [0 , E [ v ( ψ ( s, a, b, ξ ))] with an empirical estimate. Given a sam-ple of n uniform random variables, { ξ i } ni =1 , the empirical estimate of E [ v ( ψ ( s, a, b, ξ ))] is n (cid:80) ni =1 v ( ψ ( s, a, b, ξ i )). Our algorithm is summarized next. Algorithm 4

Empirical value iteration for minimaxInput: v ∈ R S , sample size n ≥ k = 0.1. Sample n uniformly distributed random variables { ξ i } ni =1 , and compute v k +1 ( s ) = max a ∈ A ( s ) min b ∈ B ( s ) (cid:40) c ( s, a, b ) + αn n (cid:88) i =1 v k ( ψ ( s, a, b, ξ i )) (cid:41) , ∀ s ∈ S .

2. Increment k := k + 1 and return to step 2.In each iteration, we take a new set of samples and use this empirical estimate to approximate T .Since T is a contraction with a known convergence rate α , we can apply the exact same developmentas for empirical value iteration. We now show via the newsvendor problem that the empir-ical dynamic programming method can sometimes work remarkably well even for continuous statesand action spaces. This, of course, exploits the linear structure of the newsvendor problem.Let D be a continuous random variable representing the stationary demand distribution. Let { D k } k ≥ be independent and identically distributed collection of random variables with the samedistribution as D , where D k is the demand in period k . The unit order cost is c , unit holding costis h , and unit backorder cost is b . We let x k be the inventory level at the beginning of period k ,and we let q k ≥ k .22or technical convenience, we only allow stock levels in the compact set X = [ x min , x max ] ⊂ R .This assumption is not too restrictive, since a ﬁrm would not want a large number of backordersand any real warehouse has ﬁnite capacity. Notice that since we restrict to X , we know that noorder quantity will ever exceed q max = x max − x min . Deﬁne the continuous function ψ : R → X via ψ ( x ) =  x max , if x > x max ,x min , if x < x min ,x, otherwise , The function ψ accounts for the state space truncation. The system dynamic is then x k +1 = ψ ( x k + q k − D k ) , ∀ k ≥ . We want to solve inf π ∈ Π E πν (cid:34) ∞ (cid:88) k =0 α k ( c q k + max { h x k , − b x k } ) (cid:35) , (6.1)subject to the preceding system dynamic. We know that there is an optimal stationary policy forthis problem which only depends on the current inventory level. The optimal cost-to-go functionfor this problem, v ∗ , satisﬁes v ∗ ( x ) = inf q ≥ { c q + max { h x, − b x } + E [ v ∗ ( ψ ( x + q − D ))] } , ∀ x ∈ R , where, the optimal value function v ∗ : R → R . We will compute v ∗ by iterating an appropriateBellman operator.Classical value iteration for Problem (6.1) consists of iteration of an operator in C ( X ), the spaceof continuous functions f : X → R . We equip C ( X ) with the norm (cid:107) f (cid:107) C ( X ) = sup x ∈X | f ( x ) | . Under this norm, C ( X ) is a Banach space.Now, the Bellman operator T : C ( X ) → C ( X ) for the newsvendor problem is given by[ T v ] ( x ) = inf q ≥ { c q + max { h x, − b x } + α E [ v ( ψ ( x + q − D ))] } , ∀ x ∈ X . Value iteration for the newsvendor can then be written succinctly as v k +1 = T v k for all k ≥

0. Weconﬁrm that T is a contraction with respect to (cid:107) · (cid:107) C ( X ) in the next lemma, and thus the Banachﬁxed point theorem applies. Lemma 6.4. (i) T is a contraction on C ( X ) with constant α .(ii) Let (cid:8) v k (cid:9) k ≥ be the sequence produced by value iteration, then lim k → (cid:107) v k − v ∗ (cid:107) C ( X ) → .Proof: (i) Choose u, v ∈ C ( X ), and use Fact 2.1 to compute (cid:107) T u − T v (cid:107) C ( X ) = sup x ∈X | [ T u ] ( x ) − [ T v ] ( x ) |≤ sup x ∈X , q ∈ [0 ,q max ] α | E [ u ( ψ ( x + q − D ))] − E [ v ( ψ ( x + q − D ))] |≤ α sup x ∈X , q ∈ [0 ,q max ] E [ | u ( ψ ( x + q − D )) − v ( ψ ( x + q − D )) | ] ≤ α (cid:107) u − v (cid:107) C ( X ) . (ii) Since C ( X ) is a Banach space and T is a contraction by part (i), the Banach ﬁxed point theoremapplies. (cid:3) v ( x ) = max { h x, − b x } , ∀ x ∈ X . It is chosen to represent the terminal cost in state x when there are no further ordering decisions.Then, value iteration yields v k +1 ( x ) = inf q ≥ (cid:8) c q + max { h x, − b x } + α E (cid:2) v k ( ψ ( x + q − D )) (cid:3)(cid:9) , ∀ x ∈ X . We note some key properties of these value functions.

Lemma 6.5.

Let (cid:8) v k (cid:9) k ≥ be the sequence produced by value iteration, then v k is Lipschitz con-tinuous with constant max {| h | , | b |} (cid:80) ki =0 α i for all k ≥ .Proof: First observe that v is Lipschitz continuous with constant max {| h | , | b |} . For v , wechoose x and x (cid:48) and compute | v ( x ) − v ( x (cid:48) ) | ≤ sup q ≥ { max { h x, − b x } − max { h x (cid:48) , − b x (cid:48) } + α (cid:0) E (cid:2) v ( ψ ( x + q − D )) (cid:3) − E (cid:2) v ( ψ ( x (cid:48) + q − D )) (cid:3)(cid:1)(cid:9) ≤ max {| h | , | b |} | x − x (cid:48) | + α max {| h | , | b |} E [ | ψ ( x + q − D ) − ψ ( x (cid:48) + q − D ) | ] ≤ max {| h | , | b |} | x − x (cid:48) | + α max {| h | , | b |} | x − x (cid:48) | , where we use the fact that the Lipschitz constant of ψ is one. The inductive step is similar. (cid:3) From Lemma 6.5, we also conclude that the Lipschitz constant of any iterate v k is boundedabove by L ∗ (cid:44) max {| h | , | b |} ∞ (cid:88) i =0 α i = max {| h | , | b |} − α . We acknowledge the dependence of Lemma 6.5 on the speciﬁc choice of the initial seed, v .We can do empirical value iteration with the same initial seed ˆ v n = v as above. Now, for k ≥ v k +1 n ( x ) = inf q ≥ (cid:40) c q + max { h x, − b x } + αn n (cid:88) i =1 ˆ v kn ( ψ ( x + q − D i )) (cid:41) , ∀ x ∈ R . Note that { D , . . . , D n } is an i.i.d. sample from the demand distribution. It is possible to performthese value function updates exactly for ﬁnite k based on [17]. Also note that, the initial seedis piecewise linear with ﬁnitely many breakpoints. Because the demand sample is ﬁnite in eachiteration, thus each iteration will take a piecewise linear function as input and then produce apiecewise linear function as output (both with ﬁnitely many breakpoints). Lemma 6.5 applieswithout modiﬁcation to (cid:8) ˆ v kn (cid:9) k ≥ , all of these functions are Lipschitz continuous with constantsbounded above by L ∗ .As earlier, we deﬁne the empirical Bellman operator (cid:98) T n : Ω → C as (cid:104) (cid:98) T n ( ω ) v (cid:105) ( x ) = inf q ≥ (cid:40) c q + max { h x, − b x } + αn n (cid:88) i =1 v ( ψ ( x + q − D i )) (cid:41) , ∀ x ∈ R . With the empirical Bellman operator, we write the iterates of EVI as ˆ v k +1 n = (cid:98) T kn v .We can again apply the stochastic dominance techniques we have developed to the convergenceanalysis of stochastic process (cid:8) (cid:107) ˆ v kn − v ∗ (cid:107) C ( X ) (cid:9) k ≥ . Similarly to that of equation (5.2), we get anupper bound (cid:107) v (cid:107) C ( X ) ≤ κ ∗ (cid:44) c q max + max { h x max , b x min } − α (cid:107) ˆ v kn − v ∗ (cid:107) C ( X ) ≤ (cid:107) ˆ v kn (cid:107) C ( X ) + (cid:107) v ∗ (cid:107) C ( X ) ≤ κ ∗ . We can thus restrict (cid:107) ˆ v kn − v ∗ (cid:107) C ( X ) to the state space [0 , κ ∗ ]. For a ﬁxed granularity (cid:15) g >

0, we candeﬁne (cid:8) X kn (cid:9) k ≥ and (cid:8) Y kn (cid:9) k ≥ as in Section 4. Our upper bound on probability follows. Proposition 6.2.

For any n ≥ and (cid:15) > P (cid:110) (cid:107) (cid:98) T n v − T v (cid:107) ≥ (cid:15) (cid:111) ≤ P (cid:40) α sup x ∈X , q ∈ [0 , q max ] | E [ v ( ψ ( x + q − D ))] − n n (cid:88) i =1 v ( ψ ( x + q − D i )) | ≥ (cid:15) (cid:41) ≤ (cid:38) L ∗ ) q (cid:15) (cid:39) e − (cid:15)/ n/ ( (cid:107) v (cid:107) C ( X ) ) , for all v ∈ C ( X ) with Lipschitz constant bounded by L ∗ .Proof: By Fact 2.1, we know that (cid:107) (cid:98) T n v − T v (cid:107) C ( X ) ≤ α sup x ∈X , q ∈ [0 , q max ] | E [ v ( ψ ( x + q − D ))] − n n (cid:88) i =1 v ( ψ ( x + q − D i )) | . Let { ( x j , q j ) } Jj =1 be an (cid:15)/ (3 L ∗ ) − net for X × [0 , q max ]. We can choose J to be the smallest integergreater than or equal to x max − x min (cid:15)/ (3 L ∗ ) × q max (cid:15)/ (3 L ∗ ) = 9 ( L ∗ ) q (cid:15) . If we have | E [ v ( ψ ( x j + q j − D ))] − n n (cid:88) i =1 v ( ψ ( x j + q j − D i )) | ≤ (cid:15)/ j = 1 , . . . , J , then | E [ v ( ψ ( x + q − D ))] − n n (cid:88) i =1 v ( ψ ( x + q − D i )) | ≤ (cid:15) for all ( x, q ) ∈ X × [0 , q max ] by Lipschitz continuity and construction of { ( x j , q j ) } Jj =1 . Then, byHoeﬀding’s inequality and using union bound, we get, P (cid:40) α sup j =1 ,...,J | E [ v ( ψ ( x j + q j − D ))] − n n (cid:88) i =1 v ( ψ ( x j + q j − D i )) | ≥ (cid:15)/ (cid:41) ≤ J e − (cid:15)/ n/ ( (cid:107) v (cid:107) C ( X ) ) . (cid:3) As before, we use the preceding complexity estimate to determine p n for the family (cid:8) Y kn (cid:9) k ≥ .The remainder of our stochastic dominance argument is exactly the same.

7. Numerical Experiments

We now provide a numerical comparison of EDP methods withother methods for approximate dynamic programming via simulation. Figure 1 shows relativeerror ( || ˆ v kn − v ∗ || ) of the Actor-Critic algorithm, Q-Learning algorithm, Optimistic Policy Iteration(OPI), Empirical Value Iteration (EVI) and Empirical Policy Iteration (EPI). It also shows relativeerror for exact Value iteration (VI). The problem considered was a generic 10 state and 5 actionspace MDP with inﬁnite-horizon discounted cost. From the ﬁgure, we see that EVI and EPIsigniﬁcantly outperform Actor-Critic algorithm (which converges very slowly) and Q-Learning.25

10 15 20 25 30 3500.10.20.30.40.50.60.70.80.91 iteration R e l a t i v e E rr o r Value Iteration (VI)Empirical Policy Iteration (Empirical PI)Optimistic Policy Iteration (Optmistic PI)Actor (cid:239)

CriticQ learningPolicy iteration (PI)Emprical Value Iteration (Empirical VI)PI

Empirical PI

Optimistic PI

EmpiricalVI Q learning Actor (cid:239)

Critic VI Figure 1.

Numerical performance comparison of empirical value and policy iteration with actor-critic, Q-learningand optimistic policy iteration. Number of samples taken: n = 10 , q = 10 (Refer EVI and EPI algorithms in Section 3for the deﬁnition of n and q ). Optimistic policy iteration performs better than EVI since policy iteration-based algorithms areknown to converge faster, but EPI outperforms OPI as well.The experiments were peformed on a generic laptop with Intel Core i7 processor, 4GM RAM,on a 64-bit Windows 7 operating system via Matlab R2009b environment.These preliminary numerical results seem to suggest that EDP methods outperform other ADPmethods numerically, and hold good promise. More deﬁnitive conclusions about their numericalperformance requires further work. We would also like to point that EDP methods would veryeasily be parallelizable, and hence, they could potentially be useful for a wider variety of problemsettings.

8. Conclusions

In this paper, we have introduced a new class of algorithms for approxi-mate dynamic programming. The idea is actually not novel, and quite simple and natural: justreplace the expectation in the Bellman operator with an empirical estimate (or a sample averageapproximation, as it is often called.) The diﬃculty, however, is that it makes the Bellman operatora random operator. This makes its convergence analysis very challenging since (inﬁnite horizon)dynamic programming theory is based on looking at the ﬁxed points of the Bellman operator.However, the extant notions of ‘probabilistic’ ﬁxed points for random operators are not relevantsince they are akin to classical ﬁxed points of deterministic monotone operators when ω is ﬁxed. Weintroduce two notions of probabilistic ﬁxed points - strong and weak. Furthermore, we show that26hese asymptotically coincide with the classical ﬁxed point of the related deterministic operator,This is reassuring as it suggests that approximations to our probabilistic ﬁxed points (obtained byﬁnitely many iterations of the empirical Bellman operator) are going to be approximations to theclassical ﬁxed point of the Bellman operator as well.In developing this theory, we also developed a mathematical framework based on stochasticdominance for convergence analysis of random operators. While our immediate goal was analysisof iteration of the empirical Bellman operator in empirical dynamic programming, the frameworkis likely of broader use, possibly after further development.We have then shown that many variations and extensions of classical dynamic programming,work for empirical dynamic programming as well. In particular, empirical dynamic programmingcan be done asynchronously just as classical DP can be. Moreover, a zero-sum stochastic gamecan be solved by a minimax empirical dynamic program. We also apply the EDP method to thedynamic newsvendor problem which has continuous state and action spaces, which demonstratesthe potential of EDP to solve problems over more general state and action spaces.We have done some preliminary experimental performance analysis of EVI and EPI, and com-pared it to similar methods. Our numerical simulations suggest that EDP algorithms convergefaster than stochastic approximation-based actor-critic, Q-learning and optimistic policy iterationalgorithms. However, these results are only suggestive, we do not claim deﬁnitive performanceimprovement in practice over other algorithms. This requires an extensive and careful numericalinvestigation of all such algorithms.We do note that EDP methods, unlike stochastic approximation methods, do not require anyrecurrence property to hold. In that sense, they are more universal. On the other hand, EDPalgorithms would inherit some of the ‘curse of dimensionality’ problems associated with exactdynamic programming. Overcoming that challenge requires additional ideas, and is potentially afruitful direction for future research. Some other directions of research are extending the EDPalgorithms to the inﬁnite-horizon average cost case, and to the partially-observed case. We willtake up these issues in the future. Acknowledgements

The authors would like to thank Ugur Akyol (USC) for running thenumerical experiments for this research. The authors would also like to thank Suvrajeet Sen (USC)and Vivek Borkar (IIT Bombay) for initial feedback on this work.27 eferences [1] Abounadi, J., D. Bertsekas, V. S. Borkar. 2001. Learning algorithms for markov decision processes withaverage cost.

SIAM J. Control Optim. (3) 681–698.[2] Almudevar, A. 2008. Approximate ﬁxed point iteration with an application to inﬁnite horizon markovdecision processes. SIAM J. Control Optim. (5) 2303–2347.[3] Anthony, M., P. Bartlett. 2009. Neural network learning: Theoretical foundations . cambridge universitypress.[4] Barto, A., S. Sutton, P. Brouwer. 1981. Associative search network: A reinforcement learning associativememory.

Biol. Cybernet. (3) 201–211.[5] Bellman, R. 1957. Dynamic Programming . Princeton University Press.[6] Bellman, R., S. Dreyfus. 1959. Functional approximations and dynamic programming.

MathematicalTables and Other Aids to Computation (68) 247–251.[7] Bertsekas, D. 2004. Dynamic programming and optimal control , vol. 1 and 2. Athena Scientiﬁc Belmont.[8] Bertsekas, D. 2011. Approximate policy iteration: A survey and some new methods.

J. Control TheoryAppl. (3) 310–335.[9] Bertsekas, D., J. Tsitsiklis. 1996. Neuro-Dynamic Programming . Athena Scientiﬁc.[10] Borkar, V. S. 2002. Q-learning for risk-sensitive control.

Math. Oper. Res (2) 294–311.[11] Borkar, V. S. 2008. Stochastic approximation: A dynamical systems viewpoint . Cambridge UniversityPress, Cambridge.[12] Borkar, V. S., S. P. Meyn. 2000. The ode method for convergence of stochastic approximation andreinforcement learning.

SIAM J. Control Optim. (2) 447–469.[13] Chang, H. S., M. Fu, J. Hu, S. I. Marcus. 2006. Simulation-Based Algorithms for Markov DecisionProcesses . Springer, Berlin.[14] Chang, H. S., M. Fu, J. Hu, S. I. Marcus. 2007. A survey of some simulation-based algorithms formarkov decision processes.

Commun. Inf. Syst. (1) 59–92.[15] Chueshov, I. 2002. Monotone random systems theory and applications , vol. 1779. Springer.[16] Cooper, W., S. Henderson, M. Lewis. 2003. Convergence of simulation-based policy iteration.

Probab.Engrg. Inform. Sci. (02) 213–234.[17] Cooper, W., B. Rangarajan. 2012. Performance guarantees for empirical markov decision processes withapplications to multiperiod inventory models. Oper. Res. (5) 1267–1281.[18] Even-Dar, E., Y. Mansour. 2004. Learning rates for q-learning. J. Mach. Learn. Res Dynamic Probabilistic Systems: Vol.: 2.: Semi-Markov and Decision Processes . JohnWiley and Sons.[20] Jain, R., P. Varaiya. 2006. Simulation-based uniform value function estimates of markov decision pro-cesses.

SIAM J. Control Optim. (5) 1633–1656.[21] Jain, R., P. Varaiya. 2010. Simulation-based optimization of markov decision processes: An empiricalprocess theory approach. Automatica (8) 1297–1304.[22] Kakade, S. M. 2003. On the sample complexity of reinforcement learning. Ph.D. thesis, University ofLondon.[23] Kiefer, J., J. Wolfowitz. 1952. Stochastic estimation of the maximum of a regression function. TheAnnals of Mathematical Statistics (3) 462–466.[24] Konda, V. R., V. S. Borkar. 1999. Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization (1) 94–123.[25] Konda, V. R., J. Tsitsiklis. 2004. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab.

Stochastic approximation methods for constrained and unconstrainedsystems . Springer.

27] Levin, D., Y. Peres, E. L. Wilmer. 2009.

Markov chains and mixing times . AMS.[28] Ljung, L. 1977. Analysis of recursive stochastic algorithms.

IEEE Trans. Automat. Control (4)551–575.[29] Minsky, M. 1961. Steps toward artiﬁcial intelligence. Proceedings of the IRE (1) 8–30.[30] M¨uller, A., D. Stoyan. 2002. Comparison methods for stochastic models and risks , vol. 389. Wiley.[31] Narendra, K. S., M.A.L. Thathachar. 2012.

Learning automata: an introduction . DoverPublications.[32] Nemhauser, George L. 1966.

Introduction to dynamic programming . Wiley.[33] Papadimitriou, C. H., J. Tsitsiklis. 1987. The complexity of markov decision processes.

Math. Oper.Res (3) 441–450.[34] Powell, W. B. 2007. Approximate Dynamic Programming: Solving the curses of dimensionality , vol.703. John Wiley & Sons.[35] Rajaraman, K., P. S. Sastry. 1996. Finite time analysis of the pursuit algorithm for learning automata.

Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on (4) 590–598.[36] Robbins, H., S. Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics

Stochastic processes . John Wiley & Sons.[38] Shaked, M., J. G. Shanthikumar. 2007.

Stochastic orders . Springer.[39] Shapley, Lloyd S. 1953. Stochastic games.

Proceedings Nat. Acad. of Sciences USA (10) 1095–1100.[40] Sutton, R. S., A. G. Barto. 1998. Reinforcement learning: An introduction . Cambridge Univ Press.[41] Thathachar, M.A. L., P. S. Sastry. 1985. A class of rapidly convergent algorithms for learning automata.

IEEE Tran. Sy. Man. Cyb. J. Mach. Learn. Res Machine learning (3-4) 279–292.[44] Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences.Ph.D. thesis.[45] Whitt, W. 1978. Approximations of dynamic programs, i. Math. Oper. Res (3) 231–243.[46] Whitt, W. 1979. Approximations of dynamic programs, ii. Math. Oper. Res (2) 179–185. ppendixA. Proofs of Various Lemmas, Propositions and TheoremsA.1. Proof of Fact 2.1 Proof:

To verify part (i), noteinf x ∈ X f ( x ) = inf x ∈ X { f ( x ) + f ( x ) − f ( x ) }≤ inf x ∈ X { f ( x ) + | f ( x ) − f ( x ) |}≤ inf x ∈ X (cid:26) f ( x ) + sup y ∈ Y | f ( y ) − f ( y ) | (cid:27) ≤ inf x ∈ X f ( x ) + sup y ∈ Y | f ( y ) − f ( y ) | , giving inf x ∈ X f ( x ) − inf x ∈ X f ( x ) ≤ sup x ∈ X | f ( x ) − f ( x ) | . By the same reasoning, inf x ∈ X f ( x ) − inf x ∈ X f ( x ) ≤ sup x ∈ X | f ( x ) − f ( x ) | , and the preceding two inequalities yield the desired result. Part (ii) follows similarly. (cid:3) A.2. Proof of Theorem 4.1

We ﬁrst prove the following lemmas.

Lemma A.1. [ Y k +1 n | Y kn = θ ] is stochastically increasing in θ for all k ≥ , i.e. [ Y k +1 n | Y kn = θ ] ≤ st [ Y k +1 n | Y kn = θ (cid:48) ] for all θ ≤ θ (cid:48) .Proof: We see that Pr (cid:8) Y k +1 n ≥ η | Y kn = θ (cid:9) is increasing in θ by construction of (cid:8) Y kn (cid:9) k ≥ . If θ > η , then Pr (cid:8) Y k +1 n ≥ η | Y kn = θ (cid:9) = 1 since Y k +1 n ≥ θ − θ ≤ η , then Pr (cid:8) Y k +1 n ≥ η | Y kn = θ (cid:9) = 1 − p n since the only way Y k +1 n willremain larger than η is if Y k +1 n = N ∗ . (cid:3) Lemma A.2. [ X k +1 n | X kn = θ, F k ] ≤ st [ Y k +1 n | Y kn = θ ] for all θ and all F k for all k ≥ .Proof: Follows from construction of (cid:8) Y kn (cid:9) k ≥ . For any history F k , P (cid:8) X k +1 n ≥ θ − | X kn = θ, F k (cid:9) ≤ Q (cid:8) Y k +1 n ≥ θ − | Y kn = θ (cid:9) = 1 . Now, P (cid:8) X k +1 n = N ∗ | X kn = θ, F k (cid:9) ≤ P (cid:8) X k +1 n > θ − | X kn = θ, F k (cid:9) = 1 − P ( X k +1 n ≤ θ − | X kn = θ ) ≤ − p n the last inequality follows because p n is the worst case probability for a one-step improvement inthe Markov chain { X kn } k ≥ (cid:3) Proof of Theorem 4.1 roof: Trivially, X n ≤ st Y n since X n = Y n . Next, we see that X n ≤ st Y n by previous lemma. Weprove the general case by induction. Suppose X kn ≤ st Y kn for k ≥

1, and for this proof deﬁne therandom variable Y ( θ ) = (cid:40) max { θ − , η ∗ } , w.p. p n ,N ∗ , w.p. 1 − p n , as a function of θ . We see that Y k +1 n has the same distribution as (cid:2) Y (Θ) | Θ = Y kn (cid:3) by deﬁnition. Since Y ( θ ) are stochastically increasing, we see that (cid:2) Y (Θ) | Θ = Y kn (cid:3) ≥ st (cid:2) Y (Θ) | Θ = X kn (cid:3) by [38, Theorem 1.A.6] and our induction hypothesis. Now, (cid:2) Y (Θ) | Θ = X kn (cid:3) ≥ st (cid:2) X k +1 n | X kn , F k (cid:3) by [38, Theorem 1.A.3(d)] for all histories F k . It follows that Y k +1 n ≥ st X k +1 n by transitivity. (cid:3) A.3. Proof of Lemma 4.3

Proof:

The stationary probabilities { µ n ( i ) } N ∗ i = η ∗ satisfy the equations: µ n ( η ∗ ) = p n µ n ( η ∗ ) + p n µ n ( η ∗ + 1) ,µ n ( i ) = p n µ n ( i + 1) , ∀ i = η ∗ + 1 , . . . , N ∗ − ,µ n ( N ∗ ) = (1 − p n ) N ∗ (cid:88) i = η ∗ µ n ( i ) , N ∗ (cid:88) i = η ∗ µ n ( i ) = 1 . We then see that µ n ( i ) = p ( N ∗ − i ) n µ n ( N ∗ ) , ∀ i = η ∗ + 1 , . . . , N ∗ − , and µ n ( η ∗ ) = p n − p n µ n ( η ∗ + 1) = p N ∗ − η ∗ n − p n µ ( N ∗ ) . We can solve for µ n ( N ∗ ) using (cid:80) N ∗ i = η ∗ µ n ( i ) = 1,1 = N ∗ (cid:88) i = η ∗ µ n ( i )= p N ∗ − η ∗ n − p n µ n ( N ∗ ) + N ∗ (cid:88) i = η ∗ +1 p N ∗ − in µ n ( N ∗ )= (cid:34) p N ∗ − η ∗ n − p n + N ∗ (cid:88) i = η ∗ +1 p N ∗ − in (cid:35) µ n ( N ∗ )= (cid:20) p N ∗ − η ∗ n − p n + p n − p N ∗ − η ∗ n − p n (cid:21) µ n ( N ∗ ) , = p n − p n µ n ( N ∗ ) , N ∗ (cid:88) i = η ∗ +1 p ( N ∗ − i ) n = N ∗ − η ∗ − (cid:88) i =0 p in = 1 − p ( N ∗ − η ∗ ) n − p n . We conclude µ ( N ∗ ) = 1 − p n p n , and thus µ n ( i ) = (1 − p n ) p ( N ∗ − i − ) n , ∀ i = η ∗ + 1 , . . . , N ∗ − , and µ n ( η ∗ ) = p N ∗ − η ∗ − n . (cid:3) A.4. Proof of Proposition 4.2

Proof: (i) First observe that lim n →∞ (cid:98) T n ( ω ) v ∗ = T v ∗ , by Assumption 4.1. It follows that (cid:98) T n ( ω ) v ∗ converges to v ∗ = T v ∗ as n → ∞ , P − almost surely.Almost sure convergence implies convergence in probability.(ii) Let ˆ v be a strong probabilistic ﬁxed point. Then, P ( (cid:107) T ˆ v − ˆ v (cid:107) ≥ (cid:15) ) ≤ P ( (cid:107) T ˆ v − (cid:98) T n ˆ v (cid:107) ≥ (cid:15)/

2) + P ( (cid:107) (cid:98) T n ˆ v − ˆ v (cid:107) ≥ (cid:15)/ n , we get P ( (cid:107) T ˆ v − ˆ v (cid:107) ≥ (cid:15) ) <

1. Since the event in the LHS is deterministic, we get (cid:107) T ˆ v − ˆ v (cid:107) = 0. Hence, ˆ v = v ∗ . (cid:3) A.5. Proof of Proposition 4.3

Proof: (i) This statement is proved in Theorem 4.2.(ii) Fix the initial seed v ∈ R | S | . For a contradiction, suppose ˆ v is not a ﬁxed point of T so that (cid:107) v ∗ − ˆ v (cid:107) = (cid:15) (cid:48) > v ∗ is unique). Now (cid:107) ˆ v − v ∗ (cid:107) = (cid:15) (cid:48) ≤ (cid:107) (cid:98) T kn v − ˆ v (cid:107) + (cid:107) (cid:98) T kn v − v ∗ (cid:107) for any n and k by the triangle inequality. For clarity, this inequality holds in the almost sure sense: P (cid:16) (cid:15) (cid:48) ≤ (cid:107) (cid:98) T kn v − ˆ v (cid:107) + (cid:107) (cid:98) T kn v − v ∗ (cid:107) (cid:17) = 1for all n and k .We already know that lim n →∞ lim sup k →∞ P (cid:16) (cid:107) (cid:98) T kn v − v ∗ (cid:107) > (cid:15) (cid:48) / (cid:17) = 0by Theorem 4.2 and lim n →∞ lim sup k →∞ P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:48) / (cid:17) = 0by assumption. Now P (cid:16) max (cid:110) (cid:107) (cid:98) T kn v − ˆ v (cid:107) , (cid:107) (cid:98) T kn v − v ∗ (cid:107) (cid:111) > (cid:15) (cid:48) / (cid:17) ≤ P (cid:16) (cid:107) (cid:98) T kn v − v ∗ (cid:107) > (cid:15) (cid:48) / (cid:17) + P (cid:16) (cid:107) (cid:98) T kn v − ˆ v (cid:107) > (cid:15) (cid:48) / (cid:17) ,

32o lim n →∞ lim sup k →∞ P (cid:16) max (cid:110) (cid:107) (cid:98) T kn v − ˆ v (cid:107) , (cid:107) (cid:98) T kn v − v ∗ (cid:107) (cid:111) > (cid:15) (cid:48) / (cid:17) = 0 . However, (cid:15) (cid:48) ≤ (cid:107) (cid:98) T kn v − ˆ v (cid:107) + (cid:107) (cid:98) T kn v − v ∗ (cid:107) almost surely so at least one of (cid:107) (cid:98) T kn v − ˆ v (cid:107) or (cid:107) (cid:98) T kn v − v ∗ (cid:107) mustbe greater than (cid:15) (cid:48) / k . (cid:3) A.6. Proof of Proposition 5.1

Proof: (i) Assumption 4.1 :Certainly, (cid:107) T n ( ω ) ( v ) − T v (cid:107) ≤ α max ( s,a ) ∈ K | n n (cid:88) i =1 v ( ψ ( s, a, ξ i )) − E [ v ( ψ ( s, a, ξ ))] | using Fact (2.1). We know that for any ﬁxed ( s, a ) ∈ K , | n n (cid:88) i =1 v ( ψ ( s, a, ξ i )) − E [ v ( ψ ( s, a, ξ ))] | → , as n → ∞ by the Strong Law of Large Numbers (the random variable v ( ψ ( s, a, ξ )) has ﬁniteexpectation because it is essentially bounded). Recall that K is ﬁnite to see that the right handside of the above inequality converges to zero as n → ∞ .(ii) Assumption 4.2 :We deﬁne the constant κ ∗ (cid:44) max ( s,a ) ∈ K | c ( s, a ) | − α . Then it can be easily veriﬁed that the value of any policy π v π ≤ κ ∗ . Then v ∗ ≤ κ ∗ and withoutloss of generality we can restrict ˆ v kn to the set B κ ∗ (0).(iii) Assumption 4.3:This is the well known contraction property of the Bellman operator.(iv) Assumption 4.4:Using Fact 2.1, for any ﬁxed s ∈ S , | (cid:98) T n v ( s ) − T v ( s ) | ≤ max a ∈ A ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αn n (cid:88) i =1 v ( ψ ( s, a, ξ i ) − α E [ v ( ψ ( s, a, ξ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and hence, P (cid:110) (cid:107) (cid:98) T n v − T v (cid:107) ≥ (cid:15) (cid:111) ≤ P (cid:40) max ( s,a ) ∈ K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αn n (cid:88) i =1 v ( ψ ( s, a, ξ i ) − α E [ v ( ψ ( s, a, ξ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:41) . For any ﬁxed ( s, a ) ∈ K , P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αn n (cid:88) i =1 v ( ψ ( s, a, ξ i )) − α E [ v ( ψ ( s, a, ξ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:41) ≤ e − (cid:15)/α ) n/ ( v max − v min ) ≤ e − (cid:15)/α ) n/ (2 (cid:107) v (cid:107) ) ≤ e − (cid:15)/α ) n/ ( κ ∗ ) by Hoeﬀding’s inequality. Then, using the union bound, we have P (cid:110) (cid:107) (cid:98) T n v − T v (cid:107) ≥ (cid:15) (cid:111) ≤ | K | e − (cid:15)/α ) n/ ( κ ∗ ) . By taking complements of the above event we get the desired result. (cid:3) .7. Lemma A.3 Lemma A.3.

For any ﬁxed n ≥ , the eigenvalues of the transition matrix Q of the Markovchain Y kn are (with algebraic multiplicity N ∗ − η ∗ − ) and 1.Proof: In general, the transition matrix Q n ∈ R ( N ∗ − η ∗ +1 ) × ( N ∗ − η ∗ +1 ) of (cid:8) Y kn (cid:9) k ≥ looks like Q n =  p n · · · · · · − p n ) p n · · · · · · − p n )0 p n · · · − p n )... ... ... . . . ... ...0 0 · · · · · · − p n )0 0 · · · · · · p n (1 − p n )  . To compute the eigenvalues of Q n , we want to solve Q n x = λ x for some x (cid:54) = 0 and λ ∈ R . For x = ( x , x , . . . , x N ∗ − η ∗ +1 ) ∈ R N ∗ − η ∗ +1 , Q n x =  p n x + (1 − p n ) x N ∗ − η ∗ +1 p n x + (1 − p n ) x N ∗ − η ∗ +1 p n x + (1 − p n ) x N ∗ − η ∗ +1 ... p n x N ∗ − η ∗ − + (1 − p n ) x N ∗ − η ∗ +1 p n x N ∗ − η ∗ + (1 − p n ) x N ∗ − η ∗ +1  . Now, suppose λ (cid:54) = 0 and Q x = λ x for some x (cid:54) = 0. By the explicit computation of Q x above,[ Q x ] = p x + (1 − p ) x N ∗ − η ∗ +1 = λ x and [ Q x ] = p x + (1 − p ) x N ∗ − η ∗ +1 = λ x , so it must be that x = x . However, then[ Q x ] = p x + (1 − p ) x N ∗ − η ∗ +1 = p x + (1 − p ) x N ∗ − η ∗ +1 = [ Q x ] , and thus x = x . Continuing this reasoning inductively shows that x = x = · · · = x N ∗ − η ∗ +1 forany eigenvector x of Q . Thus, it must be true that λ = 1. (cid:3) A.8. Proof of Lemma 6.1

Proof: (i) Suppose v ≤ v (cid:48) . It is automatic that [ T x v ] ( s ) = v ( s ) ≤ v (cid:48) ( s ) = [ T x v (cid:48) ] ( s ) for all s (cid:54) = x .For state s = x , c ( s, a ) + α E [ v ( s (cid:48) ) | s, a ] ≤ c ( s, a ) + α E [ v (cid:48) ( s (cid:48) ) | s, a ]for all ( s, a ) ∈ K , somin a ∈ A ( s ) { c ( s, a ) + α E [ v ( s (cid:48) ) | s, a ] } ≤ min a ∈ A ( s ) { c ( s, a ) + α E [ v (cid:48) ( s (cid:48) ) | s, a ] } , and thus [ T x v ] ( s ) ≤ [ T x v (cid:48) ] ( s ).(ii) We see thatmin a ∈ A ( s ) { c ( s, a ) + α E [ v (cid:48) ( s (cid:48) ) + η | s, a ] } = min a ∈ A ( s ) { c ( s, a ) + α E [ v (cid:48) ( s (cid:48) ) | s, a ] } + α η for state x and all other states are not changed. (cid:3) .9. Proof of Proposition 6.1 Proof:

Starting with v , T x v − η ≤ T x v + ε ≤ T x v + η , which gives T x v − η ≤ ˜ v ≤ T x v + η , and v − η ≤ ˜ v ≤ v + η . By monotonicity of T x , T x (cid:2) v − η (cid:3) ≤ T x ˜ v ≤ T x (cid:2) v + η (cid:3) , and by our assumptions on the noise, T x (cid:2) v − η (cid:3) − η ≤ ˜ v ≤ T x (cid:2) v + η (cid:3) + η . Now T x [ v − η

1] = T x v − α η e x , thus v − η − α η ≤ ˜ v ≤ v + η α η . Similarly, v − α ( η − α η − η ≤ ˜ v ≤ v + α ( η α η

1) + η , and the general case follows. (cid:3) A.10. Proof of Lemma 6.3

Proof:

Using Fact 2.1 twice, compute | [ T v ] ( s ) − [ T v (cid:48) ] ( s ) | = | min a ∈ A ( s ) max b ∈ B ( s ) { r ( s, a, b ) + α E [ v (˜ s ) | s, a, b ] }− min a ∈ A ( s ) max b ∈ B ( s ) { r ( s, a, b ) + α E [ v (cid:48) (˜ s ) | s, a, b ] } |≤ max a ∈ A | max b ∈ B ( s ) { r ( s, a, b ) + α E [ v (˜ s ) | s, a, b ] } − max b ∈ B ( s ) { r ( s, a, b ) + α E [ v (cid:48) (˜ s ) | s, a, b ] } |≤ α max a ∈ A max b ∈ B ( s ) | E [ v (˜ s ) − v (cid:48) (˜ s ) | s, a, b ] |≤ α max a ∈ A max b ∈ B E [ | v (˜ s ) − v (cid:48) (˜ s ) || s, a, b ] ≤ α (cid:107) v − v (cid:48) (cid:107) . (cid:3)(cid:3)