[PDF] Fixed Points of the Set-Based Bellman Operator

Abstract

Motivated by uncertain parameters encountered in Markov decision processes (MDPs), we study the effect of parameter uncertainty on Bellman operator-based methods. Specifically, we consider a family of MDPs where the cost parameters are from a given compact set. We then define a Bellman operator acting on an input set of value functions to produce a new set of value functions as the output under all possible variations in the cost parameters. Finally we prove the existence of a fixed point of this set-based Bellman operator by showing that it is a contractive operator on a complete metric space.

Full PDF

aa r X i v : . [ m a t h . O C ] M a r Fixed Points of Set-Based Bellman Operator

Sarah H.Q. Li ∗ Assal´e Adj´e ∗∗ Pierre-Lo¨ıc Garoche ∗∗∗

Behc¸et Ac¸ıkmes¸e ∗∗ William E.Boeing Department of Aeronautics and Astronautics, Universityof Washington, Seattle, USA. (e-mail: [email protected], [email protected]). ∗∗ LAMPS, Universit´e de Perpignan Via Domitia, Perpignan, France (e-mail:[email protected]) ∗∗∗

ONERA – The French Aerospace Lab, Univ. of Toulouse, France, (e-mail:[email protected])

Abstract:

Motivated by uncertain parameters encountered in Markov decision processes (MDPs), westudy the effect of parameter uncertainty on Bellman operator-based methods. Speciﬁcally, we considera family of MDPs where the cost parameters are from a given compact set. We then deﬁne a Bellmanoperator acting on an input set of value functions to produce a new set of value functions as the outputunder all possible variations in the cost parameters. Finally we prove the existence of a ﬁxed point of thisset-based Bellman operator by showing that it is a contractive operator on a complete metric space.

Keywords:

Markov decision process, stochastic control, game theory1. INTRODUCTIONMarkov decision process (MDP) is a widely used mathematicalframework for control design in stochastic environments, eg.density control of a swarm of agents (Ac¸ikmes¸e and Bayard,2012; Demir et al., 2015). It is also a fundamental frame-work for reinforcement learning, robotic motion planning andstochastic games (Filar and Vrieze, 2012; Li et al., 2019). AnMDP can be solved for different objectives including minimumaverage cost, minimum discounted cost, reachability, amongothers (Puterman, 2014). Given an objective, solving an MDPis equivalent to computing the optimal policy and the optimal value function of a decision maker over the state space. Amongdifferent algorithms for computing the optimal policy, mostare based on the Bellman equation that characterizes the valuefunction as its ﬁxed point.In applications, it is common to encounter MDPs with uncer-tainties. When modeling an environment as a stochastic pro-cess, sampling techniques are often used to determine processparameters such as process costs or probabilities; such mod-els are inherently uncertain. In stochastic games, the cost andprobability parameters change with respect to another deci-sion maker’s strategy. While existing works focus on certainperturbations in MDPs (Bielecki and Filar, 1991; Altman andGaitsgory, 1993; Abbad and Filar, 1992), these results do notgeneralize to the analysis of overall behaviour of the MDPunder all possible cost parameters in a compact set.Additionally, how uncertainty in MDP cost parameters affectthe outcome of value iteration type methods is not well studied.Dynamic programming on bounded MDPs is studied in Givanet al. (2000) for speciﬁcally interval sets, however convergenceover general compact sets is not considered. While computa-tion of the ﬁxed points of Bellman operator is the topic ofnumerous studies (Delage and Mannor, 2010), most focus onthe convergence analysis of value iteration and its stopping cri-teria (Ashok et al., 2017; Eisentraut et al., 2019). However, theydo not consider the relationship between bounds on the optimal value function and the uncertainty in cost. Similarly motivated,Haddad and Monmege (2018) analyzes entry-wise uncertaintransition kernels by using graph-based MDP transformations.While we also derive bounds of an MDP due to uncertainparameters, we differ in our approach: our set-based frameworkallows for direct extraction of the value iteration trajectorieswith respect to the set of cost parameters. This differentiates ourwork from Haddad and Monmege (2018) due to their graphicalabstraction of MDP, which allows for derivation of bounds butnot extraction of value function trajectories.

Contributions:

We characterize the solutions of a family ofMDPs at once, represented as sets of MDPs. More speciﬁcally,we: (i) develop a characterization of MDPs with uncertain costparameters; (ii) propose a set-based Bellman operator over non-empty compact sets; (iii) establish the contractivity of this set-based Bellman operator with the existence of a unique compactﬁxed point set.2. REVIEW OF MDPS AND BELLMAN OPERATOR

Notation : Sets of N elements are given by [ N ] = { , . . . , N − } . We denote the set of matrices of i rows and j columns withreal (non-negative) valued entries as R i × j ( R i × j + ) . Elements ofsets and matrices are denoted by capital letters, X , while setsare denoted by cursive letters, X . The ones column vector isdenoted by N = [1 , . . . , T ∈ R N × . We consider a discounted inﬁnite-horizon MDP deﬁned by ([ S ] , [ A ] , P, C, γ ) for a decision maker, where(1) [ S ] denotes the ﬁnite set of states.(2) [ A ] denotes the ﬁnite set of actions. Without loss ofgenerality, assume that every action is admissible fromeach state s ∈ [ S ] .(3) P ∈ R S × SA denotes the transition kernel. Each compo-nent P s ′ ,sa is the probability of arriving in state s ′ by tak-ng state-action ( s, a ) . Matrix P is column stochastic andelement-wise non-negative — i.e. P ( s,a ) ∈ [ S ] × [ A ] P s ′ ,sa =1 , P s ′ ,sa ≥ , ∀ s ′ , s ∈ [ S ] , a ∈ [ A ] .(4) C ∈ R S × A denotes the cost of each pair ( s, a ) .(5) γ ∈ (0 , denotes the discount factor.At each time step t , the decision maker chooses an action a based on its current state s . The state-action pair ( s, a ) inducesa probability distribution P ( · ) ,sa ∈ R S , where P s ′ ,sa is theprobability that the decision maker arrives at s ′ at time step t + 1 . The state-action ( s, a ) also induces a cost C sa that mustbe paid by the decision maker.At each time step, the decision maker chooses a policy thatdictates the action chosen at each state s . We denote policyas a function π : S × A → R + , where π ( s, a ) denotes theprobability that action a is chosen at state s . We denote theset of all feasible policies of an MDP by Π . In our context,it sufﬁces to consider only deterministic, stationary policies i.e. π ( s, a ) is a time invariant function that returns for exactly oneaction, and for all other possible actions.We denote the policy matrix induced by a policy as M π ∈ R S × SA , where ( M π ) s ′ ,sa = (cid:26) π ( s, a ) s ′ = s s ′ = s . Every stationary policy induces a stationary

Markov chain (El Chamie et al., 2018), given by

P M Tπ . Each stationary policyalso induces a stationary cost given by C ( π ) = X i ∈ [ S ] e i e Ti M π ( S ⊗ I A ) C T e i , C ( π ) ∈ R S , (1)where e i ∈ R S is the unit vector pointing in the i th coordinate.For an MDP ([ S ] , [ A ] , P, C, γ ) , we are interested in minimizingthe discounted inﬁnite horizon expected cost , deﬁned withrespect to a policy π as V ⋆s = min π ∈ Π E πs n ∞ X t =0 γ t C s t a t o , ∀ s ∈ [ S ] (2)where γ ∈ (0 , is the discount factor of future cost, s t and a t are the state and action taken at time step t , and s is the statethat the decision maker starts from at t = 0 .The minimum expected cost V ⋆s is called the optimal valuefunction . The policy π ⋆ that achieves the optimal value functionis called an optimal policy . In general, V ⋆s is unique while π ⋆ is not. It is well known that the set of optimal policies alwaysincludes at least one deterministic stationary policy (Puterman,2014, Thm 6.2.11) — i.e. for each s , π ( s, a ) returns forexactly one action, and for all other possible actions. Determining the optimal value function of a given MDP isequivalent to solving for the ﬁxed point of the associated Bell-man operator, for which a myriad of techniques exists (Puter-man, 2014). We introduce the Bellman operator here as well asrelating its ﬁxed point to the corresponding MDP problem.

Deﬁnition 1. (Standard Bellman Operator). For a discounted in-ﬁnite horizon MDP ([ S ] , [ A ] , P, C, γ ) , its associated Bellmanoperator f C : R S → R S is given component-wise by (cid:16) f C ( V ) (cid:17) s := min a ∈ [ A ] C sa + γ X s ′ ∈ [ S ] P s ′ sa V s ′ , ∀ s ∈ [ S ] . The ﬁxed point of the Bellman operator is a value function V ∈ R S that is invariant under the operator. Deﬁnition 2. (Fixed Point). Let F : X → X be an operator onthe metric space X , V ⋆ ∈ X is a ﬁxed point of F if it satisﬁes V ⋆ = F ( V ⋆ ) . (3)In order to show that the Bellman operator has a unique ﬁxedpoint, we consider the following operator property. Deﬁnition 3. (Contraction Operator). Let ( X , d ) be a completemetric space. An operator F : X → X is a contraction operatorif it satisﬁes d ( F ( V ) , F ( V ′ )) < d ( V , V ′ ) , ∀ V , V ′ ∈ X . The Bellman operator is known be a contraction operator onthe complete metric space ( R S , k·k ∞ ) . From the Banach ﬁxedpoint theorem (Puterman, 2014), it has a unique ﬁxed point.Because the optimal value function V ⋆ is given by the uniqueﬁxed point of the associated Bellman operator, we use the termsoptimal value function and ﬁxed point of f C interchangeably.In addition to obtaining V ⋆ , MDPs are also solved to determinethe optimal policy , π ⋆ . We note that because every feasiblepolicy π induces a Markov chain, π also induces a uniquestationary value function V which satisﬁes V = C ( π ) + γM π P T V. (4)Given a feasible policy π , we can equivalently solve for thestationary value function V as V = ( I − γM π P T ) − C ( π ) .From this perspective, the optimal value function is the mini-mum vector among the ﬁnite set of stationary value functionsgenerated by the set of all policies Π .From the optimal value function V ⋆ , we can also derive adeterministic optimal policy from the Bellman operator as π ⋆ ( s, a ) =  a = argmin ¯ a ∈ [ A ] C s ¯ a + γ P s ′ ∈ [ S ] P s ′ ,s ¯ a V ⋆s ′ otherwise , ∀ s ∈ [ S ] . (5)While the optimal policy does not need to be deterministic andstationary, the optimal policy π ⋆ derived from (5) will alwaysbe deterministic. Among different algorithms to determine the ﬁxed point of theBellman operator, value iteration (VI) is a commonly used andsimple technique in which the Bellman operator is iterativelyapplied until the optimal value is reached — i.e. starting fromany value function V ∈ R S and k = 1 , . . . , we apply V k +1 s = min a ∈ [ A ] C sa + γ X s ′ ∈ [ S ] P s ′ ,sa V ks ′ , ∀ s ∈ [ S ] . (6)The iteration scheme given by (6) converges to the ﬁxed point ofthe corresponding discounted inﬁnite horizon MDP. The stop-ping criteria of VI can be considered the over-approximationof the optimal value function. Lemma 4. (Puterman, 2014, Thm. 6.3.1) For any initial valuefunction V ∈ R S , let { V k } k ∈ N be the value function tra-jectory from (6). Whenever there exists ǫ > , such that (cid:13)(cid:13) V k +1 − V k (cid:13)(cid:13) < ǫ (1 − γ )2 γ , then V k +1 is within ǫ/ of the ﬁxedpoint V ⋆ , i.e. (cid:13)(cid:13) V k +1 − V ⋆ (cid:13)(cid:13) < ǫ .Lemma 4 connects the sequence { V k } k ∈ N ’s relative conver-gence to its absolute convergence towards V ⋆ by showing thathe former implies the latter. In general, the stopping criteriadiffer for different MDP objectives (see Haddad and Monmege(2018) for recent results on stopping criteria for reachability).3. SET-BASED BELLMAN OPERATORThe standard Bellman operator with respect to a ﬁxed costparameter C is well studied. Motivated by a family of MDPscorresponding to a compact set of cost parameters C ⊆ R S × A with all other data parameters remaining identical, we lift theBellman operator to operate on sets rather than individualvectors in R S . For the set-based operator, we analyze its set-based domain and prove that it is a contraction operator. Wealso prove the existence of a unique ﬁxed point set V ⋆ for aset-based Bellman operator and relate its properties to the ﬁxedpoint of the standard Bellman operator. We deﬁne a new metric space ( H ( R S ) , d H ) based on theBanach space ( R S , k·k ∞ ) to serve as our set-based operatordomain (Rudin et al., 1964), where H ( R S ) is the collection ofnon-empty compact subsets of R S equipped with partial order :for V , V ′ ∈ H ( R S ) , V (cid:22) V ′ if V ⊆ V ′ — i.e. if V is a subset of V ′ . The metric d H is the Haussdorf distance (Henrikson, 1999)deﬁned as d H ( V , V ′ ) = max { sup V ∈V inf V ′ ∈V ′ k V − V ′ k ∞ , sup V ′ ∈V ′ inf V ∈V k V − V ′ k ∞ } . (7)Since ( R S , k·k ∞ ) is a complete metric space, H ( R S ) is acomplete metric space with respect to d H . Lemma 5. (Henrikson, 1999, Thm 3.3) If X is a complete met-ric space, then its induced Hausdorff metric space ( H ( X ) , d H ) is a complete metric space.On the metric space H ( R S ) , we deﬁne a set-based Bellmanoperator . Deﬁnition 6. (Set-based Bellman Operator). For a family ofMDP problems, ([ S ] , [ A ] , P, C , γ ) , where C ⊆ R S × A is a com-pact set, its associated set-based Bellman operator is given by F C ( V ) = cl [ ( C,V ) ∈C×V f C ( V ) , ∀ V ∈ H ( R S ) where cl is the closure operator.As we take the union of uncountably many bounded sets,the resulting set may not be bounded, and therefore it is notimmediately obvious that F C ( V ) maps into the metric space H ( R S ) . We show this is true in Proposition 7. Proposition 7. If C is compact, then F C ( V ) ∈ H ( R S ) , ∀ V ∈ H ( R S ) . Proof.

For a non-empty set A of some ﬁnite dimensionalreal vector space, let us deﬁne its diameter to be denotedas diam ( A ) = sup x,y ∈A k x − y k ∞ . The diameter of anycompact set in a metric space is bounded.We take any non-empty compact set V ∈ H ( R S ) . As F C ( V ) ⊂ R S , it sufﬁces to prove that F C ( V ) is closed and bounded.The closedness is guaranteed by the closure operator. A sub-set of a metric space is bounded iff its closure is bounded.Hence, to prove the boundedness, it sufﬁces to prove that diam (cid:0) ∪ ( C,V ) ∈C×V f C ( V ) (cid:1) < + ∞ . Consider any two cost-value function pairs, ( C, V ) , ( C ′ , V ′ ) ∈ C×V , they must satisfy f C ( V ) − f C ′ ( V ′ ) = (cid:16) f C ( V ) − f C ′ ( V ) (cid:17) + (cid:16) f C ′ ( V ) − f C ′ ( V ′ ) (cid:17) , where the norm of the second term k f C ′ ( V ) − f C ′ ( V ′ ) k ∞ must be upper bounded by k V − V ′ k ∞ due to contractionproperties of f C ′ . To bound the ﬁrst term, we note that for anytwo vectors a, b ∈ R S , k a − b k ∞ = max { max( a − b ) , max( b − a ) } and let π to be the optimal policy of f C ( V ) , max( f C ′ ( V ) − f C ( V )) ≤ max( ν ′ ( π ) + γM π P T V − ν ( π ) − γM π P T V ) ≤ max( ν ′ ( π ) − ν ( π )) ≤ X i ∈ [ S ] (cid:13)(cid:13) e Ti (cid:13)(cid:13) ∞ k M π k ∞ k S ⊗ I A k ∞ (cid:13)(cid:13) ( C ′ − C ) T (cid:13)(cid:13) ∞ k e i k ∞ . Since k S ⊗ I A k ∞ = k e i k ∞ = (cid:13)(cid:13) e Ti (cid:13)(cid:13) ∞ = k M π k ∞ = 1 for any π ∈ Π , max( f C ′ ( V ) − f C ( V )) ≤ S diam (cid:0) C T (cid:1) . Theresult max( f C ( V ) − f C ′ ( V )) ≤ S diam (cid:0) C T (cid:1) can be similarlyderived. Therefore, k f C ( V ) − f C ′ ( V ′ ) k ∞ < S diam (cid:0) C T (cid:1) +diam ( V ) . Since it holds for all ( C, V ) , ( C ′ , V ′ ) ∈ C × V then diam (cid:0) ∪ ( C,V ) ∈C×V f C ( V ) (cid:1) ≤ S diam (cid:0) C T (cid:1) + diam ( V ) < + ∞ as C T and V are bounded. ✷ Proposition 7 shows that F C is an operator from H ( R S ) to H ( R S ) . Having established its range space, we can draw manyparallels between F C and f C . Similar to the existence of aunique ﬁxed point V ⋆ for f C , we consider whether a ﬁxed pointset of F C which satisﬁes F C ( V ⋆ ) = V ⋆ exists, and if it is unique.To take the comparison further, since V ⋆ is the optimal valuefunction for an MDP problem deﬁned by ([ S ] , [ A ] , P, C, γ ) ,how does V ⋆ relate to the family of optimal solutions thatcorresponds to the MDP family ([ S ] , [ A ] , P, C , γ ) ?To prove the unique existence of V ⋆ , we utilize the Banach ﬁxedpoint theorem (Puterman, 2014), which states that a uniqueﬁxed point must exist for all contraction operators on completemetric spaces. First, we show that F C is a contraction as deﬁnedin Deﬁnition 3 on the complete metric space ( H ( R S ) , d H ) . Proposition 8.

For any

V ∈ H ( R S ) and C ⊂ R S × A closedand bounded, F C is a contraction operator under the Hausdorffdistance. Proof.

Consider V , ¯ V ∈ H ( R S ) , to see that F C is a contraction,we need to show sup V ∈ F C ( V ) inf ¯ V ∈ F C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) (8) sup V ∈ F C ( ¯ V ) inf ¯ V ∈ F C ( V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) (9)First we note that taking sup ( inf ) of a continuous function overa set A is equivalent to taking the sup ( inf ) over the closure of A . Let G C ( V ) = ∪ ( C,V ) ∈C×V f C ( V ) and cl G C ( V ) = F C ( V ) ,then due to continuity of norms (Rudin et al., 1964, Thm 4.16), sup V ∈ F C ( V ) inf ¯ V ∈ F C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) = sup V ∈ G C ( V ) inf ¯ V ∈ G C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) . Therefore, it sufﬁces to prove sup f C ( V ) ∈ G C ( V ) inf f ¯ C ( ¯ V ) ∈ G C ( ¯ V ) (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) , sup f ¯ C ( ¯ V ) ∈ G C ( ¯ V ) inf f C ( V ) ∈ G C ( V ) (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) . or any V ∈ V , C ∈ C , inf ( ¯ C, ¯ V ) ∈C× ¯ V (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ (10a) = inf ( ¯ C, ¯ V ) ∈C× ¯ V k C ( π ) + γM π P T V − ( ¯ C (¯ π ) + γM ¯ π P T ¯ V ) k ∞ , (10b) ≤ inf ¯ V ∈ ¯ V k C (¯ π ) + γM ¯ π P T V − ( C (¯ π ) + γM ¯ π P T ¯ V ) k ∞ , (10c) ≤ inf ¯ V ∈ ¯ V (cid:13)(cid:13) γM ¯ π P T ( V − ¯ V ) (cid:13)(cid:13) ∞ ≤ γ inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ , (10d)where π corresponds to the optimal policy for the MDP ([ S ] , [ A ] , P, C, γ ) and ¯ π corresponds to the optimal policy forthe MDP ([ S ] , [ A ] , P, ¯ C, γ ) in (10b). In (10c) we replaced M π by M ¯ π by noting that π is optimal, therefore ¯ π must result in alarger value function (similar to the proof of Prop. 7). In (10d)we note that the inﬁmum over set C must be upper bounded bywhen ¯ C = C ∈ C , and used the fact that (cid:13)(cid:13) M ¯ π P T (cid:13)(cid:13) ∞ ≤ .Taking the sup over G C ( V ) and G C ( ¯ V ) , sup V ∈ G C ( V ) inf ¯ V ∈ G C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ ≤ γ sup V ∈V inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ , sup ¯ V ∈ G C ( ¯ V ) inf V ∈ G C ( V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ ≤ γ sup V ∈V inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ . Therefore d H ( F C ( V ) , F C ( ¯ V )) ≤ γd H ( V , ¯ V ) . Since γ ∈ (0 , , F C is a contraction operator on H ( R S ) . ✷ The contraction property of F C implies that repeated applica-tion of the operator to any V ∈ H ( R S ) will result in closerand closer sets in the Hausdorff sense of distance to a ﬁxedpoint set. It is then natural to consider if there is a unique setwhich all F k C ( V ) converges to. Theorem 9.

There exists a unique ﬁxed point V ⋆ to the set-based Bellman operator F C as deﬁned in Deﬁnition 6, suchthat F C ( V ⋆ ) = V ⋆ , and V ⋆ is a closed and bounded subsetof R S . Furthermore, for any iteration starting from arbitrary V ∈ H ( R S ) , V k +1 = F C ( V k ) , the sequence converges inthe Hausdorff sense i.e. lim k →∞ d H ( F C ( V k ) , V ⋆ ) = 0 . Proof.

As shown in Proposition 8, F C is a contraction operator.From the Banach ﬁxed point theorem (Puterman, 2014, Thm6.2.3), there exists a unique ﬁxed point V ⋆ , and any arbitrary V ∈ H ( R S ) will generate a sequence { F C ( V k ) } k ∈ N thatconverges to the ﬁxed point. ✷ The ﬁxed point V ⋆ of Bellman operator f C on metric space R S corresponds to the optimal value function of the MDPassociated with cost parameter C . Because there is no direct as-sociation of an MDP problem to the set-based Bellman operator F C , we cannot claim the same for V ⋆ . However, V ⋆ does havemany interesting properties on H ( R S ) , in parallel to operator f C on R S , especially in terms of the value iteration method (6).Suppose that instead of a ﬁxed cost parameter, we have that ateach iteration k , a C k that is random chosen from a compactset of costs, C k ∈ C , then it is interesting to ask if V ⋆ containsall the limit points of lim k f C k ( V k ) . Indeed, we can infer fromTheorem 9 that the sequence { V k } converges to V ⋆ under theHausdorff metric. Furthermore, even when V k itself does notconverge, it must converge to the set V ⋆ under the Hausdorffmetric— i.e. lim k → inf V ∈V ⋆ (cid:13)(cid:13) V k − V (cid:13)(cid:13) ∞ = 0 .4. CONCLUSIONWe summarize our results on set-based Bellman operator: fora compact cost function set C , F C converges to to a unique compact set V ⋆ which contains all the ﬁxed points of f C forall ﬁxed C ∈ C . Furthermore, V ⋆ also contains the limit pointsof f C k ( V k ) for any { C k } k ∈ N ⊆ C , V ∈ R S , given that lim k →∞ V k converges. Even if the limit does not exist, V k must asymptotically converge to V ⋆ in the Hausdorff sense.Future work includes extending the uncertainty analysis toconsider uncertainty in the transition kernel to fully capturelearning in a general stochastic game.REFERENCESAbbad, M. and Filar, J.A. (1992). Perturbation and stabilitytheory for markov control problems. IEEE Trans. Autom.Control .Ac¸ikmes¸e, B. and Bayard, D.S. (2012). A markov chainapproach to probabilistic swarm guidance. In

Amer. ControlConf. , 6300–6307. IEEE.Altman, E. and Gaitsgory, V.A. (1993). Stability and singu-lar perturbations in constrained markov decision problems.

IEEE Trans. Autom. Control , 38(6), 971–975.Ashok, P., Chatterjee, K., Daca, P., Kˇret´ınsk`y, J., and Meggen-dorfer, T. (2017). Value iteration for long-run average rewardin markov decision processes. In

Int. Conf. Comput. AidedVeriﬁcation , 201–221. Springer.Bielecki, T.R. and Filar, J.A. (1991). Singularly perturbedmarkov control problem: Limiting average cost.

Ann. Op.Res. , 28(1), 153–168.Delage, E. and Mannor, S. (2010). Percentile optimization formarkov decision processes with parameter uncertainty.

Op.Res. , 58(1), 203–213.Demir, N., Eren, U., and Ac¸ikmes¸e, B. (2015). Decentralizedprobabilistic density control of autonomous swarms withsafety.

Auton. Robots , 39(4), 537 –554.Eisentraut, J., Kˇret´ınsk`y, J., and Rotar, A. (2019). Stopping cri-teria for value and strategy iteration on concurrent stochasticreachability games. arXiv preprint arXiv:1909.08348 .El Chamie, M., Yu, Y., Ac¸ıkmes¸e, B., and Ono, M. (2018).Controlled markov processes with safety state constraints.

IEEE Trans. Autom. Control , 64(3), 1003–1018.Filar, J. and Vrieze, K. (2012).

Competitive Markov decisionprocesses . Springer Science & Business Media.Givan, R., Leach, S., and Dean, T. (2000). Bounded-parametermarkov decision processes.

Artif. Intell. , 122(1-2), 71–109.Haddad, S. and Monmege, B. (2018). Interval iteration algo-rithm for mdps and imdps.

Theor. Comput. Sci. , 735, 111–131.Henrikson, J. (1999). Completeness and total boundednessof the hausdorff metric. In

MIT Undergraduate J. Math.

Citeseer.Li, S.H.Q., Yu, Y., Calderone, D., Ratliff, L., and Ac¸ikmes¸e, B.(2019). Tolling for constraint satisfaction in markov decisionprocess congestion games. In

Amer. Control Conf. , 1238–1243. IEEE.Puterman, M.L. (2014).

Markov Decision Processes: DiscreteStochastic Dynamic Programming . John Wiley & Sons.Rudin, W. et al. (1964).