Fixed Points of the Set-Based Bellman Operator
Sarah H.Q. Li, Assalé Adjé, Pierre-Loïc Garoche, Behçet Açıkmeşe
aa r X i v : . [ m a t h . O C ] M a r Fixed Points of Set-Based Bellman Operator
Sarah H.Q. Li ∗ Assal´e Adj´e ∗∗ Pierre-Lo¨ıc Garoche ∗∗∗
Behc¸et Ac¸ıkmes¸e ∗∗ William E.Boeing Department of Aeronautics and Astronautics, Universityof Washington, Seattle, USA. (e-mail: [email protected], [email protected]). ∗∗ LAMPS, Universit´e de Perpignan Via Domitia, Perpignan, France (e-mail:[email protected]) ∗∗∗
ONERA – The French Aerospace Lab, Univ. of Toulouse, France, (e-mail:[email protected])
Abstract:
Motivated by uncertain parameters encountered in Markov decision processes (MDPs), westudy the effect of parameter uncertainty on Bellman operator-based methods. Specifically, we considera family of MDPs where the cost parameters are from a given compact set. We then define a Bellmanoperator acting on an input set of value functions to produce a new set of value functions as the outputunder all possible variations in the cost parameters. Finally we prove the existence of a fixed point of thisset-based Bellman operator by showing that it is a contractive operator on a complete metric space.
Keywords:
Markov decision process, stochastic control, game theory1. INTRODUCTIONMarkov decision process (MDP) is a widely used mathematicalframework for control design in stochastic environments, eg.density control of a swarm of agents (Ac¸ikmes¸e and Bayard,2012; Demir et al., 2015). It is also a fundamental frame-work for reinforcement learning, robotic motion planning andstochastic games (Filar and Vrieze, 2012; Li et al., 2019). AnMDP can be solved for different objectives including minimumaverage cost, minimum discounted cost, reachability, amongothers (Puterman, 2014). Given an objective, solving an MDPis equivalent to computing the optimal policy and the optimal value function of a decision maker over the state space. Amongdifferent algorithms for computing the optimal policy, mostare based on the Bellman equation that characterizes the valuefunction as its fixed point.In applications, it is common to encounter MDPs with uncer-tainties. When modeling an environment as a stochastic pro-cess, sampling techniques are often used to determine processparameters such as process costs or probabilities; such mod-els are inherently uncertain. In stochastic games, the cost andprobability parameters change with respect to another deci-sion maker’s strategy. While existing works focus on certainperturbations in MDPs (Bielecki and Filar, 1991; Altman andGaitsgory, 1993; Abbad and Filar, 1992), these results do notgeneralize to the analysis of overall behaviour of the MDPunder all possible cost parameters in a compact set.Additionally, how uncertainty in MDP cost parameters affectthe outcome of value iteration type methods is not well studied.Dynamic programming on bounded MDPs is studied in Givanet al. (2000) for specifically interval sets, however convergenceover general compact sets is not considered. While computa-tion of the fixed points of Bellman operator is the topic ofnumerous studies (Delage and Mannor, 2010), most focus onthe convergence analysis of value iteration and its stopping cri-teria (Ashok et al., 2017; Eisentraut et al., 2019). However, theydo not consider the relationship between bounds on the optimal value function and the uncertainty in cost. Similarly motivated,Haddad and Monmege (2018) analyzes entry-wise uncertaintransition kernels by using graph-based MDP transformations.While we also derive bounds of an MDP due to uncertainparameters, we differ in our approach: our set-based frameworkallows for direct extraction of the value iteration trajectorieswith respect to the set of cost parameters. This differentiates ourwork from Haddad and Monmege (2018) due to their graphicalabstraction of MDP, which allows for derivation of bounds butnot extraction of value function trajectories.
Contributions:
We characterize the solutions of a family ofMDPs at once, represented as sets of MDPs. More specifically,we: (i) develop a characterization of MDPs with uncertain costparameters; (ii) propose a set-based Bellman operator over non-empty compact sets; (iii) establish the contractivity of this set-based Bellman operator with the existence of a unique compactfixed point set.2. REVIEW OF MDPS AND BELLMAN OPERATOR
Notation : Sets of N elements are given by [ N ] = { , . . . , N − } . We denote the set of matrices of i rows and j columns withreal (non-negative) valued entries as R i × j ( R i × j + ) . Elements ofsets and matrices are denoted by capital letters, X , while setsare denoted by cursive letters, X . The ones column vector isdenoted by N = [1 , . . . , T ∈ R N × . We consider a discounted infinite-horizon MDP defined by ([ S ] , [ A ] , P, C, γ ) for a decision maker, where(1) [ S ] denotes the finite set of states.(2) [ A ] denotes the finite set of actions. Without loss ofgenerality, assume that every action is admissible fromeach state s ∈ [ S ] .(3) P ∈ R S × SA denotes the transition kernel. Each compo-nent P s ′ ,sa is the probability of arriving in state s ′ by tak-ng state-action ( s, a ) . Matrix P is column stochastic andelement-wise non-negative — i.e. P ( s,a ) ∈ [ S ] × [ A ] P s ′ ,sa =1 , P s ′ ,sa ≥ , ∀ s ′ , s ∈ [ S ] , a ∈ [ A ] .(4) C ∈ R S × A denotes the cost of each pair ( s, a ) .(5) γ ∈ (0 , denotes the discount factor.At each time step t , the decision maker chooses an action a based on its current state s . The state-action pair ( s, a ) inducesa probability distribution P ( · ) ,sa ∈ R S , where P s ′ ,sa is theprobability that the decision maker arrives at s ′ at time step t + 1 . The state-action ( s, a ) also induces a cost C sa that mustbe paid by the decision maker.At each time step, the decision maker chooses a policy thatdictates the action chosen at each state s . We denote policyas a function π : S × A → R + , where π ( s, a ) denotes theprobability that action a is chosen at state s . We denote theset of all feasible policies of an MDP by Π . In our context,it suffices to consider only deterministic, stationary policies i.e. π ( s, a ) is a time invariant function that returns for exactly oneaction, and for all other possible actions.We denote the policy matrix induced by a policy as M π ∈ R S × SA , where ( M π ) s ′ ,sa = (cid:26) π ( s, a ) s ′ = s s ′ = s . Every stationary policy induces a stationary
Markov chain (El Chamie et al., 2018), given by
P M Tπ . Each stationary policyalso induces a stationary cost given by C ( π ) = X i ∈ [ S ] e i e Ti M π ( S ⊗ I A ) C T e i , C ( π ) ∈ R S , (1)where e i ∈ R S is the unit vector pointing in the i th coordinate.For an MDP ([ S ] , [ A ] , P, C, γ ) , we are interested in minimizingthe discounted infinite horizon expected cost , defined withrespect to a policy π as V ⋆s = min π ∈ Π E πs n ∞ X t =0 γ t C s t a t o , ∀ s ∈ [ S ] (2)where γ ∈ (0 , is the discount factor of future cost, s t and a t are the state and action taken at time step t , and s is the statethat the decision maker starts from at t = 0 .The minimum expected cost V ⋆s is called the optimal valuefunction . The policy π ⋆ that achieves the optimal value functionis called an optimal policy . In general, V ⋆s is unique while π ⋆ is not. It is well known that the set of optimal policies alwaysincludes at least one deterministic stationary policy (Puterman,2014, Thm 6.2.11) — i.e. for each s , π ( s, a ) returns forexactly one action, and for all other possible actions. Determining the optimal value function of a given MDP isequivalent to solving for the fixed point of the associated Bell-man operator, for which a myriad of techniques exists (Puter-man, 2014). We introduce the Bellman operator here as well asrelating its fixed point to the corresponding MDP problem.
Definition 1. (Standard Bellman Operator). For a discounted in-finite horizon MDP ([ S ] , [ A ] , P, C, γ ) , its associated Bellmanoperator f C : R S → R S is given component-wise by (cid:16) f C ( V ) (cid:17) s := min a ∈ [ A ] C sa + γ X s ′ ∈ [ S ] P s ′ sa V s ′ , ∀ s ∈ [ S ] . The fixed point of the Bellman operator is a value function V ∈ R S that is invariant under the operator. Definition 2. (Fixed Point). Let F : X → X be an operator onthe metric space X , V ⋆ ∈ X is a fixed point of F if it satisfies V ⋆ = F ( V ⋆ ) . (3)In order to show that the Bellman operator has a unique fixedpoint, we consider the following operator property. Definition 3. (Contraction Operator). Let ( X , d ) be a completemetric space. An operator F : X → X is a contraction operatorif it satisfies d ( F ( V ) , F ( V ′ )) < d ( V , V ′ ) , ∀ V , V ′ ∈ X . The Bellman operator is known be a contraction operator onthe complete metric space ( R S , k·k ∞ ) . From the Banach fixedpoint theorem (Puterman, 2014), it has a unique fixed point.Because the optimal value function V ⋆ is given by the uniquefixed point of the associated Bellman operator, we use the termsoptimal value function and fixed point of f C interchangeably.In addition to obtaining V ⋆ , MDPs are also solved to determinethe optimal policy , π ⋆ . We note that because every feasiblepolicy π induces a Markov chain, π also induces a uniquestationary value function V which satisfies V = C ( π ) + γM π P T V. (4)Given a feasible policy π , we can equivalently solve for thestationary value function V as V = ( I − γM π P T ) − C ( π ) .From this perspective, the optimal value function is the mini-mum vector among the finite set of stationary value functionsgenerated by the set of all policies Π .From the optimal value function V ⋆ , we can also derive adeterministic optimal policy from the Bellman operator as π ⋆ ( s, a ) = a = argmin ¯ a ∈ [ A ] C s ¯ a + γ P s ′ ∈ [ S ] P s ′ ,s ¯ a V ⋆s ′ otherwise , ∀ s ∈ [ S ] . (5)While the optimal policy does not need to be deterministic andstationary, the optimal policy π ⋆ derived from (5) will alwaysbe deterministic. Among different algorithms to determine the fixed point of theBellman operator, value iteration (VI) is a commonly used andsimple technique in which the Bellman operator is iterativelyapplied until the optimal value is reached — i.e. starting fromany value function V ∈ R S and k = 1 , . . . , we apply V k +1 s = min a ∈ [ A ] C sa + γ X s ′ ∈ [ S ] P s ′ ,sa V ks ′ , ∀ s ∈ [ S ] . (6)The iteration scheme given by (6) converges to the fixed point ofthe corresponding discounted infinite horizon MDP. The stop-ping criteria of VI can be considered the over-approximationof the optimal value function. Lemma 4. (Puterman, 2014, Thm. 6.3.1) For any initial valuefunction V ∈ R S , let { V k } k ∈ N be the value function tra-jectory from (6). Whenever there exists ǫ > , such that (cid:13)(cid:13) V k +1 − V k (cid:13)(cid:13) < ǫ (1 − γ )2 γ , then V k +1 is within ǫ/ of the fixedpoint V ⋆ , i.e. (cid:13)(cid:13) V k +1 − V ⋆ (cid:13)(cid:13) < ǫ .Lemma 4 connects the sequence { V k } k ∈ N ’s relative conver-gence to its absolute convergence towards V ⋆ by showing thathe former implies the latter. In general, the stopping criteriadiffer for different MDP objectives (see Haddad and Monmege(2018) for recent results on stopping criteria for reachability).3. SET-BASED BELLMAN OPERATORThe standard Bellman operator with respect to a fixed costparameter C is well studied. Motivated by a family of MDPscorresponding to a compact set of cost parameters C ⊆ R S × A with all other data parameters remaining identical, we lift theBellman operator to operate on sets rather than individualvectors in R S . For the set-based operator, we analyze its set-based domain and prove that it is a contraction operator. Wealso prove the existence of a unique fixed point set V ⋆ for aset-based Bellman operator and relate its properties to the fixedpoint of the standard Bellman operator. We define a new metric space ( H ( R S ) , d H ) based on theBanach space ( R S , k·k ∞ ) to serve as our set-based operatordomain (Rudin et al., 1964), where H ( R S ) is the collection ofnon-empty compact subsets of R S equipped with partial order :for V , V ′ ∈ H ( R S ) , V (cid:22) V ′ if V ⊆ V ′ — i.e. if V is a subset of V ′ . The metric d H is the Haussdorf distance (Henrikson, 1999)defined as d H ( V , V ′ ) = max { sup V ∈V inf V ′ ∈V ′ k V − V ′ k ∞ , sup V ′ ∈V ′ inf V ∈V k V − V ′ k ∞ } . (7)Since ( R S , k·k ∞ ) is a complete metric space, H ( R S ) is acomplete metric space with respect to d H . Lemma 5. (Henrikson, 1999, Thm 3.3) If X is a complete met-ric space, then its induced Hausdorff metric space ( H ( X ) , d H ) is a complete metric space.On the metric space H ( R S ) , we define a set-based Bellmanoperator . Definition 6. (Set-based Bellman Operator). For a family ofMDP problems, ([ S ] , [ A ] , P, C , γ ) , where C ⊆ R S × A is a com-pact set, its associated set-based Bellman operator is given by F C ( V ) = cl [ ( C,V ) ∈C×V f C ( V ) , ∀ V ∈ H ( R S ) where cl is the closure operator.As we take the union of uncountably many bounded sets,the resulting set may not be bounded, and therefore it is notimmediately obvious that F C ( V ) maps into the metric space H ( R S ) . We show this is true in Proposition 7. Proposition 7. If C is compact, then F C ( V ) ∈ H ( R S ) , ∀ V ∈ H ( R S ) . Proof.
For a non-empty set A of some finite dimensionalreal vector space, let us define its diameter to be denotedas diam ( A ) = sup x,y ∈A k x − y k ∞ . The diameter of anycompact set in a metric space is bounded.We take any non-empty compact set V ∈ H ( R S ) . As F C ( V ) ⊂ R S , it suffices to prove that F C ( V ) is closed and bounded.The closedness is guaranteed by the closure operator. A sub-set of a metric space is bounded iff its closure is bounded.Hence, to prove the boundedness, it suffices to prove that diam (cid:0) ∪ ( C,V ) ∈C×V f C ( V ) (cid:1) < + ∞ . Consider any two cost-value function pairs, ( C, V ) , ( C ′ , V ′ ) ∈ C×V , they must satisfy f C ( V ) − f C ′ ( V ′ ) = (cid:16) f C ( V ) − f C ′ ( V ) (cid:17) + (cid:16) f C ′ ( V ) − f C ′ ( V ′ ) (cid:17) , where the norm of the second term k f C ′ ( V ) − f C ′ ( V ′ ) k ∞ must be upper bounded by k V − V ′ k ∞ due to contractionproperties of f C ′ . To bound the first term, we note that for anytwo vectors a, b ∈ R S , k a − b k ∞ = max { max( a − b ) , max( b − a ) } and let π to be the optimal policy of f C ( V ) , max( f C ′ ( V ) − f C ( V )) ≤ max( ν ′ ( π ) + γM π P T V − ν ( π ) − γM π P T V ) ≤ max( ν ′ ( π ) − ν ( π )) ≤ X i ∈ [ S ] (cid:13)(cid:13) e Ti (cid:13)(cid:13) ∞ k M π k ∞ k S ⊗ I A k ∞ (cid:13)(cid:13) ( C ′ − C ) T (cid:13)(cid:13) ∞ k e i k ∞ . Since k S ⊗ I A k ∞ = k e i k ∞ = (cid:13)(cid:13) e Ti (cid:13)(cid:13) ∞ = k M π k ∞ = 1 for any π ∈ Π , max( f C ′ ( V ) − f C ( V )) ≤ S diam (cid:0) C T (cid:1) . Theresult max( f C ( V ) − f C ′ ( V )) ≤ S diam (cid:0) C T (cid:1) can be similarlyderived. Therefore, k f C ( V ) − f C ′ ( V ′ ) k ∞ < S diam (cid:0) C T (cid:1) +diam ( V ) . Since it holds for all ( C, V ) , ( C ′ , V ′ ) ∈ C × V then diam (cid:0) ∪ ( C,V ) ∈C×V f C ( V ) (cid:1) ≤ S diam (cid:0) C T (cid:1) + diam ( V ) < + ∞ as C T and V are bounded. ✷ Proposition 7 shows that F C is an operator from H ( R S ) to H ( R S ) . Having established its range space, we can draw manyparallels between F C and f C . Similar to the existence of aunique fixed point V ⋆ for f C , we consider whether a fixed pointset of F C which satisfies F C ( V ⋆ ) = V ⋆ exists, and if it is unique.To take the comparison further, since V ⋆ is the optimal valuefunction for an MDP problem defined by ([ S ] , [ A ] , P, C, γ ) ,how does V ⋆ relate to the family of optimal solutions thatcorresponds to the MDP family ([ S ] , [ A ] , P, C , γ ) ?To prove the unique existence of V ⋆ , we utilize the Banach fixedpoint theorem (Puterman, 2014), which states that a uniquefixed point must exist for all contraction operators on completemetric spaces. First, we show that F C is a contraction as definedin Definition 3 on the complete metric space ( H ( R S ) , d H ) . Proposition 8.
For any
V ∈ H ( R S ) and C ⊂ R S × A closedand bounded, F C is a contraction operator under the Hausdorffdistance. Proof.
Consider V , ¯ V ∈ H ( R S ) , to see that F C is a contraction,we need to show sup V ∈ F C ( V ) inf ¯ V ∈ F C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) (8) sup V ∈ F C ( ¯ V ) inf ¯ V ∈ F C ( V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) (9)First we note that taking sup ( inf ) of a continuous function overa set A is equivalent to taking the sup ( inf ) over the closure of A . Let G C ( V ) = ∪ ( C,V ) ∈C×V f C ( V ) and cl G C ( V ) = F C ( V ) ,then due to continuity of norms (Rudin et al., 1964, Thm 4.16), sup V ∈ F C ( V ) inf ¯ V ∈ F C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) = sup V ∈ G C ( V ) inf ¯ V ∈ G C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) . Therefore, it suffices to prove sup f C ( V ) ∈ G C ( V ) inf f ¯ C ( ¯ V ) ∈ G C ( ¯ V ) (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) , sup f ¯ C ( ¯ V ) ∈ G C ( ¯ V ) inf f C ( V ) ∈ G C ( V ) (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ < d H ( V , ¯ V ) . or any V ∈ V , C ∈ C , inf ( ¯ C, ¯ V ) ∈C× ¯ V (cid:13)(cid:13) f C ( V ) − f ¯ C ( ¯ V ) (cid:13)(cid:13) ∞ (10a) = inf ( ¯ C, ¯ V ) ∈C× ¯ V k C ( π ) + γM π P T V − ( ¯ C (¯ π ) + γM ¯ π P T ¯ V ) k ∞ , (10b) ≤ inf ¯ V ∈ ¯ V k C (¯ π ) + γM ¯ π P T V − ( C (¯ π ) + γM ¯ π P T ¯ V ) k ∞ , (10c) ≤ inf ¯ V ∈ ¯ V (cid:13)(cid:13) γM ¯ π P T ( V − ¯ V ) (cid:13)(cid:13) ∞ ≤ γ inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ , (10d)where π corresponds to the optimal policy for the MDP ([ S ] , [ A ] , P, C, γ ) and ¯ π corresponds to the optimal policy forthe MDP ([ S ] , [ A ] , P, ¯ C, γ ) in (10b). In (10c) we replaced M π by M ¯ π by noting that π is optimal, therefore ¯ π must result in alarger value function (similar to the proof of Prop. 7). In (10d)we note that the infimum over set C must be upper bounded bywhen ¯ C = C ∈ C , and used the fact that (cid:13)(cid:13) M ¯ π P T (cid:13)(cid:13) ∞ ≤ .Taking the sup over G C ( V ) and G C ( ¯ V ) , sup V ∈ G C ( V ) inf ¯ V ∈ G C ( ¯ V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ ≤ γ sup V ∈V inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ , sup ¯ V ∈ G C ( ¯ V ) inf V ∈ G C ( V ) (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ ≤ γ sup V ∈V inf ¯ V ∈ ¯ V (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ . Therefore d H ( F C ( V ) , F C ( ¯ V )) ≤ γd H ( V , ¯ V ) . Since γ ∈ (0 , , F C is a contraction operator on H ( R S ) . ✷ The contraction property of F C implies that repeated applica-tion of the operator to any V ∈ H ( R S ) will result in closerand closer sets in the Hausdorff sense of distance to a fixedpoint set. It is then natural to consider if there is a unique setwhich all F k C ( V ) converges to. Theorem 9.
There exists a unique fixed point V ⋆ to the set-based Bellman operator F C as defined in Definition 6, suchthat F C ( V ⋆ ) = V ⋆ , and V ⋆ is a closed and bounded subsetof R S . Furthermore, for any iteration starting from arbitrary V ∈ H ( R S ) , V k +1 = F C ( V k ) , the sequence converges inthe Hausdorff sense i.e. lim k →∞ d H ( F C ( V k ) , V ⋆ ) = 0 . Proof.
As shown in Proposition 8, F C is a contraction operator.From the Banach fixed point theorem (Puterman, 2014, Thm6.2.3), there exists a unique fixed point V ⋆ , and any arbitrary V ∈ H ( R S ) will generate a sequence { F C ( V k ) } k ∈ N thatconverges to the fixed point. ✷ The fixed point V ⋆ of Bellman operator f C on metric space R S corresponds to the optimal value function of the MDPassociated with cost parameter C . Because there is no direct as-sociation of an MDP problem to the set-based Bellman operator F C , we cannot claim the same for V ⋆ . However, V ⋆ does havemany interesting properties on H ( R S ) , in parallel to operator f C on R S , especially in terms of the value iteration method (6).Suppose that instead of a fixed cost parameter, we have that ateach iteration k , a C k that is random chosen from a compactset of costs, C k ∈ C , then it is interesting to ask if V ⋆ containsall the limit points of lim k f C k ( V k ) . Indeed, we can infer fromTheorem 9 that the sequence { V k } converges to V ⋆ under theHausdorff metric. Furthermore, even when V k itself does notconverge, it must converge to the set V ⋆ under the Hausdorffmetric— i.e. lim k → inf V ∈V ⋆ (cid:13)(cid:13) V k − V (cid:13)(cid:13) ∞ = 0 .4. CONCLUSIONWe summarize our results on set-based Bellman operator: fora compact cost function set C , F C converges to to a unique compact set V ⋆ which contains all the fixed points of f C forall fixed C ∈ C . Furthermore, V ⋆ also contains the limit pointsof f C k ( V k ) for any { C k } k ∈ N ⊆ C , V ∈ R S , given that lim k →∞ V k converges. Even if the limit does not exist, V k must asymptotically converge to V ⋆ in the Hausdorff sense.Future work includes extending the uncertainty analysis toconsider uncertainty in the transition kernel to fully capturelearning in a general stochastic game.REFERENCESAbbad, M. and Filar, J.A. (1992). Perturbation and stabilitytheory for markov control problems. IEEE Trans. Autom.Control .Ac¸ikmes¸e, B. and Bayard, D.S. (2012). A markov chainapproach to probabilistic swarm guidance. In
Amer. ControlConf. , 6300–6307. IEEE.Altman, E. and Gaitsgory, V.A. (1993). Stability and singu-lar perturbations in constrained markov decision problems.
IEEE Trans. Autom. Control , 38(6), 971–975.Ashok, P., Chatterjee, K., Daca, P., Kˇret´ınsk`y, J., and Meggen-dorfer, T. (2017). Value iteration for long-run average rewardin markov decision processes. In
Int. Conf. Comput. AidedVerification , 201–221. Springer.Bielecki, T.R. and Filar, J.A. (1991). Singularly perturbedmarkov control problem: Limiting average cost.
Ann. Op.Res. , 28(1), 153–168.Delage, E. and Mannor, S. (2010). Percentile optimization formarkov decision processes with parameter uncertainty.
Op.Res. , 58(1), 203–213.Demir, N., Eren, U., and Ac¸ikmes¸e, B. (2015). Decentralizedprobabilistic density control of autonomous swarms withsafety.
Auton. Robots , 39(4), 537 –554.Eisentraut, J., Kˇret´ınsk`y, J., and Rotar, A. (2019). Stopping cri-teria for value and strategy iteration on concurrent stochasticreachability games. arXiv preprint arXiv:1909.08348 .El Chamie, M., Yu, Y., Ac¸ıkmes¸e, B., and Ono, M. (2018).Controlled markov processes with safety state constraints.
IEEE Trans. Autom. Control , 64(3), 1003–1018.Filar, J. and Vrieze, K. (2012).
Competitive Markov decisionprocesses . Springer Science & Business Media.Givan, R., Leach, S., and Dean, T. (2000). Bounded-parametermarkov decision processes.
Artif. Intell. , 122(1-2), 71–109.Haddad, S. and Monmege, B. (2018). Interval iteration algo-rithm for mdps and imdps.
Theor. Comput. Sci. , 735, 111–131.Henrikson, J. (1999). Completeness and total boundednessof the hausdorff metric. In
MIT Undergraduate J. Math.
Citeseer.Li, S.H.Q., Yu, Y., Calderone, D., Ratliff, L., and Ac¸ikmes¸e, B.(2019). Tolling for constraint satisfaction in markov decisionprocess congestion games. In
Amer. Control Conf. , 1238–1243. IEEE.Puterman, M.L. (2014).
Markov Decision Processes: DiscreteStochastic Dynamic Programming . John Wiley & Sons.Rudin, W. et al. (1964).