[PDF] Infinite Horizon Average Cost Dynamic Programming Subject to Total Variation Distance Ambiguity

Abstract

We analyze the infinite horizon minimax average cost Markov Control Model (MCM), for a class of controlled process conditional distributions, which belong to a ball, with respect to total variation distance metric, centered at a known nominal controlled conditional distribution with radius R∈[0,2] , in which the minimization is over the control strategies and the maximization is over conditional distributions. Upon performing the maximization, a dynamic programming equation is obtained which includes, in addition to the standard terms, the oscillator semi-norm of the cost-to-go. First, the dynamic programming equation is analyzed for finite state and control spaces. We show that if the nominal controlled process distribution is irreducible, then for every stationary Markov control policy the maximizing conditional distribution of the controlled process is also irreducible for R∈[0, R max ] . Second, the generalized dynamic programming is analyzed for Borel spaces. We derive necessary and sufficient conditions for any control strategy to be optimal. Through our analysis, new dynamic programming equations and new policy iteration algorithms are derived. The main feature of the new policy iteration algorithms (which are applied for finite alphabet spaces) is that the policy evaluation and policy improvement steps are performed by using the maximizing conditional distribution, which is obtained via a water filling solution. Finally, the application of the new dynamic programming equations and the corresponding policy iteration algorithms are shown via illustrative examples.

Full PDF

IINFINITE HORIZON AVERAGE COST DYNAMIC PROGRAMMING SUBJECTTO TOTAL VARIATION DISTANCE AMBIGUITY

IOANNIS TZORTZIS ∗ , CHARALAMBOS D. CHARALAMBOUS † , AND

THEMISTOKLIS CHARALAMBOUS ‡ Abstract.

We analyze the inﬁnite horizon minimax average cost Markov Control Model (MCM), for a class ofcontrolled process conditional distributions, which belong to a ball, with respect to total variation distance metric,centered at a known nominal controlled conditional distribution with radius R ∈ [0 , , in which the minimizationis over the control strategies and the maximization is over conditional distributions. Upon performing the maximiza-tion, a dynamic programming equation is obtained which includes, in addition to the standard terms, the oscillatorsemi-norm of the cost-to-go.First, the dynamic programming equation is analyzed for ﬁnite state and control spaces. We show that if thenominal controlled process distribution is irreducible, then for every stationary Markov control policy the maximiz-ing conditional distribution of the controlled process is also irreducible for R ∈ [0 , R max ] .Second, the generalized dynamic programming is analyzed for Borel spaces. We derive necessary and sufﬁcientconditions for any control strategy to be optimal.Through our analysis, new dynamic programming equations and new policy iteration algorithms are derived.The main feature of the new policy iteration algorithms (which are applied for ﬁnite alphabet spaces) is that thepolicy evaluation and policy improvement steps are performed by using the maximizing conditional distribution,which is obtained via a water ﬁlling solution. Finally, the application of the new dynamic programming equationsand the corresponding policy iteration algorithms are shown via illustrative examples. Key words.

Stochastic Control, Markov Control Models, Minimax, Dynamic Programming, Average Cost,Inﬁnite Horizon, Total Variational Distance, Policy Iteration

AMS subject classiﬁcations.

1. Introduction.

The inﬁnite horizon average cost per unit-time discrete-time MarkovControl Model (MCM), with deterministic strategies is analysed, in an anthology of pa-pers [1–4, 18]. In such MCMs, the corresponding cost-to-go and the dynamic programmingrecursions depend on the conditional distribution of the underlying controlled process [5].This means, any ambiguity of the controlled process conditional distribution will affect theoptimality and robustness of the optimal decision strategies.In this paper, we investigate the effects of any ambiguity in the controlled process con-ditional distribution on the cost-to-go and dynamic programming. We model the ambiguityin the controlled conditional distributions by a ball with respect to the total variation dis-tance metric, centered at a known nominal controlled conditional distribution with radius R ∈ [0 , . Then, we re-formulate the inﬁnite horizon average cost MCM using minimaxoptimization techniques, in which the control strategy seeks to minimize the payoff while theconditional distribution, from the class of total variation distance ball, seeks to maximize it.We begin our analysis by ﬁrst considering MCM’s deﬁned on ﬁnite state and controlspaces. By employing certain results from [7], we obtain the characterization of the maxi-mizing conditional distribution and the corresponding dynamic programming equation. Themain feature of the maximizing conditional distribution is its characterization via a water-ﬁlling solution, which is similar in spirit, to extremum problems encountered in informationtheory, such as, channel capacity and lossy data compression [8]. This leads to a dynamic ∗ Department of Electrical and Computer Engineering, University of Cyprus (UCY), Nicosia, Cyprus.( [email protected] ). † Department of Electrical and Computer Engineering, University of Cyprus (UCY), Nicosia, Cyprus.( [email protected] ). ‡ Department of Signals and Systems, Chalmers University of Technology, Gothenburg, Sweden.( [email protected] ).1 a r X i v : . [ m a t h . O C ] D ec I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS programming equation, which includes in its right hand side, the oscillator semi-norm of thecost-to-go or value function, in addition to the standard terms. We show that, if the nominalcontrolled process distribution is irreducible, then for every stationary Markov control policythe maximizing conditional distribution of the controlled process is also irreducible, the op-timal control strategies exists, for R ∈ [0 , R max ] . Moreover, for this range of R , we derive anew policy iteration algorithm.Subsequently, we consider general Borel spaces, we invoke a pair of dynamic program-ming equations (called generalized), and we derive necessary and sufﬁcient conditions of op-timality, based on the concept of canonical triplets [9,12,16,17]. This treatment characterizesoptimal strategies for any ball of radius R ∈ [0 , . The main feature of the correspond-ing policy iteration algorithm (which is applied for ﬁnite alphabet spaces), is that the policyevaluation and policy improvement steps are performed using the maximizing conditionaldistribution.The remainder of the paper is organized as follows. In Section 1.1, we introduce theclassical inﬁnite horizon dynamic programming equation of MCM with an average cost perunit-time optimality criterion, and we brieﬂy discuss the main results derived in the paper. InSection 2, we give some preliminary results concerning the maximization of a linear func-tional subject to total variation distance. In Section 3, we study the inﬁnite horizon averagecost Markov decision problem for ﬁnite state and control spaces, and we derive a new dy-namic programming recursion and the corresponding policy iteration algorithm. In Section 4,we consider general Borel spaces, and we investigate the inﬁnite horizon average cost Markovdecision problem, using the generalized dynamic programming equations. We also introducea generalized policy iteration algorithm when the state and control spaces are of ﬁnite cardi-nality. In Section 5, we present two examples which illustrate the implications of the the newdynamic programming recursions on the corresponding policy iteration algorithms. In this section, we describe the main resultsobtained in the paper with respect to the existing literature. Since we treat ﬁnite alphabetspaces and Borel spaces, the formulation below is introduced for Borel spaces.

An inﬁnite horizon MCMwith deterministic strategies is a ﬁve-tuple(1.1) (cid:16) X , U , {U ( x ) : x ∈ X } , { Q ( dz | x, u ):( x, u ) ∈X ×U} , f (cid:17) consisting of the following.a) State Space. A complete separable metric space (called a Polish space) X , whichmodels the state space of the controlled random process { x k ∈ X : k ∈ N } , N (cid:44) , , . . . .b) Control or Action Space. A Polish space U , which models the control or action setof the control random process { u k ∈ U : k ∈ N } .c) Feasible Controls or Actions. A family {U ( x ) : x ∈ X } of non-empty measurablesubsets U ( x ) of U , where U ( x ) denotes the set of feasible controls or actions, whenthe controlled process is in state x ∈ X , and the feasible state-actions pairs aremeasurable subsets of X × U , deﬁned by(1.2) K (cid:44) (cid:110) ( x, u ) : x ∈ X , u ∈ U ( x ) (cid:111) . d) Controlled Process Distribution. A conditional distribution or stochastic kernel Q ( dz | x, u ) on X given ( x, u ) ∈ K ⊆ X × U , which corresponds to the controlledprocess transition probability distribution. nﬁnite Horizon Average Cost Dynamic Programming f : K (cid:55)−→ [0 , ∞ ] , called theone-stage-cost, such that f ( x, · ) does not take the value + ∞ for each x ∈ X .To ensure the existence of measurable controls we make the following assumption.A SSUMPTION [12] K contains the graph of a measurable functions from X to U ; that is, there is a measurable function ϕ : X (cid:55)−→ U such that ϕ ( x ) ∈ U ( x ) , for all x ∈ X . The set of all such functions denoted by F are called selectors of the multifunction x (cid:55)−→ U ( x ) . We equip the spaces X and U with the natural σ -algebra B ( X ) and B ( U ) , respectively.For any measurable spaces ( X , B ( X )) , ( U , B ( U )) , we denote the set of stochastic Kernels on ( X , B ( X )) conditioned on ( U , B ( U )) by Q ( X |U ) , and we denote the set of probability dis-tributions on ( X , B ( X )) by M ( X ) . Next, we give the deﬁnition of deterministic stationaryMarkov control policies.D EFINITION

A deterministic stationary Markov control policy is a measurablefunction (selector) g : X (cid:55)−→ U such that g ( x t ) ∈ U ( x t ) , ∀ x t ∈ X , t = 0 , , . . . . Theset of such deterministic stationary Markov policies is denoted by G SM , and the set of alldeterministic control policies (i.e., non-stationary, non-Markov) is denoted by G . Deﬁne the n -stage expected cost, for a ﬁxed x = x , by(1.3) J on ( g, x ) (cid:44) E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) where E gx {·} indicates the dependence of the expectation operation on the policy g ∈ G and x = x . Then, the average cost per unit-time when policy g ∈ G is used, given x = x , isdeﬁned by(1.4) J o ( g, x ) (cid:44) lim sup n →∞ n J on ( g, x ) . The Markov Control Problem (MCP) is to ﬁnd a control policy g ∗ ∈ G such that(1.5) J o ( g ∗ , x ) (cid:44) inf g ∈ G J o ( g, x ) = J o, ∗ ( x ) , ∀ x ∈ X . For ﬁnite cardinality spaces ( X , U ) , it is known [12, 14, 16, 20], that if f is boundedand for all stationary Markov control policies g ∈ G SM the transition probability matrix Q ( z | x, u ) is irreducible (that is, all stationary policies have at most one recurrent class), thenthere exists a solution V o : X (cid:55)−→ R and a constant (independent of x ∈ X ) J o, ∗ ∈ R suchthat ( J o, ∗ , V o ( x )) is the solution of the dynamic programming (of the inﬁnite-horizon MCP(1.5))(1.6) J o, ∗ + V o ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q ( z | x, u ) V o ( z ) (cid:111) from which existence of optimal policy g ∗ ∈ G SM is obtained. However, if the irreducibilitycondition is not satisﬁed (i.e., there is more than one recurrent class), then the dynamic pro-gramming equation (1.6) may not be sufﬁcient to give the optimal policy and the minimumcost [12, 16]. In this case, (1.6) is replaced by the following J o, ∗ ( x ) = inf u ∈U ( x ) (cid:110) (cid:88) z ∈X Q ( z | x, u ) J o, ∗ ( z ) (cid:111) (1.7a) J o, ∗ ( x ) + V o ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q ( z | x, u ) V o ( z ) (cid:111) . (1.7b) I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

We refer to (1.7a) as the ﬁrst general dynamic programming equation and to (1.7b) as thesecond general dynamic programming equation . Note that, the pair of generalized dynamicprogramming equations (1.7a)-(1.7b) solves the MCP (1.5), without imposing irreducibilityof the conditional distribution of the controlled process [12, 16]. Similar results are alsoknown for Borel spaces, by replacing summations in (1.6) and (1.7) by integrals with respectto conditional distribution, while the characterization of the existence of optimal policies isdone via canonical triplets [12].Since the MCP (1.5) and the dynamic programming equation (1.6) are functionals of theconditional distribution of the controlled process, then the optimal strategies g ∈ G are ob-tained based on the assumption of having an accurate knowledge of the conditional distribu-tion Q ( dz | x, u ) . Hence, any ambiguity or mismatch of Q ( dz | x, u ) from the true conditionaldistribution will affect the optimality of the strategies.Motivated by this implication, in this paper we consider the problem discussed in thenext section. Recall the total variation distance between two probability measures, || · ||

T V : M ( X ) × M ( X ) (cid:55)−→ [0 , ∞ ] , deﬁned by || α − β || T V (cid:44) sup P ∈P ( X ) (cid:88) F i ∈ P | α ( F i ) − β ( F i ) | , α, β ∈ M ( X ) where P ( X ) denotes the collection of all ﬁnite partitions of X .In this paper, we will derive the analogues of (1.6) and (1.7a)-(1.7b), for the class ofconditional distributions of the controlled process Q ( dz | x, u ) , ( x, u ) ∈ K which are station-ary, and belong to a ball with respect to total variation distance metric, centered at a nominalcontrolled process distribution Q o ( dz | x, u ) , ( x, u ) ∈ K , having radius R ∈ [0 , (speciﬁcally, { Q ( dz | x, u ) : || Q ( ·| x, u ) − Q o ( ·| x, u ) || T V ≤ R } ).The precise deﬁnition is the following.D EFINITION

For each g ∈ G SM , the nominal controlled process { x gt : t =0 , , . . . } has a stationary conditional distribution deﬁned by P rob ( x t ∈ A | x t − , u t − ) (cid:44) Q o ( A | x t − , u t − ) , ∀ A ∈ B ( X ) , t = 0 , , . . . where Q o ( ·|· , · ) ∈ Q ( X | K ) . Given the nominal controlled process and R ∈ [0 , , the truecontrolled process conditional distributions are stationary, and belong to the total variationdistance ball deﬁned by (1.8) B R ( Q o )( x, u ) (cid:44) (cid:110) Q ( ·| x, u ) ∈ M ( X ) : || Q ( ·| x, u ) − Q o ( ·| x, u ) || T V ≤ R (cid:111) , ( x, u ) ∈ K . Next, we consider the analogue of (1.5). Deﬁne the n -stage expected cost by(1.9) J n ( g, Q, x ) (cid:44) E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) and the corresponding maximizing n -stage expected cost by(1.10) J n ( g, x ) (cid:44) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) . Some authors use the term multichain, instead.nﬁnite Horizon Average Cost Dynamic Programming g ∈ G is used, given x = x ,is deﬁned by(1.11) J ( g, x ) (cid:44) lim sup n →∞ n J n ( g, x ) . The minimax MCP subject to ambiguity deﬁned by (1.8), is to choose a control policy g ∗ ∈ G such that(1.12) J ( g ∗ , x ) (cid:44) inf g ∈ G J ( g, x ) = J ∗ ( x ) , ∀ x ∈ X . A conditional distribution Q ∗ that satisﬁes (1.11) (see also (1.10)) is called a maximizingconditional distribution, a policy g ∗ that satisﬁes (1.12) is called an average cost optimalpolicy, and the corresponding J ∗ ( · ) is the minimum cost or value function of the minimaxMCP.Next, we introduce an assumption for the minimax MCP deﬁned by (1.12).A SSUMPTION (a) The map f : X × U (cid:55)−→ R is bounded, continuous and non-negative.(b) The set U ( x ) is compact for all x ∈ X .(c) The map Q o ( A |· , · ) is continuous on K for every Borel set. Note that it is possible to relax Assumption 1.4, for example, f ( x, · ) can be replaced bya lower semi-continuous function on U ( x ) for every x ∈ X , which is non-negative (see [12]for several relaxations).We derive the following results. Dynamic Programming Equations for Finite Alphabet Spaces.

In Section 3, we as-sume that ( X , U ) are of ﬁnite cardinality and we show that if for all stationary Markov con-trol policies g ∈ G SM , and for a given total variation parameter R ∈ [0 , , the maximizingtransition probability matrix Q ∗ ( g ) is irreducible, then the dynamic programming equationcorresponding to minimax MCP (1.12) is given by(1.13) J ∗ + V ( x ) = min u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q o ( z | x, u ) V ( z ) + R (cid:0) sup z ∈X V ( z ) − inf z ∈X V ( z ) (cid:1)(cid:111) . The new term entering in the right side of (1.13) is the oscillator semi-norm of the futurepay-off.

Generalized Dynamic Programming Equations for Borel Spaces.

In Section 4, weassume that ( X , U ) are Borel spaces, and we utilize the concept of canonical triplets to es-tablish existence of optimal strategies via the following generalized dynamic programmingequations J ∗ ( x )= inf u ∈U ( x ) (cid:110) (cid:90) X Q o ( dz | x, u ) J ∗ ( z )+ R (cid:0) sup z ∈X J ∗ ( z ) − inf z ∈X J ∗ ( z ) (cid:1)(cid:111) (1.14a) J ∗ ( x )+ V ( x )= inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:90) X Q o ( dz | x, u ) V ( z )+ R (cid:0) sup z ∈X V ( z ) − inf z ∈X V ( z ) (cid:1)(cid:111) . (1.14b)Since Borel spaces include ﬁnite alphabet spaces, if irreducibility condition is violated, thenthe existence of optimal strategies is characterized by the ﬁnite alphabet version of (1.14a)-(1.14b).In addition, we obtain the following. I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

1. Characterize the maximizing conditional distribution corresponding to the supre-mum in (1.11).2. Derive new policy iteration algorithms (applied for ﬁnite alphabet spaces), in whichthe policy evaluation and the policy improvement steps are performed by using themaximizing conditional distribution obtained under total variation distance ambigu-ity constraint.Finally, in Section 5 we present illustrative examples based on (1.13) and (1.14).

2. Maximization over Total Variation Distance Ambiguity.

In this section, we recallcertain results from [7], concerning the characterization of the extremum problem of maxi-mizing a linear functional subject to total variation distance ambiguity. We use these resultsto derive the new dynamic programming equations.Let ( X , d X ) denote a complete, separable metric space (a Polish space), and ( X , B ( X )) the corresponding measurable space, in which B ( X ) is the σ -algebra generated by open setsin X . Deﬁne the spaces BC ( X ) (cid:44) (cid:8) Bounded continuous functions (cid:96) : X (cid:55)−→ R : || (cid:96) || (cid:44) sup x ∈X | (cid:96) ( x ) | < ∞ (cid:9) BC + ( X ) (cid:44) (cid:8) (cid:96) ∈ BC ( X ) : (cid:96) ≥ (cid:9) . For (cid:96) ∈ BC + ( X ) , and µ ∈ M ( X ) ﬁxed, then we have(2.1) L ( ν ∗ ) (cid:44) sup || ν − µ || TV ≤ R (cid:90) X (cid:96) ( x ) ν ( dx ) = R (cid:110) sup x ∈X (cid:96) ( x ) − inf x ∈X (cid:96) ( x ) (cid:111) + (cid:90) X (cid:96) ( x ) µ ( dx ) where R ∈ [0 , , ν ∗ satisﬁes the constraint || ξ ∗ || T V = || ν ∗ − µ || T V = R , it is normalized ν ∗ ( X ) = 1 , and ν ∗ ( A ) ∈ [0 , on any A ∈ B ( X ) . If X is a compact set, since (cid:96) ( · ) ∈ BC + ( X ) then both the supremum and inﬁmum are attained and they are ﬁnite. Deﬁne x ∈ X (cid:44) (cid:110) x ∈ X : (cid:96) ( x ) = sup { (cid:96) ( x ) : x ∈ X } ≡ (cid:96) max (cid:111) x ∈ X (cid:44) (cid:110) x ∈ X : (cid:96) ( x ) = inf { (cid:96) ( x ) : x ∈ X } ≡ (cid:96) min (cid:111) where X denotes the closure of X . Then, the pay-off L ( ν ∗ ) can be written as(2.2) L ( ν ∗ ) = (cid:90) X (cid:96) max ν ∗ ( dx ) + (cid:90) X (cid:96) min ν ∗ ( dx ) + (cid:90) X \X ∪X (cid:96) ( x ) µ ( dx ) and the optimal distribution ν ∗ ∈ M ( X ) , which satisﬁes the total variation constraint, isgiven by (cid:90) X ν ∗ ( dx ) = µ ( X ) + R ∈ [0 , (2.3a) (cid:90) X ν ∗ ( dx ) = µ ( X ) − R ∈ [0 , (2.3b) ν ∗ ( A ) = µ ( A ) , ∀ A ⊆ X \ X ∪ X . (2.3c)Note that, if X is empty then ν ∗ ( X ) = R/ and if X is empty then ν ∗ ( X ) = 0 .Next, we elaborate on the form of the maximizing measure for ﬁnite and countable al-phabet spaces, and its water ﬁlling behavior, since we use them to analyze inﬁnite horizonMCP with ﬁnite state and control spaces. We adopt the standard deﬁnitions; inﬁmum (supremum) of an empty set to be + ∞ ( −∞ ). Closure of a set X consists of all points in X plus the limit points of X .nﬁnite Horizon Average Cost Dynamic Programming Let X be a non-empty denumerable set endowed with the discrete topology. If the cardinality of X denoted by |X | is ﬁnite, then we can identify any x ∈ X by a unit vector in R |X | . Deﬁne theset of probability vectors on X by(2.4) P ( X ) (cid:44) (cid:110) p = ( p , . . . , p |X | ) : p ( x ) ≥ , x = 1 , . . . , |X | , (cid:88) x ∈X p ( x ) = 1 (cid:111) . That is, P ( X ) is the set of all |X | -dimensional vectors which are probability vectors { ν ( x ) : x ∈ X } ∈ P ( X ) , { µ ( x ) : x ∈ X } ∈ P ( X ) . Also, let (cid:96) (cid:44) { (cid:96) ( x ) : x ∈ X } ∈ R |X | + (i.e., theset of non-negative vectors of dimension |X | ). Then, (2.1) may be written as follows(2.5) L ( ν ∗ ) = max ν ∈ B R ( µ ) (cid:88) x ∈X (cid:96) ( x ) ν ( x ) where(2.6) B R ( µ ) (cid:44) (cid:110) ν ∈ P ( X ) : || ν − µ || T V (cid:44) (cid:88) x ∈X | ν ( x ) − µ ( x ) | ≤ R (cid:111) . By deﬁning ξ ( x ) (cid:44) ν ( x ) − µ ( x ) , x = 1 , . . . , |X | , then (cid:80) x ∈X ξ ( x ) = 0 , and || ξ || T V = ξ + ( X )+ ξ − ( X ) denotes the total variation of ξ , where ξ + = max { ξ, } and ξ − = max {− ξ, } stand for the positive and negative part of ξ , respectively. Therefore, (cid:88) x ∈X ξ ( x ) = (cid:88) x ∈X ξ + ( x ) − (cid:88) x ∈X ξ − ( x ) , || ξ || T V = (cid:88) x ∈X | ξ ( x ) | = (cid:88) x ∈X ξ + ( x ) + (cid:88) x ∈X ξ − ( x ) and hence (cid:80) x ∈X ξ + ( x ) ≡ α/ ≡ (cid:80) x ∈X ξ − ( x ) . In addition, since(2.7) (cid:88) x ∈X (cid:96) ( x ) ξ ( x ) = (cid:88) x ∈X (cid:96) ( x ) ξ + ( x ) − (cid:88) x ∈X (cid:96) ( x ) ξ − ( x ) then (2.5) can be reformulated as follows(2.8) max ν ∈ B R ( µ ) (cid:88) x ∈X (cid:96) ( x ) ν ( x ) −→ (cid:88) x ∈X (cid:96) ( x ) µ ( x ) + max ξ ∈ (cid:101) B R ( µ ) (cid:88) x ∈X (cid:96) ( x ) ξ ( x ) where ξ ∈ (cid:101) B R ( µ ) is described by the constraints(2.9) α (cid:44) (cid:88) x ∈X | ξ ( x ) | ≤ R, (cid:88) x ∈X ξ ( x ) = 0 , ≤ ξ ( x ) + µ ( x ) ≤ , ∀ x ∈ X . The solution of (2.8) is obtained by ﬁrst identifying the partition of X into disjoint sets ( X , X \ X ) , and then by ﬁnding upper and lower bounds on the probabilities of X and X \ X , which are achievable [7].Towards this end, deﬁne the maximum and minimum values of { (cid:96) ( x ) : x ∈ X } by (cid:96) max (cid:44) max x ∈X (cid:96) ( x ) , (cid:96) min (cid:44) min x ∈X (cid:96) ( x ) and their corresponding support sets by X (cid:44) (cid:8) x ∈ X : (cid:96) ( x ) = (cid:96) max (cid:9) , X (cid:44) (cid:8) x ∈ X : (cid:96) ( x ) = (cid:96) min (cid:9) . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

For all remaining sequence, (cid:8) (cid:96) ( x ) : x ∈ X \ X ∪ X (cid:9) , and for ≤ r ≤ |X \ X ∪ X | ,deﬁne recursively the set of indices for which the sequence achieves its ( k + 1) th smallestvalue by(2.10) X k (cid:44) (cid:110) x ∈X : (cid:96) ( x )= min (cid:8) (cid:96) ( α ): α ∈ X \ X ∪ ( k (cid:91) j =1 X j − ) (cid:9)(cid:111) , k ∈ { , , . . . , r } till all the elements of X are exhausted. Further, deﬁne the corresponding values of thesequence on sets X k by(2.11) (cid:96) ( X k ) (cid:44) min x ∈X \X ∪ ( (cid:83) kj =1 X j − ) (cid:96) ( x ) , k ∈ { , , . . . , r } where r is the number of X k sets which is at most |X \ X ∪ X | .From [7] we have the following. The maximum pay-off subject to the total variationconstraint is given by(2.12) L ( ν ∗ ) = (cid:96) max ν ∗ ( X ) + (cid:96) min ν ∗ ( X ) + r (cid:88) k =1 (cid:96) ( X k ) ν ∗ ( X k ) . Moreover, the optimal probabilities are given by the following equations (water-ﬁlling solu-tion). ν ∗ ( X ) = µ ( X ) + α (2.13a) ν ∗ ( X ) = (cid:16) µ ( X ) − α (cid:17) + (2.13b) ν ∗ ( X k ) = (cid:16) µ ( X k ) − (cid:16) α − k (cid:88) j =1 µ ( X j − ) (cid:17) + (cid:17) + (2.13c) α = min (cid:16) R, − µ ( X )) (cid:17) (2.13d)where R ∈ [0 , , k ∈ { , , . . . , r } and r is the number of X k sets which is at most |X \X ∪ X | .The above discussion also holds for countable alphabet spaces ( X , U ) . Next, we applythe above results to the minimax MCP deﬁned by (1.12).

3. Minimax Stochastic Control for Finite State and Control Spaces.

In this section,we investigate the inﬁnite horizon minimax MCP deﬁned by (1.12) for ﬁnite state and controlspaces. By employing the results of Section 2, we derive dynamic programming equation(1.13) and we introduce the corresponding policy iteration algorithm.Consider the problem of minimizing the ﬁnite horizon version of (1.11) deﬁned by(3.1) J ∗ n ( x ) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) . Let V : X (cid:55)−→ R denote the value function corresponding to (3.1). Then V satisﬁes thedynamic programming equation [6, 19] V n ( x ) = 0 , ∀ x ∈ X (3.2a) V j ( x ) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (3.2b) (cid:110) f ( x, u ) + (cid:88) z ∈X V j +1 ( z ) Q ( z | x, u ) (cid:111) , j = 0 , , . . . , n − , x ∈ X . nﬁnite Horizon Average Cost Dynamic Programming (cid:96) ( · ) = V j +1 ( · ) and µ ( · ) = Q o ( ·| x, u ) , then (3.2b) is equivalent to thedynamic programming equation(3.3) V j ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:88) z ∈X V j +1 ( z ) Q o ( z | x, u )+ R (cid:16) sup z ∈X V j +1 ( z ) − inf z ∈X V j +1 ( z ) (cid:17)(cid:111) . Moreover, by applying (2.13) with ν ∗ ( · ) = Q ∗ ( ·| x, u ) , where Q ∗ ( ·| x, u ) , ( x, u ) ∈ K isthe maximizing conditional distribution and µ ( · ) = Q o ( ·| x, u ) , ( x, u ) ∈ K , then (3.2b) isequivalent to(3.4) V j ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:88) z ∈X V j +1 ( z ) Q ∗ ( z | x, u ) (cid:111) . Deﬁne V j ( x ) = V n − j ( x ) . Then from (3.2b), V j ( · ) satisﬁes the equation V j ( x ) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (3.5) (cid:110) f ( x, u ) + (cid:88) z ∈X V j − ( z ) Q ( z | x, u ) (cid:111) , j = 0 , , . . . , n − . We rewrite (3.5) as follows. V j ( x ) + 1 j V j ( x ) (3.6) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q ( z | x, u ) (cid:16) V j − ( z ) + 1 j V j ( x ) (cid:17)(cid:111) . Next, we introduce the following standard assumption [14].A

SSUMPTION

There exists a pair ( V ( · ) , J ∗ ) , V : X (cid:55)−→ R and J ∗ ∈ R , such that (3.7) lim j →∞ (cid:0) V j ( x ) − jJ ∗ (cid:1) = V ( x ) , ∀ x ∈ X . Under Assumption 3.1, then(3.8) lim j →∞ j V j ( x ) = J ∗ , ∀ x ∈ X and the limit does not depend on x ∈ X . In addition, by taking the supremum with respectto x ∈ X on both sides of (3.7), by virtue of the the ﬁnite cardinality of X , we can exchangethe limit and the supremum to obtain(3.9) lim j →∞ sup x ∈X (cid:0) V j ( x ) − jJ ∗ (cid:1) = sup x ∈X lim j →∞ (cid:0) V j ( x ) − jJ ∗ (cid:1) = sup x ∈X V ( x ) . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

By Assumption 3.1 and by (3.8) we have the following identities. J ∗ + V ( x )= lim j →∞ (cid:16) j V j ( x ) + ( V j ( x ) − jJ ∗ ) (cid:17) ( a ) = lim j →∞ inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, u )+ (cid:88) z ∈X Q ( z | x, u ) (cid:16) V j − ( z ) + 1 j V j ( x ) (cid:17) − jJ ∗ (cid:111) ( b ) = lim j →∞ inf u ∈U ( x ) (cid:110) f ( x, u ) − jJ ∗ + (cid:88) z ∈X Q o ( z | x, u ) (cid:16) V j − ( z ) + 1 j V j ( x ) (cid:17) + R (cid:16) sup z ∈X (cid:16) V j − ( z ) + 1 j V j ( x ) (cid:17) − inf z ∈X (cid:16) V j − ( z ) + 1 j V j ( x ) (cid:17)(cid:17)(cid:111) ( c ) = lim j →∞ inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q o ( z | x, u ) (cid:0) V j − ( z ) − ( j − J ∗ + 1 j V j ( x ) − J ∗ (cid:1) + R (cid:16) sup z ∈X (cid:0) V j − ( z ) − jJ ∗ (cid:1) − inf z ∈X (cid:0) V j − ( z ) − jJ ∗ (cid:1)(cid:17)(cid:111) where(a) is obtained by using (3.6);(b) is obtained by using the equivalent formulation (3.3);(c) is obtained by adding and subtracting J ∗ (1 + j R ) .Since U and X are of ﬁnite cardinality we can interchange the limit and the minimization andmaximization operations, to arrive to the following dynamic programming equation.(3.10) J ∗ + V ( x ) = min u ∈U ( x ) (cid:110) f ( x, u )+ (cid:88) z ∈X Q o ( z | x, u ) V ( z )+ R (cid:16) sup z ∈X V ( z ) − inf z ∈X V ( z ) (cid:17)(cid:111) . Clearly, by (2.1), dynamic programming equation (3.10) is equivalently expressed as follows.(3.11) J ∗ + V ( x ) = min u ∈U ( x ) max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q ( z | x, u ) V ( z ) (cid:111) . Next, we state the ﬁrst main Theorem of this section.T

HEOREM

Suppose X and U are of ﬁnite cardinality and Assumption 3.1 holds.If there exists a solution ( V, J ∗ ) to the dynamic programming equation (3.10) , and g ∗ is astationary policy such that g ∗ ( x ) attains the minimum in the right-hand side of (3.10) forevery x , then g ∗ is an optimal policy and J ∗ is the minimum average cost.Proof . Let g ∈ G be any policy and u ∈ U ( x ) . Since ( V, J ∗ ) satisﬁes the dynamicprogramming equation (3.10), which is equivalent to (3.11), and by the deﬁnition of g ∗ then f ( x, u ) + (cid:88) z ∈X Q o ( z | x, u ) V ( z ) + R (cid:16) max z ∈X V ( z ) − min z ∈X V ( z ) (cid:17) (3.12) = max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, u ) + (cid:88) z ∈X Q ( z | x, u ) V ( z ) (cid:111) ≥ max Q ( ·| x,g ∗ ( x )) ∈ B R ( Q o )( x,g ∗ ( x )) (cid:110) f ( x, g ∗ ( x )) + (cid:88) z ∈X Q ( z | x, g ∗ ( x )) V ( z ) (cid:111) = J ∗ + V ( x ) . nﬁnite Horizon Average Cost Dynamic Programming Q ( ·| x, u ) in (3.12) by Q ∗ ( ·| x, u ) and the corre-sponding expectation by E g,Q ∗ , and taking expectation on both sides of (3.12), we have E g,Q ∗ (cid:16) f ( x j , u j ) (cid:17) ≥ J ∗ + E g,Q ∗ (cid:16) V ( x j ) (cid:17) − E g,Q ∗ (cid:16) (cid:88) z ∈X Q ∗ ( z | x j , u j ) V ( z ) (cid:17) (3.13) = J ∗ + E g,Q ∗ (cid:16) V ( x j ) (cid:17) − E g,Q ∗ (cid:16) V ( x j +1 ) (cid:17) . Then, from (1.11) we have that for all g ∈ G , J ( π ) ≥ lim inf j →∞ (cid:16) j j − (cid:88) k =0 E g,Q ∗ (cid:0) f ( x k , u k ) (cid:1)(cid:17) ( a ) ≥ lim inf j →∞ (cid:16) J ∗ + 1 j (cid:16) E g,Q ∗ (cid:0) V ( x ) (cid:1) − E g,Q ∗ (cid:0) V ( x j ) (cid:1)(cid:17)(cid:17) ( b ) = J ∗ where(a) is obtained by using (3.13);(b) is obtained because the last term vanishes as j → ∞ .Thus, J ∗ ≤ inf g ∈ G J ( g, x ) . However, when g is replaced by g ∗ equality holds throughout,and as a result g ∗ is optimal, that is, J ∗ = J ∗ ( x ) = inf g ∈ G J ( g, x ) , g ∗ ∈ G is an averagecost optimal policy and J ∗ is the value. Dynamic programming equation (3.10) and hence Theorem 3.2, arevalid under Assumption 3.1. Here, we characterize the solution of the inﬁnite horizon mini-max average cost MCM, under the standard irreducibility condition, on the nominal transitionprobabilities of the controlled process. First, we introduce some notation.Identify the state space X by X = { x , x , . . . , x |X | } consisting of |X | elements. Then,any function V : X (cid:55)−→ R may be represented by a vector in R |X | , as follows. V = (cid:0) V ( x ) · · · V ( x |X | ) (cid:1) T ∈ R |X | . Any stationary control policy g ∈ G SM , g : X (cid:55)−→ R , may also be identiﬁed with a g ∈ R |X | .For any g , let Q ( g ) ∈ R |X |×|X | deﬁned by Q ( g ) ij = P ( x t +1 = x i | x t = x j , u t = g ( x j )) and f ( g ) = (cid:0) f ( x , g ( x )) · · · f ( x |X | , g ( x |X | )) (cid:1) T ∈ R |X | . Let q ∈ R |X | be deﬁned by q ( x i ) (cid:44) = P ( { x = x i } ) , i = 1 , . . . , |X | and e (cid:44) = (1 , · · · , T ∈ R |X | .The maximization of the expected n -stage cost, for a ﬁxed q ( x ) ∈ R |X | , is given by J n ( g, q ) (cid:44) J n ( g, x ) q T ( x ) = max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) E g (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) = max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) n − (cid:88) k =0 q T Q ( g ) k f ( g ) (cid:111) (3.14) = max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) q T (cid:110) n − (cid:88) k =0 Q ( g ) k (cid:111) f ( g ) . The notation J n ( g, q ) means that q ( x ) is ﬁxed instead of x = x . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

With Q ∗ ( ·| x, u ) denoting the maximizing conditional distribution, then (3.14) is equivalent max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) q T (cid:110) n − (cid:88) k =0 Q ( g ) k (cid:111) f ( g ) = q T (cid:110) n − (cid:88) k =0 Q ∗ ( g ) k (cid:111) f ( g ) . Hence, the maximizing average cost per unit-time is given by J ( g, q ) = lim sup n →∞ max Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) n E g (cid:110) n − (cid:88) k =0 f ( x k , u k ) (cid:111) = lim sup n →∞ n q T (cid:110) n − (cid:88) k =0 Q ∗ ( g ) k (cid:111) f ( g ) . Since q ∈ R |X | and f ( g ) ∈ R |X | are independent of n , we only need to investigate theconditions under which the following limit exists lim n →∞ n n − (cid:88) k =0 Q ∗ ( g ) k . The next Lemma follows directly from [14, Lemma 5.4].L

EMMA If Q ∗ ∈ R |X |×|X | + is a stochastic matrix, then the Cesaro limit (3.15) lim n →∞ n n − (cid:88) k =0 ( Q ∗ ) k = Q ∗ always exist. The matrix Q ∗ ∈ R |X |×|X | + is a stochastic matrix and it is the solution of theequation (3.16) Q ∗ Q ∗ = Q ∗ . In view of Lemma 3.3, the maximization of the average cost per unit-time of a stationaryMarkov control policy is given by(3.17) J ( g, q ) = q T Q ∗ ( g ) f ( g ) where Q ∗ ( g ) and Q ∗ ( g ) are related by (3.16). We recall the following deﬁnition of reduciblestochastic matrix from [14, page 44].D EFINITION

A stochastic matrix P ∈ R |X |×|X | + is said to be reducible if by row andcolumn permutations it can be placed into block upper-triangular form P = (cid:18) P P P (cid:19) , where P , P are square matrices.A stochastic matrix which is not reducible is said to be irreducible. Next, we recall the following Lemma from [14, Lemma 5.7].L

EMMA

Let Q ∗ ∈ R |X |×|X | + be an irreducible stochastic matrix. Then, there existsa unique vector q such that Q ∗ q = q, e T q = 1 , q ( x i ) > for all x i ∈ X . nﬁnite Horizon Average Cost Dynamic Programming Moreover, the matrix Q ∗ associated with Q ∗ in (3.16) has all rows equal to q . Note that, (3.17) depends on the probability distribution q of the initial state. However,if Q ∗ is assumed to be an irreducible stochastic matrix, by Lemma 3.5(3.18) J ( g, q ) = q T Q ∗ ( g ) f ( g ) = q ( g ) T f ( g ) ≡ J ( g ) where q ( g ) is the unique invariant probability distribution, that is, Q ∗ ( g ) q ( g ) = q ( g ) , andthe average cost per unit-time J ( g, q ) ≡ J ( g ) is independent of the initial distribution.Hence, for the remainder of this section, we will assume that for every stationary Markovcontrol policy g ∈ G SM , the stochastic matrix Q ∗ ( g ) is irreducible. The next propositionsummarizes the above results.P ROPOSITION [20] Let g ∈ G SM be a stationary Markov control policy, g : X (cid:55)−→U and assume that Q ∗ ( g ) ∈ R |X |×|X | + is irreducible.Then the following hold.(a) There exists a unique q ( g ) ∈ R |X | + such that (3.19) Q ∗ ( g ) q ( g ) = q ( g ) , e T q = 1 . (b) The average cost per unit-time associated with the control policy g ∈ G SM is (3.20) J ( g ) = q ( g ) T f ( g ) . (c) There exists a V ( g ) ∈ R |X | such that (3.21) J ( g ) e + V ( g ) = f ( g ) + Q ∗ ( g ) V ( g ) . Proof . Part (a) and (b) follows from Lemma (3.5) and the discussion above it. For part(c) see [20].L

EMMA

Assume the following hold.1. For any stationary control policy g ∈ G SM , Q ∗ ( g ) ∈ R |X |×|X | + is irreducible.2. There exists a g ∗ ∈ G SM such that J ∗ = inf g ∈ G SM J ( g ) . Then there exists an ( V ( g ∗ , · ) , J ∗ ) , V ( g ∗ , · ) : X (cid:55)−→ R and J ∗ ∈ R which is a solution tothe dynamic programming equation J ∗ + V ( g ∗ , x ) = min u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q ∗ ( z | x, u ) V ( g ∗ , z ) (cid:111) . Proof . By Proposition 3.6 (c), there exists a V ( g ∗ , · ) : X (cid:55)−→ R and J ∗ such that for all x ∈ X (3.22) J ∗ + V ( g ∗ , x ) = f ( x, g ∗ ( x )) + (cid:88) z ∈X Q ∗ ( z | x, g ∗ ( x )) V ( g ∗ , z ) . Then, for all x ∈ X J ∗ + V ( g ∗ , x ) ≥ min u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q ∗ ( z | x, u ) V ( g ∗ , z ) (cid:111) . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

Deﬁne g : X (cid:55)−→ U as g ( x ) = argmin u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q ∗ ( z | x, u ) V ( g ∗ , z ) (cid:111) . Suppose that for some x ∈ X strict inequality holds in (3.22), then(3.23) J ∗ + V ( g ∗ , x ) > min u ∈U (cid:110) f ( x , u ) + (cid:88) z ∈X Q ∗ ( z | x , u ) V ( g ∗ , z ) (cid:111) . Multiplying (3.23) by q ( g )( x ) > and summing over x ∈ X yields J ∗ + (cid:88) x ∈X q ( g )( x ) V ( g ∗ , x ) > min u ∈U (cid:110) (cid:88) x ∈X q ( g )( x ) f ( x , u ) + (cid:88) x ∈X q ( g )( x ) (cid:88) z ∈X Q ∗ ( z | x , u ) V ( g ∗ , z ) (cid:111) = (cid:88) x ∈X q ( g )( x ) f ( x , g ( x )) + (cid:88) x ∈X q ( g )( x ) (cid:88) z ∈X Q ∗ ( z | x , g ( x )) V ( g ∗ , z )= J ( g ) + (cid:88) z ∈X q ( g ) V ( g ∗ , z ) , by Proposition 3.6 (a)which gives J ∗ > J ( g ) , contradicting assumption 2. Hence, equality holds in (3.22), forevery x ∈ X .Next, we state the second main Theorem of this section.T HEOREM

Assume that for all stationary Markov control policies g ∈ G SM , andfor a given total variation parameter R ∈ [0 , R max ] ⊂ [0 , , the maximizing transitionmatrix Q ∗ ( g ) is irreducible. Then the following hold.(a) There exists a solution ( V, J ∗ ) , V : X (cid:55)−→ R , J ∗ ∈ R to the dynamic programmingequation (3.24) J ∗ + V ( x ) = min u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q ∗ ( z | x, u ) V ( z ) (cid:111) or, to the equivalent dynamic programming equation (3.25) J ∗ + V ( x ) = min u ∈U (cid:110) f ( x, u ) + (cid:88) z ∈X Q o ( z | x, u ) V ( z ) + R (cid:16) max z ∈X V ( z ) − min z ∈X V ( z ) (cid:17)(cid:111) where max z ∈X V ( z ) denotes component-wise maximum and similarly for the min-imum. The maximizing conditional distribution Q ∗ ( ·| x, u ) , ( x, u ) ∈ K is given by (2.13) , where ν ∗ ( · ) , µ ( · ) and (cid:96) ( · ) are replaced by Q ∗ ( ·| x, u ) , Q o ( ·| x, u ) and V ( · ) ,respectively, i.e., Q ∗ ( X + | x, u ) = Q o ( X + | x, u ) + α (3.26a) Q ∗ ( X − | x, u ) = (cid:16) Q o ( X − | x, u ) − α (cid:17) + (3.26b) Q ∗ ( X k | x, u ) = (cid:16) Q o ( X k | x, u ) − (cid:16) α − k (cid:88) j =1 Q o ( X k − | x, u ) (cid:17) + (cid:17) + (3.26c) α = min (cid:16) R, − Q o ( X + | x, u )) (cid:17) (3.26d) nﬁnite Horizon Average Cost Dynamic Programming and X + (cid:44) (cid:110) x ∈ X : V ( x ) = max { V ( x ) : x ∈ X } (cid:111) (3.27a) X − (cid:44) (cid:110) x ∈ X : V ( x ) = min { V ( x ) : x ∈ X } (cid:111) (3.27b) X k (cid:44) (cid:110) x ∈ X : V ( x ) = min (cid:8) V ( α ) : α ∈ X \ X ∪ ( k (cid:91) j =1 X j − ) (cid:9)(cid:111) (3.27c) where k = 1 , , . . . , r (see Section 2.1).(b) If g ∗ ( x ) attains the minimum in (3.24) or equivalently in (3.25) for every x , then g ∗ is an average cost optimal policy.(c) The minimum average cost is J ∗ .Proof . Theorem 3.8 is obtained by combining Theorem 3.2 and Lemma 3.7 and byapplying the results of Section 2.The main observation is that in speciﬁc applications one may employ either dynamicprogramming equation (3.24) or (3.25). In this section, we provide a modiﬁed version ofthe classical policy iteration algorithm for average cost dynamic programming [14,20]. Frompart (a) of Theorem 3.8, the policy evaluation and policy improvement steps of a policy it-eration algorithm must be performed using the maximizing conditional distribution obtainedunder total variation distance ambiguity constraint. Moreover, one needs to guarantee that forthe given total variation parameter R , the corresponding maximizing matrix Q ∗ is irreducible,otherwise, Algorithm 3.9 may not be sufﬁcient to give the optimal policy and the minimumcost. In general, R ∈ [0 , R max ] ⊆ [0 , , and R max is strictly less than . This generality willbe discussed in Section 4 for general Borel spaces.A LGORITHM (Policy iteration)1. Let m = 0 and select an arbitrary stationary Markov control policy g : X (cid:55)−→ U .2. (Policy Evaluation) Solve the equation (3.28) J Q o ( g m ) e + V Q o ( g m )= f ( g m )+ Q o ( g m ) V Q o ( g m ) for J Q o ( g m ) ∈ R and V Q o ( g m ) ∈ R |X | . Identify the support sets of (3.28) using (3.27) , and construct the matrix Q ∗ ( g m ) using (3.26) . Solve the equation (3.29) J Q ∗ ( g m ) e + V Q ∗ ( g m )= f ( g m )+ Q ∗ ( g m ) V Q ∗ ( g m ) for J Q ∗ ( g m ) ∈ R and V Q ∗ ( g m ) ∈ R |X | .3. (Policy Improvement) Let (3.30) g m +1 = arg min g ∈ R |X| (cid:110) f ( g ) + Q ∗ ( g ) V Q ∗ ( g m ) (cid:111) .

4. If g m +1 = g m , let g ∗ = g m ; else let m = m + 1 and return to step 2. In Section 5.1, we illustrate how policy iteration algorithm for inﬁnite horizon averagecost dynamic programming is implemented through an example.

Part (a) of Theorem 3.8, indicates that for a stationary Markov con-trol policy g ∈ G SM , and for an irreducible stochastic matrix Q ∗ there exists a solution tothe dynamic programming equation (3.24). Moreover, the maximizing stochastic matrix Q ∗ I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS which is given by (3.26), is calculated based on the support sets (3.27), the nominal stochasticmatrix Q o , and the value of the total variation parameter R ∈ [0 , R max ] . Hence, in order toapply policy iteration algorithm for average-cost dynamic programming one needs to knowin advance that, for a given total variation parameter R ∈ [0 , , and an irreducible nominalstochastic matrix Q o , the maximizing stochastic matrix Q ∗ is also irreducible. Otherwise,policy iteration algorithm may not be sufﬁcient to give the optimal policy and the minimumcost. In particular, as we show next, if the irreducibility condition is not satisﬁed then thepolicy iteration algorithm need not have a unique solution.As an example (inspired by [16]), consider the stochastic control system shown in Fig.3.1,with state-space X = { , , } and control set U = { u , u } . Let the nominal transition prob- F IG . 3.1. Nominal Stochastic Control System. ability under controls u and u to be given by(3.31) Q o ( u ) = 19   , Q o ( u ) = 19   . The cost function under each state and action is given by f (1 , u ) = 2 , f (2 , u ) = 1 , f (3 , u ) = 3 , f (1 , u ) = 0 . , f (2 , u ) = 3 , f (3 , u ) = 0 . Clearly, from (3.31), this control system the nominal transition probability matrix, under bothcontrols, is reducible, since the system under controls u and u contains more than onecommunication class . Using policy iteration Algorithm 3.9 with initial policies g (1) = g (2) = g (3) = u , the optimality equation (3.28) for this system may be written as J Q o   + V Q o ( g )=   + 19   V Q o ( g ) and hence J Q o + V Q o ( g ,

1) = 2 + 59 V Q o ( g ,

2) + 49 V Q o ( g , J Q o + V Q o ( g ,

2) = 1 + V Q o ( g ,

2) = ⇒ J Q o = 1 J Q o + V Q o ( g ,

3) = 3 + V Q o ( g ,

3) = ⇒ J Q o = 3 . States i and j belong to the same communication class if and only if each of these states can reach and bereached by the other.nﬁnite Horizon Average Cost Dynamic Programming Q o is an irreducible stochastic matrix, as the value of total variationparameter R increases the maximizing stochastic matrix Q ∗ ( R ) , eventually, will be trans-formed into a reducible stochastic matrix. Hence, our proposed method for solving minimaxstochastic control problem with average cost is valid only for a speciﬁc range of values of totalvariation parameter, in R ∈ [0 , R max ] ⊆ [0 , . In particular, if Q o is an irreducible stochasticmatrix then, for any given partition of the state-space, there exists an R max ∈ [0 , for whichwe distinguish the following two cases:(a) for ≤ R < R max , Q ∗ is an irreducible stochastic matrix. Theorem 3.8 is valid andpolicy iteration algorithm gives the optimal policy and the minimum cost.(b) for R ≥ R max , Q ∗ is a reducible stochastic matrix. Theorem 3.8 is not valid andpolicy iteration algorithm need not have a solution.R EMARK

Consider R ≥ R max . Then, an extended solution through a reduceddimensional state-space may be obtained as follows. Due to the water-ﬁlling behavior ofmaximizing conditional distribution (3.26) , columns of Q ∗ which correspond to states be-longing to X \ X , become columns with all zero’s, as total variation parameter R increases.Whenever an all zero column appears, one can remove the corresponding state of that col-umn, and hence Q ∗ will be transformed back into an irreducible stochastic matrix of reducedorder.

4. Minimax Stochastic Control for Borel Spaces.

In this section, we derive the gen-eral dynamic programming equation for Borel spaces ( X , U ) which solves the MDP for allvalues of R ∈ [0 , . In addition, we derive a generalized policy iteration algorithm cor-responding to the generalized dynamic programming equations when the state and controlspaces are of ﬁnite dimension. Note that, throughout this section we again suppose that As-sumption 1.4 holds. Throughout this section it is assumed that As-sumptions 1.4 hold. The characterization of optimal policies for the minimax MCP deﬁned by(1.12), will be based on the concept of a canonical triplet adopted to the current formulation(see [12]).Consider the MCM (1.1), where ( X , U ) are Borel spaces, and let h : X (cid:55)−→ R be abounded, continuous and non-negative function. Denote the expected n -stage cost, with aterminal cost h , policy g , and x = x , by J ( g, Q, x, h ) = h ( x ) , and for n ≥ , by(4.1) J n ( g, Q, x, h ) = E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) + h ( x n ) (cid:111) = J n ( g, Q, x ) + E gx (cid:110) h ( x n ) (cid:111) with J n ( g, Q, x ) = J n ( g, Q, x, . The corresponding maximizing expected n -stage cost isgiven by J n ( g, x, h ) = sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) E gx (cid:110) n − (cid:88) k =0 f ( x k , u k ) + h ( x n ) (cid:111) (4.2) = E g,Q ∗ x (cid:110) n − (cid:88) k =0 f ( x k , u k ) + h ( x n ) (cid:111) = J n ( g, x ) + E g,Q ∗ x (cid:110) h ( x n ) (cid:111) with J n ( g, x ) = J n ( g, x, , where Q ∗ ( ·| x, u ) is the maximizing distribution. Then,(4.3) J ∗ n ( x, h ) = inf g ∈ G J n ( g, x, h ); J ∗ n ( x ) = inf g ∈ G J n ( g, x, h ) , if h ( · ) = 0 . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

Throughout this section it is assumed that there exists a policy g ∈ G and an initialstate x ∈ X such that J ( g, x ) < ∞ (i.e., see (1.11)). The deﬁnition of a canonical tripletis intriduced next, following [12, 21] with a slight variation, to account for the extra terms,which enter the dynamic programming equation.D EFINITION

Let ρ and h be real-valued, bounded, continuous, non-negative, mea-surable functions on X and ϕ ∈ F a given selector. Then ( ρ, h, ϕ ) is said to be a canonicaltriplet if (4.4) J n ( g ∞ , x, h ) = J ∗ n ( x, h ) = nρ ( x ) + h ( x ) , ∀ x ∈ X , n = 0 , , . . . . A selector ϕ ∈ F (of a stationary policy g ∞ ∈ G SM ) is called canonical if it is an element ofsome canonical triplet. Note that with the appropriate choice of h as the terminal cost the policy g ∞ is optimal forthe n -stage problem for all n = 0 , , . . . . The following Theorem characterizes the canonicaltriplets for the minimax problem, with respect to the new dynamic programming equation.T HEOREM

Suppose the supremum and inﬁmum of h ( · ) and ρ ( · ) over X is non-empty. Then ( ρ, h, ϕ ) is a canonical triplet if and only if, for every x ∈ X , the followinghold. ρ ( x ) = inf u ∈U ( x ) (cid:110) (cid:90) X ρ ( z ) Q o ( dz | x, u ) + R (cid:16) sup z ∈X ρ ( z ) − inf z ∈X ρ ( z ) (cid:17)(cid:111) (a) ρ ( x ) + h ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:90) X h ( z ) Q o ( dz | x, u ) + R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17)(cid:111) (b) ϕ ( x ) ∈ U ( x ) attains the minimum in both (a) and (b) , that is, (c) ρ ( x ) = (cid:90) X ρ ( z ) Q o ( dz | x, ϕ ) + R (cid:16) sup z ∈X ρ ( z ) − inf z ∈X ρ ( z ) (cid:17) (4.5) ρ ( x ) + h ( x ) = f ( x, ϕ ) + (cid:90) X h ( z ) Q o ( dz | x, ϕ ) + R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17) (4.6) or, equivalently, ( ρ, h, ϕ ) is a canonical triplet if and only if for every x ∈ X the followinghold. ρ ( x ) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:90) X ρ ( z ) Q ( dz | x, u ) (a’) ρ ( x ) + h ( x ) = inf u ∈U ( x ) sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, u )+ (cid:90) X h ( z ) Q ( dz | x, u ) (cid:111) (b’) ϕ ( x ) ∈ U ( x ) attains the minimum in (a’) and (b’) , that is, (c’) ρ ( x ) = sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:90) X ρ ( z ) Q ( dz | x, ϕ ) (4.7) ρ ( x ) + h ( x ) = sup Q ( ·| x,u ) ∈ B R ( Q o )( x,u ) (cid:110) f ( x, ϕ ) + (cid:90) X h ( z ) Q ( dz | x, ϕ ) (cid:111) (4.8)Note that, if ( ρ, h, ϕ ) is a canonical triplet, then so is ( ρ, h + N, ϕ ) for any constant N .Next we proceed with the proof of Theorem 4.2. Proof . (Necessity). Suppose that ( ρ, h, ϕ ) is a canonical triplet, i.e., (4.4) holds ∀ x ∈ X and n ≥ . From the analog of dynamic programming equation (3.3) of Borel spaces, wehave that(4.9) V j ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X V j +1 ( z ) Q o ( dz | x, u ) + R (cid:16) sup z ∈X V j +1 ( z ) − inf z ∈X V j +1 ( z ) (cid:17)(cid:111) . nﬁnite Horizon Average Cost Dynamic Programming V j ( x ) = V n − j ( x ) , ( j = 0 , , . . . , n ). Then (4.9) may be written in the “forward”form(4.10) V j +1 ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X V j ( z ) Q o ( dz | x, u ) + R (cid:16) sup z ∈X V j ( z ) − inf z ∈X V j ( z ) (cid:17)(cid:111) . Substituting (4.10) to (4.2)-(4.3), we have(4.11) J ∗ n +1 ( x, h ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X J ∗ n ( z, h ) Q o ( dz | x, u )+ R (cid:16) sup z ∈X J ∗ n ( z, h ) − inf z ∈X J ∗ n ( z, h ) (cid:17)(cid:111) . Thus, from (4.4) we have(4.12) ( n + 1) ρ ( x ) + h ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q ( dz | x, u )+ R (cid:16) sup z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1) − inf z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1)(cid:17)(cid:111) . Evaluating (4.12) at n = 0 we obtain (b). Furthermore, since ρ ( · ) , h ( · ) and f ( · , · ) arebounded, then multiplying both sides of (4.12) by /n and letting n −→ ∞ yields (a).Finally, for any deterministic stationary policy g ∞ ∈ G SM , we have that(4.13) J n +1 ( g ∞ , x, h ) = f ( x, ϕ ) + (cid:90) X J n ( g ∞ , z, h ) Q o ( dz | x, ϕ )+ R (cid:16) sup z ∈X J n ( g ∞ , z, h ) − inf z ∈X J n ( g ∞ , z, h ) (cid:17) , x ∈ X . Thus, if ϕ ∈ F satisﬁes (4.4), then by (4.11)-(4.13) we have that ( n + 1) ρ ( x ) + h ( x ) = f ( x, ϕ ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q o ( dz | x, ϕ )+ R (cid:16) sup z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1) − inf z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1)(cid:17) which, as before, gives (4.5) and (4.6).(Sufﬁciency). Conversely, suppose ( ρ, h, ϕ ) satisfy (a)-(c). Proceeding by inductionequation (4.4) is trivially satisﬁed when n = 0 . Suppose that is true for some n ≥ . Then,0 I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS the following is obtained J ∗ n +1 ( x, h ) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q o ( dz | x, u )+ R (cid:16) sup z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1) − inf z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1)(cid:17)(cid:111) = inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q ∗ ( dz | x, u ) (cid:111) ≥ inf u ∈U ( x ) (cid:110) f ( x, u ) + (cid:90) X h ( z ) Q ∗ ( dz | x, u ) (cid:111) + n inf u ∈U ( x ) (cid:110) (cid:90) X ρ ( z ) Q ∗ ( dz | x, u ) (cid:111) = inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:90) X h ( z ) Q o ( dz | x, u )+ R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17)(cid:111) + n inf u ∈U ( x ) (cid:110) (cid:90) X ρ ( z ) Q o ( dz | x, u ) + R (cid:16) sup z ∈X ρ ( z ) − inf z ∈X ρ ( z ) (cid:17)(cid:111) = ( n + 1) ρ ( x ) + h ( x ) . On the other hand, J ∗ n +1 ( x, h ) ≤ J n +1 ( g ∞ , x, h )= f ( x, ϕ ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q o ( dz | x, ϕ )+ R (cid:16) sup z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1) − inf z ∈X (cid:0) nρ ( z ) + h ( z ) (cid:1)(cid:17) = f ( x, ϕ ) + (cid:90) X (cid:0) nρ ( z ) + h ( z ) (cid:1) Q ∗ ( dz | x, ϕ )= f ( x, ϕ ) + (cid:90) X h ( z ) Q ∗ ( dz | x, ϕ ) (cid:111) + n (cid:90) X ρ ( z ) Q ∗ ( dz | x, ϕ )= f ( x, ϕ )+ (cid:90) X h ( z ) Q o ( dz | x, ϕ )+ R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17) + n (cid:110) (cid:90) X ρ ( z ) Q o ( dz | x, ϕ ) + R (cid:16) sup z ∈X ρ ( z ) − inf z ∈X ρ ( z ) (cid:17)(cid:111) = ( n + 1) ρ ( x ) + h ( x ) where the second and third equalities follow by applying (2.1). This implies, J ∗ n +1 ( x, h ) = J n +1 ( g ∞ , x, h ) = ( n + 1) ρ ( x ) + h ( x ) .R EMARK

We note that in Deﬁnition 4.1, the condition ( ρ, h ) are bounded contin-uous and non-negative can be relaxed to continuous and non-negative. In this case, if (4.4) holds, i.e., ( ρ, h, ϕ ) is a canonical triplet then (a) - (c) hold. Due to the fact that the average cost as an optimality criterion is underselective, i.e., withlimitations in distinguishing optimal policies with different costs, we introduce next a moreselective criterion. For other underselective and overselective optimality criteria see [10, 11].D

EFINITION

A policy g † is said to be(a) [9] Strong average cost optimal if (4.14) J ( g † , x ) ≤ lim inf n →∞ n J n ( g, x ) , ∀ g ∈ G, x ∈ X . (b) [11] F-strong average cost optimal if (4.15) lim n →∞ n (cid:16) J n ( g † , x ) − J ∗ n ( x ) (cid:17) = 0 , ∀ x ∈ X nﬁnite Horizon Average Cost Dynamic Programming where J ∗ n ( x ) = inf g ∈ G J n ( g, x ) . Based on Deﬁnition 4.4, next we derive stronger results.T

HEOREM [12] Suppose the cost function f satisﬁes Assumption 1.4, and let ( ρ, h, ϕ ) be a canonical triplet (with h not necessarily bounded).(a) If for every g ∈ G and x ∈ X (4.16) lim n →∞ E g,Q ∗ x (cid:110) h ( x n ) n (cid:111) = 0 then g ∞ is an average cost optimal policy and ρ is the average cost value function (4.17) J ∗ ( x ) = ρ ( x ) = J ( g ∞ , x ) = lim n →∞ n J n ( g ∞ , x ) , ∀ x. (b) If for every x ∈ X (4.18) lim n →∞ sup g ∈ G E g,Q ∗ x (cid:110) h ( x n ) n (cid:111) = 0 then g ∞ is strong average cost optimal and F-strong average cost optimal and (4.19) J ∗ ( x ) = lim n →∞ n J ∗ n ( x ) . Proof . (a) From (4.2)-(4.3) and the last equality in (4.4) nρ ( x ) + h ( x ) = J ∗ n ( x, h ) ≤ J n ( g, x ) + E g,Q ∗ x (cid:110) h ( x n ) (cid:111) , ∀ g ∈ G, x ∈ X . Hence, multiplying by /n , taking the lim sup as n → ∞ , by virtue of (4.16), we have ρ ( x ) ≤ J ( g, x ) , ∀ g, x which implies(4.20) ρ ( x ) ≤ J ∗ ( x ) , ∀ x. Furthermore, from (4.4) again(4.21) J n ( g ∞ , x, h ) = J n ( g ∞ , x ) + E g ∞ ,Q ∗ x { h ( x n ) } = nρ ( x ) + h ( x ) . Finally, multiplying both sides of (4.21) by /n and then taking both lim sup and lim inf as n → ∞ , we obtain the last two equalities in (4.17), which in turn, together with (4.20), yieldthe ﬁrst one since J ∗ ( x ) ≤ J ( g ∞ , x ) .(b) The ﬁrst equality in (4.4) gives(4.22) J ∗ n ( x, h ) = J n ( g ∞ , x ) + E g ∞ ,Q ∗ x { h ( x n ) } . On the other hand, by (4.2)-(4.3) J ∗ n ( x, h ) = inf g ∈ G (cid:16) J n ( g, x ) + E g,Q ∗ x { h ( x n ) } (cid:17) ≤ J ∗ n ( x ) + sup g ∈ G E g,Q ∗ x { h ( x n ) } . Thus,(4.23) ≤ J n ( g ∞ , x ) − J ∗ n ( x ) ≤ sup g ∈ G E g,Q ∗ x { h ( x n ) } − E g ∞ ,Q ∗ x { h ( x n ) } . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

Hence, if h satisﬁes (4.18), then g ∞ is F-strong average cost optimal. Finally, to prove that g ∞ is strong average cost optimal, we use (4.22) again to obtain J n ( g ∞ , x ) + E g ∞ ,Q ∗ x { h ( x n ) } ≤ J n ( g, x ) + E g,Q ∗ x { h ( x n ) } , ∀ g, x, n so that from (4.18)(4.24) lim inf n →∞ n J n ( g ∞ , x ) ≤ lim inf n →∞ n J n ( g, x ) . Since the left-hand side equals to J ( g ∞ , x ) (see (4.17)) it follows that g ∞ is indeed strongaverage cost optimal and the proof is complete.Note that, in the case in which ρ ( · ) is constant, that is ρ does not vary with x , then theﬁrst optimality equation of Theorem 4.2 is redundant and hence (a)-(c) reduce to ρ ∗ + h ( x ) = inf u ∈U ( x ) (cid:110) f ( x, u )+ (cid:90) X h ( z ) Q o ( dz | x, u )+ R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17)(cid:111) (4.25) ρ ∗ + h ( x ) = f ( x, ϕ )+ (cid:90) X h ( z ) Q o ( dz | x, ϕ )+ R (cid:16) sup z ∈X h ( z ) − inf z ∈X h ( z ) (cid:17) . (4.26)Next, we use equations (a’)-(c’) of Theorem 4.2 to develop a general policy iterationalgorithm for average cost dynamic programming. In this section,we provide a policy iteration algorithm to obtain average cost optimal policies, in which pol-icy evaluation and policy improvement steps are evaluated using the maximizing conditionaldistribution given by (3.26). The proposed algorithm is considerably more complex comparedto Algorithm 3.9. Nevertheless, it solves the MDP for all range of values of total variationparameter R ∈ [0 , , and without imposing the irreducibility condition, as in Section 3.1.1.A LGORITHM (General policy iteration)1) Let m = 0 and select an arbitrary stationary Markov control policy g : X (cid:55)−→ U .2) (Policy Evaluation) Solve the equations J Q o ( g m ) = Q o ( g m ) J Q o ( g m ) (4.27) J Q o ( g m ) + h Q o ( g m ) = f ( g m ) + Q o ( g m ) h Q o ( g m ) (4.28) for J Q o ( g m ) and h Q o ( g m ) . Identify the support sets of (4.28) using (3.27) (where h replaces V ), and construct the matrix Q ∗ ( g m ) using (3.26) . Solve the equations J Q ∗ ( g m )= Q ∗ ( g m ) J Q ∗ ( g m ) (4.29) J Q ∗ ( g m )+ h Q ∗ ( g m )= f ( g m )+ Q ∗ ( g m ) h Q ∗ ( g m ) (4.30) for J Q ∗ ( g m ) and h Q ∗ ( g m ) .3) (Policy Improvement)a) Let (4.31) g m +1 = arg min g ∈ R |X| (cid:8) Q ∗ ( g ) J Q ∗ ( g m ) (cid:9) . If g m +1 = g m go to step 3b); otherwise let m = m + 1 and return to step 2.b) Let (4.32) g m +1 = arg min g ∈ R |X| (cid:8) f ( g ) + Q ∗ ( g ) h Q ∗ ( g m ) (cid:9) . nﬁnite Horizon Average Cost Dynamic Programming

4) If g m +1 = g m , let g ∗ = g m ; else let m = m + 1 and return to step 2. For MCP with ﬁnite state and action spaces the proposed general policy iteration al-gorithm converges in a ﬁnite number of iterations. However, for MCP on Borel spaces theproposed policy iteration algorithm might not converge, or it might converge to a suboptimalvalue, and hence one must introduce additional assumptions (i.e., see [13, 15]). In Section5.2, we illustrate through an example how Algorithm 4.6 is applied.

5. Examples.

In this section we illustrate the new dynamic programming equations andthe corresponding policy iteration algorithms through examples. In particular, in Section 5.1we present an application of the inﬁnite horizon minimax problem for average cost by em-ploying policy iteration Algorithm 3.9, and in Section 5.2 we present an application of theinﬁnite horizon minimax problem for average cost by employing policy iteration Algorithm4.6. The essential difference between the two examples is that the MDP of the latter is de-scribed by a transition probability graph which is reducible.

Here, we il-lustrate an application of the inﬁnite horizon minimax problem for average cost, by consid-ering the stochastic control system as shown in Fig.3.1, with state space X = { , , } andcontrol set U = { u , u } . Assume that the nominal transition probabilities under controls u and u are given by Q o ( u )= 19   , Q o ( u )= 19   (5.1)the total variation distance radius is R = 6 / , and the cost function under each state andaction is f (1 , u ) = 2 , f (2 , u ) = 1 , f (3 , u ) = 3 , f (1 , u ) = 0 . , f (2 , u ) = 3 , f (3 , u ) = 0 . To obtain an optimal stationary policy of the inﬁnite horizon minimax problem for averagecost, policy iteration algorithm 3.9 is applied.

A. Let m = 0 .

1) Select the initial policies as follows g (1) = u , g (2) = u , g (3) = u .2) Solve the equation J Q o ( g ) e + V Q o ( g ) = f ( g ) + Q o ( g ) V Q o ( g ) for J Q o ( g ) ∈ R and V Q o ( g ) ∈ R , which is given by J Q o ( g )   +  V Q o ( g , V Q o ( g , V Q o ( g ,  =   + 19    V Q o ( g , V Q o ( g , V Q o ( g ,  . Since V Q o ( g ) is uniquely determined up to an additive constant, let V Q o ( g ,

3) = 0 . Thesolution is  V Q o ( g , V Q o ( g , V Q o ( g ,  =  . .  , J Q o ( g ) = 1 . . I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

Note that, V Q o (cid:44) { V Q o (1) , V Q o (2) , V Q o (3) } , |X | = 3 , and hence X + (cid:44) { x ∈ X : V Q o ( x ) = max { V Q o ( x ) : x ∈ X }} = { x ∈ X : V Q o ( x ) = V Q o (2) } = { }X − (cid:44) { x ∈ X : V Q o ( x ) = min { V Q o ( x ) : x ∈ X }} = { x ∈ X : V Q o ( x ) = V Q o (3) } = { }X (cid:44) { x ∈ X : V Q o ( x ) = min { V Q o ( α ) : α ∈ X \X + ∪X − }} = { x ∈ X : V Q o ( x ) = V Q o (1) } = { } . Once the partition is been identiﬁed, (3.26) is applied to obtain (5.2) and (5.3). Q ∗ ( u ) = (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) +  (5.2) = 19   .Q ∗ ( u ) = (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) +  (5.3) = 19   . The transition probability graph of Q ∗ , under controls u and u , is depicted in Fig.5.1. Notethat, since every state can reach every other state, matrix Q ∗ ( u ) remains irreducible underboth controls. Next, we proceed to solve the equation J Q ∗ ( g ) e + V Q ∗ ( g ) = f ( g ) + (a) Matrix Q ∗ under control u . (b) Matrix Q ∗ under control u .F IG . 5.1. Transition Probability Graph of Q ∗ under controls u and u . nﬁnite Horizon Average Cost Dynamic Programming Q ∗ ( g ) V Q ∗ ( g ) for J Q ∗ ( g ) ∈ R and V Q ∗ ( g ) ∈ R , which is given by J Q ∗ ( g )   +  V Q ∗ ( g , V Q ∗ ( g , V Q ∗ ( g ,  =   + 19    V Q ∗ ( g , V Q ∗ ( g , V Q ∗ ( g ,  . Since V Q ∗ ( g ) is uniquely determined up to an additive constant, let V Q ∗ ( g ,

3) = 0 . Thesolution is  V Q ∗ ( g , V Q ∗ ( g , V Q ∗ ( g ,  =  . .  , J Q ∗ ( g ) = 2 . .

3) Let g = argmin g ∈ R { f ( g ) + Q ∗ ( g ) V Q ∗ ( g ) } . Then g (1) = argmin (cid:110) f (1 , u ) + q ∗ ( u ) V Q ∗ ( g , q ∗ ( u ) V Q ∗ ( g , q ∗ ( u ) V Q ∗ ( g , ,f (1 , u ) + q ∗ ( u ) V Q ∗ ( g , q ∗ ( u ) V Q ∗ ( g , q ∗ ( u ) V Q ∗ ( g , (cid:111) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (1) = u . Following a similar procedure for the rest we obtain the following. g (2) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (2) = u g (3) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (3) = u . Since, g (cid:54) = g , let m = 1 and return to step 2. B. Let m = 1 .

2) Solve the equation J Q o ( g ) e + V Q o ( g ) = f ( g ) + Q o ( g ) V Q o ( g ) , V Q o ( g ,

3) = 0 ,for J Q o ( g ) ∈ R and V Q o ( g ) ∈ R . The solution is  V Q o ( g , V Q o ( g , V Q o ( g ,  =  . .  , J Q o ( g ) = 0 . . Therefore, X + = { } , X − = { } and X = { } . Since the partition is the same as in m = 0 then Q ∗ ( u ) and Q ∗ ( u ) are given by (5.2) and (5.3), respectively.Solve the equation J Q ∗ ( g ) e + V Q ∗ ( g ) = f ( g ) + Q ∗ ( g ) V Q ∗ ( g ) , V Q ∗ ( g ,

3) = 0 , for J Q ∗ ( g ) ∈ R and V Q ∗ ( g ) ∈ R . The solution is  V Q ∗ ( g , V Q ∗ ( g , V Q ∗ ( g ,  =  . .  , J Q ∗ ( g ) = 0 . .

3) Let g = argmin g ∈ R { f ( g ) + Q ∗ ( g ) V Q ∗ ( g ) } . Then g (1) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (1) = u g (2) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (2) = u g (3) = argmin (cid:110) . , . (cid:111) = { } = ⇒ g (3) = u .

4) Since, g = g , then g ∗ = g is an optimal control policy with J Q ∗ = 0 . , V Q ∗ (1) = 0 . , V Q ∗ (2) = 1 . and V Q ∗ (3) = 0 .6 I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

Inthis example, we illustrate an application of the inﬁnite horizon minimax problem for aver-age cost, by considering the stochastic control system shown in Fig.3.1, with X = { , , } and control set U = { u , u } . The essential difference between this example and the pre-vious one, is that here, the stochastic control system under consideration is described by atransition probability graph which is reducible, and hence general policy iteration algorithm4.6 is applied.Assume that the nominal transition probabilities under controls u and u are given by Q o ( u )= 19   , Q o ( u )= 19   (5.4)the total variation distance radius is R = 14 / , and the cost function under each state andaction is f (1 , u ) = 2 , f (2 , u ) = 1 , f (3 , u ) = 3 , f (1 , u ) = 0 . , f (2 , u ) = 3 , f (3 , u ) = 0 . A. Let m = 0 .

1) Select the initial policies as follows g (1) = u , g (2) = u , g (3) = u .2) Solve the equation J Q o ( g ) = Q o ( g ) J Q o ( g ) . The optimality equations (4.27) are J Q o ( g ,

1) = 59 J Q o ( g ,

2) + 49 J Q o ( g , (5.5a) J Q o ( g ,

2) = J Q o ( g , (5.5b) J Q o ( g ,

3) = J Q o ( g , . (5.5c)Next, solve the equation J Q o ( g ) + h Q o ( g ) = f ( g ) + Q o ( g ) h Q o ( g ) , for J Q o ( g ) ∈ R and h Q o ( g ) ∈ R . The optimality equations (4.28) are given by J Q o ( g ,

1) + h Q o ( g ,

1) = 2 + 59 h Q o ( g ,

2) + 49 h Q o ( g , (5.6a) J Q o ( g ,

2) + h Q o ( g ,

2) = 1 + h Q o ( g , (5.6b) J Q o ( g ,

3) + h Q o ( g ,

3) = 3 + h Q o ( g , . (5.6c)The solution of (5.5) and (5.6) has h Q o ( g , α + 49 β, h Q o ( g , α, h Q o ( g , β,J Q o ( g , . , J Q o ( g , , J Q o ( g , . Setting α = 1 and β = 0 (arbitrary constants) yields h Q o ( g ,

1) = 0 . , h Q o ( g ,

2) = 1 , h Q o ( g ,

3) = 0 . Note that, h Q o = { h Q o (1) , h Q o (2) , h Q o (3) } , and hence the support sets based on the valuesof h Q o are X + = { } , X − = { } and X = { } . Once the partition is been identiﬁed, (3.26) nﬁnite Horizon Average Cost Dynamic Programming Q ∗ ( u ) = (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) +  (5.7) = 19   Q ∗ ( u ) = (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) + (cid:16) q o ( u ) − (cid:0) R − q o ( u ) (cid:1) + (cid:17) + min (cid:0) , q o ( u ) + R (cid:1) (cid:0) q o ( u ) − R (cid:1) +  (5.8) = 19   . Next, solve the equation J Q ∗ ( g ) = Q ∗ ( g ) J Q ∗ ( g ) . The optimality equations (4.29)are J Q ∗ ( g ,

1) = J Q ∗ ( g , (5.9a) J Q ∗ ( g ,

2) = J Q ∗ ( g , (5.9b) J Q ∗ ( g ,

3) = 79 J Q ∗ ( g ,

2) + 29 J Q ∗ ( g , (5.9c)and hence, J Q ∗ ( g ,

1) = J Q ∗ ( g ,

2) = J Q ∗ ( g , .Next, solve the equation J Q ∗ ( g ) + h Q ∗ ( g ) = f ( g ) + Q ∗ ( g ) h Q ∗ ( g ) , for J Q ∗ ( g ) ∈ R and h Q ∗ ( g ) ∈ R . The optimality equations (4.30) are given by J Q ∗ ( g ,

1) + h Q ∗ ( g ,

1) = 2 + h Q ∗ ( g , (5.10a) J Q ∗ ( g ,

2) + h Q ∗ ( g ,

2) = 1 + h Q ∗ ( g , (5.10b) J Q ∗ ( g ,

3) + 79 h Q ∗ ( g ,

3) = 3 + 79 h Q ∗ ( g , . (5.10c)The solution of (5.9) and (5.10) has h Q ∗ ( g ,

1) = 1+ α, h Q ∗ ( g ,

2) = α, h Q ∗ ( g ,

3) = 187 + α,J Q ∗ ( g ,

1) = 1 , J Q ∗ ( g ,

2) = 1 , J Q ∗ ( g ,

3) = 1 . Setting α = 1 (arbitrary constant) yields h Q ∗ ( g ,

1) = 2 , h Q ∗ ( g ,

2) = 1 , h Q ∗ ( g ,

3) = 3 . .

3) a) Since J Q ∗ ( g ,

1) = J Q ∗ ( g ,

2) = J Q ∗ ( g , , then clearly g = g and we proceedto step 3b).b) Let g = argmin g ∈ R { f ( g ) + Q ∗ ( g ) h Q ∗ ( g ) } , then the resulting control policies are g (1) = u , g (2) = u and g (3) = u . Since g (cid:54) = g , let m = 1 and return to step 2.8 I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS

B. Let m = 1 .

2) Solve the equation J Q o ( g ) = Q o ( g ) J Q o ( g ) . The optimality equations (4.27) are J Q o ( g ,

1) = 29 J Q o ( g ,

1) + 79 J Q o ( g , (5.11a) J Q o ( g ,

2) = J Q o ( g , (5.11b) J Q o ( g ,

3) = 89 J Q o ( g ,

1) + 19 J Q o ( g , (5.11c)and hence, J Q o ( g ,

1) = J Q o ( g ,

2) = J Q o ( g , .Next, solve the equation J Q o ( g ) + h Q o ( g ) = f ( g ) + Q o ( g ) h Q o ( g ) , for J Q o ( g ) ∈ R and h Q o ( g ) ∈ R . The optimality equations (4.28) are given by J Q o ( g ,

1) + 79 h Q o ( g ,

1) = 0 . h Q o ( g , (5.12a) J Q o ( g ,

2) + h Q o ( g ,

2) = 1 + h Q o ( g , (5.12b) J Q o ( g ,

3) + 89 h Q o ( g ,

3) = 89 h Q o ( g , . (5.12c)The solution of (5.11) and (5.12) has h Q o ( g , α + 98 , h Q o ( g , α + 9956 , h Q o ( g , α,J Q o ( g , , J Q o ( g , , J Q o ( g , . Setting α = 1 (arbitrary constant) yields h Q o ( g ,

1) = 2 . , h Q o ( g ,

2) = 2 . , h Q o ( g ,

3) = 1 . Hence, we proceed with the identiﬁcation of the support sets, which are X + = { } , X − = { } and X = { } . Since the partition is the same as in m = 0 then Q ∗ ( u ) and Q ∗ ( u ) areequal to (5.7) and (5.8), respectively.Next, solve the equation J Q ∗ ( g ) = Q ∗ ( g ) J Q ∗ ( g ) . The optimality equations (4.29)are J Q ∗ ( g ,

1) = J Q ∗ ( g , (5.13a) J Q ∗ ( g ,

2) = J Q ∗ ( g , (5.13b) J Q ∗ ( g ,

3) = 29 J Q ∗ ( g ,

1) + 79 J Q ∗ ( g , (5.13c)and hence, J Q ∗ ( g ,

1) = J Q ∗ ( g ,

1) + h Q ∗ ( g ,

1) = 0 . h Q ∗ ( g , (5.14a) J Q ∗ ( g ,

2) + h Q ∗ ( g ,

2) = 1 + h Q ∗ ( g , (5.14b) J Q ∗ ( g ,

3) + h Q ∗ ( g ,

3) = 29 h Q ∗ ( g ,

1) + 79 h Q ∗ ( g , . (5.14c)The solution of (5.13) and (5.14) has h Q ∗ ( g , α + 1118 , h Q ∗ ( g , α + 109 , h Q ∗ ( g , α,J Q ∗ ( g , , J Q ∗ ( g , , J Q ∗ ( g , . nﬁnite Horizon Average Cost Dynamic Programming α = 1 yields h Q ∗ ( g ,

1) = 1 . , h Q ∗ ( g ,

2) = 2 . , h Q ∗ ( g ,

3) = 1 .

3) a) Since J Q ∗ ( g ,

1) = J Q ∗ ( g ,

2) = J Q ∗ ( g , , then clearly g = g and we proceedto step 3b).b) Let g = argmin g ∈ R { f ( g ) + Q ∗ ( g ) h Q ∗ ( g ) } , the resulting control policies are g (1) = u , g (2) = u and g (3) = u .4) Because, g = g , then g ∗ = g is an optimal control policy with J Q ∗ (1) = J Q ∗ (2) = J Q ∗ (3) = 1 , h Q ∗ (1) = 1 . , h Q ∗ (2) = 2 . and h Q ∗ (3) = 1 .

6. Conclusions.

In this paper, we examined the optimality of minimax MDP via dy-namic programming on an inﬁnite horizon, when the ambiguity class is described by the totalvariation distance between the conditional distribution of the true controlled process and theconditional distribution of a nominal controlled process. As optimality criterion we consid-ered the average pay-off per unit time. Under the assumption that for every stationary Markovcontrol policy the maximizing stochastic matrix is irreducible, we derived a new dynamic pro-gramming equation and a new policy iteration algorithm. However, due to the water-ﬁllingbehavior of the maximizing conditional distribution, it turns out that our proposed method ofsolution is limited only to a speciﬁc range of values of total variation distance. To circumventthis limit, we consider general Borel spaces, and we derive a general dynamic programmingequation by introducing a pair of dynamic programming equations, and, consequently a newpolicy iteration algorithm, which solve the minimax MDP for all R ∈ [0 , . Finally, the ap-plication of our recommended policy iteration algorithms is shown via illustrative examples. REFERENCES[1] A. A

RAPOSTATHIS , V. .S. B

ORKAR , E. F

ERNANDEZ -G AUCHERAND , M. K. G

HOSH , AND

S. I. M AR - CUS , Discrete-time controlled Markov processes with average cost criterion: a survey , SIAM J. ControlOptim., 31 (1993), pp. 282–344.[2] V. S. B

ORKAR , On minimum cost per unit time control of Markov chains , SIAM J. Control Optim., 22 (1984),pp. 965–978.[3] ,

Control of Markov chains with long-run average cost criterion , Stochastic Differential systems,Stochastic Control Theory and Applications, (1988), pp. 57–77.[4] ,

Control of Markov chains with long-run average cost criterion: the dynamic programming equations ,SIAM J. Control Optim., 27 (1989), pp. 642–657.[5] P. E. C

AINES , Linear stochastic systems , John Wiley & Sons, Inc., New York, 1988.[6] C.D. C

HARALAMBOUS , I. T

ZORTZIS , AND

T. C

HARALAMBOUS , Dynamic programming with total vari-ational distance uncertainty , in 51st IEEE Conference on Decision and Control, Maui, Hawaii, Dec.10–13, 2012.[7] C

HARALAMBOS

D. C

HARALAMBOUS , I. T

ZORTZIS , S. L

OYKA , AND

T. C

HARALAMBOUS , Extremumproblems with total variation distance and their applications , IEEE Trans. Autom. Control, 59 (2014),pp. 2353–2368.[8] T.M. C

OVER AND

J.A. T

HOMAS , Elements of information theory , John Wiley and Sons, Inc., 1991.[9] E. B. D

YNKIN AND

A. A. Y

USHKEVICH , Controlled Markov processes , Springer-Verlag, New York, 1979.[10] J. F

LYNN , Conditions for the equivalence of optimality criteria in dynamic programming , Ann. Statist., 4(1976), pp. 936–953.[11] ,

On optimality criteria for dynamic programs with long ﬁnite horizons , J. Math. Anal. Appl., 76(1980), pp. 202–208.[12] O. H

ERNANDEZ -L ERMA AND

J. B. L

ASSERRE , Discrete-time Markov control processes: Basic optimalitycriteria , no. v. 1 in Applications of Mathematics Stochastic Modelling and Applied Probability, SpringerVerlag, 1996.[13] ,

Policy iteration for average cost Markov control processes on Borel spaces , Acta Applicandae Math-ematica, 47 (1997), pp. 125–154.[14] P. R. K

UMAR AND

P. V

ARAIYA , Stochastic systems: Estimation, identiﬁcation, and adaptive control , PrenticeHall, 1986. I. TZORTZIS, C. D. CHARALAMBOUS AND T. CHARALAMBOUS[15] S. P. M

EYN , The policy improvement algorithm for markov decision processes with general state space , IEEETrans. Autom. Control, 42 (1997), pp. 1663–1680.[16] M. L. P

UTERMAN , Markov decision Processes , Wiley, New York, 1994.[17] M. S

CHAL , On the second optimality equation for semi-Markov decision models , Math. Op. Res., 17 (1992),pp. 470–486.[18] L. I. S

ENNOTT , Another set of conditions for average optimality in Markov control processes , Systems andControl Letters, 24 (1995), pp. 147–151.[19] I. T

ZORTZIS , C. D. C

HARALAMBOUS , AND

T. C

HARALAMBOUS , Dynamic Programming Subject to TotalVariation Distance Ambiguity , ArXiv e-prints, (2014).[20] J. H. V AN S CHUPPEN , Mathematical control and system theory of discrete-time stochastic systems , Preprint,2014.[21] A. A. Y