[PDF] A variational formula for risk-sensitive reward

Abstract

We derive a variational formula for the optimal growth rate of reward in the infinite horizon risk-sensitive control problem for discrete time Markov decision processes with compact metric state and action spaces, extending a formula of Donsker and Varadhan for the Perron-Frobenius eigenvalue of a positive operator. This leads to a concave maximization formulation of the problem of determining this optimal growth rate.

Full PDF

aa r X i v : . [ m a t h . O C ] J a n A VARIATIONAL FORMULA FORRISK-SENSITIVE REWARD

V. ANANTHARAM and V. S. BORKAR ABSTRACT:

We derive a variational formula for the optimal growthrate of reward in the inﬁnite horizon risk-sensitive control problem for discretetime Markov decision processes with compact metric state and action spaces,extending a formula of Donsker and Varadhan for the Perron-Frobenius eigen-value of a positive operator. This leads to a concave maximization formula-tion of the problem of determining this optimal growth rate.

Key words: risk-sensitive control; Perron-Frobenius eigenvalue; positiveoperators; variational formula EECS Department, University of California, Berkeley, CA 94720, USA. Research sup-ported in part by the ARO MURI grant W911NF- 08-1-0233, Tools for the Analysisand Design of Complex Multi-Scale Networks, the NSF grants CNS-0910702 and ECCS-1343398, and the NSF Science & Technology Center grant CCF-0939370, Science of Infor-mation. A part of this work was done while this author was visiting IIT Bombay. Department of Elec. Engg., IIT Bombay, Powai, Mumbai 400076, India. Work sup-ported in part by a J. C. Bose Fellowship and grant 11IRCCSG014 from IIT Bombay.A part of this work was done while this author was visiting the University of California,Berkeley. Introduction

Inﬁnite time horizon risk-sensitive control seeks to maximize the asymptoticgrowth rate for mean multiplicative reward in the standard Markov decisiontheory setting. The optimal reward multiplier per step turns out to be thePerron-Frobenius eigenvalue of a positive 1-homogeneous nonlinear operator.The existence of this Perron-Frobenius eigenvalue and an associated eigen-function is ensured by the nonlinear Krein-Rutman theorem of [37, Theorem3.1.1 and Proposition 3.1.5] under suitable conditions (see also [36], [33], [32],[12], [3]). Our aim here is to build on this nonlinear Krein-Rutman theoremto provide a variational formula for the optimal growth rate of reward in thespirit of the Donsker-Varadhan formula for the Perron-Frobenius eigenvalueof a nonnegative matrix [15, section 3.1.2], [18], [22].Risk-sensitive control has traditionally been studied in the framework ofcost minimization, see e.g. [16], [26], [27] for recent work on general statespace models and [20], [24] for its discrete state space precursors. Work onrisk-sensitive reward maximization has been relatively uncommon, see e.g.[28]. Unlike in the case of the classical discounted or ergodic costs, the tworisk-sensitive control problems are not trivially equivalent by treating cost asa negative reward. In fact, risk-sensitive reward maximization is the natu-ral set-up in portfolio optimization, see e.g. [13]. Nevertheless, it has beencommonplace to replace it by risk-sensitive cost minimization so as to ex-ploit the vastly more abundant available machinery for the latter problem,see, e.g. equation (18) of [6]. Interestingly, our approach is tailored for therisk-sensitive reward maximization problem.The paper is organized as follows. This section presents the basic no-tation and control-theoretic framework. In section 2 we develop the role ofthe nonlinear Krein-Rutman theorem in giving an expression for the optimalreward multiplier per stage. In section 3 this is parlayed into a variationalexpression for the optimal growth rate of reward. Theorem 4 in section 3 isthe main result of this paper. Alternative variational formulations derivedfrom the primary one are discussed in section 4; each of these provides adiﬀerent kind of insight into how to think about the optimal growth rate ofreward. Some examples are worked out in section 5 to illustrate the natureof the results. We close the paper with some concluding remarks in section 6.2e turn next to introducing our notation and the control-theoretic frame-work. For a compact metric space X , M ( X ) and P ( X ) will denote respec-tively the space of ﬁnite (signed) Borel measures on X and the space of prob-ability measures on X , both with the topology of weak convergence [9]. C ( X )will denote the Banach space of continuous maps X 7→ R with the supre-mum norm, denoted by k · k . Thus M ( X ) is the dual Banach space of C ( X ),with the weak-* topology [39]. Let S be a prescribed compact metric spacecalled the state space and U another compact metric space, called the actionspace . We shall consider an S -valued controlled Markov process ( X n , n ≥ U -valued control process ( Z n , n ≥

0) deﬁned as follows. Con-sider a complete probability space (Ω , F , P ) where Ω := ( S × U ) ∞ , and F is its product Borel σ -ﬁeld. For ω = [( ω , ω ′ ) , ( ω , ω ′ ) , ( ω , ω ′ ) , · · · ] ∈ Ωwith ω i ∈ S and ω ′ i ∈ U ∀ i , deﬁne ‘canonical’ random variables X i = ω i , Z i = ω ′ i , i ≥

0. The probability measure P on (Ω , F ) is then the law of(( X n , Z n ) , n ≥

0) deﬁned as follows. The law of X is prescribed and the lawof (( X n , Z n ) , n ≥

0) is constructed inductively. For this purpose, deﬁne twoincreasing families of sub- σ -ﬁelds of F : F − n := σ ( X m , m ≤ n ; Z m , m < n )and F n := σ ( X m , m ≤ n ; Z m , m ≤ n ) for n ≥

0. First deﬁne the conditionallaw of Z given F − as φ ( du | X ), where φ ( du | x ) : S 7→ P ( U )is a prescribed kernel, i.e. φ ( du | x ) is a probability distribution in P ( U ) forall x and φ ( A | x ) is Borel measurable in x for all Borel subsets A ⊂ U . Let P n denote the law of (( X , Z ) , ( X , Z ) , · · · , ( X n , Z n )), deﬁned as a probabilitymeasure on (Ω , F n ), starting with n = 0. Deﬁne the law of X n +1 given F n as p ( dy | X n , Z n ) where p ( dy | x, u ) : S × U

7→ P ( S )is a prescribed kernel, i.e. p ( dy | x, u ) is a probability distribution in P ( S ) forall ( x, u ) ∈ S × U and p ( A | x, u ) is Borel measurable in ( x, u ) for all Borelsubsets A ⊂ S . Deﬁne the conditional law of Z n +1 given F − n +1 as φ n +1 ( du | ( X , Z ) , · · · , ( X n , Z n ) , X n +1 )where φ n +1 ( du | ( x , u ) · · · , ( x n , u n ) , x n +1 ) : ( S × U ) n × S 7→ P ( U )3s a prescribed kernel for each n . These together deﬁne P n +1 . By the Ionescu-Tulcea theorem (p. 101, [38]), we deﬁne a unique P on (Ω , F ). By construc-tion, for all Borel A ⊂ S , P ( X n +1 ∈ A |F n ) = P ( X n +1 ∈ A | X n , Z n )= p ( A | X n , Z n ) . (1)The ( Z n , n ≥

0) constructed above will be referred to as admissible controls.We shall also consider two special classes of admissible controls: stationaryMarkov controls of the form Z n = v ( X n ) ∀ n, for some measurable v : S 7→ U , and randomized stationary Markov controls satisfying P ( Z n ∈ A |F n ) = P ( Z n ∈ A | X n ) = ϕ ( A | X n ) ∀ n, ∀ Borel A ⊂ U, for some kernel ϕ ( du | x ) : S 7→ P ( U ). By a standard abuse of terminology,we identify these with the maps v ( · ) , ϕ ( ·|· ) resp. The sets thereof will bedenoted by SM and RM respectively. We view SM as a subset of RM byidentifying v ( · ) with δ v ( · ) , the Dirac measure at v ( · ).The inﬁnite horizon risk-sensitive reward we seek to characterize is λ := sup x ∈S sup lim inf N ↑∞ N log E h e P N − m =0 r ( X m ,Z m ,X m +1 ) | X = x i , (2)where the second supremum is over all admissible controls. Here r ( x, u, y ) isan extended-real-valued function on S × U × S , called the ‘per stage reward’on transitioning from x to y under action u . It should be noted that wewill allow e r ( x,u,y ) = 0 for some ( x, u, y ), so r ( x, u, y ) should be thought of asbeing allowed to take the extended real value −∞ .Throughout the paper, we make the following assumptions about r ( x, u, y )and p ( dy | x, u ). We will occasionally explicitly recall these assumptions to re-mind the reader of this. (A0) : e r ( x,u,y ) ∈ C ( S × U × S ). 4 A1) : The maps ( x, u ) R f ( y ) p ( dy | x, u ) , f ∈ C ( S ) with k f k ≤

1, areequicontinuous. This is true, e.g., if S is a compact metric space, U is a com-pact metric space, and p ( dy | x, u ) = ψ ( y | x, u ) ϕ ( dy ) with ϕ ∈ P ( S ) havingfull support and ψ ( y |· , · ) , y ∈ S , equicontinuous.We shall denote by e r M the least upper bound for e r ( · , · , · ) , which is ﬁniteby virtue of assumption (A0) .Towards the end of the next section we will build up to the main varia-tional formula by ﬁrst considering the case where we have additional restric-tions captured by the following assumptions. (A0+) : Condition (A0) holds and we have e r ( x,u,y ) > x, u, y ). (A1+) : Condition (A1) holds and p ( dy | x, u ) has full support for all x, u .For instance, if S is a compact metric space, U is a compact metric space,and p ( dy | x, u ) = ψ ( y | x, u ) ϕ ( dy ) as above with ψ ( y |· , · ) , y ∈ S , equicontinu-ous, then ψ ( ·| x, u ) > S will ensure that this assumption holds.We shall denote by e r m > e r ( · , · , · ) when (A0+) holds.If p ( dx ) and q ( dx ) are ﬁnite nonnegative Borel measures on a compactmetric space X , we write D ( p ( dx ) k q ( dx )) for the relative entropy of p ( dx )with respect to q ( dx ), deﬁned by D ( p ( dx ) k q ( dx )) := (R p ( dx ) log l ( x ) if we can write p ( dx ) = l ( x ) q ( dx ) ∞ otherwise.See e.g. [41] for some of the basic properties of relative entropy. Let assumptions (A0) and (A1) be in force. Deﬁne the operator T : C ( S ) C ( S ) by T f ( x ) := sup φ ∈P ( U ) Z Z p ( dy | x, u ) φ ( du ) e r ( x,u,y ) f ( y ) . (3)5or ﬁxed x ∈ S on the left hand side of (3) the supremum on the right handside is the expectation of a continuous aﬃne function on a compact set ofprobability measures. Hence, it is a maximum attained at a Dirac measure.For each ﬁxed f ∈ C ( S ), a standard measurable selection theorem [5, Lemma1, p. 182] allows us to choose the family of maximizers, parametrized by x ∈ S , as a measurable function v : S 7→ U . To see that T is a map C ( S ) C ( S ), note that for f ∈ C ( S ) with k f k ≤ R , | T f ( x ) − T f ( x ′ ) | = | sup φ ∈P ( U ) Z Z p ( dy | x, u ) φ ( du ) e r ( x,u,y ) f ( y ) − sup φ ∈P ( U ) Z Z p ( dy | x ′ , u ) φ ( du ) e r ( x ′ ,u,y ) f ( y ) | = | sup u Z p ( dy | x, u ) e r ( x,u,y ) f ( y ) − sup u Z p ( dy | x ′ , u ) e r ( x ′ ,u,y ) f ( y ) |≤ e r M sup u sup f : k f k≤ R | Z p ( dy | x, u ) f ( y ) − Z p ( dy | x ′ , u ) f ( y ) | + R max u,y (cid:12)(cid:12)(cid:12) e r ( x,u,y ) − e r ( x ′ ,u,y ) (cid:12)(cid:12)(cid:12) . As x → x ′ , the ﬁrst term on the right tends to zero by (A1) and the secondterm on the right tends to zero by uniform continuity of e r , being a contin-uous function deﬁned on a compact set, by (A0) . In fact, this shows that T f, k f k ≤ R , are equicontinuous and bounded. Also, from the deﬁnition of T , it is straightforward to check that k T f − T g k ≤ e r M k f − g k . which establishes T as a continuous (in fact, Lipschitz) map C ( S ) C ( S ).Likewise, deﬁne, for f ∈ C ( S ), T ( n ) f ( x ) := sup E h e P n − m =0 r ( X m ,Z m ,X m +1 ) f ( X n ) | X = x i , where the supremum is over all admissible control processes. Then T (1) = T ,by virtue of the measurable selection theorem alluded to after (3). We use6he convention T (0) := the identity map. Lemma 1. ( T ( n ) , n ≥ is a semigroup of operators on C ( S ) . ✷ Proof of Lemma 1 : Note that we need to verify that T ( n ) for n ≥ C ( S )to C ( S ) as part of the stated claim. This follows as a corollary of the proof,which establishes that T ( n ) is the n -fold concatenation of T with itself. Theproof follows by a standard dynamic programming argument. Speciﬁcally,we ﬁrst have T ( n ) f ( x )= sup E h e P n − m =0 r ( X m ,Z m ,X m +1 ) f ( X n ) | X = x i ≤ sup E h e r ( X ,Z ,X ) sup E h e P n − m =1 r ( X m ,Z m ,X m +1 ) f ( X n ) | X , Z , X i | X = x i = sup E (cid:2) e r ( X ,Z ,X ) T ( n − f ( X ) | X = x (cid:3) , (4)where the inner supremum in the second line is over the control sequence fromtime 1 onwards, conditioned on X = x , Z = z , X = x (say). Secondly,let ǫ >

0. By [10, Lemma 1, p. 55], conditioned on ( X , Z , X ), there existsan admissible state-control sequence ( X ′ m , Z ′ m ) , m ≥

1, with X ′ = X suchthat E h e P n − m =1 r ( X ′ m ,Z ′ m ,X ′ m +1 ) f ( X ′ n ) | X ′ i ≥ sup E h e P n − m =1 r ( X m ,Z m ,X m +1 ) f ( X n ) | X i − ǫ, a.s.Let X ′ = X = x, Z ′ := argmax (cid:0)R p ( dy | x, · ) e r ( x, · ,y ) T ( n − f (cid:1) . Then( X ′ , Z ′ ) , ( X ′ , Z ′ ) , · · · , ( X ′ n , Z ′ n ))is an admissible state-control sequence and T ( n ) f ( x ) ≥ E h e P n − m =0 r ( X ′ m ,Z ′ m ,X ′ m +1 ) f ( X ′ n ) | X ′ = x i ≥ E h e r ( X ,Z ,X ) sup E h e P n − m =1 r ( X m ,Z m ,X m +1 ) f ( X n ) | X ii − e r M ǫ = E (cid:2) e r ( X ,Z ,X ) T ( n − f ( X ) | X = x (cid:3) − e r M ǫ = T (1) (cid:0) T ( n − f (cid:1) ( x ) − e r M ǫ (5)7ombining (4), (5) and using the fact that ǫ > T ( n ) = T (1) ◦ T ( n − . A similar argument shows that T ( n ) f = T ( n − ◦ T f . ✷ The semigroup ( T ( n ) , n ≥ C + ( S ) := { f ∈ C ( S ) : f ( x ) ≥ } denote the set of nonnegativefunctions in C ( S ). Then C + ( S ) is a cone , i.e. it is closed under additionand scalar multiplication by nonnegative real numbers, and we have C + ( S ) ∩ ( − C + ( S )) = { θ } where θ denotes the constant function that is identicallyzero. Thus C + ( S ) deﬁnes a partial order on C ( S ), denoted ≥ , given by f ≥ g if f − g ∈ C + ( S ). We write f > g (equivalently, g < f ) if f ≥ g, f = g , andwe write f >> g if f − g is a strictly positive function in C ( S ) or equivalentlyif f − g ∈ int( C + ( S )), where int( C + ( S )) denotes the interior of C + ( S ). The dual cone of C + ( S ) is the cone in the dual Banach space M ( S ) given by { µ ∈ M ( S ) : R f dµ ≥ ∀ f ∈ C + ( S ) } . This is the set of ﬁnite nonnegativemeasures on S , which we denote by M + ( S ). For more on cones in Banachspaces, see [2].Let us now make the additional assumption (A0+) and (A1+) . Onecan then verify the following additional properties of T ( n ) for each n ≥ T ( n ) is strictly increasing, i.e., f < g implies T ( n ) f < T ( n ) g . In view ofthe fact established above that ( T ( n ) , n ≥

0) is a semigroup, it suﬃcesto prove this claim for n = 1. We know that there is a measurablefunction v : S 7→ U such that T f ( x ) = Z p ( dy | x, v ( x )) e r ( x,v ( x ) ,y ) f ( y ) . Then

T g ( x ) − T f ( x ) ≥ Z p ( dy | x, v ( x )) e r ( x,v ( x ) ,y ) g ( y ) − Z p ( dy | x, v ( x )) e r ( x,v ( x ) ,y ) f ( y ) ≥ e nr m Z p ( dy | x, v ( x ))( g ( y ) − f ( y )) > , f < g, f = g and support( p ( dy | x, u )) = S ∀ x, u .2. T ( n ) is strongly positive, i.e., f ∈ C + ( S ) , f = θ = ⇒ T ( n ) f ∈ int( C + ( S )).This follows from the fact that for any u ∈ U , T ( n ) f ( x ) ≥ e nr m Z p ( dy | x, u ) f ( y ) > , where we use the fact that support( p ( dy | x, u )) = S .3. T ( n ) is positively one-homogeneous, i.e., for c > T ( n ) ( cf ) = cT ( n ) f .(This holds under the weaker assumptions (A0) and (A1) .)4. For M > e − nr m and ˘ f ∈ C ( S ) deﬁned by ˘ f ( · ) ≡ M T ( n ) ˘ f > ˘ f .5. T ( n ) is compact. (This holds under the weaker assumptions (A0) and (A1) .) It suﬃces to verify this for n = 1, the general case being thena consequence of the semigroup property. By (A1) , the family x F f ( x, u ) := R f ( y ) e r ( x,u,y ) p ( dy | x, u ) , u ∈ U, k f k ≤ R , is equicontinuousand bounded in C ( S )-norm by e r M R . Hence it is relatively compact in C ( S ) by the Arzela-Ascoli theorem. Let δ ∈ [0 , w δ ( · ) denote itscommon modulus of continuity relative to a compatible metric κ on S .Then T : C ( S ) C ( S ) satisﬁes k T f k ≤ e r M R for k f k ≤ R, f ∈ C ( S ),and, sup x,y ∈S ,κ ( x,y ) <δ k T f ( x ) − T f ( y ) k≤ sup x,y ∈S ,κ ( x,y ) <δ k sup u F f ( x, u ) − sup u F f ( y, u ) k≤ sup x,y ∈S ,κ ( x,y ) <δ sup u k F f ( x, u ) − F f ( y, u ) k≤ w δ ( F f ) δ ↓ → f : k f k ≤ R . Thus T f, k f k ≤ R , ie equicontinuous.By Arzela-Ascoli theorem, it is relatively compact, implying that T : C ( S ) C ( S ) is a compact operator.The preceding considerations allow us to state the following theorem.9 heorem 1. Under the assumptions (A0+) and (A1+) , there exists aunique ρ > (the Perron-Frobenius eigenvalue) and a ψ ∈ int ( C + ( S )) suchthat T ψ = ρψ , i.e., ρψ ( x ) = sup φ ∈P ( U ) Z Z p ( dy | x, u ) φ ( du ) e r ( x,u,y ) ψ ( y ) , (6) with ρ given by ρ = inf f ∈ int( C + ( S )) sup µ ∈M + ( S ) R T f dµ R f dµ = sup f ∈ int( C + ( S )) inf µ ∈M + ( S ) R T f dµ R f dµ . (7) ✷ Equation (7) is an abstract version of the celebrated Collatz-Wielandtformula for the Perron-Frobenius eigenvalue of irreducible nonnegative ma-trices, see e.g. [34].Before proceeding to the proof of Theorem 1, it is appropriate to make afew remarks. A great deal is known about analogs of the Perron-Frobeniustheorem for increasing positively one-homogeneous maps on ﬁnite dimen-sional vector spaces, see the recent book [30]. When the map is on an orderedBanach space (and one talks about a Krein-Rutman theorem rather than aPerron-Frobenius theorem, in view of the seminal work in [29]), we rely onTheorem 3.1.1, Proposition 3.1.5, and Lemma 3.1.7 of [37], as seen in theproof below (see also [36], [33]). These results in [37] are themselves statedin a much broader context than the special case of the Banach space C ( S )and the order structure deﬁned by the cone C + ( S ), with S a compact metricspace, which suﬃces for our purposes. The recent papers [32] and [12] claimeven stronger nonlinear Krein-Rutman theorems. However, it has been rec-ognized in [3] that some of the claims in these papers are wrong. The proofof the Theorem 1 given below does not rely in any way on [32], [12], or [3]. Proof of Theorem 1 : We deﬁne k T ( n ) k + := sup {k T ( n ) f k : f ∈ C + ( S ) , k f k ≤ } , n ≥ . T ( n ) , n ≥

0) is a positive semigroup, it is straightforward to check that k T ( k + l ) k + ≤ k T ( k ) k + k T ( l ) k + for all k, l ≥

0, and so r ( T ) := lim n →∞ k T ( n ) k n + exists. By the fourth of the properties of the semigroup ( T ( n ) , n ≥

0) shownabove, we have r ( T ) >

0. It will turn out that the ρ promised in the statementof Theorem 1 is just r ( T ).Strong positivity of T , which was shown above, veriﬁes assumption A4in [37, pg. 47], and the facts that T is compact (as established above), one-homogeneous, and order preserving are respectively the conditions A1, A2,and A3 in [37, pg.47]. Thus [37, Proposition 3.1.5.] provides the additionalrequirement in the statement of [37, Theorem 3.1.1] that T have an eigen-value, and [37, Theorem 3.1.1] states that with ρ taken to be r ( T ) thereexists a ψ ∈ int( C + ( S )) such that (6) holds.It remains to establish (7), where we now know that ρ = r ( T ). We have ρ ≥ inf f ∈ int( C + ( S )) sup µ ∈M + ( S ) R T f dµ R f dµ , which comes from substituting ψ as a choice for f on the right hand side.Similarly, we have ρ ≤ sup f ∈ int( C + ( S )) inf µ ∈M + ( S ) R T f dµ R f dµ . Thus it suﬃces to establishinf f ∈ int( C + ( S )) sup µ ∈M + ( S ) R T f dµ R f dµ ≥ ρ ≥ sup f ∈ int( C + ( S )) inf µ ∈M + ( S ) R T f dµ R f dµ . (8)Given f ∈ int( C + ( S )), we have T f ≤ sup µ ∈M + ( S ) R T f dµ R f dµ ! f . From [37, Lemma 3.1.7 (ii)], we have r ( T ) ≤ sup µ ∈M + ( S ) R T fdµ R fdµ . Since thisholds for all f ∈ int( C + ( S )), this establishes the ﬁrst inequality in (8). The11roof of the second inequality in (8) is similar, based on [37, Lemma 3.1.7(iii)]. This concludes the proof of Theorem 1. ✷ Next we show that log ρ is in fact the optimal growth rate of the risk-sensitive reward. For a development of the analogous result in the case ofcontrolled diﬀusion processes, see [4]. As argued earlier, in connection withthe right hand side of (3), for each x ∈ S , the supremum on the right handside of (6) is the expectation of a continuous aﬃne function on a compactset of probability measures, and is therefore a maximum attained at a Diracmeasure. A standard measurable selection theorem [5, Lemma 1, p. 182]then allows us to identify the family of maximizers, parametrized by x ∈ S ,with an element of SM , which we denote by v ∗ ( · ). Letting ( X ∗ n , n ≥ v ∗ ( · ) and ( Z ∗ n = v ∗ ( X ∗ n ) , n ≥

0) the corresponding control sequence, we then have ρψ ( x ) = E (cid:2) e r ( x,v ∗ ( x ) ,X ∗ ) ψ ( X ∗ ) (cid:3) , and, more generally, by iterating, we have, for all x ∈ S , ρ n ψ ( x ) = E h e P n − m =0 r ( X ∗ m ,Z ∗ m ,X ∗ m +1 ) ψ ( X ∗ n ) | X ∗ = x i . Since ψ ( x ) ∈ int( C + ( S )), we have 0 < c < ψ ( · ) < C < ∞ for someconstants c, C when ψ is chosen with, say, k ψ k = 1. Thus, for all x ∈ S , cC E h e P n − m =0 r ( X ∗ m ,Z ∗ m ,X ∗ m +1 ) | X ∗ = x i ≤ ρ n ≤ Cc E h e P n − m =0 r ( X ∗ m ,Z ∗ m ,X ∗ m +1 ) | X ∗ = x i . Hence log ρ = lim n ↑∞ n log E h e P n − m =0 r ( X ∗ m ,Z ∗ m ,X ∗ m +1 ) | X ∗ = x i . For any other admissible state-control sequence (( X n , Z n ) , n ≥ ρψ ( x ) ≤ E (cid:2) e r ( x,Z ,X ) ψ ( X ) | X = x (cid:3) . Iterating, ρ n ψ ( x ) ≤ E h e P n − m =0 r ( X m ,Z m ,X m +1 ) ψ ( X n ) | X = x i . ρ ≤ lim inf n ↑∞ n log E h e P n − m =0 r ( X m ,Z m ,X m +1 ) | X = x i . We have proved:

Theorem 2.

Under the assumptions (A0+) and (A1+) , we have, for all x ∈ S , log ρ = sup lim inf n ↑∞ n log E h e P n − m =0 r ( X m ,Z m ,X m +1 ) | X = x i , where the supremum on the right is over all admissible controls and ρ on theleft is given as in Theorem 1. Furthermore, this supremum is a maximumattained at some v ∗ ( · ) ∈ SM . ✷ An immediate consequence is the following.

Corollary 1.

Under the assumptions (A0+) and (A1+) we have λ = log ρ , where λ is the optimal growth rate of reward, as deﬁned in (2), and ρ is asdeﬁned in Theorem 1. ✷ By (7), we have ρ = inf f>> sup µ ∈M + ( S ): R fdµ =1 Z µ ( dx ) sup u Z p ( dy | x, u ) e r ( x,u,y ) f ( y )= inf f>> sup ν ∈P ( S ) Z ν ( dx ) (cid:18) sup u R p ( dy | x, u ) e r ( x,u,y ) f ( y ) f ( x ) (cid:19) = inf f>> sup x (cid:18) sup u R p ( dy | x, u ) e r ( x,u,y ) f ( y ) f ( x ) (cid:19) = inf f>> sup x sup u Z p ( dy | x, u ) e r ( x,u,y )+log f ( y ) − log f ( x ) = inf f>> sup γ ∈P ( S× U ) Z Z γ ( dx, du ) Z p ( dy | x, u ) e r ( x,u,y )+log f ( y ) − log f ( x ) . η ( dx, du, dy ) = η ( dx ) η ( du | x ) η ( dy | x, u )= ˜ η ( dx, du ) η ( dy | x, u ) . Let G := { η ( dx, du, dy ) : η is invariant under the transition kernel Z U η ( dy | x, u ) η ( du | x ) } , i.e. η ∈ G iﬀ Z ˜ η ( dx, du ) η ( dy | x, u ) = η ( dy ) . Recall that D ( ·||· ) is convex and lower semi-continuous in both arguments[41]. Then 14og ρ = inf f>> sup γ log Z Z Z γ ( dx, du ) p ( dy | x, u ) e r ( x,u,y )+log f ( y ) − log f ( x ) = inf g ∈ C ( S ) sup γ log Z Z Z γ ( dx, du ) p ( dy | x, u ) e r ( x,u,y )+ g ( y ) − g ( x ) = inf g ∈ C ( S ) sup γ sup η Z Z Z η ( dx, du, dy ) (cid:16) r ( x, u, y ) + g ( y ) − g ( x ) (cid:17) − D ( η ( dx, du, dy ) || γ ( dx, du ) p ( dy | x, u ))(by the Gibbs variational formula (Prop. 1.4.2(a), pp. 33-34, [17])= sup γ sup η inf g ∈ C ( S ) Z Z Z η ( dx, du, dy ) (cid:16) r ( x, u, y ) + g ( y ) − g ( x ) (cid:17) − D ( η ( dx, du, dy ) || γ ( dx, du ) p ( dy | x, u )) · · · · · · (by the min-max theorem [19])= sup γ sup η inf g ∈ C ( S ) (cid:16) Z Z Z η ( dx, du, dy ) (cid:16) r ( x, u, y ) + g ( y ) − g ( x ) (cid:17) − (cid:16) D (˜ η ( dx, du ) || γ ( dx, du )) + Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u ))) (cid:17) = sup η inf g ∈ C ( S ) (cid:16) Z Z Z η ( dx, du, dy ) (cid:16) r ( x, u, y ) + g ( y ) − g ( x ) (cid:17) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) · · · · · · (by setting γ = ˜ η )= sup η ∈G h inf g ∈ C ( S ) (cid:16) Z Z Z η ( dx, du, dy ) (cid:16) r ( x, u, y ) + g ( y ) − g ( x ) (cid:17) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) i · · · · · · (because h · · · i = −∞ ∀ η / ∈ G )= sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) · · · · · · (because η ∈ G = ⇒ Z η ( dx, du, dy )( g ( y ) − g ( x )) = 0)15hus we have: Theorem 3.

Under the assumptions (A0+) and (A1+) , the optimal growthrate of reward λ , as deﬁned in (2), has the variational characterization λ = log ρ = sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) , (9) where ρ is deﬁned as in Theorem 1. ✷ The following result, which uses a limiting argument to strengthen The-orem 3, is the main result of this paper.

Theorem 4.

Under the assumptions (A0) and (A1) , the optimal growthrate of reward λ , as deﬁned in (2), has the variational characterization λ = sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) . (10) ✷ Before proving Theorem 4, let us ﬁrst consider the uncontrolled case.We can ﬁt this into our framework by taking U to be a set with one point,so that p ( dy | x, u ) = ˜ p ( dy | x ) for all u ∈ U , for some kernel ˜ p ( dy | x ), and r ( x, u, y ) = ˜ r ( x, y ) for all u ∈ U , for some ˜ r ( · , · ). Theorem 4 then specializesto the statement that the growth rate of the reward, under the respectivespecializations of conditions (A0) and (A1) , is given by λ = sup α ∈ ˜ G (cid:16) Z Z Z α ( dx, dy )˜ r ( x, y ) − Z Z α ( dx ) D ( α ( dy | x ) || ˜ p ( dy | x )) (cid:17) where α ( dx, dy ) = α ( dx ) α ( dy | x ) and˜ G := { α ( dx, dy ) = α ( dx ) α ( dy | x ) : Z α ( dx ) α ( dy | x ) = α ( dy ) } . Proof of Theorem 4 : Let γ ( dy ) be an arbitrary probability distributionon S with full support, and, for all ǫ > p ǫ ( dy | x, u ) := 1 a ( x, u ) + ǫ (cid:16) e r ( x,u,y ) p ( dy | x, u ) + ǫγ ( dy ) (cid:17) , and the reward r ǫ ( x, u, y ) := log( a ( x, u ) + ǫ ) , where a ( x, u ) := Z e r ( x,u,y ) p ( dy | x, u ) . Since this kernel and reward satisfy the conditions (A0+) and (A1+) , wehave from Theorem 3 that the optimal growth rate of reward for the risk-sensitive reward maximization problem for this kernel and reward, call it λ ǫ ,is given by λ ǫ = sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ǫ ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ǫ ( dy | x, u )) (cid:17) . (11)From the formulation of the risk-sensitive objective we see that λ ǫ isnondecreasing in ǫ , and that λ ǫ ≥ λ for all ǫ >

0, where λ is deﬁned as in(2). This can be seen by writing the expression for the n -step multiplicativereward, i.e. E ǫ h e P N − m =0 r ǫ ( X m ,Z m ,X m +1 ) | X = x i , as a multiple integral, which reveals that this quantity is monotonically non-decreasing in ǫ for any initial condition x ∈ S and any admissible controlstrategy. Thus lim ǫ → λ ǫ exists and satisﬁeslim ǫ → λ ǫ ≥ λ . (12)To prove (10), we will ﬁrst prove thatlim ǫ → λ ǫ ≤ sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) , (13)17nd then prove that λ ≥ sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) . (14)Together with (12), these two claims establish (10).For ﬁxed η ∈ G , let Ψ ǫ ( η ) denote the expression inside the outer bracketson the right hand side of (11). Then one hasΨ ǫ ( η ) = − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || e r ( x,u,y ) p ( dy | x, u ) + ǫγ ( dy )) . (15)Similarly, for ﬁxed η ∈ G , let Ψ ( η ) denote the expression inside the outerbrackets on the right hand side of (10). We haveΨ ( η ) = − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || e r ( x,u,y ) p ( dy | x, u )) . (16)In fact, (15) reveals that for each η ∈ G we have Ψ ǫ ( η ) is nondecreasingin ǫ , and together with (16), reveals that for all ǫ > η ∈ G , wehave Ψ ǫ ( η ) ≥ Ψ ( η ). Thus we may conclude that for each η ∈ G the limitlim ǫ → Ψ ǫ ( η ) exists, and that this limit satisﬁes lim ǫ → Ψ ǫ ( η ) ≥ Ψ ( η ).Now, for all ǫ > δ > η δǫ ∈ G suchthat Ψ ǫ ( η δǫ ) > λ ǫ − δ . Since G is compact, there is a decreasing sequence( ǫ m , m ≥

1) with lim m →∞ ǫ m = 0, such that the sequence ( η δǫ m , m ≥

1) has alimit in P ( S × U × S ), call it η δ . Further, since G is closed, we have η δ ∈ G .By the lower semicontinuity of D ( ·k· ) as a function of ( · , · ) [41] we havesup η ∈G Ψ ( η ) ≥ Ψ ( η δ ) ≥ lim m →∞ Ψ ǫ m ( η δǫ m ) ≥ lim m →∞ λ ǫ m − δ = lim ǫ → λ ǫ − δ . Since δ > η ∈G Ψ ( η ) (i.e. the right hand side of (14))equals −∞ then there is nothing to prove, so we may assume that this isnot the case. Given η ∈ G for which Ψ ( η ) = −∞ , consider implementingthe stationary Markov strategy deﬁned by the kernel η ( du | x ). The expectedmultiplicative reward after n steps when implementing this strategy, condi-tioned on starting with the initial distribution η ( dx ), is Z · · Z η ( dx ) n − Y m =0 η ( du m | x m ) p ( dx m +1 | x m , u m ) e r ( x m ,u m ,x m +1 ) . η ( dy | x, u ) is absolutely continuous with respect to p ( dy | x, u ) for almostall ( x, u ), this equals Z ·· Z η ( dx ) n − Y m =0 η ( du m | x m ) η ( dx m +1 | x m , u m ) e r ( x m ,u m ,x m +1 ) e − log η dxm +1 | xm,um ) p ( dxm +1 | xm,um ) . Let { X ′ n } denote a controlled Markov chain with controlled transition kernel η ( dy | x, u ), initial law η , and controlled by η ( du | x ) ∈ RM . Then λ ≥ lim n →∞ n log (cid:16) Z · · Z η ( dx ) n − Y m =0 η ( du m | x m ) p ( dx m +1 | x m , u m ) e r ( x m ,u m ,x m +1 ) (cid:17) ≥ lim n →∞ n log (cid:16) Z · · Z η ( dx ) n − Y m =0 η ( du m | x m ) η ( dx m +1 | x m , u m ) × e r ( x m ,u m ,x m +1 ) − log η dxm +1 | xm,um ) p ( dxm +1 | xm,um ) (cid:17) = lim n →∞ n log (cid:16) E (cid:20) e P n − m =0 ( r ( X ′ m ,Z ′ m ,X ′ m +1 ) − log dη ·| X ′ m,Z ′ m ) dp ( ·| X ′ m,Z ′ m ) ( X ′ m )) (cid:21) (cid:17) ≥ lim n →∞ n E " n − X m =0 ( r ( X ′ m , Z ′ m , X ′ m +1 ) − log dη ( ·| X ′ m , Z ′ m ) dp ( ·| X ′ m , Z ′ m ) ( X ′ m )) (by Jensen’s inequality)= Ψ ( η ) (because η ∈ G ) . It follows that λ , as deﬁned in (2), satisﬁes (14), which concludes theproof of Theorem 4. ✷ .

1. Assume (A0), (A1) . Fix ϕ ∈ RM , and consider { ( X n , Z n ) , n ≥ } governed by the randomized stationary Markov strategy ϕ as an uncon-trolled S × U -valued Markov chain. To be precise, let ¯ S denote S × U ,let ¯ U := { ¯ u } be a one point set, and deﬁne ¯ p ( d ¯ y | ¯ x, ¯ u ) : ¯ S × ¯ U

7→ P ( ¯ S )by ¯ p ( d ¯ y | ¯ x, ¯ u ) := p ( dy | x, u ) ϕ ( du ′ | y ) , x := ( x, u ) and ¯ y := ( y, u ′ ). Also, let¯ r (¯ x, ¯ u, ¯ y ) := r ( x, u, y ) . It is straightforward to check that the assumptions (A0), (A1) hold forthe ¯ S -valued chain with trivial control space ¯ U and with the transitionkernel and one step reward as above.Given τ ( dx, du, dy, du ′ ) = τ ( dx ) τ ( du | x ) τ ( dy | x, u ) τ ( du ′ | x, u, y ), write˜ τ ( dx, du ) for τ ( dx ) τ ( du | x ) and ˆ τ ( dy, du ′ | x, u ) for τ ( dy | x, u ) τ ( du ′ | x, u, y ).Let G + := { τ ( dx, du, dy, du ′ ) : Z Z ˜ τ ( dx, du )ˆ τ ( dy, du ′ | x, u ) = ˜ τ ( dy, du ′ ) } . Further, given τ ( dx, du, dy, du ′ ), we deﬁne τ ′ ( dx, du, dy, du ′ ) by setting τ ′ := τ , τ ′ := τ , τ ′ := τ , τ ′ ( du ′ | x, u, y ) := τ ( du ′ | y ) , with the corresponding deﬁnitions for ˜ τ ′ , ˆ τ ′ . We claim that τ ′ ∈ G + .To see this, ﬁrst observe that R R ˜ τ ( dx, du )ˆ τ ( dy, du ′ | x, u ) = ˜ τ ( dy, du ′ )when integrated over u ′ gives R R ˜ τ ( dx, du ) τ ( dy | x, u ) = τ ( dy ). Thismeans Z Z ˜ τ ′ ( dx, du )ˆ τ ′ ( dy, du ′ | x, u ) = Z Z ˜ τ ( dx, du ) τ ( dy | x, u ) τ ( du ′ | y )= τ ( dy ) τ ( du ′ | y )= ˜ τ ( dy, du ′ ) = ˜ τ ′ ( dy, du ′ ) , which establishes the claim.Let λ ϕ denote the asymptotic growth rate of reward under the ﬁxedrandomized stationary Markov strategy ϕ . Then by applying Theorem4 to the ¯ S -valued chain with trivial control space ¯ U deﬁned above, wehave λ ϕ = sup τ ∈G + (cid:16) Z Z Z τ ( dx, du, dy, U ) r ( x, u, y ) − Z Z ˜ τ ( dx, du ) D (ˆ τ ( dy, du ′ | x, u ) || p ( dy | x, u ) ϕ ( du ′ | y )) (cid:17) . (17)20hen we havesup ϕ λ ϕ = sup ϕ ∈ RM sup τ ∈G + (cid:16) Z Z Z τ ( dx, du, dy, U ) r ( x, u, y ) − Z Z ˜ τ ( dx, du ) D ( τ ( dy | x, u ) τ ( du ′ | x, u, y ) || p ( dy | x, u ) ϕ ( du ′ | y )) (cid:17) ( a ) = sup τ ∈G + (cid:16) Z Z Z τ ′ ( dx, du, dy, U ) r ( x, u, y ) − Z Z ˜ τ ′ ( dx, du ) D ( τ ′ ( dy | x, u ) || p ( dy | x, u )) (cid:17) ( b ) = sup η ∈G (cid:16) Z Z Z η ( dx, du, dy ) r ( x, u, y ) − Z Z ˜ η ( dx, du ) D ( η ( dy | x, u ) || p ( dy | x, u )) (cid:17) = λ . Here, to justify step (a), notice that for every τ ∈ G + , we have shownthat τ ′ ∈ G + . Therefore we have both Z Z Z τ ′ ( dx, du, dy, U ) r ( x, u, y ) = Z Z Z τ ( dx, du, dy, U ) r ( x, u, y )and Z Z ˜ τ ′ ( dx, du ) D ( τ ′ ( dy | x, u ) || p ( dy | x, u ))= Z Z ˜ τ ( dx, du ) D ( τ ( dy | x, u ) || p ( dy | x, u )) . The choice of ϕ ( du ′ | y ) = τ ( du ′ | y ) (which also equals τ ′ ( du ′ | x, u, y ))would make the expression Z Z Z ˜ τ ′ ( dx, du ) τ ′ ( dy | x, u ) D ( τ ′ ( du ′ | x, u, y ) || ϕ ( du ′ | y ))equal to zero, whereas the expression Z Z Z ˜ τ ( dx, du ) τ ( dy | x, u ) D ( τ ( du ′ | x, u, y ) || ϕ ( du ′ | y ))21s nonnegative. To justify step (b) note that for every τ ∈ G + , wehave τ ( dx ) τ ( du | x ) τ ( dy | x, u ) ∈ G , and conversely for every η ∈ G we get τ ∈ G + by deﬁning τ ( dx, du, dy, du ′ ) := η ( dx, du, dy ) η ( du ′ | y ).Furthermore, this τ satisﬁes τ ′ = τ .The upshot is that we have proved λ = sup ϕ ∈ RM λ φ . (18)Under (A0+), (A1+) , this supremum is in fact a maximum by virtueof Theorem 2.2. Since D ( ·||· ) is convex and lower semi-continuous in its arguments asnoted earlier, (10) is a concave maximization problem on the convex set G := { η ( dx ) ϕ ( du | x ) µ ( dy | x, u ) : η is invariant under the transitionkernel x Z U ϕ ( du | x ) µ ( dy | x, u ) } . It is worthwhile to compare this formulation with the classical dynamicprogramming approach. Recall that the dynamic programming equa-tion (6) is the nonlinear eigenvalue problem ρV ( x ) = sup ϕ (cid:18)Z Z p ( dy | x, u ) ϕ ( dy | u ) e r ( x,u,y ) V ( y ) (cid:19) . Consider the standard ‘log transformation’ ζ ( x ) := log V . Thenlog ρ + ζ ( x ) = sup ϕ log (cid:18)Z Z p ( dy | x, u ) ϕ ( du | x ) e r ( x,u,y )+ ζ ( y ) (cid:19) . We treat x as a ﬁxed parameter on the right hand side. By the Gibbsvariational principle, we havelog ρ + ζ ( x )= sup ϕ sup µ ( · , ·| x ) ∈P ( U ×S ) (cid:16) Z µ ( du, dy | x )( r ( x, u, y ) + ζ ( y )) − D ( µ ( du, dy | x ) || p ( dy | x, u ) ϕ ( du | x )) (cid:17) . (19) See [11, section 11.2.3, p. 358] for the proof of convexity r ( x, u, y ) − D ( µ ( du, dy | x ) || p ( dy | x, u ) ϕ ( du | x )) , where µ speciﬁes an additional control variable the choice of which is infact the distribution of the next state and control, whereas the originalrandomized control ϕ aﬀects only the payoﬀ. This is a team problem asopposed to a control problem because while both controls have the sameobjective, viz., to maximize a common reward, they are implemented ina non-cooperative manner. This is reminiscent of, e.g., [24], which con-siders the cost minimization formulation in which a similar procedureleads to a zero sum ergodic game. There does not, however, appear tobe any corresponding development earlier for the reward maximizationproblem with a positive reward. While this is completely analogous tothe game situation, we have obtained it without an explicit minoriza-tion condition as in [16], or the ‘condition B’ of [26]. We have insteadconditions (A0) and (A1) which are relatively mild, and compactnessof state space, which is not. We are working towards relaxing the latter.An important point to note here is that we have an equivalent prob-lem of maximizing a concave upper semi-continuous function over theconvex set G . This is in contrast with the ergodic team problem ofmaximizing the same function over the nonconvex set G := { η ( dx ) ϕ ( du | x ) µ ( dy | x ) : η is invariant under the transitionkernel x µ ′ ( dy | x ) } , i.e., where the controls ϕ, µ ′ are chosen by the two team members non-cooperatively. The latter is what one obtains from the team formulationvia log transformation.3. It is also worth noting that the entropic penalty implicit in our varia-tional formula also arises in diﬀerent contexts [8], [23], [40].23 Examples

Let G be a directed graph on a ﬁnite vertex set S of size d , with edge setdenoted by E G . Let M G denote the incidence matrix of the graph, namely the d × d nonnegative matrix M G = [ m ( x, y )], with m ( x, y ) = 1 if ( x, y ) ∈ E G , and m ( x, y ) = 0 otherwise. Assume that each vertex has at least one out-goingedge. For n ≥ x ∈ S , let N n ( x ) denote the number of directed pathsof length n starting at x . Then the growth rate of the number of directedpaths in the graph, namely max x ∈S lim n →∞ n log N n ( x )exists and equals log ρ ( M G ), where ρ ( M G ) is the Perron-Frobenius eigenvalueof M G .It is also known that this common limit can be written assup G − compatible (Π ,π ) − X x,y π ( x ) π ( y | x ) log π ( y | x ) . (20)Here Π ranges over d × d transition probability matrices that are G -compatiblefor the directed graph G , i.e. such that π ( y | x ) > x, y ) ∈ E G ,and π ranges over invariant probability distributions for Π. Note that this isthe largest entropy rate among all stationary Markov chains whose transitionprobability matrix is compatible with the graph.This characterization of the growth rate of the number of paths in anirreducible graph is a consequence of the Donsker-Varadhan formula for thePerron-Frobenius eigenvalue of a nonnegative matrix. Let us verify this asa corollary of Theorem 4 in the case without controls. We take the statespace in Theorem 4 to be S , i.e. the vertex set of the graph. The controlspace U is a set consisting of a single point, which we write as U = { u } . Let p ( y | x, u ) := d ( x ) for d ( x ) := the out-degree of x and ( x, y ) ∈ E G , and let r ( x, u, y ) := ( log d ( x ) if ( x, y ) ∈ E G −∞ otherwise. (21)Substituting these into the right hand side of (10) gives the expression in(20). 24e now bring risk-sensitive control into this mix of ideas. Let U be aﬁnite set and suppose now that for each u ∈ U we are given a directed graph G u with vertex set S . Assume that each vertex has at least one out-goingedge in each G u . We pose the problem of maximizingmax x ∈S lim inf n →∞ n log ˆ N n ( x ) , where now ˆ N n ( x ) is the largest number of directed paths of length n one cancreate when starting at x and at each time choosing one of the graphs alongwhich to move (i.e. one of the control actions) depending on the history ofthe states visited so far. More generally, we might allow for a randomizedchoice of the graph to be used at each time, based on the history of the statesand the realizations of the control so far, and ask for the maximum growthrate of the expectation of the number of directed paths of each length thatwe can create in this way.This problem can be posed in a framework that is amenable to an applica-tion of Theorem 4. As in the case without controls, we set p ( y | x, u ) := d u ( x ) for all ( x, y ) ∈ E G u , where d u ( x ) denotes the out-degree of vertex x in G u ,and we now set r ( x, u, y ) := ( log d u ( x ) if ( x, y ) ∈ E G u −∞ otherwise. (22)According to Theorem 4 this maximum growth rate is given bymax η − X x,u ˜ η ( x, u ) X y : ( x,y ) ∈E Gu η ( y | x, u ) log η ( y | x, u )) , where the maximum is over all η ( x, u, y ) = ˜ η ( x, u ) η ( y | x, u ) with η ( y | x, u ) > x, y ) ∈ E G u , and such that X ( x,u ) ˜ η ( x, u ) η ( y | x.u ) = η ( x ) , where, as usual, η ( x ) := P u ˜ η ( x, u ). Note that this has following interpreta-tion: among all stationary Markov chains (( X n , Z n ) , n ≥

0) with state space

S × U that are compatible with the family of graphs in the sense that if atransition from ( x, u ) to ( y, u ′ ) has positive probability then ( x, y ) ∈ E G u ,25aximize the conditional entropy of the next state given the current state-entropy pair, i.e. maximize H ( X | X , U ).The interpretation of the growth rate of the number of directed paths of agiven length in a directed graph as an entropy rate has considerable practicalimportance in coding theory. Each directed path of length n can be viewedas an allowed sequence of length n , with coordinates from the state space S , and the set of such directed paths is then viewed as a set of constrainedsequences [14, Problem 4.16], [31]. The problem of constrained coding hasbeen extensively studied. In one version of this problem, the goal is to comeup with algorithms that can take an inﬁnitely long sequence of symbols froma ﬁnite set of size m and produce S -valued sequences as output in a one-to-one fashion, and such that the output sequences meet the constraints deﬁnedby the graph, see [31, Sec. 5.2] for more details. Naturally, it is not possibleto do this if log m exceeds the growth rate given by (20); ﬁnding eﬃcientalgorithms to do this whenever log m is less than the growth rate given in(20) was a key early success in this area [1], [31]. Investigating the questionof constrained coding up to the maximum possible conditional entropy rategiven by the application of Theorem 4 to the controlled graph formulationabove would be an interesting challenge. As another example, we consider the portfolio optimization problem from [6],except that we consider the reward maximization framework instead of costminimization as in the classic work of Cover [13]. The model is as follows.The underlying ‘factor process’ { X n } is a discrete time Markov chain ona ﬁnite state space Q := { , · · · , m } (say) with an irreducible transitionmatrix Q = [[ q ( j | i )]]. The control space will be the simplex A := { a = a , · · · , a m ] ∈ R m : a i ≥ ∀ i, P i a i ≤ } , with a i denoting the proportionof wealth invested in the i th risky asset. In particular, 1 − P i a i is then theproportion invested in the risk-less bank account. We denote by { π n } the A -valued control sequence, representing the trading strategy, i.e., π n,i willbe the proportion of wealth invested in the i th risky asset at time n . { W n } is the process of m -dimensional vectors of price relatives such that W n +1 isconditionally independent of X i , i < n ; W i , π i , i ≤ n , given ( X n , X n +1 ) and itsconditional law given the latter is speciﬁed by a kernel ν ( x, y, dw ) : Q × Q 7→R m with support in the interior of the positive cone in R m . Let e r , r > , denote the per period multiplier of wealth invested in the bank account (thus26 r − denote the constant vector of all 1’s. Theevolution of the wealth process { V t } is given by V n +1 = V n [ e r + h π n , W n +1 − e r i ] , where V := 1. The objective is to maximize the risk-adjusted growth rateof wealth lim inf n ↑∞ n log E h e − θ log V n i . (23)Here θ is the risk sensitivity parameter. The control sequence { π n } isassumed to be adapted to the factor process { X n } and the controls, i.e. thedistribution of π n is chosen as a function of ( X , . . . , X n , π , . . . , π n − ).It is useful to constrast the objecive we consider with that considered in[6] of maximizing, for θ >

0, the quantitylim inf n ↑∞ − θ n log E h e − θ log V n i . (24)In [6], this problem is considered by writing the objective in (24) as − lim sup n ↑∞ θ n log E h e − θ log V n i , and then studying the risk-sensitive cost minimization problem correspondingto the objective lim sup n ↑∞ θ n log E h e − θ log V n i . That positive θ indicates risk aversion in (24) is argued, see [7, Eqn. (2.1)],by writing the Taylor’s series expansion, for small θ , − θ log E h e − θ log V n i = E [log V n ] − θ V n ) + O ( θ ) . By constrast, our formulation is able to handle both the case of risk-aversion and risk-seeking. The Taylor’s series expansionlog E h e − θ log V n i = − θ E [log V n ] + θ V n ) + o ( θ )27ndicates that if the objective in (23) is multiplied by − θ , then it correspondsto risk-aversion for positive θ and to risk-seeking for negative θ .Keeping in mind that e r + h a, W − e r i > ν ( x, y, dz ), deﬁne µ ( x, a, y ) := Z e − θ log[ e r + h a,w − e r i ] ν ( x, y, dw ) , (assumed to be < ∞ ) r ( x, a ) := − θ log X y q ( y | x ) µ ( x, a, y ) ! ,p ( y | x, a ) := q ( y | x ) µ ( x, a, y ) P y ′ q ( y ′ | x ) µ ( x, a, y ′ ) . One can show that for all n ≥ n log E h e − θ log V n i = 1 n log ˜ E h e − θ P n − m =0 r ( X m ,π m ) i , where ˜ E is the expectation with respect to the law p ( x ) φ ( da | x ) p ( x | x , a ) φ ( da | x , a , x ) . . . × φ n − ( da n − | ( x i , a i , ≤ i ≤ n − , x n − ) , where p ( x ) is the initial distribution of X , the admissible controls are de-termined by the kernels φ ( ·|· ) , . . . , φ n − ( ·|· ), and the salient point is that thetransition kernel for the evolution of the factor process under this changeof measure is given by the kernel p ( ·|· , · ) deﬁned above. To see this, ﬁrstobserve that W , . . . , W n are conditionally independent and identically dis-tributed given ( X i , π i , ≤ i ≤ n ). Hence E h e − θ log V n | X i , π i , ≤ i ≤ n i = E " n − Y m =0 e − θ log Vm +1 Vm | X i , π i , ≤ i ≤ n = n − Y m =0 E h e − θ log Vm +1 Vm | X i , π i , ≤ i ≤ n i = n − Y m =0 µ ( X m , π m , X m +1 ) ,

28o we have E h e − θ log V n i = E " n − Y m =0 µ ( X m , π m , X m +1 ) . For an admissible control strategy, we can write this as X x ,...,x n Z a . . . Z a n − p ( x ) n − Y m =0 µ ( x m , a m , x m +1 ) q ( x m +1 | x m ) φ m ( da m | ( x i , a i , ≤ i ≤ m − , x m ) , which is the same as X x ,...,x n Z a . . . Z a n − p ( x ) n − Y m =0 e − θ r ( x m ,a m ) p ( x m +1 | x m , a m ) φ m ( da m | ( x i , a i , ≤ i ≤ m − , x m ) , which equals ˜ E h e − θ P n − m =0 r ( X m ,π m ) i .Hence the problem of maximizing (23) is equivalent to the risk-sensitivecontrol problem for a controlled Markov chain on Q with action space A andcontrolled transition probabilities p ( y | x, a ) , x, y ∈ Q , a ∈ A , the objectivebeing to maximize the reward λ := sup x sup lim inf n ↑∞ n log E h e − θ P n − m =0 r ( X m ,π m ) | X = x i . where the second supremum is over admissible controls.The optimal growth rate for the wealth is then given by λ = max η ∈G (cid:16) X x Z A ˜ η ( x, da )( − θ r ( x, a ) − X y Z A η ( y | x, a ) log (cid:18) η ( y | x, a ) p ( y | x, a ) (cid:19) ) (cid:17) where G := n η ( x, da, y ) ∈ P ( Q × A × Q ) : η ( x, da, y ) = ˜ η ( x, da ) η ( y | x, a )= η ( x ) η ( da | x ) η ( y | x, a ) such that η is stationary underthe transition matrix (cid:20)(cid:20)Z η ( da | x ) η ( y | x, a ) (cid:21)(cid:21) o .

29n order to justify this, we need to verify that the conditions (A0) and (A1) are satisﬁed. Here Q plays the role of S , A plays the role of U , and − θ r ( x, a ) plays the role of r ( x, u, y ) in the general theory. The validity of (A0) follows from the continuity of the logarithm function. The validity of (A1) follows from the continuity of the logarithm function, the fact that Q is ﬁnite, and because P y ′ q ( y ′ | x ) µ ( x, a, y ) is strictly positive for all ( x, a ).If we discretize A , this is a ﬁnite dimensional concave maximization prob-lem eminently amenable to standard nonlinear programming tools. Consider a set of controlled stochastic matrices on a ﬁnite state space S = { , · · · , s } denoted by P u = [[ p ( j | i, u )]] i,j ∈ S . Here u is the control parametertaking values in A , where A is a compact metric action space. We assumethat u P u is continuous and P u is irreducible for all u . Let S ⊂ S be anonempty proper subset of S and let S := S c denote its complement. Letˇ P u denote the restriction of P u to S and for a sequence of random variables { X n } with values in S , deﬁne τ := inf { n ≥ X n ∈ S } .We are interested in determining λ := sup i ∈ S sup lim inf n ↑∞ n log P ( τ > n ) , where the second supremum is over all admissible controls, and the law of τ isdetermined by the control strategy. Namely, we are interested in the problemof ﬁnding the slowest exit rate from S over admissible control strategies.Write ˇ P u = D u Q u where D u is a diagonal matrix with its i th diagonalentry d ( i, u ) := P j ∈ S p ( j | i, u ) and Q u := [[ q ( j | i, u )]] is a stochastic matrixon S given by q ( j | i, u ) := d ( i, u ) − p ( j | i, u ), where we will also assume that d ( i, u ) > i ∈ S and u ∈ A . It can be checked that for any admissiblecontrol strategy and i ∈ S , we have P ( τ > n ) = E h e P n − m =0 log( d ( X m ,U m )) i , where U m denotes the choice of control at time m , and { X n } is the S -valued Markov chain, having the transition probability matrix Q U m at time m . Therefore, with the choices S := S , U := A , and r ( i, u, j ) := log d ( i, u ),the problem is amenable to our general theory.30isintegrate a typical element η ∈ P ( S × A × S ) as η ( i ) η ( du | i ) η ( j | i, u ),and write ˜ η ( i, du ) for η ( i ) η ( du | i ).Then our results show that λ = max η ∈G (cid:16) X i,j ∈ S Z A η ( i, du, j ) log( d ( i, u )) − X i ∈ S Z A ˜ η ( i, du ) D ( η ( j | i, u ) || q ( j | i, u )) (cid:17) , where G denotes the set of η ∈ P ( S × A × S ) for which η is invariant underthe transition kernel R A η ( du | i ) η ( j | i, u ). To verify this, we need to checkthe validity of the conditions (A0) and (A1) . The former is a consequenceof the assumed continuity of u P u . The latter is a consequence of the factthat S is ﬁnite and that u Q u is continuous, which in turn follows fromthe assumed continuity of u P u and the assumption that d ( i, u ) > i ∈ S and u ∈ A . We considered the problem of maximizing the growth rate of reward in thestandard risk-sensitive formulation for a controlled Markov chain on a com-pact metric state space, with a compact metric action space. We took anon-standard approach to this problem via a nonlinear version of the Krein-Rutman theorem to obtain a variational formulation for the optimal reward.This leads to an occupation measure based concave maximization formula-tion of the control problem.The approach holds promise for possible use of convex optimization tech-niques for approximate solution of the risk-sensitive reward maximizationproblem, in a manner analogous to what abstract linear programming doesfor the classical additive reward problems (such as discounted or ergodic re-wards, see, e.g., [25]). We achieved this with rather few technical conditionsexcept for the compactness of the state and action spaces. It remains a majorchallenge to extend this approach to noncompact state and action spaces.31 eferences [1] Adler, R. L., Coppersmith, D., and Hassner, M. (1983) “Algorithms forsliding block code: an application of symbolic dynamics to informationtheory”,

IEEE Trans. Information Theory

Cones and Duality , Gradu-ate Studies in Mathematics, Vol. 84, American Mathematical Society,Providence, RI, USA.[3] Arapostathis, A. (2013) “A correction to Mahadevan’s nonlinear Krein-Rutman theorem”, (preprint).[4] Arapostathis, A., Borkar, V. S., and Kumar, K. S. (2013) “Risk-sensitivecontrol and an abstract Collatz-Wielandt formula”, arXiv preprintarXiv:1312.5834.[5] Beneˇs, V. E. (1970) “Existence of optimal strategies based on speci-ﬁed information, for a class of stochastic decision problems”,

SIAM J.Control

Math. Methods of Op. Research

J. Econ. Dyn. and Control

Vol. 24, 1145 -1177.[8] Bierkens. J., and Kappen, H. J. (2014) “Explicit solution of relativeentropy weighted control”,

Systems & Control Letters

72, 36-43.[9] Billingsley, P. (1968)

Convergence of Probability Measures , John Wiley& Sons, New York, USA.[10] Borkar, V. S. (1989)

Optimal Control of Diﬀusion Processes , Pitman Re-search Notes in Math. No. 203, Longman Scientiﬁc & Technical, Harlow,UK.[11] Borkar, V. S. (2002) “Convex analytic methods in Markov decision pro-cesses”, in ‘

Handbook of Markov Decision Processes ’ (E. A. Feinberg andA. Shwartz, eds.), Kluwer Academic Publishers, Boston, 347-375.3212] Chang, K. C. (2009) “A nonlinear Krein Rutman theorem”,

J. SystemsSci. and Complexity

Math. Finance

Elements of Information Theory (2nd ed.), Wiley-Interscience, New Jersey, USA.[15] Dembo, A., and Zeitouni, O. (1998)

Large Deviations Techniques andApplications (2nd ed.), Springer Verlag, New York, USA.[16] Di Masi, G. B., and Stettner, L. (2007) “Inﬁnite horizon risk sensitivecontrol of discrete time Markov processes under minorization property”,

SIAM J. Control and Optim.

A Weak Convergence Approach to theTheory of Large Deviations ”, John Wiley, New York.[18] Donsker, M. D., and Varadhan, S. R. S. (1975) “On a variational for-mula for the principal eigenvalue for operators with maximum principle”,

Proc. Nat. Acad. Sci. USA

38, 121-126.[20] Fleming, W. H., and Hernandez-Hern´andez, D. (1996) “Risk-sensitivecontrol of ﬁnite state machines on an inﬁnite horizon I”,

SIAM J. Controland Optim.

SIAM J. Con-trol and Optim.

Linear Algebra and its Applications

IEEE Trans. Au-tomatic Control

Systems and ControlLetters

Handbook of Markov Decision Processes ’ (E. A.Feinberg and A. Shwartz, eds.), Kluwer Academic Publishers, Boston,377-407.[26] Ja´skiewicz, A. (2007) “Average optimality for risk-sensitive control withgeneral state space”,

Ann. Appl. Prob.

Systems & Control Letters

Operations Research Letters

Uspekhi Mat. Nauk , 3(1:23), 3-95.[30] Lemmens, B., and Nussbaum, R. (2012)

Nonlinear Perron-FrobeniusTheory , Cambridge Tracts in Mathematics, Vol. 189, Cambridge Uni-versity Press, Cambridge, UK.[31] Lind, D., and Marcus, B. (1995)

An Introduction to Symbolic Dynamicsand Coding , Cambridge University Press, Cambridge, UK.[32] Mahadevan, R. (2007) “A note on a non-linear Krein-Rutman theorem”,

Nonlinear Analysis: Theory, Methods & Appl.

Matrix Analysis and Applied Linear Algebra , SIAM.[35] Nisio, M. (1978) “On stochastic optimal controls and envelope of Marko-vian semigroups”, in

Proc. Intl. Symp. on Stochastic Diﬀerential Equa-tions, RIMS, Kyoto Uni., Kyoto, 1976 (K. Ito, ed.), Wiley, New York,297-325. 3436] Nussbaum, R. D. (1980) “Eigenvalues of nonlinear positive operatorsand the linear Krein-Rutman theorem”, in

Lec. Notes in Math. , Vol.886, Springer, Berlin, 1981, 309-330.[37] Ogiwara, T. (1995) “Nonlinear Perron-Frobenius problem on an orderedBanach space”,

Japan J. Math. , 21, 43 -103.[38] Pollard, D. (2002)

A User’s Guide to Measure Theoretic Probability ,Cambridge Uni. Press, Cambridge, UK.[39] Rudin, W. (1973)

Functional Analysis , McGraw-Hill, New York, USA.[40] Todorov, E. (2007) “Linearly-solvable Markov decision problems”,

Proc. Advances in Neural Information Processing Systems, Vol. 19 (B.Sch¨olkopf, J. Platt and T. Hoﬀman, eds.), MIT Press, Cambridge, Mass.,2007, 1369-1376.[41] Van Erven, T., and Harrem¨oes, P. (2014) “R´enyi divergence and theKullback-Leibler divergence”,