Zero-Sum Stochastic Games with Partial Information and Average Payoff
aa r X i v : . [ m a t h . O C ] S e p ZERO-SUM STOCHASTIC GAMES WITH PARTIAL INFORMATIONAND AVERAGE PAYOFF
SUBHAMAY SAHA
Abstract.
We consider discrete time partially observable zero-sum stochastic game withaverage payoff criterion. We study the game using an equivalent completely observablegame. We show that the game has a value and also we come up with a pair of optimalstrategies for both the players. Introduction
Stochastic games were introduced by Shapley in [9]. Following this pioneering work therehas been a lot of work on stochastic games. For a survey on zero-sum games we refer to [10].Most of the available literature in this category concerns stochastic games with completeobservation, i.e., at each stage, the state of the game is completely known to the play-ers. Although there is considerable amount of literature (see [2], [3], [4] and the referencestherein) available on partially observable Markov decision processes (
POMDP ) of whichstochastic games are a generalisation, the corresponding literature in partially observablestochastic games is rather sparse. In [5] the authors study zero-sum games for partiallyobservable stochastic games under discounted payoff criterion. In this article we investigatethe same problem with the average payoff criterion. In [3] the authors study
POMDP under the average cost criteria using the approach based on Athreya-Ney-Nummelin con-struction of pseudo-atoms ([1], [8]) as described in [7]. In this article we extend those ideasto the zero-sum game case. Zero-sum stochastic games are generally studied by solving thecorresponding dynamic programming or Shapely equations [10]. This approach has alsobeen carried out for partially observable games in [5]. In this paper instead of solving theappropriate Shapely equations we solve two dynamic programming type inequalities, whichin turn lead to the existence of a value and saddle point strategies. Also our article extendsthe idea of using the pseudo-atom approach in solving MDP, to the stochastic game setup.Under certain Lyapunov assumption we use the pseudo-atom construction to carry out acoupling argument, which gives us appropriate bound on the relative α − discounted valuefunction. This bound then enables us to make appropriate limiting arguments. Mathematics Subject Classification.
Primary 91A15 ; Secondary 91A05, 91A25.
Key words and phrases.
Stochastic games, partial observation, average payoff, saddle point strategies.This work is supported in part by SPM fellowship of CSIR and in part by UGC Centre for AdvancedStudy.
The rest of the paper is organized as follows. In Section 2 we describe the model. InSection 3 we use the vanishing discount approach to prove the existence of a value and asaddle-point equilibrium for the
POSG . We conclude with a few remarks in Section 4.2.
Preliminaries and Model Description
Let
X, Y and
U, V be Polish spaces representing state, observation and action spaces forplayer 1 and player 2 respectively. We further assume that U and V are compact. For anyPolish space S , we denote by P ( S ) the Polish space of probability measures on S and by B ( S ) the Borel σ -field on S . Let { X n } be an X -valued partially observed controlled Markovchain with Y -valued observation process { Y n } . Let( x, u, v ) ∈ X × U × V → p ( dz, dy | x, u, v ) ∈ P ( X × Y )be a transition kernel which is assumed to be continuous in its arguments. Let λ denote aregular Borel radon measure on X . We assume the existence of a probability measure η on Y and a ϕ ∈ C b ( X × U × V × X × Y ), with ϕ ( · ) > p ( dz, dy | x, u, v ) = ϕ ( x, u, v, z, y ) λ ( dz ) η ( dy ) . The chain is controlled by two players. The first player chooses his actions from U andplayer 2 chooses his actions from V . Let { U n } be an U -valued control sequence of player 1and { V n } be a V -valued control sequence of player 2. The transition probability function ofthe controlled chain { X n } together with the observation chain { Y n } is given by P ( X n +1 ∈ A, Y n +1 ∈ B | X m , Y m , U m , V m , m ≤ n ) = Z A Z B ϕ ( X n , U n , V n , z, y ) λ ( dz ) , η ( dy )for A ∈ B ( X ) and B ∈ B ( Y ). The partially observed stochastic game ( POSG ) underergodic payoff criteria is the following:(i) The initial distribution of the (unobservable) state process is ψ which is known to boththe players; Y is deterministic, say Y = y ∗ for some fixed element y ∗ in Y .(ii) At the 0th epoch the players based on the knowledge that the initial distribution ofthe state process is ψ , independently choose actions u ∈ U and v ∈ V . Consequently,conditional on the event X = x player 1 gets an (unobservable) payoff c ( x , u , v ) fromplayer 2. Here c : X × U × V → R + is assumed to be a bounded continuous function. The next state and observation pair( X , Y ) is generated according to the stochastic kernel p ( dz, dy | x , u , v ).(iii) Now conditioned on the event Y = y the players again choose their actions and so on.This process is repeated over an infinite time horizon.(iv) Each player can recall at any time the observations and actions of the past. ERO-SUM GAMES 3
We now construct a probability space on which all the random variables are defined. thecanonical sample space is defined asΩ := ( X × Y × U × V ) ∞ . A generic element is of the form ω = ( x , y , u , v , x , · · · ) , x i ∈ X, y i ∈ Y, u i ∈ U, v i ∈ V .
The history spaces are defined as H = X × Y, H n +1 := H n × U × V × X × Y .
The state, observation, actions and history processes denoted by { X n } , { Y n } , { U n } , { V n } , { H n } respectively are defined by the projections X n ( ω ) = x n Y n ( ω ) = y n U n ( ω ) = u n V n ( ω ) = v n H n ( ω ) = ( x , y , u , v , · · · , u n − , v n − , x n , y n ) . The entire history up to time n is not available to the players for decision making at time n . The players have to make their decisions based on the observed history or informationvector i n := ( y , u , v , · · · , u n − , v n − , y n )and the initial distribution ψ . We define the information spaces as follows: I := Y, I n +1 := I n × U × V × Y .
The information process is defined by I n ( ω ) = ( y , u , v , · · · , u n − , v n − , y n ) . An admissible strategy for player 1 is a sequence π = { π n } of stochastic kernels on U given P ( X ) × I n . The set of admissible strategies for player 1 is denoted by Π . Similarlyan admissible strategy for player 2 is a sequence π = { π n } of stochastic kernels on V given P ( X ) × I n . The set of admissible strategies for player 2 is denoted by Π . With ψ in P ( X ) and a pair of admissible strategies ( π , π ) ∈ Π × Π specified, there exists a uniqueprobability measure P π ,π ψ on (Ω , B (Ω)) defined by P π ,π ψ ( dx , dy , du , dv , · · · , du n − , dv n − , dx n , dy n )= ψ ( dx ) δ y ∗ ( dy ) π ( du | ψ, y ) π ( dv | ψ, y ) p ( dx , dy | x , u , v ) · · · (2.1) π n − ( du n − | ψ, y , u , v , · · · , y n − ) π n − ( du n − | ψ, y , u , v , · · · , y n − ) p ( dx n , dy n | x n − , u n − , v n − ) . We now describe the payoff criterion. Given the initial distribution ψ and a pair of strategies( π , π ) ∈ Π × Π , the average payoff criterion is given by V π ,π ( ψ ) = lim inf n →∞ n E π ,π ψ n − X k =0 c ( X k , U k , V k ) (2.2) SUBHAMAY SAHA where E π ,π ψ is the expectation with respect to the probability measure P π ,π ψ . Player 1wishes to maximise V π ,π ( ψ ) over all his admissible strategies and player 2 wishes to min-imise the same over all his admissible strategies. A strategy π ∗ is said to be optimal forplayer 1 if V π ∗ ,π ( ψ ) ≥ inf Π sup Π V π ,π ( ψ )for any π ∈ Π . Similarly a strategy π ∗ is said to be optimal for player 2 if V π ,π ∗ ( ψ ) ≤ sup Π inf Π V π ,π ( ψ )for any π ∈ Π . The game is said to have a value ifinf Π sup Π V π ,π ( ψ ) = sup Π inf Π V π ,π ( ψ ) . If a pair of optimal strategies ( π ∗ , π ∗ ) exists for both the players then the pair ( π ∗ , π ∗ )is called a saddle point equilibrium. Now since the original state process is unobservable wedefine another state variable which is observable to the players. In order to achieve that, wehave by conditioning V π ,π ( ψ ) = lim inf n →∞ n n − X m =0 E π ,π ψ [˜ c (Ψ m , U m , V m )] , (2.3)where { Ψ n } is the regular conditional law of X n given I n , satisfying the recursionΨ n +1 ( dz ) = R X Ψ n ( dx ) ϕ ( x, U n , V n , z, Y n +1 ) λ ( dz ) R X R X Ψ n ( dx ) ϕ ( x, U n , V n , z, Y n +1 ) λ ( dz ) , n ≥ c ( ψ, u, v ) = Z X c ( x, u, v ) ψ ( dx ) . Equation (2.4) is known as the filtering equation. Note that since Y is deterministic, Ψ =the law of X . This allows us to consider an equivalent stochastic game with P ( X )-valuedstate process { Ψ n } with its evolution given by (2.4), under the same set of admissiblestrategies and with the payoff criterion given by (2.3). This is a completely observablestochastic game ( COSG ) because Ψ n is known to both the players via the informationupto time n . Thus we can solve the original POSG by solving this equivalent
COSG . Nowin order to show that the
POSG model under the average payoff criterion has a saddlepoint equilibrium and a value we impose the following Lyapunov type assumptions on ourmodel. (A1)
There exists inf-compact functions h and V ∈ C ( X ) satisfying h ≥
1, such that underany pair of admissible strategies and for any initial distribution E ( V ( X n +1 ) |F n ) − V ( X n ) ≤ − h ( X n ) + cI K ( X n ) (2.5)where K is some compact set with λ ( K ) > F n = σ ( X k , Y k , U k , V k , k ≤ n ). We havedropped the super- and subscripts on E for notational convenience. Let τ K = min { n ≥ X n ∈ K } . ERO-SUM GAMES 5
Then it is well known that ([7]) E [ τ K | X = x ] = O ( V ( x )) . Define P ( X ) = { µ ∈ P ( X ) : Z V dµ < ∞} . Now using (2.5) we obtain E [ V ( X n +1 )] = E [ Z X V ( x ) d Ψ n +1 ( dx )] ≤ E [ V ( X n )] + constant= E [ Z X V ( x ) d Ψ n ( dx )] + constant . Hence it follows that if Ψ ∈ P ( X ) then Ψ n ∈ P ( X ) , ∀ n ≥
1. We assume that Ψ ∈ P ( X )and hence { Ψ n } can be viewed as a P ( X )-valued process. We further assume that (A2) Under all admissible strategies and for any initial distributionlim n →∞ E [ V ( X n )] n = 0 . Saddle Point Strategies and Value
We follow the vanishing discount approach to solve the average cost problem. Let α ∈ (0 , POSG : V π ,π α ( ψ ) = E π ,π ψ [ ∞ X k =0 α k c ( X k , U k , V k )]Player 1 tries to maximise the above quantity over all his admissible strategies and player2 tries to minimise the same quantity over his admissible strategies. The definitions for thevalue of the game and for the optimal strategies can be given analogous to that of averagepayoff criterion. The following theorem can be proved using the equivalence with the COSG as discussed above and standard arguments as in [5]:
Theorem 3.1.
The discounted payoff
POSG has a value and the value function V α ( . ) isthe unique bounded solution of the following pair of Shapley equations: V α ( ψ ) = min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α Z P ( X ) V α ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) = max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α Z P ( X ) V α ( ψ ′ ) φ ( dψ ′ | π, µ, ν ) (cid:21) (3.1) where φ ( dψ ′ | ψ, µ, ν ) = Z U Z V ˜ φ ( dψ ′ | ψ, u, v ) µ ( du ) ν ( dv ) with ˜ φ ( dψ ′ | ψ, u, v ) being the controlled transition kernel of the Markov chain { Ψ n } , and ¯˜ c ( ψ, µ, ν ) = Z U Z V ˜ c ( ψ, u, v ) µ ( du ) ν ( dv ) . SUBHAMAY SAHA
Moreover let u ∗ : P ( X ) → P ( U ) be a measurable function such that u ∗ ( . ) is an outermaximiser of (3.1) then the strategy { π ∗ n } given by π ∗ n ( ·| i n ) = u ∗ ( ψ n )( · ) . is optimal for player 1. Further let v ∗ : P ( X ) → P ( V ) be a measurable function such that v ∗ ( . ) is an outer minimiser of (3.1) then { π ∗ n } given by π ∗ n ( ·| i n ) = v ∗ ( ψ n )( · ) . is an optimal strategy for player 2. Now for the vanishing discount approach we need to compare V α ( . ) for two differentvalues of its argument. For that we construct on a common probability space two X -valuedcontrolled Markov chains as above, controlled by the same pair of strategies but with dif-ferent initial distributions ˆ ψ and ˜ ψ . This is done by a modification of the construction inthe previous section. Let { π n } be an admissible strategy for player 1 and let { π n } be anadmissible strategy for player 2. Define¯Ω = ( X × X × Y × Y × U × V ) ∞ with ¯ F being the corresponding product Borel σ -algebra. Define ¯ P π ,π ˆ ψ, ˜ ψ , a probability mea-sure on ( ¯Ω , ¯ F ) by ¯ P π ,π ˆ ψ, ˜ ψ ( d ˆ x , d ˜ x , d ˆ y , d ˜ y , du , dv , d ˆ x , d ˜ x , d ˆ y , d ˜ y , du , dv , · · · , du n − , dv n − , d ˆ x n , d ˜ x n , d ˆ y n , d ˜ y n )= ˆ ψ ( d ˆ x ) ˜ ψ ( d ˜ x ) δ y ∗ ( d ˆ y ) δ y ∗ ( d ˜ y ) π ( du | ˆ ψ, ˆ y ) π ( dv | ˜ ψ, ˜ y ) p ( d ˆ x , d ˆ y | ˆ x , u , v ) p ( d ˜ x , d ˜ y | ˜ x , u , v ) π ( du | ˆ ψ, ˆ y , u , v , ˆ y ) π ( dv | ˜ ψ, ˜ y , u , v , ˜ y ) · · · π n − ( du n − | ˆ ψ, ˆ y , u , v , ˆ y , · · · , u n − , v n − , ˆ y n − ) π n − ( dv n − | ˜ ψ, ˜ y , u , v , ˜ y , · · · , u n − , v n − , ˜ y n − ) p ( d ˆ x n , d ˆ y n | ˆ x n − , u n − , v n − ) p ( d ˜ x n , d ˜ y n | ˜ x n − , u n − , v n − ) . On ( ¯Ω , ¯ F , ¯ P ), define the processes { ˆ X n } , { ˜ X n } , { ˆ Y n } , { ˜ Y n } , { U n } , { V n } canonically. Thenthe Markov chains { ˆ X n } , { ˜ X n } on ( ¯Ω , ¯ F , ¯ P ) form the desired pair. For notational simplicitywe omit the superscripts and subscripts on ¯ P . We denote by ¯ X n = ( ˆ X n , ˜ X n ) and theassociated observation pair by ¯ Y n = ( ˆ Y n , ˜ Y n ). Then { ¯ X n } is an X valued Markov chain.Let the controlled transition kernel be denoted by¯ p ( d ¯ z, d ¯ y | ¯ x, u, v ) ∈ P ( X × Y )for ¯ x = ( x , x ) ∈ X . Define G = K and define Θ ∈ P ( X ) byΘ( A ) = λ × λ ( A ∩ G ) λ ( K ) for any Borel set A of X . Then if follows from our assumptions that¯ p ( A × Y | ¯ x, u, v ) ≥ δI G (¯ x )Θ( A )where δ = (inf x ∈ K,u ∈ U,v ∈ V,z ∈ K R Y ϕ ( x, u, v, z, y ) η ( dy ) λ ( K )) . This is the minorization con-dition of [7] in the present context which enables us to carry out the Athreya-Ney-Nummelinconstruction of pseudo-atom [7]. ERO-SUM GAMES 7
Let H = X and H ∗ = X × { , } . Endow H ∗ with its Borel σ − field. For any measure µ on H , define a measure µ ∗ on H ∗ as follows: For Borel A ⊂ H , let A = A × { } and A = A × { } . Then µ ∗ ( A ) = (1 − δ ) µ ( A ∩ K ) + µ ( A ∩ ( K ) c ) µ ∗ ( A ) = δµ ( A ∩ K ) . For a measure µ on H × Y , we define the measure µ ∗ on H ∗ × Y by µ ∗ ( A × D ) = (1 − δ ) µ (( A ∩ K ) × D ) + µ (( A ∩ ( K ) c ) × D ) µ ∗ ( A × D ) = δµ (( A ∩ K ) × D ) , for D ⊂ Y Borel. On a suitable probability space (Ω ∗ , F ∗ , P ∗ ), define an H ∗ -valued con-trolled Markov chain { X ∗ n , i ∗ n } (where X ∗ n = ( ˆ X ∗ n , ˜ X ∗ n )) with U - valued control process { U ∗ n } and V - valued control process { V ∗ n } and Y -valued observation process { Y ∗ n } , such that:(i) The controlled transition kernel of { X ∗ n , i ∗ n , Y ∗ n } is given by: for x = ( x , i ) ∈ H ∗ , q ( d ¯ x, d ¯ y | x, u, v ) = ¯ p ∗ ( d ¯ x, d ¯ y | x , u, v ) , x ∈ H − K × { } = 11 − δ (¯ p ∗ ( d ¯ x, d ¯ y | x , u, v ) − δ Θ ∗ ( d ¯ x ) η ( d ¯ y )) , x ∈ K × { } = Θ ∗ ( d ¯ x ) η ( d ¯ y ) , x ∈ H , (ii) P ∗ (( X ∗ , i ∗ ) ∈ A , Y ∗ ∈ A ′ , U ∗ ∈ ∆ , V ∗ ∈ Γ) =(1 − δ )¯ P ( ¯ X ∈ A ∩ K , ¯ Y ∈ A ′ , U ∈ ∆ , V ∈ Γ)+ ¯ P ( ¯ X ∈ A ∩ ( K ) c , ¯ Y ∈ A ′ , U ∈ ∆ , V ∈ Γ) P ∗ (( X ∗ , i ∗ ) ∈ A , Y ∗ ∈ A ′ , U ∗ ∈ ∆ , V ∗ ∈ Γ) = δ ¯ P ( ¯ X ∈ A ∩ K , ¯ Y ∈ A ′ , U ∈ ∆ , V ∈ Γ)for A ⊂ H, A ′ ⊂ Y , ∆ ⊂ U, Γ ⊂ V Borel,(iii) and P ∗ ( U ∗ n ∈ ∆ , V ∗ n ∈ Γ | ( X ∗ m , i ∗ m , Y ∗ m ) = ( x m , i m , y m ) , m ≤ n, U ∗ k = u k , V ∗ k = v k , k < n )= ¯ P ( U n ∈ ∆ , V n ∈ Γ | ( ¯ X m , ¯ Y m ) = ( x m , y m ) , m ≤ n, U k = u k , V k = v k , k < n ) for n ≥ . From the above construction the following lemmas can be proved.
Lemma 3.2.
The set K × { } is an accessible atom of { ( X ∗ n , i ∗ n ) } in the sense of Meynand Tweedie ( [7] ). Lemma 3.3.
For any Borel A i ⊂ H, B i ⊂ Y , ∆ i ⊂ U, Γ i ⊂ V, ≤ i ≤ n, n ≥ P ∗ (cid:18) (( X ∗ , i ∗ , Y ∗ , U ∗ , V ∗ ) , · · · , ( X ∗ n , i ∗ n , Y ∗ n , U ∗ n , V ∗ n )) ∈ n Y i =0 ( A i ∪ A i ) × B i × ∆ i × Γ i (cid:19) = ¯ P (cid:18) (( ¯ X , ¯ Y , U , V ) , · · · , ( ¯ X n , ¯ Y n , U n , V n )) ∈ n Y i =0 A i × B i × ∆ i × Γ i (cid:19) SUBHAMAY SAHA
Let τ = min { n ≥ X ∗ n , i ∗ n ) ∈ K × { }} . (3.2)Then the following lemma can be proved using ( A
1) and standard arguments as in [7]
Lemma 3.4.
Under ( A we have, E ∗ [ τ | ( X ∗ , i ∗ ) = ( x, i )] = O ( V ( x ) + V ( x )) (3.3) for any ( x, i ) = (( x , x ) , i ) ∈ X × { , } , where τ is as in (3.2) . The following lemma gives a bound on the difference of V α ( . ) for two different values ofits argument. Lemma 3.5.
For ˆ ψ, ˜ ψ ∈ P ( X ) , there exists a suitable constant ¯ K such that | V α ( ˆ ψ ) − V α ( ˜ ψ ) | ≤ ¯ K [ Z V d ˆ ψ + Z V d ˜ ψ ] . Proof.
Let V α ( ˆ ψ ) ≥ V α ( ˜ ψ ). The other case can be handled with a symmetric argument. Let π = { π n } be an optimal policy for player 1 for the discounted payoff POSG with initialdistribution ˆ ψ and let π = { π n } be an optimal policy for player 2 for the discounted payoff POSG with initial distribution ˜ ψ . Then we have | V α ( ˆ ψ ) − V α ( ˜ ψ ) | ≤ | ∞ X m =0 α m ¯ E π ,π ˆ ψ, ˜ ψ [ c ( ˆ X n , U n , V n )] − ∞ X m =0 α m ¯ E π ,π ˆ ψ, ˜ ψ [ c ( ˜ X n , U n , V n )] | = | ∞ X m =0 α m ¯ E [ c ( ˆ X n , U n , V n ) − c ( ˜ X n , U n , V n )] |≤ | E ∗ τ X m =0 α m [ c ( ˆ X ∗ n , U ∗ n , V ∗ n ) − c ( ˜ X ∗ n , U ∗ n , V ∗ n )] |≤ || c || ∞ E ∗ ( τ )where the third step follows from the fact that ˆ X ∗ τ + m , ˜ X ∗ τ + m for m ≥ τ . Thus from (3.3) we have | V α ( ψ ) − V α ( ψ ) | ≤ ¯ K ¯ E [ V ( ˆ X ) + V ( ˜ X )] . Hence the lemma follows. (cid:3)
Now fix ψ ∗ ∈ P ( X ). Define ¯ V α ( ψ ) = V α ( ψ ) − V α ( ψ ∗ ). Thus substituting in (3.1) we get¯ V α ( ψ ) + (1 − α ) V α ( ψ ∗ ) = min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α Z P ( X ) ¯ V α ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) = max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ˜¯ c ( ψ, µ, ν ) + α Z P ( X ) ¯ V α ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . (3.4)Now (1 − α ) V α ( ψ ∗ ) is bounded. Thus we can find an α ( n ) → − α ( n )) V α ( n ) ( ψ ∗ ) → γ (3.5)for some γ ∈ R . Let ˆ V ( ψ ) = lim sup n →∞ ¯ V α ( n ) ( ψ ) and V ( ψ ) = lim inf n →∞ ¯ V α ( n ) ( ψ ). ERO-SUM GAMES 9
Lemma 3.6.
The function ˆ V satisfies ˆ V ( ψ ) + γ ≤ max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ, ν ) + Z P ( X ) ˆ V ( ψ ′ ) φ ( dψ ′ | π, µ, ν ) (cid:21) , (3.6) where γ is as in (3.5) .Proof. We have¯ V α ( n ) ( ψ ) + (1 − α ( n )) V α ( n ) ( ψ ∗ ) = max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . Now taking limit n → ∞ in the above we getˆ V ( ψ ) + γ = lim sup n →∞ max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) = lim sup n →∞ min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ ∗ n , ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ ∗ n , ν ) (cid:21) ≤ min ν ∈P ( V ) lim sup n →∞ (cid:20) ¯˜ c ( ψ, µ ∗ n , ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ ∗ n , ν ) (cid:21) . In the second step µ ∗ n is the outer maximiser. Now fix π . By dropping to a subsequence ifnecessary, we may suppose that ¯ V α ( n ) ( ψ ) → ˆ V ( ψ ) and µ ∗ n → µ ∗ in P ( U ). Now by previouslemma | ¯ V α ( ψ ) | ≤ K (1+ R V dψ ). Thus by Lemma 8.3.7 in [6], the last expression is boundedabove by min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ ∗ , ν ) + Z P ( X ) ˆ V ( ψ ′ ) φ ( dψ ′ | ψ, µ ∗ n , ν ) (cid:21) ≤ max µ ∈P ( U ) min ν ∈P ( V ) (cid:20) ¯˜ c ( ψ, µ, ν ) + Z P ( X ) ˆ V ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . The claim follows. (cid:3)
Similarly we have the following result.
Lemma 3.7.
The function V satisfies V ( ψ ) + γ ≥ min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + Z P ( X ) V ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . (3.7) Proof.
We have¯ V α ( n ) ( ψ ) + (1 − α ( n )) V α ( n ) ( ψ ∗ ) = min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . Now taking limit n → ∞ in the above we get V ( ψ ) + γ = lim inf n →∞ min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) = lim inf n →∞ max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ∗ n ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ∗ n ) (cid:21) ≥ max µ ∈P ( U ) lim inf n →∞ (cid:20) ¯˜ c ( ψ, µ, ν ∗ n ) + α ( n ) Z P ( X ) ¯ V α ( n ) ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ∗ n ) (cid:21) . In the second step ν ∗ n is the outer minimiser. Now by arguments analogous to the proof ofthe above lemma we have that there exists a ν ∗ ∈ P ( V ) such that the last expression isbounded below bymax µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ∗ ) + Z P ( X ) V ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ∗ ) (cid:21) ≥ min ν ∈P ( V ) max µ ∈P ( U ) (cid:20) ¯˜ c ( ψ, µ, ν ) + Z P ( X ) V ( ψ ′ ) φ ( dψ ′ | ψ, µ, ν ) (cid:21) . The claim follows. (cid:3)
Finally we get the following theorem:
Theorem 3.8.
Assume (A1-A2) . Then γ (as in (3.5) ) is the value of the COSG . More-over let u ∗ : P ( X ) → P ( U ) be a measurable function such that u ∗ ( . ) is the outer maximiserof the righthand side of (3.6) (exists by our assumptions and a standard measurable selectiontheorem). Then the strategy { π ∗ n } given by π ∗ n ( ·| i n ) = u ∗ ( ψ n )( · ) is an optimal strategy for player 1. Similarly let v ∗ : P ( X ) → P ( V ) be a measurablefunction such that v ∗ ( . ) is the outer minimiser of the righthand side of (3.7) . Then { π ∗ n } given by π ∗ n ( ·| i n ) = v ∗ ( ψ n )( · ) is an optimal strategy for player 2.Proof. Let { π n } be an arbitrary admissible strategy of player 2. Then we have from (3.6) E π ∗ ,π ψ [ ˆ V (Ψ n )] + γ ≤ E π ∗ ,π ψ [˜ c (Ψ n , U n , V n )] + E π ∗ ,π ψ [ ˆ V (Ψ n +1 )] , n ≥ γ ≤ n n − X m =0 E π ∗ ,π ψ [˜ c (Ψ m , U m , V m )] + E π ∗ ,π ψ [ ˆ V (Ψ n )] − ˆ V ( ψ ) n . Then by taking limit n → ∞ we have using assumption (A2) γ ≤ lim inf n →∞ n n − X m =0 E π ∗ ,π ψ [˜ c (Ψ m , U m , V m )] . Similarly, if { π n } is an arbitrary admissible strategy for player 1, then we have by (3.7) E π ,π ∗ ψ [ V (Ψ n )] + γ ≥ E π ,π ∗ ψ [˜ c (Ψ n , U n , V n )] + E π ,π ∗ ψ [ V (Ψ n +1 )] , n ≥ . Therefore we have γ ≥ n n − X m =0 E π ,π ∗ ψ [˜ c (Ψ m , U m , V m )] + E π ,π ∗ ψ [ V (Ψ n )] − V ( ψ ) n . ERO-SUM GAMES 11
Then by taking limit n → ∞ we have using assumption (A2) γ ≥ lim inf n →∞ n n − X m =0 E π ,π ∗ ψ [˜ c (Ψ m , U m , V m )] . Now the conclusions follow. (cid:3)
Now the following theorem follows from Theorem 3.8 and the equivalence of
COSG and
POSG . Theorem 3.9.
The
POSG with average cost criterion has a value and is equal to γ (asin (3.5) ) for any initial distribution. Moreover, ( { π ∗ n } , { π ∗ n } ) given by Theorem . is asaddle point equilibrium. Conclusion
In this article we study a partially observed stochastic game under average payoff cri-terion. We estimate the unobservable state variable and use the state estimate as our newobservable state variable . We then use the vanishing discount approach to solve the aver-age cost problem. Our analysis involves a coupling argument which uses the machinery ofpseudo-atom construction. We show that the game has a value and also prove the existenceof a saddle point equilibrium for our partially observable model.
Acknowledgement.
The author wish to thank V. S. Borkar and M. K. Ghosh for manyhelpful discussions and comments.
References [1] Athreya, K. B., and Ney, P.,
A new approach to the limit theory of recurrent Markov chains , Transactionsof American Mathematical Society, Vol. 245, pp. 493-501, 1978.[2] Bertsekas, D. P., and Shreve, S. E.,
Stochastic Optimal Control , Academic Press, New York, NY, 1978.[3] Borkar, V. S.,
Dynamic programming for ergodic cintrol with partial observations , Stochastic Processesand their Applications, Vol. 103, pp. 293-310, 2003.[4] Dynkin, E. B., and Yushkevich, A.,
Controlled Markov Processes , Springer Verlag, Berlin, Germany,1979.[5] Ghosh, M. K., McDonald, D., and Sinha, S.,
Zero-sum stochastic games with partial information , Journalof Optimization Theory and Applications, Vol. 121, pp. 99-118, 2004.[6] Hern´andez-Lerma, O., and Lasserre, J. B.,
Further Topics on Discrete-Time Markov control Processes ,Springer, New York, NY, 1999.[7] Meyn, S. P., and Tweedie, R. L.,
Markov Chains and Stochastic Stability , Springer, London, 1993.[8] Nummelin, E.,
A splitting technique for Harris recurrent chains , Z. Wahrscheinlichkeitstheorie Verw.Geb., Vol. 43, pp. 309-318, 1978.[9] Shapley, L.,
Stochastic games , Proceeding of National Academy of Sciences, Vol. 39, pp. 1095-1100,1953.[10] Vrieze, K.,
Zero-sum stochastic games: a survey , CWI Quarterly, Vol. 2, pp. 147-170, 1989.
Department of Mathematics, Indian Institute of Science, Bangalore 560 012, India.
E-mail address ::