Indexability of Restless Bandit Problems and Optimality of Whittle's Index for Dynamic Multichannel Access
aa r X i v : . [ c s . I T ] N ov Indexability of Restless Bandit Problems and Optimalityof Whittle’s Index for Dynamic Multichannel Access
Keqin Liu, Qing ZhaoUniversity of California, Davis, CA [email protected], [email protected]
Abstract
We consider a class of restless multi-armed bandit problems (RMBP) that arises in dynamicmultichannel access, user/server scheduling, and optimal activation in multi-agent systems. For this classof RMBP, we establish the indexability and obtain Whittle’s index in closed-form for both discountedand average reward criteria. These results lead to a direct implementation of Whittle’s index policywith remarkably low complexity. When these Markov chains are stochastically identical, we show thatWhittle’s index policy is optimal under certain conditions. Furthermore, it has a semi-universal structurethat obviates the need to know the Markov transition probabilities. The optimality and the semi-universalstructure result from the equivalency between Whittle’s index policy and the myopic policy establishedin this work. For non-identical channels, we develop efficient algorithms for computing a performanceupper bound given by Lagrangian relaxation. The tightness of the upper bound and the near-optimalperformance of Whittle’s index policy are illustrated with simulation examples.
Index Terms
Opportunistic access, dynamic channel selection, restless multi-armed bandit, Whittle’s index, in-dexability, myopic policy. This work was supported by the Army Research Laboratory CTA on Communication and Networks under Grant DAAD19-01-2-0011 and by the National Science Foundation under Grants ECS-0622200 and CCF-0830685.Part of this work was presented at the 5th IEEE Conference on Sensor, Mesh and Ad Hoc Communications and Networks(SECON) Workshops (June, 2008) and the IEEE Asilomar Conference on Signals, Systems, and Computers (October, 2008).
I. I
NTRODUCTION
A. Restless Multi-armed Bandit Problem
Restless Multi-armed Bandit Process (RMBP) is a generalization of the classical Multi-armedBandit Processes (MBP), which has been studied since 1930’s [1]. In an MBP, a player, withfull knowledge of the current state of each arm, chooses one out of N arms to activate at eachtime and receives a reward determined by the state of the activated arm. Only the activated armchanges its state according to a Markovian rule while the states of passive arms are frozen. Theobjective is to maximize the long-run reward over the infinite horizon by choosing which armto activate at each time.The structure of the optimal policy for the classical MBP was established by Gittins in 1979 [2],who proved that an index policy is optimal. The significance of Gittins’ result is that it reducesthe complexity of finding the optimal policy for an MBP from exponential with N to linear with N . Specifically, an index policy assigns an index to each state of each arm and activates the armwhose current state has the largest index. Arms are decoupled when computing the index, thusreducing an N − dimensional problem to N independent − dimensional problems.Whittle generalized MBP to RMBP by allowing multiple ( K ≥ arms to be activated simul-taneously and allowing passive arms to also change states [3]. Either of these two generalizationswould render Gittins’ index policy suboptimal in general, and finding the optimal solution toa general RMBP has been shown to be PSPACE-hard by Papadimitriou and Tsitsiklis [4]. Infact, merely allowing multiple plays ( K ≥ would have fundamentally changed the problemas shown in the classic work by Anantharam et al. [5] and by Pandelis and Teneketzis [6].By considering the Lagrangian relaxation of the problem, Whittle proposed a heuristic indexpolicy for RMBP [3]. Whittle’s index policy is the optimal solution to RMBP under a relaxedconstraint: the number of activated arms can vary over time provided that its average overthe infinite horizon equals to K . This average constraint leads to decoupling among arms,subsequently, the optimality of an index policy. Under the strict constraint that exactly K armsare to be activated at each time, Whittle’s index policy has been shown to be asymptoticallyoptimal under certain conditions ( N → ∞ stochastically identical arms) [7]. In the finite regime,extensive empirical studies have demonstrated its near-optimal performance, see, for example,[8], [9]. The difficulty of Whittle’s index policy lies in the complexity of establishing its existenceand computing the index, especially for RMBP with uncountable state space as in our case.Not every RMBP has a well-defined Whittle’s index; those that admit Whittle’s index policy arecalled indexable [3]. The indexability of an RMBP is often difficult to establish, and computingWhittle’s index can be complex, often relying on numerical approximations.In this paper, we show that for a significant class of RMBP most relevant to multichanneldynamic access applications, the indexability can be established and Whittle’s index can beobtained in closed form. For stochastically identical arms, we establish the equivalency betweenWhittle’s index policy and the myopic policy. This result, coupled with recent findings in [10],[11] on the myopic policy for this class of RMBP, shows that Whittle’s index policy achieves theoptimal performance under certain conditions and has a semi-universal structure that is robustagainst model mismatch and variations. This class of RMBP is described next.
B. Dynamic Multichannel Access
Consider the problem of probing N independent Markov chains. Each chain has two states—“good” and “bad”— with different transition probabilities across chains (see Fig. 1). At eachtime, a player can choose K (1 ≤ K < N ) chains to probe and receives reward determined bythe states of the probed chains. The objective is to design an optimal policy that governs theselection of K chains at each time to maximize the long-run reward.PSfrag replacements (bad) (good) p ( i )01 p ( i )11 p ( i )00 p ( i )10 Fig. 1. The Gilber-Elliot channel model.
The above general problem arises in a wide range of communication systems, including cog-nitive radio networks, downlink scheduling in cellular systems, opportunistic transmission overfading channels, and resource-constrained jamming and anti-jamming. In the communicationscontext, the N independent Markov chains corresponds to N communication channels under the Gilbert-Elliot channel model [12], which has been commonly used to abstract physical channelswith memory (see, for example, [13], [14]). The state of a channel models the communicationquality of this channel and determines the resultant reward of accessing this channel. For example,in cognitive radio networks where secondary users search in the spectrum for idle channelstemporarily unused by primary users [15], the state of a channel models the occupancy of thechannel. For downlink scheduling in cellular systems, the user is a base station, and each channelis associated with a downlink mobile receiver. Downlink receiver scheduling is thus equivalentto channel selection.The application of this problem also goes beyond communication systems. For example, ithas applications in target tracking as considered in [16], where K unmanned aerial vehicles aretracking the states of N ( N > K ) targets in each slot. C. Main Results
Fundamental questions concerning Whittle’s index policy since the day of its invention havebeen its existence, its performance, and the complexity in computing the index. What are thenecessary and/or sufficient conditions on the state transition and the reward structure that makean RMBP indexable? When can Whittle’s index be obtained in closed-form? For which specialclasses of RMBP is Whittle’s index policy optimal? When numerical evaluation has to be resortedto in studying its performance, are there easily computable performance benchmarks?In this paper, we attempt to address these questions for the class of RMBP described above.As will be shown, this class of RMBP has an uncountable state space, making the problemhighly nontrivial. The underlying two-state Markov chain that governs the state transition ofeach arm, however, brings rich structures into the problem, leading to positive and surprisinganswers to the above questions. The wide range of applications of this class of RMBP makesthe results obtained in this paper generally applicable.Under both discounted and average reward criteria, we establish the indexability of this classof RMBP. The basic technique of our proof is to bound the total amount of time that an arm ismade passive under the optimal policy. The general approach of using the total passive time inproving indexability was considered by Whittle in [3] when showing that a classic MBP is alwaysindexable. Applying this approach to a nontrivial RMBP is, however, much more involved, andour proof appears to be the first that extends this approach to RMBP. We hope that this work contributes to the set of possible techniques for establishing indexability of RMBP.Based on the indexability, we show that Whittle’s index can be obtained in closed-form forboth discounted and average reward criteria. This result reduces the complexity of implementingWhittle’s index policy to simple evaluations of these closed-form expressions. This result isparticularly significant considering the uncountable state space which would render numericalapproaches impractical. The monotonically increasing and piecewise concave (for arms with p ≥ p ) or piecewise convex (for arms with p < p ) properties of Whittle’s index are alsoestablished. The monotonicity of Whittle’s index leads to an interesting equivalency with themyopic policy — the simplest nontrivial index policy — when arms are stochastically identical.This equivalency allows us to work on the myopic index, which has a much simpler form, whenestablishing the structure and optimality of Whittle’s index policy for stochastically identicalarms.As to the performance of Whittle’s index policy for this class of RMBP, we show thatunder certain conditions, Whittle’s index policy is optimal for stochastically identical arms.This result provides examples for the optimality of Whittle’s index policy in the finite regime.The approximation factor of Whittle’s index policy (the ratio of the performance of Whittle’sindex policy to that of the optimal policy) is analyzed when the optimality conditions do nothold. Specifically, we show that when arms are stochastically identical, the approximation factorof Whittle’s index policy is at least KN when p ≥ p and at least max { , KN } when p < p .When arms are non-identical, we develop an efficient algorithm to compute a performanceupper bound based on Lagrangian relaxation. We show that this algorithm runs in at most O ( N (log N ) ) time to compute the performance upper bound within ǫ -accuracy for any ǫ > .Furthermore, when every channel satisfies p < p , we can compute the upper bound withouterror with complexity O ( N log N ) .Another interesting finding is that when arms are stochastically identical, Whittle’s index policyhas a semi-universal structure that obviates the need to know the Markov transition probabilities.The only required knowledge about the Markovian model is the order of p and p . This semi-universal structure reveals the robustness of Whittle’s index policy against model mismatch andvariations. D. Related Work
Multichannel opportunistic access in the context of cognitive radio systems has been studiedin [17], [18] where the problem is formulated as a Partially Observable Markov Decision Process(POMDP) to take into account potential correlations among channels. For stochastically identicaland independent channels and under the assumption of single-channel sensing ( K = 1 ), thestructure, optimality, and performance of the myopic policy have been investigated in [10],where the semi-universal structure of the myopic policy was established for all N and theoptimality of the myopic policy proved for N = 2 . In a recent work [11], the optimality ofthe myopic policy was extended to N > under the condition of p ≥ p . These resultshave also been extended to cases with probing errors in [19]. In this paper, we establish theequivalence relationship between the myopic policy and Whittle’s index policy when channelsare stochastically identical. This equivalency shows that the results obtained in [10], [11] for themyopic policy are directly applicable to Whittle’s index policy. Furthermore, we extend theseresults to multichannel sensing ( K > ).Other examples of applying the general RMBP framework to communication systems includethe work by Lott and Teneketzis [20] and the work by Raghunathan et al. [21]. In [20], theproblem of multichannel allocation for single-hop mobile networks with multiple service classeswas formulated as an RMBP, and sufficient conditions for the optimality of a myopic-type indexpolicy were established. In [21], multicast scheduling in wireless broadcast systems with strictdeadlines was formulated as an RMBP with a finite state space. The indexability was establishedand Whittle’s index was obtained in closed-form. Recent work by Kleinberg gives interestingapplications of bandit processes to Internet search and web advertisement placement [22].In the general context of RMBP, there is a rich literature on indexability. See [23] for thelinear programming representation of conditions for indexability and [9] for examples of specificindexable restless bandit processes. Constant-factor approximation algorithms for RMBP havealso been explored in the literature. For the same class of RMBP as considered in this paper,Guha and Munagala [24] have developed a constant-factor (1/68) approximation via LP relaxationunder the condition of p > > p for each channel. In [25], Guha et al. have developed afactor-2 approximation policy via LP relaxation for the so-called monotone bandit processes.In [16], Le Ny et al. have considered the same class of RMBP motivated by the applications of target tracking. They have independently established the indexability and obtained the closed-form expressions for Whittle’s index under the discounted reward criterion . Our approach toestablishing indexability and obtaining Whittle’s index is, however, different from that usedin [16], and the two approaches complement each other. Indeed, the fact that two completelydifferent applications lead to the same class of RMBP lends support for a detailed investigation ofthis particular type of RMBP. We also include several results that were not considered in [16].In particular, we consider both discounted and average reward criterion, develop algorithmsfor and analyze the complexity of computing the optimal performance under the Lagrangianrelaxation, and establish the semi-universal structure and the optimality of Whittle’s index policyfor stochastically identical arms. E. Organization
The rest of the paper is organized as follows. In Sec. II, the RMBP formulation is presented.In Sec. III, we introduce the basic concepts of indexability and Whittle’s index. In Sec. IV,we address the total discounted reward criterion, where we establish the indexability, obtainWhittle’s index in closed-form, and develop efficient algorithms for computing an upper boundon the performance of the optimal policy. Simulation examples are provided to illustrate thetightness of the upper bound and the near-optimal performance of Whittle’s index policy. InSec. V, we consider the average reward criterion and obtain results parallel to those obtainedunder the discounted reward criterion. In Sec. VI, we consider the special case when channels arestochastically identical. We show that Whittle’s index policy is optimal under certain conditionsand has a simple and robust structure. The approximation factor of Whittle’s index policy is alsoanalyzed. Sec. VII concludes this paper.II. P
ROBLEM S TATEMENT AND R ESTLESS B ANDIT F ORMULATION
A. Multi-channel Opportunistic Access
Consider N independent Gilbert-Elliot channels, each with transmission rate B i ( i = 1 , · · · , N ) .Without loss of generality, we normalize the maximum data rate: max i ∈{ , , ··· ,N } { B i } = 1 . The A conference version of our result was published in June, 2008, the same time as [16]. state of channel i —“good”( ) or “bad”( )— evolves from slot to slot as a Markov chain withtransition matrix P i = { p ( i ) j,k } j,k ∈{ , } as shown in Fig. 1.At the beginning of slot t , the user selects K out of N channels to sense. If the state S i ( t ) of the sensed channel i is , the user transmits and collects B i units of reward in this channel.Otherwise, the user collects no reward in this channel. Let U ( t ) denote the set of k channelschosen in slot t . The reward obtained in slot t is thus given by R U ( t ) ( t ) = Σ i ∈ U ( t ) S i ( t ) B i . Our objective is to maximize the expected long-run reward by designing a sensing policy thatsequentially selects K channels to sense in each slot. B. Restless Multi-armed Bandit Formulation
The channel states [ S ( t ) , ..., S N ( t )] ∈ { , } N are not directly observable before the sensingaction is made. The user can, however, infer the channel states from its decision and observationhistory. It has been shown that a sufficient statistic for optimal decision making is given by theconditional probability that each channel is in state given all past decisions and observations[26]. Referred to as the belief vector or information state, this sufficient statistic is denoted by Ω( t ) ∆ = [ ω ( t ) , · · · , ω N ( t )] , where ω i ( t ) is the conditional probability that S i ( t ) = 1 . Given thesensing action U ( t ) and the observation in slot t , the belief state in slot t + 1 can be obtainedrecursively as follows: ω i ( t + 1) = p ( i )11 , i ∈ U ( t ) , S i ( t ) = 1 p ( i )01 , i ∈ U ( t ) , S i ( t ) = 0 T ( ω i ( t )) , i / ∈ U ( t ) , (1)where T ( ω i ( t )) , ω i ( t ) p ( i )11 + (1 − ω i ( t )) p ( i )01 denotes the operator for the one-step belief update for unobserved channels.If no information on the initial system state is available, the i -th entry of the initial beliefvector Ω(1) can be set to the stationary distribution ω ( i ) o of the underlying Markov chain: ω ( i ) o = p ( i )01 p ( i )01 + p ( i )10 . (2) It is now easy to see that we have an RMBP, where each channel is considered as an arm andthe state of arm i in slot t is the belief state ω i ( t ) . The user chooses an action U ( t ) consistingof K arms to activate (sense) in each slot, while other arms are made passive (unobserved). Thestates of both active and passive arms change as given in (1).A policy π : Ω( t ) → U ( t ) is a function that maps from the belief vector Ω( t ) to the action U ( t ) in slot t . Our objective is to design the optimal policy π ∗ to maximize the expected long-termreward.There are two commonly used performance measures. One is the expected total discounted reward over the infinite horizon: E π [Σ ∞ t =1 β t − R π (Ω( t )) ( t ) | Ω(1)] , (3)where ≤ β < is the discount factor and R π (Ω( t )) ( t ) is the reward obtained in slot t underaction U ( t ) = π (Ω( t )) determined by the policy π . This performance measure applies whenrewards in the future are less valuable, for example, in delay sensitive communication systems.It also applies when the horizon length is a geometrically distributed random variable withparameter β . For example, a communication session may end at a random time, and the useraims to maximize the number of packets delivered before the session ends.The other performance measure is the expected average reward over the infinite horizon [27]: E π [ lim T →∞ T Σ Tt =1 R π (Ω( t )) ( t ) | Ω(1)] . (4)This is the common measure of throughput in the context of communications.For notation convenience, let (Ω(1) , { P i } Ni =1 , { B i } Ni =1 , β ) denote the RMBP with the dis-counted reward criterion, and (Ω(1) , { P i } Ni =1 , { B i } Ni =1 , the RMBP with the average rewardcriterion. III. I NDEXABILITY AND I NDEX P OLICIES
In this section, we introduce the basic concepts of indexability and Whittle’s index policy.
A. Index Policy An index policy assigns an index for each state of each arm to measure how rewarding it isto activate an arm at a particular state. In each slot, the policy activates those K arms whosecurrent states have the largest indices. For a strongly decomposable index policy, the index of an arm only depends on the character-istics (transition probabilities, reward structure, etc. ) of this arm. Arms are thus decoupled whencomputing the index, reducing an N − dimensional problem to N independent − dimensionalproblems.A myopic policy is a simple example of strongly decomposable index policies. This policyignores the impact of the current action on the future reward, focusing solely on maximizing theexpected immediate reward. The index is thus the expected immediate reward of activating anarm at a particular state. For the problem at hand, the myopic index of each state ω i ( t ) of arm i is simply ω i ( t ) B i . The myopic action ˆ U ( t ) under the belief state Ω( t ) = [ ω ( t ) , · · · , ω N ( t )] isgiven by ˆ U ( t ) = arg max U ( t ) Σ i ∈ U ( t ) ω i ( t ) B i . (5) B. Indexability and Whittle’s Index Policy
To introduce indexability and Whittle’s index, it suffices to consider a single arm due to thestrong decomposability of Whittle’s index. Consider a single-armed bandit process (a singlechannel) with transition probabilities { p j,k } j,k ∈ , and bandwidth B (here we drop the channelindex for notation simplicity). In each slot, the user chooses one of two possible actions— u ∈ { passive ) , active ) } —to make the arm passive or active. An expected reward of ωB is obtained when the arm is activated at belief state ω , and the belief state transits accordingto (1). The objective is to decide whether to active the arm in each slot to maximize the totaldiscounted or average reward. The optimal policy is essentially given by an optimal partitionof the state space [0 , into a passive set { ω : u ∗ ( ω ) = 0 } and an active set { ω : u ∗ ( ω ) = 1 } ,where u ∗ ( ω ) denotes the optimal action under belief state ω .Whittle’s index measures how attractive it is to activate an arm based on the concept of subsidy for passivity . Specifically, we construct a single-armed bandit process that is identicalto the above specified bandit process except that a constant subsidy m is obtained wheneverthe arm is made passive. Obviously, this subsidy m will change the optimal partition of thepassive and active sets, and states that remain in the active set under a larger subsidy m aremore attractive to the user. The minimum subsidy m that is needed to move a state from theactive set to the passive set under the optimal partition thus measures how attractive this stateis. We now present the formal definition of indexability and Whittle’s index. We consider thediscounted reward criterion. Their definitions under the average reward criterion can be similarlyobtained.Denoted by V β,m ( ω ) , the value function represents the maximum expected total discountedreward that can be accrued from a single-armed bandit process with subsidy m when the initialbelief state is ω . Considering the two possible actions in the first slot, we have V β,m ( ω ) = max { V β,m ( ω ; u = 0) , V β,m ( ω ; u = 1) } , (6)where V β,m ( ω ; u ) denotes the expected total discounted reward obtained by taking action u inthe first slot followed by the optimal policy in future slots. Consider V β,m ( ω ; u = 0) . It is givenby the sum of the subsidy m obtained in the first slot under the passive action and the totaldiscounted future reward βV β,m ( T ( ω )) which is determined by the updated belief state T ( ω ) (see (1)). V β,m ( ω ; u = 1) can be similarly obtained, and we arrive at the following dynamicprogramming. V β,m ( ω ; u = 0) = m + βV β,m ( T ( ω )) , (7) V β,m ( ω ; u = 1) = ω + β ( ωV β,m ( p ) + (1 − ω ) V β,m ( p )) . (8)The optimal action u ∗ m ( ω ) for belief state ω under subsidy m is given by u ∗ m ( ω ) = , if V β,m ( ω ; u = 1) > V β,m ( ω ; u = 0)0 , otherwise . (9)The passive set P ( m ) under subsidy m is given by P ( m ) = { ω : u ∗ m ( ω ) = 0 } (10) = { ω : V β,m ( ω ; u = 0) ≥ V β,m ( ω ; u = 1) } (11) Definition 1:
An arm is indexable if the passive set P ( m ) of the corresponding single-armedbandit process with subsidy m monotonically increases from ∅ to the whole state space [0 , as m increases from −∞ to + ∞ . An RMBP is indexable if every arm is indexable.Under the indexability condition, Whittle’s index is defined as follows. Definition 2:
If an arm is indexable, its
Whittle’s index W ( ω ) of the state ω is the infimumsubsidy m such that it is optimal to make the arm passive at ω . Equivalently, Whittle’s index W ( ω ) is the infimum subsidy m that makes the passive and active actions equally rewarding. W ( ω ) = inf m { m : u ∗ m ( ω ) = 0 } (12) = inf m { m : V β,m ( ω ; u = 0) = V β,m ( ω ; u = 1) } . (13)In Fig. 2, we compare the performance (throughput) of the myopic policy, Whittle’s indexpolicy, and the optimal policy for the RMBP formulated in Sec. II. We observe that Whittle’sindex policy achieves a near-optimal performance while the myopic policy suffers from asignificant performance loss. T r oughpu t ( b i t s pe r s l o t ) Optimal policyWhittles index policyMyopic policy
Fig. 2. The performance by Whittle’s index policy ( K = 1 , N = 7 , { p ( i )01 } i =1 = { . , . , . , . , . , . , . } , { p ( i )11 } i =1 = { . , . , . , . , . , . , . } , and B i = { . , . , . , . , . , . , . } ). IV. W
HITTLE ’ S INDEX UNDER DISCOUNTED REWARD CRITERION
In this section, we focus on the discounted reward criterion. We establish the indexability,obtain Whittle’s index in closed-form, and develop efficient algorithms for computing an upperbound of the optimal performance to provide a benchmark for evaluating the performance ofWhittle’s index policy. A. Properties of Belief State Transition
To establish indexability and obtain Whittle’s index, it suffices to consider the single-armedbandit process with subsidy m . Again, we drop the channel index from all notations and set B = 1 . PSfrag replacements ω ωk k T k ( ω ) T k ( ω ) ω o ω o Fig. 3. The k -step belief update of an unobserved arm ( p ≥ p ). PSfrag replacements ωω k k T k ( ω ) T k ( ω ) ω o ω o Fig. 4. The k -step belief update of an unobserved arm ( p < p ). The following lemma establishes properties of belief state transition that reveal the basicstructure of the RMBP considered in this paper. We resort often to these properties when derivingthe main results.
Lemma 1:
Let T k ( ω ( t )) ∆ = Pr[ S ( t + k ) = 1 | ω ( t )] ( k = 0 , , , · · · ) denote the k − step beliefupdate of ω ( t ) when the arm is unobserved for k consecutive slots. We have T k ( ω ) = p − ( p − p ) k ( p − (1 + p − p ) ω )1 + p − p , (14) min { p , p } ≤ T k ( ω ) ≤ max { p , p } , ∀ ω ∈ [0 , , ∀ k ≥ . (15)Furthermore, the convergence of T k ( ω ) to the stationary distribution ω o = p p + p has thefollowing property. • Case 1: Positively correlated channel ( p ≥ p ). For any ω ∈ [0 , , T k ( ω ) monotonically converges to ω o as k → ∞ (see Fig. 3). • Case 2: Negatively correlated channel ( p < p ). For any ω ∈ [0 , , T k ( ω ) and T k +1 ( ω ) converge, from opposite directions, to ω o as k → ∞ (see Fig. 4). Proof: T k ( ω ) = ω T k (1) + (1 − ω ) T k (0) , where T k (1) = Pr[ S ( t + k ) = 1 | S ( t ) = 1] is the k − step transition probability from to , and T k (0) = Pr[ S ( t + k ) = 1 | S ( t ) = 0] is the k − steptransition probability from to . From the eigen-decomposition of the transition matrix P (see[28]), we have T k (1) = p +(1 − p )( p − p ) k p − p and T k (0) = p (1 − ( p − p ) k )1+ p − p , which leads to (14).Other properties follow directly from (14).Next, we define an important quantity L ( ω, ω ′ ) . Referred to as the crossing time , L ( ω, ω ′ ) isthe minimum amount of time required for a passive arm to transit across ω ′ starting from ω . L ( ω, ω ′ ) ∆ = min { k : T k ( ω ) > ω ′ } . For a positively correlated arm, we have, from Lemma 1, L ( ω, ω ′ ) = , if ω > ω ′ ⌊ log p − ω ′ (1 − p p p − ω (1 − p p p − p ⌋ + 1 , if ω ≤ ω ′ < ω o ∞ , if ω ≤ ω ′ and ω ′ ≥ ω o . (16)For a negatively correlated arm, we have L ( ω, ω ′ ) = , if ω > ω ′ , if ω ≤ ω ′ and T ( ω ) > ω ′ ∞ , if ω ≤ ω ′ and T ( ω ) ≤ ω ′ . (17) It is easy to show that p > p corresponds to the case where the channel states in two consecutive slots are positivelycorrelated, i.e., for any distribution of S ( t ) , we have E [( S ( t ) − E [ S ( t )])( S ( t + 1) − E [ S ( t + 1)])] > , where S ( t ) is thestate of the Gilbert-Elliot channel in slot t . Similar, p < p corresponds to the case where S ( t ) and S ( t + 1) are negativelycorrelated, and p = p the case where S ( t ) and S ( t + 1) are independent. B. The Optimal Policy
In this subsection, we show that the optimal policy for the single-armed bandit process withsubsidy m is a threshold policy. This threshold structure provides the key to establishing theindexability and solving for Whittle’s index policy in closed-form as shown in Sec. IV-E.This threshold structure is obtained by examining the value functions V β,m ( ω ; u = 0) and V β,m ( ω ; u = 1) given in (7) and (8). From (8), we observe that V β,m ( ω ; u = 1) is a linear functionof ω . Following the general result on the convexity of the value function of a POMDP [29], weconclude that V β,m ( ω ; u = 0) given in (7) is convex in ω . These properties of V β,m ( ω ; u = 1) and V β,m ( ω ; u = 0) lead to the lemma below. Lemma 2:
The optimal policy for the single-armed bandit process with subsidy m is a thresh-old policy, i.e., there exists an ω ∗ β ( m ) ∈ R such that u ∗ m ( ω ) = if ω > ω ∗ β ( m )0 if ω ≤ ω ∗ β ( m ) , and V β,m ( ω ∗ β ( m ); u = 0) = V β,m ( ω ∗ β ( m ); u = 1) . PSfrag replacements V β,m ( ω ; u = 1) V β,m ( ω ; u = 0) ω ∗ β ( m ) Passive Set Active Set ω < ω ∗ β ( m ) ω > ω ∗ β ( m ) ω Fig. 5. The optimality of a threshold policy ( ≤ m < . Proof:
Consider first ≤ m < . We have the following inequality regarding the end pointsof V β,m (0; u = 1) and V β,m (0; u = 0) (see Fig. 5). V β,m (0; u = 1) = βV β,m ( p ) ≤ m + βV β,m ( p ) = V β,m (0; u = 0) , (18) V β,m (1; u = 1) = 1 + βV β,m ( p ) > m + βV β,m ( p ) = V β,m (1; u = 0) . (19) PSfrag replacements ωV β,m ( ω ; u = 1) = ω + mβ − β V β,m ( ω ; u = 0) = m − β Fig. 6. The optimality of a threshold policy ( m ≥ ). PSfrag replacements ω V β,m ( ω ; u = 1) = ω (1 − β )+ p β (1 − β )(1 − βp + βp ) V β,m ( ω ; u = 0) = m + β T ( ω )(1 − β )+ p β (1 − β )(1 − βp + βp ) Fig. 7. The optimality of a threshold policy ( m < .) Since V β,m ( ω ; u = 1) is linear in ω and V β,m ( ω ; u = 0) is convex in ω , V β,m ( ω ; u = 1) and V β,m ( ω ; u = 0) must have one unique intersection at some point ω ∗ β ( m ) as shown in Fig. 5.When m ≥ , it is optimal to make the arm passive all the time since the expected immediatereward ω by activating the arm is uniformly upper bounded by (see Fig. 6). We can thuschoose ω ∗ β ( m ) = c for any c > .When m < , we have (see Fig. 7) V β,m (0; u = 1) = βV β,m ( p ) > m + βV β,m ( p ) = V β,m (0; u = 0) , (20) V β,m (1; u = 1) = 1 + βV β,m ( p ) > m + βV β,m ( p ) = V β,m (0; u = 0) . (21)Based on the convexity of V β,m ( ω ; u = 0) in ω , we have V β,m ( ω ; u = 1) > V β,m ( ω ; u = 0) for any ω ∈ [0 , . It is thus optimal to always activate the arm, and we can choose ω ∗ β ( m ) = b for any b < . Lemma 2 thus follows. The expressions of V β,m (0; u = 1) and V β,m (0; u = 0) given in Fig. 6 and Fig. 7 are obtained from the closed-form expression of the value function,which will be shown in the next subsection. C. Closed-form Expression of The Value Function
In this subsection, we obtain closed-form expressions for the value function V β,m ( ω ) . Thisresult is fundamental to calculating Whittle’s index in closed-form and analyzing the performanceof Whittle’s index policy. Based on the threshold structure of the optimal policy, the value function V β,m ( ω ) can beexpressed in terms of V β,m ( T k ( ω ); u = 1) for some t ∈ Z + ∪{∞} , where t = L ( ω, ω ∗ β ( m ))+1 is the index of the slot when the belief ω transits across the threshold ω ∗ β ( m ) for the first time(recall that L ( ω, ω ∗ β ( m )) is the crossing time given in (16) and (17)). Specifically, in the first L ( ω, ω ∗ β ( m )) slots, the subsidy m is obtained in each slot. In slot t = L ( ω, ω ∗ β ( m ))+1 , the beliefstate transits across the threshold ω ∗ β ( m ) and the arm is activated. The total reward thereafter is V β,m ( T L ( ω,ω ∗ β ( m )) ( ω ); u = 1) . We thus have, considering the discount factor, V β,m ( ω ) = 1 − β L ( ω,ω ∗ β ( m )) − β m + β L ( ω,ω ∗ β ( m )) V β,m ( T L ( ω,ω ∗ β ( m )) ( ω ); u = 1) . (22)Since V β,m ( T k ( ω ); u = 1) is a function of V β,m ( p ) and V β,m ( p ) as shown in (7), we onlyneed to solve for V β,m ( p ) and V β,m ( p ) . Note that p and p are simply two specific valuesof ω ; both V β,m ( p ) and V β,m ( p ) can be written as functions of themselves through (22). Wecan thus solve for V β,m ( p ) and V β,m ( p ) as given in Lemma 3. Lemma 3:
Let ω ∗ β ( m ) denote the threshold of the optimal policy for the single-armed banditprocess with subsidy m . The value functions V β,m ( p ) and V β,m ( p ) can be obtained in closed-form as given below. • Case 1: Positively correlated channel ( p ≥ p ) V β,m ( p ) = p (1 − β )(1 − βp + βp ) , if ω ∗ β ( m ) < p − βp )(1 − β L ( p ,ω ∗ β ( m )) ) m +(1 − β ) β L ( p ,ω ∗ β ( m )) T L ( p ,ω ∗ β ( m )) ( p )(1 − βp )(1 − β )(1 − β L ( p ,ω ∗ β ( m ))+1 )+(1 − β ) β L ( p ,ω ∗ β ( m ))+1 T L ( p ,ω ∗ β ( m )) ( p ) , if p ≤ ω ∗ β ( m ) < ω om − β , if ω ∗ β ( m ) ≥ ω o (23) V β,m ( p ) = p + β (1 − p ) V β,m ( p )1 − βp , if ω ∗ β ( m ) < p m − β , if ω ∗ β ( m ) ≥ p . (24)Note that V β,m ( p ) is given explicitly in (23) while V β,m ( p ) is given in terms of V β,m ( p ) forthe ease of presentation. • Case 2: Negatively correlated channel ( p < p ) V β,m ( p ) = p (1 − β )+ βp (1 − β )(1 − βp + βp ) , if ω ∗ β ( m ) < p m (1 − β (1 − p ))+ β T ( p )(1 − β )+ β p − β (1 − p ) − β T ( p )(1 − β ) − β p , if p ≤ ω ∗ β ( m ) < T ( p ) m − β , if ω ∗ β ( m ) ≥ T ( p ) . (25) V β,m ( p ) = p + βp V β,m ( p )1 − β (1 − p ) , if ω ∗ β ( m ) < p m − β , if ω ∗ β ( m ) ≥ p . (26)Note that V β,m ( p ) is given explicitly in (25) while V β,m ( p ) is given in terms of V β,m ( p ) forthe ease of presentation. Proof:
The key to the closed-form expressions for V β,m ( p ) and V β,m ( p ) is finding thefirst slot that the optimal action is to activate the arm ( i.e., the belief state transits across thethreshold ω ∗ β ( m ) ). This can be done by applying the transition properties of the belief state givenin Lemma 1. See Appendix A for the complete proof. D. The Total Discounted Time of Being Passive
In this subsection, we study the total discounted time that the single-armed bandit processwith subsidy m is made passive. This quantity plays the central role in our proof of indexabilityand in the algorithms of computing an upper bound of the optimal performance as shown inSec. IV-E and Sec. IV-F.Let D β,m ( ω ) denote the total discounted time that the single-armed bandit process with subsidy m is made passive under the optimal policy when the initial belief state is ω . It has been shownby Whittle that D β,m ( ω ) is the derivative of the value function V β,m ( ω ) with respect to m [3]: D β,m ( ω ) = d ( V β,m ( ω )) dm . This result is intuitive: when the subsidy for passivity m increases, the rate at which the totaldiscounted reward V β,m ( ω ) increases is determined by how often the arm is made passive.Based on the threshold structure of the optimal policy, we can obtain the following dynamicprogramming equation for D β,m ( ω ) similar to that for V β,m ( ω ) given in (22). D β,m ( ω ) = 1 − β L ( ω,ω ∗ β ( m )) − β + β L ( ω,ω ∗ β ( m ))+1 ( T L ( ω,ω ∗ β ( m )) ( ω ) D β,m ( p ) + (1 − T L ( ω,ω ∗ β ( m )) ( ω )) D β,m ( p )) . (27)Specifically, the first term in (27) is the total discounted time of the first L ( ω, ω ∗ β ( m )) slotswhen the arm is made passive. In slot L ( ω, ω ∗ β ( m )) + 1 , the arm is activated. With probability T L ( ω,ω ∗ β ( m )) ( ω ) , the channel is in the good state in this slot, and the total future discountedpassive time is D β,m ( p ) . With probability − T L ( ω,ω ∗ β ( m )) ( ω ) , the channel is in the bad statein this slot, and the total future discounted passive time is D β,m ( p ) .By considering ω = p and ω = p , both D β,m ( p ) and D β,m ( p ) can be written asfunctions of themselves through (27). We can thus solve for D β,m ( p ) and D β,m ( p ) as givenin Lemma 4. Lemma 4:
Let ω ∗ β ( m ) denote the threshold of the optimal policy for the single-armed banditprocess with subsidy m . The total discounted passive times D β,m ( p ) and D β,m ( p ) are givenas follows. • Case 1: Positively correlated channel ( p ≥ p ) D β,m ( p ) = , if ω ∗ β ( m ) < p − βp )(1 − β L ( p ,ω ∗ β ( m )) )(1 − βp )(1 − β )(1 − β L ( p ,ω ∗ β ( m ))+1 )+(1 − β ) β L ( p ,ω ∗ β ( m ))+1 T L ( p ,ω ∗ β ( m )) ( p ) , if p ≤ ω ∗ β ( m ) < ω o − β , if ω ∗ β ( m ) ≥ ω o . (28) D β,m ( p ) = β (1 − p ) D β,m ( p )1 − βp , if ω ∗ β ( m ) < p − β , if ω ∗ β ( m ) ≥ p , (29) • Case 2: Negatively correlated channel ( p < p ) D β,m ( p ) = , if ω ∗ β ( m ) < p − β (1 − p )1 − β (1 − p ) − β T ( p )(1 − β ) − β p , if p ≤ ω ∗ β ( m ) < T ( p ) − β , if ω ∗ β ( m ) ≥ T ( p ) . (30) D β,m ( p ) = βp D β,m ( p )1 − β (1 − p ) , if ω ∗ β ( m ) < p − β , if ω ∗ β ( m ) ≥ p , (31) Proof:
The process of solving for D β,m ( p ) and D β,m ( p ) is similar to that of solving for V β,m ( p ) and V β,m ( p ) . Details are omitted. D β,m ( p ) and D β,m ( p ) can also be obtained bytaking the derivatives of V β,m ( p ) and V β,m ( p ) with respect to m .We point out that V β,m ( ω ) is not differentiable in m at every point ( i.e., the left derivativemay not equal to the right derivative). Suppose that V β,m ( ω ) is not differentiable at m . Then itcan be shown that the left derivative at m corresponds to the case when the threshold ω ∗ β ( m ) is included in the active set while the right derivative corresponds to the case when ω ∗ β ( m ) isincluded in the passive set. In this paper, we include the threshold in the passive set (see (11)), i.e., we choose the passive action when both actions are optimal. As a consequence, we considerthe right derivative of V β,m ( ω ) when it is not differentiable.The following lemma shows the piecewise constant (a stair function) and monotonicallyincreasing properties of D β,m ( ω ) as a function of m . These properties allow us to developan efficient algorithm for computing a performance upper bound as shown in Sec. IV-F. Lemma 5:
The total discounted passive time D β,m ( ω ) as a function of m is monotonicallyincreasing and piecewise constant (with countable pieces for p ≥ p and finite pieces for p < p ). Equivalently, the value function V β,m ( ω ) is piecewise linear and convex in m . Proof:
The piecewise constant property follows directly from (27) and Lemma 4 and isillustrated in Fig. 10 and Fig. 11. The monotonicity of D β,m ( ω ) applies to a general restless banditand has been stated without proof by Whittle [3]. We provide a proof below for completeness.We show that V β,m ( ω ) is convex in m , i.e., for any ≤ α ≤ , m , m ∈ R , αV β,m ( ω ) + (1 − α ) V β,m ( ω ) ≥ V β,αm +(1 − α ) m ( ω ) . (32)Consider the optimal policy π under subsidy αm + (1 − α ) m . If we apply π to the systemwith subsidy m , the total discounted reward will be V β,αm +(1 − α ) m ( ω ) + D β,αm +(1 − α ) m ( ω )((1 − α )( m − m )) . Since π may not be the optimal policy under subsidy m , we have V β,m ( ω ) ≥ V β,αm +(1 − α ) m ( ω ) + D β,αm +(1 − α ) m ( ω )((1 − α )( m − m )) . (33)Similarly, V β,m ( ω ) ≥ V β,αm +(1 − α ) m ( ω ) + D β,αm +(1 − α ) m ( ω )( α ( m − m )) . (34)(32) thus follows from (33) and (34). E. Indexability and Whittle’s Index Policy
With the threshold structure of the optimal policy and the closed-form expressions of the valuefunction and discounted passive time, we are ready to establish the indexability and solve forWhittle’s index.
Theorem 1:
The restless multi-armed bandit process (Ω(1) , { P i } Ni =1 , { B i } Ni =1 , β ) is indexable. Proof:
The proof is based on Lemma 2 and Lemma 4. Details are given in Appendix B. Theorem 2:
Whittle’s index W β ( ω ) ∈ R for arm i of the RMBP (Ω(1) , { P i } Ni =1 , { B i } Ni =1 , β ) is given as follows. • Case 1: Positively correlated channel ( p ( i )11 ≥ p ( i )01 ). W β ( ω ) = ωB i , if ω ≤ p ( i )01 or ω ≥ p ( i )11 ω − βp ( i )11 + βω B i , if ω ( i ) o ≤ ω < p ( i )11 ω − β T ( ω )+ C (1 − β )( β (1 − βp ( i )11 ) − β ( ω − β T ( ω )))1 − βp ( i )11 − C ( β (1 − βp ( i )11 ) − β ( ω − β T ( ω ))) B i , if p ( i )01 < ω < ω ( i ) o , (35)where C = (1 − βp ( i )11 )(1 − β L ( p ( i )01 ,ω ) )(1 − βp ( i )11 )(1 − β L ( p ( i )01 ,ω )+1 )+(1 − β ) β L ( p ( i )01 ,ω )+1 T L ( p ( i )01 ,ω ) ( p ( i )01 ) , C = β L ( p ( i )01 ,ω ) T L ( p ( i )01 ,ω ) ( p ( i )01 )(1 − βp ( i )11 )(1 − β L ( p ( i )01 ,ω )+1 )+(1 − β ) β L ( p ( i )01 ,ω )+1 T L ( p ( i )01 ,ω ) ( p ( i )01 ) . • Case 2: Negatively correlated channel ( p ( i )11 < p ( i )01 ). W β ( ω ) = ωB i , if ω ≤ p ( i )11 or ω ≥ p ( i )01 βp ( i )01 + ω (1 − β )1+ β ( p ( i )01 − ω ) B i , if T ( p ( i )11 ) ≤ ω < p ( i )01(1 − β + βC )( βp ( i )01 + ω (1 − β ))1 − β (1 − p ( i )01 ) − C ( β p ( i )01 + βω − β ω ) B i , if ω ( i ) o ≤ ω < T ( p ( i )11 ) (1 − β )( βp ( i )01 + ω − β T ( ω )) − C β ( β T ( ω ) − βp ( i )01 − ω )1 − β (1 − p ( i )01 )+ C β ( β T ( ω ) − βp ( i )01 − ω ) B i , if p ( i )11 < ω < ω ( i ) o , (36)where C = − β (1 − p ( i )01 )1+(1+ β ) βp ( i )01 − β T ( p ( i )11 ) and C = β T ( p ( i )11 )(1 − β )+ β p ( i )01 β ) βp ( i )01 − β T ( p ( i )11 ) . Proof:
By the definition of Whittle’s index, for a given belief state ω , its Whittle’s indexis the subsidy m that is the solution to the following equation of m : ω + β ( ωV β,m ( p ) + (1 − ω ) V β,m ( p )) | {z } V β,m ( ω ; u =1) = m + βV β,m ( T ( ω )) | {z } V β,m ( ω ; u =0) . (37)From the closed-form expressions for V β,m ( p ) , V β,m ( p ) and V β,m ( T ( ω )) given in Lemma 3,we can solve (37) and obtain Whittle’s index.The following properties of Whittle’s index W β ( ω ) follow from Theorem 1 and Theorem 2. Corollary 1: Properties of Whittle’s Index • W β ( ω ) is a monotonically increasing function of ω . As a consequence, Whittle’s indexpolicy is equivalent to the myopic policy for stochastically identical arms. • For a positively correlated channel ( p ≥ p ), W β ( ω ) is piecewise concave with countablepieces. More specifically, W β ( ω ) is linear in [0 , p ] and [ p , , concave in [ ω o , p ) , andpiecewise concave with countable pieces in ( p , ω ) (see Fig. 8-left). • For a negatively correlated channel ( p < p ), W β ( ω ) is piecewise convex with finitepieces. More specifically, W β ( ω ) is linear in [0 , p ] and [ p , , concave in ( p , ω o ) , [ ω o , T ( p )) , and [ T ( p ) , p ) (see Fig. 8-right).The equivalency between Whittle’s index policy and the myopic policy is particularly impor-tant. It allows us to establish the structure and optimality of Whittle’s index policy by examiningthe myopic policy which has a very simple index form.Note that the region of [ p , ω o ) for a positively correlated arm is the most complex. The infinitebut countable concave pieces of Whittle’s index in this region correspond to each possible valueof the crossing time L ( p , ω ) ∈ { , , · · · } . This region presents most of the difficulties inanalyzing the performance of Whittle’s index policy as shown in the next subsection. ω W h i tt l e ’ s i nde x W ( ω ) ≤ω≤ p p < ω < ω o ω o ≤ω
Fig. 8. Whittle’s index (left: p = 0 . , p = 0 . , β = 0 . ; right: p = 0 . , p = 0 . , β = 0 . ). F. Performance of Whittle’s Index Policy1) The optimality of Whittle’s Index Policy under a Relaxed Constraint:
Whittle’s index policyis the optimal solution to a Lagrangian relaxation of RMBP [3]. Specifically, the number of activated arms can vary over time provided that its discounted average over the infinite horizonequals to K . Let K ( t ) denote the number of arms activated in slot t . The relaxed constraint isgiven by E π [(1 − β )Σ ∞ t =1 β t − K ( t )] = K. (38)Let ¯ V β (Ω(1)) denote the maximum expected total discounted reward that can be obtained underthis relaxed constraint when the initial belief vector is Ω(1) . Based on the Lagrangian multipliertheorem, we have [3] ¯ V β (Ω(1)) = inf m { Σ Ni =1 V ( i ) β,m ( ω i (1)) − m ( N − K )1 − β } , (39)where V ( i ) β,m ( ω ) is the value function of the single-armed bandit process with subsidy m thatcorresponds to the i -th channel.The above equation reveals the role of the subsidy m as the Lagrangian multiplier and theoptimality of Whittle’s index policy for RMBP under the relaxed constraint given in (38).Specifically, under the relaxed constraint, Whittle’s index policy is implemented by activating,in each slot, those arms whose current states have a Whittle’s index greater than a constant m ∗ .This constant m ∗ is the Lagrangian multiplier that makes the relaxed constraint given in (38)satisfied, or equivalently, the Lagrangian multiplier that achieves the infimum in (39). It is notdifficult to see that Whittle’s index policy implemented by comparing to a constant m ∗ is theoptimal policy ( i.e., achieves ¯ V β (Ω(1)) ) for RMBP under the relaxed constraint.
2) An Upper Bound of The Optimal Performance:
Under the strict constraint of K ( t ) = K for all t , Whittle’s index policy is implemented by activating those K arms with the largestindices in each slot. Its optimality is lost in general.Let V β (Ω(1)) denote the maximum expected total discounted reward of the RMBP under thestrict constraint that K ( t ) = K for all t . It is obvious that V β (Ω(1)) ≤ ¯ V β (Ω(1)) . ¯ V β (Ω(1)) thus provides a performance benchmark for all RMBP policies, including Whittle’sindex policy. Unfortunately, ¯ V β (Ω(1) as given in (39) is, in general, difficult to obtain due to thecomplexity of calculating the value functions of all arms and searching for the infimum over anuncountable space. For the problem at hand, however, we have obtained V ( i ) β,m ( ω i (1)) in closed-form as given in Lemma 3. Furthermore, the piecewise constant structure of the discounted G β , m ( Ω ( )) V β ( Ω (1)) m* m’ δ ε PSfrag replacements G β (Ω(1) , m )¯ V β (Ω(1)) m ∗ m Fig. 9. The optimal performance under relaxed constraint ( N = 8 , M = 4 , { p ( i )01 } i =1 =[0 . , . , . , . , . , . , . , . , { p ( i )11 } i =1 = [0 . , . , . , . , . , . , . , . , B i = 1 for all i = 1 , . . . , , γ = 0 . ). passive time D ( i ) β,m ( ω i (1)) given in Lemma 5 leads to efficient algorithms for searching for theinfimum of the value functions over m as shown below.Let G β,m (Ω(1)) = Σ Ni =1 V ( i ) β,m ( ω i (1)) − m ( N − K )1 − β . We then have ¯ V β (Ω(1)) = inf m G β,m (Ω(1) , m ) . From Lemma 5, it is easy to see that G β,m (Ω(1)) is convex in m as illustrated in Fig. 9. The infimum of G β,m (Ω(1)) is achieved at m ∗ at whichthe derivative of G β,m (Ω(1)) with respect to m becomes nonnegative for the first time (note that G β,m (Ω(1)) is not differentiable at every m , and we consider the right derivative when it is notdifferentiable). Equivalently, m ∗ = sup { m : d ( G β,m (Ω(1))) dm = Σ Ni =1 D ( i ) β,m ( ω i (1)) − ( N − K )1 − β ≤ } . From Lemma 5, D ( i ) β,m ( ω i (1)) is piecewise constant for each channel (see Fig. 10 and Fig. 11).We can thus partition the range of m into disjoint regions such that d ( G β,m (Ω(1))) dm is constant in eachregion. To obtain m ∗ , we only need to check each region successively until d ( G β,m (Ω(1))) dm becomesnonnegative for the first time (due to the monotonically increasing property of D ( i ) β,m ( ω i (1)) in m ). The difficulty is that for a positively correlated channel, there are infinite constant regions of PSfrag replacements W β ( ω (1)) W β ( p ) W β ( p ) W β ( T ( ω (1))) W β ( T ( p )) D β,m ( ω (1)) m Fig. 10. The passive time for different regions ( p < p ). (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) PSfrag replacementsGray Area (infinite pieces) W β ( p ) W β (¯ ω ) W β ( ω o ) W β ( ω (1)) W β ( p ) W β ( p ) W β ( T ( p )) W β ( T L ( p , ¯ ω ) − ( p )) D β,m ( ω (1)) m Fig. 11. The passive time for different regions ( p ≥ p ). D ( i ) β,m ( ω i (1)) (see Fig. 11). However, we can find an arbitrarily small interval ( W β (¯ ω ) , W β ( ω )] —referred to as the gray area—outside which there are only finite number of constant regions of D ( i ) β,m ( ω i (1)) . By setting the gray area for each positively correlated channel small enough, wecan find an m ′ that is arbitrarily close to m ∗ so that G β,m ′ (Ω(1))) − G β,m ∗ (Ω(1)) ≤ ǫ for any ǫ > . Specifically, we set the length of the gray area for each positively correlated channelto δN ( i.e., W β ( ω o ) − W β (¯ ω ) ≤ δN ) where δ = ǫ (1 − β ) K . The total length of the gray area overall channels is thus at most δ , i.e., m ′ − m ∗ ≤ δ . Based on the convexity of G β,m (Ω(1)) , themaximum derivative of G β,m (Ω(1)) for m ∗ ≤ m ≤ is achieved at m = 1 , which is equal to K − β . Thus, we have G β,m ′ (Ω(1))) − G β,m ∗ (Ω(1)) ≤ K − β ( m ′ − m ∗ ) ≤ δK − β = ǫ. We point out that if m ∗ does not fall into the gray area, the algorithm will obtain m ∗ and ¯ V β (Ω(1)) without error. In the special case when every channel is negatively correlated, thealgorithm will always output the exact value of m ∗ and ¯ V β (Ω(1)) . The detailed algorithm isgiven in Fig. 12. The complexity of this algorithm is given in the following theorem. Computing the Performance Upper Bound within ǫ -Accuracy Input an ǫ > . Set δ = ǫ (1 − β ) K and j = 0 .1) For each negatively correlated channel i , calculate W β ( p ( i )11 ) , W β ( p ( i )01 ) , and W β ( T ( p ( i )11 )) . If ω i (1) < ω ( i ) o , calculate W β ( ω i (1)) and W β ( T ( ω i (1))) ; otherwise only calculate W β ( ω i (1)) .2) For each positively correlated channel i , calculate W β ( p ( i )01 ) , W β ( p ( i )11 ) , and W β ( ω ( i ) o ) .Search for an ¯ ω ( i ) ∈ [ ω ( i ) o − δN , ω ( i ) o ) such that W β ( ω ( i ) o ) ≥ W β (¯ ω ( i ) ) − δN . Let l i be thesmallest integer such that T l i ( p ( i )01 ) > ¯ ω ( i ) . Calculate W β ( T k ( p ( i )01 )) for all ≤ k ≤ l i . If ω i (1) < ω ( i ) o , then let d i be the smallest integer such that T d i ( ω i (1)) > ¯ ω ( i ) and calculate W β ( T k ( ω i (1))) for all ≤ k ≤ d i ; otherwise only calculate W β ( ω i (1)) . Set the gray area V = ∪ i [min { W β ( T l i ( p ( i )01 )) , W β ( T d i ( ω i (1))) } , W β ( ω ( i ) o )) .3) Order all Whittle’s indices calculated in Step 1 and 2 by the ascending order. Let [ a , ...a h ] denote the ordered Whittle’s indices. Set a = 0 and a h +1 = 1 .4) If [ a j , a j +1 ) * V , calculate D = Σ Nk =1 D ( k ) β,m ( ω k (1)) − ( N − K )1 − β for m ∈ [ a j , a j +1 ) accordingto (27) (note that every D ( k ) β,m ( ω k (1)) is constant for m ∈ [ a j , a j +1 ) ). If D is nonnegative,go to Step 5; otherwise set j = j + 1 and repeat Step 4.5) Calculate G = G β,m (Ω(1)) when m ∈ [ a j , a j +1 ) according to (22). Output m ′ = a j and G . Fig. 12. Algorithm for computing the upper bound of the optimal performance.
Theorem 3:
For any ǫ > , the algorithm given in Fig. 12 runs in at most O ( N log N ) timeto output a value G that is within ǫ of ¯ V β (Ω(1)) for any ǫ > . Proof:
See Appendix C. To find the infimum of G β (Ω(1) , m ) , we can also carry out a binary search on subsidy m .It can be shown that this algorithm runs in O ( N (log N ) ) time. However, it cannot output theexact value of m ∗ and ¯ V β (Ω(1)) .Fig. 13 shows an example of the performance of Whittle’s index policy. It demonstrates thenear optimal performance of Whittle’s index policy and the tightness of the performance upperbound. D i sc oun t ed t o t a l r e w a r d Whittles index plicyThe upper bound of the optimal policy
Fig. 13. The Performance of Whittle’s index policy ( N = 8 , { p ( i )01 } i =1 = { . , . , . , . , . , . , . , . } , { p ( i )11 } i =1 = { . , . , . , . , . , . , . , . } , B i = 1 for i = 1 , . . . , , and β = 0 . ). V. W
HITTLE ’ S INDEX UNDER A VERAGE REWARD CRITERION
In this section, we investigate Whittle’s index policy under the average reward criterion andestablish results parallel to those obtained under the discounted reward criterion in Sec. IV.
A. The Value Function and The Optimal Policy
First, we present a general result by Dutta [30] on the relationship between the value functionand the optimal policy under the total discounted reward criterion and those under the averagereward criterion. This result allows us to study Whittle’s index policy under the average rewardcriterion by examining its limiting behavior as the discount factor β → . Dutta’s Theorem [30]. Let F be the belief space of a POMDP and V β (Ω) the value functionwith discount factor β for belief Ω ∈ F . The POMDP satisfies the value boundedness conditionif there exist a belief Ω ′ , a real-valued function c (Ω) : F → R , and a constant c < ∞ suchthat c (Ω) ≤ V β (Ω) − V β (Ω ′ ) ≤ c , for any Ω ∈ F and β ∈ [0 , . Under the value-boundedness condition, if a series of optimalpolicies π β k for a POMDP with discount factor β k pointwise converges to a limit π ∗ as β k → ,then π ∗ is the optimal policy for the POMDP under the average reward criterion. Furthermore,let J (Ω) denote the maximum expected average reward over the infinite horizon starting fromthe initial belief Ω . We have J (Ω) = lim β k → (1 − β k ) V β k (Ω) and J (Ω) = J is independent of the initial belief Ω .Next, we will show that the single-armed bandit process with subsidy m under the discountedreward criterion (see Sec. III-B) satisfies the valueboundedness condition. Lemma 6:
The single-armed bandit process with subsidy under the discounted reward criterionsatisfies the value-boundedness condition. More specifically, we have | V β,m ( ω ) − V β,m ( ω ′ ) | ≤ c + 1 , for all ω, ω ′ ∈ [0 , , (40)where c = max { − p , p } . Proof:
See Appendix D.Under the value boundedness condition, the optimal policy for the single-armed bandit processwith subsidy under the average reward criterion can be obtained from the limit of any pointwiseconvergent series of the optimal policies under the discounted reward criterion. The followingLemma shows that the optimal policy for the single-armed bandit process with subsidy underthe average reward criterion is also a threshold policy.
Lemma 7:
Let ω ∗ β ( m ) denote the threshold of the optimal policy for the single-armed banditprocess with subsidy m under the discounted reward criterion. Then lim β → ω ∗ β ( m ) exists for any m . Furthermore, the optimal policy for the single-armed bandit process with subsidy m underthe average reward criterion is also a threshold policy with threshold ω ∗ ( m ) = lim β → ω ∗ β ( m ) . Here we do not consider the trivial case that the arm has absorbing states. Proof:
See Appendix E.
B. Indexability and Whittle’s index policy
Based on Lemma 7, the restless multi-armed bandit process (Ω , { P i } Ni =1 , { B i } Ni =1 , is in-dexable if the threshold ω ∗ ( m ) of the optimal policy is monotonically increasing with subsidy m . Next, we show that the monotonicity holds and the restless multi-armed bandit process (Ω , { P i } Ni =1 , { B i } Ni =1 , is indexable. Moreover, we obtain Whittle’s index in closed-form asshown below. Theorem 4:
The restless multi-armed bandit process (Ω(1) , { P i } Ni =1 , { B i } Ni =1 , is indexablewith Whittle’s index W ( ω ) given below. • Case 1: Positively correlated channel ( p ( i )11 ≥ p ( i )01 ) . W ( ω ) = ωB i , if ω ≤ p ( i )01 or ω ≥ p ( i )11( ω −T ( ω ))( L ( p ( i )01 ,ω )+1)+ T L ( p ( i )01 ,ω ) ( p ( i )01 )1 − p ( i )11 +( ω −T ( ω ) L ( p ( i )01 ,ω )+ T L ( p ( i )01 ,ω ) ( p ( i )01 ) B i , if p ( i )01 < ω < ω ( i ) oω − p ( i )11 + ω B i , if ω ( i ) o ≤ ω < p ( i )11 . (41) • Case 2: Negatively correlated channel ( p ( i )11 < p ( i )01 ) . W ( ω ) = ωB i , if ω ≤ p ( i )11 or ω ≥ p ( i )01 ω + p ( i )01 −T ( ω )1+ p ( i )01 −T ( p ( i )11 )+ T ( ω ) − ω B i if p ( i )11 < ω < ω ( i ) op ( i )01 p ( i )01 −T ( p ( i )11 ) B i , if ω ( i ) o ≤ ω < T ( p ( i )11 ) p ( i )01 p ( i )01 − ω B i , if T ( p ( i )11 ) ≤ ω < p ( i )01 . (42) Proof:
See Appendix F.The monotonicity and piecewise concave/convex properties of Whittle’s index under thediscounted reward criterion given in Corollary 1 are preserved under the average reward criterion. The only difference is that Whittle’s index under the discounted reward criterion is always strictlyincreasing with the belief state while Whittle’s index W ( ω ) under the average reward criterion isa constant function of ω when ω o ≤ ω < T ( p ) for a negatively correlated channel (see (42)). C. The Performance of Whittle’s Index Policy
Similar to the case under the discounted reward criterion, Whittle’s index policy is optimalunder the average reward criterion when the constraint on the number of activated arms K ( t ) ( t ≥ is relaxed to the following. E π [ lim T →∞ T Σ Tt =1 K ( t )] = K. Let ¯ J (Ω(1)) denote the maximum expected average reward that can be obtained under thisrelaxed constraint when the initial belief vector is Ω(1) . Based on the Lagrangian multipliertheorem, we have [3] ¯ J = inf m { Σ Ni =1 J ( i ) m − m ( N − K ) } , (43)where J ( i ) m is the value function of the single-armed bandit process with subsidy m that corre-sponds to the i -th channel.Let J (Ω(1)) denote the maximum expected average reward of the RMBP under the strictconstraint that K ( t ) = K for all t . Obviously, J (Ω(1)) ≤ ¯ J . ¯ J thus provides a performance benchmark for Whittle’s index policy under the strict constraint.To evaluate ¯ J , we consider the single-armed bandit with subsidy m under the average rewardcriterion. The value function J m and the average passive time D m = d ( J m ) dm can be obtained inclosed-form as shown in Lemma 8 below. Lemma 8:
The value function J m and D m can be obtained in closed-form as given below,where ω ∗ ( m ) is the threshold of the optimal policy. Furthermore, D m is piecewise constant andincreasing with m . J m = ω o , if ω ∗ ( m ) < min { p , p } (1 − p ) L ( p ,ω ∗ ( m )) m + T L ( p ,ω ∗ ( m )) ( p )(1 − p )( L ( p ,ω ∗ ( m ))+1)+ T L ( p ,ω ∗ ( m )) ( p ) , if p ≤ ω ∗ ( m ) < ω op m + p p −T ( p ) , if p ≤ ω ∗ ( m ) < T ( p ) m, other cases (44) and D m = , if ω ∗ ( m ) < min { p , p } (1 − p ) L ( p ,ω ∗ ( m ))(1 − p )( L ( p ,ω ∗ ( m ))+1)+ T L ( p ,ω ∗ ( m )) ( p ) , if p ≤ ω ∗ ( m ) < ω op p −T ( p ) , if p ≤ ω ∗ ( m ) < T ( p )1 , other cases . (45) Proof:
Under the value-boundedness condition as shown in Sec. V-A, we have, accordingto Dutta’s theorem, J m = lim β k → (1 − β k ) V β k ( ω, m ) , which leads to (44) directly. The closed-form expression for D m can be obtained from thederivative of J m with respect to m . The proof that D m is increasing with m is similar to thatgiven in Lemma 5.Based on the closed-form D m given in Lemma 8, we can obtain the subsidy m ∗ that achievesthe infimum in (43). Specifically, the subsidy m ∗ that achieves the infimum in (43) is thesupremum value of m ∈ [0 , satisfying Σ Ni =1 D m,i ≤ N − K . After obtaining m ∗ , it is easy tocalculate the infimum according to the closed-form J m given in Lemma 8. With minor changes,the algorithm in Sec. IV-F can be applied to evaluate the upper bound ¯ J . We notice that theinitial belief will not be considered in the algorithm, which leads to a shorter running time.Simulation results similarly to Fig. 9 have been observed, demonstrating the near-optimalperformance of Whittle’s index policy under the average reward criterion .VI. W HITTLE ’ S I NDEX P OLICY FOR S TOCHASTICALLY I DENTICAL C HANNELS
Based on the equivalency between Whittle’s index policy and the myopic policy for stochas-tically identical arms, we can analyze Whittle’s index policy by focusing on the myopic policywhich has a much simpler index form. In this section, we establish the semi-universal structureand study the optimality of Whittle’s index policy for stochastically identical arms.
A. The Structure of Whittle’s Index Policy
The implementation of Whittle’s index policy can be described with a queue structure. Specif-ically, all N channels are ordered in a queue, and in each slot, those K channels at the head ofthe queue are sensed. Based on the observations, channels are reordered at the end of each slotaccording to the following simple rules. When p ≥ p , the channels observed in state will stay at the head of the queue while thechannels observed in state will be moved to the end of the queue (see Fig. 14).When p < p , the channels observed in state will stay at the head of the queue while thechannels observed in state will be moved to the end of the queue. The order of the unobservedchannels are reversed (see Fig. 15). PSfrag replacementsSense S ( t ) = 1 S ( t ) = 1 S ( t ) = 0 K ( t ) K ( t + 1) Fig. 14. The structure of Whittle’s index policy ( p ≥ p ) PSfrag replacementsSense S ( t ) = 1 S ( t ) = 1 S ( t ) = 0 K ( t ) K ( t + 1) Flip
Fig. 15. The structure of Whittle’s index policy ( p < p ) The initial channel ordering K (1) is determined by the initial belief vector as given below. ω n (1) ≥ ω n (1) ≥ · · · ≥ ω n N (1) = ⇒ K (1) = ( n , n , · · · , n N ) . (46)See Appendix G for the proof of the structure of Whittle’s index policy.The advantage of this structure of Whittle’s index policy is twofold. First, it demonstratesthe simplicity of Whittle’s index policy: channel selection is reduced to maintaining a simplequeue structure that requires no computation and little memory. Second, it shows that Whittle’sindex policy has a semi-universal structure; it can be implemented without knowing the channeltransition probabilities except the order of p and p . As a result, Whittle’s index policy isrobust against model mismatch and automatically tracks variations in the channel model providedthat the order of p and p remains unchanged. As show in Fig. 16, the transition probabilitieschange abruptly in the fifth slot, which corresponds to an increase in the occurrence of goodchannel state in the system. From this figure, we can observe, from the change in the throughputincreasing rate, that Whittle’s index policy effectively tracks the model variations. T h r oughpu t p =0.6, p =0,1 (T<=5); p =0.9, p =0,4 (T>5) Model Variation Fig. 16. Tracking the change in channel transition probabilities occurred at t = 6 . B. Optimality and Approximation Factor of Whittle’s Index Policy
Based on the simple structure of Whittle’s index policy for stochastically identical channels,we can obtain a lower bound of its performance. Combining this lower bound and the upperbound shown in Sec. V-C, we further obtain the approximation factor of the performance byWhittle’s index policy, which are independent of channel parameters. Recall that J denote theaverage reward achieved by the optimal policy. Let J w denote the average reward achieved byWhittle’s index policy, Theorem 5: Lower and Upper Bounds of The Performance of Whittle’s Index Policy K T ⌊ NK ⌋− ( p )1 − p + T ⌊ NK ⌋− ( p ) ≤ J w ≤ J ≤ min { Kω o − p + ω o , ω o N } if p ≥ p (47) Kp − T ⌊ NK ⌋− ( p ) + p ≤ J w ≤ J ≤ min { Kp − T ( p ) + p , ω o N } if p < p (48) Proof:
The upper bound of J is obtained from the upper bound of the optimal performancefor generally non-identical channels as given in (43). The lower bound of J w is obtained fromthe structure of Whittle’s index policy. See Appendix H for the complete proof. Corollary 2:
Let η = J w J be the approximation factor defined as the ratio of the performanceby Whittle’s index policy to the optimal performance. We have A pp r o x i m a t i on F a c t o r p ≥ p
1 2 . . . . . . . N−1 N 1 2 . . N/20.51/2+1/N...1 M A pp r o x i m a t i on F a c t o r p
1 2 . . N/2 . . N−1 N
Fig. 17. The approximation factor of Whittle’s index policy.
Positively correlated channels η = 1 , for K = 1 , N − , Nη ≥ KN , o.w. , Negatively correlated channels η = 1 , for K = N − , Nη ≥ max { , KN } , o.w. . Proof:
See Appendix I.Fig. 17 illustrates the approximation factors of Whittle’s index policy for both positivelycorrelated and negatively correlated channels. We notice that the approximation factor approachesto as K increases. For negatively correlated channels, Whittle’s index policy achieves at leasthalf the optimal performance. For positively correlated channels, the approximation factor canbe further improved under certain conditions on the transition probabilities. Specifically, we have η ≥ − p + ω o .From Corollary 2, Whittle’s index policy is optimal when K = 1 (for positively correlated channels)and K = N − . The optimality for K = N is trivial. We point out that for a general K , numericalexamples have shown that actions given by Whittle’s index policy match with the optimal actionsfor randomly generated sample paths, suggesting the optimality of Whittle’s index policy.VII. C ONCLUSION
In this paper, we have formulated the multi-channel opportunistic access problem as a restlessmulti-armed bandit process. We established the indexability and obtained Whittle’s index in closed-form under both discounted and average reward criteria. We developed efficient algorithmsfor computing an upper bound of the optimal policy, which is the optimal performance under therelaxed constraint on the average number of channels that can be sensed simultaneously. Whenchannels are stochastically identical, we have shown that Whittle’s index policy coincides withthe myopic policy. Based on this equivalency, we have established the semi-universal structureand the optimality of Whittle index policy under certain conditions.A PPENDIX
A: P
ROOF OF L EMMA V β,m ( p ) = 1 − β L ( p ,ω ∗ β ( m )) − β m + β L ( p ,ω ∗ β ( m )) V β,m ( T L ( p ,ω ∗ β ( m )) ( p ); u = 1) , (49) V β,m ( p ) = 1 − β L ( p ,ω ∗ β ( m )) − β m + β L ( p ,ω ∗ β ( m )) V β,m ( T L ( p ,ω ∗ β ( m )) ( p ); u = 1) . (50)As shown in (7), V β,m ( T L ( ω,ω ∗ β ( m )) ( ω ); u = 1) is a function of V β,m ( p ) and V β,m ( p ) for any ω ∈ [0 , . We thus have two equations (49) and (50) for two unknowns V β,m ( p ) and V β,m ( p ) provided that we can obtain the two crossing times L ( p , ω ∗ β ( m )) and L ( p , ω ∗ β ( m )) .From (16) and (17), we can obtain these crossing times by considering different regions thatthe threshold ω ∗ β ( m ) may lie in (see Fig. 18 and Fig. 19). We can thus solve for V β,m ( p ) and V β,m ( p ) from (49) and (50) by considering each region within which both crossing times L ( p , ω ∗ β ( m )) and L ( p , ω ∗ β ( m )) are constant. PSfrag replacements p p ω o L = 0 L = 0 L = ∞ L = ∞ L = ⌊ log p − ω ∗ β ( m )(1 − p p p p − p p − p ⌋ + 1 L = L ( p , ω ∗ β ( m )) L = L ( p , ω ∗ β ( m )) ω ∗ β ( m ) Fig. 18. The threshold crossing time for different regions of ω ∗ β ( m ) when p ≥ p (the top partition is for L ( p , ω ∗ β ( m )) ,the bottom for L ( p , ω ∗ β ( m )) ). A PPENDIX
B: P
ROOF OF T HEOREM P is indexable. Based onthe threshold structure of the optimal policy for the single-armed bandit with subsidy m given PSfrag replacements p p T ( p ) L = 0 L = 0 L = ∞ L = ∞ L = 1 L = L ( p , ω ∗ β ( m )) L = L ( p , ω ∗ β ( m )) ω ∗ β ( m ) Fig. 19. The threshold crossing time for different regions of ω ∗ β ( m ) when p < p (the top partition is for L ( p , ω ∗ β ( m )) ,the bottom for L ( p , ω ∗ β ( m )) ). in Lemma 2, indexability is reduced to the monotonicity of the threshold ω ∗ β ( m ) , i.e., ω ∗ β ( m ) is monotonically increasing with the subsidy m for m ∈ [0 , . To prove the monotonicity of ω ∗ β ( m ) , we first give the following lemma. Lemma 9:
Suppose that for any m ∈ [0 , we have dV β,m ( ω ; u = 1) dm | ω = ω ∗ β ( m ) < dV β,m ( ω ; u = 0) dm | ω = ω ∗ β ( m ) . (51)Then ω ∗ β ( m ) is monotonically increasing with m .We prove Lemma 9 by contradiction. Assume that there exists an m ∈ [0 , such that ω ∗ β ( m ) is decreasing at m . Then, there exists an ǫ > such that for any ∆ m ∈ [0 , ǫ ] , we have V β,m +∆ m ( ω ∗ β ( m ); u = 1) ≥ V β,m +∆ m ( ω ∗ β ( m ); u = 0) . (52)Since ω ∗ β ( m ) is the threshold of the optimal policy under subsidy m , we have V β,m ( ω ∗ β ( m ); u = 1) = V β,m ( ω ∗ β ( m ); u = 0) . (53)From (52) and (53), we have dV β,m ( ω ; u = 1) dm | ω = ω ∗ β ( m ) ≥ dV β,m ( ω ; u = 0) dm | ω = ω ∗ β ( m ) , which contradicts with (51). Lemma 9 thus holds.According to Lemma 9, it is sufficient to prove (51). Recall that D β,m ( ω ) = d ( V β,m ( ω )) dm . From (7)and (8), we can write (51) as β ( ω ∗ β ( m ) D β,m ( p ) + (1 − ω ∗ β ( m )) D β,m ( p )) < βD β,m ( T ( ω ∗ β ( m ))) . (54)To prove (54), we consider the following three regions of ω ∗ β ( m ) . Region 1: ≤ ω ∗ β ( m ) < min { p , p } . Based on the lower bound of the updated belief given in Lemma 1, the arm will be ac-tivated in every slot when the initial belief ω > ω ∗ β ( m ) . Thus, D β,m ( p ) = D β,m ( p ) = D β,m ( T ( ω ∗ β ( m ))) = 0 ; (54) holds trivially. Region 2: ω o ≤ ω ∗ β ( m ) ≤ .In this region, the arm is made passive in every slot when the initial belief state is T ( ω ∗ β ( m )) .This is because T k ( ω ∗ β ( m )) ≤ ω ∗ β ( m ) for any k ≥ (see Lemma 1, Fig. 3 and Fig. 4). Therefore, D β,m ( T ( ω ∗ β ( m ))) = − β . Since both D β,m ( p ) and D β,m ( p ) are upper bounded by − β , it iseasy to see that (54) holds. Region 3: min { p , p } ≤ ω ∗ β ( m ) < ω o .In this region, T ( ω ∗ β ( m )) > ω ∗ β ( m ) (see Fig. 3 and Fig. 4). Thus, T ( ω ∗ β ( m )) is in the activeset, which gives us D β,m ( T ( ω ∗ β ( m ))) = β ( T ( ω ∗ β ( m )) D β,m ( p ) + (1 − T ( ω ∗ β ( m ))) D β,m ( p )) (55)To prove (54), we consider the positively correlated and negatively correlated cases separately. • Case 1: Negatively correlated channel ( p < p ). Since p > ω o > ω ∗ β ( m ) , p is in the active set. We thus have D β,m ( p ) = β ( p D β,m ( p ) + (1 − p ) D β,m ( p )) . (56)Substituting (55) and (56) into (54), we reduce (54) to the following inequality. β − β (1 − p ) D β,m ( p )(1 − β )( βp + ω ∗ β ( m ) − β T ( ω ∗ β ( m ))) < . (57)Notice that the left-hand side of (57) is increasing with ω ∗ β ( m ) and D β,m ( p ) . It thus suffices toshow the inequality by replacing ω ∗ β ( m ) with its upper bound ω o and D β,m ( p ) with its upperbound − β . After some simplifications, it is sufficient to prove f ( β ) ∆ = p ( p − p ) β + β ( p + 1 − p − p + p p ) − − p + p < . (58)It is easy to see that f ( β ) is convex in β , f (0) = − − p + p < , and f (1) = 0 . We thusconclude that f ( β ) < for any ≤ β < . • Case 2: Positively correlated channel ( p > p ). Since p ≥ ω o > ω ∗ β ( m ) , p is in the active set. We thus have D β,m ( p ) = β ( p D β,m ( p ) + (1 − p ) D β,m ( p )) . (59) Substituting (55) and (59) into (54), we reduce (54) to the following inequality. βD β,m ( p )(1 − β )(1 − ω ∗ β ( m ) − β T ( ω ∗ β ( m ))1 − βp ) < . (60)Substituting the closed-form of D β,m ( p ) given in (28) into (60), we end up with an inequalityin terms of L ( p , ω ∗ β ( m )) and ω ∗ β ( m ) . Notice that the left-hand side of (60) is decreasingwith ω ∗ β ( m ) . It thus suffices to show the inequality by replacing ω ∗ β ( m ) with its lower bound T L ( p ,ω ∗ β ( m )) − ( p ) (by the definition of L ( p , ω ∗ β ( m )) ). Let x = p − p . After some simpli-fications, it is sufficient to show that for any ≤ β < , ≤ p ≤ , ≤ x ≤ − p , L ∈{ , , , .. } , f ( β ) ∆ = β L +2 p x L +1 (1 − x ) + β ( p x L +2 + x − x − p x ) + β ( x + p x − p x L +1 −
1) + 1 − x > . (61) Since f (0) = 1 − x > and f (1) = 0 , it is sufficient to prove that f ( β ) is strictly decreasingwith β for ≤ β ≤ , which follows by showing d ( f ( β )) d ( β ) < for ≤ β < . d ( f ( β )) d ( β ) = ( L + 2) β L +1 p x L +1 (1 − x ) + 2 β ( p x L +2 + x − x − p x ) + ( x + p x − p x L +1 − . (62) To show d ( f ( β )) d ( β ) < for ≤ β < , we will establish the following two facts:(i) d ( f ( β )) d ( β ) | β =1 ≤ .(ii) d ( f ( β )) d ( β ) is strictly increasing with β .To prove (i), we set β = 1 in (62). After some simplifications, we need to prove h ( p ) ∆ = − p Lx L +2 + p ( L + 1) x L +1 − x − p x + 2 x − ≤ . (63)Since h (0) = − ( x − ≤ , it is sufficient to prove that h ( p ) is monotonically decreasingwith p , i.e., we need to prove d ( h ( p )) d ( p ) = − Lx L +2 + ( L + 1) x L +1 − x ≤ . (64)Since Lx L +1 ≤ Σ Lk =1 x k = x − x L +1 − x , it is easy to see that (64) holds. We thus proved (i).To prove (ii), it suffices to show that the coefficient of β in (62) is nonnegative, i.e., we needto prove x L +2 + x − x − p x ≥ . (65)Since ≤ x ≤ − p , we have p x ( x L +1 − ≥ − p x ≥ ( x − x . It is easy to see that (65)holds. We thus proved (ii).From (i) and (ii), it is easy to see that d ( f ( β )) d ( β ) < for any ≤ β < . We thus proved theindexability. A PPENDIX
C: P
ROOF OF T HEOREM O ( N ) time. In Step 2, the number of regions that needs to becalculated for each channel is at most O (log δN ) = O (log N ) . It runs in constant time to find l i and d i for channel i . So Step 2 runs in at most O ( N log N ) time. In Step 3, the orderingof all those probabilities needs at most O ( N log N )(log( O ( N log N ))) = O ( N (log N ) ) time.Step 4 runs in O ( N ) time for each region that does not belong to V . So Step 4 runs in at most O ( N log N ) time. Finally, Step runs in O ( N ) time. Overall, the algorithm runs in at most O ( N log N ) time. A PPENDIX
D: P
ROOF OF L EMMA V β,m ( p ) (see Lemma 3), we have, for any β (0 ≤ β < , | V β,m ( p ) − V β,m ( p ) | ≤ c. (66)From Fig. 6, Fig. 7, and Fig. 5, we have, for any ω ∈ [0 , , min { V β,m (0; u = 1) , V β,m (1; u = 1) } ≤ V β,m ( ω ) ≤ max { V β,m (0; u = 0) , V β,m (1; u = 1) } . (67)Consequently, we have, for any ω, ω ′ ∈ [0 , , | V β,m ( ω ) − V β,m ( ω ′ ) |≤ max( | V β,m (0; u = 1) − V β,m (1; u = 1) | , | V β,m (0; u = 0) − V β,m (0; u = 1) | , | V β,m (0; u = 0) − V β,m (1; u = 1) | )= max( | β ( V β,m ( p ) − V β,m ( p )) − | , | β ( V β,m ( p ) − V β,m ( p )) | , . Since | V β,m ( p ) − V β,m ( p ) | ≤ c for any β (0 ≤ β < , then V β,m ( ω ) − V β,m ( ω ′ ) | ≤ c + 1 forany β (0 ≤ β < and ω, ω ′ ∈ [0 , . Thus the value-boundedness condition is satisfied.A PPENDIX
E: P
ROOF OF L EMMA ω ∗ β ( m ) is trivial for m < and m ≥ .For ≤ m < , let W ( ω ) = lim β → W β ( ω ) . This limit exists and is given in Theorem 4 (itis tedious and lengthy to get the limit and we skip the detailed calculation). Define ω ∗ ( m ) asthe inverse function of W ( ω ) . We notice that W ( ω ) is a constant function (thus not invertible)when ω o ≤ ω ≤ T ( p ) (see (42)). In this case, we set ω ∗ ( m ) = T ( p ) . Formally, we have ω ∗ ( m ) = c ( c < if m < { ω : W ( ω ) = m } if ≤ m < b ( b > if m ≥ . (68) Next, we prove that lim β → ω ∗ β ( m ) = ω ∗ ( m ) as β → by contradiction. Since W ( ω ) =lim β → W β ( ω ) and W β ( ω ) is increasing with ω , W ( ω ) is also increasing with ω . Assume first that W β ( ω ) is strictly increasing at point ω ∗ β ( m ) . We prove lim β → ω ∗ β ( m ) = ω ∗ ( m ) by contradictionas follows.Assume ω ∗ β ( m ) ω ∗ ( m ) , i.e., there exists an ǫ > , a β ′ (0 ≤ β ′ < , and a series { β k } ( β k → such that | ω ∗ β k ( m ) − ω ∗ ( m ) | > ǫ for any β k > β ′ . If ω ∗ ( m ) − ǫ > ω ∗ β k ( m ) for any β k > β ′ , then W β k ( ω ∗ ( m ) − ǫ ) ≥ W β k ( ω ∗ β k ( m )) for any β k > β ′ by the monotonicityof W β k ( ω ) . Since W ( ω ) is strictly increasing at point ω ∗ ( m ) , there exists a δ > such that W ( ω ∗ ( m )) − W ( ω ∗ ( m ) − ǫ ) > δ . Then we have, for any β k > β ′ , W β k ( ω ∗ ( m ) − ǫ ) ≥ W β k ( ω ∗ β k ( m )) = m = W ( ω ∗ ( m )) > W ( ω ∗ ( m ) − ǫ ) + δ, which contradicts with the fact that W β k ( ω ∗ β k ( m ) − ǫ ) → W ( ω ∗ ( m ) − ǫ ) as β k → . The prooffor the case when ω ∗ ( m ) + ǫ < ω ∗ β k ( m ) for any β k > β ′ is similar to the above.Consider next that W ( ω ) is not strictly increasing at point ω ∗ ( m ) . This case only occurswhen p < p and ω ∗ ( m ) = T ( p ) . We notice that W β ( T ( p )) increasingly converges to W ( T ( p )) as β → . Thus ω ∗ β ( m ) ≥ T ( p ) = ω ∗ ( m ) by the monotonicity of W β ( ω ) . Assume ω ∗ β ( m ) ω ∗ ( m ) , i.e., , there exist an ǫ > , a β ′ (0 ≤ β ′ < and a series { β k } ( β k → suchthat ω ∗ β k ( m ) − ω ∗ ( m ) > ǫ for any β k > β ′ . We have W β k ( ω ∗ ( m ) + ǫ ) < W β k ( ω ∗ β k ( m )) for any β k > β ′ by the monotonicity of W β k ( ω ) . Since W ( ω ) is strictly increasing in [ ω ∗ ( m ) , ω ∗ ( m )+ ǫ ] ,there exists a δ ′ > such that W ( ω ∗ ( m ) + ǫ ) − W ( ω ∗ ( m )) > δ ′ . Then we have, for any β k > β ′ , W β k ( ω ∗ ( m ) + ǫ ) ≤ W β k ( ω ∗ β k ( m )) = m = W ( ω ∗ ( m )) < W ( ω ∗ ( m ) + ǫ ) − δ ′ , which contradicts with the fact that W β k ( ω ∗ β k ( m ) + ǫ ) → W ( ω ∗ ( m ) + ǫ ) as β k → .Next, we show that the optimal policy π ∗ β k for the single-armed bandit process with subsidyunder the discounted reward criterion pointwise converges to a threshold policy π ∗ as β k → .To see this, we construct π ∗ as follows: (1) If m < , then the arm is made active all the time; (2)If m ≥ , the arm is made passive all the time; (3) If ≤ m < , then ω is made passive whencurrent state ω ≤ ω ∗ ( m ) , otherwise it is activated. Since ω ∗ β ( m ) converges to ω ∗ ( m ) as β → ,it is easy to see that π ∗ β k pointwise converges to π ∗ for any β k → . Because the single-armedbandit process with subsidy under the discounted reward criterion satisfies the value boundednesscondition (see Lemma 6), the threshold policy π ∗ is optimal for the single-armed bandit processwith subsidy under the average reward criterion based on Dutta’s theorem. A PPENDIX
F: P
ROOF OF T HEOREM ω ∗ ( m ) = lim β → ω ∗ β ( m ) and ω ∗ β ( m ) is monotonically increasing with m (see Theorem 1),it is easy to see that ω ∗ ( m ) is also monotonically increasing with m . Therefore, the bandit isindexable.Next, we prove that W ( ω ) ∆ = lim β → W β ( ω ) is indeed Whittle’s index under the average rewardcriterion. For a belief state ω of an arm, its Whittle’s index is the infimum subsidy m such that ω is in the passive set under the optimal policy for the arm, i.e., the infimum subsidy m suchthat ω ≤ ω ∗ ( m ) (according to Lemma 7). From (68) and the monotonicity of W ( ω ) with ω , wehave that W ( ω ) is the infimum subsidy m such that ω ≤ ω ∗ ( m ) .A PPENDIX
G: P
ROOF OF T HE S TRUCTURE OF W HITTLE ’ S I NDEX P OLICY
The proof is an extension of the proof given in [10] under single-channel sensing ( K = 1 ).Consider the belief update of unobserved channels (see (1)). T ( ω ) = p + ω ( p − p ) . (69)We notice that T ( ω ) is an increasing function of ω for p > p and a decreasing function of ω for p < p . Furthermore, the belief value ω i ( t ) of channel i in slot t is bounded between p and p for any i and t > (see (1)).Consider first p ≥ p . The channels observed to be in state in slot t − will achievethe upper bound p of the belief value in slot t while the channels observed to be in state the lower bound p . Whittle’s index policy, which is equivalent to the myopic policy, will stayin channels observed to be in state and recognize channels observed to be in state as theleast favorite in the next slot. The unobserved channels maintains the ordering of belief values inevery slot due to the monotonically increasing property of T ( ω ) . The structure of Whittle indexpolicy for p < p can be similarly obtained by noticing that reversing the order of unobservedchannels in every slot maintains the ordering of belief values due to the monotonically decreasingproperty of T ( ω ) . A PPENDIX
H: P
ROOF OF T HEOREM J w is an extension of that with single-channel sensing( K = 1 ) given in [10]. It is, however, much more complex to analyze the performance of Whittle’s index policy when K ≥ . The lower bound obtained here is looser than that in [10]when applied to the case of K = 1 .Define a transmission period on a channel as the number of consecutive slots in which thechannel has been sensed. Based on the structure of Whittle index policy, it is easy to show that J w = K (1 − E [ τ ] ); if p ≥ p ; K E [ τ ] ; if p < p , (70)where E [ τ ] is the average length of the transmission period over the infinite time horizon.To bound the throughput J w , it is equivalent to bound the average length of the transmissionperiod E [ τ ] as shown in equation (70). We consider the following two cases. • Case 1: p ≥ p Let ω denote the belief value of the chosen channel in the first slot of a transmission period.The length τ ( ω ) of this transmission period has the following distribution. Pr[ τ ( ω ) = l ] = − ω, l = 1 ωp l − p , l > . (71)It is easy to see that if ω ′ ≥ ω , then τ ( ω ′ ) stochastically dominates τ ( ω ) .From the structure of Whittle index policy, ω = T k ( p ) , where k is the number of consecutiveslots in which the channel has been unobserved since the last visit to this channel. When the userleaves one channel, this channel has the lowest priority. It will take at least ⌊ N − KK ⌋ slots beforethe user returns to the same channel, i.e., k ≥ ⌊ NK ⌋ − . Based on the monotonically increasingproperty of the k -step transition probability T k ( p ) (see Fig. 3), we have ω = T k ( p ) ≥T ⌊ α ⌋− ( p ) . Thus τ ( T ⌊ NK ⌋− ( p )) is stochastically dominated by τ ( ω ) , and the expectation ofthe former leads to the lower bound of J w given in (47). • Case 2: p < p In this case, τ ( ω ) has the following distribution: Pr[ τ ( ω ) = l ] = ω, l = 1(1 − ω ) p l − p , l > . (72)Opposite to case 1, τ ( ω ′ ) stochastically dominates τ ( ω ) if ω ′ ≤ ω . From the structure of Whittle’s index policy, ω = T k ( p ) , where k is the number ofconsecutive slots in which the channel has been unobserved since the last visit to this channel.If k is odd, then T k ( p ) ≥ T ⌊ NK ⌋− ( p ) since ⌊ NK ⌋ − is an even number (see Fig. 4). If k is even, then k is at least ⌊ N − KK ⌋ . we have ω = T k ( p ) ≥ T ⌊ NK ⌋− ( p ) . Thus τ ( ω ) isstochastically dominated by τ ( T ⌊ NK ⌋− ( p )) , and the expectation of the latter leads to the lowerbound of J w as given in (48).Next, we show the upper bound of J . From (43), we have J ≤ inf m { N J m − m ( N − K ) } since channels are stochastically identical.When p ≥ p , we have J ≤ min m ∈{ ωo − p ωo , } N J m − m ( N − K ) = min { Kω o − p + ω o , N ω o } . (73)When p > p , we have J ≤ min m ∈{ p −T p p , } N J m − m ( N − K ) = min { Kp − T ( p ) + p , N ω o } . (74)A PPENDIX
I: P
ROOF OF C OROLLARY K = 1 and p ≥ p [10], [11](note that for N = 2 , negatively correlated channels, the optimality of the myopic policy hasalso been established). Based on the equivalency between Whittle’s index policy and the myopicpolicy, we conclude that Whittle index policy is optimal for K = 1 and p ≥ p .We now prove that Whittle’s index policy is optimal when K = N − . We construct a genie-aided system where the user knows the states S i ( t ) of all channels at the end of each slot t . Inthis system, Whittle’s index policy is clearly optimal, and the optimal performance is the upperbound of the original one. For the original system where the user only knows the states of thesensed N − channels, we notice that the channel ordering by Whittle’s index policy in eachslot is the same as that in the genie-aided system. Whittle’s index policy thus achieves the sameperformance as in the genie-aided system. It is thus optimal.According to Theorem 5, we arrive at the following inequalities (notice that J w ≥ Kω o ). η ≥ max { − p + ω o , KN } , if p ≥ p max { −T ( p )+ p − p + p , KN } , if p < p . (75)From (75), we have η ≥ KN . Next, we show that Whittle’s index policy achieves at least half the optimal performance fornegatively correlated channels ( p < p ). In this case, we have η ≥ − T ( p ) + p − p + p = 1 + ( p − p )(1 − p )1 − ( p − p ) ≥ − (1 − p ) − p ≥ . . R EFERENCES [1] A. Mahajan and D. Teneketzis, “Multi-armed Bandit Problems,” in
Foundations and Applications of Sensor Management ,A. O. Hero III, D. A. Castanon, D. Cochran and K. Kastella (Editors), Springer-Verlag, 2007.[2] J. C. Gittins, “Bandit Processes and Dynamic Allocation Indices.” in
Journal of the Royal Statistical Society , Vol.41, No.2.,pp.148-177, 1979.[3] P. Whittle, ”Restless bandits: Activity allocation in a changing world”, in
Journal of Applied Probability , Vol. 25, 1988.[4] C. H. Papadimitriou and J. N. Tsitsiklis, “The Complexity of Optimal Queueing Network Control,” in
Mathematics ofOperations Research , Vol. 24, No. 2, May 1999, pp. 293-305.[5] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient adaptive allocation rules for the multi-armed banditproblem with multiple plays (Part I : I.I.D. rewards. Part II : Markovian rewards.),” in
IEEE Transactions on AutomaticControl , Vol. AC-32, No. 11, pp. 968 -982, Nov. 1987.[6] D. G. Pandelis and D. Teneketzis, “On the Optimality of the Gittins Index Rule in Multi-armed Bandits with MultiplePlays,” in
Mathematical Methods of Operations Research , Vol. 50, pp. 449-461, 1999.[7] R. R. Weber and G. Weiss, “On an Index Policy for Restless Bandits,” in
Journal of Applied Probability , Vol.27, No.3,pp. 637-648, Sep 1990.[8] P. S. Ansell, K. D. Glazebrook, J.E. Nio-Mora, and M. O’Keeffe, “Whittle’s index policy for a multi-class queueing systemwith convex holding costs,” in
Math. Meth. Operat. Res.
57, 21–39, 2003.[9] K. D. Glazebrook, D. Ruiz-Hernandez, and C. Kirkbride, “Some Indexable Families of Restless Bandit Problems ,” in
Advances in Applied Probability , 38:643-672, 2006.[10] Q. Zhao, B. Krishnamachari, and K. Liu, “On Myopic Sensing for Multi-Channel Opportunistic Access: Struc-ture, Optimality, and Performance,” to appear in IEEE Trans. Wireless Communications , Dec., 2008, available athttp://arxiv.org/abs/0712.0035v3[11] S. H. Ahmad, M. Liu, T. Javadi, Q. Zhao and B. Krishnamachari, “Optimality of Myopic Sensing in Multi-Channel Opportunistic Access,” submitted to IEEE Transactions on Information Theory, May, 2008, available athttp://arxiv.org/abs/0811.0637[12] E. N. Gilbert, “Capacity of burst-noise channels,” Bell Syst. Tech. J., vol. 39, pp. 1253-1265, Sept. 1960.[13] M. Zorzi, R. Rao, and L. Milstein, “Error statistics in data transmission over fading channels,” in
IEEE Trans. Commun. ,vol. 46, pp. 1468-1477, Nov. 1998.[14] L.A. Johnston and V. Krishnamurthy, “Opportunistic File Transfer over a Fading Channel: A POMDP Search TheoryFormulation with Optimal Threshold Policies,” in
IEEE Trans. Wireless Communications , vol. 5, no. 2, 2006.[15] Q. Zhao and B. Sadler, “A Survey of Dynamic Spectrum Access,” in
IEEE Signal Processing magazine , vol. 24, no. 3,pp. 79-89, May 2007.[16] J. Le Ny, M. Dahleh, E. Feron, “Multi-UAV Dynamic Routing with Partial Observations using Restless Bandit AllocationIndices,”, in
Proceedings of the 2008 American Control Conference , Seattle, WA, June 2008. [17] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized Cognitive MAC for Opportunistic Spectrum Access in AdHoc Networks: A POMDP Framework,” in IEEE Journal on Selected Areas in Communications (JSAC): Special Issue onAdaptive, Spectrum Agile and Cognitive Wireles Networks , April 2007.[18] Y. Chen, Q. Zhao, and A. Swami, “Joint design and separation principle for opportunistic spectrum access in the presenceof sensing errors,” in
IEEE Transactions on Information Theory , vol. 54, no. 5, pp. 2053-2071, May, 2008[19] Q. Zhao and B. Krishnamachari, “Structure and Optimality of Myopic Policy in Opportunistic Access with Noisy Obser-vations,” submitted to
IEEE Transactions on Automatic Control , Feb., 2008, available at http://arxiv.org/abs/0802.1379v2[20] C. Lott and D. Teneketzis, “On the Optimality of an Index Rule in Multi-Channel Allocation for Single-Hop MobileNetworks with Multiple Service Classes,” in
Probability in the Engineering and Informational Sciences , Vol. 14, pp.259-297, 2000.[21] V. Raghunathan, V. Borkar, M. Cao and P. R. Kumar, ”Index policies for real-time multicast scheduling for wirelessbroadcast systems,” in
IEEE INFOCOM , 2008.[22] R. Kleinberg, A. Slivkins, and E. Upfal, “Multi-armed bandit problems in metric spaces,” in
Proc. of the 40th ACMSymposium on Theory of Computing (STOC) , 2008.[23] J. E. Nio-Mora, “Restless bandits, partial conservation laws and indexability,” in
Advances in Applied Probability , 33:7698,2001.[24] S. Guha and K. Munagala, “Approximation algorithms for partial-information based stochastic control with Markovianrewards,” in
Proc. 48th IEEE Symposium on Foundations of Computer Science (FOCS) , 2007.[25] S. Guha, K. Munagala, “Approximation Algorithms for Restless Bandit Problems,” http://arxiv.org/abs/0711.3861.[26] R. Smallwood and E. Sondik, “The optimal control of partially ovservable Markov processes over a finite horizon,” in
Operations Research , pp. 1071–1088, 1971.[27] A. Arapostathis, V. S. Borkar, E. Fern ´ a ndez-gaucherand, M. K. Ghosh, and S. I. Marcus, “Discrete-time controlled markovprocesses with average cost criterion: a survey,” in SIAM J. Control and Optimization , Vol. 31, No. 2, pp. 282344, 1993.[28] R. G. Gallager,
Discrete Stochastic Processes . Kluwer Academic Publishers, 1995.[29] E. J. Sondik, “ The Optimal Control at Partially Observable Markov Processes Over the Infinite Horizon: DiscountedCosts,” in