Online Learning Algorithms for Stochastic Water-Filling
aa r X i v : . [ c s . L G ] S e p Online Learning Algorithms for StochasticWater-Filling
Yi Gai and Bhaskar Krishnamachari
Ming Hsieh Department of Electrical EngineeringUniversity of Southern CaliforniaLos Angeles, CA 90089, USAEmail: { ygai, bkrishna } @usc.edu Abstract —Water-filling is the term for the classic solution tothe problem of allocating constrained power to a set of parallelchannels to maximize the total data-rate. It is used widely inpractice, for example, for power allocation to sub-carriers inmulti-user OFDM systems such as WiMax. The classic water-filling algorithm is deterministic and requires perfect knowledgeof the channel gain to noise ratios. In this paper we consider howto do power allocation over stochastically time-varying (i.i.d.)channels with unknown gain to noise ratio distributions. Weadopt an online learning framework based on stochastic multi-armed bandits. We consider two variations of the problem, onein which the goal is to find a power allocation to maximize P i E [log(1 + SNR i )] , and another in which the goal is to find apower allocation to maximize P i log(1 + E [ SNR i ]) . For the firstproblem, we propose a cognitive water-filling algorithm that wecall CWF1. We show that CWF1 obtains a regret (defined as thecumulative gap over time between the sum-rate obtained by adistribution-aware genie and this policy) that grows polynomiallyin the number of channels and logarithmically in time, implyingthat it asymptotically achieves the optimal time-averaged ratethat can be obtained when the gain distributions are known.For the second problem, we present an algorithm called CWF2,which is, to our knowledge, the first algorithm in the literature onstochastic multi-armed bandits to exploit non-linear dependenciesbetween the arms. We prove that the number of times CWF2picks the incorrect power allocation is bounded by a functionthat is polynomial in the number of channels and logarithmic intime, implying that its frequency of incorrect allocation tends tozero. I. I
NTRODUCTION
A fundamental resource allocation problem that arises inmany settings in communication networks is to allocate aconstrained amount of power across many parallel channelsin order to maximize the sum-rate. Assuming that the power-rate function for each channel is proportional to log(1+
SN R ) as per the Shannon’s capacity theorem for AWGN channels,it is well known that the optimal power allocation can bedetermined by a water-filling strategy [1]. The classic water-filling solution is a deterministic algorithm, and requiresperfect knowledge of all channel gain to noise ratios.In practice, however, channel gain-to-noise ratios arestochastic quantities. To handle this randomness, we consideran alternative approach, based on online learning, specificallystochastic multi-armed bandits. We formulate the problem ofstochastic water-filling as follows: time is discretized intoslots; each channel’s gain-to-noise ratio is modeled as an i.i.d. random variable with an unknown distribution. In ourgeneral formulation, the power-to-rate function for each chan-nel is allowed to be any sub-additive function . We seek apower allocation that maximizes the expected sum-rate (i.e.,an optimization of the form E [ P i log(1 + SN R i )] ). Even ifthe channel gain-to-noise ratios are random variables withknown distributions, this turns out to be a hard combinatorialstochastic optimization problem. Our focus in this paper isthus on a more challenging case.In the classical multi-armed bandit, there is a player playing K arms that yield stochastic rewards with unknown means ateach time in i.i.d. fashion over time. The player seeks a policyto maximize its total expected reward over time. The perfor-mance metric of interest in such problems is regret, defined asthe cumulative difference in expected reward between a model-aware genie and that obtained by the given learning policy.And it is of interest to show that the regret grows sub-linearlywith time so that the time-averaged regret asymptotically goesto zero, implying that the time-averaged reward of the model-aware genie is obtained asymptotically by the learning policy.We show that it is possible to map the problem of stochasticwater-filling to an MAB formulation by treating each possiblepower allocation as an arm (we consider discrete power levelsin this paper; if there are P possible power levels for eachof N channels, there would be P N total arms.) We presenta novel combinatorial policy for this problem that we callCWF1, that yields regret growing polynomially in N andlogarithmically over time. Despite the exponential growing setof arms, the CWF1 observes and maintains information for P · N variables, one corresponding to each power-level andchannel, and exploits linear dependencies between the armsbased on these variables.Typically, the way the randomness in the channel gain tonoise ratios is dealt with is that the mean channel gain tonoise ratios are estimated first based on averaging a finiteset of training observations and then the estimated gains areused in a deterministic water-filling procedure. Essentially thisapproach tries to identify the power allocation that maximizesa pseudo-sum-rate, which is determined based on the power-rate equation applied to the mean channel gain-to-noise ratios A function f is subadditive if f ( x + y ) ≤ f ( x ) + f ( y ) ; for any concavefunction g , if g (0) ≥ (such as log(1 + x ) ), g is subadditive. (i.e., an optimization of the form P i log(1 + E [ SN R i ] ). Wealso present a different stochastic water-filling algorithm thatwe call CWF2, which learns to do this in an online fashion.This algorithm observes and maintains information for N variables, one corresponding to each channel, and exploitsnon-linear dependencies between the arms based on thesevariables. To our knowledge, CWF2 is the first MAB algorithmto exploit non-linear dependencies between the arms. Weshow that the number of times CWF2 plays a non-optimalcombination of powers is uniformly bounded by a functionthat is logarithmic in time. Under some restrictive conditions,CWF2 may also solve the first problem more efficiently.II. R ELATED W ORK
The classic water-filling strategy is described in [1]. Thereare a few other stochastic variations of water-filling that havebeen covered in the literature that are different in spirit fromour formulation. When a fading distribution over the gainsis known a priori , the power constraint is expressed overtime, and the instantaneous gains are also known , then adeterministic joint frequency-time water-filling strategy can beused [2], [3]. In [4], a stochastic gradient approach based onLagrange duality is proposed to solve this problem when thefading distribution is unknown but still instantaneous gains areavailable. By contrast, in our work we do not assume that theinstantaneous gains are known, and focus on keeping the samepower constraint at each time while considering unknown gaindistributions.Another work [5] considers water-filling over stochasticnon-stationary fading channels, and proposes an adaptivelearning algorithm that tracks the time-varying optimal powerallocation by incorporating a forgetting factor. However, thefocus of their algorithm is on minimizing the maximummean squared error assuming imperfect channel estimates,and they prove only that their algorithm would converge ina stationary setting. Although their algorithm can be viewedas a learning mechanism, they do not treat stochastic water-filling from the perspective of multi-armed bandits, which isa novel contribution of our work. In our work, we focus onstationary setting with perfect channel estimates, but provestronger results, showing that our learning algorithm not onlyconverges to the optimal allocation, it does so with sub-linearregret.There has been a long line of work on stochastic multi-armed bandits involving playing arms yielding stochasticallytime varying rewards with unknown distributions. Severalauthors [6]–[9] present learning policies that yield regretgrowing logarithmically over time (asymptotically, in the caseof [6]–[8] and uniformly over time in the case of [9]). Ouralgorithms build on the UCB1 algorithm proposed in [9] butmake significant modifications to handle the combinatorialnature of the arms in this problem. CWF1 has some common-alities with the LLR algorithm we recently developed for acompletely different problem, that of stochastic combinatorialbipartite matching for channel allocation [10], but is modifiedto account for the non-linear power-rate function in this paper. Other recent work on stochastic MAB has considereddecentralized settings [11]–[14], and non-i.i.d. reward pro-cesses [15]–[19]. With respect to this literature, the problemsetting for stochastic water-filling is novel in that it involvesa non-linear function of the action and unknown variables. Inparticular, as far as we are aware, our CWF2 policy is thefirst to exploit the non-linear dependencies between arms toprovably improve the regret performance.III. P
ROBLEM F ORMULATION
We define the stochastic version of the classic communica-tion theory problem of power allocation for maximizing rateover parallel channels (water-filling) as follows.We consider a system with N channels, where thechannel gain-to-noise ratios are unknown random processes X i ( n ) , ≤ i ≤ N . Time is slotted and indexed by n .We assume that X i ( n ) evolves as an i.i.d. random processover time (i.e., we consider block fading), with the onlyrestriction that its distribution has a finite support. Without lossof generality, we normalize X i ( n ) ∈ [0 , . We do not requirethat X i ( n ) be independent across i . This random process isassumed to have a mean θ i = E [ X i ] that is unknown to theusers. We denote the set of all these means by Θ = { θ i } .At each decision period n (also referred to interchangeablyas a time slot), an N -dimensional action vector a ( n ) , repre-senting a power allocation on these N channels, is selectedunder a policy π ( n ) . We assume that the power levels arediscrete, and we can put any constraint on the selectionsof power allocations such that they are from a finite set F (i.e., the maximum total power constraint, or an upper boundon the maximum allowed power per subcarrier). We assume a i ( n ) ≥ for all ≤ i ≤ N . When a particular powerallocation a ( n ) is selected, the channel gain-to-noise ratioscorresponding to nonzero components of a ( n ) are revealed,i.e., the value of X i ( n ) is observed for all i such that a i ( n ) = 0 . We denote by A a ( n ) = { i : a i ( n ) = 0 , ≤ i ≤ N } the index set of all a i ( n ) = 0 for an allocation a .We adopt a general formulation for water-filling, where thesum rate obtained at time n by allocating a set of powers a ( n ) is defined as: R a ( n ) ( n ) = X i ∈A a ( n ) f i ( a i ( n ) , X i ( n )) . (1)where for all i , f i ( a i ( n ) , X i ( n )) is a nonlinear continuousincreasing sub-additive function in X i ( n ) , and f i ( a i ( n ) ,
0) =0 for any a i ( n ) . We assume f i is defined on R + × R + .Our formulation is general enough to include as a specialcase of the rate function obtained from Shannon’s capacitytheorem for AWGN, which is widely used in communicationnetworks: R a ( n ) ( n ) = N X i =1 log(1 + a i ( n ) X i ( n )) We refer to rate and reward interchangeably in this paper.
In the typical formulation there is a total power constraint andindividual power constraints, the corresponding constraint is F = { a : N X i =1 a i ≤ P total ∧ ≤ a i ≤ P i , ∀ i } . where P total is the total power constraint and P i is the maxi-mum allowed power per channel.Our goal is to maximize the expected sum-rate when thedistributions of all X i are unknown, as shown in (2). We referto this objective as O . max a ∈F E [ X i ∈A a f i ( a i , X i ))] (2)Note that even when X i have known distributions, this is ahard combinatorial non-linear stochastic optimization problem.In our setting, with unknown distributions, we can formulatethis as a multi-armed bandit problem, where each powerallocation a ( n ) ∈ F is an arm and the reward function isin a combinatorial non-linear form. The optimal arms are theones with the largest expected reward, denoted as O ∗ = { a ∗ } .For the rest of the paper, we use ∗ as the index indicating thata parameter is for an optimal arm. If more than one optimalarm exists, ∗ refers to any one of them.We note that for the combinatorial multi-armed bandit prob-lem with linear rewards where the reward function is definedby R a ( n ) ( n ) = P i ∈A a ( n ) a i ( n ) X i ( n ) , a ∗ is a solution to a deter-ministic optimization problem because max a ∈F E [ P i ∈A a a i X i ] =max a ∈F P i ∈A a a i E [ X i ] . Different from the combinatorial multi-armed bandit problem with linear rewards, a ∗ here is a solutionto a stochastic optimization problem, i.e., a ∗ ∈ O ∗ = { ˜ a : ˜ a = arg max a ∈F E [ X i ∈A a f i ( a i , X i ))] } . (3)We evaluate policies for O with respect to regret , whichis defined as the difference between the expected reward thatcould be obtained by a genie that can pick an optimal armat each time, and that obtained by the given policy. Note thatminimizing the regret is equivalent to maximizing the expectedrewards. Regret can be expressed as: R π ( n ) = nR ∗ − E [ n X t =1 R π ( t ) ( t )] , (4)where R ∗ = max a ∈F E [ P i ∈A a f i ( a i , X i ))] , the expected reward ofan optimal arm.Intuitively, we would like the regret R π ( n ) to be as smallas possible. If it is sub-linear with respect to time n , the time-averaged regret will tend to zero and the maximum possibletime-averaged reward can be achieved. Note that the numberof arms |F| can be exponential in the number of unknownrandom variables N .We also note that for the stochastic version of the water-filling problems, a typical way in practice to deal with theunknown randomness is to estimate the mean channel gain to noise ratios first and then find the optimized allocation basedon the mean values. This approach tries to identify the powerallocation that maximizes the power-rate equation applied tothe mean channel gain-to-noise ratios. We refer to maximizingthis as the sum-pseudo-rate over averaged channels. We denotethis objective by O , as shown in (5). max a ∈F X i ∈A a f i ( a i , E [ X i ]) (5)We would also like to develop an online learning policyfor O . Note that the optimal arm a ∗ of O is a solutionto a deterministic optimization problem. So, we evaluate thepolicies for O with respect to the expected total numberof times that a non-optimal power allocation is selected. Wedenote by T a ( n ) the number of times that a power allocationis picked up to time n . We denote r a = P i ∈A a f i ( a i , E [ X i ]) .Let T πnon ( n ) denote the total number of times that a policy π select a power allocation r a < r a ∗ . Denote by πt ( a ) theindicator function which is equal to if a is selected underpolicy π at time t , and 0 else. Then E [ T πnon ( n )] = n − E [ n X t =1 πt ( a ∗ ) = 1] (6) = X r a NLINE L EARNING FOR M AXIMIZING THE S UM -R ATE We first present in this section an online learning policy forstochastic water-filling under object O . A. Policy Design A straightforward, naive way to solve this problem is touse the UCB1 policy proposed [9]. For UCB1, each powerallocation is treated as an arm, and the arm that maximizes ˆ Y k + q nm k will be selected at each time slot, where ˆ Y k isthe mean observed reward on arm k , and m k is the number oftimes that arm k has been played. This approach essentiallyignores the underlying dependencies across the different arms,and requires storage that is linear in the number of armsand yields regret growing linearly with the number of arms.Since there can be an exponential number of arms, the UCB1algorithm performs poorly on this problem.We note that for combinatorial optimization problems withlinear reward functions, an online learning algorithm LLRhas been proposed in [6] as an efficient solution. LLR storesthe mean of observed values for every underlying unknownrandom variable, as well as the number of times each hasbeen observed. So the storage of LLR is linear in the numberof unknown random variables, and the analysis in [6] showsLLR achieves a regret that grows logarithmically in time, andpolynomially in the number of unknown parameters.However, the challenge with stochastic water-filling withobjective O , where the expectation is outside the non-linearreward function, directly storing the mean observations of X i will not work. To deal with this challenge, we propose to store the infor-mation for each a i , X i combination, i.e., ∀ ≤ i ≤ N , ∀ a i , wedefine a new set of random variables Y i,a i = f i ( a i , X i ) . Sonow the number of random variables Y i,a i is N P i =1 |B i | , where B i = { a i : a i = 0 } . Note that N P i =1 |B i | ≤ P N .Then the reward function can be expressed as R a = X i ∈A a Y i,a i , (7)Note that (7) is in a combinatorial linear form.For this redefined MAB problem with N P i =1 |B i | unknownrandom variables and linear reward function (7), we proposethe following online learning policy CWF1 for stochasticwater-filling as shown in Algorithm 1. Algorithm 1 Online Learning for Stochastic Water-Filling:CWF1 // I NITIALIZATION2: If max a |A a | is known, let L = max a |A a | ; else, L = N ; for n = 1 to N do Play any arm a such that n ∈ A a ; ∀ i ∈ A a , ∀ a i ∈ B i , Y i,a i := Y i,ai m i + f i ( a i ,X i ) m i +1 ; ∀ i ∈ A a , m i := m i + 1 ; end for // M AIN LOOP9: while do n := n + 1 ; Play an arm a which solves the maximization problem X i ∈A a ( Y i,a i + s ( L + 1) ln nm i ); (8) ∀ i ∈ A a , ∀ a i ∈ B i , Y i,a i := Y i,ai m i + f i ( a i ,X i ) m i +1 ; ∀ i ∈ A a , m i := m i + 1 ; end while To have a tighter bound of regret, different from the LLRalgorithm, instead of storing the number of times that eachunknown random variables Y i,a i has been observed, we use a by N vector, denoted as ( m i ) × N , to store the number oftimes that X i has been observed up to the current time slot.We use a by N P i =1 |B i | vector, denoted as ( Y i,a i ) × N P i =1 |B i | to store the information based on the observed values. ( Y i,a i ) × N P i =1 |B i | is updated in as shown in line 12. Each timean arm a ( n ) is played, ∀ i ∈ A a ( n ) , the observed value of X i is obtained. For every observed value of X i , |B i | values areupdated: ∀ a i ∈ B i , the average value Y i,a i of all the valuesof Y i,a i up to the current time slot is updated. CWF1 policyrequires storage linear in N P i =1 |B i | . B. Analysis of regret Theorem The expected regret under the CWF1 policy isat most " a L ( L + 1) N ln n (∆ min ) + N + π LN ∆ max . (9)where a max = max a ∈F max i a i , ∆ min = min a = a ∗ R ∗ − E [ R a ] , ∆ max = max a = a ∗ R ∗ − E [ R a ] . Note that L ≤ N .The proof of Theorem 1 is omitted. Remark For CWF1 policy, although there are N P i =1 |B i | random variables, the upper bound of regret remains O ( N log n ) , which is the same as LLR, as shown by The-orem 2 in [6]. Directly applying LLR algorithm to solve theredefined MAB problem in (7) will result in a regret that growsas O ( P N log n ) . Remark Algorithm 1 will even work for rate functionsthat do not satisfy subadditivity. Remark We can develop similar policies and resultswhen X i are Markovian rewards as in [19] and [20].V. O NLINE L EARNING FOR S UM -P SEUDO -R ATE We now show our novel online learning algorithm CWF2 forstochastic water-filling with object O . Unlike CWF1, CWF2exploits non-linear dependencies between the choices of powerallocations and requires lower storage. Under condition wherethe power allocation that maximize O also maximize O ,we will see through simulations that CWF2 has better regretperformance. A. Policy Design Our proposed policy CWF2 for stochastic water filling withobjective O is shown in Algorithm 2. Algorithm 2 Online Learning for Stochastic Water-Filling:CWF2 // I NITIALIZATION2: If max a |A a | is known, let L = max a |A a | ; else, L = N ; for n = 1 to N do Play any arm a such that n ∈ A a ; ∀ i ∈ A a , X i := X i m i + X i m i +1 , m i := m i + 1 ; end for // M AIN LOOP8: while do n := n + 1 ; Play an arm a which solves the maximization problem max a ∈F X i ∈A a f i ( a i , X i ) + f i ( a i , s ( L + 1) ln nm i ) ; (10) ∀ i ∈ A a ( n ) , X i := X i m i + X i m i +1 , m i := m i + 1 ; end while We use two by N vectors to store the information afterwe play an arm at each time slot. One is ( X i ) × N in which X i is the average (sample mean) of all the observed valuesof X i up to the current time slot (obtained through potentiallydifferent sets of arms over time). The other one is ( m i ) × N inwhich m i is the number of times that X i has been observedup to the current time slot. So CWF2 policy requires storagelinear in N . B. Analysis of regret For the analysis of the upper bound for E [ T πnon ( n )] ofCWF2 policy, we use the inequalities as stated in the Chernoff-Hoeffding bound as follows: Lemma X , . . . , X n are random variables with range [0 , , and E [ X t | X , . . . , X t − ] = µ , ∀ ≤ t ≤ n . Denote S n = P X i .Then for all a ≥ P { S n ≥ nµ + a } ≤ e − a /n P { S n ≤ nµ − a } ≤ e − a /n (11) Theorem Under the CWF2 policy, the expected totalnumber of times that non-optimal power allocations are se-lected is at most E [ T πnon ( n )] ≤ N ( L + 1) ln nB + N + π LN, (12)where B min is a constant defined by δ min and L ; δ min =min a : r a We will show the upper bound of the regret in three steps:(1) introduce a counter e T i ( n ) (defined as below) and showits relationship with the upper bound of the regret; (2) showthe upper bound of E [ e T i ( n )] ; (3) show the upper bound of E [ T πnon ( n )] . (1) The counter e T i ( n ) After the initialization period, ( e T i ( n )) × N is introducedas a counter and is updated in the following way: at anytime n when a non-optimal power allocation is selected, find i such that i = arg min j ∈A a ( n ) m j . If there is only one suchpower allocation, e T i ( n ) is increased by . If there are multiplesuch power allocations, we arbitrarily pick one, say i ′ , andincrement e T i ′ by . Based on the above definition of e T i ( n ) ,each time when a non-optimal power allocation is selected,exactly one element in ( e T i ( n )) × N is incremented by . Sothe summation of all counters in ( e T i ( n )) × N equals to thetotal number that we have selected the non-optimal powerallocations, as below: X a : R a Remark CWF2 can be used to solve the stochastic water-filling with objective O as well if ∃ a ∗ ∈ O ∗ , such that ∀ a / ∈O ∗ , X i ∈A a ∗ f i ( a i , θ i )) > X j ∈A a f j ( a j , θ j ) . (30)Then the regret of CWF2 is at most R CW F ( n ) ≤ (cid:20) N ( L + 1) ln nB + N + π LN (cid:21) ∆ max , (31)VI. A PPLICATIONS AND N UMERICAL S IMULATION R ESULTS A. Numerical Results for CWF1 We now show the numerical results for CWF2 policy. Weconsider a OFDM system with subcarriers. We assume thebandwidth of the system is MHz, and the noise density is − dBw/Hz. We assume Rayleigh fading with parameter σ =(2 , . , . . for subcarriers. We consider the followingobjective for our simulation: max E " N X i =1 log(1 + a i ( n ) X i ( n )) (32) s.t. N X i =1 a i ( n ) ≤ P total , ∀ n (33) a ( n ) ∈ { , , , } , ∀ n (34) a ( n ) ∈ { , , , } , ∀ n (35) a ( n ) ∈ { , , , , } , ∀ n (36) a ( n ) ∈ { , , } , ∀ n (37)where P total = 60 mW ( . dBm). The unit for above powerconstraints from (34) to (37) is mW. Note that (33) to (37)define the constraint set F .For this scenario, there are different choices of powerallocations, and the optimal power allocation can be calculatedas (20 , , , . E [ e T i ( n )] ≤ (cid:24) ( L + 1) ln nB (cid:25) + ∞ X t =2 t − X m h =1 · · · t − X m h |A∗| =1 t − X m p = l · · · t − X m p |A a ( t ) | = l L ( t − − L +1) ≤ ( L + 1) ln nB + 1 + L ∞ X t =1 t − ≤ ( L + 1) ln nB + 1 + π L. (28) R eg r e t/ Log ( t ) UCBLLRCWF1 Fig. 1. Normalized regret R ( n )log n vs. n time slots. We compare the performance of our proposed CWF1 policywith UCB1 policy and LLR policy, as shown in Figure 1. Aswe can see from 1, naively applying UCB1 and LLR policyresults in a worse performance than CWF1, since the UCB1policy can not exploit the underlying dependencies acrossarms, and LLR policy does not utilize the observations asefficiently as CWF1 does. B. Numerical Results for CWF2 We show the simulation results of CWF2 using the samesystem as in VI-A.We consider the following objective for our simulation: max " N X i =1 log(1 + a i ( n ) E [ X i ( n )]) s.t. a ∈ F (38)where F is same as in VI-A.For this scenario, we assume Rayleigh fading with param-eter σ = (1 . , . , . , . for subcarriers. And theoptimal power allocation can be calculated as (20 , , , .Figure 2 shows the simulation results of the total numberof times that non-optimal power allocations are chosen byrunning CWF2 up to 30 million time slots. We also show thetheoretical upper bound in figure 2. In this case, we see thatthe theoretical upper bound is quite loose and the algorithmdoes much better in practice.For this setting, we note that (30) is satisfied, since (20 , , , also maximizes (32). So as stated in Remark4, CWF2 can also be used to solve stochastic water filling Time R eg r e t/ Log ( t ) Theoretical Upper BoundCWF2 Fig. 2. Numerical results of E [ e T i ( n )] / log n and theoretical bound. R eg r e t/ Log ( t ) UCBLLRCWF1CWF2 Fig. 3. Normalized regret R ( n )log n vs. n time slots. with O , with regret that grows logarithmically in time andpolynomially in the number of channels.We show a comparison of the UCB1 policy, LLR policy,CWF1 policy and CWF2 policy under this setting in Figure3. We can see that CWF2 performs the best by far since itincorporate a way to exploit non-linear dependencies acrossarms, and learn more efficiently.VII. C ONCLUSION We have considered the problem of optimal power allocationover parallel channels with stochastically time-varying gain-to-noise ratios for maximizing information rate (stochasticwater-filling) in this work. We approached this problem fromthe novel perspective of online learning. The crux of our approach is to map each possible power allocation into arms ina stochastic multi-armed bandit problem. The significant newchallenge imposed here is that the reward obtained is a non-linear function of the arm choice and the underlying unknownrandom variables. To our knowledge there is no prior work onstochastic MAB that explicitly treats such a problem.We first considered the problem of maximizing the expectedsum rate. For this problem we developed the CWF1 algorithm.Despite the fact that the number of arms grows exponentiallyin the number of possible channels, we show that the CWF1algorithm requires only polynomial storage and also yields aregret that is polynomial in the number of power levels perchannel and the number of channels, and logarithmic in time.We then considered the problem of maximizing the sum-pseudo-rate, where the pseudo rate for a stochastic channelis defined by applying the power-rate equation to its meanSNR ( log(1 + E [ SN R ] ). The justification for considering thisproblem is its connection to practice (where allocations overstochastic channels are made based on estimated mean channelconditions). Albeit sub-optimal with respect to maximizingthe expected sum-rate, the use of the sum-pseudo-rate asthe objective function is a more tractable approach. For thisproblem, we developed a new MAB algorithm that we callCWF2. This is the first algorithm in the literature on stochasticMAB that exploits non-linear dependencies between the armrewards. We have proved that the number of times this policyuses a non-optimal power allocation is also bounded by afunction that is polynomial in the number of channels andpower-levels, and logarithmic in time.Our simulations results show that the algorithms we developare indeed better than naive application of classic MABsolutions. We also see that under settings where the powerallocation for maximizing the sum-pseudo-rate matches theoptimal power allocation that maximizes the expected sum-rate, CWF2 has significantly better regret-performance thanCWF1.Because our formulations allow for very general classes ofsub-additive reward functions, we believe that our techniquemay be much more broadly applicable to settings other thanpower allocation for stochastic channels. We would thereforelike to identify and explore such applications in future work.R EFERENCES [1] T. Cover and J. Thomas, Elements of Information Theory . NewYork: Wiley, 1991.[2] A. J. Goldsmith and P. P. Varaiya, “Capacity of Fading MIMOChannels with Channel Estimation Error,” IEEE InternationalConference on Communications (ICC) , June, 2004.[3] A. J. Goldsmith, Wireless Communications. New York: Cam-bridge University Press, 2005.[4] X. Wang, D. Wang, H. Zhuang, and S. D. Morgera, “Energy-Efficient Resource Allocation in Wireless Sensor Networks overFading TDMA,” vol. 28, no. 7, pp. 1063-1072, 2010.[5] I. Zaidi and V. Krishnamurthy, “Stochastic Adaptive MultilevelWaterfilling in MIMO-OFDM WLANs,” the 39th Asilomar Con-ference on Signals, Systems and Computers , October, 2005.[6] Y. Gai, B. Krishnamachari and R. Jain, “Combinatorial NetworkOptimization with Unknown Variables: Multi-armed Bandits withLinear Rewards,” arXiv:1011.4748. [7] V. Anantharam, P. Varaiya, and J. Walrand, “AsymptoticallyEfficient Allocation Rules for the Multiarmed Bandit Problemwith Multiple Plays-Part I: IID Rewards,” IEEE Transactions onAutomatic Control , vol. 32, no. 11, pp. 968-976, 1987.[8] R. Agrawal, “Sample Mean Based Index Policies with O(logn) Regret for the Multi-Armed Bandit Problem,” Advances inApplied Probability , vol. 27, no. 4, pp. 1054-1078, 1995.[9] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time Analysisof the Multiarmed Bandit Problem,” Machine Learning , vol. 47,no. 2, pp. 235-256, 2002.[10] Y. Gai, B. Krishnamachari, and R. Jain, “Learning MultiuserChannel Allocations in Cognitive Radio Networks: A Combi-natorial Multi-armed Bandit Formulation”, IEEE InternationalDynamic Spectrum Access Networks (DySPAN) Symposium , Sin-gapore, April, 2010.[11] A. Anandkumar, N. Michael, and A.K. Tang, “OpportunisticSpectrum Access with Multiple Users: Learning under Competi-tion,” IEEE International Conference on Computer Communica-tions (INFOCOM) , March, 2010.[12] A. Anandkumar, N. Michael, A. Tang, and A. Swami, “Dis-tributed Learning and Allocation of Cognitive Users with Log-arithmic Regret,” IEEE JSAC on Advances in Cognitive RadioNetworking and Communications , vol. 29, no. 4, pp. 781-745,2011.[13] K. Liu and Q. Zhao, “Distributed Learning in Cognitive Ra-dio Networks: Multi-Armed Bandit with Distributed MultiplePlayers”, Proc. of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP) , March, 2010.[14] Y. Gai and B. Krishnamachari, “Decentralized Online LearningAlgorithms for Opportunistic Spectrum Access,” to appear in the IEEE Global Communications Conference (GLOBECOM) ,December, 2011. arXiv:1104.0111.[15] C. Tekin and M. Liu, “Online Algorithms for the Multi-ArmedBandit Problem with Markovian Rewards,” the 48th AnnualAllerton Conference on Communication, Control, and Computing(Allerton) , September, 2010.[16] C. Tekin and M. Liu, “Online Learning in Opportunistic Spec-trum Access: a Restless Bandit Approach,” IEEE InternationalConference on Computer Communications (INFOCOM) , April,2011.[17] H. Liu, K. Liu, and Q. Zhao,“Learning and Sharing in aChanging World: Non-Bayesian Restless Bandit with MultiplePlayers,” Information Theory and Applications Workshop (ITA) ,January, 2011.[18] W. Dai, Y. Gai, B. Krishnamachari and Q. Zhao, “TheNon-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret,” IEEE International Conference on Acous-tics, Speech, and Signal Processing (ICASSP) , May, 2011.[19] Y. Gai, B. Krishnamachari and M. Liu, “On the CombinatorialMulti-Armed Bandit Problem with Markovian Rewards,” to ap-pear in the IEEE Global Communications Conference (GLOBE-COM) , December, 2011. arXiv:1012.3005.[20] Y. Gai, B. Krishnamachari and M. Liu, “Online learning forcombinatorial network optimization with restless markovian re-wards,” arXiv:1109.1606.[21] D. Pollard,