TACT: A Transfer Actor-Critic Learning Framework for Energy Saving in Cellular Radio Access Networks
Rongpeng Li, Zhifeng Zhao, Xianfu Chen, Jacques Palicot, Honggang Zhang
11 TACT: A Transfer Actor-Critic LearningFramework for Energy Saving in CellularRadio Access Networks
Rongpeng Li ∗ , Zhifeng Zhao ∗ , Xianfu Chen † , Jacques Palicot ‡ and HonggangZhang ∗§∗ Zhejiang University, Zheda Road 38, Hangzhou 310027, ChinaEmail: { lirongpeng, zhaozf, honggangzhang } @zju.edu.cn † VTT Technical Research Centre of Finland, P.O. Box 1100, FI-90571 Oulu,FinlandEmail: xianfu.chen@vtt.fi ‡ Sup´elec, Avenue de la Boulaie CS 47601, Cesson-S´evign´e Cedex, FranceEmail: [email protected] § Universit´e Europ´eenne de Bretagne (UEB) & Sup´elec, Avenue de la Boulaie CS47601, Cesson-S´evign´e Cedex, France
Abstract
Recent works have validated the possibility of improving energy efficiency in radio access networks(RANs), achieved by dynamically turning on/off some base stations (BSs). In this paper, we extend theresearch over BS switching operations, which should match up with traffic load variations. Instead ofdepending on the dynamic traffic loads which are still quite challenging to precisely forecast, we firstlyformulate the traffic variations as a Markov decision process. Afterwards, in order to foresightedlyminimize the energy consumption of RANs, we design a reinforcement learning framework basedBS switching operation scheme. Furthermore, to speed up the ongoing learning process, a transferactor-critic algorithm (TACT), which utilizes the transferred learning expertise in historical periodsor neighboring regions, is proposed and provably converges. In the end, we evaluate our proposed a r X i v : . [ c s . N I] A p r scheme by extensive simulations under various practical configurations and show that the proposedTACT algorithm contributes to a performance jumpstart and demonstrates the feasibility of significantenergy efficiency improvement at the expense of tolerable delay performance. Index Terms
Radio access networks, base stations, sleeping mode, green communications, energy saving, rein-forcement learning, transfer learning, actor-critic algorithm.
I. I
NTRODUCTION
The explosive popularity of smartphones and tablets has ignited a surging traffic load demandfor radio access and has been incurring massive energy consumption and huge greenhouse gasemission [1], [2]. Specifically speaking, the information and communication technology (ICT)industry accounts for 2% to 10% of the world’s overall power consumption [3] and has emergedas one of the major contributors to the world-wide CO emission. Besides that, there also existeconomical pressures for cellular network operators to reduce the power consumption of theirnetworks. It’s envisioned that the electricity bill will doubly enlarge in five years for China Mobile[4]. Meanwhile, the energy expenditure accounts for a significant proportion of the overall cost.Therefore, it’s quite essential to improve the energy efficiency of ICT industry.Currently, over of the power consumption takes place in the radio access networks(RANs), especially the base stations (BSs) [5]. The reason behind this is largely due to that thepresent BS deployment is on the basis of peak traffic loads and generally stays active irrespectiveof the heavily dynamic traffic load variations [6], [7]. Recently, there has been a substantial bodyof work towards traffic load-aware BS adaptation [8] and the authors have validated the possibilityof improving energy efficiency from different perspectives. Luca Chiaraviglio et al. [9] showedthe possibility of energy saving by simulations. [10] and [11] proposed how to dynamicallyadjust the working status of BS, depending on the predicted traffic loads. However, to reliablypredict the traffic loads is still quite challenging, which makes these works suffering in practicalapplications. On the other hand, [12] and [13] presented dynamic BS switching algorithms withthe traffic loads a prior and preliminarily proved the effectiveness of energy saving.Besides, it is also found that turning on/off some of the BSs will immediately affect theassociated BS of a mobile terminal (MT). Moreover, subsequent choices of user associations in turn lead to the traffic load differences of BSs. Hence, any two consecutive BS switchingoperations are correlated with each other and current BS switching operation will also furtherinfluence the overall energy consumption in the long run. In other words, the expected energysaving scheme must be foresighted while minimizing the energy consumption. It should concernits effect on both the current and future system performance to deliver a visionary BS switchingoperation solution.The authors in [6] presented a partially foresighted energy saving scheme which combines BSswitching operation and user association by giving a heuristic solution on the basis of a stationarytraffic load profile. In this paper, we try to solve this problem from a different perspective.Instead of predicting the volume of traffic loads, we apply a Markov decision process (MDP) tomodel the traffic load variations. Afterwards, the solution to the formulated MDP problem canbe attained by making use of actor-critic algorithm [14], [15], a reinforcement learning (RL)approach [16], one advantage of which is that there is no necessity to possess a prior knowledgeabout the traffic loads within the BSs. On the other hand, given the centralized structure ofcellular networks, energy saving will significantly benefit from a literally existing centralizedBS switching operation controller such as the base station controller (BSC) in second generation(2G) cellular networks or the radio network controller (RNC) in third generation (3G) or longterm evolution (LTE) cellular networks rather than a distributed one. As a result, we assumethat a BS switching operation controller exists within the reinforcement learning framework, asillustrated in Fig. 1. The controller would firstly estimate the traffic load variations based onthe on-line experience. Afterwards, it can select one of the possible BS switching operationsunder the estimated circumstance and then decreases or increases the probability of the sameaction to be later selected on the basis of the required cost. Here, the cost primarily focuses onthe energy consumption due to such a BS switching operation and also takes the performancemetric into account to ensure the user experience. After repeating the actions and knowing thecorresponding costs, the controller would know how to switch the BSs for one specific trafficload profile. Moreover, with the MDP model, the resulting BS switching strategy is foresighted,which would improve energy efficiency in the long run.However, it usually take some time for the RL approaches to be convergent to the optimalsolution in terms of the whole cost [17], [18]. Hence, the direct application of the RL algorithmsmay sometimes get into trouble, especially for a scenario where a BS switching operation BS Switching Operation Controller Macro BS Micro
BSBase Stations in Period 1
Learned Knowledge
Transfer
Action
BS 1: Active BS i : Sleeping BS N : Active CostState
Environment
BS Switching Operation Controller
Base Stations in Period 2
Action
BS 1: Active BS i : Sleeping BS N : Active CostState
Environment
Fig. 1. Transfer learning for reinforcement learning in BS switching operation scenario. controller usually takes charge of tens or even hundreds of BSs [11]. Fortunately, the periodicityand mobility of human behavior patterns make the traffic loads exhibit some temporal andspatial relevancies [19], thus making the traffic load-aware BS switching strategies at differentmoments or neighboring regions relevant. Therefore, we could deal with the application issueby utilizing the conceptual idea of transfer learning (TL) [20]. TL, which mostly concerns howto recognize and apply the knowledge learned from one or more previous tasks ( source tasks ) tomore effectively learn to solve a novel but related task ( target task ) [21], is intuitively appealing,cognitively inspired, and has led to a burst of research activities [20]–[23]. By transferring thelearned BS switching operation strategy at historical moments or neighboring regions (sourcetasks), TL could exploit the temporal and spatial relevancy in the traffic loads and speed up theon-going learning process in regions of interest (target tasks) as depicted in Fig 1. As a result,the learning framework of BS switching operation is further enhanced by incorporating the ideaof TL into the classical actor-critic algorithm (AC), namely the Transfer Actor-CriTic algorithm(TACT) in this paper.In a nutshell, our work proposes a reinforcement learning framework for energy saving inRANs. Compared to the previous works, this paper provides the following three key insights: • Firstly, we show that the learning framework is feasible to save the energy consumptionin RANs without the knowledge of traffic loads a prior. Moreover, the performance of thelearning framework approaches that of the state-of-the-art scheme (SOTA) [6], which isassumed to have fully knowledge of traffic loads. These preliminary results have alreadybeen presented in [24]. • Secondly, we extend the idea of TL to the conventional RL algorithms and show thatthe proposed TACT algorithm outperforms the classical AC algorithm with a performancejumpstart. • Thirdly, this paper details the convergence analysis of the TACT algorithm and therebycontributes to the general literature in RL field, especially the general AC algorithm.The remainder of the paper is organized as follows. In Section II, we introduce the systemmodel and formulate the traffic variation as an MDP. In Section III, we talk about the energysaving scheme by the conventional RL framework. Section IV focuses on the incorporationof idea of TL into the conventional RL framework and investigates the convergence proof ofthe TACT algorithm. Section V evaluates the proposed schemes and presents the validity andeffectiveness. Finally, we concludes this paper and presents several remaining works in SectionVI. II. S
YSTEM MODEL AND PROBLEM FORMULATION
A. System model
Beforehand, Table I summarizes the most used notations in this paper.An RAN usually consists of multiple BSs while the traffic loads of BSs are usually fluctuating,thus often making BSs under-utilization. In this paper, we assume that there exists a region
L ∈ R served by a set of overlapped BSs B = { , . . . , N } as Fig. 1 depicts. In addition, weassume there exists a BS switching operation controller, which can timely know the traffic loadsin these BSs at current stage and correspondingly determine the energy efficient working statusof any BS (i.e., active/sleeping mode) at next stage in a centralized way. Beyond that, the paperfocuses on downlink communication, i.e., from BSs to MTs. Meanwhile, the file transmissionrequests at a location x ∈ L arrive following a Poisson point process with arrival rate per unitarea λ ( x ) and file size µ ( x ) [25]–[27]. After that, the traffic load density at a location x ∈ L is defined as λ ( x ) /µ ( x ) < ∞ [6], [25]. Therefore, the traffic load density can capture different TABLE IA
LIST OF THE MAIN SYMBOLS IN THE PAPER .Symbol Meaning M = < S , A , P , C > MDP Tuple: State Space S , Action Space A , s ( k ) ∈ S , a ( k ) ∈ A State Transition Probability Function P ,and Cost Function C superscript ( k ) Stage Number V π ( s ) Value Function V w.r.t. Strategy π and State s p ( s , a ) Policy: Tendency to Select Action a under State s p o , p n and p e Subscript o, n, e: Overall, Native and Exotic Policy δ ( s ( k ) , a ( k ) ) TD Error under State s ( k ) and Action a ( k ) ν ( s ( k ) , k ) Occurrence of State s ( k ) in the Previous k Stages ˆ k = ν ( s ( k ) , a ( k ) , k ) Occurrence of ( s ( k ) , a ( k ) ) in the Previous k Stages ˆ p o (ˆ k ) Discrete Sequence: Evolution of p ( k ) o ( s , a )ˆ p (0) ( t ) Continuous Sequence: Interpolation Result of ˆ p o (ˆ k )ˆ p (ˆ k ) ( t ) Temporal Shifted Version of ˆ p (0) ( t )˙ π ( t ) , ˙ V ( t ) and ˙ p o ( t ) Derivative of π ( t ) , V ( t ) and p o ( s , a ) α ( · ) , β ( · ) , and ζ ( · ) Positive Step-Size Parameter in Learning Algorithms λ ( x ) , /µ ( x ) Arrival Rate and File Size at Location xq i Constant Power Consumption Percentage for BS iτ Temperature: Positive Parameter ς Delay Performance Importance: Positive Parameter spatial traffic variations. For example, a hotspot can be characterized by higher arrival rate orlarger file size. Hence, when the set of BSs B on is turned on, the traffic loads severed by BS i ∈ B on can be represented as Γ i = (cid:82) L I i ( x, B on ) λ ( x ) /µ ( x ) d x , whereas I i ( x, B on ) = 1 is a userassociation indicator and denotes location x is served by BS i ∈ B on and vice versa. Otherwise, ifa BS i is in sleeping mode, i.e., i ∈ B \ B on , the traffic loads are defined as zero, namely Γ i = 0 .To demonstrate the temporal traffic load variations within one BS’ coverage, i.e., P (Γ ( k +1) i | Γ ( k ) i ) within the coverage of BS i , we partition the traffic loads Γ i into several segments and use afinite state indicator s i ∈ S i to describe one segment. Subsequently, for the whole region ofinterest, a state vector s = { s , · · · , s N } ∈ S = S × · · · × S N is constructed to model the trafficload variations and constitutes a finite state Markov chain (FSMC).Let’s denote the transmission rate of a user located at x and served by BS i ∈ B on as c i ( x, B on ) . For analytical convenience, assume that c i ( x, B on ) does not change over time, i.e.,we do not consider fast fading or dynamic inter-cell interference. Instead, c i ( x, B on ) is assumedas a time-averaged transmission rate in this paper, based on the fact that the time scale of userassociation is commonly much larger than the time scale of fast fading or dynamic inter-cellinterference. Hence, the inter-cell interference is considered as static Gaussian-like noise, whichis feasible under interference randomization or fractional frequency reuse [6], [27]. Beyond that,though c i ( x, B on ) is location-dependent, it is also affected by the shadowing effect and thus notnecessarily determined by the distance from the BS i .Furthermore, the system load density can be defined as the fraction of time required to delivertraffic loads from BS i ∈ B on to location x , namely (cid:37) i ( x ) = λ ( x ) / ( µ ( x ) c i ( x, B on )) . Analogousto the definition of traffic loads, the system loads for an active BS i ∈ B on can be represented as ρ i = (cid:82) L (cid:37) i ( x ) I i ( x, B on ) d x . Meanwhile, the system loads for a sleeping BS i ∈ B \ B on is definedas zero. Hence, the indicator set I = { I i ( x, B on ) | i ∈ B , x ∈ L} is feasible [25] if each BS i ∈ B can serve ρ i < . Eventually, our goal is to choose certain active BSs and find a feasibleuser association indicator set to minimize the total cost. By exploiting the proposed learningframework, the controller can know the BS switching operation strategy at last without the priorknowledge of traffic loads. We will give the details in Section III. B. Problem formulation
In this paper, we primarily aim to minimize the overall energy consumption of BSs in RANs.Our previous work [11] has shown the energy consumption of a BS is not linearly proportionalto the traffic loads within its coverage area. Moreover, the energy consumption of BSs consistsof two categories: some constant energy consumption stays irrelevant to BS’s traffic loads whilethe remainder varies proportionally to BS’s traffic loads. Hence, we adopt the generalized energyconsumption model [6], which can be summarized as C ee ( ρ , B on ) = (cid:88) i ∈B on [(1 − q i ) ρ i P i + q i P i ] , (1)where ρ = { ρ , · · · , ρ N } . Besides, q i ∈ (0 , is the constant power consumption percentage forBS i , and P i is the maximum power consumption of BS i when it is fully utilized.On the other hand, in order to avoid the potential quality of service (QoS) deterioration, weintroduce a delay-optimal metric in [25] to demonstrate the flow performance. As defined in [25], the delay-optimal performance function can be formulated as C dp ( ρ , B on ) = (cid:88) i ∈B on ρ i − ρ i . (2)Specifically, for a queue system M/G/s , (2) equals the number of flows in the system. If we tryto minimize (2), Little’s law [28] implies that it’s actually equivalent to minimize the averagedelay.Above all, our problem is to find an optimal set of active BSs and corresponding userassociations that minimizes the function of the energy consumption while ensuring the QoS,namely min ρ , B on { C = C ee ( ρ , B on ) + ςC dp ( ρ , B on ) } ,s.t. ρ i ∈ [0 , ∀ i ∈ B , (3)where ς is a positive balancing parameter with a unit W/s. ς indicates the equivalent cost forone flow waiting in the system and reflects the importance of the delay performance relative tothe energy consumption.III. S TOCHASTIC BS SWITCHING OPERATION IN REINFORCEMENT LEARNING FRAMEWORK
A. Markov decision process
An MDP is defined as a tuple M = < S , A , P , C > , where S is the state space, A is theaction space, P is a state transition probability function, and C is a cost function. Specifically, atstage k , the traffic load state is s ( k ) . Following an action a ( k ) = { a ( k )1 , · · · , a ( k ) N } , the controllerchoose to turn a BS i ∈ B into sleeping mode if a ( k ) i = 0 . Otherwise, if a ( k ) i = 1 , the BS i remains active. The users correspondingly associate themselves with the remaining active BSs B on according to an indicator set I ( k ) , which can be determined by the specific metrics to selectthe serving BS, such as cell traffic loads or received signal strength, etc [6]. Thereafter, if weassume that as the traffic loads emerge, the traffic load state transforms into s ( k +1) , which isdetermined by the exact volume of varying traffic loads at stage k and the related serving BSs,with the transition probability P ( s (cid:48) | s ( k ) , a ( k ) ) = , s (cid:48) = s ( k +1) ;0 , otherwise . (4) Meanwhile, the immediate cost generated by the environment (computed by (3)) is fed back tothe BS switching operation controller.The goal is to find a strategy π , which maps a state s to an action π ( s ) , i.e., a ( k ) , to minimizethe discounted accumulative cost starting from the state s . Formally, this accumulative cost iscalled as a state value function, which can be calculated by [16] V π ( s ) = E π (cid:34) ∞ (cid:88) k =0 γ k C ( s ( k ) , π ( s ( k ) ) | s (0) = s ) (cid:35) = E π (cid:34) C ( s , π ( s )) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s , π ( s )) V π ( s (cid:48) ) (cid:35) , (5)where the positive parameter γ is the discount factor that maps the future cost to the currentstate. Given the diminishing importance of future cost than the current one, γ is smaller than 1.The optimal strategy π ∗ satisfies the Bellman equation [16]: V ∗ ( s ) = V π ∗ ( s ) = min a ∈A (cid:40) E π ∗ (cid:34) C ( s , a ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s , a ) V π ∗ ( s (cid:48) ) (cid:35)(cid:41) . (6)Since the optimal strategy minimizes the cumulative cost from the beginning, it contributes todesign a foresighted energy saving scheme. B. The actor-critic learning framework for energy saving
There have been some well-known methods to solve the MDP issues such as dynamic program-ming [16]. Unfortunately, these methods heavily depend on prior knowledge of the environmentaldynamics. However, it’s challenging to know the future traffic loads precisely in advance.Therefore, in this paper, we employ reinforcement learning approaches to solve the MDP problemwithout requiring the knowledge of traffic loads a prior and specifically adopt the actor-criticalgorithm. As the name implies, the actor-critic algorithm encompasses three components: actor,critic, and environment as illustrated in Fig. 2 (Left). At a given state, the actor selects an actionin a stochastic way and then executes it. This execution transforms the state of environment to anew one with certain probability, and feeds back the cost to the actor. Then, the critic criticizesthe action executed by the actor and updates the value function through a time difference (TD)error. After the criticism, the actor will update the policy to prefer the action with a smallercost, and vice versa. The algorithm repeats the above procedure until convergence. The reasonsto adopt actor-critic algorithm are three-folded: (i) since it generates the action directly from thestored policy, it requires little computation to select an action to perform; (ii) it can learn an Knowledge
Transfer
Value functionEnvironmentActor
CriticState
CostTDerrorNative
Policy
Exotic
Policy
Overall
Policy
Value function
EnvironmentPolicy
ActorCriticState
Cost TD errorClassical Actor-Critic Algorithm Transfer Actor-Critic
Algorithm
Fig. 2. Architecture of classical actor-critic algorithm and transfer actor-critic algorithm (TACT).
Action selection : make BSs switching operation decision
User association : balance transmission performance and BS traffic loads
Data transmission Policy update : make low cost action more preferable
State-value function update : update state and compute the expected cost of preceding state
Stage k-1
One Slot ... ......
Stage k Stage k+1
Fig. 3. Illustration of learning framework for energy saving. explicitly stochastic policy which may be useful in non-Markov traffic variation environment ofRANs [29]; (iii) it separately updates the value function and policy [16]. As a result, it wouldbe more easily to implement the policy knowledge transfer in Section IV, compared to othercritic-only algorithms like ε -greed and Q-learning [30], .We design an actor-critic learning framework for energy saving scheme as illustrated in Fig.3. (i) Action selection: Beforehand, let’s assume that the system is at the beginning of stage k . Meanwhile, the traffic load state is s ( k ) . Thereafter, the controller needs to select an actionaccording to a stochastic strategy, the purpose of which is to improve performance while ex-plicitly balancing two competing objectives: a) searching for a better BS switching operation(exploration) and b) taking as little cost as possible (exploitation). As a result, the controllernot only performs a good BS switching operation based on its past experience, but also is ableto explore a new one. The most common methodology is to use a Boltzmann distribution. Thecontroller chooses an action a in state s ( k ) of stage k with probability [16] π ( k ) ( s ( k ) , a ) = exp { p ( s ( k ) , a ) /τ } (cid:80) a (cid:48) ∈A exp { p ( s ( k ) , a (cid:48) ) /τ } , (7)where τ is a positive parameter called temperature. In addition, p ( s ( k ) , a ( k ) ) indicates the ten-dency to select action a ( k ) at the state s ( k ) , and it will update itself after every stage. It’sworthwhile to note that though there exists the possibility that the remaining active BSs are notenough to serve the traffic loads in the present stage k . However, as the conventional energysaving scheme commonly does, the controller can start an emergent response paradigm to quicklyturn on some BSs. Hence, in this paper, we assume the action a ( k ) , which the controller finallychooses, can meet the traffic load requirements.(ii) User association and data transmission: In one stage, there exist several slots for userassociation and data transmission. After the controller chooses to turn some of BSs into sleepingmode and broadcasts the traffic load density at stage k − , the users choose to connect one BSaccording to the modified metric in [6] and start the communications slot by slot. Specifically,users at location x choose to join BS i ∗ , while i ∗ satisfies i ∗ ( x ) = arg max j ∈B on c j ( x, B on )(1 − q j ) P j + ς (1 − ρ k − j ) − , ∀ x ∈ L . (8)As stated in [6], (8) proves to be optimal to achieve the minimum of total cost in (3) if the activeBSs are determined. Intuitively, (8) would be simplified to i ∗ ( x ) = arg max j ∈B on c j ( x, B on )(1 − q j ) P j , ∀ x ∈L , if we merely consider the minimization of energy consumption (i.e., ς = 0 ). The simplifiedequation implies that users at location x prefer to choose to join the BS with the largesttransmission rate at the same traffic load-related power consumption.(iii) State-value function update: After the transmission part of stage k , the traffic loads in eachBS will change, thus transforming the system to state s ( k +1) by (4). Meanwhile, the total costfor the transmission would be C ( k ) ( s ( k ) , a ( k ) ) . Consequently, a TD error δ ( s ( k ) , a ( k ) ) would be computed by the difference between the state-value function V ( k ) ( s ( k ) ) estimated at the precedingstate and C ( k ) ( s ( k ) , a ( k ) ) + γ · V ( k ) ( s ( k +1) ) at the critic, namely δ ( k ) ( s ( k ) , a ( k ) ) = C ( k ) ( s ( k ) , a ( k ) ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s ( k ) , a ( k ) ) V ( s (cid:48) ) − V ( s ( k ) )= C ( k ) ( s ( k ) , a ( k ) ) + γ · V ( k ) ( s ( k +1) ) − V ( k ) ( s ( k ) ) . (9)Afterwards, the TD error would feed back to the actor. By the way, the state-value functionwould be updated as V ( k +1) ( s ( k ) ) = V ( k ) ( s ( k ) ) + α ( ν ( s ( k ) , k )) · δ ( k ) ( s ( k ) , a ( k ) ) . (10)Here, ν ( s ( k ) , k ) denotes the occurrence times of state s ( k ) in these k stages. α ( · ) is a positivestep-size parameter that affects the convergence rate. On the other hand, if s (cid:54) = s ( k ) , V ( k +1) ( s ) will be kept the same as V ( k ) ( s ) .(iv) Policy update: At the end of stage k , the critic would employ the TD error to “criticize”the selected action, which is implemented as p ( k +1) ( s ( k ) , a ( k ) ) = p ( k ) ( s ( k ) , a ( k ) ) − β ( ν ( s ( k ) , a ( k ) , k )) · δ ( k ) ( s ( k ) , a ( k ) ) , (11)Similar to ν ( s ( k ) , k ) , ν ( s ( k ) , a ( k ) , k ) indicates the executed times of action a ( k ) at state s ( k ) in these k stages. β ( · ) is a positive step-size parameter. (7) and (11) ensure one action undera specific state can be selected with higher probability if the “foresighted” cost it takes iscomparatively smaller, i.e., δ ( s ( k ) ) < . Additionally, if a (cid:54) = a ( k ) , p ( k +1) ( s ( k ) , a ) = p ( k ) ( s ( k ) , a ) .If each action is executed infinitely often in every state, in other words, if in the limit,the learning strategy is greedy with infinite exploration, the value function V ( s ) and strategy π ( k ) ( s , a ) will finally converge to V ∗ and π ∗ with probability (w.p.) 1 as k → ∞ [31].IV. T RANSFER ACTOR - CRITIC ALGORITHM FOR STOCHASTIC BS SWITCHING OPERATION
A. Motivation and formulation of transfer actor-critic algorithm
The previous section addresses the methodology to exploit the classical AC algorithm toconduct the BS switching operation, culminating in an effective energy saving strategy in theend. In this section, we present the means that the controller utilizes the knowledge of learnedstrategies during historical periods or neighboring regions to be in the groove of finding theoptimal BS switching operations. Basically, the policy, say p ( s , a ) , which finally determines the strategy π ( s , a ) in one learningtask, indicates the tendency of action a to be chosen in state s . When the learning processconverges, the tendency to choose a specific action a in a specific state is comparatively largerthan that of other actions. In other words, it means that if the BS switching operation is conductedbased on one learned strategy, the energy saving in the whole system tends to be optimized inthe long run. Hence, if the knowledge of this policy p ( s , a ) is transferred to another task, e.g.,the knowledge transferred from Period 1 (source task) to Period 2 (target task) within the sameregion of interest in Fig. 1, the controller in the target task can make an attempt by takingthe same action a when the traffic loads come into state s . Compared to learning from thescratch, the controller might directly make the wisest choice at the very beginning. However,in spite of the similarities between the source task and the target task, there might still existsome differences. For example, the system might come into the same state in two different tasks,whereas the traffic loads in the source task (e.g., Period 1) might be usually higher than that inthe target one (e.g., Period 2). Hence, instead of staying on the chosen action a in source task,the controller in target task can make a more aggressive choice to turn more BSs into sleepingmode, thus saving more energy consumption. Consequently, in this case, the transferred policyguides in a negative manner. To avoid this underlying problem, the transferred policy shouldhave a decreasing impact on choosing a certain action, once the controller has attempted tochoose this action and nurtured its own learning experience.Taking the above considerations into account, we propose a new policy update method, namedTransferred Actor-CriTic algorithm (TACT) as Fig. 2. In the TACT algorithm, the overall policy(i.e., p o ) to select an action is divided as a native one p n and an exotic one p e . Without loss ofgenerality, let’s assume that at stage k , the traffic load state is s ( k ) and the chosen action is a ( k ) .Accordingly, the overall policy p o is updated as p ( k +1) o ( s ( k ) , a ( k ) ) = (cid:2) (1 − ζ ( ν ( s ( k ) , a ( k ) , k ))) p ( k +1) n ( s ( k ) , a ( k ) ) + ζ ( ν ( s ( k ) , a ( k ) , k )) p e ( s ( k ) , a ( k ) ) (cid:3) p t − p t , (12)where [ x ] ba with b > a , denotes the Euclidean projection of x onto the interval [a,b], i.e., [ x ] ba = a if x < a ; [ x ] ba = b if x > b ; and [ x ] ba = x if a ≤ x ≤ b . In this case, a = − p t and b = p t , with p t > . Additionally, p ( k +1) o ( s ( k ) , a ) = p ( k ) o ( s ( k ) , a ) , ∀ a ∈ A but a (cid:54) = a ( k ) . Besides that, p n ( s , a ) still updates itself according to the classical actor-critic algorithm, namely (11). Initially, the exotic policy p e ( s , a ) dominates in the overall strategy. Hence, when the envi-ronment enters a state s , the presence of p e ( s , a ) contributes to choose the action, which mightbe optimal to s in the source task. Consequently, the proposed policy update method leads toa possible performance jumpstart. On the other hand, since ζ ( · ) ∈ (0 , is the transfer rate and ζ ( k ) → as k → ∞ , the effect of the transferred exotic policy p e ( s , a ) continuously decreases.Therefore, the controller can not only take advantage of the learned expertise in the source task,but also swiftly get rid of the negative guidelines.Finally, we summarize our proposed TACT algorithm in Algorithm 1 . Algorithm 1
TACT : The Transfer Learning Framework for Energy Saving
Initialization : for each s ∈ S , each a ∈ A do Initialize state-value function V ( s ) , native policy function p n ( s , a ) , exotic policy function p e ( s , a ) (transferred knowledge) and strategy function π ( s , a ) ; end forRepeat until convergent
1) Choose an action a ( k ) in state s ( k ) according to π ( k ) ( s ( k ) , a ( k ) ) in (7);2) Users at location x connect one BS i by (8) and then start data transmission;3) If ρ i ≤ , ∀ i ∈ L , the chosen action is feasible. The cost function C ( s ( k ) , a ( k ) ) iscalculated by (3); otherwise, an emergent response paradigm starts as the conventionalscheme does.4) Identify the traffic loads and accordingly update state s ( k ) → s ( k +1) and compute theTD error by (9);5) Update the state-value function (10) for s = s ( k ) ;6) Update the native policy function and the overall policy function by (11) and (12) for s = s ( k ) , a = a ( k ) , respectively;7) Update the strategy function π ( k +1) ( s ( k ) , a ) by (7), for s = s ( k ) and all a ∈ A . B. Convergence analysis
Next, we are interested in the convergence of TACT algorithm, since the knowledge transfermakes the policy update in the proposed TACT algorithm distinct from that in the classical AC algorithms and it becomes difficult to directly apply the convergence results in the latter ones. Westart the analysis by introducing several related lemmas. Singh [31] shows that the Boltzmannmethod is greedy in the limit with infinite exploration, based on a large enough τ . Therefore,we have the following lemma. Lemma . If we use the Boltzmann exploration method with a large enough τ , there therebyexists an η > , such that lim k →∞ ν ( s , a , k ) k ≥ η, ∀ s ∈ S , a ∈ A . (13)In other words, as k → ∞ , ν ( s , a , k ) = ηk → ∞ . Definition . Define a function ϑ s , a ( p o ) as ϑ s , a ( p o ) = if p o ( s , a ) = p t and δ ( s , a ) ≥ , or p o ( s , a ) = − p t and δ ( s , a ) ≤ , otherwise . (14)The next theorem states that our proposed policy update tracks an ordinary differential equation(ODE). Theorem . Assume that the learning rate β ( k ) in (11) satisfies ∞ (cid:88) k =0 β ( k ) = ∞ , β ( k ) ≥ , ∞ (cid:88) k =0 β ( k ) < ∞ , (15)and the transfer rate ζ ( k ) satisfies lim ζ ( k ) /β ( k ) → as k → ∞ . p o ( s , a ) asymptotically tracksthe solution of the ODE ˙ p o ( t ) = − δ ( s , a ) ϑ s , a ( p o ) , ∀ s ∈ S , a ∈ A , (16)where δ ( s , a ) = lim δ k ( s , a ) as k → ∞ . Proof:
We provide a proof sketch here and will address the details in Appendix. Withoutloss of generality, assume that the state is s ( k ) . By Algorithm 1, at stage k , the policy value p o ( s ( k ) , a ) would be changed only when a is the executed action a ( k ) . Therefore, by merelyincluding the updated values, we could form another discrete sequence ˆ p o (ˆ k ) to indicate theevolution of p ( k ) o ( s ( k ) , a ( k ) ) . Here, the index ˆ k equals ν ( s ( k ) , a ( k ) , k ) . After that, by introducing Indeed, ˆ p o (ˆ k ) refers to ˆ p (ˆ k ) o ( s ( k ) , a ( k ) ) . But, for simplicity of representation, the notation of s ( k ) and a ( k ) is omitted here. a concept of β (ˆ k ) -induced continuous time t and interpolating the discrete sequence ˆ p o (ˆ k ) ,we construct a continuous sequence ˆ p (0) ( t ) and its shifted version ˆ p (ˆ k ) ( t ) . Next, we prove thatthe shifted continuous sequence ˆ p (ˆ k ) ( t ) is equicontinuous. Based on the discussions around theArzel`a-Ascoli Theorem [32], we finally obtain that any limit of ˆ p ( t ) , or the discrete equivalent ˆ p o (ˆ k ) , must track the solution of the ODE in (16) for a sufficiently large ˆ k . By Lemma 1, thetheorem comes.In addition, we introduce the definition of a strict Lyapunov function [32], which is thefundamental of our following proof. Definition . Suppose that for an ODE ˙ z ( t ) = f ( z ) defined on a region D , V ( z ) is a continuouslydifferentiable and real-valued function of z such that V (0) = 0 , V ( z ) > , ∀ z (cid:54) = 0 . If ˙ V ( t ) = ∇ V · ˙ z ( t ) = ∇ V · f ( z ) ≤ on the region D , and the equality holds only when ˙ z ( t ) = 0 , thefunction V ( z ) is a strict Lyapunov function for the ODE ˙ z ( t ) .Our proof relies on the following theorem by Konda and Borkar [14], which establishes theconvergence of a general actor-critic algorithm. Theorem . Assume that the learning rate α ( k ) satisfies the assumptions in Section 2.2 [14] and β ( k ) and ζ ( k ) meet the conditions in Theorem 1. If the strategy π , which is derived by (7) withthe policy update method given by (12), has a strict Lyapunov function for the ODE ˙ π ( t ) , wethereby have π converges w.p. 1 and (cid:107) π − π ∗ (cid:107) ≤ (cid:15) for any (cid:15) > as p t → ∞ .Beforehand, it comes the following lemma by directly applying (5) in (9). Lemma . (cid:88) a ∈A δ ( s , a ) π ( s , a ) = 0 , ∀ s ∈ S . (17) Lemma . If the strategy π ( s , a ) tracks the solution of ODE ˙ π ( t ) , and ˙ π ( t ) satisfies ˙ π ( t ) δ ( s , a ) ≤ , then we have ∇ V π ( s ) ˙ π ( t ) ≤ ˙ π ( t ) δ ( s , a ) ≤ , ∀ s ∈ S . Proof:
For two distinct policies π and π (cid:48) , let’s define a value function operation T ( π (cid:48) , V π ( s ))= E π (cid:48) (cid:2) C ( s , a ) + γ (cid:80) s (cid:48) ∈S p ( s (cid:48) | s , a ) V π ( s (cid:48) ) (cid:3) . Assume that there exists an infinitesimal (cid:15) > such that π + (cid:15) ˙ π ( t ) is still a valid strategy. If denote π (cid:48) = π + (cid:15) ˙ π ( t ) , we thereby have T ( π (cid:48) , V π ( s )) − V π ( s ) = E π (cid:48) (cid:34) C ( s , a ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s , a ) V π ( s (cid:48) ) (cid:35) − V π ( s )= (cid:88) a ∈A (cid:40) π (cid:48) (cid:34) C ( s , a ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s , a ) V π ( s (cid:48) ) − V π ( s ) (cid:35)(cid:41) = (cid:88) a ∈A ( π + (cid:15) ˙ π ( t )) δ ( s , a )= (cid:88) a ∈A (cid:15) ˙ π ( t ) δ ( s , a ) ≤ The last equality follows from Lemma 2.Denote an iteration operation of T ( π (cid:48) , V π ( s )) as T n ( π (cid:48) , V π ( s )) = T n − ( π (cid:48) , T ( π (cid:48) , V π ( s ))) , wehave T n ( π (cid:48) , V π ( s )) ≤ T n − ( π (cid:48) , V π ( s )) ≤ · · · ≤ V π ( s ) .In addition, T n ( π (cid:48) , V π ( s )) − V π ( s ) ≤ (cid:80) a ∈A (cid:15) ˙ π ( t ) δ ( s , a ) , for n > . As n → ∞ , T n ( π (cid:48) , V π ( s )) → V π (cid:48) ( s ) , we obtain V π (cid:48) ( s ) − V π ( s ) (cid:15) = V π + (cid:15) ˙ π ( s ) − V π ( s ) (cid:15) ≤ ˙ π ( t ) δ ( s , a ) ≤ . As (cid:15) → , ∇ V π ( s ) ˙ π ( t ) ≤ ˙ π ( t ) δ ( s , a ) ≤ . The claim follows.
Theorem . (cid:80) s ∈S V π ( s ) is a strict Lyapunov function for ODE ˙ π ( t ) , if p t is sufficiently large. Proof:
By explicit differentiating (7) over t , we have ˙ π ( t ) = τ exp [ p o ( s , a ) /τ ] (cid:80) a (cid:48) ∈A exp [ p o ( s , a (cid:48) ) /τ ] ˙ p o ( t ) − τ exp [ p o ( s , a ) /τ ] (cid:80) a (cid:48) ∈A { exp [ p o ( s , a (cid:48) ) /τ ] ˙ p o ( t ) } (cid:8)(cid:80) a (cid:48) ∈A exp [ p o ( s , a (cid:48) ) /τ ] (cid:9) = 1 τ π ( s , a ) ˙ p o ( t ) − τ π ( s , a ) (cid:88) a (cid:48) ∈A π ( s , a (cid:48) ) ˙ p o ( t )= 1 τ π ( s , a ) ˙ p o ( t ) − τ π ( s , a ) (cid:88) a (cid:48) ∈A π ( s , a (cid:48) ) [ − δ ( s , a (cid:48) ) ϑ s , a (cid:48) ( p o )]= 1 τ π ( s , a ) ˙ p o ( t ) . The last equality follows from Lemma 2, after taking into account that if p t is sufficiently large, ϑ s , a ( p o ) = 1 holds. By Theorem 1, ˙ π ( t ) δ ( s , a ) = (cid:20) − τ π ( s , a ) δ ( s , a ) ϑ s , a ( p o ) (cid:21) · δ ( s , a )= − τ π ( s , a ) [ δ ( s , a )] ≤ . Macro BS Micro BS2 km
Fig. 4. Illustration of BS deployment in our simulation scenario.
The equality only holds at the equilibrium point ˙ p o ( t ) = − δ ( s , a ) = 0 . By Lemma 3, ∇ V π ( s ) ˙ π ( t ) ≤ . Therefore, according to Definition 2, the claim follows. Theorem . Regardless of any initial value chosen for p n ( s , a ) , and transferred knowledge p e ( s , a ) , if the learning rate α ( k ) , β ( k ) and the transfer rate ζ ( k ) meets the required conditionsmeanwhile p t and τ are sufficiently large, the Algorithm 1 converges. Proof:
The proof is the direct application of Theorem 2, which establishes the convergencegiven two conditions. First, the policy p o ( s , a ) tracks the solution of an ODE, by Theorem 1.Second, the tracked ODE has a strict Lyapunov function, by Theorem 3. Therefore, the learningprocess in Algorithm 1 converges.V. N UMERICAL ANALYSIS
We validate the energy efficiency improvement of our proposed scheme by extensive simula-tions under practical configurations. Here, we simulate for an area of 2km × TABLE IIU
SED SIMULATION PARAMETERS
Parameter description Value
Simulation area 2km × Macro BS
Micro BS
Macro BS
Micro BS
Macro BS
Micro BS λ ( x ) 5 × − File size /µ ( x ) 100 kbyteConstant Power Percentage q τ γ θ ζ file transmission requests at location x ∈ L follow a Poisson point process with arrival rate λ ( x ) and file size /µ ( x ) . To ease the simulation process, each BS’ traffic load state merelytakes value of 0 or 1 (0 represents the case where the realistic traffic loads are smaller than thehistorical average one while 1 indicates the other cases). Beyond that, we assume the maximumtransmission powers for BSs, i.e., 20W and 1W for macro and micro BSs, respectively. Basedon the linear relationship between transmission and operational energy consumption in [6], themaximum operational powers for macro BS and micro BS are 865W and 38W, respectively. Weset the propagation channel according to the COST-231 modified Hata model [34] and don’tconsider the influence of fast fading effect and noise. As for the proposed TACT algorithm, thelearning rate α ( k ) = 1 /k while β ( k ) = 1 (cid:46) ( k log k ) [14]. Moreover, the transfer rate ζ ( k ) = θ k ,with the transfer rate factor θ ∈ (0 , , thus satisfying the assumption in Theorem 1.By the way, we assume the extra cost is negligible when we turn the necessary BSs intoactive mode. Besides, we define cumulative energy consumption ratio (CECR) as the metric totest how much energy saving can be achieved due to the application of our proposed schemes.Specifically, the CECR metic is defined by the ratio of the accumulative energy consumptionwhen certain BSs are turned off (as our scheme runs) to that when all the BSs stay active C E CR (a) λ = 0.05 × −4 C E CR (b) λ = 0.1 × −4 C E CR (c) λ = 2 × −4 C E CR (d) λ = 5 × −4 Classical ACTACTSOTA ε −greed Fig. 5. Performance comparison under various homogeneous traffic arrival rates. since our simulation starts. This definition is reasonable since it can show the foresighted energyefficiency improvement, which is exactly the goal of an energy saving scheme.Besides, we would compare the performance of our proposed schemes with that of the state-of-the-art (SOTA) scheme [6], which assumes the controller can obtain a full knowledge of trafficloads in prior and finds the optimal BS switching solution by greedily turning as many as BSsinto sleeping mode. To simplify the comparison, we simulate by adjusting only one parameterwhile configuring the others according to Table II.Firstly, we examine how much energy saving can be achieved under different static trafficload arrival rates. [6] shows that when all BSs are turned on, a homogeneous traffic distributionof λ ( x ) = 10 − for all x ∈ L will offer loads corresponding to about 10% of BSs utilizations.Therefore, we vary the homogeneous traffic arrival rate λ ( x ) from × − to × − toreflect the effect of traffic loads on energy saving. Here, the transferred policy is generated froma source task with the static arrival rate λ = 5 × − . As depicted in Fig. 5, we can expect Due to the space limitation, only temporal knowledge transfer is considered for the TACT scheme. −3 −2 −1 C u m u l a t i v e E ne r g y C on s u m p t i on R a t i o ( C E CR ) Classical ACTACTSOTA ς =5000 ς =2000 ς =1500 ς =1000 ς =100 ς =0 Fig. 6. Performance tradeoff between energy and delay under different delay equivalent cost scenarios. more significant energy conservation with the decrease of arrival rate λ . This is because thatif all the BSs stay active under lower traffic loads, the BSs are more highly under-utilized.Moreover, the CECR continues decreasing as the simulation runs, since the controller will havea better understanding of the traffic loads and thereby know whichever action has a betterenergy efficiency. Unfortunately, since the proposed learning schemes are performed withoutthe knowledge of traffic loads a prior, the performance of them are inferior to that of theSOTA scheme, especially at the beginning of the simulations. However, we can see that thegap compensated for the absent knowledge becomes much smaller, when the TACT scheme isapplied with the learned knowledge.After validating the feasibility of proposed learning framework to save the energy, Fig. 6depicts the performance tradeoff between energy consumption and delay under different delayequivalent cost scenarios by tuning ς . When ς = 0 , the energy saving is most significant. Together with the classical AC algorithm, ε -Greed algorithm is also compared. But, due to the insufficient exploration issuein the ε -Greed algorithm [16], the corresponding performance is the worst in all cases and then refrains us to use it further. C u m u l a t i v e E ne r g y C on s u m p t i on R a t i o I m p r o v e m en t Arrival Rate λ ( × −4 ) K u ll ba ck − Le i b l e r D i v e r gen c e
75 Stages300 Stages500 Stages1500 StagesKullback−Leibler Divergence
Fig. 7. Performance improvement of TACT scheme over classical AC scheme versus Kullback-Leibler divergence. The barscorresponding to the left Y-axis reflect the CECR improvement while the dotted line corresponding the right Y-axis representsthe Kullback-Leibler Divergence.
However, this also incurs a limited increase in delay. Comparatively, the energy saving wouldbe less obvious if we put more emphasis on the delay equivalent cost by choosing a larger ς soas to decrease the delay. Again, we could also find that the tradeoff points of TACT are closerto those of the SOTA solutions in all these scenarios.Fig. 7 presents the performance improvement of TACT scheme over classical AC scheme. Asexpected, the TACT scheme yields a relatively large performance improvement, especially at thebeginning of each simulation. In other words, the TACT scheme contributes to a performancejumpstart, or a faster convergence speed. Fig. 7 also depicts the similarity between the source taskand the target task, measured by Kullback-Leibler divergence [35]. It shows a smaller Kullback-Leibler divergence between the source task and the target task leads to a more efficient transfereffect. Besides, we also plot the impact of transfer rate factor θ in Fig. 8. Generally speaking, The performance improvement is calculated by dividing the energy consumption margin between TACT scheme and classicalAC scheme over the energy consumption using classical AC scheme. C u m u l a t i v e E ne r g y C on s u m p t i on R a t i o ( C E CR ) Classical ACTACT θ =0.1TACT θ =0.2TACT θ =0.5SOTA Fig. 8. Performance impact of the transfer rate factor θ to the TACT scheme. as we expect, larger θ results in faster convergence rate and larger energy saving.We also investigate the performance of the proposed schemes when traffic loads periodicallyfluctuates. [12] shows practical traffic load profile is periodical and can be approximated by asinusoidal function λ ( k ) = λ V · cos(2 π ( k + φ ) /D ) + λ M , where D is the period of a traffic loadprofile, λ V is the variance of traffic profile and λ M is the mean arrival rate. Therefore, we employ λ ( k, x ) = (0 . · cos(2 π ( k + 10) /
24) + 1) × − to approximate the practical traffic load arrivalrate at location x ∈ L . Fig. 9 compares the performance of the proposed schemes and showsthat the TACT scheme converges faster than the classical AC scheme.At last, we continue the performance evaluation of our proposed schemes and present moredetailed sensitivity analyses in Fig. 10. In Fig. 10 (a)–(c), we present the simulation results undervarious configurations to reflect the effect of temperature value τ , file size /µ and constant powerconsumption percentage q . We can observe that the performance trends match our common sensein all these cases. For example, in Fig. 10 (a), a larger value of τ implies that the controllerhas a higher desire to explore new actions. Therefore, even though the controller has triedthe wisest action, the controller would choose more actions with larger cost, resulting in a C u m u l a t i v e E ne r g y C on s u m p t i on R a t i o ( C E CR ) Classical ACTACTSOTA
Fig. 9. Performance comparison with time-variant traffic arrival rate λ ( k, x ) = (0 . · cos(2 π ( k + 10) /
24) + 1) × − . less significant energy consumption saving. Fortunately, the TACT scheme could exploit thetransferred knowledge to avoid some certainly undesirable actions and performs better thanthe classical AC one, especially at larger values of τ . Fig. 10 (b) demonstrates that the effectof file sizes to the scheme performance would be similar to that of arrival rates. Fig. 10 (c)implies that the schemes will perform better when the constant power consumption accounts fora larger proportion of the whole cost, since turning off one under-utilized BS will make a clearerdifference and save more energy in these cases. On the other hand, we also give the simulationresults for the red shaded region with 6 BSs (illustrated in Fig. 4) and exhibit the robustness ofour proposed schemes in different BS deployment scenarios.VI. C ONCLUSION
In this paper, we have developed a learning framework for BS energy saving. We specificallyformulated the BS switching operations under varying traffic loads as a Markov decision process.Besides, we adopt the actor-critic method, a reinforcement learning algorithm, to give the BSswitching solution to decrease the overall energy consumption. Afterwards, to fully exploit λ ( × −4 ) C E CR (d) Different Arrival Rates λ in 6 BS Scenario0.5 0.6 0.7 0.8 0.900.20.4 q C E CR (c) Different Constant PowerConsumption Percentages q800 1000 120000.20.4 τ C E CR (a) Different Values of τ
50 100 15000.20.4 1/ µ (kbyte) C E CR (b) Different File Sizes 1/ µ Classical ACTACTSOTA
Fig. 10. Performance comparison under various configurations: (a) Different values of τ , (b) Different file sizes /µ , (c)Different energy consumption models, and (d) Different arrival rates in a 6 BS (red shaded region in Fig. 4) scenario. All thesesimulations results are generated after 1500 stages. the temporal relevancy in traffic loads, we propose a transfer actor-critic algorithm to improvethe strategies by taking advantage of learned knowledge from historical periods. Our proposedalgorithm provably converges given certain restrictions that arise during the learning process, andthe extensive simulation results manifest the effectiveness and robustness of our energy savingschemes under various practical configurations.Similar to the simulated temporal knowledge transfer, our proposed TACT approach is po-tentially viable to be applied in spatial scenarios to achieve a performance improvement. Un-fortunately, the mapping of knowledge will be sometimes less straightforward in the latter case,due to the underlying BS geographical deployment differences. Therefore, we are dedicated tohandle the related meaningful yet more challenging issues over spatial knowledge transfer in thefuture. A CKNOWLEDGMENT
The authors would like to express their sincere gratitude to the editor Prof. Jack L. Burbankand the anonymous reviewers for their kind comments. The authors also thank Qianlan Ying(ZJU), Shun Cai (SEU) and Yi Zhong (USTC) for their commendable suggestions in improvingthe paper quality. This paper is supported by the National Basic Research Program of China(973Green, No. 2012CB316000), the Key (Key grant) Project of Chinese Ministry of Education(No. 313053), the Key Technologies R&D Program of China (No. 2012BAH75F01), and the grantof “Investing for the Future” program of France ANR to the CominLabs excellence laboratory(ANR-10-LABX-07-01). A
PPENDIX
Proof of Theorem 1.
Proof:
Without loss of generality, assume that at stage k , the state is s ( k ) and the chosenaction is a ( k ) . Moreover, the latest stage that the state-action pair ( s ( k ) , a ( k ) ) occurred is stage m .Thus, by Algorithm 1, the policy p ( j ) o ( s ( k ) , a ( k ) ) remains invariant for any j ∈ [ m, · · · , k ) . Forsimplicity of representation, we denote one sequence ˆ p o (ˆ k ) = p ( k ) o ( s ( k ) , a ( k ) ) and ˆ p o (ˆ k −
1) = p ( j ) o ( s ( k ) , a ( k ) ) for any j ∈ [ m, · · · , k ) , where the index ˆ k equals ν ( s ( k ) , a ( k ) , k ) . In addition,the sequences ˆ p n (ˆ k ) and ˆ δ (ˆ k ) are defined analogously to ˆ p o (ˆ k ) . Thus, based on (12), we have ˆ p o (ˆ k ) = p ( k ) o ( s ( k ) , a ( k ) )= (cid:104) (1 − ζ (ˆ k − p n (ˆ k ) + ζ (ˆ k − p e ( s ( k ) , a ( k ) ) (cid:105) p t − p t . (18)Firstly, assume that p t is large enough such that (cid:12)(cid:12)(cid:12) ˆ p o (ˆ k ) (cid:12)(cid:12)(cid:12) < p t and (cid:12)(cid:12)(cid:12) ˆ p o (ˆ k + 1) (cid:12)(cid:12)(cid:12) < p t , while theassumption will be dropped later.Subtracting (12) to (18), we obtain ˆ p o (ˆ k + 1) − ˆ p o (ˆ k )= (1 − ζ (ˆ k − (cid:16) ˆ p n (ˆ k + 1) − ˆ p n (ˆ k ) (cid:17) − ( ζ (ˆ k ) − ζ (ˆ k − (cid:16) ˆ p n (ˆ k + 1) − p e ( s ( k ) , a ( k ) ) (cid:17) = − β (ˆ k )(1 − ζ (ˆ k − δ (ˆ k ) − ( ζ (ˆ k ) − ζ (ˆ k − (cid:16) ˆ p n (ˆ k + 1) − p e ( s ( k ) , a ( k ) ) (cid:17) . (19)The last equality holds because of (11).Define t = 0 and t ˆ k = (cid:80) ˆ k − j =0 β ( j ) . For t ≥ , let K ( t ) denote the unique value of ˆ k suchthat t ˆ k ≤ t < t ˆ k +1 , as Fig. 11-(a) depicts. For t < , set K ( t ) = 0 . Define the continuous time t t (a) (c) t t (b) (d) Fig. 11. Illustration of (a) the function K ( t ) , (b) the function ˆ p (0) ( t ) , (c) the function K ( t ˆ k + t ) and (d) the function ˆ p (ˆ k ) ( t ) . interpolation ˆ p (0) ( · ) on ( −∞ , ∞ ) by ˆ p (0) ( t ) = p (0) o ( s ( k ) , a ( k ) ) for t ≤ , and for t ≥ , ˆ p (0) ( t ) = ˆ p o ( K ( t )) = ˆ p o (ˆ k ) , for t ˆ k ≤ t < t ˆ k +1 . Moreover, we define the sequence of shifted processes ˆ p (ˆ k ) ( t ) = ˆ p (0) ( t ˆ k + t ) , t ∈ ( −∞ , ∞ ) , asFig. 11-(d) depicts. Define Y j = 0 and Z j = 0 for j < . Moreover, define Y j = (1 − ζ ( j − δ ( j ) and Z j = ( ζ ( j ) − ζ ( j − (cid:0) ˆ p n ( j + 1) − p e ( s ( k ) , a ( k ) ) (cid:1) for j ≥ . Define Z (0) ( t ) = 0 for t ≤ and Z (0) ( t ) = (cid:88) K ( t ) − j =0 Z j ,Z (ˆ k ) ( t ) = Z (0) ( t ˆ k + t ) − Z (0) ( t ˆ k ) = (cid:88) K ( t ˆ k + t ) − j =ˆ k Z j , t ≥ . Taking into account the definitions above (recall that K ( t ˆ k ) = ˆ k ), the following equation canbe achieved by a manipulation of (19) ˆ p (ˆ k ) ( t ) = ˆ p o (ˆ k ) − (cid:88) K ( t ˆ k + t ) − j =ˆ k ( β ( j ) Y j + Z j )= ˆ p o (ˆ k ) − (cid:88) K ( t ˆ k + t ) − j =ˆ k ( β ( j ) Y j ) − Z (ˆ k ) ( t ) . (20)Since ˆ p (ˆ k ) ( t ) is piecewise constant, we can rewrite (20) as ˆ p ˆ k ( t ) = ˆ p o (ˆ k ) − (cid:90) t Y K ( t ˆ k + x ) dx − Z (ˆ k ) ( t ) + ϕ (ˆ k ) ( t ) , (21)where ϕ (ˆ k ) ( t ) is the outcome due to the replacement of the first sum in (20) by an integral. ϕ (ˆ k ) ( t ) = 0 at the times when the interpolated sequences have jumps, i.e., t = t ˆ k (cid:48) − t ˆ k , ˆ k (cid:48) > ˆ k ,and ϕ (ˆ k ) ( t ) → in t as ˆ k → ∞ under the assumption in (15).Besides that, by our assumption that lim ζ (ˆ k ) /β (ˆ k ) → as ˆ k → ∞ , Z ˆ k = ( ζ (ˆ k ) − ζ (ˆ k − · (cid:16) ˆ p n (ˆ k + 1) − p e ( s ( k ) , a ( k ) ) (cid:17) = o ( β (ˆ k )) (cid:16) ˆ p n (ˆ k + 1) − p e ( s ( k ) , a ( k ) ) (cid:17) . Therefore, Z (ˆ k ) ( t ) = (cid:80) K ( t ˆ k + t ) − j =ˆ k o ( β ( j )) (cid:0) ˆ p n ( j + 1) − p e ( s ( k ) , a ( k ) ) (cid:1) . Thus, as ˆ k → ∞ , Z (ˆ k ) ( t ) is negligible, since it’sa small order of magnitude to (cid:80) K ( t ˆ k + t ) − j =ˆ k β ( j ) Y j .Given the above discussion, as ˆ k → ∞ , the sequence of functions ˆ p (ˆ k ) ( t ) = ˆ p o (ˆ k ) − (cid:82) t Y K ( t ˆ k + x ) dx is equicontinous. Hence, by the Arzel`a-Ascoli Theorem [32], there is a convergent subsequencein the sense of uniform convergence on each bounded time integral, and it’s easily seen that anylimit of ˆ p ( t ) , or the discrete equivalent ˆ p o (ˆ k ) , must track the solution of the ODE ˙ˆ p ( t ) = − ˆ δ (ˆ k ) for sufficiently large ˆ k . Next, in the special case where ˆ p o (ˆ k −
1) = p t and ˆ δ (ˆ k − ≥ , at next stage k , the overallpolicy ˆ p o (ˆ k ) would equal p t . Thus, the ODE ˙ˆ p ( t ) = 0 . Similar discussion can be easily appliedto the case, where ˆ p o (ˆ k −
1) = − p t and ˆ δ (ˆ k − ≤ .Furthermore, as k → , by Lemma 1, ˆ k = ν ( s ( k ) , a ( k ) , k ) → ∞ .Summarizing the above discussion and taking into account δ ( s ( k ) , a ( k ) ) = lim δ ( k ) ( s ( k ) , a ( k ) ) as k → ∞ , we can obtain ˙ p o ( t ) = − δ ( s ( k ) , a ( k ) ) ϑ s ( k ) , a ( k ) ( p o ) . (22)The claim follows. R EFERENCES [1] J. Wu, S. Rangan, and H. Zhang, Eds.,
Green Communications: Theoretical Fundamentals, Algorithms and Applications ,1st ed. CRC Press, Sep. 2012.[2] H. Zhang, A. Gladisch, M. Pickavet, Z. Tao, and W. Mohr, “Energy efficiency in communications,”
IEEE Commun. Mag. ,vol. 48, no. 11, pp. 48–49, Nov. 2010.[3] M. Marsan, L. Chiaraviglio, D. Ciullo, and M. Meo, “Optimal energy savings in cellular networks,” in
Proc. IEEE ICCWorkshops , Dresden, Germany, Jun. 2009.[4] China Mobile Research Institute, “C-RAN: road towards green radio access network,” Tech. Rep., 2010.[5] G. P. Fettweis and E. Zimmermann, “ICT energy consumption-trends and challenges,” in
Proc. WPMC , vol. 4, Lapland,Finland, Sep. 2008.[6] K. Son, H. Kim, Y. Yi, and B. Krishnamachari, “Base station operation and user association mechanisms for energy-delaytradeoffs in green cellular networks,”
IEEE J. Select. Areas Commun. , vol. 29, no. 8, pp. 1525 –1536, Sep. 2011.[7] C. Peng, S.-B. Lee, S. Lu, H. Luo, and H. Li, “Traffic-driven power savings in operational 3G cellular networks,” in
Proc.ACM Mobicom , Las Vegas, Nevada, USA, Sep. 2011.[8] Z. Niu, “TANGO: traffic-aware network planning and green operation,”
IEEE Wireless Commun. , vol. 18, no. 5, pp. 25–29, Oct. 2011.[9] L. Chiaraviglio, D. Ciullo, M. Meo, M. Marsan, and I. Torino, “Energy-aware UMTS access networks,” in
Proc. WPMC ,Lapland, Finland, Sep. 2008.[10] Z. Niu, Y. Wu, J. Gong, and Z. Yang, “Cell zooming for cost-efficient green cellular networks,”
IEEE Commun. Mag. ,vol. 48, no. 11, pp. 74–79, Nov. 2010.[11] R. Li, Z. Zhao, Y. Wei, X. Zhou, and H. Zhang, “GM-PAB: a grid-based energy saving scheme with predicted traffic loadguidance for cellular networks,” in
Proc. IEEE ICC , Ottawa, Canada, Jun. 2012.[12] E. Oh and B.Krishnamachari, “Energy savings through dynamic base station switching in cellular wireless access networks,”in
Proc. IEEE Globecom , Miami, Florida, USA, Dec. 2010.[13] S. Zhou, J. Gong, Z. Yang, Z. Niu, and P. Yang, “Green mobile access network with dynamic base station energy saving,”in
Proc. ACM Mobicom , Beijing, China, Sep. 2009.[14] V. Konda and V. Borkar, “Actor-critic–type learning algorithms for markov decision processes,”
SIAM J. Contr. Optim. ,vol. 38, no. 1, pp. 94–123, 1999. [15] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” SIAM J. Contr. Optim. , vol. 42, no. 4, pp. 1143–1166, 2000.[16] R. Sutton and A. Barto,
Reinforcement learning: An introduction . Cambridge University Press, 1998. [Online]. Available:http://webdocs.cs.ualberta.ca/ ∼ sutton/book/ebook/[17] H. Berenji and D. Vengerov, “A convergent actor-critic-based FRL algorithm with application to power management ofwireless transmitters,” IEEE Trans. Fuzzy Syst. , vol. 11, no. 4, pp. 478–485, 2003.[18] F. Woergoetter and B. Porr, “Reinforcement learning,” vol. 3, no. 3, p. 1448, 2008.[19] X. Zhou, Z. Zhao, R. Li, Y. Zhou, and H. Zhang, “The predictability of cellular networks traffic,” in
Proc. IEEE ISCIT ,Gold Coast, Australia, Oct. 2012.[20] M. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,”
J. Mach. Learn. Res. , vol. 10,pp. 1633–1685, Jul. 2009.[21] D. Aha, M. Molineaux, and G. Sukthankar, “Case-based reasoning in transfer learning,”
Lect. Notes Artif. Int. , pp. 29–44,2009.[22] S. Pan and Q. Yang, “A survey on transfer learning,”
IEEE Trans. Knowledge Data Eng. , vol. 22, no. 10, pp. 1345–1359,Oct. 2010.[23] J. Celiberto, Luiz A., J. P. Matsuura, R. L. de Mantaras, and R. A. C. Bianchi, “Using transfer learning to speed-upreinforcement learning: a cased-based approach,” in
Proc. LARS , Washington, DC, USA, Oct. 2010.[24] R. Li, Z. Zhao, X. Chen, and H. Zhang, “Energy saving through a learning framework in greener cellular radio accessnetworks,” in
Proc. IEEE Globecom , Anaheim, USA, Dec. 2012.[25] H. Kim, G. De Veciana, X. Yang, and M. Venkatachalam, “Alpha-optimal user association and cell load balancing inwireless networks,” in
Proc. IEEE INFOCOM , San Diego, CA, USA, Mar. 2010.[26] S. Das, H. Viswanathan, and G. Rittenhouse, “Dynamic load balancing through coordinated scheduling in packet datasystems,” in
Proc. IEEE INFOCOM , San Francisco, USA, 2003.[27] A. Sang, X. Wang, M. Madihian, and R. Gitlin, “Coordinated load balancing, handoff/cell-site selection, and schedulingin multi-cell packet data systems,”
Wireless Networks , vol. 14, no. 1, pp. 103–120, 2008.[28] A. Leon-Garcia,
Probability, statistics, and random processes for electrical engineering
Proc. IEEE ICC ,Kyoto, Japan, Jun. 2011.[30] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard andnatural policy gradients,”
IEEE Trans. Syst., Man, Cybern. C , vol. 42, no. 6, pp. 1291–1307, 2012.[31] S. Singh, T. Jaakkola, M. Littman, and C. Szepesvri, “Convergence results for single-step on-policy reinforcement-learningalgorithms,”
Mach. Learn. , vol. 38, no. 3, pp. 287–308, 2000.[32] H. Kushner and G. Yin,
Stochastic approximation and recursive algorithms and applications
Proc. IEEE INFOCOM , Orlando, Florida, USA, Mar. 2012.[34] IEEE 802.16 Boradband Wireless Access Working Group, “IEEE 802.16m evaluation methodology document (EMD),”Jul. 2008. [Online]. Available: http://ieee802.org/16[35] S. Kullback and R. Leibler, “On information and sufficiency,”