Resource Allocation Using Gradient Boosting Aided Deep Q-Network for IoT in C-RANs
11 Resource Allocation Using Gradient BoostingAided Deep Q-Network for IoT in C-RANs
Yifan Luo, Jiawei Yang, Wei Xu,
Senior Member, IEEE,
Kezhi Wang,
Member, IEEE ,and Marco Di Renzo,
Senior Member, IEEE
Abstract —In this paper, we investigate dynamic resource allo-cation (DRA) problems for Internet of Things (IoT) in real-timecloud radio access networks (C-RANs), by combining gradientboosting approximation and deep reinforcement learning to solvethe following two major problems. Firstly, in C-RANs, the deci-sion making process of resource allocation is time-consuming andcomputational-expensive, motivating us to use an approximationmethod, i.e. the gradient boosting decision tree (GBDT) toapproximate the solutions of second order cone programming(SOCP) problem. Moreover, considering the innumerable statesin real-time C-RAN systems, we employ a deep reinforcementlearning framework, i.e., deep Q-network (DQN) to generatea robust policy that controls the status of remote radio heads(RRHs). We propose a GBDT-based DQN framework for theDRA problem, where the heavy computation to solve SOCPproblems is cut down and great power consumption is savedin the whole C-RAN system. We demonstrate that the generatedpolicy is error-tolerant even the gradient boosting regression maynot be strictly subject to the constraints of the original problem.Comparisons between the proposed method and existing baselinemethods confirm the advantages of our method.
Index Terms —Cloud radio access networks (C-RANs), resourceallocation, Internet of Things (IoT), gradient boosting, reinforce-ment learning, deep Q-network, ensemble learning.
I. I
NTRODUCTION T HE requirement and development of Internet of Things(IoT) services, a key challenge in 5G, have been con-tinuously rising, with the expanding diversity and density ofIoT devices [1]. Cloud radio access networks (C-RANs) [2]are regarded as the promising mobile network architectureto meet this new challenge. Specifically, C-RANs separatebase stations into radio units, which are commonly referred asremote radio heads (RRHs), and signal processing centralizedbaseband unit (BBU) Pool. In a C-RAN, BBU can be placedin a convenient and easily accessible place, and RRHs can bedeployed up on poles or rooftops on demand. It is expectedthat C-RAN architecture will be an integral part of futuredeployments to enable efficient IoT services.
Y. Luo and J. Yang made the same contribution to this paper.Y. Luo and W. Xu are with the National Mobile Communications ResearchLaboratory (NCRL), Southeast University, Nanjing 210096, China (e-mail:[email protected]; [email protected]).J. Yang is with the Department of Computer Science and Engi-neering, Southeast University, Nanjing 210096, China (e-mail: [email protected]).K. Wang is with the Department of Computer and Information Sciences,Northumbria University, Newcastle upon Tyne NE1 8ST, U.K.M. Di Renzo is with the Laboratoire des Signaux et Syst`emes,CNRS, CentraleSup´elec, Univ Paris Sud, Universit´e Paris-Saclay, 3 rueJoliot Curie, Plateau de Saclay, 91192 Gif-sur-Yvette, France (e-mail:[email protected]).
Dynamic resource allocation (DRA) for IoT in C-RANs isindispensable to maintain acceptable performance. In order toget the optimal allocation strategy, several works have triedto apply convex optimizations, like second order cone pro-gramming (SOCP) in [3], semi-definite programming (SDP)in [4] and mix-integer programming (MIP) in [5]. However,in real-time C-RANs where the environment keeps changing,the efficiency of the above methods in finding the optimaldecision faces great challenges. Attempts have been made inreinforcement learning (RL) to increase the efficiency of thesolution procedure in [3] [6].RL has shown its great advantages to solve DRA problemsin wireless communication systems and for IoT. Existingmethods to DRA problem in RANs generally model it as aRL problem [6] [7] [8], by setting different parameters as thereward. For instance, the work in [6] regarded the successfultransmission probability of the user requests as the reward,and another work in [8] set the sum of average quality ofservice (QoS) and averaged resource utilization of the slice asthe reward. However, with the increase of the complexity inallocation problems, the search space of solutions tends to beinfinite, which is hard to be tackled.With the combination of RL and deep neural network(DNN) [9], deep reinforcement learning (DRL) has beenproposed and applied to address the above problems in [10][11] [12]. By utilizing the ability of extracting useful featuresdirectly from the high-dimensional state space of DNN, DRLis able to perform end-to-end RL [9]. With the assistance ofDNN, problems of large search space and continuous statesare no longer the insurmountable challenges.To apply DRL framework in DRA problems, the design ofreward, action and state becomes vital. The action set needs tobe enumerable in most circumstances. The work in [3] used atwo-step decision framework to guarantee its enumerability, bychanging the state of one RRH at each epoch, which performswell in the models with innumerable states.Furthermore, in DRA problems, how to get optimal alloca-tion strategy will be finally turned into another optimizationproblem in most cases, i.e., convex optimization problem [13],which can be solved mathematically. Unfortunately, traditionalalgorithms [14] [15] [16] for solving the convex optimizationproblem, such as SOCP still faces significant limitations, suchas time-consuming, making it hard to generate a policy forlarge-scale systems.Recent works have achieved significant improvement incomputational efficiency by applying the DNN approximator[17] [18] [19] [20] to DRA problems. However, the unstable a r X i v : . [ c s . I T ] O c t performance of DNN in regression process makes it hard toachieve good performance [21]. With a large number of hyper-parameters, fine tuning becomes even harder in practical sys-tem. Some researchers discussed and investigated this problemin computability theory and information theory domains, e.g.,in [22].Gradient boosting machine (GBM) [23] is one memberof boosting algorithms family [24] [25], a sub-branch ofensemble learning [26] [27] [28]. It has been firmly establishedas one of state-of-the-art approaches in machine learning (ML)community, and it has played a dominating role in existing datamining and machine learning competitions [29] due to its fasttraining and excellent performance. However, to the best ofour knowledge, few works applied this method to the DRAproblem, even to other regression problems in communicationsystems.In this paper, to efficiently address DRA problem for IoT inC-RANs with innumerable states, one common form of DRL,namely the deep Q-network (DQN) is employed. Moreover,to tackle the difficulties in obtaining the reward in DQN inlow latency, a tree-based GBM, i.e., gradient boosting decisiontree (GBDT) is utilized to approximate the solutions of SOCP.Then, we demonstrate the improvement of our method bycomparing it to the traditional methods under simulations.The main contributions of this paper are as follows: • We first give the model of dynamic resource allocationproblem for IoT in the real-time C-RAN. Then, we pro-pose a GBDT-based regressor to approximate the SOCPsolution of the optimal transmitting power consumption,which serves as the immediate reward needed in DQN.By doing so, there is no need to solve the original SOCPproblem every time, and therefore great computationalcost can be saved. • Next, we aggregate the GBDT-based regressor with aDQN to propose a new framework, where the immediatereward is obtained from GBDT-based regressor instead ofSOCP solutions, to generate the optimal policy to controlthe states of RRHs. The proposed framework can save thepower consumption of the whole C-RAN system for IoT. • We show the performance gain and complexity reductionof our proposed solution by comparing it with the existingmethods.The remainder of this paper is organized as follows. SectionII presents the related works, whereas system model is givenin Section III. Section IV introduces the proposed GBDT-based DQN framework. The simulation results are reported inSection V, followed by the conclusions presented in SectionVI. II. R
ELATED W ORKS
The resource allocation problem under C-RANs is normallyinterpreted into an optimization problem, where one needs tosearch the decision space to find an optimal combinatorial setof decisions to optimize different goals [13] [30] [31] basedon current situations. Although numerous researchers devotedtheir time in finding solutions to optimization problems, mostof them are still hard or impossible to be tackled with traditional pure mathematical methods. RL has been recentlyapplied to address those problems.In [32], a model-free RL model was adopted to solve theadaptive selection problem between backhaul and fronthaultransfer modes, which aimed to minimize the long-term deliv-ery latency in fog radio access network (F-RAN). Specifically,an online on-policy value-based strategy State-Action-Reward-State-Action (SARSA) with linear approximation was appliedin this system. Moreover, some works have proposed moreefficient RL methods to overcome slow convergence andscalability issues in traditional RL-based algorithms, such asQ-learning. In [33], four methods, i.e. state space reductiontechniques, convergence speed up methods, demand forecast-ing combined with RL algorithm and DNN were proposed tohandle the aforementioned problems, especially to deal withthe huge state space.Furthermore, as reported in [34], DQN achieved a betterperformance on resource allocation problems, compared withthe traditional Q-learning based method. In practice, the sizeof possible state space may be very large or even infinite,which makes it impossible to traverse each state that requiredby the traditional Q-learning. Approximation methods canaddress this kind of problem that they maps the continuousand innumerable state space to a near-optimal Q-value spacein consecutive setting, rather than Q-table. DNN shows itsadvantage of approximation in the high-dimensional space inmany domains. Therefore, the adoption of DNN to estimateQ-value can improve the system performance and computingefficiency, as reported in the simulation results from [34].In [3], a two-step decision framework was adopted to solvethe enumerability problem of action space in C-RANs. TheDRL agent first determined which RRH to turn on or turnoff, and then the agent got the resource allocation solution bysolving a convex optimization problem. Any other complexactions can be decomposed into the two-step decision, reduc-ing the action space significantly. Moreover, the work in [6]shows the impractical use of SA (i.e., Single BS Association)scheme even in a small-scale C-RAN. Specifically, SA schemeabandoned the collaboration of each RRH and only supportedfew users. This research is a guidance to our research.The works in [6] and [8] all adopted the DRL method tosolve resource allocation problems in the RAN settings. In[8], the concept of intelligent allocation based on DRL wasproposed to tackle the cache resource optimization problem inF-RAN. To satisfy user’s QoS, the caching schemes should beintelligent, i.e. more effective and self-adaptive. Consideringthe limitation of cache space, this requirement challenges thedesign of schemes, and it motivates the adoption of DRLtechnique.As reported in [6], a DRL-based framework is used inmore complicated resource allocation problems, i.e., virtual-ized radio access networks. Based on the average QoS utilityand resource utilization of users, the DQN-based autonomousresource management framework can make virtual operationsto customize their own utility function and objective functionbased on different requirements.In this paper, to improve the system efficiency, we proposea novel gradient-boosting-based DQN framework for resource
Mobile Backhaul
Network
BBU Pool
RRH2
RRH1 RRH3
User1 User2
DRL-based
Agent
States and Rewards Actions
Fig. 1. Dynamic Resource Allocation for IoT under DRL framework in C-RANs. allocation problem, which significantly improves the systemperformance through offline training and online running.To the best of our knowledge, there is few works to applygradient boosting machine to approximate solutions of convexoptimization problems in wireless communication and we arethe first to propose this framework.III. S
YSTEM M ODEL
A. Network Model
We consider a typical C-RAN architecture where there isa single cell model with a set of m RRHs denoted by R = { r , r , ..., r m } and a set of n users which can be some IoTdevices denoted by U = { u , u , ..., u n } . In the DRA for IoTin C-RAN as shown in Fig. 1, we can get the current states,i.e. the state of each RRHs and the demands of IoT deviceusers, from the networks in K -th decision epoch t k . All theRRHs are connected to the centralized BBU pool, meaning allinformation can be shared and processed by the DQN-basedagent to make decisions, i.e. turning on or off the RRHs. Wesimplify the model by making assumption that all RRHs andusers are equipped with a single antenna, which is readily tobe generalized into the multi-antenna case by using techniqueproposed in [35].Then, the corresponding signal-to-interference-plus-noiseratio (SINR) at the receiver of i -th user u i can be given as: SIN R i = (cid:12)(cid:12) h Tu i w u i (cid:12)(cid:12) (cid:80) u j (cid:54) = u i (cid:12)(cid:12) h Tu i w u j (cid:12)(cid:12) + σ , u i ∈ U (1)where h u i = [ h r u i , h r u i , ..., h r u u i ] T denotes the chan-nel gain vector and each element h r l u i denotes the chan-nel gain from RRH r l ∈ R to user u i ; w u i =[ w r u i , w r u i , ..., w r m u i ] T denotes the vector of all RRHsbeamforming to user u i and each element w r l u i denotes theweight of beamforming vector in RRH r l ∈ R distributed touser u i and σ is the noise. According to the Shannon formula, the data rate of user u i can be given as: R i = B log (cid:18) SIN R i Γ m (cid:19) , u i ∈ U (2)where B is the channel bandwidth and Γ m is the SINR margindepending on a couple of practical considerations, e.g., themodulation scheme.The relationship of the transmitting power and the powerconsumed by the base station can be approximated to be nearlylinear, according to [36]. Then, we apply the linear powermodel for each RRH as: P i = P r i ,A + 1 η P r i ,T r i ∈ A P r i ,S r i ∈ S (3)where P r i ,T = (cid:80) u j ∈U | w r i u j | is the transmitting power ofRRH r i ; η is a constant denoting the drain efficiency of thepower amplifier; and P r i ,A is the power consumption of RRH r i when r i is active without transmitting signals. In the caseof no need for transmission, r i can be set to the sleep mode,whose power can be given by P r i ,S . Thus, one has A∪S = R .In addition, we take consideration of the power consumptionfor the state transition of RRHs, i.e. the power consumed tochange RRHs’ states. We put the RRHs which reverse statesin the current epoch to the set T and use P r i ,T to denote thepower to change the mode between Active and
Sleep , i.e. weassume they share the same power consumption. Therefore, inthe current epoch, the total power consumption of all RRHscan be written as: P Total = (cid:88) r i ∈A (cid:88) u j ∈U η (cid:12)(cid:12) w r i u j (cid:12)(cid:12) (cid:124) (cid:123)(cid:122) (cid:125) Transmitting Power + (cid:88) r i ∈A P r i ,A + (cid:88) r i ∈S P r i ,S (cid:124) (cid:123)(cid:122) (cid:125) State Power + (cid:88) r i ∈T P r i ,T (cid:124) (cid:123)(cid:122) (cid:125) Transition Power . (4) B. CP-beamforming
From Equation (4), one can see that the latter two partsare easy to be calculated, which are composed by someconstants and only relying on the current state and action.To minimize P Total , it is necessary for us to calculate theminimal transmitting power in each epoch, which depends onthe allocation scheme of beamforming weights in active RRHs.Therefore, this optimization problem can be expressed as: • Control Plane (CP)-Beamforming: min w riuj P T (5) s . t . P T = (cid:88) r i ∈A (cid:88) u j ∈U (cid:12)(cid:12) w r i u j (cid:12)(cid:12) (5.1) R i ≤ B log (cid:18) SIN R i Γm (cid:19) , u i ∈ U (5.2) (cid:88) u j ∈U (cid:12)(cid:12) w r i u j (cid:12)(cid:12) ≤ P r i , r i ∈ A (5.3) where the objective is to get the minimal total transmittingpower given the states of RRHs and user demands. Also,the variables w r i u j are distributive weights correspondingto beamforming power; R i is defined as the user demand; SIN R i is given by Equation (1) and P r i is the transmittingpower constraint for RRH r i . Also, Constraint (5.2) ensuresthe demand of all users will be met, whereas Constraint (5.3)ensures the limitation of transmitting power in each RRH.As shown in [4], the above CP-beamforming can be trans-formed into a SOCP problem. Therefore, we rewrite the aboveoptimizations as: • Modified CP-Beamforming: min w riuj P T (6) s . t . (cid:88) u j ∈U (cid:12)(cid:12) h Hu i w u j (cid:12)(cid:12) + σ ≤ µ u i (cid:12)(cid:12) h Hu i w u i (cid:12)(cid:12) , u i ∈ U (6.1) (cid:88) u j ∈U (cid:12)(cid:12) w r i u j (cid:12)(cid:12) ≤ P r i , r i ∈ A (6.2) (cid:88) r i ∈A (cid:88) u j ∈U (cid:12)(cid:12) w r i u j (cid:12)(cid:12) ≤ P T (6.3) µ i = ι i + 1 ι i , u i ∈ U (6.4) ι i = Γ m (2 RiB − (6.5)where we apply variable P T to replace the optimization (5.1)by adding Constraint (6.3), which is a common method intransformation process [37]. We also rewrite Constraint (5.2)as Constraint (6.1) and apply some simple manipulations toget the above modified optimization.Now, it is ready to see that the above Modified CP-Beamforming optimization is the same as a standard SOCPproblem. By using the iterative algorithm mentioned proposedin [38], we can get the optimal solutions. It is worth notingthat the CP-Beamforming optimization may have no feasiblesolutions. In this case, it means more RRHs should be activatedto satisfy the user demands. In this case, we will give a largenegative reward to the DQN agent and jump out of the currenttraining loop.Then, we can calculate the total power consumption byapplying Equation (4). In the following part, we propose theDQN-based framework to predict the states of RRHs and adoptGBDT to approximate the solutions of the aforementionedSOCP problems.IV. GBDT A IDED D EEP
Q-N
ETWORK FOR
DRA IN C-RAN S A. State, Action Space and Reward Function
Our goal in the aforementioned DRA problem is to generatea policy that minimizes the system’s power consumption atany state by taking the best action. Here, the best actionrefers to the action that contributes the least to overall powerconsumption in a long term but also satisfies user demands,system requirements and constraints among all the availableactions. The fundamental idea of RL-based method is toabstract an agent and an environment from the given problem to generate the environment model [39] and employ the agentto find the optimal action in each state, so as to maximize thecumulative discounted reward by exploring the environmentand receiving immediate reward signalled by the environment.To apply RL method in our problem, we transform thesystem model defined in Section III into a RL model. Thegeneral assumption that future reward is discounted by afactor of γ per time-step is made here. Then, the cumulativediscounted reward from time-step t can be expressed as: R t = E (cid:34) ∞ (cid:88) k = t γ k r k ( s k , a k ) | s t = s, a t = a (cid:35) (7.1)where E ( · ) denotes mathematical expectation; r k ( · ) denotesthe k -th reward; s k denotes the k -th state and γ ∈ (0 , denotes the discount factor. If γ tends to 0, the agent onlyconsiders the immediate reward; whereas if γ tends to 1, theagent focuses on the future reward. Moreover, the infinity overthe summation sign indicates the endless sequence in DRAproblem.Leveraging the common definition in Q-learning, the opti-mal action-value function Q ∗ ( s, a ) is defined as the greatestmathematical expected cumulative discounted reward reachedby taking action a in state s and then following a subsequentlyoptimal policy, which guarantees the optimality of cumulativefuture reward. The function Q ∗ ( s, a ) strongly follows the Bellman equation , a well-known identity in optimality theory.In this model, the optimal action-value function Q ( s, a ) torepresent the maximum cumulative reward from state s withaction a can be expressed as: Q ∗ ( s, a ) = E (cid:104) r ( s,a ) + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a (cid:105) (7.2)where r ( s,a ) denotes the immediate reward received at state s if action a is taken; a (cid:48) denotes the possible action in thenext state s (cid:48) , and other symbols are of the same meaningas Equation (7.1). The expression means that the agent takesaction a in the state s , receiving the immediate reward r ( s,a ) ,and then subsequently follows an optimal trajectory that leadsto greatest Q ( s, a ) value.In a general view, Q ∗ ( s, a ) demonstrates how promising thefinal expected cumulative reward will be if action a is taken instate s in a quantitative way. That is to say, in DRA problem,how much power consumption the C-RAN can cut down if itdecides to take the action a , i.e switches on or off one selectedRRH when observing the state s , i.e. a set of user demandsand the states (i.e. sleep/active) of RRHs. Since the true valueof Q ∗ ( s, a ) can never be known, our goal is to employ DNNto learn an approximation Q ( s, a ) . For the following sections, Q ( s, a ) just denotes the approximated Q ∗ ( s, a ) and has all thesame properties of Q ( s, a ) .The generic policy function π ( s ) defined in the context ofRL is used here, which can be expressed as: π ( s ) = arg max a Q ∗ ( s, a ) (7.3)where π ( s ) is the argmax of the action-value function Q ( s, a ) over all possible actions in a specific state s . The policyfunction leads to the action that maximize the Q ( s, a ) valuesin all states. State
Agent
DNN
Policy Π θ (s,a) Take an action Receive an immediate reward ... x Tree 1 Tree 2 Tree k Σ GBDT regression
Input Features:SOCP
Tree Splits
TransformedFeaturesSummationOutput: Power
DQN
Hidden layersStates of m RRHs and n user demands States of m RRHs and n user demands Environment
Turn on/off one RRH y Observestates ( )
Fig. 2. The DQN-based scheme for dynamic resource allocation.
The state, action and reward defined in our problem aregiven as: • State:
The state has two components that one is a setof states of RRHs and the other is a set of demandsfrom users. Specifically, Y = [ y , y , ..., y m ] is definedas the set of all m RRHs’ states, in which y i ∈ { , } denotes the state of RRH i . In the case of y i = 0 ,RRH i is in the sleep state, whereas y i = 1 meansthat it is in the active state. D = [ d , d , ..., d n ] isdefined as the set of all n users’ demands, and d j ∈ [ d min , d max ] denotes the demand of user j , in which d min is the minimum of all demands and d max is themaximal demand. Thus, the state of RL is expressed as D ∪ Y = [ y , y , ..., y m , d , d , ..., d n ] and its cardinalityis m + n . • Action:
In each decision epoch, we enable the RL agentto determine the next state of one RRH. We use a setof A = { α , α , · · · , α m } to denote the action space, inwhich α i ∈ { , } , i = 1 , , · · · , m . If α i = 1 , it meansRRH i changes the state, otherwise the RRH remains itscurrent state in next epoch. Then, the action space canbe substantially reduced. It is noteworthy that we set theconstriction that (cid:54) (cid:80) mi =1 α i (cid:54) , which means onlyone or none of all RRH states will alter its state andreduces the space into the size of m + 1 . • Reward:
To minimize the total power consumption, wedefine the immediate reward as the difference betweenthe upper bound of power consumption. The actual powerconsumption is expressed as: ∇ k = P UB − P total where P UB denotes the upper bound of the power con-sumption obtained from the system setting, and P total denotes the actual total power consumption of the systemthat is composed of three parts defined in Equation (4).To be more specific, the reward is defined to minimizethe system power consumption under the condition ofsatisfying the user demands, which requires us to solvethe optimization problem according to Equation (6),shown in Section III. To sum up, the policy mentioned in this work is a functionthat maps the current state s , the set of user demand and RRHsstatus, to the best action a , turning on or off one RRH, thatminimizes the overall power consumption of the whole system. B. Gradient Boosting Decision Tree
GBM is a gradient boosting framework that can be appliedto any classifiers or regressors. To be more specific, GBM isthe aggregation of base estimators (i.e., classifiers or regres-sors) that any base estimators like K nearest neighbor, neuralnetwork and naive Bayesian estimators can be fitted into theGBM. Better base estimators advocate higher performance.Among all kinds of GBM, a prominent one is based ondecision tree, called gradient boosting decision tree (GBDT),which has been gaining its popularity for years due to itscompetitive performance in different areas. In our framework,the GBDT is applied to the regression task due to its prominentperformance.The concept of GBDT is to optimize the empirical riskvia steepest gradient descent in hypothesis space by addingmore base tree etimators. Considering the regression task inour work, given a dataset with n entities of different statesand their corresponding rewards generated by simulation andsolving SOCP, one can have D = ( x i , p i )( | D | = n, x i ∈ D ∪ Y , p i ∈ S ) where x i denotes the state representation [ y i , y i , ..., y im ,d i , d i , ..., d in ] of system model, whereas p i denotes the corre-sponding solution of SOCP solver from Equation (6), in linewith the definition of the Reward function. To optimize theempirical risk of regression is to minimize the expectation ofa well-defined loss function over the given dataset D , whichcan be express as: E D [ L ( (cid:98) P , P )] = l ( (cid:98) P , P ) + Ω( φ ) = l ( f ( X ) , P ) + Ω( φ ) (8)where φ denotes the model itself and f ( X ) is the finalmapping to approximate P , which is our fitting object, thepower comsumption. X is the set of x i representing systemmodel, and P is the set of p i representing solution of SOCPsolver. Here the first term is model prediction loss, which is a differentiable convex function to measure the distance betweentrue power consumption and estimated power consumption;and L loss (i.e., mean-square error) is applied in this task. Thelatter term is the regularization penalty applied to constrainmodel complexity, contributing to finalize a model with lessover-fitting and better generalization performance.The choice of prediction loss and regularization penaltyalters circumstantially. Also, the penalty function is given by: Ω( f ) = βT + 12 λ (cid:107) w (cid:107) where β and λ are two hyper-parameters, while T and w arethe numbers of trees ensembled and weights owned by eachtree, respectively. When the regularization parameter is set tozero, the loss function falls back to the traditional gradient treeboosting method [40].In GBDT, it starts with a weak model that simply predictsthe mean value of P at each leaf and improves the predictionby aggregating K additive fixed size decision trees as baseestimators to predict the pseudo-residuals of previous results.The final prediction is linear combination of all the output from K regression trees. The final estimator function as adverted in(9) can be expressed as follow: f ( x ) = K (cid:88) k =0 f k ( x ) = f ( x ) + K (cid:88) k =0 θ k φ k ( x ) (9)where f is the initial guess, φ k ( x ) is the base estimator atthe iteration k and θ k is the weight for the k th estimator or afixed learning rate. The product θ k φ k ( x ) denotes the step atiteration k . C. GBDT-based Deep Q-Network (DQN)
In this section, we will show how to apply GBDT-basedDQN scheme to solve our DRA problem for IoT in real-timeC-RAN, by using the previously defined states, actions andreward. Traditional RL methods, like Q-learning, compute andstore the Q value for each state-action group into a table. Itis unrealistic to apply those methods in our problem, as thestate-action groups are countless and the demands of users ina state are continuous variables. Therefore, DQN is consideredto be best solutions for this problem. Similar with the relatedworks, e.g. [8] [34], we also apply experience replay bufferand fixed Q-targets in this work to estimate the action-valuefunction Q ( s, a ) .In our framework, two stages are included, i.e., offlinetraining and online decision making as well as regular training: • For offline training stage, we pre-train DQN to estimatethe value of taking each action in any specific states.To achieve this, millions of system data are generated interms of all RRHs states, user demands and its corre-sponding system power consumption by simulation andsolving SOCP problem given in equation (6). Then, theGBDT is employed to estimate the immediate reward toalleviate the expensive computation in solving the SOCPproblem for further training and tuning. • For online decision making and regular tuning, we loadthe pre-trained DQN to generate the best action to take for our proposed DRA problem in real-time. This is achievedby employing the policy function defined in (7.3), whichmaximizes the Q ( s, a ) in state s . To emphasize, the Q ( s, a ) function tells how much the system can cut downthe power consumption if it decides to take the action a when seeing the state s . Then, the DQN observes theimmediate reward r t obtained from GBDT approximationand observes next state s t +1 . In an online regular tuningscheme, the DQN will not immediately update modelparameters when observing new states but to store thenew observations to memory buffer. Then, under somegiven conditions, the DQN will fine-tune its parametersaccording to that buffer. This allows DQN to dynamicallyadapt to new patterns regularly.The whole algorithm is given in Algorithm 1, whereas theframework of GBDT-based DQN is given by Fig. 2. The θ denotes the set of model parameters. The loss function is L loss (i.e., mean-square error), which indicates the differencebetween Q target and model output. S i refers to the step i inAlgorithm 1. Algorithm 1
GBDT-based DQN framework
Offline:S1:
Generate millions of data randomly via SOCP solver;
S2:
Pre-train GBDT with those data and save its model;
S3:
Pre-train DQN with those data from S6 to S15 , andthen save the well-trained network and its correspond-ing experience memory D ; Online:S4:
Load replay memory with capacity N from offline-trained experience memory D and load GBDT model; S5:
Set action-value function Q with weights θ from offline-trained network; S6: For each episode t -th S7: a)
Offline: with probability ε select a random action a t , otherwise select a t = π ( s ) when observing anew state s ; b) Online: directly select a t = π ( s ) when observinga new state s ; S8:
Execute action a t ; S9:
Obtain reward r t from a) Offline: SOCP solver or b) Online: GBDT machine, and observe s t +1 ; S10:
Store transition ( s t , a t , r t , s t +1 ) in D ; S11: if a)
Offine or b) Online: reach the given condition then execute from
S12 to S14 ; S12:
Sample random mini-batch of transitions ( s j , a j , r j , s j +1 ) from D ; S13:
Set y j = (cid:40) r j , if episode terminates at step j +1 r j + γ max a (cid:48) Q ( φ j +1 , a (cid:48) ; θ − ) , otherwise S14:
Apply the gradient descent step on ( y i − Q ( φ j , a j ; θ )) with respect to the net-work parameters θ ; S15: EndFor
In Fig. 2, one can see that the left side describes a DQNframework, illustrating the agent, the environment and howto get the reward. Specifically, the agent will observe a new state from the environment after taking an action and thenit will receive an immediate reward signalled by the rewardfunction from GBDT approximator. Traditional DQN obtainsthe reward by solving the SOCP optimization, which can notbe real-time, as explained before. In our architecture, we adoptGBDT regression (i.e., the right side of Fig. 2) to obtain thereward, which can operate in a online process in real-time.We also give the training process of GBDT in the Appendix.
D. Error Tolerance Examination (ETE)
Our target is to use GBDT to approximate the typical SOCPproblem in C-RANs under the framework of DQN. Thus, itis important to evaluate its practical performance. The errorfrom GBDT or DNN will influence the optimality of the givenscheme, even worsening the performance of whole systempower consumption. Therefore, the examination of error in-fluence is of vital significance. Considering its important rolein the whole DRA problem, we emphasize the concept oferror tolerance examination (ETE) here. Specifically, in thesimulation, we will first compare the result of the optimaldecision provided by CP-Beamforming solution with the near-optimal decision from GBDT or DNN approximation solution,and then evaluate its performance in the dynamic resourceallocation settings.V. S
IMULATION R ESULTS
In this section, we present the simulation settings andperformance of the proposed GBDT-based DQN solutions. Wetake the definition of channel fading mode from previous workas [41]: h r,u = 10 − L ( dr,u )20 √ ϕ r,u s r,u G r,u (4.1)where L ( d r,u ) is the path loss with the distance of d r,u ; ϕ r,u is the antenna gain; s r,u is the shadowing coefficient and G r,u is the small-scale fading coefficient. The simulation settingsare summarized in Table I.All training and testing processes are conducted in theenvironment equipped with 8GB RAM, Intel core i7-6700HQ(2.6GHz), python 3.5.6, tensorflow 1.13.1 and lightGBM 2.2.3.We compare our DQN-based solution containing GBDTapproximator (abbreviated as DQN) with two other schemes:1) All RRHs Open (AO): all RHHs are turned on, whichcan serve each user;2) One RRH Closed (OC): one of those RHHs (chosenrandomly) stays in the sleep state, which cannot serve anyuser.It is noteworthy that the in previous work [42], anothersolution in which only one random RRH is turned on, isalso discussed in the dynamic resource allocation problem.However, it can hardly be applied to the practical systems [3].Therefore, we do not compare it in this paper. A. GBDT-based SOCP Approximator1) Computational Complexity:
We compare computationalcomplexity between a GBDT approximator and solutions fromtraditional SOCP solver in [33]. Firstly, a test set of 1000entities are randomly generated in terms of status of RRHs
TABLE IS
IMULATION S ETTINGS
Symbol Parameters Value B Channel bandwidth 10 MHz P r,max Max transmit power 1.0 W P r,A Active power 6.8 W P r,S Sleep power 4.3 W P r,T Transition power 2.0 W σ Background noise -102 dBm ϕ r,u Antenna gain 9 dBi s r,u Log-normal shadowing 8 dB G r,u Rayleigh small-scalefading CN (0 , I ) d r,u Path loss with a dis-tance of (km) . . d r,u dB d r,u Distance Uniformly distributed in [0 , m η l Power amplifier effi-ciency 25% a W = Watt, dB = decibel, dBm = decibel-milliwatts, dBi = dB(isotropic). TABLE II COMPUTATIONAL C OMPLEXITY C OMPARISON
System Input Setup Average Time Per InputGBDT SOCP6 RRHs and 3 users 0.00079 s s s s
12 RRHs and 6 users 0.00070 s s
18 RRHs and 9 users 0.00075 s s a The time in above table is obtained by averaging 1000 different systeminputs, each of which is recalculated by 10000 times through twoalgorithms respectively. and user demands. In addition, both the GBDT approximatorand the traditional SOCP method are executed to predict orcompute the outputs of that test set for 10000 times, respec-tively. One can see from Table II that GBDT approximator ismuch faster than SOCP solver, which prove the efficiency ofGBDT approximator.
2) Fitting Property:
Then, we analyse the performanceof GBDT approximator in specific situations, where we setthat there are 8 RRHs and 4 users of IoT devices whosedemands are ranging from 20
Mbps to 40
Mbps respectively.We compare it with DNN approximator. It applies the fully-connected net with 3 layers, each of which with 32, 64, 1neurons respectively. Its activation function is a rectified linearunit (ReLU). Firstly, in Fig. 3(a), we assume that all 8 RRHsare turned on. One can see from this figure that GBDT hasbetter fitting performance than DNN. Then, we assume thatthere is one RRH switched off. One can see from Fig. 3(b)that GBDT still fits very well with the SOCP solutions. InFig. 3(c), we assume that the states of all 8 RRH are setswitched on or off randomly. As expected, GBDT has muchbetter fitting performance, compared with the SOCP solutions.
B. Training Effect of GBDT and DNN
We demonstrate the training performance between theGBDT approximator and DNN aproximator by comparing thetraining effect in Fig. 4. Mean squared error (MSE) is usedhere to calculate the loss. From Fig. 4, one can see that even S y s t e m P o w e r C o n s u m p t i o n ( W a tt s ) SOCPGBDTDNN (a) All RRHs open S y s t e m P o w e r C o n s u m p t i o n ( W a tt s ) SOCPGBDTDNN (b) One RRH closed S y s t e m P o w e r C o n s u m p t i o n ( W a tt s ) SOCPGBDTDNN (c) Random states of RRHsFig. 3. Approximation performance of GBDT approximator.(a) Training effect of GBDT within 4 seconds(b) Training effect of DNN over 3 hoursFig. 4. Training effect of GBDT and DNN. trained with far more time, the loss of DNN is still higher thanthat of GBDT. One also notices that GBDT has less parametersto adjust and therefore has quicker training process.The specific comparison is not unfolded here, as it is not thefocus of this paper. Next, we will examine the performance of GBDT-based DQN solutions.
C. System Performance
In this section, we consider there are 8 RRHs and 4 users,whose demands are randomly selected. We change the userdemands every 100 ms. The performance of AO, OC andGBDT-based DQN is compared next.
1) Instant Power:
We examine the instant system powerconsumption in this subsection. In the top figures of Fig. 5(a)and Fig. 5(b), we compare the strategies of AO and DQN,where we set all the RRHs open initially and then all RRHsstay active in AO schemes. In the bottom figures of Fig. 5(a)and Fig. 5(b), we turn off one RRH randomly at the beginningfor both OC and DQN and then one RRH stay switched offin OC scheme. Moreover, we set user demands are selectedrandomly from the set of 20
Mbps to 40
Mbps in Fig. 5(a),whereas we randomly select user demands from the set of20
Mbps to 60
Mbps in Fig. 5(b). One can see from all thefigures in Fig. 5 that our proposed DQN always outperformsAO and OC. This is because DQN controls RRHs to turnon and off depending on the current states of the systems,whereas AO always turns on all the RRHs and OC randomlyturns off one RRH, which may not be the optimal strategy andcontribute to larger power consumption than DQN.One can also see that when we increase the upper limitof user demands from 40
Mbps in Fig. 5(a) to 60
Mbps inFig. 5(b), the performance of all DQN, OC and AO becomemore unstable. However, our proposed DQN still has the bestperformance when compared with AO and OC.Moreover, one can see that although there may be someerrors caused by GBDT approximator, our proposed DQNframework has considerable performance, which shows thegood ability of error tolerance in our proposed solution.
2) Average Power:
In Fig. 6, we show the performancecomparison between GBDT-based DQN, AO and OC in thelong term. The DQN with reward obtained from SOCP solveris also depicted. We compare the average system powerconsumption by averaging all instant system power in the pasttime slots.We first analyse the performance under the condition ofuser demands below 40
Mbps between both DQN schemes(including GBDT and SOCP) and AO scheme. We set all the I n s t a n t S y s t e m P o w e r ( W ) AODQN2 4 6 8 10 12 14 16 18 20Time Slot (100ms per slot)4045505560 I n s t a n t S y s t e m P o w e r ( W ) OCDQN (a) Demands from 20
Mbps to 40
Mbps I n s t a n t S y s t e m P o w e r ( W ) AODQN2 4 6 8 10 12 14 16 18 20Time Slot (100ms per slot)4045505560 I n s t a n t S y s t e m P o w e r ( W ) OCDQN (b) Demands from 20
Mbps to 60
Mbps
Fig. 5. Total instant power consumption with different user demands andallocation schemes.
RRH switched on and set user demands changed every 100 msper slot and lasting for 500 s . One can see from Fig. 6(a) thatboth DQN schemes outperform AO and can save power around8 Watts per time slot. The slight fluctuation comes from therandomness of the requirement. Moreover, one can see fromFig. 6(a) that DQN with GBDT have the similar performanceas the DQN scheme with SOCP solver, which shows the errortolerance feature of our proposed solutions.Then we turn one RRH off and continue to analyse theaverage system power consumption under DQN and OCscheme. One can see from Fig. 6(b) that both DQN schemesstill outperform OC scheme, as expected. Also, one can seethat DQN scheme with GBDT has the similar performance asSOCP solver, similarly with above.
3) Overall Performance of GBDT-based DQN:
To evaluatethe overall performance of GBDT-based DQN in differentsituations, we set user demands from 20
Mbps to 60
Mbps with10
Mbps interval, and keep other factors unchanged. One cansee from Fig. 7(a) and Fig. 7(b) that with the increase of A v e r a g e S y s t e m P o w e r C o n s u m p t i o n ( W a tt s ) AODQN (GBDT)DQN (SOCP) (a) AO VS. DQN A v e r a g e S y s t e m P o w e r C o n s u m p t i o n ( W a tt s ) OCDQN (GBDT)DQN (SOCP) (b) OC VS. DQNFig. 6. Total average power consumption with different allocation schemes. user demands, the power consumption of AO, OC and DQNincrease as well. One also sees that our proposed GBDT-basedDQN have much better performance than AO and OC, asexpected, which prove the effectiveness of our scheme.VI. C
ONCLUSION
In this paper, we presented a GBDT-based DQN frameworkto tackle the dynamic resource allocation problem for IoTin the real-time C-RANs. We first employed the GBDT toapproximate the solutions of the SOCP problem. Then, webuilt the DQN framework to generate a efficient resourceallocation policy regarding to the status of RRHs in C-RANs.Furthermore, we demonstrated the offline training, onlinedecision making as well as regular tuning processes. Lastly,we evaluated the proposed framework with the comparison totwo other methods, AO and OC, and examined its accuracyand the ability of error tolerance compared with SOCP-basedDQN scheme. Simulation results showed that the proposedGBDT-based DQN can achieve a much better performancein terms of power saving than other baseline solutions under A v e r a g e s y s t e m p o w e r c o n s u m p t i o n ( W a tt s ) AODQN (a) AO VS. DQN A v e r a g e s y s t e m p o w e r c o n s u m p t i o n ( W a tt s ) OCDQN (b) OC VS. DQNFig. 7. Average power consumption versus different user demands. the real-time setting. Future work is in progress to let GBDTapproximator meet the strict constraints of practical problems,which is expected to be employed in a wide range of scenarios.A
PPENDIX T RAINING AND P REDICTING P ROCESS OF
GBDTThe training process of GBDT is shown in Algorithm 2.The GBDT is consisted of two concepts, where one is calledthe gradient and the other is boosting. In training process, the0-th tree is fitted to the given training dataset, and it predictsthe mean value of y true in the training set regardless of whatthe input is; the predicted values of 0-th tree are denoted as y predicted . However, the predictions y predicted from the 0-th treestill have residuals between true values y true . Then, anotheradditive tree is applied to fit to the new dataset that the inputsare same as the 0-th tree, but the fitting target y ’s are theresiduals ( y true − y predicted ) . Then, the predictions of the GBDTare the linear combination of the predictions from the 0-th treeand the new additive tree, namely y predicted = y predicted + γ ∗ y predicted , where γ is the weight attributed to this tree. Next, Algorithm 2
Training process of GBDT
Initialization (1) Set the iteration counter m = 0 . Initialize the additivepredictor (cid:98) f [0] with a starting value, e.g. (cid:98) f [0] := (0) i =1 , ··· ,n . Specify a set of base-learners h ( x ) , · · · , h p ( x p ) Fit the negative gradient (2) Set m := m + 1 (3) Compute the negative gradient vector u of the lossfunction at the previous iteration: u | m | = ( u | m | i ) i =1 , ··· ,n = ( − ∂∂f ρ ( y i , f ) | f = (cid:98) f | m − | ( · ) ) i =1 , ··· ,n (4) Fit the negative gradient vector u | m | separately to everybase-learner: u | m | base − learner −−−−−−−−−→ (cid:98) h [ m ] j ( x j ) for j = 1 , · · · , p Update one component (5) Select the component j ∗ that best fits the negativegradient vector: j ∗ = arg min ≤ j ≤ p n (cid:88) i =1 ( u | m | i − (cid:98) h | m | j ( x j )) (6) Update the additive predictor (cid:98) f with the component (cid:98) f [ m ] ( · ) = (cid:98) f [ m − ( · ) + sl · (cid:98) h | m | j ∗ ( x j ∗ ) where sl is a small step length (0 < sl (cid:28) and atypical value in practice is 0.1. Iteration
Iterate steps (2) to (6) until m = m stop another tree is fitted to the new residuals y true − ( y predicted + γ ∗ y predicted ) and follow the same process as before.From above process, one can see that the boosting concept isto utilize the residuals between the previous ensembled resultsand true values. By learning from the residual, the model canmake progress when new trees are added. The gradient part ofconcept can be explained as that the whole training process issupervised and guided by the gradient of objective function,where it is typically expressed as . ∗ ( y true − y predicted ) , whosederivative is the pseudo-residual between y true and y predicted .R EFERENCES[1] J. Lin et al. , “A survey on Internet of Things: Architecture enablingtechnologies security and privacy and applications,”
IEEE Internet ThingsJ. , vol. 4, no. 5, pp. 1125-1142, Oct. 2017.[2] A. Checko et al. , “Cloud RAN for mobile networks–A technologyoverview,”
IEEE Commun. Surveys Tuts. , vol. 17, no. 1, pp. 405–426,Sep. 2014.[3] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforce-ment learning based framework for power-efficient resource allocation incloud RANs,” in
Proc. IEEE Int. Conf. Commun. (ICC) , pp. 1–6, 2017.[4] A. Wiesel, Y. C. Eldar, and S. Shamai, “Linear precoding via conicoptimization for fixed MIMO receivers,”
IEEE Trans. Signal Process. ,vol. 54, no. 1, pp. 161–176, 2006.[5] M. Gerasimenko et al. , “Cooperative radio resource management inheterogeneous cloud radio access networks,”
IEEE Access , vol. 3, pp.397–406, 2015.[6] Y. Zhou et al. , “Deep reinforcement learning based coded caching schemein fog radio access networks,” , pp. 309–313, 2018. [7] P. Rost et al. , “Cloud technologies for flexible 5G radio access networks,” IEEE Commun. Mag. , vol. 52, no. 5, pp. 68–76, 2014.[8] G. Sun et al. , “Dynamic reservation and deep reinforcement learningbased autonomous resource slicing for virtualized radio access networks,”in
IEEE Access , vol. 7, pp. 45758–45772, 2019.[9] V. Franois-Lavet et al. “An introduction to deep reinforcement learning.”
Foundations and Trends in Machine Learning , vol. 11, no. 3–4, pp. 219–354, 2018.[10] H. He et al. , “Model-driven deep learning for physical layer communi-cations,” arXiv preprint arXiv:l809.06059, 2019.[11] H. Zhu et al. , “Caching transient data for Internet of Things: A deepreinforcement learning approach,”
IEEE Internet Things J. , vol. 6, no. 2,pp. 2074–2083, Apr. 2019.[12] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcementlearning for mobile edge caching: Review new features and open issues,”
IEEE Netw. , vol. 32, no. 6, pp. 50–57, Nov. 2018.[13] D. Liu et al. , “User association in 5G networks: A survey and anoutlook,”
IEEE Commun. Surveys Tuts. , vol. 18, no. 2, pp. 1018–1044,2nd Quart. 2015.[14] A. Domahidi, E. Chu, and S. Boyd, “Ecos: An socp solver for embeddedsystems,”
Control Conference (ECC) 2013 European , pp. 3071–3076,2013.[15] E. Andersen and K. Andersen, “The MOSEK interior point optimizerforlinear programming: an implementation of the homogeneousalgorithm,”
High Performance Optimization , vol. 33, pp. 197–232, 2000.[16] J. F. Sturm, “Using SeDuMi 1.02, a Matlab toolbox for optimizationover symmetric cones,”
Optimization Methods and Software , vol. 11, no.1–4, pp. 625–653, 1999.[17] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-ing,”
Proceedings of the 27th International Conference on InternationalConference on Machine Learning.
Omnipress, pp. 399–406, 2010.[18] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding:Model–based inspiration of novel deep architectures,” arXiv preprint arXiv:1409.2574, 2014.[19] C. Lu, W. Xu, S. Jin, and K. Wang, “Bit-level optimized neural networkfor multi-antenna channel quantization,”
IEEE Commun. Lett. (EarlyAccess) , pp. 1–1, Sep. 2019.[20] C. Lu, W. Xu, H. Shen, J. Zhu, and K. Wang “MIMO channelinformation feedback using deep recurrent network,”
IEEE Commun.Lett. , vol. 23, no. 1, pp. 188–191, Jan. 2019.[21] Z. H. Zhou and J. Feng, “Deep forest: Towards an alternative to deepneural networks,” arXiv preprint arXiv:1702.08835, 2017.[22] H. Sun et al. , “Learning to optimize: Training deep neural networks forinterference management,”
IEEE Trans. Signal Process. , vol. 66, no. 20,pp. 5438–5453, Oct 2018.[23] J. H. Friedman, “Greedy function approximation: a gradient boostingmachine,”
Annals of statistics , pp. 1189–1232, 2001.[24] L. Breiman, “Bias, variance, and arcing classifiers,”
Tech. Rep. 460 ,Statistics Department, University of California, Berkeley, CA, USA, 1996.[25] Z. H. Zhou, “Ensemble methods: foundations and algorithms,”
Chapmanand Hall/CRC , 2012.[26] D. Opitz and R. Maclin, “Popular ensemble methods: An empiricalstudy,”
Journal of Artificial Intelligence Research , pp. 169–198, 1999.[27] R. Polikar, “Ensemble based systems in decision making,”
IEEE CircuitsSyst. Mag. , vol. 6, no. 3, pp. 21–45, 2006.[28] L. Rokach, “Ensemble–based classifiers,”
Artificial Intelligence Review ,vol. 33, no. 1–2, pp. 1–39, 2010.[29] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,”
Frontiers in neurorobotics , vol. 7, no. 21, 2013.[30] T. P. Do and Y. H. Kim, “Resource allocation for a full-duplex wireless-powered communication network with imperfect self-interference cance-lation,”
IEEE Commun. Lett. , vol. 20, no. 12, pp. 2482–2485, Dec. 2016.[31] J. Miao, Z. Hu, K. Yang, C. Wang, and H. Tian, “Joint power and band-width allocation algorithm with QoS support in heterogeneous wirelessnetworks,”
IEEE Commun. Lett. , vol. 16, no. 4, pp. 479–481, 2012.[32] J. Moon et al. , “Online reinforcement learning of X-Haul contentdelivery mode in fog radio access networks,”
IEEE Signal Process. Lett. ,vol. 26, no. 10, pp. 1451–1455, 2019.[33] I. John, A. Sreekantan, and S. Bhatnagar, “Efficient adaptive resourceprovisioning for cloud applications using Reinforcement Learning,” , Umea, Sweden, pp. 271–272, 2019.[34] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning basedcomputation offloading and resource allocation for MEC,” , pp. 1–6,April 2018. [35] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategiesfor cloud radio access networks,”
IEEE J. Sel. Areas Commun. , vol. 34,no. 4, pp. 1037–1050, Apr. 2016.[36] G. Auer et al. , “How much energy is needed to run a wireless network,”
IEEE Wirel. Commun. , vol. 18, no. 5, pp. 40–49, 2011.[37] S. Boyd and L. Vandenberghe, Convex optimization,
Cambridge univer-sity press , 2004.[38] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, “Applicationsof second-order cone programming,”
Linear Algebra and its ApplicationsJournal , Vol. 284, No. 1, 1998, pp. 193–228.[39] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning,
Cambridge: MIT press , 1998.[40] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
Proceedings of the 22nd acm sigkdd international conference on knowl-edge discovery and data mining. (ACM) , 2016.[41] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming forgreen cloud-RAN,”
IEEE Trans. Wireless Commun. , vol. 13, no. 5, pp.2809–2823, May 2014.[42] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategiesfor cloud radio access networks,”