[PDF] Reinforcement Learning for Self-Organization and Power Control of Two-Tier Heterogeneous Networks

Abstract

Self-organizing networks (SONs) can help manage the severe interference in dense heterogeneous networks (HetNets). Given their need to automatically configure power and other settings, machine learning is a promising tool for data-driven decision making in SONs. In this paper, a HetNet is modeled as a dense two-tier network with conventional macrocells overlaid with denser small cells (e.g. femto or pico cells). First, a distributed framework based on multi-agent Markov decision process is proposed that models the power optimization problem in the network. Second, we present a systematic approach for designing a reward function based on the optimization problem. Third, we introduce Q-learning based distributed power allocation algorithm (Q-DPA) as a self-organizing mechanism that enables ongoing transmit power adaptation as new small cells are added to the network. Further, the sample complexity of the Q-DPA algorithm to achieve ϵ -optimality with high probability is provided. We demonstrate, at density of several thousands femtocells per km 2 , the required quality of service of a macrocell user can be maintained via the proper selection of independent or cooperative learning and appropriate Markov state models.

Full PDF

11 Reinforcement Learning for SelfOrganization and Power Control ofTwo-Tier Heterogeneous Networks

Roohollah Amiri,

Student Member, IEEE,

Mojtaba Ahmadi Almasi,

StudentMember, IEEE,

Jeffrey G. Andrews,

Fellow, IEEE,

Hani Mehrpouyan,

Member, IEEE . Abstract

Self-organizing networks (SONs) can help manage the severe interference in dense heterogeneousnetworks (HetNets). Given their need to automatically conﬁgure power and other settings, machinelearning is a promising tool for data-driven decision making in SONs. In this paper, a HetNet is modeledas a dense two-tier network with conventional macrocells overlaid with denser small cells (e.g. femto orpico cells). First, a distributed framework based on multi-agent Markov decision process is proposed thatmodels the power optimization problem in the network. Second, we present a systematic approach fordesigning a reward function based on the optimization problem. Third, we introduce Q- learning baseddistributed power allocation algorithm (Q-DPA) as a self-organizing mechanism that enables ongoingtransmit power adaptation as new small cells are added to the network. Further, the sample complexityof the Q-DPA algorithm to achieve (cid:15) -optimality with high probability is provided. We demonstrate, atdensity of several thousands femtocells per km , the required quality of service of a macrocell user canbe maintained via the proper selection of independent or cooperative learning and appropriate Markovstate models. Index Terms

Self-organizing networks, HetNets, Reinforcement learning, Markov decision process.

This work was presented in part at ICC 2018 [1].R. Amiri, M. A. Almasi, and H. Mehrpouyan are with Department of Electrical and Computer Engineering,Boise State University, Boise, ID, USA (e-mail: [email protected]; [email protected];[email protected]). J. G. Andrews (email:[email protected]) is with the University of Texas at Austin,USA. Last revised March 19, 2019. a r X i v : . [ c s . I T ] M a r I. I

NTRODUCTION

Self-organization is a key feature as cellular networks densify and become more heterogeneous,through the additional small cells such as pico and femtocells [2]–[6]. Self-organizing networks(SONs) can perform self-conﬁguration, self-optimization and self-healing. These operations cancover basic tasks such as conﬁguration of a newly installed base station (BS), resource man-agement, and fault management in the network [7]. In other words, SONs attempt to minimizehuman intervention where they use measurements from the network to minimize the cost ofinstallation, conﬁguration and maintenance of the network. In fact SONs bring two main factorsin play: intelligence and autonomous adaptability [2], [3]. Therefore, machine learning techniquescan play a major role in processing underutilized sensory data to enhance the performance ofSONs [8], [9].One of the main responsibilities of SONs is to conﬁgure the transmit power at various smallBSs to manage interference. In fact, a small BS needs to conﬁgure its transmit power beforejoining the network (as self-conﬁguration). Subsequently, it needs to dynamically control itstransmit power during its operation in the network (as self-optimization). To address these twoissues, we consider a macrocell network overlaid with small cells and focus on autonomousdistributed power control, which is a key element of self-organization since it improves networkthroughput [10]–[14] and minimizes energy usage [15]–[17]. We rely on local measurements,such as signal-to-interference-plus-noise ratio (SINR), and the use of machine learning to developa SON framework that can continually improve the above performance metrics.

A. Related Work

In wireless communications, dynamic power control with the use of machine learning has beenimplemented via reinforcement learning (RL). In this context, RL is an area of machine learningthat attempts to optimize a BS’s transmit power to achieve a certain goal such as throughputmaximization. One of the main advantages of RL with respect to supervised learning methodsis its training phase, in which there is no need for correct input/output data. In fact, RL operatesby applying the experience that it has gained through interacting with the network [18]. RLmethods have been applied in the ﬁeld of wireless communications in areas such as resourcemanagement [19]–[24], energy harvesting [25], and opportunistic spectrum access [26], [27]. Acomprehensive review of RL applications in wireless communications can be found in [28]. Q- learning is a model-free RL method [29]. The model-free feature of Q- learning makes it aproper method for scenarios in which the statistics of the network continuously change. Further,Q- learning has low computational complexity and can be implemented by BSs in a distributedmanner [1]. Therefore, Q- learning can bring scalability , robustness , and computational efﬁciency to large networks. However, designing a proper reward function which accelerates the learningprocess and avoids false learning or unlearning phenomena [30] is not trivial. Therefore, to solvean optimization problem, an appropriate reward function for Q- learning needs to be determined.In this regard, the works in [19]–[24] have proposed different reward functions to optimizepower allocation between femtocell base stations (FBSs). The method in [19] uses independentQ- learning in a cognitive radio system to set the transmit power of secondary BSs in a digitaltelevision system. The solution in [19] ensures that the minimum quality of service (QoS) forthe primary user is met by applying Q- learning and using the SINR as a metric. However, theapproach in [19] doesn’t take the QoS of the secondary users into considerations. The workin [20] uses cooperative Q- learning to maximize the sum transmission rate of the femtocellusers while keeping the transmission rate of macrocell users near a certain threshold. Further,the authors in [21] have used the proximity of FBSs to a macrocell user as a factor in thereward function. This results in a fair power allocation scheme in the network. Their proposedreward function keeps the transmission rate of the macrocell user above a certain thresholdwhile maximizing the sum transmission rate of FBSs. However, by not considering a minimumthreshold for the FBSs’ rates, the approach in [21] fails to support some FBSs as the densityof the network (and consequently interference) increases. The authors in [22] model the cross-tier interference management problem as a non-cooperative game between femtocells and themacrocell. In [22], femtocells use the average SINR measurement to enhance their individualperformances while maintaining the QoS of the macrocell user. In [23], the authors attempt toimprove the transmission rate of cell-edge users while keeping the fairness between the macrocelland the femtocell users by applying a round robin approach. The work in [24] minimizespower usage in a Long Term Evolution (LTE) enterprise femtocell network by applying anexponential reward function without the requirement to achieve fairness amongst the femtocellsin the network.In the above works, the reward functions do not apply to dense networks. That is to say, ﬁrst,there is no minimum threshold for the achievable rate of the femtocells. Second, the reward functions are designed to limit the macrocell user rate to its required QoS and not more thanthat. This property encourages an FBS to use more power to increase its own rate by assumingthat the caused interference just affects the macrocell user. However, the neighbor femtocellssuffer from this decision and overall the sum rate of the network decreases. Further, they donot provide a generalized framework for modeling a HetNet as a multi-agent RL network or aprocedure to design a reward function which meets the QoS requirements of the network. In thispaper, we focus on dense networks and try to provide a general solution to the above challenges. B. Contributions

We propose a learning framework based on multi-agent Markov decision process (MDP). Byconsidering an FBS as an agent, the proposed framework enables FBSs to join and adapt to adense network autonomously. Due to unplanned and dense deployment of femtocells, providingthe required QoS to all the users in the network becomes an important issue. Therefore, wedesign a reward function that trains the FBSs to achieve this goal. Furthermore, we introduce a Q- learning based distributed power allocation approach (Q-DPA) as an application of the proposedframework. Q-DPA uses the proposed reward function to maximize the transmission rate offemtocells while prioritizing the QoS of the macrocell user. More speciﬁcally the contributionsof the paper can be summarized as:1) We propose a framework that is agnostic to the choice of learning method but also connectsthe required RL analogies to wireless communications. The proposed framework models amulti-agent network with a single MDP that contains the joint action of the all the agents as itsaction set. Next, we introduce MDP factorization methods to provide a distributed and scalablearchitecture for the proposed framework. The proposed framework is used to benchmark theperformance of different learning rates, Markov state models, or reward functions in two-tierwireless networks.2) We present a systematic approach for designing a reward function based on the optimizationproblem and the nature of RL. In fact, due to scarcity of resources in a dense network, wepropose some properties for a reward function to maximize sum transmission rate of thenetwork while considering minimum requirements of all users. The procedure is simple andgeneral and the designed reward function is in the shape of low complexity polynomials.Further, the designed reward function results in increasing the achievable sum transmission rate of the network while consuming considerably less power compared to greedy basedalgorithms.3) We propose Q-DPA as an application of the proposed framework to perform distributed powerallocation in a dense femtocell network. Q-DPA uses the factorization method to deriveindependent and cooperative learning from the optimal solution. Q-DPA uses local signalmeasurements at the femtocells to train the FBSs in order to: (i) maximize the transmissionrate of femtocells, (ii) achieve minimum required QoS for all femtocell users with a highprobability, and (iii) maintain the QoS of macrocell user in a densely deployed femtocellnetwork. In addition, we determine the minimum number of samples that is required toachieve an (cid:15) -optimal policy in Q-DPA as its sample complexity.4) We introduce four different learning conﬁgurations based on different combinations of inde-pendent/cooperative learning and Markov state models. We conduct extensive simulations toquantify the effect of different learning conﬁgurations on the performance of the network.Simulations show that the proposed Q-DPA algorithm can decrease power usage and as aresult reduce the interference to the macrocell user.The paper is organized as follows. In Section II, the system model is presented. Section IIIintroduces the optimization problem and presents the existing challenges in solving this problem.Section IV presents the proposed learning framework which models a two-tier femtocell networkwith a multi-agent MDP. Section V-A presents the Q-DPA algorithm as an application of theproposed framework. Section VI presents the simulation results while Section VII concludes thepaper.

Notation:

Lower case, boldface lower case, and calligraphic symbols represent scalars, vectors,and sets, respectively. For a real-valued function Q : Z → R , (cid:107) Q (cid:107) denotes the max norm, i.e., (cid:107) Q (cid:107) = max z ∈Z | Q ( z ) | . E x [ · ] , E x [ ·|· ] , and ∂f∂x denote the expectation, the conditional expectation,and the partial derivation with respect to x , respectively. Further, Pr ( ·|· ) and | · | denote theconditional probability and absolute value operators, respectively.II. D OWNLINK S YSTEM M ODEL

Consider the downlink of a single cell of a HetNet operating over a set S = { , ..., S } of S orthogonal subbands. In the cell a single macro base station (MBS) is deployed. The MBS servesone macrocell user equipment (MUE) over each subband while guaranteeing this user a minimum defabcmnojklwxyztuvpqrsghi@ * + MUE

FUE defabcmnojkl wxyztuvpqrsghi@ * + Macrocell BS (MBS)

Femtocell BS (FBS)

FUE

Signal

Interference

MUE

Figure 1. Macrocell and femtocells operating over the same frequency band. average SINR over each subband which is denoted by Γ . A set of FBSs are deployed in areaof coverage of the macrocell. Each FBS selects a random subband and serves one femtocelluser equipment (FUE). We assume that overall, on each subband s ∈ S , a set K = { , ..., K } of K FBSs are operating. Each FBS guarantees a minimum average SINR denoted by Γ k to itsrelated FUE. We consider a dense network in which the density results in both cross-tier and co-tier interference. Therefore, in order to control the interference-level and provide the users withtheir required minimum SINR, we focus on power allocation in the downlink of the femtocellnetwork. Uplink results can be obtained in a similar fashion but are not included for brevity.The overall network conﬁguration is presented in Fig. 1. We focus on one subband, meanwhilethe proposed solution can be extended to a case in which each FBS supports multiple users ondifferent subbands.We denote the MBS-MUE pair by the index and the FBS-FUE pairs by the index k fromthe set K . In the downlink, the received signal at the MUE operating over subband s includesinterference from the femtocells and thermal noise. Hence, the SINR at the MUE operating oversubband s ∈ S , γ , is calculated as γ = p | h , | (cid:88) k ∈K p k | h k, | (cid:124) (cid:123)(cid:122) (cid:125) femtocells’ interference + N , (1)where p denotes the power transmitted by the MBS and h , denotes the channel gain fromthe MBS to the MUE. Further, the power transmitted by the k th FBS is denoted by p k and thechannel gain from the k th FBS to the MUE is denoted by h k, . Finally, N denotes the varianceof the additive white Gaussian noise. Similarly, the SINR at the k th FUE operating over subband s ∈ S , γ k , is obtained as γ k = p k | h k,k | p | h ,k | (cid:124) (cid:123)(cid:122) (cid:125) macrocell’s interference + (cid:88) j ∈K\{ k } p j | h j,k | (cid:124) (cid:123)(cid:122) (cid:125) femtocells’ interference + N k , (2)where h k,k denotes the channel gain between the k th FBS and the k th FUE, h ,k denotes thechannel gain between the MBS and the k th FUE, p j denotes the transmit power of the j th FBS, h j,k is the channel gain between the j th FBS and the k th FUE, and N k is the variance of theadditive white Gaussian noise. Finally, the transmission rates, normalized by the transmissionbandwidth, at the MUE and the FUE operating over subband s ∈ S , i.e., r and r k , respectively,are expressed as r = log (1 + γ ) and r k = log (1 + γ k ) , k ∈ K .III. P ROBLEM F ORMULATION

Each FBS has the objective of maximizing its transmission rate while ensuring that the SINRof the MUE is above the required threshold, i.e., Γ . Denoting p = { p , ..., p K } as the vectorof the transmit powers of the K FBSs operating over the subband s ∈ S , the power allocationproblem is presented as follow maximize p (cid:88) k ∈K log (1 + γ k ) (3a) subject to 0 ≤ p k ≤ p max , k ∈ K , (3b) γ ≥ Γ , (3c) γ k ≥ Γ k , k ∈ K . (3d)where p max deﬁnes the maximum available transmit power at each FBS. The objective (3a) is tomaximize the sum transmission rate of the FUEs. Constraint (3b) refers to the power limitationof every FBS. Constraints (3c) and (3d) ensure that the minimum SINR requirement is satisﬁedfor the MUE and the FUEs. The addition of constraint (3d) to the optimization problem is oneof the differences between the proposed approach in this paper and that of [19]–[24].Considering (2), it can be concluded that the optimization in (3) is a non-convex problem fordense networks. This follows from the SINR expression in (2) and the objective function (3).More speciﬁcally, the interference term due to the neighboring femtocells in the denominatorof (2) ensures that the optimization problem in (3) is not convex [31]. This interference term may be ignored in low density networks but cannot be ignored in dense networks consistingof a large number of femtocells [32]. However, non-convextiy is not the only challenge of theabove problem. In fact, many iterative algorithms are developed to solve the above optimizationproblem with excellent performance. However, their algorithms contains expensive computationssuch as matrix inversion and bisection or singular value decomposition in each iteration whichmakes their real-time implementation challenging [33]. Besides, the k th FBS is only aware ofits own transmit power, p k , and does not know the transmit powers of the remaining FBSs.Therefore, the idea here is to treat the given problem as a black-box and try to learn the relationbetween the transmit power and the resulting transmission rate gradually by interacting with thenetwork and simple computations.To realize self-organization, each FBS should be able to operate autonomously. This means anFBS should be able to connect to the network at anytime and to continuously adapt its transmitpower to achieve its objectives. Therefore, our optimization problem requires a self-adaptivesolution. The steps for achieving self-adaptation can be summarized as: (i) the FBS measuresthe interference level at its related FUEs, (ii) determines the maximum transmit power to supportits FUEs while not greatly degrading the performance of other users in the network. In the nextsection, the required framework to solve this problem will be presented.IV. T HE P ROPOSED L EARNING F RAMEWORK

Here, ﬁrst we model a multi-agent network as an MDP. Then the required deﬁnitions, eval-uation methods, and factorization of the MDP to develop a distributed learning frameworkare explained. Subsequently, the femtocell network is modeled as a multi-agent MDP and theproposed learning framework is developed.

A. Multi-Agent MDP and Policy Evaluation

A single-agent MDP comprises an agent, an environment, an action set, and a state set. Theagent can transition between different states by choosing different actions. The trace of actionsthat is taken by the agent is called its policy. With each transition, the agent will receive areward from the environment, as a consequence of its action, and will save the discountedsummation of rewards as a cumulative reward. The agent will continue its behavior with thegoal of maximizing the cumulative reward and the value of cumulative reward evaluates the chosen policy. The discount property increases the impact of recent rewards and decreases theeffect of later ones. If the number of transitions is limited, the non-discounted summation ofrewards can be used as well.A multi-agent MDP consists of a set, K , of K agents. The agents select actions to movebetween different states of the model to maximize the cumulative reward received by all theagents. Here, we again formulate the network of agents as one MDP, e.g., we deﬁne the actionset as the joint action set of all the agents. Therefore, the multi-agent MDP framework is deﬁnedwith a tuple as ( A , X , P r, R ) with the following deﬁnitions. • A is the joint set of all the agents’ actions. An agent k selects its action a from its action set A k , i.e., a k ∈ A k . The joint action set is represented as A = A × · · · × A K , with a ∈ A asa single joint action. • The state of the system is deﬁned with a set of random variables. Each random variable isrepresented by X i with i = 1 , ..., n , and the state set is represented as X = { X , X , ..., X n } ,where x ∈ X denotes a single state of the system. Each random variable reﬂects a speciﬁcfeature of the network. • The transition probability function,

Pr ( x , a , x (cid:48) ) , represents the probability of taking joint action a at state x and ending in state x (cid:48) . In other words, the transition probability function deﬁnesthe environment which agents are interacting with. • R ( x , a ) is the reward function such that its value is the received reward by the agents fortaking joint action a at state x .We deﬁne π : X → A as the policy function, where π ( x ) is the joint action that is taken atthe state x . In order to evaluate the policy π ( x ) , a value function V π ( x ) and an action-valuefunction Q π ( x , a ) are deﬁned. The value of the policy π in state x (cid:48) ∈ X is deﬁned as [18] V π ( x (cid:48) ) = E π (cid:34) ∞ (cid:88) t =0 β t R ( t +1) (cid:12)(cid:12) x (0) = x (cid:48) (cid:35) , (4)in which β ∈ (0 , is a discount factor, R ( t +1) is the received reward at time step t + 1 , and x (0) is the initial state. The action-value function, Q π ( x , a ) , represents the value of the policy π fortaking joint action a at state x and then following policy π for subsequent iterations. Accordingto [18], the relation between the value function and the action-value function is given by Q π ( x , a ) = R ( x , a ) + β (cid:88) x (cid:48) ∈X Pr ( x (cid:48) | x , a ) V π ( x (cid:48) ) . (5) For the ease of notation, we will use V and Q for the value function and the action-valuefunction of policy π , respectively. Further, we use the term Q-function to refer to the action-value function. The optimal value of state x is the maximum value that can be reached byfollowing any policy and starting at this state. An optimal value function V ∗ , which gives anoptimal policy π ∗ , satisﬁes the Bellman optimality equation as [18] V ∗ ( x ) = max a Q ∗ ( x , a ) , (6)where Q ∗ ( x , a ) is an optimal Q-function under policy π ∗ . The general solution for (6) is tostart from an arbitrary policy and using the generalized policy iteration (GPI) [18] method toiteratively evaluate and improve the chosen policy to achieve an optimal policy. If the agentshave a priori information of the environment, i.e., P r ( x , a , x (cid:48) ) is known to the agents, dynamicprogramming is the solution for (6). However, the environment is unknown in most practicalapplications. Hence, we rely on reinforcement learning (RL) to derive an optimal Q-function.RL uses temporal-difference to provide a real-time solution for the GPI method [18]. As a result,in Section V-A, we use Q- learning , as a speciﬁc method of RL, to solve (6). B. Factored MDP

To this point, we deﬁned the Q-function over the joint state-action space of all the agents,i.e.,

X × A . We refer to this Q-function as the global Q-function. According to [29], Q- learning ﬁnds the optimal solution to a single MDP with probability one. However, in large MDPs, dueto exponential increase in the size of the joint state-action space with respect to the number ofagents, the solution to the problem becomes intractable. To resolve this issue, we use factored

MDPs as a decomposition technique for large MDPs. The idea in factored MDPs is that manylarge MDPs are generated by systems with many parts that are weakly interconnected. Each parthas its associated state variables and the state space can be factored into subsets accordingly. Thedeﬁnition of the subsets affects the optimality of the solution [34], and investigating the optimalfactorization method helps with understanding the optimality of multi-agent RL solutions [35].In [36] power control of a multi-hop network is modeled as an MDP and the state set is factorizedinto multiple subsets each referring to a single hop. The authors in [37] show that the subsetscan be deﬁned based on the local knowledge of the agents from the environment. Meanwhile, weaim to distribute the power control to the nodes of the network. Therefore, due to the deﬁnition of the problem in Section III and the fact that each FBS is only aware of its own power, we usethe assumption in [37] and deﬁne the individual action set of the agents, i.e., A k , as the subsetsof the joint action set. Consequently, the resultant Q-function for the k th agent is deﬁned as Q k ( x k , a k ) , in which a k ∈ A k , x k ∈ X k is the state vector of the k th agent, and X k , k ∈ K , arethe subsets of the global state set of the system, i.e., X .In factored MDPs, We assume that the reward function is factored based on the subsets, i.e., R ( x , a ) = (cid:88) k ∈K R k ( x k , a k ) , (7)where, R k ( x k , a k ) is the local reward function of the k th agent. Moreover, we also assume thatthe transition probabilities are factored, i.e., for the k th subsystem we have Pr ( x (cid:48) k | x , a ) = Pr ( x (cid:48) k | x k , a k ) , ( x , a ) ∈ X × A , ( x k , a k ) ∈ X k × A k , x (cid:48) k ∈ X k . (8)The value function for the global MDP is given by V ( x ) = E (cid:34) ∞ (cid:88) t =0 β t R ( t +1) ( x , a ) (cid:35) = E (cid:34) ∞ (cid:88) t =0 β t (cid:88) k ∈K R ( t +1) k ( x k , a k ) (cid:35) = (cid:88) k ∈K V k ( x k ) , (9)where, V k ( x k ) is the value function of the k th agent. Therefore, the derived policy has the valuefunction equal to the linear combination of local value functions. Further, according to (5), foreach agent k ∈ K Q k ( x k , a k ) = R k ( x k , a k ) + β (cid:88) x (cid:48) k Pr ( x (cid:48) k | x k , a k ) V k ( x (cid:48) k ) , (10)and for the global Q-function Q ( x , a ) = R ( x , a ) + β (cid:88) x (cid:48) ∈X Pr ( x (cid:48) | x , a ) V ( x (cid:48) )= (cid:88) k ∈K R k ( x k , a k ) + β (cid:88) x (cid:48) ∈X Pr ( x (cid:48) | x , a ) (cid:88) k ∈K V k ( x k )= (cid:88) k ∈K R k ( x k , a k ) + β (cid:88) k ∈K (cid:88) x (cid:48) k ∈X k Pr ( x (cid:48) k | x , a ) V k ( x k )= (cid:88) k ∈K R k ( x k , a k ) + β (cid:88) k ∈K (cid:88) x (cid:48) ∈X k Pr ( x (cid:48) k | x k , a k ) V k ( x k ) = (cid:88) k ∈K Q k ( x k , a k ) . (11)Therefore, based on the assumptions in (7) and (8), the global Q-function can be approximatedwith the linear combination of local Q-functions. Further, (11) results in a distributed and scalablearchitecture for the framework. TransmissionMeasurementContext RewardState Agent (FBS)Environment(Macrocell + other Femtocells) S I N R Action(power level) S i g n a l defabcmnojklwxyztuvpqrsghi@ * + * + * + Figure 2. The proposed learning framework: the environment from the point of view of an agent (FBS), and itsinteraction with the environment in the learning procedure. Context deﬁnes the data needed to derive the state ofthe agent. Measurement refers to calculations needed to derive the reward of the agent.

C. Femtocell Network as Multi-Agent MDP

In a wireless communication system, the resource management policy is equivalent to thepolicy function in an MDP. To integrate the femtocell network in a multi-agent MDP, we deﬁnethe followings according to Fig. 2. • Environment : From the view point of an FBS, the environment is comprised of the macrocelland all other femtocells. • Agent : Each FBS is an independent agent in the MDP. In this paper, the terms of agent and FBSare used interchangeably. An agent has three objectives: (i) improving its sum transmissionrate, (ii) guaranteeing the required SINR for its user (i.e., Γ k ), and (iii) meeting the requiredSINR for the MUE. • Action set ( A k ): The transmit power level is the action of an FBS. The k th FBS chooses itstransmit power from the set A k which covers the space between p min and p max . p min and p max denote the minimum and maximum transmit power of the FBS, respectively. In general, theFBS has no knowledge of the environment and it chooses its actions with the same probabilityin the training mode. Therefore, equal step sizes of ∆ p are chosen between p min and p max toconstruct the set A k . • State set ( X k ): State set directly affects the performance of the MUE and the FUEs. To thisend, we deﬁne four variables to represent the state of the network. The state set variables aredeﬁned based on the constraints of the optimization problem in (3). We deﬁne the variables X and X as indicators of the performance of the FUE and the MUE. On the other hand, therelative location of an FBS with respect to the MUE and the MBS is important and affectsthe interference power at the MUE caused by the FBS, and the interference power at the FBScauses by the MBS. Therefore, we deﬁne X as an indicator of the interference imposed onthe MUE by the FBS, and X as an indicator of interference imposed on the femtocell by theMBS. The state variables are deﬁned as – X ∈ { , } : The value of X indicates whether the FBS is supporting its FUE with therequired minimum SINR or not. X is deﬁned as X = { γ k ≥ Γ k } . – X ∈ { , } : The value of X indicates whether the MUE is being supported with itsrequired minimum SINR or not. X is deﬁned as X = { γ ≥ Γ } . – X ∈ { , , , ..., N } : The value of X deﬁnes the location of the FBS compared to N concentric rings around the MUE. The radius of rings are d , d , ... , d N . – X ∈ { , , , ..., N } : The value of X deﬁnes the location of the FBS compared to N concentric rings around the MBS. The radius of rings are d (cid:48) , d (cid:48) , ... , d (cid:48) N .The k th FBS calculates γ k based on the channel equality indicator (CQI) received from itsrelated FUE to assess X . The MBS is aware of the SINR of the MUE user, i.e., γ , and therelative location of the FBS concerning itself and the MUE. Therefore, the FBS obtains therequired information to asses the X , X , and X variables via backhaul and feedback fromthe MBS.Here, we deﬁned the state variables as a function of each FBS’s SINR and location. Therefore,in high SINR regime, the state of FBSs can be assumed to be independent of each other.In Section VI, we will examine different possible state sets to investigate the effect of theabove state variables on the performance of the network.V. Q-DPA, R EWARD F UNCTION , AND S AMPLE C OMPLEXITY

In this section, we present Q-DPA, which is an application of the proposed framework. Q-DPAdetails the learning method, the learning rate, and the training procedure. Then, the proposedreward function is deﬁned. Finally, the required sample complexity for the training is derived.

A. Q-learning Based Distributed Power Allocation (Q-DPA)

To solve the Bellman equation in (6), we use Q- learning . The reasoning for choosing the RLmethod and advantages of Q- learning are explained in Sections IV-A and I-A, respectively. The Q- learning update rule to evaluate a policy for the global Q-function can be represented as [29] Q ( x ( t ) , a ( t ) ) ← Q ( x ( t ) , a ( t ) ) + α ( t ) ( x , a )  R ( t +1) (cid:0) x ( t ) , a ( t ) (cid:1) + β max a (cid:48) Q ( x ( t +1) , a (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) ( M ) − Q ( x ( t ) , a ( t ) )  , (12)where a (cid:48) ∈ A , α ( t ) ( x , a ) denotes the learning rate at time step t , and x ( t ) is the new state ofthe network. The term M is the maximum value of the global Q-function that is available at thenew state x ( t +1) . After each iteration, the FBSs will receive the delayed reward R ( t +1) (cid:0) x ( t ) , a ( t ) (cid:1) and then the global Q-function will be updated according to (12).In the prior works [19]–[21], [23], [24], a constant learning rate was used for Q- learning to solve the required optimization problems. However, according to [38], in ﬁnite number ofiterations, the performance of Q- learning can be improved by applying a decaying learning rate.Therefore, we use the following learning rate α ( t ) ( x , a ) = 1[1 + t ( x , a )] , (13)in which t ( x , a ) refers to the number of times, until time step t , that the state-action pair ( x , a ) is visited. It is worth mentioning that, by using the above learning rate, we need to keep trackof the number of times each state-action pair has been visited during training, which requiresmore memory. Therefore, at the cost of more memory, a better performance can be achieved.There are two alternatives available for the training of new FBSs as they join the network, theycan use independent learning or cooperative learning. In independent learning, each FBS tries tomaximize its own Q-function. In other words, using the factorization method in Section IV-B,the term M in (12) is approximated as M = max a (cid:48) (cid:88) k ∈K Q k ( x ( t +1) k , a (cid:48) k ) ≈ (cid:88) k ∈K max a (cid:48) k Q k (cid:16) x ( t +1) k , a (cid:48) k (cid:17) . (14)In cooperative learning, the FBSs share their local Q-functions and will assume that the FBSswith the same state make the same decision. Hence, term M is approximated as M = max a (cid:48) (cid:88) k ∈K Q k ( x ( t +1) k , a (cid:48) k ) ≈ max a (cid:48) k (cid:88) k ∈K (cid:48) Q k (cid:16) x ( t +1) k , a (cid:48) k (cid:17) , (15)where K (cid:48) is the set of FBSs with the same state x ( t +1) k . Cooperative Q- learning may result in ahigher cumulative reward [39]. However, cooperation will result in the same policy for FBSs with the same state and additional overhead since the Q-functions between FBSs need to be sharedover the backhaul network. The local update rule for the k th FBS can be derived from (12) as Q k ( x ( t ) k , a ( t ) k ) ← Q k ( x ( t ) k , a ( t ) k ) + α ( t ) (cid:16) R ( t +1) (cid:16) x ( t ) k , a ( t ) k (cid:17) + βQ k (cid:16) x ( t +1) k , a ∗ k (cid:17) − Q k ( x ( t ) k , a ( t ) k ) (cid:17) , (16)where, R ( t +1) (cid:16) x ( t ) k , a ( t ) k (cid:17) is the reward of the k th FBS, and a ∗ k is deﬁned as arg max a (cid:48) k Q k (cid:16) x ( t +1) k , a (cid:48) k (cid:17) , (17)and arg max a (cid:48) k (cid:88) k ∈K (cid:48) Q k (cid:16) x ( t +1) k , a (cid:48) k (cid:17) , (18)for independent and cooperative learning, respectively.In this paper, a tabular representation is used for the Q-function in which the rows of thetable refer to the states and the columns refer to the actions of an agent. Generally, for largestate spaces, neural networks are more efﬁcient to use as Q-functions, however, part of thiswork is focused on the effect of state space variables. Therefore, we avoid large number of statevariables. On the other hand, we provide exhaustive search solution to investigate the optimalityof our solution which is not possible for large state spaces.The training for an FBS happens over L frames. In the beginning of each frame, the FBSchooses an action, i.e., transmit power. Then, the FBS sends a frame to the intended FUE. TheFUE feeds back the required measurements such as CQI so the FBS can estimate the SINR atthe FUE, and calculate the reward based on (24). Finally, the FBS updates its Q-table accordingto (16).Due to limited number of training frames, each FBS needs to select its actions in a way thatcovers most of the action space and improves the policy at the same time. Therefore, the FBSchooses the actions with a combination of exploration and exploitation, known as an e -greedyexploration. In the e -greedy method, the FBS acts greedily with probability − e (i.e., exploiting)and randomly with probability e (i.e., exploring). In exploitation, the FBS selects an action thathas the maximum value in the current state in its own Q-table (independent learning) or in thesummation of Q-tables (cooperative learning). In exploring, the FBS selects an action randomlyto cover action space and avoid biasing to a local maximum. In [18], it is shown that for a limited number of iterations the e -greedy policy results in a closer ﬁnal value to the optimalvalue compared to only exploiting or exploring.It is worth mentioning that the overhead of sharing Q-tables depends on the deﬁnition ofthe state model X k according to Section IV-C. For instance, assuming the largest possible statemodel as X k = { X , X , X , X } . The variables X and X depend on the location of the FBSand are ﬁxed during training. Therefore, one training FBS uses four rows of its Q-table and justneeds the same rows from other FBSs. Hence, if the number of active FBSs is |K| , the numberof messages to the FBS in each training frame is × ( |K| − , each of size |A k | . B. Proposed Reward Function

The design of the reward function is essential because it directly impacts the objectives of theFBS. Generally, there has not existed a quantitative approach to designing the reward function.Here, we present a systematic approach for deriving the reward function based on the nature ofthe optimization problem under consideration. Then, we compare the behavior of the designedreward function to the ones in [19]–[21].The reward function for the k th FBS is represented as R k . According to the Section IV-C,the k th FBS has knowledge of the minimum required SINR for the MUE, i.e. Γ , and minimumrequired SINR for its related FUE, i.e. Γ k . Also, after taking an action in each step, the k th FBShas access to the rate of the MUE, i.e. r and the rate of its related FUE, i.e. r k . Therefore, R k is considered as a function of the above four variables as R k : ( r , r k , Γ , Γ k ) → R .In order to design the appropriate reward function, we need to estimate the progress of the k th FBS toward the goals of the optimization problem. Based on the input arguments to thereward function, we deﬁne two progress estimators, one for the MUE as ( r − log (1 + Γ )) and one for the k th FUE as ( r k − log (1 + Γ k )) . To reduce computational complexity, we deﬁnethe reward function as a polynomial function of the deﬁned progress estimators as R k ( r , r k , Γ , Γ k ) = ( r − log (1 + Γ )) k + ( r k − log (1 + Γ k )) k + C, (19)where, k and k are integers and C ∈ R is a constant referred to as the bias of the rewardfunction.The constant bias, C , in the reward function has two effects on the learning algorithm: (i) Theﬁnal value of the states for a given policy π , and (ii) the behavior of the agent in the beginning of the learning process as follows:1) Effect of bias on the ﬁnal value of the states: Assume the reward function, R = f ( · ) , andthe reward function R = f ( · ) + C , C ∈ R . We deﬁne the value of state x for a given policy π using R as V ( x ) and the value of the state x for the same policy using R as V ( x ) .According to (4) V ( x ) = E π (cid:34) ∞ (cid:88) t =0 β t (cid:0) f ( t +1) ( · ) + C (cid:1)(cid:35) = E π (cid:34) ∞ (cid:88) t =0 β t f ( t +1) ( · ) (cid:35) + C ∞ (cid:88) t =0 β t = V ( x ) + C − β . (20)Therefore, bias of the reward function adds the constant value C − β to the value of the states.However, all the states are affected the same after the convergence of the algorithm.2) Effect of bias in the beginning of the learning process: This effect is studied using the action-value function of an agent, i.e., the Q-function. Assume that the Q-function of the agent isinitialized with zero values and the reward function is deﬁned as R = f ( · ) + C . Furtherlet us consider the ﬁrst transition of the agent from state x (cid:48) to state x (cid:48)(cid:48) happens by takingaction a at time step t , i.e., x ( t ) = x (cid:48) and x ( t +1) = x (cid:48)(cid:48) . The update rule at time step t is givenby (16) Q ( x (cid:48) , a ) ← Q ( x (cid:48) , a ) + α ( t ) ( x (cid:48) , a ) (cid:16) R ( x (cid:48) , a ) + β max a (cid:48) Q ( x (cid:48)(cid:48) , a (cid:48) ) − Q ( x (cid:48) , a ) (cid:17) ← α ( t ) ( x (cid:48) , a ) (cid:16) f ( · ) + β max a (cid:48) Q ( x (cid:48)(cid:48) , a (cid:48) ) (cid:17) + α ( t ) ( x (cid:48) , a ) C (cid:124) (cid:123)(cid:122) (cid:125) ( A ) . (21)According to the above, after the ﬁrst transition from the state x (cid:48) to the state x (cid:48)(cid:48) , the Q-valuefor the state x (cid:48) is biased by the term (A). If ( A > ), the value of the state x (cid:48) increases andif ( A < ), the value of the state x (cid:48) decreases. Therefore, the already visited states will bemore or less attractive to the agent in the beginning of the learning process as long as theagent has not explored the state-space enough.The change of behavior of the agent in the learning process can be used to bias the agent towardsthe desired actions or states. However, in basic Q-learning the agent has no knowledge in priorabout the environment. Therefore, we select the bias equal to zero, C = 0 , and deﬁne the rewardfunction as Deﬁnition 1.

The reward function for the k th FBS, R k : ( r , r k , Γ , Γ k ) → R , is a continuousand differentiable function on R deﬁned as R k ( r , r k , Γ , Γ k ) = ( r − log (1 + Γ )) k + ( r k − log (1 + Γ k )) k , (22)where k and k are integers.The objective of the FBS is to maximize its transmission rate. On the other hand, hightransmission rate for the MUE is a priority for the FBS. Therefore, R k should have the followingproperty ∂R k ∂r i ≥ , i = 0 , k. (23)The above property implies that higher transmission rate for the FBS or the MUE results inhigher reward. Hence, considering Deﬁnition 1, we design a reward function that motivates theFBSs to increase r k and r as much as possible even more than the required rate as follow R k = ( r − log (1 + Γ )) m − + ( r k − log (1 + Γ k )) m − , (24)where m is an integer. The above reward function considers the minimum rate requirements ofthe FUE and the MUE, while encourages the FBS to increase transmission rate of both.To further understand the proposed reward function, we discuss reward functions that areused by [19]–[21]. We refer to the designed reward function in [19] as quadratic, in [20] asexponential, and in [21] as proximity reward functions. The quadratic reward function is designedbased on a conservative approach. In fact, the FBS is enforced to select actions that result intransmission rate close to the minimum requirement. Therefore, higher or lower rate than theminimum requirement results in a same amount of reward. The behavior of the quadratic rewardfunction can be explained as follow ∂R k ∂r i × ( r i − log (1 + Γ i )) ≤ , i = 0 , k. (25)The above property implies that if the rate of the FBS or the MUE is higher than the minimumrequirement, the actions that increase the rate will decrease the reward. Hence, this property isagainst increasing sum transmission rate of the network. The exponential and proximity rewardfunctions have the property in (23) for the rate of the FBS, and the property in (25) for the rateof the MUE. In another words, they satisfy the following properties r k r

50 01 (a) -110 r k r

50 00 -1-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 (b) -110 r k r

50 01 -1-0.8-0.6-0.4-0.200.20.40.60.8 (c) -110 8 6 r k -0.5 4 10 r -1-0.500.5 (d)Figure 3. Reward functions: (a) Proposed reward function with m = 2 , (b) Quadratic reward function with zeromaximum at ( . , . ), (c) Exponential reward function, (d) Proximity reward function. ∂R k ∂r × ( r − log (1 + Γ )) ≤ ,∂R k ∂r k ≥ . (26)As the density of the FBSs increases, the above properties result in increasing transmit power toachieve higher individual rate for a FUE while introducing higher interference for the MUE andother neighbor FUEs. In fact, as increasing the FUE rate is rewarded, taking actions that resultin increasing the MUE rate decreases the reward. However, the FBS should have the option ofdecreasing its transmit power to increase the rate of the MUE. This behavior is important sinceit causes an FBS to produce less interference for its neighboring femtocells. Therefore, we giveequal opportunity for increasing the rate of the MUE or the FUE.The value of reward functions for different FBSs is different, however they have the samebehavior. Here, we plot the value of the four reward functions that are discussed above. Theplots refers to the proposed (Fig. 3a), quadratic (Fig. 3b), exponential (Fig. 3c), and proximity(Fig. 3d) reward functions. The important information that can be obtained from these plots arethe maximal points of the reward functions, behavior of the reward functions around minimum requirements, and behavior of the reward functions by increasing r k or r . The proposed rewardfunction in Fig. 3a shows pushing the FBS to select transmit power levels that increase both r k and r , while other reward functions have their maximum around the minimum rate requirements. C. Sample Complexity

In each training frame, Q-DPA collects one sample from the environment represented as thestate-action pair in the Q-function. Sample complexity is deﬁned as the minimum number ofsamples that is required to train the Q-function to achieve an (cid:15) -optimal policy. For (cid:15) > and δ ∈ (0 , , π is an (cid:15) -optimal policy if [40] Pr ( (cid:107) Q ∗ − Q π (cid:107) < (cid:15) ) ≥ − δ. (27)The sample complexity depends on the exploration policy that is generating the samples. InQ-DPA, e -greedy policy is used as the exploration policy. However, e -greedy policy depends onthe Q-function of the agent which is being updated. In fact, the distribution of e -greedy policyis unknown. Here, we provide a general bound on the sample complexity of Q- learning . Proposition 1.

Assume R max is the maximum of the reward function for an agent and Q ( T ) is the action-value for state-action pair ( x, a ) after T iterations. Then, with probability at least − δ , we have (cid:107) Q ∗ − Q ( T ) (cid:107) ≤ R max (1 − β ) (cid:34) βT (1 − β ) + (cid:114) T ln 2 |X | . |A| δ (cid:35) . (28) Proof.

See Appendix A.This proposition proves the stability of Q- learning and helps us to provide a minimum numberof iterations to achieve (cid:15) > error with respect to Q ∗ with probability − δ for each state-action pair. By assuming the right term of the above inequality as (cid:15) , the following Corollary isconcluded. Corollary 1.

For any (cid:15) > , after T = Ω (cid:18) R max (cid:15) (1 − β ) ln 2 |X | . | A k | δ (cid:19) (29)number of iterations, Q ( T ) reaches (cid:15) -optimality with probability at least − δ . VI. S

IMULATION R ESULTS

The objective of this section is to validate the performance of the Q-DPA algorithm withdifferent learning conﬁgurations in a dense urban scenario. We ﬁrst introduce the simulationsetup and parameters. Then, we introduce four different learning conﬁgurations and we analyzethe trade-offs between them. Finally, we investigate the performance of the Q-DPA with differentreward functions introduced in Section V-B. For the sake of simplicity, we use the notation ILas independent learning and CL as cooperative learning.

A. Simulation Setup

We use a dense urban scenario as the setup of the simulation as illustrated in Fig. 4. Weconsider one macrocell with radius m which supports multiple MUEs. The MBS assignsa subband to each MUE. Each MUE is located within a block of apartments and each blockcontains two strip of apartments. Each strip has ﬁve apartments of size m × m. There is oneFBS located in the middle of each apartment which supports an FUE within a m distance. Weassume that the FUEs are always inside the apartments. The FBSs are closed-access, therefore,the MUE is not able to connect to any FBS, however, it receives interference from the FBSsworking on the same subband as itself. Here, we assume that the MUE and all the FBSs work onthe same sub-carriers to consider the worst case scenario (high interference scenario). However,the extension of the simulation to the multi-carrier scenario is straight forward but does not affectour investigations. We assume the block of apartments is located on the edge of the macrocell,i.e., m distance from the MBS, and the MUE is assumed to be in between the two strip ofapartments.In these simulations, in order to initiate the state variables X and X in Section IV-C, thenumber of rings around the MBS and the MUE are assumed to be three ( N = N = 3 ).Although, as the density increases, more rings with smaller diameters can be used to moreclearly distinguish between the FBSs.It is assumed that the FBSs and the MBS operate at f = 2 . GHz. The MBS allocates dBm as its transmit power, and the FBSs choose their transmit power from a range of dBmto dBm with power steps of dB. In order to model the pathloss, we use the urban dualstrip model from 3GPP TR 36.814 [41]. The pathloss model of different links are provided inTable I. In Table I, R is the distance between a transmitter and a receiver in meters, L ow is the defabcmnojklwxyztuvpqrsghi@ * + * + * + * + * + * + * + * + * + * + FBSFUE MUE10m m m m Figure 4. Dense urban scenario with a dual strip apartment block located at distance of m of the MBS; FUEsare randomly located inside each apartment.Table I. Urban dual strip pathloss model

Link PL(dB)

MBS to MUE . . R , MBS to FUE . . R + L ow , FBS to FUE (same apt strip) .

76 + 20 log R + 0 . d D,indoor , FBS to FUE (different apt strip) max (15 . . R, .

46 + 20 log R ) + 18 . . d D,indoor + L ow . wall penetration loss which is set to dB [41]. d D,indoor is the 2-dimensional distance. Weassume that the apartments are single ﬂoor, therefore, d D,indoor ≈ R . The fourth row of thepathloss models is used for the links between the FBSs and the MUE.The minimum SINR requirements for the MUE and the FUEs are deﬁned based on therequired rate needed to support their corresponding user. In our simulations, the minimumrequired transmission rate to meet the QoS of the MUE is assumed to be (b/s/Hz), i.e., log (1 + Γ ) = 4 (b/s/Hz). Moreover, for the FUEs the minimum required rate is set to . (b/s/Hz), i.e, log (1 + Γ k ) = 0 . (b/s/Hz), k ∈ K . It is worth mentioning that by knowing themedia access control (MAC) layer parameters, the values of the required rates can be calculatedusing [42, Eqs. (20) and (21)].To perform Q- learning , the minimum number of required frames, i.e., L , is calculated basedon achieving optimality, with probability of at least . , i.e., δ = 0 . . The simulationparameters are given in Table II. The value of the Q- learning parameters are selected accordingto our simulations and references [19]–[24].The simulation starts with one femtocell. The FBS starts running Q-DPA in Section V-Ausing IL. After convergence, the next FBS is added to the network. The new FBS runs Q-DPA,while the other FBS is already trained, and will just act greedy to choose its transmit power. Table II. Simulation Parameters

Default parameters Value State parameters Value

Frame time 2 ms d (cid:48) , d (cid:48) , d (cid:48)

50, 150, 400 mUE thermal noise -174 dBm/Hz d , d , d FBS parameters Value Q-DPA parameters Value p min L T × |X | . |A k | frames p max

15 dBm Learning parameter β ∆ p e ) 10%After convergence of the second FBS, the next one is added to the network, and so on. Werepresent all the results versus the number of active femtocells in the system, from one to ten.Considering the size of the apartment block, and the assumption that all femtocells operate onthe same frequency range, the density of deployment varies approximately from FBS /km to FBS /km . B. Performance of Q-DPA

Here, we show the simulation results of distributed power allocation with Q-DPA. First, we de-ﬁne two different state sets. The sets are deﬁned as X = { X , X , X } and X = { X , X , X } .In both sets, FBSs are aware of their relative location to the MUE and the MBS due to thepresence of X and X , respectively. The state set X gives knowledge of the status of the FUEto the FBS, and the state set X provides knowledge of the status of the MUE to the FBS.In order to understand the effect of independent and cooperative learning, and the effect ofdifferent state sets, we use four different learning conﬁgurations as: independent learning witheach of the two state sets as IL+ X and IL+ X , and cooperative learning with each of the twostate sets as CL+ X and CL+ X . The results are compared with greedy approach in which eachFBS chooses maximum transmit power. The simulation results are shown in three ﬁgures as:transmission rate of the MUE (Fig. 5a), sum transmission rate of the FUEs (Fig. 5b), and sumtransmit power of the FBSs (Fig. 5c).According to Fig. 5c, in the greedy algorithm, each FBS uses the maximum available power fortransmission. Therefore, the greedy method introduces maximum interference for the MUE andhas the lowest MUE transmission rate in Fig. 5a. On the other hand, despite using maximum FBS Numbers M U E t r an s m i ss i on r a t e ( b / s / H z ) (a) FBS Numbers S u m t r an s m i ss i on r a t e ( b / s / H z ) (b) FBS Numbers S u m po w e r ( m W a tt ) (c)Figure 5. Performance of different learning conﬁgurations: (a) transmission rate of the MUE, (b) sum transmissionrate of the FUEs, (c) sum transmit power of the FBSs.Table III. Performance of different learning conﬁgurations. is the best, and is the worst. Learning conﬁguration (cid:80) p k (cid:80) r k r IL+ X CL+ X IL+ X CL+ X power, the greedy algorithm does not achieve highest transmission rate for the FUEs either(Fig. 5b). This is again due to the high level of interference.The state set X provides knowledge of MUE’s QoS status to the learning FBSs. Therefore, aswe see in Fig. 5a, the performance of IL with X is higher than the ones with X . This statementis true for CL too. We can see the reverse of this conclusion in the FUEs’ sum transmission rate in Fig. 5b. The performance of IL with X is higher than IL with X . This is because theFBSs are aware of the status of the FUE, therefore, they consider actions that result in the statevariable X = { γ k ≥ Γ k } to be . This is true in comparison of the states in CL too. In conclusion,the state set X works in favor of femtocells and the state set X beneﬁts the MUE.We conclude from the simulation results that IL and CL present different trade-offs. Morespeciﬁcally, IL supports a higher sum transmission rate for the FBSs and a lower transmissionrate for the MUE, while CL can support a higher transmission rate for the MUE at the costof an overall lower sum transmission rate for the FBSs. From a power consumption point ofview, IL results in a higher power consumption when compared to that of CL. In general, ILtrains an FBS to be selﬁsh compared to CL. IL can be very useful when there is no means ofcommunication between the agents. On the other hand, CL trains an FBS to be more considerateabout other FBSs at the cost of communication overhead.In Table III, we have compared the performance of the four learning conﬁgurations. In eachcolumn, number is used as a metric to refer to the highest performance achieved and number is used to refer to the lowest performance observed. The ﬁrst column represents the summationof transmit powers of FBSs, the second column indicates the summation of transmission ratesof the FUEs, and the third column denotes the transmission rate of the MUE. C. Reward Function Performance

Here, we compare the performance of the four reward functions discussed in Section V-B.Since the objective is to maximize the sum transmission rate of the FUEs, according to Table III,we choose the combination IL+ X as the learning conﬁguration. The performance of the rewardfunctions are provided as the MUE transmission rate (Fig. 6a), sum transmission rate of theFUEs (Fig. 6b), and sum transmission power of the FBSs (Fig. 6c). In each ﬁgure, the solutionof the optimization problem with exhaustive search and the performance of greedy methodare provided. The exhaustive search provides us with the highest achievable sum transmissionrate for the network. The quadratic, exponential, and proximity reward functions result in fastdecaying of MUE transmission rate, while the proposed reward function results in a much slowerdecrease of the rate for the MUE. The proposed reward function manages to achieve a higher sumtransmission rate compared to that of the other three reward functions as well. Fig. 6c indicatesthat the proposed reward function reduces the sum transmitted power at the FBSs which in FBS Numbers M U E t r an s m i ss i on r a t e ( b / s / H z ) (a) FBS Numbers S u m t r an s m i ss i on r a t e ( b / s / H z ) (b) FBS Numbers S u m t r an s m i t po w e r ( m W a tt ) (c)Figure 6. Performance of the proposed reward function compared to quadratic, exponential and proximity rewardfunctions: (a) transmission rate of the MUE, (b) sum transmission rate of the FUEs, (c) sum transmit power of theFBSs. turn could result in lower levels of interference at the FUEs. In comparison with the exhaustivesearch solution as the optimal solution, there is a gap of performance. For instance according toFig. 6c, for eight number of FBSs, the proposed reward function uses an average of mWattless sum transmit power than the optimal solution. However, as we see in Fig. 6b and Fig. 6a,by using more power, the sum transmission rate can be improved and the transmission rate ofthe MUE can be decreased to the level of exhaustive solution without violating its minimumrequired rate. In our future works, we wish to cover this gap by using neural networks as thefunction approximator of the learning method.VII. C ONCLUSION AND F UTURE W ORK

In this paper, we propose a learning framework for a two-tier femtocell network. The frame-work enables addition of a new femtocell to the network, while the femtocell trains itself toadapt its transmit power to support its serving user while protecting the macrocell user. On the other hand, the proposed method as a distributed approach can solve the power optimizationproblem in dense HetNets, while signiﬁcantly reducing power usage. The proposed frameworkis generic and motivates the design of machine learning based SONs for management schemesin femtocell networks. Besides, the framework can be used as a bench test for evaluating theperformance of different learning conﬁgurations such as Markov state models, reward functionsand learning rates. Further, the proposed framework can be applied to other interference-limitednetworks such as cognitive radio networks as well.In future work, it would be interesting to consider mmWave-enabled femtocells in the presentsetup. In fact, the high pathloss and shadowing along with the vulnerability of mmWave direc-tional signals to the blockages impacts the learning outcome [43]. This will in turn affect thesubsequent power optimization problem. In addition, as we discussed in simulation section indetails, there is a performance gap between the proposed approach and the exhaustive search.Although, the proposed approach results in less computational complexity; we wish to improveand cover this gap by utilizing neural networks as the function approximator of the learningmethod. In fact, neural networks can handle the large state-action spaces more efﬁciently.Moreover, another future complementary work to achieve a higher sum data rate and ﬁll theperformance gap would be to feed the interference model of the network to the factorizationprocess. This way, a better factorization can be provided for the global Q-function.A PPENDIX AP ROOF OF P ROPOSITION Proof.

Assume an MDP represented as ( X , A , Pr ( y | x, a ) , r ( x, a )) , a policy π with value-function V π : X → R and Q-function Q π : Z → R , Z = X × A . Here, A refers toaction space of one agent and k is the iteration index. According to (4), the maximum ofthe value-function can be ﬁned as V max = R max − β . The Bellman optimality operator is deﬁned as ( T Q ) ( x, a ) (cid:44) r ( x, a )+ β (cid:80) y ∈X Pr ( y | x, a ) max b ∈A Q ( y, b ) . T Q is a contraction operator with factor β , i.e., (cid:107) T Q − T Q (cid:48) (cid:107) ≤ β (cid:107) Q − Q (cid:48) (cid:107) and Q ∗ is a unique ﬁxed-point of ( T Q ) ( x, a ) , ∀ ( x, a ) ∈ Z .Further, for the ease of notation and readability the time step notation is slightly changed as Q k refers to the action-value function after k iterations.Assume that the state-action pair ( x, a ) is visited k times and F k = { y , y , ..., y k } are thevisiting next states. At time step k + 1 , the update rule of Q- learning is Q k +1 ( x, a ) = (1 − α k ) Q k ( x, a ) + α k T k Q k ( x, a ) , (30)where, T k Q k is the empirical Bellman operator deﬁned as T k Q k ( x, a ) (cid:44) r ( x, a )+ β max b ∈A Q ( y k , b ) .(From this point, for simplicity, we remove the dependency on ( x, a ) ). It is easy to show that E [ T k Q k ] = T Q k , therefore, we deﬁne e k as the estimation error of each iteration as e k = T k Q k − T Q k . By using α k = k +1 , the update rule of Q- learning can be written as Q k +1 = 1 k + 1 ( kQ k + T Q k + e k ) . (31)Now, in order to prove Proposition 1, we need to state the following lemmas. Lemma 1.

For any k ≥ Q k = 1 k k − (cid:88) i =0 T i Q i = 1 k (cid:32) k − (cid:88) i =0 T Q i + k − (cid:88) i =0 e i (cid:33) . (32) Proof.

We prove this lemma by induction. The lemma holds for k = 1 as Q = T Q = T Q + e .We now show that if the result holds for k , then it also holds for k + 1 . From (31) we have Q k +1 = kk + 1 Q k + 1 k + 1 ( T Q k + e k ) = kk + 1 1 k (cid:32) k − (cid:88) i =0 T Q i + k − (cid:88) i =0 e i (cid:33) + 1 k + 1 ( T Q k + e k )= 1 k + 1 (cid:32) k (cid:88) i =0 T Q i + k (cid:88) i =0 e i (cid:33) . Thus (32) holds for k ≥ by induction. Lemma 2.

Assume that initial action-value function, Q , is uniformly bounded by V max . Then,for all k ≥ we have (cid:107) Q k (cid:107) ≤ V max and (cid:107) Q ∗ − Q k (cid:107) ≤ V max . Proof.

We ﬁrst prove that (cid:107) Q k (cid:107) ≤ V max by induction. The inequality holds for k = 1 as (cid:107) Q (cid:107) = (cid:107) T Q (cid:107) = (cid:107) r + β max Q (cid:107) ≤ (cid:107) r (cid:107) + β (cid:107) Q (cid:107) ≤ R max + βV max = V max . Now, we assume that for ≤ i ≤ k , (cid:107) Q k (cid:107) ≤ V max holds. First, (cid:107) T k Q k (cid:107) = (cid:107) r + β max Q k (cid:107) ≤(cid:107) r (cid:107) + β (cid:107) max Q k (cid:107) ≤ R max + βV max = V max . Second, from Lemma 1 we have (cid:107) Q k +1 (cid:107) = 1 k + 1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =0 T i Q i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k + 1 k (cid:88) i =0 (cid:107) T i Q i (cid:107) ≤ V max . Therefore, the inequality holds for k ≥ by induction. Now the bound on (cid:107) Q ∗ − Q k (cid:107) follows (cid:107) Q ∗ − Q k (cid:107) ≤ (cid:107) Q ∗ (cid:107) + (cid:107) Q k (cid:107) ≤ V max . Lemma 3.

Assume that initial action-value function, Q , is uniformly bounded by V max , then,for any k ≥ (cid:107) Q ∗ − Q k (cid:107) ≤ βV max k (1 − β ) + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (33) Proof.

From Lemma 1, we have Q ∗ − Q k = Q ∗ − k (cid:32) k − (cid:88) i =0 T Q i + k − (cid:88) i =0 e i (cid:33) = 1 k k − (cid:88) i =0 ( T Q ∗ − T Q i ) − k k − (cid:88) i =0 e i . Therefore, we can write (cid:107) Q ∗ − Q k (cid:107) ≤ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 ( T Q ∗ − T Q i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k k − (cid:88) i =0 (cid:107) T Q ∗ − T Q i (cid:107) + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ βk k − (cid:88) i =0 (cid:107) Q ∗ − Q i (cid:107) + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . and according to [44], (cid:107) Q ∗ − Q i (cid:107) ≤ β i (cid:107) Q ∗ − Q (cid:107) . Hence, using Lemma 2, we can write (cid:107) Q ∗ − Q k (cid:107) ≤ βk k − (cid:88) i =0 β i V max + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ βV max k (1 − β ) + 1 k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Now, we prove Proposition 1 by using the above result in Lemma 3. To this aim, we need toprovide a bound on the norm of the summation of errors in the inequality of Lemma 3. First,we can write k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 k max ( x,a ) ∈Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − (cid:88) i =0 e i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . For the estimation error sequence { e , e , · · · , e k } , we have the property that E [ e k |F k − ] = 0 which means that the error sequence is a martingale difference sequence with respect to F k .Therefore, according to Hoeffding-Azuma inequality [45] for a martingale difference sequenceof { e , e , · · · , e k − } which is bounded by V max , for any t > , we can write Pr (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − (cid:88) i =0 e i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) ≤ (cid:18) − t kV max (cid:19) . Therefore, by a union bound over the state-action space, we have Pr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ |X | . |A| exp (cid:18) − t kV max (cid:19) = δ, and then, Pr (cid:32) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) i =0 e i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ V max (cid:114) k ln 2 |X | . |A| δ (cid:33) ≥ − δ. Hence, with probability at least − δ we can say (cid:107) Q ∗ − Q k (cid:107) ≤ R max (1 − β ) (cid:34) βk (1 − β ) + (cid:114) k ln 2 |X | . |A| δ (cid:35) . Consequently, the result in Proposition 1 is proved.R

EFERENCES [1] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, and D. Matolak, “A machine learning approach forpower allocation in HetNets considering QoS,” in Proc. IEEE ICC , pp. 1–7, May 2018.[2] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans, “A survey of self organisation in future cellular networks,”

IEEECommun. Surv. Tutor. , vol. 15, no. 1, pp. 336–361, First Quarter 2013.[3] J. Moysen and L. Giupponi, “From 4G to 5G: Self-organized network management meets machine learning,”

CoRR , vol.abs/1707.09300, 2017. [Online]. Available: http://arxiv.org/abs/1707.09300[4] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What will 5G be?”

IEEEJ. Select. Areas Commun. , vol. 32, no. 6, pp. 1065–1082, June 2014.[5] M. Peng, D. Liang, Y. Wei, J. Li, and H. Chen, “Self-conﬁguration and self-optimization in LTE-advanced heterogeneousnetworks,”

IEEE Commun. Mag. , vol. 51, no. 5, pp. 36–45, May 2013.[6] M. Agiwal, A. Roy, and N. Saxena, “Next generation 5G wireless networks: A comprehensive survey,”

IEEE Commun.Surv. Tutor. , vol. 18, no. 3, pp. 1617–1655, Thirdquarter 2016.[7] P. V. Klaine, M. A. Imran, O. Onireti, and R. D. Souza, “A survey of machine learning techniques applied to self-organizingcellular networks,”

IEEE Commun. Surv. Tutor. , vol. 19, no. 4, pp. 2392–2431, Fourthquarter 2017.[8] A. Imran, A. Zoha, and A. Abu-Dayya, “Challenges in 5G: how to empower SON with big data for enabling 5G,”

IEEENetwork , vol. 28, no. 6, pp. 27–33, Nov 2014.[9] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, “Intelligent 5G: When cellular networks meet artiﬁcialintelligence,”

IEEE Wirel. Commun. , vol. 24, no. 5, pp. 175–183, Oct 2017.[10] V. Chandrasekhar, J. G. Andrews, T. Muharemovic, Z. Shen, and A. Gatherer, “Power control in two-tier femtocellnetworks,”

IEEE Trans. Wireless Commun. , vol. 8, no. 8, pp. 4316–4328, Aug 2009.[11] Z. Lu, T. Bansal, and P. Sinha, “Achieving user-level fairness in open-access femtocell-based architecture,”

IEEE Trans.Mobile Comput. , vol. 12, no. 10, pp. 1943–1954, Oct 2013.[12] H. Claussen, “Performance of macro- and co-channel femtocells in a hierarchical cell structure,”

IEEE 18th Int. Symp.Pers. Indoor Mobile Radio Commun. , pp. 1–5, Sep 2007.[13] R. Amiri and H. Mehrpouyan, “Self-organizing mm-wave networks: A power allocation scheme based on machine learning,” in Proc. IEEE GSMM , pp. 1–4, May 2018.[14] Y. Sinan Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,”

ArXiv e-prints , Aug. 2018.[15] D. Lopez-Perez, X. Chu, A. V. Vasilakos, and H. Claussen, “Power minimization based resource allocation for interferencemitigation in OFDMA femtocell networks,”

IEEE J. Select. Areas Commun. , vol. 32, no. 2, pp. 333–344, Feb 2014. [16] M. Yousefvand, T. Han, N. Ansari, and A. Khreishah, “Distributed energy-spectrum trading in green cognitive radio cellularnetworks,” IEEE Trans. Green Commun. , vol. 1, no. 3, pp. 253–263, Sep 2017.[17] H. Yazdani and A. Vosoughi, “On cognitive radio systems with directional antennas and imperfect spectrum sensing,” inProc. IEEE ICASSP , pp. 3589–3593, March 2017.[18] R. S. Sutton and A. G. Barto,

Introduction to Reinforcement Learning , 1st ed. Cambridge, MA, USA: MIT Press, 1998.[19] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for aggregated interference control in cognitive radionetworks,”

IEEE Trans. Veh. Technol. , vol. 59, no. 4, pp. 1823–1834, May 2010.[20] H. Saad, A. Mohamed, and T. ElBatt, “Distributed cooperative Q-learning for power allocation in cognitive femtocellnetworks,” in Proc. IEEE Veh. Technol. Conf. , pp. 1–5, Sep 2012.[21] J. R. Tefft and N. J. Kirsch, “A proximity-based Q-learning reward function for femtocell networks,” in Proc. IEEE Veh.Technol. Conf. , pp. 1–5, Sep 2013.[22] M. Bennis, S. M. Perlaza, P. Blasco, Z. Han, and H. V. Poor, “Self-organization in small cell networks: A reinforcementlearning approach,”

IEEE Trans. Wireless Commun. , vol. 12, no. 7, pp. 3202–3212, July 2013.[23] B. Wen, Z. Gao, L. Huang, Y. Tang, and H. Cai, “A Q-learning-based downlink resource scheduling method for capacityoptimization in LTE femtocells,” in Proc. IEEE. Int. Comp. Sci. and Edu. , pp. 625–628, Aug 2014.[24] Z. Gao, B. Wen, L. Huang, C. Chen, and Z. Su, “Q-learning-based power control for LTE enterprise femtocell networks,”

IEEE Syst. J. , vol. 11, no. 4, pp. 2699–2707, Dec 2017.[25] M. Miozzo, L. Giupponi, M. Rossi, and P. Dini, “Distributed Q-learning for energy harvesting heterogeneous networks,” in Proc. IEEE. ICCW , pp. 2006–2011, June 2015.[26] B. Hamdaoui, P. Venkatraman, and M. Guizani, “Opportunistic exploitation of bandwidth resources through reinforcementlearning,” in Proc. IEEE GLOBECOM , pp. 1–6, Nov 2009.[27] G. Alnwaimi, S. Vahid, and K. Moessner, “Dynamic heterogeneous learning games for opportunistic access in LTE-basedmacro/femtocell deployments,”

IEEE Trans. Wireless Commun. , vol. 14, no. 4, pp. 2294–2308, April 2015.[28] K.-L. A. Yau, P. Komisarczuk, and P. D. Teal, “Reinforcement learning for context awareness and intelligence in wirelessnetworks: Review, new features and open issues,”

J. Netw. Comput. Appli. , vol. 35, no. 1, pp. 253 – 267, Jan 2012.[29] C. J. C. H. Watkins and P. Dayan, “Q-learning,”

Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992.[30] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Reward function and initial values: Better choices for acceleratedgoal-directed reinforcement learning,” in Proc. ICANN , pp. 840–849, 2006.[31] Z.-Q. Luo and W. Yu, “An introduction to convex optimization for communications and signal processing,”

IEEE J. Select.Areas Commun. , vol. 24, no. 8, pp. 1426–1438, Aug 2006.[32] S. Niknam and B. Natarajan, “On the regimes in millimeter wave networks: Noise-limited or interference-limited?” inProc. IEEE ICCW , pp. 1–6, May 2018.[33] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networksfor interference management,”

IEEE Trans. Signal Processing , vol. 66, no. 20, pp. 5438–5453, Oct 2018.[34] C. Boutilier, T. L. Dean, and S. Hanks, “Decision-theoretic planning: Structural assumptions and computational leverage,”

J. Art. Intel. Research , vol. 11, pp. 1–94, July 1999.[35] R. Amiri, H. Mehrpouyan, D. Matolak, and M. Elkashlan, “Joint Power Allocation in Interference-Limited Networksvia Distributed Coordinated Learning,” in Proc. IEEE Veh. Technol. Conf. , to be published. [Online]. Available:https://arxiv.org/abs/1806.02449 [36] Z. Lin and M. van der Schaar, “Autonomic and distributed joint routing and power control for delay-sensitive applicationsin multi-hop wireless networks,” vol. 10, no. 1, pp. 102–113, Jan 2011.[37] C. Guestrin, M. G. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” in Proc. ICML , pp. 227–234, July2002.[38] E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” J. Mach. Learn. Research , vol. 5, pp. 1–25, Dec 2004.[39] L. Busoniu, R. B. ˆ s ka, and B. D. Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans.Syst., Man, Cybern. C , vol. 38, no. 2, pp. 156–172, March 2008.[40] M. J. Kearns and S. P. Singh, “Finite-sample convergence rates for Q-learning and indirect algorithms,”

NIPS , vol. 11, pp.996–1002, 1999.[41] 3GPP, “Evolved Universal Terrestrial Radio Access (E-UTRA); Further advancements for E-UTRA physical layer aspects,”3rd Generation Partnership Project (3GPP), Technical Speciﬁcation (TS) 36.814, 03 2010, version 9.0.0.[42] C. C. Zarakovitis, Q. Ni, D. E. Skordoulis, and M. G. Hadjinicolaou, “Power-efﬁcient cross-layer design for OFDMAsystems with heterogeneous QoS, imperfect CSI, and outage considerations,”

IEEE Trans. Veh. Technol. , vol. 61, no. 2,pp. 781–798, Feb 2012.[43] S. Niknam, R. Barazideh, and B. Natarajan, “Cross-layer Interference Modeling for 5G MmWave Networks in the Presenceof Blockage,”

ArXiv e-prints , Jul. 2018.[44] A. L. Strehl, L. Li, and M. L. Littman, “Reinforcement learning in ﬁnite MDPs: PAC analysis,”

J. Mach. Learn. Res. ,vol. 10, pp. 2413–2444, Dec. 2009.[45] W. Hoeffding, “Probability inequalities for sums of bounded random variables,”