[PDF] Fairness-Oriented User Scheduling for Bursty Downlink Transmission Using Multi-Agent Reinforcement Learning

Abstract

In this work, we develop practical user scheduling algorithms for downlink bursty traffic with emphasis on user fairness. In contrast to the conventional scheduling algorithms that either equally divides the transmission time slots among users or maximizing some ratios without physcial meanings, we propose to use the 5%-tile user data rate (5TUDR) as the metric to evaluate user fairness. Since it is difficult to directly optimize 5TUDR, we first cast the problem into the stochastic game framework and subsequently propose a Multi-Agent Reinforcement Learning (MARL)-based algorithm to perform distributed optimization on the resource block group (RBG) allocation. Furthermore, each MARL agent is designed to take information measured by network counters from multiple network layers (e.g. Channel Quality Indicator, Buffer size) as the input states while the RBG allocation as action with a proposed reward function designed to maximize 5TUDR. Extensive simulation is performed to show that the proposed MARL-based scheduler can achieve fair scheduling while maintaining good average network throughput as compared to conventional schedulers.

Full PDF

FFairness-Oriented Scheduling for Bursty Trafﬁc inOFDMA Downlink Systems Using Multi-AgentReinforcement Learning

Mingqi Yuan , Qi Cao, Man-On Pun,

Senior Member IEEE and Yi Chen

Abstract —User scheduling is a classical problem and keytechnology in wireless communication, which will still playsan important role in the prospective 6G. There are manysophisticated schedulers that are widely deployed in the basestations, such as

Proportional Fairness (PF) and

Round-RobinFashion (RRF). It is known that the

Opportunistic (OP) schedul-ing is the optimal scheduler for maximizing the average userdata rate (AUDR) considering the full buffer trafﬁc. But theoptimal strategy achieving the highest fairness still remainslargely unknown both in the full buffer trafﬁc and the bursty trafﬁc. In this work, we investigate the problem of fairness-oriented user scheduling, especially for the RBG allocation. Webuild a user scheduler using

Multi-Agent Reinforcement Learning (MARL), which conducts distributional optimization to maximizethe fairness of the communication system. The agents take thecross-layer information (e.g. RSRP, Buffer size) as state and theRBG allocation result as action , then explore the optimal solutionfollowing a well-deﬁned reward function designed for maximizingfairness. Furthermore, we take the (5TUDR)as the key performance indicator (KPI) of fairness, and comparethe performance of MARL scheduling with PF scheduling andRRF scheduling by conducting extensive simulations. And thesimulation results show that the proposed MARL schedulingoutperforms the traditional schedulers.

Index Terms —User scheduling, RBG allocation, Fairness-oriented, Multi-Agent reinforcement learning

I. I

NTRODUCTION F OR the prospective 6G wireless communication tech-nology, both of the basic architectures and componentsremain largely undeﬁned [1]. It is foreseeable that the 6Gwill brings an unprecedented revolution while meeting thetremendous and increasing network demand. The utilization of5G greatly improves the communication services in densely-populated cities, especially for the indoor communicationscenarios. However, there are still a large number of populationcan not obtain the basic data services, such as the developingcountries and rural areas [2]. The 6G is predicted to be humancentric, data centric rather than machine centric [3]. So the

This work was supported, in part, by the Shenzhen Science and TechnologyInnovation Committee under Grant No. ZDSYS20170725140921348 andJCYJ20190813170803617, and by the National Natural Science Foundationof China under Grant No. 61731018. (

Corresponding author: Man-On Pun. )M. Yuan, Q. Cao, M. Pun and Y. Chen are with the School of Sci-ence and Engineering, The Chinese University of Hong Kong, Shenzhen,518172, China (e-mail: [email protected]; [email protected];[email protected]; [email protected].)M. Yuan, Q. Cao and M. Pun are with the Shenzhen Key Laboratory ofIoT Intelligent Systems and Wireless Network Technology, 518172, ChinaM. Yuan, M. Pun and Y. Chen are with the Shenzhen Research Institute ofBig Data, 518172, China

6G is expected to construct a wireless network which is bothvertical and horizontal to beneﬁts more population rather thanthe majority people in cities. This expectation emphasizes thefairness of the network resources scheduling, which is difﬁcultto optimize in the previous studies.In this paper, we investigate the problem of the user schedul-ing, especially the Resource Block Groups (RBGs) allocation.But we consider the fairness-oriented scheduling and aim tobuild a scheduler which maximizes the fairness of the net-work. As the Key Performance Indicators (KPI), the averageuser data rate (AUDR) and fairness are usually adopted toevaluate the schedulers. It is known that the Opportunistic(OP) scheduling is the optimal scheduler which maximizesthe AUDR considering the full buffer trafﬁc [4]. The OPscheduling follows the greedy policy and always allocatesadvantageous resources to the users with the highest expectedtransmission rate. However, the optimal strategy achieving thehighest fairness has not been studied both in the full buffertrafﬁc and the bursty trafﬁc. Furthermore, the deﬁnition andthe measurement of the fairness are still debatable until today.Consider the allocation time, the Round Robin Fashion (RRF)is regarded as the optimal scheduler in the full buffer trafﬁc[5] . The RRF scheduling allocates all the resources to eachuser in turns regardless of their status. This scheduler indeedrealizes the allocation time equality but it may wastes muchnetwork resources. Because a user will get all the resourceseven its data packets is exhausted. In this era with tremendousnetwork demands, the time-fairness is apparently obsoletedand the rate-fairness is proposed. Therefore the 5%-tile userdata rate (5TUDR) is more widely leveraged to evaluate thefairness of the network [6], which is the user data rate of theworst 5% user. Compared with the time-fairness, the 5TUDRemphasizes the right of the inferior users but it does not againstthe AUDR. So it is likely feasible to maximize the 5TUDRwhile maintaining the considerable AUDR.There are two main challenges in realizing the objective. Onthe one hand, the wireless communication network is highlycomplex with multi components and mechanism, rendering thetask of mathematically modeling the whole system analyticallyintractable. Most of the traditional model-based schedulers aredesigned on single layer of the OSI protocol, which can onlytake poor information to make allocation. For instance, theProportional Fairness (PF) scheduling calculates the allocationpriorities by tracing the user data rate recorded by the MAClayer [7]. But the network performance is affected by multi a r X i v : . [ c s . O S ] F e b ayers, thus it is difﬁcult to achieve higher fairness throughthe model-based approaches. On the other hand, studying thebursty trafﬁc is considered as a non-stationary task, it bringsunaccountable uncertainty such as the time-varying number ofthe users. Moreover, the 5TUDR is affected by the long-termallocation steps, it is difﬁcult to optimize such objective onlyby optimizing single-step allocation.To address the dilemma, the machine learning (ML) meth-ods are introduced to the research of wireless communication[8] in recent years. In sharp comparison with the model-basedapproach, the ML-based method is data-driven which appliesoptimization by exploiting the massive data in the network.This allows the ML-based method to solve many networkoptimization problems without establishing the mathematicalmodel. For instance, Cao et al. designed a DNN-based frame-work to predict network-level system performance which takethe cross-layer network data as input [9]. And the simulationresults show that the model can make accurate predictions atthe cost of high computational complexity. For user schedul-ing, another method entitled Reinforcement Learning (RL) isalso introduced to improve the existed schedulers or buildnew schedulers. RL is one of the most representative leaningsmethods of machine learning, which aims to maximize thelong-term expected return by learning from the interactionswith environment [10]. In RL, the agent accepts the state ofenvironment and make corresponding action, then the agentwill be rewarded or punished. To get higher reward, the agentmust continuously adjust its policy to make better action.The RL has strong exploration ability, and always achievesurprising performance. There are much pioneering work havealready been achieved with RL.For instance, Comsa et al. suggests that the PF schedulingcannot adapt to dynamic network conditions because of thesettled scheduling preference [11]. Thus they proposes aparameterized PF scheduling entitled Generalized PF (GPF) byusing the deep RL approach. The GPF collects diverse networkdata such as Channel Condition Indicator (CQI) and userthroughput, then adaptively adjust the scheduling preferencein each transmission time interval (TTI). As a result, theGPF is showed to be more ﬂexible in handling dynamicnetwork conditions. Furthermore, Zhang et al. builds an RL-enabled algorithm selector, which scans the network statusand intelligently select the best scheduling algorithm froma set of scheduling algorithms in each TTI [12]. And thesimulation results show that the fusion strategy achieves higherperformance compared to the traditional methods. However,both of the two algorithms have low efﬁciency because of thehuge computation, and they are not independent schedulers.Therefore, Xu et al. considers building entirely independentschedulers with RL method [13], and proposes an RL-basedscheduler. The scheduler takes the estimated instantaneousdata rate, the averaged data rate and other metrics as the inputof the agent, then outputs the allocation preference for eachuser. The performance comparison in [13] indicates that theRL-based scheduler can effectively improve the throughput,along with the fairness and the packet drop rate. But this scheduler can only allocate RBGs in full buffer trafﬁc, becauseit takes fully-connected neural network as the backbone ofagent. Similarly, Hua et al. investigates the resource manage-ment in network slicing, and proposes a GAN-powered deepdistributional RL method to ﬁnd optimal solution of demand-aware resource management [14]. The agent takes the numberof arrived packets in each slice within a speciﬁc time windowas input, then generates bandwidth allocation to each slice.And the simulations results show that method signiﬁcantlyimprove the utilization efﬁciency of resources.In this work, we consider a Long-Term Evolution (LTE)network with one base station (BS) scheduling its RBGs tomultiple users in the bursty trafﬁc. Take the allocation resultof RBGs as the action in RL, the multi RBGs produce ajoint action. For single agent RL, it is difﬁcult to learn ajoint policy because of the exponential action space. Thus weleverage the Multi-Agent Reinforcement Learning (MARL) toaddress the problem, which maintains multi agents and learns adistributional policy [15]. As there are multi agents interactingwith the environment, this can be regarded as a game fromthe perspective of the game theory, which takes the NashEquilibrium (NE) as the optimal solution [16]. Inspired bythe distributional setting and powerful exploration abilities ofRL, the MARL offers a solution paradigm that can cope withcomplex problems. One of the most successful applicationsis AlphaStar, which is a game AI for the StarCraft II usingMARL [17]. The AlphaStar can achieve grandmaster levelperformance in the confrontation with mankind, which impliesthe great potential of MARL.In this work, we ﬁrst model the RBG allocation processas a stochastic game, in which each RBG is allocated bya speciﬁc agent independently. All the agents are set to befully cooperative, which share the same reward function andmaximize the objective together. Then a MARL algorithmentitled QMIX is leveraged to ﬁnd the scheduling policy.Finally, we compare the performance of the MARL schedulerwith the PF scheduling and RRF scheduling. The simulationresults show our method outperforms the two traditionalmodel-based algorithms. The main contributions of this paperare as follows: • We analyze the necessity of fairness-oriented schedulingalgorithm in the development of 6G, which is expectedto provide more equitable network service. Then themain challenges of building such algorithms are fullydiscussed, and we summarize the applications of data-driven approaches in wireless communication, especiallythe wireless resource management; • To build the fairness-oriented scheduling policy, we ﬁrstbuild a LTE downlink wireless network to simulatethe realistic communication scenarios. Then we modelthe RBG allocation process as a stochastic game, andskillfully deﬁne its components such as state, actionand reward function. Moreover, we build the stochasticgame considering the bursty trafﬁc, which brings dynamicnetwork conditions such as the time-varying number ofactive users. A adaptive working mechanism of the agentss proposed; • The reward function characterizes the optimization ob-jective of the agents, which is difﬁcult to be obtainedand always experience-based. To ﬁnd the feasible rewardfunction that maximizes the 5TUDR, we dive into thedesign of the reward function and develop a fairness-oriented reward function, which is shown to be effectivein realizing our objective; • Based on the proposed stochastic game, we build aMARL-based algorithm to search the expected schedul-ing policy. The workﬂow of this algorithm is care-fully summarized. With a large number of training, theexpected schduling policy is successfully obtained, itachieves high 5TUDR while maintaining a considerableAUDR. • Finally, extensive simulations are conducted to demon-strate the better performance of the MARL schedulerover the traditional schedulers. Furthermore, we analyzethe special scheduling policy of the MARL schedulerin depth by data analysis, and provide numerical resultsabout the implementation details.The remainder of the paper is organized as follows: SectionII deﬁnes the system model and gives the problem formulation.Section III elaborates the framework of the stochastic gamefor RBG allocation. Then Section IV introduces the MARLalgorithm for solving the stochastic game. And Section Vshows the simulation results and numerical analysis. Finally,Section VI summarizes the paper and proposes the prospects.II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

A. System Model

We consider a single-cell LTE wireless communicationnetwork in which a BS serves multiple user equipments (UEs)in the downlink in the Frequency Division Duplexing (FDD)mode. The UE arrival is modeled as following a distribution,such as the Possion distribution with an arrival rate λ . Uponarrival, each UE requests a different but ﬁnite amount oftrafﬁc data that is stored in the buffer at the transmitter. Afterits request is completed, the corresponding UE departs thenetwork immediately, which effectively simulates the burstytrafﬁc mode. Furthermore, the frequency resources are dividedinto RBGs. We consider the most common network settingthat one RBG can be allocated to at most one UE in each TTIas shown in Fig. 1. There are k RBGs and four users in thenetwork. At each TTI, if a user is scheduled, the correspondingdata will be loaded to a HARQ buffer and cleared out from thedata buffer. For each user, it can only maintain ﬁnite HARQprocesses, and the ACK/NACK message will be fed back withﬁxed time period. The more speciﬁc network mechanismssuch as retransmission and OLLA process are elaborated inAppendix A. Based on the network, we aim to build anRBG allocation algorithm to realize intelligent and ﬂexiblescheduling, so as to maximize the 5TUDR in the bursty trafﬁcwhile maintaining considerable AUDR. O LL A Time (TTI) R B G I n d e x k (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) HARQ1 HARQ2 HARQ6 HARQ2HARQ7 HARQ1 HARQ3 HARQ4 HARQ5 HARQ2 HARQ1

A N A A A A

UE1UE2UE3UE4 U E R e t r a n s m i ss i o n Delegated to Retransmission

HARQ3 HARQ4 HARQ5 HARQ8

A A A A

Fig. 1. Network Mechanisms

B. Problem Formulation

To formulate the speciﬁc optimization problem, we ﬁrstdeﬁne two key performance indicators for evaluating the utilityof the schedulers. One is the average user data rate (AUDR)which measures the usage of the network resources. The otherone is the 5%-tile user data rate (5TUDR) which assesses thefairness of the scheduling.

1) Key Performance Indicators:

Denote by t n,a and t n,d thearrival and departure time for a user indexed by n , respectively.The user data rate (UDR) of user n at time t ∈ [ t n,a , t n,d ] canbe deﬁned as: ψ nt = t (cid:88) i = t n,a T n [ i ] A n [ i ]min( t, t n,d ) − t n,a , (1)where T n [ i ] is the packet size transmitted during the i -th TTI,and A n [ i ] is the ACK/NACK feedback with ”1” indicatinga successful transmission while ”0” a failed transmission.Denote by N t the total number of users that arrive duringa time period of TTI= [1 , ..., t ] . The AUDR can be deﬁned as D ( t ) = 1 N t N t (cid:88) n =1 ψ nt . (2)Finally, let φ t denote the 5%-tile user data rate, which can bewritten as: φ t = T ( Ψ t ) , Ψ t = ( ψ t , ..., ψ N t t ) (3)where T ( · ) is the searching operator for getting the user datarate of the worst 5% user.

2) Optimization Objective:

In this paper, we aim to max-imize the 5TUDR in the bursty trafﬁc, while maintaining aconsiderable AUDR.. Given a network deﬁned above with aset K of RBGs, we set the maximum number of users as M ineach TTI. We choose to optimize the 5TUDR in a ﬁnite timehorizon, where the the arrival of the users follows a speciﬁcdistribution. Take the Possion distribution for instance, we setthe arrival rate of the users as λ . Denote by π the schedulingpolicy. For a period TTI= [1 , ..., T ] , the optimization problemis written as: max π φ T (4)ubject to ψ nt ≥ , ≤ t ≤ T, ≤ n ≤ N t D ( T ) ≥ κP ( N T ) = λ N T N T ! e − λ where the P ( · ) being the probability, κ is the threshold.However, it is mathematically intractable to ﬁnd such apolicy π via traditional optimization methods because of thehighly complex network components and mechanisms. In thefollowing sections, we transform this scheduling problem intoa stochastic game where the RBGs are allocated by multiagents. Moreover, the cross-layer network data is leveraged toserve as the observations in the game, so as to make allocationfrom the global perspective. Finally, a MARL-based algorithmis proposed to search the solutions.III. S TOCHASTIC G AME F RAMEWORK FOR

RBGA

LLOCATION

There are two main challenges when solving the Op. 4 inthe bursty trafﬁc. On the one hand, the actual user data rateis affected by multiple factors of different layers, it is almostimpossible to build mathematical model and solve it throughtraditional optimization methods. The 5TUDR is a special vari-able which is affected by the sequence of the allocation steps.Straightforwardly optimizing single-step scheduling may notcontribute to this item, where the long-term optimization isindispensable. On the other hand, allocating multi RBGs tomulti users can be considered as conducting a joint action,which produces exponential action space. Searching solutionin this action space is always intractable both in solvabilityand computation.To address the two main challenges, we consider optimizingthe objective via long-term and distributional method. A tradi-tional model-based scheduler allocates all the RBGs followingthe same scheduling policy. This allows the scheduler to stablyoptimize a ﬁxed objective, e.g., system throughput or fairness.But a static policy is short of ﬂexibility and adaptability,which can not cope with complex network conditions. In sharpcontrast to the centralized policy, the distributional policydecomposes the exponential action space into smaller individ-ual action space, and optimize the objective more efﬁciently.Based on the distributional policy, we further leverage thecross-layer network data to make allocation. The traditionalmodel-based scheduler always take the single-layer networkdata as decision basis, e.g., the PF scheduling takes the userdata rate to calculate the allocation priorities. But cross-layer network data contains more comprehensive decisioninformation, which may contribute to the scheduling from awider perspective.To learn such a distributional scheduling policy, we ﬁrstmodel the RBG allocation process as a stochastic game, whichis the generalization of the Markov Decision Process (MDP)for the multi-agent cases.

Deﬁnition 1.

This stochastic game can be deﬁned by a tuple E = (cid:104)S , U , A , P , r, Z , O , γ (cid:105) [18], where: • S is the global state space; • U is the action space for each agent; • A is a set of agents and each agent is identiﬁed by a ∈ A ; • P ( s (cid:48) | s, u ) : S × U × S → [0 , is the transition proba-bility, where u ∈ U ≡ U |A| ; • r ( s, u , a ) : S × U × A → R , the reward function whichspeciﬁes an agent its speciﬁc reward; • Z : agents draw its individual observation z ∈ Z ; • O ( s, a ) : S × A → Z is the observation function; • γ ∈ [0 , is a discount factor. We further consider the partially observable settings, inwhich agents get local observations o ta based on observationfunction O ( s, a ) [19] [20]. Thus each agent has an action-observation history τ a ∈ T ≡ ( Z × U ) ∗ , which is conditionedby a stochastic policy π a ( u a | τ a ) : T ×U → [0 , . All the localpolicies constitutes the joint-policy π , and the distribution overthe joint-action can be formulated as: π ( u | s t ) = (cid:89) a π a ( u a | τ a ) (5)The Fig. 2 illustrates the overview of this stochastic game.The base station holds a set of K RBGs and accepts the UEs,then follows the scheduling policy π to allocate RBGs. In thisgame, | K | agents are built to allocate the RBGs respectively.Each agent maintains an independent scheduling policy whichis only responsible for its own RBG. At each TTI, the agentsobserve the network data of the cell, which constitutes thestate s . The scheduling policy π accepts the state and makescorresponding action u . Here the action is the allocation result,it determines each RBG should be allocated to which user.Then the base station accepts the allocation result and conducttransmission as required. Finally, the environment will evaluatethe state-action pair and calculate a reward. As for the policyupdate procedure, it will be elaborated in section IV. In thefollowing subsections, the speciﬁc forms of states, actions andreward function in the stochastic game of RBG allocation areelaborated in detail. A. Cross-layer Observation and Adaptive Action

Initialize a base station with a set of K RBGs, andrandomly generate ˆ N active users. For a traditional model-based scheduler, it only observes the single-layer informationof the network, such as MAC layer or PHY layer. Thelimitation of the mathematical models makes the allocationlack the support comprehensive information. To address theproblem, we consider conduct the scheduling based on cross-layer information. To handle the cross-layer network data,the deep neural network (DNN) is leveraged to serve as thebackbone of the agents. Hornik et al. has proved that standardmultilayer feedforward networks are capable of approximatingany measurable function to any desired degree of accuracy[21]. The DNN can provide powerful feature extraction abilityand approximate arbitrary functions [22]. This allows us tostraightforwardly take the cross-layer network data as the (cid:131)(cid:4667)(cid:3)(cid:22)(cid:139)(cid:144)(cid:137)(cid:142)(cid:135)(cid:3)(cid:6)(cid:135)(cid:142)(cid:142)(cid:3)(cid:8)(cid:144)(cid:152)(cid:139)(cid:145)(cid:148)(cid:144)(cid:143)(cid:135)(cid:144)(cid:150)(cid:5)(cid:131)(cid:149)(cid:135)(cid:3)(cid:22)(cid:150)(cid:131)(cid:150)(cid:139)(cid:145)(cid:144) (cid:24)(cid:8)(cid:149)(cid:21)(cid:5)(cid:10)(cid:149) (cid:1709) (cid:1709)(cid:1709) (cid:1709) (cid:17)(cid:135)(cid:150)(cid:153)(cid:145)(cid:148)(cid:141)(cid:3)(cid:7)(cid:131)(cid:150)(cid:131) (cid:4666)(cid:21)(cid:22)(cid:21)(cid:19)(cid:481)(cid:3)(cid:6)(cid:20)(cid:12)(cid:481)(cid:3)(cid:5)(cid:151)(cid:136)(cid:136)(cid:135)(cid:148)(cid:481)(cid:485)(cid:4667) (cid:22)(cid:150)(cid:131)(cid:150)(cid:135) (cid:1819) (cid:1709) (cid:1709) (cid:22)(cid:133)(cid:138)(cid:135)(cid:134)(cid:151)(cid:142)(cid:139)(cid:144)(cid:137)(cid:3)(cid:19)(cid:145)(cid:142)(cid:139)(cid:133)(cid:155)(cid:3)(cid:2760)(cid:16)(cid:151)(cid:142)(cid:150)(cid:139)(cid:3)(cid:4)(cid:137)(cid:135)(cid:144)(cid:150)(cid:149)(cid:3)(cid:2267)(cid:4)(cid:142)(cid:142)(cid:145)(cid:133)(cid:131)(cid:150)(cid:139)(cid:145)(cid:144)(cid:3)(cid:21)(cid:135)(cid:149)(cid:151)(cid:142)(cid:150)(cid:4)(cid:133)(cid:150)(cid:139)(cid:145)(cid:144)(cid:3)(cid:151)(cid:21)(cid:135)(cid:153)(cid:131)(cid:148)(cid:134)(cid:3)(cid:148) (cid:4666)(cid:132)(cid:4667)(cid:3)(cid:7)(cid:135)(cid:133)(cid:139)(cid:149)(cid:139)(cid:145)(cid:144)(cid:3)(cid:19)(cid:148)(cid:145)(cid:133)(cid:135)(cid:149)(cid:149) (cid:22)(cid:150)(cid:131)(cid:150)(cid:135)(cid:4)(cid:133)(cid:150)(cid:139)(cid:145)(cid:144)(cid:21)(cid:135)(cid:153)(cid:131)(cid:148)(cid:134)(cid:17)(cid:135)(cid:154)(cid:150)(cid:3)(cid:22)(cid:150)(cid:131)(cid:150)(cid:135) (cid:8)(cid:154)(cid:146)(cid:135)(cid:148)(cid:139)(cid:135)(cid:144)(cid:133)(cid:135)(cid:3)(cid:16)(cid:135)(cid:143)(cid:145)(cid:148)(cid:155)(cid:3)(cid:2268)(cid:22)(cid:131)(cid:143)(cid:146)(cid:142)(cid:135)(cid:3)(cid:16)(cid:139)(cid:144)(cid:139)(cid:486)(cid:5)(cid:131)(cid:150)(cid:133)(cid:138)(cid:20)(cid:16)(cid:12)(cid:27)(cid:24)(cid:146)(cid:134)(cid:131)(cid:150)(cid:135)(cid:24)(cid:146)(cid:134)(cid:131)(cid:150)(cid:135)(cid:134)(cid:19)(cid:131)(cid:148)(cid:131)(cid:143)(cid:135)(cid:150)(cid:135)(cid:148)(cid:149) (cid:4666)(cid:133)(cid:4667)(cid:3)(cid:19)(cid:145)(cid:142)(cid:139)(cid:133)(cid:155)(cid:3)(cid:24)(cid:146)(cid:134)(cid:131)(cid:150)(cid:135)(cid:7)(cid:131)(cid:150)(cid:131)(cid:19)(cid:131)(cid:133)(cid:141)(cid:135)(cid:150)(cid:149) (cid:1709) Fig. 2. The overview of the stochastic game of the RBG allocation. (a) A single cell which contains a base station and UEs. (b) The decision process of thescheduling policy π . (c) The update procedure of the scheduling policy, where the QMIX is a MARL learning algorithm introduced in section IV. observation, then make actions according to the output of theDNN. TABLE IN

ETWORK DATA OF GLOBAL STATE

Attr. Layer Dim.RSRP PHY 1Average of RB CQI (each RBG) PHY Num. of RBGsBuffer size MAC 1Scheduled times MAC 1Olla offset MAC 1Historical user data rate (HUDR) MAC 1

Table I illustrates the selected network data in this paper.Reference Signal Receiving Power (RSRP) is one of the keyparameters that can represent the wireless signal strength inLTE networks. This metric is a constant through the interactionbetween user and base station, which can straightforwardlyreﬂects the level of the ﬁnal transmission rate. Channel QualityIndicator (CQI) is the information indicator of channel quality,which represents the current channel quality and correspondsto the signal-to-noise ratio (SNR) of the channel. For eachuser, it has different CQI on different RBGs. This metricis adopted referring to the OP scheduling. To maximize thethroughput, the OP scheduling prefers to allocate more RBGsto advantageous users who take higher expected transmissionrate. Here the buffer size is the data packets need to be trans-ferred. It affects the residence time of the users in the burstytrafﬁc. Here the schedule times is designed for representingthe probability fairness, which indicates whether all the usershave same probability to get the RBGs. Finally, the HUDR iscollected referring to the PF scheduling. The PF schedulingcalculates the allocation priorities for each user by tracing theHUDR and instantaneous UDR. And the conﬁgured agentsare expected to take such metrics into consideration whenconducting the allocation.Denote by F = { f , ..., f | F | } the set of selected network data from different layers, then the global state can be formu-lated as a feature matrix: S = (cid:0) f , · · · , f | F | (cid:1) , f i = ( f i , · · · , f i ˆ N ) T (6)For the bursty trafﬁc , the feature matrix is a variable due tothe time-varying number of the active users in the cell. It alsomeans the agents need to allocate ﬁxed RBGs to an uncertainnumber of users. However, the fully-connected neural net-works (FCNN) only accepts input with ﬁxed dimensions alongwith the output [23]. To address this problem, the projectionoperation is applied to the state matrix S : s = F ( S T S ) , s ∈ R | F | (7)where F ( · ) is a ﬂattening operator, it ﬂattens a matrix intoa vector by row and generates a vector with ﬁxed dimensions.Because many selected attributes are continuous values, whichis almost impossible to generate absolutely same s . TABLE IIN

ETWORK DATA OF LOCAL OBSERVATIONS

Attr. Layer Dim.RSRP PHY 1Average of RB CQI PHY 1Buffer size MAC 1Scheduled times MAC 1Olla offset MAC 1Historical user data rate (HUDR) MAC 1

Based on the global state, we design the local observationsusing the subset of F (listed in Table II), which can be writtenas: O = (cid:16) f , · · · , f | ˆ F | (cid:17) , f i = ( f i , · · · , f i ˆ N ) T , ˆ F ⊂ F (8)As can be seen in Table II, the agents share some publicnetwork data such as the buffer size and the HUDR, whileeeping its unique attributes. Since each RBG is allocated bya agent independently, it only needs to observe its own CQIon the UEs. Moreover, this exclusive feature highlights theuniqueness of each RBG, which may motivates the agents toexplore diverse scheduling polices.Similarly, the projection operation is also applied to the localobservations to generate a ﬁxed vector: o = F ( O T O ) , o ∈ R | ˆ F | (9)As the DNN is leveraged to be the backbone of the agents,we let π aθ a denote the policy of agent a , which is parameterizedby a fully-connected network with parameters θ a . For RBGallocation, the policy accepts a local observation o a andgenerates ˆ N priorities for ˆ N users, then the RBG is allocatedto the user with the highest priority: u a = arg max u a { π aθ a ( o a ) } (10)All the local actions constitute the joint-action u , which canbe written as follows when there are | K | RBGs and ˆ N activeusers in the cell: u = (cid:0) u · · · u | K | (cid:1) , u a ∈ { , ..., ˆ N } (11)Similar to time-varying problem in the deﬁnition of obser-vation, the policy also needs to make adaptive allocation dueto the time-varying ˆ N . However, limited by the properties ofthe FCNN, it can not generate adaptive output.As there is a maximum user capacity M , then the usersin the cell can be divided into two parts: ˆ N active usersand M − ˆ N virtual users. Then we set the output layer ofthe policy network with M neurons to generate M priorities.Finally, select the action using the former ˆ N values. The Fig.3 illustrates the operation. (cid:1849) (cid:2870) (cid:1849) (cid:2869) (cid:28606)(cid:28606) (cid:3)(cid:1871) (cid:3047) (cid:1843) (cid:3047)(cid:3042)(cid:3047) (cid:4666)(cid:2254)(cid:481) (cid:2203)(cid:4667)(cid:1843) (cid:2869) (cid:4666)(cid:2028) (cid:2869) (cid:481) (cid:1873) (cid:3047)(cid:2869) (cid:4667) (cid:1843) (cid:3041) (cid:4666)(cid:2028) (cid:3041) (cid:481) (cid:1873) (cid:3047)(cid:3041) (cid:4667) (cid:28663) (cid:1843) (cid:3047)(cid:3042)(cid:3047) (cid:4666)(cid:2254)(cid:481) (cid:2203)(cid:4667) (cid:28609)(cid:28637)(cid:28652)(cid:28637)(cid:28642)(cid:28635)(cid:28564)(cid:28610)(cid:28633)(cid:28648)(cid:28651)(cid:28643)(cid:28646)(cid:28639) (cid:3)(cid:1871) (cid:3047) (cid:28597)(cid:28635)(cid:28633)(cid:28642)(cid:28648)(cid:28564)(cid:28581) (cid:28597)(cid:28635)(cid:28633)(cid:28642)(cid:28648)(cid:28564)(cid:28610) (cid:1843) (cid:2869) (cid:4666)(cid:2028) (cid:2869) (cid:481) (cid:1873) (cid:3047)(cid:2869) (cid:4667) (cid:1843) (cid:3041) (cid:4666)(cid:2028) (cid:3041) (cid:481) (cid:1873) (cid:3047)(cid:3041) (cid:4667)(cid:4666)(cid:1867) (cid:3047)(cid:2869) (cid:481) (cid:1873) (cid:3047)(cid:2879)(cid:2869)(cid:2869) (cid:4667) (cid:4666)(cid:1867) (cid:3047)(cid:2924) (cid:481) (cid:1873) (cid:3047)(cid:2879)(cid:2869)(cid:2924) (cid:4667) (cid:28663)(cid:28663) (cid:28609)(cid:28608)(cid:28612)(cid:28603)(cid:28614)(cid:28617)(cid:28609)(cid:28608)(cid:28612) (cid:4666)(cid:1867) (cid:3047)(cid:2911) (cid:481) (cid:1873) (cid:3047)(cid:2879)(cid:2869)(cid:2911) (cid:4667)(cid:1843) (cid:3028) (cid:4666)(cid:2028) (cid:3028) (cid:481)(cid:3401)(cid:4667) (cid:3)(cid:3)(cid:2024) (cid:1843) (cid:3041) (cid:4666)(cid:2028) (cid:3041) (cid:481) (cid:1873) (cid:3047)(cid:3041) (cid:4667)(cid:2035) (cid:1860) (cid:3047)(cid:2879)(cid:2869)(cid:3028) (cid:1860) (cid:3047)(cid:3028) (cid:3)(cid:3400) (cid:28606) (cid:3)(cid:883) (cid:3398) (cid:3)(cid:3400) (cid:29017) (cid:29017) (cid:28648)(cid:28629)(cid:28642)(cid:28636) (cid:3)(cid:3400) (cid:1860) (cid:3047)(cid:2879)(cid:2869)(cid:3028) (cid:1876) (cid:3047) (cid:1860) (cid:3047)(cid:3028) (cid:11)(cid:68)(cid:12) (cid:11)(cid:69)(cid:12) (cid:11)(cid:70)(cid:12) (cid:11)(cid:71)(cid:12) (cid:28687)(cid:28595)(cid:28740)(cid:28595)(cid:28687)(cid:28687)(cid:28595)(cid:28740)(cid:28595)(cid:28687) (cid:1877) (cid:3047) (cid:1843) (cid:2911) (cid:4666)(cid:2028) (cid:3041) (cid:481)(cid:3401)(cid:4667) (cid:28663) (cid:1843) (cid:2911) (cid:4666)(cid:2028) (cid:3041) (cid:481) (cid:1873) (cid:3047)(cid:3041) (cid:4667) (cid:1840)(cid:3553)(cid:3)(cid:1874)(cid:1853)(cid:1864)(cid:1873)(cid:1857)(cid:1871) (cid:28604)(cid:28653)(cid:28644)(cid:28633)(cid:28646)(cid:28564)(cid:28610)(cid:28633)(cid:28648)(cid:28651)(cid:28643)(cid:28646)(cid:28639) (cid:28663) (cid:1873) (cid:3028) (cid:1840)(cid:3553)(cid:3)(cid:1874)(cid:1853)(cid:1864)(cid:1873)(cid:1857)(cid:1871) (cid:1839)(cid:3)(cid:1868)(cid:1870)(cid:1861)(cid:1867)(cid:1870)(cid:1861)(cid:1872)(cid:1861)(cid:1857)(cid:1871) Fig. 3. Adaptive action

Moreover, all the network data of the virtual users will be setas , and this will not affect the ﬁnal values of the observationvector due to the properties of the projection. With a lot ofinteractions and training, the agents are feasible to learn thecorresponding pattern. B. Fully Cooperative Game

As the ﬁnal user data rate is determined by the usage of allresources, so we set all the agents are fully cooperative in thisgame, this implies that all the agents receive the same rewardand make it as a global reward [16]: r ( s , u , a ) = r ( s , u , a (cid:48) ) , ∀ a, a (cid:48) . (12)As the agents are set to be fully cooperative, the rewardfunction is rewritten as r ( s , u ) for convenience in the follow-ing sections.In this paper, we aims to maximize the 5TUDR in thebursty trafﬁc. Since the reward function characterizes theoptimization objective of the game, it should outputs the gainperformance to the ﬁnal 5TUDR of each state-action pair.Considering a time period TTI= [1 , ..., T ] , this reward functionevaluates the contribution of each allocation step to the ﬁnal5TUDR.Thus the optimization objective in the Op. 4 is transformedinto: max π E (cid:2) T (cid:88) t =1 γ t r t ( s t , u t ) | π (cid:3) (13)s.t. ψ nt ≥ , ≤ t ≤ T, ≤ n ≤ N t D ( T ) ≥ κP ( N T ) = λ N T N T ! e − λ where π being the joint policy.The Eq. 13 indicates that the objective of this stochasticgame is to ﬁnd a joint-policy π which maximizes the long-term return. And the agents are allowed to explore diversepolices as long as the Eq. 13 is maximized. However, thedesign of the reward function lacks rigor theoretical supportin practice, most of the conﬁgurations are always experience-based. In next section, we dive into the design of the rewardfunction, and discuss the taken insights from comprehensiveperspective. C. Reward Function

The RRF scheduling achieves the highest time-fairnessbecause each user can get the responses equally. But thisscheduling policy does not take account of the usages, whichwastes much available network resources. So we argue thatthe promotion of the 5TUDR should not largely decrease theaverage user data rate. Considering the network capacity, wehope to maximize the 5TUDR while maintaining considerableAUDR.Intuitively, to maximize the 5TUDR, it is natural to take theincrement of 5TUDR as the reward function. But this rewardfunction is founded to be incompatible with the bursty trafﬁc.To demonstrate the conﬂict, we trace the variation of 5TUDRin one random communication process which lasts 500TTIs.As can be seen in the Fig. 4, the increment of 5TUDR isalways , and there are some sharp increase and sharp decreasein the ﬁgure. The sharp decrease is because of the arrival ofnew users, if a new user arrives at the base station at TTI t , the φ t will suddenly become because the new user hasno transmitted data packets. Once the new user is scheduled,its user data rate will soar, which causes the sharp increase.herefore, this reward function can not cope with the burstytrafﬁc and properly evaluates the actions.Similarly, we can take the increment of AUDR as the rewardfunction, which aims to improve the 5TUDR via improvingthe AUDR. But this will also be affected by the time-varyingnumber of the active users. Fig. 5 illustrates the increment ofAUDR between two adjacent TTI in one simulation. Thereare also some sharp increase and sharp decrease in the ﬁgure.Here the sharp decrease is mainly because the arrival of newusers, which reduces the instantaneous average value. And thesharp increase is mainly because the agents allocate the RBGsto the user with very high CQI values. TTI -40-20020 I n c r e m en t o f T UDR (a)

Fig. 4. The increment of 5TUDR between two adjacent TTI in one simulation.

TTI -50050 I n c r e m en t o f A UDR

Fig. 5. The increment of AUDR between two adjacent TTI in one simulation.

On the one hand, the scheduling policy is expected tomaintain considerable usage of the network resources. Torealize this objective, the simplest method is to follow the OPscheduling, which always to allocate the RBGs with high CQI.But this will make the scheduler entirely ignore the inferiorusers, so that the 5TUDR is greatly decreased. On the otherhand, to maximize the 5TUDR, the scheduling can not onlyfocus on the CQI of users, which should pay more attention toother decision basis such as HUDR. But this will also damagethe usage of the network resources.To address this dilemma, we propose the following rewardfunction which both considers the usage and fairness: G σ ( x ) = n (cid:89) i =1 sin (cid:18) πx i x ) (cid:19) σ , σ ∈ R + ∆ Ψ t = (∆ ψ t , ..., ∆ ψ ˆ N t t ) , ∆ ψ it = ψ it − ψ it − r t = h ( N t (cid:88) i =1 ψ it − N t − (cid:88) i =1 ψ it − ) − exp {−G σ (∆ Ψ t ) } (14)where G σ ( · ) ∈ [0 , is the G’s fairness index [24], h ( x ) = e − x is the sigmoid function, N t is the number of total arrived users in the cell, ˆ N t is the number of active usersin the cell, and ψ it = 0 . And the reward function of Eq. 14 isdesigned based on the following insights: • This reward function conditions on the sum user data rate,which has nothing to do with the time-varying numberof total arrived users. • To maintain the usage of the network resources, thescheduling policy is required to maximize the incrementsum user data rate. This encourage the agents to allocateRBGs to the users with higher CQI. • The G’s fairness index is introduced as the punishmentitem, which conditions on the increment sequence of theindividual user data rate. The gain of the sum user datarate can be realized through two possible solutions. Oneis to only allocate the RBGs to the superior users, whichbrings tremendous gain in throughput. The other one isto let more users make contribution, which brings smallergain in throughput. If the agents follow the ﬁrst method,the agents will be punished. Through the fairness index,the agents will be forced to allocate RBGs more evenly,so as to promote the 5TUDR.So far all the elements of the stochastic game are obtained.For such game problem, the Nash Equilibrium (NE) is alwaysused to describe the solution [16]. In the NE, each actionof a agent is the best response to other agents. But the NEis always difﬁcult to realize in many scenarios. To ﬁnd theexpected scheduling policy in this stochastic game, an learn-ing algorithm based on Multi-Agent Reinforcement Learning(MARL) is proposed in the next section.IV. MARL-B

ASED A LGORITHM

The MARL is built to develop and analyzing learningrules and algorithms which can discover effective strategiesin multi-agent environments. With powerful self-explorationability, the MARL is making increasing difference in theconstruction of AI systems. In this section, we ﬁrst give somebasic elements of the MARL based on the deﬁned stochasticgame. Then a QMIX learning algorithm is introduced tomaximize the expected long-term return deﬁned in Eq. 13.Finally, the detailed algorithm of MARL for RBG allocationis summarized.

A. MARL Background

The MARL learns from the interactions with the environ-ment and updates its policy to maximize the long-term return.To evaluate the policy, the state-value function and action-value function are introduced to characterize the expectedreturn: V π ( s ) = E (cid:2) + ∞ (cid:88) t =0 γ t r t | π , s = s (cid:3) Q π ( s , u ) = E (cid:20) ∞ (cid:88) t =0 γ t r t | π , s = s , u = u (cid:21) (15)The V π ( s ) represents the expected return if the agents if theagents continue executing the joint-policy after the global state (cid:28606) (cid:1843) (cid:3047)(cid:3042)(cid:3047) (cid:4666)(cid:2254)(cid:481) (cid:2203)(cid:4667) (cid:28656)(cid:28564)(cid:28709)(cid:28564)(cid:28656)(cid:28656)(cid:28564)(cid:28709)(cid:28564)(cid:28656) (cid:3)(cid:1871) (cid:3047) (cid:28609)(cid:28637)(cid:28652)(cid:28637)(cid:28642)(cid:28635)(cid:28564)(cid:28610)(cid:28633)(cid:28648)(cid:28651)(cid:28643)(cid:28646)(cid:28639) (cid:1843) (cid:3047)(cid:3042)(cid:3047) (cid:4666)(cid:2254)(cid:481) (cid:2203)(cid:4667) (cid:1709)(cid:1709) (cid:28597)(cid:28635)(cid:28633)(cid:28642)(cid:28648)(cid:28564)(cid:28581) (cid:28597)(cid:28635)(cid:28633)(cid:28642)(cid:28648)(cid:28564)(cid:28629) (cid:1709) (cid:1843) (cid:2869) (cid:4666)(cid:2028) (cid:2869) (cid:481) (cid:1873) (cid:3047)(cid:2869) (cid:4667) (cid:1843) (cid:3041) (cid:4666)(cid:2028) (cid:3028) (cid:481) (cid:1873) (cid:3047)(cid:3028) (cid:4667) (cid:1867) (cid:3047)(cid:2869) (cid:1867) (cid:3047)(cid:3028) (cid:1843) (cid:3028) (cid:4666)(cid:2028) (cid:3028) (cid:481)(cid:3401)(cid:4667) (cid:3)(cid:3)(cid:2024) (cid:1843) (cid:3041) (cid:4666)(cid:2028) (cid:3028) (cid:481) (cid:1873) (cid:3047)(cid:3028) (cid:4667)(cid:2035) (cid:1860) (cid:3047)(cid:2879)(cid:2869)(cid:3028) (cid:28603)(cid:28614)(cid:28617) (cid:1867) (cid:3047)(cid:3028) (cid:1860) (cid:3047)(cid:3028) (cid:3)(cid:1871) (cid:3047) (cid:3)(cid:3400) (cid:28606) (cid:3)(cid:883) (cid:3398) (cid:3)(cid:3400) (cid:29017) (cid:29017) (cid:28648)(cid:28629)(cid:28642)(cid:28636) (cid:3)(cid:3400) (cid:1860) (cid:3047)(cid:2879)(cid:2869)(cid:3028) (cid:1876) (cid:3047) (cid:1860) (cid:3047)(cid:3028) (cid:1877) (cid:3047) (cid:11)(cid:68)(cid:12) (cid:11)(cid:69)(cid:12) (cid:11)(cid:70)(cid:12) (cid:11)(cid:71)(cid:12) (cid:28604)(cid:28653)(cid:28644)(cid:28633)(cid:28646)(cid:28564)(cid:28610)(cid:28633)(cid:28648)(cid:28651)(cid:28643)(cid:28646)(cid:28639) Fig. 6. (a) Mixing network structure. In red are the hypernetworks that produce the weights and biases for mixing network layers shown in blue. (b) Theoverall QMIX architecture. (c) Agent network structure. (d) The GRU structure, where h is the hidden information used for recursion [25]. s . Similarly, the Q π ( s , u ) represents the expected return if theagents executing the joint-policy π after taking joint-action u at the global state s . Based on the action-value function,the value-based RL algorithms choose the action at each stepfollowing [10]: u = arg max u { Q π ( s , u ) } (16)Moreover, the Bellman Optimal Theorem indicates that theoptimal policy π ∗ satisﬁes: Q π ∗ ( s , u ) = Q ∗ ( s , u ) := sup π ∈ Π { Q π ( s , u ) } (17)Therefore the optimal policy π ∗ can be obtained if theaction-value function is maximized. To ﬁnd such a policy,the iteration method is adopted to carry on the action-valuefunction iteration, which is the well-known Q-learning algo-rithm [26]. To conduct the iteration, the Bellman Equation Eq.18 is introduced which describes the relationship between theaction-value of a state-action pair and the action-value of thesuccessor state-action pair [27]. Q π ( s , u ) = E [ r + γQ π ( s (cid:48) , u (cid:48) ) | π ] (18)And the Q ∗ ( s , u ) holds the Bellman Optimality Equation: Q ∗ ( s , u ) = E (cid:20) r + γ max u (cid:48) ∈ U { Q ∗ ( s (cid:48) , u (cid:48) ) } (cid:21) (19)Considering the high dimensional state space, it is inevitableto use function approximation as the representation of ac-tion values. Let Q π θ ( s , u ) denote the action-value functionparametrized by a DNN with parameters θ . The optimizationobjective is transformed to ﬁnd θ that holds Q π θ ( s , u ) ≈ Q ∗ ( s , u ) . And the optimal parameters θ can be approachedby minimizing the following temporal difference error [28]: L ( θ ) = (cid:20) r + γ max u (cid:48) { Q π θ ( s (cid:48) , u (cid:48) ) } − Q π θ ( s , u ) (cid:21) (20) B. QMIX Based RBG allocation

The partially observable setting let each agent maintainan observation history τ a ∈ T . For the environment, thereis a joint action-observation history τ ∈ T |A| . Then theaction-value function is rewritten as Q π θ ( τ , u ) . It is difﬁcult tostraightforwardly learn the optimal action-value function dueto the exponential joint-action space. Moreover, the Q π θ ( τ , u ) can not be used to apply decentralized control. In orderto address this problem, the QMIX is introduced to learnthe joint-action value function via value decomposition. Forconvenience, we refer Q tot to the Q π θ ( τ , u ) in the followingsections.The QMIX is a model-free, value-based, off-policy, decen-tralized execution and centralized training algorithm towardscooperative task, which takes DNN function approximatorto estimate the action-value function [29]. QMIX allowseach agent a maintain an individual action-value function Q a ( τ a , u a ) , which conditions on action-observation history τ a and local action u a . Thus the decentralized execution istractable based on the individual action-value functions. Forcentralized training, the QMIX leverages a mixing network tocalculate the centralized action-value function Q tot . What’smore, the QMIX ensures that the global argmax operationapplied on the centralized Q tot generates the same result as aset of argmax operation applied on each Q a : arg max u { Q tot } =  arg max u { Q ( τ , u ) } ... arg max u n { Q n ( τ n , u n ) }  (21)Mathematically, this imposes a monotonicity constraint onthe relationship between Q tot and each Q a : ∂Q tot ∂Q a ≥ , ∀ a. (22)Fig. 6 illustrates the framework of QMIX. At each step,each agent a accepts a local observation o a and generates |U| ction values. Then the action is chosen which accords withthe (cid:15) -greedy principle [30]: π a ( u a | o a ) = (cid:40) − (cid:15) + (cid:15) |U| , u a = arg max u a { Q a ( τ a , u a ) } (cid:15) |U| , u a (cid:54) = arg max u a { Q a ( τ a , u a ) } (23)This policy implies that the agent will randomly select anaction from the action space with a probability of (cid:15) , or followthe action with maximum action-value with a probabilityof − (cid:15) . This policy allows the agents to make sufﬁcientexploration, because it increases the probability of coveringall the possible state-action pairs when apply training.After the individual action-value of all the agents aregenerated, it is sent to the mixing network to generate thecentralized action-value Q tot . The mixing network consiststwo parts, the backbone and hyper-networks. As the Q tot conditions on the global state s , the mixing network uses thehyper-network to generate weights and biases for its backbone.Let H η and M β denote the hyper-network and the mixingnetwork parameterized by parameters η and β respectively.At each step t , the hyper-network accepts the global state s and generates weights and biases: β t ← H η ( s t ) (24)Moreover, all the generated weights are positive to holdthe Eq. 22. Considering the mixing network in Fig. 6, let W , W , B , B denote the generated weights and biases fromthe hypernetworks, Q = { Q , ..., Q n } being the input vectorand g ( x ) = (cid:40) α ( e x − , x < , α > x, x ≥ being the ELUactivation function. Thus the Q tot can be written as: Q tot = W g ( W Q + B ) + B = W g ( Y ) + B (25)Then, ∂Q tot ∂ Q = ∂Q tot ∂g ( Y ) · ∂g ( Y ) ∂Y · ∂Y ∂ Q = W · g (cid:48) ( W Q + B ) · W (26)Since g (cid:48) ( x ) = (cid:40) αe x , x < , α > , x ≥ (27)And W , W is non-negative, thus: ∂Q tot ∂Q a ≥ , ∀ a. (28)Through the hyper-network, the global state is integrated tobuild the centralized action-value function. Finally, the QMIXis trained by minimizing the following loss function: L ( θ , η ) = (cid:2) r + γ max u (cid:48) Q tot ( τ (cid:48) , u (cid:48) , s (cid:48) ) − Q tot ( τ , u , s ) (cid:3) (29)where the θ is the parameters set of the agents. Eq. 29is similar to Eq. 20, along with update operations. Finally,the QMIX for searching the scheduling policy for the RBGallocation is summarized in Algo. 1 Algorithm 1

QMIX for RBG Allocation Initialize | K | agent networks Q θ : T → R |U| , ..., Q θ | K | : T → R |U| with parameters θ = ( θ , ..., θ | K | ) . Initialize hyper-network H η and mixing network M β with parameters η and β . Set a replay buffer B , learning rate λ , discount factor γ and (cid:15) . for Epoch = 1 , ..., E do Initialize the base station with a set of K RBGs andrandomly generate ˆ N initial users. Each agent receives the initial local observation o a for t = 1 , ..., T do New users arrive at the cell according to the Poissondistribution. Each agent a makes local action u at following the (cid:15) -greedy policy and observe a new local state o at +1 . Execute the joint-action u t and observe the globalreward r t , then observe a new global state s t +1 . Store the following transition in B : ( s t , r t , s t +1 , o at , o at + , u at , h at − , h at ) Sample a random minibatch of b transitions from B : ( s i , r i , s i +1 , o ai , o ai + , u ai , h ai − , h ai ) Get | K | individual Q values: Q θ a = ( Q θ ( τ , u i ) , ..., Q θ | K | ( τ | K | , u | K | i )) Q (cid:48) θ a = ( Q θ ( τ (cid:48) , u i +1 ) , ..., Q θ | K | ( τ | K |(cid:48) , u | K | i +1 )) Calculate weights for mixing network, then get Q tot and Q (cid:48) tot : β i ← H η ( s i ) , β i +1 ← H η ( s i +1 ) Q tot ( τ , u i , s i ; β i ) ← M β i ( Q θ a ) Q (cid:48) tot ( τ (cid:48) , u i + , s i +1 ; β i +1 ) ← M β i +1 ( Q (cid:48) θ a ) Set y i = r i + γ max u i + { Q (cid:48) tot ( τ (cid:48) , u i + , s i +1 ; β i +1 ) } Update all the agent networks and hyper-network byminimizing the following loss: L ( θ , η ) = b (cid:88) i =1 (cid:20) ( y i − Q tot ( τ , u i , s i ; β i )) (cid:21) end for end for V. S

IMULATION R ESULTS AND N UMERICAL A NALYSIS

So far the problem modeling is accomplished and the solu-tion is obtained. In this section, we demonstrate the simulationresults and the corresponding numerical analysis. All thesimulations are conducted based on the network deﬁned inSection II, which follows the mechanisms in Appendix A.TheTable III illustrates the parameters of the base station.

ABLE IIIP

ARAMETERS FOR THE BASE STATION .Parameters ValuesTransmit power for each RB 18 dBmNumber of RB for each RBG 3Number of RBGs 3Frequency bandwidth for each RBG 10MHzNoise power density -174 dBm/HzMinimum MCS 1Maximum MCS 29Mximum number of HARQ 8Feedback period of HARQ 8Initial RB CQI value 4

We deploy 3 RBGs in the base station to testify theproposed algorithm. Here each RBG is composed of threeRBs, whose transmit power is 18dBm. For transmission, eachuser owes at most HARQ processes, which will be informedthe ACK/NACK message in seven TTIs.

A. Model Training

The agents is trained following the parameters listed inTable IV. Here one epoch represents one allocation processwhose duration is 1000TTIs. For each epoch, the base stationwill continue to accept data requests and the agents willmake allocation until the duration is over. This producesmuch experiences which are composed of the state-action pairsand the corresponding rewards. Then the agents will select aminibatch of experiences to update its parameters. The batchesand the batch size are set as 10 and 256 respectively, wherethe batch size means the amount of the experiences in eachminibatch.

TABLE IVP

ARAMETERS FOR TRAINING .Parameters ValuesEpochs 100Duration of one epoch 1000TTIsLearning rate 1e-3Learning rate decay 1e-7Batches 10Batch size 256Replayer capacity 2000 (cid:15)

As can be seen in the Fig. 7, the episode reward of eachepoch increasing stably with the progress of the training.

B. Performance Comparison

To demonstrate the performance of the proposed scheduler,we compare it with two traditional schedulers, which arethe PF scheduling and the RRF scheduling. We deﬁne thesescheduling algorithms in the Appendix B.In this paper, we aim to maximize the 5TUDR of thenetwork. For the PF scheduling, the scheduler is consideredas the most fairness-oriented when α = 0 , α = 1 [6]. Herethe PF scheduling only takes the historical user data rate to E p i s ode r e w a r d Episode reward of each epoch

Fig. 7. Episode reward for each epoch calculate the allocation priorities. But the realistic situationmight be different in the bursty trafﬁc, because the schedulermay not attain the whole capacity of the network.To investigate the highest 5TUDR that the PF schedul-ing may achieve, we attempt to measure experience-basedvalue through large amount of simulations. We take theRRF scheduling and the OP scheduling as the baseline, thenperform the simulations using PF scheduling with differentcoefﬁcients (listed in Table V).

TABLE VL

IST OF SCHEDULERS .Schedulers RemarksRRF scheduling N/APF scheduling α = 0 , . , ..., . , , α = 1 OP scheduling N/A

As can be seen in Fig. 8, the PF scheduling achieves higher5TUDR with the increase of α , and it ﬁnally stabilized atabout α = 0 . . When the α is nonzero, the PF schedulingstarts to consider the instantaneous estimated user data rate.The advantage users with better channel will take higherallocation priorities and produce more throughput, which notonly promotes the usage of the network resources but alsocontributes to the 5TUDR. Therefore, the following evalua-tions are performed with four schedulers, which are MARLscheduling, PF1 scheduling ( α = 0 , α = 1 ), PF2 scheduling( α = 0 . , α = 1 ) and RRF scheduling respectively.Similar to the model training, random simulationsare performed to verify the performance of the four sched-ulers, and the AUDR and 5TUDR are calculated respectively.Fig. 9 and Fig. 10 illustrate the corresponding cumulativedistribution function (CDF) of the performance difference.The CDF manifests that the MARL scheduler achieving a83.20Bits/TTI, -29.84Bits/TTI and 147Bits/TTI AUDR gainin contrast with PF1, PF2 and RRF respectively. Moreover,the MARL scheduler achieving a 54.70Bit/TTI, 39.42Bits/TTIand 46.96Bits/TTI 5TUDR gain compared with PF1, PF2and RRF respectively. Therefore, the proposed MARL-basedscheduler successfully realizes the promotion of the 5TUDRwhile maintaining the considerable AUDR. B i t s / TT I B i t s / TT I B i t s / TT I Fig. 8. Average 5TUDR of different schedulers in 1000 random simulations. -100 -50 0 50 100 150 200

Bits/TTI F ( x ) Cumulative Distribution Function

MARL-PF1MARL-PF2MARL-RRF

Fig. 9. CDF of performance difference in AUDR. -100 -50 0 50 100 150

Bits/TTI F ( x ) Cumulative Distribution Function

MARL-PF1MARL-PF2MARL-RRF

Fig. 10. CDF of performance difference in 5TUDR.

C. Policy Analysis

Different from the traditional schedulers which have explicitformulations, the learned MARL-based scheduler is black boxwith a set of the network parameters. So it is difﬁcult tostraightforwardly analyze its scheduling policy. In this section,we attempt to demonstrate the different scheduling preferencebetween the four schedulers by analyzing the allocation log.Considering the speciﬁc bursty trafﬁc that there is only ﬁniteusers in the cell and no new users, 5 users and 3 RBGs are deployed in the cell to conduct interactions for 500TTIs.

1) More Transmission Time and Less Residence Time:

Weﬁrst investigate the total transmission time and total residencetime of the ﬁve users. The total transmission time impliesthat how many times the users get RBGs. For instance, ifthe three RBGs are allocated to three users respectively, thetransmission time will be added by 3. On the contrary, theresidence time implies that how many times the users need toﬁnish its requests including the waiting time. As can be seen inthe Fig. 11, the MARL-based scheduler signiﬁcantly decreasesthe total residence time of the users. This indicates that therequests of the users can be satisﬁed more quickly under theallocation of the MARL-based scheduler. Meanwhile, the totaltransmission time of the users is signiﬁcantly increased, whichmeans the users take less waiting time. So the users can ﬁnishits data requests as quick as possible, which produces higherAUDR.

Total residence time of all users

Time (TTI)

MARLPF1PF2RRF

Total transmission time of all users

Time (TTI)

MARLPF1PF2RRF

Fig. 11. Total residence time and total transmission time.

2) Distributional Allocation:

Compared with other sched-ulers, the MARL-based scheduler creates the highest totaltransmission time. This indicates the MARL-based schedulerrefers to make distributional allocation rather than centralizedallocation. The Table VI and Table VII illustrates the allocationresult in TTI=1 and TTI=100. Here the ”T” and ”F” representthe allocation result of the RBGs, ”N/A” means the user hasﬁnished its transmission. For instance, the ”T” of th rowmeans the RBG t . Similarly, ”two”means that the scheduler allocates two RBGs to a single userin TTI t . MARL PF1RRFPF2

OneTwoThree (cid:28585)(cid:28580)(cid:28569)(cid:28585)(cid:28580)(cid:28569) (cid:28580)(cid:28569) (cid:28592)(cid:28580)(cid:28578)(cid:28585)(cid:28569) (cid:28592)(cid:28580)(cid:28578)(cid:28585)(cid:28569)(cid:28589)(cid:28589)(cid:28569)(cid:28582)(cid:28582)(cid:28569)(cid:28582)(cid:28582)(cid:28569)(cid:28585)(cid:28586)(cid:28569) (cid:28581)(cid:28580)(cid:28580)(cid:28569)

Fig. 12. Total transmission time with different number of RBGs.

As can be seen in the Fig. 12, the PF1 scheduler and theRRF scheduler almost allocate all the RBGs to single user dur-ing the whole communication process. However, the MARLscheduler and the PF2 scheduler choose to distributes theRBGs to different users. In each TTI, the users are providedmore opportunity to get RBGs under the scheduling of theMARL scheduler and the PF2 scheduler, which realizes higherallocation fairness. The allocation of the MARL scheduler ismore distributional than the MARL scheduler, because theMARL scheduler almost never allow the users to monopolizethe RBGs, which produces higher total transmission time andallocation fairness.

3) Scheduling Preference:

Finally, we attempt to ﬁnd outthe scheduling preference of the MARL scheduler. The policyof the MARL scheduler is a series of DNN parameters, whichis intractable to demonstrate or summarize. But we can stillinfer the scheduling tendency through the data analysis of theallocation log. In the simulation of the Table VI and TableVII, the UE

TABLE VIA

LLOCATION EXAMPLE

LLOCATION EXAMPLE herefore, it is reasonable to assume that the MARL schedulerprefers to allocate the users with less buffer ﬁrst. To test theassumption, we ﬁrst investigate the number of active users ineach TTI under the allocation of different schedulers. The Fig.13 shows that the MARL scheduler takes less time but satisﬁesmore requests. Based on the Fig. 13, we further investigatethe scheduled time of the three users from TTI=1 to TTI=83.For instance, if the user is allocated with 2 RBGs, then thescheduled time will be calculated as 2. As can be seen theFig. 14, it is obvious that the MARL scheduler spends moretime on these users. By accelerating the transmission of thesmall users, the MARL scheduler promotes the UDR of thesmall users, which ﬁnally contributes to the AUDR.

Time (TTI) N u m be r o f a c t i v e u s e r s Number of active users in each TTI

MARLPF1PF2RRF

Fig. 13. Number of active users in each TTI.

Scheduled time of the small users

Time (TTI)

MARLPF1PF2RRF

Fig. 14. Scheduled time of the small users.

As for other features, we calculate the spearman correlationcoefﬁcient matrix, which is illustrated in Fig. 15. For instance,the correlation coefﬁcient between the allocation result of theRBG • A distributional policy that prefers to distributes theRBGs to different users to promote the allocation fairness.Under the allocation of the MARL scheduler, the usershave more opportunity to get resources rather than justwaiting. Such operation guarantees that all the users cancontribute to the transmission rate, so as to promote the5TUDR. This also proves the proposed reward functionindeed controls the learning direction of the agents.

Fig. 15. Spearman correlation coefﬁcient matrix. • To keep the usage of the system resources, the MARLscheduler tries to satisfy the requests of small users asfast as possible. By reducing the transmission time of thesmall users, the MARL scheduler achieves considerableUDR gain, which greatly contributes the of the ﬁnalAUDR. • The schduling policy of the MARL scheduler is forward-looking, it can conduct macroscopic and long-term allo-cation to realize the ﬁnal objective. Although the MARLscheduler can not be thoroughly understood due to com-plexity of the DNN, but it still demonstrates the greatpotential of the MARL in user scheduling.VI. CONCLUSIONIn this work, we investigate the problem scheduling, es-pecially the RBG allocation. To maximize the fairness per-formance of the communication system, a fairness-orientedscheduler is proposed based on multi-agent reinforcementlearning. This scheduler can allocates multiple RBGs to mul-tiple users in the bursty, and achieve the highest 5%-tile userdata rate compared with the PF scheduling and the RRFscheduling. Furthermore, we analyze the scheduling policyof the MARL scheduler is fully from diverse perspectives,such as allocation time and so on. There are still muchdevelopment space about the combination of user schedulingand reinforcement learning, and we will further improve theMARL-based scheduler which can optimize both the fairnessand throughput. And we fully believe that the reinforcementlearning will be utilized to make more pioneering work in thefuture. A

PPENDIX

A. Network mechanism1) Out Loop Link Adaptation:

On the user side, CQIlevels represent the received signal-to-noise ratio (SNR) oneach resource block (RB), and they are periodically computedand reported. Thus the received instantaneous SNR directlyinﬂuences the selection of MCS. However, it should be notedthat there is a CQI offset reﬂecting the detection sensitivityper user which is unknown to the BS. Thus, to compensateor the discrepancy between the chosen MCS and the optimalMCS for different users. An OLLA process is carried out onBS to offset a CQI value q , which is ¯ q = [ q + α ] , (30)where [ · ] is the rounding operation and ¯ q is the offset CQIreadily for MCS selection. Besides, α is the adjustmentcoefﬁcient given by α = (cid:40) α + s A , A = 1 ,α + s N , A = 0 , (31)where A is the corresponding ACK/NACK feedback with“ ” indicating a successful transmission while “ ” a failedtransmission, and the update rate s A > and s N < can becustomized.

2) Transmission Block Formation:

Let F ( q ) denote themapping from the CQI ¯ q to its spectral efﬁciencies (SE), and I denote a set of CQI levels measured by a user in terms ofa group of RBs. The supportive number of bits that can beloaded to the group of RBs for the user is given by f ( I ) = |I| · F (cid:32)(cid:22) |I| (cid:88) i {I} i (cid:23)(cid:33) (32)where (cid:98)·(cid:99) is the ﬂoor function. For instance, the estimated datarate of user n in RBG k can be obtained by R n,k = f ( I n ( k )) ,where I n ( k ) is the set of all RBs in RBG k measured by user n . When multiple RBGs are allocated to the user, they areutilized to convey one transport block (TB) with the sameMCS. The size of data loaded to the TB is given by T n = f ( I n ) , where I n is the set of all RBs allocated to user n .

3) Hybrid Automatic Repeat reQuest and Retransmission:

When a TB is packed for a user, the corresponding data willbe loaded to a HARQ buffer and cleared out from the databuffer. The BS can arrange at most eight HARQ processes(each with a HARQ buffer) for each user, and will see anACK/NACK message from the corresponding user in sevenTTIs. In the case of ACK, the HARQ process terminates,while in the case of NACK, a retransmission is triggered andthe HARQ process is still held. The RBGs initially delegatedto the ﬁrst transmission will conduct the retransmission, thusbeing unavailable for user scheduling temporarily. The MCSselection and TB size remain the same for the retransmissionand the HARQ process will expire at ﬁve times of consecutivefailure, which literally causes the so-called packet loss.

B. Benchmark schemes

Here gives the deﬁnition of PF scheduling and RRF schedul-ing which adopted in this work.

1) Proportional Fairness Scheduling (PF):

In the PFscheduling, any user with a non-empty buffer is a candidateto each RBG, and each RBG independently sort all usersaccording to their PF values. Denote the PF value of user n on RBG k within transmission time interval (TTI) i by ζ n,k [ i ] ,then we have ζ n,k [ i ] = ( R n,k [ i ]) α ( ˆ T n [ i ]) α , (33) where ˆ T n [ i ] is the user’s moving average throughput whichcan be expressed as ˆ T n [ i ] = (1 − χ ) ˆ T n [ i −

1] + χT n [ i − . (34)In Eq. (34), χ is the moving average coefﬁcient, and T n [ i − is the actual TB size of user n in TTI i − . The users withhigher PF value hold higher priority to occupy the RBG, andthe PF value of a user to different RBGs can be different.Speciﬁcally, the PF scheduling can be expressed as P ∗ P F ( k ) = arg max n { ζ n,k } , (35)

2) Opportunistic Scheduling (OP):

As the optimal schedul-ing in terms of systematic throughput maximization, oppor-tunistic user scheduling is quite straightforward, which can beexpressed as P ∗ OP ( k ) = arg max n { R n,k } . (36)It means each RBG selects the user suggesting the highestestimated data rate, which also implies the optimal schedulingin achieving the highest AUDR in the full buffer trafﬁc butnot necessarily in the bursty trafﬁc.

3) Round Robin Fashion Scheduling (RRF):

A userscheduling allocates all RBGs to each one user in turnsregardless of their status. New users are appended to the endof the queue. R

EFERENCES[1] W. Saad, M. Bennis, and M. Chen, “A vision of 6g wireless systems:Applications, trends, technologies, and open research problems,”

IEEENetwork , vol. 34, no. 3, pp. 134–142, 2020.[2] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “From ahuman-centric perspective: What might 6g be?,” arXiv preprintarXiv:1906.00741 , 2019.[3] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. A. Zhang, “The roadmapto 6g: Ai empowered wireless networks,”

IEEE Communications Mag-azine , vol. 57, no. 8, pp. 84–90, 2019.[4] X. Liu, E. K. P. Chong, and N. B. Shroff, “Opportunistic transmissionscheduling with resource-sharing constraints in wireless networks,”

IEEE Journal on Selected Areas in Communications , vol. 19, no. 10,pp. 2053–2064, 2001.[5] E. L. Hahne, “Round-robin scheduling for max-min fairness in datanetworks,”

IEEE Journal on Selected Areas in communications , vol. 9,no. 7, pp. 1024–1039, 1991.[6] D. Tse and P. Viswanath,

Fundamentals of wireless communication .Cambridge university press, 2005.[7] D. Tse, “Multiuser diversity in wireless networks,” in

Wireless Commu-nications Seminar, Standford University , 2001.[8] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward anintelligent edge: wireless communication meets machine learning,”

IEEECommunications Magazine , vol. 58, no. 1, pp. 19–25, 2020.[9] Q. Cao, S. Zeng, M.-O. Pun, and Y. Chen, “Network-level systemperformance prediction using deep neural networks with cross-layerinformation,” in

ICC 2020-2020 IEEE International Conference onCommunications (ICC) , pp. 1–6, IEEE, 2020.[10] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[11] I.-S. Coms , a, S. Zhang, M. Aydin, P. Kuonen, R. Trestian, and G. Ghinea,“A comparison of reinforcement learning algorithms in fairness-orientedofdma schedulers,” Information , vol. 10, no. 10, p. 315, 2019.[12] I.-S. Coms¸a, S. Zhang, M. E. Aydin, P. Kuonen, Y. Lu, R. Trestian, andG. Ghinea, “Towards 5g: A reinforcement learning-based schedulingsolution for data trafﬁc management,”

IEEE Transactions on Networkand Service Management , vol. 15, no. 4, pp. 1661–1675, 2018.13] C. Xu, J. Wang, T. Yu, C. Kong, Y. Huangfu, R. Li, Y. Ge, andJ. Wang, “Buffer-aware wireless scheduling based on deep reinforcementlearning,” in , pp. 1–6, IEEE, 2020.[14] Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “Gan-powereddeep distributional reinforcement learning for resource management innetwork slicing,”

IEEE Journal on Selected Areas in Communications ,vol. 38, no. 2, pp. 334–349, 2019.[15] L. Bus¸oniu, R. Babuˇska, and B. De Schutter, “Multi-agent reinforcementlearning: An overview,” in

Innovations in multi-agent systems andapplications-1 , pp. 183–221, Springer, 2010.[16] R. B. Myerson,

Game theory . Harvard university press, 2013.[17] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik,J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. , “Grand-master level in starcraft ii using multi-agent reinforcement learning,”

Nature , vol. 575, no. 7782, pp. 350–354, 2019.[18] Shapley and S. L., “Stochastic games,”

Proceedings of the NationalAcademy of Sciences , vol. 39, no. 10, pp. 1095–1100, 1953.[19] G. E. Monahan, “State of the art—a survey of partially observablemarkov decision processes: theory, models, and algorithms,”

Manage-ment science , vol. 28, no. 1, pp. 1–16, 1982.[20] J. D. Williams and S. Young, “Partially observable markov decisionprocesses for spoken dialog systems,”

Computer Speech & Language ,vol. 21, no. 2, pp. 393–422, 2007.[21] K. Hornik, M. Stinchcombe, H. White, et al. , “Multilayer feedforwardnetworks are universal approximators.,”

Neural networks , vol. 2, no. 5,pp. 359–366, 1989.[22] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,

Deep learning ,vol. 1. MIT press Cambridge, 2016.[23] D. Michie, D. J. Spiegelhalter, C. Taylor, et al. , “Machine learning,”

Neural and Statistical Classiﬁcation , vol. 13, no. 1994, pp. 1–298, 1994.[24] R. K. Jain, D.-M. W. Chiu, W. R. Hawe, et al. , “A quantitative measureof fairness and discrimination,”

Eastern Research Laboratory, DigitalEquipment Corporation, Hudson, MA , 1984.[25] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “On theproperties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259 , 2014.[26] C. J. Watkins and P. Dayan, “Q-learning,”

Machine learning , vol. 8,no. 3-4, pp. 279–292, 1992.[27] R. Bellman, “Dynamic programming,”

Science , vol. 153, no. 3731,pp. 34–37, 1966.[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[29] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi,M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. ,“Value-decomposition networks for cooperative multi-agent learningbased on team reward.,” in

AAMAS , pp. 2085–2087, 2018.[30] M. Wunder, M. L. Littman, and M. Babes, “Classes of multiagent q-learning dynamics with epsilon-greedy exploration,” in