[PDF] Hermes: Decentralized Dynamic Spectrum Access System for Massive Devices Deployment in 5G

Abstract

With the incoming 5G network, the ubiquitous Internet of Things (IoT) devices can benefit our daily life, such as smart cameras, drones, etc. With the introduction of the millimeter-wave band and the thriving number of IoT devices, it is critical to design new dynamic spectrum access (DSA) system to coordinate the spectrum allocation across massive devices in 5G. In this paper, we present Hermes, the first decentralized DSA system for massive devices deployment. Specifically, we propose an efficient multi-agent reinforcement learning algorithm and introduce a novel shuffle mechanism, addressing the drawbacks of collision and fairness in existing decentralized systems. We implement Hermes in 5G network via simulations. Extensive evaluations show that Hermes significantly reduces collisions and improves fairness compared to the state-of-the-art decentralized methods. Furthermore, Hermes is able to adapt the environmental changes within 0.5 seconds, showing its deployment practicability in dynamic environment of 5G.

Full PDF

HHermes: Decentralized Dynamic Spectrum Access System forMassive Devices Deployment in 5G

Zhihui Gao

Duke University [email protected] Ang Li

Duke University [email protected] Yunfan Gao

ETH Zurich [email protected] Wang

Tsinghua University [email protected] Yiran Chen

Duke University [email protected]

Abstract

With the incoming 5G network, the ubiquitous Internetof Things (IoT) devices can beneﬁt our daily life, such assmart cameras, drones, etc. With the introduction of themillimeter-wave band and the thriving number of IoT de-vices, it is critical to design new dynamic spectrum access(DSA) system to coordinate the spectrum allocation acrossmassive devices in 5G. In this paper, we present Hermes, theﬁrst decentralized DSA system for massive devices deploy-ment. Speciﬁcally, we propose an efﬁcient multi-agent rein-forcement learning algorithm and introduce a novel shufﬂemechanism, addressing the drawbacks of collision and fair-ness in existing decentralized systems. We implement Her-mes in 5G network via simulations. Extensive evaluationsshow that Hermes signiﬁcantly reduces collisions and im-proves fairness compared to the state-of-the-art decentralizedmethods. Furthermore, Hermes is able to adapt the environ-mental changes within 0.5 seconds, showing its deploymentpracticability in dynamic environment of 5G.

Categories and Subject Descriptors

Computer Systems Organization [

COMPUTER-COMMUNICATION NETWORKS ]: Network Architec-ture and Design

General Terms

Design, Standardization

Keywords

Dynamic spectrum access, 5G network, Multi-agent rein-forcement learning

Recent development of the ﬁfth-generation (5G) net-work as well as the explosive growth of the Internet of Things (IoT) devices draw attention to spectrum manage-ment (SM), especially dynamic spectrum access. The hybridspectrum landscape in 5G, i.e., microwave and millimeter-wave (mmWave) bands, provides more available spectrumresources than before [22]. In addition, the thriving numbersof the ubiquitous IoT devices emerge in our daily life, such assmartwatches, augmented/virtual reality headsets, and self-driving cars. With the increase of available spectrum and IoTdevices, an urgent demand is required on an efﬁcient and fairspectrum allocation management.The traditional SM refers to centralized methods that aredeployed on a central processor, such as a base station. Thecentral processor collects all the sensory data from multi-ple user equipments (UEs) and schedules how to allocate thelimited channels to UEs. The channels allocated by the cen-tral processor are referred to as the licensed channels. In thecognitive networks, besides the licensed channels, UEs arealso able to opportunistically access the temporarily unusedchannels or unlicensed channels, which is termed as the dy-namic spectrum access (DSA) [2, 3]. DSA often adopts de-centralized methods that are deployed on UEs, such that UEscan access the shared spectrum coordinately. Compared tothe centralized SM, UEs do not have to wait for the deci-sion by the base station but decides their action by their own.Hence, the response delay is greatly decreased in DSA. How-ever, it is very challenging for multiple UEs to coordinate aschedule plan without collisions.One of the most widely used SM methods in 4G/5G cel-lular network is the proportional fairness (PF) [25]. PF hasa better balance between the total throughput and fairnesscompared to other methods such as round robin and best CQI(channel quality indication). As a centralized method, PFis also bottlenecked by the serious delay when the numberof UEs increases. In addition, all the sensory data of UEsneeds to be uploaded to a central processor, which raisesprivacy concerns [5, 31]. As for the decentralized methodin DSA, UEs make their decisions independently to maxi-mize their own beneﬁts. Therefore, game theory is intro-duced to analyze the scheduling strategy and beneﬁts of UEs.For example, the game theory based algorithm ALLURE-U [26] provides an optimal transmission plan in a realisticsetting. The plan decides how much power to allocate over a r X i v : . [ c s . N I] J a n igure 1: A high-level overview of Hermes.licensed and unlicensed channels. The prospect theory isexploited to estimate the Nash Equilibrium. However, theoutput of ALLURE-U is only heuristical. It does not gener-ate a detailed scheduling plan for each time slot. Therefore,it cannot be directly applied. Besides game theory, multi-agent reinforcement learning (MARL) and deep Q network(DQN) [15] are applied in DSA. The deep Q-learning spec-trum access (DQSA) [19] deploys DQN on UEs and enablesUEs to choose their own channels without a central processoror sharing their sensory data. DQSA works well when thereare enough channels for each UE. However, as the numberof UEs increases, DQSA fails in two aspects. First, DQSAshows poor fairness because UEs do not share the channelalternatively but occupy certain channels. It is even worsethat UEs are so aggressive that UEs would rather allow colli-sions to occur than tolerating other UEs’ occupying channelsalone without collisions.Table 1: Comparison between Hermes and existing SM andDSA techniques. Methods Mode Fairness UEScalability

PF Centralized Yes NoALLURE-U - No YesDQSA Decentralized No No

Hermes Decentralized Yes Yes

In this paper, we propose

Hermes – a decentralized DSAsystem in 5G network that can effectively address the afore-mentioned drawbacks of existing MARL such as DQSAand achieve comparable performance with centralized PF.As Fig. 1 shows, Hermes consists of two modules: im-proved multi-agent reinforcement learning (iMARL) that isdeployed on each UE, and a shufﬂe mechanism that is de-ployed on local transceivers as shufﬂers. These two modulescan overcome the challenges of collisions and fairness re-spectively. In iMARL, we deﬁne a novel reward function forthe Q-learning such that the Nash Equilibrium changes andcollisions are signiﬁcantly reduced. In addition, two mod-iﬁcations to DQN’s workﬂow are made to adapt the shuf-ﬂe mechanism. The novel shufﬂe mechanism, deployed onsingle or multiple shufﬂers, is proposed to improve fairness. Each shufﬂer does not need to collect all the models but aportion from participating UEs. With the received models,the shufﬂer evaluates the UEs’ preference to each model anddistributes models back accordingly. As the models are shuf-ﬂed, the UEs’ selections for channels are shufﬂed. Hence,UEs can fairly share all channels. Note that during a shuf-ﬂe only the parameters of iMARL model are uploaded, andhence the privacy of sensory data is protected.Table 1 provides a comparison between Hermes and ex-isting SM and DSA techniques. Hermes differs from PF asit is a decentralized solution that offers high UE scalabil-ity. Compared to ALLURE-U, Hermes provides a practicalscheduling plan for each UE, taking fairness into account. Inaddition, Hermes overcomes the drawbacks of fairness andUE scalability in DQSA.We summarize four major contributions of Hermes:• To the best of our knowledge, Hermes is the ﬁrst decen-tralized DSA system that enables UEs to efﬁciently andfairly share channels in massive UE deployment.• An improved multi-agent reinforcement learning is pro-posed that signiﬁcantly reduce the channel collisions.Besides, iMARL contains a compact DQN structurespeciﬁcally designed for Hermes, such that the com-munication cost of sharing the model with other UEsis signiﬁcantly saved.• We proposed a novel shufﬂe mechanism that shares themodels from multiple UEs to achieve better fairnesswith the UEs’ privacy protected.• We implement a prototype system of Hermes, and con-duct extensive experiments via simulating various 5Gsettings. Experiment results demonstrate that Hermessigniﬁcantly reduces the collisions and improves fair-ness compared to the state-of-the-art decentralized tech-niques.

We begin with introducing the existing centralized SMand decentralized DSA techniques. We then show that thestate-of-the-art MARL based approach incurs both seriouscollisions and unfairness when allocating channels to UEs,which motivates the design of Hermes.

Spectrum management (SM) [3] in cellular network refersto the process that a central base station allocates and sched-ules the limited channels on the spectrum to multiple UEsover time. The goal of SM is to maximize total throughputwhile guaranteeing fairness over UEs. To achieve this goal interms of 5G, the scheduling is performed in three steps in aperiod. First, all the UEs are required to upload their sensorydata to a 5G new radio gNodeB (gNb, i.e., the base station in5G). The typical sensory data includes channel quality indi-cation (CQI) that describes the throughput capacity of eachchannel. Second, gNb runs a scheduling strategy, such as PF,to determine the scheduling plan for the current periodicitybased on the received sensory data from all the UEs. Finally,the scheduling plan is ofﬂoaded to UEs and is executed. UEsan transmit data in the allocated channels until the next pe-riodicity.Such a centralized method guarantees that UEs work co-operatively without collision, and hence yields good perfor-mance. However, it leads to some privacy issues. For ex-ample, the absolute value of CQI indicates the distance fromthe UE to the gNb [5], such information may breach UEs’localization and mobility privacy [31].In addition, with the explosive increase of IoT devices(over 1 million connected devices per square kilometers[12])that require massive channels, it is not practical for a gNbto efﬁciently manage all the UEs simultaneously. The hugenumber of UEs is more challenging to synchronously up-load sensory data on time and it takes longer time to performthe scheduling. Hence, the latency of SM dramatically in-creases. For example, Verizon reported the latency of 5Gin 2019 is 30 milliseconds [7] Furthermore, with the con-strained spectrum resources, uploading sensory data and of-ﬂoading schedule plans incur non-negligible resource con-sumption when considering massive UEs are involved.

Compared to SM, DSA [30] is a decentralized mecha-nism developed for cognitive radios. Being cognitive meansthat radios can automatically detect available channels andchoose the best one to use. In practice, gNb connects to aportion of licensed UEs such as smartphones, namely pri-mary users (PUs), and runs the centralized scheduling strat-egy for allocating channels to PUs. The rest of UEs areconsidered as unlicensed secondary users (SUs), they needto develop the scheduling plan independently and oppor-tunistically under the principle that the PUs’ communica-tion channels are not interfered. For example, an SU canoccupy a temporary idle channel that is not allocated to anyPUs or transmit with low power, which does not interferePUs’ communication. SUs can be any kind of IoT devicesother than the PUs, such as smartwatches and glasses, aug-mented/virtual reality headsets, etc.

To address DSA at the SU side, reinforcement learning(RL) based methods are proposed. The typical RL-basedDSA is single agent paradigm [28, 16, 17] that deploys RLmodels on a single SU. The goal of RL is to learn the trans-mission pattern of nearby PUs and minimize the SU’s col-lision with them. RL can be formulated with three compo-nents: state, action and reward. In this RL methods, the stateincludes the UE’s own communication requirement and thesensed signal to interference and noise ratio (SINR) of eachavailable channel, which is fully perceived by the UE. TheUE decides whether to transmit or which channel to transmitas the action. It gets positive rewards if data is transmittedsuccessfully and negative if not.

In most cases, more than one SUs share the unlicensedchannels that are not occupied by PUs, where the interfer-ence among SUs needs to be considered. Therefore, multi-agent reinforcement learning (MARL) [19] based methodsare proposed. MARL focuses on the competition and coop-eration among SUs, ignoring the existence of PUs. In par- Figure 2: The channel selection using MARL method overtime, where there are 10 UEs sharing 3 channels. Theyellow, red and black represent the three channels and thewhite represents the case that the UE stays silent.ticular, MARL-based method assumes a mapping from thenon-consecutive unlicensed channels to consecutive virtualchannels as if there are no PUs in these virtual channels andDSA is performed on the consecutive virtual channels. Inthis case, collision (i.e., two or more UEs choose the samechannel but none of them is able to successfully transmit)and idle channel (none of the UEs choose a certain channel)are two critical challenges.Note that each UE can only perceive its own state insteadof the states of peers in MARL. This means the observa-tion that UE obtains cannot fully represent the state. In sucha non-Markovian model, UEs are supposed to decide theirnext action based on not only the current observation, butalso the action history in recent time slots as additional ob-servations. With more information added to the state vector,the state space is exponentially extended for tabular Q learn-ing. Hence, deep Q network (DQN) is introduced to assistthe decision.We consider the competitive model where a UE’s rewardis based on its own throughput within a time slot. Whenthere are sufﬁcient channels for each UE, UEs are able toshare the available channels with few collisions as claimedin [19]. However, when there are more UEs than channels,directly applying MARL leads to two issues as shown in Fig.2. First, the majority of the UEs choose to request channels,leading to serious collisions all the time and none of the UEscan successfully communicate. The reason is that all the UEsare so selﬁsh that they prefer to request channels even witha slight probability to obtain access when the other competi-tors turn to random choice than staying silent. A more pro-found explanation of these collisions by Nash Equilibrium isprovided in Sec 3.4.Second, UEs tend to either occupy a particular channel ornever request channel over time except the ε -greedy policy.Such behavior is also harmful even if the collision does notoccur. In this case, some UEs can always occupy channels In reinforcement learning, we take the ε -greedy policy, where agentstake the random choices with little probability ε . nd transmit data while others can never communicating datathrough these channels, hence, the fairness issue arises. Thereason is that the action history that UEs store is limited byseveral factors, including UEs’ memory size and the inputsize of DQN model. With the limited memory size, it is im-possible for UE to learn an alternative choice pattern. As anextreme result, UEs stick to particular channels all the time. We implement the above MARL method under differentsettings and analyze the trained models of each UE. We maketwo key observations from the existing MARL methods.First, the longer the models are trained and the closer tothe convergence, the more diverged the models across UEsare. Speciﬁcally, not only the models that converge to the dif-ferent channels have different parameters, but also the mod-els that prefer to choose the same channel diverge. Such anobservation indicates that these models do not need to be uni-ﬁed and the discrepancy among converged models should bemaintained.Furthermore, the aforementioned model divergence doesnot result from the input sensory data. Instead, the subtledifference of random initialization of models leads to thesigniﬁcant discrepancy of the converged models. The inter-action across UEs is the only feasible way to train a modelthat converges to choose a particular channel. This observa-tion suggests that we cannot modify the models’ parameters,but either reuse an existing model or train models from thescratch.Based on those two key observations, we propose a novelshufﬂe mechanism in Hermes, where we shufﬂe the up-loaded models and distribute them back to UEs. Such a shuf-ﬂe mechanism maintains the diversity of models and reusesall the existing models for the next period. As a result, af-ter shufﬂing the models, UEs are likely to change the chan-nel choice, e.g., from staying silent to requesting a particularchannel and vice versa. Therefore, the fairness issue can beeffectively addressed.

Fig. 3 shows a closer look at the Hermes architecture.As shown, Hermes consists of two modules: the improvedMARL (iMARL) framework and the shufﬂe mechanism.In iMARL, we deploy a DQN on each UE to learn theQ function. For every time slot, each UE inputs sensorydata and feedback from the environment, which constitutesthe observations, to the DQN and takes an action accordingto the DQN outputs. The environment collects the actionsfrom all the UEs and provides rewards and feedback for UEs.Then the DQN will be updated based on the observations,actions and rewards.After several updates of DQN, the local models from UEswill be uploaded to the shufﬂer, where the proposed shufﬂemechanism is performed. We execute shufﬂe mechanism intwo steps: model evaluation and model distribution . First,the model evaluation module estimates the potential UE pref-erence for the received models from UEs. The model distri-bution module will then fairly distribute corresponding mod-els back to UEs based on the estimated preference. Fig.4 shows the workﬂow of Hermes. Once a DQN modelis distributed by the shufﬂer, the UE keeps running and train-ing this model until the next model arrives. It means that UEsnever wait for the next shufﬂed model to come. The trainingof iMARL is only performed between two shufﬂing cycles.After UEs upload their models to the shufﬂer, they merelyrun the model, but not train the model any longer. Train-ing is not taken in parallel with shufﬂes because the trainedmodels will be overwritten by the incoming shufﬂed models.Although training and shufﬂing the DQN model incurscertain computation cost, the critical bottleneck of the delayis only the inference latency by the DQN. Therefore, given aDQN with efﬁcient inference, Hermes can offer very respon-sive spectrum management.

The combination of iMARL and shufﬂe mechanism inHermes is promising for decentralized DSA, however, it alsointroduces three design challenges.

Challenge 1.

As described in Section 2, one of the majordrawbacks in the existing MARL based method is channelcollision due to UEs’ greediness. Considering the decentral-ized settings where UEs do not interact with each other, it ischallenging to design iMARL that converges to a model thatcan maximally avoid collisions and signiﬁcantly improve thechannel utilization.

Challenge 2.

Traditional DQN architecture is heavy in twoaspects: First, deep neural networks often contain a largenumber of parameters. Even worse, the Q network and thetarget Q network in DQN double the model size. Second,a replay buffer that stores previous experience is attachedto each model. Sharing the DQN model between UEs andthe shufﬂer incurs signiﬁcant communication overhead onthroughput. How to design a compact DQN structure to re-duce communication overhead becomes a great challenge.

Challenge 3.

As presented in Section 2.5, the DQN modelare heterogeneous across UEs. The shufﬂer needs to matchthe DQN model with UEs based on their potential prefer-ences for channel selection. However, it is difﬁcult to deducethe channel selection from the received model parameters.Moreover, the sensory data cannot be explicitly shared withthe shufﬂer. Hence, the preference evaluation can only beperformed using the received models. Therefore, how to efﬁ-ciently evaluate the personalized preference of each UE w.r.teach model without sensory data poses another challenge.Addressing these challenges is not trivial. Without care-ful design of the DQN model, the communication overheadof sharing the DQN model can overshadow the beneﬁts theybring. Even worse, if the model does not converge to a sce-nario with little collision, the channel cannot be efﬁcientlyutilized and is wasted. In addition, the evaluation module atthe shufﬂer needs to be carefully designed, such that the UEscan be well matched to the model without raw sensory data.In the following, we present the techniques we develop inHermes to effectively address those challenges.

Before diving into the design details, we need to clarifythe system conﬁgurations of Hermes, which signiﬁcantly im-pacts our design. In terms of the wireless environment, weigure 3: A closer look at the Hermes architecture.Figure 4: The work ﬂow of Hermes.assume there are N UEs to share M channels, where N > M .The achievable bits transmitted per unit of time, or termedas data rate, offered by each channel varies among differentUEs and can be estimated by CQI.In one time slot, a UE can either stay silent or requestone channel to transmit data. A silent UE does not send anydata or interfere with any channel. If a UE requests a par-ticular channel, it can successfully transmit data when thechannel is only requested by this UE. Otherwise, when twoor more UEs request the same channel, a collision occursand all UEs fail to transmit. At the end of this slot, UEs ob-tain feedback from the environment. The feedback is aboutwhether they have transmitted data successfully or not. How-ever, UEs cannot know peers’ choices for the channels andwhich UE has a collision with them.In addition, each UE has a local buffer that temporarilystores the data to transmit. In our settings, we only careabout the amount of buffered data. The amount will increasewhen the UE has new data to transmit but does not get sufﬁ-cient channel resources instantly. In contrast, when a UE isable to transmit data through a certain channel, the amountof buffered data will decrease accordingly. The data trans-mission will be terminated once the buffer is empty but thechannel will be continuously occupied in the rest time of thecurrent slot. We implement iMARL based on Deep Q Network(DQN)[15]. The structure of DQN deployed on each UE isshown in Fig. 5. In this part, we will ﬁrst introduce the core Figure 5: The structure of 2-layer-FC DQN.components of iMARL, including observations , actions and rewards . Then, the compact DQN structure is presented. Fi-nally, we will explain how we modify the DQN’s workﬂowto further adapt the shufﬂe mechanism.The observation o i ( t ) of the i th UE at time slot t consistsof four parts, whose space is o i ( t ) ∈ R M + × N + × ( { , } × N M + ) × { , } (1)where N + and R + represent the set of positive integers andpositive real numbers. The ﬁrst M elements are the availabledata rate offered in M channels. This part is the sensory datathat is collected by the UE’s sensor and is independent of theactions. The second part is the buffer status. It has only onebit indicating whether the buffer is empty or not. The next M + M elementsrecord how long from the last request for the M channelsaccordingly. The last element refers to whether the UE hassuccessfully transmitted data in the last slot.Let a i ( t ) denote i th US’s action at time slot t , which canbe formulated as: i ( t ) ∈ { , , , ..., M , M + } , (2)where the former M elements represent the preference forchoosing each one channel from the M channels and the lastelement represents the preference for staying silent. Forthe time slot t , the i th UE’s action a i ( t ) is determined by the ε -greedy policy. After an inference of DQN, the output isconsidered as the action vector. The action vector contains M + a i ( t ) . In ε -greedy policy, UEs choose the best action withthe maximal value in the action vector with probability 1 − ε and take random choice in small probability ε . (a) Reward Table of ExistingMARL (b) Reward Expectation Table ofExisting MARL(c) Reward Table of iMARL (d) Reward Expectation Table ofiMARL Figure 6: Reward tables and reward expectation tables ofexisting MARL and iMARL in a toy example, where twoUEs, Alice and Bob, share one channel. The reward andreward expectation are painted as red and blue for Alice andBob, and the gray scenarios are the Nash Equilibriums.The setting of the rewards plays an important role in UEs’actions after convergence. Let r i ( t ) be the reward i th UE ob-tains at slot t . In the existing MARL methods, r i ( t ) dependson its transmitted data, i.e., UEs get positive rewards whensuccessfully transmitting data and zero rewards otherwise.Such rewards lead to frequent occurrences of collisions, asdemonstrated in a toy example in Fig. 6 where 2 UEs, Aliceand Bob, share one channel. The reward settings for Al-ice and Bob in existing MARL methods can be formulatedas a reward table in Fig. 6a. Because of the ε -greedy pol-icy, UEs do not always take the same actions even after con-vergence. Therefore, we cannot directly calculate the NashEquilibrium from the reward table. Instead, a table contain-ing the reward expectation over actions for each scenario isrequired. Let ε be 0 . . . − ε -greedy policy can be calculated as inFig. 6d. In this case, Nash Equilibrium moves to the scenar-ios where one UE transmits successfully and the other stayssilent without collisions. Generally, Nash Equilibrium is thesituation where M UEs stick to M channels and N − M UEsalways stay silent in iMARL. Hereby, we deﬁne the rewardfunction as: r i ( t ) = (cid:40) x i ( t ) Success to transmit − α ∗ x i ( t ) Fail to transmit0 Stay Silent (3)where x i ( t ) is the data rate and α is a punishment factor. Withthe discounted factor γ , the total reward of the i th UE R i isdeﬁned as: R i = T ∑ t = γ t − r i ( t ) . (4)To reduce the overhead of sharing the DQN model, aswell as the inference latency of DQN, the structure of DQNshould be compact. Therefore, the DQN that we design foreach UE contains two fully connected (FC) layers and twoactivation layers ReLU and Sigmoid respectively. The feed-forward process of FC layers can be formulated as: x l + = W l x l + b l ( l = , ) (5)where W is the weight matrix and b is the bias vector; x l and x l + are the input and output of the l th FC layer. We denotethe size of the hidden layer as L and the total parameters ofthe two FC layers in the network are ( L + ) × ( M + ) .To train the DQN for each UE, we establish two iden-tical networks, i.e., Q network and target Q network. Thetarget Q network is ﬁxed during the time that Q network isbeing trained. Once one training epoch ends, the Q networkis copied to the target Q network. Also, a replay buffer isassigned to store the experiences and a batch of experiencesare randomly sampled from the space in the training process.n the replay buffer, the tuples [ s i ( t ) , a i ( t ) , r i ( t ) , s i ( t + )] atprevious time slots are stored. The notation of the tuple issimpliﬁed as [ s , a , r , s (cid:48) ] in the rest of the paper. To train thenetwork, we deﬁne the loss function as follows: L ( θ ) = E ( s , a , r , s (cid:48) ) (cid:104)(cid:0) r + γ max a (cid:48) Q ( a (cid:48) , s (cid:48) ; θ target ) − Q ( a , s ; θ ) (cid:1) (cid:105) , (6)where θ is the parameters of the Q network and θ target is theparameters of the target Q network. It is easy to calculate thegradients of the loss function as in Eq. 7 and the Q network isoptimized with the expectations of gradients being estimatedvia sampled tuples in the batches. ∇ θ L ( θ ) = E ( s , a , r , s (cid:48) ) (cid:104) − (cid:0) r + γ max a (cid:48) Q ( a (cid:48) , s (cid:48) ; θ target ) − Q ( a , s ; θ ) (cid:1) ∇ θ Q ( a , s ; θ ) (cid:105) . (7)To further reduce the overhead of communicating iMARLbetween UEs and the shufﬂer, two additional modiﬁcationsare introduced. UEs only upload the DQN model right af-ter one training epoch ends. At this moment, the Q networkis about to be copied to the target Q network, so the twoare supposed to be the same. Thus, only the Q network isuploaded. Also, since training no longer proceeds after up-loading and the target Q network is never used for inference,the local copy can be omitted. The second modiﬁcation isthat the replay buffer is not uploaded and it is cleaned whena new shufﬂed model arrives. UEs train the DQN using datathat is collected recently and locally. The rationale is thatwhen the training is close to converging, the variance of theexperience is low, so a small batch of experiences are able toprovide an accurate estimation of the gradient expectation inEq. 7. Before being converged, the model changes rapidlyand the memory that is generated a long time ago is out ofdate. Shufﬂe mechanism contains two steps, model evaluationfollowed by model distribution. In the following, we willillustrate the design details of both steps.In the model evaluation, we evaluate all the uploadedmodels by calculating the UEs’ potential preference to them.We denote the model form the i th UE as i th model. A pref-erence matrix E is introduced, where element E i , j representsthe i th UE’s preference to the j th model. For each E i , j , weconsider a combination of two metrics, namely, the modellast appearance E MLAi , j and the model distance E MDi , j . Themodel last appearance (MLA) E MLAi , j refers to how long fromthe the last allocation of j th model to the i th UE. The maxi-mization of MLA means that UEs prefer models that are notallocated recently, increasing the models’ mobility over UEs.As the mobility increases, UEs can access all the models andthus all the channels. This metric directly contributes to thefairness issue.In spite of the above beneﬁt, it is not appropriate to eval-uate the UE’s preference using MLA only. The reason is that MLA does not utilize the parameters of the models andignores the UEs’ adaptations to different models.Model distance (MD) E MDi , j describes the discrepancy be-tween the i th model uploaded from the i th UE and the j thmodel received from the j th UE. We tend to distribute a UEwith the model that is similar to the one uploaded by this UEvia minimizing the MD. The rationale is that UEs that are al-located with similar models are able to adapt the new modelswith fewer time slots, reducing unexpected outputs that maylead to collisions.Before calculating MDs, the Model normalization is per-formed. In our DQN with two FC layers, we only considerthe MD of the last FC layer that converts the features to theaction vector. As presented in Eq. 5, a UE takes the actionwith the maximum value in the action vector. This meanswe only care about the relative value and deducting the samerow vector from different rows of W and b will not affectthe UE’s decision. Therefore, a zero-mean normalization foreach column of W and b is performed before comparing themodels. The normalization can be deﬁned as:ˆ W = W − N row  · · · · · ·  W , (8)ˆ b = b − N row  · · · · · ·  b . (9)Based on the normalized ˆ W and ˆ b , we can deﬁne MD bythe Euclidean distance of W and b between two models as: E MDi , j = || ˆ W i − ˆ W j || + || ˆ b i − ˆ b j || . (10)Finally, we calculate the preference E i , j using the twometrics E MLAi , j and E MDi , j as: E i , j = E MLAi , j − λ E MDi , j , (11)where λ is a hyper-parameter that balances the weights of thetwo metrics.Based on the estimated UEs’ preference to models, theshufﬂer distribute models back to UEs. The model distribu-tion can be formulated as a weighted bipartite perfect match-ing problem. The N UEs and the N models are the twobipartitions and the preference matrix E is the edge weightmatrix between these two bipartitions. The Hungarian algo-rithm [11] and Kuhn-Munkres (KM) algorithm [18] can beutilized to ﬁnd a perfect matching M KM which maximizesthe summation of the matched weights: M KM = arg max M ( ∑ E i , j ∈ M E i , j ) . (12)However, KM algorithm does not take fairness into ac-count. In the matching with the maximized summation of In bipartite matching problem, perfect matching M is a matching thatmatches all vertices of the graph lgorithm 1 Optimized UE-Model Matching Algorithm

Input: E : Preference MatrixOutput: M opt : The Optimal Perfect Matching Right = max ( E ) , Le f t = min ( E ) while Right > Le f t do E th = ( Right + Le f t ) / Set ˜ E as empty for i in UE set do for j in model set do if E i , j > E th then Append E i , j to ˜ E end if end for end for M = Hungarian ( ˜ E ) if M is a perfect matching then Le f t = E th else Right = E th end if end while return M as M opt weights, the extreme matches with much smaller weightsthan others cannot be excluded. To eliminate such circum-stances, we propose an optimized UE-model matching al-gorithm. Different from KM algorithm, our matching algo-rithm aims at maximizing a threshold E th . All weights withinthe matching are greater than or equal to the threshold. Inother words, the minimum weight in the matching is maxi-mized: M opt = arg max M ( E th ) = arg max M (cid:0) min E i , j ∈ M ( E i , j ) (cid:1) . (13)The matching algorithm is presented in Alg. 1. The coreidea of this matching algorithm is trial and error using aheuristic binary search. We set the upper bound Right andthe lower bound

Le f t for E th . Given the two trial bounds, thethreshold E th is set as ( Right + Le f t ) /

2. A new unweightedsubgraph is built based on the threshold E th . An edge fromthe i th UE to the j th model is included in the subgraph ifand only if E i , j is no less than the threshold E th . Then, theHungarian algorithm is performed. If the perfect matchingexists, we reset the lower bound Le f t as E th . Otherwise, wereset upper bound Right as E th . We repeat this loop until theupper bound Right and the lower bound

Le f t converges.

We implement and evaluate Hermes in 5G network. Werefer to the MATLAB 5G Toolbox [14] for developing the5G simulation environment.

5G Frequency Domain and Time Domain.

We ﬁrst in-troduce the 5G’s structure in the frequency domain and timedomain. In the frequency domain, the resource block group (RBG) is the smallest unit allocated to UEs. Each RBG con-tains 16 resource blocks (RBs) which consist of 12 subcar-riers and the subcarrier spacing is 15kHz. The CQI on eachRB can be sensed by UEs and is represented by a positive in-teger. The CQI value is determined by the distance betweenthe UE and the gNb and is not greater than 15. It ﬂuctuatesrandomly by ± Bits = bit / symbol ∗ (cid:0) ( symbol ) − ( DMRS ) (cid:1) ∗ ( subcarrier ) ∗ ( RB ) . (14)

5G Application Conﬁguration.

The dynamic change ofUEs’ buffer size depends on the requirement of applicationson them. We simplify the application conﬁguration by mak-ing three assumptions as follows:• There is no extra control signals between the gNb andthe UEs. All the throughput is data transmission.• The host device of all the applications is UE, indicatingthat only uplinks from UEs to gNb exist.• There is no need for retransmission.Two properties are being considered for each application: packetInterval in slot and packetSize in byte. Each UE’sbuffer increases by packetSize bytes every packetInterval slots for all the applications on this UE.

Channel Utilization Efﬁciency.

We adopt channel uti-lization efﬁciency (CUE) to evaluate Hermes. The traditionalCUE refers to the rate of utilized data rate over the maximumdata rate of a channel. It is not available in Hermes becausea channel’s maximum data rate varies from different UEs.Therefore, in our experiments, we deﬁne CUE as the propor-tion of used RBGs over all available RBGs. A lower CUEindicates that there exist collisions or idle channels. In otherwords, CUE can also be deﬁned as one minus the proportionof collisions and idle RBGs: P ( utilized ) = − P ( collided ) − P ( idle ) . (15) Average Throughput.

In the wireless network, through-put is an important metric that evaluates the data rate of wire-less devices. We use average throughput over UEs as oneevaluation metric in our experiments. With a ﬁxed numberof UEs, the average throughput over UEs is linearly propor-tional to the total throughput. Unlike CUE, average through-put measures how much capacity of RBGs are utilized, in-stead of merely counting the number of used RBGs. In thisense, average throughput can be viewed as the weightedCUE using RBGs’ data rate.

Jain’s Fairness Index.

We also apply

Jain’s fairness in-dex (JFI) [10] to evaluate the throughput fairness over UEs.It is formulated as: J ( x , x , ..., x N ) = ( ∑ Ni = x i ) N ∑ Ni = x i , (16)where x i is the throughput of the i th UE. As we can see inthe equation, one advantage of this metric is that it is onlyimpacted by the relative throughput of different UEs, but notthe absolute values. As the system is the most fair and thethroughput is even for all users, JFI reaches its maximumvalue of 1. We evaluate Hermes’s performance by comparing it withthe centralized PF method [25] and the decentralized DQSAmethod [19].PF is a traditional scheduling strategy on gNbs in SM.Compared to other scheduling strategies such as Best CQIand Round Robin, PF is known for its better performance onfairness. In the PF approach, UEs need to upload their fullstatus to the gNb for every scheduling periodicity, includ-ing buffer status, CQI and historical average data rate. Thehistorical average data rate describes the previous data ratefor a UE and is updated every time slot. For each RBG, PFtends to distribute it to the UE with the maximal proportionof the current data rate over the historical average data rateuntil a maximum RBG number (1 in our implementation) isallocated to this UE.Besides, the decentralized method DQSA is taken as an-other baseline for Hermes. It is one of the state-of-the-art MARL based DSA method that outputs a good DSAscheduling plan in small-scale UE deployment. However,as described in Sec. 2.4, DQSA suffers fairness issue andserious collisions when the number of UEs N is greater thanthe number of RBGs M . We ﬁrst evaluate Hermes under a relatively simple setting,where only a small number of UEs and RBGs need to bescheduled in short simulation time. After that, we compareHermes’s with two baseline methods based on the aforemen-tioned evaluation metrics under a more challenging settingcompared to the above setting. In this setting, we increasethe number of UEs and RBGs with the simulation time du-ration extended. In both settings, we assume that sensorydata across all the UEs are independent and identically dis-tributed (i.i.d) and only one single shufﬂer serves all the UEs.The training of iMARL is performed every 10 slots and theshufﬂing is performed every 50 slots.Fig. 7 shows the results under the relatively simple set-ting with 10 UEs and 3 RBGs for 100 frames (1 secondsor 1000 slots). In Fig. 7a, the color yellow, red and blackrepresent the three channels and the color white representsthat a UE stays silent. The 10 UEs start with random chan-nel selections and end up with the Nash Equilibrium within100 frames, where 3 of 10 UEs request channels and the rest (a) Channel Selection (b) Throughput

Figure 7: The channel selection and throughput heatmaps ofHermes, where there are 10 UEs sharing 3 RBGs for 100frames.of them stay silent. Note that the training of iMARL pro-ceeds smoothly despite that shufﬂing takes place every 50slots. Besides, there are still a small number of stripes afterthe DQN converges. It indicates that UEs retains a small ten-dency to explore different RBGs due to the ε -greedy policy.Although such tendency leads to collided and idle channels,it helps the model to adapt to environmental changes. Themodel cannot make use of newly available RBGs any longerif the ε -greedy policy is abandoned once after the DQN con-verges.In terms of throughput, as Fig. 7b shows, the darkerthe color stripe is, the higher the throughput UEs achieve.Generally, a channel can achieve a data rate of 3 Mbps to6 Mbps. In the ﬁrst 10 frames when models are not con-verged and collisions occur frequently, the throughput overall the UEs remains to a low extent. When the Nash Equi-librium is reached, the throughput heatmap looks similar tothe channel selection heatmap, which indicates that almostall the requests for channels can successfully transmit datawithout collisions. (a) Channel Utilization Efﬁciency (b) Average Throughput Figure 8: The CUE and average throughput over time,where there are 20 UEs sharing 6 RBGs for 500 frames.Then we compare Hermes with two baselines in a morechallenging environment, where 20 UEs share 6 RBGs for500 frames. Their performance is shown in Fig. 8 and Table2. Fig. 8a shows the CUE of Hermes. The DQN model con-verges around the 100th frame. After that, the CUE slightlyﬂuctuates around the maximum and the collision probabilityis reduced to the minimal value. The probability of occur-able 2: Comparison of JFI between Hermes, DQSA andPF.Methods Jain’s Fairness IndexHermes 0.9619PF 0.9274DQSA 0.4902ring idle channels keeps a small value all the time. The CUEcannot reach 100% due to the ε -greedy policy.The average throughput of Hermes, DQSA and PF isshown in Fig. 8b. The average throughput of Hermes isslightly lower than that of PF. The reason is that Hermescannot fully utilize all available RBGs owing to the ε -greedypolicy. Yet, its average throughput is much higher than thatof DQSA due to the properly designed reward.We also evaluate JFI on all the three methods and reportthe evaluation results in Table 2. Hermes is slightly betterthan PF and these two methods are signiﬁcantly better thanthe DQSA without shufﬂing in terms of fairness. In addition to performance evaluation, we evaluate Her-mes’s robustness with different numbers of UEs, RBGs andvaried deployment intervals between UEs. Since the averagethroughput is strongly correlated to the number of RBGs, thetrend of the average throughput as RBGs increase tells littleabout the system’s robustness with respect to the number ofRBGs. Hence, we only use the other two metrics in this sec-tion. (a) impact of

Figure 9: The impact of (a) deployment (b) deployment interval d Figure 10: The impact of deployment interval on CUE andJFI.We also evaluate how the deployment interval betweenUEs affects the CUE and JFI. Here the deployment intervalis deﬁned as the distance between UEs. In this experiment,as shown in Fig. 10a, we deploy all the UEs with the sameinterval d in a line from the gNb, and the UE in the mid-dle is always 500 meters away from the gNb. As Fig. 10bshows, the CUE is not impacted by the deployment interval.Meanwhile, the fairness can always be guaranteed even inthe extreme situation, where the deployment interval d is 50meters. In that circumstance, the closest UE to the gNb isonly 50 meters away and the farthest UE is at the edge of thegNb’s coverage. Finally, we evaluate Hermes runtime performance in a dy-namic environment, where the numbers of available RBGsand UEs keep changing over time. In addition, multiple shuf-ﬂers are deployed and each shufﬂer is responsible for shuf-ﬂing the models among a random subset of the UEs. Weapply the throughput to the evaluation of the runtime perfor-mance.We simulate the dynamic environment for 1000 frames.There are 10 UEs and 2 RBGs at the beginning, and newUEs and RBGs are gradually added in the ﬁrst 500 frames.Speciﬁcally, ten more UEs become available at the 250thframe and one more RBG is added every 100 frames fromthe 100th frame. In the next 500 frames, as shown in Fig.11, some of the RBGs and UEs exit sequentially.The results are presented in Fig. 11. When a new RBGbecomes available, the throughput on this RBG increasesslowly and reaches its full capacity in 0.5 seconds. Onthe contrary, when an RBG becomes unavailable, the corre-sponding throughput drops to zero immediately. The emer-gence of new UEs introduces a throughput drop within 0.5seconds to all the UEs, because their untrained models bringcollisions and are taken into the shufﬂe process and dis-tributed to other UEs. When UEs exit, some of the RBGsbecome idle. The total throughput drops rapidly, but quickly a) Total throughput(b) Throughput over UEs(c) Throughput over RBGs

Figure 11: The runtime performance of Hermes in thedynamic environment.recovers within around 0.5 seconds. To summarize, Hermesis capable of adapting to the environmental changes in nomore than 0.5 seconds.

The key challenge of designing a DSA protocol is to ef-fectively and efﬁciently handle the multi-channel structureand highly dynamic network resources [2]. Many DSA tech-niques have been proposed to address this challenge basedon classical mathematical algorithms [4]. Among these tech-niques, the auction-based approach is a promising DSA tech-nique. The auction-based techniques dynamically allocatethe spectrum to SUs based on their best bid that is sub-mitted to the PU. The bid can be expressed in many rep-resentations, e.g., money, relaying services, etc. For exam-ple, Wang et al. [27] designed a DSA approach based onbandwidth auctions, where SUs submit a bid for the spec-trum and PUs allocate the spectrum among the SUs with-out affecting their own performance. Game theory is alsoadopted to optimize performance related to spectrum shar-ing. Maskery et al. [13] proposed a DSA method from anadaptive game theory perspective. When a channel is avail-able, SUs compete for it to fulﬁll their own demands whileminimizing interference with their peers. Huang et al. [9]presented a non-cooperative game theory method for LTE-Anetworks, where femtocells compete for sharing the primarymacrocell spectrum. On the contrary, Gharehshiran et al. [8] designed a cooperative game theory method for LTE-A net-works. Niyato et al. [23] proposed a comprehensive solu-tion via combining cooperative and non-cooperative gametheory approaches. Multiple PUs sell spectrum to multipleSUs, where the competition among PUs is modeled as a non-cooperative game and the competition among SUs is treatedas an evolutionary game. Vamvakas et al. [26] introducedprospect theory to allocate the transmission power as PUs orSUs. However, the dynamic interaction cannot be fully de-scribed using a game theory approach [24], and such an issuehas been addressed by applying Markov Chains. Akbar andTranter [1] proposed a Markov-based Channel Prediction Al-gorithm (MCPA), allowing SUs to dynamically select differ-ent licensed bands while signiﬁcantly reducing the interfer-ence from and to PUs. Thao et al. [20] utilized the HiddenMarkov Process to model the utilization state of each spec-trum band at each time slot, and such a state can be approxi-mated from the power spectral density measurements. Basedon the occupancy state, the available bands are assigned ac-cordingly.Besides the mathematical approaches, RL has becomean emerging state-of-the-art technique for DSA. The mostwidely applied RL algorithm in wireless communication ap-plications is Q-learning [29]. Therefore, most of RL-basedDSA approaches focus on Q-learning [16, 6, 21]. Morozs et al. [16, 17] presented a centralized RL method for DSA,and the RL model is deployed on the base station or a singleSU. In this method, the interactions among multiple SUs areignored, which is not reasonable in practice. Naparstek andCohen [19] proposed a distributed RL approach for DSA,where an MARL algorithm is introduced with consideringthe interference among multiple SUs. Although RL-basedtechniques such as Q-learning have demonstrated good per-formance on DSA, a critical challenge is it fails as the usernumber increases. In this paper, our proposed method sig-niﬁcantly improves the performance with massive users.

In this paper, we present Hermes– a decentralized DSAmethod in 5G with the consideration of massive UEs. InHermes, we design an improved MARL algorithm for train-ing local models at UEs in order to achieve better resourceutilization. A novel shufﬂe mechanism is also proposed toexchange trained models among UEs to realize better fair-ness. Comprehensive simulation experiments demonstratethat Hermes performs better than the compared baselinemethods, achieving 84 % channel utilization efﬁciency andover 0.96 in JFI. Furthermore, it takes only 0.5 seconds foradapting to the real-time changes in the dynamic environ-ment. We believe our work can facilitate the large-scale de-ployment of mobile devices in 5G, and the idea of shufﬂingcan be easily extended to more decentralized applications.

Acknowledgement

This work was supported in part by National Key R&DProgram of China (2018YFB0105000); and in part by Na-tional Natural Science Foundation of China (No. U19B2019,61832007, 61621091); and in part by Tsinghua EE XilinxAI Research Fund; and in part by Beijing National ResearchCenter for Information Science and Technology (BNRist);nd in part by Beijing Innovation Center for Future Chips. [1] I. A. Akbar and W. H. Tranter. Dynamic spectrum allocation in cog-nitive radio using hidden markov models: Poisson distributed case. In

Proceedings 2007 IEEE SoutheastCon , pages 196–201. IEEE, 2007.[2] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty. Next gener-ation/dynamic spectrum access/cognitive radio wireless networks: Asurvey.

Computer networks , 50(13):2127–2159, 2006.[3] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty. A survey onspectrum management in cognitive radio networks.

IEEE Communi-cations magazine , 46(4):40–48, 2008.[4] A. O. Arafat, A. Al-Hourani, N. S. Naﬁ, and M. A. Gregory. A sur-vey on dynamic spectrum access for lte-advanced.

Wireless PersonalCommunications , 97(3):3921–3941, 2017.[5] P. Bahl and V. N. Padmanabhan. Radar: An in-building rf-baseduser location and tracking system. In

Proceedings IEEE INFOCOM2000. Conference on Computer Communications. Nineteenth AnnualJoint Conference of the IEEE Computer and Communications Soci-eties (Cat. No. 00CH37064) , volume 2, pages 775–784. Ieee, 2000.[6] X. Chen, Z. Zhao, and H. Zhang. Stochastic power adaptation withmultiagent reinforcement learning for cognitive wireless mesh net-works.

IEEE transactions on mobile computing , 12(11):2155–2166,2012.[7] K. George and K. Kevin.

Minneapolis are ﬁrst in the world to get 5G-enabled smartphones connected to a 5G network , 04.03.2019. http://engineering.purdue.edu/˜mark/puthesis .[8] O. N. Gharehshiran, A. Attar, and V. Krishnamurthy. Collaborativesub-channel allocation in cognitive lte femto-cells: A cooperativegame-theoretic approach.

IEEE Transactions on Communications ,61(1):325–334, 2012.[9] J. W. Huang and V. Krishnamurthy. Cognitive base stations in lte/3gppfemtocells: A correlated equilibrium game-theoretic approach.

IEEEtransactions on communications , 59(12):3485–3493, 2011.[10] R. Jain, A. Durresi, and G. Babic. Throughput fairness index: Anexplanation. In

ATM Forum contribution , volume 99, 1999.[11] H. W. Kuhn. The hungarian method for the assignment problem.

Naval research logistics quarterly , 2(1-2):83–97, 1955.[12] N. Lindsay.

5G - Connection Density — Massive IoT and SoMuch More , 2017. .[13] M. Maskery, V. Krishnamurthy, and Q. Zhao. Decentralized dy-namic spectrum access for cognitive radios: Cooperative design ofa non-cooperative game.

IEEE Transactions on Communications ,57(2):459–469, 2009.[14] Matlab 5g toolbox, 2020. The MathWorks, Natick, MA, USA.[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al. Human-level control through deep reinforcement learning. na-ture , 518(7540):529–533, 2015. [16] N. Morozs, T. Clarke, and D. Grace. Distributed heuristically acceler-ated q-learning for robust cognitive spectrum management in lte cel-lular systems.

IEEE Transactions on Mobile Computing , 15(4):817–825, 2015.[17] N. Morozs, T. Clarke, and D. Grace. Heuristically accelerated rein-forcement learning for dynamic secondary spectrum sharing.

IEEEAccess , 3:2771–2783, 2015.[18] J. Munkres. Algorithms for the assignment and transportation prob-lems.

Journal of the society for industrial and applied mathematics ,5(1):32–38, 1957.[19] O. Naparstek and K. Cohen. Deep multi-user reinforcement learn-ing for distributed dynamic spectrum access.

IEEE Transactions onWireless Communications , 18(1):310–323, 2018.[20] T. Nguyeny, B. L. Mark, and Y. Ephraim. Hidden markov processbased dynamic spectrum access for cognitive radio. In , pages 1–6.IEEE, 2011.[21] J. Nie and S. Haykin. A q-learning-based dynamic channel assignmenttechnique for mobile communication systems.

IEEE Transactions onVehicular Technology , 48(5):1676–1687, 1999.[22] S. Niknam, H. S. Dhillon, and J. H. Reed. Federated learning forwireless communications: Motivation, opportunities, and challenges.

IEEE Communications Magazine , 58(6):46–51, 2020.[23] D. Niyato, E. Hossain, and Z. Han. Dynamics of multiple-seller andmultiple-buyer spectrum trading in cognitive radio networks: A game-theoretic modeling approach.

IEEE Transactions on Mobile Comput-ing , 8(8):1009–1022, 2008.[24] W. Saad, Z. Han, R. Zheng, A. Hjorungnes, T. Basar, and H. V. Poor.Coalitional games in partition form for joint spectrum sensing and ac-cess in cognitive radio networks.

IEEE Journal of Selected Topics inSignal Processing , 6(2):195–209, 2011.[25] S. Tiwari. Long term evolution (lte) protocol veriﬁcation of macscheduling algorithms in netsim, 2014.[26] P. Vamvakas, E. E. Tsiropoulou, and S. Papavassiliou. Dynamic spec-trum management in 5g wireless networks: A real-life modeling ap-proach. In

IEEE INFOCOM 2019-IEEE Conference on ComputerCommunications , pages 2134–2142. IEEE, 2019.[27] X. Wang, Z. Li, P. Xu, Y. Xu, X. Gao, and H.-H. Chen. Spectrum shar-ing in cognitive radio networks—an auction-based approach.

IEEETransactions on Systems, Man, and Cybernetics, Part B (Cybernet-ics) , 40(3):587–596, 2009.[28] Y. Wang, Z. Ye, P. Wan, and J. Zhao. A survey of dynamic spectrumallocation based on reinforcement learning algorithms in cognitive ra-dio networks.

Artiﬁcial Intelligence Review , 51(3):493–506, 2019.[29] C. J. C. H. Watkins. Learning from delayed rewards. 1989.[30] Q. Zhao and B. M. Sadler. A survey of dynamic spectrum access.

IEEE signal processing magazine , 24(3):79–89, 2007.[31] Y. Zhu, Z. Xiao, Y. Chen, Z. Li, M. Liu, B. Y. Zhao, and H. Zheng. Ettu alexa? when commodity wiﬁ devices turn into adversarial motionsensors. arXiv preprint arXiv:1810.10109arXiv preprint arXiv:1810.10109