[PDF] DQN-Based Multi-User Power Allocation for Hybrid RF/VLC Networks

Abstract

In this paper, a Deep Q-Network (DQN) based multi-agent multi-user power allocation algorithm is proposed for hybrid networks composed of radio frequency (RF) and visible light communication (VLC) access points (APs). The users are capable of multihoming, which can bridge RF and VLC links for accommodating their bandwidth requirements. By leveraging a non-cooperative multi-agent DQN algorithm, where each AP is an agent, an online power allocation strategy is developed to optimize the transmit power for providing users' required data rate. Our simulation results demonstrate that DQN's median convergence time training is 90% shorter than the Q-Learning (QL) based algorithm. The DQN-based algorithm converges to the desired user rate in half duration on average while converging with the rate of 96.1% compared to the QL-based algorithm's convergence rate of 72.3% Additionally, thanks to its continuous state-space definition, the DQN-based power allocation algorithm provides average user data rates closer to the target rates than the QL-based algorithm when it converges.

Full PDF

DDQN-Based Multi-User Power Allocation for Hybrid RF/VLC Networks

Bekir Sait Ciftler, Abdulmalik Alwarafy, Mohamed Abdallah, and Mounir Hamdi

Division of Information and Computing Technology, College of Science and Engineering,Hamad Bin Khalifa University, Doha, Qatar { bciftler, aalwarafy, moabdallah, mhamdi } @hbku.edu.qa Abstract —In this paper, a Deep Q-Network (DQN) basedmulti-agent multi-user power allocation algorithm is proposed forhybrid networks composed of radio frequency (RF) and visiblelight communication (VLC) access points (APs). The users arecapable of multihoming, which can bridge RF and VLC links foraccommodating their bandwidth requirements. By leveraging anon-cooperative multi-agent DQN algorithm, where each AP isan agent, an online power allocation strategy is developed tooptimize the transmit power for providing users’ required datarate. Our simulation results demonstrate that DQN’s medianconvergence time training is shorter than the Q-Learning(QL) based algorithm. The DQN-based algorithm converges tothe desired user rate in half duration on average while convergingwith the rate of . compared to the QL-based algorithm’sconvergence rate of . . Additionally, thanks to its continuousstate-space deﬁnition, the DQN-based power allocation algorithmprovides average user data rates closer to the target rates thanthe QL-based algorithm when it converges.

Index Terms —Convergence, DQN, DRL, hybrid networks,optimization, power allocation, RF, VLC.

I. I

NTRODUCTION

As the number of interconnected devices in our livesincreases exponentially, spectrum scarcity becomes a biggerproblem. In recent years Visible Light Communication (VLC)has attracted attention due to its vast potential to provide highdata rates and ubiquitous coverage indoors by utilizing visiblespectrum. VLC is based on the utilization of light-emittingdiodes (LEDs), which are very energy-efﬁcient and capableof exploiting unused visible spectrum [1]–[3]. VLC has manyadvantages, such as cheap transmitters and receivers, lowpower consumption [4], and better physical layer securityfeatures [5], [6].However, VLC requires line-of-sight communication witha proper angle between the transmitter and the receiver.Hence, its coverage area can be limited, even indoors. As aconsequence, VLC is usually used in hybrid systems alongwith the proprietary RF communication networks [7]. HybridRF/VLC systems are gaining popularity thanks to its manyfeatures, such as energy-efﬁciency, ubiquitous connectivity byutilizing existing infrastructure, and high throughput capacity.The users of hybrid RF/VLC systems usually have themultihoming ability, which allows them to connect multipleAPs simultaneously. Power allocation in these networks iscrucial in providing the necessary quality-of-service (QoS)for the applications while reducing the overconsumptionof the power and possible interference with other networkentities [8]. The proper allocation of the transmit power is ofgreat importance concerning varying channel conditions anduser requirements.

VLC AP-1 VLC AP-2 VLC AP-3 VLC AP-4RF APUE-1 UE-2

Fig. 1: Hybrid RF/VLC Communication System.Conventional power allocation mechanisms usually utilizeoptimization methods such as mathematical programming.However, hybrid RF/VLC systems usually have complexmodels that result in intractable optimization problemsfor power allocation [9]. Usually, the system model forpower allocation requires approximations and relaxationson closed-form equations around the solution to providesatisfactory results. Hence, the general method would be anapproximation of the model for possible solutions [10], orthe hybrid RF/VLC network power allocation modeled withmixed-integer nonlinear programming as in [7]. Then, themodel is simpliﬁed to a discrete linear programming problemby approximation around the solution. It is shown that themodel becomes intractable as the number of parameters andnetwork elements increases even with the simpliﬁed forms.As the number of APs and user devices increases in thesehybrid RF/VLC networks, these solutions become impracticalwith increased complexity. Therefore other techniques that donot rely on the complex system models are required. Machinelearning (ML) based solutions for power allocation have beengaining prominence due to their vast performance in solvingthe optimization problems without explicit system modelsand state transition dynamics [11]. Especially reinforcementlearning (RL) based solutions are prevailing since they allowmapping of states and the best actions based on observationsto maximize the deﬁned cumulative reward.In our work, we propose a multi-agent DQN-basedmulti-user power allocation scheme for hybrid RF/VLCnetworks, as shown in Fig. 1. The power allocation problemis deﬁned as an optimization problem in Section III to providethe necessary data rate to users while minimizing powerconsumption. The proposed methodology allows having acontinuous state-space where the gap between the target ratesand the actual rates of users can be observed precisely. Hencethe actions for allocation of power will be more efﬁcient andprecise as well. Our simulation results show that the proposedmethod converges faster than the Q-Learning (QL) based a r X i v : . [ c s . N I] F e b lgorithm. Additionally, the proposed method achieves a closeraverage user rate to the target rate thanks to its continuousstate-space.Our main contributions in this work are listed below: • We propose and implement a non-cooperative multi-agentDQN-based algorithm to solve the multi-user powerallocation problem of hybrid RF/VLC systems with thecontinuous state-space deﬁnition. • We deﬁne a precise reward function for the stabilityof non-cooperative RL-based algorithms for multi-userpower allocation. • We benchmark convergence time for multi-agentQL-based and DQN-based power allocation algorithms. • We show that DQN outperforms a QL-based algorithmwith shorter convergence times.This paper is structured as follows. In Section II, abrief overview of existing literature on power allocationfor hybrid RF/VLC networks is presented. We providethe system model for the hybrid RF/VLC communicationsystem in Section III. Subsequently, we explain the QL-basedand DQN-based multi-user power allocation algorithms inSection IV. Numerical results for the provided techniques aregiven in Section V. Finally, in Section VI, we present ourfuture work and concluding remarks.II. L

ITERATURE R EVIEW

Hybrid RF/VLC systems are a hot prospect forenergy-efﬁcient and ubiquitous wireless communications.This section presents a brief literature review on the powerallocation for hybrid RF/VLC networks.All of the conventional optimization techniques for the VLCnetwork performance is presented as a survey in [12]. TheVLC systems and channel models are provided in-depth foroptimization algorithms. Resource and power control with APassignment is reviewed in detail. The optimization techniquesproposed are conventional techniques that are model-based andrequire full observation of the channel.As an example of the conventional methods for resourceallocation for hybrid RF/VLC systems, the authors investigatedthe cell formation and frequency reuse patterns in the contextof load balancing in [7]. The performance of this hybridVLC system is proven to be providing high area spectralefﬁciency, and using a hybrid conﬁguration allows to providethe highest grade of fairness in most of the scenarios. In[7], the hybrid system resource allocation modeled withmixed-integer nonlinear programming (MINLP); however, itis simpliﬁed to discrete linear programming by approximation.It is shown that the model becomes intractable as the numberof elements increases.A QL-based power allocation technique is proposed forhybrid RF/VLC networks in a distributed fashion in [13].In this study, a multi-agent QL-based technique is proposedwhere each AP is an independent agent that interacts with theenvironment in a two-timescale power allocation scheme. Theproposed methodology satisﬁes the QoS requirements for theusers on average. However, classical solutions such as QL have problems with scalability and limited mapping of observationsdue to the state space’s discrete deﬁnition.The energy-efﬁcient resource of software-deﬁnedhybrid RF/VLC systems is studied in [14]. The authorsdeveloped an optimization framework that considers backhaulconstraints, QoS requirements, energy-efﬁciency, and inter-cellinterference limits for a heterogeneous VLC and RF smallcell network. The formulated optimization problem is solvedby the alternative direction method of multipliers (ADMM)method. The simulation results show that the proposedscheme can converge within a few iterations while increasingthe throughput signiﬁcantly and avoiding interference bylimiting power consumption.All of the references mentioned above utilize eitherconventional optimization techniques or RL techniques withdiscrete state-space. In our work, we propose a multi-agentDQN-based multi-user power allocation algorithm for hybridRF/VLC systems to allocate power more precisely, consideringusers’ data rate requirements and their actual rates.III. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

We consider a multi-user downlink resource allocationproblem for multihoming hybrid RF/VLC networks in thiswork. Our system model consists of a single RF AP and K VLC APs. There are N mobile users with RF and VLCreceivers with multihoming capability. At timestep t , thechannel gain for the link between user u and VLC AP l canbe represented as [9]: G ( u,l )VLC ( t ) = ( m + 1) A pd λ cos m ( θ ( u,l ) tx ( t ))2 π (cid:16)(cid:0) x ( u,l ) ( t ) (cid:1) + y (cid:17) (1) × H f ( θ ( u,l ) rx ( t )) H c ( θ ( u,l ) rx ( t )) cos( θ ( u,l ) rx ( t )) , where x and y are the horizontal and the vertical distancesbetween the user and the VLC AP. A pd is the photodiode(PD) effective detection area, λ is PD responsivity, and θ tx and θ rx represents angle of irradiance and angle of incidence,respectively [9]. The PDs of all users are assumed to be facingvertically upwards for simplicity (i.e. θ ( u,l ) tx = θ ( u,l ) rx ). Thegain of the user’s optical ﬁlter and the gain of the opticalconcentrator is represented by H f ( θ ( u,l ) rx ( t )) and H c ( θ ( u,l ) rx ( t )) ,respectively. Additionally, m = − / log (cos(Ψ / )) , where Ψ / is the semi-angle at half-power of the LED. The gainof the user’s optical ﬁlter is assumed to be throughout thismanuscript [13], while the optical concentrator’s gain is givenin below equation: H c ( θ ( u,l ) rx ( t )) = n c sin (Ψ fov ) (cid:16) ≤ θ ( u,l ) rx ( t ) ≤ Ψ fov (cid:17) , (2)where Ψ fov stands for the half of the PD’s ﬁeld-of-view(fov), ( · ) is the indicator function, and n c is the opticalconcentrator’s reﬂective index [9].The coverage regions of the VLC APs are exclusive sinceeach VLC AP allocates orthogonal frequencies for bandwidth.Each AP’s total bandwidth is divided equally for the totalnumber of users within its coverage area. The transmit poweror each VLC AP is determined by a centralized entity at thebeginning of each timestep t . The achieavable rate of the linkbetween the user u and the VLC AP l in timestep t representedwith: R ( u,l )VLC ( t ) = W VLC  (cid:16) κm d P ( u,l )VLC ( t n ) G ( u,l )VLC ( t ) (cid:17) W VLC σ  , (3) where W V LC is the VLC link bandwidth, κ is the opticalto electric conversion efﬁciency, m d is the modulation depth. σ V LC represents the noise power spectral density (PSD) ofVLC links. P ( u,l ) V LC ( t ) is the optical (transmit) power of VLCAP l for user u in timestep t [9].The gain of the RF link in timestep t is deﬁned as below: G ( u )RF ( t ) = 10 − L ( d ( u )RF ( t )) / | h ( u )RF ( t ) | , (4)where h uRF ( t ) stands for the small-scale fading, modeled withan exponential random variable with mean . dB. L ( d ) isthe path loss component deﬁned as in below equation: L ( d ) = 47 . ν log ( d/d ) + X (dB) , (5)where d indicates the distance between the transmitter and thereceiver, d = 1 m, ν = 1 . , X is a Gaussian random variablewith its mean equal to zero with standard variance of . dBrepresenting the shadowing component [13].The achievable rate at the user u from the RF link intimestep t is given as: R ( u )RF ( t ) = W RF log (cid:32) P ( u )RF ( t ) G ( u )RF ( t ) W RF σ (cid:33) , (6)where W RF is the bandwidth of each RF link, P uRF ( t ) is theallocated transmit power for the link between user u and RFAP in timestep t . The power spectral density of additive whiteGaussian noise (AWGN) for RF links is represented with σ RF .Since users are assumed to have multihoming capability, theRF and VLC links’ simultaneous use are possible. Hence thetotal achieveable rate for the user u at timestep t becomes thetotal actual rate of both links: R ( u ) ( t ) = R ( u ) RF ( t ) + R ( u,l ) V LC ( t ) , (7)where l is the associated VLC AP for user u at timestep t . Ourgoal in this paper is to develop a DRL-based solution to controlthe transmit powers to achieve maximum utility function valuebased on each user’s actual rate.The optimization problem for adjusting the transmit powersaccordingly deﬁned as follows max { P ( u )RF ( t ) } , { P ( u,l )VLC ( t ) } ∞ (cid:88) t =1 U ( t ) (8) s . t . U (cid:88) u =1 P ( u )RF ( t ) ≤ P maxRF U (cid:88) u =1 P ( u,l )VLC ( t ) ≤ P maxVLC ∀ l,P ( u,l ) V LC ≥ , P ( u ) RF ≥ ∀ u, l where U ( t ) is the utility function for the timestep t . In ourproposed scheme, the utility function deﬁned as below: U ( t ) = N (cid:88) u =1 B ( u ) − | R ( u ) ( t ) − T ( u ) | , (9)where B ( u ) is the target rate band which deﬁnes the vicinityof T ( u ) , which is the target rate requirement for QoS to beprovided to the user u . In our simulations, we deﬁned thetarget band as follows B ( u ) = max { . × T ( u ) , . } . (10)This band’s deﬁnition provides the system’s stability sincethere might not be a possible solution for a particular targetrate due to discrete power levels. For the sake of simplicity, weassumed the target rates of users are static within an episode.IV. RL- BASED M ULTI -U SER M ULTI -A GENT P OWER A LLOCATION

In this section, we present two different, QL-based andDQN-based power allocation methods in which the transmitpowers of the RF AP and the VLC APs are adjusted in everytimestep to optimize the downlink data rates of users. In bothalgorithms, separate agents are not in communication witheach other; hence they work non-cooperatively.The state space for the system is deﬁned by the target rateand the actual rate of the users as follows s t = [ s (1) t , · · · , s ( u ) t , · · · , s ( N ) t ] , (11)where separate state-space entries for each user is deﬁnedseparately since QL is bounded by a state-action table,however we have continuous state space for DQN. A. QL-based Power Allocation

In this subsection, a QL-based power allocation methodin which the power allocations of VLC APs and RF APare individual agents utilizing QL to learn optimal powerallocation to provide necessary target rates T ( u ) to users bytaking the current status of users as explained in Algorithm 1.

1) State-space:

The state space for the QL-based powerallocation is based on each user’s actual rate and target rateas follows s ( u ) t =  , if R ( u ) ( t ) < T ( u ) , if R ( u ) ( t ) > T ( u ) + B ( u ) , if T ( u ) + B ( u ) ≥ R ( u ) ( t ) ≥ T ( u ) , (12)where means the actual rate of the user is below target rate, means the actual rate is above the target rate much morethan the target band, while means the user’s actual rate iswithin targeted band. The state-space is the same for all theagents, independent of users’ location and agents’ actions. lgorithm 1 QL-based Power Allocation Initialization:

Set t = 0 . Initialize Q-values for all state-actionpairs as Q ( l ) V LC ( s , a ) = 0 for VLC APs, and Q RF ( s , a ) = 0 forRF AP. for t = 1 to ∞ do Observe state s t . for l = 1 to K do Generate a random number x from [0 , . if x ≤ (cid:15) ( t ) then Select a random action a ( l ) t from actionspace of VLC AP ( A V LC ) . else Select a ( l ) t that gives the largest Q-value accordingto arg max a ( l ) ∈ A V LC Q ( l ) V LC ( s t , a ( l ) ) end if end for Generate a random number x from [0 , . if x ≤ (cid:15) ( t ) then Select a random action a RF from actionspace of RF AP ( A RF ) . else Select a RFt that gives the largest Q-value according to arg max a RF ∈ A RF Q RF ( s t , a RF ) end if Execute all actions a ( l ) t at VLC APs l = 1 to K , and a RFt at RF AP.

Receive the rewards r t using (19) Observe the new state a t +1 using (12) Update Q ( l ) V LC for VLC APs and Q RF for RF as follows Q ( l ) V LC ( s t , a ( l ) ) ← (1 − α ) Q ( l ) V LC ( s t , a ( l ) ) + α (cid:18) r t + γ max a ( l ) ∈ A V LC Q ( l ) V LC ( s t +1 , a ( l ) ) (cid:19) Q RF ( s t , a RF ) ← (1 − α ) Q RF ( s t , a RF ) + α (cid:18) r t + γ max a RF ∈ A RF Q RF ( s t +1 , a RF ) (cid:19) end for

2) Action-space:

The action-space of each agent in thesystem is deﬁned as the power level of the APs either VLCor RF. Hence the sets of transmit powers at VLC AP or RFAP can be deﬁned as P V LC = { P V LC, , P V LC, , . . . , P V LC,V P } , (13)and P RF = { P RF, , . . . , P RF,k , . . . , P

RF,R P } , (14)where V P and R P refers to the number of power levels at VLCAPs and RF AP, respectively. The action space of a VLC APcan be expressed as A V LC = { a , . . . , a i , . . . , a V A } (15)where a i = [ a (1) i , · · · , a ( u ) i , · · · , a ( U ) i ] is a vector with size U (i.e. number of users), where a ( u ) i ∈ P V LC refers to transmitpower levels allocated to users, which is bounded by the belowequation due to power constraint U (cid:88) u =1 a ( u ) i ≤ P maxV LC . (16) As a consequence, there are possible V A possible transmitpower combinations according to above equation. Similarly,RF AP agent has the below action space with the size of R A possible combinations A RF = { a , . . . , a i , . . . , a R A } (17)where a i = [ a (1) i , · · · , a ( u ) i , · · · , a ( U ) i ] is a vector with size U (i.e. number of users), where a ( u ) i ∈ P RF refers to transmitpower levels allocated to users, which is bounded by the belowequation due to power constraint U (cid:88) u =1 a ( u ) i ≤ P maxRF . (18)

3) Reward function:

The optimization problem deﬁned in(8) and (9) aims to minimize the difference between the actualrate and the target rate of users. In our work, we deﬁne thereward function using (9), hence the reward is as follows r t = N (cid:88) u =1 B ( u ) − | R ( u ) ( t ) − T ( u ) | . (19)

4) Exploration vs. Exploitation:

The exploration vs.exploitation trade-off is one of the critical success factors inRL-based systems. Our algorithm has used a time-dependent (cid:15) -greedy technique to balance exploration and exploitation.The epsilon function ( (cid:15) ( t ) ) for the algorithm is deﬁned asfollows (cid:15) ( t ) = (cid:40) . ( t − , if . ( t − > . . , if . ( t − ≤ . . (20) B. DQN-based Power Allocation

In this subsection, we propose a DQN-based powerallocation method that utilizes continuous state-space of DQNto alleviate the information on the actual rate and users’ targetrate for shortening convergence time and using power moreefﬁciently. In this method, RF AP and each VLC AP acts asa separate agent without coordination. Hence, they can onlyobserve the users’ actual and target rates and whether theyare within their coverage area. The algorithm for multi-agentDQN-based power allocation is provided in Algorithm 2.

1) State-space:

The state space of the DQN agents aredeﬁned using (12) as follows s ( u ) t = [ R ( u ) ( t ) , T ( u ) ] T , (21)where R ( u ) ( t ) is the actual and T ( u ) rate of the user u . Thisstate-space deﬁnition allows our agents to act on the actualdifference and learn to be more clinical to get closer to thetarget rate.

2) Action-space:

The action space deﬁnition is same as QLaction space as given in (13)-(18).

3) Reward function:

The reward function deﬁnition is thesame as QL agents, as given in (19).

4) Exploration vs. Exploitation:

The epsilon function ( (cid:15) ( t ) )for the algorithm is deﬁned in (20). lgorithm 2 DQN-based Power Allocation Initialization:

Initialize replay memory of VLC AP agents D ( l ) V LC and RF AP agent D RF with capacity M . Initialize action-value functions Q ( l ) V LC and Q RF with randomweights θ ( l ) V LC and θ RF . for t = 1 to ∞ do Initialize the state s t with initial observation for l = 1 to K do Generate a random number x from [0 , if x ≤ (cid:15) ( t ) then Select a random action a ( l ) t from actionspace of VLC AP A V LC . else Select a ( l ) t that gives the largest Q-value accordingto arg max a ( l ) ∈ A V LC Q ( l ) V LC ( s t , a ( l ) ; θ ( l ) V LC ) end if end for Generate a random number x from [0 , . if x ≤ (cid:15) ( t ) then Select a random action a RFt from actionspace of RF AP A RF . else Select a RFt that gives the largest Q-value according to arg max a RF ∈ A RF Q RF ( s t , a RF ; θ RF ) end if Execute all actions a ( l ) t at VLC APs l = 1 to K , and a RFt at RF AP.

Receive the reward r t according to (19) and observe the newstate s t +1 according to (21) Store transition ( s t , a ( l ) t , r t , s t +1 ) for VLC APs in D ( l ) V LC and ( s t , a RFt , r t , s t +1 ) for RF AP in D RF for l = 1 to K do Sample random minibatch transitions from D ( l ) V LC

Set y j = r j + γ max a ( l ) Q ( l ) V LC ( s j +1 , a ( l ) ; θ ( l ) V LC ) Set ˆ y j = Q ( l ) V LC ( s j , a ( l ) j ; θ ( l ) V LC ) Update the weights with gradient descent: θ ( l ) V LC ← θ ( l ) V LC + α ∇ ( y j − ˆ y j ) end for Sample random minibatch transitions from D RF Set y j = r j + γ max a RF Q RF ( s j +1 , a RF ; θ RF ) Set ˆ y j = Q RF ( s j , a RFj ; θ RF ) Update the weights with gradient descent: θ RF ← θ RF + α ∇ ( y j − ˆ y j ) end for V. N

UMERICAL R ESULTS

We consider a m × m room, with the ceiling heightof m . The room’s center is the origin point (0 , , and RFAP is located at the origin. Four VLC APs are located at ( − , − , ( − , , (3 , − and (3 , . We have simulated thegiven system in Monte Carlo experiments where twousers are randomly placed with x and y coordinates uniformlydistributed within the room dimensions. We have executedour simulations until the convergence is achieved, deﬁned ashaving average user rates within target bands for all UEs forat least iterations. The rest of the system parameters forthe simulations are provided in Table I. TensorFlow and Keraslibraries are used to implement the neural networks for DQNagents. Each agent’s DQN has hidden layers with nodesin each of them. The loss function used is mean square error,and the optimizer is Adam optimizer. TABLE I: Simulation parameters for the hybrid network.

Parameter Value

Maximum transmit power for RF links ( P maxRF ) . WPSD of AWGN at the RF Links ( σ RF ) − dBm/MHzBandwidth for RF Links ( W RF ) 5 MHzMaximum transmit power for VLC links ( P maxV LC ) 2 WPSD of noise in the VLC links ( σ V LC ) − dBm/MHzTotal bandwidth for each VLC AP ( W V LC ) MHzThe height of the ceiling ( y ) metersHalf of the PD’s ﬁeld-of-view ( Ψ fov ) ◦ The semi-angle at half power of the LED ( Ψ / ) ◦ The effective detection area of the PD ( A pd ) − Responsivity of the PD ( λ ) . The gain of the optical ﬁlter ( H f ) The reﬂective index of the optical concentrator ( n c ) . Optical to electric conversion efﬁciency ( κ ) Number of VLC APs ( K ) Number of RF AP ( K RF ) Number of UEs ( N ) Learning Rate ( α ) 0.5Discount Factor ( γ ) 0.5 Fig. 2: User Rate (Mpbs) vs Iterations (t) where user targetrates are 20 Mbps and 12 Mbps.As a sample implementation, in Fig. 2, we evaluate the twoalgorithms considering the actual user rates as a benchmarkfor the target rates of Mbps and Mbps for User-1 andUser-2, respectively. Both algorithms initially begin with lowertransmit powers and take different actions within time. InFig. 2, we can observe that the DQN-based power allocationalgorithm achieves convergence within iterations byachieving the target rate for both users, whereas the QL-basedpower allocation algorithm requires nearly iterations toreach a feasible solution with the desired average rate for bothusers. The QL-based power allocation algorithm reaches thetarget rate for User-1, around iterations, and User-2 iterations separately. However, it is not the desired state dueto the other user’s rate in each situation, which is much lowerthan its target rate. Since the power is shared between users, itrequires around iterations for the QL-based algorithm toreach a feasible solution and convergence. Another observationon Fig. 2, can be made regarding the difference between theactual and the users’ target rates. Thanks to its continuousstate space deﬁnition, the DQN-based algorithm converges tothe target band in a shorter number of iterations and a closerig. 3: Reward comparison of QL and DQN Algorithms.Fig. 4: Convergence CDFs of QL and DQN.value to the target rate than the QL-based algorithm.In Fig. 3, reward for each iteration for QL and DQNagents are provided considering the sample case of Fig.2.Note that the reward for each agent in the system is thesame due to the reward function deﬁnition in (19). Thisresult shows that convergence speed, performance, and theDQN-based algorithm are much better than the QL-basedalgorithm. DQN-based algorithm reaches convergence in iterations, where the QL-based algorithm requires more than iterations, and the DQN-based algorithm converges to abetter result, with larger reward.The convergence time CDFs of the two algorithms areprovided in Fig. 4 over

Monte-Carlo experiments whereusers are distributed over the simulation area uniformlyin each experiment. QL-based power allocation algorithmhas a median convergence time of iterations, whilethe DQN-based power allocation algorithm has a medianconvergence time of iterations. This result shows us usinga continuous state space allows the DQN-based algorithmto converge times faster than the QL-based algorithm.Additionally, we observe that the DQN-based power allocationscheme converges to the desired state (stability within targetband) above . of the time, while the QL-based algorithmcould reach convergence . of the time. VI. C ONCLUSION

This paper investigates the power allocation for hybridRF/VLC networks with multiple users. A multi-agentDQN-based algorithm is proposed to take precise actionsconsidering the difference in users’ actual rates andtarget rates. It is shown that the DQN-based algorithm’smedian convergence time is shorter compared to theQL-based algorithm. Additionally, the DQN-based algorithm’sperformance on closing the gap between users’ actual ratesand target rates is better than the QL-based algorithm. In ourfuture work, our goal is to develop a Deep Deterministic PolicyGradient (DDPG) based power allocation algorithm to havea continuous action space to handle power allocation moreprecisely for converging to the exact target rates instead oftarget bands. R

EFERENCES[1] D. A. Basnayaka and H. Haas, “Hybrid rf and vlc systems: Improvinguser data rate performance of vlc systems,” in , 2015, pp. 1–5.[2] S. Dimitrov and H. Haas,

Principles of LED light communications:towards networked Li-Fi . Cambridge University Press, 2015.[3] M. Hammouda, S. Akın, A. M. Vegni, H. Haas, and J. Peissig, “Linkselection in hybrid rf/vlc systems under statistical queueing constraints,”

IEEE Transactions on Wireless Communications , vol. 17, no. 4, pp.2738–2754, 2018.[4] M. Kashef, M. Ismail, M. Abdallah, K. A. Qaraqe, and E. Serpedin,“Energy efﬁcient resource allocation for mixed RF/VLC heterogeneouswireless networks,”

IEEE J. Sel. Areas Commun. , vol. 34, no. 4, pp.883–893, 2016.[5] A. Mostafa and L. Lampe, “Physical-layer security for indoor visiblelight communications,” in , 2014, pp. 3342–3347.[6] J. Al-Khori, G. Nauryzbayev, M. Abdallah, and M. Hamdi, “Secrecycapacity of hybrid rf/vlc df relaying networks with jamming,”in . IEEE, 2019, pp. 67–72.[7] X. Li, R. Zhang, and L. Hanzo, “Cooperative load balancing in hybridvisible light communications and WiFi,”

IEEE Trans. Commun. , vol. 63,no. 4, pp. 1319–1329, 2015.[8] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li,“Intelligent Power Control for Spectrum Sharing in Cognitive Radios:A Deep Reinforcement Learning Approach,”

IEEE Access , vol. 6, pp.25 463–25 473, 2018.[9] D. A. Basnayaka and H. Haas, “Design and analysis of a hybrid radiofrequency and visible light communication system,”

IEEE Transactionson Communications , vol. 65, no. 10, pp. 4334–4347, 2017.[10] A. Ahmad, S. Ahmad, M. H. Rehmani, and N. U. Hassan, “A surveyon radio resource allocation in cognitive radio sensor networks,”

IEEECommunications Surveys Tutorials , vol. 17, no. 2, pp. 888–917, 2015.[11] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain, “MachineLearning for Resource Management in Cellular and IoT Networks:Potentials, Current Solutions, and Open Challenges,”

IEEE Commun.Surv. Tutorials , no. c, pp. 1–1, 2020. [Online]. Available: http://arxiv.org/abs/1907.08965[12] M. Obeed, A. M. Salhab, M. S. Alouini, and S. A. Zummo, “OnOptimizing VLC Networks for Downlink Multi-User Transmission: ASurvey,”

IEEE Commun. Surv. Tutorials , vol. 21, no. 3, pp. 2947–2976,2019.[13] J. Kong, Z. Wu, M. Ismail, E. Serpedin, and K. A. Qaraqe, “Q-learningbased two-timescale power allocation for multi-homing hybrid RF/VLCnetworks,”

IEEE Wireless Communications Letters , pp. 1–1, 2019.[14] H. Zhang, N. Liu, K. Long, J. Cheng, V. C. Leung, and L. Hanzo,“Energy efﬁcient subchannel and power allocation for software-deﬁnedheterogeneous vlc and rf networks,”