Enhanced Pub/Sub Communications for Massive IoT Traffic with SARSA Reinforcement Learning
Carlos E. Arruda, Pedro F. Moraes, Nazim Agoulmine, Joberto S. B. Martins
EEnhanced Pub/Sub Communications for Massive IoTTraffic with SARSA Reinforcement Learning (cid:63)
Carlos E. Arruda , Pedro F. Moraes , Nazim Agoulmine [ − − − ] , andJoberto S. B. Martins [ − − − ] University of Paris-Saclay/IBISC Laboratory, France Salvador University - UNIFACS, Salvador, Brazil [email protected], pmoraes @hotmail.com, [email protected],[email protected]
Abstract.
Sensors are being extensively deployed and are expected to expand atsignificant rates in the coming years. They typically generate a large volume of dataon the internet of things (IoT) application areas like smart cities, intelligent traf-fic systems, smart grid, and e-health. Cloud, edge and fog computing are potentialand competitive strategies for collecting, processing, and distributing IoT data. How-ever, cloud, edge, and fog-based solutions need to tackle the distribution of a highvolume of IoT data efficiently through constrained and limited resource network in-frastructures. This paper addresses the issue of conveying a massive volume of IoTdata through a network with limited communications resources (bandwidth) using acognitive communications resource allocation based on Reinforcement Learning (RL)with SARSA algorithm. The proposed network infrastructure (PSIoTRL) uses a Pub-lish/ Subscribe architecture to access massive and highly distributed IoT data. It isdemonstrated that the PSIoTRL bandwidth allocation for buffer flushing based onSARSA enhances the IoT aggregator buffer occupation and network link utilization.The PSIoTRL dynamically adapts the IoT aggregator traffic flushing according tothe Pub/Sub topic’s priority and network constraint requirements.
Keywords:
Publish/ Subscribe · Reinforcement Learning · SARSA · IoT · MassiveIoT Traffic · Resource Allocation · Network Communications.
Sensors are being extensively deployed and are expected to expand at significant rates inthe coming years in the internet of things (IoT) application areas like smart city, intelligenttraffic systems (ITS), connected and autonomous vehicles (CAV), video analytics in publicsafety (VAPS) and mobile e-health [31] [37], to mention some.Cloud, edge and fog computing are currently the most prevailing and competing usedstrategies to convey IoT data between producers and consumers. They use several differentalternatives to decide where to process, how to distribute, and where to store IoT data [23].Nevertheless, cloud, edge, and fog-based solutions need to tackle the distribution of a highvolume of IoT data efficiently.IoT data distribution, whatever data handling, processing, or storing strategy used,does require efficient network communications infrastructures. In fact, there is a researchchallenge on enhancing network communications and providing efficient resource allocation (cid:63)
Work supported by CAPES and Salvador University (UNIFACS) a r X i v : . [ c s . A I] J a n C. Arruda, P. Moraes, N. Agoulmine and J. Martins based on the fact that, in many applications like IoT, networks have limited and constrainedresources [5].Machine learning and reinforcement learning with distinct algorithms like Q-Learning[33], deep reinforcement learning [11], and SARSA (State-Action-Reward-State-Action) [1]are being applied to support an efficient allocation of resources in networks for applicationareas like 5G, cognitive radio, and mobile edge computing, to mention some [7] [28].The Publish/Subscribe (Pub/Sub) paradigm is extensively used as an enabler for datadistribution in various scenarios like information-centric networking, data-centric systems,smart grid, and other general group-based communications [25] [10] [2]. The Pub/Subparadigm is an alternative for IoT data distribution between producers and consumers ingeneral [10] [22].This work addresses the issue of enhancing the utilization of a communication channel(MPLS label switched path - LSP, physical link, fiber optics slot, others) deployed in net-work infrastructures with limited resources (bandwidth) using the SARSA reinforcementlearning algorithm. The target application area is the massive exchange of IoT data usingthe Pub/Sub paradigm. The framework integrating the SARSA module with the Pub/Submessage exchange deployment (PSIoT described in [22]) is the PSIoTRL.Differently from other proposals for network infrastructure resource allocation and opti-mization that consider, for instance, the quality of service for the entire set of network nodesand Pub/Sub aggregators, the SARSA-based PSIoTRL framework provides a simple dataingress based solution. In fact, it controls the quality of Pub/Sub topics data distribution inan aggregator by keeping Pub/Sub buffer topics occupation below a defined threshold. Thisindirectly means that a certain amount of bandwidth is available for the Pub/Sub topicand, consequently, a level of quality is allocated for the communication channel. Since thePub/Sub data transfers are dynamically requested, SARSA deploys a dynamic control ofbandwidth allocation for buffer data flushing.The contribution of this work is multi-fold. Firstly, we modeled the SARSA agent com-munications with a generic buffer occupation metric. The adopted metric results in a limitedamount of SARSA states to allow the allocation of bandwidth preserving Pub/Sub topicpriorities without using extensive computational resources. Moreover, we demonstrate thatPub/Sub communications with SARSA can be enhanced by dynamically adjusting IoT ag-gregators queues occupation to Pub/Sub topics priority and network resources availability.The remainder of the paper is organized as follows. Section 2 presents the related work onIoT data processing deployments with cognitive communications approaches. Section 3 is abackground section about reinforcement learning with SARSA. Sections 4 and 5 describe thebasic PSIoT Pub/Sub framework, discuss the RL applicability for the constrained commu-nication problem and present the PSIoTRL framework components including the intelligentorchestrator for IoT traffic management. In Section 6, we present a proof of concept forthe PSIoTRL and evaluate how it enhances the IoT network resource management. Finally,section 7 concludes with an overview of the main highlights, contributions and future work.
Architectural and system-wide studies about IoT massive data processing and data flowin smart cities are presented in Al-Fuqaha [20], Rathore [27] and Martins [17]. Al-Fuqaha[20] introduces the concept of cognitive smart city where IoT and artificial intelligence aremerged in a three-level model with different requirements. In relation to the communicationlevel, the basic approach assumes a fog cloud computing (fog-CC) communication without nhanced Pub/Sub with SARSA 3 considering any specific IoT massive traffic requirement. In fact, the proposal assumes thatthe aggregation with edge processing reduces the required bandwidth for fog-CC communi-cation to a minimum. Rathore [27] explores the issue of big data analytics for smart cities.The paper proposes an edge-based aggregation and processing strategy for raw data. Pro-cessed data is forwarded through gateways to smart city applications using Internet andassuming bandwidth reduction at edge-level. Martins [17] discusses the potential benefitsand impacts of how some technological enablers, like software-defined networking [12] andmachine learning [36], are integrated and aim at cognitive management [28]. IoT data edge-processed aggregation, network communication and service deployments towards an efficientoverall smart city solution are discussed and the relevance of intelligent communication re-source provisioning and deployment are highlighted.Resource provisioning from the perspective of IoT services deployment for smart city isconsidered by Santos [31]. The paper proposes a container-based micro-services approach,using Kubernetes, for service deployment that aims to off-load IoT processing with fog com-puting. In relation to this paper discussion, the proposal endorses fog processing offloadingusing a edge-computed approach and does not consider the network communication resourceprovisioning necessary to distribute the outcomes of the edge processing.Edge intelligence for service deployment is discussed in Zhang [37]. The proposed solu-tions defines a framework capable to support the deployment of AI algorithms like RL oncommon edge aggregators like Raspberry. The approach assumed is to enable the execu-tion of AI algorithms in the edge for those applications that require near real time edgeprocessing like voice recognition and on-board autonomous vehicles processing.Machine learning in communication networks is broadly addressed by Cote [7] andBoutaba et al in [5].Intelligent network communication resources are considered by Zhao [38]. The workpresents a smart routing solution for crowdsourcing data with mobile edge computing (MEC)in smart cities using reinforcement learning (RL). The solution defines routes, differentlyfrom ours that optimizes the bandwidth allocated for IoT flow flushing.In summary, this work advances on existing studies that propose service provisioning byadding an intelligent component for the allocation of communication resources between IoTdata aggregators and IoT data consumers. From the architectural point of view, this workadopts an edge-based processing approach coupled with an efficient communication resourceallocation for massive IoT data transfers.
Reinforcement learning (RL) is a largely used machine learning (ML) technique in whicha trial-and-error learning process is executed by an agent that acts on a system aiming tomaximize its rewards. The RL algorithm is expected to learn how to reach or approach acertain objective by interacting with a system through a feedback loop [13] [33] [19].In RL, a reward value r is received by the agent for the transitions from one state toanother. The overall objective of the agent is to find a policy π which maximizes the expectfuture sum of rewards received, each of them, subjected to a discount policy γ .The value function in RL is a prediction of the return available from each state, asindicated in Equation 1. C. Arruda, P. Moraes, N. Agoulmine and J. Martins V ( t ) ← E { ∞ (cid:88) k = o γ k r t + k } (1)Where r t is the reward received for the transition from state t to t + and γ is thediscount factor ( ≤ γ ≤ ) . The value function V ( t ) represents the discounted sum ofrewards received from step t onward. Therefore, it will depend on the sequence of actionstaken and on the policy adopted to take these actions.Two well-known and somehow similar reinforcement learning algorithms are Q-learning[33] [8] and SARSA [14] [1].Q-learning is an off-policy RL algorithm in which the agent finds an optimal policythat maximizes the total discount expected reward for executing a particular action ata particular state. Fundamentally, Q-learning finds the optimal policy in a step-by-stepmanner. Q-learning is off-policy because the next state and action are uncertain when thealgorithm updates the value function.In Q-learning, the value function termed Q-function is learnt. It is a prediction of thereturn associated with each action ∈ A (set of actions). This prediction can be updatedwith respect to the predicted return of the next state visited (Equation 2). Q ( t , a t ) ← r t + γV ( t + ) (2)Since the overall objective is to maximize the reward received, the current estimate of V ( t ) becomes: Q ( t , a t ) ← r t + γ mx ∈ A Q ( t + , ) (3)The Q-function is shown to converge for markovian decision processes (MDP) [26]. InQ-learning, the agent maintains a lookup table of Q ( X, A ) and Q ( , ) represents thecurrent estimate of the optimal action value function. Once the Q-function has converged,the optimal policy π is to take the action in each state with the highest predicted return(greedy policy) [29]. SARSA (State-Action-Reward-State-Action) algorithm is the reinforcement learning ap-proach used by PSIoTRL framework [14] [1].SARSA is a temporal-difference (TD) on-policy algorithm that learns the Q-values basedon the action performed by the current policy. SARSA algorithm differs from Q-learningby the way it sets up the future reward. In SARSA the agent uses the action and the stateat time t + to update the Q-value. The SARSA tuple of main elements involved in theinteraction process are: < t , t , r t + , t + , t + > (4)Where: nhanced Pub/Sub with SARSA 5 – t , t are the current state and action; – r t + is the reward; and – t + , t + are the next state and action reached using the policy ( ε − Greedy ).SARSA Q-values are therefore updated based on the Equations 5 or 6. Q ( t , t ) ← Q ( t , t ) + α [ r t + + γQ ( t + , t + ) − Q ( t , t )] (5) ΔQ ( t − , t − ) = α ( t )[ r t − + γQ ( t , t ) − Q ( t − , t − )] (6)Where: – α is the learning rate ( ≤ α ≤ ) ; and – γ is the discount factor ( ≤ γ ≤ ) . Reinforcement learning has achieved superhuman performance in games like Go [32] andalso obtained a critical result bridging the divide between high-dimensional sensory inputsand actions with ATARI [18].Video games and communication networks have imperfect but interesting operation sim-ilarities. As discussed in Cote [7], video games and networks are closed systems with a finitenumber of states. In games, actions include pressing buttons and moving the joystick, theimage pixels define the state, and the cost function is the game score. In networks, actionscorrespond mainly to network configuration parameters (bandwidth allocation, fiber allo-cation, others) or network routing configuration. The network state (snapshot) defines thestate of the system at each iteration, and the cost function corresponds to the performanceof the network such like utilization rate, throughput, or number of dropped packets.Hence, it is reasonable to consider that SARSA may have a parallel success for allocatingnetwork communications resources (bandwidth) in the PSIoTRL.Reinforcement learning with SARSA has proved efficient and obtained substantial suc-cess in resource allocation [1] [30], cloud computing [4] and computational offloading [3][24].An evaluation of SARSA and various other model-free algorithms is presented in Dafazio[9]. The evaluation considered diverse and difficult problems within a consistent environment(Arcade [15]) involving configuration aspects like Epsilon-greedy policy, exploration versusexploitation and state space with SARSA outperforming algorithms like Q-learning, ETTR(Expected Time-to-Rendezvous) [34], R-learning [16], GQ algorithm [35] and Actor-Criticalgorithm [6].SARSA algorithm features and potential advantages within the PSIoTRL environmentinclude: – Being an on-policy algorithm, SARSA attempts to evaluate or improve the policy thatis used to make decision; – SARSA avoids the state explosion issue common with the Q-learning algorithm; – PSIoTRL has a reduced state space and, consequently, SARSA behaves effectively byexploring all states at least once; and – Key issues that RL algorithms experience like bad initial performance and large trainingtime are minimized in the PSIoTRL environment since SARSA addresses the allocationof network bandwidth locally with a reduced state-space.
C. Arruda, P. Moraes, N. Agoulmine and J. Martins
The objective of this work is to enhance the Publish/ Subscribe (Pub/Sub) communicationsused in the context of massive IoT data transfers (Figure 1).The Pub/Sub communications is supported by the PSIoT framework described in [22].The PSIoT framework aims to efficiently handle network resources and IoT QoS require-ments over the network between the IoT devices and consumer IoT applications [22]. Con-sumers are applications executed in servers located beyond the backbone network, cloudcomputing infrastructure accessed by the network or any other scheme that makes use ofthe managed network infrastructure for communications (Figure 1).
Fig. 1.
PSIoT Framework Functional View [22].
The PSIoT framework was developed to manage IoT traffic in a network and to provideQoS based on IoT data characteristics and network-wide specifications, e.g. total networkuse, realtime network traffic, routing and bandwidth constraints.IoT data generated from sensors and devices can be aggregated and processed in thecloud or at the network edge, where each of these points are considered as
Aggregators or Producers in the PSIoT framework. Whereas each client that receives IoT data, bothend-users applications and even other aggregation nodes, are denoted as
Consumers .The PSIoT framework was designed to orchestrate massive IoT traffic and allocate net-work resources between producers and consumers. Besides, the PSIoT framework can sched-ule the data flow, based on Quality of Service (QoS) requirements, and allocates bandwidthon a backbone with limited resources. For this, the PSIoT was modeled to operate withfour main components: producers, consumers, a backbone with limited resources; and theorchestrator (Figure 1) [22].Aggregators are elements that gather the data obtained by connected sensors to sendthem (in an opportune moment) to their consumers (clients) using a Pub/Sub-style commu-nication channel. In the opposite direction, aggregators deliver data received from producersto send them to actuators. nhanced Pub/Sub with SARSA 7
Cloud ComputingCloud Computing
PRODUCERPRODUCER CONSUMERCOMMUNICATIONALLOCATIONMETHOD
SUBSCRIBE
PUBLISH
METADATA
SCHEDULING Network Resources b ac de
CoTCoT
PSIoT-OrchestratorEngineIoT Data Scheduler
Data stream orchestrationLSP/ bandwidthallocation f g g PSIoT-Orch FrameworkIoTGW-AgPSIoT-Orchestrator h CHANNEL REQUEST
IoT GATEWAYSENSORSPRODUCERIoT GATEWAYSENSORSPRODUCER
Fig. 2.
PSIoT Pub/Sub Message Model [22].
This exchange of information is performed through the Pub/Sub architecture, whosecharacteristic of asynchronous, use of topics and the use of a broker, makes it attractive forIoT applications [10].PSIoT implements QoS when forwarding the data gathered by the aggregator to outputbuffers, according to the following characteristics [22]: – b0 (priority) - high transmission rates and low delay application requirements for healthcare and data from critical industrial sensors, as examples; – b1 (sensitive) - commercial data and security sensors; and – b2 (insensitive) - best effort.The Pub/Sub message model adopted by the PSIoT is illustrated in Figure 2. In sum-mary, the message flow is as follows [22]:1. A consumer subscribes to a particular topic and specify the requested QoS level with aparticular aggregator ( );2. The aggregator sends to the orchestrator relevant metadata such as number of sub-scribers and their associated QoS levels and buffer allocation ( b );3. The orchestrator notifies the aggregator with an amount of bandwidth that can beconsumed by each level of QoS ( c ); and4. The aggregator publishes the data to the consumer according to bandwidth and dataavailability in the buffer ( d ).The PSIoT uses fixed rule scheduling for IoT data consumption. C. Arruda, P. Moraes, N. Agoulmine and J. Martins
The PSIoTRL architectural components are illustrated in Figure 3. Its main proposed mod-ules are: – The SARSA agent; – The PSIoTRL orchestrator module; – Aggregators (at least one for each cluster where IoT data will be consumed); – The network infrastructure (backbone); and – Producers and consumers.The SARSA agent is integrated in the PSIoT framework [22] and, by demand, allocatesbandwidth for the aggregator queues.The PSIoTRL orchestrator module is described in [22] and [21]. It basically has theknowledge of each aggregator Pub/Sub subscriptions, as well as the QoS levels required foreach topic subscription. This allows it to control the transmission emptying rates of IoTdata from each buffer within the aggregator in the network. The SARSA agent computesthe allocated bandwidth and the orchestrator deploys it.The aggregators are Fog-like nodes connected to IoT devices that act as aggregators fortheir data and also act as Pub/Sub producers regarding IoT applications subscribing totopics and consuming corresponding IoT data [21] [10].The network interconnects data producers (aggregators) and consumers and, has limitedbandwidth resources for massive IoT data transfers.Producers and consumers exchange IoT data. They use Pub/Sub [10] to communicateand exchange massive amounts of IoT data through a network with constrained bandwidthresources [22] [21].
The PSIoTRL framework uses the Pub/Sub paradigm to produce and consume data. Con-sumers request IoT data on the aggregator’s queues, and the aggregator empties its buffersaccording to consumer requests. The aggregator queues are deployed with the followingconfiguration: – Three IoT data queues (buffers) are configured per aggregator (one Pub/Sub topic perqueue): B , B and B ; – Initial buffer transmission rates are: T , T and T ; and – Buffer priorities are: p , p and p with p > p > p .For this evaluation we consider one aggregator Ag located at network node n z thatdelivers IoT data to a set of consumers C , with = , , ..., n located at network node n y .Between nodes n z and n y there is a communication resource (MPLS LSP, physical link, fiberslot, other) with a limited bandwidth BW nz . The communication resource interconnectingPub/Sub producer and consumers is then constrained as follows: By initial buffer transmission rate, we mean the configured initial transmission rate to emptybuffers without any buffer over-utilization.nhanced Pub/Sub with SARSA 9
IoT DATA BUFFER CONTROL AND MONITORING(Pub/Sub Messages) IoTGW-Ag
PSIoT-Orchestrator
PSIoTRL
COMMUNICATIONCHANNEL
RL/SARSA AGENT
SDN/ OpenflowNetwork
Programming
AGENT
AiSi
IoT AGGREGATOR
SENSORS
PRODUCER
IoT BUFFERS
SCHEDULING c BUFFER STATUS
AND
METADATA b BUFFERS
THERSHOLD
Cloud ComputingCloud Computing
CONSUMER
CoTCoTCloud Computing
CONSUMER
CoT
Network ResourcesNetwork Resources
SYSTEM
Fig. 3.
PSIoTRL Components and Communication - Orchestrator, Aggregator, Buffers and Com-munication Chanel. BW zy ≥ n (cid:88) j = TB j (7)Where TB j is the currently transmission rate to empty a buffer j .The aggregator monitors buffer occupation and occupation above a defined thresholdlimit is signaled to the orchestrator (Figure 3). The PSIoTRL deployment and operation require SARSA agent modeling for the bufferbandwidth allocation problem and the definition of basic SARSA algorithm configurationparameters.The primary goal of the SARSA agent is to arbitrate the output transmission ratesamong the aggregator buffers in order to efficiently distribute IoT data to the consumers(Figure 3).The PSIoTRL SARSA agent is modeled with a finite set of states for the aggregatoroutput queues, a finite set of configuration actions to be executed on each queue, and a setof reward values for each state/action transition pair.In the SARSA agent, the system state S represents the overall aggregator’s output queuestatus in terms of occupation at a given time t . Table 1.
Buffer (Queue) States.
Buffer (queue) State (B1,B2,B3) Description
BL, BL, BL all queues below the threshold limitBL, BL, AL queues 1 and 2 below the threshold limitBL, AL, BL queues 1 an 3 below the threshold limitBL, AL, AL queue 1 below the threshold limitAL, BL, BL queues 2 and 3 below the threshold limitAL, BL, AL queue 2 below the threshold limitAL, AL, BL queue 1 below the threshold limitAL, AL, AL no queues below the threshold limit † Legend: BL - Below Limit; AL - Above Limit.
Table 2.
SARSA agent actions.
Buffer (Queue) Actions B1 T + , T − , NB2 T + , T − , NB3 T + , T − , N T + : increase transmission rate; T − : decrease transmission rate; N: null action. The SARSA agent states are indicated in Table 1. Each queue has two states either below(BL) or above (AL), a preconfigured threshold. Having two states for each queue allows, inthis context, sufficient information for bandwidth allocation and contributes to avoid thecommon state explosion issue existing in many RL deployments.A discrete and small number of states for the queues can be reasonably assumed becauseit mainly works as a threshold to indicate that queues require attention to enforce prioritiesor maximize throughput. The first case eventually happens when queued Pub/Sub messagesrequire more capacity to be transferred than the actually allocated one. The second casecorresponds to the situation where some queues have plenty of data to transmit, while othershave unused capacity.The utilization of the buffer threshold results in having a kind of agnostic Pub/Subimplementation. In fact, the threshold-based cognitive actions allow: – To tune the PSIoTRL to have a faster or proactive reaction to adjust the queue’s band-width; and – To tune the Pub/Sub dynamic reconfiguration capability according to IoT data transfersensitivity.The PSIoTRL SARSA agent actions are illustrated in Table 2. Three actions are definedfor each buffer: increase capacity (transmission rate), reduce capacity (transmission rate)and do nothing. Therefore, twenty seven actions are possible for each agent state. Theamount of bandwidth increased or reduced per queue is a SARSA configuration parameter.A set of rewards is also defined for each executed state/action pair. nhanced Pub/Sub with SARSA 11 ε -Greedy Policy The SARSA agent uses an ε -greedy policy since it must exploit as much as possible theacquired knowledge but also explores new possibilities of enhancing the allocation of band-width for queue communications. The SARSA agent will take new actions at random withthe ε -greedy policy defining the probability of choosing random actions. No value changeor on-the-fly fine-tuning of ε was considered in this solution.The ε -greedy policy matches adequately the inherent need of a Pub/Sub IoT messagedelivery framework. Random demands from consumers consume Pub/Sub data with dif-ferent priorities. These demands generate a random volume of data to be transferred fromaggregator queues to consumers. The per-queue bandwidth distribution must be dynam-ically computed among IoT data queues considering the existing constraint (bandwidthlimitation) of the available communication channel. The SARSA agent manages the aggregator’s output queues transmission rate according toIoT data demands, IoT data priority, and communication resource constraints (bandwidthlimitation). The bandwidth allocation algorithm pseudo-code is presented in Algorithm 1.
Algorithm 1
PSIoTRL-SARSA Bandwidth Allocation Algorithm Pseudo-code procedure PSIoTRL-SARSA (Q( S t , A t , r, S t + , A t + )) (cid:46) SARSA states and reward2: for each pair ( Q t , A t ) do (cid:46) Initialization3: Initialize Q-values - Q ( S t , A t ) = (cid:46) Tabula rasa approach4: repeat (cid:46)
Forever5: Gets current PSIoTRL buffers state S t (cid:46) Buffer state is the trigger event6: repeat (cid:46)
Finishes upon terminal condition7: Choose action A t using the ε -greedy policy (cid:46) Exploration and exploitation8: Execute action A t on the PSIoTRL system (cid:46) T+, T- or do nothing on queues9: Get immediate reward r t
10: Collect new state S t +
11: Choose new action A t + using the ε -greedy policy12: Update Q-value - Q ( S t , A t ) using SARSA equation 5 (cid:46) SARSA algorithm13: until
Terminal condition reached14: until forever
The algorithm procedure is as follows:1. The algorithm starts initializing the Q-values in table, Q ( s, ) ;2. The current state, s is captured;3. An action is chosen for the current state using the greedy policy;4. The agent triggers the action, and observes the immediate reward, r , as well as the newreached state s (cid:48) ;5. The Q-value for the state s (cid:48) is updated using the observed reward and the maximumreward possible for the next state; and6. Finally for the current cycle, set the state to the new state, and repeat the process untilthe end condition is reached. The end condition for the PSIoTRL bandwidth allocation process is the following: – Bandwidth constraint limit is reached (Equation 7); or – Current buffers transmission rates corresponding priorities are ( T > T > T ) ; or – Maximum number of attempts is reached.
The purpose of this proof of concept is to validate that the agent behavior enhanced theallocation of bandwidth when one or more buffers exceeded the defined buffer occupationlimit (50% - AL condition). When this event occurs, the aggregator detects the problemand sends an alarm containing metadata to the orchestrator. After that, the SARSA agentallocates a new percentage of bandwidth among the aggregator buffer.Four initial buffer conditions were used: – One queue exceeds the occupation threshold ( AL, BL, BL ) ; – Two queues exceed the occupation threshold ( AL, AL, BL ) ; – Three queues exceed the occupation threshold ( AL, AL, AL ) ; and – One queue exceeds the link bandwidth capacity and the two other queues exceed theoccupation threshold (+ AL, AL, AL ) .The considered performance parameters are the following: – Queue (buffer) Occupation; – Link Occupation; and – Packet Loss.The allocation of bandwidth to the aggregator queues is evaluated in two scenarios. Inscenario 1, a predefined simple algorithm is used to allocate the bandwidth. In scenario2, the SARSA agent does the same task. Table 3 presents a summary of the evaluationscenarios.
Table 3.
Summary of the Proof of Concept Scenarios.
AGGREGATOR ORCHESTRATORQueues States (B1,B2,B3) T1i T2i T3i Scenario 1 Scenario 2 (SARSA)
AL,BL,BLAL,AL,BL 35% 25% 15% T1 = T1i * factor; SARSAAL,AL,AL T2 = (100 - T1)/2; Module + AL ,BL,BL T3 = (100 - T1)/2; – + AL : 120% buffer occupation; dropping packets; factor:1.15 or 1.25 or 1.50.nhanced Pub/Sub with SARSA 13 In this scenario, the orchestrator has a simple fixed rule to increase or decrease the bandwidthfor buffer flushing. T bandwidth is adjusted by a fixed factor (15%, 25% or 50%) and theremaining queues get half of the remaining bandwidth as follows: – T = T ∗ factor where T = T initial state; – T = ( − T ) / ; and – T = ( − T ) / .Buffer transmission emptying initial rates are respectively 35%, 25% and 15% for T1,T2 and T3. Tables 4, 5 and 6 present the results obtained with the fixed rule bandwidthallocation method. In these Tables, packets loss for queue B is P . Table 4.
Bandwidth allocation with fixed rule - 15 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation
AL,BL,BL (72,16,0) (40,30,30) (0,0,0) 100AL,AL,BL (72,64,0) (40,30,30) (0,0,0) 100AL,AL,AL (72,64,0) (40,30,30) (0,0,0) 100+AL,BL,BL (108,16,0) (40,30,30) (8,0,0) 100 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . Table 5.
Bandwidth allocation with fixed rule - 25 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation
AL,BL,BL (56,18,2) (44,28,28) (0,0,0) 100AL,AL,BL (56,72,2) (44,28,28) (0,0,0) 100AL,AL,AL (56,72,8) (44,28,28) (0,0,0) 100+AL,BL,BL (84,18,2) (44,28,28) (0,0,0) 100 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . The parameters and initial conditions used in this scenario of the the proof-of-concept arethe following: – SARSA Agent configuration parameters: • Epsilon-greedy policy ε of 2 % ; Table 6.
Bandwidth allocation with fixed rule - 50 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation
AL,BL,BL (40,20,10) (53,24,23) (0,0,0) 100AL,AL,BL (40,80,10) (53,24,23) (0,0,0) 100AL,AL,AL (40,80,40) (53,24,23) (0,0,0) 100+AL,BL,BL (60,20,10) (53,24,23) (0,0,0) 100 – Legend: QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . • Learning rate α of 20 % ; and • Discount factor γ of 80 % – Pub/Sub queue threshold limit (triggers agent action) = 50% – Agent actions: bandwidth increased or reduced by 10% – Maximum number of attempts = 400The SARSA configuration parameters are fixed for all evaluations in this scenario 2.Typical values were used and the impact of the variation of the values of these parameterswas not considered in this evaluation and left for future works.In terms of the PSIoTRL operation and SARSA evaluation, the agent action in theorchestrator is triggered by the aggregator anytime one of its queues occupation goes beyondthe configured threshold limit (Figure 3).The initial condition with SARSA are the same as those applied for the scenario 1 asindicated in Table 3. Ten runs were executed for each of the four initial states and theobtained results are summarized in Table 7 with a confidence interval of 5%.
Table 7.
Queue Occupation, Packet Loss and Link Occupation with SARSA.
Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation
AL,BL,BL (48,12,22) (49,35,13) (0,0,0) 97AL,AL,BL (47,47,23) (50,35,12) (0,0,0) 97AL,AL,AL (48,48,87) (49,35,14) (0,0,0) 98+AL,BL,BL (24,24,26) (63,20,11) (28,0,0) 94 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . First of all, it is essential to highlight that this analysis’s objective is to verify that theSARSA algorithm enhances bandwidth allocation for an IoT aggregator using specifically aPub/Sub method to exchange data. In this context, enhanced bandwidth allocation meansthat the constrained bandwidth resources can be tuned and, consequently, better used forattained QoS performance parameters like delay, jitter, and packet loss. nhanced Pub/Sub with SARSA 15
The evaluated parameters are queue occupation, link occupation and packet loss.The algorithm’s ability to bring the queue state back to a threshold limit is not aneffective performance parameter but demonstrates that the algorithm enhances the usageof the constrained bandwidth resources. For example, keeping the occupation of queues 1and 2 below 50% would imply having the delay and jitter for the IoT data belonging to thespecific Pub/Sub topics within its application’s requirements.The link occupation performance parameter evaluated in the runs indicates the efficientuse or not of the bandwidth available, and the packet loss performance parameter is relevantto Pub/Sub topics sensitive to data loss.The Figures 4, 5, 6 and 7 present the final buffer occupation with bandwidth allocationfor buffer flushing done by the orchestrator using fixed rule (case 1) and SARSA (case 2).
Fig. 4.
Final Queue Occupation - Initial State =
AL, BL, BL . Fig. 5.
Final Queue Occupation - Initial State =
AL, AL, BL . For all initial states ( AL, BL, BL ; AL, AL, BL ; AL, AL, AL nd + AL, BL, BL ) band-width allocation with SARSA performed better that the fixed rule using increments of 15%,
25% and 50% in the buffer flushing bandwidth. SARSA algorithm brought all buffers to thefinal condition
BL, BL, BL .In Figure 5, the case 1 (50 % ) was the unique fixed rule option that succeeded to bringone of the queue occupation (queue 3) below the threshold limit.In Figure 6, the bandwidth allocation with SARSA performed again better than thefixed rule algorithm. However, while the two priority buffers B and B are below the limit,buffer B (least effort) was set above the limit due to its lower priority. Fig. 6.
Final Queue Occupation - Initial State =
AL, AL, AL . Figure 7 corresponds to the evaluation scenario in which we want to observe the behaviorof the fixed rule bandwidth allocation and SARSA when buffer B is already experiencingpacket loss (above 100% capacity). Figure 7 shows that the SARSA algorithm is again themost efficient approach to bring all buffers below the threshold limit. Fig. 7.
Final Queue Occupation - Initial State = + AL, BL, BL .nhanced Pub/Sub with SARSA 17
In this experimentation, the objective is to use the minimum amount of link bandwidth perbuffer to transmit data. Buffer priority should be preserved, and the aimed final conditionis to bring all buffers to the BL state after bandwidth allocation.Figure 8 illustrates two related aspects of link occupation. Firstly, it shows that theSARSA allocation of bandwidth per buffer is aligned with buffer priorities. In fact, morebandwidth is allocated to B than to B , and, in turn, B gets more bandwidth than B . This result is fully consistent with the defined buffer priorities. The fixed rule methodallocates bandwidth to buffer B and splits the remaining bandwidth among the otherbuffers. Fig. 8.
Allocated Bandwidth per Queue and Link Occupation.
A second result illustrated in Figure 8 is link occupation. The result indicates that theSARSA algorithm is more efficient than the fixed rule allocation method. SARSA succeedsin bringing all buffer occupation levels bellow the defined threshold and, concomitantly,preserves link utilization below 100%.Packet loss is indicated in Tables 4, 5, 6 and 7. The allocation of chunks of bandwidthfor IoT data flushing in queues is done proactively once the 50% buffer occupancy thresholdis reached. Consequently, no packet loss occurs for three out of four cases considered in thePSIoTRL evaluation ( AL, BL, BL ; AL, AL, BL nd AL, AL, AL ) . It happens that the bufferflushing bandwidth is adjusted before any packet loss could occur.The case + AL, BL, BL starts with buffer overloaded and, as such, packet loss does occursince either the fixed rule method or the SARSA algorithm takes some time to allocatebandwidth to the overloaded buffer and fix the problem. As expected, SARSA takes moretime to allocate bandwidth due to the learning process inherent to the algorithm and,consequently, presents a more significant packet loss.
This work has presented a solution to address the problem of network resources allocationin the context of massive IoT data distribution. It has presented a PSIoTRL framework that introduces an architecture that aim to manage the allocation of limited network resourcesto aggregators.As highlighted in the simulations, the SARSA agent in PSIoTRL was able to recon-figure the queues’ flushing transmission rate efficiently, providing in 100 % of the testedcases, that these queues reached the BL (below threshold limit) final condition with lesslink bandwidth usage. The PSIoTRL demonstrates that a solution based on SARSA is anefficient reinforcement learning approach for enhancing resource (bandwidth) allocation inthe Pub/Sub-based massive IoT data exchange context.Future works will address SARSA computation scalability concerning the granularity forallocating new chunks of bandwidth to flush IoT data and adjust the Pub/Sub queue occu-pancy. The impact of the SARSA configuration parameters (epsilon-greedy policy, learningrate, and discount factor), in the algorithm time response, with a focus on the learning rate,will also be evaluated. Acknowledgments
Authors want to thanks CAPES (Coordination for the Improvement of Higher EducationPersonnel) for the master’s scholarship support granted.
References
1. Alfakih, T., Hassan, M.M., Gumaei, A., Savaglio, C., Fortino, G.: Task Offloading and ResourceAllocation for Mobile Edge Computing by Deep Reinforcement Learning Based on SARSA.IEEE Access , 54074–54084 (2020)2. An, K., Gokhale, A., Tambe, S., Kuroda, T.: Wide Area Network-scale Discovery and DataDissemination in Data-centric Publish/Subscribe Systems. pp. 1–2. ACM Press (2015)3. Asghari, A., Sohrabi, M.K., Yaghmaee, F.: Task Scheduling, Resource Provisioning, and LoadBalancing on Scientific Workflows Using Parallel SARSA Reinforcement Learning Agents andGenetic Algorithm. The Journal of Supercomputing (Jul 2020)4. Bibal Benifa, J.V., Dejey, D.: RLPAS: Reinforcement Learning-based Proactive Auto-Scalerfor Resource Provisioning in Cloud Environment. Mobile Networks and Applications (4),1348–1363 (Aug 2019)5. Boutaba, R., Salahuddin, M.A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F.,Caicedo, O.M.: A Comprehensive Survey on Machine Learning for Networking: Evolution,Applications and Research Opportunities. Journal of Internet Services and Applications (1),16 (Jun 2018)6. Ciosek, K., Vuong, Q., Loftin, R., Hofmann, K.: Better Exploration with Optimistic ActorCritic. pp. 1787–1798 (2019)7. Cote, D.: Using Machine Learning in Communication Networks. IEEE/OSA Journal of OpticalCommunications and Networking (10), D100–D109 (Oct 2018)8. Dabbaghjamanesh, M., Moeini, A., Kavousi-Fard, A.: Reinforcement Learning-based Load Fore-casting of Electric Vehicle Charging Station Using Q-Learning Technique. IEEE Transactionson Industrial Informatics pp. 1–9 (2020)9. Defazio, A., Graepel, T.: A Comparison of Learning Algorithms on the Arcade Learning Envi-ronment. arXiv:1410.8620 [cs] (Oct 2014), http://arxiv.org/abs/1410.8620 , arXiv: 1410.862010. Happ, D., Wolisz, A.: Limitations of the Pub/Sub Pattern for Cloud Based Iot and TheirImplications. In: Cloudification of the Internet of Things (CIoT). pp. 1–6. IEEE - Cloudificationof the Internet of Things (CIoT), Paris (Nov 2016)nhanced Pub/Sub with SARSA 1911. Koo, J., Mendiratta, V.B., Rahman, M.R., Walid, A.: Deep Reinforcement Learning for NetworkSlicing with Heterogeneous Resource Requirements and Time Varying Traffic Dynamics. In:2019 15th International Conference on Network and Service Management. pp. 1–5 (Oct 2019)12. Kreutz, D., Ramos, F.M.V., Verissimo, P., Rothenberg, C.E., Azodolmolky, S., Uhlig, S.:Software-Defined Networking: A Comprehensive Survey. Proceedings of the IEEE (1), 14–76(Dec 2014)13. Latah, M., Toker, L.: Artificial intelligence enabled software-defined networking: A comprehen-sive overview. IET Networks (2), 79–99 (2019)14. Liao, X., Wu, D., Wang, Y.: Dynamic Spectrum Access Based on Improved SARSA Algo-rithm. IOP Conference Series: Materials Science and Engineering (7), 072015 (Mar 2020),publisher: IOP Publishing15. Machado, M.C., Bellemare, M.G., Talvitie, E., Veness, J., Hausknecht, M., Bowling, M.: Revis-iting the Arcade Learning Environment: Evaluation Protocols and Open Problems for GeneralAgents. Journal of Artificial Intelligence Research , 523–562 (Mar 2018)16. Mahadevan, S.: Average Reward Reinforcement Learning: Foundations, Algorithms, and Em-pirical Results. Machine Learning (1), 159–195 (Mar 1996)17. Martins, J.S.B.: Towards Smart City Innovation Under the Perspective of Software-DefinedNetworking, Artificial Intelligence and Big Data. Revista de Tecnologia da Informa¸c˜ao e Co-munica¸c˜ao (2), 1–7 (Oct 2018)18. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A.,Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control throughdeep reinforcement learning. Nature (7540), 529–533 (Feb 2015)19. Moerland, T.M., Broekens, J., Jonker, C.M.: A Framework for Reinforcement Learning andPlanning. PhD thesis, TU Delft (Jun 2020)20. Mohammadi, M., Al-Fuqaha, A.: Enabling Cognitive Smart Cities Using Big Data and MachineLearning: Approaches and Challenges. IEEE Communications Magazine (2), 94–101 (Feb2018)21. Moraes, P.F., Martins, J.S.B.: A Pub/Sub SDN-Integrated Framework for IoT Traffic Orches-tration. In: Proceedings of the 3rd International Conference on Future Networks and DistributedSystems. pp. 1–9. ICFNDS ’19 (2019), paris, France22. Moraes, P.F., Reale, R.F., Martins, J.S.B.: A Publish/Subscribe QoS-aware Framework forMassive IoT Traffic Orchestration. In: Proceedings of the 6th International Workshop on AD-VANCEs in ICT Infrastructures and Services (ADVANCE). pp. 1–14. Santiago (Jan 2018)23. Mukherjee, M., Shu, L., Wang, D.: Survey of Fog Computing: Fundamental, Network Appli-cations, and Research Challenges. IEEE Communications Surveys Tutorials (3), 1826–1857(thirdquarter 2018)24. Nassar, A., Yilmaz, Y.: Reinforcement Learning for Adaptive Resource Allocation in Fog RANfor IoT With Heterogeneous Latency Requirements. IEEE Access , 128014–128025 (2019)25. Nour, B., Sharif, K., Li, F., Yang, S., Moungla, H., Wang, Y.: ICN Publisher-Subscriber Models:Challenges and Group-based Communication. IEEE Network (6), 156–163 (Nov 2019)26. Ramani, D.: A Short Survey On Memory Based Reinforcement Learning. arXiv:1904.06736 [cs](Apr 2019)27. Rathore, M.M., Ahmad, A., Paul, A., Rho, S.: Urban Planning and Building Smart CitiesBased on the Internet of Things Using Big Data Analytics. Computer Networks , 63–80(Jun 2016)28. Rendon, O.M.C., Estrada-Solano, F., Boutaba, R., Shahriar, N., Salahuddin, M., Liman, N.,Ayoubi, S.: Machine Learning for Cognitive Network Management. IEEE CommunicationsMagazine pp. 1–9 (2018)29. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Tech. Rep. TR166, Cambridge University Engineering Department, Cambridge, England (1994)30. Sampaio, L.S.R., Faustini, P.H.A., Silva, A.S., Granville, L.Z., Schaeffer-Filho, A.: Using NFVand Reinforcement Learning for Anomalies Detection and Mitigation in SDN. In: 2018 IEEESymposium on Computers and Communications (ISCC). pp. 00432–00437 (Jun 2018)0 C. Arruda, P. Moraes, N. Agoulmine and J. Martins31. Santos, J., Wauters, T., Volckaert, B., De Turck, F.: Resource Provisioning in Fog Computing:From Theory to Practice. Sensors (Basel, Switzerland) (10) (May 2019)32. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche,G., Graepel, T., Hassabis, D.: Mastering the Game of Go Without Human Knowledge. Nature (7676), 354–359 (Oct 2017)33. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge,MA, USA, 1st edn. (1998)34. Wang, J.H., Lu, P.E., Chang, C.S., Lee, D.S.: A Reinforcement Learning Approach for theMultichannel Rendezvous Problem. In: 2019 IEEE Globecom Workshops (GC Wkshps). pp. 1–5 (Dec 2019)35. Wang, Y., Zou, S.: Finite-sample Analysis of Greedy-GQ with Linear Function Approximationunder Markovian Noise. Proceedings of Machine Learning Research (Proceedings of the36th Conference on Uncertainty in Artificial Intelligence (UAI)), 1–26 (2020)36. Xie, J., Yu, F.R., Huang, T., Xie, R., Liu, J., Wang, C., Liu, Y.: A Survey of Machine LearningTechniques Applied to Software Defined Networking (SDN): Research Issues and Challenges.IEEE Communications Surveys Tutorials (1), 393–430 (Firstquarter 2019)37. Zhang, X., Wang, Y., Lu, S., Liu, L., Xu, L., Shi, W.: OpenEI: An Open Framework forEdge Intelligence. In: 39th IEEE International Conference on Distributed Computing Systems(ICDCS). pp. 1–12. Dallas, US (Jul 2019)38. Zhao, L., Wang, J., Liu, J., Kato, N.: Routing for Crowd Management in Smart Cities: A DeepReinforcement Learning Perspective. IEEE Communications Magazine57