[PDF] Enhanced Pub/Sub Communications for Massive IoT Traffic with SARSA Reinforcement Learning

Abstract

Sensors are being extensively deployed and are expected to expand at significant rates in the coming years. They typically generate a large volume of data on the internet of things (IoT) application areas like smart cities, intelligent traffic systems, smart grid, and e-health. Cloud, edge and fog computing are potential and competitive strategies for collecting, processing, and distributing IoT data. However, cloud, edge, and fog-based solutions need to tackle the distribution of a high volume of IoT data efficiently through constrained and limited resource network infrastructures. This paper addresses the issue of conveying a massive volume of IoT data through a network with limited communications resources (bandwidth) using a cognitive communications resource allocation based on Reinforcement Learning (RL) with SARSA algorithm. The proposed network infrastructure (PSIoTRL) uses a Publish/ Subscribe architecture to access massive and highly distributed IoT data. It is demonstrated that the PSIoTRL bandwidth allocation for buffer flushing based on SARSA enhances the IoT aggregator buffer occupation and network link utilization. The PSIoTRL dynamically adapts the IoT aggregator traffic flushing according to the Pub/Sub topic's priority and network constraint requirements.

Full PDF

EEnhanced Pub/Sub Communications for Massive IoTTraﬃc with SARSA Reinforcement Learning (cid:63)

Carlos E. Arruda , Pedro F. Moraes , Nazim Agoulmine [ − − − ] , andJoberto S. B. Martins [ − − − ] University of Paris-Saclay/IBISC Laboratory, France Salvador University - UNIFACS, Salvador, Brazil [email protected], pmoraes @hotmail.com, [email protected],[email protected]

Abstract.

Sensors are being extensively deployed and are expected to expand atsigniﬁcant rates in the coming years. They typically generate a large volume of dataon the internet of things (IoT) application areas like smart cities, intelligent traf-ﬁc systems, smart grid, and e-health. Cloud, edge and fog computing are potentialand competitive strategies for collecting, processing, and distributing IoT data. How-ever, cloud, edge, and fog-based solutions need to tackle the distribution of a highvolume of IoT data eﬃciently through constrained and limited resource network in-frastructures. This paper addresses the issue of conveying a massive volume of IoTdata through a network with limited communications resources (bandwidth) using acognitive communications resource allocation based on Reinforcement Learning (RL)with SARSA algorithm. The proposed network infrastructure (PSIoTRL) uses a Pub-lish/ Subscribe architecture to access massive and highly distributed IoT data. It isdemonstrated that the PSIoTRL bandwidth allocation for buﬀer ﬂushing based onSARSA enhances the IoT aggregator buﬀer occupation and network link utilization.The PSIoTRL dynamically adapts the IoT aggregator traﬃc ﬂushing according tothe Pub/Sub topic’s priority and network constraint requirements.

Keywords:

Publish/ Subscribe · Reinforcement Learning · SARSA · IoT · MassiveIoT Traﬃc · Resource Allocation · Network Communications.

Sensors are being extensively deployed and are expected to expand at signiﬁcant rates inthe coming years in the internet of things (IoT) application areas like smart city, intelligenttraﬃc systems (ITS), connected and autonomous vehicles (CAV), video analytics in publicsafety (VAPS) and mobile e-health [31] [37], to mention some.Cloud, edge and fog computing are currently the most prevailing and competing usedstrategies to convey IoT data between producers and consumers. They use several diﬀerentalternatives to decide where to process, how to distribute, and where to store IoT data [23].Nevertheless, cloud, edge, and fog-based solutions need to tackle the distribution of a highvolume of IoT data eﬃciently.IoT data distribution, whatever data handling, processing, or storing strategy used,does require eﬃcient network communications infrastructures. In fact, there is a researchchallenge on enhancing network communications and providing eﬃcient resource allocation (cid:63)

Work supported by CAPES and Salvador University (UNIFACS) a r X i v : . [ c s . A I] J a n C. Arruda, P. Moraes, N. Agoulmine and J. Martins based on the fact that, in many applications like IoT, networks have limited and constrainedresources [5].Machine learning and reinforcement learning with distinct algorithms like Q-Learning[33], deep reinforcement learning [11], and SARSA (State-Action-Reward-State-Action) [1]are being applied to support an eﬃcient allocation of resources in networks for applicationareas like 5G, cognitive radio, and mobile edge computing, to mention some [7] [28].The Publish/Subscribe (Pub/Sub) paradigm is extensively used as an enabler for datadistribution in various scenarios like information-centric networking, data-centric systems,smart grid, and other general group-based communications [25] [10] [2]. The Pub/Subparadigm is an alternative for IoT data distribution between producers and consumers ingeneral [10] [22].This work addresses the issue of enhancing the utilization of a communication channel(MPLS label switched path - LSP, physical link, ﬁber optics slot, others) deployed in net-work infrastructures with limited resources (bandwidth) using the SARSA reinforcementlearning algorithm. The target application area is the massive exchange of IoT data usingthe Pub/Sub paradigm. The framework integrating the SARSA module with the Pub/Submessage exchange deployment (PSIoT described in [22]) is the PSIoTRL.Diﬀerently from other proposals for network infrastructure resource allocation and opti-mization that consider, for instance, the quality of service for the entire set of network nodesand Pub/Sub aggregators, the SARSA-based PSIoTRL framework provides a simple dataingress based solution. In fact, it controls the quality of Pub/Sub topics data distribution inan aggregator by keeping Pub/Sub buﬀer topics occupation below a deﬁned threshold. Thisindirectly means that a certain amount of bandwidth is available for the Pub/Sub topicand, consequently, a level of quality is allocated for the communication channel. Since thePub/Sub data transfers are dynamically requested, SARSA deploys a dynamic control ofbandwidth allocation for buﬀer data ﬂushing.The contribution of this work is multi-fold. Firstly, we modeled the SARSA agent com-munications with a generic buﬀer occupation metric. The adopted metric results in a limitedamount of SARSA states to allow the allocation of bandwidth preserving Pub/Sub topicpriorities without using extensive computational resources. Moreover, we demonstrate thatPub/Sub communications with SARSA can be enhanced by dynamically adjusting IoT ag-gregators queues occupation to Pub/Sub topics priority and network resources availability.The remainder of the paper is organized as follows. Section 2 presents the related work onIoT data processing deployments with cognitive communications approaches. Section 3 is abackground section about reinforcement learning with SARSA. Sections 4 and 5 describe thebasic PSIoT Pub/Sub framework, discuss the RL applicability for the constrained commu-nication problem and present the PSIoTRL framework components including the intelligentorchestrator for IoT traﬃc management. In Section 6, we present a proof of concept forthe PSIoTRL and evaluate how it enhances the IoT network resource management. Finally,section 7 concludes with an overview of the main highlights, contributions and future work.

Architectural and system-wide studies about IoT massive data processing and data ﬂowin smart cities are presented in Al-Fuqaha [20], Rathore [27] and Martins [17]. Al-Fuqaha[20] introduces the concept of cognitive smart city where IoT and artiﬁcial intelligence aremerged in a three-level model with diﬀerent requirements. In relation to the communicationlevel, the basic approach assumes a fog cloud computing (fog-CC) communication without nhanced Pub/Sub with SARSA 3 considering any speciﬁc IoT massive traﬃc requirement. In fact, the proposal assumes thatthe aggregation with edge processing reduces the required bandwidth for fog-CC communi-cation to a minimum. Rathore [27] explores the issue of big data analytics for smart cities.The paper proposes an edge-based aggregation and processing strategy for raw data. Pro-cessed data is forwarded through gateways to smart city applications using Internet andassuming bandwidth reduction at edge-level. Martins [17] discusses the potential beneﬁtsand impacts of how some technological enablers, like software-deﬁned networking [12] andmachine learning [36], are integrated and aim at cognitive management [28]. IoT data edge-processed aggregation, network communication and service deployments towards an eﬃcientoverall smart city solution are discussed and the relevance of intelligent communication re-source provisioning and deployment are highlighted.Resource provisioning from the perspective of IoT services deployment for smart city isconsidered by Santos [31]. The paper proposes a container-based micro-services approach,using Kubernetes, for service deployment that aims to oﬀ-load IoT processing with fog com-puting. In relation to this paper discussion, the proposal endorses fog processing oﬄoadingusing a edge-computed approach and does not consider the network communication resourceprovisioning necessary to distribute the outcomes of the edge processing.Edge intelligence for service deployment is discussed in Zhang [37]. The proposed solu-tions deﬁnes a framework capable to support the deployment of AI algorithms like RL oncommon edge aggregators like Raspberry. The approach assumed is to enable the execu-tion of AI algorithms in the edge for those applications that require near real time edgeprocessing like voice recognition and on-board autonomous vehicles processing.Machine learning in communication networks is broadly addressed by Cote [7] andBoutaba et al in [5].Intelligent network communication resources are considered by Zhao [38]. The workpresents a smart routing solution for crowdsourcing data with mobile edge computing (MEC)in smart cities using reinforcement learning (RL). The solution deﬁnes routes, diﬀerentlyfrom ours that optimizes the bandwidth allocated for IoT ﬂow ﬂushing.In summary, this work advances on existing studies that propose service provisioning byadding an intelligent component for the allocation of communication resources between IoTdata aggregators and IoT data consumers. From the architectural point of view, this workadopts an edge-based processing approach coupled with an eﬃcient communication resourceallocation for massive IoT data transfers.

Reinforcement learning (RL) is a largely used machine learning (ML) technique in whicha trial-and-error learning process is executed by an agent that acts on a system aiming tomaximize its rewards. The RL algorithm is expected to learn how to reach or approach acertain objective by interacting with a system through a feedback loop [13] [33] [19].In RL, a reward value r is received by the agent for the transitions from one state toanother. The overall objective of the agent is to ﬁnd a policy π which maximizes the expectfuture sum of rewards received, each of them, subjected to a discount policy γ .The value function in RL is a prediction of the return available from each state, asindicated in Equation 1. C. Arruda, P. Moraes, N. Agoulmine and J. Martins V (  t ) ← E { ∞ (cid:88) k = o γ k r t + k } (1)Where r t is the reward received for the transition from state  t to  t + and γ is thediscount factor ( ≤ γ ≤ ) . The value function V (  t ) represents the discounted sum ofrewards received from step t onward. Therefore, it will depend on the sequence of actionstaken and on the policy adopted to take these actions.Two well-known and somehow similar reinforcement learning algorithms are Q-learning[33] [8] and SARSA [14] [1].Q-learning is an oﬀ-policy RL algorithm in which the agent ﬁnds an optimal policythat maximizes the total discount expected reward for executing a particular action ata particular state. Fundamentally, Q-learning ﬁnds the optimal policy in a step-by-stepmanner. Q-learning is oﬀ-policy because the next state and action are uncertain when thealgorithm updates the value function.In Q-learning, the value function termed Q-function is learnt. It is a prediction of thereturn associated with each action  ∈ A (set of actions). This prediction can be updatedwith respect to the predicted return of the next state visited (Equation 2). Q (  t , a t ) ← r t + γV (  t + ) (2)Since the overall objective is to maximize the reward received, the current estimate of V (  t ) becomes: Q (  t , a t ) ← r t + γ mx  ∈ A Q (  t + ,  ) (3)The Q-function is shown to converge for markovian decision processes (MDP) [26]. InQ-learning, the agent maintains a lookup table of Q ( X, A ) and Q ( ,  ) represents thecurrent estimate of the optimal action value function. Once the Q-function has converged,the optimal policy π is to take the action in each state with the highest predicted return(greedy policy) [29]. SARSA (State-Action-Reward-State-Action) algorithm is the reinforcement learning ap-proach used by PSIoTRL framework [14] [1].SARSA is a temporal-diﬀerence (TD) on-policy algorithm that learns the Q-values basedon the action performed by the current policy. SARSA algorithm diﬀers from Q-learningby the way it sets up the future reward. In SARSA the agent uses the action and the stateat time t + to update the Q-value. The SARSA tuple of main elements involved in theinteraction process are: <  t ,  t , r t + ,  t + ,  t + > (4)Where: nhanced Pub/Sub with SARSA 5 –  t ,  t are the current state and action; – r t + is the reward; and –  t + ,  t + are the next state and action reached using the policy ( ε − Greedy ).SARSA Q-values are therefore updated based on the Equations 5 or 6. Q (  t ,  t ) ← Q (  t ,  t ) + α [ r t + + γQ (  t + ,  t + ) − Q (  t ,  t )] (5) ΔQ (  t − ,  t − ) = α ( t )[ r t − + γQ (  t ,  t ) − Q (  t − ,  t − )] (6)Where: – α is the learning rate ( ≤ α ≤ ) ; and – γ is the discount factor ( ≤ γ ≤ ) . Reinforcement learning has achieved superhuman performance in games like Go [32] andalso obtained a critical result bridging the divide between high-dimensional sensory inputsand actions with ATARI [18].Video games and communication networks have imperfect but interesting operation sim-ilarities. As discussed in Cote [7], video games and networks are closed systems with a ﬁnitenumber of states. In games, actions include pressing buttons and moving the joystick, theimage pixels deﬁne the state, and the cost function is the game score. In networks, actionscorrespond mainly to network conﬁguration parameters (bandwidth allocation, ﬁber allo-cation, others) or network routing conﬁguration. The network state (snapshot) deﬁnes thestate of the system at each iteration, and the cost function corresponds to the performanceof the network such like utilization rate, throughput, or number of dropped packets.Hence, it is reasonable to consider that SARSA may have a parallel success for allocatingnetwork communications resources (bandwidth) in the PSIoTRL.Reinforcement learning with SARSA has proved eﬃcient and obtained substantial suc-cess in resource allocation [1] [30], cloud computing [4] and computational oﬄoading [3][24].An evaluation of SARSA and various other model-free algorithms is presented in Dafazio[9]. The evaluation considered diverse and diﬃcult problems within a consistent environment(Arcade [15]) involving conﬁguration aspects like Epsilon-greedy policy, exploration versusexploitation and state space with SARSA outperforming algorithms like Q-learning, ETTR(Expected Time-to-Rendezvous) [34], R-learning [16], GQ algorithm [35] and Actor-Criticalgorithm [6].SARSA algorithm features and potential advantages within the PSIoTRL environmentinclude: – Being an on-policy algorithm, SARSA attempts to evaluate or improve the policy thatis used to make decision; – SARSA avoids the state explosion issue common with the Q-learning algorithm; – PSIoTRL has a reduced state space and, consequently, SARSA behaves eﬀectively byexploring all states at least once; and – Key issues that RL algorithms experience like bad initial performance and large trainingtime are minimized in the PSIoTRL environment since SARSA addresses the allocationof network bandwidth locally with a reduced state-space.

C. Arruda, P. Moraes, N. Agoulmine and J. Martins

The objective of this work is to enhance the Publish/ Subscribe (Pub/Sub) communicationsused in the context of massive IoT data transfers (Figure 1).The Pub/Sub communications is supported by the PSIoT framework described in [22].The PSIoT framework aims to eﬃciently handle network resources and IoT QoS require-ments over the network between the IoT devices and consumer IoT applications [22]. Con-sumers are applications executed in servers located beyond the backbone network, cloudcomputing infrastructure accessed by the network or any other scheme that makes use ofthe managed network infrastructure for communications (Figure 1).

Fig. 1.

PSIoT Framework Functional View [22].

The PSIoT framework was developed to manage IoT traﬃc in a network and to provideQoS based on IoT data characteristics and network-wide speciﬁcations, e.g. total networkuse, realtime network traﬃc, routing and bandwidth constraints.IoT data generated from sensors and devices can be aggregated and processed in thecloud or at the network edge, where each of these points are considered as

Aggregators or Producers in the PSIoT framework. Whereas each client that receives IoT data, bothend-users applications and even other aggregation nodes, are denoted as

Consumers .The PSIoT framework was designed to orchestrate massive IoT traﬃc and allocate net-work resources between producers and consumers. Besides, the PSIoT framework can sched-ule the data ﬂow, based on Quality of Service (QoS) requirements, and allocates bandwidthon a backbone with limited resources. For this, the PSIoT was modeled to operate withfour main components: producers, consumers, a backbone with limited resources; and theorchestrator (Figure 1) [22].Aggregators are elements that gather the data obtained by connected sensors to sendthem (in an opportune moment) to their consumers (clients) using a Pub/Sub-style commu-nication channel. In the opposite direction, aggregators deliver data received from producersto send them to actuators. nhanced Pub/Sub with SARSA 7

Cloud ComputingCloud Computing

PRODUCERPRODUCER CONSUMERCOMMUNICATIONALLOCATIONMETHOD

PUBLISH

METADATA

SCHEDULING Network Resources b ac de

CoTCoT

PSIoT-OrchestratorEngineIoT Data Scheduler

Data stream orchestrationLSP/ bandwidthallocation f g g PSIoT-Orch FrameworkIoTGW-AgPSIoT-Orchestrator h CHANNEL REQUEST

IoT GATEWAYSENSORSPRODUCERIoT GATEWAYSENSORSPRODUCER

Fig. 2.

PSIoT Pub/Sub Message Model [22].

This exchange of information is performed through the Pub/Sub architecture, whosecharacteristic of asynchronous, use of topics and the use of a broker, makes it attractive forIoT applications [10].PSIoT implements QoS when forwarding the data gathered by the aggregator to outputbuﬀers, according to the following characteristics [22]: – b0 (priority) - high transmission rates and low delay application requirements for healthcare and data from critical industrial sensors, as examples; – b1 (sensitive) - commercial data and security sensors; and – b2 (insensitive) - best eﬀort.The Pub/Sub message model adopted by the PSIoT is illustrated in Figure 2. In sum-mary, the message ﬂow is as follows [22]:1. A consumer subscribes to a particular topic and specify the requested QoS level with aparticular aggregator (  );2. The aggregator sends to the orchestrator relevant metadata such as number of sub-scribers and their associated QoS levels and buﬀer allocation ( b );3. The orchestrator notiﬁes the aggregator with an amount of bandwidth that can beconsumed by each level of QoS ( c ); and4. The aggregator publishes the data to the consumer according to bandwidth and dataavailability in the buﬀer ( d ).The PSIoT uses ﬁxed rule scheduling for IoT data consumption. C. Arruda, P. Moraes, N. Agoulmine and J. Martins

The PSIoTRL architectural components are illustrated in Figure 3. Its main proposed mod-ules are: – The SARSA agent; – The PSIoTRL orchestrator module; – Aggregators (at least one for each cluster where IoT data will be consumed); – The network infrastructure (backbone); and – Producers and consumers.The SARSA agent is integrated in the PSIoT framework [22] and, by demand, allocatesbandwidth for the aggregator queues.The PSIoTRL orchestrator module is described in [22] and [21]. It basically has theknowledge of each aggregator Pub/Sub subscriptions, as well as the QoS levels required foreach topic subscription. This allows it to control the transmission emptying rates of IoTdata from each buﬀer within the aggregator in the network. The SARSA agent computesthe allocated bandwidth and the orchestrator deploys it.The aggregators are Fog-like nodes connected to IoT devices that act as aggregators fortheir data and also act as Pub/Sub producers regarding IoT applications subscribing totopics and consuming corresponding IoT data [21] [10].The network interconnects data producers (aggregators) and consumers and, has limitedbandwidth resources for massive IoT data transfers.Producers and consumers exchange IoT data. They use Pub/Sub [10] to communicateand exchange massive amounts of IoT data through a network with constrained bandwidthresources [22] [21].

The PSIoTRL framework uses the Pub/Sub paradigm to produce and consume data. Con-sumers request IoT data on the aggregator’s queues, and the aggregator empties its buﬀersaccording to consumer requests. The aggregator queues are deployed with the followingconﬁguration: – Three IoT data queues (buﬀers) are conﬁgured per aggregator (one Pub/Sub topic perqueue): B , B and B ; – Initial buﬀer transmission rates are: T , T and T ; and – Buﬀer priorities are: p , p and p with p > p > p .For this evaluation we consider one aggregator Ag  located at network node n z thatdelivers IoT data to a set of consumers C  , with  = , , ..., n located at network node n y .Between nodes n z and n y there is a communication resource (MPLS LSP, physical link, ﬁberslot, other) with a limited bandwidth BW nz . The communication resource interconnectingPub/Sub producer and consumers is then constrained as follows: By initial buﬀer transmission rate, we mean the conﬁgured initial transmission rate to emptybuﬀers without any buﬀer over-utilization.nhanced Pub/Sub with SARSA 9

IoT DATA BUFFER CONTROL AND MONITORING(Pub/Sub Messages) IoTGW-Ag

PSIoT-Orchestrator

PSIoTRL

COMMUNICATIONCHANNEL

RL/SARSA AGENT

SDN/ OpenflowNetwork

Programming

AGENT

AiSi

IoT AGGREGATOR

SENSORS

PRODUCER

IoT BUFFERS

SCHEDULING c BUFFER STATUS

AND

METADATA b BUFFERS

THERSHOLD

Cloud ComputingCloud Computing

CONSUMER

CoTCoTCloud Computing

CONSUMER

CoT

Network ResourcesNetwork Resources

SYSTEM

Fig. 3.

PSIoTRL Components and Communication - Orchestrator, Aggregator, Buﬀers and Com-munication Chanel. BW zy ≥ n (cid:88) j = TB j (7)Where TB j is the currently transmission rate to empty a buﬀer j .The aggregator monitors buﬀer occupation and occupation above a deﬁned thresholdlimit is signaled to the orchestrator (Figure 3). The PSIoTRL deployment and operation require SARSA agent modeling for the buﬀerbandwidth allocation problem and the deﬁnition of basic SARSA algorithm conﬁgurationparameters.The primary goal of the SARSA agent is to arbitrate the output transmission ratesamong the aggregator buﬀers in order to eﬃciently distribute IoT data to the consumers(Figure 3).The PSIoTRL SARSA agent is modeled with a ﬁnite set of states for the aggregatoroutput queues, a ﬁnite set of conﬁguration actions to be executed on each queue, and a setof reward values for each state/action transition pair.In the SARSA agent, the system state S  represents the overall aggregator’s output queuestatus in terms of occupation at a given time t  . Table 1.

Buﬀer (Queue) States.

Buﬀer (queue) State (B1,B2,B3) Description

BL, BL, BL all queues below the threshold limitBL, BL, AL queues 1 and 2 below the threshold limitBL, AL, BL queues 1 an 3 below the threshold limitBL, AL, AL queue 1 below the threshold limitAL, BL, BL queues 2 and 3 below the threshold limitAL, BL, AL queue 2 below the threshold limitAL, AL, BL queue 1 below the threshold limitAL, AL, AL no queues below the threshold limit † Legend: BL - Below Limit; AL - Above Limit.

Table 2.

SARSA agent actions.

Buﬀer (Queue) Actions B1 T + , T − , NB2 T + , T − , NB3 T + , T − , N T + : increase transmission rate; T − : decrease transmission rate; N: null action. The SARSA agent states are indicated in Table 1. Each queue has two states either below(BL) or above (AL), a preconﬁgured threshold. Having two states for each queue allows, inthis context, suﬃcient information for bandwidth allocation and contributes to avoid thecommon state explosion issue existing in many RL deployments.A discrete and small number of states for the queues can be reasonably assumed becauseit mainly works as a threshold to indicate that queues require attention to enforce prioritiesor maximize throughput. The ﬁrst case eventually happens when queued Pub/Sub messagesrequire more capacity to be transferred than the actually allocated one. The second casecorresponds to the situation where some queues have plenty of data to transmit, while othershave unused capacity.The utilization of the buﬀer threshold results in having a kind of agnostic Pub/Subimplementation. In fact, the threshold-based cognitive actions allow: – To tune the PSIoTRL to have a faster or proactive reaction to adjust the queue’s band-width; and – To tune the Pub/Sub dynamic reconﬁguration capability according to IoT data transfersensitivity.The PSIoTRL SARSA agent actions are illustrated in Table 2. Three actions are deﬁnedfor each buﬀer: increase capacity (transmission rate), reduce capacity (transmission rate)and do nothing. Therefore, twenty seven actions are possible for each agent state. Theamount of bandwidth increased or reduced per queue is a SARSA conﬁguration parameter.A set of rewards is also deﬁned for each executed state/action pair. nhanced Pub/Sub with SARSA 11 ε -Greedy Policy The SARSA agent uses an ε -greedy policy since it must exploit as much as possible theacquired knowledge but also explores new possibilities of enhancing the allocation of band-width for queue communications. The SARSA agent will take new actions at random withthe ε -greedy policy deﬁning the probability of choosing random actions. No value changeor on-the-ﬂy ﬁne-tuning of ε was considered in this solution.The ε -greedy policy matches adequately the inherent need of a Pub/Sub IoT messagedelivery framework. Random demands from consumers consume Pub/Sub data with dif-ferent priorities. These demands generate a random volume of data to be transferred fromaggregator queues to consumers. The per-queue bandwidth distribution must be dynam-ically computed among IoT data queues considering the existing constraint (bandwidthlimitation) of the available communication channel. The SARSA agent manages the aggregator’s output queues transmission rate according toIoT data demands, IoT data priority, and communication resource constraints (bandwidthlimitation). The bandwidth allocation algorithm pseudo-code is presented in Algorithm 1.

Algorithm 1

PSIoTRL-SARSA Bandwidth Allocation Algorithm Pseudo-code procedure PSIoTRL-SARSA (Q( S t , A t , r, S t + , A t + )) (cid:46) SARSA states and reward2: for each pair ( Q t , A t ) do (cid:46) Initialization3: Initialize Q-values - Q ( S t , A t ) = (cid:46) Tabula rasa approach4: repeat (cid:46)

Forever5: Gets current PSIoTRL buﬀers state S t (cid:46) Buﬀer state is the trigger event6: repeat (cid:46)

Finishes upon terminal condition7: Choose action A t using the ε -greedy policy (cid:46) Exploration and exploitation8: Execute action A t on the PSIoTRL system (cid:46) T+, T- or do nothing on queues9: Get immediate reward r t

10: Collect new state S t +

11: Choose new action A t + using the ε -greedy policy12: Update Q-value - Q ( S t , A t ) using SARSA equation 5 (cid:46) SARSA algorithm13: until

Terminal condition reached14: until forever

The algorithm procedure is as follows:1. The algorithm starts initializing the Q-values in table, Q ( s,  ) ;2. The current state, s is captured;3. An action  is chosen for the current state using the greedy policy;4. The agent triggers the action, and observes the immediate reward, r , as well as the newreached state s (cid:48) ;5. The Q-value for the state s (cid:48) is updated using the observed reward and the maximumreward possible for the next state; and6. Finally for the current cycle, set the state to the new state, and repeat the process untilthe end condition is reached. The end condition for the PSIoTRL bandwidth allocation process is the following: – Bandwidth constraint limit is reached (Equation 7); or – Current buﬀers transmission rates corresponding priorities are ( T > T > T ) ; or – Maximum number of attempts is reached.

The purpose of this proof of concept is to validate that the agent behavior enhanced theallocation of bandwidth when one or more buﬀers exceeded the deﬁned buﬀer occupationlimit (50% - AL condition). When this event occurs, the aggregator detects the problemand sends an alarm containing metadata to the orchestrator. After that, the SARSA agentallocates a new percentage of bandwidth among the aggregator buﬀer.Four initial buﬀer conditions were used: – One queue exceeds the occupation threshold ( AL, BL, BL ) ; – Two queues exceed the occupation threshold ( AL, AL, BL ) ; – Three queues exceed the occupation threshold ( AL, AL, AL ) ; and – One queue exceeds the link bandwidth capacity and the two other queues exceed theoccupation threshold (+ AL, AL, AL ) .The considered performance parameters are the following: – Queue (buﬀer) Occupation; – Link Occupation; and – Packet Loss.The allocation of bandwidth to the aggregator queues is evaluated in two scenarios. Inscenario 1, a predeﬁned simple algorithm is used to allocate the bandwidth. In scenario2, the SARSA agent does the same task. Table 3 presents a summary of the evaluationscenarios.

Table 3.

Summary of the Proof of Concept Scenarios.

AGGREGATOR ORCHESTRATORQueues States (B1,B2,B3) T1i T2i T3i Scenario 1 Scenario 2 (SARSA)

AL,BL,BLAL,AL,BL 35% 25% 15% T1 = T1i * factor; SARSAAL,AL,AL T2 = (100 - T1)/2; Module + AL ,BL,BL T3 = (100 - T1)/2; – + AL : 120% buﬀer occupation; dropping packets; factor:1.15 or 1.25 or 1.50.nhanced Pub/Sub with SARSA 13 In this scenario, the orchestrator has a simple ﬁxed rule to increase or decrease the bandwidthfor buﬀer ﬂushing. T bandwidth is adjusted by a ﬁxed factor (15%, 25% or 50%) and theremaining queues get half of the remaining bandwidth as follows: – T = T  ∗ factor where T  = T initial state; – T = ( − T ) / ; and – T = ( − T ) / .Buﬀer transmission emptying initial rates are respectively 35%, 25% and 15% for T1,T2 and T3. Tables 4, 5 and 6 present the results obtained with the ﬁxed rule bandwidthallocation method. In these Tables, packets loss for queue B  is P  . Table 4.

Bandwidth allocation with ﬁxed rule - 15 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation

AL,BL,BL (72,16,0) (40,30,30) (0,0,0) 100AL,AL,BL (72,64,0) (40,30,30) (0,0,0) 100AL,AL,AL (72,64,0) (40,30,30) (0,0,0) 100+AL,BL,BL (108,16,0) (40,30,30) (8,0,0) 100 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . Table 5.

Bandwidth allocation with ﬁxed rule - 25 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation

AL,BL,BL (56,18,2) (44,28,28) (0,0,0) 100AL,AL,BL (56,72,2) (44,28,28) (0,0,0) 100AL,AL,AL (56,72,8) (44,28,28) (0,0,0) 100+AL,BL,BL (84,18,2) (44,28,28) (0,0,0) 100 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . The parameters and initial conditions used in this scenario of the the proof-of-concept arethe following: – SARSA Agent conﬁguration parameters: • Epsilon-greedy policy ε of 2 % ; Table 6.

Bandwidth allocation with ﬁxed rule - 50 % bandwidth adjustment. Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation

AL,BL,BL (40,20,10) (53,24,23) (0,0,0) 100AL,AL,BL (40,80,10) (53,24,23) (0,0,0) 100AL,AL,AL (40,80,40) (53,24,23) (0,0,0) 100+AL,BL,BL (60,20,10) (53,24,23) (0,0,0) 100 – Legend: QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . • Learning rate α of 20 % ; and • Discount factor γ of 80 % – Pub/Sub queue threshold limit (triggers agent action) = 50% – Agent actions: bandwidth increased or reduced by 10% – Maximum number of attempts = 400The SARSA conﬁguration parameters are ﬁxed for all evaluations in this scenario 2.Typical values were used and the impact of the variation of the values of these parameterswas not considered in this evaluation and left for future works.In terms of the PSIoTRL operation and SARSA evaluation, the agent action in theorchestrator is triggered by the aggregator anytime one of its queues occupation goes beyondthe conﬁgured threshold limit (Figure 3).The initial condition with SARSA are the same as those applied for the scenario 1 asindicated in Table 3. Ten runs were executed for each of the four initial states and theobtained results are summarized in Table 7 with a conﬁdence interval of 5%.

Table 7.

Queue Occupation, Packet Loss and Link Occupation with SARSA.

Initial QS(B1i,B2i,B3i) Final QS(B1,B2,B3) Final TR(T1,T2,T3) Packet Loss(P1,P2,P3) Link Occupation

AL,BL,BL (48,12,22) (49,35,13) (0,0,0) 97AL,AL,BL (47,47,23) (50,35,12) (0,0,0) 97AL,AL,AL (48,48,87) (49,35,14) (0,0,0) 98+AL,BL,BL (24,24,26) (63,20,11) (28,0,0) 94 – QS: Queue State; TR: Transmission Rate; AL: 80 % ; BL: 20 % ; +AL: 120 % . First of all, it is essential to highlight that this analysis’s objective is to verify that theSARSA algorithm enhances bandwidth allocation for an IoT aggregator using speciﬁcally aPub/Sub method to exchange data. In this context, enhanced bandwidth allocation meansthat the constrained bandwidth resources can be tuned and, consequently, better used forattained QoS performance parameters like delay, jitter, and packet loss. nhanced Pub/Sub with SARSA 15

The evaluated parameters are queue occupation, link occupation and packet loss.The algorithm’s ability to bring the queue state back to a threshold limit is not aneﬀective performance parameter but demonstrates that the algorithm enhances the usageof the constrained bandwidth resources. For example, keeping the occupation of queues 1and 2 below 50% would imply having the delay and jitter for the IoT data belonging to thespeciﬁc Pub/Sub topics within its application’s requirements.The link occupation performance parameter evaluated in the runs indicates the eﬃcientuse or not of the bandwidth available, and the packet loss performance parameter is relevantto Pub/Sub topics sensitive to data loss.The Figures 4, 5, 6 and 7 present the ﬁnal buﬀer occupation with bandwidth allocationfor buﬀer ﬂushing done by the orchestrator using ﬁxed rule (case 1) and SARSA (case 2).

Fig. 4.

Final Queue Occupation - Initial State =

AL, BL, BL . Fig. 5.

Final Queue Occupation - Initial State =

AL, AL, BL . For all initial states ( AL, BL, BL ; AL, AL, BL ; AL, AL, AL nd + AL, BL, BL ) band-width allocation with SARSA performed better that the ﬁxed rule using increments of 15%,

25% and 50% in the buﬀer ﬂushing bandwidth. SARSA algorithm brought all buﬀers to theﬁnal condition

BL, BL, BL .In Figure 5, the case 1 (50 % ) was the unique ﬁxed rule option that succeeded to bringone of the queue occupation (queue 3) below the threshold limit.In Figure 6, the bandwidth allocation with SARSA performed again better than theﬁxed rule algorithm. However, while the two priority buﬀers B and B are below the limit,buﬀer B (least eﬀort) was set above the limit due to its lower priority. Fig. 6.

Final Queue Occupation - Initial State =

AL, AL, AL . Figure 7 corresponds to the evaluation scenario in which we want to observe the behaviorof the ﬁxed rule bandwidth allocation and SARSA when buﬀer B is already experiencingpacket loss (above 100% capacity). Figure 7 shows that the SARSA algorithm is again themost eﬃcient approach to bring all buﬀers below the threshold limit. Fig. 7.

Final Queue Occupation - Initial State = + AL, BL, BL .nhanced Pub/Sub with SARSA 17

In this experimentation, the objective is to use the minimum amount of link bandwidth perbuﬀer to transmit data. Buﬀer priority should be preserved, and the aimed ﬁnal conditionis to bring all buﬀers to the BL state after bandwidth allocation.Figure 8 illustrates two related aspects of link occupation. Firstly, it shows that theSARSA allocation of bandwidth per buﬀer is aligned with buﬀer priorities. In fact, morebandwidth is allocated to B than to B , and, in turn, B gets more bandwidth than B . This result is fully consistent with the deﬁned buﬀer priorities. The ﬁxed rule methodallocates bandwidth to buﬀer B and splits the remaining bandwidth among the otherbuﬀers. Fig. 8.

Allocated Bandwidth per Queue and Link Occupation.

A second result illustrated in Figure 8 is link occupation. The result indicates that theSARSA algorithm is more eﬃcient than the ﬁxed rule allocation method. SARSA succeedsin bringing all buﬀer occupation levels bellow the deﬁned threshold and, concomitantly,preserves link utilization below 100%.Packet loss is indicated in Tables 4, 5, 6 and 7. The allocation of chunks of bandwidthfor IoT data ﬂushing in queues is done proactively once the 50% buﬀer occupancy thresholdis reached. Consequently, no packet loss occurs for three out of four cases considered in thePSIoTRL evaluation ( AL, BL, BL ; AL, AL, BL nd AL, AL, AL ) . It happens that the buﬀerﬂushing bandwidth is adjusted before any packet loss could occur.The case + AL, BL, BL starts with buﬀer overloaded and, as such, packet loss does occursince either the ﬁxed rule method or the SARSA algorithm takes some time to allocatebandwidth to the overloaded buﬀer and ﬁx the problem. As expected, SARSA takes moretime to allocate bandwidth due to the learning process inherent to the algorithm and,consequently, presents a more signiﬁcant packet loss.

This work has presented a solution to address the problem of network resources allocationin the context of massive IoT data distribution. It has presented a PSIoTRL framework that introduces an architecture that aim to manage the allocation of limited network resourcesto aggregators.As highlighted in the simulations, the SARSA agent in PSIoTRL was able to recon-ﬁgure the queues’ ﬂushing transmission rate eﬃciently, providing in 100 % of the testedcases, that these queues reached the BL (below threshold limit) ﬁnal condition with lesslink bandwidth usage. The PSIoTRL demonstrates that a solution based on SARSA is aneﬃcient reinforcement learning approach for enhancing resource (bandwidth) allocation inthe Pub/Sub-based massive IoT data exchange context.Future works will address SARSA computation scalability concerning the granularity forallocating new chunks of bandwidth to ﬂush IoT data and adjust the Pub/Sub queue occu-pancy. The impact of the SARSA conﬁguration parameters (epsilon-greedy policy, learningrate, and discount factor), in the algorithm time response, with a focus on the learning rate,will also be evaluated. Acknowledgments

Authors want to thanks CAPES (Coordination for the Improvement of Higher EducationPersonnel) for the master’s scholarship support granted.

References

1. Alfakih, T., Hassan, M.M., Gumaei, A., Savaglio, C., Fortino, G.: Task Oﬄoading and ResourceAllocation for Mobile Edge Computing by Deep Reinforcement Learning Based on SARSA.IEEE Access , 54074–54084 (2020)2. An, K., Gokhale, A., Tambe, S., Kuroda, T.: Wide Area Network-scale Discovery and DataDissemination in Data-centric Publish/Subscribe Systems. pp. 1–2. ACM Press (2015)3. Asghari, A., Sohrabi, M.K., Yaghmaee, F.: Task Scheduling, Resource Provisioning, and LoadBalancing on Scientiﬁc Workﬂows Using Parallel SARSA Reinforcement Learning Agents andGenetic Algorithm. The Journal of Supercomputing (Jul 2020)4. Bibal Benifa, J.V., Dejey, D.: RLPAS: Reinforcement Learning-based Proactive Auto-Scalerfor Resource Provisioning in Cloud Environment. Mobile Networks and Applications (4),1348–1363 (Aug 2019)5. Boutaba, R., Salahuddin, M.A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F.,Caicedo, O.M.: A Comprehensive Survey on Machine Learning for Networking: Evolution,Applications and Research Opportunities. Journal of Internet Services and Applications (1),16 (Jun 2018)6. Ciosek, K., Vuong, Q., Loftin, R., Hofmann, K.: Better Exploration with Optimistic ActorCritic. pp. 1787–1798 (2019)7. Cote, D.: Using Machine Learning in Communication Networks. IEEE/OSA Journal of OpticalCommunications and Networking (10), D100–D109 (Oct 2018)8. Dabbaghjamanesh, M., Moeini, A., Kavousi-Fard, A.: Reinforcement Learning-based Load Fore-casting of Electric Vehicle Charging Station Using Q-Learning Technique. IEEE Transactionson Industrial Informatics pp. 1–9 (2020)9. Defazio, A., Graepel, T.: A Comparison of Learning Algorithms on the Arcade Learning Envi-ronment. arXiv:1410.8620 [cs] (Oct 2014), http://arxiv.org/abs/1410.8620 , arXiv: 1410.862010. Happ, D., Wolisz, A.: Limitations of the Pub/Sub Pattern for Cloud Based Iot and TheirImplications. In: Cloudiﬁcation of the Internet of Things (CIoT). pp. 1–6. IEEE - Cloudiﬁcationof the Internet of Things (CIoT), Paris (Nov 2016)nhanced Pub/Sub with SARSA 1911. Koo, J., Mendiratta, V.B., Rahman, M.R., Walid, A.: Deep Reinforcement Learning for NetworkSlicing with Heterogeneous Resource Requirements and Time Varying Traﬃc Dynamics. In:2019 15th International Conference on Network and Service Management. pp. 1–5 (Oct 2019)12. Kreutz, D., Ramos, F.M.V., Verissimo, P., Rothenberg, C.E., Azodolmolky, S., Uhlig, S.:Software-Deﬁned Networking: A Comprehensive Survey. Proceedings of the IEEE (1), 14–76(Dec 2014)13. Latah, M., Toker, L.: Artiﬁcial intelligence enabled software-deﬁned networking: A comprehen-sive overview. IET Networks (2), 79–99 (2019)14. Liao, X., Wu, D., Wang, Y.: Dynamic Spectrum Access Based on Improved SARSA Algo-rithm. IOP Conference Series: Materials Science and Engineering (7), 072015 (Mar 2020),publisher: IOP Publishing15. Machado, M.C., Bellemare, M.G., Talvitie, E., Veness, J., Hausknecht, M., Bowling, M.: Revis-iting the Arcade Learning Environment: Evaluation Protocols and Open Problems for GeneralAgents. Journal of Artiﬁcial Intelligence Research , 523–562 (Mar 2018)16. Mahadevan, S.: Average Reward Reinforcement Learning: Foundations, Algorithms, and Em-pirical Results. Machine Learning (1), 159–195 (Mar 1996)17. Martins, J.S.B.: Towards Smart City Innovation Under the Perspective of Software-DeﬁnedNetworking, Artiﬁcial Intelligence and Big Data. Revista de Tecnologia da Informa¸c˜ao e Co-munica¸c˜ao (2), 1–7 (Oct 2018)18. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A.,Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control throughdeep reinforcement learning. Nature (7540), 529–533 (Feb 2015)19. Moerland, T.M., Broekens, J., Jonker, C.M.: A Framework for Reinforcement Learning andPlanning. PhD thesis, TU Delft (Jun 2020)20. Mohammadi, M., Al-Fuqaha, A.: Enabling Cognitive Smart Cities Using Big Data and MachineLearning: Approaches and Challenges. IEEE Communications Magazine (2), 94–101 (Feb2018)21. Moraes, P.F., Martins, J.S.B.: A Pub/Sub SDN-Integrated Framework for IoT Traﬃc Orches-tration. In: Proceedings of the 3rd International Conference on Future Networks and DistributedSystems. pp. 1–9. ICFNDS ’19 (2019), paris, France22. Moraes, P.F., Reale, R.F., Martins, J.S.B.: A Publish/Subscribe QoS-aware Framework forMassive IoT Traﬃc Orchestration. In: Proceedings of the 6th International Workshop on AD-VANCEs in ICT Infrastructures and Services (ADVANCE). pp. 1–14. Santiago (Jan 2018)23. Mukherjee, M., Shu, L., Wang, D.: Survey of Fog Computing: Fundamental, Network Appli-cations, and Research Challenges. IEEE Communications Surveys Tutorials (3), 1826–1857(thirdquarter 2018)24. Nassar, A., Yilmaz, Y.: Reinforcement Learning for Adaptive Resource Allocation in Fog RANfor IoT With Heterogeneous Latency Requirements. IEEE Access , 128014–128025 (2019)25. Nour, B., Sharif, K., Li, F., Yang, S., Moungla, H., Wang, Y.: ICN Publisher-Subscriber Models:Challenges and Group-based Communication. IEEE Network (6), 156–163 (Nov 2019)26. Ramani, D.: A Short Survey On Memory Based Reinforcement Learning. arXiv:1904.06736 [cs](Apr 2019)27. Rathore, M.M., Ahmad, A., Paul, A., Rho, S.: Urban Planning and Building Smart CitiesBased on the Internet of Things Using Big Data Analytics. Computer Networks , 63–80(Jun 2016)28. Rendon, O.M.C., Estrada-Solano, F., Boutaba, R., Shahriar, N., Salahuddin, M., Liman, N.,Ayoubi, S.: Machine Learning for Cognitive Network Management. IEEE CommunicationsMagazine pp. 1–9 (2018)29. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Tech. Rep. TR166, Cambridge University Engineering Department, Cambridge, England (1994)30. Sampaio, L.S.R., Faustini, P.H.A., Silva, A.S., Granville, L.Z., Schaeﬀer-Filho, A.: Using NFVand Reinforcement Learning for Anomalies Detection and Mitigation in SDN. In: 2018 IEEESymposium on Computers and Communications (ISCC). pp. 00432–00437 (Jun 2018)0 C. Arruda, P. Moraes, N. Agoulmine and J. Martins31. Santos, J., Wauters, T., Volckaert, B., De Turck, F.: Resource Provisioning in Fog Computing:From Theory to Practice. Sensors (Basel, Switzerland) (10) (May 2019)32. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche,G., Graepel, T., Hassabis, D.: Mastering the Game of Go Without Human Knowledge. Nature (7676), 354–359 (Oct 2017)33. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge,MA, USA, 1st edn. (1998)34. Wang, J.H., Lu, P.E., Chang, C.S., Lee, D.S.: A Reinforcement Learning Approach for theMultichannel Rendezvous Problem. In: 2019 IEEE Globecom Workshops (GC Wkshps). pp. 1–5 (Dec 2019)35. Wang, Y., Zou, S.: Finite-sample Analysis of Greedy-GQ with Linear Function Approximationunder Markovian Noise. Proceedings of Machine Learning Research (Proceedings of the36th Conference on Uncertainty in Artiﬁcial Intelligence (UAI)), 1–26 (2020)36. Xie, J., Yu, F.R., Huang, T., Xie, R., Liu, J., Wang, C., Liu, Y.: A Survey of Machine LearningTechniques Applied to Software Deﬁned Networking (SDN): Research Issues and Challenges.IEEE Communications Surveys Tutorials (1), 393–430 (Firstquarter 2019)37. Zhang, X., Wang, Y., Lu, S., Liu, L., Xu, L., Shi, W.: OpenEI: An Open Framework forEdge Intelligence. In: 39th IEEE International Conference on Distributed Computing Systems(ICDCS). pp. 1–12. Dallas, US (Jul 2019)38. Zhao, L., Wang, J., Liu, J., Kato, N.: Routing for Crowd Management in Smart Cities: A DeepReinforcement Learning Perspective. IEEE Communications Magazine57