[PDF] Robust and Scalable Routing with Multi-Agent Deep Reinforcement Learning for MANETs

Abstract

Highly dynamic mobile ad-hoc networks (MANETs) are continuing to serve as one of the most challenging environments to develop and deploy robust, efficient, and scalable routing protocols. In this paper, we present DeepCQ+ routing which, in a novel manner, integrates emerging multi-agent deep reinforcement learning (MADRL) techniques into existing Q-learning-based routing protocols and their variants, and achieves persistently higher performance across a wide range of MANET configurations while training only on a limited range of network parameters and conditions. Quantitatively, DeepCQ+ shows consistently higher end-to-end throughput with lower overhead compared to its Q-learning-based counterparts with the overall gain of 10-15% in its efficiency. Qualitatively and more significantly, DeepCQ+ maintains remarkably similar performance gains under many scenarios that it was not trained for in terms of network sizes, mobility conditions, and traffic dynamics. To the best of our knowledge, this is the first successful demonstration of MADRL for the MANET routing problem that achieves and maintains a high degree of scalability and robustness even in the environments that are outside the trained range of scenarios. This implies that the proposed hybrid design approach of DeepCQ+ that combines MADRL and Q-learning significantly increases its practicality and explainability because the real-world MANET environment will likely vary outside the trained range of MANET scenarios.

Full PDF

11 Robust and Scalable Routing with Multi-Agent Deep ReinforcementLearning for MANETs

Saeed Kaviani , Bo Ryu , Ejaz Ahmed , Kevin Larson , Anh Le , Alex Yahja , and Jae H. Kim EpiSys Science, Inc. { saeed, bo.ryu, ejaz, anhle, alex } @episci.com Boeing Research and Technology { kevin.a.larson, jae.h.kim } @boeing.com We address the packet routing problem in highly dynamicmobile ad-hoc networks (MANETs). In the network routingproblem each router chooses the next-hop(s) of each packet todeliver the packet to a destination with lower delay, higherreliability, and less overhead in the network. In this paper,we present a novel framework and routing policies, deepCQ+routing, using multi-agent deep reinforcement learning (MADRL)which is designed to be robust and scalable for MANETs. Unlikeother deep reinforcement learning (DRL)-based routing solutionsin the literature, our approach has enabled us to train over alimited range of network parameters and conditions, but achieverealistic routing policies for a much wider range of conditionsincluding a variable number of nodes, different data ﬂows withvarying data rates and source/destination pairs, diverse mobilitylevels, and other dynamic topology of networks.We demonstrate the scalability, robustness, and performanceenhancements obtained by deepCQ+ routing over a recentlyproposed model-free and non-neural robust and reliable routingtechnique (i.e. CQ+ routing). deepCQ+ routing outperforms non-DRL-based CQ+ routing in terms of overhead while maintainssame goodput rate. Under a wide range of network sizes andmobility conditions, we have observed the reduction in normal-ized overhead of 10-15%, indicating that the deepCQ+ routingpolicy delivers more packets end-to-end with less overheadused. To the best of our knowledge, this is the ﬁrst successfulapplication of MADRL for the MANET routing problem thatsimultaneously achieves scalability and robustness under dynamicconditions while outperforming its non-neural counterpart. Moreimportantly, we provide a framework to design scalable androbust routing policy with any desired network performancemetric of interest.

I. I

NTRODUCTION

Routing has been one of the most challenging problemsin communication and computer networks, especially in dis-tributed and autonomous wireless networks without coordi-nation. In packet routing protocols, each router for a givendata packet chooses another node as next-hop (i.e. unicasting)following its routing policy. For distributed routing algorithms,typically there is no coordination and no designated commu-nication between nodes in the network to share informationother than the data and acknowledgment (ACK) packets.The routing problem is even more difﬁcult when thenetwork is highly dynamic, heterogeneous, and/or containsvariable data trafﬁcs and network size. In particular, this isof great interest for mobile ad-hoc networks (MANET) in the tactical wireless networking domain where nodes are mobile,mobility and trafﬁc patterns are unpredictable, and the networktopology and size changes constantly due to the battleﬁeldenvironmental conditions such as terrain and jammers. Manytraditional MANET routing protocols are often unreliable inthese environments and require constant re-computation ofthe end-to-end routes upon network changes. Hence, theyare not robust and suffer from a periodic loss in throughputdue to routing table computations. For reliability of packetdelivery and rapid exploration in highly dynamic networks,broadcasting (i.e. transmission of a packet to all neighbors)has been added to possible routing decisions [1], [2], [3], [4].To bring robustness to the routing protocols in highlydynamic MANET, many solutions have considered adaptingrouting protocols to variations in the network conditions, e.g.ﬁsh-eye state routing protocol (FSR) [5] use adaptive link-state update rates, and the adaptive distance vector (ADV)routing protocol [6] use a threshold-based adjustment of therouting update rates based on the network dynamics. Whilethese protocols outperform traditional routing schemes withlower overhead and topology information sharing, they arenot responsive enough in highly dynamic MANETs.The seminal work in [7] proposed Q-routing, which uses areinforcement learning (RL) module (i.e. Q-Learning [8]) toroute packets and minimize delivery time. Each node decidesnext-hop based on locally acquired and maintained statistics,i.e. Q -values. Each Q -value represents the quality of eachnext-hop (or route). For a particular target of minimizingthe end-to-end delay in the routing algorithm, the Q -valuerepresents the estimation of the delay for each path. The Q table is maintained and updated for each pair of destinationand next hop consistently through ACK messages. Nodes thattransmits a data packet to a neighbor may receive an ACKmessage. The ACK messages contain a value to update Q table. Q-routing protocol sends the data packet to the next hop(unicast) with the best Q -value. Q-routing is efﬁcient in lowdynamic and static networks. In dynamic networks, Q -valuescan get outdated and the algorithm requires some adjustments.The Q-routing framework was improved for dynamic net-works in [9] with the addition of conﬁdence level table(i.e. C -values) as so-called CQ-routing protocol. C -valuesare incremented when Q -value is updated and decremented a r X i v : . [ c s . N I] J a n when gets outdated. CQ-routing becomes inefﬁcient in highlydynamic networks as it is based on only uni-casting to asingle node. Uni-casting makes the network exploration (Q-value updates) slow when the network changes rapidly. It willbe exhaustive to uni-cast to single nodes when neighboringnodes have changed or little information is available. Since weconsider the routing in the networks without any designatedcommunication between nodes, CQ-routing protocol providesgreat foundation to disseminate information through ACK onlyand monitor the state of the routes systematically. Therefore,CQ-routing protocol is the state-message monitoring protocolthat we use as the foundation of this work. Speciﬁcally, arecent robust routing protocol proposed for highly dynamicand tactical MANETs developed in [10] which is based onthe CQ-routing but with addition of broadcast decision.Adaptive routing (AR) [11] and smart robust routing (SRR)algorithms [10] are efforts to dynamically switching betweenunicast and broadcast to improve robustness in tactical net-works while reducing high overhead caused by the ﬂooding.The routing technique proposed in [10] uses the CQ-routingprotocol (i.e. C and Q -values) but extends it by addingbroadcast procedure for high reliability, robustness, and rapidnetwork exploration as needed in the tactical and highlydynamic MANETs. Therefore, it is justiﬁed to refer to it asCQ+ routing protocol. Although CQ+ routing uses a simplebut efﬁcient switching policy to choose between unicast andbroadcast, it has several areas that can be improved. Forexample, the CQ+ routing policy is threshold-based and itsdecisions depend on a single network parameter (best pathconﬁdence level). It does not have the perspective of the entirenetwork and can settle only on a locally optimal solution.CQ+ routing also does not account for the change rate ofnetwork parameters and possible congestion in the forwardpaths. A routing policy that predicts the network dynamicsand changes during packet traveling or queuing over a pathespecially in large networks can improve routing performancesigniﬁcantly. Hence, we can explore more advanced decisionmaking policies that make these considerations and improveperformance.Adaptive decision making of routing protocols can beinterpreted in the reinforcement learning (RL) framework. Thishas been initiated by the Q-routing protocol where it was basedon the concept of Q -learning in RL [7]. Following the Q-routing approach, many other techniques and algorithms fromthe RL community have been applied to the packet routingand scheduling problem in the communication networks [12],[13], [14], [15], [16], [17], [18], [19]. [12] uses MADRLto design independent deep routing policies for each agentbased on an off-policy deep Q-learning RL algorithm. DeepQ-learning approaches are based on so-called value estimation,estimation of the return (expected reward) of the actions atcertain state. The problem with the deep Q-learning and valueestimation policies is that they are not scalable easily as theexpected reward is dependent on many network parameter andconditions that are not known prior to decisions. Moreover, inMADRL-based approaches in the literature like [12], trainingindependent policies for each agent raises the scalabilityissue. It is unclear how to use the policies trained for a speciﬁc number of agents (network size) when the networkis extended or shrunk to a different size. A similar deep Q-learning RL-based approach is used in [13], where it considersclusters in the network and accounts for inter-cluster routingperformance. They also assume a feedback link is availablefrom the source node to the cluster lead agents. Although it isclaimed that their approach is expected to be scalable to largernetworks and dynamics but it is not clear how this wouldscale and perform if only trained on smaller networks. Thedeep neural network (DNN)-based design of a routing policyis challenging for MANETs with high dynamics, as it is amulti-agent environment. Optimization of the policy for oneagent is dependent on the policy and actions of other agentsand therefore suffers from the non-stationarity issue. This isparticularly difﬁcult in dynamic networks as many networkparameters and topology are rapidly changing.To the best of our knowledge, there have not been works ona scalable and robust routing policy design framework usingMADRL in MANET. To provide robustness and reliability,we use a similar approach to CQ+ routing, where CQ-routingis combined with adaptive ﬂooding. In this paper, we useMADRL and advanced RL algorithms and techniques to traina robust and reliable routing policy that can be applied to anynetwork size, trafﬁc, and dynamics. As it is closely related tothe CQ+ routing protocol, we refer to it as deep robust routingfor dynamic networks ( deepCQ+ routing ).II. R OBUST R OUTING F RAMEWORK

We consider a robust routing protocol which monitors thequality and conﬁdence level of the routes (next-hops) viaACK messages, i.e. CQ-routing protocol [9]. We use thisprotocol as it does not add any designated communicationlink for coordination between nodes in the network, and itis robust for dynamic networks with addition of conﬁdencelevel, i.e. C . We speciﬁcally use the extended version of CQ-routing by addition of adaptive broadcasting to bring morereliability (higher delivery rates) and rapid network exploration(for highly dynamic networks), as introduced in [10] and werefer to it as CQ+ routing. We summarize the CQ+ routing inAlgorithm 1, which shows how it chooses between broadcastand unicast (and next-hop) adaptively. A. CQ+ routing Protocol

The smart robust routing (SRR) algorithm proposed in theCQ+ routing work [10] uses the network parameters C and H . for the routing decisions primarily introduced in seminalCQ-routing [9]. Each node i has a H -factor, h ( i, j, d ) (i.e. i (cid:32) j (cid:32) d ), which represents an estimate of the least numberof hops between node i and destination d which passes throughpotential next-hop j . To monitor the dynamics of the network,each node i also have a conﬁdence level or C -value, c ( i, j, d ) ,that represents the conﬁdence in likelihood the packet willreach its destination h ( i, j, d ) . This C -value is increased andcorrected with any packet transmission success (receiving the We have renamed the original Q -factor in CQ-routing to H -factor toprevent confusion with the Q in the Q -networks and/or Q -learning in theRL context. ACK). Every packet transmission, the C -value is degraded bya decay factor. C - and H -factors are updated through the c ack and h ack which are propagated by the acknowledgement (ACK) packetsfrom the receiving node (e.g. next-hop) to the transmittingnode. These ACK values are computed at the next-hop node j as h ack = 1 + h ( j, ˆ k, d ) (1) c ack = c ( j, ˆ k, d ) (2)where ˆ k is the best path (next-hop) estimate of node j todestination d and it is found by ˆ k = arg min k h ( j, d, k ) (1 − c ( j, k, d )) (3)In other words, C - and H -values are exponential movingaverage of the c ack and h ack , respectively. If a transmissionfails or there is no ACK to update C - and H -levels, then H -level cannot be updated. However, we degrade the C -level asin c ack = 0 to reﬂect the path failure. The updates of the C -and H -levels given by h t +1 ( i, j, d ) = (1 − α ) h t ( i, j, d ) + αh ack , (4) c t +1 ( i, j, d ) = (cid:40) (1 − λ ) c t ( i, j, d ) failure (1 − λ ) c t ( i, j, d ) + λc ack otherwise (5)where ≤ α ≤ is a discount factor for the new observationwith an adaptable value of α = max ( c ack , − c t ( i, j, d )) . (6) λ is the decay factor for the new observation of c ack (or − λ is the decay factor for the old observation). If packet isreceived at the destination d , then c ack , and h ack are set to 1,indicating that we have full conﬁdence that we are 1 hop awayfrom destination. The SRR algorithm for the CQ+ routing issummarized in Algorithm 1, where more details are availablein [10]. SRR algorithm includes (i) reception of ACK andconsequently updating C - and H -levels, (ii) reception of non-ACK packets by checking for duplication, loops, and pushinginto the queue, (iii) transmission of data packets from thequeue by routing it according to the policy in use. We simplyrefer to this decision policy as CQ + - routing policy.

1) CQ+ routing Policy

The CQ+ routing decision policy chooses the next-hopbased on the minimization of the information uncertainty andthe expected number of hops. This is given by j (cid:63) = arg min j h ( i, j, d ) (1 − c ( i, j, d )) . (7)The next-hop j (cid:63) is considered by the CQ+ routing policy ifit unicast the packet to one single node. However, the CQ+routing policy enables broadcasting to minimize the next-hopinformation uncertainty. The uncertainty of the next-hop ismeasured by − c i ( d, j (cid:63) ) . Therefore, CQ+ routing decidesto broadcast if the uncertainty about information obtained ishigh to explore the network more and make a reliable trans-mission by ﬂooding the neighbors. The CQ+ routing policy isprobabilistic and it assigns the probability of broadcast as P BC = (cid:15) + (1 − c ( i, j (cid:63) , d ))(1 − (cid:15) ) = 1 − c ( i, j (cid:63) , d )˜ (cid:15) (8) Algorithm 1:

CQ+ routing algorithm [10]

Receive incoming packet at node i : if Packet is ACK then

Update c and h from (4) and (5); else if packet traversed a loop; then Drop packet, do not return ACK; endif packet is already in queue then

Drop packetFind best next-hop j (cid:63) from (7)Compute c ack and h ack from (2) and (1) using j (cid:63) Return ACK endif packet is not duplicate then

Add packet to the queue endendif

Queue is not empty then

Pick up packet from queueR

OUTING D ECISION P OLICY P BC ← (cid:15) + (1 − (cid:15) ) c ( i, j (cid:63) , d ) Choose (cid:40)

Broadcast with probability P BC Unicast to j (cid:63) with probability − P BC if Broadcast then

Forward packet to all endif

Unicast then

Forward packet to j (cid:63) endend where a small value (cid:15) is used for the minimum probability ofbroadcast (deﬁned for exploration purposes) and correspond-ingly ˜ (cid:15) = 1 − (cid:15) is the maximum probability of unicast.In this paper, we use the CQ+ routing protocol, meaningthat the reception procedure of our algorithm is the sameas described in 1, but our improvements are applied to thetransmission procedure and the CQ+ routing policy. In thiswork, we preserve the protocol and concentrate on routingpolicy optimization.III. D EEP R EINFORCEMENT L EARNING FOR THE

CQ+

ROUTING (D EEP

CQ+)

A. Deep Reinforcement Learning Background

Our robust routing problem ﬁts as a decentralized partially-observable Markov decision process (Dec-POMDP) , that isused for decision-making problems in a team of cooperativeagents [20]. Dec-POMDP tries to model and optimize thebehavior of the agents while considering the environment’sand other agents’ uncertainties. POMDP is deﬁned by a set ofstates S describing the possible conﬁgurations of the agent(s),a set of actions A , and a set of observations O . The actionsare selected using a stochastic policy π θ : O × A → [0 , (parameterized by θ ), which results to the next state deﬁned bythe environment’s state transition function T : S ×A → S . Asa result of this transition, a reward is obtained by an agent(s)and it is described as a function of the state and action, i.e. r : S × A → R . Each agent receives an observation related In the Dec-POMDP, these sets are deﬁned for each agent separately to the state as o : S → O . Below, we give an overview andbrief introduction of the DRL algorithms and techniques usedin this work to design the deepCQ+ routing policies.

1) Value Optimization and Deep Q-learning

To ﬁnd the optimal policy, Q-Learning is a widely usedmodel-free algorithm that estimates the action-value functionfor policy Q π ( s, a ) . The action-value function (or Q-function)is the expected maximum sum of (discounted) rewards, per-ceived at state s when taking action a and it is given by Q π ( s, a ) := E (cid:34) T (cid:88) t =0 γ t r t (cid:12)(cid:12)(cid:12)(cid:12) s t = s, a t = a (cid:35) (9)where γ is the discount factor and T is the time horizon.The action-value function can be recursively written (andcalculated) as Q π ( s, a ) = E s (cid:48) [ r ( s, a ) + γ E a (cid:48) ∼ π [ Q π ( s (cid:48) , a (cid:48) )]] .The recent deep learning paradigm enables the RL algo-rithms to approximate the Q-function using a deep neuralnetwork, i.e. Q ( s, a ) ≈ Q ( s, a ; θ ) , where θ is the set ofneural network parameters . A popular method to do this isis known as deep Q-network (DQN) [21]. DQN learns theoptimal action-value function Q (cid:63) by minimizing the loss: L ( θ ) = E ( s,a,r,s (cid:48) ) (cid:20)(cid:16) Q (cid:63) ( s, a ; θ ) − ( r + γ max a (cid:48) ¯ Q (cid:63) ( s (cid:48) , a (cid:48) )) (cid:17) (cid:21) (10)where ¯ Q is a target Q -function with parameters periodically(or gradually) updated with the most recent parameters θ tofurther stabilize the learning. DQN also beneﬁts from a largeexperience replay buffer B of experience tuples ( s, a, r, s (cid:48) ) .Finally, the optimal (deterministic) policy is described as π (cid:63) ( s ) = arg max a ∈A Q ( s, a ; θ (cid:63) ) , (11)when the optimal parameters θ (cid:63) are found.

2) Policy Optimizations

Policy gradient methods are another popular choice that isbased on the optimization of the policy directly. Let π denotea stochastic policy which assigns a probability π ( a t | s t ) for anaction a t given a state s t . In the policy optimization methods,the main idea is to optimize the parameters of the policy, θ , tomaximize the expected discounted return objective function, J ( θ ) = E s ,a ,s ,... (cid:34) T (cid:88) t =0 γ t r t (cid:124) (cid:123)(cid:122) (cid:125) R τ (cid:35) , (12)where the experience sequence (or path) τ is denoted as { s , a , s , ... } with s ∼ p ( s ) and a t +1 ∼ π ( a t | s t ) , s t +1 ∼ T ( s t +1 | s t , a t ) . The discounted return for the sequence τ is expressed by R τ . p is the distribution of the initial state s . The name of the Q-routing algorithm is inspired by the Q-learningalgorithm in RL but it only represents the expected number of hops fordifferent paths as action-value function. In the paper, when we want to refer to neural network coefﬁcients orweights, we have used neural network parameters. Note that we use ”networkparameter” term when refer to the communication network parameters suchas c , h , etc. The action-value function Q π , and the value function V π ,and the advantage function A π are deﬁned as Q π ( s t , a t ) = E s t ,a t ,s t +1 ,... (cid:34) T (cid:88) l =0 γ l r t + l (cid:35) , (13) V π ( s t ) = E a t ,s t +1 ,... (cid:34) T (cid:88) l =0 γ l r t + l (cid:35) , (14) A π ( s, a ) = Q π ( s, a ) − V π ( s ) , (15)where a t ∼ π ( a t | s t ) and s t +1 ∼ T ( s t +1 | s t , a t ) . J ( θ ) = E τ [ R τ ] = (cid:88) τ P ( τ ; θ ) R τ (16)A class of popular policy optimization methods, policy gradi-ent (PG), optimizes the policy parameters θ by descending to-ward the gradient direction ∇ θ J ( θ ) [22]. Therefore, we reviewhow this gradient direction can be expressed and simpliﬁed andempirically calculated. Using the fact that ∇ x log y = ∇ x yy , wecan rewrite the gradient ∇ θ J as ∇ θ J ( θ ) = (cid:88) τ ∇ θ P ( τ ; π θ ) R τ = (cid:88) τ P ( τ ; θ ) ∇ θ log P ( τ ; π θ ) R τ = E τ [ ∇ θ log P ( τ ; θ ) R τ ] (17)In other words, the gradient direction ∇ θ J ( θ ) to maximizethe expected return can be interpreted as expectation of thegradient of the log-likelihood of the paths, i.e. ∇ θ log P ( τ ; θ ) weighted by the path return value R τ . This means that the de-scending in the gradient direction will increase the probabilityof paths with positive returns, and decrease the probabilityof path with negative returns. One approach to empiricallycompute this gradient direction is to average this term over allobserved paths through our search as given by ∇ θ J ( θ ) ≈ M M (cid:88) m =0 ∇ θ log P ( τ ( m ) ; θ ) R τ ( m ) . (18)Nevertheless, this requires ﬁnding ∇ θ log P ( τ ( m ) ; θ ) empiri-cally. Now, the probability of the experience path τ can becomputed by decomposing the path into states and actions.Hence, we have P ( τ ; θ ) = ∞ (cid:89) t =0 T ( s t +1 | s t , a t ) · π ( a t | s t ; θ ) (19) log P ( τ ; θ ) = ∞ (cid:88) t =0 log T ( s t +1 | s t , a t ) + ∞ (cid:88) t =0 π ( a t | s t ; θ ) (20)Using (19) and the fact that the environment’s state transition T ( s t +1 | s t , a t ) is independent of θ , the gradient of the log-likelihood of path τ , i.e. ∇ θ log P ( τ ; θ ) , can decompose intolog-likelihood sum of the stochastic policy π ( a t | s t ) as givenin ∇ θ log P ( τ ; θ ) = ∞ (cid:88) t =0 ∇ θ log π ( a t | s t ; θ ) (21) Hence, the gradient can be written as ∇ θ J ( θ ) = E τ (cid:34)(cid:88) t ∇ θ log π ( a t | s t ; θ ) R τ (cid:35) (22)Now, from the gradient estimator of the Vanilla policy gradientalgorithm [23], we know that the gradient expression can besimpliﬁed as ∇ θ J ( θ ) = E t [ ∇ θ log π θ ( a t | s t ) A t ] (23)Let η θ t denote the probability ratio η θ t = π θ ( a t | o t ) π θ old ( a t | o t ) (24)and therefore η θ old = 1 . A popular state-of-the-art policy gra-dient algorithm is called proximal policy optimization (PPO)[24], [25], which uses the clipped objective function L clip ( θ ) = E t (cid:104) min (cid:16) η θ t ˆ A t , clip( η θ t , − (cid:15), (cid:15) ) ˆ A t (cid:17)(cid:105) (25)as an optimization problem of interest. PPO is indeed a familyof policy optimization methods that uses multiple epochs ofstochastic gradient ascent to perform each policy updates. PPOinherits the stability and reliability of the trust region methods[25] but implemented in a much simpler way. We have usedPPO algorithm for the optimization of the routing policy inthe following sections.

3) Centralized Training, Decentralized Execution

It is important to understand that in multi-agent reinforce-ment learning, each agent can have its policy while sharing theenvironment with other agents. In a communication networksetup, partial observability and/or communication constraintsnecessitate the learning of decentralized policies, which condi-tion only on the local action-observation history of each agent.Decentralized policies also naturally avoid the exponentiallygrowing joint action-space with the number of agents, whilethe traditional single-agent RL methods are impractical.Fortunately, decentralized policies can be learned in acentralized fashion specially in a simulated or laboratorysetting. The centralized training has access to hidden stateinformation of other agents and removes inter-agent com-munication constraints. The paradigm of centralized trainingwith decentralised execution has attracted attention in theRL community [26], [27], [20]. However, many challengessurrounding how to best exploit centralized training remainopen.

4) Parameter Sharing

In the context of centralized training but decentralizedexecution, a common strategy is to share the policy parametersbetween agents that are homogeneous [28], [29], [30] (param-eter sharing). Note that in the heterogeneous communicationnetworks, we can always categorize nodes as homogeneouswith some individual parameters and those can be also givenas input to the policy with shared parameters.The multi-agent environment for the routing problem andour proposed framework are also summarized in Figure 1where the centralized training, decentralized execution, andparameter sharing are illustrated.

Fig. 1. Multi-agent network routing environment with shared policy param-eters between agents. The centralized training and decentralized executionare also shown. Each agent i , uses the shared policy π θ individually to ﬁndits own action a i ( t ) based on its own observations o i ( t ) . The multi-agentenvironment operates based on the joint-actions decided and taken individuallyand transition to next state s t +1 and the rewards are pulled out based on that. B. Our proposed DRL Framework and Approach

We consider large wireless and dynamic communicationnetworks with a relatively wide range of sizes (e.g. ≤ N ≤ ). Each node i holds a queue with data packets intendedto be delivered to various destinations. The nodes can haveany arbitrary trafﬁc data ﬂows. The nodes are consistentlymoving at various random velocities. The dynamic level ofeach node is different and random. The mobility of the nodestypically follows a reliable model in MANET, Gauss-Markovmodel, which is also discussed in Section IV-A2. At eachtime t , each node i , picks a packet from its data queue(according to a pre-deﬁned packet scheduling algorithm). Weuse a simpliﬁed ﬁrst-in-ﬁrst-out (FIFO) packet scheduling butthe routing decisions and policy design are independent ofthis scheduling method. Each node holds a table of networkparameters which indeed contains C - and H -levels discussedin Section II-A. These network parameters are indeed 2-dimensional ( N − × ( N − matrix for each node (i.e. c ( i, j, d ) , h ( i, j, d ) ), and it is deﬁned for each destination nodeid, and for each next-hop ( N − other nodes). These matricesare initialized when an episode is started. Hence, the networkparameter matrices are often sparse as each node may havea limited number of neighbors and therefore many networkparameters will stay at their initial reset values.During the execution, each node i has already loaded a pre-trained policy π θ , where it is also the same as other nodes.Therefore, there is no selection or classiﬁcation of the nodesto identify which policy to be used for each.Each episode contains a speciﬁc number of packet durationtime-slots L , with various data rates. Higher data rates mayintroduce high queue backlogs. We end the episode when thelength of the episode exceed a maximum trafﬁc length L max and all the nodes have empty data queues. We stop any newupcoming data trafﬁc when the episode length is larger than L max . Therefore, episode length is slightly larger than themaximum trafﬁc length. The goodput will be computed asthe number of packets delivered (non-duplicate) to the total number of packets injected as input trafﬁc.

1) Scalability

The main distinction of our work from similar DRL-basedrouting algorithms is that our solution is scalable in termsof (i) communication network size (ii) network dynamics andmobility levels and topology (iii) different data ﬂows, datarates, source, and destination. In other words, the proposedDNN Policy is designed and trained such that it is not over-ﬁtted to speciﬁc conﬁgurations and system parameters. Itworks for a wide range of communication network sizes,various dynamic levels, different sources and destinations, anddata ﬂows.

2) DNN Policy

To satisfy the scalability requirement, the proposed DNNpolicy needs to have a network architecture to accommodatevariable number of agents. The routing decisions are generallydependent on the network parameters (i.e. C - and H -values)of the neighbors. For example, if a subset of agents have zero C -levels then there is no need to play them in our routingdecisions. Hence, we perform a pre-processing of the networkparameters available (state characteristic) to have ﬁxed numberof inputs fed into our DNN policy. Therefore, we select thebest K neighbors out of total N agents in the network foreach node i and use their network parameters as the currentstate of that speciﬁc node. For the sake of simplicity, we dropthe current node i , and destination d from C - and H -levelsand use the next-hop as index (i.e. c t ( i, j, d ) → c t ( i, j ) ). Wealso order the next-hop indices according to ascending orderof h j (1 − c j ) . Now, in our problem formulation the observationof an agent i at time t obtained from its state is given by o t ( i ) = [ c t ( i ) , h t ( i )] , (26)where c t ( i ) = [ c t ( i, i ) , c t ( i, i ) , . . . , c t ( i, i K )] h t ( i ) = [ h t ( i, i ) , h t ( i, i ) , . . . , h t ( i, i K )] (27)such that the neighboring nodes i , ..., i K of i are ordered as h t ( i, i )(1 − c t ( i, i )) ≤ · · · ≤ h t ( i, i K )(1 − c t ( i, i K )) . (28)Training individual policy for each node is impractical dueto the scalability constraints of the design. For example, if wehave trained a policy for one node within a speciﬁc networksize, it is not clear how the trained policies are assigned tolarger network sizes and new nodes added to the network.We consider to train one policy for all , π ( a | s t ; θ ) . Training auniversal policy has a scalability advantage meaning that anynode added to the network or any network dynamics, topology,or data ﬂow will have the same policy to be used. This isindeed consistent with the original CQ+ routing policy as itoffers the same policy for all, although it is not DNN and itis probabilistic decisions only (see (8)).We use fully-connected DNN but the established frameworkcan extend it to other DNN architectures. We add the changebetween current observations and previous observations, andalso the previous effective action taken by the agent to theobservations at node i that is fed into the DNN policy asinput features. Including the previous observations for theinput features helps to capture temporal change rate of network Routing Protocol C/Q-values Broadcast MADRLCQ-routing [9] (cid:88) × ×

CQ+ routing [10] (cid:88) (cid:88) × deepCQ+ routing (this work) (cid:88) (cid:88) (cid:88) TABLE IC

OMPARISON OF ROBUST ROUTING PROTOCOLS IN DYNAMIC NETWORKS parameters. Alternatively, we can beneﬁt from recurrent neuralnetwork architectures but we have found that training suchnetworks are typically takes longer. The input features to theDNN policy at node i are given by o t ( i ) = [ c t ( i ) , h t ( i ) , ∆ c t ( i ) , ∆ h t ( i ) , a t − ( i )] , (29)where ∆ c t ( i ) = c t ( i ) − c t − ( i ) and ∆ h t ( i ) = h t ( i ) − h t − ( i ) . C. CQ+ routing as a DRL-based Policy

In this section, we reformulate the RL problem so thatits optimal policy results in the same policy as the CQ+routing policy. This particularly helps in deﬁning a properrewarding system for our CQ+ routing policy improvementsand extensions.First, we consider a stochastic routing policy π ( a | s t ) wherethe action’s probabilities are only dictated by the policy. Thisis indeed similar to the CQ+ routing routing policy where itcomputes a probability of broadcast. We give the followingrewards to the unicast action ( a = 0 ) and the broadcast action( a = 1 ) as r t ( s t , a t ) = (cid:40) − c t ( i, i )˜ (cid:15) a t = 1( broadcast ) c t ( i, i )˜ (cid:15) a t = 0( unicast ) (Reward 1)(30)where ˜ (cid:15) = 1 − (cid:15) . Note that c t ( i, i ) is closely tracks thesuccess/freshness probability of the path via the next-hop.Then, in order to maximize the total expected (discounted)reward as R = E s t ∼ p π ,a t ∼ π (cid:34) T (cid:88) t =0 γ t r t (cid:35) = E s t ∼ p π ,a t (cid:34) T (cid:88) t =0 γ t ( π ( a t = 0 | s t ) c t ( i, i )˜ (cid:15) + π ( a t = 1 | s t )(1 − c t ( i, i )˜ (cid:15) )) (cid:35) (31)Now, with zero horizon (i.e. T = 0 ), maximizing the ex-pected reward given by (31) leads to a policy with probabilityof broadcast is π ( s t , a t = 1) = P BC = (cid:40) c t ( i, i )˜ (cid:15) < / c t ( i, i )˜ (cid:15) > / . (32)This is indeed the deterministic version of the CQ+ routingrouting policy previously discussed as in (8). D. deepCQ+ routing

An immediate improvement of the CQ+ routing policy usingDRL can be pursued as maximization of the expected return, R as in 31 with γ > and some large horizon T , wherethe immediate reward r t ( s t , a t ) is given by 30. We can useefﬁcient RL algorithms such as PPO to ﬁnd a DNN policy.We refer to this DNN-based version of the CQ+ routingas deepCQ+ routing with reward type 1. Note that differentreward designs will give different performance objectives.The CQ+ routing policy aims to choose the next-hop thatminimizes uncertainties (probability of failure) in dynamicnetworks. This is done by accounting for the probability offailure for the path with the lowest link uncertainties andnumber of hops. Although the CQ+ routing policy indirectlyoptimizes the network overhead, it is not accounted for directlyin its optimization. We deﬁne the overhead in the routingof the dynamic networks as the ratio of the total number oftransmissions in the communication network, denoted by N TX ,to the total number of packets delivered, denoted by N D for aspeciﬁc packet data rate and a window of time. The normalizedoverhead by the network size, N , is given by OH = 1 N · N TX N D . (33)To show the effectiveness of the DRL approach on the CQ+routing policy optimization, the deepCQ+ routing objective isto minimize overhead while keeps the goodput rate ρ at thesame level (or higher) than the goodput rate provided by theCQ+ routing, i.e. ρ (for the same input data ﬂows and thesame horizon). This can be formulated as min θ N TX N D such that ρ ≥ ρ . (34)Since the target is the efﬁciency of our routing policy, we cansimply assume ρ = ρ , and consequently, N D = ρ T forsome time horizon T (for some speciﬁc input data rate). Withthis approach, we maintain the goodput rate while we achievelower overhead and it is fair to compare with previous robustrouting policies. Therefore, the optimization problem in (34)can be simpliﬁed as min θ N TX such that N D = ρ T (35)From our experiments, using the rewarding system in (30) giveus the goodput rate as ρ while keep the overhead lower thanCQ+ routing. Hence, we can consider the overhead minimiza-tion by reformulating our rewarding system to accommodate(35) as r i ( t ) = w D − w Z − w N ack N + (cid:40) − c t ( i, i )˜ (cid:15)a t = 1 c t ( i, i )˜ (cid:15)a t = 0 . (36)where D is the reward for packet delivery. If a packetis delivered successfully to the destination and node i hascontributed to that delivery then it will be rewarded. Note thatwe do not reward for duplicate delivery of a packet. Also, Z is a reward (penalty) indicator which is enabled when wehave not received any ACKs from our transmissions. This termis not directly related to the optimization problem (35) but itprevents the system to learn unicasting all the packets initially,which may happen if the initial reward is negative and theagent wants to end the episode to prevent increased penalties. The next term in the reward, i.e. N ack /N , is the normalizednumber of ACKs received as a result of the action taken andtherefore closely represents the number of copies of a packetat other nodes. A naive approach can be to penalize for everytransmission directly. However, the cause of transmissionsat each node is not the actions taken at that speciﬁc node,but it is the transmissions of the packets from other nodesto that speciﬁc node. Hence, N ack received at each node isused to estimate the added number of transmissions dictatedto the network as a result of each node’s action. The lastreward component remains intact from the reward type 1 as itwas already outperforming CQ+ routing in terms of overheadwhile achieving the same goodput (see Figures 4 and 5).The weights of the reward components, (i.e. w , w , w ), havebeen tuned according to the overhead minimization problem(35). The proposed deepCQ+ routing technique with modiﬁed(extended) reward deﬁnition is referred to as deepCQ+ routingwith reward type 2. The deepCQ+ routing algorithms aresummarized in Algorithm 2. Algorithm 2:

The proposed deepCQ+ routing (withreward type 1 and 2)

Receive incoming packet at node i : if Packet is ACK then

Update c and h using (4) and (5) else if packet traversed a loop then Drop packet, do not return ACK endif packet is already in queue then

Drop packetFind best next-hop i from (7))Compute c ack and h ack from (2) and (1) using j (cid:63) Return ACK endif packet is not duplicate then

Add packet to the queue endendif

Queue is not empty then

Pick up packet from queue of node i (at current time t )Pre-process best ( K = 4 ) neighbors using (28)Form the input to the DNN-based routing policy o t ( i ) = [ c t ( i ) , h t ( i ) , c t − ( i ) , h t − ( i ) , a t − ( i )] R OUTING D ECISION P OLICY deepCQ+ routing (reward 1 or 2): θ = θ (cid:63) deepCQ+ routing-rew-1 or -2 Choose (cid:40)

Broadcast with probability π θ ( a = 1 | o t ( i ); θ ) Unicast with probability π θ ( a = 0 | o t ( i ); θ ) if Decision is Broadcast then

Forward packet to all endif

Decision is Unicast then

Forward packet to i endend IV. E

XPERIMENTS AND N UMERICAL R ESULTS

A. Environment Modelling and Training Platforms

To model the CQ+ routing environment and for rapiddevelopment of the design, algorithms, and testing, we have developed a CQ+ routing network simulator and constructedan RL CQ+ routing environment to train and test our approachand the CQ+ routing baselines. Our environment platform isbuilt in Python and it has all CQ+ routing protocol featuresincluding generation of random dynamic networks with dataﬂows, different number of nodes, and conﬁgurable networksetup parameters (dynamic levels, range, node locations, datarates, source and destinations, data queues and backlogs, linkquality computation, etc.). The CQ+ routing protocol containsthe tracking, distributing C - and H -values, duplicate packetchecking, etc. It also extracts performance evaluation infor-mation. The environment is interfaced with the Ray , whichis a powerful distributed computing platform for the machinelearning [31] and RLlib library [32] which provides scalablesoftware primitives for the RL algorithms.A snapshot of the environment is given in Figure 2, wherethe source and destination are in colored as gray. The colorof the nodes shows the queue length or the congestion levelof the nodes. The color of the links is associated with theirquality with green links represents high quality and low packeterror rates. The thickness of the links shows the trafﬁc levelof the link. The heatmap images of the C - and H -levelsare monitored to show the state of the network visually.The heatmap images of C i ( d, j ) are 2-D vectors as theyconsider single destination node (i.e. node d = 25 ). Eachrow is associated with the current node i and each columncorresponds to the other nodes j of each current node i . Forexample, the C -levels for the other nodes that are closer tothe destination ( j ≥ ) are relatively high as there is highconﬁdence of the path going to the destinations through them.

1) Benchmark Topology

A benchmark topology was considered based on the adap-tive routing work, [11] and the CQ+ routing [10]. In thistopology scenario, there are N ( N ≥ ) nodes consideredin the network which are randomly located in an area. Thenodes between the source and destination nodes are movingin random directions with variable speeds. The closer theyare to the source or destination, the slower their speed is,reﬂecting the heterogeneous tactical network environments.An example that is considered in the training and testing inthe area of 800 (m) by 300 (m) with the wireless range ofthe nodes is 150 (m). The packet error rates drop rapidlywhen the distance between nodes exceeds that range. Weassume no interference between transmissions to focus on therouting layer. Source and destination nodes exhibit a slowmobility pattern, similar to the nodes in the outermost regions.The data ﬂow between the source and destination nodes canhave various data rates (e.g. 20 packets per second at up to1000 bytes each; each ﬂow generates up to 160Kbps payloadtrafﬁc). In different experiments, we use different routingpolicies starting from the generic CQ+ routing policy [10]. Thesame network topology will be used to compare other routingalgorithms such as deepCQ+ routing with reward types 1 and2. For both testing and training (to monitor the policy trainingprogress), we monitor the resulting performance metrics suchas broadcast rate (the percentage of the broadcast actions inthe network), the goodput rate. Goodput is the rate of thedelivery of intended data packets that are successfully received at the destination. We exclude any duplicate packets fromcounting as delivered packets, and in the RL context, we donot count them towards any sort of reward. More importantly,we consider the total overhead as a metric that quantiﬁes theefﬁciency of our decisions and it is the ratio between the totalnumber of data packets delivered to the total transmissionsthat have been made. Other network metrics are also recorded,such as the average number of hops between source anddestination. Factors such as overhead and the number of hopsare normalized by the size of the network so they apply tonetworks of different scales.

2) Mobility Model

In the mobile ad-hoc networks (MANET), there have beenvarious mobility models proposed and discussed recently andparticularly in their impact on the network routing protocols[33], [34], [35], [36], [37]. The MANET mobility modelsrepresent the moving behavior of each mobile node in theMANET. MANET mobility models are particularly importantfrom our training and testing perspective. These models needto be realistic for the dynamic tactical wireless networks,while still need to have enough randomization to cover cornercases and extreme scenarios in our training and properlyreﬂect the performance impact of the network dynamics,scheduling, and resource allocation protocols. The randomwaypoint mobility model is often used in the simulation studyof MANET, despite some unrealistic movement behaviors,such as exhibiting sudden stops and sharp turns [38]. TheGauss-Markov mobility model [39] has been shown to solveboth of these problems [35]. Therefore, we have used thismodel in our training and testing process, and implementedin our network simulator platform. To be complete, we havealso tested the trained policies using the Gauss-Markov modelon the networks that follow random way-point model andobserved similar performance and behavior. Compared to therandom way-point model, the Gauss-Markov mobility modelhas improved modeling performance at higher dynamics (e.g.as fast as fast automobiles) while it holds the same perfor-mance as a random way-point at human running speeds [35].For the training of the MANET, the important factor is thesensitivity of the throughput (or goodput) and the end-to-enddelay to the different levels of the randomness settings, andthe Gauss-Markov model shows no effect on the accuracy ofthese metrics [35].In the Gauss-Markov model, the velocity of mobile node isassumed to be correlated over time and modeled as a Gauss-Markov stochastic process. In a two-dimensional simulationand emulation ﬁeld (as in this study), the value of speed anddirection at the n th time instance is calculated on the basis ofthe value of speed and direction at the n − th time instanceand a random variable using the following equations: v ( n ) = µv ( n − + (1 − µ )¯ v + (cid:113) (1 − α )˜ v ( n − (37) φ ( n ) = µφ ( n − + (1 − µ ) ¯ φ + (cid:113) (1 − µ ) ˜ φ ( n − (38)where v ( n ) and φ ( n ) are the new speed and direction of thenode at time interval n ; ≤ µ ≤ is the tuning parameter forthe randomness (and correlation to previous time instance); Fig. 2. An example of our developed CQ+ routing environment (simulator) for a 25-node dynamic network. ¯ v and ¯ φ are constants representing the mean value of speed(i.e. dynamic level) and direction as n → ∞ ; ˜ v ( n − and ˜ φ ( n − are random values from a Gaussian distribution to addrandomness. In a two-dimensional simulation and emulationﬁeld (as our study), the Gauss-Markov model gives the nextlocation based on the current location at time instance n as x ( n ) = x ( n − + s ( n ) cos( φ ( n ) ) (39) y ( n ) = y ( n − + s ( n ) sin( φ ( n ) ) (40)where (cid:0) x ( n ) , y ( n ) (cid:1) and (cid:0) x ( n − , y ( n − (cid:1) are the x and y coordinates of the MANET node’s location at the n th and ( n − th time intervals, respectively. The mean angle willbe adjusted when nodes reach the region edges to limit themovement within the region.To more accurately reﬂect the benchmark topology used inthe CQ+ routing papers (discussed in the previous section),we have divided the environment into 5 groups symmetricallybetween the source and destination nodes. The source anddestination regions are at the left and right corners of thearea. The closer the region is to the center, the faster thenodes move. Also, the regions are overlapping by %10 toprevent too many network partitions. The central region hasdouble the speed variance of that of the mid-left and mid-rightregions. The source and destination regions have half the speedvariance compared to the mid-right/left regions. The mobilityof the MANET nodes is simulated and shown in Figure 3. B. Hyperparameter Tuning and Conﬁguration Parameters

In this section, we discuss the hyperparameter tuning pro-cess and list the parameters that obtained through tuning. Wealso itemize the conﬁguration parameters used in the trainingprocess to generate the CQ+ routing environment.

Fig. 3. Mobility regions and network topology. The movements of the nodesare shown for a network of size 30 according to the Gauss-Markov model

Parameters Valuenumber of nodes 10-30learning rate 0.00005discount factor γ C. Numerical Results

The results in Figure 7 conﬁrms that in general, thedeepCQ+ routing with reward type 2 outperforms the generic(non-DRL based) CQ+ routing technique. Note that the test isperformed over network sizes of 10 to 30 while the trainingis only performed on the 12-node networks. This is evidencethat the CQ+ routing is scalable for different network sizesand dynamics. The main performance metric of interest is thenormalized overhead as it shows the efﬁciency of the networkrouting in terms of delivery of intended data packets. The Fig. 4. Goodput rate (delivery rate) while training vs. number of steps.Fig. 5. Normalized overhead metrics during training vs. number of steps. normalized overhead is indeed the total overhead (transmitted)packet rate in the network divided by the goodput rate. Thenormalized overhead is further divided by the network sizeto be fair in comparison across different network sizes. Thenormalized overhead is at least %15 lower in deepCQ+ routingwith reward type 2 compared with the generic non-DRL-based CQ+ routing while the goodput rate is about the same.Lower normalized overhead values are indicative of moreefﬁcient policies. Note that this is only an example of ourDRL framework routing policy design to show achieving thetarget objective (e.g. minimize normalized overhead whilemaintaining goodput rate) and still being scalable (can betrained for certain network sizes and dynamics but performssatisfactorily for the rest of network parameters).Our MADRL framework for the network routing policydesign also enables us to train over variable network sizes anddynamics if it is required. Figure 8 shows the performance ofDeepCQ+ with reward type 2 trained on 12-node networksand the same policy trained over variable network sizes 10to 30. The results show that the training over 12-node net-works scales perfectly for the entire network size domain and although it is possible to train over 10 to 30-node networks,there is not much gain in doing so if any.V. C

ONCLUSIONS AND F UTURE D IRECTIONS

In this paper, we have shown a MADRL approach to designa robust, reliable, and scalable policy for dynamic wirelesscommunication networks. Our DRL framework is speciallydesigned for scalability and enables us to train and test therouting policies for variable network sizes, data rates, andmobility dynamics and performs well in other scenarios. OurDNN-based robust routing for dynamic networks policy, i.e.deepCQ+ routing, is based on the CQ-routing but also moni-tors network statistics to improve broadcast/unicast decisions.It is shown that deepCQ+ routing is much more efﬁcient thantraditional CQ+ routing techniques, signiﬁcantly decreasingnormalized overhead (number of transmissions per numberof successfully delivered packets). Moreover, the policy isscalable and uses parameter sharing for all different nodes,makes it possible to use the same trained policy for networksand agents with various mobility dynamics, data rates, andnetwork sizes.In our future works, we will expand the action space ofthe deepCQ+ routing to include next-hop selection. We canalso extend deepCQ+ routing to accommodate heterogeneousnetworks with dynamic node types. Another possible directionis to update and redesign the network parameter distributionvia ACK, and include more efﬁcient and useful information.The DRL-based policies can be designed and trained toaccommodate different performance metrics such as end-to-end delay minimization, overhead minimization, goodput ratemaximization, etc.VI. A

CKNOWLEDGEMENT

Research reported in this publication was supported in partby Ofﬁce of the Naval Research under the contract N00014-19-C-1037. The content is solely the responsibility of theauthors and does not necessarily represent the ofﬁcial views ofthe Ofﬁce of Naval Research. The authors would like to thank

Fig. 6. Broadcast rate during training vs. number of steps. Fig. 7. Comparison of the results of the deepCQ+ routing with reward type 1 and reward type 2 trained for a 12-node network only versus the CQ+ routing;The results are tested across various network sizes from 10 to 30. Although the deepCQ+ routing PPO policy is trained on 12-node networks, it scales perfectlyfor various network sizes. The deepCQ+ routing with reward type 2 achieves signiﬁcantly lower normalized overhead (overhead rate divided by the goodputrate divided by the network size).

Dr. Santanu Das (ONR Program Manager) for his support andencouragement. R

EFERENCES[1] C. E. Perkins and P. Bhagwat, “Highly dynamic destination-sequenceddistance-vector routing (DSDV) for mobile computers,”

ACM SIG-COMM computer communication review , vol. 24, no. 4, pp. 234–244,1994.[2] C. E. Perkins and E. M. Royer, “Ad-hoc on-demand distance vectorrouting,” in

Proceedings WMCSA’99. Second IEEE Workshop on MobileComputing Systems and Applications . IEEE, 1999, pp. 90–100.[3] T. Clausen, P. Jacquet, C. Adjih, A. Laouiti, P. Minet, P. Muhlethaler,A. Qayyum, and L. Viennot, “Optimized link state routing protocol(OLSR),” 2003.[4] J. Moy et al. , “OSPF version 2,” 1998.[5] G. Pei, M. Gerla, and T.-W. Chen, “Fisheye state routing: A routingscheme for ad hoc wireless networks,” in , vol. 1. IEEE, 2000, pp. 70–74.[6] R. V. Boppana and S. P. Konduru, “An adaptive distance vectorrouting algorithm for mobile, ad hoc networks,” in

Proceedings IEEEINFOCOM 2001. Conference on Computer Communications. TwentiethAnnual Joint Conference of the IEEE Computer and CommunicationsSociety (Cat. No. 01CH37213) , vol. 3. IEEE, 2001, pp. 1753–1762.[7] J. A. Boyan and M. L. Littman, “Packet routing in dynamically changingnetworks: A reinforcement learning approach,”

Advances in NeuralInformation Processing Systems , 1994.[8] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018. [9] S. Kumar and R. Miikkulainen, “Conﬁdence-based Q-routing: An on-line adaptive network routing algorithm,” in

Proceedings of ArtiﬁcialNeural Networks in Engineering , 1998.[10] M. Johnston, C. Danilov, and K. Larson, “A reinforcement learningapproach to adaptive redundancy for routing in tactical networks,” in

MILCOM 2018 - 2018 IEEE Military Communications Conference(MILCOM) , 2018, pp. 267–272.[11] C. Danilov, T. R. Henderson, T. Goff, O. Brewer, J. H. Kim, J. Macker,and B. Adamson, “Adaptive routing for tactical communications,” in

MILCOM 2012 - 2012 IEEE Military Communications Conference ,2012, pp. 1–7.[12] X. You, X. Li, Y. Xu, H. Feng, J. Zhao, and H. Yan, “Toward packetrouting with fully distributed multiagent deep reinforcement learning,”

IEEE Transactions on Systems, Man, and Cybernetics: Systems , 2020.[13] R. E. Ali, B. Erman, E. Bas¸tu˘g, and B. Cilli, “Hierarchical deepdouble Q-routing,” in

ICC 2020-2020 IEEE International Conferenceon Communications (ICC) . IEEE, 2020, pp. 1–7.[14] Z. Mammeri, “Reinforcement learning based routing in networks: Re-view and classiﬁcation of approaches,”

IEEE Access , vol. 7, pp. 55 916–55 950, 2019.[15] C. Yu, J. Lan, Z. Guo, and Y. Hu, “DROM: Optimizing the routingin software-deﬁned networks with deep reinforcement learning,”

IEEEAccess , vol. 6, pp. 64 533–64 539, 2018.[16] G. Stampa, M. Arias, D. S´anchez-Charles, V. Munt´es-Mulero,and A. Cabellos, “A deep-reinforcement learning approach forsoftware-deﬁned networking routing optimization,” arXiv preprintarXiv:1709.07080 , 2017.[17] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning basedresource allocation for V2V communications,”

IEEE Transactions onVehicular Technology , vol. 68, no. 4, pp. 3163–3173, 2019.[18] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “QoS-aware adaptive Fig. 8. Scalability of the deepCQ+ routing with reward type 2 trained on 12-node networks is shown when compared with the policy trained over all 10- to30-node networks (variable network sizes). The policy trained on 12-node shows as good as or even better in terms of normalized overhead on network sizesof 10 to 30.routing in multi-layer hierarchical software deﬁned networks: A rein-forcement learning approach,” in . IEEE, 2016, pp. 25–33.[19] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learningto route,” ser. HotNets-XVI. New York, NY, USA: Associationfor Computing Machinery, 2017, pp. 185–191. [Online]. Available:https://doi.org/10.1145/3152434.3152441[20] F. A. Oliehoek, C. Amato et al. , A concise introduction to decentralizedPOMDPs . Springer, 2016, vol. 1.[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.[22] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” in

Advances in neural information processing systems , 2000, pp.1057–1063.[23] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”

Machine learning , vol. 8, no. 3-4,pp. 229–256, 1992.[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in

International conference on machinelearning , 2015, pp. 1889–1897.[26] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch,“Multi-agent actor-critic for mixed cooperative-competitive environ-ments,” in

Advances in neural information processing systems , 2017,pp. 6379–6390.[27] L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,”

Neurocomputing , vol. 190, pp.82–94, 2016.[28] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agentcontrol using deep reinforcement learning,” in

International Conferenceon Autonomous Agents and Multiagent Systems . Springer, 2017, pp.66–83.[29] J. K. Terry, N. Grammel, A. Hari, L. Santos, B. Black, and D. Manocha,“Parameter sharing is surprisingly useful for multi-agent deep reinforce-ment learning,” arXiv preprint arXiv:2005.13625 , 2020.[30] X. Chu and H. Ye, “Parameter sharing deep deterministic policy gradientfor cooperative multi-agent reinforcement learning,” arXiv preprintarXiv:1710.00336 , 2017.[31] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang,M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al. , “Ray: A distributedframework for emerging AI applications,” in , 2018, pp.561–577.[32] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg,J. Gonzalez, M. Jordan, and I. Stoica, “Rllib: Abstractions for distributedreinforcement learning,” in

International Conference on Machine Learn-ing , 2018, pp. 3053–3062.[33] A. K. Gupta, H. Sadawarti, and A. K. Verma, “Performance analysis ofMANET routing protocols in different mobility models,”

InternationalJournal of Information Technology and Computer Science (IJITCS) ,vol. 5, no. 6, pp. 73–82, 2013.[34] V. Timcenko, M. Stojanovic, and S. B. Rakas, “MANET routingprotocols vs. mobility models: performance analysis and comparison,”in

Proceedings of the 9th WSEAS international conference on Appliedinformatics and communications . World Scientiﬁc and EngineeringAcademy and Society (WSEAS), 2009, pp. 271–276.[35] J. Ariyakhajorn, P. Wannawilai, and C. Sathitwiriyawong, “A compara-tive study of random waypoint and gauss-markov mobility models in the performance evaluation of MANET,” in , 2006, pp. 894–899.[36] B. Divecha, A. Abraham, C. Grosan, and S. Sanyal, “Impact of nodemobility on MANET routing protocols models,” J. Digit. Inf. Manag. ,vol. 5, no. 1, pp. 19–23, 2007.[37] F. Bai and A. Helmy, “A survey of mobility models in wireless ad-hocnetworks,” 2004.[38] C. Bettstetter, H. Hartenstein, and X. P´erez-Costa, “Stochastic propertiesof the random waypoint mobility model,”

Wireless Networks , vol. 10,no. 5, pp. 555–567, 2004.[39] B. Liang and Z. J. Haas, “Predictive distance-based mobility man-agement for pcs networks,” in