[PDF] DRLE: Decentralized Reinforcement Learning at the Edge for Traffic Light Control in the IoV

Abstract

The Internet of Vehicles (IoV) enables real-time data exchange among vehicles and roadside units and thus provides a promising solution to alleviate traffic jams in the urban area. Meanwhile, better traffic management via efficient traffic light control can benefit the IoV as well by enabling a better communication environment and decreasing the network load. As such, IoV and efficient traffic light control can formulate a virtuous cycle. Edge computing, an emerging technology to provide low-latency computation capabilities at the edge of the network, can further improve the performance of this cycle. However, while the collected information is valuable, an efficient solution for better utilization and faster feedback has yet to be developed for edge-empowered IoV. To this end, we propose a Decentralized Reinforcement Learning at the Edge for traffic light control in the IoV (DRLE). DRLE exploits the ubiquity of the IoV to accelerate the collection of traffic data and its interpretation towards alleviating congestion and providing better traffic light control. DRLE operates within the coverage of the edge servers and uses aggregated data from neighboring edge servers to provide city-scale traffic light control. DRLE decomposes the highly complex problem of large area control. into a decentralized multi-agent problem. We prove its global optima with concrete mathematical reasoning. The proposed decentralized reinforcement learning algorithm running at each edge node adapts the traffic lights in real time. We conduct extensive evaluations and demonstrate the superiority of this approach over several state-of-the-art algorithms.

Full PDF

IIEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 1

DRLE: Decentralized Reinforcement Learning atthe Edge for Trafﬁc Light Control

Pengyuan Zhou,

Member, IEEE , Xianfu Chen,

Member, IEEE , Zhi Liu,

Senior member, IEEE , TristanBraud,

Member, IEEE , Pan Hui,

Fellow, IEEE , Jussi Kangasharju

Member, IEEE

Abstract —The Internet of Vehicles (IoV) enables real-time dataexchange among vehicles and roadside units and thus providesa promising solution to alleviate trafﬁc jams in the urbanarea. Meanwhile, better trafﬁc management via efﬁcient trafﬁclight control can beneﬁt the IoV as well by enabling a bettercommunication environment and decreasing the network load.As such, IoV and efﬁcient trafﬁc light control can formulatea virtuous cycle. Edge computing, an emerging technology toprovide low-latency computation capabilities at the edge of thenetwork, can further improve the performance of this cycle.However, while the collected information is valuable, an efﬁcientsolution for better utilization and faster feedback has yet to bedeveloped for edge-empowered IoV. To this end, we propose aDecentralized Reinforcement Learning at the Edge for trafﬁclight control in the IoV (DRLE). DRLE exploits the ubiquityof the IoV to accelerate the collection of trafﬁc data and itsinterpretation towards alleviating congestion and providing bettertrafﬁc light control. DRLE operates within the coverage ofthe edge servers and uses aggregated data from neighboringedge servers to provide city-scale trafﬁc light control. DRLEdecomposes the highly complex problem of large area controlinto a decentralized multi-agent problem. We prove its globaloptima with concrete mathematical reasoning. The proposeddecentralized reinforcement learning algorithm running at eachedge node adapts the trafﬁc lights in real time. We conductextensive evaluations and demonstrate the superiority of thisapproach over several state-of-the-art algorithms.

Index Terms —Edge Computing, Multi-agent Deep Reinforce-ment learning, Internet of Vehicles, Trafﬁc Light Control

I. I

NTRODUCTION

The Internet of Vehicles (IoV) [1]–[3] allows data exchangeamong vehicles (V2V), roadside units (RSUs) (V2I), and othercommutable devices on roads or remote resources distributedover the Internet. It can facilitate and enable a wide variety ofapplications such as driving habit monitoring, driving operationrecommendation, and emergency notiﬁcation [4]–[6]. The IoVleverages an ever-increasing number of vehicles connectedto the Internet and has a signiﬁcant potential to alleviatethe continuously rising trafﬁc congestion, which has dramaticconsequences on the environment as well as the well-beingof citizens. Thanks to its high-speed wireless connectivity, theIoV enables data collection from vehicles in real time [7], [8].On a related note, edge computing [9], [10] has emerged as asolution in recent years to extend the capacity of remote cloudservices towards nearby end users. Edge computing is at thecore of most future networking paradigms as its characteristicsmake it an ideal candidate for time-sensitive and highly mobileapplications such as those encountered in the IoV [11]–[14].Co-located with base stations or RSUs, edge computing nodes can provide processing of the IoV data and response to trafﬁcjams and anomalies. By connecting trafﬁc signals to the IoV,it becomes possible to empower signal timing plans with real-time trafﬁc information and exploit the intelligence at the edgeto react to unforeseen congestion. However, while the collectedinformation at the edge is valuable, an efﬁcient solution forbetter utilization and faster feedback has yet to be developedfor large-scale edge-empowered IoV.Most of the related solutions follow one of the two di-rections: 1) utilize linear programming at intersections forfast adaption of the signal plan [15]–[17], 2) deploy machinelearning to directly control the trafﬁc lights or adapt the phaseduration [18]–[20]. The ﬁrst direction lacks exploration orlearning abilities and strongly depends on the availability ofaccurate objective functions and constraints, hence lots ofresearch efforts are put on the second direction, includingsingle-agent [21], [22] and multi-agent systems [19], [23]–[25].Single-agent solutions suffer from huge state space and hugeaction space. On the other hand, multi-agent solutions candecrease the state and action spaces. Nevertheless, we noticethree major concerns regarding the works in this direction asfollows.1) Lack of practical solutions to bridge the technology gapbetween machine learning algorithms and deployabilitiesin real-life smart city scenarios.2)

Lack of solid theoretical analysis to prove the optimalperformances of decentralized training.3)

Lack of extensive tests with credible simulator platforms to show the beneﬁt of decentralized learning in trafﬁclight control with reusable results.In this work, we propose Decentralized ReinforcementLearning at the Edge for trafﬁc light control in the IoV (DRLE).Following a similar hierarchy with our previous works [26],[27], we build a new system model with a focus on multi-agenttraining with rich IoV data. This integrated framework leveragesthe real-time data collection from connected vehicles tooptimize trafﬁc light control from the perspective of hierarchicallevels. Each level optimizes its coverage with edge serversrunning a level-speciﬁc algorithm, of which one or severalkey parameters are tuned by the upper level’s algorithm inreal time. The decentralized architecture of DRLE relies on apervasive deployment of edge servers, including signal controlunits at intersections and edge servers co-located with basestations and aggregation points. In this work, we use trafﬁc signal and trafﬁc light interchangeably. a r X i v : . [ c s . M A ] S e p EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 2

DRLE decomposes the highly complex problem of large areacontrol into a decentralized multi-agent problem. We prove itsglobal optima with concrete mathematical reasoning (§IV). Webuild our algorithm with credible open-sourced platforms [28],[29] and reinforcement learning library [30] (§V). We conductextensive evaluations and demonstrate the superiority of thisapproach over several state-of-the-art algorithms (§VI). Speciﬁ-cally, DRLE decreases convergence time by 65.66% comparedto Proximal Policy Optimization and training steps by 79.44%compared to Augmented Random Search and EvolutionaryStrategies. Besides, DRLE exponentially reduces the actionspace and provides comparable trafﬁc control performancewithin only 1/4 of the training time compared to its centralizedcounterpart.The rest of the paper is structured as follows. We give anoverview of related works in §II. In §III we present the systemdesign and trafﬁc model. We describe the theoretical detailsof our algorithm in §IV and show our evaluation setup andresults in §V and §VI, respectively. Finally, §VII concludesthe paper. II.

RELATED WORK

Researchers have put a lot of effort into optimizing trafﬁclight control. Major solutions include linear programming andmachine learning. Meanwhile, recent proposals have started tolook at the potential of rich IoV data for trafﬁc management.In this section, we give an overview of related works.

IoV.

Powered by fast-developing vehicular networking tech-niques, researchers have proposed solutions to utilize richIoV data for trafﬁc control [31], [32]. Kumar et al. proposedto apply ant colony algorithms to help vehicles ﬁnd theoptimal routes [33]. Darwish et al. focused on real-timebig data analytics in the IoV environment powered by fogcomputing [34]. Chen et al. targeted enhancing transportationsafety and network security by mining effective informationfrom both physical and network data space [35]. However, thepotential of utilizing rich IoV data speciﬁcally for trafﬁc lightcontrol has not been fully explored.

Liner programming.

Researchers have proposed trafﬁc lightcontrol solutions based on trafﬁc models (microscopic, meso-scopic and macroscopic) [36]–[38]. These solutions utilizelinear programming to solve the objective functions [15]–[17]. Although linear programming is straightforward and haslow latency, it depends on accurate objective function andconstraints. Moreover, it lacks exploration or learning abilitiesto scale to large-area optimization and complicated scenarios.

Reinforcement Learning.

Instead of trying to build explicittrafﬁc ﬂow models, machine learning proposals learn the trafﬁcpatterns and achieve optimal policies by iteratively adaptingactions with the goal of maximizing the cumulative reward [20],[23]. Many proposals have applied single-agent or multi-agentreinforcement learning to optimize trafﬁc light control. Single-agent solutions have large state and action spaces thus requirelarge capacity to calculate optimal signal phases [18]. Therefore,nowadays researchers tend to use multi-agent algorithms todivide larger problems into smaller sub-problems.For example, Chu et al. proposed in [39] to reduce actionspace by dynamically partitioning the trafﬁc grid into smaller regions and deploying a local agent in each region. They appliedmulti-agent reinforcement learning to A2C for large-scale trafﬁclight control in [19]. Li et al. used deep Q-learning (DQL) tocontrol trafﬁc lights and proposed deploying the deep stackedautoencoders (SAE) neural network to reduce the huge statespace brought by the tabular Q learning method [20]. Balaji etal. proposed to control trafﬁc lights with distributed agents ateach intersection [18]. El-Tantawy et al. explored coordinatedagents to let intersections conduct signal control actions incooperation with neighbors [40]. Based on [39], Tan et al.further proposed to concatenate the latent states of local agentsto form the global action-value function [41].Overall, we ﬁnd the existing solutions fall short in thefollowing aspects and propose DRLE to improve those aspects. • Optima proof.

Rarely we ﬁnd multi-agent approaches fortrafﬁc light control mathematically prove the optima ofmulti-agent reinforcement learning. • The gap between technique and reality.

Most multi-agentlearning approaches do not provide deployable solutionsto apply the algorithms to real-life smart cities. • Extensive tests with open-source platforms.

Many relatedworks have conducted simulations with vehicular trafﬁcsimulators and self-developed machine learning scriptslimited to the proposals. We ﬁnd those results hard to bereused due to lack of credible open-sourced platforms.In this work, we address those shortages by proposing adeployable decentralized solution and prove its optima withmathematical reasoning and extensive reusable tests.III. S

YSTEM

In this section, we ﬁrst present the design of DRLE andthe communication mechanism. Then, we describe the trafﬁcmodel as the basis of the algorithm.

A. System Design

DRLE revolves around the device layer and the edge layer,as shown in Fig. 1 in detail and Fig. 2 from high level. Thedevice layer includes the vehicles, trafﬁc signals, pedestriandevices, RSUs and other devices involved in the IoV. In the restof this paper, we assume each trafﬁc signal is connected witha control unit . Each control unit employs video cameras facingall directions to collect real-time trafﬁc data and transmits it viawireless network to a nearby edge server. The edge layer hoststhe edge servers in two tiers. The ﬁrst tier servers (ES ) are co-located with the base stations at the radio access network. Thesecond tier servers (ES ) are co-located with the aggregationpoints in the core network. This scheme agrees with the Multi-access Edge Computing standard proposed by the EuropeanTelecommunications Standards Institute. Please refer to [26]for the detailed edge server deployment strategy. Each ES collects data from nearby ES and provides a larger scale ofservice and sends backup to the cloud data center. Next, webrieﬂy describe the communication mechanism and overhead. B. Communications

Each ES feeds the collected data into local reinforcementlearning and sends the actions to the signal control units in EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 3 ES ES Inter-ES C-V2X system

State aggregationMulti-agentRL trainingAgent action Threshold-basedAgent actionObserved states ES Aggregation point ES Intra-ES subsystem Intra-ES subsystemIntra-ES subsystem

Fig. 1: System overview. In each

Intra-ES subsystem, an ES (MEC) uses decentralized multi-agent learning to train thecollected IoV data and sends commands to control the trafﬁc lights. ES tunes the parameters of the multi-agent algorithmsrunning in each ES ( Intra-ES subsystem).

Device Layer

RSU Vehicle PedestrianBase Station

Edge Layer

Cloud (Data Backup)

Data CenterAggregation Point co-located

Tier 2 Edge ServerTier 1 Edge Server co-located

Trafﬁc Signal

Fig. 2: System layers.real time. An ES collects the data by communicating with thedevices (connected vehicles, signal control units and RSUs)via Cellular vehicle-to-Everything (C-V2X) [42] and operateswithin a coverage deﬁned by the range of its co-located basestation. For simplicity, we do not consider RSUs and pedestriansin the rest of the paper.Actions are primarily changing lights commands which canbe encapsulated in small data packets. These packets are sentonly to a limited number of signals, hence the majority of data transmission is between the connected vehicles and ES via Vehicle-to-Network (V2N). As suggested by the 3GPPstandard [43], each message should be sent at a frequencybetween 0.1 Hz and 1 Hz with a payload between 50 bytes and300 bytes. We assume a message frequency of 1 Hz to alignwith the agent learning rate in the evaluation (see §V). Weassume that each message has a size of maximum transmissionunit (1500 bytes), which is sufﬁcient to contain the requireddata (speed, location, and direction of travel of the vehicle).Note that other transmissions mechanisms such as V2V andV2I are performed via a different radio (e.g. ITS 5.9 GHz)than V2N (licensed mobile bands, e.g. 700 MHz) and are notnecessary for DRLE.Since the transmissions between the vehicles and the ES base stations are only one-hop and unidirectional (uplink), thecommunications require a small networking capacity. Hence,it has a limited overhead and impact on the overall C-V2Xenvironment. We present a preliminary evaluation in §VI-Dand show that the end-to-end delay consisting of vehicle-to-ES transmission delay and ES -to-signal transmission delayis much smaller than a training step duration (1 second). A training step is the period over which one gradient update happens.We deﬁne each step as a 1-second period. However, the delay of each gradientupdate is at the millisecond level. In other words, each gradient update takesseveral milliseconds then waits until the next second to conduct the nextupdate.

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 4

C. Trafﬁc Model and Problem

We assume the trafﬁc lights in the urban area pertain toa common signal timing plan characterized by a ﬁxed cyclecontaining a ﬁxed number of phases. A phase refers to the timeduration of the green lights for a given direction. Let L be theset of links in a signalized urban area. Then, L (in) ( L (out) ) is theset of input (output) links of the area. To allow smooth drivingexperience, we introduce two parameters, i.e, the number ofhalting vehicles and speed-lag. A vehicle moves at speedsslower than 0.1 m/s is considered halting. Speed-lag is deﬁnedas the speed difference between the actual driving speed andthe maximum speed permitted by statute. We use speed-lag instead of the commonly used speed to address the maximumspeed limit in reality. We formulate the problem as to minimizethe overall number of halting vehicles and speed-lag in thearea over the whole optimization horizon. As for the coverageof an ES , the objective of the optimization problem can beformulated as follows, ¯ P = lim K →∞ E (cid:34) K K (cid:88) k =1 (cid:88) m ∈ L (in) \ L (out) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) , (1)which can also be approximated as P = E (cid:34) (1 − γ ) · ∞ (cid:88) k =1 ( γ ) k − · (cid:88) m ∈L (in) \L (out) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) , (2)if the discount factor γ ∈ [0 , approaches 1. Herein, H km isthe number of halting vehicles and (cid:52) V km is the average speed-lag of the vehicles in lane m at the beginning of the k th cycle. w and w are two positive weighting constants. Although themodel describes the problem from a global view, each internalintersection can have a varied impact on the optimization.Therefore, it is better to disassemble the optimization objectiveof multiple intersections, namely, P = E (cid:34) (1 − γ ) · ∞ (cid:88) k =1 ( γ ) k − · C (cid:88) c =1 (cid:88) m ∈L c (in) \L c (out) (3) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) . where c is an intersection index, C refers to the total numberof intersections in the area, L c (in) ( L c (out) ) represents the set ofinput (output) links of intersection c . In this work, we proposeto decompose the optimization problem and solve it with adecentralized reinforcement learning algorithm as described in§IV. IV. A LGORITHM

A. Hierarchical Algorithm

The system includes three parallel and interactive algo-rithms running in three levels, i.e.,

Intersection , Intra-ES and

Inter-ES , performed by signal control units, ES and ES ,respectively (Fig. 1). At Intersection level, each signal adaptsits phases with a threshold-based algorithm. The threshold can be the average queuing length or the average spaceheadaway. At

Intra-ES level, each ES runs a multi-agentreinforcement learning algorithm to participate in the trafﬁclight switching directly. At Inter-ES level, an ES runs athreshold-based algorithm to optimize the urban trafﬁc bytuning the reinforcement learning rate in each ES . We brieﬂydescribe Intersection level and

Intra-ES level and detail thedecomposed reinforcement learning algorithm at

Intra-ES level.

Intersection level.

The control unit of each intersection signalruns a threshold-based algorithm similar with [44]. Each controlunit employs cameras to capture the trafﬁc jam and adaptsthe phase duration based on predeﬁned parameters. When theparameters surpass the thresholds, the control unit extends thephase of the green lights for the more jammed direction (e.g.,with longer queues or smaller space headway), and decreasesa equivalent length of the green phases for the other direction.

Inter-ES level.

An ES tunes the urban trafﬁc by adapting thereinforcement learning rate in each ES based on a threshold-based algorithm similar to [44]. Intra-ES level.

Each ES optimizes the internal trafﬁc withinits coverage deﬁned by the co-located base station by switchingthe trafﬁc lights directly. We adopt DQL to provide an adaptivealgorithm to respond to the dynamically changing trafﬁccondition. The advantage of Q-learning for trafﬁc light controlis described in more detail in a study by Abdulhai et al. [23],including not requiring a prespeciﬁed model of the environmentand being adaptive and unsupervised. We deﬁne the state s k ,action a k and reward R k as follows. • s k is the state at the k th cycle, including the number ofhalting vehicles H k = { H km | m = 1 , · · · , M } , the speed-lag of the vehicles (cid:52) V k = {(cid:52) V km | m = 1 , · · · , M } , andthe trafﬁc light states θ k = { θ kc | c = 1 , · · · , C } , where M is the total number of the lanes in the area. • a k is the action operated by the agents after observing s k . More speciﬁcally, a k = { a kc | c = 1 , · · · , C } , where a kc ∈ { , } represents the decision of switching the trafﬁclight. • R k is the reward deﬁned as the additive inverse of theaverage speed-lag and number of halting vehicles, namely, R k = (cid:80) Mm =1 ( − w · H km − w · (cid:52) V km ) . Agent Goal.

The overall goal of DRLE is to optimize thelight control to smooth the trafﬁc. To do so, the agents of anES need to ﬁnd a control policy π that maximizes Q π ( s , a ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R k | s = s , a = a (cid:35) , (4)which is also termed a Q-function. A control policy π can bedeﬁned as a mapping by a = π ( s ) . To put it another way, thegoal is to solve π ∗ = arg max π Q π ( s , π ( s )) , ∀ s . (5)For notational convenience, we denote Q ( s , a ) = Q π ∗ ( s , a ) , ∀ ( s , a ) . Using the state-action-reward-state-action (SARSA)algorithm [45], the optimal Q-function can be found in aniterative on-policy manner. The centralized decision making EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 5 at a cycle is executed at the signals independently, based onwhich we linearly decompose the Q-function, Q ( s , a ) = C (cid:88) c =1 Q c ( s , a c ) , (6)where Q c ( s , a c ) is deﬁned to be the per-signal Q-functiongiven by Q c ( s , a c ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R kc | s = s , a c = a c (cid:35) . (7)Herein, a kc and R kc are, respectively, the joint action and thereward for the agent at the intersection c at the cycle k . Weemphasize that the decision makings across the cycles areperformed at each signal in accordance with the optimal controlpolicy implemented by the ES . In other words, ∀ s , π ∗ ( s ) = arg max a =( a c : c =1 , ··· ,C ) C (cid:88) c =1 Q c ( s , a c ) . (8) B. Optimization and Convergence Guarantee

Theorem 1 (Optimization Guarantee).

The linear Q-functiondecomposition approach as in Eq. (6) asserts the expectedoptimal long-term performance.

Proof.

For the Q-function of a centralized decision making a under a global network state s , we have Q ( s , a ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R k | s = s , a = a (cid:35) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · C (cid:88) c =1 R kc | s = s , a = a (cid:35) = C (cid:88) c =1 (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R kc | s = s , a = a (cid:35) = C (cid:88) c =1 Q c ( s , a c ) , (9)which completes the proof. (cid:3) Therefore, instead of learning the Q-function, the SARSAupdating rule is slightly adapted for each lane to Q k +1 c ( s , a c ) = (cid:0) − α k (cid:1) · Q kc ( s , a c )+ α k · (cid:0) (1 − γ ) · R kc + γ · Q kc ( s (cid:48) , a (cid:48) c ) (cid:1) , (10)where α k ∈ [0 , is the learning rate. Theorem 2 ensures theconvergence of the decentralized learning process. Theorem 2 (Convergence Guarantee).

The sequence { ( Q kc ( s , a c ) : ∀ ( s , a c ) , ∀ c ∈ { , · · · , C } ) : k } byEq. (10) surely converges to the per-signal Q-functions ( Q c ( s , a c ) : ∀ ( s , a c ) , ∀ c ∈ { , · · · , C } ) , if and only if foreach signal c ∈ { , · · · , C } , the ( s , a c ) -pairs are visited foran inﬁnite number of times. Proof.

Since the per-signal Q-functions are learned simultane-ously, we consider monolithic updates during the decentralized learning process. That is, the iterative rule in Eq. (10) can thenbe encapsulated as C (cid:88) c =1 Q k +1 c ( s , a c ) = (cid:0) − α k (cid:1) · C (cid:88) c =1 Q kc ( s , a c ) + α k · (cid:32) (1 − γ ) · C (cid:88) c =1 R c + γ · C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) (cid:33) . (11)From both sides of Eq. (11), subtracting the sum of per-signalQ-functions leads to C (cid:88) c =1 Q k +1 c ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) = (cid:0) − α k (cid:1) · (cid:32) C (cid:88) c =1 Q kc ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) (cid:33) + α k · T k ( s , a c ) , (12)where T k ( s , a c ) = (1 − γ ) · C (cid:88) c =1 R c (13) + γ · max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) − C (cid:88) c =1 Q c ( s , a c ))+ γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) . We let ∆ k denote the history for the ﬁrst k cycles during the de-centralized learning process. The per-signal Q-functions are ∆ k -measurable, thus both ( (cid:80) Cc =1 Q k +1 c ( s , a c ) − (cid:80) Cc =1 Q c ( s , a c )) and T k ( s , a c ) are ∆ k -measurable. We then attain Eq. (14),where (cid:107) · (cid:107) ∞ is the maximum norm of a vector and (a) is dueto the convergence property of the standard Q-learning. We arenow left with verifying that (cid:107) E [ γ · ( (cid:80) Cc =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) (cid:80) Cc =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c )) | ∆ k ] (cid:107) ∞ converges to zero, which establishesthe following: i) an (cid:15) -greedy policy is deployed for theexploration-exploitation trade-off during decision-making; ii)the per-signal Q-function values are upper bounded; and iii)both the global network state and the decision-making spacesare ﬁnite. Thus the convergence of the decentralized learningis ensured. (cid:3) Takeaway.

The core contribution of the algorithm, i.e. decen-tralization, lies in the action selection process, during whichthe algorithm selects the optimal joint action that maximizesthe sum of Q-values of all agents, thus ensuring globaloptima (Theorem 1 and Theorem 2). The major advantage incomparison with traditional centralized reinforcement learning,is that the number of joint actions grows linearly instead ofexponentially as the number of involved agents (i.e., the trafﬁcsignals) increases. It is noteworthy that the training is donein a centralized fashion to allow efﬁcient cooperative trainingacross the multiple agents.V. E

XPERIMENT S ETUP

We conduct the training and the tests on a MSI GS65Stealth 8SG equipped with a 6-core I7-8750H CPU, 32GB of

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 6 memory, and an Nvidia RTX 2080 Max-Q GPU. We build thelearning algorithms based on

RLlib [30], an open-sourcedlibrary for reinforcement learning that can easily be scaled byincreasing the number of workers [46]. We use its underlayer

Ray framework [29] to accelerate the decentralized algorithmtraining.Besides Deep Q-Network (

DQN ), we also build and testdistributional DQN ( dDQN ) proposed by Bellemare et al. in2017 [47], which learns a categorical distribution of discountedreturns instead of estimating the mean. To provide betterperformance, we leverage the values of several parametersexplored by the well-acknowledged work,

Rainbow [48],including learning rate, epsilon, and softmax cross entropyfor dDQN . We deﬁne and evaluate the algorithms and policiesutilizing

Flow [28], a Python library that provides the interfacebetween

RLlib and

SUMO [49], a microscopic simulator fortrafﬁc and vehicle dynamics. Next we describe the detailedsettings.

A. Simulation Setup

To extensively test our proposal, we evaluate sets of testswith different algorithms and policies in different scenarios.

Map.

Most related works on trafﬁc light control do experimentson grid-like maps such as the urban Manhattan grid scenarioused by 3GPP [50]. We also follow this setup and deploy thetests on n × n grid road maps that consist of n × n typicalfour-way, trafﬁc-light-controlled intersections. Each intersectionallows vehicles to ﬂow either horizontally or vertically. If thelight is green, it transitions to yellow for two seconds beforeswitching to red for the purpose of safety. Each lane has onesignal only, and the signal phases of an intersection in clockwiseorder consist of GrGr , yryr , rGrG , and ryry , of which G, y, r refer to green, yellow and red, respectively. The trafﬁc lightscan be switched only after seconds to prevent ﬂickering. Scale.

In this work, each ES collects data and control thetrafﬁc lights within the coverage deﬁned by its co-located basestation. To be more realistic, we investigate the coverage rangeof LTE cell towers via a crowdsourced cellular tower andcoverage mapping service . It shows that the coverage range ofan LTE cell tower on B7 (2600MHz) is typically around 5 to10 blocks and 10 to 20 intersections in a European city centerarea (Helsinki, Finland). Therefore, we focus the simulation at the scale of of × intersections to cover common scenarios.For the completeness of the work, we also test larger scalessuch as × and × . Trafﬁc.

Vehicles enter the map from all outer edges at apredeﬁned rate, i.e., 360 vehicle/hour/edge . At such a rate,7200 vehicles enter a × grid map from the 20 outer edgesin an hour. To simplify the problem, vehicles travel straight ontheir paths. Each vehicle is driven following a basic SUMO -built-in car following model, of which the minimum gap betweensuccessive vehicles, maximum speed limit and decelerationability are set to 2.5 m, 60 m/s and 7.5 m/s , respectively. B. Training Conﬁgurations

We test the learning algorithms utilizing single- and multi-agent training. We use multi-agent

DQN and dDQN as thetargeting algorithms to prove our mathematical reasoningin §IV. Considering intersections with different centralitiesmay have inﬂuences on the trafﬁc performance, we deﬁnetwo kinds of policies with corresponding reward deﬁnitions asfollows.

Policy.

We test the system with two policies, i.e.,

SharedPolicy and

MultiPolicy , to address the different centralities of theintersections. We call the signals with higher centralities“central nodes”, as indicated by the blue squares residing inthe central area of Fig. 3. We call the others “edge nodes”,as indicated by the black circles.

SharedPolicy lets the agentsat all intersections share the same policy.

MultiPolicy lets theagents at “central nodes” share a central policy while the otheragents share a edge policy.

Reward.

We deﬁne the reward based on the average speed-lag and the number of halting vehicles (§III). Speciﬁcally, thereward of

SharedPolicy is deﬁned as R = − w · H − w · (cid:52) V (15)where R , H , (cid:52) V indicate the reward, number of haltingvehicles, and average speed-lag, respectively. The rewards of MultiPolicy for the agents at “central nodes” and “edge nodes”are deﬁned as R central = − w c · H − w c · (cid:52) V (16) R edge = − w e · H − w e · (cid:52) V (17)where w c , w c , w e , w e denote the weights to differentiate thepenalties that central and edge nodes receive. In this paper, we (cid:13)(cid:13) E (cid:2) T k ( s , a c ) | ∆ k (cid:3)(cid:13)(cid:13) ∞ (14) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) (1 − γ ) · C (cid:88) c =1 R c + γ · max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) − C (cid:88) c =1 Q c ( s , a c ) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (a) ≤ γ · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) C (cid:88) c =1 Q kc ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 7

Fig. 3: Central (blue squares) and edge (black circles) inter-sections.set w c > w e , w c > w e , to give central nodes higher penaltiesfor poor performance. C. Benchmarks

We compare the performance of

DQN and dDQN withmultiple algorithms as follows. • Static simply lets trafﬁc lights deploy pre-deﬁned staticphases. • Actuated is a common light control scheme in Germany.It works by either prolonging trafﬁc phases upon detectinga continuous trafﬁc stream, or switching to the next phaseupon detecting a sufﬁcient time gap between successivevehicles [51]. • Augmented Random Search (ARS) is an improved versionof Basic Random Search (BRS), proposed by Mania etal. in 2018 [52]. • Evolutionary Strategies (ES) is one of the OpenAI solu-tions proposed by Salimans et al. in 2017 [53]. • Proximal Policy Optimization (PPO) is a popular gradient-based policy optimization algorithm proposed by Schul-man et al. in 2017 [54]. It uses multiple epochs of mini-batch updates, and an MLP with tanh non-linearity tocompute a value function baseline.We also test dDQN using all parameter values deployed in

Rainbow [48] to see whether it applies to our use case. Table Ishows the hyper-parameters of the benchmark algorithms,

DQN and dDQN . VI. T

EST R ESULTS

A. Training Results

We conduct the training over iterations, each of whichconsists of numerous rollouts. As shown in Fig. 4 to Fig. 6,the interval consists of 100 iterations, each of which consists Rollout, or playout, is a term often used in machine learning that isdeﬁned by Monte Carlo. In each rollout, an agent takes actions until reachingpredeﬁned max steps.

TABLE I: Hyper-parameters.

Algorithm Selected Hyper-parametersAugmented Random Search SGD step size = 0.2Noise standard deviation = 0.2Evolutionary Strategies Adam step size = 0.02Noise standard deviation = 0.02Proximal Policy Optimization λ ( GAE ) = 0 . Clipping (cid:15) = 0 . Adam step size = 5 × − Minibatch size = 128

Hiddens [100, 50, 25] γ ( Discount ) = 0 . Deep Q-Network Adam (cid:15) = 1 . × − Learning rate α = 6 . × − Hiddens [256,256]Dimension = 84

Train batch size = 1000

Distributed DQN Number of atoms = 51

Min/max values: [ − , Target network update freq. = 8000 of 30 rollouts. Each rollout has 1000 steps. A step maps to onesecond in real life. As such, the interval consists of 3 millionsteps.Since

SharedPolicy and

MultiPolicy have different reward def-initions, their reward performances are not strictly comparable.Therefore, we plot them separately. As shown in Fig. 4 to Fig. 6, dDQN and

DQN show similar curves and the former performsslightly better than the latter.

PPO is always outperformed by dDQN and

DQN in the beginning of the training and climbs toa similar level after around 2M steps.

Rainbow setup providesthe worst performance out of all algorithms. It indicates thatthe values of the parameters discovered in

Rainbow does not ﬁtour use case. Single-agent algorithms take a lot longer to reacha similar level of performance, thus we exclude their curvesfrom the ﬁgures. For instance, the training time of single-agent

DQN is 4 times the training time of its multi-agent counterpart.This is because the action space dimension for each agent inmulti-agent algorithms is while single-agent DQN has anaction space of , and thus requires a much longer trainingtime. B. Trafﬁc Results

We replay the policies with

SUMO to compare the trafﬁccontrol performances and show the summarized results inTable II.

Halting vehicles refer to the average number of thehalting vehicles per step (speed < Queuing time indicates the average waiting time of vehicles due to queuingper step.

Queuing length refers to the average length of thequeues per step. The end of the queue is deﬁned as the lastvehicle with a speed of < Speed refers to the averagespeed.As shown, the

Static algorithm presents the worst perfor-mance. Meanwhile, we ﬁnd that all machine learning algorithmsprovide very limited improvement for the average vehicle

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 8 M M M M~million of steps M a x r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (a) Max reward. M M M M~million of steps M e a n r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (b) Average reward. M M M M~million of steps M i n r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (c) Minimum reward.

Fig. 4: Reward for

SharedPolicy (deﬁned in Eq. (15)). M M M M~million of steps M a x r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (a) Max reward. M M M M~million of steps M e a n r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (b) Average reward. M M M M~million of steps M i n r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (c) Minimum reward.

Fig. 5: Reward for

MultiPolicy (deﬁned in Eq. (16) and Eq. (17)). M M M M~million of steps R e w a r d p e r a g e n t o f s h a r e d p o li c y SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (a) Average reward per agent for

SharedPolicy (reward deﬁned in Eq. (15)). M M M M~million of steps R e w a r d p e r a g e n t o f c e n t r a l p o li c y MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (b) Average reward per agent for centralpolicy (reward deﬁned in Eq. (16)). M M M M~million of steps R e w a r d p e r a g e n t o f e dg e p o li c y MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (c) Average reward per agent for edgepolicy (reward deﬁned in Eq. (17)).

Fig. 6: Average reward per agent.speed compared to the

Actuated algorithm. However, theyall decrease the number of halting vehicles and queuing timesigniﬁcantly (except

Rainbow algorithms which is most likelydue to unfeasible parameter values). It indicates that with theagent-controlled signals, vehicles have much less start/stopwaves and waiting time in queues, thus gaining considerableimprovements in driving experience.Multi-agent

SharedPolicy-PPO , SharedPolicy-DQN and

MultiPolicy-DQN provide the smoothest trafﬁc control. Theirperformances show the smallest numbers of halting vehiclesand shortest queuing time and lengths while allowing higherdriving speeds than others. The performance of single-agent

DQN is fairly good. However it requires exponentially highercapacity and four times as long as the training period of its multi-agent counterparts.

ARS and ES also show competitiveperformances, although presenting either slightly slower speedsor more halting vehicles. Nevertheless, ARS and ES currentlyonly support single-agent training, thus also require muchlonger training time. For example, to achieve the performancein Table II, ARS and ES need around 800K steps (approximately253 minutes) while multi-agent DQN and dDQN only needaround 240K steps (approximately 52 minutes). Similarly,

PPO has much longer convergence time compared to multi-agent

DQN algorithms as shown in Fig. 4 and Fig. 5.Overall, the training and trafﬁc results align with Theorem 1and 2 proposed in §IV, showing that decentralized multi-agent

DQN can achieve similar optima with centralized single-agent

DQN . Surprisingly,

MultiPolicy has not provided observable

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 9

TABLE II: Trafﬁc performance and required training time.N: Non-ML, S: SingleAgent-ML, M: Multiagent-ML

Num agents Algorithm Halting vehicles Queue time (s) Queue length (m) Speed (m/s)N

Static

95 ±40 14.23 ±10.25 19.82 ±12.17 13.39 ±3.20

Actuated

ARS ES PPO

DQN

SharedPolicy-PPO

SharedPolicy-DQN

SharedPolicy-dDQN-rainbow

SharedPolicy-dDQN

MultiPolicy-PPO

MultiPolicy-DQN

MultiPolicy-dDQN-rainbow

MultiPolicy-dDQN

TABLE III: Performance of

SharedPolicy-dDQN on different map scales.

Scale Halting vehicles Queue time (s) Queue length (m) Speed (m/s) Training time/rollout (s) × × × M M M M~million of steps R e w a r d p e r a g e n t o f s h a r e d p o li c y SharedPolicy-DQN-5x5SharedPolicy-DQN-10x10SharedPolicy-DQN-15x15

Fig. 7: Average reward per agent for

SharedPolicy-DQN ondifferent scales.beneﬁts compared to

SharedPolicy . We will explore morevarieties of

MultiPolicy setups in future works. Meanwhile,decentralized multi-agent

DQN outperforms centralized single-agent

DQN in the following aspects: 1) It requires only 1/4of the training time centralized

DQN needs. 2) It requiresexponentially less computation and memory capacity thancentralized

DQN .Next, we evaluate the scalability of DRLE by conductingtests on different scales of maps. Thereafter we evaluate thecommunication overhead with a preliminary simulation.

C. Scalability

To illustrate the system scalability, we apply multi-agent

SharedPolicy-DQN on various map scales. Fig. 7 shows thetraining performances of

SharedPolicy-DQN on maps with × , × , and × intersections. Because of the ﬁxedvehicle inﬂow rate (see §V-A), each intersection has less nearbyvehicles in larger maps, resulting in shorter queue lengthsand vehicle delay, which, in turn, leads to a higher reward.As such, the average reward per agent increases as the mapscale increases. The trafﬁc control performance summarizedin Table III mostly conﬁrms this reasoning. It is important tonote that when we increase the size of the map, the numberof halting vehicles per intersection decreases. Overall, DRLEperforms well on different map scales. D. Communication Overhead

DRLE leverages the ubiquity of the IoV and the foreseeableedge facilities in the near future to run the algorithms. Inthis subsection, we provide a preliminary evaluation of thecommunication overhead brought by this system architecture.We deploy a map consisting of × intersections as shown inFig. 3 in NS-3 , a packet-level discrete-event network simulatorfor internet systems. Because the mobility trace ﬁles generatedfrom the training tests are not compatible with

NS-3 , we usethe trace ﬁles instead to estimate the average number of activevehicles per step in the considered map, i.e., 230. Then wesimulate 230 vehicles in

NS-3 with the same parameters (speed

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 10

TABLE IV: Communication parameters.

Model Protocol Packet Frequency HopsV2N [43] UDP 1500 B 1 Hz 1

TABLE V: Transmission delays. (MAD: Median AbsoluteDeviation)

Direction Mean (ms) MAD (ms)Uplink (vehicle-to-ES ) 110.82 17.68Downlink (ES -to-signal) 106.23 0 and acceleration etc.) with the training tests. As such, we arguethat the transmission delay performance, albeit not identical tothe performance in an integrated test, should be very similarfrom a statistical perspective. As described in Subsection III-B,each vehicle sends 1500 B-packets to a tri-sector eNB at thecenter of the map at 1 Hz via C-V2X (LTE) (table IV). For aperiod of 1000 seconds (the default training period in trainingtests), the transmission delay result shows that uplink (vehicle-to-ES ) plus downlink (ES -to-signal) delay is less than 240 msper packet (Table V). As such, the transmission delay duringeach step is much smaller than the step duration thus doesnot impact on DRLE.VII. C ONCLUSION AND F UTURE WORK

In this work, we present DRLE, an integrated edge comput-ing framework leveraging the ubiquity of the IoV for alleviatingtrafﬁc congestion in real time at city scale. We decompose thehighly complex centralized problem of large area trafﬁc lightcontrol to a multi-agent decentralized problem and prove itsglobal optima with concrete mathematical reasoning. DRLEexploits the low latency of edge servers to provide fast

DQN training and control feedback. Thanks to its layered architectureand hierarchical algorithm, DRLE runs optimization at the

Intersection , Intra-ES and

Inter-ES levels that allows fortrafﬁc light control on different scales. We present numerouscomparisons to evaluate the trafﬁc improvement broughtby DRLE and show that, compared to the state-of-the-artbaseline works, DRLE decreases the convergence time by65.66% compared to

PPO and training steps by 79.44%compared to

ARS and ES . Besides, DRLE provides comparabletrafﬁc control performance with its centralized counterpartwhile requiring only 1/4 of the training time.This paper explores the performance of algorithm on Intra-ES level. In future work, we will explore the

Inter-ES algorithm andits impact on the performance of DRLE. We are also lookingforward to investigating the potential of tuning the thresholdparameter value in the

Intersection level algorithm instead ofdirectly switching the trafﬁc lights, to further decrease theaction space and extend the scalability. See footnote 2. R EFERENCES[1] J. Contreras-Castillo, S. Zeadally, and J. A. Guerrero-Iba˜nez, “Internet ofvehicles: architecture, protocols, and security,”

IEEE Internet of ThingsJournal , vol. 5, no. 5, pp. 3701–3709, 2017.[2] C. Wu, X. Chen, T. Yoshinaga, Y. Ji, and Y. Zhang, “Integrating licensedand unlicensed spectrum in the internet of vehicles with mobile edgecomputing,”

IEEE Network , vol. 33, no. 4, pp. 48–53, 2019.[3] Y. Dai, D. Xu, S. Maharjan, G. Qiao, and Y. Zhang, “Artiﬁcial intelligenceempowered edge computing and caching for internet of vehicles,”

IEEEWireless Communications , vol. 26, no. 3, pp. 12–18, 2019.[4] M. Zhang, C. Chen, T. Wo, T. Xie, M. Z. A. Bhuiyan, and X. Lin,“Safedrive: online driving anomaly detection from large-scale vehicledata,”

IEEE Transactions on Industrial Informatics , vol. 13, no. 4, pp.2087–2096, 2017.[5] Y.-S. Chou, Y.-C. Mo, J.-P. Su, W.-J. Chang, L.-B. Chen, J.-J. Tang, andC.-T. Yu, “i-car system: A lora-based low power wide area networksvehicle diagnostic system for driving safety,” in . IEEE, 2017, pp.789–791.[6] P. Zhou, W. Zhang, T. Braud, P. Hui, and J. Kangasharju, “Arve: Aug-mented reality applications in vehicle to edge networks,” in

Proceedingsof the 2018 Workshop on Mobile Edge Communications . ACM, 2018,pp. 25–30.[7] N. Lu, N. Cheng, N. Zhang, X. Shen, and J. W. Mark, “Connectedvehicles: Solutions and challenges,”

IEEE internet of things journal ,vol. 1, no. 4, pp. 289–299, 2014.[8] C. Wu, Z. Liu, F. Liu, T. Yoshinaga, Y. Ji, and J. Li, “Collaborativelearning of communication routes in edge-enabled multi-access vehicularenvironment,”

IEEE Transactions on Cognitive Communications andNetworking , 2020.[9] J. Feng, Z. Liu, C. Wu, and Y. Ji, “Ave: Autonomous vehicular edgecomputing framework with aco-based scheduling,”

IEEE Transactionson Vehicular Technology , vol. 66, no. 12, pp. 10 660–10 675, 2017.[10] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edgecomputing: A survey,”

IEEE Internet of Things Journal , vol. 5, no. 1,pp. 450–465, 2017.[11] A. Alhilal, T. Braud, and P. Hui, “Distributed vehicular computing atthe dawn of 5g: a survey,” 2020.[12] K. Zhang, S. Leng, X. Peng, L. Pan, S. Maharjan, and Y. Zhang, “Artiﬁcialintelligence inspired transmission scheduling in cognitive vehicularcommunications and networks,”

IEEE Internet of Things Journal , vol. 6,no. 2, pp. 1987–1997, 2018.[13] K. Zhang, Y. Zhu, S. Leng, Y. He, S. Maharjan, and Y. Zhang, “Deeplearning empowered task ofﬂoading for mobile edge computing in urbaninformatics,”

IEEE Internet of Things Journal , vol. 6, no. 5, pp. 7635–7647, 2019.[14] K. Zhang, Y. Zhu, S. Maharjan, and Y. Zhang, “Edge intelligence andblockchain empowered 5g beyond for the industrial internet of things,”

IEEE Network , vol. 33, no. 5, pp. 12–19, 2019.[15] A. Barisone and D. Giglio, “A macroscopic trafﬁc model for real-time optimization of signalized urban areas,” , pp. 900–903, 2002.[16] M. Dotoli, M. P. Fanti, and C. Meloni, “A signal timing plan formulationfor urban trafﬁc control,”

Control engineering practice , vol. 14, no. 11,pp. 1297–1311, 2006.[17] P. Coll, P. Factorovich, I. Loiseau, and R. G´omez, “A linear programmingapproach for adaptive synchronization of trafﬁc signals,”

InternationalTransactions in Operational Research , vol. 20, no. 5, pp. 667–679, 2013.[18] P. Balaji, X. German, and D. Srinivasan, “Urban trafﬁc signal controlusing reinforcement learning agents,”

IET Intelligent Transport Systems ,vol. 4, no. 3, pp. 177–188, 2010.[19] T. Chu, J. Wang, L. Codec`a, and Z. Li, “Multi-agent deep reinforcementlearning for large-scale trafﬁc signal control,”

IEEE Transactions onIntelligent Transportation Systems , vol. 21, no. 3, pp. 1086–1095, 2019.[20] L. Li, Y. Lv, and F.-Y. Wang, “Trafﬁc signal timing via deep reinforcementlearning,”

IEEE/CAA Journal of Automatica Sinica , vol. 3, no. 3, pp.247–254, 2016.[21] J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,”

Machine learning , vol. 16, no. 3, pp. 185–202, 1994.[22] E. Bingham, “Reinforcement learning in neurofuzzy trafﬁc signal control,”

European Journal of Operational Research , vol. 131, no. 2, pp. 232–241,2001.[23] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learningfor true adaptive trafﬁc signal control,”

Journal of TransportationEngineering , vol. 129, no. 3, pp. 278–285, 2003.

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 11 [24] D. de Oliveira, A. L. Bazzan, B. C. da Silva, E. W. Basso, L. Nunes,R. Rossetti, E. de Oliveira, R. da Silva, and L. Lamb, “Reinforcementlearning based control of trafﬁc lights in non-stationary environments: Acase study in a microscopic simulator.” in

EUMAS , 2006.[25] K. Dresner and P. Stone, “Multiagent trafﬁc management: A reservation-based intersection control mechanism,” in

Proceedings of the ThirdInternational Joint Conference on Autonomous Agents and MultiagentSystems-Volume 2 , 2004, pp. 530–537.[26] P. Zhou, W. Zhang, T. Braud, P. Hui, and J. Kangasharju, “Enhancedaugmented reality applications in vehicle-to-edge networks,” in . IEEE, 2019, pp. 167–174.[27] P. Zhou, T. Braud, A. Alhilal, P. Hui, and J. Kangasharju, “Erl: Edgebased reinforcement learning for optimized urban trafﬁc light control,”in . IEEE, 2019, pp.849–854.[28] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:Architecture and benchmarking for reinforcement learning in trafﬁccontrol,” arXiv preprint arXiv:1710.05465 , 2017.[29] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Eli-bol, Z. Yang, W. Paul, M. I. Jordan et al. , “Ray: A distributed frameworkfor emerging { AI } applications,” in { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI } , 2018, pp.561–577.[30] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez,K. Y. Goldberg, and I. Stoica, “Ray rllib: A composable and scalablereinforcement learning library,” ArXiv , vol. abs/1712.09381, 2017.[31] N. R. Moloisane, R. Malekian, and D. Capeska Bogatinoska, “Wirelessmachine-to-machine communication for intelligent transportation systems:Internet of vehicles and vehicle to grid,” in , 2017, pp. 411–415.[32] M. Gerla, E.-K. Lee, G. Pau, and U. Lee, “Internet of vehicles: Fromintelligent grid to autonomous cars and vehicular clouds,” in . IEEE, 2014, pp. 241–246.[33] P. M. Kumar, G. Manogaran, R. Sundarasekar, N. Chilamkurti,R. Varatharajan et al. , “Ant colony optimization algorithm with internetof vehicles for intelligent trafﬁc control system,”

Computer Networks ,vol. 144, pp. 154–162, 2018.[34] T. S. Darwish and K. A. Bakar, “Fog based intelligent transportationbig data analytics in the internet of vehicles environment: motivations,architecture, challenges, and critical issues,”

IEEE Access , vol. 6, pp.15 679–15 701, 2018.[35] M. Chen, Y. Tian, G. Fortino, J. Zhang, and I. Humar, “Cognitive internetof vehicles,”

Computer Communications , vol. 120, pp. 58–70, 2018.[36] V. Tyagi, S. Darbha, and K. Rajagopal, “A review of the mathematicalmodels for trafﬁc ﬂow,”

International Journal of Advances in EngineeringSciences and Applied Mathematics , vol. 1, no. 1, pp. 53–68, 2009.[37] S. P. Hoogendoorn and P. H. Bovy, “State-of-the-art of vehicular trafﬁcﬂow modelling,”

Proceedings of the Institution of Mechanical Engineers,Part I: Journal of Systems and Control Engineering , vol. 215, no. 4, pp.283–303, 2001.[38] N. Eissfeldt, “Vehicle-based modelling of trafﬁc,” Ph.D. dissertation,PhD thesis, Universit¨a zu K¨oln, 2004.[39] T. Chu, S. Qu, and J. Wang, “Large-scale trafﬁc grid signal control withregional reinforcement learning,” in . IEEE, 2016, pp. 815–820.[40] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-forcement learning for integrated network of adaptive trafﬁc signalcontrollers (marlin-atsc): methodology and large-scale application ondowntown toronto,”

IEEE Transactions on Intelligent TransportationSystems , vol. 14, no. 3, pp. 1140–1150, 2013.[41] T. Tan, F. Bao, Y. Deng, A. Jin, Q. Dai, and J. Wang, “Cooperative deepreinforcement learning for large-scale trafﬁc grid signal control,”

IEEETransactions on Cybernetics , vol. 50, no. 6, pp. 2687–2700, 2019.[42] A. Papathanassiou and A. Khoryaev, “Cellular v2x as the essentialenabler of superior global connected transportation services,”

IEEE 5GTech Focus , vol. 1, no. 2, pp. 1–2, 2017.[43] 3GPP, “Study on lte support for vehicle-to-everything(v2x) services,” 3rd Generation Partnership Project (3GPP),Technical report (TR) 22.885, 2015, version 14.0.0. [On-line]. Available: https://portal.3gpp.org/desktopmodules/Speciﬁcations/SpeciﬁcationDetails.aspx?speciﬁcationId=2898[44] D. Krajzewicz, E. Brockfeld, J. Mikat, J. Ringel, C. Rossel, W. Tuch-scheerer, P. Wagner, and R. Wosler, “Simulation of modern trafﬁclights control systems using the open source trafﬁc simulation SUMO,”

In Proceedings of the 3rd Industrial Simulation Conference, Berlin,Germany , pp. 229–302, 2005.[45] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introductionmit press,”

Cambridge, MA , 1998.[46] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013.[47] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspectiveon reinforcement learning,” in

Proceedings of the 34th InternationalConference on Machine Learning - Volume 70 , ser. ICML17. JMLR.org,2017, p. 449458.[48] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski,W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow:Combining improvements in deep reinforcement learning,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.[49] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recentdevelopment and applications of SUMO - Simulation of Urban MObility,”

International Journal On Advances in Systems and Measurements , vol. 5,no. 3&4, pp. 128–138, December 2012.[50] 3GPP, “Study on lte-based v2x services,” 3rd Generation PartnershipProject (3GPP), Technical report (TR) 36.885, 2016, version14.0.0. [Online]. Available: https://portal.3gpp.org/desktopmodules/Speciﬁcations/SpeciﬁcationDetails.aspx?speciﬁcationId=2934[51] SUMO, “Simulation/trafﬁc lights,” 2020. [Online]. Available: https://sumo.dlr.de/docs/Simulation/Trafﬁc Lights.html[52] H. Mania, A. Guy, and B. Recht, “Simple random search providesa competitive approach to reinforcement learning,”

CoRR , vol.abs/1803.07055, 2018. [Online]. Available: http://arxiv.org/abs/1803.07055[53] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolutionstrategies as a scalable alternative to reinforcement learning,” arXivpreprint arXiv:1703.03864 , 2017.[54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347arXiv preprint arXiv:1707.06347