DRLE: Decentralized Reinforcement Learning at the Edge for Traffic Light Control in the IoV
Pengyuan Zhou, Xianfu Chen, Zhi Liu, Tristan Braud, Pan Hui, Jussi Kangasharju
IIEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 1
DRLE: Decentralized Reinforcement Learning atthe Edge for Traffic Light Control
Pengyuan Zhou,
Member, IEEE , Xianfu Chen,
Member, IEEE , Zhi Liu,
Senior member, IEEE , TristanBraud,
Member, IEEE , Pan Hui,
Fellow, IEEE , Jussi Kangasharju
Member, IEEE
Abstract —The Internet of Vehicles (IoV) enables real-time dataexchange among vehicles and roadside units and thus providesa promising solution to alleviate traffic jams in the urbanarea. Meanwhile, better traffic management via efficient trafficlight control can benefit the IoV as well by enabling a bettercommunication environment and decreasing the network load.As such, IoV and efficient traffic light control can formulatea virtuous cycle. Edge computing, an emerging technology toprovide low-latency computation capabilities at the edge of thenetwork, can further improve the performance of this cycle.However, while the collected information is valuable, an efficientsolution for better utilization and faster feedback has yet to bedeveloped for edge-empowered IoV. To this end, we propose aDecentralized Reinforcement Learning at the Edge for trafficlight control in the IoV (DRLE). DRLE exploits the ubiquityof the IoV to accelerate the collection of traffic data and itsinterpretation towards alleviating congestion and providing bettertraffic light control. DRLE operates within the coverage ofthe edge servers and uses aggregated data from neighboringedge servers to provide city-scale traffic light control. DRLEdecomposes the highly complex problem of large area controlinto a decentralized multi-agent problem. We prove its globaloptima with concrete mathematical reasoning. The proposeddecentralized reinforcement learning algorithm running at eachedge node adapts the traffic lights in real time. We conductextensive evaluations and demonstrate the superiority of thisapproach over several state-of-the-art algorithms.
Index Terms —Edge Computing, Multi-agent Deep Reinforce-ment learning, Internet of Vehicles, Traffic Light Control
I. I
NTRODUCTION
The Internet of Vehicles (IoV) [1]–[3] allows data exchangeamong vehicles (V2V), roadside units (RSUs) (V2I), and othercommutable devices on roads or remote resources distributedover the Internet. It can facilitate and enable a wide variety ofapplications such as driving habit monitoring, driving operationrecommendation, and emergency notification [4]–[6]. The IoVleverages an ever-increasing number of vehicles connectedto the Internet and has a significant potential to alleviatethe continuously rising traffic congestion, which has dramaticconsequences on the environment as well as the well-beingof citizens. Thanks to its high-speed wireless connectivity, theIoV enables data collection from vehicles in real time [7], [8].On a related note, edge computing [9], [10] has emerged as asolution in recent years to extend the capacity of remote cloudservices towards nearby end users. Edge computing is at thecore of most future networking paradigms as its characteristicsmake it an ideal candidate for time-sensitive and highly mobileapplications such as those encountered in the IoV [11]–[14].Co-located with base stations or RSUs, edge computing nodes can provide processing of the IoV data and response to trafficjams and anomalies. By connecting traffic signals to the IoV,it becomes possible to empower signal timing plans with real-time traffic information and exploit the intelligence at the edgeto react to unforeseen congestion. However, while the collectedinformation at the edge is valuable, an efficient solution forbetter utilization and faster feedback has yet to be developedfor large-scale edge-empowered IoV.Most of the related solutions follow one of the two di-rections: 1) utilize linear programming at intersections forfast adaption of the signal plan [15]–[17], 2) deploy machinelearning to directly control the traffic lights or adapt the phaseduration [18]–[20]. The first direction lacks exploration orlearning abilities and strongly depends on the availability ofaccurate objective functions and constraints, hence lots ofresearch efforts are put on the second direction, includingsingle-agent [21], [22] and multi-agent systems [19], [23]–[25].Single-agent solutions suffer from huge state space and hugeaction space. On the other hand, multi-agent solutions candecrease the state and action spaces. Nevertheless, we noticethree major concerns regarding the works in this direction asfollows.1) Lack of practical solutions to bridge the technology gapbetween machine learning algorithms and deployabilitiesin real-life smart city scenarios.2)
Lack of solid theoretical analysis to prove the optimalperformances of decentralized training.3)
Lack of extensive tests with credible simulator platforms to show the benefit of decentralized learning in trafficlight control with reusable results.In this work, we propose Decentralized ReinforcementLearning at the Edge for traffic light control in the IoV (DRLE).Following a similar hierarchy with our previous works [26],[27], we build a new system model with a focus on multi-agenttraining with rich IoV data. This integrated framework leveragesthe real-time data collection from connected vehicles tooptimize traffic light control from the perspective of hierarchicallevels. Each level optimizes its coverage with edge serversrunning a level-specific algorithm, of which one or severalkey parameters are tuned by the upper level’s algorithm inreal time. The decentralized architecture of DRLE relies on apervasive deployment of edge servers, including signal controlunits at intersections and edge servers co-located with basestations and aggregation points. In this work, we use traffic signal and traffic light interchangeably. a r X i v : . [ c s . M A ] S e p EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 2
DRLE decomposes the highly complex problem of large areacontrol into a decentralized multi-agent problem. We prove itsglobal optima with concrete mathematical reasoning (§IV). Webuild our algorithm with credible open-sourced platforms [28],[29] and reinforcement learning library [30] (§V). We conductextensive evaluations and demonstrate the superiority of thisapproach over several state-of-the-art algorithms (§VI). Specifi-cally, DRLE decreases convergence time by 65.66% comparedto Proximal Policy Optimization and training steps by 79.44%compared to Augmented Random Search and EvolutionaryStrategies. Besides, DRLE exponentially reduces the actionspace and provides comparable traffic control performancewithin only 1/4 of the training time compared to its centralizedcounterpart.The rest of the paper is structured as follows. We give anoverview of related works in §II. In §III we present the systemdesign and traffic model. We describe the theoretical detailsof our algorithm in §IV and show our evaluation setup andresults in §V and §VI, respectively. Finally, §VII concludesthe paper. II.
RELATED WORK
Researchers have put a lot of effort into optimizing trafficlight control. Major solutions include linear programming andmachine learning. Meanwhile, recent proposals have started tolook at the potential of rich IoV data for traffic management.In this section, we give an overview of related works.
IoV.
Powered by fast-developing vehicular networking tech-niques, researchers have proposed solutions to utilize richIoV data for traffic control [31], [32]. Kumar et al. proposedto apply ant colony algorithms to help vehicles find theoptimal routes [33]. Darwish et al. focused on real-timebig data analytics in the IoV environment powered by fogcomputing [34]. Chen et al. targeted enhancing transportationsafety and network security by mining effective informationfrom both physical and network data space [35]. However, thepotential of utilizing rich IoV data specifically for traffic lightcontrol has not been fully explored.
Liner programming.
Researchers have proposed traffic lightcontrol solutions based on traffic models (microscopic, meso-scopic and macroscopic) [36]–[38]. These solutions utilizelinear programming to solve the objective functions [15]–[17]. Although linear programming is straightforward and haslow latency, it depends on accurate objective function andconstraints. Moreover, it lacks exploration or learning abilitiesto scale to large-area optimization and complicated scenarios.
Reinforcement Learning.
Instead of trying to build explicittraffic flow models, machine learning proposals learn the trafficpatterns and achieve optimal policies by iteratively adaptingactions with the goal of maximizing the cumulative reward [20],[23]. Many proposals have applied single-agent or multi-agentreinforcement learning to optimize traffic light control. Single-agent solutions have large state and action spaces thus requirelarge capacity to calculate optimal signal phases [18]. Therefore,nowadays researchers tend to use multi-agent algorithms todivide larger problems into smaller sub-problems.For example, Chu et al. proposed in [39] to reduce actionspace by dynamically partitioning the traffic grid into smaller regions and deploying a local agent in each region. They appliedmulti-agent reinforcement learning to A2C for large-scale trafficlight control in [19]. Li et al. used deep Q-learning (DQL) tocontrol traffic lights and proposed deploying the deep stackedautoencoders (SAE) neural network to reduce the huge statespace brought by the tabular Q learning method [20]. Balaji etal. proposed to control traffic lights with distributed agents ateach intersection [18]. El-Tantawy et al. explored coordinatedagents to let intersections conduct signal control actions incooperation with neighbors [40]. Based on [39], Tan et al.further proposed to concatenate the latent states of local agentsto form the global action-value function [41].Overall, we find the existing solutions fall short in thefollowing aspects and propose DRLE to improve those aspects. • Optima proof.
Rarely we find multi-agent approaches fortraffic light control mathematically prove the optima ofmulti-agent reinforcement learning. • The gap between technique and reality.
Most multi-agentlearning approaches do not provide deployable solutionsto apply the algorithms to real-life smart cities. • Extensive tests with open-source platforms.
Many relatedworks have conducted simulations with vehicular trafficsimulators and self-developed machine learning scriptslimited to the proposals. We find those results hard to bereused due to lack of credible open-sourced platforms.In this work, we address those shortages by proposing adeployable decentralized solution and prove its optima withmathematical reasoning and extensive reusable tests.III. S
YSTEM
In this section, we first present the design of DRLE andthe communication mechanism. Then, we describe the trafficmodel as the basis of the algorithm.
A. System Design
DRLE revolves around the device layer and the edge layer,as shown in Fig. 1 in detail and Fig. 2 from high level. Thedevice layer includes the vehicles, traffic signals, pedestriandevices, RSUs and other devices involved in the IoV. In the restof this paper, we assume each traffic signal is connected witha control unit . Each control unit employs video cameras facingall directions to collect real-time traffic data and transmits it viawireless network to a nearby edge server. The edge layer hoststhe edge servers in two tiers. The first tier servers (ES ) are co-located with the base stations at the radio access network. Thesecond tier servers (ES ) are co-located with the aggregationpoints in the core network. This scheme agrees with the Multi-access Edge Computing standard proposed by the EuropeanTelecommunications Standards Institute. Please refer to [26]for the detailed edge server deployment strategy. Each ES collects data from nearby ES and provides a larger scale ofservice and sends backup to the cloud data center. Next, webriefly describe the communication mechanism and overhead. B. Communications
Each ES feeds the collected data into local reinforcementlearning and sends the actions to the signal control units in EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 3 ES ES Inter-ES C-V2X system
State aggregationMulti-agentRL trainingAgent action Threshold-basedAgent actionObserved states ES Aggregation point ES Intra-ES subsystem Intra-ES subsystemIntra-ES subsystem
Fig. 1: System overview. In each
Intra-ES subsystem, an ES (MEC) uses decentralized multi-agent learning to train thecollected IoV data and sends commands to control the traffic lights. ES tunes the parameters of the multi-agent algorithmsrunning in each ES ( Intra-ES subsystem).
Device Layer
RSU Vehicle PedestrianBase Station
Edge Layer
Cloud (Data Backup)
Data CenterAggregation Point co-located
Tier 2 Edge ServerTier 1 Edge Server co-located
Traffic Signal
Fig. 2: System layers.real time. An ES collects the data by communicating with thedevices (connected vehicles, signal control units and RSUs)via Cellular vehicle-to-Everything (C-V2X) [42] and operateswithin a coverage defined by the range of its co-located basestation. For simplicity, we do not consider RSUs and pedestriansin the rest of the paper.Actions are primarily changing lights commands which canbe encapsulated in small data packets. These packets are sentonly to a limited number of signals, hence the majority of data transmission is between the connected vehicles and ES via Vehicle-to-Network (V2N). As suggested by the 3GPPstandard [43], each message should be sent at a frequencybetween 0.1 Hz and 1 Hz with a payload between 50 bytes and300 bytes. We assume a message frequency of 1 Hz to alignwith the agent learning rate in the evaluation (see §V). Weassume that each message has a size of maximum transmissionunit (1500 bytes), which is sufficient to contain the requireddata (speed, location, and direction of travel of the vehicle).Note that other transmissions mechanisms such as V2V andV2I are performed via a different radio (e.g. ITS 5.9 GHz)than V2N (licensed mobile bands, e.g. 700 MHz) and are notnecessary for DRLE.Since the transmissions between the vehicles and the ES base stations are only one-hop and unidirectional (uplink), thecommunications require a small networking capacity. Hence,it has a limited overhead and impact on the overall C-V2Xenvironment. We present a preliminary evaluation in §VI-Dand show that the end-to-end delay consisting of vehicle-to-ES transmission delay and ES -to-signal transmission delayis much smaller than a training step duration (1 second). A training step is the period over which one gradient update happens.We define each step as a 1-second period. However, the delay of each gradientupdate is at the millisecond level. In other words, each gradient update takesseveral milliseconds then waits until the next second to conduct the nextupdate.
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 4
C. Traffic Model and Problem
We assume the traffic lights in the urban area pertain toa common signal timing plan characterized by a fixed cyclecontaining a fixed number of phases. A phase refers to the timeduration of the green lights for a given direction. Let L be theset of links in a signalized urban area. Then, L (in) ( L (out) ) is theset of input (output) links of the area. To allow smooth drivingexperience, we introduce two parameters, i.e, the number ofhalting vehicles and speed-lag. A vehicle moves at speedsslower than 0.1 m/s is considered halting. Speed-lag is definedas the speed difference between the actual driving speed andthe maximum speed permitted by statute. We use speed-lag instead of the commonly used speed to address the maximumspeed limit in reality. We formulate the problem as to minimizethe overall number of halting vehicles and speed-lag in thearea over the whole optimization horizon. As for the coverageof an ES , the objective of the optimization problem can beformulated as follows, ¯ P = lim K →∞ E (cid:34) K K (cid:88) k =1 (cid:88) m ∈ L (in) \ L (out) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) , (1)which can also be approximated as P = E (cid:34) (1 − γ ) · ∞ (cid:88) k =1 ( γ ) k − · (cid:88) m ∈L (in) \L (out) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) , (2)if the discount factor γ ∈ [0 , approaches 1. Herein, H km isthe number of halting vehicles and (cid:52) V km is the average speed-lag of the vehicles in lane m at the beginning of the k th cycle. w and w are two positive weighting constants. Although themodel describes the problem from a global view, each internalintersection can have a varied impact on the optimization.Therefore, it is better to disassemble the optimization objectiveof multiple intersections, namely, P = E (cid:34) (1 − γ ) · ∞ (cid:88) k =1 ( γ ) k − · C (cid:88) c =1 (cid:88) m ∈L c (in) \L c (out) (3) (cid:0) − w · H km − w · (cid:52) V km (cid:1) (cid:35) . where c is an intersection index, C refers to the total numberof intersections in the area, L c (in) ( L c (out) ) represents the set ofinput (output) links of intersection c . In this work, we proposeto decompose the optimization problem and solve it with adecentralized reinforcement learning algorithm as described in§IV. IV. A LGORITHM
A. Hierarchical Algorithm
The system includes three parallel and interactive algo-rithms running in three levels, i.e.,
Intersection , Intra-ES and
Inter-ES , performed by signal control units, ES and ES ,respectively (Fig. 1). At Intersection level, each signal adaptsits phases with a threshold-based algorithm. The threshold can be the average queuing length or the average spaceheadaway. At
Intra-ES level, each ES runs a multi-agentreinforcement learning algorithm to participate in the trafficlight switching directly. At Inter-ES level, an ES runs athreshold-based algorithm to optimize the urban traffic bytuning the reinforcement learning rate in each ES . We brieflydescribe Intersection level and
Intra-ES level and detail thedecomposed reinforcement learning algorithm at
Intra-ES level.
Intersection level.
The control unit of each intersection signalruns a threshold-based algorithm similar with [44]. Each controlunit employs cameras to capture the traffic jam and adaptsthe phase duration based on predefined parameters. When theparameters surpass the thresholds, the control unit extends thephase of the green lights for the more jammed direction (e.g.,with longer queues or smaller space headway), and decreasesa equivalent length of the green phases for the other direction.
Inter-ES level.
An ES tunes the urban traffic by adapting thereinforcement learning rate in each ES based on a threshold-based algorithm similar to [44]. Intra-ES level.
Each ES optimizes the internal traffic withinits coverage defined by the co-located base station by switchingthe traffic lights directly. We adopt DQL to provide an adaptivealgorithm to respond to the dynamically changing trafficcondition. The advantage of Q-learning for traffic light controlis described in more detail in a study by Abdulhai et al. [23],including not requiring a prespecified model of the environmentand being adaptive and unsupervised. We define the state s k ,action a k and reward R k as follows. • s k is the state at the k th cycle, including the number ofhalting vehicles H k = { H km | m = 1 , · · · , M } , the speed-lag of the vehicles (cid:52) V k = {(cid:52) V km | m = 1 , · · · , M } , andthe traffic light states θ k = { θ kc | c = 1 , · · · , C } , where M is the total number of the lanes in the area. • a k is the action operated by the agents after observing s k . More specifically, a k = { a kc | c = 1 , · · · , C } , where a kc ∈ { , } represents the decision of switching the trafficlight. • R k is the reward defined as the additive inverse of theaverage speed-lag and number of halting vehicles, namely, R k = (cid:80) Mm =1 ( − w · H km − w · (cid:52) V km ) . Agent Goal.
The overall goal of DRLE is to optimize thelight control to smooth the traffic. To do so, the agents of anES need to find a control policy π that maximizes Q π ( s , a ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R k | s = s , a = a (cid:35) , (4)which is also termed a Q-function. A control policy π can bedefined as a mapping by a = π ( s ) . To put it another way, thegoal is to solve π ∗ = arg max π Q π ( s , π ( s )) , ∀ s . (5)For notational convenience, we denote Q ( s , a ) = Q π ∗ ( s , a ) , ∀ ( s , a ) . Using the state-action-reward-state-action (SARSA)algorithm [45], the optimal Q-function can be found in aniterative on-policy manner. The centralized decision making EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 5 at a cycle is executed at the signals independently, based onwhich we linearly decompose the Q-function, Q ( s , a ) = C (cid:88) c =1 Q c ( s , a c ) , (6)where Q c ( s , a c ) is defined to be the per-signal Q-functiongiven by Q c ( s , a c ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R kc | s = s , a c = a c (cid:35) . (7)Herein, a kc and R kc are, respectively, the joint action and thereward for the agent at the intersection c at the cycle k . Weemphasize that the decision makings across the cycles areperformed at each signal in accordance with the optimal controlpolicy implemented by the ES . In other words, ∀ s , π ∗ ( s ) = arg max a =( a c : c =1 , ··· ,C ) C (cid:88) c =1 Q c ( s , a c ) . (8) B. Optimization and Convergence Guarantee
Theorem 1 (Optimization Guarantee).
The linear Q-functiondecomposition approach as in Eq. (6) asserts the expectedoptimal long-term performance.
Proof.
For the Q-function of a centralized decision making a under a global network state s , we have Q ( s , a ) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R k | s = s , a = a (cid:35) = (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · C (cid:88) c =1 R kc | s = s , a = a (cid:35) = C (cid:88) c =1 (1 − γ ) · E (cid:34) ∞ (cid:88) k =1 ( γ ) k − · R kc | s = s , a = a (cid:35) = C (cid:88) c =1 Q c ( s , a c ) , (9)which completes the proof. (cid:3) Therefore, instead of learning the Q-function, the SARSAupdating rule is slightly adapted for each lane to Q k +1 c ( s , a c ) = (cid:0) − α k (cid:1) · Q kc ( s , a c )+ α k · (cid:0) (1 − γ ) · R kc + γ · Q kc ( s (cid:48) , a (cid:48) c ) (cid:1) , (10)where α k ∈ [0 , is the learning rate. Theorem 2 ensures theconvergence of the decentralized learning process. Theorem 2 (Convergence Guarantee).
The sequence { ( Q kc ( s , a c ) : ∀ ( s , a c ) , ∀ c ∈ { , · · · , C } ) : k } byEq. (10) surely converges to the per-signal Q-functions ( Q c ( s , a c ) : ∀ ( s , a c ) , ∀ c ∈ { , · · · , C } ) , if and only if foreach signal c ∈ { , · · · , C } , the ( s , a c ) -pairs are visited foran infinite number of times. Proof.
Since the per-signal Q-functions are learned simultane-ously, we consider monolithic updates during the decentralized learning process. That is, the iterative rule in Eq. (10) can thenbe encapsulated as C (cid:88) c =1 Q k +1 c ( s , a c ) = (cid:0) − α k (cid:1) · C (cid:88) c =1 Q kc ( s , a c ) + α k · (cid:32) (1 − γ ) · C (cid:88) c =1 R c + γ · C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) (cid:33) . (11)From both sides of Eq. (11), subtracting the sum of per-signalQ-functions leads to C (cid:88) c =1 Q k +1 c ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) = (cid:0) − α k (cid:1) · (cid:32) C (cid:88) c =1 Q kc ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) (cid:33) + α k · T k ( s , a c ) , (12)where T k ( s , a c ) = (1 − γ ) · C (cid:88) c =1 R c (13) + γ · max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) − C (cid:88) c =1 Q c ( s , a c ))+ γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) . We let ∆ k denote the history for the first k cycles during the de-centralized learning process. The per-signal Q-functions are ∆ k -measurable, thus both ( (cid:80) Cc =1 Q k +1 c ( s , a c ) − (cid:80) Cc =1 Q c ( s , a c )) and T k ( s , a c ) are ∆ k -measurable. We then attain Eq. (14),where (cid:107) · (cid:107) ∞ is the maximum norm of a vector and (a) is dueto the convergence property of the standard Q-learning. We arenow left with verifying that (cid:107) E [ γ · ( (cid:80) Cc =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) (cid:80) Cc =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c )) | ∆ k ] (cid:107) ∞ converges to zero, which establishesthe following: i) an (cid:15) -greedy policy is deployed for theexploration-exploitation trade-off during decision-making; ii)the per-signal Q-function values are upper bounded; and iii)both the global network state and the decision-making spacesare finite. Thus the convergence of the decentralized learningis ensured. (cid:3) Takeaway.
The core contribution of the algorithm, i.e. decen-tralization, lies in the action selection process, during whichthe algorithm selects the optimal joint action that maximizesthe sum of Q-values of all agents, thus ensuring globaloptima (Theorem 1 and Theorem 2). The major advantage incomparison with traditional centralized reinforcement learning,is that the number of joint actions grows linearly instead ofexponentially as the number of involved agents (i.e., the trafficsignals) increases. It is noteworthy that the training is donein a centralized fashion to allow efficient cooperative trainingacross the multiple agents.V. E
XPERIMENT S ETUP
We conduct the training and the tests on a MSI GS65Stealth 8SG equipped with a 6-core I7-8750H CPU, 32GB of
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 6 memory, and an Nvidia RTX 2080 Max-Q GPU. We build thelearning algorithms based on
RLlib [30], an open-sourcedlibrary for reinforcement learning that can easily be scaled byincreasing the number of workers [46]. We use its underlayer
Ray framework [29] to accelerate the decentralized algorithmtraining.Besides Deep Q-Network (
DQN ), we also build and testdistributional DQN ( dDQN ) proposed by Bellemare et al. in2017 [47], which learns a categorical distribution of discountedreturns instead of estimating the mean. To provide betterperformance, we leverage the values of several parametersexplored by the well-acknowledged work,
Rainbow [48],including learning rate, epsilon, and softmax cross entropyfor dDQN . We define and evaluate the algorithms and policiesutilizing
Flow [28], a Python library that provides the interfacebetween
RLlib and
SUMO [49], a microscopic simulator fortraffic and vehicle dynamics. Next we describe the detailedsettings.
A. Simulation Setup
To extensively test our proposal, we evaluate sets of testswith different algorithms and policies in different scenarios.
Map.
Most related works on traffic light control do experimentson grid-like maps such as the urban Manhattan grid scenarioused by 3GPP [50]. We also follow this setup and deploy thetests on n × n grid road maps that consist of n × n typicalfour-way, traffic-light-controlled intersections. Each intersectionallows vehicles to flow either horizontally or vertically. If thelight is green, it transitions to yellow for two seconds beforeswitching to red for the purpose of safety. Each lane has onesignal only, and the signal phases of an intersection in clockwiseorder consist of GrGr , yryr , rGrG , and ryry , of which G, y, r refer to green, yellow and red, respectively. The traffic lightscan be switched only after seconds to prevent flickering. Scale.
In this work, each ES collects data and control thetraffic lights within the coverage defined by its co-located basestation. To be more realistic, we investigate the coverage rangeof LTE cell towers via a crowdsourced cellular tower andcoverage mapping service . It shows that the coverage range ofan LTE cell tower on B7 (2600MHz) is typically around 5 to10 blocks and 10 to 20 intersections in a European city centerarea (Helsinki, Finland). Therefore, we focus the simulation at the scale of of × intersections to cover common scenarios.For the completeness of the work, we also test larger scalessuch as × and × . Traffic.
Vehicles enter the map from all outer edges at apredefined rate, i.e., 360 vehicle/hour/edge . At such a rate,7200 vehicles enter a × grid map from the 20 outer edgesin an hour. To simplify the problem, vehicles travel straight ontheir paths. Each vehicle is driven following a basic SUMO -built-in car following model, of which the minimum gap betweensuccessive vehicles, maximum speed limit and decelerationability are set to 2.5 m, 60 m/s and 7.5 m/s , respectively. B. Training Configurations
We test the learning algorithms utilizing single- and multi-agent training. We use multi-agent
DQN and dDQN as thetargeting algorithms to prove our mathematical reasoningin §IV. Considering intersections with different centralitiesmay have influences on the traffic performance, we definetwo kinds of policies with corresponding reward definitions asfollows.
Policy.
We test the system with two policies, i.e.,
SharedPolicy and
MultiPolicy , to address the different centralities of theintersections. We call the signals with higher centralities“central nodes”, as indicated by the blue squares residing inthe central area of Fig. 3. We call the others “edge nodes”,as indicated by the black circles.
SharedPolicy lets the agentsat all intersections share the same policy.
MultiPolicy lets theagents at “central nodes” share a central policy while the otheragents share a edge policy.
Reward.
We define the reward based on the average speed-lag and the number of halting vehicles (§III). Specifically, thereward of
SharedPolicy is defined as R = − w · H − w · (cid:52) V (15)where R , H , (cid:52) V indicate the reward, number of haltingvehicles, and average speed-lag, respectively. The rewards of MultiPolicy for the agents at “central nodes” and “edge nodes”are defined as R central = − w c · H − w c · (cid:52) V (16) R edge = − w e · H − w e · (cid:52) V (17)where w c , w c , w e , w e denote the weights to differentiate thepenalties that central and edge nodes receive. In this paper, we (cid:13)(cid:13) E (cid:2) T k ( s , a c ) | ∆ k (cid:3)(cid:13)(cid:13) ∞ (14) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) (1 − γ ) · C (cid:88) c =1 R c + γ · max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) − C (cid:88) c =1 Q c ( s , a c ) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (a) ≤ γ · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) C (cid:88) c =1 Q kc ( s , a c ) − C (cid:88) c =1 Q c ( s , a c ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) γ · (cid:32) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48) c ) − max a (cid:48)(cid:48) C (cid:88) c =1 Q kc ( s (cid:48) , a (cid:48)(cid:48) c ) (cid:33) | ∆ k (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 7
Fig. 3: Central (blue squares) and edge (black circles) inter-sections.set w c > w e , w c > w e , to give central nodes higher penaltiesfor poor performance. C. Benchmarks
We compare the performance of
DQN and dDQN withmultiple algorithms as follows. • Static simply lets traffic lights deploy pre-defined staticphases. • Actuated is a common light control scheme in Germany.It works by either prolonging traffic phases upon detectinga continuous traffic stream, or switching to the next phaseupon detecting a sufficient time gap between successivevehicles [51]. • Augmented Random Search (ARS) is an improved versionof Basic Random Search (BRS), proposed by Mania etal. in 2018 [52]. • Evolutionary Strategies (ES) is one of the OpenAI solu-tions proposed by Salimans et al. in 2017 [53]. • Proximal Policy Optimization (PPO) is a popular gradient-based policy optimization algorithm proposed by Schul-man et al. in 2017 [54]. It uses multiple epochs of mini-batch updates, and an MLP with tanh non-linearity tocompute a value function baseline.We also test dDQN using all parameter values deployed in
Rainbow [48] to see whether it applies to our use case. Table Ishows the hyper-parameters of the benchmark algorithms,
DQN and dDQN . VI. T
EST R ESULTS
A. Training Results
We conduct the training over iterations, each of whichconsists of numerous rollouts. As shown in Fig. 4 to Fig. 6,the interval consists of 100 iterations, each of which consists Rollout, or playout, is a term often used in machine learning that isdefined by Monte Carlo. In each rollout, an agent takes actions until reachingpredefined max steps.
TABLE I: Hyper-parameters.
Algorithm Selected Hyper-parametersAugmented Random Search SGD step size = 0.2Noise standard deviation = 0.2Evolutionary Strategies Adam step size = 0.02Noise standard deviation = 0.02Proximal Policy Optimization λ ( GAE ) = 0 . Clipping (cid:15) = 0 . Adam step size = 5 × − Minibatch size = 128
Hiddens [100, 50, 25] γ ( Discount ) = 0 . Deep Q-Network Adam (cid:15) = 1 . × − Learning rate α = 6 . × − Hiddens [256,256]Dimension = 84
Train batch size = 1000
Distributed DQN Number of atoms = 51
Min/max values: [ − , Target network update freq. = 8000 of 30 rollouts. Each rollout has 1000 steps. A step maps to onesecond in real life. As such, the interval consists of 3 millionsteps.Since
SharedPolicy and
MultiPolicy have different reward def-initions, their reward performances are not strictly comparable.Therefore, we plot them separately. As shown in Fig. 4 to Fig. 6, dDQN and
DQN show similar curves and the former performsslightly better than the latter.
PPO is always outperformed by dDQN and
DQN in the beginning of the training and climbs toa similar level after around 2M steps.
Rainbow setup providesthe worst performance out of all algorithms. It indicates thatthe values of the parameters discovered in
Rainbow does not fitour use case. Single-agent algorithms take a lot longer to reacha similar level of performance, thus we exclude their curvesfrom the figures. For instance, the training time of single-agent
DQN is 4 times the training time of its multi-agent counterpart.This is because the action space dimension for each agent inmulti-agent algorithms is while single-agent DQN has anaction space of , and thus requires a much longer trainingtime. B. Traffic Results
We replay the policies with
SUMO to compare the trafficcontrol performances and show the summarized results inTable II.
Halting vehicles refer to the average number of thehalting vehicles per step (speed < Queuing time indicates the average waiting time of vehicles due to queuingper step.
Queuing length refers to the average length of thequeues per step. The end of the queue is defined as the lastvehicle with a speed of < Speed refers to the averagespeed.As shown, the
Static algorithm presents the worst perfor-mance. Meanwhile, we find that all machine learning algorithmsprovide very limited improvement for the average vehicle
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 8 M M M M~million of steps M a x r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (a) Max reward. M M M M~million of steps M e a n r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (b) Average reward. M M M M~million of steps M i n r e w a r d SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (c) Minimum reward.
Fig. 4: Reward for
SharedPolicy (defined in Eq. (15)). M M M M~million of steps M a x r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (a) Max reward. M M M M~million of steps M e a n r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (b) Average reward. M M M M~million of steps M i n r e w a r d MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (c) Minimum reward.
Fig. 5: Reward for
MultiPolicy (defined in Eq. (16) and Eq. (17)). M M M M~million of steps R e w a r d p e r a g e n t o f s h a r e d p o li c y SharedPolicy-DQNSharedPolicy-PPOSharedPolicy-dDQN-rainbowSharedPolicy-dDQN (a) Average reward per agent for
SharedPolicy (reward defined in Eq. (15)). M M M M~million of steps R e w a r d p e r a g e n t o f c e n t r a l p o li c y MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (b) Average reward per agent for centralpolicy (reward defined in Eq. (16)). M M M M~million of steps R e w a r d p e r a g e n t o f e dg e p o li c y MultiPolicy-DQNMultiPolicy-PPOMultiPolicy-dDQN-rainbowMultiPolicy-dDQN (c) Average reward per agent for edgepolicy (reward defined in Eq. (17)).
Fig. 6: Average reward per agent.speed compared to the
Actuated algorithm. However, theyall decrease the number of halting vehicles and queuing timesignificantly (except
Rainbow algorithms which is most likelydue to unfeasible parameter values). It indicates that with theagent-controlled signals, vehicles have much less start/stopwaves and waiting time in queues, thus gaining considerableimprovements in driving experience.Multi-agent
SharedPolicy-PPO , SharedPolicy-DQN and
MultiPolicy-DQN provide the smoothest traffic control. Theirperformances show the smallest numbers of halting vehiclesand shortest queuing time and lengths while allowing higherdriving speeds than others. The performance of single-agent
DQN is fairly good. However it requires exponentially highercapacity and four times as long as the training period of its multi-agent counterparts.
ARS and ES also show competitiveperformances, although presenting either slightly slower speedsor more halting vehicles. Nevertheless, ARS and ES currentlyonly support single-agent training, thus also require muchlonger training time. For example, to achieve the performancein Table II, ARS and ES need around 800K steps (approximately253 minutes) while multi-agent DQN and dDQN only needaround 240K steps (approximately 52 minutes). Similarly,
PPO has much longer convergence time compared to multi-agent
DQN algorithms as shown in Fig. 4 and Fig. 5.Overall, the training and traffic results align with Theorem 1and 2 proposed in §IV, showing that decentralized multi-agent
DQN can achieve similar optima with centralized single-agent
DQN . Surprisingly,
MultiPolicy has not provided observable
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 9
TABLE II: Traffic performance and required training time.N: Non-ML, S: SingleAgent-ML, M: Multiagent-ML
Num agents Algorithm Halting vehicles Queue time (s) Queue length (m) Speed (m/s)N
Static
95 ±40 14.23 ±10.25 19.82 ±12.17 13.39 ±3.20
Actuated
ARS ES PPO
DQN
SharedPolicy-PPO
SharedPolicy-DQN
SharedPolicy-dDQN-rainbow
SharedPolicy-dDQN
MultiPolicy-PPO
MultiPolicy-DQN
MultiPolicy-dDQN-rainbow
MultiPolicy-dDQN
TABLE III: Performance of
SharedPolicy-dDQN on different map scales.
Scale Halting vehicles Queue time (s) Queue length (m) Speed (m/s) Training time/rollout (s) × × × M M M M~million of steps R e w a r d p e r a g e n t o f s h a r e d p o li c y SharedPolicy-DQN-5x5SharedPolicy-DQN-10x10SharedPolicy-DQN-15x15
Fig. 7: Average reward per agent for
SharedPolicy-DQN ondifferent scales.benefits compared to
SharedPolicy . We will explore morevarieties of
MultiPolicy setups in future works. Meanwhile,decentralized multi-agent
DQN outperforms centralized single-agent
DQN in the following aspects: 1) It requires only 1/4of the training time centralized
DQN needs. 2) It requiresexponentially less computation and memory capacity thancentralized
DQN .Next, we evaluate the scalability of DRLE by conductingtests on different scales of maps. Thereafter we evaluate thecommunication overhead with a preliminary simulation.
C. Scalability
To illustrate the system scalability, we apply multi-agent
SharedPolicy-DQN on various map scales. Fig. 7 shows thetraining performances of
SharedPolicy-DQN on maps with × , × , and × intersections. Because of the fixedvehicle inflow rate (see §V-A), each intersection has less nearbyvehicles in larger maps, resulting in shorter queue lengthsand vehicle delay, which, in turn, leads to a higher reward.As such, the average reward per agent increases as the mapscale increases. The traffic control performance summarizedin Table III mostly confirms this reasoning. It is important tonote that when we increase the size of the map, the numberof halting vehicles per intersection decreases. Overall, DRLEperforms well on different map scales. D. Communication Overhead
DRLE leverages the ubiquity of the IoV and the foreseeableedge facilities in the near future to run the algorithms. Inthis subsection, we provide a preliminary evaluation of thecommunication overhead brought by this system architecture.We deploy a map consisting of × intersections as shown inFig. 3 in NS-3 , a packet-level discrete-event network simulatorfor internet systems. Because the mobility trace files generatedfrom the training tests are not compatible with
NS-3 , we usethe trace files instead to estimate the average number of activevehicles per step in the considered map, i.e., 230. Then wesimulate 230 vehicles in
NS-3 with the same parameters (speed
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 10
TABLE IV: Communication parameters.
Model Protocol Packet Frequency HopsV2N [43] UDP 1500 B 1 Hz 1
TABLE V: Transmission delays. (MAD: Median AbsoluteDeviation)
Direction Mean (ms) MAD (ms)Uplink (vehicle-to-ES ) 110.82 17.68Downlink (ES -to-signal) 106.23 0 and acceleration etc.) with the training tests. As such, we arguethat the transmission delay performance, albeit not identical tothe performance in an integrated test, should be very similarfrom a statistical perspective. As described in Subsection III-B,each vehicle sends 1500 B-packets to a tri-sector eNB at thecenter of the map at 1 Hz via C-V2X (LTE) (table IV). For aperiod of 1000 seconds (the default training period in trainingtests), the transmission delay result shows that uplink (vehicle-to-ES ) plus downlink (ES -to-signal) delay is less than 240 msper packet (Table V). As such, the transmission delay duringeach step is much smaller than the step duration thus doesnot impact on DRLE.VII. C ONCLUSION AND F UTURE WORK
In this work, we present DRLE, an integrated edge comput-ing framework leveraging the ubiquity of the IoV for alleviatingtraffic congestion in real time at city scale. We decompose thehighly complex centralized problem of large area traffic lightcontrol to a multi-agent decentralized problem and prove itsglobal optima with concrete mathematical reasoning. DRLEexploits the low latency of edge servers to provide fast
DQN training and control feedback. Thanks to its layered architectureand hierarchical algorithm, DRLE runs optimization at the
Intersection , Intra-ES and
Inter-ES levels that allows fortraffic light control on different scales. We present numerouscomparisons to evaluate the traffic improvement broughtby DRLE and show that, compared to the state-of-the-artbaseline works, DRLE decreases the convergence time by65.66% compared to
PPO and training steps by 79.44%compared to
ARS and ES . Besides, DRLE provides comparabletraffic control performance with its centralized counterpartwhile requiring only 1/4 of the training time.This paper explores the performance of algorithm on Intra-ES level. In future work, we will explore the
Inter-ES algorithm andits impact on the performance of DRLE. We are also lookingforward to investigating the potential of tuning the thresholdparameter value in the
Intersection level algorithm instead ofdirectly switching the traffic lights, to further decrease theaction space and extend the scalability. See footnote 2. R EFERENCES[1] J. Contreras-Castillo, S. Zeadally, and J. A. Guerrero-Iba˜nez, “Internet ofvehicles: architecture, protocols, and security,”
IEEE Internet of ThingsJournal , vol. 5, no. 5, pp. 3701–3709, 2017.[2] C. Wu, X. Chen, T. Yoshinaga, Y. Ji, and Y. Zhang, “Integrating licensedand unlicensed spectrum in the internet of vehicles with mobile edgecomputing,”
IEEE Network , vol. 33, no. 4, pp. 48–53, 2019.[3] Y. Dai, D. Xu, S. Maharjan, G. Qiao, and Y. Zhang, “Artificial intelligenceempowered edge computing and caching for internet of vehicles,”
IEEEWireless Communications , vol. 26, no. 3, pp. 12–18, 2019.[4] M. Zhang, C. Chen, T. Wo, T. Xie, M. Z. A. Bhuiyan, and X. Lin,“Safedrive: online driving anomaly detection from large-scale vehicledata,”
IEEE Transactions on Industrial Informatics , vol. 13, no. 4, pp.2087–2096, 2017.[5] Y.-S. Chou, Y.-C. Mo, J.-P. Su, W.-J. Chang, L.-B. Chen, J.-J. Tang, andC.-T. Yu, “i-car system: A lora-based low power wide area networksvehicle diagnostic system for driving safety,” in . IEEE, 2017, pp.789–791.[6] P. Zhou, W. Zhang, T. Braud, P. Hui, and J. Kangasharju, “Arve: Aug-mented reality applications in vehicle to edge networks,” in
Proceedingsof the 2018 Workshop on Mobile Edge Communications . ACM, 2018,pp. 25–30.[7] N. Lu, N. Cheng, N. Zhang, X. Shen, and J. W. Mark, “Connectedvehicles: Solutions and challenges,”
IEEE internet of things journal ,vol. 1, no. 4, pp. 289–299, 2014.[8] C. Wu, Z. Liu, F. Liu, T. Yoshinaga, Y. Ji, and J. Li, “Collaborativelearning of communication routes in edge-enabled multi-access vehicularenvironment,”
IEEE Transactions on Cognitive Communications andNetworking , 2020.[9] J. Feng, Z. Liu, C. Wu, and Y. Ji, “Ave: Autonomous vehicular edgecomputing framework with aco-based scheduling,”
IEEE Transactionson Vehicular Technology , vol. 66, no. 12, pp. 10 660–10 675, 2017.[10] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edgecomputing: A survey,”
IEEE Internet of Things Journal , vol. 5, no. 1,pp. 450–465, 2017.[11] A. Alhilal, T. Braud, and P. Hui, “Distributed vehicular computing atthe dawn of 5g: a survey,” 2020.[12] K. Zhang, S. Leng, X. Peng, L. Pan, S. Maharjan, and Y. Zhang, “Artificialintelligence inspired transmission scheduling in cognitive vehicularcommunications and networks,”
IEEE Internet of Things Journal , vol. 6,no. 2, pp. 1987–1997, 2018.[13] K. Zhang, Y. Zhu, S. Leng, Y. He, S. Maharjan, and Y. Zhang, “Deeplearning empowered task offloading for mobile edge computing in urbaninformatics,”
IEEE Internet of Things Journal , vol. 6, no. 5, pp. 7635–7647, 2019.[14] K. Zhang, Y. Zhu, S. Maharjan, and Y. Zhang, “Edge intelligence andblockchain empowered 5g beyond for the industrial internet of things,”
IEEE Network , vol. 33, no. 5, pp. 12–19, 2019.[15] A. Barisone and D. Giglio, “A macroscopic traffic model for real-time optimization of signalized urban areas,” , pp. 900–903, 2002.[16] M. Dotoli, M. P. Fanti, and C. Meloni, “A signal timing plan formulationfor urban traffic control,”
Control engineering practice , vol. 14, no. 11,pp. 1297–1311, 2006.[17] P. Coll, P. Factorovich, I. Loiseau, and R. G´omez, “A linear programmingapproach for adaptive synchronization of traffic signals,”
InternationalTransactions in Operational Research , vol. 20, no. 5, pp. 667–679, 2013.[18] P. Balaji, X. German, and D. Srinivasan, “Urban traffic signal controlusing reinforcement learning agents,”
IET Intelligent Transport Systems ,vol. 4, no. 3, pp. 177–188, 2010.[19] T. Chu, J. Wang, L. Codec`a, and Z. Li, “Multi-agent deep reinforcementlearning for large-scale traffic signal control,”
IEEE Transactions onIntelligent Transportation Systems , vol. 21, no. 3, pp. 1086–1095, 2019.[20] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforcementlearning,”
IEEE/CAA Journal of Automatica Sinica , vol. 3, no. 3, pp.247–254, 2016.[21] J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,”
Machine learning , vol. 16, no. 3, pp. 185–202, 1994.[22] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal control,”
European Journal of Operational Research , vol. 131, no. 2, pp. 232–241,2001.[23] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learningfor true adaptive traffic signal control,”
Journal of TransportationEngineering , vol. 129, no. 3, pp. 278–285, 2003.
EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. XX, NO. XX, XXX 2020 11 [24] D. de Oliveira, A. L. Bazzan, B. C. da Silva, E. W. Basso, L. Nunes,R. Rossetti, E. de Oliveira, R. da Silva, and L. Lamb, “Reinforcementlearning based control of traffic lights in non-stationary environments: Acase study in a microscopic simulator.” in
EUMAS , 2006.[25] K. Dresner and P. Stone, “Multiagent traffic management: A reservation-based intersection control mechanism,” in
Proceedings of the ThirdInternational Joint Conference on Autonomous Agents and MultiagentSystems-Volume 2 , 2004, pp. 530–537.[26] P. Zhou, W. Zhang, T. Braud, P. Hui, and J. Kangasharju, “Enhancedaugmented reality applications in vehicle-to-edge networks,” in . IEEE, 2019, pp. 167–174.[27] P. Zhou, T. Braud, A. Alhilal, P. Hui, and J. Kangasharju, “Erl: Edgebased reinforcement learning for optimized urban traffic light control,”in . IEEE, 2019, pp.849–854.[28] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:Architecture and benchmarking for reinforcement learning in trafficcontrol,” arXiv preprint arXiv:1710.05465 , 2017.[29] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Eli-bol, Z. Yang, W. Paul, M. I. Jordan et al. , “Ray: A distributed frameworkfor emerging { AI } applications,” in { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI } , 2018, pp.561–577.[30] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez,K. Y. Goldberg, and I. Stoica, “Ray rllib: A composable and scalablereinforcement learning library,” ArXiv , vol. abs/1712.09381, 2017.[31] N. R. Moloisane, R. Malekian, and D. Capeska Bogatinoska, “Wirelessmachine-to-machine communication for intelligent transportation systems:Internet of vehicles and vehicle to grid,” in , 2017, pp. 411–415.[32] M. Gerla, E.-K. Lee, G. Pau, and U. Lee, “Internet of vehicles: Fromintelligent grid to autonomous cars and vehicular clouds,” in . IEEE, 2014, pp. 241–246.[33] P. M. Kumar, G. Manogaran, R. Sundarasekar, N. Chilamkurti,R. Varatharajan et al. , “Ant colony optimization algorithm with internetof vehicles for intelligent traffic control system,”
Computer Networks ,vol. 144, pp. 154–162, 2018.[34] T. S. Darwish and K. A. Bakar, “Fog based intelligent transportationbig data analytics in the internet of vehicles environment: motivations,architecture, challenges, and critical issues,”
IEEE Access , vol. 6, pp.15 679–15 701, 2018.[35] M. Chen, Y. Tian, G. Fortino, J. Zhang, and I. Humar, “Cognitive internetof vehicles,”
Computer Communications , vol. 120, pp. 58–70, 2018.[36] V. Tyagi, S. Darbha, and K. Rajagopal, “A review of the mathematicalmodels for traffic flow,”
International Journal of Advances in EngineeringSciences and Applied Mathematics , vol. 1, no. 1, pp. 53–68, 2009.[37] S. P. Hoogendoorn and P. H. Bovy, “State-of-the-art of vehicular trafficflow modelling,”
Proceedings of the Institution of Mechanical Engineers,Part I: Journal of Systems and Control Engineering , vol. 215, no. 4, pp.283–303, 2001.[38] N. Eissfeldt, “Vehicle-based modelling of traffic,” Ph.D. dissertation,PhD thesis, Universit¨a zu K¨oln, 2004.[39] T. Chu, S. Qu, and J. Wang, “Large-scale traffic grid signal control withregional reinforcement learning,” in . IEEE, 2016, pp. 815–820.[40] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-forcement learning for integrated network of adaptive traffic signalcontrollers (marlin-atsc): methodology and large-scale application ondowntown toronto,”
IEEE Transactions on Intelligent TransportationSystems , vol. 14, no. 3, pp. 1140–1150, 2013.[41] T. Tan, F. Bao, Y. Deng, A. Jin, Q. Dai, and J. Wang, “Cooperative deepreinforcement learning for large-scale traffic grid signal control,”
IEEETransactions on Cybernetics , vol. 50, no. 6, pp. 2687–2700, 2019.[42] A. Papathanassiou and A. Khoryaev, “Cellular v2x as the essentialenabler of superior global connected transportation services,”
IEEE 5GTech Focus , vol. 1, no. 2, pp. 1–2, 2017.[43] 3GPP, “Study on lte support for vehicle-to-everything(v2x) services,” 3rd Generation Partnership Project (3GPP),Technical report (TR) 22.885, 2015, version 14.0.0. [On-line]. Available: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=2898[44] D. Krajzewicz, E. Brockfeld, J. Mikat, J. Ringel, C. Rossel, W. Tuch-scheerer, P. Wagner, and R. Wosler, “Simulation of modern trafficlights control systems using the open source traffic simulation SUMO,”
In Proceedings of the 3rd Industrial Simulation Conference, Berlin,Germany , pp. 229–302, 2005.[45] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introductionmit press,”
Cambridge, MA , 1998.[46] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013.[47] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspectiveon reinforcement learning,” in
Proceedings of the 34th InternationalConference on Machine Learning - Volume 70 , ser. ICML17. JMLR.org,2017, p. 449458.[48] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski,W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow:Combining improvements in deep reinforcement learning,” in
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[49] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recentdevelopment and applications of SUMO - Simulation of Urban MObility,”
International Journal On Advances in Systems and Measurements , vol. 5,no. 3&4, pp. 128–138, December 2012.[50] 3GPP, “Study on lte-based v2x services,” 3rd Generation PartnershipProject (3GPP), Technical report (TR) 36.885, 2016, version14.0.0. [Online]. Available: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=2934[51] SUMO, “Simulation/traffic lights,” 2020. [Online]. Available: https://sumo.dlr.de/docs/Simulation/Traffic Lights.html[52] H. Mania, A. Guy, and B. Recht, “Simple random search providesa competitive approach to reinforcement learning,”
CoRR , vol.abs/1803.07055, 2018. [Online]. Available: http://arxiv.org/abs/1803.07055[53] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolutionstrategies as a scalable alternative to reinforcement learning,” arXivpreprint arXiv:1703.03864 , 2017.[54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347arXiv preprint arXiv:1707.06347