[PDF] Optimizing Traffic Lights with Multi-agent Deep Reinforcement Learning and V2X communication

Abstract

We consider a system to optimize duration of traffic signals using multi-agent deep reinforcement learning and Vehicle-to-Everything (V2X) communication. This system aims at analyzing independent and shared rewards for multi-agents to control duration of traffic lights. A learning agent traffic light gets information along its lanes within a circular V2X coverage. The duration cycles of traffic light are modeled as Markov decision Processes. We investigate four variations of reward functions. The first two are unshared-rewards: based on waiting number, and waiting time of vehicles between two cycles of traffic light. The third and fourth functions are: shared-rewards based on waiting cars, and waiting time for all agents. Each agent has a memory for optimization through target network and prioritized experience replay. We evaluate multi-agents through the Simulation of Urban MObility (SUMO) simulator. The results prove effectiveness of the proposed system to optimize traffic signals and reduce average waiting cars to 41.5 % as compared to the traditional periodic traffic control system.

Full PDF

11 Optimizing Trafﬁc Lights with Multi-agent DeepReinforcement Learning and V2X communication

Azhar Hussain, Tong Wang and Cao JiahuaCollege of Information and Communication Engineering, Harbin Engineering University, 150001, Harbin, [email protected]

Abstract —We consider a system to optimize duration of traf-ﬁc signals using multi-agent deep reinforcement learning andVehicle-to-Everything (V2X) communication. This system aimsat analyzing independent and shared rewards for multi-agentsto control duration of trafﬁc lights. A learning agent trafﬁc lightgets information along its lanes within a circular V2X coverage.The duration cycles of trafﬁc light are modeled as Markovdecision Processes. We investigate four variations of rewardfunctions. The ﬁrst two are unshared-rewards: based on waitingnumber, and waiting time of vehicles between two cycles of trafﬁclight. The third and fourth functions are: shared-rewards basedon waiting cars, and waiting time for all agents. Each agenthas a memory for optimization through target network andprioritized experience replay. We evaluate multi-agents throughthe Simulation of Urban MObility (SUMO) simulator. The resultsprove effectiveness of the proposed system to optimize trafﬁcsignals and reduce average waiting cars to 41.5 % as comparedto the traditional periodic trafﬁc control system. Keywords —Deep reinforcement learning, V2X, deep learning,trafﬁc light control.

I. I

NTRODUCTION E XISTING trafﬁc signal management is performed: eitherleveraging limited real-time trafﬁc information or by ﬁxedperiodic trafﬁc signal timings [1]. This information is widelyobtained from underlying inductive loop detectors. However,this input is processed in a limited domain to estimate bet-ter duration of red/green signals. The advances in mobilecommunication networks and sensors technology have madepossible to obtain real-time trafﬁc information [2]. An artiﬁcialbrain can be implemented with deep reinforcement learning(DRL). DRL is based on three main components: states in theenvironment, action space and the scalar reward from eachaction [3]. A popular success of DRL is AlphaGo [4], and itssuccessor AlphaGo Zero [5]. The main goal is to maximizethe reward by choosing the best actions.In some research works, state is deﬁned as the number ofvehicles waiting at an intersection or the waiting queue length[6], [7]. However, it is investigated by [8] that real trafﬁcenvironment cannot be fully captured leveraging the numberof waiting vehicles or the waiting queue length. Thanks tothe rapid development of deep learning, large state problemshave been addressed with deep neural networks paradigm[9]. In [10], [11] authors have proposed to resolve trafﬁccontrol problem with DRL. However two limitations exist inthe current studies: 1) ﬁxed time intervals of trafﬁc lights,which is not efﬁcient in some studies; 2) random sequencesof trafﬁc signals, which may cause safety and comfort issues.In [12] the authors have controlled duration in a cycle basedon information extracted from vehicles and sensors networkswhich can reduce average waiting time to 20%.In this letter we investigate for the ﬁrst time multipleexperienced trafﬁc operators to control trafﬁc in each step atmultiple trafﬁc lights. This idea assumes that control process

Environment

Hidden Layers Input Layer Output Layer

Parameter ɵ

Policy π ɵ ( s , a ) Deep Neural Network State s Reward r Agent Observe State s Action a Fig. 1:

Illustration of system under study can be modeled as a Markov Decision Process (MDP). Thesystem experiences the control strategy based on the MDPby trial and error. Recently, a Q-learning based method isproposed by [18] showing better performance than ﬁxed periodpolicy. An linear function is proposed to achieve more effectivetrafﬁc ﬂow management with a high trafﬁc ﬂow [19]. Butneither tabular Q learning nor linear function methods couldsupport the increasing size of trafﬁc state space and accurateestimation of Q value in a real scenario [20]. Gao et al. [21] hasproposed a DRL method with a change in cumulative stayingtime as a reward. A CNN is employed to map states to rewards[21]. H. Jiang has analyzed nonzero-sum games with multi-players by using adaptive dynamic programming (ADP) [22].Li et al. proposed a single intersection control method basedon DQN to focus on a local optimum [23]. Pol et al. combinedDQN with a max-plus method to achieve cooperation betweentrafﬁc lights [25]. Xiaoyuan Liang et al. proposed a modelincorporating multiple optimization elements for trafﬁc lightsand the simulation on SUMO has shown efﬁciency of themodel [12]. However, most of the works focused on single-agent controlling signals for an intersection.II. S

YSTEM M ODEL

The proposed system is shown in Fig. 1. The model for eachagent is based on four items (cid:104)

S, A, R, P (cid:105) . Let S is possiblestate space, and s is a state ( s ∈ S ). A is possible action space,and a is an action ( a ∈ A ) and R is reward space. Let P isthe transition probability function space from one state to nextstate. A series of consequent actions is policy π . Learning anoptimal policy to maximize the cumulative expected rewardis main goal. An agent at state s takes an action a to reachnext state s (cid:48) , gets the reward r . A four-tuple represents thissituation as (cid:104) s , a , r , s (cid:48) (cid:105) . The state transition occurs at a discretestep n in π . Let Q ( s, a ) is cumulative reward function in futureby executing an action a at state s . Let r n is reward at n thstep, and (1) gives Q π ( s, a ) for π : Q π ( s, a ) = E [ ∞ (cid:88) k =0 γ k r n + k | s n = s, a n = a, π ] (1) a r X i v : . [ c s . A I] F e b Agent 1, ID = 8 Agent 2, ID =10

Agent 3, ID = 15 Agent 4, ID = 17

Agent 5, ID = 20 Agent 6, ID = 22 Agent 7, ID = 27 Agent 8, ID = 29

Fig. 2:

Multi-agent trafﬁc scenario

The parameter γ is a discount factor in [0 , , decides howmuch importance should be given to recent and future re-wards. The optimal policy π ∗ can be acquired through sev-eral episodes in the learning process. Calculation of optimal Q ( s, a ) is based on the optimal Q values of the succeedingstates represented by the Bellman optimality (2): Q π ∗ ( s, a ) = E s (cid:48) [ r n + γ max a (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) ) | s, a ] (2)It can be solved by dynamic programming keeping ﬁnite statesfor less computational burden. However, the Q values can alsobe estimated by a function θ for larger number of states. A. Problem formulation

The environment (road trafﬁc) is shared by the agents. Let D a is total number of controlled lanes of an agent a . Where a ⊆ A and A = { , , } . The number of waiting cars is w i a where i a is a lane. The reward is r a c considering a case c ⊆ C and C = { , , , } . Figure 2 shows the multi-genttrafﬁc scenario under study. We aim to minimize average of w i a in each c and varying number of multi-agents as describedin problem (P):(P) : min π E π (cid:34) D a (cid:88) i a =1 w i a (cid:35) s.t.r a c ∈ { , } , ∀ C, ∀ A (3)

1) States:

The states are number of vehicles on road foreach lane of the trafﬁc light agents. The number of vehiclesare acquired from V2X communication in DSRC mode [26].This reduces number of states to the controlled lanes for amulti-lane trafﬁc intersection reducing computational burden.The length of road is deﬁned as l r . The state is a four-valuevector (cid:104) l , l , l , l (cid:105) such that each element represents numberof vehicles respectively in lane 0, lane 1, lane 2 and lane3(North, East, South, West).

2) Actions:

The actions space decides duration of eachphase in the next cycle. The duration of two phase changes be-tween two consecutive cycles is modeled as a high-dimensionMDP. The phase remains constant during D seconds. Let (cid:104) D , D , D , D (cid:105) are durations of 4 phases. The duration ofone phase in the next cycle will be incremented by D if the Initialize Parameters Call SUMO server Generate Routes and Configuration Start SUMO’s TRACI Perform SUMO Step Start new episode for training Process Get Traffic Lights IDs Multi-agent Assignments Agents take actions Rewards Available Get Next States, Rewards, Done Agents memory assignments If Done Memory > Batch Episode time ends Make SUMO Environment Input Real World Traffic Flow Dataset Yes No No Loss Minimization Model training Get list of all cars in V2X range Get list of stopped cars in V2X range Rewards Calculation per agent W sum > Threshold Set Done to True Restart SUMO Environment State Generation Agents Get States Set Done to False Yes Start New Episode Yes No Randomly sample values from Batch Target Estimation from Reward and Future Rewards Predict State value for action from Model Replace prediction state value with target Save Model Get position of Agents Check updated Memory size No Yes Yes No

Fig. 3:

Process ﬂow of the multi-agent system under study same action is chosen by an agent. Repeating an action willincrease duration of the same phase. To investigate feasibilityof actions, we assume that probability of phase transition foraction i to action j is P ij . Let f ij ( n ) is the probability thata chosen action starting from i will go to j for the ﬁrst timeafter n steps. An agent may take too long time to choosean unvisited action if there are no bounded conditions. Theprobability that an action i will be chosen after action j inone action-step is f ij (1) as in (4): f ij (1) = P (1) ij = P ij (4) (a) (b) Maximum: 3.3 Maximum: 3.9 A v e r a g e w a i t i n g c a r s Fig. 4:

Results for traditional periodic TLC. (a) Duration per phase is 30 s. (b) Duration per phase is 40 s.

The ﬁrst passage probability after n action-steps can begeneralized as (5): f ij ( n ) = P ( n ) ij − f ij (1) P ( n − jj − ...f ij ( n − P (1) jj (5)Let q ij be that probability that an agent at action i willeventually take j at least once in m transitions then it willbe the sum of all ﬁrst passage probabilities (6): q ij ( m ) = f ij (1) + f ij (2) + f ij (3) + ... + f ij ( m ) (6)The feasible operation of the proposed system is possible ifand only if j is transient in nature which is shown by (7): q ij ( ∞ ) = ∞ (cid:88) n =1 P ( n ) jj < (7)It can achieved by setting a threshold W sum for the maximumwaiting number of vehicles.

3) Rewards:

Our study aims at decreasing w i a in eachlane. This proportionally reduces cumulative waiting time. Incontrast to previous research work of single agent we arguethat real-life scenarios consist of multiple trafﬁc lights and thelearning process of one agent may not be effective in reducingtrafﬁc congestion in the neighborhood. The reason is that, oncean agent performs good at an intersection, then the congestionat the connected roads of this agent will be reduced causingincrease in trafﬁc ﬂow which will result in severe trafﬁc jam inthe neighboring intersections. To investigate this we considerfour cases in C .

4) Agents with unshared-rewards:

In ﬁrst case the rewardfor an agent a , is accumulative waiting number of vehicles inits vicinity. Let D a is total number of lanes of agent a . Thereward r a , for case 1 is given by (8): r a = (1 + D a (cid:88) i a =1 w i a ) − , ∀ (cid:26) D a (cid:80) i a =1 w i a < W sum (8)The reward in the second case is accumulative waiting time ofvehicles in an agent’s vicinity. Let w t,i a is total waiting timeof all vehicles in i a , (9) gives the reward r a for a step t , r a = (1 + D a (cid:88) i a =1 w t,i a ) − (9)

5) Agents with shared-rewards:

The third case considerscommon aim for each agent. The actions of agents are se-lected to minimize the overall waiting number of cars of allagents. Therefore, reward for each agent is the accumulativewaiting number of vehicles in its own as well as other agentsvicinity. Let A number of agents are selected for the commonenvironment, D b is total number of agents except the agent a , the reward r a is given by (10): r a = (1 + D a (cid:88) i a =1 w i a + A (cid:88) b =1 ,b (cid:54) = a D b (cid:88) i b =1 w i b ) − (10)In fourth case we consider waiting time experienced by allagents as the shared reward for an agent. It means that r a is accumulative waiting time of vehicles in its own and otheragents vicinity as expressed in (11): r a = (1 + D a (cid:88) i a =1 w t,i a + A (cid:88) b =1 ,b (cid:54) = a D b (cid:88) i b =1 w t,i b ) − (11)Variable w t,i b is for total waiting time of all vehicles in i b . B. Process ﬂow of the proposed model

Figure 3 shows process ﬂow of the proposed multi-agentmodel. The initialization parameters and their values are inTable I. The proposed algorithm reads the dataset of trafﬁcﬂow per lane and formulate the vehicle ﬂow rate from realworld domain to SUMO domain in terms of steps.We choose this formulation to assign vehicle ﬂow rate orarrival probability to each vehicle according to conﬁguration1 of “Huawei 2019 Vehicle Scheduling Contest” [24]. Thepurpose of selecting this online dataset is to establish astandard comparison for the research. A total of 128 cars areused. The process of agent training starts by initializing theSUMO environment. The SUMO environment imitates the realworld road network with trafﬁc lights and vehicles. Each agentis responsible for its trafﬁc light region. The V2X coverage islimited to C r meter radius of a circle around the trafﬁc lightintersection. The agent gets information of vehicles in eachconnected roads under the coverage area.Each lane is given equal importance in the calculation ofreward function. All agents take actions in a predeﬁned cycleduration. The agents act upon their respective states eitherusing experience replay or based on the random decision underexploration. The experience replay is used individually by eachagent to minimize its loss. The loss is difference between targetand prediction. A separate neural network acts as a functionapproximator known as Q-network Q ( s, a ; θ ) with weights θ .The Q-network is trained by minimizing the sequence of lossfunctions L i ( θ i ) which changes in each i th iteration is shownin Equation (12): L i ( θ i ) = E s,a ∼ P ( s,a ) [( Q target i ( s, a ) − Q ( s, a ; θ i )) ] (12) Episodes

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Episodes

0 10 20 30 40 50 60 0 10 20 30 40 50 60

Episodes Episodes (a) R e w a r d A v e r a g e w a i t i n g c a r s (b) (c) (d) (e) (f) (g) (h) Case 1: 2 Agents Case 2: 2 Agents Case 3: 2 Agents Case 4: 2 Agents

Fig. 5:

Results with 2 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. where P ( s, a ) is the probability distribution over states andaction sequences and Q target i ( s, a ) for the i th iteration isgiven by the Equation (13): Q target i ( s, a ) = E s (cid:48) ∼ ε [ r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i − ) | s, a ] (13)The weights of previous iteration θ i − are kept ﬁxed duringthe optimization of loss function L i ( θ i ) . The term ε is SUMOenvironment. The aim is to predict the Q value for a given stateand action, get the target value and replace predicted state forthat action with the target value. The targets are dependent onthe Q network weights. Let L i = ( r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i − ) − Q ( s, a ; θ i ) is a temporary variable. The differentiation of theloss function with respect to the weights gives gradient in theform of Equation (14): ∇ θ i L i ( θ i ) = E s,a ∼ P ( s,a ) [ L i ∇ Q ( s, a ; θ i )] (14)The proposed multi-agent DRL is a model-free approach.It solves the tasks by directly using the samples from SUMOenvironment and does not require the estimate of ε . Each agentlearns its greedy strategy max a Q ( s, a ; θ ) and follows a be-haviour distribution ensuring adequate state space exploration.The learning behaviour distribution follows (cid:15) -greedy strategy,which means 1- (cid:15) is selected for the exploitation and randomaction is selected with the probability (cid:15) . TABLE I:

Parameters for model evaluation

Parameter Value

Learning Rate 0.001Memory Size 10000Epsilon initial .95Epsilon ﬁnal 0.01Epsilon Decay rate 0.001Minibatch Size 32Discount Factor γ m Episodes 60Fully connected Hidden Layers 3Nodes per each hidden layer 24, 24, 24

III. E

VALUATION

The scenario is a 6 × w i a are aggregated ineach episode. The performance of agents are compared for r a , r a , r a and r a . The rewards in Fig. 5(a) are for agents withIDs 8 and 10 and evaluated with r a . Similarly Fig. 5(b) showscorresponding rewards evaluated by using r a . The shared-rewards r a and r a are respectively presented in Fig. 5(c)and Fig. 5(d). The effects of using different reward functionsare shown below for each case in Fig. 5(e)-(h). It is observedthat in case 1 both agents have performed better in maximizingtheir rewards and minimizing the number of waiting cars intheir connected lanes. Agent 10 performed better than agent 8for the cases 2, 3 and 4. The reward for the agent 10 is higherthan agent 8 and the corresponding results of average waitingcars also show less average waiting cars for the agent 10 ascompared to agent 8.Figure 6 shows results of 4 agents with IDs 8, 10, 15, and17 by experimenting the same four cases using the similarscenario. It is observed in Fig. 6(a) that agent 10 performsbetter than other agents as the average number of waitingcars under agent 10 are less than others. In Fig. 6(b) allagents try to perform better by ﬂuctuating their rewards. Theseﬂuctuations are also reﬂected in the number of waiting carsas shown in Fig. 6(f). Interesting similar rewards are observed R e w a r d A v e r a g e w a i t i n g c a r s

12 10 8 6 4 2 0

0 10 20 30 40 50 60

Episodes

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Episodes Episodes Episodes

0 10 20 30 40 50 60

Episodes

Case 4: 4 Agents Case 3: 4 Agents Case 2: 4 Agents Case 1: 4 Agents (a) (b) (c) (d) (e) (f) (g) (h)

Fig. 6:

Results with 4 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. in the shared-rewards cases (3,4) in Fig 6(c) and Fig 6(d).Average number of waiting cars are also similar as shownin Fig. 6(g) and Fig. 6(h). The results for the 8 agents areshown in Fig. 7 with IDs 8,10,15, 17, 20, 22, 27, and 29. Thecase 1 outperforms other cases in Fig 7(a). All agents triedto maximize their rewards showing better results in Fig. 7(a)-(b) and Fig. 7(e)-(f). Peak average waiting cars in Fig. 7 (f)is 2.1 as compared to 3.9 (see Fig. 4(b)) after 15 episodes,which means a decrease of 41.5 % . On the other hand allrewards become same for the agents as expected. However,the shared-reward schemes performed poorly as compare to unshared-rewards. The reason behind this low performance isfailure to achieve better actions that could maximize the sharedreward. The actions of one agent produces the environmentconditions that could negatively disturb the rewards for otheragents. Agents 8, 10, 17, and 29 relatively performed bettereven with reduced shared rewards.IV. C ONCLUSION

In this article we have presented the performance of deepreinforcement learning under multi-agent V2X driven trafﬁccontrol system. We have observed that multi-agents with

Case 1: 8 Agents Case 2: 8 Agents Case 3: 8 Agents Case 4: 8 Agents

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Episodes Episodes Episodes Episodes R e w a r d A v e r a g e w a i t i n g c a r s

16 14 12 10 8 6 2 0 4 0.2 0.0 0.4 0.6 0.8 1.0 (a) (b) (c) (d) (e) (f) (g) (h)

Fig. 7:

Results with 8 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. individual rewards considering waiting number of cars is abetter choice as compared to the average waiting time. Onthe other hand shared-rewards based cases do not performbetter. Shared-rewards make the situation more competitive.This competition should be further investigated using othertechniques of deep reinforcement learning. We have alsoobserved that for larger number of agents, the reward basedon waiting time is the better choice.Rindividual rewards considering waiting number of cars is abetter choice as compared to the average waiting time. Onthe other hand shared-rewards based cases do not performbetter. Shared-rewards make the situation more competitive.This competition should be further investigated using othertechniques of deep reinforcement learning. We have alsoobserved that for larger number of agents, the reward basedon waiting time is the better choice.R