Optimizing Traffic Lights with Multi-agent Deep Reinforcement Learning and V2X communication
11 Optimizing Traffic Lights with Multi-agent DeepReinforcement Learning and V2X communication
Azhar Hussain, Tong Wang and Cao JiahuaCollege of Information and Communication Engineering, Harbin Engineering University, 150001, Harbin, [email protected]
Abstract —We consider a system to optimize duration of traf-fic signals using multi-agent deep reinforcement learning andVehicle-to-Everything (V2X) communication. This system aimsat analyzing independent and shared rewards for multi-agentsto control duration of traffic lights. A learning agent traffic lightgets information along its lanes within a circular V2X coverage.The duration cycles of traffic light are modeled as Markovdecision Processes. We investigate four variations of rewardfunctions. The first two are unshared-rewards: based on waitingnumber, and waiting time of vehicles between two cycles of trafficlight. The third and fourth functions are: shared-rewards basedon waiting cars, and waiting time for all agents. Each agenthas a memory for optimization through target network andprioritized experience replay. We evaluate multi-agents throughthe Simulation of Urban MObility (SUMO) simulator. The resultsprove effectiveness of the proposed system to optimize trafficsignals and reduce average waiting cars to 41.5 % as comparedto the traditional periodic traffic control system. Keywords —Deep reinforcement learning, V2X, deep learning,traffic light control.
I. I
NTRODUCTION E XISTING traffic signal management is performed: eitherleveraging limited real-time traffic information or by fixedperiodic traffic signal timings [1]. This information is widelyobtained from underlying inductive loop detectors. However,this input is processed in a limited domain to estimate bet-ter duration of red/green signals. The advances in mobilecommunication networks and sensors technology have madepossible to obtain real-time traffic information [2]. An artificialbrain can be implemented with deep reinforcement learning(DRL). DRL is based on three main components: states in theenvironment, action space and the scalar reward from eachaction [3]. A popular success of DRL is AlphaGo [4], and itssuccessor AlphaGo Zero [5]. The main goal is to maximizethe reward by choosing the best actions.In some research works, state is defined as the number ofvehicles waiting at an intersection or the waiting queue length[6], [7]. However, it is investigated by [8] that real trafficenvironment cannot be fully captured leveraging the numberof waiting vehicles or the waiting queue length. Thanks tothe rapid development of deep learning, large state problemshave been addressed with deep neural networks paradigm[9]. In [10], [11] authors have proposed to resolve trafficcontrol problem with DRL. However two limitations exist inthe current studies: 1) fixed time intervals of traffic lights,which is not efficient in some studies; 2) random sequencesof traffic signals, which may cause safety and comfort issues.In [12] the authors have controlled duration in a cycle basedon information extracted from vehicles and sensors networkswhich can reduce average waiting time to 20%.In this letter we investigate for the first time multipleexperienced traffic operators to control traffic in each step atmultiple traffic lights. This idea assumes that control process
Environment
Hidden Layers Input Layer Output Layer
Parameter ɵ
Policy π ɵ ( s , a ) Deep Neural Network State s Reward r Agent Observe State s Action a Fig. 1:
Illustration of system under study can be modeled as a Markov Decision Process (MDP). Thesystem experiences the control strategy based on the MDPby trial and error. Recently, a Q-learning based method isproposed by [18] showing better performance than fixed periodpolicy. An linear function is proposed to achieve more effectivetraffic flow management with a high traffic flow [19]. Butneither tabular Q learning nor linear function methods couldsupport the increasing size of traffic state space and accurateestimation of Q value in a real scenario [20]. Gao et al. [21] hasproposed a DRL method with a change in cumulative stayingtime as a reward. A CNN is employed to map states to rewards[21]. H. Jiang has analyzed nonzero-sum games with multi-players by using adaptive dynamic programming (ADP) [22].Li et al. proposed a single intersection control method basedon DQN to focus on a local optimum [23]. Pol et al. combinedDQN with a max-plus method to achieve cooperation betweentraffic lights [25]. Xiaoyuan Liang et al. proposed a modelincorporating multiple optimization elements for traffic lightsand the simulation on SUMO has shown efficiency of themodel [12]. However, most of the works focused on single-agent controlling signals for an intersection.II. S
YSTEM M ODEL
The proposed system is shown in Fig. 1. The model for eachagent is based on four items (cid:104)
S, A, R, P (cid:105) . Let S is possiblestate space, and s is a state ( s ∈ S ). A is possible action space,and a is an action ( a ∈ A ) and R is reward space. Let P isthe transition probability function space from one state to nextstate. A series of consequent actions is policy π . Learning anoptimal policy to maximize the cumulative expected rewardis main goal. An agent at state s takes an action a to reachnext state s (cid:48) , gets the reward r . A four-tuple represents thissituation as (cid:104) s , a , r , s (cid:48) (cid:105) . The state transition occurs at a discretestep n in π . Let Q ( s, a ) is cumulative reward function in futureby executing an action a at state s . Let r n is reward at n thstep, and (1) gives Q π ( s, a ) for π : Q π ( s, a ) = E [ ∞ (cid:88) k =0 γ k r n + k | s n = s, a n = a, π ] (1) a r X i v : . [ c s . A I] F e b Agent 1, ID = 8 Agent 2, ID =10
Agent 3, ID = 15 Agent 4, ID = 17
Agent 5, ID = 20 Agent 6, ID = 22 Agent 7, ID = 27 Agent 8, ID = 29
Fig. 2:
Multi-agent traffic scenario
The parameter γ is a discount factor in [0 , , decides howmuch importance should be given to recent and future re-wards. The optimal policy π ∗ can be acquired through sev-eral episodes in the learning process. Calculation of optimal Q ( s, a ) is based on the optimal Q values of the succeedingstates represented by the Bellman optimality (2): Q π ∗ ( s, a ) = E s (cid:48) [ r n + γ max a (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) ) | s, a ] (2)It can be solved by dynamic programming keeping finite statesfor less computational burden. However, the Q values can alsobe estimated by a function θ for larger number of states. A. Problem formulation
The environment (road traffic) is shared by the agents. Let D a is total number of controlled lanes of an agent a . Where a ⊆ A and A = { , , } . The number of waiting cars is w i a where i a is a lane. The reward is r a c considering a case c ⊆ C and C = { , , , } . Figure 2 shows the multi-genttraffic scenario under study. We aim to minimize average of w i a in each c and varying number of multi-agents as describedin problem (P):(P) : min π E π (cid:34) D a (cid:88) i a =1 w i a (cid:35) s.t.r a c ∈ { , } , ∀ C, ∀ A (3)
1) States:
The states are number of vehicles on road foreach lane of the traffic light agents. The number of vehiclesare acquired from V2X communication in DSRC mode [26].This reduces number of states to the controlled lanes for amulti-lane traffic intersection reducing computational burden.The length of road is defined as l r . The state is a four-valuevector (cid:104) l , l , l , l (cid:105) such that each element represents numberof vehicles respectively in lane 0, lane 1, lane 2 and lane3(North, East, South, West).
2) Actions:
The actions space decides duration of eachphase in the next cycle. The duration of two phase changes be-tween two consecutive cycles is modeled as a high-dimensionMDP. The phase remains constant during D seconds. Let (cid:104) D , D , D , D (cid:105) are durations of 4 phases. The duration ofone phase in the next cycle will be incremented by D if the Initialize Parameters Call SUMO server Generate Routes and Configuration Start SUMO’s TRACI Perform SUMO Step Start new episode for training Process Get Traffic Lights IDs Multi-agent Assignments Agents take actions Rewards Available Get Next States, Rewards, Done Agents memory assignments If Done Memory > Batch Episode time ends Make SUMO Environment Input Real World Traffic Flow Dataset Yes No No Loss Minimization Model training Get list of all cars in V2X range Get list of stopped cars in V2X range Rewards Calculation per agent W sum > Threshold Set Done to True Restart SUMO Environment State Generation Agents Get States Set Done to False Yes Start New Episode Yes No Randomly sample values from Batch Target Estimation from Reward and Future Rewards Predict State value for action from Model Replace prediction state value with target Save Model Get position of Agents Check updated Memory size No Yes Yes No
Fig. 3:
Process flow of the multi-agent system under study same action is chosen by an agent. Repeating an action willincrease duration of the same phase. To investigate feasibilityof actions, we assume that probability of phase transition foraction i to action j is P ij . Let f ij ( n ) is the probability thata chosen action starting from i will go to j for the first timeafter n steps. An agent may take too long time to choosean unvisited action if there are no bounded conditions. Theprobability that an action i will be chosen after action j inone action-step is f ij (1) as in (4): f ij (1) = P (1) ij = P ij (4) (a) (b) Maximum: 3.3 Maximum: 3.9 A v e r a g e w a i t i n g c a r s Fig. 4:
Results for traditional periodic TLC. (a) Duration per phase is 30 s. (b) Duration per phase is 40 s.
The first passage probability after n action-steps can begeneralized as (5): f ij ( n ) = P ( n ) ij − f ij (1) P ( n − jj − ...f ij ( n − P (1) jj (5)Let q ij be that probability that an agent at action i willeventually take j at least once in m transitions then it willbe the sum of all first passage probabilities (6): q ij ( m ) = f ij (1) + f ij (2) + f ij (3) + ... + f ij ( m ) (6)The feasible operation of the proposed system is possible ifand only if j is transient in nature which is shown by (7): q ij ( ∞ ) = ∞ (cid:88) n =1 P ( n ) jj < (7)It can achieved by setting a threshold W sum for the maximumwaiting number of vehicles.
3) Rewards:
Our study aims at decreasing w i a in eachlane. This proportionally reduces cumulative waiting time. Incontrast to previous research work of single agent we arguethat real-life scenarios consist of multiple traffic lights and thelearning process of one agent may not be effective in reducingtraffic congestion in the neighborhood. The reason is that, oncean agent performs good at an intersection, then the congestionat the connected roads of this agent will be reduced causingincrease in traffic flow which will result in severe traffic jam inthe neighboring intersections. To investigate this we considerfour cases in C .
4) Agents with unshared-rewards:
In first case the rewardfor an agent a , is accumulative waiting number of vehicles inits vicinity. Let D a is total number of lanes of agent a . Thereward r a , for case 1 is given by (8): r a = (1 + D a (cid:88) i a =1 w i a ) − , ∀ (cid:26) D a (cid:80) i a =1 w i a < W sum (8)The reward in the second case is accumulative waiting time ofvehicles in an agent’s vicinity. Let w t,i a is total waiting timeof all vehicles in i a , (9) gives the reward r a for a step t , r a = (1 + D a (cid:88) i a =1 w t,i a ) − (9)
5) Agents with shared-rewards:
The third case considerscommon aim for each agent. The actions of agents are se-lected to minimize the overall waiting number of cars of allagents. Therefore, reward for each agent is the accumulativewaiting number of vehicles in its own as well as other agentsvicinity. Let A number of agents are selected for the commonenvironment, D b is total number of agents except the agent a , the reward r a is given by (10): r a = (1 + D a (cid:88) i a =1 w i a + A (cid:88) b =1 ,b (cid:54) = a D b (cid:88) i b =1 w i b ) − (10)In fourth case we consider waiting time experienced by allagents as the shared reward for an agent. It means that r a is accumulative waiting time of vehicles in its own and otheragents vicinity as expressed in (11): r a = (1 + D a (cid:88) i a =1 w t,i a + A (cid:88) b =1 ,b (cid:54) = a D b (cid:88) i b =1 w t,i b ) − (11)Variable w t,i b is for total waiting time of all vehicles in i b . B. Process flow of the proposed model
Figure 3 shows process flow of the proposed multi-agentmodel. The initialization parameters and their values are inTable I. The proposed algorithm reads the dataset of trafficflow per lane and formulate the vehicle flow rate from realworld domain to SUMO domain in terms of steps.We choose this formulation to assign vehicle flow rate orarrival probability to each vehicle according to configuration1 of “Huawei 2019 Vehicle Scheduling Contest” [24]. Thepurpose of selecting this online dataset is to establish astandard comparison for the research. A total of 128 cars areused. The process of agent training starts by initializing theSUMO environment. The SUMO environment imitates the realworld road network with traffic lights and vehicles. Each agentis responsible for its traffic light region. The V2X coverage islimited to C r meter radius of a circle around the traffic lightintersection. The agent gets information of vehicles in eachconnected roads under the coverage area.Each lane is given equal importance in the calculation ofreward function. All agents take actions in a predefined cycleduration. The agents act upon their respective states eitherusing experience replay or based on the random decision underexploration. The experience replay is used individually by eachagent to minimize its loss. The loss is difference between targetand prediction. A separate neural network acts as a functionapproximator known as Q-network Q ( s, a ; θ ) with weights θ .The Q-network is trained by minimizing the sequence of lossfunctions L i ( θ i ) which changes in each i th iteration is shownin Equation (12): L i ( θ i ) = E s,a ∼ P ( s,a ) [( Q target i ( s, a ) − Q ( s, a ; θ i )) ] (12) Episodes
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Episodes
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Episodes Episodes (a) R e w a r d A v e r a g e w a i t i n g c a r s (b) (c) (d) (e) (f) (g) (h) Case 1: 2 Agents Case 2: 2 Agents Case 3: 2 Agents Case 4: 2 Agents
Fig. 5:
Results with 2 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. where P ( s, a ) is the probability distribution over states andaction sequences and Q target i ( s, a ) for the i th iteration isgiven by the Equation (13): Q target i ( s, a ) = E s (cid:48) ∼ ε [ r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i − ) | s, a ] (13)The weights of previous iteration θ i − are kept fixed duringthe optimization of loss function L i ( θ i ) . The term ε is SUMOenvironment. The aim is to predict the Q value for a given stateand action, get the target value and replace predicted state forthat action with the target value. The targets are dependent onthe Q network weights. Let L i = ( r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i − ) − Q ( s, a ; θ i ) is a temporary variable. The differentiation of theloss function with respect to the weights gives gradient in theform of Equation (14): ∇ θ i L i ( θ i ) = E s,a ∼ P ( s,a ) [ L i ∇ Q ( s, a ; θ i )] (14)The proposed multi-agent DRL is a model-free approach.It solves the tasks by directly using the samples from SUMOenvironment and does not require the estimate of ε . Each agentlearns its greedy strategy max a Q ( s, a ; θ ) and follows a be-haviour distribution ensuring adequate state space exploration.The learning behaviour distribution follows (cid:15) -greedy strategy,which means 1- (cid:15) is selected for the exploitation and randomaction is selected with the probability (cid:15) . TABLE I:
Parameters for model evaluation
Parameter Value
Learning Rate 0.001Memory Size 10000Epsilon initial .95Epsilon final 0.01Epsilon Decay rate 0.001Minibatch Size 32Discount Factor γ m Episodes 60Fully connected Hidden Layers 3Nodes per each hidden layer 24, 24, 24
III. E
VALUATION
The scenario is a 6 × w i a are aggregated ineach episode. The performance of agents are compared for r a , r a , r a and r a . The rewards in Fig. 5(a) are for agents withIDs 8 and 10 and evaluated with r a . Similarly Fig. 5(b) showscorresponding rewards evaluated by using r a . The shared-rewards r a and r a are respectively presented in Fig. 5(c)and Fig. 5(d). The effects of using different reward functionsare shown below for each case in Fig. 5(e)-(h). It is observedthat in case 1 both agents have performed better in maximizingtheir rewards and minimizing the number of waiting cars intheir connected lanes. Agent 10 performed better than agent 8for the cases 2, 3 and 4. The reward for the agent 10 is higherthan agent 8 and the corresponding results of average waitingcars also show less average waiting cars for the agent 10 ascompared to agent 8.Figure 6 shows results of 4 agents with IDs 8, 10, 15, and17 by experimenting the same four cases using the similarscenario. It is observed in Fig. 6(a) that agent 10 performsbetter than other agents as the average number of waitingcars under agent 10 are less than others. In Fig. 6(b) allagents try to perform better by fluctuating their rewards. Thesefluctuations are also reflected in the number of waiting carsas shown in Fig. 6(f). Interesting similar rewards are observed R e w a r d A v e r a g e w a i t i n g c a r s
12 10 8 6 4 2 0
0 10 20 30 40 50 60
Episodes
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Episodes Episodes Episodes
0 10 20 30 40 50 60
Episodes
Case 4: 4 Agents Case 3: 4 Agents Case 2: 4 Agents Case 1: 4 Agents (a) (b) (c) (d) (e) (f) (g) (h)
Fig. 6:
Results with 4 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. in the shared-rewards cases (3,4) in Fig 6(c) and Fig 6(d).Average number of waiting cars are also similar as shownin Fig. 6(g) and Fig. 6(h). The results for the 8 agents areshown in Fig. 7 with IDs 8,10,15, 17, 20, 22, 27, and 29. Thecase 1 outperforms other cases in Fig 7(a). All agents triedto maximize their rewards showing better results in Fig. 7(a)-(b) and Fig. 7(e)-(f). Peak average waiting cars in Fig. 7 (f)is 2.1 as compared to 3.9 (see Fig. 4(b)) after 15 episodes,which means a decrease of 41.5 % . On the other hand allrewards become same for the agents as expected. However,the shared-reward schemes performed poorly as compare to unshared-rewards. The reason behind this low performance isfailure to achieve better actions that could maximize the sharedreward. The actions of one agent produces the environmentconditions that could negatively disturb the rewards for otheragents. Agents 8, 10, 17, and 29 relatively performed bettereven with reduced shared rewards.IV. C ONCLUSION
In this article we have presented the performance of deepreinforcement learning under multi-agent V2X driven trafficcontrol system. We have observed that multi-agents with
Case 1: 8 Agents Case 2: 8 Agents Case 3: 8 Agents Case 4: 8 Agents
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Episodes Episodes Episodes Episodes R e w a r d A v e r a g e w a i t i n g c a r s
16 14 12 10 8 6 2 0 4 0.2 0.0 0.4 0.6 0.8 1.0 (a) (b) (c) (d) (e) (f) (g) (h)
Fig. 7:
Results with 8 agents. (a) Average of reward r a . (b) Average of reward r a . (c) Average of reward r a . (d) Average of reward r a . (e) Averagewaiting cars for case 1. (f) Average waiting cars for case 2. (g) Average waiting cars for case 3. (h) Average waiting cars for case 4. individual rewards considering waiting number of cars is abetter choice as compared to the average waiting time. Onthe other hand shared-rewards based cases do not performbetter. Shared-rewards make the situation more competitive.This competition should be further investigated using othertechniques of deep reinforcement learning. We have alsoobserved that for larger number of agents, the reward basedon waiting time is the better choice.Rindividual rewards considering waiting number of cars is abetter choice as compared to the average waiting time. Onthe other hand shared-rewards based cases do not performbetter. Shared-rewards make the situation more competitive.This competition should be further investigated using othertechniques of deep reinforcement learning. We have alsoobserved that for larger number of agents, the reward basedon waiting time is the better choice.R