[PDF] A Methodology for the Development of RL-Based Adaptive Traffic Signal Controllers

Abstract

This article proposes a methodology for the development of adaptive traffic signal controllers using reinforcement learning. Our methodology addresses the lack of standardization in the literature that renders the comparison of approaches in different works meaningless, due to differences in metrics, environments, and even experimental design and methodology. The proposed methodology thus comprises all the steps necessary to develop, deploy and evaluate an adaptive traffic signal controller -- from simulation setup to problem formulation and experimental design. We illustrate the proposed methodology in two simple scenarios, highlighting how its different steps address limitations found in the current literature.

Full PDF

AA Methodology for the Development ofRL-Based Adaptive Trafﬁc Signal Controllers

Guilherme S. Varela, Pedro P. Santos, Alberto Sardinha, Francisco S. Melo INESC-ID and Instituto Superior T´ecnicoUniversity of Lisbon, [email protected], [email protected],[email protected], [email protected]

Abstract

This article proposes a methodology for the development ofadaptive trafﬁc signal controllers using reinforcement learn-ing. Our methodology addresses the lack of standardizationin the literature that renders the comparison of approachesin different works meaningless, due to differences in metrics,environments and even experimental design and methodol-ogy. The proposed methodology thus comprises all the stepsnecessary to develop, deploy and evaluate an adaptive traf-ﬁc signal controller—from simulation setup to problem for-mulation and experimental design. We illustrate the proposedmethodology in two simple scenarios, highlighting how itsdifferent steps address limitations found in the current litera-ture.

Introduction

Trafﬁc congestion is a cross-continental problem. In theUnited States alone, an average automobile commuterspends hours in congested trafﬁc and wastes gal-lons of fuel due to congestion, leading to a total estimatedcost of , USD in wasted time and fuel per commuter(Schrank, Eisele, and Lomax 2019), not considering exter-nal costs such as the increasing price of goods caused by theinﬂation of transportation costs, environmental and produc-tivity impacts, as well as the decrease of population’s qualityof life (Hilbrecht, Smale, and Mock 2014). Similarly, a re-cent study shows that, in the EU, the total external costs as-sociated with trafﬁc is over , million euros (Becker,Becker, and Gerlach 2016). Hence, there have been numer-ous initiatives to mitigate trafﬁc congestion, such as invest-ment in public transit systems (Harford 2006) or intelligenttransportation systems (Dimitrakopoulos and Demestichas2010).Trafﬁc signals, being a fundamental element in trafﬁccontrol and regulation, are at the same time responsible fora signiﬁcant percentage of trafﬁc bottlenecks in urban envi-ronments, and play a key role in addressing the problem oftrafﬁc congestion. Effective trafﬁc signal control is, there-fore, a key part of urban trafﬁc management. Classic traf-ﬁc signal control approaches from transportation engineer-ing, such as the Webster (Webster 1958) or Max-Pressure(Varaiya 2013) methods, are capable of greatly increasing the efﬁciency of trafﬁc infrastructures all around the world.However, such approaches are either unable to adapt tochanging trafﬁc volumes, or rely on oversimpliﬁed trafﬁcmodels, manual-tuning and inaccurate trafﬁc information(Zheng et al. 2019; Wei et al. 2019).Alternatively, adaptive trafﬁc signal control (ATSC) ap-proaches seek to take advantage of the multiple sources ofinformation currently available (from mobile navigation ap-plications, ride sharing platforms, etc.). Machine learningtechniques can use the data made available by such plat-forms to provide trafﬁc signal control strategies that adaptto the current trafﬁc conditions in an effective manner, pro-viding a promising alternative to classical approaches.Recently, several researchers have addressed ATSC usingMarkov decision processes (MDP) (Wang et al. 2019; Weiet al. 2019). MDPs model discrete-time stochastic controlproblems, and are extensively used in artiﬁcial intelligenceto describe problems of sequential decision-making underuncertainty (Puterman 2005). In an MDP, an agent (the con-troller) interacts sequentially with an environment by select-ing actions based on its observation of the environment’sstate; the actions selected by the agent inﬂuence how the en-vironment’s state evolves, and the agent receives a numericalevaluation signal (a reward) that instantaneously assesses thequality of the agent’s action. The agent’s goal is to select itsactions to maximize some form of cumulative reward. MDPsare the backbone of reinforcement learning (RL), a machinelearning paradigm in which the agent learns the optimal wayof selecting the actions from direct interaction with the envi-ronment, without resorting to any pre-deﬁned model thereof(Sutton and Barto 1998).RL algorithms are a natural choice when addressingATSC, since they can be trained directly in the data avail-able, without requiring human annotators to deﬁne what is a“good” or “bad” control strategy. Unfortunately, the use ofRL in this domain is not without its own challenges.One of the ﬁrst challenges is the lack of standard envi-ronments in the domain of ATSC. The RL ﬁeld has bene-ﬁted from standardized environments and easy to use APIsthat allow researchers to compare different approaches to thesame problem space, e.g the Deep Q-network (Mnih et al.2013) which has been shown to generate relevant featuresfor the Arcade Learning Environment. The need for stan-dardization has sparkled new research towards open source a r X i v : . [ ee ss . S Y ] J a n rameworks (Genders and Razavi 2019), as means to pre-vent researchers to re-implement the same set of fundamen-tal tools with which to conduct de facto experiments. Weargue it is important that the community agrees on a set ofbenchmark environments/trafﬁc networks that may be usedas a ﬁrst test stage for the algorithms explored in the contextof ATSC. The existence of such benchmarks would enablea proper comparison of different models and algorithms in acommon set of environments, enabling a clearer assessmentof the strengths and weaknesses of different alternatives.Another major challenge is related to the security and ex-plainability of current RL architectures. Although some RLsystems have been quite successful in improving metrics ofinterest such as the average travel time, some protocols arenot viable to be implemented in a real world situations fora variety of reasons (Ault, Hanna, and Sharon 2020) suchas, long and unsafe tuning process, opaque policies, and thepredominant use of synthetic simulation scenarios.One third challenge is reproducibility . Reproducibility isthoroughly documented in RL literature, as such systemsmight overﬁt to the training experience, showing good per-formance during training but performing poorly at deploy-ment time (Whiteson et al. 2011). Works show that simplychanging the random seeds used to generate the simulationsinﬂuence in a statistically signiﬁcant manner the outcome ofthe RL algorithm (Aslani, Mesgari, and Wiering 2017).This paper contributes one further step towards a widerapplication of RL in ATSC. However, for this potentialto be realized, it is paramount to address the standardiza-tion, security and reproducibility issues identiﬁed above.Our contribution in this article is, therefore, a methodol-ogy for the development of RL-based adaptive trafﬁc sig-nal controllers that ensures some level of standardization atthe different stages of the experimental process: simulationsetup, environment modeling, experimental design and re-sult reporting. We adopt an action deﬁnition which is ratherconstrained: it produces synchronous joint action schemesacross the network – we show that the reinforcement learn-ing agent is able to ﬁnd policies that are on par with classi-cal controllers which beneﬁt from both human supervisionand from decades-old literature from the Transportation do-main. Such action plans generate protocols which are morein line with governmental transportation department’s ex-pectations, hence they provide more trust in face of the li-ability that such regulatory agencies face. In particular, withrespect to experimental design and result reporting, we dis-cuss good practices and relevant metrics that have been ex-plored in different works, highlighting their merits and howtheir adoption may contribute to better interpretation of ex-perimental results and mitigate reproducibility issues. We il-lustrate our own methodology by applying its different stepsin designing a trafﬁc signal controller using the well knowndeep Q -network (DQN) algorithm (Mnih et al. 2013). Background

Most approaches to trafﬁc signal control rely on computersoftware for microsimulation , which simulates trafﬁc at thelevel of individual vehicles, computing the position, veloc-ity, emissions data and other for every vehicle at each time

Incomingapproach Left-turnmovementRight-turn /throughmovementOutgoingapproach NS EW

Figure 1: Intersection with four incoming approaches, eachcomposed of three lanes.step. A route is a sequence of roads used by vehicles to tra-verse the network.A trafﬁc infrastructure can be represented as a net-work/graph, where roads and junctions correspond to theedges and nodes, respectively. A road connects two nodesof the network and has a given number of lanes , a maximumspeed and a length. Trafﬁc light controllers are typically in-stalled at road junctions. An intersection is a common typeof junction in which roads cross each other. Figure 1 illus-trates a typical intersection with four incoming and four out-going approaches , each approach composed of three lanes.A signal movement refers to the transit of vehicles from anincoming approach to an outgoing approach. A signal move-ment can generally be sub-categorized as a left-turn , through or right-turn movement. For example, in the intersectionof Fig. 1, East-South corresponds to a left-turn movement,while East-West corresponds to a through movement. Bothare examples of signal movements. A green signal indicatesthat the respective movement is allowed, whereas a red sig-nal indicates that the movement is prohibited.A signal phase is a combination of non-conﬂicting sig-nal movements, i.e. the signal movements that can be setto green at the same time. In the intersection of Fig. 1, thetriplet (North through, South through, South right-turn) is avalid signal phase. In contrast, (North through, South left-turn) is not. A signal plan for a single intersection is a se-quence of signal phases and their respective durations. Usu-ally, the time to cycle through all phases, known as cyclelength , is ﬁxed. Therefore, the phase splits—i.e., the amountof time allocated for each signal phase—are normally de-ﬁned as a ratio of the cycle length. A yellow signal is set asa transition from a green to a red signal. Reinforcement Learning

As mentioned in the introduction, reinforcement learningconsiders an agent whose interaction with the environmentcan be described as a

Markov decision process (MDP). AnMDP is a tuple ( S , A , {P a , a ∈ A} , r, γ ) , where S is a setof states and A is a set of actions . At each step t , the agentobserves the state S t ∈ S of the environment and selects anaction A t ∈ A . The environment then transitions to a state t +1 , where P [ S t +1 = s (cid:48) | S t = s, A t = a ] = [ P a ] ss (cid:48) . (1)The matrix P a , a ∈ A , encodes the transition probabilities associated with action a . Upon executing an action a in state s , the agent receives a (possibly random) reward with ex-pectation given by r ( s, a ) . The goal of the agent is to selectits actions so as to maximize the expected total discountedreward (TDR), TDR = E (cid:34) ∞ (cid:88) t =0 γ t R t (cid:35) , (2)where R t is the random reward received at time step t (with E [ R t ] = r ( S t , A t ) ) and the scalar γ is a discount factor .The long-term value of an action a in a state s is captured bythe optimal Q -value, Q ∗ ( s, a ) , which can be computed us-ing, for example, the Q -learning algorithm (Watkins 1989).The Q -learning algorithm estimates the optimal Q -values asthe agent interacts with the environment: given a transition ( s, a, r, s (cid:48) ) experienced by the agent, Q -learning performsthe update ˆ Q ( s, a ) ← ˆ Q ( s, a )+ α (cid:0) r + γ max a (cid:48) ∈A ˆ Q ( s (cid:48) , a (cid:48) ) − ˆ Q ( s, a ) (cid:1) , (3)where α is a step size. Upon computing Q ∗ , the agent can actoptimally by selecting, in each state s , the optimal action at s , given by π ∗ ( s ) = argmax a Q ∗ ( s, a ) . The mapping π ∗ : S → A , mapping each state s to the corresponding optimalaction π ∗ ( s ) , is known as the optimal policy for the MDP. Deep Q-network (Mnih et al. 2013) is a well knownRL method that approximates the Q-values with a neuralnetwork ˆ Q ( s, a ; θ ) , where θ denotes the parameters of themodel. At each step, the agent adds a transition ( s, a, r, s (cid:48) ) to a replay memory buffer, from which batches of transitionsare sampled in order to optimize the parameters of the modelsuch that the following loss in minimized: L ( θ ) = (cid:18) r + γ max a (cid:48) ∈A ˆ Q ( s (cid:48) , a (cid:48) ; θ − ) − ˆ Q ( s, a ; θ ) (cid:19) . (4)The gradient of the loss is backpropagated only into the be-haviour network , ˆ Q ( s, a ; θ ) , which is used to select actions.The term θ − represents the parameters of the target network ,a periodic copy of the behaviour network. Related Work

Several works have explored the use of RL in trafﬁc lightcontrol, most of which rely on estimating the Q -functionor an approximation thereof (Abdulhai, Pringle, and Karak-oulas 2003; Wiering 2000). In their simplest form, such RLapproaches consider that each intersection is controlled bya single agent that ignores the existence of other agentsin neighboring intersections. More sophisticated approachesconsider the existence of multiple agents and leverage thenetwork structure to address the interaction between thedifferent agents (Prabuchandran, Hemanth, and Bhatnagar2014; Liu, Liu, and Chen 2017). With the advent of deep learning, several of the approaches above have been ex-tended to accommodate deep neural networks as the under-lying representation of the problem. For example, Gendersand Razavi (Genders and Razavi 2016) propose a DQN con-trol agent that combines a deep convolutional networks with Q -learning. The work controls the trafﬁc lights at a single in-tersection, and essentially extends previous work to accom-modate the deep learning model.In terms of experimental methodology, there are severalissues that make the comparison of different approacheschallenging. First, different works adopt different evalua-tion metrics and baseline policies. In one work, the per-formance metrics used are the average travel time per carand the average wait time per car; the proposed approachis compared against a ﬁxed timed plan (Thorpe 1997). Ina different work, the metrics adopted are the average wait-ing time and number of refused cars (a saturation conditionof the used simulator), and the proposed approach is com-pared against both a random policy and a ﬁxed-time policy(Wiering 2000). In yet another work, the metrics used arethe ratio of the average delay, and the proposed approach iscompared against a pre-timed plan (Abdulhai, Pringle, andKarakoulas 2003). The lack of a clear understanding of thedependence of the different metrics on the number of inter-sections is another issue that renders comparisons difﬁcult.Finally, while average metrics of the variables of interest areprovided, most works offer no measures of signiﬁcance re-garding the reported performance.Recently, several works sought to address some of theissues previously discussed (Genders and Razavi 2019).Researchers have put forth several recommendations/goodpractices when exploring the use of RL in the context ofATSC: (i) provide all hyper-parameters and number of trialexperiments; (ii) report aggregated results with deviationmetrics (averages and standard deviations, not maximum re-turns); (iii) implement proper experimental procedures (av-erage together many trials using different random seeds foreach) (Islam et al. 2017). We extend those works by pro-viding both the preliminary steps necessary to simulate, de-velop, train and evaluate RL-based experiments for ATSCon real world scenarios, as well as, insights which can beextracted by our methodology in this domain. Methodology

We propose a four-stage methodology to be used in the de-velopment of RL-based trafﬁc signal controllers. Figure 2 il-lustrates the proposed methodology, comprising four phases:simulation setup, MDP formulation and selection of the RLlearning method, train and evaluation.

Simulation setup

The ﬁrst stage of the proposed methodology is the simu-lation setup phase. RL controllers must be trained by re-sorting to (micro-)simulators that are able to provide a re-alistic response to the agent’s actions during the learningprocess. The main objective of the simulation setup stageis to prepare all the simulations needed to carry out suchtraining. It includes gathering simulation-related data such imulation setup Training EvaluationMDP formulation& RL approach

Figure 2: Diagram illustrating the proposed methodology, composed of four stages. Solid arrows denote the main developmentﬂow, whereas dashed arrows denote the iterative process of model tuning.as the topological data of the roads’ network and vehiclesdemand/routes data, as well as setting up the trafﬁc simula-tor.In any case, for the purpose of our methodology, it is im-portant that during this stage two key components are con-ﬁgured and deﬁned: (i) the topology of the roads network;and (ii) the trafﬁc demands and routes.

Roads network topology.

Networks can be either syn-thetic or extracted from real world locations. Availableopen source services, such as O

PEN S TREET M AP (Open-StreetMap contributors 2017), allow segments of cities’ dis-tricts to be exported and, during the simulation setup step,such information can be prepared and fed to the simulator,thus opening up the possibility of simulating a rich set ofnetworks relevant to real-world trafﬁc signal control.In our own implementation, we use geospatial data fromO PEN S TREE M APS to build the conﬁguration ﬁles to be usedby the simulator, in our case, the SUMO micro-simulator(Krajzewicz et al. 2012). The following steps are required:(i) extract the region of interest from O

PEN S TREET M AP andopen the resulting ﬁle with the JOSM editor, an extensi-ble editor for O PEN S TREET M AP ﬁles, in order to ﬁne-tunethe network; (ii) convert the (edited) O PEN S TREET M AP ﬁleinto the SUMO network format using the netconvert tool; and (iii) open the resulting SUMO network ﬁle withthe netedit tool, a graphical network editor for SUMO,in order to ensure that all intersections are properly setup,namely check whether all trafﬁc phases and links (connec-tions between lanes) are correct. Trafﬁc demands and routes.

Trafﬁc demands and routescan be either synthetic (Wei et al. 2018) or derived fromreal-world data using origin-destination matrices (Aslani,Mesgari, and Wiering 2017) or induction loops counts (Ro-drigues and Azevedo 2019). Regarding synthetic demands,simple (constant demands) to complex (variable demands)scenarios can be created by specifying the probabilites ofvehicles’ insertion through time. With respect to the genera-tion of a synthetic set of routes, the duarouter tool canbe used. Afterwards, a probability can be assigned to eachunique route by weighting it accordingly to a pre-determinedcriteria, such as, invertionally proportional to the number ofturns contained in the respective route. https://josm.openstreetmap.de/ https://sumo.dlr.de/docs/netconvert.html https://sumo.dlr.de/docs/netedit.html https://sumo.dlr.de/docs/duarouter.html MDP formulation and RL approach

The second stage in our methodology is the description ofthe trafﬁc control problem as a Markov decision problem(or a multiagent version thereof). As previously discussed,an MDP comprises 5 elements: the set of states , the set of actions , the transition probabilities , the reward , and the dis-count factor .Most works which apply RL to the ATSC domain do notspecify the transition probabilities since they are not strictlyrequired – additionally explicitly modeling the trafﬁc dy-namics as a result of changes in trafﬁc light control is un-feasible in most cases. The remaining four components muststill be speciﬁed: the state space (i.e., the information uponwhich the agent will base its decisions), the action space(i.e., how the agent is able to inﬂuence the environmentthrough the choice of its actions) and the reward function(which implicitly encodes the goal of the agent). The liter-ature is very diverse with respect to the adopted problemformulation; it is important to stress that the performance ofthe resulting controller will depend critically on the choicesmade at this stage.Alongside the MDP formulation, it is also necessary toselect the desired RL method to be used as a learning com-ponent for the trafﬁc signal controller, a choice that is notindependent of MDP formulation. The literature is also verydiverse with respect to the algorithm choice, even thoughthe majority of the works focus their attention on the studyof value-based methods (i.e., methods that seek to esti-mate/approximate the Q -function or a surrogate thereof).The work of El-Tantawy et al. (El-Tantawy, Abdulhai, andAbdelgawad 2014) provides a comparison between somecommon state-space representations, reward functions andaction space deﬁnitions, using a real-world network topol-ogy. Wei et al. (Wei et al. 2019) provide a comprehensivelist of commonly used MDP formulations in the context ofATSC, as well as RL methods used in the context of trafﬁcsignal control. Training

The training procedure of RL-based trafﬁc signal controllersshould follow the same guidelines used in the ﬁeld of RL.For example, a proper balance between exploration and ex-ploitation should be ensured, making sure that the agent isable to experience a wide range of different situations. Inadherence to the good practices previously discussed, it isimportant to run multiple instances of the training process,using different seeds, in order to correctly assess the learningbility of the proposed RL method.In the context of ATSC, particular attention must be paidto ensure that the simulations are properly running.

Grid-locks should be avoided or properly processed, for exampleby restarting the simulation, adjusting the vehicles’ arrivals,or teleporting vehicles. Finally, it is important to monitor performance metricssuch as losses, rewards, the number of vehicles in the sim-ulation and the vehicles’ velocity throughout the training inorder to gain a better insight into the learning process.The outcome of the training process consists of a set ofpolicies (each one resulting from a different training run),that need to be properly evaluated. The next and ﬁnal stepof the proposed methodology addresses how this can be ac-complished.

Evaluation

In the context of trafﬁc light control, several performancemetrics have been proposed: travel time, waiting time, num-ber of stops, queue length, throughput, as well as gas emis-sions and fuel consumption. From all these metrics, the min-imization of the travel time is usually the main goal in thedevelopment of ATSCs, therefore, it is arguably the mostimportant metric to report. Since algorithms are usually un-able to directly optimize the travel time, it is useful to reportadditional metrics that are more closely related with the for-mulated agent’s objective (the reward function, in the caseof RL agents), such as the queue length or waiting time.It is important to run some baseline algorithms, such asthe Webster or Max-Pressure methods, under the same sce-nario. These runs are of extreme importance since they allowto compare the performance of the RL controller(s) againstwell-established, commonly used trafﬁc engineering meth-ods.

Performance estimation.

In order to adequately comparethe different approaches, it is important to assess the perfor-mance of the alternative proposals with a set of accordinglyseeded simulations in order to rule out any inﬂuence of thesimulation seeds in the results. For the baseline algorithms,this can be achieved by simply running multiple evaluationsimulations. With respect to the RL agents, each of the poli-cies that resulted from the training stage should be evaluatedwith a set of evaluation rollouts and, if posteriorly needed,the results aggregated per policy.

Performance analysis & comparison.

The simplest andmost straightforward way to present and compare the perfor-mances of the different methods is through point estimation,for example by reporting the mean and standard deviationof the travel times observed for a set of evaluation rollouts.While this is commonly used in the ATSC domain, some-times it exists a big overlap in the reported performancemetrics between different methods, thus, it is important tounderstand whether the observed differences are statisticallysigniﬁcant or not. A gridlock occurs when a queue from one bottleneck createsa new bottleneck somewhere else, and so on in a vicious cycle thatcompletely stalls the vehicles’ circulation (Daganzo 2007).

Figure 3: SUMO screenshot of the considered intersections,near Marquˆes de Pombal square, in Lisbon. Scenario 1 com-prises only the left-most intersection whereas scenario 2 iscomposed of all three pictured intersections. Phase 1 allowsthe movement of vehicles in the vertical direction whereasphase 2 allows the movement in the horizontal direction.A better insight into the performance of the methods canbe achieved by plotting the distributions of the travel timemeans for each of the methods, however, in order to drawconclusions from the observed performances it is proposedthe use of statistical hypothesis testing (Ross 2004). Specif-ically, we are interested in trying to understand whether twoor more population means (the performance metrics of thedifferent methods) are equal. It is highly likely that the pre-viously hypothesis is rejected, therefore, post hoc compar-isons need to be performed in order to understand betweenwhich mean pairs exists a statistically signiﬁcant difference.In order to perform such evaluation, it is proposed the use ofthe (one-way) ANOVA test, followed by the Tukey HonestlySigniﬁcant Difference (HSD) test for post hoc comparisons.Unfortunately, the previous tests make some assumptions re-lated with the data distributions that not always hold. There-fore, it is worth checking whether the assumptions are met,and if not, to switch the aforementioned tests with their re-spective non-parametric versions.As a complementary analysis, it might be interesting toplot histograms for the performance metrics as most of thetimes, mean values may not be well representative of theunderlying distributions.

Experiments

We now illustrate the application of the previously describedmethodology by developing a DQN-based ATSC for tworeal-world scenarios in Lisbon, Portugal.

Pre-processing

For the purposes of this work, two real-world scenarios inthe Lisbon metropolitan area are considered: scenario 1 con-sists of a single intersection, whereas the second scenarioconsists of a set of three consecutive intersections. The twoscenarios were extracted from O

PEN S TREET M AP by fol-lowing the previously described steps. The JOSM editor wasused to further crop, rotate and resize the area of interest.The resulting SUMO environment is shown in Figure 3. Theconsidered demands are time-constant and proportional tohe number of lanes. The routes are weighted using the pre-viously described synthetic procedure. MDP formulation and RL approach

We adopt a simple approach where an RL agent controlsa single intersection, ignoring the existence of other agentsin neighboring intersections (Abdulhai, Pringle, and Karak-oulas 2003). In other words, each individual agent is mod-eled using an MDP that considers only the trafﬁc informa-tion in the intersection controlled by that agent.In each intersection, the signal cycle length is ﬁxed to60 seconds, and the yellow time to 6 seconds. At the begin-ning of each cycle, the controller is able to pick the signalplan to be executed throughout the next cycle, from a setof predeﬁned signal plans. In our case, all intersections arecomposed of two phases. More precisely, the action space consists of a discrete set of 7 signal plans {

0: (30%, 70%), 1:(37%, 63%), 2: (43%, 57%), 3: (50%, 50%), 4: (57%, 43%),5: (63%, 37%), 6: (70%, 30%) } , where the ﬁrst and secondelements of each tuple correspond, respectively, to the phasesplit of phase 1 and phase 2. With this action space deﬁni-tion, adjacent trafﬁc controllers can be easily synchronizedand a minimum green time is guaranteed for all phases, eas-ily ensuring that all safety standards are met.The state s , at cycle c is represented by the tuple ( w , w ) ,where w p is the cumulative number of vehicles waiting, ornavigating at low speeds, in phase p , and can be computedaccording to: w p = 1 K K − (cid:88) k =0 (cid:88) l ∈ L p (cid:88) v ∈ V kl stopped ( v, k ) , (5)where K is the cycle length in seconds (ﬁxed as K = 60 ), L p is the set of all inbound lanes to phase p , and V kl is theset of vehicles on l lane at time k . A vehicle is said to bestopped when it has a very low speed: stopped ( v, k ) = (cid:26) speed ( v, k ) < threshold otherwise (6)where the used threshold corresponds to 10% of the maxi-mum velocity and all vehicles’ speeds are normalized as tobelong to the interval [0 , .Finally, we use an action-independent reward function de-ﬁned, for a state s = ( w , w ) in intersection i and cycle c with phases P , as: r = − (cid:88) p ∈ P w p (7)The reward in Eq. (7) consists of the (negative) sum ofthe total amount of seconds the vehicles have been stoppedduring the cycle. We use γ = 0 . . Training

We ran 30 independent training runs. Figures 4 and 5 dis-play, respectively, the mean actions selected during trainingand the observed instantaneous rewards. As it can be seen,

Cycle A c t i o n Figure 4: (Scenario 1) Mean actions.

Cycle R e w a r d MeanSmoothingStd

Figure 5: (Scenario 1) Instantaneous rewards.the agent’s actions converge towards lower-indexed signalplans, which can be justiﬁed by the intersection’s struc-ture (Fig. 3): the horizontal direction serves more vehicles,therefore the agent converges towards lower-indexed actionswhich allocate a longer period to this phase. As the actionsconverge it is noticeable an increase in the observed instan-taneous rewards.The outcome of the training stage consists of 30 policiesthat need now to be properly assessed.

Evaluation

Finally, the performance of the resulting agents is evaluatedusing a set of performance metrics: travel time, waiting timeand vehicles’ speed. Tables 1 and 2 display the evaluationmetrics for different trafﬁc signal controllers. The actuatedcontroller dynamically extends the current phase, up to amaximum value, if a continuous stream of incoming vehi-cles is detected.With respect to scenario 1 (Tab. 1), the results show thatthe RL controller is able to achieve the highest averagespeed, outperforming all the other controllers. With respectto the travel time metric, the RL agent is able to outperformthe Webster method and equal the performance of both thebest static controller and the actuated method. However, theagent is unable to outperform the Max-Pressure controller,but it is important to notice that this method exhibits a higherdegree of ﬂexibility in comparison to the RL agent since it ´ scycle length is dynamic.Regarding the second scenario (Tab. 2), the RL agents areagain able to achieve the highest average speed. Regard-ing the travel time, it can be seen that the RL agents are ethod Speed Waiting time Travel time Static (6.7, 3.5) (8.1, 9.9) (25.0, 12.5)Webster (6.6, 3.4) (8.2, 9.7) (25.4, 12.4)Max-pressure (6.5, 2.8) (5.7, 6.5) (23.4, 9.0)

Actuated (6.7, 3.4) (7.8, 10.2) (24.9, 12.8)RL controller (6.8, 3.5) (8.0, 10.1) (25.0, 12.6)

Table 1: (Scenario 1) Evaluation metrics aggregated per ve-hicle’s trip (averaged over 30 simulations). The ﬁrst tupleposition encodes the mean value; the second tuple positionencodes the standard deviation.

Method Speed Waiting time Travel time

Webster (6.3, 2.7) (12.0, 12.0) (38.6, 14.7)

Max-pressure (5.8, 2.2) (9.0, 7.0) (42.3, 18.6)Actuated (5.6, 2.5) (11.6, 10.5) (45.5, 22.0)RL controller (6.3, 2.8) (12.2, 10.9) (39.1, 15.5)

Table 2: (Scenario 2) Evaluation metrics aggregated per ve-hicle’s trip (averaged over 30 simulations). The ﬁrst tupleposition encodes the mean value; the second tuple positionencodes the standard deviation.able to outperform the actuated method as well as the Max-Pressure controller, despite the fact that these methods areable to achieve a signiﬁcantly lower waiting time. This hap-pens due to miscoordination between the adjacent intersec-tions for both the actuated and Max-Pressure methods (dueto their acyclic behaviour).Fig. 6 displays the distribution of the travel time meansfor scenario 1. As it can be seen, there is a signiﬁcant overlapin performance between the different methods. Furthermore,despite the fact that the ANOVA test yields a p-value of ≈ (meaning that, indeed, the mean performance of all meth-ods is not the same), the Tukey HSD pairwise tests show nosigniﬁcant difference in mean performance between somemethods. Namely, between the actuated, static and RL con-trollers, the conﬁdence interval on the means’ difference ei-ther includes zero, or its bounds are close to it.Finally, Fig. 7 displays the distribution of the vehicles’speed per trip. Interesting to notice the multimodal shape ofthe underlying distributions, as well as the slight differencesbetween the different methods. Conclusions

In this paper, we proposed a methodology for the devel-opment of RL-based adaptive trafﬁc signal controllers. Theproposed methodology comprises 4 steps, all of which nec-essary to develop, deploy and evaluate an ATSC — simula-tion setup, problem formulation, training and evaluation. Weillustrated the proposed methodology by developing a deepRL-based ATSC that achieves performance on par with es-tablished methods from the transportation engineering ﬁeld.Despite the fact that the ﬂexibility of the agent is con-strained in order to met safety standards, the presented re-sults glimpse at the potential of RL-based controllers to con-tribute to improve trafﬁc congestion and highlight the needfor coordination between adjacent intersections. Finally, we

Travel time (s) D e n s i t y StaticWebsterMax-pressureActuatedRL

Figure 6: (Scenario 1) Kernel density estimation of the traveltime means, computed using 30 samples for each of themethods.

Average speed (m/s) D e n s i t y StaticWebsterMax-PressureActuatedRL

Figure 7: (Scenario 1) Kernel density estimation of the ve-hicles’ speeds, computed using 30 samples for each of themethods.note that the advantages of RL-methods are more apparent inscenarios comprising bigger trafﬁc networks and more vari-able trafﬁc patterns, something that could be considered infuture work.

Acknowledgments

This work was partially supported by national fundsthrough the Portuguese Fundac¸ ˜ao para a Ciˆencia e aTecnologia under project UIDB/50021/2020 (INESC-IDmulti-annual funding) and the project ILU, with referenceDSAIPA/DS/0111/2018. This research was also partiallysupported by TAILOR, a project funded by EU Horizon2020 research and innovation programme under GA No952215.

References

Abdulhai, B.; Pringle, R.; and Karakoulas, G. 2003. Rein-forcement learning for true adaptive trafﬁc signal control.

Journal of Transportation Engineering

Transportation Research Part C

85: 732 – 752.Ault, J.; Hanna, J. P.; and Sharon, G. 2020. Learning an In-terpretable Trafﬁc Signal Control Policy. In

Proceedings ofhe 19th International Conference on Autonomous Agentsand MultiAgent Systems , AAMAS ’20, 88–96. Richland,SC: International Foundation for Autonomous Agents andMultiagent Systems. ISBN 9781450375184.Becker, U.; Becker, T.; and Gerlach, J. 2016. The true costsof automobility: External costs of cars overview on existingestimates in EU-27. Technical report, Institute of TransportPlanning and Road Trafﬁc, Technische Universit¨at Dresden.Daganzo, C. 2007. Urban gridlock: Macroscopic modelingand mitigation approaches.

Transportation Research Part B

IEEE Ve-hicular Technology Magazine

Journal of In-telligent Transportation Systems

CoRR abs/1909.00395.Genders, W.; and Razavi, S. N. 2016. Using a Deep Rein-forcement Learning Agent for Trafﬁc Signal Control.

CoRR abs/1611.01142. URL http://arxiv.org/abs/1611.01142.Harford, J. 2006. Congestion, pollution, and beneﬁt-to-costratios of US public transit systems.

Transportation ResearchPart D

World Leisure Journal

CoRR abs/1708.04133.Krajzewicz, D.; Erdmann, J.; Behrisch, M.; and Bieker, L.2012. Recent development and applications of SUMO -Simulation of Urban MObility.

International Journal on Ad-vances in Systems and Measurements

Proc.IEEE 20th Int. Conf. Intelligent Transportation Systems .Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013.Playing Atari with Deep Reinforcement Learning.

CoRR

Proc. 17th Int. Conf. Intelligent Transportation Systems ,2529–2534. Puterman, M. 2005.

Markov Decision Processes: DiscreteStochastic Dynamic Programming . John Wiley & Sons.Rodrigues, F.; and Azevedo, C. L. 2019. Towards Ro-bust Deep Reinforcement Learning for Trafﬁc Signal Con-trol: Demand Surges, Incidents and Sensor Failures.

ArXiv abs/1904.08353. ArXiv: 1904.08353.Ross, S. 2004.

Introduction to Probability and Statisticsfor Engineers and Scientists . Introduction to Probabilityand Statistics for Engineers and Scientists. Elsevier Science.ISBN 9780080470313.Schrank, D.; Eisele, B.; and Lomax, T. 2019. 2019 UrbanMobility Report. Technical report, Texas A&M Transporta-tion Institute.Sutton, R. S.; and Barto, A. G. 1998.

Introduction to Rein-forcement Learning . MIT Press.Thorpe, T. L. 1997. Vehicle trafﬁc light control usingSARSA. Technical report, Department of Computer Sci-ence, Colorado State University.Varaiya, P. 2013. The Max-Pressure Controller for ArbitraryNetworks of Signalized Intersections.

Advances in DynamicNetwork Modeling in Complex Transportation Systems

Transportation Research Part C

99: 144–163.Watkins, C. 1989.

Learning From Delayed Rewards . Ph.D.thesis, King’s College.Webster, F. 1958.

Trafﬁc Signal Settings . Road researchtechnical paper. H.M. Stationery Ofﬁce. URL https://books.google.pt/books?id=c9QOQ4jXK5cC.Wei, H.; Yao, H.; Zheng, G.; and Li, Z. 2018. IntelliLight: Areinforcement learning approach for intelligent trafﬁc lightcontrol. Proceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,2496–2505. Association for Computing Machinery. ISBN9781450355520. doi:10.1145/3219819.3220096.Wei, H.; Zheng, G.; Gayah, V.; and Li, Z. 2019. A survey ontrafﬁc signal control methods.

CoRR abs/1904.08117.Whiteson, S.; Tanner, B.; Taylor, M. E.; and Stone, P. 2011.Protecting against evaluation overﬁtting in empirical rein-forcement learning. In

Proc. 2011 IEEE Symposium onAdaptive Dynamic Programming and Reinforcement Learn-ing , 120–127.Wiering, M. 2000. Multi-agent reinforcement leraning fortrafﬁc light control. In

Proc. 7th Int. Conf. Machine Learn-ing , 1151–1158.Zheng, G.; Zang, X.; Xu, N.; Wei, H.; Yu, Z.; Gayah, V. V.;Xu, K.; and Li, Z. 2019. Diagnosing Reinforcement Learn-ing for Trafﬁc Signal Control.