A New Approach for Tactical Decision Making in Lane Changing: Sample Efficient Deep Q Learning with a Safety Feedback Reward
cc (cid:13) a r X i v : . [ c s . A I] S e p New Approach for Tactical Decision Making in Lane Changing:Sample Efficient Deep Q Learning with a Safety Feedback Reward
Ugur Yavas, Tufan Kumbasar and Nazm Kemal Ure
Abstract — Automated lane change is one of the most chal-lenging task to be solved of highly automated vehicles due toits safety-critical, uncertain and multi-agent nature. This paperpresents the novel deployment of the state of art Q learningmethod, namely Rainbow DQN, that uses a new safety drivenrewarding scheme to tackle the issues in an dynamic and uncer-tain simulation environment. We present various comparativeresults to show that our novel approach of having rewardfeedback from the safety layer dramatically increases boththe agent’s performance and sample efficiency. Furthermore,through the novel deployment of Rainbow DQN, it is shownthat more intuition about the agent’s actions is extracted byexamining the distributions of generated Q values of the agents.The proposed algorithm shows superior performance to thebaseline algorithm in the challenging scenarios with only 200000training steps (i.e. equivalent to 55 hours driving).
I. INTRODUCTIONThere has been a growing interest in self-driving carsby the industry since Darpa Urban Challenge [1]. Despitethe great achievements in this competition, the deploymentof self-driving cars into production is a quite complicatedproblem due to reasons such as long tail of edge cases,safety verification and the need of intelligent algorithms thatare capable of negotiating with human drivers. There arealready level-2 capable cars in production that autonomouslycontrol the vehicle at both the longitudinal and lateral levels.However, there is still a need for advancements to level-2 systems, namely the inclusion of automated lane changefunctionality which is crucial as it covers most of the aspectsof highway driving. Thus, we believe that making tacticaldecisions to change lanes requires intelligence in the contextof understanding the behavior of other traffic participantsand strict safety monitoring considering the fact that a largeamount of accidents happened during this maneuver [2].
A. Related Work
The automated lane change problem has been widelyhandled and various approaches such as rule-based [3],data-driven supervised learning [4], utility-based [5] andreinforcement learning-based [6], to solve the automated lanechange problem. However, excluding reinforcement learning,
Ugur Yavas is with Eatron Technologies, Istanbul, Turkey [email protected]
Tufan Kumbasar is with Control and Automation Engi-neering Department, Istanbul Technical University, Turkey [email protected]
Nazm Kemal Ure is with Artificial Intelligence and Data ScienceResearch Center and Department of Aeronautical Engineering, IstanbulTechnical University, Turkey [email protected]
This work was supported by the Research Fund of the Scientific andTechnological Research Council of Turkey under Project 118E807. the main drawback of these approaches is the fact that there isno involvement in learning. Thus, these approaches are proneto errors due to noise and uncertainty when the environmentchanges slightly from the intended design. On the other hand,data-driven algorithms have problems when facing casesoutside of their training distribution. Recently, applicationsof Deep Reinforcement Learning (DRL) to the lane changeproblem have been investigated by using Q-masking tointegrate high-level knowledge [7], combining with the safetylayer [8], injecting uncertainty [9], introducing spatial in-variance with Convolutional Neural Network (CNNs) [10]and combined planning [11]. DRL based methods have clearadvantages over other methods considering the fact that theycan handle well with uncertainty, measurement noise andlarge input spaces [12].The efficient design and implementation of DRL agentsinvolves many steps which are starting with state-actionrepresentations, balancing multi-objective reward function,tuning the hyper-parameters of the optimization algorithm,deciding the network architecture, generating rich data outof realistic scenarios and finally broad evaluation against aproper baseline methods with different seeds. Consideringthe aforementioned steps, [7] lacks the comparison with afair baseline and uses a very naive simulation environmentwithout challenging scenarios. On the other hand, [8] pro-poses compact state representations that would work in anylane-vehicle number configuration and integrates a safetylayer-based on time to collision evaluations of the leaderand follower vehicles. The defined safety layer can rejectthe actions proposed by Q network if it is evaluated asunsafe. Although compact state representation acceleratestraining (i.e. reduces the amount of computations), it hasbeen underlined in [11] that deciding lane changes by justconsidering adjacent lanes fails to solve the case shownin Fig. 1. Furthermore, it is stated in [10] that designinga DRL agent that is capable of jointly decide longitudinaland lateral actions performs better than the agent with onlymaking lane change decisions. In [11], a realistic simulationenvironment which contains measurement noise, randomizedagent behaviors, was used to train a Monte Carlo Tree Search(MCTS) based agent without the consideration of safety.
B. Contribution
This paper proposes a method to train a sample efficientRainbow DQN [19] agent that not only makes tactical deci-sions in the dynamic, uncertain and noisy highway scenariosbut also considers safety constraints. The highlights of ourcontributions can be summarized as: ig. 1. Three lane scenario with orange car (1) being the ego-vehicle:Changing one lane to the left does not bring any speed gain since the leadvehicle in the centre lane (2) is slightly slower than vehicle 3. This scenariosis a common pitfall for the rule based and narrow sighted systems. • Implementation of Rainbow DQN to the lane changeproblem, that results with a major performance increaseover double DQN [16]. It creates more intuition aboutagents actions by analyzing Q value distributions (Seesection V). • The novel use of safety layer that provides a rewardfeedback to the agent which dramatically increases bothsample efficiency and final performance as well assimple but yet efficient safety layer implementation thatis aligned with current in vehicle technology such asblind spot warning. Our approach differs from [8] aswe prefer to use a different safety metric and feed therejection information from the safety layer as a negativereward in the learning of the agent.Our results demonstrate that the design and deploymentof the Rainbow DQN agent with safety feedback performssignificantly better than both rule based and double DQNagents in complex scenarios with different seeds(a,b,c) in-volving 20 surrounding vehicles having uncertain behaviorand reaches to this superior performance only after 200000training steps.The paper is organized as follows: Section II gives thedetails of how DRL methods have been applied to the auto-mated lane change problem. Section III provides the detailsof the simulation environment and scenario configuration.Section IV shares the results of training and evaluationruns, section V discusses the results and section VI derivesconclusion and proposes future research directions.II. PROBLEM STATEMENTAutomated lane change can be formulated as a DRLproblem with continuous input state and discrete actions.Actions of the agent need to be evaluated by the safetylayer then passed to the low-level controllers that eventuallydetermines the desired steering angle and acceleration. Theego-vehicle is assumed to have an accurate perception systemthat could give the relative velocity and position of the otheragents. In order to make the perception assumption morerealistic, we consider cases with compact state representationand measurement noise. Ideally, perception systems sufferfrom occlusions, but this was not considered in this study.We also avoid to add longitudinal control actions to the DRLagent since a realistic Adaptive Cruise Control (ACC) systemhas a lot more complicated design than Intelligent DriverModel (IDM) [13] and giving DRL agent an extra degree offreedom in the action space would require additional safetyverification which is again outside the scope of this work.
A. Reinforcement Learning
Reinforcement learning is a machine learning paradigmthat relies on self-learning agents driven by a reward functionwhich is calculated through interactions with the environ-ment. In every time step, feedback from the environment isreceived as S t and an action A t is being selected by theagent and another feedback is received as S t +1 with thereward R t +1 and future discount γ t +1 . The aforementionedunits form a tuple (cid:104) S, A, T, R, γ (cid:105) that is being used tomodel the Markov Decision Process (MDP) [14]. In themodel-free reinforcement learning setup, transition function, T ( s, a, s (cid:48) ) = P [ S t +1 = s (cid:48) | S t = s, A t = a ] , is notknown and agent tries to find best action set (policy: π )that maximizes reward without knowing the dynamics of theenvironment. The problem formulation for the finite horizon H is described as follows: π ∗ = arg max π E (cid:34) H (cid:88) t =0 R t ( S t , A t , S t +1 ) | π (cid:35) (1) B. Rainbow DQN
Q-learning is a value-based technique to solve the problemin (1) by recursively estimating the optimal action-valuefunction Q ∗ ( s, a ) [15]. By calculating the Q value of eachpossible state, the action pair and using the Bellman equation[23], the optimal policy can be attained with a greedy policyof choosing actions with maximum Q values. Q ∗ ( s, a ) = E (cid:104) r + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a (cid:105) (2)However, conventional Q-learning algorithm cannot han-dle environments having large, continuous state-action pairs.Starting with DQN algorithm [15] that approximates Qfunction with neural networks and uses large experiencereplay buffers to break correlations in the training data, DRLagents reach outstanding performance in many different tasks[17], [18].The state of the art algorithm is the Rainbow DQN [19]which has combined the most significant enhancements overthe initial DQN algorithm in terms of training dynamics,sample efficiency and performance. We briefly summarizethe main elements of Rainbow DQN below and encourageinterested readers to check out the original paper. • Double Q learning decouples value estimation and ac-tion selection between target. • Prioritized experience replay samples more frequentlythe experience that has bigger loss to speed up thetraining. • Duelling network has the neural network architecture ofshared encoder, followed by separate fully connectedlayers to predict advantage and value of the statesseparately. • Multi-step learning unrolls the equation 2 N-step furtherto make Q ( s, a ) values converge faster. • Noisy network adds a noise parameter with normaldistribution to the each weight in the fully connectedlayer which are updated via back-propagation. Thiseads better exploration strategy than standard (cid:15) -greedymethod. • The Q ( s, a ) values are predicted as distributions byminimizing the Kullback-Leibner loss. Distributional Qfunction gives more insight while evaluating a particularstate as also shown in figures 5, 6. C. State/Action Representations
In this paper, the agent has three available discrete actionsthat are keeping the current lane, changing lanes to the leftand right which are generated. Regardless of the selectedaction, IDM handles the longitudinal control and determinesfollowing distance and speed. We have used the ego-centeredrelative state representation which includes the positions andvelocities of the other vehicles. However, we have alsoanalyzed the influence of more compact representation assuggested in [8] by only providing the information of leadand following vehicles in each lane.
TABLE IE GO - CENTRING , NORMALIZED C ARTESIAN STATE REPRESENTATION s , Normalized ego vehicle speed v ego /v dego s , Normalized ego vehicle lateral position y ego /y max s i +1 , Normalized relative position of vehicle i , ∆ s i / ∆ s max s i +2 , Normalized relative velocity of vehicle i , ∆ v i /v max s i +3 , Normalized relative position of vehicle i , ∆ y i /y max D. Reward Function with Safety Feedback
In literature, simple reward functions are defined and used,such as (1) punishing lightly each lane change to limit thenumber of attempts (2) punishing heavily the accidents andrewarding agent proportional to the target speed. We arguethat such a simple reward scheme with the length of theshort episode of 1000m may not reflect what the agent hasactually learned. In this context, we propose the followingnovel rewarding scheme combined with game-like episodedefinition: r ( s, a, s (cid:48) ) = speed incentive: ( v current − v initial ) /v d lane change penalty: − if collision then: − terminal ) if v current = v d then: + 100( terminal ) if action is unsafe then: − Combining the above reward scheme with longer episodelength (5km in our case) would intuitively evaluate thetraining and evaluation performance of the agent. Insteadof using infinite episodes without termination, we considerepisodes that are well aligned with the actual use case ofautomated lane change functionality. A new episode beginswhenever the speed of the vehicle gets lower than thedesired speed, and thus the agent is expected to make tacticaldecisions in order to reach the desired set-point (i.e. targetspeed) once again. During an episode, the intermediate speedof the ego vehicle does not matter if it is settled at a slowerspeed than the desired ego vehicle speed. As a second improvement to the rewarding scheme, wepropose a novel reward feedback from the safety layer. Inthe classical safety approach as in [8], decisions of DRLagents are rejected by a safety layer, and the next actionproposed by the agent is evaluated by safety again until anacceptable action is obtained. We enhance this approach ina way to make the safety layer interact with the DRL agentover the reward function. Thus, every time the DRL agentviolates safety, it gets -1 reward and the corresponding actionis overwritten by the safety layer and the agent receivesthe information regarding the next state. Proposed safetylayer is simple as rejecting the actions that would resultin clear accidents such as trying to change lanes while theadjacent lanes are occupied. This technique avoids frequentterminal states by accidents especially during the early stagesof training and significantly increases the training speed andthe agent’s performance.
E. Network Architecture and Training Parameters
In this paper, we propose two architectures based onDouble DQN and Rainbow DQN and compare them with arule-based agent driven by the Minimizing Overall BrakingInduced by Lane Changes (MOBIL) [20] algorithm. Bothproposed algorithms have networks with CNN layers in thestandard implementation to process images from the gameenvironment. Although we are working with continuousmeasurements not pixels, in order to get a significant per-formance boost, we have used the CNN layers as proposedin [10]. Following the CNN layer, considering the findingsfrom [21], large fully connected layers with 256 neurons arebeing used to prevent over-fitting in the training phase of thenetworks.
TABLE IIR
AINBOW H YPER - PARAMETERS priority replay beta: . beta schedule steps: N-step prediction: replay size: target network sync freq.: learning rate: . discount factor, γ : . batch size: III. SIMULATION ENVIRONMENTIn this paper, we use the simulator which is the enhancedversion of the one presented in our previous work [9]. Allvehicles except the ego-vehicle are driven by a combinationof IDM and MOBIL algorithm. The vehicle motions aredefined with the kinematic bicycle model. There are low levellongitudinal and lateral controllers that calculate the requiredacceleration and steering angle of the vehicles. There are fourmajor improvements over the simulator given in [9]: • Always block the lane of ego-vehicle with slower vehi-cle • During the episodes if every lane is locked by slowvehicles, randomly speed up the slower vehicles
Randomly select driver profiles from a uniform distri-bution according to driver table III • Inject realistic position and velocity measurement noise[22] to the ego-vehicle states.IDM [13] is the standard car-following model that cal-culates the required acceleration response to reach desiredvelocity set-point or following distance when there is a leadcar . The dynamics of IDM are as follows: dvdt = a = a max (cid:32) − (cid:18) vv d (cid:19) δ − (cid:18) d (cid:63) ( v, ∆ v ) d (cid:19) (cid:33) (3) d (cid:63) ( v, ∆ v ) = d + vT set + v ∆ v √ ba max (4)Parameters of IDM to simulate different driver behavior areshown in Table III [11].MOBIL algorithm is being used to decide when to changelanes in the simulator. It makes a decision based on relativeacceleration calculations regarding the following and leadvehicle in the current lane and the two adjacent lanes. Inthis context, with respect to the neighbouring vehicles, thefollowing first safety criteria is calculated: ˜ a n > b safe (5)Here, ˜ a n refers to the new acceleration of the followerafter making a lane change and b safe is the maximumsafe deceleration. Safety criteria of the MOBIL guaranteesaccident free lane change decisions under the assumptionsthat other drivers react reasonably and there is no noise in theenvironment. If safety criteria is fulfilled, incentive criteriais calculated as following: ˜ a e − a e + p (˜ a n − a n ) + q (˜ a o − a o ) > a th (6)where ˜ a e , ˜ a n and ˜ a o are the new accelerations, calculatedby the IDM, for the lane changing, new follower and oldfollower vehicles, respectively. a e , a n and a o refer to thecurrent accelerations for the same vehicles. p and q are thepoliteness factor for the side and rear vehicles. a th is thelane change decision threshold. The parameters of MOBILalgorithm, that models different driver behaviors, are shownin Table III [11].MOBIL algorithm relies on a single threshold a th to makea decision if the decision is passed by the safety criteria. Thisis a main weakness of the algorithm since it is difficult tofind an ideal threshold that may handle many different trafficsituations and be robust to the measurement noise. A. Highway Simulation Details
Simulation environment randomly generates scenarios outof initial conditions that are defined in Table IV.
B. Performance/Safety Indicators
During the training and evaluation experiments, we mon-itor the number of accidents of the agents, average rewardsof last the 100 episodes, number of lane changes, number
TABLE IIIIDM
AND
MOBIL
MODEL PARAMETERS FOR DIFFERENT DRIVERS .Normal Timid AggressiveDesired speed (m/s) v set . . . Desired time gap (s) T set . . . Minimum gap distance (m) d . . . Maximal acceleration (m/s ) a m ax . . . Desired deceleration (m/s ) b . . . Politeness factor p .
05 0 . . Changing threshold (m/s ) a th . . . Safe braking (m/s ) b safe . . . TABLE IVH
IGHWAY S IMULATION P ARAMETERS
Number of lanes, n − Number of vehicles, m − Maximum initial vehicle spread , d long m Minimum inter-vehicle distance, d (cid:52) m Rear vehicles initial speed range, [ v rearmin , v rearmax ] [15 , m/s Front vehicles initial speed range, [ v frontmin , v frontmax ] [10 , m/s Initial speed range for ego vehicle, [ v egomin , v egomax ] [10 , m/s Desired speed range for other vehicles, [ v dmin , v dmax ] [18 , m/s Desired speed for ego vehicle, v dego m/s Episode length, d max m of safety violations if the safety layer is integrated, ratio ofsuccessfully reaching terminal state which is in our case notthe final destination but desired ego-velocity. Moreover, inorder to monitor sample efficiency, we calculate the settlingstep referring to how many steps would take to reach %95of the settled reward. IV. RESULTSWe have created two different benchmark scenarios toevaluate the influence of safety feedback, using RainbowDQN, and compact state representation. Initial configurationis quite similar to our previous work [9]: Ego-vehicle issurrounded by the 8 vehicle that shares normal driving be-havior. We have trained Rainbow DQN, Rainbow DQN withsafety and Double DQN agents over 1m training steps withthree different seeds. In the second configuration, we tookinspiration from [11] and increased the surrounding vehiclenumber to 20, uniformly sample different driver behaviorsfrom III and inject Gaussian measurement noise to theposition and velocity of the vehicles. For this configuration,we trained three agents: • Rainbow: Rainbow DQN without safety, with standardstate representation (the information of all vehicles isprovided) • Rainbow-blindpsot: Rainbow DQN with standard rep-resentation including blind-spot sensor and safety feed-back • Rainbow-blindspot-comp: Rainbow DQN with compactrepresentation (only following and lead vehicles in theeach lane provided) including blind-spot sensor andsafety feedback . Noise Free Dynamic Highway Environment
Figure 2 shows the average reward of last the 100 episodesover the 1M training steps. As it can be clearly seen, rainbowDQN performs significantly better than double DQN in everyaspect. We have also observed this superiority in other seedsand validation runs. Thus, we have not employed the doubleDQN to the more challenging scenarios.
Fig. 2. Training performance of three agents in seed aTABLE VA
VERAGE REWARD OVER
EPISODES
Solved mean reward settlingObservations Eps. Ratio stepRainbow
96% 91 1 M Rainbow-blindspot
99% 93 200 k DoubleDQN
70% 64 1 . M MOBIL timid
83% 78 0
MOBIL aggressive
95% 90 0
B. Noisy, Uncertain Dynamic Highway Environment
Figure 3 shows average reward of last 100 episodesover the 250k training steps. As aligned with the previousfindings, agents with safety feedback catch up performanceof the MOBIL algorithm rapidly and surpass it after 200ktime steps. Table VI shows the performance in the evaluationrun as well as training convergence. Moreover, performanceof the agent with compact state representations is worse thanthe agent that uses states of the each vehicle whereas compactrepresentation converges faster to the settling reward.
TABLE VIA
VERAGE REWARD OVER
EPISODES
Solved mean reward settlingObservations Eps. Ratio stepRainbow
82% 76 1 M Rainbow-blindspot
92% 86 250 k Rainbow-blindspot-comp
87% 82 180 k MOBIL timid
72% 69 0
MOBIL aggressive
81% 77 0
Fig. 3. Training performance of three agents in seed a over 250k steps
V. COMMENTS AND DISCUSSIONSIn this section, firstly, we show different highway sce-narios and try to understand the reasoning of the agentby using value distributions. In figure 4, early moment ofa challenging scenario is shown. The agent(orange car) issurrounded by the many vehicles and has only positive Qvalue expectations in the predicted Q value distributions,since going straight is always the safe state, while changinglanes is expected to cause either an accident or departurefrom the road. Consequently, agent waits until the adjacentleft lane has enough space to overtake lead vehicle-3.
Fig. 4. Probability distributions of Q values during the initial conditions.Agent is aware of going left causes an accident with the high probability.(Selected action: go straight)