Amortized Q-learning with Model-based Action Proposals for Autonomous Driving on Highways
Branka Mirchevska, Maria Hügle, Gabriel Kalweit, Moritz Werling, Joschka Boedecker
AAmortized Q-learning with Model-based Action Proposals forAutonomous Driving on Highways
Branka Mirchevska , , Maria H¨ugle , Gabriel Kalweit , Moritz Werling , Joschka Boedecker , Abstract — Well-established optimization-based methods canguarantee an optimal trajectory for a short optimizationhorizon, typically no longer than a few seconds. As a result,choosing the optimal trajectory for this short horizon maystill result in a sub-optimal long-term solution. At the sametime, the resulting short-term trajectories allow for effective,comfortable and provable safe maneuvers in a dynamic trafficenvironment. In this work, we address the question of how toensure an optimal long-term driving strategy, while keepingthe benefits of classical trajectory planning. We introduce aReinforcement Learning based approach that coupled with atrajectory planner, learns an optimal long-term decision-makingstrategy for driving on highways. By online generating locallyoptimal maneuvers as actions, we balance between the infinitelow-level continuous action space, and the limited flexibilityof a fixed number of predefined standard lane-change actions.We evaluated our method on realistic scenarios in the open-source traffic simulator SUMO and were able to achieve betterperformance than the 4 benchmark approaches we comparedagainst, including a random action selecting agent, greedyagent, high-level, discrete actions agent and an IDM-basedSUMO-controlled agent.
I. INTRODUCTIONAutonomous vehicles have to make safe and reliabledecisions in highly dynamic and complex traffic environ-ments. So far, the approaches proposed for tackling thischallenge can be divided into three broad categories: rule-based systems, optimization-based (optimal control) systemsand data-driven i.e machine learning-based systems. Rule-based systems [15], [24], [2], offer the advantage of greatercontrol over their actions. However, defining a consistent setof rules that works in all possible situations under noisyobservations is a difficult and error-prone procedure. Optimalcontrol-based systems, like [29], [17], [3] and [4], which relyon constrained optimization and Model Predictive Control,encode the vehicle dynamics model directly in the planningmodule and are able to generate feasible and comfortabletrajectories. They also typically offer sound solutions withmathematically backed up safety guarantees. However, due tothe short optimization horizon, these approaches are not ca-pable of making farsighted, globally optimal decisions [16].Alternatively, purely machine learning-based approaches,while offering better generalization to unseen situations andlearning from data, introduce safety concerns and reduce thetransparency of the behavior of the system. Among machinelearning methods, Reinforcement Learning (RL) has become Dept. of Computer Science, University of Freiburg, Germany. { hueglem,kalweitg,jboedeck } @cs.uni-freiburg.de BMW Group, Unterschleissheim, Germany. { Branka.Mirchevska,Moritz.Werling } @bmw.de Cluster of Excellence BrainLinks-BrainTools, Freiburg, Germany.
Fig. 1: RL environment: RL agent (black), surroundingvehicles (white), reachable gaps (black-dashed), unreachablegaps (red-dashed)increasingly popular in autonomous driving applications, dueto the notable success in many application areas [14, 19,20]. However, when applying RL to autonomous driving,an important challenge is to determine the level of controlthe agent should have over the vehicle. Most common arethe discrete actions and continuous actions settings. In thediscrete actions case (e.g. [1, 6, 31] and also [13, 7, 8, 9, 10]),the agent can choose from actions such as keep lane, lane-change to the left or lane-change to the right or fixed acceler-ating/decelerating steps. While the small and fixed action setleads to fast learning progress, the lane-change maneuversare usually with fixed execution duration, resulting in a sub-optimal, unnatural behavior in tight situations. On the otherend of the spectrum, in the continuous actions setting, theagent learns to influence either the low-level steering wheeland gas/brake pedal control signals directly, or some kind ofan abstraction thereof such as curvature or acceleration [27,25, 26]. This endows the agent with maximal freedom ofchoice, but since the set of actions is infinite and the choicesthat the agent needs to make are typically very granular, itwill need large amounts of good action sequences amongthe learning samples, slowing down learning considerably.In addition, the maximization of the action-value function incontinuous off-policy RL methods is problematic. To addressthat, in [30], the maximization step is simplified by a learnedproposal distribution which is used to sample the continuousaction space. The approximate maximum is then taken as themaximizing action from the proposed sample-set.Inspired by this approach, we use model-based actionproposals for balancing out the trade-off between continuousand discrete actions discussed above (see Figure 1 for anoverview). Since the actions that we are proposing, arefinite but described by continuous values, they can be seenas a compromise between the discrete and the continuousrepresentations. This way, we combine the advantages ofboth sides of the spectrum, while ensuring safety. For thispurpose, we introduce the notion of gaps, as model-basedaction proposals for the agent. a r X i v : . [ c s . L G ] D ec gap is a space between two surrounding vehicles on thesame lane. We define the actions as the set of all reachablegaps in the vicinity of the agent. An embedded trajectoryplanner plans locally optimal trajectories which satisfy allphysical and safety constraints, based on which the set ofreachable gaps is created. After the RL agent chooses a gapfrom the reachable set of gaps, the best low-level maneuvertowards that gap is executed. Since, the number of gaps theagent can choose from is finite, but the features describingeach gap are continuous, we allow for a sufficient level ofcontrol over the vehicle, while ensuring safety and keepingthe action space and the learning times manageable.We introduce Amortized Q-learning with Model-basedAction Proposals (AQL-MAP), an
Amortized Q-learning approach that replaces the learned proposal distribution bymodel-based action estimation. The term ”amortized” in thiscase stems from amortizing the cost of maximization overthe whole continuous action space by proposing a finite setof reachable actions to maximize over. Even though theset of proposed actions is finite, they cover all safe andfeasible maneuvers the agent can execute in a given state.We implement the approach in the Semi-MDP Options-framework [21], [18] in order to speed up learning andstabilize the performance. Moreover, since for autonomousdriving in real traffic, on-policy, online learning is out of thequestion, the training data usually needs to be pre-collectedfrom fleets of human-operated vehicles. With that in mind,we train our approach in offline and off-policy fashion, onfixed batch of training data. Our main contributions are thefollowing:1) Introducing gaps as an action abstraction, describedby continuous features which incorporate all possiblemaneuvers that the vehicle can safely perform.2) Coupling RL with an optimization based trajectoryplanner for estimating the current set of accessible gapsand generating feasible and safe trajectories towardsthem.3) Analysis of the achieved performance by evaluatingthe novel agent against 4 benchmark agents in realistichighway scenarios, defined in the Sumo simulationenvironment [11]. The benchmark agents include: ran-dom action selecting agent, greedy agent, IDM-basedSumo-controlled agent [23] and a high-level, discreteaction selecting agent.II. REINFORCEMENT LEARNING BACKGROUNDReinforcement learning is concerned with the problem oflearning from interaction with the environment. Formallywe define an RL problem as a Markov Decision Process,consisting of: set of states S = { s , s , ..., s n } , set ofactions A = { a , a , ..., a m } , transition function T : T ( s, a, s (cid:48) ) = P ( s ( t + 1) = s (cid:48) | s ( t ) = s, a ( t ) = a ) anda reward function r ( s, a ) . In a state s t , at a time-step t ,following a policy π , an RL agent executes an action a t in the environment. At a time-step t + 1 , it then endsup in some state s t +1 and receives a reward r t +1 . Wedefine R ( s t ) = (cid:80) t (cid:48) > = t γ t (cid:48) − t r t (cid:48) as the expected discounted Fig. 2: Diagram of the information flow between Sumo, TPand the RL agentlong-term return, where γ ∈ [0 , is the discount factor.The objective of the agent is to optimize its policy π inorder to maximize R . One way of finding the optimalpolicy is by Q-learning [28], where the agent aims to learnthe optimal action-value function Q ∗ ( s, a ) , defined as: Q ∗ ( s, a ) = max π E [ R t | s t = s, a t = a, π ] .Since we implement our approach in the Options frame-work, here we briefly introduce it. Options are a general-ization of primitive actions to include temporary extendedcourses of actions. An option is a triple o = ( I, π, β ) , where I ⊂ S is the set of states that an option can start in (initiationset), π : S × A → [0 , is the policy and β : S + → [0 , isthe termination condition ( S + is the set of all states includingthe terminal state if there is one). An example of an optionwould be the task of changing a lane, which consists ofmultiple primitive actions like: activating the turn signal,turning the steering wheel, stepping on the gas pedal etc.A set of options defined over an MDP constitutes a semi-Markov decision process (SMDP). Options can take variablenumber of steps. Given a fixed set of options O s at state s , ata time-step t the agent chooses an option o t . When the optionterminates at time-step t + k , the agent is to select a newone (unlike the standard MDP setting when the agent alwayschooses a new primitive action at t + 1 ). Using options canpotentially speed up and stabilize learning on large problems.For more details refer to [21] and [22].III. GENERAL APPROACHThe final approach consists of two main components: thetrajectory planning and the learning algorithm. For generat-ing the set of action proposals for each time-step, we adaptedthe sampling-based optimal control approach for trajectoryplanning and generation from [29]. For approximating theoptimal Q -function, we use DQN-based function approxima-tor Q , with parameters θ , in an offline, batch learning way.Namely, each training iteration, we sub-sample a mini-batch M of transition samples ( s, a, r, s (cid:48) ) , from a pre-collectedtraining data buffer R . Given M = ( s i , a i , s i +1 , r i ) , we opti-mize for a the loss function: L ( θ ) = b (cid:80) i ( Q ( s i , a i | θ ) − y i ) ,where y i = r i + γ max a Q (cid:48) ( s i +1 , a i | θ ) are the targets. Q (cid:48) isthe target network with parameters θ (cid:48) which is updated everyiteration by a soft update policy with parameter τ ∈ [0 , . Inorder to reduce the overestimation of the Q -values we usetwo target networks, as suggested in [5]. Furthermore, webuild on the DeepSet-Q approach from [7], which allows usto support variable number of inputs for our RL architecture. . AQL-MAP Framework In Figure 2 the complete framework of interaction betweenthe simulation environment (Sumo), the trajectory planner(TP) and the RL agent is shown. In each time-step, firstSumo sends the current environment state to TP. Based onit, TP plans locally optimal trajectories and creates the setof reachable gaps. The current environment state and the setof reachable gaps construct the RL state which is sent to theRL agent. After the RL agent has chosen a gap to fit into, itsends its decision to TP. Then, TP sends to Sumo, the low-level control signals necessary for reaching the selected gap.Sumo executes the controls and sends the next environmentstate to TP to begin a new cycle of the same process.Using this framework, we collect the training data transi-tions. In RL state s t , given the initial state of the RL agent: s rlt = { pos t , speed t , accel t } , in time-step t = 0 consideringthe surrounding vehicles in the sensor range of the RL agent,the set of all surrounding gaps G t = { g t, ; g t, ; ... ; g t,n } is computed. Then, the TP generates all possible trajec-tories starting from state s rlt : T t = { tr t, ; tr t, ; ... ; tr t,m } .By examining which gaps are reachable by the generatedtrajectories, we compute the set of reachable gaps: RG t = { rg t, ; rg t, ; ... ; rg t,k } . This set is proposed to the RL agentto choose from. The RL agent then randomly chooses a gap g t ∈ RG , upon which the TP returns the best trajectoryleading to g t . The simulator starts executing the trajectory,the RL agent arrives in state s t +1 and receives a reward fromthe environment r t . Finally, the transition ( s t , g t , s t +1 , r t ) isstored to the fixed batch R . B. Trajectory Planning
We plan locally optimal trajectories to each gap if reach-able within a horizon, that satisfy physical and safetyconstraints. This is done based on the collision checkingmodule of the TP, as well as some predefined vehicledynamics constraints like maximum/minimum accelerationand maximal allowed speed. First, we compute all gaps inthe surroundings of the agent. Then, we check which onesare safely reachable by at least one trajectory. In this manner,we define the set of reachable gaps, that are proposed to theRL-agent as actions (Figure 1). Once the agent selects atarget gap, out of all trajectories that safely reach the gap,the TP outputs the best one. The best trajectory is selectedbased on a set of costs and weights taking into considerationthe speed, the comfort, as well as how well the end pointsof the trajectory fit inside the gap. More precisely, we areminimizing the cumulative longitudinal and lateral jerk ofthe trajectory for comfort. In lateral direction, we penalizethe lateral distance to the target lane center. In longitudinaldirection, we predict constant velocity trajectories for allrelevant surrounding vehicles, so that we can maintain safedistance between them and the RL agent. This is done bycalculating the time-to-collision and time-headway values forthe planned trajectory, relative to the relevant surrounding For e.g. if we are examining a keep-lane trajectory, only the vehicles onthat lane are considered relevant. vehicles. For a host vehicle v h , a reference vehicle v r , wedefine time-to-collision and time-headway as: ttc = | v h, pos − v r, pos | / ( v f, speed − v l, speed ) and thw = | v h, pos − v r, pos | /v f, speed ,where v h, pos and v r, pos are the longitudinal positions of thehost and the reference vehicle and, v l, speed and v f, speed referto whichever vehicle is leading/following between the hostand the reference vehicle in the time of the calculation.For more details on the trajectory generation, refer to [29]. C. Model-based Action Proposals Algorithm
We use DQN for the state-action value function approxi-mation. For that purpose, we initialize a network Q AQL , withparameters θ and a target network Q (cid:48)AQL , with parameters θ (cid:48) . Q AQL consists of the modules φ , ρ and Q , shown inFigure 3. The fully connected networks φ transform each ofthe dynamic input instances { s dyn i, , s dyn i, , ..., s dyn i,j } , describingthe surrounding vehicles of the RL agent, into the represen-tation φ ( s dyn i ) as in [7]. Then a pooling operation (sum) isapplied over the output, ultimately making the Q functionpermutation invariant wrt. its input. This is what enableshandling dynamic number of input objects [32]. The outputof the pooling operation is processed by the fully connectednetwork ρ , providing the final reconstruction of the dynamicinputs. Besides the dynamic input instances, we also have astatic part of the state s statici consisting of features describingthe RL agent. In order to inform the RL agent aboutthe available gaps, we include the features describing thereachable gaps. Finally we combine the final representationof the dynamic inputs ρ ( (cid:80) i φ ( s dyn i )) , with as many pairs ( s static i , s gap i ) as we have available gaps in time-step i . Thenwe propagate each triple ( ρ ( (cid:80) i φ ( s dyn i )) , s static i , s gap i ) throughthe Q module separately, to finally get the state-action valueestimation q for each proposed gap. In the end, we acquirethe final policy by maximizing over the q -value estimationsof all proposed gaps, and choosing the one with the maximal q -value. For details refer to Algorithm 1. Building upon [7],the parts marked with red in lines 11 and 12 in Algorithm 1,are the ones that we introduce for the description of the gapand for enabling variable number of actions in each time-step. Once we have a trained agent, we can apply it on anew driving scenario. The gap selection steps of a trainedagent are summarized in Algorithm 2.IV. TRAININGWe apply the AQL-MAP approach to teach an agentto drive on a simulated highway, in a smooth and safeFig. 3: Amortized Q-learning with Model-based Action Pro-posals network architecture lgorithm 1: Fixed Batch AQL-MAP initialize Q AQL = ( Q, φ , ρ ) , Q (cid:48)AQL = ( Q (cid:48) , φ (cid:48) , ρ (cid:48) ) set replay buffer R for training iteration ti = 1 , , . . . do get mini-batch M = ( s i , g i , s i +1 , r i +1 ) from R , where s i = ( S dyn i , s static i ) , s i +1 = ( S dyn i +1 , s static i +1 ) foreach transition do foreach object s ji +1 in S dyn i +1 do ( φ (cid:48) i +1 ) j = φ (cid:48) (cid:16) s ji +1 (cid:17) end ρ (cid:48) i +1 = ρ (cid:48) (cid:32)(cid:80) j ( φ (cid:48) i +1 ) j (cid:33) y i = r i +1 + γ max a Q (cid:48) ( ρ (cid:48) i +1 , ( s static i +1 , s gap i +1 ) , g ) , where: g = s gap i +1 ∈ { s gap i +1 , s gap i +1 , ... , s gap m i +1 } end perform a gradient step on loss: b (cid:88) i ( Q AQL ( s i , a i ) − y i ) update target network by: θ Q (cid:48) ← τ θ Q + (1 − τ ) θ Q (cid:48) end way, while respecting the desired speed requirement. In thissection, we discuss the training pipeline and its components. A. Options Framework
At a time-step t the RL agent, chooses a gap g (an option)from the set of reachable gaps. Unlike the classical RLframework, where the RL agent chooses a new action in fixedtime-intervals, here, the duration of the intervals is arbitrary.Every second, if the gap is not reached and it is still available,the RL agent is not required to choose again. If the initiallychosen option is not available anymore, we can interrupt itby choosing a new, reachable one. B. State space
The RL state consists of features describing the RL agent,its surrounding vehicles and gaps. For the RL agent we use: • absolute velocity, v RL ∈ R ≥ • left lane valid flag, indicating whether there is a laneleft of the RL agent, ll valid ∈ { , } , • analogously for the right lane: rl valid ∈ { , } For describing the vehicles j surrounding the RL agent, weconsider all vehicles in sensor range sr of
80 m ahead andbehind. We describe them with the following features: • relative distance d rel ,j , defined as ( pl j − pl RL ) / sr, wherepl RL and pl j are the longitudinal positions of the RLagent and the considered vehicle j respectively. • relative velocity v rel ,j , defined as ( v j − v RL ) /v des , RL ,where v j is the absolute velocity of the vehicle j , and Algorithm 2:
AQL-MAP Learned AgentGap Selection
Input : RL agent’s state s rlt = { pos t , speed t , accel t } ,RL state s t = ( s dyn t , s static t ) in time-step t = 0 Learned model Q AQL = ( Q, φ , ρ ) while episode not finished do get the reachable gap proposals from the TP: s gap t = { s gap t , s gap t , ... , s gap m t } compute the dynamic state representation as in:state rep dyn t = ρ (cid:16)(cid:80) ( φ t ( s dyn t ) (cid:17) Q values t = {} foreach gap g mt = s gap m t in s gap t do Q values t . insert ( Q AQL ( state rep dyn t , s static t , x gapt )) end choose gap g t = argmax ( Q values t ) end v des, RL is the user-defined desired velocity the RL agentis aiming to achieve. • relative lane lane rel ,j , defined as lane ind ,j − lane ind,RL ,where lane ind ∈ { , , } , represents the right, middleand left lane indices.Additionally, we include the features for every reachable gapin the surroundings, described below. C. Action space
The action set consists of all reachable gaps RG , describedin Section III-A. For description of the gaps we use: • relative distance d rel ,g , relative velocity v rel ,g and relativelane lane rel ,g , defined the same way as in Section IV-Bfor the surrounding vehicles, here in terms of gaps. • gap length, len g , defined as pl lv − pl fv , where pl lv andpl fv are the longitudinal positions of the leading and thefollowing vehicles forming the gap g . • gap association flag, af g ∈ { , } , which is 0 if thecurrently chosen gap is the same as the one chosen atthe previous time-step, and 1 otherwise. D. Reward function
For a desired velocity v des, RL , given δ vel = | v RL s − v des, RL | ,the reward function r : S × A → R , is defined as : r ( s, a ) = (cid:40) − δ vel + p ac , if v RL s < v des, RL . p ac , otherwise . (1)where p ac is a slight penalty for choosing an action thatdiffers from the one chosen in the previous time-step, whichencourages smoother driving. We use p ac = − . , chosenempirically by conducting a wide range of experiments. E. Training data details
We collected two training data sets, one for the high-levelagent, where the RL agent chooses an action every second,and one for the options-selecting agent, which chooses anew option in intervals with varying size (depending onhen a selected option terminates). However, apart from theaction/option choosing frequency, we made sure that they arecollected in the same way, so everything we describe belowholds for both of them. In the Sumo simulation environment,we collected around ∗ transition samples ( s, a, s (cid:48) , r ) ,that correspond to approximately 150 hours of driving. Thedata is collected on a 3-lane straight highway. Besides theRL agent, there are between 0 and 70 surrounding vehicles,positioned randomly on the highway, with varying desiredspeeds and driver behaviors. All surrounding vehicles areallowed to change lanes and accelerate/decelerate as theirunderlying controller suggests, we have control only overthe RL agent. For the options-selecting agent, we used arandom data collection policy. For the high-level agent, wecollected data in a pseudo-random way, where sometimes theagent would keep the previously chosen action with higherprobability. This is in order to make sure that we incorporatesamples in the data that describe completely executed lane-changes. V. EXPERIMENTS A. Benchmark agents
As already mentioned, we compared the performanceof the option-selecting RL agent to the one of 4 otherbenchmark agents: random agent, greedy agent, high-levelRL agent and IDM-based Sumo agent. The random agentis the simplest baseline providing us with a benchmarkperformance of a non-intelligent agent. The greedy agent,out of all reachable gaps, selects the gap with the fastesttrajectory leading towards it. The IDM-based agent is aSumo-controlled agent. We set its parameters as relaxed aspossible in order to make sure that no internal constraintsprevent it from achieving the desired speed (such as noovertaking from the right, being too cooperative etc.). Finally,the high-level RL agent chooses from discrete actions: keeplane, left lane-change and right lane-change. Since for thisagent there is no need to feed the actions into the network asa part of the RL state, (because they are discrete and fixed,)we used the classic DeepSet DQN architecture. However, wemade sure that everything else is comparable to the way theoption-selecting agent was trained. We used the same rewardfunction and the same RL state (excluding the options-relatedfeatures). During evaluation, all agents were rewarded basedon the same criteria. The underlying trajectory planning andselecting mechanism is identical for all agents. They onlydiffer in the action/option selection strategy.
B. Evaluation setup
We evaluated all agents on the same set of realisticscenarios, generated randomly, as described in IV-E. In orderto objectively assess the driving of the agents, we generatedscenarios with varying number of surrounding vehicles n = { , , ..., } , simulating fluctuating traffic density, fromlight to heavy. For each number of surrounding vehicleswe have 10 different evaluation scenarios (80 in total).Additionally, we evaluate the performance of the agentson critical, more challenging highway scenarios, shown in Fig. 4: The RL agent (red), surrounded by slow vehicleslearned that it is beneficial to move over to the middle lanein order to reach the right-most lane where it will makeprogress faster.Fig. 5: More challenging dense traffic highway situation thatcan occur for e.g. by entering a highway. Only the options-agent has learned that even though when choosing option g ,prolonged braking maneuver must be applied, it is still thebest long-term strategic choice.Figure 4 and Figure 5, where planning ahead is crucial. Wecompared the average achieved speeds of all agents on the80 scenarios. The average is calculated over the 10 scenarioswith same number of cars in it. In order to acquire a betteroverview of the performance, for the random agent andthe RL agents, we calculated an average over 10 differentruns. For the RL agents, we calculated the average over 10models trained with the same randomly initialized hyper-parameters (Table I). This allows for a better inspection of therobustness of the approach. We performed over 500 randomsearch runs (configuration space shown in Table I) usingthe Hyperband framework for distributed hyper-parametersoptimization [12].TABLE I: Training hyper-parameters and configurationspaces Parameter Final value Search configuration spacetraining steps K batch size discount factor . update type softlearning rate - [10 - , - , ..., - ] τ - [10 - , - , ..., - ] φ hidden dim. 20 [10 , , ..., φ output dim. 80 [50 , , ..., ρ hidden dim. 80 [50 , , ..., ρ output dim. 20 [10 , , ..., FC /FC dim. 100 [50 , , , , pooling fn. (cid:80) VI. R
ESULTS
Figure 6 shows the final average speed achieved by eachagent in the randomly generated scenarios. As expected theig. 6: Performance of all agents in the randomly generatedscenarios. The shaded areas represent the standard deviationover the 10 runs for the random, the high-level and theoptions agents.random agent (grey) shows the worse performance. Thegreedy agent performs better than the random but worsethan the high-level agent and the options-agent. The Sumo-agent, due to its simplistic strategy falls slightly behindthe greedy agent. The high-level agent and the options-agent have similar performance in the generic scenarios.This is because in simple, usual scenarios where the othersbehave predictably, the benefits of long-term planning areless visible. Additionally, the high-level agent has learned toexploit the longitudinal trajectories of the trajectory plannerwhich is sufficient for navigating the common, randomlygenerated scenarios. However, in the section below we showthat the options agent is superior as a whole, since whenit comes to delicate situations, it has learned to incorporatelong-term planning, unlike any of the comparison agents.
A. Performance in the critical scenarios
For further investigation of the performance, we lookedinto specific, more challenging scenarios, where planningahead is crucial. In the scenario in Figure 4 (from left toright), the RL agent (red), is in gap g , surrounded by slowvehicles. In the starting position, there are 2 available gaps:stay in g or start reaching g . The RL agent and theleading vehicles of the reachable gaps are driving with thesame speed, hence, there is no obvious short-term incentiveto change a lane. However, the options-selecting agent haslearned that it is long-term beneficial to move over to themiddle lane. Since once it is there, the gap g to the rightopens up, where it will be able to speed up and overtake theslow vehicles. For the high-level agent, since it is limitedto lane-changes, the only available gap is the startingone. This is because in the starting position its speed isthe same as the speed of the leading vehicles and a lane change is not feasible. In any given situation when forexample the surrounding vehicles are driving with similar We do not consider double lane-changes directly, since there are moresurrounding vehicles to be considered and it is harder to guarantee safety. a) Average speed b) Average durationFig. 7: Performance of all agents in the challenging scenariosfrom Figure 4 and Figure 5 in terms of speed and duration.speed as the RL agent, and the RL agent is limited todiscrete actions, a specific maneuver that may seems simpleand logical, may be infeasible. The IDM-based Sumo agenteven though completely informed about the surroundings,without a mechanism for long-term considerations, failed toreact appropriately. Since both reachable gaps g and g havethe same speed, and moving to the next lane would resultin an additional penalty, the greedy agent also fails to exitthe initial gap. In Figure 5), the agent and the 3 surroundingvehicles drive with the same speed, lower than the RL agent’sdesired speed (
30 m / s ). However, the surrounding vehicleshave achieved their desired velocities and have no intentionto accelerate. From that situation the agent has to breakin order to reach the middle lane gap g . Even thoughbraking i.e. selecting a slow gap behind, translates directlyinto lower immediate reward, the options agent learned that itis essential in the long run, since staying in the initial gap isonly locally optimal. Figure 7 summarizes the performanceof all agents averaged over the multiple runs, in terms ofaverage speed and duration on the scenarios from Figure 4(blue) and Figure 5 (green) respectively.VII. CONCLUSIONIn this work, we introduced the Amortized Q-learningwith Model-based Action Proposals approach for decision-making in common and critical highway situations. It is aReinforcement Learning-based method, with an embeddedoptimal control trajectory planner, that it is not limited tofixed action sets and supports variable number of inputs.This is especially beneficial in confined space dynamicenvironments, where precise problem definition is crucial.Moreover, it brings the best of both worlds together: asafe optimization-based trajectory planning and executionand long-term optimal decision-making. We evaluated ouroptions-selecting agent against 4 benchmark agents: a ran-dom agent, a greedy agent, IDM-based SUMO agent and ahigh-level discrete actions agent. We were able to show thatthe agents trained with our approach, not only achieved betteroverall performance but also were the only ones that man-aged to successfully navigate their way in the challengingscenarios. In the future, it would be interesting to incorporatean additional interaction term in the reward signal such ascourtesy. This would make sure that the agent is cooperativeand takes into account how its driving affects others. EFERENCES [1] Ali Alizadeh et al.
Automated Lane Change Deci-sion Making using Deep Reinforcement Learning inDynamic and Uncertain Highway Environment . 2019.arXiv: .[2] A. Bacha et al. “Odin: Team VictorTango’s entry inthe DARPA Urban Challenge”. In:
J. Field Robotics
25 (2008), pp. 467–492.[3] F. Borrelli et al. “MPC-Based Approach to ActiveSteering for Autonomous Vehicle Systems”. In:
In-ternational Journal of Vehicle Autonomous Systems
IEEE Trans-actions on Control Systems Technology
15 (2007),pp. 566–580.[5] Hado van Hasselt, Arthur Guez, and David Sil-ver.
Deep Reinforcement Learning with Double Q-learning . 2015. arXiv: .[6] C. Hoel, K. Wolff, and L. Laine. “Automated Speedand Lane Change Decision Making using Deep Rein-forcement Learning”. In: .2018, pp. 2148–2155.
DOI : .[7] Maria H¨ugle et al. “Dynamic Input for Deep Re-inforcement Learning in Autonomous Driving”. In: (2019), pp. 7566–7573.[8] Maria H¨ugle et al. “Dynamic Interaction-Aware SceneUnderstanding for Reinforcement Learning in Au-tonomous Driving”. In: (2020),pp. 4329–4335.[9] G. Kalweit et al. “Interpretable Multi Time-scale Con-straints in Model-free Deep Reinforcement Learningfor Autonomous Driving”. In: ArXiv abs/2003.09398(2020).[10] Gabriel Kalweit et al.
Deep Inverse Q-learning withConstraints . 2020. arXiv: .[11] Daniel Krajzewicz et al. “Recent Development andApplications of SUMO - Simulation of Urban MObil-ity”. In:
International Journal On Advances in Systemsand Measurements
CoRR abs/1603.06560 (2016). arXiv: . URL : http://arxiv.org/abs/1603.06560 .[13] Branka Mirchevska et al. “High-level Decision Mak-ing for Safe and Reasonable Autonomous LaneChanging using Reinforcement Learning”. In: (2018), pp. 2156–2162. [14] Volodymyr Mnih et al. Playing Atari with DeepReinforcement Learning . 2013. arXiv: .[15] M. Montemerlo et al. “Junior: The Stanford Entryin the Urban Challenge”. In:
The DARPA UrbanChallenge . 2009.[16] W. Schwarting, Javier Alonso-Mora, and D. Rus.“Planning and Decision-Making for Autonomous Ve-hicles”. In: 2018.[17] W. Schwarting et al. “Parallel autonomy in automatedvehicles: Safe motion generation with minimal inter-vention”. In: (2017), pp. 1928–1935.[18] Shai Shalev-Shwartz, Shaked Shammah, and AmnonShashua.
Safe, Multi-Agent, Reinforcement Learningfor Autonomous Driving . 2016. arXiv: .[19] D. Silver et al. “Mastering the game of Go withdeep neural networks and tree search”. In:
Nature
Nature
550 (2017), pp. 354–359.[21] R. Sutton, D. Precup, and Satinder Singh. “BetweenMDPs and Semi-MDPs: A Framework for TemporalAbstraction in Reinforcement Learning”. In:
Artif.Intell.
112 (1999), pp. 181–211.[22] Richard S. Sutton and Andrew G. Barto.
Reinforce-ment Learning: An Introduction . Second. The MITPress, 2018.
URL : http://incompleteideas.net/book/the-book-2nd.html .[23] Martin Treiber, Ansgar Hennecke, and Dirk Helbing.“Congested traffic states in empirical observations andmicroscopic simulations”. In: Physical Review E
ISSN : 1095-3787.
DOI : . URL : http://dx.doi.org/10.1103/PhysRevE.62.1805 .[24] C. Urmson et al. “Autonomous driving in urban envi-ronments: Boss and the Urban Challenge”. In: J. FieldRobotics
25 (2008), pp. 425–466.[25] Pin Wang, Ching-Yao Chan, and Arnaud de LaFortelle.
A Reinforcement Learning Based Approachfor Automated Lane Change Maneuvers . 2018. arXiv: .[26] Pin Wang, Hanhan Li, and Ching-Yao Chan.
Quadratic Q-network for Learning Continuous Con-trol for Autonomous Vehicles . 2019. arXiv: .[27] Sen Wang, Daoyuan Jia, and Xinshuo Weng.
Deep Re-inforcement Learning for Autonomous Driving . 2019.arXiv: .[28] Christopher J. C. H. Watkins and Peter Dayan. “Q-learning”. In:
Machine Learning . 1992, pp. 279–292.[29] M. Werling et al. “Optimal trajectory generation fordynamic street scenarios in a Fren´et Frame”. In:
010 IEEE International Conference on Robotics andAutomation (2010), pp. 987–993.[30] Tom Van de Wiele et al. “Q-Learning in enormous ac-tion spaces via amortized approximate maximization”.In:
CoRR abs/2001.08116 (2020). arXiv: . URL : https : / / arxiv . org / abs /2001.08116 .[31] Changxi You et al. “Advanced Planning for Au-tonomous Vehicles Using Reinforcement Learning andDeep Inverse Reinforcement Learning”. In: Roboticsand Autonomous Systems
114 (Jan. 2019).
DOI : .[32] Manzil Zaheer et al. “Deep Sets”. In: Advances inNeural Information Processing Systems 30 . Ed. by I.Guyon et al. Curran Associates, Inc., 2017, pp. 3391–3401.
URL : http://papers.nips.cc/paper/6931-deep-sets.pdfhttp://papers.nips.cc/paper/6931-deep-sets.pdf