Meta Reinforcement Learning-Based Lane Change Strategy for Autonomous Vehicles
MMeta Reinforcement Learning-Based Lane Change Strategyfor Autonomous Vehicles
Fei Ye , Pin Wang , Ching-Yao Chan and Jiucai Zhang Abstract —Recent advances in supervised learning and rein-forcement learning have provided new opportunities to applyrelated methodologies to automated driving. However, thereare still challenges to achieve automated driving maneuversin dynamically changing environments. Supervised learningalgorithms such as imitation learning can generalize to newenvironments by training on a large amount of labeled data,however, it can be often impractical or cost-prohibitive to obtainsufficient data for each new environment. Although reinforce-ment learning methods can mitigate this data-dependency issueby training the agent in a trial-and-error way, they still needto re-train policies from scratch when adapting to new envi-ronments. In this paper, we thus propose a meta reinforcementlearning (MRL) method to improve the agent’s generalizationcapabilities to make automated lane-changing maneuvers atdifferent traffic environments, which are formulated as differenttraffic congestion levels. Specifically, we train the model at lightto moderate traffic densities and test it at a new heavy trafficdensity condition. We use both collision rate and success rate toquantify the safety and effectiveness of the proposed model. Abenchmark model is developed based on a pretraining method,which uses the same network structure and training tasks asour proposed model for fair comparison. The simulation resultsshows that the proposed method achieves an overall successrate up to 20% higher than the benchmark model when it isgeneralized to the new environment of heavy traffic density.The collision rate is also reduced by up to 18% than thebenchmark model. Finally, the proposed model shows morestable and efficient generalization capabilities adapting to thenew environment, and it can achieve 100% successful rate and0% collision rate with only a few steps of gradient updates.
I. I
NTRODUCTION
Automated and semi-automated vehicles are gaining pop-ularity in their potential use in transportation. Considerabledevelopments have focused on autonomous driving applica-tions during the past decade. [1]–[6]. Particularly noticeablein the last few years, the advancements in machine learninghave prompted the application of such methodologies to thefield of automated driving.Supervised learning approaches, such as imitation learning,rely heavily on large amounts of labeled data [7]. Forexample, [8] uses imitation learning to learn from humandemonstrations and introduces perturbations to discourageundesirable behaviors. Each task in supervised learning istrained separately, so the trained agent is often not ableto generalize to other tasks. Furthermore, acquiring andlabeling a sufficient amount of data for each individual task inautonomous driving can be costly and time-consuming, and F. Ye, P. Wang and C. Chan are with California PATH, Univer-sity of California, Berkeley, Richmond, CA 94804, USA. email: { fye,pin_wang, cychan } @berkeley.edu. J. Zhang is with GAC R&D Center Silicon Valley Inc.,639 N. Pastoria Ave, Sunnyvale, CA 94085, USA. email: { jzhang } @gacrndusa.com. it is also challenging to cover all of the real-world drivingscenarios.In comparison, reinforcement learning (RL) offers analternative approach by training the agent in a trial-and-error way, and it does not require explicit human labeling orsupervision on each data sample. Recently, RL methods havebeen increasingly applied to the decision making and controlof autonomous vehicles [9]–[12]. For example, the Deep QNetwork (DQN) was introduced to solve high-level decision-making problems with automated speed control for highwaydriving [10], [13]. A DQN model was established by Hoel etal. [10] in a simulation environment to make driving behav-ioral commands (e.g. change lanes to the right/left, cruiseon the current lane, etc.), and the study also compare theinfluence of different neural network structures on the agent’sperformance. More recent work [14] introduced a hierarchicalarchitecture to train two separate policies for both high-leveldecision making and low-level control execution.Despite the rapid progress, classical reinforcement learningmethods still need to re-train new policies from scratchfor new tasks, which fails to exploit the learned propertiesfrom similar tasks. To truly achieve autonomous driving inthe diverse and complex real-world environment, it will beessential for the autonomous driving agent to handle newlyencountered situations from past experiences [15]. This mo-tivates us to introduce in this work a meta reinforcementlearning (MRL) approach, which integrates meta-learninginto deep reinforcement learning.To illustrate our proposed approach, we apply the MRLmethod to develop a decision-making strategy for lane chang-ing situations in highway driving. Decision-making for au-tomated driving in a dynamically changing environment canbe challenging due to the complex interactions and uncertainbehaviors of other road users. With the proposed framework,the learned model can quickly adapt to new driving condi-tions, which may include environmental variations such asdifferent traffic congestion levels, different road geometries,and different driving habits in different living regions. TheMRL method enables the creation of a generic model that canbe generalized to make automated lane-changing maneuversfor vehicles running in different traffic environments.Therefore, to improve the policy’s generalization capabil-ity, this paper adopts a framework of model-agnostic metalearning (MAML) [16]. Beyond adapting and generalizingto new tasks more efficiently, MAML is agnostic both tothe architecture of the neural network and also to the lossfunction. MAML is explicitly designed to train the model’sinitial parameters such that the model can reach the optimalperformance on a new task after just a few gradient updates. a r X i v : . [ c s . L G ] A ug ig. 1. Schematics of MAML Apply MAML in the context of RL also requires two loopsof optimization. A meta policy parameterized by θ is updatedin the outer loop by minimizing the total expected lossover all of the training tasks. The inner loop performstask-specific policy update and computes the expected lossusing the updated policy. More specifically, for each taskwe collect trajectories using the meta initialization, andthen we’ll update the task-specific policy parameters byapplying gradient descent to obtain the adapted policy. Thesetrajectories collected using the adapted policy are later usedto compute the meta objective, which will be sent back theouter-loop for updating the meta policy parameter θ .Safety is an issue of paramount concern for autonomousvehicles. We also propose a novel method of tackling drivingsafety in conjunction with the MRL framework. In theinner loop task learner, a safety module is incorporated intothe proximal policy optimization (PPO) [17] algorithm forlearning each driving task. To further enhance safety in bothlearning and execution phases, a safety intervention module[18] is added to reduce the chance of taking catastrophicactions in a complex interactive environment. Moreover, weintroduce a novel option of aborting lane change when theagent is making a decision of lateral action, which enablesour ego-vehicle to avoid potential collisions by aborting andchanging back to the original lane at any point while under-taking the lane change action. In the longitudinal direction,the ego vehicle chooses which leading vehicle to follow, so itcan perform speed adjustments even before making the actuallane change.The rest of the paper is organized as follows: SectionII describes the underlying principle of meta reinforcementlearning. In section III, we the proposed meta-reinforcementlearning frameworks for learning an adaptable lane changestrategy, with the detailed descriptions on establishing thereward design and model implementation. Section IV de-scribes the simulation environment, meta-learning systemarchitecture, state and action design in details. The simu-lation benchmark design, evaluation metrics, and results arepresented in Section V. Section VI concludes the paper byhighlighting the effectiveness of the proposed model. II. M ETA R EINFORCEMENT L EARNING
Reinforcement learning can teach how an agent to act byinteracting with its environment in order to maximize theexpected cumulative rewards for a certain task. However,reinforcement learning methods are often suffer from data-inefficiency and limited generalization. Meta learning, orlearning to learn, refers to the methods can enable agentsadapt quickly to new tasks using the prior knowledge or in-ductive biases learned from the previously seen related tasks.Recent efforts have explored the Meta-learning algorithms inthe context of reinforcement learning, i.e. meta reinforcementlearning [16], [19]–[21]. The existing meta reinforcementlearning algorithms can be generally categorized into twoprincipal families. The first category of methods leveragethe prior experience through the learned structure such asrecurrent neural networks [19], [22]. The second set ofmethods is gradient-based meta reinforcement learning whichattempts to optimize the parameters of the network duringthe meta-training such that they provide a good initializationpoint for adapting to new tasks with gradient descent [16],[20]. We extend our prior work [23] that developed a policygradient based lane change strategy for improving modelgeneralization and task performance, and focus on the avery influential gradient-based meta reinforcement learningmethod which was referred to as model-agnostic meta learn-ing (MAML) [16].The MAML algorithm is model-agnostic. More specifi-cally, it is agnostic both to the architecture of the neuralnetwork and also to the loss function. The backbone ofMAML is to optimize for meta parameters θ with gradientdescent in two loops – an inner loop task learner and anouter loop meta learner as shown in Fig. 1. MAML aimsto learn a good parameter initialization and use gradientdescent for updating both task learner and meta learnerupdate. These features provide many flexibilities for MAML,making it applicable to reinforcement learning problems thatmaximize the expected cumulative reward function throughpolicy gradient.Formally, we consider MAML as a neural network f θ parameterized by θ , which can conduct task-specific finetuning using gradient descent when adapting to a new task.During the meta-training stage as shown in Figure 1, MAMLoperates in an inner loop and an outer loop. In the inner loop,task learner initializes with the meta parameter θ and com-putes the updated parameter φ i of the task learner for eachtask T i using training data D tr T i collected from meta-policy.And then it evaluates the loss term on the validation data D vdi sampled from collected trajectories using the updated modelparameters φ i . The evaluated loss for each task T i can bewritten as L T i (cid:0) f θ (cid:48) i (cid:1) = L (cid:0) φ i , D vd i (cid:1) = L (cid:0) θ − α ∇ θ L RL (cid:0) θ, D tr i (cid:1) , D vd i (cid:1) (1)where φ i ← θ − α ∇ θ L RL ( θ, D tr i ) is the updated modelparameter for task T i . The loss function for updating a RLpolicy has the general form as L RL ( θ ) = ˆ E t (cid:104) log π θ ( a t | s t ) ˆ A t (cid:105) (2) ig. 2. System architecture of the proposed MAML-based lane change method where E t is the expectation operator, π θ is a stochastic RLpolicy, ˆ A t is an estimator of the advantage function at timestep t .In the outer loop, meta learner aggregates the per-taskpost-update losses L (cid:0) φ i , D vd i (cid:1) and performs a meta-gradientupdate on the original model parameter θ as θ ← θ − β · ∇ θ (cid:88) T i ∼ P ( T ) L (cid:0) φ i , D vd i (cid:1) (3)where β is the learning rate of the outer loop. MAMLoptimizes the meta-policy parameter θ such that the expectedloss across all the training tasks after inner-loop update isminimized.At meta-test time, MAML is able to adapt the metaparameters based on a few iterations of gradient updateson the rollout trajectories collected from the new task. Theadaptation process is represented as a few steps of gradientdescent with the collected rollouts. In practice, the policymust be able to read in the state for each of the tasks,which typically requires them to at least have the samedimensionality.In summary, the essential idea of MAML is trying to finda good set of parameters of a neural network that does notnecessarily have the optimal performance for different tasksat the meta-training stage, but can quickly adapt to new(unseen) tasks by fine-tuning with gradient descent.III. A UTOMATED L ANE C HANGE F RAMEWORK B ASEDON M ETA R EINFORCEMENT L EARNING
A. Overview
The framework of the proposed MAML model to performautomated lane change maneuver is illustrated in Fig. 2.
1) Task Learner (Inner-Loop):
The task learner learnsa decision-making strategy for automated mandatory lanechange maneuvers that feature safety, efficiency, and comfort.Instead of using the vanilla MAML inner-loop optimization,the REINFORCE [24] algorithm as in [16], we furtherimprove the learning efficiency and driving safety by imple-menting a proximal policy optimization (PPO) [17] method. We further combine PPO with a safety module for inner-loop optimization. PPO algorithm is built upon an actor-criticstructure, where the parameterized actor of PPO can enforcea trust region with clipped objectives, which has promisingcomputation efficiency and learning stability. Moreover, themerit of the critic is to supply the actor with the knowledge ofperformance in low variance. All of these nice properties ofPPO can improve its capability in real-life applications. Thesafety module can modify the exploration process of the tasklearner through the incorporation of a risk metric to furtherimprove driving safety.
2) Meta-Learner (Outer-Loop):
Meta-learner enables effi-cient model adaption of the lane change to new situations (i.e.different traffic density, different road geometries, differentdriving habits in different living regions). In the outer loop,the meta policy parameterized with θ is updated given theexpected return based on trajectories collected over all thetraining tasks using the adapted inner-loop policy. B. Inner-level Task-learner Design via PPO1) Inner level objective:
Clipped surrogate loss objectivehas shown better performance when compared to the REFIN-FORCE loss used in the vanilla MAML algorithm.The clipped surrogate loss function of PPO combines thepolicy surrogate and a value function error term, which isdefined as [17] L CLIP + V F + S ( θ ) = ˆ E t (cid:2) L CLIP ( θ ) − c L V F ( θ ) + c S [ π θ ] ( s t ) (cid:3) (4)where L CLIP is the clipped surrogate objective, c , c arecoefficients, L V Ft is the squared-error loss of the valuefunction ( V θ ( s t ) − V targ t ) , and S denotes the entropy loss.Specifically, the clipped surrogate objective L CLIPt takes thefollowing form L CLIP ( θ ) = ˆ E t (cid:104) min (cid:16) r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t (cid:17)(cid:105) (5)where (cid:15) is a hyperparameter, and r t ( θ ) denotes the probabil-ity ratio r t ( θ ) = π θ ( a t | s t ) /π θ old ( a t | s t ) . In this manner, theprobability ratio r is clipped at − (cid:15) or (cid:15) depending ABLE IS
IMULATION PARAMETERS SETUP
Parameter Value Descriptiontimesteps per actor batch 512 number of steps of environment per updateclip parameter 0.2 clipping rangeoptim epochs 10 number of training epochs per updatelearning rate 1e-4 learning rate γ λ on whether the advantage is positive or negative, whichforms the clipped objective after multiplying the advantageapproximator ˆ A t . The final value of L CLIPt takes the min-imum of this clipped objective and the unclipped objective r t ( θ ) ˆ A t , which can effectively avoid taking a large policyupdate compared to the unclipped version [17], which is alsoknown as the loss function of the conservative policy iterationalgorithm [25].
2) Network Structure and Optimizer:
Input layer ( × ),First hidden layer ( × ), Second hidden layer ( × ),Output layer ( × ). Adam optimizer [26] is applied tocompute and policy gradient and update the network weights.
3) Parameter Settings:
In terms of PPO hyper-parameters,we choose to use Adam and learning rate annealing with astep size of × − , and we set the horizon T = 512 ,the mini-batch size as 64, the discount factor γ = 0 . ,and the clipped parameter (cid:15) = 0 . . More detailed simulationparameter setting is shown in Table I.
4) Reward Function:
The reward function is designed toincorporate key objectives of vehicle maneuvers, which is todevelop an automated lane change strategy centered aroundthe evaluation metrics of safety , efficiency , and comfort . Morespecifically, these ideas are explained as follows:a. Comfort : this reward is introduced to avoid suddenacceleration and deceleration of the vehicle that maycause vehicle occupant discomfort. Here, the comfortreward is an evaluation of jerks in both lateral andlongitudinal direction.b.
Efficiency : the ego-vehicle should manage to move tothe target lane as soon as possible without exceedingthe speed limit. Thus, the efficiency reward is anevaluation of relative lateral distance to the target lane,time cost and vehicle speed.c.
Safety : the ego-vehicle should avoid collisions withsurrounding vehicles. Thus, the safety reward is anevaluation of the risk of collisions and near-collisions.Here, we introduce near-collision penalty and a safetymodule to modify the exploration process by incor-poration of external knowledge to further improve thedriving safety during learning and execution phases.The near-collision penalty R near collision and risk metricare both conditioned on the action-state space, rather thanonly penalizing a collision that actually takes place. In thiscase, an ego-vehicle C e can learn to abort the lane-changemaneuver if its relative distances to the surrounding vehicles C i are smaller than a predefined threshold, indicating acollision is likely to happen. The specific form of this near- collision penalty term R near collision in terms of their relativepositions is shown in TABLE II, in which F ( C e , C i ) isdefined as F ( C e , C i ) = − / ( | P y e − P y i | + 0 . (6)where P y e represents the longitudinal position of the ego-vehicle C e ; and P y i represents the longitudinal position ofthe surrounding vehicle C i .The safety module works in a reactive way that it canevaluate the collision risk and correct the learner when thechosen action causes a catastrophic result (i.e. collision withother vehicle) [27]. The main idea is to distinguish thereal “catastrophic” actions and “sub-optimal” actions in theautomated lane changing process. The safety module canevaluate the risk of collision based on the relative distancewith the surrounding vehicles in both lateral and longitudinaldirections, and output a binary label to classify if the currentchosen action is “catastrophic” or not and which vehicle isassociated with the risk. Then, the safety module will select adifferent safe actions based on the current state and returnedlabel to replace the ”catastrophic” action.More details on the formulation of reward functions thatexplicitly quantify the aforementioned evaluation metrics canbe found in [23].
5) Acton Space:
We design the action space in both lateraland longitudinal directions, so that an agent can learn whenand how to perform a lane change.In a real-world scenario, a drivers execution of the lanechange decision can also be affected by the interactionsbetween the ego vehicle and other vehicle. Therefore, weintroduce an aborting-lane-change action to the lateral actionspace, which enables the vehicle to abort taking a lanechange action to avoid a potential collision at any point whileundertaking the lane change maneuver.In the longitudinal direction, the ego vehicle needs tochoose which leading vehicle to follow, so it can performspeed adjustments even before making the actual lane change.
C. Outer-Level Meta-learner Design
The meta-learner is an optimization-based meta-RL agentthat can learn a prior parameter over a distribution of tasksthat transfer via fine-tuning and gradient descent. Specifically,the process of meta-optimization in this study is updated withthe following objective: min θ (cid:88) T i ∼ p ( T ) L T i (cid:0) f θ (cid:48) i (cid:1) = (cid:88) T i ∼ p ( T ) L T i ( f θ − α ∇ θ L τ i ( f θ )) (7)6where T i ∼ p ( T ) represents each training task drawn froma task distribution p ( T ) . The model parameters θ are trainedby optimizing for the performance of across all the trainingtasks. The network structure of the outer-level meta-learner isthe same as the inner-level task-learner. Similar to the generalmeta-RL formulations, we assume both training and test tasksare drawn from the same task distribution p ( T ) , where eachtask is a Markov decision process (MDP), consisting of a setof actions, states, stochastic dynamics, and reward functions. ABLE IIN
EAR -C OLLISION R EWARD C ONDITIONED ON THE A CTION S PACE R near collision C : Current lane leader C : Current lane follower C : Target lane leader C : Target lane followerLateral action 0: Lane keeping F ( C e , C ) min( F ( C e , C ) , F ( C e , C )) Lateral action 2: Aborting lane change min( F ( C e , C ) , F ( C e , C )) Fig. 3. Interactions between the simulation environment and the meta learner
In this study, different tasks are formulated as performinglane change maneuvers at different traffic densities. At thetraining phase, the training tasks are drawn from those withmild to moderate traffic. At the testing phase, we evaluate ourmodel in a new environment featuring in heavy dense traffic.An illustration of the outer-level meta-leaner interacting withdifferent tasks of traffic conditions in the SUMO environmentis illustrated in Fig. 3.IV. S
IMULATION E XPERIMENT
In this section, we show how we evaluate the performanceof our proposed model on more challenging and unseen tasksfor lane changing scenarios on highway driving, and the goalof our experiment is to demonstrate the effectiveness of thehigh-level meta learner combined with a low-level, safety-enhanced learner to achieve more reliable and more adaptivelearning capabilities in the complex environment.
A. Simulation Setup
The simulation network is modeled using a real-world 3-lane highway segment with on-ramps and off-ramps as shownin Fig. 4, which is implemented on SUMO [28]. The highwaysegment length to the ramp exit is 800 m and width of eachlane is 3.75 m .The low-level vehicle control is implemented through Traf-fic Control Interface (TraCI). Specifically, we implementedan intelligent driver model (IDM) [29] for car-followingbehavior for the other vehicles in the simulation. For theego vehicle, we adapted the IDM for low-level longitudinalcontrol that can adjust the vehicle speed when followingdifferent leaders. More details regarding IDM settings canbe found in [23].In SUMO, the vehicle counts are generated from a proba-bility factor f ∈ [0 , , which represents the probability of avehicle release in a second. We can simulate the highway exitbehavior at light ( f ∈ [0 , . ), moderate ( f ∈ [0 . , . ),and heavy traffic conditions ( f ∈ [0 . , ). (a)(b)Fig. 4. Simulation Network B. Meta-RL Training and Evaluation Task Design1) Meta agent:
The meta agent uses the trained meta-learners network weights as initialization parameters. Whenadapting to new tasks, a meta agent can fine-tune its policyon meta-testing tasks from meta-learners weights, whichare obtained from the meta-learning process that involvesdifferent meta-training tasks.At the meta training stage, three training tasks with lightto moderate traffic density (traffic probability factor f = { . , . , . } ) will be drawn from the distribution p ( T ) for meta optimization during training process. At the metatesting stage, we’ll use more challenge tasks than the trainingtasks to evaluate the performance of the trained meta-learner.Specifically, the heavy traffic density scenario with trafficprobability factor f = 0 . is selected as the meta testingtask, and it is obvious that this task has never been seenbefore in the training process.
2) Pretrained agent (Benchmark):
To make a fair com-parison, a benchmark is created based on a pretrained agentthat uses the same network structure and training tasks asthe meta agent. When adapting to new tasks, the pretrainedagent can fine-tune its policy on meta-testing tasks based onthe pretrained model, which is trained on mixed data sampledfrom all the training tasks.
C. Evaluation Metrics
Besides evaluating the reward function described earlierthat models safety, efficiency and comfort, we also add twoperformance-based metrics, namely the success rate and the raining step Training step S a f e t y A v e r a g e r e w a r d T o t a l A v e r a g e r e w a r d Training stepTraining step C o m f o r t A v e r a g e r e w a r d E ff i c i e n c y A v e r a g e r e w a r d Training step S u cc e ss r a t e C o lli s i o n r a t e Training step
Fig. 5. Training results. collision rate, to quantify the safety and effectiveness perfor-mance of the proposed MAML-based automated lane changemethod in both meta-training and meta-testing processes.
1) Success Rate:
A successful task in a simulation runis defined as the ego-vehicle having successfully changedto the target lane before reaching the highway exit andavoided colliding with other vehicles. The success rate isdefined as the ratio of the number of successful tasks overthe total number of simulation runs in the policy rollout. Theunsuccess rate can be attributes to cases of an autonomousagent failing to make a proper lane change successfully orsafely before the highway exit.
2) Collision Rate:
This metric evaluates the safety of avehicle, as it is generally used to account for the occurrenceof collision events. The collision rate is defined as the ratioof the number of collision events over the total number ofsimulation runs in the policy rollout.
D. Results
The training results of the meta agent are demonstratedin Fig. 5, which shows the performance that the meta agentcan achieve after one step gradient update in the trainingprocess. As discussed earlier, the reward function consistsof three metrics: comfort evaluated by the jerk, efficiency,and safety. We can observe from Fig. 5 that both comfortand efficiency have been relatively quickly improved overthe course of the training process. The safety reward slopsdown gradually as the training evolves. At the beginning, themeta agent has hardly made any lane change attempts at thebeginning, as both success rate and collision rate are near 0.When the meta agent starts to explore by taking more lanechange actions, the safety reward begins to decline. While the safety reward accounts for the occurrence of both collisionand near-collision events, it can also be used to quantify theaggressiveness of lane change actions. Finally, the desiredperformance can be reached after performing a sufficientnumber of training steps, as the lane change success rate canreach and the collision rate can also be reduced to 0,indicating the meta agent has learned a very safe and effectivelane change strategy based on the meta-training tasks.Fig. 6 illustrated the adaptation results of both meta agentand the benchmark pretrained agent. In the meta testing stage,both agents are exposed to the same new environment, andour goal is to evaluate their adaptation capabilities in theseunseen environments. Each subfigure of Fig. 6 has two lines,where the orange line represents the pretrained agent, andthe green one is the performance of the meta agent. Then,we’ll use the same metrics to evaluate their performance firstin terms of the total average return, which consists of threeparts including comfort, efficiency, and safety. Then we’llcompare their task performance in terms of the success rateand the collision rate.In reinforcement learning, we typically consider the ter-minology of few-shot adaptation as a few-steps of gradientadaptation. As shown in Fig. 6, our meta agent is obviouslyperforming better than the pretrained agent, or the bench-mark. First of all, the meta agent achieved much better safetymetrics, as its collision rate of which can quickly drop tozero after around 20 gradient steps. While both the metaagent and the pretrained agent have similar collision rate atthe beginning of the meta testing phase, the pretrained agenthas an higher collision rate in the end. Secondly, therapid rise of the success rate up to 100% shows our metaagent can adapt to new environments much quicker than the o m f o r t A v e r a g e r e w a r d Number of gradient steps S a f e t y A v e r a g e r e w a r d E ff i c i e n c y A v e r a g e r e w a r d Number of gradient steps Number of gradient steps
Meta agentPretrained agentMeta agentPretrained agent Meta agentPretrained agentNumber of gradient steps T o t a l A v e r a g e r e w a r d Meta agentPretrained agent S u cc e ss r a t e Number of gradient steps
Meta agentPretrained agent C o lli s i o n r a t e Number of gradient steps
Meta agent
Pretrained agent
Fig. 6. Adaptation results.TABLE IIIA
VERAGE S UCCESS R ATE
Model Average Success Rate gradient steps gradient steps gradient stepsPretrained agent
78% 86% 90%
Meta agent % % % Advantage
20% 14% 10% pretrained agent. Then, if we look at each reward metricindividually, our meta agent and pretrained agent will end upwith similar comfort rewards. In terms of efficiency reward,which evaluates how quickly we can switch to the desiredhighway exit lane, the performance of the meta agent is a bitlower than that of the pretrained agent. However, the mainreason for this discrepancy is that the meta agent sacrificesefficiency for better safety, as it learns to be patient in orderto avoid unnecessary collisions.TABLES III and IV present the comparison between themeta agent and the pretrained agent in terms of success rateand collision rate. It can be observed that the meta agentconsistently outperforms the pretrained agent as the numberof gradient steps evolves. For example, when there are only5 gradient steps, the meta agent will have and advantage over the pretrained agent in terms of success rateand collision rate. When it reaches 20 gradient steps, the metaagent can reach successful rate, while the pretrainedagent can only reach . Additionally, the collision rate ofthe prepared agent ( ) is also much greater than that ofthe meta agent, which is already reduced to zero.
TABLE IVA
VERAGE C OLLISION R ATE
Model Average Collision Rate gradient steps gradient steps gradient stepsPretrained agent
20% 14% 8%
Meta agent % % % Advantage
18% 14% 8%
V. C
ONCLUSIONS
This paper proposes a strategy for automated mandatorylane change based on meta reinforcement learning. The metaagent is trained based on the MAML framework that enablesthe models quick adaptation to unseen tasks (i.e. a morechallenging scenario with heavy traffic). In less than 10 stepsof gradient updates, the meta agent can quickly improve thelane change performance to an average successful rate of99% and reduce the collision rate to less than 0.2%. The metaagent can provide a more stable and efficient starting pointfor learning a new task and can achieve 100% successful rateand 0% collision rate with a few steps gradient updates. Insummary, the meta agent consistently outperforms the pre-trained agent in both adaptation speed and task performance.R
EFERENCES[1] D. Gonz´alez, J. P´erez, V. Milan´es, and F. Nashashibi, “A review of mo-tion planning techniques for automated vehicles,”
IEEE Transactionson Intelligent Transportation Systems , vol. 17, no. 4, pp. 1135–1145,2015.[2] B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “Asurvey of motion planning and control techniques for self-driving urbanvehicles,”
IEEE Transactions on intelligent vehicles , vol. 1, no. 1, pp.33–55, 2016.3] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,”
Annual Review of Control, Robotics,and Autonomous Systems , 2018.[4] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey ofautonomous driving: common practices and emerging technologies,” arXiv preprint arXiv:1906.05113 , 2019.[5] V. Talpaert, I. Sobh, B. R. Kiran, P. Mannion, S. Yogamani, A. El-Sallab, and P. Perez, “Exploring applications of deep reinforcementlearning for real-world autonomous driving systems,” arXiv preprintarXiv:1901.01536 , 2019.[6] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey ofdeep learning techniques for autonomous driving,”
Journal of FieldRobotics , vol. 37, no. 3, pp. 362–386, 2020.[7] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end autonomous driving,” 2016.[8] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning todrive by imitating the best and synthesizing the worst,” 2018.[9] A. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcementlearning framework for autonomous driving,”
Electronic Imaging , vol.2017, p. 7076, Jan 2017.[10] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane changedecision making using deep reinforcement learning,”
Proc. Int. Conf.Intell. Transp. Syst. (ITSC) , Nov 2018.[11] P. Wang, C.-Y. Chan, and A. d. L. Fortelle, “A reinforcement learningbased approach for automated lane change maneuvers,”
IEEE Intell.Veh. Symp. (IV) , Jun 2018.[12] Y. Ye, X. Zhang, and J. Sun, “Automated vehicles behavior decisionmaking using deep reinforcement learning and high-fidelity simulationenvironment,”
Transportation Research Part C: Emerging Technolo-gies , vol. 107, pp. 155 – 170, 2019.[13] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker,“High-level decision making for safe and reasonable autonomous lanechanging using reinforcement learning,” in , 2018, pp.2156–2162.[14] T. Shi, P. Wang, X. Cheng, C. Chan, and D. Huang, “Drivingdecision and control for automated lane change behavior based ondeep reinforcement learning,” in
Proc. Int. Conf. Intell. Transp. Syst.(ITSC) , Oct 2019, pp. 2895–2900.[15] A. E. Sallab, M. Saeed, O. A. Tawab, and M. Abdou, “Meta learningframework for automated driving,” 2017.[16] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400 ,2017.[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347 , 2017.[18] W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans, “Trial withouterror: Towards safe reinforcement learning via human intervention,” arXiv preprint arXiv:1707.05173 , 2017.[19] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, andP. Abbeel, “Rl : Fast reinforcement learning via slow reinforcementlearning,” arXiv preprint arXiv:1611.02779 , 2016.[20] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta-reinforcement learning of structured exploration strategies,” in Ad-vances in Neural Information Processing Systems , 2018, pp. 5302–5311.[21] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel,S. Levine, and C. Finn, “Learning to adapt in dynamic, real-worldenvironments through meta-reinforcement learning,” arXiv preprintarXiv:1803.11347 , 2018.[22] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo,R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning toreinforcement learn,” arXiv preprint arXiv:1611.05763 , 2016.[23] F. Ye, X. Cheng, P. Wang, and C.-Y. Chan, “Automated lane changestrategy using proximal policy optimization-based deep reinforcementlearning,” arXiv preprint arXiv:2002.02667 , 2020.[24] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”
Machine learning , vol. 8, no.3-4, pp. 229–256, 1992.[25] S. Kakade and J. Langford, “Approximately optimal approximatereinforcement learning,” in
ICML , vol. 2, 2002, pp. 267–274.[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[27] M. Alshiekh, R. Bloem, R. Ehlers, B. K¨onighofer, S. Niekum, andU. Topcu, “Safe reinforcement learning via shielding,” arXiv preprintarXiv:1708.08611 , 2017.[28] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo–simulation of urban mobility: an overview,” in
Proceedings of SIMUL2011, The Third International Conference on Advances in SystemSimulation . ThinkMind, 2011.[29] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states inempirical observations and microscopic simulations,”