[PDF] Longitudinal Dynamic versus Kinematic Models for Car-Following Control Using Deep Reinforcement Learning

Abstract

The majority of current studies on autonomous vehicle control via deep reinforcement learning (DRL) utilize point-mass kinematic models, neglecting vehicle dynamics which includes acceleration delay and acceleration command dynamics. The acceleration delay, which results from sensing and actuation delays, results in delayed execution of the control inputs. The acceleration command dynamics dictates that the actual vehicle acceleration does not rise up to the desired command acceleration instantaneously due to dynamics. In this work, we investigate the feasibility of applying DRL controllers trained using vehicle kinematic models to more realistic driving control with vehicle dynamics. We consider a particular longitudinal car-following control, i.e., Adaptive Cruise Control (ACC), problem solved via DRL using a point-mass kinematic model. When such a controller is applied to car following with vehicle dynamics, we observe significantly degraded car-following performance. Therefore, we redesign the DRL framework to accommodate the acceleration delay and acceleration command dynamics by adding the delayed control inputs and the actual vehicle acceleration to the reinforcement learning environment state, respectively. The training results show that the redesigned DRL controller results in near-optimal control performance of car following with vehicle dynamics considered when compared with dynamic programming solutions.

Full PDF

aa r X i v : . [ c s . R O ] J u l Longitudinal Dynamic versus Kinematic Models for Car-FollowingControl Using Deep Reinforcement Learning

Yuan Lin , John McPhee , and Nasser L. Azad Abstract — The majority of current studies on autonomousvehicle control via deep reinforcement learning (DRL) uti-lize point-mass kinematic models, neglecting vehicle dynamicswhich includes acceleration delay and acceleration commanddynamics. The acceleration delay, which results from sensingand actuation delays, results in delayed execution of the controlinputs. The acceleration command dynamics dictates that theactual vehicle acceleration does not rise up to the desiredcommand acceleration instantaneously due to dynamics. Inthis work, we investigate the feasibility of applying DRLcontrollers trained using vehicle kinematic models to morerealistic driving control with vehicle dynamics. We considera particular longitudinal car-following control, i.e., AdaptiveCruise Control (ACC), problem solved via DRL using a point-mass kinematic model. When such a controller is applied tocar following with vehicle dynamics, we observe signiﬁcantlydegraded car-following performance. Therefore, we redesignthe DRL framework to accommodate the acceleration delay andacceleration command dynamics by adding the delayed controlinputs and the actual vehicle acceleration to the reinforcementlearning environment state, respectively. The training resultsshow that the redesigned DRL controller results in near-optimalcontrol performance of car following with vehicle dynamics con-sidered when compared with dynamic programming solutions.

I. INTRODUCTIONReinforcement learning is a goal-directed learning-basedmethod that can be used for control tasks [1]. Reinforcementlearning is formulated as a Markov Decision Process (MDP)wherein an agent takes an action based on the currentenvironment state, and receives a reward as the environmentmoves to the next state due to the action taken. The goalof the reinforcement learning agent is to learn a state-actionmapping policy that maximizes the long-term cumulative re-ward. DRL utilizes deep (multi-layer) neural nets to approx-imate the optimal state-action policy through trial and erroras the agent interacts with the environment during training[2]. DRL has found recent breakthroughs as it surpassedhumans in playing board games [3]. DRL is actively evolvingand various algorithms have been developed which includeDeep Q Networks [2], Deep Deterministic Policy Gradient(DDPG) [4], Distributed Distributional Deterministic PolicyGradient [5], and Soft Actor Critic [6]. Dr. Yuan Lin is a Postdoctoral Fellow in the Systems Design Engi-neering Department at University of Waterloo, Ontario, Canada N2L 3G1. [email protected] Dr. John McPhee is a Professor and Canada Research Chair in theSystems Design Engineering Department at University of Waterloo, Ontario,Canada N2L 3G1. [email protected] Dr. Nasser L. Azad is an Associate Professor in the Systems DesignEngineering Department at University of Waterloo, Ontario, Canada N2L3G1. [email protected]

Connected and automated vehicles have become increas-ingly popular in academia and industry since DARPA urbanchallenge as autonomous driving could potentially becomea reality [7]. Fully autonomous driving is a challenging tasksince the transportation trafﬁc can be dynamic, high-speed,and unpredictable. The Society of Automotive Engineershas deﬁned multiple levels of automation as we progressfrom partial, such as Advanced Driver Assistance Systems(ADAS), to full automation. Current ADAS include ACC,lane-keeping assistance, lane-change assistance, emergencybraking assistance, and driver drowsiness detection[8]. Fu-ture highly automated vehicles shall be able to tackle morechallenging trafﬁc scenarios such as freeway on-ramp merg-ing, intersection maneuver, and roundabout traversing.Since DRL has been demonstrated to surpass humans incertain domains, it could potentially be suited to solve thechallenging tasks in automated driving to achieve superhu-man performance. Current literature has seen that DRL isused to tackle various trafﬁc scenarios for automated driving.In [9], Deep Q-learning is used to guide an autonomousvehicle to merge to freeway from on-ramp. In [10], [11], [12],[13], Deep Q Networks and/or DDPG allow an autonomousvehicle to maneuver through a single intersection whileavoiding collisions. In [14], DRL is used to solve for thelane change maneuver. Other studies have also used DRL totrain a single agent to handle a variety of driving tasks [15],[16].However, all the above-mentioned studies consider point-mass kinematic models of the vehicle, instead of vehicledynamic models wherein acceleration delay and acceler-ation command dynamics are included. With accelerationdelay, the reinforcement learning action such as the targetacceleration is delayed in time; with acceleration commanddynamics, the actual acceleration does not rise up to thetarget acceleration immediately [17]. We acknowledge thatacceleration command dynamics is being considered in acouple of most recent works that use DRL for vehiclecontrol. In [18], a longitudinal dynamic model is consideredfor predictive speed control using DDPG. In [19], a car-following controller is developed with acceleration commanddynamics considered using DDPG by learning from natu-ralistic human-driving data. However, both studies did notinvestigate the impact of acceleration delay, which coulddegrade the control performance.Regarding car-following control using DRL, there areother studies in the literature that have developed suchcontrollers. In [20], a cooperative car-following controlleris developed using policy gradient with a single-hidden-layereural net. In [21], [22], human-like car-following controllerswithout considering vehicle dynamics are developed usingdeterministic policy gradient by learning from naturalistichuman-driving data. To the best of our knowledge, thereis currently no study that utilizes DRL to develop an ACCcontroller in simulation (not learning from naturalistic data).There are studies in the literature that investigate delayedcontrol inputs in non-deep reinforcement learning. It issuggested that the delay can negatively inﬂuence controlperformance if it is not considered in the reinforcementlearning controller development [23]. A few approaches havebeen proposed to cope with control delay for reinforcementlearning. In [24], the environment state is augmented byadding the delayed control inputs, i.e., the actions in thedelay interval which have not been executed, for developing avehicle speed controller using reinforcement learning whosestate-action mapping policy is a decision tree instead ofa neural net. In [25], the authors proposed to learn theunderlying dynamic system model so as to use the modelto predict the future state after the delay for the purpose ofdetermining the current control action. In [26], a memorylessmethod that exploits the delay length is proposed to directlylearn the control action from the current environment statewith the state-action mapping policy being a tile codingfunction instead of a neural net. There is currently no studythat researches how a deep neural net trained in a no-control-delay environment responds to control delay. Thereis currently no work that develops a DRL controller withcontrol delay considered.The contribution of this work is studying the necessityand methodology of incorporating vehicle dynamics, whichinclude both acceleration delay and acceleration commanddynamics, in developing DRL controllers for automatedvehicle control. We ﬁrst investigate whether a DRL agenttrained using vehicle kinematic models could be used formore realistic control with vehicle dynamics. We considera particular car-following scenario wherein the precedingvehicle maintains a constant speed. As it shows that theDRL controller trained using a kinematic model causessigniﬁcantly degraded performance when vehicle dynamicsexists, we redesign the DRL controller by adding the delayedcontrol inputs and the actual acceleration to the environmentstate [27], [24] to accommodate for vehicle dynamics.II. CAR-FOLLOWING PROBLEM FORMULATIONIn this section, we derive the state-space equations of thecar-following control system so as to (1) understand howit could ﬁt into the reinforcement learning framework withstate-action mapping, and (2) use dynamic programming(DP) to compute the global optimal solutions for comparisonwith DRL solutions. DP is based on the state-space equationsand checks all permissible state values to search for theglobal minimum cost for the control system [28].We acknowledge that the relatively easy car-followingcontrol problem may preferably be solved using classicalcontrol method instead of DRL which is more capable to solve more challenging control tasks such as freeway on-ramp merging. We choose the car-following control problemhere because it can be explicitly modeled to obtain thestate-space equations with which we can use DP to solvefor the guaranteed global optimal solutions for comparisonpurposes. The DP solutions are critical because they serve asbenchmarks with which we can evaluate the DRL controllerstrained with either the vehicle dynamic or kinematic model.The other autonomous driving control tasks such as freewayon-ramp merging may not be explicitly modeled since theyinvolve highly complex multi-vehicle interactions.

Fig. 1. Schematic for car following with a constant distance headway.

We consider a simple car-following control problemwherein a following vehicle i desires to maintain a constantdistance headway d d between itself and its preceding vehicle i −

1, see Fig. 1. The gap-keeping error dynamic equationsof the car-following control system can be derived as: e = l i − − l i − b i − − d d ˙ e = v i − − v i ¨ e = a i − − a i (1)where e is the error between the actual inter-vehicle distanceand the desired distance headway d d , b i − is the vehiclebody length of the preceding vehicle i − l i − and l i arethe distances traveled by the preceding and the followingvehicles, respectively, v i − and v i are the velocities of thepreceding and following vehicles, respectively, a i − and a i are the actual accelerations of the preceding and the follow-ing vehicles, respectively. For the state space representation,we deﬁne x = e , x = ˙ e . Then˙ x = x ˙ x = a i − − a i (2)Assuming no vehicle-to-vehicle communication, the pre-ceding vehicle’s acceleration a i − is unknown to the fol-lowing vehicle. As the DRL algorithm used here is DeepDeterministic Policy Gradient which demands the system tobe deterministic, we only consider preceding vehicle’s speedto be a constant with a i − =

0. In fact, without knowing thepreceding vehicle’s acceleration, the system is not closed andthe exact optimal solution could not be found. We found thateven though the DRL neural nets are trained for this scenarioin which the preceding vehicle has a constant speed, thetrained neural nets could be applied to scenarios when thepreceding vehicle accelerates or decelerates with acceptablegap-keeping errors. Since the purpose of this paper is tocompare the use of dynamics versus kinematic models forvehicle control, we do not show such results here.ow we consider using the vehicle kinematic and dynamicmodels for the control. For a point-mass kinematic model, thefollowing vehicle’s control input u i is exactly the acceleration a i , i.e., u i = a i . The vehicle integrates and double-integratesover the control input (acceleration) for velocity and positionupdates, respectively. Thus, the state space representationwhen using a point-mass kinematic model is˙ x = x ˙ x = − u i (3)For a vehicle dynamic model, we adopt a simpliﬁed ﬁrst-order system for the acceleration command dynamics fromthe current literature used for Toyota Prius and Volvo S60[29], [30], which is shown in Laplace Domain as A i ( s ) U i ( s ) = τ s + e − φ s (4)where s is the Laplace Transform variable, A i ( s ) and U i ( s ) are the Laplace Transforms of a i and u i , respectively, τ is the time constant of the ﬁrst-order system, and φ isthe acceleration time delay. In time domain, the ﬁrst-ordersystem can be interpreted as˙ a i = u i ( − φ ) − a i τ (5)where u i ( − φ ) denotes that u i is delayed by φ in time.Introducing another state variable x = a i , the state spacerepresentation when using the dynamic model is˙ x = x ˙ x = − x ˙ x = u i ( − φ ) − x τ (6)The control goal is to minimize both the error and controleffort, which is a common goal of classical control methodssuch as Linear Quadratic Regulator and Model PredictiveControl. Here we deﬁne the absolute-value cost for the car-following control system as J = Z t f ( α | e | e nmax + β | u i | u max ) dt (7)where t f is the terminal time, | e | and | u i | denote the absolutevalues of the error and control input, respectively, u max is theallowed maximum of | u i | , e nmax is the nominal maximum of | e | , and α and β are coefﬁcients that satisfy α > β > α + β =

1. The α and β values can be adjusted so asto decide the weighting of minimizing the error over thecontrol action in the combined cost. The e nmax is a nominalmaximum because the gap-keeping error can be very large,especially during DRL training wherein the vehicle can haveany acceleration behavior before it gets well trained, see thenext section. We choose a sufﬁciently large e nmax to representa maximum gap-keeping error of a general car-followingtransient state.As both dynamic programming and reinforcement learningare based on discrete time, the above continuous-time equa-tions are discretized using a forward Euler integrator. Note that the absolute-value cost is different than the quadraticcost for LQR and MPC. This is because, for DRL, absolute-value rewards lead to lower steady-state errors [31]. As wewant to compare DRL solutions with DP ones, the DP costfunction needs to be the same as the DRL’s.III. DEEP REINFORCEMENT LEARNINGALGORITHMIn this section, we introduce the reinforcement learningframework and the speciﬁc DRL algorithm, DDPG (DeepDeterministic Policy Gradient), that we use to solve theabove car-following control problem. A. Reinforcement Learning

As stated in [1], reinforcement learning is learning whatto do, i.e., how to map states to actions, so as to maxi-mize a numerical cumulative reward. The formulation ofreinforcement learning is a Markov Decision Process. Ateach time step t , t = , , , ..., T , a reinforcement learningagent receives the environment state s t , and on that basisselects an action a t . As a consequence of the action, theagent receives a numerical reward r ( s t , a t ) and ﬁnds itselfin a new state s t + . In reinforcement learning, there areprobability distributions for transitioning from a state toan action and for the corresponding reward, which arenot illustrated here. The goal in reinforcement learning isto learn an optimal state-action mapping policy π ⋆ thatmaximizes the expected cumulative discounted reward R = E [ ∑ t = Tt = γ t r ( s t , a t )] with E denoting the expectation of theprobabilities. The symbol ⋆ denotes optimality. The Q-value,i.e., the state-action value, for time step t is deﬁned as theexpected cumulative discounted reward calculated from time t , i.e., Q ( s t , a t ) = E [ ∑ t = Tt γ t r ( s t , a t )] . Reinforcement learningproblem is solved using Bellman’s principle of optimality.That is, if the optimal state-action value for the next timestep is known Q ⋆ ( s t + , a t + ) , then the optimal state-actionvalue for the current time step can be solved by taking theaction that maximizes r ( s t , a t ) + Q ⋆ ( s t + , a t + ) .The reinforcement learning framework for the car-following control system is based on the state-space equa-tions described in the previous section. The action of thereinforcement learning framework is the control input ofthe car-following control system u i , t for time t . The rewardfunction is the negative value of the discretized absolute-value cost deﬁned in Equation 7 of the previous section. r ( s t , a t ) = − α | e t + | e nmax − β | u i , t | u max (8)With this expression, the reward value range is (-inf,0]. Weclip the reward to be in the range [-1,0] to avoid huge bumpsin the gradient update of the policy and Q-value neuralnetworks of DDPG. The huge bumps in the gradient updatelead to training instability [32].We consider 4 cases of the reinforcement learning frame-work as this work compares using dynamic versus kine-matic models for autonomous vehicle control. For case 1,a kinematic model is used. Based on Equation 3, only theap-keeping error and error rate are sufﬁcient to solve forthe dynamic system. So the environment state vector is s t = [ e t , ˙ e t ] for time step t .For case 2, only acceleration delay is considered withno acceleration command dynamics. We consider this in-termediate case for comparison purposes as well. In fact,for hybrid electric vehicles such as Toyota Prius [29], thetime constant in the acceleration dynamics equation is small τ = . s t = [ e t , ˙ e t , u i , t − k , ... u i , t − ] with k being the largestinteger such that k ∗ ∆ t ≤ φ with ∆ t = .

1s being one timestep value. This means that we feed into the DRL agent thepast control inputs that haven’t been executed by the controlsystem due to time delay. We expect the DRL agent to usethese delayed control inputs to solve for the correspondingsystem responses that would happen in the future and predictthe next optimal control input u i , t .For case 3, only acceleration command dynamics is con-sidered with no acceleration delay φ =

0. For this case, thetime constant is τ = . s t = [ e t , ˙ e t , a i , t ] which includes the error, errorrate, and the actual acceleration of the following vehicle. TABLE IC AR - FOLLOWING CONTROL SYSTEM PARAMETER VALUES .Discrete time step ∆ t e nmax u max Acceleration delay φ τ v i − v i ( t = ) e ( t = ) For case 4, both acceleration command dynamics anddelay are considered. For this case, the time constant isalso τ = .

5s for gas-engine vehicles. The state vector is s t = [ e t , ˙ e t , a i , t , u i , t − k , ... u i , t − ] . Table I shows the parametervalues for the car-following control system. B. Deep Deterministic Policy Gradient

The DRL algorithm that we use is DDPG, which isexactly the same as proposed in [4]. Here we provide abrief description of the DDPG algorithm and we encouragethe readers to read the original paper. The DDPG algo-rithm utilizes two deep neural networks: actor and criticnetworks. The actor network is for the state-action mappingpolicy µ ( s t | θ π ) where θ π denotes the actor neural netweight parameters, and the critic network is for Q-valuefunction (cumulative discounted reward) Q ( s t , a t | θ Q ) where θ Q denotes the critic neural net weight parameters. DDPGconcurrently learns the policy and Q-value function. Forlearning the Q-value (Q-learning), the Bellman’s principleof optimality is followed to minimize the root-mean-squaredloss L t = r ( s t , a t ) + Q ( s t + , µ ( s t + | θ π )) − Q ( s t , a t | θ Q ) usinggradient descent. For learning the policy, gradient ascent isperformed with respect to only the policy parameters θ π tomaximize the Q-value Q ( s t , µ ( s t | θ π )) . TABLE IID

EEP D ETERMINISTIC P OLICY G RADIENT PARAMETER VALUES .Target network update coefﬁcient 0.001Reward discount factor 0.99Actor learning rate 0.0001Critic learning rate 0.001Experience replay memory size 500000Mini-batch size 64Actor Gaussian noise mean 0Actor Gaussian noise standard deviation 0.02

Target networks are adopted to stabilize training [2]. Weuse Gaussian noise for action exploration [18]. Mini-batchgradient descent is used [4]. Experience replay is usedfor stability concerns [2]. Batch normalization is used toaccelerate learning by reducing internal covariant shift [33].Please see Table II for the DDPG algorithm parameter values.

Fig. 2. Undiscounted episode reward for training with the vehicle kinematicmodel.

Both the actor and critic networks are neural nets with2 hidden layers for all cases. For training with vehiclekinematics (case 1) and just acceleration command dynamics(case 3), the neural nets have 64 neurons for each hiddenlayer and the training time is 1 million time steps. Fortraining with control delays (cases 2 and 4), the neural netshave 128 neurons for each hidden layer and the training timeis 1.5 million time steps. For all cases, the training converges.Fig. 2 shows the undiscounted episode reward for case 1. Theplots of the undiscounted episode rewards for all the othercases look similar to that for case 1, and are not shown here.We use the undiscounted episode reward since it allows usto track changes for the latter part of the car-following errors (a)

K-DRLK-DPK-DRL-delay (b)

K-DRLK-DPK-DRL-dynamics (c)

K-DRLK-DPK-DRL-delay&dynamics

Fig. 3. Training and testing results for the DRL controller trained using the point-mass kinematic model (K-DRL). Columns (a), (b), and (c) show theresults of testing this DRL controller for car-following control with just acceleration delay (K-DRL-delay), just acceleration command dynamics (K-DRL-dynamics), and both acceleration delay and command dynamics (K-DRL-delay&dynamics), respectively. The variable e is the gap-keeping error, v i is thefollowing vehicle’s velocity, and u i is the control input to the following vehicle. Note that the control input is equal to the actual acceleration for thekinematic model case. The preceding vehicle’s constant speed is v i − =30m/s which is not shown here. easily. Note that, with the discount factor, the last reward at20 seconds (200 time steps) of one episode is discountedby 0 . = . τ = . e is able to return to near-zero in the steadystate, see column (a) of Fig. 3. When this DRL controlleris applied to car-following control with just accelerationcommand dynamics, the car-following performance is de-graded to a bigger extent as compared to the delay case.The gap-keeping error e returns to near-zero in the steadystate in a longer time, see column (b) of Fig. 3. When thisDRL controller is applied to car-following control with bothacceleration delay and command dynamics, the performanceis the worst. Both the transient and steady-state performancesare signiﬁcantly degraded. The steady-state error e does not (a) DE-DRLDE-DP (b)

DY-DRLDY-DP (c)

DE-DY-DRLDE-DY-DP

Fig. 4. DRL controller results when trained with just acceleration delay (DE-DRL, column (a)), just acceleration command dynamics (DY-DRL, column(b)), and both acceleration delay and command dynamics (DE-DY-DRL, column (c)). The variable e is the gap-keeping error, v i is the following vehicle’svelocity, u i is the control input to the following vehicle, and a i is the actual acceleration of the following vehicle. The preceding vehicle’s constant speedis v i − =30m/s which is not shown here. return to zero and forms a wavy oscillation pattern with themaximum being 0.73m and the minimum being -0.22m.The columns (a), (b), and (c) in Fig. 4 show the resultsfor the redesigned DRL controllers trained with accelerationdelay (case 2), acceleration command dynamics (case 3), andboth acceleration delay and command dynamics (case 4),respectively. For all these cases, the DRL controllers achievenear-optimal solutions as compared to the DP ones. Notethat the steady-state gap-keeping errors of the DP solutionsin columns (b) and (c) are around 0.5 meters. This would bereduced when using a smaller interval to create the evenlyspaced samples of the states for DP, although it takes muchlonger time to run. V. CONCLUSIONBy solving a particular car-following control problemusing DRL (deep reinforcement learning), we show that aDRL controller trained with a point-mass kinematic modelcould not be generalized to solve more realistic controlsituations with both vehicle acceleration delay and command dynamics. We added the control inputs that are delayed andhave not been executed, and the actual acceleration of thevehicle, to the reinforcement learning environment state forDRL controller development with vehicle dynamics. Thetraining results show that this approach provides near-optimalsolutions for car-following control with vehicle dynamics.In this work, the DRL controllers are trained with a ﬁxedinitial condition for all cases. We later trained a DRLcontroller with varying initial conditions and observed sim-ilar signiﬁcant performance degradation when applying thekinematic-model-trained DRL controller to practical controlwith vehicle dynamics.When the reinforcement learning environment state isaugmented with the delayed control inputs, the DRL agentis expected to utilize the delayed control inputs to predictthe system behavior in the future and determine the nextoptimal control action. Our results show that the DRL agentis capable to do so after training, in a near-optimal manner.However, because the environment state is augmented withmore variables, the neural network size needs to be increasednd more training time is needed, which is the disadvantage.As stated in the introduction, an alternative method is tolearn the underlying dynamic system separately and usethe learned system to predict the system behavior in thefuture after the delay time so as to determine the currentcontrol action [25]. However, this method may not be feasiblefor challenging autonomous driving control systems such asmerging control because such systems are subject to manyvariations and disturbances due to multi-vehicle interactions.It may not be easy to develop or learn an accurate model forsuch systems.Future work includes developing a more robust car-following DRL controller that can be trained with richvariations of the preceding vehicle’s speed. Another researchdirection is to develop DRL controllers with vehicle dy-namics considered for more challenging autonomous drivingscenarios such as freeway on-ramp merging.ACKNOWLEDGMENTThe authors would like to thank Toyota, Ontario Centres ofExcellence, and Natural Sciences and Engineering ResearchCouncil of Canada for the support of this work.R EFERENCES[1] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, p. 529, 2015.[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, et al. , “Mastering the game of go with deep neuralnetworks and tree search,”

Nature , vol. 529, no. 7587, p. 484, 2016.[4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971 , 2015.[5] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan,A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional de-terministic policy gradients,” arXiv preprint arXiv:1804.08617 , 2018.[6] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” arXiv preprint arXiv:1801.01290 , 2018.[7] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark,J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al. , “Autonomousdriving in urban environments: Boss and the urban challenge,”

Journalof Field Robotics , vol. 25, no. 8, pp. 425–466, 2008.[8] A. Eskandarian,

Handbook of intelligent vehicles . Springer London,2012.[9] P. Wang and C.-Y. Chan, “Autonomous ramp merge maneuverbased on reinforcement learning with continuous action space,” arXivpreprint arXiv:1803.09203 , 2018.[10] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,“Navigating occluded intersections with autonomous vehicles usingdeep reinforcement learning,” in . IEEE, 2018, pp. 2034–2039.[11] Z. Qiao, K. Muelling, J. M. Dolan, P. Palanisamy, and P. Mudalige,“Automatically generated curriculum based reinforcement learning forautonomous vehicles in urban environment,” in . IEEE, 2018, pp. 1233–1238.[12] Z. Qiao, K. Muelling, J. Dolan, P. Palanisamy, and P. Mudalige,“POMDP and hierarchical options MDP with continuous actionsfor autonomous driving at intersections,” in . IEEE,2018, pp. 2377–2382.[13] C. Li and K. Czarnecki, “Urban driving with multi-objective deepreinforcement learning,” arXiv preprint arXiv:1811.08586 , 2018. [14] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learningbased approach for automated lane change maneuvers,” in . IEEE, 2018, pp. 1379–1384.[15] P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zollner, “Adaptivebehavior generation for autonomous driving using deep reinforcementlearning with compact semantic states,” in . IEEE, 2018, pp. 993–1000.[16] S. Aradi, T. Becsi, and P. Gaspar, “Policy gradient based reinforcementlearning approach for autonomous highway driving,” in . IEEE,2018, pp. 670–675.[17] R. N. Jazar,

Vehicle dynamics: theory and application . Springer,2017.[18] M. Bucchel and A. Knoll, “Deep reinforcement learning for predictivelongitudinal control of automated vehicles,” in . IEEE,2018, pp. 2391–2397.[19] S. Wei, Y. Zou, T. Zhang, X. Zhang, and W. Wang, “Design andexperimental validation of a cooperative adaptive cruise control systembased on supervised reinforcement learning,”

Applied Sciences , vol. 8,no. 7, p. 1014, 2018.[20] C. Desjardins and B. Chaib-Draa, “Cooperative adaptive cruise con-trol: A reinforcement learning approach,”

IEEE Transactions on Intel-ligent Transportation Systems , vol. 12, no. 4, pp. 1248–1260, 2011.[21] D. Zhao, B. Wang, and D. Liu, “A supervised actor–critic approach foradaptive cruise control,”

Soft Computing , vol. 17, no. 11, pp. 2089–2099, 2013.[22] M. Zhu, X. Wang, and Y. Wang, “Human-like autonomous car-following model with deep reinforcement learning,”

TransportationResearch Part C: Emerging Technologies , vol. 97, pp. 348–368, 2018.[23] E. Schuitema, M. Wisse, T. Ramakers, and P. Jonker, “The design ofleo: a 2d bipedal walking robot for online autonomous reinforcementlearning,” in . IEEE, 2010, pp. 3238–3243.[24] T. Hester and P. Stone, “Texplore: real-time sample-efﬁcient reinforce-ment learning for robots,”

Machine learning , vol. 90, no. 3, pp. 385–429, 2013.[25] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning andplanning in environments with delayed feedback,”

Autonomous Agentsand Multi-Agent Systems , vol. 18, no. 1, p. 83, 2009.[26] E. Schuitema, L. Bus¸oniu, R. Babuˇska, and P. Jonker, “Control delay inreinforcement learning for real-time dynamic systems: a memorylessapproach,” in . IEEE, 2010, pp. 3226–3231.[27] K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision pro-cesses with delays and asynchronous cost collection,”

IEEE transac-tions on automatic control , vol. 48, no. 4, pp. 568–574, 2003.[28] D. S. Naidu,

Optimal control systems . CRC press, 2002.[29] J. Ploeg, B. T. Scheepers, E. Van Nunen, N. Van de Wouw, andH. Nijmeijer, “Design and experimental evaluation of cooperativeadaptive cruise control,” in . IEEE, 2011, pp. 260–265.[30] K. Lidstrom, K. Sjoberg, U. Holmberg, J. Andersson, F. Bergh,M. Bjade, and S. Mak, “A modular CACC system integration anddesign,”

IEEE Transactions on Intelligent Transportation Systems ,vol. 13, no. 3, pp. 1050–1061, 2012.[31] J.-M. Engel and R. Babuˇska, “On-line reinforcement learning for non-linear motion control: Quadratic and non-quadratic reward functions,”

IFAC Proceedings Volumes , vol. 47, no. 3, pp. 7043–7048, 2014.[32] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver,“Learning values across many orders of magnitude,” in

Advances inNeural Information Processing Systems , 2016, pp. 4287–4295.[33] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[34] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540arXiv preprintarXiv:1606.01540