Autonomous Braking System via Deep Reinforcement Learning
Hyunmin Chae, Chang Mook Kang, ByeoungDo Kim, Jaekyum Kim, Chung Choo Chung, Jun Won Choi
aa r X i v : . [ c s . A I] A p r Autonomous Braking System viaDeep Reinforcement Learning
Hyunmin Chae, Chang Mook Kang, ByeoungDo Kim, Jaekyum Kim, Chung Choo Chung, and Jun Won Choi
Hanyang University, Seoul, KoreaEmail: [email protected], [email protected], { bdkim, jkkim } @spo.hanyang.ac.kr, { cchung ,junwchoi } @hanyang.ac.kr Abstract —In this paper, we propose a new autonomous brakingsystem based on deep reinforcement learning. The proposedautonomous braking system automatically decides whether toapply the brake at each time step when confronting the risk ofcollision using the information on the obstacle obtained by thesensors. The problem of designing brake control is formulatedas searching for the optimal policy in Markov decision process(MDP) model where the state is given by the relative positionof the obstacle and the vehicle’s speed, and the action space isdefined as the set of the brake actions including 1) no braking,2) weak, 3) mid, 4) strong brakiong actions. The policy usedfor brake control is learned through computer simulations usingthe deep reinforcement learning method called deep Q-network(DQN). In order to derive desirable braking policy, we proposethe reward function which balances the damage imposed to theobstacle in case of accident and the reward achieved when thevehicle runs out of risk as soon as possible. DQN is trained forthe scenario where a vehicle is encountered with a pedestriancrossing the urban road. Experiments show that the control agentexhibits desirable control behavior and avoids collision withoutany mistake in various uncertain environments.
I. INTRODUCTIONSafety is one of top priorities that should be pursued inrealizing fully autonomous driving vehicles. For safe au-tonomous driving, autonomous vehicles should perceive theenvironments using the sensors and control the vehicle to travelto the destination without any accidents. Since it is inevitablefor an autonomous vehicle to encounter with unexpected andrisky situations, it is critical to develop the reliable autonomouscontrol systems that can cope well with such uncertainties.Recently, several safety systems including collision avoidance,pedestrian detection, and front collision warning (FCW) havebeen proposed to enhance the safety of the autonomous vehicle[1]–[3].One critical component for enabling safe autonomous driv-ing is the autonomous braking systems which can reducethe velocity of the vehicle automatically when a threateningobstacle is detected. The autonomous braking should offer safeand comfortable brake control without exhibiting too earlyor too late braking. Most conventional autonomous brakingsystems are rule-based, which designate the specific brakecontrol protocol for each different situation. Unfortunately, thisapproach is limited in handling all scenarios that can happenin real roads. Hence, the intelligent braking system shouldbe developed to avoid the accidents in a principled and goal-oriented manner. Recently, interest in machine learning has explosively grownup with the rise of parallel computing technology and alarge amount of training data. In particular, the success ofdeep neural network (DNN) technique led the researchers toinvestigate the application of machine learning for autonomousdriving. The DNN has been applied to autonomous drivingfrom camera-based perception [4]–[6] to end-to-end approachwhich learns mapping from the sensing to the control [7],[8]. Reinforcement learning (RL) technique has also beenimproved significantly as DNN was adopted. The technique,called deep reinforcement learning (DRL), has shown toperform reasonably well for various challenging robotics andcontrol problems. In [9], the DRL technique called Deep Q-network (DQN) was proposed, which approximates Q-valuefunction using DNN. It was shown that the DQN can outper-form human experts in various Atari video games. Recently,the DRL is applied to control systems for autonomous drivingvehicle in [10], [11].
Fig. 1. The proposed DRL-based autonomous braking systems.
In this paper, we propose a new autonomous braking systembased DRL, which can intelligently control the velocity of thevehicle in situations where collision is expected if no action istaken. The proposed autonomous braking system is describedin Fig. 1. The agent (vehicle) interacts with the uncertainenvironment where the position of the obstacle could changein time and thus the risk of collision at each time step variesas well. The agent receives the information of the obstacle’sposition using the sensors and adapts the brake control to thestate change such that the chance of accident is minimized.In our work, we design the autonomous braking system forthe urban road scenario where a vehicle faces a pedestrianho crosses the street at a random timing. In order to findthe desirable brake action for the given pedestrian’s locationand vehicle’s speed, we need to allocate appropriate rewardfunction for each state-action pair. In our work, we focus onfinding the desirable reward function which strikes the balancebetween the penalty imposed to the agent when accidenthappens and the reward obtained when the vehicle quickly getsout of risk. Using the reward function we carefully designed,we train DQN to learn the policy that decides the timing ofbrake based on the given pedestrian’s state. We also provide anew DQN design which can rapidly learn the policy to avoidrare accidents.Via computer simulations, we evaluate the performance ofthe proposed autonomous braking system. In simulations, weconsider the uncertainty of the vehicle’s initial velocity, pedes-trian’s initial position, and whether the pedestrian will cross ornot. The experimental results show that the proposed brakingsystem exhibits desirable control behavior for various testscenarios including autonomous emergency braking (AEB)test administrated by Euro NCAP.The rest of this paper is organized as follows. In SectionII, we describe the basic scenarios and the framework of theproposed system. In Section III, we provide the details ofthe DQN design for autonomous braking. The experimentalresults are provided in Section IV and the paper is concludedin Section V. II. S
YSTEM D ESCRIPTION
In this section, we describe the overall structure of theautonomous braking system. We first define the possiblescenarios for autonomous braking and explain the detailedoperation of the proposed system.(a)(b)
Fig. 2. Though a pedestrian is detected for the cases (a) and (b), the propercontrol actions are different for two cases. For case (a), the vehicle shouldstop in front of the pedestrian while for case (b), the vehicle should be onstandby without stepping the brake yet.
A. Scenarios
One of the factors that hinders safe driving in autonomousdriving is the threat from nearby objects, e.g. pedestrians.Many accidents could happen when the vehicle fails to stopahead of it when a pedestrian crosses the road. Hence, in orderto avoid accidents, the vehicle should detect the threat that canpotentially cause accidents in advance and perform appropriatebrake actions to stop vehicle in front of the obstacle. However,there exist various degrees of uncertainty which make thedesign of autonomous braking challenging such as • Vehicle’s initial velocity • Pedestrian’s position • Pedestrian’s speed • Pedestrian’s crossing timing • Pedestrian’s moving direction • Sensor’s measurement error • Road’s conditionEven if a pedestrian is detected accurately, it is hard toknow when it can become a threat to the vehicle. Hencewe need appropriate braking strategy for different situations.(see Fig. 2.) That is, for the given state of the pedestrian(i.e. position, velocity), the autonomous braking system shoulddecide what brake action to apply.
Fig. 3. Behavior of a pedestrian modeled by the discrete-state Markovprocess.
In our system, we consider the scenario where behaviorof the pedestrian follows the discrete-state Markov processdescribed in Fig. 3. The state S nobody implies that the sensorshave not detected any obstacle. Once a pedestrian is detected,the state S nobody can change to the state S stay or the state S cross , where S stay is the state that the pedestrian stays atsidewalk and S cross is the state that the pedestrian crosses theroad. The pedestrian’s initial position can be either from far-side and near-side of the vehicle and the pedestrian walkingspeed can vary between v pedmin m/s and v pedmax m/s .Note that the vehicle’s initial velocity is distributed between v vehmin m/s and v vehmax m/s . In practical scenarios, it isdifficult to know the transition probabilities of the Markov pro-cess and the distribution of the pedestrian’s states. Therefore,reinforcement learning approach can be applied to learn thebrake control policy through the interaction with environment. B. Autonomous Braking System
The detailed operation of the proposed autonomous brak-ing system is depicted in Fig. 4. The vehicle is movingt speed v veh from the position ( vehpos x , vehpos y ) . Assoon as a pedestrian is detected, the autonomous brakingsystem receives the relative position of the pedestrian, i.e., ( pedpos x − vehpos x , pedpos y − vehpos y ) from the sensormeasurements where ( pedpos x , pedpos y ) is the location of thepedestrian. Using the vehicle’s velocity v veh and the relativeposition ( pedpos x − vehpos x , pedpos y − vehpos y ) , the vehicledecides whether it will step brake at each time step. Theinterval between consecutive time steps is given by ∆ T . Weconsider four brake actions; no braking a nothing and braking a high , a mid and a low with different intensities. We can includemore brake actions with more refined steps or continuousbrake action which are not considered in this work.III. D EEP R EINFORCEMENT L EARNING FOR A UTONOMOUS B RAKING S YSTEM
In this section, we present the details of the proposedDRL-based autonomous braking system. We first introduce thestructure of the DQN and explain the reward function used totrain the DQN in details.
A. Structure of DRL
Our system follows the basic RL structure. The agentperforms an action A t given state S t under policy π . Theagent receives the state as feedback from the environment andgets the reward r t for the action taken. The state feedbackthat the agent takes from sensors consists of the velocity ofthe vehicle v veh and the relative position to the pedestrian, ( pedpos x − vehpos x , pedpos y − vehpos y ) for the past n timesteps. Possible action that agent can choose is among decelera-tion a high , a mid , a low and keeping the current speed a nothing .The goal of our proposed autonomous braking system isto maximize the expected accumulated reward called “valuefunction” that will be received in the future within an episode.Using the simulations, the agent learns from interaction withenvironment episode-by-episode. One episode starts when apedestrian is detected. Note that the initial position of thepedestrian and the initial velocity of the vehicle are random.The vehicle drives on a straight way based on the brake policy π . If the distance between the vehicle and the pedestrian is lessthan the safety distance l , it is considered as a collision event.(see Fig. 4.) The episode ends if at least one of the followingevents occurs • Stop : the vehicle completely stops, i.e., v veh = 0 . Fig. 4. Illustration of the autonomous braking operation • Bump : the vehicle passes the safety line l when thepedestrian is crossing road. • P ass : the vehicle passes the pedestrian without accident. • Cross : the pedestrian completely crosses the road andreaches the opposite side.Once one episode ends, the next episode starts with the stateof environment and the value function reset.
B. Deep Q-Network
Q-learning is one of the popular RL methods which searchesfor the optimal policy in an iterative fashion [12]. Basically,the Q-value function q π ( s, a ) is defined as q π ( s, a ) = E π [Σ ∞ k =0 γ k r t + k +1 | S t = s, A t = a ] (1)for the given state s and action a , where r t is the rewardreceived at the time step t. The Q-value function is theexpected sum of the future rewards which indicates how goodthe action a is given the state s under the policy of the agent π .The contribution to the Q-value function decays exponentiallywith the discounting factor γ for the rewards with far-offfuture. For the given Q-value function, the greedy policy isobtained as π ( s ) = arg max a q π ( s, a ) . (2)One can show that for the policy in (2), the following Bellmanequation should hold [12]; q π ( s, a ) = E h r t +1 + γ max a ′ q π ( S t +1 , a ′ ) | S t = s, A t = a i . (3)In practice, since it is hard to obtain the exact value of q π ( s, a ) satisfying the Bellman equation, the Q-learning method usesthe following update rule for the given one step backups S t , A t , r t +1 , S t +1 ; q π ( S t , A t ) ← q π ( S t , A t )+ α (cid:16) r t +1 + γ max a q π ( S t +1 , a ) − q π ( S t , A t ) (cid:17) (4)However, when the state space is continuous, it is impossibleto find the optimal value of the state-action pair q ∗ ( s, a ) for all possible states. To deal with this problem, the DQNmethod was proposed, which approximates the state-actionvalue function q ( s, a ) using the DNN, i.e., q ( s, a ) ≈ q θ ( s, a ) where θ is the parameter of the DNN [9]. The parameter θ ofthe DNN is then optimized to minimize the squared value ofthe temporal difference error δ t δ t = r t +1 + γ max a ′ q θ ( S t +1 , a ′ ) − q θ ( S t , A t ) (5)For better convergence of the DQN, instead of estimating both q ( S t , A t ) and q ( S t +1 , a ′ ) in (5), we approximate q ( S t , A t ) and q ( S t +1 , a ′ ) using the Q-network and the target networkparameterized by θ and θ − , respectively [9]. The update of thetarget network parameter θ − is done by cloning Q-networkparameter θ , periodically. Thus, (5) becomes δ t = r t +1 + γ max a ′ q θ − ( S t +1 , a ′ ) − q θ ( S t , A t ) (6)o speed up convergence further, replay memory is adoptedto store a bunch of one step backups and use a part of themchosen randomly from the memory by batch size [9]. Thebackups in the batch is used to calculate the loss function L which is given by L = Σ t ∈ B replay δ t , (7)where B replay is the backups in the batch selected fromreplay memory. Note that the optimization of parameter θ forminimizing the loss L is done through the stochastic gradientdecent method. C. Reward Function
Unlike video games, the reward should be appropriatelydefined by a system designer in autonomous braking system.As mentioned, the reward function determines the behaviorof the brake control. Hence, in order to ensure the reliabilityof the brake control, it is crucial to use the properly definedreward function. In our model, there is conflict between twointuitive objectives for brake control; 1) collision should beavoided no matter what happens and 2) the vehicle should getout of the risky situation quickly. If it is unbalanced, the agentbecomes either too conservative or reckless. Therefore, weshould use the reward function which balances two conflictingobjectives. Taking this into consideration, we propose thefollowing reward function r t = − ( α ( pedpos x − vehpos x ) + β ) decel − ( ηv t + λ ) ( S t = bump ) (8) α, β, η,λ > where v t is the velocity of the vehicle at the time step t , decel is difference between v t and v t − and ( x = y ) hasa value of if the statement inside is true and otherwise.The first term − ( α ( pedpos x − vehpos x ) + β ) decel in thereward function prevents the agent from braking too early bygiving penalty proportional to squared distance between thevehicle and pedestrian. It guides the vehicle to drive withoutdeceleration if the pedestrian is far from the vehicle. On theother hand, the term − ( ηv t + λ ) ( S t = bump ) indicates thepenalty that the agent receives when the accident occurs. Notethat this penalty is a function of the vehicle’s velocity, whichreflects the severe damage to the pedestrian in case of highvelocity at collision. Without such dependency on the velocity,the agent would not reduce the speed in situation when theaccident is not avoidable. The constants α , β , η and λ arethe weight parameters that controls the trade-off between twoobjectives. D. Trauma Memory
As mentioned in the previous section, autonomous brakingsystems should learn both of the conflicting objectives. How-ever, when we train the DQN with the reward function in(8), we find that the learning performance is not stable sincecollision events rarely happen and thus there remains only a few one-step backups associated with the collisions in thereplay memory. As a result, the probability of picking suchone-step backups is small and the DQN does not have enoughchance to learn to avoid accidents in practical learning stage.To solve this issue, we propose so called ”trauma” memorywhich is used to store only the one-step backups for the rareevents (e.g., collision events in our scenario). While the onestep backups are randomly picked from the replay memory,some fixed number of backups associated with the collisionevents are randomly selected from the trauma memory andused for training together. In other words, with the traumamemory, the loss function L is modified to L = Σ t ∈ B replay δ t + Σ t ∈ B trauma δ t (9)where B trauma is the backups randomly picked from traumamemory. Trauma memory persistently reminds the agent ofthe memory on the accidents regardless of the current policy,thus allowing the agent to learn to maintain speed and avoidcollisions reliably. IV. E XPERIMENTS
In this section, we evaluate the performance of the proposedautonomous braking system via computer simulations.
A. Simulation Setup
In simulations, we used the commercial software
PreScan which models vehicle dynamics in real time [15]. We gener-ated the environment in order to train the DQN by simulatingthe random behavior of the pedestrian. In the simulations,we assume that the relative location of the pedestrian isprovided to the agent. To make the system practical, we addslight measurement noise to it. In each episode, the initialposition of vehicle is set to (0 , . Time-to-collision T T C ischosen according to the uniform distrubution between . s and s . The initial velocity of the vehicle is uniformlydistributed between v initmin = 2 . m/s (10 km/s ) and v initmax = 16 . m/s (60 km/h ) . At the beginning of theepisodes, the position of the pedestrian is fixed to ∗ v init meters away from the position of the vehicle. The pedestrianstands either at the far-side or at near-side of the vehicle withequal probability. The behavior of the pedestrian follows oneof two scenarios below; • Scenario 1 : Cross the road • Scenario 2 : Stay at initial position.During training, either of two scenarios is selected with equalprobability. In Scenario 1, the pedestrian starts to move whenthe vehicle is crossing at the “pedestrian crossing point” p trig = (5 − T T C ) ∗ v init . (see Fig. 4.) The safety distance l for the pedestrian is set to m. The agent chooses the brakecontrol among a high = − . m/s , a mid = − . m/s , a high = − . m/s and a nothing = 0 m/s every ∆ T = 0 . second. The detailed simulation setup is summarized below. • Initial velocity of vehicle v init ∼ U (2 . , . m/s • Velocity of pedestrian v ped ∼ U (2 , m/s • Time-to-collision
T T C ∼ U (1 . , s • Initial pedestrian position pedpos x = 5 ∗ v init m Trigger point p trig = (5 − T T C ) ∗ v init m • Safety line l = 3 m • ∆ T = 0 . s • a high , a mid , a low , a nothing = {− . , − . , − . , } m/s B. Training of DQN
The neural network used for the DQN consists of the fully-connected layers with five hidden layers. RMSProp algorithm[14] is used to minimize the loss with learning rate µ =0 . . The number of position data samples used as a state isset to n = 5 . We set the size of the replay memory to 10,000and that of the trauma memory to 1,000. We set the replaybatch size to 32 and trauma batch size to 10. The summary ofthe DQN configurations used for our experiments is providedbelow; • State buffer size: n = 5 • Network architecture: fully-connected feedforwared net-work • Nonlinear function: leaky ReLU [13] • Number of nodes for each layers : [15(Input layer), 100,70, 50, 70, 100, 4(Output layer)] • RMSProp optimizer with learning rate 0.0005 [14] • Replay memory size: 10,000 • Replay batch size: 32 • Trauma memory size: 1,000 • Trauma batch size = 10 • Reward function: α = 0 . , β = 0 . , η = 0 . , λ =100 Fig. 5 provides the plot of the total accumulated rewardsi.e., value function achieved for each episode when trainingis conducted with and without trauma memory. We observethat with trauma memory the value function converges after2,000 episodes and high total reward is steadily attained afterconvergence while without trauma memory the policy doesnot converge and keeps fluctuating.
C. Test Results
Safety test was conducted for several different
T T C values.Collision rate is measured for 10,000 trials for each TTC value.Table I provides the collision rate for each
T T C value forthe test performed for Scenario 1. The agent avoids collisionsuccessfully for
T T C values above 1.5s. For the cases with
T T C values less than 1.5s, we observe some collisions.According to our analysis on the trajectory of braking actions,these are the cases where collision was not avoidable due tothe high initial velocity of the vehicle even though full brakingactions were applied. The agent passed the pedestrian withoutunnecessary stop for all cases in the Scenario 2. The detailedtrajectory of the brake actions for one example case is shownin Fig. 6. Fig. 6 (a) shows the trajectory of the position of thevehicle and the pedestrian recorded every . s. The velocityof the agent and the brake actions applied are shown in Fig. 6(b) and (c), respectively. The vehicle starts to decelerate about20 m away from the pedestrian and completely stops about 5 mahead, thereby accomplishing collision avoidance. We observe Episodes -50-40-30-20-1001020 A cc u m u l a t ed r e w a r d s Accumulated total rewards in episodes
Without trauma memoryWith trauma memory
Fig. 5. Achieved value function achieved during training that weak braking actions are applied in the beginning part ofdeceleration and then strong braking actions come as the agentgets close to the pedestrian.
Time (s) P o s i t i on ( m ) Longitudinal position of vehicle and pedestrian
Time (s) V e l o c i t y ( m / s ) Velocity of vehicleAction trajectory
Time (s) D e c e l e r a t i on ( m / s ) Fig. 6. Trajectory of position, velocity and actions in one example episodefor the case
T T C = 1 . s Fig. 7 shows how the initial position of the pedestrianand the relative distance between the pedestrian and vehicleare distributed for 1,000 trials in the scenario 1. We seethat the vehicle stops around m in front of the pedestrianfor most of cases. This seems to be reasonable safe brakingoperation considering the safety distance of l = 3 m. Notethat this distance can be adjusted by changing the rewardparameters. Overall, the experimental results show that the ABLE IC
OLLISION RATE IN TEST SCENARIOS
T T C (s) 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9Collision rate (%) 61.29 18.85 0.74 0 0 0 0 0 0 0 0 0 0 0 0 0 proposed autonomous braking system exhibits consistent brakecontrol performance for all cases considered.
Episode D i s t an c e ( m ) Pedestrian positionDistance between vehicle and pedestrian
Fig. 7. Initial position of the pedestrian and relative distance between thepedestrian and vehicle after the episode ends.
D. Test Results for Euro NCAP AEB Pedestrian Test
Additional autonomous emergency braking (AEB) pedes-trian tests are conducted. We follow the test procedure spec-ified by Euro NCAP test protocol [16], [17]. Tests are con-ducted for both farside (CVFA test) and nearside (CVNA test)under the velocity range between 20 to 60 km/h with 5 km/h interval.
T T C is set to 4 s and the pedestrian crosses the roadat 8 km/h for CVFA and 5 km/h for CVNA. Tests are scoredaccording to the rating parameters and the metric suggested in[17]. The proposed system passed all tests without collisionand the rating scores acquired by the proposed method areshown in Table II. TABLE IIAEB
TEST RESULT v init ( km/s )
20 25 30 35 40 45 50 55 60
CV F A score
CV NA score
V. CONCLUSIONSWe have presented the new autonomous braking systembased on the deep reinforcement learning. The proposedsystem learns an intelligent way of brake control from theexperiences obtained under the simulated environment. Wedesigned the autonomous braking systems using the DQN method with carefully designed reward function and enhancedstability of learning process by modifying the structure ofthe DQN. We showed through computer simulations that theproposed autonomous braking system exhibits desirable andconsistent brake control behavior for various scenarios wherebehavior of the pedestrian is uncertain.R
EFERENCES[1] A. Vahidi and A. Eskandarian, ”Research advances in intelligent col-lision avodiance and adaptive cruise control,”
IEEE Trans. IntelligentTransportation Systems , vol. 4, no. 3, pp. 143-153, Sept. 2003.[2] K. Lee and H. Peng, ”Evaluation of automotive forward collisionworning and collision avoidance algorithms,”
Vehicle System Dynamics ,vol. 43(10), pp. 735-751, Feb. 2007.[3] T. Gandhi and M. M. Trivedi, ”Pedestrian collision avodiance systems: asurvey of computer vision based recent studies,”
IEEE Trans. IntelligentTransportation Systems Conference , pp. 976-981, Sept, 2006[4] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil,M. Andrilukam, P, Rajpurkar, T. Migimatsu, R. Cheng-Yue, F. Mujica,A. Coates, and A. Y. Ng, ”An empricial evaluatio of deep learning onhighway driving,” arXiv preprint , arXiv:1504.01716, Apr. 2015.[5] D. Tome, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi and S. Tubaro,”Deep convolutional neural networks for pedestrian detection,”
SignalProcessing: Image Communication , vol. 47, pp. 482-489, May. 2016.[6] R. S. Tomar and S. Verma, ”Neural network based lane change trajectoryprediction in autonomous vehicles,”
Transactions on computationalscience XIII , Springer Berlin Heidelberg, pp. 125-146, 2011.[7] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal,L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao andK. Zieba, ”End to end learning for self-driving cars,” arXiv preprint ,arXiv:1604.0731, Apr. 2016.[8] C. Chen, A. Seff, A. Kornhauser and J. Xiao, ”DeepDriving: learningaffordance for direct perception in autonomous driving,”
Proceedingsof the IEEE International Conference on Computer Vision(ICCV) , pp.2722-2730, 2015.[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S.Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D.Wierstra, S. Legg and D. Hassabis, ”Human-level control through deepreinforcement learning,”
Nature , vol. 518(7540), pp. 529-533,2015.[10] J. Koutnik, J. Schmidhuber and F. Gomez, ”Evolving deep unsupervisedconvolutional networks for vision-based reinforcement learning,”
Pro-ceedings of the 2014 Annual Conference on Genetic and EvolutionaryComputation , ACM, pp. 541-548, 2014.[11] C. Desjardins and B. Chaib-draa, ”Cooperative adaptive cruise control: areinforcement learning approch,”
IEEE Trans. Intelligent TransportationSystems , vol. 12, no. 4, pp. 1248-1260 Dec. 2011.[12] R. S. Sutton and A. G. Barto, ”Reinforcement learning: an introduction,”,MIT Press, 1998[13] V. Nair and G. E. Hinton, ”Rectified linear units improve restrictedboltzmann machines,”
Proceedings of the 27th Internation Conferenceon Machine Learning(ICML) , pp. 807-814, 2010.[14] T. Tieleman and G. Hinton, ”Lecture 6.5-rmsprop: Devide the gradientby a running average of its recent magnitude,”
COURSERA: Neuralnetwork for machine learning , 4.2, 2012.[15] PreScan, [16] Pedestrian testing protocol,
European New Car Assessment Programme ,Dec. 2016.[17] P. seiniger, A. Hellmann, O. Bartels, M. Wisch and J. Gail ”TestProcedures and Results for Pedestrian AEB Systems,”