[PDF] A Deep Reinforcement Learning Framework for Eco-driving in Connected and Automated Hybrid Electric Vehicles

Abstract

Connected and Automated Vehicles (CAVs), in particular those with multiple power sources, have the potential to significantly reduce fuel consumption and travel time in real-world driving conditions. In particular, the Eco-driving problem seeks to design optimal speed and power usage profiles based upon look-ahead information from connectivity and advanced mapping features, to minimize the fuel consumption over a given itinerary. Due to the complexity of the problem and the limited on-board computational capability, the real-time implementation of many existing methods that rely on online trajectory optimization becomes infeasible. In this work, the Eco-driving problem is formulated as a Partially Observable Markov Decision Process (POMDP), which is then solved with a state-of-art Deep Reinforcement Learning (DRL) Actor Critic algorithm, Proximal Policy Optimization. An Eco-driving simulation environment is developed for training and testing purposes. To benchmark the performance of the DRL controller, a baseline controller representing the human driver and the wait-and-see deterministic optimal solution are presented. With minimal on-board computational requirement and comparable travel time, the DRL controller reduces the fuel consumption by more than 17% by modulating the vehicle velocity over the route and performing energy-efficient approach and departure at signalized intersections when compared against a baseline controller.

Full PDF

11 A Deep Reinforcement Learning Framework forEco-driving in Connected and Automated HybridElectric Vehicles

Zhaoxuan Zhu, Shobhit Gupta, Abhishek Gupta, Marcello Canova,

Abstract —Connected and Automated Vehicles (CAVs), in par-ticular those with multiple power sources, have the potential tosigniﬁcantly reduce fuel consumption and travel time in real-world driving conditions. In particular, the Eco-driving problemseeks to design optimal speed and power usage proﬁles basedupon look-ahead information from connectivity and advancedmapping features, to minimize the fuel consumption over a givenitinerary.Due to the complexity of the problem and the limited on-board computational capability, the real-time implementationof many existing methods that rely on online trajectory op-timization becomes infeasible. In this work, the Eco-drivingproblem is formulated as a Partially Observable Markov DecisionProcess (POMDP), which is then solved with a state-of-artDeep Reinforcement Learning (DRL) Actor Critic algorithm,Proximal Policy Optimization. An Eco-driving simulation en-vironment is developed for training and testing purposes. Tobenchmark the performance of the DRL controller, a baselinecontroller representing the human driver and the wait-and-seedeterministic optimal solution are presented. With minimal on-board computational requirement and comparable travel time,the DRL controller reduces the fuel consumption by more than17% by modulating the vehicle velocity over the route andperforming energy-efﬁcient approach and departure at signalizedintersections when compared against a baseline controller.

Index Terms —Connected and automated vehicle, Eco-driving,deep reinforcement learning, dynamic programming, long short-term memory.

I. I

NTRODUCTION W Ith the advancement in the vehicular connectivityand autonomy, Connected and Automated Vehicles(CAVs) have the potential to operate in a more time- andfuel-efﬁcient manner [1]. With Vehicle-to-Vehicle (V2V) andVehicle-to-Infrastructure (V2I) communication, the controllerhas access to real-time look-ahead information including theterrain, infrastructure and surrounding vehicles. Intuitively,with connectivity technologies, controllers can plan a speedproﬁle that allows the ego vehicle to intelligently pass moresignalized intersections in green phase with less change inspeed. This problem is formulated as the Eco-driving prob-lem (incorporating Eco-Approach and Departure at signalizedintersections), which aims to minimize the fuel consumption

Z. Zhu, S. Gupta and M. Canova are with the Center for AutomotiveResearch, The Ohio State University, Columbus, OH 43212 USA (email:[email protected]; [email protected]; [email protected])A. Gupta is with Department of Electrical and Computer Engineer-ing, The Ohio State University, Columbus, OH, 43210 USA (email:[email protected]) and the travel time between two designated locations by co-optimizing the speed trajectory and the powertrain controlstrategy [2], [3].The literature related to the Eco-driving problem distin-guishes among three aspects, namely, powertrain conﬁgura-tions, trafﬁc scenarios, and formulation domain. Regardingpowertrain conﬁguration, the difference is in whether thepowertrain is equipped with a single power source [3]–[6]or a hybrid electric architecture [7]–[10]. The latter involvesmodeling multiple power sources and devising optimal controlalgorithms that can synergistically split the power demand inorder to efﬁciently utilize the electric energy stored in battery.Maamria et al. [11] systematically compares the computationalrequirement and the optimality of different Eco-driving formu-lations solved ofﬂine via Deterministic Dynamic Programming(DDP).Related to the trafﬁc scenarios, Ozatay et al. [4] proposeda framework providing advisory speed proﬁle using onlineoptimization conducted on a cloud-based server without con-sidering the real-time trafﬁc light variability. Olin et al. [9]implemented the Eco-driving framework to evaluate real-worldfuel economy beneﬁts obtained from a control logic computedin a Rapid Prototyping System on-board a test vehicle. Astrafﬁc lights are not explicitly considered in these studies,the Eco-driving control module is required to be coupledwith other decision-making agents, such as human drivers orAdaptive Cruise Control (ACC) systems. Other studies haveexplicitly modeled and considered Signal Phase and Timings(SPaTs). Jin et al. [3] formulated the problem as a MixedInteger Linear Programming (MILP) for conventional vehicleswith Internal Combustion Engine (ICE). Asadi et al. [12] usedtrafﬁc simulation models and proposed to solve the problemconsidering probabilistic SPaT with DDP. Sun et al. [6]formulated the Eco-driving problem as a distributionally robuststochastic optimization problem with collected real-world data.Guo et al. [8] proposed a bi-level control framework with ahybrid vehicle. Bae [10] extended the work in [6] to a heuristicHEV supervisory controller.Finally, the Optimal Control Problem (OCP) for Eco-drivingcan be formulated in either time-domain [3], [8], [12] or spatialdomain [4], [6], [9], [10]. The primary beneﬁt of the spatialformulation is that it naturally lends itself to the incorporationof distance-based route features, such as speed limits, grade,trafﬁc light and stop sign locations. However, the spatial-domain formulation makes it difﬁcult to incorporate time-based information such as the SPaT at signalized intersections, a r X i v : . [ ee ss . S Y ] J a n received from V2I communication.The dimensionality of the problem, therefore the compu-tational requirements, can become quickly intractable as thenumber of states increases. This is the case, for instance, whenthe energy management of a hybrid powertrain system is com-bined with velocity optimization in presence of other vehiclesor approaching signalized intersections. The aforementionedmethods either consider a simpliﬁed powertrain model [10],[13] or treat the speed planning and the powertrain controlhierarchically [8]. Although such efforts made the real-timeimplementation feasible, the optimality can be sacriﬁced [11].The use of Deep Reinforcement Learning (DRL) in the contextof Eco-driving has caught considerable attention in the recentyears. DRL provides a train-ofﬂine, execute-online methodol-ogy with which the policy is learned from historical data or byinteracting with simulated environment. The ofﬂine trainingcan either result in an explicit policy or be implemented aspart of Model Predictive Control (MPC) as the terminal costfunction. Shi et al. [14] modeled the conventional vehicleswith ICE as a simpliﬁed model and implemented Q-learningto minimize the CO emission at signalized intersections. Li etal. [15] applies an actor-critic algorithm on the ecological ACCproblem in car-following mode. Lee et al. [16] proposed amodel-based Q learning method for electric vehicles on drivingcycles without considering signalized intersections.This work focuses on the development of the Eco-drivingcontroller for HEVs with the capability to pass signalizedintersections autonomously. Compared to other works, theOCP is formulated as a centralized problem with a nonlinearhybrid electric powertrain model. To overcome the intensiveonboard computation, the problem is formulated as a PartiallyObservable Markov Decision Process (POMDP), and it is sub-sequently solved with an on-policy actor-critic DRL algorithm,Proximal Policy Optimization (PPO), along with Long Short-Term Memory (LSTM) as the function approximators. Tobenchmark the performance of the resultant explicit policy,we present a baseline strategy representing human drivingbehaviors with rule-based energy management module and await-and-see deterministic optimal solution. The comparisonwas conducted over 100 randomly generated trips with Urbanand Mixed Urban driving scenarios.The remainder of the paper is organized as follows. SectionII presents the simulation environment. Section III introducesthe preliminaries of the DRL algorithm employed in thiswork. Section IV mathematically formulates the Eco-drivingproblem, and Section V presents the proposed DRL controller.Section VI presents the baseline strategy and the wait-and-seedeterministic optimal solution for benchmarking. Section VIIshows the training details and benchmarks the performance.Finally, VIII concludes the study and identiﬁes the futureresearch directions.II. E NVIRONMENT M ODEL

The successful training of the reinforcement learning agentrelies on an environment to provide the data. In particular,model-free reinforcement learning methods typically requirea large amount of data before agents learn the policy via

DRL Agent

ObservationsTorque Commands

EcoSim Traffic Simulator

Vehicle Kinematics

VD&PT

Fig. 1. The Structure of The Environment Model the interaction with the environment. In the context of Eco-driving, collecting such amount of the real-world driving datais expensive. Furthermore, the need for policy explorationduring training poses safety concerns for human operatorsand hardware. For these reasons, a model of the environ-ment is developed for training and validation purposes. Theenvironment model, named EcoSIM, consists of a VehicleDynamics and Powertrain (VD&PT) model and a microscopictrafﬁc simulator. Fig.1 shows the schematics of EcoSIM. Thecontroller commands three control inputs, namely, the ICEtorque, the electric motor torque and the mechanical braketorque. The component-level torques collectively determinethe HEV powertrain dynamics, the longitudinal dynamics ofthe ego vehicle and its location along the trip. While thestates of vehicle and powertrain such as battery State-of-Charge (

SoC ), velocity and gear are readily available tothe powertrain controller, the availability of the connectivityinformation depends on the infrastructure and the types ofsensors equipped onboard. Two types of SPaT data source,namely, Dedicated Short Range Communication (DSRC) andCellular Communication (4G/LTE or 5G/LTE) are being usedfor Eco-driving in [17] and [18], respectively. In this study,it is assumed that Cellular Communication technologies areavailable onboard, and SPaT remains available at every pointduring the trip. The DRL agent utilizes the SPaT from theupcoming trafﬁc light while ignoring the SPaT from anyother trafﬁc light regardless of the availability. Speciﬁcally,the distance to the upcoming trafﬁc light, its status and SPaTprogram are fed into the controller as observations. Finally, anavigation application with Global Positioning System (GPS)is assumed to be on the vehicle such that the locations of theorigin and the destination, the remaining distance, the speedlimits of the entire trip are available at every point during thetrip.

A. Vehicle and Powertrain Model

A forward-looking dynamic powertrain model is developedfor fuel economy evaluation and control strategy veriﬁcationover real-world routes. In this work, a P0 mild-hybrid electricvehicle (mHEV) is considered, equipped with a 48V BeltedStarter Generator (BSG) performing torque assist, regenerative

Fig. 2. Block Diagram of P0 mHEV Topology.

TransmissionBatteryEngineBSG TorqueConverterTransmission Control Module VehicleDynamics

Obtained from themodel (feedback)

Powertrain Plant Model lock

Fig. 3. Block Diagram of 48V P0 Mild-Hybrid Drivetrain. braking and start-stop functions. The topology and the diagramof this powertrain is illustrated in Fig.2 and Fig.3, respectively.The key components of the low-frequency quasi-static modelare described below.

1) Engine Model:

The engine is modeled as low-frequencyquasi-static nonlinear maps. The fuel consumption and thetorque limit maps are based on steady-state engine test benchdata provided by a supplier: ˙ m fuel ,t = ψ ( ω eng ,t , T eng ,t ) , (1)where the second subscript t represents the discrete time index,and ω eng and T eng are the engine angular velocity and torque,respectively.

2) BSG Model:

In a P0 conﬁguration, the BSG is connetedto the engine via a belt, as shown in Eqn. 2. A simpliﬁed,quasi-static efﬁciency map ( η ( ω bsg ,t , T bsg ,t ) ) is used to com-pute the electrical power output P bsg ,t in both regenerativebraking and traction operating modes: ω bsg ,t = τ belt ω eng ,t , (2) P bsg ,t = T bsg ,t ω bsg ,t (cid:40) η ( ω bsg ,t , T bsg ,t ) , T bsg ,t < η ( ω bsg ,t ,T bsg ,t ) , T bsg ,t > (3)where τ belt , ω bsg ,t and T bsg ,t refer to the belt ratio, the BSGangular velocity and the BSG torque, respectively.

3) Battery Model:

A zero-th order equivalent circuit modelis used to model the current ( I t ) dynamics. Coulomb counting[19] is used to compute the battery SoC : I t = V OC ( SoC t ) − (cid:112) V OC ( SoC t ) − R ( SoC t ) P bsg ,t R ( SoC t ) , (4) SoC t +1 = SoC t − ∆ tC nom ( I t + I a ) , (5)where ∆ t is the time discretization, which is set to 1s in thisstudy. The power consumed by the auxiliaries is modeled bya calibrated constant current bias I a . The cell open circuitvoltage V OC ( SoC t ) and internal resistance R ( SoC t ) data areobtained from the pack supplier.

4) Torque Converter Model:

A simpliﬁed torque convertermodel is developed with the purpose of computing the lossesduring traction and regeneration modes. Here, lock-up clutchis assumed to be always actuated, applying a controlled slip ω slip between the turbine and the pump. The assumption mightbe inaccurate during launches, and this can be compensatedby including a fuel consumption penalty in the optimizationproblem, associated to each vehicle launch event. This modelis described as follows [20]: T tc ,t = T pt ,t , (6) ω p ,t = ω tc ,t + ω slip ( n g ,t , ω eng ,t , T eng ,t ) , (7) ω eng ,t =  ω p ,t , ω p ,t ≤ ω stall ω idle ,t , ≤ ω p ,t < ω stall , ≤ ω p ,t < ω stall and Stop = 1 (8)where n g is the gear number, ω p ,t is the speed of the torqueconverter pump, ω tc ,t is the speed of the turbine, ω stall is thespeed at which the engine stalls, ω idle ,t is the idle speed of theengine, stop is a ﬂag from the ECU indicating engine shut-offwhen the vehicle is stationary, T tc ,t is the turbine torque, and T pt ,t is the combined powertrain torque. The desired slip ω slip is determined based on the powertrain conditions and desiredoperating mode of the engine (traction or deceleration fuelcut-off).

5) Transmission Model:

The transmission model is basedon a static gearbox, whose equations are as follows: ω tc ,t = τ g ( n g ,t ) ω trans ,t = τ g ( n g ,t ) τ fdr ω out ,t = τ g ( n g ,t ) τ fdr v veh ,t R w , (9) T trans ,t = τ g ( n g ,t ) T tc ,t , (10) T out ,t =  τ fdr η trans ( n g ,t , T trans ,t , ω trans ,t ) T trans ,t , T trans ,t ≥ τ fdr η trans ( n g ,t , T trans ,t , ω trans ,t ) T trans ,t , T trans ,t < (11)where τ g and τ fdr are the gear ratio and the ﬁnal drive ratio,respectively. The transmission efﬁciency η trans ( n g , T trans , ω trans ) is scheduled as a nonlinear map expressed as a function ofgear number n g , transmission input shaft torque T trans ,t andtransmission input speed ω trans . ω out ,t refers to the angularvelocity of the wheels. R w and v veh ,t are the radius of thevehicle wheel and the longitudinal velocity of the vehicle,respectively. -10-50510 S o C [ % ] V e l o c i t y [ m / s ] ModelExperimental Data

Time [s] F ue l C on s u m ed [ g ] Final Error = 3.746 %

Fig. 4. Validation of Vehicle Velocity,

SoC and Fuel Consumed over FTPCycle.

6) Vehicle Longitudinal Dynamics Model:

The vehicle dy-namics model is based on the road-load equation, which indi-cates the tire rolling resistance, road grade, and aerodynamicdrag: a veh ,t = T out ,t − T brk ,t M R w − C d ρ a A f M v veh ,t − gC r cos α t v veh ,t − g sin α t . (12)Here, a veh ,t is the longitudinal acceleration of the vehicle, T brk is the brake torque applied on wheel, M is the mass of thevehicle, C d is the aerodynamic drag coefﬁcient, ρ a is the airdensity, A f is the effective aerodynamic frontal area, C r isrolling resistance coefﬁcient, and α t is the road grade.

7) Vehicle Model Veriﬁcation:

The forward model is thencalibrated and veriﬁed using experimental data from chassisdynamometer testing. The key variables used for evaluatingthe model are vehicle velocity, battery

SoC , gear number,engine speed, desired engine and BSG torque proﬁles, andthe fuel consumption. Fig. 4 show sample results from modelveriﬁcation over the FTP regulatory drive cycle, where thebattery

SoC and fuel consumption are compared againstexperimental data.The mismatches in the battery

SoC proﬁles can be at-tributed to the simplicity of the battery model, in whichthe electrical accessory loads are modeled using a constantcurrent bias. The fuel consumption over the FTP cycle is wellestimated by the model, with error on the ﬁnal value less than4% relative to the experimental data.

B. Trafﬁc Model

A large-scale microscopic trafﬁc simulator is developedin the open source software Simulation of Urban Mobility(SUMO) [21] as part of the environment. In order to recreaterealistic mixed urban and highway trips for training, the mapof the city of Columbus, OH, USA is downloaded from theonline database OpenStreetMap [22]. The map contains thelength, shape, type and speed limit of the road segments

Fig. 5. Map of Columbus, OH for DRL Training [Each red and blue markerdenotes the start and end point of an individual trip and the colored line denotethe route between these points.] and the detailed program of each trafﬁc light in signalizedintersections. Fig. 5 highlights the area that is covered inthe study. In the shaded area, 10,000 random passenger cartrips, each of which the total distance is randomly distributedfrom 5 km to 10 km , are generated as the training set.Another 100 trips, indicated by the origins and destinationsin Fig. 5, are generated following the same distribution asthe testing set. In addition, the inter-departure time of eachtrip follows a geometric distribution with the success rate p = 0 . . The variation and the randomness of the trips usedfor training enhance the richness of the environment, whichsubsequently leads to a learned policy that is less subjectto local minimum and agnostic to speciﬁc driving condition(better generalizability) [23].The interface between the trafﬁc simulator and the VD&PTmodel is established via Trafﬁc Control Interface (TraCI) aspart of the SUMO package. At any given time step, thekinetics of the vehicle calculated from VD&PT is fed to thetrafﬁc simulator as input. Subsequently, SUMO determinesthe location of the ego vehicle, updates the connectivityinformation such as the SPaT of the upcoming trafﬁc lightand the GPS signal and returns them to the agent as part ofthe observations.III. D EEP R EINFORCEMENT L EARNING P RELIMINARIES

A. Markov Decision Process

In Markov Decision Process (MDP), sequential decisionsare made in order to maximize the discounted sum of therewards. An MDP can be deﬁned by a tuple (cid:104)S , A , P, ρ , r, γ (cid:105) ,where S and A are the state space and the action space,respectively; P : S ×A×S → [0 , is the transition dynamics distribution; ρ is the initial distribution of the state space. Thereward function r : S × A × S → R is a function that mapsthe tuple ( s t , a t , s t +1 ) to instantaneous reward. Finally, γ isthe discount factor that prioritizes the immediate reward andensures that the summation over inﬁnite horizon is ﬁnite.Let π : S × A → [0 , be a randomized policy and Π bethe set of all randomized policies. The objective of MDP is toﬁnd the optimal policy π ∗ that minimizes the expectation ofthe discounted sum of the rewards deﬁned as follows: π ∗ = argmax π ∈ Π η ( π ) , where η ( π ) = E s t +1 ∼P ( ·| s t ,a t ) (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) , where s ∼ ρ ( · ) , a t ∼ π ( ·| s t ) . (13)In the remaining work, the expectation under the state trajec-tory E s t +1 ∼ P ( ·| s t ,a t ) [ · ] will be written compactly as E π [ · ] . Forany policy π , the value function V π : S → R , the Q function Q π : S ×A → R and the advantage function A π : S ×A → R are deﬁned as follows: V π ( s t ) = E π (cid:34) ∞ (cid:88) i = t γ i − t r ( s i , a i ) | s t (cid:35) , (14) Q π ( s t , a t ) = E π (cid:34) ∞ (cid:88) i = t γ i − t r ( s i , a i ) | s t , a t (cid:35) , (15) A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ) . (16) B. Actor Critic Algorithm

Actor Critic Algorithm is one of the earliest concepts in theﬁeld of reinforcement learning [24], [25]. Actor is typicallya stochastic control policy, whereas critic is a value functionassisting actor to improve the policy. For Deep ReinforcementLearning, both actor and the critic are typically in the form ofdeep neural networks.In this study, policy gradient method is used to iterativelyimprove the policy. According to Policy Gradient Theorem in[26], [27], the gradient of the policy, parametrized by θ , canbe determined by the following equation: ∇ θ η ( π θ k ) ∝ (cid:88) s ρ ( s ) (cid:88) a Q π θk ( s, a ) ∇ θ π θ k ( a | s ) , (17)where ρ ( s ) is the discounted on-policy state distribution de-ﬁned as follows: ρ ( s ) = ∞ (cid:88) t =0 γ t P ( s t = s ) . (18)As in [28], the gradient can be estimated as follows: ∇ θ η ( π θ k ) = E π (cid:20) A π θk ( s t , a t ) ∇ θ π θ k ( a t | s t ) π θ k ( a t | s t ) (cid:21) . (19)Accordingly, to incrementally increase η ( π θ ) , the gradientascent rule follows θ k +1 = θ k + α k ∇ θ η ( π θ k ) , (20)where α k is the learning rate. As Eqn. (19) is a local estimationof the gradient at the neighborhood of the current policy, Fig. 6. Surrogate Objective Function updating the policy in such direction with large step sizecould potentially lead to large performance drop. Schulmanet al. [29] proposed to constrain the difference between theprobability distributions of the old and the new policy with thetrust region method. Although being less brittle, the algorithmrequires the analytical Hessian matrix, resulting in a highcomputational load and a nontrivial implementation. In thispaper, a ﬁrst-order method proposed by Schulman et al. [30]is used. Instead of Eqn. (19), a clipped surrogate objectivefunction is deﬁned as follows: L t ( θ ) = E π (cid:2) min ( r t ( θ ) , clip ( r t ( θ ) , − (cid:15), (cid:15) )) A π θk ( s t , a t ) (cid:3) , where r t ( θ ) = π θ ( a t | s t ) π θ k ( a t | s t ) , clip ( x, a min , a max ) = min (max ( x, a min ) , a max ) . (21)Here, the hyperparameter (cid:15) is the clipping ratio. Note the ﬁrst-order derivative of the loss function around θ k , ∇ θ L t ( θ ) | θ k , isequal to ∇ θ η ( π θ ) | θ k which is consistent with Policy GradientTheorem. Fig 6. shows the value of L t ( θ ) as a function of r t ( θ ) . When A π θk ( s t , a t ) is positive, an increase in π θ ( a t | s t ) results in an increase in L t ( θ ) . In the meantime, due to theclipping effect, any increase in r t ( θ ) beyond (1 + (cid:15) ) no longerimproves the surrogate objective. Similarly, when A π θk ( s t , a t ) is negative, any decrease beyond (1 − (cid:15) ) would not increasethe surrogate objective. The minimization operation guarantees L t ( θ ) ≤ r t ( θ ) A π θk ( s t , a t ) ∀ r t ( θ ) , which is required by themethod of majorization-minimization (MM) [31]. C. Partially Observable Markov Decision Process

In many practical applications, states are not fully observ-able. The partial observability can rise from sources such as theneed to remember the states in history, the sensor availabilityor noise and unobserved variations of the plant under control[32]. Such problem can be modeled as a POMDP whereobservations o t ∈ O , instead of states, are available to theagent. The observation at certain time follows the conditionaldistribution given the current state o t ∼ P ( ·| s t ) .For POMDP, the optimal policy, in general, depends onthe entire history h t = ( o , a , o , a , ...a t − , o t ) , i.e. a t ∼ π ( ·| h t ) . The optimal policy can be obtained by giving policiesthe internal memory to access the history h t . In [33], it is shown that the policy gradient theorem can be used to solvethe POMDP problem with Recurrent Neural Network (RNN)as the function approximator, i.e. Recurrent Policy Gradients.Comparing to other function approximators such as MultilayerPerceptron (MLP) or Convolutional Neural Network (CNN),RNN exploits the sequential property of inputs and usesinternal states for memory. Speciﬁcally, Long Short-TermMemory (LSTM) [34], as a special architecture of RNN, istypically used in DRL to avoid gradient explosion and gradientvanishing, which are the well-known issues to RNN [35]. InLSTM, three types of gates are used to keep the memorycells activated for arbitrary long. The combination of PolicyGradient and LSTM has shown excellent results in manymodern DRL applications [36], [37].In the Eco-driving problem, there are trajectory constraintswhile approaching trafﬁc lights and stop signs. Thus, in thesesituations, the ego vehicle needs to remember the states visitedin the recent past. LSTM based function approximators aretherefore chosen to approximate the value function and theadvantage function so that the ego vehicle can use informationabout the past states visited to decide on its torques.IV. P ROBLEM F ORMULATION

In the Eco-driving problem, the objective is to minimize theweighted sum of fuel consumption and travel time betweentwo designated locations. The optimal control problem isformulated as follows: min T eng ,T bsg ,T brk E (cid:34) ∞ (cid:88) t =0 ( ˙ m fuel ,t + c time ) ∆ t · I [ d t < d total ] (cid:35) (22a)s.t. SoC t +1 = f batt ( v veh ,t , SoC t , T eng ,t , T bsg ,t , T brk ,t ) (22b) v veh ,t +1 = f veh ( v veh ,t , SoC t , T eng ,t , T bsg ,t , T brk ,t ) (22c) T min eng ( ω eng ,t ) ≤ T eng ,t ≤ T max eng ( ω eng ,t ) (22d) T min bsg ( ω bsg ,t ) ≤ T bsg ,t ≤ T max bsg ( ω bsg ,t ) (22e) I min ≤ I t ≤ I max (22f) SoC min ≤ SoC t ≤ SoC max (22g)

SoC T ≥ SoC F (22h) ≤ v veh ,t ≤ v lim ,t (22i) ( t, d t ) / ∈ S red (22j)Here, ˙ m fuel ,t is the instantaneous fuel consumption at time t , c time is a constant to penalize the travel time took ateach step, f batt and f veh are the battery and vehicle dynamicsintroduced in Section II-A. The problem is formulated as ainﬁnite horizon problem in which the stage cost becomes zeroonce the system reach the goal, i.e. the travelled distance d t isgreater than or equal to the total distance of the trip d total . Eqn.(22d) to (22f) are the constraints imposed by the powertraincomponents. Eqn. (22g) and Eqn. (22h) are the constraints onthe instantaneous battery SoC and terminal

SoC for chargesustaining. Here, the subscript T represents the time at whichthe vehicle reaches the destination. SoC min , SoC max and

SoC F are commonly set to 30%, 80% and 50%. Eqn. (22i) and (22j)are the constraints imposed by the trafﬁc conditions. The set S red represents the set in which the trafﬁc light at the certain Time D i s t an c e Fig. 7. Feasible Set Imposed by Trafﬁc LightsTABLE IO

BSERVATION AND A CTION S PACE OF THE E CO - DRIVING P ROBLEM

Variable Description

O ∈ R SoC

Battery

SoCv veh

Vehicle velocity v lim Speed limit at the current road segment v (cid:48) lim Upcoming speed limit t s Time to the start of the next green light atthe upcoming intersection t e Time to the end of the next green light atthe upcoming intersection d tfc Distance to the upcoming trafﬁc light d (cid:48) lim Distance to the road segment of which thespeed limit changes d rem Remaining distance of the trip

A ∈ R T eng Engine torque T bsg Motor torque T brk Equivalent brake torque location is in red phase, as indicated by the red lines in Fig.7. As the controller can only accurately predict the futuredriving condition in a relatively short range due to the limitedconnectivity range and onboard processing power, the stochas-tic optimal control formulation is deployed to accommodatethe future uncertainties. Speciﬁcally, since the surroundingvehicles are not considered in the study, the main sourceof uncertainty comes from the unknown SPaT and the roadconditions such as the speed limits and the distance betweensignalized intersections beyond the connectivity range.V. D

EEP R EINFORCEMENT L EARNING C ONTROLLER

A. POMDP Adoption

In this study, the Eco-driving problem described by Eqn.(22) is solved as a POMDP. The constraints on the actionspace, i.e. Eqn. (22d), (22e) and (22f), are handled implicitlyby the environment model, whereas the constraints on the statespace are handled by imposing numerical penalties during theofﬂine training.Tab. I lists the observation and action spaces used toapproach the Eco-driving stochastic OCP. Here,

SoC and v veh are the states measured by the onboard sensors, and v lim , v (cid:48) lim , d tlc , d (cid:48) lim and d rem are assumed to be provided by thedownloaded map and GPS. t s and t e are the SPaT signalsprovided by V2I communication. When the upcoming trafﬁclight is in green phase, t s remains , and t e is the remainingtime of the current green phase; when the upcoming trafﬁclight is in red phase, t s is the remaining time of the currentred phase, and t e is the sum of the remaining red phase andthe duration of the upcoming green phase.The reward function r : S × A × S → R consists of fourterms: r = clip [ r obj + r vel + r batt + r tfc , − , . (23)Here, r obj represents the rewards associated with the OCPobjective; r vel is the penalty (negative reward) associated withthe violation of the speed limit constraint Eqn. (22i); r batt represents the penalties associated with the violation of thebattery constraints imposed by Eqn. (22g), (22h); r tfc is thepenalties regarding the violation of the trafﬁc light constraintimposed by Eqn. (22j). Speciﬁcally, the ﬁrst three terms aredesigned as follows: r obj ,t = c obj ( ˙ m fuel ,t [ g/s ] + c time ) , (24) r vel ,t = c vel,1 [ v vel ,t − v lim ,t ] + + c vel,2 ˙ a veh ,t , (25) r batt ,t =  c batt,1 (cid:16) [ SoC t − SoC max ] + + (cid:2) SoC min − SoC t (cid:3) + (cid:17) , d t < d total c batt , [ SoC F − SoC t ] + , d t ≥ d total (26)where [ · ] + is the positive part of the variable deﬁned as [ x ] + = max(0 , x ) . (27)In Eqn. (25), a penalty to the longitudinal jerk is assignedto the agent to improve the drive quality and to avoid theunnecessary speed ﬂuctuations.While the design of the ﬁrst three rewards is straightforward,the reward associated with the trafﬁc light constraints is moreconvoluted to deﬁne. First, a discrete state variable m tfc isdeﬁned in Fig. 8. m tfc = 0 whenever the distance to theupcoming trafﬁc light is greater than the critical brakingdistance d critical ,t , which is deﬁned as follows: d critical ,t = v veh ,t b max , (28)where b max is the maximal deceleration of the vehicle. Intu-itively, the agent does not need to make an immediate decisionregarding whether to accelerate to pass or to decelerate tostop at the upcoming signalized intersection outside the criticalbraking distance range. Once the vehicle is within the criticalbraking distance range, m tfc is determined by the current statusof the upcoming trafﬁc light, the distance between the vehicleand the intersection and the maximal distance that the vehiclecould drive given the remaining green phase d max ,t deﬁned asfollows: d max ,t = t e (cid:88) i =0 [min ( v lim ,t , v veh ,t + ia max )] , (29) Fig. 8. State Machine for the Indicator m tfc where a max is the maximal acceleration.If the upcoming trafﬁc light is in green, i.e. t s = 0 , m tfc gets updated to if the distance between the vehicle and theupcoming intersection is less than the d max ,t and otherwise.Intuitively, m tfc = 1 means that the vehicle has enough timeto cross the upcoming intersection within the remaining greenphase in the current trafﬁc light programming cycle. In casethe vehicle following the actor policy was not able to catchthe green light, a penalty proportional to d miss , the distancebetween the vehicle at the last second of the green phase andthe intersection, is assigned. On the other hand, m tfc = 2 means the vehicle would not reach the destination even withthe highest acceleration in the current cycle. If the upcomingtrafﬁc light is in red phase, m tfc ,t gets updated to . Whenthe vehicle was not able to come to stop and violate theconstraints, a penalty proportional to the speed at which thevehicle passes in red is assigned. As a summary, the rewardassociated with the trafﬁc light constraints is designed asfollows: r tfc ,t = (cid:40) c tfc , + c tfc , d miss ,t , m tfc ,t = 1 c tfc , + c tfc , v veh ,t , otherwise (30)In Appendix A, a guideline to the design and the tuning ofthe reward mechanism is provided, and the numerical valuesof all the constants in the reward function are listed in TableIII. In order to determine the reward at any given time,the environment model requires states that are not directlyavailable as observations such as ˙ a t , m tfc and d miss . Instead ofmaking these states available to the control agent, the POMDPformulation is intentionally selected for two reasons. Firstly, m tfc is a heuristically determined and ignores the accelerationlimit imposed by the powertrain components. Since it poses asigniﬁcant impact on the reward at the intersection, revealing m tfc to the controller results in a strong dependency of thestrategy to it, and occasionally such dependency misleads theagent to violate the constraints. Secondly, d miss is only relevantwhen the the vehicle is catching a green light within the criticalbraking distance. Its numerical value at other situations could potentially mislead the policy that is in the form of neuralnetwork.Studies have suggested that clipping the rewards between [ − , results in a better overall training process [38], [39].With the coefﬁcients listed in Table III, the negative rewardsaturates at − when d miss ,t > m and m tfc ,t = 1 or v veh ,t > . m/s and m tfc , = 2 , which means that the reward-ing mechanism would no longer differentiate the quality ofthe state-action pairs beyond these thresholds at the signalizedintersection. Such design, on one hand, reduces the strongimpact of the heuristically designed m tfc . On the other hand,it also signiﬁcantly slows down, or in some cases, preventsany learning as the rewards carry little directional information.Heess et al. [23] proposes to use a more diversiﬁed and richenvironment to overcome such issue. In this study, the diversityof the environment is ensured by the size of the SUMO mapand the 10,000 randomly generated trips. In addition, thevehicle speed v veh and battery SoC are randomly assignedfollowing U (0 , v lim ) and U ( SoC min , SoC max ) , respectively,at every T = 100 time steps. This domain randomizationmechanism, used in many other DRL applications [37], [40],forces the agent to explore the state space more efﬁciently andlearn a more robust policy. B. Algorithm Details

With the Monte Carlo method, the value function can beapproximated as follows: ˆ V π θ ξ ( s t ) ← ˆ E π θ (cid:34) ∞ (cid:88) i = t γ i − t r ( s i , a i ) | s t (cid:35) , (31)where the superscript π θ indicates the value function is asso-ciated with the policy π parametrized by θ , and the subscript ξ indicates the value function itself is parametrized by ξ .Although being unbiased, the Monte Carlo estimator is of highvariance and requires the entire trajectory to be simulated. Onthe other hand, TD ( N ) estimator is deﬁned as follows: ˆ V π θ ξ ( s t ) ← ˆ E π θ (cid:34) t + N (cid:88) i = t γ i − t r ( s i , a i ) + ˆ V π θ ξ old ( s t + N ) | s t (cid:35) . (32)Compared to the Monte Carlo method, it reduces the requiredrollout length and the variance of the estimation by boot-strapping. However, TD(N) estimator is biased due to theapproximation error in V π ( s t + N ) . TD( λ ) included in [25]takes the geometric sum of the terms from TD(N), leading toan adjustable balance between bias and variance. In this study,LSTM is used as the function aproximators, instead of thedata tuple ( s t , a t , r t , s t +1 ) , ( o t , h o,t , a t , h a,t , r t , o t +1 , h o,t +1 ) are logged in simulation, where h o and h a are the hiddenstates of the policy and value function networks, respec-tively. Since the state space is randomized at every N steps,truncated TD( λ ) is used for value function approximation.Speciﬁcally, after having collected the a sequence of tuples ( o t , h o,t , a t , h a,t , r t , o t +1 , h o,t +1 ) t : t + N , the following equa- Observation LSTM(100 unit) MLP(100 unit)MLP(100 unit) Value FunctionPolicy

Fig. 9. Network Architecture. tions are used for updating the value function ∀ t ∈ [ t , t + N − : ˆ V π θ ξ ( o t , h o,t ) ← V π θ ξ ( o t , h o,t ) + t + N − (cid:88) i = t ( γλ V ) i − t δ i , (33)where δ i = r i + γV π θ ξ ( o i +1 , h o,i +1 ) − V π θ ξ ( o i , h o,i ) . (34)Similarly, to balance the variance and the bias, the advantagefunction estimation is estimated with truncated GAE ( λ ) asproposed in [41]: ˆ A π θ ( o t , h o,t , a t ) = t + N − (cid:88) i = t ( γλ A ) i − t δ i . (35)Fig. 9 shows the architectures of the neural network func-tion approximator for the value function and policy. Here,multivariate Gaussian distribution with diagonal covariancematrix is used as the stochastic policy. With the estimatedadvantage function and the policy update rule in Eqn. (21),the policy tends to converge to a suboptimal deterministicpolicy prematurely since a sequence of actions is required inorder to change the intersection-crossing behavior due to itshierarchical nature. Studies [42], [43] show adding the entropyof the stochastic policy to the surrogate objective functioneffectively prevents such premature convergence. As a result,the policy is updated by maximizing the following objectivefunction: L mod ,t ( θ ) = L t ( θ ) + βh ( N ( µ, ± )) . (36)Here, L t ( θ ) is the surrogate objective function deﬁned in Eqn.(21). β and h ( N ( µ, ± )) are the entropy coefﬁcient and theentropy of the multivariate Gaussian policy.Since PPO is on-policy, its sample efﬁciency is low. Toaccelerate the learning progress, multiple actors are distributedover different CPU processors for sample collection. Algo-rithm 1 lists the detailed steps of the algorithm, and AppendixB lists all the hyperparameters used for the training.VI. M ETHODS FOR B ENCHMARKING

A. Baseline Controller

To benchmark the fuel economy beneﬁts of CAV technolo-gies, it is crucial to establish a baseline representative of real-world driving. The baseline controller consists of the EnhancedDriver Model (EDM), a deterministic reference velocity pre-dictor that utilizes route features to generate velocity proﬁlesrepresenting varying driving styles [44], [45], and a rule-based HEV energy management controller. EDM is capable ofpredicting the response of a human driver while approachinga stop sign or a signalized intersection using a Line-of-Sight

Algorithm 1:

PPO for Eco-drivingInitialize EcoSim with the randomly generated routes;Initialize the policy and value networks µ π θ , V π ξ ; while NOT converged dofor actor = , , ...N do Reset v vel , ∼ U (0 , v lim , ) and SoC ∼ U (0 . , . ;Execute policy π θ old in EcoSim for T time steps;Record ( o t , h o,t , a t , h a,t , r t , o t +1 , h o,t +1 ) for each timestep;Compute advantage estimates ˆ A π θ old ( o t , a t ) following Eqn. (35); end Compute the gradient of the surrogate objective function L mod ( θ ) in Eqn. (36) and apply gradient ascent w.r.t θ ;Update the value function and its parameter set ξ following Eqn. (34); end (LoS) based heuristic scheme [46]. LoS is a dynamic human-vision based distance parameter used to preview the upcomingroute feature as devised by the Intersection Sight Distance(ISD) speciﬁed by the American Association of State Highwayand Transportation Ofﬁcials (AASHTO) and US DoT FHA[47]. EDM comprises of three operating modes, namely Car-Following (CF), Freeway Driving (FD), and Stop Mode (S) asrepresented by the following equations: a veh ,t =  a max (cid:104) − (cid:16) v veh ,t v lim ,t − θ (cid:17) δ (cid:105) , v veh,t ≤ v lim ,t [ FD ] − b max (cid:104) − (cid:16) v lim ,t − θ v veh ,t (cid:17) δ (cid:105) , v veh ,t ≥ v lim ,t [ FD ] a max (cid:104) − (cid:16) v veh ,tv l ,t (cid:17) δ (cid:105) , v veh ,t ≤ v l ,t [ CF ] − b max (cid:104) − (cid:16) v l ,t v veh ,t (cid:17) δ (cid:105) , v veh ,t ≥ v l ,t [ CF ] − b max (cid:104) v veh ,t s t (cid:105) , v l ,t = 0 [ S ] s t = x l ,t − x veh ,t − s safe ,s brake ,t = (cid:16) c δ (cid:17) v veh ,t b max , (37)where, v veh ,t is the (reference) ego vehicle velocity, v lim ,t isthe route speed limit, v l ,t is the lead vehicle velocity, x veh ,t and x l ,t are the positions of the ego and leader respectively.FD mode gets activated in the absence of lead vehicle,where the ego vehicle follows the route speed limit with therate of change of velocity limited by maximum acceleration, a max . The repulsive braking strategy in the FD mode ensuresthat the vehicle never violates the speed limits and, if required,decelerates with rate, b max . The car-following mode gets acti-vated in the presence of a lead vehicle and a tunable parameter s safe is used to maintain a safe separation from leader. The stopmode gets activated when the distance from the ego vehicleto the stop sign decreases below the critical braking distance, s brake .The approach of the ego vehicle at a signalized intersectionis modeled by providing the signal phase information to thedriver only within the LoS. At any point beyond the LoS(chosen as 100 m) from the trafﬁc light, the trafﬁc lightis assumed to be a stop sign. The velocity reference fromthe EDM is fed to a tracking controller that generates the EDM PI Controller SimplifiedECM Plant

PedalPositionsFeedbackfrom Plant

Fig. 10. Integration of the EDM with the Vehicle and Powertrain Model necessary inputs to the validated vehicle model as describedin Fig. 10.

B. Deterministic Optimal Solution

A wait-and-see solution, obtained via Deterministic Dy-namic Programming (DDP) [48], is introduced to provide abenchmark for optimality, against which the DRL solutioncan be compared. Here, the route information and the SPaTsequence of the trafﬁc lights along the route are assumed to beknown a priori . The objective of the optimization is deﬁnedas follows: min a ∈A ∞ (cid:88) t =0 ( ˙ m fuel ,t + c time ) ∆ t · I [ d t < d total ] (38a) = min a ∈A N (cid:88) k =0 (cid:20) ( ˙ m fuel ,k + c time ) ∆ d ¯ v veh ,k (cid:21) , (38b)where k is the index in the distance domain, and N is the indexwhere d N reaches the total distance. ∆ d/ ¯ v veh is the travel timeper step, computed from the distance discretization and theaverage velocity ¯ v veh ,k = ( v veh ,k + v veh ,k +1 ) / . Here, the stateand action spaces are deﬁned as follows: x k = [ v veh ,k , SoC k , t k ] T , (39) u k = [ T eng ,k , T bsg ,k , T brk ,k ] T , (40)where t k is the time at which the vehicle reaches the distance d k . The state dynamics in spatial domain follows the following Distance TimeInfeasible region

Fig. 11. State Dynamics of Time equations: v veh ,k +1 = (cid:114)(cid:104) v veh ,k + 2 a k ∆ d (cid:105) + , (41) SoC k +1 = SoC k − ∆ d ¯ v veh ,k ¯ I batt,k C nom , (42) t k +1 = t k + ∆ d ¯ v veh ,k . (43)Note that when the vehicle stops at intersections, ∆ d/ ¯ v veh goesto inﬁnity. To address the issue, time, as a state, jumps to thestart of the next green phase when the vehicle stops beforetrafﬁc lights as illustrated by Fig. 11.To solve the problem via DDP, the state and action spacesare discretized, and the optimal cost-to-go matrix and theoptimal control policy matrix are obtained from backwardrecursion [9]. Since each combination in the state-actionspace needs to be evaluated to get the optimal solution, thecalculation of the wait-to-see solution for any trip used in thestudy can takes hours with modern Central Processing Units(CPUs). Here, the parallel DDP solver in [49] with CUDAprogramming is used to reduce the computation time fromhours to seconds. VII. R ESULTS

The training takes place on a node in Ohio SupercomputerCenter (OSC) [50] with 40 Dual Intel Xeon 6148s processorsand a NVIDIA Volta V100 GPU. The results shown belowrequires running the training continuously for 24 hours.As domain randomization is activated during training, theperformance of the agent needs to be evaluated separately fromthe training to show the learning progress. Here, 40 trips inthe training set with domain randomization deactivated areexecuted for every 10 policy updates, i.e. every 4,000 rolloutepisodes. Fig. 12 shows the evolution of the mean policyentropy, the average cumulative rewards, the average fueleconomy, the average speed and the complete ratio over the 40randomly selected trips. Here a trip is considered successfullycompleted if the agent did not violate any constraint during theitinerary. At the beginning of the training, the policy entropyincreases to encourage exploration, and as the training evolves,the policy entropy eventually decreases as the entropy coefﬁ-cient β is linearly annealing. On average, the agent reaches a P o li cy E n t r op y -100102030 C u m u l a t i v e R e w a r d s F ue l E c ono m y [ m pg ] A v e r age S peed [ m / s ] Evaluation Number C o m p l e t e R a t i o [ % ] Fig. 12. Evolution of Policy Entropy, Cumulative Rewards, Fuel Economy,Average Speed and Complete Ratio during Training performance with an average fuel economy of ∼ . mpg andan average speed of ∼ . m/s . In addition, the agent wasable to learn to obey the trafﬁc rules at signalized intersectionsthanks to the properly designed rewarding mechanism.The performance of the DRL controller is then comparedagainst the causal baseline controller and the non-causal wait-and-see deterministic optimal solution among the 100 testingtrips shown in Fig. 5. Fig. 13 and Tab. II show the statisticalcomparison among the three strategies. Here, the black line ineach box represents the median of the data, and the lower andupper boundaries represent the th and th percentiles, re-spectively. The extremes of the lines extending outside the boxrepresent the minimum and maximum limit of the data and the“+” symbol represent the outliers. With the comparable traveltime, the DRL controller consumes 17.4% less fuel over allthe trips. In the meantime, the wait-and-see solution consumes . less fuel while keeping the travel time comparable.The additional beneﬁts of the wait-and-see solution stem fromthe fact that the wait-and-see solution has the information ofall the trafﬁc lights over the entire trip, whereas the DRLcontroller only uses the SPaT of the upcoming trafﬁc light.Such advantage is also reﬂected in Fig. 14. Here, the averagespeed and the MPG of each trip are plotted against the trafﬁclight density, i.e. the number of trafﬁc lights divided by thetotal distance in kilometer. Intuitively, the average speed of thethree controllers decreases as the trafﬁc light density increases.Compared to the causal controllers, the fuel economy of thewait-and-see solution is least affected by the increase in trafﬁclight density. This is because the additional SPaT informationfrom the subsequent trafﬁc lights allows the wait-and-seesolution to plan through the intersections more efﬁciently. Inthe meantime, as suggested by the regression curves, the DRL Fig. 13. Fuel Economy, Travel Time Comparison and Charge Sustenancebehavior for Baseline, DRL and Wait-and-see SolutionsTABLE IIF

UEL E CONOMY , A

VERAGE S PEED AND S O C V

ARIANCE FOR B ASELINE ,DRL

AND W AIT - AND - SEE S OLUTIONS

Baseline DRL Controller Wait-and-SeeFuel Economympg 33.7 40.8 47.0Speed Mean m/s

SoC

Variance % A v e r age S peed [ m / s ] BaselineBaseline Regression FitDRLDRL Regression FitDPDP Regression Fit

TL Density [Km -1 ] F ue l E c ono m y [ m pg ] Fig. 14. Variation of Average Speed and Fuel Economy against Trafﬁc Light(TL) Density for Baseline, DRL and Wait-and-see Solutions controller is less sensitive to the increase in trafﬁc light densitythan the baseline (EDM with rule-based energy managementstrategy).Fig. 15 and 16 show the trajectories of the three controllerin urban and mixed (urban and highway) driving conditions,respectively. Driving under the same condition, the DRL con-troller was able to come to full stop at signalized intersectionsless frequently compared to the baseline. In addition, the DRLcontroller utilizes more of the battery capacity, i.e. a

SoC proﬁle with higher variation, compared to the baseline. DRL’s S peed [ m / s ] BaselineDRLDPSpeed Limit -10-50510 S o C [ % ] T i m e [ s ] Distance [m] F ue l C on s u m p t i on s [ g ] Fig. 15. Comparison of Velocity, ∆ SoC , Time-Space and Fuel Consumptionfor Baseline, DRL and Wait-and-see Solutions [Urban Route] S peed [ m / s ] BaselineDRLDPSpeed Limit -20-1001020 S o C [ % ] T i m e [ s ] Distance [m] F ue l C on s u m p t i on s [ g ] Fig. 16. Comparison of Velocity, ∆ SoC , Time-Space and Fuel Consumptionfor Baseline, DRL and Wait-and-see Solutions [Mixed Route] efﬁcient maneuvers while approaching the intersection coupledwith better utilization of SoC results in up to 27% reductionin fuel consumption for both urban and mixed driving whencompared against the baseline. It should be noted that sincethe baseline produces trajectories for a limited line-of-sight,the EDM has the tendency to follow the speed limits inthe absence of trafﬁc, which leads to faster travel times in mixed-highway driving but also to more frequent stops atintersections when compared against DRL or the wait-and-seesolution as evident in Fig. 16.VIII. C ONCLUSION

In this study, the Eco-driving problem for Hybrid ElectricVehicles (HEVs) with the capability of autonomously passingsignalized intersections is studied. To accommodate the com-plexity and high computational requirement of solving thisproblem, a learn-ofﬂine, execute-online strategy is proposed.The Eco-driving problem is formulated as a Partially Observ-able Markov Decision Process, and a Deep ReinforcementLearning (DRL) framework is subsequently developed. Tofacilitate the training, a simulation environment was createdconsisting of a mild HEV powertrain model and a large-scale microscopic trafﬁc simulator developed in Simulation ofUrban Mobility. The DRL controller is trained via ProximalPolicy Optimization with Long Short-Term Memory as thefunction approximators. To benchmark the performance of theDRL controller, a baseline strategy and a wait-and-see optimalsolution (computed via Deterministic Dynamic Programming)are presented. With the properly designed rewarding mecha-nism, the agent learned to obey the constraints in the optimalcontrol problem formulation. Furthermore, the learned explicitpolicy reduces the average fuel consumption by 17.4% overthe 100 randomly generated trips in urban, mixed-urban andhighway conditions compared to the baseline while keepingthe travel time comparable.Future work will focus on two aspects. First, the hardconstraint satisfaction will be rigorously analyzed. Second, thedesign of reward function speciﬁc to drivers (personalization)will be investigated. A

PPENDIX AR EWARD F UNCTION D ESIGN

In general, the process of designing the reward function isan iterative process, and it requires tuning. Here, the numericalvalues of the coefﬁcients in the reward function are listed Tab.III. Some key takeaways are listed below.1) Normalize the scale of the reward function such that thenumerical value is between [ − , .2) Rewards from the environment should reﬂect incremen-tal incentives/penalties. For example, rewards associatedwith trafﬁc lights are sparse, meaning that the agentreceives these rewards periodically and needs a sequenceof action to avoid the large penalty. Eqn.(30) ensuresthe penalty from violating the trafﬁc condition is pro-portional to how bad was the violation.3) Penalties related to constraints should be orders largerthan those related to performance.A PPENDIX BH YPERPARAMETERS

The hyperparameters used for training is listed in Tab. IV.

TABLE IIIN

UMERICAL V ALUES OF THE C ONSTANTS IN R EWARD F UNCTION

Constant Value c obj -0.001 c time c vel,1 -0.002 c vel,2 -0.015 c batt,1 -0.25 c batt,2 -10 c tfc,1 -0.25 c tfc,2 -0.01 c tfc,3 -0.02TABLE IVN UMERICAL V ALUES OF THE C ONSTANTS IN R EWARD F UNCTION

Parameter ValueLSTM Layer Size 100MLP Layer Size 100Samples per Episode 50Number of Episodes Per Update 400Number of CPUs 40Number of GPUs 1PPO Clipping 0.2GAE( λ ) 0.8Truncated TD( λ ) 0.4Entropy Coefﬁcient 0.01 (linear annealing)Learning Rate α γ A CKNOWLEDGMENT

The authors acknowledge the support from the United StatesDepartment of Energy, Advanced Research Projects Agency –Energy (ARPA-E) NEXTCAR project (Award Number DE-AR0000794) and The Ohio Supercomputer Center.R

EFERENCES[1] A. Vahidi and A. Sciarretta, “Energy saving potentials of connectedand automated vehicles,”

Transportation Research Part C: EmergingTechnologies , vol. 95, pp. 822–843, 2018.[2] A. Sciarretta, G. De Nunzio, and L. L. Ojeda, “Optimal ecodrivingcontrol: Energy-efﬁcient driving of road vehicles as an optimal controlproblem,”

IEEE Control Systems Magazine , vol. 35, no. 5, pp. 71–90,2015.[3] Q. Jin, G. Wu, K. Boriboonsomsin, and M. J. Barth, “Power-basedoptimal longitudinal control for a connected eco-driving system,”

IEEETransactions on Intelligent Transportation Systems , vol. 17, no. 10, pp.2900–2910, 2016.[4] E. Ozatay, S. Onori, J. Wollaeger, U. Ozguner, G. Rizzoni, D. Filev,J. Michelini, and S. Di Cairano, “Cloud-based velocity proﬁle optimiza-tion for everyday driving: A dynamic-programming-based solution,”

IEEE Transactions on Intelligent Transportation Systems , vol. 15, no. 6,pp. 2491–2505, 2014.[5] J. Han, A. Vahidi, and A. Sciarretta, “Fundamentals of energy efﬁcientdriving for combustion engine and electric vehicles: An optimal controlperspective,”

Automatica , vol. 103, pp. 558–572, 2019.[6] C. Sun, J. Guanetti, F. Borrelli, and S. Moura, “Optimal eco-drivingcontrol of connected and autonomous vehicles through signalized inter-sections,”

IEEE Internet of Things Journal , 2020.[7] F. Mensing, R. Trigui, and E. Bideaux, “Vehicle trajectory optimizationfor hybrid vehicles taking into account battery state-of-charge,” in . IEEE, 2012, pp. 950–955.[8] L. Guo, B. Gao, Y. Gao, and H. Chen, “Optimal energy management forhevs in eco-driving applications using bi-level mpc,”

IEEE Transactionson Intelligent Transportation Systems , vol. 18, no. 8, pp. 2153–2162,2016. [9] P. Olin, K. Aggoune, L. Tang, K. Confer, J. Kirwan, S. R. Deshpande,S. Gupta, P. Tulpule, M. Canova, and G. Rizzoni, “Reducing fuel con-sumption by using information from connected and automated vehiclemodules to optimize propulsion system control,” SAE Technical Paper,Tech. Rep., 2019.[10] S. Bae, Y. Choi, Y. Kim, J. Guanetti, F. Borrelli, and S. Moura, “Real-time ecological velocity planning for plug-in hybrid vehicles with partialcommunication to trafﬁc lights,” arXiv preprint arXiv:1903.08784 , 2019.[11] D. Maamria, K. Gillet, G. Colin, Y. Chamaillard, and C. Nouillant,“Computation of eco-driving cycles for hybrid electric vehicles: Com-parative analysis,” Control Engineering Practice , vol. 71, pp. 44–52,2018.[12] B. Asadi and A. Vahidi, “Predictive cruise control: Utilizing upcomingtrafﬁc signal information for improving fuel economy and reducing triptime,”

IEEE Transactions on Control Systems Technology , vol. 19, no. 3,pp. 707–714, 2011.[13] F. Mensing, E. Bideaux, R. Trigui, and H. Tattegrain, “Trajectoryoptimization for eco-driving taking into account trafﬁc constraints,”

Transportation Research Part D: Transport and Environment , vol. 18,pp. 55–61, 2013.[14] J. Shi, F. Qiao, Q. Li, L. Yu, and Y. Hu, “Application and evaluation ofthe reinforcement learning approach to eco-driving at intersections un-der infrastructure-to-vehicle communications,”

Transportation ResearchRecord , vol. 2672, no. 25, pp. 89–98, 2018.[15] G. Li and D. G¨orges, “Ecological adaptive cruise control for vehicleswith step-gear transmission based on reinforcement learning,”

IEEETransactions on Intelligent Transportation Systems , 2019.[16] H. Lee, N. Kim, and S. W. Cha, “Model-based reinforcement learningfor eco-driving control of electric vehicles,”

IEEE Access , vol. 8, pp.202 886–202 896, 2020.[17] B. Asadi and A. Vahidi, “Predictive cruise control: Utilizing upcomingtrafﬁc signal information for improving fuel economy and reducing triptime,”

IEEE transactions on control systems technology , vol. 19, no. 3,pp. 707–714, 2010.[18] G. Mahler, A. Winckler, S. A. Fayazi, M. Filusch, and A. Vahidi,“Cellular communication of trafﬁc signal state to connected vehiclesfor arterial eco-driving,” in . IEEE, 2017, pp. 1–6.[19] P. Rong and M. Pedram, “An analytical model for predicting theremaining battery capacity of lithium-ion batteries,”

IEEE Transactionson Very Large Scale Integration (VLSI) Systems , vol. 14, no. 5, pp.441–451, 2006.[20] M. Livshiz, M. Kao, and A. Will, “Validation and calibration processof powertrain model for engine torque control development,” SAETechnical Paper, Tech. Rep., 2004.[21] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Fl¨otter¨od,R. Hilbrich, L. L¨ucken, J. Rummel, P. Wagner, and E. Wießner,“Microscopic trafﬁc simulation using sumo,” in

The 21st IEEEInternational Conference on Intelligent Transportation Systems et al. , “Emergence of locomotion behavioursin rich environments,” arXiv preprint arXiv:1707.02286 , 2017.[24] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptiveelements that can solve difﬁcult learning control problems,”

IEEEtransactions on systems, man, and cybernetics , no. 5, pp. 834–846, 1983.[25] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[26] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”

Machine learning , vol. 8, no. 3-4,pp. 229–256, 1992.[27] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization ofmarkov reward processes,”

IEEE Transactions on Automatic Control ,vol. 46, no. 2, pp. 191–209, 2001.[28] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” in

Advances in neural information processing systems , 2000, pp.1057–1063.[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in

International conference on machinelearning , 2015, pp. 1889–1897.[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017. [31] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,”

TheAmerican Statistician , vol. 58, no. 1, pp. 30–37, 2004.[32] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based con-trol with recurrent neural networks,” arXiv preprint arXiv:1512.04455 ,2015.[33] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solvingdeep memory pomdps with recurrent policy gradients,” in

InternationalConference on Artiﬁcial Neural Networks . Springer, 2007, pp. 697–706.[34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[35] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencieswith gradient descent is difﬁcult,”

IEEE transactions on neural networks ,vol. 5, no. 2, pp. 157–166, 1994.[36] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik,J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al. , “Grand-master level in starcraft ii using multi-agent reinforcement learning,”

Nature , vol. 575, no. 7782, pp. 350–354, 2019.[37] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison,D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al. , “Dota 2 with largescale deep reinforcement learning,” arXiv preprint arXiv:1912.06680 ,2019.[38] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, p. 529, 2015.[39] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver,“Learning values across many orders of magnitude,” in

Advances inNeural Information Processing Systems , 2016, pp. 4287–4295.[40] Z. Zhu, Y. Liu, and M. Canova, “Energy management of hybrid eletricvehicles via deep q networks,” in . IEEE, 2020.[41] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438 , 2015.[42] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in

International conference on machine learning ,2016, pp. 1928–1937.[43] R. J. Williams and J. Peng, “Function optimization using connectionistreinforcement learning algorithms,”

Connection Science , vol. 3, no. 3,pp. 241–268, 1991.[44] S. Gupta, S. R. Deshpande, P. Tulpule, M. Canova, and G. Rizzoni,“An enhanced driver model for evaluating fuel economy on real-worldroutes,”

IFAC-PapersOnLine , vol. 52, no. 5, pp. 574–579, 2019.[45] S. Gupta, S. R. Deshpande, D. Tufano, M. Canova, G. Rizzoni,K. Aggoune, P. Olin, and J. Kirwan, “Estimation of fuel economy onreal-world routes for next-generation connected and automated hybridpowertrains,” SAE Technical Paper, Tech. Rep., 2020.[46] S. R. Deshpande, S. Gupta, D. Kibalama, N. Pivaro, and M. Canova,“Benchmarking fuel economy for connected and automated vehiclesin real world driving conditions via monte carlo simulation,” DynamicSystems and Control Conference, Tech. Rep., 2020.[47] A. AASHTO, “Policy on geometric design of highways and streets,”

American Association of State Highway and Transportation Ofﬁcials,Washington, DC , vol. 1, no. 990, p. 158, 2001.[48] O. Sundstrom and L. Guzzella, “A generic dynamic programming matlabfunction,” in

Control Applications,(CCA) & Intelligent Control,(ISIC),2009 IEEE . IEEE, 2009, pp. 1625–1630.[49] Z. Zhu, S. Gupta, N. Pivaro, and M. Canova, “A gpu implementation ofa look-ahead optimal controller for eco-driving based on dynamic pro-gramming (submitted),” in .IEEE, 2021.[50] O. S. Center, “Ohio supercomputer center,” 1987. [Online]. Available:http://osc.edu/ark:/19495/f5s1ph73 PLACEPHOTOHERE

Zhaoxuan Zhu received the B.Sc. degree (summacum laude) and the M.S. degree both in mechanicalengineering from The Ohio State University, Colum-bus, OH, USA in 2016 and 2018, respectively. Heis currently a Ph.D. candidate in the Department ofMechanical and Aerospace Engineering at the OhioState University, Columbus, OH, USA. His researchinterests include the application of stochastic optimalcontrol and reinforcement learning in the ﬁeld ofconnected and autonomous vehicles.PLACEPHOTOHERE

Shobhit Gupta received the Bachelor of Technologydegree in mechanical engineering from the IndianInstitute of Technology Guwahati, Assam, India, in2017, the M.S. degree in mechanical engineeringfrom The Ohio State University, Columbus, OH,USA in 2019. He is currently a Ph.D. student in theDepartment of Mechanical and Aerospace Engineer-ing at the The Ohio State University, Columbus, OH,USA. His research interest include optimal controlof connected and autonomous vehicles and driverbehavior recognition for predictive control.PLACEPHOTOHERE

Abhishek Gupta received the B.Tech. in aerospaceengineering from the Indian Institute of Technol-ogy Bombay, Mumbai, India, in 2009, the M.S.degree in aerospace engineering, the M.S. degreein applied mathematics, and the Ph.D. degree inaerospace engineering from the University of Illinoisat Urbana-Champaign (UIUC), Urbana, IL, USA,in2009, 2011, 2012, and 2014, respectively. Currently,he is an Assistant Professor with the Electricaland Computer Engineering Department at The OhioState University, Columbus, OH, USA. From 2014to 2015, he was a Postdoctoral Researcher with Ming Hsieh Department ofElectrical Engineering at the University of Southern California. His researchinterests include security of cyberphysical systems, multi agent decisiontheory, probability theory, and optimization.PLACEPHOTOHERE