[PDF] Data-Driven Transferred Energy Management Strategy for Hybrid Electric Vehicles via Deep Reinforcement Learning

Abstract

Real-time applications of energy management strategies (EMSs) in hybrid electric vehicles (HEVs) are the harshest requirements for researchers and engineers. Inspired by the excellent problem-solving capabilities of deep reinforcement learning (DRL), this paper proposes a real-time EMS via incorporating the DRL method and transfer learning (TL). The related EMSs are derived from and evaluated on the real-world collected driving cycle dataset from Transportation Secure Data Center (TSDC). The concrete DRL algorithm is proximal policy optimization (PPO) belonging to the policy gradient (PG) techniques. For specification, many source driving cycles are utilized for training the parameters of deep network based on PPO. The learned parameters are transformed into the target driving cycles under the TL framework. The EMSs related to the target driving cycles are estimated and compared in different training conditions. Simulation results indicate that the presented transfer DRL-based EMS could effectively reduce time consumption and guarantee control performance.

Full PDF

11 Abstract : Real-time applications of energy management strategies (EMSs) in hybrid electric vehicles (HEVs) are the harshest requirements for researchers and engineers. Inspired by the excellent problem-solving capabilities of deep reinforcement learning (DRL), this paper proposes a real-time EMS via incorporating the DRL method and transfer learning (TL). The related EMSs are derived from and evaluated on the real-world collected driving cycle dataset from Transportation Secure Data Center (TSDC). The concrete DRL algorithm is proximal policy optimization (PPO) belonging to the policy gradient (PG) techniques. For specification, many source driving cycles are utilized for training the parameters of deep network based on PPO. The learned parameters are transformed into the target driving cycles under the TL framework. The EMSs related to the target driving cycles are estimated and compared in different training conditions. Simulation results indicate that the presented transfer DRL-based EMS could effectively reduce time consumption and guarantee control performance.

Keywords:

Transfer learning; Deep reinforcement learning; Proximal policy optimization; Energy management; Hybrid electric vehicle; Real-world driving cycles

Data-Driven Transferred Energy Management Strategy for Hybrid Electric Vehicles via Deep Reinforcement Learning

Teng Liu a , b , Bo Wang c , * , Wenhao Tan a , Shaobo Lu a , Yalian Yang a a, Department of Automotive Engineering, Chongqing University, Chongqing 400044, China b, Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L3G1, Canada; c, School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, China *Corresponding author ( Email address: [email protected] ) . Tel: +86(10)68944115; Fax: +86(10)68944115.

1. Introduction

Hybrid electric vehicles (HEVs) and plug-in hybrid electric vehicles (PHEVs) express great talent in energy-saving and emission-reducing [1]. Academic researchers are devoting themselves to improve the property of these vehicles, and car manufacturers are developing the new shapes of these vehicles. Energy management strategy (EMS) is a blue-chip study topic for these renewable vehicles. Owing to the ability to design reasonable distributions of multiple power resources, HEVs and PHEVs are able to promote energy efficiency and retard the greenhouse effect [2]. How to achieve an applicable online or rea-time EMS is still a challenging problem in the energy management field. The approaches utilized to derive the EMSs are primarily classified into three types, which are rule-oriented, optimization-based, and learning-enabled methods [3]. Rule-oriented EMS is often built from the engineering experiences of human beings, and this policy is concentrated in real hybrid vehicles [4]. Optimization-based EMS is usually constructed from the optimization control techniques, and this strategy is always regarded as the benchmark to evaluate other policies [5]. Learning-enabled EMS is motivated by the intelligent regulation of machine learning, and this criterion is prevalent in recent years due to the online implemented potential [6]. Many methods have been conducted to found the learning-enabled EMSs, such as reinforcement learning (RL) and deep reinforcement learning (DRL). RL is a conceptual structure to address the sequential decision-making problem through the intersection of an intelligent agent and its environment [7]. Since 2015, more and more attempts have been executed in the study of RL-based EMS for hybrid electric powertrains. For example, Liu et al. proposed a RL-based adaptive energy management policy for a hybrid tracked vehicle in [8]. The related performance is compared with the stochastic dynamic programming (SDP) method in different driving schedules. The authors in [9] formulated an advanced EMS for power-split PHEV with a heuristic planning reinforcement learning (RL) method. The Dyna-H algorithm is exploited to drive the control policy, which is proven to show advantages in fuel economy and computational speed. Furthermore, integration of the RL algorithm and other information is a novel pattern to generate real-time EMS. In Ref. [10], Zhou et al. summarized the combination of future driving conditions and RL to obtain the predictive EMS. The future driving information includes vehicle speed, acceleration, and driver behaviors. This information could enable the derived EMS to adapt to future driving environments and promote energy efficiency. The researchers in [11] discussed the energy management problem of HEV in a connected environment with proximal policy optimization (PPO) algorithm. It implies that the driving information (e.g., speed and acceleration) of the surrounding vehicles are known by the studied vehicle via V2V or V2I technology. Ref. [12] combined the onboard learning algorithm for Markov Chain, speedy Q-learning algorithm, and induced matrix norm to produce the online EMS. The relevant control policy is demonstrated to be superior to SDP and traditional RL in real-time driving situations. DRL technique is the incorporation of deep learning and RL. It has achieved several impressive achievements in the energy management field in recent three years [13]. The main difference between RL and DRL is the implementation of deep learning to approximate the Q-table in RL. This feature is capable of promoting the accuracy and effectiveness of the conventional RL algorithms. For example, deep Q-learning (DQL) or deep Q-network (DQN) is the first adoptive approach in HEV’s energy management [14]. This algorithm could resolve the continuous optimal control problem with more precise powertrain modeling. In addition, some improved variations of common DQL are also applied in energy management research. The authors in [15] discussed the energy management problem with double DQL. This method is able to eliminate the overestimation issue in the approximation of Q-table effectively. Qi et al. [16] used dueling DQL to analyze the energy management problem. This technique employed state-value function and advantage function to acquire the Q-table, which could recognize the worth of each control action. Moreover, the deep deterministic policy gradient (DDPG) algorithm is another popular trial to dispose of the continuous state space and control space in energy management problems. Ref. [17] applied Gaussian process, random forest, and gradient boosted random trees to optimize the hyperparameters of the DDPG algorithm. Simulation results depict that the random forest could elevate the control performance of DDPG. To accelerate the convergence rate of random control actions selection, the authors in [18] embedded the optimal brake specific fuel consumption curve of the engine and the charge-discharge power characteristics of the battery in the training process of DDPG. This work shows that the particular powertrain efficiency map could speed up the learning efficiency and reduce fuel consumption. Transfer learning (TL) is another learning paradigm of machine learning methods [19]. The concept of TL is transforming the learned knowledge from the source tasks to target tasks to accelerate the learning process. This method is inspired by the fact that humans could apply the learned knowledge previously to solve the new problems in high efficiency. TL is testified to be beneficial in many research fields. For example, the authors in [20] developed a traffic forecasting structure by combining the TL with a diffusion convolutional recurrent neural network. The provided predictive framework learned from the California highway network can forecast the traffic on the unseen regions of the network with high accuracy. To overcome the problem of insufficient training data in emotion recognition, Nguyen et al. [21] presented a TL-based strategy to learn cross-domain features to improve recognition performance. The availability and robustness of the proposed framework is verified. Ref. [22] discussed the algorithm recommendation problem with TL. The trained neural network is reused to transform the learned knowledge to similar datasets and the performance is guaranteed. Recently, the integration of RL (or DRL) and TL is a promising solution to address the control problems in automotive engineering. Several applications have been carried out for the study of HEV or autonomous vehicles. For example, the authors in [23] combined DDPG and TL to construct a bi-level control architecture for a hybrid powertrain. The upper-level utilized DDPG to train the EMS at different speed intervals, and the lower-level employed TL to transform the pre-trained neural network for a new driving cycle. Lian et al. [24] realized cross-type knowledge transfer in the DRL-based EMS for hybrid electric vehicles. Four types of hybrid powertrains are included, and the EMS learned from Prius modeling is transformed into three other powertrains, which are a series, a series-parallel, and a power-split one. Furthermore, the knowledge transfer idea is also reflected in autonomous vehicles. Isele et al. discussed the decision-making problems at the intersection of self-driving vehicles [25]. The driving situations are cast into left-turning, going forward, and right-turning. The learned control policy from one driving task is proven to be effective in another driving mission. Ref. [26] exploited the transfer RL to determine the target recognition problem for the multi-autonomous underwater vehicles (AUV). The proposed transferred model could reduce the repeated calculation of similar data and ensure the real-time performance of the algorithm.

Real-world Driving Cycle Dataset Evaluation and Estimation

Source Driving Cycles Target Driving Cycles

1. Influence of Source Driving Cycles2. Impact of Target Driving Cycles3. Assessment of TL ApproachLearning System: DRL Method (PPO Algorithm)Source

Driving Tasks

TrainedKnowledge Target Driving TasksTransfer Deep Reinforcement Learning-based Energy Management Framework For HEVData-driven Real-world Driving Cycle Dataset (Vehicle Speed and Acceleration)

Fig. 1. The proposed transfer DRL-based control framework for data-driven energy management strategy.

Motivated by the merits of the TL and RL methods, this brief presented a data-driven transferred energy management strategy for hybrid electric vehicles, see Fig. 1 as an illustration. The special DRL algorithm is proximal policy optimization (PPO), which is located in the policy gradient approaches. The relevant EMS is built and analyzed in the real-world driving cycle dataset named Transportation Secure Data Center (TSDC). These driving cycles are classified into source driving cycles and target ones. The parameters in the deep learning are trained and learned from the source driving cycles, and then transformed into the target driving cycles to improve the learning efficiency. The impacts of the number of driving cycles and the significance of the testing driving cycle on the related EMS are illustrated. Multiple assessments are executed to evaluate the efficacy of the transfer DRL-based energy management policy. Simulation results indicate that the presented transfer DRL-based EMS could effectively reduce time consumption and guarantee control performance. Three perspectives of the main contributions and innovations are given in this work. 1) a real-time energy management policy is constructed for a HEV by using DRL (PPO algorithm) and TL; 2) Real-world driving cycle dataset is applied to formulate and estimate the data-driven EMS; 3) the necessity and importance of the source and target driving cycles are illuminated to promote the control performance of the proposed method. This paper is an attempt to combine the TL, DRL, and real-world driving data to found the real-time EMS for a HEV. This presented control framework is adaptive to different hybrid powertrains. The rest content of this work is organized as follows. The powertrain modeling of the studied HEV and its relevant energy management problem is described in Section 2. Section 3 gives the implemented progress of the transfer DRL approach, including the PPO and TL algorithms. In Section 4, the simulation results of different testing experiments are elaborated. The effects of the source driving cycles are discussed. Finally, Section 5 summarizes the key takeaways.

2. Hybrid Powertrain Modeling and Energy Management Problem

Taking a series-parallel hybrid powertrain as an example, this section establishes the powertrain modeling and its related energy management problem. The sketch of this studied powertrain is depicted in Fig. 2 [27]. The main components include a battery pack, an internal-combustion engine (ICE), an integrated starter generator (ISG), and a traction motor. The power distribution between the battery and ICE is treated as the optimization objective in order to improve energy efficiency. Since the core of this work focuses on the application of transfer DRL in energy management problems, battery aging, road slope, and temperature effect are not considered in this article. The presented EMS is easily generalized to the different hybrid powertrain.

Battery Pack DC/DC Switching

Traction MotorEngine

ISG

Fig. 2. Control architecture of the research series-parallel powertrain topology [27]. req ro ae ine

P P P P = + + (1) where P ro , P ae, and P ine are the allied powers according to the rolling resistance, aerodynamic drag, and inertial force, respectively. They can be expressed further in the following format: ro vae a Dine v P m g f vP A C v vP m a v  =    =      =   (2) where m v is the vehicle mass, f is the rolling resistance coefficient, g is the gravity coefficient. ρ is the air density, A a, and C D are the frontal area and aerodynamic drag coefficient, respectively. v and a are the vehicle speed and acceleration. They are varied with the driving cycles and thus lead to the mutable power request. In this work, the driving cycles are adopted from the real-world driving cycle dataset, which would be introduced in Section 2.3. In general, the battery and ICE are the main energy sources to maintain the power request of the powertrain. Hence, the dynamic modeling of these two components is necessary for the energy management problem. In the battery, state-of-charge (SOC) is a principal argument to determine the remaining electricity capacity at each time instant. The value of SOC ranges from 0 to 1 to represent the fully charged and fully discharged conditions of the battery. It could be computed as follows: / bat cap SOC I Q = − (3) where I bat is the battery current, and Q cap is the nominal capacity of the battery. In this work, the battery is mimicked by the internal resistance model [28], wherein the power and voltage of the battery are represented by the battery current as: bat bat batbat oc bat P U IU V I r  =  = − (4) where V oc is the open-circuit voltage in the battery, and r is the internal resistance. For the particular series-parallel hybrid powertrain, the nominal capacity Q cap is 8.1Ah, the nominal voltage of 200V, and the internal resistance r is 0.25Ω. Incorporating (3) and (4), the differential of SOC is depicted as: ( 4 ) / (2 ) oc oc bat cap SOC V V r P Q r = − − − (5) SOC could reflect the current conditions of the battery, and thus it is chosen as the state variable of the energy management problem. The range of SOC is [0, 1] in this article because the real-world driving cycles are considered. As the power request is satisfied by the battery and ICE, the power of ICE is selected as the control action and it is capable of deciding the battery power at each time step. In the static map modeling of ICE, the significant factor is the fuel consumption rate. It is computed by the speed and torque of ICE as follows: ( , ) = ICE ICE ICE m f T (6) where

ICE m is the instantaneous fuel consumption, T ICE and ω ICE are the torque and speed of ICE, respectively. f in (6) often indicates the look-up table method, which is supported by the brake specific fuel consumption (BSFC) curve of ICE. The parameters of the ICE are labeled as follows: the range of the ICE speed is [1000, 4500] rpm, the peak power is 57 kW at 5000 rpm, and peak torque is 115 Nm at 4200 rpm [27]. The control action in the energy management problem is the ICE torque with continuous space, which is expected to be resolved by the advanced DRL method. The parameters of this research series-parallel hybrid powertrain are described in Table 1. Table 1. Factors of the series-parallel powertrain for energy management problem.

Indicator Meaning Values m v Vehicle mass 1325 kg ρ Air density 1.225 kg/m f Rolling resistance coefficient 0.012 A a Frontal area 2.16 m C D Aerodynamic drag coefficient 0.26 g Gravity coefficient 9.8 m/s V oc Open circuit voltage 150 V

SOC ref

Charge sustaining value 0.65 [ ( ) ( ( )) ]( ) ( )( ) 0 ( ) T ICE SOC ref refSOC ref

J m t t dtSOC t SOC SOC t SOCt SOC t SOC  = +   −   =     (7) where [0, T ] is the time horizon of each driving cycle. λ is a positive weighting factor in achieving a trade-off between these two control indexes, and it is settled as 1000 in this paper. SoC ref is a pre-defined parameter to realize the charge-sustaining module, and it equals to 0.65 in this article. Since the original intention of the energy management problem is searching the optimal control actions at each time instant, the state variables and control actions should be defined. The state variables are vehicle speed, acceleration, and SOC in the battery. The control action is the ICE torque with continuous space. They are limited as follows: ={ [0, 45] / , [ 5, 5] / , [0, 1]}={ [0, 115] } ICE

S v m s a m s SOCA T Nm    −   (8) In the selected process of control policy, many variables should work in a reasonable physical range. It indicates that the parameters in the ICE, battery, ISG, and traction motor should follow a couple of constraints. These restraints are decided by the different hybrid powertrain. In the studied series-parallel powertrain, these conditions are stated as follows: min max,min ,max ( )( ) bat bat bat

SOC SOC t SOCP P t P      (9) ,min ,max,min ,max ( ) , , ,( ) , , , x x xx x x t x Mot Gen ICET T t T x Mot Gen ICE      =   = (10) where min and max imply the minimum and maximum value of each variable. The subscript of

Mot , Gen, and

ICE represent the speed or torque in each component. The next subsection gives the source of the real-world driving cycle dataset. 2.3 Driving Cycle Dataset

Fig. 3. Selected real-world driving cycles for transfer DRL-based EMS research.

In most references related to the EMS of HEV, the simulation driving cycles are the standard cycles [29]. As this work aims to propose a real-time EMS for multiple hybrid powertrains, the real-world driving cycles are adopted. The concrete driving cycle dataset is collected by Transportation Secure Data Center (TSDC) [30]. This driving cycle data includes aggregated statistics, trip driving distances, and second-by-second profiles (speed and acceleration) collected from vehicles with global positioning systems (GPS). The involved types of automobiles are light-, medium-, and heavy-duty 0 vehicles. The data processing procedures are introduced in [31]. The generated cleansed data are available for many open-source softwares for programming, database query, and analysis, such as PostgreSQL/PostGIS/QGIS, GRASS, Python, and R. As explained in (2), vehicle speed and acceleration are the necessary elements for the power request calculation. The driving cycles in the energy management problem basically imply the various vehicle speed and acceleration profiles. In this work, the Delaware valley regional planning commission (DVRPC) processed driving cycles in TSDC are used to derive the real-time EMSs. For specification, 20 driving cycles (these cycles are labeled as Cycle 1 to Cycle 20) are comprised and analyzed by the proposed transfer DRL method. Fig. 3 shows the sampled vehicle speed and acceleration profiles. It can be discerned that the time duration, maximum value of speed, and average velocity are different in these driving cycles. As a consequence, the variation of state variables (e.g., SOC) is different in these control cases. As mentioned in the Introduction, there are the source and target task domains in the TL method. In this paper, the transferred ability of the DRL-based EMS is investigated and analyzed. A part of the 20 driving cycles is categorized as the source driving cycles, and the DRL approach is forced on them to train the parameters of the neural network. Then, the trained neural networks are working on the target driving cycles to improve the learning efficiency and control performance. The influences of the ingredient of the source driving cycles would be discussed, and the knowledge of the target driving cycles will also be illuminated. In the next section, the particular DRL method (PPO algorithm) and the TL method are introduced.

3. DRL Method and TL Technique for EMS Transformation

This section describes the implementation process of the transfer DRL method. This approach is utilized to formulate the real-time EMS in this article. The core of the RL framework is given firstly, including the essential elements of one RL algorithm. Then, the proximal policy optimization (PPO) algorithm is introduced, which is suitable for the optimal control problem with continuous space. It is a novel form of policy gradient technique. Finally, the TL method is employed to transform the learned EMS from the source driving cycles to target driving ones. 3.1 RL Implementation The RL framework narrates the interacted procedure of an intelligent agent and its environment. At each step, the agent takes a control action and the environment gives a state and feedback. According to the feedback from the environment, the agent could modulate and improve its control policy. In this work, the agent is the onboard energy management controller and the environment is the powertrain dynamics. The goal of the agent is minimizing the cost function in (7) for a given driving cycle by selecting an appropriate control strategy. The intersection between the agent and environment is often mimicked as a Markov decision process (MDP), as a tuple (  ,  ,  ,  ). MDP means a decision-making process has Markov property, which indicates that the next state variable only depends on the present state, not on the past states. In the tuple,  is a set of state variables, and  is a set of control actions.  is a reward function, it delegates the feedback from the environment and is represented by the instantaneous cost function in 1 (7).  is a transition model, which describes the development trace of the state variables. RL method is exploited to formulate the EMS (a sequence of control actions) in the HEV’s energy management problem. The chosen regulation of the control actions is to minimize the cumulative rewards. This accumulated reward is computed as the sum of the current reward and discounted future rewards as follows [32-34]: T k tt kk t

R r  −= =   (11) where t is the time instant, γ ∈ [0, 1] is a discount factor in balancing the current and future received rewards. This is because the control action in the RL framework would affect not only the present reward but also the future feedbacks. r k ∈  is the instantaneous reward. Assuming the EMS (control strategy) obtained from the energy management controller is π , it is calculated based on the value functions via RL algorithms. Two value functions are usually applied to derive the control policy in RL algorithms. They are named as state-value function and action-value function, and they are written as the following format, respectively: ( ) | T k tt k tk t

V s E r s    −=      (12) ( , ) | , T k tt t k t tk t

Q s a E r s a    −=      (13) where s t and a t indicating the state variables and control action at time step t . E π [ . ] denotes the expected value of the accumulated reward following policy π . Different RL algorithms aim to use disparate criteria to update the value functions. As the control action is displayed in action-value function, many RL algorithms are inspired to update this function (called Q-matrix for short). For convenience, the action-value function is rewritten as the recursive form to insulate the current reward as follows: ( , )= min ( , ) t t t t t ta Q s a E r Q s a     + + +  +   (14) where s t +1 and a t +1 are the next state variable and control action. To compute the optimal control policy, the Bellman optimality equation is described with the transition model: * *1 1 1, ( , )= ( , , ) min ( , ) tt t t t t t t t t t tas r Q s a p s r s a r Q s a  ++ + + +  +    (15) where p ( s t +1 , r t | s t , a t ) ∈  means the transition probability from the current state to the next state. The tuple ( s t +1 , r t , s t , a t ) is called as an observation or a transition in the RL algorithm to record the stepwise intersection between the intelligent agent and its environment. Finally, the optimal control action at each step is generated from the trained Q-matrix as follows: * * arg min ( , ) arg ( , ) t t t t t ta a Q s a Q s a = = (16) In the conventional RL algorithm, the Q-matrix is updated by filling the observation from each step. It is time-consuming and inefficient. Hence, in the DRL approach, the deep learning (e.g., neural network) is leveraged to approximate the Q-matrix. The policy is modeled with a parameter θ , π θ ( a | s ). The factor θ implies the parameters of the neural networks. In the next subsection, the PPO algorithm 2 is introduced to update the parameterized policy. 3.2 PPO Algorithm For a parameterized cost function J ( θ ) in (6), the gradient ∇ θ J ( θ ) could help the agent to find the best θ for π θ that produces the lowest cumulative reward. This gradient can be computed with respect to the action-value function as follows [35]: ( ) ( ) ( | ) ( , ) ( | )( ) ( | ) ( , ) ( | )( , ) ln ( | ) s as S a A J p s a s Q s a a sp s a s Q s a a sE Q s a a s                  =  =  =      (17) where p π ( s ) is the probability distribution of the state variable. ( | )   a s and ( | )   old a s are the new and old policies, the probability ratio between two policies are depicted as: ( | )( )= ( | ) old a sr a s    (18) Then, the loss function of updating the policy in the RL framework is represented by: ˆ ( ) ln ( | ) ( , ) PG L a s A s a    =    (19) where Â ( s , a ) is the estimated advantage function because the real one is usually unknown. In the conventional policy gradient method, the loss function L PG is often used to perform multiple steps of optimization with the same policy. However, this updating criterion may cause some problems, such as sample inefficiency, policy diversity, and hesitation in exploration and exploitation. The new parameter θ may make the policy alter too much at one step. Thus, the PPO algorithm is presented in [36] to overcome these challenges by synthesizing the advantages of typical value-based and policy-based RL methods. To improve the stability of policy updating in PPO algorithm, a constraint is forced on the probability ratio r ( θ ) to stay within a small horizon around 1, which is [1- ε , 1+ ε ]. The new loss function is described as follows: ˆ ( ) min( ( ) ( , ), clip( ( ),1 ,1 )) ( , ) old old CLIP

L r A s a r A s a        =  − +  (20) where ε is a small positive constant close to zero (equal to 0.2 in this work), which is a hyperparameter. The function clip( r ( θ ),1− ε ,1+ ε ) clips the probability ratio to be no more than 1+ ε and no less than 1− ε . This function could effectively avoid achieving a large policy updating from the old policy. In the actual application of PPO, the neural network of the actor (policy) would share the parameters with that of the critic (state-value function). To encourage sufficient exploration for the agent, the clipped loss function is augmented with an error term on the value estimation and an entropy term: 3 + 21 arg 2 ( ) ( ) ( ( ) ) ( , ) CLIP VF S CLIP t et

L L c V s V c H s       +  =  − − −  (21) where the second term is the error term, and the third term is the entropy term. c and c are the tuned coefficients, which are the hyperparameter. When updating the policy in the PPO algorithm, a K timesteps ( K is less than the length of the episode) sample data is collected via recurrent neural networks. These data are used to calculate the loss function in (21), wherein the estimated advantage function is written as:

11 11 ˆ ( , )= ( ) ... ( )= ( ) ( ) K tt t t Kt t t t

A s ar V s V s      − ++ −+  + + + + − (22) where λ is another discount factor for the advantage function. In each iteration, there would be M actors to collect the data. The implemented pseudo-code of the PPO algorithm is depicted in Table 2, which is proved to produce awesome results in some control missions [37-38]. In the energy management problem, the state variables and control action are specialized in (8). The transition model of vehicle speed and acceleration is provided by the real-world driving data. The transition model of SOC is represented by (5). The reward is the instantaneous cost function in (7). Table 2. Implemented pseudo-code of the PPO algorithm.

PPO Algorithm, Actor-Critic Style For iteration = 1, 2, …, do For actor = 1, 2, …, M do Run policy π θold in environment for K timesteps Calculate advantage function based on (22), ˆ ˆ,..., K A A end for Optimize loss function in (21) with respect to θ for Z epochs Update θ old with θ end for The default parameters for the applied PPO algorithm are defined as follows: the discount factor γ and learning rate α in the RL framework are 0.9 and 0.01. The timesteps K is 512, the mini-batch size Z is 64, the hyperparameter ε is 0.2, and the discount coefficient for advantage function λ is 0.92. The hyperparameter c and c in (21) are 0.5 and 0.01. After training, the PPO-based EMS is generated and the related parameters of the neural network are stored. For a new driving cycle, the learned parameters are transformed to accelerate the learning efficiency. The TL method is introduced in the next subsection. 3.3 TL Technique The transfer EMS for HEV means transforming the learned neural networks from the source driving cycles to target driving cycles in this paper. The essence is reusing the parameters of the actor-critic network. For different driving cycles, the relevant EMSs should be disparate. However, the definition of reward, state variables, control action, and transition model is the same. The leading EMS is still the power split controls between the ICE and battery. Hence, the parameters of the actor-critic network could be transformed to improve learning efficiency. 4 The neural network related to the source driving cycles is named as an expert network, and the neural network related to the target driving cycles is the student network. In this work, the number of source driving cycles is larger than that of target driving cycles. The source driving cycles could comprise the target driving cycles or not. The weight and bias of the expert network would be first trained and obtained. These parameters are the initialized factors for training the EMS according to the target driving cycles. All the parameters of the actor and critic networks would be replaced. The hyperparameter stays the same in these two training processes. This operation is capable of maintaining the control performance and elevate the learning efficiency in the target driving cycles. In this work, the number of the source driving cycles is defined as 5, 10, and 20, respectively. The effect of the training data in the source domain on the control performance will be evaluated. These driving cycles are all extracted from one dataset to guarantee the correlation. For a fixed number, the source driving cycles would be designed to contain the target driving cycles or not. This layout would estimate the knowledge of target driving cycles on the efficacy of EMS. In the next section, the proposed transfer DRL-based EMS is assessed in multiple simulation tests.

4. Simulation Results and Analyzation

This section evaluates the transfer PPO-enabled EMS for HEV in three aspects. First, the number of the source driving cycles are various, and the performance of the relevant EMS is discussed. Then, the existent of the target driving cycles is analyzed. It implies that the source driving cycles would include the target driving cycles or not. Finally, the PPO algorithm with and without the TL method are compared. This simulation experiment would estimate the TL for accelerating the learning efficiency. 4.1 Influence of Source Driving Cycles This section discusses the impact of the number of source driving cycles on EMS of HEV. The number of the source driving cycles is defined as 5, 10, and 20, respectively. For convenience, the source driving cycles and target driving cycles in the transfer DRL-based EMS are shortened as SDC and TDC. Two real-world driving cycles are chosen as the TDC, and these TDC in the analysis of Section 4 are the same. Fig. 4 depicts the speed trajectories of the two TDC. The learned neural networks from different SDC are transformed to train the EMSs for these two TDC. In this subsection, the three cases of SDC include the TDC. 5

Fig. 4. Two target driving cycles used in the discussion and analysis in Section 4.

To describe the effectiveness of the constructed EMS using the proposed method, the state variables, and control action would be compared and analyzed. Since the vehicle speed and acceleration are extracted from the real-world driving data, Fig. 5 shows the SOC curves for two selected TDC. In each TDC, three control cases are enumerated. The main difference is the number of source driving cycles, which represents the source knowledge in the presented transfer DRL approach. From Fig. 5, it is evident that the SOC variations are not the same for these two TDC. For target driving cycle 1, for example, the SOC of 20 SDC is totally different from those in 5 SDC and 10 SDC in the time interval [400, 700]. For target driving cycle 2, the special time horizon is [400 800]. Moreover, it can be discerned that the vehicle speeds are very large around 750 in TDC 1 and around 200 in TDC 2. It means the power demand of the powertrain is big, and thus the ICE and battery should provide the power simultaneously. As a consequence, the SOC would decline at these time steps. It implies that the control logic of different EMSs is the same, but the particular control actions are different for different numbers of SDC.

Fig. 5. SOC trajectories for two TDC in different control cases.

After exhibiting the SOC graphs, the power split controls between the ICE and battery are given 6 in Fig. 6. The first line is for TDC 1, and the second line is for TDC 2. The powers of ICE in TDC 1 are usually lower than those in TDC 2. This is affected by vehicle speed in Fig. 4, in which the TDC 2 is often bigger than TDC 1. For TDC 1, the power splits with respect to different numbers of SDC are different. Especially in the time range [300 600] (shown in the purple rectangle), the ICE and battery powers are different. For TDC 2, the power distribution between ICE and battery is also not the same. As described in the purple ellipse, the variations of ICE are different in the time horizon [400 800]. This distinction is decided by the stated control action, ICE torque. Since the power demand is the same for TDC 1 or TDC 2, the ICE torque would influence the ICE power and thus affect the battery power. Furthermore, the SOC is determined by battery power, and thus the SOC curves are different in Fig. 5. This difference would also influence the reward, which will be shown in the next content.

Fig. 6. Power distribution between ICE and battery in three control cases for two TDC.

Fig. 7. Different control policies in the energy management problem for TDC 1 and TDC 2. T ICE ) among disparate control cases. For each TDC, the three control cases are labeled as 5 SDC, 10 SDC, and 20 SDC. It can be found that the control actions at some special time steps are different. When the ICE torque is zero, it implies that the ICE does not work. Thus, the working conditions of ICE are nearly the same for different cases of SDC. However, the output torques are diverse at the same time instant. This is attributed to the transformed neural networks from the SDC. Since the transformed knowledge is different, the obtained EMSs are not the same, which is reflected by the trajectories of control actions. The zoom-in figure could show the differences in the particular time interval [0 300] in these two TDC. Finally, the total reward (shown in (7)) is the most concerning factor in the DRL-based energy management problem, Fig. 8 lists the total reward of different control cases in two TDC. In this graph, the X-axis indicates the different SDC situations, and the Y-axis represents the two TDC. Hence, the first row means the different control cases for TDC 1, and the second row is for TDC 2. In each TDC, the 20 SDC case leads to the highest total reward, which expresses the best control performance. Along with the increase in the number of SDC, the total reward would be higher. This feature is caused by the driving information contained by the SDC. More source driving cycles would result in a more mature neural network. Based on this neural network, the transfer EMS would be better. In conclusion, in the proposed transfer DRL-enabled EMS, a larger number of SDC is welcome. Thus, in the following discussion, the number of SDC is defined as 20, and this number stays the same in different control cases. C y c l e C y c l e Fig. 8. Total reward with respect to different control cases for two TDC.

Fig. 9. SOC curves for evaluation of the significance of the TDC information.

Fig. 10. Power distribution between ICE and battery in two control cases for two TDC.

Then, the power distribution between the two onboard energy sources (ICE and battery) is given in Fig. 10. The boxplot figure describes the minimum value, medium value, and maximum value of 9 the ICE and battery power (Bat for short). For TDC 1, several characteristics can be observed. The ICE power is always greater than zero, and the battery power wanders in zero. It means that the battery could locate in charging or discharging condition. Furthermore, in the case of SDC including TDC, the medium power of ICE is more prominent than another case. It implies that the ICE is able to work at the high-efficiency area to achieve high fuel economy. For TDC 2, similar features could be discovered. Since the sum of the ICE and battery powers is the power demand, the medium value of battery power would be lower in the including case (means the SDC including the TDC information). This pattern of power distribution would affect the related reward, which will be exhibited in the following content.

Fig. 11. Control strategies for two cases (SDC including TDC or not) for two TDC.

Furthermore, the power distribution is straightway influenced by the control action (ICE torque), and the relevant trajectories of ICE torque are depicted in Fig. 11. It is apparent that the working conditions of ICE are influenced by the driving cycles. If the vehicle speed is zero in Fig. 4, the ICE usually stops to provide the power, and the powertrain works in the electric mode (power is provided by the battery). In each TDC, the zoom-in graph shows the control policies of two control cases are different. It indicates the related EMSs for HEV are not the same. This feature reflects that the transformed information from the two source tasks is different. The corresponding reward would explain which transformed neural networks are better in the energy management problem. Finally, the reward variations in two control cases of two TDC are sketched in Fig. 12. It can be discerned that the reward is lower when the vehicle speed is higher. At these time points, ICE needs to work to provide power, and the SOC decreases. From the cost function in (7), the reward would be 0 lower at these time instants. For TDC 1, the total rewards in not including and including cases are -121.7403 and -118.0950, respectively. For TDC 2, the related rewards are -416.8146 and -405.101, respectively. It can be recognized that the including case could lead to better control performance. It states that the TDC information is able to improve the effectiveness of the transfer PPO-based EMS. In the energy management problem, future driving information (e.g., vehicle speed, acceleration, and road slope) is essential for the performance of EMS. The predictive techniques are considered to be incorporated in the proposed control framework to enhance the control performance in future work.

Fig. 12. Reward trajectories in the compared control cases with respect to two TDC.

Fig. 13. Cumulative reward in two control cases: PPO with TL and PPO without TL.

Fig. 14. Different types of the loss function to highlight the advantages of the TL method.

Another factor to represent the learning efficiency in DRL methods is the value of loss function. It indicates the convergence rate of the relevant algorithms. Four shapes of the loss function are given in Fig. 14, which are original points, primary lines, smooth approximation, and linear approximation. From this figure, the transformed parameters could reduce the initial value of the loss function. In each episode, the loss function in PPO with the TL case is lower than that in PPO without TL. This feature reflects a more mature Q-matrix could be achieved by PPO with TL. In the graphs of smooth approximation and linear approximation, the convergence rate in PPO with TL is obviously lower than that in PPO without TL. This appearance explains that the proposed transfer PPO-based EMS a faster convergence rate. Based on the testing results in Fig. 13 and 14, it can be concluded that the TL method is able to improve the learning efficiency and guarantee the control performance. To reduce the computational time, the proposed transfer PPO-based EMS has the potential to be applied in a real-time energy management controller. 2

5. Conclusion

A data-driven energy management policy is presented for HEV using the transfer DRL method in this work. The PPO algorithm is applied to address the energy management problem with continuous action space. The TL method is utilized to transform the learned neural networks from source tasks to target tasks. The relevant driving cycles are extracted from the real-world driving dataset. It implies that the derived EMS is able to be employed in real-time implementation. The simulation results discuss the impact of source driving cycles and target driving cycles on the EMS. The merits of the TL method are also emphasized. Future works focus on two aspects to promote the proposed energy management framework for HEV. First, future driving information will be considered. The deep learning approaches are used to forecast the future vehicle speed and acceleration. This information would improve the fuel economy of the EMS. Second, the real-time application of the EMS is validated. The related hardware in loop (HIL) experiments will be conducted to estimate the online characteristics of the presented energy management policy for multiple hybrid powertrains.

Acknowledgements

The work was in part supported by National Nature Science Foundation of China (NSFC 51575064), State Key Laboratory of Mechanical System and Vibration (Grant No. MSV202016), and NNSF (Grant No. 11701027).

References [1]

Zou Y, Liu T, Sun F, and Peng H. Comparative study of dynamic programming and Pontryagin’s minimum principle on energy management for a parallel hybrid electric vehicle. Energies 2013;6(4):2305-2318. https://doi.org/10.3390/en6042305. [2]

Liu T, Hu X, Li S E, Cao D. Reinforcement learning optimized look-ahead energy management of a parallel hybrid electric vehicle. IEEE/ASME Trans Mechatron 2017;22(4):1497-1507. https://doi.org/10.1109/TMECH.2017.2707338. [3]

Huang Y, Wang H, Khajepour A, Li B, Ji J, Zhao K, Hu C. A review of power management strategies and component sizing methods for hybrid vehicles. Renew Sustain Energy Rev 2018;96:32-144. https://doi.org/10.1016/j.rser.2018.07.020. 3 [4]

Banvait H, Anwar S, Chen Y. A rule-based energy management strategy for plug-in hybrid electric vehicle (PHEV). 2009 American control conference (ACC). IEEE. 2009. p. 3938-3943. [5]

Wu J, Zou Y, Zhang X, Liu T, Kong Z, He D. An online correction predictive EMS for a hybrid electric tracked vehicle based on dynamic programming and reinforcement learning. IEEE Access 2019;7:98252-98266. https://doi.org/10.1109/ACCESS.2019.2926203. [6]

Hu X, Liu T, Qi X, Barth M. Reinforcement learning for hybrid and plug-in hybrid electric vehicle energy management: Recent advances and prospects. IEEE Trans Ind Electron Magazine 2019;13(3):16-25. https://doi.org/10.1109/MIE.2019.2913015. [7]

Liu T, Hu X. A bi-level control for energy efficiency improvement of a hybrid tracked vehicle. IEEE Trans Ind Informat 2018;14(4):1616-1625. https://doi.org/10.1109/TII.2018.2797322. [8]

Liu T, Zou Y, Liu D, Sun F. Reinforcement learning of adaptive energy management with transition probability for a hybrid electric tracked vehicle. IEEE Trans Ind Electron 2015;62(12):7837-7846. https://doi.org/10.1109/TIE.2015.2475419. [9]

Liu T, Hu X, Hu W, Zou Y. A heuristic planning reinforcement learning-based energy management for power-split plug-in hybrid electric vehicles. IEEE Trans Ind Informatics 2019;15(12):6436-6445. https://doi.org/10.1109/TII.2019.2903098. [10]

Zhou Y, Ravey A, Péra M C. A survey on driving prediction techniques for predictive energy management of plug-in hybrid electric vehicles. J Power Sources 2019;412:480-495. https://doi.org/10.1016/j.jpowsour.2018.11.085. [11]

Inuzuka S, Xu F, Zhang B, Shen T. Reinforcement learning based on energy management strategy for HEVs. 2019 IEEE Vehicle Power and Propulsion Conference (VPPC). IEEE. 2019. p. 1-6. https://doi.org/10.1109/VPPC46532.2019.8952511. [12]

Liu T, Wang B, Yang C. Online Markov Chain-based energy management for a hybrid tracked vehicle with speedy Q-learning. Energy 2018;160:544-555. DOI:10.1016/j.energy.2018.07.022. [13]

Wang P, Li Y, Shekhar S, Northrop W F. A deep reinforcement learning framework for energy management of extended range electric delivery vehicles. 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2019. p. 1837-1842. https://doi.org/10.1109/IVS.2019.8813890. 4 [14]

Wu J, He H, Peng J, Li Y, Li Z. Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus. Appl Energy 2018;222:799-811. https://doi.org/10.1016/j.apenergy.2018.03.104. [15]

Han X, He H, Wu J, Peng J, Li Y. Energy management based on reinforcement learning with double deep Q-learning for a hybrid electric tracked vehicle. Appl Energy 2019;254:113708. https://doi.org/10.1016/j.apenergy.2019.113708. [16]

Qi X, Luo Y, Wu G, Boriboonsomsin K, Barth M. Deep reinforcement learning enabled self-learning control for energy efficient driving. Transp Res Part C Emerg Technol 2019;99:67-81. https://doi.org/10.1016/j.trc.2018.12.018. [17]

Liessner R, Schmitt J, Dietermann A, Bäker B. Hyperparameter optimization for deep reinforcement learning in vehicle energy management. Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART). 2019. p.134-144. [18]

Lian R, Peng J, Wu Y, Tan H, Zhang H. Rule-interposing deep reinforcement learning based energy management strategy for power-split hybrid electric vehicle. Energy 2020. https://doi.org/10.1016/j.energy.2020.117297. [19]

Pan S J, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng 2009;22(10):1345-1359. https://doi.org/10.1109/TKDE.2009.191. [20]

Mallick T, Balaprakash P, Rask E, Macfarlane J. Transfer learning with graph neural networks for short-term highway traffic forecasting, 2020, arXiv preprint arXiv:2004.08038. [21]

Nguyen D, Sridharan S, Nguyen D T, Denman S, Tran S N, Zeng R, Fookes C. Joint deep cross-domain transfer learning for emotion recognition, 2020, arXiv preprint arXiv:2003.11136. [22]

Pereira G T, Santos M, Alcobaca E, Mantovani R, Carvalho A. Transfer learning for algorithm recommendation, 2019, arXiv preprint arXiv:1910.07012. [23]

Guo X, Liu T, Tang B, Tang X, Zhang J, Tan W, Jin S. Transfer deep reinforcement learning-enabled energy management strategy for hybrid tracked vehicle, 2020, arXiv preprint arXiv:2007.08690. 5 [24]

Lian R, Tan H, Peng J, Li Q, Wu Y. Cross-type transfer for deep reinforcement learning based hybrid electric vehicle energy management. IEEE Trans. Veh. Technol 2020;69(8):8367-8380. https://doi.org/10.1109/TVT.2020.2999263. [25]

Isele D, Cosgun A. Transferring autonomous driving knowledge on simulated and real intersections, 2017, arXiv preprint arXiv:1712.01106. [26]

Cai L, Sun Q, Xu T, Ma Y, Chen Z. Multi-AUV collaborative target recognition based on transfer-reinforcement learning. IEEE Access 2020;8: 39273-39284. https://doi.org/10.1109/ ACCESS.2020.2976121. [27]

Liu T, Zou Y, Liu D. Energy management for battery electric vehicle with automated mechanical transmission. Int. J. Veh. Des 2016;70(1):98-112. https://doi.org/10.1504/IJVD.2016.073701. [29]

Andre M, Hickman A J, Hassel D, Joumard. Driving cycles for emission measurements under European conditions. SAE transactions 1995: 562-574. DOI: 10.2307/44615107. [30]

Record S D T. Transportation secure data center. 2011. [31]

Gonder J, Burton E, Murakami E. Archiving data from new survey technologies: Enabling research with high-precision data while preserving participant privacy. Transportation Research Procedia 2015;11: 85-97. https://doi.org/10.1016/j.trpro.2015.12.008. [32]

Liu T, Tang X, Wang H, Yu H, Hu X. Adaptive hierarchical energy management design for a plug-in hybrid electric vehicle. IEEE Trans Veh Technol 2019;68(12):11513-11522. https://doi.org/10.1109/TVT.2019.2926733. [33]

Hua H, Qin Y, Hao C, Cao J. Optimal energy management strategies for energy Internet via deep reinforcement learning approach. Appl Energy 2019; 239: 598-609. https://doi.org/10.1016 /j.apenergy. 2019.01.145. [34]

Liu T, Yu H, Guo H, Qin Y, Zou Y. Online energy management for multimode plug-in hybrid electric vehicles. IEEE Trans Industr Inform 2019;15(7):4352-4361. https: //doi.org/10.1109/TII.2018.2880897 [35]

Sutton R S, Barto A G. Reinforcement learning: An introduction. MIT press, 2018. 6 [36]

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms, 2017, arXiv preprint arXiv:1707.06347. [37]