[PDF] Application of twin delayed deep deterministic policy gradient learning for the control of transesterification process

Abstract

The persistent depletion of fossil fuels has encouraged mankind to look for alternatives fuels that are renewable and environment-friendly. One of the promising and renewable alternatives to fossil fuels is bio-diesel produced by means of the batch transesterification process. Control of the batch transesterification process is difficult due to its complex and non-linear dynamics. It is expected that some of these challenges can be addressed by developing control strategies that directly interact with the process and learning from the experiences. To achieve the same, this study explores the feasibility of reinforcement learning (RL) based control of the batch transesterification process. In particular, the present study exploits the application of twin delayed deep deterministic policy gradient (TD3) based RL for the continuous control of the batch transesterification process. These results showcase that TD3 based controller is able to control batch transesterification process and can be a promising direction towards the goal of artificial intelligence-based control in process industries.

Full PDF

AA PPLICATION OF TWIN DELAYED DEEP DETERMINISTIC POLICYGRADIENT LEARNING FOR THE CONTROL OFTRANSESTERIFICATION PROCESS

A P

REPRINT

February 26, 2021Tanuja Joshi, Shikhar Makker, Hariprasad Kodamana,Department of Chemical Engineering,Indian Institute of Technology Delhi,Hauz Khas, New Delhi. [email protected]

Harikumar Kandnath,International Institute of Information Technology,Hyderabad. [email protected] A BSTRACT

Persistent depletion of fossil fuels has encouraged mankind to look for alternatives fuels that arerenewable and environment-friendly. One of the promising and renewable alternatives to fossil fuels isbio-diesel produced by means of batch transesteriﬁcation process. Control of batch transesteriﬁcationprocess is difﬁcult due to its complex and non-linear dynamics. It is expected that some of thesechallenges can be addressed by developing control strategies which directly interact with the processand learning from the experiences. To achieve the same, this study explores the feasibility ofreinforcement learning (RL) based control of batch transesteriﬁcation process. In particular, thepresent study exploits the application of twin delayed deep deterministic policy gradient (TD3) basedRL for the continuous control of batch transesteriﬁcation process.This results showcase that TD3based controller is able to control batch transesteriﬁcation process and can be promising directiontowards the goal of artiﬁcial intelligence based control in process industries. K eywords Reinforcement learning · deep Q-learning · deep deterministic policy gradient · batch transesteriﬁcation Gradual depletion of fossil fuels over the past has mandated the need for alternative eco-friendly fuels owing toenvironmental sustainability and low carbon emissions. One of the promising and renewable alternatives to diesel-fuelthat has been gaining much attention is the bio-diesel. It is obtained from various plant-based sources such as vegetableoil, soyabean oil, palm oil, waste cooking oil, animal fats, etc. [1, 2]. The process of the production of bio-diesel,transesteriﬁcation, wherein triglycerides from the fatty acids react with alcohols in the presence of a catalyst to produceglycerol, and methyl esters [3]. Industrial-scale production of bio-diesel is carried out in both continuous stirred tankreactors and batch reactors. However, studies have emphasized the usefulness of batch process because of variousadvantages like low capital cost, raw material cost and ﬂexibility in operation [4].The operation of the transesteriﬁcation process grapples with several challenges due to the unique characteristics of thebatch process, such as non-linearity due to the temperature-dependent kinetics, time-varying dynamics and a broad a r X i v : . [ ee ss . S Y ] F e b PREPRINT - F

EBRUARY

26, 2021range of operating conditions. Therefore, the control and optimisation of the batch transesteriﬁcation process is a verychallenging task. Reaction kinetics suggests that biodiesel production is affected mainly by four factors: reaction time,the molar ratio, catalyst, and the reaction temperature [5]. The reaction temperature has a signiﬁcant effect on thebiodiesel yield and is; therefore, the primary control variable in the existing literature on the control of the batch process[6, 7, 8]. Hence, one of the primary goal of the control of transesteriﬁcation is to track the desired the temperaturetrajectory, optimally. In the end, the yield of the product depends mostly on the choice of the controller and its capabilityand the tuning parameters. Even if the controller is appropriately tuned for an operating condition, the drifts in processparameters can signiﬁcantly deteriorate the controller performance over time.To overcome some of these issues, advanced control strategies such as model predictive control (MPC), iterativelearning control (ILC), non-linear MPC (NMPC), are presented in the literature for the control of transesteriﬁcation.[9, 10, 11, 12, 13, 14, 15, 7, 16, 17]. The performance of a model-based controller is heavily dependent on theaccuracy of the underlying model of the process. Even the slightest inaccuracies in the model will result in plant-modelmismatch and, consequently, inaccurate prediction of the concentration proﬁles of the species from the model. Further,optimization-based control algorithms are computationally demanding as the computation of the optimal input sequenceinvolve on-line optimization at each time step. Despite the advances in numerical methods and computational hardware,this is still a challenging task for complex non-linear, high-dimensional dynamical systems. Hence, there is muchincentive if a control strategy can directly interact with process trajectory and provide a control solution for on-linecourse correction.To this extent, it is useful to explore the feasibility of reinforcement learning (RL) as a potential paradigm for thecontrol of transesteriﬁcation process. Contrary to classical controllers, RL based controllers do not require a processmodel or control law; instead, they learn the dynamics of the process by directly interacting with the operationalenvironment [18, 19]. As a result, the controller performance will not be contingent on having a pressing requirementof a high ﬁdelity model of the process. Further, as opposed to the traditional controllers, the RL based controller learnsfrom experiences and past history and thus improving the control policy at every step. As an earlier proposition inthis direction, approximate dynamic programming(ADP) approach that learns the optimal ’cost-to-go’ function hasbeen proposed for the optimal control of non-linear systems [20, 21, 22]. Following this, an ADP based ’Q-learning’approach that learns the optimal policy using value iteration in a model-less manner has also been proposed [20].Unlike the traditional applications of ‘Q-learning’, the state-space and action-space is continuous for process controlapplications. This poses a major limitation to the application of conventional Q-learning methods in this context.For systems with continuous state-space, Mhin et al. combined Q-learning with deep neural network for functionapproximation and developed deep Q network (DQN) [23]. Inspired from these, some of the recent works related to RLapplications in chemical processes have attempted to optimize the control policy by applying Q-learning and deep Q-learning for applications such as chromatography and polymerization [24, 25].Although the advent of DQN based RL was a major break-through in applications to systems with continuous state-space, their utility to systems with continuous action-space was still limited. Another approach to solving the RLproblem is by employing policy gradient (PG), where instead of evaluating the value functions to ﬁnd an optimal policy,the optimal policy is evaluated directly. This approach is particularly well-suited to deal with problems where both thestate space and the action space are continuous. For instance, Petsagkourakis et al. applied the PG algorithm to ﬁnd theoptimal policy for a batch bioprocess by using principles of transfer learning [26]. However, PG methods suffer fromthe drawback of noisy gradients due to high variance leading to slow convergence. Actor-critic Methods reduces thislow variance gradient estimates by exploiting a critic network and has been the widely used framework for dealing withcontinuous action spaces. DDPG is one of the actor-critic algorithms that has celebrated a huge success [27]. Indeed,DDPG was the ﬁrst efﬁcient algorithm used to solve high dimensional continuous control tasks. It effectively combinesthe architecture of DQN and deterministic policy gradient (DPG). Recent works have shown the application of DDPGalgorithm for chemical process control as well. For example, Maet al. have proposed a DDPG based controller for thecontrol of semi-batch polymerization [28], Spielberg et al. has applied the DDPG algorithm for SISO and MIMO linearsystems [29, 30]. For a detailed review of the application of RL for chemical control problems, the readers are referredto [31, 19, 32, 33]Fujimoto et al. [26] have shown that the DDPG algorithm suffers from a critical issue of overestimation of networkbias due to function approximation error and leads to sub-optimal policy. The authors have also provided an approachto address the function approximation error in actor-critic methods and termed the new algorithm as Twin DelayedDeep Deterministic Policy Gradient (TD3). Some recent work involving the application of TD3 algorithm involvesmotion planning of robot manipulators, half- cheetah robot as an intelligent agent to run across a ﬁeld, etc. [27, 28].Inspiring from these, this paper presents a comparative simulation study of the two actor-critic methods, namely TD3and DDPG, for the optimal control of reactor temperature in the batch transesteriﬁcation processes. Speciﬁcally, wepropose a control strategy based on TD3, for the control of reactor temperature in batch Transesteriﬁcation processes.We also propose novel reward functions structure analogous to PI and PID controllers to aid the agent in better learning.ii

PREPRINT - F

EBRUARY

26, 2021We have also compared the results of actor-critic methods with RL algorithms with discrete action spaces. To thebest our knowledge (i) TD3 algorithm has not been applied for the control of the chemical process, and (ii) such acomprehensive study involving the comparison of RL based algorithms with continuous and discrete action spaces hasalso not been reported in literature in the context of the process control. The analysis show the advantages of the TD3algorithm in terms of tracking performance, control effort and computational cost when compared to the well knownDDPG algorithm in controlling transesteriﬁcation process.The rest of the paper is divided into the following sections. Section 2 presents the background of RL useful fordeveloping a TD3 based controller. Section 3 explain the TD3 algorithm for the control of transesteriﬁcation process.Section 4 discusses the methodology adopted for implementing our algorithm. Section 5 shows the application ofthe proposed controller and important results for the control of batch-transesteriﬁcation process with continuous anddiscrete action spaces. Section 6 draws conclusive remarks from the study.

A standard RL framework consists of three main components, namely the agent, the environment ( E ) and the reward( r ) [29]. At each time step t , the agent performs an action a t based on the state s t of the environment and receives ascalar reward r t , as a result, the environment moves to a new state s t +1 . The objective of the RL problem is to ﬁnd thesequence of control actions A := { a , a , a , . . . } to maximise the expected discounted reward as given below: arg max A E [ R t := ∞ (cid:88) k =0 γ k r t + k ] (1)where < γ < is the discount factor and E denotes the expectation operator which is applied to the discountedreward due to the stochastic nature of the process dynamics.However, explicit solution of Eq.(1) is tedious to obtain.Q-learning is an iterative algorithm to solve the RL problem over a ﬁnite set of actions. The Q-value of a state-actionpair, Q π ( s t , a t ) is the expected return after performing an action a t at a state s t following a policy π : S → A : Q π ( s t , a t ) = E π (cid:104) ∞ (cid:88) k =0 γ k r t + k | s t = s, a t = a (cid:105) (2)where S = { s , s , s , . . . } . The objective of the Q-learning is to ﬁnd the optimal policy ( π ∗ ) by learning the optimalQ-value, Q ∗ ( s t , a t ) which is the maximum expected return achievable by any policy for a state-action pair. The optimalQ-value, Q ∗ ( s t , a t ) must satisfy the Bellman optimality equation [30] given as : Q ∗ ( s t , a t ) = E [ r t + γ max a t +1 Q ∗ ( s t +1 , a t +1 | s t = s, a t = a )] (3)where s t +1 and a t +1 are the state and action at the next time step. The Q-learning algorithm iteratively update theQ-value for each state-action pair until the Q-function converges to the optimal Q-function. This is known as valueiteration and is given as: Q ( s t , a t ) ← Q ( s t , a t ) + α ( r t + γ max a t +1 Q ( s t +1 , a t +1 ) − Q ( s t , a t )) (4)Major encumbrances of traditional Q-learning are: (i) its application is limited only to the problems with discrete stateand actions spaces; (ii) the computational difﬁculty faced while dealing with large state space owing to the large sizeof the Q-matrix. The former problem can be circumvented by employing a function approximator for modelling therelation between Q-value and state-action pairs. Deep Q-learning [23] is an RL framework wherein a Deep neuralnetwork (DNN) is used as function for approximating the optimal Q-values. Since the goal is to ﬁnd an optimal policyvia learning an optimal Q-value, the DNN is termed as the policy network. A separate target network is used to estimatethe target Q-value for the policy network. The weights of the target network are updated periodically with the updatedweights of the policy network to make the learning stable. To address the problem of correlated sequences, DQN uses areplay buffer or experience replay memory which has a pre-deﬁned capacity where all the past experiences are stored asthe following transition tuple ( s := s t , a := a t , s (cid:48) := s t +1 , a (cid:48) := a t +1 ) ). The DQN uses past experiences to train thepolicy network by selecting suitable mini-batches from the replay buffer. The state is given as the input to the policynetwork and the the network outputs the Q-value corresponding to all possible actions in the action space. The loss iscalculated as the mean square error (MSE) between the current Q-value and the target Q-value as given in Eq. (5): Loss = E (cid:2)(cid:0) Q ∗ ( s, a ) − Q ( s, a ) (cid:1) (cid:3) (5) = E (cid:2)(cid:0) r + γ max a (cid:48) Q φ,T ( s (cid:48) , a (cid:48) ) − Q φ ( s, a ) (cid:1) (cid:3) (6)where φ represents the parameters of the network. iii PREPRINT - F

EBRUARY

26, 2021

The applications of RL algorithms such as Q-learning and DQN is limited only to problems with discrete action spaces.Policy-based methods provides an alternative solution for continuous stochastic environments by directly optimizingthe policy by taking the gradient of the objective function with respect to the policy parameter θ : ∇ θ ( J ( π θ )) = ∇ θ (cid:0) E τ ∼ π θ [ R ( τ )] (cid:1) = E τ ∼ π θ (cid:34)(cid:16) T (cid:88) t =0 ∇ θ log π θ ( a t | s t ) (cid:17) R ( τ ) (cid:35) (7)where π θ is the stochastic parameterized policy and R ( τ ) is the return obtained from the trajectory τ = { s , a , s , a , . . . } . The architecture of the actor-critic algorithms is based on policy-gradient, making them amenablefor continuous action spaces [31]. Policy-based (actor) methods suffer from the drawback of high-variance estimates ofthe gradient and lead to slow learning. The value-based (critic) methods are an indirect method for optimizing the policyby optimizing the value function. Actor-critic algorithms combine the advantages of both actor-only (policy-based)and critic-only (value-based) methods and learn optimal estimate of both policy and value function. In the actor-criticmethods, policy dictates the action based on the current state, and the critic evaluates the action taken by the actor basedon the value function estimate. The parameterized policy is then updated using the value function using the gradientascent for improving the performance.Deterministic policy gradient (DPG) proposed by Silver et al. is an actor-criticoff-policy algorithm use for continuous action spaces [32]. It uses the expected gradient of the Q-value function toevaluate the gradient of the objective function with respect to parameter θ to ﬁnd the optimal policy as given below: ∇ θ J ( µ θ ) = E [ ∇ θ Q µ ( s, a ) | a = µ θ ( s ) ] (8)where µ is the deterministic policy. However, they have only investigated the performance using linear functionapproximators to evaluate an unbiased estimate of the Q-value. DPG algorithm is further extended by Lillicrap et al. toDeep deterministic policy gradient (DDPG) [33] by employing DQN as a non-linear function approximator for theestimation of Q-values. DDPG incorporate the merits of experience replay buffer and target networks to learn stable androbust Q-values. In the actor part of DDPG architecture, rather than directly copying the weights of the policy networkin the target network, the target network weights are allowed to slowly track the policy network weights to improvethe stability.The critic part of the DDPG uses a regular DQN to ﬁnd the estimate of Q-value by minimising the lossfunction. Both the value-based method and actor-critic methods suffer from the problem of overestimation bias. Thisproblem comes due to the maximisation term in the target action-value function of the loss function. Since the agent isunaware of the environment in the beginning, it needs to ﬁrst estimate Q ( s, a ) and then update them further for learningan optimal policy. Since the estimates of Q ( s, a ) are likely to be very noisy, evaluating the maxima over value functiondoes not guarantee the selection of the optimal action. If the target itself is prone to error, then the value estimatesare overestimated, and this bias is then propagated further through the Bellman equation during the update. DoubleQ-learning or Double DQN [34] was an attempt to provide solution to this problem in the value-based framework,where actions are discrete, by separating the action selection step and the estimation of the value of the action step. Forexample, Double DQN proposed by Hasselt et al. [35] uses two estimates of the Q-value: ﬁrstly, an online network isused for the selection of action that gives the maximum Q-value, and, secondly, a target network estimates the Q-valuebased on this action. Fujimoto et al. [26] have proved that this overestimation-bias problem also exists in an actor-criticsetting which leads to the selection of sub-optimal actions, resulting in poor policy updates. They have addressed thisproblem by introducing a variant of DQN in the actor-critic framework; the twin delayed deep deterministic policygradient (TD3).TD3 algorithm is an extension of the DDPG algorithm with the following modiﬁcations to address some of the lacunaeof DDPG. (i) To address the overestimation bias problem, the concept of clipped double Q-learning is used whereintwo Q-values are learned, and the minimum of them is used for the approximation of the target Q-value. Thus, TD3has two critic networks and corresponding critic target networks reﬂecting the ’twin’ term in its name. (ii) To reducethe high variance and noisy gradients while minimising the error per update, target networks are used to reduce errorpropagation by delaying the policy update until the convergence of the Q-value. This results in a less frequent policynetwork updates than the critic network updates. (iii) To reduce the variance in the target action values, target policysmoothing is performed by regularisation technique where clipped noise is added to the target action obtained from thepolicy. The TD3 algorithm is applied for the control of Batch transesteriﬁcation process to achieve the desired reactortemperature ( T ref ). The mathematical model of the process is assumed as the environment. The model includes thekinetic model and the energy balance equations and are adapted from [7, 6]. The details of the model and the kineticiv PREPRINT - F

EBRUARY

26, 2021equations are omitted here for brevity. We propose a two-stage framework for developing a TD3 based controller forthe control of batch transesteriﬁcation process as shown in Fig. 1. The learning of the agent (controller) includes bothofﬂine learning and online learning steps. The agent performs the ofﬂine learning by interacting with the process modelof the environment.We propose to use a concept similar to that of transfer learning [36] to adapt the trained actor and the critic networksfrom ofﬂine learning step in the online learning stage, enabling a warm startup. This is expected to reduce the numberof episodes required by the agent in the online stage to achieve convergence, making it suitable to apply in real-time.The online learning starts with the trained actor and critic model and data obtained during the ofﬂine learning stageserves as the historical data for the agent in online learning. The historical data contains tuples of state ( s ), action ( a ),reward ( r ), next state ( s (cid:48) ) which is used as the replay memory. When TD3 agent works in a closed-loop fashion, theactor model receives the initial state, s from the plant and outputs the action, a . This action is then injected into theplant, which then reaches a new state, s (cid:48) and outputs the reward, r for taking the selected action. We get a transitiontuple, and we add it in the experience replay memory ( E ). Training of the actor and the critic networks happens as perTD3 algorithm steps. After the training happens, the actor-network outputs the action a (cid:48) for the new state s (cid:48) , whichis again injected into the plant. These steps take place in an iterative fashion until the ﬁrst batch run ( b i ) completes.The updated model obtained for the b i batch is used for the initialisation of networks at the start time ( t start ) of the b i +1 batch run. The whole process repeats for multiple batch runs. In this way, the agent is able to learn from theenvironment and thus achieve convergence. Figure 1 shows the schematic of the framework of the TD3 based controllerfor the control of batch transesteriﬁcation process. The following subsections contain elaborate discussions on thereward selection procedure, ofﬂine learning, and online learning for batch transesteriﬁcation process. Reward function is an imperative constituent in the RL as the agent learns in the direction of increasing the reward.Wehave considered two reward functions inspiring the PI and PID control laws functional structure as illustrated below toaccommodate the time varying nature of the error proﬁles:1. Reward Function 1 (PI) r ( t ) =  (cid:0) c I + c I g ( f ( e ( t )))+ c I g ( (cid:80) ( k − t =0 f ( e ( t ))) (cid:1) if f ( e ( t )) < thr c I + c I g ( (cid:80) ( k − t =0 f ( e ( t ))) otherwise (9)In Eq. (9), c I , . . . , c I are suitable tuning parameters, f ( . ) and g ( . ) are suitable function of error e ( t ) : x ( t ) − x ref ∈ R n , and thr is a suitably chosen threshold parameter. For instance, the penalty function f ( . ) may take structural forms like || e ( t ) || w,p , w weighted p norm of the vector space of e , while g is a suitableoperator that converts penalty to reward. In this reward formulation, instead of calculating the reward onlybased on the current error e ( t ) , we have considered the effect of historical error proﬁles also. The idea isbasically to minimise the error that is being accumulated over time in the past. Since this is inspired by theprinciple of a proportional-integral (PI) controller, so we are naming it as PI Reward Function.2. Reward Function 2 (PID) r ( t ) =  (cid:0) c II + c II g ( f ( e ( t )) + c II g ( (cid:80) ( k − t =0 f ( e ( t ))+ c II g ( c II + f (∆ e ( t )) (cid:1) if f ( e ( t )) < thr c II + c II ( (cid:80) ( k − t =0 f ( e ( t ))+ c II g ( c II + f (∆ e ( t ))) otherwise (10)Here, in Eq. (10), there is an additional term compared to Eq. (9) ∆ e ( t ) := e ( t ) − e ( t − . This particularterm helps to bring in the information regarding the rate of change of error proﬁle in the reward calculation.The parameters c II , . . . , c II and the threshold parameter thr are to be tuned appropriately. Hence, we callthis a proportional integral derivative (PID) reward function. In the next subsection, we proceed to illustratesimulation steps required for the implementation of the TD3 based controller. The detailed steps for ofﬂine learning of TD3 algorithm employed in our study are given below:v

PREPRINT - F

EBRUARY

26, 2021Figure 1: Schematic of the TD3 based controller for batch transesteriﬁcation1. Build the actor network and the corresponding ‘target actor network’ with parameters φ A and φ A , T , where T , A denote the target network and actor network, respectively. Initialize the target network as φ A , T → φ A .2. Build two critic networks and initialise them with parameters φ C and φ C and the corresponding ’targetcritic networks’ with parameters φ C , T and φ C , T , where C denotes the critic network. Initialize the targetnetwork network as φ C , T → φ C and φ C , T → φ C .3. Initialise the experience replay buffer ( E ) set with a deﬁned cardinally, say M .4. Observe the initial state, s and select action a from the actor network with noise added to the action . a = clip ( µ φ A ( s ) + (cid:15), a min , a max ) (11)where a min and a max represent the upper and lower bounds of action, respectively, µ φ A is the parametrizeddeterministic policy and (cid:15) ∼ N (0 , σ ) , with a suitably chosen σ .5. Execute the action, a by injecting it into the plant model and obtain reward r := r ( t ) and new state s (cid:48) .6. Add the obtained tuple of state, action, reward ,next state ( s, a, r := r ( t ) , s (cid:48) ) in the replay buffer ( E ).7. Train both actor and critic according to Step 6 - Step 13 as detailed in the online learning algorithm (section3.3) for n desired episodes until the convergence.8. Save the updated actor and critics models after the training completed. Let the updated actor networkparameters and the corresponding T parameters be φ A and φ A , T , respectively. Similarly, the updated criticmodel parameters are denoted by φ C and φ C and the corresponding ’target critic networks’ is denoted withparameters φ C , T and φ C , T respectively. These networks are termed as the trained models in the subsequentsteps. vi PREPRINT - F

EBRUARY

26, 2021

Use the trained actor and critic model obtained from ofﬂine learning as the actor network and the critic networkfor the true-plant.

Step 2

Initialise the experience replay buffer ( E ) with a suitable cardinally and add tuples obtained from the ofﬂinelearning ( E ). Each tuple is composed of a state, action, reward and new state i.e ( s, a, r, s (cid:48) ) . Step 3

Observe the initial state s and select action a from the actor network and add Gaussian noise as a way ofexploration during training. Also, clip the action, a between the action range, as follows: a = clip ( µ φ A ( s ) + (cid:15), a min , a max ) (12)where (cid:15) ∼ N (0 , σ ) with a suitable exploration noise variance σ . Step 4

Execute the action, a by injecting it into the true-process and obtain reward r and new state s (cid:48) . Step 5

Add the obtained tuple of state, action, reward,next state ( s, a, r, s (cid:48) ) in the replay buffer ( E ). Step 6

Sample a batch of transitions from the experience replay E . Step 7

Target actor network outputs the deterministic action ˜ a for state s (cid:48) , subsequently, a clipped noise added to thisaction. The action values are further clipped to ensure that they are in the valid action range. ˜ a = clip ( µ φ A,T ( s (cid:48) ) + clip ( (cid:15), − c, c ) , a min , a max ) (13)where (cid:15) ∼ N (0 , σ ) ; c and σ are the noise clip and the policy noise variance, respectively. Step 8

The state s (cid:48) and the target action ˜ a is given as input to the target Q-network to estimate the target Q-value, Q φ C , T ( s (cid:48) , ˜ a ) and Q φ C , T ( s (cid:48) , ˜ a ) . Select the minimum of the two Q-values to calculate the target value ( T V ) given as:

T V = r + γ min Q φ C i, T ( s (cid:48) , ˜ a ) (14) Step 9

Estimate the Q- value for the state-action pair ( s, a ) , Q φ C ( s, a ) and Q φ C ( s, a ) , and calculate the Loss(equation (5)), as below, Loss

M SE ( Q φ C ( s, a ) , T V ) (15) = E [( Q φ C ( s, a ) − T V ) ] (16) Loss

M SE ( Q φ C ( s, a ) , T V ) (17) = E [( Q φ C ( s, a ) − T V ) ] (18) Loss = M SE ( Q φ C ( s, a ) , T V ) + M SE ( Q φ C ( s, a ) , T V ) (19) Step 10

Update the Q-value by backpropogating the loss and update the critic network parameters φ C and φ C bystochastic gradient descent using a suitable optimizer. Step 11

For every two iterations update the actor network by doing gradient ascent on the Q-value of the ﬁrst criticnetwork, ∇ φ A Q φ C ( s, µ φ A ( s )) . Step 12

Update the weights of the critic targetfor i= 1,2 φ C i , T ← τ φ C i + (1 − τ ) φ C i , T (20)endand actor target φ A , T ← τ φ A + (1 − τ ) φ A , T (21)by Polyak averaging, where τ ∈ [0 , is a suitable target update rate. Step 13

Obtain the new state, s → s (cid:48) & Repeat from Step 3 until the batch process completes t = t end where t end isthe end time of the batch. Step 14

Repeat for Steps 1-13 for subsequent batches ( Note: Now Step 1 is to use the updated model obtained fromthe previous batch for the initialisation of networks at the start time of the next batch run instead of the model obtainedfrom ofﬂine learning). Also in Step 2, the E contains tuples obtained from the previous batch run.vii PREPRINT - F

EBRUARY

26, 2021

This section discusses the numerical simulation results for the application of TD3 based controller to batch trans-esteriﬁcation process. We have compared the performance of TD3 with respect to other continuous RL algorithmssuch as DDPG, and dicrete action-space RL algorithms such as DQN, and Q-learning with Gaussian process (GP)as the function approximator. The above-mentioned algorithms are trained using two types of reward functions thatresemble PI reward and PID reward, respectively. We have also introduced batch-to-batch variations and evaluated thecomparative performance of both TD3 and DDPG algorithms in this section.

Herein, we present the training details of RL based agent used for the control of the batch transesteriﬁcation problem. Aneural network consisting of 2 hidden layers with 400 and 300 hidden nodes, respectively, is used for both actor-networkand the twin critic networks. Rectiﬁed Linear Unit (ReLU) is used as the activation function between each hidden layerfor both actor and critics. Further, a linear activation function is used for the output in the actor-network. The networkparameters are updated using the ADAM optimiser for both the actor and critics network. Both the state and action aregiven as input to the critic network to estimate the Q-value.The implementation of the algorithm was done in Python 3.7.4, and the neural network framework is constructedin PyTorch (for continuous action space) and Keras (for discrete action space) API. The mathematical model of thetransesteriﬁcation process was simulated in Matlab and integrated to Python via Matlab engine. Table 1 lists thehyper-parameters used for the implementation of the TD3 algorithm.In this study, the penalty function f ( . ) in the reward formulation (as discussed in Section 3) is taken as the absolute valueof error, | e ( t ) | i.e., | T r ( k ) − T ref | and g ( . ) is the inverse operator ( g ( . ) = f ( . ) ). Here T r is the Reactor temperatureand T ref is the desired temperature. The threshold parameters thr and thr are ﬁxed to be 5 in the present study. Forthe PI reward the constant values c I . . . c I are taken as , , , . , and for the PID reward the constant valuesfrom c II . . . c II are taken as , , , , , . , , , respectively.Table 1: Hyperparameters for TD3 algorithmHyperparameters ValueDiscount factor 0.99Policy noise 0.2Exploration noise 0.1Clippd noise 0.5Actor Learning Rate 10e-3Critic Learning Rate 10e-3Target Update Rate 0.005Policy frequency 2 Table 2 compares the average RMSE values for four different algorithms, namely, TD3 and DDPG, DQN, and GP,respectively. Neural network and Gaussian process regression (GPR) are the candidate for function approximator inDQN and GP, respectively. Further, the results are compared for the two types of reward functions considered, namely,PI and PID, as discussed in Section 3. Here, the RMSE values reported are the average of the last four batches for atotal of 10 batches. It can be seen that the TD3 based controller has the lowest RMSE of 1.1526 and 1.1738 for PID andPI reward functions, respectively, as compared to DDPG and other discrete action space algorithms. Additionally, itcan be seen that the PID reward function is a better choice for reward function due to their low RMSE for all the fouralgorithms.Table 2: RMSE comparison of four different RL algorithms for batch transesteriﬁcation processReward Continuous Action Discrete ActionTD3 DDPG DQN GPPI 1.1738 1.2524 1.3189 1.4026PID 1.1526 1.1758 1.2826 1.2835viii

PREPRINT - F

EBRUARY

26, 2021Figure 2 and Figure 3 shows the comparison of the control performance with reactor temperature ( T r ) with respect totime. Figure 2 compares the control performance of continuous action space algorithms, namely, TD3 vs DDPG forboth the PID and PI reward in subplots (a) and (b), respectively. Similarly, the performance of the discrete action spacealgorithms, namely, DQN and GP, for both the reward functions are shown in Figure 3. It can be seen from the plotsthat the reactor temperature proﬁle closely follows the target value for TD3 controller for the PID reward function. Theresults clearly conclude that that the proposed TD3 based controller is capable to learn from the given environment andcontrol the system by achieving the desired set-point of temperature ( T ref ) which is 345K. R e a c t o r T e m p . ( K ) DDPGTD3Set Point60 70344345346 (a) Continuous Action (PID) R e a c t o r T e m p . ( K ) DDPGTD3Set Point50 60344345346 (b) Continuous Action (PI)

Figure 2: Comparison of control Performance of a) TD3 vs. DDPG for PID reward and b) TD3 vs. DDPG for PI rewardSince the steady-state error with PID reward function is better than the PI reward function for all the four algorithms,the subsequent analysis considers only the PID reward function. It can be seen that the variability in control actions aremore for the discrete case as compared to the continuous action space. Table 3 reports the average value of standarddeviation (SD) of control actions across the last four episodes. The results show that the SD values for the TD3 andDDPG are comparable while the input ﬂuctuations are more for the discrete case. It is worth noting that we haveconstrained the action space between 330-350 K for continuous action space algorithms. However, we have observedthat the discrete action space algorithms are unable to honour this constraint due to limited discrete action options.Hence, we have relaxed the constraint space to 330-360 K for discrete action space algorithms, with an interval of 0.75K. Figure 4 (a) shows the control action plots for all the four algorithms for the PID reward. Further, the control effortis calculated and compared for TD3 and DDPG by taking the average of integral of the square of the control actions andthe obtained values are reported in Table 4.We have also compared the results of reward vs time plots for TD3 and DDPG. Here, in order to make the contrastvisible, we have evaluated the inverse of the reward function, the penalty, and plotted the penalty values vs time asshown in Figure 4 (b). Figure 4(b) shows that DDPG has a higher value of penalty as compared to TD3 algorithm,ix

PREPRINT - F

EBRUARY

26, 2021 R e a c t o r T e m p . ( K ) GPDQNSet Point60 70344345346 (a) Discrete Action(PID) R e a c t o r T e m p . ( K ) GPDQNSet Point50 60344345346 (b) Discrete Action(PI)

Figure 3: Comparison of control Performance of a) DQN vs. GP for PID reward b) DQN vs. GP for PI rewardTable 3: Variability in Control Action for four differnt RL algorithm for the PID rewardAlgorithm SDTD3 3.2430DDPG 3.2642DQN 11.1897GP 9.1718indicating low rewards obtained in comparison with TD3. These results reinforce that TD3 is better than DDPG forcontrol application of batch transesteriﬁcation processes.Further, to test the efﬁcacy of the TD3 controller, we have introduced batch-to-batch variations in the simulations. Batch-to-batch variation may occur due to slight perturbations in the process parameters and changes in the environmentalconditions during a batch run. For the batch transesteriﬁcation process, we have introduced batch-to-variation byrandomly changing the rate constant, k c , by changing the pre-exponential factor ( k o ) having a variance of in eachbatch. The tracking trajectory plot presented in Figure 5 clearly shows that the TD3 based controller is able to reach theset-point in the presence of batch-to-batch variations and thus we achieve the desired control performance. The averageRMSE is 1.1620 and 1.2133 for TD3 and DDPG, respectively, for batch-to-batch variations, once again indicating theadvantages of TD3 over DDPG for controlling the batch transesteriﬁcation process.x PREPRINT - F

EBRUARY

26, 2021 J a c k e t I n l e t T e m p . ( K ) DQNGPTD3DDPG (a) Comparison of control inputs for four different algorithms P e n a l t y TD3DDPG (b) Comparison of Penalty for TD3 and DDPG algorithm

Figure 4: Comparison of (a) control Performances of all approaches for PID reward (b) Comparison of Penalty for TD3and DDPG for PID reward R e a c t o r T e m p . ( K ) DDPGTD3Set Point70 80344345346

Figure 5: Comparison of control performance of TD3 vs. DDPG for PID reward(for batch-to-batch variation)Table 4: Comparison of Control effort for continuous Action space algorithmsAlgorithm Control EffortTD3 118416.7DDPG 118680.3xi

PREPRINT - F

EBRUARY

26, 2021

This paper presented a controller based on the TD3 RL algorithm for the control of batch transesteriﬁcation process.The reactor temperature is considered the state (control variable) to be controlled, and the jacket inlet temperature istaken as action (control input). It was observed that the controller is able to learn the optimal policy and achieve thedesired reactor temperature by implementing appropriate control actions. We also formulated reward functions takinginspiration from the functional structure of PI and PID controller, by incorporating the historical errors and showed thatit helps the agent to better learn about the process. The results indicate that TD3 shows better convergence as comparedto continuous action-space algorithm DDPG and other algorithms such as DQN and GP that are applicable to discreteaction spaces. In summary, TD3 based RL controller is able to learn and intervene the process operation and control theprocess operation efﬁciently and can be used as potential framework for complex non-linear systems where both thestate and the action space are continuous.

References [1] Khalid Khalizani, Khalid Khalisanni, et al. Transesteriﬁcation of palm oil for the production of biodiesel.

AmericanJournal of Applied Sciences , 8(8):804–809, 2011.[2] Xuejun Liu, Huayang He, Yujun Wang, Shenlin Zhu, and Xianglan Piao. Transesteriﬁcation of soybean oil tobiodiesel using cao as a solid base catalyst.

Fuel , 87(2):216–221, 2008.[3] Junzo Otera. Transesteriﬁcation.

Chemical reviews , 93(4):1449–1470, 1993.[4] Mehdi Hosseini, Ali Mohammad Nikbakht, and Meisam Tabatabaei. Biodiesel production in batch tank reactorequipped to helical ribbon-like agitator.

Modern Applied Science , 6(3):40, 2012.[5] BEHP Freedman, EH Pryde, and TL Mounts. Variables affecting the yields of fatty esters from transesteriﬁedvegetable oils.

Journal of the American Oil Chemists Society , 61(10):1638–1643, 1984.[6] A Chanpirak and W Weerachaipichasgul. Improvement of biodiesel production in batch transesteriﬁcation process.In

Proceedings of the International MultiConference of Engineers and Computer Scientists, II , volume 5, 2017.[7] Pahola T Benavides and Urmila Diwekar. Optimal control of biodiesel production in a batch reactor: Part ii:Stochastic control.

Fuel , 94:218–226, 2012.[8] Riju De, Sharad Bhartiya, and Yogendra Shastri. Dynamic optimization of a batch transesteriﬁcation process forbiodiesel production. , (Icc):117–122, 2016.[9] Alejandro G Marchetti, Grégory François, Timm Faulwasser, and Dominique Bonvin. Modiﬁer adaptation forreal-time optimization—methods and applications.

Processes , 4(4):55, 2016.[10] Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research.

IEEE ControlSystems Magazine , 36(6):30–44, 2016.[11] Jay H Lee and Kwang S Lee. Iterative learning control applied to batch processes: An overview.

ControlEngineering Practice , 15(10):1306–1318, 2007.[12] Richard Kern and Yogendra Shastri. Advanced control with parameter estimation of batch transesteriﬁcationreactor.

Journal of Process Control , 33:127–139, 2015.[13] Farouq S Mjalli and Mohamed Azlan Hussain. Approximate predictive versus self-tuning adaptive controlstrategies of biodiesel reactors.

Industrial & engineering chemistry research , 48(24):11034–11047, 2009.[14] Farouq S Mjalli, L Kim San, K Chai Yin, and M Azlan Hussain. Dynamics and control of a biodiesel trans-esteriﬁcation reactor.

Chemical Engineering & Technology: Industrial Chemistry-Plant Equipment-ProcessEngineering-Biotechnology , 32(1):13–26, 2009.[15] Ho Yong Kuen, Farouq S Mjalli, and Yeoh Hak Koon. Recursive least squares-based adaptive control of a biodieseltransesteriﬁcation reactor.

Industrial & engineering chemistry research , 49(22):11434–11442, 2010.[16] Ana SR Brásio, Andrey Romanenko, João Leal, Lino O Santos, and Natércia CP Fernandes. Nonlinear modelpredictive control of biodiesel production via transesteriﬁcation of used vegetable oils.

Journal of Process Control ,23(10):1471–1479, 2013.[17] Ana SR Brásio, Andrey Romanenko, Natércia CP Fernandes, and Lino O Santos. First principle modeling andpredictive control of a continuous biodiesel plant.

Journal of Process Control , 47:11–21, 2016.[18] Steven Spielberg, Aditya Tulsyan, Nathan P Lawrence, Philip D Loewen, and R Bhushan Gopaluni. Deepreinforcement learning for process control: A primer for beginners. arXiv preprint arXiv:2004.05490 , 2020.xii

PREPRINT - F

EBRUARY

26, 2021[19] Rui Nian, Jinfeng Liu, and Biao Huang. A review On reinforcement learning: Introduction and applications inindustrial process control, 8 2020.[20] Jong Min Lee and Jay H Lee. Approximate dynamic programming-based approaches for input–output data-drivencontrol of nonlinear processes.

Automatica , 41(7):1281–1288, 2005.[21] Jong Min Lee. A Study on Architecture, Algorithms, and Applications of Approximate Dynamic ProgrammingBased Approach to Optimal Control. Technical report, 2004.[22] Jay H. Lee and Weechin Wong. Approximate dynamic programming approach for process control. In

Journal ofProcess Control , volume 20, pages 1038–1048, 10 2010.[23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deepreinforcement learning. nature , 518(7540):529–533, 2015.[24] Saxena Nikita, Anamika Tiwari, Deepak Sonawat, Hariprasad Kodamana, and Anurag S Rathore. Reinforcementlearning based optimization of process chromatography for continuous processing of biopharmaceuticals.

ChemicalEngineering Science , 230:116171, 2021.[25] Vikas Singh and Hariprasad Kodamana. Reinforcement learning based control of batch polymerisation processes.

IFAC-PapersOnLine , 53(1):667–672, 2020.[26] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods.In

International Conference on Machine Learning , pages 1587–1596. PMLR, 2018.[27] MyeongSeop Kim, Dong-Ki Han, Jae-Han Park, and Jung-Su Kim. Motion planning of robot manipulators for asmoother path using a twin delayed deep deterministic policy gradient with hindsight experience replay.

AppliedSciences , 10(2):575, 2020.[28] Stephen Dankwa and Wenfeng Zheng. Modeling a continuous locomotion behavior of an intelligent agent usingdeep reinforcement technique. In , pages 172–175. IEEE, 2019.[29] Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction. 2018.[30] EN Barron and H Ishii. The bellman equation for minimizing the maximum cost.

Nonlinear Analysis: Theory,Methods & Applications , 13(9):1067–1090, 1989.[31] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods forreinforcement learning with function approximation. In

NIPs , volume 99, pages 1057–1063. Citeseer, 1999.[32] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministicpolicy gradient algorithms. In

International conference on machine learning , pages 387–395. PMLR, 2014.[33] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015.[34] Hado Hasselt. Double q-learning.

Advances in neural information processing systems , 23:2613–2621, 2010.[35] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 30, 2016.[36] Lisa Torrey and Jude Shavlik. Transfer learning. In