[PDF] Distributed deep reinforcement learning for simulation control

Abstract

Several applications in the scientific simulation of physical systems can be formulated as control/optimization problems. The computational models for such systems generally contain hyperparameters, which control solution fidelity and computational expense. The tuning of these parameters is non-trivial and the general approach is to manually `spot-check' for good combinations. This is because optimal hyperparameter configuration search becomes impractical when the parameter space is large and when they may vary dynamically. To address this issue, we present a framework based on deep reinforcement learning (RL) to train a deep neural network agent that controls a model solve by varying parameters dynamically. First, we validate our RL framework for the problem of controlling chaos in chaotic systems by dynamically changing the parameters of the system. Subsequently, we illustrate the capabilities of our framework for accelerating the convergence of a steady-state CFD solver by automatically adjusting the relaxation factors of discretized Navier-Stokes equations during run-time. The results indicate that the run-time control of the relaxation factors by the learned policy leads to a significant reduction in the number of iterations for convergence compared to the random selection of the relaxation factors. Our results point to potential benefits for learning adaptive hyperparameter learning strategies across different geometries and boundary conditions with implications for reduced computational campaign expenses. \footnote{Data and codes available at \url{this https URL}}

Full PDF

DD ISTRIBUTED DEEP REINFORCEMENT LEARNING FORSIMULATION CONTROL

Suraj Pawar ∗ School of Mechanical & Aerospace Engineering,Oklahoma State University,Stillwater, Oklahoma - 74078, USA. [email protected]

Romit Maulik

Argonne Leadership Computing Facility,Argonne National Laboratory,Lemont, Illinois, USA. [email protected] A BSTRACT

Several applications in the scientiﬁc simulation of physical systems can be formulated as control/opti-mization problems. The computational models for such systems generally contain hyperparameters,which control solution ﬁdelity and computational expense. The tuning of these parameters is non-trivial and the general approach is to manually ‘spot-check’ for good combinations. This is becauseoptimal hyperparameter conﬁguration search becomes impractical when the parameter space is largeand when they may vary dynamically. To address this issue, we present a framework based on deepreinforcement learning (RL) to train a deep neural network agent that controls a model solve byvarying parameters dynamically. First, we validate our RL framework for the problem of controllingchaos in chaotic systems by dynamically changing the parameters of the system. Subsequently,we illustrate the capabilities of our framework for accelerating the convergence of a steady-stateCFD solver by automatically adjusting the relaxation factors of discretized Navier-Stokes equationsduring run-time. The results indicate that the run-time control of the relaxation factors by the learnedpolicy leads to a signiﬁcant reduction in the number of iterations for convergence compared tothe random selection of the relaxation factors. Our results point to potential beneﬁts for learningadaptive hyperparameter learning strategies across different geometries and boundary conditions withimplications for reduced computational campaign expenses. . K eywords Reinforcement learning, Neural network, Systems control, Fluid dynamics

In recent years, there has been a proliferation in the use of machine learning for ﬂuid mechanics problems that use thevast amount of data generated from experiments, ﬁeld measurements, and large-scale high ﬁdelity simulations [1–3]. Akey ML algorithm is given by the artiﬁcial neural network and it has been widely used to tackle challenges in ﬂuidmechanics. The artiﬁcial neural network is popular due to its ability to approximate complex nonlinear functions fromlarge amounts of data, although this comes at an expense of reduced interpretability. Different types of neural networkshave been adopted for a variety of tasks such as turbulence closure modeling [4–8], reduced order modeling [9–13],super-resolution and forecasting [14–16], solving partial differential equations [17, 18], ﬂow control [19, 20], etc. Apartfrom ﬂuid mechanics, there is an ever-increasing trend in the use of machine learning in material science [21, 22],computational chemistry [23–25], and medical imaging [26, 27]. A common thread that links a large majority of theseinvestigations is that machine learning has been deployed in a supervised learning fashion. Supervised learning relieson labeled data to learn relationships from observables to targets and is often limited by the lack of pre-existing data.In contrast, reinforcement learning (RL) is another type of machine learning philosophy, where an agent interactswith an environment and learns a policy that will maximize a predeﬁned long-term reward [28]. Multiple methodscan be employed as an agent in RL, but the most successful is a neural network in conjunction with RL known as thedeep RL. Deep RL is gaining widespread popularity due to its success in robotics, playing Atari games, the game ofGO, and process control [29–33]. Recently, the ﬁeld of quantum reinforcement learning is emerging which aims at ∗ This work was performed during a Research Aide appointment at the Argonne Leadership Computing Facility, Argonne NationalLaboratory Data and codes available at https://github.com/Romit-Maulik/RLLib_Theta/ a r X i v : . [ phy s i c s . c o m p - ph ] S e p eveloping quantum agent that can interact with the outer world and adapt to it with the strategy of achieving certainﬁnal foal [34–36]. Deep RL is also making inroads in ﬂuid mechanics, as a means to solve fundamental problems inﬂow control. RL has been utilized to learn collective hydrodynamics of self-propelled swimmers to optimally achieve aspeciﬁed goal [37, 38], to train a glider to autonomously navigate in turbulent atmosphere [39, 40], for controlled glidingand perching of ellipsoid shaped bodies [41], and navigation of smart microswimmers in complex ﬂows [42]. RL hasalso been used for closed-loop control of drag of a cylinder ﬂow [43], control of two-dimensional convection-diffusionpartial differential equation [44], designing control strategies for active ﬂow control [45, 46], discovering computationalmodels from the data [47], and shape optimization [48]. RL’s attractiveness stems from the ability to perform bothexploitation and exploration to learn about a particular environment.Many real-world system optimization problems can be formulated as RL problems and offer a potential for theapplication of the recent advances in deep RL [49]. Several problems in scientiﬁc simulations share a similaritywith system optimization problems. For example, the iterative solver in numerical simulations of ﬂuid ﬂows maybe considered as a plant to be controlled. The iterative solver consists of different parameters such as relaxationfactors, smoothing operations, type of solver, residual tolerance, etc., and all these parameters could be controlledduring run-time with RL to achieve faster convergence. Another example is the turbulence closure model employedin computational ﬂuid dynamics (CFD) simulations to represent unresolved physics. The closure coefﬁcients of theturbulence model can be calibrated similarly to the optimization of hyperparameters in neural networks with RL [50].Indeed, recently, the multi-agent RL was introduced as an automated turbulence model discovery tool in the context ofsub-grid scale closure modeling in large eddy simulation of homogeneous and isotropic turbulence [51].To illustrate the application of RL for the control of the parameters in computational models employed in scientiﬁcsimulation, we consider the problem of accelerating the convergence of steady-state iterative solvers in CFD simulationsby dynamically controlling under-relaxation factors of the discretized Navier-Stokes equations. More concretely, wetrain the RL agent to control under-relaxation factors to optimize the computational time or the number of iterationsit takes for the solution to converge. This control problem differs from standard control problems, as the target state(i.e., the converged solution) is not known a priori . There have been some studies done for dynamically updatingunder-relaxation factors with the goal to maximize the speed of convergence. One of the methods is to use a fuzzycontrol algorithm based on the oscillations of the solution norm observed over a large interval prior to the currentiteration [52]. Other methods based on neural network and fuzzy logic have also been proposed to automate theconvergence of CFD simulations [53, 54]. Some of the limitations of the fuzzy logic methods are that they are based oncertain assumptions about the system that might not always be accurate, and they do not scale well to larger or complexsystems. On the other hand, the neural network agent trained using RL can discover a more accurate decision-makingpolicy to achieve a certain objective and can be applied for complex systems. The problem of controlling underrelaxationfactors in CFD simulations is analogous to controlling the learning rate of stochastic gradient descent (SGD) algorithmand RL has been demonstrated to achieve better convergence of SGD than human-designed learning rate [55]. In thepresent work, we introduce a novel method based on RL to accelerate the convergence of CFD simulations applied toturbulent ﬂows and demonstrate the robustness of the method for the backward-facing step test case.One of the major challenges in using RL for scientiﬁc simulations is the computational cost of simulating an environment[38, 56]. For example, the cost of CFD solver depends on several factors such as mesh reﬁnement, complexity of theproblem, level of required convergence, and can range between the order of minutes to order of several hours. RLis known to be less sample efﬁcient and most popular RL strategies require a large number of interactions with theenvironment. Therefore, the RL framework should take the computational overhead of simulating an environmentinto account for scientiﬁc computing problems. One of the approaches to accelerate RL training is to run multipleenvironments concurrently on different processors and then collect their experience to train the agent. We refer to thismulti-environment approach as distributed RL in this study. There are many open-source packages such as RLLib [57],Tensorforce [58], SEED RL [59], Acme [60] that can be exploited to achieve faster training with distributed RL. Inaddition to the distributed execution of the environment, the CFD simulations can also be parallelized to run on multipleprocessing elements [61, 62]. The list of contributions of this work can be summarized as follows: • A distributed deep RL framework is presented for the optimization and control of systems parameters encountered inscientiﬁc simulations. • We apply the deep RL framework to the task of restoring chaos in transiently chaotic systems, and for acceleratingthe convergence of a steady-state CFD solver. • We highlight the importance of distributed RL and how it aids in accelerated training for scientiﬁc applications. • We also provide an open-source code and set of tutorials for deploying the introduced framework on distributedmultiprocessing systems. 2his paper is organized as follows. Section 2 provides background information on RL problem formulation and presentproximal policy optimization (PPO) algorithm. In Section 3, we validate our framework using the test case of restoringchaos in chaotic systems. Then, we describe how controlling relaxation factors in CFD simulations converted tosequential decision-making problem and discuss the performance of an RL agent to generalize for different boundaryconditions and different geometry setup. Finally, we conclude with the summary of our study and outline the futureresearch directions in Section 4.

In this section, we discuss the formulation of the RL problem and brieﬂy describe the proximal policy optimization(PPO) algorithm [63] that belongs to a class of policy-gradient methods. In RL, the agent observes the state of theenvironment and then based on these observations takes an action. The objective of the agent is the maximization ofthe expected value of the cumulative sum of the reward. This problem statement can be framed as a Markov decisionprocess (MDP). At each time step t , the agent observes some representation of the state of the system, s t ∈ S , andbased on this observation selects an action, a t ∈ A . As a consequence of the action, the agent receives the reward, r t ∈ R and the environment enters in a new state s t +1 . Therefore, the interaction of an agent with the environmentgives rise to a trajectory as shown below τ = { s , a , r , s , a , r , . . . } . (1)The advantage of the MDP framework is that it is ﬂexible and can be applied to different problems in different ways.For example, the time steps in the MDP formulation need not refer to the ﬁxed time step as used in CFD simulations;they can refer to arbitrary successive stages of taking the action. The goal of the RL agent is to choose an action thatwill maximize the expected discounted return over the trajectory τ and mathematically, it can be written as R ( τ ) = T (cid:88) t =0 γ t r t , (2)where γ is a parameter called a discount rate and it lies between [0 , , and T is the horizon of the trajectory. Thediscount rate determines how much importance to be given to the long term reward compared to immediate reward. If γ → , then the agent is concerned with maximizing immediate rewards and as γ → , the agent takes future rewardinto account more strongly. The RL tasks are also classiﬁed based on whether they terminate or not. Some tasks involveagent-environment interactions that break into a sequence of episodes, where each episode ends in a state called theterminal state. On the other hand, there are tasks, such as process control, that that goes on continually without limit.The horizon T of the trajectory for these continuing tasks goes to inﬁnity in Equation 2.In RL, the agent’s decision making procedure is characterized by a policy π ( s, a ) ∈ Π . The RL agent is trained to ﬁnda policy so as to optimize the expected return when starting in the state s at time step t and is called as V-value function.We can write the the V-value function as V π ( s ) = E π (cid:2) ∞ (cid:88) k =0 γ k r t + k | s t = s, π (cid:3) , (3)Similarly, the expected return starting in a state s , taking an action a , and thereafter following a policy π is called as theQ-value function and can be written as Q π ( s, a ) = E π (cid:2) ∞ (cid:88) k =0 γ k r t + k | s t = s, a t = a, π (cid:3) . (4)We also deﬁne an advantage function that can be considered as an another version of Q-value function with lowervariance by taking the V-value function as the baseline. The advantage function can be written as A π ( s, a ) = Q π ( s, a ) − V π ( s ) . (5)In deep RL, the neural network is used as an RL agent, and therefore, the weights and biases of the neural network arethe parameters of the policy [64]. We use π w ( · ) to denote that the policy is parameterized by w ∈ R d . The agent istrained with an objective function deﬁned as [28] J ( w ) ˙= V π w ( s ) , (6)where an episode starts in some particular state s , and V π w is the value function for the policy π w . The policyparameters w are learned by estimating the gradient of an objective function and plugging it into a gradient ascentalgorithm as follow w ← w + α ∇ w J ( w ) , (7)3here α is the learning rate of the optimization algorithm. The gradient of an objective function can be computed usingthe policy gradient theorem [64] as follow ∇ w V π w ( s ) = E π w (cid:2) ∇ w (cid:0) log π w ( s, a ) (cid:1) Q π w ( s, a )] . (8)There are two main challenges in using the above empirical expectation. The ﬁrst one is the large number of samplesrequired and the second is the difﬁculty of obtaining stable and steady improvement. There are different families ofpolicy-gradient algorithms that are proposed to tackle these challenges [65–67]. The performance of policy gradientmethods is highly sensitive to the learning rate α in Equation 7. If the learning rate is large it can cause the training tobe unstable. The proximal policy optimization (PPO) algorithm uses a clipped surrogate objective function to avoidexcessive update in policy parameters in a simpliﬁed way [63]. The clipped objective function of the PPO algorithm is J clip ( w ) = E (cid:2) min ( r t ( w ) A π w ( s, a ) , clip ( r t ( w ) , − (cid:15), (cid:15) ) A π w ( s, a )) (cid:3) , (9)where r t ( w ) denotes the probability ratio between new and old policies as follow r t ( w ) = π w +∆ w ( s, a ) π w ( s, a ) . (10)The (cid:15) in Equation 9 is a hyperparameter that controls how much new policy is allowed to be deviated from the old. Thisis done using the function clip ( r t ( w ) , − (cid:15), (cid:15) ) that enforces the ratio between new and old policy ( r t ( w ) ) to staybetween the limit [1 − (cid:15), (cid:15) ] . In this section, we demonstrate the application of deep RL for two problems where the parameters of the systems needto be controlled to achieve a certain objective. The ﬁrst test case is restoring chaos is chaotic dynamical systems [68].We utilize this test case as a validation example for our distributed deep RL framework. The second test problem isaccelerating the convergence of steady-state CFD solver through deep RL and we illustrate this for turbulent ﬂow over abackward-facing step example.

The chaotic state is essential in many dynamical systems and the phenomenon called crisis can cause sudden destructionof chaotic attractors [69]. This can be negative in many applications where the chaos is essential such as to enhancemixing by chaotic advection [70] or to prevent the collapse of electric power systems [71]. Some of the methods tosustain chaos is to use small perturbations based on knowledge of dynamical systems and a priori information aboutthe escape regions from chaos [72]. Vashishtha et. al. [68] proposed an approach using deep RL to determine smallperturbations to control parameters of the Lorenz system [73] in such a way that the chaotic dynamics is sustaineddespite the uncontrolled dynamics being transiently chaotic. This method does not require ﬁnding escape regions, targetpoints, and therefore attractive for systems where the complete information about the dynamical system is unknown.The Lorenz system is described by following equations dxdt = σ ( y − x ) , (11) dydt = x ( ρ − z ) − y, (12) dzdt = xy − βz, (13)(14)where x, y, z are the state of the Lorenz system, and σ, ρ, β are the system’s parameter. The Lorenz system give riseto chaotic trajectory with σ = 10 , ρ = 28 , and β = 8 / as shown in Figure 1. However, the Lorenz system exhibitstransient chaotic behavior for ρ ∈ [13 . , . [74]. For example, if we use ρ = 20 for the Lorenz system, the solutionconverges to a speciﬁc ﬁxed point P − = ( − . , − . , as illustrated in Figure 1.In order to sustain the chaos in transiently chaotic Lorenz system, we utilize the similar RL problem formulation asVashishtha et. al. [68]. The RL agent observes the solution vector and its time-derivative as the state of the system.Therefore, we have s t = x ( t ) , y ( t ) , z ( t ) , ˙ x ( t ) , ˙ y ( t ) , ˙ z ( t ) , (15)4 − − y −

10 0 10 20 z P + P − ρ = 20 x −

10 0 10 y − −

100 10 20 z ρ = 28 Figure 1: Evolution trajectory of the Lorenz system with ρ = 28 (left) and ρ = 20 (right).where the dot represent the time-derivative, i.e., velocity. Based on this observations, the agent choose an action whichis the perturbation to the control parameters of the Lorenz system as given below a t = { ∆ σ, ∆ ρ, ∆ β } . (16)The perturbation to the control parameters is restricted to lie within a certain interval which is ±

10% of its values.Therefore, we have ∆ σ ∈ [ − σ/ , σ/ , ∆ ρ ∈ [ − ρ/ , ρ/ , and ∆ β ∈ [ − β/ , β/ . We note here that theselimits remain ﬁxed throughout the episode based on the initial values of ( σ, ρ, β ) . For a transiently chaotic system,the trajectory converges to a ﬁxed point with time and therefore the magnitude of velocity will decrease consistentlywith time. This fact is utilized to design the reward function. When the magnitude of velocity is greater than certainthreshold velocity V Th , the agent is rewarded and the agent is penalized when the magnitude of velocity is less than V Th .In addition to assigning reward for each time step, a terminal reward is given at the end of each episode. The rewardfunction can be written as r t = (cid:26) , V ( t ) > V Th − , V ( t ) ≤ V Th , (17) r terminal = (cid:26) − , r t ≤ − , r t > − , (18)where r t is the average of stepwise reward over the last 2000 time steps of an episode. The value of the thresholdvelocity V Th is set at 40.For training the agent, we divide each episode into 4000 time steps with time step size dt = 2 × − . We train theagent for × time steps which corresponds to 50 episodes. The RL agent is a fully connected neural network withtwo hidden layers and 64 neurons in each layer. Additionally, the output of the second hidden layer is further processedby the LSTM layer composed of 256 cells. The hyperbolic tangent (tanh) activation function is used for all hiddenlayers. Figure 2 displays how the mean reward is progressing with training for a different number of workers. We notehere that by a different number of workers, we mean that the agent-environments interactions are run concurrently ondifferent processors. In distributed deep RL, the environment is simulated on different processors and their experience iscollected. These sample batches encode one or more fragments of a trajectory and are concatenated into a larger batchthat is then utilized to train the neural network. More concretely, in this particular example, we collect experience of anagent for 8000 time steps to train the neural network. When we employ two workers, the agent-environment interactionis simulated for 4000 time steps on each worker and then this experience is concatenated to form a training batch for theneural network. Similarly, for four workers, the agent-environment interaction is simulated for 2000 time steps on eachworker to collect a training batch size of 8000 time steps. Figure 2 also reports the computational time required fortraining. We see that as we increase the number of workers from two to four, the computational speed is approximatelyreduced by 25%. However, as we increase the number of workers to 8, there is only a marginal improvement in terms5f CPU time. This is not surprising, as the communication frequency is increased with eight workers and hence there isno signiﬁcant reduction in CPU time. − − − M e a n r e w a r d N=2N=4N=8 N=2 N=4 N=8Number of workers0200400600800 C P U T i m e ( s ) Figure 2: Evolution of the moving average mean reward for different number of workers trained with the PPO algorithm.The moving average window size is set at 5 episodes.Figure 3 depicts the performance of the trained RL agent in sustaining chaos in a transiently chaotic Lorenz system. InFigure 3, the black trajectory is for uncontrolled solution of the Lorenz system for parameters ( σ, ρ, β ) = (10 , , / .When the policy learned by an RL agent is used for the control, the Lorenz system does not converge to a ﬁxed point of P − . We can also see that the RL agent trained with a different number of workers can learn the policy effectively and isable to sustain chaos over a period of temporal integration. This observation is very important for employing RL inphysical systems where the simulation of a computationally intensive environment is the bottleneck and distributeddeep RL can be employed to address this issue. In CFD, the Navier-Stokes equations are discretized which leads to a system of linear equations. This system of linearequations is usually solved using iterative techniques like Gauss-Seidel methods, Krylov subspace methods, multigridmethods, etc. [75]. The main logic behind iterative methods is to start with some approximate solution and then reducethe residual vector at each iteration, called relaxation steps. The relaxation steps are performed until the convergencecriteria are reached. In the relaxation step, the value of the variable at the next time step is obtained by adding somefraction of the difference between the current value and predicted value to the value of the variable at the current timestep. The SIMPLE algorithm [76] is widely used in CFD and it utilizes underrelaxation method to update the solutionfrom one iteration to another. The underrelaxation methods improve the convergence by slowing down the update to thesolution ﬁeld. The value of the variable at the k th iteration is obtained as the linear interpolation between the value fromthe previous iteration and the value obtained in the current iteration as follow x ( k ) P = x ( k − P + α (cid:18) (cid:80) a nb x nb + ba P − x ( k − P (cid:19) , (19)where < α < is the relaxation factor, x ( k ) P is the value of the variable at node P to be used for the next iteration, x ( k − P is the value of the variable at node P from the previous iteration, x nb is the values of the variables at theneighboring nodes, and a P , a nb , b are the constants obtained by discretizing Navier-Stokes equations.The underrelaxation factor α in Equation 19 should be chosen in such a way that it is small enough to ensure stablecomputation and large enough to move the iterative process forward quickly. In practice, a constant value of therelaxation factor is employed throughout the computation and this value is usually decided based on some previouscomputations. Also, the suitable value of the relaxation factor is problem-dependent and a small change in it can resultin a large difference in the number of iterations needed to obtain the converged solution. In Figure 4, the number ofiterations required for convergence with different values of relaxation factors for momentum and pressure equationis shown for the problem investigated in this study (i.e., the backward-facing step example). It can be noticed that6 − − −

10 0 10 2051015202530

N=2

UncontrolledControlled − − − − −

10 0 10 2051015202530

N=4

UncontrolledControlled − − − −

10 0 10 2051015202530

N=8

UncontrolledControlled

Figure 3: Comparison of the evolution of the Lorenz system without applying the control and with an application ofthe control policy learned by an RL agent. The parameters of the Lorenz system are ( σ, β, ρ ) = (10 , / , and theinitial condition is ( x, y, z ) = (1 , , .the number of iterations is less for α u = 0 . (momentum equation relaxation factor), for all values of α p (pressureequation relaxation factor). However, for α u = 0 . , the solution converges only for α p = 0 . , and for other values of α p the solution does not converge within 3000 iterations. Even though for most of the CFD simulations, a good valueof relaxation factors can be found through trial and error, the process can become intractable for large parameter space.The complexity increases even further when the relaxation factor needs to be updated on the ﬂy during run-time formultiphysics problems. Therefore, we attempt to automate the procedure of dynamic update of the relaxation factorsusing deep RL to accelerate the convergence of CFD simulations.We illustrate our RL framework to accelerate the convergence of CFD simulations by considering a two-dimensionalbackward-facing step example. First, we validate our CFD model using the experimental data provided in [77]. Theheight of the step in the experimental setup was H = 0 . m and the free-stream reference velocity was U ref = 44 . m/s. The corresponding Reynolds number based on step height and reference free-stream velocity is approximatelyRe H = 36 , . The turbulent intensity at the inlet is I t = 6 . × − and the eddy-viscosity-to-molecular- viscosityratio at the inlet is µ t /µ = 0 . . We use the computational mesh distributed by NASA [78] for our study. We utilizethe steady-state simpleFoam solver in OpenFoam [79] and apply the k − ω SST turbulent model [80] with the standardvalues of closure coefﬁcients to simulate the turbulent ﬂow. Figure 5 displays the contour plot of the magnitude of thevelocity along with the normalized stream-wise velocity at four different axial locations obtained experimentally andnumerically. We can observe that there is a very good agreement between the experimental results and numerical resultsfor normalized stream-wise velocity.To learn the policy that is generalizable for different inlet boundary condition, we train the RL agent for multiple valuesof inlet velocities. For each episode, the inlet velocity is selected randomly from the interval between [35,65] m/s.7 . . . . α u α p Number of iterations k )200030004000500060007000 S t a t e ( S t ) α u = 0.5 α u = 0.9 Figure 4: Number of iterations required for the convergence with different combinations of relaxation factor formomentum and pressure equation (left). The convergence of the state of an environment with iterations for two valuesof momentum equation relaxation factor (right). The pressure relaxation factor is set at α p = 0 . for both cases. x / h y / h Figure 5: Validation of the CFD model with experimental measurements for normalized velocity in the x -direction. Theexperimental data for velocity measurements are published in [77].We highlight here that, in our study, the frequency at which the relaxation factors are updated is different from thefrequency of CFD iterations. In particular, we update the relaxation factor for the momentum and pressure equationafter every 100 iterations of the CFD simulation. We use the average value of the square of the velocity magnitude asthe characteristic quantity that represents the CFD solution. Thus, the state of the environment can be written as s t = 1 N b N b (cid:88) i =1 [ U b ( i )] + 1 N m N m (cid:88) i =1 [ U m ( i )] , (20)where U b is the velocity at the inlet boundary, U m is the nodal value of the velocity at internal mesh, N b is the totalnumber of nodes at inlet boundary, and N m is the total number of nodes in the internal mesh. At the start of an episode,the velocity in the interior of the domain is initialized with zero velocity, and hence the initial state for each episodewill depend only on the value of the velocity at the inlet boundary. Figure 4 displays how the state of the environmentprogress towards the converged state for two different values of relaxation factor for momentum equations. As thesolution starts converging, the change in the state of the system from one iteration to other will also be reduced. Basedon the observation of the state of the system, the agent decides the relaxation factor for the momentum and pressureequations. The action of the agent can be written as a t = { α u , α p } , (21)8here α u and α p are the relaxation factors for momentum and pressure equation, respectively. The action space for bothrelaxation factors lies within the interval [0 . , . . Even though the action space is composed of only two relaxationfactors, this problem is complicated due to its dynamic nature and simple exploratory computation will not be feasible.For more complex multiphysics problems, the action space can be easily augmented in this framework to includerelaxation factors for other governing equations. The goal of the RL agent is to minimize the number of iterationsrequired for obtaining the converged solution. We give the total number of iterations of the CFD simulation betweentwo time steps as the reward for that time step. Accordingly, the reward at each time step can be written as r t = − ( K t − K t − ) , (22)where K is the iteration count of CFD simulation. Since each episode comes to end with the terminal state, the task ofaccelerating CFD simulations can be considered as an episodic task. Therefore, we set the discount factor γ = 1 . forthis test case. The stepwise reward function computed using Equation 22 and with γ = 1 . will give cumulative rewardequal to the total number of iterations it took for CFD simulation to converge. Speciﬁcally, if the CFD simulationrequired 450 iterations to converge and the relaxation factor is updated every 100 iterations, then the trajectory for thereward will look like {− , − , − , − , − } .For this example, the RL agent is a fully connected neural network with 64 neurons in each hidden layer. The RLagent is trained for 2000 episodes and the learning rate is set to . × − . We assume that the convergence for CFDsimulation is reached once the residual for momentum equation falls below × − and the residual for pressureequation is reduced below × − . Figure 6 shows how the moving average mean reward progress with trainingfor a different number of workers. The moving average window size is set at 100 episodes, i.e., the average of thereward is taken over the last 100 episodes. At the beginning of the training, the agent is dominantly exploring andthe mean reward at the start of training is in the range of 650-700 iterations for CFD simulation to converge. Themean reward improves over the training and the ﬁnal mean reward is around 450 iterations for the convergence of CFDsimulations. This corresponds to approximately 30% improvement in the number of iterations required for convergence.We highlight here that, for many CFD problems, suitable values of relaxation factors can be chosen to achieve fasterconvergence based on prior experience. However, for complex systems, the process of learning a rule to dynamicallyupdate relaxation factors through exploratory computations can become unmanageable. For such systems, the proposedRL framework can be attractive to achieve acceleration in the convergence. In Figure 6, the CPU time for training isreported for a different number of workers. As we increase the number of workers/ environments from 4 to 8, we getaround 40% improvement in the CPU time for the training. The improvement in CPU speed is only marginal as thenumber of processor is increased to 16, which might be due to increased communication between the processors. Wenote here that our environment, i.e. the CFD simulation utilizes a single core as the backward-facing step is a relativelysmall problem and can be simulated within an order of seconds or minutes depending upon the computer architecture.For high-dimensional problems, the parallelization through domain-decomposition can also be exploited to reduce theCPU time of each CFD simulation run. − − − − − − − − M e a n r e w a r d N=4N=8N=16 N=4 N=8 N=16Number of workers025005000750010000125001500017500 C P U T i m e ( s ) Figure 6: Evolution of the moving average mean reward for the PPO algorithm trained using different number ofworkers (left). The moving average window size is set at 100 episodes. The CPU wall time for agent’s training fordifferent number of workers (right)Once the agent is trained, we test the agent’s performance for three different values of inlet velocity, V =25 . , . , . . Among these three velocities, the velocity V = 50 . lies within the training velocity range,9.e., [35.0, 65.0]. We run ten different samples of CFD simulation for these three inlet velocities, and utilize the RLagent to select the relaxation factor after every 100 iterations. Figure 7 shows the box plot to depict the data obtainedfrom ten samples through their quartiles. It should be noted that the policy learned by an agent is not deterministic, andhence the number iterations required for convergence for a single inlet velocity is not constant for all samples. Themean and standard deviation of these samples are also shown in Figure 7. From both these plots, we can see that themean of the samples for all three different velocities is around 450 iterations. Therefore, we can conclude that thepolicy learned by the RL is generalizable to inlet boundary condition different than the training boundary condition.We plot the variation of relaxation factor for momentum equation and pressure equation in Figure 8 to understand thepolicy learned by an RL agent. The RL agent has learned to keep the relaxation factor for momentum equation at itshighest value (i.e., 0.95) throughout the CFD simulation. We do not see any speciﬁc pattern for the variation of pressureequation relaxation factor. Since, the relaxation factor for momentum equations is constant, the difference in number ofiterations to convergence for different samples is mainly due to how the pressure relaxation factor is decided by an RLagent at various stages of the CFD simulation.Furthermore, we test the policy for different backward-facing step geometry. The test geometry has the step heighttwice that of the train geometry. In Figure 9, the ﬂow feature for train and test geometry is shown, and it can be seenthat there is a sufﬁcient difference in terms of the ﬂow in the recirculation region. We run a similar experiment with thetest geometry for three different values of inlet velocity. Figure 10 displays the summary of the RL agent’s performancein accelerating CFD simulation of the test geometry. Overall, the number of iterations required for the convergence forthe test geometry is higher than the train geometry. The mean of the samples for test geometry is around 550 iterationsfor all three velocities. Interestingly, the variance of test geometry samples employed with the policy trained using 8workers is less compared to the policy trained using 4 and 16 workers. N u m b e r o f i t e r a t i o n s V = 25 . V = 50 . V = 75 . N u m b e r o f i t e r a t i o n s Figure 7: Evaluation of the trained agent’s PPO policy for different inlet velocities - V = 25 . m/s (left), V = 50 . m/s(middle), V = 75 . m/s (right). The top plot is the box plot in which the green marker is for the mean, red marker is foroutliers, and the orange line is for the median of the data. The bottom plot shows the mean of the sample with error barsdenoting the standard deviation of samples. In this study, we have investigated the application of distributed reinforcement learning (RL) to automatically updatethe computational hyperparameters of numerical simulations, which is a common task in many scientiﬁc applications.We illustrate our framework for two problems. The ﬁrst problem is tasked with sustaining chaos in chaotic systems, andthe second problem is the acceleration of convergence for steady-state CFD solver. In the ﬁrst example, we demonstrate10 α u Sample 1 Iterations01 α p Sample 101 α u Sample 2 Iterations01 α p Sample 201 α u Sample 3 Iterations01 α p Sample 301 α u Sample 4 Iterations01 α p Sample 40 100 200 300 400 500Iterations01 α u Sample 5 0 100 200 300 400 500Iterations01 α p Sample 5

Figure 8: Variation of the relaxation factor for the momentum equation and pressure equation over iterations of thesteady-state CFD solver for V = 25 . m/s for ﬁve samples evaluated during testing with the learned policy.that the policy learned by an RL agent trained using proximal policy optimization (PPO) algorithm can automaticallyadjust the system’s parameters to maintain chaos in transiently chaotic systems. The problem of controlling relaxationfactors in steady-state CFD solvers is different from the standard control problem. In a standard control problem, thetarget value is known in advance and this target value is used to change action variables in such a way that the state ofthe system is guided towards the target state. However, the problem of accelerating convergence of steady-state CFDsolvers is different in a way that the target point, i.e., the converged solution is not known. Therefore, RL is a suitablealgorithm for this problem to discover the decision-making strategy that will dynamically update the relaxation factorwith an objective to minimize the number of iterations required for convergence. Our results of numerical experimentsindicate that the learned policy leads to acceleration in the convergence of steady-state CFD solver and the learnedpolicy is generalizable to unseen boundary conditions and different geometry setup.Despite the relative simplicity in terms of the RL problem formulation and its application to backward-facing stepexample, this framework can be readily extended for more complex multiphysics systems. The state and the action spaceof an RL agent can be augmented to tackle complexities in these systems. Our future work is centered around applyingthis framework for more challenging problems like a differentially heated cavity, and to improve the computationalefﬁciency of the software pipeline through direct I/O communication between OpenFoam solver and RL libraries. Oneof the major challenges in using RL for scientiﬁc simulations is the computational time due to a slow environment, andwe believe this challenge can be addressed by utilizing distributed environment evaluation along with high-performancecomputing to reduce the CPU time of the solver. Acknowledgement

We acknowledge the helpful discussions with Venkat Vishwanath and the Argonne Leadership Computing FacilityDatascience team. This material is based upon work supported by the U.S. Department of Energy (DOE), Ofﬁce ofScience, Ofﬁce of Advanced Scientiﬁc Computing Research, under Contract DE-AC02-06CH11357. This research wasfunded in part and used resources of the Argonne Leadership Computing Facility, which is a DOE Ofﬁce of Science11 x / h y / h (a) Train geometry x / h y / h (b) Test geometry Figure 9: Flow features in the vicinity of backward facing step for train and test geometry setup. The agent is trainedwith a step height h = 0 . m and the test geometry has step height h = 0 . m .User Facility supported under Contract DE-AC02-06CH11357. This paper describes objective technical results andanalysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the viewsof the U.S. DOE or the United States Government. References [1] Steven L Brunton, Bernd R Noack, and Petros Koumoutsakos. Machine learning for ﬂuid mechanics.

AnnualReview of Fluid Mechanics , 52, 2019.[2] MP Brenner, JD Eldredge, and JB Freund. Perspective on machine learning for advancing ﬂuid mechanics.

Physical Review Fluids , 4(10):100501, 2019.[3] Paul Garnier, Jonathan Viquerat, Jean Rabault, Aurélien Larcher, Alexander Kuhnle, and Elie Hachem. A reviewon deep reinforcement learning for ﬂuid mechanics. arXiv preprint arXiv:1908.04127 , 2019.[4] Julia Ling, Andrew Kurzawski, and Jeremy Templeton. Reynolds averaged turbulence modelling using deepneural networks with embedded invariance.

Journal of Fluid Mechanics , 807:155–166, 2016.[5] Karthik Duraisamy, Gianluca Iaccarino, and Heng Xiao. Turbulence modeling in the age of data.

Annual Reviewof Fluid Mechanics , 51:357–377, 2019.[6] Jian-Xun Wang, Jin-Long Wu, and Heng Xiao. Physics-informed machine learning approach for reconstructingReynolds stress modeling discrepancies based on dns data.

Physical Review Fluids , 2(3):034603, 2017.[7] Romit Maulik, Omer San, Adil Rasheed, and Prakash Vedula. Subgrid modelling for two-dimensional turbulenceusing neural networks.

Journal of Fluid Mechanics , 858:122–144, 2019.12 N u m b e r o f i t e r a t i o n s V = 25 . V = 50 . V = 75 . N u m b e r o f i t e r a t i o n s Figure 10: Evaluation of the trained agent’s PPO policy for different inlet velocities in test geometry setup - V = 25 . m/s (left), V = 50 . m/s (middle), V = 75 . m/s (right). The top plot is the box plot in which the green marker is forthe mean, red marker is for outliers, and the orange line is for the median of the data. The bottom plot shows the meanof the sample with error bars denoting the standard deviation of samples.[8] PA Srinivasan, L Guastoni, Hossein Azizpour, PHILIPP Schlatter, and Ricardo Vinuesa. Predictions of turbulentshear ﬂows using deep neural networks. Physical Review Fluids , 4(5), 2019.[9] Michele Milano and Petros Koumoutsakos. Neural network modeling for near wall turbulent ﬂow.

Journal ofComputational Physics , 182(1):1–26, 2002.[10] Arvind Mohan, Don Daniel, Michael Chertkov, and Daniel Livescu. Compressed convolutional lstm: An efﬁcientdeep learning framework to model high ﬁdelity 3d turbulence. arXiv preprint arXiv:1903.00033 , 2019.[11] Takaaki Murata, Kai Fukami, and Koji Fukagata. Nonlinear mode decomposition with convolutional neuralnetworks for ﬂuid dynamics.

Journal of Fluid Mechanics , 882, 2020.[12] Sk Mashﬁqur Rahman, Suraj Pawar, Omer San, Adil Rasheed, and Traian Iliescu. Nonintrusive reduced ordermodeling framework for quasigeostrophic turbulence.

Physical Review E , 100:053306, 2019.[13] Romit Maulik, Bethany Lusch, and Prasanna Balaprakash. Reduced-order modeling of advection-dominatedsystems with recurrent neural networks and convolutional autoencoders. arXiv preprint arXiv:2002.00470 , 2020.[14] Kai Fukami, Koji Fukagata, and Kunihiko Taira. Super-resolution reconstruction of turbulent ﬂows with machinelearning.

Journal of Fluid Mechanics , 870:106–120, 2019.[15] Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi ATchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeﬂownet: A physics-constrained deep continuousspace-time super-resolution framework. arXiv preprint arXiv:2005.01463 , 2020.[16] Junhyuk Kim and Changhoon Lee. Prediction of turbulent heat transfer using convolutional neural networks.

Journal of Fluid Mechanics , 882:A18, 2020.[17] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learningframework for solving forward and inverse problems involving nonlinear partial differential equations.

Journal ofComputational Physics , 378:686–707, 2019.[18] Zichao Long, Yiping Lu, and Bin Dong. Pde-net 2.0: Learning pdes from data with a numeric-symbolic hybriddeep network.

Journal of Computational Physics , 399:108925, 2019.1319] Changhoon Lee, John Kim, David Babcock, and Rodney Goodman. Application of neural networks to turbulencecontrol for drag reduction.

Physics of Fluids , 9(6):1740–1747, 1997.[20] N Benard, Jordi Pons-Prats, Jacques Periaux, Gabriel Bugeda, P Braud, JP Bonnet, and E Moreau. Turbulentseparated shear ﬂow control by surface plasma actuator: Experimental optimization by genetic algorithm approach.

Experiments in Fluids , 57(2):22, 2016.[21] Brian DeCost, Jason R Hattrick-Simpers, Zachary Trautt, Aaron Gilad Kusne, Eva Campo, and Martin L Green.Scientiﬁc ai in materials science: a path to a sustainable and scalable paradigm.

Machine Learning: Science andTechnology , 2020.[22] Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine learning formolecular and materials science.

Nature , 559(7715):547–555, 2018.[23] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Big data meets quantumchemistry approximations: The δ -machine learning approach. Journal of chemical theory and computation ,11(5):2087–2096, 2015.[24] Sergei Manzhos. Machine learning for the solution of the schrödinger equation.

Machine Learning: Science andTechnology , 1(1):013002, 2020.[25] Yasemin Bozkurt Varolgüne¸s, Tristan Bereau, and Joseph F Rudzinski. Interpretable embeddings from molecularsimulations using gaussian mixture variational autoencoders.

Machine Learning: Science and Technology ,1(1):015012, 2020.[26] Bradley J Erickson, Panagiotis Korﬁatis, Zeynettin Akkus, and Timothy L Kline. Machine learning for medicalimaging.

Radiographics , 37(2):505–515, 2017.[27] Chang Min Hyun, Kang Cheol Kim, Hyun Cheol Cho, Jae Kyu Choi, and Jin Keun Seo. Framelet pooling aideddeep learning network: the method to process high dimensional medical data.

Machine Learning: Science andTechnology , 1(1):015009, 2020.[28] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press, 2018.[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, andMartin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013.[30] Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Reinforcement learning for humanoid robotics. In

Proceedingsof the third IEEE-RAS international conference on humanoid robots , pages 1–20, 2003.[31] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go withdeep neural networks and tree search. nature , 529(7587):484–489, 2016.[32] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.

The InternationalJournal of Robotics Research , 32(11):1238–1274, 2013.[33] SPK Spielberg, RB Gopaluni, and PD Loewen. Deep reinforcement learning approaches for process control. In , pages 201–206. IEEE,2017.[34] Daoyi Dong, Chunlin Chen, and Zonghai Chen. Quantum reinforcement learning. In

International Conference onNatural Computation , pages 686–689. Springer, 2005.[35] Lucas Lamata. Quantum machine learning and quantum biomimetics: A perspective.

Machine Learning: Scienceand Technology , 2020.[36] F Albarrán-Arriagada, JC Retamal, E Solano, and L Lamata. Reinforcement learning for semi-autonomousapproximate quantum eigensolver.

Machine Learning: Science and Technology , 1(1):015002, 2020.[37] Mattia Gazzola, Babak Hejazialhosseini, and Petros Koumoutsakos. Reinforcement learning and wavelet adaptedvortex methods for simulations of self-propelled swimmers.

SIAM Journal on Scientiﬁc Computing , 36(3):B622–B639, 2014.[38] Siddhartha Verma, Guido Novati, and Petros Koumoutsakos. Efﬁcient collective swimming by harnessing vorticesthrough deep reinforcement learning.

Proceedings of the National Academy of Sciences , 115(23):5849–5854,2018.[39] Gautam Reddy, Antonio Celani, Terrence J Sejnowski, and Massimo Vergassola. Learning to soar in turbulentenvironments.

Proceedings of the National Academy of Sciences , 113(33):E4877–E4884, 2016.[40] Gautam Reddy, Jerome Wong-Ng, Antonio Celani, Terrence J Sejnowski, and Massimo Vergassola. Glider soaringvia reinforcement learning in the ﬁeld.

Nature , 562(7726):236–239, 2018.1441] Guido Novati, Lakshminarayanan Mahadevan, and Petros Koumoutsakos. Controlled gliding and perching throughdeep-reinforcement-learning.

Physical Review Fluids , 4(9):093902, 2019.[42] Simona Colabrese, Kristian Gustavsson, Antonio Celani, and Luca Biferale. Flow navigation by smart microswim-mers via reinforcement learning.

Physical review letters , 118(15):158004, 2017.[43] Florimond Guéniat, Lionel Mathelin, and M Yousuff Hussaini. A statistical learning strategy for closed-loopcontrol of ﬂuid ﬂows.

Theoretical and Computational Fluid Dynamics , 30(6):497–510, 2016.[44] Amir-massoud Farahmand, Saleh Nabi, and Daniel N Nikovski. Deep reinforcement learning for partial differentialequation control. In , pages 3120–3127. IEEE, 2017.[45] Jean Rabault, Miroslav Kuchta, Atle Jensen, Ulysse Réglade, and Nicolas Cerardi. Artiﬁcial neural networkstrained through deep reinforcement learning discover control strategies for active ﬂow control.

Journal of FluidMechanics , 865:281–302, 2019.[46] Feng Ren, Jean Rabault, and Hui Tang. Applying deep reinforcement learning to active ﬂow control in turbulentconditions. arXiv preprint arXiv:2006.10683 , 2020.[47] Maxime Bassenne and Adrián Lozano-Durán. Computational model discovery with reinforcement learning. arXivpreprint arXiv:2001.00008 , 2019.[48] Hassan Ghraieb, Jonathan Viquerat, Aurélien Larcher, P Meliga, and Elie Hachem. Optimization and passive ﬂowcontrol using single-step deep reinforcement learning. arXiv preprint arXiv:2006.02979 , 2020.[49] Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Joseph Gonzalez, Krste Asanovic, and Ion Stoica. A view ondeep reinforcement learning in system optimization. arXiv preprint arXiv:1908.01275 , 2019.[50] Jia Wu, SenPeng Chen, and XiYuan Liu. Efﬁcient hyperparameter optimization through model-based reinforcementlearning.

Neurocomputing , 2020.[51] Guido Novati, Hugues Lascombes de Laroussilhe, and Petros Koumoutsakos. Automating turbulence modelingby multi-agent reinforcement learning. arXiv preprint arXiv:2005.09023 , 2020.[52] Zoran Dragojlovic and Deborah A Kaminski. A fuzzy logic algorithm for acceleration of convergence in solvingturbulent ﬂow and heat transfer problems.

Numerical Heat Transfer, Part B: Fundamentals , 46(4):301–327, 2004.[53] Zoran Dragojlovic, Deborah A Kaminski, and Juntaek Ryoo. Tuning of a fuzzy rule set for controlling convergenceof a cfd solver in turbulent ﬂow.

International Journal of Heat and Mass Transfer , 44(20):3811–3822, 2001.[54] Juntaek Ryoo, Zoran Dragojlovic, and Deborah A Kaminski. Control of convergence in a computational ﬂuiddynamics simulation using anﬁs.

IEEE Transactions on fuzzy systems , 13(1):42–47, 2005.[55] Chang Xu, Tao Qin, Gang Wang, and Tie-Yan Liu. Reinforcement learning for learning rate control. arXivpreprint arXiv:1705.11159 , 2017.[56] Jean Rabault and Alexander Kuhnle. Accelerating deep reinforcement learning strategies of ﬂow control througha multi-environment approach.

Physics of Fluids , 31(9):094105, 2019.[57] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, MichaelJordan, and Ion Stoica. Rllib: Abstractions for distributed reinforcement learning. In

International Conference onMachine Learning , pages 3053–3062, 2018.[58] Michael Schaarschmidt, Alexander Kuhnle, and Kai Fricke. Tensorforce: A tensorﬂow library for appliedreinforcement learning.

Web page , 2017.[59] Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. Seed rl: Scalable and efﬁcientdeep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591 , 2019.[60] Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman,Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for distributedreinforcement learning. arXiv preprint arXiv:2006.00979 , 2020.[61] William D Gropp, Dinesh K Kaushik, David E Keyes, and Barry F Smith. High-performance parallel implicit cfd.

Parallel Computing , 27(4):337–362, 2001.[62] Horst D Simon. Parallel computational ﬂuid dynamics-implementations and results.

STIA , 94:11384, 1992.[63] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.[64] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods forreinforcement learning with function approximation. In

Advances in neural information processing systems , pages1057–1063, 2000. 1565] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In

Advances in neural information processingsystems , pages 1008–1014, 2000.[66] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza-tion. In

International conference on machine learning , pages 1889–1897, 2015.[67] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuouscontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015.[68] Sumit Vashishtha and Siddhartha Verma. Restoring chaos using deep reinforcement learning.

Chaos: AnInterdisciplinary Journal of Nonlinear Science , 30(3):031102, 2020.[69] Celso Grebogi, Edward Ott, and James A Yorke. Crises, sudden changes in chaotic attractors, and transient chaos.

Physica D: Nonlinear Phenomena , 7(1-3):181–200, 1983.[70] Julio M Ottino. Mixing, chaotic advection, and turbulence.

Annual Review of Fluid Mechanics , 22(1):207–254,1990.[71] Ian Dobson and Hsiao-Dong Chiang. Towards a theory of voltage collapse in electric power systems.

Systems &Control Letters , 13(3):253–262, 1989.[72] Weiming Yang, Mingzhou Ding, Arnold J Mandell, and Edward Ott. Preserving chaos: Control strategies topreserve complex dynamics with potential relevance to biological disorders.

Physical Review E , 51(1):102, 1995.[73] Edward N Lorenz. Deterministic nonperiodic ﬂow.

Journal of the atmospheric sciences , 20(2):130–141, 1963.[74] James L Kaplan and James A Yorke. Preturbulence: a regime observed in a ﬂuid ﬂow model of lorenz.

Communi-cations in Mathematical Physics , 67(2):93–108, 1979.[75] Yousef Saad.

Iterative methods for sparse linear systems . SIAM, 2003.[76] Suhas Patankar.

Numerical heat transfer and ﬂuid ﬂow . Taylor & Francis, 2018.[77] David M Driver and H Lee Seegmiller. Features of a reattaching turbulent shear layer in divergent channelﬂow.

AIAA journal , 23(2):163–171, 1985.[78] Christopher L Rumsey. Recent developments on the turbulence modeling resource website. In , page 2927, 2015.[79] Hrvoje Jasak, Aleksandar Jemcov, Zeljko Tukovic, et al. Openfoam: A c++ library for complex physics simulations.In

International workshop on coupled methods in numerical dynamics , volume 1000, pages 1–20. IUC DubrovnikCroatia, 2007.[80] Florian R Menter, Martin Kuntz, and Robin Langtry. Ten years of industrial experience with the sst turbulencemodel.