Single-step deep reinforcement learning for open-loop control of laminar and turbulent flows
OOptimization and passive flow control usingsingle-step deep reinforcement learning
A Preprint
H. Ghraieb
MINES Paristech , PSL - Research University, CEMEF
J. Viquerat ∗ MINES Paristech , PSL - Research University, CEMEF [email protected]
A. Larcher
MINES Paristech , PSL - Research University, CEMEF
P. Meliga
MINES Paristech , PSL - Research University, CEMEF
Elie Hachem
MINES Paristech , PSL - Research University, CEMEFJune 5, 2020
Abstract
This research gauges the ability of deep reinforcement learning (DRL) techniques toassist the optimization and control of fluid mechanical systems. It combines a novel,“degenerate” version of the proximal policy optimization (PPO) algorithm, that trainsa neural network in optimizing the system only once per learning episode, and an in-house stabilized finite elements environment implementing the variational multiscale(VMS) method, that computes the numerical reward fed to the neural network. Threeprototypical examples of separated flows in two dimensions are used as testbed for de-veloping the methodology, each of which adds a layer of complexity due either to theunsteadiness of the flow solutions, or the sharpness of the objective function, or thedimension of the control parameter space. Relevance is carefully assessed by compar-ing systematically to reference data obtained by canonical direct and adjoint methods.Beyond adding value to the shallow literature on this subject, these findings establishthe potential of single-step PPO for reliable black-box optimization of computationalfluid dynamics (CFD) systems, which paves the way for future progress in optimalflow control using this new class of methods.
Keywords
Deep Reinforcement Learning; Artificial Neural Networks; Computational fluid dynamics;Flow control; Adjoint method ∗ Corresponding author a r X i v : . [ m a t h . O C ] J un Introduction
Flow control, defined as the ability to finesse a flow into a more desired state, is a field of tremendoussocietal and economical importance. In applications such as ocean shipping or airline traffic, reducingthe overall drag by just a few percent while maintaining lift can help reducing fossil fuel consumptionand CO emission while saving several billion dollars annually. Many other scenario relevant to fluidmechanical systems call for similarly improved engineering design, e.g., the airline industry is greatlyconcerned with reducing the structural vibrations and the radiated noise that occur under unsteadyflow conditions, while microfluidics and combustion both benefit from enhanced mixing (which can beachieved by promoting unsteadiness in some appropriate manner). All such problems fall under thepurview of this line of study.Flow control is often benchmarked in the context of bluff body drag reduction. Numerous strategieshave been implemented, either open-loop with passive appendices (e.g., end plate, splitter plate, smallsecondary cylinder, or flexible tail), or open-loop with actuating devices (e.g., plasma actuation, steadyor unsteady base bleeding, rotation) or closed-loop (e.g. via transverse motion, blowing/suction, rota-tion, all relying on an appropriate sensing of flow variables). An overview of the recent achievementsand perspectives can be found in several comprehensive surveys [1, 2, 3, 4, 5, 6, 7, 8, 9]. Nonethe-less, many of the proposed strategies are trial and error, and therefore require extensive and costlyexperimental or numerical campaigns. This has motivated the development of rigorous mathematicalformalisms capable of designing optimal control strategies with minimal effort. The adjoint methodis one popular family of such techniques, that has proven efficient at computing accurate, low-costestimates of the objective gradient with respect to the control variables in large optimization spaces,and is widely used for many applications ranging from oceanography and atmospheric sciences [10]to aerodynamic design [11, 12, 13, 14], by way of fresh developments meant to promote/mitigate thelinear amplification of flow disturbances [15, 16, 17, 18, 19, 20].It is the premise of this research that the task of selecting an optimal subset of control parameterscan be alternatively automated with the assistance of supervised learning algorithms (i.e., the simplestand most common form of machine learning, where an agent learns from labeled training data andis provided with corrective feedback). The breakthrough in this field dates back to the introductionof the back-propagation algorithm [21]. It has helped popularize Artificial Neural Networks (ANN),a family of versatile non-parametric tools that can be trained to hierarchically extract informativefeatures from data, and can be used for several purposes, from exploratory data analysis to qualitativeand quantitative predictive modeling. Since then, the increased affordability of high performancecomputational hardware (together with reduced costs for computation and data storage) have allowedleveraging the ever-increasing volume of numerical and experimental data generated for research andengineering purposes into novel insight and actionable information, which in turn has reshaped entirescientific disciplines, such as image analysis [22] or robotics [23, 24]. Because neural networks haveproduced most remarkable results when applied to stiff large-scale nonlinear problems [25], it is onlynatural to assume that they are similarly suitable for the state-space models arising from the high-dimensional Machine learning algorithms have thus been making rapid inroads in fluid mechanics, asa mean to solve the Navier–Stokes equations [26] or to predict closure terms in turbulence models [27];see also Ref. [28] for an overview of the current developments and opportunities.The focus here is on deep reinforcement learning (DRL), an advanced branch of machine learningin which neural networks specialize in solving decision-making problems (the deep terminology beingused to weigh on the sizable depth of the network itself). Simply put, the neural network trains infinding out which actions or succession of actions maximize a numerical reward signal, one challengebeing that a given action affects not only the immediate reward but also the next states and, thus,all the subsequent rewards. The most well-known success story in this field is AlphaGo, the ANNthat defeated the top-level human player at the game of Go [29], but the approach has also provenfruitful to solve computer vision problems (e.g., filtering, or extracting image features) [30] and hasenabled breakthroughs in the optimal control of complex dynamic systems [31, 32], including leggedrobots deployed in real environments [33]. Despite these achievements, DRL remains surprisinglysparsely used in fluid mechanics, with a limited amount of literature barely scratching the surfaceof the performance improvements to be delivered in flow control [34, 35, 36] and shape optimizationproblems [37]. 2 x x y y Figure 1: Fully connected neural network with two hidden layers, modeling a mapping from R to R .This researches assesses the feasibility of applying proximal policy optimization (PPO [32]) for flow con-trol and optimization purposes, to both shape the capabilities of the method (PPO is still a relativelynewcomer that has quickly emerged as the go-to DRL algorithm due to its data efficiency, simplicityof implementation and reliable performance) and narrow the gap between DRL and advanced numer-ical methods for multiscale, multi-physics computational fluid dynamics (CFD). We investigate morespecifically the “degenerate” single-step PPO algorithm introduced in [37], in which the neural networkgets only one attempt per learning episode at finding the optimal. Prototypical examples of separatedflows in two dimensions are used as testbed for developing this novel approach (to the best of theauthors knowledge, no previous study in the related literature has considered using DRL in a similarfashion), as it has been speculated to hold a high potential as a reliable black-box optimizer for CFDproblems (where only the final configuration is of interest), but needs further characterization andfine-tuning.The organization is as follows: the single-step PPO method is presented in section 2, together withthe baseline principles and assumptions of both DRL and PPO. Section 3 outlines the main features ofthe finite element CFD environment used to compute the numerical reward fed to the neural network,whose efficiency and reliability has been assessed in several previous studies. Three test cases arethen organized by increasing level of complexity, namely the flow past a NACA 0012 airfoil (section4), the flow past a tandem arrangement of two identical cylinders (section 5), and the passive controlof a cylinder flow (section 6). For each case, exhaustive results pertaining to the DRL optimizationof steady asymptotic, time-averaged, or fluctuating drag and lift are reported and carefully validatedagainst reference data obtained by canonical direct/adjoint methods. A neural network (NN) is a collection of artificial neurons, i.e., connected computational units withuniversal approximation capabilities [38], that can be trained to arbitrarily well approximate themapping function between input and output spaces. Each connection provides the output of a neuronas an input to another neuron. Each neuron performs a weighted sum of its inputs, to assign significanceto the inputs with regard to the task the algorithm is trying to learn. It then adds a bias to betterrepresent the part of the output that is actually independent of the input. Finally, it feeds an activationfunction that determines whether and to what extent the computed value should affect the ultimateoutcome. As sketched in figure 1, a fully connected network is generally organized into layers, with theneurons of one layer being connected solely to those of the immediately preceding and following layers.The layer that receives the external data is the input layer, the layer that produces the outcome is theoutput layer, and in between them are zero or more hidden layers.The design of an efficient neural network requires a proper optimization of the weights and biases,together with a relevant nonlinear activation function. The abundant literature available on this topic(see, e.g., Ref. [39] and the references therein) points to a relevant network architecture (e.g., type3 nvironment s t s t +1 Agent r t a t w t Figure 2: RL agent and its interactions with its environment.of network, depth, width of each layer), finely tuned hyper parameters (i.e., parameters whose valuecannot be estimated from data, e.g., optimizer, learning rate, batch size) and a sufficiently large amountof data to learn from as being the key ingredients for success.
Deep reinforcement learning (DRL) is an advanced branch of machine learning in which deep neuralnetworks train in solving sequential decision-making problems. It is a natural extension of reinforce-ment learning (RL), in which an agent (the neural network) is taught how to behave in an environmentby taking actions and by receiving feedback from it under the form of a reward (to measure how goodor bad the taken action was) and information (to gauge how the action has affected the environment).This can be formulated as a Markov Decision Process, for which a typical execution goes as follows(see also figure 2): • assume the environment is in state s t ∈ S at iteration t , where S is a set of states, • the agent uses w t , an observation of the current environment state (and possibly a partialsubset of s t ) to take action a t ∈ A , where A is a set of actions, • the environment reacts to the action by transitionning from s t to state s t +1 ∈ S , • the agent is fed with a reward r t ∈ R , where R is a set of rewards, and a new observation w t +1 ,This repeats until some termination state is reached, the succession of states and actions defining atrajectory τ = (cid:0) s , a , s , a , ... (cid:1) . In any given state, the objective of the agent is to determine theaction maximizing its cumulative reward over an episode, i.e., over one instance of the scenario in whichthe agent takes actions. Most often, the quantity of interest is the discounted cumulative reward alonga trajectory defined as R ( τ ) = T X t =0 γ t r t , (1)where T is the horizon of the trajectory, and γ ∈ [0 ,
1] is a discount factor that weighs the relativeimportance of present and future rewards (the agent being short-sighted in the limit where γ → γ →
1, since it thencares equally about all rewards).There exist two main types of RL algorithms, namely model-based methods, in which the agent tries tobuild a model of how the environment works to make predictions about what the next state and rewardwill be before taking any action, and model-free methods, in which the agent conversely interactswith the environment without trying to understand it, and are prominent in the DRL community.Another important distinction to be made within model-free algorithms is that between value-based4ethods, in which the agent learns to predict the future reward of taking an action when provided agiven state, then selects the maximum action based on these estimates, and policy-based methods, inwhich it optimizes the expected reward of a decision policy mapping states to actions. Many of themost successful algorithms in DRL (including proximal policy optimization, whose assessment for flowcontrol and optimization purposes is the primary motivation for this research) proceed from policygradient methods, in which gradient ascent is used to optimize a parameterized policy with respect tothe expected return, as further explained in the next section. The reader interested in a more thoroughintroduction to the zoology of RL methods (together with their respective pros and cons) is referredto Ref. [40].
This section intended for the non-specialist reader briefly reviews the basic principles and assumptionsof policy gradient methods, together with the various steps taken for improvement.
Policy methods.
A policy method maximizes the expected discounted cumulative reward of a deci-sion policy mapping states to actions. It resorts not to a value function, but to a probability distributionover actions given states, that fully defines the behavior of the agent. Since policies are most oftenstochastic, the following notations are introduced: • π ( s, a ) is the probability of taking action a in state s under policy π , • Q π ( s, a ) is the expected value of the return of the policy after taking action a in state s (alsotermed state-action value function or Q-function) Q π ( s, a ) = E π (cid:2) R ( τ ) | s, a (cid:3) , (2) • V π ( s ) is the expected value of the return of the policy in state s (also termed value functionor V-function) V π ( s ) = E π (cid:2) R ( τ ) | s (cid:3) . (3)The V and Q functions are therefore such that V π ( s ) = X a π ( s, a ) Q π ( s, a ) , (4)so V π ( s ) can also be understood as the probability-weighted average of discounted cumulatedrewards over all possible actions in state s . Policy gradient methods.
A policy gradient method aims at optimizing a parametrized policy π θ , where θ denotes the free parameters whose value can be learnt from data (as opposed to thehyper parameters). In practice, one defines an objective function based on the expected discountedcumulative reward J ( θ ) = E τ ∼ π θ (cid:2) R ( τ ) (cid:3) , (5)and seeks the parameterization θ ∗ maximizing J ( θ ), hence such that θ ∗ = arg max θ E τ ∼ π θ (cid:2) R ( τ ) (cid:3) , (6)which can be done on paper by plugging an estimator of the policy gradient ∇ θ J ( θ ) into a gradientascent algorithm. This is no small task as one is looking for the gradient with respect to the policyparameters, in a context where the effects of policy changes on the state distribution are unknown(since modifying the policy will most likely modify the set of visited states, which will in turn affectperformance in some indefinite manner). The most commonly used estimator, derived in [40] usingthe log-probability trick, reads ∇ θ J ( θ ) = E τ ∼ π θ " T X t =0 ∇ θ log ( π θ ( s t , a t )) R ( τ ) ∼ E τ ∼ π θ " T X t =0 ∇ θ log ( π θ ( s t , a t )) A π ( s, a ) , (7)5 s µ t µ t σ t σ t Figure 3: Agent network example used to map states to policy. The input state s , here of size 2,is mapped to a mean µ and a standard deviation σ vectors, each of size 2. All activation functionsare ReLu, except for that of the last layer, which are linear for the µ output, and softplus for the σ output. Orthogonal weights initialization is used throughout the network.where the rightmost term is a convenient approximation relying on the advantage function A π ( s, a ) = Q π ( s, a ) − V π ( s ) , (8)that measures the improvement (if A >
0, otherwise the lack thereof) associated with taking action a in state s compared to taking the average over all possible actions. This is because the value functiondoes not depend on θ , so taking it off changes neither the expected value, nor the gradient, but it doesreduce the variance, and speeds up the training. Furthermore, when the policy π θ is represented by aneural network (in which case θ simply denotes the network weights and biases to be optimized), thefocus is rather on the policy loss defined as L ( θ ) = E τ ∼ π θ " T X t =0 log ( π θ ( a t | s t )) A π ( s, a ) , (9)whose gradient is equal to the (approximated) policy gradient (7) (since the gradient operator actsonly on the log-policy term, not on the advantage) and is computed with respect to each weight andbias by the chain rule, one layer at the time, using the back-propagation algorithm [21]. Trust regions.
The performance of policy gradient methods is hurt by the high sensitivity to thelearning rate, i.e., the size of the step to be taken in the gradient direction. Indeed, small learningrates are detrimental to learning, but large learning rates can lead to a performance collapse if theagent falls off the cliff and restarts from a poorly performing state with a locally bad policy. This isall the more harmful as the learning rate cannot be tuned locally, meaning that an above averagelearning rate will speed up learning in some regions of the parameter space where the policy loss isrelatively flat, but will possibly trigger an exploding policy update in other regions exhibiting sharpervariations. One way to ensure continuous improvement is by imposing a trust region constraint tolimit the difference between the current and updated policies, which can be done by determining firsta maximum step size relevant for exploration, then by locating the optimal point within this trustregion. We will not dwell on the intricate details of the many algorithms developed to solve suchtrust region optimization problems, e.g., natural policy gradient (NPG [41]), or trust region policyoptimization (TRPO [42]). Suffice it to say that they use the minorize-maximization algorithm tomaximize iteratively a surrogate policy loss (i.e. a lower bound approximating locally the actual lossat the current policy), but are difficult to implement and can be computationally expensive, as theyrely on an estimate of the second-order gradient of the policy log probability.6 roximal policy optimization.
Proximal policy optimization (PPO) is another approach withsimple and effective heuristics, that uses a probability ratio between the two policies to maximize im-provement without the risk of performance collapse [32]. The focus here is on the PPO-clip algorithm ,that optimizes the surrogate loss L ( θ ) = E ( s,a ) ∼ π θ (cid:20) min (cid:18) π θ ( a | s ) π θ old ( a | s ) A π θ ( s, a ) , g ( (cid:15), A π θ ( s, a )) (cid:19)(cid:21) , (10)where g ( (cid:15), A ) = (cid:26) (1 + (cid:15) ) A A ≥ , (1 − (cid:15) ) A A < , (11)and (cid:15) ∈ [0 . , .
3] is the clipping range, a small hyper parameter defining how far away the new policyis allowed to go from the old. In practice, a state-action pair yielding a positive advantage will cause π θ ( a | s ) to increase for the action to become more likely. By doing so, if it increases in a way such that π θ ( a | s ) > (1 + (cid:15) ) π θ old ( a | s ), the min kicks in (10) and its argument hits a ceiling of (1 + (cid:15) ) A π θ ( s, a ).Conversely, a state-action pair yielding a negative advantage will cause π θ ( a | s ) to decrease for theaction to become less likely. If it increases in a way such that π θ ( a | s ) < (1 − (cid:15) ) π θ old ( a | s ), the maxkicks in and its argument hits a ceiling of (1 − (cid:15) ) A π θ ( s, a ). In neither case does the new policy has anyincentive to step too far away from the current policy, which ensures that both policies will behavesimilarly.The strengths of the approach lie in the fact that the clipped surrogate itself is cheap to compute, andthat it needs only an estimate of the first-order gradient of the policy log probability (like policy gradientmethods). Refinements have been proposed recently, for instance the Trust region PPO (TRGPPO)that adaptively adjusts the clipping range within the trust region [43], but classic PPO is generallyacknowledged to be one of the most successful DRL method, combining ease of implementation andtuning, together with state-of-the-art performance across a wide range of challenging tasks. We now come to the single-step PPO method proposed in [37], a “degenerate” version of PPO in whichthe agent and the environment get to exchange only one single triplet ( s, a, r ) per learning episode.This represents a significant difference with respect to classic Markov-based methods. and amounts toview the agent as a parameterized mapping f θ from state to action.As sketched in figure 3, the agent is consistently fed with the same input state s (usually a vectorof zeros), and outputs a policy π that can be represented by mean and standard deviation vectors(denoted by µ and σ , respectively) whose size matches the dimension of the action required by theenvironment. The optimization loop is as follows: first, the agent implements a random mapping f θ from s to an initial policy determined by the initialization of the network parameters θ . At eachiteration, a population of actions a t = f θ t ( s ) is drawn from the current policy, the reward associatedto each set of action is computed and the agent is returned incentives (through the rewards) to updatethe free parameters in a way such that the next population of actions a t +1 ∼ f θ t +1 ( s ) will yieldhigher rewards (see also figure 4). This is done following for the most part the various steps describedin section 2.3, only one seeks in practice the optimal mapping f θ opt such that a opt ∼ f θ opt ( s ), notthe optimal set of actions a opt itself, and the loss is computed from (10) by substituting a normalizedaveraged reward for the advantage function.It is worth noticing that an accurate estimation of the policy gradient requires evaluating a largeamount of actions drawn from the current policy, which comes at the expense of computing the sameamount of reward evaluations. One key issue in the context of CFD applications is thus the ability todistribute a given set of actions to a parallel environment running on large computer clusters, as weshow in the following that the CPU cost of solving the present academic optimization problems in twodimensions ranges from a hundred to several thousand hours. There is also a PPO-Penalty variant which uses a penalization on the average Kullback–Leibler divergencebetween the current and new policies, but PPO-clip performs better in practice. Agent Parallelenvs. a t r t θ t → θ t +1 Figure 4: Action loop for single-step PPO. At each episode, the input state s is provided to theagent, which in turn provides n actions to n parallel environments. The latter return n rewards, thatevaluate the quality of each action taken. Once all the rewards are collected, an update of the agentparameters is made using the PPO loss (10). The process is repeated until convergence of the mapping(i.e., µ = µ ∗ and σ = ). In the following, we investigate two-dimensional (2-D) laminar incompressible flows over various ge-ometries of interest. The flow is described in a Cartesian coordinate system ( x, y ) with drag force(resp. lift force) positive in the stream-wise + x direction (resp. the cross-wise +y direction). Thegoverning equations to be solved numerically are the incompressible Navier–Stokes equations ∇ · u = 0 , ∂ t u + u · ∇ u + ∇ · [ p I − ε ( u )] = , (12)where u = ( u, v ) is the velocity field of component u (resp. v ) in the x (resp. y ) direction, p is thepressure, ε ( u ) = ( ∇ u + ∇ u T ) / In the context of finite element methods (that remain widely used to simulate engineering CFD systemsdue to their ability to handle complex geometries), direct numerical simulation (DNS) solves the weakform of (12) obtained by integrating by parts the pressure and viscous terms, to give( ∂ t u + u · ∇ u , w ) + 2Re ( ε ( u ) , ε ( w )) − ( p , ∇ · w ) + ( ∇ · u , q ) = 0 , (13)where ( w , q ) are relevant test functions for the velocity and pressure variables, and ( , ) is the L innerproduct on the computational domain.We use here the variational multiscale (VMS) approach [44, 45, 46] to solve a stabilized formulationof (13). This allows circumventing the Babuska—Brezzi condition (that otherwise imposes that dif-ferent interpolation orders be used to discretize the velocity and pressure variables, while we use heresimple continuous piecewise linear P elements for both) and prevents numerical instabilities in ad-vection regimes at high Reynolds numbers (that otherwise quickly spread in the shear regions andpollute the entire solution domain). We shall not go into the extensive details about the derivationof the stabilized formulation, for which the reader is referred to [47]. Suffice it to say here that theflow quantities are split into coarse and fine scale components, that correspond to different levels ofresolution. The fine scales are solved in an approximate manner to allow modeling their effect into thelarge-scale equations, which gives rise to additional terms in the right-hand side of (13), and yields the8ollowing weak form for the large scale( ∂ t u + u · ∇ u , w ) + 2Re ( ε ( u ) , ε ( w )) − ( p , ∇ · w ) + ( ∇ · u , q ) = X K ∈T h [( τ R M , u · ∇ w ) K + ( τ R M , ∇ q ) K + ( τ R C , ∇ · w ) K ] , (14)where ( , ) K is the inner product on element K , R M,C are the residuals defined by −R M = ∂ t u + u · ∇ u + ∇ p − R C = ∇ · u , (15)and τ , are ad-hoc mesh-dependent stabilization parameters defined in [48, 49]. Equation (14) issolved here with an in-house VMS solver whose accuracy and flexibility has been assessed in a series ofprevious papers [47, 50, 51, 49], together with open flow boundary conditions consisting of a uniformfree-stream u = (1 ,
0) at the inflow, free-surface conditions ∂ y u = v = 0 at the upper and lowerboundaries, advective conditions at the outflow, and no-slip conditions at all solid surfaces. For each test case documented in the following, a 2-D computational domain is defined asΩ = { ( x, y ) | x −∞ ≤ x ≤ x ∞ ; | y | ≤ y ∞ ; Γ s ( x, y ) < } , (16)where Γ s is a level set function that depends on the control parameters, whose zero iso-contour givesthe position of the solid surface S (the convention used here being to have Γ s ( x, y ) < s ≥ s on the control parameters is one key issue,as it imposes that a different mesh be used for each action taken by the agent over the course of thePPO optimization, each of which must be equally capable of accurately capturing the global and localspatio-temporal features at play. Here, this is achieved using the immerse volume method (IVM),widely used to immerse and represent complex geometries inside a unique mesh, and that has beenimplemented in the context of finite element VMS methods (see, e.g., [50, 49] for a detailed presentationof the mathematical formulation relevant to a rigid body immersed in an incompressible fluid). Foreach geometry and each value of the control parameters, the level set function is computed at allnodes of the mesh using the algorithm presented in [52]. The mesh at the interface is then refinedanisotropically using the gradient of the level set, and the physical properties of each domain are setwith appropriate mixing laws. Each mesh is adapted under the constraint of a fixed, case-dependentnumber of edges. Representative interface regions at the end of the anisotropic adaptation process areshown in figure 5. In practice, the adaptation process consists of adding nodes locally in the vicinity ofthe interface. The rest of the elements keep the same background size, that increases with the distancevia three successive refinement steps to accurately capture the near-wake region; see figure 5(d). This study is on the optimization of the non-dimensional forces exerted on the immersed body, asmeasured by the drag and lift coefficients per unit span length, denoted by D and L , respectively, anddefined as 2 I S ( − p n + 2Re ε ( u ) · n ) · e θ d l , (17)where e θ = (1 − θ ) e x + θ e y and θ is a Boolean used to specify the quantity subjected to optimization( θ = 0 for drag, θ = 1 for lift). The latter is computed as2 I ∂ Ω ( − p n + 2Re ε ( u ) · n ) · w d l = − Z Ω ( − p n + 2Re ε ( u ) · n ) : ∇ w d s − Z Ω ( ∇ u · u ) · w d s , (18)using the variational approach described in [53], that consists of multiplying the momentum equationsby a test function w equal to e θ on the nodes lying on S and 0 otherwise (hence vanishing everywhereexcept on the first layer of elements around S ), then of integrating by parts to bring up only volumeintegral terms, reportedly more accurate and less sensitive to the approximation of the body interfacetheir surface counterparts [54]. 9a) (b)(c) (d)Figure 5: ( a − c ) Details of the boundary layer mesh around (a) a NACA 0012 airfoil, (b) two circulartandem cylinders, and (c) a two-circular cylinder system, as obtained at the end of the anisotropicadaptation process. The blue line indicates the zero iso-contour of the level set function. (d) Back-ground mesh of the tandem configuration evidencing the three successive refinement steps. The present workflow relies on the online PPO implementation of Stable Baselines, a toolset of re-inforcement learning algorithms dedicated to the research community and industry [55], for which acustom OpenAI environment has been designed using the Gym library [56]. The instant reward r t usedto train the neural network is simply the quantity subjected to optimization (i.e., either the steadyasymptotic, time-averaged, or fluctuating value or drag and lift, modulo a plus or minus sign to tackleboth maximization and minimization problems). A moving average reward is also computed on the flyas the sliding average over the 100 latest values of r t , and we assume convergence when the absolutevariation in the moving average reward between two consecutive episodes is thrice smaller than 10 − out of 4 consecutive values. All other relevant hyper parameters are documented in the next sections,with the exception of the discount factor (since single-step PPO computes only one single reward perepisode). We consider first the NACA 0012 airfoil at incidence in a uniform stream depicted in figure 6. Allvariables are made non-dimensional using the straight chord distance c between the leading and trailing10igure 6: Schematic diagram of the flow past a NACA 0012 at incidence.(a) (b)Figure 7: Flow past a NACA 0012 at Re = 10. Iso-contours of the steady vorticity for (a) α = 11 ◦ ,and (b) α = 51 ◦ .edges and the free-stream velocity. This is a simple problem whose sole control parameter is the angleof attack α , i.e., the angle between the chord and the stream-wise directions (with α > reference data. Theorigin of the coordinate system is at the airfoil pivot-point, set at quarter chord length from theleading edge (that is, the eading edge is at ( − . ,
0) under zero incidence). The dimensions of thecomputational domain are x −∞ = − x ∞ = 40 and y ∞ = 15, which was found to be large enough notto have a discernible influence on the results. All meshes are adapted under the constraint of a fixednumber of edges n el = 115 , ∼ . × − ,which yields 20 −
25 boundary-layer grid points at mid-chord under zero incidence (depending on theReynolds number). The agent for this case is a fully-connected network with two hidden layers, eachholding 4 neurons. The learning rate is set to 5 × − , and the PPO loss clipping range is (cid:15) = 0 . The Reynolds number is set first to Re = 10 for the flow to remain steady regardless of the angle ofincidence. As could have been expected, the upper boundary layer remains attached even for positive,non-small values of α (in practice up to α ∼ ◦ ± ◦ ), after which it separates and pressure drops along This is not strictly speaking a control problem because the angle of attack cannot always be tuned toimprove performances. The methodology however carries over straightforwardly to related control problems,such as the optimization of multi-element high-lift systems. A deliberate abuse of language in the attempt to ease the reading. The reference data are actually VMS,not DNS, but the difference is clear from context. α = 11 ◦ , for which the boundary layer remains attached, and 51 ◦ , for whichit separates. Of course, for negative values of α , the same is true of the lower boundary layer, whoseseparation creates a pressure drop along the lower surface, hence creating negative lift.One objective of interest here is to maximize the steady asymptotic value of the lift coefficient L s reached after the initial transient. We show in figure 8(a) a curve of lift against the angle of attack.It has been obtained by DNS from a sample of 30 value of α uniformly distributed in the range[ − ◦ ; 75 ◦ ], for the point of attack to remain at the leading edge. In each simulation, the initialcondition is marched in time up to t = 100 with time step ∆ t = 0 . . α = 0,as should be, and exhibits a maximum for α opt = 39 .
5, associated to L s,opt = 1 . r t = L s ) uses 8 environmentsand 4 steps mini-batches to update the network for 32 epochs. Convergence is achieved after 20episodes, as indicated in figure 8(b) showing the evolution of the instant reward (in black) and of themoving average reward (in orange). This represents 160 simulations, each of which is performed on 1core following the exact same procedure as above, and lasts 1h 25min, hence ∼ α opt ∈ [37 . ◦ ; 40 ◦ ]associated which L s,opt ∼ .
16, which is perfectly in line with the DNS results; see the blue hue infigure 8(a).
The Reynolds number is now set to Re = 100 to add complexity by triggering flow unsteadiness, atleast for angles of incidence above a certain threshold value. This is best seen in figure 9 showinginstantaneous iso-contours of vorticity distributions computed at α = 26 ◦ , for which the flow settles toa steady regime similar to that observed at Re=10 (although the vorticity sheets spread much furtherdownstream), and 51 ◦ , for which the flow develops to a time-periodic, vortex-shedding state.A relevant objective here is thus to maximize the time-averaged lift coefficient¯ L = 1 τ Z t i + τt i L ( t ) d t , (19) This could scaled down to a few tens of minutes using an iterative Newton-like method, but the scope ofthis research is broader than just steady flows. α = 26 ◦ , and (b) α = 51 ◦ .(a) (b)Figure 10: Flow past a NACA 0012 at Re = 100. (a) Mean lift against the angle of attack, as computedby DNS. The vertical blue lines are the lower and upper boundaries of the optimal range of anglesobtained using DRL to solve the mean lift maximization problem, with the blue circle indicatingthe associated optimal lift. (b) Evolution per episode of the instant reward (in black) and of themoving average reward (in orange). The DNS values associated to min-max reward are reported asthe horizontal dashed lines.that mesures the mean lift force exerted on the airfoil, and reduces to the steady asymptotic lift forthose values of α yielding a steady state (as long as t i is large enough to dismiss the initial transient).figure 10(a) presents a curve of mean lift against α that has been obtained by DNS using the samesample of 30 values of α ∈ [ − ◦ ; 75 ◦ ]. In each simulation, the initial condition is marched in time upto t = 50 to leave out the transient, then up to t = 150 to compute (a posteriori) accurate averages.The averaging time τ = 100 represents 16 −
20 shedding cycles (depending on the angle of incidence)and has been checked to determine accurately the mean lift within 3%. The obtained distribution isagain perfectly antisymmetric with respect to α = 0, as should be. It loses continuity at α ∼ ◦ ,which reflects the bifurcation from the steady to the time-periodic regime through a classical linearinstability mechanism (the critical value being very close to that reported in [57]), and triggers asubstantial increase of lift at higher angles of incidence. The maximum is reached for α opt = 50 . L opt = 0 . r t = ¯ L ) uses 8 envi-ronments and 4 steps mini-batches to update the network for 32 epochs. Convergence is achievedafter 22 episodes; see figure 10(b) showing the evolutions of the instant reward (in black) and of the13igure 11: Schematic diagram of the flow past a tandem arrangement of two circular cylinders atzero-incidence.moving average reward (in orange). This represents 176 simulations, each of which is performed on1 core following the exact same procedure as above, and lasts 4h30 on average, hence approximately ∼ α opt ∈ [48 . ◦ ; 50 . ◦ ] for which ¯ L opt = 0 .
94, again perfectly in line with the DNS values; see the bluehue in figure 10(a).
As a second test case, we consider the tandem arrangement of two identical circular cylinders in auniform stream at zero-incidence. A schematic of the configuration is provided in figure 11. Allvariables are made non-dimensional using the cylinders diameter d and the free-stream velocity, andthe sole control parameter is the center distance η between the centers of the two cylinders. Whilethis remains a simple problem on paper (especially as the control parameter space is small enoughthat it again allows direct comparisons between DRL/PPO and DNS), we shall see that it brings asignificant layer of complexity due to the topology of the cost functional. The origin of the coordinatesystem is set at the center of the upstream (main) cylinder, that is, the center of the downstream(or surrounding) cylinder is at ( η, x −∞ = 15, x ∞ = 40 and y ∞ = 15, which was found to be large enough not to have a discernible influence onthe results in the range of center distances specified hereinafter. The number of mesh edges for thiscase is n el = 125 , ∼ − , which yields20 −
25 grid points in the boundary-layer of the main cylinder, just prior to separation (depending onthe center distance and the Reynolds number). The agent for this case is similar in every respect tothat used in section 4, i.e., a fully-connected network with two hidden layers, each holding 4 neurons,learning rate equal to 5 × − , and PPO loss clipping range of 0 . As has been done in the NACA test case, the Reynolds number is set first to Re = 10 for the flowto remain steady regardless of the center distance. This is best seen in figure 12 showing iso-contoursof vorticity computed at η = 3, a small value for which the shear layer separating from the maincylinder simply engulfs its downstream counterpart, and η = 11, a value large enough for each cylinderto develop independent vorticity sheets (only those of the surrounding cylinder is of slightly lowermagnitude since the wake velocity it is subjected to is smaller than the uniform stream velocity).Since lift is inherently zero (using basic symmetry considerations), one meaningful objective for this caseis to maximize the steady asymptotic value of the total drag coefficient reached after the initial transient(i.e., the non-dimensional drag force exerted on the two-cylinder system). We show in figure 13(a) acurve of drag against the center distance, that has been obtained by DNS from a sample of 11 values14a) (b)Figure 12: Flow past the tandem arrangement of two circular cylinders at Re = 10. Iso-contours ofthe steady vorticity for (a) η = 3, and (b) η = 11.(a) (b)Figure 13: Flow past the tandem arrangement of two circular cylinders at Re = 10. (a) Drag againstthe center distance, as computed by DNS. The vertical blue line is the optimal distance obtained usingDRL to solve the drag maximization problem, with the blue circle indicating the associated optimaldrag. (b) Evolution per episode of the instant reward (in black) and of the moving average reward (inorange). The DNS values associated to min-max reward are reported as the horizontal dashed lines.of η uniformly distributed in the range [1; 11]. We did not consider pushing the second cylinder furtherdownstream to keep the computational costs affordable (especially in view of the time-dependent casesdocumented in the following), as this would have required extending the computational domain andincreasing the numbers of grid points accordingly - plus we do not anticipate such large distances tobe relevant from the standpoint of optimization since the interaction between both cylinders weakensincreasingly, although it can take up to several tens of diameters for the sensitivity to η to go to zero.In each simulation, the initial condition is marched in time up to t = 100 with time step ∆ t = 0 . . η , which we ascribe to the fact that the surrounding cylinder experiences negativedrag if sufficiently close to the main cylinder, on behalf of the low pressures prevailing in the gap [58].In return, the maximum is reached at η opt = 11, associated to D s,opt = 4 . r t = D s ) uses 16 environ-ments and 4 steps mini-batches to update the network for 32 epochs. Convergence is achieved after9 episodes, as illustrated in figure 13(b) showing the evolutions of the instant and moving averagerewards. This represents 144 simulations, each of which is performed on 1 core following the exactsame procedure as above, and lasts 2h30, hence 360h of CPU cost to achieve convergence. On behalf15a) (b)Figure 14: Flow past the tandem arrangement of two circular cylinders at Re = 100. Iso-contours ofthe instantaneous vorticity for (a) η = 7, and (b) η = 11.of the monotonical behavior of drag, it is not too surprising that the DRL converges to the optimaldistance η opt = 11 associated to D s,opt ∼ .
58, identical to the DNS values.
The Reynolds number is now set to Re = 100 to trigger flow unsteadiness. However, only a limitedrange of distances η = [7; 11] is investigated because the surrounding cylinder inhibits vortex sheddingfrom the main cylinder for sufficiently small values of η (the critical value at Re = 100 being η = 3 . η = 7, while a single vortex street is recovered at larger distances; see figure 14(b) for η = 11.One relevant objective here (in connection with fatigue reduction and energy harvesting problems influid-structure systems) is to maximize the root mean square (rms) lift coefficient L = (cid:18) τ Z t i + τt i ( L ( t ) − ¯ L ) d t (cid:19) / , (20)that measures the magnitude of the fluctuating lift force exerted on the two-cylinder system. fig-ure 15(a) shows a curve of rms lift against the center distance, that has been obtained by DNS from areduced sample of 9 values of η ∈ [7; 11]. In each simulation, the initial condition is marched in timeup to t = 100 to leave out the transient, then up to t = 200 to compute (a posteriori) meaningful oscil-lation amplitudes. The averaging time τ = 100 represents ∼
20 shedding cycles, and allows convergingthe rms lift within 5%. A maximum is reached for η opt = 8 .
8, associated to L opt = 0 . r t = L ) uses 16 envi-ronments and 4 steps mini-batches. Convergence is achieved after 12 episodes, as illustrated by theinstant and moving average rewards in figure 15(b). This represents 192 simulations, each of which isperformed on 1 core following the exact same procedure as above, and lasts 7h30 (much longer than forthe NACA 0012 test case, because all tandem solutions are unsteady), hence approximately ∼ η opt ∈ [8 .
69; 8 . L opt ∼ .
91, almost identical to the DNS values as the discrepancy in the optimal distanceis by 1%.
Finally, the Reynolds number is set to Re = 300 for the flow to develop into its vortex-shedding stateregardless of the center distance. The full range η ∈ [1; 11] is considered, therefore the flow patternnow depends dramatically on the center distance. We shall not go into the full details here, as thishas been the subject of several numerical studies [60, 61, 62, 59, 63], but it is worth noticing thatfor small distances η (cid:46)
2, the separated shear layers from the main cylinder engulf their downstreamcounterparts and trigger shedding vortices far from the cylinders, although the gap flow between thecylinders remains steady; see figure 16(a) showing iso-contours of vorticity computed at η = 2. For η ∼
3, the gap flow becomes unsteady, but the gap vortices are not fully developed by the time theyimpinge on the surrounding cylinder, hence a single vortex street is observed in the far wake; seefigure 16(b) obtained for η = 3. For η ∼
4, one pair of gap vortices fully develops, then impingeson the downstream cylinder. This in turn triggers a complex interaction in the near wake of thesurrounding cylinder before a vortex street eventually forms further downstream; see figure 16(c) for η = 4. Finally, for η (cid:38)
5, the wake of the surrounding cylinder becomes unsteady again, and bothcylinders shed vortices that are synchronized and close to anti-phase; see figure 16(d) for η = 5.This broad variety of flow patterns triggers different features of flow unsteadiness, which shows inthe curve of rms lift against the center distance provided in 17(a). Such results have been obtainedfrom an extended sample of 30 values of η ∈ [1; 11], with each initial condition marched in time up to t = 200 to leave out the transient, then up to t = 300. The maximum is reached for η opt = 3 .
35 (i.e., avalue for which the gap vortices are close to being fully developed before impingement), associated to L opt = 1 .
99, but a close second lies at η ∼ opt = 7 .
25, associated to L opt = 1 .
36. The DRL/single-stepPPO resolution of the lift maximization problem uses 16 environments and 4 steps mini-batches, andconvergence is achieved after 20 episodes, as illustrated by the instant and moving average rewards infigure 17(b). This represents 320 simulations, each of which is performed on 1 core following the exactsame procedure as above, and lasts 10h30 (much longer than at Re=100 due to the increased simulationtime; more on this in the following), hence approximately 3300h of CPU cost to achieve convergence.Interestingly, DRL converges to an optimal distance η opt ∈ [7 .
32; 7 .
38] for which L opt ∼ .
39, i.e., itfails to identify the global optimal, but does converge to a value close to the secondary DNS maximum.17a) (b)(c) (d)Figure 16: Flow past the tandem arrangement of two circular cylinders at Re = 300. Iso-contours ofthe instantaneous vorticity for (a) η = 2, (b) η = 3, (c) η = 4 and (d) η = 5.(a) (b)Figure 17: Flow past the tandem arrangement of two circular cylinders at Re = 300. (a) Root meansquare lift against the center distance, as computed by DNS. The vertical blue lines are the lower andupper boundaries of the optimal range of angles obtained using DRL to solve the rms lift maximizationproblem, with the blue circle indicating the associated optimal lift. (b) Evolution per episode of theinstant reward (in black) and of the moving average reward (in orange). The DNS values associatedto min-max reward are reported as the horizontal dashed lines.This half-failure can be explained by the steepness of the reward gradients with respect to the controlvariable in the vicinity of the global optimal, that reflects the existence of a secondary instability18a) (b)Figure 18: Flow past the tandem arrangement of two circular cylinders at Re = 300: time history oflift for (a) η = 3, (b) η = 9.Figure 19: Schematic diagram of the flow past a two-circular cylinder system consisting of a fixedcylinder and movable control cylinder.mechanism at play in a narrow range of center distances. This is illustrated in figure 18(a) showingthat for η ∼
3, the flow settles to a first time-periodic solution, then bifurcates to a second time-periodic solution associated with increased lift oscillations (hence the large value of t i used for thiscase). Actually, the DRL does identify high reward positions close to η = 3, as evidenced by theearly peaks in the instant reward curve in figure 17(b), whose value L ∼ The third test case if that of a circular cylinder in a uniform stream, for which we assess the effectof inserting a second smaller (control) cylinder, as depicted in figure 19. All variables are made non-dimensional using the diameter d of the main cylinder and the free-stream velocity. The origin of the19oordinate system is set at the center of the main cylinder. The diameter of the control cylinder is setto d c = 0 .
1, and the sole control parameter is the position of its center denoted by ( x c , y c ). While thismay not seem overly complicated, the parameter space is now large enough do dismiss mapping thebest positions for placement of the control cylinder by classical numerical procedures (to give a taste,a tens of thousands of DNS are required to cover merely a few diameters around the main cylinder).In the following, the DRL/PPO results are thus compared to theoretical predictions obtained by theadjoint method, that has proven fruitful to gain insight into the most efficient region from the linearsensitivity of the uncontrolled flow (i.e., the flow past the main cylinder), without ever calculatingthe controlled states. We shall not go into the technicalities of how to derive said sensitivity by theadjoint method, as this is a topic thoroughly covered in a series of recent papers [20, 64, 65] to whichthe interested reader is referred for further deepening. The line of thought is conversely to take theoutput sensitivity as a given to assess the relevance of DRL/PPO.The agent for this case is a fully-connected network with two hidden layers, each holding 2 neurons,with the same learning rate 5 × − and PPO loss clipping range 0 . x −∞ = − x ∞ = 15 and y ∞ = 40. Thenumber of mesh edges for this case is n el = 190 , ∼ . × − at the main cylinder, and ∼ × − at control cylinder, which yields 25 −
40 grid points in the boundary-layer of the control cylinder(depending on the position of the control cylinder and the Reynolds number).
The Reynolds number is set first to Re = 40 for the flow to remain steady regardless of the position ofthe control cylinder. This is best seen in figures 20(a) − x c , y c ) = ( − . , x c , y c ) = (0 . , . x c , y c ) = (1 . , D s , that inherently pertainto the two-cylinder system, to those values D s + δD s computed by the adjoint method, where D s isthe drag exerted on the main cylinder in the absence of control cylinder, and δD s is a control-inducedvariation computed simultaneously for all possible positions of the control cylinder as δD s = Z Ω ( u † + 2 e x ) · δ f d s . (21)In equation (21), u † is a steady adjoint velocity, solution to ∇ · u † = 0 , −∇ u † · u + ∇ u T · u † + ∇ · σ ( − p † , u † ) = , (22)together with boundary condition u † = 2 e x at the surface of the main cylinder, that we compute withthe finite-element solver presented in [20]. Also, δ f is a simple model of the force exerted by the controlcylinder on the flow, namely a force opposite to the drag felt by the control cylinder in a uniform flowat the local velocity, carefully crafted to numerical and experimental data, whose relevance to mimicwith good accuracy the effect of a true control cylinder in the range of Reynolds numbers consideredis discussed extensively in [20].A map of the so-computed drag variations is shown in figure 20(d), where the successful positions ofthe control cylinder (those for which D s < D s , or equivalently δD s <
0) are indicated by the bluehue. The map is symmetric with respect to the centerline y = 0 and indicates that the total dragis reduced in two distinct flow regions, a first one located approximately 1 diameter upstream of themain cylinder, and a second one located in the vicinity of the rear stagnation point and extendingdownstream over 3 diameters, along the outer boundary of the recirculation region. Interestingly, bothregions achieve a similar drag reduction by ∼ r t = − D s ) uses 8 environments and a single-step mini-batch to update the network for 32 epochs,20a) (b)(c) (d)Figure 20: Flow past a two-circular cylinder system at Re = 40. ( a − c ) Iso-contours of the steadyvorticity for (a) ( x c , y c ) = ( − . , x c , y c ) = (0 . , .
9) and (c) ( x c , y c ) = (1 . , d c = 0 .
1, as computed using a steadyadjoint method, modeling the presence of the control cylinder by a pointwise reacting force localizedat the same location where the control cylinder is placed.and restricts to positions of the control cylinder in the upper half-plane ( y >
0) to take advantage ofthe problem symmetries. Convergence is achieved after 75 episodes, as illustrated by the instant andmoving average rewards in figure 21(b). This represents 600 simulations in which the initial conditionis marched in time up to t = 150 time step ∆ t = 0 .
1. Each simulation is performed on 1 core and lasts1h30, hence approximately 900h of CPU cost to achieve convergence. This number of episodes maybe deemed large compared to the other test cases documented herein, but it is easily explained by thecomplexity of the functional topology, that consists of two distinct valleys of comparable depth. Fromthis point of view, the fact that the DRL achieves convergence is simply a matter of chance, as it canin theory oscillate forever between both. The positions investigated by the DRL are reported as thegrey circles in figure 21(a), together with those optimal positions associated with the minimum drag D s,opt ∼ .
51 in red. Despite some limited discrepancies regarding the spatial extent of the upstreamregion, the reported agreement is very satisfying, and establishes the ability of DRL to identify bothregions of interest, and to predict the drag reduction achieved in these regions.
We set now the Reynolds number to Re = 100 for the flow to consistently develop to time-periodicvortex-shedding, i.e., the control cylinder is unable to inhibit vortex shedding from the main cylinder.In contrast, the flow past the control cylinder remains steady, as illustrated in figures 22(a) − d c = 0 .
1. The grey circles are the positions investigated bythe DRL. The red circles single out those DRL positions associated with an optimal reduction. Thenegative iso-contours of the adjoint-based variation are superimposed from figure 20(d). (b) Evolutionper episode of the instant reward (in black) and of the moving average reward (in orange).easily explained by the fact that the Reynolds number is the wake of the latter is roughly scaled by theratio of the cylinder diameters, and therefore remains below 15 regardless of the position of the controlcylinder. The intended objective here is to minimize the total mean drag. This requires comparing theDRL values of ¯ D , that again inherently pertain to the two-cylinder system, to those values ¯ D + δ ¯ D computed by the adjoint method, where ¯ D is the mean drag exerted on the main cylinder in theabsence of control cylinder, and δ ¯ D is the control-induced variation computed simultaneously for allpossible positions of the control cylinder as δ ¯ D = 1 τ Z t i + τt i Z Ω ( u † ( t ) + 2 e x ) · δ f ( t ) d s d t , (23)with u † the time-dependent adjoint velocity, solution to ∇ · u † = 0 , − ∂ t u † − ∇ u † · u + ∇ u T · u † + ∇ · σ ( − p † , u † ) = , (24)with boundary condition u † = 2 e y at the surface of the main cylinder.A map of the so-computed drag variations is reproduced from [66] in figure 22(d), where the successfulpositions of the control cylinder (those for which ¯ D < ¯ D , or equivalently δ ¯ D <
0) are indicated bythe blue hue. The map is again symmetric with respect to the centerline, and indicates that the totaldrag is reduced in two distinct flow regions, a first one approximately 1 diameter upstream of the maincylinder, that extends upstream over 2 diameters (hence further upstream than at Re = 40) and inwhich drag is reduced by roughly 2%, and a second one in the vicinity of the rear stagnation point,that extends downstream over 2 diameters along the outer boundary of the mean recirculation region(shorter than that found at Re = 40), and in which drag is reduced by almost 8%. The DRL/single-step PPO resolution of the lift maximization problem( r t = − ¯ D ) uses again 8 environments and asingle-step mini-batch, and convergence is achieved after 35 episodes, as illustrated by the instant andmoving average rewards in figure 23(b). This represents 280 simulations in which the initial conditionis marched in time with time step ∆ t = 0 .
1, up to t = 100 to leave out the transient, then up to t = 200 to compute relevant averages. Each simulation is performed on 1 core and lasts 4h30, henceapproximately 1260h of CPU cost to achieve convergence. The reduced cost compared to the steadycase is ascribed to the simpler functional topology, that features now two valleys with clearly differentdepths. This shows up in figure 23(a) depicting the positions investigated by the DRL (grey circles)together with those optimal positions associated with the minimum drag ¯ D opt ∼ .
23 in red. One seesthat while a few optimal points are identified in the upstream region, DRL quickly settles for the mostefficient downstream region, and more specifically the core of the mean recirculation region, in strikingagreement with the adjoint-based results. 22a) (b)(c) (d)Figure 22: Flow past a two-circular cylinder system at Re = 100. ( a − c ) Iso-contours of the instan-taneous vorticity for (a) ( x c , y c ) = ( − . , x c , y c ) = ( − . , .
2) and (c) ( x c , y c ) = (1 . , d c = 0 .
1, as computed usingthe time-varying adjoint method, modeling the presence of the control cylinder by a pointwise reactingforce localized at the same location where the control cylinder is placed. Reproduced from [66].
Fluid dynamicists have just begun to gauge the relevance of deep reinforcement learning techniquesto assist the design of optimal flow control strategies. This research weighs in on this issue and showsthat the proposed single-step PPO holds a high potential as a reliable, go-to black-box optimizer forcomplex CFD problems. The findings reported herein add value to the shallow literature on thissubject, as the combination of PPO and CFD has proven fruitful to tackle several time-independentand time-dependent test cases, some of which are not tractable using canonical direct methods. Despitethese achievements, there is still a need to consolidate the acquired knowledge, and to further develop,characterize and fine-tune the method, whether it be via an improved balance between explorationand exploitation to deal with steep global maxima (for instance using Trust Region-Guided PPO, as iteffectively encourages the policy to explore more on the potential valuable actions, no matter whetherthey were preferred by the previous policies or not [43]), via non-normal probability density functionsto deal with multiple global maxima, or via coupling with a surrogate model trained on-the-fly.One point worth being emphasized is that while we do report exhaustive data regarding the computa-tional efficiency of DRL with the hope to foster future comparisons, we did not seek to optimize saidefficiency, neither by optimizing the hyper parameters, nor by using pre-trained deep learning models(as is done in transfer learning). Actually, even the most successful DRL algorithm will not be able tocompete with the state-of-the art adjoint methods one generally opts for when parameter spaces arelarge enough to dismiss direct approaches, at least from the standpoint of CPU cost. This is because23a) (b)Figure 23: Flow past a two-circular cylinder system at Re = 100. (a) Variation of the total mean draginduced by a control cylinder of diameter d c = 0 .
1. The grey circles are the positions investigated bythe DRL. The red circles single out those DRL positions associated with an optimal reduction. Thenegative iso-contours of the adjoint-based variation are superimposed from figure 22(d). (b) Evolutionper episode of the instant reward (in black) and of the moving average reward (in orange).the adjoint method involves only two numerical simulations (one for the uncontrolled solution, onefor the adjoint solution), so convergence would need to be achieved in less than two episodes (usingsingle-step PPO) to match that cost. The one area where DRL can outperform the adjoint method isapplicability. On the one hand, adjoint methods are increasingly difficult to apply rigorously as tur-bulence sets in, because the high sensitivity to initial conditions yields exponentially diverging adjointsolutions as soon as the length of the adjoint simulation exceeds the predictability time scale [67]. Onthe other hand, they suffer from the reversal in space-time directionality (evidenced by the minus signahead of the material derivative term in (24)) that imposes to march the adjoint equations backwardsin time. This in turn requires knowledge of the history of the cylinder flow solution through the entiretime-span of the adjoint simulation, which is inconvenient and very costly in terms of storage. To give ataste, it takes 45 Gb to solve the passive control case at Re = 100 using the method documented in [20](solving first the Navier–Stokes equations, writing all time steps to disk, solving the adjoint equationsover the same time interval and with the same time step, and discarding the early and late time stepscorresponding to transients of the DNS and adjoint solutions to compute meaningful time averages ofthe adjoint-based integrands). It would therefore take up to 2 Tb to handle a simple three-dimensional(3-D) case with 40 points distributed in the span-wise direction, which is beyond feasible limits. Incontrast, it would likely take minimum effort to apply DRL-CFD to 3-D, turbulent flows, as the onlyprerequisite is the ability to compute accurate numerical solutions (which behoves the CFD solver, notthe reinforcement learning algorithm), and the storage cost of an episode in the single-step PPO issimply that of a CFD run, times the number of environments. This should opens a number of researchdirections, including applying the methodology to systems of industrial relevance, featuring, e.g., highReynolds numbers, multiple phases, non-Newtonian fluids and/or coupled fluid-thermic processes.
Acknowledgements
This work is supported by the Carnot M.I.N.E.S. Institute through the M.I.N.D.S. project.