[PDF] Single-step deep reinforcement learning for open-loop control of laminar and turbulent flows

Abstract

This research gauges the ability of deep reinforcement learning (DRL) techniques to assist the optimization and control of fluid mechanical systems. It combines a novel, "degenerate" version of the proximal policy optimization (PPO) algorithm, that trains a neural network in optimizing the system only once per learning episode, and an in-house stabilized finite elements environment implementing the variational multiscale (VMS) method, that computes the numerical reward fed to the neural network. Three prototypical examples of separated flows in two dimensions are used as testbed for developing the methodology, each of which adds a layer of complexity due either to the unsteadiness of the flow solutions, or the sharpness of the objective function, or the dimension of the control parameter space. Relevance is carefully assessed by comparing systematically to reference data obtained by canonical direct and adjoint methods. Beyond adding value to the shallow literature on this subject, these findings establish the potential of single-step PPO for reliable black-box optimization of computational fluid dynamics (CFD) systems, which paves the way for future progress in optimal flow control using this new class of methods.

Full PDF

OOptimization and passive flow control usingsingle-step deep reinforcement learning

A Preprint

H. Ghraieb

MINES Paristech , PSL - Research University, CEMEF

J. Viquerat ∗ MINES Paristech , PSL - Research University, CEMEF [email protected]

A. Larcher

MINES Paristech , PSL - Research University, CEMEF

P. Meliga

MINES Paristech , PSL - Research University, CEMEF

Elie Hachem

MINES Paristech , PSL - Research University, CEMEFJune 5, 2020

Abstract

This research gauges the ability of deep reinforcement learning (DRL) techniques toassist the optimization and control of ﬂuid mechanical systems. It combines a novel,“degenerate” version of the proximal policy optimization (PPO) algorithm, that trainsa neural network in optimizing the system only once per learning episode, and an in-house stabilized ﬁnite elements environment implementing the variational multiscale(VMS) method, that computes the numerical reward fed to the neural network. Threeprototypical examples of separated ﬂows in two dimensions are used as testbed for de-veloping the methodology, each of which adds a layer of complexity due either to theunsteadiness of the ﬂow solutions, or the sharpness of the objective function, or thedimension of the control parameter space. Relevance is carefully assessed by compar-ing systematically to reference data obtained by canonical direct and adjoint methods.Beyond adding value to the shallow literature on this subject, these ﬁndings establishthe potential of single-step PPO for reliable black-box optimization of computationalﬂuid dynamics (CFD) systems, which paves the way for future progress in optimalﬂow control using this new class of methods.

Keywords

Deep Reinforcement Learning; Artiﬁcial Neural Networks; Computational ﬂuid dynamics;Flow control; Adjoint method ∗ Corresponding author a r X i v : . [ m a t h . O C ] J un Introduction

Flow control, deﬁned as the ability to ﬁnesse a ﬂow into a more desired state, is a ﬁeld of tremendoussocietal and economical importance. In applications such as ocean shipping or airline traﬃc, reducingthe overall drag by just a few percent while maintaining lift can help reducing fossil fuel consumptionand CO emission while saving several billion dollars annually. Many other scenario relevant to ﬂuidmechanical systems call for similarly improved engineering design, e.g., the airline industry is greatlyconcerned with reducing the structural vibrations and the radiated noise that occur under unsteadyﬂow conditions, while microﬂuidics and combustion both beneﬁt from enhanced mixing (which can beachieved by promoting unsteadiness in some appropriate manner). All such problems fall under thepurview of this line of study.Flow control is often benchmarked in the context of bluﬀ body drag reduction. Numerous strategieshave been implemented, either open-loop with passive appendices (e.g., end plate, splitter plate, smallsecondary cylinder, or ﬂexible tail), or open-loop with actuating devices (e.g., plasma actuation, steadyor unsteady base bleeding, rotation) or closed-loop (e.g. via transverse motion, blowing/suction, rota-tion, all relying on an appropriate sensing of ﬂow variables). An overview of the recent achievementsand perspectives can be found in several comprehensive surveys [1, 2, 3, 4, 5, 6, 7, 8, 9]. Nonethe-less, many of the proposed strategies are trial and error, and therefore require extensive and costlyexperimental or numerical campaigns. This has motivated the development of rigorous mathematicalformalisms capable of designing optimal control strategies with minimal eﬀort. The adjoint methodis one popular family of such techniques, that has proven eﬃcient at computing accurate, low-costestimates of the objective gradient with respect to the control variables in large optimization spaces,and is widely used for many applications ranging from oceanography and atmospheric sciences [10]to aerodynamic design [11, 12, 13, 14], by way of fresh developments meant to promote/mitigate thelinear ampliﬁcation of ﬂow disturbances [15, 16, 17, 18, 19, 20].It is the premise of this research that the task of selecting an optimal subset of control parameterscan be alternatively automated with the assistance of supervised learning algorithms (i.e., the simplestand most common form of machine learning, where an agent learns from labeled training data andis provided with corrective feedback). The breakthrough in this ﬁeld dates back to the introductionof the back-propagation algorithm [21]. It has helped popularize Artiﬁcial Neural Networks (ANN),a family of versatile non-parametric tools that can be trained to hierarchically extract informativefeatures from data, and can be used for several purposes, from exploratory data analysis to qualitativeand quantitative predictive modeling. Since then, the increased aﬀordability of high performancecomputational hardware (together with reduced costs for computation and data storage) have allowedleveraging the ever-increasing volume of numerical and experimental data generated for research andengineering purposes into novel insight and actionable information, which in turn has reshaped entirescientiﬁc disciplines, such as image analysis [22] or robotics [23, 24]. Because neural networks haveproduced most remarkable results when applied to stiﬀ large-scale nonlinear problems [25], it is onlynatural to assume that they are similarly suitable for the state-space models arising from the high-dimensional Machine learning algorithms have thus been making rapid inroads in ﬂuid mechanics, asa mean to solve the Navier–Stokes equations [26] or to predict closure terms in turbulence models [27];see also Ref. [28] for an overview of the current developments and opportunities.The focus here is on deep reinforcement learning (DRL), an advanced branch of machine learningin which neural networks specialize in solving decision-making problems (the deep terminology beingused to weigh on the sizable depth of the network itself). Simply put, the neural network trains inﬁnding out which actions or succession of actions maximize a numerical reward signal, one challengebeing that a given action aﬀects not only the immediate reward but also the next states and, thus,all the subsequent rewards. The most well-known success story in this ﬁeld is AlphaGo, the ANNthat defeated the top-level human player at the game of Go [29], but the approach has also provenfruitful to solve computer vision problems (e.g., ﬁltering, or extracting image features) [30] and hasenabled breakthroughs in the optimal control of complex dynamic systems [31, 32], including leggedrobots deployed in real environments [33]. Despite these achievements, DRL remains surprisinglysparsely used in ﬂuid mechanics, with a limited amount of literature barely scratching the surfaceof the performance improvements to be delivered in ﬂow control [34, 35, 36] and shape optimizationproblems [37]. 2 x x y y Figure 1: Fully connected neural network with two hidden layers, modeling a mapping from R to R .This researches assesses the feasibility of applying proximal policy optimization (PPO [32]) for ﬂow con-trol and optimization purposes, to both shape the capabilities of the method (PPO is still a relativelynewcomer that has quickly emerged as the go-to DRL algorithm due to its data eﬃciency, simplicityof implementation and reliable performance) and narrow the gap between DRL and advanced numer-ical methods for multiscale, multi-physics computational ﬂuid dynamics (CFD). We investigate morespeciﬁcally the “degenerate” single-step PPO algorithm introduced in [37], in which the neural networkgets only one attempt per learning episode at ﬁnding the optimal. Prototypical examples of separatedﬂows in two dimensions are used as testbed for developing this novel approach (to the best of theauthors knowledge, no previous study in the related literature has considered using DRL in a similarfashion), as it has been speculated to hold a high potential as a reliable black-box optimizer for CFDproblems (where only the ﬁnal conﬁguration is of interest), but needs further characterization andﬁne-tuning.The organization is as follows: the single-step PPO method is presented in section 2, together withthe baseline principles and assumptions of both DRL and PPO. Section 3 outlines the main features ofthe ﬁnite element CFD environment used to compute the numerical reward fed to the neural network,whose eﬃciency and reliability has been assessed in several previous studies. Three test cases arethen organized by increasing level of complexity, namely the ﬂow past a NACA 0012 airfoil (section4), the ﬂow past a tandem arrangement of two identical cylinders (section 5), and the passive controlof a cylinder ﬂow (section 6). For each case, exhaustive results pertaining to the DRL optimizationof steady asymptotic, time-averaged, or ﬂuctuating drag and lift are reported and carefully validatedagainst reference data obtained by canonical direct/adjoint methods. A neural network (NN) is a collection of artiﬁcial neurons, i.e., connected computational units withuniversal approximation capabilities [38], that can be trained to arbitrarily well approximate themapping function between input and output spaces. Each connection provides the output of a neuronas an input to another neuron. Each neuron performs a weighted sum of its inputs, to assign signiﬁcanceto the inputs with regard to the task the algorithm is trying to learn. It then adds a bias to betterrepresent the part of the output that is actually independent of the input. Finally, it feeds an activationfunction that determines whether and to what extent the computed value should aﬀect the ultimateoutcome. As sketched in ﬁgure 1, a fully connected network is generally organized into layers, with theneurons of one layer being connected solely to those of the immediately preceding and following layers.The layer that receives the external data is the input layer, the layer that produces the outcome is theoutput layer, and in between them are zero or more hidden layers.The design of an eﬃcient neural network requires a proper optimization of the weights and biases,together with a relevant nonlinear activation function. The abundant literature available on this topic(see, e.g., Ref. [39] and the references therein) points to a relevant network architecture (e.g., type3 nvironment s t s t +1 Agent r t a t w t Figure 2: RL agent and its interactions with its environment.of network, depth, width of each layer), ﬁnely tuned hyper parameters (i.e., parameters whose valuecannot be estimated from data, e.g., optimizer, learning rate, batch size) and a suﬃciently large amountof data to learn from as being the key ingredients for success.

Deep reinforcement learning (DRL) is an advanced branch of machine learning in which deep neuralnetworks train in solving sequential decision-making problems. It is a natural extension of reinforce-ment learning (RL), in which an agent (the neural network) is taught how to behave in an environmentby taking actions and by receiving feedback from it under the form of a reward (to measure how goodor bad the taken action was) and information (to gauge how the action has aﬀected the environment).This can be formulated as a Markov Decision Process, for which a typical execution goes as follows(see also ﬁgure 2): • assume the environment is in state s t ∈ S at iteration t , where S is a set of states, • the agent uses w t , an observation of the current environment state (and possibly a partialsubset of s t ) to take action a t ∈ A , where A is a set of actions, • the environment reacts to the action by transitionning from s t to state s t +1 ∈ S , • the agent is fed with a reward r t ∈ R , where R is a set of rewards, and a new observation w t +1 ,This repeats until some termination state is reached, the succession of states and actions deﬁning atrajectory τ = (cid:0) s , a , s , a , ... (cid:1) . In any given state, the objective of the agent is to determine theaction maximizing its cumulative reward over an episode, i.e., over one instance of the scenario in whichthe agent takes actions. Most often, the quantity of interest is the discounted cumulative reward alonga trajectory deﬁned as R ( τ ) = T X t =0 γ t r t , (1)where T is the horizon of the trajectory, and γ ∈ [0 ,

1] is a discount factor that weighs the relativeimportance of present and future rewards (the agent being short-sighted in the limit where γ → γ →

1, since it thencares equally about all rewards).There exist two main types of RL algorithms, namely model-based methods, in which the agent tries tobuild a model of how the environment works to make predictions about what the next state and rewardwill be before taking any action, and model-free methods, in which the agent conversely interactswith the environment without trying to understand it, and are prominent in the DRL community.Another important distinction to be made within model-free algorithms is that between value-based4ethods, in which the agent learns to predict the future reward of taking an action when provided agiven state, then selects the maximum action based on these estimates, and policy-based methods, inwhich it optimizes the expected reward of a decision policy mapping states to actions. Many of themost successful algorithms in DRL (including proximal policy optimization, whose assessment for ﬂowcontrol and optimization purposes is the primary motivation for this research) proceed from policygradient methods, in which gradient ascent is used to optimize a parameterized policy with respect tothe expected return, as further explained in the next section. The reader interested in a more thoroughintroduction to the zoology of RL methods (together with their respective pros and cons) is referredto Ref. [40].

This section intended for the non-specialist reader brieﬂy reviews the basic principles and assumptionsof policy gradient methods, together with the various steps taken for improvement.

Policy methods.

A policy method maximizes the expected discounted cumulative reward of a deci-sion policy mapping states to actions. It resorts not to a value function, but to a probability distributionover actions given states, that fully deﬁnes the behavior of the agent. Since policies are most oftenstochastic, the following notations are introduced: • π ( s, a ) is the probability of taking action a in state s under policy π , • Q π ( s, a ) is the expected value of the return of the policy after taking action a in state s (alsotermed state-action value function or Q-function) Q π ( s, a ) = E π (cid:2) R ( τ ) | s, a (cid:3) , (2) • V π ( s ) is the expected value of the return of the policy in state s (also termed value functionor V-function) V π ( s ) = E π (cid:2) R ( τ ) | s (cid:3) . (3)The V and Q functions are therefore such that V π ( s ) = X a π ( s, a ) Q π ( s, a ) , (4)so V π ( s ) can also be understood as the probability-weighted average of discounted cumulatedrewards over all possible actions in state s . Policy gradient methods.

A policy gradient method aims at optimizing a parametrized policy π θ , where θ denotes the free parameters whose value can be learnt from data (as opposed to thehyper parameters). In practice, one deﬁnes an objective function based on the expected discountedcumulative reward J ( θ ) = E τ ∼ π θ (cid:2) R ( τ ) (cid:3) , (5)and seeks the parameterization θ ∗ maximizing J ( θ ), hence such that θ ∗ = arg max θ E τ ∼ π θ (cid:2) R ( τ ) (cid:3) , (6)which can be done on paper by plugging an estimator of the policy gradient ∇ θ J ( θ ) into a gradientascent algorithm. This is no small task as one is looking for the gradient with respect to the policyparameters, in a context where the eﬀects of policy changes on the state distribution are unknown(since modifying the policy will most likely modify the set of visited states, which will in turn aﬀectperformance in some indeﬁnite manner). The most commonly used estimator, derived in [40] usingthe log-probability trick, reads ∇ θ J ( θ ) = E τ ∼ π θ " T X t =0 ∇ θ log ( π θ ( s t , a t )) R ( τ ) ∼ E τ ∼ π θ " T X t =0 ∇ θ log ( π θ ( s t , a t )) A π ( s, a ) , (7)5 s µ t µ t σ t σ t Figure 3: Agent network example used to map states to policy. The input state s , here of size 2,is mapped to a mean µ and a standard deviation σ vectors, each of size 2. All activation functionsare ReLu, except for that of the last layer, which are linear for the µ output, and softplus for the σ output. Orthogonal weights initialization is used throughout the network.where the rightmost term is a convenient approximation relying on the advantage function A π ( s, a ) = Q π ( s, a ) − V π ( s ) , (8)that measures the improvement (if A >

0, otherwise the lack thereof) associated with taking action a in state s compared to taking the average over all possible actions. This is because the value functiondoes not depend on θ , so taking it oﬀ changes neither the expected value, nor the gradient, but it doesreduce the variance, and speeds up the training. Furthermore, when the policy π θ is represented by aneural network (in which case θ simply denotes the network weights and biases to be optimized), thefocus is rather on the policy loss deﬁned as L ( θ ) = E τ ∼ π θ " T X t =0 log ( π θ ( a t | s t )) A π ( s, a ) , (9)whose gradient is equal to the (approximated) policy gradient (7) (since the gradient operator actsonly on the log-policy term, not on the advantage) and is computed with respect to each weight andbias by the chain rule, one layer at the time, using the back-propagation algorithm [21]. Trust regions.

The performance of policy gradient methods is hurt by the high sensitivity to thelearning rate, i.e., the size of the step to be taken in the gradient direction. Indeed, small learningrates are detrimental to learning, but large learning rates can lead to a performance collapse if theagent falls oﬀ the cliﬀ and restarts from a poorly performing state with a locally bad policy. This isall the more harmful as the learning rate cannot be tuned locally, meaning that an above averagelearning rate will speed up learning in some regions of the parameter space where the policy loss isrelatively ﬂat, but will possibly trigger an exploding policy update in other regions exhibiting sharpervariations. One way to ensure continuous improvement is by imposing a trust region constraint tolimit the diﬀerence between the current and updated policies, which can be done by determining ﬁrsta maximum step size relevant for exploration, then by locating the optimal point within this trustregion. We will not dwell on the intricate details of the many algorithms developed to solve suchtrust region optimization problems, e.g., natural policy gradient (NPG [41]), or trust region policyoptimization (TRPO [42]). Suﬃce it to say that they use the minorize-maximization algorithm tomaximize iteratively a surrogate policy loss (i.e. a lower bound approximating locally the actual lossat the current policy), but are diﬃcult to implement and can be computationally expensive, as theyrely on an estimate of the second-order gradient of the policy log probability.6 roximal policy optimization.

Proximal policy optimization (PPO) is another approach withsimple and eﬀective heuristics, that uses a probability ratio between the two policies to maximize im-provement without the risk of performance collapse [32]. The focus here is on the PPO-clip algorithm ,that optimizes the surrogate loss L ( θ ) = E ( s,a ) ∼ π θ (cid:20) min (cid:18) π θ ( a | s ) π θ old ( a | s ) A π θ ( s, a ) , g ( (cid:15), A π θ ( s, a )) (cid:19)(cid:21) , (10)where g ( (cid:15), A ) = (cid:26) (1 + (cid:15) ) A A ≥ , (1 − (cid:15) ) A A < , (11)and (cid:15) ∈ [0 . , .

3] is the clipping range, a small hyper parameter deﬁning how far away the new policyis allowed to go from the old. In practice, a state-action pair yielding a positive advantage will cause π θ ( a | s ) to increase for the action to become more likely. By doing so, if it increases in a way such that π θ ( a | s ) > (1 + (cid:15) ) π θ old ( a | s ), the min kicks in (10) and its argument hits a ceiling of (1 + (cid:15) ) A π θ ( s, a ).Conversely, a state-action pair yielding a negative advantage will cause π θ ( a | s ) to decrease for theaction to become less likely. If it increases in a way such that π θ ( a | s ) < (1 − (cid:15) ) π θ old ( a | s ), the maxkicks in and its argument hits a ceiling of (1 − (cid:15) ) A π θ ( s, a ). In neither case does the new policy has anyincentive to step too far away from the current policy, which ensures that both policies will behavesimilarly.The strengths of the approach lie in the fact that the clipped surrogate itself is cheap to compute, andthat it needs only an estimate of the ﬁrst-order gradient of the policy log probability (like policy gradientmethods). Reﬁnements have been proposed recently, for instance the Trust region PPO (TRGPPO)that adaptively adjusts the clipping range within the trust region [43], but classic PPO is generallyacknowledged to be one of the most successful DRL method, combining ease of implementation andtuning, together with state-of-the-art performance across a wide range of challenging tasks. We now come to the single-step PPO method proposed in [37], a “degenerate” version of PPO in whichthe agent and the environment get to exchange only one single triplet ( s, a, r ) per learning episode.This represents a signiﬁcant diﬀerence with respect to classic Markov-based methods. and amounts toview the agent as a parameterized mapping f θ from state to action.As sketched in ﬁgure 3, the agent is consistently fed with the same input state s (usually a vectorof zeros), and outputs a policy π that can be represented by mean and standard deviation vectors(denoted by µ and σ , respectively) whose size matches the dimension of the action required by theenvironment. The optimization loop is as follows: ﬁrst, the agent implements a random mapping f θ from s to an initial policy determined by the initialization of the network parameters θ . At eachiteration, a population of actions a t = f θ t ( s ) is drawn from the current policy, the reward associatedto each set of action is computed and the agent is returned incentives (through the rewards) to updatethe free parameters in a way such that the next population of actions a t +1 ∼ f θ t +1 ( s ) will yieldhigher rewards (see also ﬁgure 4). This is done following for the most part the various steps describedin section 2.3, only one seeks in practice the optimal mapping f θ opt such that a opt ∼ f θ opt ( s ), notthe optimal set of actions a opt itself, and the loss is computed from (10) by substituting a normalizedaveraged reward for the advantage function.It is worth noticing that an accurate estimation of the policy gradient requires evaluating a largeamount of actions drawn from the current policy, which comes at the expense of computing the sameamount of reward evaluations. One key issue in the context of CFD applications is thus the ability todistribute a given set of actions to a parallel environment running on large computer clusters, as weshow in the following that the CPU cost of solving the present academic optimization problems in twodimensions ranges from a hundred to several thousand hours. There is also a PPO-Penalty variant which uses a penalization on the average Kullback–Leibler divergencebetween the current and new policies, but PPO-clip performs better in practice. Agent Parallelenvs. a t r t θ t → θ t +1 Figure 4: Action loop for single-step PPO. At each episode, the input state s is provided to theagent, which in turn provides n actions to n parallel environments. The latter return n rewards, thatevaluate the quality of each action taken. Once all the rewards are collected, an update of the agentparameters is made using the PPO loss (10). The process is repeated until convergence of the mapping(i.e., µ = µ ∗ and σ = ). In the following, we investigate two-dimensional (2-D) laminar incompressible ﬂows over various ge-ometries of interest. The ﬂow is described in a Cartesian coordinate system ( x, y ) with drag force(resp. lift force) positive in the stream-wise + x direction (resp. the cross-wise +y direction). Thegoverning equations to be solved numerically are the incompressible Navier–Stokes equations ∇ · u = 0 , ∂ t u + u · ∇ u + ∇ · [ p I − ε ( u )] = , (12)where u = ( u, v ) is the velocity ﬁeld of component u (resp. v ) in the x (resp. y ) direction, p is thepressure, ε ( u ) = ( ∇ u + ∇ u T ) / In the context of ﬁnite element methods (that remain widely used to simulate engineering CFD systemsdue to their ability to handle complex geometries), direct numerical simulation (DNS) solves the weakform of (12) obtained by integrating by parts the pressure and viscous terms, to give( ∂ t u + u · ∇ u , w ) + 2Re ( ε ( u ) , ε ( w )) − ( p , ∇ · w ) + ( ∇ · u , q ) = 0 , (13)where ( w , q ) are relevant test functions for the velocity and pressure variables, and ( , ) is the L innerproduct on the computational domain.We use here the variational multiscale (VMS) approach [44, 45, 46] to solve a stabilized formulationof (13). This allows circumventing the Babuska—Brezzi condition (that otherwise imposes that dif-ferent interpolation orders be used to discretize the velocity and pressure variables, while we use heresimple continuous piecewise linear P elements for both) and prevents numerical instabilities in ad-vection regimes at high Reynolds numbers (that otherwise quickly spread in the shear regions andpollute the entire solution domain). We shall not go into the extensive details about the derivationof the stabilized formulation, for which the reader is referred to [47]. Suﬃce it to say here that theﬂow quantities are split into coarse and ﬁne scale components, that correspond to diﬀerent levels ofresolution. The ﬁne scales are solved in an approximate manner to allow modeling their eﬀect into thelarge-scale equations, which gives rise to additional terms in the right-hand side of (13), and yields the8ollowing weak form for the large scale( ∂ t u + u · ∇ u , w ) + 2Re ( ε ( u ) , ε ( w )) − ( p , ∇ · w ) + ( ∇ · u , q ) = X K ∈T h [( τ R M , u · ∇ w ) K + ( τ R M , ∇ q ) K + ( τ R C , ∇ · w ) K ] , (14)where ( , ) K is the inner product on element K , R M,C are the residuals deﬁned by −R M = ∂ t u + u · ∇ u + ∇ p − R C = ∇ · u , (15)and τ , are ad-hoc mesh-dependent stabilization parameters deﬁned in [48, 49]. Equation (14) issolved here with an in-house VMS solver whose accuracy and ﬂexibility has been assessed in a series ofprevious papers [47, 50, 51, 49], together with open ﬂow boundary conditions consisting of a uniformfree-stream u = (1 ,

0) at the inﬂow, free-surface conditions ∂ y u = v = 0 at the upper and lowerboundaries, advective conditions at the outﬂow, and no-slip conditions at all solid surfaces. For each test case documented in the following, a 2-D computational domain is deﬁned asΩ = { ( x, y ) | x −∞ ≤ x ≤ x ∞ ; | y | ≤ y ∞ ; Γ s ( x, y ) < } , (16)where Γ s is a level set function that depends on the control parameters, whose zero iso-contour givesthe position of the solid surface S (the convention used here being to have Γ s ( x, y ) < s ≥ s on the control parameters is one key issue,as it imposes that a diﬀerent mesh be used for each action taken by the agent over the course of thePPO optimization, each of which must be equally capable of accurately capturing the global and localspatio-temporal features at play. Here, this is achieved using the immerse volume method (IVM),widely used to immerse and represent complex geometries inside a unique mesh, and that has beenimplemented in the context of ﬁnite element VMS methods (see, e.g., [50, 49] for a detailed presentationof the mathematical formulation relevant to a rigid body immersed in an incompressible ﬂuid). Foreach geometry and each value of the control parameters, the level set function is computed at allnodes of the mesh using the algorithm presented in [52]. The mesh at the interface is then reﬁnedanisotropically using the gradient of the level set, and the physical properties of each domain are setwith appropriate mixing laws. Each mesh is adapted under the constraint of a ﬁxed, case-dependentnumber of edges. Representative interface regions at the end of the anisotropic adaptation process areshown in ﬁgure 5. In practice, the adaptation process consists of adding nodes locally in the vicinity ofthe interface. The rest of the elements keep the same background size, that increases with the distancevia three successive reﬁnement steps to accurately capture the near-wake region; see ﬁgure 5(d). This study is on the optimization of the non-dimensional forces exerted on the immersed body, asmeasured by the drag and lift coeﬃcients per unit span length, denoted by D and L , respectively, anddeﬁned as 2 I S ( − p n + 2Re ε ( u ) · n ) · e θ d l , (17)where e θ = (1 − θ ) e x + θ e y and θ is a Boolean used to specify the quantity subjected to optimization( θ = 0 for drag, θ = 1 for lift). The latter is computed as2 I ∂ Ω ( − p n + 2Re ε ( u ) · n ) · w d l = − Z Ω ( − p n + 2Re ε ( u ) · n ) : ∇ w d s − Z Ω ( ∇ u · u ) · w d s , (18)using the variational approach described in [53], that consists of multiplying the momentum equationsby a test function w equal to e θ on the nodes lying on S and 0 otherwise (hence vanishing everywhereexcept on the ﬁrst layer of elements around S ), then of integrating by parts to bring up only volumeintegral terms, reportedly more accurate and less sensitive to the approximation of the body interfacetheir surface counterparts [54]. 9a) (b)(c) (d)Figure 5: ( a − c ) Details of the boundary layer mesh around (a) a NACA 0012 airfoil, (b) two circulartandem cylinders, and (c) a two-circular cylinder system, as obtained at the end of the anisotropicadaptation process. The blue line indicates the zero iso-contour of the level set function. (d) Back-ground mesh of the tandem conﬁguration evidencing the three successive reﬁnement steps. The present workﬂow relies on the online PPO implementation of Stable Baselines, a toolset of re-inforcement learning algorithms dedicated to the research community and industry [55], for which acustom OpenAI environment has been designed using the Gym library [56]. The instant reward r t usedto train the neural network is simply the quantity subjected to optimization (i.e., either the steadyasymptotic, time-averaged, or ﬂuctuating value or drag and lift, modulo a plus or minus sign to tackleboth maximization and minimization problems). A moving average reward is also computed on the ﬂyas the sliding average over the 100 latest values of r t , and we assume convergence when the absolutevariation in the moving average reward between two consecutive episodes is thrice smaller than 10 − out of 4 consecutive values. All other relevant hyper parameters are documented in the next sections,with the exception of the discount factor (since single-step PPO computes only one single reward perepisode). We consider ﬁrst the NACA 0012 airfoil at incidence in a uniform stream depicted in ﬁgure 6. Allvariables are made non-dimensional using the straight chord distance c between the leading and trailing10igure 6: Schematic diagram of the ﬂow past a NACA 0012 at incidence.(a) (b)Figure 7: Flow past a NACA 0012 at Re = 10. Iso-contours of the steady vorticity for (a) α = 11 ◦ ,and (b) α = 51 ◦ .edges and the free-stream velocity. This is a simple problem whose sole control parameter is the angleof attack α , i.e., the angle between the chord and the stream-wise directions (with α > reference data. Theorigin of the coordinate system is at the airfoil pivot-point, set at quarter chord length from theleading edge (that is, the eading edge is at ( − . ,

0) under zero incidence). The dimensions of thecomputational domain are x −∞ = − x ∞ = 40 and y ∞ = 15, which was found to be large enough notto have a discernible inﬂuence on the results. All meshes are adapted under the constraint of a ﬁxednumber of edges n el = 115 , ∼ . × − ,which yields 20 −

25 boundary-layer grid points at mid-chord under zero incidence (depending on theReynolds number). The agent for this case is a fully-connected network with two hidden layers, eachholding 4 neurons. The learning rate is set to 5 × − , and the PPO loss clipping range is (cid:15) = 0 . The Reynolds number is set ﬁrst to Re = 10 for the ﬂow to remain steady regardless of the angle ofincidence. As could have been expected, the upper boundary layer remains attached even for positive,non-small values of α (in practice up to α ∼ ◦ ± ◦ ), after which it separates and pressure drops along This is not strictly speaking a control problem because the angle of attack cannot always be tuned toimprove performances. The methodology however carries over straightforwardly to related control problems,such as the optimization of multi-element high-lift systems. A deliberate abuse of language in the attempt to ease the reading. The reference data are actually VMS,not DNS, but the diﬀerence is clear from context. α = 11 ◦ , for which the boundary layer remains attached, and 51 ◦ , for whichit separates. Of course, for negative values of α , the same is true of the lower boundary layer, whoseseparation creates a pressure drop along the lower surface, hence creating negative lift.One objective of interest here is to maximize the steady asymptotic value of the lift coeﬃcient L s reached after the initial transient. We show in ﬁgure 8(a) a curve of lift against the angle of attack.It has been obtained by DNS from a sample of 30 value of α uniformly distributed in the range[ − ◦ ; 75 ◦ ], for the point of attack to remain at the leading edge. In each simulation, the initialcondition is marched in time up to t = 100 with time step ∆ t = 0 . . α = 0,as should be, and exhibits a maximum for α opt = 39 .

5, associated to L s,opt = 1 . r t = L s ) uses 8 environmentsand 4 steps mini-batches to update the network for 32 epochs. Convergence is achieved after 20episodes, as indicated in ﬁgure 8(b) showing the evolution of the instant reward (in black) and of themoving average reward (in orange). This represents 160 simulations, each of which is performed on 1core following the exact same procedure as above, and lasts 1h 25min, hence ∼ α opt ∈ [37 . ◦ ; 40 ◦ ]associated which L s,opt ∼ .

16, which is perfectly in line with the DNS results; see the blue hue inﬁgure 8(a).

The Reynolds number is now set to Re = 100 to add complexity by triggering ﬂow unsteadiness, atleast for angles of incidence above a certain threshold value. This is best seen in ﬁgure 9 showinginstantaneous iso-contours of vorticity distributions computed at α = 26 ◦ , for which the ﬂow settles toa steady regime similar to that observed at Re=10 (although the vorticity sheets spread much furtherdownstream), and 51 ◦ , for which the ﬂow develops to a time-periodic, vortex-shedding state.A relevant objective here is thus to maximize the time-averaged lift coeﬃcient¯ L = 1 τ Z t i + τt i L ( t ) d t , (19) This could scaled down to a few tens of minutes using an iterative Newton-like method, but the scope ofthis research is broader than just steady ﬂows. α = 26 ◦ , and (b) α = 51 ◦ .(a) (b)Figure 10: Flow past a NACA 0012 at Re = 100. (a) Mean lift against the angle of attack, as computedby DNS. The vertical blue lines are the lower and upper boundaries of the optimal range of anglesobtained using DRL to solve the mean lift maximization problem, with the blue circle indicatingthe associated optimal lift. (b) Evolution per episode of the instant reward (in black) and of themoving average reward (in orange). The DNS values associated to min-max reward are reported asthe horizontal dashed lines.that mesures the mean lift force exerted on the airfoil, and reduces to the steady asymptotic lift forthose values of α yielding a steady state (as long as t i is large enough to dismiss the initial transient).ﬁgure 10(a) presents a curve of mean lift against α that has been obtained by DNS using the samesample of 30 values of α ∈ [ − ◦ ; 75 ◦ ]. In each simulation, the initial condition is marched in time upto t = 50 to leave out the transient, then up to t = 150 to compute (a posteriori) accurate averages.The averaging time τ = 100 represents 16 −

20 shedding cycles (depending on the angle of incidence)and has been checked to determine accurately the mean lift within 3%. The obtained distribution isagain perfectly antisymmetric with respect to α = 0, as should be. It loses continuity at α ∼ ◦ ,which reﬂects the bifurcation from the steady to the time-periodic regime through a classical linearinstability mechanism (the critical value being very close to that reported in [57]), and triggers asubstantial increase of lift at higher angles of incidence. The maximum is reached for α opt = 50 . L opt = 0 . r t = ¯ L ) uses 8 envi-ronments and 4 steps mini-batches to update the network for 32 epochs. Convergence is achievedafter 22 episodes; see ﬁgure 10(b) showing the evolutions of the instant reward (in black) and of the13igure 11: Schematic diagram of the ﬂow past a tandem arrangement of two circular cylinders atzero-incidence.moving average reward (in orange). This represents 176 simulations, each of which is performed on1 core following the exact same procedure as above, and lasts 4h30 on average, hence approximately ∼ α opt ∈ [48 . ◦ ; 50 . ◦ ] for which ¯ L opt = 0 .

94, again perfectly in line with the DNS values; see the bluehue in ﬁgure 10(a).

As a second test case, we consider the tandem arrangement of two identical circular cylinders in auniform stream at zero-incidence. A schematic of the conﬁguration is provided in ﬁgure 11. Allvariables are made non-dimensional using the cylinders diameter d and the free-stream velocity, andthe sole control parameter is the center distance η between the centers of the two cylinders. Whilethis remains a simple problem on paper (especially as the control parameter space is small enoughthat it again allows direct comparisons between DRL/PPO and DNS), we shall see that it brings asigniﬁcant layer of complexity due to the topology of the cost functional. The origin of the coordinatesystem is set at the center of the upstream (main) cylinder, that is, the center of the downstream(or surrounding) cylinder is at ( η, x −∞ = 15, x ∞ = 40 and y ∞ = 15, which was found to be large enough not to have a discernible inﬂuence onthe results in the range of center distances speciﬁed hereinafter. The number of mesh edges for thiscase is n el = 125 , ∼ − , which yields20 −

25 grid points in the boundary-layer of the main cylinder, just prior to separation (depending onthe center distance and the Reynolds number). The agent for this case is similar in every respect tothat used in section 4, i.e., a fully-connected network with two hidden layers, each holding 4 neurons,learning rate equal to 5 × − , and PPO loss clipping range of 0 . As has been done in the NACA test case, the Reynolds number is set ﬁrst to Re = 10 for the ﬂowto remain steady regardless of the center distance. This is best seen in ﬁgure 12 showing iso-contoursof vorticity computed at η = 3, a small value for which the shear layer separating from the maincylinder simply engulfs its downstream counterpart, and η = 11, a value large enough for each cylinderto develop independent vorticity sheets (only those of the surrounding cylinder is of slightly lowermagnitude since the wake velocity it is subjected to is smaller than the uniform stream velocity).Since lift is inherently zero (using basic symmetry considerations), one meaningful objective for this caseis to maximize the steady asymptotic value of the total drag coeﬃcient reached after the initial transient(i.e., the non-dimensional drag force exerted on the two-cylinder system). We show in ﬁgure 13(a) acurve of drag against the center distance, that has been obtained by DNS from a sample of 11 values14a) (b)Figure 12: Flow past the tandem arrangement of two circular cylinders at Re = 10. Iso-contours ofthe steady vorticity for (a) η = 3, and (b) η = 11.(a) (b)Figure 13: Flow past the tandem arrangement of two circular cylinders at Re = 10. (a) Drag againstthe center distance, as computed by DNS. The vertical blue line is the optimal distance obtained usingDRL to solve the drag maximization problem, with the blue circle indicating the associated optimaldrag. (b) Evolution per episode of the instant reward (in black) and of the moving average reward (inorange). The DNS values associated to min-max reward are reported as the horizontal dashed lines.of η uniformly distributed in the range [1; 11]. We did not consider pushing the second cylinder furtherdownstream to keep the computational costs aﬀordable (especially in view of the time-dependent casesdocumented in the following), as this would have required extending the computational domain andincreasing the numbers of grid points accordingly - plus we do not anticipate such large distances tobe relevant from the standpoint of optimization since the interaction between both cylinders weakensincreasingly, although it can take up to several tens of diameters for the sensitivity to η to go to zero.In each simulation, the initial condition is marched in time up to t = 100 with time step ∆ t = 0 . . η , which we ascribe to the fact that the surrounding cylinder experiences negativedrag if suﬃciently close to the main cylinder, on behalf of the low pressures prevailing in the gap [58].In return, the maximum is reached at η opt = 11, associated to D s,opt = 4 . r t = D s ) uses 16 environ-ments and 4 steps mini-batches to update the network for 32 epochs. Convergence is achieved after9 episodes, as illustrated in ﬁgure 13(b) showing the evolutions of the instant and moving averagerewards. This represents 144 simulations, each of which is performed on 1 core following the exactsame procedure as above, and lasts 2h30, hence 360h of CPU cost to achieve convergence. On behalf15a) (b)Figure 14: Flow past the tandem arrangement of two circular cylinders at Re = 100. Iso-contours ofthe instantaneous vorticity for (a) η = 7, and (b) η = 11.of the monotonical behavior of drag, it is not too surprising that the DRL converges to the optimaldistance η opt = 11 associated to D s,opt ∼ .

58, identical to the DNS values.

The Reynolds number is now set to Re = 100 to trigger ﬂow unsteadiness. However, only a limitedrange of distances η = [7; 11] is investigated because the surrounding cylinder inhibits vortex sheddingfrom the main cylinder for suﬃciently small values of η (the critical value at Re = 100 being η = 3 . η = 7, while a single vortex street is recovered at larger distances; see ﬁgure 14(b) for η = 11.One relevant objective here (in connection with fatigue reduction and energy harvesting problems inﬂuid-structure systems) is to maximize the root mean square (rms) lift coeﬃcient L = (cid:18) τ Z t i + τt i ( L ( t ) − ¯ L ) d t (cid:19) / , (20)that measures the magnitude of the ﬂuctuating lift force exerted on the two-cylinder system. ﬁg-ure 15(a) shows a curve of rms lift against the center distance, that has been obtained by DNS from areduced sample of 9 values of η ∈ [7; 11]. In each simulation, the initial condition is marched in timeup to t = 100 to leave out the transient, then up to t = 200 to compute (a posteriori) meaningful oscil-lation amplitudes. The averaging time τ = 100 represents ∼

20 shedding cycles, and allows convergingthe rms lift within 5%. A maximum is reached for η opt = 8 .

8, associated to L opt = 0 . r t = L ) uses 16 envi-ronments and 4 steps mini-batches. Convergence is achieved after 12 episodes, as illustrated by theinstant and moving average rewards in ﬁgure 15(b). This represents 192 simulations, each of which isperformed on 1 core following the exact same procedure as above, and lasts 7h30 (much longer than forthe NACA 0012 test case, because all tandem solutions are unsteady), hence approximately ∼ η opt ∈ [8 .

69; 8 . L opt ∼ .

91, almost identical to the DNS values as the discrepancy in the optimal distanceis by 1%.

Finally, the Reynolds number is set to Re = 300 for the ﬂow to develop into its vortex-shedding stateregardless of the center distance. The full range η ∈ [1; 11] is considered, therefore the ﬂow patternnow depends dramatically on the center distance. We shall not go into the full details here, as thishas been the subject of several numerical studies [60, 61, 62, 59, 63], but it is worth noticing thatfor small distances η (cid:46)

2, the separated shear layers from the main cylinder engulf their downstreamcounterparts and trigger shedding vortices far from the cylinders, although the gap ﬂow between thecylinders remains steady; see ﬁgure 16(a) showing iso-contours of vorticity computed at η = 2. For η ∼

3, the gap ﬂow becomes unsteady, but the gap vortices are not fully developed by the time theyimpinge on the surrounding cylinder, hence a single vortex street is observed in the far wake; seeﬁgure 16(b) obtained for η = 3. For η ∼

4, one pair of gap vortices fully develops, then impingeson the downstream cylinder. This in turn triggers a complex interaction in the near wake of thesurrounding cylinder before a vortex street eventually forms further downstream; see ﬁgure 16(c) for η = 4. Finally, for η (cid:38)

5, the wake of the surrounding cylinder becomes unsteady again, and bothcylinders shed vortices that are synchronized and close to anti-phase; see ﬁgure 16(d) for η = 5.This broad variety of ﬂow patterns triggers diﬀerent features of ﬂow unsteadiness, which shows inthe curve of rms lift against the center distance provided in 17(a). Such results have been obtainedfrom an extended sample of 30 values of η ∈ [1; 11], with each initial condition marched in time up to t = 200 to leave out the transient, then up to t = 300. The maximum is reached for η opt = 3 .

35 (i.e., avalue for which the gap vortices are close to being fully developed before impingement), associated to L opt = 1 .

99, but a close second lies at η ∼ opt = 7 .

25, associated to L opt = 1 .

36. The DRL/single-stepPPO resolution of the lift maximization problem uses 16 environments and 4 steps mini-batches, andconvergence is achieved after 20 episodes, as illustrated by the instant and moving average rewards inﬁgure 17(b). This represents 320 simulations, each of which is performed on 1 core following the exactsame procedure as above, and lasts 10h30 (much longer than at Re=100 due to the increased simulationtime; more on this in the following), hence approximately 3300h of CPU cost to achieve convergence.Interestingly, DRL converges to an optimal distance η opt ∈ [7 .

32; 7 .

38] for which L opt ∼ .

39, i.e., itfails to identify the global optimal, but does converge to a value close to the secondary DNS maximum.17a) (b)(c) (d)Figure 16: Flow past the tandem arrangement of two circular cylinders at Re = 300. Iso-contours ofthe instantaneous vorticity for (a) η = 2, (b) η = 3, (c) η = 4 and (d) η = 5.(a) (b)Figure 17: Flow past the tandem arrangement of two circular cylinders at Re = 300. (a) Root meansquare lift against the center distance, as computed by DNS. The vertical blue lines are the lower andupper boundaries of the optimal range of angles obtained using DRL to solve the rms lift maximizationproblem, with the blue circle indicating the associated optimal lift. (b) Evolution per episode of theinstant reward (in black) and of the moving average reward (in orange). The DNS values associatedto min-max reward are reported as the horizontal dashed lines.This half-failure can be explained by the steepness of the reward gradients with respect to the controlvariable in the vicinity of the global optimal, that reﬂects the existence of a secondary instability18a) (b)Figure 18: Flow past the tandem arrangement of two circular cylinders at Re = 300: time history oflift for (a) η = 3, (b) η = 9.Figure 19: Schematic diagram of the ﬂow past a two-circular cylinder system consisting of a ﬁxedcylinder and movable control cylinder.mechanism at play in a narrow range of center distances. This is illustrated in ﬁgure 18(a) showingthat for η ∼

3, the ﬂow settles to a ﬁrst time-periodic solution, then bifurcates to a second time-periodic solution associated with increased lift oscillations (hence the large value of t i used for thiscase). Actually, the DRL does identify high reward positions close to η = 3, as evidenced by theearly peaks in the instant reward curve in ﬁgure 17(b), whose value L ∼ The third test case if that of a circular cylinder in a uniform stream, for which we assess the eﬀectof inserting a second smaller (control) cylinder, as depicted in ﬁgure 19. All variables are made non-dimensional using the diameter d of the main cylinder and the free-stream velocity. The origin of the19oordinate system is set at the center of the main cylinder. The diameter of the control cylinder is setto d c = 0 .

1, and the sole control parameter is the position of its center denoted by ( x c , y c ). While thismay not seem overly complicated, the parameter space is now large enough do dismiss mapping thebest positions for placement of the control cylinder by classical numerical procedures (to give a taste,a tens of thousands of DNS are required to cover merely a few diameters around the main cylinder).In the following, the DRL/PPO results are thus compared to theoretical predictions obtained by theadjoint method, that has proven fruitful to gain insight into the most eﬃcient region from the linearsensitivity of the uncontrolled ﬂow (i.e., the ﬂow past the main cylinder), without ever calculatingthe controlled states. We shall not go into the technicalities of how to derive said sensitivity by theadjoint method, as this is a topic thoroughly covered in a series of recent papers [20, 64, 65] to whichthe interested reader is referred for further deepening. The line of thought is conversely to take theoutput sensitivity as a given to assess the relevance of DRL/PPO.The agent for this case is a fully-connected network with two hidden layers, each holding 2 neurons,with the same learning rate 5 × − and PPO loss clipping range 0 . x −∞ = − x ∞ = 15 and y ∞ = 40. Thenumber of mesh edges for this case is n el = 190 , ∼ . × − at the main cylinder, and ∼ × − at control cylinder, which yields 25 −

40 grid points in the boundary-layer of the control cylinder(depending on the position of the control cylinder and the Reynolds number).

The Reynolds number is set ﬁrst to Re = 40 for the ﬂow to remain steady regardless of the position ofthe control cylinder. This is best seen in ﬁgures 20(a) − x c , y c ) = ( − . , x c , y c ) = (0 . , . x c , y c ) = (1 . , D s , that inherently pertainto the two-cylinder system, to those values D s + δD s computed by the adjoint method, where D s isthe drag exerted on the main cylinder in the absence of control cylinder, and δD s is a control-inducedvariation computed simultaneously for all possible positions of the control cylinder as δD s = Z Ω ( u † + 2 e x ) · δ f d s . (21)In equation (21), u † is a steady adjoint velocity, solution to ∇ · u † = 0 , −∇ u † · u + ∇ u T · u † + ∇ · σ ( − p † , u † ) = , (22)together with boundary condition u † = 2 e x at the surface of the main cylinder, that we compute withthe ﬁnite-element solver presented in [20]. Also, δ f is a simple model of the force exerted by the controlcylinder on the ﬂow, namely a force opposite to the drag felt by the control cylinder in a uniform ﬂowat the local velocity, carefully crafted to numerical and experimental data, whose relevance to mimicwith good accuracy the eﬀect of a true control cylinder in the range of Reynolds numbers consideredis discussed extensively in [20].A map of the so-computed drag variations is shown in ﬁgure 20(d), where the successful positions ofthe control cylinder (those for which D s < D s , or equivalently δD s <

0) are indicated by the bluehue. The map is symmetric with respect to the centerline y = 0 and indicates that the total dragis reduced in two distinct ﬂow regions, a ﬁrst one located approximately 1 diameter upstream of themain cylinder, and a second one located in the vicinity of the rear stagnation point and extendingdownstream over 3 diameters, along the outer boundary of the recirculation region. Interestingly, bothregions achieve a similar drag reduction by ∼ r t = − D s ) uses 8 environments and a single-step mini-batch to update the network for 32 epochs,20a) (b)(c) (d)Figure 20: Flow past a two-circular cylinder system at Re = 40. ( a − c ) Iso-contours of the steadyvorticity for (a) ( x c , y c ) = ( − . , x c , y c ) = (0 . , .

9) and (c) ( x c , y c ) = (1 . , d c = 0 .

1, as computed using a steadyadjoint method, modeling the presence of the control cylinder by a pointwise reacting force localizedat the same location where the control cylinder is placed.and restricts to positions of the control cylinder in the upper half-plane ( y >

0) to take advantage ofthe problem symmetries. Convergence is achieved after 75 episodes, as illustrated by the instant andmoving average rewards in ﬁgure 21(b). This represents 600 simulations in which the initial conditionis marched in time up to t = 150 time step ∆ t = 0 .

1. Each simulation is performed on 1 core and lasts1h30, hence approximately 900h of CPU cost to achieve convergence. This number of episodes maybe deemed large compared to the other test cases documented herein, but it is easily explained by thecomplexity of the functional topology, that consists of two distinct valleys of comparable depth. Fromthis point of view, the fact that the DRL achieves convergence is simply a matter of chance, as it canin theory oscillate forever between both. The positions investigated by the DRL are reported as thegrey circles in ﬁgure 21(a), together with those optimal positions associated with the minimum drag D s,opt ∼ .

51 in red. Despite some limited discrepancies regarding the spatial extent of the upstreamregion, the reported agreement is very satisfying, and establishes the ability of DRL to identify bothregions of interest, and to predict the drag reduction achieved in these regions.

We set now the Reynolds number to Re = 100 for the ﬂow to consistently develop to time-periodicvortex-shedding, i.e., the control cylinder is unable to inhibit vortex shedding from the main cylinder.In contrast, the ﬂow past the control cylinder remains steady, as illustrated in ﬁgures 22(a) − d c = 0 .

0) are indicated bythe blue hue. The map is again symmetric with respect to the centerline, and indicates that the totaldrag is reduced in two distinct ﬂow regions, a ﬁrst one approximately 1 diameter upstream of the maincylinder, that extends upstream over 2 diameters (hence further upstream than at Re = 40) and inwhich drag is reduced by roughly 2%, and a second one in the vicinity of the rear stagnation point,that extends downstream over 2 diameters along the outer boundary of the mean recirculation region(shorter than that found at Re = 40), and in which drag is reduced by almost 8%. The DRL/single-step PPO resolution of the lift maximization problem( r t = − ¯ D ) uses again 8 environments and asingle-step mini-batch, and convergence is achieved after 35 episodes, as illustrated by the instant andmoving average rewards in ﬁgure 23(b). This represents 280 simulations in which the initial conditionis marched in time with time step ∆ t = 0 .

1, up to t = 100 to leave out the transient, then up to t = 200 to compute relevant averages. Each simulation is performed on 1 core and lasts 4h30, henceapproximately 1260h of CPU cost to achieve convergence. The reduced cost compared to the steadycase is ascribed to the simpler functional topology, that features now two valleys with clearly diﬀerentdepths. This shows up in ﬁgure 23(a) depicting the positions investigated by the DRL (grey circles)together with those optimal positions associated with the minimum drag ¯ D opt ∼ .

23 in red. One seesthat while a few optimal points are identiﬁed in the upstream region, DRL quickly settles for the mosteﬃcient downstream region, and more speciﬁcally the core of the mean recirculation region, in strikingagreement with the adjoint-based results. 22a) (b)(c) (d)Figure 22: Flow past a two-circular cylinder system at Re = 100. ( a − c ) Iso-contours of the instan-taneous vorticity for (a) ( x c , y c ) = ( − . , x c , y c ) = ( − . , .

2) and (c) ( x c , y c ) = (1 . , d c = 0 .

1, as computed usingthe time-varying adjoint method, modeling the presence of the control cylinder by a pointwise reactingforce localized at the same location where the control cylinder is placed. Reproduced from [66].

Fluid dynamicists have just begun to gauge the relevance of deep reinforcement learning techniquesto assist the design of optimal ﬂow control strategies. This research weighs in on this issue and showsthat the proposed single-step PPO holds a high potential as a reliable, go-to black-box optimizer forcomplex CFD problems. The ﬁndings reported herein add value to the shallow literature on thissubject, as the combination of PPO and CFD has proven fruitful to tackle several time-independentand time-dependent test cases, some of which are not tractable using canonical direct methods. Despitethese achievements, there is still a need to consolidate the acquired knowledge, and to further develop,characterize and ﬁne-tune the method, whether it be via an improved balance between explorationand exploitation to deal with steep global maxima (for instance using Trust Region-Guided PPO, as iteﬀectively encourages the policy to explore more on the potential valuable actions, no matter whetherthey were preferred by the previous policies or not [43]), via non-normal probability density functionsto deal with multiple global maxima, or via coupling with a surrogate model trained on-the-ﬂy.One point worth being emphasized is that while we do report exhaustive data regarding the computa-tional eﬃciency of DRL with the hope to foster future comparisons, we did not seek to optimize saideﬃciency, neither by optimizing the hyper parameters, nor by using pre-trained deep learning models(as is done in transfer learning). Actually, even the most successful DRL algorithm will not be able tocompete with the state-of-the art adjoint methods one generally opts for when parameter spaces arelarge enough to dismiss direct approaches, at least from the standpoint of CPU cost. This is because23a) (b)Figure 23: Flow past a two-circular cylinder system at Re = 100. (a) Variation of the total mean draginduced by a control cylinder of diameter d c = 0 .

1. The grey circles are the positions investigated bythe DRL. The red circles single out those DRL positions associated with an optimal reduction. Thenegative iso-contours of the adjoint-based variation are superimposed from ﬁgure 22(d). (b) Evolutionper episode of the instant reward (in black) and of the moving average reward (in orange).the adjoint method involves only two numerical simulations (one for the uncontrolled solution, onefor the adjoint solution), so convergence would need to be achieved in less than two episodes (usingsingle-step PPO) to match that cost. The one area where DRL can outperform the adjoint method isapplicability. On the one hand, adjoint methods are increasingly diﬃcult to apply rigorously as tur-bulence sets in, because the high sensitivity to initial conditions yields exponentially diverging adjointsolutions as soon as the length of the adjoint simulation exceeds the predictability time scale [67]. Onthe other hand, they suﬀer from the reversal in space-time directionality (evidenced by the minus signahead of the material derivative term in (24)) that imposes to march the adjoint equations backwardsin time. This in turn requires knowledge of the history of the cylinder ﬂow solution through the entiretime-span of the adjoint simulation, which is inconvenient and very costly in terms of storage. To give ataste, it takes 45 Gb to solve the passive control case at Re = 100 using the method documented in [20](solving ﬁrst the Navier–Stokes equations, writing all time steps to disk, solving the adjoint equationsover the same time interval and with the same time step, and discarding the early and late time stepscorresponding to transients of the DNS and adjoint solutions to compute meaningful time averages ofthe adjoint-based integrands). It would therefore take up to 2 Tb to handle a simple three-dimensional(3-D) case with 40 points distributed in the span-wise direction, which is beyond feasible limits. Incontrast, it would likely take minimum eﬀort to apply DRL-CFD to 3-D, turbulent ﬂows, as the onlyprerequisite is the ability to compute accurate numerical solutions (which behoves the CFD solver, notthe reinforcement learning algorithm), and the storage cost of an episode in the single-step PPO issimply that of a CFD run, times the number of environments. This should opens a number of researchdirections, including applying the methodology to systems of industrial relevance, featuring, e.g., highReynolds numbers, multiple phases, non-Newtonian ﬂuids and/or coupled ﬂuid-thermic processes.

Acknowledgements

This work is supported by the Carnot M.I.N.E.S. Institute through the M.I.N.D.S. project.